[ https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209162#comment-14209162 ]
jiangyu commented on HDFS-7385: ------------------------------- [~hitliuyi], it also occur when open files, the same reason of using the ThreadLocal variable cache as mkdir . I will add test case later on. > ThreadLocal used in FSEditLog class lead FSImage permission mess up > -------------------------------------------------------------------- > > Key: HDFS-7385 > URL: https://issues.apache.org/jira/browse/HDFS-7385 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.4.0, 2.5.0 > Reporter: jiangyu > Assignee: jiangyu > Attachments: HDFS-7385.patch > > > We migrated our NameNodes from low configuration to high configuration > machines last week. Firstly,we imported the current directory including > fsimage and editlog files from original ActiveNameNode to new ActiveNameNode > and started the New NameNode, then changed the configuration of all > datanodes and restarted all of datanodes , then blockreport to new NameNodes > at once and send heartbeat after that. > Everything seemed perfect, but after we restarted Resoucemanager , > most of the users compained that their jobs couldn't be executed for the > reason of permission problem. > We applied Acls in our clusters, and after migrated we found most of > the directories and files which were not set Acls before now had the > properties of Acls. That is the reason why users could not execute their > jobs.So we had to change most of the files permission to a+r and directories > permission to a+rx to make sure the jobs can be executed. > After searching this problem for some days, i found there is a bug in > FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the > proper value in logMkdir and logOpenFile functions. Here is the code of > logMkdir: > public void logMkDir(String path, INode newNode) { > PermissionStatus permissions = newNode.getPermissionStatus(); > MkdirOp op = MkdirOp.getInstance(cache.get()) > .setInodeId(newNode.getId()) > .setPath(path) > .setTimestamp(newNode.getModificationTime()) > .setPermissionStatus(permissions); > AclFeature f = newNode.getAclFeature(); > if (f != null) { > op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode)); > } > logEdit(op); > } > For example, if we mkdir with Acls through one handler(Thread indeed), > we set the AclEntries to the op from the cache. After that, if we mkdir > without any Acls setting and set through the same handler, the AclEnties from > the cache is the same with the last one which set the Acls, and because the > newNode have no AclFeature, we don’t have any chance to change it. Then the > editlog is wrong,record the wrong Acls. After the Standby load the editlogs > from journalnodes and apply them to memory in SNN then savenamespace and > transfer the wrong fsimage to ANN, all the fsimages get wrong. The only > solution is to save namespace from ANN and you can get the right fsimage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)