[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209162#comment-14209162
 ] 

jiangyu commented on HDFS-7385:
-------------------------------

[~hitliuyi], it also occur when open files, the same reason of using the 
ThreadLocal variable cache as mkdir . I will add test case later on. 

> ThreadLocal used in FSEditLog class  lead FSImage permission mess up
> --------------------------------------------------------------------
>
>                 Key: HDFS-7385
>                 URL: https://issues.apache.org/jira/browse/HDFS-7385
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.4.0, 2.5.0
>            Reporter: jiangyu
>            Assignee: jiangyu
>         Attachments: HDFS-7385.patch
>
>
>       We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>        Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>       We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
>     PermissionStatus permissions = newNode.getPermissionStatus();
>     MkdirOp op = MkdirOp.getInstance(cache.get())
>       .setInodeId(newNode.getId())
>       .setPath(path)
>       .setTimestamp(newNode.getModificationTime())
>       .setPermissionStatus(permissions);
>     AclFeature f = newNode.getAclFeature();
>     if (f != null) {
>       op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
>     }
>     logEdit(op);
>   }
>       For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to