[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213449#comment-14213449 ] Yongjun Zhang commented on HDFS-6133: - HI [~kihwal] and [~ctrezzo], Thanks a lot for sharing the info and great insights here. About the throwing-away-over-replicated case, I think the " include pinning info in block reports and remember it in block manager" approach Kihwal and Daryn suggested seems reasonable. And I agree corrupted-pinned-block case need some more thoughts and careful logic. About blockpool-aware balancing policy, since balancer works on each blockpool independently, it seems an natural approach. However, I think this would be complementary solution to the approach here. I think it deserves its own jira and it can be worked on in parallel. Currently one NN is associated with only one blockpool. And federated cluster is not yet widely used as far as I know. Implementation-wise, there are two options I can think of if we want o use blockpool-aware balancing policy to solve the problem here: # User need to choose to use federated cluster, and put all files that need to be pinned into dedicated blockpool. # Make a single NN to handle multipe block-pools. The solution would work nicely for already federated cluster. For others, it won't be easy. Right now we don't have the capability to handle pinning at block/file granularity (balancer does have the option to exclude nodes from being touched). It seems even providing a solution without handling the throwing-away-over-replicated case would help alleviating the pain. Let's see if we can include a mechanism in the patch of this jira, or at least think through how to handle the two cases (throwing-away-over-replicated, and corrupted-pinned-block). Thanks. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, > HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods
[ https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haohui Mai updated HDFS-7279: - Attachment: HDFS-7279.013.patch > Use netty to implement DatanodeWebHdfsMethods > - > > Key: HDFS-7279 > URL: https://issues.apache.org/jira/browse/HDFS-7279 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, webhdfs >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, > HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, > HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, > HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, > HDFS-7279.011.patch, HDFS-7279.012.patch, HDFS-7279.013.patch > > > Currently the DN implements all related webhdfs functionality using jetty. As > the current jetty version the DN used (jetty 6) lacks of fine-grained buffer > and connection management, DN often suffers from long latency and OOM when > its webhdfs component is under sustained heavy load. > This jira proposes to implement the webhdfs component in DN using netty, > which can be more efficient and allow more finer-grain controls on webhdfs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods
[ https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213445#comment-14213445 ] Hadoop QA commented on HDFS-7279: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681717/HDFS-7279.012.patch against trunk revision 9b86066. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8750//console This message is automatically generated. > Use netty to implement DatanodeWebHdfsMethods > - > > Key: HDFS-7279 > URL: https://issues.apache.org/jira/browse/HDFS-7279 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, webhdfs >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, > HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, > HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, > HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, > HDFS-7279.011.patch, HDFS-7279.012.patch > > > Currently the DN implements all related webhdfs functionality using jetty. As > the current jetty version the DN used (jetty 6) lacks of fine-grained buffer > and connection management, DN often suffers from long latency and OOM when > its webhdfs component is under sustained heavy load. > This jira proposes to implement the webhdfs component in DN using netty, > which can be more efficient and allow more finer-grain controls on webhdfs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods
[ https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213443#comment-14213443 ] Hadoop QA commented on HDFS-7279: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681717/HDFS-7279.012.patch against trunk revision 9b86066. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8749//console This message is automatically generated. > Use netty to implement DatanodeWebHdfsMethods > - > > Key: HDFS-7279 > URL: https://issues.apache.org/jira/browse/HDFS-7279 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, webhdfs >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, > HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, > HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, > HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, > HDFS-7279.011.patch, HDFS-7279.012.patch > > > Currently the DN implements all related webhdfs functionality using jetty. As > the current jetty version the DN used (jetty 6) lacks of fine-grained buffer > and connection management, DN often suffers from long latency and OOM when > its webhdfs component is under sustained heavy load. > This jira proposes to implement the webhdfs component in DN using netty, > which can be more efficient and allow more finer-grain controls on webhdfs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods
[ https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haohui Mai updated HDFS-7279: - Attachment: HDFS-7279.012.patch > Use netty to implement DatanodeWebHdfsMethods > - > > Key: HDFS-7279 > URL: https://issues.apache.org/jira/browse/HDFS-7279 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, webhdfs >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, > HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, > HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, > HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, > HDFS-7279.011.patch, HDFS-7279.012.patch > > > Currently the DN implements all related webhdfs functionality using jetty. As > the current jetty version the DN used (jetty 6) lacks of fine-grained buffer > and connection management, DN often suffers from long latency and OOM when > its webhdfs component is under sustained heavy load. > This jira proposes to implement the webhdfs component in DN using netty, > which can be more efficient and allow more finer-grain controls on webhdfs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods
[ https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213442#comment-14213442 ] Haohui Mai commented on HDFS-7279: -- The v12 patches removes the excessive throw clauses. > Use netty to implement DatanodeWebHdfsMethods > - > > Key: HDFS-7279 > URL: https://issues.apache.org/jira/browse/HDFS-7279 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, webhdfs >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, > HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, > HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, > HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, > HDFS-7279.011.patch, HDFS-7279.012.patch > > > Currently the DN implements all related webhdfs functionality using jetty. As > the current jetty version the DN used (jetty 6) lacks of fine-grained buffer > and connection management, DN often suffers from long latency and OOM when > its webhdfs component is under sustained heavy load. > This jira proposes to implement the webhdfs component in DN using netty, > which can be more efficient and allow more finer-grain controls on webhdfs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync
[ https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213439#comment-14213439 ] Chris Nauroth commented on HDFS-7384: - I haven't reviewed the whole patch yet, but I wanted to state again quickly that I'd prefer to keep effective permissions out of {{AclEntry}}. One problem is that the {{AclEntry}} class is also used in the setter APIs, like {{setAcl}}. In that context, the effective permissions would be ignored. This could cause confusion for users of those APIs. Another problem is that we use the same class for both the public API on the client side and the internal in-memory representation in the NameNode. Therefore, adding a new member to {{AclEntry}} would have a side effect of increasing memory footprint in the NameNode. Even if we don't populate the field when used within the NameNode, there is still the overhead of the additional pointer multiplied by every ACL entry. We could potentially change the NameNode to use a different class for its internal implementation, but then we'd have a dual-maintenance problem and a need for extra code to translate between the two representations. If {{AclStatus}} could have a new method that does the calculation for an entry's effective permissions on demand, instead of requiring a new member in {{AclEntry}}, then we wouldn't impact the setter APIs or increase memory footprint in the NameNode. > 'getfacl' command and 'getAclStatus' output should be in sync > - > > Key: HDFS-7384 > URL: https://issues.apache.org/jira/browse/HDFS-7384 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Vinayakumar B >Assignee: Vinayakumar B > Attachments: HDFS-7384-001.patch > > > *getfacl* command will print all the entries including basic and extended > entries, mask entries and effective permissions. > But, *getAclStatus* FileSystem API will return only extended ACL entries set > by the user. But this will not include the mask entry as well as effective > permissions. > To benefit the client using API, better to include 'mask' entry and effective > permissions in the return list of entries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync
[ https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213438#comment-14213438 ] Hadoop QA commented on HDFS-7384: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681715/HDFS-7384-001.patch against trunk revision 9b86066. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8748//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8748//console This message is automatically generated. > 'getfacl' command and 'getAclStatus' output should be in sync > - > > Key: HDFS-7384 > URL: https://issues.apache.org/jira/browse/HDFS-7384 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Vinayakumar B >Assignee: Vinayakumar B > Attachments: HDFS-7384-001.patch > > > *getfacl* command will print all the entries including basic and extended > entries, mask entries and effective permissions. > But, *getAclStatus* FileSystem API will return only extended ACL entries set > by the user. But this will not include the mask entry as well as effective > permissions. > To benefit the client using API, better to include 'mask' entry and effective > permissions in the return list of entries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6982) nntop: top-like tool for name node users
[ https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213436#comment-14213436 ] Hadoop QA commented on HDFS-6982: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681698/HDFS-6982.v9.patch against trunk revision 9b86066. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.metrics2.impl.TestMetricsSystemImpl org.apache.hadoop.ha.TestZKFailoverControllerStress org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby org.apache.hadoop.hdfs.TestDistributedFileSystem {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8747//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8747//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8747//console This message is automatically generated. > nntop: top-like tool for name node users > - > > Key: HDFS-6982 > URL: https://issues.apache.org/jira/browse/HDFS-6982 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Maysam Yabandeh >Assignee: Maysam Yabandeh > Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, > HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, > HDFS-6982.v7.patch, HDFS-6982.v8.patch, HDFS-6982.v9.patch, > nntop-design-v1.pdf > > > In this jira we motivate the need for nntop, a tool that, similarly to what > top does in Linux, gives the list of top users of the HDFS name node and > gives insight about which users are sending majority of each traffic type to > the name node. This information turns out to be the most critical when the > name node is under pressure and the HDFS admin needs to know which user is > hammering the name node and with what kind of requests. Here we present the > design of nntop which has been in production at Twitter in the past 10 > months. nntop proved to have low cpu overhead (< 2% in a cluster of 4K > nodes), low memory footprint (less than a few MB), and quite efficient for > the write path (only two hash lookup for updating a metric). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync
[ https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-7384: Status: Patch Available (was: Open) > 'getfacl' command and 'getAclStatus' output should be in sync > - > > Key: HDFS-7384 > URL: https://issues.apache.org/jira/browse/HDFS-7384 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Vinayakumar B >Assignee: Vinayakumar B > Attachments: HDFS-7384-001.patch > > > *getfacl* command will print all the entries including basic and extended > entries, mask entries and effective permissions. > But, *getAclStatus* FileSystem API will return only extended ACL entries set > by the user. But this will not include the mask entry as well as effective > permissions. > To benefit the client using API, better to include 'mask' entry and effective > permissions in the return list of entries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync
[ https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinayakumar B updated HDFS-7384: Attachment: HDFS-7384-001.patch Attached patch. Please review and give your feedback > 'getfacl' command and 'getAclStatus' output should be in sync > - > > Key: HDFS-7384 > URL: https://issues.apache.org/jira/browse/HDFS-7384 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Vinayakumar B >Assignee: Vinayakumar B > Attachments: HDFS-7384-001.patch > > > *getfacl* command will print all the entries including basic and extended > entries, mask entries and effective permissions. > But, *getAclStatus* FileSystem API will return only extended ACL entries set > by the user. But this will not include the mask entry as well as effective > permissions. > To benefit the client using API, better to include 'mask' entry and effective > permissions in the return list of entries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods
[ https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213392#comment-14213392 ] Tsz Wo Nicholas Sze commented on HDFS-7279: --- > the throw clause comes from the super class thus it cannot be removed. It actually can be removed since removing it is narrowing the declaration. +1 the new patch looks good other than that. > Use netty to implement DatanodeWebHdfsMethods > - > > Key: HDFS-7279 > URL: https://issues.apache.org/jira/browse/HDFS-7279 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, webhdfs >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, > HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, > HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, > HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, > HDFS-7279.011.patch > > > Currently the DN implements all related webhdfs functionality using jetty. As > the current jetty version the DN used (jetty 6) lacks of fine-grained buffer > and connection management, DN often suffers from long latency and OOM when > its webhdfs component is under sustained heavy load. > This jira proposes to implement the webhdfs component in DN using netty, > which can be more efficient and allow more finer-grain controls on webhdfs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes
[ https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213355#comment-14213355 ] Ming Ma commented on HDFS-7374: --- Yeah, that seems reasonable; How likely you get "whole cluster fully replicated" might depend on how you count it. If it is based on full scan of blockmap, the chance of getting "all blocks fully replicated" condition might be low given it also includes those newly added blocks for which not all DNs have sent IBR; in addition, it has to take the FSNameSystem lock for longer period of time. If it is based on {{BlockManager}}'s {{pendingReplicationBlocksCount}} + {{underReplicatedBlocksCount}}, then the chance might be higher; and it is faster. On the "track the blocks of those DECOMM_IN_PROGRESS DNs" note, it might be useful to add the feature later. It also helps another scenario, something [~kihwal] and [~daryn] mentioned before. {{isReplicationInProgress}} currently rescan all blocks of a given node each time the method is called; it isn't efficient as more blocks become fully replicated. We can have a separate list of DECOMM_IN_PROGRESS blocks which points to the DECOMM_IN_PROGRESS DNs. {{DecommissionManager}} will scan this list regularly. Each scan will reduce the list as blocks become fully replicated and calculated the latest list of DECOMM_IN_PROGRESS DNs. In normal decomm operations, the # of DECOMM_IN_PROGRESS DNs should be much smaller than the # of total DNs in large cluster; so the extra memory overhead might be acceptable. > Allow decommissioning of dead DataNodes > --- > > Key: HDFS-7374 > URL: https://issues.apache.org/jira/browse/HDFS-7374 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch > > > We have seen the use case of decommissioning DataNodes that are already dead > or unresponsive, and not expected to rejoin the cluster. > The logic introduced by HDFS-6791 will mark those nodes as > {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish > the decommission work. If an upper layer application is monitoring the > decommissioning progress, it will hang forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6982) nntop: top-like tool for name node users
[ https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maysam Yabandeh updated HDFS-6982: -- Attachment: HDFS-6982.v9.patch Attaching the new patch. [~andrew.wang], I ended up moving TopMetrics initialization to FsNamesystem, where I register TopAuditLogger with the aduitLoggers. > nntop: top-like tool for name node users > - > > Key: HDFS-6982 > URL: https://issues.apache.org/jira/browse/HDFS-6982 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Maysam Yabandeh >Assignee: Maysam Yabandeh > Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, > HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, > HDFS-6982.v7.patch, HDFS-6982.v8.patch, HDFS-6982.v9.patch, > nntop-design-v1.pdf > > > In this jira we motivate the need for nntop, a tool that, similarly to what > top does in Linux, gives the list of top users of the HDFS name node and > gives insight about which users are sending majority of each traffic type to > the name node. This information turns out to be the most critical when the > name node is under pressure and the HDFS admin needs to know which user is > hammering the name node and with what kind of requests. Here we present the > design of nntop which has been in production at Twitter in the past 10 > months. nntop proved to have low cpu overhead (< 2% in a cluster of 4K > nodes), low memory footprint (less than a few MB), and quite efficient for > the write path (only two hash lookup for updating a metric). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync
[ https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213326#comment-14213326 ] Vinayakumar B commented on HDFS-7384: - Thanks Chris. For the effective action, may be we can have separate method, without affecting the current fields. It will just be a alternative way for client to get the effective action, instead of calculating on its own. I will upload a patch soon. > 'getfacl' command and 'getAclStatus' output should be in sync > - > > Key: HDFS-7384 > URL: https://issues.apache.org/jira/browse/HDFS-7384 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Vinayakumar B >Assignee: Vinayakumar B > > *getfacl* command will print all the entries including basic and extended > entries, mask entries and effective permissions. > But, *getAclStatus* FileSystem API will return only extended ACL entries set > by the user. But this will not include the mask entry as well as effective > permissions. > To benefit the client using API, better to include 'mask' entry and effective > permissions in the return list of entries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213324#comment-14213324 ] Hadoop QA commented on HDFS-4882: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681664/HDFS-4882.4.patch against trunk revision 4fb96db. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8745//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8745//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8745//console This message is automatically generated. > Namenode LeaseManager checkLeases() runs into infinite loop > --- > > Key: HDFS-4882 > URL: https://issues.apache.org/jira/browse/HDFS-4882 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client, namenode >Affects Versions: 2.0.0-alpha, 2.5.1 >Reporter: Zesheng Wu >Assignee: Ravi Prakash >Priority: Critical > Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, > HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch > > > Scenario: > 1. cluster with 4 DNs > 2. the size of the file to be written is a little more than one block > 3. write the first block to 3 DNs, DN1->DN2->DN3 > 4. all the data packets of first block is successfully acked and the client > sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out > 5. DN2 and DN3 are down > 6. client recovers the pipeline, but no new DN is added to the pipeline > because of the current pipeline stage is PIPELINE_CLOSE > 7. client continuously writes the last block, and try to close the file after > written all the data > 8. NN finds that the penultimate block doesn't has enough replica(our > dfs.namenode.replication.min=2), and the client's close runs into indefinite > loop(HDFS-2936), and at the same time, NN makes the last block's state to > COMPLETE > 9. shutdown the client > 10. the file's lease exceeds hard limit > 11. LeaseManager realizes that and begin to do lease recovery by call > fsnamesystem.internalReleaseLease() > 12. but the last block's state is COMPLETE, and this triggers lease manager's > infinite loop and prints massive logs like this: > {noformat} > 2013-06-05,17:42:25,695 INFO > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: > DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard > limit > 2013-06-05,17:42:25,695 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. > Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= > /user/h_wuzesheng/test.dat > 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block > blk_-7028017402720175688_1202597, > lastBLockState=COMPLETE > 2013-06-05,17:42:25,695 INFO > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery > for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM > APREDUCE_-1252656407_1, pendingcreates: 1] > {noformat} > (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7270) Implementing congestion control in writing pipeline
[ https://issues.apache.org/jira/browse/HDFS-7270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213287#comment-14213287 ] Hadoop QA commented on HDFS-7270: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681462/HDFS-7270.000.patch against trunk revision 49c3889. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.datanode.TestDataNodeMetrics org.apache.hadoop.hdfs.TestCrcCorruption The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestParallelShortCircuitReadUnCached {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8743//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8743//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8743//console This message is automatically generated. > Implementing congestion control in writing pipeline > --- > > Key: HDFS-7270 > URL: https://issues.apache.org/jira/browse/HDFS-7270 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-7270.000.patch > > > When a client writes to HDFS faster than the disk bandwidth of the DNs, it > saturates the disk bandwidth and put the DNs unresponsive. The client only > backs off by aborting / recovering the pipeline, which leads to failed writes > and unnecessary pipeline recovery. > This jira proposes to add explicit congestion control mechanisms in the > writing pipeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache
[ https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213286#comment-14213286 ] Hadoop QA commented on HDFS-7394: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681654/HDFS-7394.patch against trunk revision 4fb96db. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestParallelShortCircuitReadUnCached {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8744//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8744//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8744//console This message is automatically generated. > Log at INFO level when InvalidToken is seen in ShortCircuitCache > > > Key: HDFS-7394 > URL: https://issues.apache.org/jira/browse/HDFS-7394 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Keith Pak >Priority: Minor > Attachments: HDFS-7394.patch > > > For long running clients, getting an {{InvalidToken}} exception is expected > and the client refetches a block token when it happens. The related events > are logged at INFO except the ones in {{ShortCircuitCache}}. It will be > better if they are also made to log at INFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7400) More reliable namenode health check to detect OS/HW issues
[ https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213281#comment-14213281 ] Ming Ma commented on HDFS-7400: --- Thanks, [~andrew.wang] and [~aw] for the comments. Here is the info I have so far. I will provide more data after I gather more data from our admins and HW engineers. 1. We couldn't access the machine except to reboot the machine via IPMI. So no chance to run "df". 2. We didn't check the progress in ssh. But given all DNs couldn't connect to this NN at that point, it looks like socket level issue. > More reliable namenode health check to detect OS/HW issues > -- > > Key: HDFS-7400 > URL: https://issues.apache.org/jira/browse/HDFS-7400 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma > > We had this scenario on an active NN machine. > * Disk array controller firmware has a bug. So disks stop working. > * ZKFC and NN still considered the node healthy; Communications between ZKFC > and ZK as well as ZKFC and NN are good. > * The machine can be pinged. > * The machine can't be sshed. > So all clients and DNs can't use the NN. But ZKFC and NN still consider the > node healthy. > The question is how we can have ZKFC and NN detect such OS/HW specific issues > quickly? Some ideas we discussed briefly, > * Have other machines help to make the decision whether the NN is actually > healthy. Then you have to figure out to make the decision accurate in the > case of network issue, etc. > * Run OS/HW health check script external to ZKFC/NN on the same machine. If > it detects disk or other issues, it can reboot the machine for example. > * Run OS/HW health check script inside ZKFC/NN. For example NN's > HAServiceProtocol#monitorHealth can be modified to call such health check > script. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server
[ https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213254#comment-14213254 ] Hadoop QA commented on HDFS-7146: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681667/HDFS-7146.005.patch against trunk revision 4fb96db. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-nfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8746//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8746//console This message is automatically generated. > NFS ID/Group lookup requires SSSD enumeration on the server > --- > > Key: HDFS-7146 > URL: https://issues.apache.org/jira/browse/HDFS-7146 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.6.0 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, > HDFS-7146.003.patch, HDFS-7146.004.patch, HDFS-7146.005.patch > > > The current implementation of the NFS UID and GID lookup works by running > 'getent passwd' with an assumption that it will return the entire list of > users available on the OS, local and remote (AD/etc.). > This behaviour of the command is advised to be and is prevented by > administrators in most secure setups to avoid excessive load to the ADs > involved, as the # of users to be listed may be too large, and the repeated > requests of ALL users not present in the cache would be too much for the AD > infrastructure to bear. > The NFS server should likely do lookups based on a specific UID request, via > 'getent passwd ', if the UID does not match a cached value. This reduces > load on the LDAP backed infrastructure. > Thanks [~qwertymaniac] for reporting the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7400) More reliable namenode health check to detect OS/HW issues
[ https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213246#comment-14213246 ] Allen Wittenauer commented on HDFS-7400: bq. Disk array controller firmware has a bug. So disks stop working. ... bq. The machine can be pinged. bq. The machine can't be sshed. Was ssh actually opening the socket and just not completing the login process? On the surface, this sounds like typical Linux IO weirdisms, but I want to make sure. bq. Out of curiosity, did your failure condition result in a situation where df worked, but the disk was otherwise non-functional? I keep thinking about the situation where there are two controllers but only one went belly up. Doing things like df or even a write+read combo might not be sufficient unless we do it across all devices. I suspect: bq. Have other machines help to make the decision whether the NN is actually healthy. ... might be the only truly viable solution under various failure modes. > More reliable namenode health check to detect OS/HW issues > -- > > Key: HDFS-7400 > URL: https://issues.apache.org/jira/browse/HDFS-7400 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma > > We had this scenario on an active NN machine. > * Disk array controller firmware has a bug. So disks stop working. > * ZKFC and NN still considered the node healthy; Communications between ZKFC > and ZK as well as ZKFC and NN are good. > * The machine can be pinged. > * The machine can't be sshed. > So all clients and DNs can't use the NN. But ZKFC and NN still consider the > node healthy. > The question is how we can have ZKFC and NN detect such OS/HW specific issues > quickly? Some ideas we discussed briefly, > * Have other machines help to make the decision whether the NN is actually > healthy. Then you have to figure out to make the decision accurate in the > case of network issue, etc. > * Run OS/HW health check script external to ZKFC/NN on the same machine. If > it detects disk or other issues, it can reboot the machine for example. > * Run OS/HW health check script inside ZKFC/NN. For example NN's > HAServiceProtocol#monitorHealth can be modified to call such health check > script. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes
[ https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213247#comment-14213247 ] Andrew Wang commented on HDFS-7374: --- Yea, precisely :) I don't know how realistic this is in an active cluster with lots of failing disks, but it'd fix it for some users at least. > Allow decommissioning of dead DataNodes > --- > > Key: HDFS-7374 > URL: https://issues.apache.org/jira/browse/HDFS-7374 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch > > > We have seen the use case of decommissioning DataNodes that are already dead > or unresponsive, and not expected to rejoin the cluster. > The logic introduced by HDFS-6791 will mark those nodes as > {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish > the decommission work. If an upper layer application is monitoring the > decommissioning progress, it will hang forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes
[ https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213244#comment-14213244 ] Ming Ma commented on HDFS-7374: --- So maybe we can use "if all blocks in the whole cluster are fully replicated" instead of "if all blocks of that dead node are fully replicated" as the criteria to move that dead node to decommed state? > Allow decommissioning of dead DataNodes > --- > > Key: HDFS-7374 > URL: https://issues.apache.org/jira/browse/HDFS-7374 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch > > > We have seen the use case of decommissioning DataNodes that are already dead > or unresponsive, and not expected to rejoin the cluster. > The logic introduced by HDFS-6791 will mark those nodes as > {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish > the decommission work. If an upper layer application is monitoring the > decommissioning progress, it will hang forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7386) Replace check "port number < 1024" with shared isPrivilegedPort method
[ https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213224#comment-14213224 ] Yongjun Zhang commented on HDFS-7386: - Thank you so much Chris! > Replace check "port number < 1024" with shared isPrivilegedPort method > --- > > Key: HDFS-7386 > URL: https://issues.apache.org/jira/browse/HDFS-7386 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, security >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang >Priority: Trivial > Fix For: 2.7.0 > > Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch > > > Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace > check "port number < 1024" with shared isPrivilegedPort method. > Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7400) More reliable namenode health check to detect OS/HW issues
[ https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213215#comment-14213215 ] Andrew Wang commented on HDFS-7400: --- So in {{monitorHealth}} we do a basic check just to see if the NN has free disk space. I'd be okay extending this to other checks related to disk health. Out of curiosity, did your failure condition result in a situation where {{df}} worked, but the disk was otherwise non-functional? I guess with no SSH it's a little hard to check, but I wonder what we could add to {{monitorHealth}} to detect this failure condition. > More reliable namenode health check to detect OS/HW issues > -- > > Key: HDFS-7400 > URL: https://issues.apache.org/jira/browse/HDFS-7400 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ming Ma > > We had this scenario on an active NN machine. > * Disk array controller firmware has a bug. So disks stop working. > * ZKFC and NN still considered the node healthy; Communications between ZKFC > and ZK as well as ZKFC and NN are good. > * The machine can be pinged. > * The machine can't be sshed. > So all clients and DNs can't use the NN. But ZKFC and NN still consider the > node healthy. > The question is how we can have ZKFC and NN detect such OS/HW specific issues > quickly? Some ideas we discussed briefly, > * Have other machines help to make the decision whether the NN is actually > healthy. Then you have to figure out to make the decision accurate in the > case of network issue, etc. > * Run OS/HW health check script external to ZKFC/NN on the same machine. If > it detects disk or other issues, it can reboot the machine for example. > * Run OS/HW health check script inside ZKFC/NN. For example NN's > HAServiceProtocol#monitorHealth can be modified to call such health check > script. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7401) Add block info to DFSInputStream' WARN message when it adds node to deadNodes
Ming Ma created HDFS-7401: - Summary: Add block info to DFSInputStream' WARN message when it adds node to deadNodes Key: HDFS-7401 URL: https://issues.apache.org/jira/browse/HDFS-7401 Project: Hadoop HDFS Issue Type: Bug Reporter: Ming Ma Priority: Minor Block info is missing in the below message {noformat} 2014-11-14 03:59:00,386 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /xx.xx.xx.xxx:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK {noformat} The code {noformat} DFSInputStream.java DFSClient.LOG.warn("Failed to connect to " + targetAddr + " for block" + ", add to deadNodes and continue. " + ex, ex); {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7386) Replace check "port number < 1024" with shared isPrivilegedPort method
[ https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213193#comment-14213193 ] Hudson commented on HDFS-7386: -- FAILURE: Integrated in Hadoop-trunk-Commit #6552 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6552/]) HDFS-7386. Replace check "port number < 1024" with shared isPrivilegedPort method. Contributed by Yongjun Zhang. (cnauroth: rev 1925e2a4ae78ef4178393848b4d1d71b0f4a4709) * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/SecurityUtil.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/SecureDataNodeStarter.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/sasl/SaslDataTransferServer.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/sasl/SaslDataTransferClient.java > Replace check "port number < 1024" with shared isPrivilegedPort method > --- > > Key: HDFS-7386 > URL: https://issues.apache.org/jira/browse/HDFS-7386 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, security >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang >Priority: Trivial > Fix For: 2.7.0 > > Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch > > > Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace > check "port number < 1024" with shared isPrivilegedPort method. > Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-3749) Disable check for jsvc on windows
[ https://issues.apache.org/jira/browse/HDFS-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth resolved HDFS-3749. - Resolution: Won't Fix This is no longer required, because HDFS-2856 has been implemented, providing SASL as a means to authenticate the DataNode instead of jsvc/privileged ports. I'm resolving this as Won't Fix. > Disable check for jsvc on windows > - > > Key: HDFS-3749 > URL: https://issues.apache.org/jira/browse/HDFS-3749 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: hdfs-3749-trunk.patch, hdfs-3749.patch, hdfs-3749.patch > > > Jsvc doesn't make sense on windows and thus we should not require the > datanode to start up under it on that platform. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7386) Replace check "port number < 1024" with shared isPrivilegedPort method
[ https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated HDFS-7386: Resolution: Fixed Fix Version/s: 2.7.0 Status: Resolved (was: Patch Available) I committed this to trunk and branch-2. Yongjun, thank you for improving this part of the code. > Replace check "port number < 1024" with shared isPrivilegedPort method > --- > > Key: HDFS-7386 > URL: https://issues.apache.org/jira/browse/HDFS-7386 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, security >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang >Priority: Trivial > Fix For: 2.7.0 > > Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch > > > Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace > check "port number < 1024" with shared isPrivilegedPort method. > Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server
[ https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213188#comment-14213188 ] Brandon Li commented on HDFS-7146: -- +1. Pending Jenkins. > NFS ID/Group lookup requires SSSD enumeration on the server > --- > > Key: HDFS-7146 > URL: https://issues.apache.org/jira/browse/HDFS-7146 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.6.0 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, > HDFS-7146.003.patch, HDFS-7146.004.patch, HDFS-7146.005.patch > > > The current implementation of the NFS UID and GID lookup works by running > 'getent passwd' with an assumption that it will return the entire list of > users available on the OS, local and remote (AD/etc.). > This behaviour of the command is advised to be and is prevented by > administrators in most secure setups to avoid excessive load to the ADs > involved, as the # of users to be listed may be too large, and the repeated > requests of ALL users not present in the cache would be too much for the AD > infrastructure to bear. > The NFS server should likely do lookups based on a specific UID request, via > 'getent passwd ', if the UID does not match a cached value. This reduces > load on the LDAP backed infrastructure. > Thanks [~qwertymaniac] for reporting the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7386) Replace check "port number < 1024" with shared isPrivilegedPort method
[ https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated HDFS-7386: Component/s: security datanode Target Version/s: 2.7.0 Hadoop Flags: Reviewed +1 for the patch. I agree that the test failures are unrelated. I saw the same thing that you saw when I reran locally. I'll commit this. > Replace check "port number < 1024" with shared isPrivilegedPort method > --- > > Key: HDFS-7386 > URL: https://issues.apache.org/jira/browse/HDFS-7386 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, security >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang >Priority: Trivial > Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch > > > Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace > check "port number < 1024" with shared isPrivilegedPort method. > Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6982) nntop: top-like tool for name node users
[ https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213177#comment-14213177 ] Andrew Wang commented on HDFS-6982: --- Hi Maysam, took a look at the latest patch, I think we're almost there :) Just minor comments. Hopefully Jenkins behaves with the next rev too, I agree it looks unrelated or garbled. DFSConfigKeys / TopConf: * Need to rename the DFSConfigKeys variable names to reflect new config names * Seems like I gave bad advice about getInts, since it doesn't have a way of taking a default, so right now if we try to turn it off, it'll set the default. Reverting to what you had is cool, though adding a getInts that takes a default would be appreciated. RWManager: * Could we add explanatory text to the Precondition checks? AuditLogger: * Rather than injecting it into the conf (kinda brittle), what I had in mind was in FSNamesystem#initAuditLoggers, we could tack it on the end after adding the ones from the conf. No need for reflection :) * Related to this, it'd be good to have a unit test that disables nntop and then checks that the audit logger isn't added and that metrics aren't published. Feel free to add a @VisibleForTesting getter if it helps. Nits: * Unused import in NameNode This is just minor stuff though, I'm +1 pending the above review comments. > nntop: top-like tool for name node users > - > > Key: HDFS-6982 > URL: https://issues.apache.org/jira/browse/HDFS-6982 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Maysam Yabandeh >Assignee: Maysam Yabandeh > Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, > HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, > HDFS-6982.v7.patch, HDFS-6982.v8.patch, nntop-design-v1.pdf > > > In this jira we motivate the need for nntop, a tool that, similarly to what > top does in Linux, gives the list of top users of the HDFS name node and > gives insight about which users are sending majority of each traffic type to > the name node. This information turns out to be the most critical when the > name node is under pressure and the HDFS admin needs to know which user is > hammering the name node and with what kind of requests. Here we present the > design of nntop which has been in production at Twitter in the past 10 > months. nntop proved to have low cpu overhead (< 2% in a cluster of 4K > nodes), low memory footprint (less than a few MB), and quite efficient for > the write path (only two hash lookup for updating a metric). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server
[ https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213165#comment-14213165 ] Yongjun Zhang commented on HDFS-7146: - HI [~brandonli], Nice idea to add some java docs and describing the solution, I just uploaded 005 to have that. Thanks for your flexibility, I will create a separate jira for the platform coverage issue, 'cause I think that may involve looking into multiple places for platform differences. Thanks for taking a further look. > NFS ID/Group lookup requires SSSD enumeration on the server > --- > > Key: HDFS-7146 > URL: https://issues.apache.org/jira/browse/HDFS-7146 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.6.0 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, > HDFS-7146.003.patch, HDFS-7146.004.patch, HDFS-7146.005.patch > > > The current implementation of the NFS UID and GID lookup works by running > 'getent passwd' with an assumption that it will return the entire list of > users available on the OS, local and remote (AD/etc.). > This behaviour of the command is advised to be and is prevented by > administrators in most secure setups to avoid excessive load to the ADs > involved, as the # of users to be listed may be too large, and the repeated > requests of ALL users not present in the cache would be too much for the AD > infrastructure to bear. > The NFS server should likely do lookups based on a specific UID request, via > 'getent passwd ', if the UID does not match a cached value. This reduces > load on the LDAP backed infrastructure. > Thanks [~qwertymaniac] for reporting the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7400) More reliable namenode health check to detect OS/HW issues
Ming Ma created HDFS-7400: - Summary: More reliable namenode health check to detect OS/HW issues Key: HDFS-7400 URL: https://issues.apache.org/jira/browse/HDFS-7400 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ming Ma We had this scenario on an active NN machine. * Disk array controller firmware has a bug. So disks stop working. * ZKFC and NN still considered the node healthy; Communications between ZKFC and ZK as well as ZKFC and NN are good. * The machine can be pinged. * The machine can't be sshed. So all clients and DNs can't use the NN. But ZKFC and NN still consider the node healthy. The question is how we can have ZKFC and NN detect such OS/HW specific issues quickly? Some ideas we discussed briefly, * Have other machines help to make the decision whether the NN is actually healthy. Then you have to figure out to make the decision accurate in the case of network issue, etc. * Run OS/HW health check script external to ZKFC/NN on the same machine. If it detects disk or other issues, it can reboot the machine for example. * Run OS/HW health check script inside ZKFC/NN. For example NN's HAServiceProtocol#monitorHealth can be modified to call such health check script. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server
[ https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjun Zhang updated HDFS-7146: Attachment: HDFS-7146.005.patch > NFS ID/Group lookup requires SSSD enumeration on the server > --- > > Key: HDFS-7146 > URL: https://issues.apache.org/jira/browse/HDFS-7146 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.6.0 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, > HDFS-7146.003.patch, HDFS-7146.004.patch, HDFS-7146.005.patch > > > The current implementation of the NFS UID and GID lookup works by running > 'getent passwd' with an assumption that it will return the entire list of > users available on the OS, local and remote (AD/etc.). > This behaviour of the command is advised to be and is prevented by > administrators in most secure setups to avoid excessive load to the ADs > involved, as the # of users to be listed may be too large, and the repeated > requests of ALL users not present in the cache would be too much for the AD > infrastructure to bear. > The NFS server should likely do lookups based on a specific UID request, via > 'getent passwd ', if the UID does not match a cached value. This reduces > load on the LDAP backed infrastructure. > Thanks [~qwertymaniac] for reporting the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash updated HDFS-4882: --- Attachment: HDFS-4882.4.patch Here's a patch which goes back to using sortedLeases.first() . > Namenode LeaseManager checkLeases() runs into infinite loop > --- > > Key: HDFS-4882 > URL: https://issues.apache.org/jira/browse/HDFS-4882 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client, namenode >Affects Versions: 2.0.0-alpha, 2.5.1 >Reporter: Zesheng Wu >Assignee: Ravi Prakash >Priority: Critical > Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, > HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch > > > Scenario: > 1. cluster with 4 DNs > 2. the size of the file to be written is a little more than one block > 3. write the first block to 3 DNs, DN1->DN2->DN3 > 4. all the data packets of first block is successfully acked and the client > sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out > 5. DN2 and DN3 are down > 6. client recovers the pipeline, but no new DN is added to the pipeline > because of the current pipeline stage is PIPELINE_CLOSE > 7. client continuously writes the last block, and try to close the file after > written all the data > 8. NN finds that the penultimate block doesn't has enough replica(our > dfs.namenode.replication.min=2), and the client's close runs into indefinite > loop(HDFS-2936), and at the same time, NN makes the last block's state to > COMPLETE > 9. shutdown the client > 10. the file's lease exceeds hard limit > 11. LeaseManager realizes that and begin to do lease recovery by call > fsnamesystem.internalReleaseLease() > 12. but the last block's state is COMPLETE, and this triggers lease manager's > infinite loop and prints massive logs like this: > {noformat} > 2013-06-05,17:42:25,695 INFO > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: > DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard > limit > 2013-06-05,17:42:25,695 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. > Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= > /user/h_wuzesheng/test.dat > 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block > blk_-7028017402720175688_1202597, > lastBLockState=COMPLETE > 2013-06-05,17:42:25,695 INFO > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery > for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM > APREDUCE_-1252656407_1, pendingcreates: 1] > {noformat} > (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
[ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213109#comment-14213109 ] Ravi Prakash commented on HDFS-4882: These test errors are valid. They are happening because pollFirst() retrieved *and removes* the first element. Sorry for the oversight. Will upload a new patch soon > Namenode LeaseManager checkLeases() runs into infinite loop > --- > > Key: HDFS-4882 > URL: https://issues.apache.org/jira/browse/HDFS-4882 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client, namenode >Affects Versions: 2.0.0-alpha, 2.5.1 >Reporter: Zesheng Wu >Assignee: Ravi Prakash >Priority: Critical > Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, > HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.patch > > > Scenario: > 1. cluster with 4 DNs > 2. the size of the file to be written is a little more than one block > 3. write the first block to 3 DNs, DN1->DN2->DN3 > 4. all the data packets of first block is successfully acked and the client > sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out > 5. DN2 and DN3 are down > 6. client recovers the pipeline, but no new DN is added to the pipeline > because of the current pipeline stage is PIPELINE_CLOSE > 7. client continuously writes the last block, and try to close the file after > written all the data > 8. NN finds that the penultimate block doesn't has enough replica(our > dfs.namenode.replication.min=2), and the client's close runs into indefinite > loop(HDFS-2936), and at the same time, NN makes the last block's state to > COMPLETE > 9. shutdown the client > 10. the file's lease exceeds hard limit > 11. LeaseManager realizes that and begin to do lease recovery by call > fsnamesystem.internalReleaseLease() > 12. but the last block's state is COMPLETE, and this triggers lease manager's > infinite loop and prints massive logs like this: > {noformat} > 2013-06-05,17:42:25,695 INFO > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder: > DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard > limit > 2013-06-05,17:42:25,695 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. > Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src= > /user/h_wuzesheng/test.dat > 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block > blk_-7028017402720175688_1202597, > lastBLockState=COMPLETE > 2013-06-05,17:42:25,695 INFO > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery > for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM > APREDUCE_-1252656407_1, pendingcreates: 1] > {noformat} > (the 3rd line log is a debug log added by us) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache
[ https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keith Pak updated HDFS-7394: Status: Open (was: Patch Available) > Log at INFO level when InvalidToken is seen in ShortCircuitCache > > > Key: HDFS-7394 > URL: https://issues.apache.org/jira/browse/HDFS-7394 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Keith Pak >Priority: Minor > Attachments: HDFS-7394.patch > > > For long running clients, getting an {{InvalidToken}} exception is expected > and the client refetches a block token when it happens. The related events > are logged at INFO except the ones in {{ShortCircuitCache}}. It will be > better if they are also made to log at INFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache
[ https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keith Pak updated HDFS-7394: Attachment: HDFS-7394.patch > Log at INFO level when InvalidToken is seen in ShortCircuitCache > > > Key: HDFS-7394 > URL: https://issues.apache.org/jira/browse/HDFS-7394 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Keith Pak >Priority: Minor > Attachments: HDFS-7394.patch > > > For long running clients, getting an {{InvalidToken}} exception is expected > and the client refetches a block token when it happens. The related events > are logged at INFO except the ones in {{ShortCircuitCache}}. It will be > better if they are also made to log at INFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache
[ https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keith Pak updated HDFS-7394: Status: Patch Available (was: Open) > Log at INFO level when InvalidToken is seen in ShortCircuitCache > > > Key: HDFS-7394 > URL: https://issues.apache.org/jira/browse/HDFS-7394 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Keith Pak >Priority: Minor > Attachments: HDFS-7394.patch > > > For long running clients, getting an {{InvalidToken}} exception is expected > and the client refetches a block token when it happens. The related events > are logged at INFO except the ones in {{ShortCircuitCache}}. It will be > better if they are also made to log at INFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache
[ https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keith Pak updated HDFS-7394: Attachment: (was: HDFS-7394.patch) > Log at INFO level when InvalidToken is seen in ShortCircuitCache > > > Key: HDFS-7394 > URL: https://issues.apache.org/jira/browse/HDFS-7394 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Keith Pak >Priority: Minor > Attachments: HDFS-7394.patch > > > For long running clients, getting an {{InvalidToken}} exception is expected > and the client refetches a block token when it happens. The related events > are logged at INFO except the ones in {{ShortCircuitCache}}. It will be > better if they are also made to log at INFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods
[ https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213010#comment-14213010 ] Haohui Mai commented on HDFS-7279: -- The findbugs warning is unrelated. > Use netty to implement DatanodeWebHdfsMethods > - > > Key: HDFS-7279 > URL: https://issues.apache.org/jira/browse/HDFS-7279 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, webhdfs >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, > HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, > HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, > HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, > HDFS-7279.011.patch > > > Currently the DN implements all related webhdfs functionality using jetty. As > the current jetty version the DN used (jetty 6) lacks of fine-grained buffer > and connection management, DN often suffers from long latency and OOM when > its webhdfs component is under sustained heavy load. > This jira proposes to implement the webhdfs component in DN using netty, > which can be more efficient and allow more finer-grain controls on webhdfs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods
[ https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212999#comment-14212999 ] Hadoop QA commented on HDFS-7279: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681609/HDFS-7279.011.patch against trunk revision f2fe8a8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The test build failed in hadoop-hdfs-project/hadoop-hdfs {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8742//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8742//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8742//console This message is automatically generated. > Use netty to implement DatanodeWebHdfsMethods > - > > Key: HDFS-7279 > URL: https://issues.apache.org/jira/browse/HDFS-7279 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, webhdfs >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, > HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, > HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, > HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, > HDFS-7279.011.patch > > > Currently the DN implements all related webhdfs functionality using jetty. As > the current jetty version the DN used (jetty 6) lacks of fine-grained buffer > and connection management, DN often suffers from long latency and OOM when > its webhdfs component is under sustained heavy load. > This jira proposes to implement the webhdfs component in DN using netty, > which can be more efficient and allow more finer-grain controls on webhdfs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache
[ https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212998#comment-14212998 ] Hadoop QA commented on HDFS-7394: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12681607/HDFS-7394.patch against trunk revision 1a1dcce. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1218 javac compiler warnings (more than the trunk's current 1217 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.ipc.TestRPCCallBenchmark The test build failed in hadoop-hdfs-project/hadoop-hdfs {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8741//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8741//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8741//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8741//console This message is automatically generated. > Log at INFO level when InvalidToken is seen in ShortCircuitCache > > > Key: HDFS-7394 > URL: https://issues.apache.org/jira/browse/HDFS-7394 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Keith Pak >Priority: Minor > Attachments: HDFS-7394.patch > > > For long running clients, getting an {{InvalidToken}} exception is expected > and the client refetches a block token when it happens. The related events > are logged at INFO except the ones in {{ShortCircuitCache}}. It will be > better if they are also made to log at INFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server
[ https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212985#comment-14212985 ] Brandon Li commented on HDFS-7146: -- {quote} The defaulyStaticIdMappingFile was introduced in the HADOOP-11195, and I actually have removed it in rev 004. Would you please indicate the place you were looking at?{quote} My bad. I looked into the wrong side of the diff. {quote}Relaxing the platform support is a different issue to solve and it seems deserving a separate jira, what do you think?{quote} I am ok with either fixing it here or a different JIRA. {quote}I introduced this for testing purpose. {quote} Please add java doc for it. Also, it would to nice to add the solution in the class java doc. > NFS ID/Group lookup requires SSSD enumeration on the server > --- > > Key: HDFS-7146 > URL: https://issues.apache.org/jira/browse/HDFS-7146 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.6.0 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, > HDFS-7146.003.patch, HDFS-7146.004.patch > > > The current implementation of the NFS UID and GID lookup works by running > 'getent passwd' with an assumption that it will return the entire list of > users available on the OS, local and remote (AD/etc.). > This behaviour of the command is advised to be and is prevented by > administrators in most secure setups to avoid excessive load to the ADs > involved, as the # of users to be listed may be too large, and the repeated > requests of ALL users not present in the cache would be too much for the AD > infrastructure to bear. > The NFS server should likely do lookups based on a specific UID request, via > 'getent passwd ', if the UID does not match a cached value. This reduces > load on the LDAP backed infrastructure. > Thanks [~qwertymaniac] for reporting the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-3806) Assertion failed in TestStandbyCheckpoints.testBothNodesInStandbyState
[ https://issues.apache.org/jira/browse/HDFS-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth resolved HDFS-3806. - Resolution: Duplicate I'm resolving this as duplicate of HDFS-3519. > Assertion failed in TestStandbyCheckpoints.testBothNodesInStandbyState > -- > > Key: HDFS-3806 > URL: https://issues.apache.org/jira/browse/HDFS-3806 > Project: Hadoop HDFS > Issue Type: Bug > Components: test > Environment: Jenkins >Reporter: Trevor Robinson >Priority: Minor > > Failed in Jenkins build for unrelated issue (HDFS-3804): > https://builds.apache.org/job/PreCommit-HDFS-Build/3011/testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestStandbyCheckpoints/testBothNodesInStandbyState/ > {noformat} > java.lang.AssertionError: Expected non-empty > /home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/fsimage_012 > at org.junit.Assert.fail(Assert.java:91) > at org.junit.Assert.assertTrue(Assert.java:43) > at > org.apache.hadoop.hdfs.server.namenode.FSImageTestUtil.assertNNHasCheckpoints(FSImageTestUtil.java:467) > at > org.apache.hadoop.hdfs.server.namenode.ha.HATestUtil.waitForCheckpoint(HATestUtil.java:213) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints.testBothNodesInStandbyState(TestStandbyCheckpoints.java:133) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7393) TestDFSUpgradeFromImage#testUpgradeFromCorruptRel22Image fails in trunk
[ https://issues.apache.org/jira/browse/HDFS-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212962#comment-14212962 ] Konstantin Shvachko commented on HDFS-7393: --- Indeed. Good it is fixed. Thanks. > TestDFSUpgradeFromImage#testUpgradeFromCorruptRel22Image fails in trunk > --- > > Key: HDFS-7393 > URL: https://issues.apache.org/jira/browse/HDFS-7393 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Ted Yu > > The following is reproducible: -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7399) Lack of synchronization in DFSOutputStream#Packet#getLastByteOffsetBlock()
Ted Yu created HDFS-7399: Summary: Lack of synchronization in DFSOutputStream#Packet#getLastByteOffsetBlock() Key: HDFS-7399 URL: https://issues.apache.org/jira/browse/HDFS-7399 Project: Hadoop HDFS Issue Type: Bug Reporter: Ted Yu Priority: Minor {code} long getLastByteOffsetBlock() { return offsetInBlock + dataPos - dataStart; {code} Access to fields of Packet.this should be protected by synchronization as done in other methods such as writeTo(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes
[ https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212846#comment-14212846 ] Andrew Wang commented on HDFS-7374: --- Hmm, so one situation we've seen is that the cluster is 100% healthy (no under-rep blocks) and dead DNs still get stuck in the D_I_P state. We can safely transition even dead nodes to DECOMMED in this situation. Going backwards from (DEAD, DECOMMED) back to (LIVE, D_I_P) feels a little weird. IMO DECOMMED should mean that a node can safely be removed from the cluster, even for dead nodes. That won't necessarily be true in this case. > Allow decommissioning of dead DataNodes > --- > > Key: HDFS-7374 > URL: https://issues.apache.org/jira/browse/HDFS-7374 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch > > > We have seen the use case of decommissioning DataNodes that are already dead > or unresponsive, and not expected to rejoin the cluster. > The logic introduced by HDFS-6791 will mark those nodes as > {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish > the decommission work. If an upper layer application is monitoring the > decommissioning progress, it will hang forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server
[ https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212844#comment-14212844 ] Yongjun Zhang commented on HDFS-7146: - HI [~brandonli], Thanks a lot for the review and comments! I have a few questions to clarify: {quote} 1. it doesn’t seem to be necessary to introduce defaultStaticIdMappingFile {quote} The defaulyStaticIdMappingFile was introduced in the HADOOP-11195, and I actually have removed it in rev 004. Would you please indicate the place you were looking at? {quote} 2. Do we need checkSupportedPlatform()? We don’t have to limit the platform with only linux and mac. Some other UNIX flavors might also be able to run the NFS server. We could do the follow: if (Shell.Mac) { // mac command } else { // linux command for everything else } {quote} About checkSupportedPlatform, I simply followed the existing implementation ({{ if (!OS.startsWith("Linux") && !OS.startsWith("Mac"))}}), which says only mac and linux are supported. Relaxing the platform support is a different issue to solve and it seems deserving a separate jira, what do you think? {quote} 3. do we still need constructFullMapAtInit since it’s always false? {quote} I introduced this for testing purpose. If you look at the new test I introduced, it's first called with "true" to create a reference (refIdMapping). That's why I tagged constructor with @VisibleForTesting. Does this sound ok to you? Thanks again! > NFS ID/Group lookup requires SSSD enumeration on the server > --- > > Key: HDFS-7146 > URL: https://issues.apache.org/jira/browse/HDFS-7146 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.6.0 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, > HDFS-7146.003.patch, HDFS-7146.004.patch > > > The current implementation of the NFS UID and GID lookup works by running > 'getent passwd' with an assumption that it will return the entire list of > users available on the OS, local and remote (AD/etc.). > This behaviour of the command is advised to be and is prevented by > administrators in most secure setups to avoid excessive load to the ADs > involved, as the # of users to be listed may be too large, and the repeated > requests of ALL users not present in the cache would be too much for the AD > infrastructure to bear. > The NFS server should likely do lookups based on a specific UID request, via > 'getent passwd ', if the UID does not match a cached value. This reduces > load on the LDAP backed infrastructure. > Thanks [~qwertymaniac] for reporting the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6982) nntop: top-like tool for name node users
[ https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212825#comment-14212825 ] Maysam Yabandeh commented on HDFS-6982: --- [~andrew.wang] I do not see any relation between the patch and the findbug warnings as well as the test failures. > nntop: top-like tool for name node users > - > > Key: HDFS-6982 > URL: https://issues.apache.org/jira/browse/HDFS-6982 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Maysam Yabandeh >Assignee: Maysam Yabandeh > Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, > HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, > HDFS-6982.v7.patch, HDFS-6982.v8.patch, nntop-design-v1.pdf > > > In this jira we motivate the need for nntop, a tool that, similarly to what > top does in Linux, gives the list of top users of the HDFS name node and > gives insight about which users are sending majority of each traffic type to > the name node. This information turns out to be the most critical when the > name node is under pressure and the HDFS admin needs to know which user is > hammering the name node and with what kind of requests. Here we present the > design of nntop which has been in production at Twitter in the past 10 > months. nntop proved to have low cpu overhead (< 2% in a cluster of 4K > nodes), low memory footprint (less than a few MB), and quite efficient for > the write path (only two hash lookup for updating a metric). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods
[ https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haohui Mai updated HDFS-7279: - Attachment: HDFS-7279.011.patch > Use netty to implement DatanodeWebHdfsMethods > - > > Key: HDFS-7279 > URL: https://issues.apache.org/jira/browse/HDFS-7279 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, webhdfs >Reporter: Haohui Mai >Assignee: Haohui Mai > Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, > HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, > HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, > HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, > HDFS-7279.011.patch > > > Currently the DN implements all related webhdfs functionality using jetty. As > the current jetty version the DN used (jetty 6) lacks of fine-grained buffer > and connection management, DN often suffers from long latency and OOM when > its webhdfs component is under sustained heavy load. > This jira proposes to implement the webhdfs component in DN using netty, > which can be more efficient and allow more finer-grain controls on webhdfs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server
[ https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212800#comment-14212800 ] Brandon Li commented on HDFS-7146: -- The patch looks nice. Some comments: 1. it doesn’t seem to be necessary to introduce defaultStaticIdMappingFile 2. Do we need checkSupportedPlatform()? We don’t have to limit the platform with only linux and mac. Some other UNIX flavors might also be able to run the NFS server. We could do the follow: if (Shell.Mac) { // mac command } else { // linux command for everything else } 3. do we still need constructFullMapAtInit since it’s always false? > NFS ID/Group lookup requires SSSD enumeration on the server > --- > > Key: HDFS-7146 > URL: https://issues.apache.org/jira/browse/HDFS-7146 > Project: Hadoop HDFS > Issue Type: Bug > Components: nfs >Affects Versions: 2.6.0 >Reporter: Yongjun Zhang >Assignee: Yongjun Zhang > Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, > HDFS-7146.003.patch, HDFS-7146.004.patch > > > The current implementation of the NFS UID and GID lookup works by running > 'getent passwd' with an assumption that it will return the entire list of > users available on the OS, local and remote (AD/etc.). > This behaviour of the command is advised to be and is prevented by > administrators in most secure setups to avoid excessive load to the ADs > involved, as the # of users to be listed may be too large, and the repeated > requests of ALL users not present in the cache would be too much for the AD > infrastructure to bear. > The NFS server should likely do lookups based on a specific UID request, via > 'getent passwd ', if the UID does not match a cached value. This reduces > load on the LDAP backed infrastructure. > Thanks [~qwertymaniac] for reporting the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes
[ https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212794#comment-14212794 ] Ming Ma commented on HDFS-7374: --- [~andrew.wang], after a node is dead, all its blocks will be removed from blockmap. So if the node no longer joins the cluster, it isn't unclear how you can tell if all its blocks are fully replicated unless we track those blocks. Another way to cover all these scenarios could be to get rid of {{DEAD, DECOM_IN_PROGRESS}} state. After the node is dead during decommission, transition to {{DEAD, DECOMMED}}. When the node rejoins the cluster, transition it to {{LIVE, DECOM_IN_PROGRESS}}. > Allow decommissioning of dead DataNodes > --- > > Key: HDFS-7374 > URL: https://issues.apache.org/jira/browse/HDFS-7374 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch > > > We have seen the use case of decommissioning DataNodes that are already dead > or unresponsive, and not expected to rejoin the cluster. > The logic introduced by HDFS-6791 will mark those nodes as > {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish > the decommission work. If an upper layer application is monitoring the > decommissioning progress, it will hang forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache
[ https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keith Pak updated HDFS-7394: Attachment: HDFS-7394.patch Attached patch > Log at INFO level when InvalidToken is seen in ShortCircuitCache > > > Key: HDFS-7394 > URL: https://issues.apache.org/jira/browse/HDFS-7394 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Keith Pak >Priority: Minor > Attachments: HDFS-7394.patch > > > For long running clients, getting an {{InvalidToken}} exception is expected > and the client refetches a block token when it happens. The related events > are logged at INFO except the ones in {{ShortCircuitCache}}. It will be > better if they are also made to log at INFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache
[ https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keith Pak updated HDFS-7394: Status: Patch Available (was: Open) > Log at INFO level when InvalidToken is seen in ShortCircuitCache > > > Key: HDFS-7394 > URL: https://issues.apache.org/jira/browse/HDFS-7394 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Keith Pak >Priority: Minor > > For long running clients, getting an {{InvalidToken}} exception is expected > and the client refetches a block token when it happens. The related events > are logged at INFO except the ones in {{ShortCircuitCache}}. It will be > better if they are also made to log at INFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache
[ https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-7394 started by Keith Pak. --- > Log at INFO level when InvalidToken is seen in ShortCircuitCache > > > Key: HDFS-7394 > URL: https://issues.apache.org/jira/browse/HDFS-7394 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Keith Pak >Priority: Minor > > For long running clients, getting an {{InvalidToken}} exception is expected > and the client refetches a block token when it happens. The related events > are logged at INFO except the ones in {{ShortCircuitCache}}. It will be > better if they are also made to log at INFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work stopped] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache
[ https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-7394 stopped by Keith Pak. --- > Log at INFO level when InvalidToken is seen in ShortCircuitCache > > > Key: HDFS-7394 > URL: https://issues.apache.org/jira/browse/HDFS-7394 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Keith Pak >Priority: Minor > > For long running clients, getting an {{InvalidToken}} exception is expected > and the client refetches a block token when it happens. The related events > are logged at INFO except the ones in {{ShortCircuitCache}}. It will be > better if they are also made to log at INFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work stopped] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache
[ https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-7394 stopped by Keith Pak. --- > Log at INFO level when InvalidToken is seen in ShortCircuitCache > > > Key: HDFS-7394 > URL: https://issues.apache.org/jira/browse/HDFS-7394 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Keith Pak >Priority: Minor > > For long running clients, getting an {{InvalidToken}} exception is expected > and the client refetches a block token when it happens. The related events > are logged at INFO except the ones in {{ShortCircuitCache}}. It will be > better if they are also made to log at INFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache
[ https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-7394 started by Keith Pak. --- > Log at INFO level when InvalidToken is seen in ShortCircuitCache > > > Key: HDFS-7394 > URL: https://issues.apache.org/jira/browse/HDFS-7394 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Keith Pak >Priority: Minor > > For long running clients, getting an {{InvalidToken}} exception is expected > and the client refetches a block token when it happens. The related events > are logged at INFO except the ones in {{ShortCircuitCache}}. It will be > better if they are also made to log at INFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache
[ https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keith Pak reassigned HDFS-7394: --- Assignee: Keith Pak > Log at INFO level when InvalidToken is seen in ShortCircuitCache > > > Key: HDFS-7394 > URL: https://issues.apache.org/jira/browse/HDFS-7394 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Keith Pak >Priority: Minor > > For long running clients, getting an {{InvalidToken}} exception is expected > and the client refetches a block token when it happens. The related events > are logged at INFO except the ones in {{ShortCircuitCache}}. It will be > better if they are also made to log at INFO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes
[ https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212766#comment-14212766 ] Andrew Wang commented on HDFS-7374: --- Hey [~mingma], I was looking a bit more at decom, and I see that we have this if statement at the end of {{isReplicationInProgress}}: {code} if (!status && !srcNode.isAlive) { LOG.warn("srcNode " + srcNode + " is dead " + "when decommission is in progress. Continue to mark " + "it as decommission in progress. In that way, when it rejoins the " + "cluster it can continue the decommission process."); status = true; } {code} Logically, a (DEAD, DECOM_IN_PROGRESS) should be able to go to (DEAD, DECOMMED) if all of its blocks are fully replicated, but this if statement prevents {{isReplicationInProgress}} from ever returning false for a dead node. It seems like we can loosen this requirement? > Allow decommissioning of dead DataNodes > --- > > Key: HDFS-7374 > URL: https://issues.apache.org/jira/browse/HDFS-7374 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch > > > We have seen the use case of decommissioning DataNodes that are already dead > or unresponsive, and not expected to rejoin the cluster. > The logic introduced by HDFS-6791 will mark those nodes as > {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish > the decommission work. If an upper layer application is monitoring the > decommissioning progress, it will hang forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212757#comment-14212757 ] Yongjun Zhang commented on HDFS-4239: - HI [~qwertymaniac], My bad that I did not notice your earlier comment {quote} I just noticed Steve's comment referring the same - should've gone through properly before spending google cycles. I feel HDFS-1362 implemented would solve half of this - and the other half would be to make the removals automatic. Right now the checkDiskError does not eject if its slow - as long as its succeed, which would have to be done via this JIRA I think. The re-add would be possible via HDFS-1362. {quote} until now. So we need to use the functionality provided by HDFS-1362 to automatically remove a sick disk. It seems the original goal of HDFS-4239 is the same as HDFS-1362 (right?), and we can create a new jira for automatically removing a sick disk? Thanks. > Means of telling the datanode to stop using a sick disk > --- > > Key: HDFS-4239 > URL: https://issues.apache.org/jira/browse/HDFS-4239 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: stack >Assignee: Yongjun Zhang > Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, > hdfs-4239_v4.patch, hdfs-4239_v5.patch > > > If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing > occasionally, or just exhibiting high latency -- your choices are: > 1. Decommission the total datanode. If the datanode is carrying 6 or 12 > disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- > the rereplication of the downed datanode's data can be pretty disruptive, > especially if the cluster is doing low latency serving: e.g. hosting an hbase > cluster. > 2. Stop the datanode, unmount the bad disk, and restart the datanode (You > can't unmount the disk while it is in use). This latter is better in that > only the bad disk's data is rereplicated, not all datanode data. > Is it possible to do better, say, send the datanode a signal to tell it stop > using a disk an operator has designated 'bad'. This would be like option #2 > above minus the need to stop and restart the datanode. Ideally the disk > would become unmountable after a while. > Nice to have would be being able to tell the datanode to restart using a disk > after its been replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212745#comment-14212745 ] Chris Trezzo commented on HDFS-6133: I am slightly late to the party on this one, but at Twitter we have a similar need for a slightly different use case. We use federation and each block pool has a different workload. For some of these workloads it does not make sense to run the balancer. For example, we have a block pool associated with a tmp namespace. Ideally, we would never want to balance blocks in that block pool because they will be deleted shortly anyways. One design approach we were contemplating is to make the balancer block pool-aware. You could then run the balancer on a per-block pool basis and have pluggable balancing strategies for each pool (i.e. the balancer policy in the block pool associated with the tmp namespace is a no-op). This allows the balancer to continue to be decoupled with the namespace and only needs to know about the block pool (we can still separate the BM at a later point). The above might work for this use case as well. The balancer policy for the block pool containing blocks in hbase would be a no-op. Let me know what you guys think. I can see the block pool design being orthogonal to this JIRA, so let me know if I should open up a separate JIRA for this effort. We could potentially use the pinning strategy for our use case as well, but I hesitate for the same reasons that [~kihwal] mentioned above with respect to corrupt/unavailable blocks. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, > HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6711) FSNamesystem#getAclStatus does not write to the audit log.
[ https://issues.apache.org/jira/browse/HDFS-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212736#comment-14212736 ] Chris Nauroth commented on HDFS-6711: - This was fixed in HDFS-7218, so I'm resolving this as duplicate. > FSNamesystem#getAclStatus does not write to the audit log. > -- > > Key: HDFS-6711 > URL: https://issues.apache.org/jira/browse/HDFS-6711 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.0.0, 2.4.0 >Reporter: Chris Nauroth >Priority: Minor > > Consider writing an event to the audit log for the {{getAclStatus}} method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-6711) FSNamesystem#getAclStatus does not write to the audit log.
[ https://issues.apache.org/jira/browse/HDFS-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth resolved HDFS-6711. - Resolution: Duplicate > FSNamesystem#getAclStatus does not write to the audit log. > -- > > Key: HDFS-6711 > URL: https://issues.apache.org/jira/browse/HDFS-6711 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.0.0, 2.4.0 >Reporter: Chris Nauroth >Priority: Minor > > Consider writing an event to the audit log for the {{getAclStatus}} method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync
[ https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212734#comment-14212734 ] Chris Nauroth commented on HDFS-7384: - Yes, what you described makes sense. An older client simply wouldn't consume the new protobuf field. I'd prefer not add the effective action directly to {{AclEntry}}, since the effective action is something that only makes sense when the entry is considered against some other object (the mask). Overall, it sounds good. Thanks for thinking this through and putting out the proposal! > 'getfacl' command and 'getAclStatus' output should be in sync > - > > Key: HDFS-7384 > URL: https://issues.apache.org/jira/browse/HDFS-7384 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Vinayakumar B >Assignee: Vinayakumar B > > *getfacl* command will print all the entries including basic and extended > entries, mask entries and effective permissions. > But, *getAclStatus* FileSystem API will return only extended ACL entries set > by the user. But this will not include the mask entry as well as effective > permissions. > To benefit the client using API, better to include 'mask' entry and effective > permissions in the return list of entries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-4239) Means of telling the datanode to stop using a sick disk
[ https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjun Zhang resolved HDFS-4239. - Resolution: Duplicate Hi Stack, This issue turned out to be a duplicate of HDFS-1362, which is resolved now. I'm closing this jira as duplicate. Please re-open if you think there is additional issue to be addressed there. Thanks. > Means of telling the datanode to stop using a sick disk > --- > > Key: HDFS-4239 > URL: https://issues.apache.org/jira/browse/HDFS-4239 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: stack >Assignee: Yongjun Zhang > Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, > hdfs-4239_v4.patch, hdfs-4239_v5.patch > > > If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing > occasionally, or just exhibiting high latency -- your choices are: > 1. Decommission the total datanode. If the datanode is carrying 6 or 12 > disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- > the rereplication of the downed datanode's data can be pretty disruptive, > especially if the cluster is doing low latency serving: e.g. hosting an hbase > cluster. > 2. Stop the datanode, unmount the bad disk, and restart the datanode (You > can't unmount the disk while it is in use). This latter is better in that > only the bad disk's data is rereplicated, not all datanode data. > Is it possible to do better, say, send the datanode a signal to tell it stop > using a disk an operator has designated 'bad'. This would be like option #2 > above minus the need to stop and restart the datanode. Ideally the disk > would become unmountable after a while. > Nice to have would be being able to tell the datanode to restart using a disk > after its been replaced. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes
[ https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212699#comment-14212699 ] Andrew Wang commented on HDFS-7374: --- The patch looks good, findbugs looks unrelated, but the TestDecommission failure is worrying and also failed for me locally. Could you take a look? > Allow decommissioning of dead DataNodes > --- > > Key: HDFS-7374 > URL: https://issues.apache.org/jira/browse/HDFS-7374 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch > > > We have seen the use case of decommissioning DataNodes that are already dead > or unresponsive, and not expected to rejoin the cluster. > The logic introduced by HDFS-6791 will mark those nodes as > {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish > the decommission work. If an upper layer application is monitoring the > decommissioning progress, it will hang forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync
[ https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212680#comment-14212680 ] Vinayakumar B commented on HDFS-7384: - Thanks [~cnauroth] for the detailed explanation. bq. At this point, we can't change the behavior of getAclStatus on the 2.x line for compatibility reasons. Suppose a 2.6.0 deployment of the shell called getAclStatus on a 2.7.0 NameNode Here we can implement this without breaking compatibility. For ex: returned {{AclStatus}} can have default permissions in form of {{FsPermission}} object itself, which would be optional field in protobuf. So We can keep {{getAclEntries()}} return value as is, but in {{AclEntry}} we can add one more field, 'effective action', either this can be calculated at client side itself, based on the FsPermission object in AclStatus, or can be optional field set at NN side itself. My basic intention is to avoid extra client side logic, which currently users have to do, to find out the effective permission for an ACL entry. If {{AclStatus}} contains {{FsPermission}} value, then we can create the same output as 'getfacl' without having to do one more RPC to NN. This would keep the existing behavior of returning empty entries for basic permissions, which was decided after so many discussions. Any thoughts? > 'getfacl' command and 'getAclStatus' output should be in sync > - > > Key: HDFS-7384 > URL: https://issues.apache.org/jira/browse/HDFS-7384 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Vinayakumar B >Assignee: Vinayakumar B > > *getfacl* command will print all the entries including basic and extended > entries, mask entries and effective permissions. > But, *getAclStatus* FileSystem API will return only extended ACL entries set > by the user. But this will not include the mask entry as well as effective > permissions. > To benefit the client using API, better to include 'mask' entry and effective > permissions in the return list of entries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212670#comment-14212670 ] Kihwal Lee commented on HDFS-6133: -- This might be outside the scope of this jira, but I think we need to think about this before going further. If a node with a pinned block is temporarily unavailable, the namenode will try to replicate the block as it is under-replicated. When the node recovers, the block is over-replicated and a replica will be invalidated. How do we make sure it is not removed from the favored node? I think this scenario can happen during start-up or transient infra/network issues. Daryn and I had a brief discussion about this. It might be possible to include pinning info in block reports and remember it in block manager. This will enable NN to make the right decision on over-replicated cases. A bit more complicated logic will be needed when a pinned block gets corrupted on a favored node. The usual replicate + invalidate strategy won't be ideal here. > Make Balancer support exclude specified path > > > Key: HDFS-6133 > URL: https://issues.apache.org/jira/browse/HDFS-6133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover, namenode >Reporter: zhaoyunjiong >Assignee: zhaoyunjiong > Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, > HDFS-6133.patch > > > Currently, run Balancer will destroying Regionserver's data locality. > If getBlocks could exclude blocks belongs to files which have specific path > prefix, like "/hbase", then we can run Balancer without destroying > Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7393) TestDFSUpgradeFromImage#testUpgradeFromCorruptRel22Image fails in trunk
[ https://issues.apache.org/jira/browse/HDFS-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang resolved HDFS-7393. --- Resolution: Duplicate I think this is a dupe of HDFS-7395, has the same Precondition stacktrace in BlockIdManager. Please reopen if it's still the case. > TestDFSUpgradeFromImage#testUpgradeFromCorruptRel22Image fails in trunk > --- > > Key: HDFS-7393 > URL: https://issues.apache.org/jira/browse/HDFS-7393 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Ted Yu > > The following is reproducible: -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6962) ACLs inheritance conflict with umaskmode
[ https://issues.apache.org/jira/browse/HDFS-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated HDFS-6962: Target Version/s: 2.7.0 (was: 2.4.1) Hello, [~Alexandre LINTE]. Thank you for filing this issue. I tested the same scenario against a Linux local file system, and I confirmed that HDFS is showing different behavior, just like you described. I also confirmed that this is a divergence from the POSIX ACL specs. Here is a quote of the relevant section: {quote} The permissions of inherited access ACLs are further modified by the mode parameter that each system call creating file system objects has. The mode parameter contains nine permission bits that stand for the permissions of the owner, group, and other class permissions. The effective permissions of each class are set to the intersection of the permissions defined for this class in the ACL and specified in the mode parameter. If the parent directory has no default ACL, the permissions of the new file are determined as defined in POSIX.1. The effective permissions are set to the permissions defined in the mode parameter, minus the permissions set in the current umask. The umask has no effect if a default ACL exists. {quote} Changing this behavior is going to be somewhat challenging. Note the distinction made in the spec between mode and umask. When creating a new child (file or directory) of a directory with a default ACL, the mode influences the inherited access ACL entries, but the umask has no effect. Unfortunately, our current implementation intersects mode and umask on the client side before passing them to the NameNode in the RPC. This happens in {{DFSClient#mkdirs}} and {{DFSClient#create}}: {code} public boolean mkdirs(String src, FsPermission permission, boolean createParent) throws IOException { if (permission == null) { permission = FsPermission.getDefault(); } FsPermission masked = permission.applyUMask(dfsClientConf.uMask); {code} {code} public DFSOutputStream create(String src, FsPermission permission, EnumSet flag, boolean createParent, short replication, long blockSize, Progressable progress, int buffersize, ChecksumOpt checksumOpt, InetSocketAddress[] favoredNodes) throws IOException { checkOpen(); if (permission == null) { permission = FsPermission.getFileDefault(); } FsPermission masked = permission.applyUMask(dfsClientConf.uMask); {code} On the NameNode side, when it copies the default ACL from parent to child, we've lost the information. We just have a single piece of permissions data, with no knowledge of what was the mode vs. the umask on the client side. A potential solution is to push both mode and umask explicitly to the NameNode in the RPC requests for {{MkdirsRequestProto}} and {{CreateRequestProto}}. Those messages already contain an instance of {{FsPermissionProto}}. We could add a second optional instance. If both instances are defined, then the NameNode would interpret one as being mode and the other as being umask. There would still be a possibility of an older client still passing just one instance, and in that case, we'd have to fall back to the current behavior. It's a bit messy, but it could work. We also have one additional problem specific to the shell for files (not directories). The implementation of copyFromLocal breaks down into 2 separate RPCs: creating the file, followed by a separate chmod call. The NameNode has no way of knowing if that chmod call is part of a copyFromLocal or not though. It's too late to enforce the mode vs. umask distinction. I'm tentatively targeting this to 2.7.0. I think this will need more investigation to make sure there are no compatibility issues with the solution. If there is an unavoidable compatibility problem, then it might require pushing out to 3.x. We won't know for sure until someone starts coding. Thank you again for the very detailed bug report. > ACLs inheritance conflict with umaskmode > > > Key: HDFS-6962 > URL: https://issues.apache.org/jira/browse/HDFS-6962 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Affects Versions: 2.4.1 > Environment: CentOS release 6.5 (Final) >Reporter: LINTE > Labels: hadoop, security > > In hdfs-site.xml > > dfs.umaskmode > 027 > > 1/ Create a directory as superuser > bash# hdfs dfs -mkdir /tmp/ACLS > 2/ set default ACLs on this directory rwx access for group readwrite and user > toto > bash# hdfs dfs -setfac
[jira] [Commented] (HDFS-7056) Snapshot support for truncate
[ https://issues.apache.org/jira/browse/HDFS-7056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212588#comment-14212588 ] Plamen Jeliazkov commented on HDFS-7056: FindBugs appears to be unrelated. New FindBugs points to inconsistent synchronization in org.apache.hadoop.hdfs.DFSOutputStream. A class we don't touch in this work. > Snapshot support for truncate > - > > Key: HDFS-7056 > URL: https://issues.apache.org/jira/browse/HDFS-7056 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.0.0 >Reporter: Konstantin Shvachko >Assignee: Plamen Jeliazkov > Attachments: HDFS-3107-HDFS-7056-combined.patch, > HDFS-3107-HDFS-7056-combined.patch, HDFS-3107-HDFS-7056-combined.patch, > HDFS-3107-HDFS-7056-combined.patch, HDFS-3107-HDFS-7056-combined.patch, > HDFS-7056.patch, HDFS-7056.patch, HDFS-7056.patch, HDFS-7056.patch, > HDFS-7056.patch, HDFS-7056.patch, HDFSSnapshotWithTruncateDesign.docx > > > Implementation of truncate in HDFS-3107 does not allow truncating files which > are in a snapshot. It is desirable to be able to truncate and still keep the > old file state of the file in the snapshot. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7177) Add an option to include minimal ACL in getAclStatus return
[ https://issues.apache.org/jira/browse/HDFS-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth resolved HDFS-7177. - Resolution: Duplicate > Add an option to include minimal ACL in getAclStatus return > --- > > Key: HDFS-7177 > URL: https://issues.apache.org/jira/browse/HDFS-7177 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Zhe Zhang >Assignee: Zhe Zhang >Priority: Minor > > Currently the 3 minimal ACL entries are not included in the returned value of > getAclStatus. {{FsShell}} gets them separately ({{FsPermission perm = > item.stat.getPermission();}}). It'd be useful to make it optional to include > them, so that external programs can get a complete view of the permissions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7177) Add an option to include minimal ACL in getAclStatus return
[ https://issues.apache.org/jira/browse/HDFS-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212555#comment-14212555 ] Chris Nauroth commented on HDFS-7177: - Hi, [~zhz]. I just realized too late that HDFS-7384 is reporting basically the same thing as this. I just entered a huge comment on HDFS-7384 about it, so I'd prefer to resolve this one as duplicate, even though it really came first. I'll add all of the watchers over to HDFS-7384 so that they can still be involved in the conversation. Thanks! > Add an option to include minimal ACL in getAclStatus return > --- > > Key: HDFS-7177 > URL: https://issues.apache.org/jira/browse/HDFS-7177 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Zhe Zhang >Assignee: Zhe Zhang >Priority: Minor > > Currently the 3 minimal ACL entries are not included in the returned value of > getAclStatus. {{FsShell}} gets them separately ({{FsPermission perm = > item.stat.getPermission();}}). It'd be useful to make it optional to include > them, so that external programs can get a complete view of the permissions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync
[ https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212548#comment-14212548 ] Chris Nauroth commented on HDFS-7384: - Hi, [~vinayrpet]. The current behavior of {{getAclStatus}} is an intentional design choice, but the history behind that choice is a bit convoluted. Let me see if I can reconstruct it here. It starts with HADOOP-10220, which added an ACL indicator bit to {{FsPermission}}. This was provided as an optimization so that clients could quickly identify if a file has an ACL, without needing an additional RPC. Later, objections were raised against the ACL bit in HDFS-5923 and HDFS-5932. We made a decision to roll back the HADOOP-10220 changes, and instead require callers to use {{getAclStatus}} to identify the presence of an ACL. Prior to this, early implementations of {{getAclStatus}} would always return a non-empty list. For an inode with no ACL, it would return the "minimal ACL" containing the 3 entries that correspond to basic POSIX permissions. However, at this point, it became helpful to change {{getAclStatus}} so that it would return an empty list if there is no ACL. This was seen as easier for clients than trying to check the entries for no ACL/minimal ACL. It was also seen as a cleaner logical separation, since the client likely already has the {{FsPermission}} prior to calling {{getAclStatus}}, and therefore it would not be helpful to return redundant ACL entries. Finally, HDFS-6326 identified that our implementation choice was backwards-incompatible for webhdfs, and generally a performance bottleneck for shell users. To solve this, we reinstated the ACL bit, in a slightly different implementation, but the behavior of {{getAclStatus}} remained the same. You've definitely identified a weakness in the current API design, and I raised similar objections at the time. It's a trade-off. I think there is good logical separation right now, but as a side effect, it does mean that callers may need some extra client-side logic to piece all of the information together, such as if someone wanted to write a custom GUI consuming WebHDFS to display ACL information. At this point, we can't change the behavior of {{getAclStatus}} on the 2.x line for compatibility reasons. Suppose a 2.6.0 deployment of the shell called {{getAclStatus}} on a 2.7.0 NameNode, and it had been changed to return the complete ACL. This would cause {{getfacl}} to display duplicate entries, because the 2.6.0 logic of {{GetfaclCommand}} and {{AclUtil#getAclFromPermAndEntries}} will combine the output of {{getAclStatus}} with the {{FsPermission}}, resulting in 3 duplicate entries. Where does that leave us for this jira? I can see the following options: # Resolve as won't fix, based on the above rationale. # Target 3.0 for a backwards-incompatible change. # Add a new RPC, named {{getFullAcl}} or similar, with the behavior that you proposed. However, I'd prefer not to increase the API footprint unless there is a really strong use case. Hope this helps. Let me know your thoughts. Thanks! > 'getfacl' command and 'getAclStatus' output should be in sync > - > > Key: HDFS-7384 > URL: https://issues.apache.org/jira/browse/HDFS-7384 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Vinayakumar B >Assignee: Vinayakumar B > > *getfacl* command will print all the entries including basic and extended > entries, mask entries and effective permissions. > But, *getAclStatus* FileSystem API will return only extended ACL entries set > by the user. But this will not include the mask entry as well as effective > permissions. > To benefit the client using API, better to include 'mask' entry and effective > permissions in the return list of entries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7398) Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit
[ https://issues.apache.org/jira/browse/HDFS-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212525#comment-14212525 ] Gera Shegalov commented on HDFS-7398: - Regarding the findbug warning: {quote} Inconsistent synchronization of org.apache.hadoop.hdfs.DFSOutputStream$Packet.dataPos; locked 83% of time {quote} It's obviously unrelated. > Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit > -- > > Key: HDFS-7398 > URL: https://issues.apache.org/jira/browse/HDFS-7398 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.6.0 >Reporter: Gera Shegalov >Assignee: Gera Shegalov > Attachments: HDFS-7398.v01.patch > > > This is a follow-up on HDFS-7385. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7396) Revisit synchronization in Namenode
[ https://issues.apache.org/jira/browse/HDFS-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212469#comment-14212469 ] Chris Nauroth commented on HDFS-7396: - bq. Whenever we experimented with improving concurrency, the limiting factor was the garbage collection overhead. I also would be interested in seeing more information on this. We've been updating our recommendations for garbage collection tuning recently. It would be interesting for us to compare notes. I'm also curious if you've tried any experiments running with the G1 collector. I haven't tried it in several years. When I tried it, it was still very experimental, so I ended up hitting too many bugs to run it in production. Perhaps it has stabilized by now. > Revisit synchronization in Namenode > --- > > Key: HDFS-7396 > URL: https://issues.apache.org/jira/browse/HDFS-7396 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze > > HDFS-2106 separated block management to a new package from namenode. As part > of it, some code was refactored to new classes such as DatanodeManager, > HeartbeatManager, etc. There are opportunities for improve locking in > namenode while currently the synchronization in namenode is mainly done by a > single global FSNamesystem lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
[ https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212384#comment-14212384 ] Hudson commented on HDFS-7395: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/]) HDFS-7395. BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 1a2e5cbc4dbed527fdbefc09abc1faaacf3da285) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit > -- > > Key: HDFS-7395 > URL: https://issues.apache.org/jira/browse/HDFS-7395 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Yongjun Zhang >Assignee: Haohui Mai > Fix For: 2.7.0 > > Attachments: HDFS-7395.000.patch > > > In latest jenkins job > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/ > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/ > but not > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/ > The following test failed the same way: > {code} > Failed > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image > Failing for the past 2 builds (Since Failed#1931 ) > Took 0.54 sec. > Stacktrace > java.lang.IllegalStateException: null > at > com.google.common.base.Preconditions.checkState(Preconditions.java:129) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:763) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:747) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443) > at > org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104) > at > org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975) > at > org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804) > at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:465) > at > org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424) > at > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582) > at > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7396) Revisit synchronization in Namenode
[ https://issues.apache.org/jira/browse/HDFS-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212394#comment-14212394 ] Kihwal Lee commented on HDFS-7396: -- This is a general comment regarding reducing lock contention and increasing concurrency in namenode. Whenever we experimented with improving concurrency, the limiting factor was the garbage collection overhead. This has gotten worse after the conversion to protobuf. Under a given load, locking improvements will certainly give better and more predictable response times. But if pushed beyond what NN was capable before, we will soon run into the existing inefficiencies. [~daryn] has found some of them and I hope he shares them with us soon. As [~tlipcon] mentioned in HDFS-2206, we need locking rules defined, documented and enforced if possible. In addition to the interactions between different locks, the role and scope of each lock need to be clearly defined. Lock definitions should include what it is protecting and expected data consistency and visibility during and after, etc.At minimum, we can come up with a comment template for this. > Revisit synchronization in Namenode > --- > > Key: HDFS-7396 > URL: https://issues.apache.org/jira/browse/HDFS-7396 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze > > HDFS-2106 separated block management to a new package from namenode. As part > of it, some code was refactored to new classes such as DatanodeManager, > HeartbeatManager, etc. There are opportunities for improve locking in > namenode while currently the synchronization in namenode is mainly done by a > single global FSNamesystem lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up
[ https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212382#comment-14212382 ] Hudson commented on HDFS-7385: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/]) HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess up. Contributed by jiangyu. (cnauroth: rev b0a41de68c5b08f534ca231293de053c0b0cbd5d) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java > ThreadLocal used in FSEditLog class causes FSImage permission mess up > - > > Key: HDFS-7385 > URL: https://issues.apache.org/jira/browse/HDFS-7385 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.4.0, 2.5.0 >Reporter: jiangyu >Assignee: jiangyu >Priority: Blocker > Fix For: 2.6.0 > > Attachments: HDFS-7385.2.patch, HDFS-7385.patch > > > We migrated our NameNodes from low configuration to high configuration > machines last week. Firstly,we imported the current directory including > fsimage and editlog files from original ActiveNameNode to new ActiveNameNode > and started the New NameNode, then changed the configuration of all > datanodes and restarted all of datanodes , then blockreport to new NameNodes > at once and send heartbeat after that. >Everything seemed perfect, but after we restarted Resoucemanager , > most of the users compained that their jobs couldn't be executed for the > reason of permission problem. > We applied Acls in our clusters, and after migrated we found most of > the directories and files which were not set Acls before now had the > properties of Acls. That is the reason why users could not execute their > jobs.So we had to change most of the files permission to a+r and directories > permission to a+rx to make sure the jobs can be executed. > After searching this problem for some days, i found there is a bug in > FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the > proper value in logMkdir and logOpenFile functions. Here is the code of > logMkdir: > public void logMkDir(String path, INode newNode) { > PermissionStatus permissions = newNode.getPermissionStatus(); > MkdirOp op = MkdirOp.getInstance(cache.get()) > .setInodeId(newNode.getId()) > .setPath(path) > .setTimestamp(newNode.getModificationTime()) > .setPermissionStatus(permissions); > AclFeature f = newNode.getAclFeature(); > if (f != null) { > op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode)); > } > logEdit(op); > } > For example, if we mkdir with Acls through one handler(Thread indeed), > we set the AclEntries to the op from the cache. After that, if we mkdir > without any Acls setting and set through the same handler, the AclEnties from > the cache is the same with the last one which set the Acls, and because the > newNode have no AclFeature, we don’t have any chance to change it. Then the > editlog is wrong,record the wrong Acls. After the Standby load the editlogs > from journalnodes and apply them to memory in SNN then savenamespace and > transfer the wrong fsimage to ANN, all the fsimages get wrong. The only > solution is to save namespace from ANN and you can get the right fsimage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager
[ https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212388#comment-14212388 ] Hudson commented on HDFS-7358: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/]) HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. (szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java > Clients may get stuck waiting when using ByteArrayManager > - > > Key: HDFS-7358 > URL: https://issues.apache.org/jira/browse/HDFS-7358 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze > Fix For: 2.7.0 > > Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, > h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, > h7358_20141108.patch > > > [~stack] reported that clients might get stuck waiting when using > ByteArrayManager; see [his > comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
[ https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212359#comment-14212359 ] Hudson commented on HDFS-7395: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/]) HDFS-7395. BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 1a2e5cbc4dbed527fdbefc09abc1faaacf3da285) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit > -- > > Key: HDFS-7395 > URL: https://issues.apache.org/jira/browse/HDFS-7395 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Yongjun Zhang >Assignee: Haohui Mai > Fix For: 2.7.0 > > Attachments: HDFS-7395.000.patch > > > In latest jenkins job > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/ > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/ > but not > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/ > The following test failed the same way: > {code} > Failed > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image > Failing for the past 2 builds (Since Failed#1931 ) > Took 0.54 sec. > Stacktrace > java.lang.IllegalStateException: null > at > com.google.common.base.Preconditions.checkState(Preconditions.java:129) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:763) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:747) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443) > at > org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104) > at > org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975) > at > org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804) > at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:465) > at > org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424) > at > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582) > at > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up
[ https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212357#comment-14212357 ] Hudson commented on HDFS-7385: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/]) HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess up. Contributed by jiangyu. (cnauroth: rev b0a41de68c5b08f534ca231293de053c0b0cbd5d) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > ThreadLocal used in FSEditLog class causes FSImage permission mess up > - > > Key: HDFS-7385 > URL: https://issues.apache.org/jira/browse/HDFS-7385 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.4.0, 2.5.0 >Reporter: jiangyu >Assignee: jiangyu >Priority: Blocker > Fix For: 2.6.0 > > Attachments: HDFS-7385.2.patch, HDFS-7385.patch > > > We migrated our NameNodes from low configuration to high configuration > machines last week. Firstly,we imported the current directory including > fsimage and editlog files from original ActiveNameNode to new ActiveNameNode > and started the New NameNode, then changed the configuration of all > datanodes and restarted all of datanodes , then blockreport to new NameNodes > at once and send heartbeat after that. >Everything seemed perfect, but after we restarted Resoucemanager , > most of the users compained that their jobs couldn't be executed for the > reason of permission problem. > We applied Acls in our clusters, and after migrated we found most of > the directories and files which were not set Acls before now had the > properties of Acls. That is the reason why users could not execute their > jobs.So we had to change most of the files permission to a+r and directories > permission to a+rx to make sure the jobs can be executed. > After searching this problem for some days, i found there is a bug in > FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the > proper value in logMkdir and logOpenFile functions. Here is the code of > logMkdir: > public void logMkDir(String path, INode newNode) { > PermissionStatus permissions = newNode.getPermissionStatus(); > MkdirOp op = MkdirOp.getInstance(cache.get()) > .setInodeId(newNode.getId()) > .setPath(path) > .setTimestamp(newNode.getModificationTime()) > .setPermissionStatus(permissions); > AclFeature f = newNode.getAclFeature(); > if (f != null) { > op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode)); > } > logEdit(op); > } > For example, if we mkdir with Acls through one handler(Thread indeed), > we set the AclEntries to the op from the cache. After that, if we mkdir > without any Acls setting and set through the same handler, the AclEnties from > the cache is the same with the last one which set the Acls, and because the > newNode have no AclFeature, we don’t have any chance to change it. Then the > editlog is wrong,record the wrong Acls. After the Standby load the editlogs > from journalnodes and apply them to memory in SNN then savenamespace and > transfer the wrong fsimage to ANN, all the fsimages get wrong. The only > solution is to save namespace from ANN and you can get the right fsimage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager
[ https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212363#comment-14212363 ] Hudson commented on HDFS-7358: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/]) HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. (szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > Clients may get stuck waiting when using ByteArrayManager > - > > Key: HDFS-7358 > URL: https://issues.apache.org/jira/browse/HDFS-7358 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze > Fix For: 2.7.0 > > Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, > h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, > h7358_20141108.patch > > > [~stack] reported that clients might get stuck waiting when using > ByteArrayManager; see [his > comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up
[ https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212293#comment-14212293 ] Hudson commented on HDFS-7385: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/]) HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess up. Contributed by jiangyu. (cnauroth: rev b0a41de68c5b08f534ca231293de053c0b0cbd5d) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java > ThreadLocal used in FSEditLog class causes FSImage permission mess up > - > > Key: HDFS-7385 > URL: https://issues.apache.org/jira/browse/HDFS-7385 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.4.0, 2.5.0 >Reporter: jiangyu >Assignee: jiangyu >Priority: Blocker > Fix For: 2.6.0 > > Attachments: HDFS-7385.2.patch, HDFS-7385.patch > > > We migrated our NameNodes from low configuration to high configuration > machines last week. Firstly,we imported the current directory including > fsimage and editlog files from original ActiveNameNode to new ActiveNameNode > and started the New NameNode, then changed the configuration of all > datanodes and restarted all of datanodes , then blockreport to new NameNodes > at once and send heartbeat after that. >Everything seemed perfect, but after we restarted Resoucemanager , > most of the users compained that their jobs couldn't be executed for the > reason of permission problem. > We applied Acls in our clusters, and after migrated we found most of > the directories and files which were not set Acls before now had the > properties of Acls. That is the reason why users could not execute their > jobs.So we had to change most of the files permission to a+r and directories > permission to a+rx to make sure the jobs can be executed. > After searching this problem for some days, i found there is a bug in > FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the > proper value in logMkdir and logOpenFile functions. Here is the code of > logMkdir: > public void logMkDir(String path, INode newNode) { > PermissionStatus permissions = newNode.getPermissionStatus(); > MkdirOp op = MkdirOp.getInstance(cache.get()) > .setInodeId(newNode.getId()) > .setPath(path) > .setTimestamp(newNode.getModificationTime()) > .setPermissionStatus(permissions); > AclFeature f = newNode.getAclFeature(); > if (f != null) { > op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode)); > } > logEdit(op); > } > For example, if we mkdir with Acls through one handler(Thread indeed), > we set the AclEntries to the op from the cache. After that, if we mkdir > without any Acls setting and set through the same handler, the AclEnties from > the cache is the same with the last one which set the Acls, and because the > newNode have no AclFeature, we don’t have any chance to change it. Then the > editlog is wrong,record the wrong Acls. After the Standby load the editlogs > from journalnodes and apply them to memory in SNN then savenamespace and > transfer the wrong fsimage to ANN, all the fsimages get wrong. The only > solution is to save namespace from ANN and you can get the right fsimage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager
[ https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212299#comment-14212299 ] Hudson commented on HDFS-7358: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/]) HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. (szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java > Clients may get stuck waiting when using ByteArrayManager > - > > Key: HDFS-7358 > URL: https://issues.apache.org/jira/browse/HDFS-7358 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze > Fix For: 2.7.0 > > Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, > h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, > h7358_20141108.patch > > > [~stack] reported that clients might get stuck waiting when using > ByteArrayManager; see [his > comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
[ https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212295#comment-14212295 ] Hudson commented on HDFS-7395: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/]) HDFS-7395. BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 1a2e5cbc4dbed527fdbefc09abc1faaacf3da285) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit > -- > > Key: HDFS-7395 > URL: https://issues.apache.org/jira/browse/HDFS-7395 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Yongjun Zhang >Assignee: Haohui Mai > Fix For: 2.7.0 > > Attachments: HDFS-7395.000.patch > > > In latest jenkins job > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/ > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/ > but not > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/ > The following test failed the same way: > {code} > Failed > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image > Failing for the past 2 builds (Since Failed#1931 ) > Took 0.54 sec. > Stacktrace > java.lang.IllegalStateException: null > at > com.google.common.base.Preconditions.checkState(Preconditions.java:129) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:763) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:747) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443) > at > org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104) > at > org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975) > at > org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804) > at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:465) > at > org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424) > at > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582) > at > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager
[ https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212286#comment-14212286 ] Hudson commented on HDFS-7358: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1933 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1933/]) HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. (szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java > Clients may get stuck waiting when using ByteArrayManager > - > > Key: HDFS-7358 > URL: https://issues.apache.org/jira/browse/HDFS-7358 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze > Fix For: 2.7.0 > > Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, > h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, > h7358_20141108.patch > > > [~stack] reported that clients might get stuck waiting when using > ByteArrayManager; see [his > comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
[ https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212282#comment-14212282 ] Hudson commented on HDFS-7395: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1933 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1933/]) HDFS-7395. BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 1a2e5cbc4dbed527fdbefc09abc1faaacf3da285) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit > -- > > Key: HDFS-7395 > URL: https://issues.apache.org/jira/browse/HDFS-7395 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Yongjun Zhang >Assignee: Haohui Mai > Fix For: 2.7.0 > > Attachments: HDFS-7395.000.patch > > > In latest jenkins job > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/ > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/ > but not > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/ > The following test failed the same way: > {code} > Failed > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image > Failing for the past 2 builds (Since Failed#1931 ) > Took 0.54 sec. > Stacktrace > java.lang.IllegalStateException: null > at > com.google.common.base.Preconditions.checkState(Preconditions.java:129) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:763) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:747) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443) > at > org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104) > at > org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975) > at > org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804) > at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:465) > at > org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424) > at > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582) > at > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up
[ https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212280#comment-14212280 ] Hudson commented on HDFS-7385: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1933 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1933/]) HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess up. Contributed by jiangyu. (cnauroth: rev b0a41de68c5b08f534ca231293de053c0b0cbd5d) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > ThreadLocal used in FSEditLog class causes FSImage permission mess up > - > > Key: HDFS-7385 > URL: https://issues.apache.org/jira/browse/HDFS-7385 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.4.0, 2.5.0 >Reporter: jiangyu >Assignee: jiangyu >Priority: Blocker > Fix For: 2.6.0 > > Attachments: HDFS-7385.2.patch, HDFS-7385.patch > > > We migrated our NameNodes from low configuration to high configuration > machines last week. Firstly,we imported the current directory including > fsimage and editlog files from original ActiveNameNode to new ActiveNameNode > and started the New NameNode, then changed the configuration of all > datanodes and restarted all of datanodes , then blockreport to new NameNodes > at once and send heartbeat after that. >Everything seemed perfect, but after we restarted Resoucemanager , > most of the users compained that their jobs couldn't be executed for the > reason of permission problem. > We applied Acls in our clusters, and after migrated we found most of > the directories and files which were not set Acls before now had the > properties of Acls. That is the reason why users could not execute their > jobs.So we had to change most of the files permission to a+r and directories > permission to a+rx to make sure the jobs can be executed. > After searching this problem for some days, i found there is a bug in > FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the > proper value in logMkdir and logOpenFile functions. Here is the code of > logMkdir: > public void logMkDir(String path, INode newNode) { > PermissionStatus permissions = newNode.getPermissionStatus(); > MkdirOp op = MkdirOp.getInstance(cache.get()) > .setInodeId(newNode.getId()) > .setPath(path) > .setTimestamp(newNode.getModificationTime()) > .setPermissionStatus(permissions); > AclFeature f = newNode.getAclFeature(); > if (f != null) { > op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode)); > } > logEdit(op); > } > For example, if we mkdir with Acls through one handler(Thread indeed), > we set the AclEntries to the op from the cache. After that, if we mkdir > without any Acls setting and set through the same handler, the AclEnties from > the cache is the same with the last one which set the Acls, and because the > newNode have no AclFeature, we don’t have any chance to change it. Then the > editlog is wrong,record the wrong Acls. After the Standby load the editlogs > from journalnodes and apply them to memory in SNN then savenamespace and > transfer the wrong fsimage to ANN, all the fsimages get wrong. The only > solution is to save namespace from ANN and you can get the right fsimage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager
[ https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212194#comment-14212194 ] Hudson commented on HDFS-7358: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/743/]) HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. (szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java > Clients may get stuck waiting when using ByteArrayManager > - > > Key: HDFS-7358 > URL: https://issues.apache.org/jira/browse/HDFS-7358 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze > Fix For: 2.7.0 > > Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, > h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, > h7358_20141108.patch > > > [~stack] reported that clients might get stuck waiting when using > ByteArrayManager; see [his > comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up
[ https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212188#comment-14212188 ] Hudson commented on HDFS-7385: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/743/]) HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess up. Contributed by jiangyu. (cnauroth: rev b0a41de68c5b08f534ca231293de053c0b0cbd5d) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java > ThreadLocal used in FSEditLog class causes FSImage permission mess up > - > > Key: HDFS-7385 > URL: https://issues.apache.org/jira/browse/HDFS-7385 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.4.0, 2.5.0 >Reporter: jiangyu >Assignee: jiangyu >Priority: Blocker > Fix For: 2.6.0 > > Attachments: HDFS-7385.2.patch, HDFS-7385.patch > > > We migrated our NameNodes from low configuration to high configuration > machines last week. Firstly,we imported the current directory including > fsimage and editlog files from original ActiveNameNode to new ActiveNameNode > and started the New NameNode, then changed the configuration of all > datanodes and restarted all of datanodes , then blockreport to new NameNodes > at once and send heartbeat after that. >Everything seemed perfect, but after we restarted Resoucemanager , > most of the users compained that their jobs couldn't be executed for the > reason of permission problem. > We applied Acls in our clusters, and after migrated we found most of > the directories and files which were not set Acls before now had the > properties of Acls. That is the reason why users could not execute their > jobs.So we had to change most of the files permission to a+r and directories > permission to a+rx to make sure the jobs can be executed. > After searching this problem for some days, i found there is a bug in > FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the > proper value in logMkdir and logOpenFile functions. Here is the code of > logMkdir: > public void logMkDir(String path, INode newNode) { > PermissionStatus permissions = newNode.getPermissionStatus(); > MkdirOp op = MkdirOp.getInstance(cache.get()) > .setInodeId(newNode.getId()) > .setPath(path) > .setTimestamp(newNode.getModificationTime()) > .setPermissionStatus(permissions); > AclFeature f = newNode.getAclFeature(); > if (f != null) { > op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode)); > } > logEdit(op); > } > For example, if we mkdir with Acls through one handler(Thread indeed), > we set the AclEntries to the op from the cache. After that, if we mkdir > without any Acls setting and set through the same handler, the AclEnties from > the cache is the same with the last one which set the Acls, and because the > newNode have no AclFeature, we don’t have any chance to change it. Then the > editlog is wrong,record the wrong Acls. After the Standby load the editlogs > from journalnodes and apply them to memory in SNN then savenamespace and > transfer the wrong fsimage to ANN, all the fsimages get wrong. The only > solution is to save namespace from ANN and you can get the right fsimage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
[ https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212190#comment-14212190 ] Hudson commented on HDFS-7395: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/743/]) HDFS-7395. BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 1a2e5cbc4dbed527fdbefc09abc1faaacf3da285) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit > -- > > Key: HDFS-7395 > URL: https://issues.apache.org/jira/browse/HDFS-7395 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Yongjun Zhang >Assignee: Haohui Mai > Fix For: 2.7.0 > > Attachments: HDFS-7395.000.patch > > > In latest jenkins job > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/ > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/ > but not > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/ > The following test failed the same way: > {code} > Failed > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image > Failing for the past 2 builds (Since Failed#1931 ) > Took 0.54 sec. > Stacktrace > java.lang.IllegalStateException: null > at > com.google.common.base.Preconditions.checkState(Preconditions.java:129) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:763) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:747) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443) > at > org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104) > at > org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975) > at > org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804) > at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:465) > at > org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424) > at > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582) > at > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
[ https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212161#comment-14212161 ] Hudson commented on HDFS-7395: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/]) HDFS-7395. BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 1a2e5cbc4dbed527fdbefc09abc1faaacf3da285) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit > -- > > Key: HDFS-7395 > URL: https://issues.apache.org/jira/browse/HDFS-7395 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Yongjun Zhang >Assignee: Haohui Mai > Fix For: 2.7.0 > > Attachments: HDFS-7395.000.patch > > > In latest jenkins job > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/ > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/ > but not > https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/ > The following test failed the same way: > {code} > Failed > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image > Failing for the past 2 builds (Since Failed#1931 ) > Took 0.54 sec. > Stacktrace > java.lang.IllegalStateException: null > at > com.google.common.base.Preconditions.checkState(Preconditions.java:129) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:763) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:747) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443) > at > org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104) > at > org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975) > at > org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804) > at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:465) > at > org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424) > at > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582) > at > org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager
[ https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212165#comment-14212165 ] Hudson commented on HDFS-7358: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/]) HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. (szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java > Clients may get stuck waiting when using ByteArrayManager > - > > Key: HDFS-7358 > URL: https://issues.apache.org/jira/browse/HDFS-7358 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Reporter: Tsz Wo Nicholas Sze >Assignee: Tsz Wo Nicholas Sze > Fix For: 2.7.0 > > Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, > h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, > h7358_20141108.patch > > > [~stack] reported that clients might get stuck waiting when using > ByteArrayManager; see [his > comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up
[ https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212159#comment-14212159 ] Hudson commented on HDFS-7385: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/]) HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess up. Contributed by jiangyu. (cnauroth: rev b0a41de68c5b08f534ca231293de053c0b0cbd5d) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java > ThreadLocal used in FSEditLog class causes FSImage permission mess up > - > > Key: HDFS-7385 > URL: https://issues.apache.org/jira/browse/HDFS-7385 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.4.0, 2.5.0 >Reporter: jiangyu >Assignee: jiangyu >Priority: Blocker > Fix For: 2.6.0 > > Attachments: HDFS-7385.2.patch, HDFS-7385.patch > > > We migrated our NameNodes from low configuration to high configuration > machines last week. Firstly,we imported the current directory including > fsimage and editlog files from original ActiveNameNode to new ActiveNameNode > and started the New NameNode, then changed the configuration of all > datanodes and restarted all of datanodes , then blockreport to new NameNodes > at once and send heartbeat after that. >Everything seemed perfect, but after we restarted Resoucemanager , > most of the users compained that their jobs couldn't be executed for the > reason of permission problem. > We applied Acls in our clusters, and after migrated we found most of > the directories and files which were not set Acls before now had the > properties of Acls. That is the reason why users could not execute their > jobs.So we had to change most of the files permission to a+r and directories > permission to a+rx to make sure the jobs can be executed. > After searching this problem for some days, i found there is a bug in > FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the > proper value in logMkdir and logOpenFile functions. Here is the code of > logMkdir: > public void logMkDir(String path, INode newNode) { > PermissionStatus permissions = newNode.getPermissionStatus(); > MkdirOp op = MkdirOp.getInstance(cache.get()) > .setInodeId(newNode.getId()) > .setPath(path) > .setTimestamp(newNode.getModificationTime()) > .setPermissionStatus(permissions); > AclFeature f = newNode.getAclFeature(); > if (f != null) { > op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode)); > } > logEdit(op); > } > For example, if we mkdir with Acls through one handler(Thread indeed), > we set the AclEntries to the op from the cache. After that, if we mkdir > without any Acls setting and set through the same handler, the AclEnties from > the cache is the same with the last one which set the Acls, and because the > newNode have no AclFeature, we don’t have any chance to change it. Then the > editlog is wrong,record the wrong Acls. After the Standby load the editlogs > from journalnodes and apply them to memory in SNN then savenamespace and > transfer the wrong fsimage to ANN, all the fsimages get wrong. The only > solution is to save namespace from ANN and you can get the right fsimage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7392) org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever
[ https://issues.apache.org/jira/browse/HDFS-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frantisek Vacek updated HDFS-7392: -- Description: In some specific circumstances, org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts and last forever. What are specific circumstances: 1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point to valid IP address but without name node service running on it. 2) There should be at least 2 IP addresses for such a URI. See output below: {quote} [~/proj/quickbox]$ nslookup share.example.com Server: 127.0.1.1 Address:127.0.1.1#53 share.example.com canonical name = internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com. Name: internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com Address: 192.168.1.223 Name: internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com Address: 192.168.1.65 {quote} In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() returns sometimes true (even if address didn't actually changed see img. 1) and the timeoutFailures counter is set to 0 (see img. 2). The maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is repeated forever. was: In some specific circumstances, org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts and last forever. What are specific circumstances: 1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point to valid IP address but without name node service running on it. 2) There should be at least 2 IP addresses for such a URI. See output below: {quote} [~/proj/quickbox]$ nslookup share.example.com Server: 127.0.1.1 Address:127.0.1.1#53 share.example.com canonical name = internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com. Name: internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com Address: 54.40.29.223 Name: internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com Address: 54.40.29.65 {quote} In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() returns sometimes true (even if address didn't actually changed see img. 1) and the timeoutFailures counter is set to 0 (see img. 2). The maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is repeated forever. > org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever > - > > Key: HDFS-7392 > URL: https://issues.apache.org/jira/browse/HDFS-7392 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Reporter: Frantisek Vacek >Priority: Critical > Attachments: 1.png, 2.png > > > In some specific circumstances, > org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts > and last forever. > What are specific circumstances: > 1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point > to valid IP address but without name node service running on it. > 2) There should be at least 2 IP addresses for such a URI. See output below: > {quote} > [~/proj/quickbox]$ nslookup share.example.com > Server: 127.0.1.1 > Address:127.0.1.1#53 > share.example.com canonical name = > internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com. > Name: internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com > Address: 192.168.1.223 > Name: internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com > Address: 192.168.1.65 > {quote} > In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() > returns sometimes true (even if address didn't actually changed see img. 1) > and the timeoutFailures counter is set to 0 (see img. 2). The > maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is > repeated forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7392) org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever
[ https://issues.apache.org/jira/browse/HDFS-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frantisek Vacek updated HDFS-7392: -- Description: In some specific circumstances, org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts and last forever. What are specific circumstances: 1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point to valid IP address but without name node service running on it. 2) There should be at least 2 IP addresses for such a URI. See output below: {quote} [~/proj/quickbox]$ nslookup share.example.com Server: 127.0.1.1 Address:127.0.1.1#53 share.example.com canonical name = internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com. Name: internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com Address: 54.40.29.223 Name: internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com Address: 54.40.29.65 {quote} In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() returns sometimes true (even if address didn't actually changed see img. 1) and the timeoutFailures counter is set to 0 (see img. 2). The maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is repeated forever. was: In some specific circumstances, org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts and last forever. What are specific circumstances: 1) HDFS URI (hdfs://share.merck.com:8020/someDir/someFile.txt) should point to valid IP address but without name node service running on it. 2) There should be at least 2 IP addresses for such a URI. See output below: {quote} [~/proj/quickbox]$ nslookup share.merck.com Server: 127.0.1.1 Address:127.0.1.1#53 share.merck.com canonical name = internal-gicprg-share-merck-com-1538706884.us-east-1.elb.amazonaws.com. Name: internal-gicprg-share-merck-com-1538706884.us-east-1.elb.amazonaws.com Address: 54.40.29.223 Name: internal-gicprg-share-merck-com-1538706884.us-east-1.elb.amazonaws.com Address: 54.40.29.65 {quote} In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() returns sometimes true (even if address didn't actually changed see img. 1) and the timeoutFailures counter is set to 0 (see img. 2). The maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is repeated forever. > org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever > - > > Key: HDFS-7392 > URL: https://issues.apache.org/jira/browse/HDFS-7392 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Reporter: Frantisek Vacek >Priority: Critical > Attachments: 1.png, 2.png > > > In some specific circumstances, > org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts > and last forever. > What are specific circumstances: > 1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point > to valid IP address but without name node service running on it. > 2) There should be at least 2 IP addresses for such a URI. See output below: > {quote} > [~/proj/quickbox]$ nslookup share.example.com > Server: 127.0.1.1 > Address:127.0.1.1#53 > share.example.com canonical name = > internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com. > Name: internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com > Address: 54.40.29.223 > Name: internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com > Address: 54.40.29.65 > {quote} > In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() > returns sometimes true (even if address didn't actually changed see img. 1) > and the timeoutFailures counter is set to 0 (see img. 2). The > maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is > repeated forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)