date:20141114

[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path

2014-11-14 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213449#comment-14213449
 ] 

Yongjun Zhang commented on HDFS-6133:
-

HI [~kihwal] and [~ctrezzo],

Thanks a lot for sharing the info and great insights here.

About the throwing-away-over-replicated case, I think the " include pinning 
info in block reports and remember it in block manager" approach Kihwal and 
Daryn suggested seems reasonable.  And I agree corrupted-pinned-block case need 
some more thoughts and careful logic. 

About blockpool-aware balancing policy, since balancer works on each blockpool 
independently, it seems an natural approach. However, I think this would be 
complementary solution to the approach here.  I think it deserves its own jira 
and it can be worked on in parallel.

Currently one NN is associated with only one blockpool. And federated cluster 
is not yet widely used as far as I know. Implementation-wise, there are two 
options I can think of if we want o use blockpool-aware balancing policy to 
solve the problem here:
# User need to choose to use federated cluster, and put all files that need to 
be pinned into dedicated blockpool.
# Make a single NN to handle multipe block-pools.
The solution would work nicely for already federated cluster. For others, it 
won't be easy.

Right now we don't have the capability to handle pinning at block/file 
granularity (balancer does have the option to exclude nodes from being 
touched). It seems even providing a solution without handling the 
throwing-away-over-replicated case would help alleviating the pain.

Let's see if we can include a mechanism in the patch of this jira, or at least 
think through how to handle the two cases (throwing-away-over-replicated, and 
corrupted-pinned-block).

Thanks.

 

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
> HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods

2014-11-14 Thread Haohui Mai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haohui Mai updated HDFS-7279:
-
Attachment: HDFS-7279.013.patch

> Use netty to implement DatanodeWebHdfsMethods
> -
>
> Key: HDFS-7279
> URL: https://issues.apache.org/jira/browse/HDFS-7279
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, webhdfs
>Reporter: Haohui Mai
>Assignee: Haohui Mai
> Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
> HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
> HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
> HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
> HDFS-7279.011.patch, HDFS-7279.012.patch, HDFS-7279.013.patch
>
>
> Currently the DN implements all related webhdfs functionality using jetty. As 
> the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
> and connection management, DN often suffers from long latency and OOM when 
> its webhdfs component is under sustained heavy load.
> This jira proposes to implement the webhdfs component in DN using netty, 
> which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods

2014-11-14 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213445#comment-14213445
 ] 

Hadoop QA commented on HDFS-7279:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681717/HDFS-7279.012.patch
  against trunk revision 9b86066.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8750//console

This message is automatically generated.

> Use netty to implement DatanodeWebHdfsMethods
> -
>
> Key: HDFS-7279
> URL: https://issues.apache.org/jira/browse/HDFS-7279
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, webhdfs
>Reporter: Haohui Mai
>Assignee: Haohui Mai
> Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
> HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
> HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
> HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
> HDFS-7279.011.patch, HDFS-7279.012.patch
>
>
> Currently the DN implements all related webhdfs functionality using jetty. As 
> the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
> and connection management, DN often suffers from long latency and OOM when 
> its webhdfs component is under sustained heavy load.
> This jira proposes to implement the webhdfs component in DN using netty, 
> which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods

2014-11-14 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213443#comment-14213443
 ] 

Hadoop QA commented on HDFS-7279:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681717/HDFS-7279.012.patch
  against trunk revision 9b86066.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8749//console

This message is automatically generated.

> Use netty to implement DatanodeWebHdfsMethods
> -
>
> Key: HDFS-7279
> URL: https://issues.apache.org/jira/browse/HDFS-7279
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, webhdfs
>Reporter: Haohui Mai
>Assignee: Haohui Mai
> Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
> HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
> HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
> HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
> HDFS-7279.011.patch, HDFS-7279.012.patch
>
>
> Currently the DN implements all related webhdfs functionality using jetty. As 
> the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
> and connection management, DN often suffers from long latency and OOM when 
> its webhdfs component is under sustained heavy load.
> This jira proposes to implement the webhdfs component in DN using netty, 
> which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods

2014-11-14 Thread Haohui Mai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haohui Mai updated HDFS-7279:
-
Attachment: HDFS-7279.012.patch

> Use netty to implement DatanodeWebHdfsMethods
> -
>
> Key: HDFS-7279
> URL: https://issues.apache.org/jira/browse/HDFS-7279
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, webhdfs
>Reporter: Haohui Mai
>Assignee: Haohui Mai
> Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
> HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
> HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
> HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
> HDFS-7279.011.patch, HDFS-7279.012.patch
>
>
> Currently the DN implements all related webhdfs functionality using jetty. As 
> the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
> and connection management, DN often suffers from long latency and OOM when 
> its webhdfs component is under sustained heavy load.
> This jira proposes to implement the webhdfs component in DN using netty, 
> which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods

2014-11-14 Thread Haohui Mai (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213442#comment-14213442
 ] 

Haohui Mai commented on HDFS-7279:
--

The v12 patches removes the excessive throw clauses.

> Use netty to implement DatanodeWebHdfsMethods
> -
>
> Key: HDFS-7279
> URL: https://issues.apache.org/jira/browse/HDFS-7279
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, webhdfs
>Reporter: Haohui Mai
>Assignee: Haohui Mai
> Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
> HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
> HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
> HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
> HDFS-7279.011.patch, HDFS-7279.012.patch
>
>
> Currently the DN implements all related webhdfs functionality using jetty. As 
> the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
> and connection management, DN often suffers from long latency and OOM when 
> its webhdfs component is under sustained heavy load.
> This jira proposes to implement the webhdfs component in DN using netty, 
> which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync

2014-11-14 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213439#comment-14213439
 ] 

Chris Nauroth commented on HDFS-7384:
-

I haven't reviewed the whole patch yet, but I wanted to state again quickly 
that I'd prefer to keep effective permissions out of {{AclEntry}}.

One problem is that the {{AclEntry}} class is also used in the setter APIs, 
like {{setAcl}}.  In that context, the effective permissions would be ignored.  
This could cause confusion for users of those APIs.

Another problem is that we use the same class for both the public API on the 
client side and the internal in-memory representation in the NameNode.  
Therefore, adding a new member to {{AclEntry}} would have a side effect of 
increasing memory footprint in the NameNode.  Even if we don't populate the 
field when used within the NameNode, there is still the overhead of the 
additional pointer multiplied by every ACL entry.  We could potentially change 
the NameNode to use a different class for its internal implementation, but then 
we'd have a dual-maintenance problem and a need for extra code to translate 
between the two representations.

If {{AclStatus}} could have a new method that does the calculation for an 
entry's effective permissions on demand, instead of requiring a new member in 
{{AclEntry}}, then we wouldn't impact the setter APIs or increase memory 
footprint in the NameNode.

> 'getfacl' command and 'getAclStatus' output should be in sync
> -
>
> Key: HDFS-7384
> URL: https://issues.apache.org/jira/browse/HDFS-7384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
> Attachments: HDFS-7384-001.patch
>
>
> *getfacl* command will print all the entries including basic and extended 
> entries, mask entries and effective permissions.
> But, *getAclStatus* FileSystem API will return only extended ACL entries set 
> by the user. But this will not include the mask entry as well as effective 
> permissions.
> To benefit the client using API, better to include 'mask' entry and effective 
> permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync

2014-11-14 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213438#comment-14213438
 ] 

Hadoop QA commented on HDFS-7384:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681715/HDFS-7384-001.patch
  against trunk revision 9b86066.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8748//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8748//console

This message is automatically generated.

> 'getfacl' command and 'getAclStatus' output should be in sync
> -
>
> Key: HDFS-7384
> URL: https://issues.apache.org/jira/browse/HDFS-7384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
> Attachments: HDFS-7384-001.patch
>
>
> *getfacl* command will print all the entries including basic and extended 
> entries, mask entries and effective permissions.
> But, *getAclStatus* FileSystem API will return only extended ACL entries set 
> by the user. But this will not include the mask entry as well as effective 
> permissions.
> To benefit the client using API, better to include 'mask' entry and effective 
> permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6982) nntop: top-like tool for name node users

2014-11-14 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213436#comment-14213436
 ] 

Hadoop QA commented on HDFS-6982:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681698/HDFS-6982.v9.patch
  against trunk revision 9b86066.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.metrics2.impl.TestMetricsSystemImpl
  org.apache.hadoop.ha.TestZKFailoverControllerStress
  org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby
  org.apache.hadoop.hdfs.TestDistributedFileSystem

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8747//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8747//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8747//console

This message is automatically generated.

> nntop: top-like tool for name node users
> -
>
> Key: HDFS-6982
> URL: https://issues.apache.org/jira/browse/HDFS-6982
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Maysam Yabandeh
>Assignee: Maysam Yabandeh
> Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, 
> HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, 
> HDFS-6982.v7.patch, HDFS-6982.v8.patch, HDFS-6982.v9.patch, 
> nntop-design-v1.pdf
>
>
> In this jira we motivate the need for nntop, a tool that, similarly to what 
> top does in Linux, gives the list of top users of the HDFS name node and 
> gives insight about which users are sending majority of each traffic type to 
> the name node. This information turns out to be the most critical when the 
> name node is under pressure and the HDFS admin needs to know which user is 
> hammering the name node and with what kind of requests. Here we present the 
> design of nntop which has been in production at Twitter in the past 10 
> months. nntop proved to have low cpu overhead (< 2% in a cluster of 4K 
> nodes), low memory footprint (less than a few MB), and quite efficient for 
> the write path (only two hash lookup for updating a metric).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync

2014-11-14 Thread Vinayakumar B (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinayakumar B updated HDFS-7384:

Status: Patch Available  (was: Open)

> 'getfacl' command and 'getAclStatus' output should be in sync
> -
>
> Key: HDFS-7384
> URL: https://issues.apache.org/jira/browse/HDFS-7384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
> Attachments: HDFS-7384-001.patch
>
>
> *getfacl* command will print all the entries including basic and extended 
> entries, mask entries and effective permissions.
> But, *getAclStatus* FileSystem API will return only extended ACL entries set 
> by the user. But this will not include the mask entry as well as effective 
> permissions.
> To benefit the client using API, better to include 'mask' entry and effective 
> permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync

2014-11-14 Thread Vinayakumar B (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinayakumar B updated HDFS-7384:

Attachment: HDFS-7384-001.patch

Attached patch.
Please review and give your feedback

> 'getfacl' command and 'getAclStatus' output should be in sync
> -
>
> Key: HDFS-7384
> URL: https://issues.apache.org/jira/browse/HDFS-7384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
> Attachments: HDFS-7384-001.patch
>
>
> *getfacl* command will print all the entries including basic and extended 
> entries, mask entries and effective permissions.
> But, *getAclStatus* FileSystem API will return only extended ACL entries set 
> by the user. But this will not include the mask entry as well as effective 
> permissions.
> To benefit the client using API, better to include 'mask' entry and effective 
> permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods

2014-11-14 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213392#comment-14213392
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7279:
---

> the throw clause comes from the super class thus it cannot be removed.

It actually can be removed since removing it is narrowing the declaration.

+1 the new patch looks good other than that.

> Use netty to implement DatanodeWebHdfsMethods
> -
>
> Key: HDFS-7279
> URL: https://issues.apache.org/jira/browse/HDFS-7279
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, webhdfs
>Reporter: Haohui Mai
>Assignee: Haohui Mai
> Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
> HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
> HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
> HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
> HDFS-7279.011.patch
>
>
> Currently the DN implements all related webhdfs functionality using jetty. As 
> the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
> and connection management, DN often suffers from long latency and OOM when 
> its webhdfs component is under sustained heavy load.
> This jira proposes to implement the webhdfs component in DN using netty, 
> which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes

2014-11-14 Thread Ming Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213355#comment-14213355
 ] 

Ming Ma commented on HDFS-7374:
---

Yeah, that seems reasonable; How likely you get "whole cluster fully 
replicated" might depend on how you count it. If it is based on full scan of 
blockmap, the chance of getting "all blocks fully replicated" condition might 
be low given it also includes those newly added blocks for which not all DNs 
have sent IBR; in addition, it has to take the FSNameSystem lock for longer 
period of time. If it is based on {{BlockManager}}'s 
{{pendingReplicationBlocksCount}} +  {{underReplicatedBlocksCount}}, then the 
chance might be higher; and it is faster.

On the "track the blocks of those DECOMM_IN_PROGRESS DNs" note, it might be 
useful to add the feature later. It also helps another scenario, something 
[~kihwal] and [~daryn] mentioned before. {{isReplicationInProgress}} currently 
rescan all blocks of a given node each time the method is called; it isn't 
efficient as more blocks become fully replicated.

We can have a separate list of DECOMM_IN_PROGRESS blocks which points to the 
DECOMM_IN_PROGRESS DNs.  {{DecommissionManager}} will scan this list regularly. 
Each scan will reduce the list as blocks become fully replicated and calculated 
the latest list of DECOMM_IN_PROGRESS DNs. In normal decomm operations, the # 
of DECOMM_IN_PROGRESS DNs should be much smaller than the # of total DNs in 
large cluster; so the extra memory overhead might be acceptable.

> Allow decommissioning of dead DataNodes
> ---
>
> Key: HDFS-7374
> URL: https://issues.apache.org/jira/browse/HDFS-7374
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch
>
>
> We have seen the use case of decommissioning DataNodes that are already dead 
> or unresponsive, and not expected to rejoin the cluster.
> The logic introduced by HDFS-6791 will mark those nodes as 
> {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
> the decommission work. If an upper layer application is monitoring the 
> decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-6982) nntop: top-like tool for name node users

2014-11-14 Thread Maysam Yabandeh (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated HDFS-6982:
--
Attachment: HDFS-6982.v9.patch

Attaching the new patch. [~andrew.wang], I ended up moving TopMetrics 
initialization to FsNamesystem, where I register TopAuditLogger with the 
aduitLoggers.

> nntop: top-like tool for name node users
> -
>
> Key: HDFS-6982
> URL: https://issues.apache.org/jira/browse/HDFS-6982
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Maysam Yabandeh
>Assignee: Maysam Yabandeh
> Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, 
> HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, 
> HDFS-6982.v7.patch, HDFS-6982.v8.patch, HDFS-6982.v9.patch, 
> nntop-design-v1.pdf
>
>
> In this jira we motivate the need for nntop, a tool that, similarly to what 
> top does in Linux, gives the list of top users of the HDFS name node and 
> gives insight about which users are sending majority of each traffic type to 
> the name node. This information turns out to be the most critical when the 
> name node is under pressure and the HDFS admin needs to know which user is 
> hammering the name node and with what kind of requests. Here we present the 
> design of nntop which has been in production at Twitter in the past 10 
> months. nntop proved to have low cpu overhead (< 2% in a cluster of 4K 
> nodes), low memory footprint (less than a few MB), and quite efficient for 
> the write path (only two hash lookup for updating a metric).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync

2014-11-14 Thread Vinayakumar B (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213326#comment-14213326
 ] 

Vinayakumar B commented on HDFS-7384:
-

Thanks Chris. 
For the effective action, may be we can have separate method, without affecting 
the current fields.
It will just be a alternative way for client to get the effective action, 
instead of calculating on its own.

I will upload a patch soon.

> 'getfacl' command and 'getAclStatus' output should be in sync
> -
>
> Key: HDFS-7384
> URL: https://issues.apache.org/jira/browse/HDFS-7384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
>
> *getfacl* command will print all the entries including basic and extended 
> entries, mask entries and effective permissions.
> But, *getAclStatus* FileSystem API will return only extended ACL entries set 
> by the user. But this will not include the mask entry as well as effective 
> permissions.
> To benefit the client using API, better to include 'mask' entry and effective 
> permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-14 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213324#comment-14213324
 ] 

Hadoop QA commented on HDFS-4882:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681664/HDFS-4882.4.patch
  against trunk revision 4fb96db.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8745//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8745//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8745//console

This message is automatically generated.

> Namenode LeaseManager checkLeases() runs into infinite loop
> ---
>
> Key: HDFS-4882
> URL: https://issues.apache.org/jira/browse/HDFS-4882
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client, namenode
>Affects Versions: 2.0.0-alpha, 2.5.1
>Reporter: Zesheng Wu
>Assignee: Ravi Prakash
>Priority: Critical
> Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
> HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch
>
>
> Scenario:
> 1. cluster with 4 DNs
> 2. the size of the file to be written is a little more than one block
> 3. write the first block to 3 DNs, DN1->DN2->DN3
> 4. all the data packets of first block is successfully acked and the client 
> sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
> 5. DN2 and DN3 are down
> 6. client recovers the pipeline, but no new DN is added to the pipeline 
> because of the current pipeline stage is PIPELINE_CLOSE
> 7. client continuously writes the last block, and try to close the file after 
> written all the data
> 8. NN finds that the penultimate block doesn't has enough replica(our 
> dfs.namenode.replication.min=2), and the client's close runs into indefinite 
> loop(HDFS-2936), and at the same time, NN makes the last block's state to 
> COMPLETE
> 9. shutdown the client
> 10. the file's lease exceeds hard limit
> 11. LeaseManager realizes that and begin to do lease recovery by call 
> fsnamesystem.internalReleaseLease()
> 12. but the last block's state is COMPLETE, and this triggers lease manager's 
> infinite loop and prints massive logs like this:
> {noformat}
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
>  limit
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
>  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
> /user/h_wuzesheng/test.dat
> 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
> blk_-7028017402720175688_1202597,
> lastBLockState=COMPLETE
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
> for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
> APREDUCE_-1252656407_1, pendingcreates: 1]
> {noformat}
> (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7270) Implementing congestion control in writing pipeline

2014-11-14 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213287#comment-14213287
 ] 

Hadoop QA commented on HDFS-7270:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681462/HDFS-7270.000.patch
  against trunk revision 49c3889.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.datanode.TestDataNodeMetrics
  org.apache.hadoop.hdfs.TestCrcCorruption

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestParallelShortCircuitReadUnCached

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8743//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8743//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8743//console

This message is automatically generated.

> Implementing congestion control in writing pipeline
> ---
>
> Key: HDFS-7270
> URL: https://issues.apache.org/jira/browse/HDFS-7270
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Haohui Mai
>Assignee: Haohui Mai
> Attachments: HDFS-7270.000.patch
>
>
> When a client writes to HDFS faster than the disk bandwidth of the DNs, it  
> saturates the disk bandwidth and put the DNs unresponsive. The client only 
> backs off by aborting / recovering the pipeline, which leads to failed writes 
> and unnecessary pipeline recovery.
> This jira proposes to add explicit congestion control mechanisms in the 
> writing pipeline. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache

2014-11-14 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213286#comment-14213286
 ] 

Hadoop QA commented on HDFS-7394:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681654/HDFS-7394.patch
  against trunk revision 4fb96db.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestParallelShortCircuitReadUnCached

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8744//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8744//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8744//console

This message is automatically generated.

> Log at INFO level when InvalidToken is seen in ShortCircuitCache
> 
>
> Key: HDFS-7394
> URL: https://issues.apache.org/jira/browse/HDFS-7394
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Keith Pak
>Priority: Minor
> Attachments: HDFS-7394.patch
>
>
> For long running clients, getting an {{InvalidToken}} exception is expected 
> and the client refetches a block token when it happens.  The related events 
> are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
> better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7400) More reliable namenode health check to detect OS/HW issues

2014-11-14 Thread Ming Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213281#comment-14213281
 ] 

Ming Ma commented on HDFS-7400:
---

Thanks, [~andrew.wang] and [~aw] for the comments. Here is the info I have so 
far. I will provide more data after I gather more data from our admins and HW 
engineers.

1. We couldn't access the machine except to reboot the machine via IPMI. So no 
chance to run "df".
2. We didn't check the progress in ssh. But given all DNs couldn't connect to 
this NN at that point, it looks like socket level issue.

> More reliable namenode health check to detect OS/HW issues
> --
>
> Key: HDFS-7400
> URL: https://issues.apache.org/jira/browse/HDFS-7400
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>
> We had this scenario on an active NN machine.
> * Disk array controller firmware has a bug. So disks stop working.
> * ZKFC and NN still considered the node healthy; Communications between ZKFC 
> and ZK as well as ZKFC and NN are good.
> * The machine can be pinged.
> * The machine can't be sshed.
> So all clients and DNs can't use the NN. But ZKFC and NN still consider the 
> node healthy.
> The question is how we can have ZKFC and NN detect such OS/HW specific issues 
> quickly? Some ideas we discussed briefly,
> * Have other machines help to make the decision whether the NN is actually 
> healthy. Then you have to figure out to make the decision accurate in the 
> case of network issue, etc.
> * Run OS/HW health check script external to ZKFC/NN on the same machine. If 
> it detects disk or other issues, it can reboot the machine for example.
> * Run OS/HW health check script inside ZKFC/NN. For example NN's 
> HAServiceProtocol#monitorHealth can be modified to call such health check 
> script.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server

2014-11-14 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213254#comment-14213254
 ] 

Hadoop QA commented on HDFS-7146:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681667/HDFS-7146.005.patch
  against trunk revision 4fb96db.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-nfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8746//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8746//console

This message is automatically generated.

> NFS ID/Group lookup requires SSSD enumeration on the server
> ---
>
> Key: HDFS-7146
> URL: https://issues.apache.org/jira/browse/HDFS-7146
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
> HDFS-7146.003.patch, HDFS-7146.004.patch, HDFS-7146.005.patch
>
>
> The current implementation of the NFS UID and GID lookup works by running 
> 'getent passwd' with an assumption that it will return the entire list of 
> users available on the OS, local and remote (AD/etc.).
> This behaviour of the command is advised to be and is prevented by 
> administrators in most secure setups to avoid excessive load to the ADs 
> involved, as the # of users to be listed may be too large, and the repeated 
> requests of ALL users not present in the cache would be too much for the AD 
> infrastructure to bear.
> The NFS server should likely do lookups based on a specific UID request, via 
> 'getent passwd ', if the UID does not match a cached value. This reduces 
> load on the LDAP backed infrastructure.
> Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7400) More reliable namenode health check to detect OS/HW issues

2014-11-14 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213246#comment-14213246
 ] 

Allen Wittenauer commented on HDFS-7400:


bq. Disk array controller firmware has a bug. So disks stop working.
...
bq. The machine can be pinged.
bq. The machine can't be sshed.

Was ssh actually opening the socket and just not completing the login process?

On the surface, this sounds like typical Linux IO weirdisms, but I want to make 
sure. 

bq. Out of curiosity, did your failure condition result in a situation where df 
worked, but the disk was otherwise non-functional? 

I keep thinking about the situation where there are two controllers but only 
one went belly up. Doing things like df or even a write+read combo might not be 
sufficient unless we do it across all devices.  I suspect:

bq. Have other machines help to make the decision whether the NN is actually 
healthy. 

... might be the only truly viable solution under various failure modes.

> More reliable namenode health check to detect OS/HW issues
> --
>
> Key: HDFS-7400
> URL: https://issues.apache.org/jira/browse/HDFS-7400
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>
> We had this scenario on an active NN machine.
> * Disk array controller firmware has a bug. So disks stop working.
> * ZKFC and NN still considered the node healthy; Communications between ZKFC 
> and ZK as well as ZKFC and NN are good.
> * The machine can be pinged.
> * The machine can't be sshed.
> So all clients and DNs can't use the NN. But ZKFC and NN still consider the 
> node healthy.
> The question is how we can have ZKFC and NN detect such OS/HW specific issues 
> quickly? Some ideas we discussed briefly,
> * Have other machines help to make the decision whether the NN is actually 
> healthy. Then you have to figure out to make the decision accurate in the 
> case of network issue, etc.
> * Run OS/HW health check script external to ZKFC/NN on the same machine. If 
> it detects disk or other issues, it can reboot the machine for example.
> * Run OS/HW health check script inside ZKFC/NN. For example NN's 
> HAServiceProtocol#monitorHealth can be modified to call such health check 
> script.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes

2014-11-14 Thread Andrew Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213247#comment-14213247
 ] 

Andrew Wang commented on HDFS-7374:
---

Yea, precisely :) I don't know how realistic this is in an active cluster with 
lots of failing disks, but it'd fix it for some users at least.

> Allow decommissioning of dead DataNodes
> ---
>
> Key: HDFS-7374
> URL: https://issues.apache.org/jira/browse/HDFS-7374
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch
>
>
> We have seen the use case of decommissioning DataNodes that are already dead 
> or unresponsive, and not expected to rejoin the cluster.
> The logic introduced by HDFS-6791 will mark those nodes as 
> {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
> the decommission work. If an upper layer application is monitoring the 
> decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes

2014-11-14 Thread Ming Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213244#comment-14213244
 ] 

Ming Ma commented on HDFS-7374:
---

So maybe we can use "if all blocks in the whole cluster are fully replicated" 
instead of "if all blocks of that dead node are fully replicated" as the 
criteria to move that dead node to decommed state?

> Allow decommissioning of dead DataNodes
> ---
>
> Key: HDFS-7374
> URL: https://issues.apache.org/jira/browse/HDFS-7374
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch
>
>
> We have seen the use case of decommissioning DataNodes that are already dead 
> or unresponsive, and not expected to rejoin the cluster.
> The logic introduced by HDFS-6791 will mark those nodes as 
> {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
> the decommission work. If an upper layer application is monitoring the 
> decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7386) Replace check "port number < 1024" with shared isPrivilegedPort method

2014-11-14 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213224#comment-14213224
 ] 

Yongjun Zhang commented on HDFS-7386:
-

Thank you so much Chris!


> Replace check "port number < 1024" with shared isPrivilegedPort method 
> ---
>
> Key: HDFS-7386
> URL: https://issues.apache.org/jira/browse/HDFS-7386
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, security
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Trivial
> Fix For: 2.7.0
>
> Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch
>
>
> Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace 
> check "port number < 1024" with shared isPrivilegedPort method.
> Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7400) More reliable namenode health check to detect OS/HW issues

2014-11-14 Thread Andrew Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213215#comment-14213215
 ] 

Andrew Wang commented on HDFS-7400:
---

So in {{monitorHealth}} we do a basic check just to see if the NN has free disk 
space. I'd be okay extending this to other checks related to disk health.

Out of curiosity, did your failure condition result in a situation where {{df}} 
worked, but the disk was otherwise non-functional? I guess with no SSH it's a 
little hard to check, but I wonder what we could add to {{monitorHealth}} to 
detect this failure condition.

> More reliable namenode health check to detect OS/HW issues
> --
>
> Key: HDFS-7400
> URL: https://issues.apache.org/jira/browse/HDFS-7400
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>
> We had this scenario on an active NN machine.
> * Disk array controller firmware has a bug. So disks stop working.
> * ZKFC and NN still considered the node healthy; Communications between ZKFC 
> and ZK as well as ZKFC and NN are good.
> * The machine can be pinged.
> * The machine can't be sshed.
> So all clients and DNs can't use the NN. But ZKFC and NN still consider the 
> node healthy.
> The question is how we can have ZKFC and NN detect such OS/HW specific issues 
> quickly? Some ideas we discussed briefly,
> * Have other machines help to make the decision whether the NN is actually 
> healthy. Then you have to figure out to make the decision accurate in the 
> case of network issue, etc.
> * Run OS/HW health check script external to ZKFC/NN on the same machine. If 
> it detects disk or other issues, it can reboot the machine for example.
> * Run OS/HW health check script inside ZKFC/NN. For example NN's 
> HAServiceProtocol#monitorHealth can be modified to call such health check 
> script.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7401) Add block info to DFSInputStream' WARN message when it adds node to deadNodes

2014-11-14 Thread Ming Ma (JIRA)

Ming Ma created HDFS-7401:
-

 Summary: Add block info to DFSInputStream' WARN message when it 
adds node to deadNodes
 Key: HDFS-7401
 URL: https://issues.apache.org/jira/browse/HDFS-7401
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ming Ma
Priority: Minor


Block info is missing in the below message

{noformat}
2014-11-14 03:59:00,386 WARN org.apache.hadoop.hdfs.DFSClient: Failed to 
connect to /xx.xx.xx.xxx:50010 for block, add to deadNodes and continue. 
java.io.IOException: Got error for OP_READ_BLOCK
{noformat}

The code
{noformat}
DFSInputStream.java
  DFSClient.LOG.warn("Failed to connect to " + targetAddr + " for block"
+ ", add to deadNodes and continue. " + ex, ex);
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7386) Replace check "port number < 1024" with shared isPrivilegedPort method

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213193#comment-14213193
 ] 

Hudson commented on HDFS-7386:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6552 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6552/])
HDFS-7386. Replace check "port number < 1024" with shared isPrivilegedPort 
method. Contributed by Yongjun Zhang. (cnauroth: rev 
1925e2a4ae78ef4178393848b4d1d71b0f4a4709)
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/SecurityUtil.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/SecureDataNodeStarter.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/sasl/SaslDataTransferServer.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/sasl/SaslDataTransferClient.java


> Replace check "port number < 1024" with shared isPrivilegedPort method 
> ---
>
> Key: HDFS-7386
> URL: https://issues.apache.org/jira/browse/HDFS-7386
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, security
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Trivial
> Fix For: 2.7.0
>
> Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch
>
>
> Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace 
> check "port number < 1024" with shared isPrivilegedPort method.
> Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-3749) Disable check for jsvc on windows

2014-11-14 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-3749.
-
Resolution: Won't Fix

This is no longer required, because HDFS-2856 has been implemented, providing 
SASL as a means to authenticate the DataNode instead of jsvc/privileged ports.  
I'm resolving this as Won't Fix.

> Disable check for jsvc on windows
> -
>
> Key: HDFS-3749
> URL: https://issues.apache.org/jira/browse/HDFS-3749
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Attachments: hdfs-3749-trunk.patch, hdfs-3749.patch, hdfs-3749.patch
>
>
> Jsvc doesn't make sense on windows and thus we should not require the 
> datanode to start up under it on that platform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7386) Replace check "port number < 1024" with shared isPrivilegedPort method

2014-11-14 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7386:

   Resolution: Fixed
Fix Version/s: 2.7.0
   Status: Resolved  (was: Patch Available)

I committed this to trunk and branch-2.  Yongjun, thank you for improving this 
part of the code.

> Replace check "port number < 1024" with shared isPrivilegedPort method 
> ---
>
> Key: HDFS-7386
> URL: https://issues.apache.org/jira/browse/HDFS-7386
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, security
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Trivial
> Fix For: 2.7.0
>
> Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch
>
>
> Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace 
> check "port number < 1024" with shared isPrivilegedPort method.
> Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server

2014-11-14 Thread Brandon Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213188#comment-14213188
 ] 

Brandon Li commented on HDFS-7146:
--

+1. Pending Jenkins.

> NFS ID/Group lookup requires SSSD enumeration on the server
> ---
>
> Key: HDFS-7146
> URL: https://issues.apache.org/jira/browse/HDFS-7146
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
> HDFS-7146.003.patch, HDFS-7146.004.patch, HDFS-7146.005.patch
>
>
> The current implementation of the NFS UID and GID lookup works by running 
> 'getent passwd' with an assumption that it will return the entire list of 
> users available on the OS, local and remote (AD/etc.).
> This behaviour of the command is advised to be and is prevented by 
> administrators in most secure setups to avoid excessive load to the ADs 
> involved, as the # of users to be listed may be too large, and the repeated 
> requests of ALL users not present in the cache would be too much for the AD 
> infrastructure to bear.
> The NFS server should likely do lookups based on a specific UID request, via 
> 'getent passwd ', if the UID does not match a cached value. This reduces 
> load on the LDAP backed infrastructure.
> Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7386) Replace check "port number < 1024" with shared isPrivilegedPort method

2014-11-14 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7386:

 Component/s: security
  datanode
Target Version/s: 2.7.0
Hadoop Flags: Reviewed

+1 for the patch.  I agree that the test failures are unrelated.  I saw the 
same thing that you saw when I reran locally.  I'll commit this.

> Replace check "port number < 1024" with shared isPrivilegedPort method 
> ---
>
> Key: HDFS-7386
> URL: https://issues.apache.org/jira/browse/HDFS-7386
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, security
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Trivial
> Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch
>
>
> Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace 
> check "port number < 1024" with shared isPrivilegedPort method.
> Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6982) nntop: top-like tool for name node users

2014-11-14 Thread Andrew Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213177#comment-14213177
 ] 

Andrew Wang commented on HDFS-6982:
---

Hi Maysam, took a look at the latest patch, I think we're almost there :) Just 
minor comments. Hopefully Jenkins behaves with the next rev too, I agree it 
looks unrelated or garbled.

DFSConfigKeys / TopConf:
* Need to rename the DFSConfigKeys variable names to reflect new config names
* Seems like I gave bad advice about getInts, since it doesn't have a way of 
taking a default, so right now if we try to turn it off, it'll set the default. 
Reverting to what you had is cool, though adding a getInts that takes a default 
would be appreciated.

RWManager:
* Could we add explanatory text to the Precondition checks?

AuditLogger:
* Rather than injecting it into the conf (kinda brittle), what I had in mind 
was in FSNamesystem#initAuditLoggers, we could tack it on the end after adding 
the ones from the conf. No need for reflection :)
* Related to this, it'd be good to have a unit test that disables nntop and 
then checks that the audit logger isn't added and that metrics aren't 
published. Feel free to add a @VisibleForTesting getter if it helps.

Nits:
* Unused import in NameNode

This is just minor stuff though, I'm +1 pending the above review comments.

> nntop: top-like tool for name node users
> -
>
> Key: HDFS-6982
> URL: https://issues.apache.org/jira/browse/HDFS-6982
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Maysam Yabandeh
>Assignee: Maysam Yabandeh
> Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, 
> HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, 
> HDFS-6982.v7.patch, HDFS-6982.v8.patch, nntop-design-v1.pdf
>
>
> In this jira we motivate the need for nntop, a tool that, similarly to what 
> top does in Linux, gives the list of top users of the HDFS name node and 
> gives insight about which users are sending majority of each traffic type to 
> the name node. This information turns out to be the most critical when the 
> name node is under pressure and the HDFS admin needs to know which user is 
> hammering the name node and with what kind of requests. Here we present the 
> design of nntop which has been in production at Twitter in the past 10 
> months. nntop proved to have low cpu overhead (< 2% in a cluster of 4K 
> nodes), low memory footprint (less than a few MB), and quite efficient for 
> the write path (only two hash lookup for updating a metric).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server

2014-11-14 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213165#comment-14213165
 ] 

Yongjun Zhang commented on HDFS-7146:
-

HI [~brandonli],

Nice idea to add some java docs and describing the solution, I just uploaded 
005 to have that.

Thanks for your flexibility, I will create a separate jira for the platform 
coverage issue, 'cause I think that may involve looking into multiple places 
for platform differences.

Thanks for taking a further look.


> NFS ID/Group lookup requires SSSD enumeration on the server
> ---
>
> Key: HDFS-7146
> URL: https://issues.apache.org/jira/browse/HDFS-7146
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
> HDFS-7146.003.patch, HDFS-7146.004.patch, HDFS-7146.005.patch
>
>
> The current implementation of the NFS UID and GID lookup works by running 
> 'getent passwd' with an assumption that it will return the entire list of 
> users available on the OS, local and remote (AD/etc.).
> This behaviour of the command is advised to be and is prevented by 
> administrators in most secure setups to avoid excessive load to the ADs 
> involved, as the # of users to be listed may be too large, and the repeated 
> requests of ALL users not present in the cache would be too much for the AD 
> infrastructure to bear.
> The NFS server should likely do lookups based on a specific UID request, via 
> 'getent passwd ', if the UID does not match a cached value. This reduces 
> load on the LDAP backed infrastructure.
> Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7400) More reliable namenode health check to detect OS/HW issues

2014-11-14 Thread Ming Ma (JIRA)

Ming Ma created HDFS-7400:
-

 Summary: More reliable namenode health check to detect OS/HW issues
 Key: HDFS-7400
 URL: https://issues.apache.org/jira/browse/HDFS-7400
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma


We had this scenario on an active NN machine.

* Disk array controller firmware has a bug. So disks stop working.
* ZKFC and NN still considered the node healthy; Communications between ZKFC 
and ZK as well as ZKFC and NN are good.
* The machine can be pinged.
* The machine can't be sshed.

So all clients and DNs can't use the NN. But ZKFC and NN still consider the 
node healthy.

The question is how we can have ZKFC and NN detect such OS/HW specific issues 
quickly? Some ideas we discussed briefly,

* Have other machines help to make the decision whether the NN is actually 
healthy. Then you have to figure out to make the decision accurate in the case 
of network issue, etc.
* Run OS/HW health check script external to ZKFC/NN on the same machine. If it 
detects disk or other issues, it can reboot the machine for example.
* Run OS/HW health check script inside ZKFC/NN. For example NN's 
HAServiceProtocol#monitorHealth can be modified to call such health check 
script.

Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server

2014-11-14 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-7146:

Attachment: HDFS-7146.005.patch

> NFS ID/Group lookup requires SSSD enumeration on the server
> ---
>
> Key: HDFS-7146
> URL: https://issues.apache.org/jira/browse/HDFS-7146
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
> HDFS-7146.003.patch, HDFS-7146.004.patch, HDFS-7146.005.patch
>
>
> The current implementation of the NFS UID and GID lookup works by running 
> 'getent passwd' with an assumption that it will return the entire list of 
> users available on the OS, local and remote (AD/etc.).
> This behaviour of the command is advised to be and is prevented by 
> administrators in most secure setups to avoid excessive load to the ADs 
> involved, as the # of users to be listed may be too large, and the repeated 
> requests of ALL users not present in the cache would be too much for the AD 
> infrastructure to bear.
> The NFS server should likely do lookups based on a specific UID request, via 
> 'getent passwd ', if the UID does not match a cached value. This reduces 
> load on the LDAP backed infrastructure.
> Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-14 Thread Ravi Prakash (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated HDFS-4882:
---
Attachment: HDFS-4882.4.patch

Here's a patch which goes back to using sortedLeases.first() .

> Namenode LeaseManager checkLeases() runs into infinite loop
> ---
>
> Key: HDFS-4882
> URL: https://issues.apache.org/jira/browse/HDFS-4882
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client, namenode
>Affects Versions: 2.0.0-alpha, 2.5.1
>Reporter: Zesheng Wu
>Assignee: Ravi Prakash
>Priority: Critical
> Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
> HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch
>
>
> Scenario:
> 1. cluster with 4 DNs
> 2. the size of the file to be written is a little more than one block
> 3. write the first block to 3 DNs, DN1->DN2->DN3
> 4. all the data packets of first block is successfully acked and the client 
> sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
> 5. DN2 and DN3 are down
> 6. client recovers the pipeline, but no new DN is added to the pipeline 
> because of the current pipeline stage is PIPELINE_CLOSE
> 7. client continuously writes the last block, and try to close the file after 
> written all the data
> 8. NN finds that the penultimate block doesn't has enough replica(our 
> dfs.namenode.replication.min=2), and the client's close runs into indefinite 
> loop(HDFS-2936), and at the same time, NN makes the last block's state to 
> COMPLETE
> 9. shutdown the client
> 10. the file's lease exceeds hard limit
> 11. LeaseManager realizes that and begin to do lease recovery by call 
> fsnamesystem.internalReleaseLease()
> 12. but the last block's state is COMPLETE, and this triggers lease manager's 
> infinite loop and prints massive logs like this:
> {noformat}
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
>  limit
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
>  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
> /user/h_wuzesheng/test.dat
> 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
> blk_-7028017402720175688_1202597,
> lastBLockState=COMPLETE
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
> for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
> APREDUCE_-1252656407_1, pendingcreates: 1]
> {noformat}
> (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-14 Thread Ravi Prakash (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213109#comment-14213109
 ] 

Ravi Prakash commented on HDFS-4882:


These test errors are valid. They are happening because pollFirst() retrieved 
*and removes* the first element. Sorry for the oversight. Will upload a new 
patch soon

> Namenode LeaseManager checkLeases() runs into infinite loop
> ---
>
> Key: HDFS-4882
> URL: https://issues.apache.org/jira/browse/HDFS-4882
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client, namenode
>Affects Versions: 2.0.0-alpha, 2.5.1
>Reporter: Zesheng Wu
>Assignee: Ravi Prakash
>Priority: Critical
> Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
> HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.patch
>
>
> Scenario:
> 1. cluster with 4 DNs
> 2. the size of the file to be written is a little more than one block
> 3. write the first block to 3 DNs, DN1->DN2->DN3
> 4. all the data packets of first block is successfully acked and the client 
> sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
> 5. DN2 and DN3 are down
> 6. client recovers the pipeline, but no new DN is added to the pipeline 
> because of the current pipeline stage is PIPELINE_CLOSE
> 7. client continuously writes the last block, and try to close the file after 
> written all the data
> 8. NN finds that the penultimate block doesn't has enough replica(our 
> dfs.namenode.replication.min=2), and the client's close runs into indefinite 
> loop(HDFS-2936), and at the same time, NN makes the last block's state to 
> COMPLETE
> 9. shutdown the client
> 10. the file's lease exceeds hard limit
> 11. LeaseManager realizes that and begin to do lease recovery by call 
> fsnamesystem.internalReleaseLease()
> 12. but the last block's state is COMPLETE, and this triggers lease manager's 
> infinite loop and prints massive logs like this:
> {noformat}
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
>  limit
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
>  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
> /user/h_wuzesheng/test.dat
> 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
> blk_-7028017402720175688_1202597,
> lastBLockState=COMPLETE
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
> for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
> APREDUCE_-1252656407_1, pendingcreates: 1]
> {noformat}
> (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache

2014-11-14 Thread Keith Pak (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak updated HDFS-7394:

Status: Open  (was: Patch Available)

> Log at INFO level when InvalidToken is seen in ShortCircuitCache
> 
>
> Key: HDFS-7394
> URL: https://issues.apache.org/jira/browse/HDFS-7394
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Keith Pak
>Priority: Minor
> Attachments: HDFS-7394.patch
>
>
> For long running clients, getting an {{InvalidToken}} exception is expected 
> and the client refetches a block token when it happens.  The related events 
> are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
> better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache

2014-11-14 Thread Keith Pak (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak updated HDFS-7394:

Attachment: HDFS-7394.patch

> Log at INFO level when InvalidToken is seen in ShortCircuitCache
> 
>
> Key: HDFS-7394
> URL: https://issues.apache.org/jira/browse/HDFS-7394
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Keith Pak
>Priority: Minor
> Attachments: HDFS-7394.patch
>
>
> For long running clients, getting an {{InvalidToken}} exception is expected 
> and the client refetches a block token when it happens.  The related events 
> are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
> better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache

2014-11-14 Thread Keith Pak (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak updated HDFS-7394:

Status: Patch Available  (was: Open)

> Log at INFO level when InvalidToken is seen in ShortCircuitCache
> 
>
> Key: HDFS-7394
> URL: https://issues.apache.org/jira/browse/HDFS-7394
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Keith Pak
>Priority: Minor
> Attachments: HDFS-7394.patch
>
>
> For long running clients, getting an {{InvalidToken}} exception is expected 
> and the client refetches a block token when it happens.  The related events 
> are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
> better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache

2014-11-14 Thread Keith Pak (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak updated HDFS-7394:

Attachment: (was: HDFS-7394.patch)

> Log at INFO level when InvalidToken is seen in ShortCircuitCache
> 
>
> Key: HDFS-7394
> URL: https://issues.apache.org/jira/browse/HDFS-7394
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Keith Pak
>Priority: Minor
> Attachments: HDFS-7394.patch
>
>
> For long running clients, getting an {{InvalidToken}} exception is expected 
> and the client refetches a block token when it happens.  The related events 
> are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
> better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods

2014-11-14 Thread Haohui Mai (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213010#comment-14213010
 ] 

Haohui Mai commented on HDFS-7279:
--

The findbugs warning is unrelated.

> Use netty to implement DatanodeWebHdfsMethods
> -
>
> Key: HDFS-7279
> URL: https://issues.apache.org/jira/browse/HDFS-7279
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, webhdfs
>Reporter: Haohui Mai
>Assignee: Haohui Mai
> Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
> HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
> HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
> HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
> HDFS-7279.011.patch
>
>
> Currently the DN implements all related webhdfs functionality using jetty. As 
> the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
> and connection management, DN often suffers from long latency and OOM when 
> its webhdfs component is under sustained heavy load.
> This jira proposes to implement the webhdfs component in DN using netty, 
> which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods

2014-11-14 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212999#comment-14212999
 ] 

Hadoop QA commented on HDFS-7279:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681609/HDFS-7279.011.patch
  against trunk revision f2fe8a8.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The test build failed in 
hadoop-hdfs-project/hadoop-hdfs 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8742//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8742//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8742//console

This message is automatically generated.

> Use netty to implement DatanodeWebHdfsMethods
> -
>
> Key: HDFS-7279
> URL: https://issues.apache.org/jira/browse/HDFS-7279
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, webhdfs
>Reporter: Haohui Mai
>Assignee: Haohui Mai
> Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
> HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
> HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
> HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
> HDFS-7279.011.patch
>
>
> Currently the DN implements all related webhdfs functionality using jetty. As 
> the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
> and connection management, DN often suffers from long latency and OOM when 
> its webhdfs component is under sustained heavy load.
> This jira proposes to implement the webhdfs component in DN using netty, 
> which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache

2014-11-14 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212998#comment-14212998
 ] 

Hadoop QA commented on HDFS-7394:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681607/HDFS-7394.patch
  against trunk revision 1a1dcce.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1218 javac 
compiler warnings (more than the trunk's current 1217 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.ipc.TestRPCCallBenchmark

  The test build failed in 
hadoop-hdfs-project/hadoop-hdfs 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8741//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8741//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Javac warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8741//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8741//console

This message is automatically generated.

> Log at INFO level when InvalidToken is seen in ShortCircuitCache
> 
>
> Key: HDFS-7394
> URL: https://issues.apache.org/jira/browse/HDFS-7394
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Keith Pak
>Priority: Minor
> Attachments: HDFS-7394.patch
>
>
> For long running clients, getting an {{InvalidToken}} exception is expected 
> and the client refetches a block token when it happens.  The related events 
> are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
> better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server

2014-11-14 Thread Brandon Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212985#comment-14212985
 ] 

Brandon Li commented on HDFS-7146:
--

{quote}
The defaulyStaticIdMappingFile was introduced in the HADOOP-11195, and I 
actually have removed it in rev 004. Would you please indicate the place you 
were looking at?{quote}
My bad. I looked into the wrong side of the diff.
{quote}Relaxing the platform support is a different issue to solve and it seems 
deserving a separate jira, what do you think?{quote}
I am ok with either fixing it here or a different JIRA.
{quote}I introduced this for testing purpose. {quote}
Please add java doc for it. Also, it would to nice to add the solution in the 
class java doc.


> NFS ID/Group lookup requires SSSD enumeration on the server
> ---
>
> Key: HDFS-7146
> URL: https://issues.apache.org/jira/browse/HDFS-7146
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
> HDFS-7146.003.patch, HDFS-7146.004.patch
>
>
> The current implementation of the NFS UID and GID lookup works by running 
> 'getent passwd' with an assumption that it will return the entire list of 
> users available on the OS, local and remote (AD/etc.).
> This behaviour of the command is advised to be and is prevented by 
> administrators in most secure setups to avoid excessive load to the ADs 
> involved, as the # of users to be listed may be too large, and the repeated 
> requests of ALL users not present in the cache would be too much for the AD 
> infrastructure to bear.
> The NFS server should likely do lookups based on a specific UID request, via 
> 'getent passwd ', if the UID does not match a cached value. This reduces 
> load on the LDAP backed infrastructure.
> Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-3806) Assertion failed in TestStandbyCheckpoints.testBothNodesInStandbyState

2014-11-14 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-3806.
-
Resolution: Duplicate

I'm resolving this as duplicate of HDFS-3519.

> Assertion failed in TestStandbyCheckpoints.testBothNodesInStandbyState
> --
>
> Key: HDFS-3806
> URL: https://issues.apache.org/jira/browse/HDFS-3806
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
> Environment: Jenkins
>Reporter: Trevor Robinson
>Priority: Minor
>
> Failed in Jenkins build for unrelated issue (HDFS-3804): 
> https://builds.apache.org/job/PreCommit-HDFS-Build/3011/testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestStandbyCheckpoints/testBothNodesInStandbyState/
> {noformat}
> java.lang.AssertionError: Expected non-empty 
> /home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/fsimage_012
>   at org.junit.Assert.fail(Assert.java:91)
>   at org.junit.Assert.assertTrue(Assert.java:43)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageTestUtil.assertNNHasCheckpoints(FSImageTestUtil.java:467)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.HATestUtil.waitForCheckpoint(HATestUtil.java:213)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints.testBothNodesInStandbyState(TestStandbyCheckpoints.java:133)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7393) TestDFSUpgradeFromImage#testUpgradeFromCorruptRel22Image fails in trunk

2014-11-14 Thread Konstantin Shvachko (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212962#comment-14212962
 ] 

Konstantin Shvachko commented on HDFS-7393:
---

Indeed. Good it is fixed. Thanks. 

> TestDFSUpgradeFromImage#testUpgradeFromCorruptRel22Image fails in trunk
> ---
>
> Key: HDFS-7393
> URL: https://issues.apache.org/jira/browse/HDFS-7393
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Ted Yu
>
> The following is reproducible:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7399) Lack of synchronization in DFSOutputStream#Packet#getLastByteOffsetBlock()

2014-11-14 Thread Ted Yu (JIRA)

Ted Yu created HDFS-7399:


 Summary: Lack of synchronization in 
DFSOutputStream#Packet#getLastByteOffsetBlock()
 Key: HDFS-7399
 URL: https://issues.apache.org/jira/browse/HDFS-7399
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ted Yu
Priority: Minor


{code}
long getLastByteOffsetBlock() {
  return offsetInBlock + dataPos - dataStart;
{code}
Access to fields of Packet.this should be protected by synchronization as done 
in other methods such as writeTo().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes

2014-11-14 Thread Andrew Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212846#comment-14212846
 ] 

Andrew Wang commented on HDFS-7374:
---

Hmm, so one situation we've seen is that the cluster is 100% healthy (no 
under-rep blocks) and dead DNs still get stuck in the D_I_P state. We can 
safely transition even dead nodes to DECOMMED in this situation.

Going backwards from (DEAD, DECOMMED) back to (LIVE, D_I_P) feels a little 
weird. IMO DECOMMED should mean that a node can safely be removed from the 
cluster, even for dead nodes. That won't necessarily be true in this case.

> Allow decommissioning of dead DataNodes
> ---
>
> Key: HDFS-7374
> URL: https://issues.apache.org/jira/browse/HDFS-7374
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch
>
>
> We have seen the use case of decommissioning DataNodes that are already dead 
> or unresponsive, and not expected to rejoin the cluster.
> The logic introduced by HDFS-6791 will mark those nodes as 
> {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
> the decommission work. If an upper layer application is monitoring the 
> decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server

2014-11-14 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212844#comment-14212844
 ] 

Yongjun Zhang commented on HDFS-7146:
-

HI [~brandonli],

Thanks a lot for the review and comments!

I have a few questions to clarify:
{quote}
1. it doesn’t seem to be necessary to introduce defaultStaticIdMappingFile
{quote}
The defaulyStaticIdMappingFile was introduced in the HADOOP-11195, and I 
actually have removed it in rev 004. Would you please indicate the place you 
were looking at?

{quote}
2. Do we need checkSupportedPlatform()? We don’t have to limit the platform 
with only linux and mac. Some other UNIX flavors might also be able to run the 
NFS server. We could do the follow:
if (Shell.Mac)
{ // mac command }
else
{ // linux command for everything else }
{quote}

About checkSupportedPlatform, I simply followed the existing implementation ({{ 
if (!OS.startsWith("Linux") && !OS.startsWith("Mac"))}}), which says only mac 
and linux are supported. Relaxing the platform support is a different issue to 
solve and it seems deserving a separate jira, what do you think?

{quote}
3. do we still need constructFullMapAtInit since it’s always false?
{quote}
I introduced this for testing purpose. If you look at the new test I 
introduced, it's first called with "true" to create a reference (refIdMapping). 
That's why I tagged constructor with @VisibleForTesting. Does this sound ok to 
you?

Thanks again!


> NFS ID/Group lookup requires SSSD enumeration on the server
> ---
>
> Key: HDFS-7146
> URL: https://issues.apache.org/jira/browse/HDFS-7146
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
> HDFS-7146.003.patch, HDFS-7146.004.patch
>
>
> The current implementation of the NFS UID and GID lookup works by running 
> 'getent passwd' with an assumption that it will return the entire list of 
> users available on the OS, local and remote (AD/etc.).
> This behaviour of the command is advised to be and is prevented by 
> administrators in most secure setups to avoid excessive load to the ADs 
> involved, as the # of users to be listed may be too large, and the repeated 
> requests of ALL users not present in the cache would be too much for the AD 
> infrastructure to bear.
> The NFS server should likely do lookups based on a specific UID request, via 
> 'getent passwd ', if the UID does not match a cached value. This reduces 
> load on the LDAP backed infrastructure.
> Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6982) nntop: top-like tool for name node users

2014-11-14 Thread Maysam Yabandeh (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212825#comment-14212825
 ] 

Maysam Yabandeh commented on HDFS-6982:
---

[~andrew.wang] I do not see any relation between the patch and the findbug 
warnings as well as the test failures.

> nntop: top-like tool for name node users
> -
>
> Key: HDFS-6982
> URL: https://issues.apache.org/jira/browse/HDFS-6982
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Maysam Yabandeh
>Assignee: Maysam Yabandeh
> Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, 
> HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, 
> HDFS-6982.v7.patch, HDFS-6982.v8.patch, nntop-design-v1.pdf
>
>
> In this jira we motivate the need for nntop, a tool that, similarly to what 
> top does in Linux, gives the list of top users of the HDFS name node and 
> gives insight about which users are sending majority of each traffic type to 
> the name node. This information turns out to be the most critical when the 
> name node is under pressure and the HDFS admin needs to know which user is 
> hammering the name node and with what kind of requests. Here we present the 
> design of nntop which has been in production at Twitter in the past 10 
> months. nntop proved to have low cpu overhead (< 2% in a cluster of 4K 
> nodes), low memory footprint (less than a few MB), and quite efficient for 
> the write path (only two hash lookup for updating a metric).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods

2014-11-14 Thread Haohui Mai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haohui Mai updated HDFS-7279:
-
Attachment: HDFS-7279.011.patch

> Use netty to implement DatanodeWebHdfsMethods
> -
>
> Key: HDFS-7279
> URL: https://issues.apache.org/jira/browse/HDFS-7279
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, webhdfs
>Reporter: Haohui Mai
>Assignee: Haohui Mai
> Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
> HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
> HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
> HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
> HDFS-7279.011.patch
>
>
> Currently the DN implements all related webhdfs functionality using jetty. As 
> the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
> and connection management, DN often suffers from long latency and OOM when 
> its webhdfs component is under sustained heavy load.
> This jira proposes to implement the webhdfs component in DN using netty, 
> which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server

2014-11-14 Thread Brandon Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212800#comment-14212800
 ] 

Brandon Li commented on HDFS-7146:
--

The patch looks nice. Some comments:
1. it doesn’t seem to be necessary to introduce defaultStaticIdMappingFile
2. Do we need checkSupportedPlatform()? We don’t have to limit the platform 
with only linux and mac. Some other UNIX flavors might also be able to run the 
NFS server. We could do the follow:
if (Shell.Mac) {
  // mac command
} else {
  // linux command for everything else
}
3. do we still need constructFullMapAtInit since it’s always false?


> NFS ID/Group lookup requires SSSD enumeration on the server
> ---
>
> Key: HDFS-7146
> URL: https://issues.apache.org/jira/browse/HDFS-7146
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nfs
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
> HDFS-7146.003.patch, HDFS-7146.004.patch
>
>
> The current implementation of the NFS UID and GID lookup works by running 
> 'getent passwd' with an assumption that it will return the entire list of 
> users available on the OS, local and remote (AD/etc.).
> This behaviour of the command is advised to be and is prevented by 
> administrators in most secure setups to avoid excessive load to the ADs 
> involved, as the # of users to be listed may be too large, and the repeated 
> requests of ALL users not present in the cache would be too much for the AD 
> infrastructure to bear.
> The NFS server should likely do lookups based on a specific UID request, via 
> 'getent passwd ', if the UID does not match a cached value. This reduces 
> load on the LDAP backed infrastructure.
> Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes

2014-11-14 Thread Ming Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212794#comment-14212794
 ] 

Ming Ma commented on HDFS-7374:
---

[~andrew.wang], after a node is dead, all its blocks will be removed from 
blockmap. So if the node no longer joins the cluster, it isn't unclear how you 
can tell if all its blocks are fully replicated unless we track those blocks.

Another way to cover all these scenarios could be to get rid of {{DEAD, 
DECOM_IN_PROGRESS}} state. After the node is dead during decommission, 
transition to {{DEAD, DECOMMED}}. When the node rejoins the cluster, transition 
it to {{LIVE, DECOM_IN_PROGRESS}}.

> Allow decommissioning of dead DataNodes
> ---
>
> Key: HDFS-7374
> URL: https://issues.apache.org/jira/browse/HDFS-7374
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch
>
>
> We have seen the use case of decommissioning DataNodes that are already dead 
> or unresponsive, and not expected to rejoin the cluster.
> The logic introduced by HDFS-6791 will mark those nodes as 
> {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
> the decommission work. If an upper layer application is monitoring the 
> decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache

2014-11-14 Thread Keith Pak (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak updated HDFS-7394:

Attachment: HDFS-7394.patch

Attached patch

> Log at INFO level when InvalidToken is seen in ShortCircuitCache
> 
>
> Key: HDFS-7394
> URL: https://issues.apache.org/jira/browse/HDFS-7394
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Keith Pak
>Priority: Minor
> Attachments: HDFS-7394.patch
>
>
> For long running clients, getting an {{InvalidToken}} exception is expected 
> and the client refetches a block token when it happens.  The related events 
> are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
> better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache

2014-11-14 Thread Keith Pak (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak updated HDFS-7394:

Status: Patch Available  (was: Open)

> Log at INFO level when InvalidToken is seen in ShortCircuitCache
> 
>
> Key: HDFS-7394
> URL: https://issues.apache.org/jira/browse/HDFS-7394
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Keith Pak
>Priority: Minor
>
> For long running clients, getting an {{InvalidToken}} exception is expected 
> and the client refetches a block token when it happens.  The related events 
> are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
> better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache

2014-11-14 Thread Keith Pak (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-7394 started by Keith Pak.
---
> Log at INFO level when InvalidToken is seen in ShortCircuitCache
> 
>
> Key: HDFS-7394
> URL: https://issues.apache.org/jira/browse/HDFS-7394
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Keith Pak
>Priority: Minor
>
> For long running clients, getting an {{InvalidToken}} exception is expected 
> and the client refetches a block token when it happens.  The related events 
> are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
> better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work stopped] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache

2014-11-14 Thread Keith Pak (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-7394 stopped by Keith Pak.
---
> Log at INFO level when InvalidToken is seen in ShortCircuitCache
> 
>
> Key: HDFS-7394
> URL: https://issues.apache.org/jira/browse/HDFS-7394
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Keith Pak
>Priority: Minor
>
> For long running clients, getting an {{InvalidToken}} exception is expected 
> and the client refetches a block token when it happens.  The related events 
> are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
> better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work stopped] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache

2014-11-14 Thread Keith Pak (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-7394 stopped by Keith Pak.
---
> Log at INFO level when InvalidToken is seen in ShortCircuitCache
> 
>
> Key: HDFS-7394
> URL: https://issues.apache.org/jira/browse/HDFS-7394
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Keith Pak
>Priority: Minor
>
> For long running clients, getting an {{InvalidToken}} exception is expected 
> and the client refetches a block token when it happens.  The related events 
> are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
> better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache

2014-11-14 Thread Keith Pak (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-7394 started by Keith Pak.
---
> Log at INFO level when InvalidToken is seen in ShortCircuitCache
> 
>
> Key: HDFS-7394
> URL: https://issues.apache.org/jira/browse/HDFS-7394
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Keith Pak
>Priority: Minor
>
> For long running clients, getting an {{InvalidToken}} exception is expected 
> and the client refetches a block token when it happens.  The related events 
> are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
> better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache

2014-11-14 Thread Keith Pak (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak reassigned HDFS-7394:
---

Assignee: Keith Pak

> Log at INFO level when InvalidToken is seen in ShortCircuitCache
> 
>
> Key: HDFS-7394
> URL: https://issues.apache.org/jira/browse/HDFS-7394
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Keith Pak
>Priority: Minor
>
> For long running clients, getting an {{InvalidToken}} exception is expected 
> and the client refetches a block token when it happens.  The related events 
> are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
> better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes

2014-11-14 Thread Andrew Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212766#comment-14212766
 ] 

Andrew Wang commented on HDFS-7374:
---

Hey [~mingma], I was looking a bit more at decom, and I see that we have this 
if statement at the end of {{isReplicationInProgress}}:

{code}
if (!status && !srcNode.isAlive) {
  LOG.warn("srcNode " + srcNode + " is dead " +
  "when decommission is in progress. Continue to mark " +
  "it as decommission in progress. In that way, when it rejoins the " +
  "cluster it can continue the decommission process.");
  status = true;
}
{code}

Logically, a (DEAD, DECOM_IN_PROGRESS) should be able to go to (DEAD, DECOMMED) 
if all of its blocks are fully replicated, but this if statement prevents 
{{isReplicationInProgress}} from ever returning false for a dead node. It seems 
like we can loosen this requirement?

> Allow decommissioning of dead DataNodes
> ---
>
> Key: HDFS-7374
> URL: https://issues.apache.org/jira/browse/HDFS-7374
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch
>
>
> We have seen the use case of decommissioning DataNodes that are already dead 
> or unresponsive, and not expected to rejoin the cluster.
> The logic introduced by HDFS-6791 will mark those nodes as 
> {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
> the decommission work. If an upper layer application is monitoring the 
> decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-11-14 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212757#comment-14212757
 ] 

Yongjun Zhang commented on HDFS-4239:
-

HI [~qwertymaniac],

My bad that I did not notice your earlier comment 
{quote}
I just noticed Steve's comment referring the same - should've gone through 
properly before spending google cycles. I feel HDFS-1362 implemented would 
solve half of this - and the other half would be to make the removals 
automatic. Right now the checkDiskError does not eject if its slow - as long as 
its succeed, which would have to be done via this JIRA I think. The re-add 
would be possible via HDFS-1362.
{quote}
until now. So we need to use the functionality provided by HDFS-1362 to 
automatically remove a sick disk. It seems the original goal of HDFS-4239 is 
the same as HDFS-1362 (right?), and we can create a new jira for  automatically 
removing a sick disk?

Thanks.


> Means of telling the datanode to stop using a sick disk
> ---
>
> Key: HDFS-4239
> URL: https://issues.apache.org/jira/browse/HDFS-4239
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: stack
>Assignee: Yongjun Zhang
> Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
> hdfs-4239_v4.patch, hdfs-4239_v5.patch
>
>
> If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
> occasionally, or just exhibiting high latency -- your choices are:
> 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
> disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
> the rereplication of the downed datanode's data can be pretty disruptive, 
> especially if the cluster is doing low latency serving: e.g. hosting an hbase 
> cluster.
> 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
> can't unmount the disk while it is in use).  This latter is better in that 
> only the bad disk's data is rereplicated, not all datanode data.
> Is it possible to do better, say, send the datanode a signal to tell it stop 
> using a disk an operator has designated 'bad'.  This would be like option #2 
> above minus the need to stop and restart the datanode.  Ideally the disk 
> would become unmountable after a while.
> Nice to have would be being able to tell the datanode to restart using a disk 
> after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path

2014-11-14 Thread Chris Trezzo (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212745#comment-14212745
 ] 

Chris Trezzo commented on HDFS-6133:


I am slightly late to the party on this one, but at Twitter we have a similar 
need for a slightly different use case. We use federation and each block pool 
has a different workload. For some of these workloads it does not make sense to 
run the balancer. For example, we have a block pool associated with a tmp 
namespace. Ideally, we would never want to balance blocks in that block pool 
because they will be deleted shortly anyways. One design approach we were 
contemplating is to make the balancer block pool-aware. You could then run the 
balancer on a per-block pool basis and have pluggable balancing strategies for 
each pool (i.e. the balancer policy in the block pool associated with the tmp 
namespace is a no-op). This allows the balancer to continue to be decoupled 
with the namespace and only needs to know about the block pool (we can still 
separate the BM at a later point).

The above might work for this use case as well. The balancer policy for the 
block pool containing blocks in hbase would be a no-op. Let me know what you 
guys think. I can see the block pool design being orthogonal to this JIRA, so 
let me know if I should open up a separate JIRA for this effort. We could 
potentially use the pinning strategy for our use case as well, but I hesitate 
for the same reasons that [~kihwal] mentioned above with respect to 
corrupt/unavailable blocks.

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
> HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6711) FSNamesystem#getAclStatus does not write to the audit log.

2014-11-14 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212736#comment-14212736
 ] 

Chris Nauroth commented on HDFS-6711:
-

This was fixed in HDFS-7218, so I'm resolving this as duplicate.

> FSNamesystem#getAclStatus does not write to the audit log.
> --
>
> Key: HDFS-6711
> URL: https://issues.apache.org/jira/browse/HDFS-6711
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.0.0, 2.4.0
>Reporter: Chris Nauroth
>Priority: Minor
>
> Consider writing an event to the audit log for the {{getAclStatus}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-6711) FSNamesystem#getAclStatus does not write to the audit log.

2014-11-14 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-6711.
-
Resolution: Duplicate

> FSNamesystem#getAclStatus does not write to the audit log.
> --
>
> Key: HDFS-6711
> URL: https://issues.apache.org/jira/browse/HDFS-6711
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.0.0, 2.4.0
>Reporter: Chris Nauroth
>Priority: Minor
>
> Consider writing an event to the audit log for the {{getAclStatus}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync

2014-11-14 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212734#comment-14212734
 ] 

Chris Nauroth commented on HDFS-7384:
-

Yes, what you described makes sense.  An older client simply wouldn't consume 
the new protobuf field.

I'd prefer not add the effective action directly to {{AclEntry}}, since the 
effective action is something that only makes sense when the entry is 
considered against some other object (the mask).

Overall, it sounds good.  Thanks for thinking this through and putting out the 
proposal!

> 'getfacl' command and 'getAclStatus' output should be in sync
> -
>
> Key: HDFS-7384
> URL: https://issues.apache.org/jira/browse/HDFS-7384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
>
> *getfacl* command will print all the entries including basic and extended 
> entries, mask entries and effective permissions.
> But, *getAclStatus* FileSystem API will return only extended ACL entries set 
> by the user. But this will not include the mask entry as well as effective 
> permissions.
> To benefit the client using API, better to include 'mask' entry and effective 
> permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-4239) Means of telling the datanode to stop using a sick disk

2014-11-14 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang resolved HDFS-4239.
-
Resolution: Duplicate

Hi Stack,

This issue turned out to be a duplicate of HDFS-1362, which is resolved now.

I'm closing this jira as duplicate. Please re-open if you think there is 
additional issue to be addressed there.

Thanks.


> Means of telling the datanode to stop using a sick disk
> ---
>
> Key: HDFS-4239
> URL: https://issues.apache.org/jira/browse/HDFS-4239
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: stack
>Assignee: Yongjun Zhang
> Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch, 
> hdfs-4239_v4.patch, hdfs-4239_v5.patch
>
>
> If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing 
> occasionally, or just exhibiting high latency -- your choices are:
> 1. Decommission the total datanode.  If the datanode is carrying 6 or 12 
> disks of data, especially on a cluster that is smallish -- 5 to 20 nodes -- 
> the rereplication of the downed datanode's data can be pretty disruptive, 
> especially if the cluster is doing low latency serving: e.g. hosting an hbase 
> cluster.
> 2. Stop the datanode, unmount the bad disk, and restart the datanode (You 
> can't unmount the disk while it is in use).  This latter is better in that 
> only the bad disk's data is rereplicated, not all datanode data.
> Is it possible to do better, say, send the datanode a signal to tell it stop 
> using a disk an operator has designated 'bad'.  This would be like option #2 
> above minus the need to stop and restart the datanode.  Ideally the disk 
> would become unmountable after a while.
> Nice to have would be being able to tell the datanode to restart using a disk 
> after its been replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes

2014-11-14 Thread Andrew Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212699#comment-14212699
 ] 

Andrew Wang commented on HDFS-7374:
---

The patch looks good, findbugs looks unrelated, but the TestDecommission 
failure is worrying and also failed for me locally. Could you take a look?

> Allow decommissioning of dead DataNodes
> ---
>
> Key: HDFS-7374
> URL: https://issues.apache.org/jira/browse/HDFS-7374
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch
>
>
> We have seen the use case of decommissioning DataNodes that are already dead 
> or unresponsive, and not expected to rejoin the cluster.
> The logic introduced by HDFS-6791 will mark those nodes as 
> {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
> the decommission work. If an upper layer application is monitoring the 
> decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync

2014-11-14 Thread Vinayakumar B (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212680#comment-14212680
 ] 

Vinayakumar B commented on HDFS-7384:
-

Thanks [~cnauroth] for the detailed explanation.

bq. At this point, we can't change the behavior of getAclStatus on the 2.x line 
for compatibility reasons. Suppose a 2.6.0 deployment of the shell called 
getAclStatus on a 2.7.0 NameNode
Here we can implement this without breaking compatibility.
For ex: returned {{AclStatus}} can have default permissions in form of 
{{FsPermission}} object itself, which would be optional field in protobuf.
So We can keep {{getAclEntries()}} return value as is, but in {{AclEntry}} we 
can add one more field, 'effective action', either this can be calculated at 
client side itself, based on the FsPermission object in AclStatus, or can be 
optional field set at NN side itself.

My basic intention is to avoid extra client side logic, which currently users 
have to do, to find out the effective permission for an ACL entry.

If {{AclStatus}} contains {{FsPermission}} value, then we can create the same 
output as 'getfacl' without having to do one more RPC to NN. This would keep 
the existing behavior of returning empty entries for basic permissions, which 
was decided after so many discussions. 

Any thoughts?

> 'getfacl' command and 'getAclStatus' output should be in sync
> -
>
> Key: HDFS-7384
> URL: https://issues.apache.org/jira/browse/HDFS-7384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
>
> *getfacl* command will print all the entries including basic and extended 
> entries, mask entries and effective permissions.
> But, *getAclStatus* FileSystem API will return only extended ACL entries set 
> by the user. But this will not include the mask entry as well as effective 
> permissions.
> To benefit the client using API, better to include 'mask' entry and effective 
> permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path

2014-11-14 Thread Kihwal Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212670#comment-14212670
 ] 

Kihwal Lee commented on HDFS-6133:
--

This might be outside the scope of this jira, but I think we need to think 
about this before going further.  If a node with a pinned block is temporarily 
unavailable, the namenode will try to replicate the block as it is 
under-replicated.  When the node recovers, the block is over-replicated and a 
replica will be invalidated.  How do we make sure it is not removed from the 
favored node?  I think this scenario can happen during start-up or transient 
infra/network issues.  

Daryn and I had a brief discussion about this. It might be possible to include 
pinning info in block reports and remember it in block manager. This will 
enable NN to make the right decision on over-replicated cases. A bit more 
complicated logic will be needed when a pinned block gets corrupted on a 
favored node. The usual replicate + invalidate strategy won't be ideal here. 

> Make Balancer support exclude specified path
> 
>
> Key: HDFS-6133
> URL: https://issues.apache.org/jira/browse/HDFS-6133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, namenode
>Reporter: zhaoyunjiong
>Assignee: zhaoyunjiong
> Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
> HDFS-6133.patch
>
>
> Currently, run Balancer will destroying Regionserver's data locality.
> If getBlocks could exclude blocks belongs to files which have specific path 
> prefix, like "/hbase", then we can run Balancer without destroying 
> Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-7393) TestDFSUpgradeFromImage#testUpgradeFromCorruptRel22Image fails in trunk

2014-11-14 Thread Andrew Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang resolved HDFS-7393.
---
Resolution: Duplicate

I think this is a dupe of HDFS-7395, has the same Precondition stacktrace in 
BlockIdManager. Please reopen if it's still the case.

> TestDFSUpgradeFromImage#testUpgradeFromCorruptRel22Image fails in trunk
> ---
>
> Key: HDFS-7393
> URL: https://issues.apache.org/jira/browse/HDFS-7393
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Ted Yu
>
> The following is reproducible:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-6962) ACLs inheritance conflict with umaskmode

2014-11-14 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-6962:

Target Version/s: 2.7.0  (was: 2.4.1)

Hello, [~Alexandre LINTE].  Thank you for filing this issue.  I tested the same 
scenario against a Linux local file system, and I confirmed that HDFS is 
showing different behavior, just like you described.

I also confirmed that this is a divergence from the POSIX ACL specs.  Here is a 
quote of the relevant section:

{quote}
The permissions of inherited access ACLs are further modified by the mode 
parameter that each system call creating file system objects has. The mode 
parameter contains nine permission bits that stand for the permissions of the 
owner, group, and other class permissions. The effective permissions of each 
class are set to the intersection of the permissions defined for this class in 
the ACL and specified in the mode parameter.

If the parent directory has no default ACL, the permissions of the new file are 
determined as defined in POSIX.1. The effective permissions are set to the 
permissions defined in the mode parameter, minus the permissions set in the 
current umask.

The umask has no effect if a default ACL exists.
{quote}

Changing this behavior is going to be somewhat challenging.  Note the 
distinction made in the spec between mode and umask.  When creating a new child 
(file or directory) of a directory with a default ACL, the mode influences the 
inherited access ACL entries, but the umask has no effect.  Unfortunately, our 
current implementation intersects mode and umask on the client side before 
passing them to the NameNode in the RPC.  This happens in {{DFSClient#mkdirs}} 
and {{DFSClient#create}}:

{code}
  public boolean mkdirs(String src, FsPermission permission,
  boolean createParent) throws IOException {
if (permission == null) {
  permission = FsPermission.getDefault();
}
FsPermission masked = permission.applyUMask(dfsClientConf.uMask);
{code}

{code}
  public DFSOutputStream create(String src, 
 FsPermission permission,
 EnumSet flag, 
 boolean createParent,
 short replication,
 long blockSize,
 Progressable progress,
 int buffersize,
 ChecksumOpt checksumOpt,
 InetSocketAddress[] favoredNodes) throws 
IOException {
checkOpen();
if (permission == null) {
  permission = FsPermission.getFileDefault();
}
FsPermission masked = permission.applyUMask(dfsClientConf.uMask);
{code}

On the NameNode side, when it copies the default ACL from parent to child, 
we've lost the information.  We just have a single piece of permissions data, 
with no knowledge of what was the mode vs. the umask on the client side.

A potential solution is to push both mode and umask explicitly to the NameNode 
in the RPC requests for {{MkdirsRequestProto}} and {{CreateRequestProto}}.  
Those messages already contain an instance of {{FsPermissionProto}}.  We could 
add a second optional instance.  If both instances are defined, then the 
NameNode would interpret one as being mode and the other as being umask.  There 
would still be a possibility of an older client still passing just one 
instance, and in that case, we'd have to fall back to the current behavior.  
It's a bit messy, but it could work.

We also have one additional problem specific to the shell for files (not 
directories).  The implementation of copyFromLocal breaks down into 2 separate 
RPCs: creating the file, followed by a separate chmod call.  The NameNode has 
no way of knowing if that chmod call is part of a copyFromLocal or not though.  
It's too late to enforce the mode vs. umask distinction.

I'm tentatively targeting this to 2.7.0.  I think this will need more 
investigation to make sure there are no compatibility issues with the solution. 
 If there is an unavoidable compatibility problem, then it might require 
pushing out to 3.x.  We won't know for sure until someone starts coding.

Thank you again for the very detailed bug report.

> ACLs inheritance conflict with umaskmode
> 
>
> Key: HDFS-6962
> URL: https://issues.apache.org/jira/browse/HDFS-6962
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.4.1
> Environment: CentOS release 6.5 (Final)
>Reporter: LINTE
>  Labels: hadoop, security
>
> In hdfs-site.xml 
> 
> dfs.umaskmode
> 027
> 
> 1/ Create a directory as superuser
> bash# hdfs dfs -mkdir  /tmp/ACLS
> 2/ set default ACLs on this directory rwx access for group readwrite and user 
> toto
> bash# hdfs dfs -setfac

[jira] [Commented] (HDFS-7056) Snapshot support for truncate

2014-11-14 Thread Plamen Jeliazkov (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212588#comment-14212588
 ] 

Plamen Jeliazkov commented on HDFS-7056:


FindBugs appears to be unrelated.

New FindBugs points to inconsistent synchronization in 
org.apache.hadoop.hdfs.DFSOutputStream. 
A class we don't touch in this work.

> Snapshot support for truncate
> -
>
> Key: HDFS-7056
> URL: https://issues.apache.org/jira/browse/HDFS-7056
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Plamen Jeliazkov
> Attachments: HDFS-3107-HDFS-7056-combined.patch, 
> HDFS-3107-HDFS-7056-combined.patch, HDFS-3107-HDFS-7056-combined.patch, 
> HDFS-3107-HDFS-7056-combined.patch, HDFS-3107-HDFS-7056-combined.patch, 
> HDFS-7056.patch, HDFS-7056.patch, HDFS-7056.patch, HDFS-7056.patch, 
> HDFS-7056.patch, HDFS-7056.patch, HDFSSnapshotWithTruncateDesign.docx
>
>
> Implementation of truncate in HDFS-3107 does not allow truncating files which 
> are in a snapshot. It is desirable to be able to truncate and still keep the 
> old file state of the file in the snapshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-7177) Add an option to include minimal ACL in getAclStatus return

2014-11-14 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-7177.
-
Resolution: Duplicate

> Add an option to include minimal ACL in getAclStatus return
> ---
>
> Key: HDFS-7177
> URL: https://issues.apache.org/jira/browse/HDFS-7177
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>Priority: Minor
>
> Currently the 3 minimal ACL entries are not included in the returned value of 
> getAclStatus. {{FsShell}} gets them separately ({{FsPermission perm = 
> item.stat.getPermission();}}). It'd be useful to make it optional to include 
> them, so that external programs can get a complete view of the permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7177) Add an option to include minimal ACL in getAclStatus return

2014-11-14 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212555#comment-14212555
 ] 

Chris Nauroth commented on HDFS-7177:
-

Hi, [~zhz].  I just realized too late that HDFS-7384 is reporting basically the 
same thing as this.  I just entered a huge comment on HDFS-7384 about it, so 
I'd prefer to resolve this one as duplicate, even though it really came first.  
I'll add all of the watchers over to HDFS-7384 so that they can still be 
involved in the conversation.  Thanks!

> Add an option to include minimal ACL in getAclStatus return
> ---
>
> Key: HDFS-7177
> URL: https://issues.apache.org/jira/browse/HDFS-7177
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>Priority: Minor
>
> Currently the 3 minimal ACL entries are not included in the returned value of 
> getAclStatus. {{FsShell}} gets them separately ({{FsPermission perm = 
> item.stat.getPermission();}}). It'd be useful to make it optional to include 
> them, so that external programs can get a complete view of the permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync

2014-11-14 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212548#comment-14212548
 ] 

Chris Nauroth commented on HDFS-7384:
-

Hi, [~vinayrpet].  The current behavior of {{getAclStatus}} is an intentional 
design choice, but the history behind that choice is a bit convoluted.  Let me 
see if I can reconstruct it here.

It starts with HADOOP-10220, which added an ACL indicator bit to 
{{FsPermission}}.  This was provided as an optimization so that clients could 
quickly identify if a file has an ACL, without needing an additional RPC.

Later, objections were raised against the ACL bit in HDFS-5923 and HDFS-5932.  
We made a decision to roll back the HADOOP-10220 changes, and instead require 
callers to use {{getAclStatus}} to identify the presence of an ACL.  Prior to 
this, early implementations of {{getAclStatus}} would always return a non-empty 
list.  For an inode with no ACL, it would return the "minimal ACL" containing 
the 3 entries that correspond to basic POSIX permissions.  However, at this 
point, it became helpful to change {{getAclStatus}} so that it would return an 
empty list if there is no ACL.  This was seen as easier for clients than trying 
to check the entries for no ACL/minimal ACL.  It was also seen as a cleaner 
logical separation, since the client likely already has the {{FsPermission}} 
prior to calling {{getAclStatus}}, and therefore it would not be helpful to 
return redundant ACL entries.

Finally, HDFS-6326 identified that our implementation choice was 
backwards-incompatible for webhdfs, and generally a performance bottleneck for 
shell users.  To solve this, we reinstated the ACL bit, in a slightly different 
implementation, but the behavior of {{getAclStatus}} remained the same.

You've definitely identified a weakness in the current API design, and I raised 
similar objections at the time.  It's a trade-off.  I think there is good 
logical separation right now, but as a side effect, it does mean that callers 
may need some extra client-side logic to piece all of the information together, 
such as if someone wanted to write a custom GUI consuming WebHDFS to display 
ACL information.

At this point, we can't change the behavior of {{getAclStatus}} on the 2.x line 
for compatibility reasons.  Suppose a 2.6.0 deployment of the shell called 
{{getAclStatus}} on a 2.7.0 NameNode, and it had been changed to return the 
complete ACL.  This would cause {{getfacl}} to display duplicate entries, 
because the 2.6.0 logic of {{GetfaclCommand}} and 
{{AclUtil#getAclFromPermAndEntries}} will combine the output of 
{{getAclStatus}} with the {{FsPermission}}, resulting in 3 duplicate entries.

Where does that leave us for this jira?  I can see the following options:
# Resolve as won't fix, based on the above rationale.
# Target 3.0 for a backwards-incompatible change.
# Add a new RPC, named {{getFullAcl}} or similar, with the behavior that you 
proposed.  However, I'd prefer not to increase the API footprint unless there 
is a really strong use case.

Hope this helps.  Let me know your thoughts.  Thanks!

> 'getfacl' command and 'getAclStatus' output should be in sync
> -
>
> Key: HDFS-7384
> URL: https://issues.apache.org/jira/browse/HDFS-7384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
>
> *getfacl* command will print all the entries including basic and extended 
> entries, mask entries and effective permissions.
> But, *getAclStatus* FileSystem API will return only extended ACL entries set 
> by the user. But this will not include the mask entry as well as effective 
> permissions.
> To benefit the client using API, better to include 'mask' entry and effective 
> permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7398) Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit

2014-11-14 Thread Gera Shegalov (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212525#comment-14212525
 ] 

Gera Shegalov commented on HDFS-7398:
-

Regarding the findbug warning:
{quote}
Inconsistent synchronization of 
org.apache.hadoop.hdfs.DFSOutputStream$Packet.dataPos; locked 83% of time
{quote}
It's obviously unrelated. 

> Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit
> --
>
> Key: HDFS-7398
> URL: https://issues.apache.org/jira/browse/HDFS-7398
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
> Attachments: HDFS-7398.v01.patch
>
>
> This is a follow-up on HDFS-7385.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7396) Revisit synchronization in Namenode

2014-11-14 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212469#comment-14212469
 ] 

Chris Nauroth commented on HDFS-7396:
-

bq. Whenever we experimented with improving concurrency, the limiting factor 
was the garbage collection overhead.

I also would be interested in seeing more information on this.  We've been 
updating our recommendations for garbage collection tuning recently.  It would 
be interesting for us to compare notes.

I'm also curious if you've tried any experiments running with the G1 collector. 
 I haven't tried it in several years.  When I tried it, it was still very 
experimental, so I ended up hitting too many bugs to run it in production.  
Perhaps it has stabilized by now.

> Revisit synchronization in Namenode
> ---
>
> Key: HDFS-7396
> URL: https://issues.apache.org/jira/browse/HDFS-7396
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>
> HDFS-2106 separated block management to a new package from namenode.  As part 
> of it, some code was refactored to new classes such as DatanodeManager, 
> HeartbeatManager, etc.  There are opportunities for improve locking in 
> namenode while currently the synchronization in namenode is mainly done by a 
> single global FSNamesystem lock. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212384#comment-14212384
 ] 

Hudson commented on HDFS-7395:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/])
HDFS-7395. BlockIdManager#clear() bails out when resetting the 
GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 
1a2e5cbc4dbed527fdbefc09abc1faaacf3da285)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
> --
>
> Key: HDFS-7395
> URL: https://issues.apache.org/jira/browse/HDFS-7395
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Yongjun Zhang
>Assignee: Haohui Mai
> Fix For: 2.7.0
>
> Attachments: HDFS-7395.000.patch
>
>
> In latest jenkins job
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/
> but not 
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/
> The following test failed the same way:
> {code}
> Failed
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image
> Failing for the past 2 builds (Since Failed#1931 )
> Took 0.54 sec.
> Stacktrace
> java.lang.IllegalStateException: null
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:763)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:747)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804)
>   at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:465)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424)
>   at 
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582)
>   at 
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7396) Revisit synchronization in Namenode

2014-11-14 Thread Kihwal Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212394#comment-14212394
 ] 

Kihwal Lee commented on HDFS-7396:
--

This is a general comment regarding reducing lock contention and increasing 
concurrency in namenode. Whenever we experimented with improving concurrency, 
the limiting factor was the garbage collection overhead. This has gotten worse 
after the conversion to protobuf.  Under a given load, locking improvements 
will certainly give better and more predictable response times. But if pushed 
beyond what NN was capable before, we will soon run into the existing 
inefficiencies. [~daryn] has found some of them and I hope he shares them with 
us soon.

As [~tlipcon] mentioned in HDFS-2206, we need locking rules defined, documented 
and enforced if possible. In addition to the interactions between different 
locks, the role and scope of each lock need to be clearly defined. Lock 
definitions should include what it is protecting and expected data consistency 
and visibility during and after, etc.At minimum, we can come up with a 
comment template for this.

> Revisit synchronization in Namenode
> ---
>
> Key: HDFS-7396
> URL: https://issues.apache.org/jira/browse/HDFS-7396
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>
> HDFS-2106 separated block management to a new package from namenode.  As part 
> of it, some code was refactored to new classes such as DatanodeManager, 
> HeartbeatManager, etc.  There are opportunities for improve locking in 
> namenode while currently the synchronization in namenode is mainly done by a 
> single global FSNamesystem lock. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212382#comment-14212382
 ] 

Hudson commented on HDFS-7385:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/])
HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess 
up. Contributed by jiangyu. (cnauroth: rev 
b0a41de68c5b08f534ca231293de053c0b0cbd5d)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java


> ThreadLocal used in FSEditLog class causes FSImage permission mess up
> -
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212388#comment-14212388
 ] 

Hudson commented on HDFS-7358:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/])
HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. 
(szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java


> Clients may get stuck waiting when using ByteArrayManager
> -
>
> Key: HDFS-7358
> URL: https://issues.apache.org/jira/browse/HDFS-7358
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Fix For: 2.7.0
>
> Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
> h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
> h7358_20141108.patch
>
>
> [~stack] reported that clients might get stuck waiting when using 
> ByteArrayManager; see [his 
> comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212359#comment-14212359
 ] 

Hudson commented on HDFS-7395:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/])
HDFS-7395. BlockIdManager#clear() bails out when resetting the 
GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 
1a2e5cbc4dbed527fdbefc09abc1faaacf3da285)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
> --
>
> Key: HDFS-7395
> URL: https://issues.apache.org/jira/browse/HDFS-7395
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Yongjun Zhang
>Assignee: Haohui Mai
> Fix For: 2.7.0
>
> Attachments: HDFS-7395.000.patch
>
>
> In latest jenkins job
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/
> but not 
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/
> The following test failed the same way:
> {code}
> Failed
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image
> Failing for the past 2 builds (Since Failed#1931 )
> Took 0.54 sec.
> Stacktrace
> java.lang.IllegalStateException: null
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:763)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:747)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804)
>   at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:465)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424)
>   at 
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582)
>   at 
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212357#comment-14212357
 ] 

Hudson commented on HDFS-7385:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/])
HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess 
up. Contributed by jiangyu. (cnauroth: rev 
b0a41de68c5b08f534ca231293de053c0b0cbd5d)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> ThreadLocal used in FSEditLog class causes FSImage permission mess up
> -
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212363#comment-14212363
 ] 

Hudson commented on HDFS-7358:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/])
HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. 
(szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> Clients may get stuck waiting when using ByteArrayManager
> -
>
> Key: HDFS-7358
> URL: https://issues.apache.org/jira/browse/HDFS-7358
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Fix For: 2.7.0
>
> Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
> h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
> h7358_20141108.patch
>
>
> [~stack] reported that clients might get stuck waiting when using 
> ByteArrayManager; see [his 
> comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212293#comment-14212293
 ] 

Hudson commented on HDFS-7385:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/])
HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess 
up. Contributed by jiangyu. (cnauroth: rev 
b0a41de68c5b08f534ca231293de053c0b0cbd5d)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java


> ThreadLocal used in FSEditLog class causes FSImage permission mess up
> -
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212299#comment-14212299
 ] 

Hudson commented on HDFS-7358:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/])
HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. 
(szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java


> Clients may get stuck waiting when using ByteArrayManager
> -
>
> Key: HDFS-7358
> URL: https://issues.apache.org/jira/browse/HDFS-7358
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Fix For: 2.7.0
>
> Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
> h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
> h7358_20141108.patch
>
>
> [~stack] reported that clients might get stuck waiting when using 
> ByteArrayManager; see [his 
> comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212295#comment-14212295
 ] 

Hudson commented on HDFS-7395:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/])
HDFS-7395. BlockIdManager#clear() bails out when resetting the 
GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 
1a2e5cbc4dbed527fdbefc09abc1faaacf3da285)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
> --
>
> Key: HDFS-7395
> URL: https://issues.apache.org/jira/browse/HDFS-7395
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Yongjun Zhang
>Assignee: Haohui Mai
> Fix For: 2.7.0
>
> Attachments: HDFS-7395.000.patch
>
>
> In latest jenkins job
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/
> but not 
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/
> The following test failed the same way:
> {code}
> Failed
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image
> Failing for the past 2 builds (Since Failed#1931 )
> Took 0.54 sec.
> Stacktrace
> java.lang.IllegalStateException: null
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:763)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:747)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804)
>   at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:465)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424)
>   at 
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582)
>   at 
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212286#comment-14212286
 ] 

Hudson commented on HDFS-7358:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1933 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1933/])
HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. 
(szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java


> Clients may get stuck waiting when using ByteArrayManager
> -
>
> Key: HDFS-7358
> URL: https://issues.apache.org/jira/browse/HDFS-7358
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Fix For: 2.7.0
>
> Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
> h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
> h7358_20141108.patch
>
>
> [~stack] reported that clients might get stuck waiting when using 
> ByteArrayManager; see [his 
> comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212282#comment-14212282
 ] 

Hudson commented on HDFS-7395:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1933 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1933/])
HDFS-7395. BlockIdManager#clear() bails out when resetting the 
GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 
1a2e5cbc4dbed527fdbefc09abc1faaacf3da285)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
> --
>
> Key: HDFS-7395
> URL: https://issues.apache.org/jira/browse/HDFS-7395
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Yongjun Zhang
>Assignee: Haohui Mai
> Fix For: 2.7.0
>
> Attachments: HDFS-7395.000.patch
>
>
> In latest jenkins job
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/
> but not 
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/
> The following test failed the same way:
> {code}
> Failed
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image
> Failing for the past 2 builds (Since Failed#1931 )
> Took 0.54 sec.
> Stacktrace
> java.lang.IllegalStateException: null
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:763)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:747)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804)
>   at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:465)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424)
>   at 
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582)
>   at 
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212280#comment-14212280
 ] 

Hudson commented on HDFS-7385:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1933 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1933/])
HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess 
up. Contributed by jiangyu. (cnauroth: rev 
b0a41de68c5b08f534ca231293de053c0b0cbd5d)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> ThreadLocal used in FSEditLog class causes FSImage permission mess up
> -
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212194#comment-14212194
 ] 

Hudson commented on HDFS-7358:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/743/])
HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. 
(szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java


> Clients may get stuck waiting when using ByteArrayManager
> -
>
> Key: HDFS-7358
> URL: https://issues.apache.org/jira/browse/HDFS-7358
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Fix For: 2.7.0
>
> Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
> h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
> h7358_20141108.patch
>
>
> [~stack] reported that clients might get stuck waiting when using 
> ByteArrayManager; see [his 
> comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212188#comment-14212188
 ] 

Hudson commented on HDFS-7385:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/743/])
HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess 
up. Contributed by jiangyu. (cnauroth: rev 
b0a41de68c5b08f534ca231293de053c0b0cbd5d)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java


> ThreadLocal used in FSEditLog class causes FSImage permission mess up
> -
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212190#comment-14212190
 ] 

Hudson commented on HDFS-7395:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/743/])
HDFS-7395. BlockIdManager#clear() bails out when resetting the 
GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 
1a2e5cbc4dbed527fdbefc09abc1faaacf3da285)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
> --
>
> Key: HDFS-7395
> URL: https://issues.apache.org/jira/browse/HDFS-7395
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Yongjun Zhang
>Assignee: Haohui Mai
> Fix For: 2.7.0
>
> Attachments: HDFS-7395.000.patch
>
>
> In latest jenkins job
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/
> but not 
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/
> The following test failed the same way:
> {code}
> Failed
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image
> Failing for the past 2 builds (Since Failed#1931 )
> Took 0.54 sec.
> Stacktrace
> java.lang.IllegalStateException: null
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:763)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:747)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804)
>   at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:465)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424)
>   at 
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582)
>   at 
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212161#comment-14212161
 ] 

Hudson commented on HDFS-7395:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/])
HDFS-7395. BlockIdManager#clear() bails out when resetting the 
GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 
1a2e5cbc4dbed527fdbefc09abc1faaacf3da285)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
> --
>
> Key: HDFS-7395
> URL: https://issues.apache.org/jira/browse/HDFS-7395
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Yongjun Zhang
>Assignee: Haohui Mai
> Fix For: 2.7.0
>
> Attachments: HDFS-7395.000.patch
>
>
> In latest jenkins job
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/
> but not 
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/
> The following test failed the same way:
> {code}
> Failed
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image
> Failing for the past 2 builds (Since Failed#1931 )
> Took 0.54 sec.
> Stacktrace
> java.lang.IllegalStateException: null
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:763)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:747)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804)
>   at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:465)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424)
>   at 
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582)
>   at 
> org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212165#comment-14212165
 ] 

Hudson commented on HDFS-7358:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/])
HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. 
(szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java


> Clients may get stuck waiting when using ByteArrayManager
> -
>
> Key: HDFS-7358
> URL: https://issues.apache.org/jira/browse/HDFS-7358
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
> Fix For: 2.7.0
>
> Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
> h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
> h7358_20141108.patch
>
>
> [~stack] reported that clients might get stuck waiting when using 
> ByteArrayManager; see [his 
> comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up

2014-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212159#comment-14212159
 ] 

Hudson commented on HDFS-7385:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/])
HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess 
up. Contributed by jiangyu. (cnauroth: rev 
b0a41de68c5b08f534ca231293de053c0b0cbd5d)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java


> ThreadLocal used in FSEditLog class causes FSImage permission mess up
> -
>
> Key: HDFS-7385
> URL: https://issues.apache.org/jira/browse/HDFS-7385
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.0, 2.5.0
>Reporter: jiangyu
>Assignee: jiangyu
>Priority: Blocker
> Fix For: 2.6.0
>
> Attachments: HDFS-7385.2.patch, HDFS-7385.patch
>
>
>   We migrated our NameNodes from low configuration to high configuration 
> machines last week. Firstly,we  imported the current directory including 
> fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
> and started the New NameNode, then  changed the configuration of all 
> datanodes and restarted all of datanodes , then blockreport to new NameNodes 
> at once and send heartbeat after that.
>Everything seemed perfect, but after we restarted Resoucemanager , 
> most of the users compained that their jobs couldn't be executed for the 
> reason of permission problem.
>   We applied Acls in our clusters, and after migrated we found most of 
> the directories and files which were not set Acls before now had the 
> properties of Acls. That is the reason why users could not execute their 
> jobs.So we had to change most of the files permission to a+r and directories 
> permission to a+rx to make sure the jobs can be executed.
> After searching this problem for some days, i found there is a bug in 
> FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
> proper value in logMkdir and logOpenFile functions. Here is the code of 
> logMkdir:
>   public void logMkDir(String path, INode newNode) {
> PermissionStatus permissions = newNode.getPermissionStatus();
> MkdirOp op = MkdirOp.getInstance(cache.get())
>   .setInodeId(newNode.getId())
>   .setPath(path)
>   .setTimestamp(newNode.getModificationTime())
>   .setPermissionStatus(permissions);
> AclFeature f = newNode.getAclFeature();
> if (f != null) {
>   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
> }
> logEdit(op);
>   }
>   For example, if we mkdir with Acls through one handler(Thread indeed), 
> we set the AclEntries to the op from the cache. After that, if we mkdir 
> without any Acls setting and set through the same handler, the AclEnties from 
> the cache is the same with the last one which set the Acls, and because the 
> newNode have no AclFeature, we don’t have any chance to change it. Then the 
> editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
> from journalnodes and  apply them to memory in SNN then savenamespace and 
> transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
> solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7392) org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever

2014-11-14 Thread Frantisek Vacek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frantisek Vacek updated HDFS-7392:
--
Description: 
In some specific circumstances, 
org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
and last forever.

What are specific circumstances:
1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point 
to valid IP address but without name node service running on it.
2) There should be at least 2 IP addresses for such a URI. See output below:
{quote}
[~/proj/quickbox]$ nslookup share.example.com
Server: 127.0.1.1
Address:127.0.1.1#53

share.example.com canonical name = 
internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com.
Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
Address: 192.168.1.223
Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
Address: 192.168.1.65
{quote}
In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
returns sometimes true (even if address didn't actually changed see img. 1) and 
the timeoutFailures counter is set to 0 (see img. 2). The 
maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
repeated forever.

  was:
In some specific circumstances, 
org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
and last forever.

What are specific circumstances:
1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point 
to valid IP address but without name node service running on it.
2) There should be at least 2 IP addresses for such a URI. See output below:
{quote}
[~/proj/quickbox]$ nslookup share.example.com
Server: 127.0.1.1
Address:127.0.1.1#53

share.example.com canonical name = 
internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com.
Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
Address: 54.40.29.223
Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
Address: 54.40.29.65
{quote}
In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
returns sometimes true (even if address didn't actually changed see img. 1) and 
the timeoutFailures counter is set to 0 (see img. 2). The 
maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
repeated forever.


> org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever
> -
>
> Key: HDFS-7392
> URL: https://issues.apache.org/jira/browse/HDFS-7392
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Frantisek Vacek
>Priority: Critical
> Attachments: 1.png, 2.png
>
>
> In some specific circumstances, 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
> and last forever.
> What are specific circumstances:
> 1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point 
> to valid IP address but without name node service running on it.
> 2) There should be at least 2 IP addresses for such a URI. See output below:
> {quote}
> [~/proj/quickbox]$ nslookup share.example.com
> Server: 127.0.1.1
> Address:127.0.1.1#53
> share.example.com canonical name = 
> internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com.
> Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
> Address: 192.168.1.223
> Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
> Address: 192.168.1.65
> {quote}
> In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
> returns sometimes true (even if address didn't actually changed see img. 1) 
> and the timeoutFailures counter is set to 0 (see img. 2). The 
> maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
> repeated forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7392) org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever

2014-11-14 Thread Frantisek Vacek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frantisek Vacek updated HDFS-7392:
--
Description: 
In some specific circumstances, 
org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
and last forever.

What are specific circumstances:
1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point 
to valid IP address but without name node service running on it.
2) There should be at least 2 IP addresses for such a URI. See output below:
{quote}
[~/proj/quickbox]$ nslookup share.example.com
Server: 127.0.1.1
Address:127.0.1.1#53

share.example.com canonical name = 
internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com.
Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
Address: 54.40.29.223
Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
Address: 54.40.29.65
{quote}
In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
returns sometimes true (even if address didn't actually changed see img. 1) and 
the timeoutFailures counter is set to 0 (see img. 2). The 
maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
repeated forever.

  was:
In some specific circumstances, 
org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
and last forever.

What are specific circumstances:
1) HDFS URI (hdfs://share.merck.com:8020/someDir/someFile.txt) should point to 
valid IP address but without name node service running on it.
2) There should be at least 2 IP addresses for such a URI. See output below:
{quote}
[~/proj/quickbox]$ nslookup share.merck.com
Server: 127.0.1.1
Address:127.0.1.1#53

share.merck.com canonical name = 
internal-gicprg-share-merck-com-1538706884.us-east-1.elb.amazonaws.com.
Name:   internal-gicprg-share-merck-com-1538706884.us-east-1.elb.amazonaws.com
Address: 54.40.29.223
Name:   internal-gicprg-share-merck-com-1538706884.us-east-1.elb.amazonaws.com
Address: 54.40.29.65
{quote}
In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
returns sometimes true (even if address didn't actually changed see img. 1) and 
the timeoutFailures counter is set to 0 (see img. 2). The 
maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
repeated forever.


> org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever
> -
>
> Key: HDFS-7392
> URL: https://issues.apache.org/jira/browse/HDFS-7392
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Frantisek Vacek
>Priority: Critical
> Attachments: 1.png, 2.png
>
>
> In some specific circumstances, 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
> and last forever.
> What are specific circumstances:
> 1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point 
> to valid IP address but without name node service running on it.
> 2) There should be at least 2 IP addresses for such a URI. See output below:
> {quote}
> [~/proj/quickbox]$ nslookup share.example.com
> Server: 127.0.1.1
> Address:127.0.1.1#53
> share.example.com canonical name = 
> internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com.
> Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
> Address: 54.40.29.223
> Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
> Address: 54.40.29.65
> {quote}
> In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
> returns sometimes true (even if address didn't actually changed see img. 1) 
> and the timeoutFailures counter is set to 0 (see img. 2). The 
> maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
> repeated forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 >

1 - 100 of 104 matches

Mail list logo