[jira] [Assigned] (HDFS-2263) Make DFSClient report bad blocks more quickly

2018-05-15 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J reassigned HDFS-2263:
-

Assignee: (was: Harsh J)

> Make DFSClient report bad blocks more quickly
> -
>
> Key: HDFS-2263
> URL: https://issues.apache.org/jira/browse/HDFS-2263
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 0.20.2
>Reporter: Aaron T. Myers
>Priority: Major
> Attachments: HDFS-2263.patch
>
>
> In certain circumstances the DFSClient may detect a block as being bad 
> without reporting it promptly to the NN.
> If when reading a file a client finds an invalid checksum of a block, it 
> immediately reports that bad block to the NN. If when serving up a block a DN 
> finds a truncated block, it reports this to the client, but the client merely 
> adds that DN to the list of dead nodes and moves on to trying another DN, 
> without reporting this to the NN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-2561) Under dfsadmin -report, show a proper 'last contact time' for decommissioned/dead nodes.

2018-05-15 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J reassigned HDFS-2561:
-

Assignee: (was: Harsh J)

> Under dfsadmin -report, show a proper 'last contact time' for 
> decommissioned/dead nodes.
> 
>
> Key: HDFS-2561
> URL: https://issues.apache.org/jira/browse/HDFS-2561
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 0.20.2
>Reporter: Harsh J
>Priority: Major
>
> Right now, the last contact period gets reset to 0 once we mark a DN dead. 
> This can be improved to show a proper time instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-1590) Decommissioning never ends when node to decommission has blocks that are under-replicated and cannot be replicated to the expected level of replication

2018-05-15 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J reassigned HDFS-1590:
-

Assignee: (was: Harsh J)

> Decommissioning never ends when node to decommission has blocks that are 
> under-replicated and cannot be replicated to the expected level of replication
> ---
>
> Key: HDFS-1590
> URL: https://issues.apache.org/jira/browse/HDFS-1590
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 0.20.2
> Environment: Linux
>Reporter: Mathias Herberts
>Priority: Minor
>
> On a test cluster with 4 DNs and a default repl level of 3, I recently 
> attempted to decommission one of the DNs. Right after the modification of the 
> dfs.hosts.exclude file and the 'dfsadmin -refreshNodes', I could see the 
> blocks being replicated to other nodes.
> After a while, the replication stopped but the node was not marked as 
> decommissioned.
> When running an 'fsck -files -blocks -locations' I saw that all files had a 
> replication of 4 (which is logical given there are 4 DNs), but some of the 
> files had an expected replication set to 10 (those were job.jar files from 
> M/R jobs).
> I ran 'fs -setrep 3' on those files and shortly after the namenode reported 
> the DN as decommissioned.
> Shouldn't this case be checked by the NameNode when decommissioning a node? 
> I.e considere a node decommissioned if either one of the following is true 
> for each block on the node being decommissioned:
> 1. It is replicated more than the expected replication level.
> 2. It is replicated as much as possible given the available nodes, even 
> though it is less replicated than expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-4638) TransferFsImage should take Configuration as parameter

2018-05-15 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J reassigned HDFS-4638:
-

Assignee: (was: Harsh J)

> TransferFsImage should take Configuration as parameter
> --
>
> Key: HDFS-4638
> URL: https://issues.apache.org/jira/browse/HDFS-4638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.0.0-alpha1
>Reporter: Todd Lipcon
>Priority: Minor
>  Labels: BB2015-05-TBR
> Attachments: HDFS-4638.patch
>
>
> TransferFsImage currently creates a new HdfsConfiguration object, rather than 
> taking one passed in. This means that using {{dfsadmin -fetchImage}}, you 
> can't pass a different timeout on the command line, since the Tool's 
> configuration doesn't get plumbed through.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-8516) The 'hdfs crypto -listZones' should not print an extra newline at end of output

2018-05-15 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J reassigned HDFS-8516:
-

Assignee: (was: Harsh J)

> The 'hdfs crypto -listZones' should not print an extra newline at end of 
> output
> ---
>
> Key: HDFS-8516
> URL: https://issues.apache.org/jira/browse/HDFS-8516
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 2.7.0
>Reporter: Harsh J
>Priority: Minor
> Attachments: HDFS-8516.patch
>
>
> It currently prints an extra newline (TableListing already adds a newline to 
> end of table string).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers

2018-05-15 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J reassigned HDFS-10237:
--

Assignee: (was: Harsh J)

> Support specifying checksum type in WebHDFS/HTTPFS writers
> --
>
> Key: HDFS-10237
> URL: https://issues.apache.org/jira/browse/HDFS-10237
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: webhdfs
>Affects Versions: 2.8.0
>Reporter: Harsh J
>Priority: Minor
> Attachments: HDFS-10237.000.patch, HDFS-10237.001.patch, 
> HDFS-10237.002.patch, HDFS-10237.002.patch
>
>
> Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS 
> writer, as you can with the regular DFS writer (done via HADOOP-8240)
> This JIRA covers the changes necessary to bring the same ability to WebHDFS 
> and HTTPFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11421) Make WebHDFS' ACLs RegEx configurable

2017-05-25 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025565#comment-16025565
 ] 

Harsh J commented on HDFS-11421:


The patch looks good to me. Were the style changes done out of checkstyle 
warnings? I only notice two changes, one is a comment becoming multi-line, the 
other's the DOMAIN static member being made lowercased.

+1

> Make WebHDFS' ACLs RegEx configurable
> -
>
> Key: HDFS-11421
> URL: https://issues.apache.org/jira/browse/HDFS-11421
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: webhdfs
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Harsh J
> Fix For: 3.0.0-alpha3
>
> Attachments: HDFS-11421.000.patch, HDFS-11421-branch-2.000.patch, 
> HDFS-11421.branch-2.001.patch, HDFS-11421.branch-2.003.patch
>
>
> Part of HDFS-5608 added support for GET/SET ACLs over WebHDFS. This currently 
> identifies the passed arguments via a hard-coded regex that mandates certain 
> group and user naming styles.
> A similar limitation had existed before for CHOWN and other User/Group set 
> related operations of WebHDFS, where it was then made configurable via 
> HDFS-11391 + HDFS-4983.
> Such configurability should be allowed for the ACL operations too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9868) Add ability for DistCp to run between 2 clusters

2017-03-13 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907295#comment-15907295
 ] 

Harsh J commented on HDFS-9868:
---

Would this mapping allow DistCp to be aware of destination's distinct KMS URIs? 
A popular ask among some of the customers I work with seems to be an ability to 
allow DistCp to copy from EZ of one cluster into EZ of another, where the KMS 
(keys) are not shared.

> Add ability for DistCp to run between 2 clusters
> 
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: distcp
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, 
> HDFS-9868.07.patch, HDFS-9868.08.patch, HDFS-9868.09.patch, 
> HDFS-9868.10.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, 
> HDFS-9868.4.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
> 
> 
>   dfs.nameservices
>   mycluster
> 
> 
>   dfs.ha.namenodes.mycluster
>   nn1,nn2
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn1
>   host1:9000
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn2
>   host2:9000
> 
> 
>   dfs.namenode.http-address.mycluster.nn1
>   host1:50070
> 
> 
>   dfs.namenode.http-address.mycluster.nn2
>   host2:50070
> 
> 
>   dfs.client.failover.proxy.provider.mycluster
>   
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
>   
> {code}
>   The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-7290) Add HTTP response code to the HttpPutFailedException message

2017-03-08 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901175#comment-15901175
 ] 

Harsh J commented on HDFS-7290:
---

[~ajisakaa] or [~jojochuang], could you help review this simple exception 
message enhancement?

> Add HTTP response code to the HttpPutFailedException message
> 
>
> Key: HDFS-7290
> URL: https://issues.apache.org/jira/browse/HDFS-7290
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.5.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7290.patch
>
>
> If the TransferFsImage#uploadImageFromStorage(…) call fails for some reason, 
> we try to print back the reason of the connection failure.
> We currently only grab connection.getResponseMessage(…) and use that as our 
> exception's lone string, but this can often be empty if there was no real 
> response message from the connection end. However, the failures always have a 
> code, so we should also ensure to print the error code returned, for at least 
> a partial hint.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11421) Make WebHDFS' ACLs RegEx configurable

2017-03-02 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891804#comment-15891804
 ] 

Harsh J commented on HDFS-11421:


[~xiaochen] - Thanks! I'm uncertain how to trigger just a branch-2 build, but a 
local build passes with the patch applied, along with the modified tests.

> Make WebHDFS' ACLs RegEx configurable
> -
>
> Key: HDFS-11421
> URL: https://issues.apache.org/jira/browse/HDFS-11421
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: webhdfs
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Harsh J
> Fix For: 3.0.0-alpha3
>
> Attachments: HDFS-11421.000.patch, HDFS-11421-branch-2.000.patch
>
>
> Part of HDFS-5608 added support for GET/SET ACLs over WebHDFS. This currently 
> identifies the passed arguments via a hard-coded regex that mandates certain 
> group and user naming styles.
> A similar limitation had existed before for CHOWN and other User/Group set 
> related operations of WebHDFS, where it was then made configurable via 
> HDFS-11391 + HDFS-4983.
> Such configurability should be allowed for the ACL operations too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11421) Make WebHDFS' ACLs RegEx configurable

2017-02-24 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-11421:
---
Target Version/s: 2.9.0, 3.0.0-beta1  (was: 2.9.0)

> Make WebHDFS' ACLs RegEx configurable
> -
>
> Key: HDFS-11421
> URL: https://issues.apache.org/jira/browse/HDFS-11421
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: webhdfs
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: HDFS-11421.000.patch, HDFS-11421-branch-2.000.patch
>
>
> Part of HDFS-5608 added support for GET/SET ACLs over WebHDFS. This currently 
> identifies the passed arguments via a hard-coded regex that mandates certain 
> group and user naming styles.
> A similar limitation had existed before for CHOWN and other User/Group set 
> related operations of WebHDFS, where it was then made configurable via 
> HDFS-11391 + HDFS-4983.
> Such configurability should be allowed for the ACL operations too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11421) Make WebHDFS' ACLs RegEx configurable

2017-02-24 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-11421:
---
Attachment: HDFS-11421.000.patch
HDFS-11421-branch-2.000.patch

Thank you for the review [~xiaochen]! I've updated the PR with a commit 
addressing your comments. I'm also attaching a patch form directly here.

> Make WebHDFS' ACLs RegEx configurable
> -
>
> Key: HDFS-11421
> URL: https://issues.apache.org/jira/browse/HDFS-11421
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: webhdfs
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: HDFS-11421.000.patch, HDFS-11421-branch-2.000.patch
>
>
> Part of HDFS-5608 added support for GET/SET ACLs over WebHDFS. This currently 
> identifies the passed arguments via a hard-coded regex that mandates certain 
> group and user naming styles.
> A similar limitation had existed before for CHOWN and other User/Group set 
> related operations of WebHDFS, where it was then made configurable via 
> HDFS-11391 + HDFS-4983.
> Such configurability should be allowed for the ACL operations too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11422) Need a method to refresh the list of NN StorageDirectories after removal

2017-02-16 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871117#comment-15871117
 ] 

Harsh J commented on HDFS-11422:


IIRC, you can also use the command {{hdfs dfsadmin -restoreFailedStorage}} to 
control this.

> Need a method to refresh the list of NN StorageDirectories after removal
> 
>
> Key: HDFS-11422
> URL: https://issues.apache.org/jira/browse/HDFS-11422
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Attila Bukor
>
> If a NN storage directory is removed due to an error, the NameNode will fail 
> to write the image even if the issue was intermittent. It would be good to 
> have a way to make the NameNode try writing again after the issue is fixed - 
> and maybe even try it automatically every certain amount of time 
> (configurable).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11422) Need a method to refresh the list of NN StorageDirectories after removal

2017-02-16 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871115#comment-15871115
 ] 

Harsh J commented on HDFS-11422:


Does the feature offered by {{dfs.namenode.name.dir.restore}} help, even 
partially? For NNs with multiple directories, a reattempt of an earlier failed 
directory gets tried at every checkpoint trigger. Its not purely time based 
however.

We'd considered making it default to true (HDFS-3560), but that was not done. 
If this fits, perhaps we can turn it on by default as the feature has existed 
for a while now.

> Need a method to refresh the list of NN StorageDirectories after removal
> 
>
> Key: HDFS-11422
> URL: https://issues.apache.org/jira/browse/HDFS-11422
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Attila Bukor
>
> If a NN storage directory is removed due to an error, the NameNode will fail 
> to write the image even if the issue was intermittent. It would be good to 
> have a way to make the NameNode try writing again after the issue is fixed - 
> and maybe even try it automatically every certain amount of time 
> (configurable).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11421) Make WebHDFS' ACLs RegEx configurable

2017-02-16 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-11421:
---
Affects Version/s: 2.6.0
 Target Version/s: 2.9.0
   Status: Patch Available  (was: Open)

> Make WebHDFS' ACLs RegEx configurable
> -
>
> Key: HDFS-11421
> URL: https://issues.apache.org/jira/browse/HDFS-11421
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: webhdfs
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Harsh J
>
> Part of HDFS-5608 added support for GET/SET ACLs over WebHDFS. This currently 
> identifies the passed arguments via a hard-coded regex that mandates certain 
> group and user naming styles.
> A similar limitation had existed before for CHOWN and other User/Group set 
> related operations of WebHDFS, where it was then made configurable via 
> HDFS-11391 + HDFS-4983.
> Such configurability should be allowed for the ACL operations too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11421) Make WebHDFS' ACLs RegEx configurable

2017-02-16 Thread Harsh J (JIRA)
Harsh J created HDFS-11421:
--

 Summary: Make WebHDFS' ACLs RegEx configurable
 Key: HDFS-11421
 URL: https://issues.apache.org/jira/browse/HDFS-11421
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: webhdfs
Reporter: Harsh J
Assignee: Harsh J


Part of HDFS-5608 added support for GET/SET ACLs over WebHDFS. This currently 
identifies the passed arguments via a hard-coded regex that mandates certain 
group and user naming styles.

A similar limitation had existed before for CHOWN and other User/Group set 
related operations of WebHDFS, where it was then made configurable via 
HDFS-11391 + HDFS-4983.

Such configurability should be allowed for the ACL operations too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-3302) Review and improve HDFS trash documentation

2017-01-17 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-3302:
--
Assignee: (was: Harsh J)

> Review and improve HDFS trash documentation
> ---
>
> Key: HDFS-3302
> URL: https://issues.apache.org/jira/browse/HDFS-3302
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.0.0-alpha1
>Reporter: Harsh J
>  Labels: docs
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HDFS-3302.patch
>
>
> Improve Trash documentation for users.
> (0.23 published release docs are missing original HDFS docs btw...)
> A set of FAQ-like questions can be found on HDFS-2740
> I'll update the ticket shortly with the areas to cover in the docs, as 
> enabling trash by default (HDFS-2740) would be considered as a wide behavior 
> change per its follow ups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-3302) Review and improve HDFS trash documentation

2017-01-17 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825810#comment-15825810
 ] 

Harsh J commented on HDFS-3302:
---

Just a short note: This seems to be incorrectly attributed (to me). Actual 
patch contributor was [~pmkiran19].

> Review and improve HDFS trash documentation
> ---
>
> Key: HDFS-3302
> URL: https://issues.apache.org/jira/browse/HDFS-3302
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.0.0-alpha1
>Reporter: Harsh J
>Assignee: Harsh J
>  Labels: docs
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HDFS-3302.patch
>
>
> Improve Trash documentation for users.
> (0.23 published release docs are missing original HDFS docs btw...)
> A set of FAQ-like questions can be found on HDFS-2740
> I'll update the ticket shortly with the areas to cover in the docs, as 
> enabling trash by default (HDFS-2740) would be considered as a wide behavior 
> change per its follow ups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-2569) DN decommissioning quirks

2016-11-24 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J resolved HDFS-2569.
---
Resolution: Cannot Reproduce
  Assignee: (was: Harsh J)

Cannot quite reproduce this on current versions.

> DN decommissioning quirks
> -
>
> Key: HDFS-2569
> URL: https://issues.apache.org/jira/browse/HDFS-2569
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 0.23.0
>Reporter: Harsh J
>
> Decommissioning a node is working slightly odd in 0.23+:
> The steps I did:
> - Start HDFS via {{hdfs namenode}} and {{hdfs datanode}}. 1-node cluster.
> - Zero files/blocks, so I go ahead and exclude-add my DN and do {{hdfs 
> dfsadmin -refreshNodes}}
> - I see the following log in NN tails, which is fine:
> {code}
> 11/11/20 09:28:10 INFO util.HostsFileReader: Setting the includes file to 
> 11/11/20 09:28:10 INFO util.HostsFileReader: Setting the excludes file to 
> build/test/excludes
> 11/11/20 09:28:10 INFO util.HostsFileReader: Refreshing hosts 
> (include/exclude) list
> 11/11/20 09:28:10 INFO util.HostsFileReader: Adding 192.168.1.23 to the list 
> of hosts from build/test/excludes
> {code}
> - However, DN log tail gets no new messages. DN still runs.
> - The dfshealth.jsp page shows this table, which makes no sense -- why is 
> there 1 live and 1 dead?:
> |Live Nodes|1 (Decommissioned: 1)|
> |Dead Nodes|1 (Decommissioned: 0)|
> |Decommissioning Nodes|0|
> - The live nodes page shows this, meaning DN is still up and heartbeating but 
> is decommissioned:
> |Node|Last Contact|Admin State|
> |192.168.1.23|0|Decommissioned|
> - The dead nodes page shows this, and the link to the DN is broken cause the 
> port is linked as -1. Also, showing 'false' for decommissioned makes no sense 
> when live node page shows that it is already decommissioned:
> |Node|Decommissioned|
> |192.168.1.23|false|
> Investigating if this is a quirk only observed when the DN had 0 blocks on it 
> in sum total.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-2936) Provide a way to apply a minimum replication factor aside of strict minimum live replicas feature

2016-11-03 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-2936:
--
Description: 
If an admin wishes to enforce replication today for all the users of their 
cluster, they may set {{dfs.namenode.replication.min}}. This property prevents 
users from creating files with < expected replication factor.

However, the value of minimum replication set by the above value is also 
checked at several other points, especially during completeFile (close) 
operations. If a condition arises wherein a write's pipeline may have gotten 
only < minimum nodes in it, the completeFile operation does not successfully 
close the file and the client begins to hang waiting for NN to replicate the 
last bad block in the background. This form of hard-guarantee can, for example, 
bring down clusters of HBase during high xceiver load on DN, or disk fill-ups 
on many of them, etc..

I propose we should split the property in two parts:
* dfs.namenode.replication.min
** Stays the same name, but only checks file creation time replication factor 
value and during adjustments made via setrep/etc.
* dfs.namenode.replication.min.for.write
** New property that disconnects the rest of the checks from the above 
property, such as the checks done during block commit, file complete/close, 
safemode checks for block availability, etc..

Alternatively, we may also choose to remove the client-side hang of 
completeFile/close calls with a set number of retries. This would further 
require discussion about how a file-closure handle ought to be handled.

  was:
If an admin wishes to enforce replication today for all the users of their 
cluster, he may set {{dfs.namenode.replication.min}}. This property prevents 
users from creating files with < expected replication factor.

However, the value of minimum replication set by the above value is also 
checked at several other points, especially during completeFile (close) 
operations. If a condition arises wherein a write's pipeline may have gotten 
only < minimum nodes in it, the completeFile operation does not successfully 
close the file and the client begins to hang waiting for NN to replicate the 
last bad block in the background. This form of hard-guarantee can, for example, 
bring down clusters of HBase during high xceiver load on DN, or disk fill-ups 
on many of them, etc..

I propose we should split the property in two parts:
* dfs.namenode.replication.min
** Stays the same name, but only checks file creation time replication factor 
value and during adjustments made via setrep/etc.
* dfs.namenode.replication.min.for.write
** New property that disconnects the rest of the checks from the above 
property, such as the checks done during block commit, file complete/close, 
safemode checks for block availability, etc..

Alternatively, we may also choose to remove the client-side hang of 
completeFile/close calls with a set number of retries. This would further 
require discussion about how a file-closure handle ought to be handled.


> Provide a way to apply a minimum replication factor aside of strict minimum 
> live replicas feature
> -
>
> Key: HDFS-2936
> URL: https://issues.apache.org/jira/browse/HDFS-2936
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 0.23.0
>Reporter: Harsh J
> Attachments: HDFS-2936.patch
>
>
> If an admin wishes to enforce replication today for all the users of their 
> cluster, they may set {{dfs.namenode.replication.min}}. This property 
> prevents users from creating files with < expected replication factor.
> However, the value of minimum replication set by the above value is also 
> checked at several other points, especially during completeFile (close) 
> operations. If a condition arises wherein a write's pipeline may have gotten 
> only < minimum nodes in it, the completeFile operation does not successfully 
> close the file and the client begins to hang waiting for NN to replicate the 
> last bad block in the background. This form of hard-guarantee can, for 
> example, bring down clusters of HBase during high xceiver load on DN, or disk 
> fill-ups on many of them, etc..
> I propose we should split the property in two parts:
> * dfs.namenode.replication.min
> ** Stays the same name, but only checks file creation time replication factor 
> value and during adjustments made via setrep/etc.
> * dfs.namenode.replication.min.for.write
> ** New property that disconnects the rest of the checks from the above 
> property, such as the checks done during block commit, file complete/close, 
> safemode checks for block availability, etc..
> Alternatively, we may also choose to remove the client-side hang of 
> completeFile/close calls with a 

[jira] [Updated] (HDFS-2936) Provide a way to apply a minimum replication factor aside of strict minimum live replicas feature

2016-11-03 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-2936:
--
Summary: Provide a way to apply a minimum replication factor aside of 
strict minimum live replicas feature  (was: File close()-ing hangs indefinitely 
if the number of live blocks does not match the minimum replication)

> Provide a way to apply a minimum replication factor aside of strict minimum 
> live replicas feature
> -
>
> Key: HDFS-2936
> URL: https://issues.apache.org/jira/browse/HDFS-2936
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 0.23.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: HDFS-2936.patch
>
>
> If an admin wishes to enforce replication today for all the users of their 
> cluster, he may set {{dfs.namenode.replication.min}}. This property prevents 
> users from creating files with < expected replication factor.
> However, the value of minimum replication set by the above value is also 
> checked at several other points, especially during completeFile (close) 
> operations. If a condition arises wherein a write's pipeline may have gotten 
> only < minimum nodes in it, the completeFile operation does not successfully 
> close the file and the client begins to hang waiting for NN to replicate the 
> last bad block in the background. This form of hard-guarantee can, for 
> example, bring down clusters of HBase during high xceiver load on DN, or disk 
> fill-ups on many of them, etc..
> I propose we should split the property in two parts:
> * dfs.namenode.replication.min
> ** Stays the same name, but only checks file creation time replication factor 
> value and during adjustments made via setrep/etc.
> * dfs.namenode.replication.min.for.write
> ** New property that disconnects the rest of the checks from the above 
> property, such as the checks done during block commit, file complete/close, 
> safemode checks for block availability, etc..
> Alternatively, we may also choose to remove the client-side hang of 
> completeFile/close calls with a set number of retries. This would further 
> require discussion about how a file-closure handle ought to be handled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-2936) Provide a way to apply a minimum replication factor aside of strict minimum live replicas feature

2016-11-03 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-2936:
--
Assignee: (was: Harsh J)

> Provide a way to apply a minimum replication factor aside of strict minimum 
> live replicas feature
> -
>
> Key: HDFS-2936
> URL: https://issues.apache.org/jira/browse/HDFS-2936
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 0.23.0
>Reporter: Harsh J
> Attachments: HDFS-2936.patch
>
>
> If an admin wishes to enforce replication today for all the users of their 
> cluster, he may set {{dfs.namenode.replication.min}}. This property prevents 
> users from creating files with < expected replication factor.
> However, the value of minimum replication set by the above value is also 
> checked at several other points, especially during completeFile (close) 
> operations. If a condition arises wherein a write's pipeline may have gotten 
> only < minimum nodes in it, the completeFile operation does not successfully 
> close the file and the client begins to hang waiting for NN to replicate the 
> last bad block in the background. This form of hard-guarantee can, for 
> example, bring down clusters of HBase during high xceiver load on DN, or disk 
> fill-ups on many of them, etc..
> I propose we should split the property in two parts:
> * dfs.namenode.replication.min
> ** Stays the same name, but only checks file creation time replication factor 
> value and during adjustments made via setrep/etc.
> * dfs.namenode.replication.min.for.write
> ** New property that disconnects the rest of the checks from the above 
> property, such as the checks done during block commit, file complete/close, 
> safemode checks for block availability, etc..
> Alternatively, we may also choose to remove the client-side hang of 
> completeFile/close calls with a set number of retries. This would further 
> require discussion about how a file-closure handle ought to be handled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11012) Unnecessary INFO logging on DFSClients for InvalidToken

2016-10-14 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-11012:
---
Status: Patch Available  (was: Open)

> Unnecessary INFO logging on DFSClients for InvalidToken
> ---
>
> Key: HDFS-11012
> URL: https://issues.apache.org/jira/browse/HDFS-11012
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: fs
>Affects Versions: 2.5.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
>
> In situations where a DFSClient would receive an InvalidToken exception (as 
> described at [1]), a single retry is automatically made (as observed at [2]). 
> However, we still print an INFO message into the DFSClient's logger even 
> though the message is expected in some scenarios. This should ideally be a 
> DEBUG level message to avoid confusion.
> If the retry or the retried attempt fails, the final clause handles it anyway 
> and prints out a proper WARN (as seen at [3]) so the INFO is unnecessary.
> [1] - 
> https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1330-L1356
> [2] - 
> https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L649-L651
>  and 
> https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1163-L1170
> [3] - 
> https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L652-L658
>  and 
> https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1171-L1177



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11012) Unnecessary INFO logging on DFSClients for InvalidToken

2016-10-14 Thread Harsh J (JIRA)
Harsh J created HDFS-11012:
--

 Summary: Unnecessary INFO logging on DFSClients for InvalidToken
 Key: HDFS-11012
 URL: https://issues.apache.org/jira/browse/HDFS-11012
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: fs
Affects Versions: 2.5.0
Reporter: Harsh J
Assignee: Harsh J
Priority: Minor


In situations where a DFSClient would receive an InvalidToken exception (as 
described at [1]), a single retry is automatically made (as observed at [2]). 
However, we still print an INFO message into the DFSClient's logger even though 
the message is expected in some scenarios. This should ideally be a DEBUG level 
message to avoid confusion.

If the retry or the retried attempt fails, the final clause handles it anyway 
and prints out a proper WARN (as seen at [3]) so the INFO is unnecessary.

[1] - 
https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1330-L1356
[2] - 
https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L649-L651
 and 
https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1163-L1170
[3] - 
https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L652-L658
 and 
https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1171-L1177



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-3584) Blocks are getting marked as corrupt with append operation under high load.

2016-06-02 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312138#comment-15312138
 ] 

Harsh J commented on HDFS-3584:
---

HDFS-10240 appears to report a similar issue.

> Blocks are getting marked as corrupt with append operation under high load.
> ---
>
> Key: HDFS-3584
> URL: https://issues.apache.org/jira/browse/HDFS-3584
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Brahma Reddy Battula
>
> Scenario:
> = 
> 1. There are 2 clients cli1 and cli2 cli1 write a file F1 and not closed
> 2. The cli2 will call append on unclosed file and triggers a leaserecovery
> 3. Cli1 is closed
> 4. Lease recovery is completed and with updated GS in DN and got BlockReport 
> since there is a mismatch in GS the block got corrupted
> 5. Now we got a CommitBlockSync this will also fail since the File is already 
> closed by cli1 and state in NN is Finalized



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10240) Race between close/recoverLease leads to missing block

2016-05-22 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295930#comment-15295930
 ] 

Harsh J commented on HDFS-10240:


This seems similar to the situation described in HDFS-3584. The comment there 
from [~umamaheswararao] suggests a probable better approach of altering the 
lease of the file to the recovering client upfront, to prevent the older client 
from coming back and closing the file out concurrently (given that close call 
does already check for active lease). Do you think such an approach would be 
better?

> Race between close/recoverLease leads to missing block
> --
>
> Key: HDFS-10240
> URL: https://issues.apache.org/jira/browse/HDFS-10240
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: zhouyingchao
>Assignee: zhouyingchao
> Attachments: HDFS-10240-001.patch
>
>
> We got a missing block in our cluster, and logs related to the missing block 
> are as follows:
> 2016-03-28,10:00:06,188 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocateBlock: XX. BP-219149063-10.108.84.25-1446859315800 
> blk_1226490256_153006345{blockUCState=UNDER_CONSTRUCTION, 
> primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]}
> 2016-03-28,10:00:06,205 INFO BlockStateChange: BLOCK* 
> blk_1226490256_153006345{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]}
>  recovery started, 
> primary=ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]
> 2016-03-28,10:00:06,205 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.internalReleaseLease: File XX has not been closed. Lease 
> recovery is in progress. RecoveryId = 153006357 for block 
> blk_1226490256_153006345{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]}
> 2016-03-28,10:00:06,248 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* 
> checkFileProgress: blk_1226490256_153006345{blockUCState=COMMITTED, 
> primaryNodeIndex=2, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-85819f0d-bdbb-4a9b-b90c-eba078547c23:NORMAL|RBW]]}
>  has not reached minimal replication 1
> 2016-03-28,10:00:06,358 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.114.5.53:11402 is added to 
> blk_1226490256_153006345{blockUCState=COMMITTED, primaryNodeIndex=2, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-85819f0d-bdbb-4a9b-b90c-eba078547c23:NORMAL|RBW]]}
>  size 139
> 2016-03-28,10:00:06,441 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.114.5.44:11402 is added to blk_1226490256_153006345 size 
> 139
> 2016-03-28,10:00:06,660 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.114.6.14:11402 is added to blk_1226490256_153006345 size 
> 139
> 2016-03-28,10:00:08,808 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
> commitBlockSynchronization(lastblock=BP-219149063-10.108.84.25-1446859315800:blk_1226490256_153006345,
>  newgenerationstamp=153006357, newlength=139, newtargets=[10.114.6.14:11402, 
> 10.114.5.53:11402, 10.114.5.44:11402], closeFile=true, deleteBlock=false)
> 2016-03-28,10:00:08,836 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1226490256 added as corrupt on 
> 10.114.6.14:11402 by /10.114.6.14 because block is COMPLETE and reported 
> genstamp 153006357 does not match genstamp in block map 153006345
> 2016-03-28,10:00:08,836 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1226490256 added as corrupt on 
> 10.114.5.53:11402 by /10.114.5.53 because block is COMPLETE and reported 
> genstamp 153006357 does not match genstamp in block map 153006345
> 2016-03-28,10:00:08,837 INFO BlockStateChange: BLOCK 
> 

[jira] [Resolved] (HDFS-3557) provide means of escaping special characters to `hadoop fs` command

2016-04-22 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J resolved HDFS-3557.
---
Resolution: Not A Problem

Resolving per comment (also stale)

> provide means of escaping special characters to `hadoop fs` command
> ---
>
> Key: HDFS-3557
> URL: https://issues.apache.org/jira/browse/HDFS-3557
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 0.20.2
>Reporter: Jeff Hodges
>Priority: Minor
>
> When running an investigative job, I used a date parameter that selected 
> multiple directories for the input (e.g. "my_data/2012/06/{18,19,20}"). It 
> used this same date parameter when creating the output directory.
> But `hadoop fs` was unable to ls, getmerge, or rmr it until I used the regex 
> operator "?" and mv to change the name (that is, `-mv 
> output/2012/06/?18,19,20? foobar").
> Shells and filesystems for other systems provide a means of escaping "special 
> characters" generically, but there seems to be no such means in HDFS/`hadoop 
> fs`. Providing one would be a great way to make accessing HDFS more robust.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-10296) FileContext.getDelegationTokens() fails to obtain KMS delegation token

2016-04-21 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251966#comment-15251966
 ] 

Harsh J commented on HDFS-10296:


We do special handling in DistributedFileSystem#addDelegationTokens to detect 
TDE features and inject an additional KMS DT; this enhancement is missing in 
FileContext.

> FileContext.getDelegationTokens() fails to obtain KMS delegation token
> --
>
> Key: HDFS-10296
> URL: https://issues.apache.org/jira/browse/HDFS-10296
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption
>Affects Versions: 2.6.0
> Environment: CDH 5.6 with a Java KMS
>Reporter: Andreas Neumann
>
> This little program demonstrates the problem: With FileSystem, we can get 
> both the HDFS and the kms-dt token, whereas with FileContext, we can only 
> obtain the HDFS delegation token. 
> {code}
> public class SimpleTest {
>   public static void main(String[] args) throws IOException {
> YarnConfiguration hConf = new YarnConfiguration();
> String renewer = "renewer";
> FileContext fc = FileContext.getFileContext(hConf);
> List tokens = fc.getDelegationTokens(new Path("/"), renewer);
> for (Token token : tokens) {
>   System.out.println("Token from FC: " + token);
> }
> FileSystem fs = FileSystem.get(hConf);
> for (Token token : fs.addDelegationTokens(renewer, new Credentials())) 
> {
>   System.out.println("Token from FS: " + token);
> }
>   }
> }
> {code}
> Sample output (host/user name x'ed out):
> {noformat}
> Token from FC: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:xxx, Ident: 
> (HDFS_DELEGATION_TOKEN token 49 for xxx)
> Token from FS: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:xxx, Ident: 
> (HDFS_DELEGATION_TOKEN token 50 for xxx)
> Token from FS: Kind: kms-dt, Service: xx.xx.xx.xx:16000, Ident: 00 04 63 64 
> 61 70 07 72 65 6e 65 77 65 72 00 8a 01 54 16 96 c2 95 8a 01 54 3a a3 46 95 0e 
> 02
> {noformat}
> Apparently FileContext does not return the KMS token. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-8161) Both Namenodes are in standby State

2016-04-15 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243980#comment-15243980
 ] 

Harsh J edited comment on HDFS-8161 at 4/16/16 3:26 AM:


[~brahmareddy] - was this encountered on virtual machine hosts, or physical 
ones? Asking because 
https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-data-to-mesos-kubernetes-docker-containers-4986f88f7a19#.v3hx212ne
 (H/T [~daisuke.kobayashi])


was (Author: qwertymaniac):
[~brahmareddy] - was this encountered on virtual machine hosts, or physical 
ones? Asking because 
https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-data-to-mesos-kubernetes-docker-containers-4986f88f7a19#.v3hx212ne

> Both Namenodes are in standby State
> ---
>
> Key: HDFS-8161
> URL: https://issues.apache.org/jira/browse/HDFS-8161
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: auto-failover
>Affects Versions: 2.6.0
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
> Attachments: ACTIVEBreadcumb and StandbyElector.txt
>
>
> Suspected Scenario:
> 
> Start cluster with three Nodes.
> Reboot Machine where ZKFC is not running..( Here Active Node ZKFC should open 
> session with this ZK )
> Now  ZKFC ( Active NN's ) session expire and try re-establish connection with 
> another ZK...Bythe time  ZKFC ( StndBy NN's ) will try to fence old active 
> and create the active Breadcrumb and Makes SNN to active state..
> But immediately it fence to standby state.. ( Here is the doubt)
> Hence both will be in standby state..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8161) Both Namenodes are in standby State

2016-04-15 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243980#comment-15243980
 ] 

Harsh J commented on HDFS-8161:
---

[~brahmareddy] - was this encountered on virtual machine hosts, or physical 
ones? Asking because 
https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-data-to-mesos-kubernetes-docker-containers-4986f88f7a19#.v3hx212ne

> Both Namenodes are in standby State
> ---
>
> Key: HDFS-8161
> URL: https://issues.apache.org/jira/browse/HDFS-8161
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: auto-failover
>Affects Versions: 2.6.0
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
> Attachments: ACTIVEBreadcumb and StandbyElector.txt
>
>
> Suspected Scenario:
> 
> Start cluster with three Nodes.
> Reboot Machine where ZKFC is not running..( Here Active Node ZKFC should open 
> session with this ZK )
> Now  ZKFC ( Active NN's ) session expire and try re-establish connection with 
> another ZK...Bythe time  ZKFC ( StndBy NN's ) will try to fence old active 
> and create the active Breadcrumb and Makes SNN to active state..
> But immediately it fence to standby state.. ( Here is the doubt)
> Hence both will be in standby state..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers

2016-04-03 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-10237:
---
Attachment: HDFS-10237.002.patch

> Support specifying checksum type in WebHDFS/HTTPFS writers
> --
>
> Key: HDFS-10237
> URL: https://issues.apache.org/jira/browse/HDFS-10237
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: webhdfs
>Affects Versions: 2.8.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-10237.000.patch, HDFS-10237.001.patch, 
> HDFS-10237.002.patch, HDFS-10237.002.patch
>
>
> Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS 
> writer, as you can with the regular DFS writer (done via HADOOP-8240)
> This JIRA covers the changes necessary to bring the same ability to WebHDFS 
> and HTTPFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers

2016-04-01 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-10237:
---
Status: Open  (was: Patch Available)

> Support specifying checksum type in WebHDFS/HTTPFS writers
> --
>
> Key: HDFS-10237
> URL: https://issues.apache.org/jira/browse/HDFS-10237
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: webhdfs
>Affects Versions: 2.8.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-10237.000.patch, HDFS-10237.001.patch, 
> HDFS-10237.002.patch
>
>
> Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS 
> writer, as you can with the regular DFS writer (done via HADOOP-8240)
> This JIRA covers the changes necessary to bring the same ability to WebHDFS 
> and HTTPFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers

2016-04-01 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-10237:
---
Status: Patch Available  (was: Open)

> Support specifying checksum type in WebHDFS/HTTPFS writers
> --
>
> Key: HDFS-10237
> URL: https://issues.apache.org/jira/browse/HDFS-10237
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: webhdfs
>Affects Versions: 2.8.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-10237.000.patch, HDFS-10237.001.patch, 
> HDFS-10237.002.patch
>
>
> Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS 
> writer, as you can with the regular DFS writer (done via HADOOP-8240)
> This JIRA covers the changes necessary to bring the same ability to WebHDFS 
> and HTTPFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers

2016-03-31 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-10237:
---
Attachment: HDFS-10237.002.patch

Addressed most checkstyle issues except the ones about the warnings surrounding 
# of parameters - these are required given the overrides.

> Support specifying checksum type in WebHDFS/HTTPFS writers
> --
>
> Key: HDFS-10237
> URL: https://issues.apache.org/jira/browse/HDFS-10237
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: webhdfs
>Affects Versions: 2.8.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-10237.000.patch, HDFS-10237.001.patch, 
> HDFS-10237.002.patch
>
>
> Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS 
> writer, as you can with the regular DFS writer (done via HADOOP-8240)
> This JIRA covers the changes necessary to bring the same ability to WebHDFS 
> and HTTPFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers

2016-03-30 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-10237:
---
Attachment: HDFS-10237.001.patch

* Fixed the javadoc extra param problem
* Fixed logic in the ChecksumOptParam that was not considering the string value 
of "null" vs. a literal null, and was breaking the contract 
test/non-recursive-create

The other tests appear unrelated.

> Support specifying checksum type in WebHDFS/HTTPFS writers
> --
>
> Key: HDFS-10237
> URL: https://issues.apache.org/jira/browse/HDFS-10237
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: webhdfs
>Affects Versions: 2.8.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-10237.000.patch, HDFS-10237.001.patch
>
>
> Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS 
> writer, as you can with the regular DFS writer (done via HADOOP-8240)
> This JIRA covers the changes necessary to bring the same ability to WebHDFS 
> and HTTPFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-6542) WebHDFSFileSystem doesn't transmit desired checksum type

2016-03-30 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J resolved HDFS-6542.
---
Resolution: Duplicate

I missed this JIRA when searching before I filed HDFS-10237, but now noticed 
via association to HADOOP-8240.

Since I've already posted a patch on HDFS-10237 and there's no ongoing 
work/assignee here, am marking this as a duplicate of HDFS-10237.

Sorry for the extra noise!

> WebHDFSFileSystem doesn't transmit desired checksum type
> 
>
> Key: HDFS-6542
> URL: https://issues.apache.org/jira/browse/HDFS-6542
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Reporter: Andrey Stepachev
>Priority: Minor
>
> Currently DFSClient has possibility to specify desired checksum type. This 
> behaviour controlled by dfs.checksym.type parameter settable by client. 
> It works with hdfs:// filesystem, but doesn't works with webhdfs.It fails to 
> work because webhdfs will use default type of checksumming initialised by 
> server instance of DFSClient.
> As example https://issues.apache.org/jira/browse/HADOOP-8240 doesn't works 
> with webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers

2016-03-30 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-10237:
---
Attachment: HDFS-10237.000.patch

> Support specifying checksum type in WebHDFS/HTTPFS writers
> --
>
> Key: HDFS-10237
> URL: https://issues.apache.org/jira/browse/HDFS-10237
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: webhdfs
>Affects Versions: 2.8.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-10237.000.patch
>
>
> Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS 
> writer, as you can with the regular DFS writer (done via HADOOP-8240)
> This JIRA covers the changes necessary to bring the same ability to WebHDFS 
> and HTTPFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers

2016-03-30 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-10237:
---
Status: Patch Available  (was: Open)

> Support specifying checksum type in WebHDFS/HTTPFS writers
> --
>
> Key: HDFS-10237
> URL: https://issues.apache.org/jira/browse/HDFS-10237
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: webhdfs
>Affects Versions: 2.8.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-10237.000.patch
>
>
> Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS 
> writer, as you can with the regular DFS writer (done via HADOOP-8240)
> This JIRA covers the changes necessary to bring the same ability to WebHDFS 
> and HTTPFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers

2016-03-30 Thread Harsh J (JIRA)
Harsh J created HDFS-10237:
--

 Summary: Support specifying checksum type in WebHDFS/HTTPFS writers
 Key: HDFS-10237
 URL: https://issues.apache.org/jira/browse/HDFS-10237
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: webhdfs
Affects Versions: 2.8.0
Reporter: Harsh J
Assignee: Harsh J
Priority: Minor


Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS 
writer, as you can with the regular DFS writer (done via HADOOP-8240)

This JIRA covers the changes necessary to bring the same ability to WebHDFS and 
HTTPFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-10221) Add test resource dfs.hosts.json to the rat exclusions

2016-03-29 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215876#comment-15215876
 ] 

Harsh J commented on HDFS-10221:


It appears we can also request the parser to allow comments as a feature, 
depending on the parser, for ex. this change and JIRA: LENS-729 / 
https://issues.apache.org/jira/secure/attachment/12750264/LENS-729.patch

This would help preserve licenses vs. excluding it, just in case.

> Add test resource dfs.hosts.json to the rat exclusions
> --
>
> Key: HDFS-10221
> URL: https://issues.apache.org/jira/browse/HDFS-10221
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.8.0
> Environment: Jenkins
>Reporter: Ming Ma
>Assignee: Ming Ma
>Priority: Blocker
> Attachments: HDFS-10221.patch
>
>
> A new test resource dfs.hosts.json was added to HDFS-9005 for better 
> readability. Given json file doesn't allow comments, it violates ASF license. 
> To address this, we can add the file to rat exclusions list in the pom.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression

2016-03-19 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-9949:
--
Description: 
In the following scenario, in releases without HDFS-8211, the DN may regenerate 
its UUIDs unintentionally.

0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}}
1. Stop DN
2. Unmount the second disk, {{/data2/dfs/dn}}
3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root path
4. Start DN
5. DN now considers /data2/dfs/dn empty so formats it, but during the format it 
uses {{datanode.getDatanodeUuid()}} which is null until register() is called.
6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} 
gets called with successful condition, and it causes a new generation of UUID 
which is written to all disks {{/data1/dfs/dn/current/VERSION}} and 
{{/data2/dfs/dn/current/VERSION}}.
7. Stop DN (in the scenario, this was when the mistake of unmounted disk was 
realised)
8. Mount the second disk back again {{/data2/dfs/dn}}, causing the {{VERSION}} 
file to be the original one again on it (mounting masks the root path that we 
last generated upon).
9. DN fails to start up cause it finds mismatched UUID between the two disks, 
with an error similar to:
{code}WARN org.apache.hadoop.hdfs.server.common.Storage: 
{{org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory 
/data/2/dfs/dn is in an inconsistent state: Root /data/2/dfs/dn: 
DatanodeUuid=fe3a848f-beb8-4fcb-9581-c6fb1c701cc4, does not match 
8ea9493c-7097-4ee3-96a3-0cc4dfc1d6ac from other StorageDirectory.{code}

The DN should not generate a new UUID if one of the storage disks already have 
the older one.

HDFS-8211 unintentionally fixes this by changing the 
{{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} 
representation of the UUID vs. the {{DatanodeID}} object which only gets 
available (non-null) _after_ the registration.

It'd still be good to add a direct test case to the above scenario that passes 
on trunk and branch-2, but fails on branch-2.7 and lower, so we can catch a 
regression around this in future.

  was:
In the following scenario, in releases without HDFS-8211, the DN may regenerate 
its UUIDs unintentionally.

0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}}
1. Stop DN
2. Unmount the second disk, {{/data2/dfs/dn}}
3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root path
4. Start DN
5. DN now considers /data2/dfs/dn empty so formats it, but during the format it 
uses {{datanode.getDatanodeUuid()}} which is null until register() is called.
6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} 
gets called with successful condition, and it causes a new generation of UUID 
which is written to all disks {{/data1/dfs/dn/current/VERSION}} and 
{{/data2/dfs/dn/current/VERSION}}.
7. Stop DN (in the scenario, this was when the mistake of unmounted disk was 
realised)
8. Mount the second disk back again {{/data2/dfs/dn}}, causing the {{VERSION}} 
file to be the original one again on it (mounting masks the root path that we 
last generated upon).
9. DN fails to start up cause it finds mismatched UUID between the two disks

The DN should not generate a new UUID if one of the storage disks already have 
the older one.

HDFS-8211 unintentionally fixes this by changing the 
{{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} 
representation of the UUID vs. the {{DatanodeID}} object which only gets 
available (non-null) _after_ the registration.

It'd still be good to add a direct test case to the above scenario that passes 
on trunk and branch-2, but fails on branch-2.7 and lower, so we can catch a 
regression around this in future.


> Testcase for catching DN UUID regeneration regression
> -
>
> Key: HDFS-9949
> URL: https://issues.apache.org/jira/browse/HDFS-9949
> Project: Hadoop HDFS
>  Issue Type: Test
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, 
> HDFS-9949.000.patch, HDFS-9949.001.patch
>
>
> In the following scenario, in releases without HDFS-8211, the DN may 
> regenerate its UUIDs unintentionally.
> 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}}
> 1. Stop DN
> 2. Unmount the second disk, {{/data2/dfs/dn}}
> 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root 
> path
> 4. Start DN
> 5. DN now considers /data2/dfs/dn empty so formats it, but during the format 
> it uses {{datanode.getDatanodeUuid()}} which is null until register() is 
> called.
> 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} 
> gets called with successful condition, and it causes a new 

[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression

2016-03-19 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-9949:
--
Attachment: HDFS-9949.001.patch

Thank you for reviewing [~cmccabe]! Agreed, I should've used a smaller sleep. 
Updated the trunk patch with the suggested {{50ms}}. Test runs on my local 
machine in ~3s.

> Testcase for catching DN UUID regeneration regression
> -
>
> Key: HDFS-9949
> URL: https://issues.apache.org/jira/browse/HDFS-9949
> Project: Hadoop HDFS
>  Issue Type: Test
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, 
> HDFS-9949.000.patch, HDFS-9949.001.patch
>
>
> In the following scenario, in releases without HDFS-8211, the DN may 
> regenerate its UUIDs unintentionally.
> 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}}
> 1. Stop DN
> 2. Unmount the second disk, {{/data2/dfs/dn}}
> 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root 
> path
> 4. Start DN
> 5. DN now considers /data2/dfs/dn empty so formats it, but during the format 
> it uses {{datanode.getDatanodeUuid()}} which is null until register() is 
> called.
> 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} 
> gets called with successful condition, and it causes a new generation of UUID 
> which is written to all disks {{/data1/dfs/dn/current/VERSION}} and 
> {{/data2/dfs/dn/current/VERSION}}.
> 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was 
> realised)
> 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the 
> {{VERSION}} file to be the original one again on it (mounting masks the root 
> path that we last generated upon).
> 9. DN fails to start up cause it finds mismatched UUID between the two disks
> The DN should not generate a new UUID if one of the storage disks already 
> have the older one.
> HDFS-8211 unintentionally fixes this by changing the 
> {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} 
> representation of the UUID vs. the {{DatanodeID}} object which only gets 
> available (non-null) _after_ the registration.
> It'd still be good to add a direct test case to the above scenario that 
> passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can 
> catch a regression around this in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression

2016-03-19 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-9949:
--
Description: 
In the following scenario, in releases without HDFS-8211, the DN may regenerate 
its UUIDs unintentionally.

0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}}
1. Stop DN
2. Unmount the second disk, {{/data2/dfs/dn}}
3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root path
4. Start DN
5. DN now considers /data2/dfs/dn empty so formats it, but during the format it 
uses {{datanode.getDatanodeUuid()}} which is null until register() is called.
6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} 
gets called with successful condition, and it causes a new generation of UUID 
which is written to all disks {{/data1/dfs/dn/current/VERSION}} and 
{{/data2/dfs/dn/current/VERSION}}.
7. Stop DN (in the scenario, this was when the mistake of unmounted disk was 
realised)
8. Mount the second disk back again {{/data2/dfs/dn}}, causing the {{VERSION}} 
file to be the original one again on it (mounting masks the root path that we 
last generated upon).
9. DN fails to start up cause it finds mismatched UUID between the two disks, 
with an error similar to:
{code}WARN org.apache.hadoop.hdfs.server.common.Storage: 
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory 
/data/2/dfs/dn is in an inconsistent state: Root /data/2/dfs/dn: 
DatanodeUuid=fe3a848f-beb8-4fcb-9581-c6fb1c701cc4, does not match 
8ea9493c-7097-4ee3-96a3-0cc4dfc1d6ac from other StorageDirectory.{code}

The DN should not generate a new UUID if one of the storage disks already have 
the older one.

HDFS-8211 unintentionally fixes this by changing the 
{{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} 
representation of the UUID vs. the {{DatanodeID}} object which only gets 
available (non-null) _after_ the registration.

It'd still be good to add a direct test case to the above scenario that passes 
on trunk and branch-2, but fails on branch-2.7 and lower, so we can catch a 
regression around this in future.

  was:
In the following scenario, in releases without HDFS-8211, the DN may regenerate 
its UUIDs unintentionally.

0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}}
1. Stop DN
2. Unmount the second disk, {{/data2/dfs/dn}}
3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root path
4. Start DN
5. DN now considers /data2/dfs/dn empty so formats it, but during the format it 
uses {{datanode.getDatanodeUuid()}} which is null until register() is called.
6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} 
gets called with successful condition, and it causes a new generation of UUID 
which is written to all disks {{/data1/dfs/dn/current/VERSION}} and 
{{/data2/dfs/dn/current/VERSION}}.
7. Stop DN (in the scenario, this was when the mistake of unmounted disk was 
realised)
8. Mount the second disk back again {{/data2/dfs/dn}}, causing the {{VERSION}} 
file to be the original one again on it (mounting masks the root path that we 
last generated upon).
9. DN fails to start up cause it finds mismatched UUID between the two disks, 
with an error similar to:
{code}WARN org.apache.hadoop.hdfs.server.common.Storage: 
{{org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory 
/data/2/dfs/dn is in an inconsistent state: Root /data/2/dfs/dn: 
DatanodeUuid=fe3a848f-beb8-4fcb-9581-c6fb1c701cc4, does not match 
8ea9493c-7097-4ee3-96a3-0cc4dfc1d6ac from other StorageDirectory.{code}

The DN should not generate a new UUID if one of the storage disks already have 
the older one.

HDFS-8211 unintentionally fixes this by changing the 
{{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} 
representation of the UUID vs. the {{DatanodeID}} object which only gets 
available (non-null) _after_ the registration.

It'd still be good to add a direct test case to the above scenario that passes 
on trunk and branch-2, but fails on branch-2.7 and lower, so we can catch a 
regression around this in future.


> Testcase for catching DN UUID regeneration regression
> -
>
> Key: HDFS-9949
> URL: https://issues.apache.org/jira/browse/HDFS-9949
> Project: Hadoop HDFS
>  Issue Type: Test
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, 
> HDFS-9949.000.patch, HDFS-9949.001.patch
>
>
> In the following scenario, in releases without HDFS-8211, the DN may 
> regenerate its UUIDs unintentionally.
> 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}}
> 1. Stop DN
> 2. Unmount the second disk, {{/data2/dfs/dn}}
> 3. Create (in the scenario, this was an 

[jira] [Commented] (HDFS-9940) Rename dfs.balancer.max.concurrent.moves to avoid confusion

2016-03-13 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192340#comment-15192340
 ] 

Harsh J commented on HDFS-9940:
---

Agreed on the confusion when there's role-based config management involved - 
I'd also vote for having Balancers discover the property from DNs 
directly/dynamically instead, which gives the added benefit of allowing per-DN 
flexibility (should it get required in future) - HDFS-7466.

> Rename dfs.balancer.max.concurrent.moves to avoid confusion
> ---
>
> Key: HDFS-9940
> URL: https://issues.apache.org/jira/browse/HDFS-9940
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.6.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Minor
>  Labels: supportability
> Fix For: 2.8.0
>
>
> It is very confusing for both Balancer and Datanode to use the same property 
> {{dfs.datanode.balance.max.concurrent.moves}}. It is especially so for the 
> Balancer because the property has "datanode" in the name string. Many 
> customers forget to set the property for the Balancer.
> Change the Balancer to use a new property 
> {{dfs.balancer.max.concurrent.moves}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9949) Testcase for catching DN UUID regeneration regression

2016-03-13 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192335#comment-15192335
 ] 

Harsh J commented on HDFS-9949:
---

bq. -1 unit *

Patch only adds a test-case; the failing tests appear flaky instead of related 
to the added test for trunk and branch-2 here.

The other patch (branch-2.7 named) is proof-of-test, but not intended for 
commit as HDFS-8211 is not in branch-2.7, which will cause the test to always 
fail.

> Testcase for catching DN UUID regeneration regression
> -
>
> Key: HDFS-9949
> URL: https://issues.apache.org/jira/browse/HDFS-9949
> Project: Hadoop HDFS
>  Issue Type: Test
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, 
> HDFS-9949.000.patch
>
>
> In the following scenario, in releases without HDFS-8211, the DN may 
> regenerate its UUIDs unintentionally.
> 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}}
> 1. Stop DN
> 2. Unmount the second disk, {{/data2/dfs/dn}}
> 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root 
> path
> 4. Start DN
> 5. DN now considers /data2/dfs/dn empty so formats it, but during the format 
> it uses {{datanode.getDatanodeUuid()}} which is null until register() is 
> called.
> 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} 
> gets called with successful condition, and it causes a new generation of UUID 
> which is written to all disks {{/data1/dfs/dn/current/VERSION}} and 
> {{/data2/dfs/dn/current/VERSION}}.
> 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was 
> realised)
> 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the 
> {{VERSION}} file to be the original one again on it (mounting masks the root 
> path that we last generated upon).
> 9. DN fails to start up cause it finds mismatched UUID between the two disks
> The DN should not generate a new UUID if one of the storage disks already 
> have the older one.
> HDFS-8211 unintentionally fixes this by changing the 
> {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} 
> representation of the UUID vs. the {{DatanodeID}} object which only gets 
> available (non-null) _after_ the registration.
> It'd still be good to add a direct test case to the above scenario that 
> passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can 
> catch a regression around this in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression

2016-03-12 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-9949:
--
Target Version/s: 3.0.0, 2.8.0, 2.9.0
  Status: Patch Available  (was: Open)

> Testcase for catching DN UUID regeneration regression
> -
>
> Key: HDFS-9949
> URL: https://issues.apache.org/jira/browse/HDFS-9949
> Project: Hadoop HDFS
>  Issue Type: Test
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, 
> HDFS-9949.000.patch
>
>
> In the following scenario, in releases without HDFS-8211, the DN may 
> regenerate its UUIDs unintentionally.
> 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}}
> 1. Stop DN
> 2. Unmount the second disk, {{/data2/dfs/dn}}
> 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root 
> path
> 4. Start DN
> 5. DN now considers /data2/dfs/dn empty so formats it, but during the format 
> it uses {{datanode.getDatanodeUuid()}} which is null until register() is 
> called.
> 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} 
> gets called with successful condition, and it causes a new generation of UUID 
> which is written to all disks {{/data1/dfs/dn/current/VERSION}} and 
> {{/data2/dfs/dn/current/VERSION}}.
> 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was 
> realised)
> 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the 
> {{VERSION}} file to be the original one again on it (mounting masks the root 
> path that we last generated upon).
> 9. DN fails to start up cause it finds mismatched UUID between the two disks
> The DN should not generate a new UUID if one of the storage disks already 
> have the older one.
> HDFS-8211 unintentionally fixes this by changing the 
> {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} 
> representation of the UUID vs. the {{DatanodeID}} object which only gets 
> available (non-null) _after_ the registration.
> It'd still be good to add a direct test case to the above scenario that 
> passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can 
> catch a regression around this in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression

2016-03-12 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-9949:
--
Attachment: (was: HDFS-9949.000.patch)

> Testcase for catching DN UUID regeneration regression
> -
>
> Key: HDFS-9949
> URL: https://issues.apache.org/jira/browse/HDFS-9949
> Project: Hadoop HDFS
>  Issue Type: Test
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, 
> HDFS-9949.000.patch
>
>
> In the following scenario, in releases without HDFS-8211, the DN may 
> regenerate its UUIDs unintentionally.
> 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}}
> 1. Stop DN
> 2. Unmount the second disk, {{/data2/dfs/dn}}
> 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root 
> path
> 4. Start DN
> 5. DN now considers /data2/dfs/dn empty so formats it, but during the format 
> it uses {{datanode.getDatanodeUuid()}} which is null until register() is 
> called.
> 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} 
> gets called with successful condition, and it causes a new generation of UUID 
> which is written to all disks {{/data1/dfs/dn/current/VERSION}} and 
> {{/data2/dfs/dn/current/VERSION}}.
> 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was 
> realised)
> 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the 
> {{VERSION}} file to be the original one again on it (mounting masks the root 
> path that we last generated upon).
> 9. DN fails to start up cause it finds mismatched UUID between the two disks
> The DN should not generate a new UUID if one of the storage disks already 
> have the older one.
> HDFS-8211 unintentionally fixes this by changing the 
> {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} 
> representation of the UUID vs. the {{DatanodeID}} object which only gets 
> available (non-null) _after_ the registration.
> It'd still be good to add a direct test case to the above scenario that 
> passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can 
> catch a regression around this in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression

2016-03-12 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-9949:
--
Attachment: HDFS-9949.000.patch

> Testcase for catching DN UUID regeneration regression
> -
>
> Key: HDFS-9949
> URL: https://issues.apache.org/jira/browse/HDFS-9949
> Project: Hadoop HDFS
>  Issue Type: Test
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, 
> HDFS-9949.000.patch
>
>
> In the following scenario, in releases without HDFS-8211, the DN may 
> regenerate its UUIDs unintentionally.
> 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}}
> 1. Stop DN
> 2. Unmount the second disk, {{/data2/dfs/dn}}
> 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root 
> path
> 4. Start DN
> 5. DN now considers /data2/dfs/dn empty so formats it, but during the format 
> it uses {{datanode.getDatanodeUuid()}} which is null until register() is 
> called.
> 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} 
> gets called with successful condition, and it causes a new generation of UUID 
> which is written to all disks {{/data1/dfs/dn/current/VERSION}} and 
> {{/data2/dfs/dn/current/VERSION}}.
> 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was 
> realised)
> 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the 
> {{VERSION}} file to be the original one again on it (mounting masks the root 
> path that we last generated upon).
> 9. DN fails to start up cause it finds mismatched UUID between the two disks
> The DN should not generate a new UUID if one of the storage disks already 
> have the older one.
> HDFS-8211 unintentionally fixes this by changing the 
> {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} 
> representation of the UUID vs. the {{DatanodeID}} object which only gets 
> available (non-null) _after_ the registration.
> It'd still be good to add a direct test case to the above scenario that 
> passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can 
> catch a regression around this in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression

2016-03-12 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-9949:
--
Attachment: HDFS-9949.000.branch-2.7.not-for-commit.patch
HDFS-9949.000.patch

> Testcase for catching DN UUID regeneration regression
> -
>
> Key: HDFS-9949
> URL: https://issues.apache.org/jira/browse/HDFS-9949
> Project: Hadoop HDFS
>  Issue Type: Test
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Harsh J
>Priority: Minor
> Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, 
> HDFS-9949.000.patch
>
>
> In the following scenario, in releases without HDFS-8211, the DN may 
> regenerate its UUIDs unintentionally.
> 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}}
> 1. Stop DN
> 2. Unmount the second disk, {{/data2/dfs/dn}}
> 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root 
> path
> 4. Start DN
> 5. DN now considers /data2/dfs/dn empty so formats it, but during the format 
> it uses {{datanode.getDatanodeUuid()}} which is null until register() is 
> called.
> 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} 
> gets called with successful condition, and it causes a new generation of UUID 
> which is written to all disks {{/data1/dfs/dn/current/VERSION}} and 
> {{/data2/dfs/dn/current/VERSION}}.
> 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was 
> realised)
> 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the 
> {{VERSION}} file to be the original one again on it (mounting masks the root 
> path that we last generated upon).
> 9. DN fails to start up cause it finds mismatched UUID between the two disks
> The DN should not generate a new UUID if one of the storage disks already 
> have the older one.
> HDFS-8211 unintentionally fixes this by changing the 
> {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} 
> representation of the UUID vs. the {{DatanodeID}} object which only gets 
> available (non-null) _after_ the registration.
> It'd still be good to add a direct test case to the above scenario that 
> passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can 
> catch a regression around this in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9949) Testcase for catching DN UUID regeneration regression

2016-03-12 Thread Harsh J (JIRA)
Harsh J created HDFS-9949:
-

 Summary: Testcase for catching DN UUID regeneration regression
 Key: HDFS-9949
 URL: https://issues.apache.org/jira/browse/HDFS-9949
 Project: Hadoop HDFS
  Issue Type: Test
Affects Versions: 2.6.0
Reporter: Harsh J
Assignee: Harsh J
Priority: Minor


In the following scenario, in releases without HDFS-8211, the DN may regenerate 
its UUIDs unintentionally.

0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}}
1. Stop DN
2. Unmount the second disk, {{/data2/dfs/dn}}
3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root path
4. Start DN
5. DN now considers /data2/dfs/dn empty so formats it, but during the format it 
uses {{datanode.getDatanodeUuid()}} which is null until register() is called.
6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} 
gets called with successful condition, and it causes a new generation of UUID 
which is written to all disks {{/data1/dfs/dn/current/VERSION}} and 
{{/data2/dfs/dn/current/VERSION}}.
7. Stop DN (in the scenario, this was when the mistake of unmounted disk was 
realised)
8. Mount the second disk back again {{/data2/dfs/dn}}, causing the {{VERSION}} 
file to be the original one again on it (mounting masks the root path that we 
last generated upon).
9. DN fails to start up cause it finds mismatched UUID between the two disks

The DN should not generate a new UUID if one of the storage disks already have 
the older one.

HDFS-8211 unintentionally fixes this by changing the 
{{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} 
representation of the UUID vs. the {{DatanodeID}} object which only gets 
available (non-null) _after_ the registration.

It'd still be good to add a direct test case to the above scenario that passes 
on trunk and branch-2, but fails on branch-2.7 and lower, so we can catch a 
regression around this in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-8475) Exception in createBlockOutputStream java.io.EOFException: Premature EOF: no length prefix available

2016-03-09 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J resolved HDFS-8475.
---
Resolution: Not A Bug

I don't see a bug reported here - the report says the write was done with a 
single replica and that the single replica was manually corrupted.

Please post to u...@hadoop.apache.org for problems observed in usage.

If you plan to reopen this, please post precise steps of how the bug may be 
reproduced.

I'd recommend looking at your NN and DN logs to trace further on what's 
happening.

> Exception in createBlockOutputStream java.io.EOFException: Premature EOF: no 
> length prefix available
> 
>
> Key: HDFS-8475
> URL: https://issues.apache.org/jira/browse/HDFS-8475
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: Vinod Valecha
>Priority: Blocker
>
> Scenraio:
> =
> write a file
> corrupt block manually
> Exception stack trace- 
> 2015-05-24 02:31:55.291 INFO [T-33716795] 
> [org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer] Exception in 
> createBlockOutputStream
> java.io.EOFException: Premature EOF: no length prefix available
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1155)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> [5/24/15 2:31:55:291 UTC] 02027a3b DFSClient I 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer createBlockOutputStream 
> Exception in createBlockOutputStream
>  java.io.EOFException: Premature EOF: no 
> length prefix available
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1155)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> 2015-05-24 02:31:55.291 INFO [T-33716795] 
> [org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer] Abandoning 
> BP-176676314-10.108.106.59-1402620296713:blk_1404621403_330880579
> [5/24/15 2:31:55:291 UTC] 02027a3b DFSClient I 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer nextBlockOutputStream 
> Abandoning BP-176676314-10.108.106.59-1402620296713:blk_1404621403_330880579
> 2015-05-24 02:31:55.299 INFO [T-33716795] 
> [org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer] Excluding datanode 
> 10.108.106.59:50010
> [5/24/15 2:31:55:299 UTC] 02027a3b DFSClient I 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer nextBlockOutputStream 
> Excluding datanode 10.108.106.59:50010
> 2015-05-24 02:31:55.300 WARNING [T-33716795] 
> [org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer] DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /var/db/opera/files/B4889CCDA75F9751DDBB488E5AAB433E/BE4DAEF290B7136ED6EF3D4B157441A2/BE4DAEF290B7136ED6EF3D4B157441A2-4.pag
>  could only be replicated to 0 nodes instead of minReplication (=1).  There 
> are 1 datanode(s) running and 1 node(s) are excluded in this operation.
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1384)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2477)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555)
> [5/24/15 2:31:55:300 UTC] 02027a3b DFSClient W 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer run DataStreamer Exception
>  
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /var/db/opera/files/B4889CCDA75F9751DDBB488E5AAB433E/BE4DAEF290B7136ED6EF3D4B157441A2/BE4DAEF290B7136ED6EF3D4B157441A2-4.pag
>  could only be replicated to 0 nodes instead of minReplication (=1).  There 
> are 1 datanode(s) running and 1 node(s) are excluded in this operation.
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1384)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2477)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555)
> at 
> 

[jira] [Updated] (HDFS-9521) TransferFsImage.receiveFile should account and log separate times for image download and fsync to disk

2016-03-07 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-9521:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.9.0
   Status: Resolved  (was: Patch Available)

> TransferFsImage.receiveFile should account and log separate times for image 
> download and fsync to disk 
> ---
>
> Key: HDFS-9521
> URL: https://issues.apache.org/jira/browse/HDFS-9521
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Fix For: 2.9.0
>
> Attachments: HDFS-9521-2.patch, HDFS-9521-3.patch, 
> HDFS-9521.004.patch, HDFS-9521.patch, HDFS-9521.patch.1
>
>
> Currently, TransferFsImage.receiveFile is logging total transfer time as 
> below:
> {noformat}
> double xferSec = Math.max(
>((float)(Time.monotonicNow() - startTime)) / 1000.0, 0.001);
> long xferKb = received / 1024;
> LOG.info(String.format("Transfer took %.2fs at %.2f KB/s",xferSec, xferKb / 
> xferSec))
> {noformat}
> This is really useful, but it just measures the total method execution time, 
> which includes time taken to download the image and do an fsync to all the 
> namenode metadata directories.
> Sometime when troubleshooting these imager transfer problems, it's 
> interesting to know which part of the process is being the bottleneck 
> (whether network or disk write).
> This patch accounts time for image download and fsync to each disk 
> separately, logging how much time did it take on each operation.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9521) TransferFsImage.receiveFile should account and log separate times for image download and fsync to disk

2016-03-07 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182915#comment-15182915
 ] 

Harsh J commented on HDFS-9521:
---

+1.

The check-point related tests in one of the tests seemed relevant but they pass 
locally on both JDK7 and JDK8.

{code}
Running org.apache.hadoop.hdfs.TestRollingUpgrade
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 99.737 sec - 
in org.apache.hadoop.hdfs.TestRollingUpgrade
{code}

Therefore they appear to be flaky than at fault here. Other tests appear 
similarly unrelated to the log change here (no tests appear to rely on the 
original message either).

Committing to branch-2 and trunk shortly.

> TransferFsImage.receiveFile should account and log separate times for image 
> download and fsync to disk 
> ---
>
> Key: HDFS-9521
> URL: https://issues.apache.org/jira/browse/HDFS-9521
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: HDFS-9521-2.patch, HDFS-9521-3.patch, 
> HDFS-9521.004.patch, HDFS-9521.patch, HDFS-9521.patch.1
>
>
> Currently, TransferFsImage.receiveFile is logging total transfer time as 
> below:
> {noformat}
> double xferSec = Math.max(
>((float)(Time.monotonicNow() - startTime)) / 1000.0, 0.001);
> long xferKb = received / 1024;
> LOG.info(String.format("Transfer took %.2fs at %.2f KB/s",xferSec, xferKb / 
> xferSec))
> {noformat}
> This is really useful, but it just measures the total method execution time, 
> which includes time taken to download the image and do an fsync to all the 
> namenode metadata directories.
> Sometime when troubleshooting these imager transfer problems, it's 
> interesting to know which part of the process is being the bottleneck 
> (whether network or disk write).
> This patch accounts time for image download and fsync to each disk 
> separately, logging how much time did it take on each operation.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9521) TransferFsImage.receiveFile should account and log separate times for image download and fsync to disk

2016-03-07 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-9521:
--
Attachment: HDFS-9521.004.patch

LGTM. Just had two checkstyle nits I've corrected in this variant, aside of 
some spacing logic. Will commit once jenkins returns +1.

The previously failed tests don't appear related.

> TransferFsImage.receiveFile should account and log separate times for image 
> download and fsync to disk 
> ---
>
> Key: HDFS-9521
> URL: https://issues.apache.org/jira/browse/HDFS-9521
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: HDFS-9521-2.patch, HDFS-9521-3.patch, 
> HDFS-9521.004.patch, HDFS-9521.patch, HDFS-9521.patch.1
>
>
> Currently, TransferFsImage.receiveFile is logging total transfer time as 
> below:
> {noformat}
> double xferSec = Math.max(
>((float)(Time.monotonicNow() - startTime)) / 1000.0, 0.001);
> long xferKb = received / 1024;
> LOG.info(String.format("Transfer took %.2fs at %.2f KB/s",xferSec, xferKb / 
> xferSec))
> {noformat}
> This is really useful, but it just measures the total method execution time, 
> which includes time taken to download the image and do an fsync to all the 
> namenode metadata directories.
> Sometime when troubleshooting these imager transfer problems, it's 
> interesting to know which part of the process is being the bottleneck 
> (whether network or disk write).
> This patch accounts time for image download and fsync to each disk 
> separately, logging how much time did it take on each operation.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4936) Handle overflow condition for txid going over Long.MAX_VALUE

2016-02-28 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171424#comment-15171424
 ] 

Harsh J commented on HDFS-4936:
---

Forgot to add the response of the asker:

{quote}
After 200 million years, spacemen manage the earth, they also know Hadoop,
but they cannot restart it, after a hard debug they find the txid has been
overflowed for many years.
{quote}

> Handle overflow condition for txid going over Long.MAX_VALUE
> 
>
> Key: HDFS-4936
> URL: https://issues.apache.org/jira/browse/HDFS-4936
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Harsh J
>Priority: Minor
>
> Hat tip to [~azuryy] for the question that lead to this (on mailing lists).
> I hacked up my local NN's txids manually to go very large (close to max) and 
> decided to try out if this causes any harm. I basically bumped up the freshly 
> formatted files' starting txid to 9223372036854775805 (and ensured image 
> references the same by hex-editing it):
> {code}
> ➜  current  ls
> VERSION
> fsimage_9223372036854775805.md5
> fsimage_9223372036854775805
> seen_txid
> ➜  current  cat seen_txid
> 9223372036854775805
> {code}
> NameNode started up as expected.
> {code}
> 13/06/25 18:30:08 INFO namenode.FSImage: Image file of size 129 loaded in 0 
> seconds.
> 13/06/25 18:30:08 INFO namenode.FSImage: Loaded image for txid 
> 9223372036854775805 from 
> /temp-space/tmp-default/dfs-cdh4/name/current/fsimage_9223372036854775805
> 13/06/25 18:30:08 INFO namenode.FSEditLog: Starting log segment at 
> 9223372036854775806
> {code}
> I could create a bunch of files and do regular ops (counting to much after 
> the long max increments). I created over 10 files, just to make it go well 
> over the Long.MAX_VALUE.
> Quitting NameNode and restarting fails though, with the following error:
> {code}
> 13/06/25 18:31:08 INFO namenode.FileJournalManager: Recovering unfinalized 
> segments in 
> /Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current
> 13/06/25 18:31:08 INFO namenode.FileJournalManager: Finalizing edits file 
> /Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_inprogress_9223372036854775806
>  -> 
> /Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_9223372036854775806-9223372036854775807
> 13/06/25 18:31:08 FATAL namenode.NameNode: Exception in namenode join
> java.io.IOException: Gap in transactions. Expected to be able to read up 
> until at least txid 9223372036854775806 but unable to find any edit logs 
> containing txid -9223372036854775808
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.checkForGaps(FSEditLog.java:1194)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1152)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:616)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:267)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:592)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:435)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:397)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:399)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:433)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:609)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:590)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1141)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1205)
> {code}
> Looks like we also lose some edits when we restart, as noted by the finalized 
> edits filename:
> {code}
> VERSION
> edits_9223372036854775806-9223372036854775807
> fsimage_9223372036854775805
> fsimage_9223372036854775805.md5
> seen_txid
> {code}
> It seems like we won't be able to handle the case where txid overflows. Its a 
> very very large number so that's not an immediate concern but seemed worthy 
> of a report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9868) add reading source cluster with HA access mode feature for DistCp

2016-02-28 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171119#comment-15171119
 ] 

Harsh J commented on HDFS-9868:
---

Does the feature added with HDFS-6376 not suffice?

> add reading source cluster with HA access mode feature for DistCp
> -
>
> Key: HDFS-9868
> URL: https://issues.apache.org/jira/browse/HDFS-9868
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: distcp
>Affects Versions: 2.7.1
>Reporter: NING DING
>Assignee: NING DING
> Attachments: HDFS-9868.1.patch
>
>
> Normally the HDFS cluster is HA enabled. It could take a long time when 
> coping huge data by distp. If the source cluster changes active namenode, the 
> distp will run failed. This patch supports the DistCp can read source cluster 
> files in HA access mode. A source cluster configuration file needs to be 
> specified (via the -sourceClusterConf option).
>   The following is an example of the contents of a source cluster 
> configuration
>   file:
> {code:xml}
> 
>   
>   fs.defaultFS
>   hdfs://mycluster
> 
> 
>   dfs.nameservices
>   mycluster
> 
> 
>   dfs.ha.namenodes.mycluster
>   nn1,nn2
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn1
>   host1:9000
> 
> 
>   dfs.namenode.rpc-address.mycluster.nn2
>   host2:9000
> 
> 
>   dfs.namenode.http-address.mycluster.nn1
>   host1:50070
> 
> 
>   dfs.namenode.http-address.mycluster.nn2
>   host2:50070
> 
> 
>   dfs.client.failover.proxy.provider.mycluster
>   
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
>   
> {code}
>   The invocation of DistCp is as below:
> {code}
> bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar 
> hdfs://nn2:8020/bar/foo
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8509) Support different passwords for key and keystore on HTTPFS using SSL. This requires for a Tomcat version update.

2016-02-08 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-8509:
--
Issue Type: Improvement  (was: Task)

> Support different passwords for key and keystore on HTTPFS using SSL. This 
> requires for a Tomcat version update.
> 
>
> Key: HDFS-8509
> URL: https://issues.apache.org/jira/browse/HDFS-8509
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: webhdfs
>Affects Versions: 2.7.0
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: HDFS-8509.patch
>
>
> Currently, SSL for HTTPFS requires that keystore/truststore and key passwords 
> be the same. This is a limitation from Tomcat version 6, which didn't have 
> support for different passwords. From Tomcat 7, this is now possible by 
> defining "keyPass" property for "Connector" configuration on Tomcat's 
> server.xml file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9521) TransferFsImage.receiveFile should account and log separate times for image download and fsync to disk

2016-01-04 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082446#comment-15082446
 ] 

Harsh J commented on HDFS-9521:
---

Patch's approach looks good to me. Agreed with [~liuml07], that we can keep the 
total time also (but indicate in the message that it includes both times). 
Alternatively, a single combined log at the end that prints the total and 
divided times (along with path info as we have it in the current patch) would 
be better too.

I do not agree on DEBUG level though. The change is a refinement of an 
existing, vital INFO message.

Please also address the checkstyle issues, if they are relevant (sorry, am too 
late here and the build data's been wiped already). You can run checkstyle goal 
with maven to get the same results locally.

The failing tests don't appear related.

> TransferFsImage.receiveFile should account and log separate times for image 
> download and fsync to disk 
> ---
>
> Key: HDFS-9521
> URL: https://issues.apache.org/jira/browse/HDFS-9521
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: HDFS-9521.patch
>
>
> Currently, TransferFsImage.receiveFile is logging total transfer time as 
> below:
> {noformat}
> double xferSec = Math.max(
>((float)(Time.monotonicNow() - startTime)) / 1000.0, 0.001);
> long xferKb = received / 1024;
> LOG.info(String.format("Transfer took %.2fs at %.2f KB/s",xferSec, xferKb / 
> xferSec))
> {noformat}
> This is really useful, but it just measures the total method execution time, 
> which includes time taken to download the image and do an fsync to all the 
> namenode metadata directories.
> Sometime when troubleshooting these imager transfer problems, it's 
> interesting to know which part of the process is being the bottleneck 
> (whether network or disk write).
> This patch accounts time for image download and fsync to each disk 
> separately, logging how much time did it take on each operation.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7447) Number of maximum Acl entries on a File/Folder should be made user configurable than hardcoding .

2015-12-11 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15054098#comment-15054098
 ] 

Harsh J commented on HDFS-7447:
---

bq. The number of entries in a single ACL is capped at a maximum of 32. 
Attempts to add ACL entries exceeding the maximum will fail with a user­facing 
error. This is done for 2 reasons: to simplify management, and to limit 
resource consumption. ACLs with a very high number of entries tend to become 
difficult to understand and may indicate that the requirements are better 
implemented by defining additional groups or users. ACLs with a very high 
number of entries also would require more memory and storage and take longer to 
evaluate on each permission check. The number 32 is chosen for consistency with 
the maximum number of ACL entries enforced by the ext family of file systems. - 
https://issues.apache.org/jira/secure/attachment/12627729/HDFS-ACLs-Design-3.pdf

> Number of maximum Acl entries on a File/Folder should be made user 
> configurable than hardcoding .
> -
>
> Key: HDFS-7447
> URL: https://issues.apache.org/jira/browse/HDFS-7447
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: J.Andreina
>
> By default on creating a folder1 will have 6 acl entries . On top of that 
> assigning acl  on a folder1 exceeds 32 , then unable to assign acls for a 
> group/user to folder1. 
> {noformat}
> 2014-11-20 18:55:06,553 ERROR [qtp1279235236-17 - /rolexml/role/modrole] 
> Error occured while setting permissions for Resource:[ 
> hdfs://hacluster/folder1 ] and Error message is : Invalid ACL: ACL has 33 
> entries, which exceeds maximum of 32.
> at 
> org.apache.hadoop.hdfs.server.namenode.AclTransformation.buildAndValidateAcl(AclTransformation.java:274)
> at 
> org.apache.hadoop.hdfs.server.namenode.AclTransformation.mergeAclEntries(AclTransformation.java:181)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedModifyAclEntries(FSDirectory.java:2771)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.modifyAclEntries(FSDirectory.java:2757)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.modifyAclEntries(FSNamesystem.java:7734)
> {noformat}
> Here value 32 is hardcoded  , which can be made user configurable. 
> {noformat}
> private static List buildAndValidateAcl(ArrayList aclBuilder)
> throws AclException
> {
> if(aclBuilder.size() > 32)
> throw new AclException((new StringBuilder()).append("Invalid ACL: 
> ACL has ").append(aclBuilder.size()).append(" entries, which exceeds maximum 
> of ").append(32).append(".").toString());
> :
> :
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

2015-11-16 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006787#comment-15006787
 ] 

Harsh J commented on HDFS-8298:
---

I'm closing this JIRA as Invalid. This is not a bug, but a misunderstanding of 
what the error means. Its worth filing these discussions in the user list prior 
to reporting to JIRA.

The chief QJM guarantee is that a failure, whatever kind, is intolerable. That 
the NN kills itself in such a situation is the current expected outcome.

bq. In an HDFS HA setup if there is a temporary problem with contacting journal 
nodes (eg. network interruption), the NameNode shuts down entirely,

This is as per the current design, and the NN is operating in the right.

bq. When it should instead go in to a standby mode so that it can stay online 
and retry to achieve quorum later.

This is not possible today, cause at the point of failure the transaction is 
incomplete. The mutation writes are not write-ahead, but write-behind (although 
part of the same transaction), so we don't want to risk an inconsistent state, 
especially so if reads over standby is allowed, and that if we let it be alive 
then if it ever became active again it'd have an edit conflict.

bq. As usual, don't blindly change without understanding the impacts of your 
change.

Indeed, modification of these values must be done with care, especially that of 
client retries + timeout values. If the whole timeouts add up to more than the 
period a client may wait (and retry over), then you'd have actual client 
failures in a HA situation.

bq. We are facing this issue every week at the same time might be due to a 
network glitch but is there a workaround that can be put in place?

It is important to understand why the writes timeout. Note that each write is 
hardly more than a few thousand bytes, usually. Writing such a small amount, in 
parallel and over the network, should not take > 2-3s. There's a LOT of factors 
that can cause the timeouts besides the network: GC or other form of 
process-level pauses (NN pauses mid-write over 20s, recovers, but write 
finisher now thinks > 20s has passed to write to JNs, so it marks the write 
failed), JN fsync delays due to not providing it dedicated disks (if you're 
unaware, JN writes are locally synchronous, for durability), and Kerberos KDC 
timeouts (JVM default of 30s before a failed KDC connection retry is higher 
than 20s default of a transaction write timeout to JN, when auth gets required 
between NN -> JNs periodically).

> HA: NameNode should not shut down completely without quorum, doesn't recover 
> from temporary network outages
> ---
>
> Key: HDFS-8298
> URL: https://issues.apache.org/jira/browse/HDFS-8298
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, HDFS, namenode, qjm
>Affects Versions: 2.6.0
>Reporter: Hari Sekhon
>
> In an HDFS HA setup if there is a temporary problem with contacting journal 
> nodes (eg. network interruption), the NameNode shuts down entirely, when it 
> should instead go in to a standby mode so that it can stay online and retry 
> to achieve quorum later.
> If both NameNodes shut themselves off like this then even after the temporary 
> network outage is resolved, the entire cluster remains offline indefinitely 
> until operator intervention, whereas it could have self-repaired after 
> re-contacting the journalnodes and re-achieving quorum.
> {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog 
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for 
> required journal (JournalAndStre
> am(mgr=QJM to [:8485, :8485, :8485], stream=QuorumOutputStream 
> starting at txid 54270281))
> java.io.IOException: Interrupted waiting 2ms for a quorum of nodes to 
> respond.
> at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
> at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
> at 
> 

[jira] [Resolved] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

2015-11-16 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J resolved HDFS-8298.
---
Resolution: Invalid

Closing out - for specific identified improvements (such as log improvements, 
or ideas about improving unclear root-causing), please log a more direct JIRA.

> HA: NameNode should not shut down completely without quorum, doesn't recover 
> from temporary network outages
> ---
>
> Key: HDFS-8298
> URL: https://issues.apache.org/jira/browse/HDFS-8298
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, HDFS, namenode, qjm
>Affects Versions: 2.6.0
>Reporter: Hari Sekhon
>
> In an HDFS HA setup if there is a temporary problem with contacting journal 
> nodes (eg. network interruption), the NameNode shuts down entirely, when it 
> should instead go in to a standby mode so that it can stay online and retry 
> to achieve quorum later.
> If both NameNodes shut themselves off like this then even after the temporary 
> network outage is resolved, the entire cluster remains offline indefinitely 
> until operator intervention, whereas it could have self-repaired after 
> re-contacting the journalnodes and re-achieving quorum.
> {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog 
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for 
> required journal (JournalAndStre
> am(mgr=QJM to [:8485, :8485, :8485], stream=QuorumOutputStream 
> starting at txid 54270281))
> java.io.IOException: Interrupted waiting 2ms for a quorum of nodes to 
> respond.
> at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
> at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
> at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
> at java.lang.Thread.run(Thread.java:745)
> 2015-04-15 15:59:26,901 WARN  client.QuorumJournalManager 
> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at 
> txid 54270281
> 2015-04-15 15:59:26,904 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - 
> Exiting with status 1
> 2015-04-15 15:59:27,001 INFO  namenode.NameNode (StringUtils.java:run(659)) - 
> SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NameNode at /
> /{code}
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

2015-11-16 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-8298:
--
Environment: (was: HDP 2.2)

> HA: NameNode should not shut down completely without quorum, doesn't recover 
> from temporary network outages
> ---
>
> Key: HDFS-8298
> URL: https://issues.apache.org/jira/browse/HDFS-8298
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, HDFS, namenode, qjm
>Affects Versions: 2.6.0
>Reporter: Hari Sekhon
>
> In an HDFS HA setup if there is a temporary problem with contacting journal 
> nodes (eg. network interruption), the NameNode shuts down entirely, when it 
> should instead go in to a standby mode so that it can stay online and retry 
> to achieve quorum later.
> If both NameNodes shut themselves off like this then even after the temporary 
> network outage is resolved, the entire cluster remains offline indefinitely 
> until operator intervention, whereas it could have self-repaired after 
> re-contacting the journalnodes and re-achieving quorum.
> {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog 
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for 
> required journal (JournalAndStre
> am(mgr=QJM to [:8485, :8485, :8485], stream=QuorumOutputStream 
> starting at txid 54270281))
> java.io.IOException: Interrupted waiting 2ms for a quorum of nodes to 
> respond.
> at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
> at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
> at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
> at java.lang.Thread.run(Thread.java:745)
> 2015-04-15 15:59:26,901 WARN  client.QuorumJournalManager 
> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at 
> txid 54270281
> 2015-04-15 15:59:26,904 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - 
> Exiting with status 1
> 2015-04-15 15:59:27,001 INFO  namenode.NameNode (StringUtils.java:run(659)) - 
> SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NameNode at /
> /{code}
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8986) Add option to -du to calculate directory space usage excluding snapshots

2015-11-06 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994469#comment-14994469
 ] 

Harsh J commented on HDFS-8986:
---

This continues to cause a bunch of confusion among our user-base who are still 
reliant on the pre-snapshot feature behaviour, so it would be nice to see it 
implemented.

> Add option to -du to calculate directory space usage excluding snapshots
> 
>
> Key: HDFS-8986
> URL: https://issues.apache.org/jira/browse/HDFS-8986
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: snapshots
>Reporter: Gautam Gopalakrishnan
>Assignee: Jagadesh Kiran N
>
> When running {{hadoop fs -du}} on a snapshotted directory (or one of its 
> children), the report includes space consumed by blocks that are only present 
> in the snapshots. This is confusing for end users.
> {noformat}
> $  hadoop fs -du -h -s /tmp/parent /tmp/parent/*
> 799.7 M  2.3 G  /tmp/parent
> 799.7 M  2.3 G  /tmp/parent/sub1
> $ hdfs dfs -createSnapshot /tmp/parent snap1
> Created snapshot /tmp/parent/.snapshot/snap1
> $ hadoop fs -rm -skipTrash /tmp/parent/sub1/*
> ...
> $ hadoop fs -du -h -s /tmp/parent /tmp/parent/*
> 799.7 M  2.3 G  /tmp/parent
> 799.7 M  2.3 G  /tmp/parent/sub1
> $ hdfs dfs -deleteSnapshot /tmp/parent snap1
> $ hadoop fs -du -h -s /tmp/parent /tmp/parent/*
> 0  0  /tmp/parent
> 0  0  /tmp/parent/sub1
> {noformat}
> It would be helpful if we had a flag, say -X, to exclude any snapshot related 
> disk usage in the output



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8986) Add option to -du to calculate directory space usage excluding snapshots

2015-10-31 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984274#comment-14984274
 ] 

Harsh J commented on HDFS-8986:
---

[~jagadesh.kiran] - What's the precise opposition w.r.t. adding a new flag to 
explicitly exclude snapshot counts? It does not break compatibility, and is an 
added feature/improvement.

> Add option to -du to calculate directory space usage excluding snapshots
> 
>
> Key: HDFS-8986
> URL: https://issues.apache.org/jira/browse/HDFS-8986
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: snapshots
>Reporter: Gautam Gopalakrishnan
>Assignee: Jagadesh Kiran N
>
> When running {{hadoop fs -du}} on a snapshotted directory (or one of its 
> children), the report includes space consumed by blocks that are only present 
> in the snapshots. This is confusing for end users.
> {noformat}
> $  hadoop fs -du -h -s /tmp/parent /tmp/parent/*
> 799.7 M  2.3 G  /tmp/parent
> 799.7 M  2.3 G  /tmp/parent/sub1
> $ hdfs dfs -createSnapshot /tmp/parent snap1
> Created snapshot /tmp/parent/.snapshot/snap1
> $ hadoop fs -rm -skipTrash /tmp/parent/sub1/*
> ...
> $ hadoop fs -du -h -s /tmp/parent /tmp/parent/*
> 799.7 M  2.3 G  /tmp/parent
> 799.7 M  2.3 G  /tmp/parent/sub1
> $ hdfs dfs -deleteSnapshot /tmp/parent snap1
> $ hadoop fs -du -h -s /tmp/parent /tmp/parent/*
> 0  0  /tmp/parent
> 0  0  /tmp/parent/sub1
> {noformat}
> It would be helpful if we had a flag, say -X, to exclude any snapshot related 
> disk usage in the output



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9257) improve error message for "Absolute path required" in INode.java to contain the rejected path

2015-10-16 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960854#comment-14960854
 ] 

Harsh J commented on HDFS-9257:
---

+1, failed tests are unrelated. Tests shouldn't be necessary for the trivial 
message improvement. Committing shortly.

> improve error message for "Absolute path required" in INode.java to contain 
> the rejected path
> -
>
> Key: HDFS-9257
> URL: https://issues.apache.org/jira/browse/HDFS-9257
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.7.1
>Reporter: Marcell Szabo
>Assignee: Marcell Szabo
>Priority: Trivial
> Attachments: HDFS-9257.000.patch
>
>
> throw new AssertionError("Absolute path required");
> message should also show the path to help debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9257) improve error message for "Absolute path required" in INode.java to contain the rejected path

2015-10-16 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-9257:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Committed to trunk and branch-2. Thank you for the improvement contribution 
Marcell, hope to see more!

> improve error message for "Absolute path required" in INode.java to contain 
> the rejected path
> -
>
> Key: HDFS-9257
> URL: https://issues.apache.org/jira/browse/HDFS-9257
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.7.1
>Reporter: Marcell Szabo
>Assignee: Marcell Szabo
>Priority: Trivial
> Fix For: 2.8.0
>
> Attachments: HDFS-9257.000.patch
>
>
> throw new AssertionError("Absolute path required");
> message should also show the path to help debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7899) Improve EOF error message

2015-10-06 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945253#comment-14945253
 ] 

Harsh J commented on HDFS-7899:
---

Thank you [~jagadesh.kiran] and [~vinayrpet]

> Improve EOF error message
> -
>
> Key: HDFS-7899
> URL: https://issues.apache.org/jira/browse/HDFS-7899
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Jagadesh Kiran N
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: HDFS-7899-00.patch, HDFS-7899-01.patch, 
> HDFS-7899-02.patch
>
>
> Currently, a DN disconnection for reasons other than connection timeout or 
> refused messages, such as an EOF message as a result of rejection or other 
> network fault, reports in this manner:
> {code}
> WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /x.x.x.x: for 
> block, add to deadNodes and continue. java.io.EOFException: Premature EOF: no 
> length prefix available 
> java.io.EOFException: Premature EOF: no length prefix available 
> at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:171)
>  
> at 
> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:392)
>  
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:137)
>  
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:1103)
>  
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:538) 
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:750)
>  
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:794) 
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:602) 
> {code}
> This is not very clear to a user (warn's at the hdfs-client). It could likely 
> be improved with a more diagnosable message, or at least the direct reason 
> than an EOF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7899) Improve EOF error message

2015-10-03 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942281#comment-14942281
 ] 

Harsh J commented on HDFS-7899:
---

Sorry on delay here!

Perhaps it could be fixed to:

s/Unexpected EOF while trying read response/Unexpected EOF while trying to read 
response from server

(With hope that it would get some users to look above it and spot the 
server/block identifiers to investigate forward.)

> Improve EOF error message
> -
>
> Key: HDFS-7899
> URL: https://issues.apache.org/jira/browse/HDFS-7899
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Jagadesh Kiran N
>Priority: Minor
> Attachments: HDFS-7899-00.patch, HDFS-7899-01.patch
>
>
> Currently, a DN disconnection for reasons other than connection timeout or 
> refused messages, such as an EOF message as a result of rejection or other 
> network fault, reports in this manner:
> {code}
> WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /x.x.x.x: for 
> block, add to deadNodes and continue. java.io.EOFException: Premature EOF: no 
> length prefix available 
> java.io.EOFException: Premature EOF: no length prefix available 
> at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:171)
>  
> at 
> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:392)
>  
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:137)
>  
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:1103)
>  
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:538) 
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:750)
>  
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:794) 
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:602) 
> {code}
> This is not very clear to a user (warn's at the hdfs-client). It could likely 
> be improved with a more diagnosable message, or at least the direct reason 
> than an EOF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-6674) UserGroupInformation.loginUserFromKeytab will hang forever if keytab file length is less than 6 byte.

2015-09-25 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J resolved HDFS-6674.
---
Resolution: Invalid

The hang, if still valid, seems to result as an outcome of the underlying Java 
libraries being at fault. There's not anything HDFS can control about this, and 
this bug instead needs to be reported to the Oracle/OpenJDK communities with a 
test case.

> UserGroupInformation.loginUserFromKeytab will hang forever if keytab file 
> length  is less than 6 byte.
> --
>
> Key: HDFS-6674
> URL: https://issues.apache.org/jira/browse/HDFS-6674
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.0.1-alpha
>Reporter: liuyang
>Priority: Minor
>
> The jstack is as follows:
>java.lang.Thread.State: RUNNABLE
>   at java.io.FileInputStream.available(Native Method)
>   at java.io.BufferedInputStream.available(BufferedInputStream.java:399)
>   - locked <0x000745585330> (a 
> sun.security.krb5.internal.ktab.KeyTabInputStream)
>   at sun.security.krb5.internal.ktab.KeyTab.load(KeyTab.java:257)
>   at sun.security.krb5.internal.ktab.KeyTab.(KeyTab.java:97)
>   at sun.security.krb5.internal.ktab.KeyTab.getInstance0(KeyTab.java:124)
>   - locked <0x000745586560> (a java.lang.Class for 
> sun.security.krb5.internal.ktab.KeyTab)
>   at sun.security.krb5.internal.ktab.KeyTab.getInstance(KeyTab.java:157)
>   at javax.security.auth.kerberos.KeyTab.takeSnapshot(KeyTab.java:119)
>   at 
> javax.security.auth.kerberos.KeyTab.getEncryptionKeys(KeyTab.java:192)
>   at 
> javax.security.auth.kerberos.JavaxSecurityAuthKerberosAccessImpl.keyTabGetEncryptionKeys(JavaxSecurityAuthKerberosAccessImpl.java:36)
>   at 
> sun.security.jgss.krb5.Krb5Util.keysFromJavaxKeyTab(Krb5Util.java:381)
>   at 
> com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:701)
>   at 
> com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:584)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at javax.security.auth.login.LoginContext.invoke(LoginContext.java:784)
>   at 
> javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
>   at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721)
>   at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718)
>   at javax.security.auth.login.LoginContext.login(LoginContext.java:590)
>   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:679)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-4224) The dncp_block_verification log can be compressed

2015-09-15 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J resolved HDFS-4224.
---
Resolution: Invalid

Invalid after HDFS-7430

> The dncp_block_verification log can be compressed
> -
>
> Key: HDFS-4224
> URL: https://issues.apache.org/jira/browse/HDFS-4224
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.0.0-alpha
>Reporter: Harsh J
>Priority: Minor
>
> On some systems, I noticed that when the scanner runs, the 
> dncp_block_verification.log.curr file under the block pool gets quite large 
> (several GBs). Although this is rolled away, we could also configure 
> compression upon it (a codec that may work without natives, would be a good 
> default) and save on I/O and space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7899) Improve EOF error message

2015-09-06 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14733250#comment-14733250
 ] 

Harsh J commented on HDFS-7899:
---

Thanks Jagadesh, that message change was just a small idea to make it carry 
slightly more sense. Do you have any ideas also to improve the situation such 
that users may be able to self-figure out whats going on? I've seen this appear 
during socket disconnects/timeouts/etc. - but the message it prints is from the 
software layer instead, which causes confusion.

> Improve EOF error message
> -
>
> Key: HDFS-7899
> URL: https://issues.apache.org/jira/browse/HDFS-7899
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.6.0
>Reporter: Harsh J
>Assignee: Jagadesh Kiran N
>Priority: Minor
> Attachments: HDFS-7899-00.patch
>
>
> Currently, a DN disconnection for reasons other than connection timeout or 
> refused messages, such as an EOF message as a result of rejection or other 
> network fault, reports in this manner:
> {code}
> WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /x.x.x.x: for 
> block, add to deadNodes and continue. java.io.EOFException: Premature EOF: no 
> length prefix available 
> java.io.EOFException: Premature EOF: no length prefix available 
> at 
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:171)
>  
> at 
> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:392)
>  
> at 
> org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:137)
>  
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:1103)
>  
> at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:538) 
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:750)
>  
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:794) 
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:602) 
> {code}
> This is not very clear to a user (warn's at the hdfs-client). It could likely 
> be improved with a more diagnosable message, or at least the direct reason 
> than an EOF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-237) Better handling of dfsadmin command when namenode is slow

2015-09-06 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J resolved HDFS-237.
--
Resolution: Later

This older JIRA is a bit stale given the multiple changes that went into the 
RPC side. Follow HADOOP-9640 and related JIRAs instead for more recent work.

bq. a separate rpc queue

This is supported today via the servicerpc-address configs (typically set to 
8022, and strongly recommended for HA modes).

> Better handling of dfsadmin command when namenode is slow
> -
>
> Key: HDFS-237
> URL: https://issues.apache.org/jira/browse/HDFS-237
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Koji Noguchi
>
> Probably when hitting HADOOP-3810, Namenode became unresponsive.  Large time 
> spent in GC.
> All dfs/dfsadmin command were timing out.
> WebUI was coming up after waiting for a long time.
> Maybe setting a long timeout would have made the dfsadmin command go through.
> But it would be nice to have a separate queue/handler which doesn't compete 
> with regular rpc calls.
> All I wanted to do was dfsadmin -safemode enter, dfsadmin -finalizeUpgrade ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-2390) dfsadmin -setBalancerBandwidth doesnot validate -ve value

2015-08-27 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-2390:
--
Issue Type: Improvement  (was: Bug)

 dfsadmin -setBalancerBandwidth doesnot validate -ve value
 -

 Key: HDFS-2390
 URL: https://issues.apache.org/jira/browse/HDFS-2390
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover
Affects Versions: 2.7.1
Reporter: Rajit Saha
Assignee: Gautam Gopalakrishnan
 Attachments: HDFS-2390-1.patch, HDFS-2390-2.patch, HDFS-2390-3.patch, 
 HDFS-2390-4.patch


 $ hadoop dfsadmin -setBalancerBandwidth -1 
 does not throw any message that it is invalid although in DN log we are not 
 getting
 DNA_BALANCERBANDWIDTHUPDATE. 
 I think it should throw some message that -ve numbers are not valid , as it 
 does
 for decimal numbers or non-numbers like -
 $ hadoop dfsadmin -setBalancerBandwidth 12.34
 NumberFormatException: For input string: 12.34
 Usage: java DFSAdmin [-setBalancerBandwidth bandwidth in bytes per second]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-2390) dfsadmin -setBalancerBandwidth doesnot validate -ve value

2015-08-27 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716463#comment-14716463
 ] 

Harsh J commented on HDFS-2390:
---

+1 on v4. The check style warning is about file length, but we can ignore that 
(I agree with HADOOP-12005).
The test on the console output does not appear to have failed, and is unrelated 
and also passes locally:
{code}
Running org.apache.hadoop.hdfs.server.namenode.TestINodeFile
Tests run: 26, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 39.608 sec - 
in org.apache.hadoop.hdfs.server.namenode.TestINodeFile
{code}

Thanks for the test and improvement fix, committing this shortly!

 dfsadmin -setBalancerBandwidth doesnot validate -ve value
 -

 Key: HDFS-2390
 URL: https://issues.apache.org/jira/browse/HDFS-2390
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer  mover
Affects Versions: 2.7.1
Reporter: Rajit Saha
Assignee: Gautam Gopalakrishnan
 Attachments: HDFS-2390-1.patch, HDFS-2390-2.patch, HDFS-2390-3.patch, 
 HDFS-2390-4.patch


 $ hadoop dfsadmin -setBalancerBandwidth -1 
 does not throw any message that it is invalid although in DN log we are not 
 getting
 DNA_BALANCERBANDWIDTHUPDATE. 
 I think it should throw some message that -ve numbers are not valid , as it 
 does
 for decimal numbers or non-numbers like -
 $ hadoop dfsadmin -setBalancerBandwidth 12.34
 NumberFormatException: For input string: 12.34
 Usage: java DFSAdmin [-setBalancerBandwidth bandwidth in bytes per second]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-2390) dfsadmin -setBalancerBandwidth doesnot validate -ve value

2015-08-27 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-2390:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Committed to trunk and branch-2. Thank you Gautam!

 dfsadmin -setBalancerBandwidth doesnot validate -ve value
 -

 Key: HDFS-2390
 URL: https://issues.apache.org/jira/browse/HDFS-2390
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover
Affects Versions: 2.7.1
Reporter: Rajit Saha
Assignee: Gautam Gopalakrishnan
 Fix For: 2.8.0

 Attachments: HDFS-2390-1.patch, HDFS-2390-2.patch, HDFS-2390-3.patch, 
 HDFS-2390-4.patch


 $ hadoop dfsadmin -setBalancerBandwidth -1 
 does not throw any message that it is invalid although in DN log we are not 
 getting
 DNA_BALANCERBANDWIDTHUPDATE. 
 I think it should throw some message that -ve numbers are not valid , as it 
 does
 for decimal numbers or non-numbers like -
 $ hadoop dfsadmin -setBalancerBandwidth 12.34
 NumberFormatException: For input string: 12.34
 Usage: java DFSAdmin [-setBalancerBandwidth bandwidth in bytes per second]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-2390) dfsadmin -setBalancerBandwidth doesnot validate -ve value

2015-08-27 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-2390:
--
Priority: Minor  (was: Major)

 dfsadmin -setBalancerBandwidth doesnot validate -ve value
 -

 Key: HDFS-2390
 URL: https://issues.apache.org/jira/browse/HDFS-2390
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover
Affects Versions: 2.7.1
Reporter: Rajit Saha
Assignee: Gautam Gopalakrishnan
Priority: Minor
 Fix For: 2.8.0

 Attachments: HDFS-2390-1.patch, HDFS-2390-2.patch, HDFS-2390-3.patch, 
 HDFS-2390-4.patch


 $ hadoop dfsadmin -setBalancerBandwidth -1 
 does not throw any message that it is invalid although in DN log we are not 
 getting
 DNA_BALANCERBANDWIDTHUPDATE. 
 I think it should throw some message that -ve numbers are not valid , as it 
 does
 for decimal numbers or non-numbers like -
 $ hadoop dfsadmin -setBalancerBandwidth 12.34
 NumberFormatException: For input string: 12.34
 Usage: java DFSAdmin [-setBalancerBandwidth bandwidth in bytes per second]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-2390) dfsadmin -setBalancerBandwidth doesnot validate -ve value

2015-08-26 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14714205#comment-14714205
 ] 

Harsh J commented on HDFS-2390:
---

bq. +assertEquals(Bandwidth should be a non-negative integer, -1, 
exitCode);

This should rather be something like Negative bandwidth value must fail the 
command, such that upon a regression during which the test fails, the message 
produced by the JUnit test suite would look like: Negative bandwidth value 
must fail command: expected -1 but got 0.

 dfsadmin -setBalancerBandwidth doesnot validate -ve value
 -

 Key: HDFS-2390
 URL: https://issues.apache.org/jira/browse/HDFS-2390
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer  mover
Affects Versions: 2.7.1
Reporter: Rajit Saha
Assignee: Gautam Gopalakrishnan
 Attachments: HDFS-2390-1.patch, HDFS-2390-2.patch, HDFS-2390-3.patch


 $ hadoop dfsadmin -setBalancerBandwidth -1 
 does not throw any message that it is invalid although in DN log we are not 
 getting
 DNA_BALANCERBANDWIDTHUPDATE. 
 I think it should throw some message that -ve numbers are not valid , as it 
 does
 for decimal numbers or non-numbers like -
 $ hadoop dfsadmin -setBalancerBandwidth 12.34
 NumberFormatException: For input string: 12.34
 Usage: java DFSAdmin [-setBalancerBandwidth bandwidth in bytes per second]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-5696) Examples for httpfs REST API incorrect on apache.org

2015-08-25 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1479#comment-1479
 ] 

Harsh J commented on HDFS-5696:
---

Thanks for reporting this. Care to submit a fix changing the op values to the 
right one? I suspect these may be leftovers from the Hoop days.

bq. Not sure what the convention should be for specifying the user.name. Use 
hdfs? or a name that is obviously an example?

Since these are curl based examples that also likely assume no kerberos setup, 
why not $USER or $(whoami) instead of a hardcoded value?

 Examples for httpfs REST API incorrect on apache.org
 

 Key: HDFS-5696
 URL: https://issues.apache.org/jira/browse/HDFS-5696
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.2.0
 Environment: NA
Reporter: Casey Brotherton
Priority: Trivial

 The examples provided for the httpfs REST API are incorrect.
 http://hadoop.apache.org/docs/r2.2.0/hadoop-hdfs-httpfs/index.html
 http://hadoop.apache.org/docs/r2.0.5-alpha/hadoop-hdfs-httpfs/index.html
 From the documentation:
 *
 HttpFS is a separate service from Hadoop NameNode.
 HttpFS itself is Java web-application and it runs using a preconfigured 
 Tomcat bundled with HttpFS binary distribution.
 HttpFS HTTP web-service API calls are HTTP REST calls that map to a HDFS file 
 system operation. For example, using the curl Unix command:
 $ curl http://httpfs-host:14000/webhdfs/v1/user/foo/README.txt returns the 
 contents of the HDFS /user/foo/README.txt file.
 $ curl http://httpfs-host:14000/webhdfs/v1/user/foo?op=list returns the 
 contents of the HDFS /user/foo directory in JSON format.
 $ curl -X POST http://httpfs-host:14000/webhdfs/v1/user/foo/bar?op=mkdirs 
 creates the HDFS /user/foo.bar directory.
 ***
 The commands have incorrect operations. ( Verified through source code in 
 HttpFSFileSystem.java )
 In addition, although the webhdfs documentation specifies user.name as 
 optional, on my cluster, each action required a user.name
 It should be included in the short examples to allow for the greatest chance 
 of success.
 Three examples rewritten:
 curl -i -L 
 http://httpfs-host:14000/webhdfs/v1/user/foo/README.txt?op=openuser.name=hdfsuser;
 curl -i 
 http://httpfs-host:14000/webhdfs/v1/user/foo/?op=liststatususer.name=hdfsuser;
 curl -i -X PUT 
 http://httpfs-host:14000/webhdfs/v1/user/foo/bar?op=mkdirsuser.name=hdfsuser;
 Not sure what the convention should be for specifying the user.name. Use 
 hdfs? or a name that is obviously an example?
 It would also be beneficial if the HTTPfs page linked to the webhdfs 
 documentation page in the text instead of just on the menu sidebar.
 http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8118) Delay in checkpointing Trash can leave trash for 2 intervals before deleting

2015-08-24 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-8118:
--
 Hadoop Flags: Reviewed
Affects Version/s: 2.7.1
 Target Version/s: 3.0.0, 2.8.0
   Status: Patch Available  (was: Open)

I re-looked at the change and the problem, and although this can be difficult 
to test, the change does certainly fix the described changing-timestamp 
behaviour.

+1, will commit after verifying Jenkins result on the newer patch. Marking as 
Patch Available to trigger the build.

 Delay in checkpointing Trash can leave trash for 2 intervals before deleting
 

 Key: HDFS-8118
 URL: https://issues.apache.org/jira/browse/HDFS-8118
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Casey Brotherton
Assignee: Casey Brotherton
Priority: Trivial
 Attachments: HDFS-8118.001.patch, HDFS-8118.patch


 When the fs.trash.checkpoint.interval and the fs.trash.interval are set 
 non-zero and the same, it is possible for trash to be left for two intervals.
 The TrashPolicyDefault will use a floor and ceiling function to ensure that 
 the Trash will be checkpointed every interval of minutes.
 Each user's trash is checkpointed individually.  The time resolution of the 
 checkpoint timestamp is to the second.
 If the seconds switch while one user is checkpointing, then the next user's 
 timestamp will be later.
 This will cause the next user's checkpoint to not be deleted at the next 
 interval.
 I have recreated this in a lab cluster 
 I also have a suggestion for a patch that I can upload later tonight after 
 testing it further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8118) Delay in checkpointing Trash can leave trash for 2 intervals before deleting

2015-08-24 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14709034#comment-14709034
 ] 

Harsh J commented on HDFS-8118:
---

I missed a small detail - why is the {{new Date()}} not outside the 
user-directory iteration, if the goal is to make it constant across all user 
directories when the emptier gets invoked?

 Delay in checkpointing Trash can leave trash for 2 intervals before deleting
 

 Key: HDFS-8118
 URL: https://issues.apache.org/jira/browse/HDFS-8118
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Casey Brotherton
Assignee: Casey Brotherton
Priority: Trivial
 Attachments: HDFS-8118.001.patch, HDFS-8118.patch


 When the fs.trash.checkpoint.interval and the fs.trash.interval are set 
 non-zero and the same, it is possible for trash to be left for two intervals.
 The TrashPolicyDefault will use a floor and ceiling function to ensure that 
 the Trash will be checkpointed every interval of minutes.
 Each user's trash is checkpointed individually.  The time resolution of the 
 checkpoint timestamp is to the second.
 If the seconds switch while one user is checkpointing, then the next user's 
 timestamp will be later.
 This will cause the next user's checkpoint to not be deleted at the next 
 interval.
 I have recreated this in a lab cluster 
 I also have a suggestion for a patch that I can upload later tonight after 
 testing it further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (HDFS-8118) Delay in checkpointing Trash can leave trash for 2 intervals before deleting

2015-08-24 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-8118:
--
Comment: was deleted

(was: I missed a small detail - why is the {{new Date()}} not outside the 
user-directory iteration, if the goal is to make it constant across all user 
directories when the emptier gets invoked?)

 Delay in checkpointing Trash can leave trash for 2 intervals before deleting
 

 Key: HDFS-8118
 URL: https://issues.apache.org/jira/browse/HDFS-8118
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Casey Brotherton
Assignee: Casey Brotherton
Priority: Trivial
 Attachments: HDFS-8118.001.patch, HDFS-8118.patch


 When the fs.trash.checkpoint.interval and the fs.trash.interval are set 
 non-zero and the same, it is possible for trash to be left for two intervals.
 The TrashPolicyDefault will use a floor and ceiling function to ensure that 
 the Trash will be checkpointed every interval of minutes.
 Each user's trash is checkpointed individually.  The time resolution of the 
 checkpoint timestamp is to the second.
 If the seconds switch while one user is checkpointing, then the next user's 
 timestamp will be later.
 This will cause the next user's checkpoint to not be deleted at the next 
 interval.
 I have recreated this in a lab cluster 
 I also have a suggestion for a patch that I can upload later tonight after 
 testing it further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8118) Delay in checkpointing Trash can leave trash for 2 intervals before deleting

2015-08-24 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710624#comment-14710624
 ] 

Harsh J commented on HDFS-8118:
---

Thanks Casey,

You can run individual tests locally via {{mvn test 
-Dtest=TestWebDelegationToken}} for example. The jenkins build cannot retrigger 
specific tests, but you can always check past/future builds to also inspect if 
the test has been generally flaky, and search JIRA/emails to see if this has/is 
already been reported/being worked on.

It doesn't appear related to the behaviour fix we're making here, and the test 
does pass locally for me, so I'm committing this shortly.

 Delay in checkpointing Trash can leave trash for 2 intervals before deleting
 

 Key: HDFS-8118
 URL: https://issues.apache.org/jira/browse/HDFS-8118
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Casey Brotherton
Assignee: Casey Brotherton
Priority: Trivial
 Attachments: HDFS-8118.001.patch, HDFS-8118.patch


 When the fs.trash.checkpoint.interval and the fs.trash.interval are set 
 non-zero and the same, it is possible for trash to be left for two intervals.
 The TrashPolicyDefault will use a floor and ceiling function to ensure that 
 the Trash will be checkpointed every interval of minutes.
 Each user's trash is checkpointed individually.  The time resolution of the 
 checkpoint timestamp is to the second.
 If the seconds switch while one user is checkpointing, then the next user's 
 timestamp will be later.
 This will cause the next user's checkpoint to not be deleted at the next 
 interval.
 I have recreated this in a lab cluster 
 I also have a suggestion for a patch that I can upload later tonight after 
 testing it further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8821) Explain message Operation category X is not supported in state standby

2015-07-30 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-8821:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Thanks Gautam, test passed locally for me as well. Seems unrelated.

I've pushed this into trunk and branch-2; thank you for the continued 
contributions!

 Explain message Operation category X is not supported in state standby 
 -

 Key: HDFS-8821
 URL: https://issues.apache.org/jira/browse/HDFS-8821
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Gautam Gopalakrishnan
Assignee: Gautam Gopalakrishnan
Priority: Minor
 Fix For: 2.8.0

 Attachments: HDFS-8821-1.patch, HDFS-8821-2.patch


 There is one message specifically that causes many users to question the 
 health of their HDFS cluster, namely Operation category READ/WRITE is not 
 supported in state standby.
 HDFS-3447 is an attempt to lower the logging severity for StandbyException 
 related messages but it is not resolved yet. So this jira is an attempt to 
 explain this particular message so it appears less scary.
 The text is question 3.17 in the Hadoop Wiki FAQ
 ref: 
 https://wiki.apache.org/hadoop/FAQ#What_does_the_message_.22Operation_category_READ.2FWRITE_is_not_supported_in_state_standby.22_mean.3F



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8821) Explain message Operation category X is not supported in state standby

2015-07-25 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641777#comment-14641777
 ] 

Harsh J commented on HDFS-8821:
---

+1 looks good to me. This should help avoid the continual confusion people new 
to HDFS HA appear to have (from experience) about whether that error is to be 
taken seriously or not. Thank you for writing up the wiki entry too!

bq. hadoop.hdfs.server.namenode.ha.TestStandbyIsHot

Test name appears relevant but the failing test does not (fails at counting 
proper under-replicated blocks value). I'll still test this again manually 
before committing (by Tuesday EOD).

 Explain message Operation category X is not supported in state standby 
 -

 Key: HDFS-8821
 URL: https://issues.apache.org/jira/browse/HDFS-8821
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Gautam Gopalakrishnan
Assignee: Gautam Gopalakrishnan
Priority: Minor
 Attachments: HDFS-8821-1.patch, HDFS-8821-2.patch


 There is one message specifically that causes many users to question the 
 health of their HDFS cluster, namely Operation category READ/WRITE is not 
 supported in state standby.
 HDFS-3447 is an attempt to lower the logging severity for StandbyException 
 related messages but it is not resolved yet. So this jira is an attempt to 
 explain this particular message so it appears less scary.
 The text is question 3.17 in the Hadoop Wiki FAQ
 ref: 
 https://wiki.apache.org/hadoop/FAQ#What_does_the_message_.22Operation_category_READ.2FWRITE_is_not_supported_in_state_standby.22_mean.3F



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8486) DN startup may cause severe data loss

2015-07-20 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-8486:
--
Release Note: 
Public service notice:
- Every restart of a 2.6.x or 2.7.0 DN incurs a risk of unwanted block deletion.
- Apply this patch if you are running a pre-2.7.1 release.

(Promoting comment into release-notes area of JIRA just so its better visible)

 DN startup may cause severe data loss
 -

 Key: HDFS-8486
 URL: https://issues.apache.org/jira/browse/HDFS-8486
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 0.23.1, 2.0.0-alpha
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Blocker
 Fix For: 2.7.1

 Attachments: HDFS-8486.patch, HDFS-8486.patch


 A race condition between block pool initialization and the directory scanner 
 may cause a mass deletion of blocks in multiple storages.
 If block pool initialization finds a block on disk that is already in the 
 replica map, it deletes one of the blocks based on size, GS, etc.  
 Unfortunately it _always_ deletes one of the blocks even if identical, thus 
 the replica map _must_ be empty when the pool is initialized.
 The directory scanner starts at a random time within its periodic interval 
 (default 6h).  If the scanner starts very early it races to populate the 
 replica map, causing the block pool init to erroneously delete blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8660) Slow write to packet mirror should log which mirror and which block

2015-07-03 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613063#comment-14613063
 ] 

Harsh J commented on HDFS-8660:
---

This would be an excellent improvement for certain performance troubleshooting. 
In looking for more such Slow messages, the following matches may also need 
similar changes:

The Slow ReadProcessor message in DataStreamer.java can benefit from a block 
ID.
The Slow waitForAckedSeqno in DataStreamer.java message too could benefit 
from a block ID as well as a nodes list.

Just Block ID can also be added into the below messages under 
BlockReceiver.java:

Slow flushOrSync
Slow BlockReceiver write data to disk
Slow manageWriterOsCache

DN Mirror Host+Block ID can both be added into the below message under 
BlockReceiver.java:

Slow PacketResponder send ack to upstream took

Could you check if these are possible to do as part of the same JIRA as simple 
changes too?

 Slow write to packet mirror should log which mirror and which block
 ---

 Key: HDFS-8660
 URL: https://issues.apache.org/jira/browse/HDFS-8660
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.7.0
Reporter: Hazem Mahmoud
Assignee: Hazem Mahmoud

 Currently, log format states something similar to: 
 Slow BlockReceiver write packet to mirror took 468ms (threshold=300ms)
 For troubleshooting purposes, it would be good to have it mention which block 
 ID it's writing as well as the mirror (DN) that it's writing it to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-3627) OfflineImageViewer oiv Indented processor prints out the Java class name in the DELEGATION_KEY field

2015-06-04 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572631#comment-14572631
 ] 

Harsh J commented on HDFS-3627:
---

bq. Its used only for the old Pre-protobuf images.

Yup but we seem to support writing a legacy copy for the OIV specifically, so 
this fix could still be useful to some.

 OfflineImageViewer oiv Indented processor prints out the Java class name in 
 the DELEGATION_KEY field
 

 Key: HDFS-3627
 URL: https://issues.apache.org/jira/browse/HDFS-3627
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 0.23.0
Reporter: Ravi Prakash
Priority: Minor
  Labels: BB2015-05-TBR
 Attachments: HDFS-3627.patch, HDFS-3627.patch, HDFS-3627.patch, 
 HDFS-3627.patch, HDFS-3627.patch, HDFS-3627.patch


 Instead of the contents of the delegation key this is printed out
 DELEGATION_KEY = 
 org.apache.hadoop.security.token.delegation.DelegationKey@1e2ca7
 DELEGATION_KEY = 
 org.apache.hadoop.security.token.delegation.DelegationKey@105bd58
 DELEGATION_KEY = 
 org.apache.hadoop.security.token.delegation.DelegationKey@1d1e730
 DELEGATION_KEY = 
 org.apache.hadoop.security.token.delegation.DelegationKey@1a116c9
 DELEGATION_KEY = 
 org.apache.hadoop.security.token.delegation.DelegationKey@df1832



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8516) The 'hdfs crypto -listZones' should not print an extra newline at end of output

2015-06-03 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570433#comment-14570433
 ] 

Harsh J commented on HDFS-8516:
---

I wasn't worried of the usage ones as I'd never parse them generally, but if 
we'd like the extra newlines fixed in those I'll need to also target the 
cache/fs-shell/trace tools I think. Maybe if thats needed, it'd be better to 
have a builder option in TableListing itself to not add that newline at its 
last row. Let me know!

 The 'hdfs crypto -listZones' should not print an extra newline at end of 
 output
 ---

 Key: HDFS-8516
 URL: https://issues.apache.org/jira/browse/HDFS-8516
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: tools
Affects Versions: 2.7.0
Reporter: Harsh J
Assignee: Harsh J
Priority: Minor
 Attachments: HDFS-8516.patch


 It currently prints an extra newline (TableListing already adds a newline to 
 end of table string).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8516) The 'hdfs crypto -listZones' should not print an extra newline at end of output

2015-06-02 Thread Harsh J (JIRA)
Harsh J created HDFS-8516:
-

 Summary: The 'hdfs crypto -listZones' should not print an extra 
newline at end of output
 Key: HDFS-8516
 URL: https://issues.apache.org/jira/browse/HDFS-8516
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: tools
Reporter: Harsh J
Assignee: Harsh J
Priority: Minor


It currently prints an extra newline (TableListing already adds a newline to 
end of table string).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8516) The 'hdfs crypto -listZones' should not print an extra newline at end of output

2015-06-02 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-8516:
--
Attachment: HDFS-8516.patch

 The 'hdfs crypto -listZones' should not print an extra newline at end of 
 output
 ---

 Key: HDFS-8516
 URL: https://issues.apache.org/jira/browse/HDFS-8516
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: tools
Reporter: Harsh J
Assignee: Harsh J
Priority: Minor
 Attachments: HDFS-8516.patch


 It currently prints an extra newline (TableListing already adds a newline to 
 end of table string).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8516) The 'hdfs crypto -listZones' should not print an extra newline at end of output

2015-06-02 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-8516:
--
 Target Version/s: 2.8.0
Affects Version/s: 2.7.0
   Status: Patch Available  (was: Open)

 The 'hdfs crypto -listZones' should not print an extra newline at end of 
 output
 ---

 Key: HDFS-8516
 URL: https://issues.apache.org/jira/browse/HDFS-8516
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: tools
Affects Versions: 2.7.0
Reporter: Harsh J
Assignee: Harsh J
Priority: Minor
 Attachments: HDFS-8516.patch


 It currently prints an extra newline (TableListing already adds a newline to 
 end of table string).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7097) Allow block reports to be processed during checkpointing on standby name node

2015-05-04 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-7097:
--
Target Version/s:   (was: 2.6.0)

 Allow block reports to be processed during checkpointing on standby name node
 -

 Key: HDFS-7097
 URL: https://issues.apache.org/jira/browse/HDFS-7097
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Critical
 Fix For: 2.7.0

 Attachments: HDFS-7097.patch, HDFS-7097.patch, HDFS-7097.patch, 
 HDFS-7097.patch, HDFS-7097.ultimate.trunk.patch


 On a reasonably busy HDFS cluster, there are stream of creates, causing data 
 nodes to generate incremental block reports.  When a standby name node is 
 checkpointing, RPC handler threads trying to process a full or incremental 
 block report is blocked on the name system's {{fsLock}}, because the 
 checkpointer acquires the read lock on it.  This can create a serious problem 
 if the size of name space is big and checkpointing takes a long time.
 All available RPC handlers can be tied up very quickly. If you have 100 
 handlers, it only takes 34 file creates.  If a separate service RPC port is 
 not used, HA transition will have to wait in the call queue for minutes. Even 
 if a separate service RPC port is configured, hearbeats from datanodes will 
 be blocked. A standby NN  with a big name space can lose all data nodes after 
 checkpointing.  The rpc calls will also be retransmitted by data nodes many 
 times, filling up the call queue and potentially causing listen queue 
 overflow.
 Since block reports are not modifying any state that is being saved to 
 fsimage, I propose letting them through during checkpointing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7442) Optimization for decommission-in-progress check

2015-04-30 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-7442:
--
Affects Version/s: 2.6.0

 Optimization for decommission-in-progress check
 ---

 Key: HDFS-7442
 URL: https://issues.apache.org/jira/browse/HDFS-7442
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.6.0
Reporter: Ming Ma

 1. {{isReplicationInProgress }} currently rescan all blocks of a given node 
 each time the method is called; it becomes less efficient as more of its 
 blocks become fully replicated. Each scan takes FS lock.
 2. As discussed in HDFS-7374, if the node becomes dead during decommission, 
 it is useful if the dead node can be marked as decommissioned after all its 
 blocks are fully replicated. Currently there is no way to check the blocks of 
 dead decomm-in-progress nodes, given the dead node is removed from blockmap.
 There are mitigations for these limitations. Set 
 dfs.namenode.decommission.nodes.per.interval to small value for reduce the 
 duration of lock. HDFS-7409 uses global FS state to tell if a dead node's 
 blocks are fully replicated.
 To address these scenarios, it will be useful to track the 
 decommon-in-progress blocks separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7442) Optimization for decommission-in-progress check

2015-04-30 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-7442:
--
Component/s: namenode

 Optimization for decommission-in-progress check
 ---

 Key: HDFS-7442
 URL: https://issues.apache.org/jira/browse/HDFS-7442
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.6.0
Reporter: Ming Ma

 1. {{isReplicationInProgress }} currently rescan all blocks of a given node 
 each time the method is called; it becomes less efficient as more of its 
 blocks become fully replicated. Each scan takes FS lock.
 2. As discussed in HDFS-7374, if the node becomes dead during decommission, 
 it is useful if the dead node can be marked as decommissioned after all its 
 blocks are fully replicated. Currently there is no way to check the blocks of 
 dead decomm-in-progress nodes, given the dead node is removed from blockmap.
 There are mitigations for these limitations. Set 
 dfs.namenode.decommission.nodes.per.interval to small value for reduce the 
 duration of lock. HDFS-7409 uses global FS state to tell if a dead node's 
 blocks are fully replicated.
 To address these scenarios, it will be useful to track the 
 decommon-in-progress blocks separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-3493) Invalidate excess corrupted blocks as long as minimum replication is satisfied

2015-04-30 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J updated HDFS-3493:
--
Component/s: namenode

 Invalidate excess corrupted blocks as long as minimum replication is satisfied
 --

 Key: HDFS-3493
 URL: https://issues.apache.org/jira/browse/HDFS-3493
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.0.0-alpha, 2.0.5-alpha
Reporter: J.Andreina
Assignee: Juan Yu
 Fix For: 2.5.0

 Attachments: HDFS-3493.002.patch, HDFS-3493.003.patch, 
 HDFS-3493.004.patch, HDFS-3493.patch


 replication factor= 3, block report interval= 1min and start NN and 3DN
 Step 1:Write a file without close and do hflush (Dn1,DN2,DN3 has blk_ts1)
 Step 2:Stopped DN3
 Step 3:recovery happens and time stamp updated(blk_ts2)
 Step 4:close the file
 Step 5:blk_ts2 is finalized and available in DN1 and Dn2
 Step 6:now restarted DN3(which has got blk_ts1 in rbw)
 From the NN side there is no cmd issued to DN3 to delete the blk_ts1 . But 
 ask DN3 to make the block as corrupt .
 Replication of blk_ts2 to DN3 is not happened.
 NN logs:
 
 {noformat}
 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
 NameSystem.addToCorruptReplicasMap: duplicate requested for 
 blk_3927215081484173742 to add as corrupt on XX.XX.XX.XX:50276 by 
 /XX.XX.XX.XX because reported RWR replica with genstamp 1007 does not match 
 COMPLETE block's genstamp in block map 1008
 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* processReport: from 
 DatanodeRegistration(XX.XX.XX.XX, 
 storageID=DS-443871816-XX.XX.XX.XX-50276-1336829714197, infoPort=50275, 
 ipcPort=50277, 
 storageInfo=lv=-40;cid=CID-e654ac13-92dc-4f82-a22b-c0b6861d06d7;nsid=2063001898;c=0),
  blocks: 2, processing time: 1 msecs
 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* Removing block 
 blk_3927215081484173742_1008 from neededReplications as it has enough 
 replicas.
 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
 NameSystem.addToCorruptReplicasMap: duplicate requested for 
 blk_3927215081484173742 to add as corrupt on XX.XX.XX.XX:50276 by 
 /XX.XX.XX.XX because reported RWR replica with genstamp 1007 does not match 
 COMPLETE block's genstamp in block map 1008
 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* processReport: from 
 DatanodeRegistration(XX.XX.XX.XX, 
 storageID=DS-443871816-XX.XX.XX.XX-50276-1336829714197, infoPort=50275, 
 ipcPort=50277, 
 storageInfo=lv=-40;cid=CID-e654ac13-92dc-4f82-a22b-c0b6861d06d7;nsid=2063001898;c=0),
  blocks: 2, processing time: 1 msecs
 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not 
 able to place enough replicas, still in need of 1 to reach 1
 For more information, please enable DEBUG log level on 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
 {noformat}
 fsck Report
 ===
 {noformat}
 /file21:  Under replicated 
 BP-1008469586-XX.XX.XX.XX-1336829603103:blk_3927215081484173742_1008. Target 
 Replicas is 3 but found 2 replica(s).
 .Status: HEALTHY
  Total size:  495 B
  Total dirs:  1
  Total files: 3
  Total blocks (validated):3 (avg. block size 165 B)
  Minimally replicated blocks: 3 (100.0 %)
  Over-replicated blocks:  0 (0.0 %)
  Under-replicated blocks: 1 (33.32 %)
  Mis-replicated blocks:   0 (0.0 %)
  Default replication factor:  1
  Average block replication:   2.0
  Corrupt blocks:  0
  Missing replicas:1 (14.285714 %)
  Number of data-nodes:3
  Number of racks: 1
 FSCK ended at Sun May 13 09:49:05 IST 2012 in 9 milliseconds
 The filesystem under path '/' is HEALTHY
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HDFS-7960) The full block report should prune zombie storages even if they're not empty

2015-04-29 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J reassigned HDFS-7960:
-

Assignee: Colin Patrick McCabe  (was: Samuel Otero Schmidt)

 The full block report should prune zombie storages even if they're not empty
 

 Key: HDFS-7960
 URL: https://issues.apache.org/jira/browse/HDFS-7960
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lei (Eddy) Xu
Assignee: Colin Patrick McCabe
Priority: Critical
 Fix For: 2.7.0

 Attachments: HDFS-7960.002.patch, HDFS-7960.003.patch, 
 HDFS-7960.004.patch, HDFS-7960.005.patch, HDFS-7960.006.patch, 
 HDFS-7960.007.patch, HDFS-7960.008.patch


 The full block report should prune zombie storages even if they're not empty. 
  We have seen cases in production where zombie storages have not been pruned 
 subsequent to HDFS-7575.  This could arise any time the NameNode thinks there 
 is a block in some old storage which is actually not there.  In this case, 
 the block will not show up in the new storage (once old is renamed to new) 
 and the old storage will linger forever as a zombie, even with the HDFS-7596 
 fix applied.  This also happens with datanode hotplug, when a drive is 
 removed.  In this case, an entire storage (volume) goes away but the blocks 
 do not show up in another storage on the same datanode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8118) Delay in checkpointing Trash can leave trash for 2 intervals before deleting

2015-04-21 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505791#comment-14505791
 ] 

Harsh J commented on HDFS-8118:
---

Thanks for explaining that Casey. It makes sense to constant-ise the checkpoint 
date for uniformity - and the fix for this looks alright to me.

It also may make sense that people want to set checkpoint intervals equal to 
the trash intervals. I think we can remove the change in the patch of capping 
it to 1/2 the value of intervals, but just add a small doc note in 
hdfs-default.xml to the trash checkpoint period property on what the behaviour 
could end up being if its set to equal of the trash clearing interval.

Would it also be possible to come up with a test-case for this? For example, 
load some files into trash such that multiple dirs need to be checkpointed, and 
issue a checkpoint (or await its lowered interval) and ensure only one date is 
observed before clearing occurs? It would help avoid regressions in future, 
just in case.

 Delay in checkpointing Trash can leave trash for 2 intervals before deleting
 

 Key: HDFS-8118
 URL: https://issues.apache.org/jira/browse/HDFS-8118
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Casey Brotherton
Assignee: Casey Brotherton
Priority: Trivial
 Attachments: HDFS-8118.patch


 When the fs.trash.checkpoint.interval and the fs.trash.interval are set 
 non-zero and the same, it is possible for trash to be left for two intervals.
 The TrashPolicyDefault will use a floor and ceiling function to ensure that 
 the Trash will be checkpointed every interval of minutes.
 Each user's trash is checkpointed individually.  The time resolution of the 
 checkpoint timestamp is to the second.
 If the seconds switch while one user is checkpointing, then the next user's 
 timestamp will be later.
 This will cause the next user's checkpoint to not be deleted at the next 
 interval.
 I have recreated this in a lab cluster 
 I also have a suggestion for a patch that I can upload later tonight after 
 testing it further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure

2015-04-15 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496446#comment-14496446
 ] 

Harsh J commented on HDFS-8113:
---

Stale block copies leftover in the DN can cause the condition - it
indeed goes away if you clear out the RBW directory in the DN.

Imagine this condition:
1. File is being written. Has replica on node X among others.
2. Replica write to node X in pipeline fails. Write carries on,
leaving stale block copy in RBW of node X.
3. File gets closed and deleted away soon/immediately after (but well
before a block report from X).
4. Block report now sends the RBW info but NN has no knowledge of the
block anymore.

I think modifying Colin's test this way should reproduce the issue:

1. start a mini dfs cluster with 2 datanodes
2. create a file with repl=2, but do not close it (flush it to ensure
on-disk RBW write)
3. take down one DN
4. close and delete the file
5. wait
6. bring back up the other DN, which will still have the RBW block
from the file which was deleted




-- 
Harsh J


 NullPointerException in BlockInfoContiguous causes block report failure
 ---

 Key: HDFS-8113
 URL: https://issues.apache.org/jira/browse/HDFS-8113
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: HDFS-8113.patch


 The following copy constructor can throw NullPointerException if {{bc}} is 
 null.
 {code}
   protected BlockInfoContiguous(BlockInfoContiguous from) {
 this(from, from.bc.getBlockReplication());
 this.bc = from.bc;
   }
 {code}
 We have observed that some DataNodes keeps failing doing block reports with 
 NameNode. The stacktrace is as follows. Though we are not using the latest 
 version, the problem still exists.
 {quote}
 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
 RemoteException in offerService
 org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
 java.lang.NullPointerException
 at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80)
 at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696)
 at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185)
 at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047)
 at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950)
 at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823)
 at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750)
 at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069)
 at 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152)
 at 
 org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382)
 at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7452) Can we skip getCorruptFiles() call for standby NameNode..?

2015-04-07 Thread Harsh J (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484575#comment-14484575
 ] 

Harsh J commented on HDFS-7452:
---

IIUC, the Web UI of SBN tries to load up corrupt/missing block file info from 
its local self, which also causes this log spam. We could eliminate that to 
address this?

 Can we skip getCorruptFiles() call for standby NameNode..?
 --

 Key: HDFS-7452
 URL: https://issues.apache.org/jira/browse/HDFS-7452
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Brahma Reddy Battula
Assignee: Brahma Reddy Battula
Priority: Trivial

 Seen following warns logs from StandBy Namenode logs ..
 {noformat}
 2014-11-27 17:50:32,497 | WARN  | 512264920@qtp-429668078-606 | Get corrupt 
 file blocks returned error: Operation category READ is not supported in state 
 standby | 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916)
 2014-11-27 17:50:42,557 | WARN  | 512264920@qtp-429668078-606 | Get corrupt 
 file blocks returned error: Operation category READ is not supported in state 
 standby | 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916)
 2014-11-27 17:50:52,617 | WARN  | 512264920@qtp-429668078-606 | Get corrupt 
 file blocks returned error: Operation category READ is not supported in state 
 standby | 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916)
 2014-11-27 17:51:00,058 | WARN  | 512264920@qtp-429668078-606 | Get corrupt 
 file blocks returned error: Operation category READ is not supported in state 
 standby | 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916)
 2014-11-27 17:51:00,117 | WARN  | 512264920@qtp-429668078-606 | Get corrupt 
 file blocks returned error: Operation category READ is not supported in state 
 standby | 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916)
 2014-11-27 17:51:02,678 | WARN  | 512264920@qtp-429668078-606 | Get corrupt 
 file blocks returned error: Operation category READ is not supported in state 
 standby | 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916)
 2014-11-27 17:51:12,738 | WARN  | 512264920@qtp-429668078-606 | Get corrupt 
 file blocks returned error: Operation category READ is not supported in state 
 standby | 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916)
 2014-11-27 17:51:22,798 | WARN  | 512264920@qtp-429668078-606 | Get corrupt 
 file blocks returned error: Operation category READ is not supported in state 
 standby | 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916)
 2014-11-27 17:51:30,058 | WARN  | 512264920@qtp-429668078-606 | Get corrupt 
 file blocks returned error: Operation category READ is not supported in state 
 standby | 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916)
 2014-11-27 17:51:30,119 | WARN  | 512264920@qtp-429668078-606 | Get corrupt 
 file blocks returned error: Operation category READ is not supported in state 
 standby | 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916)
 {noformat}
 do we need to call for SNN..? I feel, it might not be required.can we 
 maintain state wide..Please let me know, If I am wrong..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-7306) can't decommission w/under construction blocks

2015-04-01 Thread Harsh J (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J resolved HDFS-7306.
---
Resolution: Duplicate

This should be resolved via HDFS-5579.

 can't decommission w/under construction blocks
 --

 Key: HDFS-7306
 URL: https://issues.apache.org/jira/browse/HDFS-7306
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Allen Wittenauer

 We need a way to decommission a node with open blocks.  Now that HDFS 
 supports append, this should be do-able.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   >