[jira] [Assigned] (HDFS-2263) Make DFSClient report bad blocks more quickly
[ https://issues.apache.org/jira/browse/HDFS-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J reassigned HDFS-2263: - Assignee: (was: Harsh J) > Make DFSClient report bad blocks more quickly > - > > Key: HDFS-2263 > URL: https://issues.apache.org/jira/browse/HDFS-2263 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 0.20.2 >Reporter: Aaron T. Myers >Priority: Major > Attachments: HDFS-2263.patch > > > In certain circumstances the DFSClient may detect a block as being bad > without reporting it promptly to the NN. > If when reading a file a client finds an invalid checksum of a block, it > immediately reports that bad block to the NN. If when serving up a block a DN > finds a truncated block, it reports this to the client, but the client merely > adds that DN to the list of dead nodes and moves on to trying another DN, > without reporting this to the NN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-2561) Under dfsadmin -report, show a proper 'last contact time' for decommissioned/dead nodes.
[ https://issues.apache.org/jira/browse/HDFS-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J reassigned HDFS-2561: - Assignee: (was: Harsh J) > Under dfsadmin -report, show a proper 'last contact time' for > decommissioned/dead nodes. > > > Key: HDFS-2561 > URL: https://issues.apache.org/jira/browse/HDFS-2561 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 0.20.2 >Reporter: Harsh J >Priority: Major > > Right now, the last contact period gets reset to 0 once we mark a DN dead. > This can be improved to show a proper time instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-1590) Decommissioning never ends when node to decommission has blocks that are under-replicated and cannot be replicated to the expected level of replication
[ https://issues.apache.org/jira/browse/HDFS-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J reassigned HDFS-1590: - Assignee: (was: Harsh J) > Decommissioning never ends when node to decommission has blocks that are > under-replicated and cannot be replicated to the expected level of replication > --- > > Key: HDFS-1590 > URL: https://issues.apache.org/jira/browse/HDFS-1590 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 0.20.2 > Environment: Linux >Reporter: Mathias Herberts >Priority: Minor > > On a test cluster with 4 DNs and a default repl level of 3, I recently > attempted to decommission one of the DNs. Right after the modification of the > dfs.hosts.exclude file and the 'dfsadmin -refreshNodes', I could see the > blocks being replicated to other nodes. > After a while, the replication stopped but the node was not marked as > decommissioned. > When running an 'fsck -files -blocks -locations' I saw that all files had a > replication of 4 (which is logical given there are 4 DNs), but some of the > files had an expected replication set to 10 (those were job.jar files from > M/R jobs). > I ran 'fs -setrep 3' on those files and shortly after the namenode reported > the DN as decommissioned. > Shouldn't this case be checked by the NameNode when decommissioning a node? > I.e considere a node decommissioned if either one of the following is true > for each block on the node being decommissioned: > 1. It is replicated more than the expected replication level. > 2. It is replicated as much as possible given the available nodes, even > though it is less replicated than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-4638) TransferFsImage should take Configuration as parameter
[ https://issues.apache.org/jira/browse/HDFS-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J reassigned HDFS-4638: - Assignee: (was: Harsh J) > TransferFsImage should take Configuration as parameter > -- > > Key: HDFS-4638 > URL: https://issues.apache.org/jira/browse/HDFS-4638 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.0.0-alpha1 >Reporter: Todd Lipcon >Priority: Minor > Labels: BB2015-05-TBR > Attachments: HDFS-4638.patch > > > TransferFsImage currently creates a new HdfsConfiguration object, rather than > taking one passed in. This means that using {{dfsadmin -fetchImage}}, you > can't pass a different timeout on the command line, since the Tool's > configuration doesn't get plumbed through. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-8516) The 'hdfs crypto -listZones' should not print an extra newline at end of output
[ https://issues.apache.org/jira/browse/HDFS-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J reassigned HDFS-8516: - Assignee: (was: Harsh J) > The 'hdfs crypto -listZones' should not print an extra newline at end of > output > --- > > Key: HDFS-8516 > URL: https://issues.apache.org/jira/browse/HDFS-8516 > Project: Hadoop HDFS > Issue Type: Improvement > Components: tools >Affects Versions: 2.7.0 >Reporter: Harsh J >Priority: Minor > Attachments: HDFS-8516.patch > > > It currently prints an extra newline (TableListing already adds a newline to > end of table string). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers
[ https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J reassigned HDFS-10237: -- Assignee: (was: Harsh J) > Support specifying checksum type in WebHDFS/HTTPFS writers > -- > > Key: HDFS-10237 > URL: https://issues.apache.org/jira/browse/HDFS-10237 > Project: Hadoop HDFS > Issue Type: New Feature > Components: webhdfs >Affects Versions: 2.8.0 >Reporter: Harsh J >Priority: Minor > Attachments: HDFS-10237.000.patch, HDFS-10237.001.patch, > HDFS-10237.002.patch, HDFS-10237.002.patch > > > Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS > writer, as you can with the regular DFS writer (done via HADOOP-8240) > This JIRA covers the changes necessary to bring the same ability to WebHDFS > and HTTPFS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11421) Make WebHDFS' ACLs RegEx configurable
[ https://issues.apache.org/jira/browse/HDFS-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025565#comment-16025565 ] Harsh J commented on HDFS-11421: The patch looks good to me. Were the style changes done out of checkstyle warnings? I only notice two changes, one is a comment becoming multi-line, the other's the DOMAIN static member being made lowercased. +1 > Make WebHDFS' ACLs RegEx configurable > - > > Key: HDFS-11421 > URL: https://issues.apache.org/jira/browse/HDFS-11421 > Project: Hadoop HDFS > Issue Type: Improvement > Components: webhdfs >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Harsh J > Fix For: 3.0.0-alpha3 > > Attachments: HDFS-11421.000.patch, HDFS-11421-branch-2.000.patch, > HDFS-11421.branch-2.001.patch, HDFS-11421.branch-2.003.patch > > > Part of HDFS-5608 added support for GET/SET ACLs over WebHDFS. This currently > identifies the passed arguments via a hard-coded regex that mandates certain > group and user naming styles. > A similar limitation had existed before for CHOWN and other User/Group set > related operations of WebHDFS, where it was then made configurable via > HDFS-11391 + HDFS-4983. > Such configurability should be allowed for the ACL operations too. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9868) Add ability for DistCp to run between 2 clusters
[ https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15907295#comment-15907295 ] Harsh J commented on HDFS-9868: --- Would this mapping allow DistCp to be aware of destination's distinct KMS URIs? A popular ask among some of the customers I work with seems to be an ability to allow DistCp to copy from EZ of one cluster into EZ of another, where the KMS (keys) are not shared. > Add ability for DistCp to run between 2 clusters > > > Key: HDFS-9868 > URL: https://issues.apache.org/jira/browse/HDFS-9868 > Project: Hadoop HDFS > Issue Type: Improvement > Components: distcp >Affects Versions: 2.7.1 >Reporter: NING DING >Assignee: NING DING > Attachments: HDFS-9868.05.patch, HDFS-9868.06.patch, > HDFS-9868.07.patch, HDFS-9868.08.patch, HDFS-9868.09.patch, > HDFS-9868.10.patch, HDFS-9868.1.patch, HDFS-9868.2.patch, HDFS-9868.3.patch, > HDFS-9868.4.patch > > > Normally the HDFS cluster is HA enabled. It could take a long time when > coping huge data by distp. If the source cluster changes active namenode, the > distp will run failed. This patch supports the DistCp can read source cluster > files in HA access mode. A source cluster configuration file needs to be > specified (via the -sourceClusterConf option). > The following is an example of the contents of a source cluster > configuration > file: > {code:xml} > > > fs.defaultFS > hdfs://mycluster > > > dfs.nameservices > mycluster > > > dfs.ha.namenodes.mycluster > nn1,nn2 > > > dfs.namenode.rpc-address.mycluster.nn1 > host1:9000 > > > dfs.namenode.rpc-address.mycluster.nn2 > host2:9000 > > > dfs.namenode.http-address.mycluster.nn1 > host1:50070 > > > dfs.namenode.http-address.mycluster.nn2 > host2:50070 > > > dfs.client.failover.proxy.provider.mycluster > > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > > > {code} > The invocation of DistCp is as below: > {code} > bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar > hdfs://nn2:8020/bar/foo > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7290) Add HTTP response code to the HttpPutFailedException message
[ https://issues.apache.org/jira/browse/HDFS-7290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901175#comment-15901175 ] Harsh J commented on HDFS-7290: --- [~ajisakaa] or [~jojochuang], could you help review this simple exception message enhancement? > Add HTTP response code to the HttpPutFailedException message > > > Key: HDFS-7290 > URL: https://issues.apache.org/jira/browse/HDFS-7290 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.5.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Labels: BB2015-05-TBR > Attachments: HDFS-7290.patch > > > If the TransferFsImage#uploadImageFromStorage(…) call fails for some reason, > we try to print back the reason of the connection failure. > We currently only grab connection.getResponseMessage(…) and use that as our > exception's lone string, but this can often be empty if there was no real > response message from the connection end. However, the failures always have a > code, so we should also ensure to print the error code returned, for at least > a partial hint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11421) Make WebHDFS' ACLs RegEx configurable
[ https://issues.apache.org/jira/browse/HDFS-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891804#comment-15891804 ] Harsh J commented on HDFS-11421: [~xiaochen] - Thanks! I'm uncertain how to trigger just a branch-2 build, but a local build passes with the patch applied, along with the modified tests. > Make WebHDFS' ACLs RegEx configurable > - > > Key: HDFS-11421 > URL: https://issues.apache.org/jira/browse/HDFS-11421 > Project: Hadoop HDFS > Issue Type: Improvement > Components: webhdfs >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Harsh J > Fix For: 3.0.0-alpha3 > > Attachments: HDFS-11421.000.patch, HDFS-11421-branch-2.000.patch > > > Part of HDFS-5608 added support for GET/SET ACLs over WebHDFS. This currently > identifies the passed arguments via a hard-coded regex that mandates certain > group and user naming styles. > A similar limitation had existed before for CHOWN and other User/Group set > related operations of WebHDFS, where it was then made configurable via > HDFS-11391 + HDFS-4983. > Such configurability should be allowed for the ACL operations too. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11421) Make WebHDFS' ACLs RegEx configurable
[ https://issues.apache.org/jira/browse/HDFS-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-11421: --- Target Version/s: 2.9.0, 3.0.0-beta1 (was: 2.9.0) > Make WebHDFS' ACLs RegEx configurable > - > > Key: HDFS-11421 > URL: https://issues.apache.org/jira/browse/HDFS-11421 > Project: Hadoop HDFS > Issue Type: Improvement > Components: webhdfs >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Harsh J > Attachments: HDFS-11421.000.patch, HDFS-11421-branch-2.000.patch > > > Part of HDFS-5608 added support for GET/SET ACLs over WebHDFS. This currently > identifies the passed arguments via a hard-coded regex that mandates certain > group and user naming styles. > A similar limitation had existed before for CHOWN and other User/Group set > related operations of WebHDFS, where it was then made configurable via > HDFS-11391 + HDFS-4983. > Such configurability should be allowed for the ACL operations too. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11421) Make WebHDFS' ACLs RegEx configurable
[ https://issues.apache.org/jira/browse/HDFS-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-11421: --- Attachment: HDFS-11421.000.patch HDFS-11421-branch-2.000.patch Thank you for the review [~xiaochen]! I've updated the PR with a commit addressing your comments. I'm also attaching a patch form directly here. > Make WebHDFS' ACLs RegEx configurable > - > > Key: HDFS-11421 > URL: https://issues.apache.org/jira/browse/HDFS-11421 > Project: Hadoop HDFS > Issue Type: Improvement > Components: webhdfs >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Harsh J > Attachments: HDFS-11421.000.patch, HDFS-11421-branch-2.000.patch > > > Part of HDFS-5608 added support for GET/SET ACLs over WebHDFS. This currently > identifies the passed arguments via a hard-coded regex that mandates certain > group and user naming styles. > A similar limitation had existed before for CHOWN and other User/Group set > related operations of WebHDFS, where it was then made configurable via > HDFS-11391 + HDFS-4983. > Such configurability should be allowed for the ACL operations too. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11422) Need a method to refresh the list of NN StorageDirectories after removal
[ https://issues.apache.org/jira/browse/HDFS-11422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871117#comment-15871117 ] Harsh J commented on HDFS-11422: IIRC, you can also use the command {{hdfs dfsadmin -restoreFailedStorage}} to control this. > Need a method to refresh the list of NN StorageDirectories after removal > > > Key: HDFS-11422 > URL: https://issues.apache.org/jira/browse/HDFS-11422 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.6.0 >Reporter: Attila Bukor > > If a NN storage directory is removed due to an error, the NameNode will fail > to write the image even if the issue was intermittent. It would be good to > have a way to make the NameNode try writing again after the issue is fixed - > and maybe even try it automatically every certain amount of time > (configurable). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11422) Need a method to refresh the list of NN StorageDirectories after removal
[ https://issues.apache.org/jira/browse/HDFS-11422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871115#comment-15871115 ] Harsh J commented on HDFS-11422: Does the feature offered by {{dfs.namenode.name.dir.restore}} help, even partially? For NNs with multiple directories, a reattempt of an earlier failed directory gets tried at every checkpoint trigger. Its not purely time based however. We'd considered making it default to true (HDFS-3560), but that was not done. If this fits, perhaps we can turn it on by default as the feature has existed for a while now. > Need a method to refresh the list of NN StorageDirectories after removal > > > Key: HDFS-11422 > URL: https://issues.apache.org/jira/browse/HDFS-11422 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.6.0 >Reporter: Attila Bukor > > If a NN storage directory is removed due to an error, the NameNode will fail > to write the image even if the issue was intermittent. It would be good to > have a way to make the NameNode try writing again after the issue is fixed - > and maybe even try it automatically every certain amount of time > (configurable). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11421) Make WebHDFS' ACLs RegEx configurable
[ https://issues.apache.org/jira/browse/HDFS-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-11421: --- Affects Version/s: 2.6.0 Target Version/s: 2.9.0 Status: Patch Available (was: Open) > Make WebHDFS' ACLs RegEx configurable > - > > Key: HDFS-11421 > URL: https://issues.apache.org/jira/browse/HDFS-11421 > Project: Hadoop HDFS > Issue Type: Improvement > Components: webhdfs >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Harsh J > > Part of HDFS-5608 added support for GET/SET ACLs over WebHDFS. This currently > identifies the passed arguments via a hard-coded regex that mandates certain > group and user naming styles. > A similar limitation had existed before for CHOWN and other User/Group set > related operations of WebHDFS, where it was then made configurable via > HDFS-11391 + HDFS-4983. > Such configurability should be allowed for the ACL operations too. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-11421) Make WebHDFS' ACLs RegEx configurable
Harsh J created HDFS-11421: -- Summary: Make WebHDFS' ACLs RegEx configurable Key: HDFS-11421 URL: https://issues.apache.org/jira/browse/HDFS-11421 Project: Hadoop HDFS Issue Type: Improvement Components: webhdfs Reporter: Harsh J Assignee: Harsh J Part of HDFS-5608 added support for GET/SET ACLs over WebHDFS. This currently identifies the passed arguments via a hard-coded regex that mandates certain group and user naming styles. A similar limitation had existed before for CHOWN and other User/Group set related operations of WebHDFS, where it was then made configurable via HDFS-11391 + HDFS-4983. Such configurability should be allowed for the ACL operations too. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-3302) Review and improve HDFS trash documentation
[ https://issues.apache.org/jira/browse/HDFS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-3302: -- Assignee: (was: Harsh J) > Review and improve HDFS trash documentation > --- > > Key: HDFS-3302 > URL: https://issues.apache.org/jira/browse/HDFS-3302 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation >Affects Versions: 3.0.0-alpha1 >Reporter: Harsh J > Labels: docs > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: HDFS-3302.patch > > > Improve Trash documentation for users. > (0.23 published release docs are missing original HDFS docs btw...) > A set of FAQ-like questions can be found on HDFS-2740 > I'll update the ticket shortly with the areas to cover in the docs, as > enabling trash by default (HDFS-2740) would be considered as a wide behavior > change per its follow ups. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-3302) Review and improve HDFS trash documentation
[ https://issues.apache.org/jira/browse/HDFS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825810#comment-15825810 ] Harsh J commented on HDFS-3302: --- Just a short note: This seems to be incorrectly attributed (to me). Actual patch contributor was [~pmkiran19]. > Review and improve HDFS trash documentation > --- > > Key: HDFS-3302 > URL: https://issues.apache.org/jira/browse/HDFS-3302 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation >Affects Versions: 3.0.0-alpha1 >Reporter: Harsh J >Assignee: Harsh J > Labels: docs > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: HDFS-3302.patch > > > Improve Trash documentation for users. > (0.23 published release docs are missing original HDFS docs btw...) > A set of FAQ-like questions can be found on HDFS-2740 > I'll update the ticket shortly with the areas to cover in the docs, as > enabling trash by default (HDFS-2740) would be considered as a wide behavior > change per its follow ups. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-2569) DN decommissioning quirks
[ https://issues.apache.org/jira/browse/HDFS-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J resolved HDFS-2569. --- Resolution: Cannot Reproduce Assignee: (was: Harsh J) Cannot quite reproduce this on current versions. > DN decommissioning quirks > - > > Key: HDFS-2569 > URL: https://issues.apache.org/jira/browse/HDFS-2569 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 0.23.0 >Reporter: Harsh J > > Decommissioning a node is working slightly odd in 0.23+: > The steps I did: > - Start HDFS via {{hdfs namenode}} and {{hdfs datanode}}. 1-node cluster. > - Zero files/blocks, so I go ahead and exclude-add my DN and do {{hdfs > dfsadmin -refreshNodes}} > - I see the following log in NN tails, which is fine: > {code} > 11/11/20 09:28:10 INFO util.HostsFileReader: Setting the includes file to > 11/11/20 09:28:10 INFO util.HostsFileReader: Setting the excludes file to > build/test/excludes > 11/11/20 09:28:10 INFO util.HostsFileReader: Refreshing hosts > (include/exclude) list > 11/11/20 09:28:10 INFO util.HostsFileReader: Adding 192.168.1.23 to the list > of hosts from build/test/excludes > {code} > - However, DN log tail gets no new messages. DN still runs. > - The dfshealth.jsp page shows this table, which makes no sense -- why is > there 1 live and 1 dead?: > |Live Nodes|1 (Decommissioned: 1)| > |Dead Nodes|1 (Decommissioned: 0)| > |Decommissioning Nodes|0| > - The live nodes page shows this, meaning DN is still up and heartbeating but > is decommissioned: > |Node|Last Contact|Admin State| > |192.168.1.23|0|Decommissioned| > - The dead nodes page shows this, and the link to the DN is broken cause the > port is linked as -1. Also, showing 'false' for decommissioned makes no sense > when live node page shows that it is already decommissioned: > |Node|Decommissioned| > |192.168.1.23|false| > Investigating if this is a quirk only observed when the DN had 0 blocks on it > in sum total. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-2936) Provide a way to apply a minimum replication factor aside of strict minimum live replicas feature
[ https://issues.apache.org/jira/browse/HDFS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-2936: -- Description: If an admin wishes to enforce replication today for all the users of their cluster, they may set {{dfs.namenode.replication.min}}. This property prevents users from creating files with < expected replication factor. However, the value of minimum replication set by the above value is also checked at several other points, especially during completeFile (close) operations. If a condition arises wherein a write's pipeline may have gotten only < minimum nodes in it, the completeFile operation does not successfully close the file and the client begins to hang waiting for NN to replicate the last bad block in the background. This form of hard-guarantee can, for example, bring down clusters of HBase during high xceiver load on DN, or disk fill-ups on many of them, etc.. I propose we should split the property in two parts: * dfs.namenode.replication.min ** Stays the same name, but only checks file creation time replication factor value and during adjustments made via setrep/etc. * dfs.namenode.replication.min.for.write ** New property that disconnects the rest of the checks from the above property, such as the checks done during block commit, file complete/close, safemode checks for block availability, etc.. Alternatively, we may also choose to remove the client-side hang of completeFile/close calls with a set number of retries. This would further require discussion about how a file-closure handle ought to be handled. was: If an admin wishes to enforce replication today for all the users of their cluster, he may set {{dfs.namenode.replication.min}}. This property prevents users from creating files with < expected replication factor. However, the value of minimum replication set by the above value is also checked at several other points, especially during completeFile (close) operations. If a condition arises wherein a write's pipeline may have gotten only < minimum nodes in it, the completeFile operation does not successfully close the file and the client begins to hang waiting for NN to replicate the last bad block in the background. This form of hard-guarantee can, for example, bring down clusters of HBase during high xceiver load on DN, or disk fill-ups on many of them, etc.. I propose we should split the property in two parts: * dfs.namenode.replication.min ** Stays the same name, but only checks file creation time replication factor value and during adjustments made via setrep/etc. * dfs.namenode.replication.min.for.write ** New property that disconnects the rest of the checks from the above property, such as the checks done during block commit, file complete/close, safemode checks for block availability, etc.. Alternatively, we may also choose to remove the client-side hang of completeFile/close calls with a set number of retries. This would further require discussion about how a file-closure handle ought to be handled. > Provide a way to apply a minimum replication factor aside of strict minimum > live replicas feature > - > > Key: HDFS-2936 > URL: https://issues.apache.org/jira/browse/HDFS-2936 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 0.23.0 >Reporter: Harsh J > Attachments: HDFS-2936.patch > > > If an admin wishes to enforce replication today for all the users of their > cluster, they may set {{dfs.namenode.replication.min}}. This property > prevents users from creating files with < expected replication factor. > However, the value of minimum replication set by the above value is also > checked at several other points, especially during completeFile (close) > operations. If a condition arises wherein a write's pipeline may have gotten > only < minimum nodes in it, the completeFile operation does not successfully > close the file and the client begins to hang waiting for NN to replicate the > last bad block in the background. This form of hard-guarantee can, for > example, bring down clusters of HBase during high xceiver load on DN, or disk > fill-ups on many of them, etc.. > I propose we should split the property in two parts: > * dfs.namenode.replication.min > ** Stays the same name, but only checks file creation time replication factor > value and during adjustments made via setrep/etc. > * dfs.namenode.replication.min.for.write > ** New property that disconnects the rest of the checks from the above > property, such as the checks done during block commit, file complete/close, > safemode checks for block availability, etc.. > Alternatively, we may also choose to remove the client-side hang of > completeFile/close calls with a
[jira] [Updated] (HDFS-2936) Provide a way to apply a minimum replication factor aside of strict minimum live replicas feature
[ https://issues.apache.org/jira/browse/HDFS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-2936: -- Summary: Provide a way to apply a minimum replication factor aside of strict minimum live replicas feature (was: File close()-ing hangs indefinitely if the number of live blocks does not match the minimum replication) > Provide a way to apply a minimum replication factor aside of strict minimum > live replicas feature > - > > Key: HDFS-2936 > URL: https://issues.apache.org/jira/browse/HDFS-2936 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 0.23.0 >Reporter: Harsh J >Assignee: Harsh J > Attachments: HDFS-2936.patch > > > If an admin wishes to enforce replication today for all the users of their > cluster, he may set {{dfs.namenode.replication.min}}. This property prevents > users from creating files with < expected replication factor. > However, the value of minimum replication set by the above value is also > checked at several other points, especially during completeFile (close) > operations. If a condition arises wherein a write's pipeline may have gotten > only < minimum nodes in it, the completeFile operation does not successfully > close the file and the client begins to hang waiting for NN to replicate the > last bad block in the background. This form of hard-guarantee can, for > example, bring down clusters of HBase during high xceiver load on DN, or disk > fill-ups on many of them, etc.. > I propose we should split the property in two parts: > * dfs.namenode.replication.min > ** Stays the same name, but only checks file creation time replication factor > value and during adjustments made via setrep/etc. > * dfs.namenode.replication.min.for.write > ** New property that disconnects the rest of the checks from the above > property, such as the checks done during block commit, file complete/close, > safemode checks for block availability, etc.. > Alternatively, we may also choose to remove the client-side hang of > completeFile/close calls with a set number of retries. This would further > require discussion about how a file-closure handle ought to be handled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-2936) Provide a way to apply a minimum replication factor aside of strict minimum live replicas feature
[ https://issues.apache.org/jira/browse/HDFS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-2936: -- Assignee: (was: Harsh J) > Provide a way to apply a minimum replication factor aside of strict minimum > live replicas feature > - > > Key: HDFS-2936 > URL: https://issues.apache.org/jira/browse/HDFS-2936 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 0.23.0 >Reporter: Harsh J > Attachments: HDFS-2936.patch > > > If an admin wishes to enforce replication today for all the users of their > cluster, he may set {{dfs.namenode.replication.min}}. This property prevents > users from creating files with < expected replication factor. > However, the value of minimum replication set by the above value is also > checked at several other points, especially during completeFile (close) > operations. If a condition arises wherein a write's pipeline may have gotten > only < minimum nodes in it, the completeFile operation does not successfully > close the file and the client begins to hang waiting for NN to replicate the > last bad block in the background. This form of hard-guarantee can, for > example, bring down clusters of HBase during high xceiver load on DN, or disk > fill-ups on many of them, etc.. > I propose we should split the property in two parts: > * dfs.namenode.replication.min > ** Stays the same name, but only checks file creation time replication factor > value and during adjustments made via setrep/etc. > * dfs.namenode.replication.min.for.write > ** New property that disconnects the rest of the checks from the above > property, such as the checks done during block commit, file complete/close, > safemode checks for block availability, etc.. > Alternatively, we may also choose to remove the client-side hang of > completeFile/close calls with a set number of retries. This would further > require discussion about how a file-closure handle ought to be handled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11012) Unnecessary INFO logging on DFSClients for InvalidToken
[ https://issues.apache.org/jira/browse/HDFS-11012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-11012: --- Status: Patch Available (was: Open) > Unnecessary INFO logging on DFSClients for InvalidToken > --- > > Key: HDFS-11012 > URL: https://issues.apache.org/jira/browse/HDFS-11012 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs >Affects Versions: 2.5.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > > In situations where a DFSClient would receive an InvalidToken exception (as > described at [1]), a single retry is automatically made (as observed at [2]). > However, we still print an INFO message into the DFSClient's logger even > though the message is expected in some scenarios. This should ideally be a > DEBUG level message to avoid confusion. > If the retry or the retried attempt fails, the final clause handles it anyway > and prints out a proper WARN (as seen at [3]) so the INFO is unnecessary. > [1] - > https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1330-L1356 > [2] - > https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L649-L651 > and > https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1163-L1170 > [3] - > https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L652-L658 > and > https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1171-L1177 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-11012) Unnecessary INFO logging on DFSClients for InvalidToken
Harsh J created HDFS-11012: -- Summary: Unnecessary INFO logging on DFSClients for InvalidToken Key: HDFS-11012 URL: https://issues.apache.org/jira/browse/HDFS-11012 Project: Hadoop HDFS Issue Type: Improvement Components: fs Affects Versions: 2.5.0 Reporter: Harsh J Assignee: Harsh J Priority: Minor In situations where a DFSClient would receive an InvalidToken exception (as described at [1]), a single retry is automatically made (as observed at [2]). However, we still print an INFO message into the DFSClient's logger even though the message is expected in some scenarios. This should ideally be a DEBUG level message to avoid confusion. If the retry or the retried attempt fails, the final clause handles it anyway and prints out a proper WARN (as seen at [3]) so the INFO is unnecessary. [1] - https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1330-L1356 [2] - https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L649-L651 and https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1163-L1170 [3] - https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L652-L658 and https://github.com/apache/hadoop/blob/release-2.7.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1171-L1177 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-3584) Blocks are getting marked as corrupt with append operation under high load.
[ https://issues.apache.org/jira/browse/HDFS-3584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312138#comment-15312138 ] Harsh J commented on HDFS-3584: --- HDFS-10240 appears to report a similar issue. > Blocks are getting marked as corrupt with append operation under high load. > --- > > Key: HDFS-3584 > URL: https://issues.apache.org/jira/browse/HDFS-3584 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: Brahma Reddy Battula > > Scenario: > = > 1. There are 2 clients cli1 and cli2 cli1 write a file F1 and not closed > 2. The cli2 will call append on unclosed file and triggers a leaserecovery > 3. Cli1 is closed > 4. Lease recovery is completed and with updated GS in DN and got BlockReport > since there is a mismatch in GS the block got corrupted > 5. Now we got a CommitBlockSync this will also fail since the File is already > closed by cli1 and state in NN is Finalized -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10240) Race between close/recoverLease leads to missing block
[ https://issues.apache.org/jira/browse/HDFS-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295930#comment-15295930 ] Harsh J commented on HDFS-10240: This seems similar to the situation described in HDFS-3584. The comment there from [~umamaheswararao] suggests a probable better approach of altering the lease of the file to the recovering client upfront, to prevent the older client from coming back and closing the file out concurrently (given that close call does already check for active lease). Do you think such an approach would be better? > Race between close/recoverLease leads to missing block > -- > > Key: HDFS-10240 > URL: https://issues.apache.org/jira/browse/HDFS-10240 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: zhouyingchao >Assignee: zhouyingchao > Attachments: HDFS-10240-001.patch > > > We got a missing block in our cluster, and logs related to the missing block > are as follows: > 2016-03-28,10:00:06,188 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocateBlock: XX. BP-219149063-10.108.84.25-1446859315800 > blk_1226490256_153006345{blockUCState=UNDER_CONSTRUCTION, > primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]} > 2016-03-28,10:00:06,205 INFO BlockStateChange: BLOCK* > blk_1226490256_153006345{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, > replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]} > recovery started, > primary=ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW] > 2016-03-28,10:00:06,205 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.internalReleaseLease: File XX has not been closed. Lease > recovery is in progress. RecoveryId = 153006357 for block > blk_1226490256_153006345{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, > replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-3f5032bc-6006-4fcc-b0f7-b355a5b94f1b:NORMAL|RBW]]} > 2016-03-28,10:00:06,248 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* > checkFileProgress: blk_1226490256_153006345{blockUCState=COMMITTED, > primaryNodeIndex=2, > replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-85819f0d-bdbb-4a9b-b90c-eba078547c23:NORMAL|RBW]]} > has not reached minimal replication 1 > 2016-03-28,10:00:06,358 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.114.5.53:11402 is added to > blk_1226490256_153006345{blockUCState=COMMITTED, primaryNodeIndex=2, > replicas=[ReplicaUnderConstruction[[DISK]DS-bcd22774-cf4d-45e9-a6a6-c475181271c9:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-ec1413ae-5541-4b44-8922-c928be3bb306:NORMAL|RBW], > > ReplicaUnderConstruction[[DISK]DS-85819f0d-bdbb-4a9b-b90c-eba078547c23:NORMAL|RBW]]} > size 139 > 2016-03-28,10:00:06,441 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.114.5.44:11402 is added to blk_1226490256_153006345 size > 139 > 2016-03-28,10:00:06,660 INFO BlockStateChange: BLOCK* addStoredBlock: > blockMap updated: 10.114.6.14:11402 is added to blk_1226490256_153006345 size > 139 > 2016-03-28,10:00:08,808 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > commitBlockSynchronization(lastblock=BP-219149063-10.108.84.25-1446859315800:blk_1226490256_153006345, > newgenerationstamp=153006357, newlength=139, newtargets=[10.114.6.14:11402, > 10.114.5.53:11402, 10.114.5.44:11402], closeFile=true, deleteBlock=false) > 2016-03-28,10:00:08,836 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1226490256 added as corrupt on > 10.114.6.14:11402 by /10.114.6.14 because block is COMPLETE and reported > genstamp 153006357 does not match genstamp in block map 153006345 > 2016-03-28,10:00:08,836 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1226490256 added as corrupt on > 10.114.5.53:11402 by /10.114.5.53 because block is COMPLETE and reported > genstamp 153006357 does not match genstamp in block map 153006345 > 2016-03-28,10:00:08,837 INFO BlockStateChange: BLOCK >
[jira] [Resolved] (HDFS-3557) provide means of escaping special characters to `hadoop fs` command
[ https://issues.apache.org/jira/browse/HDFS-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J resolved HDFS-3557. --- Resolution: Not A Problem Resolving per comment (also stale) > provide means of escaping special characters to `hadoop fs` command > --- > > Key: HDFS-3557 > URL: https://issues.apache.org/jira/browse/HDFS-3557 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 0.20.2 >Reporter: Jeff Hodges >Priority: Minor > > When running an investigative job, I used a date parameter that selected > multiple directories for the input (e.g. "my_data/2012/06/{18,19,20}"). It > used this same date parameter when creating the output directory. > But `hadoop fs` was unable to ls, getmerge, or rmr it until I used the regex > operator "?" and mv to change the name (that is, `-mv > output/2012/06/?18,19,20? foobar"). > Shells and filesystems for other systems provide a means of escaping "special > characters" generically, but there seems to be no such means in HDFS/`hadoop > fs`. Providing one would be a great way to make accessing HDFS more robust. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10296) FileContext.getDelegationTokens() fails to obtain KMS delegation token
[ https://issues.apache.org/jira/browse/HDFS-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251966#comment-15251966 ] Harsh J commented on HDFS-10296: We do special handling in DistributedFileSystem#addDelegationTokens to detect TDE features and inject an additional KMS DT; this enhancement is missing in FileContext. > FileContext.getDelegationTokens() fails to obtain KMS delegation token > -- > > Key: HDFS-10296 > URL: https://issues.apache.org/jira/browse/HDFS-10296 > Project: Hadoop HDFS > Issue Type: Bug > Components: encryption >Affects Versions: 2.6.0 > Environment: CDH 5.6 with a Java KMS >Reporter: Andreas Neumann > > This little program demonstrates the problem: With FileSystem, we can get > both the HDFS and the kms-dt token, whereas with FileContext, we can only > obtain the HDFS delegation token. > {code} > public class SimpleTest { > public static void main(String[] args) throws IOException { > YarnConfiguration hConf = new YarnConfiguration(); > String renewer = "renewer"; > FileContext fc = FileContext.getFileContext(hConf); > Listtokens = fc.getDelegationTokens(new Path("/"), renewer); > for (Token token : tokens) { > System.out.println("Token from FC: " + token); > } > FileSystem fs = FileSystem.get(hConf); > for (Token token : fs.addDelegationTokens(renewer, new Credentials())) > { > System.out.println("Token from FS: " + token); > } > } > } > {code} > Sample output (host/user name x'ed out): > {noformat} > Token from FC: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:xxx, Ident: > (HDFS_DELEGATION_TOKEN token 49 for xxx) > Token from FS: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:xxx, Ident: > (HDFS_DELEGATION_TOKEN token 50 for xxx) > Token from FS: Kind: kms-dt, Service: xx.xx.xx.xx:16000, Ident: 00 04 63 64 > 61 70 07 72 65 6e 65 77 65 72 00 8a 01 54 16 96 c2 95 8a 01 54 3a a3 46 95 0e > 02 > {noformat} > Apparently FileContext does not return the KMS token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-8161) Both Namenodes are in standby State
[ https://issues.apache.org/jira/browse/HDFS-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243980#comment-15243980 ] Harsh J edited comment on HDFS-8161 at 4/16/16 3:26 AM: [~brahmareddy] - was this encountered on virtual machine hosts, or physical ones? Asking because https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-data-to-mesos-kubernetes-docker-containers-4986f88f7a19#.v3hx212ne (H/T [~daisuke.kobayashi]) was (Author: qwertymaniac): [~brahmareddy] - was this encountered on virtual machine hosts, or physical ones? Asking because https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-data-to-mesos-kubernetes-docker-containers-4986f88f7a19#.v3hx212ne > Both Namenodes are in standby State > --- > > Key: HDFS-8161 > URL: https://issues.apache.org/jira/browse/HDFS-8161 > Project: Hadoop HDFS > Issue Type: Bug > Components: auto-failover >Affects Versions: 2.6.0 >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Attachments: ACTIVEBreadcumb and StandbyElector.txt > > > Suspected Scenario: > > Start cluster with three Nodes. > Reboot Machine where ZKFC is not running..( Here Active Node ZKFC should open > session with this ZK ) > Now ZKFC ( Active NN's ) session expire and try re-establish connection with > another ZK...Bythe time ZKFC ( StndBy NN's ) will try to fence old active > and create the active Breadcrumb and Makes SNN to active state.. > But immediately it fence to standby state.. ( Here is the doubt) > Hence both will be in standby state.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8161) Both Namenodes are in standby State
[ https://issues.apache.org/jira/browse/HDFS-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243980#comment-15243980 ] Harsh J commented on HDFS-8161: --- [~brahmareddy] - was this encountered on virtual machine hosts, or physical ones? Asking because https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-data-to-mesos-kubernetes-docker-containers-4986f88f7a19#.v3hx212ne > Both Namenodes are in standby State > --- > > Key: HDFS-8161 > URL: https://issues.apache.org/jira/browse/HDFS-8161 > Project: Hadoop HDFS > Issue Type: Bug > Components: auto-failover >Affects Versions: 2.6.0 >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Attachments: ACTIVEBreadcumb and StandbyElector.txt > > > Suspected Scenario: > > Start cluster with three Nodes. > Reboot Machine where ZKFC is not running..( Here Active Node ZKFC should open > session with this ZK ) > Now ZKFC ( Active NN's ) session expire and try re-establish connection with > another ZK...Bythe time ZKFC ( StndBy NN's ) will try to fence old active > and create the active Breadcrumb and Makes SNN to active state.. > But immediately it fence to standby state.. ( Here is the doubt) > Hence both will be in standby state.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers
[ https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-10237: --- Attachment: HDFS-10237.002.patch > Support specifying checksum type in WebHDFS/HTTPFS writers > -- > > Key: HDFS-10237 > URL: https://issues.apache.org/jira/browse/HDFS-10237 > Project: Hadoop HDFS > Issue Type: New Feature > Components: webhdfs >Affects Versions: 2.8.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-10237.000.patch, HDFS-10237.001.patch, > HDFS-10237.002.patch, HDFS-10237.002.patch > > > Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS > writer, as you can with the regular DFS writer (done via HADOOP-8240) > This JIRA covers the changes necessary to bring the same ability to WebHDFS > and HTTPFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers
[ https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-10237: --- Status: Open (was: Patch Available) > Support specifying checksum type in WebHDFS/HTTPFS writers > -- > > Key: HDFS-10237 > URL: https://issues.apache.org/jira/browse/HDFS-10237 > Project: Hadoop HDFS > Issue Type: New Feature > Components: webhdfs >Affects Versions: 2.8.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-10237.000.patch, HDFS-10237.001.patch, > HDFS-10237.002.patch > > > Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS > writer, as you can with the regular DFS writer (done via HADOOP-8240) > This JIRA covers the changes necessary to bring the same ability to WebHDFS > and HTTPFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers
[ https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-10237: --- Status: Patch Available (was: Open) > Support specifying checksum type in WebHDFS/HTTPFS writers > -- > > Key: HDFS-10237 > URL: https://issues.apache.org/jira/browse/HDFS-10237 > Project: Hadoop HDFS > Issue Type: New Feature > Components: webhdfs >Affects Versions: 2.8.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-10237.000.patch, HDFS-10237.001.patch, > HDFS-10237.002.patch > > > Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS > writer, as you can with the regular DFS writer (done via HADOOP-8240) > This JIRA covers the changes necessary to bring the same ability to WebHDFS > and HTTPFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers
[ https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-10237: --- Attachment: HDFS-10237.002.patch Addressed most checkstyle issues except the ones about the warnings surrounding # of parameters - these are required given the overrides. > Support specifying checksum type in WebHDFS/HTTPFS writers > -- > > Key: HDFS-10237 > URL: https://issues.apache.org/jira/browse/HDFS-10237 > Project: Hadoop HDFS > Issue Type: New Feature > Components: webhdfs >Affects Versions: 2.8.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-10237.000.patch, HDFS-10237.001.patch, > HDFS-10237.002.patch > > > Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS > writer, as you can with the regular DFS writer (done via HADOOP-8240) > This JIRA covers the changes necessary to bring the same ability to WebHDFS > and HTTPFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers
[ https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-10237: --- Attachment: HDFS-10237.001.patch * Fixed the javadoc extra param problem * Fixed logic in the ChecksumOptParam that was not considering the string value of "null" vs. a literal null, and was breaking the contract test/non-recursive-create The other tests appear unrelated. > Support specifying checksum type in WebHDFS/HTTPFS writers > -- > > Key: HDFS-10237 > URL: https://issues.apache.org/jira/browse/HDFS-10237 > Project: Hadoop HDFS > Issue Type: New Feature > Components: webhdfs >Affects Versions: 2.8.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-10237.000.patch, HDFS-10237.001.patch > > > Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS > writer, as you can with the regular DFS writer (done via HADOOP-8240) > This JIRA covers the changes necessary to bring the same ability to WebHDFS > and HTTPFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-6542) WebHDFSFileSystem doesn't transmit desired checksum type
[ https://issues.apache.org/jira/browse/HDFS-6542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J resolved HDFS-6542. --- Resolution: Duplicate I missed this JIRA when searching before I filed HDFS-10237, but now noticed via association to HADOOP-8240. Since I've already posted a patch on HDFS-10237 and there's no ongoing work/assignee here, am marking this as a duplicate of HDFS-10237. Sorry for the extra noise! > WebHDFSFileSystem doesn't transmit desired checksum type > > > Key: HDFS-6542 > URL: https://issues.apache.org/jira/browse/HDFS-6542 > Project: Hadoop HDFS > Issue Type: Bug > Components: webhdfs >Reporter: Andrey Stepachev >Priority: Minor > > Currently DFSClient has possibility to specify desired checksum type. This > behaviour controlled by dfs.checksym.type parameter settable by client. > It works with hdfs:// filesystem, but doesn't works with webhdfs.It fails to > work because webhdfs will use default type of checksumming initialised by > server instance of DFSClient. > As example https://issues.apache.org/jira/browse/HADOOP-8240 doesn't works > with webhdfs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers
[ https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-10237: --- Attachment: HDFS-10237.000.patch > Support specifying checksum type in WebHDFS/HTTPFS writers > -- > > Key: HDFS-10237 > URL: https://issues.apache.org/jira/browse/HDFS-10237 > Project: Hadoop HDFS > Issue Type: New Feature > Components: webhdfs >Affects Versions: 2.8.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-10237.000.patch > > > Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS > writer, as you can with the regular DFS writer (done via HADOOP-8240) > This JIRA covers the changes necessary to bring the same ability to WebHDFS > and HTTPFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers
[ https://issues.apache.org/jira/browse/HDFS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-10237: --- Status: Patch Available (was: Open) > Support specifying checksum type in WebHDFS/HTTPFS writers > -- > > Key: HDFS-10237 > URL: https://issues.apache.org/jira/browse/HDFS-10237 > Project: Hadoop HDFS > Issue Type: New Feature > Components: webhdfs >Affects Versions: 2.8.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-10237.000.patch > > > Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS > writer, as you can with the regular DFS writer (done via HADOOP-8240) > This JIRA covers the changes necessary to bring the same ability to WebHDFS > and HTTPFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-10237) Support specifying checksum type in WebHDFS/HTTPFS writers
Harsh J created HDFS-10237: -- Summary: Support specifying checksum type in WebHDFS/HTTPFS writers Key: HDFS-10237 URL: https://issues.apache.org/jira/browse/HDFS-10237 Project: Hadoop HDFS Issue Type: New Feature Components: webhdfs Affects Versions: 2.8.0 Reporter: Harsh J Assignee: Harsh J Priority: Minor Currently you cannot set a desired checksum type over a WebHDFS or HTTPFS writer, as you can with the regular DFS writer (done via HADOOP-8240) This JIRA covers the changes necessary to bring the same ability to WebHDFS and HTTPFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10221) Add test resource dfs.hosts.json to the rat exclusions
[ https://issues.apache.org/jira/browse/HDFS-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215876#comment-15215876 ] Harsh J commented on HDFS-10221: It appears we can also request the parser to allow comments as a feature, depending on the parser, for ex. this change and JIRA: LENS-729 / https://issues.apache.org/jira/secure/attachment/12750264/LENS-729.patch This would help preserve licenses vs. excluding it, just in case. > Add test resource dfs.hosts.json to the rat exclusions > -- > > Key: HDFS-10221 > URL: https://issues.apache.org/jira/browse/HDFS-10221 > Project: Hadoop HDFS > Issue Type: Bug > Components: build >Affects Versions: 2.8.0 > Environment: Jenkins >Reporter: Ming Ma >Assignee: Ming Ma >Priority: Blocker > Attachments: HDFS-10221.patch > > > A new test resource dfs.hosts.json was added to HDFS-9005 for better > readability. Given json file doesn't allow comments, it violates ASF license. > To address this, we can add the file to rat exclusions list in the pom.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression
[ https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-9949: -- Description: In the following scenario, in releases without HDFS-8211, the DN may regenerate its UUIDs unintentionally. 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}} 1. Stop DN 2. Unmount the second disk, {{/data2/dfs/dn}} 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root path 4. Start DN 5. DN now considers /data2/dfs/dn empty so formats it, but during the format it uses {{datanode.getDatanodeUuid()}} which is null until register() is called. 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} gets called with successful condition, and it causes a new generation of UUID which is written to all disks {{/data1/dfs/dn/current/VERSION}} and {{/data2/dfs/dn/current/VERSION}}. 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was realised) 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the {{VERSION}} file to be the original one again on it (mounting masks the root path that we last generated upon). 9. DN fails to start up cause it finds mismatched UUID between the two disks, with an error similar to: {code}WARN org.apache.hadoop.hdfs.server.common.Storage: {{org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data/2/dfs/dn is in an inconsistent state: Root /data/2/dfs/dn: DatanodeUuid=fe3a848f-beb8-4fcb-9581-c6fb1c701cc4, does not match 8ea9493c-7097-4ee3-96a3-0cc4dfc1d6ac from other StorageDirectory.{code} The DN should not generate a new UUID if one of the storage disks already have the older one. HDFS-8211 unintentionally fixes this by changing the {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} representation of the UUID vs. the {{DatanodeID}} object which only gets available (non-null) _after_ the registration. It'd still be good to add a direct test case to the above scenario that passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can catch a regression around this in future. was: In the following scenario, in releases without HDFS-8211, the DN may regenerate its UUIDs unintentionally. 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}} 1. Stop DN 2. Unmount the second disk, {{/data2/dfs/dn}} 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root path 4. Start DN 5. DN now considers /data2/dfs/dn empty so formats it, but during the format it uses {{datanode.getDatanodeUuid()}} which is null until register() is called. 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} gets called with successful condition, and it causes a new generation of UUID which is written to all disks {{/data1/dfs/dn/current/VERSION}} and {{/data2/dfs/dn/current/VERSION}}. 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was realised) 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the {{VERSION}} file to be the original one again on it (mounting masks the root path that we last generated upon). 9. DN fails to start up cause it finds mismatched UUID between the two disks The DN should not generate a new UUID if one of the storage disks already have the older one. HDFS-8211 unintentionally fixes this by changing the {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} representation of the UUID vs. the {{DatanodeID}} object which only gets available (non-null) _after_ the registration. It'd still be good to add a direct test case to the above scenario that passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can catch a regression around this in future. > Testcase for catching DN UUID regeneration regression > - > > Key: HDFS-9949 > URL: https://issues.apache.org/jira/browse/HDFS-9949 > Project: Hadoop HDFS > Issue Type: Test >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, > HDFS-9949.000.patch, HDFS-9949.001.patch > > > In the following scenario, in releases without HDFS-8211, the DN may > regenerate its UUIDs unintentionally. > 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}} > 1. Stop DN > 2. Unmount the second disk, {{/data2/dfs/dn}} > 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root > path > 4. Start DN > 5. DN now considers /data2/dfs/dn empty so formats it, but during the format > it uses {{datanode.getDatanodeUuid()}} which is null until register() is > called. > 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} > gets called with successful condition, and it causes a new
[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression
[ https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-9949: -- Attachment: HDFS-9949.001.patch Thank you for reviewing [~cmccabe]! Agreed, I should've used a smaller sleep. Updated the trunk patch with the suggested {{50ms}}. Test runs on my local machine in ~3s. > Testcase for catching DN UUID regeneration regression > - > > Key: HDFS-9949 > URL: https://issues.apache.org/jira/browse/HDFS-9949 > Project: Hadoop HDFS > Issue Type: Test >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, > HDFS-9949.000.patch, HDFS-9949.001.patch > > > In the following scenario, in releases without HDFS-8211, the DN may > regenerate its UUIDs unintentionally. > 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}} > 1. Stop DN > 2. Unmount the second disk, {{/data2/dfs/dn}} > 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root > path > 4. Start DN > 5. DN now considers /data2/dfs/dn empty so formats it, but during the format > it uses {{datanode.getDatanodeUuid()}} which is null until register() is > called. > 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} > gets called with successful condition, and it causes a new generation of UUID > which is written to all disks {{/data1/dfs/dn/current/VERSION}} and > {{/data2/dfs/dn/current/VERSION}}. > 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was > realised) > 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the > {{VERSION}} file to be the original one again on it (mounting masks the root > path that we last generated upon). > 9. DN fails to start up cause it finds mismatched UUID between the two disks > The DN should not generate a new UUID if one of the storage disks already > have the older one. > HDFS-8211 unintentionally fixes this by changing the > {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} > representation of the UUID vs. the {{DatanodeID}} object which only gets > available (non-null) _after_ the registration. > It'd still be good to add a direct test case to the above scenario that > passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can > catch a regression around this in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression
[ https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-9949: -- Description: In the following scenario, in releases without HDFS-8211, the DN may regenerate its UUIDs unintentionally. 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}} 1. Stop DN 2. Unmount the second disk, {{/data2/dfs/dn}} 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root path 4. Start DN 5. DN now considers /data2/dfs/dn empty so formats it, but during the format it uses {{datanode.getDatanodeUuid()}} which is null until register() is called. 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} gets called with successful condition, and it causes a new generation of UUID which is written to all disks {{/data1/dfs/dn/current/VERSION}} and {{/data2/dfs/dn/current/VERSION}}. 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was realised) 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the {{VERSION}} file to be the original one again on it (mounting masks the root path that we last generated upon). 9. DN fails to start up cause it finds mismatched UUID between the two disks, with an error similar to: {code}WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data/2/dfs/dn is in an inconsistent state: Root /data/2/dfs/dn: DatanodeUuid=fe3a848f-beb8-4fcb-9581-c6fb1c701cc4, does not match 8ea9493c-7097-4ee3-96a3-0cc4dfc1d6ac from other StorageDirectory.{code} The DN should not generate a new UUID if one of the storage disks already have the older one. HDFS-8211 unintentionally fixes this by changing the {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} representation of the UUID vs. the {{DatanodeID}} object which only gets available (non-null) _after_ the registration. It'd still be good to add a direct test case to the above scenario that passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can catch a regression around this in future. was: In the following scenario, in releases without HDFS-8211, the DN may regenerate its UUIDs unintentionally. 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}} 1. Stop DN 2. Unmount the second disk, {{/data2/dfs/dn}} 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root path 4. Start DN 5. DN now considers /data2/dfs/dn empty so formats it, but during the format it uses {{datanode.getDatanodeUuid()}} which is null until register() is called. 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} gets called with successful condition, and it causes a new generation of UUID which is written to all disks {{/data1/dfs/dn/current/VERSION}} and {{/data2/dfs/dn/current/VERSION}}. 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was realised) 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the {{VERSION}} file to be the original one again on it (mounting masks the root path that we last generated upon). 9. DN fails to start up cause it finds mismatched UUID between the two disks, with an error similar to: {code}WARN org.apache.hadoop.hdfs.server.common.Storage: {{org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /data/2/dfs/dn is in an inconsistent state: Root /data/2/dfs/dn: DatanodeUuid=fe3a848f-beb8-4fcb-9581-c6fb1c701cc4, does not match 8ea9493c-7097-4ee3-96a3-0cc4dfc1d6ac from other StorageDirectory.{code} The DN should not generate a new UUID if one of the storage disks already have the older one. HDFS-8211 unintentionally fixes this by changing the {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} representation of the UUID vs. the {{DatanodeID}} object which only gets available (non-null) _after_ the registration. It'd still be good to add a direct test case to the above scenario that passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can catch a regression around this in future. > Testcase for catching DN UUID regeneration regression > - > > Key: HDFS-9949 > URL: https://issues.apache.org/jira/browse/HDFS-9949 > Project: Hadoop HDFS > Issue Type: Test >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, > HDFS-9949.000.patch, HDFS-9949.001.patch > > > In the following scenario, in releases without HDFS-8211, the DN may > regenerate its UUIDs unintentionally. > 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}} > 1. Stop DN > 2. Unmount the second disk, {{/data2/dfs/dn}} > 3. Create (in the scenario, this was an
[jira] [Commented] (HDFS-9940) Rename dfs.balancer.max.concurrent.moves to avoid confusion
[ https://issues.apache.org/jira/browse/HDFS-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192340#comment-15192340 ] Harsh J commented on HDFS-9940: --- Agreed on the confusion when there's role-based config management involved - I'd also vote for having Balancers discover the property from DNs directly/dynamically instead, which gives the added benefit of allowing per-DN flexibility (should it get required in future) - HDFS-7466. > Rename dfs.balancer.max.concurrent.moves to avoid confusion > --- > > Key: HDFS-9940 > URL: https://issues.apache.org/jira/browse/HDFS-9940 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.6.0 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Minor > Labels: supportability > Fix For: 2.8.0 > > > It is very confusing for both Balancer and Datanode to use the same property > {{dfs.datanode.balance.max.concurrent.moves}}. It is especially so for the > Balancer because the property has "datanode" in the name string. Many > customers forget to set the property for the Balancer. > Change the Balancer to use a new property > {{dfs.balancer.max.concurrent.moves}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9949) Testcase for catching DN UUID regeneration regression
[ https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192335#comment-15192335 ] Harsh J commented on HDFS-9949: --- bq. -1 unit * Patch only adds a test-case; the failing tests appear flaky instead of related to the added test for trunk and branch-2 here. The other patch (branch-2.7 named) is proof-of-test, but not intended for commit as HDFS-8211 is not in branch-2.7, which will cause the test to always fail. > Testcase for catching DN UUID regeneration regression > - > > Key: HDFS-9949 > URL: https://issues.apache.org/jira/browse/HDFS-9949 > Project: Hadoop HDFS > Issue Type: Test >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, > HDFS-9949.000.patch > > > In the following scenario, in releases without HDFS-8211, the DN may > regenerate its UUIDs unintentionally. > 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}} > 1. Stop DN > 2. Unmount the second disk, {{/data2/dfs/dn}} > 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root > path > 4. Start DN > 5. DN now considers /data2/dfs/dn empty so formats it, but during the format > it uses {{datanode.getDatanodeUuid()}} which is null until register() is > called. > 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} > gets called with successful condition, and it causes a new generation of UUID > which is written to all disks {{/data1/dfs/dn/current/VERSION}} and > {{/data2/dfs/dn/current/VERSION}}. > 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was > realised) > 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the > {{VERSION}} file to be the original one again on it (mounting masks the root > path that we last generated upon). > 9. DN fails to start up cause it finds mismatched UUID between the two disks > The DN should not generate a new UUID if one of the storage disks already > have the older one. > HDFS-8211 unintentionally fixes this by changing the > {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} > representation of the UUID vs. the {{DatanodeID}} object which only gets > available (non-null) _after_ the registration. > It'd still be good to add a direct test case to the above scenario that > passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can > catch a regression around this in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression
[ https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-9949: -- Target Version/s: 3.0.0, 2.8.0, 2.9.0 Status: Patch Available (was: Open) > Testcase for catching DN UUID regeneration regression > - > > Key: HDFS-9949 > URL: https://issues.apache.org/jira/browse/HDFS-9949 > Project: Hadoop HDFS > Issue Type: Test >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, > HDFS-9949.000.patch > > > In the following scenario, in releases without HDFS-8211, the DN may > regenerate its UUIDs unintentionally. > 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}} > 1. Stop DN > 2. Unmount the second disk, {{/data2/dfs/dn}} > 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root > path > 4. Start DN > 5. DN now considers /data2/dfs/dn empty so formats it, but during the format > it uses {{datanode.getDatanodeUuid()}} which is null until register() is > called. > 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} > gets called with successful condition, and it causes a new generation of UUID > which is written to all disks {{/data1/dfs/dn/current/VERSION}} and > {{/data2/dfs/dn/current/VERSION}}. > 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was > realised) > 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the > {{VERSION}} file to be the original one again on it (mounting masks the root > path that we last generated upon). > 9. DN fails to start up cause it finds mismatched UUID between the two disks > The DN should not generate a new UUID if one of the storage disks already > have the older one. > HDFS-8211 unintentionally fixes this by changing the > {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} > representation of the UUID vs. the {{DatanodeID}} object which only gets > available (non-null) _after_ the registration. > It'd still be good to add a direct test case to the above scenario that > passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can > catch a regression around this in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression
[ https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-9949: -- Attachment: (was: HDFS-9949.000.patch) > Testcase for catching DN UUID regeneration regression > - > > Key: HDFS-9949 > URL: https://issues.apache.org/jira/browse/HDFS-9949 > Project: Hadoop HDFS > Issue Type: Test >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, > HDFS-9949.000.patch > > > In the following scenario, in releases without HDFS-8211, the DN may > regenerate its UUIDs unintentionally. > 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}} > 1. Stop DN > 2. Unmount the second disk, {{/data2/dfs/dn}} > 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root > path > 4. Start DN > 5. DN now considers /data2/dfs/dn empty so formats it, but during the format > it uses {{datanode.getDatanodeUuid()}} which is null until register() is > called. > 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} > gets called with successful condition, and it causes a new generation of UUID > which is written to all disks {{/data1/dfs/dn/current/VERSION}} and > {{/data2/dfs/dn/current/VERSION}}. > 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was > realised) > 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the > {{VERSION}} file to be the original one again on it (mounting masks the root > path that we last generated upon). > 9. DN fails to start up cause it finds mismatched UUID between the two disks > The DN should not generate a new UUID if one of the storage disks already > have the older one. > HDFS-8211 unintentionally fixes this by changing the > {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} > representation of the UUID vs. the {{DatanodeID}} object which only gets > available (non-null) _after_ the registration. > It'd still be good to add a direct test case to the above scenario that > passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can > catch a regression around this in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression
[ https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-9949: -- Attachment: HDFS-9949.000.patch > Testcase for catching DN UUID regeneration regression > - > > Key: HDFS-9949 > URL: https://issues.apache.org/jira/browse/HDFS-9949 > Project: Hadoop HDFS > Issue Type: Test >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, > HDFS-9949.000.patch > > > In the following scenario, in releases without HDFS-8211, the DN may > regenerate its UUIDs unintentionally. > 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}} > 1. Stop DN > 2. Unmount the second disk, {{/data2/dfs/dn}} > 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root > path > 4. Start DN > 5. DN now considers /data2/dfs/dn empty so formats it, but during the format > it uses {{datanode.getDatanodeUuid()}} which is null until register() is > called. > 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} > gets called with successful condition, and it causes a new generation of UUID > which is written to all disks {{/data1/dfs/dn/current/VERSION}} and > {{/data2/dfs/dn/current/VERSION}}. > 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was > realised) > 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the > {{VERSION}} file to be the original one again on it (mounting masks the root > path that we last generated upon). > 9. DN fails to start up cause it finds mismatched UUID between the two disks > The DN should not generate a new UUID if one of the storage disks already > have the older one. > HDFS-8211 unintentionally fixes this by changing the > {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} > representation of the UUID vs. the {{DatanodeID}} object which only gets > available (non-null) _after_ the registration. > It'd still be good to add a direct test case to the above scenario that > passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can > catch a regression around this in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9949) Testcase for catching DN UUID regeneration regression
[ https://issues.apache.org/jira/browse/HDFS-9949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-9949: -- Attachment: HDFS-9949.000.branch-2.7.not-for-commit.patch HDFS-9949.000.patch > Testcase for catching DN UUID regeneration regression > - > > Key: HDFS-9949 > URL: https://issues.apache.org/jira/browse/HDFS-9949 > Project: Hadoop HDFS > Issue Type: Test >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Harsh J >Priority: Minor > Attachments: HDFS-9949.000.branch-2.7.not-for-commit.patch, > HDFS-9949.000.patch > > > In the following scenario, in releases without HDFS-8211, the DN may > regenerate its UUIDs unintentionally. > 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}} > 1. Stop DN > 2. Unmount the second disk, {{/data2/dfs/dn}} > 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root > path > 4. Start DN > 5. DN now considers /data2/dfs/dn empty so formats it, but during the format > it uses {{datanode.getDatanodeUuid()}} which is null until register() is > called. > 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} > gets called with successful condition, and it causes a new generation of UUID > which is written to all disks {{/data1/dfs/dn/current/VERSION}} and > {{/data2/dfs/dn/current/VERSION}}. > 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was > realised) > 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the > {{VERSION}} file to be the original one again on it (mounting masks the root > path that we last generated upon). > 9. DN fails to start up cause it finds mismatched UUID between the two disks > The DN should not generate a new UUID if one of the storage disks already > have the older one. > HDFS-8211 unintentionally fixes this by changing the > {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} > representation of the UUID vs. the {{DatanodeID}} object which only gets > available (non-null) _after_ the registration. > It'd still be good to add a direct test case to the above scenario that > passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can > catch a regression around this in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9949) Testcase for catching DN UUID regeneration regression
Harsh J created HDFS-9949: - Summary: Testcase for catching DN UUID regeneration regression Key: HDFS-9949 URL: https://issues.apache.org/jira/browse/HDFS-9949 Project: Hadoop HDFS Issue Type: Test Affects Versions: 2.6.0 Reporter: Harsh J Assignee: Harsh J Priority: Minor In the following scenario, in releases without HDFS-8211, the DN may regenerate its UUIDs unintentionally. 0. Consider a DN with two disks {{/data1/dfs/dn,/data2/dfs/dn}} 1. Stop DN 2. Unmount the second disk, {{/data2/dfs/dn}} 3. Create (in the scenario, this was an accident) /data2/dfs/dn on the root path 4. Start DN 5. DN now considers /data2/dfs/dn empty so formats it, but during the format it uses {{datanode.getDatanodeUuid()}} which is null until register() is called. 6. As a result, after the directory loading, {{datanode.checkDatanodUuid()}} gets called with successful condition, and it causes a new generation of UUID which is written to all disks {{/data1/dfs/dn/current/VERSION}} and {{/data2/dfs/dn/current/VERSION}}. 7. Stop DN (in the scenario, this was when the mistake of unmounted disk was realised) 8. Mount the second disk back again {{/data2/dfs/dn}}, causing the {{VERSION}} file to be the original one again on it (mounting masks the root path that we last generated upon). 9. DN fails to start up cause it finds mismatched UUID between the two disks The DN should not generate a new UUID if one of the storage disks already have the older one. HDFS-8211 unintentionally fixes this by changing the {{datanode.getDatanodeUuid()}} function to rely on the {{DataStorage}} representation of the UUID vs. the {{DatanodeID}} object which only gets available (non-null) _after_ the registration. It'd still be good to add a direct test case to the above scenario that passes on trunk and branch-2, but fails on branch-2.7 and lower, so we can catch a regression around this in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-8475) Exception in createBlockOutputStream java.io.EOFException: Premature EOF: no length prefix available
[ https://issues.apache.org/jira/browse/HDFS-8475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J resolved HDFS-8475. --- Resolution: Not A Bug I don't see a bug reported here - the report says the write was done with a single replica and that the single replica was manually corrupted. Please post to u...@hadoop.apache.org for problems observed in usage. If you plan to reopen this, please post precise steps of how the bug may be reproduced. I'd recommend looking at your NN and DN logs to trace further on what's happening. > Exception in createBlockOutputStream java.io.EOFException: Premature EOF: no > length prefix available > > > Key: HDFS-8475 > URL: https://issues.apache.org/jira/browse/HDFS-8475 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.2.0 >Reporter: Vinod Valecha >Priority: Blocker > > Scenraio: > = > write a file > corrupt block manually > Exception stack trace- > 2015-05-24 02:31:55.291 INFO [T-33716795] > [org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer] Exception in > createBlockOutputStream > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1155) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514) > [5/24/15 2:31:55:291 UTC] 02027a3b DFSClient I > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer createBlockOutputStream > Exception in createBlockOutputStream > java.io.EOFException: Premature EOF: no > length prefix available > at > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1155) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1088) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514) > 2015-05-24 02:31:55.291 INFO [T-33716795] > [org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer] Abandoning > BP-176676314-10.108.106.59-1402620296713:blk_1404621403_330880579 > [5/24/15 2:31:55:291 UTC] 02027a3b DFSClient I > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer nextBlockOutputStream > Abandoning BP-176676314-10.108.106.59-1402620296713:blk_1404621403_330880579 > 2015-05-24 02:31:55.299 INFO [T-33716795] > [org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer] Excluding datanode > 10.108.106.59:50010 > [5/24/15 2:31:55:299 UTC] 02027a3b DFSClient I > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer nextBlockOutputStream > Excluding datanode 10.108.106.59:50010 > 2015-05-24 02:31:55.300 WARNING [T-33716795] > [org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer] DataStreamer Exception > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > /var/db/opera/files/B4889CCDA75F9751DDBB488E5AAB433E/BE4DAEF290B7136ED6EF3D4B157441A2/BE4DAEF290B7136ED6EF3D4B157441A2-4.pag > could only be replicated to 0 nodes instead of minReplication (=1). There > are 1 datanode(s) running and 1 node(s) are excluded in this operation. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1384) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2477) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555) > [5/24/15 2:31:55:300 UTC] 02027a3b DFSClient W > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer run DataStreamer Exception > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > /var/db/opera/files/B4889CCDA75F9751DDBB488E5AAB433E/BE4DAEF290B7136ED6EF3D4B157441A2/BE4DAEF290B7136ED6EF3D4B157441A2-4.pag > could only be replicated to 0 nodes instead of minReplication (=1). There > are 1 datanode(s) running and 1 node(s) are excluded in this operation. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1384) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2477) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555) > at >
[jira] [Updated] (HDFS-9521) TransferFsImage.receiveFile should account and log separate times for image download and fsync to disk
[ https://issues.apache.org/jira/browse/HDFS-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-9521: -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.9.0 Status: Resolved (was: Patch Available) > TransferFsImage.receiveFile should account and log separate times for image > download and fsync to disk > --- > > Key: HDFS-9521 > URL: https://issues.apache.org/jira/browse/HDFS-9521 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Fix For: 2.9.0 > > Attachments: HDFS-9521-2.patch, HDFS-9521-3.patch, > HDFS-9521.004.patch, HDFS-9521.patch, HDFS-9521.patch.1 > > > Currently, TransferFsImage.receiveFile is logging total transfer time as > below: > {noformat} > double xferSec = Math.max( >((float)(Time.monotonicNow() - startTime)) / 1000.0, 0.001); > long xferKb = received / 1024; > LOG.info(String.format("Transfer took %.2fs at %.2f KB/s",xferSec, xferKb / > xferSec)) > {noformat} > This is really useful, but it just measures the total method execution time, > which includes time taken to download the image and do an fsync to all the > namenode metadata directories. > Sometime when troubleshooting these imager transfer problems, it's > interesting to know which part of the process is being the bottleneck > (whether network or disk write). > This patch accounts time for image download and fsync to each disk > separately, logging how much time did it take on each operation. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9521) TransferFsImage.receiveFile should account and log separate times for image download and fsync to disk
[ https://issues.apache.org/jira/browse/HDFS-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182915#comment-15182915 ] Harsh J commented on HDFS-9521: --- +1. The check-point related tests in one of the tests seemed relevant but they pass locally on both JDK7 and JDK8. {code} Running org.apache.hadoop.hdfs.TestRollingUpgrade Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 99.737 sec - in org.apache.hadoop.hdfs.TestRollingUpgrade {code} Therefore they appear to be flaky than at fault here. Other tests appear similarly unrelated to the log change here (no tests appear to rely on the original message either). Committing to branch-2 and trunk shortly. > TransferFsImage.receiveFile should account and log separate times for image > download and fsync to disk > --- > > Key: HDFS-9521 > URL: https://issues.apache.org/jira/browse/HDFS-9521 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: HDFS-9521-2.patch, HDFS-9521-3.patch, > HDFS-9521.004.patch, HDFS-9521.patch, HDFS-9521.patch.1 > > > Currently, TransferFsImage.receiveFile is logging total transfer time as > below: > {noformat} > double xferSec = Math.max( >((float)(Time.monotonicNow() - startTime)) / 1000.0, 0.001); > long xferKb = received / 1024; > LOG.info(String.format("Transfer took %.2fs at %.2f KB/s",xferSec, xferKb / > xferSec)) > {noformat} > This is really useful, but it just measures the total method execution time, > which includes time taken to download the image and do an fsync to all the > namenode metadata directories. > Sometime when troubleshooting these imager transfer problems, it's > interesting to know which part of the process is being the bottleneck > (whether network or disk write). > This patch accounts time for image download and fsync to each disk > separately, logging how much time did it take on each operation. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9521) TransferFsImage.receiveFile should account and log separate times for image download and fsync to disk
[ https://issues.apache.org/jira/browse/HDFS-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-9521: -- Attachment: HDFS-9521.004.patch LGTM. Just had two checkstyle nits I've corrected in this variant, aside of some spacing logic. Will commit once jenkins returns +1. The previously failed tests don't appear related. > TransferFsImage.receiveFile should account and log separate times for image > download and fsync to disk > --- > > Key: HDFS-9521 > URL: https://issues.apache.org/jira/browse/HDFS-9521 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: HDFS-9521-2.patch, HDFS-9521-3.patch, > HDFS-9521.004.patch, HDFS-9521.patch, HDFS-9521.patch.1 > > > Currently, TransferFsImage.receiveFile is logging total transfer time as > below: > {noformat} > double xferSec = Math.max( >((float)(Time.monotonicNow() - startTime)) / 1000.0, 0.001); > long xferKb = received / 1024; > LOG.info(String.format("Transfer took %.2fs at %.2f KB/s",xferSec, xferKb / > xferSec)) > {noformat} > This is really useful, but it just measures the total method execution time, > which includes time taken to download the image and do an fsync to all the > namenode metadata directories. > Sometime when troubleshooting these imager transfer problems, it's > interesting to know which part of the process is being the bottleneck > (whether network or disk write). > This patch accounts time for image download and fsync to each disk > separately, logging how much time did it take on each operation. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4936) Handle overflow condition for txid going over Long.MAX_VALUE
[ https://issues.apache.org/jira/browse/HDFS-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171424#comment-15171424 ] Harsh J commented on HDFS-4936: --- Forgot to add the response of the asker: {quote} After 200 million years, spacemen manage the earth, they also know Hadoop, but they cannot restart it, after a hard debug they find the txid has been overflowed for many years. {quote} > Handle overflow condition for txid going over Long.MAX_VALUE > > > Key: HDFS-4936 > URL: https://issues.apache.org/jira/browse/HDFS-4936 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: Harsh J >Priority: Minor > > Hat tip to [~azuryy] for the question that lead to this (on mailing lists). > I hacked up my local NN's txids manually to go very large (close to max) and > decided to try out if this causes any harm. I basically bumped up the freshly > formatted files' starting txid to 9223372036854775805 (and ensured image > references the same by hex-editing it): > {code} > ➜ current ls > VERSION > fsimage_9223372036854775805.md5 > fsimage_9223372036854775805 > seen_txid > ➜ current cat seen_txid > 9223372036854775805 > {code} > NameNode started up as expected. > {code} > 13/06/25 18:30:08 INFO namenode.FSImage: Image file of size 129 loaded in 0 > seconds. > 13/06/25 18:30:08 INFO namenode.FSImage: Loaded image for txid > 9223372036854775805 from > /temp-space/tmp-default/dfs-cdh4/name/current/fsimage_9223372036854775805 > 13/06/25 18:30:08 INFO namenode.FSEditLog: Starting log segment at > 9223372036854775806 > {code} > I could create a bunch of files and do regular ops (counting to much after > the long max increments). I created over 10 files, just to make it go well > over the Long.MAX_VALUE. > Quitting NameNode and restarting fails though, with the following error: > {code} > 13/06/25 18:31:08 INFO namenode.FileJournalManager: Recovering unfinalized > segments in > /Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current > 13/06/25 18:31:08 INFO namenode.FileJournalManager: Finalizing edits file > /Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_inprogress_9223372036854775806 > -> > /Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_9223372036854775806-9223372036854775807 > 13/06/25 18:31:08 FATAL namenode.NameNode: Exception in namenode join > java.io.IOException: Gap in transactions. Expected to be able to read up > until at least txid 9223372036854775806 but unable to find any edit logs > containing txid -9223372036854775808 > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.checkForGaps(FSEditLog.java:1194) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1152) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:616) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:267) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:592) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:435) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:397) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:399) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:433) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:609) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:590) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1141) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1205) > {code} > Looks like we also lose some edits when we restart, as noted by the finalized > edits filename: > {code} > VERSION > edits_9223372036854775806-9223372036854775807 > fsimage_9223372036854775805 > fsimage_9223372036854775805.md5 > seen_txid > {code} > It seems like we won't be able to handle the case where txid overflows. Its a > very very large number so that's not an immediate concern but seemed worthy > of a report. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9868) add reading source cluster with HA access mode feature for DistCp
[ https://issues.apache.org/jira/browse/HDFS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171119#comment-15171119 ] Harsh J commented on HDFS-9868: --- Does the feature added with HDFS-6376 not suffice? > add reading source cluster with HA access mode feature for DistCp > - > > Key: HDFS-9868 > URL: https://issues.apache.org/jira/browse/HDFS-9868 > Project: Hadoop HDFS > Issue Type: New Feature > Components: distcp >Affects Versions: 2.7.1 >Reporter: NING DING >Assignee: NING DING > Attachments: HDFS-9868.1.patch > > > Normally the HDFS cluster is HA enabled. It could take a long time when > coping huge data by distp. If the source cluster changes active namenode, the > distp will run failed. This patch supports the DistCp can read source cluster > files in HA access mode. A source cluster configuration file needs to be > specified (via the -sourceClusterConf option). > The following is an example of the contents of a source cluster > configuration > file: > {code:xml} > > > fs.defaultFS > hdfs://mycluster > > > dfs.nameservices > mycluster > > > dfs.ha.namenodes.mycluster > nn1,nn2 > > > dfs.namenode.rpc-address.mycluster.nn1 > host1:9000 > > > dfs.namenode.rpc-address.mycluster.nn2 > host2:9000 > > > dfs.namenode.http-address.mycluster.nn1 > host1:50070 > > > dfs.namenode.http-address.mycluster.nn2 > host2:50070 > > > dfs.client.failover.proxy.provider.mycluster > > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > > > {code} > The invocation of DistCp is as below: > {code} > bash$ hadoop distcp -sourceClusterConf sourceCluster.xml /foo/bar > hdfs://nn2:8020/bar/foo > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8509) Support different passwords for key and keystore on HTTPFS using SSL. This requires for a Tomcat version update.
[ https://issues.apache.org/jira/browse/HDFS-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-8509: -- Issue Type: Improvement (was: Task) > Support different passwords for key and keystore on HTTPFS using SSL. This > requires for a Tomcat version update. > > > Key: HDFS-8509 > URL: https://issues.apache.org/jira/browse/HDFS-8509 > Project: Hadoop HDFS > Issue Type: Improvement > Components: webhdfs >Affects Versions: 2.7.0 >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: HDFS-8509.patch > > > Currently, SSL for HTTPFS requires that keystore/truststore and key passwords > be the same. This is a limitation from Tomcat version 6, which didn't have > support for different passwords. From Tomcat 7, this is now possible by > defining "keyPass" property for "Connector" configuration on Tomcat's > server.xml file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9521) TransferFsImage.receiveFile should account and log separate times for image download and fsync to disk
[ https://issues.apache.org/jira/browse/HDFS-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082446#comment-15082446 ] Harsh J commented on HDFS-9521: --- Patch's approach looks good to me. Agreed with [~liuml07], that we can keep the total time also (but indicate in the message that it includes both times). Alternatively, a single combined log at the end that prints the total and divided times (along with path info as we have it in the current patch) would be better too. I do not agree on DEBUG level though. The change is a refinement of an existing, vital INFO message. Please also address the checkstyle issues, if they are relevant (sorry, am too late here and the build data's been wiped already). You can run checkstyle goal with maven to get the same results locally. The failing tests don't appear related. > TransferFsImage.receiveFile should account and log separate times for image > download and fsync to disk > --- > > Key: HDFS-9521 > URL: https://issues.apache.org/jira/browse/HDFS-9521 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: HDFS-9521.patch > > > Currently, TransferFsImage.receiveFile is logging total transfer time as > below: > {noformat} > double xferSec = Math.max( >((float)(Time.monotonicNow() - startTime)) / 1000.0, 0.001); > long xferKb = received / 1024; > LOG.info(String.format("Transfer took %.2fs at %.2f KB/s",xferSec, xferKb / > xferSec)) > {noformat} > This is really useful, but it just measures the total method execution time, > which includes time taken to download the image and do an fsync to all the > namenode metadata directories. > Sometime when troubleshooting these imager transfer problems, it's > interesting to know which part of the process is being the bottleneck > (whether network or disk write). > This patch accounts time for image download and fsync to each disk > separately, logging how much time did it take on each operation. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7447) Number of maximum Acl entries on a File/Folder should be made user configurable than hardcoding .
[ https://issues.apache.org/jira/browse/HDFS-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15054098#comment-15054098 ] Harsh J commented on HDFS-7447: --- bq. The number of entries in a single ACL is capped at a maximum of 32. Attempts to add ACL entries exceeding the maximum will fail with a userfacing error. This is done for 2 reasons: to simplify management, and to limit resource consumption. ACLs with a very high number of entries tend to become difficult to understand and may indicate that the requirements are better implemented by defining additional groups or users. ACLs with a very high number of entries also would require more memory and storage and take longer to evaluate on each permission check. The number 32 is chosen for consistency with the maximum number of ACL entries enforced by the ext family of file systems. - https://issues.apache.org/jira/secure/attachment/12627729/HDFS-ACLs-Design-3.pdf > Number of maximum Acl entries on a File/Folder should be made user > configurable than hardcoding . > - > > Key: HDFS-7447 > URL: https://issues.apache.org/jira/browse/HDFS-7447 > Project: Hadoop HDFS > Issue Type: Improvement > Components: security >Reporter: J.Andreina > > By default on creating a folder1 will have 6 acl entries . On top of that > assigning acl on a folder1 exceeds 32 , then unable to assign acls for a > group/user to folder1. > {noformat} > 2014-11-20 18:55:06,553 ERROR [qtp1279235236-17 - /rolexml/role/modrole] > Error occured while setting permissions for Resource:[ > hdfs://hacluster/folder1 ] and Error message is : Invalid ACL: ACL has 33 > entries, which exceeds maximum of 32. > at > org.apache.hadoop.hdfs.server.namenode.AclTransformation.buildAndValidateAcl(AclTransformation.java:274) > at > org.apache.hadoop.hdfs.server.namenode.AclTransformation.mergeAclEntries(AclTransformation.java:181) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedModifyAclEntries(FSDirectory.java:2771) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.modifyAclEntries(FSDirectory.java:2757) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.modifyAclEntries(FSNamesystem.java:7734) > {noformat} > Here value 32 is hardcoded , which can be made user configurable. > {noformat} > private static List buildAndValidateAcl(ArrayList aclBuilder) > throws AclException > { > if(aclBuilder.size() > 32) > throw new AclException((new StringBuilder()).append("Invalid ACL: > ACL has ").append(aclBuilder.size()).append(" entries, which exceeds maximum > of ").append(32).append(".").toString()); > : > : > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages
[ https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006787#comment-15006787 ] Harsh J commented on HDFS-8298: --- I'm closing this JIRA as Invalid. This is not a bug, but a misunderstanding of what the error means. Its worth filing these discussions in the user list prior to reporting to JIRA. The chief QJM guarantee is that a failure, whatever kind, is intolerable. That the NN kills itself in such a situation is the current expected outcome. bq. In an HDFS HA setup if there is a temporary problem with contacting journal nodes (eg. network interruption), the NameNode shuts down entirely, This is as per the current design, and the NN is operating in the right. bq. When it should instead go in to a standby mode so that it can stay online and retry to achieve quorum later. This is not possible today, cause at the point of failure the transaction is incomplete. The mutation writes are not write-ahead, but write-behind (although part of the same transaction), so we don't want to risk an inconsistent state, especially so if reads over standby is allowed, and that if we let it be alive then if it ever became active again it'd have an edit conflict. bq. As usual, don't blindly change without understanding the impacts of your change. Indeed, modification of these values must be done with care, especially that of client retries + timeout values. If the whole timeouts add up to more than the period a client may wait (and retry over), then you'd have actual client failures in a HA situation. bq. We are facing this issue every week at the same time might be due to a network glitch but is there a workaround that can be put in place? It is important to understand why the writes timeout. Note that each write is hardly more than a few thousand bytes, usually. Writing such a small amount, in parallel and over the network, should not take > 2-3s. There's a LOT of factors that can cause the timeouts besides the network: GC or other form of process-level pauses (NN pauses mid-write over 20s, recovers, but write finisher now thinks > 20s has passed to write to JNs, so it marks the write failed), JN fsync delays due to not providing it dedicated disks (if you're unaware, JN writes are locally synchronous, for durability), and Kerberos KDC timeouts (JVM default of 30s before a failed KDC connection retry is higher than 20s default of a transaction write timeout to JN, when auth gets required between NN -> JNs periodically). > HA: NameNode should not shut down completely without quorum, doesn't recover > from temporary network outages > --- > > Key: HDFS-8298 > URL: https://issues.apache.org/jira/browse/HDFS-8298 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, HDFS, namenode, qjm >Affects Versions: 2.6.0 >Reporter: Hari Sekhon > > In an HDFS HA setup if there is a temporary problem with contacting journal > nodes (eg. network interruption), the NameNode shuts down entirely, when it > should instead go in to a standby mode so that it can stay online and retry > to achieve quorum later. > If both NameNodes shut themselves off like this then even after the temporary > network outage is resolved, the entire cluster remains offline indefinitely > until operator intervention, whereas it could have self-repaired after > re-contacting the journalnodes and re-achieving quorum. > {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog > (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for > required journal (JournalAndStre > am(mgr=QJM to [:8485, :8485, :8485], stream=QuorumOutputStream > starting at txid 54270281)) > java.io.IOException: Interrupted waiting 2ms for a quorum of nodes to > respond. > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) > at > org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) > at > org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) > at >
[jira] [Resolved] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages
[ https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J resolved HDFS-8298. --- Resolution: Invalid Closing out - for specific identified improvements (such as log improvements, or ideas about improving unclear root-causing), please log a more direct JIRA. > HA: NameNode should not shut down completely without quorum, doesn't recover > from temporary network outages > --- > > Key: HDFS-8298 > URL: https://issues.apache.org/jira/browse/HDFS-8298 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, HDFS, namenode, qjm >Affects Versions: 2.6.0 >Reporter: Hari Sekhon > > In an HDFS HA setup if there is a temporary problem with contacting journal > nodes (eg. network interruption), the NameNode shuts down entirely, when it > should instead go in to a standby mode so that it can stay online and retry > to achieve quorum later. > If both NameNodes shut themselves off like this then even after the temporary > network outage is resolved, the entire cluster remains offline indefinitely > until operator intervention, whereas it could have self-repaired after > re-contacting the journalnodes and re-achieving quorum. > {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog > (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for > required journal (JournalAndStre > am(mgr=QJM to [:8485, :8485, :8485], stream=QuorumOutputStream > starting at txid 54270281)) > java.io.IOException: Interrupted waiting 2ms for a quorum of nodes to > respond. > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) > at > org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) > at > org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639) > at > org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388) > at java.lang.Thread.run(Thread.java:745) > 2015-04-15 15:59:26,901 WARN client.QuorumJournalManager > (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at > txid 54270281 > 2015-04-15 15:59:26,904 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - > Exiting with status 1 > 2015-04-15 15:59:27,001 INFO namenode.NameNode (StringUtils.java:run(659)) - > SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down NameNode at / > /{code} > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages
[ https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-8298: -- Environment: (was: HDP 2.2) > HA: NameNode should not shut down completely without quorum, doesn't recover > from temporary network outages > --- > > Key: HDFS-8298 > URL: https://issues.apache.org/jira/browse/HDFS-8298 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, HDFS, namenode, qjm >Affects Versions: 2.6.0 >Reporter: Hari Sekhon > > In an HDFS HA setup if there is a temporary problem with contacting journal > nodes (eg. network interruption), the NameNode shuts down entirely, when it > should instead go in to a standby mode so that it can stay online and retry > to achieve quorum later. > If both NameNodes shut themselves off like this then even after the temporary > network outage is resolved, the entire cluster remains offline indefinitely > until operator intervention, whereas it could have self-repaired after > re-contacting the journalnodes and re-achieving quorum. > {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog > (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for > required journal (JournalAndStre > am(mgr=QJM to [:8485, :8485, :8485], stream=QuorumOutputStream > starting at txid 54270281)) > java.io.IOException: Interrupted waiting 2ms for a quorum of nodes to > respond. > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) > at > org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) > at > org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639) > at > org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388) > at java.lang.Thread.run(Thread.java:745) > 2015-04-15 15:59:26,901 WARN client.QuorumJournalManager > (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at > txid 54270281 > 2015-04-15 15:59:26,904 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - > Exiting with status 1 > 2015-04-15 15:59:27,001 INFO namenode.NameNode (StringUtils.java:run(659)) - > SHUTDOWN_MSG: > / > SHUTDOWN_MSG: Shutting down NameNode at / > /{code} > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8986) Add option to -du to calculate directory space usage excluding snapshots
[ https://issues.apache.org/jira/browse/HDFS-8986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994469#comment-14994469 ] Harsh J commented on HDFS-8986: --- This continues to cause a bunch of confusion among our user-base who are still reliant on the pre-snapshot feature behaviour, so it would be nice to see it implemented. > Add option to -du to calculate directory space usage excluding snapshots > > > Key: HDFS-8986 > URL: https://issues.apache.org/jira/browse/HDFS-8986 > Project: Hadoop HDFS > Issue Type: Improvement > Components: snapshots >Reporter: Gautam Gopalakrishnan >Assignee: Jagadesh Kiran N > > When running {{hadoop fs -du}} on a snapshotted directory (or one of its > children), the report includes space consumed by blocks that are only present > in the snapshots. This is confusing for end users. > {noformat} > $ hadoop fs -du -h -s /tmp/parent /tmp/parent/* > 799.7 M 2.3 G /tmp/parent > 799.7 M 2.3 G /tmp/parent/sub1 > $ hdfs dfs -createSnapshot /tmp/parent snap1 > Created snapshot /tmp/parent/.snapshot/snap1 > $ hadoop fs -rm -skipTrash /tmp/parent/sub1/* > ... > $ hadoop fs -du -h -s /tmp/parent /tmp/parent/* > 799.7 M 2.3 G /tmp/parent > 799.7 M 2.3 G /tmp/parent/sub1 > $ hdfs dfs -deleteSnapshot /tmp/parent snap1 > $ hadoop fs -du -h -s /tmp/parent /tmp/parent/* > 0 0 /tmp/parent > 0 0 /tmp/parent/sub1 > {noformat} > It would be helpful if we had a flag, say -X, to exclude any snapshot related > disk usage in the output -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8986) Add option to -du to calculate directory space usage excluding snapshots
[ https://issues.apache.org/jira/browse/HDFS-8986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14984274#comment-14984274 ] Harsh J commented on HDFS-8986: --- [~jagadesh.kiran] - What's the precise opposition w.r.t. adding a new flag to explicitly exclude snapshot counts? It does not break compatibility, and is an added feature/improvement. > Add option to -du to calculate directory space usage excluding snapshots > > > Key: HDFS-8986 > URL: https://issues.apache.org/jira/browse/HDFS-8986 > Project: Hadoop HDFS > Issue Type: Improvement > Components: snapshots >Reporter: Gautam Gopalakrishnan >Assignee: Jagadesh Kiran N > > When running {{hadoop fs -du}} on a snapshotted directory (or one of its > children), the report includes space consumed by blocks that are only present > in the snapshots. This is confusing for end users. > {noformat} > $ hadoop fs -du -h -s /tmp/parent /tmp/parent/* > 799.7 M 2.3 G /tmp/parent > 799.7 M 2.3 G /tmp/parent/sub1 > $ hdfs dfs -createSnapshot /tmp/parent snap1 > Created snapshot /tmp/parent/.snapshot/snap1 > $ hadoop fs -rm -skipTrash /tmp/parent/sub1/* > ... > $ hadoop fs -du -h -s /tmp/parent /tmp/parent/* > 799.7 M 2.3 G /tmp/parent > 799.7 M 2.3 G /tmp/parent/sub1 > $ hdfs dfs -deleteSnapshot /tmp/parent snap1 > $ hadoop fs -du -h -s /tmp/parent /tmp/parent/* > 0 0 /tmp/parent > 0 0 /tmp/parent/sub1 > {noformat} > It would be helpful if we had a flag, say -X, to exclude any snapshot related > disk usage in the output -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9257) improve error message for "Absolute path required" in INode.java to contain the rejected path
[ https://issues.apache.org/jira/browse/HDFS-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960854#comment-14960854 ] Harsh J commented on HDFS-9257: --- +1, failed tests are unrelated. Tests shouldn't be necessary for the trivial message improvement. Committing shortly. > improve error message for "Absolute path required" in INode.java to contain > the rejected path > - > > Key: HDFS-9257 > URL: https://issues.apache.org/jira/browse/HDFS-9257 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.7.1 >Reporter: Marcell Szabo >Assignee: Marcell Szabo >Priority: Trivial > Attachments: HDFS-9257.000.patch > > > throw new AssertionError("Absolute path required"); > message should also show the path to help debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9257) improve error message for "Absolute path required" in INode.java to contain the rejected path
[ https://issues.apache.org/jira/browse/HDFS-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-9257: -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.8.0 Status: Resolved (was: Patch Available) Committed to trunk and branch-2. Thank you for the improvement contribution Marcell, hope to see more! > improve error message for "Absolute path required" in INode.java to contain > the rejected path > - > > Key: HDFS-9257 > URL: https://issues.apache.org/jira/browse/HDFS-9257 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.7.1 >Reporter: Marcell Szabo >Assignee: Marcell Szabo >Priority: Trivial > Fix For: 2.8.0 > > Attachments: HDFS-9257.000.patch > > > throw new AssertionError("Absolute path required"); > message should also show the path to help debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7899) Improve EOF error message
[ https://issues.apache.org/jira/browse/HDFS-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945253#comment-14945253 ] Harsh J commented on HDFS-7899: --- Thank you [~jagadesh.kiran] and [~vinayrpet] > Improve EOF error message > - > > Key: HDFS-7899 > URL: https://issues.apache.org/jira/browse/HDFS-7899 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Jagadesh Kiran N >Priority: Minor > Fix For: 2.8.0 > > Attachments: HDFS-7899-00.patch, HDFS-7899-01.patch, > HDFS-7899-02.patch > > > Currently, a DN disconnection for reasons other than connection timeout or > refused messages, such as an EOF message as a result of rejection or other > network fault, reports in this manner: > {code} > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /x.x.x.x: for > block, add to deadNodes and continue. java.io.EOFException: Premature EOF: no > length prefix available > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:171) > > at > org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:392) > > at > org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:137) > > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:1103) > > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:538) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:750) > > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:794) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:602) > {code} > This is not very clear to a user (warn's at the hdfs-client). It could likely > be improved with a more diagnosable message, or at least the direct reason > than an EOF. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7899) Improve EOF error message
[ https://issues.apache.org/jira/browse/HDFS-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942281#comment-14942281 ] Harsh J commented on HDFS-7899: --- Sorry on delay here! Perhaps it could be fixed to: s/Unexpected EOF while trying read response/Unexpected EOF while trying to read response from server (With hope that it would get some users to look above it and spot the server/block identifiers to investigate forward.) > Improve EOF error message > - > > Key: HDFS-7899 > URL: https://issues.apache.org/jira/browse/HDFS-7899 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Jagadesh Kiran N >Priority: Minor > Attachments: HDFS-7899-00.patch, HDFS-7899-01.patch > > > Currently, a DN disconnection for reasons other than connection timeout or > refused messages, such as an EOF message as a result of rejection or other > network fault, reports in this manner: > {code} > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /x.x.x.x: for > block, add to deadNodes and continue. java.io.EOFException: Premature EOF: no > length prefix available > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:171) > > at > org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:392) > > at > org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:137) > > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:1103) > > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:538) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:750) > > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:794) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:602) > {code} > This is not very clear to a user (warn's at the hdfs-client). It could likely > be improved with a more diagnosable message, or at least the direct reason > than an EOF. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-6674) UserGroupInformation.loginUserFromKeytab will hang forever if keytab file length is less than 6 byte.
[ https://issues.apache.org/jira/browse/HDFS-6674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J resolved HDFS-6674. --- Resolution: Invalid The hang, if still valid, seems to result as an outcome of the underlying Java libraries being at fault. There's not anything HDFS can control about this, and this bug instead needs to be reported to the Oracle/OpenJDK communities with a test case. > UserGroupInformation.loginUserFromKeytab will hang forever if keytab file > length is less than 6 byte. > -- > > Key: HDFS-6674 > URL: https://issues.apache.org/jira/browse/HDFS-6674 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Affects Versions: 2.0.1-alpha >Reporter: liuyang >Priority: Minor > > The jstack is as follows: >java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.available(Native Method) > at java.io.BufferedInputStream.available(BufferedInputStream.java:399) > - locked <0x000745585330> (a > sun.security.krb5.internal.ktab.KeyTabInputStream) > at sun.security.krb5.internal.ktab.KeyTab.load(KeyTab.java:257) > at sun.security.krb5.internal.ktab.KeyTab.(KeyTab.java:97) > at sun.security.krb5.internal.ktab.KeyTab.getInstance0(KeyTab.java:124) > - locked <0x000745586560> (a java.lang.Class for > sun.security.krb5.internal.ktab.KeyTab) > at sun.security.krb5.internal.ktab.KeyTab.getInstance(KeyTab.java:157) > at javax.security.auth.kerberos.KeyTab.takeSnapshot(KeyTab.java:119) > at > javax.security.auth.kerberos.KeyTab.getEncryptionKeys(KeyTab.java:192) > at > javax.security.auth.kerberos.JavaxSecurityAuthKerberosAccessImpl.keyTabGetEncryptionKeys(JavaxSecurityAuthKerberosAccessImpl.java:36) > at > sun.security.jgss.krb5.Krb5Util.keysFromJavaxKeyTab(Krb5Util.java:381) > at > com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:701) > at > com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:584) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at javax.security.auth.login.LoginContext.invoke(LoginContext.java:784) > at > javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) > at javax.security.auth.login.LoginContext$5.run(LoginContext.java:721) > at javax.security.auth.login.LoginContext$5.run(LoginContext.java:719) > at java.security.AccessController.doPrivileged(Native Method) > at > javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:718) > at javax.security.auth.login.LoginContext.login(LoginContext.java:590) > at > org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:679) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-4224) The dncp_block_verification log can be compressed
[ https://issues.apache.org/jira/browse/HDFS-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J resolved HDFS-4224. --- Resolution: Invalid Invalid after HDFS-7430 > The dncp_block_verification log can be compressed > - > > Key: HDFS-4224 > URL: https://issues.apache.org/jira/browse/HDFS-4224 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.0.0-alpha >Reporter: Harsh J >Priority: Minor > > On some systems, I noticed that when the scanner runs, the > dncp_block_verification.log.curr file under the block pool gets quite large > (several GBs). Although this is rolled away, we could also configure > compression upon it (a codec that may work without natives, would be a good > default) and save on I/O and space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7899) Improve EOF error message
[ https://issues.apache.org/jira/browse/HDFS-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14733250#comment-14733250 ] Harsh J commented on HDFS-7899: --- Thanks Jagadesh, that message change was just a small idea to make it carry slightly more sense. Do you have any ideas also to improve the situation such that users may be able to self-figure out whats going on? I've seen this appear during socket disconnects/timeouts/etc. - but the message it prints is from the software layer instead, which causes confusion. > Improve EOF error message > - > > Key: HDFS-7899 > URL: https://issues.apache.org/jira/browse/HDFS-7899 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.6.0 >Reporter: Harsh J >Assignee: Jagadesh Kiran N >Priority: Minor > Attachments: HDFS-7899-00.patch > > > Currently, a DN disconnection for reasons other than connection timeout or > refused messages, such as an EOF message as a result of rejection or other > network fault, reports in this manner: > {code} > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /x.x.x.x: for > block, add to deadNodes and continue. java.io.EOFException: Premature EOF: no > length prefix available > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:171) > > at > org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:392) > > at > org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:137) > > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:1103) > > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:538) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:750) > > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:794) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:602) > {code} > This is not very clear to a user (warn's at the hdfs-client). It could likely > be improved with a more diagnosable message, or at least the direct reason > than an EOF. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-237) Better handling of dfsadmin command when namenode is slow
[ https://issues.apache.org/jira/browse/HDFS-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J resolved HDFS-237. -- Resolution: Later This older JIRA is a bit stale given the multiple changes that went into the RPC side. Follow HADOOP-9640 and related JIRAs instead for more recent work. bq. a separate rpc queue This is supported today via the servicerpc-address configs (typically set to 8022, and strongly recommended for HA modes). > Better handling of dfsadmin command when namenode is slow > - > > Key: HDFS-237 > URL: https://issues.apache.org/jira/browse/HDFS-237 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Koji Noguchi > > Probably when hitting HADOOP-3810, Namenode became unresponsive. Large time > spent in GC. > All dfs/dfsadmin command were timing out. > WebUI was coming up after waiting for a long time. > Maybe setting a long timeout would have made the dfsadmin command go through. > But it would be nice to have a separate queue/handler which doesn't compete > with regular rpc calls. > All I wanted to do was dfsadmin -safemode enter, dfsadmin -finalizeUpgrade ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-2390) dfsadmin -setBalancerBandwidth doesnot validate -ve value
[ https://issues.apache.org/jira/browse/HDFS-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-2390: -- Issue Type: Improvement (was: Bug) dfsadmin -setBalancerBandwidth doesnot validate -ve value - Key: HDFS-2390 URL: https://issues.apache.org/jira/browse/HDFS-2390 Project: Hadoop HDFS Issue Type: Improvement Components: balancer mover Affects Versions: 2.7.1 Reporter: Rajit Saha Assignee: Gautam Gopalakrishnan Attachments: HDFS-2390-1.patch, HDFS-2390-2.patch, HDFS-2390-3.patch, HDFS-2390-4.patch $ hadoop dfsadmin -setBalancerBandwidth -1 does not throw any message that it is invalid although in DN log we are not getting DNA_BALANCERBANDWIDTHUPDATE. I think it should throw some message that -ve numbers are not valid , as it does for decimal numbers or non-numbers like - $ hadoop dfsadmin -setBalancerBandwidth 12.34 NumberFormatException: For input string: 12.34 Usage: java DFSAdmin [-setBalancerBandwidth bandwidth in bytes per second] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-2390) dfsadmin -setBalancerBandwidth doesnot validate -ve value
[ https://issues.apache.org/jira/browse/HDFS-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716463#comment-14716463 ] Harsh J commented on HDFS-2390: --- +1 on v4. The check style warning is about file length, but we can ignore that (I agree with HADOOP-12005). The test on the console output does not appear to have failed, and is unrelated and also passes locally: {code} Running org.apache.hadoop.hdfs.server.namenode.TestINodeFile Tests run: 26, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 39.608 sec - in org.apache.hadoop.hdfs.server.namenode.TestINodeFile {code} Thanks for the test and improvement fix, committing this shortly! dfsadmin -setBalancerBandwidth doesnot validate -ve value - Key: HDFS-2390 URL: https://issues.apache.org/jira/browse/HDFS-2390 Project: Hadoop HDFS Issue Type: Bug Components: balancer mover Affects Versions: 2.7.1 Reporter: Rajit Saha Assignee: Gautam Gopalakrishnan Attachments: HDFS-2390-1.patch, HDFS-2390-2.patch, HDFS-2390-3.patch, HDFS-2390-4.patch $ hadoop dfsadmin -setBalancerBandwidth -1 does not throw any message that it is invalid although in DN log we are not getting DNA_BALANCERBANDWIDTHUPDATE. I think it should throw some message that -ve numbers are not valid , as it does for decimal numbers or non-numbers like - $ hadoop dfsadmin -setBalancerBandwidth 12.34 NumberFormatException: For input string: 12.34 Usage: java DFSAdmin [-setBalancerBandwidth bandwidth in bytes per second] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-2390) dfsadmin -setBalancerBandwidth doesnot validate -ve value
[ https://issues.apache.org/jira/browse/HDFS-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-2390: -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.8.0 Status: Resolved (was: Patch Available) Committed to trunk and branch-2. Thank you Gautam! dfsadmin -setBalancerBandwidth doesnot validate -ve value - Key: HDFS-2390 URL: https://issues.apache.org/jira/browse/HDFS-2390 Project: Hadoop HDFS Issue Type: Improvement Components: balancer mover Affects Versions: 2.7.1 Reporter: Rajit Saha Assignee: Gautam Gopalakrishnan Fix For: 2.8.0 Attachments: HDFS-2390-1.patch, HDFS-2390-2.patch, HDFS-2390-3.patch, HDFS-2390-4.patch $ hadoop dfsadmin -setBalancerBandwidth -1 does not throw any message that it is invalid although in DN log we are not getting DNA_BALANCERBANDWIDTHUPDATE. I think it should throw some message that -ve numbers are not valid , as it does for decimal numbers or non-numbers like - $ hadoop dfsadmin -setBalancerBandwidth 12.34 NumberFormatException: For input string: 12.34 Usage: java DFSAdmin [-setBalancerBandwidth bandwidth in bytes per second] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-2390) dfsadmin -setBalancerBandwidth doesnot validate -ve value
[ https://issues.apache.org/jira/browse/HDFS-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-2390: -- Priority: Minor (was: Major) dfsadmin -setBalancerBandwidth doesnot validate -ve value - Key: HDFS-2390 URL: https://issues.apache.org/jira/browse/HDFS-2390 Project: Hadoop HDFS Issue Type: Improvement Components: balancer mover Affects Versions: 2.7.1 Reporter: Rajit Saha Assignee: Gautam Gopalakrishnan Priority: Minor Fix For: 2.8.0 Attachments: HDFS-2390-1.patch, HDFS-2390-2.patch, HDFS-2390-3.patch, HDFS-2390-4.patch $ hadoop dfsadmin -setBalancerBandwidth -1 does not throw any message that it is invalid although in DN log we are not getting DNA_BALANCERBANDWIDTHUPDATE. I think it should throw some message that -ve numbers are not valid , as it does for decimal numbers or non-numbers like - $ hadoop dfsadmin -setBalancerBandwidth 12.34 NumberFormatException: For input string: 12.34 Usage: java DFSAdmin [-setBalancerBandwidth bandwidth in bytes per second] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-2390) dfsadmin -setBalancerBandwidth doesnot validate -ve value
[ https://issues.apache.org/jira/browse/HDFS-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14714205#comment-14714205 ] Harsh J commented on HDFS-2390: --- bq. +assertEquals(Bandwidth should be a non-negative integer, -1, exitCode); This should rather be something like Negative bandwidth value must fail the command, such that upon a regression during which the test fails, the message produced by the JUnit test suite would look like: Negative bandwidth value must fail command: expected -1 but got 0. dfsadmin -setBalancerBandwidth doesnot validate -ve value - Key: HDFS-2390 URL: https://issues.apache.org/jira/browse/HDFS-2390 Project: Hadoop HDFS Issue Type: Bug Components: balancer mover Affects Versions: 2.7.1 Reporter: Rajit Saha Assignee: Gautam Gopalakrishnan Attachments: HDFS-2390-1.patch, HDFS-2390-2.patch, HDFS-2390-3.patch $ hadoop dfsadmin -setBalancerBandwidth -1 does not throw any message that it is invalid although in DN log we are not getting DNA_BALANCERBANDWIDTHUPDATE. I think it should throw some message that -ve numbers are not valid , as it does for decimal numbers or non-numbers like - $ hadoop dfsadmin -setBalancerBandwidth 12.34 NumberFormatException: For input string: 12.34 Usage: java DFSAdmin [-setBalancerBandwidth bandwidth in bytes per second] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-5696) Examples for httpfs REST API incorrect on apache.org
[ https://issues.apache.org/jira/browse/HDFS-5696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1479#comment-1479 ] Harsh J commented on HDFS-5696: --- Thanks for reporting this. Care to submit a fix changing the op values to the right one? I suspect these may be leftovers from the Hoop days. bq. Not sure what the convention should be for specifying the user.name. Use hdfs? or a name that is obviously an example? Since these are curl based examples that also likely assume no kerberos setup, why not $USER or $(whoami) instead of a hardcoded value? Examples for httpfs REST API incorrect on apache.org Key: HDFS-5696 URL: https://issues.apache.org/jira/browse/HDFS-5696 Project: Hadoop HDFS Issue Type: Bug Components: documentation Affects Versions: 2.2.0 Environment: NA Reporter: Casey Brotherton Priority: Trivial The examples provided for the httpfs REST API are incorrect. http://hadoop.apache.org/docs/r2.2.0/hadoop-hdfs-httpfs/index.html http://hadoop.apache.org/docs/r2.0.5-alpha/hadoop-hdfs-httpfs/index.html From the documentation: * HttpFS is a separate service from Hadoop NameNode. HttpFS itself is Java web-application and it runs using a preconfigured Tomcat bundled with HttpFS binary distribution. HttpFS HTTP web-service API calls are HTTP REST calls that map to a HDFS file system operation. For example, using the curl Unix command: $ curl http://httpfs-host:14000/webhdfs/v1/user/foo/README.txt returns the contents of the HDFS /user/foo/README.txt file. $ curl http://httpfs-host:14000/webhdfs/v1/user/foo?op=list returns the contents of the HDFS /user/foo directory in JSON format. $ curl -X POST http://httpfs-host:14000/webhdfs/v1/user/foo/bar?op=mkdirs creates the HDFS /user/foo.bar directory. *** The commands have incorrect operations. ( Verified through source code in HttpFSFileSystem.java ) In addition, although the webhdfs documentation specifies user.name as optional, on my cluster, each action required a user.name It should be included in the short examples to allow for the greatest chance of success. Three examples rewritten: curl -i -L http://httpfs-host:14000/webhdfs/v1/user/foo/README.txt?op=openuser.name=hdfsuser; curl -i http://httpfs-host:14000/webhdfs/v1/user/foo/?op=liststatususer.name=hdfsuser; curl -i -X PUT http://httpfs-host:14000/webhdfs/v1/user/foo/bar?op=mkdirsuser.name=hdfsuser; Not sure what the convention should be for specifying the user.name. Use hdfs? or a name that is obviously an example? It would also be beneficial if the HTTPfs page linked to the webhdfs documentation page in the text instead of just on the menu sidebar. http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8118) Delay in checkpointing Trash can leave trash for 2 intervals before deleting
[ https://issues.apache.org/jira/browse/HDFS-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-8118: -- Hadoop Flags: Reviewed Affects Version/s: 2.7.1 Target Version/s: 3.0.0, 2.8.0 Status: Patch Available (was: Open) I re-looked at the change and the problem, and although this can be difficult to test, the change does certainly fix the described changing-timestamp behaviour. +1, will commit after verifying Jenkins result on the newer patch. Marking as Patch Available to trigger the build. Delay in checkpointing Trash can leave trash for 2 intervals before deleting Key: HDFS-8118 URL: https://issues.apache.org/jira/browse/HDFS-8118 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.1 Reporter: Casey Brotherton Assignee: Casey Brotherton Priority: Trivial Attachments: HDFS-8118.001.patch, HDFS-8118.patch When the fs.trash.checkpoint.interval and the fs.trash.interval are set non-zero and the same, it is possible for trash to be left for two intervals. The TrashPolicyDefault will use a floor and ceiling function to ensure that the Trash will be checkpointed every interval of minutes. Each user's trash is checkpointed individually. The time resolution of the checkpoint timestamp is to the second. If the seconds switch while one user is checkpointing, then the next user's timestamp will be later. This will cause the next user's checkpoint to not be deleted at the next interval. I have recreated this in a lab cluster I also have a suggestion for a patch that I can upload later tonight after testing it further. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8118) Delay in checkpointing Trash can leave trash for 2 intervals before deleting
[ https://issues.apache.org/jira/browse/HDFS-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14709034#comment-14709034 ] Harsh J commented on HDFS-8118: --- I missed a small detail - why is the {{new Date()}} not outside the user-directory iteration, if the goal is to make it constant across all user directories when the emptier gets invoked? Delay in checkpointing Trash can leave trash for 2 intervals before deleting Key: HDFS-8118 URL: https://issues.apache.org/jira/browse/HDFS-8118 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.1 Reporter: Casey Brotherton Assignee: Casey Brotherton Priority: Trivial Attachments: HDFS-8118.001.patch, HDFS-8118.patch When the fs.trash.checkpoint.interval and the fs.trash.interval are set non-zero and the same, it is possible for trash to be left for two intervals. The TrashPolicyDefault will use a floor and ceiling function to ensure that the Trash will be checkpointed every interval of minutes. Each user's trash is checkpointed individually. The time resolution of the checkpoint timestamp is to the second. If the seconds switch while one user is checkpointing, then the next user's timestamp will be later. This will cause the next user's checkpoint to not be deleted at the next interval. I have recreated this in a lab cluster I also have a suggestion for a patch that I can upload later tonight after testing it further. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (HDFS-8118) Delay in checkpointing Trash can leave trash for 2 intervals before deleting
[ https://issues.apache.org/jira/browse/HDFS-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-8118: -- Comment: was deleted (was: I missed a small detail - why is the {{new Date()}} not outside the user-directory iteration, if the goal is to make it constant across all user directories when the emptier gets invoked?) Delay in checkpointing Trash can leave trash for 2 intervals before deleting Key: HDFS-8118 URL: https://issues.apache.org/jira/browse/HDFS-8118 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.1 Reporter: Casey Brotherton Assignee: Casey Brotherton Priority: Trivial Attachments: HDFS-8118.001.patch, HDFS-8118.patch When the fs.trash.checkpoint.interval and the fs.trash.interval are set non-zero and the same, it is possible for trash to be left for two intervals. The TrashPolicyDefault will use a floor and ceiling function to ensure that the Trash will be checkpointed every interval of minutes. Each user's trash is checkpointed individually. The time resolution of the checkpoint timestamp is to the second. If the seconds switch while one user is checkpointing, then the next user's timestamp will be later. This will cause the next user's checkpoint to not be deleted at the next interval. I have recreated this in a lab cluster I also have a suggestion for a patch that I can upload later tonight after testing it further. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8118) Delay in checkpointing Trash can leave trash for 2 intervals before deleting
[ https://issues.apache.org/jira/browse/HDFS-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710624#comment-14710624 ] Harsh J commented on HDFS-8118: --- Thanks Casey, You can run individual tests locally via {{mvn test -Dtest=TestWebDelegationToken}} for example. The jenkins build cannot retrigger specific tests, but you can always check past/future builds to also inspect if the test has been generally flaky, and search JIRA/emails to see if this has/is already been reported/being worked on. It doesn't appear related to the behaviour fix we're making here, and the test does pass locally for me, so I'm committing this shortly. Delay in checkpointing Trash can leave trash for 2 intervals before deleting Key: HDFS-8118 URL: https://issues.apache.org/jira/browse/HDFS-8118 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.1 Reporter: Casey Brotherton Assignee: Casey Brotherton Priority: Trivial Attachments: HDFS-8118.001.patch, HDFS-8118.patch When the fs.trash.checkpoint.interval and the fs.trash.interval are set non-zero and the same, it is possible for trash to be left for two intervals. The TrashPolicyDefault will use a floor and ceiling function to ensure that the Trash will be checkpointed every interval of minutes. Each user's trash is checkpointed individually. The time resolution of the checkpoint timestamp is to the second. If the seconds switch while one user is checkpointing, then the next user's timestamp will be later. This will cause the next user's checkpoint to not be deleted at the next interval. I have recreated this in a lab cluster I also have a suggestion for a patch that I can upload later tonight after testing it further. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8821) Explain message Operation category X is not supported in state standby
[ https://issues.apache.org/jira/browse/HDFS-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-8821: -- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.8.0 Status: Resolved (was: Patch Available) Thanks Gautam, test passed locally for me as well. Seems unrelated. I've pushed this into trunk and branch-2; thank you for the continued contributions! Explain message Operation category X is not supported in state standby - Key: HDFS-8821 URL: https://issues.apache.org/jira/browse/HDFS-8821 Project: Hadoop HDFS Issue Type: Improvement Reporter: Gautam Gopalakrishnan Assignee: Gautam Gopalakrishnan Priority: Minor Fix For: 2.8.0 Attachments: HDFS-8821-1.patch, HDFS-8821-2.patch There is one message specifically that causes many users to question the health of their HDFS cluster, namely Operation category READ/WRITE is not supported in state standby. HDFS-3447 is an attempt to lower the logging severity for StandbyException related messages but it is not resolved yet. So this jira is an attempt to explain this particular message so it appears less scary. The text is question 3.17 in the Hadoop Wiki FAQ ref: https://wiki.apache.org/hadoop/FAQ#What_does_the_message_.22Operation_category_READ.2FWRITE_is_not_supported_in_state_standby.22_mean.3F -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8821) Explain message Operation category X is not supported in state standby
[ https://issues.apache.org/jira/browse/HDFS-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641777#comment-14641777 ] Harsh J commented on HDFS-8821: --- +1 looks good to me. This should help avoid the continual confusion people new to HDFS HA appear to have (from experience) about whether that error is to be taken seriously or not. Thank you for writing up the wiki entry too! bq. hadoop.hdfs.server.namenode.ha.TestStandbyIsHot Test name appears relevant but the failing test does not (fails at counting proper under-replicated blocks value). I'll still test this again manually before committing (by Tuesday EOD). Explain message Operation category X is not supported in state standby - Key: HDFS-8821 URL: https://issues.apache.org/jira/browse/HDFS-8821 Project: Hadoop HDFS Issue Type: Improvement Reporter: Gautam Gopalakrishnan Assignee: Gautam Gopalakrishnan Priority: Minor Attachments: HDFS-8821-1.patch, HDFS-8821-2.patch There is one message specifically that causes many users to question the health of their HDFS cluster, namely Operation category READ/WRITE is not supported in state standby. HDFS-3447 is an attempt to lower the logging severity for StandbyException related messages but it is not resolved yet. So this jira is an attempt to explain this particular message so it appears less scary. The text is question 3.17 in the Hadoop Wiki FAQ ref: https://wiki.apache.org/hadoop/FAQ#What_does_the_message_.22Operation_category_READ.2FWRITE_is_not_supported_in_state_standby.22_mean.3F -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8486) DN startup may cause severe data loss
[ https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-8486: -- Release Note: Public service notice: - Every restart of a 2.6.x or 2.7.0 DN incurs a risk of unwanted block deletion. - Apply this patch if you are running a pre-2.7.1 release. (Promoting comment into release-notes area of JIRA just so its better visible) DN startup may cause severe data loss - Key: HDFS-8486 URL: https://issues.apache.org/jira/browse/HDFS-8486 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 0.23.1, 2.0.0-alpha Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Blocker Fix For: 2.7.1 Attachments: HDFS-8486.patch, HDFS-8486.patch A race condition between block pool initialization and the directory scanner may cause a mass deletion of blocks in multiple storages. If block pool initialization finds a block on disk that is already in the replica map, it deletes one of the blocks based on size, GS, etc. Unfortunately it _always_ deletes one of the blocks even if identical, thus the replica map _must_ be empty when the pool is initialized. The directory scanner starts at a random time within its periodic interval (default 6h). If the scanner starts very early it races to populate the replica map, causing the block pool init to erroneously delete blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8660) Slow write to packet mirror should log which mirror and which block
[ https://issues.apache.org/jira/browse/HDFS-8660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613063#comment-14613063 ] Harsh J commented on HDFS-8660: --- This would be an excellent improvement for certain performance troubleshooting. In looking for more such Slow messages, the following matches may also need similar changes: The Slow ReadProcessor message in DataStreamer.java can benefit from a block ID. The Slow waitForAckedSeqno in DataStreamer.java message too could benefit from a block ID as well as a nodes list. Just Block ID can also be added into the below messages under BlockReceiver.java: Slow flushOrSync Slow BlockReceiver write data to disk Slow manageWriterOsCache DN Mirror Host+Block ID can both be added into the below message under BlockReceiver.java: Slow PacketResponder send ack to upstream took Could you check if these are possible to do as part of the same JIRA as simple changes too? Slow write to packet mirror should log which mirror and which block --- Key: HDFS-8660 URL: https://issues.apache.org/jira/browse/HDFS-8660 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.7.0 Reporter: Hazem Mahmoud Assignee: Hazem Mahmoud Currently, log format states something similar to: Slow BlockReceiver write packet to mirror took 468ms (threshold=300ms) For troubleshooting purposes, it would be good to have it mention which block ID it's writing as well as the mirror (DN) that it's writing it to. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-3627) OfflineImageViewer oiv Indented processor prints out the Java class name in the DELEGATION_KEY field
[ https://issues.apache.org/jira/browse/HDFS-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572631#comment-14572631 ] Harsh J commented on HDFS-3627: --- bq. Its used only for the old Pre-protobuf images. Yup but we seem to support writing a legacy copy for the OIV specifically, so this fix could still be useful to some. OfflineImageViewer oiv Indented processor prints out the Java class name in the DELEGATION_KEY field Key: HDFS-3627 URL: https://issues.apache.org/jira/browse/HDFS-3627 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.23.0 Reporter: Ravi Prakash Priority: Minor Labels: BB2015-05-TBR Attachments: HDFS-3627.patch, HDFS-3627.patch, HDFS-3627.patch, HDFS-3627.patch, HDFS-3627.patch, HDFS-3627.patch Instead of the contents of the delegation key this is printed out DELEGATION_KEY = org.apache.hadoop.security.token.delegation.DelegationKey@1e2ca7 DELEGATION_KEY = org.apache.hadoop.security.token.delegation.DelegationKey@105bd58 DELEGATION_KEY = org.apache.hadoop.security.token.delegation.DelegationKey@1d1e730 DELEGATION_KEY = org.apache.hadoop.security.token.delegation.DelegationKey@1a116c9 DELEGATION_KEY = org.apache.hadoop.security.token.delegation.DelegationKey@df1832 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8516) The 'hdfs crypto -listZones' should not print an extra newline at end of output
[ https://issues.apache.org/jira/browse/HDFS-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570433#comment-14570433 ] Harsh J commented on HDFS-8516: --- I wasn't worried of the usage ones as I'd never parse them generally, but if we'd like the extra newlines fixed in those I'll need to also target the cache/fs-shell/trace tools I think. Maybe if thats needed, it'd be better to have a builder option in TableListing itself to not add that newline at its last row. Let me know! The 'hdfs crypto -listZones' should not print an extra newline at end of output --- Key: HDFS-8516 URL: https://issues.apache.org/jira/browse/HDFS-8516 Project: Hadoop HDFS Issue Type: Improvement Components: tools Affects Versions: 2.7.0 Reporter: Harsh J Assignee: Harsh J Priority: Minor Attachments: HDFS-8516.patch It currently prints an extra newline (TableListing already adds a newline to end of table string). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8516) The 'hdfs crypto -listZones' should not print an extra newline at end of output
Harsh J created HDFS-8516: - Summary: The 'hdfs crypto -listZones' should not print an extra newline at end of output Key: HDFS-8516 URL: https://issues.apache.org/jira/browse/HDFS-8516 Project: Hadoop HDFS Issue Type: Improvement Components: tools Reporter: Harsh J Assignee: Harsh J Priority: Minor It currently prints an extra newline (TableListing already adds a newline to end of table string). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8516) The 'hdfs crypto -listZones' should not print an extra newline at end of output
[ https://issues.apache.org/jira/browse/HDFS-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-8516: -- Attachment: HDFS-8516.patch The 'hdfs crypto -listZones' should not print an extra newline at end of output --- Key: HDFS-8516 URL: https://issues.apache.org/jira/browse/HDFS-8516 Project: Hadoop HDFS Issue Type: Improvement Components: tools Reporter: Harsh J Assignee: Harsh J Priority: Minor Attachments: HDFS-8516.patch It currently prints an extra newline (TableListing already adds a newline to end of table string). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8516) The 'hdfs crypto -listZones' should not print an extra newline at end of output
[ https://issues.apache.org/jira/browse/HDFS-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-8516: -- Target Version/s: 2.8.0 Affects Version/s: 2.7.0 Status: Patch Available (was: Open) The 'hdfs crypto -listZones' should not print an extra newline at end of output --- Key: HDFS-8516 URL: https://issues.apache.org/jira/browse/HDFS-8516 Project: Hadoop HDFS Issue Type: Improvement Components: tools Affects Versions: 2.7.0 Reporter: Harsh J Assignee: Harsh J Priority: Minor Attachments: HDFS-8516.patch It currently prints an extra newline (TableListing already adds a newline to end of table string). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7097) Allow block reports to be processed during checkpointing on standby name node
[ https://issues.apache.org/jira/browse/HDFS-7097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-7097: -- Target Version/s: (was: 2.6.0) Allow block reports to be processed during checkpointing on standby name node - Key: HDFS-7097 URL: https://issues.apache.org/jira/browse/HDFS-7097 Project: Hadoop HDFS Issue Type: Bug Reporter: Kihwal Lee Assignee: Kihwal Lee Priority: Critical Fix For: 2.7.0 Attachments: HDFS-7097.patch, HDFS-7097.patch, HDFS-7097.patch, HDFS-7097.patch, HDFS-7097.ultimate.trunk.patch On a reasonably busy HDFS cluster, there are stream of creates, causing data nodes to generate incremental block reports. When a standby name node is checkpointing, RPC handler threads trying to process a full or incremental block report is blocked on the name system's {{fsLock}}, because the checkpointer acquires the read lock on it. This can create a serious problem if the size of name space is big and checkpointing takes a long time. All available RPC handlers can be tied up very quickly. If you have 100 handlers, it only takes 34 file creates. If a separate service RPC port is not used, HA transition will have to wait in the call queue for minutes. Even if a separate service RPC port is configured, hearbeats from datanodes will be blocked. A standby NN with a big name space can lose all data nodes after checkpointing. The rpc calls will also be retransmitted by data nodes many times, filling up the call queue and potentially causing listen queue overflow. Since block reports are not modifying any state that is being saved to fsimage, I propose letting them through during checkpointing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7442) Optimization for decommission-in-progress check
[ https://issues.apache.org/jira/browse/HDFS-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-7442: -- Affects Version/s: 2.6.0 Optimization for decommission-in-progress check --- Key: HDFS-7442 URL: https://issues.apache.org/jira/browse/HDFS-7442 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.6.0 Reporter: Ming Ma 1. {{isReplicationInProgress }} currently rescan all blocks of a given node each time the method is called; it becomes less efficient as more of its blocks become fully replicated. Each scan takes FS lock. 2. As discussed in HDFS-7374, if the node becomes dead during decommission, it is useful if the dead node can be marked as decommissioned after all its blocks are fully replicated. Currently there is no way to check the blocks of dead decomm-in-progress nodes, given the dead node is removed from blockmap. There are mitigations for these limitations. Set dfs.namenode.decommission.nodes.per.interval to small value for reduce the duration of lock. HDFS-7409 uses global FS state to tell if a dead node's blocks are fully replicated. To address these scenarios, it will be useful to track the decommon-in-progress blocks separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7442) Optimization for decommission-in-progress check
[ https://issues.apache.org/jira/browse/HDFS-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-7442: -- Component/s: namenode Optimization for decommission-in-progress check --- Key: HDFS-7442 URL: https://issues.apache.org/jira/browse/HDFS-7442 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.6.0 Reporter: Ming Ma 1. {{isReplicationInProgress }} currently rescan all blocks of a given node each time the method is called; it becomes less efficient as more of its blocks become fully replicated. Each scan takes FS lock. 2. As discussed in HDFS-7374, if the node becomes dead during decommission, it is useful if the dead node can be marked as decommissioned after all its blocks are fully replicated. Currently there is no way to check the blocks of dead decomm-in-progress nodes, given the dead node is removed from blockmap. There are mitigations for these limitations. Set dfs.namenode.decommission.nodes.per.interval to small value for reduce the duration of lock. HDFS-7409 uses global FS state to tell if a dead node's blocks are fully replicated. To address these scenarios, it will be useful to track the decommon-in-progress blocks separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-3493) Invalidate excess corrupted blocks as long as minimum replication is satisfied
[ https://issues.apache.org/jira/browse/HDFS-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J updated HDFS-3493: -- Component/s: namenode Invalidate excess corrupted blocks as long as minimum replication is satisfied -- Key: HDFS-3493 URL: https://issues.apache.org/jira/browse/HDFS-3493 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.0-alpha, 2.0.5-alpha Reporter: J.Andreina Assignee: Juan Yu Fix For: 2.5.0 Attachments: HDFS-3493.002.patch, HDFS-3493.003.patch, HDFS-3493.004.patch, HDFS-3493.patch replication factor= 3, block report interval= 1min and start NN and 3DN Step 1:Write a file without close and do hflush (Dn1,DN2,DN3 has blk_ts1) Step 2:Stopped DN3 Step 3:recovery happens and time stamp updated(blk_ts2) Step 4:close the file Step 5:blk_ts2 is finalized and available in DN1 and Dn2 Step 6:now restarted DN3(which has got blk_ts1 in rbw) From the NN side there is no cmd issued to DN3 to delete the blk_ts1 . But ask DN3 to make the block as corrupt . Replication of blk_ts2 to DN3 is not happened. NN logs: {noformat} INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_3927215081484173742 to add as corrupt on XX.XX.XX.XX:50276 by /XX.XX.XX.XX because reported RWR replica with genstamp 1007 does not match COMPLETE block's genstamp in block map 1008 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* processReport: from DatanodeRegistration(XX.XX.XX.XX, storageID=DS-443871816-XX.XX.XX.XX-50276-1336829714197, infoPort=50275, ipcPort=50277, storageInfo=lv=-40;cid=CID-e654ac13-92dc-4f82-a22b-c0b6861d06d7;nsid=2063001898;c=0), blocks: 2, processing time: 1 msecs INFO org.apache.hadoop.hdfs.StateChange: BLOCK* Removing block blk_3927215081484173742_1008 from neededReplications as it has enough replicas. INFO org.apache.hadoop.hdfs.StateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_3927215081484173742 to add as corrupt on XX.XX.XX.XX:50276 by /XX.XX.XX.XX because reported RWR replica with genstamp 1007 does not match COMPLETE block's genstamp in block map 1008 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* processReport: from DatanodeRegistration(XX.XX.XX.XX, storageID=DS-443871816-XX.XX.XX.XX-50276-1336829714197, infoPort=50275, ipcPort=50277, storageInfo=lv=-40;cid=CID-e654ac13-92dc-4f82-a22b-c0b6861d06d7;nsid=2063001898;c=0), blocks: 2, processing time: 1 msecs WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not able to place enough replicas, still in need of 1 to reach 1 For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy {noformat} fsck Report === {noformat} /file21: Under replicated BP-1008469586-XX.XX.XX.XX-1336829603103:blk_3927215081484173742_1008. Target Replicas is 3 but found 2 replica(s). .Status: HEALTHY Total size: 495 B Total dirs: 1 Total files: 3 Total blocks (validated):3 (avg. block size 165 B) Minimally replicated blocks: 3 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 1 (33.32 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 1 Average block replication: 2.0 Corrupt blocks: 0 Missing replicas:1 (14.285714 %) Number of data-nodes:3 Number of racks: 1 FSCK ended at Sun May 13 09:49:05 IST 2012 in 9 milliseconds The filesystem under path '/' is HEALTHY {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HDFS-7960) The full block report should prune zombie storages even if they're not empty
[ https://issues.apache.org/jira/browse/HDFS-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J reassigned HDFS-7960: - Assignee: Colin Patrick McCabe (was: Samuel Otero Schmidt) The full block report should prune zombie storages even if they're not empty Key: HDFS-7960 URL: https://issues.apache.org/jira/browse/HDFS-7960 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lei (Eddy) Xu Assignee: Colin Patrick McCabe Priority: Critical Fix For: 2.7.0 Attachments: HDFS-7960.002.patch, HDFS-7960.003.patch, HDFS-7960.004.patch, HDFS-7960.005.patch, HDFS-7960.006.patch, HDFS-7960.007.patch, HDFS-7960.008.patch The full block report should prune zombie storages even if they're not empty. We have seen cases in production where zombie storages have not been pruned subsequent to HDFS-7575. This could arise any time the NameNode thinks there is a block in some old storage which is actually not there. In this case, the block will not show up in the new storage (once old is renamed to new) and the old storage will linger forever as a zombie, even with the HDFS-7596 fix applied. This also happens with datanode hotplug, when a drive is removed. In this case, an entire storage (volume) goes away but the blocks do not show up in another storage on the same datanode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8118) Delay in checkpointing Trash can leave trash for 2 intervals before deleting
[ https://issues.apache.org/jira/browse/HDFS-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505791#comment-14505791 ] Harsh J commented on HDFS-8118: --- Thanks for explaining that Casey. It makes sense to constant-ise the checkpoint date for uniformity - and the fix for this looks alright to me. It also may make sense that people want to set checkpoint intervals equal to the trash intervals. I think we can remove the change in the patch of capping it to 1/2 the value of intervals, but just add a small doc note in hdfs-default.xml to the trash checkpoint period property on what the behaviour could end up being if its set to equal of the trash clearing interval. Would it also be possible to come up with a test-case for this? For example, load some files into trash such that multiple dirs need to be checkpointed, and issue a checkpoint (or await its lowered interval) and ensure only one date is observed before clearing occurs? It would help avoid regressions in future, just in case. Delay in checkpointing Trash can leave trash for 2 intervals before deleting Key: HDFS-8118 URL: https://issues.apache.org/jira/browse/HDFS-8118 Project: Hadoop HDFS Issue Type: Bug Reporter: Casey Brotherton Assignee: Casey Brotherton Priority: Trivial Attachments: HDFS-8118.patch When the fs.trash.checkpoint.interval and the fs.trash.interval are set non-zero and the same, it is possible for trash to be left for two intervals. The TrashPolicyDefault will use a floor and ceiling function to ensure that the Trash will be checkpointed every interval of minutes. Each user's trash is checkpointed individually. The time resolution of the checkpoint timestamp is to the second. If the seconds switch while one user is checkpointing, then the next user's timestamp will be later. This will cause the next user's checkpoint to not be deleted at the next interval. I have recreated this in a lab cluster I also have a suggestion for a patch that I can upload later tonight after testing it further. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8113) NullPointerException in BlockInfoContiguous causes block report failure
[ https://issues.apache.org/jira/browse/HDFS-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496446#comment-14496446 ] Harsh J commented on HDFS-8113: --- Stale block copies leftover in the DN can cause the condition - it indeed goes away if you clear out the RBW directory in the DN. Imagine this condition: 1. File is being written. Has replica on node X among others. 2. Replica write to node X in pipeline fails. Write carries on, leaving stale block copy in RBW of node X. 3. File gets closed and deleted away soon/immediately after (but well before a block report from X). 4. Block report now sends the RBW info but NN has no knowledge of the block anymore. I think modifying Colin's test this way should reproduce the issue: 1. start a mini dfs cluster with 2 datanodes 2. create a file with repl=2, but do not close it (flush it to ensure on-disk RBW write) 3. take down one DN 4. close and delete the file 5. wait 6. bring back up the other DN, which will still have the RBW block from the file which was deleted -- Harsh J NullPointerException in BlockInfoContiguous causes block report failure --- Key: HDFS-8113 URL: https://issues.apache.org/jira/browse/HDFS-8113 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: HDFS-8113.patch The following copy constructor can throw NullPointerException if {{bc}} is null. {code} protected BlockInfoContiguous(BlockInfoContiguous from) { this(from, from.bc.getBlockReplication()); this.bc = from.bc; } {code} We have observed that some DataNodes keeps failing doing block reports with NameNode. The stacktrace is as follows. Though we are not using the latest version, the problem still exists. {quote} 2015-03-08 19:28:13,442 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: RemoteException in offerService org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.(BlockInfo.java:80) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockToMarkCorrupt.(BlockManager.java:1696) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.checkReplicaCorrupt(BlockManager.java:2185) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReportedBlock(BlockManager.java:2047) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.reportDiff(BlockManager.java:1950) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1823) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1750) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1069) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1623) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7452) Can we skip getCorruptFiles() call for standby NameNode..?
[ https://issues.apache.org/jira/browse/HDFS-7452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484575#comment-14484575 ] Harsh J commented on HDFS-7452: --- IIUC, the Web UI of SBN tries to load up corrupt/missing block file info from its local self, which also causes this log spam. We could eliminate that to address this? Can we skip getCorruptFiles() call for standby NameNode..? -- Key: HDFS-7452 URL: https://issues.apache.org/jira/browse/HDFS-7452 Project: Hadoop HDFS Issue Type: Bug Reporter: Brahma Reddy Battula Assignee: Brahma Reddy Battula Priority: Trivial Seen following warns logs from StandBy Namenode logs .. {noformat} 2014-11-27 17:50:32,497 | WARN | 512264920@qtp-429668078-606 | Get corrupt file blocks returned error: Operation category READ is not supported in state standby | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916) 2014-11-27 17:50:42,557 | WARN | 512264920@qtp-429668078-606 | Get corrupt file blocks returned error: Operation category READ is not supported in state standby | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916) 2014-11-27 17:50:52,617 | WARN | 512264920@qtp-429668078-606 | Get corrupt file blocks returned error: Operation category READ is not supported in state standby | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916) 2014-11-27 17:51:00,058 | WARN | 512264920@qtp-429668078-606 | Get corrupt file blocks returned error: Operation category READ is not supported in state standby | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916) 2014-11-27 17:51:00,117 | WARN | 512264920@qtp-429668078-606 | Get corrupt file blocks returned error: Operation category READ is not supported in state standby | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916) 2014-11-27 17:51:02,678 | WARN | 512264920@qtp-429668078-606 | Get corrupt file blocks returned error: Operation category READ is not supported in state standby | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916) 2014-11-27 17:51:12,738 | WARN | 512264920@qtp-429668078-606 | Get corrupt file blocks returned error: Operation category READ is not supported in state standby | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916) 2014-11-27 17:51:22,798 | WARN | 512264920@qtp-429668078-606 | Get corrupt file blocks returned error: Operation category READ is not supported in state standby | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916) 2014-11-27 17:51:30,058 | WARN | 512264920@qtp-429668078-606 | Get corrupt file blocks returned error: Operation category READ is not supported in state standby | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916) 2014-11-27 17:51:30,119 | WARN | 512264920@qtp-429668078-606 | Get corrupt file blocks returned error: Operation category READ is not supported in state standby | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCorruptFiles(FSNamesystem.java:6916) {noformat} do we need to call for SNN..? I feel, it might not be required.can we maintain state wide..Please let me know, If I am wrong.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7306) can't decommission w/under construction blocks
[ https://issues.apache.org/jira/browse/HDFS-7306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh J resolved HDFS-7306. --- Resolution: Duplicate This should be resolved via HDFS-5579. can't decommission w/under construction blocks -- Key: HDFS-7306 URL: https://issues.apache.org/jira/browse/HDFS-7306 Project: Hadoop HDFS Issue Type: Bug Reporter: Allen Wittenauer We need a way to decommission a node with open blocks. Now that HDFS supports append, this should be do-able. -- This message was sent by Atlassian JIRA (v6.3.4#6332)