[jira] [Commented] (HDFS-1447) Make getGenerationStampFromFile() more efficient, so it doesn't reprocess full directory listing for every block
[ https://issues.apache.org/jira/browse/HDFS-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335597#comment-14335597 ] Hadoop QA commented on HDFS-1447: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12499004/HDFS-1447.patch against trunk revision 9a37247. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9661//console This message is automatically generated. Make getGenerationStampFromFile() more efficient, so it doesn't reprocess full directory listing for every block Key: HDFS-1447 URL: https://issues.apache.org/jira/browse/HDFS-1447 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 0.20.2 Reporter: Matt Foley Assignee: Matt Foley Attachments: HDFS-1447.patch, Test_HDFS_1447_NotForCommitt.java.patch Make getGenerationStampFromFile() more efficient. Currently this routine is called by addToReplicasMap() for every blockfile in the directory tree, and it walks each file's containing directory on every call. There is a simple refactoring that should make it more efficient. This work item is one of four sub-tasks for HDFS-1443, Improve Datanode startup time. The fix will probably be folded into sibling task HDFS-1446, which is already refactoring the method that calls getGenerationStampFromFile(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case
[ https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335658#comment-14335658 ] Andrew Wang commented on HDFS-7763: --- This looks good to me, though one little nit is we could do {{System.exit}} in a {{finally}}. +1, I'll commit shortly. fix zkfc hung issue due to not catching exception in a corner case -- Key: HDFS-7763 URL: https://issues.apache.org/jira/browse/HDFS-7763 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Liang Xie Assignee: Liang Xie Attachments: HDFS-7763-001.txt, HDFS-7763-002.txt, jstack.4936 In our product cluster, we hit both the two zkfc process is hung after a zk network outage. the zkfc log said: {code} 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 0x4a61bacdd9dfb2 closed 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 11300 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 11300 {code} and the thread dump also be uploaded as attachment. From the dump, we can see due to the unknown non-daemon threads(pool-*-thread-*), the process did not exit, but the critical threads, like health monitor and rpc threads had been stopped, so our watchdog(supervisord) had not not observed the zkfc process is down or abnormal. so the following namenode failover could not be done as expected. there're two possible fixes here, 1) figure out the unset-thread-name, like pool-7-thread-1, where them came from and close or set daemon property. i tried to search but got nothing right now. 2) catch the exception from ZKFailoverController.run() so we can continue to exec the System.exit, the attached patch is 2). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7837) Allocate and persist striped blocks in FSNamesystem
Jing Zhao created HDFS-7837: --- Summary: Allocate and persist striped blocks in FSNamesystem Key: HDFS-7837 URL: https://issues.apache.org/jira/browse/HDFS-7837 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Jing Zhao Assignee: Jing Zhao Try to finish the remaining work from HDFS-7339 (except the ClientProtocol/DFSClient part): # Allow FSNamesystem#getAdditionalBlock to create striped blocks and persist striped blocks to editlog # Update FSImage for max allocated striped block ID # Update the block commit/complete logic in BlockManager -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7836) BlockManager Scalability Improvements
Charles Lamb created HDFS-7836: -- Summary: BlockManager Scalability Improvements Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Bug Reporter: Charles Lamb Assignee: Charles Lamb Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-1732) Enhance NNThroughputBenchmark to observe scale-dependent changes in IBR processing
[ https://issues.apache.org/jira/browse/HDFS-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-1732: --- Fix Version/s: (was: 0.24.0) Enhance NNThroughputBenchmark to observe scale-dependent changes in IBR processing -- Key: HDFS-1732 URL: https://issues.apache.org/jira/browse/HDFS-1732 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Affects Versions: 0.22.0 Reporter: Matt Foley Assignee: Matt Foley Attachments: IBRBenchmark_apachetrunk_v8.patch, IBRBenchmark_apachetrunk_v9.patch Rework NNThroughputBenchmark to provide more detailed info about Initial Block Report (IBR) processing time, as a function of number of nodes, number of unique blocks, and total number of replicas. Allow both direct local communication and remote and local RPC, so we can see how much impact RPC overhead has on IBR processing time. Also plug some holes in performance-specific logging of Block Report processing, so that consistent and complete data are logged from both Namenode and Datanode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-1447) Make getGenerationStampFromFile() more efficient, so it doesn't reprocess full directory listing for every block
[ https://issues.apache.org/jira/browse/HDFS-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-1447: --- Status: Open (was: Patch Available) Make getGenerationStampFromFile() more efficient, so it doesn't reprocess full directory listing for every block Key: HDFS-1447 URL: https://issues.apache.org/jira/browse/HDFS-1447 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 0.20.2 Reporter: Matt Foley Assignee: Matt Foley Attachments: HDFS-1447.patch, Test_HDFS_1447_NotForCommitt.java.patch Make getGenerationStampFromFile() more efficient. Currently this routine is called by addToReplicasMap() for every blockfile in the directory tree, and it walks each file's containing directory on every call. There is a simple refactoring that should make it more efficient. This work item is one of four sub-tasks for HDFS-1443, Improve Datanode startup time. The fix will probably be folded into sibling task HDFS-1446, which is already refactoring the method that calls getGenerationStampFromFile(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-1732) Enhance NNThroughputBenchmark to observe scale-dependent changes in IBR processing
[ https://issues.apache.org/jira/browse/HDFS-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated HDFS-1732: --- Status: Open (was: Patch Available) Cancelling patch as it no longer applies to trunk. Enhance NNThroughputBenchmark to observe scale-dependent changes in IBR processing -- Key: HDFS-1732 URL: https://issues.apache.org/jira/browse/HDFS-1732 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Affects Versions: 0.22.0 Reporter: Matt Foley Assignee: Matt Foley Attachments: IBRBenchmark_apachetrunk_v8.patch, IBRBenchmark_apachetrunk_v9.patch Rework NNThroughputBenchmark to provide more detailed info about Initial Block Report (IBR) processing time, as a function of number of nodes, number of unique blocks, and total number of replicas. Allow both direct local communication and remote and local RPC, so we can see how much impact RPC overhead has on IBR processing time. Also plug some holes in performance-specific logging of Block Report processing, so that consistent and complete data are logged from both Namenode and Datanode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7840) cdd
[ https://issues.apache.org/jira/browse/HDFS-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ashu updated HDFS-7840: --- Description: cdc was: We are trying to set the following properties in Hue LDAP section of our environment so that all usernames are forced to lowercase and so authentication ignores case. This will avoid new user’s home folders being in UPPERCASE, causing access errors with HDFS which parses to lowercase. Configuration to Set/Append: [[ldap]] ignore_username_case=true force_username_lowercase=true Problem: Cannot identify proper configuration file to edit (tried runtime Hue.ini and safety-valve files). Have edited the following files and restarted Hue service, and the runtime Hue.ini still does not show changes made. (8378 is the current process as of this email). Cloudera Managwer also does not expose these parameters, but does offer a field for safety-vale entries. 1. /var/run/cloudera-scm-agent/process/8378-hue-HUE_SERVER/hue_safety_valve.ini a. currently contains above LDAP section 2. /var/run/cloudera-scm-agent/process/8378-hue-HUE_SERVER/hue.ini a. does not show safety valve changes; editing file directly does not work since changes are lost next service restart. 3. /opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/etc/hue/conf.empty/hue.ini a. File is empty Please provide correct files to edit and which services need restarted for change to take effect. Summary: cdd (was: errors with HDFS which parses to lowercase.) cdd --- Key: HDFS-7840 URL: https://issues.apache.org/jira/browse/HDFS-7840 Project: Hadoop HDFS Issue Type: Bug Environment: Hadoop Reporter: ashu cdc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7840) cdd
[ https://issues.apache.org/jira/browse/HDFS-7840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer resolved HDFS-7840. Resolution: Invalid cdd --- Key: HDFS-7840 URL: https://issues.apache.org/jira/browse/HDFS-7840 Project: Hadoop HDFS Issue Type: Bug Environment: Hadoop Reporter: ashu cdc -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7840) errors with HDFS which parses to lowercase.
ashu created HDFS-7840: -- Summary: errors with HDFS which parses to lowercase. Key: HDFS-7840 URL: https://issues.apache.org/jira/browse/HDFS-7840 Project: Hadoop HDFS Issue Type: Bug Environment: Hadoop Reporter: ashu We are trying to set the following properties in Hue LDAP section of our environment so that all usernames are forced to lowercase and so authentication ignores case. This will avoid new user’s home folders being in UPPERCASE, causing access errors with HDFS which parses to lowercase. Configuration to Set/Append: [[ldap]] ignore_username_case=true force_username_lowercase=true Problem: Cannot identify proper configuration file to edit (tried runtime Hue.ini and safety-valve files). Have edited the following files and restarted Hue service, and the runtime Hue.ini still does not show changes made. (8378 is the current process as of this email). Cloudera Managwer also does not expose these parameters, but does offer a field for safety-vale entries. 1. /var/run/cloudera-scm-agent/process/8378-hue-HUE_SERVER/hue_safety_valve.ini a. currently contains above LDAP section 2. /var/run/cloudera-scm-agent/process/8378-hue-HUE_SERVER/hue.ini a. does not show safety valve changes; editing file directly does not work since changes are lost next service restart. 3. /opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/etc/hue/conf.empty/hue.ini a. File is empty Please provide correct files to edit and which services need restarted for change to take effect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7841) access errors with HDFS which parses to lowercase.
ankush created HDFS-7841: Summary: access errors with HDFS which parses to lowercase. Key: HDFS-7841 URL: https://issues.apache.org/jira/browse/HDFS-7841 Project: Hadoop HDFS Issue Type: Bug Reporter: ankush We are trying to set the following properties in Hue LDAP section of our environment so that all usernames are forced to lowercase and so authentication ignores case. This will avoid new user’s home folders being in UPPERCASE, causing access errors with HDFS which parses to lowercase. [[ldap]] ignore_username_case=true force_username_lowercase=true Problem: Cannot identify proper configuration file to edit (tried runtime Hue.ini and safety-valve files). Have edited the following files and restarted Hue service, and the runtime Hue.ini still does not show changes made. (8378 is the current process as of this email). Cloudera Managwer also does not expose these parameters, but does offer a field for safety-vale entries. 1. /var/run/cloudera-scm-agent/process/8378-hue-HUE_SERVER/hue_safety_valve.ini a. currently contains above LDAP section 2. /var/run/cloudera-scm-agent/process/8378-hue-HUE_SERVER/hue.ini a. does not show safety valve changes; editing file directly does not work since changes are lost next service restart. 3. /opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/etc/hue/conf.empty/hue.ini a. File is empty Support Needed: Please provide correct files to edit and which services need restarted for change to take effect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7538) removedDst should be checked against null in the finally block of FSDirRenameOp#unprotectedRenameTo()
[ https://issues.apache.org/jira/browse/HDFS-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335972#comment-14335972 ] Binglin Chang commented on HDFS-7538: - Hi [~tedyu], the patch is out of date, and I think the bug no longer exists, should this be resolved? removedDst should be checked against null in the finally block of FSDirRenameOp#unprotectedRenameTo() - Key: HDFS-7538 URL: https://issues.apache.org/jira/browse/HDFS-7538 Project: Hadoop HDFS Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Priority: Minor Attachments: hdfs-7538-001.patch {code} if (removedDst != null) { undoRemoveDst = false; ... if (undoRemoveDst) { // Rename failed - restore dst if (dstParent.isDirectory() dstParent.asDirectory().isWithSnapshot()) { dstParent.asDirectory().undoRename4DstParent(removedDst, {code} If the first if check doesn't pass, removedDst would be null and undoRemoveDst may be true. This combination would lead to NullPointerException in the finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7841) fff
[ https://issues.apache.org/jira/browse/HDFS-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ankush updated HDFS-7841: - Description: hv (was: We are trying to set the following properties in Hue LDAP section of our environment so that all usernames are forced to lowercase and so authentication ignores case. This will avoid new user’s home folders being in UPPERCASE, causing access errors with HDFS which parses to lowercase. [[ldap]] ignore_username_case=true force_username_lowercase=true Problem: Cannot identify proper configuration file to edit (tried runtime Hue.ini and safety-valve files). Have edited the following files and restarted Hue service, and the runtime Hue.ini still does not show changes made. (8378 is the current process as of this email). Cloudera Managwer also does not expose these parameters, but does offer a field for safety-vale entries. 1. /var/run/cloudera-scm-agent/process/8378-hue-HUE_SERVER/hue_safety_valve.ini a. currently contains above LDAP section 2. /var/run/cloudera-scm-agent/process/8378-hue-HUE_SERVER/hue.ini a. does not show safety valve changes; editing file directly does not work since changes are lost next service restart. 3. /opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/etc/hue/conf.empty/hue.ini a. File is empty Support Needed: Please provide correct files to edit and which services need restarted for change to take effect. ) Summary: fff (was: access errors with HDFS which parses to lowercase.) fff --- Key: HDFS-7841 URL: https://issues.apache.org/jira/browse/HDFS-7841 Project: Hadoop HDFS Issue Type: Bug Reporter: ankush hv -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7784) load fsimage in parallel
[ https://issues.apache.org/jira/browse/HDFS-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335929#comment-14335929 ] Walter Su commented on HDFS-7784: - I use visualvm to profile the loading process and find out that the bottleneck is deserialization taking too much cpu time, not disk I/O. The test(test-20150213.pdf) uses three 7200rpm hard disks as raid0. I tried single-threaded starts with and without cleaning buffer cache, and the difference is very small. load fsimage in parallel Key: HDFS-7784 URL: https://issues.apache.org/jira/browse/HDFS-7784 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Walter Su Assignee: Walter Su Priority: Minor Attachments: HDFS-7784.001.patch, test-20150213.pdf When single Namenode has huge amount of files, without using federation, the startup/restart speed is slow. The fsimage loading step takes the most of the time. fsimage loading can seperate to two parts, deserialization and object construction(mostly map insertion). Deserialization takes the most of CPU time. So we can do deserialization in parallel, and add to hashmap in serial. It will significantly reduce the NN start time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HDFS-7841) access errors with HDFS which parses to lowercase.
[ https://issues.apache.org/jira/browse/HDFS-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer resolved HDFS-7841. Resolution: Invalid Sorry, but none of this has anything to do with Apache Hadoop. Please contact Cloudera for support. access errors with HDFS which parses to lowercase. -- Key: HDFS-7841 URL: https://issues.apache.org/jira/browse/HDFS-7841 Project: Hadoop HDFS Issue Type: Bug Reporter: ankush We are trying to set the following properties in Hue LDAP section of our environment so that all usernames are forced to lowercase and so authentication ignores case. This will avoid new user’s home folders being in UPPERCASE, causing access errors with HDFS which parses to lowercase. [[ldap]] ignore_username_case=true force_username_lowercase=true Problem: Cannot identify proper configuration file to edit (tried runtime Hue.ini and safety-valve files). Have edited the following files and restarted Hue service, and the runtime Hue.ini still does not show changes made. (8378 is the current process as of this email). Cloudera Managwer also does not expose these parameters, but does offer a field for safety-vale entries. 1. /var/run/cloudera-scm-agent/process/8378-hue-HUE_SERVER/hue_safety_valve.ini a. currently contains above LDAP section 2. /var/run/cloudera-scm-agent/process/8378-hue-HUE_SERVER/hue.ini a. does not show safety valve changes; editing file directly does not work since changes are lost next service restart. 3. /opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/etc/hue/conf.empty/hue.ini a. File is empty Support Needed: Please provide correct files to edit and which services need restarted for change to take effect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7538) removedDst should be checked against null in the finally block of FSDirRenameOp#unprotectedRenameTo()
[ https://issues.apache.org/jira/browse/HDFS-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HDFS-7538: - Resolution: Not a Problem Status: Resolved (was: Patch Available) removedDst should be checked against null in the finally block of FSDirRenameOp#unprotectedRenameTo() - Key: HDFS-7538 URL: https://issues.apache.org/jira/browse/HDFS-7538 Project: Hadoop HDFS Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Priority: Minor Attachments: hdfs-7538-001.patch {code} if (removedDst != null) { undoRemoveDst = false; ... if (undoRemoveDst) { // Rename failed - restore dst if (dstParent.isDirectory() dstParent.asDirectory().isWithSnapshot()) { dstParent.asDirectory().undoRename4DstParent(removedDst, {code} If the first if check doesn't pass, removedDst would be null and undoRemoveDst may be true. This combination would lead to NullPointerException in the finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7722) DataNode#checkDiskError should also remove Storage when error is found.
[ https://issues.apache.org/jira/browse/HDFS-7722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu updated HDFS-7722: Attachment: HDFS-7722.001.patch Hi, [~cmccabe]. Thanks for reviewing. I updated the patch based on your inputs. Now, {{checkDirs()}} shares the same logic with {{DataNode#refreshVolumes()}}, because we'd like to remove everythings about the volumes, i.e., {{blockInfos}}, {{FsVolumeImpls}} in {{FsDataset}} and storage dirs in {{DataStorage}}. The existing {{checkDirs()}} logic only removes {{blockInfo}} and {{FsVolumeImpl}} in {{FsDataset}}. Thus {{checkDirs()}} returns failed volumes way back to {{DataNode}}. Because of the above reason, I chose to let {{checkDirs()}} return {{SetFile}} instead of {{SetFsVolumeImpl/FsVolumeRef}}, since these volumes will be consumed in {{DataNode}}. I think that {{FsVolumeRef}} should only be used when there is I/Os on the volume. Would you mind take another look? bq. Please remember that this scans all files on a volume, which is an expensive operation. {{FsVolumeList#checkDirs}} only checks access permissions on all sub directories and does not read files. I agree that it can still be problematic, I will file a follow JIRA to throttle it. DataNode#checkDiskError should also remove Storage when error is found. --- Key: HDFS-7722 URL: https://issues.apache.org/jira/browse/HDFS-7722 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Attachments: HDFS-7722.000.patch, HDFS-7722.001.patch When {{DataNode#checkDiskError}} found disk errors, it removes all block metadatas from {{FsDatasetImpl}}. However, it does not removed the corresponding {{DataStorage}} and {{BlockPoolSliceStorage}}. The result is that, we could not directly run {{reconfig}} to hot swap the failure disks without changing the configure file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7722) DataNode#checkDiskError should also remove Storage when error is found.
[ https://issues.apache.org/jira/browse/HDFS-7722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei (Eddy) Xu updated HDFS-7722: Status: Patch Available (was: Open) DataNode#checkDiskError should also remove Storage when error is found. --- Key: HDFS-7722 URL: https://issues.apache.org/jira/browse/HDFS-7722 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Lei (Eddy) Xu Assignee: Lei (Eddy) Xu Attachments: HDFS-7722.000.patch, HDFS-7722.001.patch When {{DataNode#checkDiskError}} found disk errors, it removes all block metadatas from {{FsDatasetImpl}}. However, it does not removed the corresponding {{DataStorage}} and {{BlockPoolSliceStorage}}. The result is that, we could not directly run {{reconfig}} to hot swap the failure disks without changing the configure file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case
[ https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-7763: -- Resolution: Fixed Fix Version/s: 2.7.0 Status: Resolved (was: Patch Available) Committed to trunk and branch-2, thanks for the nice find and fix [~xieliang007]! fix zkfc hung issue due to not catching exception in a corner case -- Key: HDFS-7763 URL: https://issues.apache.org/jira/browse/HDFS-7763 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Liang Xie Assignee: Liang Xie Fix For: 2.7.0 Attachments: HDFS-7763-001.txt, HDFS-7763-002.txt, jstack.4936 In our product cluster, we hit both the two zkfc process is hung after a zk network outage. the zkfc log said: {code} 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 0x4a61bacdd9dfb2 closed 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 11300 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 11300 {code} and the thread dump also be uploaded as attachment. From the dump, we can see due to the unknown non-daemon threads(pool-*-thread-*), the process did not exit, but the critical threads, like health monitor and rpc threads had been stopped, so our watchdog(supervisord) had not not observed the zkfc process is down or abnormal. so the following namenode failover could not be done as expected. there're two possible fixes here, 1) figure out the unset-thread-name, like pool-7-thread-1, where them came from and close or set daemon property. i tried to search but got nothing right now. 2) catch the exception from ZKFailoverController.run() so we can continue to exec the System.exit, the attached patch is 2). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7836) BlockManager Scalability Improvements
[ https://issues.apache.org/jira/browse/HDFS-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Lamb updated HDFS-7836: --- Attachment: BlockManagerScalabilityImprovementsDesign.pdf BlockManager Scalability Improvements - Key: HDFS-7836 URL: https://issues.apache.org/jira/browse/HDFS-7836 Project: Hadoop HDFS Issue Type: Improvement Reporter: Charles Lamb Assignee: Charles Lamb Attachments: BlockManagerScalabilityImprovementsDesign.pdf Improvements to BlockManager scalability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)