[jira] [Resolved] (HDFS-12649) handling of corrupt blocks not suitable for commodity hardware
[ https://issues.apache.org/jira/browse/HDFS-12649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gruust resolved HDFS-12649. --- Resolution: Invalid > handling of corrupt blocks not suitable for commodity hardware > -- > > Key: HDFS-12649 > URL: https://issues.apache.org/jira/browse/HDFS-12649 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.8.1 >Reporter: Gruust >Priority: Minor > > Hadoop's documentation tells me it's suitable for commodity hardware in the > sense that hardware failures are expected to happen frequently. However, > there is currently no automatic handling of corrupted blocks, which seems a > bit contradictory to me. > See: > https://stackoverflow.com/questions/19205057/how-to-fix-corrupt-hdfs-files > This is even problematic for data integrity as the redundancy is not kept at > the desired level without manual intervention and therefore in a timely > manner. If there is a corrupted block, I would at least expect that the > namenode forces the creation of an additional good replica to keep up the > redundancy level, ie. the redundancy level should never include corrupted > data... which it currently does: > "UnderReplicatedBlocks" : 0, > "CorruptBlocks" : 2, > (namenode /jmx http dump) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12653) Implement toArray() and toSubArray() for ReadOnlyList
Manoj Govindassamy created HDFS-12653: - Summary: Implement toArray() and toSubArray() for ReadOnlyList Key: HDFS-12653 URL: https://issues.apache.org/jira/browse/HDFS-12653 Project: Hadoop HDFS Issue Type: Improvement Reporter: Manoj Govindassamy Assignee: Manoj Govindassamy {{ReadOnlyList}} today gives an unmodifiable view of the backing List. This list supports following Util methods for easy construction of read only views of any given list. {noformat} public static ReadOnlyList asReadOnlyList(final List list) public static List asList(final ReadOnlyList list) {noformat} {{asList}} above additionally overrides {{Object[] toArray()}} of the {{java.util.List}} interface. Unlike the {{java.util.List}}, the above one returns an array of Objects referring to the backing list and avoid any copying of objects. Given that we have many usages of read only lists, 1. Lets have a light-weight / shared-view {{toArray()}} implementation for {{ReadOnlyList}} as well. 2. Additionally, similar to {{java.util.List#subList(fromIndex, toIndex)}}, lets have {{ReadOnlyList#subArray(fromIndex, toIndex)}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12652) INodeAttributesProvider#getAttributes(): Avoid multiple conversions of path components byte[][] to String[] when requesting INode attributes
Manoj Govindassamy created HDFS-12652: - Summary: INodeAttributesProvider#getAttributes(): Avoid multiple conversions of path components byte[][] to String[] when requesting INode attributes Key: HDFS-12652 URL: https://issues.apache.org/jira/browse/HDFS-12652 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.0.0-beta1 Reporter: Manoj Govindassamy Assignee: Manoj Govindassamy {{INodeAttributesProvider#getAttributes}} needs the path components passed in to be an array of Strings. Where as the INode and related layers maintain path components as an array of byte[]. So, these layers are required to convert each byte[] component of the path back into a string and for multiple times when requesting for INode attributes from the Provider. That is, the path "/a/b/c" requires calling the attribute provider with: (1) "", (2) "", "a", (3) "", "a","b", (4) "", "a","b", "c". Every single one of those strings were freshly (re)converted from a byte[]. Say, a file listing is done on a huge directory containing 100s of millions of files, then these multiple time redundant conversions of byte[][] to String[] create lots of tiny object garbages, occupying memory and affecting performance. Better if we could avoid creating redundant copies of path component strings. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12651) Ozone: SCM: avoid synchronously loading all the keys from containers upon SCM datanode start
Xiaoyu Yao created HDFS-12651: - Summary: Ozone: SCM: avoid synchronously loading all the keys from containers upon SCM datanode start Key: HDFS-12651 URL: https://issues.apache.org/jira/browse/HDFS-12651 Project: Hadoop HDFS Issue Type: Sub-task Affects Versions: HDFS-7240 Reporter: Xiaoyu Yao Assignee: Xiaoyu Yao This is based on code review feedback from HDFS-12411 to avoid slow SCM datanode restart when there are large amount of keys and containers. E.g., 5 GB per container / 4 KB per key = 1.25 Million keys per container. The proposed solution is async loading containers/key size info and update the containerStatus once done. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12650) Use slf4j instead of log4j in LeaseManager
Ajay Kumar created HDFS-12650: - Summary: Use slf4j instead of log4j in LeaseManager Key: HDFS-12650 URL: https://issues.apache.org/jira/browse/HDFS-12650 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ajay Kumar Assignee: Ajay Kumar Fix For: 3.1.0 FileNamesystem is still using log4j dependencies. We should move those to slf4j, as most of the methods using log4j are deprecated. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12649) handling of corrupt blocks not suitable for commodity hardware
Gruust created HDFS-12649: - Summary: handling of corrupt blocks not suitable for commodity hardware Key: HDFS-12649 URL: https://issues.apache.org/jira/browse/HDFS-12649 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.8.1 Reporter: Gruust Priority: Minor Hadoop's documentation tells me it's suitable for commodity hardware in the sense that hardware failures are expected to happen frequently. However, there is currently no automatic handling of corrupted blocks, which seems a bit contradictory to me. See: https://stackoverflow.com/questions/19205057/how-to-fix-corrupt-hdfs-files This is even problematic for data integrity as the redundancy is not kept at the desired level without manual intervention. If there is a corrupted block, I would at least expect that the namenode forces the creation of an additional good replica to keep up the redundancy level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86
For more details, see https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/555/ [Oct 11, 2017 7:45:28 AM] (kai.zheng) HDFS-12635. Unnecessary exception declaration of the CellBuffers [Oct 11, 2017 8:57:38 AM] (rohithsharmaks) MAPREDUCE-6951. Improve exception message when [Oct 11, 2017 9:09:53 AM] (aajisaka) HDFS-12622. Fix enumerate in HDFSErasureCoding.md. Contributed by Yiqun [Oct 11, 2017 3:31:02 PM] (jlowe) YARN-7082. TestContainerManagerSecurity failing in trunk. Contributed by [Oct 11, 2017 5:06:43 PM] (stevel) HADOOP-14913. Sticky bit implementation for rename() operation in Azure [Oct 11, 2017 6:14:33 PM] (sunilg) YARN-6620. Add support in NodeManager to isolate GPU devices by using [Oct 11, 2017 7:26:14 PM] (arp) HDFS-12627. Fix typo in DFSAdmin command output. Contributed by Ajay [Oct 11, 2017 7:29:35 PM] (arp) HDFS-12542. Update javadoc and documentation for listStatus. Contributed [Oct 11, 2017 10:21:21 PM] (Arun Suresh) HADOOP-13556. Change Configuration.getPropsWithPrefix to use getProps [Oct 11, 2017 10:25:28 PM] (wangda) YARN-7205. Log improvements for the ResourceUtils. (Sunil G via wangda) [Oct 11, 2017 10:58:20 PM] (aengineer) HADOOP-13102. Update GroupsMapping documentation to reflect the new -1 overall The following subsystems voted -1: unit The following subsystems voted -1 but were configured to be filtered/ignored: cc checkstyle javac javadoc pylint shellcheck shelldocs whitespace The following subsystems are considered long running: (runtime bigger than 1h 0m 0s) unit Specific tests: Failed junit tests : hadoop.crypto.key.TestCachingKeyProvider hadoop.hdfs.server.namenode.TestNameNodeMetadataConsistency hadoop.yarn.server.nodemanager.scheduler.TestDistributedScheduler hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSAppAttempt Timed out junit tests : org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.TestZKConfigurationStore org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.TestRMContainerImpl org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler org.apache.hadoop.yarn.server.resourcemanager.TestRMHA org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService org.apache.hadoop.yarn.server.resourcemanager.TestLeaderElectorService org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesReservation org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesSchedulerActivities org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.TestRMProxyUsersConf org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerAsyncScheduling org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestIncreaseAllocationExpirer org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStorePerf org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer org.apache.hadoop.yarn.client.api.impl.TestAMRMClient cc: https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/555/artifact/out/diff-compile-cc-root.txt [4.0K] javac: https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/555/artifact/out/diff-compile-javac-root.txt [288K] checkstyle: https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/555/artifact/out/diff-checksty
[jira] [Created] (HDFS-12648) DN should provide feedback to NN for throttling commands
Daryn Sharp created HDFS-12648: -- Summary: DN should provide feedback to NN for throttling commands Key: HDFS-12648 URL: https://issues.apache.org/jira/browse/HDFS-12648 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.8.0 Reporter: Daryn Sharp The NN should avoid sending commands to a DN with a high number of outstanding commands. The heartbeat could provide this feedback via perhaps a simple count of the commands or rate of processing. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12647) DN commands processing should be async
Daryn Sharp created HDFS-12647: -- Summary: DN commands processing should be async Key: HDFS-12647 URL: https://issues.apache.org/jira/browse/HDFS-12647 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.8.0 Reporter: Daryn Sharp Due to dataset lock contention, service actors may encounter significant latency while processing DN commands. Even the queuing of async deletions require multiple lock acquisitions. A slow disk will cause a backlog of xceivers instantiating block sender/receivers which starves the actor and leads to the NN falsely declaring the node dead. Async processing of all commands will free the actor to perform its primary purpose of heartbeating and block reporting. Note that FBRs will be dependent on queued block invalidations not being included in the report. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12646) Avoid IO while holding the FsDataset lock
Daryn Sharp created HDFS-12646: -- Summary: Avoid IO while holding the FsDataset lock Key: HDFS-12646 URL: https://issues.apache.org/jira/browse/HDFS-12646 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Affects Versions: 2.8.0 Reporter: Daryn Sharp IO operations should be allowed while holding the dataset lock. Notable offenders include but are not limited to the instantiation of a block sender/receiver, constructing the path to a block, unfinalizing a block. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12645) FSDatasetImpl lock will stall BP service actors and may cause missing blocks
Daryn Sharp created HDFS-12645: -- Summary: FSDatasetImpl lock will stall BP service actors and may cause missing blocks Key: HDFS-12645 URL: https://issues.apache.org/jira/browse/HDFS-12645 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.8.0 Reporter: Daryn Sharp The DN is extremely susceptible to a slow volume due bad locking practices. DN operations require a fs dataset lock. IO in the dataset lock should not be permissible as it leads to severe performance degradation and possibly (temporarily) missing blocks. A slow disk will cause pipelines to experience significant latency and timeouts, increasing lock/io contention while cleaning up, leading to more timeouts, etc. Meanwhile, the actor service thread is interleaving multiple lock acquire/releases with xceivers. If many commands are issued, the node may be incorrectly declared as dead. HDFS-12639 documents that both actors synchronize on the offer service lock while processing commands. A backlogged active actor will block the standby actor and cause it to go dead too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12644) Offer a non-privileged listEncryptionZone operation
Wei-Chiu Chuang created HDFS-12644: -- Summary: Offer a non-privileged listEncryptionZone operation Key: HDFS-12644 URL: https://issues.apache.org/jira/browse/HDFS-12644 Project: Hadoop HDFS Issue Type: Improvement Components: encryption, namenode Affects Versions: 3.0.0-alpha1, 2.8.0 Reporter: Wei-Chiu Chuang Assignee: Wei-Chiu Chuang As discussed in HDFS-12484, we can consider adding a non-privileged listEncryptionZone for better user experience. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-11797) BlockManager#createLocatedBlocks() can throw ArrayIndexOutofBoundsException when corrupt replicas are inconsistent
[ https://issues.apache.org/jira/browse/HDFS-11797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang resolved HDFS-11797. Resolution: Duplicate I'm going to close it as a dup of HDFS-11445. Feel free to reopen if this is not the case. Thanks [~kshukla]! > BlockManager#createLocatedBlocks() can throw ArrayIndexOutofBoundsException > when corrupt replicas are inconsistent > -- > > Key: HDFS-11797 > URL: https://issues.apache.org/jira/browse/HDFS-11797 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kuhu Shukla >Assignee: Kuhu Shukla >Priority: Critical > Attachments: HDFS-11797.001.patch > > > The calculation for {{numMachines}} can be too less (causing > ArrayIndexOutOfBoundsException) or too many (causing NPE (HDFS-9958)) if data > structures find inconsistent number of corrupt replicas. This was earlier > found related to failed storages. This JIRA tracks a change that works for > all possible cases of inconsistencies. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-12630) Rolling restart can create inconsistency between blockMap and corrupt replicas map
[ https://issues.apache.org/jira/browse/HDFS-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang resolved HDFS-12630. Resolution: Duplicate > Rolling restart can create inconsistency between blockMap and corrupt > replicas map > -- > > Key: HDFS-12630 > URL: https://issues.apache.org/jira/browse/HDFS-12630 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Andre Araujo > > After a NN rolling restart several HDFS files started showing block problems. > Running FSCK for one of the files or for the directory that contained it > would complete with a FAILED message but without any details of the failure. > The NameNode log showed the following: > {code} > 2017-10-10 16:58:32,147 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: > FSCK started by hdfs (auth:KERBEROS_SSL) from /10.92.128.4 for path > /user/prod/data/file_20171010092201.csv at Tue Oct 10 16:58:32 PDT 2017 > 2017-10-10 16:58:32,147 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Inconsistent > number of corrupt replicas for blk_1941920008_1133195379 blockMap has 1 but > corrupt replicas map has 2 > 2017-10-10 16:58:32,147 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: > Fsck on path '/user/prod/data/file_20171010092201.csv' FAILED > java.lang.ArrayIndexOutOfBoundsException > {code} > After triggering a full block report for all the DNs the problem went away. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12643) HDFS maintenance state behaviour is confusing and not well documented
Andre Araujo created HDFS-12643: --- Summary: HDFS maintenance state behaviour is confusing and not well documented Key: HDFS-12643 URL: https://issues.apache.org/jira/browse/HDFS-12643 Project: Hadoop HDFS Issue Type: Bug Components: documentation, namenode Reporter: Andre Araujo The current implementation of the HDFS maintenance state feature is confusing and error-prone. The documentation is missing important information that's required for the correct use of the feature. For example, if the Hadoop admin wants to put a single node in maintenance state, he/she can add a single entry to the maintenance file with the contents: {code} { "hostName": "host-1.example.com", "adminState": "IN_MAINTENANCE", "maintenanceExpireTimeInMS": 1507663698000 } {code} Let's say now that the actual maintenance finished well before the set expiration time and the Hadoop admin wants to bring the node back to NORMAL state. It would be natural to simply change the state of the node, as show below, and run another refresh: {code} { "hostName": "host-1.example.com", "adminState": "NORMAL" } {code} The configuration file above, though, not only take the node {{host-1}} out of maintenance state but it also *blacklists all the other DataNodes*. This behaviour seems inconsistent to me and is due to {{emptyInServiceNodeLists}} being set to {{false}} [here|https://github.com/apache/hadoop/blob/230b85d5865b7e08fb7aaeab45295b5b966011ef/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/CombinedHostFileManager.java#L80] only when there is at least one node with {{adminState = NORMAL}} listed in the file. I believe that it would be more consistent, and less error prone, to simply implement the following: * If the dfs.hosts file is empty, all nodes are allowed and in normal state * If the file is not empty, any host *not* listed in the file is *blacklisted*, regardless of the state of the hosts listed in the file. Regardless of the implementation being changed or not, the documentation also needs to be updated to ensure the readers know of the caveats mentioned above. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org