[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haohui Mai updated HDFS-9103: - Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I've committed the patch to trunk and branch-2. Thanks [~James Clampffer] for the contribution. > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: James Clampffer > Fix For: HDFS-8707, HDFS-8707 > > Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch, > HDFS-9103.HDFS-8707.006.patch, HDFS-9103.HDFS-8707.007.patch, > HDFS-9103.HDFS-8707.008.patch, HDFS-9103.HDFS-8707.009.patch, > HDFS-9103.HDFS-8707.010.patch, HDFS-9103.HDFS-8707.3.patch, > HDFS-9103.HDFS-8707.4.patch, HDFS-9103.HDFS-8707.5.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Clampffer updated HDFS-9103: -- Attachment: HDFS-9103.HDFS-8707.010.patch Thanks for the clarification [~wheat9] New patch posted: -got bad_datanode_tracker.h out of the public headers, moved into lib/fs because thats where it's most tightly coupled -declared NodeExclusionRule in hdfs.h -got rid of comment about static cast -renamed 'optional_exclude_rule' param to 'excluded_nodes' -inlined SelectBlockAndNode -changed dn selection to use a find_if rather than an explicit loop -kept existing bad_datanode_test tests -put the unit tests for BadDataNodeTracker and ExclusionSet that don't use mock objects/methods into a seperate test and cmake target Other things: -NodeExclusionRule and classes that derive from it got virtual destructors to avoid leaks -Added tests for the ExcludedSet object. It's incredibly simple but more tests won't hurt. > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: James Clampffer > Fix For: HDFS-8707 > > Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch, > HDFS-9103.HDFS-8707.006.patch, HDFS-9103.HDFS-8707.007.patch, > HDFS-9103.HDFS-8707.008.patch, HDFS-9103.HDFS-8707.009.patch, > HDFS-9103.HDFS-8707.010.patch, HDFS-9103.HDFS-8707.3.patch, > HDFS-9103.HDFS-8707.4.patch, HDFS-9103.HDFS-8707.5.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Clampffer updated HDFS-9103: -- Attachment: HDFS-9103.HDFS-8707.009.patch New patch: -Addessed [~bobthansen]'s concerns about test coverage and API -got rid of the little GC pass for expired nodes, we can wait and see if that ever becomes a real problem -Got rid of the optional_node_rule default parameter in AsyncPreadSome, just pass in nullptr if you don't want to use it. > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: James Clampffer > Fix For: HDFS-8707 > > Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch, > HDFS-9103.HDFS-8707.006.patch, HDFS-9103.HDFS-8707.007.patch, > HDFS-9103.HDFS-8707.008.patch, HDFS-9103.HDFS-8707.009.patch, > HDFS-9103.HDFS-8707.3.patch, HDFS-9103.HDFS-8707.4.patch, > HDFS-9103.HDFS-8707.5.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Clampffer updated HDFS-9103: -- Attachment: HDFS-9103.HDFS-8707.008.patch I still need to get rid of some test duplication and write a couple good tests for AsyncPreadSome with an override but wanted to post this in case anyone was curious. -Got rid of explicitly passing around the BadDataNodeTracker. FileSystem and InputStream now keep shared_ptrs to the BadDataNodeTracker. The tracker is used by default for methods like PositionRead. -I've added an abstraction, NodeExclusionRule with a uuid->bool virtual method for testing bad nodes so that the tracker can be overridden if the user want to in AsyncPreadSome. Added a wrapper for std::set that inherits from this to make provide an easy way to pass in a set of nodes to exclude. -Added unit tests for BadDataNodeTracker. Added a method that can be used in tests to move time forward to make sure that nodes get kicked out after enough time has elapsed. > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: James Clampffer > Fix For: HDFS-8707 > > Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch, > HDFS-9103.HDFS-8707.006.patch, HDFS-9103.HDFS-8707.007.patch, > HDFS-9103.HDFS-8707.008.patch, HDFS-9103.HDFS-8707.3.patch, > HDFS-9103.HDFS-8707.4.patch, HDFS-9103.HDFS-8707.5.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Clampffer updated HDFS-9103: -- Attachment: HDFS-9103.HDFS-8707.007.patch New patch, there's a bit of extra noise due to clang-format hitting a few files that hadn't had it before. Addressing Haohui's batch of concerns in order: -That name_match function isn't needed after switching bad_datanodes_ to a map -Got rid of BadDataNodeTracker::GetNodesToExclude and added a IsBadNode method instead. The InputStream takes a shared_ptr to the BadDataNodeTracker and calls IsBadNode directly, this should get rid of any need for caching as it gets rid of a lot of copies and other work making sets of strings. -Got rid of BadDataNodeTracker::Clear entirely and changed the tests so that BadDataNodeTracker is scoped by test function. This avoids issues with possibly carrying state between tests. -Added a datanode exclusion duration to the Option class with a default of 10 minutes. Switched time units to milliseconds to be consistent. Is there a standard name for this? I didn't see anything in the options used for hdfs-sites.xml. -Switched from system_clock to steady_clock to make sure time is always monotonically increasing. -I think the way I rearranged the code that this comment referred to simplified it. If it's not please let me know what exactly needs to be simplified. -Made ShouldExclude a static method of InputStream, got rid of the duplicate used by the gmock test. > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: James Clampffer > Fix For: HDFS-8707 > > Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch, > HDFS-9103.HDFS-8707.006.patch, HDFS-9103.HDFS-8707.007.patch, > HDFS-9103.HDFS-8707.3.patch, HDFS-9103.HDFS-8707.4.patch, > HDFS-9103.HDFS-8707.5.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Clampffer updated HDFS-9103: -- Attachment: HDFS-9103.HDFS-8707.006.patch Adding a patch: This is effectively the logic I had in one of the earlier revisions of HDFS-8766 however now it keeps time on a per-node basis. I added a class, BadDataNodeTracker, that encapsulates all locking. The HadoopFileSystem is the first to create it and keep a shared_ptr with make_shared and then each FileHandle object is given a shared_ptr to tracker as well. > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: James Clampffer > Fix For: HDFS-8707 > > Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch, > HDFS-9103.HDFS-8707.006.patch, HDFS-9103.HDFS-8707.3.patch, > HDFS-9103.HDFS-8707.4.patch, HDFS-9103.HDFS-8707.5.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Hansen updated HDFS-9103: - Attachment: HDFS-9103.HDFS-8707.5.patch > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: Bob Hansen > Fix For: HDFS-8707 > > Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch, > HDFS-9103.HDFS-8707.3.patch, HDFS-9103.HDFS-8707.4.patch, > HDFS-9103.HDFS-8707.5.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Hansen updated HDFS-9103: - Attachment: HDFS-9103.HDFS-8707.4.patch > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: Bob Hansen > Fix For: HDFS-8707 > > Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch, > HDFS-9103.HDFS-8707.3.patch, HDFS-9103.HDFS-8707.4.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Hansen updated HDFS-9103: - Attachment: HDFS-9103.HDFS-8707.3.patch Removed the public state mutations that are no longer necessary for testing. > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: Bob Hansen > Fix For: HDFS-8707 > > Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch, > HDFS-9103.HDFS-8707.3.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Hansen updated HDFS-9103: - Attachment: HDFS-9103.2.patch > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: Bob Hansen > Fix For: HDFS-8707 > > Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Hansen updated HDFS-9103: - Fix Version/s: HDFS-8707 > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: Bob Hansen > Fix For: HDFS-8707 > > Attachments: HDFS-9103.1.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Hansen updated HDFS-9103: - Attachment: (was: HDFS-9103.1.patch) > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: Bob Hansen > Attachments: HDFS-9103.1.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Hansen updated HDFS-9103: - Attachment: HDFS-9103.1.patch > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: Bob Hansen > Attachments: HDFS-9103.1.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Hansen updated HDFS-9103: - Target Version/s: HDFS-8707 > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: Bob Hansen > Attachments: HDFS-9103.1.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Hansen updated HDFS-9103: - Status: Patch Available (was: Open) Changes of note: I changed the semantics of InputStream::PositionRead to be success-or-failure. It will now retry if there is a Status::exception in the pipeline [Q: will all I/O errors be reflected as Status::exceptions?] I exposed Status::Code so we can use the right internal semantics. I moved the excluded_datanodes to be a member of the InputStream. I think we would want failed nodes to be remembered across individual reads and not re-tried. Because it's mutable, I didn't want to be passing copies or mutable references on the stack. > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: Bob Hansen > Attachments: HDFS-9103.1.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9103) Retry reads on DN failure
[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob Hansen updated HDFS-9103: - Attachment: HDFS-9103.1.patch > Retry reads on DN failure > - > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Bob Hansen >Assignee: Bob Hansen > Attachments: HDFS-9103.1.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)