[ https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14987971#comment-14987971 ]
Haohui Mai commented on HDFS-9103: ---------------------------------- Thanks for updating the patch. {code} + auto name_match = [dn](const Entry& entry) { + if (dn == entry.first) { + return true; + } + return false; + }; + auto entry = std::find_if(datanodes_.begin(), datanodes_.end(), name_match); + + if (entry != datanodes_.end()) { + /* if entry already exists remove it so one with a new timestamp can be + * added */ + datanodes_.erase(entry); + } + + /* insert a new entry with bad node and current time */ + datanodes_.insert(Entry(std::string(dn), Clock::now())); {code} The {{name_match}} function can be simplified. It looks like the comments are unnecessary. {code} +std::set<std::string> BadDataNodeTracker::GetNodesToExclude() { {code} This function is on the critical path thus the result should be cached. {code} +void BadDataNodeTracker::Clear() { + std::lock_guard<std::mutex> update_lock(datanodes_update_lock_); + datanodes_.clear(); +} {code} Looks like this is a test-only function. Let's name it "TEST_Clear()" for clarity. {code} + BadDataNodeTracker() : timeout_duration_(10 * 60) {} + unsigned int timeout_duration_; /* in seconds */ {code} The timeout settings needs to be passed in through the {{Option}} object. {{timeout_duration_}} can be marked as const. {code} + typedef std::chrono::system_clock Clock; + typedef std::pair<std::string, TimePoint> Entry; + std::set<Entry> datanodes_; {code} The clock needs to be a {{steady_clock}} for monotonicity. Are there any reasons of using a {{set<Entry>}} instead of a {{vector}} / {{map}}? Using {{vector}} has much better cache locality results in this case. {code} if (!s.ok()) { + /* determine if DN gets marked bad */ + if (ShouldExclude(s)) { + bad_datanodes_->AddBadNode(contacted_datanode); + } + // resource might free up + case Status::kResourceUnavailable: + return false; {code} The above code can be simplified. {code} +static bool ShouldExclude(const Status &s) { + if (s.ok()) { + return false; + } + + switch (s.code()) { + // resource might free up + case Status::kResourceUnavailable: + return false; + case Status::kInvalidArgument: + case Status::kUnimplemented: + case Status::kException: + default: + return true; + } + return true; +} {code} The function can be placed in the {{InputStream}} class. > Retry reads on DN failure > ------------------------- > > Key: HDFS-9103 > URL: https://issues.apache.org/jira/browse/HDFS-9103 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client > Reporter: Bob Hansen > Assignee: James Clampffer > Fix For: HDFS-8707 > > Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch, > HDFS-9103.HDFS-8707.006.patch, HDFS-9103.HDFS-8707.3.patch, > HDFS-9103.HDFS-8707.4.patch, HDFS-9103.HDFS-8707.5.patch > > > When AsyncPreadSome fails, add the failed DataNode to the excluded list and > try again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)