[jira] [Commented] (HDFS-9103) Retry reads on DN failure

Haohui Mai (JIRA) Tue, 03 Nov 2015 11:48:09 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14987971#comment-14987971
 ]


Haohui Mai commented on HDFS-9103:
----------------------------------

Thanks for updating the patch.

{code}
+  auto name_match = [dn](const Entry& entry) {
+    if (dn == entry.first) {
+      return true;
+    }
+    return false;
+  };
+  auto entry = std::find_if(datanodes_.begin(), datanodes_.end(), name_match);
+
+  if (entry != datanodes_.end()) {
+    /* if entry already exists remove it so one with a new timestamp can be
+     * added */
+    datanodes_.erase(entry);
+  }
+
+  /* insert a new entry with bad node and current time */
+  datanodes_.insert(Entry(std::string(dn), Clock::now()));
{code}

The {{name_match}} function can be simplified. It looks like the comments are 
unnecessary.

{code}
+std::set<std::string> BadDataNodeTracker::GetNodesToExclude() {
{code}

This function is on the critical path thus the result should be cached.

{code}
+void BadDataNodeTracker::Clear() {
+  std::lock_guard<std::mutex> update_lock(datanodes_update_lock_);
+  datanodes_.clear();
+}
{code}

Looks like this is a test-only function. Let's name it "TEST_Clear()" for 
clarity.

{code}
+  BadDataNodeTracker() : timeout_duration_(10 * 60) {}
+  unsigned int timeout_duration_; /* in seconds */

{code}

The timeout settings needs to be passed in through the {{Option}} object. 
{{timeout_duration_}} can be marked as const.

{code}
+  typedef std::chrono::system_clock Clock;
+  typedef std::pair<std::string, TimePoint> Entry;
+  std::set<Entry> datanodes_;
{code}

The clock needs to be a {{steady_clock}} for monotonicity. Are there any 
reasons of using a {{set<Entry>}} instead of a {{vector}} / {{map}}? Using 
{{vector}} has much better cache locality results in this case.

{code}

   if (!s.ok()) {
+    /* determine if DN gets marked bad */
+    if (ShouldExclude(s)) {
+      bad_datanodes_->AddBadNode(contacted_datanode);
+    }


+    // resource might free up
+    case Status::kResourceUnavailable:
+      return false;
{code}

The above code can be simplified.

{code}
+static bool ShouldExclude(const Status &s) {
+  if (s.ok()) {
+    return false;
+  }
+
+  switch (s.code()) {
+    // resource might free up
+    case Status::kResourceUnavailable:
+      return false;
+    case Status::kInvalidArgument:
+    case Status::kUnimplemented:
+    case Status::kException:
+    default:
+      return true;
+  }
+  return true;
+}
{code}

The function can be placed in the {{InputStream}} class.

> Retry reads on DN failure
> -------------------------
>
>                 Key: HDFS-9103
>                 URL: https://issues.apache.org/jira/browse/HDFS-9103
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: Bob Hansen
>            Assignee: James Clampffer
>             Fix For: HDFS-8707
>
>         Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch, 
> HDFS-9103.HDFS-8707.006.patch, HDFS-9103.HDFS-8707.3.patch, 
> HDFS-9103.HDFS-8707.4.patch, HDFS-9103.HDFS-8707.5.patch
>
>
> When AsyncPreadSome fails, add the failed DataNode to the excluded list and 
> try again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-9103) Retry reads on DN failure

Reply via email to