[GitHub] [hadoop] functioner commented on pull request #2737: HDFS-15869. Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang

GitBox Mon, 19 Apr 2021 12:07:56 -0700


functioner commented on pull request #2737:
URL: https://github.com/apache/hadoop/pull/2737#issuecomment-822713500



   > Thanks @functioner
   > The detailed discussions (except the lambda argument) should have been on 
the Jira.
   > 
   > > IMO, this is a classic Producer-Consumer problem, and it is natural idea 
to improve performance using parallel way.
   > 
   > > So, call.sendResponse() (network service) affects FSEditLogAsync (edit 
log sync service). So, I would say it's a bug.
   > 
   > Now, I am really even more confused about the (Bug Vs. Improvement). So, I 
am going to pass on reviewing.
   
   @amahussein Thanks for your feedback, and your time!
   Sorry for all the possible confusion I made.
   
   It's not a big deal whether it's marked as bug or improvement. One of my bug 
reports ([HADOOP-17552](https://issues.apache.org/jira/browse/HADOOP-17552)) is 
also finally marked as improvement rather than bug. The point is that the 
developers (in 
[HADOOP-17552](https://issues.apache.org/jira/browse/HADOOP-17552)) finally 
realized that there's a potential hanging issue as I point out, and the patch 
(as well as the relevant discussion) is very helpful for the developers and the 
users.
   
   > Ok, is the purpose of the change is to improve performance of the 
FSEditLogAsync.java by executing sendResponse() in parallel?
   > In that case, please change the title of the Jira and the description to 
remove references to "hanging" problems.
   
   My intention is to defend that we should not remove the references to 
"hanging" problems.
   
   In short, the discussion above can be summarized into 3 arguments:
   1. https://github.com/apache/hadoop/pull/2737#issuecomment-822151838: this 
is a classic Producer-Consumer problem, and it is natural idea to improve 
performance using parallel way
   2. https://github.com/apache/hadoop/pull/2737#issuecomment-822591028: the 
`call.sendResponse()` may hang due to network issue, without throwing any 
exception
   3. https://github.com/apache/hadoop/pull/2737#issuecomment-822617097: 
   a). the `FSEditLogAsync` thread (without this patch) directly invokes a 
network I/O call `call.sendResponse()`, so if this network I/O invocation 
hangs, the `FSEditLogAsync` thread also hangs
   b). in the "correct" system design, if this network I/O invocation hangs in 
this way, then that should be fine, because HDFS (as a fault-tolerant system) 
should tolerate it.
   c). when the system tolerates this network issue, the `FSEditLogAsync` 
thread should not hang, otherwise everybody can't commit the log.
   d). our expected behavior is that, when the system tolerates this network 
issue, the `FSEditLogAsync` thread should continue, so that everything still 
works well, despite this network issue.
   
   Both Argument 3 and Argument 1 can be resolved with this patch.
   
   In conclusion, this patch not only improves the performance, but also 
enhances the availability & fault-tolerance.
   
   So, I think the references to "hanging" problems should not be removed.
   
   P.S. I will summarize our discussion with a comment in Jira after we reach a 
consensus.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[GitHub] [hadoop] functioner commented on pull request #2737: HDFS-15869. Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang

Reply via email to