Re: [PR] HDFS-15869. Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang [hadoop]

via GitHub Thu, 22 Feb 2024 23:08:54 -0800


ZanderXu commented on PR #2737:
URL: https://github.com/apache/hadoop/pull/2737#issuecomment-1960828209


   Thanks for your works and discussions for this problem. I spent a long time 
to catch your ideas and concerns😂, it's so hard.
   
   I have some throughs and questions about this ticket.
   
   Some questions in 
[HDFS-15869](https://issues.apache.org/jira/browse/HDFS-15869):
   
   > The `channel.write(buffer)` operation in line 3594 may be slow. Although 
for this specific stack trace, the channel is initialized in the non-blocking 
mode, there is still a chance of being slow depending on native write 
implementation in the OS (e.g., a kernel issue). Furthermore, the channelIO 
invocation in line 3594 may also get stuck, since it waits until the buffer is 
drained:
   
   `ret = (readCh == null) ? writeCh.write(buf) : readCh.read(buf);` will 
return 0 if namenode cannot write more data to this connection, right?  
`RpcEdit,logSyncNotify` will add this response into the queue of this 
connection and let the Responder to take this job, right? So FSEditLogAsync can 
go to process the next jobs, right?
   
   
   Some throughs in 
[HDFS-15869](https://issues.apache.org/jira/browse/HDFS-15869):
   
   Actually, I encountered this problems in our prod environment that the 
thread `FSEditLogAsync` spends a little more time to send responses to clients, 
which had a big performance impact on writing edits to JNs. So I just use a new 
single thread to do these send response jobs. Of course, we can use multiple 
threads to send responses to client. New task is very expensive, so we use a 
producer-consumer mode to fix this problem.
   
   - FSEditLogAsync just put task into a capacity blocking Queue.
   - ResponseSender thread take tasks from the Queue and send them to clients.
   
   
   About "Bug" or "Improvement", I think it should be a performance 
improvement, since all processes are worked as expected, no blocking or 
hanging, just slow.
   
   
   Some throughs in 
[HDFS-15957](https://issues.apache.org/jira/browse/HDFS-15957):
   
   - I think namenode should directly close this connection if IOException 
happens in `RpcEdit,logSyncNotify`, since we cannot let the client hang 
forever. It seems that the namenode drops a request.
   
   
   @functioner  Looking forward your ideas and confirm.
   
   
   @daryn-sharp @Hexiaoqiao @linyiqun @amahussein Looking forward your ideas. I 
hope we can push this ticket forward. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Re: [PR] HDFS-15869. Network issue while FSEditLogAsync is executing RpcEdit.logSyncNotify can cause the namenode to hang [hadoop]

Reply via email to