ZanderXu commented on PR #4527:
URL: https://github.com/apache/hadoop/pull/4527#issuecomment-1183957096

   @omalley We encountered an incident in our prod environment that relate to 
connection. Limited the `rpcRequestQueue` can fix this problem and I'm looking 
for your good ideas.
   
   And the root cause is that NameNode OOM caused by many pending sending 
requests in connection. 
   
   - The network between Observer NameNode and JournalNode 1 is abnormal, such 
as lag, tcp drop. 
   - The connection is not interrupted, but NameNode can not send requests or 
receive response from this connection.
   - ObserverNameNode will always send `getJournaledEdits` RPC to JN1 and 
ObserverNameNode can ignore the response of the JN1 because it has received 
quorum responses.
   - ObserverNameNode try to ignore the abnormal response by interrupt it. But 
it can not able to interrupt this connection.
   - In the end, NameNode OOM because there are too many pending Requests in 
this abnormal connection.
   
   So I feel that maybe we can limit the `rpcRequestQueue`, and we can throw 
IOException when `rpcRequestQueue` is full.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to