ZanderXu commented on PR #4527: URL: https://github.com/apache/hadoop/pull/4527#issuecomment-1183957096
@omalley We encountered an incident in our prod environment that relate to connection. Limited the `rpcRequestQueue` can fix this problem and I'm looking for your good ideas. And the root cause is that NameNode OOM caused by many pending sending requests in connection. - The network between Observer NameNode and JournalNode 1 is abnormal, such as lag, tcp drop. - The connection is not interrupted, but NameNode can not send requests or receive response from this connection. - ObserverNameNode will always send `getJournaledEdits` RPC to JN1 and ObserverNameNode can ignore the response of the JN1 because it has received quorum responses. - ObserverNameNode try to ignore the abnormal response by interrupt it. But it can not able to interrupt this connection. - In the end, NameNode OOM because there are too many pending Requests in this abnormal connection. So I feel that maybe we can limit the `rpcRequestQueue`, and we can throw IOException when `rpcRequestQueue` is full. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org