[ 
https://issues.apache.org/jira/browse/HADOOP-19061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin resolved HADOOP-19061.
-------------------------------
       Fix Version/s: 3.5.0
    Target Version/s:   (was: 3.3.9, 3.5.0)
          Resolution: Fixed

> Capture exception in rpcRequestSender.start() in IPC.Connection.run()
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-19061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19061
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 3.5.0
>            Reporter: Xing Lin
>            Assignee: Xing Lin
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.5.0
>
>
> rpcRequestThread.start() can fail due to OOM. This will immediately crash the 
> Connection thread, without removing itself from the connections pool. Then 
> for all following getConnection(remoteid), we will get this bad connection 
> object and all rpc requests will be hanging, because this is a bad connection 
> object, without threads being properly running (Neither Connection or 
> Connection.rpcRequestSender thread is running due to OOM.).
> In this PR, we moved the rpcRequestThread.start() to be within the 
> try{}-catch{} block, to capture OOM from rpcRequestThread.start() and proper 
> cleaning is followed if we hit OOM.
> {code:java}
> IPC.Connection.run()
>   @Override
>     public void run() {
>       // Don't start the ipc parameter sending thread until we start this
>       // thread, because the shutdown logic only gets triggered if this
>       // thread is started.
>       rpcRequestThread.start();
>       if (LOG.isDebugEnabled())
>         LOG.debug(getName() + ": starting, having connections " 
>             + connections.size());      
>       try {
>         while (waitForWork()) {//wait here for work - read or close connection
>           receiveRpcResponse();
>         }
>       } catch (Throwable t) {
>         // This truly is unexpected, since we catch IOException in 
> receiveResponse
>         // -- this is only to be really sure that we don't leave a client 
> hanging
>         // forever.
>         LOG.warn("Unexpected error reading responses on connection " + this, 
> t);
>         markClosed(new IOException("Error reading responses", t));
>       }{code}
> Because there is no rpcRequestSender thread consuming the rpcRequestQueue, 
> all rpc request enqueue operations for this connection will be blocked and 
> will be hanging at this while loop forever during sendRpcRequest().
> {code:java}
> while (!shouldCloseConnection.get()) {
>   if (rpcRequestQueue.offer(Pair.of(call, buf), 1, TimeUnit.SECONDS)) {
>     break;
>   }
> }{code}
> OOM exception in starting the rpcRequestSender thread.
> {code:java}
> Exception in thread "IPC Client (1664093259) connection to 
> nn01.grid.linkedin.com/IP-Address:portNum from kafkaetl" 
> java.lang.OutOfMemoryError: unable to create new native thread
>       at java.lang.Thread.start0(Native Method)
>       at java.lang.Thread.start(Thread.java:717)
>       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1034)
> {code}
> Multiple threads blocked by queue.offer(). and we don't found any "IPC 
> Client" or "IPC Parameter Sending Thread" in thread dump. 
> {code:java}
> Thread 2156123: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information 
> may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) 
> @bci=20, line=215 (Compiled frame)
>  - 
> java.util.concurrent.SynchronousQueue$TransferQueue.awaitFulfill(java.util.concurrent.SynchronousQueue$TransferQueue$QNode,
>  java.lang.Object, boolean, long) @bci=156, line=764 (Compiled frame)
>  - 
> java.util.concurrent.SynchronousQueue$TransferQueue.transfer(java.lang.Object,
>  boolean, long) @bci=148, line=695 (Compiled frame)
>  - java.util.concurrent.SynchronousQueue.offer(java.lang.Object, long, 
> java.util.concurrent.TimeUnit) @bci=24, line=895 (Compiled frame)
>  - 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(org.apache.hadoop.ipc.Client$Call)
>  @bci=88, line=1134 (Compiled frame)
>  - org.apache.hadoop.ipc.Client.call(org.apache.hadoop.ipc.RPC$RpcKind, 
> org.apache.hadoop.io.Writable, org.apache.hadoop.ipc.Client$ConnectionId, 
> int, java.util.concurrent.atomic.AtomicBoolean, 
> org.apache.hadoop.ipc.AlignmentContext) @bci=36, line=1402 (Interpreted frame)
>  - org.apache.hadoop.ipc.Client.call(org.apache.hadoop.ipc.RPC$RpcKind, 
> org.apache.hadoop.io.Writable, org.apache.hadoop.ipc.Client$ConnectionId, 
> java.util.concurrent.atomic.AtomicBoolean, 
> org.apache.hadoop.ipc.AlignmentContext) @bci=9, line=1349 (Compiled frame)
>  - org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(java.lang.Object, 
> java.lang.reflect.Method, java.lang.Object[]) @bci=248, line=230 (Compiled 
> frame)
>  - org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(java.lang.Object, 
> java.lang.reflect.Method, java.lang.Object[]) @bci=4, line=118 (Compiled 
> frame)
>  - com.sun.proxy.$Proxy11.getBlockLocations({code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to