[ 
https://issues.apache.org/jira/browse/HDFS-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated HDFS-17250:
------------------------------
          Component/s: hdfs
     Target Version/s: 3.4.0
    Affects Version/s: 3.4.0

> EditLogTailer#triggerActiveLogRoll should handle thread Interrupted
> -------------------------------------------------------------------
>
>                 Key: HDFS-17250
>                 URL: https://issues.apache.org/jira/browse/HDFS-17250
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs
>    Affects Versions: 3.4.0
>            Reporter: Haiyang Hu
>            Assignee: Haiyang Hu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0
>
>
> *Issue:*
> When the NameNode attempts to trigger a log roll and the cachedActiveProxy is 
> a "the machine has been shut down of the namenode," it is unable to establish 
> a network connection. This results in a timeout during the socket connection 
> phase, which has a set timeout of 90 seconds. Since the asynchronous call for 
> "Triggering log roll" has a waiting time of 60 seconds, it triggers a timeout 
> and initiates a "cancel" operation, causing the executing thread to receive 
> an "Interrupted" signal and throwing a "java.io.InterruptedIOException" 
> exception.
> Currently, the logic not to handle interrupted signal, and the 
> "getActiveNodeProxy" method hasn't reached the maximum retry limit, the 
> overall execution process doesn't exit and it continues to attempt to
> call the "rollEditLog" on the next NameNode in the list. However when a 
> socket connection is established, it throws a 
> "java.nio.channels.ClosedByInterruptException" exception due to the thread 
> being in an "Interrupted" state.
> this cycle repeats until it reaches the maximum retry limit (nnCount * 
> maxRetries) will exits.
> However in the next cycle of "Triggering log roll," it continues to traverse 
> the NameNode list and encounters the same issue and the cachedActiveProxy is 
> still a "shut down NameNode."
> This eventually results in the NameNode being unable to successfully complete 
> the "Triggering log roll" operation.
> To optimize this, we need to handle the thread being interrupted and exit the 
> execution.
> *Detailed logs such as:*
> the Observer node "ob1" will execute "Triggering log roll" is as follows:
> nns list is [ob2(shut down machine),nn1(active),nn2(standy)]
> * The Observer node "ob1" periodically executes "triggerActiveLogRoll" and 
> asynchronously calls "getNameNodeProxy" to request "ob2" for the 
> "rollEditLog" operation, since the "ob2" machine is shut down, it cannot 
> establish a network connection, this results in a timeout during the socket 
> connection phase (here set timeout is 90 seconds).
> {quote}
> 2023-11-03 10:27:41,734 INFO  ha.EditLogTailer 
> (EditLogTailer.java:triggerActiveLogRoll(465)) [Edit log tailer] - Triggering 
> log roll on remote NameNode
> 2023-11-03 10:28:41,734 WARN  ha.EditLogTailer 
> (EditLogTailer.java:triggerActiveLogRoll(478)) [Edit log tailer] - Unable to 
> finish rolling edits in 60000 ms
> {quote}
> * As the asynchronous call for "Triggering log roll" has a waiting time of 60 
> seconds, it triggers a timeout and initiates a "cancel" operation, causing 
> the executing thread to receive an "Interrupted" signal and will throw 
> "java.io.InterruptedIOException".
> {quote}
> 2023-11-03 10:28:41,753 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:41,972 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=ob2, ipcAddress=xxx, httpAddress=http://xxx], try next.
> java.io.InterruptedIOException: DestHost:destPort xxx , LocalHost:localPort 
> xxx. Failed on local exception: java.io.InterruptedIOException: Interrupted 
> while waiting for IO on channel 
> java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total 
> timeout mills is 90000, 30038 millis timeout left.
>         at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown 
> Source)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906)
>         at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1511)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1402)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
>         at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.InterruptedIOException: Interrupted while waiting for IO 
> on channel java.nio.channels.SocketChannel[connection-pending 
> remote=xxx:8040]. Total timeout mills is 90000, 30038 millis timeout left.
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:346)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653)
> {quote}
> *  Since current logic not to handle interrupted signal and hasn't reached 
> the maximum retry limit, the overall execution process doesn't exit and it 
> continues to attempt to call the "rollEditLog" on the next NameNode in the 
> list. However when a socket connection is established, it throws a 
> "java.nio.channels.ClosedByInterruptException" exception due to the thread 
> being in an "Interrupted" state. this cycle repeats until it reaches 9 retry 
> limit (3*3) and then exits.
> {quote}
> 2023-11-03 10:28:41,991 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:41,991 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=nn1, ipcAddress=xxx:8040, httpAddress=http://xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
>         at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>         at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:656)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:346)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1449)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1402)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
>         at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2023-11-03 10:28:42,095 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,095 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,104 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,104 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=ob2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,118 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,118 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=nn1, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> 2023-11-03 10:28:42,119 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,119 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,148 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,148 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=ob2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,148 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,148 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=nn1, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,149 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,149 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> {quote}
> * In the next cycle of "Triggering log roll," it continues to traverse the 
> NameNode list and proceeds to "ob2," causing the same issue to recur. This 
> eventually results in the NameNode being unable to successfully complete the 
> "Triggering log roll" operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to