[
https://issues.apache.org/jira/browse/HDFS-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiaoqiao He resolved HDFS-17250.
--------------------------------
Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Resolution: Fixed
> EditLogTailer#triggerActiveLogRoll should handle thread Interrupted
> -------------------------------------------------------------------
>
> Key: HDFS-17250
> URL: https://issues.apache.org/jira/browse/HDFS-17250
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Haiyang Hu
> Assignee: Haiyang Hu
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.4.0
>
>
> *Issue:*
> When the NameNode attempts to trigger a log roll and the cachedActiveProxy is
> a "the machine has been shut down of the namenode," it is unable to establish
> a network connection. This results in a timeout during the socket connection
> phase, which has a set timeout of 90 seconds. Since the asynchronous call for
> "Triggering log roll" has a waiting time of 60 seconds, it triggers a timeout
> and initiates a "cancel" operation, causing the executing thread to receive
> an "Interrupted" signal and throwing a "java.io.InterruptedIOException"
> exception.
> Currently, the logic not to handle interrupted signal, and the
> "getActiveNodeProxy" method hasn't reached the maximum retry limit, the
> overall execution process doesn't exit and it continues to attempt to
> call the "rollEditLog" on the next NameNode in the list. However when a
> socket connection is established, it throws a
> "java.nio.channels.ClosedByInterruptException" exception due to the thread
> being in an "Interrupted" state.
> this cycle repeats until it reaches the maximum retry limit (nnCount *
> maxRetries) will exits.
> However in the next cycle of "Triggering log roll," it continues to traverse
> the NameNode list and encounters the same issue and the cachedActiveProxy is
> still a "shut down NameNode."
> This eventually results in the NameNode being unable to successfully complete
> the "Triggering log roll" operation.
> To optimize this, we need to handle the thread being interrupted and exit the
> execution.
> *Detailed logs such as:*
> the Observer node "ob1" will execute "Triggering log roll" is as follows:
> nns list is [ob2(shut down machine),nn1(active),nn2(standy)]
> * The Observer node "ob1" periodically executes "triggerActiveLogRoll" and
> asynchronously calls "getNameNodeProxy" to request "ob2" for the
> "rollEditLog" operation, since the "ob2" machine is shut down, it cannot
> establish a network connection, this results in a timeout during the socket
> connection phase (here set timeout is 90 seconds).
> {quote}
> 2023-11-03 10:27:41,734 INFO ha.EditLogTailer
> (EditLogTailer.java:triggerActiveLogRoll(465)) [Edit log tailer] - Triggering
> log roll on remote NameNode
> 2023-11-03 10:28:41,734 WARN ha.EditLogTailer
> (EditLogTailer.java:triggerActiveLogRoll(478)) [Edit log tailer] - Unable to
> finish rolling edits in 60000 ms
> {quote}
> * As the asynchronous call for "Triggering log roll" has a waiting time of 60
> seconds, it triggers a timeout and initiates a "cancel" operation, causing
> the executing thread to receive an "Interrupted" signal and will throw
> "java.io.InterruptedIOException".
> {quote}
> 2023-11-03 10:28:41,753 WARN ipc.Client
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted
> while trying for connection
> 2023-11-03 10:28:41,972 WARN ha.EditLogTailer (EditLogTailer.java:call(618))
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo
> [nnId=ob2, ipcAddress=xxx, httpAddress=http://xxx], try next.
> java.io.InterruptedIOException: DestHost:destPort xxx , LocalHost:localPort
> xxx. Failed on local exception: java.io.InterruptedIOException: Interrupted
> while waiting for IO on channel
> java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total
> timeout mills is 90000, 30038 millis timeout left.
> at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown
> Source)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
> at org.apache.hadoop.ipc.Client.call(Client.java:1511)
> at org.apache.hadoop.ipc.Client.call(Client.java:1402)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
> at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.InterruptedIOException: Interrupted while waiting for IO
> on channel java.nio.channels.SocketChannel[connection-pending
> remote=xxx:8040]. Total timeout mills is 90000, 30038 millis timeout left.
> at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
> at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662)
> at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783)
> at
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:346)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653)
> {quote}
> * Since current logic not to handle interrupted signal and hasn't reached
> the maximum retry limit, the overall execution process doesn't exit and it
> continues to attempt to call the "rollEditLog" on the next NameNode in the
> list. However when a socket connection is established, it throws a
> "java.nio.channels.ClosedByInterruptException" exception due to the thread
> being in an "Interrupted" state. this cycle repeats until it reaches 9 retry
> limit (3*3) and then exits.
> {quote}
> 2023-11-03 10:28:41,991 WARN ipc.Client
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted
> while trying for connection
> 2023-11-03 10:28:41,991 WARN ha.EditLogTailer (EditLogTailer.java:call(618))
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo
> [nnId=nn1, ipcAddress=xxx:8040, httpAddress=http://xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> at
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
> at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:656)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
> at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662)
> at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783)
> at
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:346)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653)
> at org.apache.hadoop.ipc.Client.call(Client.java:1449)
> at org.apache.hadoop.ipc.Client.call(Client.java:1402)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
> at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2023-11-03 10:28:42,095 WARN ipc.Client
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted
> while trying for connection
> 2023-11-03 10:28:42,095 WARN ha.EditLogTailer (EditLogTailer.java:call(618))
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo
> [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,104 WARN ipc.Client
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted
> while trying for connection
> 2023-11-03 10:28:42,104 WARN ha.EditLogTailer (EditLogTailer.java:call(618))
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo
> [nnId=ob2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,118 WARN ipc.Client
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted
> while trying for connection
> 2023-11-03 10:28:42,118 WARN ha.EditLogTailer (EditLogTailer.java:call(618))
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo
> [nnId=nn1, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> 2023-11-03 10:28:42,119 WARN ipc.Client
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted
> while trying for connection
> 2023-11-03 10:28:42,119 WARN ha.EditLogTailer (EditLogTailer.java:call(618))
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo
> [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,148 WARN ipc.Client
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted
> while trying for connection
> 2023-11-03 10:28:42,148 WARN ha.EditLogTailer (EditLogTailer.java:call(618))
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo
> [nnId=ob2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,148 WARN ipc.Client
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted
> while trying for connection
> 2023-11-03 10:28:42,148 WARN ha.EditLogTailer (EditLogTailer.java:call(618))
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo
> [nnId=nn1, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,149 WARN ipc.Client
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted
> while trying for connection
> 2023-11-03 10:28:42,149 WARN ha.EditLogTailer (EditLogTailer.java:call(618))
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo
> [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> {quote}
> * In the next cycle of "Triggering log roll," it continues to traverse the
> NameNode list and proceeds to "ob2," causing the same issue to recur. This
> eventually results in the NameNode being unable to successfully complete the
> "Triggering log roll" operation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]