[ 
https://issues.apache.org/jira/browse/HDFS-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787443#comment-17787443
 ] 

ASF GitHub Bot commented on HDFS-17250:
---------------------------------------

xinglin commented on PR #6266:
URL: https://github.com/apache/hadoop/pull/6266#issuecomment-1817339327

   qq: which version of hadoop did you run? Does `triggerActiveLogRoll()` 
actually try to connect other NN as well? Is your issue happening for trunk? By 
reading the source code, that does not seem to be the case.
   
   log line of "Triggering log rolling to the remote NameNode, ": is this 
truncated in your paste? seems to be incomplete compared to trunk.
   
   ```
     /**
      * Trigger the active node to roll its logs.
      */
     @VisibleForTesting
     void triggerActiveLogRoll() {
       Future<Void> future = null;
       try {
        ...
      }
   
     Callable<Void> getNameNodeProxy() {
       return new MultipleNameNodeProxy<Void>() {
         @Override
         protected Void doWork() throws IOException {
           LOG.info("Triggering log rolling to the remote NameNode, " +
               "active NameNode = {}", currentNN.getIpcAddress());
           cachedActiveProxy.rollEditLog();
           return null;
         }
       };
   ```




> EditLogTailer#triggerActiveLogRoll should handle thread Interrupted
> -------------------------------------------------------------------
>
>                 Key: HDFS-17250
>                 URL: https://issues.apache.org/jira/browse/HDFS-17250
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Haiyang Hu
>            Assignee: Haiyang Hu
>            Priority: Major
>              Labels: pull-request-available
>
> *Issue:*
> When the NameNode attempts to trigger a log roll and the cachedActiveProxy is 
> a "the machine has been shut down of the namenode," it is unable to establish 
> a network connection. This results in a timeout during the socket connection 
> phase, which has a set timeout of 90 seconds. Since the asynchronous call for 
> "Triggering log roll" has a waiting time of 60 seconds, it triggers a timeout 
> and initiates a "cancel" operation, causing the executing thread to receive 
> an "Interrupted" signal and throwing a "java.io.InterruptedIOException" 
> exception.
> Currently, the logic not to handle interrupted signal, and the 
> "getActiveNodeProxy" method hasn't reached the maximum retry limit, the 
> overall execution process doesn't exit and it continues to attempt to
> call the "rollEditLog" on the next NameNode in the list. However when a 
> socket connection is established, it throws a 
> "java.nio.channels.ClosedByInterruptException" exception due to the thread 
> being in an "Interrupted" state.
> this cycle repeats until it reaches the maximum retry limit (nnCount * 
> maxRetries) will exits.
> However in the next cycle of "Triggering log roll," it continues to traverse 
> the NameNode list and encounters the same issue and the cachedActiveProxy is 
> still a "shut down NameNode."
> This eventually results in the NameNode being unable to successfully complete 
> the "Triggering log roll" operation.
> To optimize this, we need to handle the thread being interrupted and exit the 
> execution.
> *Detailed logs such as:*
> the Observer node "ob1" will execute "Triggering log roll" is as follows:
> nns list is [ob2(shut down machine),nn1(active),nn2(standy)]
> * The Observer node "ob1" periodically executes "triggerActiveLogRoll" and 
> asynchronously calls "getNameNodeProxy" to request "ob2" for the 
> "rollEditLog" operation, since the "ob2" machine is shut down, it cannot 
> establish a network connection, this results in a timeout during the socket 
> connection phase (here set timeout is 90 seconds).
> {quote}
> 2023-11-03 10:27:41,734 INFO  ha.EditLogTailer 
> (EditLogTailer.java:triggerActiveLogRoll(465)) [Edit log tailer] - Triggering 
> log roll on remote NameNode
> 2023-11-03 10:28:41,734 WARN  ha.EditLogTailer 
> (EditLogTailer.java:triggerActiveLogRoll(478)) [Edit log tailer] - Unable to 
> finish rolling edits in 60000 ms
> {quote}
> * As the asynchronous call for "Triggering log roll" has a waiting time of 60 
> seconds, it triggers a timeout and initiates a "cancel" operation, causing 
> the executing thread to receive an "Interrupted" signal and will throw 
> "java.io.InterruptedIOException".
> {quote}
> 2023-11-03 10:28:41,753 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:41,972 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=ob2, ipcAddress=xxx, httpAddress=http://xxx], try next.
> java.io.InterruptedIOException: DestHost:destPort xxx , LocalHost:localPort 
> xxx. Failed on local exception: java.io.InterruptedIOException: Interrupted 
> while waiting for IO on channel 
> java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total 
> timeout mills is 90000, 30038 millis timeout left.
>         at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown 
> Source)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906)
>         at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1511)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1402)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
>         at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.InterruptedIOException: Interrupted while waiting for IO 
> on channel java.nio.channels.SocketChannel[connection-pending 
> remote=xxx:8040]. Total timeout mills is 90000, 30038 millis timeout left.
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:351)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:346)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653)
> {quote}
> *  Since current logic not to handle interrupted signal and hasn't reached 
> the maximum retry limit, the overall execution process doesn't exit and it 
> continues to attempt to call the "rollEditLog" on the next NameNode in the 
> list. However when a socket connection is established, it throws a 
> "java.nio.channels.ClosedByInterruptException" exception due to the thread 
> being in an "Interrupted" state. this cycle repeats until it reaches 9 retry 
> limit (3*3) and then exits.
> {quote}
> 2023-11-03 10:28:41,991 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:41,991 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=nn1, ipcAddress=xxx:8040, httpAddress=http://xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
>         at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>         at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:656)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:662)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:783)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:346)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1653)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1449)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1402)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
>         at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
>         at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2023-11-03 10:28:42,095 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,095 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,104 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,104 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=ob2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,118 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,118 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=nn1, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> 2023-11-03 10:28:42,119 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,119 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,148 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,148 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=ob2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,148 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,148 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=nn1, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> 2023-11-03 10:28:42,149 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:42,149 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=nn2, ipcAddress=xxx:8040, httpAddress=xxx:50070], try next.
> java.nio.channels.ClosedByInterruptException
> {quote}
> * In the next cycle of "Triggering log roll," it continues to traverse the 
> NameNode list and proceeds to "ob2," causing the same issue to recur. This 
> eventually results in the NameNode being unable to successfully complete the 
> "Triggering log roll" operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to