[jira] [Updated] (HDFS-17250) EditLogTailer#triggerActiveLogRoll should handle thread Interrupted

2024-01-27 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated HDFS-17250:
--
  Component/s: hdfs
 Target Version/s: 3.4.0
Affects Version/s: 3.4.0

> EditLogTailer#triggerActiveLogRoll should handle thread Interrupted
> ---
>
> Key: HDFS-17250
> URL: https://issues.apache.org/jira/browse/HDFS-17250
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> *Issue:*
> When the NameNode attempts to trigger a log roll and the cachedActiveProxy is 
> a "the machine has been shut down of the namenode," it is unable to establish 
> a network connection. This results in a timeout during the socket connection 
> phase, which has a set timeout of 90 seconds. Since the asynchronous call for 
> "Triggering log roll" has a waiting time of 60 seconds, it triggers a timeout 
> and initiates a "cancel" operation, causing the executing thread to receive 
> an "Interrupted" signal and throwing a "java.io.InterruptedIOException" 
> exception.
> Currently, the logic not to handle interrupted signal, and the 
> "getActiveNodeProxy" method hasn't reached the maximum retry limit, the 
> overall execution process doesn't exit and it continues to attempt to
> call the "rollEditLog" on the next NameNode in the list. However when a 
> socket connection is established, it throws a 
> "java.nio.channels.ClosedByInterruptException" exception due to the thread 
> being in an "Interrupted" state.
> this cycle repeats until it reaches the maximum retry limit (nnCount * 
> maxRetries) will exits.
> However in the next cycle of "Triggering log roll," it continues to traverse 
> the NameNode list and encounters the same issue and the cachedActiveProxy is 
> still a "shut down NameNode."
> This eventually results in the NameNode being unable to successfully complete 
> the "Triggering log roll" operation.
> To optimize this, we need to handle the thread being interrupted and exit the 
> execution.
> *Detailed logs such as:*
> the Observer node "ob1" will execute "Triggering log roll" is as follows:
> nns list is [ob2(shut down machine),nn1(active),nn2(standy)]
> * The Observer node "ob1" periodically executes "triggerActiveLogRoll" and 
> asynchronously calls "getNameNodeProxy" to request "ob2" for the 
> "rollEditLog" operation, since the "ob2" machine is shut down, it cannot 
> establish a network connection, this results in a timeout during the socket 
> connection phase (here set timeout is 90 seconds).
> {quote}
> 2023-11-03 10:27:41,734 INFO  ha.EditLogTailer 
> (EditLogTailer.java:triggerActiveLogRoll(465)) [Edit log tailer] - Triggering 
> log roll on remote NameNode
> 2023-11-03 10:28:41,734 WARN  ha.EditLogTailer 
> (EditLogTailer.java:triggerActiveLogRoll(478)) [Edit log tailer] - Unable to 
> finish rolling edits in 6 ms
> {quote}
> * As the asynchronous call for "Triggering log roll" has a waiting time of 60 
> seconds, it triggers a timeout and initiates a "cancel" operation, causing 
> the executing thread to receive an "Interrupted" signal and will throw 
> "java.io.InterruptedIOException".
> {quote}
> 2023-11-03 10:28:41,753 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:41,972 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=ob2, ipcAddress=xxx, httpAddress=http://xxx], try next.
> java.io.InterruptedIOException: DestHost:destPort xxx , LocalHost:localPort 
> xxx. Failed on local exception: java.io.InterruptedIOException: Interrupted 
> while waiting for IO on channel 
> java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total 
> timeout mills is 9, 30038 millis timeout left.
> at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown 
> Source)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
> at org.apache.hadoop.ipc.Client.call(Client.java:1511)
> at org.apache.hadoop.ipc.Client.call(Client.java:1402)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
> at 
> 

[jira] [Updated] (HDFS-17250) EditLogTailer#triggerActiveLogRoll should handle thread Interrupted

2023-11-12 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17250:
--
Description: 
*Issue:*
When the NameNode attempts to trigger a log roll and the cachedActiveProxy is a 
"the machine has been shut down of the namenode," it is unable to establish a 
network connection. This results in a timeout during the socket connection 
phase, which has a set timeout of 90 seconds. Since the asynchronous call for 
"Triggering log roll" has a waiting time of 60 seconds, it triggers a timeout 
and initiates a "cancel" operation, causing the executing thread to receive an 
"Interrupted" signal and throwing a "java.io.InterruptedIOException" exception.

Currently, the logic not to handle interrupted signal, and the 
"getActiveNodeProxy" method hasn't reached the maximum retry limit, the overall 
execution process doesn't exit and it continues to attempt to
call the "rollEditLog" on the next NameNode in the list. However when a socket 
connection is established, it throws a 
"java.nio.channels.ClosedByInterruptException" exception due to the thread 
being in an "Interrupted" state.
this cycle repeats until it reaches the maximum retry limit (nnCount * 
maxRetries) will exits.

However in the next cycle of "Triggering log roll," it continues to traverse 
the NameNode list and encounters the same issue and the cachedActiveProxy is 
still a "shut down NameNode."
This eventually results in the NameNode being unable to successfully complete 
the "Triggering log roll" operation.

To optimize this, we need to handle the thread being interrupted and exit the 
execution.

*Detailed logs such as:*

the Observer node "ob1" will execute "Triggering log roll" is as follows:
nns list is [ob2(shut down machine),nn1(active),nn2(standy)]

* The Observer node "ob1" periodically executes "triggerActiveLogRoll" and 
asynchronously calls "getNameNodeProxy" to request "ob2" for the "rollEditLog" 
operation, since the "ob2" machine is shut down, it cannot establish a network 
connection, this results in a timeout during the socket connection phase (here 
set timeout is 90 seconds).
{quote}
2023-11-03 10:27:41,734 INFO  ha.EditLogTailer 
(EditLogTailer.java:triggerActiveLogRoll(465)) [Edit log tailer] - Triggering 
log roll on remote NameNode

2023-11-03 10:28:41,734 WARN  ha.EditLogTailer 
(EditLogTailer.java:triggerActiveLogRoll(478)) [Edit log tailer] - Unable to 
finish rolling edits in 6 ms
{quote}

* As the asynchronous call for "Triggering log roll" has a waiting time of 60 
seconds, it triggers a timeout and initiates a "cancel" operation, causing the 
executing thread to receive an "Interrupted" signal and will throw 
"java.io.InterruptedIOException".
{quote}
2023-11-03 10:28:41,753 WARN  ipc.Client 
(Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
while trying for connection
2023-11-03 10:28:41,972 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
[pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
[nnId=ob2, ipcAddress=xxx, httpAddress=http://xxx], try next.
java.io.InterruptedIOException: DestHost:destPort xxx , LocalHost:localPort 
xxx. Failed on local exception: java.io.InterruptedIOException: Interrupted 
while waiting for IO on channel 
java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total 
timeout mills is 9, 30038 millis timeout left.
at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
at org.apache.hadoop.ipc.Client.call(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1402)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 

[jira] [Updated] (HDFS-17250) EditLogTailer#triggerActiveLogRoll should handle thread Interrupted

2023-11-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17250:
--
Labels: pull-request-available  (was: )

> EditLogTailer#triggerActiveLogRoll should handle thread Interrupted
> ---
>
> Key: HDFS-17250
> URL: https://issues.apache.org/jira/browse/HDFS-17250
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> *Issue:*
> When the NameNode attempts to trigger a log roll and the cachedActiveProxy is 
> a "shut down NameNode," it is unable to establish a network connection. This 
> results in a timeout during the socket connection phase, which has a set 
> timeout of 90 seconds. Since the asynchronous call for "Triggering log roll" 
> has a waiting time of 60 seconds, it triggers a timeout and initiates a 
> "cancel" operation, causing the executing thread to receive an "Interrupted" 
> signal and throwing a "java.io.InterruptedIOException" exception.
> Currently, the logic not to handle interrupted signal, and the 
> "getActiveNodeProxy" method hasn't reached the maximum retry limit, the 
> overall execution process doesn't exit and it continues to attempt to
> call the "rollEditLog" on the next NameNode in the list. However when a 
> socket connection is established, it throws a 
> "java.nio.channels.ClosedByInterruptException" exception due to the thread 
> being in an "Interrupted" state.
> this cycle repeats until it reaches the maximum retry limit (nnCount * 
> maxRetries) will exits.
> However in the next cycle of "Triggering log roll," it continues to traverse 
> the NameNode list and encounters the same issue and the cachedActiveProxy is 
> still a "shut down NameNode."
> This eventually results in the NameNode being unable to successfully complete 
> the "Triggering log roll" operation.
> To optimize this, we need to handle the thread being interrupted and exit the 
> execution.
> *Detailed logs such as:*
> the Observer node "ob1" will execute "Triggering log roll" is as follows:
> nns list is [ob2(shut down node),nn1(active),nn2(standy)]
> * The Observer node "ob1" periodically executes "triggerActiveLogRoll" and 
> asynchronously calls "getNameNodeProxy" to request "ob2" for the 
> "rollEditLog" operation, since the "ob2" machine is shut down, it cannot 
> establish a network connection, this results in a timeout during the socket 
> connection phase (here set timeout is 90 seconds).
> {quote}
> 2023-11-03 10:27:41,734 INFO  ha.EditLogTailer 
> (EditLogTailer.java:triggerActiveLogRoll(465)) [Edit log tailer] - Triggering 
> log roll on remote NameNode
> 2023-11-03 10:28:41,734 WARN  ha.EditLogTailer 
> (EditLogTailer.java:triggerActiveLogRoll(478)) [Edit log tailer] - Unable to 
> finish rolling edits in 6 ms
> {quote}
> * As the asynchronous call for "Triggering log roll" has a waiting time of 60 
> seconds, it triggers a timeout and initiates a "cancel" operation, causing 
> the executing thread to receive an "Interrupted" signal and will throw 
> "java.io.InterruptedIOException".
> {quote}
> 2023-11-03 10:28:41,753 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:41,972 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=ob2, ipcAddress=xxx, httpAddress=http://xxx], try next.
> java.io.InterruptedIOException: DestHost:destPort xxx , LocalHost:localPort 
> xxx. Failed on local exception: java.io.InterruptedIOException: Interrupted 
> while waiting for IO on channel 
> java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total 
> timeout mills is 9, 30038 millis timeout left.
> at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown 
> Source)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
> at org.apache.hadoop.ipc.Client.call(Client.java:1511)
> at org.apache.hadoop.ipc.Client.call(Client.java:1402)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
> at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
> at 
> 

[jira] [Updated] (HDFS-17250) EditLogTailer#triggerActiveLogRoll should handle thread Interrupted

2023-11-09 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17250:
--
Description: 
*Issue:*
When the NameNode attempts to trigger a log roll and the cachedActiveProxy is a 
"shut down NameNode," it is unable to establish a network connection. This 
results in a timeout during the socket connection phase, which has a set 
timeout of 90 seconds. Since the asynchronous call for "Triggering log roll" 
has a waiting time of 60 seconds, it triggers a timeout and initiates a 
"cancel" operation, causing the executing thread to receive an "Interrupted" 
signal and throwing a "java.io.InterruptedIOException" exception.

Currently, the logic not to handle interrupted signal, and the 
"getActiveNodeProxy" method hasn't reached the maximum retry limit, the overall 
execution process doesn't exit and it continues to attempt to
call the "rollEditLog" on the next NameNode in the list. However when a socket 
connection is established, it throws a 
"java.nio.channels.ClosedByInterruptException" exception due to the thread 
being in an "Interrupted" state.
this cycle repeats until it reaches the maximum retry limit (nnCount * 
maxRetries) will exits.

However in the next cycle of "Triggering log roll," it continues to traverse 
the NameNode list and encounters the same issue and the cachedActiveProxy is 
still a "shut down NameNode."
This eventually results in the NameNode being unable to successfully complete 
the "Triggering log roll" operation.

To optimize this, we need to handle the thread being interrupted and exit the 
execution.

*Detailed logs such as:*

the Observer node "ob1" will execute "Triggering log roll" is as follows:
nns list is [ob2(shut down node),nn1(active),nn2(standy)]

* The Observer node "ob1" periodically executes "triggerActiveLogRoll" and 
asynchronously calls "getNameNodeProxy" to request "ob2" for the "rollEditLog" 
operation, since the "ob2" machine is shut down, it cannot establish a network 
connection, this results in a timeout during the socket connection phase (here 
set timeout is 90 seconds).
{quote}
2023-11-03 10:27:41,734 INFO  ha.EditLogTailer 
(EditLogTailer.java:triggerActiveLogRoll(465)) [Edit log tailer] - Triggering 
log roll on remote NameNode

2023-11-03 10:28:41,734 WARN  ha.EditLogTailer 
(EditLogTailer.java:triggerActiveLogRoll(478)) [Edit log tailer] - Unable to 
finish rolling edits in 6 ms
{quote}

* As the asynchronous call for "Triggering log roll" has a waiting time of 60 
seconds, it triggers a timeout and initiates a "cancel" operation, causing the 
executing thread to receive an "Interrupted" signal and will throw 
"java.io.InterruptedIOException".
{quote}
2023-11-03 10:28:41,753 WARN  ipc.Client 
(Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
while trying for connection
2023-11-03 10:28:41,972 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
[pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
[nnId=ob2, ipcAddress=xxx, httpAddress=http://xxx], try next.
java.io.InterruptedIOException: DestHost:destPort xxx , LocalHost:localPort 
xxx. Failed on local exception: java.io.InterruptedIOException: Interrupted 
while waiting for IO on channel 
java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total 
timeout mills is 9, 30038 millis timeout left.
at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
at org.apache.hadoop.ipc.Client.call(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1402)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 

[jira] [Updated] (HDFS-17250) EditLogTailer#triggerActiveLogRoll should handle thread Interrupted

2023-11-09 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17250:
--
Description: 
*Issue:*
When the NameNode attempts to trigger a log roll and the cachedActiveProxy is a 
"shut down NameNode," it is unable to establish a network connection. This 
results in a timeout during the socket connection phase, which has a set 
timeout of 90 seconds. Since the asynchronous call for "Triggering log roll" 
has a waiting time of 60 seconds, it triggers a timeout and initiates a 
"cancel" operation, causing the executing thread to receive an "Interrupted" 
signal and throwing a "java.io.InterruptedIOException" exception.

Currently, the logic not to handle interrupted signal, and the 
"getActiveNodeProxy" method hasn't reached the maximum retry limit, the overall 
execution process doesn't exit and it continues to attempt to
call the "rollEditLog" on the next NameNode in the list. However when a socket 
connection is established, it throws a 
"java.nio.channels.ClosedByInterruptException" exception due to the thread 
being in an "Interrupted" state.
this cycle repeats until it reaches the maximum retry limit (nnCount * 
maxRetries) will exits.

However in the next cycle of "Triggering log roll," it continues to traverse 
the NameNode list and encounters the same issue and the cachedActiveProxy is 
still a "shut down NameNode."
This eventually results in the NameNode being unable to successfully complete 
the "Triggering log roll" operation.

To optimize this, we need to handle the thread being interrupted and exit the 
execution.

*Detailed logs such as:*

the Observer node "ob1" will execute "Triggering log roll" is as follows:
nns list is [ob2(shut down node),nn1(active),nn2(standy)]

* The Observer node "ob1" periodically executes "triggerActiveLogRoll" and 
asynchronously calls "getNameNodeProxy" to request "ob2" for the "rollEditLog" 
operation, since the "ob2" machine is shut down, it cannot establish a network 
connection, this results in a timeout during the socket connection phase (here 
set timeout is 90 seconds).
{quote}
2023-11-03 10:27:41,734 INFO  ha.EditLogTailer 
(EditLogTailer.java:triggerActiveLogRoll(465)) [Edit log tailer] - Triggering 
log roll on remote NameNode

2023-11-03 10:28:41,734 WARN  ha.EditLogTailer 
(EditLogTailer.java:triggerActiveLogRoll(478)) [Edit log tailer] - Unable to 
finish rolling edits in 6 ms
{quote}

* As the asynchronous call for "Triggering log roll" has a waiting time of 60 
seconds, it triggers a timeout and initiates a "cancel" operation, causing the 
executing thread to receive an "Interrupted" signal and will throw 
"java.io.InterruptedIOException".
{quote}
2023-11-03 10:28:41,753 WARN  ipc.Client 
(Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
while trying for connection
2023-11-03 10:28:41,972 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
[pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
[nnId=ob2, ipcAddress=xxx, httpAddress=http://xxx], try next.
java.io.InterruptedIOException: DestHost:destPort xxx , LocalHost:localPort 
xxx. Failed on local exception: java.io.InterruptedIOException: Interrupted 
while waiting for IO on channel 
java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total 
timeout mills is 9, 30038 millis timeout left.
at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
at org.apache.hadoop.ipc.Client.call(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1402)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 

[jira] [Updated] (HDFS-17250) EditLogTailer#triggerActiveLogRoll should handle thread Interrupted

2023-11-09 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17250:
--
Description: 
*Issue:*
When the NameNode attempts to trigger a log roll and the cachedActiveProxy is a 
"shut down NameNode," it is unable to establish a network connection. This 
results in a timeout during the socket connection phase, which has a set 
timeout of 90 seconds. Since the asynchronous call for "Triggering log roll" 
has a waiting time of 60 seconds, it triggers a timeout and initiates a 
"cancel" operation, causing the executing thread to receive an "Interrupted" 
signal and throwing a "java.io.InterruptedIOException" exception.

Currently, the logic not to handle interrupted signal, and the 
"getActiveNodeProxy" method hasn't reached the maximum retry limit, the overall 
execution process doesn't exit and it continues to attempt to
call the "rollEditLog" on the next NameNode in the list. However when a socket 
connection is established, it throws a 
"java.nio.channels.ClosedByInterruptException" exception due to the thread 
being in an "Interrupted" state.
this cycle repeats until it reaches the maximum retry limit (nnCount * 
maxRetries) will exits.

However in the next cycle of "Triggering log roll," it continues to traverse 
the NameNode list and encounters the same issue and the cachedActiveProxy is 
still a "shut down NameNode."
This eventually results in the NameNode being unable to successfully complete 
the "Triggering log roll" operation.

To optimize this, we need to handle the thread being interrupted and exit the 
execution.

*Detailed logs such as:*

the Observer node "ob1" will execute "Triggering log roll" is as follows:
nns list is [ob2(shut down node),nn1(active),nn2(standy)]

* The Observer node "ob1" periodically executes "triggerActiveLogRoll" and 
asynchronously calls "getNameNodeProxy" to request "ob2" for the "rollEditLog" 
operation, since the "ob2" machine is shut down, it cannot establish a network 
connection, this results in a timeout during the socket connection phase (here 
set timeout is 90 seconds).
{code:java}
2023-11-03 10:27:41,734 INFO  ha.EditLogTailer 
(EditLogTailer.java:triggerActiveLogRoll(465)) [Edit log tailer] - Triggering 
log roll on remote NameNode

2023-11-03 10:28:41,734 WARN  ha.EditLogTailer 
(EditLogTailer.java:triggerActiveLogRoll(478)) [Edit log tailer] - Unable to 
finish rolling edits in 6 ms
{code}

* As the asynchronous call for "Triggering log roll" has a waiting time of 60 
seconds, it triggers a timeout and initiates a "cancel" operation, causing the 
executing thread to receive an "Interrupted" signal and will throw 
"java.io.InterruptedIOException".
{code:java}
2023-11-03 10:28:41,753 WARN  ipc.Client 
(Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
while trying for connection
2023-11-03 10:28:41,972 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
[pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
[nnId=ob2, ipcAddress=xxx, httpAddress=http://xxx], try next.
java.io.InterruptedIOException: DestHost:destPort xxx , LocalHost:localPort 
xxx. Failed on local exception: java.io.InterruptedIOException: Interrupted 
while waiting for IO on channel 
java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total 
timeout mills is 9, 30038 millis timeout left.
at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
at org.apache.hadoop.ipc.Client.call(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1402)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:454)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:451)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:615)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at