empiredan opened a new issue, #2133:
URL: https://github.com/apache/incubator-pegasus/issues/2133

   After all servers (including both meta/replica servers) in a pegasus cluster 
with authentication enabled were restarted, the java client that had connected 
to them before would never succeed in connecting to the meta server due to 
`Negotiation failed` with  `ERR_TIMEOUT` for port **33391** as follows:
   ```
   I1012 17:22:33.122887 28573 MetaSession.java:197] query meta got error, rpc 
error(ERR_OK), meta error(ERR_FORWARD_TO_OTHERS), forward 
address(rpc_address(10.1.136.90:8170)), current leader
   (rpc_address(10.1.136.41:8170)), remain retry count(1), need switch 
leader(true), need delay(false)
   I1012 17:22:33.123011 28571 ReplicaSession.java:189] 
rpc_address(10.1.136.90:8170): the session is disconnected, needs to reconnect
   I1012 17:22:33.123739 28522 ReplicaSession.java:204] 
rpc_address(10.1.136.90:8170): start to async connect to target, wait channel 
to active
   I1012 17:22:33.123829 28522 ReplicaSession.java:424] Channel [id: 
0xea9a05e5, L:/10.1.136.26:33391 - R:/10.1.136.90:8170] for session 
rpc_address(10.1.136.90:8170) is active
   
   ......
   
   E1012 17:22:43.134124 28520 Negotiation.java:85] Negotiation failed
   Java exception follows:
   com.xiaominew.infra.pegasus.rpc.ReplicationException: ERR_TIMEOUT
           at 
com.xiaominew.infra.pegasus.security.Negotiation$RecvHandler.run(Negotiation.java:81)
           at 
com.xiaominew.infra.pegasus.rpc.async.ReplicaSession.tryNotifyFailureWithSeqID(ReplicaSession.java:321)
           at 
com.xiaominew.infra.pegasus.rpc.async.ReplicaSession$4.run(ReplicaSession.java:367)
           at 
io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
           at 
io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
           at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
           at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464)
           at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
           at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
           at java.lang.Thread.run(Thread.java:748)
   I1012 17:22:43.134315 28520 ReplicaSession.java:138] channel to 
rpc_address(10.1.136.90:8170) closed
   W1012 17:22:43.134387 28522 ReplicaSession.java:418] Channel [id: 
0xea9a05e5, L:/10.1.136.26:33391 ! R:/10.1.136.90:8170] for session 
rpc_address(10.1.136.90:8170) is inactive
   I1012 17:22:43.134445 28522 ReplicaSession.java:274] 
rpc_address(10.1.136.90:8170): mark the session to be disconnected from 
state=CONNECTED
   ```
   
   corresponding logs of primary meta server is as follows:
   ```
   D2024-10-12 17:22:33.123 (1728724953123732943 6bad)   meta.io-thrd.27565: 
server_negotiation.cpp:40:start(): 
SERVER_NEGOTIATION(CLIENT=10.1.136.26:33391): start negotiation
   D2024-10-12 17:22:33.123 (1728724953123744766 6bad)   meta.io-thrd.27565: 
network.cpp:702:on_server_session_accepted(): server session accepted, 
remote_client = 10.1.136.26:33391, current_count = 6
   D2024-10-12 17:22:33.123 (1728724953123748123 6bad)   meta.io-thrd.27565: 
network.cpp:707:on_server_session_accepted(): ip session increased, 
remote_client = 10.1.136.26:33391, current_count = 4
   D2024-10-12 17:22:33.123 (1728724953123964597 6bad)   meta.io-thrd.27565: 
network.cpp:744:on_server_session_disconnected(): session 10.1.136.26:33391 
disconnected, the total client sessions count remains 5
   D2024-10-12 17:22:33.123 (1728724953123970218 6bad)   meta.io-thrd.27565: 
network.cpp:754:on_server_session_disconnected(): client ip 10.1.136.26:33391 
has still 3 of sessions to this server
   W2024-10-12 17:22:33.123 (1728724953123972893 6bad)   meta.io-thrd.27565: 
network.cpp:397:on_failure(): disconnect to remote {}, the socket will be 
lazily closed when the session destroyed
   D2024-10-12 17:22:33.123 (1728724953123997959 6bbb)   
meta.default7.01006bad00011326: negotiation_manager.cpp:376:get_negotiation(): 
negotiation was removed for msg: RPC_NEGOTIATION, 10.1.136.26:33391
   ```
   
   Or `Negotiation failed` with  `ERR_SESSION_RESET` for port **33375** as 
follows:
   ```
   I1012 17:22:23.114938 28573 MetaSession.java:197] query meta got error, rpc 
error(ERR_OK), meta error(ERR_FORWARD_TO_OTHERS), forward 
address(rpc_address(10.1.136.90:8170)), current leader
   (rpc_address(10.1.136.41:8170)), remain retry count(4), need switch 
leader(true), need delay(false)
   I1012 17:22:23.115047 28571 ReplicaSession.java:189] 
rpc_address(10.1.136.90:8170): the session is disconnected, needs to reconnect
   I1012 17:22:23.115684 28572 ReplicaSession.java:204] 
rpc_address(10.1.136.90:8170): start to async connect to target, wait channel 
to active
   I1012 17:22:23.115765 28572 ReplicaSession.java:424] Channel [id: 
0xae9c99dd, L:/10.1.136.26:33375 - R:/10.1.136.90:8170] for session 
rpc_address(10.1.136.90:8170) is active
   
   ......
   
   W1012 17:22:28.117709 28570 ReplicaSession.java:309] 
rpc_address(10.1.136.90:8170): actively close the session because it's not 
responding for 30 seconds
   I1012 17:22:28.117861 28570 ReplicaSession.java:138] channel to 
rpc_address(10.1.136.90:8170) closed
   W1012 17:22:28.117931 28572 ReplicaSession.java:418] Channel [id: 
0xae9c99dd, L:/10.1.136.26:33375 ! R:/10.1.136.90:8170] for session 
rpc_address(10.1.136.90:8170) is inactive
   I1012 17:22:28.118019 28570 MetaSession.java:197] query meta got error, rpc 
error(ERR_SESSION_RESET), meta error(ERR_UNKNOWN), forward address(null), 
current leader(rpc_address(10.1.136.90:8170)), remain retry count(3), need 
switch leader(true), need delay(false)
   E1012 17:22:28.118207 28572 Negotiation.java:85] Negotiation failed
   Java exception follows:
   com.xiaominew.infra.pegasus.rpc.ReplicationException: ERR_SESSION_RESET
           at 
com.xiaominew.infra.pegasus.security.Negotiation$RecvHandler.run(Negotiation.java:81)
           at 
com.xiaominew.infra.pegasus.rpc.async.ReplicaSession.tryNotifyFailureWithSeqID(ReplicaSession.java:321)
           at 
com.xiaominew.infra.pegasus.rpc.async.ReplicaSession.markSessionDisconnect(ReplicaSession.java:266)
           at 
com.xiaominew.infra.pegasus.rpc.async.ReplicaSession$DefaultHandler.channelInactive(ReplicaSession.java:419)
           at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:240)
           at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:226)
           at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
           at 
io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:379)
           at 
io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:344)
           at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:240)
           at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:226)
           at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
           at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1299)
           at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:240)
           at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:226)
           at 
io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:903)
           at 
io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:768)
           at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
           at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464)
           at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
           at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
           at java.lang.Thread.run(Thread.java:748)
   I1012 17:22:28.118362 28572 ReplicaSession.java:138] channel to 
rpc_address(10.1.136.90:8170) closed
   I1012 17:22:28.118568 28572 ReplicaSession.java:274] 
rpc_address(10.1.136.90:8170): mark the session to be disconnected from 
state=CONNECTED
   ```
   
   corresponding logs of primary meta server is as follows:
   ```
   D2024-10-12 17:22:23.115 (1728724943115678962 6bad)   meta.io-thrd.27565: 
server_negotiation.cpp:40:start(): 
SERVER_NEGOTIATION(CLIENT=10.1.136.26:33375): start negotiation
   D2024-10-12 17:22:23.115 (1728724943115698138 6bad)   meta.io-thrd.27565: 
network.cpp:702:on_server_session_accepted(): server session accepted, 
remote_client = 10.1.136.26:33375, current_count = 6
   D2024-10-12 17:22:23.115 (1728724943115702576 6bad)   meta.io-thrd.27565: 
network.cpp:707:on_server_session_accepted(): ip session increased, 
remote_client = 10.1.136.26:33375, current_count = 4
   D2024-10-12 17:22:23.115 (1728724943115923801 6bad)   meta.io-thrd.27565: 
network.cpp:744:on_server_session_disconnected(): session 10.1.136.26:33375 
disconnected, the total client sessions count remains 5
   D2024-10-12 17:22:23.115 (1728724943115929441 6bad)   meta.io-thrd.27565: 
network.cpp:754:on_server_session_disconnected(): client ip 10.1.136.26:33375 
has still 3 of sessions to this server
   W2024-10-12 17:22:23.115 (1728724943115959959 6bad)   meta.io-thrd.27565: 
network.cpp:397:on_failure(): disconnect to remote {}, the socket will be 
lazily closed when the session destroyed
   D2024-10-12 17:22:23.115 (1728724943115988613 6bb7)   
meta.default3.01006bad00011320: negotiation_manager.cpp:376:get_negotiation(): 
negotiation was removed for msg: RPC_NEGOTIATION, 10.1.136.26:33375
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to