empiredan opened a new issue, #2133:
URL: https://github.com/apache/incubator-pegasus/issues/2133
After all servers (including both meta/replica servers) in a pegasus cluster
with authentication enabled were restarted, the java client that had connected
to them before would never succeed in connecting to the meta server due to
`Negotiation failed` with `ERR_TIMEOUT` for port **33391** as follows:
```
I1012 17:22:33.122887 28573 MetaSession.java:197] query meta got error, rpc
error(ERR_OK), meta error(ERR_FORWARD_TO_OTHERS), forward
address(rpc_address(10.1.136.90:8170)), current leader
(rpc_address(10.1.136.41:8170)), remain retry count(1), need switch
leader(true), need delay(false)
I1012 17:22:33.123011 28571 ReplicaSession.java:189]
rpc_address(10.1.136.90:8170): the session is disconnected, needs to reconnect
I1012 17:22:33.123739 28522 ReplicaSession.java:204]
rpc_address(10.1.136.90:8170): start to async connect to target, wait channel
to active
I1012 17:22:33.123829 28522 ReplicaSession.java:424] Channel [id:
0xea9a05e5, L:/10.1.136.26:33391 - R:/10.1.136.90:8170] for session
rpc_address(10.1.136.90:8170) is active
......
E1012 17:22:43.134124 28520 Negotiation.java:85] Negotiation failed
Java exception follows:
com.xiaominew.infra.pegasus.rpc.ReplicationException: ERR_TIMEOUT
at
com.xiaominew.infra.pegasus.security.Negotiation$RecvHandler.run(Negotiation.java:81)
at
com.xiaominew.infra.pegasus.rpc.async.ReplicaSession.tryNotifyFailureWithSeqID(ReplicaSession.java:321)
at
com.xiaominew.infra.pegasus.rpc.async.ReplicaSession$4.run(ReplicaSession.java:367)
at
io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
at
io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
I1012 17:22:43.134315 28520 ReplicaSession.java:138] channel to
rpc_address(10.1.136.90:8170) closed
W1012 17:22:43.134387 28522 ReplicaSession.java:418] Channel [id:
0xea9a05e5, L:/10.1.136.26:33391 ! R:/10.1.136.90:8170] for session
rpc_address(10.1.136.90:8170) is inactive
I1012 17:22:43.134445 28522 ReplicaSession.java:274]
rpc_address(10.1.136.90:8170): mark the session to be disconnected from
state=CONNECTED
```
corresponding logs of primary meta server is as follows:
```
D2024-10-12 17:22:33.123 (1728724953123732943 6bad) meta.io-thrd.27565:
server_negotiation.cpp:40:start():
SERVER_NEGOTIATION(CLIENT=10.1.136.26:33391): start negotiation
D2024-10-12 17:22:33.123 (1728724953123744766 6bad) meta.io-thrd.27565:
network.cpp:702:on_server_session_accepted(): server session accepted,
remote_client = 10.1.136.26:33391, current_count = 6
D2024-10-12 17:22:33.123 (1728724953123748123 6bad) meta.io-thrd.27565:
network.cpp:707:on_server_session_accepted(): ip session increased,
remote_client = 10.1.136.26:33391, current_count = 4
D2024-10-12 17:22:33.123 (1728724953123964597 6bad) meta.io-thrd.27565:
network.cpp:744:on_server_session_disconnected(): session 10.1.136.26:33391
disconnected, the total client sessions count remains 5
D2024-10-12 17:22:33.123 (1728724953123970218 6bad) meta.io-thrd.27565:
network.cpp:754:on_server_session_disconnected(): client ip 10.1.136.26:33391
has still 3 of sessions to this server
W2024-10-12 17:22:33.123 (1728724953123972893 6bad) meta.io-thrd.27565:
network.cpp:397:on_failure(): disconnect to remote {}, the socket will be
lazily closed when the session destroyed
D2024-10-12 17:22:33.123 (1728724953123997959 6bbb)
meta.default7.01006bad00011326: negotiation_manager.cpp:376:get_negotiation():
negotiation was removed for msg: RPC_NEGOTIATION, 10.1.136.26:33391
```
Or `Negotiation failed` with `ERR_SESSION_RESET` for port **33375** as
follows:
```
I1012 17:22:23.114938 28573 MetaSession.java:197] query meta got error, rpc
error(ERR_OK), meta error(ERR_FORWARD_TO_OTHERS), forward
address(rpc_address(10.1.136.90:8170)), current leader
(rpc_address(10.1.136.41:8170)), remain retry count(4), need switch
leader(true), need delay(false)
I1012 17:22:23.115047 28571 ReplicaSession.java:189]
rpc_address(10.1.136.90:8170): the session is disconnected, needs to reconnect
I1012 17:22:23.115684 28572 ReplicaSession.java:204]
rpc_address(10.1.136.90:8170): start to async connect to target, wait channel
to active
I1012 17:22:23.115765 28572 ReplicaSession.java:424] Channel [id:
0xae9c99dd, L:/10.1.136.26:33375 - R:/10.1.136.90:8170] for session
rpc_address(10.1.136.90:8170) is active
......
W1012 17:22:28.117709 28570 ReplicaSession.java:309]
rpc_address(10.1.136.90:8170): actively close the session because it's not
responding for 30 seconds
I1012 17:22:28.117861 28570 ReplicaSession.java:138] channel to
rpc_address(10.1.136.90:8170) closed
W1012 17:22:28.117931 28572 ReplicaSession.java:418] Channel [id:
0xae9c99dd, L:/10.1.136.26:33375 ! R:/10.1.136.90:8170] for session
rpc_address(10.1.136.90:8170) is inactive
I1012 17:22:28.118019 28570 MetaSession.java:197] query meta got error, rpc
error(ERR_SESSION_RESET), meta error(ERR_UNKNOWN), forward address(null),
current leader(rpc_address(10.1.136.90:8170)), remain retry count(3), need
switch leader(true), need delay(false)
E1012 17:22:28.118207 28572 Negotiation.java:85] Negotiation failed
Java exception follows:
com.xiaominew.infra.pegasus.rpc.ReplicationException: ERR_SESSION_RESET
at
com.xiaominew.infra.pegasus.security.Negotiation$RecvHandler.run(Negotiation.java:81)
at
com.xiaominew.infra.pegasus.rpc.async.ReplicaSession.tryNotifyFailureWithSeqID(ReplicaSession.java:321)
at
com.xiaominew.infra.pegasus.rpc.async.ReplicaSession.markSessionDisconnect(ReplicaSession.java:266)
at
com.xiaominew.infra.pegasus.rpc.async.ReplicaSession$DefaultHandler.channelInactive(ReplicaSession.java:419)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:240)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:226)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
at
io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:379)
at
io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:344)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:240)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:226)
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
at
io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1299)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:240)
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:226)
at
io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:903)
at
io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:768)
at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
I1012 17:22:28.118362 28572 ReplicaSession.java:138] channel to
rpc_address(10.1.136.90:8170) closed
I1012 17:22:28.118568 28572 ReplicaSession.java:274]
rpc_address(10.1.136.90:8170): mark the session to be disconnected from
state=CONNECTED
```
corresponding logs of primary meta server is as follows:
```
D2024-10-12 17:22:23.115 (1728724943115678962 6bad) meta.io-thrd.27565:
server_negotiation.cpp:40:start():
SERVER_NEGOTIATION(CLIENT=10.1.136.26:33375): start negotiation
D2024-10-12 17:22:23.115 (1728724943115698138 6bad) meta.io-thrd.27565:
network.cpp:702:on_server_session_accepted(): server session accepted,
remote_client = 10.1.136.26:33375, current_count = 6
D2024-10-12 17:22:23.115 (1728724943115702576 6bad) meta.io-thrd.27565:
network.cpp:707:on_server_session_accepted(): ip session increased,
remote_client = 10.1.136.26:33375, current_count = 4
D2024-10-12 17:22:23.115 (1728724943115923801 6bad) meta.io-thrd.27565:
network.cpp:744:on_server_session_disconnected(): session 10.1.136.26:33375
disconnected, the total client sessions count remains 5
D2024-10-12 17:22:23.115 (1728724943115929441 6bad) meta.io-thrd.27565:
network.cpp:754:on_server_session_disconnected(): client ip 10.1.136.26:33375
has still 3 of sessions to this server
W2024-10-12 17:22:23.115 (1728724943115959959 6bad) meta.io-thrd.27565:
network.cpp:397:on_failure(): disconnect to remote {}, the socket will be
lazily closed when the session destroyed
D2024-10-12 17:22:23.115 (1728724943115988613 6bb7)
meta.default3.01006bad00011320: negotiation_manager.cpp:376:get_negotiation():
negotiation was removed for msg: RPC_NEGOTIATION, 10.1.136.26:33375
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]