Marco Baldessari created GEODE-9906:
---------------------------------------
Summary: Unable to reconnect a node after SO patching "15 seconds
have elapsed while waiting for replies"
Key: GEODE-9906
URL: https://issues.apache.org/jira/browse/GEODE-9906
Project: Geode
Issue Type: Bug
Reporter: Marco Baldessari
I have a cluster situation consisting of 4 total nodes, 3 servers and 1
management node, working properly.
At the beginning of the month we planned to patch the OS and we started from
the first server node with this procedure:
- Stop service
- S.O. patching
- Server restart
- Start service
The service of the first patched node named "serverA" fails to restart with
this error:
Log entries cluster join:
serverA:
| INFO | region-dm-12 | ache.geode.internal.tcp.Connection |
--> Connection: shared=true ordered=false failed to connect to peer
10.237.110.195( Server serverB:9993)<ec><v127>:1024 because:
java.net.ConnectException: Connection timed out (Connection timed out)
| WARN | region-dm-12 | ache.geode.internal.tcp.Connection | -->
Connection: Attempting reconnect to peer 10.237.110.195( Server
serverB:9993)<ec><v127>:1024
ServerMgmt:
| WARN | pool-3-thread-1 | tributed.internal.ReplyProcessor21
| --> 15 seconds have elapsed while waiting for replies:
<CreateRegionProcessor$CreateRegionReplyProcessor 44180 waiting for 1 replies
from [10.237.110.194( Server serverA:632)<ec><v174>:1024]> on 10.237.110.225(
Management:6033)<ec><v111>:1024 whose current membership list is:
[[10.237.110.196( Server serverC:16805)<ec><v136>:1024, 10.237.110.225(
Management:6033)<ec><v111>:1024, 10.237.110.195( Server
serverB:9993)<ec><v127>:1024, 10.237.110.194( Server
serverA:632)<ec><v174>:1024]]
The connection between the systems was verified with tcpdumps, udp 1024 is
running fine.
We have tried redeploying the service and making numerous attempts but we
always get the same error during startup.
Any idea? Thank you.
Marco.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)