[ https://issues.apache.org/jira/browse/GEODE-6423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775637#comment-16775637 ]
ASF subversion and git services commented on GEODE-6423: -------------------------------------------------------- Commit 8b29d9eb6d759435d8d9e39575f2f0edff8e81c1 in geode's branch refs/heads/develop from Bruce Schuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=8b29d9e ] GEODE-6423 availability checks sometimes immediately initiate removal Ensure that the availability check is performed for the contracted member-timeout period. This allows a suspect to survive the check if it's having a momentary glitch like a brief garbage-collection, or if there is short network outage. This change caused some "reconnect" tests to fail due to short auto-reconnect intervals letting disconnected nodes start reconnecting before suspect processing completed on the force-disconnected nodes. I've fixed this by reinitializing the UUID part of the membership ID in JGroupsMessenger during reconnect attempts. > availability checks sometimes immediately initiate removal > ---------------------------------------------------------- > > Key: GEODE-6423 > URL: https://issues.apache.org/jira/browse/GEODE-6423 > Project: Geode > Issue Type: Bug > Components: membership > Reporter: Bruce Schuchardt > Assignee: Bruce Schuchardt > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > If the network goes down the JGroupsMessenger service initiates suspect > processing when it tries to send messages. In 1.8 this seems to initiate > immediate removal of the suspect. > ioexception sending udp message initiates suspicion > suspect processing initiates a final check > the final check fails immediately (it's using a timed Socket.connect() which > fails immediately) > the member is declared dead > {noformat} > [info 2019/02/13 17:44:59.366 CST perf157-130-167-server1 <Geode Failure > Detection thread 3> tid=0xc2] received suspect message from myself for > 192.168.130.167(perf157-130-167-locator1:225065:locator)<ec><v0>:41000: > Unable to send messages to this member via JGroups > [info 2019/02/13 17:44:59.368 CST perf157-130-167-server1 <Geode Failure > Detection thread 4> tid=0xc3] Performing final check for suspect member > 192.168.130.167(perf157-130-167-locator1:225065:locator)<ec><v0>:41000 > reason=Unable to send messages to this member via JGroups > [info 2019/02/13 17:44:59.368 CST perf157-130-167-server1 <Geode Failure > Detection thread 5> tid=0xc4] Performing final check for suspect member > 192.168.130.167(perf157-130-167-worker1:225794)<v3>:16202 reason=Unable to > send messages to this member via JGroups > [info 2019/02/13 17:44:59.368 CST perf157-130-167-server1 <Geode Failure > Detection thread 4> tid=0xc3] Failure detection is now watching > 192.168.130.167(perf157-130-167-server1:225263)<v1>:16200 > [info 2019/02/13 17:44:59.368 CST perf157-130-167-server1 <Geode Failure > Detection thread 5> tid=0xc4] Failure detection is now watching > 192.168.130.167(perf157-130-167-locator1:225065:locator)<ec><v0>:41000 > [info 2019/02/13 17:44:59.368 CST perf157-130-167-server1 <Geode Failure > Detection thread 3> tid=0xc2] received suspect message from myself for > 192.168.130.167(perf157-130-167-server2:225522)<v2>:16201: Unable to send > messages to this member via JGroups > [info 2019/02/13 17:44:59.369 CST perf157-130-167-server1 <Geode Failure > Detection thread 6> tid=0xc5] Performing final check for suspect member > 192.168.130.167(perf157-130-167-server2:225522)<v2>:16201 reason=Unable to > send messages to this member via JGroups > [info 2019/02/13 17:44:59.369 CST perf157-130-167-server1 <Geode Failure > Detection thread 6> tid=0xc5] Failure detection is now watching > 192.168.130.167(perf157-130-167-server1:225263)<v1>:16200 > [info 2019/02/13 17:44:59.371 CST perf157-130-167-server1 <Geode Failure > Detection thread 5> tid=0xc4] Final check failed for member > 192.168.130.167(perf157-130-167-worker1:225794)<v3>:16202 > [info 2019/02/13 17:44:59.371 CST perf157-130-167-server1 <Geode Failure > Detection thread 5> tid=0xc4] Requesting removal of suspect member > 192.168.130.167(perf157-130-167-worker1:225794)<v3>:16202 > [info 2019/02/13 17:44:59.371 CST perf157-130-167-server1 <Geode Failure > Detection thread 4> tid=0xc3] Final check failed for member > 192.168.130.167(perf157-130-167-locator1:225065:locator)<ec><v0>:41000 > [info 2019/02/13 17:44:59.371 CST perf157-130-167-server1 <Geode Failure > Detection thread 4> tid=0xc3] Requesting removal of suspect member > 192.168.130.167(perf157-130-167-locator1:225065:locator)<ec><v0>:41000 > [info 2019/02/13 17:44:59.371 CST perf157-130-167-server1 <Geode Failure > Detection thread 4> tid=0xc3] This member is becoming the membership > coordinator with address > 192.168.130.167(perf157-130-167-server1:225263)<v1>:16200 > [info 2019/02/13 17:44:59.371 CST perf157-130-167-server1 <Geode Failure > Detection thread 6> tid=0xc5] Final check failed for member > 192.168.130.167(perf157-130-167-server2:225522)<v2>:16201 > [info 2019/02/13 17:44:59.373 CST perf157-130-167-server1 <Geode Failure > Detection thread 6> tid=0xc5] Requesting removal of suspect member > 192.168.130.167(perf157-130-167-server2:225522)<v2>:16201 > [info 2019/02/13 17:44:59.376 CST perf157-130-167-server1 <Geode Failure > Detection thread 4> tid=0xc3] ViewCreator starting > on:192.168.130.167(perf157-130-167-server1:225263)<v1>:16200 > [info 2019/02/13 17:44:59.376 CST perf157-130-167-server1 <Geode Membership > View Creator> tid=0xc6] View Creator thread is starting > [info 2019/02/13 17:44:59.377 CST perf157-130-167-server1 <Geode Membership > View Creator> tid=0xc6] > 192.168.130.167(perf157-130-167-locator1:225065:locator)<ec><v0>:41000 had a > weight of 3 > [info 2019/02/13 17:44:59.377 CST perf157-130-167-server1 <Geode Membership > View Creator> tid=0xc6] > 192.168.130.167(perf157-130-167-worker1:225794)<v3>:16202 had a weight of 10 > [info 2019/02/13 17:44:59.377 CST perf157-130-167-server1 <Geode Membership > View Creator> tid=0xc6] preparing new view > View[192.168.130.167(perf157-130-167-server1:225263)<v1>:16200|10] members: > [192.168.130.167(perf157-130-167-server1:225263)<v1>:16200{lead}, > 192.168.130.167(perf157-130-167-server2:225522)<v2>:16201] crashed: > [192.168.130.167(perf157-130-167-locator1:225065:locator)<ec><v0>:41000, > 192.168.130.167(perf157-130-167-worker1:225794)<v3>:16202] > [info 2019/02/13 17:45:03.627 CST perf157-130-167-server1 <unicast > receiver,perf157-130-167-62066> tid=0x21] received suspect message from > 192.168.130.167(perf157-130-167-worker1:225794)<v3>:16202 for > 192.168.130.167(perf157-130-167-locator1:225065:locator)<ec><v0>:41000: > Unable to send messages to this member via JGroups > [info 2019/02/13 17:45:03.718 CST perf157-130-167-server1 <unicast > receiver,perf157-130-167-62066> tid=0x21] Membership received a request to > remove 192.168.130.167(perf157-130-167-server1:225263)<v1>:16200 from > 192.168.130.167(perf157-130-167-locator1:225065:locator)<ec><v0>:41000 > reason=Unable to send messages to this member via JGroups > [severe 2019/02/13 17:45:03.719 CST perf157-130-167-server1 <unicast > receiver,perf157-130-167-62066> tid=0x21] Membership service failure: Unable > to send messages to this member via JGroups > org.apache.geode.ForcedDisconnectException: Unable to send messages to this > member via JGroups > {noformat} > > We expect the final check to respect the member-timeout setting. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)