[ https://issues.apache.org/jira/browse/GEODE-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Juan José Ramos Cassella resolved GEODE-6977. --------------------------------------------- Resolution: Fixed Fix Version/s: 1.10.0 Fixed through [869cf6765bedcb7b618b7a7702c9195ee6fbf1f8|https://github.com/apache/geode/commit/869cf6765bedcb7b618b7a7702c9195ee6fbf1f8] > WAN Auto Discovery Failure > -------------------------- > > Key: GEODE-6977 > URL: https://issues.apache.org/jira/browse/GEODE-6977 > Project: Geode > Issue Type: Improvement > Components: membership, wan > Reporter: Juan José Ramos Cassella > Assignee: Juan José Ramos Cassella > Priority: Major > Labels: GeodeCommons > Fix For: 1.10.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > The topology for the test is 4 clusters with 4 locators and 4 servers each, > the sites are connected using a ring topology (cluster1 knows about cluster2, > cluster2 knows about cluster3, cluster3 knows about cluster4 and cluster4 > knows about cluster1). > After the startup finished, it was found out that one locator (out of > sixteen) didn't know about another locator in other site. > According to the docs: > {noformat} > Discovery for Multi-Site Systems > Each Geode cluster in a WAN configuration uses locators to discover remote > clusters as well as local members. > Each locator in a WAN configuration defines a unique distributed-system-id > property that identifies the local cluster to which it belongs. A locator > uses the remote-locators property to define the addresses of one or more > locators in remote clusters to use for WAN distribution. > When a locator starts up, it contacts each locator that is configured in the > remote-locators property to exchange information about the available locators > and gateway receivers in the cluster. The locator also shares information > about locators and gateway receivers in any other Geode clusters that have > connected to the cluster. Connected clusters can then use the shared gateway > receiver information to distribute region events according to their > configured gateway senders. > Each time a new locator starts up or an existing locator shuts down, the > changed information is broadcast to other connected Geode clusters. > {noformat} > I've added some extra logging to the {{LocatorMembershipListenerImpl}} class > and reproduced the issue from scratch. In this particular run {{locator_1_1}} > ({{rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}}) was not found in > {{locator_3_3}} ({{rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}). The > member {{locator 1_1}} is the last one to startup: > {noformat} > [info 2019/07/12 12:10:47.737 PDT <vm_34_thr_66_locator_1_3_host1_6234> > tid=0x15] Locator started on 10.32.111.135[22488] > [info 2019/07/12 12:10:47.756 PDT <vm_36_thr_68_locator_2_1_host2_8759> > tid=0x17] Locator started on 10.32.109.20[23639] > [info 2019/07/12 12:10:47.809 PDT <vm_44_thr_76_locator_4_1_host2_8840> > tid=0x17] Locator started on 10.32.109.20[24645] > [info 2019/07/12 12:10:48.800 PDT <vm_41_thr_73_locator_3_2_host1_6306> > tid=0x15] Locator started on 10.32.111.135[23280] > [info 2019/07/12 12:10:50.042 PDT <vm_47_thr_79_locator_4_4_host2_8863> > tid=0x15] Locator started on 10.32.109.20[29961] > [info 2019/07/12 12:10:50.162 PDT <vm_38_thr_70_locator_2_3_host2_8780> > tid=0x15] Locator started on 10.32.109.20[27708] > [info 2019/07/12 12:10:50.172 PDT <vm_46_thr_78_locator_4_3_host2_8860> > tid=0x15] Locator started on 10.32.109.20[22615] > [info 2019/07/12 12:10:50.493 PDT <vm_39_thr_71_locator_2_4_host2_8797> > tid=0x17] Locator started on 10.32.109.20[25209] > [info 2019/07/12 12:10:51.052 PDT <vm_37_thr_69_locator_2_2_host2_8766> > tid=0x15] Locator started on 10.32.109.20[24073] > [info 2019/07/12 12:10:51.067 PDT <vm_35_thr_67_locator_1_4_host1_6246> > tid=0x15] Locator started on 10.32.111.135[28694] > [info 2019/07/12 12:10:51.082 PDT <vm_45_thr_77_locator_4_2_host2_8849> > tid=0x15] Locator started on 10.32.109.20[20182] > [info 2019/07/12 12:10:51.112 PDT <vm_42_thr_74_locator_3_3_host1_6314> > tid=0x15] Locator started on 10.32.111.135[22655] > [info 2019/07/12 12:10:51.709 PDT <vm_33_thr_65_locator_1_2_host1_6229> > tid=0x15] Locator started on 10.32.111.135[28293] > [info 2019/07/12 12:10:52.007 PDT <vm_40_thr_72_locator_3_1_host1_6297> > tid=0x15] Locator started on 10.32.111.135[22923] > [info 2019/07/12 12:10:52.625 PDT <vm_43_thr_75_locator_3_4_host1_6326> > tid=0x15] Locator started on 10.32.111.135[26144] > [info 2019/07/12 12:10:56.183 PDT <vm_32_thr_64_locator_1_1_host1_6225> > tid=0x15] Locator started on 10.32.111.135[23955] > {noformat} > After {{locator_1_1}} joins the cluster, every locator informs the other > locators about this new addition (and also about the other locators known so > far), as expected. > The logs below show that all locators tried to share the addition of locator > {{locator_1_1 rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}} with > {{locator_3_3 rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}: > {noformat} > /locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:57.153 PDT > <Thread-26> tid=0x7c] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-7[28293]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > ./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:57.192 PDT > <Thread-24> tid=0x6c] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-7[22488]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > ./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:58.011 PDT > <Thread-44> tid=0xa7] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-7[28694]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > ./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:57.796 PDT > <Thread-25> tid=0x6d] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-41[23639]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > ./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:58.334 PDT > <Thread-64> tid=0xa2] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-41[24073]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > ./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:57.368 PDT > <Thread-29> tid=0x78] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-41[27708]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > ./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:57.086 PDT > <Thread-42> tid=0x93] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-41[25209]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > ./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:10:59.059 PDT > <Thread-23> tid=0x76] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-7[22923]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > ./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:58.027 PDT > <Thread-25> tid=0x6c] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-7[23280]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > ./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:58.890 PDT > <Thread-30> tid=0x7f] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-7[26144]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > ./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:57.506 PDT > <Thread-23> tid=0x74] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-41[24645]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > ./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:57.179 PDT > <Thread-25> tid=0x77] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-41[20182]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > ./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:10:59.006 PDT > <Thread-53> tid=0xb1] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-41[22615]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > ./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:57.507 PDT > <Thread-28> tid=0xa1] TcpClient sending > LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : > rs-GEM-2603-1105a2i3large-hydra-client-41[29961]} to > rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655 > {noformat} > One second afterwards, every single locator also logs that it failed to share > the information about {{locator_1_1 > rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}} with {{locator_3_3 > rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}: > {noformat} > ./locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:58.155 PDT > <Thread-26> tid=0x7c] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > ./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:58.196 PDT > <Thread-24> tid=0x6c] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > ./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:59.012 PDT > <Thread-44> tid=0xa7] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > ./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:58.798 PDT > <Thread-25> tid=0x6d] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > ./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:59.336 PDT > <Thread-64> tid=0xa2] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > ./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:58.371 PDT > <Thread-29> tid=0x78] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > ./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:58.093 PDT > <Thread-42> tid=0x93] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > ./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:11:00.061 PDT > <Thread-23> tid=0x76] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > ./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:59.029 PDT > <Thread-25> tid=0x6c] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > ./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:59.903 PDT > <Thread-30> tid=0x7f] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > ./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:58.514 PDT > <Thread-23> tid=0x74] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > ./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:58.368 PDT > <Thread-25> tid=0x77] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > ./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:11:00.007 PDT > <Thread-53> tid=0xb1] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > ./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:58.515 PDT > <Thread-28> tid=0xa1] Locator Membership listener could not exchange locator > information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with > rs-GEM-2603-1105a2i3large-hydra-client-7:22655 > {noformat} > This is also backed up by the fact that {{locator_3_3}} > {{rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}} is the only one that > never logs the reception of the {{LocatorJoinMessage}} for {{locator_1_1 > rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}}: > {noformat} > ##### grep -lR "LocatorJoinMessage{distributedSystemId=1 > locators=rs-GEM-2603-1105a2i3large-hydra-client-7\[23955\]" . | sort > ./locatorgemfire_1_1_6225/system.log > ./locatorgemfire_1_2_6229/system.log > ./locatorgemfire_1_3_6234/system.log > ./locatorgemfire_1_4_6246/system.log > ./locatorgemfire_2_1_8759/system.log > ./locatorgemfire_2_2_8766/system.log > ./locatorgemfire_2_3_8780/system.log > ./locatorgemfire_2_4_8797/system.log > ./locatorgemfire_3_1_6297/system.log > ./locatorgemfire_3_2_6306/system.log > ./locatorgemfire_3_4_6326/system.log > ./locatorgemfire_4_1_8840/system.log > ./locatorgemfire_4_2_8849/system.log > ./locatorgemfire_4_3_8860/system.log > ./locatorgemfire_4_4_8863/system.log > {noformat} > The exception when trying to exchange the locator information > {{rs-GEM-2603-1105a2i3large-hydra-client-7:23955}} with > {{rs-GEM-2603-1105a2i3large-hydra-client-7:22655}} is always the same on > every single locator: > {noformat} > java.net.SocketTimeoutException: connect timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at > org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:898) > at > org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:839) > at > org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:828) > at > org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:205) > at > org.apache.geode.cache.client.internal.locator.wan.LocatorMembershipListenerImpl$1.run(LocatorMembershipListenerImpl.java:120) > {noformat} > Even though there are no errors on the logs for {{locator_3_3}} within the > relevant time frame, the statistics show that the cpu is at 100% and that > there are no samples taken for around 5 seconds, probably due to a short JVM > pause or resource contention, that's probably the reason why the connection > from other locators failed. That said, there were no more additions/deletions > of extra locators within any of the clusters either (which would have fired > another round of messages from every locator to every other known locator > about the addition/deletion), so in the end "it is expected" that > {{locator_3_3}} didn't have the information about {{locator_1_1}} in its > internal map of known locators. > There's no retry logic within {{LocatorJoinMessage.locatorJoined}} and the > timeout of 1 second seems to be too short: any locator might be busy for a > short period of time and, as such, the information about the new locator will > be missed until there's another locator membership change; this ticket is to > change that: increase the timeout and retry a couple of times by default. -- This message was sent by Atlassian JIRA (v7.6.14#76016)