[ 
https://issues.apache.org/jira/browse/GEODE-6977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan José Ramos Cassella resolved GEODE-6977.
---------------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.10.0

Fixed through 
[869cf6765bedcb7b618b7a7702c9195ee6fbf1f8|https://github.com/apache/geode/commit/869cf6765bedcb7b618b7a7702c9195ee6fbf1f8]

> WAN Auto Discovery Failure
> --------------------------
>
>                 Key: GEODE-6977
>                 URL: https://issues.apache.org/jira/browse/GEODE-6977
>             Project: Geode
>          Issue Type: Improvement
>          Components: membership, wan
>            Reporter: Juan José Ramos Cassella
>            Assignee: Juan José Ramos Cassella
>            Priority: Major
>              Labels: GeodeCommons
>             Fix For: 1.10.0
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The topology for the test is 4 clusters with 4 locators and 4 servers each, 
> the sites are connected using a ring topology (cluster1 knows about cluster2, 
> cluster2 knows about cluster3, cluster3 knows about cluster4 and cluster4 
> knows about cluster1).
> After the startup finished, it was found out that one locator (out of 
> sixteen) didn't know about another locator in other site.
> According to the docs:
> {noformat}
> Discovery for Multi-Site Systems
> Each Geode cluster in a WAN configuration uses locators to discover remote 
> clusters as well as local members.
> Each locator in a WAN configuration defines a unique distributed-system-id 
> property that identifies the local cluster to which it belongs. A locator 
> uses the remote-locators property to define the addresses of one or more 
> locators in remote clusters to use for WAN distribution.
> When a locator starts up, it contacts each locator that is configured in the 
> remote-locators property to exchange information about the available locators 
> and gateway receivers in the cluster. The locator also shares information 
> about locators and gateway receivers in any other Geode clusters that have 
> connected to the cluster. Connected clusters can then use the shared gateway 
> receiver information to distribute region events according to their 
> configured gateway senders.
> Each time a new locator starts up or an existing locator shuts down, the 
> changed information is broadcast to other connected Geode clusters.
> {noformat}
> I've added some extra logging to the {{LocatorMembershipListenerImpl}} class 
> and reproduced the issue from scratch. In this particular run {{locator_1_1}} 
> ({{rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}}) was not found in 
> {{locator_3_3}} ({{rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}). The 
> member {{locator 1_1}} is the last one to startup:
> {noformat}
> [info 2019/07/12 12:10:47.737 PDT <vm_34_thr_66_locator_1_3_host1_6234> 
> tid=0x15] Locator started on 10.32.111.135[22488]
> [info 2019/07/12 12:10:47.756 PDT <vm_36_thr_68_locator_2_1_host2_8759> 
> tid=0x17] Locator started on 10.32.109.20[23639]
> [info 2019/07/12 12:10:47.809 PDT <vm_44_thr_76_locator_4_1_host2_8840> 
> tid=0x17] Locator started on 10.32.109.20[24645]
> [info 2019/07/12 12:10:48.800 PDT <vm_41_thr_73_locator_3_2_host1_6306> 
> tid=0x15] Locator started on 10.32.111.135[23280]
> [info 2019/07/12 12:10:50.042 PDT <vm_47_thr_79_locator_4_4_host2_8863> 
> tid=0x15] Locator started on 10.32.109.20[29961]
> [info 2019/07/12 12:10:50.162 PDT <vm_38_thr_70_locator_2_3_host2_8780> 
> tid=0x15] Locator started on 10.32.109.20[27708]
> [info 2019/07/12 12:10:50.172 PDT <vm_46_thr_78_locator_4_3_host2_8860> 
> tid=0x15] Locator started on 10.32.109.20[22615]
> [info 2019/07/12 12:10:50.493 PDT <vm_39_thr_71_locator_2_4_host2_8797> 
> tid=0x17] Locator started on 10.32.109.20[25209]
> [info 2019/07/12 12:10:51.052 PDT <vm_37_thr_69_locator_2_2_host2_8766> 
> tid=0x15] Locator started on 10.32.109.20[24073]
> [info 2019/07/12 12:10:51.067 PDT <vm_35_thr_67_locator_1_4_host1_6246> 
> tid=0x15] Locator started on 10.32.111.135[28694]
> [info 2019/07/12 12:10:51.082 PDT <vm_45_thr_77_locator_4_2_host2_8849> 
> tid=0x15] Locator started on 10.32.109.20[20182]
> [info 2019/07/12 12:10:51.112 PDT <vm_42_thr_74_locator_3_3_host1_6314> 
> tid=0x15] Locator started on 10.32.111.135[22655]
> [info 2019/07/12 12:10:51.709 PDT <vm_33_thr_65_locator_1_2_host1_6229> 
> tid=0x15] Locator started on 10.32.111.135[28293]
> [info 2019/07/12 12:10:52.007 PDT <vm_40_thr_72_locator_3_1_host1_6297> 
> tid=0x15] Locator started on 10.32.111.135[22923]
> [info 2019/07/12 12:10:52.625 PDT <vm_43_thr_75_locator_3_4_host1_6326> 
> tid=0x15] Locator started on 10.32.111.135[26144]
> [info 2019/07/12 12:10:56.183 PDT <vm_32_thr_64_locator_1_1_host1_6225> 
> tid=0x15] Locator started on 10.32.111.135[23955]
> {noformat}
> After {{locator_1_1}} joins the cluster, every locator informs the other 
> locators about this new addition (and also about the other locators known so 
> far), as expected.
> The logs below show that all locators tried to share the addition of locator 
> {{locator_1_1 rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}} with 
> {{locator_3_3 rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}:
> {noformat}
> /locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:57.153 PDT 
> <Thread-26> tid=0x7c] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-7[28293]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:57.192 PDT 
> <Thread-24> tid=0x6c] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-7[22488]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:58.011 PDT 
> <Thread-44> tid=0xa7] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-7[28694]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:57.796 PDT 
> <Thread-25> tid=0x6d] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-41[23639]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:58.334 PDT 
> <Thread-64> tid=0xa2] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-41[24073]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:57.368 PDT 
> <Thread-29> tid=0x78] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-41[27708]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:57.086 PDT 
> <Thread-42> tid=0x93] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-41[25209]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:10:59.059 PDT 
> <Thread-23> tid=0x76] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-7[22923]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:58.027 PDT 
> <Thread-25> tid=0x6c] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-7[23280]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:58.890 PDT 
> <Thread-30> tid=0x7f] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-7[26144]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:57.506 PDT 
> <Thread-23> tid=0x74] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-41[24645]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:57.179 PDT 
> <Thread-25> tid=0x77] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-41[20182]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:10:59.006 PDT 
> <Thread-53> tid=0xb1] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-41[22615]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> ./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:57.507 PDT 
> <Thread-28> tid=0xa1] TcpClient sending 
> LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7[23955] Source Locator : 
> rs-GEM-2603-1105a2i3large-hydra-client-41[29961]} to 
> rs-GEM-2603-1105a2i3large-hydra-client-7/10.32.111.135:22655
> {noformat}
> One second afterwards, every single locator also logs that it failed to share 
> the information about {{locator_1_1 
> rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}} with {{locator_3_3 
> rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}}:
> {noformat}
> ./locatorgemfire_1_2_6229/system.log:[debug 2019/07/12 12:10:58.155 PDT 
> <Thread-26> tid=0x7c] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_1_3_6234/system.log:[debug 2019/07/12 12:10:58.196 PDT 
> <Thread-24> tid=0x6c] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_1_4_6246/system.log:[debug 2019/07/12 12:10:59.012 PDT 
> <Thread-44> tid=0xa7] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_2_1_8759/system.log:[debug 2019/07/12 12:10:58.798 PDT 
> <Thread-25> tid=0x6d] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_2_2_8766/system.log:[debug 2019/07/12 12:10:59.336 PDT 
> <Thread-64> tid=0xa2] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_2_3_8780/system.log:[debug 2019/07/12 12:10:58.371 PDT 
> <Thread-29> tid=0x78] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_2_4_8797/system.log:[debug 2019/07/12 12:10:58.093 PDT 
> <Thread-42> tid=0x93] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_3_1_6297/system.log:[debug 2019/07/12 12:11:00.061 PDT 
> <Thread-23> tid=0x76] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_3_2_6306/system.log:[debug 2019/07/12 12:10:59.029 PDT 
> <Thread-25> tid=0x6c] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_3_4_6326/system.log:[debug 2019/07/12 12:10:59.903 PDT 
> <Thread-30> tid=0x7f] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_4_1_8840/system.log:[debug 2019/07/12 12:10:58.514 PDT 
> <Thread-23> tid=0x74] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_4_2_8849/system.log:[debug 2019/07/12 12:10:58.368 PDT 
> <Thread-25> tid=0x77] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_4_3_8860/system.log:[debug 2019/07/12 12:11:00.007 PDT 
> <Thread-53> tid=0xb1] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> ./locatorgemfire_4_4_8863/system.log:[debug 2019/07/12 12:10:58.515 PDT 
> <Thread-28> tid=0xa1] Locator Membership listener could not exchange locator 
> information rs-GEM-2603-1105a2i3large-hydra-client-7:23955 with 
> rs-GEM-2603-1105a2i3large-hydra-client-7:22655
> {noformat}
> This is also backed up by the fact that {{locator_3_3}} 
> {{rs-GEM-2603-1105a2i3large-hydra-client-7[22655]}} is the only one that 
> never logs the reception of the {{LocatorJoinMessage}} for {{locator_1_1 
> rs-GEM-2603-1105a2i3large-hydra-client-7[23955]}}:
> {noformat}
> ##### grep -lR "LocatorJoinMessage{distributedSystemId=1 
> locators=rs-GEM-2603-1105a2i3large-hydra-client-7\[23955\]" . | sort
> ./locatorgemfire_1_1_6225/system.log
> ./locatorgemfire_1_2_6229/system.log
> ./locatorgemfire_1_3_6234/system.log
> ./locatorgemfire_1_4_6246/system.log
> ./locatorgemfire_2_1_8759/system.log
> ./locatorgemfire_2_2_8766/system.log
> ./locatorgemfire_2_3_8780/system.log
> ./locatorgemfire_2_4_8797/system.log
> ./locatorgemfire_3_1_6297/system.log
> ./locatorgemfire_3_2_6306/system.log
> ./locatorgemfire_3_4_6326/system.log
> ./locatorgemfire_4_1_8840/system.log
> ./locatorgemfire_4_2_8849/system.log
> ./locatorgemfire_4_3_8860/system.log
> ./locatorgemfire_4_4_8863/system.log
> {noformat}
> The exception when trying to exchange the locator information 
> {{rs-GEM-2603-1105a2i3large-hydra-client-7:23955}} with 
> {{rs-GEM-2603-1105a2i3large-hydra-client-7:22655}} is always the same on 
> every single locator:
> {noformat}
> java.net.SocketTimeoutException: connect timed out
>       at java.net.PlainSocketImpl.socketConnect(Native Method)
>       at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>       at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>       at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>       at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>       at java.net.Socket.connect(Socket.java:589)
>       at 
> org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:898)
>       at 
> org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:839)
>       at 
> org.apache.geode.internal.net.SocketCreator.connect(SocketCreator.java:828)
>       at 
> org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:205)
>       at 
> org.apache.geode.cache.client.internal.locator.wan.LocatorMembershipListenerImpl$1.run(LocatorMembershipListenerImpl.java:120)
> {noformat}
> Even though there are no errors on the logs for {{locator_3_3}} within the 
> relevant time frame, the statistics show that the cpu is at 100% and that 
> there are no samples taken for around 5 seconds, probably due to a short JVM 
> pause or resource contention, that's probably the reason why the connection 
> from other locators failed. That said, there were no more additions/deletions 
> of extra locators within any of the clusters either (which would have fired 
> another round of messages from every locator to every other known locator 
> about the addition/deletion), so in the end "it is expected" that 
> {{locator_3_3}} didn't have the information about {{locator_1_1}} in its 
> internal map of known locators. 
> There's no retry logic within {{LocatorJoinMessage.locatorJoined}} and the 
> timeout of 1 second seems to be too short: any locator might be busy for a 
> short period of time and, as such, the information about the new locator will 
> be missed until there's another locator membership change; this ticket is to 
> change that: increase the timeout and retry a couple of times by default.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to