[ https://issues.apache.org/jira/browse/GEODE-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236482#comment-17236482 ]
Dan Smith commented on GEODE-8739: ---------------------------------- These are what are the most interesting lines from the logs files I think, where the locators each decide that they should be the coordinator. What's weird is that in the "Discovery state" message, each one has the same list of registrants, and the same view. But they have different possible coordinators. {noformat} gemfirecluster-sample-locator-0.log: [info 2020/11/17 12:22:12.973 GMT <main> tid=0x1] using findCoordinatorFromView gemfirecluster-sample-locator-0.log: [info 2020/11/17 12:22:12.974 GMT <main> tid=0x1] searching for coordinator in findCoordinatorFromView gemfirecluster-sample-locator-0.log: [info 2020/11/17 12:22:12.974 GMT <main> tid=0x1] sending FindCoordinatorRequests to [192.168.68.28(gemfirecluster-sample-server-0:1)<v2>:41000, 192.168.149.18(gemfirecluster-sample-server-1:1)<v2>:41000, 192.168.149.63(gemfirecluster-sample-locator-1:1:locator)<ec>:41000] gemfirecluster-sample-locator-0.log: [info 2020/11/17 12:22:15.975 GMT <main> tid=0x1] findCoordinatorFromView processing FindCoordinatorResponse(coordinator=192.168.149.63(gemfirecluster-sample-locator-1:1:locator)<ec>:41000; senderId=192.168.149.63(gemfirecluster-sample-locator-1:1:locator)<ec>:41000) gemfirecluster-sample-locator-0.log: [info 2020/11/17 12:22:15.976 GMT <main> tid=0x1] Discovery state after looking for membership coordinator is locatorsContacted=2; findInViewResponses=0; alreadyTried=[192.168.149.18(gemfirecluster-sample-server-1:1)<v2>:41000, 192.168.68.28(gemfirecluster-sample-server-0:1)<v2>:41000, 192.168.149.63(gemfirecluster-sample-locator-1:1:locator)<ec>:41000]; registrants=[192.168.64.210(gemfirecluster-sample-locator-0:1:locator)<ec>:41000, 192.168.149.63(gemfirecluster-sample-locator-1:1:locator)<ec>:41000]; possibleCoordinator=192.168.64.210(gemfirecluster-sample-locator-0:1:locator)<ec>:41000; viewId=-1; hasContactedAJoinedLocator=false; view=View[192.168.149.10(gemfirecluster-sample-locator-0:1:locator)<ec><v0>:41000|-1] members: [192.168.68.28(gemfirecluster-sample-server-0:1)<v2>:41000{lead}, 192.168.149.18(gemfirecluster-sample-server-1:1)<v2>:41000]; responses=[] gemfirecluster-sample-locator-0.log: [info 2020/11/17 12:22:15.976 GMT <main> tid=0x1] found possible coordinator 192.168.64.210(gemfirecluster-sample-locator-0:1:locator)<ec>:41000 gemfirecluster-sample-locator-0.log: [info 2020/11/17 12:22:15.976 GMT <main> tid=0x1] This member is becoming the membership coordinator with address 192.168.64.210(gemfirecluster-sample-locator-0:1:locator)<ec>:41000 gemfirecluster-sample-locator-1.log: [info 2020/11/17 12:22:16.000 GMT <main> tid=0x1] using findCoordinatorFromView gemfirecluster-sample-locator-1.log: [info 2020/11/17 12:22:16.001 GMT <main> tid=0x1] searching for coordinator in findCoordinatorFromView gemfirecluster-sample-locator-1.log: [info 2020/11/17 12:22:16.002 GMT <main> tid=0x1] sending FindCoordinatorRequests to [192.168.68.28(gemfirecluster-sample-server-0:1)<v2>:41000, 192.168.149.18(gemfirecluster-sample-server-1:1)<v2>:41000, 192.168.64.210(gemfirecluster-sample-locator-0:1:locator)<ec>:41000] gemfirecluster-sample-locator-1.log: [info 2020/11/17 12:22:19.003 GMT <main> tid=0x1] findCoordinatorFromView processing FindCoordinatorResponse(coordinator=192.168.64.210(gemfirecluster-sample-locator-0:1:locator)<ec>:41000; senderId=192.168.64.210(gemfirecluster-sample-locator-0:1:locator)<ec>:41000) gemfirecluster-sample-locator-1.log: [info 2020/11/17 12:22:19.004 GMT <main> tid=0x1] findCoordinatorFromView's best guess is now 192.168.149.63(gemfirecluster-sample-locator-1:1:locator)<ec>:41000 gemfirecluster-sample-locator-1.log: [info 2020/11/17 12:22:19.005 GMT <main> tid=0x1] Discovery state after looking for membership coordinator is locatorsContacted=2; findInViewResponses=0; alreadyTried=[192.168.149.18(gemfirecluster-sample-server-1:1)<v2>:41000, 192.168.68.28(gemfirecluster-sample-server-0:1)<v2>:41000]; registrants=[192.168.64.210(gemfirecluster-sample-locator-0:1:locator)<ec>:41000, 192.168.149.63(gemfirecluster-sample-locator-1:1:locator)<ec>:41000]; possibleCoordinator=192.168.149.63(gemfirecluster-sample-locator-1:1:locator)<ec>:41000; viewId=-1; hasContactedAJoinedLocator=false; view=View[192.168.149.10(gemfirecluster-sample-locator-0:1:locator)<ec><v0>:41000|-1] members: [192.168.68.28(gemfirecluster-sample-server-0:1)<v2>:41000{lead}, 192.168.149.18(gemfirecluster-sample-server-1:1)<v2>:41000]; responses=[] gemfirecluster-sample-locator-1.log: [info 2020/11/17 12:22:19.005 GMT <main> tid=0x1] found possible coordinator 192.168.149.63(gemfirecluster-sample-locator-1:1:locator)<ec>:41000 gemfirecluster-sample-locator-1.log: [info 2020/11/17 12:22:19.005 GMT <main> tid=0x1] This member is becoming the membership coordinator with address 192.168.149.63(gemfirecluster-sample-locator-1:1:locator)<ec>:41000 {noformat} > Split brain when locators exhaust join attempts on non existant servers > ----------------------------------------------------------------------- > > Key: GEODE-8739 > URL: https://issues.apache.org/jira/browse/GEODE-8739 > Project: Geode > Issue Type: Bug > Components: membership > Reporter: Jason Huynh > Priority: Major > Attachments: exportedLogs_locator-0.zip, exportedLogs_locator-1.zip > > > The hypothesis: "if there is a locator view .dat file with several > non-existent servers then then locators will waste all of their join attempts > on the servers instead of finding each other" > Scenario is a test/user attempts to recreate a cluster with existing .dat and > persistent files. The locators are spun in parallel and from the analysis, > it looks like they are able to communicate with each other, but then end up > forming their own ds. -- This message was sent by Atlassian Jira (v8.3.4#803005)