Udo Kohlmeyer created GEODE-870: ----------------------------------- Summary: 2 locators connecting simultaneously both think they are the coordinator even after one is kicked out as a surprise member Key: GEODE-870 URL: https://issues.apache.org/jira/browse/GEODE-870 Project: Geode Issue Type: Bug Components: membership Reporter: Udo Kohlmeyer
The scenario is to permanently remove a locator from the distributed system. Steps to reproduce: Start 3 locators Start 2 servers Stop locator 1 Stop locators 2 and 3 Reconfigure locators 2 and 3 without locator 1 Restart locators 2 and 3 Both locators think they are the coordinator: locator-2 log messages: [info 2015/09/14 15:47:13.844 PDT locator-2 <main> tid=0x1] Membership: lead member is now 192.168.2.7(server-1:67247)<v3>:37028 [info 2015/09/14 15:47:13.850 PDT locator-2 <FD_SOCK Ping thread> tid=0x46] GemFire failure detection is now monitoring 192.168.2.7(server-1:67247)<v3>:37028 [info 2015/09/14 15:47:13.850 PDT locator-2 <main> tid=0x1] This member, 192.168.2.7(locator-2:67411:locator)<ec>:64755, is becoming group coordinator. [info 2015/09/14 15:47:13.854 PDT locator-2 <main> tid=0x1] Membership: sending new view [[192.168.2.7(locator-2:67411:locator)<ec><v28>:64755|28] [192.168.2.7(server-1:67247)<v3>:37028/7081, 192.168.2.7(server-2:67265)<v4>:43233/7082, 192.168.2.7(locator-2:67411:locator)<ec><v28>:64755/7072]] (3 mbrs) [info 2015/09/14 15:47:13.866 PDT locator-2 <main> tid=0x1] Admitting member <192.168.2.7(server-1:67247)<v3>:37028>. Now there are 1 non-admin member(s). [info 2015/09/14 15:47:13.867 PDT locator-2 <main> tid=0x1] Admitting member <192.168.2.7(server-2:67265)<v4>:43233>. Now there are 2 non-admin member(s). [info 2015/09/14 15:47:13.867 PDT locator-2 <main> tid=0x1] Admitting member <192.168.2.7(locator-2:67411:locator)<ec><v28>:64755>. Now there are 3 non-admin member(s). [info 2015/09/14 15:47:13.869 PDT locator-2 <main> tid=0x1] Membership: Finished view processing viewID = 28 [info 2015/09/14 15:47:15.178 PDT locator-2 <main> tid=0x1] Starting server location for Distribution Locator on boglesbymac[9092] locator-3 log messages: [info 2015/09/14 15:47:13.846 PDT locator-3 <main> tid=0x1] Membership: lead member is now 192.168.2.7(server-1:67247)<v3>:37028 [info 2015/09/14 15:47:13.852 PDT locator-3 <FD_SOCK Ping thread> tid=0x47] GemFire failure detection is now monitoring 192.168.2.7(server-1:67247)<v3>:37028 [info 2015/09/14 15:47:13.853 PDT locator-3 <main> tid=0x1] This member, 192.168.2.7(locator-3:67410:locator)<ec>:9461, is becoming group coordinator. [info 2015/09/14 15:47:13.855 PDT locator-3 <main> tid=0x1] Membership: sending new view [[192.168.2.7(locator-3:67410:locator)<ec><v28>:9461|28] [192.168.2.7(server-1:67247)<v3>:37028/7081, 192.168.2.7(server-2:67265)<v4>:43233/7082, 192.168.2.7(locator-3:67410:locator)<ec><v28>:9461/7073]] (3 mbrs) [info 2015/09/14 15:47:13.868 PDT locator-3 <main> tid=0x1] Admitting member <192.168.2.7(server-1:67247)<v3>:37028>. Now there are 1 non-admin member(s). [info 2015/09/14 15:47:13.868 PDT locator-3 <main> tid=0x1] Admitting member <192.168.2.7(server-2:67265)<v4>:43233>. Now there are 2 non-admin member(s). [info 2015/09/14 15:47:13.869 PDT locator-3 <main> tid=0x1] Admitting member <192.168.2.7(locator-3:67410:locator)<ec><v28>:9461>. Now there are 3 non-admin member(s). [info 2015/09/14 15:47:13.870 PDT locator-3 <main> tid=0x1] Membership: Finished view processing viewID = 28 [info 2015/09/14 15:47:15.213 PDT locator-3 <main> tid=0x1] Starting server location for Distribution Locator on boglesbymac[9093] Both server logs show locator-3 being admitted, then expired: [finest 2015/09/14 15:47:13.888 PDT server-1 <P2P message reader@233ba812> tid=0x71] Membership: Received message from surprise member: <192.168.2.7(locator-3:67410:locator)<ec><v28>:9461>. My view number is 28 it is 28 [finest 2015/09/14 15:47:13.888 PDT server-1 <P2P message reader@233ba812> tid=0x71] Membership: Processing surprise addition <192.168.2.7(locator-3:67410:locator)<ec><v28>:9461> [info 2015/09/14 15:47:13.889 PDT server-1 <P2P message reader@233ba812> tid=0x71] Admitting member <192.168.2.7(locator-3:67410:locator)<ec><v28>:9461>. Now there are 4 non-admin member(s). [info 2015/09/14 15:47:13.896 PDT server-1 <Pooled High Priority Message Processor 4> tid=0x5c] Member 192.168.2.7(locator-2:67411:locator)<ec><v28>:64755 is equivalent or in the same redundancy zone. [info 2015/09/14 15:47:13.900 PDT server-1 <Pooled High Priority Message Processor 5> tid=0x73] Member 192.168.2.7(locator-3:67410:locator)<ec><v28>:9461 is equivalent or in the same redundancy zone. [info 2015/09/14 15:49:03.791 PDT server-1 <Timer-4> tid=0x4d] Membership: expiring membership of surprise member <192.168.2.7(locator-3:67410:locator)<ec><v28>:9461> [finest 2015/09/14 15:49:03.791 PDT server-1 <Timer-4> tid=0x4d] Membership: destroying < 192.168.2.7(locator-3:67410:locator)<ec><v28>:9461 > [finest 2015/09/14 15:49:03.792 PDT server-1 <Timer-4> tid=0x4d] Membership: added shunned member < 192.168.2.7(locator-3:67410:locator)<ec><v28>:9461 > [finest 2015/09/14 15:49:03.792 PDT server-1 <Timer-4> tid=0x4d] Membership: dispatching uplevel departure event for < 192.168.2.7(locator-3:67410:locator)<ec><v28>:9461 > [info 2015/09/14 15:49:03.793 PDT server-1 <Timer-4> tid=0x4d] Member at 192.168.2.7(locator-3:67410:locator)<ec><v28>:9461 unexpectedly left the distributed cache: not seen in membership view in 100000ms AFAICT from the logs, locator-3 has no idea its not the coordinator. The process is still alive, and its locator thread is still alive: "Distribution Locator on boglesbymac[9093]" daemon prio=5 tid=0x00007ffd5e90e000 nid=0x7003 runnable [0x00000001127c8000] java.lang.Thread.State: RUNNABLE at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) at java.net.ServerSocket.implAccept(ServerSocket.java:530) at java.net.ServerSocket.accept(ServerSocket.java:498) at com.gemstone.org.jgroups.stack.tcpserver.TcpServer.run(TcpServer.java:246) at com.gemstone.org.jgroups.stack.tcpserver.TcpServer$2.run(TcpServer.java:196) Also, if a client connects to it, it'll provide the servers to findAllServers request. This code: private void dumpServers() { PoolImpl pool = (PoolImpl) PoolManager.find("pool"); AutoConnectionSourceImpl connectionSource = (AutoConnectionSourceImpl) pool.getConnectionSource(); List<InetSocketAddress> knownLocators = pool.getLocators(); ArrayList<ServerLocation> allServers = connectionSource.findAllServers(); // message to locator System.out.println("Locator " + knownLocators + " knows about the following " + (allServers == null ? 0 : allServers.size()) + " servers:"); for (ServerLocation server : allServers) { System.out.println("\t" + server); } } Dumps this output from locator-3: Locator [localhost/127.0.0.1:9093] knows about the following 2 servers: 192.168.2.7:40402 192.168.2.7:40401 -- This message was sent by Atlassian JIRA (v6.3.4#6332)