MG>waiting for Yuval to answer:
in docker container? if so how do you resolve hosts (.hosts file, kubernetes, swarm)? MG>? ________________________________ From: VALLASTER Stefan <stefan.vallas...@frequentis.com> Sent: Tuesday, April 10, 2018 4:46 AM To: user@zookeeper.apache.org Subject: RE: IllegalStateException: instance must be started before calling this method How can we get this topic forward again? It seems the mail thread is currently stalled for no obvious reason. Anyone willing to help us out here again? Best regards Stefan -----Original Message----- From: Martin Gainty [mailto:mgai...@hotmail.com] Sent: Sonntag, 25. März 2018 22:58 To: user@zookeeper.apache.org Subject: Re: IllegalStateException: instance must be started before calling this method in docker container? if so how do you resolve hosts (.hosts file, kubernetes, swarm)? M ________________________________ From: Yuval Dori <yuval.d...@gigaspaces.com> Sent: Sunday, March 25, 2018 5:34 AM To: user@zookeeper.apache.org Cc: Yuval Dori Subject: IllegalStateException: instance must be started before calling this method Hi, Our customer is using ZK 3.4.3 and got IllegalStateException: instance must be started before calling this method with the following stack trace: 2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194; [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler<http://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing shutdown while sock still open java.net<http://java.net>.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:542) This is the description: At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3. 24.01.2018 11:31:04,947 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Space instance is Primary After some relocation of ps-main instances a new backup instance is started on the same GSC. 24.01.2018 11:32:59,886 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T146 SPACE/GS-LRMI Connection-pool-1-thread-6/1 Space instance [ frqMain_container1:frqMain ] has been selected as Backup Later a failover test is performed by shutting down WSOSTSLXBMS02 and WSOSTSLXMSS02. We would expect that ps-main.1 backup will be elected as primary because the former primary was located at WSOSTSLXBMS02 but no election happend. In the end in the whole cluster there was no primary of ps-main.1 running. We found suspicious exception in the logs: 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain Lost leadership [sessionId=0x36128253b820001] 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T98 SPACE/Curator-Framework-0/CuratorFrameworkImpl backgroundOperationsLoop exiting 24.01.2018 11:32:44,990 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/1 Quiesce state set to SUSPENDED 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper Session: 0x36128253b820001 closed 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector The leader threw an exception 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector java.lang.IllegalStateException: instance must be started before calling this method 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.shaded.com<http://org.apache.curator.shaded.com>.google.common.base.Preconditions.checkState(Preconditions.java:176) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:339) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:123) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:449) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:466) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:65) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:246) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:240) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.FutureTask.run(FutureTask.java:266) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector at java.lang.Thread.run(Thread.java:748) 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 SPACE/Curator-LeaderSelector-0/LeaderSelector 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T97 SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn EventThread shut down for session: 0x36128253b820001 24.01.2018 11:32:44,997 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1 Channel is closed [target=frqMain1_1, target url=jini://*/frqMain_container1_1/frqMain?groups=POLBAY&ignoreValidation=true&schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_2538032931939568226_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(frqMain_container1_1:frqMain)]] 24.01.2018 11:32:44,998 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service Channel is closed [target=ps-main-mirror-service, target url=jini://*/ps-main-mirror-service_container/ps-main-mirror-service?schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&groups=POLBAY&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine connection url=NIO://WSOSTSLXBMS01:11002/pid[3220]/25906827215_3_7196393537085178059_details[class com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main-mirror-service_container:ps-main-mirror-service)]] 24.01.2018 11:32:45,767 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/1 Shutdown complete 24.01.2018 11:32:45,768 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/container Container <frqMain_container1> shutdown completed 24.01.2018 11:32:46,037 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Failed to delete deployed processing unit from [C:\Program Files\Frequentis\Frequentis LifeX Platform\gigaspaces\work\processing-units\ps-main_1_2008048111] 24.01.2018 11:32:46,037 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Stopped Attached thread dump. Failover happend around 24.01.2018 11:40:20,893 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T59 SPACE/GS-Watchdog/watchdog The ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001<http://10.17.46.142:11001/>] is not reachable or is reachable but with no matching invocation in progress at the server peer, closing invalid connection with local address [/10.17.46.136:49864<http://10.17.46.136:49864/>] [ remote invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups] [java.net<http://java.net>.SocketTimeoutException] Could it be ZK bug? Thanks, Yuval