Richard Pijnenburg created ATLAS-4659: -----------------------------------------
Summary: Atlas in HA mode fails to get healthy Key: ATLAS-4659 URL: https://issues.apache.org/jira/browse/ATLAS-4659 Project: Atlas Issue Type: Bug Affects Versions: 3.0.0 Environment: Zookeeper 3.8.0 Reporter: Richard Pijnenburg We are trying to setup atlas with the HA functionality using zookeeper 3.8.0 Relevant logs: {code:java} 2022-08-18 14:57:06,924 INFO - [main:] ~ Found matched server id id1 with host port: atlas-0.atlas-headless.atlas.svc.cluster.local:21000 (AtlasServerIdSelector:65) 2022-08-18 14:57:06,924 INFO - [main:] ~ Starting leader election for id1 (ActiveInstanceElectorService:112) 2022-08-18 14:57:06,933 INFO - [main:] ~ Leader latch started for id1. (ActiveInstanceElectorService:118) 2022-08-18 14:57:06,991 INFO - [main:] ~ AtlasJsonProvider() instantiated (AtlasJsonProvider:53) 2022-08-18 14:57:07,296 WARN - [main-EventThread:] ~ Server instance with server id id1 is elected as leader (ActiveInstanceElectorService:152) 2022-08-18 14:57:07,296 WARN - [main-EventThread:] ~ Instance becoming active from PASSIVE (ServiceState:88 ——— 2022-08-18 14:57:27,818 INFO - [main-EventThread:] ~ Reacting to active state: initializing Kafka consumers (NotificationHookConsumer:421) 2022-08-18 14:57:27,819 INFO - [main-EventThread:] ~ ==> KafkaNotification.createConsumers(notificationType=HOOK, numConsumers=1, autoCommitEnabled=false) (KafkaNotification:194) 2022-08-18 14:57:28,237 INFO - [main-EventThread:] ~ <== KafkaNotification.createConsumers(notificationType=HOOK, numConsumers=1, autoCommitEnabled=false) (KafkaNotification:234) 2022-08-18 14:57:28,402 INFO - [main-EventThread:] ~ ==> TaskManagement.instanceIsActive() (TaskManagement:94) 2022-08-18 14:57:28,402 INFO - [main-EventThread:] ~ TaskManagement: Started! (TaskManagement:196) 2022-08-18 14:57:28,479 INFO - [NotificationHookConsumer thread-0:] ~ [atlas-hook-consumer-thread]: Starting (Logging:66) 2022-08-18 14:57:28,481 INFO - [NotificationHookConsumer thread-0:] ~ ==> HookConsumer doWork() (NotificationHookConsumer$HookConsumer:540) 2022-08-18 14:57:28,483 INFO - [NotificationHookConsumer thread-0:] ~ Atlas Server is not ready. Waiting for 1000 milliseconds to retry... (NotificationHookConsumer$HookConsumer:940) 2022-08-18 14:57:28,485 INFO - [main-EventThread:] ~ TaskManagement: Found: 0: Tasks in pending state. (TaskManagement:212) 2022-08-18 14:57:28,485 INFO - [main-EventThread:] ~ <== TaskManagement.instanceIsActive() (TaskManagement:98) 2022-08-18 14:57:28,485 INFO - [main-EventThread:] ~ ==> IndexRecoveryService.instanceIsActive() (IndexRecoveryService:117) 2022-08-18 14:57:28,485 INFO - [main-EventThread:] ~ <== IndexRecoveryService.instanceIsActive() (IndexRecoveryService:121) 2022-08-18 14:57:28,486 INFO - [index-health-monitor:] ~ Index Health Monitor: Starting... (IndexRecoveryService$RecoveryThread:175) 2022-08-18 14:57:28,487 ERROR - [main-EventThread:] ~ Got exception while activating (ActiveInstanceElectorService:162) org.apache.atlas.exception.AtlasBaseException: ActiveInstanceState.update resulted in exception. at org.apache.atlas.web.service.ActiveInstanceState.update(ActiveInstanceState.java:119) at org.apache.atlas.web.service.ActiveInstanceElectorService.isLeader(ActiveInstanceElectorService.java:158) at org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:702) at org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:698) at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100) at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) at org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:697) at org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:575) at org.apache.curator.framework.recipes.leader.LeaderLatch.access$600(LeaderLatch.java:65) at org.apache.curator.framework.recipes.leader.LeaderLatch$7.processResult(LeaderLatch.java:626) at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883) at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653) at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) at org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:627) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) Caused by: java.lang.IllegalStateException: Expected state [STARTED] was [STOPPED] at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:823) at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkState(CuratorFrameworkImpl.java:432) at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkExists(CuratorFrameworkImpl.java:459) at org.apache.atlas.web.service.ActiveInstanceState.update(ActiveInstanceState.java:109) ... 16 more 2022-08-18 14:57:28,487 WARN - [main-EventThread:] ~ Server instance with server id id1 is removed as leader (ActiveInstanceElectorService:199) 2022-08-18 14:57:28,487 WARN - [main-EventThread:] ~ Instance becoming passive from BECOMING_ACTIVE (ServiceState:119) 2022-08-18 14:57:28,487 INFO - [main-EventThread:] ~ ==> IndexRecoveryService.instanceIsPassive() (IndexRecoveryService:126) 2022-08-18 14:57:28,487 INFO - [main-EventThread:] ~ Index Health Monitor: Shutdown: Starting... (IndexRecoveryService$RecoveryThread:196) 2022-08-18 14:57:28,487 INFO - [main-EventThread:] ~ Index Health Monitor: Shutdown: Done! (IndexRecoveryService$RecoveryThread:206) 2022-08-18 14:57:29,484 INFO - [NotificationHookConsumer thread-0:] ~ Atlas Server is not ready. Waiting for 1000 milliseconds to retry... (NotificationHookConsumer$HookConsumer:940) 2022-08-18 14:57:30,484 INFO - [NotificationHookConsumer thread-0:] ~ Atlas Server is not ready. Waiting for 1000 milliseconds to retry... (NotificationHookConsumer$HookConsumer:940) {code} Running Atlas in non ha mode works fine The zookeeper instance is also used for Cassandra and Solr and those don't seem to have any issues with Zookeeper. It's unclear from the logs where the actual issue is. -- This message was sent by Atlassian Jira (v8.20.10#820010)