Hi,
I am using aurora for running 1300 instances of jenkins CIs in a mesos cluster.
Aurora is running in a HA mode with a three node cluster. A few days ago aurora
crashed and the end log can be seen below (Note that I have changed the actual
zookeeper IP addresses with IP1 and IP2).
Appreciate any help with regards to avoiding this kind of a crash in future.
Aurora version --> 0.6.1-snapshot
Thanks.
Ash.
============entries from the aurora log file
below================================================
I0305 21:23:25.546087 8811 replica.cpp:661] Replica learned APPEND action at
position 93907
I0305 21:23:25.546380 8811 replica.cpp:508] Replica received write request for
position 93908
I0305 21:23:25.651371 8811 leveldb.cpp:343] Persisting action (524328 bytes)
to leveldb took 104.79926ms
I0305 21:23:25.651468 8811 replica.cpp:676] Persisted action at 93908
I0305 21:23:25.768111 8809 replica.cpp:655] Replica received learned notice
for position 93908
I0305 21:23:25.809016 8809 leveldb.cpp:343] Persisting action (524330 bytes)
to leveldb took 40.811957ms
I0305 21:23:25.809109 8809 replica.cpp:676] Persisted action at 93908
I0305 21:23:25.809135 8809 replica.cpp:661] Replica learned APPEND action at
position 93908
I0305 21:23:25.831048 8812 replica.cpp:508] Replica received write request for
position 93909
I0305 21:23:25.859412 8812 leveldb.cpp:343] Persisting action (524328 bytes)
to leveldb took 28.015678ms
I0305 21:23:25.859561 8812 replica.cpp:676] Persisted action at 93909
I0305 21:23:26.522222 8814 replica.cpp:655] Replica received learned notice
for position 93909
I0305 21:23:26.552645 8814 leveldb.cpp:343] Persisting action (524330 bytes)
to leveldb took 30.324183ms
I0305 21:23:26.552736 8814 replica.cpp:676] Persisted action at 93909
I0305 21:23:26.552763 8814 replica.cpp:661] Replica learned APPEND action at
position 93909
I0305 21:23:32.186401 8810 network.hpp:423] ZooKeeper group memberships changed
I0305 21:23:32.186600 8815 group.cpp:658] Trying to get
'/aurora/replicated-log/0000000012' in ZooKeeper
I0305 21:23:32.197412 8815 group.cpp:658] Trying to get
'/aurora/replicated-log/0000000016' in ZooKeeper
I0305 21:23:32.208843 8815 network.hpp:461] ZooKeeper group PIDs: {
log-replica(1)@IP1:8083, log-replica(1)@IP2:8083 }
W0305 21:23:36.191 THREAD128
com.twitter.common.zookeeper.ServerSetImpl$ServerSetWatcher.notifyServerSetChange:
server set empty for path /aurora/scheduler
W0305 21:23:36.191 THREAD128
org.apache.aurora.scheduler.http.LeaderRedirect$SchedulerMonitor.onChange: No
schedulers in host set, will not redirect despite not being leader.
I0305 21:23:36.202 THREAD128
com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate
/aurora/scheduler/singleton_candidate_0000000017 waiting for the next leader
election, current voting: [singleton_candidate_0000000012,
singleton_candidate_0000000017]
I0305 21:28:36.605 THREAD128
com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange: Candidate
/aurora/scheduler/singleton_candidate_0000000017 is now leader of group:
[singleton_candidate_0000000017]
I0305 21:28:36.608 THREAD128
com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
I0305 21:28:36.608 THREAD128
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute: Elected as leading
scheduler!
I0305 21:28:36.628625 8815 log.cpp:656] Attempting to start the writer
I0305 21:28:36.629897 8809 replica.cpp:474] Replica received implicit promise
request with proposal 2
I0305 21:28:36.662684 8809 leveldb.cpp:306] Persisting metadata (8 bytes) to
leveldb took 32.691373ms
I0305 21:28:36.662773 8809 replica.cpp:342] Persisted promised to 2
I0305 21:28:42.186013 8809 network.hpp:423] ZooKeeper group memberships changed
I0305 21:28:42.186200 8814 group.cpp:658] Trying to get
'/aurora/replicated-log/0000000016' in ZooKeeper
I0305 21:28:42.197717 8813 network.hpp:461] ZooKeeper group PIDs: {
log-replica(1)@IP1:8083 }
I0305 21:28:51.631983 8808 log.cpp:656] Attempting to start the writer
I0305 21:29:06.634974 8808 log.cpp:656] Attempting to start the writer
I0305 21:29:21.649663 8808 log.cpp:656] Attempting to start the writer
I0305 21:29:36.662432 8808 log.cpp:656] Attempting to start the writer
I0305 21:29:51.669358 8813 log.cpp:656] Attempting to start the writer
I0305 21:30:06.684087 8809 log.cpp:656] Attempting to start the writer
I0305 21:30:21.696292 8812 log.cpp:656] Attempting to start the writer
I0305 21:30:36.703007 8812 log.cpp:656] Attempting to start the writer
I0305 21:30:51.712620 8812 log.cpp:656] Attempting to start the writer
I0305 21:31:06.727669 8808 log.cpp:656] Attempting to start the writer
I0305 21:31:21.742475 8813 log.cpp:656] Attempting to start the writer
I0305 21:31:36.754021 8810 log.cpp:656] Attempting to start the writer
I0305 21:31:51.766616 8815 log.cpp:656] Attempting to start the writer
I0305 21:32:06.779209 8812 log.cpp:656] Attempting to start the writer
I0305 21:32:21.794214 8809 log.cpp:656] Attempting to start the writer
I0305 21:32:36.803851 8811 log.cpp:656] Attempting to start the writer
I0305 21:32:51.812130 8809 log.cpp:656] Attempting to start the writer
I0305 21:33:06.822957 8814 log.cpp:656] Attempting to start the writer
I0305 21:33:21.835047 8811 log.cpp:656] Attempting to start the writer
I0305 21:33:36.842078 8813 log.cpp:680] Attempting to append 5 bytes to the log
I0305 21:33:36.842 THREAD128 com.twitter.common.application.Lifecycle.shutdown:
Shutting down application
I0305 21:33:36.843 THREAD128
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute:
Executing 7 shutdown commands.
I0305 21:33:36.854 THREAD128
org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute:
Shutdown initiated by: Thread: RedirectMonitor STARTING-EventThread (id 128)
java.lang.Thread.getStackTrace(Thread.java:1567)
org.apache.aurora.scheduler.app.AppModule$RegisterShutdownStackPrinter$2.execute(AppModule.java:159)
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.disableLog(MesosLog.java:350)
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.mutate(MesosLog.java:369)
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:315)
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.readAll(MesosLog.java:238)
org.apache.aurora.scheduler.storage.log.StreamManagerImpl.readFromBeginning(StreamManagerImpl.java:109)
org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:505)
org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.CGLIB$recover$6(<generated>:-1)
org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca$$FastClassByGuice$$ea0e64a7.invoke(<generated>:-1)
com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.recover(<generated>:-1)
org.apache.aurora.scheduler.storage.log.LogStorage$20.execute(LogStorage.java:485)
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132)
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149)
org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:147)
org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:144)
org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:137)
org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.CGLIB$write$0(<generated>:-1)
org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73$$FastClassByGuice$$66aea947.invoke(<generated>:-1)
com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.write(<generated>:-1)
org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:144)
org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.CGLIB$write$0(<generated>:-1)
org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca$$FastClassByGuice$$fb95f8d6.invoke(<generated>:-1)
com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.write(<generated>:-1)
org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:635)
org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:479)
org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:92)
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:257)
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:253)
com.twitter.common.base.Closures$4.execute(Closures.java:120)
org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:384)
org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:380)
com.twitter.common.base.Closures$4.execute(Closures.java:120)
com.twitter.common.base.Closures$3.execute(Closures.java:98)
com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:426)
com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168)
com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155)
com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665)
com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638)
com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579)
com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:600)
com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:597)
com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109)
com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107)
com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127)
com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107)
com.twitter.common.zookeeper.Group$GroupMonitor.tryWatchGroup(Group.java:622)
com.twitter.common.zookeeper.Group$GroupMonitor.access$1100(Group.java:579)
com.twitter.common.zookeeper.Group$GroupMonitor$1.process(Group.java:591)
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
I0305 21:33:36.854 THREAD128
com.twitter.common.stats.TimeSeriesRepositoryImpl$3.execute: Variable sampler
shut down
I0305 21:33:36.855 THREAD128
org.apache.aurora.scheduler.http.JettyServerModule$HttpServerLauncher$2.execute:
Shutting down embedded http server
I0305 21:33:36.880 THREAD128
org.eclipse.jetty.server.handler.ContextHandler.doStop: stopped
o.e.j.s.ServletContextHandler{/,null}
I0305 21:33:36.938 THREAD128
com.twitter.common.util.StateMachine$Builder$1.execute: SchedulerLifecycle
state machine transition LEADER_AWAITING_REGISTRATION -> DEAD
W0305 21:33:36.956 THREAD128
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute:
Shutdown action failed.
com.twitter.common.util.StateMachine$IllegalStateTransitionException: State
transition from PREPARED to STOPPED is not allowed.
at
com.twitter.common.util.StateMachine.transition(StateMachine.java:183)
at
org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.stop(CallOrderEnforcingStorage.java:109)
at
org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:339)
at
org.apache.aurora.scheduler.SchedulerLifecycle$8.execute(SchedulerLifecycle.java:313)
at com.twitter.common.base.Closures$4.execute(Closures.java:120)
at com.twitter.common.base.Closures$3.execute(Closures.java:98)
at
com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
at
org.apache.aurora.scheduler.SchedulerLifecycle$4.execute(SchedulerLifecycle.java:239)
at
com.twitter.common.application.ShutdownRegistry$ShutdownRegistryImpl.execute(ShutdownRegistry.java:88)
at
com.twitter.common.application.Lifecycle.shutdown(Lifecycle.java:92)
at
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.disableLog(MesosLog.java:350)
at
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.mutate(MesosLog.java:369)
at
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.append(MesosLog.java:315)
at
org.apache.aurora.scheduler.log.mesos.MesosLog$LogStream.readAll(MesosLog.java:238)
at
org.apache.aurora.scheduler.storage.log.StreamManagerImpl.readFromBeginning(StreamManagerImpl.java:109)
at
org.apache.aurora.scheduler.storage.log.LogStorage.recover(LogStorage.java:505)
at
org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.CGLIB$recover$6(<generated>)
at
org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca$$FastClassByGuice$$ea0e64a7.invoke(<generated>)
at
com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
at
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
at
com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
at
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
at
com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
at
org.apache.aurora.scheduler.storage.log.LogStorage$$EnhancerByGuice$$598802ca.recover(<generated>)
at
org.apache.aurora.scheduler.storage.log.LogStorage$20.execute(LogStorage.java:485)
at
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132)
at
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult$Quiet.apply(Storage.java:149)
at
org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:147)
at
org.apache.aurora.scheduler.storage.mem.MemStorage$3.apply(MemStorage.java:144)
at
org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:137)
at
org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.CGLIB$write$0(<generated>)
at
org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73$$FastClassByGuice$$66aea947.invoke(<generated>)
at
com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
at
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
at
org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
at
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
at
com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
at
org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$71c38b73.write(<generated>)
at
org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemStorage.java:144)
at
org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.CGLIB$write$0(<generated>)
at
org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca$$FastClassByGuice$$fb95f8d6.invoke(<generated>)
at
com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
at
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
at
com.twitter.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:87)
at
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
at
com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
at
org.apache.aurora.scheduler.storage.mem.MemStorage$$EnhancerByGuice$$cfe951ca.write(<generated>)
at
org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:635)
at
org.apache.aurora.scheduler.storage.log.LogStorage.start(LogStorage.java:479)
at
org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.start(CallOrderEnforcingStorage.java:92)
at
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:257)
at
org.apache.aurora.scheduler.SchedulerLifecycle$6.execute(SchedulerLifecycle.java:253)
at com.twitter.common.base.Closures$4.execute(Closures.java:120)
at
org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:384)
at
org.apache.aurora.scheduler.SchedulerLifecycle$9.execute(SchedulerLifecycle.java:380)
at com.twitter.common.base.Closures$4.execute(Closures.java:120)
at com.twitter.common.base.Closures$3.execute(Closures.java:98)
at
com.twitter.common.util.StateMachine.transition(StateMachine.java:191)
at
org.apache.aurora.scheduler.SchedulerLifecycle$SchedulerCandidateImpl.onLeading(SchedulerLifecycle.java:426)
at
com.twitter.common.zookeeper.SingletonService$1.onElected(SingletonService.java:168)
at
com.twitter.common.zookeeper.CandidateImpl$4.onGroupChange(CandidateImpl.java:155)
at
com.twitter.common.zookeeper.Group$GroupMonitor.setMembers(Group.java:665)
at
com.twitter.common.zookeeper.Group$GroupMonitor.watchGroup(Group.java:638)
at
com.twitter.common.zookeeper.Group$GroupMonitor.access$900(Group.java:579)
at
com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:600)
at
com.twitter.common.zookeeper.Group$GroupMonitor$2.get(Group.java:597)
at
com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:109)
at
com.twitter.common.util.BackoffHelper$1.get(BackoffHelper.java:107)
at
com.twitter.common.util.BackoffHelper.doUntilResult(BackoffHelper.java:127)
at
com.twitter.common.util.BackoffHelper.doUntilSuccess(BackoffHelper.java:107)
at
com.twitter.common.zookeeper.Group$GroupMonitor.tryWatchGroup(Group.java:622)
at
com.twitter.common.zookeeper.Group$GroupMonitor.access$1100(Group.java:579)
at
com.twitter.common.zookeeper.Group$GroupMonitor$1.process(Group.java:591)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
I0305 21:33:36.960 THREAD1 com.twitter.common.application.AppLauncher.run:
Application run() exited.