[ https://issues.apache.org/jira/browse/AURORA-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181856#comment-16181856 ]
Bill Farner commented on AURORA-1950: ------------------------------------- The problem is that this sequence results in an exception thrown from {{SchedulerMain.run()}}, which then does not trigger teardown of app services (including the HTTP server, which uses non-daemon threads). I only tested it once in vagrant, but i think this should do the trick: {code} --- a/src/main/java/org/apache/aurora/scheduler/app/SchedulerMain.java +++ b/src/main/java/org/apache/aurora/scheduler/app/SchedulerMain.java @@ -220,6 +220,7 @@ public class SchedulerMain { } finally { LOG.info("Application run() exited."); } + scheduler.stop(); } finally { if (lifecycle != null) { lifecycle.shutdown(); {code} {{stop()}} should be idempotent, so sprinkling another call to plug the hole should be safe. Overall, managing the scheduler's complex lifecycle has been an interesting challenge, and resembles a whack-a-mole in places like this Ideas are welcome! {quote} CallOrderEnforcingStorage.java does not allow transitions to STOPPED from any state other than READY. I am not sure if this is intentional, but I've added the transitions in the patch to isolate the above error {quote} I'm in favor of allowing those transitions, especially as it supports idempotent stoppage as described above. > Exceptions in LogStorage.prepare() cause scheduler die and stay in DEAD state > ----------------------------------------------------------------------------- > > Key: AURORA-1950 > URL: https://issues.apache.org/jira/browse/AURORA-1950 > Project: Aurora > Issue Type: Bug > Components: Scheduler > Reporter: Jordan Ly > Attachments: prepare_failure.patch > > > If the Scheduler fails while starting up its storage (calling > storage.prepare() in PREPARING_STORAGE state in SchedulerLifecycle.java), the > Scheduler will transition to dead but incorrectly shutdown. The non-daemon > threads will fail to stop, leaving the scheduler in the DEAD state. > The logs showing the issue: > {noformat} > I0926 01:02:34.118 [HttpServerLauncher STARTING, AbstractConnector] Started > ServerConnector@6b15ba8c{HTTP/1.1,[http/1.1]}{0.0.0.0:8081} > I0926 01:02:34.120 [HttpServerLauncher STARTING, Server] Started @7043ms > I0926 01:02:34.122 [main, StateMachine] SchedulerLifecycle state machine > transition IDLE -> PREPARING_STORAGE > E0926 01:02:35.023 [main, SchedulerLifecycle] Caught unchecked exception: > java.lang.IllegalStateException: Failed to open the log, cannot continue > java.lang.IllegalStateException: Failed to open the log, cannot continue > at > org.apache.aurora.scheduler.storage.log.LogStorage.prepare(LogStorage.java:407) > at > org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) > at > org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.prepare(CallOrderEnforcingStorage.java:86) > at > org.apache.aurora.scheduler.SchedulerLifecycle$3.accept(SchedulerLifecycle.java:216) > at > org.apache.aurora.scheduler.SchedulerLifecycle$3.accept(SchedulerLifecycle.java:213) > at > org.apache.aurora.common.base.Consumers.lambda$filter$1(Consumers.java:74) > at > org.apache.aurora.scheduler.SchedulerLifecycle.lambda$dieOnError$1(SchedulerLifecycle.java:337) > at > org.apache.aurora.common.base.Consumers.lambda$filter$1(Consumers.java:74) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at > org.apache.aurora.common.util.StateMachine.transition(StateMachine.java:160) > at > org.apache.aurora.scheduler.SchedulerLifecycle.prepare(SchedulerLifecycle.java:354) > at > org.apache.aurora.scheduler.app.SchedulerMain.run(SchedulerMain.java:148) > at > org.apache.aurora.scheduler.app.SchedulerMain.flagConfiguredMain(SchedulerMain.java:219) > at > org.apache.aurora.scheduler.app.SchedulerMain.main(SchedulerMain.java:243) > Caused by: java.io.IOException: blah blah blah > at > org.apache.aurora.scheduler.storage.log.LogStorage.prepare(LogStorage.java:404) > ... 19 common frames omitted > I0926 01:02:35.023 [main, StateMachine] SchedulerLifecycle state machine > transition PREPARING_STORAGE -> DEAD > I0926 01:02:35.028 [main, StateMachine] storage state machine transition > CONSTRUCTED -> STOPPED > I0926 01:02:35.028 [main, Lifecycle] Shutting down application > I0926 01:02:35.028 [main, ShutdownRegistry$ShutdownRegistryImpl] Executing 4 > shutdown commands. > I0926 01:02:35.047 [main, StateMachine] SchedulerLifecycle state machine > transition DEAD -> DEAD > I0926 01:02:35.047 [main, SchedulerLifecycle] Shutdown already invoked, > ignoring extra call. > I0926 01:02:35.053 [Curator-Framework-0, CuratorFrameworkImpl] > backgroundOperationsLoop exiting > I0926 01:02:35.066 [main-EventThread, ClientCnxn] EventThread shut down for > session: 0x15ebba3b77c0010 > I0926 01:02:35.066 [main, ZooKeeper] Session: 0x15ebba3b77c0010 closed > I0926 01:02:35.067 [main, SchedulerMain] Application run() exited. > E0926 01:02:35.067 [main, SchedulerMain] Uncaught exception from > Thread[main,5,main]:java.lang.IllegalStateException: Failed to open the log, > cannot continue java.lang.IllegalStateException: Failed to open the log, > cannot continue > at > org.apache.aurora.scheduler.storage.log.LogStorage.prepare(LogStorage.java:407) > at > org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) > at > org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.prepare(CallOrderEnforcingStorage.java:86) > at > org.apache.aurora.scheduler.SchedulerLifecycle$3.accept(SchedulerLifecycle.java:216) > at > org.apache.aurora.scheduler.SchedulerLifecycle$3.accept(SchedulerLifecycle.java:213) > at > org.apache.aurora.common.base.Consumers.lambda$filter$1(Consumers.java:74) > at > org.apache.aurora.scheduler.SchedulerLifecycle.lambda$dieOnError$1(SchedulerLifecycle.java:337) > at > org.apache.aurora.common.base.Consumers.lambda$filter$1(Consumers.java:74) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at java.util.function.Consumer.lambda$andThen$0(Consumer.java:65) > at > org.apache.aurora.common.util.StateMachine.transition(StateMachine.java:160) > at > org.apache.aurora.scheduler.SchedulerLifecycle.prepare(SchedulerLifecycle.java:354) > at > org.apache.aurora.scheduler.app.SchedulerMain.run(SchedulerMain.java:148) > at > org.apache.aurora.scheduler.app.SchedulerMain.flagConfiguredMain(SchedulerMain.java:219) > at > org.apache.aurora.scheduler.app.SchedulerMain.main(SchedulerMain.java:243) > Caused by: java.io.IOException: blah blah blah > at > org.apache.aurora.scheduler.storage.log.LogStorage.prepare(LogStorage.java:404) > ... 19 common frames omitted > I0926 01:02:51.943 [qtp2025987098-37, RequestLog] 192.168.33.1 - - > [26/Sep/2017:01:02:51 +0000] "GET //192.168.33.7:8081/threads HTTP/1.1" 200 > 2120 > I0926 01:02:52.925 [qtp2025987098-38, RequestLog] 192.168.33.1 - - > [26/Sep/2017:01:02:52 +0000] "GET //192.168.33.7:8081/threads HTTP/1.1" 200 > 2134 > {noformat} > You will still be able to access the "DEAD" Scheduler's HTTP interface. > Checking the "/threads" endpoint shows that services that have failed to stop. > I've attached a diff that can replicate the issue. Just bring up the Vagrant > cluster and run the Scheduler with the patch applied. > On a unrelated note: CallOrderEnforcingStorage.java does not allow > transitions to STOPPED from any state other than READY. I am not sure if this > is intentional, but I've added the transitions in the patch to isolate the > above error. -- This message was sent by Atlassian JIRA (v6.4.14#64029)