Zameer Manji created AURORA-1911: ------------------------------------ Summary: HTTP Scheduler Driver does not reliable re subscribe Key: AURORA-1911 URL: https://issues.apache.org/jira/browse/AURORA-1911 Project: Aurora Issue Type: Bug Reporter: Zameer Manji Assignee: Zameer Manji
I observed this issue in a large production cluster during a period of Mesos Master instability: 1. Mesos master crashes or restarts. 2. {{V1Mesos}} driver detects this and reconnects. 3. Aurora does the {{SUBSCRIBE}} call again. 4. The {{SUBSCRIBE}} Call fails silently in the driver. 5. All future calls are silently dropped by the driver. 6. Aurora has no offers because it is not subscribed. Logs: {noformat} I0328 19:40:55.473546 101404 scheduler.cpp:353] Connected with the master at http://10.162.14.30:5050/master/api/v1/scheduler W0328 19:40:55.475898 101410 scheduler.cpp:583] Received '503 Service Unavailable' () for SUBSCRIBE .... W0328 19:40:58.862393 101398 scheduler.cpp:508] Dropping KILL: Scheduler is in state CONNECTED .... W0328 19:41:14.588474 101394 scheduler.cpp:508] Dropping KILL: Scheduler is in state CONNECTED .... W0328 19:41:37.763464 101402 scheduler.cpp:508] Dropping KILL: Scheduler is in state CONNECTED ... {noformat} To fix this, the {{VersionedSchedulerDriver}} needs to do two things: 1. Block calls when unsubscribed not just disconnected. 2. Retry the {{SUBSCRIBE}} call repeatedly with exponential backoff. -- This message was sent by Atlassian JIRA (v6.3.15#6346)