[jira] [Commented] (AURORA-1769) Enabling webhook is synchronous and could cause longer leader reelection cycle
[ https://issues.apache.org/jira/browse/AURORA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484846#comment-15484846 ] Zameer Manji commented on AURORA-1769: -- Agreed that a solution to this problem involves two components: * Not sending the {{TaskStateChange}} on scheduler restart (that's very surprising to me) * Sending data asynchronously to no block in the event bus callback. I don't think this has to be a blocker either for 0.16.0, but I just wanted to surface it incase [~joshua.cohen] agreed. > Enabling webhook is synchronous and could cause longer leader reelection cycle > -- > > Key: AURORA-1769 > URL: https://issues.apache.org/jira/browse/AURORA-1769 > Project: Aurora > Issue Type: Bug >Reporter: Dmitriy Shirchenko >Assignee: Dmitriy Shirchenko > > We had an issue where on scheduler leader reelection EventBus was full of > TaskStateChange events and caused scheduler to not be able to post > DriverRegistered() message which caused Aurora scheduler to not register > within 1 minute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1769) Enabling webhook is synchronous and could cause longer leader reelection cycle
[ https://issues.apache.org/jira/browse/AURORA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484533#comment-15484533 ] Maxim Khutornenko commented on AURORA-1769: --- My suggestion was targeting the restart issue where events should be suppressed regardless: you don't want to resend {{TaskStateChange}} events for all tasks every time a scheduler restarts. As for the general perf issue, blocking {{EventBus}} threads was one of the concerns raised in the original https://reviews.apache.org/r/47440/ RB. We concluded back then that using aggressive connection timeouts _was_ appropriate to mitigate possible event queue saturation. If you feel that is no longer the case, please follow up with an async proposal. You'll likely need something akin the [BatchWorker|https://reviews.apache.org/r/51759/] sending thread working off of its own queue. In any case, given this feature is optional and off by default I feel blocking the release until it's improved is not justified. > Enabling webhook is synchronous and could cause longer leader reelection cycle > -- > > Key: AURORA-1769 > URL: https://issues.apache.org/jira/browse/AURORA-1769 > Project: Aurora > Issue Type: Bug >Reporter: Dmitriy Shirchenko >Assignee: Dmitriy Shirchenko > > We had an issue where on scheduler leader reelection EventBus was full of > TaskStateChange events and caused scheduler to not be able to post > DriverRegistered() message which caused Aurora scheduler to not register > within 1 minute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1769) Enabling webhook is synchronous and could cause longer leader reelection cycle
[ https://issues.apache.org/jira/browse/AURORA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15479046#comment-15479046 ] Zameer Manji commented on AURORA-1769: -- [~maximk]: I don't think that's sufficient. In reality, doing any blocking in any event subscriber will delay propagation of events. Apply the following patch to your repo: {noformat} diff --git c/examples/vagrant/upstart/aurora-scheduler.conf w/examples/vagrant/upstart/aurora-scheduler.conf index 91b27d7..f7419d4 100644 --- c/examples/vagrant/upstart/aurora-scheduler.conf +++ w/examples/vagrant/upstart/aurora-scheduler.conf @@ -51,4 +51,5 @@ exec bin/aurora-scheduler \ -mesos_role=aurora-role \ -populate_discovery_info=true \ -receive_revocable_resources=true \ - -allow_gpu_resource=true + -allow_gpu_resource=true \ + -webhook_config=/home/vagrant/aurora/src/main/resources/org/apache/aurora/scheduler/webhook.json diff --git c/src/main/java/org/apache/aurora/scheduler/events/Webhook.java w/src/main/java/org/apache/aurora/scheduler/events/Webhook.java index e54aa19..ed61ac0 100644 --- c/src/main/java/org/apache/aurora/scheduler/events/Webhook.java +++ w/src/main/java/org/apache/aurora/scheduler/events/Webhook.java @@ -13,6 +13,7 @@ */ package org.apache.aurora.scheduler.events; +import com.google.common.base.Throwables; import java.io.DataOutputStream; import java.io.InputStream; import java.net.HttpURLConnection; @@ -23,6 +24,8 @@ import com.google.common.eventbus.Subscribe; import com.google.inject.Inject; +import org.apache.aurora.common.quantity.Amount; +import org.apache.aurora.common.quantity.Time; import org.apache.aurora.scheduler.events.PubsubEvent.EventSubscriber; import org.apache.aurora.scheduler.events.PubsubEvent.TaskStateChange; import org.slf4j.Logger; @@ -104,7 +107,11 @@ public class Webhook implements EventSubscriber { */ @Subscribe public void taskChangedState(TaskStateChange stateChange) { -String eventJson = stateChange.toJson(); -callEndpoint(eventJson); +int i = Amount.of(15, Time.SECONDS).as(Time.MILLISECONDS); +try { + Thread.sleep(i); +} catch (InterruptedException e) { + Throwables.propagate(e); +} } } {noformat} Then in vagrant create a job with 100 tasks. Then restart the scheduler, you will see that it will never register within one minute because the async worker for the event bus is busy blocked delivering {{TaskStateChange}} events. You can see this by checking {{/threads}} and see the {{AsyncProcessor-*}} threads blocked in the {{Webhook}} class. Since calling an external HTTP server can block for an unknown amount of time, I think the solution here is to make the hook async and have the event subscriber just place the event in a queue for processing. Then it can have it's own thread pool for sending the requests out. > Enabling webhook is synchronous and could cause longer leader reelection cycle > -- > > Key: AURORA-1769 > URL: https://issues.apache.org/jira/browse/AURORA-1769 > Project: Aurora > Issue Type: Bug >Reporter: Dmitriy Shirchenko >Assignee: Dmitriy Shirchenko > > We had an issue where on scheduler leader reelection EventBus was full of > TaskStateChange events and caused scheduler to not be able to post > DriverRegistered() message which caused Aurora scheduler to not register > within 1 minute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1769) Enabling webhook is synchronous and could cause longer leader reelection cycle
[ https://issues.apache.org/jira/browse/AURORA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478951#comment-15478951 ] Maxim Khutornenko commented on AURORA-1769: --- Scratch the above. I read the description but completely ignored the title. I can totally see how this can happen with a webhook enabled. Should be easy to ignore {{TaskStateChange}} processing there until the {{DriverRegistered}} is received. > Enabling webhook is synchronous and could cause longer leader reelection cycle > -- > > Key: AURORA-1769 > URL: https://issues.apache.org/jira/browse/AURORA-1769 > Project: Aurora > Issue Type: Bug >Reporter: Dmitriy Shirchenko >Assignee: Dmitriy Shirchenko > > We had an issue where on scheduler leader reelection EventBus was full of > TaskStateChange events and caused scheduler to not be able to post > DriverRegistered() message which caused Aurora scheduler to not register > within 1 minute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1769) Enabling webhook is synchronous and could cause longer leader reelection cycle
[ https://issues.apache.org/jira/browse/AURORA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478914#comment-15478914 ] Maxim Khutornenko commented on AURORA-1769: --- Are you sure that was due to the backed up {{EventBus}}? The only way for the {{TaskStateChange}} events to get there before the driver registration is due to re-populating the {{TaskStore}} while [reading snapshot/replaying transactions|https://github.com/apache/aurora/blob/b24619b28c4dbb35188871bacd0091a9e01218e3/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L99]. This is usually very fast for the {{MemTaskStore}} but I can see how it _might_ take longer when using the {{DBTaskStore}}. Are you setting the {{use_beta_db_task_store=true}}? > Enabling webhook is synchronous and could cause longer leader reelection cycle > -- > > Key: AURORA-1769 > URL: https://issues.apache.org/jira/browse/AURORA-1769 > Project: Aurora > Issue Type: Bug >Reporter: Dmitriy Shirchenko >Assignee: Dmitriy Shirchenko > > We had an issue where on scheduler leader reelection EventBus was full of > TaskStateChange events and caused scheduler to not be able to post > DriverRegistered() message which caused Aurora scheduler to not register > within 1 minute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)