[jira] [Commented] (AURORA-1769) Enabling webhook is synchronous and could cause longer leader reelection cycle

2016-09-12 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484846#comment-15484846
 ] 

Zameer Manji commented on AURORA-1769:
--

Agreed that a solution to this problem involves two components:
* Not sending the {{TaskStateChange}} on scheduler restart (that's very 
surprising to me)
* Sending data asynchronously to no block in the event bus callback.

I don't think this has to be a blocker either for 0.16.0, but I just wanted to 
surface it incase [~joshua.cohen] agreed.

> Enabling webhook is synchronous and could cause longer leader reelection cycle
> --
>
> Key: AURORA-1769
> URL: https://issues.apache.org/jira/browse/AURORA-1769
> Project: Aurora
>  Issue Type: Bug
>Reporter: Dmitriy Shirchenko
>Assignee: Dmitriy Shirchenko
>
> We had an issue where on scheduler leader reelection EventBus was full of 
> TaskStateChange events and caused scheduler to not be able to post 
> DriverRegistered() message which caused Aurora scheduler to not register 
> within 1 minute. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1769) Enabling webhook is synchronous and could cause longer leader reelection cycle

2016-09-12 Thread Maxim Khutornenko (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484533#comment-15484533
 ] 

Maxim Khutornenko commented on AURORA-1769:
---

My suggestion was targeting the restart issue where events should be suppressed 
regardless: you don't want to resend {{TaskStateChange}} events for all tasks 
every time a scheduler restarts.

As for the general perf issue, blocking {{EventBus}} threads was one of the 
concerns raised in the original https://reviews.apache.org/r/47440/ RB. We 
concluded back then that using aggressive connection timeouts _was_ appropriate 
to mitigate possible event queue saturation. If you feel that is no longer the 
case, please follow up with an async proposal. You'll likely need something 
akin the [BatchWorker|https://reviews.apache.org/r/51759/] sending thread 
working off of its own queue. In any case, given this feature is optional and 
off by default I feel blocking the release until it's improved is not justified.

> Enabling webhook is synchronous and could cause longer leader reelection cycle
> --
>
> Key: AURORA-1769
> URL: https://issues.apache.org/jira/browse/AURORA-1769
> Project: Aurora
>  Issue Type: Bug
>Reporter: Dmitriy Shirchenko
>Assignee: Dmitriy Shirchenko
>
> We had an issue where on scheduler leader reelection EventBus was full of 
> TaskStateChange events and caused scheduler to not be able to post 
> DriverRegistered() message which caused Aurora scheduler to not register 
> within 1 minute. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1769) Enabling webhook is synchronous and could cause longer leader reelection cycle

2016-09-09 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15479046#comment-15479046
 ] 

Zameer Manji commented on AURORA-1769:
--

[~maximk]: I don't think that's sufficient. In reality, doing any blocking in 
any event subscriber will delay propagation of events. Apply the following 
patch to your repo:

{noformat}
diff --git c/examples/vagrant/upstart/aurora-scheduler.conf 
w/examples/vagrant/upstart/aurora-scheduler.conf
index 91b27d7..f7419d4 100644
--- c/examples/vagrant/upstart/aurora-scheduler.conf
+++ w/examples/vagrant/upstart/aurora-scheduler.conf
@@ -51,4 +51,5 @@ exec bin/aurora-scheduler \
   -mesos_role=aurora-role \
   -populate_discovery_info=true \
   -receive_revocable_resources=true \
-  -allow_gpu_resource=true
+  -allow_gpu_resource=true \
+  
-webhook_config=/home/vagrant/aurora/src/main/resources/org/apache/aurora/scheduler/webhook.json
diff --git c/src/main/java/org/apache/aurora/scheduler/events/Webhook.java 
w/src/main/java/org/apache/aurora/scheduler/events/Webhook.java
index e54aa19..ed61ac0 100644
--- c/src/main/java/org/apache/aurora/scheduler/events/Webhook.java
+++ w/src/main/java/org/apache/aurora/scheduler/events/Webhook.java
@@ -13,6 +13,7 @@
  */
 package org.apache.aurora.scheduler.events;
 
+import com.google.common.base.Throwables;
 import java.io.DataOutputStream;
 import java.io.InputStream;
 import java.net.HttpURLConnection;
@@ -23,6 +24,8 @@ import com.google.common.eventbus.Subscribe;
 
 import com.google.inject.Inject;
 
+import org.apache.aurora.common.quantity.Amount;
+import org.apache.aurora.common.quantity.Time;
 import org.apache.aurora.scheduler.events.PubsubEvent.EventSubscriber;
 import org.apache.aurora.scheduler.events.PubsubEvent.TaskStateChange;
 import org.slf4j.Logger;
@@ -104,7 +107,11 @@ public class Webhook implements EventSubscriber {
*/
   @Subscribe
   public void taskChangedState(TaskStateChange stateChange) {
-String eventJson = stateChange.toJson();
-callEndpoint(eventJson);
+int i = Amount.of(15, Time.SECONDS).as(Time.MILLISECONDS);
+try {
+  Thread.sleep(i);
+} catch (InterruptedException e) {
+  Throwables.propagate(e);
+}
   }
 }

{noformat}

Then in vagrant create a job with 100 tasks.

Then restart the scheduler, you will see that it will never register within one 
minute because the async worker for the event bus is busy blocked delivering 
{{TaskStateChange}} events. You can see this by checking {{/threads}} and see 
the {{AsyncProcessor-*}} threads blocked in the {{Webhook}} class.

Since calling an external HTTP server can block for an unknown amount of time, 
I think the solution here is to make the hook async and have the event 
subscriber just place the event in a queue for processing. Then it can have 
it's own thread pool for sending the requests out.

> Enabling webhook is synchronous and could cause longer leader reelection cycle
> --
>
> Key: AURORA-1769
> URL: https://issues.apache.org/jira/browse/AURORA-1769
> Project: Aurora
>  Issue Type: Bug
>Reporter: Dmitriy Shirchenko
>Assignee: Dmitriy Shirchenko
>
> We had an issue where on scheduler leader reelection EventBus was full of 
> TaskStateChange events and caused scheduler to not be able to post 
> DriverRegistered() message which caused Aurora scheduler to not register 
> within 1 minute. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1769) Enabling webhook is synchronous and could cause longer leader reelection cycle

2016-09-09 Thread Maxim Khutornenko (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478951#comment-15478951
 ] 

Maxim Khutornenko commented on AURORA-1769:
---

Scratch the above. I read the description but completely ignored the title. I 
can totally see how this can happen with a webhook enabled. Should be easy to 
ignore {{TaskStateChange}} processing there until the {{DriverRegistered}} is 
received.

> Enabling webhook is synchronous and could cause longer leader reelection cycle
> --
>
> Key: AURORA-1769
> URL: https://issues.apache.org/jira/browse/AURORA-1769
> Project: Aurora
>  Issue Type: Bug
>Reporter: Dmitriy Shirchenko
>Assignee: Dmitriy Shirchenko
>
> We had an issue where on scheduler leader reelection EventBus was full of 
> TaskStateChange events and caused scheduler to not be able to post 
> DriverRegistered() message which caused Aurora scheduler to not register 
> within 1 minute. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1769) Enabling webhook is synchronous and could cause longer leader reelection cycle

2016-09-09 Thread Maxim Khutornenko (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478914#comment-15478914
 ] 

Maxim Khutornenko commented on AURORA-1769:
---

Are you sure that was due to the backed up {{EventBus}}? The only way for the 
{{TaskStateChange}} events to get there before the driver registration is due 
to re-populating the {{TaskStore}} while [reading snapshot/replaying 
transactions|https://github.com/apache/aurora/blob/b24619b28c4dbb35188871bacd0091a9e01218e3/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L99].
 This is usually very fast for the {{MemTaskStore}} but I can see how it 
_might_ take longer when using the {{DBTaskStore}}. Are you setting the 
{{use_beta_db_task_store=true}}?

> Enabling webhook is synchronous and could cause longer leader reelection cycle
> --
>
> Key: AURORA-1769
> URL: https://issues.apache.org/jira/browse/AURORA-1769
> Project: Aurora
>  Issue Type: Bug
>Reporter: Dmitriy Shirchenko
>Assignee: Dmitriy Shirchenko
>
> We had an issue where on scheduler leader reelection EventBus was full of 
> TaskStateChange events and caused scheduler to not be able to post 
> DriverRegistered() message which caused Aurora scheduler to not register 
> within 1 minute. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)