[jira] [Created] (AURORA-1892) TaskQuery `limit` and `offset` must be applied at TaskStore

2017-02-13 Thread Santhosh Kumar Shanmugham (JIRA)
Santhosh Kumar Shanmugham created AURORA-1892:
-

 Summary: TaskQuery `limit` and `offset` must be applied at 
TaskStore
 Key: AURORA-1892
 URL: https://issues.apache.org/jira/browse/AURORA-1892
 Project: Aurora
  Issue Type: Task
Reporter: Santhosh Kumar Shanmugham


{{TaksQuery}}'s {{limit}} and {{offset}} are currently applied after the 
results have been fetched from the {{TaskStore}}, which is inefficient. Make 
the {{TaskStore}} apply the {{limit}} and {{offset}} conditions at the 
{{TaskStore}} level in both {{MemTaskStore}} and {{DBTaskStore}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1891) Unable to upgrade Guava

2017-02-13 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1891:


 Summary: Unable to upgrade Guava
 Key: AURORA-1891
 URL: https://issues.apache.org/jira/browse/AURORA-1891
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji
Priority: Minor


Guava 21 is out and with better Java 8 integration.

I cannot upgrade us. Bumping the dependency results in:

{noformat}
/Users/zmanji/code/aurora/src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java:82:
 error: cannot find symbol
class WriteAheadStorage extends WriteAheadStorageForwarder implements
^
  symbol: class WriteAheadStorageForwarder
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith': 
class file for com.google.errorprone.annotations.CompatibleWith not found
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multiset.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multiset.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/code/aurora/src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java:74:
 Note: Wrote forwarder 
org.apache.aurora.scheduler.storage.log.WriteAheadStorageForwarder
@Forward({
^
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith': 
class file for com.google.errorprone.annotations.CompatibleWith not found
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'

[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored

2017-02-13 Thread David McLaughlin (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864608#comment-15864608
 ] 

David McLaughlin commented on AURORA-1890:
--

Sounds good to me. 

> Job Update Pulse History is not durably stored
> --
>
> Key: AURORA-1890
> URL: https://issues.apache.org/jira/browse/AURORA-1890
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>
> I have experienced the following problem with pulse updates. To reproduce:
> 1. Create an update with a pulse timeout of 1h
> 2. Send a pulse to get the update going.
> 3. Failover the scheduler immediately after.
> 4. Observe that the update is awaiting another pulse right after the failover.
> This is because the {{JobUpdateControllerImpl}} stores pulse history and 
> state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is 
> reset to no pulse received.
> We can solve this by durably storing the timestamp of the last pulse received 
> in storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored

2017-02-13 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864598#comment-15864598
 ] 

Zameer Manji commented on AURORA-1890:
--

I would be content with initializing the {{PulseState}} timestamp with the 
timestamp of the most recent event that transitioned from a 
{{BLOCKED_AWAITING_PULSE}}.

I feel this is more correct than what we do now, avoids hashing out some 
storage changes, and is suitable for my current usecase.

If you confirm that you agree, I can rephrase this ticket to better capture 
what the fix would be.

> Job Update Pulse History is not durably stored
> --
>
> Key: AURORA-1890
> URL: https://issues.apache.org/jira/browse/AURORA-1890
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>
> I have experienced the following problem with pulse updates. To reproduce:
> 1. Create an update with a pulse timeout of 1h
> 2. Send a pulse to get the update going.
> 3. Failover the scheduler immediately after.
> 4. Observe that the update is awaiting another pulse right after the failover.
> This is because the {{JobUpdateControllerImpl}} stores pulse history and 
> state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is 
> reset to no pulse received.
> We can solve this by durably storing the timestamp of the last pulse received 
> in storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored

2017-02-13 Thread David McLaughlin (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864569#comment-15864569
 ] 

David McLaughlin commented on AURORA-1890:
--

You're right, the write volume is totally dependent on your update volume and 
the pulse interval. For many use cases, the cost of the update would be 
negligible. I think the real concern was the cost of reading the last pulse 
time. 

One other reason why persisting the pulse is not super useful is the scheduler 
failover time typically exceeds a sane pulse timeout. The same applies to 
automatically setting it to the last event time (which would be preferable 
IMO). I think the reason we backed out of the grace period change (which was 
going to be achieved by setting the timestamp to scheduler acquiring leadership 
timestamp) is that it would potentially reactivate a bunch of updates that were 
legitimately blocked. In the end, we agreed the churn from ROLLING_FORWARD -> 
BLOCKED_AWAITING_PULSE -> ROLLING_FORWARD was harmless. But I suppose if you 
have automation on top of this that reacts to state changes, it could be 
annoying. 

> Job Update Pulse History is not durably stored
> --
>
> Key: AURORA-1890
> URL: https://issues.apache.org/jira/browse/AURORA-1890
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>
> I have experienced the following problem with pulse updates. To reproduce:
> 1. Create an update with a pulse timeout of 1h
> 2. Send a pulse to get the update going.
> 3. Failover the scheduler immediately after.
> 4. Observe that the update is awaiting another pulse right after the failover.
> This is because the {{JobUpdateControllerImpl}} stores pulse history and 
> state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is 
> reset to no pulse received.
> We can solve this by durably storing the timestamp of the last pulse received 
> in storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored

2017-02-13 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864537#comment-15864537
 ] 

Zameer Manji commented on AURORA-1890:
--

The scheduler does the right thing on first pulse. However on failover, any 
coordinated updates are immediately sent to BLOCKED_AWAITING_PULSE. This is 
because on scheduler startup pulse state is reset to no pulse received. The 
code sets the timestamp to the last pulse received to 0L:

{noformat}
synchronized void initializePulseState(IJobUpdate update, JobUpdateStatus 
status) {
  pulseStates.put(update.getSummary().getKey(), new PulseState(
  status,
  update.getInstructions().getSettings().getBlockIfNoPulsesAfterMs(),
  0L));
}
{noformat}

Would it be ok to set the timestamp to the first event after the most recent 
{{BLOCKED_AWAITING_PULSE}}? We know for sure at that point in time that a pulse 
was received because of the state transition from {{BLCOKED_AWAITING_PULSE}} to 
some other event.

Also could you describe "significant" write volume? I can imagine if the pulse 
interval was in the seconds and there are thousands of updates perhaps it would 
be too much. However we could prevent excessively small pulse intervals.

> Job Update Pulse History is not durably stored
> --
>
> Key: AURORA-1890
> URL: https://issues.apache.org/jira/browse/AURORA-1890
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>
> I have experienced the following problem with pulse updates. To reproduce:
> 1. Create an update with a pulse timeout of 1h
> 2. Send a pulse to get the update going.
> 3. Failover the scheduler immediately after.
> 4. Observe that the update is awaiting another pulse right after the failover.
> This is because the {{JobUpdateControllerImpl}} stores pulse history and 
> state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is 
> reset to no pulse received.
> We can solve this by durably storing the timestamp of the last pulse received 
> in storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored

2017-02-13 Thread David McLaughlin (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864530#comment-15864530
 ] 

David McLaughlin commented on AURORA-1890:
--

There should also be a grace period of pulse_interval_secs that the Scheduler 
waits for before transitioning to BLOCKED_AWAITING_PULSE. Is that not the case? 

> Job Update Pulse History is not durably stored
> --
>
> Key: AURORA-1890
> URL: https://issues.apache.org/jira/browse/AURORA-1890
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>
> I have experienced the following problem with pulse updates. To reproduce:
> 1. Create an update with a pulse timeout of 1h
> 2. Send a pulse to get the update going.
> 3. Failover the scheduler immediately after.
> 4. Observe that the update is awaiting another pulse right after the failover.
> This is because the {{JobUpdateControllerImpl}} stores pulse history and 
> state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is 
> reset to no pulse received.
> We can solve this by durably storing the timestamp of the last pulse received 
> in storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored

2017-02-13 Thread David McLaughlin (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864528#comment-15864528
 ] 

David McLaughlin commented on AURORA-1890:
--

Just a FYI, the design there was intentional, otherwise the write volume caused 
by pulses would be significant. 

Plus, the Scheduler does the right thing on first pulse, right?

> Job Update Pulse History is not durably stored
> --
>
> Key: AURORA-1890
> URL: https://issues.apache.org/jira/browse/AURORA-1890
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>
> I have experienced the following problem with pulse updates. To reproduce:
> 1. Create an update with a pulse timeout of 1h
> 2. Send a pulse to get the update going.
> 3. Failover the scheduler immediately after.
> 4. Observe that the update is awaiting another pulse right after the failover.
> This is because the {{JobUpdateControllerImpl}} stores pulse history and 
> state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is 
> reset to no pulse received.
> We can solve this by durably storing the timestamp of the last pulse received 
> in storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)