[jira] [Created] (AURORA-1892) TaskQuery `limit` and `offset` must be applied at TaskStore
Santhosh Kumar Shanmugham created AURORA-1892: - Summary: TaskQuery `limit` and `offset` must be applied at TaskStore Key: AURORA-1892 URL: https://issues.apache.org/jira/browse/AURORA-1892 Project: Aurora Issue Type: Task Reporter: Santhosh Kumar Shanmugham {{TaksQuery}}'s {{limit}} and {{offset}} are currently applied after the results have been fetched from the {{TaskStore}}, which is inefficient. Make the {{TaskStore}} apply the {{limit}} and {{offset}} conditions at the {{TaskStore}} level in both {{MemTaskStore}} and {{DBTaskStore}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1891) Unable to upgrade Guava
Zameer Manji created AURORA-1891: Summary: Unable to upgrade Guava Key: AURORA-1891 URL: https://issues.apache.org/jira/browse/AURORA-1891 Project: Aurora Issue Type: Bug Reporter: Zameer Manji Priority: Minor Guava 21 is out and with better Java 8 integration. I cannot upgrade us. Bumping the dependency results in: {noformat} /Users/zmanji/code/aurora/src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java:82: error: cannot find symbol class WriteAheadStorage extends WriteAheadStorageForwarder implements ^ symbol: class WriteAheadStorageForwarder /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith': class file for com.google.errorprone.annotations.CompatibleWith not found /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multiset.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multiset.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/code/aurora/src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java:74: Note: Wrote forwarder org.apache.aurora.scheduler.storage.log.WriteAheadStorageForwarder @Forward({ ^ /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith': class file for com.google.errorprone.annotations.CompatibleWith not found /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored
[ https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864608#comment-15864608 ] David McLaughlin commented on AURORA-1890: -- Sounds good to me. > Job Update Pulse History is not durably stored > -- > > Key: AURORA-1890 > URL: https://issues.apache.org/jira/browse/AURORA-1890 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji > > I have experienced the following problem with pulse updates. To reproduce: > 1. Create an update with a pulse timeout of 1h > 2. Send a pulse to get the update going. > 3. Failover the scheduler immediately after. > 4. Observe that the update is awaiting another pulse right after the failover. > This is because the {{JobUpdateControllerImpl}} stores pulse history and > state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is > reset to no pulse received. > We can solve this by durably storing the timestamp of the last pulse received > in storage. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored
[ https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864598#comment-15864598 ] Zameer Manji commented on AURORA-1890: -- I would be content with initializing the {{PulseState}} timestamp with the timestamp of the most recent event that transitioned from a {{BLOCKED_AWAITING_PULSE}}. I feel this is more correct than what we do now, avoids hashing out some storage changes, and is suitable for my current usecase. If you confirm that you agree, I can rephrase this ticket to better capture what the fix would be. > Job Update Pulse History is not durably stored > -- > > Key: AURORA-1890 > URL: https://issues.apache.org/jira/browse/AURORA-1890 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji > > I have experienced the following problem with pulse updates. To reproduce: > 1. Create an update with a pulse timeout of 1h > 2. Send a pulse to get the update going. > 3. Failover the scheduler immediately after. > 4. Observe that the update is awaiting another pulse right after the failover. > This is because the {{JobUpdateControllerImpl}} stores pulse history and > state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is > reset to no pulse received. > We can solve this by durably storing the timestamp of the last pulse received > in storage. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored
[ https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864569#comment-15864569 ] David McLaughlin commented on AURORA-1890: -- You're right, the write volume is totally dependent on your update volume and the pulse interval. For many use cases, the cost of the update would be negligible. I think the real concern was the cost of reading the last pulse time. One other reason why persisting the pulse is not super useful is the scheduler failover time typically exceeds a sane pulse timeout. The same applies to automatically setting it to the last event time (which would be preferable IMO). I think the reason we backed out of the grace period change (which was going to be achieved by setting the timestamp to scheduler acquiring leadership timestamp) is that it would potentially reactivate a bunch of updates that were legitimately blocked. In the end, we agreed the churn from ROLLING_FORWARD -> BLOCKED_AWAITING_PULSE -> ROLLING_FORWARD was harmless. But I suppose if you have automation on top of this that reacts to state changes, it could be annoying. > Job Update Pulse History is not durably stored > -- > > Key: AURORA-1890 > URL: https://issues.apache.org/jira/browse/AURORA-1890 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji > > I have experienced the following problem with pulse updates. To reproduce: > 1. Create an update with a pulse timeout of 1h > 2. Send a pulse to get the update going. > 3. Failover the scheduler immediately after. > 4. Observe that the update is awaiting another pulse right after the failover. > This is because the {{JobUpdateControllerImpl}} stores pulse history and > state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is > reset to no pulse received. > We can solve this by durably storing the timestamp of the last pulse received > in storage. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored
[ https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864537#comment-15864537 ] Zameer Manji commented on AURORA-1890: -- The scheduler does the right thing on first pulse. However on failover, any coordinated updates are immediately sent to BLOCKED_AWAITING_PULSE. This is because on scheduler startup pulse state is reset to no pulse received. The code sets the timestamp to the last pulse received to 0L: {noformat} synchronized void initializePulseState(IJobUpdate update, JobUpdateStatus status) { pulseStates.put(update.getSummary().getKey(), new PulseState( status, update.getInstructions().getSettings().getBlockIfNoPulsesAfterMs(), 0L)); } {noformat} Would it be ok to set the timestamp to the first event after the most recent {{BLOCKED_AWAITING_PULSE}}? We know for sure at that point in time that a pulse was received because of the state transition from {{BLCOKED_AWAITING_PULSE}} to some other event. Also could you describe "significant" write volume? I can imagine if the pulse interval was in the seconds and there are thousands of updates perhaps it would be too much. However we could prevent excessively small pulse intervals. > Job Update Pulse History is not durably stored > -- > > Key: AURORA-1890 > URL: https://issues.apache.org/jira/browse/AURORA-1890 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji > > I have experienced the following problem with pulse updates. To reproduce: > 1. Create an update with a pulse timeout of 1h > 2. Send a pulse to get the update going. > 3. Failover the scheduler immediately after. > 4. Observe that the update is awaiting another pulse right after the failover. > This is because the {{JobUpdateControllerImpl}} stores pulse history and > state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is > reset to no pulse received. > We can solve this by durably storing the timestamp of the last pulse received > in storage. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored
[ https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864530#comment-15864530 ] David McLaughlin commented on AURORA-1890: -- There should also be a grace period of pulse_interval_secs that the Scheduler waits for before transitioning to BLOCKED_AWAITING_PULSE. Is that not the case? > Job Update Pulse History is not durably stored > -- > > Key: AURORA-1890 > URL: https://issues.apache.org/jira/browse/AURORA-1890 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji > > I have experienced the following problem with pulse updates. To reproduce: > 1. Create an update with a pulse timeout of 1h > 2. Send a pulse to get the update going. > 3. Failover the scheduler immediately after. > 4. Observe that the update is awaiting another pulse right after the failover. > This is because the {{JobUpdateControllerImpl}} stores pulse history and > state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is > reset to no pulse received. > We can solve this by durably storing the timestamp of the last pulse received > in storage. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored
[ https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864528#comment-15864528 ] David McLaughlin commented on AURORA-1890: -- Just a FYI, the design there was intentional, otherwise the write volume caused by pulses would be significant. Plus, the Scheduler does the right thing on first pulse, right? > Job Update Pulse History is not durably stored > -- > > Key: AURORA-1890 > URL: https://issues.apache.org/jira/browse/AURORA-1890 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji > > I have experienced the following problem with pulse updates. To reproduce: > 1. Create an update with a pulse timeout of 1h > 2. Send a pulse to get the update going. > 3. Failover the scheduler immediately after. > 4. Observe that the update is awaiting another pulse right after the failover. > This is because the {{JobUpdateControllerImpl}} stores pulse history and > state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is > reset to no pulse received. > We can solve this by durably storing the timestamp of the last pulse received > in storage. -- This message was sent by Atlassian JIRA (v6.3.15#6346)