[jira] [Work logged] (BEAM-9308) Optimize state cleanup at end-of-window

2020-05-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9308?focusedWorklogId=431314=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-431314
 ]

ASF GitHub Bot logged work on BEAM-9308:


Author: ASF GitHub Bot
Created on: 06/May/20 16:58
Start Date: 06/May/20 16:58
Worklog Time Spent: 10m 
  Work Description: stale[bot] commented on pull request #10852:
URL: https://github.com/apache/beam/pull/10852#issuecomment-624768356


   This pull request has been closed due to lack of activity. If you think that 
is incorrect, or the pull request requires review, you can revive the PR at any 
time.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 431314)
Time Spent: 2h  (was: 1h 50m)

> Optimize state cleanup at end-of-window
> ---
>
> Key: BEAM-9308
> URL: https://issues.apache.org/jira/browse/BEAM-9308
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> When using state with a large keyspace, you can end up with a large amount of 
> state cleanup timers set to fire all 1ms after the end of a window.  This can 
> cause a momentary (I've observed 1-3 minute) lag in processing while windmill 
> and the java harness fire and process these cleanup timers.
> By spreading the firing over a short period after the end of the window, we 
> can decorrelate the firing of the timers and smooth the load out, resulting 
> in much less impact from state cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9308) Optimize state cleanup at end-of-window

2020-04-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9308?focusedWorklogId=428653=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-428653
 ]

ASF GitHub Bot logged work on BEAM-9308:


Author: ASF GitHub Bot
Created on: 29/Apr/20 16:29
Start Date: 29/Apr/20 16:29
Worklog Time Spent: 10m 
  Work Description: stale[bot] commented on pull request #10852:
URL: https://github.com/apache/beam/pull/10852#issuecomment-621322177


   This pull request has been marked as stale due to 60 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@beam.apache.org list. Thank you for your 
contributions.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 428653)
Time Spent: 1h 50m  (was: 1h 40m)

> Optimize state cleanup at end-of-window
> ---
>
> Key: BEAM-9308
> URL: https://issues.apache.org/jira/browse/BEAM-9308
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> When using state with a large keyspace, you can end up with a large amount of 
> state cleanup timers set to fire all 1ms after the end of a window.  This can 
> cause a momentary (I've observed 1-3 minute) lag in processing while windmill 
> and the java harness fire and process these cleanup timers.
> By spreading the firing over a short period after the end of the window, we 
> can decorrelate the firing of the timers and smooth the load out, resulting 
> in much less impact from state cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9308) Optimize state cleanup at end-of-window

2020-02-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9308?focusedWorklogId=395475=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-395475
 ]

ASF GitHub Bot logged work on BEAM-9308:


Author: ASF GitHub Bot
Created on: 29/Feb/20 15:54
Start Date: 29/Feb/20 15:54
Worklog Time Spent: 10m 
  Work Description: steveniemitz commented on issue #10852: [BEAM-9308] 
Decorrelate state cleanup timers
URL: https://github.com/apache/beam/pull/10852#issuecomment-592959333
 
 
   > maybe we need to explore the prioritization issue a bit more.
   
   Agreed, I think ideally the state cleanup timers would have a (much?) lower 
priority than everything else so they don't starve out more important "user" 
work.
   
   > Is this. a blocker for. you? If so then. maybe we can add a parameter to 
DataflowPipelineOptions to control this so we don't take the risk of changing 
the default behavior without more data.
   
   We run our own fork of the anyways, so it's not particularly a blocker here. 
 I mostly just intended this PR as a conversation starter.
   
   I am curious about your comment above though ("We currently rely on the 
state cleanup timer for watermark holds").  From what I've observed in the 
code, the state cleanup is set for after the window end, so delaying it 
slightly more shouldn't cause any correctness issues, correct?
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 395475)
Time Spent: 1h 40m  (was: 1.5h)

> Optimize state cleanup at end-of-window
> ---
>
> Key: BEAM-9308
> URL: https://issues.apache.org/jira/browse/BEAM-9308
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When using state with a large keyspace, you can end up with a large amount of 
> state cleanup timers set to fire all 1ms after the end of a window.  This can 
> cause a momentary (I've observed 1-3 minute) lag in processing while windmill 
> and the java harness fire and process these cleanup timers.
> By spreading the firing over a short period after the end of the window, we 
> can decorrelate the firing of the timers and smooth the load out, resulting 
> in much less impact from state cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9308) Optimize state cleanup at end-of-window

2020-02-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9308?focusedWorklogId=395473=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-395473
 ]

ASF GitHub Bot logged work on BEAM-9308:


Author: ASF GitHub Bot
Created on: 29/Feb/20 15:41
Start Date: 29/Feb/20 15:41
Worklog Time Spent: 10m 
  Work Description: reuvenlax commented on issue #10852: [BEAM-9308] 
Decorrelate state cleanup timers
URL: https://github.com/apache/beam/pull/10852#issuecomment-592958092
 
 
   I'm trying to think of a principled way to do this - maybe we need to 
explore the prioritization issue a bit more.
   
   Is this. a blocker for. you? If so then. maybe we can add a parameter to 
DataflowPipelineOptions to control this so we don't take the risk of changing 
the default behavior without more data.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 395473)
Time Spent: 1.5h  (was: 1h 20m)

> Optimize state cleanup at end-of-window
> ---
>
> Key: BEAM-9308
> URL: https://issues.apache.org/jira/browse/BEAM-9308
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When using state with a large keyspace, you can end up with a large amount of 
> state cleanup timers set to fire all 1ms after the end of a window.  This can 
> cause a momentary (I've observed 1-3 minute) lag in processing while windmill 
> and the java harness fire and process these cleanup timers.
> By spreading the firing over a short period after the end of the window, we 
> can decorrelate the firing of the timers and smooth the load out, resulting 
> in much less impact from state cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9308) Optimize state cleanup at end-of-window

2020-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9308?focusedWorklogId=391041=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-391041
 ]

ASF GitHub Bot logged work on BEAM-9308:


Author: ASF GitHub Bot
Created on: 22/Feb/20 03:36
Start Date: 22/Feb/20 03:36
Worklog Time Spent: 10m 
  Work Description: steveniemitz commented on issue #10852: [BEAM-9308] 
Decorrelate state cleanup timers
URL: https://github.com/apache/beam/pull/10852#issuecomment-589913494
 
 
   > Why is this problem specific to the GC timer? How about the normal 
end-of-window timer that is used to fire windowed aggregations. For fixed 
windows there is one per key and those also fire all at the same time.
   
   heh, we already work around that on our own by using state + timers instead 
of the built-in combine transform.  We already decorrelate our end-of-window 
triggering (and we're now using the watermark hold feature for timers which 
simplified things a lot), but can't work around the state GC w/o changing the 
worker itself.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 391041)
Time Spent: 1h 20m  (was: 1h 10m)

> Optimize state cleanup at end-of-window
> ---
>
> Key: BEAM-9308
> URL: https://issues.apache.org/jira/browse/BEAM-9308
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When using state with a large keyspace, you can end up with a large amount of 
> state cleanup timers set to fire all 1ms after the end of a window.  This can 
> cause a momentary (I've observed 1-3 minute) lag in processing while windmill 
> and the java harness fire and process these cleanup timers.
> By spreading the firing over a short period after the end of the window, we 
> can decorrelate the firing of the timers and smooth the load out, resulting 
> in much less impact from state cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9308) Optimize state cleanup at end-of-window

2020-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9308?focusedWorklogId=391040=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-391040
 ]

ASF GitHub Bot logged work on BEAM-9308:


Author: ASF GitHub Bot
Created on: 22/Feb/20 03:35
Start Date: 22/Feb/20 03:35
Worklog Time Spent: 10m 
  Work Description: steveniemitz commented on issue #10852: [BEAM-9308] 
Decorrelate state cleanup timers
URL: https://github.com/apache/beam/pull/10852#issuecomment-589913494
 
 
   > Why is this problem specific to the GC timer? How about the normal 
end-of-window timer that is used to fire windowed aggregations. For fixed 
windows there is one per key and those also fire all at the same time.
   
   heh, we already work around that on our own by using state + timers instead 
of the built-in combine transform.  We already decorrelate our end-of-window 
triggering (and we're now using the watermark hold feature for timers which 
simplified things a lot).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 391040)
Time Spent: 1h 10m  (was: 1h)

> Optimize state cleanup at end-of-window
> ---
>
> Key: BEAM-9308
> URL: https://issues.apache.org/jira/browse/BEAM-9308
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When using state with a large keyspace, you can end up with a large amount of 
> state cleanup timers set to fire all 1ms after the end of a window.  This can 
> cause a momentary (I've observed 1-3 minute) lag in processing while windmill 
> and the java harness fire and process these cleanup timers.
> By spreading the firing over a short period after the end of the window, we 
> can decorrelate the firing of the timers and smooth the load out, resulting 
> in much less impact from state cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9308) Optimize state cleanup at end-of-window

2020-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9308?focusedWorklogId=391039=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-391039
 ]

ASF GitHub Bot logged work on BEAM-9308:


Author: ASF GitHub Bot
Created on: 22/Feb/20 03:32
Start Date: 22/Feb/20 03:32
Worklog Time Spent: 10m 
  Work Description: reuvenlax commented on issue #10852: [BEAM-9308] 
Decorrelate state cleanup timers
URL: https://github.com/apache/beam/pull/10852#issuecomment-589913248
 
 
   Why is this problem specific to the GC timer? How about the normal 
end-of-window timer that is used to fire windowed aggregations. For fixed 
windows there is one per key and those also fire all at the same time. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 391039)
Time Spent: 1h  (was: 50m)

> Optimize state cleanup at end-of-window
> ---
>
> Key: BEAM-9308
> URL: https://issues.apache.org/jira/browse/BEAM-9308
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When using state with a large keyspace, you can end up with a large amount of 
> state cleanup timers set to fire all 1ms after the end of a window.  This can 
> cause a momentary (I've observed 1-3 minute) lag in processing while windmill 
> and the java harness fire and process these cleanup timers.
> By spreading the firing over a short period after the end of the window, we 
> can decorrelate the firing of the timers and smooth the load out, resulting 
> in much less impact from state cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9308) Optimize state cleanup at end-of-window

2020-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9308?focusedWorklogId=391029=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-391029
 ]

ASF GitHub Bot logged work on BEAM-9308:


Author: ASF GitHub Bot
Created on: 22/Feb/20 02:24
Start Date: 22/Feb/20 02:24
Worklog Time Spent: 10m 
  Work Description: steveniemitz commented on issue #10852: [BEAM-9308] 
Decorrelate state cleanup timers
URL: https://github.com/apache/beam/pull/10852#issuecomment-589907688
 
 
   Yay thanks for looking at this.  I'll address your points in reverse order :P
   
   > Maybe we need a better prioritization strategy so that large #s of timers 
don't starve out elements?
   
   I think that'd be the best overall option, but ideally we'd have variable 
priority.  ie, state cleanup timers should be low priority, while user timers 
should be the same priority as "normal" elements.  In the end though, if we end 
up with state cleanup timers delayed by N minutes because they are 
deprioritized, that seems like we'd be in the same spot as explicitly 
decorrelating them here.
   
   > Delaying the timer will also prevent downstream aggregations from firing. 
3 minutes could cause issues if the window itself is much smaller.
   
   Agreed, I sort of touched on this on my comment about letting the duration 
be configurable.  Ideally it'd be some fraction of the window duration itself. 
   
   I'm not sure it actually will delay the downstream aggregations from firing 
however, since the firing time it set to after the window closes (maxTimestamp 
+ allowedLateness + 1ms), so once these begin firing, the watermark has already 
passed the end of the window.  Or am I misunderstanding something here?
   
   > We want to reuse this timer for OnWindowExpiration, and this will delay 
all those callbacks as well.
   
   I'd actually argue that's preferable, since you'd have the same problem 
there was well (potentially millions of timers firing at the same time).
   
   > We currently rely on the state cleanup timer for watermark holds.
   
   Is this true?  The state cleanup timer is already set past the end of the 
window, so by the time the timer fires the window has already closed.
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 391029)
Time Spent: 50m  (was: 40m)

> Optimize state cleanup at end-of-window
> ---
>
> Key: BEAM-9308
> URL: https://issues.apache.org/jira/browse/BEAM-9308
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When using state with a large keyspace, you can end up with a large amount of 
> state cleanup timers set to fire all 1ms after the end of a window.  This can 
> cause a momentary (I've observed 1-3 minute) lag in processing while windmill 
> and the java harness fire and process these cleanup timers.
> By spreading the firing over a short period after the end of the window, we 
> can decorrelate the firing of the timers and smooth the load out, resulting 
> in much less impact from state cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9308) Optimize state cleanup at end-of-window

2020-02-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9308?focusedWorklogId=391021=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-391021
 ]

ASF GitHub Bot logged work on BEAM-9308:


Author: ASF GitHub Bot
Created on: 22/Feb/20 01:55
Start Date: 22/Feb/20 01:55
Worklog Time Spent: 10m 
  Work Description: reuvenlax commented on issue #10852: [BEAM-9308] 
Decorrelate state cleanup timers
URL: https://github.com/apache/beam/pull/10852#issuecomment-589904689
 
 
   As written, this is incorrect. We currently rely on the state cleanup timer 
for watermark holds. This PR will cause that hold to be pushed later, which can 
cause incorrect grouping for any downstream aggregations. This is something we 
might be able to address by using the new outputTimestamp.
   
   This requires some thought though. Delaying the timer will also prevent 
downstream aggregations from firing.  3 minutes could cause issues if the 
window itself is much smaller. We want to reuse this timer for 
OnWindowExpiration, and this will delay all those callbacks as well.
   
   I wonder if it would be better to first root cause why the GC timers caused 
issues for your pipeline. One possibility: I believe that today any timers for 
a key are always prioritized over any data for that key. Maybe we need a better 
prioritization strategy  so that large #s of timers don't starve out elements?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 391021)
Time Spent: 40m  (was: 0.5h)

> Optimize state cleanup at end-of-window
> ---
>
> Key: BEAM-9308
> URL: https://issues.apache.org/jira/browse/BEAM-9308
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When using state with a large keyspace, you can end up with a large amount of 
> state cleanup timers set to fire all 1ms after the end of a window.  This can 
> cause a momentary (I've observed 1-3 minute) lag in processing while windmill 
> and the java harness fire and process these cleanup timers.
> By spreading the firing over a short period after the end of the window, we 
> can decorrelate the firing of the timers and smooth the load out, resulting 
> in much less impact from state cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9308) Optimize state cleanup at end-of-window

2020-02-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9308?focusedWorklogId=389483=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-389483
 ]

ASF GitHub Bot logged work on BEAM-9308:


Author: ASF GitHub Bot
Created on: 19/Feb/20 16:16
Start Date: 19/Feb/20 16:16
Worklog Time Spent: 10m 
  Work Description: steveniemitz commented on issue #10852: [BEAM-9308] 
Decorrelate state cleanup timers
URL: https://github.com/apache/beam/pull/10852#issuecomment-588303100
 
 
   cc @reuvenlax maybe?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 389483)
Time Spent: 0.5h  (was: 20m)

> Optimize state cleanup at end-of-window
> ---
>
> Key: BEAM-9308
> URL: https://issues.apache.org/jira/browse/BEAM-9308
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When using state with a large keyspace, you can end up with a large amount of 
> state cleanup timers set to fire all 1ms after the end of a window.  This can 
> cause a momentary (I've observed 1-3 minute) lag in processing while windmill 
> and the java harness fire and process these cleanup timers.
> By spreading the firing over a short period after the end of the window, we 
> can decorrelate the firing of the timers and smooth the load out, resulting 
> in much less impact from state cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9308) Optimize state cleanup at end-of-window

2020-02-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9308?focusedWorklogId=386938=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-386938
 ]

ASF GitHub Bot logged work on BEAM-9308:


Author: ASF GitHub Bot
Created on: 13/Feb/20 23:01
Start Date: 13/Feb/20 23:01
Worklog Time Spent: 10m 
  Work Description: steveniemitz commented on issue #10852: [BEAM-9308] 
Decorrelate state cleanup timers
URL: https://github.com/apache/beam/pull/10852#issuecomment-586014639
 
 
   The precommit failures seem unrelated to this, one is 
`ParDoLifecycleTest.testTeardownCalledAfterExceptionInFinishBundleStateful` and 
one is cassandra failing to start.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 386938)
Time Spent: 20m  (was: 10m)

> Optimize state cleanup at end-of-window
> ---
>
> Key: BEAM-9308
> URL: https://issues.apache.org/jira/browse/BEAM-9308
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When using state with a large keyspace, you can end up with a large amount of 
> state cleanup timers set to fire all 1ms after the end of a window.  This can 
> cause a momentary (I've observed 1-3 minute) lag in processing while windmill 
> and the java harness fire and process these cleanup timers.
> By spreading the firing over a short period after the end of the window, we 
> can decorrelate the firing of the timers and smooth the load out, resulting 
> in much less impact from state cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9308) Optimize state cleanup at end-of-window

2020-02-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9308?focusedWorklogId=386743=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-386743
 ]

ASF GitHub Bot logged work on BEAM-9308:


Author: ASF GitHub Bot
Created on: 13/Feb/20 16:50
Start Date: 13/Feb/20 16:50
Worklog Time Spent: 10m 
  Work Description: steveniemitz commented on pull request #10852: 
[BEAM-9308] Decorrelate state cleanup timers
URL: https://github.com/apache/beam/pull/10852
 
 
   In our larger streaming pipelines, we generally observe a short blip (1-3 
minutes) in event processing, as well as an increase in lag following window 
closing.  One reason for this is the state cleanup timers all firing once a 
window closes.
   
   We've been running this PR in our dev environment for a few days now, and 
the results are impressive.  By decorrelating (jittering) the state cleanup 
timer, we spread the timer load across a short period of time, with the 
trade-off of holding state for a slightly longer period of time.  In practice 
though, I've actually noticed our state cleans up QUICKER with this change, 
because the timers don't all compete with each other.
   
   I'd like to contribute this back (and could modify the core StatefulDoFn 
runner as well) if we agree this is something useful.
   
   There's a couple points for discussion:
   - I chose 3 minutes arbitrarily based on some experimentation, should this 
be configurable somehow?
   - I use the "user" key (from their KV input) to derive a consistent jitter 
amount.  The only real reason for this is to prevent the timer from moving 
around each element (if we used just a random amount each time instead).  I'm 
not sure if this actually matters in practice, since timers are supposed to be 
cheap to reset?
   - I added a counter which has been useful for debugging (and seeing how many 
keys are active each window), but could be removed.
   
   Interested to hear thoughts here.  
   
   Here's a before and after of our pubsub latency:
   
   before:
   
![image](https://user-images.githubusercontent.com/1882981/74457778-bb9cf080-4e56-11ea-900c-69f2a4a28613.png)
   
   after:
   
![image](https://user-images.githubusercontent.com/1882981/74457812-c788b280-4e56-11ea-801e-a4b69a84a10b.png)
   
   Based on the counter I added, we're firing ~20 million timers, across 50 
workers = ~400,000 timers / worker.  So rather than having them all fire in one 
shot, we can spread them over 3 minutes, for only ~2,000 timers / sec, which is 
much more reasonable.
   
   cc @lukecwik @pabloem 
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [x] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] Update `CHANGES.md` with noteworthy changes.
- [x] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build