Re: Streaming side-input performance in dataflow

2021-07-28 Thread Steve Niemitz
So I spent some more time on this today, and noticed something interesting when trying to reproduce it: it only seems to happen if the ParDo using the side input is fused with another stage that uses state. I'm not quite sure why this is, but I'm also fairly certain the fix is to simply clear the

Re: Streaming side-input performance in dataflow

2021-07-22 Thread Kenneth Knowles
On Thu, Jul 22, 2021 at 4:47 AM Steve Niemitz wrote: > I don't think I'd call it a bug? The cache doesn't differentiate between > a state cell that existed but was cleared, and one that is missing from the > cache (maybe it should?). > Filing this in my collection of problems caused by "nullabl

Re: Streaming side-input performance in dataflow

2021-07-22 Thread Steve Niemitz
I don't think I'd call it a bug? The cache doesn't differentiate between a state cell that existed but was cleared, and one that is missing from the cache (maybe it should?). The side input fetcher clears the blocked state when it becomes unblocked: https://github.com/apache/beam/blob/master/run

Streaming side-input performance in dataflow

2021-07-21 Thread Steve Niemitz
I had opened a jira years ago [1] about this, but would like to actually fix it for real now, given that our users have started using streaming more and more. There's more detail in the jira, but basically side inputs in streaming pipelines on dataflow lead to pretty bad performance because they r