First off, thanks for your reply!

I have an assumption that I should probably verify first:
When determining the source of the backpressure we look (in the WebUI) for the 
first operator in our pipeline that is not showing backpressure. That’s what we 
consider to be the source of the backpressure

In this case the first operator that in our graph that is not showing 
backpressure is our window operator (all though the keyBy operation right 
before it doesn’t show up in the graph). The window function uses a custom 
aggregation function that builds up a hashmap and a custom process function 
that emits the hashmap and performs some metrics operations. I am not sure how 
this would generate backpressure since it doesn’t perform any IO, but again I 
might be drawing incorrect conclusions.

The window function has a parallelism of 32. Each of the Subtasks has between 
136kb and 2.45mb of state, with a checkpoint duration of 280ms to 2 seconds. 
Each of the 32 subtasks appear to be handling 1,700-50,000 records an hour with 
a bytes received of 7mb and 170mb

Am I barking up the wrong tree?

-Steve


From: David Anderson <da...@alpinegizmo.com>
Sent: Friday, July 17, 2020 6:54 AM
To: Nelson Steven <nelsonste...@johndeere.com>
Cc: user@flink.apache.org
Subject: Re: Backpressure on Window step

Backpressure is typically caused by something like one of these things:

* problems relating to i/o to external services (e.g., enrichment via an API or 
database lookup, or a sink)
* data skew (e.g., a hot key)
* under-provisioning, or competition for resources
* spikes in traffic
* timer storms

I would start to debug this by looking for signs of significant asymmetry in 
the metrics (across the various subtasks), or resource exhaustion. Could be 
related to the network, GC, CPU, disk i/o, etc. Flink's webUI will show you 
checkpoint size and timing information for each sub-task; you can learn a lot 
from studying that data.

Relating to session windows -- could you occasionally have an unusually long 
session, and might that cause problems?

Best,
David

On Tue, Jul 14, 2020 at 10:12 PM Nelson Steven 
<nelsonste...@johndeere.com<mailto:nelsonste...@johndeere.com>> wrote:
Hello!

We are experiencing occasional backpressure on a Window function in our 
pipeline. The window is on a KeyedStream and is triggered by an 
EventTimeSessionWindows.withGap(Time.seconds(30)). The prior step does a fanout 
and we use the window to sort things into batches based on the Key for the 
keyed stream. We aren’t seeing an unreasonable amount of records (500-600/s) on 
a parallism of 32 (prior step has a parallelism of 4). We are as interested in 
learning out to debug the issue as we are in fixing the actual problem. Any 
ideas?

-Steve

Reply via email to