Re: Web UI shows my AssignTImestamp is in high back pressure but in/outPoolUsage are both 0.

2021-06-18 Thread Piotr Nowojski
Hi Haocheng,

Regarding the first part, yes. For a very long time there was a trivial bug
that was displaying the maximum "backpressure status" ("HIGH" in your case)
from all of the subtasks, for every subtask, instead of showing the
subtask's individual status. [1]  It is/will be fixed in Flink 1.11.4,
1.12.4, 1.13.1, 1.14.0.

Also please note, that starting from 1.13.0, Flink has a much better, more
user friendly tools for analysing the source of the backpressure [2]. I
would highly recommend upgrading to it.

About the empty `inPoolUsage`. Keep in mind that this metric is ignoring
local channels [3], which might be hiding the problem. But yes. In
principle, if the upstream subtask has full output buffers, while the
downstream subtasks have empty input buffers, that most likely means there
is a problem in the network exchange. It can be network IO related, maybe
network threads are overloaded (CPU) might be causing that, or maybe some
other issue (GC, encryption/SSL, compression). But that should only happen
in very high throughput jobs, with hundreds of MB/s of network traffic. I
would first rule out if for sure your `Window` is not causing the
backpressure. You could do it by upgrading to Flink 1.13.x and checking the
newly added `busyTimeMsPerSecond` metric. Alternatively you can attach a
CPU profiler to a TaskManager. This is the most reliable way.

Piotrek

[1] https://issues.apache.org/jira/browse/FLINK-22489
[2] https://issues.apache.org/jira/browse/FLINK-14814
[3]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/metrics/#default-shuffle-service

sob., 12 cze 2021 o 12:53 Haocheng Wang  napisaƂ(a):

> Hi, I have a job like 'Source -> assignmentTimestamp -> flatmap ->  Window
> -> Sink' and I get back pressure from 'Source' to the 'FlatMap' operators
> form the 'BackPressure' tab in the Web UI.
> When trying to find which operator is the source of back pressure, I use
> metrics provided by the Web UI, specifically, 'inPoolUsage' and
> 'outPoolUsage'.
> Firstly, As far as I know, when both of the metrics are 0, the operator
> should not be defined as 'back pressured', but when I check the
> 'AssignmentTimestamp' operator, where 8 subtasks running, I find 1 or 2 of
> them have 0 value about the back pressure index, and the others have the
> index higher than 0.80, and all of them are marked  in 'HIGH' status.
> However, the two metrics, 'in/outPoolUsage', are always be 0. So maybe the
> operator is not back pressured actually?  Or is there any problem with my
> Flink WebUI?
> Second question is, from my experience, I think the source of the back
> pressure should be the Window operator because the outPoolUsage of the
> 'FlatMap' are 1, and the 'Window' is the first downstream operator from the
> 'Flatmap', but the inPoolUsage and the outPoolUsage are also 0. So the
> cause of the back pressure should be the network bottleneck between window
> and flatmap? Am I right?
> Thanks for your reading, and I'm looking forward for your ideas.
>
> Haocheng
>


Web UI shows my AssignTImestamp is in high back pressure but in/outPoolUsage are both 0.

2021-06-12 Thread Haocheng Wang
Hi, I have a job like 'Source -> assignmentTimestamp -> flatmap ->  Window
-> Sink' and I get back pressure from 'Source' to the 'FlatMap' operators
form the 'BackPressure' tab in the Web UI.
When trying to find which operator is the source of back pressure, I use
metrics provided by the Web UI, specifically, 'inPoolUsage' and
'outPoolUsage'.
Firstly, As far as I know, when both of the metrics are 0, the operator
should not be defined as 'back pressured', but when I check the
'AssignmentTimestamp' operator, where 8 subtasks running, I find 1 or 2 of
them have 0 value about the back pressure index, and the others have the
index higher than 0.80, and all of them are marked  in 'HIGH' status.
However, the two metrics, 'in/outPoolUsage', are always be 0. So maybe the
operator is not back pressured actually?  Or is there any problem with my
Flink WebUI?
Second question is, from my experience, I think the source of the back
pressure should be the Window operator because the outPoolUsage of the
'FlatMap' are 1, and the 'Window' is the first downstream operator from the
'Flatmap', but the inPoolUsage and the outPoolUsage are also 0. So the
cause of the back pressure should be the network bottleneck between window
and flatmap? Am I right?
Thanks for your reading, and I'm looking forward for your ideas.

Haocheng