[jira] [Commented] (FLINK-14814) Show the vertex that produces the backpressure source in the job
[ https://issues.apache.org/jira/browse/FLINK-14814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256927#comment-17256927 ] Piotr Nowojski commented on FLINK-14814: Previous approach using {{isBackPressuredRatio}} and {{isCausingBackPressureRatio}} had a major problem with accuracy of measurements, if load spikes were happening quicker/faster then the sampling rate (it's impossible to accurately sample a wave, with sampling rate smaller then half of the wave's frequency). Because of that I switched to another approach: using {{backPressuredTimeMsPerSecond}} and {{busyTimeMsPerSecond}} which we can calculate much more accurately. > Show the vertex that produces the backpressure source in the job > > > Key: FLINK-14814 > URL: https://issues.apache.org/jira/browse/FLINK-14814 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / Network, Runtime / REST, > Runtime / Web Frontend >Reporter: lining >Assignee: Piotr Nowojski >Priority: Major > Labels: pull-request-available > Attachments: 2B0E910D-6D95-401F-B450-1F6B1AFB9BEA.png, Screenshot > 2020-12-30 at 14.09.19.png, Screenshot 2020-12-31 at 10.27.52.png > > > By checking the status of output and input buffer pools exposed via > FLINK-14815 (output buffer empty, input buffer full) it is possible to > display which node is a source of the back pressure. This information could > be displayed/accessible in the Web Frontend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14814) Show the vertex that produces the backpressure source in the job
[ https://issues.apache.org/jira/browse/FLINK-14814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256926#comment-17256926 ] Piotr Nowojski commented on FLINK-14814: New visualisation based on the updated approach: !Screenshot 2020-12-31 at 10.27.52.png! > Show the vertex that produces the backpressure source in the job > > > Key: FLINK-14814 > URL: https://issues.apache.org/jira/browse/FLINK-14814 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / Network, Runtime / REST, > Runtime / Web Frontend >Reporter: lining >Assignee: Piotr Nowojski >Priority: Major > Labels: pull-request-available > Attachments: 2B0E910D-6D95-401F-B450-1F6B1AFB9BEA.png, Screenshot > 2020-12-30 at 14.09.19.png, Screenshot 2020-12-31 at 10.27.52.png > > > By checking the status of output and input buffer pools exposed via > FLINK-14815 (output buffer empty, input buffer full) it is possible to > display which node is a source of the back pressure. This information could > be displayed/accessible in the Web Frontend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14814) Show the vertex that produces the backpressure source in the job
[ https://issues.apache.org/jira/browse/FLINK-14814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256515#comment-17256515 ] Piotr Nowojski commented on FLINK-14814: My current (implemented in the PR) proposal how to display backpressure status looks like this: !Screenshot 2020-12-30 at 14.09.19.png! > Show the vertex that produces the backpressure source in the job > > > Key: FLINK-14814 > URL: https://issues.apache.org/jira/browse/FLINK-14814 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / Network, Runtime / REST, > Runtime / Web Frontend >Reporter: lining >Assignee: Piotr Nowojski >Priority: Major > Labels: pull-request-available > Attachments: 2B0E910D-6D95-401F-B450-1F6B1AFB9BEA.png, Screenshot > 2020-12-30 at 14.09.19.png > > > By checking the status of output and input buffer pools exposed via > FLINK-14815 (output buffer empty, input buffer full) it is possible to > display which node is a source of the back pressure. This information could > be displayed/accessible in the Web Frontend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14814) Show the vertex that produces the backpressure source in the job
[ https://issues.apache.org/jira/browse/FLINK-14814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236101#comment-17236101 ] Matthias commented on FLINK-14814: -- FYI: I commented on the state of this issue in its parent FLINK-14712. > Show the vertex that produces the backpressure source in the job > > > Key: FLINK-14814 > URL: https://issues.apache.org/jira/browse/FLINK-14814 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / Network, Runtime / REST, > Runtime / Web Frontend >Reporter: lining >Assignee: lining >Priority: Major > Attachments: 2B0E910D-6D95-401F-B450-1F6B1AFB9BEA.png > > > By checking the status of output and input buffer pools exposed via > FLINK-14815 (output buffer empty, input buffer full) it is possible to > display which node is a source of the back pressure. This information could > be displayed/accessible in the Web Frontend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14814) Show the vertex that produces the backpressure source in the job
[ https://issues.apache.org/jira/browse/FLINK-14814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977229#comment-16977229 ] Piotr Nowojski commented on FLINK-14814: Ok, sounds good [~lining]. One thing: > pool usage aggregated by max, min, and the average in every vertex for users > to judge vertex Do we need all of the aggregates? Max & min for example? Check my explanation in FLINK-14815 for why I think min/max aggregate for pool usage might be redundant to just average. On the other hand presenting too many metrics has couple of potential issues: # information spam to a user (why show him something that he doesn't need?) # potential performance implications? Even if not now, but in the future, if we add too many metrics now, it will be difficult to drop them in the future. > Show the vertex that produces the backpressure source in the job > > > Key: FLINK-14814 > URL: https://issues.apache.org/jira/browse/FLINK-14814 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / Network, Runtime / REST, > Runtime / Web Frontend >Reporter: lining >Assignee: lining >Priority: Major > Attachments: 2B0E910D-6D95-401F-B450-1F6B1AFB9BEA.png > > > By checking the status of output and input buffer pools exposed via > FLINK-14815 (output buffer empty, input buffer full) it is possible to > display which node is a source of the back pressure. This information could > be displayed/accessible in the Web Frontend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14814) Show the vertex that produces the backpressure source in the job
[ https://issues.apache.org/jira/browse/FLINK-14814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976469#comment-16976469 ] Yadong Xie commented on FLINK-14814: Sure [~lining], this proposal looks great, I will review this PR when you finish it. > Show the vertex that produces the backpressure source in the job > > > Key: FLINK-14814 > URL: https://issues.apache.org/jira/browse/FLINK-14814 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / Network, Runtime / REST, > Runtime / Web Frontend >Reporter: lining >Assignee: lining >Priority: Major > Attachments: 2B0E910D-6D95-401F-B450-1F6B1AFB9BEA.png > > > By checking the status of output and input buffer pools exposed via > FLINK-14815 (output buffer empty, input buffer full) it is possible to > display which node is a source of the back pressure. This information could > be displayed/accessible in the Web Frontend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14814) Show the vertex that produces the backpressure source in the job
[ https://issues.apache.org/jira/browse/FLINK-14814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976459#comment-16976459 ] lining commented on FLINK-14814: Maybe we could create another Jira for show these metrics on vertex, this one for REST API to expose these metrics. > Show the vertex that produces the backpressure source in the job > > > Key: FLINK-14814 > URL: https://issues.apache.org/jira/browse/FLINK-14814 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / Network, Runtime / REST, > Runtime / Web Frontend >Reporter: lining >Assignee: lining >Priority: Major > Attachments: 2B0E910D-6D95-401F-B450-1F6B1AFB9BEA.png > > > By checking the status of output and input buffer pools exposed via > FLINK-14815 (output buffer empty, input buffer full) it is possible to > display which node is a source of the back pressure. This information could > be displayed/accessible in the Web Frontend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14814) Show the vertex that produces the backpressure source in the job
[ https://issues.apache.org/jira/browse/FLINK-14814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976456#comment-16976456 ] lining commented on FLINK-14814: We want to both. 1. present non aggregated metrics for subtask, it's to find which subtask is blocked. 2. pool usage aggregated by max, min, and the average in every vertex for users to judge vertex. 3. show FLINK-14813 back-pressured metric on vertex. Maybe [~vthinkxie] could help us to review it. > Show the vertex that produces the backpressure source in the job > > > Key: FLINK-14814 > URL: https://issues.apache.org/jira/browse/FLINK-14814 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / Network, Runtime / REST, > Runtime / Web Frontend >Reporter: lining >Assignee: lining >Priority: Major > Attachments: 2B0E910D-6D95-401F-B450-1F6B1AFB9BEA.png > > > By checking the status of output and input buffer pools exposed via > FLINK-14815 (output buffer empty, input buffer full) it is possible to > display which node is a source of the back pressure. This information could > be displayed/accessible in the Web Frontend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14814) Show the vertex that produces the backpressure source in the job
[ https://issues.apache.org/jira/browse/FLINK-14814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976364#comment-16976364 ] Piotr Nowojski commented on FLINK-14814: Having multiple output edges is I think not that often, and even if, one can deduce the state from the combined output usage basing on the fact that buffers are rarely in other states than "mostly empty" and "mostly full". Value of {{outputUsage}} jiggling around 50% means one output is full other is empty. Because of that I wouldn't worry about it too much, at least not in the first version. I think the bigger problem is that your screenshot displays the tasks, not individual subtasks/parallel instances. This rises a question: # do we want to present non aggregated metrics for subtask? # do we want to present aggregated metrics for the tasks? ... # ... if so, how to aggregate the metrics (and who should be doing that)? 1. would be easier to do, significantly more detailed and fine grained, however less user friendly and more difficult to use. 2. loosing some information in an exchange for a simpler usage (we might want to do both, or one first, later the other) 3. we would have to decide how to aggregate individual value. For example if one single subtask is back-pressured, do we report that whole task is back-pressured? For pool usage should we average them out? Max? Regarding who should be doing that - it shouldn't be the UI, so in that case we would need one more metric related ticket to actually come up with an idea how to aggregate the metrics. > Show the vertex that produces the backpressure source in the job > > > Key: FLINK-14814 > URL: https://issues.apache.org/jira/browse/FLINK-14814 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / Network, Runtime / REST, > Runtime / Web Frontend >Reporter: lining >Assignee: lining >Priority: Major > Attachments: 2B0E910D-6D95-401F-B450-1F6B1AFB9BEA.png > > > By checking the status of output and input buffer pools exposed via > FLINK-14815 (output buffer empty, input buffer full) it is possible to > display which node is a source of the back pressure. This information could > be displayed/accessible in the Web Frontend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14814) Show the vertex that produces the backpressure source in the job
[ https://issues.apache.org/jira/browse/FLINK-14814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976238#comment-16976238 ] lining commented on FLINK-14814: Now we may just show this information in the Web Frontend. > Show the vertex that produces the backpressure source in the job > > > Key: FLINK-14814 > URL: https://issues.apache.org/jira/browse/FLINK-14814 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / Network, Runtime / REST, > Runtime / Web Frontend >Reporter: lining >Assignee: lining >Priority: Major > > By checking the status of output and input buffer pools exposed via > FLINK-14815 (output buffer empty, input buffer full) it is possible to > display which node is a source of the back pressure. This information could > be displayed/accessible in the Web Frontend. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14814) Show the vertex that produces the backpressure source in the job
[ https://issues.apache.org/jira/browse/FLINK-14814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974967#comment-16974967 ] Piotr Nowojski commented on FLINK-14814: One clarifying question [~lining]. What's the scope of this ticket? Would you like this information to be visible in the Web Frontend, or just exposed via some metric? > Show the vertex that produces the backpressure source in the job > > > Key: FLINK-14814 > URL: https://issues.apache.org/jira/browse/FLINK-14814 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Metrics, Runtime / Network, Runtime / REST, > Runtime / Web Frontend >Reporter: lining >Assignee: lining >Priority: Major > > By checking the status of output and input buffer pools exposed via > FLINK-14815 (output buffer empty, input buffer full) it is possible to > display which node is a source of the back pressure. This information could > be displayed/accessible in the Web Frontend. -- This message was sent by Atlassian Jira (v8.3.4#803005)