Zhanghao Chen created FLINK-31769:
-------------------------------------
Summary: Add percentiles to aggregated metrics
Key: FLINK-31769
URL: https://issues.apache.org/jira/browse/FLINK-31769
Project: Flink
Issue Type: Improvement
Components: Autoscaler, Runtime / Metrics
Reporter: Zhanghao Chen
Attachments: image-2023-04-11-15-11-51-471.png
*Background*
Currently only min/avg/max of metrics are exposed via REST API. Flink
Autoscaler relies on these aggregated metrics to make predictions, and the type
of aggregation plays an import role. [FLINK-30652] Use max busytime instead of
average to compute true processing rate - ASF JIRA (apache.org) suggests that
using max aggregator instead of avg of busy time can handle data skew more
robustly. However, we found that for large-scale jobs, using max aggregation
may be too sensitive. As a result, the true processing rate is underestimated
with severe turbulence.
The graph below is the true processing rate estimated with different
aggregators of a real production data transmission job with a parallelism of
750.
!image-2023-04-11-15-11-51-471.png!
*Proposal*
Add percentiles (p50, p90, p99) to aggregated metrics. Apache common maths can
be used for computing that.
A follow up would be making Flink autoscaler make use of the new aggregators.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)