Zhanghao Chen created FLINK-31769:
-------------------------------------

             Summary: Add percentiles to aggregated metrics
                 Key: FLINK-31769
                 URL: https://issues.apache.org/jira/browse/FLINK-31769
             Project: Flink
          Issue Type: Improvement
          Components: Autoscaler, Runtime / Metrics
            Reporter: Zhanghao Chen
         Attachments: image-2023-04-11-15-11-51-471.png

*Background*

Currently only min/avg/max of metrics are exposed via REST API. Flink 
Autoscaler relies on these aggregated metrics to make predictions, and the type 
of aggregation plays an import role. [FLINK-30652] Use max busytime instead of 
average to compute true processing rate - ASF JIRA (apache.org) suggests that 
using max aggregator instead of avg of busy time can handle data skew more 
robustly. However, we found that for large-scale jobs, using max aggregation 
may be too sensitive. As a result, the true processing rate is underestimated 
with severe turbulence.

The graph below is the true processing rate estimated with different 
aggregators of a real production data transmission job with a parallelism of 
750.

!image-2023-04-11-15-11-51-471.png!

*Proposal*

Add percentiles (p50, p90, p99) to aggregated metrics. Apache common maths can 
be used for computing that.

A follow up would be making Flink autoscaler make use of the new aggregators.

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to