[ 
https://issues.apache.org/jira/browse/FLINK-39160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Lakshman updated FLINK-39160:
----------------------------------
    Description: 
Flink currently lacks metrics for RPC-level observability for serialized 
response frame sizes and oversized-response rejections. When responses exceed 
pekko.framesize, they are rejected, but we cannot easily see the response-size 
trend. This makes it difficult to diagnose RPC failures, tune frame-size 
settings, and detect payload-size regressions in production


Today, oversized RPC responses are primarily visible only through error logs, 
with no dedicated metric to track response sizes or rejection frequency over 
time. This makes diagnosis reactive and noisy, since operators must grep logs 
instead of using dashboards/alerts.



  was:
Flink currently lacks metrics for RPC-level observability for serialized 
response frame sizes and oversized-response rejections. When responses exceed 
pekko.framesize, they are rejected, but we cannot easily see the response-size 
trend. This makes it difficult to diagnose RPC failures, tune frame-size 
settings, and detect payload-size regressions in production




> [runtime][rpc][metrics] Expose RPC response frame size and oversized-response 
> rejection metrics
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39160
>                 URL: https://issues.apache.org/jira/browse/FLINK-39160
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / RPC
>    Affects Versions: 2.2.0
>            Reporter: Arun Lakshman
>            Priority: Minor
>              Labels: metrics, rpc
>
> Flink currently lacks metrics for RPC-level observability for serialized 
> response frame sizes and oversized-response rejections. When responses exceed 
> pekko.framesize, they are rejected, but we cannot easily see the 
> response-size trend. This makes it difficult to diagnose RPC failures, tune 
> frame-size settings, and detect payload-size regressions in production
> Today, oversized RPC responses are primarily visible only through error logs, 
> with no dedicated metric to track response sizes or rejection frequency over 
> time. This makes diagnosis reactive and noisy, since operators must grep logs 
> instead of using dashboards/alerts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to