[
https://issues.apache.org/jira/browse/FLINK-39160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arun Lakshman updated FLINK-39160:
----------------------------------
Description:
Flink currently lacks metrics for RPC-level observability for serialized
response frame sizes and oversized-response rejections. When responses exceed
pekko.framesize, they are rejected, but we cannot easily see the response-size
trend. This makes it difficult to diagnose RPC failures, tune frame-size
settings, and detect payload-size regressions in production
Today, oversized RPC responses are primarily visible only through error logs,
with no dedicated metric to track response sizes or rejection frequency over
time. This makes diagnosis reactive and noisy, since operators must grep logs
instead of using dashboards/alerts.
was:
Flink currently lacks metrics for RPC-level observability for serialized
response frame sizes and oversized-response rejections. When responses exceed
pekko.framesize, they are rejected, but we cannot easily see the response-size
trend. This makes it difficult to diagnose RPC failures, tune frame-size
settings, and detect payload-size regressions in production
> [runtime][rpc][metrics] Expose RPC response frame size and oversized-response
> rejection metrics
> -----------------------------------------------------------------------------------------------
>
> Key: FLINK-39160
> URL: https://issues.apache.org/jira/browse/FLINK-39160
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / RPC
> Affects Versions: 2.2.0
> Reporter: Arun Lakshman
> Priority: Minor
> Labels: metrics, rpc
>
> Flink currently lacks metrics for RPC-level observability for serialized
> response frame sizes and oversized-response rejections. When responses exceed
> pekko.framesize, they are rejected, but we cannot easily see the
> response-size trend. This makes it difficult to diagnose RPC failures, tune
> frame-size settings, and detect payload-size regressions in production
> Today, oversized RPC responses are primarily visible only through error logs,
> with no dedicated metric to track response sizes or rejection frequency over
> time. This makes diagnosis reactive and noisy, since operators must grep logs
> instead of using dashboards/alerts.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)