Anurag Kyal created FLINK-36983:
-----------------------------------
Summary: Observing unreliable IO metrics
Key: FLINK-36983
URL: https://issues.apache.org/jira/browse/FLINK-36983
Project: Flink
Issue Type: Bug
Components: Autoscaler
Affects Versions: 1.18.1
Reporter: Anurag Kyal
Attachments: Screenshot 2024-12-31 at 2.01.53 PM.png
<Not sure yet if it's a bug or just an issue with my setup>
Have been trying to enabling the autoscaler for our Flink jobs and it hasn't
been working as expected. So I started diving into the source code and found
out that the algorithm heavily relies on the IO metrics for the job's vertices
in the DAG. However, the IO metrics seem pretty inconsistent for my job at
which point the autoscaling algo will def not work.
I have seen the IO metrics on the UI to be pretty inconsistent earlier too but
never got bothered about it until I found out that it's actually being used as
inputs to the autoscaling algorithm.
This screenshot below demonstrates some of the discrepancies for a sample job.
!Screenshot 2024-12-31 at 2.01.53 PM.png|width=643,height=257!
Also want to add that I have verified that the job is healthy and not doing
anything unexpected from business metrics. There is consistently healthy amount
of data flowing in and out to the sink.
Since so many people are using the autoscaling successfully thus makes me
wonder if it's an issue with my setup? Would love to hear if anyone else is
seeing this issue or any other insights how to resolve this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)