[ https://issues.apache.org/jira/browse/SPARK-53007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Harsha Gudladona updated SPARK-53007: ------------------------------------- Description: Hello!, We are running Hudi Delta Streamer on Spark sourcing data from Kafka. We have a case where the Spark UI shows negative RDD block counts and incorrect storage values upon intermittent task failures and successful retries. The metrics stay correct until it hits the first task failure and a subsequent retry. My first thought is that the status event queue on the listener bus on the driver is full, but JMX metrics show the dropped count as 0. I am unaware of any other ways to troubleshoot this, any help is appreciated. was: Hello!, We are running Hudi Delta Streamer on Spark sourcing data from Kafka. We have a case where the Spark UI shows negative RDD block counts and incorrect storage values upon intermittent task failures and successful retries. The metrics stay correct until it hits the first task failure and a subsequent retry. My first thought is that the status event queue on the listener bus on the driver is full, but JMX metrics show the dropped count as 0. I am unaware of any other ways to troubleshoot this, any help is appreciated. !image-2025-07-29-17-04-51-074.png! !image-2025-07-29-17-04-27-564.png! > Spark UI: Incorrect metrics reported after spark task failures and successful > retries. > -------------------------------------------------------------------------------------- > > Key: SPARK-53007 > URL: https://issues.apache.org/jira/browse/SPARK-53007 > Project: Spark > Issue Type: Bug > Components: Web UI > Affects Versions: 3.4.1 > Environment: Spark Version: 3.4.1 > Infra: Spark on EKS > Operator: Kubeflow > Reporter: Harsha Gudladona > Priority: Major > > Hello!, > We are running Hudi Delta Streamer on Spark sourcing data from Kafka. We have > a case where the Spark UI shows negative RDD block counts and incorrect > storage values upon intermittent task failures and successful retries. The > metrics stay correct until it hits the first task failure and a subsequent > retry. My first thought is that the status event queue on the listener bus on > the driver is full, but JMX metrics show the dropped count as 0. I am unaware > of any other ways to troubleshoot this, any help is appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org