[ https://issues.apache.org/jira/browse/SPARK-38309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-38309: ------------------------------------ Assignee: (was: Apache Spark) > SHS has incorrect percentiles for shuffle read bytes and shuffle total blocks > metrics > ------------------------------------------------------------------------------------- > > Key: SPARK-38309 > URL: https://issues.apache.org/jira/browse/SPARK-38309 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.1.0 > Reporter: Rob Reeves > Priority: Major > Labels: correctness > Attachments: image-2022-02-23-14-19-33-255.png > > > *Background* > In [this PR|https://github.com/apache/spark/pull/26508] (SPARK-26260) the SHS > stage metric percentiles were updated to only include successful tasks when > using disk storage. It did this by making the values for each metric negative > when the task is not in a successful state. This approach was chosen to avoid > breaking changes to disk storage. See [this > comment|https://github.com/apache/spark/pull/26508#issuecomment-554540314] > for context. > To get the percentiles, it reads the metric values, starting at 0, in > ascending order. This filters out all tasks that are not successful because > the values are less than 0. To get the percentile values it scales the > percentiles to the list index of successful tasks. For example if there are > 200 tasks and you want percentiles [0, 25, 50, 75, 100] the lookup indexes in > the task collection are > *Issue* > For metrics 1) shuffle total reads and 2) shuffle total blocks, the above PR > incorrectly makes the metric indices positive. This means tasks that are not > successful are included in the percentile calculations. The percentile lookup > index calculation is still based on the number of successful task so the > wrong task metric is returned for a given percentile. This was not caught > because the unit test only verified values for one metric, executorRunTime. > *Steps to Reproduce* > _SHS UI_ > # Find a spark application in the SHS that has failed tasks for a stage with > shuffle read. > # Navigate to the stage UI. > # Look at the max shuffle read size in the summary metrics > # Sort the tasks by shuffle read size descending. You'll see it doesn't > match step 3. > > !image-2022-02-23-14-19-33-255.png|width=789,height=389! > _API_ > # For the same stage in the above repro steps, make a request to the task > summary endpoint (e.g. > /api/v1/applications/application_1632281309592_21294517/1/stages/6/0/taskSummary?quantiles=0,0.25,0.5,0.75,1.0) > # Look at the shuffleReadMetrics.readBytes and > shuffleReadMetrics.totalBlocksFetched. You will see -2 for at least some of > the lower percentiles and the positive values will also be wrong. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org