[jira] [Created] (SPARK-38309) SHS has incorrect percentiles for shuffle read bytes and shuffle total blocks metrics

Rob Reeves (Jira) Wed, 23 Feb 2022 14:19:06 -0800

Rob Reeves created SPARK-38309:
----------------------------------

             Summary: SHS has incorrect percentiles for shuffle read bytes and 
shuffle total blocks metrics
                 Key: SPARK-38309
                 URL: https://issues.apache.org/jira/browse/SPARK-38309
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.1.0
            Reporter: Rob Reeves

*Background*

In [this PR|https://github.com/apache/spark/pull/26508] (SPARK-26260) the SHS
stage metric percentiles were updated to only include successful tasks when
using disk storage. It did this by making the values for each metric negative
when the task is not in a successful state. This approach was chosen to avoid
breaking changes to disk storage. See [this
comment|https://github.com/apache/spark/pull/26508#issuecomment-554540314] for
context.

To get the percentiles, it reads the metric values, starting at 0, in ascending
order. This filters out all tasks that are not successful because the values
are less than 0. To get the percentile values it scales the percentiles to the
list index of successful tasks. For example if there are 200 tasks and you want
percentiles [0, 25, 50, 75, 100] the lookup indexes in the task collection are

*Issue*
For metrics 1) shuffle total reads and 2) shuffle total blocks, the above PR
incorrectly makes the metric indices positive. This means tasks that are not
successful are included in the percentile calculations. The percentile lookup
index calculation is still based on the number of successful task so the wrong
task metric is returned for a given percentile. This was not caught because the
unit test only verified values for one metric, executorRunTime.

*Steps to Reproduce*
_SHS UI_

# Find a spark application in the SHS that has failed tasks for a stage with
shuffle read.
# Navigate to the stage UI.
# Look at the max shuffle read size in the summary metrics
# Sort the tasks by shuffle read size descending. You'll see it doesn't match
step 3.

!image-2022-02-23-14-13-49-403.png!

_API_
# For the same stage in the above repro steps, make a request to the task
summary endpoint (e.g.
/api/v1/applications/application_1632281309592_21294517/1/stages/6/0/taskSummary?quantiles=0,0.25,0.5,0.75,1.0)
# Look at the shuffleReadMetrics.readBytes and
shuffleReadMetrics.totalBlocksFetched. You will see -2 for at least some of the
lower percentiles and the positive values will also be wrong.

--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38309) SHS has incorrect percentiles for shuffle read bytes and shuffle total blocks metrics

Reply via email to