Rob Reeves created SPARK-38309:
----------------------------------

             Summary: SHS has incorrect percentiles for shuffle read bytes and 
shuffle total blocks metrics
                 Key: SPARK-38309
                 URL: https://issues.apache.org/jira/browse/SPARK-38309
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.1.0
            Reporter: Rob Reeves


*Background*

In [this PR|https://github.com/apache/spark/pull/26508] (SPARK-26260) the SHS 
stage metric percentiles were updated to only include successful tasks when 
using disk storage. It did this by making the values for each metric negative 
when the task is not in a successful state. This approach was chosen to avoid 
breaking changes to disk storage. See [this 
comment|https://github.com/apache/spark/pull/26508#issuecomment-554540314] for 
context.

To get the percentiles, it reads the metric values, starting at 0, in ascending 
order. This filters out all tasks that are not successful because the values 
are less than 0. To get the percentile values it scales the percentiles to the 
list index of successful tasks. For example if there are 200 tasks and you want 
percentiles [0, 25, 50, 75, 100] the lookup indexes in the task collection are

*Issue*
For metrics 1) shuffle total reads and 2) shuffle total blocks, the above PR 
incorrectly makes the metric indices positive. This means tasks that are not 
successful are included in the percentile calculations. The percentile lookup 
index calculation is still based on the number of successful task so the wrong 
task metric is returned for a given percentile. This was not caught because the 
unit test only verified values for one metric, executorRunTime.

*Steps to Reproduce*
_SHS UI_


 # Find a spark application in the SHS that has failed tasks for a stage with 
shuffle read.
 # Navigate to the stage UI.
 # Look at the max shuffle read size in the summary metrics
 # Sort the tasks by shuffle read size descending. You'll see it doesn't match 
step 3.

!image-2022-02-23-14-13-49-403.png!

 

_API_
 # For the same stage in the above repro steps, make a request to the task 
summary endpoint (e.g. 
/api/v1/applications/application_1632281309592_21294517/1/stages/6/0/taskSummary?quantiles=0,0.25,0.5,0.75,1.0)
 # Look at the shuffleReadMetrics.readBytes and 
shuffleReadMetrics.totalBlocksFetched. You will see -2 for at least some of the 
lower percentiles and the positive values will also be wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to