[jira] [Created] (SPARK-4157) Task input statistics incomplete when a task reads from multiple locations

Charles Reiss (JIRA) Thu, 30 Oct 2014 12:03:59 -0700

Charles Reiss created SPARK-4157:
------------------------------------

             Summary: Task input statistics incomplete when a task reads from 
multiple locations
                 Key: SPARK-4157
                 URL: https://issues.apache.org/jira/browse/SPARK-4157
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.1.0
            Reporter: Charles Reiss
            Priority: Minor



SPARK-1683 introduced tracking of filesystem reads for tasks, but the tracking 
code assumes that each task reads from exactly one file/cache block, and 
replaces any prior InputMetrics object for a task after each read.

But, for example, a task computing a shuffle-less join (input RDDs are 
prepartitioned by key) may read two or more cached dependency RDD blocks from 
cache. In this case, the displayed input size will be for whichever dependency 
was requested last.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4157) Task input statistics incomplete when a task reads from multiple locations

Reply via email to