Patrick Wendell created SPARK-4092: -------------------------------------- Summary: Input metrics don't work for coalesce()'d RDD's Key: SPARK-4092 URL: https://issues.apache.org/jira/browse/SPARK-4092 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Priority: Critical
In every case where we set input metrics (from both Hadoop and block storage) we currently assume that exactly one input partition is computed within the task. This is not a correct assumption in the general case. The main example in the current API is coalesce(), but user-defined RDD's could also be affected. To deal with the most general case, we would need to support the notion of a single task having multiple input sources. A more surgical and less general fix is to simply go to HadoopRDD and check if there are already inputMetrics defined for the task with the same "type". If there are, then merge in the new data rather than blowing away the old one. This wouldn't cover case where, e.g. a single task has input from both on-disk and in-memory blocks. It _would_ cover the case where someone calls coalesce on a HadoopRDD... which is more common. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org