Weichen Xu created SPARK-18078: ---------------------------------- Summary: Add option for customize zipPartition task preferred locations Key: SPARK-18078 URL: https://issues.apache.org/jira/browse/SPARK-18078 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Weichen Xu
`RDD.zipPartitions` task preferred locations strategy will use the intersection of corresponding zipped partitions locations, if the intersection is null, it use union of these locations. but in special case, I want to customize the task preferred locations for better performance. A typical case is in spark-tfocus: a distributed matrix(DMatrix) multiply a vector(DVector), it use RDD.zipPartitions. https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/DVectorFunctions.scala Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope the zipPartition task always locates on `DMatrix` partition's location. it will get better data locality than the default preferred location strategy. I think it make sense to add an option for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org