Hi Folks,
Does anybody know what is the reason not allowing preserverPartitioning in
RDD.map? Do I miss something here?
Following example involves two shuffles. I think if preservePartitioning is
allowed, we can avoid the second one, right?
val r1 =
I believe if you do the following:
sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString
(8) MapPartitionsRDD[34] at reduceByKey at console:23 []
| MapPartitionsRDD[33] at mapValues at console:23 []
| ShuffledRDD[32] at
Thanks Jonathan. You are right regarding rewrite the example.
I mean providing such option to developer so that it is controllable. The
example may seems silly, and I don’t know the use cases.
But for example, if I also want to operate both the key and value part to
generate some new value
This is just a deficiency of the api, imo. I agree: mapValues could
definitely be a function (K, V)=V1. The option isn't set by the function,
it's on the RDD. So you could look at the code and do this.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
I think we have a version of mapPartitions that allows you to tell
Spark the partitioning is preserved:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L639
We could also add a map function that does same. Or you can just write
your map using an
Thanks all for the quick response.
Thanks.
Zhan Zhang
On Mar 26, 2015, at 3:14 PM, Patrick Wendell pwend...@gmail.com wrote:
I think we have a version of mapPartitions that allows you to tell
Spark the partitioning is preserved: