I have an JavaPairRDDKeyType,Tuple2Type1,Type2 originalPairs. There are
on the order of 100 million elements
I call a function to rearrange the tuples
JavaPairRDDString,Tuple2Type1,Type2 newPairs =
originalPairs.values().mapToPair(new PairFunctionTuple2Type1,Type2,
String, Tuple2IType1,Type2
Hi Steve,
You changed the first value in a Tuple2, which is the one that Spark uses
to hash and determine where in the cluster to place the value. By changing
the first part of the PairRDD, you've implicitly asked Spark to reshuffle
the data according to the new keys. I'd guess that you would
If I combineByKey in the next step I suppose I am paying for a shuffle I
need any way - right?
Also if I supply a custom partitioner rather than hash can I control where
and how data is shuffled - overriding equals and hashcode could be a bad
thing but a custom partitioner is less dangerous
On