Why is this operation so expensive

2014-11-25 Thread Steve Lewis
I have an JavaPairRDDKeyType,Tuple2Type1,Type2 originalPairs. There are on the order of 100 million elements I call a function to rearrange the tuples JavaPairRDDString,Tuple2Type1,Type2 newPairs = originalPairs.values().mapToPair(new PairFunctionTuple2Type1,Type2, String, Tuple2IType1,Type2

Re: Why is this operation so expensive

2014-11-25 Thread Andrew Ash
Hi Steve, You changed the first value in a Tuple2, which is the one that Spark uses to hash and determine where in the cluster to place the value. By changing the first part of the PairRDD, you've implicitly asked Spark to reshuffle the data according to the new keys. I'd guess that you would

Re: Why is this operation so expensive

2014-11-25 Thread Steve Lewis
If I combineByKey in the next step I suppose I am paying for a shuffle I need any way - right? Also if I supply a custom partitioner rather than hash can I control where and how data is shuffled - overriding equals and hashcode could be a bad thing but a custom partitioner is less dangerous On