Re: Why is this operation so expensive

Steve Lewis Tue, 25 Nov 2014 17:44:54 -0800

If I combineByKey in the next step I suppose I am paying for a shuffle I
need any way - right?
Also if I supply a custom partitioner rather than hash can I control where
and how data is shuffled - overriding equals and hashcode could be a bad
thing but a custom partitioner is less dangerous


On Tue, Nov 25, 2014 at 1:55 PM, Andrew Ash <and...@andrewash.com> wrote:

> Hi Steve,
>
> You changed the first value in a Tuple2, which is the one that Spark uses
> to hash and determine where in the cluster to place the value.  By changing
> the first part of the PairRDD, you've implicitly asked Spark to reshuffle
> the data according to the new keys.  I'd guess that you would observe large
> amounts of shuffle in the webui as a result of this code.
>
> If you don't actually need your data shuffled by the first part of the
> pair RDD, then consider making the KeyType not in the first half of the
> PairRDD.  An alternative is to make the .equals() and .hashcode() of
> KeyType delegate to the .getId() method you use in the anonymous function.
>
> Cheers,
> Andrew
>
> On Tue, Nov 25, 2014 at 10:06 AM, Steve Lewis <lordjoe2...@gmail.com>
> wrote:
>
>> I have an JavaPairRDD<KeyType,Tuple2<Type1,Type2>> originalPairs. There
>> are on the order of 100 million elements
>>
>> I call a function to rearrange the tuples
>>   JavaPairRDD<String,Tuple2<Type1,Type2>>   newPairs =
>> originalPairs.values().mapToPair(new PairFunction<Tuple2<Type1,Type2>,
>> String, Tuple2<IType1,Type2>> {
>>         @Override
>>         public Tuple2<String, Tuple2<Type1,Type2>> doCall(final
>> Tuple2<Type1,Type2> t)  {
>>             return new Tuple2<String,
>> Tuple2<Type1,Type2>>(t._1().getId(), t);
>>         }
>>     }
>>
>> where Type1.getId() returns a String
>>
>> The data are spread across 120 partitions on 15 machines. The operation
>> is dead simple and yet it takes 5 minutes to generate the data and over 30
>> minutes to perform this simple operation. I am at a loss to  understand
>> what is taking so long or how to make it faster. It this stage there is no
>> reason to move data to different partitions
>> Anyone have bright ideas - Oh yes Type1 and Type2 are moderately complex
>> objects weighing in at about 10kb
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Why is this operation so expensive

Reply via email to