I have a job involving two sets of data indexed with the same type of key.
I have an expensive operation that I want to run on pairs sharing the same
key. The following code works BUT all of the work is being done on 3 of 16
processors -
   How do I go about diagnosing and fixing the behavior. A shuffle would
take a lot less time than running MyExpensiveOperation on all the data

JavaRDD<MyKey,Type1> set1;
JavaRDD<MyKey,Type2 set2;

I do a join

JavaRDD<MyKey,Tuple2<Type1,Type2> joinSet = set1.join(set2);

JavaRDD<MyResult> results = joinSet.values().map(new
MyExpensiveOperation());

Reply via email to