I have a job involving two sets of data indexed with the same type of key. I have an expensive operation that I want to run on pairs sharing the same key. The following code works BUT all of the work is being done on 3 of 16 processors - How do I go about diagnosing and fixing the behavior. A shuffle would take a lot less time than running MyExpensiveOperation on all the data
JavaRDD<MyKey,Type1> set1; JavaRDD<MyKey,Type2 set2; I do a join JavaRDD<MyKey,Tuple2<Type1,Type2> joinSet = set1.join(set2); JavaRDD<MyResult> results = joinSet.values().map(new MyExpensiveOperation());