I have an JavaPairRDD<KeyType,Tuple2<Type1,Type2>> originalPairs. There are on the order of 100 million elements
I call a function to rearrange the tuples JavaPairRDD<String,Tuple2<Type1,Type2>> newPairs = originalPairs.values().mapToPair(new PairFunction<Tuple2<Type1,Type2>, String, Tuple2<IType1,Type2>> { @Override public Tuple2<String, Tuple2<Type1,Type2>> doCall(final Tuple2<Type1,Type2> t) { return new Tuple2<String, Tuple2<Type1,Type2>>(t._1().getId(), t); } } where Type1.getId() returns a String The data are spread across 120 partitions on 15 machines. The operation is dead simple and yet it takes 5 minutes to generate the data and over 30 minutes to perform this simple operation. I am at a loss to understand what is taking so long or how to make it faster. It this stage there is no reason to move data to different partitions Anyone have bright ideas - Oh yes Type1 and Type2 are moderately complex objects weighing in at about 10kb