I have the following code in a Spark Job. // Get the baseline input file(s) JavaPairRDD<Text,Text> hsfBaselinePairRDDReadable = sc.hadoopFile(baselineInputBucketFile, SequenceFileInputFormat.class, Text.class, Text.class); JavaPairRDD<String, String> hsfBaselinePairRDD = hsfBaselinePairRDDReadable.mapToPair(newConvertFromWritableTypes()).partitionBy(newHashPartitioner(Variables.NUM_OUTPUT_PARTITIONS)).persist(StorageLevel.MEMORY_ONLY_SER());
// Use 'substring' to extract epoch values. JavaPairRDD<String, Long> baselinePairRDD = hsfBaselinePairRDD.mapValues(newExtractEpoch(accumBadBaselineRecords)).persist(StorageLevel.MEMORY_ONLY_SER()); When looking at the STAGE information for my job, I notice the following: To construct hsfBaselinePairRDD, it takes about 7.5 minutes, with 931GB of input (from S3) and 377GB of shuffle write (presumably because of the hash partitioning). This all makes sense. To construct the baselinePairRDD, it also takes about 7.5 minutes. I thought that was a bit odd. But what I thought was really odd is why there was also 330GB of shuffle read in this stage. I would have thought there should be 0 shuffle read in this stage. What I'm confused about is why there is even any 'shuffle read' when constructing the baselinePairRDD. If anyone could shed some light on this it would be appreciated. Thanks. Darin. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org