In the following code, I read in a large sequence file from S3 (1TB) spread across 1024 partitions. When I look at the job/stage summary, I see about 400GB of shuffle writes which seems to make sense as I'm doing a hash partition on this file.
// Get the baseline input file JavaPairRDD<Text,Text> hsfBaselinePairRDDReadable = sc.hadoopFile(baselineInputBucketFile, SequenceFileInputFormat.class, Text.class, Text.class); JavaPairRDD<String, String> hsfBaselinePairRDD = hsfBaselinePairRDDReadable.mapToPair(new ConvertFromWritableTypes()).partitionBy(new HashPartitioner(Variables.NUM_OUTPUT_PARTITIONS)).persist(StorageLevel.MEMORY_AND_DISK_SER()); I then execute the following code (with a count to force execution) and what I find very strange is that when I look at the job/stage summary, I see more than 340GB of shuffle read. Why would there be any shuffle read in this step? I would expect there to be little (if any) shuffle reads in this step. // Use 'substring' to extract the epoch value from each record. JavaPairRDD<String, Long> baselinePairRDD = hsfBaselinePairRDD.mapValues(new ExtractEpoch(accumBadBaselineRecords)).persist(StorageLevel.MEMORY_AND_DISK_SER()); log.info("Number of baseline records: " + baselinePairRDD.count()); Both hsfBaselinePairRDD and baselinePairRDD have 1024 partitions. Any insights would be appreciated. I'm using Spark 1.2.0 in a stand-alone cluster. Darin. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org