I have the following code in a Spark Job.

// Get the baseline input file(s)
JavaPairRDD<Text,Text> hsfBaselinePairRDDReadable = 
sc.hadoopFile(baselineInputBucketFile, SequenceFileInputFormat.class, 
Text.class, Text.class);
JavaPairRDD<String, String> hsfBaselinePairRDD = 
hsfBaselinePairRDDReadable.mapToPair(newConvertFromWritableTypes()).partitionBy(newHashPartitioner(Variables.NUM_OUTPUT_PARTITIONS)).persist(StorageLevel.MEMORY_ONLY_SER());

// Use 'substring' to extract epoch values.
JavaPairRDD<String, Long> baselinePairRDD = 
hsfBaselinePairRDD.mapValues(newExtractEpoch(accumBadBaselineRecords)).persist(StorageLevel.MEMORY_ONLY_SER());

When looking at the STAGE information for my job, I notice the following:

To construct hsfBaselinePairRDD, it takes about 7.5 minutes, with 931GB of 
input (from S3) and 377GB of shuffle write (presumably because of the hash 
partitioning).  This all makes sense.

To construct the baselinePairRDD, it also takes about 7.5 minutes.  I thought 
that was a bit odd.  But what I thought was really odd is why there was also 
330GB of shuffle read in this stage.  I would have thought there should be 0 
shuffle read in this stage.  

What I'm confused about is why there is even any 'shuffle read' when 
constructing the baselinePairRDD.  If anyone could shed some light on this it 
would be appreciated.

Thanks.

Darin.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to