I'm using Spark 1.2, stand-alone cluster on ec2 I have a cluster of 8
r3.8xlarge machines but limit the job to only 128 cores. I have also tried
other things such as setting 4 workers per r3.8xlarge and 67gb each but this
made no difference.
The job frequently fails at the end in this step (saveasHadoopFile). It will
sometimes work.
finalNewBaselinePairRDD is hashPartitioned with 1024 partitions and a total
size around 1TB. There are about 13.5M records in finalNewBaselinePairRDD.
finalNewBaselinePairRDD is <String,String>
JavaPairRDD<Text, Text> finalBaselineRDDWritable =
finalNewBaselinePairRDD.mapToPair(new
ConvertToWritableTypes()).persist(StorageLevel.MEMORY_AND_DISK_SER());
// Save to hdfs (gzip)
finalBaselineRDDWritable.saveAsHadoopFile("hdfs:///sparksync/", Text.class,
Text.class,
SequenceFileOutputFormat.class,org.apache.hadoop.io.compress.GzipCodec.class);
If anyone has any tips for what I should look into it would be appreciated.
Thanks.
Darin.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]