I'm using Spark 1.2, stand-alone cluster on ec2 I have a cluster of 8 r3.8xlarge machines but limit the job to only 128 cores. I have also tried other things such as setting 4 workers per r3.8xlarge and 67gb each but this made no difference.
The job frequently fails at the end in this step (saveasHadoopFile). It will sometimes work. finalNewBaselinePairRDD is hashPartitioned with 1024 partitions and a total size around 1TB. There are about 13.5M records in finalNewBaselinePairRDD. finalNewBaselinePairRDD is <String,String> JavaPairRDD<Text, Text> finalBaselineRDDWritable = finalNewBaselinePairRDD.mapToPair(new ConvertToWritableTypes()).persist(StorageLevel.MEMORY_AND_DISK_SER()); // Save to hdfs (gzip) finalBaselineRDDWritable.saveAsHadoopFile("hdfs:///sparksync/", Text.class, Text.class, SequenceFileOutputFormat.class,org.apache.hadoop.io.compress.GzipCodec.class); If anyone has any tips for what I should look into it would be appreciated. Thanks. Darin. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org