I am executing a spark job on a cluster as a yarn-client(Yarn cluster not an option due to permission issues).
- num-executors 800 - spark.akka.frameSize=1024 - spark.default.parallelism=25600 - driver-memory=4G - executor-memory=32G. - My input size is around 1.5TB. My problem is when I execute rdd.saveAsTextFile(outputPath, classOf[org.apache.hadoop.io.compress.SnappyCodec])(Saving as avro also not an option, I have tried saveAsSequenceFile with GZIP, saveAsNewAPIHadoopFile with same result), I get heap space issue. On the other hand if I execute rdd.take(1). I get no such issue. So I am assuming that issue is due to write.