I am executing a spark job on a cluster as a yarn-client(Yarn cluster not
an option due to permission issues).

   - num-executors 800
   - spark.akka.frameSize=1024
   - spark.default.parallelism=25600
   - driver-memory=4G
   - executor-memory=32G.
   - My input size is around 1.5TB.

My problem is when I execute rdd.saveAsTextFile(outputPath,
classOf[org.apache.hadoop.io.compress.SnappyCodec])(Saving as avro also not
an option, I have tried saveAsSequenceFile with GZIP,
saveAsNewAPIHadoopFile with same result), I get heap space issue. On the
other hand if I execute rdd.take(1). I get no such issue. So I am assuming
that issue is due to write.

Reply via email to