Hello I am trying to write parquet files from a data frame. I am able to use the partitionBy("year", "month", "day") and spark correctly physically partitions the data in a directory structure i expect.
The issue is when the partitions themselves are anything non trivial in size then the memory usage seems to blow up and i am getting a lot of gc pressure on my cluster. There is lots of red in the executors tab on the web UI for the all the executers in the GC time column. If i try to coalesce the data frames rdd so get reasonably sized output files the job falls over due to GC pressure. Removing the partitionBy and writing directly to the output destination alleviates the problem, however we would like this functionality to improve out query performance in engines like hive. I am running spark 2.0.2 on EMR 5.3.1, i am using pretty large nodes c3.4xlarge which have 30g ram per node and each executor gets 5.5g. I saw some previous mails about a similar issue but that was back in spark 1.4 days and they seem to have been resolved but i still have this issue. Any help would be appreciated.