Hi What is the proper configuration for saving parquet partition with large number of repeated keys?
On bellow code I load 500 milion rows of data and partition it on column with not so many different values. Using spark-shell with 30g per executor and driver and 3 executor cores sqlContext.read.load("hdfs://notpartitioneddata").write.partitionBy("columnname").parquet("partitioneddata") Job failed because not enough memory in executor : WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 43.5 GB of 43.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. 16/01/14 17:32:38 ERROR YarnScheduler: Lost executor 11 on datanode2.babar.poc: Container killed by YARN for exceeding memory limits. 43.5 GB of 43.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org