Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Adrien Mogenet
Very interested in that topic too, thanks Cheng for the direction! We'll give it a try as well. On 3 December 2015 at 01:40, Cheng Lian wrote: > You may try to set Hadoop conf "parquet.enable.summary-metadata" to false > to disable writing Parquet summary files (_metadata and _common_metadata).

Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Cheng Lian
You may try to set Hadoop conf "parquet.enable.summary-metadata" to false to disable writing Parquet summary files (_metadata and _common_metadata). By default Parquet writes the summary files by collecting footers of all part-files in the dataset while committing the job. Spark also follows

Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Jerry Lam
Hi Don, It sounds familiar to this: https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCAG+ckK9L=htfyrwx3ux2oeqjjkyukkpmxjq+tns1xrwh-ff...@mail.gmail.com%3E

Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Don Drake
Does anyone have any suggestions on creating a large amount of parquet files? Especially in regards to the last phase where it creates the _metadata. Thanks. -Don On Sat, Nov 28, 2015 at 9:02 AM, Don Drake wrote: > I have a 2TB dataset that I have in a DataFrame that I am attempting to > parti

df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-28 Thread Don Drake
I have a 2TB dataset that I have in a DataFrame that I am attempting to partition by 2 fields and my YARN job seems to write the partitioned dataset successfully. I can see the output in HDFS once all Spark tasks are done. After the spark tasks are done, the job appears to be running for over an