Re: parquet repartitions and parquet.enable.summary-metadata does not work

Cheng Lian Tue, 12 Jan 2016 11:05:40 -0800

I see. So there are actually 3000 tasks instead of 3000 jobs right?

Would you mind to provide the full stack trace of the GC issue? At firstI thought it's identical to the _metadata one in the mail thread youmentioned.


Cheng

On 1/11/16 5:30 PM, Gavin Yue wrote:

Here is how I set the conf:sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")


This actually works, I do not see the _metadata file anymore.

I think I made a mistake. The 3000 jobs are coming fromrepartition("id").


I have 7600 json files and want to save as parquet.

So if I use: df.write.parquet(path), it would generate 7600 parquetfiles with 7600 parititions which has no problem.

But if I use repartition to change partition number, say:df.reparition(3000).write.parquet

This would generate 7600 + 3000 tasks. 3000 tasks always fails due toGC problem.


Best,
Gavin

On Mon, Jan 11, 2016 at 4:31 PM, Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:


    Hey Gavin,

    Could you please provide a snippet of your code to show how did
    you disabled "parquet.enable.summary-metadata" and wrote the
    files? Especially, you mentioned you saw "3000 jobs" failed. Were
    you writing each Parquet file with an individual job? (Usually
    people use write.partitionBy(...).parquet(...) to write multiple
    Parquet files.)

    Cheng


    On 1/10/16 10:12 PM, Gavin Yue wrote:

        Hey,

        I am trying to convert a bunch of json files into parquet,
        which would output over 7000 parquet files. But tthere are too
        many files, so I want to repartition based on id to 3000.

        But I got the error of GC problem like this one:
        
https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives

        So I set  parquet.enable.summary-metadata to false. But when I
        write.parquet, I could still see the 3000 jobs run after the
        writing parquet and they failed due to GC.

        Basically repartition never succeeded for me. Is there any
        other settings which could be optimized?

        Thanks,
        Gavin

Re: parquet repartitions and parquet.enable.summary-metadata does not work

Reply via email to