Re: parquet repartitions and parquet.enable.summary-metadata does not work
I see. So there are actually 3000 tasks instead of 3000 jobs right? Would you mind to provide the full stack trace of the GC issue? At first I thought it's identical to the _metadata one in the mail thread you mentioned. Cheng On 1/11/16 5:30 PM, Gavin Yue wrote: Here is how I set the conf: sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") This actually works, I do not see the _metadata file anymore. I think I made a mistake. The 3000 jobs are coming from repartition("id"). I have 7600 json files and want to save as parquet. So if I use: df.write.parquet(path), it would generate 7600 parquet files with 7600 parititions which has no problem. But if I use repartition to change partition number, say: df.reparition(3000).write.parquet This would generate 7600 + 3000 tasks. 3000 tasks always fails due to GC problem. Best, Gavin On Mon, Jan 11, 2016 at 4:31 PM, Cheng Lian> wrote: Hey Gavin, Could you please provide a snippet of your code to show how did you disabled "parquet.enable.summary-metadata" and wrote the files? Especially, you mentioned you saw "3000 jobs" failed. Were you writing each Parquet file with an individual job? (Usually people use write.partitionBy(...).parquet(...) to write multiple Parquet files.) Cheng On 1/10/16 10:12 PM, Gavin Yue wrote: Hey, I am trying to convert a bunch of json files into parquet, which would output over 7000 parquet files. But tthere are too many files, so I want to repartition based on id to 3000. But I got the error of GC problem like this one: https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives So I set parquet.enable.summary-metadata to false. But when I write.parquet, I could still see the 3000 jobs run after the writing parquet and they failed due to GC. Basically repartition never succeeded for me. Is there any other settings which could be optimized? Thanks, Gavin
Re: parquet repartitions and parquet.enable.summary-metadata does not work
Hey Gavin, Could you please provide a snippet of your code to show how did you disabled "parquet.enable.summary-metadata" and wrote the files? Especially, you mentioned you saw "3000 jobs" failed. Were you writing each Parquet file with an individual job? (Usually people use write.partitionBy(...).parquet(...) to write multiple Parquet files.) Cheng On 1/10/16 10:12 PM, Gavin Yue wrote: Hey, I am trying to convert a bunch of json files into parquet, which would output over 7000 parquet files. But tthere are too many files, so I want to repartition based on id to 3000. But I got the error of GC problem like this one: https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives So I set parquet.enable.summary-metadata to false. But when I write.parquet, I could still see the 3000 jobs run after the writing parquet and they failed due to GC. Basically repartition never succeeded for me. Is there any other settings which could be optimized? Thanks, Gavin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
parquet repartitions and parquet.enable.summary-metadata does not work
Hey, I am trying to convert a bunch of json files into parquet, which would output over 7000 parquet files. But tthere are too many files, so I want to repartition based on id to 3000. But I got the error of GC problem like this one: https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives So I set parquet.enable.summary-metadata to false. But when I write.parquet, I could still see the 3000 jobs run after the writing parquet and they failed due to GC. Basically repartition never succeeded for me. Is there any other settings which could be optimized? Thanks, Gavin