I see. So there are actually 3000 tasks instead of 3000 jobs right?
Would you mind to provide the full stack trace of the GC issue? At first
I thought it's identical to the _metadata one in the mail thread you
mentioned.
Cheng
On 1/11/16 5:30 PM, Gavin Yue wrote:
Here is how I set the conf:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
This actually works, I do not see the _metadata file anymore.
I think I made a mistake. The 3000 jobs are coming from
repartition("id").
I have 7600 json files and want to save as parquet.
So if I use: df.write.parquet(path), it would generate 7600 parquet
files with 7600 parititions which has no problem.
But if I use repartition to change partition number, say:
df.reparition(3000).write.parquet
This would generate 7600 + 3000 tasks. 3000 tasks always fails due to
GC problem.
Best,
Gavin
On Mon, Jan 11, 2016 at 4:31 PM, Cheng Lian <lian.cs....@gmail.com
<mailto:lian.cs....@gmail.com>> wrote:
Hey Gavin,
Could you please provide a snippet of your code to show how did
you disabled "parquet.enable.summary-metadata" and wrote the
files? Especially, you mentioned you saw "3000 jobs" failed. Were
you writing each Parquet file with an individual job? (Usually
people use write.partitionBy(...).parquet(...) to write multiple
Parquet files.)
Cheng
On 1/10/16 10:12 PM, Gavin Yue wrote:
Hey,
I am trying to convert a bunch of json files into parquet,
which would output over 7000 parquet files. But tthere are too
many files, so I want to repartition based on id to 3000.
But I got the error of GC problem like this one:
https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives
So I set parquet.enable.summary-metadata to false. But when I
write.parquet, I could still see the 3000 jobs run after the
writing parquet and they failed due to GC.
Basically repartition never succeeded for me. Is there any
other settings which could be optimized?
Thanks,
Gavin