Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-12 Thread Cheng Lian

I see. So there are actually 3000 tasks instead of 3000 jobs right?

Would you mind to provide the full stack trace of the GC issue? At first 
I thought it's identical to the _metadata one in the mail thread you 
mentioned.


Cheng

On 1/11/16 5:30 PM, Gavin Yue wrote:
Here is how I set the conf: 
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")


This actually works, I do not see the _metadata file anymore.

I think I made a mistake.  The 3000 jobs are coming from 
repartition("id").


I have 7600 json files and want to save as parquet.

So if I use:  df.write.parquet(path), it would generate 7600 parquet 
files with 7600 parititions which has no problem.


But if I use repartition to change partition number, say: 
df.reparition(3000).write.parquet


This would generate 7600 + 3000 tasks.  3000 tasks always fails due to 
GC problem.


Best,
Gavin



On Mon, Jan 11, 2016 at 4:31 PM, Cheng Lian > wrote:


Hey Gavin,

Could you please provide a snippet of your code to show how did
you disabled "parquet.enable.summary-metadata" and wrote the
files? Especially, you mentioned you saw "3000 jobs" failed. Were
you writing each Parquet file with an individual job? (Usually
people use write.partitionBy(...).parquet(...) to write multiple
Parquet files.)

Cheng


On 1/10/16 10:12 PM, Gavin Yue wrote:

Hey,

I am trying to convert a bunch of json files into parquet,
which would output over 7000 parquet files. But tthere are too
many files, so I want to repartition based on id to 3000.

But I got the error of GC problem like this one:

https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives

So I set  parquet.enable.summary-metadata to false. But when I
write.parquet, I could still see the 3000 jobs run after the
writing parquet and they failed due to GC.

Basically repartition never succeeded for me. Is there any
other settings which could be optimized?

Thanks,
Gavin







Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-11 Thread Cheng Lian

Hey Gavin,

Could you please provide a snippet of your code to show how did you 
disabled "parquet.enable.summary-metadata" and wrote the files? 
Especially, you mentioned you saw "3000 jobs" failed. Were you writing 
each Parquet file with an individual job? (Usually people use 
write.partitionBy(...).parquet(...) to write multiple Parquet files.)


Cheng

On 1/10/16 10:12 PM, Gavin Yue wrote:

Hey,

I am trying to convert a bunch of json files into parquet, which would 
output over 7000 parquet files. But tthere are too many files, so I 
want to repartition based on id to 3000.


But I got the error of GC problem like this one: 
https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives


So I set  parquet.enable.summary-metadata to false. But when I 
write.parquet, I could still see the 3000 jobs run after the writing 
parquet and they failed due to GC.


Basically repartition never succeeded for me. Is there any other 
settings which could be optimized?


Thanks,
Gavin



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-10 Thread Gavin Yue
Hey,

I am trying to convert a bunch of json files into parquet, which would
output over 7000 parquet files.  But tthere are too many files, so I want
to repartition based on id to 3000.

But I got the error of GC problem like this one:
https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives

So I set  parquet.enable.summary-metadata to false. But when I
write.parquet, I could still see the 3000 jobs run after the writing
parquet and they failed due to GC.

Basically repartition never succeeded for me. Is there any other settings
which could be optimized?

Thanks,
Gavin