Hey,

I am trying to convert a bunch of json files into parquet, which would
output over 7000 parquet files.  But tthere are too many files, so I want
to repartition based on id to 3000.

But I got the error of GC problem like this one:
https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives

So I set  parquet.enable.summary-metadata to false. But when I
write.parquet, I could still see the 3000 jobs run after the writing
parquet and they failed due to GC.

Basically repartition never succeeded for me. Is there any other settings
which could be optimized?

Thanks,
Gavin

Reply via email to