Hey, I am trying to convert a bunch of json files into parquet, which would output over 7000 parquet files. But tthere are too many files, so I want to repartition based on id to 3000.
But I got the error of GC problem like this one: https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives So I set parquet.enable.summary-metadata to false. But when I write.parquet, I could still see the 3000 jobs run after the writing parquet and they failed due to GC. Basically repartition never succeeded for me. Is there any other settings which could be optimized? Thanks, Gavin