Hi Denny, Thank you for your inputs. I also use 128 MB but still too many files generated by Spark app which is only ~14 KB each ! That's why I'm asking if there is a solution for this if some one has same issue.
Cheers, Kevin. On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee <denny.g....@gmail.com> wrote: > Generally, yes - you should try to have larger data sizes due to the > overhead of opening up files. Typical guidance is between 64MB-1GB; > personally I usually stick with 128MB-512MB with the default of snappy > codec compression with parquet. A good reference is Vida Ha's presentation > Data > Storage Tips for Optimal Spark Performance > <https://spark-summit.org/2015/events/data-storage-tips-for-optimal-spark-performance/>. > > > On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran <kevin...@gmail.com> wrote: > >> Hi Everyone, >> Does anyone know what is the best practise of writing parquet file from >> Spark ? >> >> As Spark app write data to parquet and it shows that under that directory >> there are heaps of very small parquet file (such as >> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). >> Each parquet file is only 15KB >> >> Should it write each chunk of bigger data size (such as 128 MB) with >> proper number of files ? >> >> Does anyone find out any performance changes when changing data size of >> each parquet file ? >> >> Thanks, >> Kevin. >> >