Generally, yes - you should try to have larger data sizes due to the overhead of opening up files. Typical guidance is between 64MB-1GB; personally I usually stick with 128MB-512MB with the default of snappy codec compression with parquet. A good reference is Vida Ha's presentation Data Storage Tips for Optimal Spark Performance <https://spark-summit.org/2015/events/data-storage-tips-for-optimal-spark-performance/>.
On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran <kevin...@gmail.com> wrote: > Hi Everyone, > Does anyone know what is the best practise of writing parquet file from > Spark ? > > As Spark app write data to parquet and it shows that under that directory > there are heaps of very small parquet file (such as > e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only > 15KB > > Should it write each chunk of bigger data size (such as 128 MB) with > proper number of files ? > > Does anyone find out any performance changes when changing data size of > each parquet file ? > > Thanks, > Kevin. >