Try to coalesce with a value of 2 or so. You could dynamically calculate how many partitions to have to obtain an optimal file size.
Sent from my iPhone > On Dec 8, 2016, at 1:03 PM, Kevin Tran <kevin...@gmail.com> wrote: > > How many partition should it be when streaming? - As in streaming process the > data will growing in size and is there any configuration for limit file size > and write to new file if it is more than x (let says 128MB per file) > > Another question about performance when query to these parquet files. What is > the practise for number of file size and files ? > > How to compacting small parquet flies to small number of bigger parquet file ? > > Thanks, > Kevin. > >> On Tue, Nov 29, 2016 at 3:01 AM, Chin Wei Low <lowchin...@gmail.com> wrote: >> Try limit the partitions. spark.sql.shuffle.partitions >> >> This control the number of files generated. >> >> >>> On 28 Nov 2016 8:29 p.m., "Kevin Tran" <kevin...@gmail.com> wrote: >>> Hi Denny, >>> Thank you for your inputs. I also use 128 MB but still too many files >>> generated by Spark app which is only ~14 KB each ! That's why I'm asking if >>> there is a solution for this if some one has same issue. >>> >>> Cheers, >>> Kevin. >>> >>>> On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee <denny.g....@gmail.com> wrote: >>>> Generally, yes - you should try to have larger data sizes due to the >>>> overhead of opening up files. Typical guidance is between 64MB-1GB; >>>> personally I usually stick with 128MB-512MB with the default of snappy >>>> codec compression with parquet. A good reference is Vida Ha's >>>> presentation Data Storage Tips for Optimal Spark Performance. >>>> >>>>> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran <kevin...@gmail.com> wrote: >>>>> Hi Everyone, >>>>> Does anyone know what is the best practise of writing parquet file from >>>>> Spark ? >>>>> >>>>> As Spark app write data to parquet and it shows that under that directory >>>>> there are heaps of very small parquet file (such as >>>>> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only >>>>> 15KB >>>>> >>>>> Should it write each chunk of bigger data size (such as 128 MB) with >>>>> proper number of files ? >>>>> >>>>> Does anyone find out any performance changes when changing data size of >>>>> each parquet file ? >>>>> >>>>> Thanks, >>>>> Kevin. >>> >