How many partition should it be when streaming? - As in streaming process the data will growing in size and is there any configuration for limit file size and write to new file if it is more than x (let says 128MB per file)
Another question about performance when query to these parquet files. What is the practise for number of file size and files ? How to compacting small parquet flies to small number of bigger parquet file ? Thanks, Kevin. On Tue, Nov 29, 2016 at 3:01 AM, Chin Wei Low <lowchin...@gmail.com> wrote: > Try limit the partitions. spark.sql.shuffle.partitions > > This control the number of files generated. > > On 28 Nov 2016 8:29 p.m., "Kevin Tran" <kevin...@gmail.com> wrote: > >> Hi Denny, >> Thank you for your inputs. I also use 128 MB but still too many files >> generated by Spark app which is only ~14 KB each ! That's why I'm asking if >> there is a solution for this if some one has same issue. >> >> Cheers, >> Kevin. >> >> On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee <denny.g....@gmail.com> wrote: >> >>> Generally, yes - you should try to have larger data sizes due to the >>> overhead of opening up files. Typical guidance is between 64MB-1GB; >>> personally I usually stick with 128MB-512MB with the default of snappy >>> codec compression with parquet. A good reference is Vida Ha's presentation >>> Data >>> Storage Tips for Optimal Spark Performance >>> <https://spark-summit.org/2015/events/data-storage-tips-for-optimal-spark-performance/>. >>> >>> >>> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran <kevin...@gmail.com> wrote: >>> >>>> Hi Everyone, >>>> Does anyone know what is the best practise of writing parquet file from >>>> Spark ? >>>> >>>> As Spark app write data to parquet and it shows that under that >>>> directory there are heaps of very small parquet file (such as >>>> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is >>>> only 15KB >>>> >>>> Should it write each chunk of bigger data size (such as 128 MB) with >>>> proper number of files ? >>>> >>>> Does anyone find out any performance changes when changing data size of >>>> each parquet file ? >>>> >>>> Thanks, >>>> Kevin. >>>> >>> >>