Re: Spark app write too many small parquet files

2016-12-08 Thread Miguel Morales
Try to coalesce with a value of 2 or so. You could dynamically calculate how many partitions to have to obtain an optimal file size. Sent from my iPhone > On Dec 8, 2016, at 1:03 PM, Kevin Tran wrote: > > How many partition should it be when streaming? - As in streaming

Re: Spark app write too many small parquet files

2016-12-08 Thread Kevin Tran
How many partition should it be when streaming? - As in streaming process the data will growing in size and is there any configuration for limit file size and write to new file if it is more than x (let says 128MB per file) Another question about performance when query to these parquet files.

Re: Spark app write too many small parquet files

2016-11-28 Thread Chin Wei Low
Try limit the partitions. spark.sql.shuffle.partitions This control the number of files generated. On 28 Nov 2016 8:29 p.m., "Kevin Tran" wrote: > Hi Denny, > Thank you for your inputs. I also use 128 MB but still too many files > generated by Spark app which is only ~14 KB

Re: Spark app write too many small parquet files

2016-11-28 Thread Kevin Tran
Hi Denny, Thank you for your inputs. I also use 128 MB but still too many files generated by Spark app which is only ~14 KB each ! That's why I'm asking if there is a solution for this if some one has same issue. Cheers, Kevin. On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee

Re: Spark app write too many small parquet files

2016-11-27 Thread Denny Lee
Generally, yes - you should try to have larger data sizes due to the overhead of opening up files. Typical guidance is between 64MB-1GB; personally I usually stick with 128MB-512MB with the default of snappy codec compression with parquet. A good reference is Vida Ha's presentation Data Storage

Spark app write too many small parquet files

2016-11-27 Thread Kevin Tran
Hi Everyone, Does anyone know what is the best practise of writing parquet file from Spark ? As Spark app write data to parquet and it shows that under that directory there are heaps of very small parquet file (such as e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only 15KB