Try to coalesce with a value of 2 or so.  You could dynamically calculate how 
many partitions to have to obtain an optimal file size.

Sent from my iPhone

> On Dec 8, 2016, at 1:03 PM, Kevin Tran <kevin...@gmail.com> wrote:
> 
> How many partition should it be when streaming? - As in streaming process the 
> data will growing in size and is there any configuration for limit file size 
> and write to new file if it is more than x (let says  128MB per file)
> 
> Another question about performance when query to these parquet files. What is 
> the practise for number of file size and files ?
> 
> How to compacting small parquet flies to small number of bigger parquet file ?
> 
> Thanks,
> Kevin.
> 
>> On Tue, Nov 29, 2016 at 3:01 AM, Chin Wei Low <lowchin...@gmail.com> wrote:
>> Try limit the partitions. spark.sql.shuffle.partitions
>> 
>> This control the number of files generated.
>> 
>> 
>>> On 28 Nov 2016 8:29 p.m., "Kevin Tran" <kevin...@gmail.com> wrote:
>>> Hi Denny,
>>> Thank you for your inputs. I also use 128 MB but still too many files 
>>> generated by Spark app which is only ~14 KB each ! That's why I'm asking if 
>>> there is a solution for this if some one has same issue.
>>> 
>>> Cheers,
>>> Kevin.
>>> 
>>>> On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee <denny.g....@gmail.com> wrote:
>>>> Generally, yes - you should try to have larger data sizes due to the 
>>>> overhead of opening up files.  Typical guidance is between 64MB-1GB; 
>>>> personally I usually stick with 128MB-512MB with the default of snappy 
>>>> codec compression with parquet.  A good reference is Vida Ha's 
>>>> presentation Data Storage Tips for Optimal Spark Performance.  
>>>> 
>>>>> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran <kevin...@gmail.com> wrote:
>>>>> Hi Everyone,
>>>>> Does anyone know what is the best practise of writing parquet file from 
>>>>> Spark ?
>>>>> 
>>>>> As Spark app write data to parquet and it shows that under that directory 
>>>>> there are heaps of very small parquet file (such as 
>>>>> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only 
>>>>> 15KB
>>>>> 
>>>>> Should it write each chunk of  bigger data size (such as 128 MB) with 
>>>>> proper number of files ?
>>>>> 
>>>>> Does anyone find out any performance changes when changing data size of 
>>>>> each parquet file ?
>>>>> 
>>>>> Thanks,
>>>>> Kevin.
>>> 
> 

Reply via email to