Re: Spark app write too many small parquet files

2016-12-08 Thread Miguel Morales
Try to coalesce with a value of 2 or so.  You could dynamically calculate how 
many partitions to have to obtain an optimal file size.

Sent from my iPhone

> On Dec 8, 2016, at 1:03 PM, Kevin Tran  wrote:
> 
> How many partition should it be when streaming? - As in streaming process the 
> data will growing in size and is there any configuration for limit file size 
> and write to new file if it is more than x (let says  128MB per file)
> 
> Another question about performance when query to these parquet files. What is 
> the practise for number of file size and files ?
> 
> How to compacting small parquet flies to small number of bigger parquet file ?
> 
> Thanks,
> Kevin.
> 
>> On Tue, Nov 29, 2016 at 3:01 AM, Chin Wei Low  wrote:
>> Try limit the partitions. spark.sql.shuffle.partitions
>> 
>> This control the number of files generated.
>> 
>> 
>>> On 28 Nov 2016 8:29 p.m., "Kevin Tran"  wrote:
>>> Hi Denny,
>>> Thank you for your inputs. I also use 128 MB but still too many files 
>>> generated by Spark app which is only ~14 KB each ! That's why I'm asking if 
>>> there is a solution for this if some one has same issue.
>>> 
>>> Cheers,
>>> Kevin.
>>> 
 On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee  wrote:
 Generally, yes - you should try to have larger data sizes due to the 
 overhead of opening up files.  Typical guidance is between 64MB-1GB; 
 personally I usually stick with 128MB-512MB with the default of snappy 
 codec compression with parquet.  A good reference is Vida Ha's 
 presentation Data Storage Tips for Optimal Spark Performance.  
 
> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran  wrote:
> Hi Everyone,
> Does anyone know what is the best practise of writing parquet file from 
> Spark ?
> 
> As Spark app write data to parquet and it shows that under that directory 
> there are heaps of very small parquet file (such as 
> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only 
> 15KB
> 
> Should it write each chunk of  bigger data size (such as 128 MB) with 
> proper number of files ?
> 
> Does anyone find out any performance changes when changing data size of 
> each parquet file ?
> 
> Thanks,
> Kevin.
>>> 
> 


Re: Spark app write too many small parquet files

2016-12-08 Thread Kevin Tran
How many partition should it be when streaming? - As in streaming process
the data will growing in size and is there any configuration for limit file
size and write to new file if it is more than x (let says  128MB per file)

Another question about performance when query to these parquet files. What
is the practise for number of file size and files ?

How to compacting small parquet flies to small number of bigger parquet
file ?

Thanks,
Kevin.

On Tue, Nov 29, 2016 at 3:01 AM, Chin Wei Low  wrote:

> Try limit the partitions. spark.sql.shuffle.partitions
>
> This control the number of files generated.
>
> On 28 Nov 2016 8:29 p.m., "Kevin Tran"  wrote:
>
>> Hi Denny,
>> Thank you for your inputs. I also use 128 MB but still too many files
>> generated by Spark app which is only ~14 KB each ! That's why I'm asking if
>> there is a solution for this if some one has same issue.
>>
>> Cheers,
>> Kevin.
>>
>> On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee  wrote:
>>
>>> Generally, yes - you should try to have larger data sizes due to the
>>> overhead of opening up files.  Typical guidance is between 64MB-1GB;
>>> personally I usually stick with 128MB-512MB with the default of snappy
>>> codec compression with parquet.  A good reference is Vida Ha's presentation 
>>> Data
>>> Storage Tips for Optimal Spark Performance
>>> .
>>>
>>>
>>> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran  wrote:
>>>
 Hi Everyone,
 Does anyone know what is the best practise of writing parquet file from
 Spark ?

 As Spark app write data to parquet and it shows that under that
 directory there are heaps of very small parquet file (such as
 e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is
 only 15KB

 Should it write each chunk of  bigger data size (such as 128 MB) with
 proper number of files ?

 Does anyone find out any performance changes when changing data size of
 each parquet file ?

 Thanks,
 Kevin.

>>>
>>


Re: Spark app write too many small parquet files

2016-11-28 Thread Chin Wei Low
Try limit the partitions. spark.sql.shuffle.partitions

This control the number of files generated.

On 28 Nov 2016 8:29 p.m., "Kevin Tran"  wrote:

> Hi Denny,
> Thank you for your inputs. I also use 128 MB but still too many files
> generated by Spark app which is only ~14 KB each ! That's why I'm asking if
> there is a solution for this if some one has same issue.
>
> Cheers,
> Kevin.
>
> On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee  wrote:
>
>> Generally, yes - you should try to have larger data sizes due to the
>> overhead of opening up files.  Typical guidance is between 64MB-1GB;
>> personally I usually stick with 128MB-512MB with the default of snappy
>> codec compression with parquet.  A good reference is Vida Ha's presentation 
>> Data
>> Storage Tips for Optimal Spark Performance
>> .
>>
>>
>> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran  wrote:
>>
>>> Hi Everyone,
>>> Does anyone know what is the best practise of writing parquet file from
>>> Spark ?
>>>
>>> As Spark app write data to parquet and it shows that under that
>>> directory there are heaps of very small parquet file (such as
>>> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is
>>> only 15KB
>>>
>>> Should it write each chunk of  bigger data size (such as 128 MB) with
>>> proper number of files ?
>>>
>>> Does anyone find out any performance changes when changing data size of
>>> each parquet file ?
>>>
>>> Thanks,
>>> Kevin.
>>>
>>
>


Re: Spark app write too many small parquet files

2016-11-28 Thread Kevin Tran
Hi Denny,
Thank you for your inputs. I also use 128 MB but still too many files
generated by Spark app which is only ~14 KB each ! That's why I'm asking if
there is a solution for this if some one has same issue.

Cheers,
Kevin.

On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee  wrote:

> Generally, yes - you should try to have larger data sizes due to the
> overhead of opening up files.  Typical guidance is between 64MB-1GB;
> personally I usually stick with 128MB-512MB with the default of snappy
> codec compression with parquet.  A good reference is Vida Ha's presentation 
> Data
> Storage Tips for Optimal Spark Performance
> .
>
>
> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran  wrote:
>
>> Hi Everyone,
>> Does anyone know what is the best practise of writing parquet file from
>> Spark ?
>>
>> As Spark app write data to parquet and it shows that under that directory
>> there are heaps of very small parquet file (such as 
>> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet).
>> Each parquet file is only 15KB
>>
>> Should it write each chunk of  bigger data size (such as 128 MB) with
>> proper number of files ?
>>
>> Does anyone find out any performance changes when changing data size of
>> each parquet file ?
>>
>> Thanks,
>> Kevin.
>>
>


Re: Spark app write too many small parquet files

2016-11-27 Thread Denny Lee
Generally, yes - you should try to have larger data sizes due to the
overhead of opening up files.  Typical guidance is between 64MB-1GB;
personally I usually stick with 128MB-512MB with the default of snappy
codec compression with parquet.  A good reference is Vida Ha's
presentation Data
Storage Tips for Optimal Spark Performance
.


On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran  wrote:

> Hi Everyone,
> Does anyone know what is the best practise of writing parquet file from
> Spark ?
>
> As Spark app write data to parquet and it shows that under that directory
> there are heaps of very small parquet file (such as
> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only
> 15KB
>
> Should it write each chunk of  bigger data size (such as 128 MB) with
> proper number of files ?
>
> Does anyone find out any performance changes when changing data size of
> each parquet file ?
>
> Thanks,
> Kevin.
>


Spark app write too many small parquet files

2016-11-27 Thread Kevin Tran
Hi Everyone,
Does anyone know what is the best practise of writing parquet file from
Spark ?

As Spark app write data to parquet and it shows that under that directory
there are heaps of very small parquet file (such as
e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only
15KB

Should it write each chunk of  bigger data size (such as 128 MB) with
proper number of files ?

Does anyone find out any performance changes when changing data size of
each parquet file ?

Thanks,
Kevin.