How many partition should it be when streaming? - As in streaming process
the data will growing in size and is there any configuration for limit file
size and write to new file if it is more than x (let says  128MB per file)

Another question about performance when query to these parquet files. What
is the practise for number of file size and files ?

How to compacting small parquet flies to small number of bigger parquet
file ?

Thanks,
Kevin.

On Tue, Nov 29, 2016 at 3:01 AM, Chin Wei Low <lowchin...@gmail.com> wrote:

> Try limit the partitions. spark.sql.shuffle.partitions
>
> This control the number of files generated.
>
> On 28 Nov 2016 8:29 p.m., "Kevin Tran" <kevin...@gmail.com> wrote:
>
>> Hi Denny,
>> Thank you for your inputs. I also use 128 MB but still too many files
>> generated by Spark app which is only ~14 KB each ! That's why I'm asking if
>> there is a solution for this if some one has same issue.
>>
>> Cheers,
>> Kevin.
>>
>> On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee <denny.g....@gmail.com> wrote:
>>
>>> Generally, yes - you should try to have larger data sizes due to the
>>> overhead of opening up files.  Typical guidance is between 64MB-1GB;
>>> personally I usually stick with 128MB-512MB with the default of snappy
>>> codec compression with parquet.  A good reference is Vida Ha's presentation 
>>> Data
>>> Storage Tips for Optimal Spark Performance
>>> <https://spark-summit.org/2015/events/data-storage-tips-for-optimal-spark-performance/>.
>>>
>>>
>>> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran <kevin...@gmail.com> wrote:
>>>
>>>> Hi Everyone,
>>>> Does anyone know what is the best practise of writing parquet file from
>>>> Spark ?
>>>>
>>>> As Spark app write data to parquet and it shows that under that
>>>> directory there are heaps of very small parquet file (such as
>>>> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is
>>>> only 15KB
>>>>
>>>> Should it write each chunk of  bigger data size (such as 128 MB) with
>>>> proper number of files ?
>>>>
>>>> Does anyone find out any performance changes when changing data size of
>>>> each parquet file ?
>>>>
>>>> Thanks,
>>>> Kevin.
>>>>
>>>
>>

Reply via email to