Try limit the partitions. spark.sql.shuffle.partitions

This control the number of files generated.

On 28 Nov 2016 8:29 p.m., "Kevin Tran" <kevin...@gmail.com> wrote:

> Hi Denny,
> Thank you for your inputs. I also use 128 MB but still too many files
> generated by Spark app which is only ~14 KB each ! That's why I'm asking if
> there is a solution for this if some one has same issue.
>
> Cheers,
> Kevin.
>
> On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee <denny.g....@gmail.com> wrote:
>
>> Generally, yes - you should try to have larger data sizes due to the
>> overhead of opening up files.  Typical guidance is between 64MB-1GB;
>> personally I usually stick with 128MB-512MB with the default of snappy
>> codec compression with parquet.  A good reference is Vida Ha's presentation 
>> Data
>> Storage Tips for Optimal Spark Performance
>> <https://spark-summit.org/2015/events/data-storage-tips-for-optimal-spark-performance/>.
>>
>>
>> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran <kevin...@gmail.com> wrote:
>>
>>> Hi Everyone,
>>> Does anyone know what is the best practise of writing parquet file from
>>> Spark ?
>>>
>>> As Spark app write data to parquet and it shows that under that
>>> directory there are heaps of very small parquet file (such as
>>> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is
>>> only 15KB
>>>
>>> Should it write each chunk of  bigger data size (such as 128 MB) with
>>> proper number of files ?
>>>
>>> Does anyone find out any performance changes when changing data size of
>>> each parquet file ?
>>>
>>> Thanks,
>>> Kevin.
>>>
>>
>

Reply via email to