Re: Parquet write optimization by row group size config

Akhil Das Wed, 20 Jan 2016 22:04:51 -0800

It would be good if you can share the code, someone here or I can guide you
better if you can post the code snippet.


Thanks
Best Regards

On Wed, Jan 20, 2016 at 10:54 PM, Pavel Plotnikov <
pavel.plotni...@team.wrike.com> wrote:

> Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i
> missed something
>
> Regards,
> Pavel
>
> On Wed, Jan 20, 2016 at 9:51 AM Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Did you try re-partitioning the data before doing the write?
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <
>> pavel.plotni...@team.wrike.com> wrote:
>>
>>> Hello,
>>> I'm using spark on some machines in standalone mode, data storage is
>>> mounted on this machines via nfs. A have input data stream and when i'm
>>> trying to store all data for hour in parquet, a job executes mostly on one
>>> core and this hourly data are stored in 40- 50 minutes. It is very slow!
>>> And it is not IO problem. After research how parquet file works, i'm found
>>> that it can be parallelized on row group abstraction level.
>>> I think row group for my files is to large, and how can i change it?
>>> When i create to big DataFrame i devides in parts very well and writes
>>> quikly!
>>>
>>> Thanks,
>>> Pavel
>>>
>>
>>

Re: Parquet write optimization by row group size config

Reply via email to