Re: Parquet write optimization by row group size config

Pavel Plotnikov Wed, 20 Jan 2016 09:34:17 -0800

Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i
missed something


Regards,
Pavel

On Wed, Jan 20, 2016 at 9:51 AM Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Did you try re-partitioning the data before doing the write?
>
> Thanks
> Best Regards
>
> On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov <
> pavel.plotni...@team.wrike.com> wrote:
>
>> Hello,
>> I'm using spark on some machines in standalone mode, data storage is
>> mounted on this machines via nfs. A have input data stream and when i'm
>> trying to store all data for hour in parquet, a job executes mostly on one
>> core and this hourly data are stored in 40- 50 minutes. It is very slow!
>> And it is not IO problem. After research how parquet file works, i'm found
>> that it can be parallelized on row group abstraction level.
>> I think row group for my files is to large, and how can i change it?
>> When i create to big DataFrame i devides in parts very well and writes
>> quikly!
>>
>> Thanks,
>> Pavel
>>
>
>

Re: Parquet write optimization by row group size config

Reply via email to