Re: Parquet write optimization by row group size config

Jörn Franke Wed, 20 Jan 2016 23:37:11 -0800

What is your data size, the algorithm and the expected time?
Depending on this the group can recommend you optimizations or tell you that 
the expectations are wrong


> On 20 Jan 2016, at 18:24, Pavel Plotnikov <pavel.plotni...@team.wrike.com> 
> wrote:
> 
> Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i missed 
> something
> 
> Regards,
> Pavel
> 
>> On Wed, Jan 20, 2016 at 9:51 AM Akhil Das <ak...@sigmoidanalytics.com> wrote:
>> Did you try re-partitioning the data before doing the write?
>> 
>> Thanks
>> Best Regards
>> 
>>> On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov 
>>> <pavel.plotni...@team.wrike.com> wrote:
>>> Hello, 
>>> I'm using spark on some machines in standalone mode, data storage is 
>>> mounted on this machines via nfs. A have input data stream and when i'm 
>>> trying to store all data for hour in parquet, a job executes mostly on one 
>>> core and this hourly data are stored in 40- 50 minutes. It is very slow! 
>>> And it is not IO problem. After research how parquet file works, i'm found 
>>> that it can be parallelized on row group abstraction level. 
>>> I think row group for my files is to large, and how can i change it? 
>>> When i create to big DataFrame i devides in parts very well and writes 
>>> quikly!
>>> 
>>> Thanks,
>>> Pavel

Re: Parquet write optimization by row group size config

Reply via email to