Re: Parquet write optimization by row group size config

2016-01-21 Thread Pavel Plotnikov
I have got about 25 separated gzipped log files per hour. File sizes is very different, from 10MB to 50MB of gzipped JSON data. So, i'am convert this data in parquet each hour. Code very simple on python: text_file = sc.textFile(src_file) df = sqlCtx.jsonRDD(text_file.map(lambda x:

Re: Parquet write optimization by row group size config

2016-01-20 Thread Akhil Das
It would be good if you can share the code, someone here or I can guide you better if you can post the code snippet. Thanks Best Regards On Wed, Jan 20, 2016 at 10:54 PM, Pavel Plotnikov < pavel.plotni...@team.wrike.com> wrote: > Thanks, Akhil! It helps, but this jobs still not fast enough,

Re: Parquet write optimization by row group size config

2016-01-20 Thread Jörn Franke
What is your data size, the algorithm and the expected time? Depending on this the group can recommend you optimizations or tell you that the expectations are wrong > On 20 Jan 2016, at 18:24, Pavel Plotnikov > wrote: > > Thanks, Akhil! It helps, but this jobs

Re: Parquet write optimization by row group size config

2016-01-20 Thread Pavel Plotnikov
Thanks, Akhil! It helps, but this jobs still not fast enough, maybe i missed something Regards, Pavel On Wed, Jan 20, 2016 at 9:51 AM Akhil Das wrote: > Did you try re-partitioning the data before doing the write? > > Thanks > Best Regards > > On Tue, Jan 19, 2016

Re: Parquet write optimization by row group size config

2016-01-19 Thread Akhil Das
Did you try re-partitioning the data before doing the write? Thanks Best Regards On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov < pavel.plotni...@team.wrike.com> wrote: > Hello, > I'm using spark on some machines in standalone mode, data storage is > mounted on this machines via nfs. A have

Parquet write optimization by row group size config

2016-01-19 Thread Pavel Plotnikov
Hello, I'm using spark on some machines in standalone mode, data storage is mounted on this machines via nfs. A have input data stream and when i'm trying to store all data for hour in parquet, a job executes mostly on one core and this hourly data are stored in 40- 50 minutes. It is very slow!