Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
lto:sumit.kha...@askme.in] > *Sent:* 29 July 2016 13:41 > *To:* Gourav Sengupta <gourav.sengu...@gmail.com> > *Cc:* user <user@spark.apache.org> > *Subject:* Re: how to save spark files as parquets efficiently > > > > Hey Gourav, > > > >

RE: how to save spark files as parquets efficiently

2016-07-29 Thread Ewan Leith
. Thanks, Ewan From: Sumit Khanna [mailto:sumit.kha...@askme.in] Sent: 29 July 2016 13:41 To: Gourav Sengupta <gourav.sengu...@gmail.com> Cc: user <user@spark.apache.org> Subject: Re: how to save spark files as parquets efficiently Hey Gourav, Well so I think that it is my ex

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
Hey Gourav, Well so I think that it is my execution plan that is at fault. So basically df.write as a spark job on localhost:4040/ well being an action will include the time taken for all the umpteen transformation on it right? All I wanted to know is "what apt env/config params are needed to

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Gourav Sengupta
Hi, The default write format in SPARK is parquet. And I have never faced any issues writing over a billion records in SPARK. Are you using virtualization by any chance or an obsolete hard disk or Intel Celeron may be? Regards, Gourav Sengupta On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
Hey, So I believe this is the right format to save the file, as in optimization is never in the write part, but with the head / body of my execution plan isnt it? Thanks, On Fri, Jul 29, 2016 at 11:57 AM, Sumit Khanna wrote: > Hey, > > master=yarn > mode=cluster > >