If you replace the df.write …. With
df.count() in your code you’ll see how much time is taken to process the full execution plan without the write output. That code below looks perfectly normal for writing a parquet file yes, there shouldn’t be any tuning needed for “normal” performance. Thanks, Ewan From: Sumit Khanna [mailto:sumit.kha...@askme.in] Sent: 29 July 2016 13:41 To: Gourav Sengupta <gourav.sengu...@gmail.com> Cc: user <user@spark.apache.org> Subject: Re: how to save spark files as parquets efficiently Hey Gourav, Well so I think that it is my execution plan that is at fault. So basically df.write as a spark job on localhost:4040/ well being an action will include the time taken for all the umpteen transformation on it right? All I wanted to know is "what apt env/config params are needed to something simple read a dataframe from parquet and save it back as another parquet (meaning vanilla load/store no transformation). Is it good enough to simply read. and write. in the very format mentioned in spark tutorial docs i.e df.write.format("parquet").mode("overwrite").save(hdfspathTemp) ?? Thanks, On Fri, Jul 29, 2016 at 4:22 PM, Gourav Sengupta <gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote: Hi, The default write format in SPARK is parquet. And I have never faced any issues writing over a billion records in SPARK. Are you using virtualization by any chance or an obsolete hard disk or Intel Celeron may be? Regards, Gourav Sengupta On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna <sumit.kha...@askme.in<mailto:sumit.kha...@askme.in>> wrote: Hey, master=yarn mode=cluster spark.executor.memory=8g spark.rpc.netty.dispatcher.numThreads=2 All the POC on a single node cluster. the biggest bottle neck being : 1.8 hrs to save 500k records as a parquet file/dir executing this command : df.write.format("parquet").mode("overwrite").save(hdfspathTemp) No doubt, the whole execution plan gets triggered on this write / save action. But is it the right command / set of params to save a dataframe? essentially I am doing an upsert by pulling in data from hdfs and then updating it with the delta changes of the current run. But not sure if write itself takes that much time or some optimization is needed for upsert. (I have that asked as another question altogether). Thanks, Sumit