RE: how to save spark files as parquets efficiently

Ewan Leith Fri, 29 Jul 2016 07:57:38 -0700

If you replace the df.write ….

With


df.count()

in your code you’ll see how much time is taken to process the full execution 
plan without the write output.

That code below looks perfectly normal for writing a parquet file yes, there 
shouldn’t be any tuning needed for “normal” performance.

Thanks,
Ewan

From: Sumit Khanna [mailto:sumit.kha...@askme.in]
Sent: 29 July 2016 13:41
To: Gourav Sengupta <gourav.sengu...@gmail.com>
Cc: user <user@spark.apache.org>
Subject: Re: how to save spark files as parquets efficiently

Hey Gourav,

Well so I think that it is my execution plan that is at fault. So basically 
df.write as a spark job on localhost:4040/ well being an action will include 
the time taken for all the umpteen transformation on it right? All I wanted to 
know is "what apt env/config params are needed to something simple read a 
dataframe from parquet and save it back as another parquet (meaning vanilla 
load/store no transformation). Is it good enough to simply read. and write. in 
the very format mentioned in spark tutorial docs i.e

df.write.format("parquet").mode("overwrite").save(hdfspathTemp) ??

Thanks,

On Fri, Jul 29, 2016 at 4:22 PM, Gourav Sengupta 
<gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote:
Hi,

The default write format in SPARK is parquet. And I have never faced any issues 
writing over a billion records in SPARK. Are you using virtualization by any 
chance or an obsolete hard disk or Intel Celeron may be?
Regards,
Gourav Sengupta

On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna 
<sumit.kha...@askme.in<mailto:sumit.kha...@askme.in>> wrote:
Hey,

master=yarn
mode=cluster

spark.executor.memory=8g
spark.rpc.netty.dispatcher.numThreads=2

All the POC on a single node cluster. the biggest bottle neck being :

1.8 hrs to save 500k records as a parquet file/dir executing this command :


df.write.format("parquet").mode("overwrite").save(hdfspathTemp)

No doubt, the whole execution plan gets triggered on this write / save action. 
But is it the right command / set of params to save a dataframe?

essentially I am doing an upsert by pulling in data from hdfs and then updating 
it with the delta changes of the current run. But not sure if write itself 
takes that much time or some optimization is needed for upsert. (I have that 
asked as another question altogether).

Thanks,
Sumit

RE: how to save spark files as parquets efficiently

Reply via email to