Hey Gourav, Well so I think that it is my execution plan that is at fault. So basically df.write as a spark job on localhost:4040/ well being an action will include the time taken for all the umpteen transformation on it right? All I wanted to know is "what apt env/config params are needed to something simple read a dataframe from parquet and save it back as another parquet (meaning vanilla load/store no transformation). Is it good enough to simply read. and write. in the very format mentioned in spark tutorial docs i.e
df.write.format("parquet").mode("overwrite").save(hdfspathTemp) ?? Thanks, On Fri, Jul 29, 2016 at 4:22 PM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > The default write format in SPARK is parquet. And I have never faced any > issues writing over a billion records in SPARK. Are you using > virtualization by any chance or an obsolete hard disk or Intel Celeron may > be? > > Regards, > Gourav Sengupta > > On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna <sumit.kha...@askme.in> > wrote: > >> Hey, >> >> master=yarn >> mode=cluster >> >> spark.executor.memory=8g >> spark.rpc.netty.dispatcher.numThreads=2 >> >> All the POC on a single node cluster. the biggest bottle neck being : >> >> 1.8 hrs to save 500k records as a parquet file/dir executing this command >> : >> >> df.write.format("parquet").mode("overwrite").save(hdfspathTemp) >> >> >> No doubt, the whole execution plan gets triggered on this write / save >> action. But is it the right command / set of params to save a dataframe? >> >> essentially I am doing an upsert by pulling in data from hdfs and then >> updating it with the delta changes of the current run. But not sure if >> write itself takes that much time or some optimization is needed for >> upsert. (I have that asked as another question altogether). >> >> Thanks, >> Sumit >> >> >