how to save spark files as parquets efficiently

Sumit Khanna Thu, 28 Jul 2016 23:28:34 -0700

Hey,

master=yarn
mode=cluster


spark.executor.memory=8g
spark.rpc.netty.dispatcher.numThreads=2

All the POC on a single node cluster. the biggest bottle neck being :

1.8 hrs to save 500k records as a parquet file/dir executing this command :

df.write.format("parquet").mode("overwrite").save(hdfspathTemp)


No doubt, the whole execution plan gets triggered on this write / save
action. But is it the right command / set of params to save a dataframe?

essentially I am doing an upsert by pulling in data from hdfs and then
updating it with the delta changes of the current run. But not sure if
write itself takes that much time or some optimization is needed for
upsert. (I have that asked as another question altogether).

Thanks,
Sumit

how to save spark files as parquets efficiently

Reply via email to