Hey,

master=yarn
mode=cluster

spark.executor.memory=8g
spark.rpc.netty.dispatcher.numThreads=2

All the POC on a single node cluster. the biggest bottle neck being :

1.8 hrs to save 500k records as a parquet file/dir executing this command :

df.write.format("parquet").mode("overwrite").save(hdfspathTemp)


No doubt, the whole execution plan gets triggered on this write / save
action. But is it the right command / set of params to save a dataframe?

essentially I am doing an upsert by pulling in data from hdfs and then
updating it with the delta changes of the current run. But not sure if
write itself takes that much time or some optimization is needed for
upsert. (I have that asked as another question altogether).

Thanks,
Sumit

Reply via email to