Hey, master=yarn mode=cluster
spark.executor.memory=8g spark.rpc.netty.dispatcher.numThreads=2 All the POC on a single node cluster. the biggest bottle neck being : 1.8 hrs to save 500k records as a parquet file/dir executing this command : df.write.format("parquet").mode("overwrite").save(hdfspathTemp) No doubt, the whole execution plan gets triggered on this write / save action. But is it the right command / set of params to save a dataframe? essentially I am doing an upsert by pulling in data from hdfs and then updating it with the delta changes of the current run. But not sure if write itself takes that much time or some optimization is needed for upsert. (I have that asked as another question altogether). Thanks, Sumit