Hello all I have a spark job that reads parquet data and partition it based on one of the columns. I made sure partitions equally distributed and not skewed. My code looks like this -
datasetA.write.partitonBy("column1").parquet(outputPath) Execution plan - [image: Inline image 1] All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins to close application. I am not sure what spark is doing after all tasks are processes successfully. I checked thread dump (using UI executor tab) on few executors but couldnt find anything major. Overall, few shuffle-client processes are "RUNNABLE" and few dispatched-* processes are "WAITING". Please let me know what spark is doing at this stage(after all tasks finished) and any way I can optimize it. Thanks Swapnil