Huge partitioning job takes longer to close after all tasks finished

Swapnil Shinde Tue, 07 Mar 2017 10:46:12 -0800

Hello all
   I have a spark job that reads parquet data and partition it based on one
of the columns. I made sure partitions equally distributed and not skewed.
My code looks like this -


datasetA.write.partitonBy("column1").parquet(outputPath)

Execution plan -
[image: Inline image 1]

All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins
to close application. I am not sure what spark is doing after all tasks are
processes successfully.
I checked thread dump (using UI executor tab) on few executors but couldnt
find anything major. Overall, few shuffle-client processes are "RUNNABLE"
and few dispatched-* processes are "WAITING".

Please let me know what spark is doing at this stage(after all tasks
finished) and any way I can optimize it.

Thanks
Swapnil

Huge partitioning job takes longer to close after all tasks finished

Reply via email to