Do you enable the spark fault tolerance mechanism, RDD run at the end of the job, will start a separate job, to the checkpoint data written to the file system before the persistence of high availability
2017-03-08 2:45 GMT+08:00 Swapnil Shinde <swapnilushi...@gmail.com>: > Hello all > I have a spark job that reads parquet data and partition it based on > one of the columns. I made sure partitions equally distributed and not > skewed. My code looks like this - > > datasetA.write.partitonBy("column1").parquet(outputPath) > > Execution plan - > [image: Inline image 1] > > All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins > to close application. I am not sure what spark is doing after all tasks are > processes successfully. > I checked thread dump (using UI executor tab) on few executors but couldnt > find anything major. Overall, few shuffle-client processes are "RUNNABLE" > and few dispatched-* processes are "WAITING". > > Please let me know what spark is doing at this stage(after all tasks > finished) and any way I can optimize it. > > Thanks > Swapnil > > >