Do you enable the spark fault tolerance mechanism, RDD run at the end of
the job, will start a separate job, to the checkpoint data written to the
file system before the persistence of high availability

2017-03-08 2:45 GMT+08:00 Swapnil Shinde <swapnilushi...@gmail.com>:

> Hello all
>    I have a spark job that reads parquet data and partition it based on
> one of the columns. I made sure partitions equally distributed and not
> skewed. My code looks like this -
>
> datasetA.write.partitonBy("column1").parquet(outputPath)
>
> Execution plan -
> [image: Inline image 1]
>
> All tasks(~12,000) finishes in 30-35 mins but it takes another 40-45 mins
> to close application. I am not sure what spark is doing after all tasks are
> processes successfully.
> I checked thread dump (using UI executor tab) on few executors but couldnt
> find anything major. Overall, few shuffle-client processes are "RUNNABLE"
> and few dispatched-* processes are "WAITING".
>
> Please let me know what spark is doing at this stage(after all tasks
> finished) and any way I can optimize it.
>
> Thanks
> Swapnil
>
>
>

Reply via email to