Cleanup after Spark SQL job with window aggregation takes a long time

Jestin Ma Mon, 29 Aug 2016 15:58:29 -0700

After a Spark SQL job appending a few columns using window aggregation
functions, and performing a join and some data massaging, I find that the
cleanup after the job finishes saving the result data to disk takes as long
if not longer than the job.


I currently am performing window aggregation on a dataset ~150 GB and
joining with another dataset of about ~50 GB.

With window aggregation, it takes about 15 minutes. Without window
aggregation and instead performing a standard groupBy(..).agg(...) and
join, it takes about 19 minutes.

However, when using window aggregation functions, for more than 15-20
minutes, the driver program is removing broadcast pieces, cleaning
accumulators, and cleaning shuffles.

Can anyone explain what these are at a lower level besides what I see on
the command line, or why this happens ONLY when I use window aggregation?
And are there any ways to remedy this?

Thank you!
Jestin

Cleanup after Spark SQL job with window aggregation takes a long time

Reply via email to