After a Spark SQL job appending a few columns using window aggregation
functions, and performing a join and some data massaging, I find that the
cleanup after the job finishes saving the result data to disk takes as long
if not longer than the job.

I currently am performing window aggregation on a dataset ~150 GB and
joining with another dataset of about ~50 GB.

With window aggregation, it takes about 15 minutes. Without window
aggregation and instead performing a standard groupBy(..).agg(...) and
join, it takes about 19 minutes.

However, when using window aggregation functions, for more than 15-20
minutes, the driver program is removing broadcast pieces, cleaning
accumulators, and cleaning shuffles.

Can anyone explain what these are at a lower level besides what I see on
the command line, or why this happens ONLY when I use window aggregation?
And are there any ways to remedy this?

Thank you!
Jestin

Reply via email to