One guess - you are doing two things here, count() and write(). There is a
persist(), but it's async. It won't necessarily wait for the persist to
finish before proceeding and may have to recompute at least some partitions
for the second op. You could debug further by looking at the stages and
seeing what exactly is executing and where it uses cached partitions or not.

On Mon, Jan 31, 2022 at 2:12 AM Benjamin Du <legendu....@outlook.com> wrote:

> I did check the execution plan, there were 2 stages and both stages show
> that the pandas UDF (which takes almost all the computation time of the
> DataFrame) is executed.
>
> It didn't seem to be an issue of repartition/coalesce as the DataFrame was
> still computed twice after removing coalesce.
>
>
>

Reply via email to