Hello all,

We've been working with PySpark and Pandas, and have found that to
convert a dataset using N bytes of memory to Pandas, we need to have
2N bytes free, even with the Arrow optimization enabled. The
fundamental reason is ARROW-3789[1]: Arrow does not free the Arrow
table until conversion finishes, so there are 2 copies of the dataset
in memory.

We'd like to improve this by taking advantage of the Arrow
"self_destruct" option available in Arrow >= 0.16. When converting a
suitable[*] Arrow table to a Pandas dataframe, it avoids the
worst-case 2x memory usage, with something more like ~25% overhead
instead, by freeing the columns in the Arrow table after converting
each column instead of at the end of conversion.

Does this sound like a desirable optimization to have in Spark? If so,
how should it be exposed to users? As discussed below, there are cases
where a user may or may not want it enabled.

Here's a proof-of-concept patch, along with a demonstration, and a
comparison of memory usage (via memory_profiler[2]) with and without
the flag enabled:
https://gist.github.com/lidavidm/289229caa022358432f7deebe26a9bd3

There are some cases where you may _not_ want this optimization,
however, so the patch leaves it as a toggle. Is this the API we'd
want, or would we prefer a different API (e.g. a configuration flag)?

The reason we may not want this enabled by default is that the related
split_blocks option is more likely to find zero-copy opportunities,
which will result in the Pandas dataframe being backed by immutable
buffers. Some Pandas operations will error in these cases, e.g. [3].
Also, to minimize memory usage, we set use_threads=False to converts
each column sequentially, rather than in parallel, but this slows down
the conversion somewhat. One option here may be to set self_destruct
by default, but relegate the other two options (which further save
memory) to a toggle, and I can measure the impact of this if desired.

[1]: https://issues.apache.org/jira/browse/ARROW-3789
[2]: https://github.com/pythonprofilers/memory_profiler
[3]: https://github.com/pandas-dev/pandas/issues/35530
[*] See my comment in https://issues.apache.org/jira/browse/ARROW-9878.

Thanks,
David

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to