Hello all, We've been working with PySpark and Pandas, and have found that to convert a dataset using N bytes of memory to Pandas, we need to have 2N bytes free, even with the Arrow optimization enabled. The fundamental reason is ARROW-3789[1]: Arrow does not free the Arrow table until conversion finishes, so there are 2 copies of the dataset in memory.
We'd like to improve this by taking advantage of the Arrow "self_destruct" option available in Arrow >= 0.16. When converting a suitable[*] Arrow table to a Pandas dataframe, it avoids the worst-case 2x memory usage, with something more like ~25% overhead instead, by freeing the columns in the Arrow table after converting each column instead of at the end of conversion. Does this sound like a desirable optimization to have in Spark? If so, how should it be exposed to users? As discussed below, there are cases where a user may or may not want it enabled. Here's a proof-of-concept patch, along with a demonstration, and a comparison of memory usage (via memory_profiler[2]) with and without the flag enabled: https://gist.github.com/lidavidm/289229caa022358432f7deebe26a9bd3 There are some cases where you may _not_ want this optimization, however, so the patch leaves it as a toggle. Is this the API we'd want, or would we prefer a different API (e.g. a configuration flag)? The reason we may not want this enabled by default is that the related split_blocks option is more likely to find zero-copy opportunities, which will result in the Pandas dataframe being backed by immutable buffers. Some Pandas operations will error in these cases, e.g. [3]. Also, to minimize memory usage, we set use_threads=False to converts each column sequentially, rather than in parallel, but this slows down the conversion somewhat. One option here may be to set self_destruct by default, but relegate the other two options (which further save memory) to a toggle, and I can measure the impact of this if desired. [1]: https://issues.apache.org/jira/browse/ARROW-3789 [2]: https://github.com/pythonprofilers/memory_profiler [3]: https://github.com/pandas-dev/pandas/issues/35530 [*] See my comment in https://issues.apache.org/jira/browse/ARROW-9878. Thanks, David --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org