Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/21546 _UPDATE_ I changed `toPandas` to write out of order partitions to python as they come in, followed by a list of indices to represent the correct batch order. In python, the batches are then put in the correct order to make the Arrow Table / Pandas DataFrame. This is slightly more complicated than before because we are sending extra info to Python, but it significantly reduces the upper-bound space complexity in the driver JVM from all data to the size of the largest partition. It also seems to be a little faster, so I re-ran the performance tests, which I'll post now.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org