Bryan Cutler created SPARK-25274: ------------------------------------ Summary: Improve `toPandas` with Arrow by sending out-of-order record batches Key: SPARK-25274 URL: https://issues.apache.org/jira/browse/SPARK-25274 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 2.4.0 Reporter: Bryan Cutler
When executing {{toPandas}} with Arrow enabled, partitions that arrive in the JVM out-of-order must be buffered before they can be send to Python. This causes an excess of memory to be used in the driver JVM and increases the time it takes to complete because data must sit in the JVM waiting for preceding partitions to come in. This can be improved by sending out-of-order partitions to Python as soon as they arrive in the JVM, followed by a list of indices so that Python can assemble the data in the correct order. This way, data is not buffered at the JVM and there is no waiting on particular partitions so performance will be increased. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org