Bryan Cutler created SPARK-25274:
------------------------------------

             Summary: Improve `toPandas` with Arrow by sending out-of-order 
record batches
                 Key: SPARK-25274
                 URL: https://issues.apache.org/jira/browse/SPARK-25274
             Project: Spark
          Issue Type: Sub-task
          Components: PySpark, SQL
    Affects Versions: 2.4.0
            Reporter: Bryan Cutler


When executing {{toPandas}} with Arrow enabled, partitions that arrive in the 
JVM out-of-order must be buffered before they can be send to Python. This 
causes an excess of memory to be used in the driver JVM and increases the time 
it takes to complete because data must sit in the JVM waiting for preceding 
partitions to come in.

This can be improved by sending out-of-order partitions to Python as soon as 
they arrive in the JVM, followed by a list of indices so that Python can 
assemble the data in the correct order. This way, data is not buffered at the 
JVM and there is no waiting on particular partitions so performance will be 
increased.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to