[ https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Cutler updated SPARK-25274: --------------------------------- Summary: Improve toPandas with Arrow by sending out-of-order record batches (was: Improve `toPandas` with Arrow by sending out-of-order record batches) > Improve toPandas with Arrow by sending out-of-order record batches > ------------------------------------------------------------------ > > Key: SPARK-25274 > URL: https://issues.apache.org/jira/browse/SPARK-25274 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL > Affects Versions: 2.4.0 > Reporter: Bryan Cutler > Priority: Major > > When executing {{toPandas}} with Arrow enabled, partitions that arrive in the > JVM out-of-order must be buffered before they can be send to Python. This > causes an excess of memory to be used in the driver JVM and increases the > time it takes to complete because data must sit in the JVM waiting for > preceding partitions to come in. > This can be improved by sending out-of-order partitions to Python as soon as > they arrive in the JVM, followed by a list of indices so that Python can > assemble the data in the correct order. This way, data is not buffered at the > JVM and there is no waiting on particular partitions so performance will be > increased. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org