[ https://issues.apache.org/jira/browse/SPARK-47466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850855#comment-17850855 ]
Ian Cook commented on SPARK-47466: ---------------------------------- For Connect, see the function {{to_table_as_iterator}} inĀ {{python/pyspark/sql/connect/client/core.py}}. To return an iterator of RecordBatches we could add another function similar to that. For Classic, see the function {{_collect_as_arrow}} in {{python/pyspark/sql/pandas/conversion.py}}. To return an iterator of RecordBatches we could add another function similar to that. > Add PySpark DataFrame method to return iterator of PyArrow RecordBatches > ------------------------------------------------------------------------ > > Key: SPARK-47466 > URL: https://issues.apache.org/jira/browse/SPARK-47466 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.5.1 > Reporter: Ian Cook > Priority: Major > > As a follow-up to SPARK-47365: > {{toArrow()}} is useful when the data is relatively small. For larger data, > the best way to return the contents of a PySpark DataFrame in Arrow format is > to return an iterator of [PyArrow > RecordBatches|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org