[ https://issues.apache.org/jira/browse/SPARK-38111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Fabien updated SPARK-38111: --------------------------- Labels: arrow (was: ) > Retrieve a Spark dataframe as Arrow batches > ------------------------------------------- > > Key: SPARK-38111 > URL: https://issues.apache.org/jira/browse/SPARK-38111 > Project: Spark > Issue Type: Question > Components: Java API > Affects Versions: 3.2.0 > Environment: Java 11 > Spark 3 > Reporter: Fabien > Priority: Minor > Labels: arrow > > Using the Java API, is there a way to efficiently retrieve a dataframe as > Arrow batches ? > I have a pretty large dataset on my cluster so I cannot collect it using > [collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--] > which download every thing at once and saturate my JVM memory > Seeing that Arrow is becoming a standard to transfer large datasets and that > Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with > Arrow batches ? > This would be ideal to process the data batch per batch and avoid saturating > the memory. > > I am looking for an API like this (in Java) > > {code:java} > var stream = dataframe.collectAsArrowStream() > while (stream.hasNextBatch()) { > var batch = stream.getNextBatch() > // do some stuff with the arrow batch > } > {code} > It would be even better if I can split the dataframe into several streams so > I can download and process it in parallel -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org