Fabien created SPARK-38111: ------------------------------ Summary: Retrieve a Spark dataframe as Arrow batches Key: SPARK-38111 URL: https://issues.apache.org/jira/browse/SPARK-38111 Project: Spark Issue Type: Question Components: Java API Affects Versions: 3.2.0 Environment: Java 11
Spark 3 Reporter: Fabien Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow batches ? I have a pretty large dataset on my cluster so I cannot collect it using [collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--] which download every thing at once and saturate the my JVM memory Seeing that Arrow is becoming a standard to transfer large datasets and that Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with Arrow batches ? This would be ideal to process the data batch per batch and avoid saturating the memory. I am looking for an API like this (in Java) {code:java} var stream = dataframe.collectAsArrowStream() while (stream.hasNextBatch()) { var batch = stream.getNextBatch() // do some stuff with the arrow batch } {code} It would be even better if I can split the dataframe into several streams so I can download and process it in parallel -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org