Re: [PR] [SPARK-53562][PYTHON] Limit Arrow batch sizes in `applyInArrow` and `applyInPandas` [spark]

via GitHub Fri, 12 Sep 2025 00:51:08 -0700


viirya commented on code in PR #52303:
URL: https://github.com/apache/spark/pull/52303#discussion_r2343321734



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala:
##########
@@ -194,3 +180,118 @@ private[python] trait BatchedPythonArrowInput extends 
BasicPythonArrowInput {
     }
   }
 }
+
+object BatchedPythonArrowInput {
+  /**
+   * Split a group into smaller Arrow batches within
+   * a separate and complete Arrow streaming format in order
+   * to work around Arrow 2G limit, see ARROW-4890.
+   *
+   * The return value is the number of rows in the batch.
+   *
+   * Note that `rowIter` here is always grouped batch. One group does not span
+   * multiple groups, see also 
[[org.apache.spark.sql.execution.GroupedIterator]].

Review Comment:
   Hmm, what does "One group does not span multiple groups" mean? Do you mean 
the same group won't appear more than once like `GroupedIterator`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-53562][PYTHON] Limit Arrow batch sizes in `applyInArrow` and `applyInPandas` [spark]

Reply via email to