viirya commented on code in PR #52303:
URL: https://github.com/apache/spark/pull/52303#discussion_r2343324017
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala:
##########
@@ -194,3 +180,118 @@ private[python] trait BatchedPythonArrowInput extends
BasicPythonArrowInput {
}
}
}
+
+object BatchedPythonArrowInput {
+ /**
+ * Split a group into smaller Arrow batches within
+ * a separate and complete Arrow streaming format in order
+ * to work around Arrow 2G limit, see ARROW-4890.
+ *
+ * The return value is the number of rows in the batch.
+ *
+ * Note that `rowIter` here is always grouped batch. One group does not span
+ * multiple groups, see also
[[org.apache.spark.sql.execution.GroupedIterator]].
+ * Therefore, each split Arrow batch also does not have mixed grouped. For
example:
+ *
+ * +------------------------+ +------------------------+
+--------------------
+ * |Group (by k1) v1, v2, v3| |Group (by k2) v1, v2, v3| |
...
+ * +------------------------+ +------------------------+
+--------------------
+ *
+ *
+------+-----------------+------+------+-----------------+------+------+--------------------
+ * |Schema| Batch| Batch|Schema| Batch| Batch|Schema|
Batch ...
+ *
+------+-----------------+------+------+-----------------+------+------+--------------------
+ * | Arrow Streaming Format | Arrow Streaming Format |
Arrow Streaming Form...
+ *
+ * Here, each (Arrow) batch does not span multiple groups.
+ * These (Arrow) batches within each complete Arrow Streaming Format are
Review Comment:
Arrow Streaming Format -> Arrow IPC Format?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]