zhengruifeng commented on code in PR #52303:
URL: https://github.com/apache/spark/pull/52303#discussion_r2354501693
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowInput.scala:
##########
@@ -194,3 +180,115 @@ private[python] trait BatchedPythonArrowInput extends
BasicPythonArrowInput {
}
}
}
+
+object BatchedPythonArrowInput {
+ /**
+ * Split a group into smaller Arrow batches within
+ * a separate and complete Arrow streaming format in order
+ * to work around Arrow 2G limit, see ARROW-4890.
+ *
+ * The return value is the number of rows in the batch.
+ * Each split Arrow batch also does not have mixed grouped. For example:
+ *
+ * +------------------------+ +------------------------+
+--------------------
+ * |Group (by k1) v1, v2, v3| |Group (by k2) v1, v2, v3| |
...
+ * +------------------------+ +------------------------+
+--------------------
+ *
+ *
+------+-----------------+------+------+-----------------+------+------+--------------------
+ * |Schema| Batch| Batch|Schema| Batch| Batch|Schema|
Batch ...
+ *
+------+-----------------+------+------+-----------------+------+------+--------------------
+ * | Arrow Streaming Format | Arrow Streaming Format |
Arrow Streaming Form...
+ *
+ * Here, each (Arrow) batch does not span multiple groups.
+ * These (Arrow) batches within each complete Arrow IPC Format are
+ * reconstructed into the group back as pandas instances later on the Python
worker side.
+ */
+ def writeSizedBatch(
Review Comment:
For the reuse of `BaseStreamingArrowWriter`, since it is only used in
pyspark streaming feature, I am still hesitant about using it here.
created subtask https://issues.apache.org/jira/browse/SPARK-53612 to track
the consolidating it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]