Yicong Huang created SPARK-54688:
------------------------------------
Summary: Let ArrowBatchIterator return ArrowBatch instead of
Array[Byte]
Key: SPARK-54688
URL: https://issues.apache.org/jira/browse/SPARK-54688
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
Currently {{ArrowBatchIterator}} returns {{Iterator[Array[Byte]]}}, which only
provides the raw serialized batch bytes. This limits the ability to expose
batch-level metadata (e.g., row count, byte sizes) to callers without
re-parsing the batch.
{code:scala}
// Current implementation
private[sql] class ArrowBatchIterator(...)
extends Iterator[Array[Byte]] with AutoCloseable {
override def next(): Array[Byte] = {
// ...
bytes
}
}
{code}
Propose to let it return ArrowBatch instead.
{code:scala}
private[sql] class ArrowBatchIterator(...)
extends Iterator[ArrowBatch] with AutoCloseable {
override def next(): ArrowBatch = {
// ...
ArrowBatch(rowCount, bytes.length, bytes)
}
}
{code}
This enables downstream consumers to access batch metadata without
deserializing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]