Yicong Huang created SPARK-54688:
------------------------------------

             Summary: Let ArrowBatchIterator return ArrowBatch instead of 
Array[Byte]
                 Key: SPARK-54688
                 URL: https://issues.apache.org/jira/browse/SPARK-54688
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Yicong Huang


Currently {{ArrowBatchIterator}} returns {{Iterator[Array[Byte]]}}, which only 
provides the raw serialized batch bytes. This limits the ability to expose 
batch-level metadata (e.g., row count, byte sizes) to callers without 
re-parsing the batch.

{code:scala}
// Current implementation
private[sql] class ArrowBatchIterator(...)
    extends Iterator[Array[Byte]] with AutoCloseable {
  
  override def next(): Array[Byte] = {
    // ...
    bytes
  }
}
{code}

Propose to let it return ArrowBatch instead.
{code:scala}
private[sql] class ArrowBatchIterator(...)
    extends Iterator[ArrowBatch] with AutoCloseable {
  
  override def next(): ArrowBatch = {
    // ...
    ArrowBatch(rowCount, bytes.length, bytes)
  }
}
{code}

This enables downstream consumers to access batch metadata without 
deserializing. 




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to