viirya commented on PR #967:
URL: https://github.com/apache/arrow-java/pull/967#issuecomment-3775998864
> When outer array is empty, nested writers are never invoked, so child
list's offset buffer remains unallocated (capacity = 0). This violates Arrow
spec which requires offset[0] = 0.
I think you are referring the writers in Spark. It is out of context here
and not related to the root cause. We should update the description to explain
the issue clearly.
The offset buffers are actually allocated properly. But during IPC
serialization, they are ignored.
```Java
public long readableBytes() {
return writerIndex - readerIndex;
}
```
So when ListVector.setReaderAndWriterIndex() sets writerIndex(0) and
readerIndex(0), readableBytes() returns 0 - 0 = 0.
Then when MessageSerializer.writeBatchBuffers() calls
WriteChannel.write(buffer), it writes 0 bytes.
So the flow is:
1. valueCount=0 → ListVector.setReaderAndWriterIndex() sets
offsetBuffer.writerIndex(0)
2. VectorUnloader.getFieldBuffers() returns the buffer with writerIndex=0
3. MessageSerializer.writeBatchBuffers() writes the buffer
4. WriteChannel.write(buffer) checks buffer.readableBytes() which is 0
5. 0 bytes are written to the IPC stream
6. PyArrow read the batch with the missing buffer → crash when other
libraries to read
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]