Re: [PR] GH-343: Fix ListVector offset buffer not allocated for nested empty arrays [arrow-java]

via GitHub Tue, 20 Jan 2026 19:48:13 -0800


viirya commented on PR #967:
URL: https://github.com/apache/arrow-java/pull/967#issuecomment-3775998864


   > When outer array is empty, nested writers are never invoked, so child 
list's offset buffer remains unallocated (capacity = 0). This violates Arrow 
spec which requires offset[0] = 0.
   
   I think you are referring the writers in Spark. It is out of context here 
and not related to the root cause. We should update the description to explain 
the issue clearly.
   
   The offset buffers are actually allocated properly. But during IPC 
serialization, they are ignored.
   
   ```Java
     public long readableBytes() {
         return writerIndex - readerIndex;
     }
   ```
   
   So when ListVector.setReaderAndWriterIndex() sets writerIndex(0) and 
readerIndex(0), readableBytes() returns 0 - 0 = 0.
   
   Then when MessageSerializer.writeBatchBuffers() calls 
WriteChannel.write(buffer), it writes 0 bytes.
   
   So the flow is:
     1. valueCount=0 → ListVector.setReaderAndWriterIndex() sets 
offsetBuffer.writerIndex(0)
     2. VectorUnloader.getFieldBuffers() returns the buffer with writerIndex=0
     3. MessageSerializer.writeBatchBuffers() writes the buffer
     4. WriteChannel.write(buffer) checks buffer.readableBytes() which is 0
     5. 0 bytes are written to the IPC stream
     6. PyArrow read the batch with the missing buffer → crash when other 
libraries to read
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-343: Fix ListVector offset buffer not allocated for nested empty arrays [arrow-java]

Reply via email to