I just updated my pull request from May adding language to clarify
what protocol writers are expected to set when producing the Arrow
binary protocol

https://github.com/apache/arrow/pull/4370

Implementations may allocate small buffers, or use memory which does
not meet the 8-byte minimal padding requirements of the Arrow
protocol. It becomes a question, then, whether to set the in-memory
buffer size or the padded size when producing the protocol.

This PR states that either is acceptable. As an example, a 1-byte
validity buffer could have Buffer metadata stating that the size
either is 1 byte or 8 bytes. Either way, 7 bytes of padding must be
written to conform to the protocol. The metadata, therefore, reflects
the "intent" of the protocol writer for the protocol reader. If the
writer says the length is 1, then the protocol reader understands that
the writer does not expect the reader to concern itself with the 7
bytes of padding. This could have implications for hashing or
comparisons, for example, so I think that having the flexibility to do
either is a good idea.

For an application that wants to guarantee that AVX512 instructions
can be used on all buffers on the receiver side, it would be
appropriate to include 512-bit padding in the accounting.

Let me know if others think differently so we can have this properly
documented for the 1.0.0 Format release.

Thanks,
Wes

Reply via email to