Hi Wes, It seems fine to be flexible here. However:
> This could have implications for hashing or > comparisons, for example, so I think that having the flexibility to do > either is a good idea. This statement of use-cases makes me a little nervous. It seems like it could lead to bugs if a consumer is reading from two producers that use different alternatives? Thanks, Micah On Mon, Sep 30, 2019 at 5:24 PM Wes McKinney <[email protected]> wrote: > I just updated my pull request from May adding language to clarify > what protocol writers are expected to set when producing the Arrow > binary protocol > > https://github.com/apache/arrow/pull/4370 > > Implementations may allocate small buffers, or use memory which does > not meet the 8-byte minimal padding requirements of the Arrow > protocol. It becomes a question, then, whether to set the in-memory > buffer size or the padded size when producing the protocol. > > This PR states that either is acceptable. As an example, a 1-byte > validity buffer could have Buffer metadata stating that the size > either is 1 byte or 8 bytes. Either way, 7 bytes of padding must be > written to conform to the protocol. The metadata, therefore, reflects > the "intent" of the protocol writer for the protocol reader. If the > writer says the length is 1, then the protocol reader understands that > the writer does not expect the reader to concern itself with the 7 > bytes of padding. This could have implications for hashing or > comparisons, for example, so I think that having the flexibility to do > either is a good idea. > > For an application that wants to guarantee that AVX512 instructions > can be used on all buffers on the receiver side, it would be > appropriate to include 512-bit padding in the accounting. > > Let me know if others think differently so we can have this properly > documented for the 1.0.0 Format release. > > Thanks, > Wes >
