Per Antoine's recommendation. I'm splitting off the discussion about data integrity from the previous e-mail thread about the format additions [1]. To re-cap I made a proposal including data integrity [2] by adding a new message type to the
>From the previous thread the main question was at what level to apply digests to Arrow data (Message level, array, buffer or potentially some hybrid). Some trade-offs I've thought of for each approach: * Message level + Simplest implementation and can be applied across all messages with the pretty much the same code. + Smallest amount of additional data (each digest will likely be 8-64 bytes) - It lacks granularity to recover partial data from a record batch if there is corruption. Array level: + Allows for reading non-corrupted columns + Allows for potentially more complicated use-cases like have different compute engines "collaborate" and sign each array they computed to establish a "chain-of-trust" - Adds some implementation complexity. Will need different schemes for message types other than RecordBatch and for message metadata. We also need to determine digest boundaries (would a complex column be consumed entirely or would child arrays be separate). Buffer level: More or less same issues as array but with the following other factors: - The most amount of additional data - Its not clear if there is a benefit of detecting if a single buffer is corrupted if it means we can't accurately decode the array. Other implementation options: * Use message level metadata (this can be a little awkward if we want to have safety against metadata corruption). [1] https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E [2] https://github.com/apache/arrow/pull/4815