[DISCUSS][FORMAT] Data Integrity

Micah Kornfield Fri, 12 Jul 2019 00:57:25 -0700

Per Antoine's recommendation.  I'm splitting off the discussion about data
integrity from the previous e-mail thread about the format additions [1].
To re-cap I made a proposal including data integrity [2] by adding a new
message type to the

>From the previous thread the main question was at what level to apply
digests to Arrow data (Message level, array, buffer or potentially some
hybrid).

Some trade-offs I've thought of for each approach:
* Message level
+ Simplest implementation and can be applied across all messages with the
pretty much the same code.
+ Smallest amount of additional data (each digest will likely be 8-64 bytes)
- It lacks granularity to recover partial data from a record batch if there
is corruption.

Array level:
+ Allows for reading non-corrupted columns
+ Allows for potentially more complicated use-cases like have different
compute engines "collaborate" and sign each array they computed to
establish a "chain-of-trust"
- Adds some implementation complexity. Will need different schemes for
message types other than RecordBatch and for message metadata. We also
need to determine digest boundaries (would a complex column be consumed
entirely or would child arrays be separate).

Buffer level:
More or less same issues as array but with the following other factors:
- The most amount of additional data
- Its not clear if there is a benefit of detecting if a single buffer is
corrupted if it means we can't accurately decode the array.

Other implementation options:
* Use message level metadata (this can be a little awkward if we want to
have safety against metadata corruption).

[1]
https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E
[2] https://github.com/apache/arrow/pull/4815

[DISCUSS][FORMAT] Data Integrity

Reply via email to