Re: [DISCUSS][Format] Dynamic data encodings in the IPC format and C ABI

Wes McKinney Sat, 30 Jul 2022 12:36:33 -0700

On Sat, Jul 30, 2022 at 7:10 AM Andrew Lamb <[email protected]> wrote:
>
> > but the decision must be declared up front in the schema.
>
> I am probably missing something obvious but given the description of the
> IPC format [1] perhaps we could support "on the fly changes" by sending
> additional `SCHEMA` messages in the stream, rather than just a single
> message at the beginning.
>
> If (re) sending the entire schema again in a SCHEMA is too costly, perhaps
> some sort of SCHEMA_DELTA message could be devised that transmitted the
> incremental schema changes.
>
> I think such a scheme could be "backwards compatible" too as existing
> readers should error if they see a new SCHEMA message they are not prepared
> to handle.

In principle you are right:

* We could add new encoding information to the Schema / Field flatbuffers types
* A schema delta message could be sent

There is a certain awkwardness to dealing with things through the
Schema and I can envision the implementation in the libraries being
rather complicated.

The main trouble with adding new encoding information at the Field
level is that it would risk forward compatibility issues (maybe
implementations could observe the presence of an unrecognized member
of "Field" but I'm not sure -- Micah would know more). I think the
only thing we can safely do without jeopardizing forward compatibility
is adding new types to the "Type" union.

Adding a new "EncodedRecordBatch" type that provides for:

* Field-level encodings passed with the batch message rather than the schema
* Pluggable encodings (in the way that Type is "pluggable" now -- so
new encodings could be added to an Encodings union without breaking
forward compatibility)
* Fields to be omitted entirely (i.e. when they are all null)

We had previously discussed adding "Constant" encoding type, so this
would be the place to encode a constant / scalar value for a whole
field/column in a batch without trying to shoehorn it into our
existing "RecordBatch".

The main other question in my mind is how support for this new IPC
metadata type would be negotiated from a forward compatibility
standpoint -- we would want to add more metadata version negotiation
in Flight implementations, for example. The current MetadataVersion in
Schema.fbs is V5, so if we added a new batch type allowing for
encodings, sparseness, etc., then we would need to bump the
MetadataVersion to V6, but libraries implementing V6 metadata should
be able to operate in V5 compatibility mode (sending non-encoded data
in the current IPC format).

>
> [1]
> https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
>
> On Fri, Jul 29, 2022 at 7:19 PM Wes McKinney <[email protected]> wrote:
>
> > hi all,
> >
> > Since we've been recently discussing adding new data types, memory
> > formats, or data encodings to Arrow, I wanted to bring up a more "big
> > picture" question around how we could support data whose encodings may
> > change throughout the lifetime of a data stream sent via the IPC
> > format (e.g. over Flight) or over the C data interface.
> >
> > I can think of a few common encodings which could appear dynamically
> > in a stream, and apply to basically all Arrow data types:
> >
> > * Constant (value is the same for all records in a batch)
> > * Dictionary
> > * Run-length encoded
> > * Plain
> >
> > There are some other encodings that work only for certain data types
> > (e.g. FrameOfReference for integers).
> >
> > Current Arrow record batches can either be all Plain encoded or all
> > Dictionary encoded, but the decision must be declared up front in the
> > schema. The dictionary can change, but the stream cannot stop
> > dictionary encoding.
> >
> > In Parquet files, many writers will start out all columns using
> > dictionary encoding and then switch to plain encoding if the
> > dictionary exceeds a certain size. This has led to a certain
> > awkwardness when trying to return dictionary encoded data directly
> > from a Parquet file, since the "switchover" to Plain encoding is not
> > compatible with the way that Arrow schemas work.
> >
> > In general, it's not practical for all data sources to know up front
> > what is the "best" encoding, so being able to switch from one encoding
> > to another would give Arrow producers more flexibility in their
> > choice.
> >
> > Micah Kornfield had previously put up a PR to add a new RecordBatch
> > metadata variant for the IPC format that would permit dynamic
> > encodings as well as sparseness (fields not present in the batch --
> > effectively "all null" -- currently "all null" fields in record
> > batches take up a lot of useless space)
> >
> > https://github.com/apache/arrow/pull/4815
> >
> > I think given the discussions that have been happening in and around
> > the project, that now would be a good time to rekindle this discussion
> > and see if we can come up with something that will work with the above
> > listed encodings and also provide for the beneficial sparseness
> > property. It is also timely since there are several PRs for RLE that
> > Tobias Zagorni has been working on [1], and knowing how new encodings
> > could be added to Arrow in general will have some bearing on the same
> > of the implementation when it comes to the IPC format and the C
> > interface.
> >
> > For the Arrow C ABI, I am not sure about whether sparseness could be
> > supported, but finding a mechanism to transmit dynamically-encoded
> > data without breaking the existing C ABI would be worthwhile also.
> >
> > Thanks,
> > Wes
> >
> > [1]: https://github.com/apache/arrow/pulls?q=is%3Apr+is%3Aopen+rle
> >

Re: [DISCUSS][Format] Dynamic data encodings in the IPC format and C ABI

Reply via email to