> but the decision must be declared up front in the schema.

I am probably missing something obvious but given the description of the
IPC format [1] perhaps we could support "on the fly changes" by sending
additional `SCHEMA` messages in the stream, rather than just a single
message at the beginning.

If (re) sending the entire schema again in a SCHEMA is too costly, perhaps
some sort of SCHEMA_DELTA message could be devised that transmitted the
incremental schema changes.

I think such a scheme could be "backwards compatible" too as existing
readers should error if they see a new SCHEMA message they are not prepared
to handle.

[1]
https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc

On Fri, Jul 29, 2022 at 7:19 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi all,
>
> Since we've been recently discussing adding new data types, memory
> formats, or data encodings to Arrow, I wanted to bring up a more "big
> picture" question around how we could support data whose encodings may
> change throughout the lifetime of a data stream sent via the IPC
> format (e.g. over Flight) or over the C data interface.
>
> I can think of a few common encodings which could appear dynamically
> in a stream, and apply to basically all Arrow data types:
>
> * Constant (value is the same for all records in a batch)
> * Dictionary
> * Run-length encoded
> * Plain
>
> There are some other encodings that work only for certain data types
> (e.g. FrameOfReference for integers).
>
> Current Arrow record batches can either be all Plain encoded or all
> Dictionary encoded, but the decision must be declared up front in the
> schema. The dictionary can change, but the stream cannot stop
> dictionary encoding.
>
> In Parquet files, many writers will start out all columns using
> dictionary encoding and then switch to plain encoding if the
> dictionary exceeds a certain size. This has led to a certain
> awkwardness when trying to return dictionary encoded data directly
> from a Parquet file, since the "switchover" to Plain encoding is not
> compatible with the way that Arrow schemas work.
>
> In general, it's not practical for all data sources to know up front
> what is the "best" encoding, so being able to switch from one encoding
> to another would give Arrow producers more flexibility in their
> choice.
>
> Micah Kornfield had previously put up a PR to add a new RecordBatch
> metadata variant for the IPC format that would permit dynamic
> encodings as well as sparseness (fields not present in the batch --
> effectively "all null" -- currently "all null" fields in record
> batches take up a lot of useless space)
>
> https://github.com/apache/arrow/pull/4815
>
> I think given the discussions that have been happening in and around
> the project, that now would be a good time to rekindle this discussion
> and see if we can come up with something that will work with the above
> listed encodings and also provide for the beneficial sparseness
> property. It is also timely since there are several PRs for RLE that
> Tobias Zagorni has been working on [1], and knowing how new encodings
> could be added to Arrow in general will have some bearing on the same
> of the implementation when it comes to the IPC format and the C
> interface.
>
> For the Arrow C ABI, I am not sure about whether sparseness could be
> supported, but finding a mechanism to transmit dynamically-encoded
> data without breaking the existing C ABI would be worthwhile also.
>
> Thanks,
> Wes
>
> [1]: https://github.com/apache/arrow/pulls?q=is%3Apr+is%3Aopen+rle
>

Reply via email to