> but the decision must be declared up front in the schema. I am probably missing something obvious but given the description of the IPC format [1] perhaps we could support "on the fly changes" by sending additional `SCHEMA` messages in the stream, rather than just a single message at the beginning.
If (re) sending the entire schema again in a SCHEMA is too costly, perhaps some sort of SCHEMA_DELTA message could be devised that transmitted the incremental schema changes. I think such a scheme could be "backwards compatible" too as existing readers should error if they see a new SCHEMA message they are not prepared to handle. [1] https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc On Fri, Jul 29, 2022 at 7:19 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi all, > > Since we've been recently discussing adding new data types, memory > formats, or data encodings to Arrow, I wanted to bring up a more "big > picture" question around how we could support data whose encodings may > change throughout the lifetime of a data stream sent via the IPC > format (e.g. over Flight) or over the C data interface. > > I can think of a few common encodings which could appear dynamically > in a stream, and apply to basically all Arrow data types: > > * Constant (value is the same for all records in a batch) > * Dictionary > * Run-length encoded > * Plain > > There are some other encodings that work only for certain data types > (e.g. FrameOfReference for integers). > > Current Arrow record batches can either be all Plain encoded or all > Dictionary encoded, but the decision must be declared up front in the > schema. The dictionary can change, but the stream cannot stop > dictionary encoding. > > In Parquet files, many writers will start out all columns using > dictionary encoding and then switch to plain encoding if the > dictionary exceeds a certain size. This has led to a certain > awkwardness when trying to return dictionary encoded data directly > from a Parquet file, since the "switchover" to Plain encoding is not > compatible with the way that Arrow schemas work. > > In general, it's not practical for all data sources to know up front > what is the "best" encoding, so being able to switch from one encoding > to another would give Arrow producers more flexibility in their > choice. > > Micah Kornfield had previously put up a PR to add a new RecordBatch > metadata variant for the IPC format that would permit dynamic > encodings as well as sparseness (fields not present in the batch -- > effectively "all null" -- currently "all null" fields in record > batches take up a lot of useless space) > > https://github.com/apache/arrow/pull/4815 > > I think given the discussions that have been happening in and around > the project, that now would be a good time to rekindle this discussion > and see if we can come up with something that will work with the above > listed encodings and also provide for the beneficial sparseness > property. It is also timely since there are several PRs for RLE that > Tobias Zagorni has been working on [1], and knowing how new encodings > could be added to Arrow in general will have some bearing on the same > of the implementation when it comes to the IPC format and the C > interface. > > For the Arrow C ABI, I am not sure about whether sparseness could be > supported, but finding a mechanism to transmit dynamically-encoded > data without breaking the existing C ABI would be worthwhile also. > > Thanks, > Wes > > [1]: https://github.com/apache/arrow/pulls?q=is%3Apr+is%3Aopen+rle >