On Sat, Jul 30, 2022 at 7:10 AM Andrew Lamb <al...@influxdata.com> wrote: > > > but the decision must be declared up front in the schema. > > I am probably missing something obvious but given the description of the > IPC format [1] perhaps we could support "on the fly changes" by sending > additional `SCHEMA` messages in the stream, rather than just a single > message at the beginning. > > If (re) sending the entire schema again in a SCHEMA is too costly, perhaps > some sort of SCHEMA_DELTA message could be devised that transmitted the > incremental schema changes. > > I think such a scheme could be "backwards compatible" too as existing > readers should error if they see a new SCHEMA message they are not prepared > to handle.
In principle you are right: * We could add new encoding information to the Schema / Field flatbuffers types * A schema delta message could be sent There is a certain awkwardness to dealing with things through the Schema and I can envision the implementation in the libraries being rather complicated. The main trouble with adding new encoding information at the Field level is that it would risk forward compatibility issues (maybe implementations could observe the presence of an unrecognized member of "Field" but I'm not sure -- Micah would know more). I think the only thing we can safely do without jeopardizing forward compatibility is adding new types to the "Type" union. Adding a new "EncodedRecordBatch" type that provides for: * Field-level encodings passed with the batch message rather than the schema * Pluggable encodings (in the way that Type is "pluggable" now -- so new encodings could be added to an Encodings union without breaking forward compatibility) * Fields to be omitted entirely (i.e. when they are all null) We had previously discussed adding "Constant" encoding type, so this would be the place to encode a constant / scalar value for a whole field/column in a batch without trying to shoehorn it into our existing "RecordBatch". The main other question in my mind is how support for this new IPC metadata type would be negotiated from a forward compatibility standpoint -- we would want to add more metadata version negotiation in Flight implementations, for example. The current MetadataVersion in Schema.fbs is V5, so if we added a new batch type allowing for encodings, sparseness, etc., then we would need to bump the MetadataVersion to V6, but libraries implementing V6 metadata should be able to operate in V5 compatibility mode (sending non-encoded data in the current IPC format). > > [1] > https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc > > On Fri, Jul 29, 2022 at 7:19 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > hi all, > > > > Since we've been recently discussing adding new data types, memory > > formats, or data encodings to Arrow, I wanted to bring up a more "big > > picture" question around how we could support data whose encodings may > > change throughout the lifetime of a data stream sent via the IPC > > format (e.g. over Flight) or over the C data interface. > > > > I can think of a few common encodings which could appear dynamically > > in a stream, and apply to basically all Arrow data types: > > > > * Constant (value is the same for all records in a batch) > > * Dictionary > > * Run-length encoded > > * Plain > > > > There are some other encodings that work only for certain data types > > (e.g. FrameOfReference for integers). > > > > Current Arrow record batches can either be all Plain encoded or all > > Dictionary encoded, but the decision must be declared up front in the > > schema. The dictionary can change, but the stream cannot stop > > dictionary encoding. > > > > In Parquet files, many writers will start out all columns using > > dictionary encoding and then switch to plain encoding if the > > dictionary exceeds a certain size. This has led to a certain > > awkwardness when trying to return dictionary encoded data directly > > from a Parquet file, since the "switchover" to Plain encoding is not > > compatible with the way that Arrow schemas work. > > > > In general, it's not practical for all data sources to know up front > > what is the "best" encoding, so being able to switch from one encoding > > to another would give Arrow producers more flexibility in their > > choice. > > > > Micah Kornfield had previously put up a PR to add a new RecordBatch > > metadata variant for the IPC format that would permit dynamic > > encodings as well as sparseness (fields not present in the batch -- > > effectively "all null" -- currently "all null" fields in record > > batches take up a lot of useless space) > > > > https://github.com/apache/arrow/pull/4815 > > > > I think given the discussions that have been happening in and around > > the project, that now would be a good time to rekindle this discussion > > and see if we can come up with something that will work with the above > > listed encodings and also provide for the beneficial sparseness > > property. It is also timely since there are several PRs for RLE that > > Tobias Zagorni has been working on [1], and knowing how new encodings > > could be added to Arrow in general will have some bearing on the same > > of the implementation when it comes to the IPC format and the C > > interface. > > > > For the Arrow C ABI, I am not sure about whether sparseness could be > > supported, but finding a mechanism to transmit dynamically-encoded > > data without breaking the existing C ABI would be worthwhile also. > > > > Thanks, > > Wes > > > > [1]: https://github.com/apache/arrow/pulls?q=is%3Apr+is%3Aopen+rle > >