hi all, Since we've been recently discussing adding new data types, memory formats, or data encodings to Arrow, I wanted to bring up a more "big picture" question around how we could support data whose encodings may change throughout the lifetime of a data stream sent via the IPC format (e.g. over Flight) or over the C data interface.
I can think of a few common encodings which could appear dynamically in a stream, and apply to basically all Arrow data types: * Constant (value is the same for all records in a batch) * Dictionary * Run-length encoded * Plain There are some other encodings that work only for certain data types (e.g. FrameOfReference for integers). Current Arrow record batches can either be all Plain encoded or all Dictionary encoded, but the decision must be declared up front in the schema. The dictionary can change, but the stream cannot stop dictionary encoding. In Parquet files, many writers will start out all columns using dictionary encoding and then switch to plain encoding if the dictionary exceeds a certain size. This has led to a certain awkwardness when trying to return dictionary encoded data directly from a Parquet file, since the "switchover" to Plain encoding is not compatible with the way that Arrow schemas work. In general, it's not practical for all data sources to know up front what is the "best" encoding, so being able to switch from one encoding to another would give Arrow producers more flexibility in their choice. Micah Kornfield had previously put up a PR to add a new RecordBatch metadata variant for the IPC format that would permit dynamic encodings as well as sparseness (fields not present in the batch -- effectively "all null" -- currently "all null" fields in record batches take up a lot of useless space) https://github.com/apache/arrow/pull/4815 I think given the discussions that have been happening in and around the project, that now would be a good time to rekindle this discussion and see if we can come up with something that will work with the above listed encodings and also provide for the beneficial sparseness property. It is also timely since there are several PRs for RLE that Tobias Zagorni has been working on [1], and knowing how new encodings could be added to Arrow in general will have some bearing on the same of the implementation when it comes to the IPC format and the C interface. For the Arrow C ABI, I am not sure about whether sparseness could be supported, but finding a mechanism to transmit dynamically-encoded data without breaking the existing C ABI would be worthwhile also. Thanks, Wes [1]: https://github.com/apache/arrow/pulls?q=is%3Apr+is%3Aopen+rle