hi all,

Since we've been recently discussing adding new data types, memory
formats, or data encodings to Arrow, I wanted to bring up a more "big
picture" question around how we could support data whose encodings may
change throughout the lifetime of a data stream sent via the IPC
format (e.g. over Flight) or over the C data interface.

I can think of a few common encodings which could appear dynamically
in a stream, and apply to basically all Arrow data types:

* Constant (value is the same for all records in a batch)
* Dictionary
* Run-length encoded
* Plain

There are some other encodings that work only for certain data types
(e.g. FrameOfReference for integers).

Current Arrow record batches can either be all Plain encoded or all
Dictionary encoded, but the decision must be declared up front in the
schema. The dictionary can change, but the stream cannot stop
dictionary encoding.

In Parquet files, many writers will start out all columns using
dictionary encoding and then switch to plain encoding if the
dictionary exceeds a certain size. This has led to a certain
awkwardness when trying to return dictionary encoded data directly
from a Parquet file, since the "switchover" to Plain encoding is not
compatible with the way that Arrow schemas work.

In general, it's not practical for all data sources to know up front
what is the "best" encoding, so being able to switch from one encoding
to another would give Arrow producers more flexibility in their
choice.

Micah Kornfield had previously put up a PR to add a new RecordBatch
metadata variant for the IPC format that would permit dynamic
encodings as well as sparseness (fields not present in the batch --
effectively "all null" -- currently "all null" fields in record
batches take up a lot of useless space)

https://github.com/apache/arrow/pull/4815

I think given the discussions that have been happening in and around
the project, that now would be a good time to rekindle this discussion
and see if we can come up with something that will work with the above
listed encodings and also provide for the beneficial sparseness
property. It is also timely since there are several PRs for RLE that
Tobias Zagorni has been working on [1], and knowing how new encodings
could be added to Arrow in general will have some bearing on the same
of the implementation when it comes to the IPC format and the C
interface.

For the Arrow C ABI, I am not sure about whether sparseness could be
supported, but finding a mechanism to transmit dynamically-encoded
data without breaking the existing C ABI would be worthwhile also.

Thanks,
Wes

[1]: https://github.com/apache/arrow/pulls?q=is%3Apr+is%3Aopen+rle

Reply via email to