Potentially extending the IPC format to support these additional flexibilities is the easy part.

The difficult part is to shoehorn the newstanding flexibility into existing APIs, also leaking into the expectations of downstream users. For example, in C++ it is expected that a RecordBatchReader or Table or AsyncGenerator<RecordBatch> will always exhibit the same Schema.

Regards

Antoine.





Le 30/07/2022 à 21:35, Wes McKinney a écrit :
On Sat, Jul 30, 2022 at 7:10 AM Andrew Lamb <al...@influxdata.com> wrote:

but the decision must be declared up front in the schema.

I am probably missing something obvious but given the description of the
IPC format [1] perhaps we could support "on the fly changes" by sending
additional `SCHEMA` messages in the stream, rather than just a single
message at the beginning.

If (re) sending the entire schema again in a SCHEMA is too costly, perhaps
some sort of SCHEMA_DELTA message could be devised that transmitted the
incremental schema changes.

I think such a scheme could be "backwards compatible" too as existing
readers should error if they see a new SCHEMA message they are not prepared
to handle.

In principle you are right:

* We could add new encoding information to the Schema / Field flatbuffers types
* A schema delta message could be sent

There is a certain awkwardness to dealing with things through the
Schema and I can envision the implementation in the libraries being
rather complicated.

The main trouble with adding new encoding information at the Field
level is that it would risk forward compatibility issues (maybe
implementations could observe the presence of an unrecognized member
of "Field" but I'm not sure -- Micah would know more). I think the
only thing we can safely do without jeopardizing forward compatibility
is adding new types to the "Type" union.

Adding a new "EncodedRecordBatch" type that provides for:

* Field-level encodings passed with the batch message rather than the schema
* Pluggable encodings (in the way that Type is "pluggable" now -- so
new encodings could be added to an Encodings union without breaking
forward compatibility)
* Fields to be omitted entirely (i.e. when they are all null)

We had previously discussed adding "Constant" encoding type, so this
would be the place to encode a constant / scalar value for a whole
field/column in a batch without trying to shoehorn it into our
existing "RecordBatch".

The main other question in my mind is how support for this new IPC
metadata type would be negotiated from a forward compatibility
standpoint -- we would want to add more metadata version negotiation
in Flight implementations, for example. The current MetadataVersion in
Schema.fbs is V5, so if we added a new batch type allowing for
encodings, sparseness, etc., then we would need to bump the
MetadataVersion to V6, but libraries implementing V6 metadata should
be able to operate in V5 compatibility mode (sending non-encoded data
in the current IPC format).


[1]
https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc

On Fri, Jul 29, 2022 at 7:19 PM Wes McKinney <wesmck...@gmail.com> wrote:

hi all,

Since we've been recently discussing adding new data types, memory
formats, or data encodings to Arrow, I wanted to bring up a more "big
picture" question around how we could support data whose encodings may
change throughout the lifetime of a data stream sent via the IPC
format (e.g. over Flight) or over the C data interface.

I can think of a few common encodings which could appear dynamically
in a stream, and apply to basically all Arrow data types:

* Constant (value is the same for all records in a batch)
* Dictionary
* Run-length encoded
* Plain

There are some other encodings that work only for certain data types
(e.g. FrameOfReference for integers).

Current Arrow record batches can either be all Plain encoded or all
Dictionary encoded, but the decision must be declared up front in the
schema. The dictionary can change, but the stream cannot stop
dictionary encoding.

In Parquet files, many writers will start out all columns using
dictionary encoding and then switch to plain encoding if the
dictionary exceeds a certain size. This has led to a certain
awkwardness when trying to return dictionary encoded data directly
from a Parquet file, since the "switchover" to Plain encoding is not
compatible with the way that Arrow schemas work.

In general, it's not practical for all data sources to know up front
what is the "best" encoding, so being able to switch from one encoding
to another would give Arrow producers more flexibility in their
choice.

Micah Kornfield had previously put up a PR to add a new RecordBatch
metadata variant for the IPC format that would permit dynamic
encodings as well as sparseness (fields not present in the batch --
effectively "all null" -- currently "all null" fields in record
batches take up a lot of useless space)

https://github.com/apache/arrow/pull/4815

I think given the discussions that have been happening in and around
the project, that now would be a good time to rekindle this discussion
and see if we can come up with something that will work with the above
listed encodings and also provide for the beneficial sparseness
property. It is also timely since there are several PRs for RLE that
Tobias Zagorni has been working on [1], and knowing how new encodings
could be added to Arrow in general will have some bearing on the same
of the implementation when it comes to the IPC format and the C
interface.

For the Arrow C ABI, I am not sure about whether sparseness could be
supported, but finding a mechanism to transmit dynamically-encoded
data without breaking the existing C ABI would be worthwhile also.

Thanks,
Wes

[1]: https://github.com/apache/arrow/pulls?q=is%3Apr+is%3Aopen+rle

Reply via email to