Hello,
As discussed on [1], I've proposed clarifications in a PR [2] that
clarifies:

1.  It is not required that all dictionary batches occur at the beginning
of the IPC stream format (if a the first record batch has an all null
dictionary encoded column, the null column's dictionary might not be sent
until later in the stream).

2.  A second dictionary batch for the same ID that is not a "delta batch"
in an IPC stream indicates the dictionary should be replaced.

3.  Clarifies that the file format, can only contain 1 "NON-delta"
dictionary batch and multiple "delta" dictionary batches. Dictionary
replacement is not supported in the file format.

4.  Add an enum to dictionary metadata for possible future changes in what
format dictionary batches can be sent. (the most likely would be an array
Map<Int, Value>).  An enum is needed as a place holder to allow for forward
compatibility past the release 1.0.0.

If accepted there will be work in all implementations to make sure that
they cover the edge cases highlighted and additional integration testing
will be needed.

Please vote whether to accept these additions. The vote will be open for at
least 72 hours.

[ ] +1 Accept these change to the specification
[ ] +0
[ ] -1 Do not accept the changes because...

Thanks,
Micah


[1]
https://lists.apache.org/thread.html/d0f137e9db0abfcfde2ef879ca517a710f620e5be4dd749923d22c37@%3Cdev.arrow.apache.org%3E
[2] https://github.com/apache/arrow/pull/5585

Reply via email to