Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

Wes McKinney Wed, 30 Oct 2019 13:13:35 -0700

Returning to this discussion as there seems to lack consensus in the vote thread

Copying Micah's proposals in the VOTE thread here, I wanted to state
my opinions so we can discuss further and see where there is potential
disagreement

1.  It is not required that all dictionary batches occur at the beginning
of the IPC stream format (if a the first record batch has an all null
dictionary encoded column, the null column's dictionary might not be sent
until later in the stream).

This seems preferable to requiring a placeholder empty dictionary
batch. This does mean more to test but the integration tests will
force the issue

2.  A second dictionary batch for the same ID that is not a "delta batch"
in an IPC stream indicates the dictionary should be replaced.

Agree.

3.  Clarifies that the file format, can only contain 1 "NON-delta"
dictionary batch and multiple "delta" dictionary batches.

Agree -- it is also worth stating explicitly that dictionary
replacements are not allowed in the file format.

In the file format, all the dictionaries must be "loaded" up front.
The code path for loading the dictionaries ideally should use nearly
the same code as the stream-reader code that sees follow-up dictionary
batches interspersed in the stream. The only downside is that it will
not be possible to exactly preserve the dictionary "state" as of each
record batch being written.

So if we had a file containing

DICTIONARY ID=0
RECORD BATCH
RECORD BATCH
DICTIONARY DELTA ID=0
RECORD BATCH
RECORD BATCH

Then after processing/loading the dictionaries, the first two record
batches will have a dictionary that is "larger" (on account of the
delta) than when they were written. Since dictionaries are
fundamentally about data representation, they still represent the same
data so I think this is acceptable.

4.  Add an enum to dictionary metadata for possible future changes in what
format dictionary batches can be sent. (the most likely would be an array
Map<Int, Value>).  An enum is needed as a place holder to allow for forward
compatibility past the release 1.0.0.

I'm least sure about this but I do not think it is harmful to have a
forward-compatible "escape hatch" for future evolutions in dictionary
encoding.

On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield <[email protected]> wrote:
>
> I'll plan on starting a vote in the next day or two if there are no further
> objections/comments.
>
> On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield <[email protected]>
> wrote:
>
> > I think the only point asked on the PR that I think is worth discussing is
> > assumptions about dictionaries at the beginning of streams.
> >
> > There are two options:
> > 1.  Based on the current wording, it does not seem that all dictionaries
> > need to be at the beginning of the stream if they aren't made use of in the
> > first record batch (i.e. a dictionary encoded column is all null in the
> > first record batch).
> > 2.  We require a dictionary batch for each dictionary at the beginning of
> > the stream (and require implementations to send an empty batch if they
> > don't have the dictionary available).
> >
> > The current proposal in the PR is option #1.
> >
> > Thanks,
> > Micah
> >
> > On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield <[email protected]>
> > wrote:
> >
> >> I've opened a pull request [1] to clarify some recent conversations about
> >> semantics/edge cases for dictionary encoding [2][3] around interleaved
> >> batches and when isDelta=False.
> >>
> >> Specifically, it proposes isDelta=False indicates dictionary
> >> replacement.  For the file format, only one isDelta=False batch is allowed
> >> per file and isDelta=true batches are applied in the order supplied file
> >> footer.
> >>
> >> In addition, I've added a new enum to DictionaryEncoding to preserve
> >> future compatibility in case we want to expand dictionary encoding to be an
> >> explicit mapping from "ID" to "VALUE" as discussed in [4].
> >>
> >> Once people have had a change to review and come to a consensus. I will
> >> call a formal vote to approve the change commit the change.
> >>
> >> Thanks,
> >> Micah
> >>
> >> [1] https://github.com/apache/arrow/pull/5585
> >> [2]
> >> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E
> >> [3]
> >> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E
> >> [4]
> >> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E
> >>
> >>

Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

Reply via email to