Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]

Wes McKinney Sat, 29 Jun 2019 13:15:01 -0700

hey Jacques,

I think in the case of dictionary encoding, any algorithm should be
operating with the dictionary for a particular field in a record batch
already in hand. Certain algorithms optimized for dictionary-encoded
data (like hash aggregations) may have to branch at fragment merge
steps (where the reduction may have been performed on different
dictionary indices).


So if we had:

fragment A
indices [0, 1, 2, 0, 1]
dictionary: ['a', 'b', 'c']

fragment B
indices [0, 1, 2, 0, 1]
dictionary: ['a', 'c', 'b']

then the COUNT aggregation would be merging partials

fragment A
partials: [(0, 2), (0, 2), (0, 1)]
dictionary: ['a', 'b', 'c']

fragment B
partials: [(0, 2), (0, 2), (0, 1)]
dictionary: ['a', 'c', 'b']

So in this case a unification is needed, where in other cases it could
be skipped if the dictionary is unchanged.

In other cases, optimized kernels (e.g. string operations) can
dispatch to the dictionary and leave the indices untouched, so we
could have

indices [0, 1, 2, 0, 1, 2]
dictionary [foo, bar, baz]

then the UPPER kernel would yield

indices [0, 1, 2, 0, 1, 2]
dictionary [FOO, BAR, BAZ]

and pass through the unmodified indices memory.

It's a double-edged sword -- it will be interesting to see how things
develop at the compiler level (Gandiva, or similar)

- Wes

On Fri, Jun 28, 2019 at 9:01 PM Jacques Nadeau <[email protected]> wrote:
>
> I think we need to start separating out dataset behavior from base IPC
> behavior. Having worked with this kind of structure in both Drill (where
> things were entirely late bound dynamic) and Dremio (where we start with
> schema and restart if we identify schema change), I strongly recommend that
> "dataset behavior" be strongly schemaed with pre-known schemas. This also
> includes avoiding dictionary restatement/mutation. It's fine for IPC to
> support this but I think we shouldn't pollute higher level concepts with
> it. For example, a sort kernel should or gandiva runtime compilation should
> not be expected to deal with any of these things. In a generalized engine,
> the amount of recompilation and optimization you have to do makes the code
> far more difficult to build and maintain.
>
> In short, being too accommodating of change and dynamicism substantially
> stifles the progress of innovation and completeness of the datasets
> approach.
>
> On Thu, Jun 27, 2019 at 7:49 PM Fan Liya <[email protected]> wrote:
>
> > @Wes McKinney, I see your comments. Thank you so much.
> >
> > I agree with you that the schema and dictionary should be separated.
> > However, according to the current Java implementation, the dictionary is
> > attached to the schema, so a refactoring is required.
> >
> > BTW, a somewhat related problem is that the data schema is not immutable
> > even if we remove dictionary from it.
> > As data are inserted to the batch, the schema may change. A problem related
> > to this is found in https://issues.apache.org/jira/browse/ARROW-5658
> > Do you think that violates the IPC protocol?
> >
> > Best,
> > Liya Fan
> >
> >
> > On Fri, Jun 28, 2019 at 7:36 AM Wes McKinney <[email protected]> wrote:
> >
> > > hi Liya,
> > >
> > > I left a couple of comments in the document. You might look at what we
> > > have developed in C++ and JavaSript which is more mature and widely
> > > used in those languages than what currently exists in Java.
> > >
> > > In particular, I strongly encourage you to avoid creating a coupling
> > > between the Schema (which indicates the "index type" and the
> > > "dictionary type") and the Data (which contains the "dictionary
> > > indices" and the "dictionary values"). If you do this, then it is
> > > nearly impossible to account for dictionary changes in a stream of
> > > data, which is already supported in the IPC protocol, see
> > >
> > >
> > >
> > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches
> > >
> > > The protocol does not permit dictionaries to change order altogether,
> > > but I think it should. I just opened
> > > https://issues.apache.org/jira/browse/ARROW-5767
> > >
> > > We made this mistake early in the C++ library and found it difficult
> > > to have to reconciled dictionaries produced by different threads of
> > > execution. See ARROW-3144 and the patch I wrote in this
> > > release cycle which addressed this design flaw.
> > >
> > > Thanks
> > > Wes
> > >
> > > On Wed, Jun 12, 2019 at 5:33 AM Fan Liya <[email protected]> wrote:
> > > >
> > > > @Micah Kornfield <[email protected]> Thanks a lot for your
> > comments.
> > > >
> > > > In the doc, we identify 3 problems for the current dictionary encoding
> > > use
> > > > case (there can be more, so please give your valuable suggestions):
> > > >
> > > > 1. there should be a convenient way to provide access to both
> > > > encoded/decoded data.
> > > > 2. the constructor for the schema with dictionary is misleading.
> > > > 3. we should provide a way to do the encoding/decoding during
> > > > serialization/deserialization, so the encoding/decoding can be
> > > transparent
> > > > to the user.
> > > >
> > > > The current PR provides a solution for problem 2, and it is a
> > relatively
> > > > small change, so we think we can merge this PR first.
> > > >
> > > > Solutions for problem 1 and 3 should be chosen carefully so as not to
> > > > affect existing APIs. So more discussions/designs for problems 1 and 3
> > > are
> > > > desired.
> > > > Do you think it reasonable?
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > >
> > > >
> > > > On Wed, Jun 12, 2019 at 4:01 PM Micah Kornfield <[email protected]
> > >
> > > > wrote:
> > > >
> > > > > Hi Liya Fan,
> > > > > Thanks you for doing this.  I need to take a closer look at the PR in
> > > > > question and the dictionary encoding code but this seems like it is
> > on
> > > the
> > > > > right track.
> > > > >
> > > > > Could other java contributors with more familiarity in the space look
> > > over
> > > > > the document to make sure it makes sense to them?
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > > >
> > > > > On Mon, Jun 10, 2019 at 2:23 AM Fan Liya <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > This is concerning issue ARROW-3396.
> > > > > >
> > > > > > I have summarized the problem (please see if my understanding is
> > > > > correct),
> > > > > > and proposed some solutions to it. Please give your valuable
> > > feedback.
> > > > > > For details, please see:
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > >
> > https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing
> > > > > >
> > > > > > Thank you in advance.
> > > > > >
> > > > > > Best,
> > > > > > Liya Fan
> > > > > >
> > > > >
> > >
> >

Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]

Reply via email to