Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]

2019-06-29 Thread Wes McKinney
hey Jacques,

I think in the case of dictionary encoding, any algorithm should be
operating with the dictionary for a particular field in a record batch
already in hand. Certain algorithms optimized for dictionary-encoded
data (like hash aggregations) may have to branch at fragment merge
steps (where the reduction may have been performed on different
dictionary indices).

So if we had:

fragment A
indices [0, 1, 2, 0, 1]
dictionary: ['a', 'b', 'c']

fragment B
indices [0, 1, 2, 0, 1]
dictionary: ['a', 'c', 'b']

then the COUNT aggregation would be merging partials

fragment A
partials: [(0, 2), (0, 2), (0, 1)]
dictionary: ['a', 'b', 'c']

fragment B
partials: [(0, 2), (0, 2), (0, 1)]
dictionary: ['a', 'c', 'b']

So in this case a unification is needed, where in other cases it could
be skipped if the dictionary is unchanged.

In other cases, optimized kernels (e.g. string operations) can
dispatch to the dictionary and leave the indices untouched, so we
could have

indices [0, 1, 2, 0, 1, 2]
dictionary [foo, bar, baz]

then the UPPER kernel would yield

indices [0, 1, 2, 0, 1, 2]
dictionary [FOO, BAR, BAZ]

and pass through the unmodified indices memory.

It's a double-edged sword -- it will be interesting to see how things
develop at the compiler level (Gandiva, or similar)

- Wes

On Fri, Jun 28, 2019 at 9:01 PM Jacques Nadeau  wrote:
>
> I think we need to start separating out dataset behavior from base IPC
> behavior. Having worked with this kind of structure in both Drill (where
> things were entirely late bound dynamic) and Dremio (where we start with
> schema and restart if we identify schema change), I strongly recommend that
> "dataset behavior" be strongly schemaed with pre-known schemas. This also
> includes avoiding dictionary restatement/mutation. It's fine for IPC to
> support this but I think we shouldn't pollute higher level concepts with
> it. For example, a sort kernel should or gandiva runtime compilation should
> not be expected to deal with any of these things. In a generalized engine,
> the amount of recompilation and optimization you have to do makes the code
> far more difficult to build and maintain.
>
> In short, being too accommodating of change and dynamicism substantially
> stifles the progress of innovation and completeness of the datasets
> approach.
>
> On Thu, Jun 27, 2019 at 7:49 PM Fan Liya  wrote:
>
> > @Wes McKinney, I see your comments. Thank you so much.
> >
> > I agree with you that the schema and dictionary should be separated.
> > However, according to the current Java implementation, the dictionary is
> > attached to the schema, so a refactoring is required.
> >
> > BTW, a somewhat related problem is that the data schema is not immutable
> > even if we remove dictionary from it.
> > As data are inserted to the batch, the schema may change. A problem related
> > to this is found in https://issues.apache.org/jira/browse/ARROW-5658
> > Do you think that violates the IPC protocol?
> >
> > Best,
> > Liya Fan
> >
> >
> > On Fri, Jun 28, 2019 at 7:36 AM Wes McKinney  wrote:
> >
> > > hi Liya,
> > >
> > > I left a couple of comments in the document. You might look at what we
> > > have developed in C++ and JavaSript which is more mature and widely
> > > used in those languages than what currently exists in Java.
> > >
> > > In particular, I strongly encourage you to avoid creating a coupling
> > > between the Schema (which indicates the "index type" and the
> > > "dictionary type") and the Data (which contains the "dictionary
> > > indices" and the "dictionary values"). If you do this, then it is
> > > nearly impossible to account for dictionary changes in a stream of
> > > data, which is already supported in the IPC protocol, see
> > >
> > >
> > >
> > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches
> > >
> > > The protocol does not permit dictionaries to change order altogether,
> > > but I think it should. I just opened
> > > https://issues.apache.org/jira/browse/ARROW-5767
> > >
> > > We made this mistake early in the C++ library and found it difficult
> > > to have to reconciled dictionaries produced by different threads of
> > > execution. See ARROW-3144 and the patch I wrote in this
> > > release cycle which addressed this design flaw.
> > >
> > > Thanks
> > > Wes
> > >
> > > On Wed, Jun 12, 2019 at 5:33 AM Fan Liya  wrote:
> > > >
> > > > @Micah Kornfield  Thanks a lot for your
> > comments.
> > > >
> > > > In the doc, we identify 3 problems for the current dictionary encoding
> > > use
> > > > case (there can be more, so please give your valuable suggestions):
> > > >
> > > > 1. there should be a convenient way to provide access to both
> > > > encoded/decoded data.
> > > > 2. the constructor for the schema with dictionary is misleading.
> > > > 3. we should provide a way to do the encoding/decoding during
> > > > serialization/deserialization, so the encoding/decoding can be
> > > transparent
> > > > to the user.
> > > >
> 

Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]

2019-06-28 Thread Jacques Nadeau
I think we need to start separating out dataset behavior from base IPC
behavior. Having worked with this kind of structure in both Drill (where
things were entirely late bound dynamic) and Dremio (where we start with
schema and restart if we identify schema change), I strongly recommend that
"dataset behavior" be strongly schemaed with pre-known schemas. This also
includes avoiding dictionary restatement/mutation. It's fine for IPC to
support this but I think we shouldn't pollute higher level concepts with
it. For example, a sort kernel should or gandiva runtime compilation should
not be expected to deal with any of these things. In a generalized engine,
the amount of recompilation and optimization you have to do makes the code
far more difficult to build and maintain.

In short, being too accommodating of change and dynamicism substantially
stifles the progress of innovation and completeness of the datasets
approach.

On Thu, Jun 27, 2019 at 7:49 PM Fan Liya  wrote:

> @Wes McKinney, I see your comments. Thank you so much.
>
> I agree with you that the schema and dictionary should be separated.
> However, according to the current Java implementation, the dictionary is
> attached to the schema, so a refactoring is required.
>
> BTW, a somewhat related problem is that the data schema is not immutable
> even if we remove dictionary from it.
> As data are inserted to the batch, the schema may change. A problem related
> to this is found in https://issues.apache.org/jira/browse/ARROW-5658
> Do you think that violates the IPC protocol?
>
> Best,
> Liya Fan
>
>
> On Fri, Jun 28, 2019 at 7:36 AM Wes McKinney  wrote:
>
> > hi Liya,
> >
> > I left a couple of comments in the document. You might look at what we
> > have developed in C++ and JavaSript which is more mature and widely
> > used in those languages than what currently exists in Java.
> >
> > In particular, I strongly encourage you to avoid creating a coupling
> > between the Schema (which indicates the "index type" and the
> > "dictionary type") and the Data (which contains the "dictionary
> > indices" and the "dictionary values"). If you do this, then it is
> > nearly impossible to account for dictionary changes in a stream of
> > data, which is already supported in the IPC protocol, see
> >
> >
> >
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches
> >
> > The protocol does not permit dictionaries to change order altogether,
> > but I think it should. I just opened
> > https://issues.apache.org/jira/browse/ARROW-5767
> >
> > We made this mistake early in the C++ library and found it difficult
> > to have to reconciled dictionaries produced by different threads of
> > execution. See ARROW-3144 and the patch I wrote in this
> > release cycle which addressed this design flaw.
> >
> > Thanks
> > Wes
> >
> > On Wed, Jun 12, 2019 at 5:33 AM Fan Liya  wrote:
> > >
> > > @Micah Kornfield  Thanks a lot for your
> comments.
> > >
> > > In the doc, we identify 3 problems for the current dictionary encoding
> > use
> > > case (there can be more, so please give your valuable suggestions):
> > >
> > > 1. there should be a convenient way to provide access to both
> > > encoded/decoded data.
> > > 2. the constructor for the schema with dictionary is misleading.
> > > 3. we should provide a way to do the encoding/decoding during
> > > serialization/deserialization, so the encoding/decoding can be
> > transparent
> > > to the user.
> > >
> > > The current PR provides a solution for problem 2, and it is a
> relatively
> > > small change, so we think we can merge this PR first.
> > >
> > > Solutions for problem 1 and 3 should be chosen carefully so as not to
> > > affect existing APIs. So more discussions/designs for problems 1 and 3
> > are
> > > desired.
> > > Do you think it reasonable?
> > >
> > > Best,
> > > Liya Fan
> > >
> > >
> > >
> > > On Wed, Jun 12, 2019 at 4:01 PM Micah Kornfield  >
> > > wrote:
> > >
> > > > Hi Liya Fan,
> > > > Thanks you for doing this.  I need to take a closer look at the PR in
> > > > question and the dictionary encoding code but this seems like it is
> on
> > the
> > > > right track.
> > > >
> > > > Could other java contributors with more familiarity in the space look
> > over
> > > > the document to make sure it makes sense to them?
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > On Mon, Jun 10, 2019 at 2:23 AM Fan Liya 
> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > This is concerning issue ARROW-3396.
> > > > >
> > > > > I have summarized the problem (please see if my understanding is
> > > > correct),
> > > > > and proposed some solutions to it. Please give your valuable
> > feedback.
> > > > > For details, please see:
> > > > >
> > > > >
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing
> > > > >
> > > > > Thank you in advance.
> > > > >
> > > > > Best,
> > > > > Liya Fan
> > > > >
> > > >
> >
>


Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]

2019-06-27 Thread Fan Liya
@Wes McKinney, I see your comments. Thank you so much.

I agree with you that the schema and dictionary should be separated.
However, according to the current Java implementation, the dictionary is
attached to the schema, so a refactoring is required.

BTW, a somewhat related problem is that the data schema is not immutable
even if we remove dictionary from it.
As data are inserted to the batch, the schema may change. A problem related
to this is found in https://issues.apache.org/jira/browse/ARROW-5658
Do you think that violates the IPC protocol?

Best,
Liya Fan


On Fri, Jun 28, 2019 at 7:36 AM Wes McKinney  wrote:

> hi Liya,
>
> I left a couple of comments in the document. You might look at what we
> have developed in C++ and JavaSript which is more mature and widely
> used in those languages than what currently exists in Java.
>
> In particular, I strongly encourage you to avoid creating a coupling
> between the Schema (which indicates the "index type" and the
> "dictionary type") and the Data (which contains the "dictionary
> indices" and the "dictionary values"). If you do this, then it is
> nearly impossible to account for dictionary changes in a stream of
> data, which is already supported in the IPC protocol, see
>
>
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches
>
> The protocol does not permit dictionaries to change order altogether,
> but I think it should. I just opened
> https://issues.apache.org/jira/browse/ARROW-5767
>
> We made this mistake early in the C++ library and found it difficult
> to have to reconciled dictionaries produced by different threads of
> execution. See ARROW-3144 and the patch I wrote in this
> release cycle which addressed this design flaw.
>
> Thanks
> Wes
>
> On Wed, Jun 12, 2019 at 5:33 AM Fan Liya  wrote:
> >
> > @Micah Kornfield  Thanks a lot for your comments.
> >
> > In the doc, we identify 3 problems for the current dictionary encoding
> use
> > case (there can be more, so please give your valuable suggestions):
> >
> > 1. there should be a convenient way to provide access to both
> > encoded/decoded data.
> > 2. the constructor for the schema with dictionary is misleading.
> > 3. we should provide a way to do the encoding/decoding during
> > serialization/deserialization, so the encoding/decoding can be
> transparent
> > to the user.
> >
> > The current PR provides a solution for problem 2, and it is a relatively
> > small change, so we think we can merge this PR first.
> >
> > Solutions for problem 1 and 3 should be chosen carefully so as not to
> > affect existing APIs. So more discussions/designs for problems 1 and 3
> are
> > desired.
> > Do you think it reasonable?
> >
> > Best,
> > Liya Fan
> >
> >
> >
> > On Wed, Jun 12, 2019 at 4:01 PM Micah Kornfield 
> > wrote:
> >
> > > Hi Liya Fan,
> > > Thanks you for doing this.  I need to take a closer look at the PR in
> > > question and the dictionary encoding code but this seems like it is on
> the
> > > right track.
> > >
> > > Could other java contributors with more familiarity in the space look
> over
> > > the document to make sure it makes sense to them?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Mon, Jun 10, 2019 at 2:23 AM Fan Liya  wrote:
> > >
> > > > Hi all,
> > > >
> > > > This is concerning issue ARROW-3396.
> > > >
> > > > I have summarized the problem (please see if my understanding is
> > > correct),
> > > > and proposed some solutions to it. Please give your valuable
> feedback.
> > > > For details, please see:
> > > >
> > > >
> > > >
> > >
> https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing
> > > >
> > > > Thank you in advance.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > >
>


Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]

2019-06-27 Thread Wes McKinney
hi Liya,

I left a couple of comments in the document. You might look at what we
have developed in C++ and JavaSript which is more mature and widely
used in those languages than what currently exists in Java.

In particular, I strongly encourage you to avoid creating a coupling
between the Schema (which indicates the "index type" and the
"dictionary type") and the Data (which contains the "dictionary
indices" and the "dictionary values"). If you do this, then it is
nearly impossible to account for dictionary changes in a stream of
data, which is already supported in the IPC protocol, see

https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches

The protocol does not permit dictionaries to change order altogether,
but I think it should. I just opened
https://issues.apache.org/jira/browse/ARROW-5767

We made this mistake early in the C++ library and found it difficult
to have to reconciled dictionaries produced by different threads of
execution. See ARROW-3144 and the patch I wrote in this
release cycle which addressed this design flaw.

Thanks
Wes

On Wed, Jun 12, 2019 at 5:33 AM Fan Liya  wrote:
>
> @Micah Kornfield  Thanks a lot for your comments.
>
> In the doc, we identify 3 problems for the current dictionary encoding use
> case (there can be more, so please give your valuable suggestions):
>
> 1. there should be a convenient way to provide access to both
> encoded/decoded data.
> 2. the constructor for the schema with dictionary is misleading.
> 3. we should provide a way to do the encoding/decoding during
> serialization/deserialization, so the encoding/decoding can be transparent
> to the user.
>
> The current PR provides a solution for problem 2, and it is a relatively
> small change, so we think we can merge this PR first.
>
> Solutions for problem 1 and 3 should be chosen carefully so as not to
> affect existing APIs. So more discussions/designs for problems 1 and 3 are
> desired.
> Do you think it reasonable?
>
> Best,
> Liya Fan
>
>
>
> On Wed, Jun 12, 2019 at 4:01 PM Micah Kornfield 
> wrote:
>
> > Hi Liya Fan,
> > Thanks you for doing this.  I need to take a closer look at the PR in
> > question and the dictionary encoding code but this seems like it is on the
> > right track.
> >
> > Could other java contributors with more familiarity in the space look over
> > the document to make sure it makes sense to them?
> >
> > Thanks,
> > Micah
> >
> > On Mon, Jun 10, 2019 at 2:23 AM Fan Liya  wrote:
> >
> > > Hi all,
> > >
> > > This is concerning issue ARROW-3396.
> > >
> > > I have summarized the problem (please see if my understanding is
> > correct),
> > > and proposed some solutions to it. Please give your valuable feedback.
> > > For details, please see:
> > >
> > >
> > >
> > https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing
> > >
> > > Thank you in advance.
> > >
> > > Best,
> > > Liya Fan
> > >
> >


Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]

2019-06-12 Thread Fan Liya
@Micah Kornfield  Thanks a lot for your comments.

In the doc, we identify 3 problems for the current dictionary encoding use
case (there can be more, so please give your valuable suggestions):

1. there should be a convenient way to provide access to both
encoded/decoded data.
2. the constructor for the schema with dictionary is misleading.
3. we should provide a way to do the encoding/decoding during
serialization/deserialization, so the encoding/decoding can be transparent
to the user.

The current PR provides a solution for problem 2, and it is a relatively
small change, so we think we can merge this PR first.

Solutions for problem 1 and 3 should be chosen carefully so as not to
affect existing APIs. So more discussions/designs for problems 1 and 3 are
desired.
Do you think it reasonable?

Best,
Liya Fan



On Wed, Jun 12, 2019 at 4:01 PM Micah Kornfield 
wrote:

> Hi Liya Fan,
> Thanks you for doing this.  I need to take a closer look at the PR in
> question and the dictionary encoding code but this seems like it is on the
> right track.
>
> Could other java contributors with more familiarity in the space look over
> the document to make sure it makes sense to them?
>
> Thanks,
> Micah
>
> On Mon, Jun 10, 2019 at 2:23 AM Fan Liya  wrote:
>
> > Hi all,
> >
> > This is concerning issue ARROW-3396.
> >
> > I have summarized the problem (please see if my understanding is
> correct),
> > and proposed some solutions to it. Please give your valuable feedback.
> > For details, please see:
> >
> >
> >
> https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing
> >
> > Thank you in advance.
> >
> > Best,
> > Liya Fan
> >
>


Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]

2019-06-12 Thread Micah Kornfield
Hi Liya Fan,
Thanks you for doing this.  I need to take a closer look at the PR in
question and the dictionary encoding code but this seems like it is on the
right track.

Could other java contributors with more familiarity in the space look over
the document to make sure it makes sense to them?

Thanks,
Micah

On Mon, Jun 10, 2019 at 2:23 AM Fan Liya  wrote:

> Hi all,
>
> This is concerning issue ARROW-3396.
>
> I have summarized the problem (please see if my understanding is correct),
> and proposed some solutions to it. Please give your valuable feedback.
> For details, please see:
>
>
> https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing
>
> Thank you in advance.
>
> Best,
> Liya Fan
>


[Discuss][Java][Typical use cases for dictionary encoding string vectors]

2019-06-10 Thread Fan Liya
Hi all,

This is concerning issue ARROW-3396.

I have summarized the problem (please see if my understanding is correct),
and proposed some solutions to it. Please give your valuable feedback.
For details, please see:

https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing

Thank you in advance.

Best,
Liya Fan