@Wes McKinney, I see your comments. Thank you so much.

I agree with you that the schema and dictionary should be separated.
However, according to the current Java implementation, the dictionary is
attached to the schema, so a refactoring is required.

BTW, a somewhat related problem is that the data schema is not immutable
even if we remove dictionary from it.
As data are inserted to the batch, the schema may change. A problem related
to this is found in https://issues.apache.org/jira/browse/ARROW-5658
Do you think that violates the IPC protocol?

Best,
Liya Fan


On Fri, Jun 28, 2019 at 7:36 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Liya,
>
> I left a couple of comments in the document. You might look at what we
> have developed in C++ and JavaSript which is more mature and widely
> used in those languages than what currently exists in Java.
>
> In particular, I strongly encourage you to avoid creating a coupling
> between the Schema (which indicates the "index type" and the
> "dictionary type") and the Data (which contains the "dictionary
> indices" and the "dictionary values"). If you do this, then it is
> nearly impossible to account for dictionary changes in a stream of
> data, which is already supported in the IPC protocol, see
>
>
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches
>
> The protocol does not permit dictionaries to change order altogether,
> but I think it should. I just opened
> https://issues.apache.org/jira/browse/ARROW-5767
>
> We made this mistake early in the C++ library and found it difficult
> to have to reconciled dictionaries produced by different threads of
> execution. See ARROW-3144 and the patch I wrote in this
> release cycle which addressed this design flaw.
>
> Thanks
> Wes
>
> On Wed, Jun 12, 2019 at 5:33 AM Fan Liya <liya.fa...@gmail.com> wrote:
> >
> > @Micah Kornfield <emkornfi...@gmail.com> Thanks a lot for your comments.
> >
> > In the doc, we identify 3 problems for the current dictionary encoding
> use
> > case (there can be more, so please give your valuable suggestions):
> >
> > 1. there should be a convenient way to provide access to both
> > encoded/decoded data.
> > 2. the constructor for the schema with dictionary is misleading.
> > 3. we should provide a way to do the encoding/decoding during
> > serialization/deserialization, so the encoding/decoding can be
> transparent
> > to the user.
> >
> > The current PR provides a solution for problem 2, and it is a relatively
> > small change, so we think we can merge this PR first.
> >
> > Solutions for problem 1 and 3 should be chosen carefully so as not to
> > affect existing APIs. So more discussions/designs for problems 1 and 3
> are
> > desired.
> > Do you think it reasonable?
> >
> > Best,
> > Liya Fan
> >
> >
> >
> > On Wed, Jun 12, 2019 at 4:01 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > Hi Liya Fan,
> > > Thanks you for doing this.  I need to take a closer look at the PR in
> > > question and the dictionary encoding code but this seems like it is on
> the
> > > right track.
> > >
> > > Could other java contributors with more familiarity in the space look
> over
> > > the document to make sure it makes sense to them?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Mon, Jun 10, 2019 at 2:23 AM Fan Liya <liya.fa...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > This is concerning issue ARROW-3396.
> > > >
> > > > I have summarized the problem (please see if my understanding is
> > > correct),
> > > > and proposed some solutions to it. Please give your valuable
> feedback.
> > > > For details, please see:
> > > >
> > > >
> > > >
> > >
> https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing
> > > >
> > > > Thank you in advance.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > >
>

Reply via email to