@Wes McKinney, I see your comments. Thank you so much. I agree with you that the schema and dictionary should be separated. However, according to the current Java implementation, the dictionary is attached to the schema, so a refactoring is required.
BTW, a somewhat related problem is that the data schema is not immutable even if we remove dictionary from it. As data are inserted to the batch, the schema may change. A problem related to this is found in https://issues.apache.org/jira/browse/ARROW-5658 Do you think that violates the IPC protocol? Best, Liya Fan On Fri, Jun 28, 2019 at 7:36 AM Wes McKinney <wesmck...@gmail.com> wrote: > hi Liya, > > I left a couple of comments in the document. You might look at what we > have developed in C++ and JavaSript which is more mature and widely > used in those languages than what currently exists in Java. > > In particular, I strongly encourage you to avoid creating a coupling > between the Schema (which indicates the "index type" and the > "dictionary type") and the Data (which contains the "dictionary > indices" and the "dictionary values"). If you do this, then it is > nearly impossible to account for dictionary changes in a stream of > data, which is already supported in the IPC protocol, see > > > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches > > The protocol does not permit dictionaries to change order altogether, > but I think it should. I just opened > https://issues.apache.org/jira/browse/ARROW-5767 > > We made this mistake early in the C++ library and found it difficult > to have to reconciled dictionaries produced by different threads of > execution. See ARROW-3144 and the patch I wrote in this > release cycle which addressed this design flaw. > > Thanks > Wes > > On Wed, Jun 12, 2019 at 5:33 AM Fan Liya <liya.fa...@gmail.com> wrote: > > > > @Micah Kornfield <emkornfi...@gmail.com> Thanks a lot for your comments. > > > > In the doc, we identify 3 problems for the current dictionary encoding > use > > case (there can be more, so please give your valuable suggestions): > > > > 1. there should be a convenient way to provide access to both > > encoded/decoded data. > > 2. the constructor for the schema with dictionary is misleading. > > 3. we should provide a way to do the encoding/decoding during > > serialization/deserialization, so the encoding/decoding can be > transparent > > to the user. > > > > The current PR provides a solution for problem 2, and it is a relatively > > small change, so we think we can merge this PR first. > > > > Solutions for problem 1 and 3 should be chosen carefully so as not to > > affect existing APIs. So more discussions/designs for problems 1 and 3 > are > > desired. > > Do you think it reasonable? > > > > Best, > > Liya Fan > > > > > > > > On Wed, Jun 12, 2019 at 4:01 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > Hi Liya Fan, > > > Thanks you for doing this. I need to take a closer look at the PR in > > > question and the dictionary encoding code but this seems like it is on > the > > > right track. > > > > > > Could other java contributors with more familiarity in the space look > over > > > the document to make sure it makes sense to them? > > > > > > Thanks, > > > Micah > > > > > > On Mon, Jun 10, 2019 at 2:23 AM Fan Liya <liya.fa...@gmail.com> wrote: > > > > > > > Hi all, > > > > > > > > This is concerning issue ARROW-3396. > > > > > > > > I have summarized the problem (please see if my understanding is > > > correct), > > > > and proposed some solutions to it. Please give your valuable > feedback. > > > > For details, please see: > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing > > > > > > > > Thank you in advance. > > > > > > > > Best, > > > > Liya Fan > > > > > > > >