Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]
hey Jacques, I think in the case of dictionary encoding, any algorithm should be operating with the dictionary for a particular field in a record batch already in hand. Certain algorithms optimized for dictionary-encoded data (like hash aggregations) may have to branch at fragment merge steps (where the reduction may have been performed on different dictionary indices). So if we had: fragment A indices [0, 1, 2, 0, 1] dictionary: ['a', 'b', 'c'] fragment B indices [0, 1, 2, 0, 1] dictionary: ['a', 'c', 'b'] then the COUNT aggregation would be merging partials fragment A partials: [(0, 2), (0, 2), (0, 1)] dictionary: ['a', 'b', 'c'] fragment B partials: [(0, 2), (0, 2), (0, 1)] dictionary: ['a', 'c', 'b'] So in this case a unification is needed, where in other cases it could be skipped if the dictionary is unchanged. In other cases, optimized kernels (e.g. string operations) can dispatch to the dictionary and leave the indices untouched, so we could have indices [0, 1, 2, 0, 1, 2] dictionary [foo, bar, baz] then the UPPER kernel would yield indices [0, 1, 2, 0, 1, 2] dictionary [FOO, BAR, BAZ] and pass through the unmodified indices memory. It's a double-edged sword -- it will be interesting to see how things develop at the compiler level (Gandiva, or similar) - Wes On Fri, Jun 28, 2019 at 9:01 PM Jacques Nadeau wrote: > > I think we need to start separating out dataset behavior from base IPC > behavior. Having worked with this kind of structure in both Drill (where > things were entirely late bound dynamic) and Dremio (where we start with > schema and restart if we identify schema change), I strongly recommend that > "dataset behavior" be strongly schemaed with pre-known schemas. This also > includes avoiding dictionary restatement/mutation. It's fine for IPC to > support this but I think we shouldn't pollute higher level concepts with > it. For example, a sort kernel should or gandiva runtime compilation should > not be expected to deal with any of these things. In a generalized engine, > the amount of recompilation and optimization you have to do makes the code > far more difficult to build and maintain. > > In short, being too accommodating of change and dynamicism substantially > stifles the progress of innovation and completeness of the datasets > approach. > > On Thu, Jun 27, 2019 at 7:49 PM Fan Liya wrote: > > > @Wes McKinney, I see your comments. Thank you so much. > > > > I agree with you that the schema and dictionary should be separated. > > However, according to the current Java implementation, the dictionary is > > attached to the schema, so a refactoring is required. > > > > BTW, a somewhat related problem is that the data schema is not immutable > > even if we remove dictionary from it. > > As data are inserted to the batch, the schema may change. A problem related > > to this is found in https://issues.apache.org/jira/browse/ARROW-5658 > > Do you think that violates the IPC protocol? > > > > Best, > > Liya Fan > > > > > > On Fri, Jun 28, 2019 at 7:36 AM Wes McKinney wrote: > > > > > hi Liya, > > > > > > I left a couple of comments in the document. You might look at what we > > > have developed in C++ and JavaSript which is more mature and widely > > > used in those languages than what currently exists in Java. > > > > > > In particular, I strongly encourage you to avoid creating a coupling > > > between the Schema (which indicates the "index type" and the > > > "dictionary type") and the Data (which contains the "dictionary > > > indices" and the "dictionary values"). If you do this, then it is > > > nearly impossible to account for dictionary changes in a stream of > > > data, which is already supported in the IPC protocol, see > > > > > > > > > > > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches > > > > > > The protocol does not permit dictionaries to change order altogether, > > > but I think it should. I just opened > > > https://issues.apache.org/jira/browse/ARROW-5767 > > > > > > We made this mistake early in the C++ library and found it difficult > > > to have to reconciled dictionaries produced by different threads of > > > execution. See ARROW-3144 and the patch I wrote in this > > > release cycle which addressed this design flaw. > > > > > > Thanks > > > Wes > > > > > > On Wed, Jun 12, 2019 at 5:33 AM Fan Liya wrote: > > > > > > > > @Micah Kornfield Thanks a lot for your > > comments. > > > > > > > > In the doc, we identify 3 problems for the current dictionary encoding > > > use > > > > case (there can be more, so please give your valuable suggestions): > > > > > > > > 1. there should be a convenient way to provide access to both > > > > encoded/decoded data. > > > > 2. the constructor for the schema with dictionary is misleading. > > > > 3. we should provide a way to do the encoding/decoding during > > > > serialization/deserialization, so the encoding/decoding can be > > > transparent > > > > to the user. > > > > >
Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]
I think we need to start separating out dataset behavior from base IPC behavior. Having worked with this kind of structure in both Drill (where things were entirely late bound dynamic) and Dremio (where we start with schema and restart if we identify schema change), I strongly recommend that "dataset behavior" be strongly schemaed with pre-known schemas. This also includes avoiding dictionary restatement/mutation. It's fine for IPC to support this but I think we shouldn't pollute higher level concepts with it. For example, a sort kernel should or gandiva runtime compilation should not be expected to deal with any of these things. In a generalized engine, the amount of recompilation and optimization you have to do makes the code far more difficult to build and maintain. In short, being too accommodating of change and dynamicism substantially stifles the progress of innovation and completeness of the datasets approach. On Thu, Jun 27, 2019 at 7:49 PM Fan Liya wrote: > @Wes McKinney, I see your comments. Thank you so much. > > I agree with you that the schema and dictionary should be separated. > However, according to the current Java implementation, the dictionary is > attached to the schema, so a refactoring is required. > > BTW, a somewhat related problem is that the data schema is not immutable > even if we remove dictionary from it. > As data are inserted to the batch, the schema may change. A problem related > to this is found in https://issues.apache.org/jira/browse/ARROW-5658 > Do you think that violates the IPC protocol? > > Best, > Liya Fan > > > On Fri, Jun 28, 2019 at 7:36 AM Wes McKinney wrote: > > > hi Liya, > > > > I left a couple of comments in the document. You might look at what we > > have developed in C++ and JavaSript which is more mature and widely > > used in those languages than what currently exists in Java. > > > > In particular, I strongly encourage you to avoid creating a coupling > > between the Schema (which indicates the "index type" and the > > "dictionary type") and the Data (which contains the "dictionary > > indices" and the "dictionary values"). If you do this, then it is > > nearly impossible to account for dictionary changes in a stream of > > data, which is already supported in the IPC protocol, see > > > > > > > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches > > > > The protocol does not permit dictionaries to change order altogether, > > but I think it should. I just opened > > https://issues.apache.org/jira/browse/ARROW-5767 > > > > We made this mistake early in the C++ library and found it difficult > > to have to reconciled dictionaries produced by different threads of > > execution. See ARROW-3144 and the patch I wrote in this > > release cycle which addressed this design flaw. > > > > Thanks > > Wes > > > > On Wed, Jun 12, 2019 at 5:33 AM Fan Liya wrote: > > > > > > @Micah Kornfield Thanks a lot for your > comments. > > > > > > In the doc, we identify 3 problems for the current dictionary encoding > > use > > > case (there can be more, so please give your valuable suggestions): > > > > > > 1. there should be a convenient way to provide access to both > > > encoded/decoded data. > > > 2. the constructor for the schema with dictionary is misleading. > > > 3. we should provide a way to do the encoding/decoding during > > > serialization/deserialization, so the encoding/decoding can be > > transparent > > > to the user. > > > > > > The current PR provides a solution for problem 2, and it is a > relatively > > > small change, so we think we can merge this PR first. > > > > > > Solutions for problem 1 and 3 should be chosen carefully so as not to > > > affect existing APIs. So more discussions/designs for problems 1 and 3 > > are > > > desired. > > > Do you think it reasonable? > > > > > > Best, > > > Liya Fan > > > > > > > > > > > > On Wed, Jun 12, 2019 at 4:01 PM Micah Kornfield > > > > wrote: > > > > > > > Hi Liya Fan, > > > > Thanks you for doing this. I need to take a closer look at the PR in > > > > question and the dictionary encoding code but this seems like it is > on > > the > > > > right track. > > > > > > > > Could other java contributors with more familiarity in the space look > > over > > > > the document to make sure it makes sense to them? > > > > > > > > Thanks, > > > > Micah > > > > > > > > On Mon, Jun 10, 2019 at 2:23 AM Fan Liya > wrote: > > > > > > > > > Hi all, > > > > > > > > > > This is concerning issue ARROW-3396. > > > > > > > > > > I have summarized the problem (please see if my understanding is > > > > correct), > > > > > and proposed some solutions to it. Please give your valuable > > feedback. > > > > > For details, please see: > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing > > > > > > > > > > Thank you in advance. > > > > > > > > > > Best, > > > > > Liya Fan > > > > > > > > > > > >
Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]
@Wes McKinney, I see your comments. Thank you so much. I agree with you that the schema and dictionary should be separated. However, according to the current Java implementation, the dictionary is attached to the schema, so a refactoring is required. BTW, a somewhat related problem is that the data schema is not immutable even if we remove dictionary from it. As data are inserted to the batch, the schema may change. A problem related to this is found in https://issues.apache.org/jira/browse/ARROW-5658 Do you think that violates the IPC protocol? Best, Liya Fan On Fri, Jun 28, 2019 at 7:36 AM Wes McKinney wrote: > hi Liya, > > I left a couple of comments in the document. You might look at what we > have developed in C++ and JavaSript which is more mature and widely > used in those languages than what currently exists in Java. > > In particular, I strongly encourage you to avoid creating a coupling > between the Schema (which indicates the "index type" and the > "dictionary type") and the Data (which contains the "dictionary > indices" and the "dictionary values"). If you do this, then it is > nearly impossible to account for dictionary changes in a stream of > data, which is already supported in the IPC protocol, see > > > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches > > The protocol does not permit dictionaries to change order altogether, > but I think it should. I just opened > https://issues.apache.org/jira/browse/ARROW-5767 > > We made this mistake early in the C++ library and found it difficult > to have to reconciled dictionaries produced by different threads of > execution. See ARROW-3144 and the patch I wrote in this > release cycle which addressed this design flaw. > > Thanks > Wes > > On Wed, Jun 12, 2019 at 5:33 AM Fan Liya wrote: > > > > @Micah Kornfield Thanks a lot for your comments. > > > > In the doc, we identify 3 problems for the current dictionary encoding > use > > case (there can be more, so please give your valuable suggestions): > > > > 1. there should be a convenient way to provide access to both > > encoded/decoded data. > > 2. the constructor for the schema with dictionary is misleading. > > 3. we should provide a way to do the encoding/decoding during > > serialization/deserialization, so the encoding/decoding can be > transparent > > to the user. > > > > The current PR provides a solution for problem 2, and it is a relatively > > small change, so we think we can merge this PR first. > > > > Solutions for problem 1 and 3 should be chosen carefully so as not to > > affect existing APIs. So more discussions/designs for problems 1 and 3 > are > > desired. > > Do you think it reasonable? > > > > Best, > > Liya Fan > > > > > > > > On Wed, Jun 12, 2019 at 4:01 PM Micah Kornfield > > wrote: > > > > > Hi Liya Fan, > > > Thanks you for doing this. I need to take a closer look at the PR in > > > question and the dictionary encoding code but this seems like it is on > the > > > right track. > > > > > > Could other java contributors with more familiarity in the space look > over > > > the document to make sure it makes sense to them? > > > > > > Thanks, > > > Micah > > > > > > On Mon, Jun 10, 2019 at 2:23 AM Fan Liya wrote: > > > > > > > Hi all, > > > > > > > > This is concerning issue ARROW-3396. > > > > > > > > I have summarized the problem (please see if my understanding is > > > correct), > > > > and proposed some solutions to it. Please give your valuable > feedback. > > > > For details, please see: > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing > > > > > > > > Thank you in advance. > > > > > > > > Best, > > > > Liya Fan > > > > > > > >
Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]
hi Liya, I left a couple of comments in the document. You might look at what we have developed in C++ and JavaSript which is more mature and widely used in those languages than what currently exists in Java. In particular, I strongly encourage you to avoid creating a coupling between the Schema (which indicates the "index type" and the "dictionary type") and the Data (which contains the "dictionary indices" and the "dictionary values"). If you do this, then it is nearly impossible to account for dictionary changes in a stream of data, which is already supported in the IPC protocol, see https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches The protocol does not permit dictionaries to change order altogether, but I think it should. I just opened https://issues.apache.org/jira/browse/ARROW-5767 We made this mistake early in the C++ library and found it difficult to have to reconciled dictionaries produced by different threads of execution. See ARROW-3144 and the patch I wrote in this release cycle which addressed this design flaw. Thanks Wes On Wed, Jun 12, 2019 at 5:33 AM Fan Liya wrote: > > @Micah Kornfield Thanks a lot for your comments. > > In the doc, we identify 3 problems for the current dictionary encoding use > case (there can be more, so please give your valuable suggestions): > > 1. there should be a convenient way to provide access to both > encoded/decoded data. > 2. the constructor for the schema with dictionary is misleading. > 3. we should provide a way to do the encoding/decoding during > serialization/deserialization, so the encoding/decoding can be transparent > to the user. > > The current PR provides a solution for problem 2, and it is a relatively > small change, so we think we can merge this PR first. > > Solutions for problem 1 and 3 should be chosen carefully so as not to > affect existing APIs. So more discussions/designs for problems 1 and 3 are > desired. > Do you think it reasonable? > > Best, > Liya Fan > > > > On Wed, Jun 12, 2019 at 4:01 PM Micah Kornfield > wrote: > > > Hi Liya Fan, > > Thanks you for doing this. I need to take a closer look at the PR in > > question and the dictionary encoding code but this seems like it is on the > > right track. > > > > Could other java contributors with more familiarity in the space look over > > the document to make sure it makes sense to them? > > > > Thanks, > > Micah > > > > On Mon, Jun 10, 2019 at 2:23 AM Fan Liya wrote: > > > > > Hi all, > > > > > > This is concerning issue ARROW-3396. > > > > > > I have summarized the problem (please see if my understanding is > > correct), > > > and proposed some solutions to it. Please give your valuable feedback. > > > For details, please see: > > > > > > > > > > > https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing > > > > > > Thank you in advance. > > > > > > Best, > > > Liya Fan > > > > >
Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]
@Micah Kornfield Thanks a lot for your comments. In the doc, we identify 3 problems for the current dictionary encoding use case (there can be more, so please give your valuable suggestions): 1. there should be a convenient way to provide access to both encoded/decoded data. 2. the constructor for the schema with dictionary is misleading. 3. we should provide a way to do the encoding/decoding during serialization/deserialization, so the encoding/decoding can be transparent to the user. The current PR provides a solution for problem 2, and it is a relatively small change, so we think we can merge this PR first. Solutions for problem 1 and 3 should be chosen carefully so as not to affect existing APIs. So more discussions/designs for problems 1 and 3 are desired. Do you think it reasonable? Best, Liya Fan On Wed, Jun 12, 2019 at 4:01 PM Micah Kornfield wrote: > Hi Liya Fan, > Thanks you for doing this. I need to take a closer look at the PR in > question and the dictionary encoding code but this seems like it is on the > right track. > > Could other java contributors with more familiarity in the space look over > the document to make sure it makes sense to them? > > Thanks, > Micah > > On Mon, Jun 10, 2019 at 2:23 AM Fan Liya wrote: > > > Hi all, > > > > This is concerning issue ARROW-3396. > > > > I have summarized the problem (please see if my understanding is > correct), > > and proposed some solutions to it. Please give your valuable feedback. > > For details, please see: > > > > > > > https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing > > > > Thank you in advance. > > > > Best, > > Liya Fan > > >
Re: [Discuss][Java][Typical use cases for dictionary encoding string vectors]
Hi Liya Fan, Thanks you for doing this. I need to take a closer look at the PR in question and the dictionary encoding code but this seems like it is on the right track. Could other java contributors with more familiarity in the space look over the document to make sure it makes sense to them? Thanks, Micah On Mon, Jun 10, 2019 at 2:23 AM Fan Liya wrote: > Hi all, > > This is concerning issue ARROW-3396. > > I have summarized the problem (please see if my understanding is correct), > and proposed some solutions to it. Please give your valuable feedback. > For details, please see: > > > https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing > > Thank you in advance. > > Best, > Liya Fan >
[Discuss][Java][Typical use cases for dictionary encoding string vectors]
Hi all, This is concerning issue ARROW-3396. I have summarized the problem (please see if my understanding is correct), and proposed some solutions to it. Please give your valuable feedback. For details, please see: https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing Thank you in advance. Best, Liya Fan