hi Micah, I think we should formulate changes to format/Columnar.rst and have a vote, what do you think?
On Thu, Aug 29, 2019 at 2:23 AM Micah Kornfield <emkornfi...@gmail.com> wrote: >> >> >> > I was thinking the file format must satisfy one of two conditions: >> > 1. Exactly one dictionarybatch per encoded column >> > 2. DictionaryBatches are interleaved correctly. >> >> Could you clarify? > > I think you clarified it very well :) My motivation for suggesting the > additional complexity is I see two use-cases for the file format. These > roughly correspond with the two options I suggested: > 1. We are encoding data from scratch. In this case, it seems like all > dictionaries would be built incrementally, not need replacement and we write > them at the end of the file [1] > > 2. The data being written out is essentially a "tee" off of some stream that > is generating new dictionaries requiring replacement on the fly (i.e. reading > back two parquet files). > >> It might be better to disallow replacements >> in the file format (which does introduce semantic slippage between the >> file and stream formats as Antoine was saying). > > It is is certainly possible, to accept the slippage from the stream format > for now and later add this capability, since it should be forwards compatible. > > Thanks, > Micah > > [1] There is also medium complexity option where we require one non-delta > dictionary and as many delta dictionaries as the user want. > > On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney <wesmck...@gmail.com> wrote: >> >> On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > >> > I was thinking the file format must satisfy one of two conditions: >> > 1. Exactly one dictionarybatch per encoded column >> > 2. DictionaryBatches are interleaved correctly. >> >> Could you clarify? In the first case, there is no issue with >> dictionary replacements. I'm not sure about the second case -- if a >> dictionary id appears twice, then you'll see it twice in the file >> footer. I suppose you could look at the file offsets to determine >> whether a dictionary batch precedes a particular record batch block >> (to know which dictionary you should be using), but that's rather >> complicated to implement. It might be better to disallow replacements >> in the file format (which does introduce semantic slippage between the >> file and stream formats as Antoine was saying). >> >> > >> > On Tuesday, August 27, 2019, Wes McKinney <wesmck...@gmail.com> wrote: >> > >> > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <anto...@python.org> >> > > wrote: >> > > > >> > > > >> > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit : >> > > > > So the current situation we have right now in C++ is that if we tried >> > > > > to create an IPC stream from a sequence of record batches that don't >> > > > > all have the same dictionary, we'd run into two scenarios: >> > > > > >> > > > > * Batches that either have a prefix of a prior-observed dictionary, >> > > > > or >> > > > > the prior dictionary is a prefix of their dictionary. For example, >> > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and >> > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In >> > > > > such case we could compute and send a delta batch >> > > > > >> > > > > * Batches with a dictionary that is a permutation of values, and >> > > > > possibly new unique values. >> > > > > >> > > > > In this latter case, without the option of replacing an existing ID >> > > > > in >> > > > > the stream, we would have to do a unification / permutation of >> > > > > indices >> > > > > and then also possibly send a delta batch. We should probably have >> > > > > code at some point that deals with both cases, but in the meantime I >> > > > > would like to allow dictionaries to be redefined in this case. Seems >> > > > > like we might need a vote to formalize this? >> > > > >> > > > Isn't the stream format deviating from the file format then? In the >> > > > file format, IIUC, dictionaries can appear after the respective record >> > > > batches, so there's no way to tell whether the original or redefined >> > > > version of a dictionary is being referred to. >> > > >> > > You make a good point -- we can consider changes to the file format to >> > > allow for record batches to have different dictionaries. Even handling >> > > delta dictionaries with the current file format would be a bit tedious >> > > (though not indeterminate) >> > > >> > > > Regards >> > > > >> > > > Antoine. >> > >