Re: [Format] Semantics for dictionary batches in streams

Wes McKinney Mon, 09 Sep 2019 12:21:57 -0700

hi Micah,

I think we should formulate changes to format/Columnar.rst and have a
vote, what do you think?


On Thu, Aug 29, 2019 at 2:23 AM Micah Kornfield <[email protected]> wrote:
>>
>>
>> > I was thinking the file format must satisfy one of two conditions:
>> > 1.  Exactly one dictionarybatch per encoded column
>> > 2.  DictionaryBatches are interleaved correctly.
>>
>> Could you clarify?
>
> I think you clarified it very well :) My motivation for suggesting the 
> additional complexity is I see two use-cases for the file format.  These 
> roughly correspond with the two options I suggested:
> 1.  We are encoding data from scratch.  In this case, it seems like all 
> dictionaries would be built incrementally, not need replacement and we write 
> them at the end of the file [1]
>
> 2.  The data being written out is essentially a "tee" off of some stream that 
> is generating new dictionaries requiring replacement on the fly (i.e. reading 
> back two parquet files).
>
>>  It might be better to disallow replacements
>> in the file format (which does introduce semantic slippage between the
>> file and stream formats as Antoine was saying).
>
> It is is certainly possible, to accept the slippage from the stream format 
> for now and later add this capability, since it should be forwards compatible.
>
> Thanks,
> Micah
>
> [1] There is also medium complexity option where we require one non-delta 
> dictionary and as many delta dictionaries as the user want.
>
> On Wed, Aug 28, 2019 at 7:50 AM Wes McKinney <[email protected]> wrote:
>>
>> On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield <[email protected]> 
>> wrote:
>> >
>> > I was thinking the file format must satisfy one of two conditions:
>> > 1.  Exactly one dictionarybatch per encoded column
>> > 2.  DictionaryBatches are interleaved correctly.
>>
>> Could you clarify? In the first case, there is no issue with
>> dictionary replacements. I'm not sure about the second case -- if a
>> dictionary id appears twice, then you'll see it twice in the file
>> footer. I suppose you could look at the file offsets to determine
>> whether a dictionary batch precedes a particular record batch block
>> (to know which dictionary you should be using), but that's rather
>> complicated to implement. It might be better to disallow replacements
>> in the file format (which does introduce semantic slippage between the
>> file and stream formats as Antoine was saying).
>>
>> >
>> > On Tuesday, August 27, 2019, Wes McKinney <[email protected]> wrote:
>> >
>> > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <[email protected]> 
>> > > wrote:
>> > > >
>> > > >
>> > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit :
>> > > > > So the current situation we have right now in C++ is that if we tried
>> > > > > to create an IPC stream from a sequence of record batches that don't
>> > > > > all have the same dictionary, we'd run into two scenarios:
>> > > > >
>> > > > > * Batches that either have a prefix of a prior-observed dictionary, 
>> > > > > or
>> > > > > the prior dictionary is a prefix of their dictionary. For example,
>> > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
>> > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
>> > > > > such case we could compute and send a delta batch
>> > > > >
>> > > > > * Batches with a dictionary that is a permutation of values, and
>> > > > > possibly new unique values.
>> > > > >
>> > > > > In this latter case, without the option of replacing an existing ID 
>> > > > > in
>> > > > > the stream, we would have to do a unification / permutation of 
>> > > > > indices
>> > > > > and then also possibly send a delta batch. We should probably have
>> > > > > code at some point that deals with both cases, but in the meantime I
>> > > > > would like to allow dictionaries to be redefined in this case. Seems
>> > > > > like we might need a vote to formalize this?
>> > > >
>> > > > Isn't the stream format deviating from the file format then?  In the
>> > > > file format, IIUC, dictionaries can appear after the respective record
>> > > > batches, so there's no way to tell whether the original or redefined
>> > > > version of a dictionary is being referred to.
>> > >
>> > > You make a good point -- we can consider changes to the file format to
>> > > allow for record batches to have different dictionaries. Even handling
>> > > delta dictionaries with the current file format would be a bit tedious
>> > > (though not indeterminate)
>> > >
>> > > > Regards
>> > > >
>> > > > Antoine.
>> > >

Re: [Format] Semantics for dictionary batches in streams

Reply via email to