Hello,

My own answers:

1) isDelta should be true only when a delta is being transmitted (to be appended to the existing dictionary with the same id); it should be false when a full dictionary is being transmitted (to replace the existing dictionary with the same id, if any)
2) yes, it could
3) yes
4) there's no reason it can't be valid

Regards

Antoine.


Le 25/01/2024 à 07:25, Micah Kornfield a écrit :
Hi Chris,
My interpretations:
1) I'm not sure it is clearly defined, but my impression is the first
dictionary is never a delta dictionary (option 1)
2) I don't think they are prevented from switching state (which I supposed
is more complicated?) but hopefully not by much?
3) Dictionaries are reused across batches unless replaced.
4)  I'm not sure I understand this question.  Dictionary should be passed
independently of indexes?

Thanks,
Micah

On Fri, Jan 19, 2024 at 1:55 PM Chris Larsen <clar...@netflix.com.invalid>
wrote:

Hi folks,

I'm working on multi-batch dictionary with delta support in Java [1] and
would like some clarifications. Given the "isDelta" flag in the dictionary
message [2], when should this be set to "true"?

1) If we have dictionary with an ID of 1 that we want to delta encode and
it is used across multiple batches, should the initial batch have
`isDelta=false` then subsequent batches have `isDelta=true`? E.g.

batch 1, dict 1, isDelta=false, dictVector=[a, b, c], indexVector=[0, 1, 1,
2]
batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1]
batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4]

Or should the flag be true for the entire IPC flow? E.g.

batch 1, dict 1, isDelta=true, dictVector=[a, b, c], indexVector=[0, 1, 1,
2]
batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1]
batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4, 3]

Either works for me.

2) Could (in stream, not file IPCs) a single dictionary ever switch state
across batches from delta to replacement mode or vice-versa? E.g.

batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1,
1, 2]
batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[2, 3, 0, 1]
batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1,
2]

I'd like to keep the protocol and API simple and assume switching is not
allowed. This would mean the 2nd example above would be canonical.

3) Are replacement dictionaries required to be serialized for every batch
or is a dictionary re-used across batches until a replacement is received?
The CPP IPC API has 'unify_dictionaries' [3] that mentions "a column with a
dictionary type must have the same dictionary in each record batch". I
assume (and prefer) the latter, that replacements are serialized once and
re-used. E.g.

batch 1, dict 1, isDelta = false, dictVector=[a, b, c], indexVector=[0, 1,
1, 2]
batch 2, dict 1, isDelta = false, dictVector=[], indexVector=[2, 1, 0, 1]
// use previous dictionary
batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1,
2] // replacement

And I assume that 'unify_dictionaries' simply concatenates all dictionaries
into a single vector serialized in the first batch (haven't looked at the
code yet).

4) Is it valid for a delta dictionary to have an update in a subsequent
batch even though the update is not used in that batch? A silly example
would be:

batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1,
1, 2]
batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[null, null,
null, null]
batch 3, dict 1, isDelta = true, dictVector=[], indexVector=[0, 3, 2]

Thanks for your help!

[1] https://github.com/apache/arrow/pull/38423
[2] https://github.com/apache/arrow/blob/main/format/Message.fbs#L134
[3]

https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions18unify_dictionariesE

--


Chris Larsen


Reply via email to