Joris Peeters created ARROW-11869:
-------------------------------------

             Summary: [Java] Support re-emitting dictionaries in 
ArrowStreamWriter
                 Key: ARROW-11869
                 URL: https://issues.apache.org/jira/browse/ARROW-11869
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Java
            Reporter: Joris Peeters
            Assignee: Joris Peeters


The ArrowStreamWriter currently takes a DictionaryProvider at construction time 
and emits the used dicts once.

However, the streaming format allows for the dictionaries to change between 
record batches. It would be useful to support this mechanism. It can be worked 
around in various ways (e.g. manually re-emitting DictionaryBatches between 
calling writeBatch), but this isn't very pleasant.

We'd somehow have to reconcile this with the abstract ArrowWriter parent and 
the ArrowFileWriter sibling. In the latter, for example, this mechanism is not 
supported.

An example solution (but perhaps we can do better) might be to add a virtual 
`writeBatch(Provider provider)` method, that is UnsupportedOperationException 
in ArrowFileWriter, and re-emits the used dicts in ArrowStreamWriter.

In the present context just looking at dictionary replacement, not dictionary 
delta's.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to