Dimitri Vorona created ARROW-2330:
-------------------------------------

             Summary: Optimize delta buffer creation with partially finishable 
array builders
                 Key: ARROW-2330
                 URL: https://issues.apache.org/jira/browse/ARROW-2330
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++
    Affects Versions: 0.8.0
            Reporter: Dimitri Vorona
             Fix For: 0.9.0


The main aim of this change is to optimize the building of delta dictionaries. 
In the current version delta dictionaries are built using an additional 
"overflow" buffer which leads to complicated and potentially error-prone code 
and subpar performance by doubling the number of lookups.

I solve this problem by introducing the notion of partially finishable array 
builders, i.e. builder which are able to retain the state on calling Finish. 
The interface is based on RecordBatchBuilder::Flush, i.e. Finish is overloaded 
with additional signature Finish(bool reset_builder, std::shared_ptr<Array>* 
out). The resulting Arrays point to the same data buffer with different offsets.

I'm aware that the change is kind of biggish, but I'd like to discuss it here. 
The solution makes the code more straight forward, doesn't bloat the code base 
too much and leaves the API more or less untouched. Additionally, the new way 
to make delta dictionaries by using a different call signature to Finish feel 
cleaner to me.

I'm looking forward to your critic and improvement ideas.

The pull request is available at: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to