[ https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432390#comment-16432390 ]
ASF GitHub Bot commented on ARROW-2330: --------------------------------------- xhochy commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer creation with partially finishable array builders URL: https://github.com/apache/arrow/pull/1769#issuecomment-380127439 Hello Dimitri, I forgot to reply to you. This got drowned in all the other Arrow activity in the last days. Sorry about that. > Do you know why the RecordBatchBuilder has a partial Flush method? Is it because its instantion is more cumbersome compared to array builders? The `RecordBatchBuilder` is different to the other builders as it does not generate a Arrays but should be used for a more streaming-like output. In it, all builders are resetted on Flush. This is also the behaviour I would expect in the dictionary builder. > That being said, I don't think that it adds that much complexity, compared to the previous implementation. Most of the changes are in the testing code, and builder files themselves have around 30 additional LOC. At the same time, it makes the DictionaryBuilder quite straight forward. The added complexity is the thing that scares me most at the moment. The Builders are built with the intention that they complety own their buffers and can modify them as they like internally. You can ask them for snapshot access to elements to inspect them but we don't provide any guarantees that the memory addresses that were returned from them are still valid once you can a method on the builder again. As far as I understand the current implementation in some cases we return the current buffer pointer as new `Buffer` object in the BufferBuilder. Once we do expand it though later on, we may free this memory address and thus the `Buffer` object point to unreachable memory. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Optimize delta buffer creation with partially finishable array builders > ----------------------------------------------------------------------------- > > Key: ARROW-2330 > URL: https://issues.apache.org/jira/browse/ARROW-2330 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Affects Versions: 0.8.0 > Reporter: Dimitri Vorona > Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > The main aim of this change is to optimize the building of delta > dictionaries. In the current version delta dictionaries are built using an > additional "overflow" buffer which leads to complicated and potentially > error-prone code and subpar performance by doubling the number of lookups. > I solve this problem by introducing the notion of partially finishable array > builders, i.e. builder which are able to retain the state on calling Finish. > The interface is based on RecordBatchBuilder::Flush, i.e. Finish is > overloaded with additional signature Finish(bool reset_builder, > std::shared_ptr<Array>* out). The resulting Arrays point to the same data > buffer with different offsets. > I'm aware that the change is kind of biggish, but I'd like to discuss it > here. The solution makes the code more straight forward, doesn't bloat the > code base too much and leaves the API more or less untouched. Additionally, > the new way to make delta dictionaries by using a different call signature to > Finish feel cleaner to me. > I'm looking forward to your critic and improvement ideas. > The pull request is available at: https://github.com/apache/arrow/pull/1769 -- This message was sent by Atlassian JIRA (v7.6.3#76005)