[
https://issues.apache.org/jira/browse/ARROW-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718728#comment-16718728
]
Antoine Pitrou commented on ARROW-2532:
---------------------------------------
[~wesmckinn] I have been thinking about this a bit. I think there are several
possible designs:
# An entirely duplicate class hierarchy and implementation. The main downside
is obviously the development work, and later maintenance to try to keep the
interfaces consistent as one of the hierarchies evolves.
# A refactor of the current ArrayBuilder classes to take a template type that
implements either chunking or raising CapacityError. Very heavy in terms of
implementation, but is conceptually "clean" and avoids code duplication.
# A set of "heavy" wrappers that redefine all overloads of {{Append}} and
{{AppendValues}}, but delegate some of the work to an underlying ArrayBuilder.
# A set of "light-weight" wrappers that define templated {{Append}} and
{{AppendValues}} methods that simply redirect to an underlying ArrayBuilder,
but catch {{CapacityError}} to finalize the current chunk and try again.
# A set of "very thin" wrappers that only have a couple methods for space
reservation. So {{ChunkedBuilder::Reserve}} would either reserve some space on
the underlying ArrayBuilder, or finalize the current chunk and reserve space in
the new chunk. Actual appending would need the user to call the underlying
ArrayBuilder directly, taking care to sequence calls to the ChunkedBuilder and
the ArrayBuilder carefully: you must first ask the ChunkedBuilder for space
reservation, then append on the ArrayBuilder.
The fifth approach is probably the most light-weight in terms of
implementation. The main downside is it requires a bit more care from the user
in how they interact with the two builders.
Specialized implementations will be needed for binary, list and struct arrays,
and I'm not sure what we can do for dictionary arrays.
To make testing easier, we should ideally make the maximum chunk size
parametrizable.
> [C++] Add chunked builder classes
> ---------------------------------
>
> Key: ARROW-2532
> URL: https://issues.apache.org/jira/browse/ARROW-2532
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 0.9.0
> Reporter: Antoine Pitrou
> Priority: Major
>
> I think it would be useful to have chunked builders for list, string and
> binary types. A chunked builder would produce a chunked array as output,
> circumventing the 32-bit offset limit of those types. There's some
> special-casing scatterred around our Numpy conversion routines right now.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)