[
https://issues.apache.org/jira/browse/ARROW-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Omer Ozarslan updated ARROW-6375:
---------------------------------
Description:
I was trying to benchmark performances of using array builders vs. STL API for
converting some row data to arrow tables. I realized it is around 1.5-1.8 times
slower to convert {{std::vector}} values with STL API than doing so with
builder API. It appears this is primarily due to appending rows via
{{...::Append}} method by iterating over
{{ConversionTrait<std::vector<...>>::AppendRow}} for each value.
Calling {{...::AppendValues}} would make it more efficient, however,
{{ConversionTraits}} doesn't offer a way for appending more than one cells
({{AppendRow}} takes a builder and a single cell as its parameters).
Would it be possible to extend conversion traits with an optional method
{{AppendRows(Builder, Cell*, size_t),}} which allows template specialization to
efficiently append multiple values at once? In the example above this function
would be called with {{std::vector::data()}} and {{std::vector::size()}} if
provided. If such method isn't provided by the specialization, current behavior
(i.e. iterating over {{AppendRow}}) can be used as default.
[This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
is the particular part in code that will be replaced in practice. Instead of
directly calling AppendRow in a for loop, a public helper function (e.g.
{{stl::AppendRows}}) can be provided, in which it implements above logic.
was:
I was trying to benchmark performances of using array builders vs. STL API for
converting some row data to arrow tables. I realized it is around 1.5-1.8 times
slower to convert {{std::vector}} values with STL API than doing so with
builder API. It appears this is primarily due to appending rows via
{{...::Append}} method by iterating over
{{ConversionTrait<std::vector<...>>::AppendRow}} for each value.
Calling {{...::AppendValues}} would make it more efficient, however,
{{ConversionTraits}} doesn't offer a way for appending more than one cells
({{AppendRow}} takes a builder and a single cell as its parameters).
Would it be possible to extend conversion traits with an optional metho\{{d
}}{{AppendRows(Builder, Cell*, size_t)}} which allows template specialization
to efficiently append multiple values at once? In the example above this
function would be called with {{std::vector::data()}} and
{{std::vector::size()}} if provided. If such method isn't provided by the
specialization, current behavior (i.e. iterating over {{AppendRow}}) can be
used as default.
[This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
is the particular part in code that will be replaced in practice. Instead of
directly calling AppendRow in a for loop, a public helper function (e.g.
{{stl::AppendRows}}) can be provided, in which it implements above logic.
> [C++] Extend ConversionTraits to allow efficiently appending list values in
> STL API
> -----------------------------------------------------------------------------------
>
> Key: ARROW-6375
> URL: https://issues.apache.org/jira/browse/ARROW-6375
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Omer Ozarslan
> Priority: Major
>
> I was trying to benchmark performances of using array builders vs. STL API
> for converting some row data to arrow tables. I realized it is around 1.5-1.8
> times slower to convert {{std::vector}} values with STL API than doing so
> with builder API. It appears this is primarily due to appending rows via
> {{...::Append}} method by iterating over
> {{ConversionTrait<std::vector<...>>::AppendRow}} for each value.
> Calling {{...::AppendValues}} would make it more efficient, however,
> {{ConversionTraits}} doesn't offer a way for appending more than one cells
> ({{AppendRow}} takes a builder and a single cell as its parameters).
> Would it be possible to extend conversion traits with an optional method
> {{AppendRows(Builder, Cell*, size_t),}} which allows template specialization
> to efficiently append multiple values at once? In the example above this
> function would be called with {{std::vector::data()}} and
> {{std::vector::size()}} if provided. If such method isn't provided by the
> specialization, current behavior (i.e. iterating over {{AppendRow}}) can be
> used as default.
> [This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
> is the particular part in code that will be replaced in practice. Instead of
> directly calling AppendRow in a for loop, a public helper function (e.g.
> {{stl::AppendRows}}) can be provided, in which it implements above logic.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)