[ https://issues.apache.org/jira/browse/ARROW-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529119#comment-17529119 ]
Tobias Zagorni commented on ARROW-16161: ---------------------------------------- I created a hacky prototype of an ExecArrayData class that does not hold a shared_ptr to DataType. It is used in Datum as an alternative to ArrayData. I can currently run all the ExecuteScalarExpressionOverhead benchmarks with it. There are noticeable performance improvements in multi-threaded runs with small (e.g. 1000) batch sizes. The best is in zero_copy_expression: {{old: ExecuteScalarExpressionOverhead/zero_copy_expression/rows_per_batch:1000/threads:16 1613271 ns 25730049 ns 32 rows_per_second=388.651M/s new: ExecuteScalarExpressionOverhead/zero_copy_expression/rows_per_batch:1000/threads:16 1121647 ns 17913402 ns 32 rows_per_second=558.241M/s}} (full results are attached: [^ExecArrayData-difference.txt]) In other places I see a small performance reduction, and also the simple_expression/100 improvement is smaller than with the first copyable DataType experiment. Probably due to the additional code paths in this implementation and my way of getting a shared_ptr<DataType> back from ExecArrayData, cloning the DataType object, which is far from optimal. I'm still looking into this. most the remaining reference counting overhead seems to come from 3 places: * ValueDescr / resolving output types using it * Scalar datums * pointers to the CPUDevice instance, because CPUMemeoryManager instances are recreated a lot. I'll create a diffrent subtask for this [Code|https://github.com/zagto/arrow/compare/datatype-performance-execarraydata-baseline...zagto:datatype-performance?expand=1] - (ignore the weird mix of template and subclassing for ExecArrayData) > [C++] Overhead of std::shared_ptr<DataType> copies is causing thread > contention > ------------------------------------------------------------------------------- > > Key: ARROW-16161 > URL: https://issues.apache.org/jira/browse/ARROW-16161 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ > Reporter: Weston Pace > Assignee: Tobias Zagorni > Priority: Major > Attachments: ExecArrayData-difference.txt > > > We created a benchmark to measure ExecuteScalarExpression performance in > ARROW-16014. We noticed significant thread contention (even though there > shouldn't be much, if any, for this task) As part of ARROW-16138 we have been > investigating possible causes. > One cause seems to be contention from copying shared_ptr<DataType> objects. > Two possible solutions jump to mind and I'm sure there are many more. > ExecBatch is an internal type and used inside of ExecuteScalarExpression as > well as inside of the execution engine. In the former we can safely assume > the data types will exist for the duration of the call. In the latter we can > safely assume the data types will exist for the duration of the execution > plan. Thus we can probably take a more targetted fix and migrate only > ExecBatch to using DataType* (or const DataType&). > On the other hand, we might consider a more global approach. All of our > "stock" data types are assumed to have static storage duration. However, we > must use std::shared_ptr<DataType> because users could create their own > extension types. We could invent an "extension type registration" system > where extension types must first be registered with the C++ lib before being > used. Then we could have long-lived DataType instances and we could replace > std::shared_ptr<DataType> with DataType* (or const DataType&) throughout most > of the entire code base. > But, as I mentioned, I'm sure there are many approaches to take. CC > [~lidavidm] and [~apitrou] and [~yibocai] for thoughts but this might be > interesting for just about any C++ dev. -- This message was sent by Atlassian Jira (v8.20.7#820007)