[
https://issues.apache.org/jira/browse/ORC-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun resolved ORC-1950.
--------------------------------
Fix Version/s: 2.2.0
Resolution: Fixed
Issue resolved by pull request 2321
[https://github.com/apache/orc/pull/2321]
> [C++] Replace std::unorder_map with google dense_hash_map in
> SortedStringDictionary and remove reorder to improve write performance of
> dict-encoding columns
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: ORC-1950
> URL: https://issues.apache.org/jira/browse/ORC-1950
> Project: ORC
> Issue Type: Bug
> Reporter: Taiyang Li
> Assignee: Taiyang Li
> Priority: Minor
> Fix For: 2.2.0
>
>
> Replace std::unorder_map with google dense_hash_map as SortedStringDictionary
> and remove reorder to improve write performance of dict-encoding columns.
> According to previous optimizations in MaxCompute ORC.
> [https://zhuanlan.zhihu.com/p/75112745]
>
> We benchmarked writing dictionary-encoded string column with different
> cardinalities(10, 100, 1000, 10000, 100000) within 1000000 rows.
> POC:
> [https://github.com/bigo-sg/ClickHouse/commit/b9fc51fd8ded21f84f31cfa169350906b9f14456]
>
>
>
> baseline:
> {code:java}
> 2025-07-08T16:22:54+08:00
> Running ./build_gcc/src/Common/benchmarks/orc_string_dictionary
> Run on (96 X 2900 MHz CPU s)
> CPU Caches:
> L1 Data 32 KiB (x48)
> L1 Instruction 32 KiB (x48)
> L2 Unified 1024 KiB (x48)
> L3 Unified 36608 KiB (x2)
> Load Average: 27.44, 62.03, 43.39
> Benchmark Time
> CPU Iterations
> BM_writeStringDictionary<NewSortedStringDictionary, 10> 49801815 ns
> 49800922 ns 11
> BM_writeStringDictionary<NewSortedStringDictionary, 100> 60295648 ns
> 60294001 ns 12
> BM_writeStringDictionary<NewSortedStringDictionary, 1000> 73385081 ns
> 73383192 ns 10
> BM_writeStringDictionary<NewSortedStringDictionary, 10000> 121725939 ns
> 121642493 ns 6
> BM_writeStringDictionary<NewSortedStringDictionary, 100000> 232034759 ns
> 232031059 ns 3 {code}
>
> Opt1: Replace std::unorder_map with google dense_hash_map in
> SortedStringDictionary.
> {code:java}
> BM_writeStringDictionary<NewSortedStringDictionary, 10> 47356699 ns
> 47353586 ns 12
> BM_writeStringDictionary<NewSortedStringDictionary, 100> 45186048 ns
> 45183162 ns 15
> BM_writeStringDictionary<NewSortedStringDictionary, 1000> 57606074 ns
> 57602874 ns 12
> BM_writeStringDictionary<NewSortedStringDictionary, 10000> 95010486 ns
> 95005368 ns 7
> BM_writeStringDictionary<NewSortedStringDictionary, 100000> 188142023 ns
> 188129613 ns 4 {code}
>
> Opt2: Remove reorder operation in SortedStringDictionary, as it is not
> required by ORC specifications.
> {code:java}
> BM_writeStringDictionary<NewSortedStringDictionary, 10> 46043402 ns
> 46039712 ns 13
> BM_writeStringDictionary<NewSortedStringDictionary, 100> 44810622 ns
> 44809496 ns 16
> BM_writeStringDictionary<NewSortedStringDictionary, 1000> 55505824 ns
> 55500060 ns 12
> BM_writeStringDictionary<NewSortedStringDictionary, 10000> 91577865 ns
> 91573214 ns 7
> BM_writeStringDictionary<NewSortedStringDictionary, 100000> 186722559 ns
> 186709219 ns 4 {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)