[
https://issues.apache.org/jira/browse/ORC-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Taiyang Li updated ORC-1950:
----------------------------
Description:
Replace std::unorder_map with google dense_hash_map as SortedStringDictionary
and remove reorder to improve write performance of dict-encoding columns
POC:
[https://github.com/bigo-sg/ClickHouse/commit/b9fc51fd8ded21f84f31cfa169350906b9f14456]
baseline:
{code:java}
2025-07-08T16:22:54+08:00
Running ./build_gcc/src/Common/benchmarks/orc_string_dictionary
Run on (96 X 2900 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x48)
L1 Instruction 32 KiB (x48)
L2 Unified 1024 KiB (x48)
L3 Unified 36608 KiB (x2)
Load Average: 27.44, 62.03, 43.39
Benchmark Time
CPU Iterations
BM_writeStringDictionary<NewSortedStringDictionary, 10> 49801815 ns
49800922 ns 11
BM_writeStringDictionary<NewSortedStringDictionary, 100> 60295648 ns
60294001 ns 12
BM_writeStringDictionary<NewSortedStringDictionary, 1000> 73385081 ns
73383192 ns 10
BM_writeStringDictionary<NewSortedStringDictionary, 10000> 121725939 ns
121642493 ns 6
BM_writeStringDictionary<NewSortedStringDictionary, 100000> 232034759 ns
232031059 ns 3 {code}
Opt1:
> [C++] Replace std::unorder_map with google dense_hash_map as
> SortedStringDictionary and remove reorder to improve write performance of
> dict-encoding columns
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: ORC-1950
> URL: https://issues.apache.org/jira/browse/ORC-1950
> Project: ORC
> Issue Type: Bug
> Reporter: Taiyang Li
> Priority: Major
>
> Replace std::unorder_map with google dense_hash_map as SortedStringDictionary
> and remove reorder to improve write performance of dict-encoding columns
>
> POC:
> [https://github.com/bigo-sg/ClickHouse/commit/b9fc51fd8ded21f84f31cfa169350906b9f14456]
>
>
> baseline:
> {code:java}
> 2025-07-08T16:22:54+08:00
> Running ./build_gcc/src/Common/benchmarks/orc_string_dictionary
> Run on (96 X 2900 MHz CPU s)
> CPU Caches:
> L1 Data 32 KiB (x48)
> L1 Instruction 32 KiB (x48)
> L2 Unified 1024 KiB (x48)
> L3 Unified 36608 KiB (x2)
> Load Average: 27.44, 62.03, 43.39
> Benchmark Time
> CPU Iterations
> BM_writeStringDictionary<NewSortedStringDictionary, 10> 49801815 ns
> 49800922 ns 11
> BM_writeStringDictionary<NewSortedStringDictionary, 100> 60295648 ns
> 60294001 ns 12
> BM_writeStringDictionary<NewSortedStringDictionary, 1000> 73385081 ns
> 73383192 ns 10
> BM_writeStringDictionary<NewSortedStringDictionary, 10000> 121725939 ns
> 121642493 ns 6
> BM_writeStringDictionary<NewSortedStringDictionary, 100000> 232034759 ns
> 232031059 ns 3 {code}
> Opt1:
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)