[ 
https://issues.apache.org/jira/browse/ORC-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taiyang Li reassigned ORC-1950:
-------------------------------

    Assignee: Taiyang Li

> [C++] Replace std::unorder_map with google dense_hash_map in 
> SortedStringDictionary and remove reorder to improve write performance of 
> dict-encoding columns
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ORC-1950
>                 URL: https://issues.apache.org/jira/browse/ORC-1950
>             Project: ORC
>          Issue Type: Bug
>            Reporter: Taiyang Li
>            Assignee: Taiyang Li
>            Priority: Major
>
> Replace std::unorder_map with google dense_hash_map as SortedStringDictionary 
> and remove reorder to improve write performance of dict-encoding columns. 
> According to previous optimizations in MaxCompute ORC. 
> [https://zhuanlan.zhihu.com/p/75112745]  
>  
> We benchmarked writing dictionary-encoded string column with different 
> cardinalities(10, 100, 1000, 10000, 100000) within 1000000 rows. 
> POC: 
> [https://github.com/bigo-sg/ClickHouse/commit/b9fc51fd8ded21f84f31cfa169350906b9f14456]
>   
>  
>  
> baseline: 
> {code:java}
> 2025-07-08T16:22:54+08:00
> Running ./build_gcc/src/Common/benchmarks/orc_string_dictionary
> Run on (96 X 2900 MHz CPU s)
> CPU Caches:
> L1 Data 32 KiB (x48)
> L1 Instruction 32 KiB (x48)
> L2 Unified 1024 KiB (x48)
> L3 Unified 36608 KiB (x2)
> Load Average: 27.44, 62.03, 43.39
> Benchmark                                                            Time     
>         CPU   Iterations
> BM_writeStringDictionary<NewSortedStringDictionary, 10>       49801815 ns     
> 49800922 ns           11
> BM_writeStringDictionary<NewSortedStringDictionary, 100>      60295648 ns     
> 60294001 ns           12
> BM_writeStringDictionary<NewSortedStringDictionary, 1000>     73385081 ns     
> 73383192 ns           10
> BM_writeStringDictionary<NewSortedStringDictionary, 10000>   121725939 ns    
> 121642493 ns            6
> BM_writeStringDictionary<NewSortedStringDictionary, 100000>  232034759 ns    
> 232031059 ns            3 {code}
>  
> Opt1: Replace std::unorder_map with google dense_hash_map in 
> SortedStringDictionary. 
> {code:java}
> BM_writeStringDictionary<NewSortedStringDictionary, 10>       47356699 ns     
> 47353586 ns           12
> BM_writeStringDictionary<NewSortedStringDictionary, 100>      45186048 ns     
> 45183162 ns           15
> BM_writeStringDictionary<NewSortedStringDictionary, 1000>     57606074 ns     
> 57602874 ns           12
> BM_writeStringDictionary<NewSortedStringDictionary, 10000>    95010486 ns     
> 95005368 ns            7
> BM_writeStringDictionary<NewSortedStringDictionary, 100000>  188142023 ns    
> 188129613 ns            4 {code}
>  
> Opt2: Remove reorder operation in SortedStringDictionary, as it is not 
> required by ORC specifications. 
> {code:java}
> BM_writeStringDictionary<NewSortedStringDictionary, 10>       46043402 ns     
> 46039712 ns           13
> BM_writeStringDictionary<NewSortedStringDictionary, 100>      44810622 ns     
> 44809496 ns           16
> BM_writeStringDictionary<NewSortedStringDictionary, 1000>     55505824 ns     
> 55500060 ns           12
> BM_writeStringDictionary<NewSortedStringDictionary, 10000>    91577865 ns     
> 91573214 ns            7
> BM_writeStringDictionary<NewSortedStringDictionary, 100000>  186722559 ns    
> 186709219 ns            4 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to