[ https://issues.apache.org/jira/browse/ORC-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Taiyang Li reassigned ORC-1950: ------------------------------- Assignee: Taiyang Li > [C++] Replace std::unorder_map with google dense_hash_map in > SortedStringDictionary and remove reorder to improve write performance of > dict-encoding columns > ------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: ORC-1950 > URL: https://issues.apache.org/jira/browse/ORC-1950 > Project: ORC > Issue Type: Bug > Reporter: Taiyang Li > Assignee: Taiyang Li > Priority: Major > > Replace std::unorder_map with google dense_hash_map as SortedStringDictionary > and remove reorder to improve write performance of dict-encoding columns. > According to previous optimizations in MaxCompute ORC. > [https://zhuanlan.zhihu.com/p/75112745] > > We benchmarked writing dictionary-encoded string column with different > cardinalities(10, 100, 1000, 10000, 100000) within 1000000 rows. > POC: > [https://github.com/bigo-sg/ClickHouse/commit/b9fc51fd8ded21f84f31cfa169350906b9f14456] > > > > baseline: > {code:java} > 2025-07-08T16:22:54+08:00 > Running ./build_gcc/src/Common/benchmarks/orc_string_dictionary > Run on (96 X 2900 MHz CPU s) > CPU Caches: > L1 Data 32 KiB (x48) > L1 Instruction 32 KiB (x48) > L2 Unified 1024 KiB (x48) > L3 Unified 36608 KiB (x2) > Load Average: 27.44, 62.03, 43.39 > Benchmark Time > CPU Iterations > BM_writeStringDictionary<NewSortedStringDictionary, 10> 49801815 ns > 49800922 ns 11 > BM_writeStringDictionary<NewSortedStringDictionary, 100> 60295648 ns > 60294001 ns 12 > BM_writeStringDictionary<NewSortedStringDictionary, 1000> 73385081 ns > 73383192 ns 10 > BM_writeStringDictionary<NewSortedStringDictionary, 10000> 121725939 ns > 121642493 ns 6 > BM_writeStringDictionary<NewSortedStringDictionary, 100000> 232034759 ns > 232031059 ns 3 {code} > > Opt1: Replace std::unorder_map with google dense_hash_map in > SortedStringDictionary. > {code:java} > BM_writeStringDictionary<NewSortedStringDictionary, 10> 47356699 ns > 47353586 ns 12 > BM_writeStringDictionary<NewSortedStringDictionary, 100> 45186048 ns > 45183162 ns 15 > BM_writeStringDictionary<NewSortedStringDictionary, 1000> 57606074 ns > 57602874 ns 12 > BM_writeStringDictionary<NewSortedStringDictionary, 10000> 95010486 ns > 95005368 ns 7 > BM_writeStringDictionary<NewSortedStringDictionary, 100000> 188142023 ns > 188129613 ns 4 {code} > > Opt2: Remove reorder operation in SortedStringDictionary, as it is not > required by ORC specifications. > {code:java} > BM_writeStringDictionary<NewSortedStringDictionary, 10> 46043402 ns > 46039712 ns 13 > BM_writeStringDictionary<NewSortedStringDictionary, 100> 44810622 ns > 44809496 ns 16 > BM_writeStringDictionary<NewSortedStringDictionary, 1000> 55505824 ns > 55500060 ns 12 > BM_writeStringDictionary<NewSortedStringDictionary, 10000> 91577865 ns > 91573214 ns 7 > BM_writeStringDictionary<NewSortedStringDictionary, 100000> 186722559 ns > 186709219 ns 4 {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)