Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Volodymyr Verovkin, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/15802 to look at the new patch set (#3). Change subject: KUDU-2844 (3/3): avoid copying plain/dict strings to RowBlock Arena ...................................................................... KUDU-2844 (3/3): avoid copying plain/dict strings to RowBlock Arena This changes Dictionary and Plain binary blocks to no longer copy string values into the destination RowBlock's Arena. Instead, the dictionary block (or plain block) is attached to the RowBlockMemory's list of reference-counted block handles, and the cell directly refers to the underlying block handle. I modified full_stack-insert-scan-test to have it do some scans with fault tolerance and no caching. If I comment out the code path that reference-counts the blocks during a scan, the newly added test fails. I performance-tested this by loading a lineitem table using tpch_real_world: $ tpch_real_world -tpch-path-to-dbgen-dir /data/2/todd/dbgen/ \ -tpch_num_inserters=8 -tpch_scaling_factor=100 \ --tpch_mini_cluster_base_dir /data/1/todd/tpch-kudu-data \ For the numbers reported below, I accidentally cancelled the load after 347M rows were present, but I used the same dataset for both "before" and "after" so the relative comparison is still valid. I started a tserver and master with the same data directories, with the tserver running inside perf stat (or perf record to look at profile): $ kudu-master \ -fs-wal-dir /data/1/todd/tpch-kudu-data/master-0/wal/ \ -fs-data-dirs /data/1/todd/tpch-kudu-data/master-0/data/ $ perf stat kudu-tserver \ -fs-wal-dir /data/1/todd/tpch-kudu-data/ts-0/wal/ \ -fs-data-dirs /data/1/todd/tpch-kudu-data/ts-0/data/ I waited until the data had been fully flushed from MRS before running the read workloads. To test the reads I used: $ kudu perf table_scan localhost tpch_real_world \ --columns l_shipdate --num_threads=16 The results of the first test were a bit noisy due to NUMA placement issues -- some runs were 30-40% faster than other runs, even on the same build, which made it hard to compare results, even though it was clear that the optimized version used fewer cycles on average. So, I ran both the tserver and the client using 'numactl -m 0 -N 0' to force everything to a single NUMA node. This made results much more consistent. Before: Performance counter stats for 'numactl -m 0 -N 0 build/latest/bin/kudu-tserver -fs-wal-dir /data/1/todd/tpch-kudu-data/ts-0/wal/ -fs-data-dirs /data/1/todd/tpch-kudu-data/ts-0/data/': 109,499.82 msec task-clock # 2.157 CPUs utilized 126,972 context-switches # 0.001 M/sec 860 cpu-migrations # 0.008 K/sec 179,152 page-faults # 0.002 M/sec 287,368,365,068 cycles # 2.624 GHz (28.49%) 449,376,563,201 instructions # 1.56 insn per cycle (43.78%) 72,675,361,468 branches # 663.703 M/sec (27.51%) 1,498,319,620 branch-misses # 2.06% of all branches (22.68%) 88.573074000 seconds user 30.057621000 seconds sys After: Performance counter stats for 'numactl -m 0 -N 0 build/latest/bin/kudu-tserver.threadsafe -fs-wal-dir /data/1/todd/tpch-kudu-data/ts-0/wal/ -fs-data-dirs /data/1/todd/tpch-kudu-data/ts-0/data/': 79,082.33 msec task-clock # 2.136 CPUs utilized 123,454 context-switches # 0.002 M/sec 651 cpu-migrations # 0.008 K/sec 180,599 page-faults # 0.002 M/sec 206,372,306,012 cycles # 2.610 GHz (28.27%) 333,399,478,320 instructions # 1.62 insn per cycle (43.61%) 45,056,751,917 branches # 569.745 M/sec (27.44%) 223,361,581 branch-misses # 0.50% of all branches (22.77%) 57.790049000 seconds user 29.709481000 seconds sys To summarize, the change gives about 1.39x reduction in CPU cycles on the tserver. The wall clock reported by the perf scan tool showed about a 20% reduction in wall-clock. Looking at 'perf diff -c ratio -o 0 --sort=sym' between two profiles collected by perf-record, we can see that the BinaryDictBlockDecoder code path is much cheaper, most of the memcpy calls are removed, and slightly more CPU is spent serializing the result (probably due to reduced locality of reference copying the string data to the wire). Orig % CPU Ratio Symbol ------------------------------------------------------------------------------------- 26.16% 0.437272 [.] kudu::cfile::BinaryDictBlockDecoder::CopyNextDecodeStrings(unsigned long*, kudu::ColumnDataView*) 19.63% 1.135304 [.] kudu::SerializeRowBlock(kudu::RowBlock const&, kudu::Schema const*, kudu::faststring*, kudu::faststring*, bool) 11.49% 1.002845 [k] copy_user_enhanced_fast_string 10.29% 0.068955 [.] __memcpy_ssse3_back Change-Id: I93fa1f9fd401814a42dc5a1f3fd2ffb1286ac441 --- M src/kudu/cfile/binary_dict_block.cc M src/kudu/cfile/binary_plain_block.cc M src/kudu/cfile/binary_plain_block.h M src/kudu/cfile/block_handle.h M src/kudu/cfile/cfile_reader.cc M src/kudu/common/rowblock_memory.h M src/kudu/integration-tests/full_stack-insert-scan-test.cc 7 files changed, 85 insertions(+), 19 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/15802/3 -- To view, visit http://gerrit.cloudera.org:8080/15802 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I93fa1f9fd401814a42dc5a1f3fd2ffb1286ac441 Gerrit-Change-Number: 15802 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Andrew Wong <andrew.w...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Volodymyr Verovkin <verjov...@cloudera.com>