Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/15802 )
Change subject: KUDU-2844 (3/3): avoid copying plain/dict strings to RowBlock Arena ...................................................................... KUDU-2844 (3/3): avoid copying plain/dict strings to RowBlock Arena This changes Dictionary and Plain binary blocks to no longer copy string values into the destination RowBlock's Arena. Instead, the dictionary block (or plain block) is attached to the RowBlockMemory's list of reference-counted block handles, and the cell directly refers to the underlying block handle. I modified full_stack-insert-scan-test to have it do some scans with fault tolerance and no caching. If I comment out the code path that reference-counts the blocks during a scan, the newly added test fails. I performance-tested this by loading a lineitem table using tpch_real_world: $ tpch_real_world -tpch-path-to-dbgen-dir /data/2/todd/dbgen/ \ -tpch_num_inserters=8 -tpch_scaling_factor=100 \ --tpch_mini_cluster_base_dir /data/1/todd/tpch-kudu-data \ For the numbers reported below, I accidentally cancelled the load after 222M rows were present, but I used the same dataset for both "before" and "after" so the relative comparison is still valid. I started a tserver and master with the same data directories, with the tserver running inside perf stat (or perf record to look at profile): $ kudu-master \ -fs-wal-dir /data/1/todd/tpch-kudu-data/master-0/wal/ \ -fs-data-dirs /data/1/todd/tpch-kudu-data/master-0/data/ $ perf stat kudu-tserver \ -fs-wal-dir /data/1/todd/tpch-kudu-data/ts-0/wal/ \ -fs-data-dirs /data/1/todd/tpch-kudu-data/ts-0/data/ I waited until the data had been fully flushed from MRS and compacted before running the read workloads. To test the reads I ran the following 10 times: $ kudu perf table_scan localhost tpch_real_world \ --columns l_shipdate,l_shipmode,l_comment --num_threads=16 The results of the first test were a bit noisy due to NUMA placement issues -- some runs were 30-40% faster than other runs, even on the same build, which made it hard to compare results, even though it was clear that the optimized version used fewer cycles on average. So, I ran both the tserver and the client using 'numactl -m 0 -N 0' to force everything to a single NUMA node. This made results much more consistent. Before: 255870.36 msec task-clock # 3.058 CPUs utilized 244847 context-switches # 0.957 K/sec 3322 cpu-migrations # 0.013 K/sec 245814 page-faults # 0.961 K/sec 1066864136000 cycles # 4.170 GHz (83.46%) 84410991344 stalled-cycles-frontend # 7.91% frontend cycles idle (83.37%) 340913242391 stalled-cycles-backend # 31.95% backend cycles idle (83.25%) 1131564485394 instructions # 1.06 insn per cycle # 0.30 stalled cycles per insn (83.34%) 187879069908 branches # 734.274 M/sec (83.32%) 8550168935 branch-misses # 4.55% of all branches (83.26%) 191.262870000 seconds user 64.765755000 seconds sys After: 214131.49 msec task-clock # 2.750 CPUs utilized 245357 context-switches # 0.001 M/sec 2734 cpu-migrations # 0.013 K/sec 248108 page-faults # 0.001 M/sec 893270854012 cycles # 4.172 GHz (83.45%) 83805641687 stalled-cycles-frontend # 9.38% frontend cycles idle (83.25%) 345166097238 stalled-cycles-backend # 38.64% backend cycles idle (83.29%) 913435059189 instructions # 1.02 insn per cycle # 0.38 stalled cycles per insn (83.36%) 142198832288 branches # 664.072 M/sec (83.36%) 4819907752 branch-misses # 3.39% of all branches (83.29%) 77.876854360 seconds time elapsed 146.195821000 seconds user 68.113598000 seconds sys To summarize, the change gives about 1.30x reduction in CPU cycles on the tserver. The wall clock reported by the perf scan tool showed a 15-20% reduction in wall-clock. Testing just scanning a dictionary-encoded column shows even better results. Looking at 'perf diff -c ratio -o 0 --sort=sym' between two profiles collected by perf-record, we can see that the BinaryDictBlockDecoder code path is much cheaper, most of the memcpy calls are removed, and slightly more CPU is spent serializing the result (probably due to reduced locality of reference copying the string data to the wire). Orig % CPU Ratio Symbol ------------------------------------------------------------------------------------- 26.16% 0.437272 [.] kudu::cfile::BinaryDictBlockDecoder::CopyNextDecodeStrings(unsigned long*, kudu::ColumnDataView*) 19.63% 1.135304 [.] kudu::SerializeRowBlock(kudu::RowBlock const&, kudu::Schema const*, kudu::faststring*, kudu::faststring*, bool) 11.49% 1.002845 [k] copy_user_enhanced_fast_string 10.29% 0.068955 [.] __memcpy_ssse3_back Change-Id: I93fa1f9fd401814a42dc5a1f3fd2ffb1286ac441 Reviewed-on: http://gerrit.cloudera.org:8080/15802 Tested-by: Kudu Jenkins Reviewed-by: Andrew Wong <aw...@cloudera.com> --- M src/kudu/cfile/binary_dict_block.cc M src/kudu/cfile/binary_plain_block.cc M src/kudu/cfile/binary_plain_block.h M src/kudu/cfile/block_handle.h M src/kudu/common/rowblock_memory.h M src/kudu/integration-tests/full_stack-insert-scan-test.cc 6 files changed, 101 insertions(+), 24 deletions(-) Approvals: Kudu Jenkins: Verified Andrew Wong: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/15802 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I93fa1f9fd401814a42dc5a1f3fd2ffb1286ac441 Gerrit-Change-Number: 15802 Gerrit-PatchSet: 8 Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Andrew Wong <andrew.w...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Volodymyr Verovkin <verjov...@cloudera.com>