Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Volodymyr Verovkin,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15802

to look at the new patch set (#3).

Change subject: KUDU-2844 (3/3): avoid copying plain/dict strings to RowBlock 
Arena
......................................................................

KUDU-2844 (3/3): avoid copying plain/dict strings to RowBlock Arena

This changes Dictionary and Plain binary blocks to no longer copy string
values into the destination RowBlock's Arena. Instead, the dictionary
block (or plain block) is attached to the RowBlockMemory's list of
reference-counted block handles, and the cell directly refers to the
underlying block handle.

I modified full_stack-insert-scan-test to have it do some scans with
fault tolerance and no caching. If I comment out the code path that
reference-counts the blocks during a scan, the newly added test fails.

I performance-tested this by loading a lineitem table using
tpch_real_world:

$ tpch_real_world -tpch-path-to-dbgen-dir /data/2/todd/dbgen/ \
  -tpch_num_inserters=8 -tpch_scaling_factor=100 \
  --tpch_mini_cluster_base_dir /data/1/todd/tpch-kudu-data \

For the numbers reported below, I accidentally cancelled the load after
347M rows were present, but I used the same dataset for both "before"
and "after" so the relative comparison is still valid.

I started a tserver and master with the same data directories, with
the tserver running inside perf stat (or perf record to look at
profile):

$ kudu-master \
  -fs-wal-dir /data/1/todd/tpch-kudu-data/master-0/wal/ \
  -fs-data-dirs /data/1/todd/tpch-kudu-data/master-0/data/

$ perf stat kudu-tserver \
  -fs-wal-dir /data/1/todd/tpch-kudu-data/ts-0/wal/ \
  -fs-data-dirs /data/1/todd/tpch-kudu-data/ts-0/data/

I waited until the data had been fully flushed from MRS before running
the read workloads. To test the reads I used:

$ kudu perf table_scan localhost tpch_real_world  \
  --columns l_shipdate --num_threads=16

The results of the first test were a bit noisy due to NUMA placement
issues -- some runs were 30-40% faster than other runs, even on the same
build, which made it hard to compare results, even though it was clear
that the optimized version used fewer cycles on average. So, I ran both
the tserver and the client using 'numactl -m 0 -N 0' to force everything
to a single NUMA node. This made results much more consistent.

Before:

 Performance counter stats for 'numactl -m 0 -N 0 build/latest/bin/kudu-tserver 
-fs-wal-dir /data/1/todd/tpch-kudu-data/ts-0/wal/ -fs-data-dirs 
/data/1/todd/tpch-kudu-data/ts-0/data/':

        109,499.82 msec task-clock                #    2.157 CPUs utilized
           126,972      context-switches          #    0.001 M/sec
               860      cpu-migrations            #    0.008 K/sec
           179,152      page-faults               #    0.002 M/sec
   287,368,365,068      cycles                    #    2.624 GHz                
      (28.49%)
   449,376,563,201      instructions              #    1.56  insn per cycle     
      (43.78%)
    72,675,361,468      branches                  #  663.703 M/sec              
      (27.51%)
     1,498,319,620      branch-misses             #    2.06% of all branches    
      (22.68%)

      88.573074000 seconds user
      30.057621000 seconds sys

After:

 Performance counter stats for 'numactl -m 0 -N 0 
build/latest/bin/kudu-tserver.threadsafe -fs-wal-dir 
/data/1/todd/tpch-kudu-data/ts-0/wal/ -fs-data-dirs 
/data/1/todd/tpch-kudu-data/ts-0/data/':

         79,082.33 msec task-clock                #    2.136 CPUs utilized
           123,454      context-switches          #    0.002 M/sec
               651      cpu-migrations            #    0.008 K/sec
           180,599      page-faults               #    0.002 M/sec
   206,372,306,012      cycles                    #    2.610 GHz                
      (28.27%)
   333,399,478,320      instructions              #    1.62  insn per cycle     
      (43.61%)
    45,056,751,917      branches                  #  569.745 M/sec              
      (27.44%)
       223,361,581      branch-misses             #    0.50% of all branches    
      (22.77%)

      57.790049000 seconds user
      29.709481000 seconds sys

To summarize, the change gives about 1.39x reduction in CPU cycles on the
tserver.  The wall clock reported by the perf scan tool showed about a 20%
reduction in wall-clock.

Looking at 'perf diff -c ratio -o 0 --sort=sym' between two profiles collected
by perf-record, we can see that the BinaryDictBlockDecoder code path is much
cheaper, most of the memcpy calls are removed, and slightly more CPU is spent
serializing the result (probably due to reduced locality of reference copying
the string data to the wire).

    Orig % CPU     Ratio      Symbol
-------------------------------------------------------------------------------------
    26.16%        0.437272  [.] 
kudu::cfile::BinaryDictBlockDecoder::CopyNextDecodeStrings(unsigned long*, 
kudu::ColumnDataView*)
    19.63%        1.135304  [.] kudu::SerializeRowBlock(kudu::RowBlock const&, 
kudu::Schema const*, kudu::faststring*, kudu::faststring*, bool)
    11.49%        1.002845  [k] copy_user_enhanced_fast_string
    10.29%        0.068955  [.] __memcpy_ssse3_back

Change-Id: I93fa1f9fd401814a42dc5a1f3fd2ffb1286ac441
---
M src/kudu/cfile/binary_dict_block.cc
M src/kudu/cfile/binary_plain_block.cc
M src/kudu/cfile/binary_plain_block.h
M src/kudu/cfile/block_handle.h
M src/kudu/cfile/cfile_reader.cc
M src/kudu/common/rowblock_memory.h
M src/kudu/integration-tests/full_stack-insert-scan-test.cc
7 files changed, 85 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/15802/3
--
To view, visit http://gerrit.cloudera.org:8080/15802
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I93fa1f9fd401814a42dc5a1f3fd2ffb1286ac441
Gerrit-Change-Number: 15802
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Andrew Wong <andrew.w...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Volodymyr Verovkin <verjov...@cloudera.com>

Reply via email to