Hello Andrew Wong, I'd like you to do a code review. Please visit
http://gerrit.cloudera.org:8080/15677 to review the following change. Change subject: tserver: add locking around an RPC's usage of a Scanner ...................................................................... tserver: add locking around an RPC's usage of a Scanner Prior to this patch, there was little to prevent multiple RPCs from trying to access and work with a single Scanner object other than the call sequence ID checks. In practice, those checks were fairly likely to prevent races -- there was only a very small window of time in between looking up a scanner and checking its sequence ID -- but it was buggy nonetheless. In addition, the lack of reasonable locking meant that some variables like scanner::num_rows_returned_ were needlessly atomic. In certain workloads this actually took a measurable amount of CPU. This patch adds a new test that would reliably crash the tserver and trigger TSAN warnings prior to the fix. It adds some explicit locking around the Scanner object and asserts that most methods on the scanner require the lock to be held. I benchmarked this by running perf-stat on a tserver and 'kudu perf table_scan' to count the rows in a table 200 times (ie scan with no columns projected). This is a best case for seeing the value of the optimization. Before: Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir /tmp/ts': 11,753.59 msec task-clock # 1.166 CPUs utilized 7,174 context-switches # 0.610 K/sec 807 cpu-migrations # 0.069 K/sec 14,909 page-faults # 0.001 M/sec 32,447,480,644 cycles # 2.761 GHz 49,275,463,523 instructions # 1.52 insn per cycle 7,570,941,653 branches # 644.139 M/sec 14,271,190 branch-misses # 0.19% of all branches 10.082020598 seconds time elapsed 11.561886000 seconds user 0.340585000 seconds sys After: Performance counter stats for './build/latest/bin/kudu.opt tserver run -fs-wal-dir /tmp/ts': 9,426.45 msec task-clock # 1.010 CPUs utilized 6,906 context-switches # 0.733 K/sec 892 cpu-migrations # 0.095 K/sec 15,078 page-faults # 0.002 M/sec 26,127,343,920 cycles # 2.772 GHz 48,101,748,066 instructions # 1.84 insn per cycle 7,402,811,470 branches # 785.323 M/sec 13,857,599 branch-misses # 0.19% of all branches 9.335786317 seconds time elapsed 9.258446000 seconds user 0.315547000 seconds sys (1.24x fewer cycles for the same work) When I actually scan one column of data, the improvement is only about 1% since other CPU consumption is dominant. Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e --- M src/kudu/tserver/scanners-test.cc M src/kudu/tserver/scanners.cc M src/kudu/tserver/scanners.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc 5 files changed, 269 insertions(+), 127 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/77/15677/1 -- To view, visit http://gerrit.cloudera.org:8080/15677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e Gerrit-Change-Number: 15677 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Andrew Wong <andrew.w...@cloudera.com>