[kudu-CR] tserver: add locking around an RPC's usage of a Scanner
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/15677 ) Change subject: tserver: add locking around an RPC's usage of a Scanner .. tserver: add locking around an RPC's usage of a Scanner Prior to this patch, there was little to prevent multiple RPCs from trying to access and work with a single Scanner object other than the call sequence ID checks. In practice, those checks were fairly likely to prevent races -- there was only a very small window of time in between looking up a scanner and checking its sequence ID -- but it was buggy nonetheless. In addition, the lack of reasonable locking meant that some variables like scanner::num_rows_returned_ were needlessly atomic. In certain workloads this actually took a measurable amount of CPU. This patch adds a new test that would reliably crash the tserver and trigger TSAN warnings prior to the fix. It adds some explicit locking around the Scanner object and asserts that most methods on the scanner require the lock to be held. I benchmarked this by running perf-stat on a tserver and 'kudu perf table_scan' to count the rows in a table 200 times (ie scan with no columns projected). This is a best case for seeing the value of the optimization. Before: Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir /tmp/ts': 11,753.59 msec task-clock#1.166 CPUs utilized 7,174 context-switches #0.610 K/sec 807 cpu-migrations#0.069 K/sec 14,909 page-faults #0.001 M/sec 32,447,480,644 cycles#2.761 GHz 49,275,463,523 instructions #1.52 insn per cycle 7,570,941,653 branches # 644.139 M/sec 14,271,190 branch-misses #0.19% of all branches 10.082020598 seconds time elapsed 11.561886000 seconds user 0.340585000 seconds sys After: Performance counter stats for './build/latest/bin/kudu.opt tserver run -fs-wal-dir /tmp/ts': 9,426.45 msec task-clock#1.010 CPUs utilized 6,906 context-switches #0.733 K/sec 892 cpu-migrations#0.095 K/sec 15,078 page-faults #0.002 M/sec 26,127,343,920 cycles#2.772 GHz 48,101,748,066 instructions #1.84 insn per cycle 7,402,811,470 branches # 785.323 M/sec 13,857,599 branch-misses #0.19% of all branches 9.335786317 seconds time elapsed 9.258446000 seconds user 0.315547000 seconds sys (1.24x fewer cycles for the same work) When I actually scan one column of data, the improvement is only about 1% since other CPU consumption is dominant. Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e Reviewed-on: http://gerrit.cloudera.org:8080/15677 Tested-by: Todd Lipcon Reviewed-by: Andrew Wong --- M src/kudu/tserver/scanners-test.cc M src/kudu/tserver/scanners.cc M src/kudu/tserver/scanners.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc 5 files changed, 281 insertions(+), 136 deletions(-) Approvals: Todd Lipcon: Verified Andrew Wong: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/15677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e Gerrit-Change-Number: 15677 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon
[kudu-CR] tserver: add locking around an RPC's usage of a Scanner
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15677 ) Change subject: tserver: add locking around an RPC's usage of a Scanner .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/15677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e Gerrit-Change-Number: 15677 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 16 Apr 2020 21:54:52 + Gerrit-HasComments: No
[kudu-CR] tserver: add locking around an RPC's usage of a Scanner
Todd Lipcon has removed a vote on this change. Change subject: tserver: add locking around an RPC's usage of a Scanner .. Removed Verified-1 by Kudu Jenkins (120) -- To view, visit http://gerrit.cloudera.org:8080/15677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: deleteVote Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e Gerrit-Change-Number: 15677 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon
[kudu-CR] tserver: add locking around an RPC's usage of a Scanner
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/15677 ) Change subject: tserver: add locking around an RPC's usage of a Scanner .. Patch Set 3: Verified+1 test was unrelaed (chrony) -- To view, visit http://gerrit.cloudera.org:8080/15677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e Gerrit-Change-Number: 15677 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 16 Apr 2020 18:32:51 + Gerrit-HasComments: No
[kudu-CR] tserver: add locking around an RPC's usage of a Scanner
Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Andrew Wong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15677 to look at the new patch set (#3). Change subject: tserver: add locking around an RPC's usage of a Scanner .. tserver: add locking around an RPC's usage of a Scanner Prior to this patch, there was little to prevent multiple RPCs from trying to access and work with a single Scanner object other than the call sequence ID checks. In practice, those checks were fairly likely to prevent races -- there was only a very small window of time in between looking up a scanner and checking its sequence ID -- but it was buggy nonetheless. In addition, the lack of reasonable locking meant that some variables like scanner::num_rows_returned_ were needlessly atomic. In certain workloads this actually took a measurable amount of CPU. This patch adds a new test that would reliably crash the tserver and trigger TSAN warnings prior to the fix. It adds some explicit locking around the Scanner object and asserts that most methods on the scanner require the lock to be held. I benchmarked this by running perf-stat on a tserver and 'kudu perf table_scan' to count the rows in a table 200 times (ie scan with no columns projected). This is a best case for seeing the value of the optimization. Before: Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir /tmp/ts': 11,753.59 msec task-clock#1.166 CPUs utilized 7,174 context-switches #0.610 K/sec 807 cpu-migrations#0.069 K/sec 14,909 page-faults #0.001 M/sec 32,447,480,644 cycles#2.761 GHz 49,275,463,523 instructions #1.52 insn per cycle 7,570,941,653 branches # 644.139 M/sec 14,271,190 branch-misses #0.19% of all branches 10.082020598 seconds time elapsed 11.561886000 seconds user 0.340585000 seconds sys After: Performance counter stats for './build/latest/bin/kudu.opt tserver run -fs-wal-dir /tmp/ts': 9,426.45 msec task-clock#1.010 CPUs utilized 6,906 context-switches #0.733 K/sec 892 cpu-migrations#0.095 K/sec 15,078 page-faults #0.002 M/sec 26,127,343,920 cycles#2.772 GHz 48,101,748,066 instructions #1.84 insn per cycle 7,402,811,470 branches # 785.323 M/sec 13,857,599 branch-misses #0.19% of all branches 9.335786317 seconds time elapsed 9.258446000 seconds user 0.315547000 seconds sys (1.24x fewer cycles for the same work) When I actually scan one column of data, the improvement is only about 1% since other CPU consumption is dominant. Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e --- M src/kudu/tserver/scanners-test.cc M src/kudu/tserver/scanners.cc M src/kudu/tserver/scanners.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc 5 files changed, 282 insertions(+), 136 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/77/15677/3 -- To view, visit http://gerrit.cloudera.org:8080/15677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e Gerrit-Change-Number: 15677 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon
[kudu-CR] tserver: add locking around an RPC's usage of a Scanner
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/15677 ) Change subject: tserver: add locking around an RPC's usage of a Scanner .. Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners-test.cc File src/kudu/tserver/scanners-test.cc: http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners-test.cc@90 PS2, Line 90: auto access = s2->LockForAccess(); > nit: mind noting that locking updates the access time, which is what we act Done http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners.h File src/kudu/tserver/scanners.h: http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners.h@255 PS2, Line 255: // Check result.owns_lock() to see if the lock was successful. : AccessLock TryLockForAccess() WARN_UNUSED_RESULT { : return AccessLock(this); : } > Should be using the other constructor? woops! http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/tablet_server-test.cc File src/kudu/tserver/tablet_server-test.cc: http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/tablet_server-test.cc@2521 PS2, Line 2521: threads.emplace_back( : [&]() { : while (!done) { : mini_server_->server()->scanner_manager()->ListScans(); : } : }); > nit: mind commenting why this is important for the test? Is this just anoth Done -- To view, visit http://gerrit.cloudera.org:8080/15677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e Gerrit-Change-Number: 15677 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 16 Apr 2020 17:56:35 + Gerrit-HasComments: Yes
[kudu-CR] tserver: add locking around an RPC's usage of a Scanner
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15677 ) Change subject: tserver: add locking around an RPC's usage of a Scanner .. Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners-test.cc File src/kudu/tserver/scanners-test.cc: http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners-test.cc@90 PS2, Line 90: auto access = s2->LockForAccess(); nit: mind noting that locking updates the access time, which is what we actually care about in this test? http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners.h File src/kudu/tserver/scanners.h: http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners.h@255 PS2, Line 255: // Check result.owns_lock() to see if the lock was successful. : AccessLock TryLockForAccess() WARN_UNUSED_RESULT { : return AccessLock(this); : } Should be using the other constructor? http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/tablet_server-test.cc File src/kudu/tserver/tablet_server-test.cc: http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/tablet_server-test.cc@2521 PS2, Line 2521: threads.emplace_back( : [&]() { : while (!done) { : mini_server_->server()->scanner_manager()->ListScans(); : } : }); nit: mind commenting why this is important for the test? Is this just another avenue on which we're testing concurrency? -- To view, visit http://gerrit.cloudera.org:8080/15677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e Gerrit-Change-Number: 15677 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Thu, 09 Apr 2020 06:33:13 + Gerrit-HasComments: Yes
[kudu-CR] tserver: add locking around an RPC's usage of a Scanner
Hello Tidy Bot, Andrew Wong, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15677 to look at the new patch set (#2). Change subject: tserver: add locking around an RPC's usage of a Scanner .. tserver: add locking around an RPC's usage of a Scanner Prior to this patch, there was little to prevent multiple RPCs from trying to access and work with a single Scanner object other than the call sequence ID checks. In practice, those checks were fairly likely to prevent races -- there was only a very small window of time in between looking up a scanner and checking its sequence ID -- but it was buggy nonetheless. In addition, the lack of reasonable locking meant that some variables like scanner::num_rows_returned_ were needlessly atomic. In certain workloads this actually took a measurable amount of CPU. This patch adds a new test that would reliably crash the tserver and trigger TSAN warnings prior to the fix. It adds some explicit locking around the Scanner object and asserts that most methods on the scanner require the lock to be held. I benchmarked this by running perf-stat on a tserver and 'kudu perf table_scan' to count the rows in a table 200 times (ie scan with no columns projected). This is a best case for seeing the value of the optimization. Before: Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir /tmp/ts': 11,753.59 msec task-clock#1.166 CPUs utilized 7,174 context-switches #0.610 K/sec 807 cpu-migrations#0.069 K/sec 14,909 page-faults #0.001 M/sec 32,447,480,644 cycles#2.761 GHz 49,275,463,523 instructions #1.52 insn per cycle 7,570,941,653 branches # 644.139 M/sec 14,271,190 branch-misses #0.19% of all branches 10.082020598 seconds time elapsed 11.561886000 seconds user 0.340585000 seconds sys After: Performance counter stats for './build/latest/bin/kudu.opt tserver run -fs-wal-dir /tmp/ts': 9,426.45 msec task-clock#1.010 CPUs utilized 6,906 context-switches #0.733 K/sec 892 cpu-migrations#0.095 K/sec 15,078 page-faults #0.002 M/sec 26,127,343,920 cycles#2.772 GHz 48,101,748,066 instructions #1.84 insn per cycle 7,402,811,470 branches # 785.323 M/sec 13,857,599 branch-misses #0.19% of all branches 9.335786317 seconds time elapsed 9.258446000 seconds user 0.315547000 seconds sys (1.24x fewer cycles for the same work) When I actually scan one column of data, the improvement is only about 1% since other CPU consumption is dominant. Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e --- M src/kudu/tserver/scanners-test.cc M src/kudu/tserver/scanners.cc M src/kudu/tserver/scanners.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc 5 files changed, 271 insertions(+), 127 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/77/15677/2 -- To view, visit http://gerrit.cloudera.org:8080/15677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e Gerrit-Change-Number: 15677 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241)
[kudu-CR] tserver: add locking around an RPC's usage of a Scanner
Hello Andrew Wong, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/15677 to review the following change. Change subject: tserver: add locking around an RPC's usage of a Scanner .. tserver: add locking around an RPC's usage of a Scanner Prior to this patch, there was little to prevent multiple RPCs from trying to access and work with a single Scanner object other than the call sequence ID checks. In practice, those checks were fairly likely to prevent races -- there was only a very small window of time in between looking up a scanner and checking its sequence ID -- but it was buggy nonetheless. In addition, the lack of reasonable locking meant that some variables like scanner::num_rows_returned_ were needlessly atomic. In certain workloads this actually took a measurable amount of CPU. This patch adds a new test that would reliably crash the tserver and trigger TSAN warnings prior to the fix. It adds some explicit locking around the Scanner object and asserts that most methods on the scanner require the lock to be held. I benchmarked this by running perf-stat on a tserver and 'kudu perf table_scan' to count the rows in a table 200 times (ie scan with no columns projected). This is a best case for seeing the value of the optimization. Before: Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir /tmp/ts': 11,753.59 msec task-clock#1.166 CPUs utilized 7,174 context-switches #0.610 K/sec 807 cpu-migrations#0.069 K/sec 14,909 page-faults #0.001 M/sec 32,447,480,644 cycles#2.761 GHz 49,275,463,523 instructions #1.52 insn per cycle 7,570,941,653 branches # 644.139 M/sec 14,271,190 branch-misses #0.19% of all branches 10.082020598 seconds time elapsed 11.561886000 seconds user 0.340585000 seconds sys After: Performance counter stats for './build/latest/bin/kudu.opt tserver run -fs-wal-dir /tmp/ts': 9,426.45 msec task-clock#1.010 CPUs utilized 6,906 context-switches #0.733 K/sec 892 cpu-migrations#0.095 K/sec 15,078 page-faults #0.002 M/sec 26,127,343,920 cycles#2.772 GHz 48,101,748,066 instructions #1.84 insn per cycle 7,402,811,470 branches # 785.323 M/sec 13,857,599 branch-misses #0.19% of all branches 9.335786317 seconds time elapsed 9.258446000 seconds user 0.315547000 seconds sys (1.24x fewer cycles for the same work) When I actually scan one column of data, the improvement is only about 1% since other CPU consumption is dominant. Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e --- M src/kudu/tserver/scanners-test.cc M src/kudu/tserver/scanners.cc M src/kudu/tserver/scanners.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc 5 files changed, 269 insertions(+), 127 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/77/15677/1 -- To view, visit http://gerrit.cloudera.org:8080/15677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e Gerrit-Change-Number: 15677 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong