[kudu-CR] tserver: add locking around an RPC's usage of a Scanner

2020-04-16 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/15677 )

Change subject: tserver: add locking around an RPC's usage of a Scanner
..

tserver: add locking around an RPC's usage of a Scanner

Prior to this patch, there was little to prevent multiple RPCs from
trying to access and work with a single Scanner object other than the
call sequence ID checks. In practice, those checks were fairly likely to
prevent races -- there was only a very small window of time in between
looking up a scanner and checking its sequence ID -- but it was buggy
nonetheless.

In addition, the lack of reasonable locking meant that some variables
like scanner::num_rows_returned_ were needlessly atomic. In certain
workloads this actually took a measurable amount of CPU.

This patch adds a new test that would reliably crash the tserver and
trigger TSAN warnings prior to the fix. It adds some explicit locking
around the Scanner object and asserts that most methods on the scanner
require the lock to be held.

I benchmarked this by running perf-stat on a tserver and 'kudu perf
table_scan' to count the rows in a table 200 times (ie scan with no
columns projected). This is a best case for seeing the value of the
optimization.

Before:
 Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir 
/tmp/ts':

 11,753.59 msec task-clock#1.166 CPUs utilized
 7,174  context-switches  #0.610 K/sec
   807  cpu-migrations#0.069 K/sec
14,909  page-faults   #0.001 M/sec
32,447,480,644  cycles#2.761 GHz
49,275,463,523  instructions  #1.52  insn per cycle
 7,570,941,653  branches  #  644.139 M/sec
14,271,190  branch-misses #0.19% of all branches

  10.082020598 seconds time elapsed

  11.561886000 seconds user
   0.340585000 seconds sys

After:
 Performance counter stats for './build/latest/bin/kudu.opt tserver run 
-fs-wal-dir /tmp/ts':

  9,426.45 msec task-clock#1.010 CPUs utilized
 6,906  context-switches  #0.733 K/sec
   892  cpu-migrations#0.095 K/sec
15,078  page-faults   #0.002 M/sec
26,127,343,920  cycles#2.772 GHz
48,101,748,066  instructions  #1.84  insn per cycle
 7,402,811,470  branches  #  785.323 M/sec
13,857,599  branch-misses #0.19% of all branches

   9.335786317 seconds time elapsed

   9.258446000 seconds user
   0.315547000 seconds sys

(1.24x fewer cycles for the same work)

When I actually scan one column of data, the improvement is only about
1% since other CPU consumption is dominant.

Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e
Reviewed-on: http://gerrit.cloudera.org:8080/15677
Tested-by: Todd Lipcon 
Reviewed-by: Andrew Wong 
---
M src/kudu/tserver/scanners-test.cc
M src/kudu/tserver/scanners.cc
M src/kudu/tserver/scanners.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
5 files changed, 281 insertions(+), 136 deletions(-)

Approvals:
  Todd Lipcon: Verified
  Andrew Wong: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/15677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e
Gerrit-Change-Number: 15677
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] tserver: add locking around an RPC's usage of a Scanner

2020-04-16 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15677 )

Change subject: tserver: add locking around an RPC's usage of a Scanner
..


Patch Set 3: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/15677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e
Gerrit-Change-Number: 15677
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 16 Apr 2020 21:54:52 +
Gerrit-HasComments: No


[kudu-CR] tserver: add locking around an RPC's usage of a Scanner

2020-04-16 Thread Todd Lipcon (Code Review)
Todd Lipcon has removed a vote on this change.

Change subject: tserver: add locking around an RPC's usage of a Scanner
..


Removed Verified-1 by Kudu Jenkins (120)
--
To view, visit http://gerrit.cloudera.org:8080/15677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e
Gerrit-Change-Number: 15677
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] tserver: add locking around an RPC's usage of a Scanner

2020-04-16 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15677 )

Change subject: tserver: add locking around an RPC's usage of a Scanner
..


Patch Set 3: Verified+1

test was unrelaed (chrony)


--
To view, visit http://gerrit.cloudera.org:8080/15677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e
Gerrit-Change-Number: 15677
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 16 Apr 2020 18:32:51 +
Gerrit-HasComments: No


[kudu-CR] tserver: add locking around an RPC's usage of a Scanner

2020-04-16 Thread Todd Lipcon (Code Review)
Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Andrew Wong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15677

to look at the new patch set (#3).

Change subject: tserver: add locking around an RPC's usage of a Scanner
..

tserver: add locking around an RPC's usage of a Scanner

Prior to this patch, there was little to prevent multiple RPCs from
trying to access and work with a single Scanner object other than the
call sequence ID checks. In practice, those checks were fairly likely to
prevent races -- there was only a very small window of time in between
looking up a scanner and checking its sequence ID -- but it was buggy
nonetheless.

In addition, the lack of reasonable locking meant that some variables
like scanner::num_rows_returned_ were needlessly atomic. In certain
workloads this actually took a measurable amount of CPU.

This patch adds a new test that would reliably crash the tserver and
trigger TSAN warnings prior to the fix. It adds some explicit locking
around the Scanner object and asserts that most methods on the scanner
require the lock to be held.

I benchmarked this by running perf-stat on a tserver and 'kudu perf
table_scan' to count the rows in a table 200 times (ie scan with no
columns projected). This is a best case for seeing the value of the
optimization.

Before:
 Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir 
/tmp/ts':

 11,753.59 msec task-clock#1.166 CPUs utilized
 7,174  context-switches  #0.610 K/sec
   807  cpu-migrations#0.069 K/sec
14,909  page-faults   #0.001 M/sec
32,447,480,644  cycles#2.761 GHz
49,275,463,523  instructions  #1.52  insn per cycle
 7,570,941,653  branches  #  644.139 M/sec
14,271,190  branch-misses #0.19% of all branches

  10.082020598 seconds time elapsed

  11.561886000 seconds user
   0.340585000 seconds sys

After:
 Performance counter stats for './build/latest/bin/kudu.opt tserver run 
-fs-wal-dir /tmp/ts':

  9,426.45 msec task-clock#1.010 CPUs utilized
 6,906  context-switches  #0.733 K/sec
   892  cpu-migrations#0.095 K/sec
15,078  page-faults   #0.002 M/sec
26,127,343,920  cycles#2.772 GHz
48,101,748,066  instructions  #1.84  insn per cycle
 7,402,811,470  branches  #  785.323 M/sec
13,857,599  branch-misses #0.19% of all branches

   9.335786317 seconds time elapsed

   9.258446000 seconds user
   0.315547000 seconds sys

(1.24x fewer cycles for the same work)

When I actually scan one column of data, the improvement is only about
1% since other CPU consumption is dominant.

Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e
---
M src/kudu/tserver/scanners-test.cc
M src/kudu/tserver/scanners.cc
M src/kudu/tserver/scanners.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
5 files changed, 282 insertions(+), 136 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/77/15677/3
--
To view, visit http://gerrit.cloudera.org:8080/15677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e
Gerrit-Change-Number: 15677
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] tserver: add locking around an RPC's usage of a Scanner

2020-04-16 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15677 )

Change subject: tserver: add locking around an RPC's usage of a Scanner
..


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners-test.cc
File src/kudu/tserver/scanners-test.cc:

http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners-test.cc@90
PS2, Line 90: auto access = s2->LockForAccess();
> nit: mind noting that locking updates the access time, which is what we act
Done


http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners.h
File src/kudu/tserver/scanners.h:

http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners.h@255
PS2, Line 255:   // Check result.owns_lock() to see if the lock was successful.
 :   AccessLock TryLockForAccess() WARN_UNUSED_RESULT {
 : return AccessLock(this);
 :   }
> Should be using the other constructor?
woops!


http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/tablet_server-test.cc
File src/kudu/tserver/tablet_server-test.cc:

http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/tablet_server-test.cc@2521
PS2, Line 2521:   threads.emplace_back(
  :   [&]() {
  : while (!done) {
  :   
mini_server_->server()->scanner_manager()->ListScans();
  : }
  :   });
> nit: mind commenting why this is important for the test? Is this just anoth
Done



--
To view, visit http://gerrit.cloudera.org:8080/15677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e
Gerrit-Change-Number: 15677
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 16 Apr 2020 17:56:35 +
Gerrit-HasComments: Yes


[kudu-CR] tserver: add locking around an RPC's usage of a Scanner

2020-04-09 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15677 )

Change subject: tserver: add locking around an RPC's usage of a Scanner
..


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners-test.cc
File src/kudu/tserver/scanners-test.cc:

http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners-test.cc@90
PS2, Line 90: auto access = s2->LockForAccess();
nit: mind noting that locking updates the access time, which is what we 
actually care about in this test?


http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners.h
File src/kudu/tserver/scanners.h:

http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/scanners.h@255
PS2, Line 255:   // Check result.owns_lock() to see if the lock was successful.
 :   AccessLock TryLockForAccess() WARN_UNUSED_RESULT {
 : return AccessLock(this);
 :   }
Should be using the other constructor?


http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/tablet_server-test.cc
File src/kudu/tserver/tablet_server-test.cc:

http://gerrit.cloudera.org:8080/#/c/15677/2/src/kudu/tserver/tablet_server-test.cc@2521
PS2, Line 2521:   threads.emplace_back(
  :   [&]() {
  : while (!done) {
  :   
mini_server_->server()->scanner_manager()->ListScans();
  : }
  :   });
nit: mind commenting why this is important for the test? Is this just another 
avenue on which we're testing concurrency?



--
To view, visit http://gerrit.cloudera.org:8080/15677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e
Gerrit-Change-Number: 15677
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 09 Apr 2020 06:33:13 +
Gerrit-HasComments: Yes


[kudu-CR] tserver: add locking around an RPC's usage of a Scanner

2020-04-08 Thread Todd Lipcon (Code Review)
Hello Tidy Bot, Andrew Wong, Kudu Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15677

to look at the new patch set (#2).

Change subject: tserver: add locking around an RPC's usage of a Scanner
..

tserver: add locking around an RPC's usage of a Scanner

Prior to this patch, there was little to prevent multiple RPCs from
trying to access and work with a single Scanner object other than the
call sequence ID checks. In practice, those checks were fairly likely to
prevent races -- there was only a very small window of time in between
looking up a scanner and checking its sequence ID -- but it was buggy
nonetheless.

In addition, the lack of reasonable locking meant that some variables
like scanner::num_rows_returned_ were needlessly atomic. In certain
workloads this actually took a measurable amount of CPU.

This patch adds a new test that would reliably crash the tserver and
trigger TSAN warnings prior to the fix. It adds some explicit locking
around the Scanner object and asserts that most methods on the scanner
require the lock to be held.

I benchmarked this by running perf-stat on a tserver and 'kudu perf
table_scan' to count the rows in a table 200 times (ie scan with no
columns projected). This is a best case for seeing the value of the
optimization.

Before:
 Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir 
/tmp/ts':

 11,753.59 msec task-clock#1.166 CPUs utilized
 7,174  context-switches  #0.610 K/sec
   807  cpu-migrations#0.069 K/sec
14,909  page-faults   #0.001 M/sec
32,447,480,644  cycles#2.761 GHz
49,275,463,523  instructions  #1.52  insn per cycle
 7,570,941,653  branches  #  644.139 M/sec
14,271,190  branch-misses #0.19% of all branches

  10.082020598 seconds time elapsed

  11.561886000 seconds user
   0.340585000 seconds sys

After:
 Performance counter stats for './build/latest/bin/kudu.opt tserver run 
-fs-wal-dir /tmp/ts':

  9,426.45 msec task-clock#1.010 CPUs utilized
 6,906  context-switches  #0.733 K/sec
   892  cpu-migrations#0.095 K/sec
15,078  page-faults   #0.002 M/sec
26,127,343,920  cycles#2.772 GHz
48,101,748,066  instructions  #1.84  insn per cycle
 7,402,811,470  branches  #  785.323 M/sec
13,857,599  branch-misses #0.19% of all branches

   9.335786317 seconds time elapsed

   9.258446000 seconds user
   0.315547000 seconds sys

(1.24x fewer cycles for the same work)

When I actually scan one column of data, the improvement is only about
1% since other CPU consumption is dominant.

Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e
---
M src/kudu/tserver/scanners-test.cc
M src/kudu/tserver/scanners.cc
M src/kudu/tserver/scanners.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
5 files changed, 271 insertions(+), 127 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/77/15677/2
--
To view, visit http://gerrit.cloudera.org:8080/15677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e
Gerrit-Change-Number: 15677
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] tserver: add locking around an RPC's usage of a Scanner

2020-04-07 Thread Todd Lipcon (Code Review)
Hello Andrew Wong,

I'd like you to do a code review. Please visit

http://gerrit.cloudera.org:8080/15677

to review the following change.


Change subject: tserver: add locking around an RPC's usage of a Scanner
..

tserver: add locking around an RPC's usage of a Scanner

Prior to this patch, there was little to prevent multiple RPCs from
trying to access and work with a single Scanner object other than the
call sequence ID checks. In practice, those checks were fairly likely to
prevent races -- there was only a very small window of time in between
looking up a scanner and checking its sequence ID -- but it was buggy
nonetheless.

In addition, the lack of reasonable locking meant that some variables
like scanner::num_rows_returned_ were needlessly atomic. In certain
workloads this actually took a measurable amount of CPU.

This patch adds a new test that would reliably crash the tserver and
trigger TSAN warnings prior to the fix. It adds some explicit locking
around the Scanner object and asserts that most methods on the scanner
require the lock to be held.

I benchmarked this by running perf-stat on a tserver and 'kudu perf
table_scan' to count the rows in a table 200 times (ie scan with no
columns projected). This is a best case for seeing the value of the
optimization.

Before:
 Performance counter stats for './build/latest/bin/kudu tserver run -fs-wal-dir 
/tmp/ts':

 11,753.59 msec task-clock#1.166 CPUs utilized
 7,174  context-switches  #0.610 K/sec
   807  cpu-migrations#0.069 K/sec
14,909  page-faults   #0.001 M/sec
32,447,480,644  cycles#2.761 GHz
49,275,463,523  instructions  #1.52  insn per cycle
 7,570,941,653  branches  #  644.139 M/sec
14,271,190  branch-misses #0.19% of all branches

  10.082020598 seconds time elapsed

  11.561886000 seconds user
   0.340585000 seconds sys

After:
 Performance counter stats for './build/latest/bin/kudu.opt tserver run 
-fs-wal-dir /tmp/ts':

  9,426.45 msec task-clock#1.010 CPUs utilized
 6,906  context-switches  #0.733 K/sec
   892  cpu-migrations#0.095 K/sec
15,078  page-faults   #0.002 M/sec
26,127,343,920  cycles#2.772 GHz
48,101,748,066  instructions  #1.84  insn per cycle
 7,402,811,470  branches  #  785.323 M/sec
13,857,599  branch-misses #0.19% of all branches

   9.335786317 seconds time elapsed

   9.258446000 seconds user
   0.315547000 seconds sys

(1.24x fewer cycles for the same work)

When I actually scan one column of data, the improvement is only about
1% since other CPU consumption is dominant.

Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e
---
M src/kudu/tserver/scanners-test.cc
M src/kudu/tserver/scanners.cc
M src/kudu/tserver/scanners.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
5 files changed, 269 insertions(+), 127 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/77/15677/1
--
To view, visit http://gerrit.cloudera.org:8080/15677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I3591e85a07aefadaf7fb05768109c8a261a8828e
Gerrit-Change-Number: 15677
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong