[kudu-CR] KUDU-695. Avoid glog contention by deferring log writes to dedicated threads

2016-12-04 Thread Todd Lipcon (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5321

to look at the new patch set (#7).

Change subject: KUDU-695. Avoid glog contention by deferring log writes to 
dedicated threads
..

KUDU-695. Avoid glog contention by deferring log writes to dedicated threads

This patch changes the logging init sequence to install a wrapper around
the built-in glog Loggers. The wrapper starts a new thread which does
double-buffering: log entries are buffered in a vector, and the thread
is woken up. The thread swaps in a clean buffer to avoid delaying
application threads while it flushes the original buffer to disk.

It's hard to test the end-to-end integration in a unit test, since the
unit tests disable logging to files, and this path only affects
file-based logging. However, I ran an earlier version of this patch in a
stress test environment and it seemed to reduce the frequency with which
I saw threads blocked on glog.

The patch does, however, have a test which exercises the new code paths,
including the blocking path. I looped the new test 500 times in TSAN
mode with success.

The new feature is enabled by default, but I left a hidden flag to
disable it in case we have any issues.

Change-Id: Ie22a0a29fa00a988a53a15d2c726ce5d49018f4d
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/async_logger.cc
A src/kudu/util/async_logger.h
M src/kudu/util/debug/leakcheck_disabler.h
M src/kudu/util/logging-test.cc
M src/kudu/util/logging.cc
6 files changed, 461 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/5321/7
-- 
To view, visit http://gerrit.cloudera.org:8080/5321
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie22a0a29fa00a988a53a15d2c726ce5d49018f4d
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] ssl factory: avoid a non-POD (vector<>) in static storage

2016-12-04 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged.

Change subject: ssl_factory: avoid a non-POD (vector<>) in static storage
..


ssl_factory: avoid a non-POD (vector<>) in static storage

The global vector in static storage was causing ASAN test
failures when I looped rpc-test's keepalive test with SSL enabled[1].
The issue is that the static storage destructor could run concurrently
with other threads still trying to access the SSL mutexes.

Since this mutex list is purposefully leaked anyway, it's better to just
use a C-style array.

[1] http://dist-test.cloudera.org/job?job_id=todd.1480910327.10622

Change-Id: I859aefd932d62fa8619650fcc795d3676187ceb4
Reviewed-on: http://gerrit.cloudera.org:8080/5354
Reviewed-by: Adar Dembo 
Tested-by: Kudu Jenkins
---
M src/kudu/util/net/ssl_factory.cc
1 file changed, 5 insertions(+), 7 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/5354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I859aefd932d62fa8619650fcc795d3676187ceb4
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] rpc-test: fix flakiness of TestCallLongerThanKeepalive

2016-12-04 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged.

Change subject: rpc-test: fix flakiness of TestCallLongerThanKeepalive
..


rpc-test: fix flakiness of TestCallLongerThanKeepalive

The 50ms keepalive was apparently short enough that occasionally the
server connection would drop before even receiving the initial call on
the connection. This would cause the test to fail with a spurious
"connection reset" type error.

I looped the following to test:

/build-support/dist_test.py --disable-sharding \
   loop -n 1000 build/latest/bin/rpc-test \
   --gtest_filter=OptionalSSL/TestRpc.TestCallLongerThanKeepalive/1 \
   --stress_cpu_threads=8

Without the patch[1] there were 31/1000 failures, all due to this issue.
With the patch[2] there were no failures.

[1] http://dist-test.cloudera.org//job?job_id=todd.1480911578.21798
[2] http://dist-test.cloudera.org//job?job_id=todd.1480911736.2857

Change-Id: I8d8d981af6dae90d0c656745d9260ffa920c9039
Reviewed-on: http://gerrit.cloudera.org:8080/5355
Reviewed-by: Adar Dembo 
Tested-by: Kudu Jenkins
---
M src/kudu/rpc/rpc-test.cc
1 file changed, 3 insertions(+), 3 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/5355
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I8d8d981af6dae90d0c656745d9260ffa920c9039
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-798 (part 4) Add a TimeManager to manage safe time advancement

2016-12-04 Thread David Ribeiro Alves (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5300

to look at the new patch set (#10).

Change subject: KUDU-798 (part 4) Add a TimeManager to manage safe time 
advancement
..

KUDU-798 (part 4) Add a TimeManager to manage safe time advancement

This patch adds an entity to manage safe time (the timestamp before
which all transactions are committed/aborted or in-flight) across
a consensus configuration. This allows scans to make sure/wait that
the scan timestamp is safe, thus making sure the scan is repeatable.

This adds a small unit test. Follow up patches will use this
in integration tests.

Change-Id: I0bb2b1d2590ed7ead6f1980f9572be10444bb81b
---
M src/kudu/consensus/CMakeLists.txt
A src/kudu/consensus/time_manager-test.cc
A src/kudu/consensus/time_manager.cc
A src/kudu/consensus/time_manager.h
4 files changed, 643 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/00/5300/10
-- 
To view, visit http://gerrit.cloudera.org:8080/5300
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0bb2b1d2590ed7ead6f1980f9572be10444bb81b
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1608: Catalog manager can stop retrying DeleteTablet upon fatal errors

2016-12-04 Thread Dinesh Bhat (Code Review)
Hello Mike Percy, Adar Dembo,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/5357

to review the following change.

Change subject: KUDU-1608: Catalog manager can stop retrying DeleteTablet upon 
fatal errors
..

KUDU-1608: Catalog manager can stop retrying DeleteTablet upon fatal errors

Master continuously retries DeleteTablet RPC for few non-retriable
error codes like WRONG_SERVER_UUID or TABLET_NOT_RUNNING both of
which indicate that tablet ceases to exist on that tserver.
Such errors can be deemed fatal and DeleteTablet need not be retried.

Change-Id: Id45f07667b6e62ce4814acfdf931dea2af4332d1
---
M src/kudu/master/catalog_manager.cc
1 file changed, 22 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/57/5357/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5357
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Id45f07667b6e62ce4814acfdf931dea2af4332d1
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Mike Percy 


[kudu-CR] [consensus] KUDU-1407 replica is not evcited when TABLET NOT RUNNING

2016-12-04 Thread Dinesh Bhat (Code Review)
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5352

to look at the new patch set (#3).

Change subject: [consensus] KUDU-1407 replica is not evcited when 
TABLET_NOT_RUNNING
..

[consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING

Fixes replica eviction failure where a follower returning
TABLET_NOT_RUNNING is not evicted from the consensus config.

Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tserver/tablet_server-test-base.h
M src/kudu/tserver/tablet_server_test_util.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 97 insertions(+), 34 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/52/5352/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5352
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot


[kudu-CR] WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time advancement

2016-12-04 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change.

Change subject: WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time 
advancement
..


Patch Set 7:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5300/7/src/kudu/consensus/time_manager-test.cc
File src/kudu/consensus/time_manager-test.cc:

Line 43:   // Returns a latch that allows to
> incomplete sentence
Done


Line 46: latches_.push_back(std::move(latch));
> why not just do: latches_.emplace_back(new CountDownLatch(1))
Done


PS7, Line 48: Thread::Create("test", "time_manager-test", 
::WaitForSafeTimeThread,
:  this, latches_.back().get(), safe_time, 
);
:  
> for the purposes of tests, I think just using std::thread with a lambda her
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5300
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0bb2b1d2590ed7ead6f1980f9572be10444bb81b
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time advancement

2016-12-04 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change.

Change subject: WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time 
advancement
..


Patch Set 8:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/5300/8/src/kudu/consensus/time_manager-test.cc
File src/kudu/consensus/time_manager-test.cc:

Line 32: using server::Clock;
> warning: using decl 'Clock' is unused [misc-unused-using-decls]
Done


Line 154:   ReplicateMsg message;
> error: redefinition of 'message' [clang-diagnostic-error]
Done


http://gerrit.cloudera.org:8080/#/c/5300/7/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

Line 51: TimeManager::TimeManager(scoped_refptr clock, Timestamp 
initial_safe_time)
> nit: pass by value and use std::move
Done


Line 84:   // NOTE: Currently this method just updates the clock and stores the 
message's timestamp.
> redundant cast?
Done


Line 90:   //   committed, at the cost of additional delay in moving safe 
time.
> is it worth a check here (either a Status-returning one or an assertion) th
Until we have leader leases it will always be able to go back here. replicas 
call this on uncommitted messages (much like the leader calls AssignTimestamp). 
Not sure it's informative to log that since this doesn't have a direct impact 
on safe time and consensus already logs operation overwrite.

Left a lengthy note here.


PS7, Line 179: 
> this wording is kind of vague. i think it would be good to outline the exac
moved this to GetSafeTime and added an ASCII diag


Line 197: Timestamp TimeManager::GetSafeTime()  {
> this logic could also use a good comment
Done


http://gerrit.cloudera.org:8080/#/c/5300/7/src/kudu/consensus/time_manager.h
File src/kudu/consensus/time_manager.h:

Line 33: //
> specify thread safety (or lack thereof)
Done


PS7, Line 39: // returns an OK status, all the transactions whose timestamps 
fall before the scan's timestamp
: /
> no GrantLease yet?
Done


Line 43: // and for moving the leader's safe time, which in turn may be sent to 
replicas on heartbeats
> would be good to add some class-level doc about the two "modes" and what th
Done


PS7, Line 46: d on t
> makes it sound like this is an argument. Maybe "the message's ExternalConsi
Done


PS7, Line 50: 
> nit: comma, not new sentence
Done


Line 51: //
> this sentence reads a little funny... "the queue knows of the messages appe
Done


Line 55: //
> under what circumstances does it return an error status?
Done


Line 57: //
> can you clarify a bit more what effect this has, or at least that we expect
Done


PS7, Line 60: //   GetSafeTime will still return monotonically increasing 
timestamps, it's just
: //   that, in certain corner cases, the timestamp returned by 
GetSafeTime() can't be trusted
: //   to mean that all future messages will
> docs
Done


Line 64: //
> does this mean tat it's a no-op if it's in leader mode? or is it a CHECK?
Done


Line 75: 
> maybe move the implementation details into the .cc file?
I left some info on what the caller can expect but moved the impl details to 
the cc.


Line 123:   // In leader mode returns clock_->Now() or some value close to it.
> nit: usually we define the lock first, not last
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5300
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0bb2b1d2590ed7ead6f1980f9572be10444bb81b
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time advancement

2016-12-04 Thread David Ribeiro Alves (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5300

to look at the new patch set (#8).

Change subject: WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time 
advancement
..

WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time advancement

WIP still need to finish the unit test

Change-Id: I0bb2b1d2590ed7ead6f1980f9572be10444bb81b
---
M src/kudu/consensus/CMakeLists.txt
A src/kudu/consensus/time_manager-test.cc
A src/kudu/consensus/time_manager.cc
A src/kudu/consensus/time_manager.h
4 files changed, 645 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/00/5300/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5300
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0bb2b1d2590ed7ead6f1980f9572be10444bb81b
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] rpc-test: fix flakiness in tests that expect timeouts

2016-12-04 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: rpc-test: fix flakiness in tests that expect timeouts
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5356/1//COMMIT_MSG
Commit Message:

PS1, Line 12: some
Extra 'some' here?


-- 
To view, visit http://gerrit.cloudera.org:8080/5356
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifff555634968bc92f453b25af4d5c15da21edf7c
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-HasComments: Yes


[kudu-CR] ssl factory: avoid a non-POD (vector<>) in static storage

2016-12-04 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: ssl_factory: avoid a non-POD (vector<>) in static storage
..


Patch Set 1: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I859aefd932d62fa8619650fcc795d3676187ceb4
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-HasComments: No


[kudu-CR] ts recovery-itest: reduce test flakiness

2016-12-04 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged.

Change subject: ts_recovery-itest: reduce test flakiness
..


ts_recovery-itest: reduce test flakiness

In the flaky test dashboard, this test case was flaky in two cases:

1) in TSAN builds sometimes it wouldn't write fast enough to trigger
lots of rolls. In that case, the "wait for crash in 60 seconds" might
not trigger.

2) In DEBUG builds, sometimes the 50% chance of crashes allowed it to
write 10+ segments before crashing (0.5^10 = 1/1000 times). With this
many log segments, the bootstrap could take long enough that the
verification on restart would time out.

This addresses the issue by (a) writing larger ops, so that logs would
roll quicker with fewer ops required, and (b) increasing the crash
probability so that it needs fewer rolls to crash on average.

Change-Id: I4d5018636d20a4663e84071e91f573b86e309a29
Reviewed-on: http://gerrit.cloudera.org:8080/5351
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo 
---
M src/kudu/integration-tests/ts_recovery-itest.cc
1 file changed, 2 insertions(+), 1 deletion(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/5351
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I4d5018636d20a4663e84071e91f573b86e309a29
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] client: throttle some warning messages when lookup RPCs fail

2016-12-04 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged.

Change subject: client: throttle some warning messages when lookup RPCs fail
..


client: throttle some warning messages when lookup RPCs fail

In an Impala stress workload, we found that LookupRpc::SendRpcCb was
spewing warnings at an alarming rate. The warnings themselves were
causing so much load (due to Impala's glog lock contention) that it
produced a vicious cycle: the log lock contention caused more timeouts,
which caused more warnings, which caused more lock contention, etc...

I figured out the address of the glog mutex in an impalad process and
used perf to trace the stack trace of its lock acquisitions over a
5-second period:

[todd@ve1120 kudu]$ sudo perf record -a -e syscalls:sys_enter_futex --filter 
'uaddr == "0x7f6950a7714c"' -g sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 4.191 MB perf.data (~183118 samples) ]
[todd@ve1120 kudu]$ sudo perf script | grep -A1 'LogMess' | grep client:: | 
sort | uniq -c
272 7f6950477224 
kudu::client::internal::WriteRpc::Finish(kudu::Status const&)
 92 7f695049be85 
kudu::client::internal::RemoteTablet::MarkReplicaFailed(kudu::client::internal::RemoteTabletServer*,
 kudu::Status const&)
  14453 7f69504a1bc5 
kudu::client::internal::LookupRpc::SendRpcCb(kudu::Status const&)

Here we can see the vast majority came from LookupRpc::SendRpcCb and
WriteRpc::Finish. This patch addresses those as well as a few others
that look like they might potentially occur frequently.

Change-Id: I840bd57976ccefba4453667f82c7aa32756922d3
Reviewed-on: http://gerrit.cloudera.org:8080/5332
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo 
---
M src/kudu/client/batcher.cc
M src/kudu/client/client-internal.cc
M src/kudu/client/meta_cache.cc
3 files changed, 22 insertions(+), 17 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/5332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I840bd57976ccefba4453667f82c7aa32756922d3
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] rpc-test: fix flakiness of TestCallLongerThanKeepalive

2016-12-04 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: rpc-test: fix flakiness of TestCallLongerThanKeepalive
..


Patch Set 1: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5355
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8d8d981af6dae90d0c656745d9260ffa920c9039
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-HasComments: No


[kudu-CR] ts recovery-itest: reduce test flakiness

2016-12-04 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: ts_recovery-itest: reduce test flakiness
..


Patch Set 1: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5351
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4d5018636d20a4663e84071e91f573b86e309a29
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] KUDU-695. Avoid glog contention by deferring log writes to dedicated threads

2016-12-04 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-695. Avoid glog contention by deferring log writes to 
dedicated threads
..


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5321/4/src/kudu/util/async_logger.cc
File src/kudu/util/async_logger.cc:

PS4, Line 69: ull(*active
> > BTW, Kudu should never Write() FATAL messages given that we aren't wrappi
Ah, I forgot about that.

Could you update the class doc to describe the flush semantics of glog in a 
little more detail, as I suggested above?


http://gerrit.cloudera.org:8080/#/c/5321/4/src/kudu/util/async_logger.h
File src/kudu/util/async_logger.h:

PS4, Line 128:   size = 0;
 :   flush = false;
> Sure, or 'needs_flush_or_write'.
Were you going to rename this?


-- 
To view, visit http://gerrit.cloudera.org:8080/5321
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie22a0a29fa00a988a53a15d2c726ce5d49018f4d
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] client: throttle some warning messages when lookup RPCs fail

2016-12-04 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: client: throttle some warning messages when lookup RPCs fail
..


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I840bd57976ccefba4453667f82c7aa32756922d3
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] rpc-test: fix flakiness of TestCallLongerThanKeepalive

2016-12-04 Thread Todd Lipcon (Code Review)
Hello Mike Percy, Adar Dembo,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/5355

to review the following change.

Change subject: rpc-test: fix flakiness of TestCallLongerThanKeepalive
..

rpc-test: fix flakiness of TestCallLongerThanKeepalive

The 50ms keepalive was apparently short enough that occasionally the
server connection would drop before even receiving the initial call on
the connection. This would cause the test to fail with a spurious
"connection reset" type error.

I looped the following to test:

/build-support/dist_test.py --disable-sharding \
   loop -n 1000 build/latest/bin/rpc-test \
   --gtest_filter=OptionalSSL/TestRpc.TestCallLongerThanKeepalive/1 \
   --stress_cpu_threads=8

Without the patch[1] there were 31/1000 failures, all due to this issue.
With the patch[2] there were no failures.

[1] http://dist-test.cloudera.org//job?job_id=todd.1480911578.21798
[2] http://dist-test.cloudera.org//job?job_id=todd.1480911736.2857

Change-Id: I8d8d981af6dae90d0c656745d9260ffa920c9039
---
M src/kudu/rpc/rpc-test.cc
1 file changed, 3 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/55/5355/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5355
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I8d8d981af6dae90d0c656745d9260ffa920c9039
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Mike Percy 


[kudu-CR] ssl factory: avoid a non-POD (vector<>) in static storage

2016-12-04 Thread Todd Lipcon (Code Review)
Hello Mike Percy, Adar Dembo,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/5354

to review the following change.

Change subject: ssl_factory: avoid a non-POD (vector<>) in static storage
..

ssl_factory: avoid a non-POD (vector<>) in static storage

The global vector in static storage was causing ASAN test
failures when I looped rpc-test's keepalive test with SSL enabled[1].
The issue is that the static storage destructor could run concurrently
with other threads still trying to access the SSL mutexes.

Since this mutex list is purposefully leaked anyway, it's better to just
use a C-style array.

[1] http://dist-test.cloudera.org/job?job_id=todd.1480910327.10622

Change-Id: I859aefd932d62fa8619650fcc795d3676187ceb4
---
M src/kudu/util/net/ssl_factory.cc
1 file changed, 5 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/54/5354/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5354
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I859aefd932d62fa8619650fcc795d3676187ceb4
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Mike Percy 


[kudu-CR] rpc-test: fix flakiness in tests that expect timeouts

2016-12-04 Thread Todd Lipcon (Code Review)
Hello Mike Percy, Adar Dembo,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/5356

to review the following change.

Change subject: rpc-test: fix flakiness in tests that expect timeouts
..

rpc-test: fix flakiness in tests that expect timeouts

The DoTestExpectTimeout() utility function would set a timeout to some
number of milliseconds N, and then ask the server to sleep for (N + 50)
milliseconds, expecting a timeout. It would then assert that the timeout
was returned some after some amount of time between N and N+50, but no
more than N+50ms.

This would be flaky under concurrent load (eg stress threads) because
the sleep(50ms) might sometimes actually sleep for an extra 50-100ms.

This just changes the test to ask the server to sleep for n+500ms,
giving it a lot more budget for sloppiness.

I looped TestCallTimeout/0 with 4 stress threads 1000 times in TSAN.
Before[1] it failed 4/1000. After[2] it didn't fail.

[1] http://dist-test.cloudera.org//job?job_id=todd.1480912345.2054
[2] http://dist-test.cloudera.org//job?job_id=todd.1480912461.12007

Change-Id: Ifff555634968bc92f453b25af4d5c15da21edf7c
---
M src/kudu/rpc/rpc-test-base.h
1 file changed, 2 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/56/5356/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5356
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ifff555634968bc92f453b25af4d5c15da21edf7c
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Mike Percy 


[kudu-CR] [i-tests] TestWorkload::set num tablets() accepts 1

2016-12-04 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: [i-tests] TestWorkload::set_num_tablets() accepts 1
..


Patch Set 1: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5347/1//COMMIT_MSG
Commit Message:

Line 7: [i-tests] TestWorkload::set_num_tablets() accepts 1
nit: this summary line is a bit confusing about whether the commit is stating a 
bug (that it currently accepts 1, but shouldn't), or fixing a bug (that it 
should accept 1, but doesn't). Would be better to write "should accept 1" or 
"Fix TestWorkload to allow set_num_tablets(1)" or something for better clarity

(obviously it's explained below but often I end up looking at git log --oneline 
or just skimming a git log)


-- 
To view, visit http://gerrit.cloudera.org:8080/5347
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iaf9d0e06550fffdd6a04f6bfba53fe873fb33749
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Dinesh Bhat 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1757: fix appendCellValueDebugString, do not throw exception when a column is not set

2016-12-04 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1757: fix appendCellValueDebugString, do not throw 
exception when a column is not set
..


Patch Set 2:

JD and Dan, do you guys have opinions on this?

-- 
To view, visit http://gerrit.cloudera.org:8080/5237
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ife05af5a11e95751b3cc259e786d36494aeb5d3a
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Yanlong Zheng 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Yanlong Zheng 
Gerrit-HasComments: No


[kudu-CR] client: throttle some warning messages when lookup RPCs fail

2016-12-04 Thread Todd Lipcon (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5332

to look at the new patch set (#2).

Change subject: client: throttle some warning messages when lookup RPCs fail
..

client: throttle some warning messages when lookup RPCs fail

In an Impala stress workload, we found that LookupRpc::SendRpcCb was
spewing warnings at an alarming rate. The warnings themselves were
causing so much load (due to Impala's glog lock contention) that it
produced a vicious cycle: the log lock contention caused more timeouts,
which caused more warnings, which caused more lock contention, etc...

I figured out the address of the glog mutex in an impalad process and
used perf to trace the stack trace of its lock acquisitions over a
5-second period:

[todd@ve1120 kudu]$ sudo perf record -a -e syscalls:sys_enter_futex --filter 
'uaddr == "0x7f6950a7714c"' -g sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 4.191 MB perf.data (~183118 samples) ]
[todd@ve1120 kudu]$ sudo perf script | grep -A1 'LogMess' | grep client:: | 
sort | uniq -c
272 7f6950477224 
kudu::client::internal::WriteRpc::Finish(kudu::Status const&)
 92 7f695049be85 
kudu::client::internal::RemoteTablet::MarkReplicaFailed(kudu::client::internal::RemoteTabletServer*,
 kudu::Status const&)
  14453 7f69504a1bc5 
kudu::client::internal::LookupRpc::SendRpcCb(kudu::Status const&)

Here we can see the vast majority came from LookupRpc::SendRpcCb and
WriteRpc::Finish. This patch addresses those as well as a few others
that look like they might potentially occur frequently.

Change-Id: I840bd57976ccefba4453667f82c7aa32756922d3
---
M src/kudu/client/batcher.cc
M src/kudu/client/client-internal.cc
M src/kudu/client/meta_cache.cc
3 files changed, 22 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/32/5332/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I840bd57976ccefba4453667f82c7aa32756922d3
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] client: throttle some warning messages when lookup RPCs fail

2016-12-04 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: client: throttle some warning messages when lookup RPCs fail
..


Patch Set 1:

(2 comments)

OK, added throttling in a few other spots

http://gerrit.cloudera.org:8080/#/c/5332/1/src/kudu/client/meta_cache.cc
File src/kudu/client/meta_cache.cc:

Line 197:   LOG(WARNING) << "Tablet " << tablet_id_ << ": Replica " << 
ts->ToString()
> What about this one?
Done


Line 696: LOG(WARNING) << "Failed to determine new Master: " << 
status.ToString();
> Or this one?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I840bd57976ccefba4453667f82c7aa32756922d3
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-695. Avoid glog contention by deferring log writes to dedicated threads

2016-12-04 Thread Todd Lipcon (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5321

to look at the new patch set (#6).

Change subject: KUDU-695. Avoid glog contention by deferring log writes to 
dedicated threads
..

KUDU-695. Avoid glog contention by deferring log writes to dedicated threads

This patch changes the logging init sequence to install a wrapper around
the built-in glog Loggers. The wrapper starts a new thread which does
double-buffering: log entries are buffered in a vector, and the thread
is woken up. The thread swaps in a clean buffer to avoid delaying
application threads while it flushes the original buffer to disk.

It's hard to test the end-to-end integration in a unit test, since the
unit tests disable logging to files, and this path only affects
file-based logging. However, I ran an earlier version of this patch in a
stress test environment and it seemed to reduce the frequency with which
I saw threads blocked on glog.

The patch does, however, have a test which exercises the new code paths,
including the blocking path. I looped the new test 500 times in TSAN
mode with success.

The new feature is enabled by default, but I left a hidden flag to
disable it in case we have any issues.

Change-Id: Ie22a0a29fa00a988a53a15d2c726ce5d49018f4d
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/async_logger.cc
A src/kudu/util/async_logger.h
M src/kudu/util/debug/leakcheck_disabler.h
M src/kudu/util/logging-test.cc
M src/kudu/util/logging.cc
6 files changed, 453 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/5321/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5321
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie22a0a29fa00a988a53a15d2c726ce5d49018f4d
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-695. Avoid glog contention by deferring log writes to dedicated threads

2016-12-04 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: KUDU-695. Avoid glog contention by deferring log writes to 
dedicated threads
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5321/5/src/kudu/util/async_logger.cc
File src/kudu/util/async_logger.cc:

Line 88: l.Unlock();
> Nit: you could move the Unlock() above the conditional. Maybe adjusting the
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5321
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie22a0a29fa00a988a53a15d2c726ce5d49018f4d
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1782. Fault injection crashes should exit with a specific exit code

2016-12-04 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged.

Change subject: KUDU-1782. Fault injection crashes should exit with a specific 
exit code
..


KUDU-1782. Fault injection crashes should exit with a specific exit code

This changes the fault injection "crash" to use exit(85) instead of
LOG(FATAL). This means that we can now distinguish between a process
exiting due to fault injection compared to a process exiting due to a
crash or real FATAL-inducing issue.

The WaitForCrash method in ExternalMiniCluster::ExternalDaemon is now
split into WaitForInjectedCrash, which specifically checks that the exit
code matches the expected one, and WaitForFatal, which checks that the
exit code matches a SIGABRT. If the process exits with an unexpected
status, a bad Status is returned, typically resulting in a test failure.

I verified the new functionality by locally reverting back the fault
injection code to use LOG(FATAL) and checking that ts_recovery-itest
reported that the process had crashed unexpectedly.

Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484
Reviewed-on: http://gerrit.cloudera.org:8080/5353
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo 
---
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/integration-tests/disk_reservation-itest.cc
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/util/fault_injection.cc
M src/kudu/util/fault_injection.h
8 files changed, 90 insertions(+), 31 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/5353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1782. Fault injection crashes should exit with a specific exit code

2016-12-04 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1782. Fault injection crashes should exit with a specific 
exit code
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5353/3/src/kudu/integration-tests/delete_table-test.cc
File src/kudu/integration-tests/delete_table-test.cc:

Line 494:   ASSERT_OK(cluster_->tablet_server(kTsIndex)->Restart());
> Seems like a pretty important omission. What was the effect?
basically made this part of the test a no-op: the WaitForTSToCrash() below used 
to silently ignore if the TS wasn't started at that point, whereas I added a 
CHECK in this patch that verifies it's in the expected state.


-- 
To view, visit http://gerrit.cloudera.org:8080/5353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1782. Fault injection crashes should exit with a specific exit code

2016-12-04 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1782. Fault injection crashes should exit with a specific 
exit code
..


Patch Set 3: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5353/3/src/kudu/integration-tests/delete_table-test.cc
File src/kudu/integration-tests/delete_table-test.cc:

Line 494:   ASSERT_OK(cluster_->tablet_server(kTsIndex)->Restart());
Seems like a pretty important omission. What was the effect?


-- 
To view, visit http://gerrit.cloudera.org:8080/5353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes


[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

2016-12-04 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absense of 
writes
..


Patch Set 21:

(29 comments)

mostly small stuff

http://gerrit.cloudera.org:8080/#/c/5240/21//COMMIT_MSG
Commit Message:

Line 7: WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes
nit: absence


PS21, Line 19: if there aren't any writes in-flight that
 :   it doesn't know about yet
I think this merits further explanation


PS21, Line 22: Mvcc's safe time is a more conservative version of "global"
 :   safe time
should we consider renaming this to 'trimmable time' or 'safetime_lower_bound' 
or something?


http://gerrit.cloudera.org:8080/#/c/5240/21/java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java:

PS21, Line 58: cluster 
nit: comma here (http://grammar.ccc.commnet.edu/GRAMMAR/commas_intro.htm has 
some good examples)


PS21, Line 59: se
absence


PS21, Line 59: we expect precise clock values in the test
not understanding this exactly. What does one thing have to do with the other? 
safetime advancements end up pushing the clock value ahead or something? Please 
elaborate, especially since people editing the Java client tests may not be as 
familiar with the internal details.


PS21, Line 61: --disable_
I think 'negative' gflags are kind of confusing, because you get these kind of 
double-negatives. why not have it be an 'enable' flag, defaulted to true, and 
set it to false to disable it?


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus.h
File src/kudu/consensus/consensus.h:

Line 28: #include "kudu/consensus/time_manager.h"
can you forward decl?


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 78: scoped_refptr time_manager(new TimeManager(clock, 
Timestamp::kMin));
missing include for this? (I guess it's getting it transitively)


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 105:const scoped_refptr& 
time_manager,
pass by value and then use std::move below


Line 275: 
RETURN_NOT_OK(time_manager_->AdvanceSafeTimeWithMessage(*msg->get()));
above comment should be updated to reference this.

would it be more efficient (and just as useful) to just do it once at the end 
for msgs.back()?


PS21, Line 442:  
comma


PS21, Line 443: else {
  : if
collapse


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

Line 34: #include "kudu/consensus/time_manager.h"
is forward decl sufficient?


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 533:   RETURN_NOT_OK(time_manager_->AssignTimestamp(replicate));
PREPEND


PS21, Line 1198: *(*iter)->get()
is **iter sufficient?

Can't remember under what circumstances this returns non-OK, but are we 
actually safe to return here rather than break?


PS21, Line 1202: d 
comma


Line 1205:   
RETURN_NOT_OK(time_manager_->AdvanceSafeTime(Timestamp(request->safe_timestamp(;
same, is this safe to return or should it be a CHECK?


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/raft_consensus_state.h
File src/kudu/consensus/raft_consensus_state.h:

Line 33: #include "kudu/consensus/time_manager.h"
fwd decl?


PS21, Line 256: explicit
dont need explicit


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

PS21, Line 1195: the clock values are unexpected.
not quite following this


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/mvcc-test.cc
File src/kudu/tablet/mvcc-test.cc:

Line 508:   // Commit tx 2 - thread should still wai.
typo


Line 510:   mgr.CommitTransaction(tx2);
probably want an ASSERT_FALSE(HasResultSnapshot()) here.


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/mvcc.cc
File src/kudu/tablet/mvcc.cc:

Line 285: bool MvccManager::AnyInFlightAtOrBeforeUnlocked(Timestamp ts) const {
hrm, could this just be implemented by looking at earliest_in_flight_?


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/mvcc.h
File src/kudu/tablet/mvcc.h:

Line 224:   // time received from the leader (which can go back without leader 
leases).
perhaps elaborate that this would crash if you tried to move it backwards? 
(would it?)

Maybe it should be called something like SetNewTransactionLowerBound? per your 
commit message, the MvccManager doesn't really track safetime anymore, it just 
tracks a lower bound on new transactions (which itself is a lower bound on 

[kudu-CR] KUDU-1782. Fault injection crashes should exit with a specific exit code

2016-12-04 Thread Todd Lipcon (Code Review)
Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5353

to look at the new patch set (#3).

Change subject: KUDU-1782. Fault injection crashes should exit with a specific 
exit code
..

KUDU-1782. Fault injection crashes should exit with a specific exit code

This changes the fault injection "crash" to use exit(85) instead of
LOG(FATAL). This means that we can now distinguish between a process
exiting due to fault injection compared to a process exiting due to a
crash or real FATAL-inducing issue.

The WaitForCrash method in ExternalMiniCluster::ExternalDaemon is now
split into WaitForInjectedCrash, which specifically checks that the exit
code matches the expected one, and WaitForFatal, which checks that the
exit code matches a SIGABRT. If the process exits with an unexpected
status, a bad Status is returned, typically resulting in a test failure.

I verified the new functionality by locally reverting back the fault
injection code to use LOG(FATAL) and checking that ts_recovery-itest
reported that the process had crashed unexpectedly.

Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484
---
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/integration-tests/disk_reservation-itest.cc
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/util/fault_injection.cc
M src/kudu/util/fault_injection.h
8 files changed, 90 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/5353/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot


[kudu-CR] KUDU-1782. Fault injection crashes should exit with a specific exit code

2016-12-04 Thread Todd Lipcon (Code Review)
Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5353

to look at the new patch set (#2).

Change subject: KUDU-1782. Fault injection crashes should exit with a specific 
exit code
..

KUDU-1782. Fault injection crashes should exit with a specific exit code

This changes the fault injection "crash" to use exit(85) instead of
LOG(FATAL). This means that we can now distinguish between a process
exiting due to fault injection compared to a process exiting due to a
crash or real FATAL-inducing issue.

The WaitForCrash method in ExternalMiniCluster::ExternalDaemon is now
split into WaitForInjectedCrash, which specifically checks that the exit
code matches the expected one, and WaitForFatal, which checks that the
exit code matches a SIGABRT. If the process exits with an unexpected
status, a bad Status is returned, typically resulting in a test failure.

I verified the new functionality by locally reverting back the fault
injection code to use LOG(FATAL) and checking that ts_recovery-itest
reported that the process had crashed unexpectedly.

Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484
---
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/integration-tests/disk_reservation-itest.cc
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/util/fault_injection.cc
M src/kudu/util/fault_injection.h
8 files changed, 84 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/5353/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot


[kudu-CR] KUDU-1782. Fault injection crashes should exit with a specific exit code

2016-12-04 Thread Todd Lipcon (Code Review)
Hello David Ribeiro Alves,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/5353

to review the following change.

Change subject: KUDU-1782. Fault injection crashes should exit with a specific 
exit code
..

KUDU-1782. Fault injection crashes should exit with a specific exit code

This changes the fault injection "crash" to use exit(85) instead of
LOG(FATAL). This means that we can now distinguish between a process
exiting due to fault injection compared to a process exiting due to a
crash or real FATAL-inducing issue.

The WaitForCrash method in ExternalMiniCluster::ExternalDaemon is now
renamed to WaitForInjectedCrash and specifically checks that the exit
code matches the expected one. If not, it now returns a bad status.

I verified the new functionality by locally reverting back the fault
injection code to use LOG(FATAL) and checking that ts_recovery-itest
reported that the process had crashed unexpectedly.

Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484
---
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/integration-tests/disk_reservation-itest.cc
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/util/fault_injection.cc
M src/kudu/util/fault_injection.h
8 files changed, 45 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/5353/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5353
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: David Ribeiro Alves 


[kudu-CR] [consensus] KUDU-1407 replica is not evcited when TABLET NOT RUNNING

2016-12-04 Thread Dinesh Bhat (Code Review)
Hello David Ribeiro Alves, Mike Percy,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5352

to look at the new patch set (#2).

Change subject: [consensus] KUDU-1407 replica is not evcited when 
TABLET_NOT_RUNNING
..

[consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING

Fixes replica eviction failure where a follower returning
TABLET_NOT_RUNNING is not evicted from the consensus config.

Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tserver/tablet_server-test-base.h
M src/kudu/tserver/tablet_server_test_util.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 95 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/52/5352/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5352
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot


[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

2016-12-04 Thread Dinesh Bhat (Code Review)
Dinesh Bhat has posted comments on this change.

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for 
WRONG_SERVER_UUID
..


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 2609:  const scoped_refptr& tablet,
> Would be better to use the standard prefix T  P  for 
This change is moved to different patch, addressing it there.


http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 213:   // Open a tablet meta from the local file system by loading its 
superblock.
> Why optional<>* ? New param needs doc
Since this change moved to a different patch, I have addressed it there.
http://gerrit.cloudera.org:8080/5352


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Dinesh Bhat 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] [consensus] KUDU-1407 replica is not evcited when TABLET NOT RUNNING

2016-12-04 Thread Dinesh Bhat (Code Review)
Hello David Ribeiro Alves, Mike Percy,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/5352

to review the following change.

Change subject: [consensus] KUDU-1407 replica is not evcited when 
TABLET_NOT_RUNNING
..

[consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING

Fixes replica eviction failure where a follower returning
TABLET_NOT_RUNNING is not evicted from the consensus config.

Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tserver/tablet_server-test-base.h
M src/kudu/tserver/tablet_server_test_util.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 91 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/52/5352/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5352
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Mike Percy