[kudu-CR] KUDU-695. Avoid glog contention by deferring log writes to dedicated threads
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5321 to look at the new patch set (#7). Change subject: KUDU-695. Avoid glog contention by deferring log writes to dedicated threads .. KUDU-695. Avoid glog contention by deferring log writes to dedicated threads This patch changes the logging init sequence to install a wrapper around the built-in glog Loggers. The wrapper starts a new thread which does double-buffering: log entries are buffered in a vector, and the thread is woken up. The thread swaps in a clean buffer to avoid delaying application threads while it flushes the original buffer to disk. It's hard to test the end-to-end integration in a unit test, since the unit tests disable logging to files, and this path only affects file-based logging. However, I ran an earlier version of this patch in a stress test environment and it seemed to reduce the frequency with which I saw threads blocked on glog. The patch does, however, have a test which exercises the new code paths, including the blocking path. I looped the new test 500 times in TSAN mode with success. The new feature is enabled by default, but I left a hidden flag to disable it in case we have any issues. Change-Id: Ie22a0a29fa00a988a53a15d2c726ce5d49018f4d --- M src/kudu/util/CMakeLists.txt A src/kudu/util/async_logger.cc A src/kudu/util/async_logger.h M src/kudu/util/debug/leakcheck_disabler.h M src/kudu/util/logging-test.cc M src/kudu/util/logging.cc 6 files changed, 461 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/5321/7 -- To view, visit http://gerrit.cloudera.org:8080/5321 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie22a0a29fa00a988a53a15d2c726ce5d49018f4d Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] ssl factory: avoid a non-POD (vector<>) in static storage
Todd Lipcon has submitted this change and it was merged. Change subject: ssl_factory: avoid a non-POD (vector<>) in static storage .. ssl_factory: avoid a non-POD (vector<>) in static storage The global vectorin static storage was causing ASAN test failures when I looped rpc-test's keepalive test with SSL enabled[1]. The issue is that the static storage destructor could run concurrently with other threads still trying to access the SSL mutexes. Since this mutex list is purposefully leaked anyway, it's better to just use a C-style array. [1] http://dist-test.cloudera.org/job?job_id=todd.1480910327.10622 Change-Id: I859aefd932d62fa8619650fcc795d3676187ceb4 Reviewed-on: http://gerrit.cloudera.org:8080/5354 Reviewed-by: Adar Dembo Tested-by: Kudu Jenkins --- M src/kudu/util/net/ssl_factory.cc 1 file changed, 5 insertions(+), 7 deletions(-) Approvals: Adar Dembo: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/5354 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I859aefd932d62fa8619650fcc795d3676187ceb4 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] rpc-test: fix flakiness of TestCallLongerThanKeepalive
Todd Lipcon has submitted this change and it was merged. Change subject: rpc-test: fix flakiness of TestCallLongerThanKeepalive .. rpc-test: fix flakiness of TestCallLongerThanKeepalive The 50ms keepalive was apparently short enough that occasionally the server connection would drop before even receiving the initial call on the connection. This would cause the test to fail with a spurious "connection reset" type error. I looped the following to test: /build-support/dist_test.py --disable-sharding \ loop -n 1000 build/latest/bin/rpc-test \ --gtest_filter=OptionalSSL/TestRpc.TestCallLongerThanKeepalive/1 \ --stress_cpu_threads=8 Without the patch[1] there were 31/1000 failures, all due to this issue. With the patch[2] there were no failures. [1] http://dist-test.cloudera.org//job?job_id=todd.1480911578.21798 [2] http://dist-test.cloudera.org//job?job_id=todd.1480911736.2857 Change-Id: I8d8d981af6dae90d0c656745d9260ffa920c9039 Reviewed-on: http://gerrit.cloudera.org:8080/5355 Reviewed-by: Adar DemboTested-by: Kudu Jenkins --- M src/kudu/rpc/rpc-test.cc 1 file changed, 3 insertions(+), 3 deletions(-) Approvals: Adar Dembo: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/5355 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I8d8d981af6dae90d0c656745d9260ffa920c9039 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-798 (part 4) Add a TimeManager to manage safe time advancement
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5300 to look at the new patch set (#10). Change subject: KUDU-798 (part 4) Add a TimeManager to manage safe time advancement .. KUDU-798 (part 4) Add a TimeManager to manage safe time advancement This patch adds an entity to manage safe time (the timestamp before which all transactions are committed/aborted or in-flight) across a consensus configuration. This allows scans to make sure/wait that the scan timestamp is safe, thus making sure the scan is repeatable. This adds a small unit test. Follow up patches will use this in integration tests. Change-Id: I0bb2b1d2590ed7ead6f1980f9572be10444bb81b --- M src/kudu/consensus/CMakeLists.txt A src/kudu/consensus/time_manager-test.cc A src/kudu/consensus/time_manager.cc A src/kudu/consensus/time_manager.h 4 files changed, 643 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/00/5300/10 -- To view, visit http://gerrit.cloudera.org:8080/5300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I0bb2b1d2590ed7ead6f1980f9572be10444bb81b Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro AlvesGerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1608: Catalog manager can stop retrying DeleteTablet upon fatal errors
Hello Mike Percy, Adar Dembo, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/5357 to review the following change. Change subject: KUDU-1608: Catalog manager can stop retrying DeleteTablet upon fatal errors .. KUDU-1608: Catalog manager can stop retrying DeleteTablet upon fatal errors Master continuously retries DeleteTablet RPC for few non-retriable error codes like WRONG_SERVER_UUID or TABLET_NOT_RUNNING both of which indicate that tablet ceases to exist on that tserver. Such errors can be deemed fatal and DeleteTablet need not be retried. Change-Id: Id45f07667b6e62ce4814acfdf931dea2af4332d1 --- M src/kudu/master/catalog_manager.cc 1 file changed, 22 insertions(+), 11 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/57/5357/1 -- To view, visit http://gerrit.cloudera.org:8080/5357 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Id45f07667b6e62ce4814acfdf931dea2af4332d1 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh BhatGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Mike Percy
[kudu-CR] [consensus] KUDU-1407 replica is not evcited when TABLET NOT RUNNING
Hello Mike Percy, Adar Dembo, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5352 to look at the new patch set (#3). Change subject: [consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING .. [consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING Fixes replica eviction failure where a follower returning TABLET_NOT_RUNNING is not evicted from the consensus config. Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625 --- M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/tserver/tablet_server-test-base.h M src/kudu/tserver/tablet_server_test_util.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 8 files changed, 97 insertions(+), 34 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/52/5352/3 -- To view, visit http://gerrit.cloudera.org:8080/5352 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh BhatGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot
[kudu-CR] WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time advancement
David Ribeiro Alves has posted comments on this change. Change subject: WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time advancement .. Patch Set 7: (3 comments) http://gerrit.cloudera.org:8080/#/c/5300/7/src/kudu/consensus/time_manager-test.cc File src/kudu/consensus/time_manager-test.cc: Line 43: // Returns a latch that allows to > incomplete sentence Done Line 46: latches_.push_back(std::move(latch)); > why not just do: latches_.emplace_back(new CountDownLatch(1)) Done PS7, Line 48: Thread::Create("test", "time_manager-test", ::WaitForSafeTimeThread, : this, latches_.back().get(), safe_time, ); : > for the purposes of tests, I think just using std::thread with a lambda her Done -- To view, visit http://gerrit.cloudera.org:8080/5300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0bb2b1d2590ed7ead6f1980f9572be10444bb81b Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro AlvesGerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time advancement
David Ribeiro Alves has posted comments on this change. Change subject: WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time advancement .. Patch Set 8: (19 comments) http://gerrit.cloudera.org:8080/#/c/5300/8/src/kudu/consensus/time_manager-test.cc File src/kudu/consensus/time_manager-test.cc: Line 32: using server::Clock; > warning: using decl 'Clock' is unused [misc-unused-using-decls] Done Line 154: ReplicateMsg message; > error: redefinition of 'message' [clang-diagnostic-error] Done http://gerrit.cloudera.org:8080/#/c/5300/7/src/kudu/consensus/time_manager.cc File src/kudu/consensus/time_manager.cc: Line 51: TimeManager::TimeManager(scoped_refptr clock, Timestamp initial_safe_time) > nit: pass by value and use std::move Done Line 84: // NOTE: Currently this method just updates the clock and stores the message's timestamp. > redundant cast? Done Line 90: // committed, at the cost of additional delay in moving safe time. > is it worth a check here (either a Status-returning one or an assertion) th Until we have leader leases it will always be able to go back here. replicas call this on uncommitted messages (much like the leader calls AssignTimestamp). Not sure it's informative to log that since this doesn't have a direct impact on safe time and consensus already logs operation overwrite. Left a lengthy note here. PS7, Line 179: > this wording is kind of vague. i think it would be good to outline the exac moved this to GetSafeTime and added an ASCII diag Line 197: Timestamp TimeManager::GetSafeTime() { > this logic could also use a good comment Done http://gerrit.cloudera.org:8080/#/c/5300/7/src/kudu/consensus/time_manager.h File src/kudu/consensus/time_manager.h: Line 33: // > specify thread safety (or lack thereof) Done PS7, Line 39: // returns an OK status, all the transactions whose timestamps fall before the scan's timestamp : / > no GrantLease yet? Done Line 43: // and for moving the leader's safe time, which in turn may be sent to replicas on heartbeats > would be good to add some class-level doc about the two "modes" and what th Done PS7, Line 46: d on t > makes it sound like this is an argument. Maybe "the message's ExternalConsi Done PS7, Line 50: > nit: comma, not new sentence Done Line 51: // > this sentence reads a little funny... "the queue knows of the messages appe Done Line 55: // > under what circumstances does it return an error status? Done Line 57: // > can you clarify a bit more what effect this has, or at least that we expect Done PS7, Line 60: // GetSafeTime will still return monotonically increasing timestamps, it's just : // that, in certain corner cases, the timestamp returned by GetSafeTime() can't be trusted : // to mean that all future messages will > docs Done Line 64: // > does this mean tat it's a no-op if it's in leader mode? or is it a CHECK? Done Line 75: > maybe move the implementation details into the .cc file? I left some info on what the caller can expect but moved the impl details to the cc. Line 123: // In leader mode returns clock_->Now() or some value close to it. > nit: usually we define the lock first, not last Done -- To view, visit http://gerrit.cloudera.org:8080/5300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0bb2b1d2590ed7ead6f1980f9572be10444bb81b Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro AlvesGerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time advancement
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5300 to look at the new patch set (#8). Change subject: WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time advancement .. WIP: KUDU-798 (part 4) Add a TimeManager to manage safe time advancement WIP still need to finish the unit test Change-Id: I0bb2b1d2590ed7ead6f1980f9572be10444bb81b --- M src/kudu/consensus/CMakeLists.txt A src/kudu/consensus/time_manager-test.cc A src/kudu/consensus/time_manager.cc A src/kudu/consensus/time_manager.h 4 files changed, 645 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/00/5300/8 -- To view, visit http://gerrit.cloudera.org:8080/5300 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I0bb2b1d2590ed7ead6f1980f9572be10444bb81b Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro AlvesGerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] rpc-test: fix flakiness in tests that expect timeouts
Adar Dembo has posted comments on this change. Change subject: rpc-test: fix flakiness in tests that expect timeouts .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/5356/1//COMMIT_MSG Commit Message: PS1, Line 12: some Extra 'some' here? -- To view, visit http://gerrit.cloudera.org:8080/5356 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ifff555634968bc92f453b25af4d5c15da21edf7c Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-HasComments: Yes
[kudu-CR] ssl factory: avoid a non-POD (vector<>) in static storage
Adar Dembo has posted comments on this change. Change subject: ssl_factory: avoid a non-POD (vector<>) in static storage .. Patch Set 1: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/5354 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I859aefd932d62fa8619650fcc795d3676187ceb4 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-HasComments: No
[kudu-CR] ts recovery-itest: reduce test flakiness
Todd Lipcon has submitted this change and it was merged. Change subject: ts_recovery-itest: reduce test flakiness .. ts_recovery-itest: reduce test flakiness In the flaky test dashboard, this test case was flaky in two cases: 1) in TSAN builds sometimes it wouldn't write fast enough to trigger lots of rolls. In that case, the "wait for crash in 60 seconds" might not trigger. 2) In DEBUG builds, sometimes the 50% chance of crashes allowed it to write 10+ segments before crashing (0.5^10 = 1/1000 times). With this many log segments, the bootstrap could take long enough that the verification on restart would time out. This addresses the issue by (a) writing larger ops, so that logs would roll quicker with fewer ops required, and (b) increasing the crash probability so that it needs fewer rolls to crash on average. Change-Id: I4d5018636d20a4663e84071e91f573b86e309a29 Reviewed-on: http://gerrit.cloudera.org:8080/5351 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo--- M src/kudu/integration-tests/ts_recovery-itest.cc 1 file changed, 2 insertions(+), 1 deletion(-) Approvals: Adar Dembo: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/5351 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I4d5018636d20a4663e84071e91f573b86e309a29 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] client: throttle some warning messages when lookup RPCs fail
Todd Lipcon has submitted this change and it was merged. Change subject: client: throttle some warning messages when lookup RPCs fail .. client: throttle some warning messages when lookup RPCs fail In an Impala stress workload, we found that LookupRpc::SendRpcCb was spewing warnings at an alarming rate. The warnings themselves were causing so much load (due to Impala's glog lock contention) that it produced a vicious cycle: the log lock contention caused more timeouts, which caused more warnings, which caused more lock contention, etc... I figured out the address of the glog mutex in an impalad process and used perf to trace the stack trace of its lock acquisitions over a 5-second period: [todd@ve1120 kudu]$ sudo perf record -a -e syscalls:sys_enter_futex --filter 'uaddr == "0x7f6950a7714c"' -g sleep 5 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 4.191 MB perf.data (~183118 samples) ] [todd@ve1120 kudu]$ sudo perf script | grep -A1 'LogMess' | grep client:: | sort | uniq -c 272 7f6950477224 kudu::client::internal::WriteRpc::Finish(kudu::Status const&) 92 7f695049be85 kudu::client::internal::RemoteTablet::MarkReplicaFailed(kudu::client::internal::RemoteTabletServer*, kudu::Status const&) 14453 7f69504a1bc5 kudu::client::internal::LookupRpc::SendRpcCb(kudu::Status const&) Here we can see the vast majority came from LookupRpc::SendRpcCb and WriteRpc::Finish. This patch addresses those as well as a few others that look like they might potentially occur frequently. Change-Id: I840bd57976ccefba4453667f82c7aa32756922d3 Reviewed-on: http://gerrit.cloudera.org:8080/5332 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo--- M src/kudu/client/batcher.cc M src/kudu/client/client-internal.cc M src/kudu/client/meta_cache.cc 3 files changed, 22 insertions(+), 17 deletions(-) Approvals: Adar Dembo: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/5332 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I840bd57976ccefba4453667f82c7aa32756922d3 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] rpc-test: fix flakiness of TestCallLongerThanKeepalive
Adar Dembo has posted comments on this change. Change subject: rpc-test: fix flakiness of TestCallLongerThanKeepalive .. Patch Set 1: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/5355 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8d8d981af6dae90d0c656745d9260ffa920c9039 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-HasComments: No
[kudu-CR] ts recovery-itest: reduce test flakiness
Adar Dembo has posted comments on this change. Change subject: ts_recovery-itest: reduce test flakiness .. Patch Set 1: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/5351 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I4d5018636d20a4663e84071e91f573b86e309a29 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: No
[kudu-CR] KUDU-695. Avoid glog contention by deferring log writes to dedicated threads
Adar Dembo has posted comments on this change. Change subject: KUDU-695. Avoid glog contention by deferring log writes to dedicated threads .. Patch Set 6: (2 comments) http://gerrit.cloudera.org:8080/#/c/5321/4/src/kudu/util/async_logger.cc File src/kudu/util/async_logger.cc: PS4, Line 69: ull(*active > > BTW, Kudu should never Write() FATAL messages given that we aren't wrappi Ah, I forgot about that. Could you update the class doc to describe the flush semantics of glog in a little more detail, as I suggested above? http://gerrit.cloudera.org:8080/#/c/5321/4/src/kudu/util/async_logger.h File src/kudu/util/async_logger.h: PS4, Line 128: size = 0; : flush = false; > Sure, or 'needs_flush_or_write'. Were you going to rename this? -- To view, visit http://gerrit.cloudera.org:8080/5321 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie22a0a29fa00a988a53a15d2c726ce5d49018f4d Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] client: throttle some warning messages when lookup RPCs fail
Adar Dembo has posted comments on this change. Change subject: client: throttle some warning messages when lookup RPCs fail .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/5332 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I840bd57976ccefba4453667f82c7aa32756922d3 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] rpc-test: fix flakiness of TestCallLongerThanKeepalive
Hello Mike Percy, Adar Dembo, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/5355 to review the following change. Change subject: rpc-test: fix flakiness of TestCallLongerThanKeepalive .. rpc-test: fix flakiness of TestCallLongerThanKeepalive The 50ms keepalive was apparently short enough that occasionally the server connection would drop before even receiving the initial call on the connection. This would cause the test to fail with a spurious "connection reset" type error. I looped the following to test: /build-support/dist_test.py --disable-sharding \ loop -n 1000 build/latest/bin/rpc-test \ --gtest_filter=OptionalSSL/TestRpc.TestCallLongerThanKeepalive/1 \ --stress_cpu_threads=8 Without the patch[1] there were 31/1000 failures, all due to this issue. With the patch[2] there were no failures. [1] http://dist-test.cloudera.org//job?job_id=todd.1480911578.21798 [2] http://dist-test.cloudera.org//job?job_id=todd.1480911736.2857 Change-Id: I8d8d981af6dae90d0c656745d9260ffa920c9039 --- M src/kudu/rpc/rpc-test.cc 1 file changed, 3 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/55/5355/1 -- To view, visit http://gerrit.cloudera.org:8080/5355 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I8d8d981af6dae90d0c656745d9260ffa920c9039 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Mike Percy
[kudu-CR] ssl factory: avoid a non-POD (vector<>) in static storage
Hello Mike Percy, Adar Dembo, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/5354 to review the following change. Change subject: ssl_factory: avoid a non-POD (vector<>) in static storage .. ssl_factory: avoid a non-POD (vector<>) in static storage The global vectorin static storage was causing ASAN test failures when I looped rpc-test's keepalive test with SSL enabled[1]. The issue is that the static storage destructor could run concurrently with other threads still trying to access the SSL mutexes. Since this mutex list is purposefully leaked anyway, it's better to just use a C-style array. [1] http://dist-test.cloudera.org/job?job_id=todd.1480910327.10622 Change-Id: I859aefd932d62fa8619650fcc795d3676187ceb4 --- M src/kudu/util/net/ssl_factory.cc 1 file changed, 5 insertions(+), 7 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/54/5354/1 -- To view, visit http://gerrit.cloudera.org:8080/5354 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I859aefd932d62fa8619650fcc795d3676187ceb4 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Mike Percy
[kudu-CR] rpc-test: fix flakiness in tests that expect timeouts
Hello Mike Percy, Adar Dembo, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/5356 to review the following change. Change subject: rpc-test: fix flakiness in tests that expect timeouts .. rpc-test: fix flakiness in tests that expect timeouts The DoTestExpectTimeout() utility function would set a timeout to some number of milliseconds N, and then ask the server to sleep for (N + 50) milliseconds, expecting a timeout. It would then assert that the timeout was returned some after some amount of time between N and N+50, but no more than N+50ms. This would be flaky under concurrent load (eg stress threads) because the sleep(50ms) might sometimes actually sleep for an extra 50-100ms. This just changes the test to ask the server to sleep for n+500ms, giving it a lot more budget for sloppiness. I looped TestCallTimeout/0 with 4 stress threads 1000 times in TSAN. Before[1] it failed 4/1000. After[2] it didn't fail. [1] http://dist-test.cloudera.org//job?job_id=todd.1480912345.2054 [2] http://dist-test.cloudera.org//job?job_id=todd.1480912461.12007 Change-Id: Ifff555634968bc92f453b25af4d5c15da21edf7c --- M src/kudu/rpc/rpc-test-base.h 1 file changed, 2 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/56/5356/1 -- To view, visit http://gerrit.cloudera.org:8080/5356 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ifff555634968bc92f453b25af4d5c15da21edf7c Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Mike Percy
[kudu-CR] [i-tests] TestWorkload::set num tablets() accepts 1
Todd Lipcon has posted comments on this change. Change subject: [i-tests] TestWorkload::set_num_tablets() accepts 1 .. Patch Set 1: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/5347/1//COMMIT_MSG Commit Message: Line 7: [i-tests] TestWorkload::set_num_tablets() accepts 1 nit: this summary line is a bit confusing about whether the commit is stating a bug (that it currently accepts 1, but shouldn't), or fixing a bug (that it should accept 1, but doesn't). Would be better to write "should accept 1" or "Fix TestWorkload to allow set_num_tablets(1)" or something for better clarity (obviously it's explained below but often I end up looking at git log --oneline or just skimming a git log) -- To view, visit http://gerrit.cloudera.org:8080/5347 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iaf9d0e06550fffdd6a04f6bfba53fe873fb33749 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1757: fix appendCellValueDebugString, do not throw exception when a column is not set
Todd Lipcon has posted comments on this change. Change subject: KUDU-1757: fix appendCellValueDebugString, do not throw exception when a column is not set .. Patch Set 2: JD and Dan, do you guys have opinions on this? -- To view, visit http://gerrit.cloudera.org:8080/5237 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ife05af5a11e95751b3cc259e786d36494aeb5d3a Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Yanlong ZhengGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Yanlong Zheng Gerrit-HasComments: No
[kudu-CR] client: throttle some warning messages when lookup RPCs fail
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5332 to look at the new patch set (#2). Change subject: client: throttle some warning messages when lookup RPCs fail .. client: throttle some warning messages when lookup RPCs fail In an Impala stress workload, we found that LookupRpc::SendRpcCb was spewing warnings at an alarming rate. The warnings themselves were causing so much load (due to Impala's glog lock contention) that it produced a vicious cycle: the log lock contention caused more timeouts, which caused more warnings, which caused more lock contention, etc... I figured out the address of the glog mutex in an impalad process and used perf to trace the stack trace of its lock acquisitions over a 5-second period: [todd@ve1120 kudu]$ sudo perf record -a -e syscalls:sys_enter_futex --filter 'uaddr == "0x7f6950a7714c"' -g sleep 5 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 4.191 MB perf.data (~183118 samples) ] [todd@ve1120 kudu]$ sudo perf script | grep -A1 'LogMess' | grep client:: | sort | uniq -c 272 7f6950477224 kudu::client::internal::WriteRpc::Finish(kudu::Status const&) 92 7f695049be85 kudu::client::internal::RemoteTablet::MarkReplicaFailed(kudu::client::internal::RemoteTabletServer*, kudu::Status const&) 14453 7f69504a1bc5 kudu::client::internal::LookupRpc::SendRpcCb(kudu::Status const&) Here we can see the vast majority came from LookupRpc::SendRpcCb and WriteRpc::Finish. This patch addresses those as well as a few others that look like they might potentially occur frequently. Change-Id: I840bd57976ccefba4453667f82c7aa32756922d3 --- M src/kudu/client/batcher.cc M src/kudu/client/client-internal.cc M src/kudu/client/meta_cache.cc 3 files changed, 22 insertions(+), 17 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/32/5332/2 -- To view, visit http://gerrit.cloudera.org:8080/5332 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I840bd57976ccefba4453667f82c7aa32756922d3 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] client: throttle some warning messages when lookup RPCs fail
Todd Lipcon has posted comments on this change. Change subject: client: throttle some warning messages when lookup RPCs fail .. Patch Set 1: (2 comments) OK, added throttling in a few other spots http://gerrit.cloudera.org:8080/#/c/5332/1/src/kudu/client/meta_cache.cc File src/kudu/client/meta_cache.cc: Line 197: LOG(WARNING) << "Tablet " << tablet_id_ << ": Replica " << ts->ToString() > What about this one? Done Line 696: LOG(WARNING) << "Failed to determine new Master: " << status.ToString(); > Or this one? Done -- To view, visit http://gerrit.cloudera.org:8080/5332 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I840bd57976ccefba4453667f82c7aa32756922d3 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-695. Avoid glog contention by deferring log writes to dedicated threads
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5321 to look at the new patch set (#6). Change subject: KUDU-695. Avoid glog contention by deferring log writes to dedicated threads .. KUDU-695. Avoid glog contention by deferring log writes to dedicated threads This patch changes the logging init sequence to install a wrapper around the built-in glog Loggers. The wrapper starts a new thread which does double-buffering: log entries are buffered in a vector, and the thread is woken up. The thread swaps in a clean buffer to avoid delaying application threads while it flushes the original buffer to disk. It's hard to test the end-to-end integration in a unit test, since the unit tests disable logging to files, and this path only affects file-based logging. However, I ran an earlier version of this patch in a stress test environment and it seemed to reduce the frequency with which I saw threads blocked on glog. The patch does, however, have a test which exercises the new code paths, including the blocking path. I looped the new test 500 times in TSAN mode with success. The new feature is enabled by default, but I left a hidden flag to disable it in case we have any issues. Change-Id: Ie22a0a29fa00a988a53a15d2c726ce5d49018f4d --- M src/kudu/util/CMakeLists.txt A src/kudu/util/async_logger.cc A src/kudu/util/async_logger.h M src/kudu/util/debug/leakcheck_disabler.h M src/kudu/util/logging-test.cc M src/kudu/util/logging.cc 6 files changed, 453 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/5321/6 -- To view, visit http://gerrit.cloudera.org:8080/5321 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie22a0a29fa00a988a53a15d2c726ce5d49018f4d Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-695. Avoid glog contention by deferring log writes to dedicated threads
Todd Lipcon has posted comments on this change. Change subject: KUDU-695. Avoid glog contention by deferring log writes to dedicated threads .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/5321/5/src/kudu/util/async_logger.cc File src/kudu/util/async_logger.cc: Line 88: l.Unlock(); > Nit: you could move the Unlock() above the conditional. Maybe adjusting the Done -- To view, visit http://gerrit.cloudera.org:8080/5321 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie22a0a29fa00a988a53a15d2c726ce5d49018f4d Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1782. Fault injection crashes should exit with a specific exit code
Todd Lipcon has submitted this change and it was merged. Change subject: KUDU-1782. Fault injection crashes should exit with a specific exit code .. KUDU-1782. Fault injection crashes should exit with a specific exit code This changes the fault injection "crash" to use exit(85) instead of LOG(FATAL). This means that we can now distinguish between a process exiting due to fault injection compared to a process exiting due to a crash or real FATAL-inducing issue. The WaitForCrash method in ExternalMiniCluster::ExternalDaemon is now split into WaitForInjectedCrash, which specifically checks that the exit code matches the expected one, and WaitForFatal, which checks that the exit code matches a SIGABRT. If the process exits with an unexpected status, a bad Status is returned, typically resulting in a test failure. I verified the new functionality by locally reverting back the fault injection code to use LOG(FATAL) and checking that ts_recovery-itest reported that the process had crashed unexpectedly. Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484 Reviewed-on: http://gerrit.cloudera.org:8080/5353 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo--- M src/kudu/integration-tests/delete_table-test.cc M src/kudu/integration-tests/disk_reservation-itest.cc M src/kudu/integration-tests/external_mini_cluster.cc M src/kudu/integration-tests/external_mini_cluster.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/util/fault_injection.cc M src/kudu/util/fault_injection.h 8 files changed, 90 insertions(+), 31 deletions(-) Approvals: Adar Dembo: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/5353 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1782. Fault injection crashes should exit with a specific exit code
Todd Lipcon has posted comments on this change. Change subject: KUDU-1782. Fault injection crashes should exit with a specific exit code .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/5353/3/src/kudu/integration-tests/delete_table-test.cc File src/kudu/integration-tests/delete_table-test.cc: Line 494: ASSERT_OK(cluster_->tablet_server(kTsIndex)->Restart()); > Seems like a pretty important omission. What was the effect? basically made this part of the test a no-op: the WaitForTSToCrash() below used to silently ignore if the TS wasn't started at that point, whereas I added a CHECK in this patch that verifies it's in the expected state. -- To view, visit http://gerrit.cloudera.org:8080/5353 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1782. Fault injection crashes should exit with a specific exit code
Adar Dembo has posted comments on this change. Change subject: KUDU-1782. Fault injection crashes should exit with a specific exit code .. Patch Set 3: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/5353/3/src/kudu/integration-tests/delete_table-test.cc File src/kudu/integration-tests/delete_table-test.cc: Line 494: ASSERT_OK(cluster_->tablet_server(kTsIndex)->Restart()); Seems like a pretty important omission. What was the effect? -- To view, visit http://gerrit.cloudera.org:8080/5353 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes
Todd Lipcon has posted comments on this change. Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes .. Patch Set 21: (29 comments) mostly small stuff http://gerrit.cloudera.org:8080/#/c/5240/21//COMMIT_MSG Commit Message: Line 7: WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes nit: absence PS21, Line 19: if there aren't any writes in-flight that : it doesn't know about yet I think this merits further explanation PS21, Line 22: Mvcc's safe time is a more conservative version of "global" : safe time should we consider renaming this to 'trimmable time' or 'safetime_lower_bound' or something? http://gerrit.cloudera.org:8080/#/c/5240/21/java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java File java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java: PS21, Line 58: cluster nit: comma here (http://grammar.ccc.commnet.edu/GRAMMAR/commas_intro.htm has some good examples) PS21, Line 59: se absence PS21, Line 59: we expect precise clock values in the test not understanding this exactly. What does one thing have to do with the other? safetime advancements end up pushing the clock value ahead or something? Please elaborate, especially since people editing the Java client tests may not be as familiar with the internal details. PS21, Line 61: --disable_ I think 'negative' gflags are kind of confusing, because you get these kind of double-negatives. why not have it be an 'enable' flag, defaulted to true, and set it to false to disable it? http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus.h File src/kudu/consensus/consensus.h: Line 28: #include "kudu/consensus/time_manager.h" can you forward decl? http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus_queue-test.cc File src/kudu/consensus/consensus_queue-test.cc: Line 78: scoped_refptr time_manager(new TimeManager(clock, Timestamp::kMin)); missing include for this? (I guess it's getting it transitively) http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 105:const scoped_refptr& time_manager, pass by value and then use std::move below Line 275: RETURN_NOT_OK(time_manager_->AdvanceSafeTimeWithMessage(*msg->get())); above comment should be updated to reference this. would it be more efficient (and just as useful) to just do it once at the end for msgs.back()? PS21, Line 442: comma PS21, Line 443: else { : if collapse http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus_queue.h File src/kudu/consensus/consensus_queue.h: Line 34: #include "kudu/consensus/time_manager.h" is forward decl sufficient? http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: Line 533: RETURN_NOT_OK(time_manager_->AssignTimestamp(replicate)); PREPEND PS21, Line 1198: *(*iter)->get() is **iter sufficient? Can't remember under what circumstances this returns non-OK, but are we actually safe to return here rather than break? PS21, Line 1202: d comma Line 1205: RETURN_NOT_OK(time_manager_->AdvanceSafeTime(Timestamp(request->safe_timestamp(; same, is this safe to return or should it be a CHECK? http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/raft_consensus_state.h File src/kudu/consensus/raft_consensus_state.h: Line 33: #include "kudu/consensus/time_manager.h" fwd decl? PS21, Line 256: explicit dont need explicit http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/integration-tests/raft_consensus-itest.cc File src/kudu/integration-tests/raft_consensus-itest.cc: PS21, Line 1195: the clock values are unexpected. not quite following this http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/mvcc-test.cc File src/kudu/tablet/mvcc-test.cc: Line 508: // Commit tx 2 - thread should still wai. typo Line 510: mgr.CommitTransaction(tx2); probably want an ASSERT_FALSE(HasResultSnapshot()) here. http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/mvcc.cc File src/kudu/tablet/mvcc.cc: Line 285: bool MvccManager::AnyInFlightAtOrBeforeUnlocked(Timestamp ts) const { hrm, could this just be implemented by looking at earliest_in_flight_? http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/mvcc.h File src/kudu/tablet/mvcc.h: Line 224: // time received from the leader (which can go back without leader leases). perhaps elaborate that this would crash if you tried to move it backwards? (would it?) Maybe it should be called something like SetNewTransactionLowerBound? per your commit message, the MvccManager doesn't really track safetime anymore, it just tracks a lower bound on new transactions (which itself is a lower bound on
[kudu-CR] KUDU-1782. Fault injection crashes should exit with a specific exit code
Hello David Ribeiro Alves, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5353 to look at the new patch set (#3). Change subject: KUDU-1782. Fault injection crashes should exit with a specific exit code .. KUDU-1782. Fault injection crashes should exit with a specific exit code This changes the fault injection "crash" to use exit(85) instead of LOG(FATAL). This means that we can now distinguish between a process exiting due to fault injection compared to a process exiting due to a crash or real FATAL-inducing issue. The WaitForCrash method in ExternalMiniCluster::ExternalDaemon is now split into WaitForInjectedCrash, which specifically checks that the exit code matches the expected one, and WaitForFatal, which checks that the exit code matches a SIGABRT. If the process exits with an unexpected status, a bad Status is returned, typically resulting in a test failure. I verified the new functionality by locally reverting back the fault injection code to use LOG(FATAL) and checking that ts_recovery-itest reported that the process had crashed unexpectedly. Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484 --- M src/kudu/integration-tests/delete_table-test.cc M src/kudu/integration-tests/disk_reservation-itest.cc M src/kudu/integration-tests/external_mini_cluster.cc M src/kudu/integration-tests/external_mini_cluster.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/util/fault_injection.cc M src/kudu/util/fault_injection.h 8 files changed, 90 insertions(+), 31 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/5353/3 -- To view, visit http://gerrit.cloudera.org:8080/5353 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot
[kudu-CR] KUDU-1782. Fault injection crashes should exit with a specific exit code
Hello David Ribeiro Alves, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5353 to look at the new patch set (#2). Change subject: KUDU-1782. Fault injection crashes should exit with a specific exit code .. KUDU-1782. Fault injection crashes should exit with a specific exit code This changes the fault injection "crash" to use exit(85) instead of LOG(FATAL). This means that we can now distinguish between a process exiting due to fault injection compared to a process exiting due to a crash or real FATAL-inducing issue. The WaitForCrash method in ExternalMiniCluster::ExternalDaemon is now split into WaitForInjectedCrash, which specifically checks that the exit code matches the expected one, and WaitForFatal, which checks that the exit code matches a SIGABRT. If the process exits with an unexpected status, a bad Status is returned, typically resulting in a test failure. I verified the new functionality by locally reverting back the fault injection code to use LOG(FATAL) and checking that ts_recovery-itest reported that the process had crashed unexpectedly. Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484 --- M src/kudu/integration-tests/delete_table-test.cc M src/kudu/integration-tests/disk_reservation-itest.cc M src/kudu/integration-tests/external_mini_cluster.cc M src/kudu/integration-tests/external_mini_cluster.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/util/fault_injection.cc M src/kudu/util/fault_injection.h 8 files changed, 84 insertions(+), 28 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/5353/2 -- To view, visit http://gerrit.cloudera.org:8080/5353 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot
[kudu-CR] KUDU-1782. Fault injection crashes should exit with a specific exit code
Hello David Ribeiro Alves, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/5353 to review the following change. Change subject: KUDU-1782. Fault injection crashes should exit with a specific exit code .. KUDU-1782. Fault injection crashes should exit with a specific exit code This changes the fault injection "crash" to use exit(85) instead of LOG(FATAL). This means that we can now distinguish between a process exiting due to fault injection compared to a process exiting due to a crash or real FATAL-inducing issue. The WaitForCrash method in ExternalMiniCluster::ExternalDaemon is now renamed to WaitForInjectedCrash and specifically checks that the exit code matches the expected one. If not, it now returns a bad status. I verified the new functionality by locally reverting back the fault injection code to use LOG(FATAL) and checking that ts_recovery-itest reported that the process had crashed unexpectedly. Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484 --- M src/kudu/integration-tests/delete_table-test.cc M src/kudu/integration-tests/disk_reservation-itest.cc M src/kudu/integration-tests/external_mini_cluster.cc M src/kudu/integration-tests/external_mini_cluster.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/util/fault_injection.cc M src/kudu/util/fault_injection.h 8 files changed, 45 insertions(+), 25 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/53/5353/1 -- To view, visit http://gerrit.cloudera.org:8080/5353 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ic8d9ae38efa65123ae69097a76c113b9709c3484 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves
[kudu-CR] [consensus] KUDU-1407 replica is not evcited when TABLET NOT RUNNING
Hello David Ribeiro Alves, Mike Percy, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5352 to look at the new patch set (#2). Change subject: [consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING .. [consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING Fixes replica eviction failure where a follower returning TABLET_NOT_RUNNING is not evicted from the consensus config. Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625 --- M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/tserver/tablet_server-test-base.h M src/kudu/tserver/tablet_server_test_util.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 8 files changed, 95 insertions(+), 33 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/52/5352/2 -- To view, visit http://gerrit.cloudera.org:8080/5352 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh BhatGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot
[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID
Dinesh Bhat has posted comments on this change. Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID .. Patch Set 5: (2 comments) http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: Line 2609: const scoped_refptr& tablet, > Would be better to use the standard prefix T P for This change is moved to different patch, addressing it there. http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: Line 213: // Open a tablet meta from the local file system by loading its superblock. > Why optional<>* ? New param needs doc Since this change moved to a different patch, I have addressed it there. http://gerrit.cloudera.org:8080/5352 -- To view, visit http://gerrit.cloudera.org:8080/5111 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh BhatGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] [consensus] KUDU-1407 replica is not evcited when TABLET NOT RUNNING
Hello David Ribeiro Alves, Mike Percy, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/5352 to review the following change. Change subject: [consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING .. [consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING Fixes replica eviction failure where a follower returning TABLET_NOT_RUNNING is not evicted from the consensus config. Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625 --- M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/tserver/tablet_server-test-base.h M src/kudu/tserver/tablet_server_test_util.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 8 files changed, 91 insertions(+), 31 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/52/5352/1 -- To view, visit http://gerrit.cloudera.org:8080/5352 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh BhatGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Mike Percy