Adar Dembo has posted comments on this change. Change subject: KUDU-456 Implement AUTO_FLUSH_BACKGROUND flush mode ......................................................................
Patch Set 25: (45 comments) I actually reviewed the changes to client-test this time. http://gerrit.cloudera.org:8080/#/c/3952/25/src/kudu/client/client-test.cc File src/kudu/client/client-test.cc: PS25, Line 182: // Must be public to use as a thread closure. They can be static functions, though, right? I don't see them accessing any test state. PS25, Line 187: Microseconds How did you choose 10 us? It seems low to me (i.e. maybe we can get away with a higher granularity), but maybe I'm missing something. PS25, Line 199: Nit: extra space here. PS25, Line 553: oprations Nit: operations PS25, Line 556: WaitForAutoFlushBackground Hmm, when this returns, there's no actual guarantee that a batch flushed, right? We may have just expired the deadline? PS25, Line 558: MAX_WAIT_MS kMaxWaitMs Line 1207: << "unexpected status: " << result.ToString(); There's a better technique for handling situations like these: Status s = DoSomething(); SCOPED_TRACE(s.ToString()); ASSERT_TRUE(s.IsIOError()); ASSERT_TRUE(...); If a test fails while SCOPED_TRACE() is still in scope, gtest will print s.ToString(). That way you don't have to worry about printing it yourself following every potential ASSERT/EXPECT that may fail. PS25, Line 2153: BUFFER_SIZE kBufferSizeBytes (camel case, k prefix, and append the unit) Below too. Please fix all of your "constants" to follow this naming convention. PS25, Line 2186: // Applying a big operation which does not fit into the buffer should : // return an error with session running in any supported flush mode. Doesn't this test make the second half of TestApplyTooMuchWithoutFlushing redundant? PS25, Line 2197: for (auto mode: FLUSH_MODES) Nit: auto mode : FLUSH_MODES (separate 'mode' from the colon with a space) PS25, Line 2215: STLDeleteElements(&errors); Nit: use an ElementDeleter and declare it right after declaring 'errors'. Then there's no possibility for a leak. PS25, Line 2258: WaitForAutoFlushBackground(session, FLUSH_INTERVAL_MS); : // There should be some pending ops: the automatic background flush : // should not be active after switch to the MANUAL_FLUSH mode. I see. How expensive is this in terms of test runtime? You've dropped the flush interval to 100 ms; how much time do we end up waiting here? Line 2304: *result = t_end - t_begin; Have you looked at the StopWatch class? May be easier for this kind of thing. PS25, Line 2335: // Check that call to Flush()/FlushAsync() is no-op if there isn't current : // batcher for the session. Doesn't this test make TestEmptyBatch redundant? PS25, Line 2348: ASSERT_OK(sync0.Wait()); : // The same with the synchronous flush. : ASSERT_OK(session->Flush()); : ASSERT_FALSE(session->HasPendingOperations()); : ASSERT_EQ(0, session->CountPendingErrors()); How does this verify that flush was a no-op? Seems to me you need to find some metrics indicating that no Write RPCs were sent. PS25, Line 2391: kudu::Thread::Create Now that we're building with C++11, we prefer to use std::thread for test threads. There's less boilerplate with them (i.e. no need to make up a thread category and name), and they have better support for lambdas, which I think you can use here pretty effectively. Same goes for the other new uses of kudu::Thread. Line 2428: // AUTO_FLUSH_BACKGROUND should be faster than AUTO_FLUSH_SYNC. Hmm, is this expected to be true 100% of all test runs? Even with debug instrumentation, TSAN, ASAN, etc? Wondering if this test will be flaky. A good way to find out is to use build-support/dist-test.py to loop your test 1000 times. You can use --gtest_filter on the command line to run just this one test case (as opposed to all of client-test, much of which is already flaky). Line 2429: EXPECT_GT(t_diff_afs.ToMilliseconds(), t_diff_afb.ToMilliseconds()); Can't we just EXPECT_GT on the MonoDeltas directly now? PS25, Line 2451: // AUTO_FLUSH_BACKGROUND should be faster than AUTO_FLUSH_SYNC. : EXPECT_GT(t_diff_afs.ToMilliseconds(), t_diff_afb.ToMilliseconds()); Same issues and questions as in the above test. PS25, Line 2456: // applying a bunch of small rows without a flush should not result in : // an error, even with low limit on the buffer space. This is because : // Session::Apply() blocks and waits for buffer space to become available. For this test, you may want to scan once you're done writing to make sure that all of the rows really made it and that Apply() didn't just decide to throw some rows away because we ran out of buffer space. Line 2488: // applying a bunch of rows every of which is so big in size that Nit: every one of which PS25, Line 2489: does Nit: do PS25, Line 2510: 3 Why does i start at 3 and not 0? PS25, Line 2574: ASSERT_TRUE(session->HasPendingOperations()); This is the first test where I've noticed this but others may be affected too. It's risky to have this assert when using background flushing. We've seen Java tests become flaky from asserting this. Scheduling delays may cause the background flush to be executed before this assert runs, especially with a low interval like 100ms. If you want your test to be robust, it should either: 1. Properly handle the case where HasPendingOperations() has already become false (i.e. short-circuit out of the test, I guess). 2. Loop a bunch of times until at least one run works as expected, though I don't know if this is actually more robust in slow environments like TSAN. 3. Use a much higher flush interval, though this has the side effect of increasing test runtime, potentially significantly. Again, when in doubt, use dist-test.py to loop the test 1000 times in a pathological environment (e.g. TSAN) and see whether it fails. PS25, Line 2612: const MonoTime t_begin(MonoTime::Now()); : ASSERT_OK(session->Apply(insert.release())); : const MonoTime t_end(MonoTime::Now()); Consider StopWatch for this. PS25, Line 2627: // It's supposed that sending an operation to the server at least : // order of magnitude less than the periodic flush interval. Nit: "Sending an operation to the server is supposed to be at least an order of magnitude less than the periodic flush interval" PS25, Line 2630: EXPECT_GT(FLUSH_INTERVAL_MS / 10, t_diff.ToMilliseconds()); Again, worried about flakiness due to testing timing like this. PS25, Line 2634: It's supposed the flushed operations : // reach the tablet server faster than the max-wait flush interval. Nit: "The flushed operations should reach the tablet server faster than the max-wait flush interval." PS25, Line 2636: WaitForAutoFlushBackground(session, FLUSH_INTERVAL_MS); This will be at least a 10s pause, right? If so, can you condition the entire test on AllowSlowTests()? PS25, Line 2645: accomodate Nit: accommodate PS25, Line 2653: gurantees Nit: guarantees PS25, Line 2663: TestAutoFlushBackgroundApplyBlocks Lots of opportunities for flakiness here due to timing measurements, please test with dist-test and TSAN/ASAN. PS25, Line 2714: It's supposed the flushed operations : // reach the tablet server faster than wait_timeout_ms. Nit: reword as per the above. http://gerrit.cloudera.org:8080/#/c/3952/25/src/kudu/client/client.h File src/kudu/client/client.h: PS25, Line 1351: successfull successful PS25, Line 1352: left be left http://gerrit.cloudera.org:8080/#/c/3952/23/src/kudu/client/session-internal.cc File src/kudu/client/session-internal.cc: PS23, Line 296: : if (batcher_) { : const MonoTime f > OK, I see. I suspected something like that, but since I haven't seen anyth This means acquiring a Mutex on the hot path, though. Hmm. The implementation does a TryAcquire() in Acquire() so that it will do just a CAS when uncontended, so maybe that's OK. May want to ask Todd about this one (that is, about whether taking a lock is necessary). http://gerrit.cloudera.org:8080/#/c/3952/25/src/kudu/client/session-internal.cc File src/kudu/client/session-internal.cc: PS25, Line 106: PRCs RPCs PS25, Line 150: PRCs RPCs PS25, Line 174: PRCs RPCs PS25, Line 208: PRCs RPCs http://gerrit.cloudera.org:8080/#/c/3952/25/src/kudu/client/session-internal.h File src/kudu/client/session-internal.h: PS25, Line 102: NoLock Kudu convention is to use an "Unlocked" suffix for methods like these. PS25, Line 144: This method is used by tests only. Kudu convention is to add the suffix "ForTests" for test-only methods like these (and below). PS25, Line 151: is here to represent Nit: just "represents" PS25, Line 152: // the first argument in expressions like FlushCurrentBatcher(1, cbk): : // this is the watermark corresponding to 1 byte of data. Yes, but what is the effect? What is special about choosing a value of 1? PS25, Line 181: // Note that this mutex should not be acquired if the thread is already : // holding a Batcher's lock. This must be acquired first. How is such sequencing possible? Isn't the batcher's lock internal to the batcher, and thus even if you call a batcher method that acquires the batcher lock, it is released before the call returns? -- To view, visit http://gerrit.cloudera.org:8080/3952 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I34905c30b3aad96f53cf7a1822b1cde6d25f33a8 Gerrit-PatchSet: 25 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Dan Burkert <d...@cloudera.com> Gerrit-Reviewer: Dinesh Bhat <din...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes