Adar Dembo has posted comments on this change.

Change subject: KUDU-456 Implement AUTO_FLUSH_BACKGROUND flush mode
......................................................................


Patch Set 25:

(45 comments)

I actually reviewed the changes to client-test this time.

http://gerrit.cloudera.org:8080/#/c/3952/25/src/kudu/client/client-test.cc
File src/kudu/client/client-test.cc:

PS25, Line 182:   // Must be public to use as a thread closure.
They can be static functions, though, right? I don't see them accessing any 
test state.


PS25, Line 187: Microseconds
How did you choose 10 us? It seems low to me (i.e. maybe we can get away with a 
higher granularity), but maybe I'm missing something.


PS25, Line 199:   
Nit: extra space here.


PS25, Line 553: oprations
Nit: operations


PS25, Line 556: WaitForAutoFlushBackground
Hmm, when this returns, there's no actual guarantee that a batch flushed, 
right? We may have just expired the deadline?


PS25, Line 558: MAX_WAIT_MS
kMaxWaitMs


Line 1207:         << "unexpected status: " << result.ToString();
There's a better technique for handling situations like these:

  Status s = DoSomething();
  SCOPED_TRACE(s.ToString());
  ASSERT_TRUE(s.IsIOError());
  ASSERT_TRUE(...);

If a test fails while SCOPED_TRACE() is still in scope, gtest will print 
s.ToString(). That way you don't have to worry about printing it yourself 
following every potential ASSERT/EXPECT that may fail.


PS25, Line 2153: BUFFER_SIZE
kBufferSizeBytes (camel case, k prefix, and append the unit)

Below too. Please fix all of your "constants" to follow this naming convention.


PS25, Line 2186: // Applying a big operation which does not fit into the buffer 
should
               : // return an error with session running in any supported flush 
mode.
Doesn't this test make the second half of TestApplyTooMuchWithoutFlushing 
redundant?


PS25, Line 2197: for (auto mode: FLUSH_MODES)
Nit: auto mode : FLUSH_MODES

(separate 'mode' from the colon with a space)


PS25, Line 2215:     STLDeleteElements(&errors);
Nit: use an ElementDeleter and declare it right after declaring 'errors'. Then 
there's no possibility for a leak.


PS25, Line 2258:   WaitForAutoFlushBackground(session, FLUSH_INTERVAL_MS);
               :   // There should be some pending ops: the automatic 
background flush
               :   // should not be active after switch to the MANUAL_FLUSH 
mode.
I see. How expensive is this in terms of test runtime? You've dropped the flush 
interval to 100 ms; how much time do we end up waiting here?


Line 2304:     *result = t_end - t_begin;
Have you looked at the StopWatch class? May be easier for this kind of thing.


PS25, Line 2335: // Check that call to Flush()/FlushAsync() is no-op if there 
isn't current
               : // batcher for the session.
Doesn't this test make TestEmptyBatch redundant?


PS25, Line 2348:   ASSERT_OK(sync0.Wait());
               :   // The same with the synchronous flush.
               :   ASSERT_OK(session->Flush());
               :   ASSERT_FALSE(session->HasPendingOperations());
               :   ASSERT_EQ(0, session->CountPendingErrors());
How does this verify that flush was a no-op? Seems to me you need to find some 
metrics indicating that no Write RPCs were sent.


PS25, Line 2391: kudu::Thread::Create
Now that we're building with C++11, we prefer to use std::thread for test 
threads. There's less boilerplate with them (i.e. no need to make up a thread 
category and name), and they have better support for lambdas, which I think you 
can use here pretty effectively.

Same goes for the other new uses of kudu::Thread.


Line 2428:   // AUTO_FLUSH_BACKGROUND should be faster than AUTO_FLUSH_SYNC.
Hmm, is this expected to be true 100% of all test runs? Even with debug 
instrumentation, TSAN, ASAN, etc? Wondering if this test will be flaky.

A good way to find out is to use build-support/dist-test.py to loop your test 
1000 times. You can use --gtest_filter on the command line to run just this one 
test case (as opposed to all of client-test, much of which is already flaky).


Line 2429:   EXPECT_GT(t_diff_afs.ToMilliseconds(), 
t_diff_afb.ToMilliseconds());
Can't we just EXPECT_GT on the MonoDeltas directly now?


PS25, Line 2451:   // AUTO_FLUSH_BACKGROUND should be faster than 
AUTO_FLUSH_SYNC.
               :   EXPECT_GT(t_diff_afs.ToMilliseconds(), 
t_diff_afb.ToMilliseconds());
Same issues and questions as in the above test.


PS25, Line 2456: // applying a bunch of small rows without a flush should not 
result in
               : // an error, even with low limit on the buffer space.  This is 
because
               : // Session::Apply() blocks and waits for buffer space to 
become available.
For this test, you may want to scan once you're done writing to make sure that 
all of the rows really made it and that Apply() didn't just decide to throw 
some rows away because we ran out of buffer space.


Line 2488: // applying a bunch of rows every of which is so big in size that
Nit: every one of which


PS25, Line 2489: does
Nit: do


PS25, Line 2510: 3
Why does i start at 3 and not 0?


PS25, Line 2574:   ASSERT_TRUE(session->HasPendingOperations());
This is the first test where I've noticed this but others may be affected too.

It's risky to have this assert when using background flushing. We've seen Java 
tests become flaky from asserting this. Scheduling delays may cause the 
background flush to be executed before this assert runs, especially with a low 
interval like 100ms.

If you want your test to be robust, it should either:
1. Properly handle the case where HasPendingOperations() has already become 
false (i.e. short-circuit out of the test, I guess).
2. Loop a bunch of times until at least one run works as expected, though I 
don't know if this is actually more robust in slow environments like TSAN.
3. Use a much higher flush interval, though this has the side effect of 
increasing test runtime, potentially significantly.

Again, when in doubt, use dist-test.py to loop the test 1000 times in a 
pathological environment (e.g. TSAN) and see whether it fails.


PS25, Line 2612:     const MonoTime t_begin(MonoTime::Now());
               :     ASSERT_OK(session->Apply(insert.release()));
               :     const MonoTime t_end(MonoTime::Now());
Consider StopWatch for this.


PS25, Line 2627:     // It's supposed that sending an operation to the server 
at least
               :     // order of magnitude less than the periodic flush 
interval.
Nit: "Sending an operation to the server is supposed to be at least an order of 
magnitude less than the periodic flush interval"


PS25, Line 2630:     EXPECT_GT(FLUSH_INTERVAL_MS / 10, t_diff.ToMilliseconds());
Again, worried about flakiness due to testing timing like this.


PS25, Line 2634: It's supposed the flushed operations
               :   // reach the tablet server faster than the max-wait flush 
interval.
Nit: "The flushed operations should reach the tablet server faster than the 
max-wait flush interval."


PS25, Line 2636:   WaitForAutoFlushBackground(session, FLUSH_INTERVAL_MS);
This will be at least a 10s pause, right? If so, can you condition the entire 
test on AllowSlowTests()?


PS25, Line 2645: accomodate
Nit: accommodate


PS25, Line 2653: gurantees
Nit: guarantees


PS25, Line 2663: TestAutoFlushBackgroundApplyBlocks
Lots of opportunities for flakiness here due to timing measurements, please 
test with dist-test and TSAN/ASAN.


PS25, Line 2714: It's supposed the flushed operations
               :   // reach the tablet server faster than wait_timeout_ms.
Nit: reword as per the above.


http://gerrit.cloudera.org:8080/#/c/3952/25/src/kudu/client/client.h
File src/kudu/client/client.h:

PS25, Line 1351: successfull
successful


PS25, Line 1352: left
be left


http://gerrit.cloudera.org:8080/#/c/3952/23/src/kudu/client/session-internal.cc
File src/kudu/client/session-internal.cc:

PS23, Line 296: 
              :     if (batcher_) {
              :       const MonoTime f
> OK, I see.  I suspected something like that, but since I haven't seen anyth
This means acquiring a Mutex on the hot path, though. Hmm. The implementation 
does a TryAcquire() in Acquire() so that it will do just a CAS when 
uncontended, so maybe that's OK.

May want to ask Todd about this one (that is, about whether taking a lock is 
necessary).


http://gerrit.cloudera.org:8080/#/c/3952/25/src/kudu/client/session-internal.cc
File src/kudu/client/session-internal.cc:

PS25, Line 106: PRCs
RPCs


PS25, Line 150: PRCs
RPCs


PS25, Line 174: PRCs
RPCs


PS25, Line 208: PRCs
RPCs


http://gerrit.cloudera.org:8080/#/c/3952/25/src/kudu/client/session-internal.h
File src/kudu/client/session-internal.h:

PS25, Line 102: NoLock
Kudu convention is to use an "Unlocked" suffix for methods like these.


PS25, Line 144: This method is used by tests only.
Kudu convention is to add the suffix "ForTests" for test-only methods like 
these (and below).


PS25, Line 151: is here to represent
Nit: just "represents"


PS25, Line 152: // the first argument in expressions like 
FlushCurrentBatcher(1, cbk):
              :   // this is the watermark corresponding to 1 byte of data.
Yes, but what is the effect? What is special about choosing a value of 1?


PS25, Line 181:   // Note that this mutex should not be acquired if the thread 
is already
              :   // holding a Batcher's lock. This must be acquired first.
How is such sequencing possible? Isn't the batcher's lock internal to the 
batcher, and thus even if you call a batcher method that acquires the batcher 
lock, it is released before the call returns?


-- 
To view, visit http://gerrit.cloudera.org:8080/3952
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I34905c30b3aad96f53cf7a1822b1cde6d25f33a8
Gerrit-PatchSet: 25
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <d...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <din...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

Reply via email to