Hi, I played with the test (mostly in background), making the failure almost 100% reproducible.
After collecting some evidence, I can say it's a server-side bug. I think so because the reproduction scenario I'm talking about uses good old MANUAL_FLUSH mode, not AUTO_FLUSH_BACKGROUND mode. Yes, I've modified the test slightly to achieve higher reproduction ratio and to clear the question whether it's AUTO_FLUSH_BACKGROUND-specific bug. That's what I found: 1. The problem occurs when updating rows with the same primary keys multiple times. 2. It's crucial to flush (i.e. call KuduSession::Flush() or KuduSession::FlushAsync()) freshly applied update operations not just once in the very end of a client session, but multiple times while adding those operations. If flushing just once in the very end, the issue becomes 0% reproducible. 3. The more updates for different rows we have, the more likely we hit the issue (but there should be at least a couple updates for every row). 4. The problem persists in all types of Kudu builds: debug, TSAN, release, ASAN (in the decreasing order of the reproduction ratio). 5. The problem is also highly reproducible if running the test via the dist_test.py utility (check for 256 out of 256 failure ratio at http://dist-test.cloudera.org//job?job_id=aserbin.1476258983.2603 ) To build the modified test and run the reproduction scenario: 1. Get the patch from https://gist.github.com/alexeyserbin/7c885148dadff8705912f6cc513108d0 2. Apply the patch to the latest Kudu source from the master branch. 3. Build debug, TSAN, release or ASAN configuration and run with the command (the random seed is not really crucial, but this gives better results): ../../build-support/run-test.sh ./bin/tablet_history_gc-itest --gtest_filter=RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload --stress_cpu_threads=64 --test_random_seed=1213726993 4. If running via dist_test.py, run the following instead: ../../build-support/dist_test.py loop -n 256 -- ./bin/tablet_history_gc-itest --gtest_filter=RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload --stress_cpu_threads=8 --test_random_seed=1213726993 Mike, it seems I'll need your help to troubleshoot/debug this issue further. Best regards, Alexey On Mon, Oct 3, 2016 at 9:48 AM, Alexey Serbin <aser...@cloudera.com> wrote: > Todd, > > I apologize for the late response -- somehow my inbox is messed up. > Probably, I need to switch to use stand-alone mail application (as iMail) > instead of browser-based one. > > Yes, I'll take a look at that. > > > Best regards, > > Alexey > > On Mon, Sep 26, 2016 at 12:58 PM, Todd Lipcon <t...@cloudera.com> wrote: > >> This test has gotten flaky with a concerning failure mode (seeing "wrong" >> results, not just a timeout or something): >> >> http://dist-test.cloudera.org:8080/test_drilldown?test_name= >> tablet_history_gc-itest >> >> It seems like it got flaky starting with Alexey's >> commit bc14b2f9d775c9f27f2e2be36d4b03080977e8fa which switched it to use >> AUTO_FLUSH_BACKGROUND. So perhaps the bug is actually a client bug and not >> anything to do with GC. >> >> Alexey, do you have time to take a look, and perhaps consult with Mike if >> you think it's actually a server-side bug? >> >> -Todd >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > >