One small update: the issue might be not in GC logic, but some other flakiness related to reading data at snapshot.
I updated the patch so the only operations the test now does are inserts, updates and scans. No tablet merge compactions, redo delta compactions, forced re-updates of missing deltas, or moving time forward. The updated patch can be found at: https://gist.github.com/alexeyserbin/06ed8dbdb0e8e9abcbde2991c6615660 The test firmly fails if running as described in the previous message in this thread, just use the updated patch location. David, may be you can take a quick look at that as well? Thanks, Alexey On Wed, Oct 12, 2016 at 2:01 AM, Alexey Serbin <aser...@cloudera.com> wrote: > Hi, > > I played with the test (mostly in background), making the failure almost > 100% reproducible. > > After collecting some evidence, I can say it's a server-side bug. I think > so because the reproduction scenario I'm talking about uses good old > MANUAL_FLUSH mode, not AUTO_FLUSH_BACKGROUND mode. Yes, I've modified the > test slightly to achieve higher reproduction ratio and to clear the > question whether it's AUTO_FLUSH_BACKGROUND-specific bug. > > That's what I found: > 1. The problem occurs when updating rows with the same primary keys > multiple times. > 2. It's crucial to flush (i.e. call KuduSession::Flush() or > KuduSession::FlushAsync()) freshly applied update operations not just once > in the very end of a client session, but multiple times while adding those > operations. If flushing just once in the very end, the issue becomes 0% > reproducible. > 3. The more updates for different rows we have, the more likely we hit > the issue (but there should be at least a couple updates for every row). > 4. The problem persists in all types of Kudu builds: debug, TSAN, > release, ASAN (in the decreasing order of the reproduction ratio). > 5. The problem is also highly reproducible if running the test via the > dist_test.py utility (check for 256 out of 256 failure ratio at > http://dist-test.cloudera.org//job?job_id=aserbin.1476258983.2603 ) > > To build the modified test and run the reproduction scenario: > 1. Get the patch from https://gist.github.com/alexeyserbin/ > 7c885148dadff8705912f6cc513108d0 > 2. Apply the patch to the latest Kudu source from the master branch. > 3. Build debug, TSAN, release or ASAN configuration and run with the > command (the random seed is not really crucial, but this gives better > results): > ../../build-support/run-test.sh ./bin/tablet_history_gc-itest > --gtest_filter=RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload > --stress_cpu_threads=64 --test_random_seed=1213726993 > > 4. If running via dist_test.py, run the following instead: > > ../../build-support/dist_test.py loop -n 256 -- > ./bin/tablet_history_gc-itest --gtest_filter= > RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload > --stress_cpu_threads=8 --test_random_seed=1213726993 > > Mike, it seems I'll need your help to troubleshoot/debug this issue > further. > > > Best regards, > > Alexey > > > On Mon, Oct 3, 2016 at 9:48 AM, Alexey Serbin <aser...@cloudera.com> > wrote: > >> Todd, >> >> I apologize for the late response -- somehow my inbox is messed up. >> Probably, I need to switch to use stand-alone mail application (as iMail) >> instead of browser-based one. >> >> Yes, I'll take a look at that. >> >> >> Best regards, >> >> Alexey >> >> On Mon, Sep 26, 2016 at 12:58 PM, Todd Lipcon <t...@cloudera.com> wrote: >> >>> This test has gotten flaky with a concerning failure mode (seeing >>> "wrong" results, not just a timeout or something): >>> >>> http://dist-test.cloudera.org:8080/test_drilldown?test_name= >>> tablet_history_gc-itest >>> >>> It seems like it got flaky starting with Alexey's >>> commit bc14b2f9d775c9f27f2e2be36d4b03080977e8fa which switched it to >>> use AUTO_FLUSH_BACKGROUND. So perhaps the bug is actually a client bug and >>> not anything to do with GC. >>> >>> Alexey, do you have time to take a look, and perhaps consult with Mike >>> if you think it's actually a server-side bug? >>> >>> -Todd >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >>> >> >> >