One small update: the issue might be not in GC logic, but some other
flakiness related to reading data at snapshot.

I updated the patch so the only operations the test now does are inserts,
updates and scans. No tablet merge compactions, redo delta compactions,
forced re-updates of missing deltas, or moving time forward.  The updated
patch can be found at:
  https://gist.github.com/alexeyserbin/06ed8dbdb0e8e9abcbde2991c6615660

The test firmly fails if running as described in the previous message in
this thread, just use the updated patch location.

David, may be you can take a quick look at that as well?


Thanks,

Alexey

On Wed, Oct 12, 2016 at 2:01 AM, Alexey Serbin <aser...@cloudera.com> wrote:

> Hi,
>
> I played with the test (mostly in background), making the failure almost
> 100% reproducible.
>
> After collecting some evidence, I can say it's a server-side bug.  I think
> so because the reproduction scenario I'm talking about uses good old
> MANUAL_FLUSH mode, not AUTO_FLUSH_BACKGROUND mode.  Yes, I've modified the
> test slightly to achieve higher reproduction ratio and to clear the
> question whether it's AUTO_FLUSH_BACKGROUND-specific bug.
>
> That's what I found:
>   1. The problem occurs when updating rows with the same primary keys
> multiple times.
>   2. It's crucial to flush (i.e. call KuduSession::Flush() or
> KuduSession::FlushAsync()) freshly applied update operations not just once
> in the very end of a client session, but multiple times while adding those
> operations.  If flushing just once in the very end, the issue becomes 0%
> reproducible.
>   3. The more updates for different rows we have, the more likely we hit
> the issue (but there should be at least a couple updates for every row).
>   4. The problem persists in all types of Kudu builds: debug, TSAN,
> release, ASAN (in the decreasing order of the reproduction ratio).
>   5. The problem is also highly reproducible if running the test via the
> dist_test.py utility (check for 256 out of 256 failure ratio at
> http://dist-test.cloudera.org//job?job_id=aserbin.1476258983.2603 )
>
> To build the modified test and run the reproduction scenario:
>   1. Get the patch from https://gist.github.com/alexeyserbin/
> 7c885148dadff8705912f6cc513108d0
>   2. Apply the patch to the latest Kudu source from the master branch.
>   3. Build debug, TSAN, release or ASAN configuration and run with the
> command (the random seed is not really crucial, but this gives better
> results):
>     ../../build-support/run-test.sh ./bin/tablet_history_gc-itest
> --gtest_filter=RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload
> --stress_cpu_threads=64 --test_random_seed=1213726993
>
> 4. If running via dist_test.py, run the following instead:
>
>     ../../build-support/dist_test.py loop -n 256 --
> ./bin/tablet_history_gc-itest --gtest_filter=
> RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload
> --stress_cpu_threads=8 --test_random_seed=1213726993
>
> Mike, it seems I'll need your help to troubleshoot/debug this issue
> further.
>
>
> Best regards,
>
> Alexey
>
>
> On Mon, Oct 3, 2016 at 9:48 AM, Alexey Serbin <aser...@cloudera.com>
> wrote:
>
>> Todd,
>>
>> I apologize for the late response -- somehow my inbox is messed up.
>> Probably, I need to switch to use stand-alone mail application (as iMail)
>> instead of browser-based one.
>>
>> Yes, I'll take a look at that.
>>
>>
>> Best regards,
>>
>> Alexey
>>
>> On Mon, Sep 26, 2016 at 12:58 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>
>>> This test has gotten flaky with a concerning failure mode (seeing
>>> "wrong" results, not just a timeout or something):
>>>
>>> http://dist-test.cloudera.org:8080/test_drilldown?test_name=
>>> tablet_history_gc-itest
>>>
>>> It seems like it got flaky starting with Alexey's
>>> commit bc14b2f9d775c9f27f2e2be36d4b03080977e8fa which switched it to
>>> use AUTO_FLUSH_BACKGROUND. So perhaps the bug is actually a client bug and
>>> not anything to do with GC.
>>>
>>> Alexey, do you have time to take a look, and perhaps consult with Mike
>>> if you think it's actually a server-side bug?
>>>
>>> -Todd
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>

Reply via email to