Re: [HACKERS] logical replication - still unstable after all these months

Jeff Janes Sun, 28 May 2017 18:39:57 -0700

On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood <
[email protected]> wrote:

> On 28/05/17 19:01, Mark Kirkwood wrote:
>
>
>> So running in cloud land now...so for no errors - will update.
>>
>>
>>
>>
> The framework ran 600 tests last night, and I see 3 'NOK' results, i.e 3
> failed test runs (all scale 25 and 8 pgbench clients). Given the way the
> test decides on failure (gets tired of waiting for the table md5's to
> match) - it begs the question 'What if it had waited a bit longer'? However
> from what I can see in all cases:
>
> - the rowcounts were the same in master and replica
> - the md5 of pgbench_accounts was different
>

All four tables should be wrong if there is still a transaction it is
waiting for, as all the changes happen in a single transaction.

I also got a failure, after 87 iterations of a similar test case.  It
waited for hours, as mine requires manual intervention to stop waiting.  On
the subscriber, one account still had a zero balance, while the history
table on the subscriber agreed with both history and accounts on the
publisher and the account should not have been zero, so definitely a
transaction atomicity got busted.

I altered the script to also save the tellers and branches tables and
repeated the runs, but so far it hasn't failed again in over 800 iterations
using the altered script.

>
> ...so does seem possible that there is some bug being tickled here.
> Unfortunately the test framework blasts away the failed tables and
> subscription and continues on...I'm going to amend it to stop on failure so
> I can have a closer look at what happened.
>

What would you want to look at?  Would saving the WAL from the master be
helpful?

Cheers,

Jeff

Re: [HACKERS] logical replication - still unstable after all these months

Reply via email to