On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood < mark.kirkw...@catalyst.net.nz> wrote:
> On 28/05/17 19:01, Mark Kirkwood wrote: > > >> So running in cloud land now...so for no errors - will update. >> >> >> >> > The framework ran 600 tests last night, and I see 3 'NOK' results, i.e 3 > failed test runs (all scale 25 and 8 pgbench clients). Given the way the > test decides on failure (gets tired of waiting for the table md5's to > match) - it begs the question 'What if it had waited a bit longer'? However > from what I can see in all cases: > > - the rowcounts were the same in master and replica > - the md5 of pgbench_accounts was different > All four tables should be wrong if there is still a transaction it is waiting for, as all the changes happen in a single transaction. I also got a failure, after 87 iterations of a similar test case. It waited for hours, as mine requires manual intervention to stop waiting. On the subscriber, one account still had a zero balance, while the history table on the subscriber agreed with both history and accounts on the publisher and the account should not have been zero, so definitely a transaction atomicity got busted. I altered the script to also save the tellers and branches tables and repeated the runs, but so far it hasn't failed again in over 800 iterations using the altered script. > > ...so does seem possible that there is some bug being tickled here. > Unfortunately the test framework blasts away the failed tables and > subscription and continues on...I'm going to amend it to stop on failure so > I can have a closer look at what happened. > What would you want to look at? Would saving the WAL from the master be helpful? Cheers, Jeff