Robert Haas <robertmh...@gmail.com> writes: >> That would be great. Taking a look at what happened, I have a feeling >> this may be a race condition of some kind in the isolation tester. It >> seems to have failed to recognize that a1 started waiting, and that >> caused the "deadlock detected" message to reported differently. I'm >> not immediately sure what to do about that.
> Yeah, so: try_complete_step() waits 10ms, and if it still hasn't > gotten any data back from the server, then it uses a separate query to > see whether the step in question is waiting on a lock. So what > must've happened here is that it took more than 10ms for the process > to show up as waiting in pg_stat_activity. No, because the machines that are failing are showing a "<waiting ...>" annotation that your reference output *doesn't* have. I think what is actually happening is that these machines are seeing the process as waiting and reporting it, whereas on your machine the backend detects the deadlock and completes the query (with an error) before isolationtester realizes that the process is waiting. It would probably help if you didn't do this: setup { BEGIN; SET deadlock_timeout = '10ms'; } which pretty much guarantees that there is a race condition: you've set it so that the deadlock detector will run at approximately the same time when isolationtester will be probing the state. I'm surprised that it seemed to act consistently for you. I would suggest putting all the other sessions to deadlock_timeout of 100s and the one you want to fail to timeout of ~ 5s. That will mean that the "<waiting ...>" output should show up pretty reliably even on overloaded buildfarm critters. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers