On Sun, Mar 4, 2018 at 3:40 PM, Andres Freund <and...@anarazel.de> wrote: > On March 3, 2018 6:36:51 PM PST, Tomas Vondra <tomas.von...@2ndquadrant.com> > wrote: >>On 03/04/2018 03:20 AM, Thomas Munro wrote: >>> Hi, >>> >>> I saw a one-off failure like this: >>> >>> QUERY PLAN >>> >>-------------------------------------------------------------------------- >>> Aggregate (actual rows=1 loops=1) >>> ! -> Nested Loop (actual rows=98000 loops=1) >>> -> Seq Scan on tenk2 (actual rows=10 loops=1) >>> Filter: (thousand = 0) >>> Rows Removed by Filter: 9990 >>> ! -> Gather (actual rows=9800 loops=10) >>> Workers Planned: 4 >>> Workers Launched: 4 >>> -> Parallel Seq Scan on tenk1 (actual rows=1960 >>loops=50) >>> --- 485,495 ---- >>> QUERY PLAN >>> >>-------------------------------------------------------------------------- >>> Aggregate (actual rows=1 loops=1) >>> ! -> Nested Loop (actual rows=97984 loops=1) >>> -> Seq Scan on tenk2 (actual rows=10 loops=1) >>> Filter: (thousand = 0) >>> Rows Removed by Filter: 9990 >>> ! -> Gather (actual rows=9798 loops=10) >>> Workers Planned: 4 >>> Workers Launched: 4 >>> -> Parallel Seq Scan on tenk1 (actual rows=1960 >>loops=50) >>> >>> >>> Two tuples apparently went missing. >>> >>> Similar failures on the build farm: >>> >>> >>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01 >>> >>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32 >>> >>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11 >>> >>> Could this be related to commit >>> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit >>> 497171d3e2aaeea3b30d710b4e368645ad07ae43? >>> >> >>I think the same failure (or at least very similar plan diff) was >>already mentioned here: >> >>https://www.postgresql.org/message-id/17385.1520018...@sss.pgh.pa.us >> >>So I guess someone else already noticed, but I don't see the cause >>identified in that thread.
Oh. Sorry, I didn't recognise that as the same thing, from the title. Doesn't seem to be related to number of workers launched at all... it looks more like the tuple queue is misbehaving. Though I haven't got any proof of anything yet. > Robert and I started discussing it a bit over IM. No conclusion. Robert tried > to reproduce locally, including disabling atomics, without luck. > > Can anybody reproduce locally? I've seen it several times on Travis CI. (So I would normally have been able to tell you about this problem before the was committed, except that the email thread was too long and the mail archive app cuts long threads off!) Will try on some different kind of computers that I have local control off... I suspect (knowing how we run it on Travis CI) that being way overloaded might be helpful... -- Thomas Munro http://www.enterprisedb.com