Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

Thomas Munro Sat, 03 Mar 2018 18:50:31 -0800

On Sun, Mar 4, 2018 at 3:40 PM, Andres Freund <[email protected]> wrote:
> On March 3, 2018 6:36:51 PM PST, Tomas Vondra <[email protected]> 
> wrote:
>>On 03/04/2018 03:20 AM, Thomas Munro wrote:
>>> Hi,
>>>
>>> I saw a one-off failure like this:
>>>
>>>                                   QUERY PLAN
>>>
>>--------------------------------------------------------------------------
>>>    Aggregate (actual rows=1 loops=1)
>>> !    ->  Nested Loop (actual rows=98000 loops=1)
>>>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>>>                  Filter: (thousand = 0)
>>>                  Rows Removed by Filter: 9990
>>> !          ->  Gather (actual rows=9800 loops=10)
>>>                  Workers Planned: 4
>>>                  Workers Launched: 4
>>>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960
>>loops=50)
>>> --- 485,495 ----
>>>                                   QUERY PLAN
>>>
>>--------------------------------------------------------------------------
>>>    Aggregate (actual rows=1 loops=1)
>>> !    ->  Nested Loop (actual rows=97984 loops=1)
>>>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>>>                  Filter: (thousand = 0)
>>>                  Rows Removed by Filter: 9990
>>> !          ->  Gather (actual rows=9798 loops=10)
>>>                  Workers Planned: 4
>>>                  Workers Launched: 4
>>>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960
>>loops=50)
>>>
>>>
>>> Two tuples apparently went missing.
>>>
>>> Similar failures on the build farm:
>>>
>>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01
>>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32
>>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11
>>>
>>> Could this be related to commit
>>> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit
>>> 497171d3e2aaeea3b30d710b4e368645ad07ae43?
>>>
>>
>>I think the same failure (or at least very similar plan diff) was
>>already mentioned here:
>>
>>https://www.postgresql.org/message-id/[email protected]
>>
>>So I guess someone else already noticed, but I don't see the cause
>>identified in that thread.


Oh.  Sorry, I didn't recognise that as the same thing, from the title.
Doesn't seem to be related to number of workers launched at all... it
looks more like the tuple queue is misbehaving.  Though I haven't got
any proof of anything yet.

> Robert and I started discussing it a bit over IM. No conclusion. Robert tried 
> to reproduce locally, including disabling atomics, without luck.
>
> Can anybody reproduce locally?

I've seen it several times on Travis CI.  (So I would normally have
been able to tell you about this problem before the was committed,
except that the email thread was too long and the mail archive app
cuts long threads off!)  Will try on some different kind of computers
that I have local control off...  I suspect (knowing how we run it on
Travis CI) that being way overloaded might be helpful...

-- 
Thomas Munro
http://www.enterprisedb.com

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

Reply via email to