It's definitely a quite a relatively complex pattern. The query I set you last 
time was minimal with respect to predicates (so removing any single one of the 
predicates converted that one into a working query).
> Huh.  Ok well that's a lot more frequent that I thought.  Is it always the 
> same query?  Any chance you can get the plan?  Are there more things going on 
> on the server, like perhaps concurrent parallel queries?
I had this bug occurring while I was the only one working on the server. I 
checked there was just one transaction with a snapshot at all and it was a 
autovacuum busy with a totally unrelated relation my colleague was working on.

The bug is indeed behaving like a ghost.
One child relation needed a few new rows to test a particular application a 
colleague of mine was working on. The insert triggered an autoanalyze and the 
explain changed slightly:
Besides row and cost estimates the change is that the line
Recheck Cond: (((COALESCE((fid)::bigint, fallback) ) >= 1) AND 
((COALESCE((fid)::bigint, fallback) ) <= 1) AND (gid && 
'{853078,853080,853082}'::integer[]))
is now 
Recheck Cond: ((gid && '{853078,853080,853082}'::integer[]) AND 
((COALESCE((fid)::bigint, fallback) ) >= 1) AND ((COALESCE((fid)::bigint, 
fallback) ) <= 1))
and the error vanished.

I could try to hunt down another query by assembling seemingly random queries. 
I don't see a very clear pattern from the queries aborting with this error on 
our production servers. I'm not surprised that bug is had to chase on 
production servers. They usually are quite lively.

>If you're able to run a throwaway copy of your production database on another 
>system that you don't have to worry about crashing, you could just replace 
>ERROR with PANIC and run a high-speed loop of the query that crashed in 
>product, or something.  This might at least tell us whether it's reach that 
>condition via something dereferencing a dsa_pointer or something manipulating 
>the segment lists while allocating/freeing.

I could take a backup and restore the relevant tables on a throwaway system. 
You are just suggesting to replace line 728
elog(FATAL,
                                 "dsa_allocate could not find %zu free pages", 
npages);
by
elog(PANIC,
                                 "dsa_allocate could not find %zu free pages", 
npages);
correct? Just for my understanding: why would the shutdown of the whole 
instance create more helpful logging?

All the best
Arne

Reply via email to