On Tue, 8 Jun 2021 at 13:03, Justin Pryzby <pry...@telsasoft.com> wrote:
>
> On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote:
> > On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby <pry...@telsasoft.com> wrote:
> > > I'll leave the instance running for a little bit before restarting (or 
> > > kill-9)
> > > in case someone requests more info.
> >
> > How about dumping the page image out, and sharing it with the list?
> > This procedure should work fine from gdb:
> >
> > https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Dumping_a_page_image_from_within_GDB
>
> > I suggest that you dump the "page" pointer inside lazy_scan_prune(). I
> > imagine that you have the instance already stuck in an infinite loop,
> > so what we'll probably see from the page image is the page after the
> > first prune and another no-progress prune.
>
> The cluster was again rejecting with "too many clients already".
>
> I was able to open a shell this time, but it immediately froze when I tried to
> tab complete "pg_stat_acti"...
>
> I was able to dump the page image, though - attached.  I can send you its
> "data" privately, if desirable.  I'll also try to step through this.

Could you attach a dump of lazy_scan_prune's vacrel, all the global
visibility states (GlobalVisCatalogRels, and possibly
GlobalVisSharedRels, GlobalVisDataRels, and GlobalVisTempRels),  and
heap_page_prune's PruneState?

Additionally, the locals of lazy_scan_prune (more specifically, the
'offnum' when it enters heap_page_prune) would also be appreciated, as
it helps indicate the tuple.

I've been looking at whatever might have done this, and I'm currently
stuck on lacking information in GlobalVisCatalogRels and the
PruneState.

One curiosity that I did notice is that the t_xmax of the problematic
tuples has been exactly one lower than the OldestXmin. Not weird, but
a curiosity.


With regards,

Matthias van de Meent.


PS. Attached a few of my current research notes, which are mainly
comparisons between heap_prune_satisfies_vacuum and
HeapTupleSatisfiesVacuum.
# Analysis of what can happen

In heap_prune_chain, heap_prune_satisfies_vacuum (HPSV) is used for visibility 
checks instead of HeapTupleSatisfiesVacuum (HTSV). Both functions use 
HeapTupleSatisfiesVacuumHorizon (HTSVH), but differ in one behaviour: Handling 
HEAPTUPLE_RECENTLY_DEAD.

More specifically, when HTSVH returns RECENTLY_DEAD, HTSV will return DEAD when 
the dead_after result from HTSVH precedes vacrel->OldestXmin (using XID).
HPSV however will do this:
 - when dead_after precedes prstate->old_snap_xmin (but only when 
OldSnapshotThresholdActive(), so not here I presume. using XID)
 - when dead_after is removable according to GlobalVisTestIsRemovableXid (using 
the GlobalVisState applicable for that relation, this case 
GlobalVisCatalogRels, using FXID)

This GlobalVisTestIsRemovableXid returns true when the FXid of the tuple (as 
generated relative to globalVisState->definitely_needed) is less than 
globalVisState->maybe_needed.

One more item is that globalVisState->maybe_needed is set from the same value 
as what is later returned and set to vacrel->OldestXmin. 


Reply via email to