> On Aug 9, 2022, at 7:26 PM, Andres Freund <and...@anarazel.de> wrote:
> 
> The relevant code triggering it:
> 
>       newbuf = XLogInitBufferForRedo(record, 1);
>       _hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
>                                 xlrec->new_bucket_flag, true);
>       if (!IsBufferCleanupOK(newbuf))
>               elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire 
> cleanup lock");
> 
> Why do we just crash if we don't already have a cleanup lock? That can't be
> right. Or is there supposed to be a guarantee this can't happen?

Perhaps the code assumes that when xl_hash_split_allocate_page record was 
written, the new_bucket field referred to an unused page, and so during replay 
it should also refer to an unused page, and being unused, that nobody will have 
it pinned.  But at least in heap we sometimes pin unused pages just long enough 
to examine them and to see that they are unused.  Maybe something like that is 
happening here?

I'd be curious to see the count returned by 
BUF_STATE_GET_REFCOUNT(LockBufHdr(newbuf)) right before this panic.  If it's 
just 1, then it's not another backend, but our own, and we'd want to debug why 
we're pinning the same page twice (or more) while replaying wal.  Otherwise, 
maybe it's a race condition with some other process that transiently pins a 
buffer and occasionally causes this code to panic?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Reply via email to