On Wed, Aug 10, 2022 at 2:52 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Wed, Aug 10, 2022 at 10:58 AM Andres Freund <and...@anarazel.de> wrote: > > > > Hi, > > > > On 2022-08-09 20:21:19 -0700, Mark Dilger wrote: > > > > On Aug 9, 2022, at 7:26 PM, Andres Freund <and...@anarazel.de> wrote: > > > > > > > > The relevant code triggering it: > > > > > > > > newbuf = XLogInitBufferForRedo(record, 1); > > > > _hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket, > > > > xlrec->new_bucket_flag, true); > > > > if (!IsBufferCleanupOK(newbuf)) > > > > elog(PANIC, "hash_xlog_split_allocate_page: failed to > > > > acquire cleanup lock"); > > > > > > > > Why do we just crash if we don't already have a cleanup lock? That > > > > can't be > > > > right. Or is there supposed to be a guarantee this can't happen? > > > > > > Perhaps the code assumes that when xl_hash_split_allocate_page record was > > > written, the new_bucket field referred to an unused page, and so during > > > replay it should also refer to an unused page, and being unused, that > > > nobody > > > will have it pinned. But at least in heap we sometimes pin unused pages > > > just long enough to examine them and to see that they are unused. Maybe > > > something like that is happening here? > > > > I don't think it's a safe assumption that nobody would hold a pin on such a > > page during recovery. While not the case here, somebody else could have used > > pg_prewarm to read it in. > > > > But also, the checkpointer or bgwriter could have it temporarily pinned, to > > write it out, or another backend could try to write it out as a victim > > buffer > > and have it temporarily pinned. > > > > > > static int > > SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext > > *wb_context) > > { > > ... > > /* > > * Pin it, share-lock it, write it. (FlushBuffer will do nothing > > if the > > * buffer is clean by the time we've locked it.) > > */ > > PinBuffer_Locked(bufHdr); > > LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED); > > > > > > As you can see we acquire a pin without holding a lock on the page (and that > > can't be changed!). > > > > I think this could be the probable reason for failure though I didn't > try to debug/reproduce this yet. AFAIU, this is possible during > recovery/replay of WAL record XLOG_HASH_SPLIT_ALLOCATE_PAGE as via > XLogReadBufferForRedoExtended, we can mark the buffer dirty while > restoring from full page image. OTOH, because during normal operation > we didn't mark the page dirty SyncOneBuffer would have skipped it due > to check (if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))).
I'm trying to simulate the scenario in streaming replication using the below: CREATE TABLE pvactst (i INT, a INT[], p POINT) with (autovacuum_enabled = off); CREATE INDEX hash_pvactst ON pvactst USING hash (i); INSERT INTO pvactst SELECT i, array[1,2,3], point(i, i+1) FROM generate_series(1,1000) i; With the above scenario, it will be able to replay allocation of page for split operation. I will slightly change the above statements and try to debug and see if we can make the background writer process to pin this buffer and simulate the scenario. I will post my findings once I'm done with the analysis. Regards, Vignesh