On Fri, Feb 3, 2012 at 6:45 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I wrote:
>> I have not gotten very far with the coredump, except to observe that
>> gdb says the Assert ought to have passed: ...
>> This suggests very strongly that indeed the buffer was changing under
>> us.
>
> I probably ought to let the test case run overnight before concluding
> anything, but at this point it's run for two-plus hours with no errors
> after applying this patch:
>
> diff --git a/src/backend/access/transam/xlog.c 
> b/src/backend/access/transam/xlog.c
> index cce87a3..b128bfd 100644
> *** a/src/backend/access/transam/xlog.c
> --- b/src/backend/access/transam/xlog.c
> *************** RestoreBkpBlocks(XLogRecPtr lsn, XLogRec
> *** 3716,3724 ****
>                }
>                else
>                {
> -                       /* must zero-fill the hole */
> -                       MemSet((char *) page, 0, BLCKSZ);
>                        memcpy((char *) page, blk, bkpb.hole_offset);
>                        memcpy((char *) page + (bkpb.hole_offset + 
> bkpb.hole_length),
>                                   blk + bkpb.hole_offset,
>                                   BLCKSZ - (bkpb.hole_offset + 
> bkpb.hole_length));
> --- 3716,3724 ----
>                }
>                else
>                {
>                        memcpy((char *) page, blk, bkpb.hole_offset);
> +                       /* must zero-fill the hole */
> +                       MemSet((char *) page + bkpb.hole_offset, 0, 
> bkpb.hole_length);
>                        memcpy((char *) page + (bkpb.hole_offset + 
> bkpb.hole_length),
>                                   blk + bkpb.hole_offset,
>                                   BLCKSZ - (bkpb.hole_offset + 
> bkpb.hole_length));
>
>
> The existing code makes the page state transiently invalid (all zeroes)
> for no particularly good reason, and consumes useless cycles to do so,
> so this would be a good change in any case.  The reason it is relevant
> to our current problem is that even though RestoreBkpBlocks faithfully
> takes exclusive lock on the buffer, *that is not enough to guarantee
> that no one else is touching that buffer*.  Another backend that has
> already located a visible tuple on a page is entitled to keep accessing
> that tuple with only a buffer pin.  So the existing code transiently
> wipes the data from underneath the other backend's pin.
>
> It's clear how this explains the symptoms

Yes, that looks like the murder weapon.

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply via email to