On Tue, Oct 25, 2011 at 8:03 AM, Simon Riggs <si...@2ndquadrant.com> wrote:

> We are starting recovery at the right place but we are initialising
> the clog and subtrans incorrectly. Precisely, the oldestActiveXid is
> being derived later than it should be, which can cause problems if
> this then means that whole pages are unitialised in subtrans. The bug
> only shows up if you do enough transactions (2048 is always enough) to
> move to the next subtrans page between the redo pointer and the
> checkpoint record while at the same time we do not have a long running
> transaction that spans those two points. That's just enough to happen
> reasonably frequently on busy systems and yet just enough to have
> slipped through testing.
>
> We must either
>
> 1. During CreateCheckpoint() we should derive oldestActiveXid before
> we derive the redo location
>
> 2. Change the way subtrans pages are initialized during recovery so we
> don't rely on oldestActiveXid
>
> I need to think some more before a decision on this in my own mind,
> but I lean towards doing (1) as a longer term fix and doing (2) as a
> short term fix for existing releases. I expect to have a fix later
> today.

(1) looks the best way forwards in all cases.

Patch attached. Will be backpatched to 9.0

I think it is possible to avoid taking XidGenLock during
GetRunningTransactions() now, but I haven't included that change in
this patch.

Any other comments before commit?

-- 
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment: oldestActiveXid_fixed.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to