On Tue, Oct 25, 2011 at 8:03 AM, Simon Riggs <si...@2ndquadrant.com> wrote:
> We are starting recovery at the right place but we are initialising > the clog and subtrans incorrectly. Precisely, the oldestActiveXid is > being derived later than it should be, which can cause problems if > this then means that whole pages are unitialised in subtrans. The bug > only shows up if you do enough transactions (2048 is always enough) to > move to the next subtrans page between the redo pointer and the > checkpoint record while at the same time we do not have a long running > transaction that spans those two points. That's just enough to happen > reasonably frequently on busy systems and yet just enough to have > slipped through testing. > > We must either > > 1. During CreateCheckpoint() we should derive oldestActiveXid before > we derive the redo location > > 2. Change the way subtrans pages are initialized during recovery so we > don't rely on oldestActiveXid > > I need to think some more before a decision on this in my own mind, > but I lean towards doing (1) as a longer term fix and doing (2) as a > short term fix for existing releases. I expect to have a fix later > today. (1) looks the best way forwards in all cases. Patch attached. Will be backpatched to 9.0 I think it is possible to avoid taking XidGenLock during GetRunningTransactions() now, but I haven't included that change in this patch. Any other comments before commit? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
oldestActiveXid_fixed.v1.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers