I'd like to add some new flag bits to XLogRecord. (xlog.h) Where? xl_prev.
xl_prev is an XLogRecPtr which points backwards to the immediately preceeding WAL record. All of the bits are currently used, but I have some observations and a proposal to change that. We currently compare the whole xl_prev value against the whole XLogRecPtr of the last WAL record. When we are reading back WAL, if a WAL record is valid the xlogid portion of the value seldom differs by more than +1 from pointer of the current record, since that would imply an xlog record of more than 4GB. If it is incorrect, it will either be garbage or occasionally be a previously valid value but from two prior checkpoints back before this file was reused. So we probably don't need to compare the whole of xl_prev against the whole of the last WAL record pointer, we can probably avoid comparing some of the high bits, since the range of valid values is so limited. How many bits? checkpoint_segments is limited to INT_MAX, which means the xlogid increase of a single checkpoint is always at most INT_MAX/255. That means that the xl_prev value cannot differ by more than 2* INT_MAX/255 across two checkpoints. (I make that 134 Petabytes). Alternatively, the checkpoint_timeout is one hour. So we're OK until systems can write WAL at 67 Petabytes/hour. Which means if * we never get WAL records of more than 67 Petabytes in size *and* * the lowest 25 bits of xl_prev do not match the position of the last WAL record then the XLogRecord is invalid, no matter what the value of the highest 7 bits of xl_prev. So I would like to propose that we ignore the top 4 bits in xl_prev.xlogid when comparing values, rather than using all 32 bits for comparison. That then frees up 4 new flag bits on XLogRecords. Changing xl_prev handling is only required in 3 places, all in xlog.c, plus some log outputs. I would simply document the limitation of WAL record sizes. Putting code in for that would be pointless since the test would last years on current systems. (We wouldn't need dtrace to measure the WALInsertLock hold time, we could use tree rings.:-) These values would vary if we allow XLOG_SEG_SIZE higher than 16MB, but we should probably limit checkpoint_segments according to the setting of XLOG_SEG_SIZE anyhow. Thoughts? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers