[HACKERS] txid failed epoch increment, again, aka 6291

Daniel Farina Tue, 04 Sep 2012 02:08:42 -0700

It seems like this has reproduced once more.  And once again, there
doesn't appear to be any funny business in pg_control (but the structs
are pasted here for your re-check), and there are successful sensical
updates to it.  The primary is running 9.0.6.


However, we do have a new piece of data: there was a very brief period
where txid_snapshot did report an xmin greater than 2^33, our next
epoch boundary, by a few thousand transactions.  That could be because
the reporting function GetNextXidAndEpoch does its own epoch
calculation before the checkpoint and then after a checkpoint that
forgets to increment the epoch there is no need to add post-facto
adjust the epoch anymore.

I've been reviewing the mechanism in CreateCheckPoint for this on and
off for a couple of days, but so far I haven't come up with a
convincing mechanism. However, given that it seems historically that
this bug is more likely to surface than *not* surface on this system,
perhaps we can try for a sometimes-reproducing test case. I'm still
struggling for a hint of a solution, though, so toss your thoughts
here.

$2 = {Insert = {LogwrtResult = {Write = {xlogid = 0, xrecoff = 0},
Flush = {xlogid = 0, xrecoff = 0}}, PrevRecord = {xlogid = 0,
      xrecoff = 0}, curridx = 0, currpage = 0x7ff4ed04a000, currpos =
0x0, RedoRecPtr = {xlogid = 18751, xrecoff = 1200832888},
    forcePageWrites = 0 '\000'}, LogwrtRqst = {Write = {xlogid = 0,
xrecoff = 0}, Flush = {xlogid = 0, xrecoff = 0}}, LogwrtResult = {
    Write = {xlogid = 0, xrecoff = 0}, Flush = {xlogid = 0, xrecoff =
0}}, ckptXidEpoch = 1, ckptXid = 9904084, asyncXactLSN = {
    xlogid = 0, xrecoff = 0}, lastRemovedLog = 0, lastRemovedSeg = 0,
Write = {LogwrtResult = {Write = {xlogid = 0, xrecoff = 0},
      Flush = {xlogid = 0, xrecoff = 0}}, curridx = 0,
lastSegSwitchTime = 0}, pages = 0x7ff4ed04a000 "", xlblocks =
0x7ff4ed0471d8,
  XLogCacheBlck = 1023, ThisTimeLineID = 0, RecoveryTargetTLI = 6,
archiveCleanupCommand = '\000' <repeats 1023 times>,
  SharedRecoveryInProgress = 1 '\001', lastCheckPointRecPtr = {xlogid
= 18751, xrecoff = 1671519088}, lastCheckPoint = {redo = {
      xlogid = 18751, xrecoff = 1200832888}, ThisTimeLineID = 6,
nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524,
    nextMulti = 1119, nextMultiOffset = 3115, oldestXid = 4115479553,
oldestXidDB = 1, time = 1346746796, oldestActiveXid = 9776547},
  replayEndRecPtr = {xlogid = 18751, xrecoff = 1748623656},
recoveryLastRecPtr = {xlogid = 18751, xrecoff = 1748623656},
  recoveryLastXTime = 400062234671833, info_lck = 0 '\000'}
(gdb) p ControlFile
$3 = (ControlFileData *) 0x7ff4ed046bf8
(gdb) p *ControlFile
$4 = {system_identifier = 5613733157253676693, pg_control_version =
903, catalog_version_no = 201008051,
  state = DB_IN_ARCHIVE_RECOVERY, time = 1346746898, checkPoint =
{xlogid = 18751, xrecoff = 1072693824}, prevCheckPoint = {
    xlogid = 18751, xrecoff = 1072693824}, checkPointCopy = {redo =
{xlogid = 18751, xrecoff = 602482536}, ThisTimeLineID = 6,
    nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524, nextMulti
= 1119, nextMultiOffset = 3115, oldestXid = 4115479553,
    oldestXidDB = 1, time = 1346746496, oldestActiveXid = 9558248},
minRecoveryPoint = {xlogid = 18751, xrecoff = 1748623656},
  backupStartPoint = {xlogid = 0, xrecoff = 0}, wal_level = 2,
MaxConnections = 500, max_prepared_xacts = 500, max_locks_per_xact =
64,
  maxAlign = 8, floatFormat = 1234567, blcksz = 8192, relseg_size =
131072, xlog_blcksz = 8192, xlog_seg_size = 16777216,
  nameDataLen = 64, indexMaxKeys = 32, toast_max_chunk_size = 1996,
enableIntTimes = 1 '\001', float4ByVal = 1 '\001',
  float8ByVal = 1 '\001', crc = 3725972657}

-- 
fdr


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] txid failed epoch increment, again, aka 6291

Reply via email to