Andres, * Andres Freund (and...@anarazel.de) wrote: > On 2017-12-05 16:21:27 +0530, Amit Kapila wrote: > > On Tue, Dec 5, 2017 at 2:49 PM, Alexander Korotkov > > <a.korot...@postgrespro.ru> wrote: > > > On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila <amit.kapil...@gmail.com> > > > wrote: > > >> > > >> Currently, txid_current and friends export a 64-bit format of > > >> transaction id that is extended with an “epoch” counter so that it > > >> will not wrap around during the life of an installation. The epoch > > >> value it uses is based on the epoch that is maintained by checkpoint > > >> (aka only checkpoint increments it). > > >> > > >> Now if epoch changes multiple times between two checkpoints > > >> (practically the chances of this are bleak, but there is a theoretical > > >> possibility), then won't the computation of xids will go wrong? > > >> Basically, it can give the same value of txid after wraparound if the > > >> checkpoint doesn't occur between the two calls to txid_current. > > > > > > > > > AFAICS, yes, if epoch changes multiple times between two checkpoints, then > > > computation will go wrong. And it doesn't look like purely theoretical > > > possibility for me, because I think I know couple of instances of the edge > > > of this... > > I think it's not terribly likely principle, due to the required WAL > size. You need at least a commit record for each of 4 billion > transactions. Each commit record is at least 24bytes long, and in a > non-artificial scenario you additionally would have a few hundred bytes > of actual content of WAL. So we're talking about a distance of at least > 0.5-2TB within a single checkpoint here. Not impossible, but not likely > either.
At the bottom end, with a 30-minute checkpoint, that's about 300MB/s. Certainly quite a bit and we might have trouble getting there for other reasons, but definitely something that can be accomplished with even a single SSD these days. > > Okay, it is quite strange that we haven't discovered this problem till > > now. I think we should do something to fix it. One idea is that we > > track epoch change in shared memory (probably in the same data > > structure (VariableCacheData) where we track nextXid). We need to > > increment it when the xid wraparound during xid allocation (in > > GetNewTransactionId). Also, we need to make it persistent as which > > means we need to log it in checkpoint xlog record and we need to write > > a separate xlog record for the epoch change. > > I think it makes a fair bit of sense to not do the current crufty > tracking of xid epochs. I don't really how we got there, but it doesn't > make terribly much sense. Don't think we need additional WAL logging > though - we should be able to piggyback this onto the already existing > clog logging. Don't you mean xact logging? ;) > I kinda wonder if we shouldn't just track nextXid as a 64bit integer > internally, instead of bothering with tracking the epoch > separately. Then we can "just" truncate it in the cases where it's > stored in space constrained places etc. This sounds reasonable to me, at least, but I've not been in these depths much. Thanks! Stephen
signature.asc
Description: Digital signature