Andres,

* Andres Freund (and...@anarazel.de) wrote:
> On 2017-12-05 16:21:27 +0530, Amit Kapila wrote:
> > On Tue, Dec 5, 2017 at 2:49 PM, Alexander Korotkov
> > <a.korot...@postgrespro.ru> wrote:
> > > On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila <amit.kapil...@gmail.com> 
> > > wrote:
> > >>
> > >> Currently, txid_current and friends export a 64-bit format of
> > >> transaction id that is extended with an “epoch” counter so that it
> > >> will not wrap around during the life of an installation.   The epoch
> > >> value it uses is based on the epoch that is maintained by checkpoint
> > >> (aka only checkpoint increments it).
> > >>
> > >> Now if epoch changes multiple times between two checkpoints
> > >> (practically the chances of this are bleak, but there is a theoretical
> > >> possibility), then won't the computation of xids will go wrong?
> > >> Basically, it can give the same value of txid after wraparound if the
> > >> checkpoint doesn't occur between the two calls to txid_current.
> > >
> > >
> > > AFAICS, yes, if epoch changes multiple times between two checkpoints, then
> > > computation will go wrong.  And it doesn't look like purely theoretical
> > > possibility for me, because I think I know couple of instances of the edge
> > > of this...
> 
> I think it's not terribly likely principle, due to the required WAL
> size. You need at least a commit record for each of 4 billion
> transactions. Each commit record is at least 24bytes long, and in a
> non-artificial scenario you additionally would have a few hundred bytes
> of actual content of WAL. So we're talking about a distance of at least
> 0.5-2TB within a single checkpoint here. Not impossible, but not likely
> either.

At the bottom end, with a 30-minute checkpoint, that's about 300MB/s.
Certainly quite a bit and we might have trouble getting there for other
reasons, but definitely something that can be accomplished with even a
single SSD these days.

> > Okay, it is quite strange that we haven't discovered this problem till
> > now.  I think we should do something to fix it.  One idea is that we
> > track epoch change in shared memory (probably in the same data
> > structure (VariableCacheData) where we track nextXid).  We need to
> > increment it when the xid wraparound during xid allocation (in
> > GetNewTransactionId).  Also, we need to make it persistent as which
> > means we need to log it in checkpoint xlog record and we need to write
> > a separate xlog record for the epoch change.
> 
> I think it makes a fair bit of sense to not do the current crufty
> tracking of xid epochs. I don't really how we got there, but it doesn't
> make terribly much sense. Don't think we need additional WAL logging
> though - we should be able to piggyback this onto the already existing
> clog logging.

Don't you mean xact logging? ;)

> I kinda wonder if we shouldn't just track nextXid as a 64bit integer
> internally, instead of bothering with tracking the epoch
> separately. Then we can "just" truncate it in the cases where it's
> stored in space constrained places etc.

This sounds reasonable to me, at least, but I've not been in these
depths much.

Thanks!

Stephen

Attachment: signature.asc
Description: Digital signature

Reply via email to