Re: [HACKERS] [REVIEW] pg_last_xact_insert_timestamp

Greg Smith Sat, 10 Dec 2011 04:30:22 -0800

On 10/02/2011 07:10 AM, Robert Haas wrote:

Your proposals involve sending additional information from the master to
the slave, but the slave already knows both its WAL position and the
timestamp of the transaction it has most recently replayed, because
the startup process on the slave tracks that information and publishes
it in shared memory.  On the master, however, only the WAL position is
centrally tracked; the transaction timestamps are not.

This seems to be the question that was never really answered well enoughto satisfy anyone, so let's rewind to here for a bit. I wasn'tfollowing this closely until now, so I just did my own review fromscratch against the latest patch. I found a few issues, and I don'tthink all of them have been vented here yet:

-It adds overhead at every commit, even for people who aren't using it.Probably not enough to matter, but it's yet another thing going throughthe often maligned as too heavy pgstat system, often.

-In order to measure lag this way, you need access to both the masterand the standby. Yuck, dblink or application code doing timestamp math,either idea makes me shudder. It would be nice to answer "how manyseconds of lag do have?" directly from the standby. Ideally I wouldlike both the master and standby to know those numbers.

-After the server is restarted, you get a hole in the monitoring datauntil the first transaction is committed or aborted. The way theexisting pg_last_xact_replay_timestamp side of this computation goesNULL for some unpredictable period after restart is going to drivemonitoring systems batty. Building this new facility similarly willforce everyone who writes a lag monitor to special case for thatcondition on both sides. Sure, that's less user hostile than the statusquo, but it isn't going to help PostgreSQL's battered reputation in thearea of monitoring either.

-The transaction ID advancing is not a very good proxy for real-worldlag. You can have a very large amount of writing in between commits.The existing lag monitoring possibilities focus on WAL position instead,which is better correlated with the sort of activity that causes lag.Making one measurement of lag WAL position based (the existing ones) andanother based on transaction advance (this proposal) is bound to raisesome "which of these is the correct lag?" questions, when the twodiverge. Large data loading operations come to mind as a not unusual atall situation where this would happen.

I'm normally a fan of building the simplest thing that will do somethinguseful, and this patch succeeds there. But the best data to collectneeds to have a timestamp that keeps moving forward in a way thatcorrelates reasonably well to the system WAL load. Using thetransaction ID doesn't do that. Simon did some hand waving aroundsending a timestamp every checkpoint. That would allow the standby tocompute its own lag, limit overhead to something much lighter thanper-transaction, and better track write volume. There could still be abigger than normal discontinuity after server restart, if the server wasdown a while, but at least there wouldn't ever be a point where thevalue was returned by the master but was NULL.

But as Simon mentioned in passing, it will bloat the WAL streams foreveryone. Here's the as yet uncoded mini-proposal that seems to haveslid by uncommented upon:

"We can send regular special messages from WALSender to WALReceiver thatdo not form part of the WAL stream, so we don't bulk

up WAL archives. (i.e. don't use "w" messages)."

Here's my understanding of how this would work. Each time acommit/abort record appears in the WAL, that updates XLogCtl with theassociated timestamp. If WALReceiver received regular updatescontaining the master's clock timestamp and stored themsimilarly--either via regular streaming or the heartbeat--it couldcompute lag with the same resolution as this patch aims to do, for theprice of two spinlocks: just subtract the two timestamps. No overheadon the master, and lag can be directly computed and queried from eachstandby. If you want to feed that data back to the master so it canappear in pg_stat_replication in both places, send it periodically viathe same channel sync rep and standby feedback use. I believe that willbe cheap in many cases, since it can piggyback on messages that willstill be quite small relative to minimum packet size on most networks.(Exception for compressed/encrypted networks where packets aren'tdiscrete in this way doesn't seem that relevant, presuming that ifyou're doing one of those I would think this overhead is the least ofyour worries)

Now, there's still one problem here. This doesn't address the "lots ofwrite volume but no commit" problem any better than the proposed patchdoes. Maybe there's some fancy way to inject it along with the receivedWAL on the standby, I'm out of brain power to think through that rightnow. One way to force this to go away is to add a "time update" WALrecord type here, one that only appears in some of these unusualsituations. My first idea for keeping that overhead small is to put theinjection logic for it at WAL file switch time. If you haven'tcommitted a transaction during that entire 16MB of writes, start the newWAL segment with a little time update record. That would ensure younever do too many writes before the WAL replay clock is updated, whileavoiding this record type altogether during commit-heavy periods.

The WAL sender and receiver pair should be able to work out how tohandle the other corner case, where neither WAL advance nor transactionsoccured. You don't want lag to keep increasing forever there. If thetransaction ID hasn't moved forward, the heartbeat can still update thetime. I think if you do that, all you'd need to special case is that ifmaster XID=standby replay XID, lag time should be forced to 0 instead ofbeing its usual value (master timestamp - last standby commit/abort record).

As for "what do you return if asked for lag before the first data with atimestamp appears?" problem, there are surely still cases where thathappens in this approach. I'm less concerned about that if there's onlya single function involved though. The part that worries me is the highprobability people are going to do NULL math wrong if they have tocombine two values from different servers, and not catch it duringdevelopment. If I had a penny for every NULL handling mistake ever madein that category, I'd be writing this from my island lair instead ofBaltimore. I'm more comfortable saying "this lag interval might beNULL"; if that's the only exception people have to worry about, thatdoesn't stress me as much.

Last disclaimer: if this argument is persuasive enough to puntpg_last_xact_insert_timestamp for now, in favor of a new specificationalong the lines I've outlined here, I have no intention of lettingmyself or Simon end up blocking progress on that. This one is importantenough for 9.2 that if there's not another patch offering the samefeature sitting in the queue by the end of the month, I'll ask someoneelse here to sit on this problem a while. Probably Jaime, because he'ssuffering with this problem as much as I am. We maintain code in repmgrto do this job with brute force: it saves a history table to translateXID->timestamps and works out lag from there. Window function query +constantly churning monitoring table=high overhead. It would really begreat to EOL that whole messy section as deprecated starting in 9.2.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] pg_last_xact_insert_timestamp

Reply via email to