On 10/02/2011 07:10 AM, Robert Haas wrote:
Your proposals involve sending additional information from the master to
the slave, but the slave already knows both its WAL position and the
timestamp of the transaction it has most recently replayed, because
the startup process on the slave tracks that information and publishes
it in shared memory.  On the master, however, only the WAL position is
centrally tracked; the transaction timestamps are not.

This seems to be the question that was never really answered well enough to satisfy anyone, so let's rewind to here for a bit. I wasn't following this closely until now, so I just did my own review from scratch against the latest patch. I found a few issues, and I don't think all of them have been vented here yet:

-It adds overhead at every commit, even for people who aren't using it. Probably not enough to matter, but it's yet another thing going through the often maligned as too heavy pgstat system, often.

-In order to measure lag this way, you need access to both the master and the standby. Yuck, dblink or application code doing timestamp math, either idea makes me shudder. It would be nice to answer "how many seconds of lag do have?" directly from the standby. Ideally I would like both the master and standby to know those numbers.

-After the server is restarted, you get a hole in the monitoring data until the first transaction is committed or aborted. The way the existing pg_last_xact_replay_timestamp side of this computation goes NULL for some unpredictable period after restart is going to drive monitoring systems batty. Building this new facility similarly will force everyone who writes a lag monitor to special case for that condition on both sides. Sure, that's less user hostile than the status quo, but it isn't going to help PostgreSQL's battered reputation in the area of monitoring either.

-The transaction ID advancing is not a very good proxy for real-world lag. You can have a very large amount of writing in between commits. The existing lag monitoring possibilities focus on WAL position instead, which is better correlated with the sort of activity that causes lag. Making one measurement of lag WAL position based (the existing ones) and another based on transaction advance (this proposal) is bound to raise some "which of these is the correct lag?" questions, when the two diverge. Large data loading operations come to mind as a not unusual at all situation where this would happen.

I'm normally a fan of building the simplest thing that will do something useful, and this patch succeeds there. But the best data to collect needs to have a timestamp that keeps moving forward in a way that correlates reasonably well to the system WAL load. Using the transaction ID doesn't do that. Simon did some hand waving around sending a timestamp every checkpoint. That would allow the standby to compute its own lag, limit overhead to something much lighter than per-transaction, and better track write volume. There could still be a bigger than normal discontinuity after server restart, if the server was down a while, but at least there wouldn't ever be a point where the value was returned by the master but was NULL.

But as Simon mentioned in passing, it will bloat the WAL streams for everyone. Here's the as yet uncoded mini-proposal that seems to have slid by uncommented upon:

"We can send regular special messages from WALSender to WALReceiver that do not form part of the WAL stream, so we don't bulk
up WAL archives. (i.e. don't use "w" messages)."

Here's my understanding of how this would work. Each time a commit/abort record appears in the WAL, that updates XLogCtl with the associated timestamp. If WALReceiver received regular updates containing the master's clock timestamp and stored them similarly--either via regular streaming or the heartbeat--it could compute lag with the same resolution as this patch aims to do, for the price of two spinlocks: just subtract the two timestamps. No overhead on the master, and lag can be directly computed and queried from each standby. If you want to feed that data back to the master so it can appear in pg_stat_replication in both places, send it periodically via the same channel sync rep and standby feedback use. I believe that will be cheap in many cases, since it can piggyback on messages that will still be quite small relative to minimum packet size on most networks. (Exception for compressed/encrypted networks where packets aren't discrete in this way doesn't seem that relevant, presuming that if you're doing one of those I would think this overhead is the least of your worries)

Now, there's still one problem here. This doesn't address the "lots of write volume but no commit" problem any better than the proposed patch does. Maybe there's some fancy way to inject it along with the received WAL on the standby, I'm out of brain power to think through that right now. One way to force this to go away is to add a "time update" WAL record type here, one that only appears in some of these unusual situations. My first idea for keeping that overhead small is to put the injection logic for it at WAL file switch time. If you haven't committed a transaction during that entire 16MB of writes, start the new WAL segment with a little time update record. That would ensure you never do too many writes before the WAL replay clock is updated, while avoiding this record type altogether during commit-heavy periods.

The WAL sender and receiver pair should be able to work out how to handle the other corner case, where neither WAL advance nor transactions occured. You don't want lag to keep increasing forever there. If the transaction ID hasn't moved forward, the heartbeat can still update the time. I think if you do that, all you'd need to special case is that if master XID=standby replay XID, lag time should be forced to 0 instead of being its usual value (master timestamp - last standby commit/abort record).

As for "what do you return if asked for lag before the first data with a timestamp appears?" problem, there are surely still cases where that happens in this approach. I'm less concerned about that if there's only a single function involved though. The part that worries me is the high probability people are going to do NULL math wrong if they have to combine two values from different servers, and not catch it during development. If I had a penny for every NULL handling mistake ever made in that category, I'd be writing this from my island lair instead of Baltimore. I'm more comfortable saying "this lag interval might be NULL"; if that's the only exception people have to worry about, that doesn't stress me as much.

Last disclaimer: if this argument is persuasive enough to punt pg_last_xact_insert_timestamp for now, in favor of a new specification along the lines I've outlined here, I have no intention of letting myself or Simon end up blocking progress on that. This one is important enough for 9.2 that if there's not another patch offering the same feature sitting in the queue by the end of the month, I'll ask someone else here to sit on this problem a while. Probably Jaime, because he's suffering with this problem as much as I am. We maintain code in repmgr to do this job with brute force: it saves a history table to translate XID->timestamps and works out lag from there. Window function query + constantly churning monitoring table=high overhead. It would really be great to EOL that whole messy section as deprecated starting in 9.2.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to