Fujii Masao wrote:
"I'm thinking something like pg_standbys_xlog_location() [on the primary] which 
returns
one row per standby servers, showing pid of walsender, host name/
port number/user OID of the standby, the location where the standby
has written/flushed WAL. DBA can measure the gap from the
combination of pg_current_xlog_location() and pg_standbys_xlog_location()
via one query on the primary."

This function is useful but not essential for troubleshooting, I think.
So I'd like to postpone it.

Sure; in a functional system where primary and secondary are both up, you can assemble the info using the new functions you just added, so this other one is certainly optional. I just took a brief look at the code of the features you added, and it looks like it exposes the minimum necessary to make this whole thing possible to manage. I think it's OK if you postpone this other bit, more important stuff for you to work on.

So: the one piece of information I though was most important to expose here at an absolute minimum is there now. Good progress. The other popular request that keeps popping up here is providing an easy way to see how backlogged the archive_command is, to make it easier to monitor for out of disk errors that might prove catastrophic to replication.

I just spent some time looking through the WAL/archiving code in that context. It looks to me that that this information isn't really stored anywhere right now. The only thing that knows what segment is currently queued up to copy over is pgarch_ArchiverCopyLoop via its call to pgarch_readyXlog. Now, this is a pretty brute-force piece of code: it doesn't remember its previous work at all, it literally walks the archive_status directory looking for *.ready files that have names that look like xlog files, then returns the earliest. That unfortunately means that it's not even thinking in the same terms as all these other functions, which are driven by the xlog_location advancing, and then the filename is computed from that. All you've got is the filename at this point, and it's not even guaranteed to be real--you could easily fool this code if you dropped an inappropriately named file into that directory.

I could easily update this code path to save the name of the last archived file in memory while all this directory scanning is going on anyway, and then provide a UDF to expose that bit of information. The result would need to have documentation that disclaims it like this:

pg_last_archived_xlogfile() text: Get the name of the last file the archive_command [tried to|successfully] archived since the server was started. If archiving is disabled or no xlog files have become ready to archive since startup, a blank line will be returned. It is possible for this function to return a result that does not reflect an actual xlogfile if files are manually added to the server's archive_status directory.

I'd find this extremely handy as a hook for monitoring scripts that want to watch the server but don't have access to the filesystem directly, even given those limitations. I'd prefer to have the "tried to" version, because it will populate with the name of the troublesome file it's stuck on even if archiving never gets its first segment delivered.

I'd happily write a patch to handle all that if I thought it would be accepted. I fear that the whole approach will be considered a bit too hackish and get rejected on that basis though. Not really sure of a "right" way to handle this though. Anything better is going to be more complicated because it requires passing more information into the archiver, with little gain for that work beyond improving the quality of this diagnostic routine. And I think most people would find what I described above useful enough.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com

Reply via email to