Re: [HACKERS] Streaming replication status

Stefan Kaltenbrunner Fri, 15 Jan 2010 09:25:43 -0800

Greg Smith wrote:

Stefan Kaltenbrunner wrote:
Greg Smith wrote:
The other popular request that keeps popping up here is providing aneasy way to see how backlogged the archive_command is, to make iteasier to monitor for out of disk errors that might provecatastrophic to replication.
I tend to disagree - in any reasonable production setup basic stulfflike disk space usage is monitored by non-application specific matters.While monitoring backlog might be interesting for other reasons,citing disk space usage/exhaustions seems just wrong.
I was just mentioning that one use of the data, but there are others.Let's say that your archive_command works by copying things over to aNFS mount, and the mount goes down. It could be a long time before younoticed this via disk space monitoring. But if you were monitoring "howlong has it been since the last time pg_last_archived_xlogfile()changed?", this would jump right out at you.

well from an syadmin perspective you have to monitor the NFS mountanyway - so why do you need the database to do too(and not in a sane waybecause there is no way the database can even figure out what the realproblem is and if there is one)?

Another popular question is "how far behind real-time is the archiverprocess?" You can do this right now by duplicating the same xlog filename scanning and sorting that the archiver does in your own code,looking for .ready files. It would be simpler if you could callpg_last_archived_xlogfile() and then just grab that file's timestamp.

well that one seems a more reasonable reasoning to me however I'm not sosure that the proposed implementation feels right - though can't come upwith a better suggestion for now.

I think it's also important to consider the fact that diagnosticinternals exposed via the database are far more useful to some peoplethan things you have to setup outside of it. You talk about reasonableconfigurations above, but some production setups are not so reasonable.In many of the more secure environments I've worked in (finance,defense), there is *no* access to the database server beyond what comesout of port 5432 without getting a whole separate team of peopleinvolved. If the DBA can write a simple monitoring program themselvesthat presents data via the one port that is exposed, that makes lifeeasier for them. This same issue pops up sometimes when we consider theshared hosting case too, where the user may not have the option ofrunning a full-fledged monitoring script.

well again I consider stuff like "available diskspace" or "NFS mountavailable" completely in the realm of the OS level management. Thedatabase side should focus on the stuff that concerns the internal stateand operation of the database app itself.If you continue your line of thought you will have to add all kind ofstuff to the database, like CPU usage tracking, getting informationabout running processes, storage health.As soon as you are done you have reimplemented nagios-plugins over SQLon port 5432 instead of NRPE(or SNMP or whatnot).Again I fully understand and know that there are environments where theDBA does not have OS level (be it root or no shell at all) access has tothe OS but even if you had that "archiving is hanging" function youwould still have to go back to that "completely different group" andhave them diagnose again.So my point is - that even if you have disparate groups of people beingresponsible for different parts of a system solution you can't reallywork around incompetency(or slownest or whatever) of the groupresponsible for the lower layer by adding partial and inexactfunctionality at the upper part that can only guess what the real issue is.



Stefan

--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming replication status

Reply via email to