On 06/20/2017 01:23 PM, Alban Hertroys wrote:
On 20 Jun 2017, at 18:46, Adrian Klaver <adrian.kla...@aklaver.com> wrote:
Yes this could be become complicated if for no other reason then it is being
driven from the customer end and there will need to be a process to verify and
incorporate their changes.
There you're saying something rather important: "If it is being driven from the
customer end".
Yeah, it is the interaction between technical issues and people issues.
One is easier to solve then the other:)
2) Figure out what a day is. In other words are different timezones involved
and if so what do you 'anchor' a day to?
For an example of how that might fail: At our company, they work in shifts (I
don't) of 3*8 hours, that run from 23:00 to 23:00. Depending on who looks at
the data, either that's a day or a normal day (00:00-00:00) is. It's a matter
of perspective.
I see that as part of how to 'anchor' a day. Right now Steve is looking
at one customer as I understand it. I would expect that might change so
I can envision a system that would need to account for different
definitions of a day. Still you have to start somewhere.
IMHO, the only safe approach is to have the customer end decide whether it's a
regular outage or an irregular one. There is just no way to reliably guess that
from the data. If a customer decides to turn off the system when he's going
home, you can't guess when he's going to do that and you will be raising false
positives when you depend on a schedule of when he might be going home.
From a software implementation point of view that means that your
customer-side application needs to be able to signal planned shutdowns and
startups. If you detect any outages without such a signal, then you can flag it
as a problem.
I agree. I personally see false alerts as a form of 'Crying wolf' and I
think that down the road they lead to complacency. Hence my earlier
suggestion to have a method to indicate manual intervention on the
customer end.
There are still opportunities for getting those wrong of course, such as lack
of connectivity between you and your customer, but those should be easy to
explain once detected.
And I'm sure there are plenty of other corner-cases you need to take into
account. I bet it has a lot of problems in common with replication actually
(how do we reliably get information from system A to system B), so it probably
pays to look at what particular problems occur there and how they're solved.
I would say a good deal of the above is going to be driven by legal
considerations. Who is responsible for what and what guarantees are in
effect.
Alban Hertroys
--
Adrian Klaver
adrian.kla...@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general