On 06/20/2017 01:23 PM, Alban Hertroys wrote:

On 20 Jun 2017, at 18:46, Adrian Klaver <adrian.kla...@aklaver.com> wrote:


Yes this could be become complicated if for no other reason then it is being 
driven from the customer end and there will need to be a process to verify and 
incorporate their changes.

There you're saying something rather important: "If it is being driven from the 
customer end".

Yeah, it is the interaction between technical issues and people issues. One is easier to solve then the other:)


2) Figure out what a day is. In other words are different timezones involved 
and if so what do you 'anchor' a day to?

For an example of how that might fail: At our company, they work in shifts (I 
don't) of 3*8 hours, that run from 23:00 to 23:00. Depending on who looks at 
the data, either that's a day or a normal day (00:00-00:00) is. It's a matter 
of perspective.

I see that as part of how to 'anchor' a day. Right now Steve is looking at one customer as I understand it. I would expect that might change so I can envision a system that would need to account for different definitions of a day. Still you have to start somewhere.



IMHO, the only safe approach is to have the customer end decide whether it's a 
regular outage or an irregular one. There is just no way to reliably guess that 
from the data. If a customer decides to turn off the system when he's going 
home, you can't guess when he's going to do that and you will be raising false 
positives when you depend on a schedule of when he might be going home.

 From a software implementation point of view that means that your 
customer-side application needs to be able to signal planned shutdowns and 
startups. If you detect any outages without such a signal, then you can flag it 
as a problem.

I agree. I personally see false alerts as a form of 'Crying wolf' and I think that down the road they lead to complacency. Hence my earlier suggestion to have a method to indicate manual intervention on the customer end.


There are still opportunities for getting those wrong of course, such as lack 
of connectivity between you and your customer, but those should be easy to 
explain once detected.
And I'm sure there are plenty of other corner-cases you need to take into 
account. I bet it has a lot of problems in common with replication actually 
(how do we reliably get information from system A to system B), so it probably 
pays to look at what particular problems occur there and how they're solved.

I would say a good deal of the above is going to be driven by legal considerations. Who is responsible for what and what guarantees are in effect.


Alban Hertroys


--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to