On Jan 25, 8:00 pm, Paul Rubin <http://[EMAIL PROTECTED]> wrote: > "Paddy" <[EMAIL PROTECTED]> writes: > > No, you should think of the service that needs to be up. You seem to be > > talking about how it can't be fixed rather than looking for ways to > > keep things going. > But you're proposing cargo cult programming. i don't know that term. What I'm proposing is that if, for example, a process stops running three times in a year at roughly three to four months intervals , and it should have stayed up; then restart the server sooner, at aa time of your choosing, whilst taking other measures to investicate the error. > There is no reason > whatsoever to expect that restarting the server now and then will help > the problem in the slightest. Thats where we most likely differ. The problem is only indirecctly the program failing. the customer wants reliable service. Which you can get from unreliable components. It happens all the time in firmware controlled systems that periodically reboot themselves as a matter of course. > Nick used the fancy term Poisson > process but it just means that the probability of failure at any > moment is independent of what's happened in the past, like the > spontaneous radioactive decay of an atom. It's not like a mechanical > system where some part gradually gets worn out and eventually breaks, > so you can prevent the failure by replacing the part every so often. Whilst you sit agreeing on how many fairys can dance on the end of a pin or not Your company could be loosing customers. You and Nick seem to be saying it *must* be Poisson, therefore we can't do... > > > A little learning is fine but "it can't theoretically be fixed" is > > no solution.The best you can do is identify the unfixable situations > > precisely and > work around them. Precision is important. I'm sorry, but your argument reminds me of when Western statistical quality control first met with the Japanese Zero defects methodologies. We had argued ourselves into accepting a certain amount of defective cars getting out to customers as the result of our theories. The Japanese practices emphasized *no* defects were acceptable at the customer, and they seemed to deliver better made cars. > > The next best thing is have several servers running simultaneously, > with failure detection and automatic failover. Yah, finally. I can work with that > > If a server is failing at random every few months, trying to prevent > that by restarting it every so often is just shooting in the dark. "at random" - "every few months" Me thinking it happens "every few months" allows me to search for a fix. If thinking it happens "at random" leads you to a brick wall, then switch!
> Think of your server stopping now and then because there's a power > failure, where you get power failures every few months on the average. > Shutting down your server once a month, unplugging it, and plugging it > back in will do nothing to prevent those outages. You need to either > identify and fix whatever is causing the power outages, or install a > backup generator. Yep. I also know that a mad bloke entering the server room with a hammer every three to four months is also not likely to be fixed by restarting the server every two months ;-) - Paddy. -- http://mail.python.org/mailman/listinfo/python-list