Carl J. Van Arsdall wrote: > Aahz wrote: >> [snip] >> >> My response is that you're asking the wrong questions here. Our database >> server locked up hard Sunday morning, and we still have no idea why (the >> machine itself, not just the database app). I think it's more important >> to focus on whether you have done all that is reasonable to make your >> application reliable -- and then put your efforts into making your app >> recoverable. >> > Well, I assume that I have done all I can to make it reliable. This > list is usually my last resort, or a place where I come hoping to find > ideas that aren't coming to me naturally. The only other thing I > thought to come up with was that there might be network errors. But > i've gone back and forth on that, because TCP should handle that for me > and I shouldn't have to deal with it directly in pyro, although I've > added (and continue to add) checks in places that appear appropriate > (and in some cases, checks because I prefer to be paranoid about errors). > > >> I'm particularly making this comment in the context of your later point >> about the bug showing up only every three or four months. >> >> Side note: without knowing what error messages you're getting, there's >> not much anybody can say about your programs or the reliability of >> threads for your application. >> > Right, I wasn't coming here to get someone to debug my app, I'm just > looking for ideas. I constantly am trying to find new ways to improve > my software and new ways to reduce bugs, and when i get really stuck, > new ways to track bugs down. The exception won't mean much, but I can > say that the error appears to me as bad data. I do checks prior to > performing actions on any data, if the data doesn't look like what it > should look like, then the system flags an exception. > > The problem I'm having is determining how the data went bad. In > tracking down the problem a couple guys mentioned that problems like > that usually are a race condition. From here I examined my code, > checked out all the locking stuff, made sure it was good, and wasn't > able to find anything. Being that there's one lock and the critical > sections are well defined, I'm having difficulty. One idea I have to > try and get a better understanding might be to check data before its > stored. Again, I still don't know how it would get messed up nor can I > reproduce the error on my own. > > Do any of you think that would be a good practice for trying to track > this down? (Check the data after reading it, check the data before > saving it) > Are you using memory with built-in error detection and correction?
regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://del.icio.us/steve.holden Blog of Note: http://holdenweb.blogspot.com See you at PyCon? http://us.pycon.org/TX2007 -- http://mail.python.org/mailman/listinfo/python-list