And to go just a bit further with this, would this be considered a non-backward compatible change to the core expectations of the system (even if it's patching a "bug")? It changes the "expected" behavior of the application from an undefined possible zombie state to fail-fast. Theoretically, that might constitute a version bump (http://semver.org), unless you parameterize the behavior so that a user, who encounters this behavior outside of their existing expectations, could turn off the fail-fast behavior by setting an appropriate flag. I'm not an expert in ZK whatsoever, but maybe there's some runtime operation that occurs and generates an unchecked exception and the system is fine with that, but it's only really experienced at runtime in exceptional cases. If the system fails-fast, without a way to disable fail-fast, you may have to release another patch promptly (instead of simply saying "disable fail-fast for now, we'll fix the larger issue in the next release"). Maybe that's excessive, but just putting it out there anyway.
On 4/13/12 11:22 AM, "Shelley, Ryan" <[email protected]> wrote: >Just my 2 centsŠ is the error code 1 the correct error code to return to >the OS? I'm just curious if anywhere else in ZooKeeper a System.exit(1) >may be called. It may make sense to either re-use that error code, or use >a different one (if 1 is already used elsewhere for a different type of >error, like "Invalid arguments" during start-up, for example). > >If the error isn't an OOME, is there any clean-up ZK needs to do to maybe >inform a cluster it's going down abruptly (maybe to gracefully begin a >leader re-election if necessary, for example)? > >I'm +1 to fail-fast behavior. > >Ryan > >On 4/13/12 8:15 AM, "Scott Fines" <[email protected]> wrote: > >>On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM >>for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd >>args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't >>remember off the top of my head). Will these triggers still be fired, or >>will the catch-all prevent them? >> >>I'm still +1 for the change no matter what, but it's probably a good idea >>to mention that in the docs if they don't work. >> >>Scott >> >>On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier >><[email protected]>wrote: >> >>> Hi everyone, >>> >>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and >>>I'd >>> like some feedback from the user base on it. >>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442 >>> >>> The current behavior of ZK when we get an uncaught exception is to log >>>it >>> and try to move on. This is arguably not the right thing to do, and >>>will >>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) >>>for >>> longer than it should. >>> The patch proposes that when we get an instance of java.lang.Error, we >>> should do a system.exit to fast-fail the process. With the possible >>> exception of ThreadDeath (which may or may not be an unrecoverable >>>system >>> state depending on the thread), I think this makes sense, but I would >>>like >>> to hear from others if they have an opinion. I think it's better to >>>kill >>> the process and let your monitoring services detect process death (and >>>thus >>> restart) than possibly linger unresponsive for a while, are there >>>scenarios >>> that we're missing where this error can occur and you wouldn't want the >>> process killed? >>> >>> Thanks for your feedback, >>> >>> Camille >>> >
