All of this heated debate has led me to reconsider our whole concept of exceptions. It seems that we're squabbling over little details in existing paradigms. But what of the big picture? What *is* an exception anyway? We all know the textbook definition, but clearly something is missing since we can't seem to agree how it should be implemented.
DEFINITION So I'd like to propose this definition: an exception is an abnormal condition that causes a particular operation not to be completed, *but which may have one or more ways of recovery*. I'm not interested in problems with no method of recovery: we might as well just terminate the program right there and call it a day. The fact that there's such a thing as try-catch means that we're interested in problems that we have *some* way of handling. Before I proceed, I'd like to propose that we temporarily forget about the current implementation details of exceptions. Let's for the time being forget about class hierarchies, try-catch, Variant hashes, etc., and let's consider the *concept* of an exception. In discussing what is an exception, we can jump into the nitty-gritty details, and argue all day about how to handle the particulars of case X and how to reconcile it with case Y, but I'd like to approach this from a different angle. Yes, if we know the nitty-gritty, then we can deal with it in a nitty-gritty way. Let it suffice to say that sometimes, we *want* to get into the nitty-gritty of an exception, so in any implementation there should be a way to do this. But let's talk about the generics. What if we *don't* know the nitty-gritty? Can we still say something useful about exception handling *independently* of the details of the exception? If module X calls module Y and a problem happens, but X doesn't understand the implementation details of Y or how the problem relates to it, can X still do something meaningful to handle the problem? To this end, I decided to categorize exceptions according to their general characteristics, rather than their particulars. In making this categorization, I'm not looking for artificial, ivory-tower idealizations of exception types; I'm looking for categories that would allow us to handle an unknown exception meaningfully. In other words, categories for which *there are recovery methods are available* to handle that exception. I don't care whether it's an I/O error, a network error, or an out-of-memory error; what I want to know is, *what can I do about it*? TRANSITIVITY One more point before I get into the categories themselves: in finding useful categorizations of exceptions, it's useful to find categories which are *transitive*. And by that I mean, a category C is transitive if an exception E in this category remains in the same category when the stack unwinds from Y, where E was first thrown, to X, which called Y. For example, Andrei's is_transient property constitutes a transitive category. If X calls Y and Y experiences a transient error, then by extension X has also experienced a transient error. Therefore, the is_transient category is transitive. Why should we care if an exception category is transitive? Because it allows us to make meaningful decisions far up the call stack from where the actual problem happened. If an exception category is not transitive, then as soon as the stack unwinds past the caller of the function which threw the exception, we can no longer reasonably assume that the exception still belongs to the same category as before. An illegal input error, for example, is not necessarily transitive: if X calls Y and Y calls Z, and Z says "illegal input!", then it means Y passed bad data to Z, but it doesn't necessarily mean that X passed bad data to Y. It may be Y's fault that the input to Z was bad. So trying to recover from the bad data when the stack has unwound to X is not useful: X may not have any way of changing what Y passes to Z. But if Y merely takes the data passed to it by X and hands it to Z, then the illegal input *is* transitive up to X. In *that* case, X can meaningfully attempt a recovery by fixing the bad data. But if X was the one who created that data, then X's caller cannot do anything about it. So input errors are only conditionally transitive -- up to the origin of the input data. CATEGORIES Here are the categories I've found so far. I don't claim this is anywhere near complete, but I'd like to put it on the table so that y'all can discuss about it, and hopefully refine this idea better. Each category is associated with a list of recovery actions that are meaningful for that category. Note that it doesn't mean that *every* exception in that category will have all listed recovery actions available; some recovery actions aren't possible in all cases. In an implementation, there would need to be a way to indicate which of the listed recovery actions are actually possible given a particular exception. Also, the recovery actions are deliberately generic. This will be explained later, but the idea is to let generic code decide on a general course of action without knowing the details of the actual implementation, which is determined by the code that triggered the original condition, or by an intermediate handler midway up the call stack who knows how to deal with the condition. INPUT ERRORS: Definition: errors that are caused by passing bad input to function X, such that X doesn't know how to compute a meaningful output or execute a meaningful operation. Transitivity: conditional, only up to origin of the input data. Recovery actions: - Attempt to repair the input and try the operation again. Only possible if there exists a mechanism of repairing the input. - Skip over the bad input and attempt to continue. Only applicable if the input is a list, say, and the program can still function (albeit not to the full extent) without the erroneous input. Otherwise, this recovery action is meaningless. COMPONENT FAILURE: Definition: the operation being attempted depends on the normal functioning of sub-operations X1, X2, ..., but one (or more) of them isn't functioning normally, so the operation can't be completed. Transitivity: Yes. If X calls Y and Y calls Z, and one of Z's subcomponents failed, then from Y's perspective, Z has also failed. Recovery actions: - Retry the operation using an alternative component, if one is available. For example, if X is a DNS resolver and Y is a non-responding DNS server, then X could try DNS server Z instead. Transitively, if X can't recover (doesn't know an alternative DNS server to try), then X's caller can attempt to bypass X and use W instead, which looks up a local DNS cache, say. (Note that the high-level code that handles a component failure does not need to know the details of how this component swapping is implemented, or exactly what is being swapped in/out. It only knows that this is a possible course of action.) CONSISTENCY ERROR: Definition: the operation being attempted depends on a suboperation X, which is operating normally, however, the result returned by X fails a consistency check. (I'm not sure if this warrants separate treatment from component failure; they are very similar and the recovery actions are also very similar. Maybe they can be lumped together as a single category.) Examples: Numerical overflow/underflow, which would throw off any remaining calculations. Transitivity: Yes. If X calls Y and Y calls Z, and Y discovers the Z produced an inconsistent result, then by extension Y would have also produced an inconsistent result had it decided to blindly charge ahead with the operation. Recovery actions: - Retry the operation by using an alternative component, if available. For example, a numerical overflow might be repaired by switching to a BigNum library for the troublesome part of the computation. LACK OF RESOURCES: Definition: the operation being attempted would have completed normally, had there been sufficient resources available, but there aren't, so it can't continue. Transitivity: Yes. If X calls Y and Y runs out of resources to finish, then by extension X doesn't have the resources to finish either. Recovery actions: - Free up some resources and try again. This one is debatable, since it may not be clear which resources need to be freed up, or whether they *can* be freed at all. If it's a full disk, for example, it would be unwise to just go and randomly delete files. But some cases can be handled, e.g., if memory runs out, trigger the GC. (But presumably the GC does this automatically, so this may not be an actual use case that needs manual handling.) All in all, this category may not be easy to recover from, so it may be of limited utility. TRANSIENT ERROR: Definition: the operation depends on component X, which is known to sometimes fail. Example: a network server may sometimes go down due to intermittent network problems, timeouts, etc.. Transitivity: Yes. If operation X calls operation Y and Y has a transient error, then X also has a transient error by extension. Recovery actions: - Retry the operation: it may succeed next time. Sidenote: Here I'd like to say that at first, I was very skeptical about Andrei's is_transient proposal, because I didn't have the proper context to understand its utility. I felt that something was missing. And that missing something was that is_transient is but a part of a larger framework of generic exception categories. Without this larger context, the value of is_transient is not immediately obvious. It seems like just an arbitrary thing out of the blue. How could it possibly be useful?? But when viewed as part of a larger system, is_transient can be seen to be an extremely useful concept: it is a *transitive* category, which means you can do something meaningful with it at any point up the call stack. CREDENTIALS ERROR: Definition: there's no problem with the input, and all subcomponents are functioning properly, but because of lack of (or improper) credentials, the operation could not be completed. Transitivity: Yes(?). Not sure about this one, not because it doesn't fit the definition, but because it's unclear how to correctly handle the recovery action. A single operation may consist of many sub-operations, each requiring a different set of credentials. Just because one of the sub-operations raises a credentials error doesn't mean the exception handler knows where to find alternative credentials, or even what kind of credentials they are. Recovery actions: - Retry the operation with different credentials. E.g., prompt user for a different password. But I'm unsure if/how this can be generally implemented, as described above. These are all the general categories I found. There may be more. IMPLEMENTATION Alright. All of this grand talk about generics and categories is all good, but how can this actually be implemented in real life? The try-catch mechanism is not adequate to implement all the recovery actions described above. As I've said before when discussing what I called the Lispian model, some of these recovery actions need to happen *in the context where the exception was thrown*. Once the stack unwinds, it may not be possible to recover anymore, because the execution context of the original code is gone. One peculiarity about Andrei's is_transient is that you *can* re-attempt the operation after unwinding the stack. Which is what makes it useful in the current try/catch exception system that we have. But not all recovery actions can be implemented this way. Some, such as repair bad input, or try alternate component, makes no sense after the stack has unwound: the execution context of the failing component and its caller is long gone; to try an alternate component would require painstaking passing of retry information all the way down the function call chain, polluting normal function parameters with retry parameters and producing very ugly code. Repair bad input, in particular, *must* be done before the stack unwinds past the origin of the input, otherwise it's impossible to correct it. This is where the Lispian model really shines. To summarize: 1) When we encounter a problem, we raise a Condition (instead of throw an exception immediately). 2) Every Condition is associated with a set of recovery actions. These actions are generic; basically we're mapping each exception category to a Condition. The raiser of the Condition will specialize each recovery action with code specific to itself. 3) High-level code may register Handlers (in the form of a delegate) for particular Conditions. These registrations are limited by scope; once the function registering the handler exits, any handlers it registered are removed from the system. The handler registered closest to the origin of a Condition has priority over other matching handlers. 4) When a Condition is raised, the condition-handling system first checks a list of registered condition handlers to see if any handler is willing to handle the condition. The handler is passed the Condition with its associated set of recovery options. The handler decides, based on high-level information, which recovery action to take, and informs the condition-handling system. The recovery action is then executed *in the context of the function that raised the Condition*, *without unwinding the stack*. If no handler is found, or the handler decides to abort the operation, then the condition-handling system converts the Condition into an exception and throws it. A function higher up the call chain may decide to catch this exception and raise a corresponding Condition, to allow (other) handlers to deal with the situation at the higher level. If nothing is caught or all attempts to fix the problem failed, we eventually percolate up the call stack to the top and fail the program. Advantages of this system: - Complex recovery actions are possible, because we don't unwind the stack until we decide to abort the operation after all. - Recovery actions run in the context where failure is first seen, thereby taking advantage of the immediate context to recover in a specific way. - High-level code gets to make decisions about which recovery action to pursue (via the delegate handler). It gets to do this *without* need to know the nitty-gritty of the low-level code; it is given the generic problem category and a list of generic recovery actions that can be attempted. The low-level code implements various recovery options, the high-level code chooses between them. - If nobody knows how to handle the situation, we unwind the stack, as in the traditional try/catch model. - If an intermediate function up the stack has a way to deal with the situation, it can catch the associated exception and raise a Condition that has recovery actions *run at its level*. The high-level delegate still gets to make decisions, but now the recovery actions are run at a higher level than the original locus of the problem. In some cases, this is a better position for attempting recovery. E.g., a network timeout may be seen at the packet level, but to repair the problem requires reconnecting from, say, the HTTP request level, so we need to unwind the stack up until that point. This is actually superior to the try/catch mechanism, because at the HTTP request level, we don't necessarily have enough context to decide what course of action to take; but by passing the condition a higher-level delegate, it can make decisions the HTTP module can't make, and the HTTP module can correct the problem without unwinding the stack all the way to where the delegate was registered. In a previous post, I had a skeletal implementation of this system, but the major problem was that it was too specialized: every piece of code that wanted to implement recovery needed to define a specific Condition with its own set of recovery strategies, leading to reams and reams of code just to achieve something simple. Furthermore, the high-level handler needed to know the nitty-gritty low-level details of what each Condition represented and what options are available to deal with it, so there was no way to write a *generic* handler that can decide what to do with conditions whose details it knows nothing about. But by using generic exception categories, we can finally get rid of that bloat and still be able to implement problem-specific recovery strategies. The high-level code need only know which generic category the Condition belongs to, and based on this it knows which recovery actions are available. It never needs to know what the details are (unless it's intended to be a very specific handler dealing with a very specific condition whose details it knows). The low-level code provides the implementation of the recovery actions by implementing the generic interface of that particular category. Currently, I'm still unsure whether Conditions and Exceptions should be unified, or they should be kept separate; deadalnix recommended they be kept separate, but I'd like to open it for discussion. Sorry for this super-long post, but I wanted to lay my ideas out in a coherent fashion so that we can discuss its conceptual aspects without getting lost with arguing about the details. I hope this is a step in the right direction toward a better model of exception handling. T -- Life is complex. It consists of real and imaginary parts. -- YHL