[ 
https://issues.apache.org/jira/browse/CAMEL-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18064066#comment-18064066
 ] 

Raymond commented on CAMEL-23079:
---------------------------------

I notified our operational manager that work was started on this ticket. He was 
excited about it, and thinks this will benefit the Camel framework and its 
community broadly. 

Some things he said that I like to share here:

Normal, successful message processing happens automatically without human 
intervention. I'm thus in operations mostly
interested in the deviations, errors or exceptions. Better known as management 
by exception.

Important for operations is that errors are:


*Categorized*
             

1. Exceptions (The Technical Failures)

    1a. Recoverable Errors (Soft Failures)

        These are temporary issues where the message might succeed if tried 
again. Examples: A temporary network glitch, a database timeout, or a remote 
server being briefly offline.
    2a. Irrecoverable Errors (Hard Failures)
        These are permanent issues where retrying is futile. Examples: A "File 
Not Found" error, a null pointer in your custom code, or a "404 Not Found" from 
an API.

2. Fault Messages (The Business Failures)

    A message that is technically "successful" in terms of delivery but 
contains a functional error.

    Example: You call a bank API to withdraw money. The connection is perfect 
(no exception), but the API returns a "Fault" saying "Insufficient Funds." This 
can be an exception set by the Camel developer in the route or a specific HTTP 
error code.


*Error Handling* 

How has the automatic error handling been done? Is it handled or not. How is 
the error handling done, retried, send to endpoint, logged etc.

*Error Query returns something that is structured and understandable*

A query result should have all data to understand the error, decide on the 
severity/impact, and the action to take. A query result best maybe structured, 
for example as a json, so that it can be further process by the maintainer, 
another system (say elastic/grafana) or to an AI to analyze, explain and 
propose actions.

*Find and analyze hotspots*

An Error Registry could be used to find hotspots of errors. What are the most 
significant errors within a time frame (say last 3 days)? What has been the 
impact of this error (did it slow the route, did it use a lot of resources)? 
This can be failed messages in the route, but also bridge consumer errors such 
as database connections timeout or errors.

*Analyze errors on a high level to details*
    
Recognition of the exception should be as close as possible to the source of 
the exception. The goal is to make error handling consistent, observable, and 
largely automated. Exception classification becomes useful when:

the system can reliably recognize the category automatically
each exception category has predefined automated measures

There has to be observability on different levels. Classification follows 
recognition. If an exception has occurred, it has to be classified in order to 
choose the right handling or measurement. A hotspot may identify, but then you 
need to do a more detailed query to get all the details to analyze this 
specific error.

--------------------------------------------------------

This feedback hopefully gives a more functional / operational perspective on 
the matter. As a Camel developer of course I would like a nice API to work 
with, but I can follow the reasoning of the operational maintainer who mostly 
look if the end result that needs to give him the right information in a 
structured format.
    

> camel-core - Registry for capturing errors during routing messages
> ------------------------------------------------------------------
>
>                 Key: CAMEL-23079
>                 URL: https://issues.apache.org/jira/browse/CAMEL-23079
>             Project: Camel
>          Issue Type: New Feature
>          Components: camel-core
>            Reporter: Claus Ibsen
>            Assignee: Guillaume Nodet
>            Priority: Major
>             Fix For: 4.x
>
>         Attachments: ErrorRegistry-Design-Document.pdf
>
>
> Some new kind of API that end users can enable that collects the exceptions 
> that happened during routing and stores them in a registry (memory).
> Can have a cap size for how many entries to store, and also for how long time 
> to keep them back so an error does not stay around for months.
> Then this registry can have Java and JMX API for monitoring and management. 
> And Camel JBang command to browse etc.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to