[
https://issues.apache.org/jira/browse/CAMEL-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18064066#comment-18064066
]
Raymond commented on CAMEL-23079:
---------------------------------
I notified our operational manager that work was started on this ticket. He was
excited about it, and thinks this will benefit the Camel framework and its
community broadly.
Some things he said that I like to share here:
Normal, successful message processing happens automatically without human
intervention. I'm thus in operations mostly
interested in the deviations, errors or exceptions. Better known as management
by exception.
Important for operations is that errors are:
*Categorized*
1. Exceptions (The Technical Failures)
1a. Recoverable Errors (Soft Failures)
These are temporary issues where the message might succeed if tried
again. Examples: A temporary network glitch, a database timeout, or a remote
server being briefly offline.
2a. Irrecoverable Errors (Hard Failures)
These are permanent issues where retrying is futile. Examples: A "File
Not Found" error, a null pointer in your custom code, or a "404 Not Found" from
an API.
2. Fault Messages (The Business Failures)
A message that is technically "successful" in terms of delivery but
contains a functional error.
Example: You call a bank API to withdraw money. The connection is perfect
(no exception), but the API returns a "Fault" saying "Insufficient Funds." This
can be an exception set by the Camel developer in the route or a specific HTTP
error code.
*Error Handling*
How has the automatic error handling been done? Is it handled or not. How is
the error handling done, retried, send to endpoint, logged etc.
*Error Query returns something that is structured and understandable*
A query result should have all data to understand the error, decide on the
severity/impact, and the action to take. A query result best maybe structured,
for example as a json, so that it can be further process by the maintainer,
another system (say elastic/grafana) or to an AI to analyze, explain and
propose actions.
*Find and analyze hotspots*
An Error Registry could be used to find hotspots of errors. What are the most
significant errors within a time frame (say last 3 days)? What has been the
impact of this error (did it slow the route, did it use a lot of resources)?
This can be failed messages in the route, but also bridge consumer errors such
as database connections timeout or errors.
*Analyze errors on a high level to details*
Recognition of the exception should be as close as possible to the source of
the exception. The goal is to make error handling consistent, observable, and
largely automated. Exception classification becomes useful when:
the system can reliably recognize the category automatically
each exception category has predefined automated measures
There has to be observability on different levels. Classification follows
recognition. If an exception has occurred, it has to be classified in order to
choose the right handling or measurement. A hotspot may identify, but then you
need to do a more detailed query to get all the details to analyze this
specific error.
--------------------------------------------------------
This feedback hopefully gives a more functional / operational perspective on
the matter. As a Camel developer of course I would like a nice API to work
with, but I can follow the reasoning of the operational maintainer who mostly
look if the end result that needs to give him the right information in a
structured format.
> camel-core - Registry for capturing errors during routing messages
> ------------------------------------------------------------------
>
> Key: CAMEL-23079
> URL: https://issues.apache.org/jira/browse/CAMEL-23079
> Project: Camel
> Issue Type: New Feature
> Components: camel-core
> Reporter: Claus Ibsen
> Assignee: Guillaume Nodet
> Priority: Major
> Fix For: 4.x
>
> Attachments: ErrorRegistry-Design-Document.pdf
>
>
> Some new kind of API that end users can enable that collects the exceptions
> that happened during routing and stores them in a registry (memory).
> Can have a cap size for how many entries to store, and also for how long time
> to keep them back so an error does not stay around for months.
> Then this registry can have Java and JMX API for monitoring and management.
> And Camel JBang command to browse etc.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)