Re: Ops experience: monitoring [mail processing - mailbox event processing] for distributed James product

2020-02-12 Thread Antoine Duprat
+1

Le mer. 12 févr. 2020 à 12:01, Tellier Benoit  a
écrit :

> Then an admin might miss the original log, if out of it's browsing window.
>
> However I agree the log could be done at a lower pace:
>  - Check every minute, log directly upon status change
>  - Otherwise re-log current status every 30 minutes
>
> On 12/02/2020 17:48, Antoine Duprat wrote:
> > Shouldn't it be more logic to log only status changes ?
> >
> > I mean, if you are in a degraded state, you will log the same thing each
> > minute else if you have fixed the issue.
> >
> > Le mer. 12 févr. 2020 à 11:43, Tellier Benoit  a
> > écrit :
> >
> >> +1
> >>
> >> We should make this happen.
> >>
> >> On 12/02/2020 17:29, Matthieu Baechler wrote:
> >>> On Wed, 2020-02-12 at 16:27 +0700, Tellier Benoit wrote:
> >>>
>   - Through grafana, the admin will have the information directly
>  available. Nowaday, health-checks requires her to execute the
>  healthcheck via webadmin. More actions is generally the best way of
>  having none of them taken.
> 
> >>>
> >>> I just want to add, on that matter, that I already proposed to have a
> >>> timer that logs health state to WARN when status is `degraded` and
> >>> ERROR when status is `down` on a sensible time interval (like once a
> >>> minute) and that would be enabled in our default configuration.
> >>>
> >>> That way the logs, which are the first and most basic tool any admin is
> >>> looking at, would give you that very important information.
> >>>
> >>> Cheers,
> >>>
> >>
> >> -
> >> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
> >> For additional commands, e-mail: server-dev-h...@james.apache.org
> >>
> >>
> >
>
> -
> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
> For additional commands, e-mail: server-dev-h...@james.apache.org
>
>


Re: Ops experience: monitoring [mail processing - mailbox event processing] for distributed James product

2020-02-12 Thread Tellier Benoit
Then an admin might miss the original log, if out of it's browsing window.

However I agree the log could be done at a lower pace:
 - Check every minute, log directly upon status change
 - Otherwise re-log current status every 30 minutes

On 12/02/2020 17:48, Antoine Duprat wrote:
> Shouldn't it be more logic to log only status changes ?
> 
> I mean, if you are in a degraded state, you will log the same thing each
> minute else if you have fixed the issue.
> 
> Le mer. 12 févr. 2020 à 11:43, Tellier Benoit  a
> écrit :
> 
>> +1
>>
>> We should make this happen.
>>
>> On 12/02/2020 17:29, Matthieu Baechler wrote:
>>> On Wed, 2020-02-12 at 16:27 +0700, Tellier Benoit wrote:
>>>
  - Through grafana, the admin will have the information directly
 available. Nowaday, health-checks requires her to execute the
 healthcheck via webadmin. More actions is generally the best way of
 having none of them taken.

>>>
>>> I just want to add, on that matter, that I already proposed to have a
>>> timer that logs health state to WARN when status is `degraded` and
>>> ERROR when status is `down` on a sensible time interval (like once a
>>> minute) and that would be enabled in our default configuration.
>>>
>>> That way the logs, which are the first and most basic tool any admin is
>>> looking at, would give you that very important information.
>>>
>>> Cheers,
>>>
>>
>> -
>> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
>> For additional commands, e-mail: server-dev-h...@james.apache.org
>>
>>
> 

-
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org



Re: Ops experience: monitoring [mail processing - mailbox event processing] for distributed James product

2020-02-12 Thread Antoine Duprat
Shouldn't it be more logic to log only status changes ?

I mean, if you are in a degraded state, you will log the same thing each
minute else if you have fixed the issue.

Le mer. 12 févr. 2020 à 11:43, Tellier Benoit  a
écrit :

> +1
>
> We should make this happen.
>
> On 12/02/2020 17:29, Matthieu Baechler wrote:
> > On Wed, 2020-02-12 at 16:27 +0700, Tellier Benoit wrote:
> >
> >>  - Through grafana, the admin will have the information directly
> >> available. Nowaday, health-checks requires her to execute the
> >> healthcheck via webadmin. More actions is generally the best way of
> >> having none of them taken.
> >>
> >
> > I just want to add, on that matter, that I already proposed to have a
> > timer that logs health state to WARN when status is `degraded` and
> > ERROR when status is `down` on a sensible time interval (like once a
> > minute) and that would be enabled in our default configuration.
> >
> > That way the logs, which are the first and most basic tool any admin is
> > looking at, would give you that very important information.
> >
> > Cheers,
> >
>
> -
> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
> For additional commands, e-mail: server-dev-h...@james.apache.org
>
>


Re: Ops experience: monitoring [mail processing - mailbox event processing] for distributed James product

2020-02-12 Thread Tellier Benoit
+1

We should make this happen.

On 12/02/2020 17:29, Matthieu Baechler wrote:
> On Wed, 2020-02-12 at 16:27 +0700, Tellier Benoit wrote:
> 
>>  - Through grafana, the admin will have the information directly
>> available. Nowaday, health-checks requires her to execute the
>> healthcheck via webadmin. More actions is generally the best way of
>> having none of them taken.
>>
> 
> I just want to add, on that matter, that I already proposed to have a
> timer that logs health state to WARN when status is `degraded` and
> ERROR when status is `down` on a sensible time interval (like once a
> minute) and that would be enabled in our default configuration.
> 
> That way the logs, which are the first and most basic tool any admin is
> looking at, would give you that very important information.
> 
> Cheers,
> 

-
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org



Re: Ops experience: monitoring [mail processing - mailbox event processing] for distributed James product

2020-02-12 Thread Matthieu Baechler
On Wed, 2020-02-12 at 16:27 +0700, Tellier Benoit wrote:

>  - Through grafana, the admin will have the information directly
> available. Nowaday, health-checks requires her to execute the
> healthcheck via webadmin. More actions is generally the best way of
> having none of them taken.
> 

I just want to add, on that matter, that I already proposed to have a
timer that logs health state to WARN when status is `degraded` and
ERROR when status is `down` on a sensible time interval (like once a
minute) and that would be enabled in our default configuration.

That way the logs, which are the first and most basic tool any admin is
looking at, would give you that very important information.

Cheers,

-- 
Matthieu Baechler


-
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org



Ops experience: monitoring [mail processing - mailbox event processing] for distributed James product

2020-02-12 Thread Tellier Benoit
Hello all,

Recently, as part of our work documenting Administration Procedures for
the Distributed Guice James product, we are having some reflections
regarding the way to conduct monitoring, which undertook some nice
discussions.

Currently, monitoring of `mailbox event processing` and `mail
processing` can be achieved via logs (ie ERROR log review, etc..)

However, logs requires correct kibana configuration which means also
good information. But:
 - It makes retries/final-try non trivial to distinguish
 - Admin generally monotor logs using a time-window. Events older than
this time window are ignored.

We can think of several mechanisms to enhance this matter of fact:

 - Having for instance a health check, like
MailboxEventProcessingHealthCheck ensuring that dead-letter is empty, or
returning "degraded" otherwize
 - Having a metric displayed in a board. For the dead-letter exemple, a
boolean text field can be enough.

While interesting, the health check options received the following
critics so far:
 - A perfectly behaving James server might report some failed processing
entries (for example on some border line EML parsing), leading to a
degraded status of an overwize perfectly working James server (for both
the mail processing and mailbox processing case)
 - Through grafana, the admin will have the information directly
available. Nowaday, health-checks requires her to execute the
healthcheck via webadmin. More actions is generally the best way of
having none of them taken.

We would be very interested by feedback on this topic, in order to get a
friendlyer admin experience.

Best regards,

Benoit


-
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org