Re: [PATCH] audit: add backlog high water mark metric

Paul Moore Tue, 12 May 2026 09:02:37 -0700

On Fri, Apr 17, 2026 at 9:02 AM Ricardo Robaina <[email protected]> wrote:
> On Thu, Apr 16, 2026 at 5:58 PM Paul Moore <[email protected]> wrote:
> > On Thu, Apr 16, 2026 at 4:51 PM Paul Moore <[email protected]> wrote:
> > > On Thu, Apr 16, 2026 at 4:33 PM Steve Grubb <[email protected]> wrote:
> > > > On Wednesday, April 15, 2026 11:21:52 AM Eastern Daylight Time Paul 
> > > > Moore
> > > > wrote:
> > > > > On Wed, Apr 15, 2026 at 11:19 AM Paul Moore <[email protected]> 
> > > > > wrote:
> > > > > > On Tue, Apr 14, 2026 at 11:45 PM Steve Grubb <[email protected]> 
> > > > > > wrote:
> > > > > > > On Friday, April 10, 2026 5:34:08 PM Eastern Daylight Time Paul 
> > > > > > > Moore
> > > > wrote:
> > > > > > > > On Mon, Mar 23, 2026 at 11:07 AM Ricardo Robaina
> > > > > > > > <[email protected]>
> > > > > > >
> > > > > > > wrote:
> > > > > > ...
> > > > > >
> > > > > > > ... compliance-driven systems that must use a finite backlog 
> > > > > > > limit for
> > > > > > > memory safety but cannot tolerate dropped events ...>
> > > > > > You must pick one of those two requirements, or at the very least
> > > > > > prioritize them; it is simply impossible to both limit the backlog
> > > > > > queue and require zero dropped events.
> > > > >
> > > > > To be perfectly honest, it's also impossible to require zero dropped
> > > > > events.  Even in the most extreme configurations where the admin
> > > > > decides to panic the system, that only happens once the system reaches
> > > > > the point where it is dropping events.  We try *really* hard to not
> > > > > drop events, but it is always going to be a possibility.
> > > >
> > > > You're helping make the point.  Those administrators have decided 
> > > > reliable
> > > > auditing is more important than system availability.
> > >
> > > Users prioritizing reliable auditing over system availability should
> > > not run with a backlog limit.  It's that simple.
> >
> > To clarify this further, even on systems without a backlog limit and a
> > panic-on-loss configuration, there is still a possibility that the
> > system could lose an event when it hits the edge before it panics.  A
> > maximum backlog stat won't help here.  Even if you had a way to
> > capture the backlog size before the system took itself out, the size
> > is flirting with the maximum resource limits of the system, it would
> > be silly to use that as a configured backlog limit, you would still
> > want to leave the limit at 0/disabled.
> >
> > > Regardless, I'm still not convinced this maximum backlog stat alone
> > > will solve any meaningful problems.  If your audit log is predictable
> > > enough that this metric has value, it should be possible to either
> > > capture the backlog size during periods of high audit load or simply
> > > run the system through that load and verify it doesn't crash and go to
> > > hell.  If your audit log isn't predictable, capturing a maximum
> > > backlog size doesn't really mean anything since it is still a snapshot
> > > of one instance of the system and there is always the possibility of
> > > the system exceeding it.
> >
> > --
> > paul-moore.com
> >
>
> Hi Paul,
>
> Thanks for reviewing the patch and giving your perspective on it.
>
> I understand your point that if a system truly prioritizes auditing
> over everything else, it shouldn't run with a limit. However, in
> practice, there is a middle ground where compliance frameworks or
> internal infrastructure policies require a finite backlog limit to
> ensure memory safety, while still demanding reliable auditing.


It is important that those users understand they are believing a lie
if they think one can demand reliable auditing with a finite backlog
limit.

> I'd like to ask what specific metric or combination of metrics you
> would be willing to consider? You mentioned average queue length
> earlier, and Steve suggested combining the max depth with a
> backlog_lost_since_reset counter. I'm happy to work on a v2 that
> addresses your concerns while still delivering the metrics audit users
> currently lack.

My suggestion would be to put forth a proposal explaining the problems
you want to solve and what metrics you believe are important towards
solving those problems.  I agree that the current list of audit
metrics are rather sparse, but as we've seen here, I don't think we
yet have agreement on what metrics would be useful.  My hope is that
having a discussion on the metrics first could avoid false starts as
we've seen here.

-- 
paul-moore.com

Re: [PATCH] audit: add backlog high water mark metric

Reply via email to