Re: Occasional delayed output of events

Burn Alting Tue, 19 Jan 2021 00:20:54 -0800

On Mon, 2021-01-18 at 15:36 -0500, Paul Moore wrote:
> On Mon, Jan 18, 2021 at 9:31 AM Steve Grubb <[email protected]> wrote:
> > On Monday, January 18, 2021 8:54:30 AM EST Paul Moore wrote:
> > > > > > I like the N of M concept but there would be a LOT of change
> > > > > > -especiallyfor all the non-kernel event sources. The EOE would be 
> > > > > > the
> > > > > > mostseamless, but at a cost. My preference is to allow the 2 second
> > > > > > 'timer'to be configurable.
> > > > > 
> > > > > Agree with Burn, numbering the records coming up from the kernel 
> > > > > isgoing
> > > > > to be a real nightmare, and not something to consider 
> > > > > lightly.Especially
> > > > > when it sounds like we don't yet have a root cause for theissue.
> > > > 
> > > > A very long time ago, we had numbered records. But it was decided
> > > > thatthere's no real point in it and we'd rather just save disk space.
> > > 
> > > With the current kernel code, adding numbered records is not something 
> > > totake
> > > lightly.
> > 
> > That's why I'm saying we had it and it was removed. I could imagine that 
> > ifyou
> > had auditing of the kill syscall enabled and a whole process group wasbeing
> > killed, you could have hundreds of records that need numbering. No goodway 
> > to
> > know in advance how many records make up the event.
> 
> You only mentioned disk space concerns so it wasn't clear to me thatyou were 
> in
> agreement about this being a bad idea.  Regardless, I'mglad to see we are on 
> the
> same page about this.
> > > > I know that the kernel does not serialize the events headed for 
> > > > userspace.
> > > > But I'm curious how an event gets stuck and others can jump aheadwhile 
> > > > one
> > > > that's already inflight can get hung for 4 seconds before it'snext 
> > > > record
> > > > goes out?
> > > 
> > > Have you determined that the problem is the kernel?
> > 
> > I assume so because the kernel adds the timestamp and choses what hits 
> > thesocket
> > next. Auditd does no ordering of events. It just looks up the textevent ID, 
> > some
> > minor translation if the enriched format is being used, andwrites it to 
> > disk. It
> > can handle well over 100k records per second.
> 
> Feel free to insert the old joke about assumptions.
> I guess I was hoping for a bit more understanding of the problem andperhaps 
> some
> actual data indicating the kernel was the source of theproblem.  Conjecture 
> based
> on how things are supposed to work can bemisleading.
> > > Initially it was looking like it was a userspace issue, is that no 
> > > longerthe
> > > general thought?
> > 
> > I don't see how user space could cause this. Even if auditd was slow,
> > itshouldn't take 4 seconds to write to disk and then come back to read
> > anotherrecord. And even it did, why would the newest record go out before
> > completingone that's in progress? Something in the kernel chooses what's 
> > next.
> > Isuspect that might need looking at.
> 
> See above.
> > > Also, is there a reliable reproducer yet?
> > 
> > I don't know of one. But, I suppose we could modify ausearch to look 
> > forexamples
> > of this.
> 
> The kernel queuing is a rather complicated affair due to the need togracefully
> handle auditd failing, fallbacks to the console, andmulticast groups all while
> handling extreme pressure (e.g. auditing*every* syscall) and not destroying 
> the
> responsiveness of the system(we actually can still make forward progress if 
> you
> are auditing*every* syscall).  With that complexity comes a number of 
> cornercases,
> and I imagine there are a few cases where the system is underextreme pressure
> and/or the auditd daemon is dead and/or starved fromCPU time.  As I know 
> Richard
> is reading this, to be clear I'm talkingabout the hold/retry queues and the
> UNICAST_RETRIES case.  The severedelays you are talking about in this thread 
> seem
> severe, but perhapsif the system is under enough pressure to cause the 
> ordering
> issues inthe first place such a delay is to be expected.
> Anyway, my test setup isn't likely able to reproduce such a scenariowithout 
> some
> significant tweaks, so perhaps those of you who have seenthis problem (Burn, 
> and
> anyone else?) could shed some light into thestate of the system when the 
> ordering
> problem occurred.


I tend to have a rigorous auditing posture (see the rules loaded in 
https://github.com/linux-audit/audit-userspace/issues/148) which is not normal 
for
most. Perhaps, Paul, you have hit the nail on the head by stating that this 
'severe
delay' is not that unreasonable given my rules posture and we just need to 'deal
with it' in user space.We still get the event data, I just need to adjust the 
user
space tools to deal with this occurrence.
As for what the system is doing, in my home case it's a Centos 7 VM running a 
tomcat
service which only gets busy every 20 minutes and the other is a HPE Z800 
running
Centos 8 with 4-5 VM's mostly dormant. I can put any code in these hosts to 
assist
in 'validating'/testing the delay. Advise and I will run.

--
Linux-audit mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-audit

Re: Occasional delayed output of events

Reply via email to