Re: [PATCH v2] audit: report audit wait metric in audit status reply

2020-12-07 Thread Max Englander
On Wed, Dec 2, 2020 at 11:33 PM Joe Wulf  wrote:

> I would like to suggest providing a mechanism where admins can query the
> status or state of backlog issues (wait time, sums, etc...).  Maybe the
> intent is to expand the output of status checking of auditd.
>
> I believe further clarity is beneficial on the setting of the
> 'backlog_wait_sum' (or to whatever the name evolves to) initially.
> -  How it evolves over time
> -  What the conditions in the system, or auditing, would change it
> -  What conditions admins should pay attention to for informational
> understanding of status
> -  What conditions admins should realize exist such that adjustments are
> needed
>(and suggestions to what those adjustments should be)
> -  What new guidance will admins have for building adjusting audit.rules
> around this
>
> Consider the scenario where auditing has been 'working fine' for days.
> Little to no active admin monitoring.
> Events occur to spike the auditing such that backloging of audit records
> dramatically increases.
> (for some reason) admins now come looking to investigate.
> Assuming they do:  'systemctl status auditd' the newly presented 'state'
> of the 'backlog_wait_sum' will show some evidence.
> Q:  Is that just a moment in time?
> Q:  What information here will give the perspective things are good/ok
> 'now', versus some action needs to be taken?
>
> Maybe that isn't a great scenario, or good questionsit is what occurs
> to me at the moment.
>
> Thank you.
>
> R,
> -Joe Wulf
>
>
> On Wednesday, July 1, 2020, 5:33:14 PM EDT, Max Englander <
> max.englan...@gmail.com> wrote:
>
> >  In environments where the preservation of audit events and predictable
> >  usage of system memory are prioritized, admins may use a combination of
> >  --backlog_wait_time and -b options at the risk of degraded performance
> >  resulting from backlog waiting. In some cases, this risk may be
> >  preferred to lost events or unbounded memory usage. Ideally, this risk
> >  can be mitigated by making adjustments when backlog waiting is
> detected.
> >
> >  However, detection can be diffult using the currently available
> metrics.
> >  For example, an admin attempting to debug degraded performance may
> >  falsely believe a full backlog indicates backlog waiting. It may turn
> >  out the backlog frequently fills up but drains quickly.
> >
> >  To make it easier to reliably track degraded performance to backlog
> >  waiting, this patch makes the following changes:
> >
> >  Add a new field backlog_wait_sum to the audit status reply. Initialize
> >  this field to zero. Add to this field the total time spent by the
> >  current task on scheduled timeouts while the backlog limit is exceeded.
> >
> >  Tested on Ubuntu 18.04 using complementary changes to the audit
> >  userspace: https://github.com/linux-audit/audit-userspace/pull/134.
>
> 
>

Hi Joe,

Not sure I can address all your points above, but the way that we monitor
Linux audit internals at my employer is to continuously monitor the audit
status
response with short evaluation windows.

- We compute a rate of change on the lost field, and alert if the there are
more than N lost records per second on average
- We compute the backlog utilization by computing backlog/backlog_limit,
and alert if that goes above 75% at any point in time
- If/when we run on a kernel that has backlog_wait_time_actual, we'll
monitor on that as well, setting thresholds around where we'd expect
growth in this value to result in service degradation.

If we get an alert, and it is just a blip that goes away and doesn't come
back,
we probably won't spend a lot of time investigating. However, if we see
that the alert is frequently active across multiple hosts, that will prompt
us
to investigate. As far as what action we would take, it would depend on
the precise values in the audit status reply, as well as other information
we
had gathered from our system. For example, if we observed elevated values
for backlog and backlog_wait_time_actual, we might first investigate other
environmental factors such as whether the auditd daemon was crashed or
starved for CPU time. If we saw that lost was high but backlog was low
that might indicate to us that the rate limit is being exceeded, or that the
kernel is out of memory.

I agree with you that it would help to expand the metrics reported in
audit status. For example, reporting the number of times an audit record was
lost due to rate limit being exceeded would help.

Not sure how responsive this is to your questions. Hope it helps some.

Thanks,
Max
--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v2] audit: report audit wait metric in audit status reply

2020-12-07 Thread Max Englander
On Mon, Dec 7, 2020 at 4:21 PM Richard Guy Briggs  wrote:

> On 2020-12-07 16:13, Max Englander wrote:
> > On Fri, Dec 4, 2020 at 3:41 PM Paul Moore  wrote:
> >
> > > On Thu, Dec 3, 2020 at 9:47 PM Steve Grubb  wrote:
> > > > On Thursday, December 3, 2020 9:16:52 PM EST Paul Moore wrote:
> > > > > > > > Author: Richard Guy Briggs 
> > > > > > > > AuthorDate: 2014-11-17 15:51:01 -0500
> > > > > > > > Commit: Paul Moore 
> > > > > > > > CommitDate: 2014-11-17 16:53:51 -0500
> > > > > > > > ("audit: convert status version to a feature bitmap")
> > > > > > > > It was introduced specifically to enable distributions to
> > > selectively
> > > > > > > > backport features.  It was converted away from AUDIT_VERSION.
> > > > > > > >
> > > > > > > > There are other ways to detect the presence of
> > > > > > > > backlog_wait_time_actual
> > > > > > > > as I mentioned above.
> > > > > > >
> > > > > > > Let me be blunt - I honestly don't care what Steve's audit
> > > userspace
> > > > > > > does to detect this.  I've got my own opinion, but Steve's
> audit
> > > > > > > userspace is not my project to manage and I think we've
> established
> > > > > > > over the years that Steve and I have very different views on
> what
> > > > > > > constitutes good design.
> > > > > >
> > > > > > And guessing what might be in buffers of different sizes is good
> > > design?
> > > > > > The FEATURE_BITMAP was introduced to get rid of this ambiguity.
> > > > >
> > > > > There is just soo much to unpack in your comment Steve, but let me
> > > > > keep it short ...
> > > > >
> > > > > - This is an enterprise distro problem, not an upstream problem.
> The
> > > > > problems you are talking about are not a problem for upstream.
> > > >
> > > > You may look at it that way. I do not. Audit -userspace is also an
> > > upstream
> > > > for a lot of distros and I need to make this painless for them. So,
> > > while you
> > > > may think of this being a backport problem for Red Hat to solve, I
> think
> > > of
> > > > this as a generic problem that I'd like to solve for Debian, Suse,
> > > Ubuntu,
> > > > Arch, Gentoo, anyone using audit. We both are upstream.
> > >
> > > I intentionally said "enterprise Linux distributions", I never singled
> > > out RH/IBM.  Contrary to what RH/IBM marketing may have me believe, I
> > > don't consider RHEL to be the only "enterprise Linux distribution" :)
> > >
> > > Beyond that, while I haven't looked at all of the distros you list
> > > above, I know a few of them typically only backport fixes, not new
> > > features.  Further, as I mentioned previously in this thread, there is
> > > a way to backport this feature in a safe manner without using the
> > > feature bits.  Even further, if there wasn't a way to backport
> > > this feature safely (and let me stress agai that you can backport this
> > > safely), I would still consider that to be a distro problem and not an
> > > upstream kernel problem.  The upstream kernel is not responsible for
> > > enabling or supporting arbitrary combinations of patches.
> > >
> > > --
> > > paul moore
> > > www.paul-moore.com
> > >
> > > --
> > > Linux-audit mailing list
> > > Linux-audit@redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-audit
> > >
> > >
> > Hi Steve, Paul,
> >
> > I'm replying with the Gmail UI since I don't have my Mutt setup handy, so
> > apologies for any formatting which doesn't align with the mailing list
> best
> >  practices!
> >
> > First off, my apologies for being late to the thread, and for submitting
> > code
> > to the kernel and user space which aren't playing nicely with each other.
> >
> > It sounds like there's a decision to be made around whether or not to use
> > the bitmap feature flags which I probably am probably not in a position
> to
> > help decide. However, I'm more than happy to fix my userspace PR so
> > that it does not rely on the feature flag space using

Re: [PATCH v2] audit: report audit wait metric in audit status reply

2020-12-07 Thread Max Englander
On Fri, Dec 4, 2020 at 3:41 PM Paul Moore  wrote:

> On Thu, Dec 3, 2020 at 9:47 PM Steve Grubb  wrote:
> > On Thursday, December 3, 2020 9:16:52 PM EST Paul Moore wrote:
> > > > > > Author: Richard Guy Briggs 
> > > > > > AuthorDate: 2014-11-17 15:51:01 -0500
> > > > > > Commit: Paul Moore 
> > > > > > CommitDate: 2014-11-17 16:53:51 -0500
> > > > > > ("audit: convert status version to a feature bitmap")
> > > > > > It was introduced specifically to enable distributions to
> selectively
> > > > > > backport features.  It was converted away from AUDIT_VERSION.
> > > > > >
> > > > > > There are other ways to detect the presence of
> > > > > > backlog_wait_time_actual
> > > > > > as I mentioned above.
> > > > >
> > > > > Let me be blunt - I honestly don't care what Steve's audit
> userspace
> > > > > does to detect this.  I've got my own opinion, but Steve's audit
> > > > > userspace is not my project to manage and I think we've established
> > > > > over the years that Steve and I have very different views on what
> > > > > constitutes good design.
> > > >
> > > > And guessing what might be in buffers of different sizes is good
> design?
> > > > The FEATURE_BITMAP was introduced to get rid of this ambiguity.
> > >
> > > There is just soo much to unpack in your comment Steve, but let me
> > > keep it short ...
> > >
> > > - This is an enterprise distro problem, not an upstream problem.  The
> > > problems you are talking about are not a problem for upstream.
> >
> > You may look at it that way. I do not. Audit -userspace is also an
> upstream
> > for a lot of distros and I need to make this painless for them. So,
> while you
> > may think of this being a backport problem for Red Hat to solve, I think
> of
> > this as a generic problem that I'd like to solve for Debian, Suse,
> Ubuntu,
> > Arch, Gentoo, anyone using audit. We both are upstream.
>
> I intentionally said "enterprise Linux distributions", I never singled
> out RH/IBM.  Contrary to what RH/IBM marketing may have me believe, I
> don't consider RHEL to be the only "enterprise Linux distribution" :)
>
> Beyond that, while I haven't looked at all of the distros you list
> above, I know a few of them typically only backport fixes, not new
> features.  Further, as I mentioned previously in this thread, there is
> a way to backport this feature in a safe manner without using the
> feature bits.  Even further, if there wasn't a way to backport
> this feature safely (and let me stress agai that you can backport this
> safely), I would still consider that to be a distro problem and not an
> upstream kernel problem.  The upstream kernel is not responsible for
> enabling or supporting arbitrary combinations of patches.
>
> --
> paul moore
> www.paul-moore.com
>
> --
> Linux-audit mailing list
> Linux-audit@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-audit
>
>
Hi Steve, Paul,

I'm replying with the Gmail UI since I don't have my Mutt setup handy, so
apologies for any formatting which doesn't align with the mailing list best
 practices!

First off, my apologies for being late to the thread, and for submitting
code
to the kernel and user space which aren't playing nicely with each other.

It sounds like there's a decision to be made around whether or not to use
the bitmap feature flags which I probably am probably not in a position to
help decide. However, I'm more than happy to fix my userspace PR so
that it does not rely on the feature flag space using the approach Paul
outlined, in spite of the drawbacks, if that ends up being the decision.

Steve, I understand your preference to rely on the feature bitmap since it
is a more reliable way to determine the availability of a feature than
key size, but if you're open to Paul's recommendations in spite of the
drawbacks, I'll make the changes to my patch as soon as I can to unblock
your work.

Separately, since there is tension between these two approaches
(structure size and bitmap), I wonder if Paul/Steve you would be open
to a third way.

For example, I can imagine adding additional bitmap
spaces (FEATURE_BITMAP_2, FEATURE_BITMAP_3, etc.).
Alternately, I can imagine each feature being assigned a unique u64
ID, and user space programs querying the kernel to see whether a
a particular feature is enabled.

I'm not familiar enough with the kernel to be able to judge how sound
either idea is (or if these have been considered and rejected in the past)
but if you all think a third way is viable, I'd be happy to start a separate
mailing thread to try to thread the competing requirements of the kernel
and userspace, and contribute code if we can find a solution.

Max
--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v3] audit: report audit wait metric in audit status reply

2020-07-21 Thread Max Englander
On Tue, Jul 21, 2020 at 11:26:53AM -0400, Paul Moore wrote:
> On Wed, Jul 15, 2020 at 9:30 PM Paul Moore  wrote:
> > On Wed, Jul 8, 2020 at 7:13 PM Paul Moore  wrote:
> > > On Sat, Jul 4, 2020 at 11:15 AM Max Englander  
> > > wrote:
> > > >
> > > > In environments where the preservation of audit events and predictable
> > > > usage of system memory are prioritized, admins may use a combination of
> > > > --backlog_wait_time and -b options at the risk of degraded performance
> > > > resulting from backlog waiting. In some cases, this risk may be
> > > > preferred to lost events or unbounded memory usage. Ideally, this risk
> > > > can be mitigated by making adjustments when backlog waiting is detected.
> > > >
> > > > However, detection can be difficult using the currently available
> > > > metrics. For example, an admin attempting to debug degraded performance
> > > > may falsely believe a full backlog indicates backlog waiting. It may
> > > > turn out the backlog frequently fills up but drains quickly.
> > > >
> > > > To make it easier to reliably track degraded performance to backlog
> > > > waiting, this patch makes the following changes:
> > > >
> > > > Add a new field backlog_wait_time_total to the audit status reply.
> > > > Initialize this field to zero. Add to this field the total time spent
> > > > by the current task on scheduled timeouts while the backlog limit is
> > > > exceeded. Reset field to zero upon request via AUDIT_SET.
> > > >
> > > > Tested on Ubuntu 18.04 using complementary changes to the
> > > > audit-userspace and audit-testsuite:
> > > > - https://github.com/linux-audit/audit-userspace/pull/134
> > > > - https://github.com/linux-audit/audit-testsuite/pull/97
> > > >
> > > > Signed-off-by: Max Englander 
> > > > ---
> > > > Patch changelogs between v1 and v2:
> > > >   - Instead of printing a warning when backlog waiting occurs, add
> > > > duration of backlog waiting to cumulative sum, and report this
> > > > sum in audit status reply.
> > > >
> > > > Patch changelogs between v2 and v3:
> > > >  - Rename backlog_wait_sum to backlog_wait_time_actual.
> > > >  - Drop unneeded and unwanted header flags
> > > >AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_SUM and
> > > >AUDIT_VERSION_BACKLOG_WAIT_SUM.
> > > >  - Increment backlog_wait_time_actual counter after every call to
> > > >schedule_timeout rather than once after enqueuing (or losing) an
> > > >audit record.
> > > >  - Add support for resetting backlog_wait_time_actual counter to zero
> > > >upon request via AUDIT_SET.
> > > >
> > > >  include/uapi/linux/audit.h | 18 +++---
> > > >  kernel/audit.c | 35 +--
> > > >  2 files changed, 36 insertions(+), 17 deletions(-)
> > >
> > > This looks okay to me, thanks for the fixes Max.
> > >
> > > Steve, does the associated userspace patch look okay to you?
> >
> > Steve, any comments on the userspace patch?  Did I miss a reply in my
> > inbox perhaps?
> >
> > If I don't see any feedback by the end of the week I'll plan on
> > merging this into audit/next.
> 
> It's been over two weeks with no comment, so I went ahead and merged
> this into audit/next.  Thanks for your patience Max!

Excellent, glad to hear it! Thank you (and Richard, Steve) for the
guidance and interesting discussion along the way.

> 
> -- 
> paul moore
> www.paul-moore.com

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit



[PATCH v3] audit: report audit wait metric in audit status reply

2020-07-04 Thread Max Englander
In environments where the preservation of audit events and predictable
usage of system memory are prioritized, admins may use a combination of
--backlog_wait_time and -b options at the risk of degraded performance
resulting from backlog waiting. In some cases, this risk may be
preferred to lost events or unbounded memory usage. Ideally, this risk
can be mitigated by making adjustments when backlog waiting is detected.

However, detection can be difficult using the currently available
metrics. For example, an admin attempting to debug degraded performance
may falsely believe a full backlog indicates backlog waiting. It may
turn out the backlog frequently fills up but drains quickly.

To make it easier to reliably track degraded performance to backlog
waiting, this patch makes the following changes:

Add a new field backlog_wait_time_total to the audit status reply.
Initialize this field to zero. Add to this field the total time spent
by the current task on scheduled timeouts while the backlog limit is
exceeded. Reset field to zero upon request via AUDIT_SET.

Tested on Ubuntu 18.04 using complementary changes to the
audit-userspace and audit-testsuite:
- https://github.com/linux-audit/audit-userspace/pull/134
- https://github.com/linux-audit/audit-testsuite/pull/97

Signed-off-by: Max Englander 
---
Patch changelogs between v1 and v2:
  - Instead of printing a warning when backlog waiting occurs, add
duration of backlog waiting to cumulative sum, and report this
sum in audit status reply.

Patch changelogs between v2 and v3:
 - Rename backlog_wait_sum to backlog_wait_time_actual.
 - Drop unneeded and unwanted header flags
   AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_SUM and
   AUDIT_VERSION_BACKLOG_WAIT_SUM.
 - Increment backlog_wait_time_actual counter after every call to
   schedule_timeout rather than once after enqueuing (or losing) an
   audit record.
 - Add support for resetting backlog_wait_time_actual counter to zero
   upon request via AUDIT_SET.

 include/uapi/linux/audit.h | 18 +++---
 kernel/audit.c | 35 +--
 2 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index a534d71e689a..92d72965ad44 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -332,14 +332,15 @@ enum {
 };
 
 /* Status symbols */
-   /* Mask values */
-#define AUDIT_STATUS_ENABLED   0x0001
-#define AUDIT_STATUS_FAILURE   0x0002
-#define AUDIT_STATUS_PID   0x0004
+   /* Mask values */
+#define AUDIT_STATUS_ENABLED   0x0001
+#define AUDIT_STATUS_FAILURE   0x0002
+#define AUDIT_STATUS_PID   0x0004
 #define AUDIT_STATUS_RATE_LIMIT0x0008
-#define AUDIT_STATUS_BACKLOG_LIMIT 0x0010
-#define AUDIT_STATUS_BACKLOG_WAIT_TIME 0x0020
-#define AUDIT_STATUS_LOST  0x0040
+#define AUDIT_STATUS_BACKLOG_LIMIT 0x0010
+#define AUDIT_STATUS_BACKLOG_WAIT_TIME 0x0020
+#define AUDIT_STATUS_LOST  0x0040
+#define AUDIT_STATUS_BACKLOG_WAIT_TIME_ACTUAL  0x0080
 
 #define AUDIT_FEATURE_BITMAP_BACKLOG_LIMIT 0x0001
 #define AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME 0x0002
@@ -466,6 +467,9 @@ struct audit_status {
__u32   feature_bitmap; /* bitmap of kernel audit features */
};
__u32   backlog_wait_time;/* message queue wait timeout */
+   __u32   backlog_wait_time_actual;/* time spent waiting while
+ * message limit exceeded
+ */
 };
 
 struct audit_features {
diff --git a/kernel/audit.c b/kernel/audit.c
index 87f31bf1f0a0..33c640fdacf7 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -136,6 +136,11 @@ u32audit_sig_sid = 0;
 */
 static atomic_taudit_lost = ATOMIC_INIT(0);
 
+/* Monotonically increasing sum of time the kernel has spent
+ * waiting while the backlog limit is exceeded.
+ */
+static atomic_t audit_backlog_wait_time_actual = ATOMIC_INIT(0);
+
 /* Hash for inode-based rules */
 struct list_head audit_inode_hash[AUDIT_INODE_BUCKETS];
 
@@ -1193,17 +1198,18 @@ static int audit_receive_msg(struct sk_buff *skb, 
struct nlmsghdr *nlh)
case AUDIT_GET: {
struct audit_status s;
memset(, 0, sizeof(s));
-   s.enabled   = audit_enabled;
-   s.failure   = audit_failure;
+   s.enabled  = audit_enabled;
+   s.failure  = audit_failure;
/* NOTE: use pid_vnr() so the PID is relative to the current
 *   namespace */
-   s.pid   = auditd_pid_vnr();
-   s.rate_limit= audit_rate_limit

Re: [PATCH v2] audit: report audit wait metric in audit status reply

2020-07-03 Thread Max Englander
On Fri, Jul 03, 2020 at 05:29:49PM -0400, Richard Guy Briggs wrote:
> On 2020-07-02 16:42, Paul Moore wrote:
> > On Wed, Jul 1, 2020 at 5:32 PM Max Englander  
> > wrote:
> > >
> > > In environments where the preservation of audit events and predictable
> > > usage of system memory are prioritized, admins may use a combination of
> > > --backlog_wait_time and -b options at the risk of degraded performance
> > > resulting from backlog waiting. In some cases, this risk may be
> > > preferred to lost events or unbounded memory usage. Ideally, this risk
> > > can be mitigated by making adjustments when backlog waiting is detected.
> > >
> > > However, detection can be diffult using the currently available metrics.
> > > For example, an admin attempting to debug degraded performance may
> > > falsely believe a full backlog indicates backlog waiting. It may turn
> > > out the backlog frequently fills up but drains quickly.
> > >
> > > To make it easier to reliably track degraded performance to backlog
> > > waiting, this patch makes the following changes:
> > >
> > > Add a new field backlog_wait_sum to the audit status reply. Initialize
> > > this field to zero. Add to this field the total time spent by the
> > > current task on scheduled timeouts while the backlog limit is exceeded.
> > >
> > > Tested on Ubuntu 18.04 using complementary changes to the audit
> > > userspace: https://github.com/linux-audit/audit-userspace/pull/134.
> > >
> > > Signed-off-by: Max Englander 
> > > ---
> > >  Patch changelogs between v1 and v2:
> > >  - Instead of printing a warning when backlog waiting occurs, add
> > >duration of backlog waiting to cumulative sum, and report this
> > >sum in audit status reply.
> > >
> > >  include/uapi/linux/audit.h | 7 ++-
> > >  kernel/audit.c | 9 +
> > >  2 files changed, 15 insertions(+), 1 deletion(-)
> > 
> > Hi Max,
> > 
> > In general this looks better than the previous approach, but I do have
> > a few specific comments (inline).  It also important that in addition
> > to the requisite userspace patch, we also need a test added to the
> > audit-testsuite project so we can verify this functionality in the
> > future.
> > 
> > * https://github.com/linux-audit/audit-testsuite
> > 
> > > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> > > index a534d71e689a..ea0cc364beca 100644
> > > --- a/include/uapi/linux/audit.h
> > > +++ b/include/uapi/linux/audit.h
> > > @@ -340,6 +340,7 @@ enum {
> > >  #define AUDIT_STATUS_BACKLOG_LIMIT 0x0010
> > >  #define AUDIT_STATUS_BACKLOG_WAIT_TIME 0x0020
> > >  #define AUDIT_STATUS_LOST  0x0040
> > > +#define AUDIT_STATUS_BACKLOG_WAIT_SUM  0x0080
> > 
> > Sooo ... you've defined this, but I don't see any of the corresponding
> > AUDIT_SET code that I would expect, was that an oversight?  If not, it
> > is something we should support in the kernel as I'm sure admins will
> > want to reset this value at some point.
> 
> Have a look at the lost reset code as an example.  It is tricky since it
> does an atomic reset while delivering a value back up the control plane
> and issuing a record.  There were some fallout bug fixes because it
> wasn't as obvious as it looked.

Thank you for the suggestion. I've copied that approach in the (not yet
submitted) v3 patch.

> 
> > >  #define AUDIT_FEATURE_BITMAP_BACKLOG_LIMIT 0x0001
> > >  #define AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME 0x0002
> > > @@ -348,6 +349,7 @@ enum {
> > >  #define AUDIT_FEATURE_BITMAP_SESSIONID_FILTER  0x0010
> > >  #define AUDIT_FEATURE_BITMAP_LOST_RESET0x0020
> > >  #define AUDIT_FEATURE_BITMAP_FILTER_FS 0x0040
> > > +#define AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_SUM  0x0080
> > 
> > In an effort not to exhaust the feature bitmap too quickly, I've been
> > restricting it to only those features that would cause breakage with
> > userspace.  I haven't looked closely at Steve's userspace in quite a
> > while, but I'm guessing it can key off the structure size and doesn't
> > need this entry in the bitmap, right?  Let me rephrase, if userspace
> > needs to key off anything, it *should* key off the structure size and
> > not a new flag in the bitmask ;)
> 
> It could key solely off the existance o

Re: [PATCH v2] audit: report audit wait metric in audit status reply

2020-07-03 Thread Max Englander
On Thu, Jul 02, 2020 at 04:42:13PM -0400, Paul Moore wrote:
> On Wed, Jul 1, 2020 at 5:32 PM Max Englander  wrote:
> >
> > In environments where the preservation of audit events and predictable
> > usage of system memory are prioritized, admins may use a combination of
> > --backlog_wait_time and -b options at the risk of degraded performance
> > resulting from backlog waiting. In some cases, this risk may be
> > preferred to lost events or unbounded memory usage. Ideally, this risk
> > can be mitigated by making adjustments when backlog waiting is detected.
> >
> > However, detection can be diffult using the currently available metrics.
> > For example, an admin attempting to debug degraded performance may
> > falsely believe a full backlog indicates backlog waiting. It may turn
> > out the backlog frequently fills up but drains quickly.
> >
> > To make it easier to reliably track degraded performance to backlog
> > waiting, this patch makes the following changes:
> >
> > Add a new field backlog_wait_sum to the audit status reply. Initialize
> > this field to zero. Add to this field the total time spent by the
> > current task on scheduled timeouts while the backlog limit is exceeded.
> >
> > Tested on Ubuntu 18.04 using complementary changes to the audit
> > userspace: https://github.com/linux-audit/audit-userspace/pull/134.
> >
> > Signed-off-by: Max Englander 
> > ---
> >  Patch changelogs between v1 and v2:
> >  - Instead of printing a warning when backlog waiting occurs, add
> >duration of backlog waiting to cumulative sum, and report this
> >sum in audit status reply.
> >
> >  include/uapi/linux/audit.h | 7 ++-
> >  kernel/audit.c | 9 +
> >  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> Hi Max,
> 
> In general this looks better than the previous approach, but I do have
> a few specific comments (inline).  It also important that in addition

Thanks for your feedback and comments, Paul. I've prepared a v3 patch
addressing all of your comments, with corresponding changes to the
audit-userspace, which I'll submit once I get a working change to
the test suite. I'll use this thread to respond to some things and ask
a question.

> to the requisite userspace patch, we also need a test added to the
> audit-testsuite project so we can verify this functionality in the
> future.
> 
> * https://github.com/linux-audit/audit-testsuite
> 

I downloaded this test suite and attempted to run it on Ubuntu 18.04,
with the latest audit kernel tree and latest audit-userspace. Many tests
were failing, I assume because I have some issues with my environment.
May I ask what environment (OS, tree, commit) you recommend for running
the test suite? Happy to move this question over to GitHub if that's a
better venue.

> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> > index a534d71e689a..ea0cc364beca 100644
> > --- a/include/uapi/linux/audit.h
> > +++ b/include/uapi/linux/audit.h
> > @@ -340,6 +340,7 @@ enum {
> >  #define AUDIT_STATUS_BACKLOG_LIMIT 0x0010
> >  #define AUDIT_STATUS_BACKLOG_WAIT_TIME 0x0020
> >  #define AUDIT_STATUS_LOST  0x0040
> > +#define AUDIT_STATUS_BACKLOG_WAIT_SUM  0x0080
> 
> Sooo ... you've defined this, but I don't see any of the corresponding
> AUDIT_SET code that I would expect, was that an oversight?  If not, it
> is something we should support in the kernel as I'm sure admins will
> want to reset this value at some point.

To be honest I had based this patch off v1 which included a flag for
setting the backlog warn time threshold, but didn't remove it from the
v2 patch (an oversight). I wasn't thinking about admins' need to reset
the value, but since you suggested it I've included support for that in
the v3 patch.

> 
> >  #define AUDIT_FEATURE_BITMAP_BACKLOG_LIMIT 0x0001
> >  #define AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME 0x0002
> > @@ -348,6 +349,7 @@ enum {
> >  #define AUDIT_FEATURE_BITMAP_SESSIONID_FILTER  0x0010
> >  #define AUDIT_FEATURE_BITMAP_LOST_RESET0x0020
> >  #define AUDIT_FEATURE_BITMAP_FILTER_FS 0x0040
> > +#define AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_SUM  0x0080
> 
> In an effort not to exhaust the feature bitmap too quickly, I've been
> restricting it to only those features that would cause breakage with
> userspace.  I haven't looked closely at Steve's userspace in quite a
> while, but I'm guessing it can key off the structure size and doesn't
> need this entry in the bitmap, right?  Let me rephrase, if userspace
> needs to key off anything, it *should* key off the structure size 

[PATCH v2] audit: report audit wait metric in audit status reply

2020-07-01 Thread Max Englander
In environments where the preservation of audit events and predictable
usage of system memory are prioritized, admins may use a combination of
--backlog_wait_time and -b options at the risk of degraded performance
resulting from backlog waiting. In some cases, this risk may be
preferred to lost events or unbounded memory usage. Ideally, this risk
can be mitigated by making adjustments when backlog waiting is detected.

However, detection can be diffult using the currently available metrics.
For example, an admin attempting to debug degraded performance may
falsely believe a full backlog indicates backlog waiting. It may turn
out the backlog frequently fills up but drains quickly.

To make it easier to reliably track degraded performance to backlog
waiting, this patch makes the following changes:

Add a new field backlog_wait_sum to the audit status reply. Initialize
this field to zero. Add to this field the total time spent by the
current task on scheduled timeouts while the backlog limit is exceeded.

Tested on Ubuntu 18.04 using complementary changes to the audit
userspace: https://github.com/linux-audit/audit-userspace/pull/134.

Signed-off-by: Max Englander 
---
 Patch changelogs between v1 and v2:
 - Instead of printing a warning when backlog waiting occurs, add
   duration of backlog waiting to cumulative sum, and report this
   sum in audit status reply.

 include/uapi/linux/audit.h | 7 ++-
 kernel/audit.c | 9 +
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index a534d71e689a..ea0cc364beca 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -340,6 +340,7 @@ enum {
 #define AUDIT_STATUS_BACKLOG_LIMIT 0x0010
 #define AUDIT_STATUS_BACKLOG_WAIT_TIME 0x0020
 #define AUDIT_STATUS_LOST  0x0040
+#define AUDIT_STATUS_BACKLOG_WAIT_SUM  0x0080
 
 #define AUDIT_FEATURE_BITMAP_BACKLOG_LIMIT 0x0001
 #define AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME 0x0002
@@ -348,6 +349,7 @@ enum {
 #define AUDIT_FEATURE_BITMAP_SESSIONID_FILTER  0x0010
 #define AUDIT_FEATURE_BITMAP_LOST_RESET0x0020
 #define AUDIT_FEATURE_BITMAP_FILTER_FS 0x0040
+#define AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_SUM  0x0080
 
 #define AUDIT_FEATURE_BITMAP_ALL (AUDIT_FEATURE_BITMAP_BACKLOG_LIMIT | \
  AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME | \
@@ -355,12 +357,14 @@ enum {
  AUDIT_FEATURE_BITMAP_EXCLUDE_EXTEND | \
  AUDIT_FEATURE_BITMAP_SESSIONID_FILTER | \
  AUDIT_FEATURE_BITMAP_LOST_RESET | \
- AUDIT_FEATURE_BITMAP_FILTER_FS)
+ AUDIT_FEATURE_BITMAP_FILTER_FS | \
+ AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_SUM)
 
 /* deprecated: AUDIT_VERSION_* */
 #define AUDIT_VERSION_LATEST   AUDIT_FEATURE_BITMAP_ALL
 #define AUDIT_VERSION_BACKLOG_LIMITAUDIT_FEATURE_BITMAP_BACKLOG_LIMIT
 #define AUDIT_VERSION_BACKLOG_WAIT_TIME
AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME
+#define AUDIT_VERSION_BACKLOG_WAIT_SUM AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_SUM
 
/* Failure-to-log actions */
 #define AUDIT_FAIL_SILENT  0
@@ -466,6 +470,7 @@ struct audit_status {
__u32   feature_bitmap; /* bitmap of kernel audit features */
};
__u32   backlog_wait_time;/* message queue wait timeout */
+   __u32   backlog_wait_sum;/* time spent waiting while message 
limit exceeded */
 };
 
 struct audit_features {
diff --git a/kernel/audit.c b/kernel/audit.c
index 87f31bf1f0a0..301ea4f3d750 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -136,6 +136,11 @@ u32audit_sig_sid = 0;
 */
 static atomic_taudit_lost = ATOMIC_INIT(0);
 
+/* Monotonically increasing sum of time the kernel has spent
+ * waiting while the backlog limit is exceeded.
+ */
+static atomic_t audit_backlog_wait_sum = ATOMIC_INIT(0);
+
 /* Hash for inode-based rules */
 struct list_head audit_inode_hash[AUDIT_INODE_BUCKETS];
 
@@ -1204,6 +1209,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct 
nlmsghdr *nlh)
s.backlog   = skb_queue_len(_queue);
s.feature_bitmap= AUDIT_FEATURE_BITMAP_ALL;
s.backlog_wait_time = audit_backlog_wait_time;
+   s.backlog_wait_sum  = atomic_read(_backlog_wait_sum);
audit_send_reply(skb, seq, AUDIT_GET, 0, 0, , sizeof(s));
break;
}
@@ -1794,6 +1800,9 @@ struct audit_buffer *audit_log_start(struct audit_context 
*ctx, gfp_t gfp_mask,
return NULL;
}
}
+
+   if (stime != audit_backlog_wait_time)
+   atomic_add(audit_backlog_wait_time - stime

Re: [PATCH] audit: optionally print warning after waiting to enqueue record

2020-06-24 Thread Max Englander
On Tue, Jun 23, 2020 at 08:15:59PM -0400, Paul Moore wrote:
> On Thu, Jun 18, 2020 at 8:30 PM Richard Guy Briggs  wrote:
> > On 2020-06-18 23:48, Max Englander wrote:
> > > In case you’re any more receptive to the idea, I thought I’d mention
> > > that the need this patch addresses would be just as well fulfilled if
> > > wait times were reported in the audit status response along with other
> > > currently reported metrics like backlog length and lost events. Wait
> > > times could be reported as a cumulative sum, a moving average, or in
> > > some other way, and would help directly implicate or rule out backlog
> > > waiting as the cause in the event that an admin is faced with debugging
> > > degraded kernel performance. It would eliminate the need for a new flag,
> > > and fit well with the userspace tooling approach you suggested above.
> >
> > Such as is captured in this upstream issue from 3 years ago:
> >
> > https://github.com/linux-audit/audit-kernel/issues/63
> > "RFE: add kernel audit queue statistics"
> 
> I would be more open to the idea of reporting queue statistics as part
> of the audit status information, or similar.
> 
> -- 
> paul moore
> www.paul-moore.com

Excellent, I'll send a v2 patch.

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH] audit: optionally print warning after waiting to enqueue record

2020-06-18 Thread Max Englander
On Wed, Jun 17, 2020 at 09:06:27PM -0400, Paul Moore wrote:
> On Wed, Jun 17, 2020 at 6:54 PM Max Englander  wrote:
> > On Wed, Jun 17, 2020 at 02:47:19PM -0400, Paul Moore wrote:
> > > On Tue, Jun 16, 2020 at 12:58 AM Max Englander  
> > > wrote:
> > > >
> > > > In environments where security is prioritized, users may set
> > > > --backlog_wait_time to a high value in order to reduce the likelihood
> > > > that any audit event is lost, even though doing so may result in
> > > > unpredictable performance if the kernel schedules a timeout when the
> > > > backlog limit is exceeded. For these users, the next best thing to
> > > > predictable performance is the ability to quickly detect and react to
> > > > degraded performance. This patch proposes to aid the detection of kernel
> > > > audit subsystem pauses through the following changes:
> > > >
> > > > Add a variable named audit_backlog_warn_time. Enforce the value of this
> > > > variable to be no less than zero, and no more than the value of
> > > > audit_backlog_wait_time.
> > > >
> > > > If audit_backlog_warn_time is greater than zero and if the total time
> > > > spent waiting to enqueue an audit record is greater than or equal to
> > > > audit_backlog_warn_time, then print a warning with the total time
> > > > spent waiting.
> > > >
> > > > An example configuration:
> > > >
> > > > auditctl --backlog_warn_time 50
> > > >
> > > > An example warning message:
> > > >
> > > > audit: sleep_time=52 >= audit_backlog_warn_time=50
> > > >
> > > > Tested on Ubuntu 18.04.04 using complementary changes to the audit
> > > > userspace: https://github.com/linux-audit/audit-userspace/pull/131.
> > > >
> > > > Signed-off-by: Max Englander 
> > > > ---
> > > >  include/uapi/linux/audit.h |  7 ++-
> > > >  kernel/audit.c | 35 +++
> > > >  2 files changed, 41 insertions(+), 1 deletion(-)
> > >
> > > If an admin is prioritizing security, aka don't loose any audit
> > > records, and there is a concern over variable system latency due to an
> > > audit queue backlog, why not simply disable the backlog limit?
> > >
> > > --
> > > paul moore
> > > www.paul-moore.com
> >
> > That’s good in some cases, but in other cases unbounded growth of the
> > backlog could result in memory issues. If the kernel runs out of memory
> > it would drop the audit event or possibly have other problems. It could
> > also also consume memory in a way that starves user workloads or causes
> > them to be killed by the OOMKiller.
> >
> > To refine my motivating use case a bit, if a Kubernetes admin wants to
> > prioritize security, and also avoid unbounded growth of the audit
> > backlog, they may set -b and --backlog_wait_time in a way that limits
> > kernel memory usage and reduces the likelihood that any audit event is
> > lost. Occasional performance degradation may be acceptable to the admin,
> > but they would like a way to be alerted to prolonged kernel pauses, so
> > that they can investigate and take corrective action (increase backlog,
> > increase server capacity, move some workloads to other servers, etc.).
> >
> > To state another way. The kernel currently can be configured to print a
> > message when the backlog limit is exceeded and it must discard the audit
> > event. This is a useful message for admins, which they can address with
> > corrective action. I think a message similar to the one proposed by this
> > patch would be equally useful when the backlog limit is exceeded and the
> > kernel is configured to wait for the backlog to drain. Admins could
> > address that message in the same way, but without the cost of lost audit
> > events.
> 
> I'm still struggling to understand how this is any better than
> disabling the backlog limit, or setting it very high, and simply
> monitoring the audit size of the audit backlog.  This way the admin
> doesn't have to worry about the latency issues of a full backlog,
> while still being able to trigger actions based on the state of the
> backlog.  The userspace tooling/scripting to watch the backlog size
> would be trivial, and would arguably provide much better visibility
> into the backlog state than a single warning threshold in the kernel.
> 
> -- 
> paul moore
> www.paul-moore.com

Re: [PATCH] audit: optionally print warning after waiting to enqueue record

2020-06-18 Thread Max Englander
On Thu, Jun 18, 2020 at 09:39:08AM -0400, Steve Grubb wrote:
> On Wednesday, June 17, 2020 6:54:16 PM EDT Max Englander wrote:
> > On Wed, Jun 17, 2020 at 02:47:19PM -0400, Paul Moore wrote:
> > > On Tue, Jun 16, 2020 at 12:58 AM Max Englander  
> wrote:
> > > > In environments where security is prioritized, users may set
> > > > --backlog_wait_time to a high value in order to reduce the likelihood
> > > > that any audit event is lost, even though doing so may result in
> > > > unpredictable performance if the kernel schedules a timeout when the
> > > > backlog limit is exceeded. For these users, the next best thing to
> > > > predictable performance is the ability to quickly detect and react to
> > > > degraded performance. This patch proposes to aid the detection of
> > > > kernel
> > > > audit subsystem pauses through the following changes:
> > > > 
> > > > Add a variable named audit_backlog_warn_time. Enforce the value of this
> > > > variable to be no less than zero, and no more than the value of
> > > > audit_backlog_wait_time.
> > > > 
> > > > If audit_backlog_warn_time is greater than zero and if the total time
> > > > spent waiting to enqueue an audit record is greater than or equal to
> > > > audit_backlog_warn_time, then print a warning with the total time
> > > > spent waiting.
> > > > 
> > > > An example configuration:
> > > > auditctl --backlog_warn_time 50
> > > > 
> > > > An example warning message:
> > > > audit: sleep_time=52 >= audit_backlog_warn_time=50
> > > > 
> > > > Tested on Ubuntu 18.04.04 using complementary changes to the audit
> > > > userspace: https://github.com/linux-audit/audit-userspace/pull/131.
> > > > 
> > > > Signed-off-by: Max Englander 
> > > > ---
> > > > 
> > > >  include/uapi/linux/audit.h |  7 ++-
> > > >  kernel/audit.c | 35 +++
> > > >  2 files changed, 41 insertions(+), 1 deletion(-)
> > > 
> > > If an admin is prioritizing security, aka don't loose any audit
> > > records, and there is a concern over variable system latency due to an
> > > audit queue backlog, why not simply disable the backlog limit?
> > 
> > That’s good in some cases, but in other cases unbounded growth of the
> > backlog could result in memory issues. If the kernel runs out of memory
> > it would drop the audit event or possibly have other problems. It could
> > also also consume memory in a way that starves user workloads or causes
> > them to be killed by the OOMKiller.
> 
> The kernel cannot grow the backlog unbounded. If you do nothing, the backlog 
> is 64 - which is too small to really use. Otherwise, you set the backlog to a 
> finite number with the -b option.
> 
> > To refine my motivating use case a bit, if a Kubernetes admin wants to
> > prioritize security, and also avoid unbounded growth of the audit
> > backlog, they may set -b and --backlog_wait_time in a way that limits
> > kernel memory usage and reduces the likelihood that any audit event is
> > lost. Occasional performance degradation may be acceptable to the admin,
> > but they would like a way to be alerted to prolonged kernel pauses, so
> > that they can investigate and take corrective action (increase backlog,
> > increase server capacity, move some workloads to other servers, etc.).
> > 
> > To state another way. The kernel currently can be configured to print a
> > message when the backlog limit is exceeded and it must discard the audit
> > event. This is a useful message for admins, which they can address with
> > corrective action. I think a message similar to the one proposed by this
> > patch would be equally useful when the backlog limit is exceeded and the
> > kernel is configured to wait for the backlog to drain. Admins could
> > address that message in the same way, but without the cost of lost audit
> > events.
> 
> If backlog wait time is exceeded, that could be a useful warning if that does 
> not exist. I don't know how often that could happen...and of course without a 
> warning we don't know if it happens or why it happens.
  
What you’re describing already exists, if I’m reading your words right.
In the event that the backlog wait time limit is exceeded, the -f flag
is consulted, and, if the value of -f is 1, then an error message
stating that the backlog limit is exceeded is printed. This is also true
when the backlog wait time is zero.

What I am suggesting is that even if the the backlog wait time is not
exceeded, it would be useful for the kernel to report when backlog
waiting occurs as a way to help identify degraded kernel performance.

> I also wished we had metrics on the backlog such as max used. That might help 
> admins tune the size of the backlog.
> 
> -Steve
> 
> 

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH] audit: optionally print warning after waiting to enqueue record

2020-06-17 Thread Max Englander
On Wed, Jun 17, 2020 at 02:47:19PM -0400, Paul Moore wrote:
> On Tue, Jun 16, 2020 at 12:58 AM Max Englander  
> wrote:
> >
> > In environments where security is prioritized, users may set
> > --backlog_wait_time to a high value in order to reduce the likelihood
> > that any audit event is lost, even though doing so may result in
> > unpredictable performance if the kernel schedules a timeout when the
> > backlog limit is exceeded. For these users, the next best thing to
> > predictable performance is the ability to quickly detect and react to
> > degraded performance. This patch proposes to aid the detection of kernel
> > audit subsystem pauses through the following changes:
> >
> > Add a variable named audit_backlog_warn_time. Enforce the value of this
> > variable to be no less than zero, and no more than the value of
> > audit_backlog_wait_time.
> >
> > If audit_backlog_warn_time is greater than zero and if the total time
> > spent waiting to enqueue an audit record is greater than or equal to
> > audit_backlog_warn_time, then print a warning with the total time
> > spent waiting.
> >
> > An example configuration:
> >
> > auditctl --backlog_warn_time 50
> >
> > An example warning message:
> >
> > audit: sleep_time=52 >= audit_backlog_warn_time=50
> >
> > Tested on Ubuntu 18.04.04 using complementary changes to the audit
> > userspace: https://github.com/linux-audit/audit-userspace/pull/131.
> >
> > Signed-off-by: Max Englander 
> > ---
> >  include/uapi/linux/audit.h |  7 ++-
> >  kernel/audit.c | 35 +++
> >  2 files changed, 41 insertions(+), 1 deletion(-)
> 
> If an admin is prioritizing security, aka don't loose any audit
> records, and there is a concern over variable system latency due to an
> audit queue backlog, why not simply disable the backlog limit?
> 
> -- 
> paul moore
> www.paul-moore.com

That’s good in some cases, but in other cases unbounded growth of the
backlog could result in memory issues. If the kernel runs out of memory
it would drop the audit event or possibly have other problems. It could
also also consume memory in a way that starves user workloads or causes
them to be killed by the OOMKiller.

To refine my motivating use case a bit, if a Kubernetes admin wants to
prioritize security, and also avoid unbounded growth of the audit
backlog, they may set -b and --backlog_wait_time in a way that limits
kernel memory usage and reduces the likelihood that any audit event is
lost. Occasional performance degradation may be acceptable to the admin,
but they would like a way to be alerted to prolonged kernel pauses, so
that they can investigate and take corrective action (increase backlog,
increase server capacity, move some workloads to other servers, etc.).

To state another way. The kernel currently can be configured to print a
message when the backlog limit is exceeded and it must discard the audit
event. This is a useful message for admins, which they can address with
corrective action. I think a message similar to the one proposed by this
patch would be equally useful when the backlog limit is exceeded and the
kernel is configured to wait for the backlog to drain. Admins could
address that message in the same way, but without the cost of lost audit
events.

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

[PATCH] audit: optionally print warning after waiting to enqueue record

2020-06-15 Thread Max Englander
In environments where security is prioritized, users may set
--backlog_wait_time to a high value in order to reduce the likelihood
that any audit event is lost, even though doing so may result in
unpredictable performance if the kernel schedules a timeout when the
backlog limit is exceeded. For these users, the next best thing to
predictable performance is the ability to quickly detect and react to
degraded performance. This patch proposes to aid the detection of kernel
audit subsystem pauses through the following changes:

Add a variable named audit_backlog_warn_time. Enforce the value of this
variable to be no less than zero, and no more than the value of
audit_backlog_wait_time.

If audit_backlog_warn_time is greater than zero and if the total time
spent waiting to enqueue an audit record is greater than or equal to
audit_backlog_warn_time, then print a warning with the total time
spent waiting.

An example configuration:

auditctl --backlog_warn_time 50

An example warning message:

audit: sleep_time=52 >= audit_backlog_warn_time=50

Tested on Ubuntu 18.04.04 using complementary changes to the audit
userspace: https://github.com/linux-audit/audit-userspace/pull/131.

Signed-off-by: Max Englander 
---
 include/uapi/linux/audit.h |  7 ++-
 kernel/audit.c | 35 +++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index a534d71e689a..e3e021047fdc 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -340,6 +340,7 @@ enum {
 #define AUDIT_STATUS_BACKLOG_LIMIT 0x0010
 #define AUDIT_STATUS_BACKLOG_WAIT_TIME 0x0020
 #define AUDIT_STATUS_LOST  0x0040
+#define AUDIT_STATUS_BACKLOG_WARN_TIME 0x0080
 
 #define AUDIT_FEATURE_BITMAP_BACKLOG_LIMIT 0x0001
 #define AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME 0x0002
@@ -348,6 +349,7 @@ enum {
 #define AUDIT_FEATURE_BITMAP_SESSIONID_FILTER  0x0010
 #define AUDIT_FEATURE_BITMAP_LOST_RESET0x0020
 #define AUDIT_FEATURE_BITMAP_FILTER_FS 0x0040
+#define AUDIT_FEATURE_BITMAP_BACKLOG_WARN_TIME 0x0080
 
 #define AUDIT_FEATURE_BITMAP_ALL (AUDIT_FEATURE_BITMAP_BACKLOG_LIMIT | \
  AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME | \
@@ -355,12 +357,14 @@ enum {
  AUDIT_FEATURE_BITMAP_EXCLUDE_EXTEND | \
  AUDIT_FEATURE_BITMAP_SESSIONID_FILTER | \
  AUDIT_FEATURE_BITMAP_LOST_RESET | \
- AUDIT_FEATURE_BITMAP_FILTER_FS)
+ AUDIT_FEATURE_BITMAP_FILTER_FS | \
+ AUDIT_FEATURE_BITMAP_BACKLOG_WARN_TIME)
 
 /* deprecated: AUDIT_VERSION_* */
 #define AUDIT_VERSION_LATEST   AUDIT_FEATURE_BITMAP_ALL
 #define AUDIT_VERSION_BACKLOG_LIMITAUDIT_FEATURE_BITMAP_BACKLOG_LIMIT
 #define AUDIT_VERSION_BACKLOG_WAIT_TIME
AUDIT_FEATURE_BITMAP_BACKLOG_WAIT_TIME
+#define AUDIT_VERSION_BACKLOG_WARN_TIME
AUDIT_FEATURE_BITMAP_BACKLOG_WARN_TIME
 
/* Failure-to-log actions */
 #define AUDIT_FAIL_SILENT  0
@@ -466,6 +470,7 @@ struct audit_status {
__u32   feature_bitmap; /* bitmap of kernel audit features */
};
__u32   backlog_wait_time;/* message queue wait timeout */
+   __u32   backlog_warn_time;/* message queue warn threshold */
 };
 
 struct audit_features {
diff --git a/kernel/audit.c b/kernel/audit.c
index 87f31bf1f0a0..4a5437cfe61f 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -122,6 +122,12 @@ static u32 audit_backlog_limit = 64;
 #define AUDIT_BACKLOG_WAIT_TIME (60 * HZ)
 static u32 audit_backlog_wait_time = AUDIT_BACKLOG_WAIT_TIME;
 
+/* If audit_backlog_wait_time is non-zero, and the kernel waits
+ * for audit_backlog_warn_time or more to enqueue audit record,
+ * a warning will be printed with the duration of the wait
+ */
+static u32 audit_backlog_warn_time;
+
 /* The identity of the user shutting down the audit system. */
 kuid_t audit_sig_uid = INVALID_UID;
 pid_t  audit_sig_pid = -1;
@@ -439,6 +445,12 @@ static int audit_set_backlog_wait_time(u32 timeout)
  _backlog_wait_time, timeout);
 }
 
+static int audit_set_backlog_warn_time(u32 warn_time)
+{
+   return audit_do_config_change("audit_backlog_warn_time",
+ _backlog_warn_time, warn_time);
+}
+
 static int audit_set_enabled(u32 state)
 {
int rc;
@@ -1204,6 +1216,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct 
nlmsghdr *nlh)
s.backlog   = skb_queue_len(_queue);
s.feature_bitmap= AUDIT_FEATURE_BITMAP_ALL;
s.backlog_wait_time = audit_backlog_wait_time;
+   s.back