Re: Trying to understand audisp-remote network behavior

2022-07-12 Thread Lenny Bruzenak

On 7/12/22 12:57, Ken Hornstein wrote:


Well, the default configuration is that heartbeats are turned off, so
the general impression I would take away from that is you should only
turn on heartbeats if you have some unusual requirement.

This has to be coordinated between the client and server as many of these
setting need to be. I can add some discussion to the man page that this is
recommended.

Errr ... does it?

I certainly turned them on all of our clients but did not on turn
them on our server.  Did not cause any problems.  I mean, yes, I could
see that turning them on the server might be helpful, but it doesn't
seem to be required to make them work; from my reading of the code that
the server will respond to a heartbeat message whether or not they
are configured, and since connections all initiate from the clients
that's the end that has to notice the connection has dropped.


I think what Steve was referring to is the tcp_client_max_idle setting, 
which has a man page item saying it needs to be higher than the 
heartbeat setting on the sending side.




And yes, some additional documentation might be helpful.  Like if there
was a note in the man page that said, "Enabling heartbeats is the only
way to ensure that a connection will be retried if it is lost", that
might have clued me in that heartbeats are essentially required for
reliable connectivity (I am assuming we all agree that statement is
true; as far as I can tell, even with the latest code it still is!).


This may be true, doubtful it is the intent.

LCB

--
Lenny Bruzenak
MagitekLTD

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit



Re: Trying to understand audisp-remote network behavior

2022-07-12 Thread Ken Hornstein
>> Well, the default configuration is that heartbeats are turned off, so
>> the general impression I would take away from that is you should only
>> turn on heartbeats if you have some unusual requirement.
>
>This has to be coordinated between the client and server as many of these 
>setting need to be. I can add some discussion to the man page that this is 
>recommended.

Errr ... does it?

I certainly turned them on all of our clients but did not on turn
them on our server.  Did not cause any problems.  I mean, yes, I could
see that turning them on the server might be helpful, but it doesn't
seem to be required to make them work; from my reading of the code that
the server will respond to a heartbeat message whether or not they
are configured, and since connections all initiate from the clients
that's the end that has to notice the connection has dropped.

And yes, some additional documentation might be helpful.  Like if there
was a note in the man page that said, "Enabling heartbeats is the only
way to ensure that a connection will be retried if it is lost", that
might have clued me in that heartbeats are essentially required for
reliable connectivity (I am assuming we all agree that statement is
true; as far as I can tell, even with the latest code it still is!).

--Ken

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit



Re: Trying to understand audisp-remote network behavior

2022-07-12 Thread Steve Grubb
Hello,

On Monday, July 11, 2022 10:14:40 PM EDT Ken Hornstein wrote:
> >It is advisable to use the heartbeat option. This way each end can detect
> >the other "disappeared" for some reason.
> 
> Well, the default configuration is that heartbeats are turned off, so
> the general impression I would take away from that is you should only
> turn on heartbeats if you have some unusual requirement.

This has to be coordinated between the client and server as many of these 
setting need to be. I can add some discussion to the man page that this is 
recommended.

-Steve


--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit



Re: Trying to understand audisp-remote network behavior

2022-07-12 Thread Ken Hornstein
>> I would like to speak to those people who use it reliably in production!
>> Specifically, do they have heartbeats configured?
>
>Hello Ken, been fielding systems for a very long time now.
>
>Yes. I've always had heartbeats configured on.

Thank you for your reply!

So I am wondering ...

- Did you ever try without heartbeats configured on?

- Do you see the same things that I see, in that if there is a connection
  drop you don't get a connection retry unless you don't get an audit event
  within the heartbeat interval?

- What _do_ you have your heartbeat interval set to?  I settled on 120
  seconds basically as a guess and that seems to work based on the
  amount of audit activity we get (it is bursty enough that so far
  we've always had a idle interval of at least 120 seconds).

Thanks for any feedback you can give me!

--Ken

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit



Re: Trying to understand audisp-remote network behavior

2022-07-12 Thread Lenny Bruzenak

On 7/11/22 20:14, Ken Hornstein wrote:


I know there are people on this list that are using it reliably in
production. But, the problems were worked out mostly in the 3.0 release. The
kerberos code is donated code. I have not personally tested it myself due to
the problems in setting up the infrastructure. But from my review 2 weeks
ago, it looks like it would have problems in any error situation. I committed
some updates today which should make krb5 support better.

I would like to speak to those people who use it reliably in production!
Specifically, do they have heartbeats configured?


Hello Ken, been fielding systems for a very long time now.

Yes. I've always had heartbeats configured on.



As long as I have you ... there is one additional issue I think that
is worth mentioning.  If you have GSS configured you can hang an aggregation
server hard by doing:

% telnet aggregation-server 60

The problem is while nearly all of auditd uses a libev event loop, the
function ar_read() calls read() without a timeout, and it blocks and
none of the other connections get serviced.  This can happen if you
are doing something like network scanning, or you have a misconfigured
audisp-remote client.  I think the only long-term solution there is to
make sure ar_read (or maybe recv_token()) uses the ev event loop;
I know that's not easy.


The non-kerberos code has been heavily tested. You might try that to see if
it works better. But if you are on the old code, there were problems fixed in
the 3.0 release. I think people using it are not using the krb5 code and
create a vpn or ssh tunnel for encryption.

Well, it's a large effort to use a non-vendor RPM here_and_  the STIGs
mandate the use of krb5 with audisp-remote (I know people have asked
for exceptions successfully, but having been involved with that process
I know the less exceptions you ask for, the better).  Just from my
analysis the core networking code hasn't really changed in any way that
would change the basic problem.  Like I said, I am open to being proven
wrong!  I'd be intersted in hearing from others who have used audisp-remote
successfully in production, Kerberos or not.


Because ALL our inter-server communication is encrypted with ipsec 
(libreswan), we are not required to add another.


I will say that I've custom-patched the auditd code and the 
audisp-remoteĀ  code in ways probably not suitable for general use.


Thx,

LCB

--
Lenny Bruzenak
MagitekLTD
--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit


Re: Trying to understand audisp-remote network behavior

2022-07-11 Thread Ken Hornstein
>2 Weeks ago I wrote a model to go looking for certain kinds of problems in 
>kerberos. The results were that it's probably leaking memory. And on the 
>client side, I don't think it was fully resetting all the kerberos variables 
>on failure - which may be contributing to the problems.

Well, in my experience that isn't the problem, see below.

>> This is on RHEL 7 which ships with audit-2.8.5, but as far as
>> I can tell the relevant code hasn't changed much from there to what
>> is on GitHub.
>
>There are differences. I'd trust the current code in github more than the old 
>code.

Wlll 

I took a look.  There are 32 commits between 2.8.5 and HEAD for
audisp-remote.c.  It looks like they break down as:

- 4 Kerberos/GSS memory leak fixes
- 3 whitespace/typo fixes
- 4 warning fixes
- 4 misc code cleanups
- 6 code changes related to configuration or moving things around

The remaining ones that might affect the network connection:

b6c474b22f6e - audisp-remote: fix hang with disk_low_action=suspend (#254)

We did have this happen once, so yes, definitely an issue.  But that wasn't
the major cause of our problems.

3e45aa959d55 - In audisp-remote, fixup remote endpoint disappearin in ascii 
forma

We use managed format so that isn't an issue.

10dde069d1a - Dont look for stop on exit while draining the queue

That only affects things when audisp-remote is exiting, not during the
main loop.

9debebcc066 - Fixup krb5 broken by T_KRB5 & T_TCP separation

"Maybe" would cause an issue, but ... see below!

>In github, the first connection should do unlimited retries.

Sure, but at least _for us_, that isn't the issue.  It's when a connection
is lost after the first one.

>> - If the connection is lost for almost any reason (see below), the
>> connection is never retried using the default configuration.  There might
>> be some corner cases where a retry can happen, but in my experience that
>> is rare. Once it's gone, it never gets retried, and audit messages build
>> up until the queue overflows.
>
>The behavior for what to do became a configuration item around 3.0.

Ummm ... so I am trying to understand what you mean there, Yes, I see
there is a configuration item for _startup_ errors added since 2.8.5,
but like I said that's not the problem we encounter.

>> - In theory if a graceful shutdown is received by audisp-remote (either
>>   a zero-length read or a "ENDING" audit message), then retries can
>>   happen; this is indicated by the "remote_ended" flag in the code.
>
>This would happen if, for example, the aggregating server needed to reboot.

Right, but, like I said in my original message, at least during my
testing that never happened.

>It is advisable to use the heartbeat option. This way each end can detect the 
>other "disappeared" for some reason.

Well, the default configuration is that heartbeats are turned off, so
the general impression I would take away from that is you should only
turn on heartbeats if you have some unusual requirement.

>> The key issue seems to be in this part of the loop in main() (this section
>> is entered when audisp-remote receives an audit record):
>[...]
>> In short, when a new audit record is received, init_transport()
>> (which tries to connect to the audit server) is only called _IF_ the
>> connection is down (transport_ok == 0) _and_ remote_ended is true _and_
>> remote_ending_action is set to FA_RECONNECT (the default) _or_ there
>> hasn't been at least one successful connection (connected_once == 0).
>> 
>> The problem with that is at least in our environment remote_ended is
>> never set to 1, so when the connection drops it is never retried, and
>> there aren't any other entry points in the normal event loop that would
>> ever cause the connection to retry.
>
>I want to think this has been fixed in the current code. It is one of the 
>subtle changes since 2.8.5.

I ... do not believe this is true!

Generally when there is a communication problem then transport_ok is set
to 0 and sock is set to -1 (stop_transport() does this).

In the main loop, if sock == -1 it is never set in the fd_set, so you
never try to send anything.  Even if you _do_ happen to call send_one(),
it will return if transport_ok == 0.  The only time init_transport()
is called is called in the main loop is if transport_ok == 0 _and_
remote_ended == 1, and like I said we never get remote_ended == 1
even with a auditd server reboot.

So, really ... _if_ heartbeats are _not_ set, I can't see code path
that would ever result in a reconnect.  I'd love to be proven wrong!
This actually should be easy to test; just make sure heartbeats
are not configured send audisp-remote a HUP signal; that will call
stop_transport(), and then see if the connection is ever reconnected or
not.  That should act the same whether or not you're using GSS-API.

If the answer is "you should use heartbeats", well ... fair enough.
But it might be worth making those the default, and maybe make sure
if the transport is 

Re: Trying to understand audisp-remote network behavior

2022-07-11 Thread Steve Grubb
Hello,

On Thursday, July 7, 2022 12:05:28 AM EDT Ken Hornstein wrote:
> So we've been struggling with getting audisp-remote working in a
> reliable manner.  In summary, it works but the networking seems fragile.
> We are using Kerberos authentication with audisp-remote, but that
> doesn't seem to be related to the fragility (sadly the Kerberos support
> does make it trivial to completely hang the server, but that's another
> issue).

2 Weeks ago I wrote a model to go looking for certain kinds of problems in 
kerberos. The results were that it's probably leaking memory. And on the 
client side, I don't think it was fully resetting all the kerberos variables 
on failure - which may be contributing to the problems.

> This is on RHEL 7 which ships with audit-2.8.5, but as far as
> I can tell the relevant code hasn't changed much from there to what
> is on GitHub.

There are differences. I'd trust the current code in github more than the old 
code.

> After staring at the code a lot and doing some experiments, here's what
> I believe to be true.  I'll gladly take corrections for anything I get
> wrong.
> 
> - If a connection has _never_ been made successfully by audisp-remote,
>   it will retry the connection (in theory there's a limit to retries,
>   but that seems to be per-message; it will retry on every new message).
>   Fine, that seems reasonable.

In github, the first connection should do unlimited retries.

> - If the connection is lost for almost any reason (see below), the
> connection is never retried using the default configuration.  There might
> be some corner cases where a retry can happen, but in my experience that
> is rare. Once it's gone, it never gets retried, and audit messages build
> up until the queue overflows.

The behavior for what to do became a configuration item around 3.0.

> - In theory if a graceful shutdown is received by audisp-remote (either
>   a zero-length read or a "ENDING" audit message), then retries can
>   happen; this is indicated by the "remote_ended" flag in the code.

This would happen if, for example, the aggregating server needed to reboot.

>   But
>   in my experience that is rare; during my experiments when I rebooted
>   our audit server that message was never sent (I guess the audit server
>   stop was received after the interfaces were shut down).  If the audit
>   server crashes or you have a network failure, you end up getting an
>   error on a write and then the network is marked down and you get into
>   never-retry state.
> 
> - If you turn on heartbeats via heartbeat_timeout, the network connection
>   _will_ retry when a heartbeat is sent.  However, the subtle issue here
>   is that a heartbeat is only sent when there are no incoming audit
> messages within the heartbeat timeout.

It is advisable to use the heartbeat option. This way each end can detect the 
other "disappeared" for some reason.

> The key issue seems to be in this part of the loop in main() (this section
> is entered when audisp-remote receives an audit record):
> 
> // See if input fd is also set
> if (FD_ISSET(ifd, )) {
> do {
> if (remote_fgets(event, sizeof(event),
> ifd)) { if (!transport_ok && remote_ended && (config.remote_ending_action
> == FA_RECONNECT || !connected_once)) { quiet = 1;
> if (init_transport() ==
> ET_SUCCESS)
> { remote_ended = 0; connected_once = 1; }
> quiet = 0;
> }
> 
> In short, when a new audit record is received, init_transport()
> (which tries to connect to the audit server) is only called _IF_ the
> connection is down (transport_ok == 0) _and_ remote_ended is true _and_
> remote_ending_action is set to FA_RECONNECT (the default) _or_ there
> hasn't been at least one successful connection (connected_once == 0).
> 
> The problem with that is at least in our environment remote_ended is
> never set to 1, so when the connection drops it is never retried, and
> there aren't any other entry points in the normal event loop that would
> ever cause the connection to retry.

I want to think this has been fixed in the current code. It is one of the 
subtle changes since 2.8.5.

> The heartbeat code calls relay_event() directly (code that sends audit
> events normally calls send_one() which returns if transport_ok is false)
> and relay_event() calls either relay_sock_ascii() or relay_sock_managed()
> and those two functions will call init_transport() if the network
> connection is down.  But as mentioned above, you need to make sure that
> you try to send a heartbeat every so often; if you have a server generating
> audit messages constantly then there won't be a heartbeat if you set the
> heartbeat timeout too high.
> 
> You _can_ get a network connection retry if you 

Trying to understand audisp-remote network behavior

2022-07-06 Thread Ken Hornstein
So we've been struggling with getting audisp-remote working in a
reliable manner.  In summary, it works but the networking seems fragile.
We are using Kerberos authentication with audisp-remote, but that
doesn't seem to be related to the fragility (sadly the Kerberos support
does make it trivial to completely hang the server, but that's another
issue).  This is on RHEL 7 which ships with audit-2.8.5, but as far as
I can tell the relevant code hasn't changed much from there to what
is on GitHub.

After staring at the code a lot and doing some experiments, here's what
I believe to be true.  I'll gladly take corrections for anything I get
wrong.

- If a connection has _never_ been made successfully by audisp-remote,
  it will retry the connection (in theory there's a limit to retries,
  but that seems to be per-message; it will retry on every new message).
  Fine, that seems reasonable.

- If the connection is lost for almost any reason (see below), the connection
  is never retried using the default configuration.  There might be some
  corner cases where a retry can happen, but in my experience that is rare.
  Once it's gone, it never gets retried, and audit messages build up until
  the queue overflows.

- In theory if a graceful shutdown is received by audisp-remote (either
  a zero-length read or a "ENDING" audit message), then retries can
  happen; this is indicated by the "remote_ended" flag in the code.  But
  in my experience that is rare; during my experiments when I rebooted
  our audit server that message was never sent (I guess the audit server
  stop was received after the interfaces were shut down).  If the audit
  server crashes or you have a network failure, you end up getting an
  error on a write and then the network is marked down and you get into
  never-retry state.

- If you turn on heartbeats via heartbeat_timeout, the network connection
  _will_ retry when a heartbeat is sent.  However, the subtle issue here
  is that a heartbeat is only sent when there are no incoming audit messages
  within the heartbeat timeout.

The key issue seems to be in this part of the loop in main() (this section is
entered when audisp-remote receives an audit record):

// See if input fd is also set
if (FD_ISSET(ifd, )) {
do {
if (remote_fgets(event, sizeof(event), ifd)) {
if (!transport_ok && remote_ended && 
(config.remote_ending_action ==
FA_RECONNECT ||
!connected_once)) {
quiet = 1;
if (init_transport() ==
ET_SUCCESS) {
remote_ended = 0;
connected_once = 1;
}
quiet = 0;
}

In short, when a new audit record is received, init_transport()
(which tries to connect to the audit server) is only called _IF_ the
connection is down (transport_ok == 0) _and_ remote_ended is true _and_
remote_ending_action is set to FA_RECONNECT (the default) _or_ there
hasn't been at least one successful connection (connected_once == 0).

The problem with that is at least in our environment remote_ended is
never set to 1, so when the connection drops it is never retried, and
there aren't any other entry points in the normal event loop that would
ever cause the connection to retry.

The heartbeat code calls relay_event() directly (code that sends audit
events normally calls send_one() which returns if transport_ok is false)
and relay_event() calls either relay_sock_ascii() or relay_sock_managed()
and those two functions will call init_transport() if the network
connection is down.  But as mentioned above, you need to make sure that
you try to send a heartbeat every so often; if you have a server generating
audit messages constantly then there won't be a heartbeat if you set the
heartbeat timeout too high.

You _can_ get a network connection retry if you encounter an error
inside of relay_sock_ascii() or relay_sock_managed(); I can't say
that didn't happen with us, but it sure seemed like it wasn't sufficient
and having the transport marked as failed was inevitible.

So, I guess my questions are:

- Is this all accurate?

- Is this how it's SUPPOSED to be?  At least for us, network glitches
  happen enough that most of our hosts ended up with overflowing
  audisp-remote queues.  Setting the heartbeat timeout seems to have
  resolved that (but it took a little experimentation to figure out
  the right value).  It just seems surprising that