Re: Routing socket issue?

2021-01-31 Thread Frank Kardel

Looks reasonable to me.

I think the the overflows can be triggered when the routing daemon add a 
longer list of routes like when discovering another part


of the net. This happens e. g. here when my host starts up the VPN an 
then frr needs to insert 30+ routes it just learned.


So depending on the topology there can be at times route message floods.

Frank


On 01/31/21 09:33, Roy Marples wrote:

Hi Frank :)

On 31/01/2021 07:58, Frank Kardel wrote:
For example I fail to see how RTM_LOSING helps that because it won't 
change

how ntpd would configure itself.

Well if I read the comment right I am inclined to differ here:
In in_pcs.c we find:
/*
  * Check for alternatives when higher level complains
  * about service problems.  For now, invalidate cached
  * routing information.  If the route was created dynamically
  * (by a redirect), time to try a default gateway again.
  */
in_losing(struct inpcb *inp)

and the call is in tcp_time.c:
 /*
  * If losing, let the lower level know and try for
  * a better route.  Also, if we backed off this far,
  * our srtt estimate is probably bogus.  Clobber it
  * so we'll take the next rtt measurement as our srtt;
  * move the current srtt into rttvar to keep the current
  * retransmit times until then.
  */

As ntpd acts after a grace period the routing engine may have 
corrected this situation and routing may indeed change.
ntpd's interactions with peers can take up to 1024s so it is good to 
attempt in a best effort way to keep the internal

local address/socket state close to the current state.
It is likely though that there have been routing messages like 
RTM_CHANGE/ADD/DELETE before that and RTM_LOSING is not providing

additional information at the point.


Right, RTM_LOSING is just informational.
If any routing does change then we get RTM_CHANGE/ADD/DELETE etc.





As NTP doesn't bring interfaces up or down, RFM_IFANNOUNCE is 
useless as well.
If the interface does vanish, any addresses on it will be reported 
via RTM_DELADDR.
RTM_IFINFO is also questionable as commentary in the code is that it 
only cares about addresses.



Well I read
ntp_io.c
 /*
  * we are keen on new and deleted addresses and
  * if an interface goes up and down or routing
  * changes
  */
not as being interested in addresses only.

Also keep in mind that at this point routing messages are processed 
in a loop and the action here

 timer_interfacetimeout(current_time + UPDATE_GRACE);
just sets the variable for the next interface+local address update 
run. This is very cheap. The grace period
will batch multiple routing message together. An explicit routing 
message flush is from my point of view
code clutter here. as the socket is effectively drained in the loop 
at the cost of examining the msg_type and setting

a variable. Not much gained here.


OK, we'll keep RTM_IFINFO but drop RTM_IFANNOUNCE.
The point is trying to eliminate the overflow message entirely.

I mean, if you want to argue against any of that then I would 
suggest why even bother filtering or looking at overflow at all?
Shrink the code - any activity on the routing socket, drain it 
ignoring all error, start the interface update timer.
That would be an option but we should react only on known events. 
There may be one or two events that could be removed from
the list after examination as other messages can cover for them. Keep 
in mind the this is a portable code section and the
code tries to be on the fail safe, robust side for the goal of 
address/routing tracking so adjusting it to a particular implementation

may break on other os implementations.


Well, Dragonfly (prior to my patches there) and by extension FreeBSD 
(not checked to see if that changed) both emit RMT_DELADDR before 
RTM_IFANNOUNCE (ie wrong order) so when they do overflow you never see 
RTM_IFANNOUNCE to say the interface vanished. Hence there is zero 
point is listening for it for ntp.






As for the message: IMHO it does not need to be logged at all 
(DPRINTF/maybe LOGDEBUG at most) because the overflow should and 
does just trigger ntpd to reevaluate the interface/routing 
configuration.


This information is not important at all for normal operation as 
the effects are correctly mitigated.


I changed it to LOG_DEBUG as well as removing RTM_LOSING and 
RTM_IFANNOUNCE as discussed above.




Great.

BTW: does the current code revert to (fail safe) periodic interface 
scanning if the routing socket is being disabled (happens when an 
unexpected error code is returned from read(2))?


No.

The socket is non blocking so the only error to ignore here would be 
EINTR.

Any other errors are due to bad programming IMO.
Could be bad programming, but I prefer the ntpd being forgiving 
against hiccups by reverting to periodic scanning when we
disable to routing socket. That is a fail safe stra

Re: Routing socket issue?

2021-01-31 Thread Roy Marples

Hi Frank :)

On 31/01/2021 07:58, Frank Kardel wrote:

For example I fail to see how RTM_LOSING helps that because it won't change
how ntpd would configure itself.

Well if I read the comment right I am inclined to differ here:
In in_pcs.c we find:
/*
  * Check for alternatives when higher level complains
  * about service problems.  For now, invalidate cached
  * routing information.  If the route was created dynamically
  * (by a redirect), time to try a default gateway again.
  */
in_losing(struct inpcb *inp)

and the call is in tcp_time.c:
     /*
  * If losing, let the lower level know and try for
  * a better route.  Also, if we backed off this far,
  * our srtt estimate is probably bogus.  Clobber it
  * so we'll take the next rtt measurement as our srtt;
  * move the current srtt into rttvar to keep the current
  * retransmit times until then.
  */

As ntpd acts after a grace period the routing engine may have corrected this 
situation and routing may indeed change.
ntpd's interactions with peers can take up to 1024s so it is good to attempt in 
a best effort way to keep the internal

local address/socket state close to the current state.
It is likely though that there have been routing messages like 
RTM_CHANGE/ADD/DELETE before that and RTM_LOSING is not providing

additional information at the point.


Right, RTM_LOSING is just informational.
If any routing does change then we get RTM_CHANGE/ADD/DELETE etc.





As NTP doesn't bring interfaces up or down, RFM_IFANNOUNCE is useless as well.
If the interface does vanish, any addresses on it will be reported via 
RTM_DELADDR.
RTM_IFINFO is also questionable as commentary in the code is that it only 
cares about addresses.



Well I read
ntp_io.c
     /*
  * we are keen on new and deleted addresses and
  * if an interface goes up and down or routing
  * changes
  */
not as being interested in addresses only.

Also keep in mind that at this point routing messages are processed in a loop 
and the action here

     timer_interfacetimeout(current_time + UPDATE_GRACE);
just sets the variable for the next interface+local address update run. This is 
very cheap. The grace period
will batch multiple routing message together. An explicit routing message flush 
is from my point of view
code clutter here. as the socket is effectively drained in the loop at the cost 
of examining the msg_type and setting

a variable. Not much gained here.


OK, we'll keep RTM_IFINFO but drop RTM_IFANNOUNCE.
The point is trying to eliminate the overflow message entirely.

I mean, if you want to argue against any of that then I would suggest why even 
bother filtering or looking at overflow at all?
Shrink the code - any activity on the routing socket, drain it ignoring all 
error, start the interface update timer.
That would be an option but we should react only on known events. There may be 
one or two events that could be removed from
the list after examination as other messages can cover for them. Keep in mind 
the this is a portable code section and the
code tries to be on the fail safe, robust side for the goal of address/routing 
tracking so adjusting it to a particular implementation

may break on other os implementations.


Well, Dragonfly (prior to my patches there) and by extension FreeBSD (not 
checked to see if that changed) both emit RMT_DELADDR before RTM_IFANNOUNCE (ie 
wrong order) so when they do overflow you never see RTM_IFANNOUNCE to say the 
interface vanished. Hence there is zero point is listening for it for ntp.






As for the message: IMHO it does not need to be logged at all (DPRINTF/maybe 
LOGDEBUG at most) because the overflow should and does just trigger ntpd to 
reevaluate the interface/routing configuration.


This information is not important at all for normal operation as the effects 
are correctly mitigated.


I changed it to LOG_DEBUG as well as removing RTM_LOSING and RTM_IFANNOUNCE as 
discussed above.




Great.

BTW: does the current code revert to (fail safe) periodic interface scanning 
if the routing socket is being disabled (happens when an unexpected error 
code is returned from read(2))?


No.

The socket is non blocking so the only error to ignore here would be EINTR.
Any other errors are due to bad programming IMO.
Could be bad programming, but I prefer the ntpd being forgiving against hiccups 
by reverting to periodic scanning when we
disable to routing socket. That is a fail safe strategy and would also warrant a 
log message as it is an unusual event.


EINTR is now ignored.
I'll find time to restore periodic scanning later.

Roy


Re: Routing socket issue?

2021-01-30 Thread Frank Kardel

Hi Roy!


On 01/31/21 03:27, Roy Marples wrote:

On 30/01/2021 22:01, Frank Kardel wrote:

"why it needs to be interested i..."

Ntpd needs to know the local address being used when sending to peers 
(authentication, which socket to use). That is why it not just reacts to


address information but also redetermines to local addresses (and 
sockets) are being used for reaching its peers.


The interaction with the routing socket is purposely simple. ntpd 
just needs to know that *something* has changed. It will then rescan 
after a grace period the interfaces and reevaluate


the interface/local address/socket setup. It does not need to be 
extremely snappy but it needs to happen.


Dropping that might delay ntpd's detection of changed local addresses 
for peers.


For example I fail to see how RTM_LOSING helps that because it won't 
change

how ntpd would configure itself.

Well if I read the comment right I am inclined to differ here:
In in_pcs.c we find:
/*
 * Check for alternatives when higher level complains
 * about service problems.  For now, invalidate cached
 * routing information.  If the route was created dynamically
 * (by a redirect), time to try a default gateway again.
 */
in_losing(struct inpcb *inp)

and the call is in tcp_time.c:
/*
 * If losing, let the lower level know and try for
 * a better route.  Also, if we backed off this far,
 * our srtt estimate is probably bogus.  Clobber it
 * so we'll take the next rtt measurement as our srtt;
 * move the current srtt into rttvar to keep the current
 * retransmit times until then.
 */

As ntpd acts after a grace period the routing engine may have corrected 
this situation and routing may indeed change.
ntpd's interactions with peers can take up to 1024s so it is good to 
attempt in a best effort way to keep the internal

local address/socket state close to the current state.
It is likely though that there have been routing messages like 
RTM_CHANGE/ADD/DELETE before that and RTM_LOSING is not providing

additional information at the point.



As NTP doesn't bring interfaces up or down, RFM_IFANNOUNCE is useless 
as well.
If the interface does vanish, any addresses on it will be reported via 
RTM_DELADDR.
RTM_IFINFO is also questionable as commentary in the code is that it 
only cares about addresses.



Well I read
ntp_io.c
/*
 * we are keen on new and deleted addresses and
 * if an interface goes up and down or routing
 * changes
 */
not as being interested in addresses only.

Also keep in mind that at this point routing messages are processed in a 
loop and the action here

timer_interfacetimeout(current_time + UPDATE_GRACE);
just sets the variable for the next interface+local address update run. 
This is very cheap. The grace period
will batch multiple routing message together. An explicit routing 
message flush is from my point of view
code clutter here. as the socket is effectively drained in the loop at 
the cost of examining the msg_type and setting

a variable. Not much gained here.

NOTE TO SELF: our kernel doesn't seem to report RTM_CHGADDR anymore 
looking at nxr.netbsd.org


I mean, if you want to argue against any of that then I would suggest 
why even bother filtering or looking at overflow at all?
Shrink the code - any activity on the routing socket, drain it 
ignoring all error, start the interface update timer.
That would be an option but we should react only on known events. There 
may be one or two events that could be removed from
the list after examination as other messages can cover for them. Keep in 
mind the this is a portable code section and the
code tries to be on the fail safe, robust side for the goal of 
address/routing tracking so adjusting it to a particular implementation

may break on other os implementations.




As for the message: IMHO it does not need to be logged at all 
(DPRINTF/maybe LOGDEBUG at most) because the overflow should and does 
just trigger ntpd to reevaluate the interface/routing configuration.


This information is not important at all for normal operation as the 
effects are correctly mitigated.


Great.

BTW: does the current code revert to (fail safe) periodic interface 
scanning if the routing socket is being disabled (happens when an 
unexpected error code is returned from read(2))?


No.

The socket is non blocking so the only error to ignore here would be 
EINTR.

Any other errors are due to bad programming IMO.
Could be bad programming, but I prefer the ntpd being forgiving against 
hiccups by reverting to periodic scanning when we
disable to routing socket. That is a fail safe strategy and would also 
warrant a log message as it is an unusual event.


Roy

Frank


Re: Routing socket issue?

2021-01-30 Thread Roy Marples

On 30/01/2021 22:01, Frank Kardel wrote:

"why it needs to be interested i..."

Ntpd needs to know the local address being used when sending to peers 
(authentication, which socket to use). That is why it not just reacts to


address information but also redetermines to local addresses (and sockets) are 
being used for reaching its peers.


The interaction with the routing socket is purposely simple. ntpd just needs to 
know that *something* has changed. It will then rescan after a grace period the 
interfaces and reevaluate


the interface/local address/socket setup. It does not need to be extremely 
snappy but it needs to happen.


Dropping that might delay ntpd's detection of changed local addresses for peers.


For example I fail to see how RTM_LOSING helps that because it won't change
how ntpd would configure itself.

As NTP doesn't bring interfaces up or down, RFM_IFANNOUNCE is useless as well.
If the interface does vanish, any addresses on it will be reported via 
RTM_DELADDR.
RTM_IFINFO is also questionable as commentary in the code is that it only cares 
about addresses.


NOTE TO SELF: our kernel doesn't seem to report RTM_CHGADDR anymore looking at 
nxr.netbsd.org


I mean, if you want to argue against any of that then I would suggest why even 
bother filtering or looking at overflow at all?
Shrink the code - any activity on the routing socket, drain it ignoring all 
error, start the interface update timer.




As for the message: IMHO it does not need to be logged at all (DPRINTF/maybe 
LOGDEBUG at most) because the overflow should and does just trigger ntpd to 
reevaluate the interface/routing configuration.


This information is not important at all for normal operation as the effects are 
correctly mitigated.


Great.

BTW: does the current code revert to (fail safe) periodic interface scanning if 
the routing socket is being disabled (happens when an unexpected error code is 
returned from read(2))?


No.

The socket is non blocking so the only error to ignore here would be EINTR.
Any other errors are due to bad programming IMO.

Roy


Re: Routing socket issue?

2021-01-30 Thread Paul Goyette

On Sat, 30 Jan 2021, Roy Marples wrote:


On 30/01/2021 18:27, Paul Goyette wrote:

On Sat, 30 Jan 2021, Roy Marples wrote:


On 30/01/2021 15:12, Paul Goyette wrote:

I thought we took care of the buffer-space issue a long time ago, but
today I've gotten about a dozen of these:

...
Jan 30 05:20:11 speedy ntpd[3146]: routing socket reports: No buffer
space available


I recently adding a patch to enable the diagnostic AND take action on it.
We can change the upstream default from LOG_ERR to LOG_DEBUG or maybe 
their custom DPRINTF though if you think that would help reduce the noise.


Not concerned about noise, just wanted to make sure we didn't have a
regression slip by.  As long as the message is deliberate, I'm not too
worried.


Well, currently other apps such as dhcpcd still log an error when the routing 
socket overflows but a more helpful message.


I think we can just change it to:
  routing socket overflowed - will update interfaces

Happy with that?


Sure.



To alleviate the issue we could also stop ntpd from listening to routing 
changes has that has no bearing on how it discovers interfaces and addresses 
as far as i can tell.

Frank ok with that?

Roy

!DSPAM:6015c335157686319899926!




++--+---+
| Paul Goyette   | PGP Key fingerprint: | E-mail addresses: |
| (Retired)  | FA29 0E3B 35AF E8AE 6651 | p...@whooppee.com |
| Software Developer | 0786 F758 55DE 53BA 7731 | pgoye...@netbsd.org   |
++--+---+

Re: Routing socket issue?

2021-01-30 Thread Frank Kardel

"why it needs to be interested i..."

Ntpd needs to know the local address being used when sending to peers 
(authentication, which socket to use). That is why it not just reacts to


address information but also redetermines to local addresses (and 
sockets) are being used for reaching its peers.


The interaction with the routing socket is purposely simple. ntpd just 
needs to know that *something* has changed. It will then rescan after a 
grace period the interfaces and reevaluate


the interface/local address/socket setup. It does not need to be 
extremely snappy but it needs to happen.


Dropping that might delay ntpd's detection of changed local addresses 
for peers.


As for the message: IMHO it does not need to be logged at all 
(DPRINTF/maybe LOGDEBUG at most) because the overflow should and does 
just trigger ntpd to reevaluate the interface/routing configuration.


This information is not important at all for normal operation as the 
effects are correctly mitigated.


BTW: does the current code revert to (fail safe) periodic interface 
scanning if the routing socket is being disabled (happens when an 
unexpected error code is returned from read(2))?


Frank


On 01/30/21 21:41, Roy Marples wrote:

On 30/01/2021 18:27, Paul Goyette wrote:

On Sat, 30 Jan 2021, Roy Marples wrote:


On 30/01/2021 15:12, Paul Goyette wrote:

I thought we took care of the buffer-space issue a long time ago, but
today I've gotten about a dozen of these:

...
Jan 30 05:20:11 speedy ntpd[3146]: routing socket reports: No buffer
space available


I recently adding a patch to enable the diagnostic AND take action 
on it.
We can change the upstream default from LOG_ERR to LOG_DEBUG or 
maybe their custom DPRINTF though if you think that would help 
reduce the noise.


Not concerned about noise, just wanted to make sure we didn't have a
regression slip by.  As long as the message is deliberate, I'm not too
worried.


Just to be clear on this, we have the framework to filter out routing 
messages we don't need to stop overflow from happening and we can also 
detect when overflow still happens.
Currently ntpd now does both, before it just filtered out, but I 
didn't change what it was interested in and now I'm curious why it 
needs to be interested in actual routing changes for interface/address 
discovery as I'm pretty sure we can drop that.


As we enable this in more applications we just have to make some 
choices - filter more out vs increasing buffer size vs just discarding 
the error if the prior two are not feasible.


Roy




Re: Routing socket issue?

2021-01-30 Thread Roy Marples

On 30/01/2021 18:27, Paul Goyette wrote:

On Sat, 30 Jan 2021, Roy Marples wrote:


On 30/01/2021 15:12, Paul Goyette wrote:

I thought we took care of the buffer-space issue a long time ago, but
today I've gotten about a dozen of these:

...
Jan 30 05:20:11 speedy ntpd[3146]: routing socket reports: No buffer
space available


I recently adding a patch to enable the diagnostic AND take action on it.
We can change the upstream default from LOG_ERR to LOG_DEBUG or maybe their 
custom DPRINTF though if you think that would help reduce the noise.


Not concerned about noise, just wanted to make sure we didn't have a
regression slip by.  As long as the message is deliberate, I'm not too
worried.


Just to be clear on this, we have the framework to filter out routing messages 
we don't need to stop overflow from happening and we can also detect when 
overflow still happens.
Currently ntpd now does both, before it just filtered out, but I didn't change 
what it was interested in and now I'm curious why it needs to be interested in 
actual routing changes for interface/address discovery as I'm pretty sure we can 
drop that.


As we enable this in more applications we just have to make some choices - 
filter more out vs increasing buffer size vs just discarding the error if the 
prior two are not feasible.


Roy


Re: Routing socket issue?

2021-01-30 Thread Roy Marples

On 30/01/2021 18:27, Paul Goyette wrote:

On Sat, 30 Jan 2021, Roy Marples wrote:


On 30/01/2021 15:12, Paul Goyette wrote:

I thought we took care of the buffer-space issue a long time ago, but
today I've gotten about a dozen of these:

...
Jan 30 05:20:11 speedy ntpd[3146]: routing socket reports: No buffer
space available


I recently adding a patch to enable the diagnostic AND take action on it.
We can change the upstream default from LOG_ERR to LOG_DEBUG or maybe their 
custom DPRINTF though if you think that would help reduce the noise.


Not concerned about noise, just wanted to make sure we didn't have a
regression slip by.  As long as the message is deliberate, I'm not too
worried.


Well, currently other apps such as dhcpcd still log an error when the routing 
socket overflows but a more helpful message.


I think we can just change it to:
   routing socket overflowed - will update interfaces

Happy with that?

To alleviate the issue we could also stop ntpd from listening to routing changes 
has that has no bearing on how it discovers interfaces and addresses as far as i 
can tell.

Frank ok with that?

Roy


Re: Routing socket issue?

2021-01-30 Thread Paul Goyette

On Sat, 30 Jan 2021, Roy Marples wrote:


On 30/01/2021 15:12, Paul Goyette wrote:

I thought we took care of the buffer-space issue a long time ago, but
today I've gotten about a dozen of these:

...
Jan 30 05:20:11 speedy ntpd[3146]: routing socket reports: No buffer
space available


I recently adding a patch to enable the diagnostic AND take action on it.
We can change the upstream default from LOG_ERR to LOG_DEBUG or maybe their 
custom DPRINTF though if you think that would help reduce the noise.


Not concerned about noise, just wanted to make sure we didn't have a
regression slip by.  As long as the message is deliberate, I'm not too
worried.

Perhaps LOG_DEBUG would be better, though, as well as some indication of
what "action" is taken and whether or not the action was successful?


++--+---+
| Paul Goyette   | PGP Key fingerprint: | E-mail addresses: |
| (Retired)  | FA29 0E3B 35AF E8AE 6651 | p...@whooppee.com |
| Software Developer | 0786 F758 55DE 53BA 7731 | pgoye...@netbsd.org   |
++--+---+


Re: Routing socket issue?

2021-01-30 Thread Roy Marples

On 30/01/2021 15:12, Paul Goyette wrote:

I thought we took care of the buffer-space issue a long time ago, but
today I've gotten about a dozen of these:

...
Jan 30 05:20:11 speedy ntpd[3146]: routing socket reports: No buffer
space available


I recently adding a patch to enable the diagnostic AND take action on it.
We can change the upstream default from LOG_ERR to LOG_DEBUG or maybe their 
custom DPRINTF though if you think that would help reduce the noise.


Roy


Routing socket issue?

2021-01-30 Thread Paul Goyette

I thought we took care of the buffer-space issue a long time ago, but
today I've gotten about a dozen of these:

...
Jan 30 05:20:11 speedy ntpd[3146]: routing socket reports: No buffer
space available
...



++--+---+
| Paul Goyette   | PGP Key fingerprint: | E-mail addresses: |
| (Retired)  | FA29 0E3B 35AF E8AE 6651 | p...@whooppee.com |
| Software Developer | 0786 F758 55DE 53BA 7731 | pgoye...@netbsd.org   |
++--+---+