Re: babel RTT metric false samples

2024-08-14 Thread Stephanie Wilde-Hobbs via Bird-users

Hi,

I've been running that branch since Jul 24, I've had no issues and it's 
run stably apart from a logged assertion failure:


Assertion 'cs == CS_DOWN || cs == CS_START' failed at nest/proto.c:1147

Unfortunately I can't speak for it's effect on Babel RTT since I have 
moved only a single machine to that branch, to avoid depending on babel 
3 in the firing line of my home internet.


Hope this helps validate the threaded babel implementation.

Stephanie.

On 13/04/2024 16:14, Maria Matejka wrote:

Hello Stephanie, Toke and list,

On Fri, Apr 12, 2024 at 04:22:50PM +0200, Toke Høiland-Jørgensen via 
Bird-users wrote:


Stephanie Wilde-Hobbs via Bird-users bird-users@network.cz
<mailto:bird-users@network.cz> writes:

The babel RTT metric measurements provided by bird appears
suspect for my setup. The metric through a tunnel with a latency
of about 5ms is shown in babel as 150+ms.

[…]

Debug logs show many RTT samples with approximately correct
timestamps (4-6ms) then the occasional IHU with 800-1200ms
calculated instead. Calculating the RTT metric by hand using
babel packet logs shows that the calculations are correct. By
correlating two packet dumps (the machines have <1ms NTP
timekeeping) I can also see that the packets for which high RTT
is calculated have similar transit times through the tunnel as
other packets. Hence, I suspect the accuracy of the packet
timestamps recorded by bird. Is the current packet timestamping
system giving correct timestamps if the packet arrives while
babel is processing another event?

Hmm, so Babel implementation in Bird tries to get a timestamp as
early as possible after receiving the packet, and set it as late as
possible before sending out the packet. However, the former in
practice means after returning from poll(), so if the packet has
been sitting around in the OS buffer for a while before Bird gets
around to process it, the timestamp is not set until Bird is done
processing it. Likewise, if the packet sits around in a socket
buffer (or in a lower-level buffer on the sending side) after Bird
has sent it out, that time will also be counted as part of the RTT.

I would suspect that the kernel table prune routine may be the case. It 
just runs from begin to end synchronously.


I have just fast-tracked Babel in its own thread for BIRD 3, it may be 
worth checking. (There should be also artifacts from the build process 
for download available.) This should get you rid of most of the cases of 
suspiciously high RTT.


|https://gitlab.nic.cz/labs/bird/-/tree/babel-in-threads|

Just to be noted, updating a route in BIRD 3 is still a locking process 
so it may still tamper the RTT measurements. At least it should happen 
only in cases where Babel is doing the update. Anyway, with BIRD 3 
internals, it should be possible to easily /detect/ such situations and 
disregard these single measurements as unreliable. (Not implemented, 
though.)


There are even some thoughts on implementing lockless import queues for 
routing tables, yet now we have to prioritize BIRD 3 stabilization to 
actually release it as a stable version. Import queues must wait.


Also with this testing, feel free to report any weird behavior, notably 
crashes of BIRD 3, as bugs. That would be very helpful with stabilizing 
BIRD 3. Thanks a lot!


Maria

– Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.



babel RTT metric false samples

2024-04-11 Thread Stephanie Wilde-Hobbs via Bird-users

Hi,

The babel RTT metric measurements provided by bird appears suspect for 
my setup. The metric through a tunnel with a latency of about 5ms is 
shown in babel as 150+ms.


Can others replicate this issue? (should be easy to check for other 
babel users since RTT measurement is on by default in recent versions)


First I suspected a problem with the tunnel, but I compared bird's babel 
RTT measurement against a long-running ping for the same time period and 
got ~160ms measured by bird's babel implementation, and 4.6ms with a 
28ms maximum latency reported by pings in the same wireguard tunnel. 
Other machines across my network also report similarly inflated RTT 
metrics for all non-wired links.


Debug logs show many RTT samples with approximately correct timestamps 
(4-6ms) then the occasional IHU with 800-1200ms calculated instead. 
Calculating the RTT metric by hand using babel packet logs shows that 
the calculations are correct. By correlating two packet dumps (the 
machines have <1ms NTP timekeeping) I can also see that the packets for 
which high RTT is calculated have similar transit times through the 
tunnel as other packets. Hence, I suspect the accuracy of the packet 
timestamps recorded by bird. Is the current packet timestamping system 
giving correct timestamps if the packet arrives while babel is 
processing another event?


I can provide packet captures for anyone interested in debugging further.

Thanks,
Stephanie.