Re: babel RTT metric false samples

2024-04-15 Thread Ondrej Zajicek via Bird-users
On Sat, Apr 13, 2024 at 04:38:47PM +0200, Erin Shepherd wrote:
> I guess it might not fit with bird's abstractions (or perhaps the Babel 
> protocol), but has thought been given to using SO_TIMESTAMPING to have the 
> kernel compute TX/RX timestamps? 

Yeah, that is definitely a better solution.

-- 
Elen sila lumenn' omentielvo

Ondrej 'Santiago' Zajicek (email: santi...@crfreenet.org)
"To err is human -- to blame it on a computer is even more so."


Fwd: Re: babel RTT metric false samples

2024-04-13 Thread Maria Matejka via Bird-users
(mistakenly sent off-list)


 Original Message 
From: Maria Matejka 
Sent: 13 April 2024 18:18:05 CEST
To: Erin Shepherd 
Subject: Re: babel RTT metric false samples

Just quick thought – I think both approaches (timestamping in kernel and in 
userspace) are actually useful for different purposes. Thus, we shall support 
both.

Transferred to our internal issue: https://gitlab.nic.cz/labs/bird/-/issues/61

Maria


On 13 April 2024 16:38:47 CEST, Erin Shepherd  
wrote:
>I guess it might not fit with bird's abstractions (or perhaps the Babel 
>protocol), but has thought been given to using SO_TIMESTAMPING to have the 
>kernel compute TX/RX timestamps? 
>
>- Erin
>
>
>On Sat, 13 Apr 2024, at 16:14, Maria Matejka via Bird-users wrote:
>> Hello Stephanie, Toke and list,
>> 
>> On Fri, Apr 12, 2024 at 04:22:50PM +0200, Toke Høiland-Jørgensen via 
>> Bird-users wrote:
>> 
>>> Stephanie Wilde-Hobbs via Bird-users bird-users@network.cz writes:
>>> 
>>>> The babel RTT metric measurements provided by bird appears suspect for my 
>>>> setup. The metric through a tunnel with a latency of about 5ms is shown in 
>>>> babel as 150+ms.
>>>> 
>> […]
>> 
>>>> Debug logs show many RTT samples with approximately correct timestamps 
>>>> (4-6ms) then the occasional IHU with 800-1200ms calculated instead. 
>>>> Calculating the RTT metric by hand using babel packet logs shows that the 
>>>> calculations are correct. By correlating two packet dumps (the machines 
>>>> have <1ms NTP timekeeping) I can also see that the packets for which high 
>>>> RTT is calculated have similar transit times through the tunnel as other 
>>>> packets. Hence, I suspect the accuracy of the packet timestamps recorded 
>>>> by bird. Is the current packet timestamping system giving correct 
>>>> timestamps if the packet arrives while babel is processing another event?
>>>> 
>>> Hmm, so Babel implementation in Bird tries to get a timestamp as early as 
>>> possible after receiving the packet, and set it as late as possible before 
>>> sending out the packet. However, the former in practice means after 
>>> returning from poll(), so if the packet has been sitting around in the OS 
>>> buffer for a while before Bird gets around to process it, the timestamp is 
>>> not set until Bird is done processing it. Likewise, if the packet sits 
>>> around in a socket buffer (or in a lower-level buffer on the sending side) 
>>> after Bird has sent it out, that time will also be counted as part of the 
>>> RTT.
>>> 
>> I would suspect that the kernel table prune routine may be the case. It just 
>> runs from begin to end synchronously.
>> 
>> I have just fast-tracked Babel in its own thread for BIRD 3, it may be worth 
>> checking. (There should be also artifacts from the build process for 
>> download available.) This should get you rid of most of the cases of 
>> suspiciously high RTT.
>> 
>> `https://gitlab.nic.cz/labs/bird/-/tree/babel-in-threads`
>> Just to be noted, updating a route in BIRD 3 is still a locking process so 
>> it may still tamper the RTT measurements. At least it should happen only in 
>> cases where Babel is doing the update. Anyway, with BIRD 3 internals, it 
>> should be possible to easily *detect* such situations and disregard these 
>> single measurements as unreliable. (Not implemented, though.)
>> 
>> There are even some thoughts on implementing lockless import queues for 
>> routing tables, yet now we have to prioritize BIRD 3 stabilization to 
>> actually release it as a stable version. Import queues must wait.
>> 
>> Also with this testing, feel free to report any weird behavior, notably 
>> crashes of BIRD 3, as bugs. That would be very helpful with stabilizing BIRD 
>> 3. Thanks a lot!
>> 
>> Maria
>> 
>> – Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.
>> 
-- 
Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.
-- 
Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.

Re: babel RTT metric false samples

2024-04-13 Thread Erin Shepherd
I guess it might not fit with bird's abstractions (or perhaps the Babel 
protocol), but has thought been given to using SO_TIMESTAMPING to have the 
kernel compute TX/RX timestamps? 

- Erin


On Sat, 13 Apr 2024, at 16:14, Maria Matejka via Bird-users wrote:
> Hello Stephanie, Toke and list,
> 
> On Fri, Apr 12, 2024 at 04:22:50PM +0200, Toke Høiland-Jørgensen via 
> Bird-users wrote:
> 
>> Stephanie Wilde-Hobbs via Bird-users bird-users@network.cz writes:
>> 
>>> The babel RTT metric measurements provided by bird appears suspect for my 
>>> setup. The metric through a tunnel with a latency of about 5ms is shown in 
>>> babel as 150+ms.
>>> 
> […]
> 
>>> Debug logs show many RTT samples with approximately correct timestamps 
>>> (4-6ms) then the occasional IHU with 800-1200ms calculated instead. 
>>> Calculating the RTT metric by hand using babel packet logs shows that the 
>>> calculations are correct. By correlating two packet dumps (the machines 
>>> have <1ms NTP timekeeping) I can also see that the packets for which high 
>>> RTT is calculated have similar transit times through the tunnel as other 
>>> packets. Hence, I suspect the accuracy of the packet timestamps recorded by 
>>> bird. Is the current packet timestamping system giving correct timestamps 
>>> if the packet arrives while babel is processing another event?
>>> 
>> Hmm, so Babel implementation in Bird tries to get a timestamp as early as 
>> possible after receiving the packet, and set it as late as possible before 
>> sending out the packet. However, the former in practice means after 
>> returning from poll(), so if the packet has been sitting around in the OS 
>> buffer for a while before Bird gets around to process it, the timestamp is 
>> not set until Bird is done processing it. Likewise, if the packet sits 
>> around in a socket buffer (or in a lower-level buffer on the sending side) 
>> after Bird has sent it out, that time will also be counted as part of the 
>> RTT.
>> 
> I would suspect that the kernel table prune routine may be the case. It just 
> runs from begin to end synchronously.
> 
> I have just fast-tracked Babel in its own thread for BIRD 3, it may be worth 
> checking. (There should be also artifacts from the build process for download 
> available.) This should get you rid of most of the cases of suspiciously high 
> RTT.
> 
> `https://gitlab.nic.cz/labs/bird/-/tree/babel-in-threads`
> Just to be noted, updating a route in BIRD 3 is still a locking process so it 
> may still tamper the RTT measurements. At least it should happen only in 
> cases where Babel is doing the update. Anyway, with BIRD 3 internals, it 
> should be possible to easily *detect* such situations and disregard these 
> single measurements as unreliable. (Not implemented, though.)
> 
> There are even some thoughts on implementing lockless import queues for 
> routing tables, yet now we have to prioritize BIRD 3 stabilization to 
> actually release it as a stable version. Import queues must wait.
> 
> Also with this testing, feel free to report any weird behavior, notably 
> crashes of BIRD 3, as bugs. That would be very helpful with stabilizing BIRD 
> 3. Thanks a lot!
> 
> Maria
> 
> – Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.
> 

Re: babel RTT metric false samples

2024-04-13 Thread Maria Matejka via Bird-users
Hello Stephanie, Toke and list,

On Fri, Apr 12, 2024 at 04:22:50PM +0200, Toke Høiland-Jørgensen via
Bird-users wrote:

> Stephanie Wilde-Hobbs via Bird-users  writes:

> > The babel RTT metric measurements provided by bird appears suspect
> > for my setup. The metric through a tunnel with a latency of about
> > 5ms is shown in babel as 150+ms.

[...]

> > Debug logs show many RTT samples with approximately correct
> > timestamps (4-6ms) then the occasional IHU with 800-1200ms
> > calculated instead.  Calculating the RTT metric by hand using babel
> > packet logs shows that the calculations are correct. By correlating
> > two packet dumps (the machines have <1ms NTP timekeeping) I can also
> > see that the packets for which high RTT is calculated have similar
> > transit times through the tunnel as other packets. Hence, I suspect
> > the accuracy of the packet timestamps recorded by bird. Is the
> > current packet timestamping system giving correct timestamps if the
> > packet arrives while babel is processing another event?
> 
> Hmm, so Babel implementation in Bird tries to get a timestamp as early
> as possible after receiving the packet, and set it as late as possible
> before sending out the packet. However, the former in practice means
> after returning from poll(), so if the packet has been sitting around
> in the OS buffer for a while before Bird gets around to process it,
> the timestamp is not set until Bird is done processing it. Likewise,
> if the packet sits around in a socket buffer (or in a lower-level
> buffer on the sending side) after Bird has sent it out, that time will
> also be counted as part of the RTT.

I would suspect that the kernel table prune routine may be the case. It
just runs from begin to end synchronously.

I have just fast-tracked Babel in its own thread for BIRD 3, it may be
worth checking. (There should be also artifacts from the build process
for download available.) This should get you rid of most of the cases of
suspiciously high RTT.

https://gitlab.nic.cz/labs/bird/-/tree/babel-in-threads

Just to be noted, updating a route in BIRD 3 is still a locking process
so it may still tamper the RTT measurements. At least it should happen
only in cases where Babel is doing the update. Anyway, with BIRD 3
internals, it should be possible to easily *detect* such situations and
disregard these single measurements as unreliable. (Not implemented,
though.)

There are even some thoughts on implementing lockless import queues for
routing tables, yet now we have to prioritize BIRD 3 stabilization to
actually release it as a stable version. Import queues must wait.

Also with this testing, feel free to report any weird behavior, notably
crashes of BIRD 3, as bugs. That would be very helpful with stabilizing
BIRD 3. Thanks a lot!

Maria

-- 
Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.


Re: babel RTT metric false samples

2024-04-12 Thread Toke Høiland-Jørgensen via Bird-users
Stephanie Wilde-Hobbs via Bird-users  writes:

> Hi,
>
> The babel RTT metric measurements provided by bird appears suspect for 
> my setup. The metric through a tunnel with a latency of about 5ms is 
> shown in babel as 150+ms.
>
> Can others replicate this issue? (should be easy to check for other 
> babel users since RTT measurement is on by default in recent versions)
>
> First I suspected a problem with the tunnel, but I compared bird's babel 
> RTT measurement against a long-running ping for the same time period and 
> got ~160ms measured by bird's babel implementation, and 4.6ms with a 
> 28ms maximum latency reported by pings in the same wireguard tunnel. 
> Other machines across my network also report similarly inflated RTT 
> metrics for all non-wired links.
>
> Debug logs show many RTT samples with approximately correct timestamps 
> (4-6ms) then the occasional IHU with 800-1200ms calculated instead. 
> Calculating the RTT metric by hand using babel packet logs shows that 
> the calculations are correct. By correlating two packet dumps (the 
> machines have <1ms NTP timekeeping) I can also see that the packets for 
> which high RTT is calculated have similar transit times through the 
> tunnel as other packets. Hence, I suspect the accuracy of the packet 
> timestamps recorded by bird. Is the current packet timestamping system 
> giving correct timestamps if the packet arrives while babel is 
> processing another event?

Hmm, so Babel implementation in Bird tries to get a timestamp as early
as possible after receiving the packet, and set it as late as possible
before sending out the packet. However, the former in practice means
after returning from poll(), so if the packet has been sitting around in
the OS buffer for a while before Bird gets around to process it, the
timestamp is not set until Bird is done processing it. Likewise, if the
packet sits around in a socket buffer (or in a lower-level buffer on the
sending side) after Bird has sent it out, that time will also be counted
as part of the RTT.

You may be able to figure out on which side the inflated RTTs happen by
looking at differences in timestamps between a packet capture and the
values Bird puts into the packet. The timestamps Bird writes are just
CLOCK_MONOTONIC times in microsecond resolution, so they should be
comparable to packet timestamps if you can get the offset between the
system's monotonic and real-time clocks at the time of the packet
capture.

-Toke


babel RTT metric false samples

2024-04-11 Thread Stephanie Wilde-Hobbs via Bird-users

Hi,

The babel RTT metric measurements provided by bird appears suspect for 
my setup. The metric through a tunnel with a latency of about 5ms is 
shown in babel as 150+ms.


Can others replicate this issue? (should be easy to check for other 
babel users since RTT measurement is on by default in recent versions)


First I suspected a problem with the tunnel, but I compared bird's babel 
RTT measurement against a long-running ping for the same time period and 
got ~160ms measured by bird's babel implementation, and 4.6ms with a 
28ms maximum latency reported by pings in the same wireguard tunnel. 
Other machines across my network also report similarly inflated RTT 
metrics for all non-wired links.


Debug logs show many RTT samples with approximately correct timestamps 
(4-6ms) then the occasional IHU with 800-1200ms calculated instead. 
Calculating the RTT metric by hand using babel packet logs shows that 
the calculations are correct. By correlating two packet dumps (the 
machines have <1ms NTP timekeeping) I can also see that the packets for 
which high RTT is calculated have similar transit times through the 
tunnel as other packets. Hence, I suspect the accuracy of the packet 
timestamps recorded by bird. Is the current packet timestamping system 
giving correct timestamps if the packet arrives while babel is 
processing another event?


I can provide packet captures for anyone interested in debugging further.

Thanks,
Stephanie.