(mistakenly sent off-list)

-------- Original Message --------
From: Maria Matejka <maria.mate...@nic.cz>
Sent: 13 April 2024 18:18:05 CEST
To: Erin Shepherd <bird-us...@erinshepherd.net>
Subject: Re: babel RTT metric false samples

Just quick thought – I think both approaches (timestamping in kernel and in 
userspace) are actually useful for different purposes. Thus, we shall support 
both.

Transferred to our internal issue: https://gitlab.nic.cz/labs/bird/-/issues/61

Maria


On 13 April 2024 16:38:47 CEST, Erin Shepherd <bird-us...@erinshepherd.net> 
wrote:
>I guess it might not fit with bird's abstractions (or perhaps the Babel 
>protocol), but has thought been given to using SO_TIMESTAMPING to have the 
>kernel compute TX/RX timestamps? 
>
>- Erin
>
>
>On Sat, 13 Apr 2024, at 16:14, Maria Matejka via Bird-users wrote:
>> Hello Stephanie, Toke and list,
>> 
>> On Fri, Apr 12, 2024 at 04:22:50PM +0200, Toke Høiland-Jørgensen via 
>> Bird-users wrote:
>> 
>>> Stephanie Wilde-Hobbs via Bird-users bird-users@network.cz writes:
>>> 
>>>> The babel RTT metric measurements provided by bird appears suspect for my 
>>>> setup. The metric through a tunnel with a latency of about 5ms is shown in 
>>>> babel as 150+ms.
>>>> 
>> […]
>> 
>>>> Debug logs show many RTT samples with approximately correct timestamps 
>>>> (4-6ms) then the occasional IHU with 800-1200ms calculated instead. 
>>>> Calculating the RTT metric by hand using babel packet logs shows that the 
>>>> calculations are correct. By correlating two packet dumps (the machines 
>>>> have <1ms NTP timekeeping) I can also see that the packets for which high 
>>>> RTT is calculated have similar transit times through the tunnel as other 
>>>> packets. Hence, I suspect the accuracy of the packet timestamps recorded 
>>>> by bird. Is the current packet timestamping system giving correct 
>>>> timestamps if the packet arrives while babel is processing another event?
>>>> 
>>> Hmm, so Babel implementation in Bird tries to get a timestamp as early as 
>>> possible after receiving the packet, and set it as late as possible before 
>>> sending out the packet. However, the former in practice means after 
>>> returning from poll(), so if the packet has been sitting around in the OS 
>>> buffer for a while before Bird gets around to process it, the timestamp is 
>>> not set until Bird is done processing it. Likewise, if the packet sits 
>>> around in a socket buffer (or in a lower-level buffer on the sending side) 
>>> after Bird has sent it out, that time will also be counted as part of the 
>>> RTT.
>>> 
>> I would suspect that the kernel table prune routine may be the case. It just 
>> runs from begin to end synchronously.
>> 
>> I have just fast-tracked Babel in its own thread for BIRD 3, it may be worth 
>> checking. (There should be also artifacts from the build process for 
>> download available.) This should get you rid of most of the cases of 
>> suspiciously high RTT.
>> 
>> `https://gitlab.nic.cz/labs/bird/-/tree/babel-in-threads`
>> Just to be noted, updating a route in BIRD 3 is still a locking process so 
>> it may still tamper the RTT measurements. At least it should happen only in 
>> cases where Babel is doing the update. Anyway, with BIRD 3 internals, it 
>> should be possible to easily *detect* such situations and disregard these 
>> single measurements as unreliable. (Not implemented, though.)
>> 
>> There are even some thoughts on implementing lockless import queues for 
>> routing tables, yet now we have to prioritize BIRD 3 stabilization to 
>> actually release it as a stable version. Import queues must wait.
>> 
>> Also with this testing, feel free to report any weird behavior, notably 
>> crashes of BIRD 3, as bugs. That would be very helpful with stabilizing BIRD 
>> 3. Thanks a lot!
>> 
>> Maria
>> 
>> – Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.
>> 
-- 
Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.
-- 
Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.

Reply via email to