I guess it might not fit with bird's abstractions (or perhaps the Babel protocol), but has thought been given to using SO_TIMESTAMPING to have the kernel compute TX/RX timestamps?
- Erin On Sat, 13 Apr 2024, at 16:14, Maria Matejka via Bird-users wrote: > Hello Stephanie, Toke and list, > > On Fri, Apr 12, 2024 at 04:22:50PM +0200, Toke Høiland-Jørgensen via > Bird-users wrote: > >> Stephanie Wilde-Hobbs via Bird-users bird-users@network.cz writes: >> >>> The babel RTT metric measurements provided by bird appears suspect for my >>> setup. The metric through a tunnel with a latency of about 5ms is shown in >>> babel as 150+ms. >>> > […] > >>> Debug logs show many RTT samples with approximately correct timestamps >>> (4-6ms) then the occasional IHU with 800-1200ms calculated instead. >>> Calculating the RTT metric by hand using babel packet logs shows that the >>> calculations are correct. By correlating two packet dumps (the machines >>> have <1ms NTP timekeeping) I can also see that the packets for which high >>> RTT is calculated have similar transit times through the tunnel as other >>> packets. Hence, I suspect the accuracy of the packet timestamps recorded by >>> bird. Is the current packet timestamping system giving correct timestamps >>> if the packet arrives while babel is processing another event? >>> >> Hmm, so Babel implementation in Bird tries to get a timestamp as early as >> possible after receiving the packet, and set it as late as possible before >> sending out the packet. However, the former in practice means after >> returning from poll(), so if the packet has been sitting around in the OS >> buffer for a while before Bird gets around to process it, the timestamp is >> not set until Bird is done processing it. Likewise, if the packet sits >> around in a socket buffer (or in a lower-level buffer on the sending side) >> after Bird has sent it out, that time will also be counted as part of the >> RTT. >> > I would suspect that the kernel table prune routine may be the case. It just > runs from begin to end synchronously. > > I have just fast-tracked Babel in its own thread for BIRD 3, it may be worth > checking. (There should be also artifacts from the build process for download > available.) This should get you rid of most of the cases of suspiciously high > RTT. > > `https://gitlab.nic.cz/labs/bird/-/tree/babel-in-threads` > Just to be noted, updating a route in BIRD 3 is still a locking process so it > may still tamper the RTT measurements. At least it should happen only in > cases where Babel is doing the update. Anyway, with BIRD 3 internals, it > should be possible to easily *detect* such situations and disregard these > single measurements as unreliable. (Not implemented, though.) > > There are even some thoughts on implementing lockless import queues for > routing tables, yet now we have to prioritize BIRD 3 stabilization to > actually release it as a stable version. Import queues must wait. > > Also with this testing, feel free to report any weird behavior, notably > crashes of BIRD 3, as bugs. That would be very helpful with stabilizing BIRD > 3. Thanks a lot! > > Maria > > – Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o. >