Re: babel RTT metric false samples

Stephanie Wilde-Hobbs via Bird-users Wed, 14 Aug 2024 09:01:34 -0700

Hi,

I've been running that branch since Jul 24, I've had no issues and it'srun stably apart from a logged assertion failure:


Assertion 'cs == CS_DOWN || cs == CS_START' failed at nest/proto.c:1147

Unfortunately I can't speak for it's effect on Babel RTT since I havemoved only a single machine to that branch, to avoid depending on babel3 in the firing line of my home internet.


Hope this helps validate the threaded babel implementation.

Stephanie.

On 13/04/2024 16:14, Maria Matejka wrote:

Hello Stephanie, Toke and list,
On Fri, Apr 12, 2024 at 04:22:50PM +0200, Toke Høiland-Jørgensen viaBird-users wrote:
    Stephanie Wilde-Hobbs via Bird-users bird-users@network.cz
    <mailto:bird-users@network.cz> writes:

        The babel RTT metric measurements provided by bird appears
        suspect for my setup. The metric through a tunnel with a latency
        of about 5ms is shown in babel as 150+ms.

[…]

        Debug logs show many RTT samples with approximately correct
        timestamps (4-6ms) then the occasional IHU with 800-1200ms
        calculated instead. Calculating the RTT metric by hand using
        babel packet logs shows that the calculations are correct. By
        correlating two packet dumps (the machines have <1ms NTP
        timekeeping) I can also see that the packets for which high RTT
        is calculated have similar transit times through the tunnel as
        other packets. Hence, I suspect the accuracy of the packet
        timestamps recorded by bird. Is the current packet timestamping
        system giving correct timestamps if the packet arrives while
        babel is processing another event?

    Hmm, so Babel implementation in Bird tries to get a timestamp as
    early as possible after receiving the packet, and set it as late as
    possible before sending out the packet. However, the former in
    practice means after returning from poll(), so if the packet has
    been sitting around in the OS buffer for a while before Bird gets
    around to process it, the timestamp is not set until Bird is done
    processing it. Likewise, if the packet sits around in a socket
    buffer (or in a lower-level buffer on the sending side) after Bird
    has sent it out, that time will also be counted as part of the RTT.
I would suspect that the kernel table prune routine may be the case. Itjust runs from begin to end synchronously.
I have just fast-tracked Babel in its own thread for BIRD 3, it may beworth checking. (There should be also artifacts from the build processfor download available.) This should get you rid of most of the cases ofsuspiciously high RTT.
|https://gitlab.nic.cz/labs/bird/-/tree/babel-in-threads|
Just to be noted, updating a route in BIRD 3 is still a locking processso it may still tamper the RTT measurements. At least it should happenonly in cases where Babel is doing the update. Anyway, with BIRD 3internals, it should be possible to easily /detect/ such situations anddisregard these single measurements as unreliable. (Not implemented,though.)
There are even some thoughts on implementing lockless import queues forrouting tables, yet now we have to prioritize BIRD 3 stabilization toactually release it as a stable version. Import queues must wait.
Also with this testing, feel free to report any weird behavior, notablycrashes of BIRD 3, as bugs. That would be very helpful with stabilizingBIRD 3. Thanks a lot!
Maria

– Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.

Re: babel RTT metric false samples

Reply via email to