Hi Vladimir, On Fri, Mar 27, 2026 at 7:28 PM Medvedkin, Vladimir <[email protected]> wrote: > > Hi Maxime, > > On 3/25/2026 9:43 PM, Maxime Leroy wrote: > > Hi Vladimir, > > > > On Wed, Mar 25, 2026 at 4:56 PM Medvedkin, Vladimir > > <[email protected]> wrote: > >> > >> On 3/24/2026 9:19 AM, Maxime Leroy wrote: > >>> Hi Vladimir, > >>> > >>> On Mon, Mar 23, 2026 at 7:46 PM Medvedkin, Vladimir > >>> <[email protected]> wrote: > >>>> On 3/23/2026 2:53 PM, Maxime Leroy wrote: > >>>>> On Mon, Mar 23, 2026 at 1:49 PM Medvedkin, Vladimir > >>>>> <[email protected]> wrote: > >>>>>> Hi Maxime, > >>>>>> > >>>>>> On 3/23/2026 11:27 AM, Maxime Leroy wrote: > >>>>>>> Hi Vladimir, > >>>>>>> > >>>>>>> > >>>>>>> On Sun, Mar 22, 2026 at 4:42 PM Vladimir Medvedkin > >>>>>>> <[email protected]> wrote: > >>>>>>>> This series adds multi-VRF support to both IPv4 and IPv6 FIB paths by > >>>>>>>> allowing a single FIB instance to host multiple isolated routing > >>>>>>>> domains. > >>>>>>>> > >>>>>>>> Currently FIB instance represents one routing instance. For > >>>>>>>> workloads that > >>>>>>>> need multiple VRFs, the only option is to create multiple FIB > >>>>>>>> objects. In a > >>>>>>>> burst oriented datapath, packets in the same batch can belong to > >>>>>>>> different VRFs, so > >>>>>>>> the application either does per-packet lookup in different FIB > >>>>>>>> instances or > >>>>>>>> regroups packets by VRF before lookup. Both approaches are expensive. > >>>>>>>> > >>>>>>>> To remove that cost, this series keeps all VRFs inside one FIB > >>>>>>>> instance and > >>>>>>>> extends lookup input with per-packet VRF IDs. > >>>>>>>> > >>>>>>>> The design follows the existing fast-path structure for both > >>>>>>>> families. IPv4 and > >>>>>>>> IPv6 use multi-ary trees with a 2^24 associativity on a first level > >>>>>>>> (tbl24). The > >>>>>>>> first-level table scales per configured VRF. This increases memory > >>>>>>>> usage, but > >>>>>>>> keeps performance and lookup complexity on par with non-VRF > >>>>>>>> implementation. > >>>>>>>> > >>>>>>> Thanks for the RFC. Some thoughts below. > >>>>>>> > >>>>>>> Memory cost: the flat TBL24 replicates the entire table for every VRF > >>>>>>> (num_vrfs * 2^24 * nh_size). With 256 VRFs and 8B nexthops that is > >>>>>>> 32 GB for TBL24 alone. In grout we support up to 256 VRFs allocated > >>>>>>> on demand -- this approach forces the full cost upfront even if most > >>>>>>> VRFs are empty. > >>>>>> Yes, increased memory consumption is the > >>>>>> trade-off.WemakethischoiceinDPDKquite often,such as pre-allocatedmbufs, > >>>>>> mempoolsand many other stuff allocated in advance to gain performance. > >>>>>> For FIB, I chose to replicate TBL24 per VRF for this same reason. > >>>>>> > >>>>>> And, as Morten mentioned earlier, if memory is the priority, a table > >>>>>> instance per VRF allocated on-demand is still supported. > >>>>>> > >>>>>> The high memory cost stems from TBL24's design: for IPv4, it was > >>>>>> justified by the BGP filtering convention (no prefixes more specific > >>>>>> than /24 in BGPv4 full view), ensuring most lookups hit with just one > >>>>>> random memory access. For IPv6, we should likely switch to a 16-bit > >>>>>> TRIE > >>>>>> scheme on all layers. For IPv4, alternative algorithms with smaller > >>>>>> footprints (like DXR or DIR16-8-8, as used in VPP) may be worth > >>>>>> exploring if BGP full view is not required for those VRFs. > >>>>>> > >>>>>>> Per-packet VRF lookup: Rx bursts come from one port, thus one VRF. > >>>>>>> Mixed-VRF bulk lookups do not occur in practice. The three AVX512 > >>>>>>> code paths add complexity for a scenario that does not exist, at > >>>>>>> least for a classic router. Am I missing a use-case? > >>>>>> That's not true, you're missing out on a lot of established core use > >>>>>> cases that are at least 2 decades old: > >>>>>> > >>>>>> - VLAN subinterface abstraction. Each subinterface may belong to a > >>>>>> separate VRF > >>>>>> > >>>>>> - MPLS VPN > >>>>>> > >>>>>> - Policy based routing > >>>>>> > >>>>> Fair point on VLAN subinterfaces and MPLS VPN. SRv6 L3VPN (End.DT4/ > >>>>> End.DT6) also fits that pattern after decap. > >>>>> > >>>>> I agree DPDK often pre-allocates for performance, but I wonder if the > >>>>> flat TBL24 actually helps here. Each VRF's working set is spread > >>>>> 128 MB apart in the flat table. Would regrouping packets by VRF and > >>>>> doing one bulk lookup per VRF with separate contiguous TBL24s be > >>>>> more cache-friendly than a single mixed-VRF gather? Do you have > >>>>> benchmarks comparing the two approaches? > >>>> It depends. Generally, if we assume that we are working with wide > >>>> internet traffic, then even for a single VRF we most likely will miss > >>>> the cache for TLB24, thus, regardless of the size of the tbl24, each > >>>> memory access will be performed directly to DRAM. > >>> If the lookup is DRAM-bound anyway, then the 10 cycles/addr cost > >>> is dominated by memory latency, not CPU. The CPU cost of a bucket > >>> sort on 32-64 packets is negligible next to a DRAM access (~80-100 > >>> ns per cache miss). > >> memory accesses are independent and executed in parallel in the CPU > >> pipeline > >>> That actually makes the case for regroup + > >>> per-VRF lookup: the regrouping is pure CPU work hidden behind > >>> memory stalls, > >> regrouping must be performed before memory accesses, so it cannot be > >> amortized in between of memory reads > > With internet traffic, TBL24 lookups quickly become limited by > > cache misses, not CPU cycles. Even if some bursts hit the same > > routes and benefit from cache locality, the CPU has a limited > > number of outstanding misses (load buffer entries, MSHRs) -- > > out-of-order execution helps, but it is not magic. > Correct, but this does not contradict to what I'm saying > > > > The whole point of vector/graph processing (VPP, DPDK graph, etc.) > > is to amortize that memory latency: prefetch for packet N+1 while > > processing packet N. This works because all packets in a batch > > hit the same data structure in a tight loop. > https://github.com/DPDK/dpdk/blob/626d4e39327333cd5508885162e45ca7fb94ef7f/lib/fib/dir24_8.h#L161 > > > > > With separate per-VRF TBL24s, a bucket sort by VRF -- a few > > dozen cycles, all in L1 -- gives you clean batches where > > prefetching works as designed. This is exactly what graph nodes > > already do: classify, then process per-class in a tight loop. > How lookup is performed in this design? Am I understand it right: > 1. sort the batch by VRF ids, splitting IPs from the batch with IP > sub-batches belonging to the same VRF id > 2. for each subset of IPs perform lookup in tbl24[batch_common_vrf_id] > 3. unsort nexthops > > Correct?
No sort/unsort. This is how rte_graph classification works: ip_input (validation) -> ip_lookup-v0 (bulk fib4_lookup on homogeneous VRF 0 burst) -> ip_lookup-v1 (bulk fib4_lookup on homogeneous VRF 1 burst) -> ip_forward / ip_input_local / ... ip_input already iterates over packets for header validation and enqueues them to different next nodes. Adding a per-VRF edge costs one iface->vrf_id load (already in L1) and one rte_node_enqueue_x1() (already done today). Each ip_lookup-vN clone holds its VRF's rte_fib in node context and calls rte_fib_lookup_bulk() on the whole burst at once. We do not use bulk lookups yet in grout (each packet does its own rte_fib_lookup_bulk(..., 1) today), but this is how we would implement it. The tradeoff is batch fragmentation: with traffic spread across K active VRFs, each sub-batch is ~N/K packets. But in practice, most deployments have 1-3 hot VRFs, so batches stay large. And even fragmented batches benefit from the vectorized lookup -- 8 packets is still one AVX512 iteration, vs. 8 scalar lookups today. > > > >>> and each per-VRF bulk lookup hits a contiguous > >>> TBL24 instead of scattering across 128 MB-apart VRF regions. > >> why is a contiguous 128Mb single-VRF TBL24 OK for you, but bigger > >> contiguous multi-VRF TBL24 is not OK in the context of lookup (here we > >> are talking about lookup, omitting the problem of memory consumption on > >> init)? > > The performance difference may be small, but the flat approach > > is not faster either -- while costing 64 GB upfront. > it seems you are implicitly take an assumption of 256 VRFs. Is my > usecase with a few VRFs have a right to exist? > > > >> In both of these cases, memory access behaves the same way within a > >> single batch of packets during lookup, i.e. the first hit is likely a > >> cache miss, regardless of whether we are dealing with one or more VRFs, > >> it will not maintain TBL24 in L3$ in any way in a real dataplane app. > >> > >>>> And if the addresses are localized (i.e. most traffic is internal), then > >>>> having multiple TBL24 won'tmake the situationmuchworse. > >>>> > >>> With localized traffic, regrouping by VRF + per-VRF lookup on > >>> contiguous TBL24s would benefit from cache locality, > >> why so? There will be no differences within a single batch with a > >> reasonable size (for example 64), because within the lookup session, no > >> matter with or without regrouping, temporal cache locality will be the > >> same. > >> > >> Let't look at it from a different angle. Is it > >> worthregroupingipaddressesby/8(i.e. 8 MSBs)withthe > >> currentimplementationof a singleVRFFIB? > >> > >>> while the > >>> flat multi-VRF table spreads hot entries 128 MB apart. The flat > >>> approach may actually be worse in that scenario > >>> > >>>> I don't have any benchmarks for regrouping, however I have 2 things to > >>>> consider: > >>>> > >>>> 1. lookup is relatively fast (for IPv4 it is about 10 cycles per > >>>> address, and I don't really want to slow it down) > >>>> > >>>> 2. incoming addresses and their corresponding VRFs are not controlled by > >>>> "us", so this is a random set. Regrouping effectively is sorting. I'm > >>>> not really happy to have nlogn complexity on a fast path :) > >>> Without benchmarks, we do not know whether the flat approach is > >>> actually faster than regroup + per-VRF lookup. > >> feel free to share benchmark results. The only thing you need to add is > >> the packets regrouping logic, and then use separate single-VRF FIB > >> instances. > > Your series introduces a new API that optimizes multi-VRF lookup. > > The performance numbers should come with the proposal. > By the policy we can not share raw performance numbers, and I think this > is unnecessary, because performance depends on the testing environment > (content of the routing table, CPU model, etc). > > Tests I've done on my board with ipv4 full-view (782940 routes) with 4 > VRFs performing random lookup in all of them was 180% in cost comparing > to a single VRF with the same RT content. > > You can test it in your environment with > dpdk-test-fib -l 1,2 --no-pci -- -f <path to your routes> -e 4 -l > 100000000 -V <number of VRFs> The dpdk-test-fib benchmark is useful for measuring raw lookup throughput, but it does not capture the full picture. In a real router stack with rte_graph, the classification by VRF happens naturally as part of packet processing -- it is not an extra sorting step. The only way to compare both approaches fairly is to measure end-to-end forwarding performance in a real datapath. grout is an open source DPDK-based router built on rte_graph, designed to exercise and validate DPDK APIs in realistic conditions. I would be happy to help benchmark both approaches there. > > > > >>>>> On the memory trade-off and VRF ID mapping: the API uses vrf_id as > >>>>> a direct index (0 to max_vrfs-1). With 256 VRFs and 8B nexthops, > >>>>> TBL24 alone costs 32 GB for IPv4 and 32 GB for IPv6 -- 64 GB total > >>>>> at startup. In grout, VRF IDs are interface IDs that can be any > >>>>> uint16_t, so we would also need to maintain a mapping between our > >>>>> VRF IDs and FIB slot indices. > >>>> of course, this is an application responsibility. In FIB VRFs are in > >>>> continuous range. > >>>>> We would need to introduce a max_vrfs > >>>>> limit, which forces a bad trade-off: either set it low (e.g. 16) > >>>>> and limit deployments, or set it high (e.g. 256) and pay 64 GB at > >>>>> startup even with a single VRF. With separate FIB instances per VRF, > >>>>> we only allocate what we use. > >>>> Yes, I understand this. In the end, if the user wants to use 256 VRFs, > >>>> the amount of memory footprint will be at least 64Gb anyway. > >>> The difference is when the memory is committed. > >> yes, this's the only difference. It all comes down to the static vs > >> dynamic memory allocation problem. And each of these approaches is good > >> for solving a specific task. For the task of creating a new VRF, what is > >> more preferable - to fail on init or runtime? > > The main problem is that your series imposes contiguous VRF IDs > > (0 to max_vrfs-1). How a VRF is represented is a network stack > > design decision > exactly - a network stack decision. FIB is not a network stack. > > -- in Linux it is an ifindex, > so every interface lives in it's own private VRF? > > in Cisco a name, > are you going to pass an array of strings on lookup? > > in grout an interface ID. > haven't we decided this is a problematic design(VLANs, L3VPN, etc)? > > Any application using this API needs > > a mapping layer on top. > I think from my rhetoric questions this should be obvious > > > > In grout, everything is allocated dynamically: mempools, FIBs, > > conntrack tables. Pre-allocating everything at init forces > > hardcoded arbitrary limits and prevents memory reuse between > > subsystems -- memory reserved for FIB TBL24 cannot be used for > > conntrack when the VRF has no routes, and vice versa. We prefer > > to allocate resources only when needed. It is simpler for users > > and more efficient for memory. > > > >>> With separate FIB > >>> instances per VRF, you allocate 128 MB only when a VRF is actually > >>> created at runtime. With the flat multi-VRF approach, you pay > >>> max_vrfs * 128 MB at startup, even if only one VRF is active. > >>> > >>> On top of that, the API uses vrf_id as a direct index (0 to > >>> max_vrfs-1). As Stephen noted, there are multiple ways to model > >>> VRFs. Depending on the networking stack, VRFs are identified by > >>> ifindex (Linux l3mdev), by name (Cisco, Juniper), or by some > >>> other scheme. This means the application must maintain a mapping > >>> between its own VRF representation and the FIB slot indices, and > >>> choose max_vrfs upfront. What is the benefit of this flat > >>> multi-VRF FIB if the application still needs to manage a > >>> translation layer and pre-commit memory for VRFs that may never > >>> exist? > >> This is the control plane task. > >>>> As a trade-off for a bad trade-off ;) I can suggest to allocate it in > >>>> chunks. Let's say you are starting with 16 VRFs, and during runtime, if > >>>> the user wants to increase the number of VRFs above this limit, you can > >>>> allocate another 16xVRF FIB. Then, of course, you need to split > >>>> addresses into 2 bursts each for each FIB handle. > >>> But then we are back to regrouping packets -- just by chunk of > >>> VRFs instead of by individual VRF. If we have to sort the burst > >>> anyway, what does the flat multi-VRF table buy us? > >>> > >>>>>>> I am not too familiar with DPDK FIB internals, but would it be > >>>>>>> possible to keep a separate TBL24 per VRF and only share the TBL8 > >>>>>>> pool? > >>>>>> it is how it is implemented right now with one note - TBL24 are pre > >>>>>> allocated. > >>>>>>> Something like pre-allocating an array of max_vrfs TBL24 > >>>>>>> pointers, allocating each TBL24 on demand at VRF add time, > >>>>>> and you suggesting to allocate TBL24 on demand by adding an extra > >>>>>> indirection layer. Thiswill leadtolowerperformance,whichIwouldliketo > >>>>>> avoid. > >>>>>>> and > >>>>>>> having them all point into a shared TBL8 pool. The TBL8 index in > >>>>>>> TBL24 entries seems to already be global, so would that work without > >>>>>>> encoding changes? > >>>>>>> > >>>>>>> Going further: could the same idea extend to IPv6? The dir24_8 and > >>>>>>> trie seem to use the same TBL8 block format (256 entries, same > >>>>>>> (nh << 1) | ext_bit encoding, same size). Would unifying the TBL8 > >>>>>>> allocator allow a single pool shared across IPv4, IPv6, and all > >>>>>>> VRFs? That could be a bigger win for /32-heavy and /128-heavy tables > >>>>>>> and maybe a good first step before multi-VRF. > >>>>>> So, you are suggesting merging IPv4 and IPv6 into a single unified FIB? > >>>>>> I'm not sure how this can be a bigger win, could you please elaborate > >>>>>> more on this? > >>>>> On the IPv4/IPv6 TBL8 pool: I was not suggesting merging FIBs, just > >>>>> sharing the TBL8 block allocator between separate FIB instances. > >>>>> This is possible since dir24_8 and trie use the same TBL8 block > >>>>> format (256 entries, same encoding, same size). > >>>>> > >>>>> Would it be possible to pass a shared TBL8 pool at rte_fib_create() > >>>>> time? Each FIB keeps its own TBL24 and RIB, but TBL8 is shared > >>>>> across all FIBs and potentially across IPv4/IPv6. Users would no > >>>>> longer have to guess num_tbl8 per FIB. > >>>> Yes, this is possible. However, this will significantly complicate the > >>>> work with the library, solving a not so big problem. > >>> Your series already shares TBL8 across all VRFs within a single > >>> FIB -- that part is useful, and it does not require the flat > >>> multi-VRF TBL24. > >>> > >>> In grout, routes arrive from FRR (BGP, OSPF, etc.) at runtime. > >>> We cannot predict TBL8 usage per VRF in advance > >> and you don't need it (knowing per-VRF consumption) now. If I understood > >> your request here properly, do you want to share TBL8 between ipv4 and > >> ipv6 FIBs? I don't think this is a good idea at all. At least because > >> this is a good idea to split them in case if one AF consumes (because of > >> attack/bogus CP) all TBL8, another AF remains intact. > > If TBL8 isolation per AF is meant as a protection against route > > floods, then the same argument applies between VRFs: your series > > shares TBL8 across all VRFs within a single FIB, so a bogus > > control plane in one VRF exhausts TBL8 for all other VRFs. > > > > But more fundamentally, this is not how route flood protection > > works. It is handled in the control plane: the routing daemon > > limits the number of prefixes accepted per BGP session > > (max-prefix) and selects which routes are installed via prefix > > filters -- before those routes ever reach the forwarding table. > > > > The Linux kernel is a good reference here. IPv6 used to enforce > > a max_size limit on FIB + cache entries (net.ipv6.route.max_size, > > defaulting to 4096). It caused real production issues and was > > removed in kernel 6.3. IPv4 never had a FIB route limit. There > > is no per-VRF route limit either. The kernel relies entirely on > > the control plane for route flood protection. > > FIB is not the Linux kernel, as well as not a network stack. We can not > rely on a control plane protection, since control plane is a 3rd party > software. > > Also, I think allocating a very algorithm-specific entity such as pool > of TBL8 prior to calling rte_fib_create and passing a pointer on it > could be confusing for many users and bloating API. > > FIB supports pluggable lookup algorithms, you can write your own and > specify a pointer to the tbl8_pool in an algorithm-specific > configuration defined for your algorithm, where you may also create a > dynamic table of TBL24 pointers per each VRF. If you need any help with > this task - I would be happy to help. I have sent this RFC: https://mails.dpdk.org/archives/dev/2026-March/335512.html Thanks in advance for your help. > > > > >>> -- it depends on > >>> prefix length distribution which varies per VRF and changes over > >>> time. No production LPM (Linux kernel, JunOS, IOS) asks the > >>> operator to size these structures per routing table upfront. > >> - they are using different lpm algorithms > >> - you use these facilities, developers properly tuned them. FIB is a low > >> level library, it cannot be used without any knowledge, it will not > >> solve all the problems with a single red button "make it work, don't do > >> any bugs" > >> P.S. how do you know how JunOS /IOS implements their LPMs? ;) > >> > > I do not need to know their LPM implementation -- I only need > > to know how they are configured. No production router requires > > the operator to size internal LPM structures. > > > > We can impose a maximum number of IPv4/IPv6 routes on the user > > -- even though the kernel does not need this either. But TBL8 > > is a different problem: the application cannot predict TBL8 > > consumption because it depends on prefix length distribution, > > which varies per VRF and changes over time with dynamic routing. > > Today there is no API to query TBL8 usage, and no API to resize > > a FIB without destroying it. > > > > This is exactly why a shared TBL8 pool across VRFs is useful: > > VRFs with few long prefixes naturally leave room for VRFs that > > need more. > > but this is already implemented. I don't know why you repeatedly > concerning about this. We are aligned about this and this feature is > already there - in the patch > On the other hand what we disagreed on is the sharing not only across > VRFs, but also across address families. If you don't understand the > amount of TBL8 per AF, how would you magically understand the number of > TBL8s for a merged pool? > > > This is the valuable part of your series. But it > > does not require a flat multi-VRF TBL24 -- separate per-VRF > > TBL24s sharing a common TBL8 pool would give the same benefit > > without the 64 GB upfront cost. > > andthat'sa completelydifferentproblem.Please,let's separatethe > problemsandnotmixthemup. > > I understand your concern about memory consumption. Ihavesomeideason > howto solvethisproblem as a parallel to proposed solution. > > > > >>> Today we do not even have TBL8 usage stats (Robin's series > >>> addresses that) > >> I will tryto findtimetoreviewthispatchinthe nearfuture > > Thanks, Robin's TBL8 stats series would help users understand > > their TBL8 consumption -- a more practical improvement for > > current users. > > > > Regards, > > Maxime > > -- > Regards, > Vladimir > -- Regards, Maxime

