Re: [Intel-wired-lan] [PATCH net-next 00/10] Call skb_metadata_set when skb->data points past metadata
On 1/25/26 11:15 AM, Jakub Sitnicki wrote: On Thu, Jan 22, 2026 at 12:21 PM -08, Martin KaFai Lau wrote: On 1/13/26 4:33 AM, Jakub Sitnicki wrote: Good point. I'm hoping we don't have to allocate from skb_metadata_set(), which does sound prohibitively expensive. Instead we'd allocate the extension together with the skb if we know upfront that metadata will be used. [ Sorry for being late. Have been catching up after holidays. ] For the sk local storage (which was mentioned in other replies as making skb metadata to look more like sk local storage), there is a plan (Amery has been looking into it) to allocate the storage together with sk for performance reason. This means allocating a larger 'struct sock'. The extra space will be at the front of sk instead of the end of sk because of how the 'struct sock' is embedded in tcp_sock/udp_sock/... If skb is going in the same direction, it should be useful to have a similar scheme on: upfront allocation and then shared by multiple BPF progs. The current thinking is to built upon the existing bpf_sk_local_storage usage. A boot param decides how much BPF space should be allocated for 'struct sock'. When a bpf_sk_storage_map is created (with a new use_reserve flag), the space will be allocated permanently from the head space of every sk for this map. The read (from a BPF prog) will be at one stable offset before a sk. If there is no more head space left, the map creation will fail. User can decide if it wants to retry without the 'use_reserve' flag. Thanks for sharing the plans. We will definitely be looking into ways of eliminating allocations in the long run. With one allocation for skb_ext, one for bpf_local_storage, and one for the actual map, it seems unlikely we will be able to attach metadata this way to every packet. Which is something we wanted for our "label packet once, use label everywhere" use case. I'm not sure how much we can squeeze in together with the sk_buff. Hopefully at least skb_ext plus a pointer to bpf_local_storage. yeah, only a bpf_local_storage pointer is needed in skb (or in skb_ext). It is the same for the bpf sk/task/... storage. To be clear, for allocation in skb, I was thinking more about Paolo's comment on "...increasing struct sk_buff size as an alternative to the mptcp skb extension...". I'm also hoping we can allocate memory for bpf_local_storage together with the backing space for the map, which update triggers the skb extension activation. Allocate the actual storage at the end of bpf_local_storage? Hmm... off the top of my head, I don't have a good idea how to do it without trading off flexibility. If trading off flexibility, may as well allocate fixed extra space at the sk (/skb) and get a performance benefit (which would need to be measured). Finally, bpf_local_storage itself has a pretty generous cache which blows it up. Maybe the cache could be a flexible array, which could be smaller for skb local storage. For our usage, the cache has been slowly filling up, so we actually have another side of the issue. Improvements on bpf_local_storage is always welcomed. I am currently more interested in getting the extra memory/headroom allocated for an sk. Eventually, the storage(s) that will be needed for all (or most) sk will use the extra headroom of sk. The current bpf_local_storage (pointer) in sk will be more for testing/ad-hoc purpose or for performance-insensitive usage. It is probably off topic now. It seems having extra tail space in a skb is not in your current plan for the next respin.
Re: [Intel-wired-lan] [PATCH net-next 00/10] Call skb_metadata_set when skb->data points past metadata
On Thu, Jan 22, 2026 at 12:21 PM -08, Martin KaFai Lau wrote: > On 1/13/26 4:33 AM, Jakub Sitnicki wrote: >> Good point. I'm hoping we don't have to allocate from >> skb_metadata_set(), which does sound prohibitively expensive. Instead >> we'd allocate the extension together with the skb if we know upfront >> that metadata will be used. > > [ Sorry for being late. Have been catching up after holidays. ] > > For the sk local storage (which was mentioned in other replies as making skb > metadata to look more like sk local storage), there is a plan (Amery has been > looking into it) to allocate the storage together with sk for performance > reason. This means allocating a larger 'struct sock'. The extra space will be > at > the front of sk instead of the end of sk because of how the 'struct sock' is > embedded in tcp_sock/udp_sock/... If skb is going in the same direction, it > should be useful to have a similar scheme on: upfront allocation and then > shared > by multiple BPF progs. > > The current thinking is to built upon the existing bpf_sk_local_storage > usage. A > boot param decides how much BPF space should be allocated for 'struct > sock'. When a bpf_sk_storage_map is created (with a new use_reserve flag), the > space will be allocated permanently from the head space of every sk for this > map. The read (from a BPF prog) will be at one stable offset before a sk. If > there is no more head space left, the map creation will fail. User can decide > if > it wants to retry without the 'use_reserve' flag. Thanks for sharing the plans. We will definitely be looking into ways of eliminating allocations in the long run. With one allocation for skb_ext, one for bpf_local_storage, and one for the actual map, it seems unlikely we will be able to attach metadata this way to every packet. Which is something we wanted for our "label packet once, use label everywhere" use case. I'm not sure how much we can squeeze in together with the sk_buff. Hopefully at least skb_ext plus a pointer to bpf_local_storage. I'm also hoping we can allocate memory for bpf_local_storage together with the backing space for the map, which update triggers the skb extension activation. Finally, bpf_local_storage itself has a pretty generous cache which blows it up. Maybe the cache could be a flexible array, which could be smaller for skb local storage. All just ideas ATM. Initial RFC won't have any of these optimizations.
Re: [Intel-wired-lan] [PATCH net-next 00/10] Call skb_metadata_set when skb->data points past metadata
On 1/13/26 4:33 AM, Jakub Sitnicki wrote: Good point. I'm hoping we don't have to allocate from skb_metadata_set(), which does sound prohibitively expensive. Instead we'd allocate the extension together with the skb if we know upfront that metadata will be used. [ Sorry for being late. Have been catching up after holidays. ] For the sk local storage (which was mentioned in other replies as making skb metadata to look more like sk local storage), there is a plan (Amery has been looking into it) to allocate the storage together with sk for performance reason. This means allocating a larger 'struct sock'. The extra space will be at the front of sk instead of the end of sk because of how the 'struct sock' is embedded in tcp_sock/udp_sock/... If skb is going in the same direction, it should be useful to have a similar scheme on: upfront allocation and then shared by multiple BPF progs. The current thinking is to built upon the existing bpf_sk_local_storage usage. A boot param decides how much BPF space should be allocated for 'struct sock'. When a bpf_sk_storage_map is created (with a new use_reserve flag), the space will be allocated permanently from the head space of every sk for this map. The read (from a BPF prog) will be at one stable offset before a sk. If there is no more head space left, the map creation will fail. User can decide if it wants to retry without the 'use_reserve' flag.
Re: [Intel-wired-lan] [PATCH net-next 00/10] Call skb_metadata_set when skb->data points past metadata
On Wed, Jan 14, 2026 at 12:49 PM +01, Toke Høiland-Jørgensen wrote: > Jakub Sitnicki via Intel-wired-lan writes: > >> On Tue, Jan 13, 2026 at 07:52 PM +01, Jesper Dangaard Brouer wrote: >>> *BUT* this patchset isn't doing that. To me it looks like a cleanup >>> patchset that simply makes it consistent when skb_metadata_set() called. >>> Selling it as a pre-requirement for doing copy later seems fishy. >> >> Fair point on the framing. The interface cleanup is useful on its own - >> I should have presented it that way rather than tying it to future work. >> >>> Instead of blindly copying XDP data_meta area into a single SKB >>> extension. What if we make it the responsibility of the TC-ingress BPF- >>> hook to understand the data_meta format and via (kfunc) helpers >>> transfer/create the SKB extension that it deems relevant. >>> Would this be an acceptable approach that makes it easier to propagate >>> metadata deeper in netstack? >> >> I think you and Jakub are actually proposing the same thing. >> >> If we can access a buffer tied to an skb extension from BPF, this could >> act as skb-local storage and solves the problem (with some operational >> overhead to set up TC on ingress). >> >> I'd also like to get Alexei's take on this. We had a discussion before >> about not wanting to maintain two different storage areas for skb >> metadata. >> >> That was one of two reasons why we abandoned Arthur's patches and why I >> tried to make the existing headroom-backed metadata area work. >> >> But perhaps I misunderstood the earlier discussion. Alexei's point may >> have been that we don't want another *headroom-backed* metadata area >> accessible from XDP, because we already have that. >> >> Looks like we have two options on the table: >> >> Option A) Headroom-backed metadata >> - Use existing skb metadata area >> - Patch skb_push/pull call sites to preserve it >> >> Option B) Extension-backed metadata >> - Store metadata in skb extension from BPF >> - TC BPF copies/extracts what it needs from headroom-metadata >> >> Or is there an Option C I'm missing? > > Not sure if it's really an option C, but would it be possible to > consolidate them using verifier tricks? I.e., the data_meta field in the > __sk_buff struct is really a virtual pointer that the verifier rewrites > to loading an actual pointer from struct bpf_skb_data_end in skb->cb. So > in principle this could be loaded from an skb extension instead with the > BPF programs being none the wiser. > > There's the additional wrinkle that the end of the data_meta pointer is > compared to the 'data' start pointer to check for overflow, which > wouldn't work anymore. Not sure if there's a way to make the verifier > rewrite those checks in a compatible way, or if this is even a path we > want to go down. But it would be a pretty neat way to make the whole > thing transparent and backwards compatible, I think :) I gave it a shot when working on [1]. Here's the challenge: 1) Keep the skb->data_meta + N <= skb->data checks working This is what guarantees that your BPF program won't access memory outside of the metadata area. So you can't rewrite the skb->data_meta pseudo-pointer load. This means you must... 2) Patch the skb->data_meta pointer dereference after the check Since deref happens at some unknown point after the skb->data_meta pointer load, you may no longer have the context pointer in any of the registers. You might be able to hack it by spilling the context pointer to the stack in the prologue, like I've seen bpf_qdisc does. But that I haven't tried. In general, I view it as a seconary issue since you can use a BPF dynptr to access the skb metadata. It was exactly for that reason - to hide the fact where the metadata is actually located. > Other than that, I like the extention-backed metadata idea! That's what I'm going to work on. I look at it as an skb local storage backed by an skb extension. If the user wants to transfer the contents of the skb metadata into local storage, they can. But the extra allocation is their decision. [1] https://lore.kernel.org/r/20260110-skb-meta-fixup-skb_metadata_set-calls-v1-0-1047878ed...@cloudflare.com
Re: [Intel-wired-lan] [PATCH net-next 00/10] Call skb_metadata_set when skb->data points past metadata
Jakub Sitnicki via Intel-wired-lan writes: > On Tue, Jan 13, 2026 at 07:52 PM +01, Jesper Dangaard Brouer wrote: >> *BUT* this patchset isn't doing that. To me it looks like a cleanup >> patchset that simply makes it consistent when skb_metadata_set() called. >> Selling it as a pre-requirement for doing copy later seems fishy. > > Fair point on the framing. The interface cleanup is useful on its own - > I should have presented it that way rather than tying it to future work. > >> Instead of blindly copying XDP data_meta area into a single SKB >> extension. What if we make it the responsibility of the TC-ingress BPF- >> hook to understand the data_meta format and via (kfunc) helpers >> transfer/create the SKB extension that it deems relevant. >> Would this be an acceptable approach that makes it easier to propagate >> metadata deeper in netstack? > > I think you and Jakub are actually proposing the same thing. > > If we can access a buffer tied to an skb extension from BPF, this could > act as skb-local storage and solves the problem (with some operational > overhead to set up TC on ingress). > > I'd also like to get Alexei's take on this. We had a discussion before > about not wanting to maintain two different storage areas for skb > metadata. > > That was one of two reasons why we abandoned Arthur's patches and why I > tried to make the existing headroom-backed metadata area work. > > But perhaps I misunderstood the earlier discussion. Alexei's point may > have been that we don't want another *headroom-backed* metadata area > accessible from XDP, because we already have that. > > Looks like we have two options on the table: > > Option A) Headroom-backed metadata > - Use existing skb metadata area > - Patch skb_push/pull call sites to preserve it > > Option B) Extension-backed metadata > - Store metadata in skb extension from BPF > - TC BPF copies/extracts what it needs from headroom-metadata > > Or is there an Option C I'm missing? Not sure if it's really an option C, but would it be possible to consolidate them using verifier tricks? I.e., the data_meta field in the __sk_buff struct is really a virtual pointer that the verifier rewrites to loading an actual pointer from struct bpf_skb_data_end in skb->cb. So in principle this could be loaded from an skb extension instead with the BPF programs being none the wiser. There's the additional wrinkle that the end of the data_meta pointer is compared to the 'data' start pointer to check for overflow, which wouldn't work anymore. Not sure if there's a way to make the verifier rewrite those checks in a compatible way, or if this is even a path we want to go down. But it would be a pretty neat way to make the whole thing transparent and backwards compatible, I think :) Other than that, I like the extention-backed metadata idea! -Toke
Re: [Intel-wired-lan] [PATCH net-next 00/10] Call skb_metadata_set when skb->data points past metadata
On Tue, Jan 13, 2026 at 07:52 PM +01, Jesper Dangaard Brouer wrote: > *BUT* this patchset isn't doing that. To me it looks like a cleanup > patchset that simply makes it consistent when skb_metadata_set() called. > Selling it as a pre-requirement for doing copy later seems fishy. Fair point on the framing. The interface cleanup is useful on its own - I should have presented it that way rather than tying it to future work. > Instead of blindly copying XDP data_meta area into a single SKB > extension. What if we make it the responsibility of the TC-ingress BPF- > hook to understand the data_meta format and via (kfunc) helpers > transfer/create the SKB extension that it deems relevant. > Would this be an acceptable approach that makes it easier to propagate > metadata deeper in netstack? I think you and Jakub are actually proposing the same thing. If we can access a buffer tied to an skb extension from BPF, this could act as skb-local storage and solves the problem (with some operational overhead to set up TC on ingress). I'd also like to get Alexei's take on this. We had a discussion before about not wanting to maintain two different storage areas for skb metadata. That was one of two reasons why we abandoned Arthur's patches and why I tried to make the existing headroom-backed metadata area work. But perhaps I misunderstood the earlier discussion. Alexei's point may have been that we don't want another *headroom-backed* metadata area accessible from XDP, because we already have that. Looks like we have two options on the table: Option A) Headroom-backed metadata - Use existing skb metadata area - Patch skb_push/pull call sites to preserve it Option B) Extension-backed metadata - Store metadata in skb extension from BPF - TC BPF copies/extracts what it needs from headroom-metadata Or is there an Option C I'm missing? Thanks, -jkbs
Re: [Intel-wired-lan] [PATCH net-next 00/10] Call skb_metadata_set when skb->data points past metadata
On 13/01/2026 13.09, Paolo Abeni wrote: On 1/13/26 4:08 AM, Jakub Kicinski wrote: On Sat, 10 Jan 2026 22:05:14 +0100 Jakub Sitnicki wrote: This series is split out of [1] following discussion with Jakub. To copy XDP metadata into an skb extension when skb_metadata_set() is called, we need to locate the metadata contents. "When skb_metadata_set() is called"? I think that may cause perf regressions unless we merge major optimizations at the same time? Should we defer touching the drivers until we have a PoC and some idea whether allocating the extension right away is manageable or we are better off doing it via a kfunc in TC (after GRO)? To be clear putting the metadata in an extension right away would indeed be much cleaner, just not sure how much of the perf hit we can optimize away.. I agree it would be better deferring touching the driver before we have proof there will not be significant regressions. It will be a performance regression to (as cover-letter says): "To copy XDP metadata into an skb extension when skb_metadata_set() is called". The XDP to TC-ingress code path is a fast-path IMHO. *BUT* this patchset isn't doing that. To me it looks like a cleanup patchset that simply makes it consistent when skb_metadata_set() called. Selling it as a pre-requirement for doing copy later seems fishy. IIRC, at early MPTCP impl time, Eric suggested increasing struct sk_buff size as an alternative to the mptcp skb extension, leaving the added trailing part uninitialized when the sk_buff is allocated. If skb extensions usage become so ubicuos they are basically allocated for each packet, the total skb extension is kept under strict control and remains reasonable (assuming it is :), perhaps we could consider revisiting the above mentioned approach? I really like this idea. As using the uninitialized tail room in the SKB (memory area) will make SKB extensions fast. Today SKBs are allocated via SLUB-alloacator cache-aligned so the real size is 256 bytes. On my system the actual SKB (sk_buff) size is 232 bytes (already leaving us 24 bytes). The area that gets zero-initialized is only 192 bytes (3 cache-lines). My experience with the SLUB allocator is that increasing the object size doesn't increase the allocation cost (below PAGE_SIZE). So, the suggestion of simply allocating a larger sk_buff is valid as it doesn't cost more (if we don't touch those cache-lines). We could even make it a CONFIG compile time option how big this area should be. For Jakub this unfortunately challenge/breaks the design of keeping data_meta area valid deeper into the netstack. With all the challenges around encapsulation/decap it seems hard/infeasible to maintain this area in-front of the packet data pointer deeper into the netstack. Instead of blindly copying XDP data_meta area into a single SKB extension. What if we make it the responsibility of the TC-ingress BPF- hook to understand the data_meta format and via (kfunc) helpers transfer/create the SKB extension that it deems relevant. Would this be an acceptable approach that makes it easier to propagate metadata deeper in netstack? --Jesper p.s. For compact storage of SKB extensions in the SKB tail-area, we could revisit Arthur's "traits" (compact-KV storage).
Re: [Intel-wired-lan] [PATCH net-next 00/10] Call skb_metadata_set when skb->data points past metadata
On Tue, Jan 13, 2026 at 01:09 PM +01, Paolo Abeni wrote: > IIRC, at early MPTCP impl time, Eric suggested increasing struct sk_buff > size as an alternative to the mptcp skb extension, leaving the added > trailing part uninitialized when the sk_buff is allocated. > > If skb extensions usage become so ubicuos they are basically allocated > for each packet, the total skb extension is kept under strict control > and remains reasonable (assuming it is :), perhaps we could consider > revisiting the above mentioned approach? I've been thinking the same thing. Great to hear that this idea is not new. FWIW, in our use cases we'd want to attach metadata to the first packet of new TCP/QUIC flow, and ocassionally to sampled skbs for tracing.
Re: [Intel-wired-lan] [PATCH net-next 00/10] Call skb_metadata_set when skb->data points past metadata
On Mon, Jan 12, 2026 at 07:08 PM -08, Jakub Kicinski wrote: > On Sat, 10 Jan 2026 22:05:14 +0100 Jakub Sitnicki wrote: >> This series is split out of [1] following discussion with Jakub. >> >> To copy XDP metadata into an skb extension when skb_metadata_set() is >> called, we need to locate the metadata contents. > > "When skb_metadata_set() is called"? I think that may cause perf > regressions unless we merge major optimizations at the same time? > Should we defer touching the drivers until we have a PoC and some > idea whether allocating the extension right away is manageable or > we are better off doing it via a kfunc in TC (after GRO)? > To be clear putting the metadata in an extension right away would > indeed be much cleaner, just not sure how much of the perf hit we > can optimize away.. Good point. I'm hoping we don't have to allocate from skb_metadata_set(), which does sound prohibitively expensive. Instead we'd allocate the extension together with the skb if we know upfront that metadata will be used. Things took an unexpected turn and I'm figuring this out as I go. Please bear with me :-) Here are my thoughts: 1) The driver changes do clean up the interface, but you're right that it's premature churn if the approach changes. If the skb extension approach doesn't pan out, we're ready to fall back to headroom-based storage. 2) How do we handle CONFIG_SKB_EXTENSIONS=n? Without extensions, reliable metadata access after L2 encap/decap would require patching skb_push/pull call sites—or we declare the feature unsupported without CONFIG_SKB_EXTENSIONS=y. 3) When skb extensions are enabled, asking users to attach TC BPF progs to call a kfunc to all devices the skb goes through before L2 encap/decap is impractical. The extension alloc/move needs to be baked into the stack. I'll focus on getting a PoC together next. Stay tuned. Thanks, -jkbs
Re: [Intel-wired-lan] [PATCH net-next 00/10] Call skb_metadata_set when skb->data points past metadata
On 1/13/26 4:08 AM, Jakub Kicinski wrote: > On Sat, 10 Jan 2026 22:05:14 +0100 Jakub Sitnicki wrote: >> This series is split out of [1] following discussion with Jakub. >> >> To copy XDP metadata into an skb extension when skb_metadata_set() is >> called, we need to locate the metadata contents. > > "When skb_metadata_set() is called"? I think that may cause perf > regressions unless we merge major optimizations at the same time? > Should we defer touching the drivers until we have a PoC and some > idea whether allocating the extension right away is manageable or > we are better off doing it via a kfunc in TC (after GRO)? > To be clear putting the metadata in an extension right away would > indeed be much cleaner, just not sure how much of the perf hit we > can optimize away.. I agree it would be better deferring touching the driver before we have proof there will not be significant regressions. IIRC, at early MPTCP impl time, Eric suggested increasing struct sk_buff size as an alternative to the mptcp skb extension, leaving the added trailing part uninitialized when the sk_buff is allocated. If skb extensions usage become so ubicuos they are basically allocated for each packet, the total skb extension is kept under strict control and remains reasonable (assuming it is :), perhaps we could consider revisiting the above mentioned approach? /P
Re: [Intel-wired-lan] [PATCH net-next 00/10] Call skb_metadata_set when skb->data points past metadata
On Sat, 10 Jan 2026 22:05:14 +0100 Jakub Sitnicki wrote: > This series is split out of [1] following discussion with Jakub. > > To copy XDP metadata into an skb extension when skb_metadata_set() is > called, we need to locate the metadata contents. "When skb_metadata_set() is called"? I think that may cause perf regressions unless we merge major optimizations at the same time? Should we defer touching the drivers until we have a PoC and some idea whether allocating the extension right away is manageable or we are better off doing it via a kfunc in TC (after GRO)? To be clear putting the metadata in an extension right away would indeed be much cleaner, just not sure how much of the perf hit we can optimize away..
