2018-05-18 15:43 GMT+02:00 Daniel Borkmann <dan...@iogearbox.net>: > On 05/18/2018 05:38 AM, Alexei Starovoitov wrote: >> On 5/16/18 11:46 PM, Björn Töpel wrote: >>> 2018-05-04 1:38 GMT+02:00 Alexei Starovoitov <alexei.starovoi...@gmail.com>: >>>> On Fri, May 04, 2018 at 12:49:09AM +0200, Daniel Borkmann wrote: >>>>> On 05/02/2018 01:01 PM, Björn Töpel wrote: >>>>>> From: Björn Töpel <bjorn.to...@intel.com> >>>>>> >>>>>> This patch set introduces a new address family called AF_XDP that is >>>>>> optimized for high performance packet processing and, in upcoming >>>>>> patch sets, zero-copy semantics. In this patch set, we have removed >>>>>> all zero-copy related code in order to make it smaller, simpler and >>>>>> hopefully more review friendly. This patch set only supports copy-mode >>>>>> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode >>>>>> for RX using the XDP_DRV path. Zero-copy support requires XDP and >>>>>> driver changes that Jesper Dangaard Brouer is working on. Some of his >>>>>> work has already been accepted. We will publish our zero-copy support >>>>>> for RX and TX on top of his patch sets at a later point in time. >>>>> >>>>> +1, would be great to see it land this cycle. Saw few minor nits here >>>>> and there but nothing to hold it up, for the series: >>>>> >>>>> Acked-by: Daniel Borkmann <dan...@iogearbox.net> >>>>> >>>>> Thanks everyone! >>>> >>>> Great stuff! >>>> >>>> Applied to bpf-next, with one condition. >>>> Upcoming zero-copy patches for both RX and TX need to be posted >>>> and reviewed within this release window. >>>> If netdev community as a whole won't be able to agree on the zero-copy >>>> bits we'd need to revert this feature before the next merge window. >>>> >>>> Few other minor nits: >>>> patch 3: >>>> +struct xdp_ring { >>>> + __u32 producer __attribute__((aligned(64))); >>>> + __u32 consumer __attribute__((aligned(64))); >>>> +}; >>>> It kinda begs for ____cacheline_aligned_in_smp to be introduced for uapi >>>> headers. >>> >>> Hmm, I need some guidance on what a sane uapi variant would be. We >>> can't have the uapi depend on the kernel build. ARM64, e.g., can have >>> both 64B and 128B according to the specs. Contemporary IA processors >>> have 64B. >>> >>> The simplest, and maybe most future-proof, would be 128B aligned for >>> all. Another is having 128B for ARM and 64B for all IA. A third option >>> is having a hand-shaking API (I think virtio has that) for determine >>> the cache line size, but I'd rather not go down that route. >>> >>> Thoughts/ideas on how a uapi ____cacheline_aligned_in_smp version >>> would look like? >> >> I suspect i40e+arm combination wasn't tested anyway. >> The api may have endianness issues too on something like sparc. >> I think the way to be backwards compatible in this area >> is to make the api usable on x86 only by adding >> to include/uapi/linux/if_xdp.h >> #if defined(__x86_64__) >> #define AF_XDP_CACHE_BYTES 64 >> #else >> #error "AF_XDP support is not yet available for this architecture" >> #endif >> and doing: >> __u32 producer __attribute__((aligned(AF_XDP_CACHE_BYTES))); >> __u32 consumer __attribute__((aligned(AF_XDP_CACHE_BYTES))); >> >> And progressively add to this for arm64 and few other archs. >> Eventually removing #error and adding some generic define >> that's good enough for long tail of architectures that >> we really cannot test. > > Been looking into this yesterday as well a bit, and it's a bit of a mess what > uapi headers do on this regard (though there are just a handful of such > headers). > Some of the kernel uapi headers hard-code generally 64 bytes regardless of the > underlying arch. In general, the kernel does expose it to user space via sysfs > (coherency_line_size). Here's what perf does to retrieve it: > > #ifdef _SC_LEVEL1_DCACHE_LINESIZE > #define cache_line_size(cacheline_sizep) *cacheline_sizep = > sysconf(_SC_LEVEL1_DCACHE_LINESIZE) > #else > static void cache_line_size(int *cacheline_sizep) > { > if > (sysfs__read_int("devices/system/cpu/cpu0/cache/index0/coherency_line_size", > cacheline_sizep)) > pr_debug("cannot determine cache line size"); > } > #endif > > The sysconf() implementation for _SC_LEVEL1_DCACHE_LINESIZE seems also only > available for x86, arm64, s390 and ppc on a cursory glance in the glibc code. > In the x86 case it retrieves the info from cpuid insn. In order to generically > use it in combination with the header you'd have some probe which would then > set this as a define before including the header. >
But as a uapi we cannot depend on the L1 cache line size for what's currently running on the system, right? So, either a "one cache line size for all flavors of an arch", e.g. for ARMv8 that would be 128, even though there can be 64 flavors out there. Another way would be to remove the ring structure completely, and leave that to the user-space application to figure out. So there's a runtime interface (getsockopt) to probe the offsets the head and tail pointer is after/before the mmap call, and only expose the descriptor format explicitly in if_xdp.h. Don't know if that is too unorthodox or not... > Then projects like urcu, they do ... > > #define ____cacheline_internodealigned_in_smp \ > __attribute__((__aligned__(CAA_CACHE_LINE_SIZE))) > > ... and then hard code CAA_CACHE_LINE_SIZE for x86 (== 128), s390 (== 128), > ppc (== 256) and sparc64 (== 256) with a generic fallback to 64. > > Hmm, perhaps a combination of the two would make sense where in case of known > cacheline size it can still be used and we only have the fallback in such way. > Like: > > #ifndef XDP_CACHE_BYTES > # if defined(__x86_64__) > # define XDP_CACHE_BYTES 64 > # else > # error "Please define XDP_CACHE_BYTES for this architecture!" > # endif > #endif > > Too bad there's no asm uapi header at least for the archs where it's fixed > anyway such that not every project out there has to redefine all of it from > scratch and we could just include it (and the generic-asm one would throw > a compile error if it's not externally defined or such). > I started out with adding a cache.h to the arch/XXX/include/uapi, but then realized that this didn't really work. :-) > Cheers, > Daniel