On Wed, 2 May 2018 12:33:47 +0900 Toshiaki Makita <toshiaki.maki...@gmail.com> wrote:
> On 18/05/01 (火) 17:14, Jesper Dangaard Brouer wrote: > > On Tue, 1 May 2018 10:02:01 +0900 Toshiaki Makita > > <makita.toshi...@lab.ntt.co.jp> wrote: > > > >> On 2018/05/01 2:27, Jesper Dangaard Brouer wrote: > >>> On Thu, 26 Apr 2018 19:52:40 +0900 Toshiaki Makita > >>> <makita.toshi...@lab.ntt.co.jp> wrote: > >>> > >>>> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote: > >>>>> On Tue, 24 Apr 2018 23:39:20 +0900 Toshiaki Makita > >>>>> <toshiaki.maki...@gmail.com> wrote: > >>>>> > >>>>>> +static int veth_xdp_xmit(struct net_device *dev, struct [...] > >>>>> > >>>>> I'm not sure you can make this assumption, that xdp_frames > >>>>> coming from another device driver uses a refcnt based memory > >>>>> model. But maybe I'm confused, as this looks like an SKB > >>>>> receive path, but in the ndo_xdp_xmit(). > >>>> > >>>> I find this path similar to cpumap, which creates skb from > >>>> redirected xdp frame. Once it is converted to skb, skb head is > >>>> freed by page_frag_free, so anyway I needed to get the > >>>> refcount here regardless of memory model. > >>> > >>> Yes I know, I wrote cpumap ;-) > >>> > >>> First of all, I don't want to see such xdp_frame to SKB > >>> conversion code in every driver. Because that increase the > >>> chances of errors. And when looking at the details, then it > >>> seems that you have made the mistake of making it possible to > >>> leak xdp_frame info to the SKB (which cpumap takes into > >>> account). > >> > >> Do you mean leaving xdp_frame in skb->head is leaking something? > >> how? > > > > Like commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp > > headroom") and commit 6dfb970d3dbd ("xdp: avoid leaking info stored > > in frame data on page reuse"). > > Thanks for sharing the info. > > > But this time, the concern is a bpf_prog attached at TC/bpf_cls > > level, that can that can adjust head via bpf_skb_change_head (for > > XDP it is bpf_xdp_adjust_head) into the area used by xdp_frame. As > > described in https://git.kernel.org/davem/net-next/c/6dfb970d3dbd in > > is not super critical at the moment, as this _currently_ runs as > > CAP_SYS_ADMIN, but we would like to move towards CAP_NET_ADMIN. > > What I don't get is why special casing xdp_frame info. My assumption is > that the area above skb->mac_header is uninit kernel memory and should > not be readable by unprivileged users anyway. So I didn't clear the area > at this point. This is also my understanding. But Alexei explicitly asked me to handle this xdp_frame info case. I assume that more work is required for the transition from CAP_SYS_ADMIN to CAP_NET_ADMIN, we just don't want to add more/new code that makes this more difficult. > >>> Second, I think the refcnt scheme here is wrong. The xdp_frame > >>> should be "owned" by XDP and have the proper refcnt to deliver > >>> it directly to the network stack. > >>> > >>> Third, if we choose that we want a fallback, in-case XDP is not > >>> enabled on egress dev (but it have an ndo_xdp_xmit), then it > >>> should be placed in the generic/core code. E.g. > >>> __bpf_tx_xdp_map() could look at the return code from > >>> dev->netdev_ops->ndo_xdp() and create an SKB. (Hint, this would > >>> make it easy to implement TX bulking towards the dev). > >> > >> Right, this is a much cleaner way. > >> > >> Although I feel like we should add this fallback for veth because > >> it requires something which is different from other drivers > >> (enabling XDP on the peer device of the egress device), > > > > (This is why I Cc'ed Tariq...) > > > > This is actually a general problem with the xdp "xmit" side (and not > > specific to veth driver). The problem exists for other drivers as > > well. > > > > The problem is that a driver can implement ndo_xdp_xmit(), but the > > driver might not have allocated the necessary XDP TX-queue resources > > yet (or it might not be possible due to system resource limits). > > > > The current "hack" is to load an XDP prog on the egress device, and > > then assume that this cause the driver to also allocate the XDP > > ndo_xdo_xmit side HW resources. This is IMHO a wrong assumption. > > > > We need a more generic way to test if a net_device is "ready/enabled" > > for handling xdp_frames via ndo_xdp_xmit. And Tariq had some ideas > > on how to implement this... > > My assumption on REDIRECT requirement came from this. > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=306da4e685b Yes, I wrote that. > I guess you are saying thing are changing, and having an XDP program > attached on the egress device is no longer generally sufficient. Looking > forward to Tariq's solution. Yes, (hopefully) things are changing. Loading a dummy XDP program to enable ndo_xdp_xmit, turned out to be a bad model, for all the reasons I listed. I hope Tariq find some time to work on this ... ;-) > Toshiaki Makita > > > > > My opinion is that it is a waste of (HW/mem) resources to always > > allocate resources for ndo_xdp_xmit when loading an XDP program. > > Because what if my use-cases are XDP_DROP DDoS filter, or CPUMAP > > redirect load-balancer, then I don't want/need ndo_xdp_xmit. E.g. > > today using ixgbe on machines with more than 96 CPUs, will fail due > > to limited TX queue resources. Thus, blocking the mentioned > > use-cases. > > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer