from:"David Ahern"

Re: [PATCH net-next 1/4] netdev: support dumping a single netdev in qstats

2024-04-22 Thread David Ahern

On 4/22/24 7:48 AM, Jakub Kicinski wrote:
> On Sun, 21 Apr 2024 13:32:24 -0600 David Ahern wrote:
>> On 4/21/24 1:17 PM, Eric Dumazet wrote:
>>> I wonder if NLM_F_DUMP_FILTERED should not be reported to user space ?  
>>
>> good point. We do set that flag for other dumps when a filter has been
>> used to limit data returned.
> 
> That flag appears to be a, hm, historic workaround?
> If I was to guess what the motivation was I'd say that it's because
> "old school netlink" didn't reject unknown attributes. And you wanted
> to know whether the kernel did the filtering or you have to filter
> again in user space? Am I close? :)

close enough based on what I can recall.

> 
> The flag is mostly used in the IP stack, I'd rather try to deprecate 
> it than propagate it to new genetlink families which do full input
> validation, rendering the flag 100% unnecessary.

Re: [PATCH net-next 1/4] netdev: support dumping a single netdev in qstats

2024-04-21 Thread David Ahern

On 4/21/24 1:17 PM, Eric Dumazet wrote:
> I wonder if NLM_F_DUMP_FILTERED should not be reported to user space ?

good point. We do set that flag for other dumps when a filter has been
used to limit data returned.

Re: [PATCH net-next 3/4] netlink: support all extack types in dumps

2024-04-20 Thread David Ahern

On 4/19/24 8:35 PM, Jakub Kicinski wrote:
> Note that when this commit message refers to netlink dump
> it only means the actual dumping part, the parsing / dump
> start is handled by the same code as "doit".
> 
> Commit 4a19edb60d02 ("netlink: Pass extack to dump handlers")
> added support for returning extack messages from dump handlers,
> but left out other extack info, e.g. bad attribute.
> 
> This used to be fine because until YNL we had little practical
> use for the machine readable attributes, and only messages were
> used in practice.
> 
> YNL flips the preference 180 degrees, it's now much more useful
> to point to a bad attr with NL_SET_BAD_ATTR() than type
> an English message saying "attribute XYZ is $reason-why-bad".
> 
> Support all of extack. The fact that extack only gets added if
> it fits remains unaddressed.
> 
> Signed-off-by: Jakub Kicinski 
> ---
>  net/netlink/af_netlink.c | 15 ++-
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 

Reviewed-by: David Ahern

Re: [PATCH net-next 2/4] netlink: move extack writing helpers

2024-04-20 Thread David Ahern

On 4/19/24 8:35 PM, Jakub Kicinski wrote:
> Next change will need them in netlink_dump_done(), pure move.
> 
> Signed-off-by: Jakub Kicinski 
> ---
>  net/netlink/af_netlink.c | 126 +++
>  1 file changed, 63 insertions(+), 63 deletions(-)
> 

Reviewed-by: David Ahern

Re: [PATCH net-next 1/4] netdev: support dumping a single netdev in qstats

2024-04-20 Thread David Ahern

On 4/19/24 8:35 PM, Jakub Kicinski wrote:
> Having to filter the right ifindex in the tests is a bit tedious.
> Add support for dumping qstats for a single ifindex.
> 
> Signed-off-by: Jakub Kicinski 
> ---
>  Documentation/netlink/specs/netdev.yaml |  1 +
>  net/core/netdev-genl-gen.c  |  1 +
>  net/core/netdev-genl.c  | 52 ++---
>  3 files changed, 41 insertions(+), 13 deletions(-)
> 

Reviewed-by: David Ahern

Re: [PATCH net-next 08/10] selftests: forwarding: router_mpath_nh: Add a diagram

2024-04-15 Thread David Ahern

On 4/12/24 11:03 AM, Petr Machata wrote:
> This test lacks a topology diagram, making the setup not obvious.
> Add one.
> 
> Cc: David Ahern 
> Signed-off-by: Petr Machata 
> ---
>  .../net/forwarding/router_mpath_nh.sh | 35 +++
>  1 file changed, 35 insertions(+)
> 

Reviewed-by: David Ahern

Re: [BUG] selftests/net: icmp_redirect.sh: 12 out of 40 test result with [FAIL]

2024-03-18 Thread David Ahern

On 3/17/24 8:38 PM, Hangbin Liu wrote:
> Wild guess, the last change of icmp_redirect is my netns update. Maybe
> there are something default sysctl settings in netns cause the error?

It is most likely sysctl settings. It would be good to chase those down
and make sure we have the script setting them.

Mirsad: What OS are you testing with? That script has a verbose option
(-v) to get more output like the commands run and pause-on-fail (-p) to
manually debug at that point.

Re: [BUG net-next] fcnal-test.sh: 4 (four) tests FAIL

2024-03-11 Thread David Ahern

On 3/11/24 10:17 AM, Jakub Kicinski wrote:
> On Sat, 9 Mar 2024 19:45:15 +0100 Mirsad Todorovac wrote:
>> In the vanilla net-next tree build of v6.8-rc7-2348-g75c2946db360, with 
>> up-to-date
>> iproute2 built tools, fcnal-test.sh reports certain failures:
>>
>> --
>> # TEST: ping local, VRF bind - VRF IP
>>[FAIL]
>> # TEST: ping local, device bind - ns-A IP
>>[FAIL]
>> # TEST: ping local, VRF bind - VRF IP
>>[FAIL]
>> # TEST: ping local, device bind - ns-A IP
>>[FAIL]
>> --
> 
> Adding David A to CC.
> 
> It rings a bell. We also build ping from source when running the tests
> locally, I have in my notes "AWS iputils are buggy, use iputils.git"
> but unfortunately I didn't make a note which tests were failing without
> it. David might remember..

yes, please update ping -- make sure it has proper support for
SO_BINDTODEVICE.

It's a bug in versions of iputils ping. It sets the BINDTODEVICE and
then resets it because the source address is not set on the command line
(it should not be required - they are separate intents).

Re: [RFC PATCH net-next v6 01/15] queue_api: define queue api

2024-03-10 Thread David Ahern

On 3/8/24 4:47 PM, David Wei wrote:
> 
> I'm working to port bnxt over to using this API. What are your thoughts
> on maybe pulling this out and use bnxt to drive it?
> 

I would love to see a second nic implementation; this patch set and
overall design is driven by GVE limitations.

Re: [RFC PATCH net-next v6 01/15] queue_api: define queue api

2024-03-10 Thread David Ahern

On 3/8/24 4:47 PM, David Wei wrote:
> 
> I'm working to port bnxt over to using this API. What are your thoughts
> on maybe pulling this out and use bnxt to drive it?
> 

I would love to see a second nic implementation; this patch set and
overall design is driven by GVE limitations.

Re: [PATCH v2 net 0/4] selftests: net: more fixes

2024-02-01 Thread David Ahern

On 2/1/24 11:42 AM, Paolo Abeni wrote:
> Another small bunch of fixes, addressing issues outlined by the
> netdev CI.
> 
> The first 2 patches are just rebased.
> 
> The following 2 are new fixes, for even more problems that surfaced
> meanwhile.
> 
> Paolo Abeni (4):
>   selftests: net: cut more slack for gro fwd tests.
>   selftests: net: fix setup_ns usage in rtnetlink.sh
>   selftests: net: fix tcp listener handling in pmtu.sh
>   selftests: net: avoid just another constant wait
> 
>  tools/testing/selftests/net/pmtu.sh   | 23 ++-
>  tools/testing/selftests/net/rtnetlink.sh  |  6 ++---
>  tools/testing/selftests/net/udpgro_fwd.sh | 14 +--
>  tools/testing/selftests/net/udpgso_bench_rx.c |  2 +-
>  4 files changed, 32 insertions(+), 13 deletions(-)
> 

For the set:

Reviewed-by: David Ahern

Re: [PATCH net 0/3] selftests: net: a few pmtu.sh fixes

2024-01-30 Thread David Ahern

On 1/30/24 10:47 AM, Paolo Abeni wrote:
> This series try to address CI failures for the pmtu.sh tests. It
> does _not_ attempt to enable all the currently skipped cases, to
> avoid adding more entropy.
> 
> Tested with:
> 
> make -C tools/testing/selftests/ TARGETS=net install
> vng --build  --config tools/testing/selftests/net/config
> vng --run . --user root -- \
>   ./tools/testing/selftests/kselftest_install/run_kselftest.sh \
>   -t net:pmtu.sh
> 
> Paolo Abeni (3):
>   selftests: net: add missing config for pmtu.sh tests
>   selftests: net: fix available tunnels detection
>   selftests: net: don't access /dev/stdout in pmtu.sh
> 
>  tools/testing/selftests/net/config  |  3 +++
>  tools/testing/selftests/net/pmtu.sh | 18 +-
>  2 files changed, 12 insertions(+), 9 deletions(-)
> 

selftests are getting a lot of love right now :-)

Reviewed-by: David Ahern

Re: selftest: net: fcnal-test.sh TIMEOUT

2024-01-06 Thread David Ahern

On 1/5/24 2:41 AM, Mirsad Todorovac wrote:
> diff --git a/tools/testing/selftests/net/settings
> b/tools/testing/selftests/net/settings index dfc27cdc6c05..ed8418e8217a
> 100644 --- a/tools/testing/selftests/net/settings +++
> b/tools/testing/selftests/net/settings @@ -1 +1 @@ -timeout=1500
> +timeout=3600

bumping the timeout is fine to me. that script is running a lot of
permutations.

Re: [PATCH net-next v2 2/3] net: gro: parse ipv6 ext headers without frag0 invalidation

2024-01-02 Thread David Ahern

On 1/2/24 6:24 AM, Richard Gobert wrote:
> The existing code always pulls the IPv6 header and sets the transport
> offset initially. Then optionally again pulls any extension headers in
> ipv6_gso_pull_exthdrs and sets the transport offset again on return from
> that call. skb->data is set at the start of the first extension header
> before calling ipv6_gso_pull_exthdrs, and must disable the frag0
> optimization because that function uses pskb_may_pull/pskb_pull instead of
> skb_gro_ helpers. It sets the GRO offset to the TCP header with
> skb_gro_pull and sets the transport header. Then returns skb->data to its
> position before this block.
> 
> This commit introduces a new helper function - ipv6_gro_pull_exthdrs -
> which is used in ipv6_gro_receive to pull ipv6 ext headers instead of
> ipv6_gso_pull_exthdrs. Thus, there is no modification of skb->data, all
> operations use skb_gro_* helpers, and the frag0 fast path can be taken for
> IPv6 packets with ext headers.
> 
> Signed-off-by: Richard Gobert 
> Reviewed-by: Willem de Bruijn 
> ---
>  include/net/ipv6.h |  1 +
>  net/ipv6/ip6_offload.c | 51 +-
>  2 files changed, 42 insertions(+), 10 deletions(-)
> 

Reviewed-by: David Ahern

Re: [PATCH net-next v2 1/3] net: gso: add HBH extension header offload support

2024-01-02 Thread David Ahern

On 1/2/24 6:20 AM, Richard Gobert wrote:
> This commit adds net_offload to IPv6 Hop-by-Hop extension headers (as it
> is done for routing and dstopts) since it is supported in GSO and GRO.
> This allows to remove specific HBH conditionals in GSO and GRO when
> pulling and parsing an incoming packet.
> 
> Signed-off-by: Richard Gobert 
> Reviewed-by: Willem de Bruijn 
> ---
>  net/ipv6/exthdrs_offload.c | 11 +++
>  net/ipv6/ip6_offload.c | 25 +++--
>  2 files changed, 22 insertions(+), 14 deletions(-)
> 

Reviewed-by: David Ahern

Re: [net-next v1 08/16] memory-provider: dmabuf devmem memory provider

2023-12-12 Thread David Ahern

On 12/12/23 6:09 PM, Mina Almasry wrote:
> OK, I imagine this is not that hard to implement - it's really whether
> the change is acceptable to reviewers.
> 
> I figure I can start by implementing a no-op abstraction to page*:
> 
> typedef struct page netmem_t
> 
> and replace the page* in the following places with netmem_t*:
> 
> 1. page_pool API (not internals)
> 2. drivers using the page_pool.
> 3. skb_frag_t.
> 

accessors to skb_frag_t field are now consolidated to
include/linux/skbuff.h (the one IB driver was fixed in Sept by
4ececeb83986), so changing skb_frag_t from bio_vec to something like:

typedef struct skb_frag {
void *addr;
unsigned int length;
unsigned int offset;
};

is trivial. From there, addr can default to `struct page *`. If LSB is
set, strip it and return `struct page_pool_iov *` or `struct buffer_pool *`

Re: [net-next v1 08/16] memory-provider: dmabuf devmem memory provider

2023-12-12 Thread David Ahern

On 12/12/23 6:09 PM, Mina Almasry wrote:
> OK, I imagine this is not that hard to implement - it's really whether
> the change is acceptable to reviewers.
> 
> I figure I can start by implementing a no-op abstraction to page*:
> 
> typedef struct page netmem_t
> 
> and replace the page* in the following places with netmem_t*:
> 
> 1. page_pool API (not internals)
> 2. drivers using the page_pool.
> 3. skb_frag_t.
> 

accessors to skb_frag_t field are now consolidated to
include/linux/skbuff.h (the one IB driver was fixed in Sept by
4ececeb83986), so changing skb_frag_t from bio_vec to something like:

typedef struct skb_frag {
void *addr;
unsigned int length;
unsigned int offset;
};

is trivial. From there, addr can default to `struct page *`. If LSB is
set, strip it and return `struct page_pool_iov *` or `struct buffer_pool *`

Re: [net-next v1 06/16] netdev: support binding dma-buf to netdevice

2023-12-09 Thread David Ahern

On 12/8/23 12:22 PM, Mina Almasry wrote:
> On Fri, Dec 8, 2023 at 9:48 AM David Ahern  wrote:
>>
>> On 12/7/23 5:52 PM, Mina Almasry wrote:
> ...
>>> +
>>> + xa_for_each(>bound_rxq_list, xa_idx, rxq) {
>>> + if (rxq->binding == binding) {
>>> + /* We hold the rtnl_lock while binding/unbinding
>>> +  * dma-buf, so we can't race with another thread that
>>> +  * is also modifying this value. However, the driver
>>> +  * may read this config while it's creating its
>>> +  * rx-queues. WRITE_ONCE() here to match the
>>> +  * READ_ONCE() in the driver.
>>> +  */
>>> + WRITE_ONCE(rxq->binding, NULL);
>>> +
>>> + rxq_idx = get_netdev_rx_queue_index(rxq);
>>> +
>>> + netdev_restart_rx_queue(binding->dev, rxq_idx);
>>
>> Blindly restarting a queue when a dmabuf is heavy handed. If the dmabuf
>> has no outstanding references (ie., no references in the RxQ), then no
>> restart is needed.
>>
> 
> I think I need to stop the queue while binding to a dmabuf for the
> sake of concurrency, no? I.e. the softirq thread may be delivering a
> packet, and in parallel a separate thread holds rtnl_lock and tries to
> bind the dma-buf. At that point the page_pool recreation will race
> with the driver doing page_pool_alloc_page(). I don't think I can
> insert a lock to handle this into the rx fast path, no?

I think it depends on the details of how entries are added and removed
from the pool. I am behind on the pp details at this point, so I do need
to do some homework.

> 
> Also, this sounds like it requires (lots of) more changes. The
> page_pool + driver need to report how many pending references there
> are (with locking so we don't race with incoming packets), and have
> them reported via an ndo so that we can skip restarting the queue.
> Implementing the changes in to a huge issue but handling the
> concurrency may be a genuine blocker. Not sure it's worth the upside
> of not restarting the single rx queue?

It has to do with the usability of this overall solution. As I mentioned
most ML use cases can (and will want to) use many memory allocations for
receiving packets - e.g., allocations per message and receiving multiple
messages per socket connection.

> 
>>> + }
>>> + }
>>> +
>>> + xa_erase(_dmabuf_bindings, binding->id);
>>> +
>>> + netdev_dmabuf_binding_put(binding);
>>> +}
>>> +
>>> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>>> + struct netdev_dmabuf_binding *binding)
>>> +{
>>> + struct netdev_rx_queue *rxq;
>>> + u32 xa_idx;
>>> + int err;
>>> +
>>> + rxq = __netif_get_rx_queue(dev, rxq_idx);
>>> +
>>> + if (rxq->binding)
>>> + return -EEXIST;
>>> +
>>> + err = xa_alloc(>bound_rxq_list, _idx, rxq, xa_limit_32b,
>>> +GFP_KERNEL);
>>> + if (err)
>>> + return err;
>>> +
>>> + /* We hold the rtnl_lock while binding/unbinding dma-buf, so we can't
>>> +  * race with another thread that is also modifying this value. 
>>> However,
>>> +  * the driver may read this config while it's creating its * 
>>> rx-queues.
>>> +  * WRITE_ONCE() here to match the READ_ONCE() in the driver.
>>> +  */
>>> + WRITE_ONCE(rxq->binding, binding);
>>> +
>>> + err = netdev_restart_rx_queue(dev, rxq_idx);
>>
>> Similarly, here binding a dmabuf to a queue. I was expecting the dmabuf
>> binding to add entries to the page pool for the queue.
> 
> To be honest, I think maybe there's a slight disconnect between how
> you think the page_pool works, and my primitive understanding of how
> it works. Today, I see a 1:1 mapping between rx-queue and page_pool in
> the code. I don't see 1:many or many:1 mappings.

I am not referring to 1:N or N:1 for page pool and queues. I am
referring to entries within a single page pool for a single Rx queue.


> 
> In theory mapping 1 rx-queue to n page_pools is trivial: the driver
> can call page_pool_create() multiple times to generate n queues and
> decide for incoming packets which one to use.
> 
> However, mapping n rx-queues to 1 page_pool seems like a can of worms.
> I see code in the page_pool that looks t

Re: [net-next v1 06/16] netdev: support binding dma-buf to netdevice

2023-12-09 Thread David Ahern

On 12/8/23 12:22 PM, Mina Almasry wrote:
> On Fri, Dec 8, 2023 at 9:48 AM David Ahern  wrote:
>>
>> On 12/7/23 5:52 PM, Mina Almasry wrote:
> ...
>>> +
>>> + xa_for_each(>bound_rxq_list, xa_idx, rxq) {
>>> + if (rxq->binding == binding) {
>>> + /* We hold the rtnl_lock while binding/unbinding
>>> +  * dma-buf, so we can't race with another thread that
>>> +  * is also modifying this value. However, the driver
>>> +  * may read this config while it's creating its
>>> +  * rx-queues. WRITE_ONCE() here to match the
>>> +  * READ_ONCE() in the driver.
>>> +  */
>>> + WRITE_ONCE(rxq->binding, NULL);
>>> +
>>> + rxq_idx = get_netdev_rx_queue_index(rxq);
>>> +
>>> + netdev_restart_rx_queue(binding->dev, rxq_idx);
>>
>> Blindly restarting a queue when a dmabuf is heavy handed. If the dmabuf
>> has no outstanding references (ie., no references in the RxQ), then no
>> restart is needed.
>>
> 
> I think I need to stop the queue while binding to a dmabuf for the
> sake of concurrency, no? I.e. the softirq thread may be delivering a
> packet, and in parallel a separate thread holds rtnl_lock and tries to
> bind the dma-buf. At that point the page_pool recreation will race
> with the driver doing page_pool_alloc_page(). I don't think I can
> insert a lock to handle this into the rx fast path, no?

I think it depends on the details of how entries are added and removed
from the pool. I am behind on the pp details at this point, so I do need
to do some homework.

> 
> Also, this sounds like it requires (lots of) more changes. The
> page_pool + driver need to report how many pending references there
> are (with locking so we don't race with incoming packets), and have
> them reported via an ndo so that we can skip restarting the queue.
> Implementing the changes in to a huge issue but handling the
> concurrency may be a genuine blocker. Not sure it's worth the upside
> of not restarting the single rx queue?

It has to do with the usability of this overall solution. As I mentioned
most ML use cases can (and will want to) use many memory allocations for
receiving packets - e.g., allocations per message and receiving multiple
messages per socket connection.

> 
>>> + }
>>> + }
>>> +
>>> + xa_erase(_dmabuf_bindings, binding->id);
>>> +
>>> + netdev_dmabuf_binding_put(binding);
>>> +}
>>> +
>>> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>>> + struct netdev_dmabuf_binding *binding)
>>> +{
>>> + struct netdev_rx_queue *rxq;
>>> + u32 xa_idx;
>>> + int err;
>>> +
>>> + rxq = __netif_get_rx_queue(dev, rxq_idx);
>>> +
>>> + if (rxq->binding)
>>> + return -EEXIST;
>>> +
>>> + err = xa_alloc(>bound_rxq_list, _idx, rxq, xa_limit_32b,
>>> +GFP_KERNEL);
>>> + if (err)
>>> + return err;
>>> +
>>> + /* We hold the rtnl_lock while binding/unbinding dma-buf, so we can't
>>> +  * race with another thread that is also modifying this value. 
>>> However,
>>> +  * the driver may read this config while it's creating its * 
>>> rx-queues.
>>> +  * WRITE_ONCE() here to match the READ_ONCE() in the driver.
>>> +  */
>>> + WRITE_ONCE(rxq->binding, binding);
>>> +
>>> + err = netdev_restart_rx_queue(dev, rxq_idx);
>>
>> Similarly, here binding a dmabuf to a queue. I was expecting the dmabuf
>> binding to add entries to the page pool for the queue.
> 
> To be honest, I think maybe there's a slight disconnect between how
> you think the page_pool works, and my primitive understanding of how
> it works. Today, I see a 1:1 mapping between rx-queue and page_pool in
> the code. I don't see 1:many or many:1 mappings.

I am not referring to 1:N or N:1 for page pool and queues. I am
referring to entries within a single page pool for a single Rx queue.


> 
> In theory mapping 1 rx-queue to n page_pools is trivial: the driver
> can call page_pool_create() multiple times to generate n queues and
> decide for incoming packets which one to use.
> 
> However, mapping n rx-queues to 1 page_pool seems like a can of worms.
> I see code in the page_pool that looks t

Re: [net-next v1 00/16] Device Memory TCP

2023-12-08 Thread David Ahern

On 12/7/23 5:52 PM, Mina Almasry wrote:
> Major changes in v1:
> --
> 
> 1. Implemented MVP queue API ndos to remove the userspace-visible
>driver reset.
> 
> 2. Fixed issues in the napi_pp_put_page() devmem frag unref path.
> 
> 3. Removed RFC tag.
> 
> Many smaller addressed comments across all the patches (patches have
> individual change log).
> 
> Full tree including the rest of the GVE driver changes:
> https://github.com/mina/linux/commits/tcpdevmem-v1
> 

Still a lot of DEVMEM references (e.g., socket API). Any reason not to
move those to DMABUF?

Re: [net-next v1 00/16] Device Memory TCP

2023-12-08 Thread David Ahern

On 12/7/23 5:52 PM, Mina Almasry wrote:
> Major changes in v1:
> --
> 
> 1. Implemented MVP queue API ndos to remove the userspace-visible
>driver reset.
> 
> 2. Fixed issues in the napi_pp_put_page() devmem frag unref path.
> 
> 3. Removed RFC tag.
> 
> Many smaller addressed comments across all the patches (patches have
> individual change log).
> 
> Full tree including the rest of the GVE driver changes:
> https://github.com/mina/linux/commits/tcpdevmem-v1
> 

Still a lot of DEVMEM references (e.g., socket API). Any reason not to
move those to DMABUF?

Re: [net-next v1 07/16] netdev: netdevice devmem allocator

2023-12-08 Thread David Ahern

On 12/7/23 5:52 PM, Mina Almasry wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b8c8be5a912e..30667e4c3b95 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2120,6 +2120,41 @@ static int netdev_restart_rx_queue(struct net_device 
> *dev, int rxq_idx)
>   return err;
>  }
>  
> +struct page_pool_iov *netdev_alloc_dmabuf(struct netdev_dmabuf_binding 
> *binding)
> +{
> + struct dmabuf_genpool_chunk_owner *owner;
> + struct page_pool_iov *ppiov;
> + unsigned long dma_addr;
> + ssize_t offset;
> + ssize_t index;
> +
> + dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE,

Any reason not to allow allocation sizes other than PAGE_SIZE? e.g.,
2048 for smaller MTUs or 8192 for larger ones. It can be a property of
page_pool and constant across allocations vs allowing different size for
each allocation.

Re: [net-next v1 07/16] netdev: netdevice devmem allocator

2023-12-08 Thread David Ahern

On 12/7/23 5:52 PM, Mina Almasry wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b8c8be5a912e..30667e4c3b95 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2120,6 +2120,41 @@ static int netdev_restart_rx_queue(struct net_device 
> *dev, int rxq_idx)
>   return err;
>  }
>  
> +struct page_pool_iov *netdev_alloc_dmabuf(struct netdev_dmabuf_binding 
> *binding)
> +{
> + struct dmabuf_genpool_chunk_owner *owner;
> + struct page_pool_iov *ppiov;
> + unsigned long dma_addr;
> + ssize_t offset;
> + ssize_t index;
> +
> + dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE,

Any reason not to allow allocation sizes other than PAGE_SIZE? e.g.,
2048 for smaller MTUs or 8192 for larger ones. It can be a property of
page_pool and constant across allocations vs allowing different size for
each allocation.

Re: [net-next v1 13/16] tcp: RX path for devmem TCP

2023-12-08 Thread David Ahern

On 12/7/23 5:52 PM, Mina Almasry wrote:
> In tcp_recvmsg_locked(), detect if the skb being received by the user
> is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
> flag - pass it to tcp_recvmsg_devmem() for custom handling.
> 
> tcp_recvmsg_devmem() copies any data in the skb header to the linear
> buffer, and returns a cmsg to the user indicating the number of bytes
> returned in the linear buffer.
> 
> tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
> and returns to the user a cmsg_devmem indicating the location of the
> data in the dmabuf device memory. cmsg_devmem contains this information:
> 
> 1. the offset into the dmabuf where the payload starts. 'frag_offset'.
> 2. the size of the frag. 'frag_size'.
> 3. an opaque token 'frag_token' to return to the kernel when the buffer
> is to be released.
> 
> The pages awaiting freeing are stored in the newly added
> sk->sk_user_pages, and each page passed to userspace is get_page()'d.
> This reference is dropped once the userspace indicates that it is
> done reading this page.  All pages are released when the socket is
> destroyed.
> 
> Signed-off-by: Willem de Bruijn 
> Signed-off-by: Kaiyuan Zhang 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> Changes in v1:
> - Added dmabuf_id to dmabuf_cmsg (David/Stan).
> - Devmem -> dmabuf (David).
> - Change tcp_recvmsg_dmabuf() check to skb->dmabuf (Paolo).
> - Use __skb_frag_ref() & napi_pp_put_page() for refcounting (Yunsheng).
> 
> RFC v3:
> - Fixed issue with put_cmsg() failing silently.
> 

What happens if a retransmitted packet is received or an rx window is
closed and a probe is received where the kernel drops the skb - is the
iov reference(s) in the skb returned to the pool by the stack and ready
for use again?

Re: [net-next v1 13/16] tcp: RX path for devmem TCP

2023-12-08 Thread David Ahern

On 12/7/23 5:52 PM, Mina Almasry wrote:
> In tcp_recvmsg_locked(), detect if the skb being received by the user
> is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
> flag - pass it to tcp_recvmsg_devmem() for custom handling.
> 
> tcp_recvmsg_devmem() copies any data in the skb header to the linear
> buffer, and returns a cmsg to the user indicating the number of bytes
> returned in the linear buffer.
> 
> tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
> and returns to the user a cmsg_devmem indicating the location of the
> data in the dmabuf device memory. cmsg_devmem contains this information:
> 
> 1. the offset into the dmabuf where the payload starts. 'frag_offset'.
> 2. the size of the frag. 'frag_size'.
> 3. an opaque token 'frag_token' to return to the kernel when the buffer
> is to be released.
> 
> The pages awaiting freeing are stored in the newly added
> sk->sk_user_pages, and each page passed to userspace is get_page()'d.
> This reference is dropped once the userspace indicates that it is
> done reading this page.  All pages are released when the socket is
> destroyed.
> 
> Signed-off-by: Willem de Bruijn 
> Signed-off-by: Kaiyuan Zhang 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> Changes in v1:
> - Added dmabuf_id to dmabuf_cmsg (David/Stan).
> - Devmem -> dmabuf (David).
> - Change tcp_recvmsg_dmabuf() check to skb->dmabuf (Paolo).
> - Use __skb_frag_ref() & napi_pp_put_page() for refcounting (Yunsheng).
> 
> RFC v3:
> - Fixed issue with put_cmsg() failing silently.
> 

What happens if a retransmitted packet is received or an rx window is
closed and a probe is received where the kernel drops the skb - is the
iov reference(s) in the skb returned to the pool by the stack and ready
for use again?

Re: [net-next v1 06/16] netdev: support binding dma-buf to netdevice

2023-12-08 Thread David Ahern

On 12/7/23 5:52 PM, Mina Almasry wrote:
> +
> +static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx)
> +{
> + void *new_mem;
> + void *old_mem;
> + int err;
> +
> + if (!dev || !dev->netdev_ops)
> + return -EINVAL;
> +
> + if (!dev->netdev_ops->ndo_queue_stop ||
> + !dev->netdev_ops->ndo_queue_mem_free ||
> + !dev->netdev_ops->ndo_queue_mem_alloc ||
> + !dev->netdev_ops->ndo_queue_start)
> + return -EOPNOTSUPP;
> +
> + new_mem = dev->netdev_ops->ndo_queue_mem_alloc(dev, rxq_idx);
> + if (!new_mem)
> + return -ENOMEM;
> +
> + err = dev->netdev_ops->ndo_queue_stop(dev, rxq_idx, _mem);
> + if (err)
> + goto err_free_new_mem;
> +
> + err = dev->netdev_ops->ndo_queue_start(dev, rxq_idx, new_mem);
> + if (err)
> + goto err_start_queue;
> +
> + dev->netdev_ops->ndo_queue_mem_free(dev, old_mem);
> +
> + return 0;
> +
> +err_start_queue:
> + dev->netdev_ops->ndo_queue_start(dev, rxq_idx, old_mem);
> +
> +err_free_new_mem:
> + dev->netdev_ops->ndo_queue_mem_free(dev, new_mem);
> +
> + return err;
> +}
> +
> +/* Protected by rtnl_lock() */
> +static DEFINE_XARRAY_FLAGS(netdev_dmabuf_bindings, XA_FLAGS_ALLOC1);
> +
> +void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding)
> +{
> + struct netdev_rx_queue *rxq;
> + unsigned long xa_idx;
> + unsigned int rxq_idx;
> +
> + if (!binding)
> + return;
> +
> + if (binding->list.next)
> + list_del(>list);
> +
> + xa_for_each(>bound_rxq_list, xa_idx, rxq) {
> + if (rxq->binding == binding) {
> + /* We hold the rtnl_lock while binding/unbinding
> +  * dma-buf, so we can't race with another thread that
> +  * is also modifying this value. However, the driver
> +  * may read this config while it's creating its
> +  * rx-queues. WRITE_ONCE() here to match the
> +  * READ_ONCE() in the driver.
> +  */
> + WRITE_ONCE(rxq->binding, NULL);
> +
> + rxq_idx = get_netdev_rx_queue_index(rxq);
> +
> + netdev_restart_rx_queue(binding->dev, rxq_idx);

Blindly restarting a queue when a dmabuf is heavy handed. If the dmabuf
has no outstanding references (ie., no references in the RxQ), then no
restart is needed.

> + }
> + }
> +
> + xa_erase(_dmabuf_bindings, binding->id);
> +
> + netdev_dmabuf_binding_put(binding);
> +}
> +
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> + struct netdev_dmabuf_binding *binding)
> +{
> + struct netdev_rx_queue *rxq;
> + u32 xa_idx;
> + int err;
> +
> + rxq = __netif_get_rx_queue(dev, rxq_idx);
> +
> + if (rxq->binding)
> + return -EEXIST;
> +
> + err = xa_alloc(>bound_rxq_list, _idx, rxq, xa_limit_32b,
> +GFP_KERNEL);
> + if (err)
> + return err;
> +
> + /* We hold the rtnl_lock while binding/unbinding dma-buf, so we can't
> +  * race with another thread that is also modifying this value. However,
> +  * the driver may read this config while it's creating its * rx-queues.
> +  * WRITE_ONCE() here to match the READ_ONCE() in the driver.
> +  */
> + WRITE_ONCE(rxq->binding, binding);
> +
> + err = netdev_restart_rx_queue(dev, rxq_idx);

Similarly, here binding a dmabuf to a queue. I was expecting the dmabuf
binding to add entries to the page pool for the queue. If the pool was
previously empty, then maybe the queue needs to be "started" in the
sense of creating with h/w or just pushing buffers into the queue and
moving the pidx.

Re: [net-next v1 06/16] netdev: support binding dma-buf to netdevice

2023-12-08 Thread David Ahern

On 12/7/23 5:52 PM, Mina Almasry wrote:
> +
> +static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx)
> +{
> + void *new_mem;
> + void *old_mem;
> + int err;
> +
> + if (!dev || !dev->netdev_ops)
> + return -EINVAL;
> +
> + if (!dev->netdev_ops->ndo_queue_stop ||
> + !dev->netdev_ops->ndo_queue_mem_free ||
> + !dev->netdev_ops->ndo_queue_mem_alloc ||
> + !dev->netdev_ops->ndo_queue_start)
> + return -EOPNOTSUPP;
> +
> + new_mem = dev->netdev_ops->ndo_queue_mem_alloc(dev, rxq_idx);
> + if (!new_mem)
> + return -ENOMEM;
> +
> + err = dev->netdev_ops->ndo_queue_stop(dev, rxq_idx, _mem);
> + if (err)
> + goto err_free_new_mem;
> +
> + err = dev->netdev_ops->ndo_queue_start(dev, rxq_idx, new_mem);
> + if (err)
> + goto err_start_queue;
> +
> + dev->netdev_ops->ndo_queue_mem_free(dev, old_mem);
> +
> + return 0;
> +
> +err_start_queue:
> + dev->netdev_ops->ndo_queue_start(dev, rxq_idx, old_mem);
> +
> +err_free_new_mem:
> + dev->netdev_ops->ndo_queue_mem_free(dev, new_mem);
> +
> + return err;
> +}
> +
> +/* Protected by rtnl_lock() */
> +static DEFINE_XARRAY_FLAGS(netdev_dmabuf_bindings, XA_FLAGS_ALLOC1);
> +
> +void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding)
> +{
> + struct netdev_rx_queue *rxq;
> + unsigned long xa_idx;
> + unsigned int rxq_idx;
> +
> + if (!binding)
> + return;
> +
> + if (binding->list.next)
> + list_del(>list);
> +
> + xa_for_each(>bound_rxq_list, xa_idx, rxq) {
> + if (rxq->binding == binding) {
> + /* We hold the rtnl_lock while binding/unbinding
> +  * dma-buf, so we can't race with another thread that
> +  * is also modifying this value. However, the driver
> +  * may read this config while it's creating its
> +  * rx-queues. WRITE_ONCE() here to match the
> +  * READ_ONCE() in the driver.
> +  */
> + WRITE_ONCE(rxq->binding, NULL);
> +
> + rxq_idx = get_netdev_rx_queue_index(rxq);
> +
> + netdev_restart_rx_queue(binding->dev, rxq_idx);

Blindly restarting a queue when a dmabuf is heavy handed. If the dmabuf
has no outstanding references (ie., no references in the RxQ), then no
restart is needed.

> + }
> + }
> +
> + xa_erase(_dmabuf_bindings, binding->id);
> +
> + netdev_dmabuf_binding_put(binding);
> +}
> +
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> + struct netdev_dmabuf_binding *binding)
> +{
> + struct netdev_rx_queue *rxq;
> + u32 xa_idx;
> + int err;
> +
> + rxq = __netif_get_rx_queue(dev, rxq_idx);
> +
> + if (rxq->binding)
> + return -EEXIST;
> +
> + err = xa_alloc(>bound_rxq_list, _idx, rxq, xa_limit_32b,
> +GFP_KERNEL);
> + if (err)
> + return err;
> +
> + /* We hold the rtnl_lock while binding/unbinding dma-buf, so we can't
> +  * race with another thread that is also modifying this value. However,
> +  * the driver may read this config while it's creating its * rx-queues.
> +  * WRITE_ONCE() here to match the READ_ONCE() in the driver.
> +  */
> + WRITE_ONCE(rxq->binding, binding);
> +
> + err = netdev_restart_rx_queue(dev, rxq_idx);

Similarly, here binding a dmabuf to a queue. I was expecting the dmabuf
binding to add entries to the page pool for the queue. If the pool was
previously empty, then maybe the queue needs to be "started" in the
sense of creating with h/w or just pushing buffers into the queue and
moving the pidx.

Re: [PATCHv3 net-next 01/14] selftests/net: add lib.sh

2023-12-07 Thread David Ahern

On 12/7/23 3:34 AM, Petr Machata wrote:
> But what I object against is that the library uses trap without having a
> way for user scripts to schedule at-exit work, because that's used
> literally everywhere in forwarding tests. 

+1

Re: [PATCH net-next 00/38] Conver all net selftests to run in unique namespace

2023-11-26 Thread David Ahern

On 11/24/23 2:26 AM, Hangbin Liu wrote:
> As Guillaume pointed, many selftests create namespaces with very common
> names (like "client" or "server") or even (partially) run directly in 
> init_net.
> This makes these tests prone to failure if another namespace with the same
> name already exists. It also makes it impossible to run several instances
> of these tests in parallel.
> 
> This patch set conver all the net selftests to run in unique namespace,
> so we can update the selftest freamwork to run all tests in it's own namespace
> in parallel. After update, we only need to wait for the test which need
> longest time.
> 
> ]# per_test_logging=1 time ./run_kselftest.sh -n -c net
> TAP version 13
> # selftests: net: reuseport_bpf_numa
> not ok 3 selftests: net: reuseport_bpf_numa # exit=1
> # selftests: net: reuseport_bpf_cpu
> not ok 2 selftests: net: reuseport_bpf_cpu # exit=1
> # selftests: net: reuseport_dualstack
> not ok 4 selftests: net: reuseport_dualstack # exit=1
> # selftests: net: reuseaddr_conflict
> ok 5 selftests: net: reuseaddr_conflict
> 
> ...
> 
> # selftests: net: test_vxlan_mdb.sh
> ok 90 selftests: net: test_vxlan_mdb.sh
> # selftests: net: fib_nexthops.sh
> not ok 41 selftests: net: fib_nexthops.sh # exit=1
> # selftests: net: fcnal-test.sh
> not ok 36 selftests: net: fcnal-test.sh # exit=1
> 
> real55m1.238s
> user12m10.350s
> sys 22m17.432s
> 
> 

I have not looked at the details of each script change, but as long as
not test is broken by the change I am fine with the intent.

Acked-by: David Ahern

Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-11 Thread David Ahern

On 11/10/23 7:26 AM, Pavel Begunkov wrote:
> On 11/7/23 23:03, Mina Almasry wrote:
>> On Tue, Nov 7, 2023 at 2:55 PM David Ahern  wrote:
>>>
>>> On 11/7/23 3:10 PM, Mina Almasry wrote:
>>>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>>>>>
>>>>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>>>> index eeeda849115c..1c351c138a5b 100644
>>>>>> --- a/include/linux/netdevice.h
>>>>>> +++ b/include/linux/netdevice.h
>>>>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>>>>   };
>>>>>>
>>>>>>   #ifdef CONFIG_DMA_SHARED_BUFFER
>>>>>> +struct page_pool_iov *
>>>>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>>>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>>>>
>>>>> netdev_{alloc,free}_dmabuf?
>>>>>
>>>>
>>>> Can do.
>>>>
>>>>> I say that because a dmabuf can be host memory, at least I am not
>>>>> aware
>>>>> of a restriction that a dmabuf is device memory.
>>>>>
>>>>
>>>> In my limited experience dma-buf is generally device memory, and
>>>> that's really its use case. CONFIG_UDMABUF is a driver that mocks
>>>> dma-buf with a memfd which I think is used for testing. But I can do
>>>> the rename, it's more clear anyway, I think.
>>>
>>> config UDMABUF
>>>  bool "userspace dmabuf misc driver"
>>>  default n
>>>  depends on DMA_SHARED_BUFFER
>>>  depends on MEMFD_CREATE || COMPILE_TEST
>>>  help
>>>    A driver to let userspace turn memfd regions into dma-bufs.
>>>    Qemu can use this to create host dmabufs for guest
>>> framebuffers.
>>>
>>>
>>> Qemu is just a userspace process; it is no way a special one.
>>>
>>> Treating host memory as a dmabuf should radically simplify the io_uring
>>> extension of this set.
>>
>> I agree actually, and I was about to make that comment to David Wei's
>> series once I have the time.
>>
>> David, your io_uring RX zerocopy proposal actually works with devmem
>> TCP, if you're inclined to do that instead, what you'd do roughly is
>> (I think):
> That would be a Frankenstein's monster api with no good reason for it.

It brings a consistent API from a networking perspective.

io_uring should not need to be in the page pool and memory management
business. Have you or David coded up the re-use of the socket APIs with
dmabuf to see how much smaller it makes the io_uring change - or even
walked through from a theoretical perspective?

Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-11 Thread David Ahern

On 11/10/23 7:26 AM, Pavel Begunkov wrote:
> On 11/7/23 23:03, Mina Almasry wrote:
>> On Tue, Nov 7, 2023 at 2:55 PM David Ahern  wrote:
>>>
>>> On 11/7/23 3:10 PM, Mina Almasry wrote:
>>>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>>>>>
>>>>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>>>> index eeeda849115c..1c351c138a5b 100644
>>>>>> --- a/include/linux/netdevice.h
>>>>>> +++ b/include/linux/netdevice.h
>>>>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>>>>   };
>>>>>>
>>>>>>   #ifdef CONFIG_DMA_SHARED_BUFFER
>>>>>> +struct page_pool_iov *
>>>>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>>>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>>>>
>>>>> netdev_{alloc,free}_dmabuf?
>>>>>
>>>>
>>>> Can do.
>>>>
>>>>> I say that because a dmabuf can be host memory, at least I am not
>>>>> aware
>>>>> of a restriction that a dmabuf is device memory.
>>>>>
>>>>
>>>> In my limited experience dma-buf is generally device memory, and
>>>> that's really its use case. CONFIG_UDMABUF is a driver that mocks
>>>> dma-buf with a memfd which I think is used for testing. But I can do
>>>> the rename, it's more clear anyway, I think.
>>>
>>> config UDMABUF
>>>  bool "userspace dmabuf misc driver"
>>>  default n
>>>  depends on DMA_SHARED_BUFFER
>>>  depends on MEMFD_CREATE || COMPILE_TEST
>>>  help
>>>    A driver to let userspace turn memfd regions into dma-bufs.
>>>    Qemu can use this to create host dmabufs for guest
>>> framebuffers.
>>>
>>>
>>> Qemu is just a userspace process; it is no way a special one.
>>>
>>> Treating host memory as a dmabuf should radically simplify the io_uring
>>> extension of this set.
>>
>> I agree actually, and I was about to make that comment to David Wei's
>> series once I have the time.
>>
>> David, your io_uring RX zerocopy proposal actually works with devmem
>> TCP, if you're inclined to do that instead, what you'd do roughly is
>> (I think):
> That would be a Frankenstein's monster api with no good reason for it.

It brings a consistent API from a networking perspective.

io_uring should not need to be in the page pool and memory management
business. Have you or David coded up the re-use of the socket APIs with
dmabuf to see how much smaller it makes the io_uring change - or even
walked through from a theoretical perspective?

Re: [RFC PATCH v3 06/12] memory-provider: dmabuf devmem memory provider

2023-11-07 Thread David Ahern

On 11/7/23 5:02 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 1:02 PM Stanislav Fomichev  wrote:
>>
>> On 11/05, Mina Almasry wrote:
>>> +static inline bool page_is_page_pool_iov(const struct page *page)
>>> +{
>>> + return (unsigned long)page & PP_DEVMEM;
>>> +}
>>
>> Speaking of bpf: one thing that might be problematic with this PP_DEVMEM
>> bit is that it will make debugging with bpftrace a bit (more)
>> complicated. If somebody were trying to get to that page_pool_iov from
>> the frags, they will have to do the equivalent of page_is_page_pool_iov,
>> but probably not a big deal? (thinking out loud)
> 
> Good point, but that doesn't only apply to bpf I think. I'm guessing
> even debugger drgn access to the bv_page in the frag will have trouble
> if it's actually accessing an iov with LSB set.
> 
> But this is not specific to this use for LSB pointer trick. I think
> all code that currently uses LSB pointer trick will have similar
> troubles. In this context my humble vote is that we get such big
> upside from reducing code churn that it's reasonable to tolerate such
> side effects.

+1

> 
> I could alleviate some of the issues by teaching drgn to do the right
> thing for devmem/iovs... time permitting.
> 
Tools like drgn and crash have to know when the LSB trick is used  -
e.g., dst_entry - and handle it when dereferencing pointers.

Re: [RFC PATCH v3 06/12] memory-provider: dmabuf devmem memory provider

2023-11-07 Thread David Ahern

On 11/7/23 5:02 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 1:02 PM Stanislav Fomichev  wrote:
>>
>> On 11/05, Mina Almasry wrote:
>>> +static inline bool page_is_page_pool_iov(const struct page *page)
>>> +{
>>> + return (unsigned long)page & PP_DEVMEM;
>>> +}
>>
>> Speaking of bpf: one thing that might be problematic with this PP_DEVMEM
>> bit is that it will make debugging with bpftrace a bit (more)
>> complicated. If somebody were trying to get to that page_pool_iov from
>> the frags, they will have to do the equivalent of page_is_page_pool_iov,
>> but probably not a big deal? (thinking out loud)
> 
> Good point, but that doesn't only apply to bpf I think. I'm guessing
> even debugger drgn access to the bv_page in the frag will have trouble
> if it's actually accessing an iov with LSB set.
> 
> But this is not specific to this use for LSB pointer trick. I think
> all code that currently uses LSB pointer trick will have similar
> troubles. In this context my humble vote is that we get such big
> upside from reducing code churn that it's reasonable to tolerate such
> side effects.

+1

> 
> I could alleviate some of the issues by teaching drgn to do the right
> thing for devmem/iovs... time permitting.
> 
Tools like drgn and crash have to know when the LSB trick is used  -
e.g., dst_entry - and handle it when dereferencing pointers.

Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP

2023-11-07 Thread David Ahern

On 11/7/23 4:55 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 4:03 PM Willem de Bruijn
>  wrote:
>>
>> On Mon, Nov 6, 2023 at 3:55 PM David Ahern  wrote:
>>>
>>> On 11/6/23 4:32 PM, Stanislav Fomichev wrote:
>>>>> The concise notification API returns tokens as a range for
>>>>> compression, encoding as two 32-bit unsigned integers start + length.
>>>>> It allows for even further batching by returning multiple such ranges
>>>>> in a single call.
>>>>
>>>> Tangential: should tokens be u64? Otherwise we can't have more than
>>>> 4gb unacknowledged. Or that's a reasonable constraint?
>>>>
>>>
>>> Was thinking the same and with bits reserved for a dmabuf id to allow
>>> multiple dmabufs in a single rx queue (future extension, but build the
>>> capability in now). e.g., something like a 37b offset (128GB dmabuf
>>> size), 19b length (large GRO), 8b dmabuf id (lots of dmabufs to a queue).
>>
>> Agreed. Converting to 64b now sounds like a good forward looking revision.
> 
> The concept of IDing a dma-buf came up in a couple of different
> contexts. First, in the context of us giving the dma-buf ID to the
> user on recvmsg() to tell the user the data is in this specific
> dma-buf. The second context is here, to bind dma-bufs with multiple
> user-visible IDs to an rx queue.
> 
> My issue here is that I don't see anything in the struct dma_buf that
> can practically serve as an ID:
> 
> https://elixir.bootlin.com/linux/v6.6-rc7/source/include/linux/dma-buf.h#L302
> 
> Actually, from the userspace, only the name of the dma-buf seems
> queryable. That's only unique if the user sets it as such. The dmabuf
> FD can't serve as an ID. For our use case we need to support 1 process
> doing the dma-buf bind via netlink, sharing the dma-buf FD to another
> process, and that process receives the data.  In this case the FDs
> shown by the 2 processes may be different. Converting to 64b is a
> trivial change I can make now, but I'm not sure how to ID these
> dma-bufs. Suggestions welcome. I'm not sure the dma-buf guys will
> allow adding a new ID + APIs to query said dma-buf ID.
> 

The API can be unique to this usage: e.g., add a dmabuf id to the
netlink API. Userspace manages the ids (tells the kernel what value to
use with an instance), the kernel validates no 2 dmabufs have the same
id and then returns the value here.

Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP

2023-11-07 Thread David Ahern

On 11/7/23 4:55 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 4:03 PM Willem de Bruijn
>  wrote:
>>
>> On Mon, Nov 6, 2023 at 3:55 PM David Ahern  wrote:
>>>
>>> On 11/6/23 4:32 PM, Stanislav Fomichev wrote:
>>>>> The concise notification API returns tokens as a range for
>>>>> compression, encoding as two 32-bit unsigned integers start + length.
>>>>> It allows for even further batching by returning multiple such ranges
>>>>> in a single call.
>>>>
>>>> Tangential: should tokens be u64? Otherwise we can't have more than
>>>> 4gb unacknowledged. Or that's a reasonable constraint?
>>>>
>>>
>>> Was thinking the same and with bits reserved for a dmabuf id to allow
>>> multiple dmabufs in a single rx queue (future extension, but build the
>>> capability in now). e.g., something like a 37b offset (128GB dmabuf
>>> size), 19b length (large GRO), 8b dmabuf id (lots of dmabufs to a queue).
>>
>> Agreed. Converting to 64b now sounds like a good forward looking revision.
> 
> The concept of IDing a dma-buf came up in a couple of different
> contexts. First, in the context of us giving the dma-buf ID to the
> user on recvmsg() to tell the user the data is in this specific
> dma-buf. The second context is here, to bind dma-bufs with multiple
> user-visible IDs to an rx queue.
> 
> My issue here is that I don't see anything in the struct dma_buf that
> can practically serve as an ID:
> 
> https://elixir.bootlin.com/linux/v6.6-rc7/source/include/linux/dma-buf.h#L302
> 
> Actually, from the userspace, only the name of the dma-buf seems
> queryable. That's only unique if the user sets it as such. The dmabuf
> FD can't serve as an ID. For our use case we need to support 1 process
> doing the dma-buf bind via netlink, sharing the dma-buf FD to another
> process, and that process receives the data.  In this case the FDs
> shown by the 2 processes may be different. Converting to 64b is a
> trivial change I can make now, but I'm not sure how to ID these
> dma-bufs. Suggestions welcome. I'm not sure the dma-buf guys will
> allow adding a new ID + APIs to query said dma-buf ID.
> 

The API can be unique to this usage: e.g., add a dmabuf id to the
netlink API. Userspace manages the ids (tells the kernel what value to
use with an instance), the kernel validates no 2 dmabufs have the same
id and then returns the value here.

Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-07 Thread David Ahern

On 11/7/23 3:10 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>>
>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index eeeda849115c..1c351c138a5b 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>  };
>>>
>>>  #ifdef CONFIG_DMA_SHARED_BUFFER
>>> +struct page_pool_iov *
>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>
>> netdev_{alloc,free}_dmabuf?
>>
> 
> Can do.
> 
>> I say that because a dmabuf can be host memory, at least I am not aware
>> of a restriction that a dmabuf is device memory.
>>
> 
> In my limited experience dma-buf is generally device memory, and
> that's really its use case. CONFIG_UDMABUF is a driver that mocks
> dma-buf with a memfd which I think is used for testing. But I can do
> the rename, it's more clear anyway, I think.

config UDMABUF
bool "userspace dmabuf misc driver"
default n
depends on DMA_SHARED_BUFFER
depends on MEMFD_CREATE || COMPILE_TEST
help
  A driver to let userspace turn memfd regions into dma-bufs.
  Qemu can use this to create host dmabufs for guest framebuffers.

Qemu is just a userspace process; it is no way a special one.

Treating host memory as a dmabuf should radically simplify the io_uring
extension of this set. That the io_uring set needs to dive into
page_pools is just wrong - complicating the design and code and pushing
io_uring into a realm it does not need to be involved in.

Most (all?) of this patch set can work with any memory; only device
memory is unreadable.

Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-07 Thread David Ahern

On 11/7/23 3:10 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>>
>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index eeeda849115c..1c351c138a5b 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>  };
>>>
>>>  #ifdef CONFIG_DMA_SHARED_BUFFER
>>> +struct page_pool_iov *
>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>
>> netdev_{alloc,free}_dmabuf?
>>
> 
> Can do.
> 
>> I say that because a dmabuf can be host memory, at least I am not aware
>> of a restriction that a dmabuf is device memory.
>>
> 
> In my limited experience dma-buf is generally device memory, and
> that's really its use case. CONFIG_UDMABUF is a driver that mocks
> dma-buf with a memfd which I think is used for testing. But I can do
> the rename, it's more clear anyway, I think.

config UDMABUF
bool "userspace dmabuf misc driver"
default n
depends on DMA_SHARED_BUFFER
depends on MEMFD_CREATE || COMPILE_TEST
help
  A driver to let userspace turn memfd regions into dma-bufs.
  Qemu can use this to create host dmabufs for guest framebuffers.

Qemu is just a userspace process; it is no way a special one.

Treating host memory as a dmabuf should radically simplify the io_uring
extension of this set. That the io_uring set needs to dive into
page_pools is just wrong - complicating the design and code and pushing
io_uring into a realm it does not need to be involved in.

Most (all?) of this patch set can work with any memory; only device
memory is unreadable.

Re: [RFC PATCH v3 00/12] Device Memory TCP

2023-11-07 Thread David Ahern

Is there a policy about cc'ing moderated lists on patch sets? I thought
there was, but not finding anything under Documentation/. Getting a
'needs moderator approval response' on every message is rather annoying.

Re: [RFC PATCH v3 00/12] Device Memory TCP

2023-11-07 Thread David Ahern

Is there a policy about cc'ing moderated lists on patch sets? I thought
there was, but not finding anything under Documentation/. Getting a
'needs moderator approval response' on every message is rather annoying.

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern

On 11/6/23 5:20 PM, Mina Almasry wrote:
> The user is free to modify or delete flow steering rules outside of the
> lifetime of the socket. Technically it's possible for the user to
> reconfigure flow steering while the socket is simultaneously receiving,
> and the result will be packets switching from devmem to non-devmem.

generically, from one page pool to another (ie., devmem piece of that
statement is not relevant).

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern

On 11/6/23 5:20 PM, Mina Almasry wrote:
> The user is free to modify or delete flow steering rules outside of the
> lifetime of the socket. Technically it's possible for the user to
> reconfigure flow steering while the socket is simultaneously receiving,
> and the result will be packets switching from devmem to non-devmem.

generically, from one page pool to another (ie., devmem piece of that
statement is not relevant).

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern

On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 176eb5834746..cdd4fb129968 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -425,6 +425,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, 
> int offset,
>   return 0;
>   }
>  
> + if (skb_frags_not_readable(skb))
> + goto short_copy;
> +
>   /* Copy paged appendix. Hmm... why does this look so complicated? */
>   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
>   int end;
> @@ -616,6 +619,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct 
> sock *sk,
>  {
>   int frag;
>  
> + if (skb_frags_not_readable(skb))
> + return -EFAULT;

This check 
> +
>   if (msg && msg->msg_ubuf && msg->sg_from_iter)
>   return msg->sg_from_iter(sk, skb, from, length);


... should go here. That allows custome sg_from_iter to have access to
the skb. What matters is not expecting struct page (e.g., refcounting);
if the custom iter does not do that then all is well. io_uring's iter
does not look at the pages, so all good.

>  
> diff --git a/net/core/gro.c b/net/core/gro.c
> index 42d7f6755f32..56046d65386a 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -390,6 +390,9 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int 
> grow)
>  {
>   struct skb_shared_info *pinfo = skb_shinfo(skb);
>  
> + if (WARN_ON_ONCE(skb_frags_not_readable(skb)))
> + return;
> +
>   BUG_ON(skb->end - skb->tail < grow);
>  
>   memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
> @@ -411,7 +414,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb)
>  {
>   int grow = skb_gro_offset(skb) - skb_headlen(skb);
>  
> - if (grow > 0)
> + if (grow > 0 && !skb_frags_not_readable(skb))
>   gro_pull_from_frag0(skb, grow);
>  }
>  
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 13eca4fd25e1..f01673ed2eff 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1230,6 +1230,14 @@ void skb_dump(const char *level, const struct sk_buff 
> *skb, bool full_pkt)
>   struct page *p;
>   u8 *vaddr;
>  
> + if (skb_frag_is_page_pool_iov(frag)) {

Why skb_frag_is_page_pool_iov here vs skb_frags_not_readable?

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern

On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 176eb5834746..cdd4fb129968 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -425,6 +425,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, 
> int offset,
>   return 0;
>   }
>  
> + if (skb_frags_not_readable(skb))
> + goto short_copy;
> +
>   /* Copy paged appendix. Hmm... why does this look so complicated? */
>   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
>   int end;
> @@ -616,6 +619,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct 
> sock *sk,
>  {
>   int frag;
>  
> + if (skb_frags_not_readable(skb))
> + return -EFAULT;

This check 
> +
>   if (msg && msg->msg_ubuf && msg->sg_from_iter)
>   return msg->sg_from_iter(sk, skb, from, length);


... should go here. That allows custome sg_from_iter to have access to
the skb. What matters is not expecting struct page (e.g., refcounting);
if the custom iter does not do that then all is well. io_uring's iter
does not look at the pages, so all good.

>  
> diff --git a/net/core/gro.c b/net/core/gro.c
> index 42d7f6755f32..56046d65386a 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -390,6 +390,9 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int 
> grow)
>  {
>   struct skb_shared_info *pinfo = skb_shinfo(skb);
>  
> + if (WARN_ON_ONCE(skb_frags_not_readable(skb)))
> + return;
> +
>   BUG_ON(skb->end - skb->tail < grow);
>  
>   memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
> @@ -411,7 +414,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb)
>  {
>   int grow = skb_gro_offset(skb) - skb_headlen(skb);
>  
> - if (grow > 0)
> + if (grow > 0 && !skb_frags_not_readable(skb))
>   gro_pull_from_frag0(skb, grow);
>  }
>  
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 13eca4fd25e1..f01673ed2eff 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1230,6 +1230,14 @@ void skb_dump(const char *level, const struct sk_buff 
> *skb, bool full_pkt)
>   struct page *p;
>   u8 *vaddr;
>  
> + if (skb_frag_is_page_pool_iov(frag)) {

Why skb_frag_is_page_pool_iov here vs skb_frags_not_readable?

Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP

2023-11-06 Thread David Ahern

On 11/6/23 4:32 PM, Stanislav Fomichev wrote:
>> The concise notification API returns tokens as a range for
>> compression, encoding as two 32-bit unsigned integers start + length.
>> It allows for even further batching by returning multiple such ranges
>> in a single call.
> 
> Tangential: should tokens be u64? Otherwise we can't have more than
> 4gb unacknowledged. Or that's a reasonable constraint?
> 

Was thinking the same and with bits reserved for a dmabuf id to allow
multiple dmabufs in a single rx queue (future extension, but build the
capability in now). e.g., something like a 37b offset (128GB dmabuf
size), 19b length (large GRO), 8b dmabuf id (lots of dmabufs to a queue).

Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP

2023-11-06 Thread David Ahern

On 11/6/23 4:32 PM, Stanislav Fomichev wrote:
>> The concise notification API returns tokens as a range for
>> compression, encoding as two 32-bit unsigned integers start + length.
>> It allows for even further batching by returning multiple such ranges
>> in a single call.
> 
> Tangential: should tokens be u64? Otherwise we can't have more than
> 4gb unacknowledged. Or that's a reasonable constraint?
> 

Was thinking the same and with bits reserved for a dmabuf id to allow
multiple dmabufs in a single rx queue (future extension, but build the
capability in now). e.g., something like a 37b offset (128GB dmabuf
size), 19b length (large GRO), 8b dmabuf id (lots of dmabufs to a queue).

Re: [RFC PATCH v3 06/12] memory-provider: dmabuf devmem memory provider

2023-11-06 Thread David Ahern

On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 78cbb040af94..b93243c2a640 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -111,6 +112,45 @@ page_pool_iov_binding(const struct page_pool_iov *ppiov)
>   return page_pool_iov_owner(ppiov)->binding;
>  }
>  
> +static inline int page_pool_iov_refcount(const struct page_pool_iov *ppiov)
> +{
> + return refcount_read(>refcount);
> +}
> +
> +static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov,
> +   unsigned int count)
> +{
> + refcount_add(count, >refcount);
> +}
> +
> +void __page_pool_iov_free(struct page_pool_iov *ppiov);
> +
> +static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov,
> +   unsigned int count)
> +{
> + if (!refcount_sub_and_test(count, >refcount))
> + return;
> +
> + __page_pool_iov_free(ppiov);
> +}
> +
> +/* page pool mm helpers */
> +
> +static inline bool page_is_page_pool_iov(const struct page *page)
> +{
> + return (unsigned long)page & PP_DEVMEM;

This is another one where the code can be more generic to not force a
lot changes later.  e.g., PP_CUSTOM or PP_NO_PAGE. Then io_uring use
case with host memory can leverage the iov pool in a similar manner.

That does mean skb->devmem needs to be a flag on the page pool and not
just assume iov == device memory.

Re: [RFC PATCH v3 06/12] memory-provider: dmabuf devmem memory provider

2023-11-06 Thread David Ahern

On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 78cbb040af94..b93243c2a640 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -111,6 +112,45 @@ page_pool_iov_binding(const struct page_pool_iov *ppiov)
>   return page_pool_iov_owner(ppiov)->binding;
>  }
>  
> +static inline int page_pool_iov_refcount(const struct page_pool_iov *ppiov)
> +{
> + return refcount_read(>refcount);
> +}
> +
> +static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov,
> +   unsigned int count)
> +{
> + refcount_add(count, >refcount);
> +}
> +
> +void __page_pool_iov_free(struct page_pool_iov *ppiov);
> +
> +static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov,
> +   unsigned int count)
> +{
> + if (!refcount_sub_and_test(count, >refcount))
> + return;
> +
> + __page_pool_iov_free(ppiov);
> +}
> +
> +/* page pool mm helpers */
> +
> +static inline bool page_is_page_pool_iov(const struct page *page)
> +{
> + return (unsigned long)page & PP_DEVMEM;

This is another one where the code can be more generic to not force a
lot changes later.  e.g., PP_CUSTOM or PP_NO_PAGE. Then io_uring use
case with host memory can leverage the iov pool in a similar manner.

That does mean skb->devmem needs to be a flag on the page pool and not
just assume iov == device memory.

Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-06 Thread David Ahern

On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eeeda849115c..1c351c138a5b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>  };
>  
>  #ifdef CONFIG_DMA_SHARED_BUFFER
> +struct page_pool_iov *
> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
> +void netdev_free_devmem(struct page_pool_iov *ppiov);

netdev_{alloc,free}_dmabuf?

I say that because a dmabuf can be host memory, at least I am not aware
of a restriction that a dmabuf is device memory.

Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-06 Thread David Ahern

On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eeeda849115c..1c351c138a5b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>  };
>  
>  #ifdef CONFIG_DMA_SHARED_BUFFER
> +struct page_pool_iov *
> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
> +void netdev_free_devmem(struct page_pool_iov *ppiov);

netdev_{alloc,free}_dmabuf?

I say that because a dmabuf can be host memory, at least I am not aware
of a restriction that a dmabuf is device memory.

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern

On 11/6/23 3:18 PM, Mina Almasry wrote:
>> @@ -991,7 +993,7 @@ struct sk_buff {
>>  #if IS_ENABLED(CONFIG_IP_SCTP)
>>  __u8csum_not_inet:1;
>>  #endif
>> -
>> +__u8devmem:1;
>>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>>  __u16   tc_index;   /* traffic control index */
>>  #endif
>> @@ -1766,6 +1768,12 @@ static inline void 
>> skb_zcopy_downgrade_managed(struct sk_buff *skb)
>>  __skb_zcopy_downgrade_managed(skb);
>>  }
>>
>> +/* Return true if frags in this skb are not readable by the host. */
>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
>> +{
>> +return skb->devmem;
>
> bikeshedding: should we also rename 'devmem' sk_buff flag to 
> 'not_readable'?
> It better communicates the fact that the stack shouldn't dereference the
> frags (because it has 'devmem' fragments or for some other potential
> future reason).

 +1.

 Also, the flag on the skb is an optimization - a high level signal that
 one or more frags is in unreadable memory. There is no requirement that
 all of the frags are in the same memory type.
>>
>> David: maybe there should be such a requirement (that they all are
>> unreadable)? Might be easier to support initially; we can relax later
>> on.
>>
> 
> Currently devmem == not_readable, and the restriction is that all the
> frags in the same skb must be either all readable or all unreadable
> (all devmem or all non-devmem).

What requires that restriction? In all of the uses of skb->devmem and
skb_frags_not_readable() what matters is if any frag is not readable,
then frag list walk or collapse is avoided.

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern

On 11/6/23 3:18 PM, Mina Almasry wrote:
>> @@ -991,7 +993,7 @@ struct sk_buff {
>>  #if IS_ENABLED(CONFIG_IP_SCTP)
>>  __u8csum_not_inet:1;
>>  #endif
>> -
>> +__u8devmem:1;
>>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>>  __u16   tc_index;   /* traffic control index */
>>  #endif
>> @@ -1766,6 +1768,12 @@ static inline void 
>> skb_zcopy_downgrade_managed(struct sk_buff *skb)
>>  __skb_zcopy_downgrade_managed(skb);
>>  }
>>
>> +/* Return true if frags in this skb are not readable by the host. */
>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
>> +{
>> +return skb->devmem;
>
> bikeshedding: should we also rename 'devmem' sk_buff flag to 
> 'not_readable'?
> It better communicates the fact that the stack shouldn't dereference the
> frags (because it has 'devmem' fragments or for some other potential
> future reason).

 +1.

 Also, the flag on the skb is an optimization - a high level signal that
 one or more frags is in unreadable memory. There is no requirement that
 all of the frags are in the same memory type.
>>
>> David: maybe there should be such a requirement (that they all are
>> unreadable)? Might be easier to support initially; we can relax later
>> on.
>>
> 
> Currently devmem == not_readable, and the restriction is that all the
> frags in the same skb must be either all readable or all unreadable
> (all devmem or all non-devmem).

What requires that restriction? In all of the uses of skb->devmem and
skb_frags_not_readable() what matters is if any frag is not readable,
then frag list walk or collapse is avoided.

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern

On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> On 11/05, Mina Almasry wrote:
>> For device memory TCP, we expect the skb headers to be available in host
>> memory for access, and we expect the skb frags to be in device memory
>> and unaccessible to the host. We expect there to be no mixing and
>> matching of device memory frags (unaccessible) with host memory frags
>> (accessible) in the same skb.
>>
>> Add a skb->devmem flag which indicates whether the frags in this skb
>> are device memory frags or not.
>>
>> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
>> and marks the skb as skb->devmem accordingly.
>>
>> Add checks through the network stack to avoid accessing the frags of
>> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
>>
>> Signed-off-by: Willem de Bruijn 
>> Signed-off-by: Kaiyuan Zhang 
>> Signed-off-by: Mina Almasry 
>>
>> ---
>>  include/linux/skbuff.h | 14 +++-
>>  include/net/tcp.h  |  5 +--
>>  net/core/datagram.c|  6 
>>  net/core/gro.c |  5 ++-
>>  net/core/skbuff.c  | 77 --
>>  net/ipv4/tcp.c |  6 
>>  net/ipv4/tcp_input.c   | 13 +--
>>  net/ipv4/tcp_output.c  |  5 ++-
>>  net/packet/af_packet.c |  4 +--
>>  9 files changed, 115 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 1fae276c1353..8fb468ff8115 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
>>   *  @csum_level: indicates the number of consecutive checksums found in
>>   *  the packet minus one that have been verified as
>>   *  CHECKSUM_UNNECESSARY (max 3)
>> + *  @devmem: indicates that all the fragments in this skb are backed by
>> + *  device memory.
>>   *  @dst_pending_confirm: need to confirm neighbour
>>   *  @decrypted: Decrypted SKB
>>   *  @slow_gro: state present at GRO time, slower prepare step required
>> @@ -991,7 +993,7 @@ struct sk_buff {
>>  #if IS_ENABLED(CONFIG_IP_SCTP)
>>  __u8csum_not_inet:1;
>>  #endif
>> -
>> +__u8devmem:1;
>>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>>  __u16   tc_index;   /* traffic control index */
>>  #endif
>> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct 
>> sk_buff *skb)
>>  __skb_zcopy_downgrade_managed(skb);
>>  }
>>  
>> +/* Return true if frags in this skb are not readable by the host. */
>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
>> +{
>> +return skb->devmem;
> 
> bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> It better communicates the fact that the stack shouldn't dereference the
> frags (because it has 'devmem' fragments or for some other potential
> future reason).

+1.

Also, the flag on the skb is an optimization - a high level signal that
one or more frags is in unreadable memory. There is no requirement that
all of the frags are in the same memory type.

Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern

On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> On 11/05, Mina Almasry wrote:
>> For device memory TCP, we expect the skb headers to be available in host
>> memory for access, and we expect the skb frags to be in device memory
>> and unaccessible to the host. We expect there to be no mixing and
>> matching of device memory frags (unaccessible) with host memory frags
>> (accessible) in the same skb.
>>
>> Add a skb->devmem flag which indicates whether the frags in this skb
>> are device memory frags or not.
>>
>> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
>> and marks the skb as skb->devmem accordingly.
>>
>> Add checks through the network stack to avoid accessing the frags of
>> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
>>
>> Signed-off-by: Willem de Bruijn 
>> Signed-off-by: Kaiyuan Zhang 
>> Signed-off-by: Mina Almasry 
>>
>> ---
>>  include/linux/skbuff.h | 14 +++-
>>  include/net/tcp.h  |  5 +--
>>  net/core/datagram.c|  6 
>>  net/core/gro.c |  5 ++-
>>  net/core/skbuff.c  | 77 --
>>  net/ipv4/tcp.c |  6 
>>  net/ipv4/tcp_input.c   | 13 +--
>>  net/ipv4/tcp_output.c  |  5 ++-
>>  net/packet/af_packet.c |  4 +--
>>  9 files changed, 115 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 1fae276c1353..8fb468ff8115 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
>>   *  @csum_level: indicates the number of consecutive checksums found in
>>   *  the packet minus one that have been verified as
>>   *  CHECKSUM_UNNECESSARY (max 3)
>> + *  @devmem: indicates that all the fragments in this skb are backed by
>> + *  device memory.
>>   *  @dst_pending_confirm: need to confirm neighbour
>>   *  @decrypted: Decrypted SKB
>>   *  @slow_gro: state present at GRO time, slower prepare step required
>> @@ -991,7 +993,7 @@ struct sk_buff {
>>  #if IS_ENABLED(CONFIG_IP_SCTP)
>>  __u8csum_not_inet:1;
>>  #endif
>> -
>> +__u8devmem:1;
>>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>>  __u16   tc_index;   /* traffic control index */
>>  #endif
>> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct 
>> sk_buff *skb)
>>  __skb_zcopy_downgrade_managed(skb);
>>  }
>>  
>> +/* Return true if frags in this skb are not readable by the host. */
>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
>> +{
>> +return skb->devmem;
> 
> bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> It better communicates the fact that the stack shouldn't dereference the
> frags (because it has 'devmem' fragments or for some other potential
> future reason).

+1.

Also, the flag on the skb is an optimization - a high level signal that
one or more frags is in unreadable memory. There is no requirement that
all of the frags are in the same memory type.

Re: [PATCH] selftests:net change ifconfig with ip command

2023-10-23 Thread David Ahern

On 10/23/23 1:12 AM, Jiri Pirko wrote:
>> ip addr add ...
> 
> Why not "address" then? :)

no objection from me.

> What's wrong with "a"?

1-letter commands can be ambiguous. Test scripts should be clear and
obvious.

Re: [PATCH v2] selftests:net change ifconfig with ip command

2023-10-23 Thread David Ahern

On 10/23/23 6:34 AM, Swarup Laxman Kotiaklapudi wrote:
> Change ifconfig with ip command,
> on a system where ifconfig is
> not used this script will not
> work correcly.
> 
> Test result with this patchset:
> 
> sudo make TARGETS="net" kselftest
> 
> TAP version 13
> 1..1
>  timeout set to 1500
>  selftests: net: route_localnet.sh
>  run arp_announce test
>  net.ipv4.conf.veth0.route_localnet = 1
>  net.ipv4.conf.veth1.route_localnet = 1
>  net.ipv4.conf.veth0.arp_announce = 2
>  net.ipv4.conf.veth1.arp_announce = 2
>  PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
>   bytes of data.
>  64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.038 ms
>  64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.068 ms
>  64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.068 ms
>  64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.068 ms
>  64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.068 ms
> 
>  --- 127.25.3.14 ping statistics ---
>  5 packets transmitted, 5 received, 0% packet loss, time 4073ms
>  rtt min/avg/max/mdev = 0.038/0.062/0.068/0.012 ms
>  ok
>  run arp_ignore test
>  net.ipv4.conf.veth0.route_localnet = 1
>  net.ipv4.conf.veth1.route_localnet = 1
>  net.ipv4.conf.veth0.arp_ignore = 3
>  net.ipv4.conf.veth1.arp_ignore = 3
>  PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
>   bytes of data.
>  64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.032 ms
>  64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.065 ms
>  64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.066 ms
>  64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.065 ms
>  64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.065 ms
> 
>  --- 127.25.3.14 ping statistics ---
>  5 packets transmitted, 5 received, 0% packet loss, time 4092ms
>  rtt min/avg/max/mdev = 0.032/0.058/0.066/0.013 ms
>  ok
> ok 1 selftests: net: route_localnet.sh
> ...
> 
> Signed-off-by: Swarup Laxman Kotiaklapudi 
> ---
>  tools/testing/selftests/net/route_localnet.sh | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/net/route_localnet.sh 
> b/tools/testing/selftests/net/route_localnet.sh
> index 116bfeab72fa..e08701c750e3 100755
> --- a/tools/testing/selftests/net/route_localnet.sh
> +++ b/tools/testing/selftests/net/route_localnet.sh
> @@ -18,8 +18,10 @@ setup() {
>  ip route del 127.0.0.0/8 dev lo table local
>  ip netns exec "${PEER_NS}" ip route del 127.0.0.0/8 dev lo table local
>  
> -ifconfig veth0 127.25.3.4/24 up
> -ip netns exec "${PEER_NS}" ifconfig veth1 127.25.3.14/24 up
> +ip address add 127.25.3.4/24 dev veth0
> +ip link set dev veth0 up
> +ip netns exec "${PEER_NS}" ip address add 127.25.3.14/24 dev veth1
> +ip netns exec "${PEER_NS}" ip link set dev veth1 up
>  
>  ip route flush cache
>  ip netns exec "${PEER_NS}" ip route flush cache

Reviewed-by: David Ahern

Re: [PATCH] selftests:net change ifconfig with ip command

2023-10-22 Thread David Ahern

On 10/22/23 5:31 AM, Swarup Laxman Kotiaklapudi wrote:
> diff --git a/tools/testing/selftests/net/route_localnet.sh 
> b/tools/testing/selftests/net/route_localnet.sh
> index 116bfeab72fa..3ab9beb4462c 100755
> --- a/tools/testing/selftests/net/route_localnet.sh
> +++ b/tools/testing/selftests/net/route_localnet.sh
> @@ -18,8 +18,10 @@ setup() {
>  ip route del 127.0.0.0/8 dev lo table local
>  ip netns exec "${PEER_NS}" ip route del 127.0.0.0/8 dev lo table local
>  
> -ifconfig veth0 127.25.3.4/24 up
> -ip netns exec "${PEER_NS}" ifconfig veth1 127.25.3.14/24 up
> +ip a add 127.25.3.4/24 dev veth0

ip addr add ...

> +ip link set dev veth0 up
> +ip netns exec "${PEER_NS}" ip a add 127.25.3.14/24 dev veth1

ip addr add ...

> +ip netns exec "${PEER_NS}" ip link set dev veth1 up
>  
>  ip route flush cache
>  ip netns exec "${PEER_NS}" ip route flush cache

Re: [PATCH] neighbor: tracing: Move pin6 inside CONFIG_IPV6=y section

2023-10-17 Thread David Ahern

On 10/16/23 6:49 AM, Geert Uytterhoeven wrote:
> When CONFIG_IPV6=n, and building with W=1:
> 
> In file included from include/trace/define_trace.h:102,
>from include/trace/events/neigh.h:255,
>from net/core/net-traces.c:51:
> include/trace/events/neigh.h: In function 
> ‘trace_event_raw_event_neigh_create’:
> include/trace/events/neigh.h:42:34: error: variable ‘pin6’ set but not 
> used [-Werror=unused-but-set-variable]
>42 | struct in6_addr *pin6;
> |  ^~~~
> include/trace/trace_events.h:402:11: note: in definition of macro 
> ‘DECLARE_EVENT_CLASS’
>   402 | { assign; }   
>   \
> |   ^~
> include/trace/trace_events.h:44:30: note: in expansion of macro ‘PARAMS’
>44 |  PARAMS(assign),   \
> |  ^~
> include/trace/events/neigh.h:23:1: note: in expansion of macro 
> ‘TRACE_EVENT’
>23 | TRACE_EVENT(neigh_create,
> | ^~~
> include/trace/events/neigh.h:41:9: note: in expansion of macro 
> ‘TP_fast_assign’
>41 | TP_fast_assign(
> | ^~
> In file included from include/trace/define_trace.h:103,
>from include/trace/events/neigh.h:255,
>from net/core/net-traces.c:51:
> include/trace/events/neigh.h: In function ‘perf_trace_neigh_create’:
> include/trace/events/neigh.h:42:34: error: variable ‘pin6’ set but not 
> used [-Werror=unused-but-set-variable]
>42 | struct in6_addr *pin6;
> |  ^~~~
> include/trace/perf.h:51:11: note: in definition of macro 
> ‘DECLARE_EVENT_CLASS’
>51 | { assign; }   
>   \
> |   ^~
> include/trace/trace_events.h:44:30: note: in expansion of macro ‘PARAMS’
>44 |  PARAMS(assign),   \
> |  ^~
> include/trace/events/neigh.h:23:1: note: in expansion of macro 
> ‘TRACE_EVENT’
>23 | TRACE_EVENT(neigh_create,
> | ^~~
> include/trace/events/neigh.h:41:9: note: in expansion of macro 
> ‘TP_fast_assign’
>41 | TP_fast_assign(
> | ^~
> 
> Indeed, the variable pin6 is declared and initialized unconditionally,
> while it is only used and needlessly re-initialized when support for
> IPv6 is enabled.
> 
> Fix this by dropping the unused variable initialization, and moving the
> variable declaration inside the existing section protected by a check
> for CONFIG_IPV6.
> 
> Fixes: fc651001d2c5ca4f ("neighbor: Add tracepoint to __neigh_create")
> Signed-off-by: Geert Uytterhoeven 
> ---
> No changes in generated code.
> 
>  include/trace/events/neigh.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Reviewed-by: David Ahern 

Google put your email in the spam folder, BTW.

Re: [PATCH v2 10/15] vrf: Remove the now superfluous sentinel element from ctl_table array

2023-10-04 Thread David Ahern

On 10/2/23 2:55 AM, Joel Granados via B4 Relay wrote:
> From: Joel Granados 
> 
> This commit comes at the tail end of a greater effort to remove the
> empty elements at the end of the ctl_table arrays (sentinels) which
> will reduce the overall build time size of the kernel and run time
> memory bloat by ~64 bytes per sentinel (further information Link :
> https://lore.kernel.org/all/zo5yx5jfoggi%2f...@bombadil.infradead.org/)
> 
> Remove sentinel from vrf_table
> 
> Signed-off-by: Joel Granados 
> ---
>  drivers/net/vrf.c | 1 -
>  1 file changed, 1 deletion(-)
> 

Reviewed-by: David Ahern

Re: [PATCH] nexthop: Annotate struct nh_group with __counted_by

2023-10-03 Thread David Ahern

On 10/3/23 7:44 PM, Kees Cook wrote:
> Prepare for the coming implementation by GCC and Clang of the __counted_by
> attribute. Flexible array members annotated with __counted_by can have
> their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
> array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> functions).
> 
> As found with Coccinelle[1], add __counted_by for struct nh_group.
> 
> Cc: David Ahern 
> Cc: "David S. Miller" 
> Cc: Eric Dumazet 
> Cc: Jakub Kicinski 
> Cc: Paolo Abeni 
> Cc: net...@vger.kernel.org
> Link: 
> https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
>  [1]
> Signed-off-by: Kees Cook 
> ---
>  include/net/nexthop.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Reviewed-by: David Ahern

Re: [PATCH] nexthop: Annotate struct nh_notifier_grp_info with __counted_by

2023-10-03 Thread David Ahern

On 10/3/23 5:21 PM, Kees Cook wrote:
> Prepare for the coming implementation by GCC and Clang of the __counted_by
> attribute. Flexible array members annotated with __counted_by can have
> their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
> array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> functions).
> 
> As found with Coccinelle[1], add __counted_by for struct nh_notifier_grp_info.
> 
> Cc: David Ahern 
> Cc: "David S. Miller" 
> Cc: Eric Dumazet 
> Cc: Jakub Kicinski 
> Cc: Paolo Abeni 
> Cc: Nathan Chancellor 
> Cc: Nick Desaulniers 
> Cc: Tom Rix 
> Cc: net...@vger.kernel.org
> Cc: l...@lists.linux.dev
> Link: 
> https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
>  [1]
> Signed-off-by: Kees Cook 
> ---
>  include/net/nexthop.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 


Reviewed-by: David Ahern

Re: [PATCH] nexthop: Annotate struct nh_notifier_res_table_info with __counted_by

2023-10-03 Thread David Ahern

On 10/3/23 5:18 PM, Kees Cook wrote:
> Prepare for the coming implementation by GCC and Clang of the __counted_by
> attribute. Flexible array members annotated with __counted_by can have
> their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
> array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> functions).
> 
> As found with Coccinelle[1], add __counted_by for struct
> nh_notifier_res_table_info.
> 
> Cc: David Ahern 
> Cc: "David S. Miller" 
> Cc: Eric Dumazet 
> Cc: Jakub Kicinski 
> Cc: Paolo Abeni 
> Cc: Nathan Chancellor 
> Cc: Nick Desaulniers 
> Cc: Tom Rix 
> Cc: net...@vger.kernel.org
> Cc: l...@lists.linux.dev
> Link: 
> https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
>  [1]
> Signed-off-by: Kees Cook 
> ---
>  include/net/nexthop.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Reviewed-by: David Ahern

Re: [PATCH] nexthop: Annotate struct nh_res_table with __counted_by

2023-10-03 Thread David Ahern

On 10/3/23 5:18 PM, Kees Cook wrote:
> Prepare for the coming implementation by GCC and Clang of the __counted_by
> attribute. Flexible array members annotated with __counted_by can have
> their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
> array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> functions).
> 
> As found with Coccinelle[1], add __counted_by for struct nh_res_table.
> 
> Cc: David Ahern 
> Cc: "David S. Miller" 
> Cc: Eric Dumazet 
> Cc: Jakub Kicinski 
> Cc: Paolo Abeni 
> Cc: net...@vger.kernel.org
> Link: 
> https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
>  [1]
> Signed-off-by: Kees Cook 
> ---
>  include/net/nexthop.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Reviewed-by: David Ahern

Re: [Intel-gfx] [PATCH v2 10/15] vrf: Remove the now superfluous sentinel element from ctl_table array

2023-10-03 Thread David Ahern

On 10/2/23 2:55 AM, Joel Granados via B4 Relay wrote:
> From: Joel Granados 
> 
> This commit comes at the tail end of a greater effort to remove the
> empty elements at the end of the ctl_table arrays (sentinels) which
> will reduce the overall build time size of the kernel and run time
> memory bloat by ~64 bytes per sentinel (further information Link :
> https://lore.kernel.org/all/zo5yx5jfoggi%2f...@bombadil.infradead.org/)
> 
> Remove sentinel from vrf_table
> 
> Signed-off-by: Joel Granados 
> ---
>  drivers/net/vrf.c | 1 -
>  1 file changed, 1 deletion(-)
> 

Reviewed-by: David Ahern

Re: [Openipmi-developer] [PATCH v2 10/15] vrf: Remove the now superfluous sentinel element from ctl_table array

2023-10-02 Thread David Ahern

On 10/2/23 2:55 AM, Joel Granados via B4 Relay wrote:
> From: Joel Granados 
> 
> This commit comes at the tail end of a greater effort to remove the
> empty elements at the end of the ctl_table arrays (sentinels) which
> will reduce the overall build time size of the kernel and run time
> memory bloat by ~64 bytes per sentinel (further information Link :
> https://lore.kernel.org/all/zo5yx5jfoggi%2f...@bombadil.infradead.org/)
> 
> Remove sentinel from vrf_table
> 
> Signed-off-by: Joel Granados 
> ---
>  drivers/net/vrf.c | 1 -
>  1 file changed, 1 deletion(-)
> 

Reviewed-by: David Ahern 




___
Openipmi-developer mailing list
Openipmi-developer@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Re: [PATCH v2 10/15] vrf: Remove the now superfluous sentinel element from ctl_table array

2023-10-02 Thread David Ahern

On 10/2/23 2:55 AM, Joel Granados via B4 Relay wrote:
> From: Joel Granados 
> 
> This commit comes at the tail end of a greater effort to remove the
> empty elements at the end of the ctl_table arrays (sentinels) which
> will reduce the overall build time size of the kernel and run time
> memory bloat by ~64 bytes per sentinel (further information Link :
> https://lore.kernel.org/all/zo5yx5jfoggi%2f...@bombadil.infradead.org/)
> 
> Remove sentinel from vrf_table
> 
> Signed-off-by: Joel Granados 
> ---
>  drivers/net/vrf.c | 1 -
>  1 file changed, 1 deletion(-)
> 

Reviewed-by: David Ahern

Re: [PATCH v2 10/15] vrf: Remove the now superfluous sentinel element from ctl_table array

2023-10-02 Thread David Ahern

On 10/2/23 2:55 AM, Joel Granados via B4 Relay wrote:
> From: Joel Granados 
> 
> This commit comes at the tail end of a greater effort to remove the
> empty elements at the end of the ctl_table arrays (sentinels) which
> will reduce the overall build time size of the kernel and run time
> memory bloat by ~64 bytes per sentinel (further information Link :
> https://lore.kernel.org/all/zo5yx5jfoggi%2f...@bombadil.infradead.org/)
> 
> Remove sentinel from vrf_table
> 
> Signed-off-by: Joel Granados 
> ---
>  drivers/net/vrf.c | 1 -
>  1 file changed, 1 deletion(-)
> 

Reviewed-by: David Ahern

Re: [Bridge] [PATCH iproute2-next 1/2] configure: add the --color option

2023-09-15 Thread David Ahern

On 9/15/23 9:59 AM, Stephen Hemminger wrote:
> On Wed, 13 Sep 2023 19:58:25 +0200
> Andrea Claudi  wrote:
> 
>> This commit allows users/packagers to choose a default for the color
>> output feature provided by some iproute2 tools.
>>
>> The configure script option is documented in the script itself and it is
>> pretty much self-explanatory. The default value is set to "never" to
>> avoid changes to the current ip, tc, and bridge behaviour.
>>
>> Signed-off-by: Andrea Claudi 
>> ---
> 
> More build time config is not the answer either.
> Don't want complaints from distribution users about the change.
> Needs to be an environment variable or config file.

I liked the intent, and it defaults to off. Allowing an on by default
brings awareness of the usefulness of the color option.

I have applied the patches, so we need either a revert or followup based
on Stephen's objections.

Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-23 Thread David Ahern

On 8/23/23 3:52 PM, David Wei wrote:
> I'm also preparing a submission for NetDev conf. I wonder if you and others at
> Google plan to present there as well? If so, then we may want to coordinate 
> our
> submissions and talks (if accepted).

personally, I see them as related but separate topics. Mina's proposal
as infra that io_uring builds on. Both are interesting and needed
discussions.

Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-19 Thread David Ahern

On 8/19/23 9:22 AM, Jesper Dangaard Brouer wrote:
> 
> I do see the problem of depending on having a struct page, as the
> page_pool_iov isn't related to struct page.  Having "page" in the name
> of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> as we all know).
> 
> To support more allocator types, perhaps skb->pp_recycle bit need to
> grow another bit (and be renamed skb->recycle), so we can tell allocator
> types apart, those that are page based and those whom are not.
> 
> 
>> I think the feedback has been strong to not multiplex yet another
>> memory type into that struct, that is not a real page. Which is why
>> we went into this direction. This latest series limits the impact largely
>> to networking structures and code.
>>
> 
> Some what related what I'm objecting to: the "page_pool_iov" is not a
> real page, but this getting recycled into something called "page_pool",
> which funny enough deals with struct-pages internally and depend on the
> struct-page-refcnt.
> 
> Given the approach changed way from using struct page, then I also don't
> see the connection with the page_pool. Sorry.

I do not care for the page_pool_iov name either; I presumed it was least
change to prove an idea and the name and details would evolve.

How about something like buffer_pool or netdev_buf_pool that can operate
with either pages, dma addresses, or something else in the future?

> 
>> As for the LSB trick: that avoided adding a lot of boilerplate churn
>> with new type and helper functions.
>>
> 
> Says the lazy programmer :-P ... sorry could not resist ;-)

Use of the LSB (or bits depending on alignment expectations) is a common
trick and already done in quite a few places in the networking stack.
This trick is essential to any realistic change here to incorporate gpu
memory; way too much code will have unnecessary churn without it.

I do prefer my earlier suggestion though where the skb_frag_t has a
union of relevant types though. Instead of `struct page *page` it could
be `void *addr` with the helpers indicating page, iov, or other.

Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-13 Thread David Ahern

On 8/9/23 7:57 PM, Mina Almasry wrote:
> Changes in RFC v2:
> --
> 
> The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> that attempts to resolve this by implementing scatterlist support in the
> networking stack, such that we can import the dma-buf scatterlist
> directly. This is the approach proposed at a high level here[2].
> 
> Detailed changes:
> 1. Replaced dma-buf pages approach with importing scatterlist into the
>page pool.
> 2. Replace the dma-buf pages centric API with a netlink API.
> 3. Removed the TX path implementation - there is no issue with
>implementing the TX path with scatterlist approach, but leaving
>out the TX path makes it easier to review.
> 4. Functionality is tested with this proposal, but I have not conducted
>perf testing yet. I'm not sure there are regressions, but I removed
>perf claims from the cover letter until they can be re-confirmed.
> 5. Added Signed-off-by: contributors to the implementation.
> 6. Fixed some bugs with the RX path since RFC v1.
> 
> Any feedback welcome, but specifically the biggest pending questions
> needing feedback IMO are:
> 
> 1. Feedback on the scatterlist-based approach in general.
> 2. Netlink API (Patch 1 & 2).
> 3. Approach to handle all the drivers that expect to receive pages from
>the page pool (Patch 6).
> 
> [1] 
> https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb...@gmail.com/T/
> [2] 
> https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=h...@mail.gmail.com/
> 
> --
> 
> * TL;DR:
> 
> Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> from device memory efficiently, without bouncing the data to a host memory
> buffer.
> 
> * Problem:
> 
> A large amount of data transfers have device memory as the source and/or
> destination. Accelerators drastically increased the volume of such transfers.
> Some examples include:
> - ML accelerators transferring large amounts of training data from storage 
> into
>   GPU/TPU memory. In some cases ML training setup time can be as long as 50% 
> of
>   TPU compute time, improving data transfer throughput & efficiency can help
>   improving GPU/TPU utilization.
> 
> - Distributed training, where ML accelerators, such as GPUs on different 
> hosts,
>   exchange data among them.
> 
> - Distributed raw block storage applications transfer large amounts of data 
> with
>   remote SSDs, much of this data does not require host processing.
> 
> Today, the majority of the Device-to-Device data transfers the network are
> implemented as the following low level operations: Device-to-Host copy,
> Host-to-Host network transfer, and Host-to-Device copy.
> 
> The implementation is suboptimal, especially for bulk data transfers, and can
> put significant strains on system resources, such as host memory bandwidth,
> PCIe bandwidth, etc. One important reason behind the current state is the
> kernel’s lack of semantics to express device to network transfers.
> 
> * Proposal:
> 
> In this patch series we attempt to optimize this use case by implementing
> socket APIs that enable the user to:
> 
> 1. send device memory across the network directly, and
> 2. receive incoming network packets directly into device memory.
> 
> Packet _payloads_ go directly from the NIC to device memory for receive and 
> from
> device memory to NIC for transmit.
> Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
> normally. The NIC _must_ support header split to achieve this.
> 
> Advantages:
> 
> - Alleviate host memory bandwidth pressure, compared to existing
>  network-transfer + device-copy semantics.
> 
> - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
>   of the PCIe tree, compared to traditional path which sends data through the
>   root complex.
> 
> * Patch overview:
> 
> ** Part 1: netlink API
> 
> Gives user ability to bind dma-buf to an RX queue.
> 
> ** Part 2: scatterlist support
> 
> Currently the standard for device memory sharing is DMABUF, which doesn't
> generate struct pages. On the other hand, networking stack (skbs, drivers, and
> page pool) operate on pages. We have 2 options:
> 
> 1. Generate struct pages for dmabuf device memory, or,
> 2. Modify the networking stack to process scatterlist.
> 
> Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.
> 
> ** part 3: page pool support
> 
> We piggy back on page pool memory providers proposal:
> https://github.com/kuba-moo/linux/tree/pp-providers
> 
> It allows the page pool to define a memory provider that provides the
> page allocation and freeing. It helps abstract most of the device memory
> TCP changes from the driver.
> 
> ** part 4: support for unreadable skb frags
> 
> Page pool iovs are not accessible by the host; we implement changes
> throughput the networking stack to

Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice

2023-08-13 Thread David Ahern

On 8/9/23 7:57 PM, Mina Almasry wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8e7d0cb540cd..02a25ccf771a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -151,6 +151,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #include "dev.h"
>  #include "net-sysfs.h"
> @@ -2037,6 +2039,182 @@ static int call_netdevice_notifiers_mtu(unsigned long 
> val,
>   return call_netdevice_notifiers_info(val, );
>  }
> 
> +/* Device memory support */
> +
> +static void netdev_devmem_free_chunk_owner(struct gen_pool *genpool,
> +struct gen_pool_chunk *chunk,
> +void *not_used)
> +{
> + struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
> +
> + kvfree(owner->ppiovs);
> + kfree(owner);
> +}
> +
> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
> +{
> + size_t size, avail;
> +
> + gen_pool_for_each_chunk(binding->chunk_pool,
> + netdev_devmem_free_chunk_owner, NULL);
> +
> + size = gen_pool_size(binding->chunk_pool);
> + avail = gen_pool_avail(binding->chunk_pool);
> +
> + if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
> +   size, avail))
> + gen_pool_destroy(binding->chunk_pool);
> +
> + dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> +  DMA_BIDIRECTIONAL);
> + dma_buf_detach(binding->dmabuf, binding->attachment);
> + dma_buf_put(binding->dmabuf);
> + kfree(binding);
> +}
> +
> +void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding)
> +{
> + struct netdev_rx_queue *rxq;
> + unsigned long xa_idx;
> +
> + list_del_rcu(>list);
> +
> + xa_for_each(>bound_rxq_list, xa_idx, rxq)
> + if (rxq->binding == binding)
> + rxq->binding = NULL;
> +
> + netdev_devmem_binding_put(binding);

This does a put on the binding but it does not notify the driver that
that the dmabuf references need to be flushed from the rx queue.

Also, what about the device getting deleted - e.g., the driver is removed?

> +}
> +
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int 
> dmabuf_fd,
> + u32 rxq_idx, struct netdev_dmabuf_binding **out)
> +{
> + struct netdev_dmabuf_binding *binding;
> + struct netdev_rx_queue *rxq;
> + struct scatterlist *sg;
> + struct dma_buf *dmabuf;
> + unsigned int sg_idx, i;
> + unsigned long virtual;
> + u32 xa_idx;
> + int err;
> +
> + rxq = __netif_get_rx_queue(dev, rxq_idx);
> +
> + if (rxq->binding)
> + return -EEXIST;

So this proposal is limiting a binding to a single dmabuf at a time? Is
this just for the RFC?

Also, this suggests that the Rx queue is unique to the flow. I do not
recall a netdev API to create H/W queues on the fly (only a passing
comment from Kuba), so how is the H/W queue (or queue set since a
completion queue is needed as well) created for the flow? And in turn if
it is unique to the flow, what deletes the queue if an app does not do a
proper cleanup? If the queue sticks around, the dmabuf references stick
around.

Also, if this is an app specific h/w queue, flow steering is not
mentioned in this RFC.

> +
> + dmabuf = dma_buf_get(dmabuf_fd);
> + if (IS_ERR_OR_NULL(dmabuf))
> + return -EBADFD;
> +
> + binding = kzalloc_node(sizeof(*binding), GFP_KERNEL,
> +dev_to_node(>dev));
> + if (!binding) {
> + err = -ENOMEM;
> + goto err_put_dmabuf;
> + }
> +
> + xa_init_flags(>bound_rxq_list, XA_FLAGS_ALLOC);
> +
> + refcount_set(>ref, 1);
> +
> + binding->dmabuf = dmabuf;
> +
> + binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent);
> + if (IS_ERR(binding->attachment)) {
> + err = PTR_ERR(binding->attachment);
> + goto err_free_binding;
> + }
> +
> + binding->sgt = dma_buf_map_attachment(binding->attachment,
> +   DMA_BIDIRECTIONAL);
> + if (IS_ERR(binding->sgt)) {
> + err = PTR_ERR(binding->sgt);
> + goto err_detach;
> + }
> +
> + /* For simplicity we expect to make PAGE_SIZE allocations, but the
> +  * binding can be much more flexible than that. We may be able to
> +  * allocate MTU sized chunks here. Leave that for future work...
> +  */
> + binding->chunk_pool = gen_pool_create(PAGE_SHIFT,
> +   dev_to_node(>dev));
> + if (!binding->chunk_pool) {
> + err = -ENOMEM;
> + goto err_unmap;
> + }
> +
> + virtual = 0;
> + for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) {
> + dma_addr_t dma_addr = sg_dma_address(sg);
> + struct dmabuf_genpool_chunk_owner *owner;
> +

Re: [RFC PATCH 00/10] Device Memory TCP

2023-07-18 Thread David Ahern

On 7/18/23 12:29 PM, Jakub Kicinski wrote:
> On Tue, 18 Jul 2023 12:20:59 -0600 David Ahern wrote:
>> On 7/18/23 12:15 PM, Jakub Kicinski wrote:
>>> On Tue, 18 Jul 2023 15:06:29 -0300 Jason Gunthorpe wrote:  
>>>> netlink feels like a weird API choice for that, in particular it would
>>>> be really wrong to somehow bind the lifecycle of a netlink object to a
>>>> process.  
>>>
>>> Netlink is the right API, life cycle of objects can be easily tied to
>>> a netlink socket.  
>>
>> That is an untuitive connection -- memory references, h/w queues, flow
>> steering should be tied to the datapath socket, not a control plane socket.
> 
> There's one RSS context for may datapath sockets. Plus a lot of the
> APIs already exist, and it's more of a question of packaging them up 
> at the user space level. For things which do not have an API, however,
> netlink, please.

I do not see how 1 RSS context (or more specifically a h/w Rx queue) can
be used properly with memory from different processes (or dma-buf
references). When the process dies, that memory needs to be flushed from
the H/W queues. Queues with interlaced submissions make that more
complicated.

I guess the devil is in the details; I look forward to the evolution of
the patches.

Re: [RFC PATCH 00/10] Device Memory TCP

2023-07-18 Thread David Ahern

On 7/18/23 12:15 PM, Jakub Kicinski wrote:
> On Tue, 18 Jul 2023 15:06:29 -0300 Jason Gunthorpe wrote:
>> netlink feels like a weird API choice for that, in particular it would
>> be really wrong to somehow bind the lifecycle of a netlink object to a
>> process.
> 
> Netlink is the right API, life cycle of objects can be easily tied to
> a netlink socket.

That is an untuitive connection -- memory references, h/w queues, flow
steering should be tied to the datapath socket, not a control plane socket.

Re: [Bridge] [PATCH net-next v2 0/4] Add backup nexthop ID support

2023-07-18 Thread David Ahern

; Patch #1 extends the tunnel key structure with the new nexthop ID field.
> 
> Patch #2 uses the new field in the VXLAN driver to forward packets via
> the specified nexthop ID.
> 
> Patch #3 adds the new backup nexthop ID bridge port attribute and
> adjusts the bridge driver to attach the ID as tunnel metadata upon
> redirection.
> 
> Patch #4 adds a selftest.
> 
> iproute2 patches can be found here [3].
> 
> Changelog
> =
> 
> Since RFC [4]:
> 
> * Added Nik's tags.
> 
> [1] https://datatracker.ietf.org/doc/html/rfc7432#section-7.1
> [2] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2
> [3] https://github.com/idosch/iproute2/tree/submit/backup_nhid_v1
> [4] https://lore.kernel.org/netdev/20230713070925.3955850-1-ido...@nvidia.com/
> 
> Ido Schimmel (4):
>   ip_tunnels: Add nexthop ID field to ip_tunnel_key
>   vxlan: Add support for nexthop ID metadata
>   bridge: Add backup nexthop ID support
>   selftests: net: Add bridge backup port and backup nexthop ID test
> 
>  drivers/net/vxlan/vxlan_core.c|  44 +
>  include/net/ip_tunnels.h  |   1 +
>  include/uapi/linux/if_link.h  |   1 +
>  net/bridge/br_forward.c   |   1 +
>  net/bridge/br_netlink.c   |  12 +
>  net/bridge/br_private.h   |   3 +
>  net/bridge/br_vlan_tunnel.c   |  15 +
>  net/core/rtnetlink.c  |   2 +-
>  tools/testing/selftests/net/Makefile  |   1 +
>  .../selftests/net/test_bridge_backup_port.sh  | 759 ++
>  10 files changed, 838 insertions(+), 1 deletion(-)
>  create mode 100755 tools/testing/selftests/net/test_bridge_backup_port.sh
> 

For the series:
Acked-by: David Ahern

Re: [Bridge] [RFC PATCH net-next 0/4] Add backup nexthop ID support

2023-07-14 Thread David Ahern

On 7/13/23 1:09 AM, Ido Schimmel wrote:
> tl;dr
> =
> 
> This patchset adds a new bridge port attribute specifying the nexthop
> object ID to attach to a redirected skb as tunnel metadata. The ID is
> used by the VXLAN driver to choose the target VTEP for the skb. This is
> useful for EVPN multi-homing, where we want to redirect local
> (intra-rack) traffic upon carrier loss through one of the other VTEPs
> (ES peers) connected to the target host.
> 
> Background
> ==
> 
> In a typical EVPN multi-homing setup each host is multi-homed using a
> set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
> switches in a rack. These switches act as VTEPs and are not directly
> connected (as opposed to MLAG), but can communicate with each other (as
> well as with VTEPs in remote racks) via spine switches over L3.
> 
> The control plane uses Type 1 routes [1] to create a mapping between an
> ES and VTEPs where the ES has active links. In addition, the control
> plane uses Type 2 routes [2] to create a mapping between {MAC, VLAN} and
> an ES.
> 
> These tables are then used by the control plane to instruct VTEPs how to
> reach remote hosts. For example, assuming {MAC X, VLAN Y} is accessible
> via ES1 and this ES has active links to VTEP1 and VTEP2. The control
> plane will program the following entries to a remote VTEP:
> 
>  # ip nexthop add id 1 via $VTEP1_IP fdb
>  # ip nexthop add id 2 via $VTEP2_IP fdb
>  # ip nexthop add id 10 group 1/2 fdb
>  # bridge fdb add $MAC_X dev vx0 master extern_learn vlan $VLAN_Y
>  # bridge fdb add $MAC_Y dev vx0 self extern_learn nhid 10 src_vni $VNI_Y
> 
> Remote traffic towards the host will be load balanced between VTEP1 and
> VTEP2. If the control plane notices a carrier loss on the ES1 link
> connected to VTEP1, it will issue a Type 1 route withdraw, prompting
> remote VTEPs to remove the effected nexthop from the group:
> 
>  # ip nexthop replace id 10 group 2 fdb
> 
> Motivation
> ==
> 
> While remote traffic can be redirected to a VTEP with an active ES link
> by withdrawing a Type 1 route, the same is not true for local traffic. A
> host that is multi-homed to VTEP1 and VTEP2 via another ES (e.g., ES2)
> will send its traffic to {MAC X, VLAN Y} via one of these two switches,
> according to its LAG hash algorithm which is not under our control. If
> the traffic arrives at VTEP1 - which no longer has an active ES1 link -
> it will be dropped due to the carrier loss.
> 
> In MLAG setups, the above problem is solved by redirecting the traffic
> through the peer link upon carrier loss. This is achieved by defining
> the peer link as the backup port of the host facing bond. For example:
> 
>  # bridge link set dev bond0 backup_port bond_peer
> 
> Unlike MLAG, there is no peer link between the leaf switches in EVPN.
> Instead, upon carrier loss, local traffic should be redirected through
> one of the active ES peers. This can be achieved by defining the VXLAN
> port as the backup port of the host facing bonds. For example:
> 
>  # bridge link set dev es1_bond backup_port vx0
> 
> However, the VXLAN driver is not programmed with FDB entries for locally
> attached hosts and therefore does not know to which VTEP to redirect the
> traffic to. This will result in the traffic being replicated to all the
> VTEPs (potentially hundreds) in the network and each VTEP dropping the
> traffic, except for the active ES peer.
> 
> Avoiding the flooding by programming local FDB entries in the VXLAN
> driver is not a viable solution as it requires to significantly increase
> the number of programmed FDB entries.
> 
> Implementation
> ==
> 
> The proposed solution is to create an FDB nexthop group for each ES with
> the IP addresses of the active ES peers and set this ID as the backup
> nexthop ID (new bridge port attribute) of the ES link. For example, on
> VTEP1:
> 
>  # ip nexthop add id 1 via $VTEP2_IP fdb
>  # ip nexthop add id 10 group 1 fdb
>  # bridge link set dev es1_bond backup_nhid 10
>  # bridge link set dev es1_bond backup_port vx0
> 
> When the ES link loses its carrier, traffic will be redirected to the
> VXLAN port, but instead of only attaching the tunnel ID (i.e., VNI) as
> tunnel metadata to the skb, the backup nexthop ID will be attached as
> well. The VXLAN driver will then use this information to forward the skb
> via the nexthop object associated with the ID, as if the skb hit an FDB
> entry associated with this ID.
> 
> Testing
> ===
> 
> A test for both the existing backup port attribute as well as the new
> backup nexthop ID attribute is added in patch #4.
> 
> Patchset overview
> =
> 
> Patch #1 extends the tunnel key structure with the new nexthop ID field.
> 
> Patch #2 uses the new field in the VXLAN driver to forward packets via
> the specified nexthop ID.
> 
> Patch #3 adds the new backup nexthop ID bridge port attribute and
> adjusts the bridge driver to attach the ID as tunnel metadata upon
>

Bug#1032642: iproute2: ip tunnel change ip6gre to gre crashes with stack smash

2023-04-03 Thread David Ahern

On 4/3/23 9:24 AM, Stephen Hemminger wrote:
> ted
>>
>> This happens because iproute2 just assumes the tunnel is ipv4, but the
>> kernel "knows" it's actually ip6gre so when calling the SIOCGETTUNNEL
>> ioctl it writes back a struct ip6_tnl_parm2 into the struct
>> ip_tunnel_parm which is smaller, so the stack gets overwritten. Is
>> there any way to tell from userspace whether a gre is v4 or v6 before
>> doing an ioctl? The ioctls don't take/return a size parameter as far
>> as I can see...
> 
> Ip uses and IPv4 UDP socket when it thinks it is talking to GRE.
> And a IPv6 UDP socket when it is talking to GRE6.
> 
> So the kernel could check and error out?
> 

Does seem like a kernel bug and a well known design flaw in ioctl
interface (assuming buffer of a specific size). The best iproute2 can do
is have `old_p` be a larger size (e.g., ip6_tnl_parm2) to avoid the
overrun, but then the result is nonsense with no way for it no an ipv6
struct was passed back. The crash at least indicates something is off.

Bug#1032642: iproute2: ip tunnel change ip6gre to gre crashes with stack smash

2023-04-03 Thread David Ahern

On 4/3/23 9:24 AM, Stephen Hemminger wrote:
> ted
>>
>> This happens because iproute2 just assumes the tunnel is ipv4, but the
>> kernel "knows" it's actually ip6gre so when calling the SIOCGETTUNNEL
>> ioctl it writes back a struct ip6_tnl_parm2 into the struct
>> ip_tunnel_parm which is smaller, so the stack gets overwritten. Is
>> there any way to tell from userspace whether a gre is v4 or v6 before
>> doing an ioctl? The ioctls don't take/return a size parameter as far
>> as I can see...
> 
> Ip uses and IPv4 UDP socket when it thinks it is talking to GRE.
> And a IPv6 UDP socket when it is talking to GRE6.
> 
> So the kernel could check and error out?
> 

Does seem like a kernel bug and a well known design flaw in ioctl
interface (assuming buffer of a specific size). The best iproute2 can do
is have `old_p` be a larger size (e.g., ip6_tnl_parm2) to avoid the
overrun, but then the result is nonsense with no way for it no an ipv6
struct was passed back. The crash at least indicates something is off.

[jira] [Created] (SPARK-40317) Improvement to JDBC predicate for queries involving joins

2022-09-02 Thread David Ahern (Jira)

David Ahern created SPARK-40317:
---

 Summary: Improvement to JDBC predicate for queries involving joins
 Key: SPARK-40317
 URL: https://issues.apache.org/jira/browse/SPARK-40317
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.2
Reporter: David Ahern


Current behaviour on tables involving joins seems to use a subquery as follows

 

select * from

(

select a, b, c from tbl1

lj tbl2 on tbl1.col1 = tbl2.col1

lj tbl3 on tbl1.col2 = tbl3.col2

)

where predicate = 1

where predicate = 2

where predicate = 3

 

More desirable would be

(

select a, b, c from tbl1 where (predicate = 1, predicate = 2, etc)

lj tbl2 on tbl1.col1 = tbl2.col1

lj tbl3 on tbl1.col2 = tbl3.col2

)

 

to just do the join on the subset of data rather than joining all data then 
filtering.  Predicate pushdown usually only works on columns that have been 
indexes.  So even if the data isn't indexed, this would reduce amount of data 
needing to be moved.  In many cases better to do the join on DB side than 
pulling everything into Spark.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [RFC] Remove DECNET support from kernel

2022-08-01 Thread David Ahern

On 7/31/22 1:06 PM, Stephen Hemminger wrote:
> Decnet is an obsolete network protocol that receives more attention
> from kernel janitors than users. It belongs in computer protocol
> history museum not in Linux kernel.
> 
> It has been Orphaned in kernel since 2010.
> And the documentation link on Sourceforge says it is abandoned there.
> 
> Leave the UAPI alone to keep userspace programs compiling.
> 
> Signed-off-by: Stephen Hemminger 
> ---

Acked-by: David Ahern

Re: [ovs-dev] [PATCH net-next] net: rename reference+tracking helpers

2022-06-09 Thread David Ahern

On 6/8/22 8:58 AM, Jakub Kicinski wrote:
> IMO to encourage use of the track-capable API we could keep their names
> short and call the legacy functions __netdev_hold() as I mentioned or
> maybe netdev_hold_notrack().

I like that option. Similar to the old nla_parse functions that were
renamed with _deprecated - makes it easier to catch new uses.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [Bridge] [PATCH net-next v2 0/8] net: bridge: add flush filtering support

2022-04-15 Thread David Ahern

On Mon, Apr 11, 2022 at 12:22:24PM -0700, Roopa Prabhu wrote:
> all great points. My only reason to explore RTM_DELNEIGH is to see if we can
> find a recipe to support similar bulk deletes of other objects handled via
> rtm msgs in the future. Plus, it allows you to maintain symmetry between
> flush requests and object delete notification msg types.
> 
> Lets see if there are other opinions.

I guess I should have read the entire thread. :-) (still getting used to
the new lei + mutt workflow). This was my thought - bulk delete is going
to be a common need, and it is really just a mass delete. The GET
message types are used for dumps and some allow attributes on the
request as a means of coarse grain filtering. I think we should try
something similar here for the flush case.

Re: [Bridge] [PATCH net-next v4 05/12] net: rtnetlink: add bulk delete support flag

2022-04-15 Thread David Ahern

On 4/13/22 6:21 AM, Nikolay Aleksandrov wrote:
>> If a buggy user space application is sending messages with NLM_F_BULK
>> set (unintentionally), will it break on newer kernel? I couldn't find
>> where the kernel was validating that reserved flags are not used (I
>> suspect it doesn't).
> 
> Correct, it doesn't.
> 
>>
>> Assuming the above is correct and of interest, maybe just emit a warning
>> via extack and drop the goto? Alternatively, we can see if anyone
>> complains which might never happen
>>
> 
> TBH I prefer to error out on an unsupported flag, but I get the problem. These
> weren't validated before and we start checking now. The problem is that we'll
> return an extack without an error, but the delete might also remove something.
> Hrm.. perhaps we can rephrase the error in that case (since it becomes a 
> warning
> in iproute2 terms):
>  "NLM_F_BULK flag is set but bulk delete operation is not supported"
> So it will warn the user it has an unsupported flag.
> 
> WDYT ?
> 
> IMO we should bite the bullet and keep the error though. :)
> 

I agree. The check across the board for BULK flag on any DELETE requests
should tell us pretty quick if someone is setting that flag when it
should not be.

Re: [Bridge] [PATCH net-next v3 0/8] net: bridge: add flush filtering support

2022-04-15 Thread David Ahern

On 4/12/22 7:22 AM, Nikolay Aleksandrov wrote:
> Hi,
> This patch-set adds support to specify filtering conditions for a bulk
> delete (flush) operation. This version uses a new nlmsghdr delete flag
> called NLM_F_BULK in combination with a new ndo_fdb_del_bulk op which is
> used to signal that the driver supports bulk deletes (that avoids
> pushing common mac address checks to ndo_fdb_del implementations and
> also has a different prototype and parsed attribute expectations, more
> info in patch 03). The new delete flag can be used for any RTM_DEL*
> type, implementations just need to be careful with older kernels which
> are doing non-strict attribute parses. Here I use the fact that mac

overall it looks fine to me. The rollout of BULK delete for other
commands will be slow so we need a way to reject the BULK flag if the
handler does not support it. One thought is to add another flag to
rtnl_link_flags (e.g., RTNL_FLAG_BULK_DEL_SUPPORTED) and pass that flag
in for handlers that handle bulk delete and reject it for others in core
rtnetlink code.

Re: [Bridge] [PATCH net-next v2 1/8] net: rtnetlink: add RTM_FLUSHNEIGH

2022-04-15 Thread David Ahern

On Mon, Apr 11, 2022 at 08:29:27PM +0300, Nikolay Aleksandrov wrote:
> Add a new rtnetlink type used to flush neigh objects. It will be
> initially used to add flush with filtering support for bridge fdbs, but
> it also opens the door to add similar support to others (e.g. vxlan).
> 
> Signed-off-by: Nikolay Aleksandrov 
> ---
>  include/uapi/linux/rtnetlink.h | 3 +++
>  security/selinux/nlmsgtab.c| 3 ++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> index 83849a37db5b..06001cfd404b 100644
> --- a/include/uapi/linux/rtnetlink.h
> +++ b/include/uapi/linux/rtnetlink.h
> @@ -194,6 +194,9 @@ enum {
>   RTM_GETTUNNEL,
>  #define RTM_GETTUNNELRTM_GETTUNNEL
>  
> + RTM_FLUSHNEIGH = 124,
> +#define RTM_FLUSHNEIGH   RTM_FLUSHNEIGH
> +

rtm message types are "new, del, get, set" quadruplets; making this a
flush breaks the current consistent style. Can this be done by adding
a FLUSH flag to the RTM_DELNEIGH message?

>   __RTM_MAX,
>  #define RTM_MAX  (((__RTM_MAX + 3) & ~3) - 1)
>  };

Re: [iproute2-next v2 4/4] vdpa: Enable user to set mtu of the vdpa device

2021-12-20 Thread David Ahern

On 12/20/21 12:11 PM, Parav Pandit wrote:
>> After consideration, this future proofing seems like a good thing to have.
> Ok. I will first get kernel change merged, after which will send v3 for 
> iproute2.

this set has been committed; not sure what happened to the notification
from patchworks bot.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [iproute2-next v2 4/4] vdpa: Enable user to set mtu of the vdpa device

2021-12-18 Thread David Ahern

On 12/17/21 1:08 AM, Parav Pandit wrote:
> @@ -204,6 +217,8 @@ static void vdpa_opts_put(struct nlmsghdr *nlh, struct 
> vdpa *vdpa)
>   if (opts->present & VDPA_OPT_VDEV_MAC)
>   mnl_attr_put(nlh, VDPA_ATTR_DEV_NET_CFG_MACADDR,
>sizeof(opts->mac), opts->mac);
> + if (opts->present & VDPA_OPT_VDEV_MTU)
> + mnl_attr_put_u16(nlh, VDPA_ATTR_DEV_NET_CFG_MTU, opts->mtu);

Why limit the MTU to a u16? Eric for example is working on "Big TCP"
where IPv6 can work with Jumbograms where mtu can be > 64k.

https://datatracker.ietf.org/doc/html/rfc2675

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [iproute2-next 3/4] vdpa: Enable user to set mac address of vdpa device

2021-12-03 Thread David Ahern

On 12/1/21 9:22 PM, Parav Pandit wrote:
> @@ -233,6 +254,15 @@ static int vdpa_argv_parse(struct vdpa *vdpa, int argc, 
> char **argv,
>  
>   NEXT_ARG_FWD();
>   o_found |= VDPA_OPT_VDEV_MGMTDEV_HANDLE;
> + } else if ((matches(*argv, "mac")  == 0) &&

use strcmp; we are not taking any more uses of matches() for parameters.


> +(o_all & VDPA_OPT_VDEV_MAC)) {
> + NEXT_ARG_FWD();
> + err = vdpa_argv_mac(vdpa, argc, argv, opts->mac);
> + if (err)
> + return err;
> +
> + NEXT_ARG_FWD();
> + o_found |= VDPA_OPT_VDEV_MAC;
>   } else {
>   fprintf(stderr, "Unknown option \"%s\"\n", *argv);
>   return -EINVAL;

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [iproute2-next 4/4] vdpa: Enable user to set mtu of the vdpa device

2021-12-03 Thread David Ahern

On 12/1/21 9:22 PM, Parav Pandit wrote:
> @@ -154,6 +156,31 @@ static int vdpa_argv_mac(struct vdpa *vdpa, int argc, 
> char **argv, char *mac)
>   return 0;
>  }
>  
> +static int strtouint16_t(const char *str, uint16_t *p_val)
> +{
> + char *endptr;
> + unsigned long int val;
> +
> + val = strtoul(str, , 10);
> + if (endptr == str || *endptr != '\0')
> + return -EINVAL;
> + if (val > USHRT_MAX)
> + return -ERANGE;
> + *p_val = val;
> + return 0;
> +}

duplicates get_u16
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [iproute2-next 0/4] vdpa tool to query and set config layout

2021-12-03 Thread David Ahern

On 12/1/21 9:22 PM, Parav Pandit wrote:
> This series implements querying and setting of the mac address and mtu
> device config fields of the vdpa device of type net.
> 
> An example of query and set as below.
> 
> $ vdpa dev add name bar mgmtdev vdpasim_net mac 00:11:22:33:44:55 mtu 9000
> 
> $ vdpa dev config show
> bar: mac 00:11:22:33:44:55 link up link_announce false mtu 9000
> 
> $ vdpa dev config show -jp
> {
> "config": {
> "bar": {
> "mac": "00:11:22:33:44:55",
> "link ": "up",
> "link_announce ": false,
> "mtu": 1500,
> }
> }
> }
> 
> patch summary:
> patch-1 updates the kernel headers
> patch-2 implements the query command
> patch-3 implements setting the mac address of vdpa dev config space
> patch-4 implements setting the mtu of vdpa dev config space
> 
> 
> Parav Pandit (4):
>   vdpa: Update kernel headers
>   vdpa: Enable user to query vdpa device config layout
>   vdpa: Enable user to set mac address of vdpa device
>   vdpa: Enable user to set mtu of the vdpa device
> 
>  include/uapi/linux/virtio_net.h |  81 +
>  vdpa/include/uapi/linux/vdpa.h  |   7 ++
>  vdpa/vdpa.c | 198 ++--
>  3 files changed, 277 insertions(+), 9 deletions(-)
>  create mode 100644 include/uapi/linux/virtio_net.h
> 

please update man page(s)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 net-next 00/26] net: introduce and use generic XDP stats

2021-11-30 Thread David Ahern

On 11/30/21 10:07 AM, Jakub Kicinski wrote:
> On Tue, 30 Nov 2021 17:17:24 +0100 Toke Høiland-Jørgensen wrote:
>>> 1. Channels vs queues vs global.
>>>
>>> Jakub: no per-channel.
>>> David (Ahern): it's worth it to separate as Rx/Tx.
>>> Toke is fine with globals at the end I think?  
>>
>> Well, I don't like throwing data away, so in that sense I do like
>> per-queue stats, but it's not a very strong preference (i.e., I can live
>> with either)...
> 
> We don't even have a clear definition of a queue in Linux.
> 

The summary above says "Jakub: no per-channel", and then you have this
comment about a clear definition of a queue. What is your preference
here, Jakub? I think I have gotten lost in all of the coments.

My request was just to not lump Rx and Tx together under a 'channel'
definition as a new API. Proposals like zctap and 'queues as a first
class citizen' are examples of intentions / desires to move towards Rx
and Tx queues beyond what exists today.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 net-next 00/26] net: introduce and use generic XDP stats

2021-11-30 Thread David Ahern

On 11/30/21 8:56 AM, Alexander Lobakin wrote:
> Rest:
>  - don't create a separate `ip` command and report under `-s`;

Reporting XDP stats under 'ip -s' is not going to be scalable from a
readability perspective.

ifstat (misc/ifstat.c) has support for extended stats which is where you
are adding these.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 net-next 00/26] net: introduce and use generic XDP stats

2021-11-30 Thread David Ahern

On 11/30/21 10:04 AM, Jakub Kicinski wrote:
> On Tue, 30 Nov 2021 17:34:54 +0100 Alexander Lobakin wrote:
>>> Another thought on this patch: with individual attributes you could save
>>> some overhead by not sending 0 counters to userspace. e.g., define a
>>> helper that does:  
>>
>> I know about ETHTOOL_STAT_NOT_SET, but RTNL xstats doesn't use this,
>> does it?
> 
> Not sure if you're asking me or Dave but no, to my knowledge RTNL does
> not use such semantics today. But the reason is mostly because there
> weren't many driver stats added there. Knowing if an error counter is
> not supported or supporter and 0 is important for monitoring. Even if
> XDP stats don't have a counter which may not be supported today it's
> not a good precedent to make IMO.
> 

Today, stats are sent as a struct so skipping stats whose value is 0 is
not an option. When using individual attributes for the counters this
becomes an option. Given there is no value in sending '0' why do it?

Is your pushback that there should be a uapi to opt-in to this behavior?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 net-next 01/26] rtnetlink: introduce generic XDP statistics

2021-11-29 Thread David Ahern

On 11/23/21 9:39 AM, Alexander Lobakin wrote:
> +static bool rtnl_get_xdp_stats_xdpxsk(struct sk_buff *skb, u32 ch,
> +   const void *attr_data)
> +{
> + const struct ifla_xdp_stats *xstats = attr_data;
> +
> + xstats += ch;
> +
> + if (nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_PACKETS, xstats->packets,
> +   IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_BYTES, xstats->bytes,
> +   IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_ERRORS, xstats->errors,
> +   IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_ABORTED, xstats->aborted,
> +   IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_DROP, xstats->drop,
> +   IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_INVALID, xstats->invalid,
> +   IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_PASS, xstats->pass,
> +   IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_REDIRECT, xstats->redirect,
> +   IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_REDIRECT_ERRORS,
> +   xstats->redirect_errors,
> +   IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_TX, xstats->tx,
> +   IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_TX_ERRORS,
> +   xstats->tx_errors, IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_XMIT_PACKETS,
> +   xstats->xmit_packets, IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_XMIT_BYTES,
> +   xstats->xmit_bytes, IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_XMIT_ERRORS,
> +   xstats->xmit_errors, IFLA_XDP_XSTATS_UNSPEC) ||
> + nla_put_u64_64bit(skb, IFLA_XDP_XSTATS_XMIT_FULL,
> +   xstats->xmit_full, IFLA_XDP_XSTATS_UNSPEC))
> + return false;
> +
> + return true;
> +}
> +

Another thought on this patch: with individual attributes you could save
some overhead by not sending 0 counters to userspace. e.g., define a
helper that does:

static inline int nla_put_u64_if_set(struct sk_buff *skb, int attrtype,
 u64 value, int padattr)
{
if (value)
return nla_put_u64_64bit(skb, attrtype, value, padattr);

return 0;
}
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 net-next 00/26] net: introduce and use generic XDP stats

2021-11-28 Thread David Ahern

On 11/23/21 9:39 AM, Alexander Lobakin wrote:
> This is an almost complete rework of [0].
> 
> This series introduces generic XDP statistics infra based on rtnl
> xstats (Ethtool standard stats previously), and wires up the drivers
> which collect appropriate statistics to this new interface. Finally,
> it introduces XDP/XSK statistics to all XDP-capable Intel drivers.
> 
> Those counters are:
> * packets: number of frames passed to bpf_prog_run_xdp().
> * bytes: number of bytes went through bpf_prog_run_xdp().
> * errors: number of general XDP errors, if driver has one unified
>   counter.
> * aborted: number of XDP_ABORTED returns.
> * drop: number of XDP_DROP returns.
> * invalid: number of returns of unallowed values (i.e. not XDP_*).
> * pass: number of XDP_PASS returns.
> * redirect: number of successfully performed XDP_REDIRECT requests.
> * redirect_errors: number of failed XDP_REDIRECT requests.
> * tx: number of successfully performed XDP_TX requests.
> * tx_errors: number of failed XDP_TX requests.
> * xmit_packets: number of successfully transmitted XDP/XSK frames.
> * xmit_bytes: number of successfully transmitted XDP/XSK frames.
> * xmit_errors: of XDP/XSK frames failed to transmit.
> * xmit_full: number of XDP/XSK queue being full at the moment of
>   transmission.
> 
> To provide them, developers need to implement .ndo_get_xdp_stats()
> and, if they want to expose stats on a per-channel basis,
> .ndo_get_xdp_stats_nch(). include/net/xdp.h contains some helper

Why the tie to a channel? There are Rx queues and Tx queues and no
requirement to link them into a channel. It would be better (more
flexible) to allow them to be independent. Rather than ask the driver
"how many channels", ask 'how many Rx queues' and 'how many Tx queues'
for which xdp stats are reported.

>From there, allow queue numbers or queue id's to be non-consecutive and
add a queue id or number as an attribute. e.g.,

[XDP stats]
[ Rx queue N]
counters


[ Tx queue N]
counters

This would allow a follow on patch set to do something like "Give me XDP
stats for Rx queue N" instead of doing a full dump.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [ovs-dev] [PATCH net-next] openvswitch: allow linking a VRF to an OVS bridge

2021-09-21 Thread David Ahern

On 9/20/21 9:34 AM, Antoine Tenart wrote:
> There might be questions about the setup in which a VRF is linked to an
> OVS bridge; I cc'ed Luis Tomás who wrote the article.

My head just exploded. You want to make an L3 device a port of an L2
device.

Can someone explain how this is supposed to work and why it is even
needed? ie., given how an OVS bridge handles packets and the point of a
VRF device (to direct lookups to a table), the 2 are at odds in my head.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [PATCH net-next 03/21] ethtool, stats: introduce standard XDP statistics

2021-08-04 Thread David Ahern

On 8/4/21 12:27 PM, Saeed Mahameed wrote:
> 
>> I just ran some quick tests with my setup and measured about 1.2%
>> worst
> 
> 1.2% is a lot ! what was the test ? what is the change ?

I did say "quick test ... not exhaustive" and it was definitely
eyeballing a pps change over a small time window.

If multiple counters are bumped 20-25 million times a second (e.g. XDP
drop case), how measurable is it? I was just trying to ballpark the
overhead - 1%, 5%, more? If it is <~ 1% then there is no performance
argument in which case let's do the right thing for users - export via
existing APIs.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next 03/21] ethtool, stats: introduce standard XDP statistics

2021-08-04 Thread David Ahern

On 8/4/21 10:44 AM, Jakub Kicinski wrote:
> On Wed, 4 Aug 2021 10:17:56 -0600 David Ahern wrote:
>> On 8/4/21 6:36 AM, Jakub Kicinski wrote:
>>>> XDP is going to always be eBPF based ! why not just report such stats
>>>> to a special BPF_MAP ? BPF stack can collect the stats from the driver
>>>> and report them to this special MAP upon user request.  
>>> Do you mean replacing the ethtool-netlink / rtnetlink etc. with
>>> a new BPF_MAP? I don't think adding another category of uAPI thru 
>>> which netdevice stats are exposed would do much good :( Plus it 
>>> doesn't address the "yet another cacheline" concern.
>>>
>>> To my understanding the need for stats recognizes the fact that (in
>>> large organizations) fleet monitoring is done by different teams than
>>> XDP development. So XDP team may have all the stats they need, but the
>>> team doing fleet monitoring has no idea how to get to them.
>>>
>>> To bridge the two worlds we need a way for the infra team to ask the
>>> XDP for well-defined stats. Maybe we should take a page from the BPF
>>> iterators book and create a program type for bridging the two worlds?
>>> Called by networking core when duping stats to extract from the
>>> existing BPF maps all the relevant stats and render them into a well
>>> known struct? Users' XDP design can still use a single per-cpu map with
>>> all the stats if they so choose, but there's a way to implement more
>>> optimal designs and still expose well-defined stats.
>>>
>>> Maybe that's too complex, IDK.  
>>
>> I was just explaining to someone internally how to get stats at all of
>> the different points in the stack to track down reasons for dropped packets:
>>
>> ethtool -S for h/w and driver
>> tc -s for drops by the qdisc
>> /proc/net/softnet_stat for drops at the backlog layer
>> netstat -s for network and transport layer
>>
>> yet another command and API just adds to the nightmare of explaining and
>> understanding these stats.
> 
> Are you referring to RTM_GETSTATS when you say "yet another command"?
> RTM_GETSTATS exists and is used by offloads today.
> 
> I'd expect ip -s (-s) to be extended to run GETSTATS and display the xdp
> stats. (Not sure why ip -s was left out of your list :))

It's on my diagram, and yes, forgot to add it here.

> 
>> There is real value in continuing to use ethtool API for XDP stats. Not
>> saying this reorg of the XDP stats is the right thing to do, only that
>> the existing API has real user benefits.
> 
> RTM_GETSTATS is an existing API. New ethtool stats are intended to be HW
> stats. I don't want to go back to ethtool being a dumping ground for all
> stats because that's what the old interface encouraged.

driver stats are important too. e.g., mlx5's cache stats and per-queue
stats.

> 
>> Does anyone have data that shows bumping a properly implemented counter
>> causes a noticeable performance degradation and if so by how much? You
>> mention 'yet another cacheline' but collecting stats on stack and
>> incrementing the driver structs at the end of the napi loop should not
>> have a huge impact versus the value the stats provide.
> 
> Not sure, maybe Jesper has some numbers. Maybe Intel folks do?

I just ran some quick tests with my setup and measured about 1.2% worst
case. Certainly not exhaustive. Perhaps Intel or Mellanox can provide
numbers for their high speed nics - e.g. ConnectX-6 and a saturated host.

> 
> I'm just allergic to situations when there is a decision made and 
> then months later patches are posted disregarding the decision, 
> without analysis on why that decision was wrong. And while the
> maintainer who made the decision is on vacation.
> 

stats is one of the many sensitive topics. I have been consistent in
defending the need to use existing APIs and tooling and not relying on
XDP program writers to add the relevant stats and then provide whatever
tool is needed to extract and print them. Standardization for
fundamental analysis tools.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next 03/21] ethtool, stats: introduce standard XDP statistics

2021-08-04 Thread David Ahern

On 8/4/21 6:36 AM, Jakub Kicinski wrote:
>> XDP is going to always be eBPF based ! why not just report such stats
>> to a special BPF_MAP ? BPF stack can collect the stats from the driver
>> and report them to this special MAP upon user request.
> Do you mean replacing the ethtool-netlink / rtnetlink etc. with
> a new BPF_MAP? I don't think adding another category of uAPI thru 
> which netdevice stats are exposed would do much good :( Plus it 
> doesn't address the "yet another cacheline" concern.
> 
> To my understanding the need for stats recognizes the fact that (in
> large organizations) fleet monitoring is done by different teams than
> XDP development. So XDP team may have all the stats they need, but the
> team doing fleet monitoring has no idea how to get to them.
> 
> To bridge the two worlds we need a way for the infra team to ask the
> XDP for well-defined stats. Maybe we should take a page from the BPF
> iterators book and create a program type for bridging the two worlds?
> Called by networking core when duping stats to extract from the
> existing BPF maps all the relevant stats and render them into a well
> known struct? Users' XDP design can still use a single per-cpu map with
> all the stats if they so choose, but there's a way to implement more
> optimal designs and still expose well-defined stats.
> 
> Maybe that's too complex, IDK.

I was just explaining to someone internally how to get stats at all of
the different points in the stack to track down reasons for dropped packets:

ethtool -S for h/w and driver
tc -s for drops by the qdisc
/proc/net/softnet_stat for drops at the backlog layer
netstat -s for network and transport layer

yet another command and API just adds to the nightmare of explaining and
understanding these stats.

There is real value in continuing to use ethtool API for XDP stats. Not
saying this reorg of the XDP stats is the right thing to do, only that
the existing API has real user benefits.

Does anyone have data that shows bumping a properly implemented counter
causes a noticeable performance degradation and if so by how much? You
mention 'yet another cacheline' but collecting stats on stack and
incrementing the driver structs at the end of the napi loop should not
have a huge impact versus the value the stats provide.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [tipc-discussion] [iproute2] tipc: call a sub-routine in separate socket

2021-05-09 Thread David Ahern

On 5/5/21 9:27 PM, Hoang Le wrote:
> When receiving a result from first query to netlink, we may exec
> a another query inside the callback. If calling this sub-routine
> in the same socket, it will be discarded the result from previous
> exection.
> To avoid this we perform a nested query in separate socket.
> 
> Fixes: 202102830663 ("tipc: use the libmnl functions in lib/mnl_utils.c")
> Signed-off-by: Hoang Le 
> Acked-by: Jon Maloy 
> ---
>  tipc/bearer.c | 50 +-
>  tipc/link.c   | 15 +--
>  tipc/socket.c | 17 +++--
>  3 files changed, 73 insertions(+), 9 deletions(-)
> 

applied, thanks



___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [PATCH net-next v3] virtio-net: page_to_skb() use build_skb when there's sufficient tailroom

2021-04-19 Thread David Ahern

On 4/16/21 2:16 AM, Xuan Zhuo wrote:
> In page_to_skb(), if we have enough tailroom to save skb_shared_info, we
> can use build_skb to create skb directly. No need to alloc for
> additional space. And it can save a 'frags slot', which is very friendly
> to GRO.
> 
> Here, if the payload of the received package is too small (less than
> GOOD_COPY_LEN), we still choose to copy it directly to the space got by
> napi_alloc_skb. So we can reuse these pages.
> 
> Testing Machine:
> The four queues of the network card are bound to the cpu1.
> 
> Test command:
> for ((i=0;i<5;++i)); do sockperf tp --ip 192.168.122.64 -m 1000 -t 150& 
> done
> 
> The size of the udp package is 1000, so in the case of this patch, there
> will always be enough tailroom to use build_skb. The sent udp packet
> will be discarded because there is no port to receive it. The irqsoftd
> of the machine is 100%, we observe the received quantity displayed by
> sar -n DEV 1:
> 
> no build_skb:  956864.00 rxpck/s
> build_skb:1158465.00 rxpck/s
> 

virtio_net is using napi_consume_skb, so napi_build_skb should show a
small increase from build_skb.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 7749 matches

Mail list logo