Re: [PATCH 2/3] treewide: Convert some ethtool_sprintf() to ethtool_puts()

2023-10-26 Thread Jakub Kicinski
On Thu, 26 Oct 2023 11:23:37 +0200 Przemek Kitszel wrote:
> this would now fit into one line
> (perhaps it's the same in other cases, I just checked this one manually)

I think cocci would fold lines automatically? Could be worth trying
spatch to do the conversion for that reason, if you aren't already.



Re: [PATCH net-next] Use xdp_set_features_flag instead of direct assignment

2023-10-27 Thread Jakub Kicinski
On Fri, 27 Oct 2023 11:06:51 -0700 Haiyang Zhang wrote:
> From: Konstantin Taranov 
> 
> This patch uses a helper function for assignment of xdp_features.
> This change simplifies backports.

Generally making backports is not a strong enough reason to change
upstream code, but using the helper seems like a good idea.

I touched up the white space and title when applying, thanks!
-- 
pw-bot: applied



Re: [Patch v2] hv_netvsc: Mark VF as slave before exposing it to user-mode

2023-10-30 Thread Jakub Kicinski
On Fri, 27 Oct 2023 13:59:50 -0700 lon...@linuxonhyperv.com wrote:
> When a VF is being exposed form the kernel, it should be marked as "slave"
> before exposing to the user-mode. The VF is not usable without netvsc running
> as master. The user-mode should never see a VF without the "slave" flag.
> 
> This commit moves the code of setting the slave flag to the time before VF is
> exposed to user-mode.

Can you give a real example in the commit message of a flow in user
space which would get confused by seeing the VF netdev without
IFF_SLAVE?

You're only moving setting IFF_SLAVE but not linking the master,
is there no code which would assume that if SLAVE is set there 
is a master?



Re: [PATCH net,v2] hv_netvsc: fix race of netvsc and VF register_netdevice

2023-11-01 Thread Jakub Kicinski
On Thu, 26 Oct 2023 14:22:34 -0700 Haiyang Zhang wrote:
> And, move register_netdevice_notifier() earlier, so the call back
> function is set before probing.

Are you sure you need this? I thought the netdev notifier "replays"
registration events (i.e. sends "fake" events for already present
netdevs).

If I'm wrong this should still be a separate patch from the rtnl
reorder.
-- 
pw-bot: cr



Re: [PATCH net-next v4] hv_netvsc: Mark VF as slave before exposing it to user-mode

2023-11-08 Thread Jakub Kicinski
On Wed,  8 Nov 2023 14:56:52 -0800 lon...@linuxonhyperv.com wrote:
> From: Long Li 
> 
> When a VF is being exposed form the kernel, it should be marked as "slave"
> before exposing to the user-mode. The VF is not usable without netvsc running
> as master. The user-mode should never see a VF without the "slave" flag.
> 
> An example of a user-mode program depending on this flag is cloud-init
> (https://github.com/canonical/cloud-init/blob/19.3/cloudinit/net/__init__.py)

Quick grep for "flags", "priv" and "slave" doesn't show anything.
Can you point me to the line of code?

> When scanning interfaces, it checks on if this interface has a master to
> decide if it should be configured. There are other user-mode programs perform
> similar checks.
> 
> This commit moves the code of setting the slave flag to the time before VF is
> exposed to user-mode.

> Change since v3:
> Change target to net-next.

You don't consider this a fix? It seems like a race condition.

> - if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr)) {
> - netdev_notice(vf_netdev,
> -   "falling back to mac addr based 
> matching\n");
> + if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr) ||
> + ether_addr_equal(vf_netdev->dev_addr, ndev->perm_addr))

This change doesn't seem to be described in the commit message.

Please note that we have a rule against reposting patches within 24h:

https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#resending-after-review



Re: [PATCH net,v3, 2/2] hv_netvsc: Fix race of register_netdevice_notifier and VF register

2023-11-08 Thread Jakub Kicinski
On Tue,  7 Nov 2023 13:05:32 -0800 Haiyang Zhang wrote:
> If VF NIC is registered earlier, NETDEV_REGISTER event is replayed,
> but NETDEV_POST_INIT is not.

But Long Li sent the patch which starts to use POST_INIT against 
the net-next tree. If we apply this to net and Long Li's patch to
net-next one release will have half of the fixes.

I think that you should add Long Li's patch to this series. That'd
limit the confusion and git preserves authorship of the changes, so
neither of you will loose the credit.



Re: [PATCH net-next v4] hv_netvsc: Mark VF as slave before exposing it to user-mode

2023-11-10 Thread Jakub Kicinski
On Fri, 10 Nov 2023 00:43:55 + Long Li wrote:
> The code above needs to work with and without netvsc (the possible
> master device) present.

I don't think that's a reasonable requirement for the kernel code.

The auto-bonding already puts the kernel into business of guessing
policy, which frankly we shouldn't be in.

Having the kernel guess even harder that there will be a master,
but it's not there yet, is not reasonable.



Re: [PATCH net-next v4] hv_netvsc: Mark VF as slave before exposing it to user-mode

2023-11-18 Thread Jakub Kicinski
On Wed, 15 Nov 2023 08:14:06 -0800 Stephen Hemminger wrote:
> Jakub is right that in an ideal world, this could all be managed by
> userspace. But the management of network devices in Linux is a
> dumpster fire! Every distro invents there own solution, last time
> I counted there were six different tools claiming to be the
> "one network device manager to rule them all". And that doesn't
> include all the custom scripts and vendor appliances.

To be clear, I thought Long Li was saying that the goal is work around
cases where VF is probed before netvsc. That seems like something that
can be prevented by the hypervisor.



Re: [PATCH V2 net-next] net: mana: Assigning IRQ affinity on HT cores

2023-11-21 Thread Jakub Kicinski
On Tue, 21 Nov 2023 05:54:37 -0800 Souradeep Chakrabarti wrote:
> Existing MANA design assigns IRQ to every CPUs, including sibling 
> hyper-threads
> in a core. This causes multiple IRQs to work on same CPU and may reduce the 
> network
> performance with RSS.
> 
> Improve the performance by adhering the configuration for RSS, which assigns 
> IRQ
> on HT cores.

Drivers should not have to carry 120 LoC for something as basic as
spreading IRQs. Please take a look at include/linux/topology.h and
if there's nothing that fits your needs there - add it. That way
other drivers can reuse it.



Re: [EXTERNAL] Re: [PATCH V2 net-next] net: mana: Assigning IRQ affinity on HT cores

2023-11-27 Thread Jakub Kicinski
On Mon, 27 Nov 2023 09:36:38 + Souradeep Chakrabarti wrote:
> easier to keep things inside the mana driver code here

Easier for who? Upstream we care about consistency and maintainability
across all drivers.



Re: [EXTERNAL] Re: [PATCH V2 net-next] net: mana: Assigning IRQ affinity on HT cores

2023-11-29 Thread Jakub Kicinski
On Wed, 29 Nov 2023 14:17:39 -0800 Souradeep Chakrabarti wrote:
> On Mon, Nov 27, 2023 at 10:06:39AM -0800, Jakub Kicinski wrote:
> > On Mon, 27 Nov 2023 09:36:38 + Souradeep Chakrabarti wrote:  
> > > easier to keep things inside the mana driver code here  
> > 
> > Easier for who? Upstream we care about consistency and maintainability
> > across all drivers.  
> I am refactoring the code and putting some of the changes in topology.h
> and in nodemask.h. I am sharing the proposed change here for those two
> files. Please let me know if they are acceptable.

Thanks, adding Yury  who's the best person 
to comment on the details...

> Added a new helper to iterate on numa nodes with cpu and start from a 
> particular node, instead of first node. This helps when we want to
> iterate from the local numa node.
> 
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index 8d07116caaf1..6e4528376164 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -392,6 +392,15 @@ static inline void __nodes_fold(nodemask_t *dstp, const 
> nodemask_t *origp,
> for ((node) = 0; (node) < 1 && !nodes_empty(mask); (node)++)
>  #endif /* MAX_NUMNODES */
> 
> +#if MAX_NUMNODES > 1
> +#define for_each_node_next_mask(node_start, node_next, mask)   \
> +   for ((node_next) = (node_start);\
> +(node_next) < MAX_NUMNODES;\
> +(node_next) = next_node((node_next), (mask)))
> +#else
> +#define for_each_node_next_mask(node_start, node_next, mask)   \
> +   for_each_node_mask(node_next, mask)
> +#endif
>  /*
>   * Bitmasks that are kept for all the nodes.
>   */
> @@ -440,6 +449,8 @@ static inline int num_node_state(enum node_states state)
> 
>  #define for_each_node_state(__node, __state) \
> for_each_node_mask((__node), node_states[__state])
> +#define for_each_node_next_state(__node_start, __node_next, __state) \
> +   for_each_node_next_mask((__node_start), (__node_next), 
> node_states[__state])
> 
>  #define first_online_node  first_node(node_states[N_ONLINE])
>  #define first_memory_node  first_node(node_states[N_MEMORY])
> @@ -489,7 +500,8 @@ static inline int num_node_state(enum node_states state)
> 
>  #define for_each_node_state(node, __state) \
> for ( (node) = 0; (node) == 0; (node) = 1)
> -
> +#define for_each_node_next_state(node, next_node, _state) \
> +   for_each_node_state(node, __state)
>  #define first_online_node  0
>  #define first_memory_node  0
>  #define next_online_node(nid)  (MAX_NUMNODES)
> @@ -535,6 +547,8 @@ static inline int node_random(const nodemask_t *maskp)
> 
>  #define for_each_node(node)   for_each_node_state(node, N_POSSIBLE)
>  #define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
> +#define for_each_online_node_next(node, next_node)  \
> + for_each_node_next_state(node, next_node, 
> N_ONLINE)
> 
>  /*
>   * For nodemask scratch area.
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 52f5850730b3..a06b16e5a955 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -43,6 +43,9 @@
> for_each_online_node(node)  \
> if (nr_cpus_node(node))
> 
> +#define for_each_next_node_with_cpus(node, next_node)  \
> +   for_each_online_node_next(node, next_node)  \
> +   if (nr_cpus_node(next_node))
>  int arch_update_cpu_topology(void);
> 
>  /* Conform to ACPI 2.0 SLIT distance definitions */
> 




Re: [PATCH 4/4 V2 net-next] net: mana: Assigning IRQ affinity on HT cores

2024-01-23 Thread Jakub Kicinski
On Mon, 22 Jan 2024 08:00:59 -0800 Souradeep Chakrabarti wrote:
> IRQ   node-numcore-num   CPUperformance(%)
> 1  0 | 0   0 | 0 0 | 0-1 0
> 2  0 | 0   0 | 1 1 | 2-3 3
> 3  0 | 0   1 | 2 2 | 4-5 10
> 4  0 | 0   1 | 3 3 | 6-7 15
> 5  0 | 0   2 | 4 4 | 8-9 15
> ---
> ---

Please don't use --- as a line, indent it or use ... because git am
uses --- as a commit message separator. The commit message will get
cut off at the first one of those if we try to apply this.
-- 
pw-bot: cr



Re: [PATCH] hv_netvsc:Register VF in netvsc_probe if NET_DEVICE_REGISTER missed

2024-01-30 Thread Jakub Kicinski
On Mon, 29 Jan 2024 23:18:55 -0800 Shradha Gupta wrote:
> If hv_netvsc driver is removed and reloaded, the NET_DEVICE_REGISTER
> handler cannot perform VF register successfully as the register call
> is received before netvsc_probe is finished. This is because we
> register register_netdevice_notifier() very early(even before
> vmbus_driver_register()).
> To fix this, we try to register each such matching VF( if it is visible
> as a netdevice) at the end of netvsc_probe.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 85520856466e ("hv_netvsc: Fix race of register_netdevice_notifier and 
> VF register")
> Suggested-by: Dexuan Cui 
> Signed-off-by: Shradha Gupta 

Does not seem to apply to net/main, please respin



Re: [PATCH] hv_netvsc:Register VF in netvsc_probe if NET_DEVICE_REGISTER missed

2024-01-31 Thread Jakub Kicinski
On Tue, 30 Jan 2024 22:09:57 -0800 Shradha Gupta wrote:
> This patch applies to net, which is missed in the subject. I will fix
> this in the new version of the patch. Thanks


$ git checkout net/main 
[...]
$ git am raw
Applying: hv_netvsc:Register VF in netvsc_probe if NET_DEVICE_REGISTER missed
error: patch failed: drivers/net/hyperv/netvsc_drv.c:42
error: drivers/net/hyperv/netvsc_drv.c: patch does not apply
Patch failed at 0001 hv_netvsc:Register VF in netvsc_probe if 
NET_DEVICE_REGISTER missed
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".



Re: [PATCH] net :mana : Add per-cpu stats for MANA device

2024-03-07 Thread Jakub Kicinski
On Thu,  7 Mar 2024 06:52:12 -0800 Shradha Gupta wrote:
> Extend 'ethtool -S' output for mana devices to include per-CPU packet
> stats

But why? You already have per queue stats.

> Built-on: Ubuntu22
> Tested-on: Ubuntu22

I understand when people mention testing driver changes on particular
SKUs of supported devices, that adds meaningful info. I don't understand
what "built" and "tested" on a distro is supposed to tell us.
Honest question, what is the value of this?



Re: [PATCH] net :mana : Add per-cpu stats for MANA device

2024-03-07 Thread Jakub Kicinski
On Thu, 7 Mar 2024 15:49:15 + Haiyang Zhang wrote:
> > > Extend 'ethtool -S' output for mana devices to include per-CPU packet
> > > stats  
> > 
> > But why? You already have per queue stats.  
> Yes. But the q to cpu binding is dynamic, we also want the per-CPU stat 
> to analyze the CPU usage by counting the packets and bytes on each CPU.

Dynamic is a bit of an exaggeration, right? On a well-configured system
each CPU should use a single queue assigned thru XPS. And for manual
debug bpftrace should serve the purpose quite well.

Please note that you can't use num_present_cpus() to size stats in
ethtool -S , you have to use possible_cpus(), because the retrieval
of the stats is done in a multi-syscall fashion and there are no
explicit lengths in the API. So you must always report all possible
stats, not just currently active :(



Re: [PATCH v2] net :mana : Add per-cpu stats for MANA device

2024-03-08 Thread Jakub Kicinski
On Fri,  8 Mar 2024 02:44:31 -0800 Shradha Gupta wrote:
> Extend 'ethtool -S' output for mana devices to include per-CPU packet
> stats

You haven't answered the questions on v1 and haven't addressed all 
the feedback.
-- 
pw-bot: cr
pv-bot: feedback



Re: [PATCH] net :mana : Add per-cpu stats for MANA device

2024-03-08 Thread Jakub Kicinski
On Fri, 8 Mar 2024 18:51:58 + Haiyang Zhang wrote:
> > Dynamic is a bit of an exaggeration, right? On a well-configured system
> > each CPU should use a single queue assigned thru XPS. And for manual
> > debug bpftrace should serve the purpose quite well.  
> 
> Some programs, like irqbalancer can dynamically change the CPU affinity, 
> so we want to add the per-CPU counters for better understanding of the CPU 
> usage.

Do you have experimental data showing this making a difference
in production?

Seems unlikely, but if it does work we should enable it for all
devices, no driver by driver.



Re: [PATCH] net :mana : Add per-cpu stats for MANA device

2024-03-08 Thread Jakub Kicinski
On Fri, 8 Mar 2024 19:43:57 + Haiyang Zhang wrote:
> > Seems unlikely, but if it does work we should enable it for all
> > devices, no driver by driver.  
> There are some existing drivers, like mlx, rmnet, netvsc, etc. using percpu 
> counters. Are you suggesting we add a common API for all drivers?

Hm, I don't think mlx does. The typical use case for pcpu stats so far
has been software devices which don't have queues, and implement
lockless Tx. In that case instead of recording stats on a local queue
struct we can count per-cpu and not worry about locking.



Re: [PATCH] net :mana : Add per-cpu stats for MANA device

2024-03-11 Thread Jakub Kicinski
On Sun, 10 Mar 2024 21:19:50 -0700 Shradha Gupta wrote:
> > Seems unlikely, but if it does work we should enable it for all
> > devices, no driver by driver.  
> You mean, if the usecase seems valid we should try to extend the framework
> mentioned by Rahul 
> (https://lore.kernel.org/lkml/20240307072923.6cc8a...@kernel.org/)
> to include these stats as well?

"framework" is a big word, but yes, add a netlink command to get 
pcpu stats. Let's focus on the usefulness before investing time
in a rewrite, tho.



Re: [PATCH] net :mana : Add per-cpu stats for MANA device

2024-03-14 Thread Jakub Kicinski
On Wed, 13 Mar 2024 19:57:20 -0700 Shradha Gupta wrote:
> Default interrupts affinity for each queue:
> 
>  25:  1103  02989138  Hyper-V PCIe MSI 
> 4138200989697-edge  mana_q0@pci:7870:00:00.0
>  26:  0  14005360  0  Hyper-V PCIe MSI 
> 4138200989698-edge  mana_q1@pci:7870:00:00.0
>  27:  0  0  12997584  Hyper-V PCIe MSI 
> 4138200989699-edge  mana_q2@pci:7870:00:00.0
>  28:3565461  0  0  1  Hyper-V PCIe MSI 
> 4138200989700-edge  mana_q3
> @pci:7870:00:00.0
> 
> As seen the CPU-queue mapping is not 1:1, Queue 0 and Queue 2 are both mapped 
> to cpu3. From this knowledge we can figure out the total RX stats processed by
> each CPU by adding the values of mana_q0 and mana_q2 stats for cpu3. But if
> this data changes dynamically using irqbalance or smp_affinity file edits, the
> above assumption fails. 
> 
> Interrupt affinity for mana_q2 changes and the affinity table looks as follows
>  25:  1103  03038084  Hyper-V PCIe MSI 
> 4138200989697-edge  mana_q0@pci:7870:00:00.0
>  26:  0  14012447  0  Hyper-V PCIe MSI 
> 4138200989698-edge  mana_q1@pci:7870:00:00.0
>  27: 157181 10  13007990  Hyper-V PCIe MSI 
> 4138200989699-edge  mana_q2@pci:7870:00:00.0
>  28:3593858  0  0  1  Hyper-V PCIe MSI 
> 4138200989700-edge  mana_q3@pci:7870:00:00.0 
> 
> And during this time we might end up calculating the per-CPU stats 
> incorrectly,
> messing up the understanding of CPU usage by MANA driver that is consumed by 
> monitoring services. 

Like Stephen said, forget about irqbalance for networking.

Assume that the IRQs are affinitized and XPS set, correctly.

Now, presumably you can use your pcpu stats to "trade queues",
e.g. 4 CPUs / 4 queues, if CPU 0 insists on using queue 1
instead of queue 0, you can swap the 0 <> 1 assignment.
That's just an example of an "algorithm", maybe you have other
use cases. But if the problem is "user runs broken irqbalance"
the solution is not in the kernel...



Re: [PATCH] net :mana : Add per-cpu stats for MANA device

2024-03-14 Thread Jakub Kicinski
On Thu, 14 Mar 2024 18:54:31 + Haiyang Zhang wrote:
> We understand irqbalance may be a "bad idea", and recommended some 
> customers to disable it when having problems with it... But it's 
> still enabled by default, and we cannot let all distro vendors 
> and custom image makers to disable the irqbalance. So, our host-
> networking team is eager to have per-CPU stats for analyzing CPU 
> usage related to irqbalance or other issues.

You need a use case to get the stats upstream.
"CPU usage related to irqbalance or other issues" is both too vague,
and irqbalance is a user space problem.



Re: [PATCH net] net: mana: Fix Rx DMA datasize and skb_over_panic

2024-04-01 Thread Jakub Kicinski
On Tue, 2 Apr 2024 01:23:08 + Dexuan Cui wrote:
> > > I suggest the Fixes tag should be updated. Otherwise the fix
> > > looks good to me.  
> > 
> > Thanks for the suggestion. I actually thought about this before
> > submission.
> > I was worried about someone back ports the jumbo frame feature,
> > they may not automatically know this patch should be backported
> > too.   
> 
> The jumbo frame commit (2fbbd712baf1) depends on the MTU
> commit (2fbbd712baf1), so adding "Fixes: 2fbbd712baf1" (
> instead of "Fixes: ca9c54d2d6a5") might make it easier for people
> to notice and pick up this fix.
> 
> I'm OK if the patch remains as is. Just wanted to make  sure I
> understand the issue here.

Please update the tag to where the bug was actually first exposed.



Re: [PATCH net-next v2 1/2] net: Add sysfs atttributes for max_mtu min_mtu

2024-04-24 Thread Jakub Kicinski
On Wed, 24 Apr 2024 03:33:37 -0700 Shradha Gupta wrote:
> Add sysfs attributes to read max_mtu and min_mtu value for
> network devices

Absolutely pointless. You posted v1, dumping this as a driver
specific value, even tho it's already reported by the core...
And you can't even produce a meaningful commit message longer
than one sentence.

This is not meeting the bar. Please get your patches reviewed
internally at Microsoft by someone with good understanding of
Linux networking before you post.



Re: [PATCH net-next] net: mana: Enable MANA driver on ARM64 with 4K page size

2024-05-13 Thread Jakub Kicinski
On Mon, 13 May 2024 13:29:01 -0700 Haiyang Zhang wrote:
> - depends on PCI_MSI && X86_64
> + depends on PCI_MSI
> + depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN && ARM64_4K_PAGES)

Can ARM64 be big endian?



Re: [PATCH net-next] net: mana: Enable MANA driver on ARM64 with 4K page size

2024-05-13 Thread Jakub Kicinski
On Mon, 13 May 2024 20:50:37 + Haiyang Zhang wrote:
> From the document, it can be:
> "ARM cores support both modes, but are most commonly used in, and typically 
> default to little-endian mode. Most Linux distributions for ARM tend to be 
> little-endian only." 
> https://developer.arm.com/documentation/den0042/a/Coding-for-Cortex-R-Processors/Endianness
> 
> MANA driver doesn't support big endian.

Alright, but please prioritize at least adding the 64k page support.
Linux drivers are supposed to be as platform independent as possible.
If you use the right APIs you shouldn't have to worry about the endian
or the page size.



Re: [PATCH net-next] net: mana: Implement get_ringparam/set_ringparam for mana

2024-07-15 Thread Jakub Kicinski
On Sun, 14 Jul 2024 20:40:20 -0700 Shradha Gupta wrote:
> + if (ring->rx_jumbo_pending) {
> + netdev_err(ndev, "%s: rx_jumbo_pending not supported\n", 
> __func__);
> + return -EINVAL;
> + }
> + if (ring->rx_mini_pending) {
> + netdev_err(ndev, "%s: rx_mini_pending not supported\n", 
> __func__);
> + return -EINVAL;
> + }

I think that core already checks this

> + if (!apc)
> + return -EINVAL;

Provably impossible, apc is netdev + sizeof(netdev) so it'd have to
wrap a 64b integer to be NULL :|

> + old_tx = apc->tx_queue_size;
> + old_rx = apc->rx_queue_size;
> + new_tx = clamp_t(u32, ring->tx_pending, MIN_TX_BUFFERS_PER_QUEUE, 
> MAX_TX_BUFFERS_PER_QUEUE);
> + new_rx = clamp_t(u32, ring->rx_pending, MIN_RX_BUFFERS_PER_QUEUE, 
> MAX_RX_BUFFERS_PER_QUEUE);
> +
> + if (new_tx == old_tx && new_rx == old_rx)
> + return 0;

Pretty sure core will also not call you if there's no change.
If it does please update core instead of catching this in the driver.

Please keep in mind that net-next will be closed for the duration
of the merge window.
-- 
pw-bot: cr



Re: [PATCH net-next v2] net: mana: Implement get_ringparam/set_ringparam for mana

2024-07-31 Thread Jakub Kicinski
On Tue, 30 Jul 2024 10:01:35 -0700 Shradha Gupta wrote:
> + err1 = mana_detach(ndev, false);
> + if (err1) {
> + netdev_err(ndev, "mana_detach failed: %d\n", err1);
> + return err1;
> + }
> +
> + apc->tx_queue_size = new_tx;
> + apc->rx_queue_size = new_rx;
> + err1 = mana_attach(ndev);
> + if (!err1)
> + return 0;
> +
> + netdev_err(ndev, "mana_attach failed: %d\n", err1);
> +
> + /* Try rolling back to the older values */
> + apc->tx_queue_size = old_tx;
> + apc->rx_queue_size = old_rx;
> + err2 = mana_attach(ndev);

If system is under memory pressure there's no guarantee you'll get 
the memory back, even if you revert to the old counts.
We strongly recommend you refactor the code to hold onto the old memory
until you're sure new config works.
-- 
pw-bot: cr



Re: [PATCH net-next v2] net: mana: Implement get_ringparam/set_ringparam for mana

2024-08-01 Thread Jakub Kicinski
On Wed, 31 Jul 2024 20:49:05 -0700 Shradha Gupta wrote:
> It is a pretty standard support for network drivers to allow changing
> TX/RX queue sizes. We are working on improving customizations in MANA
> driver based on VM configurations. This patch is a part of that series.
> Hope that makes things more clear.

Simple reconfiguration must not run the risk of taking the system off
network.



Re: [PATCH] net: netvsc: Increase default VMBus channel from 8 to 16

2024-08-07 Thread Jakub Kicinski
On Mon,  5 Aug 2024 22:55:51 -0700 Erni Sri Satya Vennela wrote:
> Performance tests showed significant improvement in throughput:
> - 0.54% for 16 vCPUs
> - 1.51% for 32 vCPUs
> - 0.72% for 48 vCPUs
> - 5.57% for 64 vCPUs
> - 9.14% for 96 vCPUs

Could you please switch to netif_get_num_default_rss_queues() ?
It used to return hard coded 8, but now it returns #physical cores / 2.
That's based on broad experience with Meta's workloads. Some workloads
need more some needs fewer, but broadly half of physical cores is
a good guess for 90%+
Assuming you have thread siblings in those vCPUs above it should
match what you want, too.
-- 
pw-bot: cr



Re: [PATCH v2 net] net: mana: Fix doorbell out of order violation and avoid unnecessary doorbell rings

2024-08-08 Thread Jakub Kicinski
On Wed,  7 Aug 2024 16:17:06 -0700 lon...@linuxonhyperv.com wrote:
> Cc: sta...@vger.kernel.org
> Fixes: e1b5683ff62e ("net: mana: Move NAPI from EQ to CQ")
> 
> Reviewed-by: Haiyang Zhang 

no empty lines between trailers please



Re: [PATCH v2 net] net: mana: Fix doorbell out of order violation and avoid unnecessary doorbell rings

2024-08-08 Thread Jakub Kicinski
On Thu, 8 Aug 2024 15:33:55 + Long Li wrote:
> > no empty lines between trailers please  
> 
> I'm sending v3 to fix this.

I hope you don't mean you're sending it _now_, given:
https://www.kernel.org/doc/html/next/process/maintainer-netdev.html



Re: [PATCH v2] net: netvsc: Update default VMBus channels

2024-08-15 Thread Jakub Kicinski
On Wed, 14 Aug 2024 09:59:13 -0700 Erni Sri Satya Vennela wrote:
> Change VMBus channels macro (VRSS_CHANNEL_DEFAULT) in
> Linux netvsc from 8 to 16 to align with Azure Windows VM
> and improve networking throughput.
> 
> For VMs having less than 16 vCPUS, the channels depend
> on number of vCPUs. Between 16 to 32 vCPUs, the channels
> default to VRSS_CHANNEL_DEFAULT. For greater than 32 vCPUs,
> set the channels to number of physical cores / 2 as a way
> to optimize CPU resource utilization and scale for high-end
> processors with many cores.
> Maximum number of channels are by default set to 64.
> 
> Based on this change the subchannel creation would change as follows:
> 
> -
> |No. of vCPU  |dev_info->num_chn  |subchannel created |
> -
> |  0-16   |   16  |   vCPU|
> | >16 & <=32  |   16  |   16  |
> | >32 & <=128 |   vCPU/2  |   vCPU/2  |
> | >128|   vCPU/2  |   64  |
> -
> 
> Performance tests showed significant improvement in throughput:
> - 0.54% for 16 vCPUs
> - 0.83% for 32 vCPUs
> - 1.76% for 48 vCPUs
> - 10.35% for 64 vCPUs
> - 13.47% for 96 vCPUs

Is there anything that needs clarifying in my feedback on v1?

https://lore.kernel.org/all/20240807201857.445f9...@kernel.org/

Ignoring maintainer feedback is known to result in angry outbursts.



Re: [PATCH v2] net: netvsc: Update default VMBus channels

2024-08-16 Thread Jakub Kicinski
On Thu, 15 Aug 2024 19:23:50 + Haiyang Zhang wrote:
> Your suggestion on netif_get_num_default_rss_queues() is not ignored.
> We discussed internally on the formula we used for the num_chn, and
> chose a similar formula for higher number of vCPUs as in 
> netif_get_num_default_rss_queues().
> For lower number of vCPUs, we use the same default as Windows guests,
> because we don't want any potential regression.

Ideally you'd just use netif_get_num_default_rss_queues()
but the code is close enough to that, and I don't have enough
experience with the question of online CPUs vs physical CPUs.

I would definitely advise you to try this on real workloads.
While "iperf" looks great with a lot of rings, real workloads
suffer measurably from having more channels eating up memory
and generating interrupts.

But if you're confident with the online_cpus() / 2, that's fine.
You may be better off coding it up using max:

dev_info->num_chn = max(DIV_ROUND_UP(num_online_cpus(), 2),
VRSS_CHANNEL_DEFAULT);



Re: [PATCH net-next v3] net: mana: Implement get_ringparam/set_ringparam for mana

2024-08-16 Thread Jakub Kicinski
On Fri, 16 Aug 2024 03:48:23 -0700 Shradha Gupta wrote:
> + old_tx = apc->tx_queue_size;
> + old_rx = apc->rx_queue_size;
> + new_tx = clamp_t(u32, ring->tx_pending, MIN_TX_BUFFERS_PER_QUEUE, 
> MAX_TX_BUFFERS_PER_QUEUE);
> + new_rx = clamp_t(u32, ring->rx_pending, MIN_RX_BUFFERS_PER_QUEUE, 
> MAX_RX_BUFFERS_PER_QUEUE);

You can min(), the max side of clam is unnecessary. Core code won't let
user requests above max provided by "get" thru.

> + if (!is_power_of_2(new_tx)) {
> + netdev_err(ndev, "%s:Tx:%d not supported. Needs to be a power 
> of 2\n",
> +__func__, new_tx);
> + return -EINVAL;
> + }

The power of 2 vs clamp is a bit odd.
On one hand you clamp the values to what's supported automatically.
On the other you hard reject values which are not power of 2.
Why not round them up?

IDK whether checking or auto-correction is better, but mixing the two
is odd.

> + if (!is_power_of_2(new_rx)) {
> + netdev_err(ndev, "%s:Rx:%d not supported. Needs to be a power 
> of 2\n",
> +__func__, new_rx);

Instead of printing please use the extack passed in as an argument.
-- 
pw-bot: cr



Re: [PATCH net-next v21] net: refactor ->ndo_bpf calls into dev_xdp_propagate

2024-08-21 Thread Jakub Kicinski
On Wed, 21 Aug 2024 04:56:29 + Mina Almasry wrote:
> When net devices propagate xdp configurations to slave devices, or when
> core propagates xdp configuration to a device, we will need to perform
> a memory provider check to ensure we're not binding xdp to a device
> using unreadable netmem.
> 
> Currently ->ndo_bpf calls are all over the place. Adding checks to all
> these places would not be ideal.
> 
> Refactor all the ->ndo_bpf calls into one place where we can add this
> check in the future.
> 
> Suggested-by: Jakub Kicinski 
> Signed-off-by: Mina Almasry 

> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index f9633a6f8571..73f9416c6c1b 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -2258,7 +2258,7 @@ int bond_enslave(struct net_device *bond_dev, struct 
> net_device *slave_dev,
>   goto err_sysfs_del;
>   }
>  
> - res = slave_dev->netdev_ops->ndo_bpf(slave_dev, &xdp);
> + res = dev_xdp_propagate(slave_dev, &xdp);

I was hoping we can fold the "is there any program present already"
but I'm not sure if that check itself isn't buggy... so let's leave
that part to someone else.

Hangbin, would you be willing to take a look at testing (and fixing)
the XDP program propagation? I did a naive test of adding a bond
and veth under it, I attached an XDP prog to bond, and nothing happened
on the veth. Maybe I'm misreading but I expected the XDP prog to show
up on the veth.

> diff --git a/drivers/net/hyperv/netvsc_bpf.c b/drivers/net/hyperv/netvsc_bpf.c
> index 4a9522689fa4..e01c5997a551 100644
> --- a/drivers/net/hyperv/netvsc_bpf.c
> +++ b/drivers/net/hyperv/netvsc_bpf.c
> @@ -183,7 +183,7 @@ int netvsc_vf_setxdp(struct net_device *vf_netdev, struct 
> bpf_prog *prog)
>   xdp.command = XDP_SETUP_PROG;
>   xdp.prog = prog;
>  
> - ret = vf_netdev->netdev_ops->ndo_bpf(vf_netdev, &xdp);
> + ret = dev_xdp_propagate(vf_netdev, &xdp);

Again, the driver itself appears rather questionable but we can leave
it be :)

> @@ -130,7 +130,7 @@ static int bpf_map_offload_ndo(struct bpf_offloaded_map 
> *offmap,
>   /* Caller must make sure netdev is valid */
>   netdev = offmap->netdev;
>  
> - return netdev->netdev_ops->ndo_bpf(netdev, &data);
> + return dev_xdp_propagate(netdev, &data);

This is not propagation, it's an offload call, let's not convert it

> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index c0e0204b9630..f44d68c8d75d 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -149,7 +149,7 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
>   bpf.xsk.pool = NULL;
>   bpf.xsk.queue_id = pool->queue_id;
>  
> - err = pool->netdev->netdev_ops->ndo_bpf(pool->netdev, &bpf);
> + err = dev_xdp_propagate(pool->netdev, &bpf);
>  
>   if (err)
>   WARN(1, "Failed to disable zero-copy!\n");
> @@ -215,7 +215,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>   bpf.xsk.pool = pool;
>   bpf.xsk.queue_id = queue_id;
>  
> - err = netdev->netdev_ops->ndo_bpf(netdev, &bpf);
> + err = dev_xdp_propagate(netdev, &bpf);
>   if (err)
>   goto err_unreg_pool;
>  

That's also not xdp propagation. If you're not doing so already in your
series - you should look at queue state here directly and check if it
has MP installed. Conversely you should look at rxq->pool when binding
MP. But as I said, that's not part of the XDP refactor, just as part of
your series.

So in case my ramblings were confusing - code LG, but ditch the
net/xdp/xsk_buff_pool.c and kernel/bpf/offload.c changes.



Re: [PATCH net-next v2] net: refactor ->ndo_bpf calls into dev_xdp_propagate

2024-08-22 Thread Jakub Kicinski
On Thu, 22 Aug 2024 05:51:54 + Mina Almasry wrote:
> When net devices propagate xdp configurations to slave devices,
> we will need to perform a memory provider check to ensure we're
> not binding xdp to a device using unreadable netmem.
> 
> Currently the ->ndo_bpf calls in a few places. Adding checks to all
> these places would not be ideal.
> 
> Refactor all the ->ndo_bpf calls into one place where we can add this
> check in the future.

LGTM! (if anyone is planning to review this please TAL, I'm thinking of
applying it a few hours before the full 24h period to let Mina post his
larger series today)



Re: [PATCH V2 net] net: mana: Fix error handling in mana_create_txq/rxq's NAPI cleanup

2024-08-27 Thread Jakub Kicinski
On Fri, 23 Aug 2024 02:44:29 -0700 Souradeep Chakrabarti wrote:
> @@ -2023,14 +2024,17 @@ static void mana_destroy_rxq(struct mana_port_context 
> *apc,
>  
>   napi = &rxq->rx_cq.napi;
>  
> - if (validate_state)
> - napi_synchronize(napi);
> + if (napi->dev == apc->ndev) {
>  
> - napi_disable(napi);
> + if (validate_state)
> + napi_synchronize(napi);
>  
> - xdp_rxq_info_unreg(&rxq->xdp_rxq);
> + napi_disable(napi);
>  
> - netif_napi_del(napi);
> + netif_napi_del(napi);
> + }
> +
> + xdp_rxq_info_unreg(&rxq->xdp_rxq);

Please don't use internal core state as a crutch for your cleanup.

IDK what "validate_state" stands for, but it gives you all the info you
need on Rx. On Rx NAPI registration happens as the last stage of rxq
activation, once nothing can fail. And the "cleanup" path calls destroy
with validate_state=false. The only other caller passes true.

So you can rewrite this as:

if (validate_state) { /* rename it maybe? */
napi_disable(napi);
...
}
xdp_rxq_info_unreg(&rxq->xdp_rxq);

You can take similar approach with Tx. Pass a bool which tells the
destroy function whether NAPI has been registered.
-- 
pw-bot: cr