Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, 2018-07-06 at 16:38 -0700, Alexei Starovoitov wrote: > On Fri, Jul 06, 2018 at 08:44:24PM +0000, Waskiewicz Jr, Peter wrote: > > On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote: > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > > > > > I'm also not 100% on board with the argument that "future" FW > > > > can > > > > reshuffle things whatever way it wants to. Is the assumption > > > > that > > > > future ASICs/FW will be designed to always use the "blessed" > > > > BTF > > > > format? Or will it be reconfigurable at runtime? > > > > > > let's table configuration of metadata aside for a second. > > > > I agree that this should/could be NIC-specific and shouldn't weigh > > on > > the metadata interface between the drivers and XDP layer. > > > > > Describing metedata layout in BTF allows NICs to disclose > > > everything > > > NIC has to users in a standard and generic way. > > > Whether firmware is reconfigurable on the fly or has to reflashed > > > and hw powercycled to have new md layout (and corresponding BTF > > > description) > > > is a separate discussion. > > > Saeed's proposal introduces the concept of 'offset' inside > > > 'struct > > > xdp_md_info' > > > to reach 'hash' value in metadata. > > > Essentially it's a run-time way to access 'hash' instead of > > > build- > > > time. > > > So bpf program would need two loads to read csum or hash field > > > instead of one. > > > With BTF the layout of metadata is known to the program at build- > > > time. > > > > > > To reiterate the proposal: > > > - driver+firmware keep layout of the metadata in BTF format > > > (either > > > in the driver > > > or driver can read it from firmware) > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will > > > query > > > the driver and > > > generate normal C header file based on BTF in the given NIC > > > - user does #include "md_desc.h" and bpf program can access md- > > > >csum > > > or md->hash > > > with direct single load out of metadata area in front of the > > > packet > > > > This piece is where I'd like to discuss more. When we discussed > > this > > in Seoul, the initial proposal was a static struct that we'd try to > > hammer out a common layout between the interested parties. That > > obviously wasn't going to scale, and we wanted to pursue something > > more > > dynamic. But I thought the goal was the XDP/eBPF program wouldn't > > want > > to care what the underlying device is, and could just ask for > > metadata > > that it's interested in. With this approach, your eBPF program is > > now > > bound/tied to the NIC/driver, and if you switched to a differen > > NIC/driver combo, then you'd have to rewrite part of your eBPF > > program > > to comprehend that. I thought we were trying to avoid that. > > It looks to me that NICs have a lot more differences instead of > common pieces. > If we focus the architecture on making common things as generic > as possible we miss the bigger picture of covering distinct > and unique features that hw provides. > > In the last email when I said "would be good to standardize at least > a few common fields" > I meant that it make sense only at the later phases. > I don't think we should put things into uapi upfront. > Otherwise we will end up with likely useless fields. > Like vlan field. Surely a lot of nics can store it into metadata, but > why? > bpf program can easily read it from the packet. > What's the benefit of adding it to md? > This metadata concept we're discussing in the context of program > acceleration. > If metadata doesn't give perf benefits it should not be done by > firmware, it should not be supported by the driver and certainly > should not be part of uapi. > Hence I'd like to see pure BTF description without any > common/standard > fields across the drivers and figure out what (if anything) is useful > to accelerate xdp programs (and af_xdp in the future). I guess I didn't write my response very well before, apologies. I actually really like the BTF description and the dynamic-ness of it. It solves lots of problems and keeps things out of the UAPI. What I was getting at was using the BTF description and dumping a header file to be used in the program to be loaded. If we have an application develo
Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote: > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > I'm also not 100% on board with the argument that "future" FW can > > reshuffle things whatever way it wants to. Is the assumption that > > future ASICs/FW will be designed to always use the "blessed" BTF > > format? Or will it be reconfigurable at runtime? > > let's table configuration of metadata aside for a second. I agree that this should/could be NIC-specific and shouldn't weigh on the metadata interface between the drivers and XDP layer. > Describing metedata layout in BTF allows NICs to disclose everything > NIC has to users in a standard and generic way. > Whether firmware is reconfigurable on the fly or has to reflashed > and hw powercycled to have new md layout (and corresponding BTF > description) > is a separate discussion. > Saeed's proposal introduces the concept of 'offset' inside 'struct > xdp_md_info' > to reach 'hash' value in metadata. > Essentially it's a run-time way to access 'hash' instead of build- > time. > So bpf program would need two loads to read csum or hash field > instead of one. > With BTF the layout of metadata is known to the program at build- > time. > > To reiterate the proposal: > - driver+firmware keep layout of the metadata in BTF format (either > in the driver > or driver can read it from firmware) > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query > the driver and > generate normal C header file based on BTF in the given NIC > - user does #include "md_desc.h" and bpf program can access md->csum > or md->hash > with direct single load out of metadata area in front of the packet This piece is where I'd like to discuss more. When we discussed this in Seoul, the initial proposal was a static struct that we'd try to hammer out a common layout between the interested parties. That obviously wasn't going to scale, and we wanted to pursue something more dynamic. But I thought the goal was the XDP/eBPF program wouldn't want to care what the underlying device is, and could just ask for metadata that it's interested in. With this approach, your eBPF program is now bound/tied to the NIC/driver, and if you switched to a differen NIC/driver combo, then you'd have to rewrite part of your eBPF program to comprehend that. I thought we were trying to avoid that. Our proposed approach (still working on something ready to RFC) is to provide a method for the eBPF program to send a struct of requested hints down to the driver on load. If the driver can provide the hints, then that'd be how they'd be laid out in the metadata. If it can't provide them, we'd probably reject the program loading, or discuss providing a software fallback (I know this is an area of contention). I suppose we could get there with the rewriting mechanism described below, but that'd be a tough sell to set a bit of ABI for metadata, then change it to be potentially dynamic at runtime in the future. -PJ
Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
On 9/28/17 2:23 PM, John Fastabend wrote: > [...] > >> I'm pretty sure I misunderstood what you were going after with >> XDP_REDIRECT reserving the headroom. Our use case (patches coming in a >> few weeks) will populate the headroom coming out of the driver to XDP, >> and then once the XDP program extracts whatever hints it wants via >> helpers, I fully expect that area in the headroom to get stomped by >> something else. If we want to send any of that hint data up farther, >> we'll already have it extracted via the helpers, and the eBPF program >> can happily assign it to wherever in the outbound metadata area. > > In case its not obvious with the latest xdp metadata patches the outbound > metadata can then be pushed into skb fields via a tc_cls program if needed. Yes, that was what I was alluding to with "can happily assign it to wherever." The patches we're working on are driver->XDP, then anything else using the latest meta-data patches would be XDP->anywhere else. So I don't think we're going to step on any toes. Thanks John, -PJ
Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
On 9/28/17 12:59 PM, Andy Gospodarek wrote: > On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter > <peter.waskiewicz...@intel.com> wrote: >> On 9/26/17 10:21 AM, Andy Gospodarek wrote: >>> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote: >>>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote: >>>> [...] >>>>> First, thanks for this detailed description. It was helpful to read >>>>> along with the patches. >>>>> >>>>> My only concern about this area being generic is that you are now in a >>>>> state where any bpf program must know about all the bpf programs in the >>>>> receive pipeline before it can properly parse what is stored in the >>>>> meta-data and add it to an skb (or perform any other action). >>>>> Especially if each program adds it's own meta-data along the way. >>>>> >>>>> Maybe this isn't a big concern based on the number of users of this >>>>> today, but it just starts to seem like a concern as there are these >>>>> hints being passed between layers that are challenging to track due to a >>>>> lack of a standard format for passing data between. >>>> >>>> Btw, we do have similar kind of programmable scratch buffer also today >>>> wrt skb cb[] that you can program from tc side, the perf ring buffer, >>>> which doesn't have any fixed layout for the slots, or a per-cpu map >>>> where you can transfer data between tail calls for example, then tail >>>> calls themselves that need to coordinate, or simply mangling of packets >>>> itself if you will, but more below to your use case ... >>>> >>>>> The main reason I bring this up is that Michael and I had discussed and >>>>> designed a way for drivers to communicate between each other that rx >>>>> resources could be freed after a tx completion on an XDP_REDIRECT >>>>> action. Much like this code, it involved adding an new element to >>>>> struct xdp_md that could point to the important information. Now that >>>>> there is a generic way to handle this, it would seem nice to be able to >>>>> leverage it, but I'm not sure how reliable this meta-data area would be >>>>> without the ability to mark it in some manner. >>>>> >>>>> For additional background, the minimum amount of data needed in the case >>>>> Michael and I were discussing was really 2 words. One to serve as a >>>>> pointer to an rx_ring structure and one to have a counter to the rx >>>>> producer entry. This data could be acessed by the driver processing the >>>>> tx completions and callback to the driver that received the frame off the >>>>> wire >>>>> to perform any needed processing. (For those curious this would also >>>>> require a >>>>> new callback/netdev op to act on this data stored in the XDP buffer.) >>>> >>>> What you describe above doesn't seem to be fitting to the use-case of >>>> this set, meaning the area here is fully programmable out of the BPF >>>> program, the infrastructure you're describing is some sort of means of >>>> communication between drivers for the XDP_REDIRECT, and should be >>>> outside of the control of the BPF program to mangle. >>> >>> OK, I understand that perspective. I think saying this is really meant >>> as a BPF<->BPF communication channel for now is fine. >>> >>>> You could probably reuse the base infra here and make a part of that >>>> inaccessible for the program with some sort of a fixed layout, but I >>>> haven't seen your code yet to be able to fully judge. Intention here >>>> is to allow for programmability within the BPF prog in a generic way, >>>> such that based on the use-case it can be populated in specific ways >>>> and propagated to the skb w/o having to define a fixed layout and >>>> bloat xdp_buff all the way to an skb while still retaining all the >>>> flexibility. >>> >>> Some level of reuse might be proper, but I'd rather it be explicit for >>> my use since it's not exclusively something that will need to be used by >>> a BPF prog, but rather the driver. I'll produce some patches this week >>> for reference. >> >> Sorry for chiming in late, I've been offline. >> >> We're looking to add some functionality from driver to XDP inside
Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
On 9/26/17 10:21 AM, Andy Gospodarek wrote: > On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote: >> On 09/25/2017 08:10 PM, Andy Gospodarek wrote: >> [...] >>> First, thanks for this detailed description. It was helpful to read >>> along with the patches. >>> >>> My only concern about this area being generic is that you are now in a >>> state where any bpf program must know about all the bpf programs in the >>> receive pipeline before it can properly parse what is stored in the >>> meta-data and add it to an skb (or perform any other action). >>> Especially if each program adds it's own meta-data along the way. >>> >>> Maybe this isn't a big concern based on the number of users of this >>> today, but it just starts to seem like a concern as there are these >>> hints being passed between layers that are challenging to track due to a >>> lack of a standard format for passing data between. >> >> Btw, we do have similar kind of programmable scratch buffer also today >> wrt skb cb[] that you can program from tc side, the perf ring buffer, >> which doesn't have any fixed layout for the slots, or a per-cpu map >> where you can transfer data between tail calls for example, then tail >> calls themselves that need to coordinate, or simply mangling of packets >> itself if you will, but more below to your use case ... >> >>> The main reason I bring this up is that Michael and I had discussed and >>> designed a way for drivers to communicate between each other that rx >>> resources could be freed after a tx completion on an XDP_REDIRECT >>> action. Much like this code, it involved adding an new element to >>> struct xdp_md that could point to the important information. Now that >>> there is a generic way to handle this, it would seem nice to be able to >>> leverage it, but I'm not sure how reliable this meta-data area would be >>> without the ability to mark it in some manner. >>> >>> For additional background, the minimum amount of data needed in the case >>> Michael and I were discussing was really 2 words. One to serve as a >>> pointer to an rx_ring structure and one to have a counter to the rx >>> producer entry. This data could be acessed by the driver processing the >>> tx completions and callback to the driver that received the frame off the >>> wire >>> to perform any needed processing. (For those curious this would also >>> require a >>> new callback/netdev op to act on this data stored in the XDP buffer.) >> >> What you describe above doesn't seem to be fitting to the use-case of >> this set, meaning the area here is fully programmable out of the BPF >> program, the infrastructure you're describing is some sort of means of >> communication between drivers for the XDP_REDIRECT, and should be >> outside of the control of the BPF program to mangle. > > OK, I understand that perspective. I think saying this is really meant > as a BPF<->BPF communication channel for now is fine. > >> You could probably reuse the base infra here and make a part of that >> inaccessible for the program with some sort of a fixed layout, but I >> haven't seen your code yet to be able to fully judge. Intention here >> is to allow for programmability within the BPF prog in a generic way, >> such that based on the use-case it can be populated in specific ways >> and propagated to the skb w/o having to define a fixed layout and >> bloat xdp_buff all the way to an skb while still retaining all the >> flexibility. > > Some level of reuse might be proper, but I'd rather it be explicit for > my use since it's not exclusively something that will need to be used by > a BPF prog, but rather the driver. I'll produce some patches this week > for reference. Sorry for chiming in late, I've been offline. We're looking to add some functionality from driver to XDP inside this xdp_buff->data_meta region. We want to assign it to an opaque structure, that would be specific per driver (think of a flex descriptor coming out of the hardware). We'd like to pass these offloaded computations into XDP programs to help accelerate them, such as packet type, where headers are located, etc. It's similar to Jesper's RFC patches back in May when passing through the mlx Rx descriptor to XDP. This is actually what a few of us are planning to present at NetDev 2.2 in November. If you're hoping to restrict this headroom in the xdp_buff for an exclusive use case with XDP_REDIRECT, then I'd like to discuss that further. -PJ
Re: [Intel-wired-lan] [PATCH] igb: check memory allocation failure
On 9/13/17 7:24 PM, Brown, Aaron F wrote: >> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On Behalf >> Of Christophe JAILLET >> Sent: Monday, August 28, 2017 10:13 AM >> To: Waskiewicz Jr, Peter <peter.waskiewicz...@intel.com>; Kirsher, Jeffrey T >> <jeffrey.t.kirs...@intel.com> >> Cc: netdev@vger.kernel.org; kernel-janit...@vger.kernel.org; intel-wired- >> l...@lists.osuosl.org; linux-ker...@vger.kernel.org >> Subject: Re: [Intel-wired-lan] [PATCH] igb: check memory allocation failure >> >> Le 28/08/2017 à 01:09, Waskiewicz Jr, Peter a écrit : >>> On 8/27/17 2:42 AM, Christophe JAILLET wrote: >>>> Check memory allocation failures and return -ENOMEM in such cases, as >>>> already done for other memory allocations in this function. >>>> >>>> This avoids NULL pointers dereference. >>>> >>>> Signed-off-by: Christophe JAILLET <christophe.jail...@wanadoo.fr> >>>> --- >>>> drivers/net/ethernet/intel/igb/igb_main.c | 2 ++ >>>> 1 file changed, 2 insertions(+) >>>> > > This seems to be fine from a "it does not break in testing" perspective, so... > > Tested-by: Aaron Brown <aaron.f.br...@intel.com > >>> -PJ >>> >> Hi, >> >> in fact, there is no leak because the only caller of 'igb_sw_init()' >> (i.e. 'igb_probe()'), already frees these resources in case of error, >> see [1] >> >> These resources are also freed in 'igb_remove()'. >> >> Best reagrds, >> CJ >> >> [1]: >> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux- >> next.git/tree/drivers/net/ethernet/intel/igb/igb_main.c#n2775 > > But is PJ's comment saying that it is not really necessary? If so I tend to > lean towards the don't touch it if it's not broken perspective. I guess I didn't respond after Christophe replied, sorry about that. The patch is good to me. It's definitely catching an issue where we're not checking for a memory failure, then just follows the same de-allocation path on unwind. If you want it: Acked-by: PJ Waskiewicz <peter.waskiewicz...@intel.com>
Re: [PATCH] connector: Delete an error message for a failed memory allocation in cn_queue_alloc_callback_entry()
On 8/28/17 2:06 AM, Dan Carpenter wrote: > On Sun, Aug 27, 2017 at 11:16:06PM +0000, Waskiewicz Jr, Peter wrote: >> On 8/27/17 3:26 PM, SF Markus Elfring wrote: >>> From: Markus Elfring <elfr...@users.sourceforge.net> >>> Date: Sun, 27 Aug 2017 21:18:37 +0200 >>> >>> Omit an extra message for a memory allocation failure in this function. >>> >>> This issue was detected by using the Coccinelle software. >> >> Did coccinelle trip on the message or the fact you weren't returning NULL? >> > > You've misread the patch somehow. The existing code has a NULL return > and it's preserved in Markus's patch. This sort of patch is to fix a > checkpatch.pl warning. The error message from this kzalloc() isn't going > to get printed because it's a small allocation and small allocations > always succeed in current kernels. But probably the main reason > checkpatch complains is that kmalloc() already prints a stack trace and > a bunch of other information so the printk doesn't add anyting. > Removing it saves a little memory. > > I'm mostly a fan of running checkpatch on new patches or staging and not > on old code... And this is what I get for reading the patch with a crappy mailer...thanks Doubtlook. Sorry for the noise.
Re: Get ARP/ND tables from kernel
On 8/27/17 9:25 PM, Bassam Alsanie wrote: > Hello everyone, > I looking into a good way (stable and compatible with large number of > distros) to get the arp/nd cache from kernel to user space, for both > IP4 and IP6. > > It seem IOCTL (SIOCGARP) can't do that, you can only get MAC address > from provided IP address. But IOCTL can't give the the full arp/nd > table. > The other option is the Netlink interface. I tried it and I got the > ARP/ND table :). > The third option is using /proc/net/arp, which only restricted to IP4. > > There is command line utilities that I excluding in my case. > > Is there another way to do it? what is the best way in my case? > > Thank you all. # strace arp -an [...] open("/proc/net/arp", O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 read(4, "IP address HW type Fla"..., 1024) = 310 [...] # strace ip -6 neighbor show [...] socket(AF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, NETLINK_ROUTE) = 3 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [32768], 4) = 0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1048576], 4) = 0 bind(3, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=}, 12) = 0 getsockname(3, {sa_family=AF_NETLINK, nl_pid=30292, nl_groups=}, [12]) = 0 sendto(3, {{len=40, type=RTM_GETLINK, flags=NLM_F_REQUEST|NLM_F_DUMP, seq=1503888680, pid=0}, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\35\0\1\0\0\0"}, 40, 0, NULL, 0) = 40 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=}, msg_namelen=12, msg_iov=[{iov_base=[{{len=1268, type=RTM_NEWLINK, flags=NLM_F_MULTI, seq=1503888680, pid=30292}, "\0\0\4\3\1\0\0\0I\0\1\0\0\0\0\0\7\0\3\0lo\0\0\10\0\r\0\350\3\0\0"...}, {{len=1280, type=RTM_NEWLINK, flags=NLM_F_MULTI, seq=1503888680, pid=30292}, "\0\0\1\0\2\0\0\0C\20\1\0\0\0\0\0\t\0\3\0eno1\0\0\0\0\10\0\r\0"...}], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 2548 [...] Seems like it's pretty obvious if you don't want to use the existing tools, just look at how the existing tools get this data. IPv4 uses /proc/net/arp, IPv6 uses netlink. Cheers, -PJ
Re: [PATCH] connector: Delete an error message for a failed memory allocation in cn_queue_alloc_callback_entry()
On 8/27/17 3:26 PM, SF Markus Elfring wrote: > From: Markus Elfring> Date: Sun, 27 Aug 2017 21:18:37 +0200 > > Omit an extra message for a memory allocation failure in this function. > > This issue was detected by using the Coccinelle software. Did coccinelle trip on the message or the fact you weren't returning NULL? > > Signed-off-by: Markus Elfring > --- > drivers/connector/cn_queue.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/drivers/connector/cn_queue.c b/drivers/connector/cn_queue.c > index 1f8bf054d11c..e4f31d679f02 100644 > --- a/drivers/connector/cn_queue.c > +++ b/drivers/connector/cn_queue.c > @@ -40,10 +40,8 @@ cn_queue_alloc_callback_entry(struct cn_queue_dev *dev, > const char *name, > struct cn_callback_entry *cbq; > > cbq = kzalloc(sizeof(*cbq), GFP_KERNEL); > - if (!cbq) { > - pr_err("Failed to create new callback queue.\n"); > + if (!cbq) > return NULL; > - } Wny not: if (!cbq) { pr_err("Failed to create new callback queue.\n"); + return NULL; } > > atomic_set(>refcnt, 1); > >
Re: [PATCH] igb: check memory allocation failure
On 8/27/17 2:42 AM, Christophe JAILLET wrote: > Check memory allocation failures and return -ENOMEM in such cases, as > already done for other memory allocations in this function. > > This avoids NULL pointers dereference. > > Signed-off-by: Christophe JAILLET> --- > drivers/net/ethernet/intel/igb/igb_main.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c > b/drivers/net/ethernet/intel/igb/igb_main.c > index fd4a46b03cc8..837d9b46a390 100644 > --- a/drivers/net/ethernet/intel/igb/igb_main.c > +++ b/drivers/net/ethernet/intel/igb/igb_main.c > @@ -3162,6 +3162,8 @@ static int igb_sw_init(struct igb_adapter *adapter) > /* Setup and initialize a copy of the hw vlan table array */ > adapter->shadow_vfta = kcalloc(E1000_VLAN_FILTER_TBL_SIZE, sizeof(u32), > GFP_ATOMIC); > + if (!adapter->shadow_vfta) > + return -ENOMEM; Looks reasonable to me. A larger issue though I see in this function is that if we return -ENOMEM here, and if we return -ENOMEM from igb_init_interrupt_scheme() below on failure, we leak adapter->mac_table (and adapter->shadow_vfta in the latter). We should add a proper unwind to free up the memory on failure. -PJ
Re: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more
On 8/25/17 11:25 AM, Jacob Keller wrote: > Under some circumstances, such as with many stacked devices, it is > possible that dev_hard_start_xmit will bundle many packets together, and > mark them all with xmit_more. > > Most drivers respond to xmit_more by skipping tail bumps on packet > rings, or similar behavior as long as xmit_more is set. This is > a performance win since it means drivers can avoid notifying hardware of > new packets repeat daily, and thus avoid wasting unnecessary PCIe or other > bandwidth. > > This use of xmit_more comes with a trade off because bundling too many > packets can increase latency of the Tx packets. To avoid this, we should > limit the maximum number of packets with xmit_more. > > Driver authors could modify their drivers to check for some determined > limit, but this requires all drivers to be modified in order to gain > advantage. > > Instead, add a sysctl "xmit_more_max" which can be used to configure the > maximum number of xmit_more skbs to send in a sequence. This ensures > that all drivers benefit, and allows system administrators the option to > tune the value to their environment. > > Signed-off-by: Jacob Keller> --- > > Stray thoughts and further questions > > Is this the right approach? Did I miss any other places where we should > limit? Does the limit make sense? Should it instead be a per-device > tuning nob instead of a global? Is 32 a good default? I actually like the idea of a per-device knob. A xmit_more_max that's global in a system with 1GbE devices along with a 25/50GbE or more just doesn't make much sense to me. Or having heterogeneous vendor devices in the same system that have different HW behaviors could mask issues with latency. This seems like another incarnation of possible buffer-bloat if the max is too high... > > Documentation/sysctl/net.txt | 6 ++ > include/linux/netdevice.h| 2 ++ > net/core/dev.c | 10 +- > net/core/sysctl_net_core.c | 7 +++ > 4 files changed, 24 insertions(+), 1 deletion(-) > > diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt > index b67044a2575f..3d995e8f4448 100644 > --- a/Documentation/sysctl/net.txt > +++ b/Documentation/sysctl/net.txt > @@ -230,6 +230,12 @@ netdev_max_backlog > Maximum number of packets, queued on the INPUT side, when the > interface > receives packets faster than kernel can process them. > > +xmit_more_max > +- > + > +Maximum number of packets in a row to mark with skb->xmit_more. A value of > zero > +indicates no limit. What defines "packet?" MTU-sized packets, or payloads coming down from the stack (e.g. TSO's)? -PJ
Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
On 8/25/17 10:59 AM, Jesper Dangaard Brouer wrote: > On Fri, 25 Aug 2017 14:24:28 + > "Waskiewicz Jr, Peter" <peter.waskiewicz...@intel.com> wrote: > >> On 8/25/17 5:19 AM, Jesper Dangaard Brouer wrote: >>>> >>>> Tested with Intel XL710 NIC with Cisco 3172 switch. >>>> >>>> It would be even slightly better if the irqbalance service is turned >>>> off outside. >>> >>> Yes, if you don't turn-off (kill) irqbalance it will move around the >>> IRQs behind your back... >> >> Or you can use the --banirq option to irqbalance to ignore your device's >> interrupts as targets for balancing. > > It might be worth mentioning that --banirq=X is specified for each IRQ > that you want to exclude, and --banirq is simply specified multiple > times on the command line. > > Is it possible to tell a running irqbalance that I want to excluded an > extra IRQ? (just before I do my manual adjustment). It isn't possible today, since we don't have a way to attach a foreground/oneshot irqbalance run to a currently-running daemon. That's an interesting feature enhancement...I can add it to our list as a feature request so I don't forget about it. That way I can also get Neil's thoughts on this. -PJ
Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
On 8/25/17 5:19 AM, Jesper Dangaard Brouer wrote: >> >> Tested with Intel XL710 NIC with Cisco 3172 switch. >> >> It would be even slightly better if the irqbalance service is turned >> off outside. > > Yes, if you don't turn-off (kill) irqbalance it will move around the > IRQs behind your back... Or you can use the --banirq option to irqbalance to ignore your device's interrupts as targets for balancing. Cheers, -PJ
RE: [PATCH] [IPROUTE2] Update various classifiers' help output forexpected CLASSID syntax
+ fprintf(stderr, \nNOTE: CLASSID is parsed as hexidecimal +input.\n); s/hexidecimal/hexadecimal/g (?) Gah! Thanks Jarek, I can correct and resend. -PJ Waskiewicz -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] [TC U32] Fix input parsing to support more than 9 flow id'scorrectly
Sorry, can't change the api, update the documentation instead. Yes, this is much more reasonable. I'll send a patch shortly to do that. Thanks -PJ Waskiewicz -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] [TC U32] Fix input parsing to support more than 9 flow id'scorrectly
From: PJ Waskiewicz [EMAIL PROTECTED] Using strtoul with a base of 16 converts flowid 10 into 0x10, which makes it flowid 16. This is interpreted by the kernel incorrectly, and causes traffic flows above 9 to be classified into band 0 on multiband qdiscs. This changes the base to 10, which will correctly parse input into the proper hexidecimal value. Stephen, We can go one of two ways I suppose. Once is this way, since most user input for CLASSID is base 10, or we can update documentation to say that CLASSID input is expected to be base 16. What do you think? Thanks, -PJ Waskiewicz [EMAIL PROTECTED] Signed-off-by: Peter P Waskiewicz Jr [EMAIL PROTECTED] --- tc/tc_util.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/tc/tc_util.c b/tc/tc_util.c index cdbae42..a277eac 100644 --- a/tc/tc_util.c +++ b/tc/tc_util.c @@ -65,7 +65,7 @@ int get_tc_classid(__u32 *h, const char *str) maj = TC_H_UNSPEC; if (strcmp(str, none) == 0) goto ok; - maj = strtoul(str, p, 16); + maj = strtoul(str, p, 10); if (p == str) { maj = 0; if (*p != ':') @@ -76,7 +76,7 @@ int get_tc_classid(__u32 *h, const char *str) return -1; maj = 16; str = p+1; - min = strtoul(str, p, 16); + min = strtoul(str, p, 10); if (*p != 0) return -1; if (min = (116)) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] Disable TSO for non standard qdiscs
...But, on the other hand, in this case the realization seems to be wrong: probably still all locally created packets will be treated the same - or I miss something? Jarek P. The TCP layer will generate TSO packets based on the kernel socket features associated with the flow. So if you have two devices, one supporting TSO, the other not, then the flows associated with the non-TSO device will not have their packets built for TSO. This has no bearing on the device supporting TSO, which its feature flags will propogate into the kernel socket for that flow, and cause any TCP flows to that device to be TSO packets. So in a nutshell, disabling TSO is on a per-device level, not a global switch. -PJ Waskiewicz -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] Disable TSO for non standard qdiscs
Indeed. As an example of an unknowing user, this discussion made me check whether my cablemodem device (on which I'm using HFSC) uses TSO :) The TSO defer logic is based on your congestion window and current window size. So the actual frame sizes hitting your NIC attached to your DSL probably aren't anywhere near 64KB, but probably more in line with whatever your window size is for DSL. The bottom line is TSO saves CPU cycles. If we want to make it go away because of a traffic shaping qdisc interfering, then that's fine. I just don't think a TSO option should be added to the scheduler layer, since it already exists in the ethtool layer. Asking a user to type 'ethtool -k devicename tso off' is probably going to be much easier than setting an option on your qdisc through tc to turn TSO back on. I think we're having more of a disagreement of what is considered the normal case user. If you are on a slow link, such as a DSL/cable line, your TCP window/congestion window aren't going to be big enough to generate large TSO's, so what is the issue? But disabling TSO, say on a 10 GbE link, can cut throughput by half (I have data on 8-core machines with 10 GbE with/without TSO if you're interested). Even on a single-core machine with a 1GbE link can have bad performance hits. So this is why I'm so concerned about a proposal to turn off TSO outside of the current established methods of using ethtool. Rather than educating the user about how to turn TSO back on using tc if they want it, educate them why they may want to consider turning TSO off in certain configurations. And I don't consider any user effectively using a TBF qdisc someone incapable of understanding how to use ethtool. Cheers, -PJ Waskiewicz -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] Disable TSO for non standard qdiscs
Right - Essentially it is a usability issue: People who know how to use TSO (Peter for example) will be clueful enough to turn it on. Which means the default should be to protect the clueless and turn it off. On Andis approach: Turning TSO off at netdev registration time with a warning will be a cleaner IMO. Or alternatively introducing a kernel-config I know what TSO is option which is then used at netdev registration. From a usability perspective it would make more sense to just keep ethtool as the only way to configure TSO. To me this sounds like the most reasonable approach. I've put out my concerns, so I'll get out of the way now so we can move forward in some direction. :-) Cheers, -PJ Waskiewicz -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Lots of BUG eth1 code -5 qlen 0 messages in 2.6.24
I've changed the -EIO into NETDEV_TX_BUSY and so far I can't trigger the bug anymore. It was quite easy to trigger within minutes with rsync, but I can't trigger it anymore. Should I send a patch, and if so: to who? The tulip/xircom_cb driver seems to be orphaned. Perhaps Jeff Garzik is a good place to start. He maintains the driver tree, he either can take the patch or direct you to where it should go. No need to test that, it *is* netif_stopped before the return: /* Uh oh... no free descriptor... drop the packet */ netif_stop_queue(dev); spin_unlock_irqrestore(card-lock,flags); trigger_transmit(card); return NETDEV_TX_BUSY; trigger_transmit() is a simple function that just writes a single register on the card to trigger a transmit. Eek! No need to make this smarter at this point, but this is pretty basic. Cheers, -PJ Waskiewicz -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] Disable TSO for non standard qdiscs
I totally disagree with these POVs: - 10G cards should be treated by default as 10G cards - not DSL modems, and common users shouldn't have to read any warnings or configs to see this. - tc with TBF or HTB are professional tools; there should be added some warnings to manuals. But trying to change the way they work because we think we know better what users want, and changing BTW some other things (making debugging this later a hell), is simply disrespectful for target users of these tools. There are some wrappers or creators invented for this. And, BTW, I think I've seen somewhere a system which does this this other way - with creators for professionals. So, you could be right with this too... Ok, maybe I'm not done quite yet. Jarek is echo'ing my original point, changing the behavior of the tool automatically (these qdiscs in question) is not good for a normal end user. It might be fine for kernel developers, but not users of these tools, IMO. A less disruptive approach, such as a warning message printed when loading the qdisc if TSO is enabled, and documenting recommended usage, I think is more prudent here. Cheers, -PJ Waskiewicz -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] Disable TSO for non standard qdiscs
The philosophical problem I have with this suggestion is that I expect that the large majority of users will be more happy with disabled TSO if they use non standard qdiscs and defaults that do not fit the majority use case are bad. Basically you're suggesting that nearly everyone using tc should learn about another obscure command. If someone is using tc to load and configure a qdisc, I'd really hope knowing or learning ethtool wouldn't be a stretch for them... And I'm not arguing the majority of people will want this or not, but taking away the ability to use TSO at the kernel level here without allowing a tuneable is making that decision for them, which is wrong IMO. Cheers, -PJ Waskiewicz [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] Disable TSO for non standard qdiscs
Well, it could be just that when using such qdiscs TSO would be disabled, but the user could override this by using ethtool after loading the qdiscs. I still disagree with this. The qdisc should not cause anything to happen to feature flags on the device. It's the scheduling layer and really shouldn't care about what features the device supports or not. If someone has an issue with a feature hurting performance or causing odd behavior when using a qdisc, then they should disable the feature on the device using the appropriate tools provided. If it's the qdisc causing issues, then either the qdisc needs to be fixed, or it should be documented what features are recommended to be on and off with the qdisc. I don't agree that the scheduling layer should affect features on an underlying device. Cheers, -PJ Waskiewicz -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Lots of BUG eth1 code -5 qlen 0 messages in 2.6.24
Are you using any specific qdisc, or just the default pfifo_fast? Have you done any specific tuning on your qdisc as well? The default qlen seems to have been changed. The driver seems buggy. Make it return NETDEV_TX_BUSY instead of -EIO in xircom_start_xmit() and the messages will go away. Totally agree. However, the driver is still getting entries when it shouldn't (netif_queue_stopped() is true), hence it's returning a non-OK value. We can either re-add the check for netif_queue_stopped() to qdisc_restart(), or update the drivers to use the newer API. I'd rather do the latter, which I can work on. Thanks Jamal! -PJ -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Lots of BUG eth1 code -5 qlen 0 messages in 2.6.24
I've just started to use 2.6.24 on my home firewall (before it was running 2.6.24-rc2 for about 65 days) and I noticed a couple of error messages I've never seen before: Jan 29 07:50:54 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 08:28:30 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 08:57:30 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 09:44:04 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 10:01:35 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 10:01:35 gateway last message repeated 2 times Jan 29 10:16:48 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 10:16:48 gateway last message repeated 2 times Jan 29 10:45:48 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 10:45:48 gateway last message repeated 2 times Jan 29 11:10:01 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 11:10:02 gateway last message repeated 9 times The message seems to be coming from the qdisc_restart() in net/sched/sch_generic.c which was changed with commit 5f1a485d5905aa641f33009019b369907a4c . The NIC is an IBM EtherJet cardbus card using the xircom_cb driver: Are you using any specific qdisc, or just the default pfifo_fast? Have you done any specific tuning on your qdisc as well? The default qlen seems to have been changed. Basically your queue is being overrun, and with the current checks in the kernel in the stack, it's allowing the skb into the driver. I've known about this issue, and I'm hesitant to push a patch to re-add the netif_queue_stopped() check into qdisc_restart(). I'd rather push a one-time patch to the drivers that interacts with netif_stop_subqueue(netdev, 0), so we can completely decouple the single queue from the netdev. I'd say you can somewhat ignore the messages for now. But there is work to be done here, and it's obvious I need to do this sooner than later. Please let me know about the qdisc parameters though when you get a chance. Thanks, -PJ Waskiewicz [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Lots of BUG eth1 code -5 qlen 0 messages in 2.6.24
Peter, I suspect that driver is just buggy in some other way as opposed to being re-entered; couldnt tell by inspection. It is possible it may be too eager to open up before it really has space. It will be easy to check your theory by having the driver just check if it is netif_stopped just before it returns NETDEV_TX_BUSY. Ahh, very good point. Thanks, -PJ -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] PATCH 1/2 [SCHED 2.6.24]: Check subqueue status before calling hard_start_xmit
You could optimize this by getting HARD_TX_LOCK after the check. I assume that netif_stop_subqueue (from another CPU) would always be called by the driver xmit, and that is not possible since we hold the __LINK_STATE_QDISC_RUNNING bit. Does that sound correct? Sorry for not responding sooner; Dave hit it on the head though with his response. I agree with your changes, and I'll incorporate them in the lockless stack patches I've been working on (in the software queuing mode). Now if I could just find some time to finish them up and get them out for review... Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] [AF_PACKET]: Allow multicast traffic to be caught by ORIGDEV when bonded
The socket option for packet sockets to return the original ifindex instead of the bonded ifindex will not match multicast traffic. Since this socket option is the most useful for layer 2 traffic and multicast traffic, make the option multicast-aware. Signed-off-by: Peter P Waskiewicz Jr [EMAIL PROTECTED] I agree with you in principle, but I'd like to hear some feedback from other folks. In particular I'd like a discussion about what this might break, if anything. That's reasonable. In any event, the only thing this could affect is if the option is set on the socket, which shouldn't be very often at all. I'm more than open to feedback on this change. Thanks Dave, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] [AF_PACKET]: Allow multicast traffic to be caught by ORIGDEV when bonded
-Original Message- From: David Miller [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 07, 2007 3:52 AM To: Waskiewicz Jr, Peter P Cc: netdev@vger.kernel.org Subject: Re: [PATCH] [AF_PACKET]: Allow multicast traffic to be caught by ORIGDEV when bonded From: Waskiewicz Jr, Peter P [EMAIL PROTECTED] Date: Wed, 7 Nov 2007 03:00:42 -0800 In any event, the only thing this could affect is if the option is set on the socket, which shouldn't be very often at all. Any idea how many programs set this option and which ones? You obviously noticed, so perhaps you know at least one or was this discovered purely by code inspection? We have an application that communicates with a switch and a NIC that helps some additional configuration for the network (think of it as an extended link negotiation). It uses LLDP, which is layer 2 and is multicast. We noticed that when testing bonding, it didn't work. I'm not sure who else might be using this code right now, since it went into the mainstream kernel in 2.6.22. Thanks, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Question on TSO maximum segment sizes.
I'm having an issue with TSO right vs. hardware that can't take the maximum segment size sent from the stack. I've been told that the maximum packet size that can be sent to the hardware today is 64k, but my hardware can only take 32k in certain modes per queue due to hardware limitations. I have two questions regarding this: 1) where is this value set in the TCP code, and 2) Is this something that can be configured on the fly? If the answer to 2 is no, I will try and put something together to allow this to happen. Thanks in advance, PJ Waskiewicz Intel Corporation [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Question on TSO maximum segment sizes.
From: Rick Jones [EMAIL PROTECTED] Date: Thu, 11 Oct 2007 16:50:46 -0700 For just messing about, might it be possible to tweak the socket buffer sizes and tcp_tso_win_divisor to kludge things for a short while? Couldn't ship that way certainly, but assuming Peter's going to get his broken hardware fixed it might let him limp along until then. TCP dynamically grows the socket buffer sizes unless the application explicitly sets them via setsockopt() and the limits imposed in those cases are controlled by tcp_{,r,w}mem[] sysctls. Decreasing those will kill performance exactly for the cases this person cares about. No, the only way to deal with this is to GSO segment incoming frames inside of the driver when they exceed the HW limits. Thanks Dave and Rick. I'll mess with this and make my driver happy again. Cheers, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
-Original Message- From: Andi Kleen [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 10, 2007 9:02 AM To: Waskiewicz Jr, Peter P Cc: David Miller; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching We've done similar testing with ixgbe to push maximum descriptor counts, and we lost performance very quickly in the same range you're quoting on NIU. Did you try it with WC writes to the ring or CLFLUSH? -Andi Hmm, I think it might be slightly different, but it still shows queue depth vs. performance. I was actually referring to how many descriptors we can represent a packet with before it becomes a problem wrt performance. This morning I tried to actually push my ixgbe NIC hard enough to come close to filling the ring with packets (384-byte packets), and even on my 8-core Xeon I can't do it. My system can't generate enough I/O to fill the hardware queues before CPUs max out. -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3][NET_BATCH] net core use batching
IMO the net driver really should provide a hint as to what it wants. 8139cp and tg3 would probably prefer multiple TX queue behavior to match silicon behavior -- strict prio. If I understand what you just said, I disagree. If your hardware is running strict prio, you don't want to enforce strict prio in the qdisc layer; performing two layers of QoS is excessive, and may lead to results you don't want. The reason I added the DRR qdisc is for the Si that has its own queueing strategy that is not RR. For Si that implements RR (like e1000), you can either use the DRR qdisc, or if you want to prioritize your flows, use PRIO. -PJ Waskiewicz [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3][NET_BATCH] net core use batching
A misunderstanding, I think. To my brain, DaveM's item #2 seemed to assume/require the NIC hardware to balance fairly across hw TX rings, which seemed to preclude the 8139cp/tg3 style of strict-prio hardware. That's what I was responding to. As long as there is some modular way to fit 8139cp/tg3 style multi-TX into our universe, I'm happy :) Ah hah. Yes, a misunderstanding on my part. Thanks for the clarification. Methinks more caffeine is required for today... -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3][NET_BATCH] net core use batching
-Original Message- From: J Hadi Salim [mailto:[EMAIL PROTECTED] On Behalf Of jamal Sent: Monday, October 08, 2007 11:27 AM To: David Miller Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED]; Waskiewicz Jr, Peter P; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: [PATCH 2/3][NET_BATCH] net core use batching This patch adds the usage of batching within the core. cheers, jamal Hey Jamal, I still have concerns how this will work with Tx multiqueue. The way the batching code looks right now, you will probably send a batch of skb's from multiple bands from PRIO or RR to the driver. For non-Tx multiqueue drivers, this is fine. For Tx multiqueue drivers, this isn't fine, since the Tx ring is selected by the value of skb-queue_mapping (set by the qdisc on {prio|rr}_classify()). If the whole batch comes in with different queue_mappings, this could prove to be an interesting issue. Now I see in the driver HOWTO you recently sent that the driver will be expected to loop over the list and call it's -hard_start_xmit() for each skb. I think that should be fine for multiqueue, I just wanted to see if you had any thoughts on how it should work, any performance issues you can see (I can't think of any). Since the batching feature and Tx multiqueue are very new features, I'd like to make sure we can think of any possible issues with them coexisting before they are both mainline. Looking ahead for multiqueue, I'm still working on the per-queue lock implementation for multiqueue, which I know will not work with batching as it's designed today. I'm still not sure how to handle this, because it really would require the batch you send to have the same queue_mapping in each skb, so you're grabbing the correct queue_lock. Or, we could have the core grab all the queue locks for each skb-queue_mapping represented in the batch. That would block another batch though if it had any of those queues in it's next batch before the first one completed. Thoughts? Thanks Jamal, -PJ Waskiewicz [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3][NET_BATCH] net core use batching
true, that needs some resolution. Heres a hand-waving thought: Assuming all packets of a specific map end up in the same qdiscn queue, it seems feasible to ask the qdisc scheduler to give us enough packages (ive seen people use that terms to refer to packets) for each hardware ring's available space. With the patches i posted, i do that via dev-xmit_win that assumes only one view of the driver; essentially a single ring. If that is doable, then it is up to the driver to say i have space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on what scheduling scheme the driver implements - the dev-blist can stay the same. Its a handwave, so there may be issues there and there could be better ways to handle this. Note: The other issue that needs resolving that i raised earlier was in regards to multiqueue running on multiple cpus servicing different rings concurently. I can see the qdisc being modified to send batches per queue_mapping. This shouldn't be too difficult, and if we had the xmit_win per queue (in the subqueue struct like Dave pointed out). Addressing your note/issue with different rings being services concurrently: I'd like to remove the QDISC_RUNNING bit from the global device; with Tx multiqueue, this bit should be set on each queue (if at all), allowing multiple Tx rings to be loaded simultaneously. The biggest issue today with the multiqueue implementation is the global queue_lock. I see it being a hot source of contention in my testing; my setup is a 8-core machine (dual quad-core procs) with a 10GbE NIC, using 8 Tx and 8 Rx queues. On transmit, when loading all 8 queues, the enqueue/dequeue are hitting that lock quite a bit for the whole device. I really think that the queue_lock should join the queue_state, so the device no longer manages the top-level state (since we're operating per-queue instead of per-device). It's the core that does that, not the driver; the driver continues to use -hard_start_xmit() (albeit modified one). The idea is not to have many new interfaces. I'll look closer at this, since I think I confused myself. Isnt multiqueue mainline already? Well, it's in 2.6.23-rc*. I imagine it won't see much action though until 2.6.24, since people will be porting drivers during that time. Plus having the native Rx multiqueue w/NAPI code in 2.6.24 makes sense to have Tx multiqueue at that time. The point behind batching is to reduce the cost of the locks by amortizing across the locks. Even better if one can, they should get rid of locks. Remind me, why do you need the per-queuemap lock? And is it needed from the enqueuing side too? Maybe lets start there to help me understand things? The multiqueue implementation today enforces the number of qdisc bands (RR or PRIO) to be equal to the number of Tx rings your hardware/driver is supporting. Therefore, the queue_lock and queue_state in the kernel directly relate to the qdisc band management. If the queue stops from the driver, then the qdisc won't try to dequeue from the band. What I'm working on is to move the lock there too, so I can lock the queue when I enqueue (protect the band from multiple sources modifying the skb chain), and lock it when I dequeue. This is purely for concurrency of adding/popping skb's from the qdisc queues. Right now, we take the whole global lock to add and remove skb's. This is the next logical step for separating the queue dependancy on each other. Please let me know if this doesn't make sense, or if you have any questions at all about my reasoning. I agree that this is where we should be on the same page before moving onto anything else in this discussion. :) Sure, that is doable if the driver can set a per queue_mapping xmit_win and the qdisc can be taught to say give me packets for queue_mapping X Yes, I like this idea very much. Do that, modify the qdisc to send in chunks from a queue, and the problem should be solved. I will try and find some additional cycles to get my patches completely working, and send them. It'd be easier I think to see what's going on if I did that. I'll also try to make them work with the ideas of xmit_win per queue and batched queue qdisc sends. Stay tuned... Thanks Jamal, -PJ Waskiewicz [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: parallel networking
Multiply whatever effect you think you might be able to measure due to that on your 2 or 4 way system, and multiple it up to 64 cpus or so for machines I am using. This is where machines are going, and is going to become the norm. That along with speeds going to 10 GbE with multiple Tx/Rx queues (with 40 and 100 GbE under discussion now), where multiple CPU's hitting the driver are needed to push line rate without cratering the entire machine. -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3][NET_BATCH] net core use batching
If net_device_subqueue is visible from both driver and core scheduler area (couldnt tell from looking at whats in there already), then that'll do it. Yes, I use the net_device_subqueue structs (the state variable in there) in the prio and rr qdiscs right now. It's an indexed list at the very end of struct netdevice. -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
I have submitted this before; but here it is again. Against net-2.6.24 from yesterday for this and all following patches. cheers, jamal Hi Jamal, I've been (slowly) working on resurrecting the original design of my multiqueue patches to address this exact issue of the queue_lock being a hot item. I added a queue_lock to each queue in the subqueue struct, and in the enqueue and dequeue, just lock that queue instead of the global device queue_lock. The only two issues to overcome are the QDISC_RUNNING state flag, since that also serializes entry into the qdisc_restart() function, and the qdisc statistics maintenance, which needs to be serialized. Do you think this work along with your patch will benefit from one another? I apologize for not having working patches right now, but I am working on them slowly as I have some blips of spare time. Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
The one thing that seems obvious is to use dev-hard_prep_xmit() in the patches i posted to select the xmit ring. You should be able to do figure out the txmit ring without holding any lock. I've looked at that as a candidate to use. The lock for enqueue would be needed when actually placing the skb into the appropriate software queue for the qdisc, so it'd be quick. I lost track of how/where things went since the last discussion; so i need to wrap my mind around it to make sensisble suggestions - I know the core patches are in the kernel but havent paid attention to details and if you look at my second patch youd see a comment in dev_batch_xmit() which says i need to scrutinize multiqueue more. No worries. I'll try to get things together on my end and provide some patches to add a per-queue lock. In the meantime, I'll take a much closer look at the batching code, since I've stopped looking at the patches in-depth about a month ago. :-( Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote: I've looked at that as a candidate to use. The lock for enqueue would be needed when actually placing the skb into the appropriate software queue for the qdisc, so it'd be quick. The enqueue is easy to comprehend. The single device queue lock should suffice. The dequeue is interesting: We should make sure we're symmetric with the locking on enqueue to dequeue. If we use the single device queue lock on enqueue, then dequeue will also need to check that lock in addition to the individual queue lock. The details of this are more trivial than the actual dequeue to make it efficient though. Maybe you can point me to some doc or describe to me the dequeue aspect; are you planning to have an array of txlocks per, one per ring? How is the policy to define the qdisc queues locked/mapped to tx rings? The dequeue locking would be pushed into the qdisc itself. This is how I had it originally, and it did make the code more complex, but it was successful at breaking the heavily-contended queue_lock apart. I have a subqueue structure right now in netdev, which only has queue_state (for netif_{start|stop}_subqueue). This state is checked in sch_prio right now in the dequeue for both prio and rr. My approach is to add a queue_lock in that struct, so each queue allocated by the driver would have a lock per queue. Then in dequeue, that lock would be taken when the skb is about to be dequeued. The skb-queue_mapping field also maps directly to the queue index itself, so it can be unlocked easily outside of the context of the dequeue function. The policy would be to use a spin_trylock() in dequeue, so that dequeue can still do work if enqueue or another dequeue is busy. And the allocation of qdisc queues to device queues is assumed to be one-to-one (that's how the qdisc behaves now). I really just need to put my nose to the grindstone and get the patches together and to the list...stay tuned. Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock
I really just need to put my nose to the grindstone and get the patches together and to the list...stay tuned. Thanks, -PJ Waskiewicz - Since we are redoing this, is there any way to make the whole TX path more lockless? The existing model seems to be more of a monitor than a real locking model. That seems quite reasonable. I will certainly see what I can do. Thanks Stephen, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2] Move the definition of pr_err() into kernel.h
Other pr_*() macros are already defined in kernel.h, but pr_err() was defined multiple times in several other places Signed-off-by: Emil Medve [EMAIL PROTECTED] --- I'm writing a driver and I've been using the pr_*() macros from kernel.h and I was surprised not to find there pr_err() but defined multiple times (in four different files). I didn't want to define it yet one more time so I did this cleanup This patch is against Linus' tree v2.6.23-rc6 (0d4cbb5e7f60b2f1a4d8b7f6ea4cc264262c7a01) linux-2.6 scripts/checkpatch.pl 0001-Move-the-definition-of-pr_err-into-kernel.h.patch Your patch has no obvious style problems and is ready for submission. Almost. :-) drivers/i2c/chips/menelaus.c | 10 -- drivers/net/spider_net.h |3 --- drivers/video/omap/lcd_h3.c |6 ++ drivers/video/omap/lcd_inn1610.c |6 ++ include/linux/kernel.h |2 ++ 5 files changed, 10 insertions(+), 17 deletions(-) diff --git a/drivers/i2c/chips/menelaus.c b/drivers/i2c/chips/menelaus.c index d9c92c5..66436ba 100644 --- a/drivers/i2c/chips/menelaus.c +++ b/drivers/i2c/chips/menelaus.c @@ -49,8 +49,6 @@ #define DRIVER_NAME menelaus -#define pr_err(fmt, arg...) printk(KERN_ERR DRIVER_NAME : , ## arg); - Unnecessary whitespace removal. #define MENELAUS_I2C_ADDRESS 0x72 #define MENELAUS_REV 0x01 @@ -155,7 +153,7 @@ static int menelaus_write_reg(int reg, u8 value) int val = i2c_smbus_write_byte_data(the_menelaus-client, reg, value); if (val 0) { - pr_err(write error); + pr_err(DRIVER_NAME : write error); return val; } @@ -167,7 +165,7 @@ static int menelaus_read_reg(int reg) int val = i2c_smbus_read_byte_data(the_menelaus-client, reg); if (val 0) - pr_err(read error); + pr_err(DRIVER_NAME : read error); return val; } @@ -1177,7 +1175,7 @@ static int menelaus_probe(struct i2c_client *client) /* If a true probe check the device */ rev = menelaus_read_reg(MENELAUS_REV); if (rev 0) { - pr_err(device not found); + pr_err(DRIVER_NAME : device not found); err = -ENODEV; goto fail1; } @@ -1258,7 +1256,7 @@ static int __init menelaus_init(void) res = i2c_add_driver(menelaus_i2c_driver); if (res 0) { - pr_err(driver registration failed\n); + pr_err(DRIVER_NAME : driver registration failed\n); return res; } diff --git a/drivers/net/spider_net.h b/drivers/net/spider_net.h index dbbdb8c..c67b11d 100644 --- a/drivers/net/spider_net.h +++ b/drivers/net/spider_net.h @@ -493,7 +493,4 @@ struct spider_net_card { struct spider_net_descr darray[0]; }; -#define pr_err(fmt,arg...) \ - printk(KERN_ERR fmt ,##arg) - Unnecessary whitespace removal. #endif diff --git a/drivers/video/omap/lcd_h3.c b/drivers/video/omap/lcd_h3.c index 51807b4..c604d93 100644 --- a/drivers/video/omap/lcd_h3.c +++ b/drivers/video/omap/lcd_h3.c @@ -28,8 +28,6 @@ #define MODULE_NAME omapfb-lcd_h3 -#define pr_err(fmt, args...) printk(KERN_ERR MODULE_NAME : fmt, ## args) - Unnecessary whitespace removal. static int h3_panel_init(struct lcd_panel *panel, struct omapfb_device *fbdev) { return 0; @@ -48,7 +46,7 @@ static int h3_panel_enable(struct lcd_panel *panel) if (!r) r = tps65010_set_gpio_out_value(GPIO2, HIGH); if (r) - pr_err(Unable to turn on LCD panel\n); + pr_err(MODULE_NAME : Unable to turn on LCD panel\n); return r; } @@ -62,7 +60,7 @@ static void h3_panel_disable(struct lcd_panel *panel) if (!r) tps65010_set_gpio_out_value(GPIO2, LOW); if (r) - pr_err(Unable to turn off LCD panel\n); + pr_err(MODULE_NAME : Unable to turn off LCD panel\n); } static unsigned long h3_panel_get_caps(struct lcd_panel *panel) diff --git a/drivers/video/omap/lcd_inn1610.c b/drivers/video/omap/lcd_inn1610.c index 95604ca..5ef119c 100644 --- a/drivers/video/omap/lcd_inn1610.c +++ b/drivers/video/omap/lcd_inn1610.c @@ -27,20 +27,18 @@ #define MODULE_NAME omapfb-lcd_h3 -#define pr_err(fmt, args...) printk(KERN_ERR MODULE_NAME : fmt, ## args) - Unnecessary whitespace removal. static int innovator1610_panel_init(struct lcd_panel *panel, struct omapfb_device *fbdev) { int r = 0; if (omap_request_gpio(14)) { - pr_err(can't request GPIO 14\n); + pr_err(MODULE_NAME : can't request GPIO 14\n); r = -1; goto exit; } if (omap_request_gpio(15)) { - pr_err(can't request GPIO 15\n); + pr_err(MODULE_NAME : can't
RE: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar2 [EMAIL PROTECTED] Date: Wed, 29 Aug 2007 10:43:23 +0530 The reason was to run parallel copies, not for buffer limitations. Oh, I see. I'll note in passing that current lmbench-3 has some parallelization features you could play with, you might want to check it out. I've also used iperf for parallel connections successfully, and that will allow you to mess with the buffer sizes as well along with other variables of the data streams for both TCP and UDP. Cheers, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 4/4] Update e1000 driver to use devres.
Index: linux-2.6/drivers/net/e1000/e1000_main.c === --- linux-2.6.orig/drivers/net/e1000/e1000_main.c +++ linux-2.6/drivers/net/e1000/e1000_main.c @@ -860,15 +860,14 @@ e1000_probe(struct pci_dev *pdev, { struct net_device *netdev; struct e1000_adapter *adapter; - unsigned long mmio_start, mmio_len; - unsigned long flash_start, flash_len; + unsigned long mmio_len, flash_len; static int cards_found = 0; static int global_quad_port_a = 0; /* global ksp3 port a indication */ int i, err, pci_using_dac; uint16_t eeprom_data = 0; uint16_t eeprom_apme_mask = E1000_EEPROM_APME; - if ((err = pci_enable_device(pdev))) + if ((err = pcim_enable_device(pdev))) return err; if (!(err = pci_set_dma_mask(pdev, DMA_64BIT_MASK)) @@ -878,20 +877,19 @@ e1000_probe(struct pci_dev *pdev, if ((err = pci_set_dma_mask(pdev, DMA_32BIT_MASK)) (err = pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK))) { E1000_ERR(No usable DMA configuration, aborting\n); - goto err_dma; + return err; } pci_using_dac = 0; } if ((err = pci_request_regions(pdev, e1000_driver_name))) - goto err_pci_reg; + return err; pci_set_master(pdev); - err = -ENOMEM; - netdev = alloc_etherdev(sizeof(struct e1000_adapter)); + netdev = devm_alloc_etherdev(pdev-dev, sizeof(struct +e1000_adapter)); if (!netdev) - goto err_alloc_etherdev; + return -ENOMEM; I'm a bit confused why you removed the goto's, and then removed all the target unwinding code at the bottom of e1000_probe(). Those labels clean up resources if something fails, like the err_sw_init label. I don't see anything in the devres code that jumps out at me that explains why we can do away with these cleanup routines. Thoughts? Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 4/4] Update e1000 driver to use devres.
Now if I insert a return -ENOMEM right after allocating tx_ring: --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -1356,6 +1356,8 @@ e1000_alloc_queues(struct e1000_adapter *adapter) { adapter-tx_ring = kcalloc(adapter-num_tx_queues, sizeof(struct e1000_tx_ring), GFP_KERNEL); + + return -ENOMEM; if (!adapter-tx_ring) return -ENOMEM; #insmod DEVRES ADD f7a80e80 pcim_release (8 bytes) DEVRES ADD f7a80ca0 devm_free_netdev (4 bytes) DEVRES ADD eb7f0080 pcim_iomap_release (24 bytes) DEVRES ADD eb7f devm_kzalloc_release (40 bytes) e1000_sw_init: Unable to allocate memory for queues DEVRES REL eb7f devm_kzalloc_release (40 bytes) DEVRES REL eb7f0080 pcim_iomap_release (24 bytes) DEVRES REL f7a80ca0 devm_free_netdev (4 bytes) DEVRES REL f7a80e80 pcim_release (8 bytes) ACPI: PCI interrupt for device :02:00.0 disabled e1000: probe of :02:00.0 failed with error -12 Since we are returning an error from probe the driver core calls devres_release_all(dev) which releases all of the resources in the right order. See really_probe() in drivers/base/dd.c. This looks fine then to me. Thanks for the explanation. -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Potential u32 classifier bug.
* Waskiewicz Jr, Peter P [EMAIL PROTECTED] 2007-08-15 11:02 There is this very horrible way of using the u32 classifier with a negative offset to look into the ethernet header. Based on this, it sounds like u32 using protocol 802_3 is broken? You might be expecting too much from u32. The protocol given to u32 is just a filter, it doesn't imply anything beyond that. u32 has its usage the way it is, that's way we've added an ematch rather than extending u32 itself. Ok, that clarifies it a bit. I've just found a few examples on the net, one of which is in a TC filter manual (http://tcn.hypert.net/tcmanual.pdf, section 2.2.1.3 at the bottom of the section), that was using u32 to simply filter on dest MAC address without anything elaborate. Either it worked way back when, or it was a bogus example. Thanks again Thomas, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Potential u32 classifier bug.
* Waskiewicz Jr, Peter P [EMAIL PROTECTED] 2007-08-09 18:07 My big question is: Has anyone recently used the 802_3 protocol in tc with u32 and actually gotten it to work? I can't see how the u32_classify() code can look at the mac header, since it is using the network header accessor to start looking. I think this is an issue with the classification code, but I'm looking to see if I'm doing something stupid before I really start digging into this mess. There is this very horrible way of using the u32 classifier with a negative offset to look into the ethernet header. Based on this, it sounds like u32 using protocol 802_3 is broken? You might want to look into the cmp ematch which can be attached to almost any classifier. It allows basing offsets on any layer thus making ethernet header filtering trivial. I'll look at this. Thanks Thomas for the response! -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Potential u32 classifier bug.
I've recently been trying to use tc with u32 classifiers to filter ethernet traffic based on ethertype. Although we found and fixed an issue in the sch_prio call to tc_classify() (thanks Patrick!), this didn't fix the actual filtering issue. For those of you who are curious or are tc-savy, I'm really in a bind and need some help. This is what I have so far: Filter to identify and move traffic to a different flow: # tc qdisc add dev eth2 root handle 1: rr bands 8 priomap 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 multiqueue # tc filter add dev eth2 parent 1: protocol 802_3 prio 1 u32 match u32 0x0800 0x at 12 flowid 1:6 Now this hits tc_classify(), and tp-protocol and skb-protocol match (be16 of 8 - ETH_P_802_3, which is what I expect), so u32_classify() is called through the tp-classify() func pointer. This is where things get weird. In net/sched/cls_u32.c, u32_classify() grabs a reference to part of the network header. This is question number one: how can the filter look at the ethernet (mac) header if it's grabbing a reference to the network header: u8 *ptr = skb_network_header(skb); I would think that for a protocol of 802.3, one would want: u8 *ptr = skb_mac_header(skb); Changing this though doesn't fix the filter. Further down is a rather horrible match criteria to make sure the filter is looking at the right data before it applies the whole filter list on the skb: if ((*(u32*)(ptr+key-off+(off2key-offmask))^key-val)key-mask) { Now if this matches (meaning it evaluates to zero), we can move on. If not, we go to the next key node and try again. Run out of key nodes, we return -1 and take the default mapping from IP TOS to Linux priority, and get queued to a band. My big question is: Has anyone recently used the 802_3 protocol in tc with u32 and actually gotten it to work? I can't see how the u32_classify() code can look at the mac header, since it is using the network header accessor to start looking. I think this is an issue with the classification code, but I'm looking to see if I'm doing something stupid before I really start digging into this mess. Thanks in advance, PJ Waskiewicz Intel Corporation [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET_DMA: remove unused dma_memcpy_to_kernel_iovec
On 7/26/07, David Miller [EMAIL PROTECTED] wrote: From: Shannon Nelson [EMAIL PROTECTED] Date: Tue, 24 Jul 2007 17:36:06 -0700 (repost - original eaten by vger?) Al Viro pointed out that dma_memcpy_to_kernel_iovec() really was unreachable and thus unused. The code originally was there to support in-kernel dma needs, but since it remains unused, we'll pull it out. Signed-off-by: Shannon Nelson [EMAIL PROTECTED] Applied, thanks Shannon. NET_DMA on kernel buffer is pretty useful in ndb, iSCSI target and initiators which uses kernel buffer to receive data. Are there any other issues with dma-memcpy on kernel buffers, if not then following patch makes dma_memcpy_to_kernel_iovec() reachable from tcp_recvmsg. I tested this patch and it work fine with unh iSCSI target. comments? --pravin. Pravin, Currently, NET_DMA only has one provider in the kernel, namely TCP. As the driver stands now, I'd like to keep this function out since it really wasn't being used before. Shannon has posted patches that will hopefully make it into 2.6.24 that allow multiple clients to request DMA channels, so things like iSCSI, ndb, UDP, and NFS can ask for channels without depending on TCP. Once we get to that point, if we see the need for the function, we can add it back. We have tried getting iSCSI accelerated through NET_DMA already, and didn't see the performance boost that made it worthwhile to do. Once the multiple client channel patches are in, we can revisit that, since it can be done right. Cheers, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [FIX] NET: Fix sch_api and sch_prio to properly set and detect the root qdisc
-Original Message- From: David Miller [mailto:[EMAIL PROTECTED] Sent: Monday, July 30, 2007 5:14 PM To: Waskiewicz Jr, Peter P Cc: netdev@vger.kernel.org; [EMAIL PROTECTED] Subject: Re: [FIX] NET: Fix sch_api and sch_prio to properly set and detect the root qdisc From: Waskiewicz Jr, Peter P [EMAIL PROTECTED] Date: Fri, 27 Jul 2007 09:22:11 -0700 Are these queued for 2.6.24, or are they going to make it into 2.6.23? I know you're busy with patches and NAPI, but I was curious. Thanks! I've applied both fixes and will push them into 2.6.23 Many thanks Dave! Cheers, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [FIX] NET: Fix sch_api and sch_prio to properly set and detect the root qdisc
From: Patrick McHardy [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 24, 2007 5:36 PM To: Waskiewicz Jr, Peter P Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org Subject: Re: [FIX] NET: Fix sch_api and sch_prio to properly set and detect the root qdisc [...] Both look good, thanks. Acked-by: Patrick McHardy [EMAIL PROTECTED] Dave, Are these queued for 2.6.24, or are they going to make it into 2.6.23? I know you're busy with patches and NAPI, but I was curious. Thanks! Cheers, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [NET_SCHED]: Fix prio/ingress classification logic error
This is the final fix for the problem Peter reported. Turns out most other schedulers get it right, the only other case is ingress setting skb-tc_index to the uninitialized value of res.classid. I've applied this and tested prio and rr, and everything is happy again. Thanks for finding this and putting the patch together. I got lost in looking for the bug in the u32_classify() code, and missed the easier defect... Thanks again Patrick, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Tc filtering: broken 802_3 classifier?
The protocol match is on skb-protocol, so it case of ethernet its on the ethernet protocol, which is ETH_P_IP or ip for IPv4. I see that in the code, but the reason I started worrying was when I added the 802_3 classifier on 8 flows, it would shove all traffic into flowid 1:1, no matter if it matched or not. I'll keep investigating and see if I can narrow down what I'm seeing. Thanks Patrick, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Tc filtering: broken 802_3 classifier?
Waskiewicz Jr, Peter P wrote: The protocol match is on skb-protocol, so it case of ethernet its on the ethernet protocol, which is ETH_P_IP or ip for IPv4. I see that in the code, but the reason I started worrying was when I added the 802_3 classifier on 8 flows, it would shove all traffic into flowid 1:1, no matter if it matched or not. I'll keep investigating and see if I can narrow down what I'm seeing. I'm not sure what you're expecting. skb-protocol is usually not set to ETH_P_802_3, which is why the filter is not matching. I understand that. I had two issues, which you cleared up one by reminding me that the protocol matches on skb-protocol before it tries to run the -classify() routine. The other issue I am seeing is with 8 bands, an 802_3 filter is affecting classification of IP traffic. For example, I have an 802_3 filter to look for dst MAC address, but an ssh packet, which without any filters should go into flowid 1:3 on my system, is getting pushed into flowid 1:1. I remove the 802_3 filter, and ssh traffic starts going back into 1:3. No other filters on the system. That's the main issue I'm seeing, so I'll keep investigating to see what's going on. -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Tc filtering: broken 802_3 classifier?
In case of prio, if your manually installed filters don't match, it will fall back to the skb-priority based classification, which is based on tos and is probably responsible for what you're seeing. Feel free to investigate, but you could save us all some time by simply posting what you're doing, what you're expecting and what is actually happening, there's probably a good explanation. I thought I did that before, but I probably wasn't clear. I'll try again (and if I'm still not clear, please pop me in the head). I am aware that skb-priority is used if no filter matches, and that is derived from tos (and gets set in ipsockglue). This is my setup. 8 bands with prio, with a priomap that is nice and simple: # tc qdisc add dev eth0 root handle 1: prio bands 8 priomap 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 With this configuration, ICMP will default to flowid 1:1 (band 0), and ssh will default to flowid 1:4 (band 3) based on TOS. I add this filter (802_3) and all traffic starts flowing into flowid 1:1 (including ssh), even though it should never match: # tc filter add dev eth0 protocol 802_3 parent 1: prio 2 u32 match u32 0x0800 0x at 12 flowid 1:6 As soon as I remove the filter: # tc filter del dev eth0 protocol 802_3 prio 2 ssh flows back into flowid 1:4. No filters of protocol ip were added, only the 802.3 filter. I hope this is more clear as to what I'm seeing. Thanks, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/2][RFC] Update net core to use devres.
* netdev_pci_remove_one() can replace simple pci device remove functions * devm_alloc_netdev() is like alloc_netdev but allocates memory using devres. alloc_netdev() can be removed once all drivers use devres. Please look at the multiqueue network code that is in 2.6.23. It creates alloc_netdev_mq() for multiqueue network device drivers. If you want to add this devres feature (which based on the threads, might be difficult), please make sure you comprehend the multiqueue features. Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Tc filtering: broken 802_3 classifier?
I've been trying to use tc filtering to filter on ethertype, among other things in the MAC layer. I'm running into multiple issues, and want to put this out there in case I'm using the filters wrong, or if there really is a bug in the filter code (I've stared at most of it today, and my head hurts). Here's the scenario. I am running on a recent 2.6.23 GIT tree, and am using sch_prio with no multiqueue turned on in the qdisc. The network interface in question is e1000 (no multiqueue). # tc qdisc add dev eth0 root handle 1: prio Now to see the flowid's packet counts incrementing, I add explicit classful qdiscs to the leaves: # tc qdisc add dev eth0 parent 1:1 handle 10: pfifo # tc qdisc add dev eth0 parent 1:2 handle 20: pfifo # tc qdisc add dev eth0 parent 1:3 handle 30: pfifo Now packet counts can be seen with: # tc -s qdisc ls dev eth0 I can add a filter for IP for ssh, and it works as intended: # tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip dport 22 0x flowid 1:3 This will shove ssh traffic into the 3rd pfifo queue, where by default it will flow into flowid 1:1. This is good. Now I add a filter for ethernet (802.3), and things aren't as happy: # tc filter add dev eth0 protocol 802_3 parent 1: prio 2 u32 match u32 first 4 bytes of dst mac address 0x at 0 match u32 last 2 bytes of dst mac address 0x at 4 flowid 1:1 This should match the destination MAC address of outgoing packets, and put it into flowid 1:1. For pings, using the normal priomap, they go into 1:2, so ping should be a good candidate for seeing if it goes into 1:1. In this case, it does not filter into 1:1. If I expand this into 8 flows on a multiqueue NIC, using sch_prio or sch_rr, adding any 802_3 filter to the chain will cause any traffic that hits the classifier (i.e. no other filters match first) to go into flowid 1:1, regardless if it actually matches. Remove the 802_3 filter from the chain, and all filtering starts working again. I'm trying to get state from the classifier code now when it's running, but it's a really big mess of black magic. I'm wondering if anyone is also seeing this behavior, and if they've tried to fix it. If not, I'll continue to search for a solution, but I'm just polling the community to see if this is a known issue, or if I'm doing something wrong. Thanks, -PJ Waskiewicz Intel Corporation [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Fix sch_prio to detect the root qdisc loading
As explained in my last mail, sch-parent is an integer. And it is set when grafting the qdisc, not on initilization, so it is always 0 here when coming from prio_init. This untested patch should make sure its always set correctly. Yes, I'm using NULL and 0 interchangeably here, since in the sch_api code, qdisc_graft(), sch-parent is referenced using NULL and not 0. I know it's a u32, and the value is getting set to the proper handle when the qdisc is not the root qdisc. When it's the root qdisc, it's left to be 0. This patch was tested as well, but looking at the history now, I didn't set bands correctly, so that's why the multiqueue case failed. My mistake, and I'll keep working on this. Sorry for the extra noise, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Question: how to detect if a qdisc is root or not?
From: Patrick McHardy [EMAIL PROTECTED] Date: Sat, 21 Jul 2007 06:01:07 +0200 Waskiewicz Jr, Peter P wrote: I just sent out a patch to fix this. I didn't see it yet. VGER ate it for some reason and I didn't notice it during my daily bounce analysis. I asked him on IRC to resend it, but he's away from the computer already :) I'll get it resent today. Sorry for the delay. -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Question: how to detect if a qdisc is root or not?
Anyways, I tried a few different things, and what it looks like is sch-parent will be NULL (0) for the top-level device. This is sch-correct, and trying to mess with that screws up qdisc_graft() when unloading the qdisc. I also tried adding a TCQ_F_ROOT flag to sch-flags when classid is TC_H_ROOT, but that also screwed up unloading the qdisc. I dont think I understand. Whats the problem with setting sch-parent on initialization instead on grafting as I did in my example patch? Please explain the problems arrising on unload in detail. sch-parent is getting set on initialization, and for the root and ingress qdiscs, it's left at zero. If I change that value, when the root qdisc is unloaded and pfifo_fast is put back into place, the qdisc_destroy() walks the tree and attempts to free memory from the handle pointed at by sch-parent. It stops when sch-parent is NULL, so sch-parent is actually being set as intended. The only thing that confused me is that nowhere in the qdisc is TC_H_ROOT included explicitly, rather, the root qdisc is where sch-parent is NULL. So I misunderstood what was actually wrong. The qdisc code is ok as-is, it's just that the top-level qdisc (root and ingress) have a sch-parent of NULL, which is being set correctly today. Hope that clarifies. Thanks, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Question: how to detect if a qdisc is root or not?
Its set after grafting the parent, which is after initialization. I think what should work is to set it in qdisc_create instead, sch_api.c around line 490: + sch-parent = handle; if (handle == TC_H_INGRESS) { sch-flags |= TCQ_F_INGRESS; sch-stats_lock = dev-ingress_lock; ... and remove the initialization in qdisc_graft. That would additionally have the benefit that ingress qdiscs also have it initialized properly. I just sent out a patch to fix this. Sorry for the delay; my development machine oops'd in the middle of some disk I/O, and it corrupted part of the inode table...the ext3 journal application seemed to make it worse too. Rebuilt the machine, so I'm back on my feet. Anyways, I tried a few different things, and what it looks like is sch-parent will be NULL (0) for the top-level device. This is correct, and trying to mess with that screws up qdisc_graft() when unloading the qdisc. I also tried adding a TCQ_F_ROOT flag to sch-flags when classid is TC_H_ROOT, but that also screwed up unloading the qdisc. The ingress qdisc does have the parent handle set correctly, namely NULL, since it will always be the top-level qdisc on ingress sessions. Thx Patrick, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Question: how to detect if a qdisc is root or not?
I've been tracking down an issue with the recent multiqueue code, specifically with sch_prio and sch_rr loading as a root qdisc. Right now, we do not want to allow child qdiscs of sch_rr and sch_prio to load with multiqueue enabled; we want to restrict multiqueue-enabled qdiscs to the root qdisc (since this is the only thing to push into the device). The issue I have is I don't know how to detect if the qdisc I'm currently processing is the root qdisc, or if it's a child. From sch_prio.c: q-mq = RTA_GET_FLAG(tb[TCA_PRIO_MQ - 1]); if (q-mq) { if (sch-handle != TC_H_ROOT) return -EINVAL; if (netif_is_multiqueue(sch-dev)) { Unfortunately, this code isn't working. This sch-handle is the handle assigned to the qdisc upon creation, and it's not TC_H_ROOT. Now the information whether or not the qdisc is root gets passed via tc from userspace, but that lives in the tcmsg struct, not the rtattr list of options. Does anyone know, outside of adding another attribute to the rtattr list, how to detect if a qdisc is a root qdisc within the prio_tune() initialization routine? Any and all help would be much appreciated. Thanks, PJ Waskiewicz Intel Corporation [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Question: how to detect if a qdisc is root or not?
You're right, thats a bug. TC_H_ROOT is the parent ID, which is stored in sch-parent. IIRC its also passed to the -init() function. Unfortunately it's not passed. It is passed into the -change() function: static int prio_init(struct Qdisc *sch, struct rtattr *opt) static int prio_change(struct Qdisc *sch, u32 handle, u32 parent, struct rtattr **tca, unsigned long *arg) I did mess around with sch-parent a bit, with no success (it appears to be zero / unitialized). I'll keep investigating. Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Question: how to detect if a qdisc is root or not?
Its set after grafting the parent, which is after initialization. I think what should work is to set it in qdisc_create instead, sch_api.c around line 490: + sch-parent = handle; if (handle == TC_H_INGRESS) { sch-flags |= TCQ_F_INGRESS; sch-stats_lock = dev-ingress_lock; ... and remove the initialization in qdisc_graft. That would additionally have the benefit that ingress qdiscs also have it initialized properly. That's where I'm messing around right now, and I did see that. Let me put it in and test. If all is well, I'll post a patch. Thanks for the help Patrick. -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
It would be great if we could finally get a working e1000 multiqueue patch so work in this area can actually be tested. I'm actively working on this right now. I'm on vacation next week, but hopefully I can get something working before I leave OLS and post it. -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Ok everything is checked into net-2.6.23, thanks everyone. Dave, thank you for your patience and feedback on this whole process. Patrick and everyone else, thank you for your feedback and assistance. I am looking at your posed virtualization question, but I need sleep since I just remembered I'm on east coast time here at OLS, and it's 4:30am. Thanks again, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
PJ Waskiewicz wrote: + static int __init prio_module_init(void) { - return register_qdisc(prio_qdisc_ops); + int err; + err = register_qdisc(prio_qdisc_ops); + if (!err) + err = register_qdisc(rr_qdisc_ops); + return err; } Thats still broken. I'll fix this and some minor cleanness issues myself so you don't have to go through another resend. Auke and I just looked at register_qdisc() and this code. Maybe we haven't had enough coffee yet, but register_qdisc() returns 0 on success. So if register_qdisc(prio_qdisc_ops) succeeds, then rr_qdisc_ops gets registered. I'm curious what is broken with this. Thanks Patrick -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] iproute2: sch_rr support in tc
Since rr is not built as a module you could actually put everything in q_prio and share the code. But I don't really care :) I tried doing this, but I couldn't get things working quite right with selecting the correct module from sch_prio to sch_rr. So I just added the q_rr.c portion of tc, and it fixed the selection issues I was seeing. At least for userspace, it may be better to explicitly show the existance of rr being independent; that's just my personal preference. Cheers, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Its not error handling. You do: err = register qdisc 1 if (err) return err; err = register qdisc 2 if (err) unregister qdisc 2 return err anyways, I already fixed that and cleaned up prio_classify the way I suggested. Will send shortly. Thanks for fixing; however, the current sch_prio doesn't unregister the qdisc if register_qdisc() on prio fails, or does that happen implicitly because the module will probably unload? Thanks again Patrick, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
PJ Waskiewicz wrote: include/linux/etherdevice.h |3 +- include/linux/netdevice.h | 62 ++- include/linux/skbuff.h |4 ++- net/core/dev.c | 27 +-- net/core/netpoll.c |8 +++--- net/core/pktgen.c | 10 +-- net/core/skbuff.c |3 ++ net/ethernet/eth.c |9 +++--- 8 files changed, 104 insertions(+), 22 deletions(-) include/linux/pkt_sched.h |9 +++ net/sched/Kconfig | 23 +++ net/sched/sch_prio.c | 147 + 3 files changed, 166 insertions(+), 13 deletions(-) Quick question: where are the sch_generic changes? :) If you hold for ten minutes I'll post a set of slightly changed patches with the NETDEVICES_MULTIQUEUE option and a fix for this. Jamal's and KK's qdisc_restart() rewrite took the netif_queue_stopped() call out of sch_generic.c. So the underlying qdisc is only responsible for checking the queue status now before dequeueing. -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
PJ Waskiewicz wrote: +#ifdef CONFIG_NET_SCH_MULTIQUEUE + if (q-mq) + skb-queue_mapping = + q-prio2band[bandTC_PRIO_MAX]; + else + skb-queue_mapping = 0; +#endif Setting it to zero here is wrong, consider: root qdisc: prio multiqueue child qdisc: prio non-multiqueue The top-level qdisc will set it, the child qdisc will unset it again. When multiqueue is inactive it should not touch it. I'll fix that as well. But the child can't assume the device is multiqueue if the child is non-multiqueue. This is the same issue with IP forwarding, where if you forward through a multiqueue device to a non-mq device, you don't know if the destination device is multiqueue. So the last qdisc to actually dequeue into a device should have control what the queue mapping is. If a user had a multiqueue qdisc as root, and configured a child qdisc as non-mq, that is a configuration error if the underlying device is indeed multiqueue IMO. -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
Waskiewicz Jr, Peter P wrote: Quick question: where are the sch_generic changes? :) If you hold for ten minutes I'll post a set of slightly changed patches with the NETDEVICES_MULTIQUEUE option and a fix for this. Jamal's and KK's qdisc_restart() rewrite took the netif_queue_stopped() call out of sch_generic.c. So the underlying qdisc is only responsible for checking the queue status now before dequeueing. Yes, I noticed that now. Doesn't seem right though as long as queueing while queue is stopped is treated as a bug by the drivers. But I vaguely recall seeing a discussion about this, I'll check the archives. The basic gist is before the dequeue is done, the qdisc is locked by the qdisc is running bit, so another CPU cannot get in there. So if the queue isn't stopped when a dequeue is done, that same queue should not be stopped when hard_start_xmit() is called. The only thing I could think of that could happen is some out-of-band cleanup routine in the driver where the tx_ring lock is held, and the skb is bounced back, where the driver returns NETIF_TX_BUSY, and you requeue. This is an extreme corner case, so the check could be removed. -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Absolutely not. First of all, its perfectly valid to use non-multiqueue qdiscs on multiqueue devices. Secondly, its only the root qdisc that has to know about multiqueue since that one controls the child qdiscs. Think about it, it makes absolutely no sense to have the child qdisc even know about multiqueue. Changing my example to have a multiqueue qdisc as child: root qdisc: 2 band prio multiqueue child qdisc of band 0: 2 band prio multiqueue When the root qdisc decides to dequeue band0, it checks whether subqueue 0 is active and dequeues the child qdisc. If the child qdisc is indeed another multiqueue qdisc as you suggest, if might decide to dequeue its own band 1 and checks that subqueue state. So where should the packet finally end up? And what if one of both subqueues is inactive? The only reasonable thing it can do is not care about multiqueue and just dequeue as usual. In fact I think it should be an error to configure multiqueue on a non-root qdisc. Ack. This is a thought process that trips me up from time to time...I see child qdisc, and think that's the last qdisc to dequeue and send to the device, not the first one to dequeue. So please disregard my comments before; I totally agree with you. Great catch here; I really like the prio_classify() cleanup. As always, many thanks for your feedback and help Patrick. -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Waskiewicz Jr, Peter P wrote: [...] The only reasonable thing it can do is not care about multiqueue and just dequeue as usual. In fact I think it should be an error to configure multiqueue on a non-root qdisc. Ack. This is a thought process that trips me up from time to time...I see child qdisc, and think that's the last qdisc to dequeue and send to the device, not the first one to dequeue. So please disregard my comments before; I totally agree with you. Great catch here; I really like the prio_classify() cleanup. Thanks. This updated patch makes configuring a non-root qdisc for multiqueue an error. The patch looks fine to me. Thanks Patrick. -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
Waskiewicz Jr, Peter P wrote: Looking at Peter's multiqueue patch, which should include all hard_start_xmit users (I'm not seeing sch_teql though, Peter?) the only other one is pktgen. Ugh. That is another netif_queue_stopped() that needs netif_subqueue_stopped(). I can send an updated patch for the core to fix this based from your patches Patrick. I still have the tree around, here's an updated version. So what do we do about netpoll then wrt netif_(sub)queue_stopped() being removed from qdisc_restart()? The fallout of having netpoll() cause a queue to stop (queue 0 only) is the skb sent will be requeued, since the driver will return NETIF_TX_BUSY if this actually happens. But this is a corner case, and we won't lose packets; we'll just have increased latency on that queue. Should I worry about this or just move forward with the sch_teql.c change and repost the core patch? I don't think you need to worry about that, the subqueue patch just follows the existing code. Thanks Patrick for taking care of this. I am totally fine with this patch; if anyone else has feedback, please send it. If not, I'm excited to see if these can be considered for 2.6.23 now. :) Thanks everyone for the help. Cheers, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API
/* ensure 32-byte alignment of both the device and private area */ - alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST) ~NETDEV_ALIGN_CONST; + alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST + +(sizeof(struct net_device_subqueue) * (queue_count - 1))) Why queue_count - 1 ? It should be queue_count I think. I'm not sure what went through my head, but I'll fix this. Otherwise ACK for this patch except that it should also contain the sch_generic changes. I misread your previous mail; I'll get the sch_generic.c changes into this patch. Thanks Patrick, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RTNETLINK]: Add nested compat attribute
This patch adds a new attribute type that can be used to replace non-nested attributes that contain structures by nested ones in a compatible way. This can be used in cases like Peter's who is trying to extend sch_prio, which currently uses a fixed structure without any holes. Switching to nested attributes makes sure that the next person won't run into the same problem. I've been using this patch and the IPROUTE2 patches Patrick has proposed with no issues. Can someone else look at these patches when they have time? I'd be interested in seeing them make it into 2.6.23. Thanks!! -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
enum { - TCA_PRIO_UNPSEC, - TCA_PRIO_TEST, You misunderstood me. You can work on top of my compat attribute patches, but the example code should not have to go in to apply your patch. Ok. I'll fix my patches. diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 475df84..7f14fa6 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -102,8 +102,16 @@ config NET_SCH_ATM To compile this code as a module, choose M here: the module will be called sch_atm. +config NET_SCH_BANDS +bool Multi Band Queueing (PRIO and RR) This options seems useless. Its not used *anywhere* except for dependencies. I was trying to group the multiqueue qdiscs together with this. But I can see just having the multiqueue option for scheduling will cover this. I'll remove this. +config NET_SCH_BANDS_MQ + bool Multiple hardware queue support + depends on NET_SCH_BANDS OK, again: Introduce NET_SCH_RR. NET_SCH_RR selects NET_SCH_PRIO. Nothing at all changes for NET_SCH_PRIO itself. Additionally introduce a boolean NET_SCH_MULTIQUEUE. No dependencies at all. Use NET_SCH_MULTIQUEUE to guard the multiqueue code in sch_prio.c. Your current code doesn't even have any ifdefs anymore though, so this might not be needed at all. Additionally you could later introduce E1000_MULTIQUEUE and have that select NET_SCH_MULTIQUEUE. I'll clean this up. Thanks for the persistance. :) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 9461e8a..203d5c4 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -168,7 +168,8 @@ static inline int qdisc_restart(struct net_device *dev) spin_unlock(dev-queue_lock); ret = NETDEV_TX_BUSY; - if (!netif_queue_stopped(dev)) + if (!netif_queue_stopped(dev) + !netif_subqueue_stopped(dev, skb-queue_mapping)) /* churn baby churn .. */ ret = dev_hard_start_xmit(skb, dev); I'll try again - please move this to patch 2/3. I'm sorry; I misread your original comment about this. I'll move the change (although this disappears with Jamal's and KK's qdisc_restart() cleanup). diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index 40a13e8..8a716f0 100644 --- a/net/sched/sch_prio.c +++ b/net/sched/sch_prio.c @@ -40,9 +40,11 @@ struct prio_sched_data { int bands; + int curband; /* for round-robin */ struct tcf_proto *filter_list; u8 prio2band[TC_PRIO_MAX+1]; struct Qdisc *queues[TCQ_PRIO_BANDS]; + unsigned char mq; }; @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) #endif if (TC_H_MAJ(band)) band = 0; + if (q-mq) + skb-queue_mapping = + q-prio2band[bandTC_PRIO_MAX]; + else + skb-queue_mapping = 0; Might look cleaner if you have one central point where queue_mapping is set and the band is returned. I'll see how easy it'll be to condense this; because the queue being selected in the qdisc can be different based on a few different things, I'm not sure how easy it'll be to assign this in one spot. I'll play around with it and see what I can come up with. + /* If we're multiqueue, make sure the number of incoming bands +* matches the number of queues on the device we're associating with. +*/ + if (tb[TCA_PRIO_MQ - 1]) + q-mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]); If you're using it as a flag, please use RTA_GET_FLAG(), otherwise RTA_GET_U8. Will do. Thanks. + if (q-mq (qopt-bands != sch-dev-egress_subqueue_count)) + return -EINVAL; sch_tree_lock(sch); q-bands = qopt-bands; @@ -280,7 +342,7 @@ static int prio_dump(struct Qdisc *sch, struct sk_buff *skb) memcpy(opt.priomap, q-prio2band, TC_PRIO_MAX+1); nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), opt); - RTA_PUT_U32(skb, TCA_PRIO_TEST, 321); + RTA_PUT_U8(skb, TCA_PRIO_MQ, q-mq); And RTA_PUT_FLAG. Now that I think of it, does it even makes sense to have a prio private flag for this instead of a qdisc global one? There currently aren't any other qdiscs that are natural fits for multiqueue that I can see. I can see the benefit though of having this as a global flag in the qdisc API; let me check it out, and if it makes sense, I can move it. static int __init prio_module_init(void) { - return register_qdisc(prio_qdisc_ops); + register_qdisc(prio_qdisc_ops); + register_qdisc(rr_qdisc_ops); Proper error handling please. Will do. Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at
RE: [RTNETLINK]: Add nested compat attribute
From: Waskiewicz Jr, Peter P [EMAIL PROTECTED] Date: Mon, 25 Jun 2007 10:14:37 -0700 This patch adds a new attribute type that can be used to replace non-nested attributes that contain structures by nested ones in a compatible way. This can be used in cases like Peter's who is trying to extend sch_prio, which currently uses a fixed structure without any holes. Switching to nested attributes makes sure that the next person won't run into the same problem. I've been using this patch and the IPROUTE2 patches Patrick has proposed with no issues. Can someone else look at these patches when they have time? I'd be interested in seeing them make it into 2.6.23. I've just put Patrick's patch into the net-2.6.23 tree. Awesome Dave!! Thank you very much. :) -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RTNETLINK]: Add nested compat attribute
-Original Message- From: David Miller [mailto:[EMAIL PROTECTED] Sent: Monday, June 25, 2007 2:30 PM To: Waskiewicz Jr, Peter P Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED] Subject: Re: [RTNETLINK]: Add nested compat attribute From: Waskiewicz Jr, Peter P [EMAIL PROTECTED] Date: Mon, 25 Jun 2007 14:01:31 -0700 Awesome Dave!! Thank you very much. :) Please get your next round of patches ready, Patrick and I can review them and barring any serious issues we can finally put this stuff in to net-2.6.23. I'm putting them into the latest 2.6.23 tree right now - I'll have them tested and sent upstream later today. Thanks Dave, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
@@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) #endif if (TC_H_MAJ(band)) band = 0; + if (q-mq) + skb-queue_mapping = + q-prio2band[bandTC_PRIO_MAX]; + else + skb-queue_mapping = 0; Might look cleaner if you have one central point where queue_mapping is set and the band is returned. I've taken a stab at this. I can have one return point, but I'll still have multiple assignments of skb-queue_mapping due to the different branches for which queue to select in the qdisc. I suppose we can do a rewrite of prio_classify(), but to me that seems beyond the scope of the multiqueue patches themselves. What do you think? Thanks, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Thats not necessary. I just though you could add one exit point: ... out: skb-queue_mapping = q-mq ? band : 0; return q-queues[band]; } But if that doesn't work don't bother .. Unfortunately it won't, given how band might be used like this to select the queue: return q-queues[q-prio2band[bandTC_PRIO_MAX]]; I'll keep this in mind though, and if it can be done cleanly, I'll submit a patch. Thanks Patrick, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RTNETLINK]: Add nested compat attribute
From: Patrick McHardy [EMAIL PROTECTED] Date: Mon, 25 Jun 2007 23:08:09 +0200 David Miller wrote: I've been using this patch and the IPROUTE2 patches Patrick has proposed with no issues. Can someone else look at these patches when they have time? I'd be interested in seeing them make it into 2.6.23. I've just put Patrick's patch into the net-2.6.23 tree. You seem to have added the netlink patch for the generic netlink attributes. Peter needs the rtnetlink attribute patch since qdiscs still use the old stuff. Attached again to this mail. Thanks for the clarification, I thought they were both the same patch, resent for the sake of tgraf seeing it. I've applied this one too, thanks Patrick. It looks like the one Patrick resent was the older version that requires a typecast. This is the function prototype currently in the kernel: +extern int rtattr_parse_nested_compat(struct rtattr *tb[], int maxattr, + struct rtattr *rta, void **data, int len); This is the newer version: +extern int __rtattr_parse_nested_compat(struct rtattr *tb[], int maxattr, + struct rtattr *rta, int len); +#define rtattr_parse_nested_compat(tb, max, rta, data, len) \ +({ data = RTA_PAYLOAD(rta) = len ? RTA_DATA(rta) : NULL; \ + __rtattr_parse_nested_compat(tb, max, rta, len); }) + I can send the update to what's in 2.6.23 as the first patch of my series, or I can write my function calls using the older callback (which isn't a problem). Which would you prefer? Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
#include linux/module.h @@ -40,9 +42,13 @@ struct prio_sched_data { int bands; +#ifdef CONFIG_NET_SCH_RR + int curband; /* for round-robin */ +#endif struct tcf_proto *filter_list; u8 prio2band[TC_PRIO_MAX+1]; struct Qdisc *queues[TCQ_PRIO_BANDS]; + u16 band2queue[TC_PRIO_MAX + 1]; Why is this still here? Its a 1:1 mapping. Thought about this more last night and this morning. As far as I can tell, I still need this. If the qdisc gets loaded with multiqueue turned on, I can just use the value of band to assign skb-queue_mapping. But if the qdisc is loaded without multiqueue support, then I need to assign a value of zero to queue_mapping, or not assign it at all (it will be zero'd out before the call to -enqueue() in dev_queue_xmit()). But I'd rather not have a conditional in the hotpath checking if the qdisc is multiqueue; I'd rather have the array to match the bands so I can just do an assignment. What do you think? Thanks, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Patrick McHardy wrote: Waskiewicz Jr, Peter P wrote: Thought about this more last night and this morning. As far as I can tell, I still need this. If the qdisc gets loaded with multiqueue turned on, I can just use the value of band to assign skb-queue_mapping. But if the qdisc is loaded without multiqueue support, then I need to assign a value of zero to queue_mapping, or not assign it at all (it will be zero'd out before the call to -enqueue() in dev_queue_xmit()). But I'd rather not have a conditional in the hotpath checking if the qdisc is multiqueue; I'd rather have the array to match the bands so I can just do an assignment. What do you think? I very much doubt that it has any measurable impact. You can also add a small inline function void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue) OK I didn't really listen obviously :) A compile time option won't help. Just remove it and assign it conditionally. Sounds good. Thanks Patrick. -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
BTw, couldn't you just merge sch_rr with prio? AFAICT you only need a new dequeue function, a new struct Qdisc_ops and a MODULE_ALIAS. Ok, I have this somewhat working, but need to poll for some help from the community. I used MODULE_ALIAS(sch_rr) in sch_prio.c, and modprobe is happily loading sch_prio.ko when I ask for sch_rr.ko. It also recognizes the correct ops struct to associate with the instance of the module. However, when I try to load the qdisc via tc (modified version that knows sch_rr), I'm getting No Such File or Directory from RTNETLINK. It's looking for sch_rr.ko, and is bailing. I've scoured the code looking for a reason why, and am drawing a blank. I'll continue looking, but if this sounds familiar to someone who knows how to get around this, please reply and let me know. Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Please post the code. Code is attached. Please forgive the attachment and any whitespace damage...currently using Doubtlook to send this (cringe). Thanks, -PJ Waskiewicz sch_prio.c Description: sch_prio.c
RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Waskiewicz Jr, Peter P wrote: Please post the code. Code is attached. Please forgive the attachment and any whitespace damage...currently using Doubtlook to send this (cringe). The code looks correct. Are you sure you had the config option enabled during your test? Yes. This is the tc command I used to configure the qdisc (with q_rr.c attached from my patches iproute2 package): # tc qdisc add dev eth2 root handle 1: rr bands 8 RTNETLINK answers: No such file or directory At this point, sch_prio gets loaded correctly, but it obviously fails to finish loading the qdisc. Using prio works though: # tc qdisc add dev eth2 root handle 1: prio bands 8 And yes, the NIC I'm working with has 8 queues, just to be clear. Any help is definitely appreciated; I'm going to keep this copy of the code for now, but am going to get the separate module written back up just in case this can't be solved in the short-term. This is the only piece keeping me from sending these patches back for consideration, so I'll keep the parallel effort going. Thanks Patrick, -PJ Waskiewicz q_rr.c Description: q_rr.c
RE: [PATCH] NET: Multiple queue hardware support
PJ Waskiewicz wrote: I did not modify other users of netif_queue_stopped() in net/core/netpoll.c, net/core/dev.c, or net/core/pktgen.c, since no classification occurs for the skb being sent to the device. Therefore, packets should always be ending up in queue 0, so there's no need to check the subqueue status either. Thats not correct. Subqueue 0 may be full and the queue still running. I'll look over the patches later. I'm working something up to address this. The last time I thought about this, I had issues with software devices, such as loopback. They weren't allocating any subqueues at all, so they would call netif_subqueue_stopped() and panic the kernel. However, now with Dave's request to index egress_subqueue, the first queue is allocated for everyone, so loopback and other software devices should be happy. Let me put these checks back in, test it out, and resend if I don't see any issues. Sorry for the thrash, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
The dependencies seem to be very confused. SCHED_PRIO does not depend on anything new, SCH_RR also doesn't depend on anything. SCH_PRIO_MQ and SCH_RR_MQ (which is missing) depend on SCH_PRIO/SCH_RR. A single NET_SCH_MULTIQUEUE option seems better than adding one per scheduler though. I agree with a NET_SCH_MULTIQUEUE option. However, SCH_RR does depend on SCH_PRIO being built since it's the same code, doesn't it? Maybe I'm not understanding something about the build process. I'll clean this up. --- a/net/sched/sch_prio.c +++ b/net/sched/sch_prio.c @@ -9,6 +9,8 @@ * Authors:Alexey Kuznetsov, [EMAIL PROTECTED] * Fixes: 19990609: J Hadi Salim [EMAIL PROTECTED]: * Init -- EINVAL when opt undefined + * Additions: Peter P. Waskiewicz Jr. [EMAIL PROTECTED] + * Added round-robin scheduling for selection at load-time git keeps changelogs, please don't add it here. Roger. struct tcf_proto *filter_list; u8 prio2band[TC_PRIO_MAX+1]; struct Qdisc *queues[TCQ_PRIO_BANDS]; + u16 band2queue[TC_PRIO_MAX + 1]; Why is this still here? Its a 1:1 mapping. I'll fix this. @@ -211,6 +265,22 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt) return -EINVAL; } + /* If we're prio multiqueue or are using round-robin, make +* sure the number of incoming bands matches the number of +* queues on the device we're associating with. +*/ +#ifdef CONFIG_NET_SCH_RR + if (strcmp(rr, sch-ops-id) == 0) + if (qopt-bands != sch-dev-egress_subqueue_count) + return -EINVAL; +#endif + +#ifdef CONFIG_NET_SCH_PRIO_MQ + if (strcmp(prio, sch-ops-id) == 0) + if (qopt-bands != sch-dev-egress_subqueue_count) + return -EINVAL; +#endif For the tenth time now, the user should enable this at runtime. You can't just break things dependant on config options. I had this in sch_prio and tc before, and was told to remove it because of ABI issues. I can put it back in, but I'm not sure what those previous ABI issues were. Was it backwards compatibility that you referred to before that was broken? As always, the feedback is very much appreciated. I'll get these fixes in as soon as possible. -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API
From: PJ Waskiewicz [EMAIL PROTECTED] Date: Mon, 18 Jun 2007 11:42:29 -0700 + + /* The TX queue control structures */ + struct net_device_subqueue *egress_subqueue; + int egress_subqueue_count; Since every net device will have at least one subqueue, I would suggest that you do this as follows: 1) In net_device change the quoted part of the patch above to: int egress_subqueue_count; struct net_device_subqueue egress_subqueue[0]; 2) In alloc_netdev(): Factor (sizeof(struct egress_subqueue) * num_subqueues) into the net_device allocation size, place the priv area after the subqueues. This will save us pointer dereferences on all of these quite common accesses. Thanks Dave! I'll be putting this in today and run a test pass on it. Thanks for the feedback. -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API
From: PJ Waskiewicz [EMAIL PROTECTED] Date: Mon, 18 Jun 2007 11:42:29 -0700 + + /* The TX queue control structures */ + struct net_device_subqueue *egress_subqueue; + int egress_subqueue_count; Since every net device will have at least one subqueue, I would suggest that you do this as follows: 1) In net_device change the quoted part of the patch above to: int egress_subqueue_count; struct net_device_subqueue egress_subqueue[0]; 2) In alloc_netdev(): Factor (sizeof(struct egress_subqueue) * num_subqueues) into the net_device allocation size, place the priv area after the subqueues. This will save us pointer dereferences on all of these quite common accesses. I've been thinking about this more today, so please bear with me if I'm missing something. Right now, with how qdisc_restart() is running, we'd definitely call netif_subqueue_stopped(dev, skb-queue_mapping) for all multi-ring and single-ring devices. However, with Jamal's and Krishna's qdisc_restart() rewrite patch, the checks for netif_queue_stopped() and netif_subqueue_stopped() would be pushed into the qdisc's -dequeue() functions. If that's the case, then the only checks on egress_subqueue[x] would be for multi-ring adapters, or if someone was silly enough to load sch_{rr|prio} onto a single-ring device with multiqueue hardware support compiled in. Given all of that, I'm not sure allocating egress_subqueue[0] at compile time or runtime would make any difference either way. If I'm missing something, please let me know - I'd like to reduce any unnecessary pointer dereferences if possible, but given the proposed qdisc_restart(), I think the code as-is would be ok. Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API
It's not being allocated at compile time, it's being allocated linearly into one block of ram in order to avoid pointer derefs but it's still dynamic in that the size isn't known until the alloc_netdev() call. We do this trick all over the networking, TCP sockets are 3 or 4 different structures, all allocated into a linear block of memory so that: 1) only one memory allocation needs to be done for each object create, this is not relevant for the net_device case except in extreme examples bringing thousands of devices up and down which I suppose someone can give a realistic example of :-) 2) each part can be accessed as an offset from the other instead of a pointer deref which costs a cpu memory read Please change the allocation strategy as I recommended, thanks. Later this afternoon someone in my group came over and bonked me in the head, and made this obvious to me. Thanks for the follow-up; great catch on the efficiency. It's in my code now and is currently running some test passes. Thanks Dave, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/2 - rev2] qdisc_restart - readability changes plus onebug fix.
-Original Message- From: Krishna Kumar [mailto:[EMAIL PROTECTED] Sent: Sunday, June 17, 2007 9:51 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; Waskiewicz Jr, Peter P; [EMAIL PROTECTED]; [EMAIL PROTECTED]; netdev@vger.kernel.org Subject: [PATCH 1/2 - rev2] qdisc_restart - readability changes plus onebug fix. New changes : - Incorporated Peter Waskiewicz's comments. - Re-added back one warning message (on driver returning wrong value). I've run these patches last night and this morning without issue. I see no issues with these updated patches from my testing. Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API
PJ Waskiewicz wrote: Add the multiqueue hardware device support API to the core network stack. Allow drivers to allocate multiple queues and manage them at the netdev level if they choose to do so. Should be 2/3 and qdisc changes should be 3/3. Well actually the qdisc sch_generic changes belong in this patch as well and the qdisc changes should be split in one change per qdisc. I'll re-arrange the patches. /* Functions used for multicast support */ diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index e7367c7..8bcd870 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -215,6 +215,7 @@ typedef unsigned char *sk_buff_data_t; * @pkt_type: Packet class * @fclone: skbuff clone status * @ip_summed: Driver fed us an IP checksum + * @queue_mapping: Queue mapping for multiqueue devices * @priority: Packet queueing priority * @users: User count - see {datagram,tcp}.c * @protocol: Packet protocol from driver @@ -269,6 +270,7 @@ struct sk_buff { __u16 csum_offset; }; }; + __u16 queue_mapping; We have a 4 byte hole on 64 bit after iif where this would fit in. I'll move the variable. Thanks for this! @@ -3377,12 +3381,23 @@ struct net_device *alloc_netdev(int sizeof_priv, const char *name, if (sizeof_priv) dev-priv = netdev_priv(dev); + alloc_size = (sizeof(struct net_device_subqueue) * queue_count); + + p = kzalloc(alloc_size, GFP_KERNEL); + if (!p) { + printk(KERN_ERR alloc_netdev: Unable to allocate queues.\n); + return NULL; Same leak here that you already fixed a couple of posts ago. Heavy sigh...I synced off an older personal git repository. I'll clean this and the casts up and repost with the other feedback. Thanks Patrick, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
PJ Waskiewicz wrote: diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index 6d7542c..44ecdc6 100644 --- a/net/sched/sch_prio.c +++ b/net/sched/sch_prio.c } +#ifdef CONFIG_NET_SCH_PRIO_MQ + /* setup queue to band mapping */ + if (q-bands sch-dev-egress_subqueue_count) { + qmapoffset = 1; + mod = sch-dev-egress_subqueue_count; + } else { + mod = q-bands % sch-dev-egress_subqueue_count; + qmapoffset = q-bands / sch-dev-egress_subqueue_count + + ((mod) ? 1 : 0); + } + + queue = 0; + offset = 0; + for (i = 0; i q-bands; i++) { + q-band2queue[i] = queue; + if ( ((i + 1) - offset) == qmapoffset) { + queue++; + offset += qmapoffset; + if (mod) + mod--; + qmapoffset = q-bands / + sch-dev-egress_subqueue_count + + ((mod) ? 1 : 0); + } + } +#endif This should really go, its not only ugly, it also makes no sense to use more bands than queues since that means multiple bands of different priorities are controlled through a single queue state, so lower priority bands can stop the queue for higher priority ones. The user should enable multiqueue behaviour and using it with a non-matching parameters should simply return an error. That sounds fine to me. I'll clean this and sch_prio up. diff --git a/net/sched/sch_rr.c b/net/sched/sch_rr.c new file mode 100644 index 000..ce9f237 --- /dev/null +++ b/net/sched/sch_rr.c For which multiqueue capable device is this? Jamal mentioned that e1000 uses drr. E1000 is capable of doing DRR and WRR, however the way our drivers are written, they're straight round-robin. But this qdisc would be useful for devices such as wireless where they have their own scheduler in the MAC, and do not want the stack to prioritize traffic beyond that. This way a user can classify multiple flows into multiple bands, but the driver will control the prioritization of the traffic beyond that. For MAC's that have no strict scheduler (such as e1000 as it is today), they can use sch_prio to achieve multiple queue support, but have scheduling priority from the stack. This qdisc can also be useful at the physical netdev layer for virtualization I would think, but perhaps I haven't thought that hard about that yet. Lots os unnecessary includes. I have a patch that cleans this up for net/sched, this is the relevant sch_prio part where you copied this from: I'll clean the includes up. Thanks! + band = TC_H_MIN(band) - 1; + if (band q-bands) { You copied an off-by-one from an old sch_prio version here. Hmm. This is the sch_prio from the first 2.6.23-dev tree. I'll resync and make sure it's the correct one. +static int rr_tune(struct Qdisc *sch, struct rtattr *opt) +{ + struct rr_sched_data *q = qdisc_priv(sch); + struct tc_rr_qopt *qopt = RTA_DATA(opt); Nested attributes please, don't repeat sch_prio's mistake. I'm not sure I understand what you mean here about nested attributes. ... + /* setup queue to band mapping - best effort to map into available +* hardware queues +*/ + if (q-bands sch-dev-egress_subqueue_count) { + qmapoffset = 1; + mod = sch-dev-egress_subqueue_count; + } else { + mod = q-bands % sch-dev-egress_subqueue_count; + qmapoffset = q-bands / sch-dev-egress_subqueue_count + + ((mod) ? 1 : 0); + } + + queue = 0; + offset = 0; + for (i = 0; i q-bands; i++) { + q-band2queue[i] = queue; + if ( ((i + 1) - offset) == qmapoffset) { + queue++; + offset += qmapoffset; + if (mod) + mod--; + qmapoffset = q-bands / + sch-dev-egress_subqueue_count + + ((mod) ? 1 : 0); + } + } Should go as well. Works for me. I'll clean this up as well. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Hmm. This is the sch_prio from the first 2.6.23-dev tree. I'll resync and make sure it's the correct one. Current 2.6.22-rc and net-2.6.23 have if (band = q-bands) I just pulled 2.6.23 down, and see that is true. I must have had that left over. I'll fix that. I'm not sure I understand what you mean here about nested attributes. Nested netlink attributes, like most qdisc use, instead of struct tc_rr_qopt (or additionally). The way you've done it makes it hard to add further attributes later. I'm going to need to think about this more, since I'm not immediately getting what you're referring to. I see the qdisc using tc_prio_qopt as a single member; do you have an example outside of the qdiscs I can look at and see what you're referring to? Please bear with me: my netlink skills are still very green. BTw, couldn't you just merge sch_rr with prio? AFAICT you only need a new dequeue function, a new struct Qdisc_ops and a MODULE_ALIAS. Are you suggesting a module that can determine RR or PRIO at runtime? Because the two are so similar, I definitely thought about combining them, but because of the dequeue difference, you'd need a load-time switch to determine which mode to run the module in. That would break ABI for sch_prio, which I was trying to avoid. Thanks Patrick, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] qdisc_restart - couple of optimizations.
IMHO this scenario occurs so infrequently that the check isn't worth it especially since the driver has to be able to deal with us calling it after netif_stop_queue() anyway. That sounds just fine to me. Thanks Krishna and Herbert for weighing in on this. -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] qdisc_restart - couple of optimizations.
- netif_queue_stopped need not be called inside qdisc_restart as it has been called already in qdisc_run() before the first skb is sent, and in __qdisc_run() after each intermediate skb is sent (note : we are the only sender, so the queue cannot get stopped while the tx lock was got in the ~LLTX case). I somewhat disagree here. The underlying driver can conceivably stop the device queue even if the stack holds the queue lock during an interrupt to clean Tx descriptors, and it finds it's out of them or needs to grab the device for whatever reason. Granted this is a corner case, and the net effect would be a simple requeue of the skb, but checking the status of the queue at the last possible moment before entering the driver could alleviate the requeue in the time between -dequeue() from the qdisc, and hard_start_xmit() if an event like I mentioned happened. I'm ok with it either way, especially since this is a corner case. But it does need to be considered that it can happen. Cheers, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html