Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Waskiewicz Jr, Peter
On Fri, 2018-07-06 at 16:38 -0700, Alexei Starovoitov wrote:
> On Fri, Jul 06, 2018 at 08:44:24PM +0000, Waskiewicz Jr, Peter wrote:
> > On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote:
> > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > > > 
> > > > I'm also not 100% on board with the argument that "future" FW
> > > > can
> > > > reshuffle things whatever way it wants to.  Is the assumption
> > > > that
> > > > future ASICs/FW will be designed to always use the "blessed"
> > > > BTF
> > > > format?  Or will it be reconfigurable at runtime?
> > > 
> > > let's table configuration of metadata aside for a second.
> > 
> > I agree that this should/could be NIC-specific and shouldn't weigh
> > on
> > the metadata interface between the drivers and XDP layer.
> > 
> > > Describing metedata layout in BTF allows NICs to disclose
> > > everything
> > > NIC has to users in a standard and generic way.
> > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > and hw powercycled to have new md layout (and corresponding BTF
> > > description)
> > > is a separate discussion.
> > > Saeed's proposal introduces the concept of 'offset' inside
> > > 'struct
> > > xdp_md_info'
> > > to reach 'hash' value in metadata.
> > > Essentially it's a run-time way to access 'hash' instead of
> > > build-
> > > time.
> > > So bpf program would need two loads to read csum or hash field
> > > instead of one.
> > > With BTF the layout of metadata is known to the program at build-
> > > time.
> > > 
> > > To reiterate the proposal:
> > > - driver+firmware keep layout of the metadata in BTF format
> > > (either
> > > in the driver
> > >   or driver can read it from firmware)
> > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will
> > > query
> > > the driver and
> > >   generate normal C header file based on BTF in the given NIC
> > > - user does #include "md_desc.h" and bpf program can access md-
> > > >csum
> > > or md->hash
> > >   with direct single load out of metadata area in front of the
> > > packet
> > 
> > This piece is where I'd like to discuss more.  When we discussed
> > this
> > in Seoul, the initial proposal was a static struct that we'd try to
> > hammer out a common layout between the interested parties.  That
> > obviously wasn't going to scale, and we wanted to pursue something
> > more
> > dynamic.  But I thought the goal was the XDP/eBPF program wouldn't
> > want
> > to care what the underlying device is, and could just ask for
> > metadata
> > that it's interested in.  With this approach, your eBPF program is
> > now
> > bound/tied to the NIC/driver, and if you switched to a differen
> > NIC/driver combo, then you'd have to rewrite part of your eBPF
> > program
> > to comprehend that.  I thought we were trying to avoid that.
> 
> It looks to me that NICs have a lot more differences instead of
> common pieces.
> If we focus the architecture on making common things as generic
> as possible we miss the bigger picture of covering distinct
> and unique features that hw provides.
> 
> In the last email when I said "would be good to standardize at least
> a few common fields"
> I meant that it make sense only at the later phases.
> I don't think we should put things into uapi upfront.
> Otherwise we will end up with likely useless fields.
> Like vlan field. Surely a lot of nics can store it into metadata, but
> why?
> bpf program can easily read it from the packet.
> What's the benefit of adding it to md?
> This metadata concept we're discussing in the context of program
> acceleration.
> If metadata doesn't give perf benefits it should not be done by
> firmware, it should not be supported by the driver and certainly
> should not be part of uapi.
> Hence I'd like to see pure BTF description without any
> common/standard
> fields across the drivers and figure out what (if anything) is useful
> to accelerate xdp programs (and af_xdp in the future).

I guess I didn't write my response very well before, apologies.  I
actually really like the BTF description and the dynamic-ness of it. 
It solves lots of problems and keeps things out of the UAPI.

What I was getting at was using the BTF description and dumping a
header file to be used in the program to be loaded.  If we have an
application develo

Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure

2018-07-06 Thread Waskiewicz Jr, Peter
On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote:
> On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > 
> > I'm also not 100% on board with the argument that "future" FW can
> > reshuffle things whatever way it wants to.  Is the assumption that
> > future ASICs/FW will be designed to always use the "blessed" BTF
> > format?  Or will it be reconfigurable at runtime?
> 
> let's table configuration of metadata aside for a second.

I agree that this should/could be NIC-specific and shouldn't weigh on
the metadata interface between the drivers and XDP layer.

> Describing metedata layout in BTF allows NICs to disclose everything
> NIC has to users in a standard and generic way.
> Whether firmware is reconfigurable on the fly or has to reflashed
> and hw powercycled to have new md layout (and corresponding BTF
> description)
> is a separate discussion.
> Saeed's proposal introduces the concept of 'offset' inside 'struct
> xdp_md_info'
> to reach 'hash' value in metadata.
> Essentially it's a run-time way to access 'hash' instead of build-
> time.
> So bpf program would need two loads to read csum or hash field
> instead of one.
> With BTF the layout of metadata is known to the program at build-
> time.
> 
> To reiterate the proposal:
> - driver+firmware keep layout of the metadata in BTF format (either
> in the driver
>   or driver can read it from firmware)
> - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query
> the driver and
>   generate normal C header file based on BTF in the given NIC
> - user does #include "md_desc.h" and bpf program can access md->csum
> or md->hash
>   with direct single load out of metadata area in front of the packet

This piece is where I'd like to discuss more.  When we discussed this
in Seoul, the initial proposal was a static struct that we'd try to
hammer out a common layout between the interested parties.  That
obviously wasn't going to scale, and we wanted to pursue something more
dynamic.  But I thought the goal was the XDP/eBPF program wouldn't want
to care what the underlying device is, and could just ask for metadata
that it's interested in.  With this approach, your eBPF program is now
bound/tied to the NIC/driver, and if you switched to a differen
NIC/driver combo, then you'd have to rewrite part of your eBPF program
to comprehend that.  I thought we were trying to avoid that.

Our proposed approach (still working on something ready to RFC) is to
provide a method for the eBPF program to send a struct of requested
hints down to the driver on load.  If the driver can provide the hints,
then that'd be how they'd be laid out in the metadata.  If it can't
provide them, we'd probably reject the program loading, or discuss
providing a software fallback (I know this is an area of contention).

I suppose we could get there with the rewriting mechanism described
below, but that'd be a tough sell to set a bit of ABI for metadata,
then change it to be potentially dynamic at runtime in the future.

-PJ

Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access

2017-09-28 Thread Waskiewicz Jr, Peter
On 9/28/17 2:23 PM, John Fastabend wrote:
> [...]
> 
>> I'm pretty sure I misunderstood what you were going after with
>> XDP_REDIRECT reserving the headroom.  Our use case (patches coming in a
>> few weeks) will populate the headroom coming out of the driver to XDP,
>> and then once the XDP program extracts whatever hints it wants via
>> helpers, I fully expect that area in the headroom to get stomped by
>> something else.  If we want to send any of that hint data up farther,
>> we'll already have it extracted via the helpers, and the eBPF program
>> can happily assign it to wherever in the outbound metadata area.
> 
> In case its not obvious with the latest xdp metadata patches the outbound
> metadata can then be pushed into skb fields via a tc_cls program if needed.

Yes, that was what I was alluding to with "can happily assign it to 
wherever."  The patches we're working on are driver->XDP, then anything 
else using the latest meta-data patches would be XDP->anywhere else.  So 
I don't think we're going to step on any toes.

Thanks John,
-PJ


Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access

2017-09-28 Thread Waskiewicz Jr, Peter
On 9/28/17 12:59 PM, Andy Gospodarek wrote:
> On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter
> <peter.waskiewicz...@intel.com> wrote:
>> On 9/26/17 10:21 AM, Andy Gospodarek wrote:
>>> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote:
>>>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
>>>> [...]
>>>>> First, thanks for this detailed description.  It was helpful to read
>>>>> along with the patches.
>>>>>
>>>>> My only concern about this area being generic is that you are now in a
>>>>> state where any bpf program must know about all the bpf programs in the
>>>>> receive pipeline before it can properly parse what is stored in the
>>>>> meta-data and add it to an skb (or perform any other action).
>>>>> Especially if each program adds it's own meta-data along the way.
>>>>>
>>>>> Maybe this isn't a big concern based on the number of users of this
>>>>> today, but it just starts to seem like a concern as there are these
>>>>> hints being passed between layers that are challenging to track due to a
>>>>> lack of a standard format for passing data between.
>>>>
>>>> Btw, we do have similar kind of programmable scratch buffer also today
>>>> wrt skb cb[] that you can program from tc side, the perf ring buffer,
>>>> which doesn't have any fixed layout for the slots, or a per-cpu map
>>>> where you can transfer data between tail calls for example, then tail
>>>> calls themselves that need to coordinate, or simply mangling of packets
>>>> itself if you will, but more below to your use case ...
>>>>
>>>>> The main reason I bring this up is that Michael and I had discussed and
>>>>> designed a way for drivers to communicate between each other that rx
>>>>> resources could be freed after a tx completion on an XDP_REDIRECT
>>>>> action.  Much like this code, it involved adding an new element to
>>>>> struct xdp_md that could point to the important information.  Now that
>>>>> there is a generic way to handle this, it would seem nice to be able to
>>>>> leverage it, but I'm not sure how reliable this meta-data area would be
>>>>> without the ability to mark it in some manner.
>>>>>
>>>>> For additional background, the minimum amount of data needed in the case
>>>>> Michael and I were discussing was really 2 words.  One to serve as a
>>>>> pointer to an rx_ring structure and one to have a counter to the rx
>>>>> producer entry.  This data could be acessed by the driver processing the
>>>>> tx completions and callback to the driver that received the frame off the 
>>>>> wire
>>>>> to perform any needed processing.  (For those curious this would also 
>>>>> require a
>>>>> new callback/netdev op to act on this data stored in the XDP buffer.)
>>>>
>>>> What you describe above doesn't seem to be fitting to the use-case of
>>>> this set, meaning the area here is fully programmable out of the BPF
>>>> program, the infrastructure you're describing is some sort of means of
>>>> communication between drivers for the XDP_REDIRECT, and should be
>>>> outside of the control of the BPF program to mangle.
>>>
>>> OK, I understand that perspective.  I think saying this is really meant
>>> as a BPF<->BPF communication channel for now is fine.
>>>
>>>> You could probably reuse the base infra here and make a part of that
>>>> inaccessible for the program with some sort of a fixed layout, but I
>>>> haven't seen your code yet to be able to fully judge. Intention here
>>>> is to allow for programmability within the BPF prog in a generic way,
>>>> such that based on the use-case it can be populated in specific ways
>>>> and propagated to the skb w/o having to define a fixed layout and
>>>> bloat xdp_buff all the way to an skb while still retaining all the
>>>> flexibility.
>>>
>>> Some level of reuse might be proper, but I'd rather it be explicit for
>>> my use since it's not exclusively something that will need to be used by
>>> a BPF prog, but rather the driver.  I'll produce some patches this week
>>> for reference.
>>
>> Sorry for chiming in late, I've been offline.
>>
>> We're looking to add some functionality from driver to XDP inside 

Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access

2017-09-28 Thread Waskiewicz Jr, Peter
On 9/26/17 10:21 AM, Andy Gospodarek wrote:
> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote:
>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
>> [...]
>>> First, thanks for this detailed description.  It was helpful to read
>>> along with the patches.
>>>
>>> My only concern about this area being generic is that you are now in a
>>> state where any bpf program must know about all the bpf programs in the
>>> receive pipeline before it can properly parse what is stored in the
>>> meta-data and add it to an skb (or perform any other action).
>>> Especially if each program adds it's own meta-data along the way.
>>>
>>> Maybe this isn't a big concern based on the number of users of this
>>> today, but it just starts to seem like a concern as there are these
>>> hints being passed between layers that are challenging to track due to a
>>> lack of a standard format for passing data between.
>>
>> Btw, we do have similar kind of programmable scratch buffer also today
>> wrt skb cb[] that you can program from tc side, the perf ring buffer,
>> which doesn't have any fixed layout for the slots, or a per-cpu map
>> where you can transfer data between tail calls for example, then tail
>> calls themselves that need to coordinate, or simply mangling of packets
>> itself if you will, but more below to your use case ...
>>
>>> The main reason I bring this up is that Michael and I had discussed and
>>> designed a way for drivers to communicate between each other that rx
>>> resources could be freed after a tx completion on an XDP_REDIRECT
>>> action.  Much like this code, it involved adding an new element to
>>> struct xdp_md that could point to the important information.  Now that
>>> there is a generic way to handle this, it would seem nice to be able to
>>> leverage it, but I'm not sure how reliable this meta-data area would be
>>> without the ability to mark it in some manner.
>>>
>>> For additional background, the minimum amount of data needed in the case
>>> Michael and I were discussing was really 2 words.  One to serve as a
>>> pointer to an rx_ring structure and one to have a counter to the rx
>>> producer entry.  This data could be acessed by the driver processing the
>>> tx completions and callback to the driver that received the frame off the 
>>> wire
>>> to perform any needed processing.  (For those curious this would also 
>>> require a
>>> new callback/netdev op to act on this data stored in the XDP buffer.)
>>
>> What you describe above doesn't seem to be fitting to the use-case of
>> this set, meaning the area here is fully programmable out of the BPF
>> program, the infrastructure you're describing is some sort of means of
>> communication between drivers for the XDP_REDIRECT, and should be
>> outside of the control of the BPF program to mangle.
> 
> OK, I understand that perspective.  I think saying this is really meant
> as a BPF<->BPF communication channel for now is fine.
> 
>> You could probably reuse the base infra here and make a part of that
>> inaccessible for the program with some sort of a fixed layout, but I
>> haven't seen your code yet to be able to fully judge. Intention here
>> is to allow for programmability within the BPF prog in a generic way,
>> such that based on the use-case it can be populated in specific ways
>> and propagated to the skb w/o having to define a fixed layout and
>> bloat xdp_buff all the way to an skb while still retaining all the
>> flexibility.
> 
> Some level of reuse might be proper, but I'd rather it be explicit for
> my use since it's not exclusively something that will need to be used by
> a BPF prog, but rather the driver.  I'll produce some patches this week
> for reference.

Sorry for chiming in late, I've been offline.

We're looking to add some functionality from driver to XDP inside this 
xdp_buff->data_meta region.  We want to assign it to an opaque 
structure, that would be specific per driver (think of a flex descriptor 
coming out of the hardware).  We'd like to pass these offloaded 
computations into XDP programs to help accelerate them, such as packet 
type, where headers are located, etc.  It's similar to Jesper's RFC 
patches back in May when passing through the mlx Rx descriptor to XDP.

This is actually what a few of us are planning to present at NetDev 2.2 
in November.  If you're hoping to restrict this headroom in the xdp_buff 
for an exclusive use case with XDP_REDIRECT, then I'd like to discuss 
that further.

-PJ



Re: [Intel-wired-lan] [PATCH] igb: check memory allocation failure

2017-09-14 Thread Waskiewicz Jr, Peter
On 9/13/17 7:24 PM, Brown, Aaron F wrote:
>> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On Behalf
>> Of Christophe JAILLET
>> Sent: Monday, August 28, 2017 10:13 AM
>> To: Waskiewicz Jr, Peter <peter.waskiewicz...@intel.com>; Kirsher, Jeffrey T
>> <jeffrey.t.kirs...@intel.com>
>> Cc: netdev@vger.kernel.org; kernel-janit...@vger.kernel.org; intel-wired-
>> l...@lists.osuosl.org; linux-ker...@vger.kernel.org
>> Subject: Re: [Intel-wired-lan] [PATCH] igb: check memory allocation failure
>>
>> Le 28/08/2017 à 01:09, Waskiewicz Jr, Peter a écrit :
>>> On 8/27/17 2:42 AM, Christophe JAILLET wrote:
>>>> Check memory allocation failures and return -ENOMEM in such cases, as
>>>> already done for other memory allocations in this function.
>>>>
>>>> This avoids NULL pointers dereference.
>>>>
>>>> Signed-off-by: Christophe JAILLET <christophe.jail...@wanadoo.fr>
>>>> ---
>>>> drivers/net/ethernet/intel/igb/igb_main.c | 2 ++
>>>> 1 file changed, 2 insertions(+)
>>>>
> 
> This seems to be fine from a "it does not break in testing" perspective, so...
> 
> Tested-by: Aaron Brown <aaron.f.br...@intel.com
> 
>>> -PJ
>>>
>> Hi,
>>
>> in fact, there is no leak because the only caller of 'igb_sw_init()'
>> (i.e. 'igb_probe()'), already frees these resources in case of error,
>> see [1]
>>
>> These resources are also freed  in 'igb_remove()'.
>>
>> Best reagrds,
>> CJ
>>
>> [1]:
>> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-
>> next.git/tree/drivers/net/ethernet/intel/igb/igb_main.c#n2775
> 
> But is PJ's comment saying that it is not really necessary?  If so I tend to 
> lean towards the don't touch it if it's not broken perspective.

I guess I didn't respond after Christophe replied, sorry about that. 
The patch is good to me.  It's definitely catching an issue where we're 
not checking for a memory failure, then just follows the same 
de-allocation path on unwind.

If you want it:

Acked-by: PJ Waskiewicz <peter.waskiewicz...@intel.com>



Re: [PATCH] connector: Delete an error message for a failed memory allocation in cn_queue_alloc_callback_entry()

2017-08-28 Thread Waskiewicz Jr, Peter
On 8/28/17 2:06 AM, Dan Carpenter wrote:
> On Sun, Aug 27, 2017 at 11:16:06PM +0000, Waskiewicz Jr, Peter wrote:
>> On 8/27/17 3:26 PM, SF Markus Elfring wrote:
>>> From: Markus Elfring <elfr...@users.sourceforge.net>
>>> Date: Sun, 27 Aug 2017 21:18:37 +0200
>>>
>>> Omit an extra message for a memory allocation failure in this function.
>>>
>>> This issue was detected by using the Coccinelle software.
>>
>> Did coccinelle trip on the message or the fact you weren't returning NULL?
>>
> 
> You've misread the patch somehow.  The existing code has a NULL return
> and it's preserved in Markus's patch.  This sort of patch is to fix a
> checkpatch.pl warning.  The error message from this kzalloc() isn't going
> to get printed because it's a small allocation and small allocations
> always succeed in current kernels.  But probably the main reason
> checkpatch complains is that kmalloc() already prints a stack trace and
> a bunch of other information so the printk doesn't add anyting.
> Removing it saves a little memory.
> 
> I'm mostly a fan of running checkpatch on new patches or staging and not
> on old code...

And this is what I get for reading the patch with a crappy 
mailer...thanks Doubtlook.

Sorry for the noise.


Re: Get ARP/ND tables from kernel

2017-08-27 Thread Waskiewicz Jr, Peter
On 8/27/17 9:25 PM, Bassam Alsanie wrote:
> Hello everyone,
> I looking into a good way (stable and compatible with large number of
> distros) to get the arp/nd cache from kernel to user space, for both
> IP4 and IP6.
> 
> It seem IOCTL (SIOCGARP) can't do that, you can only get MAC address
> from provided IP address. But IOCTL can't give the the full arp/nd
> table.
> The other option is the Netlink interface. I tried it and I got the
> ARP/ND table :).
> The third option is using /proc/net/arp, which only restricted to IP4.
> 
> There is command line utilities that I excluding in my case.
> 
> Is there another way to do it? what is the best way in my case?
> 
> Thank you all.

# strace arp -an
[...]
open("/proc/net/arp", O_RDONLY) = 4
fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(4, "IP address   HW type Fla"..., 1024) = 310
[...]

# strace ip -6 neighbor show
[...]
socket(AF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, NETLINK_ROUTE) = 3
setsockopt(3, SOL_SOCKET, SO_SNDBUF, [32768], 4) = 0
setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1048576], 4) = 0
bind(3, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=}, 12) = 0
getsockname(3, {sa_family=AF_NETLINK, nl_pid=30292, nl_groups=}, 
[12]) = 0
sendto(3, {{len=40, type=RTM_GETLINK, flags=NLM_F_REQUEST|NLM_F_DUMP, 
seq=1503888680, pid=0}, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\35\0\1\0\0\0"}, 40, 0, NULL, 0) = 40
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, 
nl_groups=}, msg_namelen=12, msg_iov=[{iov_base=[{{len=1268, 
type=RTM_NEWLINK, flags=NLM_F_MULTI, seq=1503888680, pid=30292}, 
"\0\0\4\3\1\0\0\0I\0\1\0\0\0\0\0\7\0\3\0lo\0\0\10\0\r\0\350\3\0\0"...}, 
{{len=1280, type=RTM_NEWLINK, flags=NLM_F_MULTI, seq=1503888680, 
pid=30292}, 
"\0\0\1\0\2\0\0\0C\20\1\0\0\0\0\0\t\0\3\0eno1\0\0\0\0\10\0\r\0"...}], 
iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 2548
[...]

Seems like it's pretty obvious if you don't want to use the existing 
tools, just look at how the existing tools get this data.  IPv4 uses 
/proc/net/arp, IPv6 uses netlink.

Cheers,
-PJ



Re: [PATCH] connector: Delete an error message for a failed memory allocation in cn_queue_alloc_callback_entry()

2017-08-27 Thread Waskiewicz Jr, Peter
On 8/27/17 3:26 PM, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Sun, 27 Aug 2017 21:18:37 +0200
> 
> Omit an extra message for a memory allocation failure in this function.
> 
> This issue was detected by using the Coccinelle software.

Did coccinelle trip on the message or the fact you weren't returning NULL?

> 
> Signed-off-by: Markus Elfring 
> ---
>   drivers/connector/cn_queue.c | 4 +---
>   1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/connector/cn_queue.c b/drivers/connector/cn_queue.c
> index 1f8bf054d11c..e4f31d679f02 100644
> --- a/drivers/connector/cn_queue.c
> +++ b/drivers/connector/cn_queue.c
> @@ -40,10 +40,8 @@ cn_queue_alloc_callback_entry(struct cn_queue_dev *dev, 
> const char *name,
>   struct cn_callback_entry *cbq;
>   
>   cbq = kzalloc(sizeof(*cbq), GFP_KERNEL);
> - if (!cbq) {
> - pr_err("Failed to create new callback queue.\n");
> + if (!cbq)
>   return NULL;
> - }

Wny not:

if (!cbq) {
pr_err("Failed to create new callback queue.\n");
+   return NULL;
}

>   
>   atomic_set(>refcnt, 1);
>   
> 



Re: [PATCH] igb: check memory allocation failure

2017-08-27 Thread Waskiewicz Jr, Peter
On 8/27/17 2:42 AM, Christophe JAILLET wrote:
> Check memory allocation failures and return -ENOMEM in such cases, as
> already done for other memory allocations in this function.
> 
> This avoids NULL pointers dereference.
> 
> Signed-off-by: Christophe JAILLET 
> ---
>   drivers/net/ethernet/intel/igb/igb_main.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index fd4a46b03cc8..837d9b46a390 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -3162,6 +3162,8 @@ static int igb_sw_init(struct igb_adapter *adapter)
>   /* Setup and initialize a copy of the hw vlan table array */
>   adapter->shadow_vfta = kcalloc(E1000_VLAN_FILTER_TBL_SIZE, sizeof(u32),
>  GFP_ATOMIC);
> + if (!adapter->shadow_vfta)
> + return -ENOMEM;

Looks reasonable to me.

A larger issue though I see in this function is that if we return 
-ENOMEM here, and if we return -ENOMEM from igb_init_interrupt_scheme() 
below on failure, we leak adapter->mac_table (and adapter->shadow_vfta 
in the latter).  We should add a proper unwind to free up the memory on 
failure.

-PJ


Re: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more

2017-08-25 Thread Waskiewicz Jr, Peter
On 8/25/17 11:25 AM, Jacob Keller wrote:
> Under some circumstances, such as with many stacked devices, it is
> possible that dev_hard_start_xmit will bundle many packets together, and
> mark them all with xmit_more.
> 
> Most drivers respond to xmit_more by skipping tail bumps on packet
> rings, or similar behavior as long as xmit_more is set. This is
> a performance win since it means drivers can avoid notifying hardware of
> new packets repeat daily, and thus avoid wasting unnecessary PCIe or other
> bandwidth.
> 
> This use of xmit_more comes with a trade off because bundling too many
> packets can increase latency of the Tx packets. To avoid this, we should
> limit the maximum number of packets with xmit_more.
> 
> Driver authors could modify their drivers to check for some determined
> limit, but this requires all drivers to be modified in order to gain
> advantage.
> 
> Instead, add a sysctl "xmit_more_max" which can be used to configure the
> maximum number of xmit_more skbs to send in a sequence. This ensures
> that all drivers benefit, and allows system administrators the option to
> tune the value to their environment.
> 
> Signed-off-by: Jacob Keller 
> ---
> 
> Stray thoughts and further questions
> 
> Is this the right approach? Did I miss any other places where we should
> limit? Does the limit make sense? Should it instead be a per-device
> tuning nob instead of a global? Is 32 a good default?

I actually like the idea of a per-device knob.  A xmit_more_max that's 
global in a system with 1GbE devices along with a 25/50GbE or more just 
doesn't make much sense to me.  Or having heterogeneous vendor devices 
in the same system that have different HW behaviors could mask issues 
with latency.

This seems like another incarnation of possible buffer-bloat if the max 
is too high...

> 
>   Documentation/sysctl/net.txt |  6 ++
>   include/linux/netdevice.h|  2 ++
>   net/core/dev.c   | 10 +-
>   net/core/sysctl_net_core.c   |  7 +++
>   4 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
> index b67044a2575f..3d995e8f4448 100644
> --- a/Documentation/sysctl/net.txt
> +++ b/Documentation/sysctl/net.txt
> @@ -230,6 +230,12 @@ netdev_max_backlog
>   Maximum number  of  packets,  queued  on  the  INPUT  side, when the 
> interface
>   receives packets faster than kernel can process them.
>   
> +xmit_more_max
> +-
> +
> +Maximum number of packets in a row to mark with skb->xmit_more. A value of 
> zero
> +indicates no limit.

What defines "packet?"  MTU-sized packets, or payloads coming down from 
the stack (e.g. TSO's)?

-PJ


Re: [PATCH] pktgen: add a new sample script for 40G and above link testing

2017-08-25 Thread Waskiewicz Jr, Peter
On 8/25/17 10:59 AM, Jesper Dangaard Brouer wrote:
> On Fri, 25 Aug 2017 14:24:28 +
> "Waskiewicz Jr, Peter" <peter.waskiewicz...@intel.com> wrote:
> 
>> On 8/25/17 5:19 AM, Jesper Dangaard Brouer wrote:
>>>>
>>>> Tested with Intel XL710 NIC with Cisco 3172 switch.
>>>>
>>>> It would be even slightly better if the irqbalance service is turned
>>>> off outside.
>>>
>>> Yes, if you don't turn-off (kill) irqbalance it will move around the
>>> IRQs behind your back...
>>
>> Or you can use the --banirq option to irqbalance to ignore your device's
>> interrupts as targets for balancing.
> 
> It might be worth mentioning that --banirq=X is specified for each IRQ
> that you want to exclude, and --banirq is simply specified multiple
> times on the command line.
> 
> Is it possible to tell a running irqbalance that I want to excluded an
> extra IRQ? (just before I do my manual adjustment).

It isn't possible today, since we don't have a way to attach a 
foreground/oneshot irqbalance run to a currently-running daemon.  That's 
an interesting feature enhancement...I can add it to our list as a 
feature request so I don't forget about it.  That way I can also get 
Neil's thoughts on this.

-PJ



Re: [PATCH] pktgen: add a new sample script for 40G and above link testing

2017-08-25 Thread Waskiewicz Jr, Peter
On 8/25/17 5:19 AM, Jesper Dangaard Brouer wrote:
>>
>> Tested with Intel XL710 NIC with Cisco 3172 switch.
>>
>> It would be even slightly better if the irqbalance service is turned
>> off outside.
> 
> Yes, if you don't turn-off (kill) irqbalance it will move around the
> IRQs behind your back...

Or you can use the --banirq option to irqbalance to ignore your device's 
interrupts as targets for balancing.

Cheers,
-PJ


RE: [PATCH] [IPROUTE2] Update various classifiers' help output forexpected CLASSID syntax

2008-02-13 Thread Waskiewicz Jr, Peter P
  +   fprintf(stderr, \nNOTE: CLASSID is parsed as hexidecimal 
  +input.\n);
 
 s/hexidecimal/hexadecimal/g (?)

Gah!  Thanks Jarek, I can correct and resend.

-PJ Waskiewicz
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] [TC U32] Fix input parsing to support more than 9 flow id'scorrectly

2008-02-12 Thread Waskiewicz Jr, Peter P
  
 Sorry, can't change the api, update the documentation instead.

Yes, this is much more reasonable.  I'll send a patch shortly to do
that.

Thanks
-PJ Waskiewicz
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] [TC U32] Fix input parsing to support more than 9 flow id'scorrectly

2008-02-12 Thread Waskiewicz Jr, Peter P
 From: PJ Waskiewicz [EMAIL PROTECTED]
 
 Using strtoul with a base of 16 converts flowid 10 into 0x10, 
 which makes it flowid 16.  This is interpreted by the kernel 
 incorrectly, and causes traffic flows above 9 to be 
 classified into band 0 on multiband qdiscs.
 This changes the base to 10, which will correctly parse input 
 into the proper hexidecimal value.

Stephen,
We can go one of two ways I suppose.  Once is this way, since
most user input for CLASSID is base 10, or we can update documentation
to say that CLASSID input is expected to be base 16.  What do you think?

Thanks,
-PJ Waskiewicz
[EMAIL PROTECTED]

 
 Signed-off-by: Peter P Waskiewicz Jr [EMAIL PROTECTED]
 ---
 
  tc/tc_util.c |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/tc/tc_util.c b/tc/tc_util.c index cdbae42..a277eac 100644
 --- a/tc/tc_util.c
 +++ b/tc/tc_util.c
 @@ -65,7 +65,7 @@ int get_tc_classid(__u32 *h, const char *str)
   maj = TC_H_UNSPEC;
   if (strcmp(str, none) == 0)
   goto ok;
 - maj = strtoul(str, p, 16);
 + maj = strtoul(str, p, 10);
   if (p == str) {
   maj = 0;
   if (*p != ':')
 @@ -76,7 +76,7 @@ int get_tc_classid(__u32 *h, const char *str)
   return -1;
   maj = 16;
   str = p+1;
 - min = strtoul(str, p, 16);
 + min = strtoul(str, p, 10);
   if (*p != 0)
   return -1;
   if (min = (116))
 
 --
 To unsubscribe from this list: send the line unsubscribe 
 netdev in the body of a message to [EMAIL PROTECTED] 
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Disable TSO for non standard qdiscs

2008-02-01 Thread Waskiewicz Jr, Peter P
 ...But, on the other hand, in this case the realization seems to be
 wrong: probably still all locally created packets will be 
 treated the same - or I miss something?
 
 Jarek P.

The TCP layer will generate TSO packets based on the kernel socket
features associated with the flow.  So if you have two devices, one
supporting TSO, the other not, then the flows associated with the
non-TSO device will not have their packets built for TSO.  This has no
bearing on the device supporting TSO, which its feature flags will
propogate into the kernel socket for that flow, and cause any TCP flows
to that device to be TSO packets.  So in a nutshell, disabling TSO is on
a per-device level, not a global switch.

-PJ Waskiewicz
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Disable TSO for non standard qdiscs

2008-02-01 Thread Waskiewicz Jr, Peter P
 Indeed. As an example of an unknowing user, this discussion 
 made me check whether my cablemodem device (on which I'm 
 using HFSC) uses TSO :)

The TSO defer logic is based on your congestion window and current
window size.  So the actual frame sizes hitting your NIC attached to
your DSL probably aren't anywhere near 64KB, but probably more in line
with whatever your window size is for DSL.

The bottom line is TSO saves CPU cycles.  If we want to make it go away
because of a traffic shaping qdisc interfering, then that's fine.  I
just don't think a TSO option should be added to the scheduler layer,
since it already exists in the ethtool layer.  Asking a user to type
'ethtool -k devicename tso off' is probably going to be much easier
than setting an option on your qdisc through tc to turn TSO back on.

I think we're having more of a disagreement of what is considered the
normal case user.  If you are on a slow link, such as a DSL/cable
line, your TCP window/congestion window aren't going to be big enough to
generate large TSO's, so what is the issue?  But disabling TSO, say on a
10 GbE link, can cut throughput by half (I have data on 8-core machines
with 10 GbE with/without TSO if you're interested).  Even on a
single-core machine with a 1GbE link can have bad performance hits.  So
this is why I'm so concerned about a proposal to turn off TSO outside of
the current established methods of using ethtool.  Rather than educating
the user about how to turn TSO back on using tc if they want it, educate
them why they may want to consider turning TSO off in certain
configurations.  And I don't consider any user effectively using a TBF
qdisc someone incapable of understanding how to use ethtool.

Cheers,

-PJ Waskiewicz
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Disable TSO for non standard qdiscs

2008-02-01 Thread Waskiewicz Jr, Peter P
 Right - Essentially it is a usability issue:
 People who know how to use TSO (Peter for example) will be 
 clueful enough to turn it on. Which means the default should 
 be to protect the clueless and turn it off.
 On Andis approach:
 Turning TSO off at netdev registration time with a warning 
 will be a cleaner IMO. Or alternatively introducing a 
 kernel-config I know what TSO is option which is then used 
 at netdev registration. From a usability perspective it would 
 make more sense to just keep ethtool as the only way to 
 configure TSO. 

To me this sounds like the most reasonable approach.  I've put out my
concerns, so I'll get out of the way now so we can move forward in some
direction.  :-)

Cheers,

-PJ Waskiewicz
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Lots of BUG eth1 code -5 qlen 0 messages in 2.6.24

2008-02-01 Thread Waskiewicz Jr, Peter P
 I've changed the -EIO into NETDEV_TX_BUSY and so far I can't 
 trigger the bug anymore. It was quite easy to trigger within 
 minutes with rsync, but I can't trigger it anymore. Should I 
 send a patch, and if
 so: to who? The tulip/xircom_cb driver seems to be orphaned.

Perhaps Jeff Garzik is a good place to start.  He maintains the driver
tree, he either can take the patch or direct you to where it should go.

 No need to test that, it *is* netif_stopped before the return:
 
 /* Uh oh... no free descriptor... drop the packet */
 netif_stop_queue(dev);
 spin_unlock_irqrestore(card-lock,flags);
 trigger_transmit(card);
 
 return NETDEV_TX_BUSY;
 
 trigger_transmit() is a simple function that just writes a 
 single register on the card to trigger a transmit.

Eek!  No need to make this smarter at this point, but this is pretty
basic.

Cheers,

-PJ Waskiewicz
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Disable TSO for non standard qdiscs

2008-02-01 Thread Waskiewicz Jr, Peter P
 I totally disagree with these POVs:
 
 - 10G cards should be treated by default as 10G cards - not 
 DSL modems,
   and common users shouldn't have to read any warnings or configs to
   see this.
 - tc with TBF or HTB are professional tools; there should be 
 added some
   warnings to manuals. But trying to change the way they work because
   we think we know better what users want, and changing BTW some other
   things (making debugging this later a hell), is simply disrespectful
   for target users of these tools. There are some wrappers or 
 creators
   invented for this. And, BTW, I think I've seen somewhere a system
   which does this this other way - with creators for 
 professionals. So,
   you could be right with this too...

Ok, maybe I'm not done quite yet.  Jarek is echo'ing my original point,
changing the behavior of the tool automatically (these qdiscs in
question) is not good for a normal end user.  It might be fine for
kernel developers, but not users of these tools, IMO.  A less disruptive
approach, such as a warning message printed when loading the qdisc if
TSO is enabled, and documenting recommended usage, I think is more
prudent here.

Cheers,
-PJ Waskiewicz
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Waskiewicz Jr, Peter P
 The philosophical problem I have with this suggestion is that 
 I expect that the large majority of users will be more happy 
 with disabled TSO if they use non standard qdiscs and 
 defaults that do not fit the majority use case are bad.
 
 Basically you're suggesting that nearly everyone using tc 
 should learn about another obscure command.

If someone is using tc to load and configure a qdisc, I'd really hope
knowing or learning ethtool wouldn't be a stretch for them...  And I'm
not arguing the majority of people will want this or not, but taking
away the ability to use TSO at the kernel level here without allowing a
tuneable is making that decision for them, which is wrong IMO.

Cheers,

-PJ Waskiewicz
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Waskiewicz Jr, Peter P
 Well, it could be just that when using such qdiscs TSO would be
 disabled, but the user could override this by using ethtool after
 loading the qdiscs.

I still disagree with this.  The qdisc should not cause anything to happen to 
feature flags on the device. It's the scheduling layer and really shouldn't 
care about what features the device supports or not.  If someone has an issue 
with a feature hurting performance or causing odd behavior when using a qdisc, 
then they should disable the feature on the device using the appropriate tools 
provided.  If it's the qdisc causing issues, then either the qdisc needs to be 
fixed, or it should be documented what features are recommended to be on and 
off with the qdisc.  I don't agree that the scheduling layer should affect 
features on an underlying device.

Cheers,

-PJ Waskiewicz
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Lots of BUG eth1 code -5 qlen 0 messages in 2.6.24

2008-01-29 Thread Waskiewicz Jr, Peter P

  Are you using any specific qdisc, or just the default pfifo_fast?  
  Have you done any specific tuning on your qdisc as well?  
 The default 
  qlen seems to have been changed.
 
 The driver seems buggy. Make it return NETDEV_TX_BUSY instead 
 of -EIO in xircom_start_xmit() and the messages will go away.

Totally agree.  However, the driver is still getting entries when it
shouldn't (netif_queue_stopped() is true), hence it's returning a non-OK
value.  We can either re-add the check for netif_queue_stopped() to
qdisc_restart(), or update the drivers to use the newer API.  I'd rather
do the latter, which I can work on.

Thanks Jamal!

-PJ
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Lots of BUG eth1 code -5 qlen 0 messages in 2.6.24

2008-01-29 Thread Waskiewicz Jr, Peter P
 I've just started to use 2.6.24 on my home firewall (before 
 it was running 2.6.24-rc2 for about 65 days) and I noticed a 
 couple of error messages I've never seen before:
 
 Jan 29 07:50:54 gateway kernel: BUG eth1 code -5 qlen 0 Jan 
 29 08:28:30 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 
 08:57:30 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 
 09:44:04 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 
 10:01:35 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 
 10:01:35 gateway last message repeated 2 times Jan 29 
 10:16:48 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 
 10:16:48 gateway last message repeated 2 times Jan 29 
 10:45:48 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 
 10:45:48 gateway last message repeated 2 times Jan 29 
 11:10:01 gateway kernel: BUG eth1 code -5 qlen 0 Jan 29 
 11:10:02 gateway last message repeated 9 times
 
 The message seems to be coming from the qdisc_restart() in 
 net/sched/sch_generic.c which was changed with commit 
 5f1a485d5905aa641f33009019b369907a4c .
 
 The NIC is an IBM EtherJet cardbus card using the xircom_cb driver:

Are you using any specific qdisc, or just the default pfifo_fast?  Have
you done any specific tuning on your qdisc as well?  The default qlen
seems to have been changed.

Basically your queue is being overrun, and with the current checks in
the kernel in the stack, it's allowing the skb into the driver.  I've
known about this issue, and I'm hesitant to push a patch to re-add the
netif_queue_stopped() check into qdisc_restart().  I'd rather push a
one-time patch to the drivers that interacts with
netif_stop_subqueue(netdev, 0), so we can completely decouple the single
queue from the netdev.

I'd say you can somewhat ignore the messages for now.  But there is work
to be done here, and it's obvious I need to do this sooner than later.
Please let me know about the qdisc parameters though when you get a
chance.

Thanks,

-PJ Waskiewicz
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Lots of BUG eth1 code -5 qlen 0 messages in 2.6.24

2008-01-29 Thread Waskiewicz Jr, Peter P
 Peter, I suspect that driver is just buggy in some other way 
 as opposed to being re-entered; couldnt tell by inspection. 
 It is possible it may be too eager to open up before it 
 really has space.
 It will be easy to check your theory by having the driver 
 just check if it is netif_stopped just before it returns 
 NETDEV_TX_BUSY.

Ahh, very good point.

Thanks,
-PJ
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] PATCH 1/2 [SCHED 2.6.24]: Check subqueue status before calling hard_start_xmit

2007-11-15 Thread Waskiewicz Jr, Peter P
 You could optimize this by getting HARD_TX_LOCK after the 
 check. I assume that netif_stop_subqueue (from another CPU) 
 would always be called by the driver xmit, and that is not 
 possible since we hold the __LINK_STATE_QDISC_RUNNING bit. 
 Does that sound correct?

Sorry for not responding sooner; Dave hit it on the head though with his
response.  I agree with your changes, and I'll incorporate them in the
lockless stack patches I've been working on (in the software queuing
mode).  Now if I could just find some time to finish them up and get
them out for review...

Thanks,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] [AF_PACKET]: Allow multicast traffic to be caught by ORIGDEV when bonded

2007-11-07 Thread Waskiewicz Jr, Peter P
  The socket option for packet sockets to return the original ifindex 
  instead of the bonded ifindex will not match multicast 
 traffic.  Since 
  this socket option is the most useful for layer 2 traffic and 
  multicast traffic, make the option multicast-aware.
  
  Signed-off-by: Peter P Waskiewicz Jr 
 [EMAIL PROTECTED]
 
 I agree with you in principle, but I'd like to hear some 
 feedback from other folks.  In particular I'd like a 
 discussion about what this might break, if anything.

That's reasonable.  In any event, the only thing this could affect is if
the option is set on the socket, which shouldn't be very often at all.

I'm more than open to feedback on this change.

Thanks Dave,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] [AF_PACKET]: Allow multicast traffic to be caught by ORIGDEV when bonded

2007-11-07 Thread Waskiewicz Jr, Peter P
 -Original Message-
 From: David Miller [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, November 07, 2007 3:52 AM
 To: Waskiewicz Jr, Peter P
 Cc: netdev@vger.kernel.org
 Subject: Re: [PATCH] [AF_PACKET]: Allow multicast traffic to 
 be caught by ORIGDEV when bonded
 
 From: Waskiewicz Jr, Peter P [EMAIL PROTECTED]
 Date: Wed, 7 Nov 2007 03:00:42 -0800
 
  In any event, the only thing this could affect is if the 
 option is set 
  on the socket, which shouldn't be very often at all.
 
 Any idea how many programs set this option and which ones?  
 You obviously noticed, so perhaps you know at least one or 
 was this discovered purely by code inspection?

We have an application that communicates with a switch and a NIC that
helps some additional configuration for the network (think of it as an
extended link negotiation).  It uses LLDP, which is layer 2 and is
multicast.  We noticed that when testing bonding, it didn't work.  I'm
not sure who else might be using this code right now, since it went into
the mainstream kernel in 2.6.22.

Thanks,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Question on TSO maximum segment sizes.

2007-10-11 Thread Waskiewicz Jr, Peter P
I'm having an issue with TSO right vs. hardware that can't take the
maximum segment size sent from the stack.  I've been told that the
maximum packet size that can be sent to the hardware today is 64k, but
my hardware can only take 32k in certain modes per queue due to hardware
limitations.  I have two questions regarding this: 1) where is this
value set in the TCP code, and 2) Is this something that can be
configured on the fly?  If the answer to 2 is no, I will try and put
something together to allow this to happen.

Thanks in advance,
 
PJ Waskiewicz
Intel Corporation
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Question on TSO maximum segment sizes.

2007-10-11 Thread Waskiewicz Jr, Peter P
 From: Rick Jones [EMAIL PROTECTED]
 Date: Thu, 11 Oct 2007 16:50:46 -0700
 
  For just messing about, might it be possible to tweak the socket 
  buffer sizes and tcp_tso_win_divisor to kludge things for a short 
  while?  Couldn't ship that way certainly, but assuming 
 Peter's going 
  to get his broken hardware fixed it might let him limp 
 along until then.
 
 TCP dynamically grows the socket buffer sizes unless the 
 application explicitly sets them via setsockopt() and the 
 limits imposed in those cases are controlled by 
 tcp_{,r,w}mem[] sysctls.  Decreasing those will kill 
 performance exactly for the cases this person cares about.
 
 No, the only way to deal with this is to GSO segment incoming 
 frames inside of the driver when they exceed the HW limits.

Thanks Dave and Rick.  I'll mess with this and make my driver happy
again.

Cheers,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Waskiewicz Jr, Peter P
 -Original Message-
 From: Andi Kleen [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, October 10, 2007 9:02 AM
 To: Waskiewicz Jr, Peter P
 Cc: David Miller; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 netdev@vger.kernel.org; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net 
 core use batching
 
  We've done similar testing with ixgbe to push maximum descriptor 
  counts, and we lost performance very quickly in the same 
 range you're 
  quoting on NIU.
 
 Did you try it with WC writes to the ring or CLFLUSH?
 
 -Andi

Hmm, I think it might be slightly different, but it still shows queue
depth vs. performance.  I was actually referring to how many descriptors
we can represent a packet with before it becomes a problem wrt
performance.  This morning I tried to actually push my ixgbe NIC hard
enough to come close to filling the ring with packets (384-byte
packets), and even on my 8-core Xeon I can't do it.  My system can't
generate enough I/O to fill the hardware queues before CPUs max out.

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Waskiewicz Jr, Peter P
 IMO the net driver really should provide a hint as to what it wants.
 
 8139cp and tg3 would probably prefer multiple TX queue 
 behavior to match silicon behavior -- strict prio.

If I understand what you just said, I disagree.  If your hardware is
running strict prio, you don't want to enforce strict prio in the qdisc
layer; performing two layers of QoS is excessive, and may lead to
results you don't want.  The reason I added the DRR qdisc is for the Si
that has its own queueing strategy that is not RR.  For Si that
implements RR (like e1000), you can either use the DRR qdisc, or if you
want to prioritize your flows, use PRIO.

-PJ Waskiewicz
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Waskiewicz Jr, Peter P
 A misunderstanding, I think.
 
 To my brain, DaveM's item #2 seemed to assume/require the NIC 
 hardware to balance fairly across hw TX rings, which seemed 
 to preclude the
 8139cp/tg3 style of strict-prio hardware.  That's what I was 
 responding to.
 
 As long as there is some modular way to fit 8139cp/tg3 style 
 multi-TX into our universe, I'm happy :)

Ah hah.  Yes, a misunderstanding on my part.  Thanks for the
clarification.  Methinks more caffeine is required for today...

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Waskiewicz Jr, Peter P
 -Original Message-
 From: J Hadi Salim [mailto:[EMAIL PROTECTED] On Behalf Of jamal
 Sent: Monday, October 08, 2007 11:27 AM
 To: David Miller
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; netdev@vger.kernel.org; 
 [EMAIL PROTECTED]; Waskiewicz Jr, Peter P; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: [PATCH 2/3][NET_BATCH] net core use batching
 
 This patch adds the usage of batching within the core.
 
 cheers,
 jamal

Hey Jamal,
I still have concerns how this will work with Tx multiqueue.
The way the batching code looks right now, you will probably send a
batch of skb's from multiple bands from PRIO or RR to the driver.  For
non-Tx multiqueue drivers, this is fine.  For Tx multiqueue drivers,
this isn't fine, since the Tx ring is selected by the value of
skb-queue_mapping (set by the qdisc on {prio|rr}_classify()).  If the
whole batch comes in with different queue_mappings, this could prove to
be an interesting issue.
Now I see in the driver HOWTO you recently sent that the driver
will be expected to loop over the list and call it's -hard_start_xmit()
for each skb.  I think that should be fine for multiqueue, I just wanted
to see if you had any thoughts on how it should work, any performance
issues you can see (I can't think of any).  Since the batching feature
and Tx multiqueue are very new features, I'd like to make sure we can
think of any possible issues with them coexisting before they are both
mainline.

Looking ahead for multiqueue, I'm still working on the per-queue
lock implementation for multiqueue, which I know will not work with
batching as it's designed today.  I'm still not sure how to handle this,
because it really would require the batch you send to have the same
queue_mapping in each skb, so you're grabbing the correct queue_lock.
Or, we could have the core grab all the queue locks for each
skb-queue_mapping represented in the batch.  That would block another
batch though if it had any of those queues in it's next batch before the
first one completed.  Thoughts?

Thanks Jamal,

-PJ Waskiewicz
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Waskiewicz Jr, Peter P
 true, that needs some resolution. Heres a hand-waving thought:
 Assuming all packets of a specific map end up in the same 
 qdiscn queue, it seems feasible to ask the qdisc scheduler to 
 give us enough packages (ive seen people use that terms to 
 refer to packets) for each hardware ring's available space. 
 With the patches i posted, i do that via
 dev-xmit_win that assumes only one view of the driver; essentially a
 single ring.  
 If that is doable, then it is up to the driver to say i have 
 space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on 
 what scheduling scheme the driver implements - the dev-blist 
 can stay the same. Its a handwave, so there may be issues 
 there and there could be better ways to handle this.
 
 Note: The other issue that needs resolving that i raised 
 earlier was in regards to multiqueue running on multiple cpus 
 servicing different rings concurently. 

I can see the qdisc being modified to send batches per queue_mapping.
This shouldn't be too difficult, and if we had the xmit_win per queue
(in the subqueue struct like Dave pointed out).

Addressing your note/issue with different rings being services
concurrently: I'd like to remove the QDISC_RUNNING bit from the global
device; with Tx multiqueue, this bit should be set on each queue (if at
all), allowing multiple Tx rings to be loaded simultaneously.  The
biggest issue today with the multiqueue implementation is the global
queue_lock.  I see it being a hot source of contention in my testing; my
setup is a 8-core machine (dual quad-core procs) with a 10GbE NIC, using
8 Tx and 8 Rx queues.  On transmit, when loading all 8 queues, the
enqueue/dequeue are hitting that lock quite a bit for the whole device.
I really think that the queue_lock should join the queue_state, so the
device no longer manages the top-level state (since we're operating
per-queue instead of per-device).

 It's the core that does that, not the driver; the driver 
 continues to use -hard_start_xmit() (albeit modified one). 
 The idea is not to have many new interfaces.

I'll look closer at this, since I think I confused myself.

 Isnt multiqueue mainline already?

Well, it's in 2.6.23-rc*.  I imagine it won't see much action though
until 2.6.24, since people will be porting drivers during that time.
Plus having the native Rx multiqueue w/NAPI code in 2.6.24 makes sense
to have Tx multiqueue at that time.

 The point behind batching is to reduce the cost of the locks 
 by amortizing across the locks. Even better if one can, they 
 should get rid of locks. Remind me, why do you need the 
 per-queuemap lock? And is it needed from the enqueuing side 
 too? Maybe lets start there to help me understand things?

The multiqueue implementation today enforces the number of qdisc bands
(RR or PRIO) to be equal to the number of Tx rings your hardware/driver
is supporting.  Therefore, the queue_lock and queue_state in the kernel
directly relate to the qdisc band management.  If the queue stops from
the driver, then the qdisc won't try to dequeue from the band.  What I'm
working on is to move the lock there too, so I can lock the queue when I
enqueue (protect the band from multiple sources modifying the skb
chain), and lock it when I dequeue.  This is purely for concurrency of
adding/popping skb's from the qdisc queues.  Right now, we take the
whole global lock to add and remove skb's.  This is the next logical
step for separating the queue dependancy on each other.  Please let me
know if this doesn't make sense, or if you have any questions at all
about my reasoning.  I agree that this is where we should be on the same
page before moving onto anything else in this discussion.  :)

 Sure, that is doable if the driver can set a per 
 queue_mapping xmit_win and the qdisc can be taught to say 
 give me packets for queue_mapping X

Yes, I like this idea very much.  Do that, modify the qdisc to send in
chunks from a queue, and the problem should be solved.

I will try and find some additional cycles to get my patches completely
working, and send them.  It'd be easier I think to see what's going on
if I did that.  I'll also try to make them work with the ideas of
xmit_win per queue and batched queue qdisc sends.  Stay tuned...

Thanks Jamal,

-PJ Waskiewicz
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: parallel networking

2007-10-08 Thread Waskiewicz Jr, Peter P
 Multiply whatever effect you think you might be able to 
 measure due to that on your 2 or 4 way system, and multiple 
 it up to 64 cpus or so for machines I am using.  This is 
 where machines are going, and is going to become the norm.

That along with speeds going to 10 GbE with multiple Tx/Rx queues (with
40 and 100 GbE under discussion now), where multiple CPU's hitting the
driver are needed to push line rate without cratering the entire
machine.

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-08 Thread Waskiewicz Jr, Peter P
 If  net_device_subqueue is visible from both driver and core 
 scheduler area (couldnt tell from looking at whats in there 
 already), then that'll do it.

Yes, I use the net_device_subqueue structs (the state variable in there)
in the prio and rr qdiscs right now.  It's an indexed list at the very
end of struct netdevice.

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-24 Thread Waskiewicz Jr, Peter P
 I have submitted this before; but here it is again.
 Against net-2.6.24 from yesterday for this and all following patches. 
 
 
 cheers,
 jamal

Hi Jamal,
I've been (slowly) working on resurrecting the original design
of my multiqueue patches to address this exact issue of the queue_lock
being a hot item.  I added a queue_lock to each queue in the subqueue
struct, and in the enqueue and dequeue, just lock that queue instead of
the global device queue_lock.  The only two issues to overcome are the
QDISC_RUNNING state flag, since that also serializes entry into the
qdisc_restart() function, and the qdisc statistics maintenance, which
needs to be serialized.  Do you think this work along with your patch
will benefit from one another?  I apologize for not having working
patches right now, but I am working on them slowly as I have some blips
of spare time.

Thanks,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-24 Thread Waskiewicz Jr, Peter P
 The one thing that seems obvious is to use 
 dev-hard_prep_xmit() in the patches i posted to select the 
 xmit ring. You should be able to do figure out the txmit ring 
 without holding any lock. 

I've looked at that as a candidate to use.  The lock for enqueue would
be needed when actually placing the skb into the appropriate software
queue for the qdisc, so it'd be quick.

 I lost track of how/where things went since the last 
 discussion; so i need to wrap my mind around it to make 
 sensisble suggestions - I know the core patches are in the 
 kernel but havent paid attention to details and if you look 
 at my second patch youd see a comment in
 dev_batch_xmit() which says i need to scrutinize multiqueue more. 

No worries.  I'll try to get things together on my end and provide some
patches to add a per-queue lock.  In the meantime, I'll take a much
closer look at the batching code, since I've stopped looking at the
patches in-depth about a month ago.  :-(

Thanks,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-24 Thread Waskiewicz Jr, Peter P
 On Mon, 2007-24-09 at 15:57 -0700, Waskiewicz Jr, Peter P wrote:
 
  I've looked at that as a candidate to use.  The lock for 
 enqueue would 
  be needed when actually placing the skb into the 
 appropriate software 
  queue for the qdisc, so it'd be quick.
 
 The enqueue is easy to comprehend. The single device queue 
 lock should suffice. The dequeue is interesting:

We should make sure we're symmetric with the locking on enqueue to
dequeue.  If we use the single device queue lock on enqueue, then
dequeue will also need to check that lock in addition to the individual
queue lock.  The details of this are more trivial than the actual
dequeue to make it efficient though.

 Maybe you can point me to some doc or describe to me the 
 dequeue aspect; are you planning to have an array of txlocks 
 per, one per ring?
 How is the policy to define the qdisc queues locked/mapped to 
 tx rings? 

The dequeue locking would be pushed into the qdisc itself.  This is how
I had it originally, and it did make the code more complex, but it was
successful at breaking the heavily-contended queue_lock apart.  I have a
subqueue structure right now in netdev, which only has queue_state (for
netif_{start|stop}_subqueue).  This state is checked in sch_prio right
now in the dequeue for both prio and rr.  My approach is to add a
queue_lock in that struct, so each queue allocated by the driver would
have a lock per queue.  Then in dequeue, that lock would be taken when
the skb is about to be dequeued.

The skb-queue_mapping field also maps directly to the queue index
itself, so it can be unlocked easily outside of the context of the
dequeue function.  The policy would be to use a spin_trylock() in
dequeue, so that dequeue can still do work if enqueue or another dequeue
is busy.  And the allocation of qdisc queues to device queues is assumed
to be one-to-one (that's how the qdisc behaves now).

I really just need to put my nose to the grindstone and get the patches
together and to the list...stay tuned.

Thanks,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/4] [NET_SCHED] explict hold dev tx lock

2007-09-24 Thread Waskiewicz Jr, Peter P
  I really just need to put my nose to the grindstone and get the 
  patches together and to the list...stay tuned.
  
  Thanks,
  -PJ Waskiewicz
  -
 
 
 Since we are redoing this, is there any way to make the whole 
 TX path more lockless?  The existing model seems to be more 
 of a monitor than a real locking model.

That seems quite reasonable.  I will certainly see what I can do.

Thanks Stephen,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2] Move the definition of pr_err() into kernel.h

2007-09-11 Thread Waskiewicz Jr, Peter P
 Other pr_*() macros are already defined in kernel.h, but 
 pr_err() was defined multiple times in several other places
 
 Signed-off-by: Emil Medve [EMAIL PROTECTED]
 ---
 
 I'm writing a driver and I've been using the pr_*() macros 
 from kernel.h and I was surprised not to find there pr_err() 
 but defined multiple times (in four different files). I 
 didn't want to define it yet one more time so I did this cleanup
 
 This patch is against Linus' tree v2.6.23-rc6 
 (0d4cbb5e7f60b2f1a4d8b7f6ea4cc264262c7a01)
 
 linux-2.6 scripts/checkpatch.pl 
 0001-Move-the-definition-of-pr_err-into-kernel.h.patch
 Your patch has no obvious style problems and is ready for submission.

Almost.  :-)

 
  drivers/i2c/chips/menelaus.c |   10 --
  drivers/net/spider_net.h |3 ---
  drivers/video/omap/lcd_h3.c  |6 ++
  drivers/video/omap/lcd_inn1610.c |6 ++
  include/linux/kernel.h   |2 ++
  5 files changed, 10 insertions(+), 17 deletions(-)
 
 diff --git a/drivers/i2c/chips/menelaus.c 
 b/drivers/i2c/chips/menelaus.c index d9c92c5..66436ba 100644
 --- a/drivers/i2c/chips/menelaus.c
 +++ b/drivers/i2c/chips/menelaus.c
 @@ -49,8 +49,6 @@
  
  #define DRIVER_NAME  menelaus
  
 -#define pr_err(fmt, arg...)  printk(KERN_ERR DRIVER_NAME : 
 , ## arg);
 -

Unnecessary whitespace removal.

  #define MENELAUS_I2C_ADDRESS 0x72
  
  #define MENELAUS_REV 0x01
 @@ -155,7 +153,7 @@ static int menelaus_write_reg(int reg, u8 value)
   int val = 
 i2c_smbus_write_byte_data(the_menelaus-client, reg, value);
  
   if (val  0) {
 - pr_err(write error);
 + pr_err(DRIVER_NAME : write error);
   return val;
   }
  
 @@ -167,7 +165,7 @@ static int menelaus_read_reg(int reg)
   int val = i2c_smbus_read_byte_data(the_menelaus-client, reg);
  
   if (val  0)
 - pr_err(read error);
 + pr_err(DRIVER_NAME : read error);
  
   return val;
  }
 @@ -1177,7 +1175,7 @@ static int menelaus_probe(struct 
 i2c_client *client)
   /* If a true probe check the device */
   rev = menelaus_read_reg(MENELAUS_REV);
   if (rev  0) {
 - pr_err(device not found);
 + pr_err(DRIVER_NAME : device not found);
   err = -ENODEV;
   goto fail1;
   }
 @@ -1258,7 +1256,7 @@ static int __init menelaus_init(void)
  
   res = i2c_add_driver(menelaus_i2c_driver);
   if (res  0) {
 - pr_err(driver registration failed\n);
 + pr_err(DRIVER_NAME : driver registration failed\n);
   return res;
   }
  
 diff --git a/drivers/net/spider_net.h 
 b/drivers/net/spider_net.h index dbbdb8c..c67b11d 100644
 --- a/drivers/net/spider_net.h
 +++ b/drivers/net/spider_net.h
 @@ -493,7 +493,4 @@ struct spider_net_card {
   struct spider_net_descr darray[0];
  };
  
 -#define pr_err(fmt,arg...) \
 - printk(KERN_ERR fmt ,##arg)
 -

Unnecessary whitespace removal.

  #endif
 diff --git a/drivers/video/omap/lcd_h3.c 
 b/drivers/video/omap/lcd_h3.c index 51807b4..c604d93 100644
 --- a/drivers/video/omap/lcd_h3.c
 +++ b/drivers/video/omap/lcd_h3.c
 @@ -28,8 +28,6 @@
  
  #define MODULE_NAME  omapfb-lcd_h3
  
 -#define pr_err(fmt, args...) printk(KERN_ERR MODULE_NAME : 
  fmt, ## args)
 -

Unnecessary whitespace removal.

  static int h3_panel_init(struct lcd_panel *panel, struct 
 omapfb_device *fbdev)  {
   return 0;
 @@ -48,7 +46,7 @@ static int h3_panel_enable(struct lcd_panel *panel)
   if (!r)
   r = tps65010_set_gpio_out_value(GPIO2, HIGH);
   if (r)
 - pr_err(Unable to turn on LCD panel\n);
 + pr_err(MODULE_NAME : Unable to turn on LCD panel\n);
  
   return r;
  }
 @@ -62,7 +60,7 @@ static void h3_panel_disable(struct 
 lcd_panel *panel)
   if (!r)
   tps65010_set_gpio_out_value(GPIO2, LOW);
   if (r)
 - pr_err(Unable to turn off LCD panel\n);
 + pr_err(MODULE_NAME : Unable to turn off LCD panel\n);
  }
  
  static unsigned long h3_panel_get_caps(struct lcd_panel 
 *panel) diff --git a/drivers/video/omap/lcd_inn1610.c 
 b/drivers/video/omap/lcd_inn1610.c
 index 95604ca..5ef119c 100644
 --- a/drivers/video/omap/lcd_inn1610.c
 +++ b/drivers/video/omap/lcd_inn1610.c
 @@ -27,20 +27,18 @@
  
  #define MODULE_NAME  omapfb-lcd_h3
  
 -#define pr_err(fmt, args...) printk(KERN_ERR MODULE_NAME : 
  fmt, ## args)
 -

Unnecessary whitespace removal.

  static int innovator1610_panel_init(struct lcd_panel *panel,
   struct omapfb_device *fbdev)
  {
   int r = 0;
  
   if (omap_request_gpio(14)) {
 - pr_err(can't request GPIO 14\n);
 + pr_err(MODULE_NAME : can't request GPIO 14\n);
   r = -1;
   goto exit;
   }
   if (omap_request_gpio(15)) {
 - pr_err(can't request GPIO 15\n);
 + pr_err(MODULE_NAME : can't 

RE: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-29 Thread Waskiewicz Jr, Peter P
 From: Krishna Kumar2 [EMAIL PROTECTED]
 Date: Wed, 29 Aug 2007 10:43:23 +0530
 
  The reason was to run parallel copies, not for buffer limitations.
 
 Oh, I see.
 
 I'll note in passing that current lmbench-3 has some 
 parallelization features you could play with, you might want 
 to check it out.

I've also used iperf for parallel connections successfully, and that
will allow you to mess with the buffer sizes as well along with other
variables of the data streams for both TCP and UDP.

Cheers,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [patch 4/4] Update e1000 driver to use devres.

2007-08-16 Thread Waskiewicz Jr, Peter P
 Index: linux-2.6/drivers/net/e1000/e1000_main.c
 ===
 --- linux-2.6.orig/drivers/net/e1000/e1000_main.c
 +++ linux-2.6/drivers/net/e1000/e1000_main.c
 @@ -860,15 +860,14 @@ e1000_probe(struct pci_dev *pdev,  {
   struct net_device *netdev;
   struct e1000_adapter *adapter;
 - unsigned long mmio_start, mmio_len;
 - unsigned long flash_start, flash_len;
 + unsigned long mmio_len, flash_len;
  
   static int cards_found = 0;
   static int global_quad_port_a = 0; /* global ksp3 port 
 a indication */
   int i, err, pci_using_dac;
   uint16_t eeprom_data = 0;
   uint16_t eeprom_apme_mask = E1000_EEPROM_APME;
 - if ((err = pci_enable_device(pdev)))
 + if ((err = pcim_enable_device(pdev)))
   return err;
  
   if (!(err = pci_set_dma_mask(pdev, DMA_64BIT_MASK))  
 @@ -878,20 +877,19 @@ e1000_probe(struct pci_dev *pdev,
   if ((err = pci_set_dma_mask(pdev, DMA_32BIT_MASK)) 
   (err = pci_set_consistent_dma_mask(pdev, 
 DMA_32BIT_MASK))) {
   E1000_ERR(No usable DMA configuration, 
 aborting\n);
 - goto err_dma;
 + return err;
   }
   pci_using_dac = 0;
   }
  
   if ((err = pci_request_regions(pdev, e1000_driver_name)))
 - goto err_pci_reg;
 + return err;
  
   pci_set_master(pdev);
  
 - err = -ENOMEM;
 - netdev = alloc_etherdev(sizeof(struct e1000_adapter));
 + netdev = devm_alloc_etherdev(pdev-dev, sizeof(struct 
 +e1000_adapter));
   if (!netdev)
 - goto err_alloc_etherdev;
 + return -ENOMEM;

I'm a bit confused why you removed the goto's, and then removed all the
target unwinding code at the bottom of e1000_probe().   Those labels
clean up resources if something fails, like the err_sw_init label.  I
don't see anything in the devres code that jumps out at me that explains
why we can do away with these cleanup routines.  Thoughts?

Thanks,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [patch 4/4] Update e1000 driver to use devres.

2007-08-16 Thread Waskiewicz Jr, Peter P
 Now if I insert a return -ENOMEM right after allocating tx_ring:
 --- a/drivers/net/e1000/e1000_main.c
 +++ b/drivers/net/e1000/e1000_main.c
 @@ -1356,6 +1356,8 @@ e1000_alloc_queues(struct e1000_adapter 
 *adapter)  {
 adapter-tx_ring = kcalloc(adapter-num_tx_queues,
sizeof(struct 
 e1000_tx_ring), GFP_KERNEL);
 +
 +   return -ENOMEM;
 if (!adapter-tx_ring)
 return -ENOMEM;
 
 #insmod
 DEVRES ADD f7a80e80 pcim_release (8 bytes) DEVRES ADD 
 f7a80ca0 devm_free_netdev (4 bytes) DEVRES ADD eb7f0080 
 pcim_iomap_release (24 bytes) DEVRES ADD eb7f 
 devm_kzalloc_release (40 bytes)
 e1000_sw_init: Unable to allocate memory for queues DEVRES 
 REL eb7f devm_kzalloc_release (40 bytes) DEVRES REL 
 eb7f0080 pcim_iomap_release (24 bytes) DEVRES REL f7a80ca0 
 devm_free_netdev (4 bytes) DEVRES REL f7a80e80 pcim_release (8 bytes)
 ACPI: PCI interrupt for device :02:00.0 disabled
 e1000: probe of :02:00.0 failed with error -12
 
 Since we are returning an error from probe the driver core calls
 devres_release_all(dev) which releases all of the resources 
 in the right order.  See really_probe() in drivers/base/dd.c.

This looks fine then to me.  Thanks for the explanation.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Potential u32 classifier bug.

2007-08-15 Thread Waskiewicz Jr, Peter P
 * Waskiewicz Jr, Peter P [EMAIL PROTECTED] 
 2007-08-15 11:02
   There is this very horrible way of using the u32 
 classifier with a 
   negative offset to look into the ethernet header.
  
  Based on this, it sounds like u32 using protocol 802_3 is broken?
 
 You might be expecting too much from u32. The protocol given 
 to u32 is just a filter, it doesn't imply anything beyond that.
 u32 has its usage the way it is, that's way we've added an 
 ematch rather than extending u32 itself.

Ok, that clarifies it a bit.  I've just found a few examples on the net,
one of which is in a TC filter manual
(http://tcn.hypert.net/tcmanual.pdf, section 2.2.1.3 at the bottom of
the section), that was using u32 to simply filter on dest MAC address
without anything elaborate.  Either it worked way back when, or it was a
bogus example.

Thanks again Thomas,

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Potential u32 classifier bug.

2007-08-15 Thread Waskiewicz Jr, Peter P
 * Waskiewicz Jr, Peter P [EMAIL PROTECTED] 
 2007-08-09 18:07
  My big question is: Has anyone recently used the 802_3 
 protocol in tc 
  with u32 and actually gotten it to work?  I can't see how the
  u32_classify() code can look at the mac header, since it is 
 using the 
  network header accessor to start looking.  I think this is an issue 
  with the classification code, but I'm looking to see if I'm doing 
  something stupid before I really start digging into this mess.
 
 There is this very horrible way of using the u32 classifier 
 with a negative offset to look into the ethernet header.

Based on this, it sounds like u32 using protocol 802_3 is broken?

 You might want to look into the cmp ematch which can be 
 attached to almost any classifier. It allows basing offsets 
 on any layer thus making ethernet header filtering trivial.

I'll look at this.  Thanks Thomas for the response!

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Potential u32 classifier bug.

2007-08-09 Thread Waskiewicz Jr, Peter P
I've recently been trying to use tc with u32 classifiers to filter
ethernet traffic based on ethertype.  Although we found and fixed an
issue in the sch_prio call to tc_classify() (thanks Patrick!), this
didn't fix the actual filtering issue.  For those of you who are curious
or are tc-savy, I'm really in a bind and need some help.  This is what I
have so far:
 
Filter to identify and move traffic to a different flow:
 
# tc qdisc add dev eth2 root handle 1: rr bands 8 priomap 7 7 6 6 5 5 4
4 3 3 2 2 1 1 0 0 multiqueue
# tc filter add dev eth2 parent 1: protocol 802_3 prio 1 u32 match u32
0x0800 0x at 12 flowid 1:6

Now this hits tc_classify(), and tp-protocol and skb-protocol match
(be16 of 8 - ETH_P_802_3, which is what I expect), so u32_classify() is
called through the tp-classify() func pointer.  This is where things
get weird.

In net/sched/cls_u32.c, u32_classify() grabs a reference to part of the
network header.  This is question number one: how can the filter look at
the ethernet (mac) header if it's grabbing a reference to the network
header:

u8 *ptr = skb_network_header(skb);

I would think that for a protocol of 802.3, one would want:

u8 *ptr = skb_mac_header(skb);

Changing this though doesn't fix the filter.  Further down is a rather
horrible match criteria to make sure the filter is looking at the right
data before it applies the whole filter list on the skb:

if
((*(u32*)(ptr+key-off+(off2key-offmask))^key-val)key-mask) {

Now if this matches (meaning it evaluates to zero), we can move on.  If
not, we go to the next key node and try again.  Run out of key nodes, we
return -1 and take the default mapping from IP TOS to Linux priority,
and get queued to a band.

My big question is: Has anyone recently used the 802_3 protocol in tc
with u32 and actually gotten it to work?  I can't see how the
u32_classify() code can look at the mac header, since it is using the
network header accessor to start looking.  I think this is an issue with
the classification code, but I'm looking to see if I'm doing something
stupid before I really start digging into this mess.

Thanks in advance,

PJ Waskiewicz
Intel Corporation
[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] NET_DMA: remove unused dma_memcpy_to_kernel_iovec

2007-07-30 Thread Waskiewicz Jr, Peter P
 On 7/26/07, David Miller [EMAIL PROTECTED] wrote:
  From: Shannon Nelson [EMAIL PROTECTED]
  Date: Tue, 24 Jul 2007 17:36:06 -0700
 
   (repost - original eaten by vger?)
  
   Al Viro pointed out that dma_memcpy_to_kernel_iovec() really was 
   unreachable and thus unused.  The code originally was there to 
   support in-kernel dma needs, but since it remains unused, 
 we'll pull it out.
  
   Signed-off-by: Shannon Nelson [EMAIL PROTECTED]
 
  Applied, thanks Shannon.
 
 NET_DMA on kernel buffer is pretty useful in ndb, iSCSI 
 target and initiators which uses kernel buffer to receive 
 data. Are there any other issues with dma-memcpy on kernel 
 buffers, if not then following patch makes 
 dma_memcpy_to_kernel_iovec() reachable from tcp_recvmsg.
 I tested this patch and it work fine with unh iSCSI target.
 
 
 comments?
 
 --pravin.

Pravin,
Currently, NET_DMA only has one provider in the kernel, namely
TCP.  As the driver stands now, I'd like to keep this function out since
it really wasn't being used before.  Shannon has posted patches that
will hopefully make it into 2.6.24 that allow multiple clients to
request DMA channels, so things like iSCSI, ndb, UDP, and NFS can ask
for channels without depending on TCP.  Once we get to that point, if we
see the need for the function, we can add it back.
We have tried getting iSCSI accelerated through NET_DMA already,
and didn't see the performance boost that made it worthwhile to do.
Once the multiple client channel patches are in, we can revisit that,
since it can be done right.

Cheers,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [FIX] NET: Fix sch_api and sch_prio to properly set and detect the root qdisc

2007-07-30 Thread Waskiewicz Jr, Peter P
 -Original Message-
 From: David Miller [mailto:[EMAIL PROTECTED] 
 Sent: Monday, July 30, 2007 5:14 PM
 To: Waskiewicz Jr, Peter P
 Cc: netdev@vger.kernel.org; [EMAIL PROTECTED]
 Subject: Re: [FIX] NET: Fix sch_api and sch_prio to properly 
 set and detect the root qdisc
 
 From: Waskiewicz Jr, Peter P [EMAIL PROTECTED]
 Date: Fri, 27 Jul 2007 09:22:11 -0700
 
  Are these queued for 2.6.24, or are they going to make it into 
  2.6.23?  I know you're busy with patches and NAPI, but I 
 was curious.  
  Thanks!
 
 I've applied both fixes and will push them into 2.6.23


Many thanks Dave!

Cheers,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [FIX] NET: Fix sch_api and sch_prio to properly set and detect the root qdisc

2007-07-27 Thread Waskiewicz Jr, Peter P
 From: Patrick McHardy [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, July 24, 2007 5:36 PM
 To: Waskiewicz Jr, Peter P
 Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org
 Subject: Re: [FIX] NET: Fix sch_api and sch_prio to properly 
 set and detect the root qdisc

[...]
 Both look good, thanks.
 
 Acked-by: Patrick McHardy [EMAIL PROTECTED]
 

Dave,
Are these queued for 2.6.24, or are they going to make it into
2.6.23?  I know you're busy with patches and NAPI, but I was curious.
Thanks!

Cheers,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [NET_SCHED]: Fix prio/ingress classification logic error

2007-07-26 Thread Waskiewicz Jr, Peter P
 This is the final fix for the problem Peter reported.
 Turns out most other schedulers get it right, the only other 
 case is ingress setting skb-tc_index to the uninitialized 
 value of res.classid.

I've applied this and tested prio and rr, and everything is happy again.
Thanks for finding this and putting the patch together.

I got lost in looking for the bug in the u32_classify() code, and missed
the easier defect...

Thanks again Patrick,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Tc filtering: broken 802_3 classifier?

2007-07-25 Thread Waskiewicz Jr, Peter P

 The protocol match is on skb-protocol, so it case of 
 ethernet its on the ethernet protocol, which is ETH_P_IP or 
 ip for IPv4.

I see that in the code, but the reason I started worrying was when I
added the 802_3 classifier on 8 flows, it would shove all traffic into
flowid 1:1, no matter if it matched or not.

I'll keep investigating and see if I can narrow down what I'm seeing.

Thanks Patrick,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Tc filtering: broken 802_3 classifier?

2007-07-25 Thread Waskiewicz Jr, Peter P
 Waskiewicz Jr, Peter P wrote:
 The protocol match is on skb-protocol, so it case of 
 ethernet its on 
 the ethernet protocol, which is ETH_P_IP or ip for IPv4.
  
  
  I see that in the code, but the reason I started worrying was when I
  added the 802_3 classifier on 8 flows, it would shove all 
 traffic into
  flowid 1:1, no matter if it matched or not.
  
  I'll keep investigating and see if I can narrow down what 
 I'm seeing.
 
 
 I'm not sure what you're expecting. skb-protocol is usually not set
 to ETH_P_802_3, which is why the filter is not matching.

I understand that.  I had two issues, which you cleared up one by
reminding me that the protocol matches on skb-protocol before it tries
to run the -classify() routine.  The other issue I am seeing is with 8
bands, an 802_3 filter is affecting classification of IP traffic.  For
example, I have an 802_3 filter to look for dst MAC address, but an ssh
packet, which without any filters should go into flowid 1:3 on my
system, is getting pushed into flowid 1:1.  I remove the 802_3 filter,
and ssh traffic starts going back into 1:3.  No other filters on the
system.  That's the main issue I'm seeing, so I'll keep investigating to
see what's going on.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Tc filtering: broken 802_3 classifier?

2007-07-25 Thread Waskiewicz Jr, Peter P
 In case of prio, if your manually installed filters don't 
 match, it will fall back to the skb-priority based 
 classification, which is based on tos and is probably 
 responsible for what you're seeing. Feel free to investigate, 
 but you could save us all some time by simply posting what 
 you're doing, what you're expecting and what is actually 
 happening, there's probably a good explanation.

I thought I did that before, but I probably wasn't clear.  I'll try
again (and if I'm still not clear, please pop me in the head).  I am
aware that skb-priority is used if no filter matches, and that is
derived from tos (and gets set in ipsockglue).

This is my setup.  8 bands with prio, with a priomap that is nice and
simple:

# tc qdisc add dev eth0 root handle 1: prio bands 8 priomap 0 0 1 1 2 2
3 3 4 4 5 5 6 6 7 7

With this configuration, ICMP will default to flowid 1:1 (band 0), and
ssh will default to flowid 1:4 (band 3) based on TOS.  I add this filter
(802_3) and all traffic starts flowing into flowid 1:1 (including ssh),
even though it should never match:

# tc filter add dev eth0 protocol 802_3 parent 1: prio 2 u32 match u32
0x0800 0x at 12 flowid 1:6

As soon as I remove the filter:

# tc filter del dev eth0 protocol 802_3 prio 2

ssh flows back into flowid 1:4.  No filters of protocol ip were added,
only the 802.3 filter.

I hope this is more clear as to what I'm seeing.

Thanks,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/2][RFC] Update net core to use devres.

2007-07-24 Thread Waskiewicz Jr, Peter P
 * netdev_pci_remove_one() can replace simple pci device remove
   functions
 
 * devm_alloc_netdev() is like alloc_netdev but allocates 
 memory using devres.
   alloc_netdev() can be removed once all drivers use devres.

Please look at the multiqueue network code that is in 2.6.23.  It
creates alloc_netdev_mq() for multiqueue network device drivers.  If you
want to add this devres feature (which based on the threads, might be
difficult), please make sure you comprehend the multiqueue features.

Thanks,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Tc filtering: broken 802_3 classifier?

2007-07-24 Thread Waskiewicz Jr, Peter P
I've been trying to use tc filtering to filter on ethertype, among other
things in the MAC layer.  I'm running into multiple issues, and want to
put this out there in case I'm using the filters wrong, or if there
really is a bug in the filter code (I've stared at most of it today, and
my head hurts).

Here's the scenario.  I am running on a recent 2.6.23 GIT tree, and am
using sch_prio with no multiqueue turned on in the qdisc.  The network
interface in question is e1000 (no multiqueue).

# tc qdisc add dev eth0 root handle 1: prio

Now to see the flowid's packet counts incrementing, I add explicit
classful qdiscs to the leaves:

# tc qdisc add dev eth0 parent 1:1 handle 10: pfifo
# tc qdisc add dev eth0 parent 1:2 handle 20: pfifo
# tc qdisc add dev eth0 parent 1:3 handle 30: pfifo

Now packet counts can be seen with:

# tc -s qdisc ls dev eth0

I can add a filter for IP for ssh, and it works as intended:

# tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip dport
22 0x flowid 1:3

This will shove ssh traffic into the 3rd pfifo queue, where by default
it will flow into flowid 1:1.  This is good.

Now I add a filter for ethernet (802.3), and things aren't as happy:

# tc filter add dev eth0 protocol 802_3 parent 1: prio 2 u32 match u32
first 4 bytes of dst mac address 0x at 0 match u32 last 2
bytes of dst mac address 0x at 4 flowid 1:1

This should match the destination MAC address of outgoing packets, and
put it into flowid 1:1.  For pings, using the normal priomap, they go
into 1:2, so ping should be a good candidate for seeing if it goes into
1:1.  In this case, it does not filter into 1:1.

If I expand this into 8 flows on a multiqueue NIC, using sch_prio or
sch_rr, adding any 802_3 filter to the chain will cause any traffic that
hits the classifier (i.e. no other filters match first) to go into
flowid 1:1, regardless if it actually matches.  Remove the 802_3 filter
from the chain, and all filtering starts working again.

I'm trying to get state from the classifier code now when it's running,
but it's a really big mess of black magic.  I'm wondering if anyone is
also seeing this behavior, and if they've tried to fix it.  If not, I'll
continue to search for a solution, but I'm just polling the community to
see if this is a known issue, or if I'm doing something wrong.

Thanks,

-PJ Waskiewicz
Intel Corporation
[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] NET: Fix sch_prio to detect the root qdisc loading

2007-07-22 Thread Waskiewicz Jr, Peter P

 As explained in my last mail, sch-parent is an integer. And 
 it is set when grafting the qdisc, not on initilization, so 
 it is always 0 here when coming from prio_init.
 
 This untested patch should make sure its always set correctly.

Yes, I'm using NULL and 0 interchangeably here, since in the sch_api
code, qdisc_graft(), sch-parent is referenced using NULL and not 0.  I
know it's a u32, and the value is getting set to the proper handle when
the qdisc is not the root qdisc.  When it's the root qdisc, it's left to
be 0.

This patch was tested as well, but looking at the history now, I didn't
set bands correctly, so that's why the multiqueue case failed.  My
mistake, and I'll keep working on this.


Sorry for the extra noise,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Question: how to detect if a qdisc is root or not?

2007-07-21 Thread Waskiewicz Jr, Peter P
 From: Patrick McHardy [EMAIL PROTECTED]
 Date: Sat, 21 Jul 2007 06:01:07 +0200
 
  Waskiewicz Jr, Peter P wrote:
   I just sent out a patch to fix this.
  
  I didn't see it yet.
 
 VGER ate it for some reason and I didn't notice it during my 
 daily bounce analysis.  I asked him on IRC to resend it, but 
 he's away from the computer already :)

I'll get it resent today.  Sorry for the delay.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Question: how to detect if a qdisc is root or not?

2007-07-21 Thread Waskiewicz Jr, Peter P
  Anyways, I tried a few different things, and what it looks like is
  sch-parent will be NULL (0) for the top-level device.  This is 
  sch-correct,
  and trying to mess with that screws up qdisc_graft() when unloading 
  the qdisc.  I also tried adding a TCQ_F_ROOT flag to 
 sch-flags when 
  classid is TC_H_ROOT, but that also screwed up unloading the qdisc.
 
 
 I dont think I understand. Whats the problem with setting 
 sch-parent on initialization instead on grafting as I did in 
 my example patch?
 Please explain the problems arrising on unload in detail.

sch-parent is getting set on initialization, and for the root and
ingress qdiscs, it's left at zero.  If I change that value, when the
root qdisc is unloaded and pfifo_fast is put back into place, the
qdisc_destroy() walks the tree and attempts to free memory from the
handle pointed at by sch-parent.  It stops when sch-parent is NULL, so
sch-parent is actually being set as intended.  The only thing that
confused me is that nowhere in the qdisc is TC_H_ROOT included
explicitly, rather, the root qdisc is where sch-parent is NULL.

So I misunderstood what was actually wrong.  The qdisc code is ok as-is,
it's just that the top-level qdisc (root and ingress) have a sch-parent
of NULL, which is being set correctly today.

Hope that clarifies.

Thanks,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Question: how to detect if a qdisc is root or not?

2007-07-20 Thread Waskiewicz Jr, Peter P
 Its set after grafting the parent, which is after initialization.
 I think what should work is to set it in qdisc_create 
 instead, sch_api.c around line 490:
 
 + sch-parent = handle;
 
 if (handle == TC_H_INGRESS) {
 sch-flags |= TCQ_F_INGRESS;
 sch-stats_lock = dev-ingress_lock; ...
 
 and remove the initialization in qdisc_graft. That would 
 additionally have the benefit that ingress qdiscs also have 
 it initialized properly.

I just sent out a patch to fix this.  Sorry for the delay; my
development machine oops'd in the middle of some disk I/O, and it
corrupted part of the inode table...the ext3 journal application seemed
to make it worse too.  Rebuilt the machine, so I'm back on my feet.

Anyways, I tried a few different things, and what it looks like is
sch-parent will be NULL (0) for the top-level device.  This is correct,
and trying to mess with that screws up qdisc_graft() when unloading the
qdisc.  I also tried adding a TCQ_F_ROOT flag to sch-flags when classid
is TC_H_ROOT, but that also screwed up unloading the qdisc.

The ingress qdisc does have the parent handle set correctly, namely
NULL, since it will always be the top-level qdisc on ingress sessions.

Thx Patrick,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Question: how to detect if a qdisc is root or not?

2007-07-18 Thread Waskiewicz Jr, Peter P
I've been tracking down an issue with the recent multiqueue code,
specifically with sch_prio and sch_rr loading as a root qdisc.  Right
now, we do not want to allow child qdiscs of sch_rr and sch_prio to load
with multiqueue enabled; we want to restrict multiqueue-enabled qdiscs
to the root qdisc (since this is the only thing to push into the
device).  The issue I have is I don't know how to detect if the qdisc
I'm currently processing is the root qdisc, or if it's a child.  From
sch_prio.c:

q-mq = RTA_GET_FLAG(tb[TCA_PRIO_MQ - 1]);
if (q-mq) {
if (sch-handle != TC_H_ROOT)
return -EINVAL;

if (netif_is_multiqueue(sch-dev)) { 

Unfortunately, this code isn't working.  This sch-handle is the handle
assigned to the qdisc upon creation, and it's not TC_H_ROOT.  Now the
information whether or not the qdisc is root gets passed via tc from
userspace, but that lives in the tcmsg struct, not the rtattr list of
options.  Does anyone know, outside of adding another attribute to the
rtattr list, how to detect if a qdisc is a root qdisc within the
prio_tune() initialization routine?  Any and all help would be much
appreciated.

Thanks,
 
PJ Waskiewicz
Intel Corporation
[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Question: how to detect if a qdisc is root or not?

2007-07-18 Thread Waskiewicz Jr, Peter P

 You're right, thats a bug. TC_H_ROOT is the parent ID, which 
 is stored in sch-parent. IIRC its also passed to the 
 -init() function.

Unfortunately it's not passed.  It is passed into the -change()
function:

static int prio_init(struct Qdisc *sch, struct rtattr *opt)

static int prio_change(struct Qdisc *sch, u32 handle, u32 parent, struct
rtattr **tca, unsigned long *arg)

I did mess around with sch-parent a bit, with no success (it appears to
be zero / unitialized).  I'll keep investigating.

Thanks,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Question: how to detect if a qdisc is root or not?

2007-07-18 Thread Waskiewicz Jr, Peter P
 Its set after grafting the parent, which is after initialization.
 I think what should work is to set it in qdisc_create 
 instead, sch_api.c around line 490:
 
 + sch-parent = handle;
 
 if (handle == TC_H_INGRESS) {
 sch-flags |= TCQ_F_INGRESS;
 sch-stats_lock = dev-ingress_lock; ...
 
 and remove the initialization in qdisc_graft. That would 
 additionally have the benefit that ingress qdiscs also have 
 it initialized properly.

That's where I'm messing around right now, and I did see that.  Let me
put it in and test.  If all is well, I'll post a patch.

Thanks for the help Patrick.

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-30 Thread Waskiewicz Jr, Peter P
 It would be great if we could finally get a working e1000 
 multiqueue patch so work in this area can actually be tested.

I'm actively working on this right now.  I'm on vacation next week, but
hopefully I can get something working before I leave OLS and post it.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Waskiewicz Jr, Peter P
 Ok everything is checked into net-2.6.23, thanks everyone.

Dave, thank you for your patience and feedback on this whole process.
Patrick and everyone else, thank you for your feedback and assistance.

I am looking at your posed virtualization question, but I need sleep
since I just remembered I'm on east coast time here at OLS, and it's
4:30am.

Thanks again,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P
 PJ Waskiewicz wrote:
 
  +
   static int __init prio_module_init(void)  {
  -   return register_qdisc(prio_qdisc_ops);
  +   int err;
  +   err = register_qdisc(prio_qdisc_ops);
  +   if (!err)
  +   err = register_qdisc(rr_qdisc_ops);
  +   return err;
   }
   
 
 Thats still broken. I'll fix this and some minor cleanness 
 issues myself so you don't have to go through another resend.

Auke and I just looked at register_qdisc() and this code.  Maybe we
haven't had enough coffee yet, but register_qdisc() returns 0 on
success.  So if register_qdisc(prio_qdisc_ops) succeeds, then
rr_qdisc_ops gets registered.  I'm curious what is broken with this.

Thanks Patrick
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] iproute2: sch_rr support in tc

2007-06-28 Thread Waskiewicz Jr, Peter P

 Since rr is not built as a module you could actually put 
 everything in q_prio and share the code. But I don't really care :)

I tried doing this, but I couldn't get things working quite right with
selecting the correct module from sch_prio to sch_rr.  So I just added
the q_rr.c portion of tc, and it fixed the selection issues I was
seeing.  At least for userspace, it may be better to explicitly show the
existance of rr being independent; that's just my personal preference.

Cheers,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P

 Its not error handling. You do:
 
 err = register qdisc 1
 if (err)
   return err;
 err = register qdisc 2
 if (err)
   unregister qdisc 2
 return err
 
 anyways, I already fixed that and cleaned up prio_classify 
 the way I suggested. Will send shortly.

Thanks for fixing; however, the current sch_prio doesn't unregister the
qdisc if register_qdisc() on prio fails, or does that happen implicitly
because the module will probably unload?

Thanks again Patrick,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API

2007-06-28 Thread Waskiewicz Jr, Peter P
 PJ Waskiewicz wrote:
   include/linux/etherdevice.h |3 +-
   include/linux/netdevice.h   |   62 
 ++-
   include/linux/skbuff.h  |4 ++-
   net/core/dev.c  |   27 +--
   net/core/netpoll.c  |8 +++---
   net/core/pktgen.c   |   10 +--
   net/core/skbuff.c   |3 ++
   net/ethernet/eth.c  |9 +++---
   8 files changed, 104 insertions(+), 22 deletions(-)
 
  include/linux/pkt_sched.h |9 +++
  net/sched/Kconfig |   23 +++
  net/sched/sch_prio.c  |  147
 +
   3 files changed, 166 insertions(+), 13 deletions(-)
 
 
 Quick question: where are the sch_generic changes? :)
 
 If you hold for ten minutes I'll post a set of slightly 
 changed patches with the NETDEVICES_MULTIQUEUE option and a 
 fix for this.

Jamal's and KK's qdisc_restart() rewrite took the netif_queue_stopped()
call out of sch_generic.c.  So the underlying qdisc is only responsible
for checking the queue status now before dequeueing.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P
 PJ Waskiewicz wrote:
  +#ifdef CONFIG_NET_SCH_MULTIQUEUE
  +   if (q-mq)
  +   skb-queue_mapping = 
  +   
 q-prio2band[bandTC_PRIO_MAX];
  +   else
  +   skb-queue_mapping = 0;
  +#endif
 
 
 Setting it to zero here is wrong, consider:
 
 root qdisc: prio multiqueue
 child qdisc: prio non-multiqueue
 
 The top-level qdisc will set it, the child qdisc will unset it again.
 When multiqueue is inactive it should not touch it.
 
 I'll fix that as well.

But the child can't assume the device is multiqueue if the child is
non-multiqueue.  This is the same issue with IP forwarding, where if you
forward through a multiqueue device to a non-mq device, you don't know
if the destination device is multiqueue.  So the last qdisc to actually
dequeue into a device should have control what the queue mapping is.  If
a user had a multiqueue qdisc as root, and configured a child qdisc as
non-mq, that is a configuration error if the underlying device is indeed
multiqueue IMO.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API

2007-06-28 Thread Waskiewicz Jr, Peter P
 Waskiewicz Jr, Peter P wrote:
 Quick question: where are the sch_generic changes? :)
 
 If you hold for ten minutes I'll post a set of slightly changed 
 patches with the NETDEVICES_MULTIQUEUE option and a fix for this.
  
  
  Jamal's and KK's qdisc_restart() rewrite took the 
 netif_queue_stopped()
  call out of sch_generic.c.  So the underlying qdisc is only 
 responsible
  for checking the queue status now before dequeueing.
 
 
 Yes, I noticed that now. Doesn't seem right though as long as
 queueing while queue is stopped is treated as a bug by the
 drivers.
 
 But I vaguely recall seeing a discussion about this, I'll check
 the archives.

The basic gist is before the dequeue is done, the qdisc is locked by the
qdisc is running bit, so another CPU cannot get in there.  So if the
queue isn't stopped when a dequeue is done, that same queue should not
be stopped when hard_start_xmit() is called.  The only thing I could
think of that could happen is some out-of-band cleanup routine in the
driver where the tx_ring lock is held, and the skb is bounced back,
where the driver returns NETIF_TX_BUSY, and you requeue.  This is an
extreme corner case, so the check could be removed.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P
 Absolutely not. First of all, its perfectly valid to use 
 non-multiqueue qdiscs on multiqueue devices. Secondly, its 
 only the root qdisc that has to know about multiqueue since 
 that one controls the child qdiscs.
 
 Think about it, it makes absolutely no sense to have the 
 child qdisc even know about multiqueue. Changing my example 
 to have a multiqueue qdisc as child:
 
 root qdisc: 2 band prio multiqueue
 child qdisc of band 0: 2 band prio multiqueue
 
 When the root qdisc decides to dequeue band0, it checks 
 whether subqueue 0 is active and dequeues the child qdisc. If 
 the child qdisc is indeed another multiqueue qdisc as you 
 suggest, if might decide to dequeue its own band 1 and checks 
 that subqueue state. So where should the packet finally end 
 up? And what if one of both subqueues is inactive?
 
 The only reasonable thing it can do is not care about 
 multiqueue and just dequeue as usual. In fact I think it 
 should be an error to configure multiqueue on a non-root qdisc.

Ack.  This is a thought process that trips me up from time to time...I
see child qdisc, and think that's the last qdisc to dequeue and send to
the device, not the first one to dequeue.  So please disregard my
comments before; I totally agree with you.  Great catch here; I really
like the prio_classify() cleanup.

As always, many thanks for your feedback and help Patrick.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P
 Waskiewicz Jr, Peter P wrote:
 [...]
 The only reasonable thing it can do is not care about 
 multiqueue and 
 just dequeue as usual. In fact I think it should be an error to 
 configure multiqueue on a non-root qdisc.
  
  
  Ack.  This is a thought process that trips me up from time 
 to time...I 
  see child qdisc, and think that's the last qdisc to dequeue 
 and send 
  to the device, not the first one to dequeue.  So please 
 disregard my 
  comments before; I totally agree with you.  Great catch 
 here; I really 
  like the prio_classify() cleanup.
 
 
 Thanks. This updated patch makes configuring a non-root qdisc 
 for multiqueue an error.
 

The patch looks fine to me.  Thanks Patrick.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API

2007-06-28 Thread Waskiewicz Jr, Peter P
 Waskiewicz Jr, Peter P wrote:
 
  Looking at Peter's multiqueue patch, which should include all 
  hard_start_xmit users (I'm not seeing sch_teql though,
  Peter?) the only other one is pktgen.
  
 
  Ugh.  That is another netif_queue_stopped() that needs 
  netif_subqueue_stopped().  I can send an updated patch for 
 the core to 
  fix this based from your patches Patrick.

 
 I still have the tree around, here's an updated version.
 
 
  So what do we do about netpoll then wrt netif_(sub)queue_stopped() 
  being removed from qdisc_restart()?  The fallout of having 
 netpoll() 
  cause a queue to stop (queue 0 only) is the skb sent will 
 be requeued, 
  since the driver will return NETIF_TX_BUSY if this actually 
 happens.  
  But this is a corner case, and we won't lose packets; we'll 
 just have 
  increased latency on that queue.  Should I worry about this or just 
  move forward with the sch_teql.c change and repost the core patch?

 
 
 I don't think you need to worry about that, the subqueue 
 patch just follows the existing code.

Thanks Patrick for taking care of this.  I am totally fine with this
patch; if anyone else has feedback, please send it.  If not, I'm excited
to see if these can be considered for 2.6.23 now.  :)  Thanks everyone
for the help.

Cheers,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API

2007-06-25 Thread Waskiewicz Jr, Peter P
  /* ensure 32-byte alignment of both the device and 
 private area */
  -   alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST)  
 ~NETDEV_ALIGN_CONST;
  +   alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
  +(sizeof(struct net_device_subqueue) * 
 (queue_count - 1))) 
 
 
 Why queue_count - 1 ? It should be queue_count I think.

I'm not sure what went through my head, but I'll fix this.

 Otherwise ACK for this patch except that it should also 
 contain the sch_generic changes.

I misread your previous mail; I'll get the sch_generic.c changes into
this patch.

Thanks Patrick,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RTNETLINK]: Add nested compat attribute

2007-06-25 Thread Waskiewicz Jr, Peter P
 This patch adds a new attribute type that can be used to 
 replace non-nested attributes that contain structures by 
 nested ones in a compatible way.
 
 This can be used in cases like Peter's who is trying to 
 extend sch_prio, which currently uses a fixed structure 
 without any holes.
 
 Switching to nested attributes makes sure that the next 
 person won't run into the same problem.

I've been using this patch and the IPROUTE2 patches Patrick has proposed
with no issues.  Can someone else look at these patches when they have
time?  I'd be interested in seeing them make it into 2.6.23.

Thanks!!

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-25 Thread Waskiewicz Jr, Peter P
   enum
   {
  -   TCA_PRIO_UNPSEC,
  -   TCA_PRIO_TEST,
 
 
 You misunderstood me. You can work on top of my compat 
 attribute patches, but the example code should not have to go 
 in to apply your patch.

Ok.  I'll fix my patches.

  diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 
  475df84..7f14fa6 100644
  --- a/net/sched/Kconfig
  +++ b/net/sched/Kconfig
  @@ -102,8 +102,16 @@ config NET_SCH_ATM
To compile this code as a module, choose M here: the
module will be called sch_atm.
   
  +config NET_SCH_BANDS
  +bool Multi Band Queueing (PRIO and RR)
 
 This options seems useless. Its not used *anywhere* except 
 for dependencies.

I was trying to group the multiqueue qdiscs together with this.  But I
can see just having the multiqueue option for scheduling will cover
this.  I'll remove this.

  +config NET_SCH_BANDS_MQ
  +   bool Multiple hardware queue support
  +   depends on NET_SCH_BANDS
 
 
 OK, again:
 
 Introduce NET_SCH_RR. NET_SCH_RR selects NET_SCH_PRIO. 
 Nothing at all changes for NET_SCH_PRIO itself. Additionally 
 introduce a boolean NET_SCH_MULTIQUEUE. No dependencies at 
 all. Use NET_SCH_MULTIQUEUE to guard the multiqueue code in 
 sch_prio.c.
 Your current code doesn't even have any ifdefs anymore 
 though, so this might not be needed at all.
 
 Additionally you could later introduce E1000_MULTIQUEUE and 
 have that select NET_SCH_MULTIQUEUE.

I'll clean this up.  Thanks for the persistance.  :)

  diff --git a/net/sched/sch_generic.c 
 b/net/sched/sch_generic.c index 
  9461e8a..203d5c4 100644
  --- a/net/sched/sch_generic.c
  +++ b/net/sched/sch_generic.c
  @@ -168,7 +168,8 @@ static inline int qdisc_restart(struct 
 net_device *dev)
  spin_unlock(dev-queue_lock);
   
  ret = NETDEV_TX_BUSY;
  -   if (!netif_queue_stopped(dev))
  +   if (!netif_queue_stopped(dev) 
  +   !netif_subqueue_stopped(dev, skb-queue_mapping))
  /* churn baby churn .. */
  ret = dev_hard_start_xmit(skb, dev);
 
 I'll try again - please move this to patch 2/3.

I'm sorry; I misread your original comment about this.  I'll move the
change (although this disappears with Jamal's and KK's qdisc_restart()
cleanup).

  diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index 
  40a13e8..8a716f0 100644
  --- a/net/sched/sch_prio.c
  +++ b/net/sched/sch_prio.c
  @@ -40,9 +40,11 @@
   struct prio_sched_data
   {
  int bands;
  +   int curband; /* for round-robin */
  struct tcf_proto *filter_list;
  u8  prio2band[TC_PRIO_MAX+1];
  struct Qdisc *queues[TCQ_PRIO_BANDS];
  +   unsigned char mq;
   };
   
   
  @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc 
  *sch, int *qerr)  #endif
  if (TC_H_MAJ(band))
  band = 0;
  +   if (q-mq)
  +   skb-queue_mapping = 
  +   
 q-prio2band[bandTC_PRIO_MAX];
  +   else
  +   skb-queue_mapping = 0;
 
 
 Might look cleaner if you have one central point where 
 queue_mapping is set and the band is returned.

I'll see how easy it'll be to condense this; because the queue being
selected in the qdisc can be different based on a few different things,
I'm not sure how easy it'll be to assign this in one spot.  I'll play
around with it and see what I can come up with.

  +   /* If we're multiqueue, make sure the number of incoming bands
  +* matches the number of queues on the device we're 
 associating with.
  +*/
  +   if (tb[TCA_PRIO_MQ - 1])
  +   q-mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]);
 
 
 If you're using it as a flag, please use RTA_GET_FLAG(), 
 otherwise RTA_GET_U8.

Will do.  Thanks.

  +   if (q-mq  (qopt-bands != sch-dev-egress_subqueue_count))
  +   return -EINVAL;
   
  sch_tree_lock(sch);
  q-bands = qopt-bands;
  @@ -280,7 +342,7 @@ static int prio_dump(struct Qdisc *sch, 
 struct sk_buff *skb)
  memcpy(opt.priomap, q-prio2band, TC_PRIO_MAX+1);
   
  nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), opt);
  -   RTA_PUT_U32(skb, TCA_PRIO_TEST, 321);
  +   RTA_PUT_U8(skb, TCA_PRIO_MQ, q-mq);
 
 
 And RTA_PUT_FLAG. Now that I think of it, does it even makes 
 sense to have a prio private flag for this instead of a qdisc 
 global one?

There currently aren't any other qdiscs that are natural fits for
multiqueue that I can see.  I can see the benefit though of having this
as a global flag in the qdisc API; let me check it out, and if it makes
sense, I can move it.

   static int __init prio_module_init(void)  {
  -   return register_qdisc(prio_qdisc_ops);
  +   register_qdisc(prio_qdisc_ops);
  +   register_qdisc(rr_qdisc_ops);
 
 Proper error handling please.

Will do.

Thanks,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

RE: [RTNETLINK]: Add nested compat attribute

2007-06-25 Thread Waskiewicz Jr, Peter P
 From: Waskiewicz Jr, Peter P [EMAIL PROTECTED]
 Date: Mon, 25 Jun 2007 10:14:37 -0700
 
   This patch adds a new attribute type that can be used to replace 
   non-nested attributes that contain structures by nested ones in a 
   compatible way.
   
   This can be used in cases like Peter's who is trying to extend 
   sch_prio, which currently uses a fixed structure without 
 any holes.
   
   Switching to nested attributes makes sure that the next 
 person won't 
   run into the same problem.
  
  I've been using this patch and the IPROUTE2 patches Patrick has 
  proposed with no issues.  Can someone else look at these 
 patches when 
  they have time?  I'd be interested in seeing them make it 
 into 2.6.23.
 
 I've just put Patrick's patch into the net-2.6.23 tree.

Awesome Dave!!  Thank you very much. :)

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RTNETLINK]: Add nested compat attribute

2007-06-25 Thread Waskiewicz Jr, Peter P
 -Original Message-
 From: David Miller [mailto:[EMAIL PROTECTED] 
 Sent: Monday, June 25, 2007 2:30 PM
 To: Waskiewicz Jr, Peter P
 Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; 
 [EMAIL PROTECTED]
 Subject: Re: [RTNETLINK]: Add nested compat attribute
 
 From: Waskiewicz Jr, Peter P [EMAIL PROTECTED]
 Date: Mon, 25 Jun 2007 14:01:31 -0700
 
  Awesome Dave!!  Thank you very much. :)
 
 Please get your next round of patches ready, Patrick and I 
 can review them and barring any serious issues we can finally 
 put this stuff in to net-2.6.23.

I'm putting them into the latest 2.6.23 tree right now - I'll have them
tested and sent upstream later today.

Thanks Dave,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-25 Thread Waskiewicz Jr, Peter P
  @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc 
  *sch, int *qerr)  #endif
  if (TC_H_MAJ(band))
  band = 0;
  +   if (q-mq)
  +   skb-queue_mapping = 
  +   
 q-prio2band[bandTC_PRIO_MAX];
  +   else
  +   skb-queue_mapping = 0;
 
 
 Might look cleaner if you have one central point where 
 queue_mapping is set and the band is returned.

I've taken a stab at this.  I can have one return point, but I'll still
have multiple assignments of skb-queue_mapping due to the different
branches for which queue to select in the qdisc.  I suppose we can do a
rewrite of prio_classify(), but to me that seems beyond the scope of the
multiqueue patches themselves.  What do you think?

Thanks,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-25 Thread Waskiewicz Jr, Peter P
 Thats not necessary. I just though you could add one exit point:
 
 
 ...
 out:
 skb-queue_mapping = q-mq ? band : 0;
 return q-queues[band];
 }
 
 But if that doesn't work don't bother ..

Unfortunately it won't, given how band might be used like this to select
the queue:

return q-queues[q-prio2band[bandTC_PRIO_MAX]];

I'll keep this in mind though, and if it can be done cleanly, I'll
submit a patch.

Thanks Patrick,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RTNETLINK]: Add nested compat attribute

2007-06-25 Thread Waskiewicz Jr, Peter P
 From: Patrick McHardy [EMAIL PROTECTED]
 Date: Mon, 25 Jun 2007 23:08:09 +0200
 
  David Miller wrote:
  
   I've been using this patch and the IPROUTE2 patches Patrick has 
   proposed with no issues.  Can someone else look at these patches 
   when they have time?  I'd be interested in seeing them 
 make it into 2.6.23.
   
  
   I've just put Patrick's patch into the net-2.6.23 tree.
 
  
  You seem to have added the netlink patch for the generic netlink 
  attributes. Peter needs the rtnetlink attribute patch since qdiscs 
  still use the old stuff.
  
  Attached again to this mail.
 
 Thanks for the clarification, I thought they were both the 
 same patch, resent for the sake of tgraf seeing it.
 
 I've applied this one too, thanks Patrick.

It looks like the one Patrick resent was the older version that requires
a typecast.  This is the function prototype currently in the kernel:

+extern int rtattr_parse_nested_compat(struct rtattr *tb[], int maxattr,
+ struct rtattr *rta, void **data,
int len);

This is the newer version:

+extern int __rtattr_parse_nested_compat(struct rtattr *tb[], int
maxattr,
+   struct rtattr *rta, int len);
+#define rtattr_parse_nested_compat(tb, max, rta, data, len) \
+({ data = RTA_PAYLOAD(rta) = len ? RTA_DATA(rta) : NULL; \
+   __rtattr_parse_nested_compat(tb, max, rta, len); })
+

I can send the update to what's in 2.6.23 as the first patch of my
series, or I can write my function calls using the older callback (which
isn't a problem).  Which would you prefer?

Thanks,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-22 Thread Waskiewicz Jr, Peter P
   #include linux/module.h
  @@ -40,9 +42,13 @@
   struct prio_sched_data
   {
  int bands;
  +#ifdef CONFIG_NET_SCH_RR
  +   int curband; /* for round-robin */
  +#endif
  struct tcf_proto *filter_list;
  u8  prio2band[TC_PRIO_MAX+1];
  struct Qdisc *queues[TCQ_PRIO_BANDS];
  +   u16 band2queue[TC_PRIO_MAX + 1];

 
 Why is this still here? Its a 1:1 mapping.

Thought about this more last night and this morning.  As far as I can
tell, I still need this.  If the qdisc gets loaded with multiqueue
turned on, I can just use the value of band to assign
skb-queue_mapping.  But if the qdisc is loaded without multiqueue
support, then I need to assign a value of zero to queue_mapping, or not
assign it at all (it will be zero'd out before the call to -enqueue()
in dev_queue_xmit()).  But I'd rather not have a conditional in the
hotpath checking if the qdisc is multiqueue; I'd rather have the array
to match the bands so I can just do an assignment.

What do you think?

Thanks,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-22 Thread Waskiewicz Jr, Peter P
 Patrick McHardy wrote:
  Waskiewicz Jr, Peter P wrote:
  
 Thought about this more last night and this morning.  As 
 far as I can 
 tell, I still need this.  If the qdisc gets loaded with multiqueue 
 turned on, I can just use the value of band to assign
 skb-queue_mapping.  But if the qdisc is loaded without multiqueue
 support, then I need to assign a value of zero to queue_mapping, or 
 not assign it at all (it will be zero'd out before the call to 
 -enqueue() in dev_queue_xmit()).  But I'd rather not have a 
 conditional in the hotpath checking if the qdisc is multiqueue; I'd 
 rather have the array to match the bands so I can just do 
 an assignment.
 
 What do you think?
  
  
  
  I very much doubt that it has any measurable impact. You 
 can also add 
  a small inline function
  
  void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue)
 
 
 OK I didn't really listen obviously :) A compile time option 
 won't help. Just remove it and assign it conditionally.

Sounds good.  Thanks Patrick.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-21 Thread Waskiewicz Jr, Peter P

 BTw, couldn't you just merge sch_rr with prio? AFAICT you 
 only need a new dequeue function, a new struct Qdisc_ops and 
 a MODULE_ALIAS.

Ok, I have this somewhat working, but need to poll for some help from
the community.  I used MODULE_ALIAS(sch_rr) in sch_prio.c, and
modprobe is happily loading sch_prio.ko when I ask for sch_rr.ko.  It
also recognizes the correct ops struct to associate with the instance of
the module.  However, when I try to load the qdisc via tc (modified
version that knows sch_rr), I'm getting No Such File or Directory from
RTNETLINK.  It's looking for sch_rr.ko, and is bailing.  I've scoured
the code looking for a reason why, and am drawing a blank.  I'll
continue looking, but if this sounds familiar to someone who knows how
to get around this, please reply and let me know.

Thanks,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-21 Thread Waskiewicz Jr, Peter P

 Please post the code.
 

Code is attached.  Please forgive the attachment and any whitespace
damage...currently using Doubtlook to send this (cringe).

Thanks,
-PJ Waskiewicz


sch_prio.c
Description: sch_prio.c


RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-21 Thread Waskiewicz Jr, Peter P
 Waskiewicz Jr, Peter P wrote:
  Please post the code.
 
  
 
  Code is attached.  Please forgive the attachment and any whitespace 
  damage...currently using Doubtlook to send this (cringe).

 
 The code looks correct. Are you sure you had the config 
 option enabled during your test?


Yes.  This is the tc command I used to configure the qdisc (with q_rr.c
attached from my patches iproute2 package):

# tc qdisc add dev eth2 root handle 1: rr bands 8
RTNETLINK answers: No such file or directory

At this point, sch_prio gets loaded correctly, but it obviously fails to
finish loading the qdisc.  Using prio works though:

# tc qdisc add dev eth2 root handle 1: prio bands 8

And yes, the NIC I'm working with has 8 queues, just to be clear.  Any
help is definitely appreciated; I'm going to keep this copy of the code
for now, but am going to get the separate module written back up just in
case this can't be solved in the short-term.  This is the only piece
keeping me from sending these patches back for consideration, so I'll
keep the parallel effort going.

Thanks Patrick,

-PJ Waskiewicz


q_rr.c
Description: q_rr.c


RE: [PATCH] NET: Multiple queue hardware support

2007-06-21 Thread Waskiewicz Jr, Peter P
 PJ Waskiewicz wrote:
  I did not modify other users of netif_queue_stopped() in 
  net/core/netpoll.c, net/core/dev.c, or net/core/pktgen.c, since no 
  classification occurs for the skb being sent to the device.  
  Therefore, packets should always be ending up in queue 0, 
 so there's no need to check the subqueue status either.

 
 Thats not correct. Subqueue 0 may be full and the queue still running.
 
 I'll look over the patches later.

I'm working something up to address this.  The last time I thought about
this, I had issues with software devices, such as loopback.  They
weren't allocating any subqueues at all, so they would call
netif_subqueue_stopped() and panic the kernel.  However, now with Dave's
request to index egress_subqueue, the first queue is allocated for
everyone, so loopback and other software devices should be happy.  Let
me put these checks back in, test it out, and resend if I don't see any
issues.

Sorry for the thrash,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-21 Thread Waskiewicz Jr, Peter P
 The dependencies seem to be very confused. SCHED_PRIO does 
 not depend on anything new, SCH_RR also doesn't depend on 
 anything. SCH_PRIO_MQ and SCH_RR_MQ (which is missing) depend 
 on SCH_PRIO/SCH_RR. A single NET_SCH_MULTIQUEUE option seems 
 better than adding one per scheduler though.

I agree with a NET_SCH_MULTIQUEUE option.  However, SCH_RR does depend
on SCH_PRIO being built since it's the same code, doesn't it?  Maybe I'm
not understanding something about the build process.  I'll clean this
up.

 
  --- a/net/sched/sch_prio.c
  +++ b/net/sched/sch_prio.c
  @@ -9,6 +9,8 @@
* Authors:Alexey Kuznetsov, [EMAIL PROTECTED]
* Fixes:   19990609: J Hadi Salim [EMAIL PROTECTED]:
*  Init --  EINVAL when opt undefined
  + * Additions:  Peter P. Waskiewicz Jr. 
 [EMAIL PROTECTED]
  + * Added round-robin scheduling for selection at load-time

 
 git keeps changelogs, please don't add it here.

Roger.

  struct tcf_proto *filter_list;
  u8  prio2band[TC_PRIO_MAX+1];
  struct Qdisc *queues[TCQ_PRIO_BANDS];
  +   u16 band2queue[TC_PRIO_MAX + 1];

 
 Why is this still here? Its a 1:1 mapping.

I'll fix this.

  @@ -211,6 +265,22 @@ static int prio_tune(struct Qdisc 
 *sch, struct rtattr *opt)
  return -EINVAL;
  }
   
  +   /* If we're prio multiqueue or are using round-robin, make
  +* sure the number of incoming bands matches the number of
  +* queues on the device we're associating with.
  +*/
  +#ifdef CONFIG_NET_SCH_RR
  +   if (strcmp(rr, sch-ops-id) == 0)
  +   if (qopt-bands != sch-dev-egress_subqueue_count)
  +   return -EINVAL;
  +#endif
  +
  +#ifdef CONFIG_NET_SCH_PRIO_MQ
  +   if (strcmp(prio, sch-ops-id) == 0)
  +   if (qopt-bands != sch-dev-egress_subqueue_count)
  +   return -EINVAL;
  +#endif

 
 For the tenth time now, the user should enable this at 
 runtime. You can't just break things dependant on config options.

I had this in sch_prio and tc before, and was told to remove it because
of ABI issues.  I can put it back in, but I'm not sure what those
previous ABI issues were.  Was it backwards compatibility that you
referred to before that was broken?

As always, the feedback is very much appreciated.  I'll get these fixes
in as soon as possible.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API

2007-06-19 Thread Waskiewicz Jr, Peter P
 From: PJ Waskiewicz [EMAIL PROTECTED]
 Date: Mon, 18 Jun 2007 11:42:29 -0700
 
  +
  +   /* The TX queue control structures */
  +   struct net_device_subqueue  *egress_subqueue;
  +   int egress_subqueue_count;
 
 Since every net device will have at least one subqueue, I 
 would suggest that you do this as follows:
 
 1) In net_device change the quoted part of the patch above to:
 
   int egress_subqueue_count;
   struct net_device_subqueue  egress_subqueue[0];
 
 2) In alloc_netdev():
 
Factor (sizeof(struct egress_subqueue) * num_subqueues) into
the net_device allocation size, place the priv area after
the subqueues.
 
 This will save us pointer dereferences on all of these quite 
 common accesses.

Thanks Dave!  I'll be putting this in today and run a test pass on it.
Thanks for the feedback.

-PJ Waskiewicz 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API

2007-06-19 Thread Waskiewicz Jr, Peter P
 From: PJ Waskiewicz [EMAIL PROTECTED]
 Date: Mon, 18 Jun 2007 11:42:29 -0700
 
  +
  +   /* The TX queue control structures */
  +   struct net_device_subqueue  *egress_subqueue;
  +   int egress_subqueue_count;
 
 Since every net device will have at least one subqueue, I 
 would suggest that you do this as follows:
 
 1) In net_device change the quoted part of the patch above to:
 
   int egress_subqueue_count;
   struct net_device_subqueue  egress_subqueue[0];
 
 2) In alloc_netdev():
 
Factor (sizeof(struct egress_subqueue) * num_subqueues) into
the net_device allocation size, place the priv area after
the subqueues.
 
 This will save us pointer dereferences on all of these quite 
 common accesses.

I've been thinking about this more today, so please bear with me if I'm
missing something.  Right now, with how qdisc_restart() is running, we'd
definitely call netif_subqueue_stopped(dev, skb-queue_mapping) for all
multi-ring and single-ring devices.  However, with Jamal's and Krishna's
qdisc_restart() rewrite patch, the checks for netif_queue_stopped() and
netif_subqueue_stopped() would be pushed into the qdisc's -dequeue()
functions.  If that's the case, then the only checks on
egress_subqueue[x] would be for multi-ring adapters, or if someone was
silly enough to load sch_{rr|prio} onto a single-ring device with
multiqueue hardware support compiled in.  Given all of that, I'm not
sure allocating egress_subqueue[0] at compile time or runtime would make
any difference either way.  If I'm missing something, please let me know
- I'd like to reduce any unnecessary pointer dereferences if possible,
but given the proposed qdisc_restart(), I think the code as-is would be
ok.

Thanks,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API

2007-06-19 Thread Waskiewicz Jr, Peter P
 It's not being allocated at compile time, it's being 
 allocated linearly into one block of ram in order to avoid 
 pointer derefs but it's still dynamic in that the size 
 isn't known until the alloc_netdev() call.
 
 We do this trick all over the networking, TCP sockets are 3 
 or 4 different structures, all allocated into a linear block 
 of memory so that:
 
 1) only one memory allocation needs to be done for each object
create, this is not relevant for the net_device case
except in extreme examples bringing thousands of devices
up and down which I suppose someone can give a realistic
example of :-)
 
 2) each part can be accessed as an offset from the other instead
of a pointer deref which costs a cpu memory read
 
 Please change the allocation strategy as I recommended, thanks.

Later this afternoon someone in my group came over and bonked me in the
head, and made this obvious to me.  Thanks for the follow-up; great
catch on the efficiency.  It's in my code now and is currently running
some test passes.

Thanks Dave,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/2 - rev2] qdisc_restart - readability changes plus onebug fix.

2007-06-18 Thread Waskiewicz Jr, Peter P
 -Original Message-
 From: Krishna Kumar [mailto:[EMAIL PROTECTED] 
 Sent: Sunday, June 17, 2007 9:51 PM
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; Waskiewicz 
 Jr, Peter P; [EMAIL PROTECTED]; [EMAIL PROTECTED]; netdev@vger.kernel.org
 Subject: [PATCH 1/2 - rev2] qdisc_restart - readability 
 changes plus onebug fix.
 
 New changes :
 
 - Incorporated Peter Waskiewicz's comments.
 - Re-added back one warning message (on driver returning wrong value).

I've run these patches last night and this morning without issue.  I see
no issues with these updated patches from my testing.

Thanks,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API

2007-06-18 Thread Waskiewicz Jr, Peter P
 PJ Waskiewicz wrote:
  Add the multiqueue hardware device support API to the core network 
  stack.  Allow drivers to allocate multiple queues and 
 manage them at 
  the netdev level if they choose to do so.

 
 Should be 2/3 and qdisc changes should be 3/3. Well actually 
 the qdisc sch_generic changes belong in this patch as well 
 and the qdisc changes should be split in one change per qdisc.

I'll re-arrange the patches.

   /* Functions used for multicast support */ diff --git 
  a/include/linux/skbuff.h b/include/linux/skbuff.h index 
  e7367c7..8bcd870 100644
  --- a/include/linux/skbuff.h
  +++ b/include/linux/skbuff.h
  @@ -215,6 +215,7 @@ typedef unsigned char *sk_buff_data_t;
* @pkt_type: Packet class
* @fclone: skbuff clone status
* @ip_summed: Driver fed us an IP checksum
  + * @queue_mapping: Queue mapping for multiqueue devices
* @priority: Packet queueing priority
* @users: User count - see {datagram,tcp}.c
* @protocol: Packet protocol from driver
  @@ -269,6 +270,7 @@ struct sk_buff {
  __u16   csum_offset;
  };
  };
  +   __u16   queue_mapping;

 
 We have a 4 byte hole on 64 bit after iif where this would fit in.

I'll move the variable.  Thanks for this!

  @@ -3377,12 +3381,23 @@ struct net_device *alloc_netdev(int 
 sizeof_priv, const char *name,
  if (sizeof_priv)
  dev-priv = netdev_priv(dev);
   
  +   alloc_size = (sizeof(struct net_device_subqueue) * queue_count);
  +  
  +   p = kzalloc(alloc_size, GFP_KERNEL);
  +   if (!p) {
  +   printk(KERN_ERR alloc_netdev: Unable to 
 allocate queues.\n);
  +   return NULL;

 
 Same leak here that you already fixed a couple of posts ago.

Heavy sigh...I synced off an older personal git repository.  I'll clean
this and the casts up and repost with the other feedback.

Thanks Patrick,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-18 Thread Waskiewicz Jr, Peter P
 PJ Waskiewicz wrote:
 
  diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index 
  6d7542c..44ecdc6 100644
  --- a/net/sched/sch_prio.c
  +++ b/net/sched/sch_prio.c
  }
  +#ifdef CONFIG_NET_SCH_PRIO_MQ
  +   /* setup queue to band mapping */
  +   if (q-bands  sch-dev-egress_subqueue_count) {
  +   qmapoffset = 1;
  +   mod = sch-dev-egress_subqueue_count;
  +   } else {
  +   mod = q-bands % sch-dev-egress_subqueue_count;
  +   qmapoffset = q-bands / sch-dev-egress_subqueue_count
  +   + ((mod) ? 1 : 0);
  +   }
  +
  +   queue = 0;
  +   offset = 0;
  +   for (i = 0; i  q-bands; i++) {
  +   q-band2queue[i] = queue;
  +   if ( ((i + 1) - offset) == qmapoffset) {
  +   queue++;
  +   offset += qmapoffset;
  +   if (mod)
  +   mod--;
  +   qmapoffset = q-bands /
  +   sch-dev-egress_subqueue_count +
  +   ((mod) ? 1 : 0);
  +   }
  +   }
  +#endif
 
 This should really go, its not only ugly, it also makes no 
 sense to use more bands than queues since that means multiple 
 bands of different priorities are controlled through a single 
 queue state, so lower priority bands can stop the queue for 
 higher priority ones.
 
 The user should enable multiqueue behaviour and using it with 
 a non-matching parameters should simply return an error.

That sounds fine to me.  I'll clean this and sch_prio up.

  diff --git a/net/sched/sch_rr.c b/net/sched/sch_rr.c new file mode 
  100644 index 000..ce9f237
  --- /dev/null
  +++ b/net/sched/sch_rr.c
 
 For which multiqueue capable device is this? Jamal mentioned 
 that e1000 
 uses drr.

E1000 is capable of doing DRR and WRR, however the way our drivers are
written, they're straight round-robin.  But this qdisc would be useful
for devices such as wireless where they have their own scheduler in the
MAC, and do not want the stack to prioritize traffic beyond that.  This
way a user can classify multiple flows into multiple bands, but the
driver will control the prioritization of the traffic beyond that.  For
MAC's that have no strict scheduler (such as e1000 as it is today), they
can use sch_prio to achieve multiple queue support, but have scheduling
priority from the stack.

This qdisc can also be useful at the physical netdev layer for
virtualization I would think, but perhaps I haven't thought that hard
about that yet.

 Lots os unnecessary includes. I have a patch that cleans this up for 
 net/sched,
 this is the relevant sch_prio part where you copied this from:

I'll clean the includes up.  Thanks!

  +   band = TC_H_MIN(band) - 1;
  +   if (band  q-bands) {
 
 You copied an off-by-one from an old sch_prio version here.

Hmm.  This is the sch_prio from the first 2.6.23-dev tree.  I'll resync
and make sure it's the correct one.

  +static int rr_tune(struct Qdisc *sch, struct rtattr *opt)
  +{
  +   struct rr_sched_data *q = qdisc_priv(sch);
  +   struct tc_rr_qopt *qopt = RTA_DATA(opt);
 
 
 Nested attributes please, don't repeat sch_prio's mistake.

I'm not sure I understand what you mean here about nested attributes.

 
  ...
  +   /* setup queue to band mapping - best effort to map 
 into available
  +* hardware queues
  +*/
  +   if (q-bands  sch-dev-egress_subqueue_count) {
  +   qmapoffset = 1;
  +   mod = sch-dev-egress_subqueue_count;
  +   } else {
  +   mod = q-bands % sch-dev-egress_subqueue_count;
  +   qmapoffset = q-bands / sch-dev-egress_subqueue_count
  +   + ((mod) ? 1 : 0);
  +   }
  +
  +   queue = 0;
  +   offset = 0;
  +   for (i = 0; i  q-bands; i++) {
  +   q-band2queue[i] = queue;
  +   if ( ((i + 1) - offset) == qmapoffset) {
  +   queue++;
  +   offset += qmapoffset;
  +   if (mod)
  +   mod--;
  +   qmapoffset = q-bands /
  +   sch-dev-egress_subqueue_count +
  +   ((mod) ? 1 : 0);
  +   }
  +   }
 
 Should go as well.
 

Works for me.  I'll clean this up as well.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-18 Thread Waskiewicz Jr, Peter P
  Hmm.  This is the sch_prio from the first 2.6.23-dev tree.  I'll 
  resync and make sure it's the correct one.
 
 Current 2.6.22-rc and net-2.6.23 have
 
 if (band = q-bands)

I just pulled 2.6.23 down, and see that is true.  I must have had that
left over.  I'll fix that.

  I'm not sure I understand what you mean here about nested 
 attributes.
 
 
 Nested netlink attributes, like most qdisc use, instead of 
 struct tc_rr_qopt (or additionally). The way you've done it 
 makes it hard to add further attributes later.

I'm going to need to think about this more, since I'm not immediately
getting what you're referring to.  I see the qdisc using tc_prio_qopt as
a single member; do you have an example outside of the qdiscs I can look
at and see what you're referring to?  Please bear with me: my netlink
skills are still very green.

 
 BTw, couldn't you just merge sch_rr with prio? AFAICT you 
 only need a new dequeue function, a new struct Qdisc_ops and 
 a MODULE_ALIAS.

Are you suggesting a module that can determine RR or PRIO at runtime?
Because the two are so similar, I definitely thought about combining
them, but because of the dequeue difference, you'd need a load-time
switch to determine which mode to run the module in.  That would break
ABI for sch_prio, which I was trying to avoid.

Thanks Patrick,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] qdisc_restart - couple of optimizations.

2007-06-14 Thread Waskiewicz Jr, Peter P
 IMHO this scenario occurs so infrequently that the check 
 isn't worth it especially since the driver has to be able to 
 deal with us calling it after netif_stop_queue() anyway.

That sounds just fine to me.  Thanks Krishna and Herbert for weighing in
on this.

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] qdisc_restart - couple of optimizations.

2007-06-13 Thread Waskiewicz Jr, Peter P
 - netif_queue_stopped need not be called inside qdisc_restart as
   it has been called already in qdisc_run() before the first skb
   is sent, and in __qdisc_run() after each intermediate skb is
   sent (note : we are the only sender, so the queue cannot get
   stopped while the tx lock was got in the ~LLTX case).

I somewhat disagree here.  The underlying driver can conceivably stop
the device queue even if the stack holds the queue lock during an
interrupt to clean Tx descriptors, and it finds it's out of them or
needs to grab the device for whatever reason.  Granted this is a corner
case, and the net effect would be a simple requeue of the skb, but
checking the status of the queue at the last possible moment before
entering the driver could alleviate the requeue in the time between
-dequeue() from the qdisc, and hard_start_xmit() if an event like I
mentioned happened.

I'm ok with it either way, especially since this is a corner case.  But
it does need to be considered that it can happen.

Cheers,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >