Clarification required about select vs wake_up race condition
Hi, I am facing following problem and was wondering if somebody could help me out. I reference our char driver below but the question I really have is about sleep/wake_up mechanism. So, I thought somebody who is aware of this can help me. BTW, this is 2.6.10. Our char driver(pretty much like all other char drivers) does a poll_wait() and returns status depending on whether data is available to be read. Even though some data is available to be read(verified using one of our internal commands), the select() never wakes up, inspite of any no. of messages sent. To understand this, I was looking at the code of select vs wake_up_interruptible(). I feel I am misunderstanding some part of the kernel code but will be glad if somebody can point it out. My understanding: The do_select() sets the state of task to TASK_INTERRUPTIBLE and calls the driver's poll entry point. In our poll(), let's say immediately after we determine that there's nothing to be read, some data arrives causing a wake_up_interruptible() on another CPU. The wake up happens in the context of process sending the data. Since the receiving process was already added to the list of listeners, from looking at the code of try_to_wake_up(), it looks like it can set the state of the receiving process to TASK_RUNNING(I don't see any lock preventing this). After this happens, the receiving process goes to sleep (because of schedule_timeout called by do_select) but state is still set to TASK_RUNNING. In this state, when another message arrives, the wake_up_interruptible will not wake the process because of following code in try_to_wake_up() ? old_state = p-state; if (!(old_state state)) goto out; The above situation seems simplistic so I'm wondering what I am missing here ? Thanks, Ravi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
I/OAT configuration ?
Hi, I am trying to use I/OAT on one of the newer woodcrest boxes. But not sure if things are configured properly since there seems to be no change in performance with I/OAT enabled or disabled. Following are the steps followed. 1. MSI (CONFIG_PCI_MSI) is enabled in kernel(2.6.16.21). 2. In kernel DMA configuration, following are enabled. Support for DMA Engines Network: TCP receive copy offload Test DMA Client Intel I/OAT DMA support 3. I manually load the ioatdma driver (modprobe ioatdma) As per some documentation I read, when step #3 is performed successfully, directories dma0chanX is supposed to be created under /sys/class/dma but in my case, this directory stays empty. I don't see any messages in /var/log/messages. Any idea what is missing ? Thanks, Ravi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
H/W requirements for NETIF_F_HW_CSUM
Hello, Our current NIC does not provide the actual checksum value on receive path. Hence we only claim NETIF_F_IP_CSUM instead of the more general NETIF_F_HW_CSUM. To support this in a future adapter, we would like to know what exactly are the requirements (on both Rx and Tx )to claim NETIF_F_HW_CSUM ? Following are some specific questions: 1. On Tx, our adapter supports checksumming of TCP/UDP over IPv4 and IPv6. This computation is TCP/UDP specific. Does the checksum calculation need to be more generic ? Also, skbuff.h says that the checksum needs to be placed at a specific location(skb-h.raw+skb-csum). I guess this means the adapter needs to pass back the checksum to host driver after transmission. What happens in case of TSO ? 2. On Rx, is it suffficient if we place the L4 checksum in skb-csum ? What about L3 checksum ? Thanks, Ravi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: H/W requirements for NETIF_F_HW_CSUM
Steve, Thanks for the response. Pls see my comments below. -Original Message- From: Stephen Hemminger [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 26, 2006 12:16 PM To: [EMAIL PROTECTED] Cc: netdev@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; Leonid. Grossman (E-mail) Subject: Re: H/W requirements for NETIF_F_HW_CSUM On Wed, 26 Jul 2006 10:28:00 -0700 Ravinandan Arakali [EMAIL PROTECTED] wrote: Hello, Our current NIC does not provide the actual checksum value on receive path. Hence we only claim NETIF_F_IP_CSUM instead of the more general NETIF_F_HW_CSUM. To support this in a future adapter, we would like to know what exactly are the requirements (on both Rx and Tx )to claim NETIF_F_HW_CSUM ? If you set NETIF_F_HW_CSUM, on transmit the adapter if ip_summed is set will be handed an unchecksummed frame with the offset to stuff the checksum at. Only difference between NETIF_F_HW_CSUM and NETIF_F_IP_CSUM is that IP_CSUM means the device can handle IPV4 only. Since our adapter does IPv4 and IPv6 checksum, do we then satisfy the requirements to claim NETIF_F_HW_CSUM on Tx side ? Also, for non-TSO, we can stuff the checksum at specified offset in skb. What about TSO frames ? NETIF_F_HW_CSUM has no impact on receive. The form of receive checksumming format is up to the device. It can either put one's complement in skb-csum and set ip_summed to CHECKSUM_HW or if device only reports checksum good then set CHECKSUM_UNNECESSARY. The reason for thinking that NETIF_F_HW_CSUM and CHECKSUM_UNNECESSARY don't go together was a comment from Jeff way back in '04 when our driver was initially submitted. Quoting from that mail: You CANNOT use NETIF_F_HW_CSUM, when your hardware does not provide the checksum value. You must use NETIF_F_IP_CSUM. Your use of NETIF_F_HW_CSUM + CHECKSUM_UNNECESSARY is flat out incorrect. Several are a couple of subtleties to the receive processing: * Meaning of ip_summed changes from tx to rx path and that has to be handled in code that does forwarding like bridges. * If device only reports checksum okay vs bad. The packets marked bad might be another protocol, so should be passed up with CHECKSUM_NONE and let any checksum errors get detected in software. * Checksum HW on receive should work better since then IPV6 and nested protocols like VLAN's can be handled. Following are some specific questions: 1. On Tx, our adapter supports checksumming of TCP/UDP over IPv4 and IPv6. This computation is TCP/UDP specific. Does the checksum calculation need to be more generic ? Also, skbuff.h says that the checksum needs to be placed at a specific location(skb-h.raw+skb-csum). I guess this means the adapter needs to pass back the checksum to host driver after transmission. What happens in case of TSO ? 2. On Rx, is it suffficient if we place the L4 checksum in skb-csum ? What about L3 checksum ? The L3 checksum (IP) is always computed. Since the header is in CPU cache anyway it is faster that way. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH]NET: Add ECN support for TSO
Dave/Michael, Replicating NS bit(from super segment) across all segments looks fine. But one of the issues is the random/pseudo-random generation of ECT code points on each of these segments. The hardware will need to be capable of generating this, and I guess should be able to verify this against the NS bit received as part of ACK for that packet. Following are couple of schemes proposed by our team. Please comment. Option A) If we were to permit ourselves to somewhat break the spirit of RFC3540 without breaking the letter, we could come up a fairly easy enhancement to TSO... I think it would be acceptable to set ECT(0) on all packets except one (I would suggest the last, but an argument could be made for the first). That one would have either ECT(0) or ECT(1) set as per a field in the TxD (for example). That would give us a method that works with ECN nonces (ECT(0) doesn't increment the sum). Unfortunately, it would give us a relative increase in the number of packets being sent with ECT(0) (the random generation should see a 50-50 distribution between ECT (0) and (1); we would be skewing it toward (0) by whatever the proportion of packets to TSO operations is). So, a connection using ECN nonces and TSO would be less robust than one not using TSO. But it wouldn't be broken... Option B) The hardware could randomly generate either ECN codepoint on all packets of a TSO operation except the last. It would keep a local NS value for the operation and, in the last packet, set either ECT(0) or ECT(1) as necessary to generate a NS value equal to that specified in the descriptor. That way we would keep a much more equal distribution. It comes at the cost of a random value generator in the hardware but we could get by with something extremely basic (e.g. lsb of the internal clock at the point the packet is generated) if perfect randomness is not required. The limitation with this scheme is that the sender can't verify the NS from any returned ACK that falls inside a TSO operation (it can only be checked at TSO endpoints and on non-TSO transmissions). Ravi -Original Message- From: David Miller [mailto:[EMAIL PROTECTED] Sent: Saturday, July 08, 2006 1:32 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; netdev@vger.kernel.org Subject: Re: [PATCH]NET: Add ECN support for TSO From: Michael Chan [EMAIL PROTECTED] Date: Fri, 7 Jul 2006 18:01:34 -0700 However, Large Receive Offload will be a different story. If packets are accumulated in the hardware and presented to the stack as one large packet, the stack will not be able to calculate the cumulative NS correctly. Unless the hardware calculates the partial NS over the LRO packet and puts it in the SKB when handing over the packet. This is correct, LRO hardware would need to do something to make sure the nonce parity works out. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH]NET: Add ECN support for TSO
Thanks.. I will get rid of the per-session check for ECN. Ravi -Original Message- From: David Miller [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 11, 2006 11:12 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; netdev@vger.kernel.org Subject: Re: [PATCH]NET: Add ECN support for TSO From: Michael Chan [EMAIL PROTECTED] Date: Tue, 11 Jul 2006 21:53:42 -0700 There is no reason to find out if ECN is enabled or not for any TCP connections. Hw just needs to watch the above bits in the incoming packets. Correct, it can be done in a completely stateless manner. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH]NET: Add ECN support for TSO
Michael/David, Thanks for the comments on LRO. The current LRO code in S2io driver is not aware of ECN. While I was trying to fix this, the first thing I encountered was to check, in the driver, if ECN is enabled for current session. To do this, I try to get hold of the socket by doing something like: tk = tcp_sk(skb-sk); if (tk-ecn_flags TCP_ECN_OK) /* Check CE, ECE, CWR etc */ I find that skb-sk is NULL. Is this the correct way to check the per-session ECN capability ? Why is skb-sk NULL ? Thanks, Ravi -Original Message- From: David Miller [mailto:[EMAIL PROTECTED] Sent: Saturday, July 08, 2006 1:32 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; netdev@vger.kernel.org Subject: Re: [PATCH]NET: Add ECN support for TSO From: Michael Chan [EMAIL PROTECTED] Date: Fri, 7 Jul 2006 18:01:34 -0700 However, Large Receive Offload will be a different story. If packets are accumulated in the hardware and presented to the stack as one large packet, the stack will not be able to calculate the cumulative NS correctly. Unless the hardware calculates the partial NS over the LRO packet and puts it in the SKB when handing over the packet. This is correct, LRO hardware would need to do something to make sure the nonce parity works out. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH]NET: Add ECN support for TSO
Michael, Are network cards expected to be aware-of and implement RFC3540(ECN with nonces) ? Thanks, Ravi -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Michael Chan Sent: Tuesday, June 27, 2006 8:07 PM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Cc: netdev@vger.kernel.org Subject: [PATCH]NET: Add ECN support for TSO In the current TSO implementation, NETIF_F_TSO and ECN cannot be turned on together in a TCP connection. The problem is that most hardware that supports TSO does not handle CWR correctly if it is set in the TSO packet. Correct handling requires CWR to be set in the first packet only if it is set in the TSO header. This patch adds the ability to turn on NETIF_F_TSO and ECN using GSO if necessary to handle TSO packets with CWR set. Hardware that handles CWR correctly can turn on NETIF_F_TSO_ECN in the dev- features flag. All TSO packets with CWR set will have the SKB_GSO_TCPV4_ECN set. If the output device does not have the NETIF_F_TSO_ECN feature set, GSO will split the packet up correctly with CWR only set in the first segment. It is further assumed that all hardware will handle ECE properly by replicating the ECE flag in all segments. If that is not the case, a simple extension of the logic will be required. Signed-off-by: Michael Chan [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [3/5] [NET]: Add software TSOv4
We are working on it. Ravi -Original Message- From: YOSHIFUJI Hideaki / g!?p-? [mailto:[EMAIL PROTECTED] Sent: Friday, June 23, 2006 6:33 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED] Subject: Re: [3/5] [NET]: Add software TSOv4 In article [EMAIL PROTECTED] (at Fri, 23 Jun 2006 17:28:12 -0700), Ravinandan Arakali [EMAIL PROTECTED] says: Neterion's Xframe adapter supports TSO over IPv6. I remember you posted some patches. Would you post revised version reflecting Stephen's comment, please? --yoshfuji - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 2.6.17] s2io driver irq fix
Andrew, My understanding is that MSI-X vectors are not usually shared. We don't want to spend cycles checking if the interrupt was indeed from our card or another device on same IRQ. In fact, current driver shares IRQ for the MSI case which I think is a bug. That should also be non-shared. Our MSI handler just runs thru' the Tx/Rx completions and returns IRQ_HANDLED. In case of IRQ sharing, we could be falsely claiming the interrupt as our own. Ravi -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Andrew Morton Sent: Wednesday, June 21, 2006 9:16 PM To: Ananda Raju Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [patch 2.6.17] s2io driver irq fix On Wed, 21 Jun 2006 15:50:49 -0400 (EDT) Ananda Raju [EMAIL PROTECTED] wrote: + if (sp-intr_type == MSI_X) { + int i; - free_irq(vector, arg); + for (i=1; (sp-s2io_entries[i].in_use == MSIX_FLG); i++) { + if (sp-s2io_entries[i].type == MSIX_FIFO_TYPE) { + sprintf(sp-desc[i], %s:MSI-X-%d-TX, + dev-name, i); + err = request_irq(sp-entries[i].vector, + s2io_msix_fifo_handle, 0, sp-desc[i], + sp-s2io_entries[i].arg); Is it usual to prohibit IRQ sharing with msix? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] s2io: netpoll support
I don't think we should disable and enable all interrupts in the poll_controller entry point. With the current patch, at the end of the routine _all_ interrupts get enabled which is not desirable. Maybe you should just do disable_irq() at start of function and enable_irq() before exiting, the way some of the other drivers do. Ravi -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Brian Haley Sent: Thursday, June 08, 2006 9:02 AM To: netdev@vger.kernel.org Cc: [EMAIL PROTECTED] Subject: [PATCH] s2io: netpoll support This adds netpoll support for things like netconsole/kgdboe to the s2io 10GbE driver. This duplicates some code from s2io_poll() as I wanted to be least-invasive, someone from Neterion might have other thoughts? Signed-off-by: Brian Haley [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2.6.16.18] MSI: Proposed fix for MSI/MSI-X load failure
Hi, This patch suggests a fix for the MSI/MSI-X load failure. Please review the patch. Symptoms: When a driver is loaded with MSI followed by MSI-X, the load fails indicating that the MSI vector is still active. And vice versa. Suspected rootcause: This happens inspite of driver calling free_irq() followed by pci_disable_msi/pci_disable_msix. This appears to be a kernel bug wherein the pci_disable_msi and pci_disable_msix calls do not clear/unpopulate the msi_desc data structure that was populated by pci_enable_msi/pci_enable_msix. Proposed fix: Free the MSI vector in pci_disable_msi and all allocated MSI-X vectors in pci_disable_msix. Testing: The fix has been tested on IA64 platforms with Neterion's Xframe driver. Signed-off-by: Ravinandan Arakali [EMAIL PROTECTED] --- diff -urpN old/drivers/pci/msi.c new/drivers/pci/msi.c --- old/drivers/pci/msi.c 2006-05-31 19:02:19.0 -0700 +++ new/drivers/pci/msi.c 2006-05-31 19:02:39.0 -0700 @@ -779,6 +779,7 @@ void pci_disable_msi(struct pci_dev* dev nr_released_vectors++; default_vector = entry-msi_attrib.default_vector; spin_unlock_irqrestore(msi_lock, flags); + msi_free_vector(dev, dev-irq, 1); /* Restore dev-irq to its default pin-assertion vector */ dev-irq = default_vector; disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI), @@ -1046,6 +1047,7 @@ void pci_disable_msix(struct pci_dev* de } } + msi_remove_pci_irq_vectors(dev); } /** - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2.6.16.18] MSI: Proposed fix for MSI/MSI-X load failure
Rajesh, It's possible that the current behavior is by design but once the driver is loaded with MSI, you need a reboot to be able to load MSI-X. And vice versa. I found this rather restrictive. I did test the fix multiple times. For eg. multiple load/unload iterations of MSI followed by multiple load/unload of MSI-X followed by load/unload MSI. That way both transitions(MSI-to-MSI-X and vice versa) are tested. Thanks, Ravi -Original Message- From: Rajesh Shah [mailto:[EMAIL PROTECTED] Sent: Friday, June 02, 2006 2:55 PM To: Ravinandan Arakali Cc: linux-kernel@vger.kernel.org; netdev@vger.kernel.org; Leonid Grossman; Ananda Raju; Sriram Rapuru Subject: Re: [PATCH 2.6.16.18] MSI: Proposed fix for MSI/MSI-X load failure On Fri, Jun 02, 2006 at 03:21:37PM -0400, Ravinandan Arakali wrote: Symptoms: When a driver is loaded with MSI followed by MSI-X, the load fails indicating that the MSI vector is still active. And vice versa. Suspected rootcause: This happens inspite of driver calling free_irq() followed by pci_disable_msi/pci_disable_msix. This appears to be a kernel bug wherein the pci_disable_msi and pci_disable_msix calls do not clear/unpopulate the msi_desc data structure that was populated by pci_enable_msi/pci_enable_msix. The current MSI code actually does this deliberately, not by accident. It's got a lot of complex code to track devices and vectors and make sure an enable_msi - disable - enable sequence gives a driver the same vector. It also has policies about reserving vectors based on potential hotplug activity etc. Frankly, I've never understood the need for such policies, and am in the process of removing all of them. Proposed fix: Free the MSI vector in pci_disable_msi and all allocated MSI-X vectors in pci_disable_msix. This will break the existing MSI policies. Once you take that away, a whole lot of additional code and complexity can be removed too. That's what I'm working on right now, but such a change is likely too big for -stable. So, I'm ok with this patch if it actually doesn't break MSI/MSI-X. Did you try to repeatedly load/unload an MSI capable driver with this patch? Did you repeatedly try to ifdown/ifup an Ethernet driver that uses MSI? I'm not in a position to test this today, but will try it out next week. thanks, Rajesh - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: pci_enable_msix throws up error
I have submitted a proposed fix for the below issue. Will wait for comments. Ravi -Original Message- From: Andi Kleen [mailto:[EMAIL PROTECTED] Sent: Friday, May 05, 2006 1:44 AM To: Ayaz Abdulla Cc: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; Ananda. Raju; netdev@vger.kernel.org; Leonid Grossman Subject: Re: pci_enable_msix throws up error On Friday 05 May 2006 07:14, Ayaz Abdulla wrote: I noticed the same behaviour, i.e. can not use both MSI and MSIX without rebooting. I had sent a message to the maintainer of the MSI/MSIX source a few months ago and got a response that they were working on fixing it. Not sure what the progress is on it. The best way to make progress faster would be for someone like you who needs it to submit a patch to fix it then. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
pci_enable_msix throws up error
Hi, I am seeing the following problem with MSI/MSI-X. Note: I am copying netdev since other network drivers use this feature and somebody on the list could throw light. Our 10G network card(Xframe II) supports MSI and MSI-X. When I load/unload the driver with MSI support followed by an attempt to load with MSI-X, I get the following message from pci_enable_msix: Can't enable MSI-X. Device already has an MSI vector assigned I seem to be doing the correct things when unloading the MSI driver. Basically, I do free_irq() followed by pci_disable_msi(). Any idea what I am missing ? Further analysis: Looking at the code, the following check(when it finds a match) in msi_lookup_vector(called by pci_enable_msix) seems to throw up this message: if (!msi_desc[vector] || msi_desc[vector]-dev != dev || msi_desc[vector]-msi_attrib.type != type || msi_desc[vector]-msi_attrib.default_vector != dev-irq) pci_enable_msi, on successful completion will populate the fields in msi_desc. But neither pci_disable_msi nor free_irq seems to undo/unpopulate the msi_desc table. Could this be the cause for the problem ? Thanks, Ravi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2.6.16-rc5] S2io: Receive packet classification and steering mechanisms
Andi, The driver will be polling(listening) to netlink for any configuration requests. We could release the user tools but not sure where(in the tree) they would reside. Thanks, Ravi -Original Message- From: Andi Kleen [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 19, 2006 5:51 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org Subject: Re: [PATCH 2.6.16-rc5] S2io: Receive packet classification and steering mechanisms On Thursday 20 April 2006 00:45, Ravinandan Arakali wrote: Andi, We would like to explain that this patch is tier-1 of a two tiered approach. It implements all the steering functionality at driver-only level, and it is fairly Neterion-specific. That's fine for experiments, but probably not something that should be in tree. The second upcoming submission will add a generic netlink-based interface for channel data flow and configuration(including receive steering parameters) on per-channel basis, that will utilize the lower level implementation from the current patch. Will the driver itself listening to netlink? My feeling would be to teach the stack to use this would require efficient interfaces and netlink isn't particularly. But if it's just a glue module outside the driver that would be reasonable as a first step I guess. Do you also plan to release user tools to use it? -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2.6.16-rc5] S2io: Receive packet classification and steering mechanisms
Andi, We would like to explain that this patch is tier-1 of a two tiered approach. It implements all the steering functionality at driver-only level, and it is fairly Neterion-specific. The second upcoming submission will add a generic netlink-based interface for channel data flow and configuration(including receive steering parameters) on per-channel basis, that will utilize the lower level implementation from the current patch. Thanks, Ravi -Original Message- From: Andi Kleen [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 18, 2006 5:59 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org Subject: Re: [PATCH 2.6.16-rc5] S2io: Receive packet classification and steering mechanisms On Wednesday 19 April 2006 02:38, Ravinandan Arakali wrote: configuration: A mask(specified using loadable parameter rth_fn_and_mask) can be used to select a subset of TCP/UDP tuple for hash calculation. eg. To mask source port for TCP/IPv4 configuration, # insmod s2io.ko rx_steering_type=2 rth_fn_and_mask=0x0101 LSB specifies RTH function type and MSB the mask. A full description is provided at the beginning of s2io.c I don't think it's a good idea to introduce such weird and hard to understand module parameters for this. I would be better to define a generic internal kernel interface between stack and driver. Perhaps starting with a standard netlink interface for this might be a good start until the stack learns how to use this on its own. 3. MAC address-based: Done based on destination MAC address of packet. Xframe can be configured with multiple unicast MAC addresses. configuration: Load-time parameters multi_mac_cnt and multi_macs can be used to specify no. of MAC addresses and list of unicast addresses. eg. insmod s2io.ko rx_steering_type=8 multi_mac_cnt=3 multi_macs=00:0c:fc:00:00:22, 00:0c:fc:00:01:22, 00:0c:fc:00:02:22 Packets received with default destination MAC address will be steered to ring0. Packets with destination MAC addresses specified by multi_macs are steered to ring1, ring2... respectively. The obvious way to do this nicely would be to allow to define multiple virtual interfaces where the mac addresses can be set using the usual ioctls. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2.6.16-rc5] S2io: Receive packet classification and steering mechanisms
Hi, Just wondering if anybody got a chance to review below patch. Thanks, Ravi -Original Message- From: Ravinandan Arakali [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 12:32 PM To: [EMAIL PROTECTED]; netdev@vger.kernel.org Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: [PATCH 2.6.16-rc5] S2io: Receive packet classification and steering mechanisms Hi, Attached below is a patch to several receive packet classification and steering mechanisms for Xframe NIC hw channels. Current Xframe ASIC supports one hardware channel per CPU, up to 8 channels. This number will increase in the next ASIC release. A channel could be attached to a specific MSI-X vector (with an independent interrupt moderation scheme), which in turn can be bound to a CPU. Follow-up patches will provide some enhancements for the default tcp workload balancing across hw channels, as well as an optional hw channel interface. The interface is intended to be very generic (not specific to Xframe hardware). The following mechanisms are supported in this patch: Note: The steering type can be specified at load time with parameter rx_steering_type. Values supported are 1(port based), 2(RTH), 4(SPDM), 8(MAC addr based). 1. RTH(Receive traffic hashing): Steering is based on socket tuple (or a subset) and the popular Jenkins hash is used for RTH. This lets the receive processing to be spanned out to multiple CPUs, thus reducing single CPU bottleneck on Rx path. Hash-based steering can be used when it is desired to balance an unlimited number or TCP sessions across multiple CPUs but the exact mapping between a particular session and a particular cpu is not important. configuration: A mask(specified using loadable parameter rth_fn_and_mask) can be used to select a subset of TCP/UDP tuple for hash calculation. eg. To mask source port for TCP/IPv4 configuration, # insmod s2io.ko rx_steering_type=2 rth_fn_and_mask=0x0101 LSB specifies RTH function type and MSB the mask. A full description is provided at the beginning of s2io.c 2. port based: Steering is done based on source/destination TCP/UDP port number. configuration: Interface used is netlink sockets. Can specify port number(s), TCP/UDP type, source/destination port. 3. MAC address-based: Done based on destination MAC address of packet. Xframe can be configured with multiple unicast MAC addresses. configuration: Load-time parameters multi_mac_cnt and multi_macs can be used to specify no. of MAC addresses and list of unicast addresses. eg. insmod s2io.ko rx_steering_type=8 multi_mac_cnt=3 multi_macs=00:0c:fc:00:00:22, 00:0c:fc:00:01:22, 00:0c:fc:00:02:22 Packets received with default destination MAC address will be steered to ring0. Packets with destination MAC addresses specified by multi_macs are steered to ring1, ring2... respectively. 4. SPDM (Socket Pair Direct Match). Steering is based on exact socket tuple (or a subset) match. SPDM steering can be used when the exact mapping between a particular session and a particular cpu is desired. configuration: Interface used is netlink sockets. Can specify socket tuple values. If any of the values(say source port) needs to be don't care, specify 0x. Signed-off-by: Raghavendra Koushik [EMAIL PROTECTED] Signed-off-by: Sivakumar Subramani [EMAIL PROTECTED] Signed-off-by: Ravinandan Arakali [EMAIL PROTECTED] --- diff -urpN old/drivers/net/rx_cfg.h new/drivers/net/rx_cfg.h --- old/drivers/net/rx_cfg.h1969-12-31 16:00:00.0 -0800 +++ new/drivers/net/rx_cfg.h2006-03-10 02:54:56.0 -0800 @@ -0,0 +1,44 @@ +#ifndef _RX_CFG_H_ +#define_RX_CFG_H_ + +typedef struct { + unsigned short port; + unsigned short prot_n_type; /* TCP/UDP Dst/Src port type */ +#defineSRC_PRT 0x0 +#defineDST_PRT 0x1 +#defineTCP_PROT0x0 +#defineUDP_PROT0x1 + unsigned short dst_ring; +}port_info_t; + +/* A Rx steering config structure to pass info the driver by user */ +typedef struct { + //SPDM + unsigned intsip; /* Src IP addr */ + unsigned intdip; /* Dst IP addr */ + unsigned short sprt; /* Src TCP port */ + unsigned short dprt; /* Dst TCP port */ + unsigned intt_queue; /*Target Rx Queue for the packet */ + unsigned inthash; /* the hash as per jenkin's hash algorithm. */ +#define SPDM_NO_DATA 0x1 +#define SPDM_XENA_IF 0x2 +#define SPDM_HW_UNINITIALIZED 0x3 +#define SPDM_INCOMPLETE_SOCKET 0x4 +#define SPDM_TABLE_ACCESS_FAILED 0x5 +#define SPDM_TABLE_FULL0x6 +#define SPDM_TABLE_UNKNOWN_BAR 0x7 +#define SPDM_TABLE_MALLOC_FAIL 0x8 +#defineSPDM_INVALID_DEVICE 0x9 +#define SPDM_CONF_SUCCESS 0x0 +#defineSPDM_GET_CFG_DATA 0xAA55 + int
[PATCH 2.6.16-rc5] S2io: Receive packet classification and steering mechanisms
Hi, Attached below is a patch to several receive packet classification and steering mechanisms for Xframe NIC hw channels. Current Xframe ASIC supports one hardware channel per CPU, up to 8 channels. This number will increase in the next ASIC release. A channel could be attached to a specific MSI-X vector (with an independent interrupt moderation scheme), which in turn can be bound to a CPU. Follow-up patches will provide some enhancements for the default tcp workload balancing across hw channels, as well as an optional hw channel interface. The interface is intended to be very generic (not specific to Xframe hardware). The following mechanisms are supported in this patch: Note: The steering type can be specified at load time with parameter rx_steering_type. Values supported are 1(port based), 2(RTH), 4(SPDM), 8(MAC addr based). 1. RTH(Receive traffic hashing): Steering is based on socket tuple (or a subset) and the popular Jenkins hash is used for RTH. This lets the receive processing to be spanned out to multiple CPUs, thus reducing single CPU bottleneck on Rx path. Hash-based steering can be used when it is desired to balance an unlimited number or TCP sessions across multiple CPUs but the exact mapping between a particular session and a particular cpu is not important. configuration: A mask(specified using loadable parameter rth_fn_and_mask) can be used to select a subset of TCP/UDP tuple for hash calculation. eg. To mask source port for TCP/IPv4 configuration, # insmod s2io.ko rx_steering_type=2 rth_fn_and_mask=0x0101 LSB specifies RTH function type and MSB the mask. A full description is provided at the beginning of s2io.c 2. port based: Steering is done based on source/destination TCP/UDP port number. configuration: Interface used is netlink sockets. Can specify port number(s), TCP/UDP type, source/destination port. 3. MAC address-based: Done based on destination MAC address of packet. Xframe can be configured with multiple unicast MAC addresses. configuration: Load-time parameters multi_mac_cnt and multi_macs can be used to specify no. of MAC addresses and list of unicast addresses. eg. insmod s2io.ko rx_steering_type=8 multi_mac_cnt=3 multi_macs=00:0c:fc:00:00:22, 00:0c:fc:00:01:22, 00:0c:fc:00:02:22 Packets received with default destination MAC address will be steered to ring0. Packets with destination MAC addresses specified by multi_macs are steered to ring1, ring2... respectively. 4. SPDM (Socket Pair Direct Match). Steering is based on exact socket tuple (or a subset) match. SPDM steering can be used when the exact mapping between a particular session and a particular cpu is desired. configuration: Interface used is netlink sockets. Can specify socket tuple values. If any of the values(say source port) needs to be don't care, specify 0x. Signed-off-by: Raghavendra Koushik [EMAIL PROTECTED] Signed-off-by: Sivakumar Subramani [EMAIL PROTECTED] Signed-off-by: Ravinandan Arakali [EMAIL PROTECTED] --- diff -urpN old/drivers/net/rx_cfg.h new/drivers/net/rx_cfg.h --- old/drivers/net/rx_cfg.h1969-12-31 16:00:00.0 -0800 +++ new/drivers/net/rx_cfg.h2006-03-10 02:54:56.0 -0800 @@ -0,0 +1,44 @@ +#ifndef _RX_CFG_H_ +#define_RX_CFG_H_ + +typedef struct { + unsigned short port; + unsigned short prot_n_type; /* TCP/UDP Dst/Src port type */ +#defineSRC_PRT 0x0 +#defineDST_PRT 0x1 +#defineTCP_PROT0x0 +#defineUDP_PROT0x1 + unsigned short dst_ring; +}port_info_t; + +/* A Rx steering config structure to pass info the driver by user */ +typedef struct { + //SPDM + unsigned intsip; /* Src IP addr */ + unsigned intdip; /* Dst IP addr */ + unsigned short sprt; /* Src TCP port */ + unsigned short dprt; /* Dst TCP port */ + unsigned intt_queue; /*Target Rx Queue for the packet */ + unsigned inthash; /* the hash as per jenkin's hash algorithm. */ +#define SPDM_NO_DATA 0x1 +#define SPDM_XENA_IF 0x2 +#define SPDM_HW_UNINITIALIZED 0x3 +#define SPDM_INCOMPLETE_SOCKET 0x4 +#define SPDM_TABLE_ACCESS_FAILED 0x5 +#define SPDM_TABLE_FULL0x6 +#define SPDM_TABLE_UNKNOWN_BAR 0x7 +#define SPDM_TABLE_MALLOC_FAIL 0x8 +#defineSPDM_INVALID_DEVICE 0x9 +#define SPDM_CONF_SUCCESS 0x0 +#defineSPDM_GET_CFG_DATA 0xAA55 + int ret; +#define MAX_SPDM_ENTRIES_SIZE (0x100 * 0x40) + unsigned char data[MAX_SPDM_ENTRIES_SIZE]; + int data_len; /* Number of entries retrieved */ + chardev_name[20]; /* Device name, e.g. eth0, eth1... */ + + // Port steering + port_info_t l4_ports; +} rx_steering_cfg_t; + +#endif /*_RX_CFG_H_*/ diff -urpN old/drivers/net/s2io-regs.h new/drivers/net/s2io-regs.h --- old/drivers/net
RE: [PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO) feature(v2) for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs
Hi, Just wondering if anybody got a chance to review the below patch. This version(as per Rick's comment on v1 patch) includes support for TCP timestamps. Thanks, Ravi -Original Message- From: Ravinandan Arakali [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 25, 2006 11:53 AM To: [EMAIL PROTECTED]; netdev@vger.kernel.org Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: [PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO) feature(v2) for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs Hi, Below is a patch for the Large Receive Offload feature. Please review and let us know your comments. LRO algorithm was described in an OLS 2005 presentation, located at ftp.s2io.com user: linuxdocs password: HALdocs The same ftp site has Programming Manual for Xframe-I ASIC. LRO feature is supported on Neterion Xframe-I, Xframe-II and Xframe-Express 10GbE NICs. Brief description: The Large Receive Offload(LRO) feature is a stateless offload that is complementary to TSO feature but on the receive path. The idea is to combine and collapse(upto 64K maximum) in the driver, in-sequence TCP packets belonging to the same session. It is mainly designed to improve 1500 mtu receive performance, since Jumbo frame performance is already close to 10GbE line rate. Some performance numbers are attached below. Implementation details: 1. Handle packet chains from multiple sessions(current default MAX_LRO_SESSSIONS=32). 2. Examine each packet for eligiblity to aggregate. A packet is considered eligible if it meets all the below criteria. a. It is a TCP/IP packet and L2 type is not LLC or SNAP. b. The packet has no checksum errors(L3 and L4). c. There are no IP options. The only TCP option supported is timestamps. d. Search and locate the LRO object corresponding to this socket and ensure packet is in TCP sequence. e. It's not a special packet(SYN, FIN, RST, URG, PSH etc. flags are not set). f. TCP payload is non-zero(It's not a pure ACK). g. It's not an IP-fragmented packet. 3. If a packet is found eligible, the LRO object is updated with information such as next sequence number expected, current length of aggregated packet and so on. If not eligible or max packets reached, update IP and TCP headers of first packet in the chain and pass it up to stack. 4. The frag_list in skb structure is used to chain packets into one large packet. Kernel changes required: None Performance results: Main focus of the initial testing was on 1500 mtu receiver, since this is a bottleneck not covered by the existing stateless offloads. There are couple disclaimers about the performance results below: 1. Your mileage will vary We initially concentrated on couple pci-x 2.0 platforms that are powerful enough to push 10 GbE NIC and do not have bottlenecks other than cpu%; testing on other platforms is still in progress. On some lower end systems we are seeing lower gains. 2. Current LRO implementation is still (for the most part) software based, and therefore performance potential of the feature is far from being realized. Full hw implementation of LRO is expected in the next version of Xframe ASIC. Performance delta(with MTU=1500) going from LRO disabled to enabled: IBM 2-way Xeon (x366) : 3.5 to 7.1 Gbps 2-way Opteron : 4.5 to 6.1 Gbps Signed-off-by: Ravinandan Arakali [EMAIL PROTECTED] --- diff -urpN old/drivers/net/s2io.c new_ts/drivers/net/s2io.c --- old/drivers/net/s2io.c 2006-01-19 04:31:05.0 -0800 +++ new_ts/drivers/net/s2io.c 2006-01-24 08:56:25.0 -0800 @@ -57,6 +57,9 @@ #include linux/ethtool.h #include linux/workqueue.h #include linux/if_vlan.h +#include linux/ip.h +#include linux/tcp.h +#include net/tcp.h #include asm/system.h #include asm/uaccess.h @@ -66,7 +69,7 @@ #include s2io.h #include s2io-regs.h -#define DRV_VERSION Version 2.0.9.4 +#define DRV_VERSION 2.0.11.2 /* S2io Driver name version. */ static char s2io_driver_name[] = Neterion; @@ -168,6 +171,11 @@ static char ethtool_stats_keys[][ETH_GST {\n DRIVER STATISTICS}, {single_bit_ecc_errs}, {double_bit_ecc_errs}, + (lro_aggregated_pkts), + (lro_flush_both_count), + (lro_out_of_sequence_pkts), + (lro_flush_due_to_max_pkts), + (lro_avg_aggr_pkts), }; #define S2IO_STAT_LEN sizeof(ethtool_stats_keys)/ ETH_GSTRING_LEN @@ -317,6 +325,12 @@ static unsigned int indicate_max_pkts; static unsigned int rxsync_frequency = 3; /* Interrupt type. Values can be 0(INTA), 1(MSI), 2(MSI_X) */ static unsigned int intr_type = 0; +/* Large receive offload feature */ +static unsigned int lro = 0; +/* Max pkts to be aggregated by LRO at one time. If not specified, + * aggregation happens until we hit max IP pkt size(64K) + */ +static unsigned int lro_max_pkts = 0x; /* * S2IO device table. @@ -1476,6 +1490,19 @@ static int init_nic(struct s2io_nic *nic writel((u32
[PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO) feature(v2) for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs
Hi, Below is a patch for the Large Receive Offload feature. Please review and let us know your comments. LRO algorithm was described in an OLS 2005 presentation, located at ftp.s2io.com user: linuxdocs password: HALdocs The same ftp site has Programming Manual for Xframe-I ASIC. LRO feature is supported on Neterion Xframe-I, Xframe-II and Xframe-Express 10GbE NICs. Brief description: The Large Receive Offload(LRO) feature is a stateless offload that is complementary to TSO feature but on the receive path. The idea is to combine and collapse(upto 64K maximum) in the driver, in-sequence TCP packets belonging to the same session. It is mainly designed to improve 1500 mtu receive performance, since Jumbo frame performance is already close to 10GbE line rate. Some performance numbers are attached below. Implementation details: 1. Handle packet chains from multiple sessions(current default MAX_LRO_SESSSIONS=32). 2. Examine each packet for eligiblity to aggregate. A packet is considered eligible if it meets all the below criteria. a. It is a TCP/IP packet and L2 type is not LLC or SNAP. b. The packet has no checksum errors(L3 and L4). c. There are no IP options. The only TCP option supported is timestamps. d. Search and locate the LRO object corresponding to this socket and ensure packet is in TCP sequence. e. It's not a special packet(SYN, FIN, RST, URG, PSH etc. flags are not set). f. TCP payload is non-zero(It's not a pure ACK). g. It's not an IP-fragmented packet. 3. If a packet is found eligible, the LRO object is updated with information such as next sequence number expected, current length of aggregated packet and so on. If not eligible or max packets reached, update IP and TCP headers of first packet in the chain and pass it up to stack. 4. The frag_list in skb structure is used to chain packets into one large packet. Kernel changes required: None Performance results: Main focus of the initial testing was on 1500 mtu receiver, since this is a bottleneck not covered by the existing stateless offloads. There are couple disclaimers about the performance results below: 1. Your mileage will vary We initially concentrated on couple pci-x 2.0 platforms that are powerful enough to push 10 GbE NIC and do not have bottlenecks other than cpu%; testing on other platforms is still in progress. On some lower end systems we are seeing lower gains. 2. Current LRO implementation is still (for the most part) software based, and therefore performance potential of the feature is far from being realized. Full hw implementation of LRO is expected in the next version of Xframe ASIC. Performance delta(with MTU=1500) going from LRO disabled to enabled: IBM 2-way Xeon (x366) : 3.5 to 7.1 Gbps 2-way Opteron : 4.5 to 6.1 Gbps Signed-off-by: Ravinandan Arakali [EMAIL PROTECTED] --- diff -urpN old/drivers/net/s2io.c new_ts/drivers/net/s2io.c --- old/drivers/net/s2io.c 2006-01-19 04:31:05.0 -0800 +++ new_ts/drivers/net/s2io.c 2006-01-24 08:56:25.0 -0800 @@ -57,6 +57,9 @@ #include linux/ethtool.h #include linux/workqueue.h #include linux/if_vlan.h +#include linux/ip.h +#include linux/tcp.h +#include net/tcp.h #include asm/system.h #include asm/uaccess.h @@ -66,7 +69,7 @@ #include s2io.h #include s2io-regs.h -#define DRV_VERSION Version 2.0.9.4 +#define DRV_VERSION 2.0.11.2 /* S2io Driver name version. */ static char s2io_driver_name[] = Neterion; @@ -168,6 +171,11 @@ static char ethtool_stats_keys[][ETH_GST {\n DRIVER STATISTICS}, {single_bit_ecc_errs}, {double_bit_ecc_errs}, + (lro_aggregated_pkts), + (lro_flush_both_count), + (lro_out_of_sequence_pkts), + (lro_flush_due_to_max_pkts), + (lro_avg_aggr_pkts), }; #define S2IO_STAT_LEN sizeof(ethtool_stats_keys)/ ETH_GSTRING_LEN @@ -317,6 +325,12 @@ static unsigned int indicate_max_pkts; static unsigned int rxsync_frequency = 3; /* Interrupt type. Values can be 0(INTA), 1(MSI), 2(MSI_X) */ static unsigned int intr_type = 0; +/* Large receive offload feature */ +static unsigned int lro = 0; +/* Max pkts to be aggregated by LRO at one time. If not specified, + * aggregation happens until we hit max IP pkt size(64K) + */ +static unsigned int lro_max_pkts = 0x; /* * S2IO device table. @@ -1476,6 +1490,19 @@ static int init_nic(struct s2io_nic *nic writel((u32) (val64 32), (add + 4)); val64 = readq(bar0-mac_cfg); + /* Enable FCS stripping by adapter */ + add = bar0-mac_cfg; + val64 = readq(bar0-mac_cfg); + val64 |= MAC_CFG_RMAC_STRIP_FCS; + if (nic-device_type == XFRAME_II_DEVICE) + writeq(val64, bar0-mac_cfg); + else { + writeq(RMAC_CFG_KEY(0x4C0D), bar0-rmac_cfg_key); + writel((u32) (val64), add); + writeq(RMAC_CFG_KEY(0x4C0D), bar0-rmac_cfg_key); + writel((u32) (val64 32), (add + 4
RE: [PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO) feature for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs
Rick, This is the basic implementation I submitted. I will try and include support for timestamp option and resubmit. I did not did understand your other comments about service demand. Thanks, Ravi -Original Message- From: Rick Jones [mailto:[EMAIL PROTECTED] Sent: Friday, January 20, 2006 3:30 PM To: Ravinandan Arakali Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO) feature for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs Implementation details: 1. Handle packet chains from multiple sessions(current default MAX_LRO_SESSSIONS=32). 2. Examine each packet for eligiblity to aggregate. A packet is considered eligible if it meets all the below criteria. a. It is a TCP/IP packet and L2 type is not LLC or SNAP. b. The packet has no checksum errors(L3 and L4). c. There are no TCP or IP options. _No_ TCP options? Not even Timestamps? Given that one can theoretically wrap the 32-bit TCP sequence space in something like four seconds, and the general goodness of timestamps when using window scaling, one might think that timestamps being enabled if not already common today would become more common? d. Search and locate the LRO object corresponding to this socket and ensure packet is in TCP sequence. e. It's not a special packet(SYN, FIN, RST, URG, PSH etc. flags are not set). f. TCP payload is non-zero(It's not a pure ACK). g. It's not an IP-fragmented packet. 3. If a packet is found eligible, the LRO object is updated with information such as next sequence number expected, current length of aggregated packet and so on. If not eligible or max packets reached, update IP and TCP headers of first packet in the chain and pass it up to stack. 4. The frag_list in skb structure is used to chain packets into one large packet. Kernel changes required: None Performance results: Main focus of the initial testing was on 1500 mtu receiver, since this is a bottleneck not covered by the existing stateless offloads. There are couple disclaimers about the performance results below: 1. Your mileage will vary We initially concentrated on couple pci-x 2.0 platforms that are powerful enough to push 10 GbE NIC and do not have bottlenecks other than cpu%; testing on other platforms is still in progress. On some lower end systems we are seeing lower gains. You should still see benefits in reported service demand no? 2. Current LRO implementation is still (for the most part) software based, and therefore performance potential of the feature is far from being realized. Full hw implementation of LRO is expected in the next version of Xframe ASIC. Performance delta(with MTU=1500) going from LRO disabled to enabled: IBM 2-way Xeon (x366) : 3.5 to 7.1 Gbps 2-way Opteron : 4.5 to 6.1 Gbps Service demand changes? rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO) feature for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs
Rick, In addition to showing improved throughput, the CPU utilization(service demand) also went down. But one of the CPUs was running at full utilization. For eg. without LRO, the CPU idle times on the 4 CPUs were 39,43,8,12(average 25% idle). With LRO, it was 48/0/46/47(average 35% idle). Regards, Ravi -Original Message- From: Rick Jones [mailto:[EMAIL PROTECTED] Sent: Monday, January 23, 2006 4:08 PM To: Ravinandan Arakali Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO) feature for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs Ravinandan Arakali wrote: Rick, This is the basic implementation I submitted. I will try and include support for timestamp option and resubmit. I did not did understand your other comments about service demand. Sorry, that's a netperfism - netperf can report the service demand measured during a test - it is basically the quantity of CPU consumed per unit of work performed. Lower is better. For example: languid:/opt/netperf2# src/netperf -H 192.168.3.212 -c -C TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.212 (192.168.3.212) port 0 AF_INET Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % S us/KB us/KB 87380 16384 1638410.00 940.96 17.0147.962.962 8.351 In the test above, the sender consumed nearly 3 microseconds of CPU time to transfer a KB of data, and the reciever consumed nearly 8.4 rick - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO) feature for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs
Hi, Below is a patch for the Large Receive Offload feature. Please review and let us know your comments. LRO algorithm was described in an OLS 2005 presentation, located at ftp.s2io.com user: linuxdocs password: HALdocs The same ftp site has Programming Manual for Xframe-I ASIC. LRO feature is supported on Neterion Xframe-I, Xframe-II and Xframe-Express 10GbE NICs. Brief description: The Large Receive Offload(LRO) feature is a stateless offload that is complementary to TSO feature but on the receive path. The idea is to combine and collapse(upto 64K maximum) in the driver, in-sequence TCP packets belonging to the same session. It is mainly designed to improve 1500 mtu receive performance, since Jumbo frame performance is already close to 10GbE line rate. Some performance numbers are attached below. Implementation details: 1. Handle packet chains from multiple sessions(current default MAX_LRO_SESSSIONS=32). 2. Examine each packet for eligiblity to aggregate. A packet is considered eligible if it meets all the below criteria. a. It is a TCP/IP packet and L2 type is not LLC or SNAP. b. The packet has no checksum errors(L3 and L4). c. There are no TCP or IP options. d. Search and locate the LRO object corresponding to this socket and ensure packet is in TCP sequence. e. It's not a special packet(SYN, FIN, RST, URG, PSH etc. flags are not set). f. TCP payload is non-zero(It's not a pure ACK). g. It's not an IP-fragmented packet. 3. If a packet is found eligible, the LRO object is updated with information such as next sequence number expected, current length of aggregated packet and so on. If not eligible or max packets reached, update IP and TCP headers of first packet in the chain and pass it up to stack. 4. The frag_list in skb structure is used to chain packets into one large packet. Kernel changes required: None Performance results: Main focus of the initial testing was on 1500 mtu receiver, since this is a bottleneck not covered by the existing stateless offloads. There are couple disclaimers about the performance results below: 1. Your mileage will vary We initially concentrated on couple pci-x 2.0 platforms that are powerful enough to push 10 GbE NIC and do not have bottlenecks other than cpu%; testing on other platforms is still in progress. On some lower end systems we are seeing lower gains. 2. Current LRO implementation is still (for the most part) software based, and therefore performance potential of the feature is far from being realized. Full hw implementation of LRO is expected in the next version of Xframe ASIC. Performance delta(with MTU=1500) going from LRO disabled to enabled: IBM 2-way Xeon (x366) : 3.5 to 7.1 Gbps 2-way Opteron : 4.5 to 6.1 Gbps Signed-off-by: Ravinandan Arakali [EMAIL PROTECTED] --- diff -urpN old/drivers/net/s2io.c new/drivers/net/s2io.c --- old/drivers/net/s2io.c 2006-01-19 04:31:05.0 -0800 +++ new/drivers/net/s2io.c 2006-01-20 04:04:09.0 -0800 @@ -57,6 +57,8 @@ #include linux/ethtool.h #include linux/workqueue.h #include linux/if_vlan.h +#include linux/ip.h +#include linux/tcp.h #include asm/system.h #include asm/uaccess.h @@ -66,7 +68,7 @@ #include s2io.h #include s2io-regs.h -#define DRV_VERSION Version 2.0.9.4 +#define DRV_VERSION 2.0.11.2 /* S2io Driver name version. */ static char s2io_driver_name[] = Neterion; @@ -168,6 +170,11 @@ static char ethtool_stats_keys[][ETH_GST {\n DRIVER STATISTICS}, {single_bit_ecc_errs}, {double_bit_ecc_errs}, + (lro_aggregated_pkts), + (lro_flush_both_count), + (lro_out_of_sequence_pkts), + (lro_flush_due_to_max_pkts), + (lro_avg_aggr_pkts), }; #define S2IO_STAT_LEN sizeof(ethtool_stats_keys)/ ETH_GSTRING_LEN @@ -317,6 +324,12 @@ static unsigned int indicate_max_pkts; static unsigned int rxsync_frequency = 3; /* Interrupt type. Values can be 0(INTA), 1(MSI), 2(MSI_X) */ static unsigned int intr_type = 0; +/* Large receive offload feature */ +static unsigned int lro = 0; +/* Max pkts to be aggregated by LRO at one time. If not specified, + * aggregation happens until we hit max IP pkt size(64K) + */ +static unsigned int lro_max_pkts = 0x; /* * S2IO device table. @@ -1476,6 +1489,19 @@ static int init_nic(struct s2io_nic *nic writel((u32) (val64 32), (add + 4)); val64 = readq(bar0-mac_cfg); + /* Enable FCS stripping by adapter */ + add = bar0-mac_cfg; + val64 = readq(bar0-mac_cfg); + val64 |= MAC_CFG_RMAC_STRIP_FCS; + if (nic-device_type == XFRAME_II_DEVICE) + writeq(val64, bar0-mac_cfg); + else { + writeq(RMAC_CFG_KEY(0x4C0D), bar0-rmac_cfg_key); + writel((u32) (val64), add); + writeq(RMAC_CFG_KEY(0x4C0D), bar0-rmac_cfg_key); + writel((u32) (val64 32), (add + 4)); + } + /* * Set the time value to be inserted
RE: [PATCH 2.6.12.1 5/12] S2io: Performance improvements
Arthur/David/Jeff, Thanks for pointing that out. We will wait for any other comments on our 12 patches. If there are no other, will send out a patch13 to include the mmiowb() change. Thanks, Ravi -Original Message- From: Arthur Kepner [mailto:[EMAIL PROTECTED] Sent: Friday, July 08, 2005 8:31 AM To: Raghavendra Koushik Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: [PATCH 2.6.12.1 5/12] S2io: Performance improvements On Thu, 7 Jul 2005, Raghavendra Koushik wrote: On an Altix machine I believe the readq was necessary to flush the PIO writes. How long did you run the tests? I had seen in long duration tests that an occasional write (TXDL control word and the address) would be missed and the xmit Get's stuck. The most recent tests I did used pktgen, and they ran for a total time of ~.5 hours (changing pkt_size every 30 seconds or so). The pktgen tests and other tests (like nttcp) have been run several times, so I've exercised the card for a total of several hours without any problems. FWIW, I've done quite a few performance measurements with the patch I posted earlier, and it's worked well. For 1500 byte mtus throughput goes up by ~20%. Is even the mmiowb() unnecessary? Was this on 2.4 kernel because I think the readq would not have a significant impact on 2.6 kernels due to TSO. (with TSO on the number of packets that actually enter the Xmit routine would be reduced apprx 40 times). . This was with a 2.6 kernel (with TSO on). PIO reads are pretty expensive on Altix, so eliminating them really helps us. For big mtus (=4KBytes) the benefit of replacing the readq() with mmiowb() in s2io_xmit() is negligible. -- Arthur - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html