RE: [PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO) feature for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs

2006-01-23 Thread Ravinandan Arakali
Rick,
This is the basic implementation I submitted. I will try and include support
for timestamp option and resubmit.
I did not did understand your other comments about service demand.

Thanks,
Ravi

-Original Message-
From: Rick Jones [mailto:[EMAIL PROTECTED]
Sent: Friday, January 20, 2006 3:30 PM
To: Ravinandan Arakali
Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org;
[EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Subject: Re: [PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO)
feature for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs



 Implementation details:
 1. Handle packet chains from multiple sessions(current default
 MAX_LRO_SESSSIONS=32).
 2. Examine each packet for eligiblity to aggregate. A packet is
 considered eligible if it meets all the below criteria.
   a. It is a TCP/IP packet and L2 type is not LLC or SNAP.
   b. The packet has no checksum errors(L3 and L4).
   c. There are no TCP or IP options.

_No_ TCP options?  Not even Timestamps?  Given that one can theoretically
wrap
the 32-bit TCP sequence space in something like four seconds, and the
general
goodness of timestamps when using window scaling, one might think that
timestamps being enabled if not already common today would become more
common?

   d. Search and locate the LRO object corresponding to this
  socket and ensure packet is in TCP sequence.
   e. It's not a special packet(SYN, FIN, RST, URG, PSH etc. flags are not
set).
   f. TCP payload is non-zero(It's not a pure ACK).
   g. It's not an IP-fragmented packet.
 3. If a packet is found eligible, the LRO object is updated with
information such as next sequence number expected, current length
of aggregated packet and so on. If not eligible or max packets
reached, update IP and TCP headers of first packet in the chain
and pass it up to stack.
 4. The frag_list in skb structure is used to chain packets into one
large packet.

 Kernel changes required: None

 Performance results:
 Main focus of the initial testing was on 1500 mtu receiver, since this
 is a bottleneck not covered by the existing stateless offloads.

 There are couple disclaimers about the performance results below:
 1. Your mileage will vary We initially concentrated on couple pci-x
2.0
 platforms that are powerful enough to push 10 GbE NIC and do not
 have bottlenecks other than cpu%;  testing on other platforms is still
 in progress. On some lower end systems we are seeing lower gains.

You should still see benefits in reported service demand no?

 2. Current LRO implementation is still (for the most part) software based,
 and therefore performance potential of the feature is far from being
realized.
 Full hw implementation of LRO is expected in the next version of Xframe
ASIC.

 Performance delta(with MTU=1500) going from LRO disabled to enabled:
 IBM 2-way Xeon (x366) : 3.5 to 7.1 Gbps
 2-way Opteron : 4.5 to 6.1 Gbps

Service demand changes?

rick jones

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO) feature for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs

2006-01-23 Thread Rick Jones

Ravinandan Arakali wrote:

Rick,
This is the basic implementation I submitted. I will try and include support
for timestamp option and resubmit.
I did not did understand your other comments about service demand.


Sorry, that's a netperfism - netperf can report the service demand measured 
during a test - it is basically the quantity of CPU consumed per unit of work 
performed.  Lower is better.


For example:

languid:/opt/netperf2# src/netperf -H 192.168.3.212 -c -C
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.212 
(192.168.3.212) port 0 AF_INET

Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384  1638410.00   940.96   17.0147.962.962   8.351

In the test above, the sender consumed nearly 3 microseconds of CPU time to 
transfer a KB of data, and the reciever consumed nearly 8.4


rick
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO) feature for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs

2006-01-23 Thread Ravinandan Arakali
Rick,
In addition to showing improved throughput, the CPU utilization(service
demand)
also went down. But one of the CPUs was running at full utilization. For eg.
without LRO, the CPU idle times on the 4 CPUs were 39,43,8,12(average 25%
idle).
With LRO, it was 48/0/46/47(average 35% idle).

Regards,
Ravi

-Original Message-
From: Rick Jones [mailto:[EMAIL PROTECTED]
Sent: Monday, January 23, 2006 4:08 PM
To: Ravinandan Arakali
Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org;
[EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Subject: Re: [PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO)
feature for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs


Ravinandan Arakali wrote:
 Rick,
 This is the basic implementation I submitted. I will try and include
support
 for timestamp option and resubmit.
 I did not did understand your other comments about service demand.

Sorry, that's a netperfism - netperf can report the service demand
measured
during a test - it is basically the quantity of CPU consumed per unit of
work
performed.  Lower is better.

For example:

languid:/opt/netperf2# src/netperf -H 192.168.3.212 -c -C
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.3.212
(192.168.3.212) port 0 AF_INET
Recv   SendSend  Utilization   Service
Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local
remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

  87380  16384  1638410.00   940.96   17.0147.962.962
8.351

In the test above, the sender consumed nearly 3 microseconds of CPU time to
transfer a KB of data, and the reciever consumed nearly 8.4

rick

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.16-rc1] S2io: Large Receive Offload (LRO) feature for Neterion (s2io) 10GbE Xframe PCI-X and PCI-E NICs

2006-01-20 Thread Rick Jones


Implementation details:
1. Handle packet chains from multiple sessions(current default
MAX_LRO_SESSSIONS=32).
2. Examine each packet for eligiblity to aggregate. A packet is
considered eligible if it meets all the below criteria.
  a. It is a TCP/IP packet and L2 type is not LLC or SNAP.
  b. The packet has no checksum errors(L3 and L4). 
  c. There are no TCP or IP options.


_No_ TCP options?  Not even Timestamps?  Given that one can theoretically wrap 
the 32-bit TCP sequence space in something like four seconds, and the general 
goodness of timestamps when using window scaling, one might think that 
timestamps being enabled if not already common today would become more common?



  d. Search and locate the LRO object corresponding to this
 socket and ensure packet is in TCP sequence.
  e. It's not a special packet(SYN, FIN, RST, URG, PSH etc. flags are not set).
  f. TCP payload is non-zero(It's not a pure ACK).
  g. It's not an IP-fragmented packet.
3. If a packet is found eligible, the LRO object is updated with 
   information such as next sequence number expected, current length

   of aggregated packet and so on. If not eligible or max packets
   reached, update IP and TCP headers of first packet in the chain
   and pass it up to stack.
4. The frag_list in skb structure is used to chain packets into one
   large packet.
 
Kernel changes required: None


Performance results:
Main focus of the initial testing was on 1500 mtu receiver, since this 
is a bottleneck not covered by the existing stateless offloads.


There are couple disclaimers about the performance results below:
1. Your mileage will vary We initially concentrated on couple pci-x 2.0 
platforms that are powerful enough to push 10 GbE NIC and do not 
have bottlenecks other than cpu%;  testing on other platforms is still 
in progress. On some lower end systems we are seeing lower gains.


You should still see benefits in reported service demand no?

2. Current LRO implementation is still (for the most part) software based, 
and therefore performance potential of the feature is far from being realized. 
Full hw implementation of LRO is expected in the next version of Xframe ASIC. 


Performance delta(with MTU=1500) going from LRO disabled to enabled:
IBM 2-way Xeon (x366) : 3.5 to 7.1 Gbps
2-way Opteron : 4.5 to 6.1 Gbps


Service demand changes?

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html