from:"Rick Jones"

Re: [PATCH] softirq: let ksoftirqd do its job

2016-08-31 Thread Rick Jones


On 08/31/2016 04:11 PM, Eric Dumazet wrote:

On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote:

With regard to drops, are both of you sure you're using the same socket
buffer sizes?


Does it really matter ?


At least at points in the past I have seen different drop counts at the 
SO_RCVBUF based on using (sometimes much) larger sizes.  The hypothesis 
I was operating under at the time was that this dealt with those 
situations where the netserver was held-off from running for "a little 
while" from time to time.  It didn't change things for a sustained 
overload situation though.



In the meantime, is anything interesting happening with TCP_RR or
TCP_STREAM?


TCP_RR is driven by the network latency, we do not drop packets in the
socket itself.


I've been of the opinion it (single stream) is driven by path length. 
Sometimes by NIC latency.  But then I'm almost always measuring in the 
LAN rather than across the WAN.


happy benchmarking,

rick

Re: [PATCH] softirq: let ksoftirqd do its job

2016-08-31 Thread Rick Jones

With regard to drops, are both of you sure you're using the same socket 
buffer sizes?


In the meantime, is anything interesting happening with TCP_RR or 
TCP_STREAM?


happy benchmarking,

rick jones

Re: strange Mac OSX RST behavior

2016-07-01 Thread Rick Jones


On 07/01/2016 08:10 AM, Jason Baron wrote:

I'm wondering if anybody else has run into this...

On Mac OSX 10.11.5 (latest version), we have found that when tcp
connections are abruptly terminated (via ^C), a FIN is sent followed
by an RST packet.


That just seems, well, silly.  If the client application wants to use 
abortive close (sigh..) it should do so, there shouldn't be this 
little-bit-pregnant, correct close initiation (FIN) followed by a RST.



The RST is sent with the same sequence number as the
FIN, and thus dropped since the stack only accepts RST packets matching
rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with
an RST on the closed socket, but it appears that it does not.

The workaround here is then to reset the connection, if the RST is
is equal to rcv_nxt - 1, if we have already received a FIN.

The RST attack surface is limited b/c we only accept the RST after we've
accepted a FIN and have not previously sent a FIN and received back the
corresponding ACK. In other words RST is only accepted in the tcp
states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING.

I'm interested if anybody else has run into this issue. Its problematic
since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT.


Isn't the server application expected to act on the read return of zero 
(which is supposed to be) triggered by the receipt of the FIN segment?


rick jones


We are also in the process of contacting Apple to see what can be done
here...workaround patch is below.

Re: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-06-28 Thread Rick Jones


On 06/28/2016 02:59 AM, Dexuan Cui wrote:

The idea here is: IMO the syscalls sys_read()/write() shoudn't return
-ENOMEM, so I have to make sure the buffer allocation succeeds?

I tried to use kmalloc with __GFP_NOFAIL, but I hit a warning in
in mm/page_alloc.c:
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

What error code do you think I should return?
EAGAIN, ERESTARTSYS, or something else?

May I have your suggestion? Thanks!


What happens as far as errno is concerned when an application makes a 
read() call against a (say TCP) socket associated with a connection 
which has been reset?  Is it limited to those errno values listed in the 
read() manpage, or does it end-up getting an errno value from those 
listed in the recv() manpage?  Or, perhaps even one not (presently) 
listed in either?


rick jones

Re: [PATCH -next 2/2] virtio_net: Read the advised MTU

2016-06-02 Thread Rick Jones


On 06/02/2016 10:06 AM, Aaron Conole wrote:

Rick Jones  writes:

One of the things I've been doing has been setting-up a cluster
(OpenStack) with JumboFrames, and then setting MTUs on instance vNICs
by hand to measure different MTU sizes.  It would be a shame if such a
thing were not possible in the future.  Keeping a warning if shrinking
the MTU would be good, leave the error (perhaps) to if an attempt is
made to go beyond the advised value.


This was cut because it didn't make sense for such a warning to
be issued, but it seems like perhaps you may want such a feature?  I
agree with Michael, after thinking about it, that I don't know what sort
of use the warning would serve.  After all, if you're changing the MTU,
you must have wanted such a change to occur?


I don't need a warning, was simply willing to live with one when 
shrinking the MTU.  Didn't want an error.


happy benchmarking,

rick jones

Re: [RFC v2 -next 0/2] virtio-net: Advised MTU feature

2016-03-15 Thread Rick Jones


On 03/15/2016 02:04 PM, Aaron Conole wrote:

The following series adds the ability for a hypervisor to set an MTU on the
guest during feature negotiation phase. This is useful for VM orchestration
when, for instance, tunneling is involved and the MTU of the various systems
should be homogenous.

The first patch adds the feature bit as described in the proposed VFIO spec
addition found at
https://lists.oasis-open.org/archives/virtio-dev/201603/msg1.html

The second patch adds a user of the bit, and a warning when the guest changes
the MTU from the hypervisor advised MTU. Future patches may add more thorough
error handling.


How do you see this interacting with VMs getting MTU settings via DHCP?

rick jones



v2:
* Whitespace and code style cleanups from Sergei Shtylyov and Paolo Abeni
* Additional test before printing a warning

Aaron Conole (2):
   virtio: Start feature MTU support
   virtio_net: Read the advised MTU

  drivers/net/virtio_net.c| 12 
  include/uapi/linux/virtio_net.h |  3 +++
  2 files changed, 15 insertions(+)

Re: [PATCH net-next RFC 2/2] vhost_net: basic polling support

2015-10-22 Thread Rick Jones


On 10/22/2015 02:33 AM, Michael S. Tsirkin wrote:

On Thu, Oct 22, 2015 at 01:27:29AM -0400, Jason Wang wrote:

This patch tries to poll for new added tx buffer for a while at the
end of tx processing. The maximum time spent on polling were limited
through a module parameter. To avoid block rx, the loop will end it
there's new other works queued on vhost so in fact socket receive
queue is also be polled.

busyloop_timeout = 50 gives us following improvement on TCP_RR test:

size/session/+thu%/+normalize%
 1/ 1/   +5%/  -20%
 1/50/  +17%/   +3%


Is there a measureable increase in cpu utilization
with busyloop_timeout = 0?


And since a netperf TCP_RR test is involved, be careful about what 
netperf reports for CPU util if that increase isn't in the context of 
the guest OS.


For completeness, looking at the effect on TCP_STREAM and TCP_MAERTS, 
aggregate _RR and even aggregate _RR/packets per second for many VMs on 
the same system would be in order.


happy benchmarking,

rick jones

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next] tcp: Return error instead of partial read for saved syn headers

2015-05-18 Thread Rick Jones


On 05/18/2015 11:35 AM, Eric B Munson wrote:

Currently the getsockopt() requesting the cached contents of the syn
packet headers will fail silently if the caller uses a buffer that is
too small to contain the requested data.  Rather than fail silently and
discard the headers, getsockopt() should return an error and report the
required size to hold the data.


Is there any chapter and verse on whether a "failed" getsockopt() may 
alter the items passed to it?


rick jones

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Rick Jones


On 04/15/2015 11:32 AM, Eric Dumazet wrote:

On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote:


Well, I'm not sure that it is George and Jonathan themselves who don't
want to change a sysctl, but the customers who would have to tweak that
in their VMs?


Keep in mind some VM users install custom qdisc, or even custom TCP
sysctls.


That could very well be, though I confess I've not seen that happening 
in my little corner of the cloud.  They tend to want to launch the VM 
and go.  Some of the more advanced/sophisticated ones might tweak a few 
things but my (admittedly limited) experience has been they are few in 
number.  They just expect it to work "out of the box" (to the extent one 
can use that phrase still).


It's kind of ironic - go back to the (early) 1990s when NICs generated a 
completion interrupt for every individual tx completion (and incoming 
packet) and all everyone wanted to do was coalesce/avoid interrupts.  I 
guess that has gone rather far.  And today to fight bufferbloat TCP gets 
tweaked to favor quick tx completions.  Call it cycles, or pendulums or 
whatever I guess.


I wonder just how consistent tx completion timings are for a VM so a 
virtio_net or whatnot in the VM can pick a per-device setting to 
advertise to TCP?  Hopefully, full NIC emulation is no longer a thing 
and VMs "universally" use a virtual NIC interface. At least in my little 
corner of the cloud, emulated NICs are gone, and good riddance.


rick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Rick Jones


On 04/15/2015 11:08 AM, Eric Dumazet wrote:

On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote:


Have you tested this patch on a NIC without GSO/TSO ?

This would allow more than 500 packets for a single flow.

Hello bufferbloat.


Woudln't the fq_codel qdisc on that interface address that problem?


Last time I checked, default qdisc was pfifo_fast.


Bummer.


These guys do not want to change a sysctl, how pfifo_fast will magically
becomes fq_codel ?


Well, I'm not sure that it is George and Jonathan themselves who don't 
want to change a sysctl, but the customers who would have to tweak that 
in their VMs?


rick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Rick Jones



Have you tested this patch on a NIC without GSO/TSO ?

This would allow more than 500 packets for a single flow.

Hello bufferbloat.


Woudln't the fq_codel qdisc on that interface address that problem?

rick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: TCP connection issues against Amazon S3

2015-01-08 Thread Rick Jones




Strange thing is that sender does not misbehave at the beginning when
receiver window is still small. Only after a while.


Just guessing, but when the receiver window is small, the sender cannot 
get a large quantity of data out there at once, so any string of lost 
packets will tend to be smaller.  If the sender is relying on the RTO to 
trigger the retransmits, and is not resetting his RTO until the clean 
ACK of a segment sent after snd_nxt when the loss is detected, the 
smaller loss strings will not get to the rather large RTO values seen in 
the trace before curl gives-up.  It may be that the sender is indeed 
misbehaving at the beginning, just that it isn't noticeable?


Different but perhaps related observation/question - without timestamps 
(which we don't have in this case), isn't there a certain ambiguity 
about arriving out-of-order segments? One doesn't really know if they 
are out-of-order because the network is re-ordering, or because they are 
retransmissions of segments we've not yet seen at the receiver.


rick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: TCP connection issues against Amazon S3

2015-01-06 Thread Rick Jones


On 01/06/2015 11:16 AM, Rick Jones wrote:

I'm assuming one incident starts at XX:41:24.748265 in the trace?  That
does look like it is slowly slogging its way through a bunch of lost
traffic, which was I think part of the problem I was seeing with the
middlebox I stepped in, but I don't think I see the reset where I would
have expected it.  Still, it looks like the sender has an increasing TCP
RTO as it is going through the slog (as it likely must since there are
no TCP timestamps?), to the point it gets larger than I'm guessing curl
was willing to wait, so the FIN at XX:41:53.269534 after a ten second or
so gap.


Should the receiver's autotuning be advertising an ever larger window 
the way it is while going through the slog of lost traffic?


rick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: TCP connection issues against Amazon S3

2015-01-06 Thread Rick Jones


A packet dump [1] shows repeated ACK retransmits for some of the

TCP does not retransmit ACK ... do you mean DUPACKs sent by the receiver?

I am trying to understand the problem. Could you confirm that it's the
HTTP responses sent from Amazon S3 got stalled, or HTTP requests sent
from the receiver (your host)?

btw I suspect some middleboxes are stripping SACKOK options from your
SYNs (or Amazon SYN-ACKs) assuming Amazon supports SACK.


The TCP Timestamp option too it seems.

Speaking of middleboxes...  It is probably a fish that is red, but a 
while back I stepped in a middle box (a load balancer) which decided 
that if it saw "too many" retransmissions in a given TCP window that 
something was seriously wrong and it would toast the connection.  I 
thought though that was an active reset on the part of the middlebox. 
(And the client was the active sender not the back-end server)


I'm assuming one incident starts at XX:41:24.748265 in the trace?  That 
does look like it is slowly slogging its way through a bunch of lost 
traffic, which was I think part of the problem I was seeing with the 
middlebox I stepped in, but I don't think I see the reset where I would 
have expected it.  Still, it looks like the sender has an increasing TCP 
RTO as it is going through the slog (as it likely must since there are 
no TCP timestamps?), to the point it gets larger than I'm guessing curl 
was willing to wait, so the FIN at XX:41:53.269534 after a ten second or 
so gap.


rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: What's the concern about setting irq thread's policy as SCHED_FIFO

2014-12-03 Thread Rick Jones


On 12/03/2014 12:06 AM, Qin Chuanyu wrote:

I am doing network performance test under suse11sp3 and intel 82599 nic,
Becasuse the softirq is out of schedule policy's control, so netserver
thread couldn't always get 100% cpu usage, then packet dropped in kernel
udp socket's receive queue.

In order to get a stable result, I did some patch in ixgbe driver and
then use irq_thread instead of softirq to handle rx.
It seems work well, but irq_thread's SCHED_FIFO schedule policy cause
that when the cpu is limited, netserver couldn't work at all.


I cannot speak to any scheduling issues/questions, but can ask if you 
tried binding netserver to a CPU other than the one servicing the 
interrupts via the -T option on the netperf command line:


netperf -T ,  ...

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#index-g_t_002dT_002c-Global-41

happy benchnmarking,

rick jones



So I change the irq_thread's schedule policy from SCHED_FIFO to
SCHED_NORMAL, then the irq_thread could share the cpu usage with
netserver thread.

the question is:
What's the concrete reason about setting irq thread's policy as SCHED_FIFO?
Except the priority affecting the cpu usage, any function would be
broken if irq thread change to SCHED_NORMAL?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [QA-TCP] How to send tcp small packages immediately?

2014-10-24 Thread Rick Jones


On 10/24/2014 12:41 AM, Zhangjie (HZ) wrote:

Hi,

I use netperf to test the performance of small tcp package, with TCP_NODELAY 
set :

netperf -H 129.9.7.164 -l 100 -- -m 512 -D

Among the packages I got by tcpdump, there is not only small packages, also 
lost of
big ones (skb->len=65160).

IP 129.9.7.186.60840 > 129.9.7.164.34607: tcp 65160
IP 129.9.7.164.34607 > 129.9.7.186.60840: tcp 0
IP 129.9.7.164.34607 > 129.9.7.186.60840: tcp 0
IP 129.9.7.164.34607 > 129.9.7.186.60840: tcp 0
IP 129.9.7.186.60840 > 129.9.7.164.34607: tcp 65160
IP 129.9.7.164.34607 > 129.9.7.186.60840: tcp 0
IP 129.9.7.164.34607 > 129.9.7.186.60840: tcp 0
IP 129.9.7.164.34607 > 129.9.7.186.60840: tcp 0
IP 129.9.7.186.60840 > 129.9.7.164.34607: tcp 80
IP 129.9.7.186.60840 > 129.9.7.164.34607: tcp 512
IP 129.9.7.186.60840 > 129.9.7.164.34607: tcp 512

SO, how to test small tcp packages? Including TCP_NODELAY, What else should be 
set?


Well, I don't think there is anything else you can set.  Even with 
TCP_NODELAY set, segment size with TCP will still be controlled by 
factors such as congestion window.


I am ass-u-me-ing your packet trace is at the sender.  I suppose if your 
sender were fast enough compared to the path that might combine with 
congestion window to result in the very large segments.


Not to say there cannot be a bug somewhere with TSO overriding 
TCP_NODELAY, but in broad terms, even TCP_NODELAY does not guarantee 
small TCP segments.  That has been something of a bane on my attempts to 
use TCP for aggregate small-packet performance measurements via netperf 
for quite some time.


And since you seem to have included a virtualization mailing list I 
would also ass-u-me that virtualization is involved somehow.  Knuth only 
knows how that will affect the timing of events, which will be very much 
involved in matters of congestion window and such.  I suppose it is even 
possible that if the packet trace is on a VM receiver that some delays 
in getting the VM running could mean that GRO would end-up making large 
segments being pushed up the stack.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: skbuff truesize incorrect.

2014-05-23 Thread Rick Jones


On 05/23/2014 02:33 AM, Bjørn Mork wrote:

Jim Baxter  writes:


I'll create and test a patch for the cdc_ncm host driver unless someone
else wants to do that. I haven't really played with the gadget driver
before, so I'd prefer if someone knowing it (Jim maybe?) could take care
of it.  If not, then I can always make an attempt using dummy_hcd to
test it.

I can create a patch for the host driver, I will issue the gadget patch
first to resolve any issues, the fix would be similar.


Well, I couldn't help myself.  I just had to test it.  The attached
patch works for me, briefly tested with an Ericsson H5321gw NCM device.
I have no ideas about the performance impact as that modem is limited to
21 Mbps HSDPA.


If you are measuring performance with the likes of netperf, you should 
be able to get an idea of the performance effect from the change in 
service demand (CPU consumed per unit of work) even if the maximum 
throughput remains capped.


You can run a netperf TCP_STREAM test along the lines of:

netperf -H  -c -C -t TCP_STREAM

and also

netperf -H  -c -C -t TCP_RR

For extra added credit you can consider either multiple runs and 
post-processing, or adding a -i 30,3 to the command line to tell netperf 
to run at least three iterations, no more than thirty and it will try to 
achieve a 99% confidence that the reported means for throughput, local 
and remote CPU utilization are within +/- 2.5% of the actual mean.  You 
can narrow or widen that with a -I 99,.  A width of 5% is what 
gives the +/- 2.5% (and/or demonstrates my lack of accurate statistics 
knowledge :) )


happy benchmarking,

rick jones

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 08/24] net, diet: Make TCP metrics optional

2014-05-06 Thread Rick Jones


On 05/06/2014 09:41 AM, j...@joshtriplett.org wrote:

On Tue, May 06, 2014 at 11:59:41AM -0400, David Miller wrote:

Making 2MB RAM machines today makes no sense at all.

The lowest end dirt cheap smartphone, something which fits on
someone's pocket, has gigabytes of ram.


The lowest-end smartphone isn't anywhere close to "dirt cheap", and
hardly counts as "embedded" at all anymore.  Smartphones cost $100+;
we're talking about systems in the low tens of dollars or less.  These
systems will have no graphics, no peripherals, and only one or two
specific functions.  The entirety of their functionality will likely
consist of a single userspace program; they might not even have a PID 2.
*That's* the kind of "embedded" we're talking about, not the
supercomputers we carry around in our pockets.


Would this be some sort of "Internet of Things" system?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: network performance get regression from 2.6 to 3.10 by each version

2014-05-05 Thread Rick Jones


On 05/02/2014 12:40 PM, V JobNickname wrote:

I have an ARM platform which works with older 2.6.28 Linux Kernel and
the embedded NIC driver
I profile the TCP Tx using netperf 2.6 by command "./netperf -H
{serverip} -l 300".


Is your ARM platform a multi-core one?  If so, you may need/want to look 
into making certain the assignment of NIC interrupts and netperf have 
remained constant through your tests.  You can bind netperf to a 
specific CPU via either "taskset" or the global -T option.  You can 
check the interrupt assignment(s) for the queue(s) from the NIC by 
looking at /proc/interrupts and perhaps via other means.


It would also be good to know if the drops in throughput correspond to 
an increase in service demand (CPU per unit of work).  To that end, 
adding a global -c option to measure local (netperf side) CPU 
utilization would be a good idea.


Still, even armed with that information, tracking down the regression or 
regressions will be no small feat particularly since the timespan is so 
long.  A very good reason to be trying the newer versions as they 
appear, even if only briefly, rather than leaving it for so long.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A call to revise sockets behaviour

2013-07-29 Thread Rick Jones




A wine developer clearly showed that this option simply doesn't work.

http://bugs.winehq.org/show_bug.cgi?id=26031#c21

Output of strace:
getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0
setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr("0.
0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)


The output of netstat -an didn't by any chance happen to still show an 
endpoint in the LISTEN state for that port number did it?


rick jones

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/5] net: low latency Ethernet device polling

2013-02-27 Thread Rick Jones

On 02/27/2013 09:55 AM, Eliezer Tamir wrote:

This patchset adds the ability for the socket layer code to poll directly
on an Ethernet device's RX queue. This eliminates the cost of the interrupt
and context switch and with proper tuning allows us to get very close
to the HW latency.

This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from last year
http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf

Patch 1 adds ndo_ll_poll and the IP code to use it.
Patch 2 is an example of how TCP can use ndo_ll_poll.
Patch 3 shows how this method would be implemented for the ixgbe driver.
Patch 4 adds statistics to the ixgbe driver for ndo_ll_poll events.
(Optional) Patch 5 is a handy kprobes module to measure detailed latency
numbers.

this patchset is also available in the following git branch
git://github.com/jbrandeb/lls.git rfc

Performance numbers:
Kernel Config C3/6 rx-usecs TCP UDP
3.8rc6 typicaloff adaptive 37k 40k
3.8rc6 typicaloff 0*50k 56k
3.8rc6 optimized off 0*61k 67k
3.8rc6 optimized onadaptive 26k 29k
patched typicaloff adaptive 70k 78k
patched optimized off adaptive 79k 88k
patched optimized off 100 84k 92k
patched optimized onadaptive 83k 91k
*rx-usecs=0 is usually not useful in a production environment.

I would think that latency-sensitive folks would be using rx-usecs=0 in
production - at least if the NIC in use didn't have low enough latency
with its default interrupt coalescing/avoidance heuristics.

If I take the first "pure" A/B comparison it seems that the change as
benchmarked takes latency for TCP from ~27 usec (37k) to ~14 usec (70k).
At what request/response size does the benefit taper-off? 13 usec
seems to be about 16250 bytes at 10 GbE.

When I last looked at netperf TCP_RR performance where something similar
could happen I think it was IPoIB where it was possible to set things up
such that polling happened rather than wakeups (perhaps it was with a
shim library that converted netperf's socket calls to "native" IB). My
recollection is that it "did a number" on the netperf service demands
thanks to the spinning. It would be a good thing to include those
figures in any subsequent rounds of benchmarking.

Am I correct in assuming this is a mechanism which would not be used in
a high aggregate PPS situation?

happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: Doubts about listen backlog and tcp_max_syn_backlog

2013-01-24 Thread Rick Jones


On 01/24/2013 04:22 AM, Leandro Lucarella wrote:

On Wed, Jan 23, 2013 at 11:28:08AM -0800, Rick Jones wrote:

Then if syncookies are enabled, the time spent in connect() shouldn't be
bigger than 3 seconds even if SYNs are being "dropped" by listen, right?


Do you mean if "ESTABLISHED" connections are dropped because the
listen queue is full?  I don't think I would put that as "SYNs being
dropped by listen" - too easy to confuse that with an actual
dropping of a SYN segment.


I was just kind of quoting the name given by netstat: "SYNs to LISTEN
sockets dropped" (for kernel 3.0, I noticed newer kernels don't have
this stat anymore, or the name was changed). I still don't know if we
are talking about the same thing.


Are you sure those stats are not present in 3.X kernels?  I just looked 
at /proc/net/netstat on a 3.7 system and noticed both the ListenMumble 
stats and the three cookie stats.  And I see the code for them in the tree:


aj@tardy:~/net-next/net/ipv4$ grep MIB_LISTEN *.c
proc.c: SNMP_MIB_ITEM("ListenOverflows", LINUX_MIB_LISTENOVERFLOWS),
proc.c: SNMP_MIB_ITEM("ListenDrops", LINUX_MIB_LISTENDROPS),
tcp_ipv4.c: NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
tcp_ipv4.c: NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);

raj@tardy:~/net-next/net/ipv4$ grep MIB_SYN *.c
proc.c: SNMP_MIB_ITEM("SyncookiesSent", LINUX_MIB_SYNCOOKIESSENT),
proc.c: SNMP_MIB_ITEM("SyncookiesRecv", LINUX_MIB_SYNCOOKIESRECV),
proc.c: SNMP_MIB_ITEM("SyncookiesFailed", LINUX_MIB_SYNCOOKIESFAILED),
syncookies.c:   NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESSENT);
syncookies.c:   NET_INC_STATS_BH(sock_net(sk), 
LINUX_MIB_SYNCOOKIESFAILED);
syncookies.c:   NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV);


I will sometimes be tripped-up by netstat's not showing a statistic with 
a zero value...



But yes, I would not expect a connect() call to remain incomplete
for any longer than it took to receive an SYN|ACK from the other
end.


So the only reason to experience these high times spent in connect()
should be because a SYN or SYN|ACK was actually loss in a lower layer,
like an error in the network device or a transmission error?


Modulo the/some other drop-without-stat point such as Vijay mentioned 
yesterday.


You might consider taking some packet traces.  If you can I would start 
with a trace taken on the system(s) on which the long connect() calls 
are happening.   I think the tcpdump manpage has an example of a tcpdump 
command with a filter expression that catches just SYNchronize and 
FINished segments which I suppose you could extend to include ReSeT 
segments.  Such a filter expression would be missing the client's ACK of 
the SYN|ACK but unless you see incrementing stats relating to say 
checksum failures or other drops on the "client" side I suppose you 
could assume that the client ACKed the server's SYN|ACK.



That would be 3 (,9, 21, etc...) seconds on a kernel with 3
seconds as the initial retransmission timeout.


Which can't be changed without recompiling, right?


To the best of my knowledge.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Doubts about listen backlog and tcp_max_syn_backlog

2013-01-23 Thread Rick Jones


On 01/23/2013 02:47 AM, Leandro Lucarella wrote:

Thanks for the info. I'm definitely dropping SYNs and sending cookies,
around 50/s. Is there any way to tell how many connections are queued in
a particular socket?


I am not familiar with one.  Doesn't mean there isn't one, only that I 
am not able to think of it.



Then if syncookies are enabled, the time spent in connect() shouldn't be
bigger than 3 seconds even if SYNs are being "dropped" by listen, right?


Do you mean if "ESTABLISHED" connections are dropped because the listen 
queue is full?  I don't think I would put that as "SYNs being dropped by 
listen" - too easy to confuse that with an actual dropping of a SYN segment.


But yes, I would not expect a connect() call to remain incomplete for 
any longer than it took to receive an SYN|ACK from the other end.  That 
would be 3 (,9, 21, etc...) seconds on a kernel with 3 seconds as the 
initial retransmission timeout.


rick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Doubts about listen backlog and tcp_max_syn_backlog

2013-01-22 Thread Rick Jones


On 01/22/2013 10:42 AM, Leandro Lucarella wrote:

On Tue, Jan 22, 2013 at 10:17:50AM -0800, Rick Jones wrote:

What is important is the backlog, and I guess you didn't increase it
properly. The somaxconn default is quite low (128)


Leandro -

If that is being overflowed, I believe you should be seeing something like:

 14 SYNs to LISTEN sockets dropped

in the output of netstat -s on the system on which the server
application is running.


What is that value reporting exactly?


Netstat is reporting the ListenDrops and/or ListenOverflows  which map 
to LINUX_MIB_LISTENDROPS and LINUX_MIB_LISTENOVERFLOWS.  Those get 
incremented in tcp_v4_syn_recv_sock() (and its v6 version etc)


   if (sk_acceptq_is_full(sk))
goto exit_overflow;

Will increment both overflows and drops, and drops will increment on its 
own in some additional cases.



Because we are using syncookies, and AFAIK with that enabled, all
SYNs are being replied, and what the listen backlog is really
limitting is the "completely established sockets waiting to be
accepted", according to listen(2). What I don't really know to be
honest, is what a "completely established socket" is, does it mean
that the SYN,ACK was sent, or the ACK was received back?


I have always thought it meant that the ACK of the SYN|ACK has been 
received.


SyncookiesSent SyncookiesRecv SyncookiesFailed also appear in 
/proc/net/netstat and presumably in netstat -s output.



Also, from the client side, when is the connect(2) call done? When the
SYN,ACK is received?


That would be my assumption.


In a previous message:


What I'm seeing are clients taking either useconds to connect, or 3
seconds, which suggest SYNs are getting lost, but the network doesn't
seem to be the problem. I'm still investigating this, so unfortunately
I'm not really sure.


I recently ran into something like that, which turned-out to be an issue 
with nf_conntrack and its table filling.


rick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next] tcp: add ability to set a timestamp offset

2013-01-22 Thread Rick Jones


On 01/22/2013 12:52 PM, Andrey Vagin wrote:

If a TCP socket will get live-migrated from one box to another the
timestamps (which are typically ON) will get screwed up -- the new
kernel will generate TS values that has nothing to do with what they
were on dump. The solution is to yet again fix the kernel and put a
"timestamp offset" on a socket.


Is there a chance a connection can be moved more than once within the 
"lifetime" of a given timestamp value?


rick jones

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Doubts about listen backlog and tcp_max_syn_backlog

2013-01-22 Thread Rick Jones




What is important is the backlog, and I guess you didn't increase it
properly. The somaxconn default is quite low (128)


Leandro -

If that is being overflowed, I believe you should be seeing something like:

14 SYNs to LISTEN sockets dropped

in the output of netstat -s on the system on which the server 
application is running.


rick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next] softirq: reduce latencies

2013-01-03 Thread Rick Jones


On 01/03/2013 05:31 AM, Eric Dumazet wrote:

A common network load is to launch ~200 concurrent TCP_RR netperf
sessions like the following

netperf -H remote_host -t TCP_RR -l 1000



And then you can launch some netperf asking P99_LATENCY results :

netperf -H remote_host -t TCP_RR -- -k P99_LATENCY


In terms of netperf overhead, once you specify P99_LATENCY, you are 
already in for the pound of cost but only getting the penny of output 
(so to speak).  While it would clutter the output, one could go ahead 
and ask for the other latency stats and it won't "cost" anything more:


... -- -k 
RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY


Additional information about how the omni output selectors work can be 
found at 
http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Omni-Output-Selection


happy benchmarking,

rick jones

BTW - you will likely see some differences between RT_LATENCY, which is 
calculated from the average transactions per second, and MEAN_LATENCY, 
which is calculated from the histogram of individual latencies 
maintained when any of the _LATENCY outputs other than RT_LATENCY is 
requested.  Kudos to the folks at Google who did the extensions to the 
then-existing histogram code to enable it to be used for more reasonably 
accurate statistics.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc net-next v6 0/3] Multiqueue virtio-net

2012-10-30 Thread Rick Jones


On 10/30/2012 03:03 AM, Jason Wang wrote:

Hi all:

This series is an update version of multiqueue virtio-net driver based on
Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
packets reception and transmission. Please review and comments.

Changes from v5:
- Align the implementation with the RFC spec update v4
- Switch the mode between single mode and multiqueue mode without reset
- Remove the 256 limitation of queues
- Use helpers to do the mapping between virtqueues and tx/rx queues
- Use commbined channels instead of separated rx/tx queus when do the queue
number configuartion
- Other coding style comments from Michael

Reference:
- A protype implementation of qemu-kvm support could by found in
git://github.com/jasowang/qemu-kvm-mq.git
- V5 could be found at http://lwn.net/Articles/505388/
- V4 could be found at https://lkml.org/lkml/2012/6/25/120
- V2 could be found at http://lwn.net/Articles/467283/
- Michael virtio-spec: http://www.spinics.net/lists/netdev/msg209986.html

Perf Numbers:

- Pktgen test shows the receiving capability of the multiqueue virtio-net were
   dramatically improved.
- Netperf result shows latency were greately improved according to the test
result.


I suppose it is technically correct to say that latency was improved, 
but usually for aggregate request/response tests I tend to talk about 
the aggregate transactions per second.


Do you have a hypothesis as to why the improvement dropped going to 20 
concurrent sessions from 10?


rick jones


Netperf Local VM to VM test:
- VM1 and its vcpu/vhost thread in numa node 0
- VM2 and its vcpu/vhost thread in numa node 1
- a script is used to lauch the netperf with demo mode and do the postprocessing
   to measure the aggreagte result with the help of timestamp
- average of 3 runs

TCP_RR:
size/session/+lat%/+normalize%
 1/ 1/0%/0%
 1/10/  +52%/   +6%
 1/20/  +27%/   +5%
64/ 1/0%/0%
64/10/  +45%/   +4%
64/20/  +28%/   +7%
   256/ 1/   -1%/0%
   256/10/  +38%/   +2%
   256/20/  +27%/   +6%
TCP_CRR:
size/session/+lat%/+normalize%
 1/ 1/   -7%/  -12%
 1/10/  +34%/   +3%
 1/20/   +3%/   -8%
64/ 1/   -7%/   -3%
64/10/  +32%/   +1%
64/20/   +4%/   -7%
   256/ 1/   -6%/  -18%
   256/10/  +33%/0%
   256/20/   +4%/   -8%
STREAM:
size/session/+thu%/+normalize%
 1/ 1/   -3%/0%
 1/ 2/   -1%/0%
 1/ 4/   -2%/0%
64/ 1/0%/   +1%
64/ 2/   -6%/   -6%
64/ 4/   -8%/  -14%
   256/ 1/0%/0%
   256/ 2/  -48%/  -52%
   256/ 4/  -50%/  -55%
   512/ 1/   +4%/   +5%
   512/ 2/  -29%/  -33%
   512/ 4/  -37%/  -49%
  1024/ 1/   +6%/   +7%
  1024/ 2/  -46%/  -51%
  1024/ 4/  -15%/  -17%
  4096/ 1/   +1%/   +1%
  4096/ 2/  +16%/   -2%
  4096/ 4/  +31%/  -10%
16384/ 1/0%/0%
16384/ 2/  +16%/   +9%
16384/ 4/  +17%/   -9%

Netperf test between external host and guest over 10gb(ixgbe):
- VM thread and vhost threads were pinned int the node 0
- a script is used to lauch the netperf with demo mode and do the postprocessing
   to measure the aggreagte result with the help of timestamp
- average of 3 runs

TCP_RR:
size/session/+lat%/+normalize%
 1/ 1/0%/   +6%
 1/10/  +41%/   +2%
 1/20/  +10%/   -3%
64/ 1/0%/  -10%
64/10/  +39%/   +1%
64/20/  +22%/   +2%
   256/ 1/0%/   +2%
   256/10/  +26%/  -17%
   256/20/  +24%/  +10%
TCP_CRR:
size/session/+lat%/+normalize%
 1/ 1/   -3%/   -3%
 1/10/  +34%/   -3%
 1/20/0%/  -15%
64/ 1/   -3%/   -3%
64/10/  +34%/   -3%
64/20/   -1%/  -16%
   256/ 1/   -1%/   -3%
   256/10/  +38%/   -2%
   256/20/   -2%/  -17%
TCP_STREAM:(guest receiving)
size/session/+thu%/+normalize%
 1/ 1/   +1%/  +14%
 1/ 2/0%/   +4%
 1/ 4/   -2%/  -24%
64/ 1/   -6%/   +1%
64/ 2/   +1%/   +1%
64/ 4/   -1%/  -11%
   256/ 1/   +3%/   +4%
   256/ 2/0%/   -1%
   256/ 4/0%/  -15%
   512/ 1/   +4%/0%
   512/ 2/  -10%/  -12%
   512/ 4/0%/  -11%
  1024/ 1/   -5%/0%
  1024/ 2/  -11%/  -16%
  1024/ 4/   +3%/  -11%
  4096/ 1/  +27%/   +6%
  4096/ 2/0%/  -12%
  4096/ 4/0%/  -20%
16384/ 1/0%/   -2%
16384/ 2/0%/   -9%
16384/ 4/  +10%/   -2%
TCP_MAERTS:(guest sending)
 1/ 1/   -1%/0%
 1/ 2/0%/0%
 1/ 4/   -5%/0%
64/ 1/0%/0%
64/ 2/   -7%/   -8%
64/ 4/   -7%/   -8%
   256/ 1/0%/0%
   256/ 2/  -28%/  -28%
   256/ 4/  -28%/  -29%
   512/ 1/0%/0%
   512/ 2/  -15%/  -13%
   512/ 4/  -53%/  -59%
  1024/ 1/   +4%/  +13%
  1024/ 2/   -7%/  -18%
  1024/ 4/   +1%/  -18%
  4096/ 1/

Re: Netperf UDP_STREAM regression due to not sending IPIs in ttwu_queue()

2012-10-03 Thread Rick Jones


On 10/03/2012 02:47 AM, Mel Gorman wrote:

On Tue, Oct 02, 2012 at 03:48:57PM -0700, Rick Jones wrote:

On 10/02/2012 01:45 AM, Mel Gorman wrote:


SIZE=64
taskset -c 0 netserver
taskset -c 1 netperf -t UDP_STREAM -i 50,6 -I 99,1 -l 20 -H 127.0.0.1 -- -P 
15895 -s 32768 -S 32768 -m $SIZE -M $SIZE


Just FYI, unless you are running a hacked version of netperf, the
"50" in "-i 50,6" will be silently truncated to 30.



I'm not using a hacked version of netperf. The 50,6 has been there a long
time so I'm not sure where I took it from any more. It might have been an
older version or me being over-zealous at the time.


No version has ever gone past 30.  It has been that way since the 
confidence interval code was contributed.  It doesn't change anything, 
so it hasn't messed-up any results.  It would be good to fix but not 
critical.



PS - I trust it is the receive-side throughput being reported/used
with UDP_STREAM :)


Good question. Now that I examine the scripts, it is in fact the sending
side that is being reported which is flawed. Granted I'm not expecting any
UDP loss on loopback and looking through a range of results, the
difference is marginal. It's still wrong to report just the sending side
for UDP_STREAM and I'll correct the scripts for it in the future.


Switching from sending to receiving throughput in UDP_STREAM could be a 
non-trivial disconnect in throughputs.  As Eric mentions, the receiver 
could be dropping lots of datagrams if it cannot keep-up, and netperf 
makes not attempt to provide any application-layer flow-control.


Not sure which version of netperf you are using to know whether or not 
it has gone to the "omni" code path.  If you aren't using 2.5.0 or 2.6.0 
then the confidence intervals will have been computed based on the 
receive side throughput, so you will at least know that it was stable, 
even if it wasn't the same as the sending side.


The top of trunk will use the remote's receive stats for the omni 
migration of a UDP_STREAM test too.  I think it is that way in 2.5.0 and 
2.6.0 as well but I've not gone into the repository to check.


Of course, that means you don't necessarily know that the sending 
throughput met your confidence intervals :)


If you are on 2.5.0 or later, you may find:

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Omni-Output-Selection

helpful when looking to parse results.

One more, little thing - taskset may indeed be better for what you are 
doing (it will happen "sooner" certainly), but there is also the global 
-T option to bind netperf/netserver to the specified CPU id. 
http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#index-g_t_002dT_002c-Global-41


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Netperf UDP_STREAM regression due to not sending IPIs in ttwu_queue()

2012-10-02 Thread Rick Jones


On 10/02/2012 01:45 AM, Mel Gorman wrote:


SIZE=64
taskset -c 0 netserver
taskset -c 1 netperf -t UDP_STREAM -i 50,6 -I 99,1 -l 20 -H 127.0.0.1 -- -P 
15895 -s 32768 -S 32768 -m $SIZE -M $SIZE


Just FYI, unless you are running a hacked version of netperf, the "50" 
in "-i 50,6" will be silently truncated to 30.


happy benchmarking,

rick jones

PS - I trust it is the receive-side throughput being reported/used with 
UDP_STREAM :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: getsockopt/setsockopt with SO_RCVBUF and SO_SNDBUF "non-standard" behaviour

2012-07-18 Thread Rick Jones


On 07/18/2012 09:11 AM, Eric Dumazet wrote:


That the way it's done on linux since day 0

You can probably find a lot of pages on the web explaining the
rationale.

If your application handles UDP frames, what SO_RCVBUF should count ?

If its the amount of payload bytes, you could have a pathological
situation where an attacker sends 1-byte UDP frames fast enough and
could consume a lot of kernel memory.

Each frame consumes a fair amount of kernel memory (between 512 bytes
and 8 Kbytes depending on the driver).

So linux says : If user expect to receive   bytes, set a limit of
_kernel_ memory used to store these bytes, and use an estimation of 100%
of overhead. That is : allow 2* bytes to be allocated for socket
receive buffers.


Expanding on/rewording that, in a setsockopt() call SO_RCVBUF specifies 
the data bytes and gets doubled to become the kernel/overhead byte 
limit.  Unless the doubling would be greater than net.core.rmem_max, in 
which case the limit becomes net.core.rmem_max.


But on getsockopt() SO_RCVBUF is always the kernel/overhead byte limit.

In one call it is fish.  In the other it is fowl.

Other stacks appear to keep their kernel/overhead limit quiet, keeping 
SO_RCVBUF an expression of a data limit in both setsockopt() and 
getsockopt().  With those stacks, there is I suppose the possible source 
of confusion when/if someone tests the queuing to a socket, sends "high 
overhead" packets and doesn't get to SO_RCVBUF worth of data though I 
don't recall encountering that in my "pre-linux" time.


The sometimes fish, sometimes fowl version (along with the auto tuning 
when one doesn't make setsockopt() calls) gave me fits in netperf for 
years until I finally relented and split the socket buffer size 
variables into three - what netperf's user requested via the command 
line, what it was right after the socket was created, and what it was at 
the end of the data phase of the test.


rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [net-next RFC V5 0/5] Multiqueue virtio-net

2012-07-09 Thread Rick Jones


On 07/08/2012 08:23 PM, Jason Wang wrote:

On 07/07/2012 12:23 AM, Rick Jones wrote:

On 07/06/2012 12:42 AM, Jason Wang wrote:
Which mechanism to address skew error?  The netperf manual describes
more than one:


This mechanism is missed in my test, I would add them to my test scripts.


http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance


Personally, my preference these days is to use the "demo mode" method
of aggregate results as it can be rather faster than (ab)using the
confidence intervals mechanism, which I suspect may not really scale
all that well to large numbers of concurrent netperfs.


During my test, the confidence interval would even hard to achieved in
RR test when I pin vhost/vcpus in the processors, so I didn't use it.


When running aggregate netperfs, *something* has to be done to address 
the prospect of skew error.  Otherwise the results are suspect.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [net-next RFC V5 0/5] Multiqueue virtio-net

2012-07-06 Thread Rick Jones


On 07/06/2012 12:42 AM, Jason Wang wrote:

I'm not expert of tcp, but looks like the changes are reasonable:
- we can do full-sized TSO check in tcp_tso_should_defer() only for
westwood, according to tcp westwood
- run tcp_tso_should_defer for tso_segs = 1 when tso is enabled.


I'm sure Eric and David will weigh-in on the TCP change.  My initial 
inclination would have been to say "well, if multiqueue is draining 
faster, that means ACKs come-back faster, which means the "race" between 
more data being queued by netperf and ACKs will go more to the ACKs 
which means the segments being sent will be smaller - as TCP_NODELAY is 
not set, the Nagle algorithm is in force, which means once there is data 
outstanding on the connection, no more will be sent until either the 
outstanding data is ACKed, or there is an accumulation of > MSS worth of 
data to send.



Also, how are you combining the concurrent netperf results?  Are you
taking sums of what netperf reports, or are you gathering statistics
outside of netperf?



The throughput were just sumed from netperf result like what netperf
manual suggests. The cpu utilization were measured by mpstat.


Which mechanism to address skew error?  The netperf manual describes 
more than one:


http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance

Personally, my preference these days is to use the "demo mode" method of 
aggregate results as it can be rather faster than (ab)using the 
confidence intervals mechanism, which I suspect may not really scale all 
that well to large numbers of concurrent netperfs.


I also tend to use the --enable-burst configure option to allow me to 
minimize the number of concurrent netperfs in the first place.  Set 
TCP_NODELAY (the test-specific -D option) and then have several 
transactions outstanding at one time (test-specific -b option with a 
number of additional in-flight transactions).


This is expressed in the runemomniaggdemo.sh script:

http://www.netperf.org/svn/netperf2/trunk/doc/examples/runemomniaggdemo.sh

which uses the find_max_burst.sh script:

http://www.netperf.org/svn/netperf2/trunk/doc/examples/find_max_burst.sh

to pick the burst size to use in the concurrent netperfs, the results of 
which can be post-processed with:


http://www.netperf.org/svn/netperf2/trunk/doc/examples/post_proc.py

The nice feature of using the "demo mode" mechanism is when it is 
coupled with systems with reasonably synchronized clocks (eg NTP) it can 
be used for many-to-many testing in addition to one-to-many testing 
(which cannot be dealt with by the confidence interval method of dealing 
with skew error)



A single instance TCP_RR test would help confirm/refute any
non-trivial change in (effective) path length between the two cases.



Yes, I would test this thanks.


Excellent.

happy benchmarking,

rick jones

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-14 Thread Rick Jones

*) netperf/netserver support CPU affinity within themselves with the 
global -T option to netperf.  Is the result with taskset much different? 
  The equivalent to the above would be to run netperf with:


./netperf -T 0,7 ..


I checked the source codes and didn't find this option.
I use netperf V2.3 (I found the number in the makefile).


Indeed, that version pre-dates the -T option.  If you weren't already 
chasing a regression I'd suggest an upgrade to 2.4.mumble.  Once you are 
at a point where changing another variable won't muddle things you may 
want to consider upgrading.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-11 Thread Rick Jones


The test command is:
#sudo taskset -c 7 ./netserver
#sudo taskset -c 0 ./netperf -t TCP_RR -l 60 -H 127.0.0.1 -i 50,3 -I 99,5 -- -r 
1,1


A couple of comments/questions on the command lines:

*) netperf/netserver support CPU affinity within themselves with the 
global -T option to netperf.  Is the result with taskset much different? 
  The equivalent to the above would be to run netperf with:


./netperf -T 0,7 ...

The one possibly salient difference between the two is that when done 
within netperf, the initial process creation will take place wherever 
the scheduler wants it.


*) The -i option to set the confidence iteration count will silently cap 
the max at 30.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: questions on NAPI processing latency and dropped network packets

2008-01-10 Thread Rick Jones


1) Interrupts are being processed on both cpus:

[EMAIL PROTECTED]:/root> cat /proc/interrupts
   CPU0   CPU1
 30:17037564530785  U3-MPIC Level eth0


IIRC none of the e1000 driven cards are multi-queue, so while the above 
shows that interrupts from eth0 have been processed on both CPUs at 
various points in the past, it doesn't necessarily mean that they are 
being processed on both CPUs at the same time right?


rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: AF_UNIX MSG_PEEK bug?

2008-01-08 Thread Rick Jones

Potential bugs notwithstanding, given that this is a STREAM socket, and 
as such shouldn't (I hope, or I'm eating toes for dinner again) have 
side effects like tossing the rest of a datagram, why are you using 
MSG_PEEK?  Why not simply read the N bytes of the message that will have 
the message length with a normal read/recv, and then read that many 
bytes in the next call?


rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Reproducible data corruption with sendfile+vsftp - splice regression?

2007-11-30 Thread Rick Jones

Could the corruption be seen in a tcpdump trace prior to transmission 
(ie taken on the sender) or was it only seen after the data passed out 
the NIC?


rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-27 Thread Rick Jones


Adrian Bunk wrote:

On Tue, Nov 27, 2007 at 01:15:23PM -0800, Rick Jones wrote:


The real problem is that these drivers are not in the upstream kernel.

Are there common reasons why these drivers are not upstream?


One might be that upstream has not accepted them.  Anything doing or 
smelling of TOE comes to mind right away.



Which modules doing or smelling of TOE do work with unmodified vendor 
kernels?


At the very real risk of further demonstrating my Linux vocabulary 
limitations, I believe there is a "Linux Sockets Acceleration" 
module/whatnot for NetXen and related 10G NICs, and a cxgb3_toe (?) 
module for Chelsio 10G NICs.


rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-27 Thread Rick Jones


The real problem is that these drivers are not in the upstream kernel.

Are there common reasons why these drivers are not upstream?


One might be that upstream has not accepted them.  Anything doing or 
smelling of TOE comes to mind right away.


rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5][RFC] Physical PCI slot objects

2007-11-13 Thread Rick Jones


Greg KH wrote:

Doesn't /sys/firmware/acpi give you raw access to the correct tables
already?

And isn't there some other tool that dumps the raw ACPI tables?  I
thought the acpi developers used it all the time when debugging things
with users.


I'm neither an acpi developer (well I don't think that I am :) nor an
end-user, but here are the two things for which I was going to use the
information being presented by Alex's patch:

1) a not-yet, but on track to be released tool to be used by end-users
to diagnose I/O bottlenecks - the information in
/sys/bus/pci/slot//address would be used to associated interfaces
and/or pci busses etc with something the end user would grok - the
number next to the slot.

2) I was also going to get the folks doing installers to make use of the
"end-user" slot ID. Even without going to the extreme of the
aforementioned 192 slot system, an 8 slot system with a bunch of
dual-port NICs in it (for example) is going to present this huge list of
otherwise identical entries. Even if the installers show the MAC for a
NIC (or I guess a WWN for an HBA or whatnot) that still doesn't tell one
without prior knowledge of what MACs were installed in which slot, which
slot is associated with a given ethN. Having the end-user slot ID
visible is then going to be a great help to that poor admin who is doing
the install.

rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bizarre network timing problem

2007-11-06 Thread Rick Jones


Felix von Leitner wrote:

Thus spake Rick Jones ([EMAIL PROTECTED]):

Past performance is no guarantee of current correctness :)  And over an 
Ethernet, there will be a very different set of both timings and TCP 
segment sizes compared to loopback.


My guess is that you will find setting the lo mtu to 1500 a very 
interesting experiment.


Setting the MTU on lo to 1500 eliminates the problem and gives me double
digit MB/sec throughput.


I'm not in a position at the moment to test it as my IPoIB systems are offline, 
and not sure you are either, but I will note that with IPoIB bits circa OFED1.2 
the default MTU for IPoIB goes up to 65520 bytes.  If indeed the problem you 
were seeing was related to sub-mss sends and window probing and such, it might 
appear on IPoIB in addition to loopback.


rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bizarre network timing problem

2007-11-02 Thread Rick Jones


Felix von Leitner wrote:

Thus spake Rick Jones ([EMAIL PROTECTED]):


Oh I'm pretty sure it's not my application, because my application performs
well over ethernet, which is after all its purpose.  Also I see the
write, the TCP uncork, then a pause, and then the packet leaving.


Well, a wise old engineer tried to teach me that the proper spelling is 
ass-u-me :) so just for grins, you might try the TCP_RR test anyway :)  And 
even if your application is correct (although I wonder why the receiver 
isn't sucking data-out very quickly...) if you can reproduce the problem 
with netperf it will be easier for others to do so.



My application is only the server, the receiver is smbget from Samba, so
I don't feel responsible for it :-)


Might want to strace it anyway... no good deed (such as reporting a potential 
issue) goes unpunished :)



Still, when run over Ethernet, it works fine without waiting for
timeouts to expire.


Past performance is no guarantee of current correctness :)  And over an 
Ethernet, there will be a very different set of both timings and TCP segment 
sizes compared to loopback.


My guess is that you will find setting the lo mtu to 1500 a very interesting 
experiment.




To reproduce this:

  - smbget is from samba, you probably already have this
  - gatling (my server) can be gotten from
cvs -d :pserver:[EMAIL PROTECTED]:/cvs -z9 co dietlibc libowfat gatling

dietlibc is not strictly needed, but it's my environment.
First built dietlibc, then libowfat, then gatling.

Felix


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bizarre network timing problem

2007-11-02 Thread Rick Jones


Felix von Leitner wrote:

Thus spake Rick Jones ([EMAIL PROTECTED]):


How could I test this theory?


Can you take another trace that isn't so "cooked?"  One that just sticks 
with TCP-level and below stuff?



Sorry for taking so long.  Here is a tcpdump.  The side on port 445 is
the SMB server using TCP_CORK.

23:03:20.283772 IP 127.0.0.1.33230 > 127.0.0.1.445: S 1503927325:1503927325(0) win 
32792 
23:03:20.283774 IP 127.0.0.1.445 > 127.0.0.1.33230: S 1513925692:1513925692(0) ack 
1503927326 win 32768 

>

23:03:20.283797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 1 win 257 

23:03:20.295851 IP 127.0.0.1.33230 > 127.0.0.1.445: P 1:195(194) ack 1 win 257 

23:03:20.295881 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 195 win 265 

23:03:20.295959 IP 127.0.0.1.445 > 127.0.0.1.33230: P 1:87(86) ack 195 win 265 

23:03:20.295998 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 87 win 256 

23:03:20.296063 IP 127.0.0.1.33230 > 127.0.0.1.445: P 195:287(92) ack 87 win 256 

23:03:20.296096 IP 127.0.0.1.445 > 127.0.0.1.33230: P 87:181(94) ack 287 win 265 

23:03:20.296135 IP 127.0.0.1.33230 > 127.0.0.1.445: P 287:373(86) ack 181 win 255 

23:03:20.296163 IP 127.0.0.1.445 > 127.0.0.1.33230: P 181:239(58) ack 373 win 265 

23:03:20.296201 IP 127.0.0.1.33230 > 127.0.0.1.445: P 373:459(86) ack 239 win 255 

23:03:20.296245 IP 127.0.0.1.445 > 127.0.0.1.33230: P 239:309(70) ack 459 win 265 

23:03:20.296286 IP 127.0.0.1.33230 > 127.0.0.1.445: P 459:535(76) ack 309 win 254 

23:03:20.296314 IP 127.0.0.1.445 > 127.0.0.1.33230: P 309:461(152) ack 535 win 265 

23:03:20.296361 IP 127.0.0.1.33230 > 127.0.0.1.445: P 535:594(59) ack 461 win 253 

23:03:20.296400 IP 127.0.0.1.445 > 127.0.0.1.33230: . 461:16845(16384) ack 594 win 
265 
23:03:20.335748 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 16845 win 125 


[note the .2 sec pause]


I wonder if the ack 16384 win 125 not updating the window is part of it?  With a 
 window scale of 7, the advertised window of 125 is only 16000 bytes, and it 
looks based on what follows that TCP has another 16384 to send, so my guess is 
that TCP was waiting to have enough window, the persist timer expired and TCP 
then had to say "oh well, send what I can"  Probably a coupling with this being 
less than the MSS (16396) involved too.



23:03:20.547763 IP 127.0.0.1.445 > 127.0.0.1.33230: P 16845:32845(16000) ack 594 win 
265 
23:03:20.547797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 0 



Notice that an ACK comes-back with a zero window in it - that means that by this 
point the receiver still hasn't consumed the 16384+16000 bytes sent to id.



23:03:20.547855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 96 



Now the receiver has pulled some data, on the order of 96*128 bytes so TCP can 
now go ahead and send the remaining 384 bytes.



23:03:20.547863 IP 127.0.0.1.445 > 127.0.0.1.33230: P 32845:33229(384) ack 594 win 
265 
23:03:20.547890 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 33229 win 96 


[note the .2 sec pause]


I'll bet that 96 * 128 is 12288 and we have another persist timer expiring.

I also wonder if the behaviour might be different if you were using send() 
rather than sendfile() - just random musings...



23:03:20.755775 IP 127.0.0.1.445 > 127.0.0.1.33230: P 33229:45517(12288) ack 594 win 
265 
23:03:20.755855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 45517 win 96 

23:03:20.755868 IP 127.0.0.1.445 > 127.0.0.1.33230: P 45517:49613(4096) ack 594 win 
265 
23:03:20.755898 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 49613 win 96 


[another one]

23:03:20.963789 IP 127.0.0.1.445 > 127.0.0.1.33230: P 49613:61901(12288) ack 594 win 
265 
23:03:20.963871 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 61901 win 96 

23:03:20.963885 IP 127.0.0.1.445 > 127.0.0.1.33230: P 61901:64525(2624) ack 594 win 
265 
23:03:20.963909 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 64525 win 96 

23:03:20.964101 IP 127.0.0.1.33230 > 127.0.0.1.445: P 594:653(59) ack 64525 win 96 

23:03:21.003790 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 653 win 265 

23:03:21.171811 IP 127.0.0.1.445 > 127.0.0.1.33230: P 64525:76813(12288) ack 653 win 
265 

You get the idea.

Anyway, now THIS is the interesting case, because we have two packets in
the answer, and you see the first half of the answer leaving immediately
(when I wanted the whole answer to be sent) but the second only leaving
after the .2 sec delay.


And it wasn't waiting for an ACK/window-update.  You could try:

ifconfig lo mtu 1500

and see what happens then.


If SMB is a one-request-at-a-time protocol (I can never remember),

It is.


Joy.


you could simulate it with a netperf TCP_RR test by passing suitable values
to the test-specific -r option:



netperf -H  -t TCP_RR -- -r ,


If that shows similar behaviour then you can ass-u-me it isn't your 
application.



Oh I'm pretty sure it's

Re: expected behavior of PF_PACKET on NETIF_F_HW_VLAN_RX device?

2007-11-01 Thread Rick Jones


David Miller wrote:

From: Rick Jones <[EMAIL PROTECTED]>

I'll try to go pester folks in tcpdump-workers then.



The thing to check is "TP_STATUS_CSUMNOTREADY".

When using mmap(), it will be provided in the descriptor.  When using
recvmsg() it will be provided via a PACKET_AUXDATA control message
when enabled via the PACKET_AUXDATA socket option.


Figures... the "dailies" and "weeklies" for tar files of tcpdump and libpcap 
source are fubar... again.  I've email in to tcpdump-workers on that one.  If 
that isn't resolved quickly I'll learn how to access their CVS (pick an SCM, any 
SCM...)


I did an apt-get of debian lenny's tcpdump and sources:

hpcpc103:~# tcpdump -V
tcpdump version 3.9.8
libpcap version 0.9.8

and that seems to show the false checksum failure and not use the 
TP_STATUS_CSUMNOTREADY - at least that didn't appear in a grepping of the 
sources.  At first I thought it might be, but then I realized that my snaplen 
was too short to get the whole TSO'ed frame so tcpdump wasn't even trying to 
verify.  After disabling TSO on the NIC, leaving CKO on, and making my snaplen > 
1500 I could see it was doing undesirable stuff.


I'll see what top of trunk has at some point and what the folks there think of 
adding-in a change.


rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: expected behavior of PF_PACKET on NETIF_F_HW_VLAN_RX device?

2007-11-01 Thread Rick Jones


David Miller wrote:

From: Rick Jones <[EMAIL PROTECTED]>
Date: Thu, 01 Nov 2007 14:48:45 -0700



One could I suppose try to ammend the information passed to allow
tcpdump to say "oh, this was a tx packet on the same machine on
which I am tracing so don't worry about checksum mismatch"



We do this already!


I'll try to go pester folks in tcpdump-workers then.

rick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: expected behavior of PF_PACKET on NETIF_F_HW_VLAN_RX device?

2007-11-01 Thread Rick Jones


The code in AF_PACKET should fix the skb before passing to user
space so that there is no difference between accel and non-accel
hardware.  Internal choices shouldn't leak to user space.  Ditto,
the receive checksum offload should be fixed up as well.



yep.  bad csum on tx packets as reported by tcpdump is also an issue.


With TX CKO enabled, there isn't any checksum to fixup when a tx packet is 
sniffed, so I'm not sure what can be done in the kernel apart from an 
unpalatable "disable CKO and all which depend upon it when entering promiscuous 
mode."  Having the tap calculate a checksum would be equally bad for 
performance, and would frankly be incorrect anyway because it would give the 
user the false impression that was the checksum which went-out onto the wire.


One could I suppose try to ammend the information passed to allow tcpdump to say 
"oh, this was a tx packet on the same machine on which I am tracing so don't 
worry about checksum mismatch" but I have to wonder if it is _really_ worth it. 
Already someone has to deal with seeing TCP segments >> the MSS thanks to TSO. 
(Actually tcpdump got rather confused about that too since the IP length of 
those was 0, but IIRC we got that patched to use the length of zero as a "ah, 
this was TSO so wing it" heuristic.)


rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [bug, 2.6.24-rc1] sysfs: duplicate filename 'eth0' can not be created

2007-10-29 Thread Rick Jones

0  0  0  0  0  0 
 0  0  LSAPIC  cmc_hndlr
 48:  0  0  0  0  0  0 
 0  0  IO-SAPIC-level  acpi
 50:  0  0   1065  0  0  0 
 0  0  IO-SAPIC-level  serial
 52:  0  0  0  0328  0 
 0  0  IO-SAPIC-level  ehci_hcd:usb1
 54:  0  0  0  0  0  0 
32  0  IO-SAPIC-level  ohci_hcd:usb2
 57:  0  0  0  0  0  0 
 0  0  IO-SAPIC-level  ohci_hcd:usb3
 60:  0  0  0  0  11945  0 
 0  0  IO-SAPIC-level  eth6
 61:  0  0  0  0  0 101072 
 0  0  IO-SAPIC-level  eth6 Neterion 10 Gigabit Ethernet-SR Low Profile 
PCI-X 2.0 DDR A
 70:  0  0  0  0  0  0 
25580  0  IO-SAPIC-level  cciss0
232:  0  0  0  0  0  0 
0  0  LSAPIC  mca_rdzv
238:  0  0  0  0  0  0 
0  0  LSAPIC  perfmon
239:237619023760792376009237600523761272376121 
23759182373020  LSAPIC  timer
240:  0  0  0  0  0  0 
0  0  LSAPIC  mca_wkup
252:  0  0  0  0  0  0 
0  0  LSAPIC  tlb_flush
253:586255449315448982 
497702  LSAPIC  resched
254:123162161166168154 
109    140  LSAPIC  IPI

ERR:  0

it appears as eth6.

rick jones
...




Jeff



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] always export sysctl_{r,w}mem_max

2007-10-26 Thread Rick Jones


David Miller wrote:

If DLM really wants minimum, it can use SO_SNDBUFFORCE and
SO_RCVBUFFORCE socket options and use whatever limits it
likes.

But even this is questionable.


Drift...

Is that something netperf should be using though?  Right now it uses the regular 
SO_[SND|RCV]BUF calls and is at the mercy of sysctls.  I wonder if it would be 
better to have it use their FORCE versions to make life easier on the 
benchmarker - such as myself - who has an unfortunate habit of forgetting to 
update sysctl.conf :)


rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] always export sysctl_{r,w}mem_max

2007-10-26 Thread Rick Jones


Eric W. Biederman wrote:

Adrian Bunk <[EMAIL PROTECTED]> writes:



This patch fixes the following build error with CONFIG_SYSCTL=n:

<--  snip  -->

...
ERROR: "sysctl_rmem_max" [fs/dlm/dlm.ko] undefined!
ERROR: "sysctl_wmem_max" [drivers/net/rrunner.ko] undefined!
ERROR: "sysctl_rmem_max" [drivers/net/rrunner.ko] undefined!
make[2]: *** [__modpost] Error 1



I was going to ask if allowing drivers to increase rmem_max
is something that we want to do.  Apparently the road runner
driver has been doing this since the 2.6.12-rc1 when the
git repository starts so this probably isn't a latent bug.


Although it does rather sound like a driver writer yanking the rope from the 
hand's of the sysadmin and hanging him with it rather than letting the sysadmin 
do it himself.  I've seen other drivers' README's suggesting larger mem's but 
not their sources doing it.


rick jones

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bad TCP checksum error

2007-10-26 Thread Rick Jones

Checksum Offload on the NIC(s) can complicate things.  First, if you are tracing 
on the sender, the tracepoint is before the NIC has computed the full checksum. 
 IIRC only a partial checksum is passed-down to the NIC when CKO is in use.


So, making certain your trace is from the "wire" or the receiver rather than the 
sender would be a good thing, and trying again with CKO disabled on the 
interface(s) (via ethtool) might be something worth looking at.  Ultimately, 
doing the partial checksum modificiations in a CKO-friendly manner might be a 
good thing.


rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bizarre network timing problem

2007-10-18 Thread Rick Jones


Felix von Leitner wrote:
the packet trace was a bit too cooked perhaps, but there were indications 
that at times the TCP window was going to zero - perhaps something with 
window updates or persist timers?



Does TCP use different window sizes on loopback?  Why is this not
happening on ethernet?


I don't think it uses different window sizes on loopback, but with the 
autotuning it can be difficult to say a priori what the window size will be. 
What one can say with confidence is that the MTU and thus the MSS will be 
different between loopback and ethernet.




How could I test this theory?


Can you take another trace that isn't so "cooked?"  One that just sticks with 
TCP-level and below stuff?


If SMB is a one-request-at-a-time protocol (I can never remember), you could 
simulate it with a netperf TCP_RR test by passing suitable values to the 
test-specific -r option:


netperf -H  -t TCP_RR -- -r ,

If that shows similar behaviour then you can ass-u-me it isn't your application. 
 One caveat though is that TCP_CORK mode in netperf is very primitive and may 
not match what you are doing, however, that may be a good thing.


http://www.netperf.org/svn/netperf2/tags/netperf-2.4.4/  or
ftp://ftp.netperf.org/netperf/

to get the current netperf bits.  It is also possible to get multiple 
transactions in flight at one time if you configure netperf with --enable-burst, 
which will then enable a test-specific -b option.  With the latest netperf you 
cna also switch the output of a TCP_RR test to bits or bytes per second a la the 
_STREAM tests.


rick jones



My initial idea was that it has something todo with the different MTU on
loopback.  My initial block size was 16k, but the problem stayed when I
changed it to 64k.

Felix


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: bizarre network timing problem

2007-10-17 Thread Rick Jones

the packet trace was a bit too cooked perhaps, but there were indications that 
at times the TCP window was going to zero - perhaps something with window 
updates or persist timers?


rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: follow-up: discrepancy with POSIX

2007-09-19 Thread Rick Jones


Andi Kleen wrote:

On Wed, Sep 19, 2007 at 11:02:00AM -0700, Ulrich Drepper wrote:

on UDP/RAW and it's certainly possible to connect() to that. 


Where do you get this from?  And where is this implemented?  I don't



Sorry it's actually loopback, not broadcast as implemented in Linux.
In Linux it's implemented in ip_route_output_slow(). Essentially
converted to 127.0.0.1

I think it's traditional BSD behaviour but couldn't find it on
a quick look in FreeBSD source (but haven't looked very intensively) 


One has to set their way-back machine pretty far back to find the *BSD 
bits which used 0.0.0.0 as the "all nets, all subnets" (to mis-use a 
term) broadcast IPv4 address when sending.  Perhaps as far back as the 
time before HP-UX 7 or SunOS4.  The bit errors in my dimm memory get 
pretty dense that far back...


It has hung-on in various places (stacks) as an "accepted" broadcast IP 
in the receive path, but not the send path for quite possibly decades now.


rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Configurable tap interface MTU

2007-09-11 Thread Rick Jones


Ed Swierk wrote:

This patch makes it possible to change the MTU on a tap interface.
Increasing the MTU beyond the 1500-byte default is useful for
applications that interoperate with Ethernet devices supporting jumbo
frames.

The patch caps the MTU somewhat arbitrarily at 16000 bytes. This is
slightly lower than the value used by the e1000 driver, so it seems
like a safe upper limit.


FWIW the OFED 1.2 bits take the MTU of IPoIB up to 65520 bytes :)

rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: issues concerning the next NAPI interface

2007-08-24 Thread Rick Jones


Just to be clear, in the previous email I posted on this thread, I
described a worst-case network ping-pong test case (send a packet, wait
for reply), and found out that a deffered interrupt scheme just damaged
the performance of the test case.  Since the folks who came up with the
test case were adamant, I turned off the defferred interrupts.  
While defferred interrupts are an "obvious" solution, I decided that 
they weren't a good solution. (And I have no other solution to offer).


Sounds exactly like the default netperf TCP_RR test and any number of other 
benchmarks.  The "send  a request, wait for reply, send next request, etc etc 
etc" is a rather common application behaviour afterall.


rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Rick Jones


Andi Kleen wrote:

TSO is beneficial for the software again. The linux code currently
takes several locks and does quite a few function calls for each 
packet and using larger packets lowers this overhead. At least with

10GbE saving CPU cycles is still quite important.


Some quick netperf TCP_RR tests between a pair of dual-core rx6600's running 
2.6.23-rc3.  the NICs are dual-core e1000's connected back-to-back with the 
interrupt throttle disabled.  I like using TCP_RR to tickle path-length 
questions because it rarely runs into bandwidth limitations regardless of the 
link-type.


First, with TSO enabled on both sides, then with it disabled, netperf/netserver 
bound to the same CPU as takes interrupts, which is the "best" place to be for a 
TCP_RR test (although not always for a TCP_STREAM test...):


:~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 
(192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf.  : first burst 0 : cpu bind

!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput  :  0.3%
!!!   Local CPU util  : 39.3%
!!!   Remote CPU util : 40.6%

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Sus/Tr   us/Tr

16384  87380  1   1  10.01   18611.32  20.96  22.35  22.522  24.017
16384  87380
:~# ethtool -K eth2 tso off
e1000: eth2: e1000_set_tso: TSO is Disabled
:~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 
(192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf.  : first burst 0 : cpu bind

!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput  :  0.4%
!!!   Local CPU util  : 21.0%
!!!   Remote CPU util : 25.2%

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Sus/Tr   us/Tr

16384  87380  1   1  10.01   19812.51  17.81  17.19  17.983  17.358
16384  87380

While the confidence intervals for CPU util weren't hit, I suspect the 
differences in service demand were still real.  On throughput we are talking 
about +/- 0.2%, for CPU util we are talking about +/- 20% (percent not 
percentage points) in the first test and 12.5% in the second.


So, in broad handwaving terms, TSO increased the per-transaction service demand 
by something along the lines of (23.27 - 17.67)/17.67 or ~30% and the 
transaction rate decreased by ~6%.


rick jones
bitrate blindless is a constant concern
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] Add ETHTOOL_[GS]FLAGS sub-ioctls

2007-08-10 Thread Rick Jones

David Miller wrote:

From: Ben Greear <[EMAIL PROTECTED]>
Date: Fri, 10 Aug 2007 15:40:02 -0700

For GSO on output, is there a generic fallback for any driver that
does not specifically implement GSO?

Absolutely, in fact that's mainly what it's there for.

I don't think there is any issue.  The knob is there via
ethtool for people who really want to disable it.

Just to be paranoid (who me?) we are then at a point where what happened 
a couple months ago with forwarding between 10G and IPoIB won't happen 
again - where things failed because a 10G NIC had LRO enabled by default?

rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Driver writer hints (was [PATCH 3/4] Add ETHTOOL_[GS]PFLAGS sub-ioctls)

2007-08-10 Thread Rick Jones


If we are getting (retrieving) flags:

3) Userland issues ETHTOOL_GPFLAGS, to obtain a 32-bit bitmap

4) Userland prints out a tag returned from ETHTOOL_GSTRINGS
   for each bit set to one in the bitmap.  If a bit is set,
   but there is no string to describe it, that bit is ignored.
   (i.e. a list of 5 strings is returned, but bit 24 is set)


Is that to enable "hidden" bits?  If not I'd think that emitting some 
sort of "UNKNOWN_FLAG" might help flush-out little oopses like 
forgetting a string.


rick jones

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: all syscalls initially taking 4usec on a P4? Re: nonblocking UDPv4 recvfrom() taking 4usec @ 3GHz?

2007-02-20 Thread Rick Jones


I measure a huge slope, however. Starting at 1usec for back-to-back system
calls, it rises to 2usec after interleaving calls with a count to 20
million.

4usec is hit after 110 million.

The graph, with semi-scientific error-bars is on
http://ds9a.nl/tmp/recvfrom-usec-vs-wait.png

The code to generate it is on:
http://ds9a.nl/tmp/recvtimings.c

I'm investigating this further for other system calls. It might be that my
measurements are off, but it appears even a slight delay between calls
incurs a large penalty.


The slope appears to be flattening-out the farther out to the right it 
goes.  Perhaps that is the length of time it takes to take all the 
requisite cache misses.


Some judicious use of HW perf counters might be in order via say papi or 
pfmon.  Otherwise, you could try a test where you don't delay, but do 
try to blow-out the cache(s) between recvfrom() calls.  If the delay 
there starts to match the delay as you go out to the right on the graph 
it would suggest that it is indeed cache effects.


rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Network drivers that don't suspend on interface down

2006-12-20 Thread Rick Jones


There are two different problems:

1) Behavior seems to be different depending on device driver
   author. We should document the expected semantics better.

   IMHO:
When device is down, it should:
 a) use as few resources as possible:
   - not grab memory for buffers
   - not assign IRQ unless it could get one
   - turn off all power consumption possible
 b) allow setting parameters like speed/duplex/autonegotiation,
ring buffers, ... with ethtool, and remember the state
 c) not accept data coming in, and drop packets queued


What implications does c have for something like tcpdump?

rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2.6.11] bonding: avoid tx balance for IGMP (alb/tlb mode)

2005-03-15 Thread Rick Jones

Is that switch behaviour "normal" or "correct?"  I know next to nothing about 
what stuff like LACP should do, but asked some internal folks and they had this 
to say:


 treats IGMP packets the same as all other non-broadcast traffic 
(i.e. it
will attempt to load balance). This switch behavior seems rather odd in an
aggregated case, given the fact that most traffic (except broadcast packets)
will be load balanced by the partner device. In addition, the switch (in
theory) is suppose to treat the aggregated switch ports as 1 logical port
and therefore it should allow IGMP packets to be received back on any port
in the logical aggregation.
IMO, the switch behavior in this case seems questionable.

FWIW,
rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Very high bandwith packet based interface and performance problems

2001-02-22 Thread Rick Jones


Alan Cox wrote:
> 
> > > TCP _requires_ the remote end ack every 2nd frame regardless of progress.
> >
> > um, I thought the spec says that ACK every 2nd segment is a SHOULD not a
> > MUST?
> 
> Yes its a SHOULD in RFC1122, but in any normal environment pretty much a
> must and I know of no stack significantly violating it.

I didn't know there was such a thing as a normal environment :)

> RFC1122 also requires that your protocol stack SHOULD be able to leap tall
> buldings at a single bound of course...

And, of course my protocol stack does :) It is also a floor wax, AND a
dessert topping!-)

rick jones
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Very high bandwith packet based interface and performance problems

2001-02-21 Thread Rick Jones


Alan Cox wrote:
> 
> > that because the kernel was getting 99% of the cpu, the application was
> > getting very little, and thus the read wasn't happening fast enough, and
> 
> Seems reasonable
> 
> > This is NOT what I'm seeing at all.. the kernel load appears to be
> > pegged at 100% (or very close to it), the user space app is getting
> > enough cpu time to read out about 10-20Mbit, and FURTHERMORE the kernel
> > appears to be ACKING ALL the traffic, which I don't understand at all
> > (e.g. the transmitter is simply blasting 300MBit of tcp unrestricted)
> 
> TCP _requires_ the remote end ack every 2nd frame regardless of progress.

um, I thought the spec says that ACK every 2nd segment is a SHOULD not a
MUST?

rick jones
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Very high bandwith packet based interface and performance problems

2001-02-21 Thread Rick Jones


> > > This is NOT what I'm seeing at all.. the kernel load appears to be
> > > pegged at 100% (or very close to it), the user space app is getting
> > > enough cpu time to read out about 10-20Mbit, and FURTHERMORE the kernel
> > > appears to be ACKING ALL the traffic, which I don't understand at all
> > > (e.g. the transmitter is simply blasting 300MBit of tcp unrestricted)
> >
> > TCP _requires_ the remote end ack every 2nd frame regardless of progress.
> 
> YIPES. I didn't realize this was the case.. how is end-to-end application
> flow control handled when the bottle neck is user space bound and not b/w
> bound? e.g. if i write a test app that does a

If the app is not reading from the socket buffer, the receiving TCP is
supposed to stop sending window-updates, and the sender is supposed to
stop sending data when it runs-out of window.

If TCP ACK's data, it really should (must?) not then later drop it on
the floor without aborting the connection. If a TCP is ACKing data and
then that data is dropped before it is given to the application, and the
connection is not being reset, that is probably a bug.

A TCP _is_ free to drop data prior to sending an ACK - it simply drops
it and does not ACK it.

rick jones

-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MTU and 2.4.x kernel

2001-02-19 Thread Rick Jones


the TCP code should be "honouring" the link-local MTU in its selection
of MSS.

rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MTU and 2.4.x kernel

2001-02-15 Thread Rick Jones


> Default of 536 is sadistic (and apaprently will be changed eventually
> to stop tears of poor people whose providers not only supply them
> with bogus mtu values sort of 552 or even 296, but also jailed them
> to some proxy or masquearding domain), but it is still right: IP
> with mtu lower 576 is not full functional.

I thought that the specs said that 576 was the "minimum maximum"
reassemblable IP datagram size and not a minimum MTU.

rick jones
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing todowith ECN)

2001-02-05 Thread Rick Jones


> > As time marches on, the orders of magnitude of the constants may change,
> > but basic concepts still remain, and the "lessons" learned in the past
> > by one generation tend to get relearned in the next :) for example -
> > there is no such a thing as a free lunch... :)
> 
> ;->
> BTW, i am reading one of your papers (circa 1993 ;->, "we go fast with a
> little help from your apps")  in which you make an interesting
> observation. That (figure 2) there is "a considerable increase in
> efficiency but not a considerable increase in throughput"  I "scanned"
> to the end of the paper and dont see an explanation.

That would be the copyavoidance paper using the very old G30 with the
HP-PB (sometimes called PeanutButter) bus :)
(http://ftp.cup.hp.com/dist/networking/briefs/)

No, back then we were not going to describe the dirty laundry of the G30
hardware :) The limiter appears to have been the bus converter from the
SGC (?) main bus of the Novas (8x7,F,G,H,I) to the HP-PB bus. The chip
was (apropriately enough) codenamed "BOA" and it was a constrictor :)

I never had a chance to carry-out the tests on an older 852 system -
those have slower CPU's, but HP-PB was _the_ bus in the system.
Prototypes leading to the HP-PB FDDI card achieved 10 MB/s on an 832
system using UDP - this was back in the 1988-1989 timeframe iirc.

> I've made a somehow similar observation with the current zc patches and
> infact observed that throughput goes down with the linux zc patches.
> [This is being contested but no-one else is testing at gigE, so my word is
> the only truth].
> Of course your paper doesnt talk about sendfile rather the page pinning +
> COW tricks (which are considered taboo in Linux) but i do sense a
> relationship.

Well, the HP-PB FDDI card did follow buffer chains rather well, and
there was no mapping overhead on a Nova - it was a non-coherent I/O
subsystem and DMA was done exclusively with physical addresses (and
requisite pre-DMA flushes on outbound, and purges on inbound - another
reason why copy-avoidance was such a win overheadwise).

Also, there was no throughput drop when going to copyavoidance in that
stuff. So, I'd say that while somethings might "feel" similar, it does
not go much deeper than that.


rick

> PS:- I dont have "my" machines yet and i have a feeling it will be a while
> before i re-run the tests; however, i have created a patch for
> linux-sendfile with netperf. Please take a look at it at:
> http://www.cyberus.ca/~hadi/patch-nperf-sfile-linux.gz
> tell me if is missing anything and if it is ok, could you please merge in
> your tree?

I will take a look.

-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing todowith ECN)

2001-01-30 Thread Rick Jones


> > How does ZC/SG change the nature of the packets presented to the NIC?
> 
> what do you mean? I am _sure_ you know how SG/ZC work. So i am suspecting
> more than socratic view on life here. Could be influence from Aristotle;->

Well, I don't know  the specifics of Linux, but I gather from what I've
read on the list thusfar, that prior to implementing SG support, Linux
NIC drivers would copy packets into single contiguous buffers that were
then sent to the NIC yes? 

If so, the implication is with SG going, that copy no longer takes
place, and so a chain of buffers is given to the NIC.

Also, if one is fully ZC :) pesky things like protocol headers can
naturally end-up in separate buffers.

So, now you have to ask how well any given NIC follows chains of
buffers. At what number of buffers is the overhead in the NIC of
following the chains enough to keep it from achieving link-rate?

One way to try and deduce that would be to meld some of the SG and preSG
behaviours and copy packets into varying numbers of buffers per packet
and measure the resulting impact on throughput through the NIC.

rick jones

As time marches on, the orders of magnitude of the constants may change,
but basic concepts still remain, and the "lessons" learned in the past
by one generation tend to get relearned in the next :) for example -
there is no such a thing as a free lunch... :)

-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing to dowith ECN)

2001-01-30 Thread Rick Jones


> ** I reported that there was also an oddity in throughput values,
> unfortunately since no one (other than me) seems to have access
> to a gige cards in the ZC list, nobody can confirm or disprove
> what i posted. Here again as a reminder:
> 
> Kernel |  tput  | sender-CPU | receiver-CPU |
> -
> 2.4.0-pre3 | 99MB/s |   87%  |  23% |
> NSF|||  |
> -
> 2.4.0-pre3 | 86MB/s |   100% |  17% |
> SF |||  |
> -
> 2.4.0-pre3 | 66.2   |   60%  |  11% |
> +ZC| MB/s   ||  |
> -
> 2.4.0-pre3 | 68 |   8%   |  8%  |
> +ZC  SF| MB/s   ||  |
> -
> 
> Just ignore the CPU readings, focus on throughput. And could someone plese
> post results?

In the spirit of the socratic method :)

Is your gige card based on Alteon?

How does ZC/SG change the nature of the packets presented to the NIC?

How well does the NIC do with that changed nature?

rick jones

sometimes, performance tuning is like squeezing a balloon. one part gets
smaller, but then you start to see the rest of the balloon...

-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN)

2001-01-29 Thread Rick Jones


> I'll give this a shot later. Can you try with the sendfiled-ttcp?
> http://www.cyberus.ca/~hadi/ttcp-sf.tar.gz

I guess I need to "leverage" some bits for netperf :)


WRT getting data with links that cannot saturate a system, having
something akin to the netperf service demand measure can help. Nothing
terribly fancy - simply a conversion of the CPU utilization and
throughput to a microseconds of CPU to transfer a KB of data. 

As for CKO and avoiding copies and such, if past experience is any guide
(ftp://ftp.cup.hp.com/dist/networking/briefs/copyavoid.ps) you get a
very nice synergistic effect once the last "access" of data is removed.
CKO gets you say 10%, avoiding the copy gets you say 10%, but doing both
at the same time gets you 30%.

rick jones
http://www.netperf.org/
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: hotmail not dealing with ECN

2001-01-26 Thread Rick Jones


> As David pointed out, it is "reserved for future use - you must set
> these bits to zero and not use it _for your own purposes_.   For non-rfc
> use of these bits _will_ break something the day we start using them
> for something useful.
> 
> So, no reason for a firewall author to check these bits.

I thought that most firewalls were supposed to be insanely paranoid.
Perhaps it would be considered a possible covert data channel, as
farfecthed as that may sound.

rick jones
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-22 Thread Rick Jones

[EMAIL PROTECTED] wrote:
> 
> Hello!
> 
> >   is there really
> > much value in the second request flowing to the server before the first
> > byte of the reply has hit?
> 
> Yes, of course, it has lots of sense: f.e. all the icons, referenced
> parent page are batched to single well-coalesced stream without rtt delays
> between them. It is the only sense of pipelining yet.

"Elsewhere" i see references stating that the typical RTT for the great
unwashed masses is somewhere in the range of 100 to 200 milliseconds.

The linux standalone ACK timer is 200 milliseconds yes?

If the web server is going to take longer than 200 milliseconds to
generate the first byte of the reply to the first request it seems that
the bottleneck here is the web server, not the link RTT.

Now, if the server _is_ able to respond with the first bytes (ignoring
CORK for the moment) sooner than the standalone ACK timer, then perhaps
the RTT is an issue. However, as we were in the constrained case of only
two requests, I suspect that it is not a big deal.

If there are all those icons to be displayed, there would be more than
two requests. Without the explicit (cork et al)/implicit (tcpnodely)
push at the client those 2-N requests will pile-up into a nice sized TCP
segment. Those requests will arrive en-mass at the server and will then
have RTT issues ammortized.

rick jones
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-19 Thread Rick Jones



> Look: http-1.1, asynchronous one, the first request is sent, but not acked.
> Time to send the second one, but it is blocked by Nagle. If there is no
> third request, the pipe stalls. Seems, this situation will be usual,
> when http-1.1 will start to be used by clients, because of dependencies
> between replys (references) frequently move it to http-1.0 synchronous
> mode, but with some data in flight. See?

The stall takes place if and only if the web server takes longer than
the standalone ACK timer to generate the first bytes of reply. Once the
first bytes of the reply hit the client, the client's second request
will flow.

If the web server takes longer than the standalone ACK timer to generate
the first bytes of the reply, there is no particular value in the second
request having arrived anyway - it will simply sit queued in the
server's stack rather than the client's stack. You could argue that the
server could start serving the second request, but it still has to hold
the reply and keep it queued until the first reply is complete, and I
suspect there is little value in working for that much parallelism here.
Better to have as much queuing in your most distributed resource - the
clients.

Further, even ignoring the issue of standalone acks, is there really
much value in the second request flowing to the server before the first
byte of the reply has hit? I would think that the parallelism in the
server is going to be among all the different sources of request, not
fromwithin a given source of requests. 

Also, if the browser is indeed going to do pipelined requests, and
getting the requests to the server as very quickly as possible was
indeed required because the requests could be started in parallel (just
how likely that is I have no idea) i would have thought that it would
(could) want go through the page, gather-up all the URLs from the given
server, and then dump all those requests into the connection at once
(modulo various folks dislike of sendmsg and writev  :). We are in this
instance at least talking about purpose coded software anyhow and not a
random CGI dribbler. In that sense, the "logically associated data" are
all the server's URL's from that page. Yes, this paragraph is in slight
contradiction with my statement above about keeping things queued in the
client :)

rick jones
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-19 Thread Rick Jones

dean gaudet wrote:
> 
> On Wed, 17 Jan 2001, Rick Jones wrote:
> 
> > > actually the problem isn't nagle...  nagle needs to be turned off for
> > > efficient servers anyhow.
> >
> > i'm not sure I follow that. could you expand on that a bit?
> 
> the problem which caused us to disable nagle in apache is documented in
> this paper <http://www.isi.edu/~johnh/PAPERS/Heidemann97a.html>.  mind you
> i should personally revisit the paper after all these years so that i can
> reconsider its implications in the context of pipelining and webmux.

ah yes, that - where the web server even for just static content was
providing the replies in more than one send. i would not consider that
to have been an "efficient" server.

i'm not sure that I agree with their statment that piggy-backing is
rarely successful in request/response situations.

the business about the last 1100ish bytes of a 4096 byte send being
delayed by nagle only implies that the stack's implementation of nagle
was broken and interpreting it on a per-segment rather than a per-send
basis. if the app sends 4096 bytes, then there should be no
nagle-induced delays on a connection with an MSS of 4096 or less.

it would seem that in the context of that paper at least, most if not
all of the problems were the result of bugs - either in the webserver
software, or the host TCP stack. otherwise, the persistent connections
would have worked just fine.

> i'm not aware yet of any study in the field.  and i'm out of touch enough
> with the clients that i don't know if new netscape or IE have finally
> begun to use pipelining (they hadn't as of 1998).

someone else sent a private email implying that no browsers were yet
doing pipelining.

rick
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Rick Jones


Olivier Galibert wrote:
> 
> On Thu, Jan 18, 2001 at 10:04:28PM +0100, Andrea Arcangeli wrote:
> > NAGLE algorithm is only one, CORK algorithm is another different algorithm. So
> > probably it would be not appropriate to mix CORK and NAGLE under the name
> > "CONTROL_NAGLING", but certainly I agree they could stay together under another
> > name ;).
> 
> TCP_FLOW_CONTROL ?

then folks would think you were controlling the congestion or "classic"
windows. what alal these things do is affect segmentation, so perhaps
TCP_SEGMENT_CONTROL or something to that effect, if anything.

rickjones
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Rick Jones


[EMAIL PROTECTED] wrote:
> 
> Hello!
> 
> > So if I understand  all this correctly...
> >
> > The difference in ACK generation
> 
> CORK does not affect receive direction and, hence, ACK geneartion.

I was asking how the semantics of cork interacted with piggybacking
ACK's on data flowing the other way. Was I wrong in assuming that the
Linux TCP piggybacks ACKs?

rick
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Is sendfile all that sexy?

2001-01-18 Thread Rick Jones



> device-to-device is not the same as disk-to-disk. A better example would
> be a streaming file server. Slowly the pci bus becomes a bottleneck, why
> would you want to move the data twice over the pci bus if once is enough
> and the data very likely not needed afterwards? Sure you can use a more
> expensive 64bit/60MHz bus, but why should you if the 32bit/30MHz bus is
> theoretically fast enough for your application?

theoretically fast enough for the application would imply the dual
transfers across the bus would fit :)

also, if a system was doing something with that much throughput, i
suspect it would not only be designed with 64/66 busses (or better), but
also have things on several different busses. that makes device to
device life more of a challenge.

rick jones

-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Rick Jones


Linus Torvalds wrote:
> Remember the UNIX philosophy: everything is a file.

...and a file is simply a stream of bytes (iirc?)

rick jones
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Rick Jones


Ingo Molnar wrote:
> 
> On Wed, 17 Jan 2001, Rick Jones wrote:
> 
> > i'd heard interesting generalities but no specifics. for instance,
> > when the send is small, does TCP wait exclusively for the app to
> > flush, or is there an "if all else fails" sort of timer running?
> 
> yes there is a per-socket timer for this. According to RFC 1122 a TCP
> stack 'MUST NOT' buffer app-sent TCP data indefinitely if the PSH bit
> cannot be explicitly set by a SEND operation. Was this a trick question?
> :-)

Nope, not a trick question. The nagle heuristic means that small sends
will not wait indefinitely since sending the first small bit of data
starts the retransmission timer as a course of normal processing. So, I
am not in the habit of thinking about a "clear the buffer" timer being
set when a small send takes place but no transmit happens.

rick jones

btw, as I'm currently on linux-kernel, no need to cc me :)
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Rick Jones

Andi Kleen wrote:
> 
> On Wed, Jan 17, 2001 at 02:17:36PM -0800, Rick Jones wrote:
> > How does CORKing interact with ACK generation? In particular how it
> > might interact with (or rather possibly induce) standalone ACKs?
> 
> It doesn't change the ACK generation. If your cork'ed packets gets sent
> before the delayed ack triggers it is piggy backed, if not it is send
> individually. When the delayed ack triggers depends; Linux has dynamic
> delack based on the rtt and also a special quickack mode to speed up slow
> start.

So if I understand  all this correctly...

The difference in ACK generation would be that with nagle it is a race
between the standalone ack heuristic and the first byte of response
data, with cork, the race is between the standalone ack heuristic and
the last byte of response data and an uncork call, or the MSSth byte
whichever comes first.

If the response bytes are dribbling slowly into the socket, where slowly
is less than the bandwidth delay product of the connection, cork can
result in quite fewer packets than nagle would. It would perhaps though
have one more standalone ACK than nagle

If the response bytes are dribbling quickly into the socket, where
quickly is greater than the bandwidth delay product of the connection,
cork will produce one less packet than nagle.

If the response bytes go into the socket together, cork and nagle will
produce the same number of packets.

rick jones
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-17 Thread Rick Jones


Linus Torvalds wrote:
> 
> On Wed, 17 Jan 2001, Rick Jones wrote:
> >
> > > The fact that I understand _why_ it is done that way doesn't mean that I
> > > don't think it's a hack. It doesn't allow you to sendfile multiple files
> > > etc without having nagle boundaries, and the header/trailer stuff really
> > > isn't a generic solution.
> >
> > Hmm, I would think that nagle would only come into play if those files
> > were each less than MSS and there were no intervening application level
> > reply/request messages for each.
> 
> It's not the file itself - it's the headers and trailers.

OK, the sum of the header/trailer/file when one calls an HP-UX-style
sendfile(). All that does is make it more likely that one will have
sends larger than the MSS.

>  - the packet boundary between the header and the file you're sending.
> 
> Normally, if you do a separate data "send()" for the header before
> actually using sendfile(), the header would be sent out as one packet,
> while the actual file contents would then get coalesced into MSS-sized
> packets.
> 
> This is why people originally did writev() and sendmsg() - to allow
> people to do scatter-gather without having multiple packets on the
> wire, and letting the OS choose the best packet boundaries, of course.

I prefer to describe it as "presenting logically associated data to the
transport at one time" but that's just wordsmithing.

> So the Linux approach (and, obviously, in my opinion the only right
> approach) is basically to
> 
>  (a) make sure that system call latency is low enough that there really
>  aren't any major reasons to avoid system calls. They're just function
>  calls - they may be a bit heavier than most functions, of course, but
>  people shouldn't need to avoid them like the plague like on some
>  systems.

i'm not quite sure how it plays here, but someone once told me that the
most efficient procedure call was the one that was never made :)

> and
> 
>  (b) TCP_CORK.
> 
> Now, TCP_CORK is basically me telling David Miller that I refuse to play
> games to have good packet size distribution, and that I wanted a way for
> the application to just tell the OS: I want big packets, please wait until
> you get enough data from me that you can make big packets.
> 
> Basically, TCP_CORK is a kind of "anti-nagle" flag. It's the reverse of
> "no-nagle". So you'd "cork" the TCP connection when you know you are going
> to do bulk transfers, and when you're done with the bulk transfer you just
> "uncork" it. At which point the normal rules take effect (ie normally
> "send out any partial packets if you have no packets in flight").

How "bulk" is a bulk transfer in your thinking? By the time the transfer
gets above something like 100*MSS I would think that the first small
packet would become epsilon. 

How does CORKing interact with ACK generation? In particular how it
might interact with (or rather possibly induce) standalone ACKs?

> This is a _much_ better interface than having to play games with
> scatter-gather lists etc. You could basically just do
> 
> int optval = 1;
> 
> setsockopt(sk, SOL_TCP, TCP_CORK, &optval, sizeof(int));
> write(sk, ..);
> write(sk, ..);
> write(sk, ..);
> sendfile(sk, ..);
> write(..)
> printf(...);
> ...any kind of output..
> 
> optval = 0;
> setsockopt(sk, SOL_TCP, TCP_CORK, &optval, sizeof(int));
> 
> and notice how you don't need to worry about _how_ you output the data any
> more. It will automatically generate the best packet sizes - waiting for
> disk if necessary etc.
> 
> With TCP_CORK, you can obviously and trivially emulate the HP-UX behaviour
> if you want to. But you can just do _soo_ much more.
> 
> Imagine, for example, keep-alive http connections. Where you might be
> doing multiple sendfile()'s of small files over the same connection, one
> after the other. With Linux and TCP_CORK, what you can basically do is to
> just cork the connection at the beginning, and then let is stay corked for
> as long as you don't have any outstanding requests - ie you uncork only
> when you don't have anything pending any more.

so after i present each reply, i'm checking to see if there is another
request and if there is not i have to uncork to get the residual data to
flow.

> (The reason you want to uncork at all, is to obviously let the partial
> packets out when you don't know if you'll write anything more in the near

Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-17 Thread Rick Jones



> > Hmm, I would think that nagle would only come into play if those files
> > were each less than MSS and there were no intervening application level
> > reply/request messages for each.
> 
> actually the problem isn't nagle...  nagle needs to be turned off for
> efficient servers anyhow.  

i'm not sure I follow that. could you expand on that a bit?

> but once it's turned off, the standard socket
> API requires (or rather allows) the kernel to flush packets to the wire
> after each system call.

most definitely allows, not requires.

> consider the case where you're responding to a pair of pipelined HTTP/1.1
> requests.  with the HPUX and BSD sendfile() APIs you end up forcing a
> packet boundary between the two responses.  this is likely to result in
> one small packet on the wire after each response.

i _possibly_ have a packet boundary. if the last small bit of the first
file is handed to the transport when there is sufficient clasic and
congestion window to send it or that window "arrives" before the first
chunk of the second file is sent.

on the topic of pipelining - do the pipelined requests tend to be send
or arrive together? 

> with the linux TCP_CORK API you only get one trailing small packet.  in
> case you haven't heard of TCP_CORK -- when the cork is set, the kernel is
> free to send any maximum size packets it can form, but has to hold on to
> the stragglers until userland gives it more data or pops the cork.

i'd heard interesting generalities but no specifics. for instance, when
the send is small, does TCP wait exclusively for the app to flush, or is
there an "if all else fails" sort of timer running?

> (the heuristic i use in apache to decide if i need to flush responses in a
> pipeline is to look if there are any more requests to read first, and if
> there are none then i flush before blocking waiting for new requests.)

how often to you find yourself flushing the little bits anyhow?

> > As for the header/trailer stuff, you're right, I should have spec'd a
> > separate iovec for each :)
> 
> well, if you've got low system call overhead (such as linux ;), and you
> add TCP_CORK ... then you don't even need to combine all those system
> calls into one monster syscall.

how low is the system call overhead to check for the next request before
you flush?

(i'm not sure that I'd say HP-UX sendfile() was a combination of system
calls - i'd probably say it was a (partial) replacement for writev())

rick jones
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-17 Thread Rick Jones


> The fact that I understand _why_ it is done that way doesn't mean that I
> don't think it's a hack. It doesn't allow you to sendfile multiple files
> etc without having nagle boundaries, and the header/trailer stuff really
> isn't a generic solution.

Hmm, I would think that nagle would only come into play if those files
were each less than MSS and there were no intervening application level
reply/request messages for each. So, perhaps rcp, but not FTP nor HTTP.
I'm not sure where the break-even point versus send() is on other OSes,
but it seems to be in the neighborhood of the typical ethernet MSS on
HP-UX.

As for the header/trailer stuff, you're right, I should have spec'd a
separate iovec for each :)

> Also note how I said that it is the BSD people I _despise_. Not The HP-UX

That misunderstanding would be the result of my entering the
conversation in the middle...

> implementation. The HP-UX one is not pretty, but it works. But I hold open
> source people to higher standards. They are supposed to be the people who
> do programming because it's an art-form, not because it's their job.

I'm not sure, but I think I've just been insulted !-) (in case it is not
clear, that is meant as a joke...)

rick jones
-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[Fwd: Is sendfile all that sexy? (fwd)]

2001-01-16 Thread Rick Jones


> : >Agreed -- the hard-coded Nagle algorithm makes no sense these days.
> :
> : The fact I dislike about the HP-UX implementation is that it is so
> : _obviously_ stupid.
> :
> : And I have to say that I absolutely despise the BSD people.  They did
> : sendfile() after both Linux and HP-UX had done it, and they must have
> : known about both implementations.  And they chose the HP-UX braindamage,
> : and even brag about the fact that they were stupid and didn't understand
> : TCP_CORK (they don't say so in those exact words, of course - they just
> : show that they were stupid and clueless by the things they brag about).
> :
> : Oh, well. Not everybody can be as goodlooking as me. It's a curse.

nor it would seem, as humble :)

Hello Linus, my name is Rick Jones. I am the person at Hewlett-Packard
who drafted the "so _obviously_ stupid" sendfile() interface of HP-UX.
Some of your critique (quoted above) found its way to my inbox and I
thought I would introduce myself to you to give you an opportunity to
expand a bit on your criticism. In return, if you like, I would be more
than happy to describe a bit of the history of sendfile() on HP-UX.
Perhaps (though I cannot say with any certainty) it will help explain
why HP-UX sendfile() is spec'd the way it is.

rick jones
never forget what leads to the downfall of the protagonist in Greek
tragedy...

-- 
ftp://ftp.cup.hp.com/dist/networking/misc/rachel/
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to email, OR post, but please do NOT do BOTH...
my email address is raj in the cup.hp.com domain...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

85 matches

Mail list logo