Re: [PATCH] softirq: let ksoftirqd do its job
On 08/31/2016 04:11 PM, Eric Dumazet wrote: On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote: With regard to drops, are both of you sure you're using the same socket buffer sizes? Does it really matter ? At least at points in the past I have seen different drop counts at the SO_RCVBUF based on using (sometimes much) larger sizes. The hypothesis I was operating under at the time was that this dealt with those situations where the netserver was held-off from running for "a little while" from time to time. It didn't change things for a sustained overload situation though. In the meantime, is anything interesting happening with TCP_RR or TCP_STREAM? TCP_RR is driven by the network latency, we do not drop packets in the socket itself. I've been of the opinion it (single stream) is driven by path length. Sometimes by NIC latency. But then I'm almost always measuring in the LAN rather than across the WAN. happy benchmarking, rick
Re: [PATCH] softirq: let ksoftirqd do its job
With regard to drops, are both of you sure you're using the same socket buffer sizes? In the meantime, is anything interesting happening with TCP_RR or TCP_STREAM? happy benchmarking, rick jones
Re: strange Mac OSX RST behavior
On 07/01/2016 08:10 AM, Jason Baron wrote: I'm wondering if anybody else has run into this... On Mac OSX 10.11.5 (latest version), we have found that when tcp connections are abruptly terminated (via ^C), a FIN is sent followed by an RST packet. That just seems, well, silly. If the client application wants to use abortive close (sigh..) it should do so, there shouldn't be this little-bit-pregnant, correct close initiation (FIN) followed by a RST. The RST is sent with the same sequence number as the FIN, and thus dropped since the stack only accepts RST packets matching rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with an RST on the closed socket, but it appears that it does not. The workaround here is then to reset the connection, if the RST is is equal to rcv_nxt - 1, if we have already received a FIN. The RST attack surface is limited b/c we only accept the RST after we've accepted a FIN and have not previously sent a FIN and received back the corresponding ACK. In other words RST is only accepted in the tcp states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING. I'm interested if anybody else has run into this issue. Its problematic since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT. Isn't the server application expected to act on the read return of zero (which is supposed to be) triggered by the receipt of the FIN segment? rick jones We are also in the process of contacting Apple to see what can be done here...workaround patch is below.
Re: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets
On 06/28/2016 02:59 AM, Dexuan Cui wrote: The idea here is: IMO the syscalls sys_read()/write() shoudn't return -ENOMEM, so I have to make sure the buffer allocation succeeds? I tried to use kmalloc with __GFP_NOFAIL, but I hit a warning in in mm/page_alloc.c: WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); What error code do you think I should return? EAGAIN, ERESTARTSYS, or something else? May I have your suggestion? Thanks! What happens as far as errno is concerned when an application makes a read() call against a (say TCP) socket associated with a connection which has been reset? Is it limited to those errno values listed in the read() manpage, or does it end-up getting an errno value from those listed in the recv() manpage? Or, perhaps even one not (presently) listed in either? rick jones
Re: [PATCH -next 2/2] virtio_net: Read the advised MTU
On 06/02/2016 10:06 AM, Aaron Conole wrote: Rick Jones writes: One of the things I've been doing has been setting-up a cluster (OpenStack) with JumboFrames, and then setting MTUs on instance vNICs by hand to measure different MTU sizes. It would be a shame if such a thing were not possible in the future. Keeping a warning if shrinking the MTU would be good, leave the error (perhaps) to if an attempt is made to go beyond the advised value. This was cut because it didn't make sense for such a warning to be issued, but it seems like perhaps you may want such a feature? I agree with Michael, after thinking about it, that I don't know what sort of use the warning would serve. After all, if you're changing the MTU, you must have wanted such a change to occur? I don't need a warning, was simply willing to live with one when shrinking the MTU. Didn't want an error. happy benchmarking, rick jones
Re: [RFC v2 -next 0/2] virtio-net: Advised MTU feature
On 03/15/2016 02:04 PM, Aaron Conole wrote: The following series adds the ability for a hypervisor to set an MTU on the guest during feature negotiation phase. This is useful for VM orchestration when, for instance, tunneling is involved and the MTU of the various systems should be homogenous. The first patch adds the feature bit as described in the proposed VFIO spec addition found at https://lists.oasis-open.org/archives/virtio-dev/201603/msg1.html The second patch adds a user of the bit, and a warning when the guest changes the MTU from the hypervisor advised MTU. Future patches may add more thorough error handling. How do you see this interacting with VMs getting MTU settings via DHCP? rick jones v2: * Whitespace and code style cleanups from Sergei Shtylyov and Paolo Abeni * Additional test before printing a warning Aaron Conole (2): virtio: Start feature MTU support virtio_net: Read the advised MTU drivers/net/virtio_net.c| 12 include/uapi/linux/virtio_net.h | 3 +++ 2 files changed, 15 insertions(+)
Re: [PATCH net-next RFC 2/2] vhost_net: basic polling support
On 10/22/2015 02:33 AM, Michael S. Tsirkin wrote: On Thu, Oct 22, 2015 at 01:27:29AM -0400, Jason Wang wrote: This patch tries to poll for new added tx buffer for a while at the end of tx processing. The maximum time spent on polling were limited through a module parameter. To avoid block rx, the loop will end it there's new other works queued on vhost so in fact socket receive queue is also be polled. busyloop_timeout = 50 gives us following improvement on TCP_RR test: size/session/+thu%/+normalize% 1/ 1/ +5%/ -20% 1/50/ +17%/ +3% Is there a measureable increase in cpu utilization with busyloop_timeout = 0? And since a netperf TCP_RR test is involved, be careful about what netperf reports for CPU util if that increase isn't in the context of the guest OS. For completeness, looking at the effect on TCP_STREAM and TCP_MAERTS, aggregate _RR and even aggregate _RR/packets per second for many VMs on the same system would be in order. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] tcp: Return error instead of partial read for saved syn headers
On 05/18/2015 11:35 AM, Eric B Munson wrote: Currently the getsockopt() requesting the cached contents of the syn packet headers will fail silently if the caller uses a buffer that is too small to contain the requested data. Rather than fail silently and discard the headers, getsockopt() should return an error and report the required size to hold the data. Is there any chapter and verse on whether a "failed" getsockopt() may alter the items passed to it? rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On 04/15/2015 11:32 AM, Eric Dumazet wrote: On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote: Well, I'm not sure that it is George and Jonathan themselves who don't want to change a sysctl, but the customers who would have to tweak that in their VMs? Keep in mind some VM users install custom qdisc, or even custom TCP sysctls. That could very well be, though I confess I've not seen that happening in my little corner of the cloud. They tend to want to launch the VM and go. Some of the more advanced/sophisticated ones might tweak a few things but my (admittedly limited) experience has been they are few in number. They just expect it to work "out of the box" (to the extent one can use that phrase still). It's kind of ironic - go back to the (early) 1990s when NICs generated a completion interrupt for every individual tx completion (and incoming packet) and all everyone wanted to do was coalesce/avoid interrupts. I guess that has gone rather far. And today to fight bufferbloat TCP gets tweaked to favor quick tx completions. Call it cycles, or pendulums or whatever I guess. I wonder just how consistent tx completion timings are for a VM so a virtio_net or whatnot in the VM can pick a per-device setting to advertise to TCP? Hopefully, full NIC emulation is no longer a thing and VMs "universally" use a virtual NIC interface. At least in my little corner of the cloud, emulated NICs are gone, and good riddance. rick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On 04/15/2015 11:08 AM, Eric Dumazet wrote: On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote: Have you tested this patch on a NIC without GSO/TSO ? This would allow more than 500 packets for a single flow. Hello bufferbloat. Woudln't the fq_codel qdisc on that interface address that problem? Last time I checked, default qdisc was pfifo_fast. Bummer. These guys do not want to change a sysctl, how pfifo_fast will magically becomes fq_codel ? Well, I'm not sure that it is George and Jonathan themselves who don't want to change a sysctl, but the customers who would have to tweak that in their VMs? rick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
Have you tested this patch on a NIC without GSO/TSO ? This would allow more than 500 packets for a single flow. Hello bufferbloat. Woudln't the fq_codel qdisc on that interface address that problem? rick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: TCP connection issues against Amazon S3
Strange thing is that sender does not misbehave at the beginning when receiver window is still small. Only after a while. Just guessing, but when the receiver window is small, the sender cannot get a large quantity of data out there at once, so any string of lost packets will tend to be smaller. If the sender is relying on the RTO to trigger the retransmits, and is not resetting his RTO until the clean ACK of a segment sent after snd_nxt when the loss is detected, the smaller loss strings will not get to the rather large RTO values seen in the trace before curl gives-up. It may be that the sender is indeed misbehaving at the beginning, just that it isn't noticeable? Different but perhaps related observation/question - without timestamps (which we don't have in this case), isn't there a certain ambiguity about arriving out-of-order segments? One doesn't really know if they are out-of-order because the network is re-ordering, or because they are retransmissions of segments we've not yet seen at the receiver. rick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: TCP connection issues against Amazon S3
On 01/06/2015 11:16 AM, Rick Jones wrote: I'm assuming one incident starts at XX:41:24.748265 in the trace? That does look like it is slowly slogging its way through a bunch of lost traffic, which was I think part of the problem I was seeing with the middlebox I stepped in, but I don't think I see the reset where I would have expected it. Still, it looks like the sender has an increasing TCP RTO as it is going through the slog (as it likely must since there are no TCP timestamps?), to the point it gets larger than I'm guessing curl was willing to wait, so the FIN at XX:41:53.269534 after a ten second or so gap. Should the receiver's autotuning be advertising an ever larger window the way it is while going through the slog of lost traffic? rick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: TCP connection issues against Amazon S3
A packet dump [1] shows repeated ACK retransmits for some of the TCP does not retransmit ACK ... do you mean DUPACKs sent by the receiver? I am trying to understand the problem. Could you confirm that it's the HTTP responses sent from Amazon S3 got stalled, or HTTP requests sent from the receiver (your host)? btw I suspect some middleboxes are stripping SACKOK options from your SYNs (or Amazon SYN-ACKs) assuming Amazon supports SACK. The TCP Timestamp option too it seems. Speaking of middleboxes... It is probably a fish that is red, but a while back I stepped in a middle box (a load balancer) which decided that if it saw "too many" retransmissions in a given TCP window that something was seriously wrong and it would toast the connection. I thought though that was an active reset on the part of the middlebox. (And the client was the active sender not the back-end server) I'm assuming one incident starts at XX:41:24.748265 in the trace? That does look like it is slowly slogging its way through a bunch of lost traffic, which was I think part of the problem I was seeing with the middlebox I stepped in, but I don't think I see the reset where I would have expected it. Still, it looks like the sender has an increasing TCP RTO as it is going through the slog (as it likely must since there are no TCP timestamps?), to the point it gets larger than I'm guessing curl was willing to wait, so the FIN at XX:41:53.269534 after a ten second or so gap. rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: What's the concern about setting irq thread's policy as SCHED_FIFO
On 12/03/2014 12:06 AM, Qin Chuanyu wrote: I am doing network performance test under suse11sp3 and intel 82599 nic, Becasuse the softirq is out of schedule policy's control, so netserver thread couldn't always get 100% cpu usage, then packet dropped in kernel udp socket's receive queue. In order to get a stable result, I did some patch in ixgbe driver and then use irq_thread instead of softirq to handle rx. It seems work well, but irq_thread's SCHED_FIFO schedule policy cause that when the cpu is limited, netserver couldn't work at all. I cannot speak to any scheduling issues/questions, but can ask if you tried binding netserver to a CPU other than the one servicing the interrupts via the -T option on the netperf command line: netperf -T , ... http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#index-g_t_002dT_002c-Global-41 happy benchnmarking, rick jones So I change the irq_thread's schedule policy from SCHED_FIFO to SCHED_NORMAL, then the irq_thread could share the cpu usage with netserver thread. the question is: What's the concrete reason about setting irq thread's policy as SCHED_FIFO? Except the priority affecting the cpu usage, any function would be broken if irq thread change to SCHED_NORMAL? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [QA-TCP] How to send tcp small packages immediately?
On 10/24/2014 12:41 AM, Zhangjie (HZ) wrote: Hi, I use netperf to test the performance of small tcp package, with TCP_NODELAY set : netperf -H 129.9.7.164 -l 100 -- -m 512 -D Among the packages I got by tcpdump, there is not only small packages, also lost of big ones (skb->len=65160). IP 129.9.7.186.60840 > 129.9.7.164.34607: tcp 65160 IP 129.9.7.164.34607 > 129.9.7.186.60840: tcp 0 IP 129.9.7.164.34607 > 129.9.7.186.60840: tcp 0 IP 129.9.7.164.34607 > 129.9.7.186.60840: tcp 0 IP 129.9.7.186.60840 > 129.9.7.164.34607: tcp 65160 IP 129.9.7.164.34607 > 129.9.7.186.60840: tcp 0 IP 129.9.7.164.34607 > 129.9.7.186.60840: tcp 0 IP 129.9.7.164.34607 > 129.9.7.186.60840: tcp 0 IP 129.9.7.186.60840 > 129.9.7.164.34607: tcp 80 IP 129.9.7.186.60840 > 129.9.7.164.34607: tcp 512 IP 129.9.7.186.60840 > 129.9.7.164.34607: tcp 512 SO, how to test small tcp packages? Including TCP_NODELAY, What else should be set? Well, I don't think there is anything else you can set. Even with TCP_NODELAY set, segment size with TCP will still be controlled by factors such as congestion window. I am ass-u-me-ing your packet trace is at the sender. I suppose if your sender were fast enough compared to the path that might combine with congestion window to result in the very large segments. Not to say there cannot be a bug somewhere with TSO overriding TCP_NODELAY, but in broad terms, even TCP_NODELAY does not guarantee small TCP segments. That has been something of a bane on my attempts to use TCP for aggregate small-packet performance measurements via netperf for quite some time. And since you seem to have included a virtualization mailing list I would also ass-u-me that virtualization is involved somehow. Knuth only knows how that will affect the timing of events, which will be very much involved in matters of congestion window and such. I suppose it is even possible that if the packet trace is on a VM receiver that some delays in getting the VM running could mean that GRO would end-up making large segments being pushed up the stack. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: skbuff truesize incorrect.
On 05/23/2014 02:33 AM, Bjørn Mork wrote: Jim Baxter writes: I'll create and test a patch for the cdc_ncm host driver unless someone else wants to do that. I haven't really played with the gadget driver before, so I'd prefer if someone knowing it (Jim maybe?) could take care of it. If not, then I can always make an attempt using dummy_hcd to test it. I can create a patch for the host driver, I will issue the gadget patch first to resolve any issues, the fix would be similar. Well, I couldn't help myself. I just had to test it. The attached patch works for me, briefly tested with an Ericsson H5321gw NCM device. I have no ideas about the performance impact as that modem is limited to 21 Mbps HSDPA. If you are measuring performance with the likes of netperf, you should be able to get an idea of the performance effect from the change in service demand (CPU consumed per unit of work) even if the maximum throughput remains capped. You can run a netperf TCP_STREAM test along the lines of: netperf -H -c -C -t TCP_STREAM and also netperf -H -c -C -t TCP_RR For extra added credit you can consider either multiple runs and post-processing, or adding a -i 30,3 to the command line to tell netperf to run at least three iterations, no more than thirty and it will try to achieve a 99% confidence that the reported means for throughput, local and remote CPU utilization are within +/- 2.5% of the actual mean. You can narrow or widen that with a -I 99,. A width of 5% is what gives the +/- 2.5% (and/or demonstrates my lack of accurate statistics knowledge :) ) happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 08/24] net, diet: Make TCP metrics optional
On 05/06/2014 09:41 AM, j...@joshtriplett.org wrote: On Tue, May 06, 2014 at 11:59:41AM -0400, David Miller wrote: Making 2MB RAM machines today makes no sense at all. The lowest end dirt cheap smartphone, something which fits on someone's pocket, has gigabytes of ram. The lowest-end smartphone isn't anywhere close to "dirt cheap", and hardly counts as "embedded" at all anymore. Smartphones cost $100+; we're talking about systems in the low tens of dollars or less. These systems will have no graphics, no peripherals, and only one or two specific functions. The entirety of their functionality will likely consist of a single userspace program; they might not even have a PID 2. *That's* the kind of "embedded" we're talking about, not the supercomputers we carry around in our pockets. Would this be some sort of "Internet of Things" system? rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: network performance get regression from 2.6 to 3.10 by each version
On 05/02/2014 12:40 PM, V JobNickname wrote: I have an ARM platform which works with older 2.6.28 Linux Kernel and the embedded NIC driver I profile the TCP Tx using netperf 2.6 by command "./netperf -H {serverip} -l 300". Is your ARM platform a multi-core one? If so, you may need/want to look into making certain the assignment of NIC interrupts and netperf have remained constant through your tests. You can bind netperf to a specific CPU via either "taskset" or the global -T option. You can check the interrupt assignment(s) for the queue(s) from the NIC by looking at /proc/interrupts and perhaps via other means. It would also be good to know if the drops in throughput correspond to an increase in service demand (CPU per unit of work). To that end, adding a global -c option to measure local (netperf side) CPU utilization would be a good idea. Still, even armed with that information, tracking down the regression or regressions will be no small feat particularly since the timespan is so long. A very good reason to be trying the newer versions as they appear, even if only briefly, rather than leaving it for so long. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A call to revise sockets behaviour
A wine developer clearly showed that this option simply doesn't work. http://bugs.winehq.org/show_bug.cgi?id=26031#c21 Output of strace: getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0 setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr("0. 0.0.0")}, 16) = -1 EADDRINUSE (Address already in use) The output of netstat -an didn't by any chance happen to still show an endpoint in the LISTEN state for that port number did it? rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] net: low latency Ethernet device polling
On 02/27/2013 09:55 AM, Eliezer Tamir wrote: This patchset adds the ability for the socket layer code to poll directly on an Ethernet device's RX queue. This eliminates the cost of the interrupt and context switch and with proper tuning allows us to get very close to the HW latency. This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from last year http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf Patch 1 adds ndo_ll_poll and the IP code to use it. Patch 2 is an example of how TCP can use ndo_ll_poll. Patch 3 shows how this method would be implemented for the ixgbe driver. Patch 4 adds statistics to the ixgbe driver for ndo_ll_poll events. (Optional) Patch 5 is a handy kprobes module to measure detailed latency numbers. this patchset is also available in the following git branch git://github.com/jbrandeb/lls.git rfc Performance numbers: Kernel Config C3/6 rx-usecs TCP UDP 3.8rc6 typicaloff adaptive 37k 40k 3.8rc6 typicaloff 0*50k 56k 3.8rc6 optimized off 0*61k 67k 3.8rc6 optimized onadaptive 26k 29k patched typicaloff adaptive 70k 78k patched optimized off adaptive 79k 88k patched optimized off 100 84k 92k patched optimized onadaptive 83k 91k *rx-usecs=0 is usually not useful in a production environment. I would think that latency-sensitive folks would be using rx-usecs=0 in production - at least if the NIC in use didn't have low enough latency with its default interrupt coalescing/avoidance heuristics. If I take the first "pure" A/B comparison it seems that the change as benchmarked takes latency for TCP from ~27 usec (37k) to ~14 usec (70k). At what request/response size does the benefit taper-off? 13 usec seems to be about 16250 bytes at 10 GbE. When I last looked at netperf TCP_RR performance where something similar could happen I think it was IPoIB where it was possible to set things up such that polling happened rather than wakeups (perhaps it was with a shim library that converted netperf's socket calls to "native" IB). My recollection is that it "did a number" on the netperf service demands thanks to the spinning. It would be a good thing to include those figures in any subsequent rounds of benchmarking. Am I correct in assuming this is a mechanism which would not be used in a high aggregate PPS situation? happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Doubts about listen backlog and tcp_max_syn_backlog
On 01/24/2013 04:22 AM, Leandro Lucarella wrote: On Wed, Jan 23, 2013 at 11:28:08AM -0800, Rick Jones wrote: Then if syncookies are enabled, the time spent in connect() shouldn't be bigger than 3 seconds even if SYNs are being "dropped" by listen, right? Do you mean if "ESTABLISHED" connections are dropped because the listen queue is full? I don't think I would put that as "SYNs being dropped by listen" - too easy to confuse that with an actual dropping of a SYN segment. I was just kind of quoting the name given by netstat: "SYNs to LISTEN sockets dropped" (for kernel 3.0, I noticed newer kernels don't have this stat anymore, or the name was changed). I still don't know if we are talking about the same thing. Are you sure those stats are not present in 3.X kernels? I just looked at /proc/net/netstat on a 3.7 system and noticed both the ListenMumble stats and the three cookie stats. And I see the code for them in the tree: aj@tardy:~/net-next/net/ipv4$ grep MIB_LISTEN *.c proc.c: SNMP_MIB_ITEM("ListenOverflows", LINUX_MIB_LISTENOVERFLOWS), proc.c: SNMP_MIB_ITEM("ListenDrops", LINUX_MIB_LISTENDROPS), tcp_ipv4.c: NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS); tcp_ipv4.c: NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS); raj@tardy:~/net-next/net/ipv4$ grep MIB_SYN *.c proc.c: SNMP_MIB_ITEM("SyncookiesSent", LINUX_MIB_SYNCOOKIESSENT), proc.c: SNMP_MIB_ITEM("SyncookiesRecv", LINUX_MIB_SYNCOOKIESRECV), proc.c: SNMP_MIB_ITEM("SyncookiesFailed", LINUX_MIB_SYNCOOKIESFAILED), syncookies.c: NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESSENT); syncookies.c: NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED); syncookies.c: NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESRECV); I will sometimes be tripped-up by netstat's not showing a statistic with a zero value... But yes, I would not expect a connect() call to remain incomplete for any longer than it took to receive an SYN|ACK from the other end. So the only reason to experience these high times spent in connect() should be because a SYN or SYN|ACK was actually loss in a lower layer, like an error in the network device or a transmission error? Modulo the/some other drop-without-stat point such as Vijay mentioned yesterday. You might consider taking some packet traces. If you can I would start with a trace taken on the system(s) on which the long connect() calls are happening. I think the tcpdump manpage has an example of a tcpdump command with a filter expression that catches just SYNchronize and FINished segments which I suppose you could extend to include ReSeT segments. Such a filter expression would be missing the client's ACK of the SYN|ACK but unless you see incrementing stats relating to say checksum failures or other drops on the "client" side I suppose you could assume that the client ACKed the server's SYN|ACK. That would be 3 (,9, 21, etc...) seconds on a kernel with 3 seconds as the initial retransmission timeout. Which can't be changed without recompiling, right? To the best of my knowledge. rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Doubts about listen backlog and tcp_max_syn_backlog
On 01/23/2013 02:47 AM, Leandro Lucarella wrote: Thanks for the info. I'm definitely dropping SYNs and sending cookies, around 50/s. Is there any way to tell how many connections are queued in a particular socket? I am not familiar with one. Doesn't mean there isn't one, only that I am not able to think of it. Then if syncookies are enabled, the time spent in connect() shouldn't be bigger than 3 seconds even if SYNs are being "dropped" by listen, right? Do you mean if "ESTABLISHED" connections are dropped because the listen queue is full? I don't think I would put that as "SYNs being dropped by listen" - too easy to confuse that with an actual dropping of a SYN segment. But yes, I would not expect a connect() call to remain incomplete for any longer than it took to receive an SYN|ACK from the other end. That would be 3 (,9, 21, etc...) seconds on a kernel with 3 seconds as the initial retransmission timeout. rick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Doubts about listen backlog and tcp_max_syn_backlog
On 01/22/2013 10:42 AM, Leandro Lucarella wrote: On Tue, Jan 22, 2013 at 10:17:50AM -0800, Rick Jones wrote: What is important is the backlog, and I guess you didn't increase it properly. The somaxconn default is quite low (128) Leandro - If that is being overflowed, I believe you should be seeing something like: 14 SYNs to LISTEN sockets dropped in the output of netstat -s on the system on which the server application is running. What is that value reporting exactly? Netstat is reporting the ListenDrops and/or ListenOverflows which map to LINUX_MIB_LISTENDROPS and LINUX_MIB_LISTENOVERFLOWS. Those get incremented in tcp_v4_syn_recv_sock() (and its v6 version etc) if (sk_acceptq_is_full(sk)) goto exit_overflow; Will increment both overflows and drops, and drops will increment on its own in some additional cases. Because we are using syncookies, and AFAIK with that enabled, all SYNs are being replied, and what the listen backlog is really limitting is the "completely established sockets waiting to be accepted", according to listen(2). What I don't really know to be honest, is what a "completely established socket" is, does it mean that the SYN,ACK was sent, or the ACK was received back? I have always thought it meant that the ACK of the SYN|ACK has been received. SyncookiesSent SyncookiesRecv SyncookiesFailed also appear in /proc/net/netstat and presumably in netstat -s output. Also, from the client side, when is the connect(2) call done? When the SYN,ACK is received? That would be my assumption. In a previous message: What I'm seeing are clients taking either useconds to connect, or 3 seconds, which suggest SYNs are getting lost, but the network doesn't seem to be the problem. I'm still investigating this, so unfortunately I'm not really sure. I recently ran into something like that, which turned-out to be an issue with nf_conntrack and its table filling. rick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] tcp: add ability to set a timestamp offset
On 01/22/2013 12:52 PM, Andrey Vagin wrote: If a TCP socket will get live-migrated from one box to another the timestamps (which are typically ON) will get screwed up -- the new kernel will generate TS values that has nothing to do with what they were on dump. The solution is to yet again fix the kernel and put a "timestamp offset" on a socket. Is there a chance a connection can be moved more than once within the "lifetime" of a given timestamp value? rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Doubts about listen backlog and tcp_max_syn_backlog
What is important is the backlog, and I guess you didn't increase it properly. The somaxconn default is quite low (128) Leandro - If that is being overflowed, I believe you should be seeing something like: 14 SYNs to LISTEN sockets dropped in the output of netstat -s on the system on which the server application is running. rick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] softirq: reduce latencies
On 01/03/2013 05:31 AM, Eric Dumazet wrote: A common network load is to launch ~200 concurrent TCP_RR netperf sessions like the following netperf -H remote_host -t TCP_RR -l 1000 And then you can launch some netperf asking P99_LATENCY results : netperf -H remote_host -t TCP_RR -- -k P99_LATENCY In terms of netperf overhead, once you specify P99_LATENCY, you are already in for the pound of cost but only getting the penny of output (so to speak). While it would clutter the output, one could go ahead and ask for the other latency stats and it won't "cost" anything more: ... -- -k RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY Additional information about how the omni output selectors work can be found at http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Omni-Output-Selection happy benchmarking, rick jones BTW - you will likely see some differences between RT_LATENCY, which is calculated from the average transactions per second, and MEAN_LATENCY, which is calculated from the histogram of individual latencies maintained when any of the _LATENCY outputs other than RT_LATENCY is requested. Kudos to the folks at Google who did the extensions to the then-existing histogram code to enable it to be used for more reasonably accurate statistics. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc net-next v6 0/3] Multiqueue virtio-net
On 10/30/2012 03:03 AM, Jason Wang wrote: Hi all: This series is an update version of multiqueue virtio-net driver based on Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the packets reception and transmission. Please review and comments. Changes from v5: - Align the implementation with the RFC spec update v4 - Switch the mode between single mode and multiqueue mode without reset - Remove the 256 limitation of queues - Use helpers to do the mapping between virtqueues and tx/rx queues - Use commbined channels instead of separated rx/tx queus when do the queue number configuartion - Other coding style comments from Michael Reference: - A protype implementation of qemu-kvm support could by found in git://github.com/jasowang/qemu-kvm-mq.git - V5 could be found at http://lwn.net/Articles/505388/ - V4 could be found at https://lkml.org/lkml/2012/6/25/120 - V2 could be found at http://lwn.net/Articles/467283/ - Michael virtio-spec: http://www.spinics.net/lists/netdev/msg209986.html Perf Numbers: - Pktgen test shows the receiving capability of the multiqueue virtio-net were dramatically improved. - Netperf result shows latency were greately improved according to the test result. I suppose it is technically correct to say that latency was improved, but usually for aggregate request/response tests I tend to talk about the aggregate transactions per second. Do you have a hypothesis as to why the improvement dropped going to 20 concurrent sessions from 10? rick jones Netperf Local VM to VM test: - VM1 and its vcpu/vhost thread in numa node 0 - VM2 and its vcpu/vhost thread in numa node 1 - a script is used to lauch the netperf with demo mode and do the postprocessing to measure the aggreagte result with the help of timestamp - average of 3 runs TCP_RR: size/session/+lat%/+normalize% 1/ 1/0%/0% 1/10/ +52%/ +6% 1/20/ +27%/ +5% 64/ 1/0%/0% 64/10/ +45%/ +4% 64/20/ +28%/ +7% 256/ 1/ -1%/0% 256/10/ +38%/ +2% 256/20/ +27%/ +6% TCP_CRR: size/session/+lat%/+normalize% 1/ 1/ -7%/ -12% 1/10/ +34%/ +3% 1/20/ +3%/ -8% 64/ 1/ -7%/ -3% 64/10/ +32%/ +1% 64/20/ +4%/ -7% 256/ 1/ -6%/ -18% 256/10/ +33%/0% 256/20/ +4%/ -8% STREAM: size/session/+thu%/+normalize% 1/ 1/ -3%/0% 1/ 2/ -1%/0% 1/ 4/ -2%/0% 64/ 1/0%/ +1% 64/ 2/ -6%/ -6% 64/ 4/ -8%/ -14% 256/ 1/0%/0% 256/ 2/ -48%/ -52% 256/ 4/ -50%/ -55% 512/ 1/ +4%/ +5% 512/ 2/ -29%/ -33% 512/ 4/ -37%/ -49% 1024/ 1/ +6%/ +7% 1024/ 2/ -46%/ -51% 1024/ 4/ -15%/ -17% 4096/ 1/ +1%/ +1% 4096/ 2/ +16%/ -2% 4096/ 4/ +31%/ -10% 16384/ 1/0%/0% 16384/ 2/ +16%/ +9% 16384/ 4/ +17%/ -9% Netperf test between external host and guest over 10gb(ixgbe): - VM thread and vhost threads were pinned int the node 0 - a script is used to lauch the netperf with demo mode and do the postprocessing to measure the aggreagte result with the help of timestamp - average of 3 runs TCP_RR: size/session/+lat%/+normalize% 1/ 1/0%/ +6% 1/10/ +41%/ +2% 1/20/ +10%/ -3% 64/ 1/0%/ -10% 64/10/ +39%/ +1% 64/20/ +22%/ +2% 256/ 1/0%/ +2% 256/10/ +26%/ -17% 256/20/ +24%/ +10% TCP_CRR: size/session/+lat%/+normalize% 1/ 1/ -3%/ -3% 1/10/ +34%/ -3% 1/20/0%/ -15% 64/ 1/ -3%/ -3% 64/10/ +34%/ -3% 64/20/ -1%/ -16% 256/ 1/ -1%/ -3% 256/10/ +38%/ -2% 256/20/ -2%/ -17% TCP_STREAM:(guest receiving) size/session/+thu%/+normalize% 1/ 1/ +1%/ +14% 1/ 2/0%/ +4% 1/ 4/ -2%/ -24% 64/ 1/ -6%/ +1% 64/ 2/ +1%/ +1% 64/ 4/ -1%/ -11% 256/ 1/ +3%/ +4% 256/ 2/0%/ -1% 256/ 4/0%/ -15% 512/ 1/ +4%/0% 512/ 2/ -10%/ -12% 512/ 4/0%/ -11% 1024/ 1/ -5%/0% 1024/ 2/ -11%/ -16% 1024/ 4/ +3%/ -11% 4096/ 1/ +27%/ +6% 4096/ 2/0%/ -12% 4096/ 4/0%/ -20% 16384/ 1/0%/ -2% 16384/ 2/0%/ -9% 16384/ 4/ +10%/ -2% TCP_MAERTS:(guest sending) 1/ 1/ -1%/0% 1/ 2/0%/0% 1/ 4/ -5%/0% 64/ 1/0%/0% 64/ 2/ -7%/ -8% 64/ 4/ -7%/ -8% 256/ 1/0%/0% 256/ 2/ -28%/ -28% 256/ 4/ -28%/ -29% 512/ 1/0%/0% 512/ 2/ -15%/ -13% 512/ 4/ -53%/ -59% 1024/ 1/ +4%/ +13% 1024/ 2/ -7%/ -18% 1024/ 4/ +1%/ -18% 4096/ 1/
Re: Netperf UDP_STREAM regression due to not sending IPIs in ttwu_queue()
On 10/03/2012 02:47 AM, Mel Gorman wrote: On Tue, Oct 02, 2012 at 03:48:57PM -0700, Rick Jones wrote: On 10/02/2012 01:45 AM, Mel Gorman wrote: SIZE=64 taskset -c 0 netserver taskset -c 1 netperf -t UDP_STREAM -i 50,6 -I 99,1 -l 20 -H 127.0.0.1 -- -P 15895 -s 32768 -S 32768 -m $SIZE -M $SIZE Just FYI, unless you are running a hacked version of netperf, the "50" in "-i 50,6" will be silently truncated to 30. I'm not using a hacked version of netperf. The 50,6 has been there a long time so I'm not sure where I took it from any more. It might have been an older version or me being over-zealous at the time. No version has ever gone past 30. It has been that way since the confidence interval code was contributed. It doesn't change anything, so it hasn't messed-up any results. It would be good to fix but not critical. PS - I trust it is the receive-side throughput being reported/used with UDP_STREAM :) Good question. Now that I examine the scripts, it is in fact the sending side that is being reported which is flawed. Granted I'm not expecting any UDP loss on loopback and looking through a range of results, the difference is marginal. It's still wrong to report just the sending side for UDP_STREAM and I'll correct the scripts for it in the future. Switching from sending to receiving throughput in UDP_STREAM could be a non-trivial disconnect in throughputs. As Eric mentions, the receiver could be dropping lots of datagrams if it cannot keep-up, and netperf makes not attempt to provide any application-layer flow-control. Not sure which version of netperf you are using to know whether or not it has gone to the "omni" code path. If you aren't using 2.5.0 or 2.6.0 then the confidence intervals will have been computed based on the receive side throughput, so you will at least know that it was stable, even if it wasn't the same as the sending side. The top of trunk will use the remote's receive stats for the omni migration of a UDP_STREAM test too. I think it is that way in 2.5.0 and 2.6.0 as well but I've not gone into the repository to check. Of course, that means you don't necessarily know that the sending throughput met your confidence intervals :) If you are on 2.5.0 or later, you may find: http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Omni-Output-Selection helpful when looking to parse results. One more, little thing - taskset may indeed be better for what you are doing (it will happen "sooner" certainly), but there is also the global -T option to bind netperf/netserver to the specified CPU id. http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#index-g_t_002dT_002c-Global-41 happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Netperf UDP_STREAM regression due to not sending IPIs in ttwu_queue()
On 10/02/2012 01:45 AM, Mel Gorman wrote: SIZE=64 taskset -c 0 netserver taskset -c 1 netperf -t UDP_STREAM -i 50,6 -I 99,1 -l 20 -H 127.0.0.1 -- -P 15895 -s 32768 -S 32768 -m $SIZE -M $SIZE Just FYI, unless you are running a hacked version of netperf, the "50" in "-i 50,6" will be silently truncated to 30. happy benchmarking, rick jones PS - I trust it is the receive-side throughput being reported/used with UDP_STREAM :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: getsockopt/setsockopt with SO_RCVBUF and SO_SNDBUF "non-standard" behaviour
On 07/18/2012 09:11 AM, Eric Dumazet wrote: That the way it's done on linux since day 0 You can probably find a lot of pages on the web explaining the rationale. If your application handles UDP frames, what SO_RCVBUF should count ? If its the amount of payload bytes, you could have a pathological situation where an attacker sends 1-byte UDP frames fast enough and could consume a lot of kernel memory. Each frame consumes a fair amount of kernel memory (between 512 bytes and 8 Kbytes depending on the driver). So linux says : If user expect to receive bytes, set a limit of _kernel_ memory used to store these bytes, and use an estimation of 100% of overhead. That is : allow 2* bytes to be allocated for socket receive buffers. Expanding on/rewording that, in a setsockopt() call SO_RCVBUF specifies the data bytes and gets doubled to become the kernel/overhead byte limit. Unless the doubling would be greater than net.core.rmem_max, in which case the limit becomes net.core.rmem_max. But on getsockopt() SO_RCVBUF is always the kernel/overhead byte limit. In one call it is fish. In the other it is fowl. Other stacks appear to keep their kernel/overhead limit quiet, keeping SO_RCVBUF an expression of a data limit in both setsockopt() and getsockopt(). With those stacks, there is I suppose the possible source of confusion when/if someone tests the queuing to a socket, sends "high overhead" packets and doesn't get to SO_RCVBUF worth of data though I don't recall encountering that in my "pre-linux" time. The sometimes fish, sometimes fowl version (along with the auto tuning when one doesn't make setsockopt() calls) gave me fits in netperf for years until I finally relented and split the socket buffer size variables into three - what netperf's user requested via the command line, what it was right after the socket was created, and what it was at the end of the data phase of the test. rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [net-next RFC V5 0/5] Multiqueue virtio-net
On 07/08/2012 08:23 PM, Jason Wang wrote: On 07/07/2012 12:23 AM, Rick Jones wrote: On 07/06/2012 12:42 AM, Jason Wang wrote: Which mechanism to address skew error? The netperf manual describes more than one: This mechanism is missed in my test, I would add them to my test scripts. http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance Personally, my preference these days is to use the "demo mode" method of aggregate results as it can be rather faster than (ab)using the confidence intervals mechanism, which I suspect may not really scale all that well to large numbers of concurrent netperfs. During my test, the confidence interval would even hard to achieved in RR test when I pin vhost/vcpus in the processors, so I didn't use it. When running aggregate netperfs, *something* has to be done to address the prospect of skew error. Otherwise the results are suspect. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [net-next RFC V5 0/5] Multiqueue virtio-net
On 07/06/2012 12:42 AM, Jason Wang wrote: I'm not expert of tcp, but looks like the changes are reasonable: - we can do full-sized TSO check in tcp_tso_should_defer() only for westwood, according to tcp westwood - run tcp_tso_should_defer for tso_segs = 1 when tso is enabled. I'm sure Eric and David will weigh-in on the TCP change. My initial inclination would have been to say "well, if multiqueue is draining faster, that means ACKs come-back faster, which means the "race" between more data being queued by netperf and ACKs will go more to the ACKs which means the segments being sent will be smaller - as TCP_NODELAY is not set, the Nagle algorithm is in force, which means once there is data outstanding on the connection, no more will be sent until either the outstanding data is ACKed, or there is an accumulation of > MSS worth of data to send. Also, how are you combining the concurrent netperf results? Are you taking sums of what netperf reports, or are you gathering statistics outside of netperf? The throughput were just sumed from netperf result like what netperf manual suggests. The cpu utilization were measured by mpstat. Which mechanism to address skew error? The netperf manual describes more than one: http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance Personally, my preference these days is to use the "demo mode" method of aggregate results as it can be rather faster than (ab)using the confidence intervals mechanism, which I suspect may not really scale all that well to large numbers of concurrent netperfs. I also tend to use the --enable-burst configure option to allow me to minimize the number of concurrent netperfs in the first place. Set TCP_NODELAY (the test-specific -D option) and then have several transactions outstanding at one time (test-specific -b option with a number of additional in-flight transactions). This is expressed in the runemomniaggdemo.sh script: http://www.netperf.org/svn/netperf2/trunk/doc/examples/runemomniaggdemo.sh which uses the find_max_burst.sh script: http://www.netperf.org/svn/netperf2/trunk/doc/examples/find_max_burst.sh to pick the burst size to use in the concurrent netperfs, the results of which can be post-processed with: http://www.netperf.org/svn/netperf2/trunk/doc/examples/post_proc.py The nice feature of using the "demo mode" mechanism is when it is coupled with systems with reasonably synchronized clocks (eg NTP) it can be used for many-to-many testing in addition to one-to-many testing (which cannot be dealt with by the confidence interval method of dealing with skew error) A single instance TCP_RR test would help confirm/refute any non-trivial change in (effective) path length between the two cases. Yes, I would test this thanks. Excellent. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22
*) netperf/netserver support CPU affinity within themselves with the global -T option to netperf. Is the result with taskset much different? The equivalent to the above would be to run netperf with: ./netperf -T 0,7 .. I checked the source codes and didn't find this option. I use netperf V2.3 (I found the number in the makefile). Indeed, that version pre-dates the -T option. If you weren't already chasing a regression I'd suggest an upgrade to 2.4.mumble. Once you are at a point where changing another variable won't muddle things you may want to consider upgrading. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22
The test command is: #sudo taskset -c 7 ./netserver #sudo taskset -c 0 ./netperf -t TCP_RR -l 60 -H 127.0.0.1 -i 50,3 -I 99,5 -- -r 1,1 A couple of comments/questions on the command lines: *) netperf/netserver support CPU affinity within themselves with the global -T option to netperf. Is the result with taskset much different? The equivalent to the above would be to run netperf with: ./netperf -T 0,7 ... The one possibly salient difference between the two is that when done within netperf, the initial process creation will take place wherever the scheduler wants it. *) The -i option to set the confidence iteration count will silently cap the max at 30. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: questions on NAPI processing latency and dropped network packets
1) Interrupts are being processed on both cpus: [EMAIL PROTECTED]:/root> cat /proc/interrupts CPU0 CPU1 30:17037564530785 U3-MPIC Level eth0 IIRC none of the e1000 driven cards are multi-queue, so while the above shows that interrupts from eth0 have been processed on both CPUs at various points in the past, it doesn't necessarily mean that they are being processed on both CPUs at the same time right? rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: AF_UNIX MSG_PEEK bug?
Potential bugs notwithstanding, given that this is a STREAM socket, and as such shouldn't (I hope, or I'm eating toes for dinner again) have side effects like tossing the rest of a datagram, why are you using MSG_PEEK? Why not simply read the N bytes of the message that will have the message length with a normal read/recv, and then read that many bytes in the next call? rick jones -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reproducible data corruption with sendfile+vsftp - splice regression?
Could the corruption be seen in a tcpdump trace prior to transmission (ie taken on the sender) or was it only seen after the data passed out the NIC? rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
Adrian Bunk wrote: On Tue, Nov 27, 2007 at 01:15:23PM -0800, Rick Jones wrote: The real problem is that these drivers are not in the upstream kernel. Are there common reasons why these drivers are not upstream? One might be that upstream has not accepted them. Anything doing or smelling of TOE comes to mind right away. Which modules doing or smelling of TOE do work with unmodified vendor kernels? At the very real risk of further demonstrating my Linux vocabulary limitations, I believe there is a "Linux Sockets Acceleration" module/whatnot for NetXen and related 10G NICs, and a cxgb3_toe (?) module for Chelsio 10G NICs. rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
The real problem is that these drivers are not in the upstream kernel. Are there common reasons why these drivers are not upstream? One might be that upstream has not accepted them. Anything doing or smelling of TOE comes to mind right away. rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5][RFC] Physical PCI slot objects
Greg KH wrote: Doesn't /sys/firmware/acpi give you raw access to the correct tables already? And isn't there some other tool that dumps the raw ACPI tables? I thought the acpi developers used it all the time when debugging things with users. I'm neither an acpi developer (well I don't think that I am :) nor an end-user, but here are the two things for which I was going to use the information being presented by Alex's patch: 1) a not-yet, but on track to be released tool to be used by end-users to diagnose I/O bottlenecks - the information in /sys/bus/pci/slot//address would be used to associated interfaces and/or pci busses etc with something the end user would grok - the number next to the slot. 2) I was also going to get the folks doing installers to make use of the "end-user" slot ID. Even without going to the extreme of the aforementioned 192 slot system, an 8 slot system with a bunch of dual-port NICs in it (for example) is going to present this huge list of otherwise identical entries. Even if the installers show the MAC for a NIC (or I guess a WWN for an HBA or whatnot) that still doesn't tell one without prior knowledge of what MACs were installed in which slot, which slot is associated with a given ethN. Having the end-user slot ID visible is then going to be a great help to that poor admin who is doing the install. rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: bizarre network timing problem
Felix von Leitner wrote: Thus spake Rick Jones ([EMAIL PROTECTED]): Past performance is no guarantee of current correctness :) And over an Ethernet, there will be a very different set of both timings and TCP segment sizes compared to loopback. My guess is that you will find setting the lo mtu to 1500 a very interesting experiment. Setting the MTU on lo to 1500 eliminates the problem and gives me double digit MB/sec throughput. I'm not in a position at the moment to test it as my IPoIB systems are offline, and not sure you are either, but I will note that with IPoIB bits circa OFED1.2 the default MTU for IPoIB goes up to 65520 bytes. If indeed the problem you were seeing was related to sub-mss sends and window probing and such, it might appear on IPoIB in addition to loopback. rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: bizarre network timing problem
Felix von Leitner wrote: Thus spake Rick Jones ([EMAIL PROTECTED]): Oh I'm pretty sure it's not my application, because my application performs well over ethernet, which is after all its purpose. Also I see the write, the TCP uncork, then a pause, and then the packet leaving. Well, a wise old engineer tried to teach me that the proper spelling is ass-u-me :) so just for grins, you might try the TCP_RR test anyway :) And even if your application is correct (although I wonder why the receiver isn't sucking data-out very quickly...) if you can reproduce the problem with netperf it will be easier for others to do so. My application is only the server, the receiver is smbget from Samba, so I don't feel responsible for it :-) Might want to strace it anyway... no good deed (such as reporting a potential issue) goes unpunished :) Still, when run over Ethernet, it works fine without waiting for timeouts to expire. Past performance is no guarantee of current correctness :) And over an Ethernet, there will be a very different set of both timings and TCP segment sizes compared to loopback. My guess is that you will find setting the lo mtu to 1500 a very interesting experiment. To reproduce this: - smbget is from samba, you probably already have this - gatling (my server) can be gotten from cvs -d :pserver:[EMAIL PROTECTED]:/cvs -z9 co dietlibc libowfat gatling dietlibc is not strictly needed, but it's my environment. First built dietlibc, then libowfat, then gatling. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: bizarre network timing problem
Felix von Leitner wrote: Thus spake Rick Jones ([EMAIL PROTECTED]): How could I test this theory? Can you take another trace that isn't so "cooked?" One that just sticks with TCP-level and below stuff? Sorry for taking so long. Here is a tcpdump. The side on port 445 is the SMB server using TCP_CORK. 23:03:20.283772 IP 127.0.0.1.33230 > 127.0.0.1.445: S 1503927325:1503927325(0) win 32792 23:03:20.283774 IP 127.0.0.1.445 > 127.0.0.1.33230: S 1513925692:1513925692(0) ack 1503927326 win 32768 > 23:03:20.283797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 1 win 257 23:03:20.295851 IP 127.0.0.1.33230 > 127.0.0.1.445: P 1:195(194) ack 1 win 257 23:03:20.295881 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 195 win 265 23:03:20.295959 IP 127.0.0.1.445 > 127.0.0.1.33230: P 1:87(86) ack 195 win 265 23:03:20.295998 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 87 win 256 23:03:20.296063 IP 127.0.0.1.33230 > 127.0.0.1.445: P 195:287(92) ack 87 win 256 23:03:20.296096 IP 127.0.0.1.445 > 127.0.0.1.33230: P 87:181(94) ack 287 win 265 23:03:20.296135 IP 127.0.0.1.33230 > 127.0.0.1.445: P 287:373(86) ack 181 win 255 23:03:20.296163 IP 127.0.0.1.445 > 127.0.0.1.33230: P 181:239(58) ack 373 win 265 23:03:20.296201 IP 127.0.0.1.33230 > 127.0.0.1.445: P 373:459(86) ack 239 win 255 23:03:20.296245 IP 127.0.0.1.445 > 127.0.0.1.33230: P 239:309(70) ack 459 win 265 23:03:20.296286 IP 127.0.0.1.33230 > 127.0.0.1.445: P 459:535(76) ack 309 win 254 23:03:20.296314 IP 127.0.0.1.445 > 127.0.0.1.33230: P 309:461(152) ack 535 win 265 23:03:20.296361 IP 127.0.0.1.33230 > 127.0.0.1.445: P 535:594(59) ack 461 win 253 23:03:20.296400 IP 127.0.0.1.445 > 127.0.0.1.33230: . 461:16845(16384) ack 594 win 265 23:03:20.335748 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 16845 win 125 [note the .2 sec pause] I wonder if the ack 16384 win 125 not updating the window is part of it? With a window scale of 7, the advertised window of 125 is only 16000 bytes, and it looks based on what follows that TCP has another 16384 to send, so my guess is that TCP was waiting to have enough window, the persist timer expired and TCP then had to say "oh well, send what I can" Probably a coupling with this being less than the MSS (16396) involved too. 23:03:20.547763 IP 127.0.0.1.445 > 127.0.0.1.33230: P 16845:32845(16000) ack 594 win 265 23:03:20.547797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 0 Notice that an ACK comes-back with a zero window in it - that means that by this point the receiver still hasn't consumed the 16384+16000 bytes sent to id. 23:03:20.547855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 96 Now the receiver has pulled some data, on the order of 96*128 bytes so TCP can now go ahead and send the remaining 384 bytes. 23:03:20.547863 IP 127.0.0.1.445 > 127.0.0.1.33230: P 32845:33229(384) ack 594 win 265 23:03:20.547890 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 33229 win 96 [note the .2 sec pause] I'll bet that 96 * 128 is 12288 and we have another persist timer expiring. I also wonder if the behaviour might be different if you were using send() rather than sendfile() - just random musings... 23:03:20.755775 IP 127.0.0.1.445 > 127.0.0.1.33230: P 33229:45517(12288) ack 594 win 265 23:03:20.755855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 45517 win 96 23:03:20.755868 IP 127.0.0.1.445 > 127.0.0.1.33230: P 45517:49613(4096) ack 594 win 265 23:03:20.755898 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 49613 win 96 [another one] 23:03:20.963789 IP 127.0.0.1.445 > 127.0.0.1.33230: P 49613:61901(12288) ack 594 win 265 23:03:20.963871 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 61901 win 96 23:03:20.963885 IP 127.0.0.1.445 > 127.0.0.1.33230: P 61901:64525(2624) ack 594 win 265 23:03:20.963909 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 64525 win 96 23:03:20.964101 IP 127.0.0.1.33230 > 127.0.0.1.445: P 594:653(59) ack 64525 win 96 23:03:21.003790 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 653 win 265 23:03:21.171811 IP 127.0.0.1.445 > 127.0.0.1.33230: P 64525:76813(12288) ack 653 win 265 You get the idea. Anyway, now THIS is the interesting case, because we have two packets in the answer, and you see the first half of the answer leaving immediately (when I wanted the whole answer to be sent) but the second only leaving after the .2 sec delay. And it wasn't waiting for an ACK/window-update. You could try: ifconfig lo mtu 1500 and see what happens then. If SMB is a one-request-at-a-time protocol (I can never remember), It is. Joy. you could simulate it with a netperf TCP_RR test by passing suitable values to the test-specific -r option: netperf -H -t TCP_RR -- -r , If that shows similar behaviour then you can ass-u-me it isn't your application. Oh I'm pretty sure it's
Re: expected behavior of PF_PACKET on NETIF_F_HW_VLAN_RX device?
David Miller wrote: From: Rick Jones <[EMAIL PROTECTED]> I'll try to go pester folks in tcpdump-workers then. The thing to check is "TP_STATUS_CSUMNOTREADY". When using mmap(), it will be provided in the descriptor. When using recvmsg() it will be provided via a PACKET_AUXDATA control message when enabled via the PACKET_AUXDATA socket option. Figures... the "dailies" and "weeklies" for tar files of tcpdump and libpcap source are fubar... again. I've email in to tcpdump-workers on that one. If that isn't resolved quickly I'll learn how to access their CVS (pick an SCM, any SCM...) I did an apt-get of debian lenny's tcpdump and sources: hpcpc103:~# tcpdump -V tcpdump version 3.9.8 libpcap version 0.9.8 and that seems to show the false checksum failure and not use the TP_STATUS_CSUMNOTREADY - at least that didn't appear in a grepping of the sources. At first I thought it might be, but then I realized that my snaplen was too short to get the whole TSO'ed frame so tcpdump wasn't even trying to verify. After disabling TSO on the NIC, leaving CKO on, and making my snaplen > 1500 I could see it was doing undesirable stuff. I'll see what top of trunk has at some point and what the folks there think of adding-in a change. rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: expected behavior of PF_PACKET on NETIF_F_HW_VLAN_RX device?
David Miller wrote: From: Rick Jones <[EMAIL PROTECTED]> Date: Thu, 01 Nov 2007 14:48:45 -0700 One could I suppose try to ammend the information passed to allow tcpdump to say "oh, this was a tx packet on the same machine on which I am tracing so don't worry about checksum mismatch" We do this already! I'll try to go pester folks in tcpdump-workers then. rick - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: expected behavior of PF_PACKET on NETIF_F_HW_VLAN_RX device?
The code in AF_PACKET should fix the skb before passing to user space so that there is no difference between accel and non-accel hardware. Internal choices shouldn't leak to user space. Ditto, the receive checksum offload should be fixed up as well. yep. bad csum on tx packets as reported by tcpdump is also an issue. With TX CKO enabled, there isn't any checksum to fixup when a tx packet is sniffed, so I'm not sure what can be done in the kernel apart from an unpalatable "disable CKO and all which depend upon it when entering promiscuous mode." Having the tap calculate a checksum would be equally bad for performance, and would frankly be incorrect anyway because it would give the user the false impression that was the checksum which went-out onto the wire. One could I suppose try to ammend the information passed to allow tcpdump to say "oh, this was a tx packet on the same machine on which I am tracing so don't worry about checksum mismatch" but I have to wonder if it is _really_ worth it. Already someone has to deal with seeing TCP segments >> the MSS thanks to TSO. (Actually tcpdump got rather confused about that too since the IP length of those was 0, but IIRC we got that patched to use the length of zero as a "ah, this was TSO so wing it" heuristic.) rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [bug, 2.6.24-rc1] sysfs: duplicate filename 'eth0' can not be created
0 0 0 0 0 0 0 0 LSAPIC cmc_hndlr 48: 0 0 0 0 0 0 0 0 IO-SAPIC-level acpi 50: 0 0 1065 0 0 0 0 0 IO-SAPIC-level serial 52: 0 0 0 0328 0 0 0 IO-SAPIC-level ehci_hcd:usb1 54: 0 0 0 0 0 0 32 0 IO-SAPIC-level ohci_hcd:usb2 57: 0 0 0 0 0 0 0 0 IO-SAPIC-level ohci_hcd:usb3 60: 0 0 0 0 11945 0 0 0 IO-SAPIC-level eth6 61: 0 0 0 0 0 101072 0 0 IO-SAPIC-level eth6 Neterion 10 Gigabit Ethernet-SR Low Profile PCI-X 2.0 DDR A 70: 0 0 0 0 0 0 25580 0 IO-SAPIC-level cciss0 232: 0 0 0 0 0 0 0 0 LSAPIC mca_rdzv 238: 0 0 0 0 0 0 0 0 LSAPIC perfmon 239:237619023760792376009237600523761272376121 23759182373020 LSAPIC timer 240: 0 0 0 0 0 0 0 0 LSAPIC mca_wkup 252: 0 0 0 0 0 0 0 0 LSAPIC tlb_flush 253:586255449315448982 497702 LSAPIC resched 254:123162161166168154 109 140 LSAPIC IPI ERR: 0 it appears as eth6. rick jones ... Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] always export sysctl_{r,w}mem_max
David Miller wrote: If DLM really wants minimum, it can use SO_SNDBUFFORCE and SO_RCVBUFFORCE socket options and use whatever limits it likes. But even this is questionable. Drift... Is that something netperf should be using though? Right now it uses the regular SO_[SND|RCV]BUF calls and is at the mercy of sysctls. I wonder if it would be better to have it use their FORCE versions to make life easier on the benchmarker - such as myself - who has an unfortunate habit of forgetting to update sysctl.conf :) rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] always export sysctl_{r,w}mem_max
Eric W. Biederman wrote: Adrian Bunk <[EMAIL PROTECTED]> writes: This patch fixes the following build error with CONFIG_SYSCTL=n: <-- snip --> ... ERROR: "sysctl_rmem_max" [fs/dlm/dlm.ko] undefined! ERROR: "sysctl_wmem_max" [drivers/net/rrunner.ko] undefined! ERROR: "sysctl_rmem_max" [drivers/net/rrunner.ko] undefined! make[2]: *** [__modpost] Error 1 I was going to ask if allowing drivers to increase rmem_max is something that we want to do. Apparently the road runner driver has been doing this since the 2.6.12-rc1 when the git repository starts so this probably isn't a latent bug. Although it does rather sound like a driver writer yanking the rope from the hand's of the sysadmin and hanging him with it rather than letting the sysadmin do it himself. I've seen other drivers' README's suggesting larger mem's but not their sources doing it. rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bad TCP checksum error
Checksum Offload on the NIC(s) can complicate things. First, if you are tracing on the sender, the tracepoint is before the NIC has computed the full checksum. IIRC only a partial checksum is passed-down to the NIC when CKO is in use. So, making certain your trace is from the "wire" or the receiver rather than the sender would be a good thing, and trying again with CKO disabled on the interface(s) (via ethtool) might be something worth looking at. Ultimately, doing the partial checksum modificiations in a CKO-friendly manner might be a good thing. rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: bizarre network timing problem
Felix von Leitner wrote: the packet trace was a bit too cooked perhaps, but there were indications that at times the TCP window was going to zero - perhaps something with window updates or persist timers? Does TCP use different window sizes on loopback? Why is this not happening on ethernet? I don't think it uses different window sizes on loopback, but with the autotuning it can be difficult to say a priori what the window size will be. What one can say with confidence is that the MTU and thus the MSS will be different between loopback and ethernet. How could I test this theory? Can you take another trace that isn't so "cooked?" One that just sticks with TCP-level and below stuff? If SMB is a one-request-at-a-time protocol (I can never remember), you could simulate it with a netperf TCP_RR test by passing suitable values to the test-specific -r option: netperf -H -t TCP_RR -- -r , If that shows similar behaviour then you can ass-u-me it isn't your application. One caveat though is that TCP_CORK mode in netperf is very primitive and may not match what you are doing, however, that may be a good thing. http://www.netperf.org/svn/netperf2/tags/netperf-2.4.4/ or ftp://ftp.netperf.org/netperf/ to get the current netperf bits. It is also possible to get multiple transactions in flight at one time if you configure netperf with --enable-burst, which will then enable a test-specific -b option. With the latest netperf you cna also switch the output of a TCP_RR test to bits or bytes per second a la the _STREAM tests. rick jones My initial idea was that it has something todo with the different MTU on loopback. My initial block size was 16k, but the problem stayed when I changed it to 64k. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: bizarre network timing problem
the packet trace was a bit too cooked perhaps, but there were indications that at times the TCP window was going to zero - perhaps something with window updates or persist timers? rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: follow-up: discrepancy with POSIX
Andi Kleen wrote: On Wed, Sep 19, 2007 at 11:02:00AM -0700, Ulrich Drepper wrote: on UDP/RAW and it's certainly possible to connect() to that. Where do you get this from? And where is this implemented? I don't Sorry it's actually loopback, not broadcast as implemented in Linux. In Linux it's implemented in ip_route_output_slow(). Essentially converted to 127.0.0.1 I think it's traditional BSD behaviour but couldn't find it on a quick look in FreeBSD source (but haven't looked very intensively) One has to set their way-back machine pretty far back to find the *BSD bits which used 0.0.0.0 as the "all nets, all subnets" (to mis-use a term) broadcast IPv4 address when sending. Perhaps as far back as the time before HP-UX 7 or SunOS4. The bit errors in my dimm memory get pretty dense that far back... It has hung-on in various places (stacks) as an "accepted" broadcast IP in the receive path, but not the send path for quite possibly decades now. rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Configurable tap interface MTU
Ed Swierk wrote: This patch makes it possible to change the MTU on a tap interface. Increasing the MTU beyond the 1500-byte default is useful for applications that interoperate with Ethernet devices supporting jumbo frames. The patch caps the MTU somewhat arbitrarily at 16000 bytes. This is slightly lower than the value used by the e1000 driver, so it seems like a safe upper limit. FWIW the OFED 1.2 bits take the MTU of IPoIB up to 65520 bytes :) rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: issues concerning the next NAPI interface
Just to be clear, in the previous email I posted on this thread, I described a worst-case network ping-pong test case (send a packet, wait for reply), and found out that a deffered interrupt scheme just damaged the performance of the test case. Since the folks who came up with the test case were adamant, I turned off the defferred interrupts. While defferred interrupts are an "obvious" solution, I decided that they weren't a good solution. (And I have no other solution to offer). Sounds exactly like the default netperf TCP_RR test and any number of other benchmarks. The "send a request, wait for reply, send next request, etc etc etc" is a rather common application behaviour afterall. rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
Andi Kleen wrote: TSO is beneficial for the software again. The linux code currently takes several locks and does quite a few function calls for each packet and using larger packets lowers this overhead. At least with 10GbE saving CPU cycles is still quite important. Some quick netperf TCP_RR tests between a pair of dual-core rx6600's running 2.6.23-rc3. the NICs are dual-core e1000's connected back-to-back with the interrupt throttle disabled. I like using TCP_RR to tickle path-length questions because it rarely runs into bandwidth limitations regardless of the link-type. First, with TSO enabled on both sides, then with it disabled, netperf/netserver bound to the same CPU as takes interrupts, which is the "best" place to be for a TCP_RR test (although not always for a TCP_STREAM test...): :~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 (192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf. : first burst 0 : cpu bind !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.3% !!! Local CPU util : 39.3% !!! Remote CPU util : 40.6% Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Sus/Tr us/Tr 16384 87380 1 1 10.01 18611.32 20.96 22.35 22.522 24.017 16384 87380 :~# ethtool -K eth2 tso off e1000: eth2: e1000_set_tso: TSO is Disabled :~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 (192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf. : first burst 0 : cpu bind !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.4% !!! Local CPU util : 21.0% !!! Remote CPU util : 25.2% Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Sus/Tr us/Tr 16384 87380 1 1 10.01 19812.51 17.81 17.19 17.983 17.358 16384 87380 While the confidence intervals for CPU util weren't hit, I suspect the differences in service demand were still real. On throughput we are talking about +/- 0.2%, for CPU util we are talking about +/- 20% (percent not percentage points) in the first test and 12.5% in the second. So, in broad handwaving terms, TSO increased the per-transaction service demand by something along the lines of (23.27 - 17.67)/17.67 or ~30% and the transaction rate decreased by ~6%. rick jones bitrate blindless is a constant concern - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] Add ETHTOOL_[GS]FLAGS sub-ioctls
David Miller wrote: From: Ben Greear <[EMAIL PROTECTED]> Date: Fri, 10 Aug 2007 15:40:02 -0700 For GSO on output, is there a generic fallback for any driver that does not specifically implement GSO? Absolutely, in fact that's mainly what it's there for. I don't think there is any issue. The knob is there via ethtool for people who really want to disable it. Just to be paranoid (who me?) we are then at a point where what happened a couple months ago with forwarding between 10G and IPoIB won't happen again - where things failed because a 10G NIC had LRO enabled by default? rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Driver writer hints (was [PATCH 3/4] Add ETHTOOL_[GS]PFLAGS sub-ioctls)
If we are getting (retrieving) flags: 3) Userland issues ETHTOOL_GPFLAGS, to obtain a 32-bit bitmap 4) Userland prints out a tag returned from ETHTOOL_GSTRINGS for each bit set to one in the bitmap. If a bit is set, but there is no string to describe it, that bit is ignored. (i.e. a list of 5 strings is returned, but bit 24 is set) Is that to enable "hidden" bits? If not I'd think that emitting some sort of "UNKNOWN_FLAG" might help flush-out little oopses like forgetting a string. rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: all syscalls initially taking 4usec on a P4? Re: nonblocking UDPv4 recvfrom() taking 4usec @ 3GHz?
I measure a huge slope, however. Starting at 1usec for back-to-back system calls, it rises to 2usec after interleaving calls with a count to 20 million. 4usec is hit after 110 million. The graph, with semi-scientific error-bars is on http://ds9a.nl/tmp/recvfrom-usec-vs-wait.png The code to generate it is on: http://ds9a.nl/tmp/recvtimings.c I'm investigating this further for other system calls. It might be that my measurements are off, but it appears even a slight delay between calls incurs a large penalty. The slope appears to be flattening-out the farther out to the right it goes. Perhaps that is the length of time it takes to take all the requisite cache misses. Some judicious use of HW perf counters might be in order via say papi or pfmon. Otherwise, you could try a test where you don't delay, but do try to blow-out the cache(s) between recvfrom() calls. If the delay there starts to match the delay as you go out to the right on the graph it would suggest that it is indeed cache effects. rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Network drivers that don't suspend on interface down
There are two different problems: 1) Behavior seems to be different depending on device driver author. We should document the expected semantics better. IMHO: When device is down, it should: a) use as few resources as possible: - not grab memory for buffers - not assign IRQ unless it could get one - turn off all power consumption possible b) allow setting parameters like speed/duplex/autonegotiation, ring buffers, ... with ethtool, and remember the state c) not accept data coming in, and drop packets queued What implications does c have for something like tcpdump? rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.11] bonding: avoid tx balance for IGMP (alb/tlb mode)
Is that switch behaviour "normal" or "correct?" I know next to nothing about what stuff like LACP should do, but asked some internal folks and they had this to say: treats IGMP packets the same as all other non-broadcast traffic (i.e. it will attempt to load balance). This switch behavior seems rather odd in an aggregated case, given the fact that most traffic (except broadcast packets) will be load balanced by the partner device. In addition, the switch (in theory) is suppose to treat the aggregated switch ports as 1 logical port and therefore it should allow IGMP packets to be received back on any port in the logical aggregation. IMO, the switch behavior in this case seems questionable. FWIW, rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Very high bandwith packet based interface and performance problems
Alan Cox wrote: > > > > TCP _requires_ the remote end ack every 2nd frame regardless of progress. > > > > um, I thought the spec says that ACK every 2nd segment is a SHOULD not a > > MUST? > > Yes its a SHOULD in RFC1122, but in any normal environment pretty much a > must and I know of no stack significantly violating it. I didn't know there was such a thing as a normal environment :) > RFC1122 also requires that your protocol stack SHOULD be able to leap tall > buldings at a single bound of course... And, of course my protocol stack does :) It is also a floor wax, AND a dessert topping!-) rick jones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Very high bandwith packet based interface and performance problems
Alan Cox wrote: > > > that because the kernel was getting 99% of the cpu, the application was > > getting very little, and thus the read wasn't happening fast enough, and > > Seems reasonable > > > This is NOT what I'm seeing at all.. the kernel load appears to be > > pegged at 100% (or very close to it), the user space app is getting > > enough cpu time to read out about 10-20Mbit, and FURTHERMORE the kernel > > appears to be ACKING ALL the traffic, which I don't understand at all > > (e.g. the transmitter is simply blasting 300MBit of tcp unrestricted) > > TCP _requires_ the remote end ack every 2nd frame regardless of progress. um, I thought the spec says that ACK every 2nd segment is a SHOULD not a MUST? rick jones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Very high bandwith packet based interface and performance problems
> > > This is NOT what I'm seeing at all.. the kernel load appears to be > > > pegged at 100% (or very close to it), the user space app is getting > > > enough cpu time to read out about 10-20Mbit, and FURTHERMORE the kernel > > > appears to be ACKING ALL the traffic, which I don't understand at all > > > (e.g. the transmitter is simply blasting 300MBit of tcp unrestricted) > > > > TCP _requires_ the remote end ack every 2nd frame regardless of progress. > > YIPES. I didn't realize this was the case.. how is end-to-end application > flow control handled when the bottle neck is user space bound and not b/w > bound? e.g. if i write a test app that does a If the app is not reading from the socket buffer, the receiving TCP is supposed to stop sending window-updates, and the sender is supposed to stop sending data when it runs-out of window. If TCP ACK's data, it really should (must?) not then later drop it on the floor without aborting the connection. If a TCP is ACKing data and then that data is dropped before it is given to the application, and the connection is not being reset, that is probably a bug. A TCP _is_ free to drop data prior to sending an ACK - it simply drops it and does not ACK it. rick jones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: MTU and 2.4.x kernel
the TCP code should be "honouring" the link-local MTU in its selection of MSS. rick jones - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: MTU and 2.4.x kernel
> Default of 536 is sadistic (and apaprently will be changed eventually > to stop tears of poor people whose providers not only supply them > with bogus mtu values sort of 552 or even 296, but also jailed them > to some proxy or masquearding domain), but it is still right: IP > with mtu lower 576 is not full functional. I thought that the specs said that 576 was the "minimum maximum" reassemblable IP datagram size and not a minimum MTU. rick jones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing todowith ECN)
> > As time marches on, the orders of magnitude of the constants may change, > > but basic concepts still remain, and the "lessons" learned in the past > > by one generation tend to get relearned in the next :) for example - > > there is no such a thing as a free lunch... :) > > ;-> > BTW, i am reading one of your papers (circa 1993 ;->, "we go fast with a > little help from your apps") in which you make an interesting > observation. That (figure 2) there is "a considerable increase in > efficiency but not a considerable increase in throughput" I "scanned" > to the end of the paper and dont see an explanation. That would be the copyavoidance paper using the very old G30 with the HP-PB (sometimes called PeanutButter) bus :) (http://ftp.cup.hp.com/dist/networking/briefs/) No, back then we were not going to describe the dirty laundry of the G30 hardware :) The limiter appears to have been the bus converter from the SGC (?) main bus of the Novas (8x7,F,G,H,I) to the HP-PB bus. The chip was (apropriately enough) codenamed "BOA" and it was a constrictor :) I never had a chance to carry-out the tests on an older 852 system - those have slower CPU's, but HP-PB was _the_ bus in the system. Prototypes leading to the HP-PB FDDI card achieved 10 MB/s on an 832 system using UDP - this was back in the 1988-1989 timeframe iirc. > I've made a somehow similar observation with the current zc patches and > infact observed that throughput goes down with the linux zc patches. > [This is being contested but no-one else is testing at gigE, so my word is > the only truth]. > Of course your paper doesnt talk about sendfile rather the page pinning + > COW tricks (which are considered taboo in Linux) but i do sense a > relationship. Well, the HP-PB FDDI card did follow buffer chains rather well, and there was no mapping overhead on a Nova - it was a non-coherent I/O subsystem and DMA was done exclusively with physical addresses (and requisite pre-DMA flushes on outbound, and purges on inbound - another reason why copy-avoidance was such a win overheadwise). Also, there was no throughput drop when going to copyavoidance in that stuff. So, I'd say that while somethings might "feel" similar, it does not go much deeper than that. rick > PS:- I dont have "my" machines yet and i have a feeling it will be a while > before i re-run the tests; however, i have created a patch for > linux-sendfile with netperf. Please take a look at it at: > http://www.cyberus.ca/~hadi/patch-nperf-sfile-linux.gz > tell me if is missing anything and if it is ok, could you please merge in > your tree? I will take a look. -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing todowith ECN)
> > How does ZC/SG change the nature of the packets presented to the NIC? > > what do you mean? I am _sure_ you know how SG/ZC work. So i am suspecting > more than socratic view on life here. Could be influence from Aristotle;-> Well, I don't know the specifics of Linux, but I gather from what I've read on the list thusfar, that prior to implementing SG support, Linux NIC drivers would copy packets into single contiguous buffers that were then sent to the NIC yes? If so, the implication is with SG going, that copy no longer takes place, and so a chain of buffers is given to the NIC. Also, if one is fully ZC :) pesky things like protocol headers can naturally end-up in separate buffers. So, now you have to ask how well any given NIC follows chains of buffers. At what number of buffers is the overhead in the NIC of following the chains enough to keep it from achieving link-rate? One way to try and deduce that would be to meld some of the SG and preSG behaviours and copy packets into varying numbers of buffers per packet and measure the resulting impact on throughput through the NIC. rick jones As time marches on, the orders of magnitude of the constants may change, but basic concepts still remain, and the "lessons" learned in the past by one generation tend to get relearned in the next :) for example - there is no such a thing as a free lunch... :) -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Still not sexy! (Re: sendfile+zerocopy: fairly sexy (nothing to dowith ECN)
> ** I reported that there was also an oddity in throughput values, > unfortunately since no one (other than me) seems to have access > to a gige cards in the ZC list, nobody can confirm or disprove > what i posted. Here again as a reminder: > > Kernel | tput | sender-CPU | receiver-CPU | > - > 2.4.0-pre3 | 99MB/s | 87% | 23% | > NSF||| | > - > 2.4.0-pre3 | 86MB/s | 100% | 17% | > SF ||| | > - > 2.4.0-pre3 | 66.2 | 60% | 11% | > +ZC| MB/s || | > - > 2.4.0-pre3 | 68 | 8% | 8% | > +ZC SF| MB/s || | > - > > Just ignore the CPU readings, focus on throughput. And could someone plese > post results? In the spirit of the socratic method :) Is your gige card based on Alteon? How does ZC/SG change the nature of the packets presented to the NIC? How well does the NIC do with that changed nature? rick jones sometimes, performance tuning is like squeezing a balloon. one part gets smaller, but then you start to see the rest of the balloon... -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN)
> I'll give this a shot later. Can you try with the sendfiled-ttcp? > http://www.cyberus.ca/~hadi/ttcp-sf.tar.gz I guess I need to "leverage" some bits for netperf :) WRT getting data with links that cannot saturate a system, having something akin to the netperf service demand measure can help. Nothing terribly fancy - simply a conversion of the CPU utilization and throughput to a microseconds of CPU to transfer a KB of data. As for CKO and avoiding copies and such, if past experience is any guide (ftp://ftp.cup.hp.com/dist/networking/briefs/copyavoid.ps) you get a very nice synergistic effect once the last "access" of data is removed. CKO gets you say 10%, avoiding the copy gets you say 10%, but doing both at the same time gets you 30%. rick jones http://www.netperf.org/ -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: hotmail not dealing with ECN
> As David pointed out, it is "reserved for future use - you must set > these bits to zero and not use it _for your own purposes_. For non-rfc > use of these bits _will_ break something the day we start using them > for something useful. > > So, no reason for a firewall author to check these bits. I thought that most firewalls were supposed to be insanely paranoid. Perhaps it would be considered a possible covert data channel, as farfecthed as that may sound. rick jones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
[EMAIL PROTECTED] wrote: > > Hello! > > > is there really > > much value in the second request flowing to the server before the first > > byte of the reply has hit? > > Yes, of course, it has lots of sense: f.e. all the icons, referenced > parent page are batched to single well-coalesced stream without rtt delays > between them. It is the only sense of pipelining yet. "Elsewhere" i see references stating that the typical RTT for the great unwashed masses is somewhere in the range of 100 to 200 milliseconds. The linux standalone ACK timer is 200 milliseconds yes? If the web server is going to take longer than 200 milliseconds to generate the first byte of the reply to the first request it seems that the bottleneck here is the web server, not the link RTT. Now, if the server _is_ able to respond with the first bytes (ignoring CORK for the moment) sooner than the standalone ACK timer, then perhaps the RTT is an issue. However, as we were in the constrained case of only two requests, I suspect that it is not a big deal. If there are all those icons to be displayed, there would be more than two requests. Without the explicit (cork et al)/implicit (tcpnodely) push at the client those 2-N requests will pile-up into a nice sized TCP segment. Those requests will arrive en-mass at the server and will then have RTT issues ammortized. rick jones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
> Look: http-1.1, asynchronous one, the first request is sent, but not acked. > Time to send the second one, but it is blocked by Nagle. If there is no > third request, the pipe stalls. Seems, this situation will be usual, > when http-1.1 will start to be used by clients, because of dependencies > between replys (references) frequently move it to http-1.0 synchronous > mode, but with some data in flight. See? The stall takes place if and only if the web server takes longer than the standalone ACK timer to generate the first bytes of reply. Once the first bytes of the reply hit the client, the client's second request will flow. If the web server takes longer than the standalone ACK timer to generate the first bytes of the reply, there is no particular value in the second request having arrived anyway - it will simply sit queued in the server's stack rather than the client's stack. You could argue that the server could start serving the second request, but it still has to hold the reply and keep it queued until the first reply is complete, and I suspect there is little value in working for that much parallelism here. Better to have as much queuing in your most distributed resource - the clients. Further, even ignoring the issue of standalone acks, is there really much value in the second request flowing to the server before the first byte of the reply has hit? I would think that the parallelism in the server is going to be among all the different sources of request, not fromwithin a given source of requests. Also, if the browser is indeed going to do pipelined requests, and getting the requests to the server as very quickly as possible was indeed required because the requests could be started in parallel (just how likely that is I have no idea) i would have thought that it would (could) want go through the page, gather-up all the URLs from the given server, and then dump all those requests into the connection at once (modulo various folks dislike of sendmsg and writev :). We are in this instance at least talking about purpose coded software anyhow and not a random CGI dribbler. In that sense, the "logically associated data" are all the server's URL's from that page. Yes, this paragraph is in slight contradiction with my statement above about keeping things queued in the client :) rick jones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
dean gaudet wrote: > > On Wed, 17 Jan 2001, Rick Jones wrote: > > > > actually the problem isn't nagle... nagle needs to be turned off for > > > efficient servers anyhow. > > > > i'm not sure I follow that. could you expand on that a bit? > > the problem which caused us to disable nagle in apache is documented in > this paper <http://www.isi.edu/~johnh/PAPERS/Heidemann97a.html>. mind you > i should personally revisit the paper after all these years so that i can > reconsider its implications in the context of pipelining and webmux. ah yes, that - where the web server even for just static content was providing the replies in more than one send. i would not consider that to have been an "efficient" server. i'm not sure that I agree with their statment that piggy-backing is rarely successful in request/response situations. the business about the last 1100ish bytes of a 4096 byte send being delayed by nagle only implies that the stack's implementation of nagle was broken and interpreting it on a per-segment rather than a per-send basis. if the app sends 4096 bytes, then there should be no nagle-induced delays on a connection with an MSS of 4096 or less. it would seem that in the context of that paper at least, most if not all of the problems were the result of bugs - either in the webserver software, or the host TCP stack. otherwise, the persistent connections would have worked just fine. > i'm not aware yet of any study in the field. and i'm out of touch enough > with the clients that i don't know if new netscape or IE have finally > begun to use pipelining (they hadn't as of 1998). someone else sent a private email implying that no browsers were yet doing pipelining. rick -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
Olivier Galibert wrote: > > On Thu, Jan 18, 2001 at 10:04:28PM +0100, Andrea Arcangeli wrote: > > NAGLE algorithm is only one, CORK algorithm is another different algorithm. So > > probably it would be not appropriate to mix CORK and NAGLE under the name > > "CONTROL_NAGLING", but certainly I agree they could stay together under another > > name ;). > > TCP_FLOW_CONTROL ? then folks would think you were controlling the congestion or "classic" windows. what alal these things do is affect segmentation, so perhaps TCP_SEGMENT_CONTROL or something to that effect, if anything. rickjones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
[EMAIL PROTECTED] wrote: > > Hello! > > > So if I understand all this correctly... > > > > The difference in ACK generation > > CORK does not affect receive direction and, hence, ACK geneartion. I was asking how the semantics of cork interacted with piggybacking ACK's on data flowing the other way. Was I wrong in assuming that the Linux TCP piggybacks ACKs? rick -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
> device-to-device is not the same as disk-to-disk. A better example would > be a streaming file server. Slowly the pci bus becomes a bottleneck, why > would you want to move the data twice over the pci bus if once is enough > and the data very likely not needed afterwards? Sure you can use a more > expensive 64bit/60MHz bus, but why should you if the 32bit/30MHz bus is > theoretically fast enough for your application? theoretically fast enough for the application would imply the dual transfers across the bus would fit :) also, if a system was doing something with that much throughput, i suspect it would not only be designed with 64/66 busses (or better), but also have things on several different busses. that makes device to device life more of a challenge. rick jones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
Linus Torvalds wrote: > Remember the UNIX philosophy: everything is a file. ...and a file is simply a stream of bytes (iirc?) rick jones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
Ingo Molnar wrote: > > On Wed, 17 Jan 2001, Rick Jones wrote: > > > i'd heard interesting generalities but no specifics. for instance, > > when the send is small, does TCP wait exclusively for the app to > > flush, or is there an "if all else fails" sort of timer running? > > yes there is a per-socket timer for this. According to RFC 1122 a TCP > stack 'MUST NOT' buffer app-sent TCP data indefinitely if the PSH bit > cannot be explicitly set by a SEND operation. Was this a trick question? > :-) Nope, not a trick question. The nagle heuristic means that small sends will not wait indefinitely since sending the first small bit of data starts the retransmission timer as a course of normal processing. So, I am not in the habit of thinking about a "clear the buffer" timer being set when a small send takes place but no transmit happens. rick jones btw, as I'm currently on linux-kernel, no need to cc me :) -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
Andi Kleen wrote: > > On Wed, Jan 17, 2001 at 02:17:36PM -0800, Rick Jones wrote: > > How does CORKing interact with ACK generation? In particular how it > > might interact with (or rather possibly induce) standalone ACKs? > > It doesn't change the ACK generation. If your cork'ed packets gets sent > before the delayed ack triggers it is piggy backed, if not it is send > individually. When the delayed ack triggers depends; Linux has dynamic > delack based on the rtt and also a special quickack mode to speed up slow > start. So if I understand all this correctly... The difference in ACK generation would be that with nagle it is a race between the standalone ack heuristic and the first byte of response data, with cork, the race is between the standalone ack heuristic and the last byte of response data and an uncork call, or the MSSth byte whichever comes first. If the response bytes are dribbling slowly into the socket, where slowly is less than the bandwidth delay product of the connection, cork can result in quite fewer packets than nagle would. It would perhaps though have one more standalone ACK than nagle If the response bytes are dribbling quickly into the socket, where quickly is greater than the bandwidth delay product of the connection, cork will produce one less packet than nagle. If the response bytes go into the socket together, cork and nagle will produce the same number of packets. rick jones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
Linus Torvalds wrote: > > On Wed, 17 Jan 2001, Rick Jones wrote: > > > > > The fact that I understand _why_ it is done that way doesn't mean that I > > > don't think it's a hack. It doesn't allow you to sendfile multiple files > > > etc without having nagle boundaries, and the header/trailer stuff really > > > isn't a generic solution. > > > > Hmm, I would think that nagle would only come into play if those files > > were each less than MSS and there were no intervening application level > > reply/request messages for each. > > It's not the file itself - it's the headers and trailers. OK, the sum of the header/trailer/file when one calls an HP-UX-style sendfile(). All that does is make it more likely that one will have sends larger than the MSS. > - the packet boundary between the header and the file you're sending. > > Normally, if you do a separate data "send()" for the header before > actually using sendfile(), the header would be sent out as one packet, > while the actual file contents would then get coalesced into MSS-sized > packets. > > This is why people originally did writev() and sendmsg() - to allow > people to do scatter-gather without having multiple packets on the > wire, and letting the OS choose the best packet boundaries, of course. I prefer to describe it as "presenting logically associated data to the transport at one time" but that's just wordsmithing. > So the Linux approach (and, obviously, in my opinion the only right > approach) is basically to > > (a) make sure that system call latency is low enough that there really > aren't any major reasons to avoid system calls. They're just function > calls - they may be a bit heavier than most functions, of course, but > people shouldn't need to avoid them like the plague like on some > systems. i'm not quite sure how it plays here, but someone once told me that the most efficient procedure call was the one that was never made :) > and > > (b) TCP_CORK. > > Now, TCP_CORK is basically me telling David Miller that I refuse to play > games to have good packet size distribution, and that I wanted a way for > the application to just tell the OS: I want big packets, please wait until > you get enough data from me that you can make big packets. > > Basically, TCP_CORK is a kind of "anti-nagle" flag. It's the reverse of > "no-nagle". So you'd "cork" the TCP connection when you know you are going > to do bulk transfers, and when you're done with the bulk transfer you just > "uncork" it. At which point the normal rules take effect (ie normally > "send out any partial packets if you have no packets in flight"). How "bulk" is a bulk transfer in your thinking? By the time the transfer gets above something like 100*MSS I would think that the first small packet would become epsilon. How does CORKing interact with ACK generation? In particular how it might interact with (or rather possibly induce) standalone ACKs? > This is a _much_ better interface than having to play games with > scatter-gather lists etc. You could basically just do > > int optval = 1; > > setsockopt(sk, SOL_TCP, TCP_CORK, &optval, sizeof(int)); > write(sk, ..); > write(sk, ..); > write(sk, ..); > sendfile(sk, ..); > write(..) > printf(...); > ...any kind of output.. > > optval = 0; > setsockopt(sk, SOL_TCP, TCP_CORK, &optval, sizeof(int)); > > and notice how you don't need to worry about _how_ you output the data any > more. It will automatically generate the best packet sizes - waiting for > disk if necessary etc. > > With TCP_CORK, you can obviously and trivially emulate the HP-UX behaviour > if you want to. But you can just do _soo_ much more. > > Imagine, for example, keep-alive http connections. Where you might be > doing multiple sendfile()'s of small files over the same connection, one > after the other. With Linux and TCP_CORK, what you can basically do is to > just cork the connection at the beginning, and then let is stay corked for > as long as you don't have any outstanding requests - ie you uncork only > when you don't have anything pending any more. so after i present each reply, i'm checking to see if there is another request and if there is not i have to uncork to get the residual data to flow. > (The reason you want to uncork at all, is to obviously let the partial > packets out when you don't know if you'll write anything more in the near
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
> > Hmm, I would think that nagle would only come into play if those files > > were each less than MSS and there were no intervening application level > > reply/request messages for each. > > actually the problem isn't nagle... nagle needs to be turned off for > efficient servers anyhow. i'm not sure I follow that. could you expand on that a bit? > but once it's turned off, the standard socket > API requires (or rather allows) the kernel to flush packets to the wire > after each system call. most definitely allows, not requires. > consider the case where you're responding to a pair of pipelined HTTP/1.1 > requests. with the HPUX and BSD sendfile() APIs you end up forcing a > packet boundary between the two responses. this is likely to result in > one small packet on the wire after each response. i _possibly_ have a packet boundary. if the last small bit of the first file is handed to the transport when there is sufficient clasic and congestion window to send it or that window "arrives" before the first chunk of the second file is sent. on the topic of pipelining - do the pipelined requests tend to be send or arrive together? > with the linux TCP_CORK API you only get one trailing small packet. in > case you haven't heard of TCP_CORK -- when the cork is set, the kernel is > free to send any maximum size packets it can form, but has to hold on to > the stragglers until userland gives it more data or pops the cork. i'd heard interesting generalities but no specifics. for instance, when the send is small, does TCP wait exclusively for the app to flush, or is there an "if all else fails" sort of timer running? > (the heuristic i use in apache to decide if i need to flush responses in a > pipeline is to look if there are any more requests to read first, and if > there are none then i flush before blocking waiting for new requests.) how often to you find yourself flushing the little bits anyhow? > > As for the header/trailer stuff, you're right, I should have spec'd a > > separate iovec for each :) > > well, if you've got low system call overhead (such as linux ;), and you > add TCP_CORK ... then you don't even need to combine all those system > calls into one monster syscall. how low is the system call overhead to check for the next request before you flush? (i'm not sure that I'd say HP-UX sendfile() was a combination of system calls - i'd probably say it was a (partial) replacement for writev()) rick jones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
> The fact that I understand _why_ it is done that way doesn't mean that I > don't think it's a hack. It doesn't allow you to sendfile multiple files > etc without having nagle boundaries, and the header/trailer stuff really > isn't a generic solution. Hmm, I would think that nagle would only come into play if those files were each less than MSS and there were no intervening application level reply/request messages for each. So, perhaps rcp, but not FTP nor HTTP. I'm not sure where the break-even point versus send() is on other OSes, but it seems to be in the neighborhood of the typical ethernet MSS on HP-UX. As for the header/trailer stuff, you're right, I should have spec'd a separate iovec for each :) > Also note how I said that it is the BSD people I _despise_. Not The HP-UX That misunderstanding would be the result of my entering the conversation in the middle... > implementation. The HP-UX one is not pretty, but it works. But I hold open > source people to higher standards. They are supposed to be the people who > do programming because it's an art-form, not because it's their job. I'm not sure, but I think I've just been insulted !-) (in case it is not clear, that is meant as a joke...) rick jones -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[Fwd: Is sendfile all that sexy? (fwd)]
> : >Agreed -- the hard-coded Nagle algorithm makes no sense these days. > : > : The fact I dislike about the HP-UX implementation is that it is so > : _obviously_ stupid. > : > : And I have to say that I absolutely despise the BSD people. They did > : sendfile() after both Linux and HP-UX had done it, and they must have > : known about both implementations. And they chose the HP-UX braindamage, > : and even brag about the fact that they were stupid and didn't understand > : TCP_CORK (they don't say so in those exact words, of course - they just > : show that they were stupid and clueless by the things they brag about). > : > : Oh, well. Not everybody can be as goodlooking as me. It's a curse. nor it would seem, as humble :) Hello Linus, my name is Rick Jones. I am the person at Hewlett-Packard who drafted the "so _obviously_ stupid" sendfile() interface of HP-UX. Some of your critique (quoted above) found its way to my inbox and I thought I would introduce myself to you to give you an opportunity to expand a bit on your criticism. In return, if you like, I would be more than happy to describe a bit of the history of sendfile() on HP-UX. Perhaps (though I cannot say with any certainty) it will help explain why HP-UX sendfile() is spec'd the way it is. rick jones never forget what leads to the downfall of the protagonist in Greek tragedy... -- ftp://ftp.cup.hp.com/dist/networking/misc/rachel/ these opinions are mine, all mine; HP might not want them anyway... :) feel free to email, OR post, but please do NOT do BOTH... my email address is raj in the cup.hp.com domain... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/