Re: svn commit: r212803 - head/sys/netinet
On 23.10.2010 15:10, Bjoern A. Zeeb wrote: On Fri, 17 Sep 2010, Andre Oppermann wrote: Author: andre Date: Fri Sep 17 22:05:27 2010 New Revision: 212803 URL: http://svn.freebsd.org/changeset/base/212803 Log: Rearrange the TSO code to make it more readable and to clearly separate the decision logic, of whether we can do TSO, and the calculation of the burst length into two distinct parts. Change the way the TSO burst length calculation is done. While TSO could do bursts of 65535 bytes that can't be represented in ip_len together with the IP and TCP header. Account for that and use IP_MAXPACKET instead of TCP_MAXWIN as base constant (both have the same value of 64K). When more data is available prevent less than MSS sized segments from being sent during the current TSO burst. Add two more KASSERTs to ensure the integrity of the packets. Tested by: Ben Wilber ben-at-desync com MFC after: 10 days As this hasn't happned yet, please do not do. It breaks things. I'll follow-up later as soon as I have more details. I was busied out after the EuroBSDCon DevSummit and didn't have have time to MFC. Incidentially I was planning on doing it today, but will hold off based on your request. The version currently in 8 certainly has a bug. For the one in head you are the first report. Others reported their all their issues to be fixed with this patch. Can you give an high level description of the problem you are seeing? A detailed description is not required to take a first look on whatever issue you may have. -- Andre Modified: head/sys/netinet/tcp_output.c Modified: head/sys/netinet/tcp_output.c == --- head/sys/netinet/tcp_output.c Fri Sep 17 21:53:56 2010 (r212802) +++ head/sys/netinet/tcp_output.c Fri Sep 17 22:05:27 2010 (r212803) @@ -465,9 +465,8 @@ after_sack_rexmit: } /* - * Truncate to the maximum segment length or enable TCP Segmentation - * Offloading (if supported by hardware) and ensure that FIN is removed - * if the length no longer contains the last data byte. + * Decide if we can use TCP Segmentation Offloading (if supported by + * hardware). * * TSO may only be used if we are in a pure bulk sending state. The * presence of TCP-MD5, SACK retransmits, SACK advertizements and @@ -475,10 +474,6 @@ after_sack_rexmit: * (except for the sequence number) for all generated packets. This * makes it impossible to transmit any options which vary per generated * segment or packet. - * - * The length of TSO bursts is limited to TCP_MAXWIN. That limit and - * removal of FIN (if not already catched here) are handled later after - * the exact length of the TCP options are known. */ #ifdef IPSEC /* @@ -487,22 +482,15 @@ after_sack_rexmit: */ ipsec_optlen = ipsec_hdrsiz_tcp(tp); #endif - if (len tp-t_maxseg) { - if ((tp-t_flags TF_TSO) V_tcp_do_tso - ((tp-t_flags TF_SIGNATURE) == 0) - tp-rcv_numsacks == 0 sack_rxmit == 0 - tp-t_inpcb-inp_options == NULL - tp-t_inpcb-in6p_options == NULL + if ((tp-t_flags TF_TSO) V_tcp_do_tso len tp-t_maxseg + ((tp-t_flags TF_SIGNATURE) == 0) + tp-rcv_numsacks == 0 sack_rxmit == 0 #ifdef IPSEC - ipsec_optlen == 0 + ipsec_optlen == 0 #endif - ) { - tso = 1; - } else { - len = tp-t_maxseg; - sendalot = 1; - } - } + tp-t_inpcb-inp_options == NULL + tp-t_inpcb-in6p_options == NULL) + tso = 1; if (sack_rxmit) { if (SEQ_LT(p-rxmit + len, tp-snd_una + so-so_snd.sb_cc)) @@ -732,28 +720,53 @@ send: * bump the packet length beyond the t_maxopd length. * Clear the FIN bit because we cut off the tail of * the segment. - * - * When doing TSO limit a burst to TCP_MAXWIN minus the - * IP, TCP and Options length to keep ip-ip_len from - * overflowing. Prevent the last segment from being - * fractional thus making them all equal sized and set - * the flag to continue sending. TSO is disabled when - * IP options or IPSEC are present. */ if (len + optlen + ipoptlen tp-t_maxopd) { flags = ~TH_FIN; + if (tso) { - if (len TCP_MAXWIN - hdrlen - optlen) { - len = TCP_MAXWIN - hdrlen - optlen; - len = len - (len % (tp-t_maxopd - optlen)); + KASSERT(ipoptlen == 0, + (%s: TSO can't do IP options, __func__)); + + /* + * Limit a burst to IP_MAXPACKET minus IP, + * TCP and options length to keep ip-ip_len + * from overflowing. + */ + if (len IP_MAXPACKET - hdrlen) { + len = IP_MAXPACKET - hdrlen; + sendalot = 1; + } + + /* + * Prevent the last segment from being + * fractional unless the send sockbuf can + * be emptied. + */ + if (sendalot off + len so-so_snd.sb_cc) { + len -= len % (tp-t_maxopd - optlen); sendalot = 1; - } else if (tp-t_flags TF_NEEDFIN) + } + + /* + * Send the FIN in a separate segment + * after the bulk sending is done. + * We don't trust the TSO implementations + * to clear the FIN flag on all but the + * last segment. + */ + if (tp-t_flags TF_NEEDFIN) sendalot = 1; + } else { len = tp-t_maxopd - optlen - ipoptlen; sendalot = 1; } - } + } else + tso = 0; + + KASSERT(len
svn commit: r241686 - in head/sys: net netgraph netgraph/atm/ccatm netgraph/atm/sscfu netgraph/atm/sscop netgraph/atm/uni netinet netinet6 netipsec
Author: andre Date: Thu Oct 18 13:57:24 2012 New Revision: 241686 URL: http://svn.freebsd.org/changeset/base/241686 Log: Mechanically remove the last stray remains of spl* calls from net*/*. They have been Noop's for a long time now. Modified: head/sys/net/if.c head/sys/net/if_ef.c head/sys/net/if_gre.c head/sys/net/if_spppsubr.c head/sys/net/if_var.h head/sys/net/rtsock.c head/sys/netgraph/atm/ccatm/ng_ccatm.c head/sys/netgraph/atm/sscfu/ng_sscfu.c head/sys/netgraph/atm/sscop/ng_sscop.c head/sys/netgraph/atm/uni/ng_uni.c head/sys/netgraph/ng_eiface.c head/sys/netgraph/ng_ether.c head/sys/netgraph/ng_fec.c head/sys/netgraph/ng_gif.c head/sys/netgraph/ng_ksocket.c head/sys/netgraph/ng_source.c head/sys/netinet/ip_ipsec.c head/sys/netinet6/in6.c head/sys/netinet6/ip6_ipsec.c head/sys/netinet6/nd6.c head/sys/netinet6/nd6_nbr.c head/sys/netinet6/nd6_rtr.c head/sys/netinet6/udp6_usrreq.c head/sys/netipsec/key.c Modified: head/sys/net/if.c == --- head/sys/net/if.c Thu Oct 18 13:46:26 2012(r241685) +++ head/sys/net/if.c Thu Oct 18 13:57:24 2012(r241686) @@ -691,12 +691,9 @@ static void if_attachdomain(void *dummy) { struct ifnet *ifp; - int s; - s = splnet(); TAILQ_FOREACH(ifp, V_ifnet, if_link) if_attachdomain1(ifp); - splx(s); } SYSINIT(domainifattach, SI_SUB_PROTO_IFATTACHDOMAIN, SI_ORDER_SECOND, if_attachdomain, NULL); @@ -705,21 +702,15 @@ static void if_attachdomain1(struct ifnet *ifp) { struct domain *dp; - int s; - - s = splnet(); /* * Since dp-dom_ifattach calls malloc() with M_WAITOK, we * cannot lock ifp-if_afdata initialization, entirely. */ - if (IF_AFDATA_TRYLOCK(ifp) == 0) { - splx(s); + if (IF_AFDATA_TRYLOCK(ifp) == 0) return; - } if (ifp-if_afdata_initialized = domain_init_status) { IF_AFDATA_UNLOCK(ifp); - splx(s); printf(if_attachdomain called more than once on %s\n, ifp-if_xname); return; @@ -734,8 +725,6 @@ if_attachdomain1(struct ifnet *ifp) ifp-if_afdata[dp-dom_family] = (*dp-dom_ifattach)(ifp); } - - splx(s); } /* @@ -1825,7 +1814,6 @@ link_rtrequest(int cmd, struct rtentry * /* * Mark an interface down and notify protocols of * the transition. - * NOTE: must be called at splnet or eqivalent. */ static void if_unroute(struct ifnet *ifp, int flag, int fam) @@ -1849,7 +1837,6 @@ if_unroute(struct ifnet *ifp, int flag, /* * Mark an interface up and notify protocols of * the transition. - * NOTE: must be called at splnet or eqivalent. */ static void if_route(struct ifnet *ifp, int flag, int fam) @@ -1935,7 +1922,6 @@ do_link_state_change(void *arg, int pend /* * Mark an interface down and notify protocols of * the transition. - * NOTE: must be called at splnet or eqivalent. */ void if_down(struct ifnet *ifp) @@ -1947,7 +1933,6 @@ if_down(struct ifnet *ifp) /* * Mark an interface up and notify protocols of * the transition. - * NOTE: must be called at splnet or eqivalent. */ void if_up(struct ifnet *ifp) @@ -2150,14 +2135,10 @@ ifhwioctl(u_long cmd, struct ifnet *ifp, /* Smart drivers twiddle their own routes */ } else if (ifp-if_flags IFF_UP (new_flags IFF_UP) == 0) { - int s = splimp(); if_down(ifp); - splx(s); } else if (new_flags IFF_UP (ifp-if_flags IFF_UP) == 0) { - int s = splimp(); if_up(ifp); - splx(s); } /* See if permanently promiscuous mode bit is about to flip */ if ((ifp-if_flags ^ new_flags) IFF_PPROMISC) { @@ -2605,11 +2586,8 @@ ifioctl(struct socket *so, u_long cmd, c if ((oif_flags ^ ifp-if_flags) IFF_UP) { #ifdef INET6 - if (ifp-if_flags IFF_UP) { - int s = splimp(); + if (ifp-if_flags IFF_UP) in6_if_up(ifp); - splx(s); - } #endif } if_rele(ifp); Modified: head/sys/net/if_ef.c == --- head/sys/net/if_ef.cThu Oct 18 13:46:26 2012(r241685) +++ head/sys/net/if_ef.cThu Oct 18 13:57:24 2012(r241686) @@ -151,14 +151,10 @@ static int ef_detach(struct efnet *sc) { struct ifnet *ifp = sc-ef_ifp; - int s; - - s = splimp(); ether_ifdetach(ifp); if_free(ifp); - splx(s); return 0; } @@ -172,11
svn commit: r241688 - head/sys/net
Author: andre Date: Thu Oct 18 14:08:26 2012 New Revision: 241688 URL: http://svn.freebsd.org/changeset/base/241688 Log: Use LOG_WARNING level in in_attachdomain1() instead of printf(). Submitted by: vijju.singh-at-gmail.com Modified: head/sys/net/if.c Modified: head/sys/net/if.c == --- head/sys/net/if.c Thu Oct 18 13:57:28 2012(r241687) +++ head/sys/net/if.c Thu Oct 18 14:08:26 2012(r241688) @@ -711,8 +711,8 @@ if_attachdomain1(struct ifnet *ifp) return; if (ifp-if_afdata_initialized = domain_init_status) { IF_AFDATA_UNLOCK(ifp); - printf(if_attachdomain called more than once on %s\n, - ifp-if_xname); + log(LOG_WARNING, if_attachdomain called more than once + on %s\n, ifp-if_xname); return; } ifp-if_afdata_initialized = domain_init_status; ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r241703 - head/sys/kern
On 18.10.2012 22:22, Andre Oppermann wrote: Author: andre Date: Thu Oct 18 20:22:17 2012 New Revision: 241703 URL: http://svn.freebsd.org/changeset/base/241703 Log: Remove double-wrapping of #ifdef ZERO_COPY_SOCKETS within zero copy specialized sosend_copyin() helper function. Note that I'm not saying zero copy should be used or is even more performant than the optimized m_uiotombuf() function. Actually there may be some real bit-rot to zero copy sockets. I've just started looking into it. Note that zero copy isn't entirely true either as it marks the page as COW. So when the userspace application reuses the memory it is copied anyway. Also the overhead of doing the VM magic and mbuf attachment of a VM page isn't free either. To really benefit from it an application has to be written with COW in mind and not reuse the memory that was just written to the socket. For non-aware applications it may be a net performance loss overall. Also I don't like the name zero-copy-socket as it promises too much for those not into socket, mbuf and VM magic. I'd rather call it cow-socket or something like that as it describes much better what is actually happening behind the scenes. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r241704 - head/sys/kern
Author: andre Date: Thu Oct 18 21:04:30 2012 New Revision: 241704 URL: http://svn.freebsd.org/changeset/base/241704 Log: Remove unnecessary includes from sosend_copyin() and fix a couple of style issues. Modified: head/sys/kern/uipc_socket.c Modified: head/sys/kern/uipc_socket.c == --- head/sys/kern/uipc_socket.c Thu Oct 18 20:22:17 2012(r241703) +++ head/sys/kern/uipc_socket.c Thu Oct 18 21:04:30 2012(r241704) @@ -860,12 +860,6 @@ struct so_zerocopy_stats{ int found_ifp; }; struct so_zerocopy_stats so_zerocp_stats = {0,0,0}; -#include netinet/in.h -#include net/route.h -#include netinet/in_pcb.h -#include vm/vm.h -#include vm/vm_page.h -#include vm/vm_object.h /* * sosend_copyin() is only used if zero copy sockets are enabled. Otherwise @@ -907,9 +901,9 @@ sosend_copyin(struct uio *uio, struct mb } else m = m_get(M_WAITOK, MT_DATA); if (so_zero_copy_send - resid=PAGE_SIZE - *space=PAGE_SIZE - uio-uio_iov-iov_len=PAGE_SIZE) { + resid = PAGE_SIZE + *space = PAGE_SIZE + uio-uio_iov-iov_len = PAGE_SIZE) { so_zerocp_stats.size_ok++; so_zerocp_stats.align_ok++; cow_send = socow_setup(m, uio); @@ -946,7 +940,7 @@ sosend_copyin(struct uio *uio, struct mb if (cow_send) error = 0; else - error = uiomove(mtod(m, void *), (int)len, uio); + error = uiomove(mtod(m, void *), (int)len, uio); resid = uio-uio_resid; m-m_len = len; *mp = m; ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r241703 - head/sys/kern
On 18.10.2012 23:06, Navdeep Parhar wrote: Hello Andre, A couple of things if you're poking around in this area... I didn't really mean to dive too deep into COW socket writes. On 10/18/12 13:44, Andre Oppermann wrote: On 18.10.2012 22:22, Andre Oppermann wrote: Author: andre Date: Thu Oct 18 20:22:17 2012 New Revision: 241703 URL: http://svn.freebsd.org/changeset/base/241703 Log: Remove double-wrapping of #ifdef ZERO_COPY_SOCKETS within zero copy specialized sosend_copyin() helper function. Note that I'm not saying zero copy should be used or is even more performant than the optimized m_uiotombuf() function. Some time back I played around with a modified m_uiotombuf() that was aware of the mbuf_jumbo_16K zone (instead of limiting itself to 4K mbufs). In some cases it performed better than the stock m_uiotombuf. I suspect this change would also help drivers that are unable to deal with long gather lists when doing TSO. But my testing wasn't rigorous enough (I was merely playing around), and the drivers I work with can mostly cope with whatever the kernel throws at them. So nothing came out of it. The jumbo 16K zone is special in that the memory is actually allocated by contigmalloc to get physically contiguous RAM. After some uptime and heavy use this may become difficult to obtain. Also contigmalloc has to hunt for it which may cause quite a bit of overhead. 4K mbufs, actually PAGE_SIZE mbufs, are very easily obtainable and fast. To be honest I'm not really happy about PAGE_SIZE mbufs. They were introduced at a time when DMA engines were more limited and couldn't do S/G DMA on receive. So performance with PAGE_SIZE mbufs may be a little bit better but when you approach memory fragmentation after some heavy system usage it sucks up to the point where it fails most of the time. PAGE_SIZE mbufs always perform the same with very little deviation. In an ideal scenario I'd like to see 9K and 16K mbufs go away and have the RX DMA ring stitch a packet up out of PAGE_SIZE mbufs. Actually there may be some real bit-rot to zero copy sockets. I've just started looking into it. I have a cxgbe(4)-specific true zero-copy implementation. The rx side is in head, the tx side works only for blocking sockets (the easy case) and I haven't checked it in anywhere. Take a look at t4_soreceive_ddp() and m_mbuftouio_ddp() in sys/dev/cxgbe/t4_ddp.c. They're mostly identical to the kernel routines they're based on (read: copy-pasted from). You may find them of some interest if you're working in this area and are thinking of adding zero-copy hooks to the socket implementation. I'm going to have a look at it think about how to generically support DDP either way with our socket buffer layout. Actually that may end up as the golden path. Do away with PAGE_SIZE mbufs, sink page flipping COW (incorrectly named ZERO_COPY) and use DDP for those who need utmost performance (as I said only COW aware applications gain a bit of speed, unaware may end up much worse). -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r241724 - head/sys/sys
Author: andre Date: Fri Oct 19 10:04:43 2012 New Revision: 241724 URL: http://svn.freebsd.org/changeset/base/241724 Log: Remove splimp() comment from sysinit table and attribute SI_SUB_PROTO_BEGIN and SI_SUB_PROTO_END to VNET related initializations. MFC after:3 days Modified: head/sys/sys/kernel.h Modified: head/sys/sys/kernel.h == --- head/sys/sys/kernel.h Fri Oct 19 09:41:45 2012(r241723) +++ head/sys/sys/kernel.h Fri Oct 19 10:04:43 2012(r241724) @@ -84,12 +84,6 @@ extern int ticks; * The SI_SUB_SWAP values represent a value used by * the BSD 4.4Lite but not by FreeBSD; it is maintained in dependent * order to support porting. - * - * The SI_SUB_PROTO_BEGIN and SI_SUB_PROTO_END bracket a range of - * initializations to take place at splimp(). This is a historical - * wart that should be removed -- probably running everything at - * splimp() until the first init that doesn't want it is the correct - * fix. They are currently present to ensure historical behavior. */ enum sysinit_sub_id { SI_SUB_DUMMY= 0x000,/* not executed; for linker*/ @@ -147,12 +141,12 @@ enum sysinit_sub_id { SI_SUB_P1003_1B = 0x6E0,/* P1003.1B realtime */ SI_SUB_PSEUDO = 0x700,/* pseudo devices*/ SI_SUB_EXEC = 0x740,/* execve() handlers */ - SI_SUB_PROTO_BEGIN = 0x800,/* XXX: set splimp (kludge)*/ + SI_SUB_PROTO_BEGIN = 0x800,/* VNET initialization */ SI_SUB_PROTO_IF = 0x840,/* interfaces*/ SI_SUB_PROTO_DOMAININIT = 0x860,/* domain registration system */ SI_SUB_PROTO_DOMAIN = 0x880,/* domains (address families?)*/ SI_SUB_PROTO_IFATTACHDOMAIN = 0x881,/* domain dependent data init*/ - SI_SUB_PROTO_END= 0x8ff,/* XXX: set splx (kludge)*/ + SI_SUB_PROTO_END= 0x8ff,/* VNET helper functions */ SI_SUB_KPROF= 0x900,/* kernel profiling*/ SI_SUB_KICK_SCHEDULER = 0xa00,/* start the timeout events*/ SI_SUB_INT_CONFIG_HOOKS = 0xa80,/* Interrupts enabled config */ ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r241725 - head/sys/net
Author: andre Date: Fri Oct 19 10:07:55 2012 New Revision: 241725 URL: http://svn.freebsd.org/changeset/base/241725 Log: Update to previous r241688 to use __func__ instead of spelled out function name in log(9) message. Suggested by: glebius Modified: head/sys/net/if.c Modified: head/sys/net/if.c == --- head/sys/net/if.c Fri Oct 19 10:04:43 2012(r241724) +++ head/sys/net/if.c Fri Oct 19 10:07:55 2012(r241725) @@ -711,8 +711,8 @@ if_attachdomain1(struct ifnet *ifp) return; if (ifp-if_afdata_initialized = domain_init_status) { IF_AFDATA_UNLOCK(ifp); - log(LOG_WARNING, if_attachdomain called more than once - on %s\n, ifp-if_xname); + log(LOG_WARNING, %s called more than once on %s\n, + __func__, ifp-if_xname); return; } ifp-if_afdata_initialized = domain_init_status; ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r241688 - head/sys/net
On 18.10.2012 16:11, Gleb Smirnoff wrote: On Thu, Oct 18, 2012 at 02:08:26PM +, Andre Oppermann wrote: A Author: andre A Date: Thu Oct 18 14:08:26 2012 A New Revision: 241688 A URL: http://svn.freebsd.org/changeset/base/241688 A A Log: A Use LOG_WARNING level in in_attachdomain1() instead of printf(). A A Submitted by: vijju.singh-at-gmail.com A A Modified: A head/sys/net/if.c A A Modified: head/sys/net/if.c A == A --- head/sys/net/if.c Thu Oct 18 13:57:28 2012(r241687) A +++ head/sys/net/if.c Thu Oct 18 14:08:26 2012(r241688) A @@ -711,8 +711,8 @@ if_attachdomain1(struct ifnet *ifp) A return; A if (ifp-if_afdata_initialized = domain_init_status) { A IF_AFDATA_UNLOCK(ifp); A - printf(if_attachdomain called more than once on %s\n, A - ifp-if_xname); A + log(LOG_WARNING, if_attachdomain called more than once A + on %s\n, ifp-if_xname); A return; A } A ifp-if_afdata_initialized = domain_init_status; It'll be even more perfect if done as %s called more than once on %s\n, __func__, ifp-if_xname Thanks, done in r241725. And do we need \n for log(9)? Yes. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r241726 - head/sys/kern
Author: andre Date: Fri Oct 19 10:15:32 2012 New Revision: 241726 URL: http://svn.freebsd.org/changeset/base/241726 Log: Move UMA socket zone initialization from uipc_domain.c to uipc_socket.c into one place next to its other related functions to avoid confusion. Modified: head/sys/kern/uipc_domain.c head/sys/kern/uipc_socket.c Modified: head/sys/kern/uipc_domain.c == --- head/sys/kern/uipc_domain.c Fri Oct 19 10:07:55 2012(r241725) +++ head/sys/kern/uipc_domain.c Fri Oct 19 10:15:32 2012(r241726) @@ -239,28 +239,11 @@ domain_add(void *data) mtx_unlock(dom_mtx); } -static void -socket_zone_change(void *tag) -{ - - uma_zone_set_max(socket_zone, maxsockets); -} - /* ARGSUSED*/ static void domaininit(void *dummy) { - /* -* Before we do any setup, make sure to initialize the -* zone allocator we get struct sockets from. -*/ - socket_zone = uma_zcreate(socket, sizeof(struct socket), NULL, NULL, - NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); - uma_zone_set_max(socket_zone, maxsockets); - EVENTHANDLER_REGISTER(maxsockets_change, socket_zone_change, NULL, - EVENTHANDLER_PRI_FIRST); - if (max_linkhdr 16) /* XXX */ max_linkhdr = 16; Modified: head/sys/kern/uipc_socket.c == --- head/sys/kern/uipc_socket.c Fri Oct 19 10:07:55 2012(r241725) +++ head/sys/kern/uipc_socket.c Fri Oct 19 10:15:32 2012(r241726) @@ -227,6 +227,29 @@ MTX_SYSINIT(so_global_mtx, so_global_mt SYSCTL_NODE(_kern, KERN_IPC, ipc, CTLFLAG_RW, 0, IPC); /* + * Initialize the socket subsystem and set up the socket + * memory allocator. + */ +static void +socket_zone_change(void *tag) +{ + + uma_zone_set_max(socket_zone, maxsockets); +} + +static void +socket_init(void *tag) +{ + +socket_zone = uma_zcreate(socket, sizeof(struct socket), NULL, NULL, +NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); +uma_zone_set_max(socket_zone, maxsockets); +EVENTHANDLER_REGISTER(maxsockets_change, socket_zone_change, NULL, +EVENTHANDLER_PRI_FIRST); +} +SYSINIT(socket, SI_SUB_PROTO_DOMAININIT, SI_ORDER_ANY, socket_init, NULL); + +/* * Sysctl to get and set the maximum global sockets limit. Notify protocols * of the change so that they can update their dependent limits as required. */ ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r241779 - head/sys/kern
Author: andre Date: Sat Oct 20 10:51:32 2012 New Revision: 241779 URL: http://svn.freebsd.org/changeset/base/241779 Log: Tidy up somaxconn (accept queue limit) and related functions and move it together into one place. Modified: head/sys/kern/uipc_socket.c Modified: head/sys/kern/uipc_socket.c == --- head/sys/kern/uipc_socket.c Sat Oct 20 10:34:55 2012(r241778) +++ head/sys/kern/uipc_socket.c Sat Oct 20 10:51:32 2012(r241779) @@ -182,15 +182,37 @@ MALLOC_DEFINE(M_PCB, pcb, protocol co VNET_ASSERT(curvnet != NULL,\ (%s:%d curvnet is NULL, so=%p, __func__, __LINE__, (so))); +/* + * Limit on the number of connections in the listen queue waiting + * for accept(2). + */ static int somaxconn = SOMAXCONN; -static int sysctl_somaxconn(SYSCTL_HANDLER_ARGS); -/* XXX: we dont have SYSCTL_USHORT */ + +static int +sysctl_somaxconn(SYSCTL_HANDLER_ARGS) +{ + int error; + int val; + + val = somaxconn; + error = sysctl_handle_int(oidp, val, 0, req); + if (error || !req-newptr ) + return (error); + + if (val 1 || val USHRT_MAX) + return (EINVAL); + + somaxconn = val; + return (0); +} SYSCTL_PROC(_kern_ipc, KIPC_SOMAXCONN, somaxconn, CTLTYPE_UINT | CTLFLAG_RW, -0, sizeof(int), sysctl_somaxconn, I, Maximum pending socket connection -queue size); +0, sizeof(int), sysctl_somaxconn, I, +Maximum listen socket pending connection accept queue size); + static int numopensockets; SYSCTL_INT(_kern_ipc, OID_AUTO, numopensockets, CTLFLAG_RD, numopensockets, 0, Number of open sockets); + #ifdef ZERO_COPY_SOCKETS /* These aren't static because they're used in other files. */ int so_zero_copy_send = 1; @@ -3269,24 +3291,6 @@ socheckuid(struct socket *so, uid_t uid) return (0); } -static int -sysctl_somaxconn(SYSCTL_HANDLER_ARGS) -{ - int error; - int val; - - val = somaxconn; - error = sysctl_handle_int(oidp, val, 0, req); - if (error || !req-newptr ) - return (error); - - if (val 1 || val USHRT_MAX) - return (EINVAL); - - somaxconn = val; - return (0); -} - /* * These functions are used by protocols to notify the socket layer (and its * consumers) of state changes in the sockets driven by protocol-side events. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r241781 - in head: lib/libc/sys sys/kern
Author: andre Date: Sat Oct 20 12:53:14 2012 New Revision: 241781 URL: http://svn.freebsd.org/changeset/base/241781 Log: Hide the unfortunate named sysctl kern.ipc.somaxconn from sysctl -a output and replace it with a new visible sysctl kern.ipc.acceptqueue of the same functionality. It specifies the maximum length of the accept queue on a listen socket. The old kern.ipc.somaxconn remains available for reading and writing for compatibility reasons so that existing programs, scripts and configurations continue to work. There no plans to ever remove the orginal and now hidden kern.ipc.somaxconn. Modified: head/lib/libc/sys/listen.2 head/sys/kern/uipc_socket.c Modified: head/lib/libc/sys/listen.2 == --- head/lib/libc/sys/listen.2 Sat Oct 20 12:07:48 2012(r241780) +++ head/lib/libc/sys/listen.2 Sat Oct 20 12:53:14 2012(r241781) @@ -28,7 +28,7 @@ .\From: @(#)listen.2 8.2 (Berkeley) 12/11/93 .\ $FreeBSD$ .\ -.Dd August 29, 2005 +.Dd October 20, 2012 .Dt LISTEN 2 .Os .Sh NAME @@ -102,15 +102,15 @@ of service attacks are no longer necessa The .Xr sysctl 3 MIB variable -.Va kern.ipc.somaxconn +.Va kern.ipc.soacceptqueue specifies a hard limit on .Fa backlog ; if a value greater than -.Va kern.ipc.somaxconn +.Va kern.ipc.soacceptqueue or less than zero is specified, .Fa backlog is silently forced to -.Va kern.ipc.somaxconn . +.Va kern.ipc.soacceptqueue . .Sh INTERACTION WITH ACCEPT FILTERS When accept filtering is used on a socket, a second queue will be used to hold sockets that have connected, but have not yet @@ -168,3 +168,17 @@ at run-time, and to use a negative .Fa backlog to request the maximum allowable value, was introduced in .Fx 2.2 . +The +.Va kern.ipc.somaxconn +.Xr sysctl 3 +has been replaced with +.Va kern.ipc.soacceptqueue +in +.Fx 10.0 +to prevent confusion its actual functionality. +The original +.Xr sysctl 3 +.Va kern.ipc.somaxconn +is still available but hidden from a +.Xr sysctl 3 +-a output so that existing applications and scripts continue to work. Modified: head/sys/kern/uipc_socket.c == --- head/sys/kern/uipc_socket.c Sat Oct 20 12:07:48 2012(r241780) +++ head/sys/kern/uipc_socket.c Sat Oct 20 12:53:14 2012(r241781) @@ -185,6 +185,8 @@ MALLOC_DEFINE(M_PCB, pcb, protocol co /* * Limit on the number of connections in the listen queue waiting * for accept(2). + * NB: The orginal sysctl somaxconn is still available but hidden + * to prevent confusion about the actually purpose of this number. */ static int somaxconn = SOMAXCONN; @@ -205,9 +207,13 @@ sysctl_somaxconn(SYSCTL_HANDLER_ARGS) somaxconn = val; return (0); } -SYSCTL_PROC(_kern_ipc, KIPC_SOMAXCONN, somaxconn, CTLTYPE_UINT | CTLFLAG_RW, +SYSCTL_PROC(_kern_ipc, OID_AUTO, soacceptqueue, CTLTYPE_UINT | CTLFLAG_RW, 0, sizeof(int), sysctl_somaxconn, I, Maximum listen socket pending connection accept queue size); +SYSCTL_PROC(_kern_ipc, KIPC_SOMAXCONN, somaxconn, +CTLTYPE_UINT | CTLFLAG_RW | CTLFLAG_SKIP, +0, sizeof(int), sysctl_somaxconn, I, +Maximum listen socket pending connection accept queue size (compat)); static int numopensockets; SYSCTL_INT(_kern_ipc, OID_AUTO, numopensockets, CTLFLAG_RD, ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r241789 - in head: lib/libc/sys sys/kern
Author: andre Date: Sat Oct 20 19:38:22 2012 New Revision: 241789 URL: http://svn.freebsd.org/changeset/base/241789 Log: Grammar fixes to r241781. Submitted by: alc Modified: head/lib/libc/sys/listen.2 head/sys/kern/uipc_socket.c Modified: head/lib/libc/sys/listen.2 == --- head/lib/libc/sys/listen.2 Sat Oct 20 18:13:20 2012(r241788) +++ head/lib/libc/sys/listen.2 Sat Oct 20 19:38:22 2012(r241789) @@ -175,7 +175,7 @@ has been replaced with .Va kern.ipc.soacceptqueue in .Fx 10.0 -to prevent confusion its actual functionality. +to prevent confusion about its actual functionality. The original .Xr sysctl 3 .Va kern.ipc.somaxconn Modified: head/sys/kern/uipc_socket.c == --- head/sys/kern/uipc_socket.c Sat Oct 20 18:13:20 2012(r241788) +++ head/sys/kern/uipc_socket.c Sat Oct 20 19:38:22 2012(r241789) @@ -186,7 +186,7 @@ MALLOC_DEFINE(M_PCB, pcb, protocol co * Limit on the number of connections in the listen queue waiting * for accept(2). * NB: The orginal sysctl somaxconn is still available but hidden - * to prevent confusion about the actually purpose of this number. + * to prevent confusion about the actual purpose of this number. */ static int somaxconn = SOMAXCONN; ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r241781 - in head: lib/libc/sys sys/kern
On 20.10.2012 19:23, Alan Cox wrote: There are couple minor grammar issues in the text. See below. Thank you. Fixed in r241789. -- Andre Alan On 10/20/2012 07:53, Andre Oppermann wrote: Author: andre Date: Sat Oct 20 12:53:14 2012 New Revision: 241781 URL: http://svn.freebsd.org/changeset/base/241781 Log: Hide the unfortunate named sysctl kern.ipc.somaxconn from sysctl -a output and replace it with a new visible sysctl kern.ipc.acceptqueue of the same functionality. It specifies the maximum length of the accept queue on a listen socket. The old kern.ipc.somaxconn remains available for reading and writing for compatibility reasons so that existing programs, scripts and configurations continue to work. There no plans to ever remove the orginal and now hidden kern.ipc.somaxconn. Modified: head/lib/libc/sys/listen.2 head/sys/kern/uipc_socket.c Modified: head/lib/libc/sys/listen.2 == --- head/lib/libc/sys/listen.2Sat Oct 20 12:07:48 2012(r241780) +++ head/lib/libc/sys/listen.2Sat Oct 20 12:53:14 2012(r241781) @@ -28,7 +28,7 @@ .\From: @(#)listen.28.2 (Berkeley) 12/11/93 .\ $FreeBSD$ .\ -.Dd August 29, 2005 +.Dd October 20, 2012 .Dt LISTEN 2 .Os .Sh NAME @@ -102,15 +102,15 @@ of service attacks are no longer necessa The .Xr sysctl 3 MIB variable -.Va kern.ipc.somaxconn +.Va kern.ipc.soacceptqueue specifies a hard limit on .Fa backlog ; if a value greater than -.Va kern.ipc.somaxconn +.Va kern.ipc.soacceptqueue or less than zero is specified, .Fa backlog is silently forced to -.Va kern.ipc.somaxconn . +.Va kern.ipc.soacceptqueue . .Sh INTERACTION WITH ACCEPT FILTERS When accept filtering is used on a socket, a second queue will be used to hold sockets that have connected, but have not yet @@ -168,3 +168,17 @@ at run-time, and to use a negative .Fa backlog to request the maximum allowable value, was introduced in .Fx 2.2 . +The +.Va kern.ipc.somaxconn +.Xr sysctl 3 +has been replaced with +.Va kern.ipc.soacceptqueue +in +.Fx 10.0 +to prevent confusion its actual functionality. There is a missing word here: ... confusion about its ... +The original +.Xr sysctl 3 +.Va kern.ipc.somaxconn +is still available but hidden from a +.Xr sysctl 3 +-a output so that existing applications and scripts continue to work. Modified: head/sys/kern/uipc_socket.c == --- head/sys/kern/uipc_socket.cSat Oct 20 12:07:48 2012(r241780) +++ head/sys/kern/uipc_socket.cSat Oct 20 12:53:14 2012(r241781) @@ -185,6 +185,8 @@ MALLOC_DEFINE(M_PCB, pcb, protocol co /* * Limit on the number of connections in the listen queue waiting * for accept(2). + * NB: The orginal sysctl somaxconn is still available but hidden + * to prevent confusion about the actually purpose of this number. actually should be actual. */ static int somaxconn = SOMAXCONN; @@ -205,9 +207,13 @@ sysctl_somaxconn(SYSCTL_HANDLER_ARGS) somaxconn = val; return (0); } -SYSCTL_PROC(_kern_ipc, KIPC_SOMAXCONN, somaxconn, CTLTYPE_UINT | CTLFLAG_RW, +SYSCTL_PROC(_kern_ipc, OID_AUTO, soacceptqueue, CTLTYPE_UINT | CTLFLAG_RW, 0, sizeof(int), sysctl_somaxconn, I, Maximum listen socket pending connection accept queue size); +SYSCTL_PROC(_kern_ipc, KIPC_SOMAXCONN, somaxconn, +CTLTYPE_UINT | CTLFLAG_RW | CTLFLAG_SKIP, +0, sizeof(int), sysctl_somaxconn, I, +Maximum listen socket pending connection accept queue size (compat)); static int numopensockets; SYSCTL_INT(_kern_ipc, OID_AUTO, numopensockets, CTLFLAG_RD, ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r241892 - head/sys/mips/conf
Author: andre Date: Mon Oct 22 15:04:23 2012 New Revision: 241892 URL: http://svn.freebsd.org/changeset/base/241892 Log: Remove ZERO_COPY_SOCKETS from kernel configuration as the current COW based approach is not safe and should not be used in production. Modified: head/sys/mips/conf/RT305X Modified: head/sys/mips/conf/RT305X == --- head/sys/mips/conf/RT305X Mon Oct 22 14:48:14 2012(r241891) +++ head/sys/mips/conf/RT305X Mon Oct 22 15:04:23 2012(r241892) @@ -86,7 +86,6 @@ options SCSI_NO_OP_STRINGS optionsRWLOCK_NOINLINE optionsSX_NOINLINE optionsNO_SWAPPING -optionsZERO_COPY_SOCKETS options MROUTING# Multicast routing optionsIPFIREWALL_DEFAULT_TO_ACCEPT ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r241923 - in head/sys: netinet netipsec
On 23.10.2012 10:33, Gleb Smirnoff wrote: Author: glebius Date: Tue Oct 23 08:33:13 2012 New Revision: 241923 URL: http://svn.freebsd.org/changeset/base/241923 Log: Do not reduce ip_len by size of IP header in the ip_input() before passing a packet to protocol input routines. For several protocols this mean that now protocol needs to do subtraction itself, and for another half this means that we do not need to add header length back to the packet. Yay! More Mammoth shit getting washed away! ;) Please add an entry to UPDATING as the convention of of ip_len subtraction has been there since forever. That makes it easier to discover for third parties writing code. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r241931 - in head/sys: conf kern
Author: andre Date: Tue Oct 23 14:19:44 2012 New Revision: 241931 URL: http://svn.freebsd.org/changeset/base/241931 Log: Replace the ill-named ZERO_COPY_SOCKET kernel option with two more appropriate named kernel options for the very distinct send and receive path. options SOCKET_SEND_COW enables VM page copy-on-write based sending of data on an outbound socket. NB: The COW based send mechanism is not safe and may result in kernel crashes. options SOCKET_RECV_PFLIP enables VM kernel/userspace page flipping for special disposable pages attached as external storage to mbufs. Only the naming of the kernel options is changed and their corresponding #ifdef sections are adjusted. No functionality is added or removed. Discussed with: alc (mechanism and limitations of send side COW) Modified: head/sys/conf/NOTES head/sys/conf/options head/sys/kern/subr_uio.c head/sys/kern/uipc_socket.c Modified: head/sys/conf/NOTES == --- head/sys/conf/NOTES Tue Oct 23 12:39:17 2012(r241930) +++ head/sys/conf/NOTES Tue Oct 23 14:19:44 2012(r241931) @@ -964,12 +964,20 @@ options TCP_SIGNATURE #include support # a smooth scheduling of the traffic. optionsDUMMYNET -# Zero copy sockets support. This enables zero copy for sending and -# receiving data via a socket. The send side works for any type of NIC, -# the receive side only works for NICs that support MTUs greater than the -# page size of your architecture and that support header splitting. See -# zero_copy(9) for more details. -optionsZERO_COPY_SOCKETS +# Zero copy sockets support is split into the send and receive path +# which operate very differently. +# For the send path the VM page with the data is wired into the kernel +# and marked as COW (copy-on-write). If the application touches the +# data while it is still in the send socket buffer the page is copied +# and divorced from its kernel wiring (no longer zero copy). +# The receive side requires explicit NIC driver support to create +# disposable pages which are flipped from kernel to user-space VM. +# See zero_copy(9) for more details. +# XXX: The COW based send mechanism is not safe and may result in +# kernel crashes. +# XXX: None of the current NIC drivers support disposeable pages. +optionsSOCKET_SEND_COW +optionsSOCKET_RECV_PFLIP # # FILESYSTEM OPTIONS Modified: head/sys/conf/options == --- head/sys/conf/options Tue Oct 23 12:39:17 2012(r241930) +++ head/sys/conf/options Tue Oct 23 14:19:44 2012(r241931) @@ -520,7 +520,8 @@ NGATM_CCATM opt_netgraph.h # DRM options DRM_DEBUG opt_drm.h -ZERO_COPY_SOCKETS opt_zero.h +SOCKET_SEND_COWopt_zero.h +SOCKET_RECV_PFLIP opt_zero.h TI_SF_BUF_JUMBOopt_ti.h TI_JUMBO_HDRSPLIT opt_ti.h BCE_JUMBO_HDRSPLIT opt_bce.h Modified: head/sys/kern/subr_uio.c == --- head/sys/kern/subr_uio.cTue Oct 23 12:39:17 2012(r241930) +++ head/sys/kern/subr_uio.cTue Oct 23 14:19:44 2012(r241931) @@ -57,7 +57,7 @@ __FBSDID($FreeBSD$); #include vm/vm_extern.h #include vm/vm_page.h #include vm/vm_map.h -#ifdef ZERO_COPY_SOCKETS +#ifdef SOCKET_SEND_COW #include vm/vm_object.h #endif @@ -66,7 +66,7 @@ SYSCTL_INT(_kern, KERN_IOV_MAX, iov_max, static int uiomove_faultflag(void *cp, int n, struct uio *uio, int nofault); -#ifdef ZERO_COPY_SOCKETS +#ifdef SOCKET_SEND_COW /* Declared in uipc_socket.c */ extern int so_zero_copy_receive; @@ -128,7 +128,7 @@ retry: vm_map_lookup_done(map, entry); return(KERN_SUCCESS); } -#endif /* ZERO_COPY_SOCKETS */ +#endif /* SOCKET_SEND_COW */ int copyin_nofault(const void *udaddr, void *kaddr, size_t len) @@ -261,7 +261,7 @@ uiomove_frombuf(void *buf, int buflen, s return (uiomove((char *)buf + offset, n, uio)); } -#ifdef ZERO_COPY_SOCKETS +#ifdef SOCKET_RECV_PFLIP /* * Experimental support for zero-copy I/O */ @@ -356,7 +356,7 @@ uiomoveco(void *cp, int n, struct uio *u } return (0); } -#endif /* ZERO_COPY_SOCKETS */ +#endif /* SOCKET_RECV_PFLIP */ /* * Give next character to user as result of read. Modified: head/sys/kern/uipc_socket.c == --- head/sys/kern/uipc_socket.c Tue Oct 23 12:39:17 2012(r241930) +++ head/sys/kern/uipc_socket.c Tue Oct 23 14:19:44 2012(r241931) @@ -219,17 +219,20 @@ static int numopensockets; SYSCTL_INT(_kern_ipc, OID_AUTO, numopensockets, CTLFLAG_RD, numopensockets, 0, Number of open sockets); -#ifdef
svn commit: r241932 - head/share/man/man9
Author: andre Date: Tue Oct 23 14:25:37 2012 New Revision: 241932 URL: http://svn.freebsd.org/changeset/base/241932 Log: Update zero_copy(9) man page to note the renamed kernel options and to warn about unsafeness of COW based sends. Modified: head/share/man/man9/zero_copy.9 Modified: head/share/man/man9/zero_copy.9 == --- head/share/man/man9/zero_copy.9 Tue Oct 23 14:19:44 2012 (r241931) +++ head/share/man/man9/zero_copy.9 Tue Oct 23 14:25:37 2012 (r241932) @@ -25,7 +25,7 @@ .\ .\ $FreeBSD$ .\ -.Dd December 5, 2004 +.Dd October 23, 2012 .Dt ZERO_COPY 9 .Os .Sh NAME @@ -33,7 +33,8 @@ .Nm zero_copy_sockets .Nd zero copy sockets code .Sh SYNOPSIS -.Cd options ZERO_COPY_SOCKETS +.Cd options SOCKET_SEND_COW +.Cd options SOCKET_RECV_PFLIP .Sh DESCRIPTION The .Fx @@ -155,6 +156,8 @@ variables respectively. .Xr sendfile 2 , .Xr socket 2 , .Xr ti 4 +.Sh BUGS +The COW based send mechanism is not safe and may result in kernel crashes. .Sh HISTORY The zero copy sockets code first appeared in .Fx 5.0 , ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r241931 - in head/sys: conf kern
On 23.10.2012 16:42, Gleb Smirnoff wrote: On Tue, Oct 23, 2012 at 02:19:45PM +, Andre Oppermann wrote: A Author: andre A Date: Tue Oct 23 14:19:44 2012 A New Revision: 241931 A URL: http://svn.freebsd.org/changeset/base/241931 A A Log: A Replace the ill-named ZERO_COPY_SOCKET kernel option with two A more appropriate named kernel options for the very distinct A send and receive path. A A options SOCKET_SEND_COW enables VM page copy-on-write based A sending of data on an outbound socket. A A NB: The COW based send mechanism is not safe and may result A in kernel crashes. A A options SOCKET_RECV_PFLIP enables VM kernel/userspace page A flipping for special disposable pages attached as external A storage to mbufs. A A Only the naming of the kernel options is changed and their A corresponding #ifdef sections are adjusted. No functionality A is added or removed. A A Discussed with: alc (mechanism and limitations of send side COW) Users may call this a pointless POLA violation. IMO, the old kernel option that we had for years, more than a decade, should remain and just imply two new kernel options. There shouldn't be any users. Zero copy send is broken and responsible for random kernel crashes. Zero copy receive isn't supported by any modern driver. Both are useless to dangerous. The main problem with ZERO_COPY_SOCKETS was that it sounded great and who wouldn't want to have zero copy sockets? Unfortunately it doesn't work that way. According to alc@ even if zero copy send would work it wouldn't be faster due to page based COW setup being a very expensive operation. Eventually he want's page-based COW to go away. For zero copy send we're trying to come up with a sendfile-like approach where the page is simply wired into kernel space. The application then is not allowed to touch it until the socket buffer has released it again. The main issue here is how to provide feedback to the application when it is safe for reuse. For zero copy receive I've been contacted by np@ to find a way to combine DDP into the socket buffer layer. Trying to work something out that isn't too horrible. A generic approach would hinge on page sized mbufs though. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r241931 - in head/sys: conf kern
On 23.10.2012 17:11, David Chisnall wrote: On 23 Oct 2012, at 16:05, Andre Oppermann wrote: For zero copy send we're trying to come up with a sendfile-like approach where the page is simply wired into kernel space. The application then is not allowed to touch it until the socket buffer has released it again. The main issue here is how to provide feedback to the application when it is safe for reuse. It's been a few years since I used it, but I thought that aio_write() already provided this. The application may not modify the contents of the memory pointed to by aio_buf until after it has received notification that the write has finished. This happens either via a signal directly, a signal polled by kqueue, or a call to aio_return(). Indeed, that's one of the ways being explored. It requires the explicit cooperation of the application. I don't think there is any way around that. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r241955 - head
Author: andre Date: Tue Oct 23 16:33:43 2012 New Revision: 241955 URL: http://svn.freebsd.org/changeset/base/241955 Log: Note the removal of the ZERO_COPY_SOCKETS kernel option in r241931 and provide a proper explanation. Modified: head/UPDATING Modified: head/UPDATING == --- head/UPDATING Tue Oct 23 16:12:17 2012(r241954) +++ head/UPDATING Tue Oct 23 16:33:43 2012(r241955) @@ -25,6 +25,17 @@ NOTE TO PEOPLE WHO THINK THAT FreeBSD 10 ln -s 'abort:false,junk:false' /etc/malloc.conf.) 20121023: + The ZERO_COPY_SOCKET kernel option has been removed and + split into SOCKET_SEND_COW and SOCKET_RECV_PFLIP. + NB: SOCKET_SEND_COW uses the VM page based copy-on-write + mechanism which is not safe and may result in kernel crashes. + NB: The SOCKET_RECV_PFLIP mechanism is useless as no current + driver supports disposeable external page sized mbuf storage. + Proper replacements for both zero-copy mechanisms are under + consideration and will eventually lead to complete removal + of the two kernel options. + +20121023: The IPv4 network stack has been converted to network byte order. The following modules need to be recompiled together with kernel: carp(4), divert(4), gif(4), siftr(4), gre(4), ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r241931 - in head/sys: conf kern
On 23.10.2012 17:21, Bryan Drewery wrote: On 10/23/2012 10:05 AM, Andre Oppermann wrote: There shouldn't be any users. Zero copy send is broken and responsible for random kernel crashes. Zero copy receive isn't supported by any modern driver. Both are useless to dangerous. I enabled this a few weeks ago, not knowing it was useless/dangerous. Perhaps an entry in UPDATING to note that this has been renamed and that it may not actually be useful? Good idea. Will do. Also, zero_copy(9) needs updating, as it references ZERO_COPY_SOCKETS. Already done in r241932. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r241931 - in head/sys: conf kern
On 23.10.2012 18:05, Gleb Smirnoff wrote: On Tue, Oct 23, 2012 at 05:05:48PM +0200, Andre Oppermann wrote: A There shouldn't be any users. Zero copy send is broken and A responsible for random kernel crashes. Zero copy receive isn't A supported by any modern driver. Both are useless to dangerous. A A The main problem with ZERO_COPY_SOCKETS was that it sounded great A and who wouldn't want to have zero copy sockets? Unfortunately A it doesn't work that way. Okay, it appeared that there are users, even on current@ mailing list during couple of hours of exposition. Can we keep the old option as compatibility? No. They are not users. They simply fell for the promise of zero copy which it isn't. It doesn't do what the users believe it does. It's useless for receive and dangerous for send. I have updated NOTES and forwarded it to -current. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242014 - head/sys/kern
On 24.10.2012 20:56, Jim Harris wrote: On Wed, Oct 24, 2012 at 11:41 AM, Adrian Chadd adr...@freebsd.org wrote: On 24 October 2012 11:36, Jim Harris jimhar...@freebsd.org wrote: Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle. Ok, but.. struct mtx tdq_lock; /* run queue lock. */ + charpad[64 - sizeof(struct mtx)]; .. don't we have an existing compile time macro for the cache line size, which can be used here? Yes, but I didn't use it for a couple of reasons: 1) struct tdq itself is currently using __aligned(64), so I wanted to keep it consistent. 2) CACHE_LINE_SIZE is currently defined as 128 on x86, due to NetBurst-based processors having 128-byte cache sectors a while back. I had planned to start a separate thread on arch@ about this today on whether this was still appropriate. See also the discussion on svn-src-all regarding global struct mtx alignment. Thank you for proving my point. ;) Let's go back and see how we can do this the sanest way. These are the options I see at the moment: 1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place 2. use a macro like MTX_ALIGN that can be SMP/UP aware and in the future possibly change to a different compiler dependent align attribute 3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it automatically gets aligned in all cases, even when dynamically allocated. Personally I'm undecided between #2 and #3. #1 is ugly. In favor of #3 is that there possibly isn't any case where you'd actually want the mutex to share a cache line with anything else, even a data structure. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242014 - head/sys/kern
On 24.10.2012 21:49, Jim Harris wrote: On Wed, Oct 24, 2012 at 12:16 PM, Andre Oppermann an...@freebsd.org wrote: snip See also the discussion on svn-src-all regarding global struct mtx alignment. Thank you for proving my point. ;) Let's go back and see how we can do this the sanest way. These are the options I see at the moment: 1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place 2. use a macro like MTX_ALIGN that can be SMP/UP aware and in the future possibly change to a different compiler dependent align attribute 3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it automatically gets aligned in all cases, even when dynamically allocated. Personally I'm undecided between #2 and #3. #1 is ugly. In favor of #3 is that there possibly isn't any case where you'd actually want the mutex to share a cache line with anything else, even a data structure. I've run my same tests with #3 as you describe, and I did see further noticeable improvement. I had a difficult time though quantifying the effect it would have on all of the different architectures. Putting it in ULE's tdq gained 60-70% of the overall benefit, and was well contained. I just experimented with different specifications of alignment and couldn't get the globals aligned at all. This seems to be because of the linker not understanding or not getting passed the alignment information when linking the kernel. I agree that sprinkling all over the place isn't pretty. But focused investigations into specific locks (spin mutexes, default mutexes, whatever) may find a few key additional ones that would benefit. I started down this path with the sleepq and turnstile locks, but none of those specifically showed noticeable improvement (at least in the tests I was running). There's still some additional ones I want to look at, but haven't had the time yet. This runs the very great risk of optimizing for today's available architectures and then needs rejiggling every five years. Just as you've noticed the issue with 128B alignment from the Netburst days. We never know how the next micro-architecture will behave. Micro optimizing each individual invocation of common building blocks is the wrong path to go. I'd very much prefer the alignment *and* padding control to be done in one place for all of them, either through a magic macro or compiler __attribute__(whatever). -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242014 - head/sys/kern
On 24.10.2012 21:06, Attilio Rao wrote: On Wed, Oct 24, 2012 at 8:00 PM, Jim Harris jim.har...@gmail.com wrote: On Wed, Oct 24, 2012 at 11:43 AM, John Baldwin j...@freebsd.org wrote: On Wednesday, October 24, 2012 2:36:41 pm Jim Harris wrote: Author: jimharris Date: Wed Oct 24 18:36:41 2012 New Revision: 242014 URL: http://svn.freebsd.org/changeset/base/242014 Log: Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle. This enables CPU searches (which read tdq_load) to operate independently of any contention on the spinlock. Some scheduler-intensive workloads running on an 8C single-socket SNB Xeon show considerable improvement with this change (2-3% perf improvement, 5-6% decrease in CPU util). Sponsored by: Intel Reviewed by:jeff Modified: head/sys/kern/sched_ule.c Modified: head/sys/kern/sched_ule.c == --- head/sys/kern/sched_ule.c Wed Oct 24 18:33:44 2012(r242013) +++ head/sys/kern/sched_ule.c Wed Oct 24 18:36:41 2012(r242014) @@ -223,8 +223,13 @@ static int sched_idlespinthresh = -1; * locking in sched_pickcpu(); */ struct tdq { - /* Ordered to improve efficiency of cpu_search() and switch(). */ + /* + * Ordered to improve efficiency of cpu_search() and switch(). + * tdq_lock is padded to avoid false sharing with tdq_load and + * tdq_cpu_idle. + */ struct mtx tdq_lock; /* run queue lock. */ + charpad[64 - sizeof(struct mtx)]; Can this use 'tdq_lock __aligned(CACHE_LINE_SIZE)' instead? No - that doesn't pad it. I believe that only works if it's global, i.e. not part of a data structure. As I've already said in another thread __align() doesn't work on object declaration, so what that won't pad it either if it is global or part of a struct. It is just implemented as __attribute__((aligned(X))): http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Type-Attributes.html Actually it seems gcc itself doesn't really care and it up to the linker to honor that. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242014 - head/sys/kern
On 24.10.2012 21:30, Alexander Motin wrote: On 24.10.2012 22:16, Andre Oppermann wrote: On 24.10.2012 20:56, Jim Harris wrote: On Wed, Oct 24, 2012 at 11:41 AM, Adrian Chadd adr...@freebsd.org wrote: On 24 October 2012 11:36, Jim Harris jimhar...@freebsd.org wrote: Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle. Ok, but.. struct mtx tdq_lock; /* run queue lock. */ + charpad[64 - sizeof(struct mtx)]; .. don't we have an existing compile time macro for the cache line size, which can be used here? Yes, but I didn't use it for a couple of reasons: 1) struct tdq itself is currently using __aligned(64), so I wanted to keep it consistent. 2) CACHE_LINE_SIZE is currently defined as 128 on x86, due to NetBurst-based processors having 128-byte cache sectors a while back. I had planned to start a separate thread on arch@ about this today on whether this was still appropriate. See also the discussion on svn-src-all regarding global struct mtx alignment. Thank you for proving my point. ;) Let's go back and see how we can do this the sanest way. These are the options I see at the moment: 1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place 2. use a macro like MTX_ALIGN that can be SMP/UP aware and in the future possibly change to a different compiler dependent align attribute 3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it automatically gets aligned in all cases, even when dynamically allocated. Personally I'm undecided between #2 and #3. #1 is ugly. In favor of #3 is that there possibly isn't any case where you'd actually want the mutex to share a cache line with anything else, even a data structure. I'm sorry, could you hint me with some theory? I think I can agree that cache line sharing can be a problem in case of spin locks -- waiting thread will constantly try to access page modified by other CPU, that I guess will cause cache line writes to the RAM. But why is it so bad to share lock with respective data in case of non-spin locks? Won't benefits from free regular prefetch of the right data while grabbing lock compensate penalties from relatively rare collisions? Cliff Click describes it in detail: http://www.azulsystems.com/blog/cliff/2009-04-14-odds-ends For a classic mutex it likely doesn't make much difference since the cache line is exclusive anyway while the lock is held. On LL/SC systems there may be cache line dirtying on a failed locking attempt. For spin mutexes it hurts badly as you noted. Especially on RW mutexes it hurts because a read lock dirties the cache line for all other CPU's. Here the RW mutex should be on its own cache line in all cases. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242014 - head/sys/kern
On 24.10.2012 22:29, Attilio Rao wrote: On Wed, Oct 24, 2012 at 9:25 PM, Andre Oppermann an...@freebsd.org wrote: On 24.10.2012 21:06, Attilio Rao wrote: As I've already said in another thread __align() doesn't work on object declaration, so what that won't pad it either if it is global or part of a struct. It is just implemented as __attribute__((aligned(X))): http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Type-Attributes.html Actually it seems gcc itself doesn't really care and it up to the linker to honor that. Yes but the concept being that if you use __aligned() properly (when defining a struct) the object will be correctly sized, so you will get padding automatically. Yes. With __aligned() the start of the element/structure should begin on an address evenly dividable by the align value *and* it should pad out any remaining space up to the next evenly dividable address. The problem we have is that is apparently doesn't work correctly within gcc when creating structs nor within the linker when placing such supposedly aligned structs in the .bss section (at least the padding is missing). It seems to come down to either a) fixing gcc+ld; or b) hacking around it by magically padding the structs that require it. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242014 - head/sys/kern
On 24.10.2012 22:55, Andre Oppermann wrote: On 24.10.2012 22:29, Attilio Rao wrote: On Wed, Oct 24, 2012 at 9:25 PM, Andre Oppermann an...@freebsd.org wrote: On 24.10.2012 21:06, Attilio Rao wrote: As I've already said in another thread __align() doesn't work on object declaration, so what that won't pad it either if it is global or part of a struct. It is just implemented as __attribute__((aligned(X))): http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Type-Attributes.html Actually it seems gcc itself doesn't really care and it up to the linker to honor that. Yes but the concept being that if you use __aligned() properly (when defining a struct) the object will be correctly sized, so you will get padding automatically. Yes. With __aligned() the start of the element/structure should begin on an address evenly dividable by the align value *and* it should pad out any remaining space up to the next evenly dividable address. The problem we have is that is apparently doesn't work correctly within gcc when creating structs nor within the linker when placing such supposedly aligned structs in the .bss section (at least the padding is missing). I spoke too soon. Attilio is completely right in his assessment. It does work when done on the struct definition: struct mtx { ... } __aligned(CACHE_LINE_SIZE); /* works including .bss alignment padding */ When creating a struct (in globals at least) it doesn't work: struct mtx __aligned(CACHE_LINE_SIZE) foo_mtx; /* doesn't work */ It seems to come down to either a) fixing gcc+ld; or b) hacking around it by magically padding the structs that require it. The question now becomes of whether we can (should?) make the latter case above work or find another workaround. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242079 - in head: sbin/ipfw share/man/man4 sys/conf sys/net sys/netinet sys/netinet6 sys/netpfil/ipfw
On 25.10.2012 11:39, Andrey V. Elsukov wrote: Author: ae Date: Thu Oct 25 09:39:14 2012 New Revision: 242079 URL: http://svn.freebsd.org/changeset/base/242079 Log: Remove the IPFIREWALL_FORWARD kernel option and make possible to turn on the related functionality in the runtime via the sysctl variable net.pfil.forward. It is turned off by default. Sponsored by:Yandex LLC Discussed with: net@ MFC after: 2 weeks I still don't agree with naming the sysctl net.pfil.forward. This type of forwarding is a property of IPv4 and IPv6 and thus should be put there. Pfil hooking can be on layer 2, 2-bridging, 3 and who knows where else in the future. Forwarding works only for IPv46. You haven't even replied to my comment on net@. Please change the sysctl location and name to its appropriate place. Also an MFC's after 2 weeks must ensure that compiling with IPFIREWALL_ FORWARD enabled the sysctl at the same time to keep kernel configs within 9-stable working. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242014 - head/sys/kern
On 25.10.2012 05:49, Bruce Evans wrote: On Wed, 24 Oct 2012, Attilio Rao wrote: On Wed, Oct 24, 2012 at 8:16 PM, Andre Oppermann an...@freebsd.org wrote: ... Let's go back and see how we can do this the sanest way. These are the options I see at the moment: 1. sprinkle __aligned(CACHE_LINE_SIZE) all over the place This is wrong because it doesn't give padding. Unless it is sprinkled in struct declarations. 2. use a macro like MTX_ALIGN that can be SMP/UP aware and in the future possibly change to a different compiler dependent align attribute What is this macro supposed to do? I don't understand that from your description. 3. embed __aligned(CACHE_LINE_SIZE) into struct mtx itself so it automatically gets aligned in all cases, even when dynamically allocated. This works but I think it is overkill for structures including sleep mutexes which are the vast majority. So I wouldn't certainly be in favor of such a patch. This doesn't work either with fully dynamic (auto) allocations. Stack alignment is generally broken (limited, and pessimized for both space and time) in gcc (it works better in clang). On amd64, it is limited by the default of -mpreferred-stack-boundary=4. Since 2**4 is smaller than the cache line size and stack alignments larger than it are broken in gcc, __aligned(CACHE_LINE_SIZE) never works (except accidentally, 16/CACHE_LINE_SIZE of the time. On i386, we reduce the space/time pessimizations a little by overriding the default to -mpreferred-stack-boundary=2. 2**2 is even smaller than the cache line size. (The pessimizations are for both space and time, since time and code space is wasted for the code to keep the stack aligned, and cache space and thus also time are wasted for padding. Most functions don't benefit from more than sizeof(register_t) alignment.) I'm not aware of stack allocated mutexes anywhere in the kernel. Even if there is a case it's very special and unique. I've verified that __aligned(CACHE_LINE_SIZE) on the definition of struct mtx itself (in sys/_mutex.h) correctly aligns and pads the global .bss resident mutexes for 64B and 128B cache line sizes. Dynamic allocations via malloc() get whatever alignment malloc() gives. This is only required to be 4 or 8 or 16 or so (the maximum for a C object declared in conforming C (no __align()), but malloc() usually gives more. If it gives CACHE_LINE_SIZE, that is wasteful for most small allocations. Stand-alone mutexes are normally not malloc'ed. They're always embedded into some larger structure they protect. __builtin_alloca() is broken in gcc-3.3.3, but works in gcc-4.2.1, at least on i386. In gcc-3.3.3, it assumes that the stack is the default 16-byte aligned even if -mpreferred-stack-boundary=2 is in CFLAGS to say otherwise, and just subtracts from the stack pointer. In gcc-4.2.1, it does the necessary andl of the stack pointer, but only 16-byte alignment. It is another bug that there sre no extensions of malloc() or alloca(). Since malloc() is in the library and may give CACHE_LINE_SIZE but __builtin_alloca() is in the compiler and only gives 16, these functions are not even as compatible as they should be. I don't know of any mutexes allocated on the stack, but there are stack frames with mcontexts in them that need special alignment so they cause problems on i386. They can't just be put on the stack due to the above bugs. They are laboriously allocated using malloc(). Since they are a quite large, 1 mcontext barely fits on the kernel stack, so kib didn't like my alloca() method for allocating them. You lost me here. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242079 - in head: sbin/ipfw share/man/man4 sys/conf sys/net sys/netinet sys/netinet6 sys/netpfil/ipfw
On 25.10.2012 18:25, Andrey V. Elsukov wrote: On 25.10.2012 19:54, Andre Oppermann wrote: I still don't agree with naming the sysctl net.pfil.forward. This type of forwarding is a property of IPv4 and IPv6 and thus should be put there. Pfil hooking can be on layer 2, 2-bridging, 3 and who knows where else in the future. Forwarding works only for IPv46. You haven't even replied to my comment on net@. Please change the sysctl location and name to its appropriate place. Hi Andre, There were two replies related to this subject, you did not replied to them and i thought that you became agree. I replied to your reply to mine. Other than that I didn't find anything else from you. So, if not, what you think about the name net.pfil.ipforward? net.inet.ip.pfil_forward net.inet6.ip6.pfil_forward or something like that. If you can show with your performance profiling that the sysctl isn't even necessary, you could leave it completely away and have pfil_forward enabled permanently. That would be even better for everybody. Also an MFC's after 2 weeks must ensure that compiling with IPFIREWALL_ FORWARD enabled the sysctl at the same time to keep kernel configs within 9-stable working. Yes, it will work like that. Excellent. Thank you. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242079 - in head: sbin/ipfw share/man/man4 sys/conf sys/net sys/netinet sys/netinet6 sys/netpfil/ipfw
On 26.10.2012 13:26, Gleb Smirnoff wrote: On Thu, Oct 25, 2012 at 10:29:51PM +0200, Andre Oppermann wrote: A On 25.10.2012 18:25, Andrey V. Elsukov wrote: A On 25.10.2012 19:54, Andre Oppermann wrote: A I still don't agree with naming the sysctl net.pfil.forward. This A type of forwarding is a property of IPv4 and IPv6 and thus should A be put there. Pfil hooking can be on layer 2, 2-bridging, 3 and A who knows where else in the future. Forwarding works only for IPv46. A A You haven't even replied to my comment on net@. Please change the A sysctl location and name to its appropriate place. A A Hi Andre, A A There were two replies related to this subject, you did not replied to A them and i thought that you became agree. A A I replied to your reply to mine. Other than that I didn't find A anything else from you. A A So, if not, what you think about the name net.pfil.ipforward? A A net.inet.ip.pfil_forward A net.inet6.ip6.pfil_forward A A or something like that. A A If you can show with your performance profiling that the sysctl A isn't even necessary, you could leave it completely away and have A pfil_forward enabled permanently. That would be even better for A everybody. I'd prefer to have the sysctl. Benchmarking will definitely show no regression, because in default case packets are tagless. But if packets would carry 1 or 2 tags each, which don't actually belong to PACKET_TAG_IPFORWARD, then processing would be pessimized. With M_FASTFWD_OURS I used an overlay of the protocol specific M_PROTO[1-5] mbuf flags. The same can be done with M_IPFORWARD. The ipfw code then will not only add the m_tag but also set M_IPFORWARD flag. That way no sysctl is required and the feature is always available. The overlay definition is in ip_var.h. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242079 - in head: sbin/ipfw share/man/man4 sys/conf sys/net sys/netinet sys/netinet6 sys/netpfil/ipfw
On 26.10.2012 14:29, Andrey V. Elsukov wrote: On 26.10.2012 15:43, Andre Oppermann wrote: A If you can show with your performance profiling that the sysctl A isn't even necessary, you could leave it completely away and have A pfil_forward enabled permanently. That would be even better for A everybody. I'd prefer to have the sysctl. Benchmarking will definitely show no regression, because in default case packets are tagless. But if packets would carry 1 or 2 tags each, which don't actually belong to PACKET_TAG_IPFORWARD, then processing would be pessimized. With M_FASTFWD_OURS I used an overlay of the protocol specific M_PROTO[1-5] mbuf flags. The same can be done with M_IPFORWARD. The ipfw code then will not only add the m_tag but also set M_IPFORWARD flag. That way no sysctl is required and the feature is always available. The overlay definition is in ip_var.h. It seems we have only one bit in the m_flags that can be used, so, maybe we left it to some things that can appear in the future? That's what the M_PROTO flags are for: #define M_IPFW_FORWARD M_PROTO2/* ip forwarding */ of course you have to do the same for ip6. The M_PROTO[1-5] flags are only valid within a protocol layer. For example they get cleared in ip_output() before the packet is handed to layer 2. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242079 - in head: sbin/ipfw share/man/man4 sys/conf sys/net sys/netinet sys/netinet6 sys/netpfil/ipfw
On 26.10.2012 15:24, Andre Oppermann wrote: On 26.10.2012 14:29, Andrey V. Elsukov wrote: On 26.10.2012 15:43, Andre Oppermann wrote: A If you can show with your performance profiling that the sysctl A isn't even necessary, you could leave it completely away and have A pfil_forward enabled permanently. That would be even better for A everybody. I'd prefer to have the sysctl. Benchmarking will definitely show no regression, because in default case packets are tagless. But if packets would carry 1 or 2 tags each, which don't actually belong to PACKET_TAG_IPFORWARD, then processing would be pessimized. With M_FASTFWD_OURS I used an overlay of the protocol specific M_PROTO[1-5] mbuf flags. The same can be done with M_IPFORWARD. The ipfw code then will not only add the m_tag but also set M_IPFORWARD flag. That way no sysctl is required and the feature is always available. The overlay definition is in ip_var.h. It seems we have only one bit in the m_flags that can be used, so, maybe we left it to some things that can appear in the future? That's what the M_PROTO flags are for: #defineM_IPFW_FORWARDM_PROTO2/* ip forwarding */ Actually looking at it technically this isn't forwarding but specifying a different nexthop. Hence the #define and description should be more like #define M_IP_NEXTHOPM_PROTO2/* explicit ip nexthop */ Of course the userspace ipfw feature naming and usage doesn't change. But within the kernel it's really nexthop manipulation within the forwarding path. -- Andre of course you have to do the same for ip6. The M_PROTO[1-5] flags are only valid within a protocol layer. For example they get cleared in ip_output() before the packet is handed to layer 2. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242151 - in head/sys: vm xen/evtchn
Author: andre Date: Fri Oct 26 17:31:35 2012 New Revision: 242151 URL: http://svn.freebsd.org/changeset/base/242151 Log: Move the corresponding MTX_SYSINIT() next to their struct mtx declaration to make their relationship more obvious as done with the other such mutexs. Modified: head/sys/vm/vm_glue.c head/sys/xen/evtchn/evtchn.c Modified: head/sys/vm/vm_glue.c == --- head/sys/vm/vm_glue.c Fri Oct 26 17:02:50 2012(r242150) +++ head/sys/vm/vm_glue.c Fri Oct 26 17:31:35 2012(r242151) @@ -307,6 +307,8 @@ struct kstack_cache_entry *kstack_cache; static int kstack_cache_size = 128; static int kstacks; static struct mtx kstack_cache_mtx; +MTX_SYSINIT(kstack_cache, kstack_cache_mtx, kstkch, MTX_DEF); + SYSCTL_INT(_vm, OID_AUTO, kstack_cache_size, CTLFLAG_RW, kstack_cache_size, 0, ); SYSCTL_INT(_vm, OID_AUTO, kstacks, CTLFLAG_RD, kstacks, 0, @@ -486,7 +488,6 @@ kstack_cache_init(void *nulll) EVENTHANDLER_PRI_ANY); } -MTX_SYSINIT(kstack_cache, kstack_cache_mtx, kstkch, MTX_DEF); SYSINIT(vm_kstacks, SI_SUB_KTHREAD_INIT, SI_ORDER_ANY, kstack_cache_init, NULL); #ifndef NO_SWAPPING Modified: head/sys/xen/evtchn/evtchn.c == --- head/sys/xen/evtchn/evtchn.cFri Oct 26 17:02:50 2012 (r242150) +++ head/sys/xen/evtchn/evtchn.cFri Oct 26 17:31:35 2012 (r242151) @@ -44,7 +44,15 @@ static inline unsigned long __ffs(unsign return word; } +/* + * irq_mapping_update_lock: in order to allow an interrupt to occur in a critical + * section, to set pcpu-ipending (etc...) properly, we + * must be able to get the icu lock, so it can't be + * under witness. + */ static struct mtx irq_mapping_update_lock; +MTX_SYSINIT(irq_mapping_update_lock, irq_mapping_update_lock, xp, MTX_SPIN); + static struct xenpic *xp; struct xenpic_intsrc { struct intsrc xp_intsrc; @@ -1130,11 +1138,4 @@ evtchn_init(void *dummy __unused) } SYSINIT(evtchn_init, SI_SUB_INTR, SI_ORDER_MIDDLE, evtchn_init, NULL); -/* - * irq_mapping_update_lock: in order to allow an interrupt to occur in a critical - * section, to set pcpu-ipending (etc...) properly, we - * must be able to get the icu lock, so it can't be - * under witness. - */ -MTX_SYSINIT(irq_mapping_update_lock, irq_mapping_update_lock, xp, MTX_SPIN); ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242161 - in head/sys: net netinet netpfil/pf
On 26.10.2012 23:06, Gleb Smirnoff wrote: Author: glebius Date: Fri Oct 26 21:06:33 2012 New Revision: 242161 URL: http://svn.freebsd.org/changeset/base/242161 Log: o Remove last argument to ip_fragment(), and obtain all needed information on checksums directly from mbuf flags. This simplifies code. o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in hardware. Some driver may not announce CSUM_IP in theur if_hwassist, although try to do checksums if CSUM_IP set on mbuf. Example is em(4). I'm not getting your description here? Why work around a bug in a driver in ip_fragment() when we can fix the bug in the driver? o While here, consistently use CSUM_IP instead of its alias CSUM_DELAY_IP. After this change CSUM_DELAY_IP vanishes from the stack. Good. :) Submitted by:Sebastian Kuzminsky seb lineratesystems.com -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242249 - head/sys/netinet
Author: andre Date: Sun Oct 28 17:16:09 2012 New Revision: 242249 URL: http://svn.freebsd.org/changeset/base/242249 Log: Adjust the initial default CWND upon connection establishment to the new and increased values specified by RFC5681 Section 3.1. The even larger initial CWND per RFC3390, if enabled, is not affected. MFC after:2 weeks Modified: head/sys/netinet/tcp_input.c Modified: head/sys/netinet/tcp_input.c == --- head/sys/netinet/tcp_input.cSun Oct 28 17:06:50 2012 (r242248) +++ head/sys/netinet/tcp_input.cSun Oct 28 17:16:09 2012 (r242249) @@ -351,8 +351,15 @@ cc_conn_init(struct tcpcb *tp) if (V_tcp_do_rfc3390) tp-snd_cwnd = min(4 * tp-t_maxseg, max(2 * tp-t_maxseg, 4380)); - else - tp-snd_cwnd = tp-t_maxseg; + else { + /* Per RFC5681 Section 3.1 */ + if (tp-t_maxseg 2190) + tp-snd_cwnd = 2 * tp-t_maxseg; + else if (tp-t_maxseg 1095) + tp-snd_cwnd = 3 * tp-t_maxseg; + else + tp-snd_cwnd = 4 * tp-t_maxseg; + } if (CC_ALGO(tp)-conn_init != NULL) CC_ALGO(tp)-conn_init(tp-ccv); ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242250 - head/sys/netinet
Author: andre Date: Sun Oct 28 17:25:08 2012 New Revision: 242250 URL: http://svn.freebsd.org/changeset/base/242250 Log: When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to reduce the initial CWND to one segment. This reduction got lost some time ago due to a change in initialization ordering. Additionally in tcp_timer_rexmt() avoid entering fast recovery when we're still in TCPS_SYN_SENT state. MFC after:2 weeks Modified: head/sys/netinet/tcp_input.c head/sys/netinet/tcp_syncache.c head/sys/netinet/tcp_timer.c Modified: head/sys/netinet/tcp_input.c == --- head/sys/netinet/tcp_input.cSun Oct 28 17:16:09 2012 (r242249) +++ head/sys/netinet/tcp_input.cSun Oct 28 17:25:08 2012 (r242250) @@ -345,10 +345,16 @@ cc_conn_init(struct tcpcb *tp) /* * Set the initial slow-start flight size. * -* RFC3390 says only do this if SYN or SYN/ACK didn't got lost. -* XXX: We currently check only in syncache_socket for that. -*/ - if (V_tcp_do_rfc3390) +* RFC5681 Section 3.1 specifies the default conservative values. +* RFC3390 specifies slightly more aggressive values. +* +* If a SYN or SYN/ACK was lost and retransmitted, we have to +* reduce the initial CWND to one segment as congestion is likely +* requiring us to be cautious. +*/ + if (tp-snd_cwnd == 1) + tp-snd_cwnd = tp-t_maxseg;/* SYN(-ACK) lost */ + else if (V_tcp_do_rfc3390) tp-snd_cwnd = min(4 * tp-t_maxseg, max(2 * tp-t_maxseg, 4380)); else { Modified: head/sys/netinet/tcp_syncache.c == --- head/sys/netinet/tcp_syncache.c Sun Oct 28 17:16:09 2012 (r242249) +++ head/sys/netinet/tcp_syncache.c Sun Oct 28 17:25:08 2012 (r242250) @@ -852,11 +852,12 @@ syncache_socket(struct syncache *sc, str tcp_mss(tp, sc-sc_peer_mss); /* -* If the SYN,ACK was retransmitted, reset cwnd to 1 segment. +* If the SYN,ACK was retransmitted, indicate that CWND to be +* limited to one segment in cc_conn_init(). * NB: sc_rxmits counts all SYN,ACK transmits, not just retransmits. */ if (sc-sc_rxmits 1) - tp-snd_cwnd = tp-t_maxseg; + tp-snd_cwnd = 1; #ifdef TCP_OFFLOAD /* Modified: head/sys/netinet/tcp_timer.c == --- head/sys/netinet/tcp_timer.cSun Oct 28 17:16:09 2012 (r242249) +++ head/sys/netinet/tcp_timer.cSun Oct 28 17:25:08 2012 (r242250) @@ -539,7 +539,13 @@ tcp_timer_rexmt(void * xtp) } INP_INFO_RUNLOCK(V_tcbinfo); headlocked = 0; - if (tp-t_rxtshift == 1) { + if (tp-t_state == TCPS_SYN_SENT) { + /* +* If the SYN was retransmitted, indicate CWND to be +* limited to 1 segment in cc_conn_init(). +*/ + tp-snd_cwnd = 1; + } else if (tp-t_rxtshift == 1) { /* * first retransmit; record ssthresh and cwnd so they can * be recovered if this turns out to be a bad retransmit. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242251 - head/sys/netinet
Author: andre Date: Sun Oct 28 17:30:28 2012 New Revision: 242251 URL: http://svn.freebsd.org/changeset/base/242251 Log: When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to reduce the initial CWND to one segment. This reduction got lost some time ago due to a change in initialization ordering. Additionally in tcp_timer_rexmt() avoid entering fast recovery when we're still in TCPS_SYN_SENT state. MFC after:2 weeks Modified: head/sys/netinet/tcp_output.c Modified: head/sys/netinet/tcp_output.c == --- head/sys/netinet/tcp_output.c Sun Oct 28 17:25:08 2012 (r242250) +++ head/sys/netinet/tcp_output.c Sun Oct 28 17:30:28 2012 (r242251) @@ -551,10 +551,14 @@ after_sack_rexmit: * max size segments, or at least 50% of the maximum possible * window, then want to send a window update to peer. * Skip this if the connection is in T/TCP half-open state. -* Don't send pure window updates when the peer has closed -* the connection and won't ever send more data. +* +* Don't send an independent window update if a delayed +* ACK is pending (it will get piggy-backed on it) or the +* remote side already has done a half-close and won't send +* more data. */ if (recwin 0 !(tp-t_flags TF_NEEDSYN) + !(tp-t_flags TF_DELACK) !TCPS_HAVERCVDFIN(tp-t_state)) { /* * adv is the amount we can increase the window, ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242252 - head/sys/netinet
Author: andre Date: Sun Oct 28 17:40:35 2012 New Revision: 242252 URL: http://svn.freebsd.org/changeset/base/242252 Log: Prevent a flurry of forced window updates when an application is doing small reads on a (partially) filled receive socket buffer. Normally one would a send a window update every time the available space in the socket buffer increases by two times MSS. This leads to a flurry of window updates that do not provide any meaningful new information to the sender. There still is available space in the window and the sender can continue sending data. All window updates then get carried by the regular ACKs. Only when the socket buffer was (almost) full and the window closed accordingly a window updates delivery new information and allows the sender to start sending more data again. Send window updates only every two MSS when the socket buffer has less than 1/8 space available, or the available space in the socket buffer increased by 1/4 its full capacity, or the socket buffer is very small. The next regular data ACK will carry and report the exact window size again. Reported by: sbruno Tested by:darrenr Tested by:Darren Baginski PR: kern/116335 MFC after:2 weeks Modified: head/sys/netinet/tcp_output.c Modified: head/sys/netinet/tcp_output.c == --- head/sys/netinet/tcp_output.c Sun Oct 28 17:30:28 2012 (r242251) +++ head/sys/netinet/tcp_output.c Sun Oct 28 17:40:35 2012 (r242252) @@ -545,23 +545,39 @@ after_sack_rexmit: } /* -* Compare available window to amount of window -* known to peer (as advertised window less -* next expected input). If the difference is at least two -* max size segments, or at least 50% of the maximum possible -* window, then want to send a window update to peer. -* Skip this if the connection is in T/TCP half-open state. +* Sending of standalone window updates. +* +* Window updates important when we close our window due to a full +* socket buffer and are opening it again after the application +* reads data from it. Once the window has opened again and the +* remote end starts to send again the ACK clock takes over and +* provides the most current window information. +* +* We must avoid to the silly window syndrome whereas every read +* from the receive buffer, no matter how small, causes a window +* update to be sent. We also should avoid sending a flurry of +* window updates when the socket buffer had queued a lot of data +* and the application is doing small reads. +* +* Prevent a flurry of pointless window updates by only sending +* an update when we can increase the advertized window by more +* than 1/4th of the socket buffer capacity. When the buffer is +* getting full or is very small be more aggressive and send an +* update whenever we can increase by two mss sized segments. +* In all other situations the ACK's to new incoming data will +* carry further window increases. * * Don't send an independent window update if a delayed * ACK is pending (it will get piggy-backed on it) or the * remote side already has done a half-close and won't send -* more data. +* more data. Skip this if the connection is in T/TCP +* half-open state. */ if (recwin 0 !(tp-t_flags TF_NEEDSYN) !(tp-t_flags TF_DELACK) !TCPS_HAVERCVDFIN(tp-t_state)) { /* -* adv is the amount we can increase the window, +* adv is the amount we could increase the window, * taking into account that we are limited by * TCP_MAXWIN tp-rcv_scale. */ @@ -581,9 +597,11 @@ after_sack_rexmit: */ if (oldwin tp-rcv_scale == (adv + oldwin) tp-rcv_scale) goto dontupdate; - if (adv = (long) (2 * tp-t_maxseg)) - goto send; - if (2 * adv = (long) so-so_rcv.sb_hiwat) + + if (adv = (long)(2 * tp-t_maxseg) + (adv = (long)(so-so_rcv.sb_hiwat / 4) || +recwin = (long)(so-so_rcv.sb_hiwat / 8) || +so-so_rcv.sb_hiwat = 8 * tp-t_maxseg)) goto send; } dontupdate: ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242253 - head/sys/netinet
Author: andre Date: Sun Oct 28 17:59:46 2012 New Revision: 242253 URL: http://svn.freebsd.org/changeset/base/242253 Log: Simplify implementation of net.inet.tcp.reass.maxsegments and net.inet.tcp.reass.cursegments. MFC after:2 weeks Modified: head/sys/netinet/tcp_reass.c Modified: head/sys/netinet/tcp_reass.c == --- head/sys/netinet/tcp_reass.cSun Oct 28 17:40:35 2012 (r242252) +++ head/sys/netinet/tcp_reass.cSun Oct 28 17:59:46 2012 (r242253) @@ -74,7 +74,6 @@ __FBSDID($FreeBSD$); #include netinet/tcp_debug.h #endif /* TCPDEBUG */ -static int tcp_reass_sysctl_maxseg(SYSCTL_HANDLER_ARGS); static int tcp_reass_sysctl_qsize(SYSCTL_HANDLER_ARGS); static SYSCTL_NODE(_net_inet_tcp, OID_AUTO, reass, CTLFLAG_RW, 0, @@ -82,16 +81,12 @@ static SYSCTL_NODE(_net_inet_tcp, OID_AU static VNET_DEFINE(int, tcp_reass_maxseg) = 0; #defineV_tcp_reass_maxseg VNET(tcp_reass_maxseg) -SYSCTL_VNET_PROC(_net_inet_tcp_reass, OID_AUTO, maxsegments, -CTLTYPE_INT | CTLFLAG_RDTUN, -VNET_NAME(tcp_reass_maxseg), 0, tcp_reass_sysctl_maxseg, I, +SYSCTL_VNET_INT(_net_inet_tcp_reass, OID_AUTO, maxsegments, CTLFLAG_RDTUN, +VNET_NAME(tcp_reass_maxseg), 0, Global maximum number of TCP Segments in Reassembly Queue); -static VNET_DEFINE(int, tcp_reass_qsize) = 0; -#defineV_tcp_reass_qsize VNET(tcp_reass_qsize) SYSCTL_VNET_PROC(_net_inet_tcp_reass, OID_AUTO, cursegments, -CTLTYPE_INT | CTLFLAG_RD, -VNET_NAME(tcp_reass_qsize), 0, tcp_reass_sysctl_qsize, I, +(CTLTYPE_INT | CTLFLAG_RD), NULL, 0, tcp_reass_sysctl_qsize, I, Global number of TCP Segments currently in Reassembly Queue); static VNET_DEFINE(int, tcp_reass_overflows) = 0; @@ -109,8 +104,10 @@ static void tcp_reass_zone_change(void *tag) { + /* Set the zone limit and read back the effective value. */ V_tcp_reass_maxseg = nmbclusters / 16; uma_zone_set_max(V_tcp_reass_zone, V_tcp_reass_maxseg); + V_tcp_reass_maxseg = uma_zone_get_max(V_tcp_reass_zone); } void @@ -122,7 +119,9 @@ tcp_reass_init(void) V_tcp_reass_maxseg); V_tcp_reass_zone = uma_zcreate(tcpreass, sizeof (struct tseg_qent), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); + /* Set the zone limit and read back the effective value. */ uma_zone_set_max(V_tcp_reass_zone, V_tcp_reass_maxseg); + V_tcp_reass_maxseg = uma_zone_get_max(V_tcp_reass_zone); EVENTHANDLER_REGISTER(nmbclusters_change, tcp_reass_zone_change, NULL, EVENTHANDLER_PRI_ANY); } @@ -156,17 +155,12 @@ tcp_reass_flush(struct tcpcb *tp) } static int -tcp_reass_sysctl_maxseg(SYSCTL_HANDLER_ARGS) -{ - V_tcp_reass_maxseg = uma_zone_get_max(V_tcp_reass_zone); - return (sysctl_handle_int(oidp, arg1, arg2, req)); -} - -static int tcp_reass_sysctl_qsize(SYSCTL_HANDLER_ARGS) { - V_tcp_reass_qsize = uma_zone_get_cur(V_tcp_reass_zone); - return (sysctl_handle_int(oidp, arg1, arg2, req)); + int qsize; + + qsize = uma_zone_get_cur(V_tcp_reass_zone); + return (sysctl_handle_int(oidp, qsize, sizeof(qsize), req)); } int ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242254 - head/sys/netinet
Author: andre Date: Sun Oct 28 18:07:34 2012 New Revision: 242254 URL: http://svn.freebsd.org/changeset/base/242254 Log: Change the syncache count reporting the current number of entries from an unprotected u_int that reports garbage on SMP to a function based sysctl obtaining the current value from UMA. Also read back the actual cache_limit after page size rounding by UMA. PR: kern/165879 MFC after:2 weeks Modified: head/sys/netinet/tcp_syncache.c head/sys/netinet/tcp_syncache.h Modified: head/sys/netinet/tcp_syncache.c == --- head/sys/netinet/tcp_syncache.c Sun Oct 28 17:59:46 2012 (r242253) +++ head/sys/netinet/tcp_syncache.c Sun Oct 28 18:07:34 2012 (r242254) @@ -123,6 +123,7 @@ struct syncache *syncache_lookup(struct static int syncache_respond(struct syncache *); static struct socket *syncache_socket(struct syncache *, struct socket *, struct mbuf *m); +static int syncache_sysctl_count(SYSCTL_HANDLER_ARGS); static void syncache_timeout(struct syncache *sc, struct syncache_head *sch, int docallout); static void syncache_timer(void *); @@ -158,8 +159,8 @@ SYSCTL_VNET_UINT(_net_inet_tcp_syncache, VNET_NAME(tcp_syncache.cache_limit), 0, Overall entry limit for syncache); -SYSCTL_VNET_UINT(_net_inet_tcp_syncache, OID_AUTO, count, CTLFLAG_RD, -VNET_NAME(tcp_syncache.cache_count), 0, +SYSCTL_VNET_PROC(_net_inet_tcp_syncache, OID_AUTO, count, (CTLTYPE_UINT|CTLFLAG_RD), +NULL, 0, syncache_sysctl_count, IU, Current number of entries in syncache); SYSCTL_VNET_UINT(_net_inet_tcp_syncache, OID_AUTO, hashsize, CTLFLAG_RDTUN, @@ -225,7 +226,6 @@ syncache_init(void) { int i; - V_tcp_syncache.cache_count = 0; V_tcp_syncache.hashsize = TCP_SYNCACHE_HASHSIZE; V_tcp_syncache.bucket_limit = TCP_SYNCACHE_BUCKETLIMIT; V_tcp_syncache.rexmt_limit = SYNCACHE_MAXREXMTS; @@ -269,6 +269,7 @@ syncache_init(void) V_tcp_syncache.zone = uma_zcreate(syncache, sizeof(struct syncache), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0); uma_zone_set_max(V_tcp_syncache.zone, V_tcp_syncache.cache_limit); + V_tcp_syncache.cache_limit = uma_zone_get_max(V_tcp_syncache.zone); } #ifdef VIMAGE @@ -296,8 +297,8 @@ syncache_destroy(void) mtx_destroy(sch-sch_mtx); } - KASSERT(V_tcp_syncache.cache_count == 0, (%s: cache_count %d not 0, - __func__, V_tcp_syncache.cache_count)); + KASSERT(uma_zone_get_cur(V_tcp_syncache.zone) == 0, + (%s: cache_count not 0, __func__)); /* Free the allocated global resources. */ uma_zdestroy(V_tcp_syncache.zone); @@ -305,6 +306,15 @@ syncache_destroy(void) } #endif +static int +syncache_sysctl_count(SYSCTL_HANDLER_ARGS) +{ + int count; + + count = uma_zone_get_cur(V_tcp_syncache.zone); + return (sysctl_handle_int(oidp, count, sizeof(count), req)); +} + /* * Inserts a syncache entry into the specified bucket row. * Locks and unlocks the syncache_head autonomously. @@ -347,7 +357,6 @@ syncache_insert(struct syncache *sc, str SCH_UNLOCK(sch); - V_tcp_syncache.cache_count++; TCPSTAT_INC(tcps_sc_added); } @@ -373,7 +382,6 @@ syncache_drop(struct syncache *sc, struc #endif syncache_free(sc); - V_tcp_syncache.cache_count--; } /* @@ -958,7 +966,6 @@ syncache_expand(struct in_conninfo *inc, tod-tod_syncache_removed(tod, sc-sc_todctx); } #endif - V_tcp_syncache.cache_count--; SCH_UNLOCK(sch); } Modified: head/sys/netinet/tcp_syncache.h == --- head/sys/netinet/tcp_syncache.h Sun Oct 28 17:59:46 2012 (r242253) +++ head/sys/netinet/tcp_syncache.h Sun Oct 28 18:07:34 2012 (r242254) @@ -112,7 +112,6 @@ struct tcp_syncache { u_int hashsize; u_int hashmask; u_int bucket_limit; - u_int cache_count;/* XXX: unprotected */ u_int cache_limit; u_int rexmt_limit; u_int hash_secret; ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242255 - head/sys/netinet
Author: andre Date: Sun Oct 28 18:33:52 2012 New Revision: 242255 URL: http://svn.freebsd.org/changeset/base/242255 Log: Allow arbitrary MSS sizes and don't mind about the cluster size anymore. We've got more cluster sizes for quite some time now and the orginally imposed limits and the previously codified thoughts on efficiency gains are no longer true. MFC after:2 weeks Modified: head/sys/netinet/tcp_input.c Modified: head/sys/netinet/tcp_input.c == --- head/sys/netinet/tcp_input.cSun Oct 28 18:07:34 2012 (r242254) +++ head/sys/netinet/tcp_input.cSun Oct 28 18:33:52 2012 (r242255) @@ -3322,10 +3322,8 @@ tcp_xmit_timer(struct tcpcb *tp, int rtt /* * Determine a reasonable value for maxseg size. * If the route is known, check route for mtu. - * If none, use an mss that can be handled on the outgoing - * interface without forcing IP to fragment; if bigger than - * an mbuf cluster (MCLBYTES), round down to nearest multiple of MCLBYTES - * to utilize large mbufs. If no route is found, route has no mtu, + * If none, use an mss that can be handled on the outgoing interface + * without forcing IP to fragment. If no route is found, route has no mtu, * or the destination isn't local, use a default, hopefully conservative * size (usually 512 or the default IP max size, but no more than the mtu * of the interface), as we can't discover anything about intervening @@ -3506,13 +3504,6 @@ tcp_mss_update(struct tcpcb *tp, int off (tp-t_flags TF_RCVD_TSTMP) == TF_RCVD_TSTMP)) mss -= TCPOLEN_TSTAMP_APPA; -#if(MCLBYTES (MCLBYTES - 1)) == 0 - if (mss MCLBYTES) - mss = ~(MCLBYTES-1); -#else - if (mss MCLBYTES) - mss = mss / MCLBYTES * MCLBYTES; -#endif tp-t_maxseg = mss; } ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242256 - head/sys/kern
Author: andre Date: Sun Oct 28 18:38:51 2012 New Revision: 242256 URL: http://svn.freebsd.org/changeset/base/242256 Log: Improve m_cat() by being able to also merge contents from M_EXT mbuf's by doing proper testing with M_WRITABLE(). In m_collapse() replace an incomplete manual check for M_RDONLY with the M_WRITABLE() macro that also tests for shared buffers and other cases that make a particular mbuf immutable. MFC after:2 weeks Modified: head/sys/kern/uipc_mbuf.c Modified: head/sys/kern/uipc_mbuf.c == --- head/sys/kern/uipc_mbuf.c Sun Oct 28 18:33:52 2012(r242255) +++ head/sys/kern/uipc_mbuf.c Sun Oct 28 18:38:51 2012(r242256) @@ -911,8 +911,8 @@ m_cat(struct mbuf *m, struct mbuf *n) while (m-m_next) m = m-m_next; while (n) { - if (m-m_flags M_EXT || - m-m_data + m-m_len + n-m_len = m-m_dat[MLEN]) { + if (!M_WRITABLE(m) || + M_TRAILINGSPACE(m) n-m_len) { /* just join the two chains */ m-m_next = n; return; @@ -1584,7 +1584,7 @@ again: n = m-m_next; if (n == NULL) break; - if ((m-m_flags M_RDONLY) == 0 + if (M_WRITABLE(m) n-m_len M_TRAILINGSPACE(m)) { bcopy(mtod(n, void *), mtod(m, char *) + m-m_len, n-m_len); ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242257 - head/sys/netinet
Author: andre Date: Sun Oct 28 18:45:04 2012 New Revision: 242257 URL: http://svn.freebsd.org/changeset/base/242257 Log: Remove bogus 'else' in #ifdef that prevented the rttvar from being reset tcp_timer_rexmt() on retransmit for IPv6 sessions. MFC after:2 weeks Modified: head/sys/netinet/tcp_timer.c Modified: head/sys/netinet/tcp_timer.c == --- head/sys/netinet/tcp_timer.cSun Oct 28 18:38:51 2012 (r242256) +++ head/sys/netinet/tcp_timer.cSun Oct 28 18:45:04 2012 (r242257) @@ -596,7 +596,6 @@ tcp_timer_rexmt(void * xtp) #ifdef INET6 if ((tp-t_inpcb-inp_vflag INP_IPV6) != 0) in6_losing(tp-t_inpcb); - else #endif tp-t_rttvar += (tp-t_srtt TCP_RTT_SHIFT); tp-t_srtt = 0; ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242260 - head/sys/netinet
Author: andre Date: Sun Oct 28 18:56:57 2012 New Revision: 242260 URL: http://svn.freebsd.org/changeset/base/242260 Log: When retransmitting SYN in TCPS_SYN_SENT state use TCPTV_RTOBASE, the default retransmit timeout, as base to calculate the backoff time until next try instead of the TCP_REXMTVAL() macro which only works correctly when we already have measured an actual RTT+RTTVAR. Before it would cause the first retransmit at RTOBASE, the next four at the same time (!) about 200ms later, and then another one again RTOBASE later. MFC after:2 weeks Modified: head/sys/netinet/tcp_timer.c Modified: head/sys/netinet/tcp_timer.c == --- head/sys/netinet/tcp_timer.cSun Oct 28 18:53:28 2012 (r242259) +++ head/sys/netinet/tcp_timer.cSun Oct 28 18:56:57 2012 (r242260) @@ -572,7 +572,7 @@ tcp_timer_rexmt(void * xtp) tp-t_flags = ~TF_PREVVALID; TCPSTAT_INC(tcps_rexmttimeo); if (tp-t_state == TCPS_SYN_SENT) - rexmt = TCP_REXMTVAL(tp) * tcp_syn_backoff[tp-t_rxtshift]; + rexmt = TCPTV_RTOBASE * tcp_syn_backoff[tp-t_rxtshift]; else rexmt = TCP_REXMTVAL(tp) * tcp_backoff[tp-t_rxtshift]; TCPT_RANGESET(tp-t_rxtcur, rexmt, ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242261 - head/sys/netinet
Author: andre Date: Sun Oct 28 19:02:07 2012 New Revision: 242261 URL: http://svn.freebsd.org/changeset/base/242261 Log: For retransmits of SYN|ACK from the syncache use the slightly more aggressive special tcp_syn_backoff[] retransmit schedule instead of the normal tcp_backoff[] schedule for established connections. MFC after:2 weeks Modified: head/sys/netinet/tcp_syncache.c head/sys/netinet/tcp_timer.h Modified: head/sys/netinet/tcp_syncache.c == --- head/sys/netinet/tcp_syncache.c Sun Oct 28 18:56:57 2012 (r242260) +++ head/sys/netinet/tcp_syncache.c Sun Oct 28 19:02:07 2012 (r242261) @@ -391,7 +391,7 @@ static void syncache_timeout(struct syncache *sc, struct syncache_head *sch, int docallout) { sc-sc_rxttime = ticks + - TCPTV_RTOBASE * (tcp_backoff[sc-sc_rxmits]); + TCPTV_RTOBASE * (tcp_syn_backoff[sc-sc_rxmits]); sc-sc_rxmits++; if (TSTMP_LT(sc-sc_rxttime, sch-sch_nextc)) { sch-sch_nextc = sc-sc_rxttime; Modified: head/sys/netinet/tcp_timer.h == --- head/sys/netinet/tcp_timer.hSun Oct 28 18:56:57 2012 (r242260) +++ head/sys/netinet/tcp_timer.hSun Oct 28 19:02:07 2012 (r242261) @@ -170,6 +170,7 @@ extern int tcp_rexmit_slop; extern int tcp_msl; extern int tcp_ttl;/* time to live for TCP segs */ extern int tcp_backoff[]; +extern int tcp_syn_backoff[]; extern int tcp_finwait2_timeout; extern int tcp_fast_finwait2_recycle; ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242262 - head/sys/netinet
Author: andre Date: Sun Oct 28 19:16:22 2012 New Revision: 242262 URL: http://svn.freebsd.org/changeset/base/242262 Log: Simplify and enhance the window change/update acceptance logic, especially in the presence of bi-directional data transfers. snd_wl1 tracks the right edge, including data in the reassembly queue, of valid incoming data. This makes it like rcv_nxt plus reassembly. It never goes backwards to prevent older, possibly reordered segments from updating the window. snd_wl2 tracks the left edge of sent data. This makes it a duplicate of snd_una. However joining them right now is difficult due to separate update dependencies in different places in the code flow. snd_wnd tracks the current advertized send window by the peer. In tcp_output() the effective window is calculated by subtracting the already in-flight data, snd_nxt less snd_una, from it. ACK's become the main clock of window updates and will always update the window when the left edge of what we sent is advanced. The ACK clock is the primary signaling mechanism in ongoing data transfers. This works reliably even in the presence of reordering, reassembly and retransmitted segments. The ACK clock is most important because it determines how much data we are allowed to inject into the network. Zero window updates get us out of persistence mode are crucial. Here a segment that neither moves ACK nor SEQ but enlarges WND is accepted. When the ACK clock is not active (that is we're not or no longer sending any data) any segment that moves the extended right SEQ edge, including out-of-order segments, updates the window. This gives us updates especially during ping-pong transfers where the peer isn't done consuming the already acknowledged data from the receive buffer while responding with data. The SSH protocol is a prime candidate to benefit from the improved bi-directional window update logic as it has its own windowing mechanism on top of TCP and is frequently sending back protocol ACK's. Tcpdump provided by: darrenr Tested by:darrenr MFC after:2 weeks Modified: head/sys/netinet/tcp_input.c Modified: head/sys/netinet/tcp_input.c == --- head/sys/netinet/tcp_input.cSun Oct 28 19:02:07 2012 (r242261) +++ head/sys/netinet/tcp_input.cSun Oct 28 19:16:22 2012 (r242262) @@ -1714,7 +1714,7 @@ tcp_do_segment(struct mbuf *m, struct tc * Pull snd_wl1 up to prevent seq wrap relative to * th_seq. */ - tp-snd_wl1 = th-th_seq; + tp-snd_wl1 = th-th_seq + tlen; /* * Pull rcv_up up to prevent seq wrap relative to * rcv_nxt. @@ -2327,7 +2327,6 @@ tcp_do_segment(struct mbuf *m, struct tc if (tlen == 0 (thflags TH_FIN) == 0) (void) tcp_reass(tp, (struct tcphdr *)0, 0, (struct mbuf *)0); - tp-snd_wl1 = th-th_seq - 1; /* FALLTHROUGH */ /* @@ -2638,12 +2637,10 @@ process_ACK: SOCKBUF_LOCK(so-so_snd); if (acked so-so_snd.sb_cc) { - tp-snd_wnd -= so-so_snd.sb_cc; sbdrop_locked(so-so_snd, (int)so-so_snd.sb_cc); ourfinisacked = 1; } else { sbdrop_locked(so-so_snd, acked); - tp-snd_wnd -= acked; ourfinisacked = 0; } /* NB: sowwakeup_locked() does an implicit unlock. */ @@ -2733,24 +2730,56 @@ step6: INP_WLOCK_ASSERT(tp-t_inpcb); /* -* Update window information. -* Don't look at window if no ACK: TAC's send garbage on first SYN. +* Window update acceptance logic. We have to be careful not +* to accept window updates from old segments in the presence +* of reordering or duplication. +* +* A window update is valid when: +* - the segment ACK's new data. +* - the segment carries new data and its ACK is current. +* - the segment matches the current SEQ and ACK but increases +*the window. This is the escape from persist mode, if there +*data to be sent. +* +* XXXAO: The presence of new SACK information would allow to +* accept window updates during retransmits. We don't have an +* easy way to test for that the moment. +* +* NB: The other side isn't allowed to shrink the window when +* not sending or acking new data. This behavior is strongly +* discouraged by RFC793, section 3.7, page 42 anyways. +* +* XXXAO: tiwin = minmss to avoid jitter?
svn commit: r242263 - head/sys/netinet
Author: andre Date: Sun Oct 28 19:20:23 2012 New Revision: 242263 URL: http://svn.freebsd.org/changeset/base/242263 Log: Add SACK_PERMIT to the list of TCP options that are switched off after retransmitting a SYN three times. MFC after:2 weeks Modified: head/sys/netinet/tcp_timer.c Modified: head/sys/netinet/tcp_timer.c == --- head/sys/netinet/tcp_timer.cSun Oct 28 19:16:22 2012 (r242262) +++ head/sys/netinet/tcp_timer.cSun Oct 28 19:20:23 2012 (r242263) @@ -585,7 +585,7 @@ tcp_timer_rexmt(void * xtp) * unknown-to-them TCP options. */ if ((tp-t_state == TCPS_SYN_SENT) (tp-t_rxtshift == 3)) - tp-t_flags = ~(TF_REQ_SCALE|TF_REQ_TSTMP); + tp-t_flags = ~(TF_REQ_SCALE|TF_REQ_TSTMP|TF_SACK_PERMIT); /* * If we backed off this far, our srtt estimate is probably bogus. * Clobber it so we'll take the next rtt measurement as our srtt; ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242264 - head/sys/netinet
Author: andre Date: Sun Oct 28 19:22:18 2012 New Revision: 242264 URL: http://svn.freebsd.org/changeset/base/242264 Log: Update comment to reflect the change made in r242263. MFC after:2 weeks Modified: head/sys/netinet/tcp_timer.c Modified: head/sys/netinet/tcp_timer.c == --- head/sys/netinet/tcp_timer.cSun Oct 28 19:20:23 2012 (r242263) +++ head/sys/netinet/tcp_timer.cSun Oct 28 19:22:18 2012 (r242264) @@ -578,7 +578,7 @@ tcp_timer_rexmt(void * xtp) TCPT_RANGESET(tp-t_rxtcur, rexmt, tp-t_rttmin, TCPTV_REXMTMAX); /* -* Disable rfc1323 if we haven't got any response to +* Disable RFC1323 and SACK if we haven't got any response to * our third SYN to work-around some broken terminal servers * (most of which have hopefully been retired) that have bad VJ * header compression code which trashes TCP segments containing ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242266 - head/sys/netinet
Author: andre Date: Sun Oct 28 19:47:46 2012 New Revision: 242266 URL: http://svn.freebsd.org/changeset/base/242266 Log: Increase the initial CWND to 10 segments as defined in IETF TCPM draft-ietf-tcpm-initcwnd-05. It explains why the increased initial window improves the overall performance of many web services without risking congestion collapse. As long as it remains a draft it is placed under a sysctl marking it as experimental: net.inet.tcp.experimental.initcwnd10 = 1 When it becomes an official RFC soon the sysctl will be changed to the RFC number and moved to net.inet.tcp. This implementation differs from the RFC draft in that it is a bit more conservative in the case of packet loss on SYN or SYN|ACK because we haven't reduced the default RTO to 1 second yet. Also the restart window isn't yet increased as allowed. Both will be adjusted with upcoming changes. Is is enabled by default. In Linux it is enabled since kernel 3.0. MFC after:2 weeks Modified: head/sys/netinet/tcp_input.c head/sys/netinet/tcp_var.h Modified: head/sys/netinet/tcp_input.c == --- head/sys/netinet/tcp_input.cSun Oct 28 19:38:42 2012 (r242265) +++ head/sys/netinet/tcp_input.cSun Oct 28 19:47:46 2012 (r242266) @@ -159,6 +159,14 @@ SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, VNET_NAME(tcp_do_rfc3390), 0, Enable RFC 3390 (Increasing TCP's Initial Congestion Window)); +SYSCTL_NODE(_net_inet_tcp, OID_AUTO, experimental, CTLFLAG_RW, 0, +Experimental TCP extensions); + +VNET_DEFINE(int, tcp_do_initcwnd10) = 1; +SYSCTL_VNET_INT(_net_inet_tcp_experimental, OID_AUTO, initcwnd10, CTLFLAG_RW, +VNET_NAME(tcp_do_initcwnd10), 0, +Enable draft-ietf-tcpm-initcwnd-05 (Increasing initial CWND to 10)); + VNET_DEFINE(int, tcp_do_rfc3465) = 1; SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, rfc3465, CTLFLAG_RW, VNET_NAME(tcp_do_rfc3465), 0, @@ -347,6 +355,7 @@ cc_conn_init(struct tcpcb *tp) * * RFC5681 Section 3.1 specifies the default conservative values. * RFC3390 specifies slightly more aggressive values. +* Draft-ietf-tcpm-initcwnd-05 increases it to ten segments. * * If a SYN or SYN/ACK was lost and retransmitted, we have to * reduce the initial CWND to one segment as congestion is likely @@ -354,6 +363,9 @@ cc_conn_init(struct tcpcb *tp) */ if (tp-snd_cwnd == 1) tp-snd_cwnd = tp-t_maxseg;/* SYN(-ACK) lost */ + else if (V_tcp_do_initcwnd10) + tp-snd_cwnd = min(10 * tp-t_maxseg, + max(2 * tp-t_maxseg, 14600)); else if (V_tcp_do_rfc3390) tp-snd_cwnd = min(4 * tp-t_maxseg, max(2 * tp-t_maxseg, 4380)); Modified: head/sys/netinet/tcp_var.h == --- head/sys/netinet/tcp_var.h Sun Oct 28 19:38:42 2012(r242265) +++ head/sys/netinet/tcp_var.h Sun Oct 28 19:47:46 2012(r242266) @@ -611,6 +611,7 @@ VNET_DECLARE(int, tcp_mssdflt); /* XXX * VNET_DECLARE(int, tcp_minmss); VNET_DECLARE(int, tcp_delack_enabled); VNET_DECLARE(int, tcp_do_rfc3390); +VNET_DECLARE(int, tcp_do_initcwnd10); VNET_DECLARE(int, tcp_sendspace); VNET_DECLARE(int, tcp_recvspace); VNET_DECLARE(int, path_mtu_discovery); @@ -623,6 +624,7 @@ VNET_DECLARE(int, tcp_abc_l_var); #defineV_tcp_minmssVNET(tcp_minmss) #defineV_tcp_delack_enabledVNET(tcp_delack_enabled) #defineV_tcp_do_rfc3390VNET(tcp_do_rfc3390) +#defineV_tcp_do_initcwnd10 VNET(tcp_do_initcwnd10) #defineV_tcp_sendspace VNET(tcp_sendspace) #defineV_tcp_recvspace VNET(tcp_recvspace) #defineV_path_mtu_discoveryVNET(path_mtu_discovery) ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242267 - head/sys/netinet
Author: andre Date: Sun Oct 28 19:58:20 2012 New Revision: 242267 URL: http://svn.freebsd.org/changeset/base/242267 Log: If the user has closed the socket then drop a persisting connection after a much reduced timeout. Typically web servers close their sockets quickly under the assumption that the TCP connections goes away as well. That is not entirely true however. If the peer closed the window we're going to wait for a long time with lots of data in the send buffer. MFC after:2 weeks Modified: head/sys/netinet/tcp_timer.c Modified: head/sys/netinet/tcp_timer.c == --- head/sys/netinet/tcp_timer.cSun Oct 28 19:47:46 2012 (r242266) +++ head/sys/netinet/tcp_timer.cSun Oct 28 19:58:20 2012 (r242267) @@ -447,6 +447,16 @@ tcp_timer_persist(void *xtp) tp = tcp_drop(tp, ETIMEDOUT); goto out; } + /* +* If the user has closed the socket then drop a persisting +* connection after a much reduced timeout. +*/ + if (tp-t_state TCPS_CLOSE_WAIT + (ticks - tp-t_rcvtime) = TCPTV_PERSMAX) { + TCPSTAT_INC(tcps_persistdrop); + tp = tcp_drop(tp, ETIMEDOUT); + goto out; + } tcp_setpersist(tp); tp-t_flags |= TF_FORCEDATA; (void) tcp_output(tp); ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242254 - head/sys/netinet
On 28.10.2012 21:07, Gleb Smirnoff wrote: On Sun, Oct 28, 2012 at 06:07:34PM +, Andre Oppermann wrote: A @@ -296,8 +297,8 @@ syncache_destroy(void) A mtx_destroy(sch-sch_mtx); A } A A - KASSERT(V_tcp_syncache.cache_count == 0, (%s: cache_count %d not 0, A - __func__, V_tcp_syncache.cache_count)); A + KASSERT(uma_zone_get_cur(V_tcp_syncache.zone) == 0, A + (%s: cache_count not 0, __func__)); A A /* Free the allocated global resources. */ A uma_zdestroy(V_tcp_syncache.zone); btw, keg_dtor() which is called in uma_zdestroy() printfs a warning (even on non-invariant kernel) if keg had items in it. So leak won't be unnoticed. Thanks, didn't know that. I leave the KASSERT() in if you don't mind to make it a bit more forceful than a printf that gets overlooked too easily. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242161 - in head/sys: net netinet netpfil/pf
On 28.10.2012 00:01, Gleb Smirnoff wrote: On Sat, Oct 27, 2012 at 12:58:52PM +0200, Andre Oppermann wrote: A On 26.10.2012 23:06, Gleb Smirnoff wrote: A Author: glebius A Date: Fri Oct 26 21:06:33 2012 A New Revision: 242161 A URL: http://svn.freebsd.org/changeset/base/242161 A A Log: A o Remove last argument to ip_fragment(), and obtain all needed information A on checksums directly from mbuf flags. This simplifies code. A o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in A hardware. Some driver may not announce CSUM_IP in theur if_hwassist, A although try to do checksums if CSUM_IP set on mbuf. Example is em(4). A A I'm not getting your description here? Why work around a bug in a driver A in ip_fragment() when we can fix the bug in the driver? Well, that was actually bug in the stack and a very special driver that demonstrates it. I may even agree that driver is incorrect, but the stack was incorrect, too. Ah, OK. Do you intend to fix the driver as well? -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242266 - head/sys/netinet
On 28.10.2012 22:03, Rui Paulo wrote: On 28 Oct 2012, at 12:47, Andre Oppermann an...@freebsd.org wrote: Author: andre Date: Sun Oct 28 19:47:46 2012 New Revision: 242266 URL: http://svn.freebsd.org/changeset/base/242266 Log: Increase the initial CWND to 10 segments as defined in IETF TCPM draft-ietf-tcpm-initcwnd-05. It explains why the increased initial window improves the overall performance of many web services without risking congestion collapse. As long as it remains a draft it is placed under a sysctl marking it as experimental: net.inet.tcp.experimental.initcwnd10 = 1 When it becomes an official RFC soon the sysctl will be changed to the RFC number and moved to net.inet.tcp. This implementation differs from the RFC draft in that it is a bit more conservative in the case of packet loss on SYN or SYN|ACK because we haven't reduced the default RTO to 1 second yet. Also the restart window isn't yet increased as allowed. Both will be adjusted with upcoming changes. Is is enabled by default. In Linux it is enabled since kernel 3.0. Didn't you also forget to point out the problems associated with it? http://tools.ietf.org/html/draft-gettys-iw10-considered-harmful-00 IW10 has been heavily discussed on IETF TCPM. A lot of research on the impact has been done and the overall result has been a significant improvement with very little downside. Linux has adopted it for quite some time already as default setting. The bufferbloat issue is certainly real and should not be neglected. However the solution to bufferbloat is not to send less packets into the network. In fact that doesn't even make a difference simply because other packets with take their place. Buffer bloat can only be fixed in the devices that actually do the buffering. A much discussed and apparently good approach seems to be the Codel algorithm for active buffer management. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242263 - head/sys/netinet
On 28.10.2012 22:26, Rui Paulo wrote: On 28 Oct 2012, at 12:20, Andre Oppermann an...@freebsd.org wrote: Author: andre Date: Sun Oct 28 19:20:23 2012 New Revision: 242263 URL: http://svn.freebsd.org/changeset/base/242263 Log: Add SACK_PERMIT to the list of TCP options that are switched off after retransmitting a SYN three times. MFC after:2 weeks Modified: head/sys/netinet/tcp_timer.c Modified: head/sys/netinet/tcp_timer.c == --- head/sys/netinet/tcp_timer.cSun Oct 28 19:16:22 2012 (r242262) +++ head/sys/netinet/tcp_timer.cSun Oct 28 19:20:23 2012 (r242263) @@ -585,7 +585,7 @@ tcp_timer_rexmt(void * xtp) * unknown-to-them TCP options. */ if ((tp-t_state == TCPS_SYN_SENT) (tp-t_rxtshift == 3)) - tp-t_flags = ~(TF_REQ_SCALE|TF_REQ_TSTMP); + tp-t_flags = ~(TF_REQ_SCALE|TF_REQ_TSTMP|TF_SACK_PERMIT); /* * If we backed off this far, our srtt estimate is probably bogus. * Clobber it so we'll take the next rtt measurement as our srtt; Do you have any data regarding this commit or you're just trying to make sure the SACK option follows the same behaviour of the WSCALE/TSTMP options? The latter. For the purpose of turning off the options after three tries it is contradictory to leave SACK on. There is discussion of scrapping this whole option disabling altogether. Until then better have the 'correct' behavior. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242261 - head/sys/netinet
On 28.10.2012 22:34, Rui Paulo wrote: On 28 Oct 2012, at 12:02, Andre Oppermann an...@freebsd.org wrote: Author: andre Date: Sun Oct 28 19:02:07 2012 New Revision: 242261 URL: http://svn.freebsd.org/changeset/base/242261 Log: For retransmits of SYN|ACK from the syncache use the slightly more aggressive special tcp_syn_backoff[] retransmit schedule instead of the normal tcp_backoff[] schedule for established connections. How did you came up with the values for tcp_syn_backoff? I obviously understand the aggressiveness, but did you measure any significant improvement in connection establishment time and if so, on what type of links? I didn't come up with the values. tcp_syn_backoff[] was introduced almost 12 years ago by jlemon. For syncache it got lost somewhere along the line. There has been recent talk by some large FreeBSD web server operators of reducing SYN|ACK retransmit timeouts. This change fixes a part of the problem. The recent RFC on reducing the RTO will fix the other part. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242261 - head/sys/netinet
On 28.10.2012 23:01, Rui Paulo wrote: On Oct 28, 2012, at 14:56, Andre Oppermann an...@freebsd.org wrote: On 28.10.2012 22:34, Rui Paulo wrote: On 28 Oct 2012, at 12:02, Andre Oppermann an...@freebsd.org wrote: Author: andre Date: Sun Oct 28 19:02:07 2012 New Revision: 242261 URL: http://svn.freebsd.org/changeset/base/242261 Log: For retransmits of SYN|ACK from the syncache use the slightly more aggressive special tcp_syn_backoff[] retransmit schedule instead of the normal tcp_backoff[] schedule for established connections. How did you came up with the values for tcp_syn_backoff? I obviously understand the aggressiveness, but did you measure any significant improvement in connection establishment time and if so, on what type of links? I didn't come up with the values. tcp_syn_backoff[] was introduced almost 12 years ago by jlemon. For syncache it got lost somewhere along the line. Oh, I see. I read it backwards. There has been recent talk by some large FreeBSD web server operators of reducing SYN|ACK retransmit timeouts. This change fixes a part of the problem. The recent RFC on reducing the RTO will fix the other part. Which RFC? I'm only aware of draft-hurtig-tcpm-rtorestart. RFC6298. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242266 - head/sys/netinet
On 28.10.2012 22:44, Rui Paulo wrote: On 28 Oct 2012, at 14:33, Andre Oppermann an...@freebsd.org wrote: IW10 has been heavily discussed on IETF TCPM. A lot of research on the impact has been done and the overall result has been a significant improvement with very little downside. Linux has adopted it for quite some time already as default setting. I have followed the discussions at tcpm, but I did not find any conclusive evidence of the benefit of IW10. I'm sure it can help in multiple situations but, as always, there are tradeoffs. Section 6 of draft-ietf-tcpm-initcwnd never convinced me. Then please raise your points on TCPM. The bufferbloat issue is certainly real and should not be neglected. However the solution to bufferbloat is not to send less packets into the network. In fact that doesn't even make a difference simply because other packets with take their place. Right, my point is that sending more packets in an already congested link will negatively affect the throughput / latency of the network. I'm not saying that it won't help you download a YouTube video faster, but the overall fairness of TCP will be reduced. That's always the case. Reality is that the majority of links these days is very fast compared to twenty years ago. We can afford to be a bit more aggressive here. Otherwise taking your point to the extreme would mean that IW can only ever be 1 MSS. Then there is the unfairness of low RTT to high RTT transfers. But that's inherent in any end to end feedback system. Buffer bloat can only be fixed in the devices that actually do the buffering. A much discussed and apparently good approach seems to be the Codel algorithm for active buffer management. Are you working on CoDel? :-) I'm looking into how the whole interface stuff including ALTQ can be improved in an SMP world. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242161 - in head/sys: net netinet netpfil/pf
On 29.10.2012 22:40, YongHyeon PYUN wrote: On Mon, Oct 29, 2012 at 09:21:00AM +0400, Gleb Smirnoff wrote: On Mon, Oct 29, 2012 at 01:41:04PM -0700, YongHyeon PYUN wrote: Y On Sun, Oct 28, 2012 at 02:01:37AM +0400, Gleb Smirnoff wrote: Y On Sat, Oct 27, 2012 at 12:58:52PM +0200, Andre Oppermann wrote: Y A On 26.10.2012 23:06, Gleb Smirnoff wrote: Y A Author: glebius Y A Date: Fri Oct 26 21:06:33 2012 Y A New Revision: 242161 Y A URL: http://svn.freebsd.org/changeset/base/242161 Y A Y A Log: Y A o Remove last argument to ip_fragment(), and obtain all needed information Y A on checksums directly from mbuf flags. This simplifies code. Y A o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in Y Y I'm not sure whether ti(4)'s checksum offloading for IP fragmented Y packets(CSUM_IP_FRAGS) still works after this change. ti(4) Y requires CSUM_IP should be set for IP fragmented packets. Not sure Y whether it's a bug or not. I have a ti(4) controller but I don't Y remember where I can find it and don't have a link Y parter(1000baseSX) to test it. :-( ti(4) declares both CSUM_IP and CSUM_IP_FRAGS, so ip_fragment() won't do Because it supports both CSUM_IP and CSUM_IP_FRAGS. Probably ti(4) is the only controller that supports TCP/UDP checksum offloading for an IP fragmented packet. This is a bit weird if it doesn't do the fragmentation itself. Computing the IP header checksum doesn't differ for normal and fragmented packets. The protocol checksum (TCP or UDP) stays the same for in the case of IP level fragmentation. It is only visible in the first fragment which includes the protocol header. software checksums, and thus won't clear these flags. Potentially a driver that announces one flag in if_hwassist but relies on couple of flags to be set on mbuf is not correct. If a driver can't do single checksum processing independently from others, then it should set or clear appropriate flags in if_hwassist as a group. Hmm, then what would be best way to achieve CSUM_IP_FRAGS in driver? I don't have clear idea how to utilize the hardware feature. The stack should tell that the mbuf needs TCP/UDP checksum offloading for IP fragmented packet(i.e. CSUM_IP_FRAGS is not set by upper stack). As I said there can't be fragment checksumming without hardware based fragmentation. We have three cases here: 1. TSO where the hardware does the segmentation, TCP and IP header checksums for each generated packet. 2. IP packet fragmentation where a packet is split, the IP header checksum is recomputed for each fragment, but the protocol csum stays the same and is not modified. 3. UDP fragmentation where a large packet is sent to the hardware and it generates first the UDP checksum and then splits it into IP fragments each with its own IP header checksum. So we end up with these possible large send hardware offload capabilities: TSO: including IPv4hdr and TCP checksumming UDP fragmentation: including IPv4hdr and UDP checksumming IP fragmentation: including IPv4hdr checksumming Besides that we have the packet = MTU sized offload capabilities: TCP checksumming UDP checksumming SCTP checksumming IPv4hdr checksumming Y A hardware. Some driver may not announce CSUM_IP in theur if_hwassist, Oh, that was a typo! Software was meant. That explains quite a bit of confusion. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242306 - head/sys/kern
Author: andre Date: Mon Oct 29 12:14:57 2012 New Revision: 242306 URL: http://svn.freebsd.org/changeset/base/242306 Log: Add logging for socket attach failures in sonewconn() during accept(2). Include the pointer to the PCB so it can be attributed to a particular application by corresponding it to netstat -A output. MFC after:2 weeks Modified: head/sys/kern/uipc_socket.c Modified: head/sys/kern/uipc_socket.c == --- head/sys/kern/uipc_socket.c Mon Oct 29 10:22:00 2012(r242305) +++ head/sys/kern/uipc_socket.c Mon Oct 29 12:14:57 2012(r242306) @@ -135,6 +135,7 @@ __FBSDID($FreeBSD$); #include sys/sysctl.h #include sys/uio.h #include sys/jail.h +#include sys/syslog.h #include net/vnet.h @@ -500,16 +501,24 @@ sonewconn(struct socket *head, int conns over = (head-so_qlen 3 * head-so_qlimit / 2); ACCEPT_UNLOCK(); #ifdef REGRESSION - if (regression_sonewconn_earlytest over) + if (regression_sonewconn_earlytest over) { #else - if (over) + if (over) { #endif + log(LOG_DEBUG, %s: pcb %p: Listen queue overflow: + %i already in queue awaiting acceptance\n, + __func__, head-so_pcb, over); return (NULL); + } VNET_ASSERT(head-so_vnet != NULL, (%s:%d so_vnet is NULL, head=%p, __func__, __LINE__, head)); so = soalloc(head-so_vnet); - if (so == NULL) + if (so == NULL) { + log(LOG_DEBUG, %s: pcb %p: New socket allocation failure: + limit reached or out of memory\n, + __func__, head-so_pcb); return (NULL); + } if ((head-so_options SO_ACCEPTFILTER) != 0) connstatus = 0; so-so_head = head; @@ -526,9 +535,16 @@ sonewconn(struct socket *head, int conns knlist_init_mtx(so-so_rcv.sb_sel.si_note, SOCKBUF_MTX(so-so_rcv)); knlist_init_mtx(so-so_snd.sb_sel.si_note, SOCKBUF_MTX(so-so_snd)); VNET_SO_ASSERT(head); - if (soreserve(so, head-so_snd.sb_hiwat, head-so_rcv.sb_hiwat) || - (*so-so_proto-pr_usrreqs-pru_attach)(so, 0, NULL)) { + if (soreserve(so, head-so_snd.sb_hiwat, head-so_rcv.sb_hiwat)) { + sodealloc(so); + log(LOG_DEBUG, %s: pcb %p: soreserve() failed\n, + __func__, head-so_pcb); + return (NULL); + } + if ((*so-so_proto-pr_usrreqs-pru_attach)(so, 0, NULL)) { sodealloc(so); + log(LOG_DEBUG, %s: pcb %p: pru_attach() failed\n, + __func__, head-so_pcb); return (NULL); } so-so_rcv.sb_lowat = head-so_rcv.sb_lowat; ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242308 - head/sys/netinet
Author: andre Date: Mon Oct 29 12:17:02 2012 New Revision: 242308 URL: http://svn.freebsd.org/changeset/base/242308 Log: Define the delayed ACK timeout value directly as hz/10 instead of obfuscating it by going through PR_FASTHZ. No functional change. MFC after:2 weeks Modified: head/sys/netinet/tcp_timer.h Modified: head/sys/netinet/tcp_timer.h == --- head/sys/netinet/tcp_timer.hMon Oct 29 12:16:19 2012 (r242307) +++ head/sys/netinet/tcp_timer.hMon Oct 29 12:17:02 2012 (r242308) @@ -118,7 +118,7 @@ #defineTCP_MAXRXTSHIFT 12 /* maximum retransmits */ -#defineTCPTV_DELACK(hz / PR_FASTHZ / 2)/* 100ms timeout */ +#defineTCPTV_DELACK( hz/10 ) /* 100ms timeout */ #ifdef TCPTIMERS static const char *tcptimers[] = ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r242311 - head/sys/netinet
Author: andre Date: Mon Oct 29 13:16:33 2012 New Revision: 242311 URL: http://svn.freebsd.org/changeset/base/242311 Log: Forced commit to provide the correct commit message to r242251: Defer sending an independent window update if a delayed ACK is pending saving a packet. The window update then gets piggy-backed on the next already scheduled ACK. Added grammar fixes as well. MFC after:2 weeks Modified: head/sys/netinet/tcp_output.c Modified: head/sys/netinet/tcp_output.c == --- head/sys/netinet/tcp_output.c Mon Oct 29 12:37:39 2012 (r242310) +++ head/sys/netinet/tcp_output.c Mon Oct 29 13:16:33 2012 (r242311) @@ -547,13 +547,13 @@ after_sack_rexmit: /* * Sending of standalone window updates. * -* Window updates important when we close our window due to a full -* socket buffer and are opening it again after the application +* Window updates are important when we close our window due to a +* full socket buffer and are opening it again after the application * reads data from it. Once the window has opened again and the * remote end starts to send again the ACK clock takes over and * provides the most current window information. * -* We must avoid to the silly window syndrome whereas every read +* We must avoid the silly window syndrome whereas every read * from the receive buffer, no matter how small, causes a window * update to be sent. We also should avoid sending a flurry of * window updates when the socket buffer had queued a lot of data ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242251 - head/sys/netinet
On 28.10.2012 18:30, Andre Oppermann wrote: Author: andre Date: Sun Oct 28 17:30:28 2012 New Revision: 242251 URL: http://svn.freebsd.org/changeset/base/242251 Log: When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to reduce the initial CWND to one segment. This reduction got lost some time ago due to a change in initialization ordering. Additionally in tcp_timer_rexmt() avoid entering fast recovery when we're still in TCPS_SYN_SENT state. Oops, this was the wrong commit message for this change. Here is the correct one: Defer sending an independent window update if a delayed ACK is pending saving a packet. The window update then gets piggy-backed on the next already scheduled ACK. I've forced commit r242311 with some grammar fixes to provide this information. -- Andre MFC after: 2 weeks Modified: head/sys/netinet/tcp_output.c Modified: head/sys/netinet/tcp_output.c == --- head/sys/netinet/tcp_output.c Sun Oct 28 17:25:08 2012 (r242250) +++ head/sys/netinet/tcp_output.c Sun Oct 28 17:30:28 2012 (r242251) @@ -551,10 +551,14 @@ after_sack_rexmit: * max size segments, or at least 50% of the maximum possible * window, then want to send a window update to peer. * Skip this if the connection is in T/TCP half-open state. -* Don't send pure window updates when the peer has closed -* the connection and won't ever send more data. +* +* Don't send an independent window update if a delayed +* ACK is pending (it will get piggy-backed on it) or the +* remote side already has done a half-close and won't send +* more data. */ if (recwin 0 !(tp-t_flags TF_NEEDSYN) + !(tp-t_flags TF_DELACK) !TCPS_HAVERCVDFIN(tp-t_state)) { /* * adv is the amount we can increase the window, ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242161 - in head/sys: net netinet netpfil/pf
On 30.10.2012 03:25, YongHyeon PYUN wrote: On Mon, Oct 29, 2012 at 09:20:59AM +0100, Andre Oppermann wrote: On 29.10.2012 22:40, YongHyeon PYUN wrote: On Mon, Oct 29, 2012 at 09:21:00AM +0400, Gleb Smirnoff wrote: On Mon, Oct 29, 2012 at 01:41:04PM -0700, YongHyeon PYUN wrote: Y On Sun, Oct 28, 2012 at 02:01:37AM +0400, Gleb Smirnoff wrote: Y On Sat, Oct 27, 2012 at 12:58:52PM +0200, Andre Oppermann wrote: Y A On 26.10.2012 23:06, Gleb Smirnoff wrote: Y A Author: glebius Y A Date: Fri Oct 26 21:06:33 2012 Y A New Revision: 242161 Y A URL: http://svn.freebsd.org/changeset/base/242161 Y A Y A Log: Y A o Remove last argument to ip_fragment(), and obtain all needed information Y A on checksums directly from mbuf flags. This simplifies code. Y A o Clear CSUM_IP from the mbuf in ip_fragment() if we did checksums in Y Y I'm not sure whether ti(4)'s checksum offloading for IP fragmented Y packets(CSUM_IP_FRAGS) still works after this change. ti(4) Y requires CSUM_IP should be set for IP fragmented packets. Not sure Y whether it's a bug or not. I have a ti(4) controller but I don't Y remember where I can find it and don't have a link Y parter(1000baseSX) to test it. :-( ti(4) declares both CSUM_IP and CSUM_IP_FRAGS, so ip_fragment() won't do Because it supports both CSUM_IP and CSUM_IP_FRAGS. Probably ti(4) is the only controller that supports TCP/UDP checksum offloading for an IP fragmented packet. This is a bit weird if it doesn't do the fragmentation itself. Computing the IP header checksum doesn't differ for normal and fragmented packets. The protocol checksum (TCP or UDP) stays the same for in the case of IP level fragmentation. It is only visible in the first fragment which includes the protocol header. My interpretation for CSUM_IP_FRAGS works like the following. - Only peuso header checksum for TCP/UDP is computed by upper stack. - Controller has no ability to fragment the packet so it should done in upper stack(i.e. ip_output()). - When ip_output() has to fragment the packet, it just fragments the packet without completing TCP/UDP and IP checksum. If controller does not support CSUM_IP_FRAGS feature, ip_output() can't delay TCP/UDP checksum in this stage. - The fragmented packets are sent to driver. Driver sets appropriate bits of DMA descriptor based on fragmentation field of mbuf(M_FRAG, M_LASTFRAG) and issue the frame to controller. - The firmware of controller queues the fragmented frames up in its internal memory and hold off sending out the frames since it has to compute TCP/UDP checksum. When it sees a frame which indicates the end of fragmented frame it finally computes TCP/UDP checksum and send each frame out to wire by computing IP checksum on the fly. The difference is which one(upper stack vs. controller) computes TCP/UDP/IP checksum. Such a behavior doesn't make much sense and probably wasn't used at all in practice. It's very complex as well. Plus you can't guarantee that there won't be other packet slipping into the interface queue in an SMP world. IP fragmentation really isn't done for TCP within the kernel. We try to prevent it as it would have a huge performance impact. Hence the internal MTU discovery and the Don't Fragment bit set on TCP packets. IP fragmentation does happen for large UDP packet locally generated. There however because of the past absence of UDP fragmentation offload coupled with UDP checksum offloading caused all fragmentation to be done at the UDP level before it hits ip_output. The remaining use of IP fragmentation is when the machine is acting as a router and it has to send packets out on an interface with a smaller MTU than the one it came in on. So the only two useful features regarding UDP+IP fragmentation are: 1. IP fragmentation including UDP checksum calculation for locally generated large UDP packets. This is the TSO for UDP. 2. Pure IP fragmentation for in-transit packets. Here only the IP header checksum needs to be recalculated for each fragment. The layer 4 checksums (UDP, TCP and others) stay the same. -- Andre software checksums, and thus won't clear these flags. Potentially a driver that announces one flag in if_hwassist but relies on couple of flags to be set on mbuf is not correct. If a driver can't do single checksum processing independently from others, then it should set or clear appropriate flags in if_hwassist as a group. Hmm, then what would be best way to achieve CSUM_IP_FRAGS in driver? I don't have clear idea how to utilize the hardware feature. The stack should tell that the mbuf needs TCP/UDP checksum offloading for IP fragmented packet(i.e. CSUM_IP_FRAGS is not set by upper stack). As I said there can't be fragment checksumming without hardware It's up to controller's firmware. It does not send the fragmented frame until it computes TCP/UDP checksum. based fragmentation. We have three cases
Re: svn commit: r242402 - in head/sys: kern vm
On 31.10.2012 20:40, Ian Lepore wrote: On Thu, 2012-11-01 at 06:30 +1100, Peter Jeremy wrote: On 2012-Oct-31 18:57:37 +, Attilio Rao atti...@freebsd.org wrote: On 10/31/12, Adrian Chadd adr...@freebsd.org wrote: Right, but you didn't make it configurable for us embedded peeps who still care about memory usage. How is this possible without breaking the module/kernel ABI? Memory usage may override ABI compatibility in an embedded environment. All that assuming you can actually prove a real performance loss even in the new cases. The issue with padding on embedded systems is memory utilisation rather than performance. There are potential performance hits too, in that embedded systems tend to have tiny caches (16K L1 with no L2, that sort of thing), so purposely padding things so that large parts of a cache line aren't used for anything wastes a scarce resource. You can define CACHE_LINE_SIZE to 0 on those platforms. Or to make it even more granular there could be a CACHE_LINE_SIZE_LOCKS that is used for lock padding. -- Andre That said, I think a point Attilio was trying to make is that we won't see a large hit because this doesn't affect a large number of mutex instances. I'm willing to accept his expert advice on that, not in small part because I'm not sure how I'd go about disputing it. :) I'm really busy with $work right now, but things should calm down in a couple weeks, and I'd be willing to do some measurements on arm systems then, if I can get some help on how to generate useful data. -- Ian ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242402 - in head/sys: kern vm
On 31.10.2012 19:10, Attilio Rao wrote: On Wed, Oct 31, 2012 at 6:07 PM, Attilio Rao atti...@freebsd.org wrote: Author: attilio Date: Wed Oct 31 18:07:18 2012 New Revision: 242402 URL: http://svn.freebsd.org/changeset/base/242402 Log: Rework the known mutexes to benefit about staying on their own cache line in order to avoid manual frobbing but using struct mtx_padalign. Interested developers can now dig and look for other mutexes to convert and just do it. Please, however, try to enclose a description about the benchmark which lead you believe the necessity to pad the mutex and possibly some numbers, in particular when the lock belongs to structures or the ABI itself. Next steps involve porting the same mtx(9) changes to rwlock(9) and port pvh global pmap lock to rwlock_padalign. I'd say for an rwlock you can make it unconditional. The very purpose of it is to be aquired by multiple CPU's causing cache line dirtying for every concurrent reader. Rwlocks are only ever used because multiple concurrent readers are expected. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242402 - in head/sys: kern vm
On 01.11.2012 12:53, Attilio Rao wrote: On 10/31/12, Andre Oppermann an...@freebsd.org wrote: On 31.10.2012 19:10, Attilio Rao wrote: On Wed, Oct 31, 2012 at 6:07 PM, Attilio Rao atti...@freebsd.org wrote: Author: attilio Date: Wed Oct 31 18:07:18 2012 New Revision: 242402 URL: http://svn.freebsd.org/changeset/base/242402 Log: Rework the known mutexes to benefit about staying on their own cache line in order to avoid manual frobbing but using struct mtx_padalign. Interested developers can now dig and look for other mutexes to convert and just do it. Please, however, try to enclose a description about the benchmark which lead you believe the necessity to pad the mutex and possibly some numbers, in particular when the lock belongs to structures or the ABI itself. Next steps involve porting the same mtx(9) changes to rwlock(9) and port pvh global pmap lock to rwlock_padalign. I'd say for an rwlock you can make it unconditional. The very purpose of it is to be aquired by multiple CPU's causing cache line dirtying for every concurrent reader. Rwlocks are only ever used because multiple concurrent readers are expected. I thought about it, but I think the same arguments as for mutexes remains. The real problem is that having default rwlocks pad-aligned will put showstoppers for their usage in sensitive structures. For example, I have plans to use them in vm_object at some point to replace VM_OBJECT_LOCK and I do want to avoid the extra-bloat for such structures. Also, please keep in mind that there is no direct relation between read acquisition and high contention with the latter being the real reason for having pad-aligned locks. I do not agree. If there is no contention then there is no need for a rwlock, a normal mutex would be sufficient. A rwlock is used when multiple concurrent readers are expected. Each read lock and unlock dirties the cache line for all other CPU's. Please note that I don't want to prevent you from doing the work all over for rwlocks. It's just that the use case for a non-padded rwlock is very narrow. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r242421 - head/sys/dev/ixgbe
On 01.11.2012 00:50, Jack F Vogel wrote: Author: jfv Date: Wed Oct 31 23:50:36 2012 New Revision: 242421 URL: http://svn.freebsd.org/changeset/base/242421 Log: A few important fixes: - Testing TSO6 has led me to discover that HW RSC is a problematic feature, it is ONLY designed to work with IPv4 in the first place, and if IP forwarding is done it can't be disabled as LRO in the stack, also initial testing we've done at Intel shows an equal performance using TSO[46] on the TX and LRO on RX, if you ran older code on 82599 or later hardware you actually could have detrimental performance for this reason. So I am disabling the feature by default and all our adapters will now use LRO instead. Yes, it's very important that LRO is *not* used when forwarding is enabled (= acting as a router). - If you have flow control off and multiple queues it was possible when the buffer of one queue becomes full that all RX movement is stalled, to eliminate this problem a feature bit is now set that will allow packets to be dropped when full rather than stall. Note, the default is to have flow control on, and this keeps this from happening. - Because of the recent fixes in the stack, LRO is now auto-disabled when problematic, so I have decided to enable it by default in the capabilities in the driver. A very important cautionary note here: LRO is only good when combined with very low RTTs (that is in LAN environments). On everything over 5ms is breaks the TCP ACK clock badly and performance will suffer greatly. This is because every ACK increases the congestion window. With a greatly reduced ACK rate the ramping up of CWND on startup and after a loss event is severely limited. Combined with ABC (appropriate byte counting) where the CWND increases only once per ACK by at most one MSS the effect is greatly pronounced as well. The higher the RTT goes the worse the effects become. I haven't checked yet whether our soft-LRO does ACK compression or not. If it does, we need a workaround and some tcp_input magic to reduce the negative impact. I'm looking into it. - There are some 1G modules used by some customers, a couple small tweaks to properly support those in the media code. - A note: we have now done some testing of TSO6 and using LRO with IPv6 and it all works great!! Seeing line rate in both directions in best cases. Thanks bz for your excellent work!! Indeed! Modified: head/sys/dev/ixgbe/ixgbe.c -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r211315 - head/sys/netinet
Author: andre Date: Sat Aug 14 20:40:55 2010 New Revision: 211315 URL: http://svn.freebsd.org/changeset/base/211315 Log: Disable TCP inflight limiter by default. It was experimental and interferes with the normal congestion control algorithms by instating a separate, possibly lower, ceiling for the amount of data that is in flight to the remote host. With high speed internet connections the inflight limit frequently has been estimated too low due to the noisy nature of the RTT measurements. This code gives way for the upcoming pluggable congestion control framework. It is the task of the congestion control algorithm to set the congestion window and amount of inflight data without external interference. Reviewed by: lstewart MFC after:1 week Removal after:1 month Modified: head/sys/netinet/tcp_subr.c Modified: head/sys/netinet/tcp_subr.c == --- head/sys/netinet/tcp_subr.c Sat Aug 14 20:12:10 2010(r211314) +++ head/sys/netinet/tcp_subr.c Sat Aug 14 20:40:55 2010(r211315) @@ -221,7 +221,7 @@ SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, SYSCTL_NODE(_net_inet_tcp, OID_AUTO, inflight, CTLFLAG_RW, 0, TCP inflight data limiting); -static VNET_DEFINE(int, tcp_inflight_enable) = 1; +static VNET_DEFINE(int, tcp_inflight_enable) = 0; #defineV_tcp_inflight_enable VNET(tcp_inflight_enable) SYSCTL_VNET_INT(_net_inet_tcp_inflight, OID_AUTO, enable, CTLFLAG_RW, VNET_NAME(tcp_inflight_enable), 0, ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r211316 - head/sys/netinet
Author: andre Date: Sat Aug 14 21:04:27 2010 New Revision: 211316 URL: http://svn.freebsd.org/changeset/base/211316 Log: Change the messages of the ICMP bad port bandwidth limiter from a kernel printf to a log output with the priority of LOG_NOTICE. This way the messages still show up in /var/log/messages but no longer spam the console every other second on busy servers that are port scanned: Limiting open port RST response from 114 to 100 packets/sec PR: kern/147352 Submitted by: Eugene Grosbein eugen-at-eg sd rdtc ru MFC after:1 week Modified: head/sys/netinet/ip_icmp.c Modified: head/sys/netinet/ip_icmp.c == --- head/sys/netinet/ip_icmp.c Sat Aug 14 20:40:55 2010(r211315) +++ head/sys/netinet/ip_icmp.c Sat Aug 14 21:04:27 2010(r211316) @@ -42,6 +42,7 @@ __FBSDID($FreeBSD$); #include sys/time.h #include sys/kernel.h #include sys/sysctl.h +#include sys/syslog.h #include net/if.h #include net/if_types.h @@ -975,7 +976,7 @@ badport_bandlim(int which) * the previous behaviour at the expense of added complexity. */ if (V_icmplim_output opps V_icmplim) - printf(Limiting %s from %d to %d packets/sec\n, + log(LOG_NOTICE, Limiting %s from %d to %d packets/sec\n, r-type, opps, V_icmplim); } return 0; /* okay to send packet */ ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r211327 - head/sys/netinet
Author: andre Date: Sun Aug 15 09:30:13 2010 New Revision: 211327 URL: http://svn.freebsd.org/changeset/base/211327 Log: Add more logging points for failures in syncache_socket() to report when a new socket couldn't be created because one of in_pcbinshash(), in6_pcbconnect() or in_pcbconnect() failed. Logging is conditional on net.inet.tcp.log_debug being enabled. MFC after:1 week Modified: head/sys/netinet/tcp_syncache.c Modified: head/sys/netinet/tcp_syncache.c == --- head/sys/netinet/tcp_syncache.c Sun Aug 15 08:49:07 2010 (r211326) +++ head/sys/netinet/tcp_syncache.c Sun Aug 15 09:30:13 2010 (r211327) @@ -627,6 +627,7 @@ syncache_socket(struct syncache *sc, str struct inpcb *inp = NULL; struct socket *so; struct tcpcb *tp; + int error = 0; char *s; INP_INFO_WLOCK_ASSERT(V_tcbinfo); @@ -675,7 +676,7 @@ syncache_socket(struct syncache *sc, str } #endif inp-inp_lport = sc-sc_inc.inc_lport; - if (in_pcbinshash(inp) != 0) { + if ((error = in_pcbinshash(inp)) != 0) { /* * Undo the assignments above if we failed to * put the PCB on the hash lists. @@ -687,6 +688,12 @@ syncache_socket(struct syncache *sc, str #endif inp-inp_laddr.s_addr = INADDR_ANY; inp-inp_lport = 0; + if ((s = tcp_log_addrs(sc-sc_inc, NULL, NULL, NULL))) { + log(LOG_DEBUG, %s; %s: in_pcbinshash failed + with error %i\n, + s, __func__, error); + free(s, M_TCPLOG); + } goto abort; } #ifdef IPSEC @@ -721,9 +728,15 @@ syncache_socket(struct syncache *sc, str laddr6 = inp-in6p_laddr; if (IN6_IS_ADDR_UNSPECIFIED(inp-in6p_laddr)) inp-in6p_laddr = sc-sc_inc.inc6_laddr; - if (in6_pcbconnect(inp, (struct sockaddr *)sin6, - thread0.td_ucred)) { + if ((error = in6_pcbconnect(inp, (struct sockaddr *)sin6, + thread0.td_ucred)) != 0) { inp-in6p_laddr = laddr6; + if ((s = tcp_log_addrs(sc-sc_inc, NULL, NULL, NULL))) { + log(LOG_DEBUG, %s; %s: in6_pcbconnect failed + with error %i\n, + s, __func__, error); + free(s, M_TCPLOG); + } goto abort; } /* Override flowlabel from in6_pcbconnect. */ @@ -750,9 +763,15 @@ syncache_socket(struct syncache *sc, str laddr = inp-inp_laddr; if (inp-inp_laddr.s_addr == INADDR_ANY) inp-inp_laddr = sc-sc_inc.inc_laddr; - if (in_pcbconnect(inp, (struct sockaddr *)sin, - thread0.td_ucred)) { + if ((error = in_pcbconnect(inp, (struct sockaddr *)sin, + thread0.td_ucred)) != 0) { inp-inp_laddr = laddr; + if ((s = tcp_log_addrs(sc-sc_inc, NULL, NULL, NULL))) { + log(LOG_DEBUG, %s; %s: in_pcbconnect failed + with error %i\n, + s, __func__, error); + free(s, M_TCPLOG); + } goto abort; } } ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r211333 - head/sys/netinet
Author: andre Date: Sun Aug 15 13:25:18 2010 New Revision: 211333 URL: http://svn.freebsd.org/changeset/base/211333 Log: Fix the interaction between 'ICMP fragmentation needed' MTU updates, path MTU discovery and the tcp_minmss limiter for very small MTU's. When the MTU suggested by the gateway via ICMP, or if there isn't any the next smaller step from ip_next_mtu(), is lower than the floor enforced by net.inet.tcp.minmss (default 216) the value is ignored and the default MSS (512) is used instead. However the DF flag in the IP header is still set in tcp_output() preventing fragmentation by the gateway. Fix this by using tcp_minmss as the MSS and clear the DF flag if the suggested MTU is too low. This turns off path MTU dissovery for the remainder of the session and allows fragmentation to be done by the gateway. Only MTU's smaller than 256 are affected. The smallest official MTU specified is for AX.25 packet radio at 256 octets. PR: kern/146628 Tested by:Matthew Luckie mjl-at-luckie org nz MFC after:1 week Modified: head/sys/netinet/tcp_output.c head/sys/netinet/tcp_subr.c Modified: head/sys/netinet/tcp_output.c == --- head/sys/netinet/tcp_output.c Sun Aug 15 13:07:08 2010 (r211332) +++ head/sys/netinet/tcp_output.c Sun Aug 15 13:25:18 2010 (r211333) @@ -1186,8 +1186,10 @@ timer: * This might not be the best thing to do according to RFC3390 * Section 2. However the tcp hostcache migitates the problem * so it affects only the first tcp connection with a host. +* +* NB: Don't set DF on small MTU/MSS to have a safe fallback. */ - if (V_path_mtu_discovery) + if (V_path_mtu_discovery tp-t_maxopd V_tcp_minmss) ip-ip_off |= IP_DF; error = ip_output(m, tp-t_inpcb-inp_options, NULL, Modified: head/sys/netinet/tcp_subr.c == --- head/sys/netinet/tcp_subr.c Sun Aug 15 13:07:08 2010(r211332) +++ head/sys/netinet/tcp_subr.c Sun Aug 15 13:25:18 2010(r211333) @@ -1339,11 +1339,9 @@ tcp_ctlinput(int cmd, struct sockaddr *s if (!mtu) mtu = ip_next_mtu(ip-ip_len, 1); - if (mtu max(296, V_tcp_minmss -+ sizeof(struct tcpiphdr))) - mtu = 0; - if (!mtu) - mtu = V_tcp_mssdflt + if (mtu V_tcp_minmss ++ sizeof(struct tcpiphdr)) + mtu = V_tcp_minmss + sizeof(struct tcpiphdr); /* * Only cache the the MTU if it ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r211327 - head/sys/netinet
On 15.08.2010 11:41, Bjoern A. Zeeb wrote: On Sun, 15 Aug 2010, Andre Oppermann wrote: Author: andre Date: Sun Aug 15 09:30:13 2010 New Revision: 211327 URL: http://svn.freebsd.org/changeset/base/211327 Log: Add more logging points for failures in syncache_socket() to report when a new socket couldn't be created because one of in_pcbinshash(), in6_pcbconnect() or in_pcbconnect() failed. Logging is conditional on net.inet.tcp.log_debug being enabled. MFC after: 1 week Modified: head/sys/netinet/tcp_syncache.c Modified: head/sys/netinet/tcp_syncache.c == --- head/sys/netinet/tcp_syncache.c Sun Aug 15 08:49:07 2010 (r211326) +++ head/sys/netinet/tcp_syncache.c Sun Aug 15 09:30:13 2010 (r211327) @@ -627,6 +627,7 @@ syncache_socket(struct syncache *sc, str struct inpcb *inp = NULL; struct socket *so; struct tcpcb *tp; + int error = 0; Is there any need to initialize here? No. Actually not. Was just my style of using safe initial values. But here the return value is the socket pointer of NULL. The error is not passed back directly. Fixed in r211332. Thanks for noticing and reporting. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r211396 - head/sys/vm
Author: andre Date: Mon Aug 16 14:24:00 2010 New Revision: 211396 URL: http://svn.freebsd.org/changeset/base/211396 Log: Add uma_zone_get_max() to obtain the effective limit after a call to uma_zone_set_max(). The UMA zone limit is not exactly set to the value supplied but rounded up to completely fill the backing store increment (a page normally). This can lead to surprising situations where the number of elements allocated from UMA is higher than the supplied limit value. The new get function reads back the effective value so that the supplied limit value can be adjusted to the real limit. Reviewed by: jeffr MFC after:1 week Modified: head/sys/vm/uma.h head/sys/vm/uma_core.c Modified: head/sys/vm/uma.h == --- head/sys/vm/uma.h Mon Aug 16 12:37:17 2010(r211395) +++ head/sys/vm/uma.h Mon Aug 16 14:24:00 2010(r211396) @@ -459,6 +459,18 @@ int uma_zone_set_obj(uma_zone_t zone, st void uma_zone_set_max(uma_zone_t zone, int nitems); /* + * Obtains the effective limit on the number of items in a zone + * + * Arguments: + * zone The zone to obtain the effective limit from + * + * Return: + * 0 No limit + * int The effective limit of the zone + */ +int uma_zone_get_max(uma_zone_t zone); + +/* * The following two routines (uma_zone_set_init/fini) * are used to set the backend init/fini pair which acts on an * object as it becomes allocated and is placed in a slab within Modified: head/sys/vm/uma_core.c == --- head/sys/vm/uma_core.c Mon Aug 16 12:37:17 2010(r211395) +++ head/sys/vm/uma_core.c Mon Aug 16 14:24:00 2010(r211396) @@ -2797,6 +2797,24 @@ uma_zone_set_max(uma_zone_t zone, int ni } /* See uma.h */ +int +uma_zone_get_max(uma_zone_t zone) +{ + int nitems; + uma_keg_t keg; + + ZONE_LOCK(zone); + keg = zone_first_keg(zone); + if (keg-uk_maxpages) + nitems = keg-uk_maxpages * keg-uk_ipers; + else + nitems = 0; + ZONE_UNLOCK(zone); + + return (nitems); +} + +/* See uma.h */ void uma_zone_set_init(uma_zone_t zone, uma_init uminit) { ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r211464 - head/sys/netinet
Author: andre Date: Wed Aug 18 18:05:54 2010 New Revision: 211464 URL: http://svn.freebsd.org/changeset/base/211464 Log: If a TCP connection has been idle for one retransmit timeout or more it must reset its congestion window back to the initial window. RFC3390 has increased the initial window from 1 segment to up to 4 segments. The initial window increase of RFC3390 wasn't reflected into the restart window which remained at its original defaults of 4 segments for local and 1 segment for all other connections. Both values are controllable through sysctl net.inet.tcp.local_slowstart_flightsize and net.inet.tcp.slowstart_flightsize. The increase helps TCP's slow start algorithm to open up the congestion window much faster. Reviewed by: lstewart MFC after:1 week Modified: head/sys/netinet/tcp_output.c head/sys/netinet/tcp_var.h Modified: head/sys/netinet/tcp_output.c == --- head/sys/netinet/tcp_output.c Wed Aug 18 17:40:10 2010 (r211463) +++ head/sys/netinet/tcp_output.c Wed Aug 18 18:05:54 2010 (r211464) @@ -140,7 +140,7 @@ tcp_output(struct tcpcb *tp) { struct socket *so = tp-t_inpcb-inp_socket; long len, recwin, sendwin; - int off, flags, error; + int off, flags, error, rw; struct mbuf *m; struct ip *ip = NULL; struct ipovly *ipov = NULL; @@ -176,23 +176,34 @@ tcp_output(struct tcpcb *tp) idle = (tp-t_flags TF_LASTIDLE) || (tp-snd_max == tp-snd_una); if (idle ticks - tp-t_rcvtime = tp-t_rxtcur) { /* -* We have been idle for a while and no acks are -* expected to clock out any data we send -- -* slow start to get ack clock running again. +* If we've been idle for more than one retransmit +* timeout the old congestion window is no longer +* current and we have to reduce it to the restart +* window before we can transmit again. * -* Set the slow-start flight size depending on whether -* this is a local network or not. +* The restart window is the initial window or the last +* CWND, whichever is smaller. +* +* This is done to prevent us from flooding the path with +* a full CWND at wirespeed, overloading router and switch +* buffers along the way. +* +* See RFC5681 Section 4.1. Restarting Idle Connections. */ - int ss = V_ss_fltsz; + if (V_tcp_do_rfc3390) + rw = min(4 * tp-t_maxseg, +max(2 * tp-t_maxseg, 4380)); #ifdef INET6 - if (isipv6) { - if (in6_localaddr(tp-t_inpcb-in6p_faddr)) - ss = V_ss_fltsz_local; - } else -#endif /* INET6 */ - if (in_localaddr(tp-t_inpcb-inp_faddr)) - ss = V_ss_fltsz_local; - tp-snd_cwnd = tp-t_maxseg * ss; + else if ((isipv6 ? in6_localaddr(tp-t_inpcb-in6p_faddr) : + in_localaddr(tp-t_inpcb-inp_faddr))) +#else + else if (in_localaddr(tp-t_inpcb-inp_faddr)) +#endif + rw = V_ss_fltsz_local * tp-t_maxseg; + else + rw = V_ss_fltsz * tp-t_maxseg; + + tp-snd_cwnd = min(rw, tp-snd_cwnd); } tp-t_flags = ~TF_LASTIDLE; if (idle) { Modified: head/sys/netinet/tcp_var.h == --- head/sys/netinet/tcp_var.h Wed Aug 18 17:40:10 2010(r211463) +++ head/sys/netinet/tcp_var.h Wed Aug 18 18:05:54 2010(r211464) @@ -565,6 +565,7 @@ extern int tcp_log_in_vain; VNET_DECLARE(int, tcp_mssdflt);/* XXX */ VNET_DECLARE(int, tcp_minmss); VNET_DECLARE(int, tcp_delack_enabled); +VNET_DECLARE(int, tcp_do_rfc3390); VNET_DECLARE(int, tcp_do_newreno); VNET_DECLARE(int, path_mtu_discovery); VNET_DECLARE(int, ss_fltsz); @@ -575,6 +576,7 @@ VNET_DECLARE(int, ss_fltsz_local); #defineV_tcp_mssdflt VNET(tcp_mssdflt) #defineV_tcp_minmssVNET(tcp_minmss) #defineV_tcp_delack_enabledVNET(tcp_delack_enabled) +#defineV_tcp_do_rfc3390VNET(tcp_do_rfc3390) #defineV_tcp_do_newrenoVNET(tcp_do_newreno) #defineV_path_mtu_discoveryVNET(path_mtu_discovery) #defineV_ss_fltsz VNET(ss_fltsz) ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r211503 - head/sys/mips/atheros
On 19.08.2010 13:53, Adrian Chadd wrote: Author: adrian Date: Thu Aug 19 11:53:55 2010 New Revision: 211503 URL: http://svn.freebsd.org/changeset/base/211503 Log: Add some initial AR724X chipset support. This is untested but should at least allow an AR724X to boot. Isn't this something that should be done on a project branch and merged back when in a good working state? The current code is lacking the detail needed to expose the PCIe bus. It is also lacking any NIC, PLL or flush/WB code. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r211503 - head/sys/mips/atheros
On 19.08.2010 19:20, M. Warner Losh wrote: In message:4c6d2933.9020...@freebsd.org Andre Oppermannan...@freebsd.org writes: : On 19.08.2010 13:53, Adrian Chadd wrote: : Author: adrian : Date: Thu Aug 19 11:53:55 2010 : New Revision: 211503 : URL: http://svn.freebsd.org/changeset/base/211503 : : Log: : Add some initial AR724X chipset support. : : This is untested but should at least allow an AR724X to boot. : : Isn't this something that should be done on a project branch and : merged back when in a good working state? We don't have a branch for mips stuff these days. This stuff is OK, since the AR724X is just being rolled out right now... For non AR724x systems, this won't affect anything... I was more concerned about tree breakage for non-tested code. When developing something bleeding edge it is often useful to just commit some stuff and have it sorted out later. In head this is more dangerous. A small AR724X development branch would be ideal for this. Branching is cheap with SVN these days. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r211503 - head/sys/mips/atheros
On 19.08.2010 20:42, M. Warner Losh wrote: In message:4c6d6fd7.7060...@freebsd.org Andre Oppermannan...@freebsd.org writes: : On 19.08.2010 19:20, M. Warner Losh wrote: : In message:4c6d2933.9020...@freebsd.org : Andre Oppermannan...@freebsd.org writes: : : On 19.08.2010 13:53, Adrian Chadd wrote: : : Author: adrian : : Date: Thu Aug 19 11:53:55 2010 : : New Revision: 211503 : : URL: http://svn.freebsd.org/changeset/base/211503 : : : : Log: : : Add some initial AR724X chipset support. : : : : This is untested but should at least allow an AR724X to boot. : : : : Isn't this something that should be done on a project branch and : : merged back when in a good working state? : : We don't have a branch for mips stuff these days. This stuff is OK, : since the AR724X is just being rolled out right now... For non AR724x : systems, this won't affect anything... : : I was more concerned about tree breakage for non-tested code. When : developing something bleeding edge it is often useful to just commit : some stuff and have it sorted out later. In head this is more : dangerous. A small AR724X development branch would be ideal for : this. Branching is cheap with SVN these days. Merging isn't that cheap with svn. The svn:mergeinfo properties make them a pita. Given that this code won't break anything, except possibly the now-unsupported AR724x, I think a branch would be overkill. We'd have to drag that branch along all the time until we can get actual hardware to test it on, which is a high overhead. Didn't know that branching and merging isn't that easy with SVN after all. This was one of the supposed benefits for switching from CVS. If there is no risk of head breakage I don't mind at all. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r211874 - head/sys/netinet
Author: andre Date: Fri Aug 27 12:34:53 2010 New Revision: 211874 URL: http://svn.freebsd.org/changeset/base/211874 Log: Use timestamp modulo comparison macro for automatic receive buffer scaling to correctly handle wrapping of ticks value. MFC after:1 week Modified: head/sys/netinet/tcp_input.c Modified: head/sys/netinet/tcp_input.c == --- head/sys/netinet/tcp_input.cFri Aug 27 11:08:11 2010 (r211873) +++ head/sys/netinet/tcp_input.cFri Aug 27 12:34:53 2010 (r211874) @@ -1441,7 +1441,7 @@ tcp_do_segment(struct mbuf *m, struct tc if (V_tcp_do_autorcvbuf to.to_tsecr (so-so_rcv.sb_flags SB_AUTOSIZE)) { - if (to.to_tsecr tp-rfbuf_ts + if (TSTMP_GT(to.to_tsecr, tp-rfbuf_ts) to.to_tsecr - tp-rfbuf_ts hz) { if (tp-rfbuf_cnt (so-so_rcv.sb_hiwat / 8 * 7) ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r212731 - head/sys/netinet
Author: andre Date: Thu Sep 16 12:13:06 2010 New Revision: 212731 URL: http://svn.freebsd.org/changeset/base/212731 Log: Improve comment to TCP_MINMSS by taking the wording from lstewart (with a small difference in the last paragraph though) as suggested by jhb. Clarify that the 'reviewed by' in r212653 by lstewart was for the functional change, not the comments in the committed version. Modified: head/sys/netinet/tcp.h Modified: head/sys/netinet/tcp.h == --- head/sys/netinet/tcp.h Thu Sep 16 12:05:46 2010(r212730) +++ head/sys/netinet/tcp.h Thu Sep 16 12:13:06 2010(r212731) @@ -120,18 +120,18 @@ struct tcphdr { #defineTCP6_MSS1220 /* - * Limit the lowest MSS we accept from path MTU discovery and the TCP SYN MSS - * option. Allowing too low values of MSS can consume significant amounts of - * resources and be used as a form of a resource exhaustion attack. + * Limit the lowest MSS we accept for path MTU discovery and the TCP SYN MSS + * option. Allowing low values of MSS can consume significant resources and + * be used to mount a resource exhaustion attack. * Connections requesting lower MSS values will be rounded up to this value - * and the IP_DF flag is cleared to allow fragmentation along the path. + * and the IP_DF flag will be cleared to allow fragmentation along the path. * * See tcp_subr.c tcp_minmss SYSCTL declaration for more comments. Setting * it to 0 disables the minmss check. * - * The default value is fine for the smallest official link MTU (256 bytes, - * AX.25 packet radio) in the Internet. However it is very unlikely to come - * across such low MTU interfaces these days (anno domini 2003). + * The default value is fine for TCP across the Internet's smallest official + * link MTU (256 bytes for AX.25 packet radio). However, a connection is very + * unlikely to come across such low MTU interfaces these days (anno domini 2003). */ #defineTCP_MINMSS 216 ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r212769 - head/share/man/man4
Author: andre Date: Thu Sep 16 22:11:55 2010 New Revision: 212769 URL: http://svn.freebsd.org/changeset/base/212769 Log: The inflight bandwidth limiter was removed in r212765. Modified: head/share/man/man4/tcp.4 Modified: head/share/man/man4/tcp.4 == --- head/share/man/man4/tcp.4 Thu Sep 16 21:18:25 2010(r212768) +++ head/share/man/man4/tcp.4 Thu Sep 16 22:11:55 2010(r212769) @@ -32,7 +32,7 @@ .\ From: @(#)tcp.48.1 (Berkeley) 6/5/93 .\ $FreeBSD$ .\ -.Dd August 16, 2008 +.Dd September 16, 2010 .Dt TCP 4 .Os .Sh NAME @@ -383,72 +383,6 @@ code. For this reason, we use 200ms of slop and a near-0 minimum, which gives us an effective minimum of 200ms (similar to .Tn Linux ) . -.It Va inflight.enable -Enable -.Tn TCP -bandwidth-delay product limiting. -An attempt will be made to calculate -the bandwidth-delay product for each individual -.Tn TCP -connection, and limit -the amount of inflight data being transmitted, to avoid building up -unnecessary packets in the network. -This option is recommended if you -are serving a lot of data over connections with high bandwidth-delay -products, such as modems, GigE links, and fast long-haul WANs, and/or -you have configured your machine to accommodate large -.Tn TCP -windows. -In such -situations, without this option, you may experience high interactive -latencies or packet loss due to the overloading of intermediate routers -and switches. -Note that bandwidth-delay product limiting only effects -the transmit side of a -.Tn TCP -connection. -.It Va inflight.debug -Enable debugging for the bandwidth-delay product algorithm. -.It Va inflight.min -This puts a lower bound on the bandwidth-delay product window, in bytes. -A value of 1024 is typically used for debugging. -6000-16000 is more typical in a production installation. -Setting this value too low may result in -slow ramp-up times for bursty connections. -Setting this value too high effectively disables the algorithm. -.It Va inflight.max -This puts an upper bound on the bandwidth-delay product window, in bytes. -This value should not generally be modified, but may be used to set a -global per-connection limit on queued data, potentially allowing you to -intentionally set a less than optimum limit, to smooth data flow over a -network while still being able to specify huge internal -.Tn TCP -buffers. -.It Va inflight.stab -The bandwidth-delay product algorithm requires a slightly larger window -than it otherwise calculates for stability. -This parameter determines the extra window in maximal packets / 10. -The default value of 20 represents 2 maximal packets. -Reducing this value is not recommended, but you may -come across a situation with very slow links where the -.Xr ping 8 -time -reduction of the default inflight code is not sufficient. -If this case occurs, you should first try reducing -.Va inflight.min -and, if that does not -work, reduce both -.Va inflight.min -and -.Va inflight.stab , -trying values of -15, 10, or 5 for the latter. -Never use a value less than 5. -Reducing -.Va inflight.stab -can lead to upwards of a 20% underutilization of the link -as well as reducing the algorithm's ability to adapt to changing -situations and should only be done as a last resort. .It Va rfc3042 Enable the Limited Transmit algorithm as described in RFC 3042. It helps avoid timeouts on lossy links and also when the congestion window ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r212803 - head/sys/netinet
Author: andre Date: Fri Sep 17 22:05:27 2010 New Revision: 212803 URL: http://svn.freebsd.org/changeset/base/212803 Log: Rearrange the TSO code to make it more readable and to clearly separate the decision logic, of whether we can do TSO, and the calculation of the burst length into two distinct parts. Change the way the TSO burst length calculation is done. While TSO could do bursts of 65535 bytes that can't be represented in ip_len together with the IP and TCP header. Account for that and use IP_MAXPACKET instead of TCP_MAXWIN as base constant (both have the same value of 64K). When more data is available prevent less than MSS sized segments from being sent during the current TSO burst. Add two more KASSERTs to ensure the integrity of the packets. Tested by:Ben Wilber ben-at-desync com MFC after:10 days Modified: head/sys/netinet/tcp_output.c Modified: head/sys/netinet/tcp_output.c == --- head/sys/netinet/tcp_output.c Fri Sep 17 21:53:56 2010 (r212802) +++ head/sys/netinet/tcp_output.c Fri Sep 17 22:05:27 2010 (r212803) @@ -465,9 +465,8 @@ after_sack_rexmit: } /* -* Truncate to the maximum segment length or enable TCP Segmentation -* Offloading (if supported by hardware) and ensure that FIN is removed -* if the length no longer contains the last data byte. +* Decide if we can use TCP Segmentation Offloading (if supported by +* hardware). * * TSO may only be used if we are in a pure bulk sending state. The * presence of TCP-MD5, SACK retransmits, SACK advertizements and @@ -475,10 +474,6 @@ after_sack_rexmit: * (except for the sequence number) for all generated packets. This * makes it impossible to transmit any options which vary per generated * segment or packet. -* -* The length of TSO bursts is limited to TCP_MAXWIN. That limit and -* removal of FIN (if not already catched here) are handled later after -* the exact length of the TCP options are known. */ #ifdef IPSEC /* @@ -487,22 +482,15 @@ after_sack_rexmit: */ ipsec_optlen = ipsec_hdrsiz_tcp(tp); #endif - if (len tp-t_maxseg) { - if ((tp-t_flags TF_TSO) V_tcp_do_tso - ((tp-t_flags TF_SIGNATURE) == 0) - tp-rcv_numsacks == 0 sack_rxmit == 0 - tp-t_inpcb-inp_options == NULL - tp-t_inpcb-in6p_options == NULL + if ((tp-t_flags TF_TSO) V_tcp_do_tso len tp-t_maxseg + ((tp-t_flags TF_SIGNATURE) == 0) + tp-rcv_numsacks == 0 sack_rxmit == 0 #ifdef IPSEC -ipsec_optlen == 0 + ipsec_optlen == 0 #endif - ) { - tso = 1; - } else { - len = tp-t_maxseg; - sendalot = 1; - } - } + tp-t_inpcb-inp_options == NULL + tp-t_inpcb-in6p_options == NULL) + tso = 1; if (sack_rxmit) { if (SEQ_LT(p-rxmit + len, tp-snd_una + so-so_snd.sb_cc)) @@ -732,28 +720,53 @@ send: * bump the packet length beyond the t_maxopd length. * Clear the FIN bit because we cut off the tail of * the segment. -* -* When doing TSO limit a burst to TCP_MAXWIN minus the -* IP, TCP and Options length to keep ip-ip_len from -* overflowing. Prevent the last segment from being -* fractional thus making them all equal sized and set -* the flag to continue sending. TSO is disabled when -* IP options or IPSEC are present. */ if (len + optlen + ipoptlen tp-t_maxopd) { flags = ~TH_FIN; + if (tso) { - if (len TCP_MAXWIN - hdrlen - optlen) { - len = TCP_MAXWIN - hdrlen - optlen; - len = len - (len % (tp-t_maxopd - optlen)); + KASSERT(ipoptlen == 0, + (%s: TSO can't do IP options, __func__)); + + /* +* Limit a burst to IP_MAXPACKET minus IP, +* TCP and options length to keep ip-ip_len +* from overflowing. +*/ + if (len IP_MAXPACKET - hdrlen) { + len = IP_MAXPACKET - hdrlen; + sendalot = 1; + } + + /* +* Prevent the last segment from being +* fractional unless the send sockbuf can +* be emptied. +*/ + if (sendalot off + len
Re: svn commit: r212803 - head/sys/netinet
On 18.09.2010 13:34, Bjoern A. Zeeb wrote: On Fri, 17 Sep 2010, Andre Oppermann wrote: @@ -487,22 +482,15 @@ after_sack_rexmit: */ ipsec_optlen = ipsec_hdrsiz_tcp(tp); #endif - if (len tp-t_maxseg) { - if ((tp-t_flags TF_TSO) V_tcp_do_tso - ((tp-t_flags TF_SIGNATURE) == 0) - tp-rcv_numsacks == 0 sack_rxmit == 0 - tp-t_inpcb-inp_options == NULL - tp-t_inpcb-in6p_options == NULL + if ((tp-t_flags TF_TSO) V_tcp_do_tso len tp-t_maxseg + ((tp-t_flags TF_SIGNATURE) == 0) + tp-rcv_numsacks == 0 sack_rxmit == 0 #ifdef IPSEC - ipsec_optlen == 0 + ipsec_optlen == 0 #endif - ) { - tso = 1; - } else { - len = tp-t_maxseg; - sendalot = 1; - } - } + tp-t_inpcb-inp_options == NULL + tp-t_inpcb-in6p_options == NULL) + tso = 1; In the non-TSO case you are no longer reducing len to tp-t_maxseg here, if it's larger, which I think breaks asssumptions all the way down. No assumptions are broken for the non-TSO case. The value of len is only tested against t_maxseg for being equal or grater. This always hold true. When the decision to send has been made len is correctly limited in the non-TSO and TSO case. Before it was a bit of either was done in both places. That is now merged into one spot. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r236959 - in head: share/man/man4 sys/netinet
On 12.06.2012 16:02, Michael Tuexen wrote: Author: tuexen Date: Tue Jun 12 14:02:38 2012 New Revision: 236959 URL: http://svn.freebsd.org/changeset/base/236959 Log: Add a IP_RECVTOS socket option to receive for received UDP/IPv4 packets a cmsg of type IP_RECVTOS which contains the TOS byte. Much like IP_RECVTTL does for TTL. This allows to implement a protocol on top of UDP and implementing ECN. You may want to consider to alias IP_RECVTOS with IP_TOS as it is done with IP_SENDSRCADDR+IP_RECVDSTADDR to allow for simpler replying of received UDP packets. That way IP_RECVTOS has the same ip socket option number and it can be used for direct TOS reflection. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r227499 - head/share/man/man4
Author: andre Date: Mon Nov 14 15:10:42 2011 New Revision: 227499 URL: http://svn.freebsd.org/changeset/base/227499 Log: Note the ip_len bug fixed in r226105 in the BUGS section. Modified: head/share/man/man4/ip.4 Modified: head/share/man/man4/ip.4 == --- head/share/man/man4/ip.4Mon Nov 14 15:10:01 2011(r227498) +++ head/share/man/man4/ip.4Mon Nov 14 15:10:42 2011(r227499) @@ -32,7 +32,7 @@ .\ @(#)ip.4 8.2 (Berkeley) 11/30/93 .\ $FreeBSD$ .\ -.Dd June 1, 2009 +.Dd November 14, 2011 .Dt IP 4 .Os .Sh NAME @@ -847,3 +847,9 @@ The .Vt ip_mreqn structure appeared in .Tn Linux 2.4 . +.Sh BUGS +Before +.Fx 10.0 packets received on raw IP sockets had the +.Va ip_hl +subtracted from the +.Va ip_len field. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r227500 - head/share/man/man4
Author: andre Date: Mon Nov 14 15:14:42 2011 New Revision: 227500 URL: http://svn.freebsd.org/changeset/base/227500 Log: Remove mention of ss_fltsz and ss_fltsz_local which were retired in r226447. Modified: head/share/man/man4/tcp.4 Modified: head/share/man/man4/tcp.4 == --- head/share/man/man4/tcp.4 Mon Nov 14 15:10:42 2011(r227499) +++ head/share/man/man4/tcp.4 Mon Nov 14 15:14:42 2011(r227500) @@ -38,7 +38,7 @@ .\ From: @(#)tcp.48.1 (Berkeley) 6/5/93 .\ $FreeBSD$ .\ -.Dd September 15, 2011 +.Dd November 14, 2011 .Dt TCP 4 .Os .Sh NAME @@ -290,14 +290,6 @@ That of 2 results in any packets to closed ports being logged. Any value unlisted above disables the logging (default is 0, i.e., the logging is disabled). -.It Va slowstart_flightsize -The number of packets allowed to be in-flight during the -.Tn TCP -slow-start phase on a non-local network. -.It Va local_slowstart_flightsize -The number of packets allowed to be in-flight during the -.Tn TCP -slow-start phase to local machines in the same subnet. .It Va msl The Maximum Segment Lifetime, in milliseconds, for a packet. .It Va keepinit @@ -411,15 +403,6 @@ maximum segment size. This helps throughput in general, but particularly affects short transfers and high-bandwidth large propagation-delay connections. -.Pp -When this feature is enabled, the -.Va slowstart_flightsize -and -.Va local_slowstart_flightsize -settings are not observed for new -connection slow starts, but they are still used for slow starts -that occur when the connection has been idle and starts sending -again. .It Va sack.enable Enable support for RFC 2018, TCP Selective Acknowledgment option, which allows the receiver to inform the sender about all successfully ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r227499 - head/share/man/man4
On 14.11.2011 16:38, Garrett Cooper wrote: On Mon, Nov 14, 2011 at 7:10 AM, Andre Oppermannan...@freebsd.org wrote: Author: andre Date: Mon Nov 14 15:10:42 2011 New Revision: 227499 URL: http://svn.freebsd.org/changeset/base/227499 Log: Note the ip_len bug fixed in r226105 in the BUGS section. Modified: head/share/man/man4/ip.4 Modified: head/share/man/man4/ip.4 == --- head/share/man/man4/ip.4Mon Nov 14 15:10:01 2011(r227498) +++ head/share/man/man4/ip.4Mon Nov 14 15:10:42 2011(r227499) @@ -32,7 +32,7 @@ .\ @(#)ip.4 8.2 (Berkeley) 11/30/93 .\ $FreeBSD$ .\ -.Dd June 1, 2009 +.Dd November 14, 2011 .Dt IP 4 .Os .Sh NAME @@ -847,3 +847,9 @@ The .Vt ip_mreqn structure appeared in .Tn Linux 2.4 . +.Sh BUGS +Before +.Fx 10.0 packets received on raw IP sockets had the +.Va ip_hl +subtracted from the +.Va ip_len field. Isn't the fix going to be MFCed? It was. However there are some ports depending on this bug and due to the late stage we are in the release cycle we decided to back out the MFC. -- Andre ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r227501 - head/share/man/man4
Author: andre Date: Mon Nov 14 15:57:03 2011 New Revision: 227501 URL: http://svn.freebsd.org/changeset/base/227501 Log: mdoc fix for r227499. Reported by: brueffer Modified: head/share/man/man4/ip.4 Modified: head/share/man/man4/ip.4 == --- head/share/man/man4/ip.4Mon Nov 14 15:14:42 2011(r227500) +++ head/share/man/man4/ip.4Mon Nov 14 15:57:03 2011(r227501) @@ -849,7 +849,8 @@ structure appeared in .Tn Linux 2.4 . .Sh BUGS Before -.Fx 10.0 packets received on raw IP sockets had the +.Fx 10.0 +packets received on raw IP sockets had the .Va ip_hl subtracted from the .Va ip_len field. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r223839 - in head/sys: conf kern netinet
Author: andre Date: Thu Jul 7 10:37:14 2011 New Revision: 223839 URL: http://svn.freebsd.org/changeset/base/223839 Log: Remove the TCP_SORECEIVE_STREAM compile time option. The use of soreceive_stream() for TCP still has to be enabled with the loader tuneable net.inet.tcp.soreceive_stream. Suggested by: trociny and others Modified: head/sys/conf/options head/sys/kern/uipc_socket.c head/sys/netinet/tcp_subr.c Modified: head/sys/conf/options == --- head/sys/conf/options Thu Jul 7 09:51:31 2011(r223838) +++ head/sys/conf/options Thu Jul 7 10:37:14 2011(r223839) @@ -427,7 +427,6 @@ SLIP_IFF_OPTS opt_slip.h TCPDEBUG TCP_OFFLOAD_DISABLEopt_inet.h #Disable code to dispatch tcp offloading TCP_SIGNATURE opt_inet.h -TCP_SORECEIVE_STREAM opt_inet.h VLAN_ARRAY opt_vlan.h XBONEHACK FLOWTABLE opt_route.h Modified: head/sys/kern/uipc_socket.c == --- head/sys/kern/uipc_socket.c Thu Jul 7 09:51:31 2011(r223838) +++ head/sys/kern/uipc_socket.c Thu Jul 7 10:37:14 2011(r223839) @@ -1915,7 +1915,6 @@ release: /* * Optimized version of soreceive() for stream (TCP) sockets. */ -#ifdef TCP_SORECEIVE_STREAM int soreceive_stream(struct socket *so, struct sockaddr **psa, struct uio *uio, struct mbuf **mp0, struct mbuf **controlp, int *flagsp) @@ -2109,7 +2108,6 @@ out: sbunlock(sb); return (error); } -#endif /* TCP_SORECEIVE_STREAM */ /* * Optimized version of soreceive() for simple datagram cases from userspace. Modified: head/sys/netinet/tcp_subr.c == --- head/sys/netinet/tcp_subr.c Thu Jul 7 09:51:31 2011(r223838) +++ head/sys/netinet/tcp_subr.c Thu Jul 7 10:37:14 2011(r223839) @@ -206,11 +206,9 @@ SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, VNET_NAME(tcp_isn_reseed_interval), 0, Seconds between reseeding of ISN secret); -#ifdef TCP_SORECEIVE_STREAM static int tcp_soreceive_stream = 0; SYSCTL_INT(_net_inet_tcp, OID_AUTO, soreceive_stream, CTLFLAG_RDTUN, tcp_soreceive_stream, 0, Using soreceive_stream for TCP sockets); -#endif #ifdef TCP_SIGNATURE static int tcp_sig_checksigs = 1; @@ -337,13 +335,13 @@ tcp_init(void) tcp_finwait2_timeout = TCPTV_FINWAIT2_TIMEOUT; tcp_tcbhashsize = hashsize; -#ifdef TCP_SORECEIVE_STREAM TUNABLE_INT_FETCH(net.inet.tcp.soreceive_stream, tcp_soreceive_stream); if (tcp_soreceive_stream) { tcp_usrreqs.pru_soreceive = soreceive_stream; +#ifdef INET6 tcp6_usrreqs.pru_soreceive = soreceive_stream; +#endif /* INET6 */ } -#endif #ifdef INET6 #define TCP_MINPROTOHDR (sizeof(struct ip6_hdr) + sizeof(struct tcphdr)) ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r223862 - in head/sys: net netinet netinet6
On 08.07.2011 11:38, Marko Zec wrote: Author: zec Date: Fri Jul 8 09:38:33 2011 New Revision: 223862 URL: http://svn.freebsd.org/changeset/base/223862 Log: Permit ARP to proceed for IPv4 host routes for which the gateway is the same as the host address. This already works fine for INET6 and ND6. Can you give an example what this does? Is it some sort of proxy ARP? While here, remove two function pointers from struct lltable which are only initialized but never used. Ideally this would have been a separate commit because it has nothing to do with primary functional change. -- Andre MFC after: 3 days Modified: head/sys/net/if_llatbl.h head/sys/netinet/in.c head/sys/netinet6/in6.c ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r223863 - head/sys/kern
Author: andre Date: Fri Jul 8 10:50:13 2011 New Revision: 223863 URL: http://svn.freebsd.org/changeset/base/223863 Log: In the experimental soreceive_stream(): o Move the non-blocking socket test below the SBS_CANTRCVMORE so that EOF is correctly returned on a remote connection close. o In the non-blocking socket test compare SS_NBIO against the so-so_state field instead of the incorrect sb-sb_state field. o Simplify the ENOTCONN test by removing cases that can't occur. Submitted by: trociny (with some further tweaks by committer) Tested by:trociny Modified: head/sys/kern/uipc_socket.c Modified: head/sys/kern/uipc_socket.c == --- head/sys/kern/uipc_socket.c Fri Jul 8 09:38:33 2011(r223862) +++ head/sys/kern/uipc_socket.c Fri Jul 8 10:50:13 2011(r223863) @@ -1954,20 +1954,9 @@ soreceive_stream(struct socket *so, stru } oresid = uio-uio_resid; - /* We will never ever get anything unless we are connected. */ + /* We will never ever get anything unless we are or were connected. */ if (!(so-so_state (SS_ISCONNECTED|SS_ISDISCONNECTED))) { - /* When disconnecting there may be still some data left. */ - if (sb-sb_cc 0) - goto deliver; - if (!(so-so_state SS_ISDISCONNECTED)) - error = ENOTCONN; - goto out; - } - - /* Socket buffer is empty and we shall not block. */ - if (sb-sb_cc == 0 - ((sb-sb_flags SS_NBIO) || (flags (MSG_DONTWAIT|MSG_NBIO { - error = EAGAIN; + error = ENOTCONN; goto out; } @@ -1994,6 +1983,13 @@ restart: goto out; } + /* Socket buffer is empty and we shall not block. */ + if (sb-sb_cc == 0 + ((so-so_state SS_NBIO) || (flags (MSG_DONTWAIT|MSG_NBIO { + error = EAGAIN; + goto out; + } + /* Socket buffer got some data that we shall deliver now. */ if (sb-sb_cc 0 !(flags MSG_WAITALL) ((sb-sb_flags SS_NBIO) || ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r226105 - head/sys/netinet
Author: andre Date: Fri Oct 7 13:43:01 2011 New Revision: 226105 URL: http://svn.freebsd.org/changeset/base/226105 Log: Add back the IP header length to the total packet length field on raw IP sockets. It was deducted in ip_input() in preparation for protocols interested only in the payload. On raw sockets the IP header should be delivered as it at came in from the network except for the byte order swaps in some fields. This brings us in line with all other OS'es that provide raw IP sockets. Reported by: Matthew Cini Sarreo mcins1-at-gmail.com MFC after: 3 days Modified: head/sys/netinet/raw_ip.c Modified: head/sys/netinet/raw_ip.c == --- head/sys/netinet/raw_ip.c Fri Oct 7 13:16:21 2011(r226104) +++ head/sys/netinet/raw_ip.c Fri Oct 7 13:43:01 2011(r226105) @@ -289,6 +289,13 @@ rip_input(struct mbuf *m, int off) last = NULL; ifp = m-m_pkthdr.rcvif; + /* +* Add back the IP header length which was +* removed by ip_input(). Raw sockets do +* not modify the packet except for some +* byte order swaps. +*/ + ip-ip_len += off; hash = INP_PCBHASH_RAW(proto, ip-ip_src.s_addr, ip-ip_dst.s_addr, V_ripcbinfo.ipi_hashmask); ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r226113 - head/sys/netinet
Author: andre Date: Fri Oct 7 16:39:03 2011 New Revision: 226113 URL: http://svn.freebsd.org/changeset/base/226113 Log: Prevent TCP sessions from stalling indefinitely in reassembly when reaching the zone limit of reassembly queue entries. When the zone limit was reached not even the missing segment that would complete the sequence space could be processed preventing the TCP session forever from making any further progress. Solve this deadlock by using a temporary on-stack queue entry for the missing segment followed by an immediate dequeue again by delivering the contiguous sequence space to the socket. Add logging under net.inet.tcp.log_debug for reassembly queue issues. Reviewed by: lsteward (previous version) Tested by:Steven Hartland killing-at-multiplay.co.uk MFC after:3 days Modified: head/sys/netinet/tcp_reass.c Modified: head/sys/netinet/tcp_reass.c == --- head/sys/netinet/tcp_reass.cFri Oct 7 16:09:44 2011 (r226112) +++ head/sys/netinet/tcp_reass.cFri Oct 7 16:39:03 2011 (r226113) @@ -177,7 +177,9 @@ tcp_reass(struct tcpcb *tp, struct tcphd struct tseg_qent *nq; struct tseg_qent *te = NULL; struct socket *so = tp-t_inpcb-inp_socket; + char *s = NULL; int flags; + struct tseg_qent tqs; INP_WLOCK_ASSERT(tp-t_inpcb); @@ -215,19 +217,40 @@ tcp_reass(struct tcpcb *tp, struct tcphd TCPSTAT_INC(tcps_rcvmemdrop); m_freem(m); *tlenp = 0; + if ((s = tcp_log_addrs(tp-t_inpcb-inp_inc, th, NULL, NULL))) { + log(LOG_DEBUG, %s; %s: queue limit reached, + segment dropped\n, s, __func__); + free(s, M_TCPLOG); + } return (0); } /* * Allocate a new queue entry. If we can't, or hit the zone limit * just drop the pkt. +* +* Use a temporary structure on the stack for the missing segment +* when the zone is exhausted. Otherwise we may get stuck. */ te = uma_zalloc(V_tcp_reass_zone, M_NOWAIT); - if (te == NULL) { + if (te == NULL th-th_seq != tp-rcv_nxt) { TCPSTAT_INC(tcps_rcvmemdrop); m_freem(m); *tlenp = 0; + if ((s = tcp_log_addrs(tp-t_inpcb-inp_inc, th, NULL, NULL))) { + log(LOG_DEBUG, %s; %s: global zone limit reached, + segment dropped\n, s, __func__); + free(s, M_TCPLOG); + } return (0); + } else if (th-th_seq == tp-rcv_nxt) { + bzero(tqs, sizeof(struct tseg_qent)); + te = tqs; + if ((s = tcp_log_addrs(tp-t_inpcb-inp_inc, th, NULL, NULL))) { + log(LOG_DEBUG, %s; %s: global zone limit reached, + using stack for missing segment\n, s, __func__); + free(s, M_TCPLOG); + } } tp-t_segqlen++; @@ -304,6 +327,8 @@ tcp_reass(struct tcpcb *tp, struct tcphd if (p == NULL) { LIST_INSERT_HEAD(tp-t_segq, te, tqe_q); } else { + KASSERT(te != tqs, (%s: temporary stack based entry not + first element in queue, __func__)); LIST_INSERT_AFTER(p, te, tqe_q); } @@ -327,7 +352,8 @@ present: m_freem(q-tqe_m); else sbappendstream_locked(so-so_rcv, q-tqe_m); - uma_zfree(V_tcp_reass_zone, q); + if (q != tqs) + uma_zfree(V_tcp_reass_zone, q); tp-t_segqlen--; q = nq; } while (q q-tqe_th-th_seq == tp-rcv_nxt); ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r226433 - head/sys/netinet
Author: andre Date: Sun Oct 16 13:54:46 2011 New Revision: 226433 URL: http://svn.freebsd.org/changeset/base/226433 Log: Update the comment and description of tcp_sendspace and tcp_recvspace to better reflect their purpose. MFC after:1 week Modified: head/sys/netinet/tcp_usrreq.c Modified: head/sys/netinet/tcp_usrreq.c == --- head/sys/netinet/tcp_usrreq.c Sun Oct 16 11:08:51 2011 (r226432) +++ head/sys/netinet/tcp_usrreq.c Sun Oct 16 13:54:46 2011 (r226433) @@ -1498,16 +1498,15 @@ tcp_ctloutput(struct socket *so, struct #undef INP_WLOCK_RECHECK /* - * tcp_sendspace and tcp_recvspace are the default send and receive window - * sizes, respectively. These are obsolescent (this information should - * be set by the route). + * Set the initial send and receive socket buffer sizes for + * newly created TCP sockets. */ u_long tcp_sendspace = 1024*32; SYSCTL_ULONG(_net_inet_tcp, TCPCTL_SENDSPACE, sendspace, CTLFLAG_RW, -tcp_sendspace , 0, Maximum outgoing TCP datagram size); +tcp_sendspace , 0, Initial send socket buffer size); u_long tcp_recvspace = 1024*64; SYSCTL_ULONG(_net_inet_tcp, TCPCTL_RECVSPACE, recvspace, CTLFLAG_RW, -tcp_recvspace , 0, Maximum incoming TCP datagram size); +tcp_recvspace , 0, Initial receive socket buffer size); /* * Attach TCP protocol to socket, allocating ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r226437 - head/sys/netinet
Author: andre Date: Sun Oct 16 15:08:43 2011 New Revision: 226437 URL: http://svn.freebsd.org/changeset/base/226437 Log: VNET virtualize tcp_sendspace/tcp_recvspace and change the type to INT. A long is not necessary as the TCP window is limited to 2**30. A larger initial window isn't useful. MFC after:1 week Modified: head/sys/netinet/tcp_input.c head/sys/netinet/tcp_usrreq.c head/sys/netinet/tcp_var.h Modified: head/sys/netinet/tcp_input.c == --- head/sys/netinet/tcp_input.cSun Oct 16 14:30:28 2011 (r226436) +++ head/sys/netinet/tcp_input.cSun Oct 16 15:08:43 2011 (r226437) @@ -3517,7 +3517,7 @@ tcp_mss(struct tcpcb *tp, int offer) */ so = inp-inp_socket; SOCKBUF_LOCK(so-so_snd); - if ((so-so_snd.sb_hiwat == tcp_sendspace) metrics.rmx_sendpipe) + if ((so-so_snd.sb_hiwat == V_tcp_sendspace) metrics.rmx_sendpipe) bufsize = metrics.rmx_sendpipe; else bufsize = so-so_snd.sb_hiwat; @@ -3534,7 +3534,7 @@ tcp_mss(struct tcpcb *tp, int offer) tp-t_maxseg = mss; SOCKBUF_LOCK(so-so_rcv); - if ((so-so_rcv.sb_hiwat == tcp_recvspace) metrics.rmx_recvpipe) + if ((so-so_rcv.sb_hiwat == V_tcp_recvspace) metrics.rmx_recvpipe) bufsize = metrics.rmx_recvpipe; else bufsize = so-so_rcv.sb_hiwat; Modified: head/sys/netinet/tcp_usrreq.c == --- head/sys/netinet/tcp_usrreq.c Sun Oct 16 14:30:28 2011 (r226436) +++ head/sys/netinet/tcp_usrreq.c Sun Oct 16 15:08:43 2011 (r226437) @@ -1501,12 +1501,15 @@ tcp_ctloutput(struct socket *so, struct * Set the initial send and receive socket buffer sizes for * newly created TCP sockets. */ -u_long tcp_sendspace = 1024*32; -SYSCTL_ULONG(_net_inet_tcp, TCPCTL_SENDSPACE, sendspace, CTLFLAG_RW, -tcp_sendspace , 0, Initial send socket buffer size); -u_long tcp_recvspace = 1024*64; -SYSCTL_ULONG(_net_inet_tcp, TCPCTL_RECVSPACE, recvspace, CTLFLAG_RW, -tcp_recvspace , 0, Initial receive socket buffer size); +VNET_DEFINE(int, tcp_sendspace) = 1024*32; +#defineV_tcp_sendspace VNET(tcp_sendspace) +SYSCTL_VNET_INT(_net_inet_tcp, TCPCTL_SENDSPACE, tcp_sendspace, CTLFLAG_RW, +VNET_NAME(tcp_sendspace), 0, Initial send socket buffer size); + +VNET_DEFINE(int, tcp_recvspace) = 1024*64 +#defineV_tcp_recvspace VNET(tcp_recvspace) +SYSCTL_VNET_INT(_net_inet_tcp, TCPCTL_RECVSPACE, tcp_recvspace, CTLFLAG_RW, +VNET_NAME(tcp_recvspace), 0, Initial receive socket buffer size); /* * Attach TCP protocol to socket, allocating @@ -1521,7 +1524,7 @@ tcp_attach(struct socket *so) int error; if (so-so_snd.sb_hiwat == 0 || so-so_rcv.sb_hiwat == 0) { - error = soreserve(so, tcp_sendspace, tcp_recvspace); + error = soreserve(so, V_tcp_sendspace, V_tcp_recvspace); if (error) return (error); } Modified: head/sys/netinet/tcp_var.h == --- head/sys/netinet/tcp_var.h Sun Oct 16 14:30:28 2011(r226436) +++ head/sys/netinet/tcp_var.h Sun Oct 16 15:08:43 2011(r226437) @@ -606,6 +606,8 @@ VNET_DECLARE(int, tcp_mssdflt); /* XXX * VNET_DECLARE(int, tcp_minmss); VNET_DECLARE(int, tcp_delack_enabled); VNET_DECLARE(int, tcp_do_rfc3390); +VNET_DECLARE(int, tcp_sendspace); +VNET_DECLARE(int, tcp_recvspace); VNET_DECLARE(int, path_mtu_discovery); VNET_DECLARE(int, ss_fltsz); VNET_DECLARE(int, ss_fltsz_local); @@ -618,6 +620,8 @@ VNET_DECLARE(int, tcp_abc_l_var); #defineV_tcp_minmssVNET(tcp_minmss) #defineV_tcp_delack_enabledVNET(tcp_delack_enabled) #defineV_tcp_do_rfc3390VNET(tcp_do_rfc3390) +#defineV_tcp_sendspace VNET(tcp_sendspace) +#defineV_tcp_recvspace VNET(tcp_recvspace) #defineV_path_mtu_discoveryVNET(path_mtu_discovery) #defineV_ss_fltsz VNET(ss_fltsz) #defineV_ss_fltsz_localVNET(ss_fltsz_local) @@ -716,8 +720,6 @@ void tcp_hc_updatemtu(struct in_conninf voidtcp_hc_update(struct in_conninfo *, struct hc_metrics_lite *); extern struct pr_usrreqs tcp_usrreqs; -extern u_long tcp_sendspace; -extern u_long tcp_recvspace; tcp_seq tcp_new_isn(struct tcpcb *); voidtcp_sack_doack(struct tcpcb *, struct tcpopt *, tcp_seq); ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
svn commit: r226447 - head/sys/netinet
Author: andre Date: Sun Oct 16 20:06:44 2011 New Revision: 226447 URL: http://svn.freebsd.org/changeset/base/226447 Log: Remove the ss_fltsz and ss_fltsz_local sysctl's which have long been superseded by the RFC3390 initial CWND sizing. Also remove the remnants of TCP_METRICS_CWND which used the TCP hostcache to set the initial CWND in a non-RFC compliant way. MFC after:1 week Modified: head/sys/netinet/tcp_input.c head/sys/netinet/tcp_output.c head/sys/netinet/tcp_var.h Modified: head/sys/netinet/tcp_input.c == --- head/sys/netinet/tcp_input.cSun Oct 16 19:46:52 2011 (r226446) +++ head/sys/netinet/tcp_input.cSun Oct 16 20:06:44 2011 (r226447) @@ -301,9 +301,6 @@ cc_conn_init(struct tcpcb *tp) struct hc_metrics_lite metrics; struct inpcb *inp = tp-t_inpcb; int rtt; -#ifdef INET6 - int isipv6 = ((inp-inp_vflag INP_IPV6) != 0) ? 1 : 0; -#endif INP_WLOCK_ASSERT(tp-t_inpcb); @@ -337,49 +334,16 @@ cc_conn_init(struct tcpcb *tp) } /* -* Set the slow-start flight size depending on whether this -* is a local network or not. -* -* Extend this so we cache the cwnd too and retrieve it here. -* Make cwnd even bigger than RFC3390 suggests but only if we -* have previous experience with the remote host. Be careful -* not make cwnd bigger than remote receive window or our own -* send socket buffer. Maybe put some additional upper bound -* on the retrieved cwnd. Should do incremental updates to -* hostcache when cwnd collapses so next connection doesn't -* overloads the path again. -* -* XXXAO: Initializing the CWND from the hostcache is broken -* and in its current form not RFC conformant. It is disabled -* until fixed or removed entirely. +* Set the initial slow-start flight size. * * RFC3390 says only do this if SYN or SYN/ACK didn't got lost. -* We currently check only in syncache_socket for that. +* XXX: We currently check only in syncache_socket for that. */ -/* #define TCP_METRICS_CWND */ -#ifdef TCP_METRICS_CWND - if (metrics.rmx_cwnd) - tp-snd_cwnd = max(tp-t_maxseg, min(metrics.rmx_cwnd / 2, - min(tp-snd_wnd, so-so_snd.sb_hiwat))); - else -#endif if (V_tcp_do_rfc3390) tp-snd_cwnd = min(4 * tp-t_maxseg, max(2 * tp-t_maxseg, 4380)); -#ifdef INET6 - else if (isipv6 in6_localaddr(inp-in6p_faddr)) - tp-snd_cwnd = tp-t_maxseg * V_ss_fltsz_local; -#endif -#if defined(INET) defined(INET6) - else if (!isipv6 in_localaddr(inp-inp_faddr)) - tp-snd_cwnd = tp-t_maxseg * V_ss_fltsz_local; -#endif -#ifdef INET - else if (in_localaddr(inp-inp_faddr)) - tp-snd_cwnd = tp-t_maxseg * V_ss_fltsz_local; -#endif else - tp-snd_cwnd = tp-t_maxseg * V_ss_fltsz; + tp-snd_cwnd = tp-t_maxseg; if (CC_ALGO(tp)-conn_init != NULL) CC_ALGO(tp)-conn_init(tp-ccv); Modified: head/sys/netinet/tcp_output.c == --- head/sys/netinet/tcp_output.c Sun Oct 16 19:46:52 2011 (r226446) +++ head/sys/netinet/tcp_output.c Sun Oct 16 20:06:44 2011 (r226447) @@ -89,16 +89,6 @@ SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, VNET_NAME(path_mtu_discovery), 1, Enable Path MTU Discovery); -VNET_DEFINE(int, ss_fltsz) = 1; -SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, slowstart_flightsize, CTLFLAG_RW, - VNET_NAME(ss_fltsz), 1, - Slow start flight size); - -VNET_DEFINE(int, ss_fltsz_local) = 4; -SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, local_slowstart_flightsize, - CTLFLAG_RW, VNET_NAME(ss_fltsz_local), 1, - Slow start flight size for local networks); - VNET_DEFINE(int, tcp_do_tso) = 1; #defineV_tcp_do_tsoVNET(tcp_do_tso) SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, tso, CTLFLAG_RW, Modified: head/sys/netinet/tcp_var.h == --- head/sys/netinet/tcp_var.h Sun Oct 16 19:46:52 2011(r226446) +++ head/sys/netinet/tcp_var.h Sun Oct 16 20:06:44 2011(r226447) @@ -609,8 +609,6 @@ VNET_DECLARE(int, tcp_do_rfc3390); VNET_DECLARE(int, tcp_sendspace); VNET_DECLARE(int, tcp_recvspace); VNET_DECLARE(int, path_mtu_discovery); -VNET_DECLARE(int, ss_fltsz); -VNET_DECLARE(int, ss_fltsz_local); VNET_DECLARE(int, tcp_do_rfc3465); VNET_DECLARE(int, tcp_abc_l_var); #defineV_tcb VNET(tcb) @@ -623,8 +621,6 @@ VNET_DECLARE(int, tcp_abc_l_var); #defineV_tcp_sendspace VNET(tcp_sendspace) #defineV_tcp_recvspace
svn commit: r226448 - head/sys/netinet
Author: andre Date: Sun Oct 16 20:18:39 2011 New Revision: 226448 URL: http://svn.freebsd.org/changeset/base/226448 Log: Move the tcp_sendspace and tcp_recvspace sysctl's from the middle of tcp_usrreq.c to the top of tcp_output.c and tcp_input.c respectively next to the socket buffer autosizing controls. MFC after:1 week Modified: head/sys/netinet/tcp_input.c head/sys/netinet/tcp_output.c head/sys/netinet/tcp_usrreq.c Modified: head/sys/netinet/tcp_input.c == --- head/sys/netinet/tcp_input.cSun Oct 16 20:06:44 2011 (r226447) +++ head/sys/netinet/tcp_input.cSun Oct 16 20:18:39 2011 (r226448) @@ -183,6 +183,11 @@ SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, VNET_NAME(tcp_insecure_rst), 0, Follow the old (insecure) criteria for accepting RST packets); +VNET_DEFINE(int, tcp_recvspace) = 1024*64 +#defineV_tcp_recvspace VNET(tcp_recvspace) +SYSCTL_VNET_INT(_net_inet_tcp, TCPCTL_RECVSPACE, tcp_recvspace, CTLFLAG_RW, +VNET_NAME(tcp_recvspace), 0, Initial receive socket buffer size); + VNET_DEFINE(int, tcp_do_autorcvbuf) = 1; #defineV_tcp_do_autorcvbuf VNET(tcp_do_autorcvbuf) SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, recvbuf_auto, CTLFLAG_RW, Modified: head/sys/netinet/tcp_output.c == --- head/sys/netinet/tcp_output.c Sun Oct 16 20:06:44 2011 (r226447) +++ head/sys/netinet/tcp_output.c Sun Oct 16 20:18:39 2011 (r226448) @@ -95,6 +95,11 @@ SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, VNET_NAME(tcp_do_tso), 0, Enable TCP Segmentation Offload); +VNET_DEFINE(int, tcp_sendspace) = 1024*32; +#defineV_tcp_sendspace VNET(tcp_sendspace) +SYSCTL_VNET_INT(_net_inet_tcp, TCPCTL_SENDSPACE, tcp_sendspace, CTLFLAG_RW, + VNET_NAME(tcp_sendspace), 0, Initial send socket buffer size); + VNET_DEFINE(int, tcp_do_autosndbuf) = 1; #defineV_tcp_do_autosndbuf VNET(tcp_do_autosndbuf) SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, sendbuf_auto, CTLFLAG_RW, Modified: head/sys/netinet/tcp_usrreq.c == --- head/sys/netinet/tcp_usrreq.c Sun Oct 16 20:06:44 2011 (r226447) +++ head/sys/netinet/tcp_usrreq.c Sun Oct 16 20:18:39 2011 (r226448) @@ -1498,20 +1498,6 @@ tcp_ctloutput(struct socket *so, struct #undef INP_WLOCK_RECHECK /* - * Set the initial send and receive socket buffer sizes for - * newly created TCP sockets. - */ -VNET_DEFINE(int, tcp_sendspace) = 1024*32; -#defineV_tcp_sendspace VNET(tcp_sendspace) -SYSCTL_VNET_INT(_net_inet_tcp, TCPCTL_SENDSPACE, tcp_sendspace, CTLFLAG_RW, -VNET_NAME(tcp_sendspace), 0, Initial send socket buffer size); - -VNET_DEFINE(int, tcp_recvspace) = 1024*64 -#defineV_tcp_recvspace VNET(tcp_recvspace) -SYSCTL_VNET_INT(_net_inet_tcp, TCPCTL_RECVSPACE, tcp_recvspace, CTLFLAG_RW, -VNET_NAME(tcp_recvspace), 0, Initial receive socket buffer size); - -/* * Attach TCP protocol to socket, allocating * internet protocol control block, tcp control block, * bufer space, and entering LISTEN state if to accept connections. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r226454 - head/sys/netinet
On 17.10.2011 02:16, Bjoern A. Zeeb wrote: On 17. Oct 2011, at 00:05 , Bjoern A. Zeeb wrote: Author: bz Date: Mon Oct 17 00:05:31 2011 New Revision: 226454 URL: http://svn.freebsd.org/changeset/base/226454 Log: Add syntactic sugar missed in r226437 and then not added either when moving things around in r226448 but desperately needed to always make things compile successfully. GENRIC and LINT did not fail failed on it as it expanded to: int tcp_recvspace = 1024*64 followed by: #define SYSCTL_VNET_INT(parent, nbr, name, access, ptr, val, descr) \ SYSCTL_INT(parent, nbr, name, access, ptr, val, descr) = #define SYSCTL_INT(parent, nbr, name, access, ptr, val, descr) \ SYSCTL_ASSERT_TYPE(INT, ptr, parent, name); \ SYSCTL_OID(parent, nbr, name, \ CTLTYPE_INT | CTLFLAG_MPSAFE | (access),\ ptr, val, sysctl_handle_int, I, descr) and the SYSCTL_ASSERT_TYPE() expanding to nothing in #define SYSCTL_ASSERT_TYPE(type, ptr, parent, name) leaving just the ';' around; so it ended up as: int tcp_recvspace = 1024*64 ; and an expanded SYSCTL_OID(...); Oops, sorry missing that one. And thanks for comitting the fix. -- Andre MFC after:1 week Modified: head/sys/netinet/tcp_input.c Modified: head/sys/netinet/tcp_input.c == --- head/sys/netinet/tcp_input.cSun Oct 16 22:24:04 2011 (r226453) +++ head/sys/netinet/tcp_input.cMon Oct 17 00:05:31 2011 (r226454) @@ -183,7 +183,7 @@ SYSCTL_VNET_INT(_net_inet_tcp, OID_AUTO, VNET_NAME(tcp_insecure_rst), 0, Follow the old (insecure) criteria for accepting RST packets); -VNET_DEFINE(int, tcp_recvspace) = 1024*64 +VNET_DEFINE(int, tcp_recvspace) = 1024*64; #define V_tcp_recvspace VNET(tcp_recvspace) SYSCTL_VNET_INT(_net_inet_tcp, TCPCTL_RECVSPACE, tcp_recvspace, CTLFLAG_RW, VNET_NAME(tcp_recvspace), 0, Initial receive socket buffer size); ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org