Re: question about fopen fd limit
Hello 盛慧华 Here's another trick that may work. Use funopen(3) and provide your own read/write/seek and close functions for the high fds. You can basically make "cookie" a struct that contains your "int sized" fds. FILE * funopen(const void *cookie, int (*readfn)(void *, char *, int), int (*writefn)(void *, const char *, int), fpos_t (*seekfn)(void *, fpos_t, int), int (*closefn)(void *)); If you need more help please make sure to email me directly so I can see your question. -Alfred On 12/23/16 12:48 AM, 盛慧华 wrote: hi all, Thank you for your advice ~ solution 2 definitly broaden my horizons ~~but may be not a good choice for my project ~~LoL i will try to mail freebsd-current mail list, if libc is as your description , may be i should modify it by myself ~~ Thank you so much~ Are u KingSoft's Dr Zhang ? nice to meet you ! winson sheng winson sheng From: Hongjiang Zhang Date: 2016-12-23 11:44 To: 盛慧华; freebsd-net Subject: RE: RE: question about fopen fd limit Ok. I know. There are two possible solutions: Quick solution for short term: modify short to int in libc by yourself, buildworld and installworld. Pushing to modify libc may take a long time, especially only few people encounter this issue. You’d better send email to freebsd-current to confirm whether they accept your suggestion. Work around: You can first reserve a series of fd before opening TCP connections. For example, invoke open(“/dev/null”) for 1 times to get 1 fds. Those fd values are small enough to be held by “short”. After that, start TCP connections. Once you need to fopen a file, please call open(“xxx”) instead, and then use dup2(old_fd, new_fd) to exchange the two fd. The old_fd value is the one obtained by open(“xxx”), and new_fd is one in your reserved fd fields, and next please use fdopen(fd, mode). Here, you have to manage the reserved fds by yourself including open/close. In my eyes: is the quick method, and there is no modifications in your logic. Needs you to maintain the reserved consecutive fields for fd by yourself, which increased the complexity of your logic. Thanks Hongjiang Zhang From: 盛慧华 [mailto:hhsh...@corp.netease.com] Sent: Friday, December 23, 2016 11:02 AM To: Hongjiang Zhang ; freebsd-net Subject: Re: RE: question about fopen fd limit hi all, not map TCP to FILE, you misunderstanding my meaning~ for example, if my server tcp already holds 32000 connection fopen only has 767 fd to use the problem has no bussiness with tcp fd, BUT fopen ... in some particular situlations , my server will open 1k+ FILE , that will exceed the fileno limit, and overflow occur my server can't open any file more ,that's the problem ~ so i felt if bsd official could change FILE struct's fileno to a UNSIGNED SHORT that may be an effecient and convenient solution just for my case ? UNSIGNED SHORT fileno is enough for me, and i don't wanna change a lot of FILE function that take FILE * as its argument ~ Thank you ~~~ winson sheng winson sheng From: Hongjiang Zhang Date: 2016-12-23 10:17 To: 盛慧华; freebsd-net Subject: RE: question about fopen fd limit Why do you need to map TCP fd to FILE? It is difficult to modify FILE structure. If it is possible, let us figure out some new designs to meet your requirement. -Original Message- From: owner-freebsd-...@freebsd.org [mailto:owner-freebsd-...@freebsd.org] On Behalf Of ??? Sent: Thursday, December 22, 2016 11:57 PM To: freebsd-net Subject: question about fopen fd limit hi all, hi~ we are from Chinese Game Develop Corp, Netease. and One of our product using FreeBsd as its OS platform. This Game has Millions of players online , and Each Server may holds 25000+ tcp connection at the same time.Thanks to BSD and kqueue :) for example, it's one of our server , netstat cmd to list connections overall... netstat -an | grep 13396 (it's our listening port) | wc -l 23221 recently we do some performance optimize and promote this connect limit to 28000+ or 3+. But we find Freebsd has a limit that this huge online number will take 28000+ fd, and bsd FILE * struct's fd only support to SHORT . such as .. struct __sFILE { ... short _file; /* (*) fileno, if Unix descriptor, else -1 */ ... so if our server want to fopen some file when we still hold this online number, the fd amount may easily exceed 32767, and fopen definitely return a err code. then the server will appear some fataly ERROR. we do a simple test and confirm this situation. then in fopen's code , we notice that we can use open to return a fd instread of fopen to avoid this overflow, as below 68 /* 1 * File descriptors are a full int, but _file is only a short. 2 * If we
Re: Does FreeBSD have sendmmsg or recvmmsg system calls?
On 1/26/16 4:39 PM, Luigi Rizzo wrote: On Tue, Jan 26, 2016 at 4:31 PM, Gary Jennejohn wrote: On Tue, 26 Jan 2016 17:46:52 -0500 (EST) Daniel Eischen wrote: On Tue, 26 Jan 2016, Gary Jennejohn wrote: On Tue, 26 Jan 2016 09:06:39 -0800 Luigi Rizzo wrote: On Tue, Jan 26, 2016 at 5:40 AM, Konstantin Belousov wrote: On Mon, Jan 25, 2016 at 11:22:13AM +0200, Boris Astardzhiev wrote: +ssize_t +recvmmsg(int s, struct mmsghdr *__restrict msgvec, size_t vlen, int flags, +const struct timespec *__restrict timeout) +{ + size_t i, rcvd; + ssize_t ret; + + if (timeout != NULL) { + fd_set fds; + int res; Please move all local definitions to the beginning of the function. This style recommendation was from 30 years ago and is bad programming practice, as it tends to complicate analysis for the human and increase the chance of improper usage of variables. We should move away from this for new code. Really? I personally find having all variables grouped together much easier to understand. Stumbling across declarations in the middle of the code in a for-loop, for example, takes me by surprise. I also greatly dislike initializing variables in their declarations. Maybe I'm just old fashioned since I have been writing C-code for more than 30 years. +1 Probably should be discouraged, but allowed on a case-by-case basis. One could argue that if you need to declaration blocks in the middle of code, then that code is too complex and should be broken out into a separate function. Right. And code like this int func(void) { int baz, zot; [some more code] if (zot < 5) { int baz = 3; [more code] } [some more code] } is even worse. The compiler (clang) seems to consider this to merely be a reinitialization of baz, but a human might be confused. oh please... :) This is simply an inner variable shadowing the outer one (which is another poor practice, flagged with -Wshadow ). When you exit the scope you get the external variable with its value, as you can see from the following code. #include int main(int ac, char *av[]) { int baz = 5; printf("1 baz %d\n", baz); { int baz = 3; printf("2 baz %d\n", baz); } printf("3 baz %d\n", baz); return 0; } I agree wholeheartedly with Luigi. I am also surprised that shadowed variable warnings was not more widely understood. It's time to move forward and make the code more readable and maintainable. Having scoped variables just makes sense. It's true that if you see very many of them, then it's likely time to introduce separate functions, but only in extreme cases, not on a case-by-case basis. -Alfred ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: kern.ipc.sockbuf limits: anyone mind if I commit this?
On 11/10/15 3:13 PM, Adrian Chadd wrote: hiya, there's a PR with a patch: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204438 https://github.com/sparrc/freebsd/commit/157f90c55d1d54d33f41c6f7517de1a9c5f5e229 Does anyone know why setting the limits isn't as simple as this patch? Does anyone mind if I just commit this? Don't mind too heavily, however the old behavior is bad and confusing however at least it stops you, however the new behavior will be odd and incorrect without warning. More succinctly: Silently "accepting" but actually changing the value passed in seems wrong. It would seem the reason for the calculation is to actually limit the number of bytes of mbufs (not just data) to the max value? Is that true? Maybe it makes sense to export sb_max_adj via sysctl and allow setting of it instead? Having silent clipping seems worse than an error. -Alfred ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Idle connections via accept_filter(9)
This is over 15 years old. I currently don't know of a great solution to this problem. Might make sense to create a timer that runs and refs the socket that will occasionally fire and cleanse out the old connections. Shouldn't be that hard to do. Sent from my iPhone > On Apr 27, 2015, at 9:19 AM, hiren panchasara > wrote: > >> On 04/27/15 at 09:10P, Adrian Chadd wrote: >> ask alfred? :) > > Thanks! CCing him. >> >> >> -a >> >> >>> On 27 April 2015 at 02:22, hiren panchasara >>> wrote: >>> Wanted to see if someone with understanding of accept_filter can >>> comment. >>> >>> cheers, >>> Hiren On 04/09/15 at 09:08P, hiren panchasara wrote: If a connections comes on a socket with accf_data(9) (for example) but never sends any data, it'll occupy resources via staying forever in listen queue of partial unaccepted connections (socket->so_incomp) which can be seen as incqlen in 'netstat -Lan'. Kernel will never pass this connection down to the application as the filter criteria hasn't been met (no data) and application would never know about this connection. What I am not sure is what would be the state of the connection and state of the socket when in this situation. We do come here after finishing 3WHS but before handing this over to the application i.e. before the accept(). From uipc_socket.c: * From the passive side, a socket is created with two queues of sockets: * so_incomp for connections in progress and so_comp for connections already * made and awaiting user acceptance. As a protocol is preparing incoming * connections, it creates a socket structure queued on so_incomp by calling * sonewconn(). When the connection is established, soisconnected() is * called, and transfers the socket structure to so_comp, making it available * to accept(). So, it looks like the connection would be in ESTABLISHED state but socket would be stuck in the so_incomp queue. Other than this special condition of accpet_filter, can such a situation occur? Any insight/help into understanding this scenario and a way to cleanup these connections would be great. (I know tcp doesn't care/worry about idle sitting connections; we have keepalives to check the health of the connection but that's it, afaik) Cheers, Hiren >> ___ >> freebsd-net@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-net >> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Implementing backpressure in the NFS server
On 2/25/15 5:08 PM, Garrett Wollman wrote: Here's the scenario: 1) A small number of (Linux) clients run a large number of processes (compute jobs) that read large files sequentially out of an NFS filesystem. Each process is reading from a different file. 2) The clients are behind a network bottleneck. 3) The Linux NFS client will issue NFS3PROC_READ RPCs (potentially including read-ahead) independently for each process. 4) The network bottleneck does not serve to limit the rate at which read RPCs can be issued, because the requests are small (it's only the responses that are large). 5) Even if the responses are delayed, causing one process to block, there are sufficient other processes that are still runnable to allow more reads to be issued. 6) On the server side, because these are requests for different file handles, they will get steered to different NFS service threads by the generic RPC queueing code. 7) Each service thread will process the read to completion, and then block when the reply is transmitted because the socket buffer is full. 8) As more reads continue to be issued by the clients, more and more service threads are stuck waiting for the socket buffer until all of the nfsd threads are blocked. 9) The server is now almost completely idle. Incoming requests can only be serviced when one of the nfsd threads finally manages to put its pending reply on the socket send queue, at which point it can return to the RPC code and pick up one request -- which, because the incoming queues are full of pending reads from the problem clients, is likely to get stuck in the same place. Lather, rinse, repeat. What should happen here? As an administrator, I can certainly increase the number of NFS service threads until there are sufficient threads available to handle all of the offered load -- but the load varies widely over time, and it's likely that I would run into other resource constraints if I did this without limit. (Is 1000 threads practical? What happens when a different mix of RPCs comes in -- will it livelock the server?) I'm of the opinion that we need at least one of the following things to mitigate this issue, but I don't have a good knowledge of the RPC code to have an idea how feasible this is: a) Admission control. RPCs should not be removed from the receive queue if the transmit queue is over some high-water mark. This will ensure that a problem client behind a network bottleneck like this one will eventually feel backpressure via TCP window contraction if nothing else. This will also make it more likely that other clients will still get their RPCs processed even if most service threads are taken up by the problem clients. b) Fairness scheduling. There should be some parameter, configurable by the administrator, that restricts the number of nfsd threads any one client can occupy, independent of how many requests it has pending. A really advanced scheduler would allow bursting over the limit for some small number of requests. Does anyone else have thoughts, or even implementation ideas, on this? The default number of threads is insanely low, the only reason I didn't bump them to FreeNAS levels (or higher) was because of the inevitable bikeshed/cryfest about Alfred touching defaults so I didn't bother. I kept them really small, because y'know people whine, and they are capped at ncpu * 8, it really should be higher imo. Just increase the nfs servers to something higher, I think we were at 256 threads in FreeNAS and it did us just fine. Higher seemed ok, except we lost a bit of performance. The only problem you might see is on SMALL machines where people will complain. So probably want an arch specific override or perhaps a memory based sliding scale. If that could become a FreeBSD default (with overrides for small memory machines and arches) that would be even better. I think your other suggestions are fine, however the problem is that: 1) they seem complex for an edge case 2) turning them on may tank performance for no good reason if the heuristic is met but we're not in the bad situation That said if you want to pursue those options, by all means please do. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Adding new media types to if_media.h
On 2/25/15 5:11 PM, Gleb Smirnoff wrote: On Mon, Feb 16, 2015 at 07:50:56PM -0600, Mike Karels wrote: M> Well, I developed the prototype as I had planned, using a 64-bit media M> word, and found that I got about 100 files in GENERIC that didn't compile; M> they attempted to store "media words" in an int. My kingdom for a typedef. M> That didn't meet my goal of KPI compatibility, so I went to Plan B. M> M> Plan B is to steal an unused bit (RFU) to indicate an "extended" media M> type. I then used the variant/subtype field to store the extended type. M> Effectively, the previously unused bit doubles the effective size of the M> subtype field. Given that the previous 5-bit field lasted us 18 years, M> I figured that doubling it would last a while. I also changed the M> SIOGGIFMEDIA ioctl, splitting it for binary compatibility; extended M> types are all mapped to IFM_OTHER (31) using the old interface, but M> are visible using the new one. M> M> With these changes, I modified one driver (vtnet) to use an extended type, M> and the rest of GENERIC is happy. The changes to ifconfig are also fairly M> small. The patch is appended, where email programs will screw it up, M> or at ftp://ftp.karels.net/outgoing/if_media.patch. M> M> The VFAST subtype is a throw-away for testing. M> M> This seems like a reasonably pragmatic change to support the new 40 Gb/s M> media types until someone wants to design an improved but non-backward- M> compatible interface. I think it meets the goal of suitability for M> back-porting; it could be MFCed. I will dare to vote against the crowd. We can't and don't plan to preserve the driver KPI for the 11 branch. The plan, that I hope to accomplish by 11 is to provide a driver KPI, where drivers do not about struct ifnet, and other network stack stuff. Of course, that's a huge change in KPI. But we do it for the sake to avoid future changes. So, all this tricks with one extra bit seem unnecessary to me. I'd suggest to introduce new 'struct ifmedia' with enough space, and of course put extra space in there. Give a new value to SIOCGIFMEDIA. Write a new clear code to handle it, without any extended bit tricks. For the sake of userland API, save old current 'struct ifmedia' as 'struct oifmedia', and take old value of ioctl to OSIOCIGIFMEDIA. Write a function under BURN_BRIDGES that handles OSIOCIGIFMEDIA and tries to convert from ifmedia to oifmedia, To summarise: the patch adds tricks to just double the ifmedia name space, not solving the problem forever. New API is introduced, but old limited one doesn't have foreseable obsolete plan, since new is tied to it. All tricks are performed for the sake of driver KPI stability, which isn't planned to be kept for this major release cycle. +1, rip the bandaid off. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [Differential] [Commented On] D1764: Factor out ip6_deletefraghdr()
Can you use the commit log string and try that? Sent from my iPhone > On Feb 15, 2015, at 5:32 PM, glebius (Gleb Smirnoff) > wrote: > > glebius added a comment. > > Damn f*ckbrikator doesn't allow me to close the revision, since I don't own > it. > > Kristof, looks like you will need to manually close all your revisions as I > commit them. Or we can just leave some trash in this "pretty" software. > > REVISION DETAIL > https://reviews.freebsd.org/D1764 > > To: kristof, ae, glebius > Cc: ae, glebius, freebsd-net > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" > ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Fwd: Adding new media types to if_media.h
On 2/8/15 2:41 PM, Mike Karels wrote: To solve the second problem, I think the right approach would be to reduce this interface to a truly generic one, such as media type (e.g. Ethernet), generic flags, and perhaps generic status. Then there should be a separate media-specific interface for each type, such as Ethernet and 802.11. To a small extent, we already have that. Solving the second, more general problem, requires a whole new driver KPI that will require surgery to every driver, which is not an exercise that I would consider. I am willing to do a prototype for -current for evaluation. Comments, alternatives, ? Mike, I think we have enough people to chip in that your concern about breaking the KPI is not as bad as you think. Would like to hear the first correct + long term + less hackish proposal first. Norse has a kernel team that is heavily invested in networking that can help with the transition. If done right, likely renaming ALL of the macros it will be quite trivial to catch all bad cases and move us forward in one great leap. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Nasty bug in startup scripts with interface renaming.
If you happen to use interface renaming there is a nasty bug lurking in the startup scripts, it seems newly introduced, but I am unsure. Specifically the following happens at boot time: /etc/rc.d/netif is run without args. It gets the list of interfaces and for each interface it calls network_start(). however in network start we have this: # Create cloned interfaces clone_up $cmdifn # Rename interfaces. ifnet_rename $cmdifn # Configure the interface(s). network_common ifn_start $cmdifn Now it doesn't take that much to realize that if 'ifnet_rename' renames 'cmdifn' then the subsequent call to 'network_common ifn_start $cmdifn' will be passing a stale interface in as a parameter and causes a bunch of errors to happen. Example: cmdifn="vtnet0" Therefor: # Rename interfaces. ifnet_rename vtnet0 # <- gets renamed here to derp0 # Configure the interface(s). network_common ifn_start vtnet0 # <- this seems to cause an error since we're using old name. I looked at fixing ifnet_rename() to take a variable to assign to, so for instance the call could turn into something like: ifnet_rename cmdifn vtnet0 This way cmdifn would be set to 'derp0' and subsequent stuff would work, however…. then I realized that ifnet_rename can take 0 args, or MULTIPLE args and will act on either all interfaces or the ones passed in. So passing another var becomes a problem. I then realized that if I threw together a patch to fix it "the alfred way" people would probably be upset. So I'm asking, any suggestions before I go about just fixing this? -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: RFC: Enabling VIMAGE in GENERIC
On 11/17/14, 3:02 AM, Warner Losh wrote: On Nov 17, 2014, at 12:46 AM, Craig Rodrigues wrote: Hi, PROPOSAL == I would like to get feedback on the following proposal. In the head branch (CURRENT), I would like to enable VIMAGE with this commit: PATCH == Index: sys/conf/NOTES === --- sys/conf/NOTES (revision 274300) +++ sys/conf/NOTES (working copy) @@ -784,8 +784,8 @@ device mn # Munich32x/Falc54 Nx64kbit/sec cards. # Network stack virtualization. -#options VIMAGE -#options VNET_DEBUG # debug for VIMAGE +optionsVIMAGE +optionsVNET_DEBUG # debug for VIMAGE # # Network interfaces: I would like to enable VIMAGE for the following reasons: REASONS (1) VIMAGE cannot be enabled off to the side in a separate library or kernel module. When enabled, it is a kernel ABI incompatible change. This has impact on 3rd party code such as the kernel modules which come with VirtualBox. So the time to do it in CURRENT is now, otherwise we can't consider doing it until FreeBSD-12 timeframe, which is quite a while away. (2) VIMAGE is used in some 3rd party products, such as FreeNAS. These 3rd party products are mostly happy with VIMAGE, but sometimes they encounter problems, and FreeBSD doesn't see these problems because it is disabled by default. (3) Most of the major subsystems like ipfw and pf have been fixed for VIMAGE, and the only way to shake out the last few issues is to make it the default and get feedback from the community. ipfilter still needs to be VIMAGE-ified. (4) Not everyone uses bhyve. FreeBSD jails are an excellent virtualization platform for FreeBSD. Jails are still very popular and performant. VIMAGE makes jails even better by allowing per-jail network stacks. (5) Olivier Cochard-Labbe has provided good network performance results in VIMAGE vs. non-VIMAGE kernels: https://lists.freebsd.org/pipermail/freebsd-net/2014-October/040091.html (6) Certain people like Vitaly "wishmaster" have been running VIMAGE jails in a production environment for quite a while, and would like to see it be the default. ACTION PLAN === (1) Coordinate/communicate with portmgr, since this has kernel ABI implications (2) Work with clusteradm@, and try to get a test instance of one of the PF firewalls in the cluster working with a VIMAGE enabled kernel. (3) Take a pass through http://wiki.freebsd.org/VIMAGE/TODO and https://bugs.freebsd.org/bugzilla/buglist.cgi?quicksearch=vimage%20or%20vnet and try to clean things up. Get help from net@ developers to do this. And if these don’t get cleaned up? If they are not cleaned/stable up by 11-RELEASE then we turn it off. That is simple. (4) Take a pass on trying to VIMAGE-ify ipfilter. I'll need help from the ipfilter maintainers for this and some net@ developers. And if this doesn’t happen? Well we do have 2 other firewalls in the kernel to pick, but we do need VIMAGE so I will let you draw your own conclusions. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: performance of the swtich/case statements
Please run compiler with -O2 -S to get the assembly to see what will actually happen. thanks, -Alfred On 10/29/14 9:24 PM, bycn82 wrote: Hi, According to my understanding in Java programming, the compiler will automatically store the values into a table and jump to the correct one according to the value only when the condition values are in running number, for example. swtich(a){ case 1: code block 1 case 2: code block 2 case 3: code block 3 case 4: code block 4 default: code block 5 } it will be handled by an array 1-->code block 1 2-->code block 2 3-->code block 3 4-->code block 4 others-->code block 5 so when the value N is greater than or lesser than 1, it will be directly jump to the "code block 5" otherwise, it will jump to N, because call the cases are nice in running numbers, but when the cases are messy, it will by just like lots of if/else On Thu, Oct 30, 2014 at 6:30 AM, Erich Dollansky < erichsfreebsdl...@alogt.com> wrote: Hi, On Wed, 29 Oct 2014 22:39:34 +0800 "bycn82" wrote: It is using the switch/case statement to make the code clear in the I am not a C programmer, so I am not clear how the switch/case will be optimized by the compiler in FreeBSD. But I used to write a compiler by myself and I use a hash table to handle all the conditions in the case statements because my compiler don't care about performance!, But in C it is different, the case statement can only accept "int" values, so I don't think it will use hash or what , it should be directly use an array(), So whether it can be optimized it depends on the conditions in the switch/case statements, and I noticed that the cases statement in the 2 loops are not arranging the opcode in running number, so does the compiler smart enough to optimize it? I did not check recently. It was already a long, long time ago, that compilers checked the limits and used the values as an index into a table to jump to the code. I hope that this did not get changed. With other words, the order in the code does not matter. The only optimisation the compiler can do, is not to use a table if the statement consists of a low number of entries only. Erich ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Multipath TCP for FreeBSD v0.4
Github offers an excellent system with comments and all that jazz for making pull requests. Super simple to use. On 9/17/14 3:34 PM, Eric Joyner wrote: As a random person without commit privileges, I hope so, too. --- - Eric Joyner On Wed, Sep 17, 2014 at 8:44 AM, Sean Bruno wrote: On Wed, 2014-09-17 at 12:58 +1000, Nigel Williams wrote: On 17/09/14 08:48, Sean Bruno wrote: On Mon, 2014-09-08 at 11:32 +1000, Nigel Williams wrote: Hi, We recently released a new tech report "Design Overview of Multipath TCP version 0.4 for FreeBSD-11" [1]. The report provides some details on various aspects of the implementation (session management, data-level retransmission etc), as of the most recent v0.4 patch [2]. cheers, nigel [1] http://caia.swin.edu.au/reports/140822A/CAIA-TR-140822A.pdf [2] http://caia.swin.edu.au/urp/newtcp/mptcp/tools.html Nigel: Hi! Are you folks interested in having this patchset incorporated into the main line of FreeBSD? I'm open to putting up a phabricator review for you folks at https://reviews.freebsd.org if that's something you guys want to do? sean Hi Sean, Thanks, but I think it's too early to put it into phabricator. The patch releases thus far are early test previews for those who are interested and perhaps willing to play around with. So in short, it's not production quality and not ready for committing to mainline. I'll continue to announce these patches on the mailing list for the time being. I'm of course open to feedback/suggestions/questions and will provide documentation with each release. cheers, nigel Noted. Thank you for the feedback. I hope, that someday, https://reviews.freebsd.org becomes more of a code review tool for users than it is being used for today. sean ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: mbuf autotuning changes
On 9/6/13 12:10 PM, hiren panchasara wrote: tunable_mbinit() in kern_mbuf.c looks like this: 119 /* 120 * The default limit for all mbuf related memory is 1/2 of all 121 * available kernel memory (physical or kmem). 122 * At most it can be 3/4 of available kernel memory. 123 */ 124 realmem = qmin((quad_t)physmem * PAGE_SIZE, 125 vm_map_max(kmem_map) - vm_map_min(kmem_map)); 126 maxmbufmem = realmem / 2; 127 TUNABLE_QUAD_FETCH("kern.ipc.maxmbufmem", &maxmbufmem); 128 if (maxmbufmem > realmem / 4 * 3) 129 maxmbufmem = realmem / 4 * 3; If I am reading the code correctly, we loose the value on line 126 when we do FETCH on line 127. And after line 127, if we havent specified kern.ipc.maxmbufmem (in loader.conf - I guess...), we set that value to 0. And because of that the if condition on line 128 is almost always false? What am I missing here? Thanks, Hiren ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" I think TUNABLE_*_FETCH will only write to the variable if it explicitly set. Meaning, unless the user actually sets a value in loader.conf then 127 is a no-op. -Alfred -- Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: mbuf autotuning changes
On 9/6/13 12:36 PM, hiren panchasara wrote: On Fri, Sep 6, 2013 at 12:14 PM, Alfred Perlstein wrote: On 9/6/13 12:10 PM, hiren panchasara wrote: tunable_mbinit() in kern_mbuf.c looks like this: 119 /* 120 * The default limit for all mbuf related memory is 1/2 of all 121 * available kernel memory (physical or kmem). 122 * At most it can be 3/4 of available kernel memory. 123 */ 124 realmem = qmin((quad_t)physmem * PAGE_SIZE, 125 vm_map_max(kmem_map) - vm_map_min(kmem_map)); 126 maxmbufmem = realmem / 2; 127 TUNABLE_QUAD_FETCH("kern.ipc.**maxmbufmem", &maxmbufmem); 128 if (maxmbufmem > realmem / 4 * 3) 129 maxmbufmem = realmem / 4 * 3; If I am reading the code correctly, we loose the value on line 126 when we do FETCH on line 127. And after line 127, if we havent specified kern.ipc.maxmbufmem (in loader.conf - I guess...), we set that value to 0. And because of that the if condition on line 128 is almost always false? What am I missing here? Thanks, Hiren __**_ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/**mailman/listinfo/freebsd-net<http://lists.freebsd.org/mailman/listinfo/freebsd-net> To unsubscribe, send any mail to "freebsd-net-unsubscribe@**freebsd.org " I think TUNABLE_*_FETCH will only write to the variable if it explicitly set. Meaning, unless the user actually sets a value in loader.conf then 127 is a no-op. Thanks Navdeep and Alfred. Thats correct. Its not touching the var if its not set. I guess the other TUNABLE_INT_FETCHs later in the function checking for variable ==0 confused me. i.e. nmbclusters. 131 TUNABLE_INT_FETCH("kern.ipc.nmbclusters", &nmbclusters); 132 if (nmbclusters == 0) 133 nmbclusters = maxmbufmem / MCLBYTES / 4; But those are global variable so here we are just checking if they are explicitly set of not. If not, we will set them. For maxmbufmem, we will set it to 1/2 the realmem. and if user sets it explicitly than we will make sure its not more than 3/4 of the realmem. Yes. It's somewhat confusing. I'm all for adding comments to this effect if you have the time and inclination. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: LOCAL_CREDS are broken ?
On 8/29/13 11:48 AM, Yuri wrote: The example below breaks with "Protocol not available" But what is wrong? Isn't this the correct usage? LOCAL_CREDS are only handled in kern/uipc_usrreq.c for AF_LOCAL, so it isn't clear why this doesn't work. Yuri --- example.c --- #include #include #include #include #include main() { int sock; int error; int oval = 1; error = socket(AF_LOCAL, SOCK_SEQPACKET, 0); if (error == -1) {perror("socket"); exit(-1);} sock = error; error = setsockopt(sock, SOL_SOCKET, LOCAL_CREDS, &oval, sizeof(oval)); if (error) {perror("setsockopt"); exit(-1);} } Looks like SOCK_SEQPACKET doesn't support LOCAL_CREDS because its protosw doesn't contain the entry for: .pr_ctloutput = &uipc_ctloutput, Have a look at src/sys/kern/uipc_usrreq.c at around lines 280-332: static struct protosw localsw[] = { { .pr_type = SOCK_STREAM, .pr_domain =&localdomain, .pr_flags = PR_CONNREQUIRED|PR_WANTRCVD|PR_RIGHTS, .pr_ctloutput = &uipc_ctloutput, .pr_usrreqs = &uipc_usrreqs_stream }, { .pr_type = SOCK_DGRAM, .pr_domain =&localdomain, .pr_flags = PR_ATOMIC|PR_ADDR|PR_RIGHTS, .pr_ctloutput = &uipc_ctloutput, .pr_usrreqs = &uipc_usrreqs_dgram }, { .pr_type = SOCK_SEQPACKET, .pr_domain =&localdomain, /* * XXXRW: For now, PR_ADDR because soreceive will bump into them * due to our use of sbappendaddr. A new sbappend variants is needed * that supports both atomic record writes and control data. */ .pr_flags = PR_ADDR|PR_ATOMIC|PR_CONNREQUIRED|PR_WANTRCVD| PR_RIGHTS, .pr_usrreqs = &uipc_usrreqs_seqpacket, }, }; I wonder if this is just a bug/missing code!? -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" -- Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [rfc] migrate lagg to an rmlock
On 8/24/13 10:47 AM, Robert N. M. Watson wrote: On 24 Aug 2013, at 17:36, Alfred Perlstein wrote: We should distinguish "lock contention" from "line contention". When acquiring a rwlock on multiple CPUs concurrently, the cache lines used to implement the lock are contended, as they must bounce between caches via the cache coherence protocol, also referred to as "contention". In the if_lagg code, I assume that the read-only acquire of the rwlock (and perhaps now rmlock) is for data stability rather than mutual exclusion -- e.g., to allow processing to completion against a stable version of the lagg configuration. As such, indeed, there should be no lock contention unless a configuration update takes place, and any line contention is a property of the locking primitive rather than data model. There are a number of other places in the kernel where migration to an rmlock makes sense -- however, some care must be taken for four reasons: (1) while read locks don't experience line contention, write locking becomes observably e.g., rmlocks might not be suitable for tcbinfo; (2) rmlocks, unlike rwlocks, more expensive so is not suitable for all rwlock line contention spots -- implement reader priority propagation, so you must reason about; and (3) historically, rmlocks have not fully implemented WITNESS so you may get less good debugging output. if_lagg is a nice place to use rmlocks, as reconfigurations are very rare, and it's really all about long-term data stability. Robert, what do you think about a quick swap of the ifnet structures to counter before 10.x? Could you be more specific about the proposal you're making? Robert The lagg patch referred to in the thread seems to indicate that zero locking is needed if we just switched to counter(9), that makes me wonder if we could do better with locking in other places if we switched to counter(9) while we have the chance. This is the thread: http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html / />/Perfect solution would be to convert ifnet(9) to counters(9), but this />/requires much more work, and unfortunately ABI change, so temporarily />/patch lagg(4) manually. />/ />/We store counters in the softc, and once per second push their values />/to legacy ifnet counters./ -- Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [rfc] migrate lagg to an rmlock
On 8/24/13 7:16 AM, Robert Watson wrote: On Sat, 24 Aug 2013, Alexander V. Chernikov wrote: On 24.08.2013 00:54, Adrian Chadd wrote: I'd like to commit this to -10. It migrates the if_lagg locking from a rw lock to a rm lock. We see a bit of contention between the transmit and We're running lagg with rmlock on several hundred heavily loaded machines, it really works better. However, there should not be any contention between receive and transmit side since there is actually no _real_ need to lock RX (and even use lagg receive code at all): http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html We should distinguish "lock contention" from "line contention". When acquiring a rwlock on multiple CPUs concurrently, the cache lines used to implement the lock are contended, as they must bounce between caches via the cache coherence protocol, also referred to as "contention". In the if_lagg code, I assume that the read-only acquire of the rwlock (and perhaps now rmlock) is for data stability rather than mutual exclusion -- e.g., to allow processing to completion against a stable version of the lagg configuration. As such, indeed, there should be no lock contention unless a configuration update takes place, and any line contention is a property of the locking primitive rather than data model. There are a number of other places in the kernel where migration to an rmlock makes sense -- however, some care must be taken for four reasons: (1) while read locks don't experience line contention, write locking becomes observably e.g., rmlocks might not be suitable for tcbinfo; (2) rmlocks, unlike rwlocks, more expensive so is not suitable for all rwlock line contention spots -- implement reader priority propagation, so you must reason about; and (3) historically, rmlocks have not fully implemented WITNESS so you may get less good debugging output. if_lagg is a nice place to use rmlocks, as reconfigurations are very rare, and it's really all about long-term data stability. Robert ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" Robert, what do you think about a quick swap of the ifnet structures to counter before 10.x? -Alfred -- Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Making IB a first class citizen.
On 8/23/13 2:29 PM, Vijay Singh wrote: We've been running with this change at work for some time and it doesn't seem to be impacting performance at all. We have a statically routed environment though. Also if we really want to optimize for performance wrt routing then IMHO we need to bring back route caching to the tcpcb. Just a thought. Thanks Vijay, I'll give a little more time and then push this change in. -Alfred Sent from my iPhone On Aug 23, 2013, at 1:52 PM, Adrian Chadd wrote: .. should just check to see what impact it has on performance in the general case. that may change the cache behaviour of the ARP / routing table code. -adrian On 23 August 2013 09:50, Alfred Perlstein wrote: Hello -net. This email is about making Infiniband a first class citizen of the FreeBSD kernel. Right now we have one #ifdef OFED in the src tree that makes compiling modules a real challenge: In sys/net/if_llatbl.h the "struct llentry" size changes based on if OFED is compiled in or not, only by 16 bytes because Infiniband uses 20bytes for MAC. I am wondering if it would be OK to just unifdef this part to make inifiband a first class citizen of the kernel. Otherwise maybe we can reverse the ifdef so that it's WITHOUT_OFED and by default have it on. I understand that we can not do this for FreeBSD 9.x due to breaking network ABI, however I think we still have time to do so in FreeBSD 10.x. If there's no objection I'd like to push this change into head in the next day or two. The only difference is +16 bytes to the "struct llentry". Comments? __**_ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/**mailman/listinfo/freebsd-net<http://lists.freebsd.org/mailman/listinfo/freebsd-net> To unsubscribe, send any mail to "freebsd-net-unsubscribe@**freebsd.org " ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Making IB a first class citizen.
Hello -net. This email is about making Infiniband a first class citizen of the FreeBSD kernel. Right now we have one #ifdef OFED in the src tree that makes compiling modules a real challenge: In sys/net/if_llatbl.h the "struct llentry" size changes based on if OFED is compiled in or not, only by 16 bytes because Infiniband uses 20bytes for MAC. I am wondering if it would be OK to just unifdef this part to make inifiband a first class citizen of the kernel. Otherwise maybe we can reverse the ifdef so that it's WITHOUT_OFED and by default have it on. I understand that we can not do this for FreeBSD 9.x due to breaking network ABI, however I think we still have time to do so in FreeBSD 10.x. If there's no objection I'd like to push this change into head in the next day or two. The only difference is +16 bytes to the "struct llentry". Comments? ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: 9-STABLE: Chelsio t4nex0: failed to pre-process config file: 2.
This looks like the result of forgetting to include the actual firmware in the kernel config and/or the firmware device itself. Can you check if you've included all the needed extra modules in the kernel config such as firmware(4) and the module for the card firmware itself? A trick you can use is to run "kldstat" after loading the module, you'll see which additional modules were needed for the device to work. Unfortunately the kernel can't autoload those modules while booting. I'm not sure if loader(8) picks up the deps either. -Alfred On 6/2/13 6:22 PM, John wrote: Hi Folks, I have a pair of Chelsio T4 cards installed in a new HP DL380 system. The driver does not load at boot time, failing with the message: t4nex0: failed to pre-process config file: 2. After the system has finished booting, if I then issue a 'kldload if_cxgbe' command, the driver loads correctly. Note, the driver loads correctly from the command prompt with or without the if_cxgbe_load in /boot/loader.conf. The message is coming from t4_main.c:partition_resources(). I don't see anything obvious that would cause this: rc = cfg ? upload_config_file(sc, cfg, &mtype, &maddr) : ENOENT; if (rc != 0) { mtype = FW_MEMTYPE_CF_FLASH; maddr = t4_flash_cfg_addr(sc); } bzero(&caps, sizeof(caps)); caps.op_to_write = htobe32(V_FW_CMD_OP(FW_CAPS_CONFIG_CMD) | F_FW_CMD_REQUEST | F_FW_CMD_READ); caps.cfvalid_to_len16 = htobe32(F_FW_CAPS_CONFIG_CMD_CFVALID | V_FW_CAPS_CONFIG_CMD_MEMTYPE_CF(mtype) | V_FW_CAPS_CONFIG_CMD_MEMADDR64K_CF(maddr >> 16) | FW_LEN16(caps)); rc = -t4_wr_mbox(sc, sc->mbox, &caps, sizeof(caps), &caps); if (rc != 0) { device_printf(sc->dev, "failed to pre-process config file: %d.\n", rc); return (rc); } Has anyone run into this? Thanks, John ps: And the output from loading the driver module by hand: t4nex0: mem 0xf7cc-0xf7cf,0xf700-0xf77f,0xf6ff-0xf6ff1fff irq 26 at device 0.4 on pci7 t4nex0: installing firmware 1.8.4.0 on card. cxgbe0: on t4nex0 cxgbe0: Ethernet address: 00:07:43:11:e9:00 cxgbe0: 16 txq, 8 rxq cxgbe1: on t4nex0 cxgbe1: Ethernet address: 00:07:43:11:e9:08 cxgbe1: 16 txq, 8 rxq cxgbe2: on t4nex0 cxgbe2: Ethernet address: 00:07:43:11:e9:10 cxgbe2: 16 txq, 8 rxq cxgbe3: on t4nex0 cxgbe3: Ethernet address: 00:07:43:11:e9:18 cxgbe3: 16 txq, 8 rxq t4nex0: PCIe x8, 4 ports, 34 MSI-X interrupts, 101 eq, 33 iq t4nex1: mem 0xfbcc-0xfbcf,0xfb00-0xfb7f,0xfaff-0xfaff1fff irq 58 at device 0.4 on pci36 t4nex1: installing firmware 1.8.4.0 on card. cxgbe4: on t4nex1 cxgbe4: Ethernet address: 00:07:43:11:e6:a0 cxgbe4: 16 txq, 8 rxq cxgbe5: on t4nex1 cxgbe5: Ethernet address: 00:07:43:11:e6:a8 cxgbe5: 16 txq, 8 rxq cxgbe6: on t4nex1 cxgbe6: Ethernet address: 00:07:43:11:e6:b0 cxgbe6: 16 txq, 8 rxq cxgbe7: on t4nex1 cxgbe7: Ethernet address: 00:07:43:11:e6:b8 cxgbe7: 16 txq, 8 rxq t4nex1: PCIe x8, 4 ports, 34 MSI-X interrupts, 101 eq, 33 iq ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Seeing EINVAL from writev on 8.0 to a non-blocking socket even though the data seems to hit the wire
On 5/1/13 8:03 PM, Richard Sharpe wrote: > Hi folks, > > I am checking to see if there are any known bugs with respect to this > in FreeBSD 8.0. > > Situation is that Samba 3.6.6 uses writev to a non-blocking socket to > get the SMB2 requests on the wire. > > Intermittently, we see the writev return EINVAL even though the data > has gotten on the wire. This I have verified by grabbing a capture and > comparing the SMB Sequence number in the last outgoing packet on the > wire vs the in-memory contents when we get EINVAL. > > Sometimes it occurs on a four-element IOVEC, sometimes we get EAGAIN > on the four-element IOVEC and then we get EINVAL when retrying on a > smaller IOVEC. > > Where should I look to check if there is some path where this might be > happening? Is this even the correct mailing list? > What does the iovec look like when you get EINVAL? Can you sanity check it? Is there anything special about it? (zero length vecs?) I think there are a few "maxvals" that if overrun cause EINVAL to be returned. example is if your iovec is somehow huge or has many, many elements. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Is it possible to slow down the network interface?
On 4/2/13 4:25 PM, Yuri wrote: For the testing purposes, I would like to be able to control the maximum speed of the interface. There is this command 'ifconfig re0 media 10baseT/UTP' that is supposed to lower the speed to 10Mbps. However, it makes interface unusable on my system. All connections are broken, even the router had to be rebooted. Maybe this is the router issue. Is there any other, "soft" way to change maximum interface speed to a particular value? When somebody sends data too fast, OS sends back ICMP notifications that connection is jammed. My question is, is it possible to impose such condition artificially? Is 'ifconfig re0 media 10baseT/UTP' actually supposed to work transparently, or disconnects are to be expected? try dummynet, it lets you simulate slow or otherwise special networks. man 4 dummynet -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 2/11/13 3:10 AM, Andre Oppermann wrote: On 09.02.2013 15:41, Alfred Perlstein wrote: However, the end result must be far different than what has occurred so far. If the code was deemed unacceptable for general inclusion, then we must find a way to provide a light framework to accomplish the needs of the community member. We've got pluggable congestion control modules thanks to lstewart. You can implement any non-standard congestion control method by adding your own module. They can be compiled into the kernel or loaded as KLD. I consider implementing this as a CC module the correct approach instead of adding yet another sysctl. Doing a CC module like this is very easy. That sounds like a win. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 2/7/13 12:04 PM, George Neville-Neil wrote: On Feb 6, 2013, at 12:28 , Alfred Perlstein wrote: On 2/6/13 4:46 AM, John Baldwin wrote: On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote: John: A burst at line rate will *often* cause drops. This is because router queues are at a finite size. Also such a burst (especially on a long delay bandwidth network) cause your RTT to increase even if there is no drop which is going to hurt you as well. A SHOULD in an RFC says you really really really really need to do it unless there is some thing that makes you willing to override it. It is slight wiggle room. In this I agree with Andre, we should not be *not* doing it. Otherwise folks will be turning this on and it is plain wrong. It may be fine for your network but I would not want to see it in FreeBSD. In my testing here at home I have put back into our stack max-burst. This uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at no more than 4 packets larger than your flight. All of my testing high-bw-delay or lan has shown this to improve TCP performance. This is because it helps you avoid bursting out so many packets that you overflow a queue. In your long-delay bw link if you do burst out too many (and you never know how many that is since you can not predict how full all those MPLS queues are or how big they are) you will really hurt yourself even worse. Note that generally in Cisco routers the default queue size is somewhere between 100-300 packets depending on the router. Due to the way our application works this never happens, but I am fine with just keeping this patch private. If there are other shops that need this they can always dig the patch up from the archives. This is yet another time when I'm sad about how things happen in FreeBSD. A developer come forward with a non-default option that's very useful for some specific workloads, specifically one that contributes much time and $$$ to the project and the community rejects the patches even though it's been successful in other OSes. It makes zero sense. John, can you repost the patch? Maybe there is a way to refactor this somehow so it's like accept filters where we can plug in a hook for TCP? I am very disappointed, but not surprised. I take away the complete opposite feeling. This is how we work through these issues. It's clear from the discussion that this need not be a default in the system, and is a special case. We had a reasoned discussion of what would be best to do and at least two experts in TCP weighed in on the effect this change might have. Not everything proposed by a developer need go into the tree, in particular since these discussions are archived we can always revisit this later. This is exactly how collaborative development should look, whether or not the patch is integrated now, next week, next year, or ever. I agree that discussion is great, we have all learned quite a bit from it, about TCP and the dangers of adjusting buffering without considerable thought. I would not be involved in FreeBSD had this type of discussion and information not be discussed on the lists so readily. However, the end result must be far different than what has occurred so far. If the code was deemed unacceptable for general inclusion, then we must find a way to provide a light framework to accomplish the needs of the community member. Take for instance someone who is starting a company that needs this facility. Which OS will they choose? One who has integrated a useful feature? Or one who has rejected it and left that code in the mailing list archives? As much as expert opinion is valuable, it must include understanding and need of handling special cases and the ability to facilitate those special cases for our users and developers. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 2/6/13 4:46 AM, John Baldwin wrote: On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote: John: A burst at line rate will *often* cause drops. This is because router queues are at a finite size. Also such a burst (especially on a long delay bandwidth network) cause your RTT to increase even if there is no drop which is going to hurt you as well. A SHOULD in an RFC says you really really really really need to do it unless there is some thing that makes you willing to override it. It is slight wiggle room. In this I agree with Andre, we should not be *not* doing it. Otherwise folks will be turning this on and it is plain wrong. It may be fine for your network but I would not want to see it in FreeBSD. In my testing here at home I have put back into our stack max-burst. This uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at no more than 4 packets larger than your flight. All of my testing high-bw-delay or lan has shown this to improve TCP performance. This is because it helps you avoid bursting out so many packets that you overflow a queue. In your long-delay bw link if you do burst out too many (and you never know how many that is since you can not predict how full all those MPLS queues are or how big they are) you will really hurt yourself even worse. Note that generally in Cisco routers the default queue size is somewhere between 100-300 packets depending on the router. Due to the way our application works this never happens, but I am fine with just keeping this patch private. If there are other shops that need this they can always dig the patch up from the archives. This is yet another time when I'm sad about how things happen in FreeBSD. A developer come forward with a non-default option that's very useful for some specific workloads, specifically one that contributes much time and $$$ to the project and the community rejects the patches even though it's been successful in other OSes. It makes zero sense. John, can you repost the patch? Maybe there is a way to refactor this somehow so it's like accept filters where we can plug in a hook for TCP? I am very disappointed, but not surprised. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: m_get2() name
On 2/1/13 7:04 AM, Gleb Smirnoff wrote: Hi! The m_get2() function allocates a single mbuf with enough space to hold specified amount of data. It can return either a single mbuf, an mbuf with a standard cluster, page size cluster, or jumbo cluster. It is alredy utilized in pfsync, bpf, libalias and soon to be utilized in ieee802111. There are probably more places in stack where it can be used. The question is about its name. Once introduced, I just gave it name "m_get2" to avoid discussion with myself about bikeshed colour and continue hacking. Now it is getting used wider, and before we branch any stable branch off the head, we have last chance to rename it to smth more meaningful. Any ideas on better name are welcome. m_getbs - mbuf get buffer size. conveniently also maps to: m_getbs - mbuf get bike shed. This is a cool function. Maybe it should take an int*error arg as well for ENOBUFS/EINVAL? -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 1/30/13 12:29 PM, Andre Oppermann wrote: On 30.01.2013 18:11, Alfred Perlstein wrote: On 1/30/13 11:58 AM, John Baldwin wrote: On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote: Yes, unfortunately I do object. This option, combined with the inflated CWND at the end of a burst, effectively removes much, if not all, of the congestion control mechanisms originally put in place to allow multiple [TCP] streams co-exist on the same pipe. Not having any decay or timeout makes it even worse by doing this burst after an arbitrary amount of time when network conditions and the congestion situation have certainly changed. You have completely ignored the fact that Linux has had this as a global option for years and the Internet has not melted. A socket option is far more fine-grained than their tunable (and requires code changes, not something a random sysadmin can just toggle as "tuning"). I agree with John here. While Andre's objection makes sense, since the majority of Linux/Unix hosts now have this as a global option I can't think of why you would force FreeBSD to be a final holdout. Unless OpenBSD, NetBSD, Solaris/Ilumos also support this it is hardly a majority of Linux/Unix hosts. And this isn't something a "sysadmin" should tune at all. My apologies, I should have been more clear. I was speaking of majority of install base, not majority of distros. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 1/30/13 11:58 AM, John Baldwin wrote: On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote: Yes, unfortunately I do object. This option, combined with the inflated CWND at the end of a burst, effectively removes much, if not all, of the congestion control mechanisms originally put in place to allow multiple [TCP] streams co-exist on the same pipe. Not having any decay or timeout makes it even worse by doing this burst after an arbitrary amount of time when network conditions and the congestion situation have certainly changed. You have completely ignored the fact that Linux has had this as a global option for years and the Internet has not melted. A socket option is far more fine-grained than their tunable (and requires code changes, not something a random sysadmin can just toggle as "tuning"). I agree with John here. While Andre's objection makes sense, since the majority of Linux/Unix hosts now have this as a global option I can't think of why you would force FreeBSD to be a final holdout. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 1/24/13 11:14 AM, John Baldwin wrote: On Thursday, January 24, 2013 3:03:31 am Andre Oppermann wrote: On 24.01.2013 03:31, Sepherosa Ziehau wrote: On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin wrote: On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote: On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. I think what you need is the RFC2861, however, you probably should ignore the "application-limited period" part of RFC2861. Hummm. It appears btw, that Linux uses RFC 2861, but has a global knob to disable it due to applictions having problems. When it is disabled, it doesn't decay the congestion window at all during idle handling. That is, it appears to act the same as if TCP_IGNOREIDLE were enabled. From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html: tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 2.6.18) If enabled, provide RFC 2861 behavior and time out the congestion window after an idle period. An idle period is defined as the current RTO (retransmission timeout). If disabled, the congestion window will not be timed out after an idle period. Also, in this thread on tcp-m it appears no one on that list realizes that there are any implementations which follow the "SHOULD" in RFC 2581 for idle handling (which is what we do currently): Nah, I don't think the idle detection in FreeBSD follows the RFC2581/RFC5681 4.1 (the paragraph before the "SHOULD"). IMHO, that's probably why the author in the following email requestioned about the implementation of "SHOULD" in RFC2581/RFC5681. http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html So if we were to implement RFC 2861, the new socket option would be equivalent to setting Linux's 'tcp_slow_start_after_idle' to false, but on a per-socket basis rather than globally. Agree, per-socket option could be useful than global sysctls under certain situation. However, in addition to the per-socket option, could global sysctl nodes to disable idle_restart/idle_cwv help too? No. This is far too dangerous once it makes it into some tuning guide. The threat of congestion breakdown is real. The Internet, or any packet network, can only survive in the long term if almost all follow the rules and self-constrain to remain fair to the others. What would happen if nobody would respect the traffic lights anymore? The problem with this argument is Linux has already had this as a tunable option for years and the Internet hasn't melted as a result. Besides that bursting into unknown network conditions is very likely to result in burst losses as well. TCP isn't good at recovering from it. In the end you most likely come out ahead if you decay the restartCWND. We have two cases primarily: a) long distance, medium to high RTT, and wildly varying bandwidth (a.k.a. the Internet); b) short distance, low RTT and mostly plenty of bandwidth (a.k.a. Datacenter). The former absolutely definately requires a decayed restartCWND. The latter less so but even there bursting at 10Gig TSO assisted wirespeed isn't going to end too happy more often than not. You forgot my case: c) dedicated long distance links with high bandwidth. Since this seems to be a burning issue I'll come up with a patch in the next days to add a decaying restartCWND that'll be fair and allow a very quick ramp up if no loss occurs. I think this could be useful. OTOH, I still think the TCP_IGNOREIDLE option is useful both with and without a decaying restartCWND? Linux seems to be doing just fine with it for what seems to be a long while. Can we get this committed? -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 1/22/13 12:11 PM, John Baldwin wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. This looks good, but it almost sounds like a bug for TCP to be doing this anyhow. Why would one want this behavior? Wouldn't it make sense to keep the window large until there was a problem rather than unconditionally chop it down? I almost think TCP is afraid that you might wind up swapping out a 10gig interface for a modem? I'm just not getting it. (probably simple oversight on my part). What do you think about also making this a sysctl for global on/off by default? -Alfred Index: share/man/man4/tcp.4 === --- share/man/man4/tcp.4(revision 245742) +++ share/man/man4/tcp.4(working copy) @@ -205,6 +205,18 @@ in the .Sx MIB Variables section further down. +.It Dv TCP_IGNOREIDLE +If a TCP connection is idle for more than one retransmit timeout, +it enters slow start when new data is available to transmit. +This avoids flooding the network with a full window of traffic at line rate. +It also allows the connection to adjust to changes to network conditions +that occurred while the connection was idle. A connection that sends +bursts of data separated by large idle periods can be permamently stuck in +slow start as a result. +The boolean option +.Dv TCP_IGNOREIDLE +disables the idle connection handling allowing connections to maintain the +existing congestion window when restarting after an idle period. .It Dv TCP_NODELAY Under most circumstances, .Tn TCP Index: sys/netinet/tcp_var.h === --- sys/netinet/tcp_var.h (revision 245742) +++ sys/netinet/tcp_var.h (working copy) @@ -230,6 +230,7 @@ #define TF_NEEDFIN 0x000800/* send FIN (implicit state) */ #define TF_NOPUSH 0x001000/* don't push */ #define TF_PREVVALID0x002000/* saved values for bad rxmit valid */ +#defineTF_IGNOREIDLE 0x004000/* connection is never idle */ #define TF_MORETOCOME 0x01/* More data to be appended to sock */ #define TF_LQ_OVERFLOW 0x02/* listen queue overflow */ #define TF_LASTIDLE 0x04/* connection was previously idle */ Index: sys/netinet/tcp_output.c === --- sys/netinet/tcp_output.c(revision 245742) +++ sys/netinet/tcp_output.c(working copy) @@ -206,7 +206,8 @@ * to send, then transmit; otherwise, investigate further. */ idle = (tp->t_flags & TF_LASTIDLE) || (tp->snd_max == tp->snd_una); - if (idle && ticks - tp->t_rcvtime >= tp->t_rxtcur) + if (!(tp->t_flags & TF_IGNOREIDLE) && + idle && ticks - tp->t_rcvtime >= tp->t_rxtcur) cc_after_idle(tp); tp->t_flags &= ~TF_LASTIDLE; if (idle) { Index: sys/netinet/tcp.h === --- sys/netinet/tcp.h (revision 245823) +++ sys/netinet/tcp.h (working copy) @@ -156,6 +156,7 @@ #define TCP_NODELAY 1 /* don't delay send to coalesce packets */ #if __BSD_VISIBLE #define TCP_MAXSEG 2 /* set maximum segment size */ +#defineTCP_IGNOREIDLE 3 /* disable idle connection handling */ #define TCP_NOPUSH4 /* don't push last block of write */ #define TCP_NOOPT 8 /* don't use TCP options */ #define TCP_MD5SIG16 /* use MD5 digests (RFC2385) */ Index: sys/netinet/tcp_usrreq.c === --- sys/netinet/tcp_usrreq.c(revision 245742) +++ sys/netinet/tcp_usrreq.c(working copy) @@ -1354,6 +1354,7 @@ case TCP_NODELAY:
Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks
On 1/14/13 4:56 PM, John Baldwin wrote: On Monday, January 14, 2013 4:42:16 pm Alfred Perlstein wrote: Wouldn't a comment over the code suffice? Something like your email as a header would actually work very nicely! I think just using decimal would be more confusing than explicitly calling it out like: /* begin enumerated (not bitmask) socket option specifiers */ #define TCP_MAXSEG 0x02/* set maximum segment size */ #define TCP_NOPUSH 0x04/* don't push last block of write */ #define TCP_NOOPT 0x08/* don't use TCP options */ #define TCP_MD5SIG 0x10/* use MD5 digests (RFC2385) */ /* end enumerated socket option specifiers */ I have a patch I'll post next which will add a new option as '3'. I think that will make it more obvious and avoid having new options follow the old pattern. Any objection to adding the contents of that email as a comment section? It really would help. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [PATCH] Don't imply TCP and UDP socket options are bitmasks
Wouldn't a comment over the code suffice? Something like your email as a header would actually work very nicely! I think just using decimal would be more confusing than explicitly calling it out like: /* begin enumerated (not bitmask) socket option specifiers */ #define TCP_MAXSEG 0x02/* set maximum segment size */ #define TCP_NOPUSH 0x04/* don't push last block of write */ #define TCP_NOOPT 0x08/* don't use TCP options */ #define TCP_MD5SIG 0x10/* use MD5 digests (RFC2385) */ /* end enumerated socket option specifiers */ On 1/14/13 3:50 PM, John Baldwin wrote: The constants used for TCP and UDP socket options (TCP_NODELAY, etc.) are currently defined as hex values that are individual bits. However, socket options are never masked together, they are used as a simple enumeration of discrete values. Using a bitmask forces us to run out of bits and makes it harder for vendors to try to use a high range of values for local custom options (hoping that they never conflict with a new option value added in stock FreeBSD). The socket options in do use bitmasks for the low bits because they map directly to bits so_options, but then they start a simple enumeration at 0x1000. TCP and UDP socket options do not directly map to bits in a flags field in the PCB (e.g. TF_NODELAY != TCP_NODELAY). I would like to change the representation of the constants to be decimal instead of hex and encourage new options to fill in the gaps between the existing values. This would preserve the existing ABI but keep things more sane in the future (I believe). The diff is this: Index: netinet/tcp.h === --- netinet/tcp.h (revision 245225) +++ netinet/tcp.h (working copy) @@ -151,18 +151,18 @@ /* * User-settable options (used with setsockopt). */ -#defineTCP_NODELAY 0x01/* don't delay send to coalesce packets */ +#defineTCP_NODELAY 1 /* don't delay send to coalesce packets */ #if __BSD_VISIBLE -#defineTCP_MAXSEG 0x02/* set maximum segment size */ -#define TCP_NOPUSH 0x04/* don't push last block of write */ -#define TCP_NOOPT 0x08/* don't use TCP options */ -#define TCP_MD5SIG 0x10/* use MD5 digests (RFC2385) */ -#defineTCP_INFO0x20/* retrieve tcp_info structure */ -#defineTCP_CONGESTION 0x40/* get/set congestion control algorithm */ -#defineTCP_KEEPINIT0x80/* N, time to establish connection */ -#defineTCP_KEEPIDLE0x100 /* L,N,X start keeplives after this period */ -#defineTCP_KEEPINTVL 0x200 /* L,N interval between keepalives */ -#defineTCP_KEEPCNT 0x400 /* L,N number of keepalives before close */ +#defineTCP_MAXSEG 2 /* set maximum segment size */ +#define TCP_NOPUSH 4 /* don't push last block of write */ +#define TCP_NOOPT 8 /* don't use TCP options */ +#define TCP_MD5SIG 16 /* use MD5 digests (RFC2385) */ +#defineTCP_INFO32 /* retrieve tcp_info structure */ +#defineTCP_CONGESTION 64 /* get/set congestion control algorithm */ +#defineTCP_KEEPINIT128 /* N, time to establish connection */ +#defineTCP_KEEPIDLE256 /* L,N,X start keeplives after this period */ +#defineTCP_KEEPINTVL 512 /* L,N interval between keepalives */ +#defineTCP_KEEPCNT 1024/* L,N number of keepalives before close */ #define TCP_CA_NAME_MAX 16 /* max congestion control name length */ Index: netinet/udp.h === --- netinet/udp.h (revision 245225) +++ netinet/udp.h (working copy) @@ -48,7 +48,7 @@ /* * User-settable options (used with setsockopt). */ -#defineUDP_ENCAP 0x01 +#defineUDP_ENCAP 1 /* ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: FreeBSD boxes as a 'router'...
On 11/20/12 3:30 PM, Barney Cordoba wrote: --- On Tue, 11/20/12, Ingo Flaschberger wrote: From: Ingo Flaschberger Subject: Re: FreeBSD boxes as a 'router'... To: freebsd-net@freebsd.org Date: Tuesday, November 20, 2012, 6:04 PM Am 20.11.2012 23:49, schrieb Alfred Perlstein: On 11/20/12 2:42 PM, Jim Thompson wrote: On Nov 20, 2012, at 3:52 PM, Barney Cordoba wrote: You're entitled to your opinion, but experimental results have tended to show yours incorrect. Jim Agree with Jim. If you want pure packet performance you burn a core to run a polling loop. At new systems, without polling I had better performance and no live-locks, at old systems (Intel 82541GI) polling prevent live-locks. Best test: Loop a GigE Switch, inject a Packet and plug it into the test-box. Yeah, thats a good real-world test. To me "performance" is not "burning a cpu" to get some extra pps. Performance is not dropping buckets of packets. Performance is using less cpu to do the same amount of work. Is a machine that benchmarks at 998Mb/s at 95% cpu really a "higher performance" system than one that does 970Mb/s and uses 50% of the cpu? The measure of performance is to manage an entire load without dropping any packets. If your machine goes into live-lock, then you need more machine. Hacking it so that it drops packets is hardly a solution. Any free CPU is wasted CPU. (unless you're concerned about power consumption, then it's debatable). -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: FreeBSD boxes as a 'router'...
On 11/20/12 2:42 PM, Jim Thompson wrote: On Nov 20, 2012, at 3:52 PM, Barney Cordoba wrote: Anyone who even mentions polling should be discounted altogether. Polling had value when you couldn't control the interrupt delays; but interrupt moderation allows you to pace the interrupts any way you like without the inefficiencies of polling. You're entitled to your opinion, but experimental results have tended to show yours incorrect. Jim Agree with Jim. If you want pure packet performance you burn a core to run a polling loop. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [CFT] ipfw SMP-ready dynamic states
Alexander, this is awesome. On 11/13/12 11:28 AM, Alexander V. Chernikov wrote: Hello list! Currently most ipfw operations with dynamic states (keep-state, check-state, limit) are serialized via IPFW_DYN_LOCK() which is per-vnet mutex lock. As a result, performance is limited to the same ~650kpps as in routing (in several cases). Patch changes the following: * global lock is changed to per-bucket mutex * state expiration is done in ipfw_tick every 1s. No expiration is done on forwarding path * hash table resize is done automatically and does not cause all states to be lost The only (architectural) problem I see is unlocked V_dyn_count increments. So, we can do the following: 1) lock increments/decrements via some separate mutex 2) do nothing 3) take some combined approach: Generally, we don't need value to be _exact_. As a result, we count total number of states in every ipfw_tick run and set V_dyn_count to new value. New states still increment V_dyn_count unlocked. What about using per-cpu PCPU counters, and then collecting them for display/reporting? -Alfred Performance: Synthetic traffic, ipfw with single allow ip from any to any rule: 2.4M. single keep-state ip from any to any: 2.2M. Some more tests should be taken (with large number of states, different types of traffic, etc), maybe I can do some next week. You need to run recent -current or merge r242631 and r242834 before applying this patch. ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: auto tuning tcp
On 11/13/12 12:25 AM, Andre Oppermann wrote: On 13.11.2012 09:18, Alfred Perlstein wrote: On 11/13/12 12:06 AM, Andre Oppermann wrote: On 13.11.2012 07:45, Alfred Perlstein wrote: If you are concerned about the space/time tradeoff I'm pretty happy with making it 1/2, 1/4th, 1/8th the size of maxsockets. (smaller?) Would that work better? I'd go for 1/8 or even 1/16 with a lower bound of 512. More than that is excessive. I'm OK with 1/8. All I'm really going for is trying to make it somewhat better than 512 when un-tuned. > PS: Please note that my patch for mbuf and maxfiles tuning is not yet in HEAD, it's still sitting in my tcp_workqueue branch. I still have to search for derived values that may get totally out of whack with the new scaling scheme. This is cool! Thank you for the feedback. Would you like me to put this on a user branch somewhere for you to merge into your perf branch? I can put it into my branch and also merge it to HEAD with a "Submitted by: alfred" line. Thank you, that works. Note: it's not even compile tested at this point. I should be able to do so tomorrow. Are there other hashes to look at? I noticed a few more: UDBHASHSIZE netinet/tcp_hostcache.c:#define TCP_HOSTCACHE_HASHSIZE 512 netinet/sctp_constants.h:#define SCTP_TCBHASHSIZE 1024 netinet/sctp_constants.h:#define SCTP_PCBHASHSIZE 256 netinet/tcp_syncache.c:#define TCP_SYNCACHE_HASHSIZE512 Any of these look like good targets? I think most could be looked at. I've only glanced. I can provide deltas. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: auto tuning tcp
On 11/13/12 12:06 AM, Andre Oppermann wrote: On 13.11.2012 07:45, Alfred Perlstein wrote: On 11/12/12 10:23 PM, Peter Wemm wrote: On Mon, Nov 12, 2012 at 10:11 PM, Alfred Perlstein wrote: On 11/12/12 10:04 PM, Alfred Perlstein wrote: On 11/12/12 10:48 AM, Alfred Perlstein wrote: On 11/12/12 10:01 AM, Andre Oppermann wrote: I've already added the tunable "kern.maxmbufmem" which is in pages. That's probably not very convenient to work with. I can change it to a percentage of phymem/kva. Would that make you happy? It really makes sense to have the hash table be some relation to sockets rather than buffers. If you are hashing "foo-objects" you want the hash to be some relation to the max amount of "foo-objects" you'll see, not backwards derived from the number of "bar-objects" that "foo-objects" contain, right? Because we are hashing the sockets, right? not clusters. Maybe I'm wrong? I'm open to ideas. Hey Andre, the following patch is what I was thinking (uncompiled/untested), it basically rounds up the maxsockets to a power of 2 and replaces the default 512 tcb hashsize. It might make sense to make the auto-tuning default to a minimum of 512. There are a number of other hashes with static sizes that could make use of this logic provided it's not upside-down. Any thoughts on this? Tune the tcp pcb hash based on maxsockets. Be more forgiving of poorly chosen tunables by finding a closer power of two rather than clamping down to 512. Index: tcp_subr.c === Sorry, GUI mangled the patch... attaching a plain text version. Wait, you want to replace a hash with a flat array? Why even bother to call it a hash at that point? If you are concerned about the space/time tradeoff I'm pretty happy with making it 1/2, 1/4th, 1/8th the size of maxsockets. (smaller?) Would that work better? I'd go for 1/8 or even 1/16 with a lower bound of 512. More than that is excessive. I'm OK with 1/8. All I'm really going for is trying to make it somewhat better than 512 when un-tuned. The reason I chose to make it equal to max sockets was a space/time tradeoff, ideally a hash should have zero collisions and if a user has enough memory for 250,000 sockets, then surely they have enough memory for 256,000 pointers. I agree in general. Though not all large memory servers do serve a large amount of connections. We have find a tradeoff here. Having a perfect hash would certainly be laudable. As long as the average hash chain doesn't go beyond few entries it's not a problem. If you strongly disagree then I am fine with a more conservative setting, just note that effectively the hash table will require 1/2 the factor that we go smaller in additional traversals when we max out the number of sockets. Meaning if the table is 1/4 the size of max sockets, when we hit that many tcp connections I think we'll see an order of average 2 linked list traversals to find a node. At 1/8, then that number becomes 4. I'm fine with that and claim that if you expect N sockets that you would also increase maxfiles/sockets to N*2 to have some headroom. That is a good point. I recall back in 2001 on a PII400 with a custom webserver I wrote having a huge benefit by upping this to 2^14 or maybe even 2^16, I forget, but suddenly my CPU went down a huge amount and I didn't have to worry about a load balancer or other tricks. I can certainly believe that. A hash size of 512 is no good if you have more than 4K connections. PS: Please note that my patch for mbuf and maxfiles tuning is not yet in HEAD, it's still sitting in my tcp_workqueue branch. I still have to search for derived values that may get totally out of whack with the new scaling scheme. This is cool! Thank you for the feedback. Would you like me to put this on a user branch somewhere for you to merge into your perf branch? -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: auto tuning tcp
On 11/12/12 10:23 PM, Peter Wemm wrote: On Mon, Nov 12, 2012 at 10:11 PM, Alfred Perlstein wrote: On 11/12/12 10:04 PM, Alfred Perlstein wrote: On 11/12/12 10:48 AM, Alfred Perlstein wrote: On 11/12/12 10:01 AM, Andre Oppermann wrote: I've already added the tunable "kern.maxmbufmem" which is in pages. That's probably not very convenient to work with. I can change it to a percentage of phymem/kva. Would that make you happy? It really makes sense to have the hash table be some relation to sockets rather than buffers. If you are hashing "foo-objects" you want the hash to be some relation to the max amount of "foo-objects" you'll see, not backwards derived from the number of "bar-objects" that "foo-objects" contain, right? Because we are hashing the sockets, right? not clusters. Maybe I'm wrong? I'm open to ideas. Hey Andre, the following patch is what I was thinking (uncompiled/untested), it basically rounds up the maxsockets to a power of 2 and replaces the default 512 tcb hashsize. It might make sense to make the auto-tuning default to a minimum of 512. There are a number of other hashes with static sizes that could make use of this logic provided it's not upside-down. Any thoughts on this? Tune the tcp pcb hash based on maxsockets. Be more forgiving of poorly chosen tunables by finding a closer power of two rather than clamping down to 512. Index: tcp_subr.c === Sorry, GUI mangled the patch... attaching a plain text version. Wait, you want to replace a hash with a flat array? Why even bother to call it a hash at that point? If you are concerned about the space/time tradeoff I'm pretty happy with making it 1/2, 1/4th, 1/8th the size of maxsockets. (smaller?) Would that work better? The reason I chose to make it equal to max sockets was a space/time tradeoff, ideally a hash should have zero collisions and if a user has enough memory for 250,000 sockets, then surely they have enough memory for 256,000 pointers. If you strongly disagree then I am fine with a more conservative setting, just note that effectively the hash table will require 1/2 the factor that we go smaller in additional traversals when we max out the number of sockets. Meaning if the table is 1/4 the size of max sockets, when we hit that many tcp connections I think we'll see an order of average 2 linked list traversals to find a node. At 1/8, then that number becomes 4. I recall back in 2001 on a PII400 with a custom webserver I wrote having a huge benefit by upping this to 2^14 or maybe even 2^16, I forget, but suddenly my CPU went down a huge amount and I didn't have to worry about a load balancer or other tricks. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: auto tuning tcp
On 11/12/12 10:04 PM, Alfred Perlstein wrote: On 11/12/12 10:48 AM, Alfred Perlstein wrote: On 11/12/12 10:01 AM, Andre Oppermann wrote: I've already added the tunable "kern.maxmbufmem" which is in pages. That's probably not very convenient to work with. I can change it to a percentage of phymem/kva. Would that make you happy? It really makes sense to have the hash table be some relation to sockets rather than buffers. If you are hashing "foo-objects" you want the hash to be some relation to the max amount of "foo-objects" you'll see, not backwards derived from the number of "bar-objects" that "foo-objects" contain, right? Because we are hashing the sockets, right? not clusters. Maybe I'm wrong? I'm open to ideas. Hey Andre, the following patch is what I was thinking (uncompiled/untested), it basically rounds up the maxsockets to a power of 2 and replaces the default 512 tcb hashsize. It might make sense to make the auto-tuning default to a minimum of 512. There are a number of other hashes with static sizes that could make use of this logic provided it's not upside-down. Any thoughts on this? Tune the tcp pcb hash based on maxsockets. Be more forgiving of poorly chosen tunables by finding a closer power of two rather than clamping down to 512. Index: tcp_subr.c === Sorry, GUI mangled the patch... attaching a plain text version. Index: tcp_subr.c === --- tcp_subr.c (revision 242936) +++ tcp_subr.c (working copy) @@ -235,7 +235,7 @@ * variable net.inet.tcp.tcbhashsize */ #ifndef TCBHASHSIZE -#define TCBHASHSIZE512 +#define TCBHASHSIZE0 #endif /* @@ -282,6 +282,27 @@ return (0); } +/* + * Take a value and get the next power of 2 that doesn't overflow. + * Used to size the tcp_inpcb hash buckets. + */ +static int +maketcp_hashsize(int size) +{ + int hashsize; + + /* +* auto tune. +* get the next power of 2 higher than maxsockets. +*/ + hashsize = 1 << fls(maxsockets); + /* catch overflow, and just go one power of 2 smaller */ + if (hashsize < maxsockets) { + hashsize = 1 << (fls(maxsockets) - 1); + } + return hashsize; +} + void tcp_init(void) { @@ -296,9 +317,20 @@ hashsize = TCBHASHSIZE; TUNABLE_INT_FETCH("net.inet.tcp.tcbhashsize", &hashsize); + if (hashsize == 0) { + /* auto tune based on maxsockets */ + hashsize = maketcp_hashsize(maxsockets); + } + /* +* Be forgiving of admins that don't know to make the tunable +* a power of two. +*/ if (!powerof2(hashsize)) { - printf("WARNING: TCB hash size not a power of 2\n"); - hashsize = 512; /* safe default */ + int oldhashsize = hashsize; + + hashsize = maketcp_hashsize(hashsize); + printf("%s: WARNING: TCB hash size not a power of 2, " + "fixed %d -> %d\n", __func__, oldhashsize, hashsize); } in_pcbinfo_init(&V_tcbinfo, "tcp", &V_tcb, hashsize, hashsize, "tcp_inpcb", tcp_inpcb_init, NULL, UMA_ZONE_NOFREE, ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: auto tuning tcp
On 11/12/12 10:48 AM, Alfred Perlstein wrote: On 11/12/12 10:01 AM, Andre Oppermann wrote: I've already added the tunable "kern.maxmbufmem" which is in pages. That's probably not very convenient to work with. I can change it to a percentage of phymem/kva. Would that make you happy? It really makes sense to have the hash table be some relation to sockets rather than buffers. If you are hashing "foo-objects" you want the hash to be some relation to the max amount of "foo-objects" you'll see, not backwards derived from the number of "bar-objects" that "foo-objects" contain, right? Because we are hashing the sockets, right? not clusters. Maybe I'm wrong? I'm open to ideas. Hey Andre, the following patch is what I was thinking (uncompiled/untested), it basically rounds up the maxsockets to a power of 2 and replaces the default 512 tcb hashsize. It might make sense to make the auto-tuning default to a minimum of 512. There are a number of other hashes with static sizes that could make use of this logic provided it's not upside-down. Any thoughts on this? Tune the tcp pcb hash based on maxsockets. Be more forgiving of poorly chosen tunables by finding a closer power of two rather than clamping down to 512. Index: tcp_subr.c === --- tcp_subr.c (revision 242936) +++ tcp_subr.c (working copy) @@ -235,7 +235,7 @@ * variable net.inet.tcp.tcbhashsize */ #ifndef TCBHASHSIZE -#define TCBHASHSIZE 512 +#define TCBHASHSIZE 0 #endif /* @@ -282,6 +282,27 @@ return (0); } +/* + * Take a value and get the next power of 2 that doesn't overflow. + * Used to size the tcp_inpcb hash buckets. + */ +static int +maketcp_hashsize(int size) +{ + int hashsize; + + /* + * auto tune. + * get the next power of 2 higher than maxsockets. + */ + hashsize = 1 << fls(maxsockets); + /* catch overflow, and just go one power of 2 smaller */ + if (hashsize < maxsockets) { + hashsize = 1 << (fls(maxsockets) - 1); + } + return hashsize; +} + void tcp_init(void) { @@ -296,9 +317,20 @@ hashsize = TCBHASHSIZE; TUNABLE_INT_FETCH("net.inet.tcp.tcbhashsize", &hashsize); + if (hashsize == 0) { + /* auto tune based on maxsockets */ + hashsize = maketcp_hashsize(maxsockets); + } + /* + * Be forgiving of admins that don't know to make the tunable + * a power of two. + */ if (!powerof2(hashsize)) { - printf("WARNING: TCB hash size not a power of 2\n"); - hashsize = 512; /* safe default */ + int oldhashsize = hashsize; + + hashsize = maketcp_hashsize(hashsize); + printf("%s: WARNING: TCB hash size not a power of 2, " + "fixed %d -> %d\n", __func__, oldhashsize, hashsize); } in_pcbinfo_init(&V_tcbinfo, "tcp", &V_tcb, hashsize, hashsize, "tcp_inpcb", tcp_inpcb_init, NULL, UMA_ZONE_NOFREE, ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: auto tuning tcp
On 11/12/12 10:01 AM, Andre Oppermann wrote: On 12.11.2012 18:43, Alfred Perlstein wrote: On Nov 12, 2012, at 1:27 AM, Andre Oppermann wrote: On 12.11.2012 09:52, Alfred Perlstein wrote: On 11/11/12 11:28 PM, Andre Oppermann wrote: On 12.11.2012 08:10, Alfred Perlstein wrote: I noticed that TCBHASHSIZE does not autotune. What do you think of the following algorithm? Basically round down to next power of two based on nmbclusters / 64. Please wait out for a real fix of the various mbuf-whatever tuning issue I'll propose shortly. This approach may become inapproriate. Also the mbuf limits can be changed at runtime by sysctl. What is the timeline you are asking for to wait? http://svnweb.freebsd.org/changeset/base/242910 Very cool! So instead of nmbclusters, will maxsockets work? Ideas/suggestions? I've already added the tunable "kern.maxmbufmem" which is in pages. That's probably not very convenient to work with. I can change it to a percentage of phymem/kva. Would that make you happy? It really makes sense to have the hash table be some relation to sockets rather than buffers. If you are hashing "foo-objects" you want the hash to be some relation to the max amount of "foo-objects" you'll see, not backwards derived from the number of "bar-objects" that "foo-objects" contain, right? Because we are hashing the sockets, right? not clusters. Maybe I'm wrong? I'm open to ideas. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: auto tuning tcp
On Nov 12, 2012, at 1:27 AM, Andre Oppermann wrote: > On 12.11.2012 09:52, Alfred Perlstein wrote: >> On 11/11/12 11:28 PM, Andre Oppermann wrote: >>> On 12.11.2012 08:10, Alfred Perlstein wrote: >>>> I noticed that TCBHASHSIZE does not autotune. >>>> >>>> What do you think of the following algorithm? >>>> >>>> Basically round down to next power of two based on nmbclusters / 64. >>> >>> Please wait out for a real fix of the various mbuf-whatever tuning >>> issue I'll propose shortly. This approach may become inapproriate. >>> Also the mbuf limits can be changed at runtime by sysctl. >>> >> What is the timeline you are asking for to wait? > > http://svnweb.freebsd.org/changeset/base/242910 Very cool! So instead of nmbclusters, will maxsockets work? Ideas/suggestions? -Alfred. ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: auto tuning tcp
On 11/11/12 11:28 PM, Andre Oppermann wrote: On 12.11.2012 08:10, Alfred Perlstein wrote: I noticed that TCBHASHSIZE does not autotune. What do you think of the following algorithm? Basically round down to next power of two based on nmbclusters / 64. Please wait out for a real fix of the various mbuf-whatever tuning issue I'll propose shortly. This approach may become inapproriate. Also the mbuf limits can be changed at runtime by sysctl. What is the timeline you are asking for to wait? -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
auto tuning tcp
I noticed that TCBHASHSIZE does not autotune. What do you think of the following algorithm? Basically round down to next power of two based on nmbclusters / 64. -Alfred #include #include #include int main(int argc, char **argv) { int nmbclusters; int pow2cl; nmbclusters = atoi(argv[1]); pow2cl = 1 << (fls(nmbclusters / 64)-1); if (pow2cl < 512) pow2cl = 512; printf("%d\n", pow2cl); return (0); } ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Patch for ip6_sprintf(), please review
Thank you Doug, I will be committing this shortly. * Doug Barton [100516 12:21] wrote: > Someone at work has been reading > http://tools.ietf.org/html/draft-ietf-6man-text-addr-representation :) > > This change follows the rules in that draft which will become and RFC as > soon as it finishes winding its way through the process, so I am > supportive of the change you are proposing. > > > Doug > > On 5/15/2010 11:22 PM, Alfred Perlstein wrote: > > Hello, > > > > The following patch seems appropriate to apply > > to fix the kernel ip6_sprintf() function. > > > > What it is doing is ensuring that when we > > abbreviate addresses that the longest string > > of zeros is shortend, not the first run of > > zeros. > > > > Our internal commit log is: > > problem: > > Unification of IPv6 address representation > > fix: > > recommended format of text representing an IPv6 address > > is summarized as follows. > > > > 1. omit leading zeros > > > > 2. "::" used to their maximum extent whenever possible > > > > 3. "::" used where shortens address the most > > > > 4. "::" used in the former part in case of a tie breaker > > > > 5. do not shorten one 16 bit 0 field > > > > 6. use lower case > > > > Present code in ip6_sprintf() is following rules 1,2,5,6. > > Adding fix for following other rules also.For following > > rules 3 and 4, finding out the index where to replace zero's > > with '::' and using that index. > > References: > > http://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-04.html > > > > > > Diff is attached in text format. > > > > > > > > > > ___ > > freebsd-net@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-net > > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" > > > > -- > > ... and that's just a little bit of history repeating. > -- Propellerheads > > Improve the effectiveness of your Internet presence with > a domain name makeover!http://SupersetSolutions.com/ > > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" -- - Alfred Perlstein .- AMA, VMOA #5191, 03 vmax, 92 gs500, 85 ch250, 07 zx10 .- FreeBSD committer ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Patch for ip6_sprintf(), please review
* Hiroki Sato [100517 22:43] wrote: > Alfred Perlstein wrote > in <20100516062211.gc6...@elvis.mu.org>: > > al> The following patch seems appropriate to apply > al> to fix the kernel ip6_sprintf() function. > al> > al> What it is doing is ensuring that when we > al> abbreviate addresses that the longest string > al> of zeros is shortend, not the first run of > al> zeros. > (snip) > al> Diff is attached in text format. > > I think the code is correct and reasonable for commit. Ok, I will do some final checks and commit shortly. Thank you, -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Patch for ip6_sprintf(), please review
Hello, The following patch seems appropriate to apply to fix the kernel ip6_sprintf() function. What it is doing is ensuring that when we abbreviate addresses that the longest string of zeros is shortend, not the first run of zeros. Our internal commit log is: problem: Unification of IPv6 address representation fix: recommended format of text representing an IPv6 address is summarized as follows. 1. omit leading zeros 2. "::" used to their maximum extent whenever possible 3. "::" used where shortens address the most 4. "::" used in the former part in case of a tie breaker 5. do not shorten one 16 bit 0 field 6. use lower case Present code in ip6_sprintf() is following rules 1,2,5,6. Adding fix for following other rules also.For following rules 3 and 4, finding out the index where to replace zero's with '::' and using that index. References: http://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-04.html Diff is attached in text format. -- - Alfred Perlstein .- AMA, VMOA #5191, 03 vmax, 92 gs500, 85 ch250, 07 zx10 .- FreeBSD committer Index: in6.c === --- in6.c (revision 207329) +++ in6.c (working copy) @@ -61,7 +61,7 @@ */ #include -__FBSDID("$FreeBSD$"); +__FBSDID("$FreeBSD: head/sys/netinet6/in6.c 207268 2010-04-27 09:47:14Z kib $"); #include "opt_compat.h" #include "opt_inet.h" @@ -1898,7 +1898,7 @@ char * ip6_sprintf(char *ip6buf, const struct in6_addr *addr) { - int i; + int i, cnt = 0, maxcnt = 0, idx = 0, index = 0; char *cp; const u_int16_t *a = (const u_int16_t *)addr; const u_int8_t *d; @@ -1907,6 +1907,23 @@ cp = ip6buf; for (i = 0; i < 8; i++) { + if (*(a + i) == 0) { + cnt++; + if (cnt == 1) +idx = i; + } + else if (maxcnt < cnt) { + maxcnt = cnt; + index = idx; + cnt = 0; + } + } + if (maxcnt < cnt) { + maxcnt = cnt; + index = idx; + } + + for (i = 0; i < 8; i++) { if (dcolon == 1) { if (*a == 0) { if (i == 7) @@ -1917,7 +1934,7 @@ dcolon = 2; } if (*a == 0) { - if (dcolon == 0 && *(a + 1) == 0) { + if (dcolon == 0 && *(a + 1) == 0 && i == index) { if (i == 0) *cp++ = ':'; *cp++ = ':'; ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Can't start mysql in jail
* Miroslav Lachman <000.f...@quip.cz> [090525 10:27] wrote: > Sam Wun wrote: > >Hi, > > > >This seems a common question, but it is a bit different. > >Production OS: FreeBSD 6.2 > >Source OS: FreeBSD 7.2 > > > >I created a jailed mysql 5.1 in my source OS FreeBSD 7.2, and then tar > > As you can see, there is different libc.so version, different threading > library, etc. > > So you can't run MySQL daemon build on different major version OS. You should be able to provided that you install the compat libraries. You may also need to use the libmap.conf facility to fixup threading library to point to libthr but I am unsure. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ipv6 bugfix, need review.
* Doug Barton [081223 11:46] wrote: > On Mon, 22 Dec 2008, Alfred Perlstein wrote: > > >Hey guys, we found a bug at Juniper and it resolves an issue > >for us. I've been asked to forward this to FreeBSD, I honestly > >am not that clear on the issue so I'm hoping someone can step > >up to review this. > > > >Synopsis is: > > > > The traffic class byte is set to 0x in the header of some > > BGP packets sent between interfaces that have IPv6 addresses, > > instead of the correct setting 0xc0 (INTERNETCONTROL). > > > >Fix is small and attached. One thing I am wondering, do we > >need to check "if (inp)" ? I don't think so. > > How about adding an assert to the patch to prove this theory? :) > > I'll test it on my home box (which has IPv6) as soon as I'm done with the > stuff I'm working on atm. > > > hth, > > Doug Thanks Doug, will do. Please let me know results. do you know how to test if this is actually being excersized? I guess you could add a sysctl that gets incremented each time this codepath is hit to test? -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
ipv6 bugfix, need review.
Hey guys, we found a bug at Juniper and it resolves an issue for us. I've been asked to forward this to FreeBSD, I honestly am not that clear on the issue so I'm hoping someone can step up to review this. Synopsis is: The traffic class byte is set to 0x in the header of some BGP packets sent between interfaces that have IPv6 addresses, instead of the correct setting 0xc0 (INTERNETCONTROL). Fix is small and attached. One thing I am wondering, do we need to check "if (inp)" ? I don't think so. Index: bsd/sys/netinet/tcp_syncache.c === RCS file: /cvs/junos-2008/bsd/sys/netinet/tcp_syncache.c,v retrieving revision 1.24 diff -p -u -r1.24 tcp_syncache.c --- bsd/sys/netinet/tcp_syncache.c 29 Jul 2008 17:07:43 - 1.24 +++ bsd/sys/netinet/tcp_syncache.c 16 Dec 2008 19:23:31 - @@ -1271,6 +1271,7 @@ syncache_respond(sc, m) struct inpcb *inp; #ifdef INET6 struct ip6_hdr *ip6 = NULL; + int inp_tclass; #endif struct rt_nexthop *minmtu_nh; struct route_table *rtb = NULL; @@ -1387,6 +1388,12 @@ syncache_respond(sc, m) /* ip6_hlim is set after checksum */ ip6->ip6_flow &= ~IPV6_FLOWLABEL_MASK; ip6->ip6_flow |= sc->sc_flowlabel; + /* Set the TC for IPv6 just like TOS for IPv4 */ + ip6->ip6_flow &= ~IPV6_CLASS_MASK; + if (inp) { + inp_tclass = IPV6_GET_CLASS(inp->in6p_flowinfo); + ip6->ip6_flow |= IPV6_SET_CLASS(inp_tclass); + } th = (struct tcphdr *)(ip6 + 1); } else -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: ACE and FreeBSD
* Randall Stewart [081222 03:48] wrote: > Hi all: > > I am trying to get the latest ACE/TAO toolkit compiling with Head... > (the > port is marked broken in 7).. > > In the process of fixing things I found something I am not sure how > to approach.. for now I have just ifdef'd it out but maybe someone > can point me to the right method... > > They are using a ioctl -- SIOCGIFDATA -- to get access to the interface > packet counts and such. Now near as I can tell we don't have that > SIO. A google of someone a few years ago where the question was > asked turned up a, we don't need that instead we should have > access to this information via the sysctl. > > So my immediate thought, hey netstat does this.. and it probably uses > the sysctl... so I go and look at the code.. and tada.. it does a > kread() to get the actual if_data yuck. > > So, is there a sysctl that gets access to this information? I have > poked around in a sysctl -a -N and don't see anything that looks > promising.. > > Pointers to the right approach would be appreciated.. I am not sure > what the monitor stuff is used for.. but I would like to get this > toolkit fully functional if possible :-) You could expand SIOCGIFDATA, but you'd need to make a compat SIOCGIFODATA (OLD DATA) ioctl. Or you could export it maybe through the dev sysctl tree. I like the former. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: working directory within kernel code
* Ferner Cilloniz [081216 12:33] wrote: > I am trying to determine the current working directory when a system > call is issued. im interested in determining this from a kernel module. > > however, because system calls are only given a thread* and a void*, > which gets casted, is there any way i find out the cwd? thread should point to proc which should have a "current dir" vnode in it, or a pointer to a struct that has it... keep poking around. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Timers in drivers vs userland
Have you tried using rtprio? You'll have to be really careful though so as not to jam up the system using it. -Alfred * Len Gross <[EMAIL PROTECTED]> [081018 17:28] wrote: > Slight correction; I should have said more accurate usleep, not "timer." > > -- Len > > On Sat, Oct 18, 2008 at 3:12 PM, Len Gross <[EMAIL PROTECTED]> wrote: > > If I place a timer directly in a driver (like Ethernet) will it be > > subject to less jitter and more consistency than if it were in > > Userland? > > > > I know FreeBSD is not "real time," but I need to be able to run a > > polling algorithm with about 1 ms accuracy. > > > > Thanks in advance. > > > > (Please tell me if there is a better list for this question.) > > > > -- Len > > > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "[EMAIL PROTECTED]" -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Closing connection from an accept_filter(9)
* David DeSimone <[EMAIL PROTECTED]> [081018 02:25] wrote: > Eugene M. Kim <[EMAIL PROTECTED]> wrote: > > > > Is it possible to close a connection from an accept filter, for > > example, in order to prevent an incoming connection with a malformed > > request body from ever reaching the userland? > > How would you propose to find out what is in the request body without > first accepting the connection? By writing a custom accept filter! :) -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Closing connection from an accept_filter(9)
* Eugene M. Kim <[EMAIL PROTECTED]> [081017 17:58] wrote: > Hello, > > Is it possible to close a connection from an accept filter, for example, > in order to prevent an incoming connection with a malformed request body > from ever reaching the userland? Probably, look at what happens inside of syncache or syncookies to sockets that are on accept queue but not yet "accepted". -Alfred > > Cheers, > Eugene > > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "[EMAIL PROTECTED]" -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Question regarding NFS
* Adam Stylinski <[EMAIL PROTECTED]> [080918 17:15] wrote: > Hello, > I am running an IPCop firewall for my entire network. I have a > wireless network device on the blue subnet which must access a freebsd NFS > server. In order to do this, I need to open a DMZ pinhole on a few select > ports. It's my understanding that NFS chooses random ports and I was > wondering if there was a way I could fix this. There is a good reason that > the subnet for the wireless is separate from the wired and I'd rather not > configure this thing over a VPN. The client connecting to the NFS server is > a voyage computer (pretty much a small debian). Also, if at all possible, > I'd like to keep performance reasonably high when large volumes of clients > are connecting to the NFS server, I'm not sure if binding to one port may or > may not make this impossible. I apologize for my stupidity and lack of > understanding when it comes to NFS. Any help would be gladly appreciated, > guys. _usually_ NFS uses port 2049 on the server side. I think the client may bind to a random low port, this would be annoying to change, but could be done with a kernel hack relatively easily. Look at the code in src/sys/nfsclient/nfs_socket.c, there's some code that that deals with binding sockets that you can play with. -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: too many open file descriptors messages since bind 9.4.2-P1 (port dns94)
FWIW, the userland scan of the files is not nearly as bad as what happens in the kernel when hundreds or thousands of objects are accessed that blow out the cache, oh and the locking that occurs as well. * Peter Jeremy <[EMAIL PROTECTED]> [080715 16:43] wrote: > On 2008-Jul-15 16:09:17 -0700, Bakul Shah <[EMAIL PROTECTED]> wrote: > >IIRC, when poll() returns n, you only look at the first n > >values in the pollfd array so it is a win when you expect a > >very small number of fds to be ready. In the select case you > >have to test the bit array until you see the last ready fd. > > No. Both poll(2) and select(2) return the number of FDs ready for > I/O. You need to scan the pollfd or fd_set array until you find that > many FDs ready. > > poll(2) is a win if you only need to test a small number of FDs > compared to the number of FDs that the process has open. In the case > of bind, you have a large number of FDs to test, of which you are > only expecting a very small number to be ready - if you don't > treat fd_set as opaque, select(2) allows you to quickly skip large > (roughly wordsize) chunks of un-interesting FDs. > > Note that, based on sys_generic.c in 7.x and -CURRENT, poll(2) is > limited to checking FD_SETSIZE descriptors, whilst select(2) has > no upper limit. > > -- > Peter Jeremy > Please excuse any delays as the result of my ISP's inability to implement > an MTA that is either RFC2821-compliant or matches their claimed behaviour. -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: FreeBSD network stack Vs others
* ithilgore -- <[EMAIL PROTECTED]> [080204 06:59] wrote: > I 'd like to learn what are the basic differences ( pros and cons ) between > the > FreeBSD network stack and the other OSs' ( especially linux ) > > I know that linux has had everything rewritten from scratch as far as the > implementation of tcp-ip and the sockets are concerned and would like to > know if this has made it actually more robust or state-of-the-art than > FreeBSD's or the opposite. > > Some actual technical details and references would be appreciated. Linux's stack hasn't been rewritten from the BSD one, it was written from scratch. Linux's tcp/ip stack has been rewritten many times over the years with the promise of large performance gains. The fact of the matter is that the performance on the "bleeding edge" of both systems, FreeBSD and Linux, is about the same. >From a BSD proponent's perspective, I would take the pragmatic viewpoint that everytime Linux reinvents its stack to get performance or some other feature FreeBSD isn't far behind with a relatively minor change to its stack to accomplish the same feat. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
* David G Lawrence <[EMAIL PROTECTED]> [071221 23:31] wrote: > > > > Can you use a placeholder vnode as a place to restart the scan? > > > > you might have to mark it special so that other threads/things > > > > (getnewvnode()?) don't molest it, but it can provide for a convenient > > > > restart point. > > > > > >That was one of the solutions that I considered and rejected since it > > > would significantly increase the overhead of the loop. > > >The solution provided by Kostik Belousov that uses uio_yield looks like > > > a find solution. I intend to try it out on some servers RSN. > > > > Out of curiosity's sake, why would it make the loop slower? one > > would only add the placeholder when yielding, not for every iteration. > >Actually, I misread your suggestion and was thinking marker flag, > rather than placeholder vnode. Sorry about that. The current code > actually already uses a marker vnode. It is hidden and obfuscated in > the MNT_VNODE_FOREACH macro, further hidden in the __mnt_vnode_first/next > functions, so it should be safe from vnode reclaimation/free problems. That level of obscuring is a bit worrysome. Yes, I did mean placeholder vnode. Even so, is it of utility or not? Or is it already being used and I'm missing something and should just "utsl" at this point? -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
* David G Lawrence <[EMAIL PROTECTED]> [071221 15:42] wrote: > > >Unfortunately, the version of the patch that I sent out isn't going to > > > help your problem. It needs to yield at the top of the loop, but vp isn't > > > necessarily valid after the wakeup from the msleep. That's a problem that > > > I'm having trouble figuring out a solution to - the solutions that come > > > to mind will all significantly increase the overhead of the loop. > > > > I apologize for not reading the code as I am swamped, but a technique > > that Matt Dillon used for bufs might work here. > > > > Can you use a placeholder vnode as a place to restart the scan? > > you might have to mark it special so that other threads/things > > (getnewvnode()?) don't molest it, but it can provide for a convenient > > restart point. > >That was one of the solutions that I considered and rejected since it > would significantly increase the overhead of the loop. >The solution provided by Kostik Belousov that uses uio_yield looks like > a find solution. I intend to try it out on some servers RSN. Out of curiosity's sake, why would it make the loop slower? one would only add the placeholder when yielding, not for every iteration. -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
* David G Lawrence <[EMAIL PROTECTED]> [071219 09:12] wrote: > > >Try it with "find / -type f >/dev/null" to duplicate the problem > > >almost > > >instantly. > > > > I was able to verify last night that (cd /; tar -cpf -) > all.tar would > > trigger the problem. I'm working getting a test running with > > David's ffs_sync() workaround now, adding a few counters there should > > get this narrowed down a little more. > >Unfortunately, the version of the patch that I sent out isn't going to > help your problem. It needs to yield at the top of the loop, but vp isn't > necessarily valid after the wakeup from the msleep. That's a problem that > I'm having trouble figuring out a solution to - the solutions that come > to mind will all significantly increase the overhead of the loop. >As a very inadequate work-around, you might consider lowering > kern.maxvnodes to something like 2 - that might be low enough to > not trigger the problem, but also be high enough to not significantly > affect system I/O performance. I apologize for not reading the code as I am swamped, but a technique that Matt Dillon used for bufs might work here. Can you use a placeholder vnode as a place to restart the scan? you might have to mark it special so that other threads/things (getnewvnode()?) don't molest it, but it can provide for a convenient restart point. -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: bikeshed for all!
* Julian Elischer <[EMAIL PROTECTED]> [071212 15:13] wrote: > Alfred Perlstein wrote: > >try using "instance". > > > >"Oh I'm going to use the FOO routing instance." > > what do Juniper call it? "Instance" and "vrf". -Alfred > > > > >Works nicely. > > > >* Julian Elischer <[EMAIL PROTECTED]> [071212 14:34] wrote: > >>So, I'm playing with some multiple routing table support.. > >>the first version is a minimal impact version with very limited > >>functionality. > >>It's done that way so I can put it in RELENG_6/7 without breaking ABIs (I > >>hope). > >>Later there will be a more flexible version for-current. > >> > >>Here's the question.. > >> > >>I need a word to use to describe the network view one is currently on.. > >>e.g. if you are usinghe second routing table, you could say I've set xxx > >>to 1 > >>(0 based).. > >> > >> > >>current;y in my code I'm using 'universe' but I don't like that.. > >> > >>one could think of it as a routing plane.. > >>each routing plane has he same interfaces on it but they are logically > >>treated differently becasue each plane has a different routing table. > >> > >> > >>so here's an axample of it in use now... > >>the names should change... > >> > >>setuniverse 1 netstat -rn > >>[shows table 1] > >>setuniverse 2 route add 10.0.0.0/24 192.168.2.1 > >>setuinverse 1 route add 10.0.0.0/24 192.168.3.1 > >>setuniverse 2 route -n get 10.0.0.3 > >>[shows 192.168.2.1] > >>setuniverse 1 route -n get 10.0.0.3 > >>[shows 192.168.3.1] > >>setuniverse 2 start_apache > >>[appache starts, always using 192.168.2.1 to reach the 10.0.0 net. > >> > >> > >>also the syscall is setuniverse() > >> > >>so, you see I really need a better name > >>setrtab? > >> > >>rtab? rtbl? > >> > >>and the command should be called "" > >> > >> > >>___ > >>freebsd-net@freebsd.org mailing list > >>http://lists.freebsd.org/mailman/listinfo/freebsd-net > >>To unsubscribe, send any mail to "[EMAIL PROTECTED]" > > > > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "[EMAIL PROTECTED]" -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: bikeshed for all!
* Mike Silbersack <[EMAIL PROTECTED]> [071212 15:09] wrote: > > On Wed, 12 Dec 2007, Julian Elischer wrote: > > >So, I'm playing with some multiple routing table support.. > >the first version is a minimal impact version with very limited > >functionality. > >It's done that way so I can put it in RELENG_6/7 without breaking ABIs (I > >hope). > >Later there will be a more flexible version for-current. > > > >Here's the question.. > > > >I need a word to use to describe the network view one is currently on.. > >e.g. if you are usinghe second routing table, you could say I've set xxx > >to 1 > >(0 based).. > > In the spirit of your subject, why not call them 'sheds'? Because it's horrible. :) -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: bikeshed for all!
* Peter Wood <[EMAIL PROTECTED]> [071212 14:53] wrote: > > so, you see I really need a better name > > setrtab? > > > > rtab? rtbl? > > > > and the command should be called "" > > Would "vrf" (Virtual Routing and Forwarding) be to technical? From > experience Cisco's call it vrf, Junipers use routing-instance IIRC. Yes, Juniper calls it "instance", although, I'm quite sure I've heard "vrf" said over the cubes here. -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: bikeshed for all!
try using "instance". "Oh I'm going to use the FOO routing instance." Works nicely. * Julian Elischer <[EMAIL PROTECTED]> [071212 14:34] wrote: > So, I'm playing with some multiple routing table support.. > the first version is a minimal impact version with very limited > functionality. > It's done that way so I can put it in RELENG_6/7 without breaking ABIs (I > hope). > Later there will be a more flexible version for-current. > > Here's the question.. > > I need a word to use to describe the network view one is currently on.. > e.g. if you are usinghe second routing table, you could say I've set xxx to > 1 > (0 based).. > > > current;y in my code I'm using 'universe' but I don't like that.. > > one could think of it as a routing plane.. > each routing plane has he same interfaces on it but they are logically > treated differently becasue each plane has a different routing table. > > > so here's an axample of it in use now... > the names should change... > > setuniverse 1 netstat -rn > [shows table 1] > setuniverse 2 route add 10.0.0.0/24 192.168.2.1 > setuinverse 1 route add 10.0.0.0/24 192.168.3.1 > setuniverse 2 route -n get 10.0.0.3 > [shows 192.168.2.1] > setuniverse 1 route -n get 10.0.0.3 > [shows 192.168.3.1] > setuniverse 2 start_apache > [appache starts, always using 192.168.2.1 to reach the 10.0.0 net. > > > also the syscall is setuniverse() > > so, you see I really need a better name > setrtab? > > rtab? rtbl? > > and the command should be called "????" > > > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "[EMAIL PROTECTED]" -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Switch pfil(9) to rmlocks
* Robert Watson <[EMAIL PROTECTED]> [071126 12:37] wrote: > > On Fri, 23 Nov 2007, Max Laier wrote: > > >attached is a diff to switch the pfil(9) subsystem to rmlocks, which are > >more suited for the task. I'd like some exposure before doing the switch, > >but I don't expect any fallout. This email is going through the patched > >pfil already - twice. > > FYI, since people are experimenting with rmlocks as a substitute for > rwlocks, I played with moving the global rwlock used to protect the name > space and linkage of UNIX domain sockets to be an rmlock. Kris didn't see > any measurable change in performance for his MySQL benchmarks, but I > figured I'd post the patches as they give a sense of what change impact > things like reader state management have on code. Attached below. I have > no current plans to commit these changes as they appear not to offer > benefit (either because the rwlock overhead was negigible compared to other > costs in the benchmark, or because the read/write blend was too scewed > towards writes -- I think probably the former rather than the latter). I would track the read/write lock mix to get an idea of what the ratio is. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: accept filters and zero copy sockets
* Jonathan Noack <[EMAIL PROTECTED]> [071018 20:59] wrote: > I'm in the process of upgrading my web/database/nfs/jack-of-all-trades box > from 6.2 to RELENG_7. I figured now would be a good time to clean up my > kernel config files. I have the following in my old kernel config: > > # Statically Link in accept filters > options ACCEPT_FILTER_DATA > options ACCEPT_FILTER_HTTP > > # Zero copy sockets support. This enables "zero copy" for sending and > # receiving data via a socket. The send side works for any type of NIC, > # the receive side only works for NICs that support MTUs greater than the > # page size of your architecture and that support header splitting. See > # zero_copy(9) for more details. > options ZERO_COPY_SOCKETS > > Are these options still working/recommended? With all the changes to > networking over the years (this box was originally set up during the 4.x > days and has been upgraded many times) I have no idea if these are still > good things to have. Accept filters should certainly work, otherwise someone will get some noogies... -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Too many TIME_WAIT connections
* Jamie Ostrowski <[EMAIL PROTECTED]> [071001 16:02] wrote: >Hello - > >I've got a mailserver running FreeBSD 4.11 and Sendmail 8.13 that has > been running as a mailserver for a couple of years without any > load/connection problems. Here are my memory stats: > Mem: 71M Active, 265M Inact, 96M Wired, 24M Cache, 60M Buf, 36M Free > Swap: 2048M Total, 760K Used, 2047M Free > > Then all of a sudden we started experiencing dropped connections even though > the load average is generally around 2.0 or less. > > I found the problem today: there are currently 1300 socket connections > suspended at status TIME_WAIT on the incoming smtp port. > > I checked some of my kernel settings: > > kern.ipc.somaxconn = 128 > net.inet.tcp.msl: 3 > > I suspect this is a dos attack: they're just opening these connections, > and then let them hang there and they don't close them, so they just build > up and the machine rejects new connections. > > Based on my configuration, does anyone have some suggestions on how I > might tweak the system to overcome this (apparent?) DOS attack? You can tweak msl, but it probably makes more sense to use some form of firewall, ipfw, ipfilter, pf, etc on the box. you can use netstat to see the remote addresses, just block them. -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Quagga as border router
* Yuri Lukin <[EMAIL PROTECTED]> [070920 16:49] wrote: > On Thu, 20 Sep 2007 00:24:09 -0700, Alfred Perlstein wrote > > > > Juniper is based on FreeBSD. ;-) > > > > On old code from the 4.x days I think, right? In the current release, yes. Would you like a router based on 5.x? :) -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Quagga as border router
* Steve Bertrand <[EMAIL PROTECTED]> [070919 21:14] wrote: > >>> Essentially, I'd like a board with at *least* 6 PCI-X slots, and perhaps > >>> 8 RAM slots (if I can find justification that my router will work better > >>> with up to 16GB of memory). > > > > Why would you go with PCI-X? it's slow and getting end-of life.. > > > > go for PCI-Express. > > there are quad PCI-E gigabit cards available. > > Much lower packet latency. > > As per my last email to Sten and the list... > > I'm not a hardware person. PCI-E, PCI-X, I don't know the difference. > > It was assumed that others would understand what I wanted and be able to > make recommendations to me, and correct me on my terminology. > > All I do know is that there is something more than ISA slots, and 386's > now ;) > > My request wasn't for clarification on motherboard technicalities, it > was essentially a request on a recommendation for a hardware/software > platform based on FreeBSD, that could possibly replace a Cisco 7206-VXR > based on the NPE-G2 processing engine (or equivalent). Juniper is based on FreeBSD. ;-) -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: FreeBSD discarding received packets > MTU
* David Christensen <[EMAIL PROTECTED]> [070907 13:41] wrote: > > > I'm not completely opposed to making such a change, but I don't want > > > to make a default change in the driver's behavior that other people > > > may be depending upon (whether they are aware of it or not). A > > > tunable driver value could be the answer but I'm not entirely sure > > > how it would fare in the hardware at the high end of MTU > > values such > > > as 9000. > > > > Dave: > > > > Internet ettiquette demands being gracious in what you accept. > > The default policy of FreeBSD is to accept such packets. > > This is a really weird bug to track down. > > Other drivers support it. > > > > This isn't worth making a stand over, unless you're trying > > to hold users of YOUR driver hostage. > > > > I'm just being cautious about making changes before I understand > all of the implications. The driver's current behavior is > supported by IEEE 802.3 specification (802.3-2005, 4.2.4.2.1) > and is implemented in the same way for other operating systems > that are very widely deployed (including Windows and Linux) > without any reported problems. The existing bge driver which > was developed for FreeBSD 10 years ago also operates this way, > so all of my references for porting this driver happen to agree > on the same implementation. Which is all well and good, but the age of a bug does not a feature make. Please think of the four points I raised. I think it makes sense to possibly add a "enforce rx mtu" knob somewhere, but it should likely be turned off. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: FreeBSD discarding received packets > MTU
* David Christensen <[EMAIL PROTECTED]> [070907 10:48] wrote: > > > It could certainly be argued by some that Cisco is not standards > > > compliant in this case for sending an oversized Ethernet frame > > > and expecting everyone to accept it. Hardware has limitations > > > and assuming that all Ethernet controllers can support frames > > > greater than 1522 bytes is not reasonable. Fortunately there is > > > a suitable workaround which is setting a larger MTU for the > > > interface. What size do you use? How did you arrive at that > > > value? > > > > I use 1550 to make it work in the test harness. > > > > The trouble is that if I set the mtu to 1550, and the machine > > talks to another > > such machine with it's mtu also set to 1550 then they > > negotiate a maximum sized > > packet based on 1550, and the problem hits me again. This is > > a web proxy > > and that problem occurs when there are two layers of proxy > > and one proxy talks to > > another. I really just need it to to silently accept a packet some > > 32 bytes or so larger than the stated MTU. > > > > I see no reason for the driver to not do what the em driver > > does and allow > > itself to receive any packet up to the MCLBYTES size. > > > > We only hit this problem recently because the data interfaces on our > > devices are usually em NICs and we only just recently started > > allowing the > > users to use the built in (on DELL 2950) bce interfaces for > > this purpose. > > > > I'm not completely opposed to making such a change, but I don't want > to make a default change in the driver's behavior that other people > may be depending upon (whether they are aware of it or not). A > tunable driver value could be the answer but I'm not entirely sure > how it would fare in the hardware at the high end of MTU values such > as 9000. Dave: Internet ettiquette demands being gracious in what you accept. The default policy of FreeBSD is to accept such packets. This is a really weird bug to track down. Other drivers support it. This isn't worth making a stand over, unless you're trying to hold users of YOUR driver hostage. -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
take II: Allocating AF constants for vendors.
* Alfred Perlstein <[EMAIL PROTECTED]> [070821 14:13] wrote: > Hello all, > > I would like to reserve about 64 entries for VENDOR specific address > families in sys/socket.h. > > I think this will allow vendors to comfortably use the array of > address families without worrying about overlap with FreeBSD > protocols. > > If no one objects I plan to commit this in the next few days. > > The format will be along the lines of: > > AF_VENDOR0 -> AF_VENDOR63 > > Suggestions? Sam asked that I provide some numbers for this proposal, I have them, however in the meanwhile another proposal I've floated was implementing a reservation system where FreeBSD would allocate every even number in the AF_ set of constants and leave the odd numbers for vendors. Q: "What if a vendor wants to then contribute code to FreeBSD?" A: They should have asked FreeBSD to reserve a number, now they can allocate a FreeBSD one. The numbers are specifically meant for internal address families. Here's the numbers for simply bumping AF_MAX: Here's what I have for sizing it up 59 entries. === GDB commands to get sizes of structures related to AF_MAX: printf "AF_MAX: %d\n", sizeof(((struct ifnet *)0)->if_afdata) / sizeof(void*) printf "struct netexport: %d\n", sizeof(struct netexport) printf "struct ifnet: %d\n", sizeof(struct ifnet) printf "route.c:rt_tables: %d\n", sizeof(rt_tables) === Data from AF_MAX = 37 (FreeBSD-stable) AF_MAX: 37 Kernel size: /usr/src/sys/i386/compile/JUNIPER_6_2_SMP % size kernel.debugSMALLMAX textdata bss dec hex filename 5964450 791752 367916 7124118 6cb496 kernel.debugSMALLMAX /usr/src/sys/i386/compile/JUNIPER_6_2_SMP % gdb kernel.debugSMALLMAX GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-marcel-freebsd"... (gdb) printf "AF_MAX: %d\n", sizeof(((struct ifnet *)0)->if_afdata) / sizeof(void*) AF_MAX: 37 (gdb) printf "struct netexport: %d\n", sizeof(struct netexport) struct netexport: 316 (gdb) printf "struct ifnet: %d\n", sizeof(struct ifnet) struct ifnet: 644 (gdb) printf "route.c:rt_tables: %d\n", sizeof(rt_tables) route.c:rt_tables: 152 (gdb) === Data from AF_MAX = 96 (FreeBSD-stable + 59 entries) AF_MAX: 96 /usr/src/sys/i386/compile/JUNIPER_6_2_SMP % size kernel.debug textdata bss dec hex filename 5964450 791752 368140 7124342 6cb576 kernel.debug .(14:22:56)([EMAIL PROTECTED]) !!! SANDBOX UNSET!!! /usr/src/sys/i386/compile/JUNIPER_6_2_SMP % gdb kernel.debug GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-marcel-freebsd"... (gdb) printf "AF_MAX: %d\n", sizeof(((struct ifnet *)0)->if_afdata) / sizeof(void*) AF_MAX: 96 (gdb) printf "struct netexport: %d\n", sizeof(struct netexport) struct netexport: 552 (gdb) printf "struct ifnet: %d\n", sizeof(struct ifnet) struct ifnet: 880 (gdb) printf "route.c:rt_tables: %d\n", sizeof(rt_tables) route.c:rt_tables: 388 (gdb) % === Summary of differences: size: textdata bss dec hex filename 5964450 791752 367916 7124118 6cb496 kernel.debugSMALLMAX 5964450 791752 368140 7124342 6cb576 kernel.debug AF_MAX: 37 struct netexport: 316 struct ifnet: 644 route.c:rt_tables: 152 AF_MAX: 96 struct netexport: 552 struct ifnet: 880 route.c:rt_tables: 388 bss diff: bytes: 224 percent: 1% dec diff: bytes: 224 percent: 1% AF_MAX: difference: 59 percent: 62% struct netexport: bytes: 236 percent: 43% struct ifnet: bytes: 236 percent: 27% route.c:rt_tables: bytes: 236 percent: 61% === Unknown: (I don't know how to get a static variable from gdb) unknown: netatm/atm_if.c: -> atm_ifouttbl ===
Re: OS choice for an edge router
* Kirc Gover <[EMAIL PROTECTED]> [070906 11:10] wrote: > We are in the stage of planning and research for a commercial development of > an edge router that will be based mostly on OpenSource software. I would like > to solicit for information and recommendation if FreeBSD is a suitable OS. > The router is expected to withstand forwarding of sustained traffic from > 10Mbps to 1Gbps and maybe more than that. Are there any known limitations of > FreeBSD in terms of architecture and performance? Can I just take out a > FreeBSD as is and put it with the hardware without any specific or major > refinements in its code? I'm very much concerned with its capability in > forwarding heavy sustained traffic. Packet loss should be at minimum and > critical userland processes should working normally even under heavy load. > Are there any known specific limitations of FreeBSD? I have browsed through > the archives and found a lot of hangups, deadlocks and freeze issues. What is > the usual or minimum hardware requirement? Is soekris box enough, or dual > core or ASIC > based platforms? I'm aware that there are so many FreeBSD based routers and > network based devices in the market. Is this a way to go over realtime and > embedded OS such as VxWorks and others (mostly commercial) without putting > the licensing cost in picture? I really appreciate any help, suggestions and > recommendations. More power to FreeBSD! > > Thanks > Kirc Kirc, do some research into Juniper routers. :) 1gps shouldn't be a problem for FreeBSD, however you may have to do some custom tweaks that I can't get into for obvious reasons. I don't think a soekris would be sufficient for 1Gbps, however a mid-range to high-end PC with good NICS and smart software should suffice. I think going with FreeBSD would be a great choice. -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
(forw) Re: Allocating AF constants for vendors.
Bruce, I haven't heard back from you on this. can you please comment? I'd like to add the policy to the header. - Forwarded message from Alfred Perlstein <[EMAIL PROTECTED]> ----- From: Alfred Perlstein <[EMAIL PROTECTED]> To: "Bruce M. Simpson" <[EMAIL PROTECTED]> Cc: Max Laier <[EMAIL PROTECTED]>, [EMAIL PROTECTED] Subject: Re: Allocating AF constants for vendors. Date: Tue, 4 Sep 2007 05:42:24 -0700 Message-ID: <[EMAIL PROTECTED]> User-Agent: Mutt/1.4.2.3i Sender: [EMAIL PROTECTED] * Bruce M. Simpson <[EMAIL PROTECTED]> [070904 03:08] wrote: > >As you can see we are defering the "bloat". > >Does that make sense? > > > > I follow but it still doesn't really make sense. > > Granted, you are deferring the growth of arrays sized off AF_MAX but > only ever by 1 slot. > What if Vendor Z wants to add 25 entries at once? Then as long as they allocate odd numbered entries they should be fine. FreeBSD's AF_MAX does not need to change to accomidate a vendor, it only has to restrict itself to even numbered slots. > We would also be tying ourselves down to the notion of a vendor in any > AF_ allocation. Is this an avenue that people are happy to pursue? Yes, until the "horrific" problem of the statically sized arrays is "fixed". Then the allocation policy can change. -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]" - End forwarded message - -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Allocating AF constants for vendors.
* Randall Stewart <[EMAIL PROTECTED]> [070904 13:22] wrote: > Alfred Perlstein wrote: > >* Bruce M. Simpson <[EMAIL PROTECTED]> [070904 03:08] wrote: > > > >>>As you can see we are defering the "bloat". > >>>Does that make sense? > >>> > >> > >>I follow but it still doesn't really make sense. > >> > >>Granted, you are deferring the growth of arrays sized off AF_MAX but > >>only ever by 1 slot. > >>What if Vendor Z wants to add 25 entries at once? > > > > > >Then as long as they allocate odd numbered entries they should > >be fine. FreeBSD's AF_MAX does not need to change to accomidate > >a vendor, it only has to restrict itself to even numbered slots. > > > > > >>We would also be tying ourselves down to the notion of a vendor in any > >>AF_ allocation. Is this an avenue that people are happy to pursue? > > > > > >Yes, until the "horrific" problem of the statically sized arrays > >is "fixed". Then the allocation policy can change. > > > > > So basically in this scheme we only have to "stumble" across an > additional slot when we add a new one to FreeBSD.. i.e. some > random vendor may assign 50 slots (in odd numbers) but FreeBSD > would not see the growth until really 2 new AF_XXX's are added. > Then you would have to bump it from by 3, to cover the two > new ones (reserving the vendor specific slots and thus causing > allocations of unused things). YES! Exactly. > > This seems like a reasonable compromise to me... I can't imagine > where we would need to add a lont of new AF_XXX's.. of course > maybe I just lack imagination :-D Well, Freebsd or 5 added bluetooth, and freebsd 7 has some IEEE thing added... sooo... the array is growing, but slowly. -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Allocating AF constants for vendors.
* Bruce M. Simpson <[EMAIL PROTECTED]> [070904 03:08] wrote: > >As you can see we are defering the "bloat". > >Does that make sense? > > > > I follow but it still doesn't really make sense. > > Granted, you are deferring the growth of arrays sized off AF_MAX but > only ever by 1 slot. > What if Vendor Z wants to add 25 entries at once? Then as long as they allocate odd numbered entries they should be fine. FreeBSD's AF_MAX does not need to change to accomidate a vendor, it only has to restrict itself to even numbered slots. > We would also be tying ourselves down to the notion of a vendor in any > AF_ allocation. Is this an avenue that people are happy to pursue? Yes, until the "horrific" problem of the statically sized arrays is "fixed". Then the allocation policy can change. -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Allocating AF constants for vendors.
* Bruce M. Simpson <[EMAIL PROTECTED]> [070903 07:44] wrote: > Alfred Perlstein wrote: > >Ok, I'm not really sure what to do here. At Juniper we have approx > >20 additional entries for AF_ constants. We also have theoretical > >but not practical "problems" with spareness and utility of this > >list, meaning we have plenty of arrays in our version of ifnets and > >route entries that are also "bloated" as well. > > > > Can you merge them into the list in such a way that AF_MAX does not need > to slide forward? > Or do they need to be referenced from within the kernel tree itself? They are refenced inside the kernel. > Prevention of code bloat is better than the cure. Not having the code > in front of me I couldn't say for sure if we're talking about a dozen > bytes or several pages potentially being wasted, so it is impossible to > judge. Well, for the most part it's going to be something like 32*sizeof(void*) so 128 or 256 bytes depending on arch. > One of my concerns is that we have ifnet.if_afdata, we're not really > using it, it makes sense to use it for some things. I'll have ot look into this. > Help from big companies as well as little folks is always appreciated, > providing we can reach consensus. YES! :) > >Otherwise one other policy would be to specify an allocation > >policy such that new AF_ constants are allocated only for even > >numbers where odd numbers are left to vendors. > > > >This would slow the "bloat" and still provide vendors with something > >useful. > > > >How does that sound? > > > > EPARSE? I don't follow this at all. Ok, let's say we garantee that going forward, all odd AF_ constants are verdor reserved So whenever FreeBSD allocates an AF constant, it should be even, vendors can use odd. That means that, from socket.h: #define AF_ARP 35 #define AF_BLUETOOTH36 /* Bluetooth sockets */ #define AF_IEEE8021137 /* IEEE 802.11 protocol */ #define AF_MAX 38 Now let's say FreeBSD wants to add a AF constant, the next one to allocate would be 38, so we have: #define AF_ARP 35 #define AF_BLUETOOTH36 /* Bluetooth sockets */ #define AF_IEEE8021137 /* IEEE 802.11 protocol */ #define AF_NEWPROTO138 /* some awesome new protocol! */ #define AF_MAX 39 Ok, well that doesn't explain it much, however, shortly thereafter we allocate another AF constant in FreeBSD, the list now looks like: #define AF_ARP 35 #define AF_BLUETOOTH36 /* Bluetooth sockets */ #define AF_IEEE8021137 /* IEEE 802.11 protocol */ #define AF_NEWPROTO138 /* some awesome new protocol! */ #define AF_VENDOR0 39 /* reserved for vendors. */ #define AF_NEWPROTO240 /* some awesome new protocol! */ #define AF_MAX 41 Soon another protocol is added: #define AF_ARP 35 #define AF_BLUETOOTH36 /* Bluetooth sockets */ #define AF_IEEE8021137 /* IEEE 802.11 protocol */ #define AF_NEWPROTO138 /* some awesome new protocol! */ #define AF_VENDOR0 39 /* reserved for vendors. */ #define AF_NEWPROTO240 /* some awesome new protocol! */ #define AF_VENDOR1 41 /* reserved for vendors. */ #define AF_NEWPROTO342 /* some awesome new protocol! */ #define AF_MAX 43 As you can see we are defering the "bloat". Does that make sense? -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Allocating AF constants for vendors.
* Bruce M. Simpson <[EMAIL PROTECTED]> [070822 07:33] wrote: > I second Max. If you are going to introduce a bunch of AF_* constants > into the tree you have to be very careful as AF_MAX is used to size > arrays and figure out how many radix trie heads to allocate. Ok, I'm not really sure what to do here. At Juniper we have approx 20 additional entries for AF_ constants. We also have theoretical but not practical "problems" with spareness and utility of this list, meaning we have plenty of arrays in our version of ifnets and route entries that are also "bloated" as well. We happen not to find it a problem. Perhaps if $BIG_ROUTER_COMPANY is not concerned about this then that might be convincing enough to let it go? Perhaps if I tossed in that it would be my intention to share code to dynamically allocate the data if we ever did it ourselves. Otherwise one other policy would be to specify an allocation policy such that new AF_ constants are allocated only for even numbers where odd numbers are left to vendors. This would slow the "bloat" and still provide vendors with something useful. How does that sound? -Alfred > > It could be argued this wastes a bunch of CPU time and memory, though I > speculate 'not much' at the moment; I am just a bit concerned that we > have ifnet->if_afdata which is also sized based on AF_MAX, 37, even > though most of the protocols in it are never attached to ifnets. > > The only domain I've seen which really uses if_afdata is PF_INET6. > PF_INET does not use it at all. In my opinion, there are structures > per-family per-ifnet which really belong hung-off ifnet on a 1:1 basis > and would simplify some of the lazy allocations we have further down in > the stack. > > If AF_MAX increases significantly so will wasted memory. If you are > going to make any significant changes here, please considering moving > this stuff to a more dynamic method of allocation. > > On the other hand, if you don't need to reference these constants in the > kernel at all, and they will all exist beyond AF_MAX, then you can > disregard what I've said and append them to the rest of the list. > > That is pretty much what happens for the libpcap/bpf DLT constants > (which are not an exact analogue of the AF constants - we don't allocate > other, larger kernel structures based on their value). > > regards, > BMS > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "[EMAIL PROTECTED]" -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Allocating AF constants for vendors.
* Max Laier <[EMAIL PROTECTED]> [070822 14:38] wrote: > On Wednesday 22 August 2007, Bruce M. Simpson wrote: > [...] > > On the other hand, if you don't need to reference these constants in > > the kernel at all, and they will all exist beyond AF_MAX, then you can > > disregard what I've said and append them to the rest of the list. > > Please make sure to leave a bit of space between AF_MAX and your constants > so we could still grow AF_MAX if the need should ever arise. Hmm, that could work, but I think we have the same problem, we depend on AF_MAX. I could look into a more dynamic way of allocating... possibly. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Allocating AF constants for vendors.
I trimmed the sender of this because I got it in private mail, that said I thought it was a good bunch of questions so I am replying to it. > 64? are you intending to bump AF_MAX or allocate them sequentially such > that adding another AF will require AF_MAX to grow a lot? > > In general this seems like a bad idea to me. I suggest you need to > (publicly) explain what you are doing and why this is a good idea. The goal here is to allow vendors to add their own constants without worrying about conflicting with FreeBSD constants. It will allow vendors to maintain some semblance of binary compatibility against FreeBSD. If you look at libpcap: http://cvs.tcpdump.org/cgi-bin/cvsweb/libpcap/pcap/bpf.h?rev=1.15 You can see that Juniper has asked for some number of reserved "families", in our case, I think it would be a bit greedy to grow the list _just_ for Juniper, so I suggested something that would work for every vendor. As far as implementation details, either one works for me, do you have any particular preference? Other than the actual delta, will this have any noticeable negative impact that you can see? -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Allocating AF constants for vendors.
Hello all, I would like to reserve about 64 entries for VENDOR specific address families in sys/socket.h. I think this will allow vendors to comfortably use the array of address families without worrying about overlap with FreeBSD protocols. If no one objects I plan to commit this in the next few days. The format will be along the lines of: AF_VENDOR0 -> AF_VENDOR63 Suggestions? thank you, -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: freebsd nfs version4 server
* Dave <[EMAIL PROTECTED]> [070615 19:06] wrote: > Hello, >Firewalling nfs i was reading some client docs and i found out that > FreeBSD has client support for the nfs v4. I was wondering if FreeBSD 6.2 > could act as an nfs v4 server? There's a patchset from Rick Maclem(sp?) that might do it. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Firewalling NFS
* Jeremie Le Hen <[EMAIL PROTECTED]> [070615 01:07] wrote: > Hi, > > It appears nearly impossible to firewall a NFS server on FreeBSD. I would be nearly impossible if one didn't know much about NFS. Care to rephrase your assertion? > The reason is that NFS related daemons use RPC, which means they > don't bind to a deterministic port. Only mountd(8) can be requested to > bind to a specific port or fail with the -p command-line switch. > Is there any reason other than "no one has needed this yet" why this > option is not available for nfsd(8), rpc.lockd(8) and rpc.statd(8)? this is wrong, wrong and more wrong. -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: New driver coming soon.
That's typically left to the driver author's discression, so go at it. * Jack Vogel <[EMAIL PROTECTED]> [070530 17:53] wrote: > I wanted to let everyone know that I will soon have a > new 10G driver to add to the tree. It is a PCI Express > MSI/X adapter, I would like to call this driver 'ix' rather > than follow Linux who are calling it 'ixgbe'. It is not > backwardly compatible with ixgb. Any objections > to the name? It would be nice to get this in before > 7 becomes a RELEASE, what time frame do I > have for that? > > Cheers, > > Jack > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "[EMAIL PROTECTED]" -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NAT Traversal Patches ...
Matthew, can you provide links to the patches and surrounding discussion. It may just be a matter of integration manpower... * Matthew Grooms <[EMAIL PROTECTED]> [070511 08:08] wrote: > > All, > > I understand that FreeBSD is a volunteer project, but does anyone > have any information regarding the status of the IPsec NAT Traversal > patches and their inclusion with FeeBSD? I have seen them floating > around this list for a few years now. At one point, there was an > objection that concerned a possible legal issue related to patents. This > can't be too much of a road block as Linux, OpenBSD and NetBSD all > include support for NATT in official stable kernel sources. Fedora Core > 6 even has the feature enabled by default in the generic kernel. Another > objection I have seen was related to the patch only offering support for > the KAME stack. But the most recent patch set also offers support for > the Fast IPsec stack as well. > > Is the patch lacking sponsorship by a FreeBSD developer sponsor > since the author does not have commit access? Maybe a developer looking > at the patch is just short on time at the moment? If so, is there > another developer that could maybe help out? Is there a technical reason > why the patches have not been committed? If so, I don't think the > author is aware so a little communication is required? > > Lastly, is there anything the community can do to help out? Maybe > donating to a FreeBSD Foundation project that sponsors IPsec related > work? > > Thanks, > > -Matthew > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "[EMAIL PROTECTED]" -- - Alfred Perlstein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: setsockopt() can not remove the accept filter
"size (%d vs expected %d)", len, sizeof(afa)); > - printf("ok 8 - setsockopt\n"); > + printf("ok 9 - setsockopt\n"); > > /* > - * Step 8: After setsockopt(). Should succeed and identify > + * Step 9: After setsockopt(). Should succeed and identify >* ACCF_NAME. >*/ > bzero(&afa, sizeof(afa)); > len = sizeof(afa); > ret = getsockopt(lso, SOL_SOCKET, SO_ACCEPTFILTER, &afa, &len); > if (ret != 0) > - errx(-1, "not ok 9 - getsockopt() after listen() setsockopt() " > + errx(-1, "not ok 10 - getsockopt() after listen() setsockopt() " > "failed with %d (%s)", errno, strerror(errno)); > if (len != sizeof(afa)) > - errx(-1, "not ok 9 - getsockopt() after setsockopet() after " > + errx(-1, "not ok 10 - getsockopt() after setsockopet() after " > "listen() returned wrong size (got %d expected %d)", len, > sizeof(afa)); > if (strcmp(afa.af_name, ACCF_NAME) != 0) > - errx(-1, "not ok 9 - getsockopt() after setsockopt() after " > + errx(-1, "not ok 10 - getsockopt() after setsockopt() after " > "listen() mismatch (got %s expected %s)", afa.af_name, > ACCF_NAME); > - printf("ok 9 - getsockopt\n"); > + printf("ok 10 - getsockopt\n"); > + > + /* > + * Step 10: Remove accept filter. After removing the accept filter > + * getsockopt() should fail with EINVAL. > + */ > + ret = setsockopt(lso, SOL_SOCKET, SO_ACCEPTFILTER, NULL, 0); > + if (ret != 0) > + errx(-1, "not ok 11 - setsockopt() after listen() " > + "failed with %d (%s)", errno, strerror(errno)); > + bzero(&afa, sizeof(afa)); > + len = sizeof(afa); > + ret = getsockopt(lso, SOL_SOCKET, SO_ACCEPTFILTER, &afa, &len); > + if (ret == 0) > + errx(-1, "not ok 11 - getsockopt() after removing " > + "the accept filter returns valid accept filter %s", > + afa.af_name); > + if (errno != EINVAL) > + errx(-1, "not ok 11 - getsockopt() after removing the accept" > + "filter failed with %d (%s)", errno, strerror(errno)); > + printf("ok 11 - setsockopt\n"); > > close(lso); > return (0); > %%% > > -- > Maxim Konovalov -- - Alfred Perlstein - email: [EMAIL PROTECTED] cell: 408-480-4684 ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Linux compatible rpc.lockd
* Bruce M Simpson <[EMAIL PROTECTED]> [041125 13:53] wrote: > On Thu, Nov 25, 2004 at 08:18:12PM +0100, Bj?rn Gr?nvall wrote: > > I have made a patch to address PR kern/56461, in short the patch > > provides two different options to be compatible with Linux lockd > > implementations. It can also serve as a basis for a future more robust > > rpc.lockd. > > Thank you for this. I looked at this around 8 months ago but abandoned > further work on it because the approach I was taking required that > nfs be refactored to use the nmount() API, and because I am not currently > using NFS. It looks as though the two options implemented here helps to > address the problems I was having with making sure Linux servers got > the right lock cookie response. > > Have you tested this in production and does it work well? If so I believe > it should be committed, but I'd defer to Alfred for further review. It looks non0invasive enough to be safe. Please see if you can get a test run and commit it. I'm in the hospital and not able to do stuff. -- - Alfred Perlstein - Research Engineering Development Inc. - email: [EMAIL PROTECTED] cell: 408-480-4684 ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
(forw) Re: kern/72396: Incorrect network accounting with aliases.
I submitted a PR with a patch, but I think there may be a better fix, any ideas? -Alfred - Forwarded message from [EMAIL PROTECTED] - From: [EMAIL PROTECTED] Reply-To: [EMAIL PROTECTED], [EMAIL PROTECTED] To: Alfred Perlstein <[EMAIL PROTECTED]> Subject: Re: kern/72396: Incorrect network accounting with aliases. Date: Wed, 6 Oct 2004 17:50:29 GMT Message-Id: <[EMAIL PROTECTED]> Thank you very much for your problem report. It has the internal identification `kern/72396'. The individual assigned to look at your report is: freebsd-bugs. You can access the state of your problem report at any time via this link: http://www.freebsd.org/cgi/query-pr.cgi?pr=72396 >Category: kern >Responsible:freebsd-bugs >Synopsis: Incorrect network accounting with aliases. >Arrival-Date: Wed Oct 06 17:50:29 GMT 2004 - End forwarded message - -- - Alfred Perlstein - Research Engineering Development Inc. - email: [EMAIL PROTECTED] cell: 408-480-4684 ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: aio patch for review.
* Alan Cox <[EMAIL PROTECTED]> [040930 21:19] wrote: > On Thu, Sep 30, 2004 at 02:18:14AM -0700, Alfred Perlstein wrote: > > properly cover the socket buffer for operations that need locking. > > > > Just to be clear, your point is that soreadable() and sowriteable() > should be performed with the corresponding socket buffer locked. > Correct? If so, yes, please go ahead and commit it. Yup. thank you, -- - Alfred Perlstein - Research Engineering Development Inc. - email: [EMAIL PROTECTED] cell: 408-480-4684 ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
aio patch for review.
properly cover the socket buffer for operations that need locking. please review. Index: vfs_aio.c === RCS file: /home/ncvs/src/sys/kern/vfs_aio.c,v retrieving revision 1.176 diff -u -r1.176 vfs_aio.c --- vfs_aio.c 23 Sep 2004 14:45:04 - 1.176 +++ vfs_aio.c 30 Sep 2004 09:15:10 - @@ -1297,6 +1297,7 @@ struct kevent kev; struct kqueue *kq; struct file *kq_fp; + struct sockbuf *sb; aiocbe = uma_zalloc(aiocb_zone, M_WAITOK); aiocbe->inputcharge = 0; @@ -1451,29 +1452,28 @@ * If it is not ready for io, then queue the aiocbe on the * socket, and set the flags so we get a call when sbnotify() * happens. +* +* Note if opcode is neither LIO_WRITE nor LIO_READ we lock +* and unlock the snd sockbuf for no reason. */ so = fp->f_data; + sb = (opcode == LIO_READ) ? &so->so_rcv : &so->so_snd; + SOCKBUF_LOCK(sb); s = splnet(); if (((opcode == LIO_READ) && (!soreadable(so))) || ((opcode == LIO_WRITE) && (!sowriteable(so { TAILQ_INSERT_TAIL(&so->so_aiojobq, aiocbe, list); TAILQ_INSERT_TAIL(&ki->kaio_sockqueue, aiocbe, plist); - if (opcode == LIO_READ) { - SOCKBUF_LOCK(&so->so_rcv); - so->so_rcv.sb_flags |= SB_AIO; - SOCKBUF_UNLOCK(&so->so_rcv); - } else { - SOCKBUF_LOCK(&so->so_snd); - so->so_snd.sb_flags |= SB_AIO; - SOCKBUF_UNLOCK(&so->so_snd); - } + sb->sb_flags |= SB_AIO; aiocbe->jobstate = JOBST_JOBQGLOBAL; /* XXX */ ki->kaio_queue_count++; num_queue_count++; + SOCKBUF_UNLOCK(sb); splx(s); error = 0; goto done; } + SOCKBUF_UNLOCK(sb); splx(s); } -- - Alfred Perlstein - Research Engineering Development Inc. - email: [EMAIL PROTECTED] cell: 408-480-4684 ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: kern/56461: FreeBSD client rpc.lockd incompatible with Linux server rpc.lockd
* Barney Wolff <[EMAIL PROTECTED]> [040618 14:09] wrote: > On Fri, Jun 18, 2004 at 10:51:21AM -0700, Alfred Perlstein wrote: > > > > *Sigh* make it a sysctl, but can someone please lay the smack > > down on the linuxiots and have them fix thier crap? > > > > * Bruce M Simpson <[EMAIL PROTECTED]> [040618 04:50] wrote: > > > > > > Linux NFS advisory locks are broken and incompatible with the rest > > > of the world. FreeBSD 5.x in particular uses BSD/OS derived NFS code > > > and thus is affected. FreeBSD 4.x does not implement client-side NFS > > > advisory locks. > > > > > > This problem is also documented as existing for MacOS X, IRIX and BSD/OS: > > > http://www.netsys.com/bsdi-users/2002-04/msg00036.html > > > http://www.uwsg.iu.edu/hypermail/linux/kernel/0311.0/0498.html > > > http://lists.freebsd.org/pipermail/freebsd-hackers/2003-July/001833.html > > > http://lists.freebsd.org/pipermail/freebsd-hackers/2003-April/000592.html > > > > > > The patch provided in the PR is verified to solve the problem, but > > > it would be good to make this functionality optional at run-time, > > > as many people are likely to be using Linux NFS shares read/write > > > with advisory locks. > > Pardon an ignorant question, but what happens to unfortunate people who > have to talk to both Linux and non-quirky servers at the same time? Is > there a way to detect what flavor of server you're talking to and adjust > accordingly? That would be far better than a sysctl. Mount option? Can we do that these days? -- - Alfred Perlstein - Research Engineering Development Inc. - email: [EMAIL PROTECTED] cell: 408-480-4684 ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: kern/56461: FreeBSD client rpc.lockd incompatible with Linux server rpc.lockd
This fucking sucks. *Sigh* make it a sysctl, but can someone please lay the smack down on the linuxiots and have them fix thier crap? * Bruce M Simpson <[EMAIL PROTECTED]> [040618 04:50] wrote: > I've attached my thoughts on this issue. I haven't gone ahead and > committed the fix in the PR as it makes us just as braindead as Linux, > but it would be good to be able to have this in GENERIC so that it > can be enabled in those situations where it's needed. > > Regards, > BMS > Synopsis: > > Linux NFS advisory locks are broken and incompatible with the rest > of the world. FreeBSD 5.x in particular uses BSD/OS derived NFS code > and thus is affected. FreeBSD 4.x does not implement client-side NFS > advisory locks. > > This problem is also documented as existing for MacOS X, IRIX and BSD/OS: > http://www.netsys.com/bsdi-users/2002-04/msg00036.html > http://www.uwsg.iu.edu/hypermail/linux/kernel/0311.0/0498.html > http://lists.freebsd.org/pipermail/freebsd-hackers/2003-July/001833.html > http://lists.freebsd.org/pipermail/freebsd-hackers/2003-April/000592.html > > The patch provided in the PR is verified to solve the problem, but > it would be good to make this functionality optional at run-time, > as many people are likely to be using Linux NFS shares read/write > with advisory locks. > > Walkthrough: > > The addition of pid_start to struct lockd_msg_ident is what triggered > this problem. The offending member is referenced by the NFS code, and > rpc.lockd itself. > > The kernel interface code for rpc.lockd resides in > src/usr.sbin/rpc.lockd/kern.c. > > LOCKD_MSG is what gets passed from the kernel to rpc.lockd via the > named pipe /var/run/lock. > > NFSCLNT_LOCKDANS is used by lockd to send a response back. struct > lockd_ans is the structure passed via this syscall. The kernel code > for this is in nfslockdans(), in src/sys/nfsclient/nfs_lock.c. > > Proposed solution: > > Actual NLM request conversion to/from the kernel happens in rpc.lockd; > there are several places in kern.c, notably test_request() and > lock_request(), which reference struct nlm4_testargs, struct nlm_testargs, > struct nlm_lockargs, and struct nlm4_lockargs. > These are defined in src/include/rpcsvc/nlm_prot.x. > > XXX Are the lockd cookies different from the regular NFS filehandles? > > arg4.cookie.n_bytes = (char *)&msg->lm_msg_ident; > arg4.cookie.n_len = sizeof(msg->lm_msg_ident); > > There's no need to change this structure, just the number of bytes > provided by it; the lm_msg_ident structure needs to change if we're > doing Linux compatbility, and is probably best served by adding > a sysctl to keep track of whether we're in this mode or not. > > So embedding a union of structs in lm_msg_ident is probably the way to go, > and taking the sizeof() the embedded struct as appropriate. > > I would suggest adding a sysctl to the tree: vfs.nfs.pid_start_locks, > "Use process start time as well as PID to differentiate client-side NFS locks". > This should be referenced from nfslockdans() as per the original patch > to check if the timercmp comparison should be skipped. -- - Alfred Perlstein - Research Engineering Development Inc. - email: [EMAIL PROTECTED] cell: 408-480-4684 ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: "netstat -m" and sendfile(2) statistics in STABLE
* Mike Silbersack <[EMAIL PROTECTED]> [040617 23:20] wrote: > > On Fri, 18 Jun 2004, Igor Sysoev wrote: > > >Hi, > > > >I read objections in cvs-all@ about netstat's output after MFC > >of sendfile(2) statistics. > > > >How about "netstat -ms" ? > > > >Right now this switch combination is treated as simple "-m" in both -STABLE > >and -CURRENT. > > > > > >Igor Sysoev > >http://sysoev.ru/en/ > > I would prefer that sfbufs statistics either be kept in netstat -m, OR > added to an entirely different program (perhaps vmstat). Making yet > another netstat flag just because we're scared of confusing users is a > noble compromise, but will in the end just make things more confusing. I was going to suggest vmstat now that sfbufs are used for so many other things than just "sendfile bufs". -- - Alfred Perlstein - Research Engineering Development Inc. - email: [EMAIL PROTECTED] cell: 408-480-4684 ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: "netstat -m" and sendfile(2) statistics in STABLE
* Igor Sysoev <[EMAIL PROTECTED]> [040617 22:52] wrote: > Hi, > > I read objections in cvs-all@ about netstat's output after MFC > of sendfile(2) statistics. > > How about "netstat -ms" ? > > Right now this switch combination is treated as simple "-m" in both -STABLE > and -CURRENT. I would love to see the sendfile stats moved to '-s'. If that's what you're proposing, then yes. :) Oh last of the nits: changes to userland output make things like examples from documentation out of date which can obfuscate things and/or ruin docs for a release. -- - Alfred Perlstein - Research Engineering Development Inc. - email: [EMAIL PROTECTED] cell: 408-480-4684 ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
thanks all (was: Re: crossover between gigE?)
I had a cat5e cable, but either: a) the box needed to reboot b) 4.9 has a problem whereas 4-stable post 4.9 is ok with em0 I dunno, but it's working with a standard cat5e cable now after the upgrade. * Michael Sierchio <[EMAIL PROTECTED]> [031220 14:13] wrote: > Alfred Perlstein wrote: > >Any suggestion of the kind of cable one should look for at Frys > >to run between two gigE card (intel em0) to function as a crossover? > > > > > > I was under the impression that copper gigE cards were auto-sensing > for polarity and it didn't matter whether you use a straight or crossover. > > -- > > "Well," Brahma said, "even after ten thousand explanations, a fool is no > wiser, but an intelligent man requires only two thousand five hundred." > - The Mahabharata -- - Alfred Perlstein - Research Engineering Development Inc. - email: [EMAIL PROTECTED] cell: 408-480-4684 ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
crossover between gigE?
Any suggestion of the kind of cable one should look for at Frys to run between two gigE card (intel em0) to function as a crossover? -- - Alfred Perlstein - Research Engineering Development Inc. - email: [EMAIL PROTECTED] cell: 408-480-4684 ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: misc/44361: possible raw socket bug
It appears that we expect the ip_len and ip_off feilds to be sent in host byte order as the stack will fix it to network byte order in ip_output. Is this a bug or feature? :) -- -Alfred Perlstein [[EMAIL PROTECTED]] 'Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom.' To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: IP Fragmentation
* shubha mr <[EMAIL PROTECTED]> [020717 03:50] wrote: > Hi, > I am writing a gigabit ethernet driver for one of the > NICs.My hardware is capable of computing the checksum > and hence I am enabling per-packet handling of > TCP/IP/UDP checksum offload in transmit side.I would > like to know if there is a way by which I can tell the > upperguy that I will not be able to compute the tcp > checksum for the fragmented packets.That is I want to > indicate that checksum offload can be offloaded only > for the non fragmented and hence complete packets > only. >From mbuf.h: #define CSUM_IP 0x0001 /* will csum IP */ #define CSUM_TCP0x0002 /* will csum TCP */ #define CSUM_UDP0x0004 /* will csum UDP */ #define CSUM_IP_FRAGS 0x0008 /* will csum IP fragments */ #define CSUM_FRAGMENT 0x0010 /* will do IP fragmentation */ Just use the first 3, have a look at the if_bge.c driver for an example. -- -Alfred Perlstein [[EMAIL PROTECTED]] 'Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom.' Tax deductible donations for FreeBSD: http://www.freebsdfoundation.org/ To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: mbuf external buffer reference counters
* Julian Elischer <[EMAIL PROTECTED]> [020712 00:00] wrote: > > > On Thu, 11 Jul 2002, Alfred Perlstein wrote: > > > > That's true, but could someone explain how one can safely and > > effeciently manipulate such a structure in an SMP environment? > > what does NetBSD do for that? They don't! *** waves skull staff exasperatedly *** RORWLRLRLLRL To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message