Re: (jail) problem and a (possible) solution ?
Terry, I made an initial change to the kernel of reducing maxusers from 512 to 256 - you said that 3gig is right on the border of needing extra KVA or not, so I thought maybe this unnecessarily high maxusers might be puching me over the top. However, as long as I was changing the kernel, I also added DDB. The bad news is, it crashed again. The good news is, I dropped to the debugger and got the wait channel info you wanted with `ps`. Here are the last four columns of ps output for the first two pages of processes (roughly 900 procs were running at the time of the halt, so of course I can't give you them all, especially since I am copying by hand) 3 select c0335140local 3 select c0335140trivial-rewrite 3 select c0335140cleanup 3 select c0335140smtpd 3 select c0335140imapd 2 httpd 2 httpd 3 sbwait e5ff6a8chttpd 3 lockf c89b7d40httpd 3 sbwait e5fc8d0chttpd 2 httpd 3 select c0335140top 3 accept e5fc9ef6httpd 3 select c0335140imapd 3 select c0335140couriertls 3 select c0335140imapd 2 couriertls 3 ttyin c74aa630bash 3 select c0335140sshd 3 select c0335140tt++ So there it all is. Does this confirm your feeling that I need to increase KVA? Or does it show you that one of the one or two other low probablity problems is occurring? thanks, PT On Sun, 23 Jun 2002, Terry Lambert wrote: Patrick Thomas wrote: I think I'll just decrease my swap size from 2 gigs to 1 gig - is that a reasonable alternative that provides the same benefit and possible solution to this problem ? ...since bsically 0 swap has ever been used on the machine anyway... Not really. The code in machdep.c allocated pmaps for swapped memory based on the size of real memory, rather than based on available swap. The reason it does this is that you can (effectively) add an arbitrary amount of swap later with swapon, without the swap devices at the time being known to the kernel at boot. THis makes it impossible to prereserve the number of pmap pages that will be needed for the actual amount of swap. Matt Dillon made some autosizing changes after I complained about this before. My actual complaint was to implicate the size of real memory available relative to the size of the full address space. The change he made attempts to autosize, and doesn't quite mirror this policy directly. THis code is not available in 4.5. I believe that it was back-ported to 4.6, but you would have to look at the CVS log on machdep.c to be sure about this -- it may only be in -current. The upshot of this is that having a lot of memory reserves pmap entries at 4K per 4M of real OR virtual memory. The result of this is that at 4G of physical RAM, you actually end up allocating more pmap's than 1G of memory can contain, since the total of physical RAM plus swap over 1024 is larger than 1G minus the amount taken by an idle kernel, not including the page mappings. If you have 3G of real RAM (which you do), then you are on the borderline of running out. When you factor in the amount of *potential* swap that machdep.c reserves, plus tuning for maxfiles/sockets/inpcb/tcpcb/mbufs/etc. (if any), PLUS the RAM taken up for things associated with running over 1000 processes (as your system does), then you end up exhausting the amount of VM space available. As I said before, though, the only way to know for sure if this is your real problem is to break to the debugger after the lockup (it's *not* a crash), and check out the wait channels for the processes thar are unable to run. If you want a tweak for 4.5 that has about a 95% proability of masking the problem, then you need to up the KVA space. Unfortunately, it's not really possible to tell you where every byte of memory is going. Also, unfortunately, the pmap's for swappable memory are not themselves swappable (or this would not be a problem). Probably, pmaps for swap and for file backing store for exectuables should be allocated when they are needed, not preallocated (they can be, if you are not out of RAM, or have RAM, but are out of KVA space in which to create mappings) [see growkernel]. Taking out 1G of physical memory from the box might also fix the problem without a kernel tweak, FWIW. However, right now, you need to cause the problem, enter the debugger, and use ps in the debugger to examine the wait channels. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
Well, it should be noted that there are two things going on with swap. What I adjusted was the size of the swap_zone, which holds swblocks. These structures hold the VM-SWAP block mappings for things that are swapped out. The swap zone eats a lot more KVA then the radix tree holding the swap bitmaps. The actual swap bitmaps are allocated from the M_SWAP malloc pool. These allocations are based on NSWAP * (largest_single_swap_area). NSWAP is usually 4. Having a single 2GB swap area is therefore somewhat expensive, but still nowhere near the size required to exhaust KVM (or even come close to exhausting KVM). It is just as expensive as having 4 x 2GB swap areas due to the way the bitmaps are allocated. The swap bitmaps eat around 2 bits per 4K block of swap so a single 2GB of swap will eat 2G/4K x 2 / 8 x NSWAP(4) = 0.5 MB of ram. Not very much. But, getting back to the swblocks... these use a zone, SWAPMETA (vmstat -z | less, search for SWAPMETA). The zone reserves KVA. A machine with 2GB of real memory will typically reserve around 10 MB of KVA to hold swblocks. Previously it reserved 20-40 MB of KVA which really ate into available KVA. It should not be a problem now but it's very easy for you to check. Multiply the size (160) against the LIMIT and you will get the approximate KVA reservation being used for the SWAPMETA zone. -- Ok, history lesson over. Going over your original posting and the ps you just posted from ddb there is not enough information to make any sort of diagnosis. It doesn't look like KVA exhaustion to me, and the ps does not show any deadlocks. I'm not sure what is going on. I think some more experimentation is necessary... e.g. breaking into DDB after it deadlocks and doing a full 'ps' (don't leave anything out this time), and potentially getting a kernel core dump (assuming you compiled the kernel -g and have a kernel.debug lying around that we can gdb the core against). -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
A few items that deserve mention, and two questions: a) this problem occurred back when the machine had 2gigs in it - I actually (naively) added the third gig of physical ram to try to fix the problem. b) another machine of mine is now exhibiting the same bahavior - it has far fewer processes running (~500 vs ~1000) and it has only 2gigs of RAM. questions: 1) How do I give you an entire `ps` output from DDB ? Is there a way to output it to a floppy or something ? Or are you suggesting to copy down by hand ~1000 lines of ps output ? 2) Any other suggestions as to what it is - if it doesn't look like KVA, and I reduced my swap from 2gig to 256megs, and I reduced maxusers from 512 to 256 ... basically I have a perfectly healthy machine that crashes for no reason ? All of your help is greatly appreciated. It's just so frustrating to have it halt every day for no apparent reason - as you saw from the `top` output just as it halted the other day , the load is trivial. --PT On Mon, 24 Jun 2002, Matthew Dillon wrote: Well, it should be noted that there are two things going on with swap. What I adjusted was the size of the swap_zone, which holds swblocks. These structures hold the VM-SWAP block mappings for things that are swapped out. The swap zone eats a lot more KVA then the radix tree holding the swap bitmaps. The actual swap bitmaps are allocated from the M_SWAP malloc pool. These allocations are based on NSWAP * (largest_single_swap_area). NSWAP is usually 4. Having a single 2GB swap area is therefore somewhat expensive, but still nowhere near the size required to exhaust KVM (or even come close to exhausting KVM). It is just as expensive as having 4 x 2GB swap areas due to the way the bitmaps are allocated. The swap bitmaps eat around 2 bits per 4K block of swap so a single 2GB of swap will eat 2G/4K x 2 / 8 x NSWAP(4) = 0.5 MB of ram. Not very much. But, getting back to the swblocks... these use a zone, SWAPMETA (vmstat -z | less, search for SWAPMETA). The zone reserves KVA. A machine with 2GB of real memory will typically reserve around 10 MB of KVA to hold swblocks. Previously it reserved 20-40 MB of KVA which really ate into available KVA. It should not be a problem now but it's very easy for you to check. Multiply the size (160) against the LIMIT and you will get the approximate KVA reservation being used for the SWAPMETA zone. -- Ok, history lesson over. Going over your original posting and the ps you just posted from ddb there is not enough information to make any sort of diagnosis. It doesn't look like KVA exhaustion to me, and the ps does not show any deadlocks. I'm not sure what is going on. I think some more experimentation is necessary... e.g. breaking into DDB after it deadlocks and doing a full 'ps' (don't leave anything out this time), and potentially getting a kernel core dump (assuming you compiled the kernel -g and have a kernel.debug lying around that we can gdb the core against). -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
Patrick Thomas wrote: I made an initial change to the kernel of reducing maxusers from 512 to 256 - you said that 3gig is right on the border of needing extra KVA or not, so I thought maybe this unnecessarily high maxusers might be puching me over the top. However, as long as I was changing the kernel, I also added DDB. The bad news is, it crashed again. The good news is, I dropped to the debugger and got the wait channel info you wanted with `ps`. Here are the last four columns of ps output for the first two pages of processes (roughly 900 procs were running at the time of the halt, so of course I can't give you them all, especially since I am copying by hand) 3 select c0335140local 3 select c0335140trivial-rewrite 3 select c0335140cleanup 3 select c0335140smtpd 3 select c0335140imapd 2 httpd 2 httpd 3 sbwait e5ff6a8chttpd 3 lockf c89b7d40httpd 3 sbwait e5fc8d0chttpd 2 httpd 3 select c0335140top 3 accept e5fc9ef6httpd 3 select c0335140imapd 3 select c0335140couriertls 3 select c0335140imapd 2 couriertls 3 ttyin c74aa630bash 3 select c0335140sshd 3 select c0335140tt++ So there it all is. Does this confirm your feeling that I need to increase KVA? Or does it show you that one of the one or two other low probablity problems is occurring? Matt Dillon is right, that there's nothing conclusive in the information you've posted. However... it provides room for additional speculation. -- The number of select waits is reasonable. The sbwait makes me somewhat worried. It's obvious that you are running a large number of httpd's; the sbwait in this case could be reasonably assumed to be waits based on sendfile for a change in so-so_snd-sb_cc; if that's the case, then it may be that you are simply running out of mbufs, and are deadlocking. This can happen if you have enough data in the pipe that you can not receive more data (e.g. the m_pullup() in tcp_input() could fail before other things would fail). If this is too much assumption, you can walk the entry off the process, and see if it's the address of the sb_cc for so_snd or for so_rcv for the process in question. The way to cross-check this would be to run a continuous netstat -m, e.g.: #!/bin/sh while true do netstat -m sleep 1 done When the lockup comes, the interesting numbers are: # netstat -m 3/64/5696 mbufs in use (current/peak/max): -- #3 3 mbufs allocated to data 0/40/1424 mbuf clusters in use (current/peak/max) -- #2 96 Kbytes allocated to network (2% of mb_map in use) 0 requests for memory denied-- #1 0 requests for memory delayed 0 calls to protocol drain routines If there are a lot of denials, then you are out of mbuf memory and/or mbuf clusters (sendfile tends to eat clusters for breakfast; it's one of the reasons I dislike it immensely; the other is that the standards for the majority of wire protocols where you'd use it require CRLF termination, and UNIX text files have only LF termination). The current vs. peak vs. max will tell you how close to resource saturation you are. The ratio of clusters to mbufs will (effectively) tell you if you need to worry about adjusting the ratio because of sendfile. The lockf could (maybe) be a deadlock, but if it were, everyone would be seeing it; it's incredibly doubtful, as long as the ps output you indicated was at all accurate. Basically, if you have any denials, or if the number of mbuf clusters gets really large, then you could have a problem. It would also be interesting to see the output of: # sysctl -a | grep tcp | grep space net.inet.tcp.sendspace: 32768 net.inet.tcp.recvspace: 65536 A standard netstat would also tell you the contents of the Recv-Q Send-Q columns. If they were non-zero, then you would basically be able to tell how much memory was being consumed by network traffic in and out. I guess the best way to deal with this would be to drop the size of the send or receive queues, until it didn't consume all your memory. In general, the size of these queues is supposed to be a *maximum*, not a *mean*, so the number of sockets possible, times the maximum total of both, will often exceed the amount of available mbuf space. An interesting attack that is moderately effective on FreeBSD boxes is to send with a very large size, and not send one of the fragments (e.g. the second one) to prevent fragment reassembly, and therefore saturate the reassembly queue. The Linux UDP NFS client code does this unintentionally, but you could believe that
Re: (jail) problem and a (possible) solution ?
:questions: : :1) How do I give you an entire `ps` output from DDB ? Is there a way to :output it to a floppy or something ? Or are you suggesting to copy down :by hand ~1000 lines of ps output ? If you have a couple of machines you can use a null-modem cable and make the target machine's console the serial port by adding the following line to the target machine's /boot/loader.conf: console=comconsole (note: DDB will occur on the serial port now, not the main system console). Then on the machine you connected the serial port you can 'tip com1' (I think). If you don't have a com1 in /etc/remote you can add one: com1:dv=/dev/cuaa0:br#9600:pa=none: In anycase, this way the console will wind up on the serial port and you can leave yourself tipped in with a big window and then cut and paste when it drops into DDB and you do the ps. The other thing you want to do is to make sure all your kernel builds are -g builds, which you can do by adding the following line to /usr/src/sys/i386/conf/YOURKERNEL (I'm assuming from prior messages that you are familiar with building kernels): makeoptions DEBUG=-g I also recommend: options ALT_BREAK_TO_DEBUGGER This will produce a kernel.debug as well as a kernel binary (only 'kernel' is installed, but kernel.debug will remain sitting in the compile dir). ALT_BREAK_TO_DEBUGGER allows you to break into DDB via the serial console by using CR ~ ^B (return, tilde, control-B) from your 'tip'. Finally, make sure the swap partition is large enough to hold main memory so the kernel dumps core, and use the 'dumpdev' option in /etc/rc.conf to set the dump device. For example: dumpdev=/dev/da0s1b :2) Any other suggestions as to what it is - if it doesn't look like KVA, :and I reduced my swap from 2gig to 256megs, and I reduced maxusers from :512 to 256 ... basically I have a perfectly healthy machine that crashes :for no reason ? : :All of your help is greatly appreciated. It's just so frustrating to have :it halt every day for no apparent reason - as you saw from the `top` :output just as it halted the other day , the load is trivial. : :--PT I don't know but hopefully a full PS will give us a better window into the problem. Oh yah, you can also play with different memory configurations simply by setting a physical memory limit (= actual physical ram in the box) in /boot/loader.conf, like this: hw.physmem=256m -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
Patrick Thomas wrote: 1) How do I give you an entire `ps` output from DDB ? Is there a way to output it to a floppy or something ? Or are you suggesting to copy down by hand ~1000 lines of ps output ? Serial console + terminal program with capture. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
It's obvious that you are running a large number of httpd's; the Yes, we are running a lot of httpd's: ps auxw | grep httpd | wc -l = 288 The way to cross-check this would be to run a continuous netstat -m, e.g.: Funny you should ask :) I was already doing that. Here is the output from a `netstat -m` run once per minute - the machine crashed sometime in the next 30-60 seconds after I got this output: 524/2576/34816 mbufs in use (current/peak/max): 500 mbufs allocated to data 24 mbufs allocated to packet headers 273/2254/8704 mbuf clusters in use (current/peak/max) 5152 Kbytes allocated to network (19% of mb_map in use) 0 requests for memory denied 0 requests for memory delayed 0 calls to protocol drain routines Basically, if you have any denials, or if the number of mbuf clusters gets really large, then you could have a problem. Do you think it is reasonable that the above netstat -m output could, within 30 or so seconds, ramp up to the bad situation you are describing ? Because it looks fairly benign to me... I have three questions: 1. Forgetting about my paticular problem for a moment, let's say you have to tune a machine to run 200+ httpd servers along with another 800 misc. processes, etc. What do you suggest setting, just to be safe (again, as a precaution - forgetting that in reality I am tryig to fix a sick machine) So far I have only tuned: In my kernel: maxusers=256 (was 512, change to 256 didn't help) options SHMMAXPGS=16384 options SHMMAX=(SHMMAXPGS*PAGE_SIZE+1) options SHMSEG=256 options SEMMNI=384 options SEMMNS=768 options SEMMNU=384 options SEMMAP=384 (all this SHM and SEM stuff is to run multiple postgres') and at boot time: sysctl -w jail.sysvipc_allowed=1 sysctl -w kern.ipc.shmall=65535 sysctl -w kern.ipc.shmmax=134217728 sysctl -w net.inet.tcp.syncookies=0 So anything obvious I am missing that you would tune for a 200+ http + 800 other processes machine? 2. Let's say I was being targeted by that effective attack you spoke of...any way to immunize myself ? 3. You spoke of: # sysctl -a | grep tcp | grep space net.inet.tcp.sendspace: 32768 net.inet.tcp.recvspace: 65536 I guess the best way to deal with this would be to drop the size of the send or receive queues, until it didn't consume all your memory. In general, the size of these queues is supposed to be a *maximum*, not a *mean*, so the number of sockets possible, times the maximum total of both, will often exceed the amount of available mbuf space. a) are you saying to collect these sysctls regularly and try to see their values right at the crash ? b) where do I drop the size of the send or receive queues ? (sysctl or kernel setting?) thank you very much. I will try to get a full `ps` tonight when it crashes again :( --PT An interesting attack that is moderately effective on FreeBSD boxes is to send with a very large size, and not send one of the fragments (e.g. the second one) to prevent fragment reassembly, and therefore saturate the reassembly queue. The Linux UDP NFS client code does this unintentionally, but you could believe that someone might be doing it intentionally, as well, which would also work against TCP. It's doubtful that you are being hit by a FreeBSD targetted attack, however. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
Patrick Thomas wrote: I think I'll just decrease my swap size from 2 gigs to 1 gig - is that a reasonable alternative that provides the same benefit and possible solution to this problem ? ...since bsically 0 swap has ever been used on the machine anyway... Not really. The code in machdep.c allocated pmaps for swapped memory based on the size of real memory, rather than based on available swap. The reason it does this is that you can (effectively) add an arbitrary amount of swap later with swapon, without the swap devices at the time being known to the kernel at boot. THis makes it impossible to prereserve the number of pmap pages that will be needed for the actual amount of swap. Matt Dillon made some autosizing changes after I complained about this before. My actual complaint was to implicate the size of real memory available relative to the size of the full address space. The change he made attempts to autosize, and doesn't quite mirror this policy directly. THis code is not available in 4.5. I believe that it was back-ported to 4.6, but you would have to look at the CVS log on machdep.c to be sure about this -- it may only be in -current. The upshot of this is that having a lot of memory reserves pmap entries at 4K per 4M of real OR virtual memory. The result of this is that at 4G of physical RAM, you actually end up allocating more pmap's than 1G of memory can contain, since the total of physical RAM plus swap over 1024 is larger than 1G minus the amount taken by an idle kernel, not including the page mappings. If you have 3G of real RAM (which you do), then you are on the borderline of running out. When you factor in the amount of *potential* swap that machdep.c reserves, plus tuning for maxfiles/sockets/inpcb/tcpcb/mbufs/etc. (if any), PLUS the RAM taken up for things associated with running over 1000 processes (as your system does), then you end up exhausting the amount of VM space available. As I said before, though, the only way to know for sure if this is your real problem is to break to the debugger after the lockup (it's *not* a crash), and check out the wait channels for the processes thar are unable to run. If you want a tweak for 4.5 that has about a 95% proability of masking the problem, then you need to up the KVA space. Unfortunately, it's not really possible to tell you where every byte of memory is going. Also, unfortunately, the pmap's for swappable memory are not themselves swappable (or this would not be a problem). Probably, pmaps for swap and for file backing store for exectuables should be allocated when they are needed, not preallocated (they can be, if you are not out of RAM, or have RAM, but are out of KVA space in which to create mappings) [see growkernel]. Taking out 1G of physical memory from the box might also fix the problem without a kernel tweak, FWIW. However, right now, you need to cause the problem, enter the debugger, and use ps in the debugger to examine the wait channels. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
ok. I was just looking back at a previous comment you made: Amusingly enough, you might actually have *better* luck with a lot less swap... and thinking that even if removing most of the swap did not _solve/mask_ the problem, at least it would be a step in the same direction as upping KVA (even if it is not as large a step) but if that is not the case... ...then, has anyone written a HOWTO on upping it in 4.5-RELEASE ? You mentioned to look back over your own old posts on the subject - before I jump in and try it, I want to confirm what I believe to understand, I need to set the KVA value in my kernel config _and_ edit those other two files in the kernel source, then just recompile my kernel. Sound like I'm on the right track ? Terry, thanks again for your help and for all the help you regularly give to other people pursuing items such as this on the various FreeBSD lists. --PT On Sun, 23 Jun 2002, Terry Lambert wrote: Patrick Thomas wrote: I think I'll just decrease my swap size from 2 gigs to 1 gig - is that a reasonable alternative that provides the same benefit and possible solution to this problem ? ...since bsically 0 swap has ever been used on the machine anyway... Not really. The code in machdep.c allocated pmaps for swapped memory based on the size of real memory, rather than based on available swap. The reason it does this is that you can (effectively) add an arbitrary amount of swap later with swapon, without the swap devices at the time being known to the kernel at boot. THis makes it impossible to prereserve the number of pmap pages that will be needed for the actual amount of swap. Matt Dillon made some autosizing changes after I complained about this before. My actual complaint was to implicate the size of real memory available relative to the size of the full address space. The change he made attempts to autosize, and doesn't quite mirror this policy directly. THis code is not available in 4.5. I believe that it was back-ported to 4.6, but you would have to look at the CVS log on machdep.c to be sure about this -- it may only be in -current. The upshot of this is that having a lot of memory reserves pmap entries at 4K per 4M of real OR virtual memory. The result of this is that at 4G of physical RAM, you actually end up allocating more pmap's than 1G of memory can contain, since the total of physical RAM plus swap over 1024 is larger than 1G minus the amount taken by an idle kernel, not including the page mappings. If you have 3G of real RAM (which you do), then you are on the borderline of running out. When you factor in the amount of *potential* swap that machdep.c reserves, plus tuning for maxfiles/sockets/inpcb/tcpcb/mbufs/etc. (if any), PLUS the RAM taken up for things associated with running over 1000 processes (as your system does), then you end up exhausting the amount of VM space available. As I said before, though, the only way to know for sure if this is your real problem is to break to the debugger after the lockup (it's *not* a crash), and check out the wait channels for the processes thar are unable to run. If you want a tweak for 4.5 that has about a 95% proability of masking the problem, then you need to up the KVA space. Unfortunately, it's not really possible to tell you where every byte of memory is going. Also, unfortunately, the pmap's for swappable memory are not themselves swappable (or this would not be a problem). Probably, pmaps for swap and for file backing store for exectuables should be allocated when they are needed, not preallocated (they can be, if you are not out of RAM, or have RAM, but are out of KVA space in which to create mappings) [see growkernel]. Taking out 1G of physical memory from the box might also fix the problem without a kernel tweak, FWIW. However, right now, you need to cause the problem, enter the debugger, and use ps in the debugger to examine the wait channels. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
Patrick Thomas wrote: ok. I was just looking back at a previous comment you made: Amusingly enough, you might actually have *better* luck with a lot less swap... I meant reserve, not physical swap. I can see how it could have been confusing in context; sorry. and thinking that even if removing most of the swap did not _solve/mask_ the problem, at least it would be a step in the same direction as upping KVA (even if it is not as large a step) but if that is not the case... ...then, has anyone written a HOWTO on upping it in 4.5-RELEASE ? You mentioned to look back over your own old posts on the subject - before I jump in and try it, I want to confirm what I believe to understand, I need to set the KVA value in my kernel config _and_ edit those other two files in the kernel source, then just recompile my kernel. Sound like I'm on the right track ? Yes. That's the way to do it for 4.5, specifically. FreeBSD really needs an internals book. But like I said, this changed between 4.5 and 4.6, and everyone who's buying books would be more interested in 5.x, and all the important things change too fast (writing an internals book is an ~2000 hour job, and that basically means that the important stuff can't change for a year, or you have to track it -- which inflates it to an ~3000 hour job). Basically, most of the important internal interfaces need to sit still so that a book can be written, or no book. Even so, the selling life of the book will be limited to the amount of time after publication that things actually sit still. Kirk McKusick is rumored to be writing one; so was Wes Peters. Alfred and I discussed a device driver book that both of us thought needed to be written. Etc.. But no book, yet. I really hesitate to put down an A-B-C set of steps, if I know that not only is it only applicable to a couple of versions, none of them are the current version. 8-(. Terry, thanks again for your help and for all the help you regularly give to other people pursuing items such as this on the various FreeBSD lists. Eh, I'm noisy. 8-). You still need to run the debugger, I think. So far, this is all theory. It fits the facts, but I can think of two other very low probability ways to cause the same symptoms. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
jump in and try it, I want to confirm what I believe to understand, I need to set the KVA value in my kernel config _and_ edit those other two files in the kernel source, then just recompile my kernel. Sound like I'm on the right track ? Yes. That's the way to do it for 4.5, specifically. Because I am paranoid, I like to check the state of a measurement before making a change and then after, to see that what I did did indeed induce a change ... I have this irrational fear that sometimes I make changes like this and nothing in fact changed, and I just don't know it :) So, should I just look for the value of: vm.zone_kmem_kvaspace: 179691520 to increase in size even though the physical RAM stays the same at 3gigs, or is there some other measurement I should look at before and after the KVA increase to ensure that it worked (and yes, I know that if it doesn't work I probably will have an inoperable machine, but just out of curiousity...) thanks, PT To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
Terry, Patrick, et al, What is the procedure in 4.5-RELEASE (please say just change KVA_PAGES=260 to KVA_PAGES=512) (snip) For 4.5, you have to hack ldscript.i386 and pmap.h. I've posted on how to do this before (should be in the archives). Actually, in 4.5 you only need to set: options KVA_PAGES=512 and recompile your kernel. It looks like 4.5-RELEASE was the first release version to _not_ require hacking sys/i386/include/pmap.h and sys/conf/ldscript.i386. As you can see by looking at a 4.5-RELEASE pmap.h: #define NKPDE (KVA_PAGES - 2) /* addressable number of page tables/pde's */ #else #define NKPDE (KVA_PAGES - 1) /* addressable number of page tables/pde's */ the offsets that Terry spoke of are already in place. This is in contrast to 4.4-RELEASE: #define NKPDE 254 /* addressable number of page tables/pde's */ #else #define NKPDE 255 /* addressable number of page tables/pde's */ Where everything was hard coded to match the default KVA_PAGES value. Further, looking at ldscript.i386 we see in 4.5-RELEASE: . = kernbase + 0x0010 + SIZEOF_HEADERS; whereas in 4.4-RELEASE and earlier, we saw: . = 0xc010 + SIZEOF_HEADERS; Which means that in 4.4 you had to change 0xc010 to 0x8010 for a 2gig KVA. In 4.5, however, you don't have to change ldscript.i386 at all, because it is now a relative value that takes kernbase into account. - So, if you are running 4.0 - 4.4, you need to edit ldscript.i386 and change 0xc010 to 0x801 (for a 2gig KVA), then you need to edit pmap.h and change the two lines I pasted above from 254 and 255 to 510 and 511, respectively. Finally, you need to set: options KVA_PAGES=512 in your kernel config, then recompile your kernel. But, if you are running 4.5 or 4.6, from the code I pasted above, it looks like all you have to do is set: options KVA_PAGES=512 in your kernel config, then recompile your kernel. - Another explanation of this concept can be found here: http://www.kozubik.com/docs/original_kva_increase.txt I am posting today mainly to get a little more information stored in the archives. In addition, I myself have a question regarding the default settings of 4.5 and 4.6 - by looking at the NKPDE values in the 4.4-RELEASE version of pmap.h, the values of 254 and 255 indicate that they are hard coded for a default of KVA_PAGES=256, however 4.5 and 4.6 have a KVA_PAGES=260 setting in LINT, which I assume is also the default ... why the increase of 4 since 4.4-RELEASE ? - John Kozubik - [EMAIL PROTECTED] - http://www.kozubik.com The pages are all going to be off-by-one from your calculations, for the recursive page mapping, or off-by-two if your kernel is an SMP kernel, for the per CPU page, so remember that, or you will end up with a kernel that simply doesn't boot. The easiest way is to look at the numbers in pmap.h, and figure out how they relate to 0xc000 (remember to OR in 0x0010 after your math, to count the kernel loading at 1M). -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
- So, if you are running 4.0 - 4.4, you need to edit ldscript.i386 and change 0xc010 to 0x801 (for a 2gig KVA), then you need to edit pmap.h and change the two lines I pasted above from 254 and 255 to 510 and 511, respectively. Finally, you need to set: options KVA_PAGES=512 An addendum - skip that last step (setting options KVA_PAGES=512 in your kernel config) for versions 4.0-4.4, as it did not yet exist as a config option at that time. Again, for 4.5 and 4.6, adding that line to your kernel config is _all_ you need to do. If you are reading this from the archives, please see my previous post in this thread for specific details. - John Kozubik - [EMAIL PROTECTED] - http://www.kozubik.com To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
Patrick Thomas wrote: Because I am paranoid, I like to check the state of a measurement before making a change and then after, to see that what I did did indeed induce a change ... I have this irrational fear that sometimes I make changes like this and nothing in fact changed, and I just don't know it :) So, should I just look for the value of: vm.zone_kmem_kvaspace: 179691520 to increase in size even though the physical RAM stays the same at 3gigs, or is there some other measurement I should look at before and after the KVA increase to ensure that it worked (and yes, I know that if it doesn't work I probably will have an inoperable machine, but just out of curiousity...) Yes. You will also see the kernel load address during the boot process, which you can interrupt/pause until you are satisfied. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
John Kozubik wrote: Terry, Patrick, et al, For 4.5, you have to hack ldscript.i386 and pmap.h. I've posted on how to do this before (should be in the archives). Actually, in 4.5 you only need to set: options KVA_PAGES=512 and recompile your kernel. It looks like 4.5-RELEASE was the first release version to _not_ require hacking sys/i386/include/pmap.h and sys/conf/ldscript.i386. As you can see by looking at a 4.5-RELEASE pmap.h: #define NKPDE (KVA_PAGES - 2) /* addressable number of page tables/pde's */ #else #define NKPDE (KVA_PAGES - 1) /* addressable number of page tables/pde's */ the offsets that Terry spoke of are already in place. This is in contrast to 4.4-RELEASE: #define NKPDE 254 /* addressable number of page tables/pde's */ #else #define NKPDE 255 /* addressable number of page tables/pde's */ Yes; this is 1.65.2.3 This is my bad. It's the system I was using as a reference; it has two kernel source trees; the first one has 1.65, the second is a RELENG_4, which makes it a 1.65.2.3. Where everything was hard coded to match the default KVA_PAGES value. Further, looking at ldscript.i386 we see in 4.5-RELEASE: . = kernbase + 0x0010 + SIZEOF_HEADERS; whereas in 4.4-RELEASE and earlier, we saw: . = 0xc010 + SIZEOF_HEADERS; Which means that in 4.4 you had to change 0xc010 to 0x8010 for a 2gig KVA. In 4.5, however, you don't have to change ldscript.i386 at all, because it is now a relative value that takes kernbase into account. Yes, this is 1.4.2.1. The commit comments for ldscript.i386 are incredibly misleading as to what the merge actually does. The derivation of kernbase itself is also dependent on a third change, which is not documented, either. So, if you are running 4.0 - 4.4, you need to edit ldscript.i386 and change 0xc010 to 0x801 (for a 2gig KVA), then you need to edit pmap.h and change the two lines I pasted above from 254 and 255 to 510 and 511, respectively. Finally, you need to set: options KVA_PAGES=512 in your kernel config, then recompile your kernel. But, if you are running 4.5 or 4.6, from the code I pasted above, it looks like all you have to do is set: options KVA_PAGES=512 in your kernel config, then recompile your kernel. - Another explanation of this concept can be found here: http://www.kozubik.com/docs/original_kva_increase.txt I am posting today mainly to get a little more information stored in the archives. In addition, I myself have a question regarding the default settings of 4.5 and 4.6 - by looking at the NKPDE values in the 4.4-RELEASE version of pmap.h, the values of 254 and 255 indicate that they are hard coded for a default of KVA_PAGES=256, however 4.5 and 4.6 have a KVA_PAGES=260 setting in LINT, which I assume is also the default ... why the increase of 4 since 4.4-RELEASE ? I believe that this would be because of the desire for the number of *usable* pages, since you have to subtract out the ones that are not global to all CPUs. The LINT value is *not* the default. It went in in 1.954 of NOTES (LINT is a generated file). I don't know why Peter did this. It says and a test in the commit, an since it's only comments and the option itself, I guess that means that the value of 260 is the test that the commit message was referencing. So I guess 4.5 is actually OK, but one of my local boxes is not. My main frustration with this has always been that the information in the Handbook has always been insufficient to actually make the change and have it work. I guess I'm glad that it made it into 4.5, even if it surprised me. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
Yes I've had the same problem. One system runs just fine with it's jails, and another crashes habitually. It has to do with a certain jail (and services). Our system are set up to be able to move jails between them (great for backups and near perfect uptime), and a certain set of jails always hangs the system in this way. I'm trying to narrow it down. Do you get a core dump or does it just hang? Nate - Original Message - From: Patrick Thomas [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, June 21, 2002 16:43 Subject: (jail) problem and a (possible) solution ? A test server of mine running a number of jails keeps locking up - but the odd thing about the lockup is that the userland stops, but the kernel keeps running (sockets can be opened, but the servers never respond on them, the machine still responds to pings, but logs show that all real activity stops) I just noticed today that some jails still have writable /dev/mem and /dev/kmem and /dev/io nodes. I think it is plausable that some kind of fiddling (writing) to these nodes is causing this kind of lockup. Is this assumption reasonable, or if some jail user fiddled with their /dev/mem or /dev/kmem or /dev/io node would it just totally crash out the machine and I _wouldn't_ still be able to ping the server after it crashes ? thanks, PT To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
What it does is the userland hangs, but the kernel keeps running. When the system is crashed, I can still ping it successfully, and I can still open sockets (like I can open a connection to a jails httpd or sshd, or the sshd of the underlying server itself) but nothing answers on the sockets - they just hang open. So everything stops running, but it is still up - still responds to pings...syslog stops logging though, cron stops running Two questions for you: 1) do you allow them write access to their /dev/mem, /dev/kmem, /dev/io ? 2) does this sound like what you see? Can you still ping the crashed server ? I'm mostly just curious if this kind of crash (userland hung but kernel running) is a possible outcome of someone in a jail fiddling with those /dev nodes, or if fiddling with dev/mem or /dev/kmem or io would just lock the machine up hard and completely. Terry? --PT On Fri, 21 Jun 2002, Nielsen wrote: Yes I've had the same problem. One system runs just fine with it's jails, and another crashes habitually. It has to do with a certain jail (and services). Our system are set up to be able to move jails between them (great for backups and near perfect uptime), and a certain set of jails always hangs the system in this way. I'm trying to narrow it down. Do you get a core dump or does it just hang? Nate - Original Message - From: Patrick Thomas [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, June 21, 2002 16:43 Subject: (jail) problem and a (possible) solution ? A test server of mine running a number of jails keeps locking up - but the odd thing about the lockup is that the userland stops, but the kernel keeps running (sockets can be opened, but the servers never respond on them, the machine still responds to pings, but logs show that all real activity stops) I just noticed today that some jails still have writable /dev/mem and /dev/kmem and /dev/io nodes. I think it is plausable that some kind of fiddling (writing) to these nodes is causing this kind of lockup. Is this assumption reasonable, or if some jail user fiddled with their /dev/mem or /dev/kmem or /dev/io node would it just totally crash out the machine and I _wouldn't_ still be able to ping the server after it crashes ? thanks, PT To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
* Patrick Thomas [EMAIL PROTECTED] [020622 01:56] wrote: What it does is the userland hangs, but the kernel keeps running. ... I'm mostly just curious if this kind of crash (userland hung but kernel running) is a possible outcome of someone in a jail fiddling with those /dev nodes, or if fiddling with dev/mem or /dev/kmem or io would just lock the machine up hard and completely. Terry? This typically means some sort of deadlock has happened, if possible getting a crash dump (this is detailed in the handbook i think) would help. The reason why it seems like apps are responding is because the kernel is only processing interrupts, something has hung the scheduler or deadlocked the kernel somehow... FYI, the kernel is not running except when interrupted by a device. -Alfred To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
Patrick Thomas wrote: What it does is the userland hangs, but the kernel keeps running. When the system is crashed, I can still ping it successfully, and I can still open sockets (like I can open a connection to a jails httpd or sshd, or the sshd of the underlying server itself) but nothing answers on the sockets - they just hang open. So everything stops running, but it is still up - still responds to pings...syslog stops logging though, cron stops running Two questions for you: 1) do you allow them write access to their /dev/mem, /dev/kmem, /dev/io ? 2) does this sound like what you see? Can you still ping the crashed server ? I'm mostly just curious if this kind of crash (userland hung but kernel running) is a possible outcome of someone in a jail fiddling with those /dev nodes, or if fiddling with dev/mem or /dev/kmem or io would just lock the machine up hard and completely. Terry? I've kept quiet so far because I'm not the jail expert; Poul actually wrote the jail code, and there was someone else who understood it enough to recently add multiple IP support. Given your symptoms, I can pretty much guess where the problem is, but not really how to fix it, other than trial-and-error, since I tend to run jails on a number of my machines, and make them do things they aren't supposed to do... Knowing what version of FreeBSD you are running would be helpful. That you can still ping indicates that both hardware interrupts and NETISR are running. That NETISR runs indicates that things are still calling splx(), which means things are still calling spl*() and coming back from it. The fact that you can still connect to servers that have active listens posted, but that you get no data is also indicative that the NETISR is running, at least up to the accept. It would be interesting to attempt a large number of connections, to see if the connections stop being accepted after you've tried more times than you set in listen(3) as the queue depth for the number of sockets allowed to sit there pending accept. If this happens (connection attempts start hanging, rather than being accepted), you know for certain that the process you are trying to talk to is not being scheduled to run. Basically, this implies one of two things is happening: 1) Your scheduler lost its head entry, so it's not scheduling anything to run, OR 2) You've used up all your resources on the machine (usually memory), and all of your processes are hung on a copy-on-write or allocate request, pending being serviced by the kernel If you can, compile the kernel for the box with the kernel debugger enabled, and break to debugger enabled, and break to the debugger on the console. The type ps and see what you get back as the wait channel everything you are trying to connect to is waiting on. This should be very informative, and it should be easy to locate the problem from there. If you have to, you can look at the scheduler queues, if there is anything in runnable state, and find out what's not there. Probably, it's not enough RAM, and your tuning parameters are set such that this isn't fatal to processes, when it should be. That you are able to ping, etc. guaranteed that you are not out of mbufs, and that you can connect that you aren't out of inpcb's or tcpcb's -- but mbufs are freelisted, so that's to be expected there (may not need more) and the pcb's are allocated at boot time (so are sockets, based on maxfiles), so tuning any of them after boot can get you in trouble. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
Terry, Thanks for that informative email - just a quick reality check though (for myself) - the last time this type of crash happened, I was running and watching `top` on the machine - and when it froze, the `top` output froze as well, and this was the last display on the screen: last pid: 6603; load averages: 3.81, 1.84, 1.48 1032 processes:1 running, 1026 sleeping, 5 zombie CPU states: 1.8% user, 0.8% nice, 3.2% system, 0.1% interrupt, 94.1% idle Mem: 1129M Active, 1404M Inact, 351M Wired, 103M Cache, 199M Buf, 28M Free Swap: 2018M Total, 2732K Used, 2015M Free Since all of the things you spoke of basically revolved around you're running out of memory, is it possible or reasonable to think that within the space of 1 second, I ran through 1404 megs inactive and 28 megs free memory ? machine is 4.5-RELEASE with 3gigs ram. swap never gets touched, although there is in fact 2gigs of swap. `pstat -s` always shows 0% used. I'll do the debug actions you suggested. --PT To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
Patrick Thomas wrote: Since all of the things you spoke of basically revolved around you're running out of memory, is it possible or reasonable to think that within the space of 1 second, I ran through 1404 megs inactive and 28 megs free memory ? machine is 4.5-RELEASE with 3gigs ram. swap never gets touched, although there is in fact 2gigs of swap. `pstat -s` always shows 0% used. OK, there's memory, and then there's memory. The amount of swap you have, the fact that it's 4.5, and the amount of RAM you have imply to me that the problem is that you are out of pmap entries. You should up your KVA space to 2G or maybe even 3G; the default in 4.5 was 1G. Basically, I now think that you don't have enough memory to map how much memory and virtual memory you have. Amusingly enough, you might actually have *better* luck with a lot less swap... If your KVA space is already enlarged above the default, then you can ignore this and just go ahead with the debugging to see what the wait channels for all the processes that won't run are stuck at. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
1) do you allow them write access to their /dev/mem, /dev/kmem, /dev/io ? Actually haven't yet let anyone else inside a jail with root capabilities. Will soon though. So, no probably not, unless there's a daemon which does just that. 2) does this sound like what you see? Can you still ping the crashed server ? Kernel routing still works. And yes ping too. But come to think of this I've seen it on other (4.5, patched pretty much to date) machines I use exclusively as routers. These have no jails on them. In these cases after uptimes of let's say 2 or 3 months, the machine's daemons stop responding and although a socket can be opened (just barely) it closes again when the process listening on the other side doesn't pick it up. IPSEC, firewalls, kernel routing, and all that continue to function just fine. Like you said it's just the userland stuff that has problems. The strange thing is, on one of my machines I was (eventually) able to log in from the console, take the system down to single user mode and back up and then everything worked like a charm. Nate To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
How do you increase KVA space these days ? I see that in earlier releases you had to edit /sys/conf/ldscript.i386 and /sys/i386/include/pmap.h and do all sorts of crazy stuff. What is the procedure in 4.5-RELEASE (please say just change KVA_PAGES=260 to KVA_PAGES=512) That's what you want me to do, right ? Is that all - can it be done just by changing that one value in my kernel config ? Again, thank you Terry for all your help. --PT On Sat, 22 Jun 2002, Terry Lambert wrote: Patrick Thomas wrote: Since all of the things you spoke of basically revolved around you're running out of memory, is it possible or reasonable to think that within the space of 1 second, I ran through 1404 megs inactive and 28 megs free memory ? machine is 4.5-RELEASE with 3gigs ram. swap never gets touched, although there is in fact 2gigs of swap. `pstat -s` always shows 0% used. OK, there's memory, and then there's memory. The amount of swap you have, the fact that it's 4.5, and the amount of RAM you have imply to me that the problem is that you are out of pmap entries. You should up your KVA space to 2G or maybe even 3G; the default in 4.5 was 1G. Basically, I now think that you don't have enough memory to map how much memory and virtual memory you have. Amusingly enough, you might actually have *better* luck with a lot less swap... If your KVA space is already enlarged above the default, then you can ignore this and just go ahead with the debugging to see what the wait channels for all the processes that won't run are stuck at. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
Patrick Thomas wrote: How do you increase KVA space these days ? I see that in earlier releases you had to edit /sys/conf/ldscript.i386 and /sys/i386/include/pmap.h and do all sorts of crazy stuff. What is the procedure in 4.5-RELEASE (please say just change KVA_PAGES=260 to KVA_PAGES=512) That's what you want me to do, right ? Is that all - can it be done just by changing that one value in my kernel config ? It's what I want you to do. For 4.5, you have to hack ldscript.i386 and pmap.h. I've posted on how to do this before (should be in the archives). The pages are all going to be off-by-one from your calculations, for the recursive page mapping, or off-by-two if your kernel is an SMP kernel, for the per CPU page, so remember that, or you will end up with a kernel that simply doesn't boot. The easiest way is to look at the numbers in pmap.h, and figure out how they relate to 0xc000 (remember to OR in 0x0010 after your math, to count the kernel loading at 1M). -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message
Re: (jail) problem and a (possible) solution ?
I think I'll just decrease my swap size from 2 gigs to 1 gig - is that a reasonable alternative that provides the same benefit and possible solution to this problem ? ...since bsically 0 swap has ever been used on the machine anyway... --PT On Sat, 22 Jun 2002, Terry Lambert wrote: Patrick Thomas wrote: How do you increase KVA space these days ? I see that in earlier releases you had to edit /sys/conf/ldscript.i386 and /sys/i386/include/pmap.h and do all sorts of crazy stuff. What is the procedure in 4.5-RELEASE (please say just change KVA_PAGES=260 to KVA_PAGES=512) That's what you want me to do, right ? Is that all - can it be done just by changing that one value in my kernel config ? It's what I want you to do. For 4.5, you have to hack ldscript.i386 and pmap.h. I've posted on how to do this before (should be in the archives). The pages are all going to be off-by-one from your calculations, for the recursive page mapping, or off-by-two if your kernel is an SMP kernel, for the per CPU page, so remember that, or you will end up with a kernel that simply doesn't boot. The easiest way is to look at the numbers in pmap.h, and figure out how they relate to 0xc000 (remember to OR in 0x0010 after your math, to count the kernel loading at 1M). -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hackers in the body of the message