Re: xargs -0 and -L
On Tue, Feb 02, 2010 at 09:32:43PM +0100, Ingo Schwarze wrote: Antti Harri wrote on Tue, Feb 02, 2010 at 07:31:57PM +0200: xargs' -L switch isn't working when using -0 flag. After checking POSIX.1 (2008), i conclude that our implementation and manual are correct in this respect. The -L option is concerned with lines of arguments from standard input. ASCII nul characters do not delimit lines. You seem to have misinterpreted the purpose of -print0 and -0. From Linux find(1): -print0 True; print the full file name on the standard output, followed by a null character (instead of the newline character that -print uses). This allows file names that contain newlines or other types of white space to be correctly interpreted by programs that process the find output. This option corresponds to the -0 option of xargs. Using -0 for xargs simply means Use \0 as otherwise \n or whitespace are used. This also has nothing to do with POSIX compliancy, since using -0 in the first place breaks compliancy. Tested also on OS X and Linux and they print two lines with -0. So you might wish to file bug reports with these operating systems. I suggest OpenBSD rather change their -0 semantics to match those of every other vendor which implement -0 in xargs. -- Atte Peltomdki atte.peltom...@iki.fi http://kameli.org Your effort to remain what you are is what limits you
Re: some cleanup of of uvm_map.c
On Thu, Feb 04, 2010 at 11:51:11AM +0500, Anton Maksimenkov wrote: 2010/2/3 Ariane van der Steldt ari...@stack.nl: On Tue, Feb 02, 2010 at 01:21:55AM +0500, Anton Maksimenkov wrote: uvm_map_lookup_entry()... in uvm_map_findspace()... I'm pretty sure that function is bugged: I think it doesn't do wrap-around on the search for an entry. ? Both functions are rather complex. And both functions pre-date random allocation. All this map-hint business has no reason to remain, since random allocation is here to stay and the hint has a high failure rate since then (if it is used, which it usually isn't anyway). The RB_TREE code in uvm_map_findspace is a pretty good solution given the map entry struct. You will need to understand the RB_AUGMENT hack that is only used in this place of the kernel though. 2010/2/3 Owain Ainsworth zer...@googlemail.com: On Wed, Feb 03, 2010 at 04:33:50PM +0100, Ariane van der Steldt wrote: I'm pretty sure that function is bugged... Yes. Henning has a bgpd core router where bgpd occasionally dies due to Let me introduce my idea. In fact, we have only two functions in uvm_map.c which make searching in maps. These are uvm_map_lookup_entry() and uvm_map_findspace(). The uvm_map_lookup_entry() do search by address (VA), while uvm_map_findspace() do search by free space. uvm_map_findspace() is also used when searching by addr: mmap(..., MAP_FIXED) requires a search by addr. uvm_map_findspace() should also be able to take an addr constraint: like Oga said, i386 has a segmented view of memory wrt W^X. And we have a RB_HEAD(uvm_tree, vm_map_entry) rbhead, which is indexed by address (see uvm_compare()). Since that this RB_TREE provide a searching by address option to us. This is what the RB_TREE used for. But when we try to use that RB_TREE for track free space and to SEARCH by free space ? we do dirty and ugly things! Then we add a RB_AUGMENT hack (it is so ugly and hard to use right with others that you can't find it anywhere in source tree... exept the uvm_map.c) and other tricks... Nah, RB_AUGMENT is easy. When an item in the tree is deleted, each node in the tree that is altered (position or insertion/removal) has each of its children and each of its parents RB_AUGMENTed. The RB_AUGMENT calls are done in such a way that each node is process prior to RB_PARENT(node) being processed. In the end we got very unclear uvm_map.c code in these places, which is very hard to understand and track, and which is overloaded and overcomplexed by that tricks and crutches. We must stop it, because it keeps hold our hands and brains. We can do things very clear and easy. Let me show how exactly. Let's just add second RB_TREE, which is indexed by space, let's call it RB_HEAD(uvm_tree_by_space, vm_map_entry) rbhead_by_space, for example. That uvm_tree_by_space will contain vm_map_entry'es sorted by free space. And that uvm_tree_by_space will be used only in uvm_map_findspace() to search for vm_map_entry with needed free space. RB_NFIND() is our best friend here! Simple, reasonable fast, clearly and easy to understand. I use that trick in uvm_pmemrange. It's a nice algorithm, easy to implement. Keep in mind however: you'll need to support random allocation, or the diff will be rejected. I was actually thinking along the same lines: rb-tree of free space, ordered by size. Let: start = first rb-entry with enough space for allocation Let: end = rb_max(free tree) Let: chosen = random chunk in [start, end] You'll want indices on the tree, so the random generator can be given a number to use. (If the number is too large, you'll get bias on the largest segments, if too small, you won't consider them.) Once you have 'chosen', you need to generate a random address inside. Let: offset = random number between 0 and (end-start) Now you have a random address inside chosen: chosen.start + offset. Disadvantage of the algorithm is that it requires 2 random numbers (the current algorithm only requires 1). Advantage is that this algorithm always produces a valid address, so no forward searching is necessary anymore. The address may have to be shifted because pmap_prefer says so. If that's the case, do so whenever possible: it removes aliasing problems on some architectures (afaik pmap has a way of dealing with it, but it's hideously expensive when it has to). Actually, since we can have two or more vm_map_entry'es with equal space, so each RB_TREE element will be the list of vm_map_entry'es with that equal space. We can use some additional logic here when we push/pop vm_map_entry from that list ? we can pop vm_map_entry with smallest start address, for example. You can't simply take the lowest start address. It would make addresses completely predictable. Then we must free the first RB_TREE (uvm_tree, indexed by address) from tracking space/ownspace and other nasty tricks, and stop using it in uvm_map_findspace(). This uvm_tree must
Re: some cleanup of of uvm_map.c
2010/2/3 Owain Ainsworth zer...@googlemail.com: ...you can't just go straight to the start of the map. Say i386 where W^X is done using segments. if you dump something right at the other end of the segment, that pretty much screws it. Perhaps going back to the min hint for the protection and trying to push just a little bit down may work? 2010/2/4 Ariane van der Steldt ari...@stack.nl: mmap(..., MAP_FIXED) requires a search by addr. uvm_map_findspace() should also be able to take an addr constraint: like Oga said, i386 has a segmented view of memory wrt W^X. ... One more problem you may face, is uvm_km_getpage starvation. If it happens, you'll be unable to allocate a vm_map_entry. In your design, you'll need 2 per allocation most of the time. If that's too much pressure, the kernel may starve itself and be unable to create new entries, becoming completely jammed. This is a problem on any non-pmap_direct architectures (like i386). I remember about MAP_FIXED, just not mentioned it in my not-so-short message ;) And I don't want 2 vm_map_entry per allocation, I only need to keep each vm_map_entry in both trees. One vm_map_entry can contain 2 separate RB_TREE entries (for both trees), so each tree can work with that vm_map_entry independantly. Can anyone explain me what is the problem with i386 segments? Or better supply some links to docs which explain it. I don't understand what you mean - MAP_FIXED flag is the problem or it used as workaround for some problem? -- antonvm
Re: xargs -0 and -L
On Thu, Feb 4, 2010 at 2:48 AM, Atte Peltomdki atte.peltom...@iki.fi wrote: On Tue, Feb 02, 2010 at 09:32:43PM +0100, Ingo Schwarze wrote: Antti Harri wrote on Tue, Feb 02, 2010 at 07:31:57PM +0200: xargs' -L switch isn't working when using -0 flag. After checking POSIX.1 (2008), i conclude that our implementation and manual are correct in this respect. The -L option is concerned with lines of arguments from standard input. ASCII nul characters do not delimit lines. You seem to have misinterpreted the purpose of -print0 and -0. From Linux find(1): We understand the purpose of -0. Are you sure you understand the difference between -n and -L? AFAICT, the behavior you desire can be obtained portably and without requiring an alteration to the definition of 'line' using -0 -n # -x Philip Guenther
Re: UBC?
On Thu, Feb 04, 2010 at 09:29:13AM -0700, Jeff Ross wrote: Jeff Ross wrote: On Sat, 30 Jan 2010, Bob Beck wrote: Ooooh. nice one. Obviously ami couldn't get memory mappings and freaked out. While not completely necessary, I'd love for you to file that whole thing into sendbug() in a pr so we don't forget it. but that one I need to pester krw, art, dlg, and maybe marco about what ami is doing. note that the behaviour you see wrt free memory dropping but not hitting swap is what I expect. basically that makes the buffer cache subordinate to working set memory between 10 and 90% of physmem. the buffer cache will throw away pages before allowing the system to swap. Drop it back to 70% and tell me if you still get the same panic please. and if you have a fixed test case that reproduces this on your machine ( a load generator for postgres with clients) I'd love to have a copy in the pr as well. 70% produces the same panic. panic: pmap_enter: no pv entries available Stopped at Debugger+0x4: leave RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC! IF RUNNING SMP, USE 'mach ddbcpu #' AND 'trace' ON OTHER PROCESSORS, TOO. DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION! ddb{0} trace Debugger(7149,dfe3ede8,d08edf18,c,0) at Debugger+0x4 panic(d077d740,0,7000,d08ad6e0,) at panic+0x55 pmap_enter(d08f3520,e0031000,7149000,3,13) at pmap_enter+0x2e5 _bus_dmamem_map(d0875c40,d505fc44,1,6344,d505fc58,1,dfe3eebc,1) at _bus_dmamem_map+0x9c ami_allocmem(d49b5800,6344,20,d0753ddc) at ami_allocmem+0x92 ami_mgmt(d49b5800,a1,4,0,0,6344,d49cd000,1) at ami_mgmt+0x268 ami_refresh_sensors(d49b5800,da987028,da987050,8000,da987028) at ami_refresh_sensors+0x25 sensor_task_work(d49b3d80,0,50,200286) at sensor_task_work+0x1f workq_thread(d0863100) at workq_thread+0x32 Bad frame pointer: 0xd0a32e78 ddb{0} I'll skip the ps for this go round since it should be pretty much the same and go directly to sendbug, including the pg_bench script I use to trigger it. Thanks! Jeff Okay, I've been testing. I brought everything up to current, applied the ami.c patch sent by David Gwynne as modified by Phillip Guenther, and the patch to bus_dma.c sent by Kenneth Westerback. I started by setting kern.bufcachepercent=60 and then moving down by 10 after each panic. Anything 20 or greater triggers the same panic as above. I then set it to 10 to see what would happen. The load ran okay, but I did get three uvm_mapent_alloc: out of static map entries entries into the console that seems to coincide with the end of one of the three pgbench runs and the start of the next. So I set it to 11 and got this: ddb{2} show panic malloc: out of space in kmem_map ddb{2} trace Debugger(3fff,c,d488a000,4,4000) at Debugger+0x4 panic(d0752c20,0,4000,0,0) at panic+0x55 malloc(4000,7f,0,da8980b4,7) at malloc+0x76 vfs_getcwd_scandir(e002eee8,e002eecc,e002eed0,d4e80400,da24f184) at vfs_getcwd_ scandir+0x123 vfs_getcwd_common(da898154,0,e002ef10,d4e80400,200,1,da24f184,22) at vfs_getcwd _common+0x1f0 sys___getcwd(da24f184,e002ef68,e002ef58,da24f184) at sys___getcwd+0x62 syscall() at syscall+0x12b --- syscall (number 304) --- 0x1c028f25: I reported a similar panic back in December http://kerneltrap.org/mailarchive/openbsd-misc/2009/12/14/6309363 and was told I'd twisted the knobs too hard ;-) Here are the sysctl values I'm currently using: kern.maxproc=10240 kern.maxfiles=20480 kern.maxvnodes=6000 kern.shminfo.shmseg=32 kern.seminfo.semmni=256 kern.seminfo.semmns=2048 kern.shminfo.shmall=512000 kern.shminfo.shmmax=76800 About that time Owain Ainsworth sent his version of a fix to bus_dma.c so I applied that and built a new kernel and I still get panics when I adjust kern.bufcachepercent above 15 or so. Here's the latest panic, trace and ps with kern.bufcachepercent set to 20: ddb{0} show panic pmap_enter: no pv entries available ddb{0} trace Debugger(b6f30,dff02e1c,2000,c,0) at Debugger+0x4 panic(d077d7a0,0,0,d0e7a980,1) at panic+0x55 pmap_enter(d08f3520,e002d000,b6f3,7,13) at pmap_enter+0x2e5 uvm_km_alloc1(d08ad720,2000,0,1) at uvm_km_alloc1+0xd5 fork1(da8c5834,14,1,0,0) at fork1+0x100 sys_fork(da8c5834,dff02f68,dff02f58,da8c5834) at sys_fork+0x38 syscall() at syscall+0x12b --- syscall (number 2) --- ddb{0} ps PID PPID PGRPUID S FLAGS WAIT COMMAND 18032 3768 18032503 3 0x208 flt_pmfail2 postgres 28815 3768 28815503 3 0x208 inode postgres 16262 3768 16262503 2 0x208postgres 15301 3768 15301503 3 0x208 inode postgres 16712 3768 16712503 2 0x208postgres 5959 3768 5959503 3 0x208 flt_pmfail2 postgres 24166 3768
Re: UBC?
On Thu, Feb 4, 2010 at 2:21 PM, Jeff Ross jr...@openvistas.net wrote: kern.shminfo.shmall=512000 kern.shminfo.shmmax=76800 Oh, when I said it was safe to crank shmmax I didn't know you'd be setting the bufcache to huge numbers too. ;) available memory in the server. I tried setting mine to 980MBs (4GB of ram) and that meant setting kern.shminfo.shmmax to 1GB just to get postgres to start. Later I found out that each process on an i386 is limited to 1GB, so I backed that off to 600MB. As I pointed out in that thread, the 1GB limit does not apply to shared memory. What I will try next is setting postgres's shared_buffers to 1/4 of 1GB, since that is really the amount of available memory I have per postmaster, and dropping kern.shminfo.shmmax accordingly. I know that will run but not terribly fast, unless maybe kicking kern.bufcachepercent back up to 90% will help and not cause a panic. Even so, I'm not sure how you're interpreting these numbers. shmmax is total across all processes, not process. Your statement seems to imply a shm setting that's per postmaster, based on dividing total RAM between them, which would also leave no memory free for your 90% buffer cache. The long and short of the problem is that you have a certain amount of RAM that has to be shared by all the processes' private address spaces, their shm segments, and the buffer cache. Obviously this can't go over 100%. But on top of it, there's also kernel overhead (secondary resources) just to keep track of all the previous (primary resources). This takes both RAM and kernel address space (out of a rather restricted pool). The limits exist not just to constrain how many resources are used for a particular purpose, but also so there's an upper bound on the secondary resources used to track the primary ones. It's like a rubber band. Bob's patch lets you stretch it farther in one direction, but you can't simultaneously stretch in a dozen ways at once.
Re: xargs -0 and -L
Hi Atte, Atte Peltomdki wrote on Thu, Feb 04, 2010 at 12:48:47PM +0200: On Tue, Feb 02, 2010 at 09:32:43PM +0100, Ingo Schwarze wrote: Antti Harri wrote on Tue, Feb 02, 2010 at 07:31:57PM +0200: xargs' -L switch isn't working when using -0 flag. After checking POSIX.1 (2008), i conclude that our implementation and manual are correct in this respect. The -L option is concerned with lines of arguments from standard input. ASCII nul characters do not delimit lines. [ snipped the Linux description of -print0, we agree on that ] Using -0 for xargs simply means Use \0 as otherwise \n or whitespace are used. That's horribly imprecise. Rather, it means interpret NUL and only NUL as the argument separator. I do not think that -0 should change the meaning of the term line. NUL is not a line separator, and -0 does not make it one. The effect of find(1) -print0 is to put all names on the same line, separated by NUL characters, whereas find(1) -print puts each name on its own line. This also has nothing to do with POSIX compliancy, since using -0 in the first place breaks compliancy. No doubt, -0 is an _extension_ to POSIX and XPG (it doesn't break it, that makes a difference). When implementing the -0 extension, we must take care not to violate completely unrelated parts of the specification. In particular, i fail to see the point in giving the XPG standard options -I and -L non-standard meanings just because the non-standard -0 option happens to be in effect, too. Tested also on OS X and Linux and they print two lines with -0. So you might wish to file bug reports with these operating systems. I suggest OpenBSD rather change their -0 semantics to match those of every other vendor which implement -0 in xargs. Nobody is discussing -0 semantics, it's -I and -L semantics that are at stake. And it looks like everyone else broke -L in their code. http://www.freebsd.org/cgi/cvsweb.cgi/src/usr.bin/xargs/xargs.c.diff?r1=1.55;r2=1.56 http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/xargs/xargs.c?only_with_tag=MAIN#rev1.16 So, in 2005, FreeBSD introduced the same bug you report in Linux, and the commit message doesn't really give a clue what the point is: so that 'xargs -0 -I []' will do something sensible (namely: treat then '\0' as the EOL character... I don't really understand that: Philip already pointed out the difference between # XPG schwa...@rhea $ printf a b | xargs -L1 a b schwa...@rhea $ printf a b | xargs -n1 a b # extension schwa...@rhea $ printf a\0b | xargs -0 -L1 a b schwa...@rhea $ printf a\0b | xargs -0 -n1 a b where -n does in a standard way what other systems apparently abuse -L for. An analogous argument applies to -I: # extension schwa...@rhea $ printf a b | xargs -I % echo % x % a b x a b schwa...@rhea $ printf a b | xargs -n1 -I % echo % x % a x a b x b # extension schwa...@rhea $ printf a\0b | xargs -0 -I % echo % x % a b x a b schwa...@rhea $ printf a\0b | xargs -0 -n1 -I % echo % x % a x a b x b I admit that's all extending to XPG, since XPG does not require xargs -I to handle more than one input argument per input line, but handling it doesn't violate XPG and it's the traditional behaviour in FreeBSD since 2002 when -I was first ported from xMach. I call it *useful* that -I and -n1 -I do different things. It is a nuisance that on other systems, -0 -I apparently now does the same as -0 -n1 -I. Finally, in 2007, NetBSD copied the -L bug from FreeBSD. What a mess... :-( Yours, Ingo
Re: UBC?
2010/2/5 Ted Unangst ted.unan...@gmail.com: On Thu, Feb 4, 2010 at 2:21 PM, Jeff Ross jr...@openvistas.net wrote: kern.shminfo.shmall=512000 kern.shminfo.shmmax=76800 Oh, when I said it was safe to crank shmmax I didn't know you'd be setting the bufcache to huge numbers too. ;) Furthermore, postgres documentation recommends not set shared buffers to big values, because postgres itself depends on buffer caches (it suppose that buffer cache is big). Jeff, since you can set buffer cache to about 90% of RAM then you can set chared buffers to something not so big... say, 256Mb or so (I just show the way, not exact values), and decrease shmall/shmmax. And set postgres parameter effective_cache_size to estimated size of your buffer cache. After that postgres will actively use your buffer cache, I suppose. And then you will show some results to us, right? Interesting to see if all these will prove things what documentation says :-). -- antonvm
Re: UBC?
2010/2/5 Anton Maksimenkov anton...@gmail.com: set chared buffers to something not so big... say, 256Mb or so (I just I misprint. Of course it's shared, not chared. Sorry. -- antonvm