Re: xargs -0 and -L

2010-02-04 Thread Atte Peltomäki
On Tue, Feb 02, 2010 at 09:32:43PM +0100, Ingo Schwarze wrote:
 Antti Harri wrote on Tue, Feb 02, 2010 at 07:31:57PM +0200:
 
  xargs' -L switch isn't working when using -0 flag.
 
 After checking POSIX.1 (2008), i conclude that our implementation and
 manual are correct in this respect.  The -L option is concerned
 with lines of arguments from standard input.  ASCII nul characters
 do not delimit lines.
 
You seem to have misinterpreted the purpose of -print0 and -0. From
Linux find(1):

-print0

True; print the full file name on the standard output, followed
by a null character (instead of the newline character that -print uses).
This allows file names that contain newlines or other types of white
space to be correctly interpreted by programs that process the find
output. This option corresponds to the -0 option of xargs.
 
Using -0 for xargs simply means Use \0 as otherwise \n or whitespace are
used. This also has nothing to do with POSIX compliancy, since using -0
in the first place breaks compliancy. 

  Tested also on OS X and Linux and they print two lines with -0.
 
 So you might wish to file bug reports with these operating systems.

I suggest OpenBSD rather change their -0 semantics to match those of
every other vendor which implement -0 in xargs. 
 
-- 
Atte Peltomdki
 atte.peltom...@iki.fi  http://kameli.org
Your effort to remain what you are is what limits you



Re: some cleanup of of uvm_map.c

2010-02-04 Thread Ariane van der Steldt
On Thu, Feb 04, 2010 at 11:51:11AM +0500, Anton Maksimenkov wrote:
 2010/2/3 Ariane van der Steldt ari...@stack.nl:
  On Tue, Feb 02, 2010 at 01:21:55AM +0500, Anton Maksimenkov wrote:
  uvm_map_lookup_entry()... in uvm_map_findspace()...
 
  I'm pretty sure that function is bugged: I think it doesn't do
  wrap-around on the search for an entry. ?
 
  Both functions are rather complex. And both functions pre-date random
  allocation. All this map-hint business has no reason to remain, since
  random allocation is here to stay and the hint has a high failure rate
  since then (if it is used, which it usually isn't anyway).
 
  The RB_TREE code in uvm_map_findspace is a pretty good solution given
  the map entry struct. You will need to understand the RB_AUGMENT hack
  that is only used in this place of the kernel though.
 
 2010/2/3 Owain Ainsworth zer...@googlemail.com:
  On Wed, Feb 03, 2010 at 04:33:50PM +0100, Ariane van der Steldt wrote:
  I'm pretty sure that function is bugged...
  Yes. Henning has a bgpd core router where bgpd occasionally dies due to
 
 
 Let me introduce my idea.
 
 In fact, we have only two functions in uvm_map.c which make searching
 in maps. These are uvm_map_lookup_entry() and uvm_map_findspace().
 The uvm_map_lookup_entry() do search by address (VA), while
 uvm_map_findspace() do search by free space.

uvm_map_findspace() is also used when searching by addr:
mmap(..., MAP_FIXED) requires a search by addr.
uvm_map_findspace() should also be able to take an addr constraint: like
Oga said, i386 has a segmented view of memory wrt W^X.

 And we have a RB_HEAD(uvm_tree, vm_map_entry) rbhead, which is
 indexed by address (see uvm_compare()). Since that this RB_TREE
 provide a searching by address option to us. This is what the
 RB_TREE used for.
 But when we try to use that RB_TREE for track free space and to
 SEARCH by free space ? we do dirty and ugly things!
 Then we add a RB_AUGMENT hack (it is so ugly and hard to use right
 with others that you can't find it anywhere in source tree... exept
 the uvm_map.c) and other tricks...

Nah, RB_AUGMENT is easy. When an item in the tree is deleted, each node
in the tree that is altered (position or insertion/removal) has each of
its children and each of its parents RB_AUGMENTed. The RB_AUGMENT calls
are done in such a way that each node is process prior to
RB_PARENT(node) being processed.

 In the end we got very unclear
 uvm_map.c code in these places, which is very hard to understand and
 track, and which is overloaded and overcomplexed by that tricks and
 crutches.
 We must stop it, because it keeps hold our hands and brains.
 
 We can do things very clear and easy. Let me show how exactly.
 
 Let's just add second RB_TREE, which is indexed by space, let's call
 it RB_HEAD(uvm_tree_by_space, vm_map_entry) rbhead_by_space, for
 example. That uvm_tree_by_space will contain vm_map_entry'es sorted by
 free space. And that uvm_tree_by_space will be used only in
 uvm_map_findspace() to search for  vm_map_entry with needed free
 space. RB_NFIND() is our best friend here! Simple, reasonable fast,
 clearly and easy to understand.

I use that trick in uvm_pmemrange. It's a nice algorithm, easy to
implement. Keep in mind however: you'll need to support random
allocation, or the diff will be rejected.

I was actually thinking along the same lines: rb-tree of free space,
ordered by size.
Let: start = first rb-entry with enough space for allocation
Let: end = rb_max(free tree)
Let: chosen = random chunk in [start, end]

You'll want indices on the tree, so the random generator can be given a
number to use. (If the number is too large, you'll get bias on the
largest segments, if too small, you won't consider them.)

Once you have 'chosen', you need to generate a random address inside.

Let: offset = random number between 0 and (end-start)
Now you have a random address inside chosen:
chosen.start + offset.

Disadvantage of the algorithm is that it requires 2 random numbers (the
current algorithm only requires 1). Advantage is that this algorithm
always produces a valid address, so no forward searching is necessary
anymore.

The address may have to be shifted because pmap_prefer says so. If
that's the case, do so whenever possible: it removes aliasing problems
on some architectures (afaik pmap has a way of dealing with it, but
it's hideously expensive when it has to).

 Actually, since we can have two or more vm_map_entry'es with equal
 space, so each RB_TREE element will be the list of vm_map_entry'es
 with that equal space. We can use some additional logic here when we
 push/pop vm_map_entry from that list ? we can pop vm_map_entry with
 smallest start address, for example.

You can't simply take the lowest start address. It would make addresses
completely predictable.

 Then we must free the first RB_TREE (uvm_tree, indexed by address)
 from tracking space/ownspace and other nasty tricks, and stop using it
 in uvm_map_findspace(). This uvm_tree must 

Re: some cleanup of of uvm_map.c

2010-02-04 Thread Anton Maksimenkov
2010/2/3 Owain Ainsworth zer...@googlemail.com:
 ...you can't just go straight to the start of the map.
 Say i386 where W^X is done using segments. if you dump
 something right at the other end of the segment, that pretty much screws
 it. Perhaps going back to the min hint for the protection and trying to
 push just a little bit down may work?

2010/2/4 Ariane van der Steldt ari...@stack.nl:
 mmap(..., MAP_FIXED) requires a search by addr.
 uvm_map_findspace() should also be able to take an addr constraint:
 like Oga said, i386 has a segmented view of memory wrt W^X.
...
 One more problem you may face, is uvm_km_getpage starvation. If it
 happens, you'll be unable to allocate a vm_map_entry. In your design,
 you'll need 2 per allocation most of the time. If that's too much
 pressure, the kernel may starve itself and be unable to create new
 entries, becoming completely jammed. This is a problem on any
 non-pmap_direct architectures (like i386).
I remember about MAP_FIXED, just not mentioned it in my not-so-short message ;)
And I don't want 2 vm_map_entry per allocation, I only need to keep
each vm_map_entry in both trees. One vm_map_entry can contain 2
separate RB_TREE entries (for both trees), so each tree can work with
that vm_map_entry independantly.


Can anyone explain me what is the problem with i386 segments? Or
better supply some links to docs which explain it.
I don't understand what you mean - MAP_FIXED flag is the problem or it
used as workaround for some problem?
-- 
antonvm



Re: xargs -0 and -L

2010-02-04 Thread Philip Guenther
On Thu, Feb 4, 2010 at 2:48 AM, Atte Peltomdki atte.peltom...@iki.fi wrote:
 On Tue, Feb 02, 2010 at 09:32:43PM +0100, Ingo Schwarze wrote:
 Antti Harri wrote on Tue, Feb 02, 2010 at 07:31:57PM +0200:

  xargs' -L switch isn't working when using -0 flag.

 After checking POSIX.1 (2008), i conclude that our implementation and
 manual are correct in this respect.  The -L option is concerned
 with lines of arguments from standard input.  ASCII nul characters
 do not delimit lines.

 You seem to have misinterpreted the purpose of -print0 and -0. From
 Linux find(1):

We understand the purpose of -0.  Are you sure you understand the
difference between -n and -L?  AFAICT, the behavior you desire can be
obtained portably and without requiring an alteration to the
definition of 'line' using -0 -n # -x


Philip Guenther



Re: UBC?

2010-02-04 Thread Ariane van der Steldt
On Thu, Feb 04, 2010 at 09:29:13AM -0700, Jeff Ross wrote:
 Jeff Ross wrote:
  On Sat, 30 Jan 2010, Bob Beck wrote:
  
  Ooooh. nice one.  Obviously ami couldn't get memory mappings and 
  freaked out.
 
  While not completely necessary, I'd love for you to file that whole
  thing into sendbug() in a pr so we don't
  forget it. but that one I need to pester krw, art, dlg, and maybe
  marco about what ami is doing.
 
  note that the behaviour you see wrt free memory dropping but not
  hitting swap is what I expect.
  basically that makes the buffer cache subordinate to working set
  memory between 10 and 90% of
  physmem. the buffer cache will throw away pages before allowing the
  system to swap.
 
  Drop it back to 70% and tell me if you still get the same panic
  please.  and if you have a fixed test
  case that reproduces this on your machine ( a load generator for
  postgres with clients) I'd love to
  have a copy in the pr as well.
  
  70% produces the same panic.
  
  panic: pmap_enter: no pv entries available
  Stopped at  Debugger+0x4:   leave
  RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC!
  IF RUNNING SMP, USE 'mach ddbcpu #' AND 'trace' ON OTHER PROCESSORS, TOO.
  DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION!
  ddb{0} trace
  Debugger(7149,dfe3ede8,d08edf18,c,0) at Debugger+0x4
  panic(d077d740,0,7000,d08ad6e0,) at panic+0x55
  pmap_enter(d08f3520,e0031000,7149000,3,13) at pmap_enter+0x2e5
  _bus_dmamem_map(d0875c40,d505fc44,1,6344,d505fc58,1,dfe3eebc,1) at 
  _bus_dmamem_map+0x9c
  ami_allocmem(d49b5800,6344,20,d0753ddc) at ami_allocmem+0x92
  ami_mgmt(d49b5800,a1,4,0,0,6344,d49cd000,1) at ami_mgmt+0x268
  ami_refresh_sensors(d49b5800,da987028,da987050,8000,da987028) at 
  ami_refresh_sensors+0x25
  sensor_task_work(d49b3d80,0,50,200286) at sensor_task_work+0x1f
  workq_thread(d0863100) at workq_thread+0x32
  Bad frame pointer: 0xd0a32e78
  ddb{0}
  
  I'll skip the ps for this go round since it should be pretty much the 
  same and go directly to sendbug, including the pg_bench script I use to 
  trigger it.
  
  Thanks!
  
  Jeff
 
 Okay, I've been testing.  I brought everything up to current, applied the
 ami.c patch sent by David Gwynne as modified by Phillip Guenther, and the 
 patch to bus_dma.c sent by Kenneth Westerback.
 
 I started by setting kern.bufcachepercent=60 and then moving down by 10 after
 each panic. Anything 20 or greater triggers the same panic as above.
 
 I then set it to 10 to see what would happen.  The load ran okay, but I did
 get three uvm_mapent_alloc: out of static map entries entries into the console
 that seems to coincide with the end of one of the three pgbench runs and the
 start of the next.
 
 So I set it to 11 and got this:
 
 ddb{2} show panic
 malloc: out of space in kmem_map
 ddb{2} trace
 Debugger(3fff,c,d488a000,4,4000) at Debugger+0x4
 panic(d0752c20,0,4000,0,0) at panic+0x55
 malloc(4000,7f,0,da8980b4,7) at malloc+0x76
 vfs_getcwd_scandir(e002eee8,e002eecc,e002eed0,d4e80400,da24f184) at 
 vfs_getcwd_
 scandir+0x123
 vfs_getcwd_common(da898154,0,e002ef10,d4e80400,200,1,da24f184,22) at 
 vfs_getcwd
 _common+0x1f0
 sys___getcwd(da24f184,e002ef68,e002ef58,da24f184) at sys___getcwd+0x62
 syscall() at syscall+0x12b
 --- syscall (number 304) ---
 0x1c028f25:
 
 I reported a similar panic back in December
   http://kerneltrap.org/mailarchive/openbsd-misc/2009/12/14/6309363
 and was told I'd twisted the knobs too hard ;-)
 
 Here are the sysctl values I'm currently using:
 
 kern.maxproc=10240
 kern.maxfiles=20480
 kern.maxvnodes=6000
 
 kern.shminfo.shmseg=32
 kern.seminfo.semmni=256
 kern.seminfo.semmns=2048
 kern.shminfo.shmall=512000
 kern.shminfo.shmmax=76800
 
 About that time Owain Ainsworth sent his version of a fix to bus_dma.c so I 
 applied that and built a new kernel and I still get panics when I adjust 
 kern.bufcachepercent above 15 or so.
 
 Here's the latest panic, trace and ps with kern.bufcachepercent set to 20:
 ddb{0} show panic
 pmap_enter: no pv entries available
 ddb{0} trace
 Debugger(b6f30,dff02e1c,2000,c,0) at Debugger+0x4
 panic(d077d7a0,0,0,d0e7a980,1) at panic+0x55
 pmap_enter(d08f3520,e002d000,b6f3,7,13) at pmap_enter+0x2e5
 uvm_km_alloc1(d08ad720,2000,0,1) at uvm_km_alloc1+0xd5
 fork1(da8c5834,14,1,0,0) at fork1+0x100
 sys_fork(da8c5834,dff02f68,dff02f58,da8c5834) at sys_fork+0x38
 syscall() at syscall+0x12b
 --- syscall (number 2) ---
 ddb{0} ps
 PID   PPID   PGRPUID  S   FLAGS  WAIT  COMMAND
   18032   3768  18032503  3   0x208  flt_pmfail2   postgres
   28815   3768  28815503  3   0x208  inode postgres
   16262   3768  16262503  2   0x208postgres
   15301   3768  15301503  3   0x208  inode postgres
   16712   3768  16712503  2   0x208postgres
5959   3768   5959503  3   0x208  flt_pmfail2   postgres
   24166   3768  

Re: UBC?

2010-02-04 Thread Ted Unangst
On Thu, Feb 4, 2010 at 2:21 PM, Jeff Ross jr...@openvistas.net wrote:
 kern.shminfo.shmall=512000
 kern.shminfo.shmmax=76800

Oh, when I said it was safe to crank shmmax I didn't know you'd be
setting the bufcache to huge numbers too.  ;)

 available memory in the server.  I tried setting mine to 980MBs (4GB of
ram)
 and that meant setting kern.shminfo.shmmax to 1GB just to get postgres to
 start.  Later I found out that each process on an i386 is limited to 1GB,
so
 I backed that off to 600MB.

As I pointed out in that thread, the 1GB limit does not apply to shared
memory.

 What I will try next is setting postgres's shared_buffers to 1/4 of 1GB,
 since that is really the amount of available memory I have per postmaster,
 and dropping kern.shminfo.shmmax accordingly.  I know that will run but not
 terribly fast, unless maybe kicking kern.bufcachepercent back up to 90%
will
 help and not cause a panic.

Even so, I'm not sure how you're interpreting these numbers.  shmmax
is total across all processes, not process.  Your statement seems to
imply a shm setting that's per postmaster, based on dividing total RAM
between them, which would also leave no memory free for your 90%
buffer cache.

The long and short of the problem is that you have a certain amount of
RAM that has to be shared by all the processes' private address
spaces, their shm segments, and the buffer cache.  Obviously this
can't go over 100%.  But on top of it, there's also kernel overhead
(secondary resources) just to keep track of all the previous (primary
resources).  This takes both RAM and kernel address space (out of a
rather restricted pool).  The limits exist not just to constrain how
many resources are used for a particular purpose, but also so there's
an upper bound on the secondary resources used to track the primary
ones.

It's like a rubber band.  Bob's patch lets you stretch it farther in
one direction, but you can't simultaneously stretch in a dozen ways at
once.



Re: xargs -0 and -L

2010-02-04 Thread Ingo Schwarze
Hi Atte,

Atte Peltomdki wrote on Thu, Feb 04, 2010 at 12:48:47PM +0200:
 On Tue, Feb 02, 2010 at 09:32:43PM +0100, Ingo Schwarze wrote:
 Antti Harri wrote on Tue, Feb 02, 2010 at 07:31:57PM +0200:

 xargs' -L switch isn't working when using -0 flag.

 After checking POSIX.1 (2008), i conclude that our implementation and
 manual are correct in this respect.  The -L option is concerned
 with lines of arguments from standard input.  ASCII nul characters
 do not delimit lines.

[ snipped the Linux description of -print0, we agree on that ]

 Using -0 for xargs simply means Use \0 as otherwise \n or whitespace
 are used.

That's horribly imprecise.  Rather, it means interpret NUL and only NUL
as the argument separator.  I do not think that -0 should change the
meaning of the term line.  NUL is not a line separator, and -0 does
not make it one.  The effect of find(1) -print0 is to put all names on
the same line, separated by NUL characters, whereas find(1) -print puts
each name on its own line.

 This also has nothing to do with POSIX compliancy, since using -0
 in the first place breaks compliancy. 

No doubt, -0 is an _extension_ to POSIX and XPG (it doesn't break it,
that makes a difference).

When implementing the -0 extension, we must take care not to violate
completely unrelated parts of the specification.  In particular, i fail
to see the point in giving the XPG standard options -I and -L
non-standard meanings just because the non-standard -0 option happens
to be in effect, too.

 Tested also on OS X and Linux and they print two lines with -0.

 So you might wish to file bug reports with these operating systems.

 I suggest OpenBSD rather change their -0 semantics to match those of
 every other vendor which implement -0 in xargs. 

Nobody is discussing -0 semantics, it's -I and -L semantics that are at
stake.  And it looks like everyone else broke -L in their code.

http://www.freebsd.org/cgi/cvsweb.cgi/src/usr.bin/xargs/xargs.c.diff?r1=1.55;r2=1.56

http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/xargs/xargs.c?only_with_tag=MAIN#rev1.16

So, in 2005, FreeBSD introduced the same bug you report in Linux,
and the commit message doesn't really give a clue what the point is:

  so that 'xargs -0 -I []' will do something sensible
   (namely: treat then '\0' as the EOL character...

I don't really understand that:
Philip already pointed out the difference between

   # XPG
  schwa...@rhea $ printf a b | xargs -L1 
  a b
  schwa...@rhea $ printf a b | xargs -n1 
  a
  b

   # extension
  schwa...@rhea $ printf a\0b | xargs -0 -L1 
  a b
  schwa...@rhea $ printf a\0b | xargs -0 -n1 
  a
  b

where -n does in a standard way what other systems apparently
abuse -L for.  An analogous argument applies to -I:

   # extension
  schwa...@rhea $ printf a b | xargs -I % echo % x % 
  a b x a b
  schwa...@rhea $ printf a b | xargs -n1 -I % echo % x %
  a x a
  b x b

   # extension
  schwa...@rhea $ printf a\0b | xargs -0 -I % echo % x % 
  a b x a b
  schwa...@rhea $ printf a\0b | xargs -0 -n1 -I % echo % x %
  a x a
  b x b

I admit that's all extending to XPG, since XPG does not require
xargs -I to handle more than one input argument per input line,
but handling it doesn't violate XPG and it's the traditional behaviour
in FreeBSD since 2002 when -I was first ported from xMach.

I call it *useful* that -I and -n1 -I do different things.
It is a nuisance that on other systems, -0 -I
apparently now does the same as -0 -n1 -I.

Finally, in 2007, NetBSD copied the -L bug from FreeBSD.

What a mess...  :-(

Yours,
  Ingo



Re: UBC?

2010-02-04 Thread Anton Maksimenkov
2010/2/5 Ted Unangst ted.unan...@gmail.com:
 On Thu, Feb 4, 2010 at 2:21 PM, Jeff Ross jr...@openvistas.net wrote:
 kern.shminfo.shmall=512000
 kern.shminfo.shmmax=76800

 Oh, when I said it was safe to crank shmmax I didn't know you'd be
 setting the bufcache to huge numbers too.  ;)

Furthermore, postgres documentation recommends not set shared buffers
to big values, because postgres itself depends on buffer caches (it
suppose that buffer cache is big).

Jeff, since you can set buffer cache to about 90% of RAM then you can
set chared buffers to something not so big... say, 256Mb or so (I just
show the way, not exact values), and decrease shmall/shmmax. And set
postgres parameter effective_cache_size to estimated size of your
buffer cache.
After that postgres will actively use your buffer cache, I suppose.
And then you will show some results to us, right?
Interesting to see if all these will prove things what documentation says
:-).
--
antonvm



Re: UBC?

2010-02-04 Thread Anton Maksimenkov
2010/2/5 Anton Maksimenkov anton...@gmail.com:
 set chared buffers to something not so big... say, 256Mb or so (I just
I misprint. Of course it's shared, not chared. Sorry.
-- 
antonvm