Re: The busybox "printf" command is always 32 bit even when the shell isn't.

Rob Landley Sat, 10 Jan 2009 21:26:01 -0800

On Saturday 10 January 2009 10:10:06 Denys Vlasenko wrote:
> On Monday 05 January 2009 23:59, Rob Landley wrote:
> > > It still destroys L1 dcache IIRC.
> >
> > You go out of L1 cache every time you touch a new cache line.  Fetching
> > that back from L2 is cheap.  If it destroyed the L2 cache, then I'd
> > worry.
>
> You effectively are saying L1 is worthless.


No, I'm saying that flushing L1 cache every 64k instructions or so isn't the 
end of the world.

Modern pipelined architectures are vulnerable to bubbles in the pipeline, 
which is they they do such aggressive instruction reordering and branch 
prediction and include tricks like conditional assignments to avoid jumps.
L1 is there primarily to keep bubbles out of pipelines, and to optimize inner 
loops.  But the size of the thing is (on most embedded processors) maybe 16k 
combined instruction and data, and even a monster like the core 2 duo (with 
its 2 megabytes of L2 cache) only has 32k instruction and 32k data L1 cache.

Often a program's working set won't fit in L1 cache, and it's not trying to.  
It's there so an individual _function's_ working set (well, data and stack, 
maybe not heap) fits in there.

Fetching data from L2 cache often has the same throughput as L1 cache on 
modern processors, but it has a few clock cycles of latency (on the core 2 
duo, 14 cycles).  And if you had to deal with that latency for every access 
you'd be in a world of hurt.

But if you only have to deal with that latency every few hundred cycles?  
That's NORMAL.  Even in the nofork case we parse a line of shell script, 
digest the command line arguments into argv[] and argc dequoting and 
substituting environment variables, and then do a binary search through applet 
names to find the appropriate function to call; our l1 cache is probably full 
of _that_ stuff, the new function is going to have to fault stuff in from L2 
anyway)

I suspect that the glibc version of sprintf is larger than the L1 instruction 
cache, so the argument here is implementing extensive infrastructure to avoid 
something as expensive as an sprintf call.  (Show me benchmark numbers to 
prove otherwise.)

> > Cloning existing page tables doesn't require an mmu flush, nor does
> > switching between processes with existing page tables.  Slowaris invented
> > threading so they could avoid an mmu flush when switching between threads
> > with identical page tables; but Linux is smarter about page table
> > switching in general and only flushes what it needs to when switching
> > process contexts.  (Hence Linux processes being cheaper than solaris
> > threads.)  That's why the kernel area is mapped (but masked) into all
> > process address spaces, otherwise you'd need an MMU reload to make a
> > system call.
> >
> > My frame of reference here is that Ingo Molnar managed 2 million
> > fork/exit pairs per second on a threading benchmark five years ago,
>
> Ingo was creating _threads_ (processes with shared memory map).
> fork() is creating separate memory map.

Linux threads and processes are more or less the same thing:
http://tldp.org/FAQ/Threads-FAQ/Comparison.html

They boil down to the same system call with different flags.  The behavior 
difference is A) whether or not the heap is shared by default (stack isn't, 
and you can allocate shared memory between any processes even after the fact, 
that's what /dev/shm is for), B) things like process group membership which 
are irrelevant in this context.

> Considering that typical L1 cache is 64k == 16 pages,

Define "typical".  The cortex A8 is a pretty high-end fire breathing arm chip, 
and it has 16k data and 16k instruction.

> and kernel stack requires two pages,

On a 64 bit host, sure.  On a 32 bit host 4k stacks have been a config option 
for years:
http://lwn.net/Articles/150580/

And the idea of making it the default even on 64 bit machines continues to 
crop up:
http://lwn.net/Articles/279229/

But the main kernel developers largely seem to have lost interest since most 
of them moved their workstations to 64 bit machines.

I note that the several year old 2 million/second figure quoted above was 
definitely allocating a new stack for each instance because that's what 
threads _do_.

I also note there are some fun games the kernel can play with per-cpu stacks 
instead of per process stacks, since the kernel stack is only needed when 
you're handling an interrupt or a syscall.  (I believe they actually have 
interrupt stacks in the per-processor data now.)  However, you need to 
allocate and initialize a task_struct for each process anyway, and if you're 
spending a page on that anyway you might as well use the rest of it for stack, 
which is what the kernel does with 4k stacks.

> page tables require about 6 pages
> minimum, new process touches its own userspace stack and usually a data
> page or two...
>
> I don't have this data handy now, but I instrumented the kernel
> to count page clearing ops sometime ago.
> I definitely saw more page clears per fork than L1 size.

This would _be_ flushing the L1 cache, yes?  I already agreed it does this, I 
just said it wasn't a big deal.

Again, did you benchmark this in the context of an actual shell script?

> > and the Linux guys
> > have made it a point of pride that threads and processes are essentially
> > the same thing on Linux and perform the same.
>
> I am not saying that processes in Linux are especially heavy.
> I think that both fork and exec syscalls have approximately
> the came cost (for small executables) - cost of losing L1
> contents.

Ok, _I'll_ benchmark it:

land...@driftwood:~$ cat temp.c
#include <stdio.h>
#include <sys/timeb.h>
#include <unistd.h>

int main(int argc, char *argv)
{
  struct timeb tp, tp2;
  int temp,i;

  ftime(&tp);
  for (i=0;i<10000;i++) {
    if (fork()) wait(temp);
        else {
      if (argc==1) return 0;
      execlp("true", "true", (void *)0);
    }
  }
  ftime(&tp2);
  printf("time1=%lu %hu time2=%lu %hu\n", tp.time, tp.millitm, tp2.time, 
tp2.millitm);
}

land...@driftwood:~$ gcc temp.c
land...@driftwood:~$ ./a.out
time1=1231644395 179 time2=1231644396 754
land...@driftwood:~$ ./a.out woot
time1=1231644402 187 time2=1231644422 666

So 10,000 fork/exit pairs (where I actually wait for the return value, which 
is something the benchmarks I mentioned earlier didn't do, so there's two 
process context switches for each one) take just over a second and a half on 
my laptop.

Add in an exec of "true" (which just exits immediately) and it's just under 20 
and a half seconds.

So the fork isn't _free_ (it's about 100 times as expensive as calling 
ftime()), but avoiding exec (with a path search and dynamic linking) of even a 
small program like "true" saves over 10 times as much as avoiding the fork.

My earlier point that doing the fork() but avoiding the exec() saves you over 
90% of the work for less than 10% of the effort would appear to be borne out 
by this benchmark.

> > Have you benchmarked the difference?
>
> No, I didnt bencmark what is better - "fork but no exec"
> or "no fork but exec". It is not that interesting, considering
> that we can only use "fork + no exec" (NOXEC applets),
> and "no fork, and no exec" (NOFORK applets).
>
> IOW: at no time in busybox we need to decide what is less expensive
> - to fork or to exec?

So I suggest you're optimizing for the wrong things, and you tell me that's 
not what you're optimizing for.  I agree with you, it's just not the point I'm 
trying to make.

> And lastly, all applets except ash use vfork (on NOMMU,
> on MMU a few still use fork because it makes them simpler).

I agree that a nommu system can't avoid doing exec(), yes.  This particular 
optimization choice is not available to nommu systems.

> --
> vda

Rob
_______________________________________________
busybox mailing list
[email protected]
http://lists.busybox.net/mailman/listinfo/busybox

Re: The busybox "printf" command is always 32 bit even when the shell isn't.

Reply via email to