On Wed, 14 Nov 2012, Chris Rees wrote:

On 14 Nov 2012 18:49, "Konstantin Belousov" <kostik...@gmail.com> wrote:

On Wed, Nov 14, 2012 at 09:28:23AM -0800, David O'Brien wrote:
On Thu, Oct 25, 2012 at 11:18:06PM +0000, Simon J. Gerraty wrote:
Log:
  Merge bmake-20121010

Hi Simon,
I was kicking the tires on this and noticed bmake is dynamically linked.

Can you change it to being statically linked?

This issue most recently came up in freebsd-current.  See thread pieces

http://lists.freebsd.org/pipermail/freebsd-current/2012-April/033460.html

http://lists.freebsd.org/pipermail/freebsd-current/2012-April/033472.html

http://lists.freebsd.org/pipermail/freebsd-current/2012-April/033473.html

As you see, I prefer to not introduce new statically linked binaries into
base.
If, for unfortunate turns of events, bmake is changed to be statically
linked,
please obey WITH_SHARED_TOOLCHAIN.

Or a /rescue/bmake for when speed is a concern would also be acceptable.

Yes, the big rescue executable is probably even better than dynamic linkage
for pessimizing speeds.  Sizes on freefall now:

%    text          data     bss     dec     hex filename
%  130265          1988    9992  142245   22ba5 /bin/sh
% 5256762        133964 2220464 7611190  742336 /rescue/sh
% -r--r--r--  1 root  wheel  3738610 Nov 11 06:48 /usr/lib/libc.a

The dynamically-linked /bin/sh is deceptively small, although it is larger
than the statically linked /bin/sh in FreeBSD-1 for few new features.
When executed, it expands to 16.5MB with 10MB RSS.  I don't know how much
of that is malloc bloat that wouldn't need to be copied on fork, but it
is a lot just to map.  /rescue/sh starts at 5MB and expands to 15.5MB with
9.25MB when executed.  So it is slightly smaller, and its slowness is
determined by its non-locality.  Perhaps its non-locality is not as good
for pessimization as libc's.

I don't use dynamic linkage of course.  /bin/sh is bloated by static
linkage (or rather libc) in the FreeBSD-~5.2 that I usually run:

   text    data     bss     dec     hex filename
 649623    8192   64056  721871   b03cf /bin/sh

but this "only" expands to 864K with 580K RSS when executed.  This can be
forked a little faster than 10MB RSS.   In practice the timings for

    time whatever/sh -c 'for i in $(jot 1000 1); do echo -n; done'

are:

    freefall /bin/sh:    6.93 real 1.69 user 5.16 sys
    freefall /rescue/sh: 6.86 real 1.65 user 5.13 sys
    local    /bin/sh:    0.21 real 0.01 user 0.18 sys

freefall:
FreeBSD 10.0-CURRENT #4 r242881M: Sun Nov 11 05:30:05 UTC 2012
    r...@freefall.freebsd.org:/usr/obj/usr/src/sys/FREEFALL amd64
CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz (2666.82-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x206c2  Family = 0x6  Model = 0x2c  Stepping = 
2

local:
FreeBSD 5.2-CURRENT #4395: Sun Apr  8 12:15:03 EST 2012
    b...@besplex.bde.org:/c/obj/usr/src/sys/compile/BESPLEX.fw
...
CPU: AMD Athlon(tm) 64 Processor 3200+ (2010.05-MHz 686-class CPU)
  Origin = "AuthenticAMD"  Id = 0xf48  Stepping = 8

freefall may be pessimized by INVARIANTS.  It is pessimized by /bin/echo
being dynamically linked.  Normally shells use builtin echo so the speed
of /bin/echo is unimportant.  There is also some strangeness in the timing
for /bin/echo specifically.  Changing 'echo -n' to
'/bin/rm -f /etc/nonesuch' or /usr/bin/true reduces the times on freefall
by almost a factor of 2, although rm is larger and has to do more:

freefall:
   text    data     bss     dec     hex filename
   2661     540       8    3209     c89 /bin/echo
  11026     884     152   12062    2f1e /bin/rm
   1420     484       8    1912     778 /usr/bin/true
(all dynamically linked to libc only.  truss verifies that rm does a little
more).
    freefall /bin/sh    echo: 6.93 real 1.69 user 5.16 sys
    freefall /bin/sh    rm:   3.83 real 0.91 user 2.84 sys
    freefall /bin/sh    true: 3.68 real 0.75 user 2.85 sys
    freefall /rescue/sh echo: 6.86 real 1.65 user 5.13 sys
    freefall /rescue/sh rm:   3.69 real 0.83 user 2.78 sys
    freefall /rescue/sh true: 3.67 real 0.85 user 2.74 sys
    local    /bin/sh    echo: 0.21 real 0.01 user 0.18 sys
    local    /bin/sh    rm:   0.22 real 0.02 user 0.19 sys
    local    /bin/sh    true: 0.18 real 0.01 user 0.17 sys
local:
   text    data     bss     dec     hex filename
  11926      60     768   12754    31d2 /bin/echo
 380758    6752   61772  449282   6db02 /bin/rm
   1639      40     604    2283     8eb /usr/bin/true
(all statically linked.  I managed to debloat crtso and libc enough for
/usr/bin/true to be small.  The sources for /bin/echo are excessively
optimized for space in the executable -- they have contortions to avoid
using printf.  But this is useless in -current, since crtso and libc
drag in printf, so that the null program int main(){} has size:

freefall (amd64):
   text    data     bss     dec     hex filename
 316370   12156   55184  383710   5dade null-static
   1452     484       8    1944     798 null-dynamic
local (i386):
   text    data     bss     dec     hex filename
   1490      40     604    2134     856 null-static
   1203     208      32    1443     5a3 null-dynamic

Putting this null program in the jot loop gives a truer indication of the
cost of a statically linked shell:

    freefall /bin/sh    null-static:  6.36 real 1.51 user 4.45 sys
    freefall /bin/sh    null-dynamic: 3.92 real 0.85 user 2.71 sys
    local    /bin/sh    null-static:  0.18 real 0.00 user 0.18 sys
    local    /bin/sh    null-dynamic: 0.58 real 0.09 user 0.49 sys

The last 2 lines show the expected large cost of dynamic linkage for
a small program (3 times slower), but the freefall lines show strangeness
-- static linkage is almost twice as slow, and almost as slow as
/bin/echo -n.  So to get a truer indication of the cost of a statically
linked shell, test with my favourite small program:

%%%
#include <sys/syscall.h>

        .globl  _start
_start:
        movl    $SYS_sync,%eax
        int     $0x80
        pushl   $0              # only to look like a sync library call (?)
        pushl   $0
        movl    $SYS_exit,%eax
        int     $0x80
%%%

This is my sync.S source file for sync(1) on x86 (must build on i386
using cc -o sync sync.S -nostdlib).

local:
   text    data     bss     dec     hex filename
     18       0       0      18      12 sync

It does the same amount of error checking as /usr/src/bin/sync.c (none),
which compiles to:

freefall:
   text    data     bss     dec     hex filename
 316330   12092   55184  383606   5da76 sync-static
   1503     492       8    2003     7d3 sync-dynamic

Putting this in the jot loop gives:

    local    /bin/sh    sync: 0.65 real 0.01 user 0.63 sys

but since is a heavyweight instruction and I don't want to exercise
freefalls's disks, remove the syscall from the program, so it just
does _exit(0):

   text    data     bss     dec     hex filename
     11       0       0      18      12 syncfree-sync

    freefall /bin/sh    syncfree-sync: 0.29 real 0.01 user 0.11 sys
    local    /bin/sh    syncfree-sync: 0.17 real 0.00 user 0.17 sys

This shows that most of freefall's enormous slowness is for execing
its bloated executables, perhaps especially when they are on nfs
(oops).  Another test of null-static after copying it to /tmp shows
that nfs makes little difference.  However, syncfree-sync is much
faster when copied to /tmp (<= 0.08 seconds real.  Test not done, but
this result is read off from a later test).

Next, try bloating syncfree-sync with padding to the same size as
null-static:

%%%
#include <sys/syscall.h>

        .text
        .globl  _start
_start:
        pushl   $0
        pushl   $0
        movl    $SYS_exit,%eax
        int     $0x80
        .space  316370-11
.data
        .space  12156
.bss
        .space  55184
%%%
   text    data     bss     dec     hex filename
 316370   12156   55184  383710   5dade bloated-syncfree-sync

    freefall /bin/sh bloated-syncfree-sync: 0.08 real 0.00 user 0.08 sys (zfs)
    freefall /bin/sh bloated-syncfree-sync: 0.30 real 0.00 user 0.13 sys (nfs)
    local    /bin/sh bloated-syncfree-sync: 0.21 real 0.00 user 0.21 sys (ffs)

This shows that the the kernel is still quite fast and enormous slowness
on freefall is mainly in crtso.  I blame malloc() for this.  malloc()
first increases the size of a null statically linked program from ~1K
text to 310K text.  Then it increases the startup time by a factor of
50 or so.  For small utilities like echo and rm, the increases are
similar.  A small utility only needs to allocate about 8K of data (for
stdio buffers).  Since execing bloated-syncfree-sync is fast, a small
utility could do this allocation a few thousand times in the time that
crtso now takes to start up (the 300+K of padding only gives enough for
statically allocating 40 x 8K.  Expanding the padding by a factor of
50 might slow down the exec to the crtso time, but gives 2000 x 8K.
Of course, actually using the allocated areas will slow down both the
statically allocated and the dynamically allocated cases a lot.

More tests with a large program on small data (put 'cc -c null.c' in
the jot loop, where null.c is int main(){}):

    freefall /bin/sh clang: 22.53 real  6.35 user 12.15 sys (nfs)
    freefall /bin/sh   gcc: 35.28 real 13.14 user 17.45 sys (nfs)
    local    /bin/sh    cc: 17.50 real  6.72 user  2.64 sys (ffs)

The crtso slowness seems to be very significant even here.  Assume that
it is 6 seconds (divided by 1000) per exec.  clang is monolithic and
does only 1 exec per cc -c.  gcc is a small driver program that execs
cc1 and as (it used to exec a separate cpp too).  So gcc does 3 execs
per cc -c, and 6 seconds extra for the 2 extra execs accounts almost
exactly for clang being 12.75 seconds faster.

The `local' time apparently shows a large accounting bug.  Actually, it
is because I left a shell loop for testing this running in the background.
All the other 'local' times are not much affected by this, since the
background loop has low priority, and scheduling works so that it is
rarely run in competition with the tiny programs in the other tests.
But here the cc's compete with it significantly.  After fixing this
and also running the freefall tests on zfs:

    freefall /bin/sh clang: 19.69 real  6.74 user 12.82 sys (zfs)
    freefall /bin/sh   gcc: 28.51 real 12.75 user 15.47 sys (zfs, gcc-4.2.1)
    local    /bin/sh    cc:  8.95 real  6.17 user  2.74 sys (ffs, gcc-3.3.3)

gcc-4.2.1 is only 35% slower than gcc-3.3.3 on larger source files when it
is run locally:

    local /bin/sh gcc: 120.1 real 112.4  user 7.4 sys (ffs, gcc-3.3.3 -O1 -S)
    local /bin/sh gcc: 164.6 real 155.8  user 8.1 sys (ffs, gcc-3.3.3 -O2 -S)
    local /bin/sh gcc: 161.9 real 148.0  user 8.1 sys (ffs, gcc-4.2.1 -O1 -S)
    local /bin/sh gcc: 202.4 real 193.6  user 8.0 sys (ffs, gcc-4.2.1 -O2 -S)

Maybe malloc() would be faster with MALLOC_PRODUCTION.  I use
/etc/malloc.conf -> aj locally.  freefall doesn't have /etc/malloc.conf.
MALLOC_OPTIONS no longer works, and MALLOC_CONF is too large for me to
understand, so I don't know how to turn off non-production features
dynamically.

Bruce
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Reply via email to