[tools-compilers] [perf-discuss] Application runs almost 2x slower on Nevada than Linux

2008-05-01 Thread johan...@sun.com
Yeah, I did some digging when I had a free moment.  The following is
the most germane to your issue.

5070823 poor malloc() performance for small byte sizes

-j

On Thu, May 01, 2008 at 05:36:26PM -0400, Matty wrote:
> We are building our application as a 32-bit entity on both Linux and
> Solaris, so our
> comparison should be apples to apples. Does anyone happen to know what the
> bug id of the small malloc issue is? I searched the opensolaris bug
> database, but
> wasn't able to dig this up.
> 
> Thanks,
> - Ryan
> 
> 
> 
> On Thu, May 1, 2008 at 4:33 PM,   wrote:
> > Part of the problem is that these allocations are very small:
> >
> >  # dtrace -n 'pid$target::malloc:entry { @a["allocsz"] = quantize(arg0); }' 
> > -c /tmp/xml
> >
> >   allocsz
> >value  - Distribution - count
> >1 | 0
> >2 | 30
> >4 |@475
> >8 |@@   166
> >   16 |@4300015
> >   32 |@@@  2406
> >   64 | 21
> >  128 | 41
> >  256 | 10
> >  512 | 0
> > 1024 | 10
> > 2048 | 10
> > 4096 | 0
> > 8192 | 10
> >16384 | 0
> >
> >  After seeing this, I took a look at the exact breakdown of the
> >  allocation sizes:
> >
> >  # dtrace -n 'pid$target::malloc:entry {...@a[arg0] = count();}' -c /tmp/xml
> >
> >121
> >961
> >   2001
> >21   10
> >43   10
> >44   10
> >51   10
> >61   10
> >75   10
> >88   10
> >   128   10
> >   147   10
> >   181   10
> >   220   10
> >   440   10
> >  1024   10
> >  2048   10
> >  8194   10
> > 8   11
> >52   11
> > 6   12
> >36   14
> >24   15
> >33   20
> > 4   21
> >17   21
> > 9   23
> > 3   30
> >10   30
> >13   30
> >14   30
> >25   30
> >28   40
> >11   41
> >20   79
> >40   90
> > 5   91
> >16  250
> > 7  351
> >48  381
> >60 1850
> >
> >  The most frequent malloc call is to allocate 60 bytes.  I believe that
> >  we have a known issue with small mallocs on Solaris.  There's a bug open
> >  for this somewhere; however, I can't find it's number at the moment.
> >
> >  Another problem that you may have run into is the 32-bit versus 64-bit
> >  compilation problem.  I was able to shave about 10 seconds off my
> >  runtime by compiling your testcase as a 64-bit app instead of a 32-bit
> >  one:
> >
> >
> >  $ gcc -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c
> >  $ file xml
> >  xml:ELF 32-bit LSB executable 80386 Version 1 [FPU], 
> > dynamically linked, not stripped, no debugging information available
> >  $ ./xml
> >  10 iter in 22.749836 sec
> >
> >  versus:
> >
> >  $ gcc -m64 -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c
> >  $ file xml
> >  xml:ELF 64-bit LSB executable AMD64 Version 1, dynamically 
> > linked, not stripped, no debugging information available
> >  $ ./xml
> >  10 iter in 13.785916 sec
> >
> >
> >  -j
> >
> >
> >
> >  On Wed, Apr 30, 2008 at 06:44:31PM -0400, Matty wrote:
> >
> >
> > > On Wed, Apr 30, 2008 at 6:26 PM, David Lutz  wrote:
> >  > > If your application is single threaded, you could try using the
> >  > >  bsdmalloc library.  This is a fast malloc, but it is not multi-t

[tools-compilers] [perf-discuss] Application runs almost 2x slower on Nevada than Linux

2008-05-01 Thread johan...@sun.com
Part of the problem is that these allocations are very small:

# dtrace -n 'pid$target::malloc:entry { @a["allocsz"] = quantize(arg0); }' -c 
/tmp/xml

  allocsz   
   value  - Distribution - count
   1 | 0
   2 | 30   
   4 |@475  
   8 |@@   166  
  16 |@4300015  
  32 |@@@  2406 
  64 | 21   
 128 | 41   
 256 | 10   
 512 | 0
1024 | 10   
2048 | 10   
4096 | 0
8192 | 10   
   16384 | 0

After seeing this, I took a look at the exact breakdown of the
allocation sizes:

# dtrace -n 'pid$target::malloc:entry {...@a[arg0] = count();}' -c /tmp/xml

   121
   961
  2001
   21   10
   43   10
   44   10
   51   10
   61   10
   75   10
   88   10
  128   10
  147   10
  181   10
  220   10
  440   10
 1024   10
 2048   10
 8194   10
8   11
   52   11
6   12
   36   14
   24   15
   33   20
4   21
   17   21
9   23
3   30
   10   30
   13   30
   14   30
   25   30
   28   40
   11   41
   20   79
   40   90
5   91
   16  250
7  351
   48  381
   60 1850

The most frequent malloc call is to allocate 60 bytes.  I believe that
we have a known issue with small mallocs on Solaris.  There's a bug open
for this somewhere; however, I can't find it's number at the moment.

Another problem that you may have run into is the 32-bit versus 64-bit
compilation problem.  I was able to shave about 10 seconds off my
runtime by compiling your testcase as a 64-bit app instead of a 32-bit
one:

$ gcc -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c
$ file xml
xml:ELF 32-bit LSB executable 80386 Version 1 [FPU], dynamically 
linked, not stripped, no debugging information available
$ ./xml
10 iter in 22.749836 sec

versus:

$ gcc -m64 -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c
$ file xml
xml:ELF 64-bit LSB executable AMD64 Version 1, dynamically linked, 
not stripped, no debugging information available
$ ./xml
10 iter in 13.785916 sec


-j


On Wed, Apr 30, 2008 at 06:44:31PM -0400, Matty wrote:
> On Wed, Apr 30, 2008 at 6:26 PM, David Lutz  wrote:
> > If your application is single threaded, you could try using the
> >  bsdmalloc library.  This is a fast malloc, but it is not multi-thread
> >  safe and will also tend to use more memory than the default
> >  malloc.  For  a comparison of different malloc libraries, look
> >  at the NOTES section at the end of umem_alloc(3MALLOC).
> >
> >  I got the following result with your example code:
> >
> >
> >  $ gcc -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c
> >  $ ./xml
> >  10 iter in 21.445672 sec
> >  $
> >  $ gcc -O3 -o xml `/usr/bin/xml2-config --libs --cflags` xml.c -lbsdmalloc
> >  $ ./xml
> >  10 iter in 12.761969 sec
> >  $
> >
> >  I got similar results using Sun Studio 12.
> >
> >  Again, bsdmalloc is not multi-thread safe, so use it with caution.
> 
> Thanks David. Does anyone happen to know why the memory allocation
> libraries in Solaris are so much slower than their Linux counterparts? If
> the various malloc implementations were a second or two slower, I could
> understand. But they appear to be 10 - 1

[tools-compilers] [perf-discuss] Application runs almost 2xslower on Nevada than Linux

2008-06-24 Thread johan...@sun.com
> Could you try an experiment and compile you sources with
> /usr/lib/libast.so.1 (you need to compile the sources with
> -I/usr/include/ast before /usr/include/ since libast uses a different
> symbol namespace and cannot be used to "intercept" other
> |malloc()|/|free()| calls like libbsdmalloc) ? AFAIK that should cure
> the performance problem (assuming the library functions which heavily
> use |malloc()|/|free()| are compiled against libast headers, too) ...

What would this experiment prove?  The test program links to libxml2,
libz, libm, and libc.  It seems like it would be a lot of work to
re-compile all of these libraries against libast headers just for
amusement.  In the case of libc, I'm not even sure it's possible.

If libast has its own memory allocator, it would make a lot more sense
to move that functionality into an interposition library like we already
do for libumem, libmapmalloc, libwatchmalloc, libmtmalloc, and
libbsdmalloc.

-j




[tools-compilers] [perf-discuss] Application runs almost 2x slower on Nevada than Linux

2008-04-28 Thread johan...@sun.com
Hey Dude,
I pulled down a copy of your test program and ran a few experiments.

$ time ./xml
10 iter in 22.715982 sec

real0m22.721s
user0m22.694s
sys 0m0.007s

This seems to indicate that all of our time is being spent in usermode, so
whatever it is in Solaris that is slower than Linux, it's not the operating
system.

Just to double check, I used the following dtrace to look at the time this 
application is spending in syscalls:

# dtrace -n 'syscall:::entry /execname == "xml"/ { self->traced = timestamp; 
self->func = probefunc;}  syscall:::return /self->traced/ { @a[self->func] = 
sum(timestamp - self->traced); @b[self->func] = count(); self->func = (char 
*)0; self->timestamp = 0;}'

(This returns the time xml spends in syscalls as well as the number of syscalls
made)

Time (nanoseconds):

  getpid 1978
  sysconfig  2229
  sigpending 3071
  sysi86 3529
  getrlimit  4234
  setcontext 5763
  fstat647042
  close 16606
  write 19898
  getcwd21302
  ioctl 23668
  munmap25757
  brk   28445
  open  40712
  resolvepath   51777
  xstat 57616
  mmap 159275
  memcntl  267070

Number of invocations:

  fstat64   1
  getcwd1
  getpid1
  getrlimit 1
  ioctl 1
  sigpending1
  sysconfig 1
  sysi861
  write 1
  setcontext2
  memcntl   6
  close 7
  munmap8
  open  8
  resolvepath   9
  xstat 9
  brk  10
  mmap 32

This agrees with the output from time.  You're spending a miniscule amount
of system time performing memory operations.

The next place to look would be at libxml2.  What library versions do
you have installed on the different machines.  If the library versions
don't match, or you've compiled with a different set of options, or used
a different compiler to build the library, you'll get different results
for the benchmark.

I ran my test on snv_84.  It has version 2.6.23 installed.

Another approach would be to use DTrace's profile provider.  I obtained
usable results by performing the following:

# dtrace -n 'profile-1001 /execname == "xml"/ {...@a[ustack()] = count();} END 
{ trunc(@a, 20);}' -c /tmp/xml

This returns the 20 most frequent stack frames (and the number of times
the occurred) when DTrace runs a profile probe at 1001hz.

xmlDictLookup shows up as a popular function, but that may be a red
herring.  I'll include the top 5 here, but you can run this command, or adjust
the truncation to view any amount you'd like:

  libc.so.1`_free_unlocked+0x4c
  libc.so.1`free+0x2b
  libxml2.so.2`xmlFreeNodeList+0x19e
  libxml2.so.2`xmlFreeNodeList+0xb6
  libxml2.so.2`xmlFreeNodeList+0xb6
  libxml2.so.2`xmlFreeDoc+0xac
  xml`main+0x55
  xml`_start+0x80
   61

  libxml2.so.2`xmlDictLookup+0x60
  libxml2.so.2`xmlParseNCName+0xab
  libxml2.so.2`xmlParseQName+0x44
  libxml2.so.2`xmlParseAttribute2+0x54
  libxml2.so.2`xmlParseStartTag2+0x246