Re: The optimization of malloc(3): FreeBSD vs GNU libc

2006-08-16 Thread Intron is my Internet alias

Jason Evans wrote:


(LI Xin) wrote:

2006-08-15 02:38 +0300ladimir Kushnir On -CURENT amd64 (Athlon64 3000+, 512k 
L2 cache):


With jemalloc (without MY_MALLOS):
  ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
116.34 real   113.69 user 0.00 sys

With MY_MALLOC:
  ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
45.30 real44.29 user 0.00 sys


Have you turned off the debugging options, i.e. ln -sf
'aj' /etc/malloc.conf?


If you want to do a fair comparison, you will also define NO_MALLOC_EXTRAS 
when compiling malloc.c, in order to turn off the copious assertions, not 
to mention the statistics gathering code.


Before you do that though, it would be useful to turn statistics reporting 
on (add MALLOC_OPTIONS=P to your environment when running the test 
program) and see what malloc says is going on.


[I am away from my -current system at the moment, so can't benchmark the 
program.]  If I understand the code correctly (and assuming the command 
line parameters specified), it starts out by requesting 3517 2000-byte 
allocations from malloc, and never frees any of those allocations.


You're right. My friend's program evaluates an electromagnetic field
by Finite-Difference Time-Domain method.



Both phkmalloc and jemalloc will fit two allocations in each page of 
memory.  phkmalloc will call sbrk() at least 1759 times.  jemalloc will 
call sbrk() approximately 6 times.  2kB allocations are a worst case for 
some of jemalloc's internal bookkeeping, but this shouldn't be a serious 
problem.  Fragmentation for these 2000-byte allocations should total 
approximately 6%.


The same bad case with mmap(2) if mmap(2) is used to obtain small memory
block each time. A hierarchical memory management mechanism is required,
just like those of GNU libc and your new code.

The essence of this problem is that memory management of operating system
can affect working efficiency of CPU hardware greatly.

Actually, not only my friend's program can cause the problem, but also
many applications using strdup(3) frequently.

/usr/src/lib/libc/string/strdup.c:

char *
strdup(str)
const char *str;
{
size_t len;
char *copy;

len = strlen(str) + 1;
if ((copy = malloc(len)) == NULL)
return (NULL);
memcpy(copy, str, len);
return (copy);
}



malloc certainly incurs more overhead than a specialized sbrk()-based 
allocator, but I don't see any particular reason that jemalloc should 
perform suboptimally, as compared to any complete malloc implementation, 
for fdtd.  If you find otherwise, please tell me so that I can look into 
the issue.


Thanks,
Jason
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]




   From Beijing, China

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: The optimization of malloc(3): FreeBSD vs GNU libc

2006-08-15 Thread Intron

Brooks Davis wrote:


On Tue, Aug 15, 2006 at 07:10:47AM +0800, Intron wrote:

One day, a friend told me that his program was 3 times slower under
FreeBSD 6.1 than under GNU/Linux (from Redhat 7.2 to Fedora Core 5).
I was astonished by the real repeatable performance difference on
AMD Athlon XP 2500+ (1.8GHz, 512KB L2 Cache).

After hacking, I found that the problem is nested in malloc(3) of
FreeBSD libc.

Download the testing program: http://ftp.intron.ac/tmp/fdtd.tar.bz2

You may try to compile the program WITHOUT the macro MY_MALLOC
defined (in Makefile) to use malloc(3) provided by FreeBSD 6.1.
Then, time the running of the binary (on Athlon XP 2500+):

#/usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
   165.24 real   164.19 user 0.02 sys

Please try to recompile the program (Remember to make clean)
WITH the macro MY_MALLOC defined (in Makefile) to use my own
simple implementation of malloc(3) (i.e. my_malloc() in cal.c).
And time the running again:

#/usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
   50.41 real49.95 user 0.04 sys

You may repeat this testing again and again.

I guess this kind of performance difference comes from:

1. His program uses malloc(3) to obtain so many small memory blocks.

2. In this case, FreeBSD malloc(3) obtains small memory blocks from
   kernel and pass them to application. 


   But malloc(3) of GNU libc obtains large memory blocks from kernel
   and splits  reallocates them in small blocks to application.

   You may verify my judgement with truss(1).

3. The way of FreeBSD malloc(3) makes VM page mapping too chaotic, which
   reduces the efficiency of CPU L2 Cache. In contrast, my my_malloc()
   simulates the behavior of GNU libc malloc(3) partially and avoids
   the over-chaos.

Callgrind is broken under FreeBSD, or I will verify my guess with it.

I have also verified the program on Intel Pentium 4 511 (2.8GHz, 1MB
L2 cache, running FreeBSD 6.1 i386 though this CPU supports EM64T)

/usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
  185.30 real   184.28 user 0.02 sys

/usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
   36.31 real35.94 user 0.03 sys

NOTE: you probably cannot see the performance difference on CPU with
   small L2 cache such as Intel Celeron 1.7GHz with 128 KB L2 Cache.


In CURRENT we've replaced phkmalloc with jemalloc.  It would be useful
to see how this benchmark performs with that.  I believe it does similar
things.

-- Brooke


You're right.

Now with truss(1) I can see that malloc(3) on 7.0-CURRENT (4 days ago)
calls brk(2) to obtain 2MB each time. I will continue my testing.


   From Beijing, China

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: The optimization of malloc(3): FreeBSD vs GNU libc

2006-08-15 Thread Intron

Vladimir Kushnir wrote:


Sorry for intrusion.

On Mon, 14 Aug 2006, Brooks Davis wrote:


On Tue, Aug 15, 2006 at 07:10:47AM +0800, Intron wrote:

One day, a friend told me that his program was 3 times slower under
FreeBSD 6.1 than under GNU/Linux (from Redhat 7.2 to Fedora Core 5).
I was astonished by the real repeatable performance difference on
AMD Athlon XP 2500+ (1.8GHz, 512KB L2 Cache).

After hacking, I found that the problem is nested in malloc(3) of
FreeBSD libc.

Download the testing program: http://ftp.intron.ac/tmp/fdtd.tar.bz2

You may try to compile the program WITHOUT the macro MY_MALLOC
defined (in Makefile) to use malloc(3) provided by FreeBSD 6.1.
Then, time the running of the binary (on Athlon XP 2500+):

#/usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
   165.24 real   164.19 user 0.02 sys

Please try to recompile the program (Remember to make clean)
WITH the macro MY_MALLOC defined (in Makefile) to use my own
simple implementation of malloc(3) (i.e. my_malloc() in cal.c).
And time the running again:

#/usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
   50.41 real49.95 user 0.04 sys

You may repeat this testing again and again.

I guess this kind of performance difference comes from:

1. His program uses malloc(3) to obtain so many small memory blocks.

2. In this case, FreeBSD malloc(3) obtains small memory blocks from
   kernel and pass them to application.

   But malloc(3) of GNU libc obtains large memory blocks from kernel
   and splits  reallocates them in small blocks to application.

   You may verify my judgement with truss(1).

3. The way of FreeBSD malloc(3) makes VM page mapping too chaotic, which
   reduces the efficiency of CPU L2 Cache. In contrast, my my_malloc()
   simulates the behavior of GNU libc malloc(3) partially and avoids
   the over-chaos.

Callgrind is broken under FreeBSD, or I will verify my guess with it.

I have also verified the program on Intel Pentium 4 511 (2.8GHz, 1MB
L2 cache, running FreeBSD 6.1 i386 though this CPU supports EM64T)


/usr/bin/time ./fdtd.FreeBSD 500 500 1000

...
  185.30 real   184.28 user 0.02 sys


/usr/bin/time ./fdtd.FreeBSD 500 500 1000

...
   36.31 real35.94 user 0.03 sys

NOTE: you probably cannot see the performance difference on CPU with
   small L2 cache such as Intel Celeron 1.7GHz with 128 KB L2 Cache.


In CURRENT we've replaced phkmalloc with jemalloc.  It would be useful
to see how this benchmark performs with that.  I believe it does similar
things.

-- Brooke


On -CURENT amd64 (Athlon64 3000+, 512k L2 cache):

With jemalloc (without MY_MALLOS):
 ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
116.34 real   113.69 user 0.00 sys

With MY_MALLOC:
 ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
45.30 real44.29 user 0.00 sys

Regards,
Vladimir
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


How long has it been since you CVSup-ed your source tree last time?

These days the source tree is broken in building frequently, which
makes 7.0-CURRENT binaries on some users' computers out of date.


   From Beijing, China

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: The optimization of malloc(3): FreeBSD vs GNU libc

2006-08-15 Thread Jason Evans

李鑫 (LI Xin) wrote:

在 2006-08-15二的 02:38 +0300,Vladimir Kushnir写道:

On -CURENT amd64 (Athlon64 3000+, 512k L2 cache):

With jemalloc (without MY_MALLOS):
  ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
116.34 real   113.69 user 0.00 sys

With MY_MALLOC:
  ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
45.30 real44.29 user 0.00 sys


Have you turned off the debugging options, i.e. ln -sf
'aj' /etc/malloc.conf?


If you want to do a fair comparison, you will also define 
NO_MALLOC_EXTRAS when compiling malloc.c, in order to turn off the 
copious assertions, not to mention the statistics gathering code.


Before you do that though, it would be useful to turn statistics 
reporting on (add MALLOC_OPTIONS=P to your environment when running the 
test program) and see what malloc says is going on.


[I am away from my -current system at the moment, so can't benchmark the 
program.]  If I understand the code correctly (and assuming the command 
line parameters specified), it starts out by requesting 3517 2000-byte 
allocations from malloc, and never frees any of those allocations.


Both phkmalloc and jemalloc will fit two allocations in each page of 
memory.  phkmalloc will call sbrk() at least 1759 times.  jemalloc will 
call sbrk() approximately 6 times.  2kB allocations are a worst case for 
some of jemalloc's internal bookkeeping, but this shouldn't be a serious 
problem.  Fragmentation for these 2000-byte allocations should total 
approximately 6%.


malloc certainly incurs more overhead than a specialized sbrk()-based 
allocator, but I don't see any particular reason that jemalloc should 
perform suboptimally, as compared to any complete malloc implementation, 
for fdtd.  If you find otherwise, please tell me so that I can look into 
the issue.


Thanks,
Jason
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: The optimization of malloc(3): FreeBSD vs GNU libc

2006-08-14 Thread Brooks Davis
On Tue, Aug 15, 2006 at 07:10:47AM +0800, Intron wrote:
 One day, a friend told me that his program was 3 times slower under
 FreeBSD 6.1 than under GNU/Linux (from Redhat 7.2 to Fedora Core 5).
 I was astonished by the real repeatable performance difference on
 AMD Athlon XP 2500+ (1.8GHz, 512KB L2 Cache).
 
 After hacking, I found that the problem is nested in malloc(3) of
 FreeBSD libc.
 
 Download the testing program: http://ftp.intron.ac/tmp/fdtd.tar.bz2
 
 You may try to compile the program WITHOUT the macro MY_MALLOC
 defined (in Makefile) to use malloc(3) provided by FreeBSD 6.1.
 Then, time the running of the binary (on Athlon XP 2500+):
 
 #/usr/bin/time ./fdtd.FreeBSD 500 500 1000
 ...
165.24 real   164.19 user 0.02 sys
 
 Please try to recompile the program (Remember to make clean)
 WITH the macro MY_MALLOC defined (in Makefile) to use my own
 simple implementation of malloc(3) (i.e. my_malloc() in cal.c).
 And time the running again:
 
 #/usr/bin/time ./fdtd.FreeBSD 500 500 1000
 ...
50.41 real49.95 user 0.04 sys
 
 You may repeat this testing again and again.
 
 I guess this kind of performance difference comes from:
 
 1. His program uses malloc(3) to obtain so many small memory blocks.
 
 2. In this case, FreeBSD malloc(3) obtains small memory blocks from
kernel and pass them to application. 
 
But malloc(3) of GNU libc obtains large memory blocks from kernel
and splits  reallocates them in small blocks to application.
 
You may verify my judgement with truss(1).
 
 3. The way of FreeBSD malloc(3) makes VM page mapping too chaotic, which
reduces the efficiency of CPU L2 Cache. In contrast, my my_malloc()
simulates the behavior of GNU libc malloc(3) partially and avoids
the over-chaos.
 
 Callgrind is broken under FreeBSD, or I will verify my guess with it.
 
 I have also verified the program on Intel Pentium 4 511 (2.8GHz, 1MB
 L2 cache, running FreeBSD 6.1 i386 though this CPU supports EM64T)
 
 /usr/bin/time ./fdtd.FreeBSD 500 500 1000
 ...
   185.30 real   184.28 user 0.02 sys
 
 /usr/bin/time ./fdtd.FreeBSD 500 500 1000
 ...
36.31 real35.94 user 0.03 sys
 
 NOTE: you probably cannot see the performance difference on CPU with
small L2 cache such as Intel Celeron 1.7GHz with 128 KB L2 Cache.

In CURRENT we've replaced phkmalloc with jemalloc.  It would be useful
to see how this benchmark performs with that.  I believe it does similar
things.

-- Brooke


pgpn3r9Si2ag0.pgp
Description: PGP signature


Re: The optimization of malloc(3): FreeBSD vs GNU libc

2006-08-14 Thread Vladimir Kushnir

Sorry for intrusion.

On Mon, 14 Aug 2006, Brooks Davis wrote:


On Tue, Aug 15, 2006 at 07:10:47AM +0800, Intron wrote:

One day, a friend told me that his program was 3 times slower under
FreeBSD 6.1 than under GNU/Linux (from Redhat 7.2 to Fedora Core 5).
I was astonished by the real repeatable performance difference on
AMD Athlon XP 2500+ (1.8GHz, 512KB L2 Cache).

After hacking, I found that the problem is nested in malloc(3) of
FreeBSD libc.

Download the testing program: http://ftp.intron.ac/tmp/fdtd.tar.bz2

You may try to compile the program WITHOUT the macro MY_MALLOC
defined (in Makefile) to use malloc(3) provided by FreeBSD 6.1.
Then, time the running of the binary (on Athlon XP 2500+):

#/usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
   165.24 real   164.19 user 0.02 sys

Please try to recompile the program (Remember to make clean)
WITH the macro MY_MALLOC defined (in Makefile) to use my own
simple implementation of malloc(3) (i.e. my_malloc() in cal.c).
And time the running again:

#/usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
   50.41 real49.95 user 0.04 sys

You may repeat this testing again and again.

I guess this kind of performance difference comes from:

1. His program uses malloc(3) to obtain so many small memory blocks.

2. In this case, FreeBSD malloc(3) obtains small memory blocks from
   kernel and pass them to application.

   But malloc(3) of GNU libc obtains large memory blocks from kernel
   and splits  reallocates them in small blocks to application.

   You may verify my judgement with truss(1).

3. The way of FreeBSD malloc(3) makes VM page mapping too chaotic, which
   reduces the efficiency of CPU L2 Cache. In contrast, my my_malloc()
   simulates the behavior of GNU libc malloc(3) partially and avoids
   the over-chaos.

Callgrind is broken under FreeBSD, or I will verify my guess with it.

I have also verified the program on Intel Pentium 4 511 (2.8GHz, 1MB
L2 cache, running FreeBSD 6.1 i386 though this CPU supports EM64T)


/usr/bin/time ./fdtd.FreeBSD 500 500 1000

...
  185.30 real   184.28 user 0.02 sys


/usr/bin/time ./fdtd.FreeBSD 500 500 1000

...
   36.31 real35.94 user 0.03 sys

NOTE: you probably cannot see the performance difference on CPU with
   small L2 cache such as Intel Celeron 1.7GHz with 128 KB L2 Cache.


In CURRENT we've replaced phkmalloc with jemalloc.  It would be useful
to see how this benchmark performs with that.  I believe it does similar
things.

-- Brooke


On -CURENT amd64 (Athlon64 3000+, 512k L2 cache):

With jemalloc (without MY_MALLOS):
 ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
116.34 real   113.69 user 0.00 sys

With MY_MALLOC:
 ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000
...
45.30 real44.29 user 0.00 sys

Regards,
Vladimir
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: The optimization of malloc(3): FreeBSD vs GNU libc

2006-08-14 Thread (LI Xin)
在 2006-08-15二的 02:38 +0300,Vladimir Kushnir写道:
 On -CURENT amd64 (Athlon64 3000+, 512k L2 cache):
 
 With jemalloc (without MY_MALLOS):
   ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000
 ...
 116.34 real   113.69 user 0.00 sys
 
 With MY_MALLOC:
   ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000
 ...
 45.30 real44.29 user 0.00 sys

Have you turned off the debugging options, i.e. ln -sf
'aj' /etc/malloc.conf?

Cheers,
-- 
Xin LI delphij delphij nethttp://www.delphij.net/


signature.asc
Description: 这是信件的数	字签名部分