Re: The optimization of malloc(3): FreeBSD vs GNU libc
Jason Evans wrote: (LI Xin) wrote: 2006-08-15 02:38 +0300ladimir Kushnir On -CURENT amd64 (Athlon64 3000+, 512k L2 cache): With jemalloc (without MY_MALLOS): ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 116.34 real 113.69 user 0.00 sys With MY_MALLOC: ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 45.30 real44.29 user 0.00 sys Have you turned off the debugging options, i.e. ln -sf 'aj' /etc/malloc.conf? If you want to do a fair comparison, you will also define NO_MALLOC_EXTRAS when compiling malloc.c, in order to turn off the copious assertions, not to mention the statistics gathering code. Before you do that though, it would be useful to turn statistics reporting on (add MALLOC_OPTIONS=P to your environment when running the test program) and see what malloc says is going on. [I am away from my -current system at the moment, so can't benchmark the program.] If I understand the code correctly (and assuming the command line parameters specified), it starts out by requesting 3517 2000-byte allocations from malloc, and never frees any of those allocations. You're right. My friend's program evaluates an electromagnetic field by Finite-Difference Time-Domain method. Both phkmalloc and jemalloc will fit two allocations in each page of memory. phkmalloc will call sbrk() at least 1759 times. jemalloc will call sbrk() approximately 6 times. 2kB allocations are a worst case for some of jemalloc's internal bookkeeping, but this shouldn't be a serious problem. Fragmentation for these 2000-byte allocations should total approximately 6%. The same bad case with mmap(2) if mmap(2) is used to obtain small memory block each time. A hierarchical memory management mechanism is required, just like those of GNU libc and your new code. The essence of this problem is that memory management of operating system can affect working efficiency of CPU hardware greatly. Actually, not only my friend's program can cause the problem, but also many applications using strdup(3) frequently. /usr/src/lib/libc/string/strdup.c: char * strdup(str) const char *str; { size_t len; char *copy; len = strlen(str) + 1; if ((copy = malloc(len)) == NULL) return (NULL); memcpy(copy, str, len); return (copy); } malloc certainly incurs more overhead than a specialized sbrk()-based allocator, but I don't see any particular reason that jemalloc should perform suboptimally, as compared to any complete malloc implementation, for fdtd. If you find otherwise, please tell me so that I can look into the issue. Thanks, Jason ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED] From Beijing, China ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: The optimization of malloc(3): FreeBSD vs GNU libc
Brooks Davis wrote: On Tue, Aug 15, 2006 at 07:10:47AM +0800, Intron wrote: One day, a friend told me that his program was 3 times slower under FreeBSD 6.1 than under GNU/Linux (from Redhat 7.2 to Fedora Core 5). I was astonished by the real repeatable performance difference on AMD Athlon XP 2500+ (1.8GHz, 512KB L2 Cache). After hacking, I found that the problem is nested in malloc(3) of FreeBSD libc. Download the testing program: http://ftp.intron.ac/tmp/fdtd.tar.bz2 You may try to compile the program WITHOUT the macro MY_MALLOC defined (in Makefile) to use malloc(3) provided by FreeBSD 6.1. Then, time the running of the binary (on Athlon XP 2500+): #/usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 165.24 real 164.19 user 0.02 sys Please try to recompile the program (Remember to make clean) WITH the macro MY_MALLOC defined (in Makefile) to use my own simple implementation of malloc(3) (i.e. my_malloc() in cal.c). And time the running again: #/usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 50.41 real49.95 user 0.04 sys You may repeat this testing again and again. I guess this kind of performance difference comes from: 1. His program uses malloc(3) to obtain so many small memory blocks. 2. In this case, FreeBSD malloc(3) obtains small memory blocks from kernel and pass them to application. But malloc(3) of GNU libc obtains large memory blocks from kernel and splits reallocates them in small blocks to application. You may verify my judgement with truss(1). 3. The way of FreeBSD malloc(3) makes VM page mapping too chaotic, which reduces the efficiency of CPU L2 Cache. In contrast, my my_malloc() simulates the behavior of GNU libc malloc(3) partially and avoids the over-chaos. Callgrind is broken under FreeBSD, or I will verify my guess with it. I have also verified the program on Intel Pentium 4 511 (2.8GHz, 1MB L2 cache, running FreeBSD 6.1 i386 though this CPU supports EM64T) /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 185.30 real 184.28 user 0.02 sys /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 36.31 real35.94 user 0.03 sys NOTE: you probably cannot see the performance difference on CPU with small L2 cache such as Intel Celeron 1.7GHz with 128 KB L2 Cache. In CURRENT we've replaced phkmalloc with jemalloc. It would be useful to see how this benchmark performs with that. I believe it does similar things. -- Brooke You're right. Now with truss(1) I can see that malloc(3) on 7.0-CURRENT (4 days ago) calls brk(2) to obtain 2MB each time. I will continue my testing. From Beijing, China ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: The optimization of malloc(3): FreeBSD vs GNU libc
Vladimir Kushnir wrote: Sorry for intrusion. On Mon, 14 Aug 2006, Brooks Davis wrote: On Tue, Aug 15, 2006 at 07:10:47AM +0800, Intron wrote: One day, a friend told me that his program was 3 times slower under FreeBSD 6.1 than under GNU/Linux (from Redhat 7.2 to Fedora Core 5). I was astonished by the real repeatable performance difference on AMD Athlon XP 2500+ (1.8GHz, 512KB L2 Cache). After hacking, I found that the problem is nested in malloc(3) of FreeBSD libc. Download the testing program: http://ftp.intron.ac/tmp/fdtd.tar.bz2 You may try to compile the program WITHOUT the macro MY_MALLOC defined (in Makefile) to use malloc(3) provided by FreeBSD 6.1. Then, time the running of the binary (on Athlon XP 2500+): #/usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 165.24 real 164.19 user 0.02 sys Please try to recompile the program (Remember to make clean) WITH the macro MY_MALLOC defined (in Makefile) to use my own simple implementation of malloc(3) (i.e. my_malloc() in cal.c). And time the running again: #/usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 50.41 real49.95 user 0.04 sys You may repeat this testing again and again. I guess this kind of performance difference comes from: 1. His program uses malloc(3) to obtain so many small memory blocks. 2. In this case, FreeBSD malloc(3) obtains small memory blocks from kernel and pass them to application. But malloc(3) of GNU libc obtains large memory blocks from kernel and splits reallocates them in small blocks to application. You may verify my judgement with truss(1). 3. The way of FreeBSD malloc(3) makes VM page mapping too chaotic, which reduces the efficiency of CPU L2 Cache. In contrast, my my_malloc() simulates the behavior of GNU libc malloc(3) partially and avoids the over-chaos. Callgrind is broken under FreeBSD, or I will verify my guess with it. I have also verified the program on Intel Pentium 4 511 (2.8GHz, 1MB L2 cache, running FreeBSD 6.1 i386 though this CPU supports EM64T) /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 185.30 real 184.28 user 0.02 sys /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 36.31 real35.94 user 0.03 sys NOTE: you probably cannot see the performance difference on CPU with small L2 cache such as Intel Celeron 1.7GHz with 128 KB L2 Cache. In CURRENT we've replaced phkmalloc with jemalloc. It would be useful to see how this benchmark performs with that. I believe it does similar things. -- Brooke On -CURENT amd64 (Athlon64 3000+, 512k L2 cache): With jemalloc (without MY_MALLOS): ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 116.34 real 113.69 user 0.00 sys With MY_MALLOC: ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 45.30 real44.29 user 0.00 sys Regards, Vladimir ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED] How long has it been since you CVSup-ed your source tree last time? These days the source tree is broken in building frequently, which makes 7.0-CURRENT binaries on some users' computers out of date. From Beijing, China ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: The optimization of malloc(3): FreeBSD vs GNU libc
李鑫 (LI Xin) wrote: 在 2006-08-15二的 02:38 +0300,Vladimir Kushnir写道: On -CURENT amd64 (Athlon64 3000+, 512k L2 cache): With jemalloc (without MY_MALLOS): ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 116.34 real 113.69 user 0.00 sys With MY_MALLOC: ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 45.30 real44.29 user 0.00 sys Have you turned off the debugging options, i.e. ln -sf 'aj' /etc/malloc.conf? If you want to do a fair comparison, you will also define NO_MALLOC_EXTRAS when compiling malloc.c, in order to turn off the copious assertions, not to mention the statistics gathering code. Before you do that though, it would be useful to turn statistics reporting on (add MALLOC_OPTIONS=P to your environment when running the test program) and see what malloc says is going on. [I am away from my -current system at the moment, so can't benchmark the program.] If I understand the code correctly (and assuming the command line parameters specified), it starts out by requesting 3517 2000-byte allocations from malloc, and never frees any of those allocations. Both phkmalloc and jemalloc will fit two allocations in each page of memory. phkmalloc will call sbrk() at least 1759 times. jemalloc will call sbrk() approximately 6 times. 2kB allocations are a worst case for some of jemalloc's internal bookkeeping, but this shouldn't be a serious problem. Fragmentation for these 2000-byte allocations should total approximately 6%. malloc certainly incurs more overhead than a specialized sbrk()-based allocator, but I don't see any particular reason that jemalloc should perform suboptimally, as compared to any complete malloc implementation, for fdtd. If you find otherwise, please tell me so that I can look into the issue. Thanks, Jason ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: The optimization of malloc(3): FreeBSD vs GNU libc
On Tue, Aug 15, 2006 at 07:10:47AM +0800, Intron wrote: One day, a friend told me that his program was 3 times slower under FreeBSD 6.1 than under GNU/Linux (from Redhat 7.2 to Fedora Core 5). I was astonished by the real repeatable performance difference on AMD Athlon XP 2500+ (1.8GHz, 512KB L2 Cache). After hacking, I found that the problem is nested in malloc(3) of FreeBSD libc. Download the testing program: http://ftp.intron.ac/tmp/fdtd.tar.bz2 You may try to compile the program WITHOUT the macro MY_MALLOC defined (in Makefile) to use malloc(3) provided by FreeBSD 6.1. Then, time the running of the binary (on Athlon XP 2500+): #/usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 165.24 real 164.19 user 0.02 sys Please try to recompile the program (Remember to make clean) WITH the macro MY_MALLOC defined (in Makefile) to use my own simple implementation of malloc(3) (i.e. my_malloc() in cal.c). And time the running again: #/usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 50.41 real49.95 user 0.04 sys You may repeat this testing again and again. I guess this kind of performance difference comes from: 1. His program uses malloc(3) to obtain so many small memory blocks. 2. In this case, FreeBSD malloc(3) obtains small memory blocks from kernel and pass them to application. But malloc(3) of GNU libc obtains large memory blocks from kernel and splits reallocates them in small blocks to application. You may verify my judgement with truss(1). 3. The way of FreeBSD malloc(3) makes VM page mapping too chaotic, which reduces the efficiency of CPU L2 Cache. In contrast, my my_malloc() simulates the behavior of GNU libc malloc(3) partially and avoids the over-chaos. Callgrind is broken under FreeBSD, or I will verify my guess with it. I have also verified the program on Intel Pentium 4 511 (2.8GHz, 1MB L2 cache, running FreeBSD 6.1 i386 though this CPU supports EM64T) /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 185.30 real 184.28 user 0.02 sys /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 36.31 real35.94 user 0.03 sys NOTE: you probably cannot see the performance difference on CPU with small L2 cache such as Intel Celeron 1.7GHz with 128 KB L2 Cache. In CURRENT we've replaced phkmalloc with jemalloc. It would be useful to see how this benchmark performs with that. I believe it does similar things. -- Brooke pgpn3r9Si2ag0.pgp Description: PGP signature
Re: The optimization of malloc(3): FreeBSD vs GNU libc
Sorry for intrusion. On Mon, 14 Aug 2006, Brooks Davis wrote: On Tue, Aug 15, 2006 at 07:10:47AM +0800, Intron wrote: One day, a friend told me that his program was 3 times slower under FreeBSD 6.1 than under GNU/Linux (from Redhat 7.2 to Fedora Core 5). I was astonished by the real repeatable performance difference on AMD Athlon XP 2500+ (1.8GHz, 512KB L2 Cache). After hacking, I found that the problem is nested in malloc(3) of FreeBSD libc. Download the testing program: http://ftp.intron.ac/tmp/fdtd.tar.bz2 You may try to compile the program WITHOUT the macro MY_MALLOC defined (in Makefile) to use malloc(3) provided by FreeBSD 6.1. Then, time the running of the binary (on Athlon XP 2500+): #/usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 165.24 real 164.19 user 0.02 sys Please try to recompile the program (Remember to make clean) WITH the macro MY_MALLOC defined (in Makefile) to use my own simple implementation of malloc(3) (i.e. my_malloc() in cal.c). And time the running again: #/usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 50.41 real49.95 user 0.04 sys You may repeat this testing again and again. I guess this kind of performance difference comes from: 1. His program uses malloc(3) to obtain so many small memory blocks. 2. In this case, FreeBSD malloc(3) obtains small memory blocks from kernel and pass them to application. But malloc(3) of GNU libc obtains large memory blocks from kernel and splits reallocates them in small blocks to application. You may verify my judgement with truss(1). 3. The way of FreeBSD malloc(3) makes VM page mapping too chaotic, which reduces the efficiency of CPU L2 Cache. In contrast, my my_malloc() simulates the behavior of GNU libc malloc(3) partially and avoids the over-chaos. Callgrind is broken under FreeBSD, or I will verify my guess with it. I have also verified the program on Intel Pentium 4 511 (2.8GHz, 1MB L2 cache, running FreeBSD 6.1 i386 though this CPU supports EM64T) /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 185.30 real 184.28 user 0.02 sys /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 36.31 real35.94 user 0.03 sys NOTE: you probably cannot see the performance difference on CPU with small L2 cache such as Intel Celeron 1.7GHz with 128 KB L2 Cache. In CURRENT we've replaced phkmalloc with jemalloc. It would be useful to see how this benchmark performs with that. I believe it does similar things. -- Brooke On -CURENT amd64 (Athlon64 3000+, 512k L2 cache): With jemalloc (without MY_MALLOS): ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 116.34 real 113.69 user 0.00 sys With MY_MALLOC: ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 45.30 real44.29 user 0.00 sys Regards, Vladimir ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: The optimization of malloc(3): FreeBSD vs GNU libc
在 2006-08-15二的 02:38 +0300,Vladimir Kushnir写道: On -CURENT amd64 (Athlon64 3000+, 512k L2 cache): With jemalloc (without MY_MALLOS): ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 116.34 real 113.69 user 0.00 sys With MY_MALLOC: ~/fdtd /usr/bin/time ./fdtd.FreeBSD 500 500 1000 ... 45.30 real44.29 user 0.00 sys Have you turned off the debugging options, i.e. ln -sf 'aj' /etc/malloc.conf? Cheers, -- Xin LI delphij delphij nethttp://www.delphij.net/ signature.asc Description: 这是信件的数 字签名部分