On Tue, Apr 26, 2011 at 6:40 AM, Mihai Badoiu <mbad...@gmail.com> wrote: > Hi, > I have terrible performance for multiplication when one number gets very > close to zero. I'm using cython by writing the following code: > cdef int i > cdef double x = 1.0 > for 0 <= i < 10000000: > x *= 0.8 > #x += 0.01 > print x > This code runs much much slower (20+ times slower) with the line x += 0.01 > uncommented. I looked at the deassembled code and it looks correct. > Moreover, it's just a few lines and by writing a C code (without python on > top), I get the same code, but it's much faster. I've also tried using sse, > but I get exactly the same behavior. The best candidate that I see so far > is that Python sets up the FPU in a different state than C. > Any advice on how to solve this performance problem? > thanks!
I'm getting almost the opposite result: with 1 operation (just the *=), it's about 6 times slower than with two operations (with the *= and +=). I investigated whether it was a GC issue; I conclude it's probably not - the number of objects in the heap was consistent between the fast and slow cython versions. I also split out the addition and multiplication into their own functions, foo1 and foo2, and profiled what was left of the original function, foo. This did not tell anything salient either, other than that the foo function had some hidden performance penalty. A line by line profiler might help, if Cython supports going down to that level; it seems that function by function profiling may not be granular enough for this problem. I found that the performance difference wasn't as marked as on your system - more like 6x. I also found that the duration of the faster of the two cython's was almost exactly the same as the duration of the C program I created to do the same thing. BTW, what version of CPython, and what version of Cython are you using? I used mostly CPython 3.1, with a Cython compiled from an early version that supported generators - which I believe has now been merged. You might try a few different versions of Cython and compare the results. What OS are you on? What kind of CPU? I'm on Ubuntu 10.10 with an AMD Athlon(tm) II X2 245 Processor. It might reveal something if you hand-coded a C extension module that does the same thing. And while this may be slow, it might also be revealing to try using decimal.Decimal arithmetic. Interesting that you get a semi-similar result in pure CPython. Please find below a shar of my experiments. #!/bin/sh # This is a shell archive (produced by GNU sharutils 4.9). # To extract the files from this archive, save it to some FILE, remove # everything before the `#!/bin/sh' line above, then type `sh FILE'. # lock_dir=_sh26876 # Made on 2011-04-26 12:08 PDT by <dstromberg@benchbox>. # Source directory was `/home/dstromberg/src/cython-slowdown'. # # Existing files will *not* be overwritten, unless `-c' is specified. # # This shar contains: # length mode name # ------ ---------- ------------------------------------------ # 463 -rw-r--r-- cs.m4 # 324 -rwxr-xr-x main # 787 -rw-r--r-- Makefile # 1954 -rw-r--r-- output # 180 -rw-r--r-- t.c # MD5SUM=${MD5SUM-md5sum} f=`${MD5SUM} --version | egrep '^md5sum .*(core|text)utils'` test -n "${f}" && md5check=true || md5check=false ${md5check} || \ echo 'Note: not verifying md5sums. Consider installing GNU coreutils.' save_IFS="${IFS}" IFS="${IFS}:" gettext_dir=FAILED locale_dir=FAILED first_param="$1" for dir in $PATH do if test "$gettext_dir" = FAILED && test -f $dir/gettext \ && ($dir/gettext --version >/dev/null 2>&1) then case `$dir/gettext --version 2>&1 | sed 1q` in *GNU*) gettext_dir=$dir ;; esac fi if test "$locale_dir" = FAILED && test -f $dir/shar \ && ($dir/shar --print-text-domain-dir >/dev/null 2>&1) then locale_dir=`$dir/shar --print-text-domain-dir` fi done IFS="$save_IFS" if test "$locale_dir" = FAILED || test "$gettext_dir" = FAILED then echo=echo else TEXTDOMAINDIR=$locale_dir export TEXTDOMAINDIR TEXTDOMAIN=sharutils export TEXTDOMAIN echo="$gettext_dir/gettext -s" fi if (echo "testing\c"; echo 1,2,3) | grep c >/dev/null then if (echo -n test; echo 1,2,3) | grep n >/dev/null then shar_n= shar_c=' ' else shar_n=-n shar_c= ; fi else shar_n= shar_c='\c' ; fi f=shar-touch.$$ st1=200112312359.59 st2=123123592001.59 st2tr=123123592001.5 # old SysV 14-char limit st3=1231235901 if touch -am -t ${st1} ${f} >/dev/null 2>&1 && \ test ! -f ${st1} && test -f ${f}; then shar_touch='touch -am -t $1$2$3$4$5$6.$7 "$8"' elif touch -am ${st2} ${f} >/dev/null 2>&1 && \ test ! -f ${st2} && test ! -f ${st2tr} && test -f ${f}; then shar_touch='touch -am $3$4$5$6$1$2.$7 "$8"' elif touch -am ${st3} ${f} >/dev/null 2>&1 && \ test ! -f ${st3} && test -f ${f}; then shar_touch='touch -am $3$4$5$6$2 "$8"' else shar_touch=: echo ${echo} 'WARNING: not restoring timestamps. Consider getting and installing GNU `touch'\'', distributed in GNU coreutils...' echo fi rm -f ${st1} ${st2} ${st2tr} ${st3} ${f} # if test ! -d ${lock_dir} ; then : else ${echo} "lock directory ${lock_dir} exists" exit 1 fi if mkdir ${lock_dir} then ${echo} "x - created lock directory ${lock_dir}." else ${echo} "x - failed to create lock directory ${lock_dir}." exit 1 fi # ============= cs.m4 ============== if test -f 'cs.m4' && test "$first_param" != -c; then ${echo} "x - SKIPPING cs.m4 (file already exists)" else ${echo} "x - extracting cs.m4 (text)" sed 's/^X//' << 'SHAR_EOF' > 'cs.m4' && X # cython: profile=True X import gc X cpdef foo(): X cdef int i X cdef double x = 1.0 X #for 0 <= i < 10000000: X for 0 <= i < 100000000: # if i % 10000000 == 0: # print(`len'(gc.get_objects())) X x *= 0.8 X # This code runs much much slower (20+ times slower) with the line X # x += 0.01 uncommented. IOW, with 1 operation, it's 20 times faster; X # with 2 operations, it's slower than it should be. X ifdef(`commented',`# ')x += 0.01 X #x += 0.0 X #print x X X SHAR_EOF (set 20 11 04 26 12 07 22 'cs.m4' eval "${shar_touch}") && \ chmod 0644 'cs.m4' if test $? -ne 0 then ${echo} "restore of cs.m4 failed" fi if ${md5check} then ( ${MD5SUM} -c >/dev/null 2>&1 || ${echo} 'cs.m4': 'MD5 check failed' ) << \SHAR_EOF 56c23edc97b77a178c2fdf50399041c2 cs.m4 SHAR_EOF else test `LC_ALL=C wc -c < 'cs.m4'` -ne 463 && \ ${echo} "restoration warning: size of 'cs.m4' is not 463" fi fi # ============= main ============== if test -f 'main' && test "$first_param" != -c; then ${echo} "x - SKIPPING main (file already exists)" else ${echo} "x - extracting main (text)" sed 's/^X//' << 'SHAR_EOF' > 'main' && #!/usr/bin/python X import sys import pstats import cProfile X if sys.argv[1:] == [ 'cs1' ]: X import cs1 as cs elif sys.argv[1:] == [ 'cs2' ]: X import cs2 as cs else: X raise ValueError X cProfile.runctx("cs.foo()", globals(), locals(), "foo.prof") X s = pstats.Stats("foo.prof") s.strip_dirs().sort_stats("time").print_stats() X SHAR_EOF (set 20 11 04 26 11 22 14 'main' eval "${shar_touch}") && \ chmod 0755 'main' if test $? -ne 0 then ${echo} "restore of main failed" fi if ${md5check} then ( ${MD5SUM} -c >/dev/null 2>&1 || ${echo} 'main': 'MD5 check failed' ) << \SHAR_EOF 0c18c9a91e04269b39fb115f37b375e2 main SHAR_EOF else test `LC_ALL=C wc -c < 'main'` -ne 324 && \ ${echo} "restoration warning: size of 'main' is not 324" fi fi # ============= Makefile ============== if test -f 'Makefile' && test "$first_param" != -c; then ${echo} "x - SKIPPING Makefile (file already exists)" else ${echo} "x - extracting Makefile (text)" sed 's/^X//' << 'SHAR_EOF' > 'Makefile' && X pyver=python-3.1 short_pyver=$(shell echo "$(pyver)" | sed 's/-//') python_dir=/usr/local/c$(pyver) X go: cs1.so cs2.so t X time $(python_dir)/bin/python ./main cs1 X time $(python_dir)/bin/python ./main cs2 X time ./t X t: t.c X gcc -o t t.c X cs1.pyx cs2.pyx: cs.m4 X m4 -Dcommented=1 < cs.m4 > cs1.pyx X m4 cs.m4 > cs2.pyx X $(python_dir)/bin/cython -a cs1.pyx X $(python_dir)/bin/cython -a cs2.pyx X cs1.so: cs1.pyx X $(python_dir)/bin/cython cs1.pyx X gcc -c cs1.c -I$(python_dir)/include/$(short_pyver) X gcc -o cs1.so -shared cs1.o -L$(python_dir)/lib -l$(short_pyver) X cs2.so: cs2.pyx X $(python_dir)/bin/cython cs2.pyx X gcc -c cs2.c -I$(python_dir)/include/$(short_pyver) X gcc -o cs2.so -shared cs2.o -L$(python_dir)/lib -l$(short_pyver) X clean: X rm -f cs[12].c *.o *.so *.pyc t *.pyx *.prof X SHAR_EOF (set 20 11 04 26 12 04 13 'Makefile' eval "${shar_touch}") && \ chmod 0644 'Makefile' if test $? -ne 0 then ${echo} "restore of Makefile failed" fi if ${md5check} then ( ${MD5SUM} -c >/dev/null 2>&1 || ${echo} 'Makefile': 'MD5 check failed' ) << \SHAR_EOF dbfc58948b66668e1c3a3efc90d84dac Makefile SHAR_EOF else test `LC_ALL=C wc -c < 'Makefile'` -ne 787 && \ ${echo} "restoration warning: size of 'Makefile' is not 787" fi fi # ============= output ============== if test -f 'output' && test "$first_param" != -c; then ${echo} "x - SKIPPING output (file already exists)" else ${echo} "x - extracting output (text)" sed 's/^X//' << 'SHAR_EOF' > 'output' && /usr/local/cpython-3.1/bin/cython cs1.pyx gcc -c cs1.c -I/usr/local/cpython-3.1/include/python3.1 gcc -o cs1.so -shared cs1.o -L/usr/local/cpython-3.1/lib -lpython3.1 /usr/local/cpython-3.1/bin/cython cs2.pyx gcc -c cs2.c -I/usr/local/cpython-3.1/include/python3.1 gcc -o cs2.so -shared cs2.o -L/usr/local/cpython-3.1/lib -lpython3.1 time /usr/local/cpython-3.1/bin/python ./main cs1 Tue Apr 26 12:01:15 2011 foo.prof X X 5 function calls in 6.495 CPU seconds X X Ordered by: internal time X X ncalls tottime percall cumtime percall filename:lineno(function) X 1 6.495 6.495 6.495 6.495 cs1.pyx:6(foo) X 1 0.000 0.000 6.495 6.495 {built-in method exec} X 1 0.000 0.000 6.495 6.495 <string>:1(<module>) X 1 0.000 0.000 6.495 6.495 {built-in method foo} X 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} X X 5.92user 0.02system 0:06.51elapsed 91%CPU (0avgtext+0avgdata 23568maxresident)k 0inputs+8outputs (0major+1608minor)pagefaults 0swaps time /usr/local/cpython-3.1/bin/python ./main cs2 0.05 Tue Apr 26 12:01:16 2011 foo.prof X X 5 function calls in 0.987 CPU seconds X X Ordered by: internal time X X ncalls tottime percall cumtime percall filename:lineno(function) X 1 0.987 0.987 0.987 0.987 cs2.pyx:6(foo) X 1 0.000 0.000 0.987 0.987 {built-in method exec} X 1 0.000 0.000 0.987 0.987 <string>:1(<module>) X 1 0.000 0.000 0.987 0.987 {built-in method foo} X 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} X X 1.00user 0.00system 0:01.01elapsed 99%CPU (0avgtext+0avgdata 23584maxresident)k 0inputs+8outputs (0major+1609minor)pagefaults 0swaps time ./t 0.050000 0.87user 0.00system 0:00.87elapsed 99%CPU (0avgtext+0avgdata 1616maxresident)k 0inputs+0outputs (0major+140minor)pagefaults 0swaps SHAR_EOF (set 20 11 04 26 12 01 17 'output' eval "${shar_touch}") && \ chmod 0644 'output' if test $? -ne 0 then ${echo} "restore of output failed" fi if ${md5check} then ( ${MD5SUM} -c >/dev/null 2>&1 || ${echo} 'output': 'MD5 check failed' ) << \SHAR_EOF c024cb5c585e47771f068d58396f5f17 output SHAR_EOF else test `LC_ALL=C wc -c < 'output'` -ne 1954 && \ ${echo} "restoration warning: size of 'output' is not 1954" fi fi # ============= t.c ============== if test -f 't.c' && test "$first_param" != -c; then ${echo} "x - SKIPPING t.c (file already exists)" else ${echo} "x - extracting t.c (text)" sed 's/^X//' << 'SHAR_EOF' > 't.c' && X #include <stdio.h> #include <stdlib.h> X void main() X { X int i; X double x = 1.0; X for (i=0; i < 100000000; i++) X { X x *= 0.8; X x += 0.01; X } X printf("%f\n", x); X exit(0); X } X SHAR_EOF (set 20 11 04 26 11 46 49 't.c' eval "${shar_touch}") && \ chmod 0644 't.c' if test $? -ne 0 then ${echo} "restore of t.c failed" fi if ${md5check} then ( ${MD5SUM} -c >/dev/null 2>&1 || ${echo} 't.c': 'MD5 check failed' ) << \SHAR_EOF ebeb48af881c2309d26b0da6a8b9256f t.c SHAR_EOF else test `LC_ALL=C wc -c < 't.c'` -ne 180 && \ ${echo} "restoration warning: size of 't.c' is not 180" fi fi if rm -fr ${lock_dir} then ${echo} "x - removed lock directory ${lock_dir}." else ${echo} "x - failed to remove lock directory ${lock_dir}." exit 1 fi exit 0 -- http://mail.python.org/mailman/listinfo/python-list