Eric, thanks for insisting on this. I noticed that, when I saw it first, just to forget about it again ... The new timings on my machine are: $: gcc -O3 -c the_lib.c -fPIC -fopenmp -ffast-math $: gcc -shared -o the_lib.so the_lib.o -lgomp -lm $: python2.5 the_python_prog.py c_threads 1 time 0.000897128582001 c_threads 2 time 0.000540800094604 c_threads 3 time 0.00035933971405 c_threads 4 time 0.000529370307922 c_threads 5 time 0.00049122095108 c_threads 6 time 0.000540502071381 c_threads 7 time 0.000580079555511 c_threads 8 time 0.000643739700317 c_threads 9 time 0.000622930526733 c_threads 10 time 0.000680360794067 c_threads 11 time 0.000613269805908 c_threads 12 time 0.000633401870728
That is, your OpenMP version is again fastest using 3 threads on my 4 core CPU. It is now 2.34x times faster than my non-OpenMP code (which compares to scipy...cdist). And, it is (only !?) 7% slower than then non-OpenMP code, when running on 1 thread. (The speedup 3 threads vs. 1 thread is 2.5x) So, that is pretty good !! What I don't understand, why did you start your first post with "I don't have the slightest idea what I'm doing" ;-) Do you think, one could get even better ? And, where does the 7% slow-down (for single thread) come from ? Is it possible to have the OpenMP option in a code, without _any_ penalty for 1 core machines ? Thanks, - Sebastian On Thu, Feb 17, 2011 at 2:12 AM, Eric Carlson <ecarl...@eng.ua.edu> wrote: > Sebastian, > Optimization appears to be important here. I used no optimization in my > previous post, so you could try the -O3 compile option: > > gcc -O3 -c my_lib.c -fPIC -fopenmp -ffast-math > > for na=329 and nb=340 I get (about 7.5 speedup) > c_threads 1 time 0.00103106021881 > c_threads 2 time 0.000528309345245 > c_threads 3 time 0.000362541675568 > c_threads 4 time 0.00028993844986 > c_threads 5 time 0.000287840366364 > c_threads 6 time 0.000264899730682 > c_threads 7 time 0.000244019031525 > c_threads 8 time 0.000242137908936 > c_threads 9 time 0.000232398509979 > c_threads 10 time 0.000227460861206 > c_threads 11 time 0.00021938085556 > c_threads 12 time 0.000216970443726 > c_threads 13 time 0.000215198993683 > c_threads 14 time 0.00021940946579 > c_threads 15 time 0.000204219818115 > c_threads 16 time 0.000216958522797 > c_threads 17 time 0.000219728946686 > c_threads 18 time 0.000199990272522 > c_threads 19 time 0.000157492160797 > c_threads 20 time 0.000171000957489 > c_threads 21 time 0.000147500038147 > c_threads 22 time 0.000141770839691 > c_threads 23 time 0.000137741565704 > > for na=3290 and nb=3400 (about 11.5 speedup) > c_threads 1 time 0.100258581638 > c_threads 2 time 0.0501346611977 > c_threads 3 time 0.0335096096992 > c_threads 4 time 0.0253720903397 > c_threads 5 time 0.0208190107346 > c_threads 6 time 0.0173784399033 > c_threads 7 time 0.0148811817169 > c_threads 8 time 0.0130474209785 > c_threads 9 time 0.011598110199 > c_threads 10 time 0.0104278612137 > c_threads 11 time 0.00950778007507 > c_threads 12 time 0.00870131969452 > c_threads 13 time 0.015882730484 > c_threads 14 time 0.0148504400253 > c_threads 15 time 0.0139465212822 > c_threads 16 time 0.0130301308632 > c_threads 17 time 0.012240819931 > c_threads 18 time 0.011567029953 > c_threads 19 time 0.0109891605377 > c_threads 20 time 0.0104281497002 > c_threads 21 time 0.00992572069168 > c_threads 22 time 0.00957406997681 > c_threads 23 time 0.00936627149582 > > > for na=329 and nb=340, cdist comes in at 0.00111914873123 which is > 1.085x slower than the c version for my system. > > for na=3290 and nb=3400 cdist gives 0.143441538811 > > Cheers, > Eric > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion