Re: [Numpy-discussion] numexpr with the new iterator
A Tuesday 11 January 2011 06:45:28 Mark Wiebe escrigué: > On Mon, Jan 10, 2011 at 11:35 AM, Mark Wiebe wrote: > > I'm a bit curious why the jump from 1 to 2 threads is scaling so > > poorly. > > > > Your timings have improvement factors of 1.85, 1.68, 1.64, and > > 1.79. Since > > > > the computation is trivial data parallelism, and I believe it's > > still pretty far off the memory bandwidth limit, I would expect a > > speedup of 1.95 or higher. > > It looks like it is the memory bandwidth which is limiting the > scalability. Indeed, this is an increasingly important problem for modern computers. You may want to read: http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf ;-) > The slower operations scale much better than faster > ones. Below are some timings of successively faster operations. > When the operation is slow enough, it scales like I was expecting... [clip] Yeah, for another example on this with more threads, see: http://code.google.com/p/numexpr/wiki/MultiThreadVM OTOH, I was curious about the performance of the new iterator with Intel's VML, but it seems to work decently too: $ python bench/vml_timing.py (original numexpr, *no* VML support) *** Numexpr vs NumPy speed-ups *** Contiguous case: 1.72 (mean), 0.92 (min), 3.07 (max) Strided case:2.1 (mean), 0.98 (min), 3.52 (max) Unaligned case: 2.35 (mean), 1.35 (min), 3.31 (max) $ python bench/vml_timing.py (original numexpr, VML support) *** Numexpr vs NumPy speed-ups *** Contiguous case: 3.83 (mean), 1.1 (min), 10.19 (max) Strided case:3.21 (mean), 0.98 (min), 7.45 (max) Unaligned case: 3.6 (mean), 1.47 (min), 7.87 (max) $ python bench/vml_timing.py (new iter numexpr, VML support) *** Numexpr vs NumPy speed-ups *** Contiguous case: 3.56 (mean), 1.12 (min), 7.38 (max) Strided case:2.37 (mean), 0.09 (min), 7.63 (max) Unaligned case: 3.56 (mean), 2.08 (min), 5.88 (max) However, there a couple of quirks here. 1) The original Numexpr performs generally faster than the iter version. 2) The strided case is quite worse for the iter version. I've isolated the tests that performs worse for the iter version, and here are a couple of samples: *** Expression: exp(f3) numpy: 0.0135 numpy strided: 0.0144 numpy unaligned: 0.0200 numexpr: 0.0020 Speed-up of numexpr over numpy: 6.6584 numexpr strided: 0.1495 Speed-up of numexpr over numpy: 0.0962 numexpr unaligned: 0.0049 Speed-up of numexpr over numpy: 4.0859 *** Expression: sin(f3)>cos(f4) numpy: 0.0291 numpy strided: 0.0366 numpy unaligned: 0.0407 numexpr: 0.0166 Speed-up of numexpr over numpy: 1.7518 numexpr strided: 0.1551 Speed-up of numexpr over numpy: 0.2361 numexpr unaligned: 0.0175 Speed-up of numexpr over numpy: 2.3246 Maybe you can shed some light on what's going on here (shall we discuss this off-the-list so as to not bore people too much?). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
A Monday 10 January 2011 19:29:33 Mark Wiebe escrigué: > > so, the new code is just < 5% slower. I suppose that removing the > > NPY_ITER_ALIGNED flag would give us a bit more performance, but > > that's great as it is now. How did you do that? Your new_iter > > branch in NumPy already deals with unaligned data, right? > > Take a look at lowlevel_strided_loops.c.src. In this case, the > buffering setup code calls PyArray_GetDTypeTransferFunction, which > in turn calls PyArray_GetStridedCopyFn, which on an x86 platform > returns > _aligned_strided_to_contig_size8. This function has a simple loop of > copies using a npy_uint64 data type. I see. Brilliant! > > Well, if you can support reduce operations with your patch that > > would be extremely good news as I'm afraid that the current reduce > > code is a bit broken in Numexpr (at least, I vaguely remember > > seeing it working badly in some cases). > > Cool, I'll take a look at some point. I imagine with the most > obvious implementation small reductions would perform poorly. IMO, reductions like sum() or prod() are mainly limited my memory access, so my advise would be to not try to over-optimize here, and just make use of the new iterator. We can refine performance later on. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
On Mon, Jan 10, 2011 at 11:35 AM, Mark Wiebe wrote: > I'm a bit curious why the jump from 1 to 2 threads is scaling so poorly. > Your timings have improvement factors of 1.85, 1.68, 1.64, and 1.79. Since > the computation is trivial data parallelism, and I believe it's still pretty > far off the memory bandwidth limit, I would expect a speedup of 1.95 or > higher. It looks like it is the memory bandwidth which is limiting the scalability. The slower operations scale much better than faster ones. Below are some timings of successively faster operations. When the operation is slow enough, it scales like I was expecting... -Mark Computing: 'cos(x**1.1) + sin(x**1.3) + tan(x**2.3)' with 2000 points Using numpy: *** Time elapsed: 14.47 Using numexpr: *** Time elapsed for 1 threads: 12.659000 *** Time elapsed for 2 threads: 6.357000 *** Ratio from 1 to 2 threads: 1.991348 Using numexpr_iter: *** Time elapsed for 1 threads: 12.573000 *** Time elapsed for 2 threads: 6.398000 *** Ratio from 1 to 2 threads: 1.965145 Computing: 'x**2.345' with 2000 points Using numpy: *** Time elapsed: 3.506 Using numexpr: *** Time elapsed for 1 threads: 3.375000 *** Time elapsed for 2 threads: 1.747000 *** Ratio from 1 to 2 threads: 1.931883 Using numexpr_iter: *** Time elapsed for 1 threads: 3.266000 *** Time elapsed for 2 threads: 1.76 *** Ratio from 1 to 2 threads: 1.855682 Computing: '1*x+2*x+3*x+4*x+5*x+6*x+7*x+8*x+9*x+10*x+11*x+12*x+13*x+14*x' with 2000 points Using numpy: *** Time elapsed: 9.774 Using numexpr: *** Time elapsed for 1 threads: 1.314000 *** Time elapsed for 2 threads: 0.703000 *** Ratio from 1 to 2 threads: 1.869132 Using numexpr_iter: *** Time elapsed for 1 threads: 1.257000 *** Time elapsed for 2 threads: 0.683000 *** Ratio from 1 to 2 threads: 1.840410 Computing: 'x+2.345' with 2000 points Using numpy: *** Time elapsed: 0.343 Using numexpr: *** Time elapsed for 1 threads: 0.348000 *** Time elapsed for 2 threads: 0.30 *** Ratio from 1 to 2 threads: 1.16 Using numexpr_iter: *** Time elapsed for 1 threads: 0.354000 *** Time elapsed for 2 threads: 0.293000 *** Ratio from 1 to 2 threads: 1.208191 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
I'm a bit curious why the jump from 1 to 2 threads is scaling so poorly. Your timings have improvement factors of 1.85, 1.68, 1.64, and 1.79. Since the computation is trivial data parallelism, and I believe it's still pretty far off the memory bandwidth limit, I would expect a speedup of 1.95 or higher. One reason I suggest TBB is that it can produce a pretty good schedule while still adapting to load produced by other processes and threads. Numexpr currently does that well, but simply dividing the data into one piece per thread doesn't handle that case very well, and makes it possible that one thread spends a fair bit of time finishing up while the others idle at the end. Perhaps using Cilk would be a better option than TBB, since the code could remain in C. -Mark On Mon, Jan 10, 2011 at 3:55 AM, Francesc Alted wrote: > A Monday 10 January 2011 11:05:27 Francesc Alted escrigué: > > Also, I'd like to try out the new thread scheduling that you > > suggested to me privately (i.e. T0T1T0T1... vs T0T0...T1T1...). > > I've just implemented the new partition schema in numexpr > (T0T0...T1T1..., being the original T0T1T0T1...). I'm attaching the > patch for this. The results are a bit confusing. For example, using > the attached benchmark (poly.py), I get these results for a common dual- > core machine, non-NUMA machine: > > With the T0T1...T0T1... (original) schema: > > Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points > Using numpy: > *** Time elapsed: 3.497 > Using numexpr: > *** Time elapsed for 1 threads: 1.279000 > *** Time elapsed for 2 threads: 0.688000 > > With the T0T0...T1T1... (new) schema: > > Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points > Using numpy: > *** Time elapsed: 3.454 > Using numexpr: > *** Time elapsed for 1 threads: 1.268000 > *** Time elapsed for 2 threads: 0.754000 > > which is around a 10% slower (2 threads) than the original partition. > > The results are a bit different on a NUMA machine (8 physical cores, 16 > logical cores via hyper-threading): > > With the T0T1...T0T1... (original) partition: > > Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points > Using numpy: > *** Time elapsed: 3.005 > Using numexpr: > *** Time elapsed for 1 threads: 1.109000 > *** Time elapsed for 2 threads: 0.677000 > *** Time elapsed for 3 threads: 0.496000 > *** Time elapsed for 4 threads: 0.394000 > *** Time elapsed for 5 threads: 0.324000 > *** Time elapsed for 6 threads: 0.287000 > *** Time elapsed for 7 threads: 0.247000 > *** Time elapsed for 8 threads: 0.234000 > *** Time elapsed for 9 threads: 0.242000 > *** Time elapsed for 10 threads: 0.239000 > *** Time elapsed for 11 threads: 0.241000 > *** Time elapsed for 12 threads: 0.235000 > *** Time elapsed for 13 threads: 0.226000 > *** Time elapsed for 14 threads: 0.214000 > *** Time elapsed for 15 threads: 0.235000 > *** Time elapsed for 16 threads: 0.218000 > > With the T0T0...T1T1... (new) partition: > > Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points > Using numpy: > *** Time elapsed: 3.003 > Using numexpr: > *** Time elapsed for 1 threads: 1.106000 > *** Time elapsed for 2 threads: 0.617000 > *** Time elapsed for 3 threads: 0.442000 > *** Time elapsed for 4 threads: 0.345000 > *** Time elapsed for 5 threads: 0.296000 > *** Time elapsed for 6 threads: 0.257000 > *** Time elapsed for 7 threads: 0.237000 > *** Time elapsed for 8 threads: 0.26 > *** Time elapsed for 9 threads: 0.245000 > *** Time elapsed for 10 threads: 0.261000 > *** Time elapsed for 11 threads: 0.238000 > *** Time elapsed for 12 threads: 0.21 > *** Time elapsed for 13 threads: 0.218000 > *** Time elapsed for 14 threads: 0.20 > *** Time elapsed for 15 threads: 0.235000 > *** Time elapsed for 16 threads: 0.198000 > > In this case, the performance is similar, with perhaps a slight > advantage for the new partition scheme, but I don't know if it is worth > to make it the default (probably not, as this partition performs clearly > worse on non-NUMA machines). At any rate, both partitions perform very > close to the aggregated memory bandwidth of NUMA machines (around 10 > GB/s in the above case). > > In general, I don't think there is much point in using Intel's TBB in > numexpr because the existing implementation already hits memory > bandwidth limits pretty early (around 10 threads in the latter example). > > -- > Francesc Alted > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
On Mon, Jan 10, 2011 at 9:47 AM, Francesc Alted wrote: > > > so, the new code is just < 5% slower. I suppose that removing the > NPY_ITER_ALIGNED flag would give us a bit more performance, but that's > great as it is now. How did you do that? Your new_iter branch in NumPy > already deals with unaligned data, right? > Take a look at lowlevel_strided_loops.c.src. In this case, the buffering setup code calls PyArray_GetDTypeTransferFunction, which in turn calls PyArray_GetStridedCopyFn, which on an x86 platform returns _aligned_strided_to_contig_size8. This function has a simple loop of copies using a npy_uint64 data type. > The new code also needs support for the reduce operation. I didn't > > look too closely at the code for that, but a nested iteration > > pattern is probably appropriate. If the inner loop is just allowed > > to be one dimension, it could be done without actually creating the > > inner iterator. > > Well, if you can support reduce operations with your patch that would be > extremely good news as I'm afraid that the current reduce code is a bit > broken in Numexpr (at least, I vaguely remember seeing it working badly > in some cases). > Cool, I'll take a look at some point. I imagine with the most obvious implementation small reductions would perform poorly. -Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
A Monday 10 January 2011 17:54:16 Mark Wiebe escrigué: > > Apparently, you forgot to add the new_iterator_pywrap.h file. > > Oops, that's added now. Excellent. It works now. > The aligned case should just be a matter of conditionally removing > the NPY_ITER_ALIGNED flag in two places. Wow, the support for unaligned in current `evaluate_iter()` seems pretty nice already: $ python unaligned-simple.py -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Numexpr version: 1.5.dev NumPy version: 2.0.0.dev-ebc963d Python version:2.6.1 (r261:67515, Feb 3 2009, 17:34:37) [GCC 4.3.2 [gcc-4_3-branch revision 141291]] Platform: linux2-x86_64 AMD/Intel CPU? True VML available? False Detected cores:2 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- NumPy aligned: 0.658 s NumPy unaligned:1.597 s Numexpr aligned:0.59 s Numexpr aligned (new iter): 0.59 s Numexpr unaligned: 0.51 s Numexpr unaligned (new_iter): 0.528 s so, the new code is just < 5% slower. I suppose that removing the NPY_ITER_ALIGNED flag would give us a bit more performance, but that's great as it is now. How did you do that? Your new_iter branch in NumPy already deals with unaligned data, right? > The new code also needs support for the reduce operation. I didn't > look too closely at the code for that, but a nested iteration > pattern is probably appropriate. If the inner loop is just allowed > to be one dimension, it could be done without actually creating the > inner iterator. Well, if you can support reduce operations with your patch that would be extremely good news as I'm afraid that the current reduce code is a bit broken in Numexpr (at least, I vaguely remember seeing it working badly in some cases). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
On Mon, Jan 10, 2011 at 2:05 AM, Francesc Alted wrote: > > > Your patch looks mostly fine to my eyes; good job! Unfortunately, I've > been unable to compile your new_iterator branch of NumPy: > > numpy/core/src/multiarray/multiarraymodule.c:45:33: fatal error: > new_iterator_pywrap.h: El fitxer o directori no existeix > > Apparently, you forgot to add the new_iterator_pywrap.h file. > Oops, that's added now. > My idea would be to merge your patch in numexpr and make the new > `evaluate_iter()` the default (i.e. make it `evaluate()`). However, by > looking into the code, it seems to me that unaligned arrays (this is an > important use case when operating with columns of structured arrays) may > need more fine-tuning for Intel platforms. When I can compile the > new_iterator branch, I'll give a try at unaligned data benchs. > The aligned case should just be a matter of conditionally removing the NPY_ITER_ALIGNED flag in two places. The new code also needs support for the reduce operation. I didn't look too closely at the code for that, but a nested iteration pattern is probably appropriate. If the inner loop is just allowed to be one dimension, it could be done without actually creating the inner iterator. -Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
A Monday 10 January 2011 11:05:27 Francesc Alted escrigué: > Also, I'd like to try out the new thread scheduling that you > suggested to me privately (i.e. T0T1T0T1... vs T0T0...T1T1...). I've just implemented the new partition schema in numexpr (T0T0...T1T1..., being the original T0T1T0T1...). I'm attaching the patch for this. The results are a bit confusing. For example, using the attached benchmark (poly.py), I get these results for a common dual- core machine, non-NUMA machine: With the T0T1...T0T1... (original) schema: Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points Using numpy: *** Time elapsed: 3.497 Using numexpr: *** Time elapsed for 1 threads: 1.279000 *** Time elapsed for 2 threads: 0.688000 With the T0T0...T1T1... (new) schema: Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points Using numpy: *** Time elapsed: 3.454 Using numexpr: *** Time elapsed for 1 threads: 1.268000 *** Time elapsed for 2 threads: 0.754000 which is around a 10% slower (2 threads) than the original partition. The results are a bit different on a NUMA machine (8 physical cores, 16 logical cores via hyper-threading): With the T0T1...T0T1... (original) partition: Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points Using numpy: *** Time elapsed: 3.005 Using numexpr: *** Time elapsed for 1 threads: 1.109000 *** Time elapsed for 2 threads: 0.677000 *** Time elapsed for 3 threads: 0.496000 *** Time elapsed for 4 threads: 0.394000 *** Time elapsed for 5 threads: 0.324000 *** Time elapsed for 6 threads: 0.287000 *** Time elapsed for 7 threads: 0.247000 *** Time elapsed for 8 threads: 0.234000 *** Time elapsed for 9 threads: 0.242000 *** Time elapsed for 10 threads: 0.239000 *** Time elapsed for 11 threads: 0.241000 *** Time elapsed for 12 threads: 0.235000 *** Time elapsed for 13 threads: 0.226000 *** Time elapsed for 14 threads: 0.214000 *** Time elapsed for 15 threads: 0.235000 *** Time elapsed for 16 threads: 0.218000 With the T0T0...T1T1... (new) partition: Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points Using numpy: *** Time elapsed: 3.003 Using numexpr: *** Time elapsed for 1 threads: 1.106000 *** Time elapsed for 2 threads: 0.617000 *** Time elapsed for 3 threads: 0.442000 *** Time elapsed for 4 threads: 0.345000 *** Time elapsed for 5 threads: 0.296000 *** Time elapsed for 6 threads: 0.257000 *** Time elapsed for 7 threads: 0.237000 *** Time elapsed for 8 threads: 0.26 *** Time elapsed for 9 threads: 0.245000 *** Time elapsed for 10 threads: 0.261000 *** Time elapsed for 11 threads: 0.238000 *** Time elapsed for 12 threads: 0.21 *** Time elapsed for 13 threads: 0.218000 *** Time elapsed for 14 threads: 0.20 *** Time elapsed for 15 threads: 0.235000 *** Time elapsed for 16 threads: 0.198000 In this case, the performance is similar, with perhaps a slight advantage for the new partition scheme, but I don't know if it is worth to make it the default (probably not, as this partition performs clearly worse on non-NUMA machines). At any rate, both partitions perform very close to the aggregated memory bandwidth of NUMA machines (around 10 GB/s in the above case). In general, I don't think there is much point in using Intel's TBB in numexpr because the existing implementation already hits memory bandwidth limits pretty early (around 10 threads in the latter example). -- Francesc Alted Index: numexpr/interpreter.c === --- numexpr/interpreter.c (revision 260) +++ numexpr/interpreter.c (working copy) @@ -59,8 +59,6 @@ int end_threads = 0; /* should exisiting threads end? */ pthread_t threads[MAX_THREADS]; /* opaque structure for threads */ int tids[MAX_THREADS]; /* ID per each thread */ -intp gindex; /* global index for all threads */ -int init_sentinels_done; /* sentinels initialized? */ int giveup; /* should parallel code giveup? */ int force_serial;/* force serial code instead of parallel? */ int pid = 0; /* the PID for this process */ @@ -1072,7 +1070,7 @@ return 0; } -/* VM engine for each threadi (general) */ +/* VM engine for each thread (general) */ static inline int vm_engine_thread(char **mem, intp index, intp block_size, struct vm_params params, int *pc_error) @@ -1086,11 +1084,11 @@ /* Do the worker job for a certain thread */ void *th_worker(void *tids) { -/* int tid = *(int *)tids; */ -intp index; /* private copy of gindex */ +int tid = *(int *)tids; +intp index; /* Parameters for threads */ -intp start; -intp vlen; +intp start, stop; +intp vlen, nblocks, th_nblocks; intp block_size; struct vm_params params; int *pc_error; @@ -1103,8 +1101,6 @@ while (1) { -init_sentinels_done = 0; /* sentinels have to be initialised yet */ -
Re: [Numpy-discussion] numexpr with the new iterator
A Sunday 09 January 2011 23:45:02 Mark Wiebe escrigué: > As a benchmark of C-based iterator usage and to make it work properly > in a multi-threaded context, I've updated numexpr to use the new > iterator. In addition to some performance improvements, this also > made it easy to add optional out= and order= parameters to the > evaluate function. The numexpr repository with this update is > available here: > > https://github.com/m-paradox/numexpr > > To use it, you need the new_iterator branch of NumPy from here: > > https://github.com/m-paradox/numpy > > In all cases tested, the iterator version of numexpr's evaluate > function matches or beats the standard version. The timing results > are below, with some explanatory comments placed inline: [clip] Your patch looks mostly fine to my eyes; good job! Unfortunately, I've been unable to compile your new_iterator branch of NumPy: numpy/core/src/multiarray/multiarraymodule.c:45:33: fatal error: new_iterator_pywrap.h: El fitxer o directori no existeix Apparently, you forgot to add the new_iterator_pywrap.h file. My idea would be to merge your patch in numexpr and make the new `evaluate_iter()` the default (i.e. make it `evaluate()`). However, by looking into the code, it seems to me that unaligned arrays (this is an important use case when operating with columns of structured arrays) may need more fine-tuning for Intel platforms. When I can compile the new_iterator branch, I'll give a try at unaligned data benchs. Also, I'd like to try out the new thread scheduling that you suggested to me privately (i.e. T0T1T0T1... vs T0T0...T1T1...). Thanks! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
That's right, essentially all I've done is replaced the code that handled preparing the arrays and producing blocks of values for the inner loops. There are three new parameters to evaluate_iter as well. It has an "out=" parameter just like ufuncs do, an "order=" parameter which controls the layout of the output if it's created by the function, and a "casting=" parameter which controls what kind of data conversions are permitted. -Mark On Sun, Jan 9, 2011 at 3:33 PM, John Salvatier wrote: > Is evaluate_iter basically numpexpr but using your numpy branch or are > there other changes? > > On Sun, Jan 9, 2011 at 2:45 PM, Mark Wiebe wrote: > >> As a benchmark of C-based iterator usage and to make it work properly in a >> multi-threaded context, I've updated numexpr to use the new iterator. In >> addition to some performance improvements, this also made it easy to add >> optional out= and order= parameters to the evaluate function. The numexpr >> repository with this update is available here: >> >> https://github.com/m-paradox/numexpr >> >> To use it, you need the new_iterator branch of NumPy from here: >> >> https://github.com/m-paradox/numpy >> >> In all cases tested, the iterator version of numexpr's evaluate function >> matches or beats the standard version. The timing results are below, with >> some explanatory comments placed inline: >> >> -Mark >> >> In [1]: import numexpr as ne >> >> # numexpr front page example >> >> In [2]: a = np.arange(1e6) >> In [3]: b = np.arange(1e6) >> >> In [4]: timeit a**2 + b**2 + 2*a*b >> 1 loops, best of 3: 121 ms per loop >> >> In [5]: ne.set_num_threads(1) >> >> # iterator version performance matches standard version >> >> In [6]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") >> 10 loops, best of 3: 24.8 ms per loop >> In [7]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") >> 10 loops, best of 3: 24.3 ms per loop >> >> In [8]: ne.set_num_threads(2) >> >> # iterator version performance matches standard version >> >> In [9]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") >> 10 loops, best of 3: 21 ms per loop >> In [10]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") >> 10 loops, best of 3: 20.5 ms per loop >> >> # numexpr front page example with a 10x bigger array >> >> In [11]: a = np.arange(1e7) >> In [12]: b = np.arange(1e7) >> >> In [13]: ne.set_num_threads(2) >> >> # the iterator version performance improvement is due to >> # a small task scheduler tweak >> >> In [14]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") >> 1 loops, best of 3: 282 ms per loop >> In [15]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") >> 1 loops, best of 3: 255 ms per loop >> >> # numexpr front page example with a Fortran contiguous array >> >> In [16]: a = np.arange(1e7).reshape(10,100,100,100).T >> In [17]: b = np.arange(1e7).reshape(10,100,100,100).T >> >> In [18]: timeit a**2 + b**2 + 2*a*b >> 1 loops, best of 3: 3.22 s per loop >> >> In [19]: ne.set_num_threads(1) >> >> # even with a C-ordered output, the iterator version performs better >> >> In [20]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") >> 1 loops, best of 3: 3.74 s per loop >> In [21]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") >> 1 loops, best of 3: 379 ms per loop >> In [22]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C') >> 1 loops, best of 3: 2.03 s per loop >> >> In [23]: ne.set_num_threads(2) >> >> # the standard version just uses 1 thread here, I believe >> # the iterator version performs the same as for the flat 1e7-sized array >> >> In [24]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") >> 1 loops, best of 3: 3.92 s per loop >> In [25]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") >> 1 loops, best of 3: 254 ms per loop >> In [26]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C') >> 1 loops, best of 3: 1.74 s per loop >> >> >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
Is evaluate_iter basically numpexpr but using your numpy branch or are there other changes? On Sun, Jan 9, 2011 at 2:45 PM, Mark Wiebe wrote: > As a benchmark of C-based iterator usage and to make it work properly in a > multi-threaded context, I've updated numexpr to use the new iterator. In > addition to some performance improvements, this also made it easy to add > optional out= and order= parameters to the evaluate function. The numexpr > repository with this update is available here: > > https://github.com/m-paradox/numexpr > > To use it, you need the new_iterator branch of NumPy from here: > > https://github.com/m-paradox/numpy > > In all cases tested, the iterator version of numexpr's evaluate function > matches or beats the standard version. The timing results are below, with > some explanatory comments placed inline: > > -Mark > > In [1]: import numexpr as ne > > # numexpr front page example > > In [2]: a = np.arange(1e6) > In [3]: b = np.arange(1e6) > > In [4]: timeit a**2 + b**2 + 2*a*b > 1 loops, best of 3: 121 ms per loop > > In [5]: ne.set_num_threads(1) > > # iterator version performance matches standard version > > In [6]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") > 10 loops, best of 3: 24.8 ms per loop > In [7]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") > 10 loops, best of 3: 24.3 ms per loop > > In [8]: ne.set_num_threads(2) > > # iterator version performance matches standard version > > In [9]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") > 10 loops, best of 3: 21 ms per loop > In [10]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") > 10 loops, best of 3: 20.5 ms per loop > > # numexpr front page example with a 10x bigger array > > In [11]: a = np.arange(1e7) > In [12]: b = np.arange(1e7) > > In [13]: ne.set_num_threads(2) > > # the iterator version performance improvement is due to > # a small task scheduler tweak > > In [14]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") > 1 loops, best of 3: 282 ms per loop > In [15]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") > 1 loops, best of 3: 255 ms per loop > > # numexpr front page example with a Fortran contiguous array > > In [16]: a = np.arange(1e7).reshape(10,100,100,100).T > In [17]: b = np.arange(1e7).reshape(10,100,100,100).T > > In [18]: timeit a**2 + b**2 + 2*a*b > 1 loops, best of 3: 3.22 s per loop > > In [19]: ne.set_num_threads(1) > > # even with a C-ordered output, the iterator version performs better > > In [20]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") > 1 loops, best of 3: 3.74 s per loop > In [21]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") > 1 loops, best of 3: 379 ms per loop > In [22]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C') > 1 loops, best of 3: 2.03 s per loop > > In [23]: ne.set_num_threads(2) > > # the standard version just uses 1 thread here, I believe > # the iterator version performs the same as for the flat 1e7-sized array > > In [24]: timeit ne.evaluate("a**2 + b**2 + 2*a*b") > 1 loops, best of 3: 3.92 s per loop > In [25]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b") > 1 loops, best of 3: 254 ms per loop > In [26]: timeit ne.evaluate_iter("a**2 + b**2 + 2*a*b", order='C') > 1 loops, best of 3: 1.74 s per loop > > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion