Re: [Numpy-discussion] numexpr with the new iterator
On Mon, Jan 10, 2011 at 11:35 AM, Mark Wiebe wrote: > I'm a bit curious why the jump from 1 to 2 threads is scaling so poorly. > Your timings have improvement factors of 1.85, 1.68, 1.64, and 1.79. Since > the computation is trivial data parallelism, and I believe it's still pretty > far off the memory bandwidth limit, I would expect a speedup of 1.95 or > higher. It looks like it is the memory bandwidth which is limiting the scalability. The slower operations scale much better than faster ones. Below are some timings of successively faster operations. When the operation is slow enough, it scales like I was expecting... -Mark Computing: 'cos(x**1.1) + sin(x**1.3) + tan(x**2.3)' with 2000 points Using numpy: *** Time elapsed: 14.47 Using numexpr: *** Time elapsed for 1 threads: 12.659000 *** Time elapsed for 2 threads: 6.357000 *** Ratio from 1 to 2 threads: 1.991348 Using numexpr_iter: *** Time elapsed for 1 threads: 12.573000 *** Time elapsed for 2 threads: 6.398000 *** Ratio from 1 to 2 threads: 1.965145 Computing: 'x**2.345' with 2000 points Using numpy: *** Time elapsed: 3.506 Using numexpr: *** Time elapsed for 1 threads: 3.375000 *** Time elapsed for 2 threads: 1.747000 *** Ratio from 1 to 2 threads: 1.931883 Using numexpr_iter: *** Time elapsed for 1 threads: 3.266000 *** Time elapsed for 2 threads: 1.76 *** Ratio from 1 to 2 threads: 1.855682 Computing: '1*x+2*x+3*x+4*x+5*x+6*x+7*x+8*x+9*x+10*x+11*x+12*x+13*x+14*x' with 2000 points Using numpy: *** Time elapsed: 9.774 Using numexpr: *** Time elapsed for 1 threads: 1.314000 *** Time elapsed for 2 threads: 0.703000 *** Ratio from 1 to 2 threads: 1.869132 Using numexpr_iter: *** Time elapsed for 1 threads: 1.257000 *** Time elapsed for 2 threads: 0.683000 *** Ratio from 1 to 2 threads: 1.840410 Computing: 'x+2.345' with 2000 points Using numpy: *** Time elapsed: 0.343 Using numexpr: *** Time elapsed for 1 threads: 0.348000 *** Time elapsed for 2 threads: 0.30 *** Ratio from 1 to 2 threads: 1.16 Using numexpr_iter: *** Time elapsed for 1 threads: 0.354000 *** Time elapsed for 2 threads: 0.293000 *** Ratio from 1 to 2 threads: 1.208191 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
I'm a bit curious why the jump from 1 to 2 threads is scaling so poorly. Your timings have improvement factors of 1.85, 1.68, 1.64, and 1.79. Since the computation is trivial data parallelism, and I believe it's still pretty far off the memory bandwidth limit, I would expect a speedup of 1.95 or higher. One reason I suggest TBB is that it can produce a pretty good schedule while still adapting to load produced by other processes and threads. Numexpr currently does that well, but simply dividing the data into one piece per thread doesn't handle that case very well, and makes it possible that one thread spends a fair bit of time finishing up while the others idle at the end. Perhaps using Cilk would be a better option than TBB, since the code could remain in C. -Mark On Mon, Jan 10, 2011 at 3:55 AM, Francesc Alted wrote: > A Monday 10 January 2011 11:05:27 Francesc Alted escrigué: > > Also, I'd like to try out the new thread scheduling that you > > suggested to me privately (i.e. T0T1T0T1... vs T0T0...T1T1...). > > I've just implemented the new partition schema in numexpr > (T0T0...T1T1..., being the original T0T1T0T1...). I'm attaching the > patch for this. The results are a bit confusing. For example, using > the attached benchmark (poly.py), I get these results for a common dual- > core machine, non-NUMA machine: > > With the T0T1...T0T1... (original) schema: > > Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points > Using numpy: > *** Time elapsed: 3.497 > Using numexpr: > *** Time elapsed for 1 threads: 1.279000 > *** Time elapsed for 2 threads: 0.688000 > > With the T0T0...T1T1... (new) schema: > > Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points > Using numpy: > *** Time elapsed: 3.454 > Using numexpr: > *** Time elapsed for 1 threads: 1.268000 > *** Time elapsed for 2 threads: 0.754000 > > which is around a 10% slower (2 threads) than the original partition. > > The results are a bit different on a NUMA machine (8 physical cores, 16 > logical cores via hyper-threading): > > With the T0T1...T0T1... (original) partition: > > Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points > Using numpy: > *** Time elapsed: 3.005 > Using numexpr: > *** Time elapsed for 1 threads: 1.109000 > *** Time elapsed for 2 threads: 0.677000 > *** Time elapsed for 3 threads: 0.496000 > *** Time elapsed for 4 threads: 0.394000 > *** Time elapsed for 5 threads: 0.324000 > *** Time elapsed for 6 threads: 0.287000 > *** Time elapsed for 7 threads: 0.247000 > *** Time elapsed for 8 threads: 0.234000 > *** Time elapsed for 9 threads: 0.242000 > *** Time elapsed for 10 threads: 0.239000 > *** Time elapsed for 11 threads: 0.241000 > *** Time elapsed for 12 threads: 0.235000 > *** Time elapsed for 13 threads: 0.226000 > *** Time elapsed for 14 threads: 0.214000 > *** Time elapsed for 15 threads: 0.235000 > *** Time elapsed for 16 threads: 0.218000 > > With the T0T0...T1T1... (new) partition: > > Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points > Using numpy: > *** Time elapsed: 3.003 > Using numexpr: > *** Time elapsed for 1 threads: 1.106000 > *** Time elapsed for 2 threads: 0.617000 > *** Time elapsed for 3 threads: 0.442000 > *** Time elapsed for 4 threads: 0.345000 > *** Time elapsed for 5 threads: 0.296000 > *** Time elapsed for 6 threads: 0.257000 > *** Time elapsed for 7 threads: 0.237000 > *** Time elapsed for 8 threads: 0.26 > *** Time elapsed for 9 threads: 0.245000 > *** Time elapsed for 10 threads: 0.261000 > *** Time elapsed for 11 threads: 0.238000 > *** Time elapsed for 12 threads: 0.21 > *** Time elapsed for 13 threads: 0.218000 > *** Time elapsed for 14 threads: 0.20 > *** Time elapsed for 15 threads: 0.235000 > *** Time elapsed for 16 threads: 0.198000 > > In this case, the performance is similar, with perhaps a slight > advantage for the new partition scheme, but I don't know if it is worth > to make it the default (probably not, as this partition performs clearly > worse on non-NUMA machines). At any rate, both partitions perform very > close to the aggregated memory bandwidth of NUMA machines (around 10 > GB/s in the above case). > > In general, I don't think there is much point in using Intel's TBB in > numexpr because the existing implementation already hits memory > bandwidth limits pretty early (around 10 threads in the latter example). > > -- > Francesc Alted > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] indexing of rank-0 structured arrays: why not?
On Mon, Jan 10, 2011 at 12:15, Nils Becker wrote: > Robert, > > your answer does work: after indexing with () I can then further index > into the datatype. > > In [115]: a_rank_0[()][0] > Out[115]: 0.0 > > I guess I just found the fact confusing that a_rank_1[0] and a_rank_0 > compare and print equal but behave differently under indexing. They do not print equal. Many things compare equal but do not behave the same. > More precisely if I do > In [117]: b = a_rank_1[0] > > then > > In [118]: b.shape > Out[118]: () > > and > > In [120]: a_rank_0 == b > Out[120]: True > > but > > In [119]: b[0] > Out[119]: 0.0 > > works but a_rank_0[0] doesn't. I thought b is a rank-0 array which it > apparently is not since it can be indexed. So maybe b[0] should fail for > consistency? No, b is a record scalar. It can be indexed because it is often convient to treat such records like tuples. This replaces the default indexing behavior of scalars (which is to simply disallow indexing). a_rank_0 is an array, so the array indexing semantics are the default, and we do not change them. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
On Mon, Jan 10, 2011 at 9:47 AM, Francesc Alted wrote: > > > so, the new code is just < 5% slower. I suppose that removing the > NPY_ITER_ALIGNED flag would give us a bit more performance, but that's > great as it is now. How did you do that? Your new_iter branch in NumPy > already deals with unaligned data, right? > Take a look at lowlevel_strided_loops.c.src. In this case, the buffering setup code calls PyArray_GetDTypeTransferFunction, which in turn calls PyArray_GetStridedCopyFn, which on an x86 platform returns _aligned_strided_to_contig_size8. This function has a simple loop of copies using a npy_uint64 data type. > The new code also needs support for the reduce operation. I didn't > > look too closely at the code for that, but a nested iteration > > pattern is probably appropriate. If the inner loop is just allowed > > to be one dimension, it could be done without actually creating the > > inner iterator. > > Well, if you can support reduce operations with your patch that would be > extremely good news as I'm afraid that the current reduce code is a bit > broken in Numexpr (at least, I vaguely remember seeing it working badly > in some cases). > Cool, I'll take a look at some point. I imagine with the most obvious implementation small reductions would perform poorly. -Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] indexing of rank-0 structured arrays: why not?
Robert, your answer does work: after indexing with () I can then further index into the datatype. In [115]: a_rank_0[()][0] Out[115]: 0.0 I guess I just found the fact confusing that a_rank_1[0] and a_rank_0 compare and print equal but behave differently under indexing. More precisely if I do In [117]: b = a_rank_1[0] then In [118]: b.shape Out[118]: () and In [120]: a_rank_0 == b Out[120]: True but In [119]: b[0] Out[119]: 0.0 works but a_rank_0[0] doesn't. I thought b is a rank-0 array which it apparently is not since it can be indexed. So maybe b[0] should fail for consistency? N. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
A Monday 10 January 2011 17:54:16 Mark Wiebe escrigué: > > Apparently, you forgot to add the new_iterator_pywrap.h file. > > Oops, that's added now. Excellent. It works now. > The aligned case should just be a matter of conditionally removing > the NPY_ITER_ALIGNED flag in two places. Wow, the support for unaligned in current `evaluate_iter()` seems pretty nice already: $ python unaligned-simple.py -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Numexpr version: 1.5.dev NumPy version: 2.0.0.dev-ebc963d Python version:2.6.1 (r261:67515, Feb 3 2009, 17:34:37) [GCC 4.3.2 [gcc-4_3-branch revision 141291]] Platform: linux2-x86_64 AMD/Intel CPU? True VML available? False Detected cores:2 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- NumPy aligned: 0.658 s NumPy unaligned:1.597 s Numexpr aligned:0.59 s Numexpr aligned (new iter): 0.59 s Numexpr unaligned: 0.51 s Numexpr unaligned (new_iter): 0.528 s so, the new code is just < 5% slower. I suppose that removing the NPY_ITER_ALIGNED flag would give us a bit more performance, but that's great as it is now. How did you do that? Your new_iter branch in NumPy already deals with unaligned data, right? > The new code also needs support for the reduce operation. I didn't > look too closely at the code for that, but a nested iteration > pattern is probably appropriate. If the inner loop is just allowed > to be one dimension, it could be done without actually creating the > inner iterator. Well, if you can support reduce operations with your patch that would be extremely good news as I'm afraid that the current reduce code is a bit broken in Numexpr (at least, I vaguely remember seeing it working badly in some cases). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] indexing of rank-0 structured arrays: why not?
On Mon, Jan 10, 2011 at 10:08, Nils Becker wrote: > Hi, > > I noticed that I can index into a dtype when I take an element > of a rank-1 array but not if I make a rank-0 array directly. This seems > inconsistent. A bug? Not a bug. Since there is no axis, you cannot use integers to index into a rank-0 array. Use an empty tuple instead. [~] |1> dt = np.dtype([('x', ' a_rank_0 = np.zeros((), dtype=dt) [~] |3> a_rank_0[()] (0.0, 0.0) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
On Mon, Jan 10, 2011 at 2:05 AM, Francesc Alted wrote: > > > Your patch looks mostly fine to my eyes; good job! Unfortunately, I've > been unable to compile your new_iterator branch of NumPy: > > numpy/core/src/multiarray/multiarraymodule.c:45:33: fatal error: > new_iterator_pywrap.h: El fitxer o directori no existeix > > Apparently, you forgot to add the new_iterator_pywrap.h file. > Oops, that's added now. > My idea would be to merge your patch in numexpr and make the new > `evaluate_iter()` the default (i.e. make it `evaluate()`). However, by > looking into the code, it seems to me that unaligned arrays (this is an > important use case when operating with columns of structured arrays) may > need more fine-tuning for Intel platforms. When I can compile the > new_iterator branch, I'll give a try at unaligned data benchs. > The aligned case should just be a matter of conditionally removing the NPY_ITER_ALIGNED flag in two places. The new code also needs support for the reduce operation. I didn't look too closely at the code for that, but a nested iteration pattern is probably appropriate. If the inner loop is just allowed to be one dimension, it could be done without actually creating the inner iterator. -Mark ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] speed of numpy.ndarray compared to Numeric.array
Hi, Spatial hashes are the common solution. Another common optimization is using the distance squared for collision detection. Since you do not need the expensive sqrt for this calc. cu. On Mon, Jan 10, 2011 at 3:25 PM, Pascal wrote: > Hi, > > On 01/10/2011 09:09 AM, EMMEL Thomas wrote: >> >> No I didn't, due to the fact that these values are coordinates in 3D (x,y,z). >> In fact I work with a list/array/tuple of arrays with 10 to 1M of >> elements or more. >> What I need to do is to calculate the distance of each of these elements >> (coordinates) >> to a given coordinate and filter for the nearest. >> The brute force method would look like this: >> >> >> #~ >> def bruteForceSearch(points, point): >> >> minpt = min([(vec2Norm(pt, point), pt, i) >> for i, pt in enumerate(points)], key=itemgetter(0)) >> return sqrt(minpt[0]), minpt[1], minpt[2] >> >> #~~ >> def vec2Norm(pt1,pt2): >> xDis = pt1[0]-pt2[0] >> yDis = pt1[1]-pt2[1] >> zDis = pt1[2]-pt2[2] >> return xDis*xDis+yDis*yDis+zDis*zDis >> > > I am not sure I understood the problem properly but here what I would > use to calculate a distance from horizontally stacked vectors (big): > > ref=numpy.array([0.1,0.2,0.3]) > big=numpy.random.randn(100, 3) > > big=numpy.add(big,-ref) > distsquared=numpy.sum(big**2, axis=1) > > Pascal > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] indexing of rank-0 structured arrays: why not?
Hi, I noticed that I can index into a dtype when I take an element of a rank-1 array but not if I make a rank-0 array directly. This seems inconsistent. A bug? Nils In [76]: np.version.version Out[76]: '1.5.1' In [78]: dt = np.dtype([('x', ' in () IndexError: 0-d arrays can't be indexed In [87]: a_rank_0['x'] Out[87]: array(0.0) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] speed of numpy.ndarray compared to Numeric.array
Hi, On 01/10/2011 09:09 AM, EMMEL Thomas wrote: > > No I didn't, due to the fact that these values are coordinates in 3D (x,y,z). > In fact I work with a list/array/tuple of arrays with 10 to 1M of > elements or more. > What I need to do is to calculate the distance of each of these elements > (coordinates) > to a given coordinate and filter for the nearest. > The brute force method would look like this: > > > #~ > def bruteForceSearch(points, point): > > minpt = min([(vec2Norm(pt, point), pt, i) > for i, pt in enumerate(points)], key=itemgetter(0)) > return sqrt(minpt[0]), minpt[1], minpt[2] > > #~~ > def vec2Norm(pt1,pt2): > xDis = pt1[0]-pt2[0] > yDis = pt1[1]-pt2[1] > zDis = pt1[2]-pt2[2] > return xDis*xDis+yDis*yDis+zDis*zDis > I am not sure I understood the problem properly but here what I would use to calculate a distance from horizontally stacked vectors (big): ref=numpy.array([0.1,0.2,0.3]) big=numpy.random.randn(100, 3) big=numpy.add(big,-ref) distsquared=numpy.sum(big**2, axis=1) Pascal ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] speed of numpy.ndarray compared to Numeric.array
Hey back... > > > #~ > ~ > ~~~ > > def bruteForceSearch(points, point): > > > > minpt = min([(vec2Norm(pt, point), pt, i) > > for i, pt in enumerate(points)], key=itemgetter(0)) > > return sqrt(minpt[0]), minpt[1], minpt[2] > > > > > #~ > ~ > > > def vec2Norm(pt1,pt2): > > xDis = pt1[0]-pt2[0] > > yDis = pt1[1]-pt2[1] > > zDis = pt1[2]-pt2[2] > > return xDis*xDis+yDis*yDis+zDis*zDis > > > > I have a more clever method but it still takes a lot of time in the > vec2norm-function. > > If you like I can attach a running example. > > > > if you use the vec2Norm function as you wrote it there, this code is > not vectorized at all, and as such of course numpy would be slowest as > it has the most overhead and no advantages for non vectorized code, > you simply can't write python code like that and expect it to be fast > for these kind of calculations. > > Your function should look more like this: > > import numpy as np > > def bruteForceSearch(points, point): > dists = points - point > # that may need point[None,:] or such for broadcasting to work > dists *= dists > dists = dists.sum(1) > I = np.argmin(dists) > return sqrt(dists[I]), points[I], I > > If points is small, this may not help much (though compared to this > exact code my guess is it probably would), if points is larger it > should speed up things tremendously (unless you run into RAM > problems). It may be that you need to fiddle around with axes, I did > not check the code. > If this is not good enough for you (you will need to port it (and > maybe the next outer loop as well) to Cython or write it in C/C++ and > make sure it can optimize things right. Also I think somewhere in > scipy there were some distance tools that may be already in C and nice > fast, but not sure. > > I hope I got this right and it helps, > > Sebastian > I see the point and it was very helpful to understand the behavior of the arrays a bit better. And your attempt improved the bruteForceSearch which is up to 6 times faster. But in case of a leaf in a kd-tree you end up with 50, 20, 10 or less points where the speed-up is reversed. In this particular case 34000 runs take 90s with your method and 50s with mine (not the bruteForce). I see now the limits of the arrays but of course I see the chances and - coming back to my original question - it seems that Numeric arrays were faster for my kind of application but they might be slower for larger amounts of data. Regards Thomas This email and any attachments are intended solely for the use of the individual or entity to whom it is addressed and may be confidential and/or privileged. If you are not one of the named recipients or have received this email in error, (i) you should not read, disclose, or copy it, (ii) please notify sender of your receipt by reply email and delete this email and all attachments, (iii) Dassault Systemes does not accept or assume any liability or responsibility for any use of or reliance on this email.For other languages, go to http://www.3ds.com/terms/email-disclaimer. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Drawing circles in a numpy array
Hi all, I have this problem: Given some point draw a circle centered in this point with radius r. I'm doing that using numpy this way (Snippet code from here [1]): >>> # Create the initial black and white image >>> import numpy as np >>> from scipy import ndimage >>> a = np.zeros((512, 512)).astype(uint8) #unsigned integer type needed by >>> watershed >>> y, x = np.ogrid[0:512, 0:512] >>> m1 = ((y-200)**2 + (x-100)**2 < 30**2) >>> m2 = ((y-350)**2 + (x-400)**2 < 20**2) >>> m3 = ((y-260)**2 + (x-200)**2 < 20**2) >>> a[m1+m2+m3]=1 >>> imshow(a, cmap = cm.gray)# left plot in the image above The problem is that it have to evaluate all values from 0 to image size (in snippet, from 0 to 512 in X and Y dimensions). There is a faster way of doing that? Without evaluate all that values? For example: only evaluate from 0 to 30, in a circle centered in (0, 0) with radius 30. Thanks! Thiago Franco de Moraes [1] - http://www.scipy.org/Cookbook/Watershed ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
A Monday 10 January 2011 11:05:27 Francesc Alted escrigué: > Also, I'd like to try out the new thread scheduling that you > suggested to me privately (i.e. T0T1T0T1... vs T0T0...T1T1...). I've just implemented the new partition schema in numexpr (T0T0...T1T1..., being the original T0T1T0T1...). I'm attaching the patch for this. The results are a bit confusing. For example, using the attached benchmark (poly.py), I get these results for a common dual- core machine, non-NUMA machine: With the T0T1...T0T1... (original) schema: Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points Using numpy: *** Time elapsed: 3.497 Using numexpr: *** Time elapsed for 1 threads: 1.279000 *** Time elapsed for 2 threads: 0.688000 With the T0T0...T1T1... (new) schema: Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points Using numpy: *** Time elapsed: 3.454 Using numexpr: *** Time elapsed for 1 threads: 1.268000 *** Time elapsed for 2 threads: 0.754000 which is around a 10% slower (2 threads) than the original partition. The results are a bit different on a NUMA machine (8 physical cores, 16 logical cores via hyper-threading): With the T0T1...T0T1... (original) partition: Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points Using numpy: *** Time elapsed: 3.005 Using numexpr: *** Time elapsed for 1 threads: 1.109000 *** Time elapsed for 2 threads: 0.677000 *** Time elapsed for 3 threads: 0.496000 *** Time elapsed for 4 threads: 0.394000 *** Time elapsed for 5 threads: 0.324000 *** Time elapsed for 6 threads: 0.287000 *** Time elapsed for 7 threads: 0.247000 *** Time elapsed for 8 threads: 0.234000 *** Time elapsed for 9 threads: 0.242000 *** Time elapsed for 10 threads: 0.239000 *** Time elapsed for 11 threads: 0.241000 *** Time elapsed for 12 threads: 0.235000 *** Time elapsed for 13 threads: 0.226000 *** Time elapsed for 14 threads: 0.214000 *** Time elapsed for 15 threads: 0.235000 *** Time elapsed for 16 threads: 0.218000 With the T0T0...T1T1... (new) partition: Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points Using numpy: *** Time elapsed: 3.003 Using numexpr: *** Time elapsed for 1 threads: 1.106000 *** Time elapsed for 2 threads: 0.617000 *** Time elapsed for 3 threads: 0.442000 *** Time elapsed for 4 threads: 0.345000 *** Time elapsed for 5 threads: 0.296000 *** Time elapsed for 6 threads: 0.257000 *** Time elapsed for 7 threads: 0.237000 *** Time elapsed for 8 threads: 0.26 *** Time elapsed for 9 threads: 0.245000 *** Time elapsed for 10 threads: 0.261000 *** Time elapsed for 11 threads: 0.238000 *** Time elapsed for 12 threads: 0.21 *** Time elapsed for 13 threads: 0.218000 *** Time elapsed for 14 threads: 0.20 *** Time elapsed for 15 threads: 0.235000 *** Time elapsed for 16 threads: 0.198000 In this case, the performance is similar, with perhaps a slight advantage for the new partition scheme, but I don't know if it is worth to make it the default (probably not, as this partition performs clearly worse on non-NUMA machines). At any rate, both partitions perform very close to the aggregated memory bandwidth of NUMA machines (around 10 GB/s in the above case). In general, I don't think there is much point in using Intel's TBB in numexpr because the existing implementation already hits memory bandwidth limits pretty early (around 10 threads in the latter example). -- Francesc Alted Index: numexpr/interpreter.c === --- numexpr/interpreter.c (revision 260) +++ numexpr/interpreter.c (working copy) @@ -59,8 +59,6 @@ int end_threads = 0; /* should exisiting threads end? */ pthread_t threads[MAX_THREADS]; /* opaque structure for threads */ int tids[MAX_THREADS]; /* ID per each thread */ -intp gindex; /* global index for all threads */ -int init_sentinels_done; /* sentinels initialized? */ int giveup; /* should parallel code giveup? */ int force_serial;/* force serial code instead of parallel? */ int pid = 0; /* the PID for this process */ @@ -1072,7 +1070,7 @@ return 0; } -/* VM engine for each threadi (general) */ +/* VM engine for each thread (general) */ static inline int vm_engine_thread(char **mem, intp index, intp block_size, struct vm_params params, int *pc_error) @@ -1086,11 +1084,11 @@ /* Do the worker job for a certain thread */ void *th_worker(void *tids) { -/* int tid = *(int *)tids; */ -intp index; /* private copy of gindex */ +int tid = *(int *)tids; +intp index; /* Parameters for threads */ -intp start; -intp vlen; +intp start, stop; +intp vlen, nblocks, th_nblocks; intp block_size; struct vm_params params; int *pc_error; @@ -1103,8 +1101,6 @@ while (1) { -init_sentinels_done = 0; /* sentinels have to be initialised yet */ -
Re: [Numpy-discussion] speed of numpy.ndarray compared to Numeric.array
Hey, On Mon, 2011-01-10 at 08:09 +, EMMEL Thomas wrote: > #~ > def bruteForceSearch(points, point): > > minpt = min([(vec2Norm(pt, point), pt, i) > for i, pt in enumerate(points)], key=itemgetter(0)) > return sqrt(minpt[0]), minpt[1], minpt[2] > > #~~ > def vec2Norm(pt1,pt2): > xDis = pt1[0]-pt2[0] > yDis = pt1[1]-pt2[1] > zDis = pt1[2]-pt2[2] > return xDis*xDis+yDis*yDis+zDis*zDis > > I have a more clever method but it still takes a lot of time in the > vec2norm-function. > If you like I can attach a running example. > if you use the vec2Norm function as you wrote it there, this code is not vectorized at all, and as such of course numpy would be slowest as it has the most overhead and no advantages for non vectorized code, you simply can't write python code like that and expect it to be fast for these kind of calculations. Your function should look more like this: import numpy as np def bruteForceSearch(points, point): dists = points - point # that may need point[None,:] or such for broadcasting to work dists *= dists dists = dists.sum(1) I = np.argmin(dists) return sqrt(dists[I]), points[I], I If points is small, this may not help much (though compared to this exact code my guess is it probably would), if points is larger it should speed up things tremendously (unless you run into RAM problems). It may be that you need to fiddle around with axes, I did not check the code. If this is not good enough for you (you will need to port it (and maybe the next outer loop as well) to Cython or write it in C/C++ and make sure it can optimize things right. Also I think somewhere in scipy there were some distance tools that may be already in C and nice fast, but not sure. I hope I got this right and it helps, Sebastian ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numexpr with the new iterator
A Sunday 09 January 2011 23:45:02 Mark Wiebe escrigué: > As a benchmark of C-based iterator usage and to make it work properly > in a multi-threaded context, I've updated numexpr to use the new > iterator. In addition to some performance improvements, this also > made it easy to add optional out= and order= parameters to the > evaluate function. The numexpr repository with this update is > available here: > > https://github.com/m-paradox/numexpr > > To use it, you need the new_iterator branch of NumPy from here: > > https://github.com/m-paradox/numpy > > In all cases tested, the iterator version of numexpr's evaluate > function matches or beats the standard version. The timing results > are below, with some explanatory comments placed inline: [clip] Your patch looks mostly fine to my eyes; good job! Unfortunately, I've been unable to compile your new_iterator branch of NumPy: numpy/core/src/multiarray/multiarraymodule.c:45:33: fatal error: new_iterator_pywrap.h: El fitxer o directori no existeix Apparently, you forgot to add the new_iterator_pywrap.h file. My idea would be to merge your patch in numexpr and make the new `evaluate_iter()` the default (i.e. make it `evaluate()`). However, by looking into the code, it seems to me that unaligned arrays (this is an important use case when operating with columns of structured arrays) may need more fine-tuning for Intel platforms. When I can compile the new_iterator branch, I'll give a try at unaligned data benchs. Also, I'd like to try out the new thread scheduling that you suggested to me privately (i.e. T0T1T0T1... vs T0T0...T1T1...). Thanks! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] speed of numpy.ndarray comparedtoNumeric.array
> -Original Message- > From: numpy-discussion-boun...@scipy.org [mailto:numpy-discussion- > boun...@scipy.org] On Behalf Of David Cournapeau > Sent: Montag, 10. Januar 2011 10:15 > To: Discussion of Numerical Python > Subject: Re: [Numpy-discussion] speed of numpy.ndarray compared > toNumeric.array > > On Mon, Jan 10, 2011 at 6:04 PM, EMMEL Thomas > wrote: > > > > > Yes, of course and my real implementation uses exactly these methods, > > but there are still issues with the arrays. > > Did you try kd-trees in scipy ? > > David David, No, I didn't, however, my method is very similar and as far as I understood kd-trees, they need some time for pre-conditioning the search-area and this is the same as I did. In fact I think my method is more or less the same as a kd-tree. The problem remains that I need to calculate the distance of some points at a certain point in my code (when I am in a leaf of a kd-tree). For example when I use 10 points I end up in a leaf of my kd-tree where I need to calculate the distance for only 100 points or less (depends on the tree). The problem still remains and I use cProfile to get into the details. Most of the time it takes is in vec2Norm, everything else is very short but I need to call it as often as I have points (again 10) and this is why 10*0.001s takes some time. For numpy.ndarray this is 0.002s-0.003s, for Numeric.array 0.001-0.002s and for tuple ~0.001s (values from cProfile). And, by the way, the same problem appears when I need to calculate the cross-product of several vectors. In this case I have a geometry in 3D with a surface of thousands of triangles and I need to calculate the normal of each of these triangles. Again, doing a loop over tuples is faster than arrays, although in this case numpy.cross is twice as fast as Numeric.cross_product. Thomas This email and any attachments are intended solely for the use of the individual or entity to whom it is addressed and may be confidential and/or privileged. If you are not one of the named recipients or have received this email in error, (i) you should not read, disclose, or copy it, (ii) please notify sender of your receipt by reply email and delete this email and all attachments, (iii) Dassault Systemes does not accept or assume any liability or responsibility for any use of or reliance on this email.For other languages, go to http://www.3ds.com/terms/email-disclaimer. This email and any attachments are intended solely for the use of the individual or entity to whom it is addressed and may be confidential and/or privileged. If you are not one of the named recipients or have received this email in error, (i) you should not read, disclose, or copy it, (ii) please notify sender of your receipt by reply email and delete this email and all attachments, (iii) Dassault Systemes does not accept or assume any liability or responsibility for any use of or reliance on this email.For other languages, go to http://www.3ds.com/terms/email-disclaimer. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] speed of numpy.ndarray compared toNumeric.array
On Mon, Jan 10, 2011 at 6:04 PM, EMMEL Thomas wrote: > > Yes, of course and my real implementation uses exactly these methods, > but there are still issues with the arrays. Did you try kd-trees in scipy ? David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] speed of numpy.ndarray compared toNumeric.array
> On Mon, Jan 10, 2011 at 5:09 PM, EMMEL Thomas > wrote: > > To John: > > > >> Did you try larger arrays/tuples? I would guess that makes a > significant > >> difference. > > > > No I didn't, due to the fact that these values are coordinates in 3D > (x,y,z). > > In fact I work with a list/array/tuple of arrays with 10 to 1M of > elements or more. > > What I need to do is to calculate the distance of each of these > elements (coordinates) > > to a given coordinate and filter for the nearest. > > Note that for this exact problem, there are much better methods than > brute force (O(N^2) for N vectors), through e.g. kd-trees, which work > very well in low-dimension. This will matter much more than numeric vs > numpy > > cheers, > > David David, Yes, of course and my real implementation uses exactly these methods, but there are still issues with the arrays. Example: If I would use brute-force it will take ~5000s for a particular example to find all points in a list of points. Theoretically it should be possible to come to O(N*log(N)) with would mean ~2s in my case. My method need ~28s with tuples, but it takes ~30s with Numeric arrays and ~60s and more with numpy.ndarrays! I just use the brute-force method since it delivers the most reusable results for performance testing, the other methods are a bit dependent on the distribution of points in space. Thomas This email and any attachments are intended solely for the use of the individual or entity to whom it is addressed and may be confidential and/or privileged. If you are not one of the named recipients or have received this email in error, (i) you should not read, disclose, or copy it, (ii) please notify sender of your receipt by reply email and delete this email and all attachments, (iii) Dassault Systemes does not accept or assume any liability or responsibility for any use of or reliance on this email.For other languages, go to http://www.3ds.com/terms/email-disclaimer. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] speed of numpy.ndarray compared to Numeric.array
On Mon, Jan 10, 2011 at 5:09 PM, EMMEL Thomas wrote: > To John: > >> Did you try larger arrays/tuples? I would guess that makes a significant >> difference. > > No I didn't, due to the fact that these values are coordinates in 3D (x,y,z). > In fact I work with a list/array/tuple of arrays with 10 to 1M of > elements or more. > What I need to do is to calculate the distance of each of these elements > (coordinates) > to a given coordinate and filter for the nearest. Note that for this exact problem, there are much better methods than brute force (O(N^2) for N vectors), through e.g. kd-trees, which work very well in low-dimension. This will matter much more than numeric vs numpy cheers, David ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] speed of numpy.ndarray compared to Numeric.array
To John: > Did you try larger arrays/tuples? I would guess that makes a significant > difference. No I didn't, due to the fact that these values are coordinates in 3D (x,y,z). In fact I work with a list/array/tuple of arrays with 10 to 1M of elements or more. What I need to do is to calculate the distance of each of these elements (coordinates) to a given coordinate and filter for the nearest. The brute force method would look like this: #~ def bruteForceSearch(points, point): minpt = min([(vec2Norm(pt, point), pt, i) for i, pt in enumerate(points)], key=itemgetter(0)) return sqrt(minpt[0]), minpt[1], minpt[2] #~~ def vec2Norm(pt1,pt2): xDis = pt1[0]-pt2[0] yDis = pt1[1]-pt2[1] zDis = pt1[2]-pt2[2] return xDis*xDis+yDis*yDis+zDis*zDis I have a more clever method but it still takes a lot of time in the vec2norm-function. If you like I can attach a running example. To Ben: > Don't know how much of an impact it would have, but those timeit statements > for array creation include the import process, which are going to be > different for each module and are probably not indicative of the speed of > array creation. No, the timeit statements counts the time for the statement in the first argument only, the import-thing isn't included in the time. Thomas This email and any attachments are intended solely for the use of the individual or entity to whom it is addressed and may be confidential and/or privileged. If you are not one of the named recipients or have received this email in error, (i) you should not read, disclose, or copy it, (ii) please notify sender of your receipt by reply email and delete this email and all attachments, (iii) Dassault Systemes does not accept or assume any liability or responsibility for any use of or reliance on this email.For other languages, go to http://www.3ds.com/terms/email-disclaimer. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion