Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Mark Wiebe
On Mon, Jan 10, 2011 at 11:35 AM, Mark Wiebe  wrote:

> I'm a bit curious why the jump from 1 to 2 threads is scaling so poorly.
>  Your timings have improvement factors of 1.85, 1.68, 1.64, and 1.79.  Since
> the computation is trivial data parallelism, and I believe it's still pretty
> far off the memory bandwidth limit, I would expect a speedup of 1.95 or
> higher.


It looks like it is the memory bandwidth which is limiting the scalability.
 The slower operations scale much better than faster ones.  Below are some
timings of successively faster operations.  When the operation is slow
enough, it scales like I was expecting...

-Mark

Computing: 'cos(x**1.1) + sin(x**1.3) + tan(x**2.3)' with 2000 points
Using numpy:
*** Time elapsed: 14.47
Using numexpr:
*** Time elapsed for 1 threads: 12.659000
*** Time elapsed for 2 threads: 6.357000
*** Ratio from 1 to 2 threads: 1.991348
Using numexpr_iter:
*** Time elapsed for 1 threads: 12.573000
*** Time elapsed for 2 threads: 6.398000
*** Ratio from 1 to 2 threads: 1.965145

Computing: 'x**2.345' with 2000 points
Using numpy:
*** Time elapsed: 3.506
Using numexpr:
*** Time elapsed for 1 threads: 3.375000
*** Time elapsed for 2 threads: 1.747000
*** Ratio from 1 to 2 threads: 1.931883
Using numexpr_iter:
*** Time elapsed for 1 threads: 3.266000
*** Time elapsed for 2 threads: 1.76
*** Ratio from 1 to 2 threads: 1.855682

Computing: '1*x+2*x+3*x+4*x+5*x+6*x+7*x+8*x+9*x+10*x+11*x+12*x+13*x+14*x'
with 2000 points
Using numpy:
*** Time elapsed: 9.774
Using numexpr:
*** Time elapsed for 1 threads: 1.314000
*** Time elapsed for 2 threads: 0.703000
*** Ratio from 1 to 2 threads: 1.869132
Using numexpr_iter:
*** Time elapsed for 1 threads: 1.257000
*** Time elapsed for 2 threads: 0.683000
*** Ratio from 1 to 2 threads: 1.840410

Computing: 'x+2.345' with 2000 points
Using numpy:
*** Time elapsed: 0.343
Using numexpr:
*** Time elapsed for 1 threads: 0.348000
*** Time elapsed for 2 threads: 0.30
*** Ratio from 1 to 2 threads: 1.16
Using numexpr_iter:
*** Time elapsed for 1 threads: 0.354000
*** Time elapsed for 2 threads: 0.293000
*** Ratio from 1 to 2 threads: 1.208191
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Mark Wiebe
I'm a bit curious why the jump from 1 to 2 threads is scaling so poorly.
 Your timings have improvement factors of 1.85, 1.68, 1.64, and 1.79.  Since
the computation is trivial data parallelism, and I believe it's still pretty
far off the memory bandwidth limit, I would expect a speedup of 1.95 or
higher.

One reason I suggest TBB is that it can produce a pretty good schedule while
still adapting to load produced by other processes and threads.  Numexpr
currently does that well, but simply dividing the data into one piece per
thread doesn't handle that case very well, and makes it possible that one
thread spends a fair bit of time finishing up while the others idle at the
end.  Perhaps using Cilk would be a better option than TBB, since the code
could remain in C.

-Mark

On Mon, Jan 10, 2011 at 3:55 AM, Francesc Alted  wrote:

> A Monday 10 January 2011 11:05:27 Francesc Alted escrigué:
> > Also, I'd like to try out the new thread scheduling that you
> > suggested to me privately (i.e. T0T1T0T1...  vs T0T0...T1T1...).
>
> I've just implemented the new partition schema in numexpr
> (T0T0...T1T1..., being the original T0T1T0T1...).  I'm attaching the
> patch for this.  The results are a bit confusing.  For example, using
> the attached benchmark (poly.py), I get these results for a common dual-
> core machine, non-NUMA machine:
>
> With the T0T1...T0T1... (original) schema:
>
> Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
> Using numpy:
> *** Time elapsed: 3.497
> Using numexpr:
> *** Time elapsed for 1 threads: 1.279000
> *** Time elapsed for 2 threads: 0.688000
>
> With the T0T0...T1T1... (new) schema:
>
> Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
> Using numpy:
> *** Time elapsed: 3.454
> Using numexpr:
> *** Time elapsed for 1 threads: 1.268000
> *** Time elapsed for 2 threads: 0.754000
>
> which is around a 10% slower (2 threads) than the original partition.
>
> The results are a bit different on a NUMA machine (8 physical cores, 16
> logical cores via hyper-threading):
>
> With the T0T1...T0T1... (original) partition:
>
> Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
> Using numpy:
> *** Time elapsed: 3.005
> Using numexpr:
> *** Time elapsed for 1 threads: 1.109000
> *** Time elapsed for 2 threads: 0.677000
> *** Time elapsed for 3 threads: 0.496000
> *** Time elapsed for 4 threads: 0.394000
> *** Time elapsed for 5 threads: 0.324000
> *** Time elapsed for 6 threads: 0.287000
> *** Time elapsed for 7 threads: 0.247000
> *** Time elapsed for 8 threads: 0.234000
> *** Time elapsed for 9 threads: 0.242000
> *** Time elapsed for 10 threads: 0.239000
> *** Time elapsed for 11 threads: 0.241000
> *** Time elapsed for 12 threads: 0.235000
> *** Time elapsed for 13 threads: 0.226000
> *** Time elapsed for 14 threads: 0.214000
> *** Time elapsed for 15 threads: 0.235000
> *** Time elapsed for 16 threads: 0.218000
>
> With the T0T0...T1T1... (new) partition:
>
> Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
> Using numpy:
> *** Time elapsed: 3.003
> Using numexpr:
> *** Time elapsed for 1 threads: 1.106000
> *** Time elapsed for 2 threads: 0.617000
> *** Time elapsed for 3 threads: 0.442000
> *** Time elapsed for 4 threads: 0.345000
> *** Time elapsed for 5 threads: 0.296000
> *** Time elapsed for 6 threads: 0.257000
> *** Time elapsed for 7 threads: 0.237000
> *** Time elapsed for 8 threads: 0.26
> *** Time elapsed for 9 threads: 0.245000
> *** Time elapsed for 10 threads: 0.261000
> *** Time elapsed for 11 threads: 0.238000
> *** Time elapsed for 12 threads: 0.21
> *** Time elapsed for 13 threads: 0.218000
> *** Time elapsed for 14 threads: 0.20
> *** Time elapsed for 15 threads: 0.235000
> *** Time elapsed for 16 threads: 0.198000
>
> In this case, the performance is similar, with perhaps a slight
> advantage for the new partition scheme, but I don't know if it is worth
> to make it the default (probably not, as this partition performs clearly
> worse on non-NUMA machines).  At any rate, both partitions perform very
> close to the aggregated memory bandwidth of NUMA machines (around 10
> GB/s in the above case).
>
> In general, I don't think there is much point in using Intel's TBB in
> numexpr because the existing implementation already hits memory
> bandwidth limits pretty early (around 10 threads in the latter example).
>
> --
> Francesc Alted
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] indexing of rank-0 structured arrays: why not?

2011-01-10 Thread Robert Kern
On Mon, Jan 10, 2011 at 12:15, Nils Becker  wrote:
> Robert,
>
> your answer does work: after indexing with () I can then further index
> into the datatype.
>
> In [115]: a_rank_0[()][0]
> Out[115]: 0.0
>
> I guess I just found the fact confusing that a_rank_1[0] and a_rank_0
> compare and print equal but behave differently under indexing.

They do not print equal. Many things compare equal but do not behave the same.

> More precisely if I do
> In [117]: b = a_rank_1[0]
>
> then
>
> In [118]: b.shape
> Out[118]: ()
>
> and
>
> In [120]: a_rank_0 == b
> Out[120]: True
>
> but
>
> In [119]: b[0]
> Out[119]: 0.0
>
> works but a_rank_0[0] doesn't. I thought b is a rank-0 array which it
> apparently is not since it can be indexed. So maybe b[0] should fail for
> consistency?

No, b is a record scalar. It can be indexed because it is often
convient to treat such records like tuples. This replaces the default
indexing behavior of scalars (which is to simply disallow indexing).
a_rank_0 is an array, so the array indexing semantics are the default,
and we do not change them.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Mark Wiebe
On Mon, Jan 10, 2011 at 9:47 AM, Francesc Alted  wrote:

> 
>
> so, the new code is just < 5% slower.  I suppose that removing the
> NPY_ITER_ALIGNED flag would give us a bit more performance, but that's
> great as it is now.  How did you do that?  Your new_iter branch in NumPy
> already deals with unaligned data, right?
>

Take a look at  lowlevel_strided_loops.c.src.  In this case, the buffering
setup code calls PyArray_GetDTypeTransferFunction, which in turn calls
PyArray_GetStridedCopyFn, which on an x86 platform returns
_aligned_strided_to_contig_size8.  This function has a simple loop of copies
using a npy_uint64 data type.

> The new code also needs support for the reduce operation.  I didn't
> > look too closely at the code for that, but a nested iteration
> > pattern is probably appropriate.  If the inner loop is just allowed
> > to be one dimension, it could be done without actually creating the
> > inner iterator.
>
> Well, if you can support reduce operations with your patch that would be
> extremely good news as I'm afraid that the current reduce code is a bit
> broken in Numexpr (at least, I vaguely remember seeing it working badly
> in some cases).
>

Cool, I'll take a look at some point.  I imagine with the most obvious
implementation small reductions would perform poorly.

-Mark
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] indexing of rank-0 structured arrays: why not?

2011-01-10 Thread Nils Becker
Robert,

your answer does work: after indexing with () I can then further index
into the datatype.

In [115]: a_rank_0[()][0]
Out[115]: 0.0

I guess I just found the fact confusing that a_rank_1[0] and a_rank_0
compare and print equal but behave differently under indexing.

More precisely if I do
In [117]: b = a_rank_1[0]

then

In [118]: b.shape
Out[118]: ()

and

In [120]: a_rank_0 == b
Out[120]: True

but

In [119]: b[0]
Out[119]: 0.0

works but a_rank_0[0] doesn't. I thought b is a rank-0 array which it
apparently is not since it can be indexed. So maybe b[0] should fail for
consistency?

N.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Francesc Alted
A Monday 10 January 2011 17:54:16 Mark Wiebe escrigué: 
> > Apparently, you forgot to add the new_iterator_pywrap.h file.
> 
> Oops,  that's added now.

Excellent.  It works now.

> The aligned case should just be a matter of conditionally removing
> the NPY_ITER_ALIGNED flag in two places.

Wow, the support for unaligned in current `evaluate_iter()` seems pretty 
nice already:

$ python unaligned-simple.py 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Numexpr version:   1.5.dev
NumPy version: 2.0.0.dev-ebc963d
Python version:2.6.1 (r261:67515, Feb  3 2009, 17:34:37) 
[GCC 4.3.2 [gcc-4_3-branch revision 141291]]
Platform:  linux2-x86_64
AMD/Intel CPU? True
VML available? False
Detected cores:2
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
NumPy aligned:  0.658 s
NumPy unaligned:1.597 s
Numexpr aligned:0.59 s
Numexpr aligned (new iter): 0.59 s
Numexpr unaligned:  0.51 s
Numexpr unaligned (new_iter):   0.528 s

so, the new code is just < 5% slower.  I suppose that removing the 
NPY_ITER_ALIGNED flag would give us a bit more performance, but that's 
great as it is now.  How did you do that?  Your new_iter branch in NumPy 
already deals with unaligned data, right?

> The new code also needs support for the reduce operation.  I didn't
> look too closely at the code for that, but a nested iteration
> pattern is probably appropriate.  If the inner loop is just allowed
> to be one dimension, it could be done without actually creating the
> inner iterator.

Well, if you can support reduce operations with your patch that would be 
extremely good news as I'm afraid that the current reduce code is a bit 
broken in Numexpr (at least, I vaguely remember seeing it working badly 
in some cases).

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] indexing of rank-0 structured arrays: why not?

2011-01-10 Thread Robert Kern
On Mon, Jan 10, 2011 at 10:08, Nils Becker  wrote:
> Hi,
>
> I noticed that I can index into a dtype when I take an element
> of a rank-1 array but not if I make a rank-0 array directly. This seems
> inconsistent. A bug?

Not a bug. Since there is no axis, you cannot use integers to index
into a rank-0 array. Use an empty tuple instead.

[~]
|1> dt = np.dtype([('x', ' a_rank_0 = np.zeros((), dtype=dt)

[~]
|3> a_rank_0[()]
(0.0, 0.0)

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Mark Wiebe
On Mon, Jan 10, 2011 at 2:05 AM, Francesc Alted  wrote:

> 
>
> Your patch looks mostly fine to my eyes; good job!  Unfortunately, I've
> been unable to compile your new_iterator branch of NumPy:
>
> numpy/core/src/multiarray/multiarraymodule.c:45:33: fatal error:
> new_iterator_pywrap.h: El fitxer o directori no existeix
>
> Apparently, you forgot to add the new_iterator_pywrap.h file.
>

Oops,  that's added now.


> My idea would be to merge your patch in numexpr and make the new
> `evaluate_iter()` the default (i.e. make it `evaluate()`).  However, by
> looking into the code, it seems to me that unaligned arrays (this is an
> important use case when operating with columns of structured arrays) may
> need more fine-tuning for Intel platforms.  When I can compile the
> new_iterator branch, I'll give a try at unaligned data benchs.
>

The aligned case should just be a matter of conditionally removing the
NPY_ITER_ALIGNED flag in two places.

The new code also needs support for the reduce operation.  I didn't look too
closely at the code for that, but a nested iteration pattern is probably
appropriate.  If the inner loop is just allowed to be one dimension, it
could be done without actually creating the inner iterator.

-Mark
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] speed of numpy.ndarray compared to Numeric.array

2011-01-10 Thread René Dudfield
Hi,

Spatial hashes are the common solution.

Another common optimization is using the distance squared for
collision detection.  Since you do not need the expensive sqrt for
this calc.

cu.



On Mon, Jan 10, 2011 at 3:25 PM, Pascal  wrote:
> Hi,
>
> On 01/10/2011 09:09 AM, EMMEL Thomas wrote:
>>
>> No I didn't, due to the fact that these values are coordinates in 3D (x,y,z).
>> In fact I work with a list/array/tuple of arrays with 10 to 1M of 
>> elements or more.
>> What I need to do is to calculate the distance of each of these elements 
>> (coordinates)
>> to a given coordinate and filter for the nearest.
>> The brute force method would look like this:
>>
>>
>> #~
>> def bruteForceSearch(points, point):
>>
>>      minpt = min([(vec2Norm(pt, point), pt, i)
>>                   for i, pt in enumerate(points)], key=itemgetter(0))
>>      return sqrt(minpt[0]), minpt[1], minpt[2]
>>
>> #~~
>> def vec2Norm(pt1,pt2):
>>      xDis = pt1[0]-pt2[0]
>>      yDis = pt1[1]-pt2[1]
>>      zDis = pt1[2]-pt2[2]
>>      return xDis*xDis+yDis*yDis+zDis*zDis
>>
>
> I am not sure I understood the problem properly but here what I would
> use to calculate a distance from horizontally stacked vectors (big):
>
> ref=numpy.array([0.1,0.2,0.3])
> big=numpy.random.randn(100, 3)
>
> big=numpy.add(big,-ref)
> distsquared=numpy.sum(big**2, axis=1)
>
> Pascal
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] indexing of rank-0 structured arrays: why not?

2011-01-10 Thread Nils Becker
Hi,

I noticed that I can index into a dtype when I take an element
of a rank-1 array but not if I make a rank-0 array directly. This seems
inconsistent. A bug?

Nils


In [76]: np.version.version
Out[76]: '1.5.1'

In [78]: dt = np.dtype([('x', ' in ()

IndexError: 0-d arrays can't be indexed

In [87]: a_rank_0['x']
Out[87]: array(0.0)
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] speed of numpy.ndarray compared to Numeric.array

2011-01-10 Thread Pascal
Hi,

On 01/10/2011 09:09 AM, EMMEL Thomas wrote:
>
> No I didn't, due to the fact that these values are coordinates in 3D (x,y,z).
> In fact I work with a list/array/tuple of arrays with 10 to 1M of 
> elements or more.
> What I need to do is to calculate the distance of each of these elements 
> (coordinates)
> to a given coordinate and filter for the nearest.
> The brute force method would look like this:
>
>
> #~
> def bruteForceSearch(points, point):
>
>  minpt = min([(vec2Norm(pt, point), pt, i)
>   for i, pt in enumerate(points)], key=itemgetter(0))
>  return sqrt(minpt[0]), minpt[1], minpt[2]
>
> #~~
> def vec2Norm(pt1,pt2):
>  xDis = pt1[0]-pt2[0]
>  yDis = pt1[1]-pt2[1]
>  zDis = pt1[2]-pt2[2]
>  return xDis*xDis+yDis*yDis+zDis*zDis
>

I am not sure I understood the problem properly but here what I would 
use to calculate a distance from horizontally stacked vectors (big):

ref=numpy.array([0.1,0.2,0.3])
big=numpy.random.randn(100, 3)

big=numpy.add(big,-ref)
distsquared=numpy.sum(big**2, axis=1)

Pascal
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] speed of numpy.ndarray compared to Numeric.array

2011-01-10 Thread EMMEL Thomas
Hey back...
> >
> #~
> ~
> ~~~
> > def bruteForceSearch(points, point):
> >
> > minpt = min([(vec2Norm(pt, point), pt, i)
> >  for i, pt in enumerate(points)], key=itemgetter(0))
> > return sqrt(minpt[0]), minpt[1], minpt[2]
> >
> >
> #~
> ~
> 
> > def vec2Norm(pt1,pt2):
> > xDis = pt1[0]-pt2[0]
> > yDis = pt1[1]-pt2[1]
> > zDis = pt1[2]-pt2[2]
> > return xDis*xDis+yDis*yDis+zDis*zDis
> >
> > I have a more clever method but it still takes a lot of time in the
> vec2norm-function.
> > If you like I can attach a running example.
> >
> 
> if you use the vec2Norm function as you wrote it there, this code is 
> not vectorized at all, and as such of course numpy would be slowest as 
> it has the most overhead and no advantages for non vectorized code, 
> you simply can't write python code like that and expect it to be fast 
> for these kind of calculations.
> 
> Your function should look more like this:
> 
> import numpy as np
> 
> def bruteForceSearch(points, point):
> dists = points - point
> # that may need point[None,:] or such for broadcasting to work
> dists *= dists
> dists = dists.sum(1)
> I = np.argmin(dists)
> return sqrt(dists[I]), points[I], I
> 
> If points is small, this may not help much (though compared to this 
> exact code my guess is it probably would), if points is larger it 
> should speed up things tremendously (unless you run into RAM 
> problems). It may be that you need to fiddle around with axes, I did 
> not check the code.
> If this is not good enough for you (you will need to port it (and 
> maybe the next outer loop as well) to Cython or write it in C/C++ and 
> make sure it can optimize things right. Also I think somewhere in 
> scipy there were some distance tools that may be already in C and nice 
> fast, but not sure.
> 
> I hope I got this right and it helps,
> 
> Sebastian
> 

I see the point and it was very helpful to understand the behavior of the 
arrays a bit better. And your attempt improved the bruteForceSearch which is up 
to 6 times faster.
But in case of a leaf in a kd-tree you end up with 50, 20, 10 or less points 
where the speed-up is reversed. In this particular case 34000 runs take 90s 
with your method and 50s with mine (not the bruteForce).
I see now the limits of the arrays but of course I see the chances and - coming 
back to my original question - it seems that Numeric arrays were faster for my 
kind of application but they might be slower for larger amounts of data.

Regards

Thomas



This email and any attachments are intended solely for the use of the 
individual or entity to whom it is addressed and may be confidential and/or 
privileged.  If you are not one of the named recipients or have received this 
email in error, (i) you should not read, disclose, or copy it, (ii) please 
notify sender of your receipt by reply email and delete this email and all 
attachments, (iii) Dassault Systemes does not accept or assume any liability or 
responsibility for any use of or reliance on this email.For other languages, go 
to http://www.3ds.com/terms/email-disclaimer.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Drawing circles in a numpy array

2011-01-10 Thread totonixs...@gmail.com
Hi all,

I have this problem: Given some point draw a circle centered in this
point with radius r. I'm doing that using numpy this way (Snippet code
from here [1]):

>>> # Create the initial black and white image
>>> import numpy as np
>>> from scipy import ndimage
>>> a = np.zeros((512, 512)).astype(uint8) #unsigned integer type needed by 
>>> watershed
>>> y, x = np.ogrid[0:512, 0:512]
>>> m1 = ((y-200)**2 + (x-100)**2 < 30**2)
>>> m2 = ((y-350)**2 + (x-400)**2 < 20**2)
>>> m3 = ((y-260)**2 + (x-200)**2 < 20**2)
>>> a[m1+m2+m3]=1
>>> imshow(a, cmap = cm.gray)# left plot in the image above

The problem is that it have to evaluate all values from 0 to image
size (in snippet, from 0 to 512 in X and Y dimensions). There is a
faster way of doing that? Without evaluate all that values? For
example: only evaluate from 0 to 30, in a circle centered in (0, 0)
with radius 30.

Thanks!
Thiago Franco de Moraes

[1] - http://www.scipy.org/Cookbook/Watershed
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Francesc Alted
A Monday 10 January 2011 11:05:27 Francesc Alted escrigué:
> Also, I'd like to try out the new thread scheduling that you
> suggested to me privately (i.e. T0T1T0T1...  vs T0T0...T1T1...).

I've just implemented the new partition schema in numexpr 
(T0T0...T1T1..., being the original T0T1T0T1...).  I'm attaching the 
patch for this.  The results are a bit confusing.  For example, using 
the attached benchmark (poly.py), I get these results for a common dual-
core machine, non-NUMA machine:

With the T0T1...T0T1... (original) schema:

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
Using numpy:
*** Time elapsed: 3.497
Using numexpr:
*** Time elapsed for 1 threads: 1.279000
*** Time elapsed for 2 threads: 0.688000

With the T0T0...T1T1... (new) schema:

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
Using numpy:
*** Time elapsed: 3.454
Using numexpr:
*** Time elapsed for 1 threads: 1.268000
*** Time elapsed for 2 threads: 0.754000

which is around a 10% slower (2 threads) than the original partition.

The results are a bit different on a NUMA machine (8 physical cores, 16 
logical cores via hyper-threading):

With the T0T1...T0T1... (original) partition:

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
Using numpy:
*** Time elapsed: 3.005
Using numexpr:
*** Time elapsed for 1 threads: 1.109000
*** Time elapsed for 2 threads: 0.677000
*** Time elapsed for 3 threads: 0.496000
*** Time elapsed for 4 threads: 0.394000
*** Time elapsed for 5 threads: 0.324000
*** Time elapsed for 6 threads: 0.287000
*** Time elapsed for 7 threads: 0.247000
*** Time elapsed for 8 threads: 0.234000
*** Time elapsed for 9 threads: 0.242000
*** Time elapsed for 10 threads: 0.239000
*** Time elapsed for 11 threads: 0.241000
*** Time elapsed for 12 threads: 0.235000
*** Time elapsed for 13 threads: 0.226000
*** Time elapsed for 14 threads: 0.214000
*** Time elapsed for 15 threads: 0.235000
*** Time elapsed for 16 threads: 0.218000

With the T0T0...T1T1... (new) partition:

Computing: '((.25*x + .75)*x - 1.5)*x - 2' with 1 points
Using numpy:
*** Time elapsed: 3.003
Using numexpr:
*** Time elapsed for 1 threads: 1.106000
*** Time elapsed for 2 threads: 0.617000
*** Time elapsed for 3 threads: 0.442000
*** Time elapsed for 4 threads: 0.345000
*** Time elapsed for 5 threads: 0.296000
*** Time elapsed for 6 threads: 0.257000
*** Time elapsed for 7 threads: 0.237000
*** Time elapsed for 8 threads: 0.26
*** Time elapsed for 9 threads: 0.245000
*** Time elapsed for 10 threads: 0.261000
*** Time elapsed for 11 threads: 0.238000
*** Time elapsed for 12 threads: 0.21
*** Time elapsed for 13 threads: 0.218000
*** Time elapsed for 14 threads: 0.20
*** Time elapsed for 15 threads: 0.235000
*** Time elapsed for 16 threads: 0.198000

In this case, the performance is similar, with perhaps a slight 
advantage for the new partition scheme, but I don't know if it is worth 
to make it the default (probably not, as this partition performs clearly 
worse on non-NUMA machines).  At any rate, both partitions perform very 
close to the aggregated memory bandwidth of NUMA machines (around 10 
GB/s in the above case).

In general, I don't think there is much point in using Intel's TBB in 
numexpr because the existing implementation already hits memory 
bandwidth limits pretty early (around 10 threads in the latter example).

-- 
Francesc Alted
Index: numexpr/interpreter.c
===
--- numexpr/interpreter.c	(revision 260)
+++ numexpr/interpreter.c	(working copy)
@@ -59,8 +59,6 @@
 int end_threads = 0; /* should exisiting threads end? */
 pthread_t threads[MAX_THREADS];  /* opaque structure for threads */
 int tids[MAX_THREADS];   /* ID per each thread */
-intp gindex; /* global index for all threads */
-int init_sentinels_done; /* sentinels initialized? */
 int giveup;  /* should parallel code giveup? */
 int force_serial;/* force serial code instead of parallel? */
 int pid = 0; /* the PID for this process */
@@ -1072,7 +1070,7 @@
 return 0;
 }
 
-/* VM engine for each threadi (general) */
+/* VM engine for each thread (general) */
 static inline int
 vm_engine_thread(char **mem, intp index, intp block_size,
   struct vm_params params, int *pc_error)
@@ -1086,11 +1084,11 @@
 /* Do the worker job for a certain thread */
 void *th_worker(void *tids)
 {
-/* int tid = *(int *)tids; */
-intp index; /* private copy of gindex */
+int tid = *(int *)tids;
+intp index;
 /* Parameters for threads */
-intp start;
-intp vlen;
+intp start, stop;
+intp vlen, nblocks, th_nblocks;
 intp block_size;
 struct vm_params params;
 int *pc_error;
@@ -1103,8 +1101,6 @@
 
 while (1) {
 
-init_sentinels_done = 0; /* sentinels have to be initialised yet */
-
  

Re: [Numpy-discussion] speed of numpy.ndarray compared to Numeric.array

2011-01-10 Thread Sebastian Berg
Hey,

On Mon, 2011-01-10 at 08:09 +, EMMEL Thomas wrote:
> #~
> def bruteForceSearch(points, point):
> 
> minpt = min([(vec2Norm(pt, point), pt, i)
>  for i, pt in enumerate(points)], key=itemgetter(0))
> return sqrt(minpt[0]), minpt[1], minpt[2] 
> 
> #~~
> def vec2Norm(pt1,pt2):
> xDis = pt1[0]-pt2[0]
> yDis = pt1[1]-pt2[1]
> zDis = pt1[2]-pt2[2]
> return xDis*xDis+yDis*yDis+zDis*zDis
> 
> I have a more clever method but it still takes a lot of time in the 
> vec2norm-function.
> If you like I can attach a running example.
> 

if you use the vec2Norm function as you wrote it there, this code is not
vectorized at all, and as such of course numpy would be slowest as it
has the most overhead and no advantages for non vectorized code, you
simply can't write python code like that and expect it to be fast for
these kind of calculations.

Your function should look more like this:

import numpy as np

def bruteForceSearch(points, point):
dists = points - point
# that may need point[None,:] or such for broadcasting to work
dists *= dists
dists = dists.sum(1)
I = np.argmin(dists)
return sqrt(dists[I]), points[I], I

If points is small, this may not help much (though compared to this
exact code my guess is it probably would), if points is larger it should
speed up things tremendously (unless you run into RAM problems). It may
be that you need to fiddle around with axes, I did not check the code.
If this is not good enough for you (you will need to port it (and maybe
the next outer loop as well) to Cython or write it in C/C++ and make
sure it can optimize things right. Also I think somewhere in scipy there
were some distance tools that may be already in C and nice fast, but not
sure.

I hope I got this right and it helps,

Sebastian

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numexpr with the new iterator

2011-01-10 Thread Francesc Alted
A Sunday 09 January 2011 23:45:02 Mark Wiebe escrigué:
> As a benchmark of C-based iterator usage and to make it work properly
> in a multi-threaded context, I've updated numexpr to use the new
> iterator.  In addition to some performance improvements, this also
> made it easy to add optional out= and order= parameters to the
> evaluate function.  The numexpr repository with this update is
> available here:
> 
> https://github.com/m-paradox/numexpr
> 
> To use it, you need the new_iterator branch of NumPy from here:
> 
> https://github.com/m-paradox/numpy
> 
> In all cases tested, the iterator version of numexpr's evaluate
> function matches or beats the standard version.  The timing results
> are below, with some explanatory comments placed inline:
[clip]

Your patch looks mostly fine to my eyes; good job!  Unfortunately, I've 
been unable to compile your new_iterator branch of NumPy:

numpy/core/src/multiarray/multiarraymodule.c:45:33: fatal error: 
new_iterator_pywrap.h: El fitxer o directori no existeix

Apparently, you forgot to add the new_iterator_pywrap.h file.

My idea would be to merge your patch in numexpr and make the new 
`evaluate_iter()` the default (i.e. make it `evaluate()`).  However, by 
looking into the code, it seems to me that unaligned arrays (this is an 
important use case when operating with columns of structured arrays) may 
need more fine-tuning for Intel platforms.  When I can compile the 
new_iterator branch, I'll give a try at unaligned data benchs.  

Also, I'd like to try out the new thread scheduling that you suggested 
to me privately (i.e. T0T1T0T1...  vs T0T0...T1T1...).

Thanks!

-- 
Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] speed of numpy.ndarray comparedtoNumeric.array

2011-01-10 Thread EMMEL Thomas
> -Original Message-
> From: numpy-discussion-boun...@scipy.org [mailto:numpy-discussion-
> boun...@scipy.org] On Behalf Of David Cournapeau
> Sent: Montag, 10. Januar 2011 10:15
> To: Discussion of Numerical Python
> Subject: Re: [Numpy-discussion] speed of numpy.ndarray compared
> toNumeric.array
> 
> On Mon, Jan 10, 2011 at 6:04 PM, EMMEL Thomas 
> wrote:
> 
> >
> > Yes, of course and my real implementation uses exactly these methods,
> > but there are still issues with the arrays.
> 
> Did you try kd-trees in scipy ?
> 
> David

David,

No, I didn't, however, my method is very similar and as far as I understood
kd-trees, they need some time for pre-conditioning the search-area and this is 
the same as I did. In fact I think my method is more or less the same as a 
kd-tree.
The problem remains that I need to calculate the distance of some points at
a certain point in my code (when I am in a leaf of a kd-tree). 
For example when I use 10 points I end up in a leaf of my kd-tree where I 
need
to calculate the distance for only 100 points or less (depends on the tree).
The problem still remains and I use cProfile to get into the details.
Most of the time it takes is in vec2Norm, everything else is very short but
I need to call it as often as I have points (again 10) and this is why
10*0.001s takes some time. For numpy.ndarray this is 0.002s-0.003s,
for Numeric.array 0.001-0.002s and for tuple ~0.001s (values from cProfile).

And, by the way, the same problem appears when I need to calculate the 
cross-product
of several vectors.
In this case I have a geometry in 3D with a surface of thousands of triangles 
and I need to calculate the normal of each of these triangles.
Again, doing a loop over tuples is faster than arrays, although in this case
numpy.cross is twice as fast as Numeric.cross_product.

Thomas


This email and any attachments are intended solely for the use of the 
individual or entity to whom it is addressed and may be confidential and/or 
privileged.  If you are not one of the named recipients or have received this 
email in error, (i) you should not read, disclose, or copy it, (ii) please 
notify sender of your receipt by reply email and delete this email and all 
attachments, (iii) Dassault Systemes does not accept or assume any liability or 
responsibility for any use of or reliance on this email.For other languages, go 
to http://www.3ds.com/terms/email-disclaimer.


This email and any attachments are intended solely for the use of the 
individual or entity to whom it is addressed and may be confidential and/or 
privileged.  If you are not one of the named recipients or have received this 
email in error, (i) you should not read, disclose, or copy it, (ii) please 
notify sender of your receipt by reply email and delete this email and all 
attachments, (iii) Dassault Systemes does not accept or assume any liability or 
responsibility for any use of or reliance on this email.For other languages, go 
to http://www.3ds.com/terms/email-disclaimer.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] speed of numpy.ndarray compared toNumeric.array

2011-01-10 Thread David Cournapeau
On Mon, Jan 10, 2011 at 6:04 PM, EMMEL Thomas  wrote:

>
> Yes, of course and my real implementation uses exactly these methods,
> but there are still issues with the arrays.

Did you try kd-trees in scipy ?

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] speed of numpy.ndarray compared toNumeric.array

2011-01-10 Thread EMMEL Thomas
> On Mon, Jan 10, 2011 at 5:09 PM, EMMEL Thomas 
> wrote:
> > To John:
> >
> >> Did you try larger arrays/tuples? I would guess that makes a
> significant
> >> difference.
> >
> > No I didn't, due to the fact that these values are coordinates in 3D
> (x,y,z).
> > In fact I work with a list/array/tuple of arrays with 10 to 1M of
> elements or more.
> > What I need to do is to calculate the distance of each of these
> elements (coordinates)
> > to a given coordinate and filter for the nearest.
> 
> Note that for this exact problem, there are much better methods than
> brute force (O(N^2) for N vectors), through e.g. kd-trees, which work
> very well in low-dimension. This will matter much more than numeric vs
> numpy
> 
> cheers,
> 
> David

David,

Yes, of course and my real implementation uses exactly these methods,
but there are still issues with the arrays.
Example:
If I would use brute-force it will take ~5000s for a particular example to 
find all points in a list of points. Theoretically it should be possible to
come to O(N*log(N)) with would mean ~2s in my case.
My method need ~28s with tuples, but it takes ~30s with Numeric arrays
and ~60s and more with numpy.ndarrays!
I just use the brute-force method since it delivers the most reusable results
for performance testing, the other methods are a bit dependent on the 
distribution of points in space.

Thomas


This email and any attachments are intended solely for the use of the 
individual or entity to whom it is addressed and may be confidential and/or 
privileged.  If you are not one of the named recipients or have received this 
email in error, (i) you should not read, disclose, or copy it, (ii) please 
notify sender of your receipt by reply email and delete this email and all 
attachments, (iii) Dassault Systemes does not accept or assume any liability or 
responsibility for any use of or reliance on this email.For other languages, go 
to http://www.3ds.com/terms/email-disclaimer.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] speed of numpy.ndarray compared to Numeric.array

2011-01-10 Thread David Cournapeau
On Mon, Jan 10, 2011 at 5:09 PM, EMMEL Thomas  wrote:
> To John:
>
>> Did you try larger arrays/tuples? I would guess that makes a significant
>> difference.
>
> No I didn't, due to the fact that these values are coordinates in 3D (x,y,z).
> In fact I work with a list/array/tuple of arrays with 10 to 1M of 
> elements or more.
> What I need to do is to calculate the distance of each of these elements 
> (coordinates)
> to a given coordinate and filter for the nearest.

Note that for this exact problem, there are much better methods than
brute force (O(N^2) for N vectors), through e.g. kd-trees, which work
very well in low-dimension. This will matter much more than numeric vs
numpy

cheers,

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] speed of numpy.ndarray compared to Numeric.array

2011-01-10 Thread EMMEL Thomas
To John:

> Did you try larger arrays/tuples? I would guess that makes a significant
> difference.

No I didn't, due to the fact that these values are coordinates in 3D (x,y,z).
In fact I work with a list/array/tuple of arrays with 10 to 1M of elements 
or more.
What I need to do is to calculate the distance of each of these elements 
(coordinates)
to a given coordinate and filter for the nearest.
The brute force method would look like this:


#~
def bruteForceSearch(points, point):

minpt = min([(vec2Norm(pt, point), pt, i)
 for i, pt in enumerate(points)], key=itemgetter(0))
return sqrt(minpt[0]), minpt[1], minpt[2] 

#~~
def vec2Norm(pt1,pt2):
xDis = pt1[0]-pt2[0]
yDis = pt1[1]-pt2[1]
zDis = pt1[2]-pt2[2]
return xDis*xDis+yDis*yDis+zDis*zDis

I have a more clever method but it still takes a lot of time in the 
vec2norm-function.
If you like I can attach a running example.

To Ben:

> Don't know how much of an impact it would have, but those timeit statements
> for array creation include the import process, which are going to be
> different for each module and are probably not indicative of the speed of
> array creation.

No, the timeit statements counts the time for the statement in the first 
argument only,
the import-thing isn't included in the time.

Thomas

This email and any attachments are intended solely for the use of the 
individual or entity to whom it is addressed and may be confidential and/or 
privileged.  If you are not one of the named recipients or have received this 
email in error, (i) you should not read, disclose, or copy it, (ii) please 
notify sender of your receipt by reply email and delete this email and all 
attachments, (iii) Dassault Systemes does not accept or assume any liability or 
responsibility for any use of or reliance on this email.For other languages, go 
to http://www.3ds.com/terms/email-disclaimer.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion