Re: [Numpy-discussion] record arrays initialization

2012-05-03 Thread Keith Goodman
On Wed, May 2, 2012 at 4:46 PM, Kevin Jacobs jac...@bioinformed.com
bioinfor...@gmail.com wrote:

 The cKDTree implementation is more than 4 times faster than the brute-force
 approach:

 T = scipy.spatial.cKDTree(targets)

 In [11]: %timeit foo1(element, targets)   # Brute force
 1000 loops, best of 3: 385 us per loop

 In [12]: %timeit foo2(element, T)         # cKDTree
 1 loops, best of 3: 83.5 us per loop

 In [13]: 385/83.5
 Out[13]: 4.610778443113772

Brute force plus cython beats cKDTree for knn with k=1 and Euclidean distance:

I[5] targets = np.random.uniform(0, 8, (5000, 7))
I[6] element = np.arange(1, 8, dtype=np.float64)

I[7] T = scipy.spatial.cKDTree(targets)
I[8] timeit T.query(element)
1 loops, best of 3: 36.1 us per loop

I[9] timeit nn(targets, element)
1 loops, best of 3: 28.5 us per loop

What about lower dimensions (2 instead of 7) where cKDTree gets faster?

I[18] element = np.arange(1,3, dtype=np.float64)
I[19] targets = np.random.uniform(0,8,(5000,2))

I[20] T = scipy.spatial.cKDTree(targets)
I[21] timeit T.query(element)
1 loops, best of 3: 27.5 us per loop

I[22] timeit nn(targets, element)
10 loops, best of 3: 11.6 us per loop

I could add nn to the bottleneck package. Is the k=1, Euclidean
distance case too specialized?

Prototype (messy) code is here:
https://github.com/kwgoodman/bottleneck/issues/45

For a smaller speed up (~2x) foo1 could use bottleneck's sum or
squares function, bn.ss().
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record arrays initialization

2012-05-03 Thread Moroney, Catherine M (388D)
 
 
 --
 
 Message: 6
 Date: Thu, 3 May 2012 10:00:11 -0700
 From: Keith Goodman kwgood...@gmail.com
 Subject: Re: [Numpy-discussion] record arrays initialization
 To: Discussion of Numerical Python numpy-discussion@scipy.org
 Message-ID:
   cab6y536wpl-cmcgzasirg7obvamc5jv5ct5byotm-j0++7t...@mail.gmail.com
 Content-Type: text/plain; charset=ISO-8859-1
 
 On Wed, May 2, 2012 at 4:46 PM, Kevin Jacobs jac...@bioinformed.com
 bioinfor...@gmail.com wrote:
 
 The cKDTree implementation is more than 4 times faster than the brute-force
 approach:
 
 T = scipy.spatial.cKDTree(targets)
 
 In [11]: %timeit foo1(element, targets) ? # Brute force
 1000 loops, best of 3: 385 us per loop
 
 In [12]: %timeit foo2(element, T) ? ? ? ? # cKDTree
 1 loops, best of 3: 83.5 us per loop
 
 In [13]: 385/83.5
 Out[13]: 4.610778443113772
 
 Brute force plus cython beats cKDTree for knn with k=1 and Euclidean distance:
 
I[5] targets = np.random.uniform(0, 8, (5000, 7))
I[6] element = np.arange(1, 8, dtype=np.float64)
 
I[7] T = scipy.spatial.cKDTree(targets)
I[8] timeit T.query(element)
1 loops, best of 3: 36.1 us per loop
 
I[9] timeit nn(targets, element)
1 loops, best of 3: 28.5 us per loop
 
 What about lower dimensions (2 instead of 7) where cKDTree gets faster?
 
I[18] element = np.arange(1,3, dtype=np.float64)
I[19] targets = np.random.uniform(0,8,(5000,2))
 
I[20] T = scipy.spatial.cKDTree(targets)
I[21] timeit T.query(element)
1 loops, best of 3: 27.5 us per loop
 
I[22] timeit nn(targets, element)
10 loops, best of 3: 11.6 us per loop
 
 I could add nn to the bottleneck package. Is the k=1, Euclidean
 distance case too specialized?
 
 Prototype (messy) code is here:
 https://github.com/kwgoodman/bottleneck/issues/45
 
 For a smaller speed up (~2x) foo1 could use bottleneck's sum or
 squares function, bn.ss().
 
 
Thank you everybody for your advice.  I will actually be clustering the
dataset for use in the production code (which must be in Fortran), and using
scipy.cluster.vq.kmeans2 to generate and populate the clusters.  Even though a 
brute-force
search in Fortran is fast, it will still be too slow for the final code.

But, the kmeans2 clustering isn't perfect in that it doesn't always
give identical results to the brute force search, so I want to do an analysis
of the differences on a test data-set.  

Catherine

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] record arrays initialization

2012-05-02 Thread Moroney, Catherine M (388D)
Thanks to Perry for some very useful off-list conversation.   I realize that
I wasn't being clear at all in my earlier description of the problem so here it 
is
in a nutshell:

Find the best match in an array t(5000, 7) for a single vector e(7).  Now scale
it up so e is (128, 512, 7) and I want to return a (128, 512) array of the 
t-identifiers
that are the best match for e.  Best match is defined as the minimum 
Euclidean distance.

I'm going to try three ways: (a) brute force and lots of looping in python,
(b) constructing a function to find the match for a single instance of e and
vectorizing it, and (c) coding it in Fortran.  I'll be curious to see the 
performance figures.

Two smaller questions:

A)  How do I most efficiently construct a record array from a single array?
I want to do the following, but it segfaults on me when i try to print b.

vtype = [(x, numpy.ndarray)]
a = numpy.arange(0, 16).reshape(4,4)
b = numpy.recarray((4), dtype=vtype, buf=a)

print a
print b

What is the most efficient way of constructing b from the values of a?  In 
real-life, 
a is (128*512*7) and I want b to be (128, 512) with the x component being a 
7-value numpy array.

and

B)  If I'm vectorizing a function (single) to find the best match for
a single element of e within t, how do I pass the entire array t into
the function without having it parcelled down to its individual elements?

i.e.

def single(elements, targets):
nlen = element.shape[0]
nvec = targets.data.shape[0]
x = element.reshape(1, nlen).repeat(nvec, axis=0)

diffs = ((x - targets.data)**2).sum(axis=1)
diffs = numpy.sqrt(diffs)
return numpy.argmin(diffs, axis=0)

multiple = numpy.vectorize(single)
x = multiple(all_elements, target)

where all_elements is similar to b in my first example, and target
is a 2-d array.  The above code doesn't work because target gets reduced
to a single element when it gets down to single and I need to see the whole 
array
when I'm down in single. 

I found a work-around by encapsulating target into a single object and passing
in the object, but I'm curious if there's a better way of doing this.

I hope I've explained myself better this time around,

Catherine
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record arrays initialization

2012-05-02 Thread Stéfan van der Walt
On Wed, May 2, 2012 at 2:45 PM, Moroney, Catherine M (388D)
catherine.m.moro...@jpl.nasa.gov wrote:
 Find the best match in an array t(5000, 7) for a single vector e(7).  Now 
 scale
 it up so e is (128, 512, 7) and I want to return a (128, 512) array of the 
 t-identifiers
 that are the best match for e.  Best match is defined as the minimum 
 Euclidean distance.

 I'm going to try three ways: (a) brute force and lots of looping in python,
 (b) constructing a function to find the match for a single instance of e and
 vectorizing it, and (c) coding it in Fortran.  I'll be curious to see the
 performance figures.

I'd use a mixture of (a) and (b):  break the t(N, 7) up into blocks
of, say, (1000, 7), compute the best match in each using broadcasting,
and then combine your results to find the best of the best.  This
strategy should be best for very large N.  For moderate N, where
broadcasting easily fits into memory, the answer given by the OP to
your original email would do the trick.

 A)  How do I most efficiently construct a record array from a single array?
 I want to do the following, but it segfaults on me when i try to print b.

 vtype = [(x, numpy.ndarray)]
 a = numpy.arange(0, 16).reshape(4,4)
 b = numpy.recarray((4), dtype=vtype, buf=a)

I prefer not to use record arrays, and stick to structured arrays:

In [11]: vtype = np.dtype([('x', (np.float, 4))])

In [12]: a = np.arange(16.).reshape((4,4))

In [13]: a.view(vtype)
Out[13]:
array([[([0.0, 1.0, 2.0, 3.0],)],
   [([4.0, 5.0, 6.0, 7.0],)],
   [([8.0, 9.0, 10.0, 11.0],)],
   [([12.0, 13.0, 14.0, 15.0],)]],
  dtype=[('x', 'f8', (4,))])

 B)  If I'm vectorizing a function (single) to find the best match for
 a single element of e within t, how do I pass the entire array t into
 the function without having it parcelled down to its individual elements?

I think the new dtype just makes your life more difficult here.  Simply do:

In [49]: np.sum(a - elements.T, axis=1)
Out[49]: array([  0.,  16.,  32.,  48.])

Stéfan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record arrays initialization

2012-05-02 Thread Kevin Jacobs jac...@bioinformed.com
On Wed, May 2, 2012 at 5:45 PM, Moroney, Catherine M (388D) 
catherine.m.moro...@jpl.nasa.gov wrote:

 Thanks to Perry for some very useful off-list conversation.   I realize
 that
 I wasn't being clear at all in my earlier description of the problem so
 here it is
 in a nutshell:

 Find the best match in an array t(5000, 7) for a single vector e(7).  Now
 scale
 it up so e is (128, 512, 7) and I want to return a (128, 512) array of the
 t-identifiers
 that are the best match for e.  Best match is defined as the minimum
 Euclidean distance.


It sounds like you want to find the nearest neighbor to a point in a
high-dimensional space. This sounds like a job for a spacial data structure
like a KD-tree.  See:

http://docs.scipy.org/doc/scipy/reference/spatial.html
http://mloss.org/software/view/143/
http://www.mrzv.org/software/pyANN/
etc.

-Kevin
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record arrays initialization

2012-05-02 Thread Aronne Merrelli
On Wed, May 2, 2012 at 5:27 PM, Kevin Jacobs jac...@bioinformed.com
bioinfor...@gmail.com wrote:
 On Wed, May 2, 2012 at 5:45 PM, Moroney, Catherine M (388D)
 catherine.m.moro...@jpl.nasa.gov wrote:

 Thanks to Perry for some very useful off-list conversation.   I realize
 that
 I wasn't being clear at all in my earlier description of the problem so
 here it is
 in a nutshell:

 Find the best match in an array t(5000, 7) for a single vector e(7).  Now
 scale
 it up so e is (128, 512, 7) and I want to return a (128, 512) array of the
 t-identifiers
 that are the best match for e.  Best match is defined as the minimum
 Euclidean distance.


 It sounds like you want to find the nearest neighbor to a point in a
 high-dimensional space. This sounds like a job for a spacial data structure
 like a KD-tree.  See:

 http://docs.scipy.org/doc/scipy/reference/spatial.html
 http://mloss.org/software/view/143/
 http://www.mrzv.org/software/pyANN/
 etc.

In general this is a good suggestion - I was going to mention it
earlier - but I think for this particular problem it is not better
than the brute force and argmin() NumPy approach. On my laptop, the
KDTree query is about a factor of 7 slower (ignoring the time cost to
create the KDTree)

In [45]: def foo1(element, targets):
distance_squared = ((element-targets)**2).sum(1)
min_index = distance_squared.argmin()
return sqrt(distance_squared[min_index]), min_index
   :

In [46]: def foo2(element, T):
return T.query(element)


In [47]: element = np.arange(1,8)
In [48]: targets = np.random.uniform(0,8,(5000,7))
In [49]: T = scipy.spatial.KDTree(targets)
In [50]: %timeit foo1(element, targets)
1000 loops, best of 3: 401 us per loop
In [51]: %timeit foo2(element, T)
100 loops, best of 3: 2.92 ms per loop

Just to make sure they say the same thing:

In [53]: foo1(element, targets)
Out[53]: (1.8173671152898632, 570)
In [54]: foo2(element, T)
Out[54]: (1.8173671152898632, 570)


I think KDTree is more optimal for larger searches (more than 5000
elements), and fewer dimensions. For example, with 500,000 elements
and 2 dimensions, I get 34 ms for NumPy and 2 ms for the KDtree.

Back to the original question, for 400 us per search, even over
128x512 elements that would be 26 seconds total, I think? That might
not be too bad.

Cheers,
Aronne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] record arrays initialization

2012-05-02 Thread Moroney, Catherine M (388D)


On May 2, 2012, at 3:23 PM, numpy-discussion-requ...@scipy.org
 wrote:

 A) ?How do I most efficiently construct a record array from a single array?
 I want to do the following, but it segfaults on me when i try to print b.
 
 vtype = [(x, numpy.ndarray)]
 a = numpy.arange(0, 16).reshape(4,4)
 b = numpy.recarray((4), dtype=vtype, buf=a)
 
 I prefer not to use record arrays, and stick to structured arrays:
 
 In [11]: vtype = np.dtype([('x', (np.float, 4))])
 
 In [12]: a = np.arange(16.).reshape((4,4))
 
 In [13]: a.view(vtype)
 Out[13]:
 array([[([0.0, 1.0, 2.0, 3.0],)],
   [([4.0, 5.0, 6.0, 7.0],)],
   [([8.0, 9.0, 10.0, 11.0],)],
   [([12.0, 13.0, 14.0, 15.0],)]],
  dtype=[('x', 'f8', (4,))])

Using structured arrays is making my code complex when I try to call the
vectorized function.  If I stick to the original record arrays, what's the 
best way of initializing b from a without doing an row-by-row copy?  

Catherine
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record arrays initialization

2012-05-02 Thread Kevin Jacobs jac...@bioinformed.com
On Wed, May 2, 2012 at 7:25 PM, Aronne Merrelli
aronne.merre...@gmail.comwrote:

 In general this is a good suggestion - I was going to mention it
 earlier - but I think for this particular problem it is not better
 than the brute force and argmin() NumPy approach. On my laptop, the
 KDTree query is about a factor of 7 slower (ignoring the time cost to
 create the KDTree)


The cKDTree implementation is more than 4 times faster than the brute-force
approach:

T = scipy.spatial.cKDTree(targets)

In [11]: %timeit foo1(element, targets)   # Brute force
1000 loops, best of 3: 385 us per loop

In [12]: %timeit foo2(element, T) # cKDTree
1 loops, best of 3: 83.5 us per loop

In [13]: 385/83.5
Out[13]: 4.610778443113772


A FLANN implementation should be even faster--perhaps by as much as another
factor of two.

-Kevin
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record arrays initialization

2012-05-02 Thread Stéfan van der Walt
On Wed, May 2, 2012 at 4:26 PM, Moroney, Catherine M (388D)
catherine.m.moro...@jpl.nasa.gov wrote:
 Using structured arrays is making my code complex when I try to call the
 vectorized function.  If I stick to the original record arrays, what's the
 best way of initializing b from a without doing an row-by-row copy?

What does your original data look like?  It seems like `a` is already
what you need after the reshape?

Stéfan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record arrays initialization

2012-05-02 Thread Stéfan van der Walt
On Wed, May 2, 2012 at 4:46 PM, Kevin Jacobs jac...@bioinformed.com
bioinfor...@gmail.com wrote:
 A FLANN implementation should be even faster--perhaps by as much as another
 factor of two.

I guess it depends on whether you care about the Approximate in
Fast Library for Approximate Nearest Neighbors.

Stéfan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record arrays initialization

2012-05-02 Thread Benjamin Root
On Wednesday, May 2, 2012, Stéfan van der Walt wrote:

 On Wed, May 2, 2012 at 4:46 PM, Kevin Jacobs 
 jac...@bioinformed.comjavascript:;
 
 bioinfor...@gmail.com javascript:; wrote:
  A FLANN implementation should be even faster--perhaps by as much as
 another
  factor of two.

 I guess it depends on whether you care about the Approximate in
 Fast Library for Approximate Nearest Neighbors.

 Stéfan


This is why I love following these lists!  I don't think I ever would have
come across this method on my own. Nifty!

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] record arrays initialization

2012-05-02 Thread Aronne Merrelli
On Wed, May 2, 2012 at 6:46 PM, Kevin Jacobs jac...@bioinformed.com
bioinfor...@gmail.com wrote:
 On Wed, May 2, 2012 at 7:25 PM, Aronne Merrelli aronne.merre...@gmail.com
 wrote:

 In general this is a good suggestion - I was going to mention it
 earlier - but I think for this particular problem it is not better
 than the brute force and argmin() NumPy approach. On my laptop, the
 KDTree query is about a factor of 7 slower (ignoring the time cost to
 create the KDTree)


 The cKDTree implementation is more than 4 times faster than the brute-force
 approach:

 T = scipy.spatial.cKDTree(targets)

 In [11]: %timeit foo1(element, targets)   # Brute force
 1000 loops, best of 3: 385 us per loop

 In [12]: %timeit foo2(element, T)         # cKDTree
 1 loops, best of 3: 83.5 us per loop

 In [13]: 385/83.5
 Out[13]: 4.610778443113772

Wow, not sure how I missed that! It even seems to scale better than
linear (some of that 83us is call overhead, I guess):

In [35]: %timeit foo2(element, T)
1 loops, best of 3: 115 us per loop
In [36]: elements = np.random.uniform(0,8,[128,512,7])
In [37]: %timeit foo2(elements.reshape((128*512,7)), T)
1 loops, best of 3: 2.66 s per loop

So only 2.7 seconds to search the whole set. Not bad!

Cheers,
Aronne
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion