Re: [Numpy-discussion] Proposal: scipy.spatial

2008-09-29 Thread Christopher Barker
Anne Archibald wrote:
> I suggest the creation of
> a new submodule of scipy, scipy.spatial,

+1

Here's one to consider:

http://pypi.python.org/pypi/Rtree

and perhaps other stuff from:

http://trac.gispython.org/spatialindex/wiki

which I think is LGPL -- can scipy use that?

By the way, a general purpose bounding box class could be kind of handy 
-- I have one that is a start. It's a subclass of numpy arrays.

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-09-29 Thread Charles R Harris
On Mon, Sep 29, 2008 at 9:24 PM, Anne Archibald
<[EMAIL PROTECTED]>wrote:

> Hi,
>
> Once again there has been a thread on the numpy/scipy mailing lists
> requesting (essentially) some form of spatial data structure. Pointers
> have been posted to ANN (sadly LGPLed and in C++) as well as a handful
> of pure-python implementations of kd-trees. I suggest the creation of
> a new submodule of scipy, scipy.spatial, to contain spatial data
> structures and algorithms. Specifically, I propose it contain a
> kd-tree implementation providing nearest-neighbor, approximate
> nearest-neighbor, and all-points-near queries. I have a few other
> suggestions for things it might contain, but kd-trees seem like a good
> first step.
>
> 2008/9/27 Nathan Bell <[EMAIL PROTECTED]>:
> > On Sat, Sep 27, 2008 at 11:18 PM, Anne Archibald
> > <[EMAIL PROTECTED]> wrote:
> >>
> >> I think a kd-tree implementation would be a valuable addition to
> >> scipy, perhaps in a submodule scipy.spatial that might eventually
> >> contain other spatial data structures and algorithms. What do you
> >> think? Should we have one? Should it be based on Sturla Molden's code,
> >> if the license permits? I am willing to contribute one, if not.
> >
> > +1
>
> Judging that your vote and mine are enough in the absence of
> dissenting voices, I have written an implementation based on yours,
> Sturla Molden's, and the algorithms described by the authors of the
> ANN library. Before integrating it into scipy, though, I'd like to
> send it around for comments.
>
> Particular issues:
>
> * It's pure python right now, but with some effort it could be
> partially or completely cythonized. This is probably a good idea in
> the long run. In the meantime I have crudely vectorized it so that
> users can at least avoid looping in their own code.
> * It is able to return the r nearest neighbors, optionally subject to
> a maximum distance limit, as well as approximate nearest neighbors.
> * It is not currently set up to conveniently return all points within
> some fixed distance of the target point, but this could easily be
> added.
> * It returns distances and indices of nearest neighbors in the original
> array.
> * The testing code is, frankly, a mess. I need to look into using nose
> in a sensible fashion.
> * The license is the scipy license.
>
> I am particularly concerned about providing a convenient return
> format. The natural return from a query is a list of neighbors, since
> it may have variable length (there may not be r elements in the tree,
> or you might have supplied a maximum distance which doesn't contain r
> points). For a single query, it's simple to return a python list
> (should it be sorted? currently it's a heap). But if you want to
> vectorize the process, storing the results in an array becomes
> cumbersome. One option is an object array full of lists; another,
> which, I currently use, is an array with nonexistent points marked
> with an infinite distance and an invalid index. A third would be to
> return masked arrays. How do you recommend handling this variable (or
> potentially-variable) sized output?
>

I think lists are the way to go. They should be pretty fast down at the C
level. I use them myself for collecting the elements of equivalence classes
and such that can be quite dynamic when new elements/relations are added.


>
> > If you're implementing one, I would highly recommend the
> > "left-balanced" kd-tree.
> >
> http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/2535/pdf/imm2535.pdf
>
> Research suggests that at least in high dimension a more geometric
> balancing criterion can produce vastly better results. I used the
> "sliding midpoint" rule, which doesn't allow such a numerical
> balancing but does ensure that you don't have too many long skinny
> cells (since a sphere tends to cut very many of these).
>
> I've also been thinking about what else would go in scipy.spatial. I
> think it would be valuable to have a reasonably efficient distance
> matrix function (we seem to get that question a lot, and the answer's
> not trivial) as well as a sparse distance matrix function based on the
> kd-trees. The module could potentially also swallow the (currently
> sandboxed?) delaunay code.
>

It would be good to define the interface separately from the algorithm so
that the latter can be tweaked without affecting the API.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-09-30 Thread Peter
On Tue, Sep 30, 2008 at 5:10 AM, Christopher Barker
<[EMAIL PROTECTED]> wrote:
>
> Anne Archibald wrote:
>> I suggest the creation of
>> a new submodule of scipy, scipy.spatial,
>
> +1
>
> Here's one to consider:
> http://pypi.python.org/pypi/Rtree
> and perhaps other stuff from:
> http://trac.gispython.org/spatialindex/wiki
> which I think is LGPL -- can scipy use that?

There is also a KDTree module in Biopython (which is under a BSD/MIT
style licence),
http://biopython.org/SRC/biopython/Bio/KDTree/

The current version is in C, there is an older version available in
the CVS history in C++ too,
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/KDTree/?cvsroot=biopython

Peter
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-09-30 Thread Anne Archibald
2008/9/30 Peter <[EMAIL PROTECTED]>:
> On Tue, Sep 30, 2008 at 5:10 AM, Christopher Barker
> <[EMAIL PROTECTED]> wrote:
>>
>> Anne Archibald wrote:
>>> I suggest the creation of
>>> a new submodule of scipy, scipy.spatial,
>>
>> +1
>>
>> Here's one to consider:
>> http://pypi.python.org/pypi/Rtree
>> and perhaps other stuff from:
>> http://trac.gispython.org/spatialindex/wiki
>> which I think is LGPL -- can scipy use that?
>
> There is also a KDTree module in Biopython (which is under a BSD/MIT
> style licence),
> http://biopython.org/SRC/biopython/Bio/KDTree/
>
> The current version is in C, there is an older version available in
> the CVS history in C++ too,
> http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/KDTree/?cvsroot=biopython

I think the profusion of different implementations is an argument for
including this in scipy. I think it is also an argument for providing
a standard interface with (at least potentially) several different
implementations. At the moment, that proposed interface looks like:

T = KDTree(data)

distances, indices = T.query(xs) # single nearest neighbor

distances, indices = T.query(xs, k=10) # ten nearest neighbors

distances, indices = T.query(xs, k=None, distance_upper_bound=1.0) #
all within 1 of x

In the first two cases, missing neighbors are represented with an
infinite distance and an invalid index. In the last case, distances
and indices are both either lists (if there's only one query point) or
object arrays of lists (if there are many query points). If only one
neighbor is requested, the array does not have a dimension of length 1
in the which-neighbor position. If (potentially) many neighbors are
returned, they are sorted by distance, nearest first.

What do you think of this interface?

It may make sense to provide additional kinds of query - nearest
neighbors between two trees, for example - some of which would be
available only for some implementations.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-09-30 Thread Gael Varoquaux
On Tue, Sep 30, 2008 at 05:31:17PM -0400, Anne Archibald wrote:
> T = KDTree(data)

> distances, indices = T.query(xs) # single nearest neighbor

> distances, indices = T.query(xs, k=10) # ten nearest neighbors

> distances, indices = T.query(xs, k=None, distance_upper_bound=1.0) #
> all within 1 of x

> What do you think of this interface?

k=None in the third call to T.query seems redundant. It should be
possible do put some logics so that the call is simply

distances, indices = T.query(xs, distance_upper_bound=1.0)

My 2 cents,

Gaël
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-09-30 Thread Anne Archibald
2008/9/30 Gael Varoquaux <[EMAIL PROTECTED]>:
> On Tue, Sep 30, 2008 at 05:31:17PM -0400, Anne Archibald wrote:
>> T = KDTree(data)
>
>> distances, indices = T.query(xs) # single nearest neighbor
>
>> distances, indices = T.query(xs, k=10) # ten nearest neighbors
>
>> distances, indices = T.query(xs, k=None, distance_upper_bound=1.0) #
>> all within 1 of x
>
>> What do you think of this interface?
>
> k=None in the third call to T.query seems redundant. It should be
> possible do put some logics so that the call is simply
>
> distances, indices = T.query(xs, distance_upper_bound=1.0)

Well, the problem with this is that you often want to provide a
distance upper bound as well as a number of nearest neighbors. For
example, suppose you are trying to find the point of closest approach
of a path to the set of points stored in a kd-tree. You do the first
query normally, but then since the second point is close to the first,
you can accelerate the search dramatically by providing an upper bound
of the distance from the first point's nearest neighbor to the second
point. With a little luck, this will save ever having to visit much of
the tree.

It is also possible, of course, to provide wrappers to query() that do
just one thing; I had this in mind more for fiddling the output
(returning only the distance to the nearest neighbor, for instance).

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-09-30 Thread Nathan Bell
On Tue, Sep 30, 2008 at 6:10 PM, Anne Archibald
<[EMAIL PROTECTED]> wrote:
>
> Well, the problem with this is that you often want to provide a
> distance upper bound as well as a number of nearest neighbors. For

This use case is also important in scattered data interpolation, so we
definitely want to support it.

-- 
Nathan Bell [EMAIL PROTECTED]
http://graphics.cs.uiuc.edu/~wnbell/
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-09-30 Thread Gael Varoquaux
On Tue, Sep 30, 2008 at 06:10:46PM -0400, Anne Archibald wrote:
> > k=None in the third call to T.query seems redundant. It should be
> > possible do put some logics so that the call is simply

> > distances, indices = T.query(xs, distance_upper_bound=1.0)

> Well, the problem with this is that you often want to provide a
> distance upper bound as well as a number of nearest neighbors.

Absolutely. I just think k should default to None, when
distance_upper_bound is specified. k=None could be interpreted as k=1
when distance_uppper_bound is not specified.

My 2 cents,

Gaël
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-09-30 Thread Barry Wark
On Mon, Sep 29, 2008 at 8:24 PM, Anne Archibald
<[EMAIL PROTECTED]> wrote:
> Hi,
>
> Once again there has been a thread on the numpy/scipy mailing lists
> requesting (essentially) some form of spatial data structure. Pointers
> have been posted to ANN (sadly LGPLed and in C++) as well as a handful
> of pure-python implementations of kd-trees. I suggest the creation of
> a new submodule of scipy, scipy.spatial, to contain spatial data
> structures and algorithms. Specifically, I propose it contain a
> kd-tree implementation providing nearest-neighbor, approximate
> nearest-neighbor, and all-points-near queries. I have a few other
> suggestions for things it might contain, but kd-trees seem like a good
> first step.
>
> 2008/9/27 Nathan Bell <[EMAIL PROTECTED]>:
>> On Sat, Sep 27, 2008 at 11:18 PM, Anne Archibald
>> <[EMAIL PROTECTED]> wrote:
>>>
>>> I think a kd-tree implementation would be a valuable addition to
>>> scipy, perhaps in a submodule scipy.spatial that might eventually
>>> contain other spatial data structures and algorithms. What do you
>>> think? Should we have one? Should it be based on Sturla Molden's code,
>>> if the license permits? I am willing to contribute one, if not.
>>
>> +1
>
> Judging that your vote and mine are enough in the absence of
> dissenting voices, I have written an implementation based on yours,
> Sturla Molden's, and the algorithms described by the authors of the
> ANN library. Before integrating it into scipy, though, I'd like to
> send it around for comments.
>
> Particular issues:
>
> * It's pure python right now, but with some effort it could be
> partially or completely cythonized. This is probably a good idea in
> the long run. In the meantime I have crudely vectorized it so that
> users can at least avoid looping in their own code.
> * It is able to return the r nearest neighbors, optionally subject to
> a maximum distance limit, as well as approximate nearest neighbors.
> * It is not currently set up to conveniently return all points within
> some fixed distance of the target point, but this could easily be
> added.
> * It returns distances and indices of nearest neighbors in the original array.
> * The testing code is, frankly, a mess. I need to look into using nose
> in a sensible fashion.
> * The license is the scipy license.
>
> I am particularly concerned about providing a convenient return
> format. The natural return from a query is a list of neighbors, since
> it may have variable length (there may not be r elements in the tree,
> or you might have supplied a maximum distance which doesn't contain r
> points). For a single query, it's simple to return a python list
> (should it be sorted? currently it's a heap). But if you want to
> vectorize the process, storing the results in an array becomes
> cumbersome. One option is an object array full of lists; another,
> which, I currently use, is an array with nonexistent points marked
> with an infinite distance and an invalid index. A third would be to
> return masked arrays. How do you recommend handling this variable (or
> potentially-variable) sized output?
>
>> If you're implementing one, I would highly recommend the
>> "left-balanced" kd-tree.
>> http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/2535/pdf/imm2535.pdf
>
> Research suggests that at least in high dimension a more geometric
> balancing criterion can produce vastly better results. I used the
> "sliding midpoint" rule, which doesn't allow such a numerical
> balancing but does ensure that you don't have too many long skinny
> cells (since a sphere tends to cut very many of these).
>
> I've also been thinking about what else would go in scipy.spatial. I
> think it would be valuable to have a reasonably efficient distance
> matrix function (we seem to get that question a lot, and the answer's
> not trivial) as well as a sparse distance matrix function based on the
> kd-trees. The module could potentially also swallow the (currently
> sandboxed?) delaunay code.
>
> Anne

Anne,

Thanks for taking this on. The scikits.ann has licensing issues (as
noted above), so it would be nice to have a clean-room implementation
in scipy. I am happy to port the scikits.ann API to the final API that
you choose, however, if you think that would be helpful.

Cheers,
Barry
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-09-30 Thread Nathan Bell
On Wed, Oct 1, 2008 at 1:26 AM, Gael Varoquaux
<[EMAIL PROTECTED]> wrote:
>
> Absolutely. I just think k should default to None, when
> distance_upper_bound is specified. k=None could be interpreted as k=1
> when distance_uppper_bound is not specified.
>

Why not expose the various possibilities through different names?

# nearest k points (possibly fewer)
query_nearest(pt, k=1)

# all points within given distance
query_sphere(pt, distance)

#nearest k points within given distance (possibly fewer)
query(pt, k, distance)

Few people will use the last form, but it's useful nevertheless.

-- 
Nathan Bell [EMAIL PROTECTED]
http://graphics.cs.uiuc.edu/~wnbell/
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-01 Thread Anne Archibald
2008/10/1 Gael Varoquaux <[EMAIL PROTECTED]>:
> On Tue, Sep 30, 2008 at 06:10:46PM -0400, Anne Archibald wrote:
>> > k=None in the third call to T.query seems redundant. It should be
>> > possible do put some logics so that the call is simply
>
>> > distances, indices = T.query(xs, distance_upper_bound=1.0)
>
>> Well, the problem with this is that you often want to provide a
>> distance upper bound as well as a number of nearest neighbors.
>
> Absolutely. I just think k should default to None, when
> distance_upper_bound is specified. k=None could be interpreted as k=1
> when distance_uppper_bound is not specified.

That seems very confusing. Better perhaps to have a function
query_all_neighbors, even if it's just a wrapper.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-01 Thread Anne Archibald
2008/10/1 Barry Wark <[EMAIL PROTECTED]>:

> Thanks for taking this on. The scikits.ann has licensing issues (as
> noted above), so it would be nice to have a clean-room implementation
> in scipy. I am happy to port the scikits.ann API to the final API that
> you choose, however, if you think that would be helpful.

That's a nice idea. I'm not totally sure yet how much it's going to be
possible for different implementations to be plug-in replacements, but
it sure would be nice if users could use ANN transparently.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-01 Thread James Philbin
> distances, indices = T.query(xs) # single nearest neighbor
I'm not sure if it's implied, but can xs be a NxD matrix here i.e.
query for all N points rather than just one. This will reduce the
python call overhead for large queries.

Also, I have some c++ code for locality sensitive hashing which might be useful?

James
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-02 Thread David Bolme
I also like the idea of a scipy.spatial library.  For the research I  
do in machine learning and computer vision we are often interested in  
specifying different distance measures.  It would be nice to have a  
way to specify the distance measure.  I would like to see a standard  
set included: City Block, Euclidean, Correlation, etc as well as a  
capability for a user defined distance or similarity function.


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-02 Thread Matthieu Brucher
2008/10/2 David Bolme <[EMAIL PROTECTED]>:
> I also like the idea of a scipy.spatial library.  For the research I
> do in machine learning and computer vision we are often interested in
> specifying different distance measures.  It would be nice to have a
> way to specify the distance measure.  I would like to see a standard
> set included: City Block, Euclidean, Correlation, etc as well as a
> capability for a user defined distance or similarity function.

You mean similarity or dissimilarity ? Distance is a dissimilarity but
correlation is a similarity measure.

Matthieu
-- 
French PhD student
Information System Engineer
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-02 Thread David Bolme
It may be useful to have an interface that handles both cases:  
similarity and dissimilarity.  Often I have seen "Nearest Neighbor"  
algorithms that look for maximum similarity instead of minimum  
distance.  In my field (biometrics) we often deal with very  
specialized distance or similarity measures.  I would like to see  
support for user defined distance and similarity functions.  It should  
be easy to implement by passing a function object to the KNN class.  I  
am not sure if kd-trees or other fast algorithms are compatible with  
similarities or non-euclidian norms, however I would be willing to  
implement an exhaustive search KNN that would support user defined  
functions.

On Oct 2, 2008, at 2:01 PM, Matthieu Brucher wrote:

> 2008/10/2 David Bolme <[EMAIL PROTECTED]>:
>> I also like the idea of a scipy.spatial library.  For the research I
>> do in machine learning and computer vision we are often interested in
>> specifying different distance measures.  It would be nice to have a
>> way to specify the distance measure.  I would like to see a standard
>> set included: City Block, Euclidean, Correlation, etc as well as a
>> capability for a user defined distance or similarity function.
>
> You mean similarity or dissimilarity ? Distance is a dissimilarity but
> correlation is a similarity measure.
>
> Matthieu
> -- 
> French PhD student
> Information System Engineer
> Website: http://matthieu-brucher.developpez.com/
> Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
> ___
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-02 Thread Anne Archibald
2008/10/2 David Bolme <[EMAIL PROTECTED]>:
> It may be useful to have an interface that handles both cases:
> similarity and dissimilarity.  Often I have seen "Nearest Neighbor"
> algorithms that look for maximum similarity instead of minimum
> distance.  In my field (biometrics) we often deal with very
> specialized distance or similarity measures.  I would like to see
> support for user defined distance and similarity functions.  It should
> be easy to implement by passing a function object to the KNN class.  I
> am not sure if kd-trees or other fast algorithms are compatible with
> similarities or non-euclidian norms, however I would be willing to
> implement an exhaustive search KNN that would support user defined
> functions.

kd-trees can only work for distance measures which have certain
special properties (in particular, you have to be able to bound them
based on coordinate differences). This is just fine for all the
Minkowski p-norms (so in particular, Euclidean distance,
maximum-coordinate-difference, and Manhattan distance) and in fact the
current implementation already supports all of these. I don't think
that correlation can be made into such a distance measure - the
neighborhoods are the wrong shape. In fact the basic space is
projective n-1 space rather than affine n-space, so I think you're
going to need some very different algorithm. If you make a metric
space out of it - define d(A,B) to be the angle between A and B - then
cover trees can serve as a spatial data structure for nearest-neighbor
search. Cover trees may be worth implementing, as they're a very
generic data structure, suitable for (among other things)
low-dimensional data in high-dimensional spaces.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-03 Thread David Bolme
I remember reading a paper or book that stated that for data that has  
been normalized correlation and Euclidean are equivalent and will  
produce the same knn results.  To this end I spent a couple hours this  
afternoon doing the math.  This document is the result.

http://www.cs.colostate.edu/~bolme/bolme08euclidean.pdf

I believe that with mean subtracted and unit length vectors, a  
Euclidean knn algorithm will produces the same result as if the  
vectors were compared using correlation.  I am not sure if kd-trees  
will perform well on the normalized vectors as they have a very  
specific geometry.  If my math checks out it may be worth adding  
Pearson's correlation as a default option or as a separate class.

I have also spent a little time looking at kd-trees and the kdtree  
code.  It looks good.  As I understand it kd-trees only work well when  
the number of datapoints (N) is larger than 2^D, where D is the  
dimensionality of those points.  This will not work well for many of  
my computer vision problems because often D is large.  As Anne  
suggested I will probably look at cover trees because often times the  
data are "low-dimensional data in high-dimensional spaces".  I have  
been told though that with a large D there is know known fast  
algorithm for knn.

Another problem is that the distances and similarity measures used in  
biometrics and computer vision are often very specialized and may or  
may not conform to the underlying assumptions of fast algorithms.  I  
think for this reason I will need an exhaustive search algorithm.  I  
will code it up modeled after Anne's interface and hopefully it will  
make it into the spatial module.

I think that kd-trees and the spatial module are a good contribution  
to scipy.  I have also enjoyed learning more about norms, correlation,  
and fast knn algorithms.

Thanks,
Dave
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-03 Thread Anne Archibald
2008/10/3 David Bolme <[EMAIL PROTECTED]>:
> I remember reading a paper or book that stated that for data that has
> been normalized correlation and Euclidean are equivalent and will
> produce the same knn results.  To this end I spent a couple hours this
> afternoon doing the math.  This document is the result.
>
> http://www.cs.colostate.edu/~bolme/bolme08euclidean.pdf

Yes, you're right, even without the mean subtraction they all lie on a
hypersphere in Euclidean space. It's a little more awkward if you want
to identify x and -x.

> I believe that with mean subtracted and unit length vectors, a
> Euclidean knn algorithm will produces the same result as if the
> vectors were compared using correlation.  I am not sure if kd-trees
> will perform well on the normalized vectors as they have a very
> specific geometry.  If my math checks out it may be worth adding
> Pearson's correlation as a default option or as a separate class.

Actually it's probably easier if the user just does the prenormalization.

> I have also spent a little time looking at kd-trees and the kdtree
> code.  It looks good.  As I understand it kd-trees only work well when
> the number of datapoints (N) is larger than 2^D, where D is the
> dimensionality of those points.  This will not work well for many of
> my computer vision problems because often D is large.  As Anne
> suggested I will probably look at cover trees because often times the
> data are "low-dimensional data in high-dimensional spaces".  I have
> been told though that with a large D there is know known fast
> algorithm for knn.

Pretty much true. Though if the intrinsic dimensionality is low, cover
trees should be all right.

> Another problem is that the distances and similarity measures used in
> biometrics and computer vision are often very specialized and may or
> may not conform to the underlying assumptions of fast algorithms.  I
> think for this reason I will need an exhaustive search algorithm.  I
> will code it up modeled after Anne's interface and hopefully it will
> make it into the spatial module.

Metric spaces are quite general - edit distance for strings, for
example, is an adequate distance measure. But brute-force is
definitely worth having too.

If I get the test suite cleaned up, it should be possible to just drop
an arbitrary k-nearest-neighbors class into it and get a reasonably
thorough collection of unit tests.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-08 Thread Rob Hetland


My version of a wrapper of ANN is attached.  I wrote it when I had  
some issues installing the scikits.ann package.  It uses ctypes, and  
might be useful in deciding on an API.


Please feel free to take what you like,

-Rob




ann.tar.gz
Description: GNU Zip compressed data



Rob Hetland, Associate Professor
Dept. of Oceanography, Texas A&M University
http://pong.tamu.edu/~rob
phone: 979-458-0096, fax: 979-845-6331



___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-08 Thread Rob Hetland

On Oct 8, 2008, at 1:36 PM, Anne Archibald wrote:

> How important did you find the ability to select priority versus
> non-priority search?
>
> How important did you find the ability to select other splitting  
> rules?


In both cases, I included those things just because they were options  
in ANN, rather than any real need on my part.

The things that I found important (and lacking in some of the other  
many kd_tree implementations) were the ability to select an arbitrary  
number of nearest neighbors returned and specifying a max search radius.

I can see that having a choice of splitting rules may be handy for  
unusual point distributions, but I have always used this on fairly  
regularly spaced data.

-Rob


Rob Hetland, Associate Professor
Dept. of Oceanography, Texas A&M University
http://pong.tamu.edu/~rob
phone: 979-458-0096, fax: 979-845-6331



___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-09 Thread David Bolme
I have written up basic nearest neighbor algorithm.  It does a brute  
force search so it will be slower than kdtrees as the number of points  
gets large.  It should however work well for high dimensional data.  I  
have also added the option for user defined distance measures.  The  
user can set a default "p".  "p" has the same functionality if it is a  
float.  "p" can also be a function that computes a distance matrix or  
the measure can be selected using the strings: "Manhattan",  
"Euclidean", or "Correlation".

https://pyvision.svn.sourceforge.net/svnroot/pyvision/trunk/src/pyvision/vector/knn.py

The interface is similar to Anne's code and in many cases can be used  
as a drop in replacement.  I have posted the code to my own project  
because I have a short term need and I do not have access to the scipy  
repository.  Feel free to include the code with scipy under the scipy  
license.

I did find a typo your documentation.
typo "trie -> tree" - ... kd-tree is a binary trie, each of whose ...

Also I found the use of k in the documentation some what confusing as  
it is the dimensionality of the data points in the kd-tree and the  
number of neighbors for k-nearest neighbors.

>> I believe that with mean subtracted and unit length vectors, a
>> Euclidean knn algorithm will produces the same result as if the
>> vectors were compared using correlation.  I am not sure if kd-trees
>> will perform well on the normalized vectors as they have a very
>> specific geometry.  If my math checks out it may be worth adding
>> Pearson's correlation as a default option or as a separate class.
>
> Actually it's probably easier if the user just does the  
> prenormalization.

I agree.  I think we should keep your class as-is and maybe create a  
class that wraps the kdtree and handles the normalization for  
correlation.  I would also like to look at cover trees, however that  
will have to wait a couple months until I have more time.

Dave
---
http://www.cs.colostate.edu/~bolme
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Proposal: scipy.spatial

2008-10-12 Thread Anne Archibald
2008/10/9 David Bolme <[EMAIL PROTECTED]>:
> I have written up basic nearest neighbor algorithm.  It does a brute
> force search so it will be slower than kdtrees as the number of points
> gets large.  It should however work well for high dimensional data.  I
> have also added the option for user defined distance measures.  The
> user can set a default "p".  "p" has the same functionality if it is a
> float.  "p" can also be a function that computes a distance matrix or
> the measure can be selected using the strings: "Manhattan",
> "Euclidean", or "Correlation".
>
> https://pyvision.svn.sourceforge.net/svnroot/pyvision/trunk/src/pyvision/vector/knn.py

This is interesting. I would point out, though, that if you want a
Minkowski norm, it may be more efficient (that is, faster) to use the
new compiled kd-tree implementation with leafsize set to the size of
your data. This is written in compiled code and uses short-circuit
distance evaluation, and may be much faster for high-dimensional
problems.

Given that, this should perhaps go in with other generic metric space
code. I have a functional implementation of ball trees (though I don't
know how efficient they are), and am looking into implementing cover
trees.

> The interface is similar to Anne's code and in many cases can be used
> as a drop in replacement.  I have posted the code to my own project
> because I have a short term need and I do not have access to the scipy
> repository.  Feel free to include the code with scipy under the scipy
> license.
>
> I did find a typo your documentation.
> typo "trie -> tree" - ... kd-tree is a binary trie, each of whose ...

That's not actually a typo: a trie is a tree in which all the data is
stored at leaf nodes. Basic kd-trees use the nodes themselves to
define splitting planes; you can actually construct one with no extra
storage at all, just by choosing an appropriate order for your array
of points. This implementation chooses splitting planes that may not
pass through any point, so all the points get stored in leaves.

> Also I found the use of k in the documentation some what confusing as
> it is the dimensionality of the data points in the kd-tree and the
> number of neighbors for k-nearest neighbors.

That's a good point. I changed the dimension of the kd-tree to m throughout.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion