Re: [Numpy-discussion] Proposal: scipy.spatial
Anne Archibald wrote: > I suggest the creation of > a new submodule of scipy, scipy.spatial, +1 Here's one to consider: http://pypi.python.org/pypi/Rtree and perhaps other stuff from: http://trac.gispython.org/spatialindex/wiki which I think is LGPL -- can scipy use that? By the way, a general purpose bounding box class could be kind of handy -- I have one that is a start. It's a subclass of numpy arrays. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
On Mon, Sep 29, 2008 at 9:24 PM, Anne Archibald <[EMAIL PROTECTED]>wrote: > Hi, > > Once again there has been a thread on the numpy/scipy mailing lists > requesting (essentially) some form of spatial data structure. Pointers > have been posted to ANN (sadly LGPLed and in C++) as well as a handful > of pure-python implementations of kd-trees. I suggest the creation of > a new submodule of scipy, scipy.spatial, to contain spatial data > structures and algorithms. Specifically, I propose it contain a > kd-tree implementation providing nearest-neighbor, approximate > nearest-neighbor, and all-points-near queries. I have a few other > suggestions for things it might contain, but kd-trees seem like a good > first step. > > 2008/9/27 Nathan Bell <[EMAIL PROTECTED]>: > > On Sat, Sep 27, 2008 at 11:18 PM, Anne Archibald > > <[EMAIL PROTECTED]> wrote: > >> > >> I think a kd-tree implementation would be a valuable addition to > >> scipy, perhaps in a submodule scipy.spatial that might eventually > >> contain other spatial data structures and algorithms. What do you > >> think? Should we have one? Should it be based on Sturla Molden's code, > >> if the license permits? I am willing to contribute one, if not. > > > > +1 > > Judging that your vote and mine are enough in the absence of > dissenting voices, I have written an implementation based on yours, > Sturla Molden's, and the algorithms described by the authors of the > ANN library. Before integrating it into scipy, though, I'd like to > send it around for comments. > > Particular issues: > > * It's pure python right now, but with some effort it could be > partially or completely cythonized. This is probably a good idea in > the long run. In the meantime I have crudely vectorized it so that > users can at least avoid looping in their own code. > * It is able to return the r nearest neighbors, optionally subject to > a maximum distance limit, as well as approximate nearest neighbors. > * It is not currently set up to conveniently return all points within > some fixed distance of the target point, but this could easily be > added. > * It returns distances and indices of nearest neighbors in the original > array. > * The testing code is, frankly, a mess. I need to look into using nose > in a sensible fashion. > * The license is the scipy license. > > I am particularly concerned about providing a convenient return > format. The natural return from a query is a list of neighbors, since > it may have variable length (there may not be r elements in the tree, > or you might have supplied a maximum distance which doesn't contain r > points). For a single query, it's simple to return a python list > (should it be sorted? currently it's a heap). But if you want to > vectorize the process, storing the results in an array becomes > cumbersome. One option is an object array full of lists; another, > which, I currently use, is an array with nonexistent points marked > with an infinite distance and an invalid index. A third would be to > return masked arrays. How do you recommend handling this variable (or > potentially-variable) sized output? > I think lists are the way to go. They should be pretty fast down at the C level. I use them myself for collecting the elements of equivalence classes and such that can be quite dynamic when new elements/relations are added. > > > If you're implementing one, I would highly recommend the > > "left-balanced" kd-tree. > > > http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/2535/pdf/imm2535.pdf > > Research suggests that at least in high dimension a more geometric > balancing criterion can produce vastly better results. I used the > "sliding midpoint" rule, which doesn't allow such a numerical > balancing but does ensure that you don't have too many long skinny > cells (since a sphere tends to cut very many of these). > > I've also been thinking about what else would go in scipy.spatial. I > think it would be valuable to have a reasonably efficient distance > matrix function (we seem to get that question a lot, and the answer's > not trivial) as well as a sparse distance matrix function based on the > kd-trees. The module could potentially also swallow the (currently > sandboxed?) delaunay code. > It would be good to define the interface separately from the algorithm so that the latter can be tweaked without affecting the API. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
On Tue, Sep 30, 2008 at 5:10 AM, Christopher Barker <[EMAIL PROTECTED]> wrote: > > Anne Archibald wrote: >> I suggest the creation of >> a new submodule of scipy, scipy.spatial, > > +1 > > Here's one to consider: > http://pypi.python.org/pypi/Rtree > and perhaps other stuff from: > http://trac.gispython.org/spatialindex/wiki > which I think is LGPL -- can scipy use that? There is also a KDTree module in Biopython (which is under a BSD/MIT style licence), http://biopython.org/SRC/biopython/Bio/KDTree/ The current version is in C, there is an older version available in the CVS history in C++ too, http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/KDTree/?cvsroot=biopython Peter ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/9/30 Peter <[EMAIL PROTECTED]>: > On Tue, Sep 30, 2008 at 5:10 AM, Christopher Barker > <[EMAIL PROTECTED]> wrote: >> >> Anne Archibald wrote: >>> I suggest the creation of >>> a new submodule of scipy, scipy.spatial, >> >> +1 >> >> Here's one to consider: >> http://pypi.python.org/pypi/Rtree >> and perhaps other stuff from: >> http://trac.gispython.org/spatialindex/wiki >> which I think is LGPL -- can scipy use that? > > There is also a KDTree module in Biopython (which is under a BSD/MIT > style licence), > http://biopython.org/SRC/biopython/Bio/KDTree/ > > The current version is in C, there is an older version available in > the CVS history in C++ too, > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/KDTree/?cvsroot=biopython I think the profusion of different implementations is an argument for including this in scipy. I think it is also an argument for providing a standard interface with (at least potentially) several different implementations. At the moment, that proposed interface looks like: T = KDTree(data) distances, indices = T.query(xs) # single nearest neighbor distances, indices = T.query(xs, k=10) # ten nearest neighbors distances, indices = T.query(xs, k=None, distance_upper_bound=1.0) # all within 1 of x In the first two cases, missing neighbors are represented with an infinite distance and an invalid index. In the last case, distances and indices are both either lists (if there's only one query point) or object arrays of lists (if there are many query points). If only one neighbor is requested, the array does not have a dimension of length 1 in the which-neighbor position. If (potentially) many neighbors are returned, they are sorted by distance, nearest first. What do you think of this interface? It may make sense to provide additional kinds of query - nearest neighbors between two trees, for example - some of which would be available only for some implementations. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
On Tue, Sep 30, 2008 at 05:31:17PM -0400, Anne Archibald wrote: > T = KDTree(data) > distances, indices = T.query(xs) # single nearest neighbor > distances, indices = T.query(xs, k=10) # ten nearest neighbors > distances, indices = T.query(xs, k=None, distance_upper_bound=1.0) # > all within 1 of x > What do you think of this interface? k=None in the third call to T.query seems redundant. It should be possible do put some logics so that the call is simply distances, indices = T.query(xs, distance_upper_bound=1.0) My 2 cents, Gaël ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/9/30 Gael Varoquaux <[EMAIL PROTECTED]>: > On Tue, Sep 30, 2008 at 05:31:17PM -0400, Anne Archibald wrote: >> T = KDTree(data) > >> distances, indices = T.query(xs) # single nearest neighbor > >> distances, indices = T.query(xs, k=10) # ten nearest neighbors > >> distances, indices = T.query(xs, k=None, distance_upper_bound=1.0) # >> all within 1 of x > >> What do you think of this interface? > > k=None in the third call to T.query seems redundant. It should be > possible do put some logics so that the call is simply > > distances, indices = T.query(xs, distance_upper_bound=1.0) Well, the problem with this is that you often want to provide a distance upper bound as well as a number of nearest neighbors. For example, suppose you are trying to find the point of closest approach of a path to the set of points stored in a kd-tree. You do the first query normally, but then since the second point is close to the first, you can accelerate the search dramatically by providing an upper bound of the distance from the first point's nearest neighbor to the second point. With a little luck, this will save ever having to visit much of the tree. It is also possible, of course, to provide wrappers to query() that do just one thing; I had this in mind more for fiddling the output (returning only the distance to the nearest neighbor, for instance). Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
On Tue, Sep 30, 2008 at 6:10 PM, Anne Archibald <[EMAIL PROTECTED]> wrote: > > Well, the problem with this is that you often want to provide a > distance upper bound as well as a number of nearest neighbors. For This use case is also important in scattered data interpolation, so we definitely want to support it. -- Nathan Bell [EMAIL PROTECTED] http://graphics.cs.uiuc.edu/~wnbell/ ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
On Tue, Sep 30, 2008 at 06:10:46PM -0400, Anne Archibald wrote: > > k=None in the third call to T.query seems redundant. It should be > > possible do put some logics so that the call is simply > > distances, indices = T.query(xs, distance_upper_bound=1.0) > Well, the problem with this is that you often want to provide a > distance upper bound as well as a number of nearest neighbors. Absolutely. I just think k should default to None, when distance_upper_bound is specified. k=None could be interpreted as k=1 when distance_uppper_bound is not specified. My 2 cents, Gaël ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
On Mon, Sep 29, 2008 at 8:24 PM, Anne Archibald <[EMAIL PROTECTED]> wrote: > Hi, > > Once again there has been a thread on the numpy/scipy mailing lists > requesting (essentially) some form of spatial data structure. Pointers > have been posted to ANN (sadly LGPLed and in C++) as well as a handful > of pure-python implementations of kd-trees. I suggest the creation of > a new submodule of scipy, scipy.spatial, to contain spatial data > structures and algorithms. Specifically, I propose it contain a > kd-tree implementation providing nearest-neighbor, approximate > nearest-neighbor, and all-points-near queries. I have a few other > suggestions for things it might contain, but kd-trees seem like a good > first step. > > 2008/9/27 Nathan Bell <[EMAIL PROTECTED]>: >> On Sat, Sep 27, 2008 at 11:18 PM, Anne Archibald >> <[EMAIL PROTECTED]> wrote: >>> >>> I think a kd-tree implementation would be a valuable addition to >>> scipy, perhaps in a submodule scipy.spatial that might eventually >>> contain other spatial data structures and algorithms. What do you >>> think? Should we have one? Should it be based on Sturla Molden's code, >>> if the license permits? I am willing to contribute one, if not. >> >> +1 > > Judging that your vote and mine are enough in the absence of > dissenting voices, I have written an implementation based on yours, > Sturla Molden's, and the algorithms described by the authors of the > ANN library. Before integrating it into scipy, though, I'd like to > send it around for comments. > > Particular issues: > > * It's pure python right now, but with some effort it could be > partially or completely cythonized. This is probably a good idea in > the long run. In the meantime I have crudely vectorized it so that > users can at least avoid looping in their own code. > * It is able to return the r nearest neighbors, optionally subject to > a maximum distance limit, as well as approximate nearest neighbors. > * It is not currently set up to conveniently return all points within > some fixed distance of the target point, but this could easily be > added. > * It returns distances and indices of nearest neighbors in the original array. > * The testing code is, frankly, a mess. I need to look into using nose > in a sensible fashion. > * The license is the scipy license. > > I am particularly concerned about providing a convenient return > format. The natural return from a query is a list of neighbors, since > it may have variable length (there may not be r elements in the tree, > or you might have supplied a maximum distance which doesn't contain r > points). For a single query, it's simple to return a python list > (should it be sorted? currently it's a heap). But if you want to > vectorize the process, storing the results in an array becomes > cumbersome. One option is an object array full of lists; another, > which, I currently use, is an array with nonexistent points marked > with an infinite distance and an invalid index. A third would be to > return masked arrays. How do you recommend handling this variable (or > potentially-variable) sized output? > >> If you're implementing one, I would highly recommend the >> "left-balanced" kd-tree. >> http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/2535/pdf/imm2535.pdf > > Research suggests that at least in high dimension a more geometric > balancing criterion can produce vastly better results. I used the > "sliding midpoint" rule, which doesn't allow such a numerical > balancing but does ensure that you don't have too many long skinny > cells (since a sphere tends to cut very many of these). > > I've also been thinking about what else would go in scipy.spatial. I > think it would be valuable to have a reasonably efficient distance > matrix function (we seem to get that question a lot, and the answer's > not trivial) as well as a sparse distance matrix function based on the > kd-trees. The module could potentially also swallow the (currently > sandboxed?) delaunay code. > > Anne Anne, Thanks for taking this on. The scikits.ann has licensing issues (as noted above), so it would be nice to have a clean-room implementation in scipy. I am happy to port the scikits.ann API to the final API that you choose, however, if you think that would be helpful. Cheers, Barry ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
On Wed, Oct 1, 2008 at 1:26 AM, Gael Varoquaux <[EMAIL PROTECTED]> wrote: > > Absolutely. I just think k should default to None, when > distance_upper_bound is specified. k=None could be interpreted as k=1 > when distance_uppper_bound is not specified. > Why not expose the various possibilities through different names? # nearest k points (possibly fewer) query_nearest(pt, k=1) # all points within given distance query_sphere(pt, distance) #nearest k points within given distance (possibly fewer) query(pt, k, distance) Few people will use the last form, but it's useful nevertheless. -- Nathan Bell [EMAIL PROTECTED] http://graphics.cs.uiuc.edu/~wnbell/ ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/10/1 Gael Varoquaux <[EMAIL PROTECTED]>: > On Tue, Sep 30, 2008 at 06:10:46PM -0400, Anne Archibald wrote: >> > k=None in the third call to T.query seems redundant. It should be >> > possible do put some logics so that the call is simply > >> > distances, indices = T.query(xs, distance_upper_bound=1.0) > >> Well, the problem with this is that you often want to provide a >> distance upper bound as well as a number of nearest neighbors. > > Absolutely. I just think k should default to None, when > distance_upper_bound is specified. k=None could be interpreted as k=1 > when distance_uppper_bound is not specified. That seems very confusing. Better perhaps to have a function query_all_neighbors, even if it's just a wrapper. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/10/1 Barry Wark <[EMAIL PROTECTED]>: > Thanks for taking this on. The scikits.ann has licensing issues (as > noted above), so it would be nice to have a clean-room implementation > in scipy. I am happy to port the scikits.ann API to the final API that > you choose, however, if you think that would be helpful. That's a nice idea. I'm not totally sure yet how much it's going to be possible for different implementations to be plug-in replacements, but it sure would be nice if users could use ANN transparently. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
> distances, indices = T.query(xs) # single nearest neighbor I'm not sure if it's implied, but can xs be a NxD matrix here i.e. query for all N points rather than just one. This will reduce the python call overhead for large queries. Also, I have some c++ code for locality sensitive hashing which might be useful? James ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
I also like the idea of a scipy.spatial library. For the research I do in machine learning and computer vision we are often interested in specifying different distance measures. It would be nice to have a way to specify the distance measure. I would like to see a standard set included: City Block, Euclidean, Correlation, etc as well as a capability for a user defined distance or similarity function. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/10/2 David Bolme <[EMAIL PROTECTED]>: > I also like the idea of a scipy.spatial library. For the research I > do in machine learning and computer vision we are often interested in > specifying different distance measures. It would be nice to have a > way to specify the distance measure. I would like to see a standard > set included: City Block, Euclidean, Correlation, etc as well as a > capability for a user defined distance or similarity function. You mean similarity or dissimilarity ? Distance is a dissimilarity but correlation is a similarity measure. Matthieu -- French PhD student Information System Engineer Website: http://matthieu-brucher.developpez.com/ Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
It may be useful to have an interface that handles both cases: similarity and dissimilarity. Often I have seen "Nearest Neighbor" algorithms that look for maximum similarity instead of minimum distance. In my field (biometrics) we often deal with very specialized distance or similarity measures. I would like to see support for user defined distance and similarity functions. It should be easy to implement by passing a function object to the KNN class. I am not sure if kd-trees or other fast algorithms are compatible with similarities or non-euclidian norms, however I would be willing to implement an exhaustive search KNN that would support user defined functions. On Oct 2, 2008, at 2:01 PM, Matthieu Brucher wrote: > 2008/10/2 David Bolme <[EMAIL PROTECTED]>: >> I also like the idea of a scipy.spatial library. For the research I >> do in machine learning and computer vision we are often interested in >> specifying different distance measures. It would be nice to have a >> way to specify the distance measure. I would like to see a standard >> set included: City Block, Euclidean, Correlation, etc as well as a >> capability for a user defined distance or similarity function. > > You mean similarity or dissimilarity ? Distance is a dissimilarity but > correlation is a similarity measure. > > Matthieu > -- > French PhD student > Information System Engineer > Website: http://matthieu-brucher.developpez.com/ > Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92 > LinkedIn: http://www.linkedin.com/in/matthieubrucher > ___ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/10/2 David Bolme <[EMAIL PROTECTED]>: > It may be useful to have an interface that handles both cases: > similarity and dissimilarity. Often I have seen "Nearest Neighbor" > algorithms that look for maximum similarity instead of minimum > distance. In my field (biometrics) we often deal with very > specialized distance or similarity measures. I would like to see > support for user defined distance and similarity functions. It should > be easy to implement by passing a function object to the KNN class. I > am not sure if kd-trees or other fast algorithms are compatible with > similarities or non-euclidian norms, however I would be willing to > implement an exhaustive search KNN that would support user defined > functions. kd-trees can only work for distance measures which have certain special properties (in particular, you have to be able to bound them based on coordinate differences). This is just fine for all the Minkowski p-norms (so in particular, Euclidean distance, maximum-coordinate-difference, and Manhattan distance) and in fact the current implementation already supports all of these. I don't think that correlation can be made into such a distance measure - the neighborhoods are the wrong shape. In fact the basic space is projective n-1 space rather than affine n-space, so I think you're going to need some very different algorithm. If you make a metric space out of it - define d(A,B) to be the angle between A and B - then cover trees can serve as a spatial data structure for nearest-neighbor search. Cover trees may be worth implementing, as they're a very generic data structure, suitable for (among other things) low-dimensional data in high-dimensional spaces. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
I remember reading a paper or book that stated that for data that has been normalized correlation and Euclidean are equivalent and will produce the same knn results. To this end I spent a couple hours this afternoon doing the math. This document is the result. http://www.cs.colostate.edu/~bolme/bolme08euclidean.pdf I believe that with mean subtracted and unit length vectors, a Euclidean knn algorithm will produces the same result as if the vectors were compared using correlation. I am not sure if kd-trees will perform well on the normalized vectors as they have a very specific geometry. If my math checks out it may be worth adding Pearson's correlation as a default option or as a separate class. I have also spent a little time looking at kd-trees and the kdtree code. It looks good. As I understand it kd-trees only work well when the number of datapoints (N) is larger than 2^D, where D is the dimensionality of those points. This will not work well for many of my computer vision problems because often D is large. As Anne suggested I will probably look at cover trees because often times the data are "low-dimensional data in high-dimensional spaces". I have been told though that with a large D there is know known fast algorithm for knn. Another problem is that the distances and similarity measures used in biometrics and computer vision are often very specialized and may or may not conform to the underlying assumptions of fast algorithms. I think for this reason I will need an exhaustive search algorithm. I will code it up modeled after Anne's interface and hopefully it will make it into the spatial module. I think that kd-trees and the spatial module are a good contribution to scipy. I have also enjoyed learning more about norms, correlation, and fast knn algorithms. Thanks, Dave ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/10/3 David Bolme <[EMAIL PROTECTED]>: > I remember reading a paper or book that stated that for data that has > been normalized correlation and Euclidean are equivalent and will > produce the same knn results. To this end I spent a couple hours this > afternoon doing the math. This document is the result. > > http://www.cs.colostate.edu/~bolme/bolme08euclidean.pdf Yes, you're right, even without the mean subtraction they all lie on a hypersphere in Euclidean space. It's a little more awkward if you want to identify x and -x. > I believe that with mean subtracted and unit length vectors, a > Euclidean knn algorithm will produces the same result as if the > vectors were compared using correlation. I am not sure if kd-trees > will perform well on the normalized vectors as they have a very > specific geometry. If my math checks out it may be worth adding > Pearson's correlation as a default option or as a separate class. Actually it's probably easier if the user just does the prenormalization. > I have also spent a little time looking at kd-trees and the kdtree > code. It looks good. As I understand it kd-trees only work well when > the number of datapoints (N) is larger than 2^D, where D is the > dimensionality of those points. This will not work well for many of > my computer vision problems because often D is large. As Anne > suggested I will probably look at cover trees because often times the > data are "low-dimensional data in high-dimensional spaces". I have > been told though that with a large D there is know known fast > algorithm for knn. Pretty much true. Though if the intrinsic dimensionality is low, cover trees should be all right. > Another problem is that the distances and similarity measures used in > biometrics and computer vision are often very specialized and may or > may not conform to the underlying assumptions of fast algorithms. I > think for this reason I will need an exhaustive search algorithm. I > will code it up modeled after Anne's interface and hopefully it will > make it into the spatial module. Metric spaces are quite general - edit distance for strings, for example, is an adequate distance measure. But brute-force is definitely worth having too. If I get the test suite cleaned up, it should be possible to just drop an arbitrary k-nearest-neighbors class into it and get a reasonably thorough collection of unit tests. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
My version of a wrapper of ANN is attached. I wrote it when I had some issues installing the scikits.ann package. It uses ctypes, and might be useful in deciding on an API. Please feel free to take what you like, -Rob ann.tar.gz Description: GNU Zip compressed data Rob Hetland, Associate Professor Dept. of Oceanography, Texas A&M University http://pong.tamu.edu/~rob phone: 979-458-0096, fax: 979-845-6331 ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
On Oct 8, 2008, at 1:36 PM, Anne Archibald wrote: > How important did you find the ability to select priority versus > non-priority search? > > How important did you find the ability to select other splitting > rules? In both cases, I included those things just because they were options in ANN, rather than any real need on my part. The things that I found important (and lacking in some of the other many kd_tree implementations) were the ability to select an arbitrary number of nearest neighbors returned and specifying a max search radius. I can see that having a choice of splitting rules may be handy for unusual point distributions, but I have always used this on fairly regularly spaced data. -Rob Rob Hetland, Associate Professor Dept. of Oceanography, Texas A&M University http://pong.tamu.edu/~rob phone: 979-458-0096, fax: 979-845-6331 ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
I have written up basic nearest neighbor algorithm. It does a brute force search so it will be slower than kdtrees as the number of points gets large. It should however work well for high dimensional data. I have also added the option for user defined distance measures. The user can set a default "p". "p" has the same functionality if it is a float. "p" can also be a function that computes a distance matrix or the measure can be selected using the strings: "Manhattan", "Euclidean", or "Correlation". https://pyvision.svn.sourceforge.net/svnroot/pyvision/trunk/src/pyvision/vector/knn.py The interface is similar to Anne's code and in many cases can be used as a drop in replacement. I have posted the code to my own project because I have a short term need and I do not have access to the scipy repository. Feel free to include the code with scipy under the scipy license. I did find a typo your documentation. typo "trie -> tree" - ... kd-tree is a binary trie, each of whose ... Also I found the use of k in the documentation some what confusing as it is the dimensionality of the data points in the kd-tree and the number of neighbors for k-nearest neighbors. >> I believe that with mean subtracted and unit length vectors, a >> Euclidean knn algorithm will produces the same result as if the >> vectors were compared using correlation. I am not sure if kd-trees >> will perform well on the normalized vectors as they have a very >> specific geometry. If my math checks out it may be worth adding >> Pearson's correlation as a default option or as a separate class. > > Actually it's probably easier if the user just does the > prenormalization. I agree. I think we should keep your class as-is and maybe create a class that wraps the kdtree and handles the normalization for correlation. I would also like to look at cover trees, however that will have to wait a couple months until I have more time. Dave --- http://www.cs.colostate.edu/~bolme ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposal: scipy.spatial
2008/10/9 David Bolme <[EMAIL PROTECTED]>: > I have written up basic nearest neighbor algorithm. It does a brute > force search so it will be slower than kdtrees as the number of points > gets large. It should however work well for high dimensional data. I > have also added the option for user defined distance measures. The > user can set a default "p". "p" has the same functionality if it is a > float. "p" can also be a function that computes a distance matrix or > the measure can be selected using the strings: "Manhattan", > "Euclidean", or "Correlation". > > https://pyvision.svn.sourceforge.net/svnroot/pyvision/trunk/src/pyvision/vector/knn.py This is interesting. I would point out, though, that if you want a Minkowski norm, it may be more efficient (that is, faster) to use the new compiled kd-tree implementation with leafsize set to the size of your data. This is written in compiled code and uses short-circuit distance evaluation, and may be much faster for high-dimensional problems. Given that, this should perhaps go in with other generic metric space code. I have a functional implementation of ball trees (though I don't know how efficient they are), and am looking into implementing cover trees. > The interface is similar to Anne's code and in many cases can be used > as a drop in replacement. I have posted the code to my own project > because I have a short term need and I do not have access to the scipy > repository. Feel free to include the code with scipy under the scipy > license. > > I did find a typo your documentation. > typo "trie -> tree" - ... kd-tree is a binary trie, each of whose ... That's not actually a typo: a trie is a tree in which all the data is stored at leaf nodes. Basic kd-trees use the nodes themselves to define splitting planes; you can actually construct one with no extra storage at all, just by choosing an appropriate order for your array of points. This implementation chooses splitting planes that may not pass through any point, so all the points get stored in leaves. > Also I found the use of k in the documentation some what confusing as > it is the dimensionality of the data points in the kd-tree and the > number of neighbors for k-nearest neighbors. That's a good point. I changed the dimension of the kd-tree to m throughout. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion