On Mon, Sep 29, 2008 at 9:24 PM, Anne Archibald <[EMAIL PROTECTED]>wrote:
> Hi, > > Once again there has been a thread on the numpy/scipy mailing lists > requesting (essentially) some form of spatial data structure. Pointers > have been posted to ANN (sadly LGPLed and in C++) as well as a handful > of pure-python implementations of kd-trees. I suggest the creation of > a new submodule of scipy, scipy.spatial, to contain spatial data > structures and algorithms. Specifically, I propose it contain a > kd-tree implementation providing nearest-neighbor, approximate > nearest-neighbor, and all-points-near queries. I have a few other > suggestions for things it might contain, but kd-trees seem like a good > first step. > > 2008/9/27 Nathan Bell <[EMAIL PROTECTED]>: > > On Sat, Sep 27, 2008 at 11:18 PM, Anne Archibald > > <[EMAIL PROTECTED]> wrote: > >> > >> I think a kd-tree implementation would be a valuable addition to > >> scipy, perhaps in a submodule scipy.spatial that might eventually > >> contain other spatial data structures and algorithms. What do you > >> think? Should we have one? Should it be based on Sturla Molden's code, > >> if the license permits? I am willing to contribute one, if not. > > > > +1 > > Judging that your vote and mine are enough in the absence of > dissenting voices, I have written an implementation based on yours, > Sturla Molden's, and the algorithms described by the authors of the > ANN library. Before integrating it into scipy, though, I'd like to > send it around for comments. > > Particular issues: > > * It's pure python right now, but with some effort it could be > partially or completely cythonized. This is probably a good idea in > the long run. In the meantime I have crudely vectorized it so that > users can at least avoid looping in their own code. > * It is able to return the r nearest neighbors, optionally subject to > a maximum distance limit, as well as approximate nearest neighbors. > * It is not currently set up to conveniently return all points within > some fixed distance of the target point, but this could easily be > added. > * It returns distances and indices of nearest neighbors in the original > array. > * The testing code is, frankly, a mess. I need to look into using nose > in a sensible fashion. > * The license is the scipy license. > > I am particularly concerned about providing a convenient return > format. The natural return from a query is a list of neighbors, since > it may have variable length (there may not be r elements in the tree, > or you might have supplied a maximum distance which doesn't contain r > points). For a single query, it's simple to return a python list > (should it be sorted? currently it's a heap). But if you want to > vectorize the process, storing the results in an array becomes > cumbersome. One option is an object array full of lists; another, > which, I currently use, is an array with nonexistent points marked > with an infinite distance and an invalid index. A third would be to > return masked arrays. How do you recommend handling this variable (or > potentially-variable) sized output? > I think lists are the way to go. They should be pretty fast down at the C level. I use them myself for collecting the elements of equivalence classes and such that can be quite dynamic when new elements/relations are added. > > > If you're implementing one, I would highly recommend the > > "left-balanced" kd-tree. > > > http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/2535/pdf/imm2535.pdf > > Research suggests that at least in high dimension a more geometric > balancing criterion can produce vastly better results. I used the > "sliding midpoint" rule, which doesn't allow such a numerical > balancing but does ensure that you don't have too many long skinny > cells (since a sphere tends to cut very many of these). > > I've also been thinking about what else would go in scipy.spatial. I > think it would be valuable to have a reasonably efficient distance > matrix function (we seem to get that question a lot, and the answer's > not trivial) as well as a sparse distance matrix function based on the > kd-trees. The module could potentially also swallow the (currently > sandboxed?) delaunay code. > It would be good to define the interface separately from the algorithm so that the latter can be tweaked without affecting the API. Chuck
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion