I remember reading a paper or book that stated that for data that has been normalized correlation and Euclidean are equivalent and will produce the same knn results. To this end I spent a couple hours this afternoon doing the math. This document is the result.
http://www.cs.colostate.edu/~bolme/bolme08euclidean.pdf I believe that with mean subtracted and unit length vectors, a Euclidean knn algorithm will produces the same result as if the vectors were compared using correlation. I am not sure if kd-trees will perform well on the normalized vectors as they have a very specific geometry. If my math checks out it may be worth adding Pearson's correlation as a default option or as a separate class. I have also spent a little time looking at kd-trees and the kdtree code. It looks good. As I understand it kd-trees only work well when the number of datapoints (N) is larger than 2^D, where D is the dimensionality of those points. This will not work well for many of my computer vision problems because often D is large. As Anne suggested I will probably look at cover trees because often times the data are "low-dimensional data in high-dimensional spaces". I have been told though that with a large D there is know known fast algorithm for knn. Another problem is that the distances and similarity measures used in biometrics and computer vision are often very specialized and may or may not conform to the underlying assumptions of fast algorithms. I think for this reason I will need an exhaustive search algorithm. I will code it up modeled after Anne's interface and hopefully it will make it into the spatial module. I think that kd-trees and the spatial module are a good contribution to scipy. I have also enjoyed learning more about norms, correlation, and fast knn algorithms. Thanks, Dave _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion