2008/10/3 David Bolme <[EMAIL PROTECTED]>: > I remember reading a paper or book that stated that for data that has > been normalized correlation and Euclidean are equivalent and will > produce the same knn results. To this end I spent a couple hours this > afternoon doing the math. This document is the result. > > http://www.cs.colostate.edu/~bolme/bolme08euclidean.pdf
Yes, you're right, even without the mean subtraction they all lie on a hypersphere in Euclidean space. It's a little more awkward if you want to identify x and -x. > I believe that with mean subtracted and unit length vectors, a > Euclidean knn algorithm will produces the same result as if the > vectors were compared using correlation. I am not sure if kd-trees > will perform well on the normalized vectors as they have a very > specific geometry. If my math checks out it may be worth adding > Pearson's correlation as a default option or as a separate class. Actually it's probably easier if the user just does the prenormalization. > I have also spent a little time looking at kd-trees and the kdtree > code. It looks good. As I understand it kd-trees only work well when > the number of datapoints (N) is larger than 2^D, where D is the > dimensionality of those points. This will not work well for many of > my computer vision problems because often D is large. As Anne > suggested I will probably look at cover trees because often times the > data are "low-dimensional data in high-dimensional spaces". I have > been told though that with a large D there is know known fast > algorithm for knn. Pretty much true. Though if the intrinsic dimensionality is low, cover trees should be all right. > Another problem is that the distances and similarity measures used in > biometrics and computer vision are often very specialized and may or > may not conform to the underlying assumptions of fast algorithms. I > think for this reason I will need an exhaustive search algorithm. I > will code it up modeled after Anne's interface and hopefully it will > make it into the spatial module. Metric spaces are quite general - edit distance for strings, for example, is an adequate distance measure. But brute-force is definitely worth having too. If I get the test suite cleaned up, it should be possible to just drop an arbitrary k-nearest-neighbors class into it and get a reasonably thorough collection of unit tests. Anne _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion