MooMaster wrote: > Now we can't calculate a meaningful Euclidean distance for something > like "Iris-setosa" and "Iris-versicolor" unless we use string-edit > distance or something overly complicated, so instead we'll use a > simple quantization scheme of enumerating the set of values within the > column domain and replacing the strings with numbers (i.e. Iris-setosa > = 1, iris-versicolor=2).
I'd calculate the distance as def string_dist(x, y, weight=1): return weight * (x == y) You don't get a high resolution in that dimension, but you don't introduce an element of randomness, either. Peter -- http://mail.python.org/mailman/listinfo/python-list