On Mon, Jun 7, 2010 at 7:50 AM, Robin Anil <[email protected]> wrote:
> .... But wont the similarity metrics need to be > different for such a vector? > Generally, Euclidean or L_1 distances are about all that makes sense for these vectors. For clustering, I worry that I don't take IDF into account (there is some provision for that in the AdaptiveWordEncoder, though). For most learning applications, IDF shouldn't matter except that it might make convergence faster by reducing the size of the largest eigenvalue. > About the dictionary based trace. I need to actually see how the trace is > useful. Do you keep track of the most important feature from those that go > into a particular hashed location?. > Right now, I pretty much assume that there are no collisions. This isn't always all that great an assumption. To get rid of that problem, it is probably pretty easy to do a relaxation step where I generate an explanation for a vector and then generate the vector for that explanation. If there are collisions, this last vector will differ slightly from the original and the explanation of the difference should get us much closer to the original. In clustering, we need to show the cluster centroids and the top features in > it for text. I don't know if that is useful for types of data other than > text. With these vectors how would the cluster dumper change? > I think that for general ideas about vector content, this style should be fine. The cluster dumper could just print the explanation of the input vectors. I have found explanations very helpful in mixed settings so far, but that is because I am printing out classification models. For most clustering applications, that might be a different story.
