Re: producing vectors from composite documents

Ted Dunning Mon, 07 Jun 2010 09:12:23 -0700

On Mon, Jun 7, 2010 at 7:50 AM, Robin Anil <[email protected]> wrote:


> .... But wont the similarity metrics need to be
> different for such a vector?
>

Generally, Euclidean or L_1 distances are about all that makes sense for
these vectors.  For clustering, I worry that I don't take IDF into account
(there is some provision for that in the AdaptiveWordEncoder, though).  For
most learning applications, IDF shouldn't matter except that it might make
convergence faster by reducing the size of the largest eigenvalue.


> About the dictionary based trace. I need to actually see how the trace is
> useful. Do you keep track of the most important feature from those that go
> into a particular hashed location?.
>

Right now, I pretty much assume that there are no collisions.  This isn't
always all that great an assumption.  To get rid of that problem, it is
probably pretty easy to do a relaxation step where I generate an explanation
for a vector and then generate the vector for that explanation.  If there
are collisions, this last vector will differ slightly from the original and
the explanation of the difference should get us much closer to the original.

In clustering, we need to show the cluster centroids and the top features in
> it for text. I don't know if that is useful for types of data other than
> text. With these vectors how would the cluster dumper change?
>

I think that for general ideas about vector content, this style should be
fine.  The cluster dumper could just print the explanation of the input
vectors.  I have found explanations very helpful in mixed settings so far,
but that is because I am printing out classification models.  For most
clustering applications, that might be a different story.

Re: producing vectors from composite documents

Reply via email to