Re: Text clustering with hashing vector encoders

Ted Dunning Wed, 19 Mar 2014 12:54:19 -0700

On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten <fr...@frankscholten.nl>wrote:

> On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > Yes.  Hashing vector encoders will preserve distances when used with
> > multiple probes.
> >
>
> So if a token occurs two times in a document the first token will be mapped
> to a given location and when the token is hashed the second time it will be
> mapped to a different location, right?
>

No.  The same token will always hash to the same location(s).

> I am wondering if when many probes are used and a large enough vector this
> process mimics TF weighting, since documents that have a high TF of a given
> token will have the same positions marked in the vector. As Suneel said
> when we then use the Hamming distance the vectors that are close to each
> other should be in the same cluster.
>

Hamming distance doesn't quite work because you want to have collisions to
a sum rather than an OR.  Also, if you apply weights to the words, these
weights will be added to all of the probe locations for the words.  This
means we still need a plus/times/L2 dot product rather than an plus/AND/L1
dot product like the Hamming distance uses.

>
> > Interpretation becomes somewhat difficult, but there is code available to
> > reverse engineer labels on hashed vectors.
>
>
> I saw that AdaptiveWordEncoder has a built in dictionary so I can see which
> words it has seen but I don't see how to go from a position or several
> positions in the vector to labels. Is there an example in the code I can
> look at?
>

Yes.  The newsgroups example applies.

The AdaptiveWordEncoder counts word occurrences that it sees and uses the
IDF based on the resulting counts.  This assumes that all instances of the
AWE will see the same rough distribution of words to work.  It is fine for
lots of applications and not fine for lots.

>
>
> > IDF weighting is slightly tricky, but quite doable if you keep a
> dictionary
> > of, say, the most common 50-200 thousand words and assume all others have
> > constant and equal frequency.
> >
>
> How would IDF weighting work in conjunction with hashing? First build up a
> dictionary of 50-200 and pass that into the vector encoders? The drawback
> of this is that you have another pass through the data and another 'input'
> to keep track of and configure. But maybe it has to be like that.

With hashing, you still have the option of applying a weight to the hashed
representation of each word.  The question is what weight.

To build a small dictionary, you don't have to go through all of the data.
 Just enough to get reasonably accurate weights for most words.  All words
not yet seen can be assumed to be rare and thus get the nominal "rare-word"
weight.

Keeping track of the dictionary of weights is, indeed, a pain.

> The
> reason I like the hashed encoders is that vectorizing can be done in a
> streaming manner at the last possible moment. With the current tools you
> have to do: data -> data2seq -> seq2sparse -> kmeans.
>

Indeed.  That is the great virtue.

>
> If this approach is doable I would like to code up a Java non-Hadoop
> example using the Reuters dataset which vectorizes each doc using the
> hashing encoders, configures KMeans with Hamming distance and then write
> some code to get the labels.
>

Use Euclidean distance, not Hamming.

You can definitely use the AWE here if you randomize document ordering.

Re: Text clustering with hashing vector encoders

Reply via email to