Re: Streaming KMeans Text Clustering Concurrency and Advice

Ted Dunning Thu, 13 Dec 2012 14:57:07 -0800

On Thu, Dec 13, 2012 at 2:29 PM, Brandon Root <brandonr...@gmail.com> wrote:


> This is a question regarding the new KNN library that Ted Dunning and Dan
> Filimon are working on (as I understand it'll be in Mahout 0.8) so I hope
> this is the appropriate list for this question instead of mahout-dev.
>

Either one.  This is a bit dev-ish, but the readership of the two lists
should be nearly identical because the user list is pretty low volume.


> First off, it's great. I was looking for a streaming kmeans library (or
> writing my own) to integrate with storm and have -- as with all things with
> Mahout -- been really impressed. Naturally taking the appropriate
> I'm-using-this-new-code-at-my-peril attitude, I had a few questions.
>

Aw shucks.


> Right now I'm running streamingKMeans with the twitter streaming api.


I hope that you are tracking Dan's repo on github which is where all the
latest fixes are appearing first.


> When
> I iterate through each cluster using the FastProjectionSearch, I'm
> occasionally hitting a concurrent modification exception because (of
> course) i'm trying to perform the search while vectors are added in a
> different thread. Do you have any plans to make the code
> more concurrency friendly, or is it more sensible to pause and wait for the
> FastProjectionSearch to finish before adding more vectors. Or am I totally
> missing something? As i understand there are performance implications to
> using concurrent collections, is that why you're steering clear thus far?
>

I don't think so.  My rationale is that the searchers should not be
required to be any more concurrency friendly than, say, an ArrayList.  In
fact, I would make a case that they should probably be even more unfriendly
as in they should, if at all possible, catch concurrent modifications and
warn about them.

If you put this the other way around, it might make more sense.  That would
be that it requiring full thread safety could easily cause significant
performance problems that could only be resolved at a higher level and thus
requiring thread safety should not be done.

My own recommendation would be that you wrap the searcher of your choice in
a class that delegates operations in a synchronized fashion.  In
particular, I think that it would be good to buffer additions up for (say)
500 ms or a few thousand insertions before getting in the way of searches.


Because I am clustering text, I have run into the issue Dan talked about
> here https://github.com/dfilimon/knn/issues/1, and have found that
> clusters
> aren't too stable with a large(er) number of dimensions.


IF you look at some of the literature, particularly the ball k-means paper,
there is a strong case to be made that clustering algorithms should be held
to a standard of stability only for so-called "well-clusterable" data.
 Data that does have clear optima can be clustered quickly (with strong
guarantees) and data without these clear optima will be very expensive to
cluster in any kind of stable fashion.

That said, I think that there may be issues with the current implementation
that make stability worse than it might be.  Dan is doing yeoman duty here
to make sure that things are working as well as they should be.

In spite of these issues, there are still a few things that can be done to
improve the situation.

First, you can start the on-line clustering with your old cluster centroids
as initial conditions.  This is very likely to give you lots of stability.

Secondly, you can use semi-supervised clustering in which you include a
target variable in your data.  If you run through your labeled data first,
you will start the clustering off with what amounts to a supervised
classifier.  At this point, you can zero out the supervised coordinates
from the cluster centroids and continue clustering in a supervised fashion.
 This can be combined with the first idea.

Thirdly, you can introduce L_1 regularization into the clustering,
especially during the ball k-means phase.   L_1 regularization will tend to
make your final cluster centroids be sparse in your feature space which
correlates in the case of text to easily explainable clusters (because they
only require a few words to describe).

The first approach is pretty likely to help your results.  You can apply it
either in the sketch phase, in the ball k-means phase or in both phases.
 This will make one clustering result resemble the next one if the data
itself doesn't change a lot.

The second approach is less likely to do really well, but I have seen it do
extraordinary things and really make things better.

The third approach is much more research-y, but it has substantial
potential.


> I'm happy to play
> around with the math a little bit, but I'd love to hear if you've made any
> progress or have other suggestions.


My only really strong suggestion with regard to the code is to use Dan's
version until we get this into Mahout.


> What I'm trying to do is (roughly)
> cluster tweets relating to a topic so that I can look for patterns in the
> conversations.


This is a great thing.  It would be fantastic if you could produce a demo
along these lines.


> It would be preferable to keep many of the clusters
> consistent, so that I can monitor how they ebb or flow. If one cluster (for
> example) contains tweets about Obama with words like "socialist"
> "communist" "Kenya" and "Fox News" it would be preferable to keep that
> cluster relatively stable so that I could watch how that conversation
> changes.


Absolutely.  Completely agree.


> .... Is there any way I can help?
>

Make that demo work!

Make a video to show it off.

Bring back more feedback.

Re: Streaming KMeans Text Clustering Concurrency and Advice

Reply via email to