Re: Clustering techniques, tips and tricks

Bogdan Vatkov Thu, 31 Dec 2009 22:41:56 -0800

Hi guys,

First of all - Happy New Year -  wish you healthy and successful 2010 year!

I would like to give some feedback. And ask some questions as well :).

I find the classification above very useful (I actually found some details
on the behavior of the k-means and Dirichlet algorithms that I did not see
before in the docs around Mahout).
I played around with Carrot2 for 2 weeks and it really has great level of
usability and simplicity but ...I had to give up on it since my very first
practical clustering task required to cluster 23K+ documents. And while I
was able to run the Lingo algorithm on 5 000 docs with 6 GB of heap memory
for approx. 2 min. it "failed" when I tried to cluster 8 000 docs...(it just
ran for 30 mins and I had to kill the app since it was definitely not
practical - compared to 5 000 docs clustered for 2 minutes).
That was the point where I really had to give up on Carrot2 and started
learning Mahout/Hadoop.

I have managed to do some clustering on my 23 000+ docs with Mahout/k-means
for something like 10 min (in standalone mode - no parallel processing at
all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) but I
am still learning and still trying to analyze if the result clusters are
really meaningful for my docs.

One thing I can tell already now is that I definitely, desperately, need
word-stopping  (which worked like a charm in Carrot2) but I am struggling a
little bit with Mahout code to come up with the right way to apply
stopwords. As far as I know there is no such feature already in the code. I
could use any piece of advice from you guys on where and how to apply
stopwords in Mahout clustering algorithms. Or maybe I should do it already
during the Text-to-Vector phase? But it would be valuable for me to be able
to come back later to the complete context of a document (i.e. with the
stopwords inside) - maybe it is a question on its own - how can I easily go
back from clusters->original docs (an not just vectors), I do not know maybe
some kind of mapper which maps vectors to the original documents somehow
(e.g. sort of URL for a document based on the vector id/index or
something?).

So, I definitely need to apply stopwords - without it I can't make any use
of my clusters - at least the ones I received after running (one iteration)
k-means over my original docs (Lucene derived vector) were not really
practical as I ended up with lots of clusters based on words that should
have been stopped.

Another question I have (maybe I simply overlooked something in the docs)
but as far as I got it one should really run the Canopy algorithm first to
get some "good" seeds for initial clusters and only then go for k-means
clustering but I could not really find the example doing so, as far as I got
it the examples are showing how to run the different algorithms (Canopy,
k-means) but only separately and not like chained one ofter the other. Did I
miss something? Can I get more guidance on how to chain Canopy and k-means?
Yet another question - I read in some docs around that k-means actually has
to be run multiple times somehow but I could not find it in the very
examples. How can I do that?

One thing is for sure - I cannot go back to Carrot2 - it is very useful for
small volumes of docs but won't work on what I am heading for:
My initial attempt is to cluster 23K+ docs but then I would like to move to
my complete set of 100K+ docs.

I would definitely like to be in control of the number (parametrized) of
clusters.
I think I will get better results if I can also apply stemming. What would
be you recommendation when using mahout? Should I do the stemming again
somewhere in the input vector forming? Or do you expect this to be part of
the clustering algorithms?

It is also really essential for me to have "updateable" algorithms as I am
adding new documents on daily basis, and I definitely like to have them
clustered immediately (incrementally) - I do not know if this is what is
called "classification" in Mahout and I did not reach these examples yet (I
wanted to really get acquainted with the clustering first).
And that is not all - I do not only want to have new documents clustered
against existing clusters but what I want in addition is that clusters could
actually change with new docs coming.
Of course one could not observe new clusters popping up after a single new
doc is added to the analysis but clusters should really be
adaptable/updateable with new docs.
Would that be possible with k-means for example?

I could give more feedback and possibly more questions :) when get a little
bit further on this.

Best regards,
Bogdan

On Thu, Dec 31, 2009 at 10:39 PM, Ted Dunning <[email protected]> wrote:

> On Thu, Dec 31, 2009 at 9:10 AM, Grant Ingersoll <[email protected]
> >wrote:
>
> > As some of you may know, I'm working on a book (it's a long time coming,
> > but I'm getting there) about open source techniques for working with
> text.
> > ...
> > Based on my research, it seems people typically divide up the clustering
> > space into two approaches: hierarchical and flat/partitioning.
>
>
> There are a number of ways of dividing the space of all clustering
> techniques.  Whether the final result is hierarchical is an important
> criterion.
>
> Other important characteristics include:
>
> - are cluster memberships hard or soft (k-means = hard, LDA and Dirichlet =
> soft)
>
> - can the clustering algorithm be viewed in a probabilistic framework
> (k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest
> neighbors = not so much)
>
> - is the definition of a cluster abstract enough to be flexible with regard
> to whether a cluster is a model or does it require stronger limits.
> (k-means = symmetric Gaussian with equal variance, Dirichlet = almost any
> probabilistic model)
>
> - is the algorithm updateable or does it have to run from scratch (k-means,
> Dirichlet = yes, agglomerative clustering = not easily)
>
> - is the algorithm non-parametric (which for clustering pretty much reduces
> to whether the number and complexity of clusters can increase without bound
> as the amount of data increases, Dirchlet = yes)
>
> - does the algorithm operationally take linear time in the size of the data
> (k-means yes, LDA = not sure, Dirichlet = pretty much, agglomerative
> clustering = no for most algorithms)
>
> - can the algorithm make use of pre-existing knowledge or user adjustments?
> (k-means yes, Dirichlet yes)
>
> Note that it is pretty easy to adapt several algorithms like k-means to be
> hierarchical.
>
>
> > In overlaying that knowledge with what we have for techniques in Mahout,
> > I'm a bit stumped about where things like LDA and Dirichlet fit into
> those
> > two approaches or is there, perhaps a third that I'm missing?  They don't
> > seem particularly hierarchical but they don't seem flat either, if that
> > makes any sense, given the probabilistic/mixture nature of the
> algorithms.
> >  Perhaps I should forgo the traditional division that previous authors
> have
> > taken and just talk about a suite of techniques at a little lower level?
> >  Thoughts?
> >
>
> I think that some of these distinctions are interesting but I think it is
> also very easy to confuse newcomers with too many distinctions.  A big part
> of the confusion has to do with the fact that none of these distinctions is
> comprehensive, nor are any of these completely clear cut.
>
>
> > The other thing I'm interested in is people's real world feedback on
> using
> > clustering to solve their text related problems.  For instance, what type
> of
> > feature reduction did you do (stopword removal, stemming, etc.)?  What
> > algorithms worked for you?  What didn't work?
>
>
> My experience has been that almost any reasonable implementation of an
> algorithm in the k-means family will work reasonably well (this includes
> LDA
> and Dirichlet effectively) and none of them are completely excellent.
> Success with text clustering strongly depends on setting expectations
> correctly in the interface.  If this user expects the clustering to be
> useful but not definitive then they tend to find clustering to be high
> value.  If the user expects the clustering to be absolutely definitive,
> they
> will be sorely disappointed.  For cases where perfection is expected, it is
> often important to have good integration between clustering and human
> edits.
>
> The carrot2 implementation is among the best that I have used in terms of
> the quality of clustering small batches of results such as search result
> lists.
>

-- 
Bogdan Vatkov
email: [email protected]
phone: +359 889 197 756

Re: Clustering techniques, tips and tricks

Reply via email to