Thanks for your answers, Jeff, find replies below:

On Wed, Jan 5, 2011 at 5:32 PM, Jeff Eastman <[email protected]> wrote:

>
>
> -----Original Message-----
> From: Pere Ferrera [mailto:[email protected]]
> Sent: Wednesday, January 05, 2011 4:37 AM
> To: [email protected]
> Subject: two-tiered clustering
>
> Hello all,
>
> I have a clustering problem that involves clustering a huge data set.
> [jeff] I'm having some similar problems with my own data so let's
> collaborate to see if we can improve things.
>
> My knowledge of clustering techniques is very limited so I just tried a few
> algorithms. Because I can regression test them against a valid clustered
> set, I found out a configuration that gives best accuracy wiht no false
> positives (MeanShift Canopy, Tanimoto distance). However, because the
> number
> of output clusters is somehow linear to the number of documents to cluster,
> I had problems scaling out with MeanShift - because of the single Reducer
> step. I have read in the mailing list that MeanShift is good fit if the
> number of clusters grows logarithmically, not linearly with input size.
> [jeff] I keep being surprised that MeanShift Canopy performs as well as it
> does. I'm pretty sure there is no other implementation like it anywhere. I
> think one can increase the number of reducers early in the iterations to
> overcome some of the scalability limitations you cite, but the canopies
> gather a list of all their contained pointIds and this will suffer another
> scalability issue. There are ways with logging to eliminate this memory
> growth but I feel it is likely just whack-a-mole.
>

[pere] That sounded interesting, can you elaborate a bit more on this? The
problem as I understood it - but maybe I'm wrong - was that the reducer step
cannot be parallel because it has to merge the results of the n mappers into
one final clustering result (and therefore the single reducer). Because this
merging's complexity depends on the number of clusters that will be
generated (that are accumulated in the Clusterer), we have a fixed bound
here that's independent on the parallelism of the Job.

I'm sure some of you already struggled your mind to think of another
implementation of the algorithm that overcomes this limitation so I assume
it's not an easy matter at all.

Then I considered K-Means because it seems to scale well despite the number
> of clusters. Under some good K choice, I got some false positives but very
> good coverage. To avoid false positives, I have read I would need to choose
> a good initial configuration, but I can't apply Canopy before because it
> won't scale with a large number of clusters, and there's no K-means++ yet
> implemented.
> [jeff] K-Means works well too but we lack a good way to prime it with -k
> clusters. Canopy has scalability problems as you note and the random
> sampling works only so-so. We have a JIRA for K-means++.
>
> Then I thought that I could apply a multi-level technique. I could use
> K-Means with an approximated and pessimistic (low) number of clusters and
> use MeanShift inside each cluster to refine the results and avoid false
> positives. I didn't know if that was a good idea, but found a reference in
> "Mahout in Action" in page 215 to something similar (two-tiered
> clustering).
>
> Even though I have this solution working fine, I have the feeling that it
> got too complex. To try it, I programmed some custom code. In my current
> solution I have a job that executes with the output of a clustering and
> launches as many map tasks as clusters were generated, performing a second
> clustering inside them and accumulating the output of each mapper (so, no
> Reducer step).
>
> I was wondering if there is a better way to accomplish that. Is there
> something already implemented to execute a two-tiered (or n-tiered)
> clustering?
> I also thought that maybe you can show me a better idea to solve my
> problem.
> [jeff] Nothing is implemented out of the box for n-tiered clustering though
> I believe that the results of any clustering can be used as the prior for
> most of our iterative algorithms.


[pere] Correct me if I'm wrong, but iterations do not reduce the size of the
problem whereas n-tiering does. I don't know how useful it would be to have
something formally implemented for tiering, but tell me if I can help in
this case.


> Another way to approach the problem might be to try Dirichlet clustering to
> prime k-means. It is non-parametric and could give you an objective opinion
> on the best value of k for your data. Dirichlet can use
> DistanceMeasureModels, e.g. Tanimoto, Cosine, Manhattan, et. Al.
>

[pere] Thanks, I did some spare tests with Dirichlet, nothing formal, so
it'll be worth giving it a deeper try.

Pere.

Reply via email to