[mahout 0.9 | k-means] methodology for selecting k to cluster very large datasets

2015-09-15 Thread hsharma mailinglists
Hello,


I have some questions around large-scale clustering. I would like to
arrive at a methodology that I can use to determine an appropriate
value of K to run K-means clustering for (at least for my scenario, if
not in general). More details follow below (apologies for the
verbosity, but I wanted to provide as much context as I could).

By 'large' I mean to imply:

* large in the number of points to cluster
* large in the dimensionality of the vector/feature-space
* large in the number of clusters


It would be doubly great if someone here has had to perform clustering
in a similar setting (similar in terms of the number of datapoints,
the nature/type of data being clustered, the number of clusters being
formed, etc.) and are willing to share their war story.

:: The context ::

I'm trying to cluster ~ 18 million sparse term-frequency vectors with
Mahout 0.9. Although these vectors were not derived from documents in
a text corpus, they were generated based on each data point being
associated with a finite number of discrete symbols. The size of the
overall symbol-set is quite large, hence the sparsity of the vector
representation. I assumed that I don't require idf-weighting because
none of these symbols can occur more than once per datapoint/vector.
Additionally, each vector has a fixed number of non-zero dimensions.

I'm using the seq2sparse program to generate these tf vectors from the
raw symbol-based inputs. For my case, the notion of
similarity/distance is based on the overlap between the symbol-sets of
the two vectors under consideration. Hence, I'm using the Tanimoto
distance measure (and therefore, the Sequential AccessSparseVector
implementation).

Also, my Hadoop cluster has 6 worker nodes (should the
Hadoop-infrastructure influence the methodology for K-selection).



:: My questions ::

My main question here is:

* How do I go about coming up with a value of K, given the sizes and
nature of my dataset and the feature-space?


I'm aware that this Wikipedia page lists a bunch of approaches for
K-selection 
(en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set),
specifying the motivation behind their synthesis for each. But before
I go down the path of implementing one of those approaches, I'd love
to hear from the users here who have had to do something similar:

* How did you select K for your large-scale clustering scenario?
* Also, was that K so large (w.r.t the number of available Hadoop
worker nodes) that the native MapReduce implementation of Mahout
K-means could not be used as-it-is due to running-time issues? If so,
how did you come up with an alternative/augmenting approach to
MapReduce-based K-means?


Further, I have an idea in mind myself, for coming up with a ball-park
value of K (and the K initial centroids). Would be great if I can be
provided feedback as to whether it's worth pursuing (or if it's better
to go with one of the approaches on the Wikipedia page referred to
above; or with an altogether different approach that you used). The
idea is this >>

Since the notion of distance/similarity for my data is based on
overlap of identically-sized symbol-sets, I can compute co-occurrence
frequencies of the symbol n-sets (how many times a particular set of n
symbols appears across all vectors of the dataset). I then identify
the symbol n-sets with the highest co-occurrence frequencies (highest
based on a cutoff/threshold for the percentile of these frequencies).
This restricted set of symbol n-sets then gives me an estimate of what
K _could_ be, and also some indication of how to choose those K
initial centroids (I'm thinking I will just pick K datapoints/vectors
at random such that they contain these high-frequency symbol n-sets.)
Is this overkill/nonsensical for my purpose?


As a final note, I do realize that using k-means with sparse vectors
may not be a good idea (given that the centroid vectors themselves are
going to be a lot denser than the datapoint vectors). However, I'm
tempted to test it for my dataset's size since I'm getting decent
clustering quality for much-smaller subsets of my overall corpus (yes,
the K-selection issue exists there as well, but these subsets have
been small enough to experiment with a range of K-values for each of
them and reach convergence in a relatively small number of
quick-running iterations; however, attempting to do the same with the
overall dataset is a challenge, because each iteration is taking a
really long time to run, even with K much smaller 18 million; hence
I'm trying to avoid having to experiment with a range of K-values and
hoping to use a more principled/tried-and-tested approach).



Thanks very much in advance!

- Harsh


Re: [mahout 0.9 | k-means] methodology for selecting k to cluster very large datasets

2015-09-15 Thread Ted Dunning
My own feeling is that the right answer is to look at average squared
distance on your training data and on held out data.

As long as these values are nearly the same, you likely have a smaller (or
equal) than optimal value of k.  When the average squared distance is
significantly less on the training data than on the held out data, you have
too large a value of k.

This definition assumes that you are using clusters to describe your data.
You may have a different goal in mind that would require a different
criterion.



On Tue, Sep 15, 2015 at 2:58 PM, hsharma mailinglists <
hsharma.mailingli...@gmail.com> wrote:

> Hello,
>
>
> I have some questions around large-scale clustering. I would like to
> arrive at a methodology that I can use to determine an appropriate
> value of K to run K-means clustering for (at least for my scenario, if
> not in general). More details follow below (apologies for the
> verbosity, but I wanted to provide as much context as I could).
>
> By 'large' I mean to imply:
>
> * large in the number of points to cluster
> * large in the dimensionality of the vector/feature-space
> * large in the number of clusters
>
>
> It would be doubly great if someone here has had to perform clustering
> in a similar setting (similar in terms of the number of datapoints,
> the nature/type of data being clustered, the number of clusters being
> formed, etc.) and are willing to share their war story.
>
> :: The context ::
>
> I'm trying to cluster ~ 18 million sparse term-frequency vectors with
> Mahout 0.9. Although these vectors were not derived from documents in
> a text corpus, they were generated based on each data point being
> associated with a finite number of discrete symbols. The size of the
> overall symbol-set is quite large, hence the sparsity of the vector
> representation. I assumed that I don't require idf-weighting because
> none of these symbols can occur more than once per datapoint/vector.
> Additionally, each vector has a fixed number of non-zero dimensions.
>
> I'm using the seq2sparse program to generate these tf vectors from the
> raw symbol-based inputs. For my case, the notion of
> similarity/distance is based on the overlap between the symbol-sets of
> the two vectors under consideration. Hence, I'm using the Tanimoto
> distance measure (and therefore, the Sequential AccessSparseVector
> implementation).
>
> Also, my Hadoop cluster has 6 worker nodes (should the
> Hadoop-infrastructure influence the methodology for K-selection).
>
>
>
> :: My questions ::
>
> My main question here is:
>
> * How do I go about coming up with a value of K, given the sizes and
> nature of my dataset and the feature-space?
>
>
> I'm aware that this Wikipedia page lists a bunch of approaches for
> K-selection (
> en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set),
> specifying the motivation behind their synthesis for each. But before
> I go down the path of implementing one of those approaches, I'd love
> to hear from the users here who have had to do something similar:
>
> * How did you select K for your large-scale clustering scenario?
> * Also, was that K so large (w.r.t the number of available Hadoop
> worker nodes) that the native MapReduce implementation of Mahout
> K-means could not be used as-it-is due to running-time issues? If so,
> how did you come up with an alternative/augmenting approach to
> MapReduce-based K-means?
>
>
> Further, I have an idea in mind myself, for coming up with a ball-park
> value of K (and the K initial centroids). Would be great if I can be
> provided feedback as to whether it's worth pursuing (or if it's better
> to go with one of the approaches on the Wikipedia page referred to
> above; or with an altogether different approach that you used). The
> idea is this >>
>
> Since the notion of distance/similarity for my data is based on
> overlap of identically-sized symbol-sets, I can compute co-occurrence
> frequencies of the symbol n-sets (how many times a particular set of n
> symbols appears across all vectors of the dataset). I then identify
> the symbol n-sets with the highest co-occurrence frequencies (highest
> based on a cutoff/threshold for the percentile of these frequencies).
> This restricted set of symbol n-sets then gives me an estimate of what
> K _could_ be, and also some indication of how to choose those K
> initial centroids (I'm thinking I will just pick K datapoints/vectors
> at random such that they contain these high-frequency symbol n-sets.)
> Is this overkill/nonsensical for my purpose?
>
>
> As a final note, I do realize that using k-means with sparse vectors
> may not be a good idea (given that the centroid vectors themselves are
> going to be a lot denser than the datapoint vectors). However, I'm
> tempted to test it for my dataset's size since I'm getting decent
> clustering quality for much-smaller subsets of my overall corpus (yes,
> the K-selection issue exists there as well, but these subsets have
> been sma