Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

Dmitriy Lyubimov Fri, 31 Mar 2017 10:53:38 -0700

here is the outline. For details of APIs, please refer to samsara manual
[2], i will not be be repeating it.

Assume your training data input is m x n matrix A. For simplicity let's
assume it's a DRM with int row keys, i.e., DrmLike[Int].

Initialization:

First, classic k-means starts by selecting initial clusters, by sampling
them out. You can do that by using sampling api [1], thus forming a k x n
in-memory matrix C (current centroids). C is therefore of Mahout's Matrix
type.

You the proceed by alternating between cluster assignments and recompupting
centroid matrix C till convergence based on some test or simply limited by
epoch count budget, your choice.

Cluster assignments: here, we go over current generation of A and recompute
centroid indexes for each row in A. Once we recompute index, we put it into
the row key . You can do that by assigning centroid indices to keys of A
using operator mapblock() (details in [2], [3], [4]). You also need to
broadcast C in order to be able to access it in efficient manner inside
mapblock() closure. Examples of that are plenty given in [2]. Essentially,
in mapblock, you'd reform the row keys to reflect cluster index in C. while
going over A, you'd have a "nearest neighbor" problem to solve for the row
of A and centroids C. This is the bulk of computation really, and there are
a few tricks there that can speed this step up in both exact and
approximate manner, but you can start with a naive search.

Centroid recomputation:
once you assigned centroids to the keys of marix A, you'd want to do an
aggregating transpose of A to compute essentially average of row A grouped
by the centroid key. The trick is to do a computation of (1|A)' which will
results in a matrix of the shape (Counts/sums of cluster rows). This is the
part i find difficult to explain without a latex graphics.

In Samsara, construction of (1|A)' corresponds to DRM expression

(1 cbind A).t (again, see [2]).

So when you compute, say,

B = (1 | A)',

then B is (n+1) x k, so each column contains a vector corresponding to a
cluster 1..k. In such column, the first element would be # of points in the
cluster, and the rest of it would correspond to sum of all points. So in
order to arrive to an updated matrix C, we need to collect B into memory,
and slice out counters (first row) from the rest of it.

So, to compute C:

C <- B (2:,:) each row divided by B(1,:)

(watch out for empty clusters with 0 elements, this will cause lack of
convergence and NaNs in the newly computed C).

This operation obviously uses subblocking and row-wise iteration over B,
for which i am again making reference to [2].

[1]
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L149

[2], Sasmara manual, a bit dated but viable,
http://apache.github.io/mahout/doc/ScalaSparkBindings.html

[3] scaladoc, again, dated but largely viable for the purpose of this
exercise:
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm

[4] mapblock etc.
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps

On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
h2016...@pilani.bits-pilani.ac.in> wrote:

> @Dmitriycan you please again tell me the approach to move ahead.
>
>
> Thanks
> Parth Khatwani
>
>
> On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
> h2016...@pilani.bits-pilani.ac.in> wrote:
>
> > yes i am unable to figure out the way ahead.
> > Like how to create the augmented matrix A := (0|D) which you have
> > mentioned.
> >
> >
> > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> >> was my reply for your post on @user has been a bit confusing?
> >>
> >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
> >> h2016...@pilani.bits-pilani.ac.in> wrote:
> >>
> >> > Sir,
> >> > I am trying to write the kmeans clustering algorithm using Mahout
> >> Samsara
> >> > but i am bit confused
> >> > about how to leverage Distributed Row Matrix for the same. Can anybody
> >> help
> >> > me with same.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Thanks
> >> > Parth Khatwani
> >> >
> >>
> >
> >
>

Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

Reply via email to