here is the outline. For details of APIs, please refer to samsara manual [2], i will not be be repeating it.
Assume your training data input is m x n matrix A. For simplicity let's assume it's a DRM with int row keys, i.e., DrmLike[Int]. Initialization: First, classic k-means starts by selecting initial clusters, by sampling them out. You can do that by using sampling api [1], thus forming a k x n in-memory matrix C (current centroids). C is therefore of Mahout's Matrix type. You the proceed by alternating between cluster assignments and recompupting centroid matrix C till convergence based on some test or simply limited by epoch count budget, your choice. Cluster assignments: here, we go over current generation of A and recompute centroid indexes for each row in A. Once we recompute index, we put it into the row key . You can do that by assigning centroid indices to keys of A using operator mapblock() (details in [2], [3], [4]). You also need to broadcast C in order to be able to access it in efficient manner inside mapblock() closure. Examples of that are plenty given in [2]. Essentially, in mapblock, you'd reform the row keys to reflect cluster index in C. while going over A, you'd have a "nearest neighbor" problem to solve for the row of A and centroids C. This is the bulk of computation really, and there are a few tricks there that can speed this step up in both exact and approximate manner, but you can start with a naive search. Centroid recomputation: once you assigned centroids to the keys of marix A, you'd want to do an aggregating transpose of A to compute essentially average of row A grouped by the centroid key. The trick is to do a computation of (1|A)' which will results in a matrix of the shape (Counts/sums of cluster rows). This is the part i find difficult to explain without a latex graphics. In Samsara, construction of (1|A)' corresponds to DRM expression (1 cbind A).t (again, see [2]). So when you compute, say, B = (1 | A)', then B is (n+1) x k, so each column contains a vector corresponding to a cluster 1..k. In such column, the first element would be # of points in the cluster, and the rest of it would correspond to sum of all points. So in order to arrive to an updated matrix C, we need to collect B into memory, and slice out counters (first row) from the rest of it. So, to compute C: C <- B (2:,:) each row divided by B(1,:) (watch out for empty clusters with 0 elements, this will cause lack of convergence and NaNs in the newly computed C). This operation obviously uses subblocking and row-wise iteration over B, for which i am again making reference to [2]. [1] https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L149 [2], Sasmara manual, a bit dated but viable, http://apache.github.io/mahout/doc/ScalaSparkBindings.html [3] scaladoc, again, dated but largely viable for the purpose of this exercise: http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm [4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > @Dmitriycan you please again tell me the approach to move ahead. > > > Thanks > Parth Khatwani > > > On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > > > yes i am unable to figure out the way ahead. > > Like how to create the augmented matrix A := (0|D) which you have > > mentioned. > > > > > > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <dlie...@gmail.com> > > wrote: > > > >> was my reply for your post on @user has been a bit confusing? > >> > >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT < > >> h2016...@pilani.bits-pilani.ac.in> wrote: > >> > >> > Sir, > >> > I am trying to write the kmeans clustering algorithm using Mahout > >> Samsara > >> > but i am bit confused > >> > about how to leverage Distributed Row Matrix for the same. Can anybody > >> help > >> > me with same. > >> > > >> > > >> > > >> > > >> > > >> > Thanks > >> > Parth Khatwani > >> > > >> > > > > >