Few Questions regarding Mahout
Hi everyone, I've been talking with Trevor over email and he shared some documents with me. They contained content that he (along with a few others) were developing to make Mahout easily accessible to newbies like myself. I've gone through the planned blog posts titled "Why Mahout", "Getting Started with Mahout", "Algorithms Framework" and "Building Apache Mahout from Source" and I have to say, I've got a lot of questions. Since Trevor is on vacation and the deadline for final proposal submission is fast approaching, I thought I'll post my questions on the dev forum. So here goes the big list of my questions. I hope of those of you who were / are involved in the development of these blog posts will be able to help me. Some of the questions are vague / abstract, I suggest you answer them as if you're explaining it to a layman. 1. Could you elaborate to me the high-level structure of Mahout? 2. What are the plans in pipeline for Mahout's development in the months to come? 3. How does contribution of a new algorithm work in Mahout? When I was reading the doc "Getting Started with Mahout" the example implemented the Ordinary Least Squares Regression in Samsara, Mahout's DSL. I had something different in my mind before reading the blog posts. I had thought that I would be contributing the distributed algorithm to Mahout from scratch, written in Scala and make it available as a package (which users can import and use) to users who use Mahout. 4. In general, is there a plan to contribute the algorithms in future using Samsara only? If so, what will be the limitations and advantages of this decision? I mean, the algorithms that will be a part of Mahout in the future, is there a plan to write all of them in Samsara. 5. What are the building blocks of Mahout that enable the distributed processing? The blog post mentions the Distributed Row Matrix. Are there any other distributed data structures available? If not, won't the algorithms that can be a part of the Mahout framework in the future become limited? Meaning, algorithms that cannot be reduced to a Linear Algebra problem? 6. What is expected of a newbie in the community? What is the learning curve to become an active contributor to Mahout? Are there any specific books / blog posts that I can read that will make the process easier? 7. Also, if you could give me some background as to how the development of Mahout has been going on. Not the motivation / inspiration that led to Mahout's conception but something like, what work has gone on between the previous release and the current release candidate. 8. What was the high level motivation of developing Mahout's own DSL, Samsara? Regards, Aditya
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
ps1 this assumes row-wise construction of A based on training set of m n-dimensional points. ps2 since we are doing multiple passes over A it may make sense to make sure it is committed to spark cache (by using checkpoint api), if spark is used On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov wrote: > here is the outline. For details of APIs, please refer to samsara manual > [2], i will not be be repeating it. > > Assume your training data input is m x n matrix A. For simplicity let's > assume it's a DRM with int row keys, i.e., DrmLike[Int]. > > Initialization: > > First, classic k-means starts by selecting initial clusters, by sampling > them out. You can do that by using sampling api [1], thus forming a k x n > in-memory matrix C (current centroids). C is therefore of Mahout's Matrix > type. > > You the proceed by alternating between cluster assignments and > recompupting centroid matrix C till convergence based on some test or > simply limited by epoch count budget, your choice. > > Cluster assignments: here, we go over current generation of A and > recompute centroid indexes for each row in A. Once we recompute index, we > put it into the row key . You can do that by assigning centroid indices to > keys of A using operator mapblock() (details in [2], [3], [4]). You also > need to broadcast C in order to be able to access it in efficient manner > inside mapblock() closure. Examples of that are plenty given in [2]. > Essentially, in mapblock, you'd reform the row keys to reflect cluster > index in C. while going over A, you'd have a "nearest neighbor" problem to > solve for the row of A and centroids C. This is the bulk of computation > really, and there are a few tricks there that can speed this step up in > both exact and approximate manner, but you can start with a naive search. > > Centroid recomputation: > once you assigned centroids to the keys of marix A, you'd want to do an > aggregating transpose of A to compute essentially average of row A grouped > by the centroid key. The trick is to do a computation of (1|A)' which will > results in a matrix of the shape (Counts/sums of cluster rows). This is the > part i find difficult to explain without a latex graphics. > > In Samsara, construction of (1|A)' corresponds to DRM expression > > (1 cbind A).t (again, see [2]). > > So when you compute, say, > > B = (1 | A)', > > then B is (n+1) x k, so each column contains a vector corresponding to a > cluster 1..k. In such column, the first element would be # of points in the > cluster, and the rest of it would correspond to sum of all points. So in > order to arrive to an updated matrix C, we need to collect B into memory, > and slice out counters (first row) from the rest of it. > > So, to compute C: > > C <- B (2:,:) each row divided by B(1,:) > > (watch out for empty clusters with 0 elements, this will cause lack of > convergence and NaNs in the newly computed C). > > This operation obviously uses subblocking and row-wise iteration over B, > for which i am again making reference to [2]. > > > [1] https://github.com/apache/mahout/blob/master/math-scala/ > src/main/scala/org/apache/mahout/math/drm/package.scala#L149 > > [2], Sasmara manual, a bit dated but viable, http://apache.github. > io/mahout/doc/ScalaSparkBindings.html > > [3] scaladoc, again, dated but largely viable for the purpose of this > exercise: > http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm > > [4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout- > math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps > > On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > >> @Dmitriycan you please again tell me the approach to move ahead. >> >> >> Thanks >> Parth Khatwani >> >> >> On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT < >> h2016...@pilani.bits-pilani.ac.in> wrote: >> >> > yes i am unable to figure out the way ahead. >> > Like how to create the augmented matrix A := (0|D) which you have >> > mentioned. >> > >> > >> > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov >> > wrote: >> > >> >> was my reply for your post on @user has been a bit confusing? >> >> >> >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT < >> >> h2016...@pilani.bits-pilani.ac.in> wrote: >> >> >> >> > Sir, >> >> > I am trying to write the kmeans clustering algorithm using Mahout >> >> Samsara >> >> > but i am bit confused >> >> > about how to leverage Distributed Row Matrix for the same. Can >> anybody >> >> help >> >> > me with same. >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > Thanks >> >> > Parth Khatwani >> >> > >> >> >> > >> > >> > >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
here is the outline. For details of APIs, please refer to samsara manual [2], i will not be be repeating it. Assume your training data input is m x n matrix A. For simplicity let's assume it's a DRM with int row keys, i.e., DrmLike[Int]. Initialization: First, classic k-means starts by selecting initial clusters, by sampling them out. You can do that by using sampling api [1], thus forming a k x n in-memory matrix C (current centroids). C is therefore of Mahout's Matrix type. You the proceed by alternating between cluster assignments and recompupting centroid matrix C till convergence based on some test or simply limited by epoch count budget, your choice. Cluster assignments: here, we go over current generation of A and recompute centroid indexes for each row in A. Once we recompute index, we put it into the row key . You can do that by assigning centroid indices to keys of A using operator mapblock() (details in [2], [3], [4]). You also need to broadcast C in order to be able to access it in efficient manner inside mapblock() closure. Examples of that are plenty given in [2]. Essentially, in mapblock, you'd reform the row keys to reflect cluster index in C. while going over A, you'd have a "nearest neighbor" problem to solve for the row of A and centroids C. This is the bulk of computation really, and there are a few tricks there that can speed this step up in both exact and approximate manner, but you can start with a naive search. Centroid recomputation: once you assigned centroids to the keys of marix A, you'd want to do an aggregating transpose of A to compute essentially average of row A grouped by the centroid key. The trick is to do a computation of (1|A)' which will results in a matrix of the shape (Counts/sums of cluster rows). This is the part i find difficult to explain without a latex graphics. In Samsara, construction of (1|A)' corresponds to DRM expression (1 cbind A).t (again, see [2]). So when you compute, say, B = (1 | A)', then B is (n+1) x k, so each column contains a vector corresponding to a cluster 1..k. In such column, the first element would be # of points in the cluster, and the rest of it would correspond to sum of all points. So in order to arrive to an updated matrix C, we need to collect B into memory, and slice out counters (first row) from the rest of it. So, to compute C: C <- B (2:,:) each row divided by B(1,:) (watch out for empty clusters with 0 elements, this will cause lack of convergence and NaNs in the newly computed C). This operation obviously uses subblocking and row-wise iteration over B, for which i am again making reference to [2]. [1] https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L149 [2], Sasmara manual, a bit dated but viable, http://apache.github.io/mahout/doc/ScalaSparkBindings.html [3] scaladoc, again, dated but largely viable for the purpose of this exercise: http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm [4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > @Dmitriycan you please again tell me the approach to move ahead. > > > Thanks > Parth Khatwani > > > On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > > > yes i am unable to figure out the way ahead. > > Like how to create the augmented matrix A := (0|D) which you have > > mentioned. > > > > > > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov > > wrote: > > > >> was my reply for your post on @user has been a bit confusing? > >> > >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT < > >> h2016...@pilani.bits-pilani.ac.in> wrote: > >> > >> > Sir, > >> > I am trying to write the kmeans clustering algorithm using Mahout > >> Samsara > >> > but i am bit confused > >> > about how to leverage Distributed Row Matrix for the same. Can anybody > >> help > >> > me with same. > >> > > >> > > >> > > >> > > >> > > >> > Thanks > >> > Parth Khatwani > >> > > >> > > > > >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
@Dmitriycan you please again tell me the approach to move ahead. Thanks Parth Khatwani On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > yes i am unable to figure out the way ahead. > Like how to create the augmented matrix A := (0|D) which you have > mentioned. > > > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov > wrote: > >> was my reply for your post on @user has been a bit confusing? >> >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT < >> h2016...@pilani.bits-pilani.ac.in> wrote: >> >> > Sir, >> > I am trying to write the kmeans clustering algorithm using Mahout >> Samsara >> > but i am bit confused >> > about how to leverage Distributed Row Matrix for the same. Can anybody >> help >> > me with same. >> > >> > >> > >> > >> > >> > Thanks >> > Parth Khatwani >> > >> > >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
yes i am unable to figure out the way ahead. Like how to create the augmented matrix A := (0|D) which you have mentioned. On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov wrote: > was my reply for your post on @user has been a bit confusing? > > On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > > > Sir, > > I am trying to write the kmeans clustering algorithm using Mahout Samsara > > but i am bit confused > > about how to leverage Distributed Row Matrix for the same. Can anybody > help > > me with same. > > > > > > > > > > > > Thanks > > Parth Khatwani > > >
Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
was my reply for your post on @user has been a bit confusing? On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT < h2016...@pilani.bits-pilani.ac.in> wrote: > Sir, > I am trying to write the kmeans clustering algorithm using Mahout Samsara > but i am bit confused > about how to leverage Distributed Row Matrix for the same. Can anybody help > me with same. > > > > > > Thanks > Parth Khatwani >
Trying to write the KMeans Clustering Using "Apache Mahout Samsara"
Sir, I am trying to write the kmeans clustering algorithm using Mahout Samsara but i am bit confused about how to leverage Distributed Row Matrix for the same. Can anybody help me with same. Thanks Parth Khatwani