Few Questions regarding Mahout

2017-03-31 Thread Aditya
Hi everyone,

I've been talking with Trevor over email and he shared some documents with
me. They contained content that he (along with a few others) were
developing to make Mahout easily accessible to newbies like myself.

I've gone through the planned blog posts titled "Why Mahout", "Getting
Started with Mahout", "Algorithms Framework" and "Building Apache Mahout
from Source" and I have to say, I've got a lot of questions. Since Trevor
is on vacation and the deadline for final proposal submission is fast
approaching, I thought I'll post my questions on the dev forum.

So here goes the big list of my questions. I hope of those of you who were
/ are involved in the development of these blog posts will be able to help
me. Some of the questions are vague / abstract, I suggest you answer them
as if you're explaining it to a layman.

1. Could you elaborate to me the high-level structure of Mahout?

2. What are the plans in pipeline for Mahout's development in the months to
come?

3. How does contribution of a new algorithm work in Mahout? When I was
reading the doc "Getting Started with Mahout" the example implemented the
Ordinary Least Squares Regression in Samsara, Mahout's DSL.
I had something different in my mind before reading the blog posts. I had
thought that I would be contributing the distributed algorithm to Mahout
from scratch, written in Scala and make it available as a package (which
users can import and use) to users who use Mahout.

4. In general, is there a plan to contribute the algorithms in future using
Samsara only? If so, what will be the limitations and advantages of this
decision? I mean, the algorithms that will be a part of Mahout in the
future, is there a plan to write all of them in Samsara.

5. What are the building blocks of Mahout that enable the distributed
processing? The blog post mentions the Distributed Row Matrix. Are there
any other distributed data structures available? If not, won't the
algorithms that can be a part of the Mahout framework in the future become
limited? Meaning, algorithms that cannot be reduced to a Linear Algebra
problem?

6. What is expected of a newbie in the community? What is the learning
curve to become an active contributor to Mahout? Are there any specific
books / blog posts that I can read that will make the process easier?

7. Also, if you could give me some background as to how the development of
Mahout has been going on. Not the motivation / inspiration that led to
Mahout's conception but something like, what work has gone on between the
previous release and the current release candidate.

8. What was the high level motivation of developing Mahout's own DSL,
Samsara?

Regards,
Aditya


Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

2017-03-31 Thread Dmitriy Lyubimov
ps1 this assumes row-wise construction of A based on training set of m
n-dimensional points.
ps2 since we are doing multiple passes over A it may make sense to make
sure it is committed to spark cache (by using checkpoint api), if spark is
used

On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov 
wrote:

> here is the outline. For details of APIs, please refer to samsara manual
> [2], i will not be be repeating it.
>
> Assume your training data input is m x n matrix A. For simplicity let's
> assume it's a DRM with int row keys, i.e., DrmLike[Int].
>
> Initialization:
>
> First, classic k-means starts by selecting initial clusters, by sampling
> them out. You can do that by using sampling api [1], thus forming a k x n
> in-memory matrix C (current centroids). C is therefore of Mahout's Matrix
> type.
>
> You the proceed by alternating between cluster assignments and
> recompupting centroid matrix C till convergence based on some test or
> simply limited by epoch count budget, your choice.
>
> Cluster assignments: here, we go over current generation of A and
> recompute centroid indexes for each row in A. Once we recompute index, we
> put it into the row key . You can do that by assigning centroid indices to
> keys of A using operator mapblock() (details in [2], [3], [4]). You also
> need to broadcast C in order to be able to access it in efficient manner
> inside mapblock() closure. Examples of that are plenty given in [2].
> Essentially, in mapblock, you'd reform the row keys to reflect cluster
> index in C. while going over A, you'd have a "nearest neighbor" problem to
> solve for the row of A and centroids C. This is the bulk of computation
> really, and there are a few tricks there that can speed this step up in
> both exact and approximate manner, but you can start with a naive search.
>
> Centroid recomputation:
> once you assigned centroids to the keys of marix A, you'd want to do an
> aggregating transpose of A to compute essentially average of row A grouped
> by the centroid key. The trick is to do a computation of (1|A)' which will
> results in a matrix of the shape (Counts/sums of cluster rows). This is the
> part i find difficult to explain without a latex graphics.
>
> In Samsara, construction of (1|A)' corresponds to DRM expression
>
> (1 cbind A).t (again, see [2]).
>
> So when you compute, say,
>
> B = (1 | A)',
>
> then B is (n+1) x k, so each column contains a vector corresponding to a
> cluster 1..k. In such column, the first element would be # of points in the
> cluster, and the rest of it would correspond to sum of all points. So in
> order to arrive to an updated matrix C, we need to collect B into memory,
> and slice out counters (first row) from the rest of it.
>
> So, to compute C:
>
> C <- B (2:,:) each row divided by B(1,:)
>
> (watch out for empty clusters with 0 elements, this will cause lack of
> convergence and NaNs in the newly computed C).
>
> This operation obviously uses subblocking and row-wise iteration over B,
> for which i am again making reference to [2].
>
>
> [1] https://github.com/apache/mahout/blob/master/math-scala/
> src/main/scala/org/apache/mahout/math/drm/package.scala#L149
>
> [2], Sasmara manual, a bit dated but viable, http://apache.github.
> io/mahout/doc/ScalaSparkBindings.html
>
> [3] scaladoc, again, dated but largely viable for the purpose of this
> exercise:
> http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm
>
> [4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout-
> math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps
>
> On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
> h2016...@pilani.bits-pilani.ac.in> wrote:
>
>> @Dmitriycan you please again tell me the approach to move ahead.
>>
>>
>> Thanks
>> Parth Khatwani
>>
>>
>> On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
>> h2016...@pilani.bits-pilani.ac.in> wrote:
>>
>> > yes i am unable to figure out the way ahead.
>> > Like how to create the augmented matrix A := (0|D) which you have
>> > mentioned.
>> >
>> >
>> > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov 
>> > wrote:
>> >
>> >> was my reply for your post on @user has been a bit confusing?
>> >>
>> >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
>> >> h2016...@pilani.bits-pilani.ac.in> wrote:
>> >>
>> >> > Sir,
>> >> > I am trying to write the kmeans clustering algorithm using Mahout
>> >> Samsara
>> >> > but i am bit confused
>> >> > about how to leverage Distributed Row Matrix for the same. Can
>> anybody
>> >> help
>> >> > me with same.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > Thanks
>> >> > Parth Khatwani
>> >> >
>> >>
>> >
>> >
>>
>
>


Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

2017-03-31 Thread Dmitriy Lyubimov
here is the outline. For details of APIs, please refer to samsara manual
[2], i will not be be repeating it.

Assume your training data input is m x n matrix A. For simplicity let's
assume it's a DRM with int row keys, i.e., DrmLike[Int].

Initialization:

First, classic k-means starts by selecting initial clusters, by sampling
them out. You can do that by using sampling api [1], thus forming a k x n
in-memory matrix C (current centroids). C is therefore of Mahout's Matrix
type.

You the proceed by alternating between cluster assignments and recompupting
centroid matrix C till convergence based on some test or simply limited by
epoch count budget, your choice.

Cluster assignments: here, we go over current generation of A and recompute
centroid indexes for each row in A. Once we recompute index, we put it into
the row key . You can do that by assigning centroid indices to keys of A
using operator mapblock() (details in [2], [3], [4]). You also need to
broadcast C in order to be able to access it in efficient manner inside
mapblock() closure. Examples of that are plenty given in [2]. Essentially,
in mapblock, you'd reform the row keys to reflect cluster index in C. while
going over A, you'd have a "nearest neighbor" problem to solve for the row
of A and centroids C. This is the bulk of computation really, and there are
a few tricks there that can speed this step up in both exact and
approximate manner, but you can start with a naive search.

Centroid recomputation:
once you assigned centroids to the keys of marix A, you'd want to do an
aggregating transpose of A to compute essentially average of row A grouped
by the centroid key. The trick is to do a computation of (1|A)' which will
results in a matrix of the shape (Counts/sums of cluster rows). This is the
part i find difficult to explain without a latex graphics.

In Samsara, construction of (1|A)' corresponds to DRM expression

(1 cbind A).t (again, see [2]).

So when you compute, say,

B = (1 | A)',

then B is (n+1) x k, so each column contains a vector corresponding to a
cluster 1..k. In such column, the first element would be # of points in the
cluster, and the rest of it would correspond to sum of all points. So in
order to arrive to an updated matrix C, we need to collect B into memory,
and slice out counters (first row) from the rest of it.

So, to compute C:

C <- B (2:,:) each row divided by B(1,:)

(watch out for empty clusters with 0 elements, this will cause lack of
convergence and NaNs in the newly computed C).

This operation obviously uses subblocking and row-wise iteration over B,
for which i am again making reference to [2].


[1]
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L149

[2], Sasmara manual, a bit dated but viable,
http://apache.github.io/mahout/doc/ScalaSparkBindings.html

[3] scaladoc, again, dated but largely viable for the purpose of this
exercise:
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm

[4] mapblock etc.
http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps

On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT <
h2016...@pilani.bits-pilani.ac.in> wrote:

> @Dmitriycan you please again tell me the approach to move ahead.
>
>
> Thanks
> Parth Khatwani
>
>
> On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
> h2016...@pilani.bits-pilani.ac.in> wrote:
>
> > yes i am unable to figure out the way ahead.
> > Like how to create the augmented matrix A := (0|D) which you have
> > mentioned.
> >
> >
> > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov 
> > wrote:
> >
> >> was my reply for your post on @user has been a bit confusing?
> >>
> >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
> >> h2016...@pilani.bits-pilani.ac.in> wrote:
> >>
> >> > Sir,
> >> > I am trying to write the kmeans clustering algorithm using Mahout
> >> Samsara
> >> > but i am bit confused
> >> > about how to leverage Distributed Row Matrix for the same. Can anybody
> >> help
> >> > me with same.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Thanks
> >> > Parth Khatwani
> >> >
> >>
> >
> >
>


Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

2017-03-31 Thread KHATWANI PARTH BHARAT
@Dmitriycan you please again tell me the approach to move ahead.


Thanks
Parth Khatwani


On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT <
h2016...@pilani.bits-pilani.ac.in> wrote:

> yes i am unable to figure out the way ahead.
> Like how to create the augmented matrix A := (0|D) which you have
> mentioned.
>
>
> On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov 
> wrote:
>
>> was my reply for your post on @user has been a bit confusing?
>>
>> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
>> h2016...@pilani.bits-pilani.ac.in> wrote:
>>
>> > Sir,
>> > I am trying to write the kmeans clustering algorithm using Mahout
>> Samsara
>> > but i am bit confused
>> > about how to leverage Distributed Row Matrix for the same. Can anybody
>> help
>> > me with same.
>> >
>> >
>> >
>> >
>> >
>> > Thanks
>> > Parth Khatwani
>> >
>>
>
>


Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

2017-03-31 Thread KHATWANI PARTH BHARAT
yes i am unable to figure out the way ahead.
Like how to create the augmented matrix A := (0|D) which you have mentioned.


On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov 
wrote:

> was my reply for your post on @user has been a bit confusing?
>
> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
> h2016...@pilani.bits-pilani.ac.in> wrote:
>
> > Sir,
> > I am trying to write the kmeans clustering algorithm using Mahout Samsara
> > but i am bit confused
> > about how to leverage Distributed Row Matrix for the same. Can anybody
> help
> > me with same.
> >
> >
> >
> >
> >
> > Thanks
> > Parth Khatwani
> >
>


Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

2017-03-31 Thread Dmitriy Lyubimov
was my reply for your post on @user has been a bit confusing?

On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT <
h2016...@pilani.bits-pilani.ac.in> wrote:

> Sir,
> I am trying to write the kmeans clustering algorithm using Mahout Samsara
> but i am bit confused
> about how to leverage Distributed Row Matrix for the same. Can anybody help
> me with same.
>
>
>
>
>
> Thanks
> Parth Khatwani
>


Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

2017-03-31 Thread KHATWANI PARTH BHARAT
Sir,
I am trying to write the kmeans clustering algorithm using Mahout Samsara
but i am bit confused
about how to leverage Distributed Row Matrix for the same. Can anybody help
me with same.





Thanks
Parth Khatwani