Re: Samsara's learning curve

2017-03-27 Thread Dmitriy Lyubimov
I believe writing in the DSL is simple enough, especially if you have some
familiarity with Scala on top of R (or, in my case, R on top of Scala
perhaps:). I've implemented about couple dozens customized algorithms that
used distributed Samsara algebra at least to some degree, and I think I can
reliably attest none of them ever exceeded 100 lines or so, and that it
significantly reduced my time dedicated to writing algebra on top of Spark
and some other backends I use under proprietary settings. I am now mostly
doing non-algebraic improvements because writing algebra is easy.

The most difficult part however, at least for me, and as you can see as you
go along with the  book, was not the pecularities of R-like bindings, but
the algorithm reformulations. Traditional "in-memory" algorithms do not
work on shared-nothing backends, even though you could program them, they
simply will not perform.

The main reasons some of the traditional algorithms do not work at scale
are because they either require random memory access, or (more often) are
simply super-linear w.r.t. input size, so as one scales  infrastructure at
linear cost, one would still incur less than expected increment in
performance (if any at all, at some point) per unit of input.

Hence, usually some mathematically, or should i say, statistically
motivated tricks are still required. As the book describes, linearly or
sub-linearly scalable sketches, random projections, dimensionality
reductions etc. etc. are required to alleviate scalability issues of the
super-linear algorithms.

To your question, i got couple of people doing some pieces on various
projects before with Samsara, but they had me as a coworker. I am
personally not aware of any outside developers beyond people already on the
project @ Apache and my co-workers, although in all honesty i feel it has
to do more with maturity and modest marketing of the public version of
Samsara than necessarily the difficulty of adoption.

-d



On Sun, Mar 26, 2017 at 9:15 AM, Gustavo Frederico <
gustavo.freder...@thinkwrap.com> wrote:

> I read Lyubimov's and Palumbo's book on Mahout Samsara up to chapter 4
> ( Distributed Algebra ). I have some familiarity with R, I did study
> linear algebra and calculus in undergrad. In my master's I studied
> statistical pattern recognition and researched a number of ML
> algorithms in my thesis - spending more time on SVMs. This is to ask:
> what is the learning curve of Samsara? How complicated is to work with
> distributed algebra to create an algorithm? Can someone share an
> example of how long she/he took to go from algorithm conception to
> implementation?
>
> Thanks
>
> Gustavo
>


Re: Samsara's learning curve

2017-03-27 Thread Trevor Grant
I tend to agree with D.

For example, I set out to do the 'Eigenfaces problem' last year, and wrote
a blog on it.  It ended up being about 4 lines of Samsara code (+ imports),
the "hardest" part was loading images into vectors, and then vectors back
into images (wasn't awful, but I was new to Scala).  In addition to the
modest marketing and a lack of introductory tutorials, is that to really
use Mahout-Samsara in the first place you need to have a fairly good grasp
of linear algebra, which gives it significantly less mass-appeal than say
an mllib/sklearn/etc. Your
I-just-got-my-data-science-certificate-from-coursera data scientists simply
aren't equipped to use Mahout.  Your advanced-R-type data scientists can
use it- but unless they have a problem that is to big for a single machine,
have no motivation to use it (may change with native solvers, more
algorithms, etc), and even given motivation the question then becomes learn
Mahout OR come up with a clever trick for being able to stay in a single
machine.

But yea- a fairly easy and pleasant framework.  If you have the proper
motivation, there is simply nothing else like it.

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Mon, Mar 27, 2017 at 12:32 PM, Dmitriy Lyubimov 
wrote:

> I believe writing in the DSL is simple enough, especially if you have some
> familiarity with Scala on top of R (or, in my case, R on top of Scala
> perhaps:). I've implemented about couple dozens customized algorithms that
> used distributed Samsara algebra at least to some degree, and I think I can
> reliably attest none of them ever exceeded 100 lines or so, and that it
> significantly reduced my time dedicated to writing algebra on top of Spark
> and some other backends I use under proprietary settings. I am now mostly
> doing non-algebraic improvements because writing algebra is easy.
>
> The most difficult part however, at least for me, and as you can see as you
> go along with the  book, was not the pecularities of R-like bindings, but
> the algorithm reformulations. Traditional "in-memory" algorithms do not
> work on shared-nothing backends, even though you could program them, they
> simply will not perform.
>
> The main reasons some of the traditional algorithms do not work at scale
> are because they either require random memory access, or (more often) are
> simply super-linear w.r.t. input size, so as one scales  infrastructure at
> linear cost, one would still incur less than expected increment in
> performance (if any at all, at some point) per unit of input.
>
> Hence, usually some mathematically, or should i say, statistically
> motivated tricks are still required. As the book describes, linearly or
> sub-linearly scalable sketches, random projections, dimensionality
> reductions etc. etc. are required to alleviate scalability issues of the
> super-linear algorithms.
>
> To your question, i got couple of people doing some pieces on various
> projects before with Samsara, but they had me as a coworker. I am
> personally not aware of any outside developers beyond people already on the
> project @ Apache and my co-workers, although in all honesty i feel it has
> to do more with maturity and modest marketing of the public version of
> Samsara than necessarily the difficulty of adoption.
>
> -d
>
>
>
> On Sun, Mar 26, 2017 at 9:15 AM, Gustavo Frederico <
> gustavo.freder...@thinkwrap.com> wrote:
>
> > I read Lyubimov's and Palumbo's book on Mahout Samsara up to chapter 4
> > ( Distributed Algebra ). I have some familiarity with R, I did study
> > linear algebra and calculus in undergrad. In my master's I studied
> > statistical pattern recognition and researched a number of ML
> > algorithms in my thesis - spending more time on SVMs. This is to ask:
> > what is the learning curve of Samsara? How complicated is to work with
> > distributed algebra to create an algorithm? Can someone share an
> > example of how long she/he took to go from algorithm conception to
> > implementation?
> >
> > Thanks
> >
> > Gustavo
> >
>


Re: Samsara's learning curve

2017-03-29 Thread Pat Ferrel
While I agree with D and T, I’ll add a few things to watch out for.

One of the hardest things to learn is the new model of execution, it’s not 
quite Spark or any other compute engine. You need to create contexts that have 
virtualized the actual compute engine. But you will probably need to use the 
actual compute engine too. Switching back and forth is fairly simple but must 
be learned and could be documented better.

The other missing bit is dataframes. R and Spark have them in different forms 
but Mahout largely ignores the issue of real world object ids. Again not vey 
hard to work around and here’s hoping it's added in a future rev.


On Mar 27, 2017, at 1:38 PM, Trevor Grant  wrote:

I tend to agree with D.

For example, I set out to do the 'Eigenfaces problem' last year, and wrote
a blog on it.  It ended up being about 4 lines of Samsara code (+ imports),
the "hardest" part was loading images into vectors, and then vectors back
into images (wasn't awful, but I was new to Scala).  In addition to the
modest marketing and a lack of introductory tutorials, is that to really
use Mahout-Samsara in the first place you need to have a fairly good grasp
of linear algebra, which gives it significantly less mass-appeal than say
an mllib/sklearn/etc. Your
I-just-got-my-data-science-certificate-from-coursera data scientists simply
aren't equipped to use Mahout.  Your advanced-R-type data scientists can
use it- but unless they have a problem that is to big for a single machine,
have no motivation to use it (may change with native solvers, more
algorithms, etc), and even given motivation the question then becomes learn
Mahout OR come up with a clever trick for being able to stay in a single
machine.

But yea- a fairly easy and pleasant framework.  If you have the proper
motivation, there is simply nothing else like it.

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Mon, Mar 27, 2017 at 12:32 PM, Dmitriy Lyubimov 
wrote:

> I believe writing in the DSL is simple enough, especially if you have some
> familiarity with Scala on top of R (or, in my case, R on top of Scala
> perhaps:). I've implemented about couple dozens customized algorithms that
> used distributed Samsara algebra at least to some degree, and I think I can
> reliably attest none of them ever exceeded 100 lines or so, and that it
> significantly reduced my time dedicated to writing algebra on top of Spark
> and some other backends I use under proprietary settings. I am now mostly
> doing non-algebraic improvements because writing algebra is easy.
> 
> The most difficult part however, at least for me, and as you can see as you
> go along with the  book, was not the pecularities of R-like bindings, but
> the algorithm reformulations. Traditional "in-memory" algorithms do not
> work on shared-nothing backends, even though you could program them, they
> simply will not perform.
> 
> The main reasons some of the traditional algorithms do not work at scale
> are because they either require random memory access, or (more often) are
> simply super-linear w.r.t. input size, so as one scales  infrastructure at
> linear cost, one would still incur less than expected increment in
> performance (if any at all, at some point) per unit of input.
> 
> Hence, usually some mathematically, or should i say, statistically
> motivated tricks are still required. As the book describes, linearly or
> sub-linearly scalable sketches, random projections, dimensionality
> reductions etc. etc. are required to alleviate scalability issues of the
> super-linear algorithms.
> 
> To your question, i got couple of people doing some pieces on various
> projects before with Samsara, but they had me as a coworker. I am
> personally not aware of any outside developers beyond people already on the
> project @ Apache and my co-workers, although in all honesty i feel it has
> to do more with maturity and modest marketing of the public version of
> Samsara than necessarily the difficulty of adoption.
> 
> -d
> 
> 
> 
> On Sun, Mar 26, 2017 at 9:15 AM, Gustavo Frederico <
> gustavo.freder...@thinkwrap.com> wrote:
> 
>> I read Lyubimov's and Palumbo's book on Mahout Samsara up to chapter 4
>> ( Distributed Algebra ). I have some familiarity with R, I did study
>> linear algebra and calculus in undergrad. In my master's I studied
>> statistical pattern recognition and researched a number of ML
>> algorithms in my thesis - spending more time on SVMs. This is to ask:
>> what is the learning curve of Samsara? How complicated is to work with
>> distributed algebra to create an algorithm? Can someone share an
>> example of how long she/he took to go from algorithm conception to
>> implementation?
>> 
>> Thanks
>> 
>> Gustavo
>> 
> 



Re: Samsara's learning curve

2017-03-29 Thread Dmitriy Lyubimov
On Wed, Mar 29, 2017 at 9:26 AM, Pat Ferrel  wrote:

>
> The other missing bit is dataframes. R and Spark have them in different
> forms but Mahout largely ignores the issue of real world object ids.


Mahout only supports matrices and vectors, not data frames.

Data frames imply mix of various types of data which yet to be converted to
numerical data to be consumed by algebraic algorithm (in R, usually done
via formula). Unfortunately Mahout has no extension for formula. As for
data frames, usually native data frames (e.g., spark data frames
specifically) work reasonably well for vectorization of non-numerical data.

distributed matrices are indeed do not support column labels, and row
labels are quasi-supported, meaning they share label nature with unordered
row index for transposition purposes, i.e., one can either have row labels
and limited transposition semantics, or one can have integer labels
interpreted as column index for transposition purposes, but not both.

another way is to use mahout NamedVectors for the purposes of row labeling,
but this is not supported consistently in any given elementary solver.


>
>


Re: Samsara's learning curve

2017-03-29 Thread Dmitriy Lyubimov
On Wed, Mar 29, 2017 at 9:26 AM, Pat Ferrel  wrote:

> While I agree with D and T, I’ll add a few things to watch out for.
>
> One of the hardest things to learn is the new model of execution, it’s not
> quite Spark or any other compute engine. You need to create contexts that
> have virtualized the actual compute engine. But you will probably need to
> use the actual compute engine too. Switching back and forth is fairly
> simple but must be learned and could be documented better.


Mahout indeed abstracts native engine's context by wrapping it into
DistributedMahoutContext. This is done largely to enable algebraic
expressions to be completely backend-agnostic.

Obtaining native engine context is easy although at that point it would
create native engine dependencies and the code is not backend agnostic
anymore. E.g. the code to unwrap spark context from mahout context (dc) is

val sparkContext = dc.asInstanceOf[SparkDistributedMahoutContext].sc,

i.e., we simply need to cast abstract context to a concrete expected
engine's implementation of one, at which point backend-specific structures
such as SparkContext are readily available.


Re: Samsara's learning curve

2017-03-29 Thread Dmitriy Lyubimov
one more word on row labels.

it seems like historical DRM interpretation of row keys (as indexes vs.
labels)  has been a bit unfortunate.

But in the end it turned out it often has some strange synergy. e.g., if
you compute a big svd,

val (U, V, s) = dssvd(A, ...)

then it doesn't matter if rows of A are labeled by strings or their ordinal
Int indices. it is all transparent for underlying pipeline. all it means
that matrix U will have the same type of keys and the same semantics as the
keys of A (either e.g., document labels of a string type, or a matrix row
index of Int type). More over, not only dssvd's user-facing API is
oblivious of key type of A, but it turns out its implementation is
oblivious of true semantics of key rows of A as well.

This mostly goes down to a simple notion that self-square A'A is logically
oblivious of row index type as well and that any matrix A inside
optimization plan can actually be formed as A' if needed, as long as it
doesn't meet the optimization barrier (i.e., collected or saved)


On Wed, Mar 29, 2017 at 9:37 AM, Dmitriy Lyubimov  wrote:

>
>
> On Wed, Mar 29, 2017 at 9:26 AM, Pat Ferrel  wrote:
>
>>
>> The other missing bit is dataframes. R and Spark have them in different
>> forms but Mahout largely ignores the issue of real world object ids.
>
>
> Mahout only supports matrices and vectors, not data frames.
>
> Data frames imply mix of various types of data which yet to be converted
> to numerical data to be consumed by algebraic algorithm (in R, usually done
> via formula). Unfortunately Mahout has no extension for formula. As for
> data frames, usually native data frames (e.g., spark data frames
> specifically) work reasonably well for vectorization of non-numerical data.
>
> distributed matrices are indeed do not support column labels, and row
> labels are quasi-supported, meaning they share label nature with unordered
> row index for transposition purposes, i.e., one can either have row labels
> and limited transposition semantics, or one can have integer labels
> interpreted as column index for transposition purposes, but not both.
>
> another way is to use mahout NamedVectors for the purposes of row
> labeling, but this is not supported consistently in any given elementary
> solver.
>
>
>>
>>


Re: Samsara's learning curve

2017-06-05 Thread Trevor Grant
Fwiw-

I think I'm about 10 hours into multi layer perceptrons, maybe another 2 to
go for docs and last unit tests.  Could have been quicker but I already
have follow on things I want to do, and am building them so that it will be
easily extendable (to LSTMs, convolution nets, etc). If I had taken some
short cuts- could have been done probably in 5-7, and a large part of that
is remembering how back-propegations works, and getting lost in my own
indices.



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Wed, Mar 29, 2017 at 11:26 AM, Pat Ferrel  wrote:

> While I agree with D and T, I’ll add a few things to watch out for.
>
> One of the hardest things to learn is the new model of execution, it’s not
> quite Spark or any other compute engine. You need to create contexts that
> have virtualized the actual compute engine. But you will probably need to
> use the actual compute engine too. Switching back and forth is fairly
> simple but must be learned and could be documented better.
>
> The other missing bit is dataframes. R and Spark have them in different
> forms but Mahout largely ignores the issue of real world object ids. Again
> not vey hard to work around and here’s hoping it's added in a future rev.
>
>
> On Mar 27, 2017, at 1:38 PM, Trevor Grant 
> wrote:
>
> I tend to agree with D.
>
> For example, I set out to do the 'Eigenfaces problem' last year, and wrote
> a blog on it.  It ended up being about 4 lines of Samsara code (+ imports),
> the "hardest" part was loading images into vectors, and then vectors back
> into images (wasn't awful, but I was new to Scala).  In addition to the
> modest marketing and a lack of introductory tutorials, is that to really
> use Mahout-Samsara in the first place you need to have a fairly good grasp
> of linear algebra, which gives it significantly less mass-appeal than say
> an mllib/sklearn/etc. Your
> I-just-got-my-data-science-certificate-from-coursera data scientists
> simply
> aren't equipped to use Mahout.  Your advanced-R-type data scientists can
> use it- but unless they have a problem that is to big for a single machine,
> have no motivation to use it (may change with native solvers, more
> algorithms, etc), and even given motivation the question then becomes learn
> Mahout OR come up with a clever trick for being able to stay in a single
> machine.
>
> But yea- a fairly easy and pleasant framework.  If you have the proper
> motivation, there is simply nothing else like it.
>
> tg
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Mon, Mar 27, 2017 at 12:32 PM, Dmitriy Lyubimov 
> wrote:
>
> > I believe writing in the DSL is simple enough, especially if you have
> some
> > familiarity with Scala on top of R (or, in my case, R on top of Scala
> > perhaps:). I've implemented about couple dozens customized algorithms
> that
> > used distributed Samsara algebra at least to some degree, and I think I
> can
> > reliably attest none of them ever exceeded 100 lines or so, and that it
> > significantly reduced my time dedicated to writing algebra on top of
> Spark
> > and some other backends I use under proprietary settings. I am now mostly
> > doing non-algebraic improvements because writing algebra is easy.
> >
> > The most difficult part however, at least for me, and as you can see as
> you
> > go along with the  book, was not the pecularities of R-like bindings, but
> > the algorithm reformulations. Traditional "in-memory" algorithms do not
> > work on shared-nothing backends, even though you could program them, they
> > simply will not perform.
> >
> > The main reasons some of the traditional algorithms do not work at scale
> > are because they either require random memory access, or (more often) are
> > simply super-linear w.r.t. input size, so as one scales  infrastructure
> at
> > linear cost, one would still incur less than expected increment in
> > performance (if any at all, at some point) per unit of input.
> >
> > Hence, usually some mathematically, or should i say, statistically
> > motivated tricks are still required. As the book describes, linearly or
> > sub-linearly scalable sketches, random projections, dimensionality
> > reductions etc. etc. are required to alleviate scalability issues of the
> > super-linear algorithms.
> >
> > To your question, i got couple of people doing some pieces on various
> > projects before with Samsara, but they had me as a coworker. I am
> > personally not aware of any outside developers beyond people already on
> the
> > project @ Apache and my co-workers, although in all honesty i feel it has
> > to do more with maturity and modest marketing of the public version of
> > Samsara th