Re: Samsara's learning curve
I believe writing in the DSL is simple enough, especially if you have some familiarity with Scala on top of R (or, in my case, R on top of Scala perhaps:). I've implemented about couple dozens customized algorithms that used distributed Samsara algebra at least to some degree, and I think I can reliably attest none of them ever exceeded 100 lines or so, and that it significantly reduced my time dedicated to writing algebra on top of Spark and some other backends I use under proprietary settings. I am now mostly doing non-algebraic improvements because writing algebra is easy. The most difficult part however, at least for me, and as you can see as you go along with the book, was not the pecularities of R-like bindings, but the algorithm reformulations. Traditional "in-memory" algorithms do not work on shared-nothing backends, even though you could program them, they simply will not perform. The main reasons some of the traditional algorithms do not work at scale are because they either require random memory access, or (more often) are simply super-linear w.r.t. input size, so as one scales infrastructure at linear cost, one would still incur less than expected increment in performance (if any at all, at some point) per unit of input. Hence, usually some mathematically, or should i say, statistically motivated tricks are still required. As the book describes, linearly or sub-linearly scalable sketches, random projections, dimensionality reductions etc. etc. are required to alleviate scalability issues of the super-linear algorithms. To your question, i got couple of people doing some pieces on various projects before with Samsara, but they had me as a coworker. I am personally not aware of any outside developers beyond people already on the project @ Apache and my co-workers, although in all honesty i feel it has to do more with maturity and modest marketing of the public version of Samsara than necessarily the difficulty of adoption. -d On Sun, Mar 26, 2017 at 9:15 AM, Gustavo Frederico < gustavo.freder...@thinkwrap.com> wrote: > I read Lyubimov's and Palumbo's book on Mahout Samsara up to chapter 4 > ( Distributed Algebra ). I have some familiarity with R, I did study > linear algebra and calculus in undergrad. In my master's I studied > statistical pattern recognition and researched a number of ML > algorithms in my thesis - spending more time on SVMs. This is to ask: > what is the learning curve of Samsara? How complicated is to work with > distributed algebra to create an algorithm? Can someone share an > example of how long she/he took to go from algorithm conception to > implementation? > > Thanks > > Gustavo >
Re: Samsara's learning curve
I tend to agree with D. For example, I set out to do the 'Eigenfaces problem' last year, and wrote a blog on it. It ended up being about 4 lines of Samsara code (+ imports), the "hardest" part was loading images into vectors, and then vectors back into images (wasn't awful, but I was new to Scala). In addition to the modest marketing and a lack of introductory tutorials, is that to really use Mahout-Samsara in the first place you need to have a fairly good grasp of linear algebra, which gives it significantly less mass-appeal than say an mllib/sklearn/etc. Your I-just-got-my-data-science-certificate-from-coursera data scientists simply aren't equipped to use Mahout. Your advanced-R-type data scientists can use it- but unless they have a problem that is to big for a single machine, have no motivation to use it (may change with native solvers, more algorithms, etc), and even given motivation the question then becomes learn Mahout OR come up with a clever trick for being able to stay in a single machine. But yea- a fairly easy and pleasant framework. If you have the proper motivation, there is simply nothing else like it. tg Trevor Grant Data Scientist https://github.com/rawkintrevo http://stackexchange.com/users/3002022/rawkintrevo http://trevorgrant.org *"Fortunate is he, who is able to know the causes of things." -Virgil* On Mon, Mar 27, 2017 at 12:32 PM, Dmitriy Lyubimov wrote: > I believe writing in the DSL is simple enough, especially if you have some > familiarity with Scala on top of R (or, in my case, R on top of Scala > perhaps:). I've implemented about couple dozens customized algorithms that > used distributed Samsara algebra at least to some degree, and I think I can > reliably attest none of them ever exceeded 100 lines or so, and that it > significantly reduced my time dedicated to writing algebra on top of Spark > and some other backends I use under proprietary settings. I am now mostly > doing non-algebraic improvements because writing algebra is easy. > > The most difficult part however, at least for me, and as you can see as you > go along with the book, was not the pecularities of R-like bindings, but > the algorithm reformulations. Traditional "in-memory" algorithms do not > work on shared-nothing backends, even though you could program them, they > simply will not perform. > > The main reasons some of the traditional algorithms do not work at scale > are because they either require random memory access, or (more often) are > simply super-linear w.r.t. input size, so as one scales infrastructure at > linear cost, one would still incur less than expected increment in > performance (if any at all, at some point) per unit of input. > > Hence, usually some mathematically, or should i say, statistically > motivated tricks are still required. As the book describes, linearly or > sub-linearly scalable sketches, random projections, dimensionality > reductions etc. etc. are required to alleviate scalability issues of the > super-linear algorithms. > > To your question, i got couple of people doing some pieces on various > projects before with Samsara, but they had me as a coworker. I am > personally not aware of any outside developers beyond people already on the > project @ Apache and my co-workers, although in all honesty i feel it has > to do more with maturity and modest marketing of the public version of > Samsara than necessarily the difficulty of adoption. > > -d > > > > On Sun, Mar 26, 2017 at 9:15 AM, Gustavo Frederico < > gustavo.freder...@thinkwrap.com> wrote: > > > I read Lyubimov's and Palumbo's book on Mahout Samsara up to chapter 4 > > ( Distributed Algebra ). I have some familiarity with R, I did study > > linear algebra and calculus in undergrad. In my master's I studied > > statistical pattern recognition and researched a number of ML > > algorithms in my thesis - spending more time on SVMs. This is to ask: > > what is the learning curve of Samsara? How complicated is to work with > > distributed algebra to create an algorithm? Can someone share an > > example of how long she/he took to go from algorithm conception to > > implementation? > > > > Thanks > > > > Gustavo > > >
Re: Samsara's learning curve
While I agree with D and T, I’ll add a few things to watch out for. One of the hardest things to learn is the new model of execution, it’s not quite Spark or any other compute engine. You need to create contexts that have virtualized the actual compute engine. But you will probably need to use the actual compute engine too. Switching back and forth is fairly simple but must be learned and could be documented better. The other missing bit is dataframes. R and Spark have them in different forms but Mahout largely ignores the issue of real world object ids. Again not vey hard to work around and here’s hoping it's added in a future rev. On Mar 27, 2017, at 1:38 PM, Trevor Grant wrote: I tend to agree with D. For example, I set out to do the 'Eigenfaces problem' last year, and wrote a blog on it. It ended up being about 4 lines of Samsara code (+ imports), the "hardest" part was loading images into vectors, and then vectors back into images (wasn't awful, but I was new to Scala). In addition to the modest marketing and a lack of introductory tutorials, is that to really use Mahout-Samsara in the first place you need to have a fairly good grasp of linear algebra, which gives it significantly less mass-appeal than say an mllib/sklearn/etc. Your I-just-got-my-data-science-certificate-from-coursera data scientists simply aren't equipped to use Mahout. Your advanced-R-type data scientists can use it- but unless they have a problem that is to big for a single machine, have no motivation to use it (may change with native solvers, more algorithms, etc), and even given motivation the question then becomes learn Mahout OR come up with a clever trick for being able to stay in a single machine. But yea- a fairly easy and pleasant framework. If you have the proper motivation, there is simply nothing else like it. tg Trevor Grant Data Scientist https://github.com/rawkintrevo http://stackexchange.com/users/3002022/rawkintrevo http://trevorgrant.org *"Fortunate is he, who is able to know the causes of things." -Virgil* On Mon, Mar 27, 2017 at 12:32 PM, Dmitriy Lyubimov wrote: > I believe writing in the DSL is simple enough, especially if you have some > familiarity with Scala on top of R (or, in my case, R on top of Scala > perhaps:). I've implemented about couple dozens customized algorithms that > used distributed Samsara algebra at least to some degree, and I think I can > reliably attest none of them ever exceeded 100 lines or so, and that it > significantly reduced my time dedicated to writing algebra on top of Spark > and some other backends I use under proprietary settings. I am now mostly > doing non-algebraic improvements because writing algebra is easy. > > The most difficult part however, at least for me, and as you can see as you > go along with the book, was not the pecularities of R-like bindings, but > the algorithm reformulations. Traditional "in-memory" algorithms do not > work on shared-nothing backends, even though you could program them, they > simply will not perform. > > The main reasons some of the traditional algorithms do not work at scale > are because they either require random memory access, or (more often) are > simply super-linear w.r.t. input size, so as one scales infrastructure at > linear cost, one would still incur less than expected increment in > performance (if any at all, at some point) per unit of input. > > Hence, usually some mathematically, or should i say, statistically > motivated tricks are still required. As the book describes, linearly or > sub-linearly scalable sketches, random projections, dimensionality > reductions etc. etc. are required to alleviate scalability issues of the > super-linear algorithms. > > To your question, i got couple of people doing some pieces on various > projects before with Samsara, but they had me as a coworker. I am > personally not aware of any outside developers beyond people already on the > project @ Apache and my co-workers, although in all honesty i feel it has > to do more with maturity and modest marketing of the public version of > Samsara than necessarily the difficulty of adoption. > > -d > > > > On Sun, Mar 26, 2017 at 9:15 AM, Gustavo Frederico < > gustavo.freder...@thinkwrap.com> wrote: > >> I read Lyubimov's and Palumbo's book on Mahout Samsara up to chapter 4 >> ( Distributed Algebra ). I have some familiarity with R, I did study >> linear algebra and calculus in undergrad. In my master's I studied >> statistical pattern recognition and researched a number of ML >> algorithms in my thesis - spending more time on SVMs. This is to ask: >> what is the learning curve of Samsara? How complicated is to work with >> distributed algebra to create an algorithm? Can someone share an >> example of how long she/he took to go from algorithm conception to >> implementation? >> >> Thanks >> >> Gustavo >> >
Re: Samsara's learning curve
On Wed, Mar 29, 2017 at 9:26 AM, Pat Ferrel wrote: > > The other missing bit is dataframes. R and Spark have them in different > forms but Mahout largely ignores the issue of real world object ids. Mahout only supports matrices and vectors, not data frames. Data frames imply mix of various types of data which yet to be converted to numerical data to be consumed by algebraic algorithm (in R, usually done via formula). Unfortunately Mahout has no extension for formula. As for data frames, usually native data frames (e.g., spark data frames specifically) work reasonably well for vectorization of non-numerical data. distributed matrices are indeed do not support column labels, and row labels are quasi-supported, meaning they share label nature with unordered row index for transposition purposes, i.e., one can either have row labels and limited transposition semantics, or one can have integer labels interpreted as column index for transposition purposes, but not both. another way is to use mahout NamedVectors for the purposes of row labeling, but this is not supported consistently in any given elementary solver. > >
Re: Samsara's learning curve
On Wed, Mar 29, 2017 at 9:26 AM, Pat Ferrel wrote: > While I agree with D and T, I’ll add a few things to watch out for. > > One of the hardest things to learn is the new model of execution, it’s not > quite Spark or any other compute engine. You need to create contexts that > have virtualized the actual compute engine. But you will probably need to > use the actual compute engine too. Switching back and forth is fairly > simple but must be learned and could be documented better. Mahout indeed abstracts native engine's context by wrapping it into DistributedMahoutContext. This is done largely to enable algebraic expressions to be completely backend-agnostic. Obtaining native engine context is easy although at that point it would create native engine dependencies and the code is not backend agnostic anymore. E.g. the code to unwrap spark context from mahout context (dc) is val sparkContext = dc.asInstanceOf[SparkDistributedMahoutContext].sc, i.e., we simply need to cast abstract context to a concrete expected engine's implementation of one, at which point backend-specific structures such as SparkContext are readily available.
Re: Samsara's learning curve
one more word on row labels. it seems like historical DRM interpretation of row keys (as indexes vs. labels) has been a bit unfortunate. But in the end it turned out it often has some strange synergy. e.g., if you compute a big svd, val (U, V, s) = dssvd(A, ...) then it doesn't matter if rows of A are labeled by strings or their ordinal Int indices. it is all transparent for underlying pipeline. all it means that matrix U will have the same type of keys and the same semantics as the keys of A (either e.g., document labels of a string type, or a matrix row index of Int type). More over, not only dssvd's user-facing API is oblivious of key type of A, but it turns out its implementation is oblivious of true semantics of key rows of A as well. This mostly goes down to a simple notion that self-square A'A is logically oblivious of row index type as well and that any matrix A inside optimization plan can actually be formed as A' if needed, as long as it doesn't meet the optimization barrier (i.e., collected or saved) On Wed, Mar 29, 2017 at 9:37 AM, Dmitriy Lyubimov wrote: > > > On Wed, Mar 29, 2017 at 9:26 AM, Pat Ferrel wrote: > >> >> The other missing bit is dataframes. R and Spark have them in different >> forms but Mahout largely ignores the issue of real world object ids. > > > Mahout only supports matrices and vectors, not data frames. > > Data frames imply mix of various types of data which yet to be converted > to numerical data to be consumed by algebraic algorithm (in R, usually done > via formula). Unfortunately Mahout has no extension for formula. As for > data frames, usually native data frames (e.g., spark data frames > specifically) work reasonably well for vectorization of non-numerical data. > > distributed matrices are indeed do not support column labels, and row > labels are quasi-supported, meaning they share label nature with unordered > row index for transposition purposes, i.e., one can either have row labels > and limited transposition semantics, or one can have integer labels > interpreted as column index for transposition purposes, but not both. > > another way is to use mahout NamedVectors for the purposes of row > labeling, but this is not supported consistently in any given elementary > solver. > > >> >>
Re: Samsara's learning curve
Fwiw- I think I'm about 10 hours into multi layer perceptrons, maybe another 2 to go for docs and last unit tests. Could have been quicker but I already have follow on things I want to do, and am building them so that it will be easily extendable (to LSTMs, convolution nets, etc). If I had taken some short cuts- could have been done probably in 5-7, and a large part of that is remembering how back-propegations works, and getting lost in my own indices. Trevor Grant Data Scientist https://github.com/rawkintrevo http://stackexchange.com/users/3002022/rawkintrevo http://trevorgrant.org *"Fortunate is he, who is able to know the causes of things." -Virgil* On Wed, Mar 29, 2017 at 11:26 AM, Pat Ferrel wrote: > While I agree with D and T, I’ll add a few things to watch out for. > > One of the hardest things to learn is the new model of execution, it’s not > quite Spark or any other compute engine. You need to create contexts that > have virtualized the actual compute engine. But you will probably need to > use the actual compute engine too. Switching back and forth is fairly > simple but must be learned and could be documented better. > > The other missing bit is dataframes. R and Spark have them in different > forms but Mahout largely ignores the issue of real world object ids. Again > not vey hard to work around and here’s hoping it's added in a future rev. > > > On Mar 27, 2017, at 1:38 PM, Trevor Grant > wrote: > > I tend to agree with D. > > For example, I set out to do the 'Eigenfaces problem' last year, and wrote > a blog on it. It ended up being about 4 lines of Samsara code (+ imports), > the "hardest" part was loading images into vectors, and then vectors back > into images (wasn't awful, but I was new to Scala). In addition to the > modest marketing and a lack of introductory tutorials, is that to really > use Mahout-Samsara in the first place you need to have a fairly good grasp > of linear algebra, which gives it significantly less mass-appeal than say > an mllib/sklearn/etc. Your > I-just-got-my-data-science-certificate-from-coursera data scientists > simply > aren't equipped to use Mahout. Your advanced-R-type data scientists can > use it- but unless they have a problem that is to big for a single machine, > have no motivation to use it (may change with native solvers, more > algorithms, etc), and even given motivation the question then becomes learn > Mahout OR come up with a clever trick for being able to stay in a single > machine. > > But yea- a fairly easy and pleasant framework. If you have the proper > motivation, there is simply nothing else like it. > > tg > > Trevor Grant > Data Scientist > https://github.com/rawkintrevo > http://stackexchange.com/users/3002022/rawkintrevo > http://trevorgrant.org > > *"Fortunate is he, who is able to know the causes of things." -Virgil* > > > On Mon, Mar 27, 2017 at 12:32 PM, Dmitriy Lyubimov > wrote: > > > I believe writing in the DSL is simple enough, especially if you have > some > > familiarity with Scala on top of R (or, in my case, R on top of Scala > > perhaps:). I've implemented about couple dozens customized algorithms > that > > used distributed Samsara algebra at least to some degree, and I think I > can > > reliably attest none of them ever exceeded 100 lines or so, and that it > > significantly reduced my time dedicated to writing algebra on top of > Spark > > and some other backends I use under proprietary settings. I am now mostly > > doing non-algebraic improvements because writing algebra is easy. > > > > The most difficult part however, at least for me, and as you can see as > you > > go along with the book, was not the pecularities of R-like bindings, but > > the algorithm reformulations. Traditional "in-memory" algorithms do not > > work on shared-nothing backends, even though you could program them, they > > simply will not perform. > > > > The main reasons some of the traditional algorithms do not work at scale > > are because they either require random memory access, or (more often) are > > simply super-linear w.r.t. input size, so as one scales infrastructure > at > > linear cost, one would still incur less than expected increment in > > performance (if any at all, at some point) per unit of input. > > > > Hence, usually some mathematically, or should i say, statistically > > motivated tricks are still required. As the book describes, linearly or > > sub-linearly scalable sketches, random projections, dimensionality > > reductions etc. etc. are required to alleviate scalability issues of the > > super-linear algorithms. > > > > To your question, i got couple of people doing some pieces on various > > projects before with Samsara, but they had me as a coworker. I am > > personally not aware of any outside developers beyond people already on > the > > project @ Apache and my co-workers, although in all honesty i feel it has > > to do more with maturity and modest marketing of the public version of > > Samsara th