Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-01-31 Thread Isabel Drost-Fromm

Hi,

On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote:
> and we're thinking about just how many pre-built algorithms we
> should include in the library versus working on performance behind the
> scenes.

To pick this question up: I've been watching Mahout from a distance for quite
some time. So from what limited background I have of Samsara I really like it's
approach to be able to run on more than one execution engine.

To give some advise to downstream users in the field - what would be your advise
for people tasked with concrete use cases (stuff like fraud detection, anomaly
detection, learning search ranking functions, building a recommender system)? Is
that something that can still be done with Mahout? What would it take to get
from raw data to finished system? Is there something we can do to help users get
that accomplished? Is there even interest from users in such a use case based
perspective? If so, would there be interest among the Mahout committers to help
users publicly create docs/examples/modules to support these use cases?


Isabel



Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-01-31 Thread Florent Empis
Hi,

I am in the same spot as Isabel.
Used to use/understand most of the «old» standalone mahout, now doing some
data transformation with spark, but I am not sure where Samsara fits in the
ecosystem.
We also do quite a bit of computation in R.
Basically we are willing to learn and support the project by for instance
buying the books Rob mentioned, but a short doc with the outline Isabel
describes would be great!

Many thanks,

Florent


Le 31 janv. 2017 12:01, "Isabel Drost-Fromm"  a écrit :


Hi,

On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote:
> and we're thinking about just how many pre-built algorithms we
> should include in the library versus working on performance behind the
> scenes.

To pick this question up: I've been watching Mahout from a distance for
quite
some time. So from what limited background I have of Samsara I really like
it's
approach to be able to run on more than one execution engine.

To give some advise to downstream users in the field - what would be your
advise
for people tasked with concrete use cases (stuff like fraud detection,
anomaly
detection, learning search ranking functions, building a recommender
system)? Is
that something that can still be done with Mahout? What would it take to get
from raw data to finished system? Is there something we can do to help
users get
that accomplished? Is there even interest from users in such a use case
based
perspective? If so, would there be interest among the Mahout committers to
help
users publicly create docs/examples/modules to support these use cases?


Isabel


Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-01-31 Thread Trevor Grant
Hello Isabel and Florent,

I'm currently working on a side-by-side demo of R / Python / SparkML(Mllib)
/ Mahout, but in very broad strokes here is how I would compare them:

R- Most statistical functionality.  Most flexibility.  Implement your own
algorithms- mathematically expressive language.  Worst performance- handles
only "small" data sets.  Language is 'math centric'. Easy to extend /
create new algos

Python (sklearn/scikit) - Some mathematical / statistical functionality,
more focused on machine learning. Machine learning library very
sophisticated though.  Much better performance than R, still only single
node. "small to medium" data sets. Language is 'programmer centric'.
Somewhat difficult to extend / create new algos

SparkML / Mllib - Very Limited Mathematical functionality (usually collects
to driver to do anything of substance).  Machine learning rudimentary
compared to sklearn, but still non-trivial one of the best available.
Exceeding performance, well suited to "big" data sets. Language is
'programmer centric'. Very difficult to extend / create new algos.

(FlinkML - Fits in same spot as SparkML, but significantly less developed)

Mahout - Good mathematical functionality.  Good performance relative to
underlying engine (possibly superior with MAHOUT-1885).  Language is 'math
centric'.  Well suited to "medium and big" data sets. Fairly easy to extend
/ create new algos (MAHOUT-1856)

I hope that provides a high level comparison.

Re use cases- the tool to use depends on the job at hand.
Highly advanced mathematical model, small dataset or sampling from full
dataset OK -> Use R
Machine learning on small to medium data set or sampling from full dataset
OK -> Use Python / sklearn
Less sophisticated machine learning on Large dataset -> SparkML
Custom mathematical/statistical model on medium to large data -> Mahout

^^ All of this is just my opinion.

Re: integration-

We're working on that too.  Recently MAHOUT-1896 added convenience methods
for interacting with MLLib type RDDs, and DataFrames
https://issues.apache.org/jira/browse/MAHOUT-1896

(No support yet for SparkML type dataframes, or spitting DRMs back out into
RDDs/DataFrames).

Finally Docs: There has been some talk for sometime of migrating the
website from CMS to Jekyll and its something I strongly support.  The CMS
makes it difficult to keep up with documentation, and Jekyll would open up
documentation /website maintenance to contributors.

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Tue, Jan 31, 2017 at 5:31 AM, Florent Empis 
wrote:

> Hi,
>
> I am in the same spot as Isabel.
> Used to use/understand most of the «old» standalone mahout, now doing some
> data transformation with spark, but I am not sure where Samsara fits in the
> ecosystem.
> We also do quite a bit of computation in R.
> Basically we are willing to learn and support the project by for instance
> buying the books Rob mentioned, but a short doc with the outline Isabel
> describes would be great!
>
> Many thanks,
>
> Florent
>
>
> Le 31 janv. 2017 12:01, "Isabel Drost-Fromm"  a écrit :
>
>
> Hi,
>
> On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote:
> > and we're thinking about just how many pre-built algorithms we
> > should include in the library versus working on performance behind the
> > scenes.
>
> To pick this question up: I've been watching Mahout from a distance for
> quite
> some time. So from what limited background I have of Samsara I really like
> it's
> approach to be able to run on more than one execution engine.
>
> To give some advise to downstream users in the field - what would be your
> advise
> for people tasked with concrete use cases (stuff like fraud detection,
> anomaly
> detection, learning search ranking functions, building a recommender
> system)? Is
> that something that can still be done with Mahout? What would it take to
> get
> from raw data to finished system? Is there something we can do to help
> users get
> that accomplished? Is there even interest from users in such a use case
> based
> perspective? If so, would there be interest among the Mahout committers to
> help
> users publicly create docs/examples/modules to support these use cases?
>
>
> Isabel
>


Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-01-31 Thread Pat Ferrel
My perspective comes from the data side. I work in recommenders and that means 
log analysis for huge amounts of data. Even a small shop doing this will 
immediately run our of the capacity in Python or R on a single node. MLlib is a 
set of prepackaged algorithms that will work (mostly) with big data. Mahout 
Samsara is the only general linear algebra tool I know of that will natively 
let you interactively run R-like code on any size cluster, then polish it for 
production all without changing tools, or language.

Going from analytics to recommenders means a jump in data size of several 
orders of magnitude and this is just one example.


On Jan 31, 2017, at 6:50 AM, Trevor Grant  wrote:

Hello Isabel and Florent,

I'm currently working on a side-by-side demo of R / Python / SparkML(Mllib)
/ Mahout, but in very broad strokes here is how I would compare them:

R- Most statistical functionality.  Most flexibility.  Implement your own
algorithms- mathematically expressive language.  Worst performance- handles
only "small" data sets.  Language is 'math centric'. Easy to extend /
create new algos

Python (sklearn/scikit) - Some mathematical / statistical functionality,
more focused on machine learning. Machine learning library very
sophisticated though.  Much better performance than R, still only single
node. "small to medium" data sets. Language is 'programmer centric'.
Somewhat difficult to extend / create new algos

SparkML / Mllib - Very Limited Mathematical functionality (usually collects
to driver to do anything of substance).  Machine learning rudimentary
compared to sklearn, but still non-trivial one of the best available.
Exceeding performance, well suited to "big" data sets. Language is
'programmer centric'. Very difficult to extend / create new algos.

(FlinkML - Fits in same spot as SparkML, but significantly less developed)

Mahout - Good mathematical functionality.  Good performance relative to
underlying engine (possibly superior with MAHOUT-1885).  Language is 'math
centric'.  Well suited to "medium and big" data sets. Fairly easy to extend
/ create new algos (MAHOUT-1856)

I hope that provides a high level comparison.

Re use cases- the tool to use depends on the job at hand.
Highly advanced mathematical model, small dataset or sampling from full
dataset OK -> Use R
Machine learning on small to medium data set or sampling from full dataset
OK -> Use Python / sklearn
Less sophisticated machine learning on Large dataset -> SparkML
Custom mathematical/statistical model on medium to large data -> Mahout

^^ All of this is just my opinion.

Re: integration-

We're working on that too.  Recently MAHOUT-1896 added convenience methods
for interacting with MLLib type RDDs, and DataFrames
https://issues.apache.org/jira/browse/MAHOUT-1896

(No support yet for SparkML type dataframes, or spitting DRMs back out into
RDDs/DataFrames).

Finally Docs: There has been some talk for sometime of migrating the
website from CMS to Jekyll and its something I strongly support.  The CMS
makes it difficult to keep up with documentation, and Jekyll would open up
documentation /website maintenance to contributors.

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Tue, Jan 31, 2017 at 5:31 AM, Florent Empis 
wrote:

> Hi,
> 
> I am in the same spot as Isabel.
> Used to use/understand most of the «old» standalone mahout, now doing some
> data transformation with spark, but I am not sure where Samsara fits in the
> ecosystem.
> We also do quite a bit of computation in R.
> Basically we are willing to learn and support the project by for instance
> buying the books Rob mentioned, but a short doc with the outline Isabel
> describes would be great!
> 
> Many thanks,
> 
> Florent
> 
> 
> Le 31 janv. 2017 12:01, "Isabel Drost-Fromm"  a écrit :
> 
> 
> Hi,
> 
> On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote:
>> and we're thinking about just how many pre-built algorithms we
>> should include in the library versus working on performance behind the
>> scenes.
> 
> To pick this question up: I've been watching Mahout from a distance for
> quite
> some time. So from what limited background I have of Samsara I really like
> it's
> approach to be able to run on more than one execution engine.
> 
> To give some advise to downstream users in the field - what would be your
> advise
> for people tasked with concrete use cases (stuff like fraud detection,
> anomaly
> detection, learning search ranking functions, building a recommender
> system)? Is
> that something that can still be done with Mahout? What would it take to
> get
> from raw data to finished system? Is there something we can do to help
> users get
> that accomplished? Is there even interest from users in such a use case
> based
> perspective? If so, would there be interest among the Mahout committers 

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-01-31 Thread Ted Dunning
>From my perspective, the state of the art of machine learning is with
systems like Tensorflow and dl4j. If you can deal with the limits of a
non-clustered GPU system, then Theano and Cafe are very useful. Keras
papers over the difference between different back-ends nicely.

Tensorflow and Theano can do a lot of mathematical and linear (tensor,
actually) algebra work nicely, especially if there is an optimization
problem lurking.

NVidia also has a very strong commercial offering that supports their GPU
clustering well.

Spark ML lags this state of the art very far behind, but is still useful
for simpler situations.

For recommendations, the situation is very different.  Almost all
applications are most easily and often most accurately solved using an
indicator-based approach and the go-to implementation of this is Mahout.

There is a lot of noise in the world about factorization-based
recommendation using ALS and such, but the noise is not warranted.
Deploying a recommender in a search engine is just better.

I have not personally used Samsara much, but the idea of a strong optimizer
over the top of a nice syntax for linear algebra is a good one.

On Tue, Jan 31, 2017 at 9:21 AM, Pat Ferrel  wrote:

> My perspective comes from the data side. I work in recommenders and that
> means log analysis for huge amounts of data. Even a small shop doing this
> will immediately run our of the capacity in Python or R on a single node.
> MLlib is a set of prepackaged algorithms that will work (mostly) with big
> data. Mahout Samsara is the only general linear algebra tool I know of that
> will natively let you interactively run R-like code on any size cluster,
> then polish it for production all without changing tools, or language.
>
> Going from analytics to recommenders means a jump in data size of several
> orders of magnitude and this is just one example.
>
>
> On Jan 31, 2017, at 6:50 AM, Trevor Grant 
> wrote:
>
> Hello Isabel and Florent,
>
> I'm currently working on a side-by-side demo of R / Python / SparkML(Mllib)
> / Mahout, but in very broad strokes here is how I would compare them:
>
> R- Most statistical functionality.  Most flexibility.  Implement your own
> algorithms- mathematically expressive language.  Worst performance- handles
> only "small" data sets.  Language is 'math centric'. Easy to extend /
> create new algos
>
> Python (sklearn/scikit) - Some mathematical / statistical functionality,
> more focused on machine learning. Machine learning library very
> sophisticated though.  Much better performance than R, still only single
> node. "small to medium" data sets. Language is 'programmer centric'.
> Somewhat difficult to extend / create new algos
>
> SparkML / Mllib - Very Limited Mathematical functionality (usually collects
> to driver to do anything of substance).  Machine learning rudimentary
> compared to sklearn, but still non-trivial one of the best available.
> Exceeding performance, well suited to "big" data sets. Language is
> 'programmer centric'. Very difficult to extend / create new algos.
>
> (FlinkML - Fits in same spot as SparkML, but significantly less developed)
>
> Mahout - Good mathematical functionality.  Good performance relative to
> underlying engine (possibly superior with MAHOUT-1885).  Language is 'math
> centric'.  Well suited to "medium and big" data sets. Fairly easy to extend
> / create new algos (MAHOUT-1856)
>
> I hope that provides a high level comparison.
>
> Re use cases- the tool to use depends on the job at hand.
> Highly advanced mathematical model, small dataset or sampling from full
> dataset OK -> Use R
> Machine learning on small to medium data set or sampling from full dataset
> OK -> Use Python / sklearn
> Less sophisticated machine learning on Large dataset -> SparkML
> Custom mathematical/statistical model on medium to large data -> Mahout
>
> ^^ All of this is just my opinion.
>
> Re: integration-
>
> We're working on that too.  Recently MAHOUT-1896 added convenience methods
> for interacting with MLLib type RDDs, and DataFrames
> https://issues.apache.org/jira/browse/MAHOUT-1896
>
> (No support yet for SparkML type dataframes, or spitting DRMs back out into
> RDDs/DataFrames).
>
> Finally Docs: There has been some talk for sometime of migrating the
> website from CMS to Jekyll and its something I strongly support.  The CMS
> makes it difficult to keep up with documentation, and Jekyll would open up
> documentation /website maintenance to contributors.
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Tue, Jan 31, 2017 at 5:31 AM, Florent Empis 
> wrote:
>
> > Hi,
> >
> > I am in the same spot as Isabel.
> > Used to use/understand most of the «old» standalone mahout, now doing
> some
> > data transformation with spark, but I am not sure where Samsara fits in
> the
> > ecosy

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-01-31 Thread scott cote
Trevor gave a great presentation at our user group.  It was live streamed on 
Periscope.  Trevor - maybe you could share the url?  I don’t have it handy at 
the moment.

SCott
> On Jan 31, 2017, at 8:50 AM, Trevor Grant  wrote:
> 
> Hello Isabel and Florent,
> 
> I'm currently working on a side-by-side demo of R / Python / SparkML(Mllib)
> / Mahout, but in very broad strokes here is how I would compare them:
> 
> R- Most statistical functionality.  Most flexibility.  Implement your own
> algorithms- mathematically expressive language.  Worst performance- handles
> only "small" data sets.  Language is 'math centric'. Easy to extend /
> create new algos
> 
> Python (sklearn/scikit) - Some mathematical / statistical functionality,
> more focused on machine learning. Machine learning library very
> sophisticated though.  Much better performance than R, still only single
> node. "small to medium" data sets. Language is 'programmer centric'.
> Somewhat difficult to extend / create new algos
> 
> SparkML / Mllib - Very Limited Mathematical functionality (usually collects
> to driver to do anything of substance).  Machine learning rudimentary
> compared to sklearn, but still non-trivial one of the best available.
> Exceeding performance, well suited to "big" data sets. Language is
> 'programmer centric'. Very difficult to extend / create new algos.
> 
> (FlinkML - Fits in same spot as SparkML, but significantly less developed)
> 
> Mahout - Good mathematical functionality.  Good performance relative to
> underlying engine (possibly superior with MAHOUT-1885).  Language is 'math
> centric'.  Well suited to "medium and big" data sets. Fairly easy to extend
> / create new algos (MAHOUT-1856)
> 
> I hope that provides a high level comparison.
> 
> Re use cases- the tool to use depends on the job at hand.
> Highly advanced mathematical model, small dataset or sampling from full
> dataset OK -> Use R
> Machine learning on small to medium data set or sampling from full dataset
> OK -> Use Python / sklearn
> Less sophisticated machine learning on Large dataset -> SparkML
> Custom mathematical/statistical model on medium to large data -> Mahout
> 
> ^^ All of this is just my opinion.
> 
> Re: integration-
> 
> We're working on that too.  Recently MAHOUT-1896 added convenience methods
> for interacting with MLLib type RDDs, and DataFrames
> https://issues.apache.org/jira/browse/MAHOUT-1896
> 
> (No support yet for SparkML type dataframes, or spitting DRMs back out into
> RDDs/DataFrames).
> 
> Finally Docs: There has been some talk for sometime of migrating the
> website from CMS to Jekyll and its something I strongly support.  The CMS
> makes it difficult to keep up with documentation, and Jekyll would open up
> documentation /website maintenance to contributors.
> 
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
> 
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> 
> 
> On Tue, Jan 31, 2017 at 5:31 AM, Florent Empis 
> wrote:
> 
>> Hi,
>> 
>> I am in the same spot as Isabel.
>> Used to use/understand most of the «old» standalone mahout, now doing some
>> data transformation with spark, but I am not sure where Samsara fits in the
>> ecosystem.
>> We also do quite a bit of computation in R.
>> Basically we are willing to learn and support the project by for instance
>> buying the books Rob mentioned, but a short doc with the outline Isabel
>> describes would be great!
>> 
>> Many thanks,
>> 
>> Florent
>> 
>> 
>> Le 31 janv. 2017 12:01, "Isabel Drost-Fromm"  a écrit :
>> 
>> 
>> Hi,
>> 
>> On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote:
>>> and we're thinking about just how many pre-built algorithms we
>>> should include in the library versus working on performance behind the
>>> scenes.
>> 
>> To pick this question up: I've been watching Mahout from a distance for
>> quite
>> some time. So from what limited background I have of Samsara I really like
>> it's
>> approach to be able to run on more than one execution engine.
>> 
>> To give some advise to downstream users in the field - what would be your
>> advise
>> for people tasked with concrete use cases (stuff like fraud detection,
>> anomaly
>> detection, learning search ranking functions, building a recommender
>> system)? Is
>> that something that can still be done with Mahout? What would it take to
>> get
>> from raw data to finished system? Is there something we can do to help
>> users get
>> that accomplished? Is there even interest from users in such a use case
>> based
>> perspective? If so, would there be interest among the Mahout committers to
>> help
>> users publicly create docs/examples/modules to support these use cases?
>> 
>> 
>> Isabel
>> 



Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-01-31 Thread Keith Aumiller
I was just watching it. ;)

https://trevorgrant.org/

Thanks Trevor!

On Tue, Jan 31, 2017 at 3:41 PM, scott cote  wrote:

> Trevor gave a great presentation at our user group.  It was live streamed
> on Periscope.  Trevor - maybe you could share the url?  I don’t have it
> handy at the moment.
>
> SCott
> > On Jan 31, 2017, at 8:50 AM, Trevor Grant 
> wrote:
> >
> > Hello Isabel and Florent,
> >
> > I'm currently working on a side-by-side demo of R / Python /
> SparkML(Mllib)
> > / Mahout, but in very broad strokes here is how I would compare them:
> >
> > R- Most statistical functionality.  Most flexibility.  Implement your own
> > algorithms- mathematically expressive language.  Worst performance-
> handles
> > only "small" data sets.  Language is 'math centric'. Easy to extend /
> > create new algos
> >
> > Python (sklearn/scikit) - Some mathematical / statistical functionality,
> > more focused on machine learning. Machine learning library very
> > sophisticated though.  Much better performance than R, still only single
> > node. "small to medium" data sets. Language is 'programmer centric'.
> > Somewhat difficult to extend / create new algos
> >
> > SparkML / Mllib - Very Limited Mathematical functionality (usually
> collects
> > to driver to do anything of substance).  Machine learning rudimentary
> > compared to sklearn, but still non-trivial one of the best available.
> > Exceeding performance, well suited to "big" data sets. Language is
> > 'programmer centric'. Very difficult to extend / create new algos.
> >
> > (FlinkML - Fits in same spot as SparkML, but significantly less
> developed)
> >
> > Mahout - Good mathematical functionality.  Good performance relative to
> > underlying engine (possibly superior with MAHOUT-1885).  Language is
> 'math
> > centric'.  Well suited to "medium and big" data sets. Fairly easy to
> extend
> > / create new algos (MAHOUT-1856)
> >
> > I hope that provides a high level comparison.
> >
> > Re use cases- the tool to use depends on the job at hand.
> > Highly advanced mathematical model, small dataset or sampling from full
> > dataset OK -> Use R
> > Machine learning on small to medium data set or sampling from full
> dataset
> > OK -> Use Python / sklearn
> > Less sophisticated machine learning on Large dataset -> SparkML
> > Custom mathematical/statistical model on medium to large data -> Mahout
> >
> > ^^ All of this is just my opinion.
> >
> > Re: integration-
> >
> > We're working on that too.  Recently MAHOUT-1896 added convenience
> methods
> > for interacting with MLLib type RDDs, and DataFrames
> > https://issues.apache.org/jira/browse/MAHOUT-1896
> >
> > (No support yet for SparkML type dataframes, or spitting DRMs back out
> into
> > RDDs/DataFrames).
> >
> > Finally Docs: There has been some talk for sometime of migrating the
> > website from CMS to Jekyll and its something I strongly support.  The CMS
> > makes it difficult to keep up with documentation, and Jekyll would open
> up
> > documentation /website maintenance to contributors.
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Tue, Jan 31, 2017 at 5:31 AM, Florent Empis 
> > wrote:
> >
> >> Hi,
> >>
> >> I am in the same spot as Isabel.
> >> Used to use/understand most of the «old» standalone mahout, now doing
> some
> >> data transformation with spark, but I am not sure where Samsara fits in
> the
> >> ecosystem.
> >> We also do quite a bit of computation in R.
> >> Basically we are willing to learn and support the project by for
> instance
> >> buying the books Rob mentioned, but a short doc with the outline Isabel
> >> describes would be great!
> >>
> >> Many thanks,
> >>
> >> Florent
> >>
> >>
> >> Le 31 janv. 2017 12:01, "Isabel Drost-Fromm"  a
> écrit :
> >>
> >>
> >> Hi,
> >>
> >> On Fri, Sep 16, 2016 at 11:36:03PM -0700, Andrew Musselman wrote:
> >>> and we're thinking about just how many pre-built algorithms we
> >>> should include in the library versus working on performance behind the
> >>> scenes.
> >>
> >> To pick this question up: I've been watching Mahout from a distance for
> >> quite
> >> some time. So from what limited background I have of Samsara I really
> like
> >> it's
> >> approach to be able to run on more than one execution engine.
> >>
> >> To give some advise to downstream users in the field - what would be
> your
> >> advise
> >> for people tasked with concrete use cases (stuff like fraud detection,
> >> anomaly
> >> detection, learning search ranking functions, building a recommender
> >> system)? Is
> >> that something that can still be done with Mahout? What would it take to
> >> get
> >> from raw data to finished system? Is there something we can do to help
> >> users get
> >> that accomplished? Is there even interest from users in such a use case
> >> based
> >>

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-01-31 Thread Florent Empis
>From my point of view, mahout as a whole has shifted from what it was in
2009-2012:
At the time, Mahout (and Mahout in Action is a great testimony of that era)
was a sum of bricks, full of relatively high-level mathematics concepts but
useable by what I'd call (myself included) wanna-be datascientists.
With an approach akin to "datascience for hackers", it was possible to
build a crude but working ML tool, such as a recommender. My memory is
lacking, but I think my first experiments with Taste date from 2008. I had,
at the time, no intimate mathematical knowledge of what the code wrote by
Ted, Sean & many others did. I fed order histories into it and got back
recommendations that "made sense". I then did the same with knn & more.
Over the years, I got better at understanding the mathematical concepts,
thanks in particular to scientists that took the time of explaining to me,
a tech/data guy, what were the mathematical concepts behind the blocks I
blindly used.

I'd say that "mahout today" is a distributed mathematical toolbox. Nothing
wrong with that, absolutely nothing. It has its purposes but I feel it's no
longer aimed at "tech people wanting to have a go at machine learning"..

When I take a look at my company code repository, even though I'm less and
less involved in day to day design decisions, I see that its "lively"
components are indeed using stuff like Tensorflow & dl4j.

My scientific credentials are obviously way less impressive than Ted's,
whom I had the pleasure to meet a few times as well as quite a few of MapR
employees, but I make exactly the same analysis coming from a
tech/functionnal background: for recommendation, don't bother reinventing
the wheel or using "fancy" ALS stuff (been there, done that, shown no
impressive gain in practical use-cases): buy an off the shelf solution
(disclaimer: I sell one ;-) ) or build it from Mahout Taste and do some
data wrangling with a search engine (but if you're in a hurry, definitely
go and talk to vendors, a few caveats apply :-) ). For everything else ML
related, have a go at tensorflow implementations related to your use case,
you will find books which are as didactic as Mahout in Action was 6 years
ago.

All in all: congrats to the Mahout team, past and current contributors, you
achieved a good damn job and got me into this field, for which I am very
grateful!









2017-01-31 18:30 GMT+01:00 Ted Dunning :

> From my perspective, the state of the art of machine learning is with
> systems like Tensorflow and dl4j. If you can deal with the limits of a
> non-clustered GPU system, then Theano and Cafe are very useful. Keras
> papers over the difference between different back-ends nicely.
>
> Tensorflow and Theano can do a lot of mathematical and linear (tensor,
> actually) algebra work nicely, especially if there is an optimization
> problem lurking.
>
> NVidia also has a very strong commercial offering that supports their GPU
> clustering well.
>
> Spark ML lags this state of the art very far behind, but is still useful
> for simpler situations.
>
> For recommendations, the situation is very different.  Almost all
> applications are most easily and often most accurately solved using an
> indicator-based approach and the go-to implementation of this is Mahout.
>
> There is a lot of noise in the world about factorization-based
> recommendation using ALS and such, but the noise is not warranted.
> Deploying a recommender in a search engine is just better.
>
> I have not personally used Samsara much, but the idea of a strong optimizer
> over the top of a nice syntax for linear algebra is a good one.
>
> On Tue, Jan 31, 2017 at 9:21 AM, Pat Ferrel  wrote:
>
> > My perspective comes from the data side. I work in recommenders and that
> > means log analysis for huge amounts of data. Even a small shop doing this
> > will immediately run our of the capacity in Python or R on a single node.
> > MLlib is a set of prepackaged algorithms that will work (mostly) with big
> > data. Mahout Samsara is the only general linear algebra tool I know of
> that
> > will natively let you interactively run R-like code on any size cluster,
> > then polish it for production all without changing tools, or language.
> >
> > Going from analytics to recommenders means a jump in data size of several
> > orders of magnitude and this is just one example.
> >
> >
> > On Jan 31, 2017, at 6:50 AM, Trevor Grant 
> > wrote:
> >
> > Hello Isabel and Florent,
> >
> > I'm currently working on a side-by-side demo of R / Python /
> SparkML(Mllib)
> > / Mahout, but in very broad strokes here is how I would compare them:
> >
> > R- Most statistical functionality.  Most flexibility.  Implement your own
> > algorithms- mathematically expressive language.  Worst performance-
> handles
> > only "small" data sets.  Language is 'math centric'. Easy to extend /
> > create new algos
> >
> > Python (sklearn/scikit) - Some mathematical / statistical functionality,
> > more focused on machine learning

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-01-31 Thread Dmitriy Lyubimov
On Tue, Jan 31, 2017 at 3:01 AM, Isabel Drost-Fromm 
wrote:

>
> Hi,
>
>
> To give some advise to downstream users in the field - what would be your
> advise
> for people tasked with concrete use cases (stuff like fraud detection,
> anomaly
> detection, learning search ranking functions, building a recommender
> system)?


If you are an off-the-shelf practitioner (most of smaller startup companies
without a chief scientist), with very few exceptions you might want to look
for an off-the-shelf solution where it exists, and most likely it does not
exist on Samsara in open domain. Except for a several applied
off-the-shelves, Mahout has not (hopefully just yet) developed a
comprehensive set of things to use.

The off-the-shelves currently are cross-occurrence recommendations (which
still require real time serving component taken from elsewhere), svd-pca,
some algebra, and Naive/complement Bayes at scale.

Most of the bigger companies i worked for never deal with completely the
off-the-shelf open source solutions. It always requires more understanding
of their problem. (E.g., much as COO recommender is wonderful, i don't
think Netflix would entertain taking Mahout's COO run on it verbatim).

It is quite common that companies invest in their own specific
understanding of their problem and requirements and a specific solution to
their problem through iterative experimentation with different
methodologies, most of which are either new-ish enough or proprietary
enough that public solution does not exist.

That latter case was pretty much motivation for Samsara. If you are a
practitioner solving numerical problems thru experimentation cycle, Mahout
is much more useful than any of the off-the-shelf collections.

So the idea, first, is to get R-like platform out for the practitioners,
and grow packages (just like with R). The platform obviously needs work
which unfortunately is not sufficiently sponsored imo at the moment by
industry or academia, compared to other projects.

  Is there even interest from users in such a use case based

> perspective? If so, would there be interest among the Mahout committers to
> help
> users publicly create docs/examples/modules to support these use cases?
>

yes


>
>
> Isabel
>
>