[jira] [Comment Edited] (MAHOUT-1894) Add support for Spark 2x backend

2017-02-17 Thread Andrew Weienr (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872778#comment-15872778
 ] 

Andrew Weienr edited comment on MAHOUT-1894 at 2/18/17 12:08 AM:
-

I followed the instructions from [~rawkintrevo] here: 
https://issues.apache.org/jira/browse/MAHOUT-1894?focusedCommentId=15871928=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15871928

Was able to run the scala shell against all 3 versions of Spark without errors.
My system:
OS X El Capitan 10.11.6 (15G1217)
Maven 3.3.9
Java 1.8.0_101

One other note.  Where the instructions say "$ bin mahout spark-shell" they 
should actually say
"$ bin/mahout spark-shell" (just in case there are any newbies helping test)



was (Author: weienran):
I followed the instructions from [~rawkintrevo] here: 
https://issues.apache.org/jira/browse/MAHOUT-1894?focusedCommentId=15871928=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15871928

Was able to run the scala shell against all 3 versions of Spark without errors.
My system:
OS X El Capitan 10.11.6 (15G1217)
Maven 3.3.9
Java 1.8.0_101

One other note.  Where the instructions say "$ bin mahout spark-shell" they 
should actually say "$ bin/mahout spark-shell" (just in case there are any 
newbies helping test)


> Add support for Spark 2x backend
> 
>
> Key: MAHOUT-1894
> URL: https://issues.apache.org/jira/browse/MAHOUT-1894
> Project: Mahout
>  Issue Type: Task
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Suneel Marthi
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 1.0.0, 0.13.0, 0.14.0
>
>
> add support for Spark 2.x as backend execution engine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MAHOUT-1894) Add support for Spark 2x backend

2017-02-17 Thread Andrew Weienr (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872778#comment-15872778
 ] 

Andrew Weienr commented on MAHOUT-1894:
---

I followed the instructions from [~rawkintrevo] here: 
https://issues.apache.org/jira/browse/MAHOUT-1894?focusedCommentId=15871928=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15871928

Was able to run the scala shell against all 3 versions of Spark without errors.
My system:
OS X El Capitan 10.11.6 (15G1217)
Maven 3.3.9
Java 1.8.0_101

One other note.  Where the instructions say "$ bin mahout spark-shell" they 
should actually say "$ bin/mahout spark-shell" (just in case there are any 
newbies helping test)


> Add support for Spark 2x backend
> 
>
> Key: MAHOUT-1894
> URL: https://issues.apache.org/jira/browse/MAHOUT-1894
> Project: Mahout
>  Issue Type: Task
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Suneel Marthi
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 1.0.0, 0.13.0, 0.14.0
>
>
> add support for Spark 2.x as backend execution engine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Contributing an algorithm for samsara

2017-02-17 Thread Dmitriy Lyubimov
in particular, this is the samsara implementation of double-weighed als :
https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626


On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov  wrote:

> Jim,
>
> if ALS is of interest, and as far as weighed ALS is concerned (since we
> already have trivial regularized ALS in the "decompositions" package),
> here's uncommitted samsara-compatible patch from a while back:
> https://issues.apache.org/jira/browse/MAHOUT-1365
>
> it combines weights on both data points (a.k.a "implicit feedback" als)
> and regularization rates  (paper references are given). We combine both
> approaches in one (which is novel, i guess, but yet simple enough).
> Obviously the final solver can also be used as pure reg rate regularized if
> wanted, making it equivalent to one of the papers.
>
> You may know implicit feedback paper from mllib's implicit als, but unlike
> it was done over there (as a use case sort problem that takes input before
> even features were extracted), we split the problem into pure algebraic
> solver (double-weighed ALS math) and leave the feature extraction outside
> of this issue per se (it can be added as a separate adapter).
>
> The reason for that is that the specific use-case oriented implementation
> does not necessarily leave the space for feature extraction that is
> different from described use case of partially consumed streamed videos in
> the paper. (e.g., instead of videos one could count visits or clicks or
> add-to-cart events which may need additional hyperparameter found for them
> as part of feature extraction and converting observations into "weghts").
>
> The biggest problem with these ALS methods however is that all
> hyperparameters require multidimensional crossvalidation and optimization.
> I think i mentioned it before as list of desired solutions, as it stands,
> Mahout does not have hyperarameter fitting routine.
>
> In practice, when using these kind of ALS, we have a case of
> multidimensional hyperparameter optimization. One of them comes from the
> fitter (reg rate, or base reg rate in case of weighed regularization), and
> the others come from feature extraction process. E.g., in original paper
> they introduce (at least) 2 formulas to extract measure weighs from the
> streaming video observations, and each of them had one parameter, alhpa,
> which in context of the whole problem becomes effectively yet another
> hyperparameter to fit. In other use cases when your confidence measurement
> may be coming from different sources and observations, the confidence
> extraction may actually have even more hyperparameters to fit than just
> one. And when we have a multidimensional case, simple approaches (like grid
> or random search) become either cost prohibitive or ineffective, due to the
> curse of dimensionality.
>
> At the time i was contributing that method, i was using it in conjunction
> with multidimensional bayesian optimizer, but the company that i wrote it
> for did not have it approved for contribution (unlike weighed als) at that
> time.
>
> Anyhow, perhaps you could read the algebra in both ALS papers there and
> ask questions, and we could worry about hyperparameter optimization a bit
> later and performance a bit later.
>
> On the feature extraction front (as in implicit feedback als per Koren
> etc.), this is an ideal use case for more general R-like formula approach,
> which is also on desired list of things to have.
>
> So i guess we have 3 problems really here:
> (1) double-weighed ALS
> (2) bayesian optimization and crossvalidation in an n-dimensional
> hyperparameter space
> (3) feature extraction per (preferrably R-like) formula.
>
>
> -d
>
>
> On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo 
> wrote:
>
>> +1 to glms
>>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>>  Original message 
>> From: Trevor Grant 
>> Date: 02/17/2017 6:56 AM (GMT-08:00)
>> To: dev@mahout.apache.org
>> Subject: Re: Contributing an algorithm for samsara
>>
>> Jim is right, and I would take it one further and say, it would be best to
>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
>> from there a Logistic regression is a trivial extension.
>>
>> Buyer beware- GLMs will be a bit of work- doable, but that would be
>> jumping
>> in neck first for both Jim and Saikat...
>>
>> MAHOUT-1928 and MAHOUT-1929
>>
>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=projec
>> t%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%
>> 20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%
>> 20priority%20DESC%2C%20created%20ASC
>>
>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs
>> are
>> in there.
>>
>> If you have an algorithm you are particularly intimate with, or explicitly
>> need/want- feel free to open a JIRA and assign to yourself.
>>
>> There is also a 

Re: Contributing an algorithm for samsara

2017-02-17 Thread Dmitriy Lyubimov
Jim,

if ALS is of interest, and as far as weighed ALS is concerned (since we
already have trivial regularized ALS in the "decompositions" package),
here's uncommitted samsara-compatible patch from a while back:
https://issues.apache.org/jira/browse/MAHOUT-1365

it combines weights on both data points (a.k.a "implicit feedback" als) and
regularization rates  (paper references are given). We combine both
approaches in one (which is novel, i guess, but yet simple enough).
Obviously the final solver can also be used as pure reg rate regularized if
wanted, making it equivalent to one of the papers.

You may know implicit feedback paper from mllib's implicit als, but unlike
it was done over there (as a use case sort problem that takes input before
even features were extracted), we split the problem into pure algebraic
solver (double-weighed ALS math) and leave the feature extraction outside
of this issue per se (it can be added as a separate adapter).

The reason for that is that the specific use-case oriented implementation
does not necessarily leave the space for feature extraction that is
different from described use case of partially consumed streamed videos in
the paper. (e.g., instead of videos one could count visits or clicks or
add-to-cart events which may need additional hyperparameter found for them
as part of feature extraction and converting observations into "weghts").

The biggest problem with these ALS methods however is that all
hyperparameters require multidimensional crossvalidation and optimization.
I think i mentioned it before as list of desired solutions, as it stands,
Mahout does not have hyperarameter fitting routine.

In practice, when using these kind of ALS, we have a case of
multidimensional hyperparameter optimization. One of them comes from the
fitter (reg rate, or base reg rate in case of weighed regularization), and
the others come from feature extraction process. E.g., in original paper
they introduce (at least) 2 formulas to extract measure weighs from the
streaming video observations, and each of them had one parameter, alhpa,
which in context of the whole problem becomes effectively yet another
hyperparameter to fit. In other use cases when your confidence measurement
may be coming from different sources and observations, the confidence
extraction may actually have even more hyperparameters to fit than just
one. And when we have a multidimensional case, simple approaches (like grid
or random search) become either cost prohibitive or ineffective, due to the
curse of dimensionality.

At the time i was contributing that method, i was using it in conjunction
with multidimensional bayesian optimizer, but the company that i wrote it
for did not have it approved for contribution (unlike weighed als) at that
time.

Anyhow, perhaps you could read the algebra in both ALS papers there and ask
questions, and we could worry about hyperparameter optimization a bit later
and performance a bit later.

On the feature extraction front (as in implicit feedback als per Koren
etc.), this is an ideal use case for more general R-like formula approach,
which is also on desired list of things to have.

So i guess we have 3 problems really here:
(1) double-weighed ALS
(2) bayesian optimization and crossvalidation in an n-dimensional
hyperparameter space
(3) feature extraction per (preferrably R-like) formula.


-d


On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo  wrote:

> +1 to glms
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: Trevor Grant 
> Date: 02/17/2017 6:56 AM (GMT-08:00)
> To: dev@mahout.apache.org
> Subject: Re: Contributing an algorithm for samsara
>
> Jim is right, and I would take it one further and say, it would be best to
> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
> from there a Logistic regression is a trivial extension.
>
> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
> in neck first for both Jim and Saikat...
>
> MAHOUT-1928 and MAHOUT-1929
>
> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=
> project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%
> 20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%
> 20DESC%2C%20created%20ASC
>
> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
> in there.
>
> If you have an algorithm you are particularly intimate with, or explicitly
> need/want- feel free to open a JIRA and assign to yourself.
>
> There is also a case to be made for implementing the ALS...
>
> 1) It's a much better 'beginner' project.
> 2) Mahout has some world class Recommenders, a toy ALS implementation might
> help us think through how the other reccomenders (e.g. CCO) will 'fit' into
> the framework. E.g. ALS being the toy-prototype reccomender that helps us
> think through building out that section of the framework.
>
>
>
> Trevor Grant
> Data 

RE: Contributing an algorithm for samsara

2017-02-17 Thread Andrew Palumbo
+1 to glms



Sent from my Verizon Wireless 4G LTE smartphone


 Original message 
From: Trevor Grant 
Date: 02/17/2017 6:56 AM (GMT-08:00)
To: dev@mahout.apache.org
Subject: Re: Contributing an algorithm for samsara

Jim is right, and I would take it one further and say, it would be best to
implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
from there a Logistic regression is a trivial extension.

Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
in neck first for both Jim and Saikat...

MAHOUT-1928 and MAHOUT-1929

https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
in there.

If you have an algorithm you are particularly intimate with, or explicitly
need/want- feel free to open a JIRA and assign to yourself.

There is also a case to be made for implementing the ALS...

1) It's a much better 'beginner' project.
2) Mahout has some world class Recommenders, a toy ALS implementation might
help us think through how the other reccomenders (e.g. CCO) will 'fit' into
the framework. E.g. ALS being the toy-prototype reccomender that helps us
think through building out that section of the framework.



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski  wrote:

> My own thoughts are that logistic regression seems a more "generalized"
> and hence more useful algo to be factored in... At least in the
> use cases that I've been toying with.
>
> So I'd like to help out with that if wanted...
>
> > On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal  wrote:
> >
> > Trevor et al,
> >
> > I'd like to contribute an algorithm or two in samsara using spark as I
> would like to do a compare and contrast with mahout with R server for a
> data science pipeline, machine learning repo that I'm working on, in
> looking at the list of algorithms (https://mahout.apache.org/
> users/basics/algorithms.html) is there an algorithm for spark that would
> be beneficial for the community, my use cases would typically be around
> clustering or real time machine learning for building recommendations on
> the fly.The algorithms I see that could potentially be useful are: 1)
> Matrix Factorization with ALS 2) Logistic regression with SVD.
> >
> > Apache Mahout: Scalable machine learning and data mining<
> https://mahout.apache.org/users/basics/algorithms.html>
> > mahout.apache.org
> > Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
> Flink; Mahout Math-Scala Core Library and Scala DSL
> >
> >
> >
> > Any thoughts/guidance or recommendations would be very helpful.
> > Thanks in advance.
>
>


[jira] [Commented] (MAHOUT-1894) Add support for Spark 2x backend

2017-02-17 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872157#comment-15872157
 ] 

Saikat Kanjilal commented on MAHOUT-1894:
-

[~rawkintrevo] I've already done all this without any errors, is there other 
testing I can help with on this?

> Add support for Spark 2x backend
> 
>
> Key: MAHOUT-1894
> URL: https://issues.apache.org/jira/browse/MAHOUT-1894
> Project: Mahout
>  Issue Type: Task
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Suneel Marthi
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 1.0.0, 0.13.0, 0.14.0
>
>
> add support for Spark 2.x as backend execution engine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Contributing an algorithm for samsara

2017-02-17 Thread Saikat Kanjilal
To start this off I figure we should spend some time understanding the current 
implementations and theory before we dig deep into implementing this in mahout:


1) 
https://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/

Alternating Least Squares Method for Collaborative 
...
bugra.github.io
Alternating Least Square Formulation for Recommender Systems¶ We have users $u$ 
for items $i$ matrix as in the following: $$ Q_{ui} = \cases{ r & \text{if user 
u ...


2) 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala


[https://avatars1.githubusercontent.com/u/47359?v=3=400]

spark/ALS.scala at master · apache/spark · 
GitHub
github.com
spark - Mirror of Apache Spark ... * Licensed to the Apache Software Foundation 
(ASF) under one or more * contributor license agreements.


3) 
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/decompositions/ALS.scala
mahout/ALS.scala at master · apache/mahout · 
GitHub
github.com
mahout - Mirror of Apache Mahout


4) https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/
Alternating Least Squares – Data Science Made 
Simpler
datasciencemadesimpler.wordpress.com
Collaborative Filtering. Collaborative Filtering (CF) is a method of making 
automatic predictions about the interests of a user by learning its preferences 
(or taste ...




Jim I would suggest we spend some time researching and digging into these 
resources and circle back next week to get this off the ground, let me know if 
you want to meet offline as well, I would recommend the next steps is a design 
proposal to the dev list of how the implementation will fit into the current 
samsara algorithms, what do you think?

Regards


From: Jim Jagielski 
Sent: Friday, February 17, 2017 8:18 AM
To: dev@mahout.apache.org
Subject: Re: Contributing an algorithm for samsara

Sounds good to me. +1

> On Feb 17, 2017, at 11:15 AM, Saikat Kanjilal  wrote:
>
> Jim,
> What do you say we start with ALS and then tackle glm?
>
>
> Sent from my iPhone
>
>> On Feb 17, 2017, at 6:56 AM, Trevor Grant  wrote:
>>
>> Jim is right, and I would take it one further and say, it would be best to
>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
[http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Biologist_and_statistician_Ronald_Fisher.jpg/200px-Biologist_and_statistician_Ronald_Fisher.jpg]

Generalized linear model - 
Wikipedia
en.wikipedia.org
Part of a series on Statistics: Regression analysis; Models; Linear regression; 
Simple regression; Ordinary least squares; Polynomial regression; General 
linear model



>> from there a Logistic regression is a trivial extension.
>>
>> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
>> in neck first for both Jim and Saikat...
>>
>> MAHOUT-1928 and MAHOUT-1929
>>
>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC
>>
>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
>> in there.
>>
>> If you have an algorithm you are particularly intimate with, or explicitly
>> need/want- feel free to open a JIRA and assign to yourself.
>>
>> There is also a case to be made for implementing the ALS...
>>
>> 1) It's a much better 'beginner' project.
>> 2) Mahout has some world class Recommenders, a toy ALS implementation might
>> help us think through how the other reccomenders (e.g. CCO) will 'fit' into
>> the framework. E.g. ALS being the toy-prototype reccomender that helps us
>> think through building out that section of the framework.
>>
>>
>>
>> Trevor Grant
>> Data Scientist
>> https://github.com/rawkintrevo
[https://avatars3.githubusercontent.com/u/5852441?v=3=400]

rawkintrevo (Trevor Grant) · GitHub
github.com
rawkintrevo has 22 repositories available. Follow their code on GitHub.



>> http://stackexchange.com/users/3002022/rawkintrevo
User rawkintrevo - Stack 

Re: Contributing an algorithm for samsara

2017-02-17 Thread Jim Jagielski
Sounds good to me. +1

> On Feb 17, 2017, at 11:15 AM, Saikat Kanjilal  wrote:
> 
> Jim,
> What do you say we start with ALS and then tackle glm?
> 
> 
> Sent from my iPhone
> 
>> On Feb 17, 2017, at 6:56 AM, Trevor Grant  wrote:
>> 
>> Jim is right, and I would take it one further and say, it would be best to
>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
>> from there a Logistic regression is a trivial extension.
>> 
>> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
>> in neck first for both Jim and Saikat...
>> 
>> MAHOUT-1928 and MAHOUT-1929
>> 
>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC
>> 
>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
>> in there.
>> 
>> If you have an algorithm you are particularly intimate with, or explicitly
>> need/want- feel free to open a JIRA and assign to yourself.
>> 
>> There is also a case to be made for implementing the ALS...
>> 
>> 1) It's a much better 'beginner' project.
>> 2) Mahout has some world class Recommenders, a toy ALS implementation might
>> help us think through how the other reccomenders (e.g. CCO) will 'fit' into
>> the framework. E.g. ALS being the toy-prototype reccomender that helps us
>> think through building out that section of the framework.
>> 
>> 
>> 
>> Trevor Grant
>> Data Scientist
>> https://github.com/rawkintrevo
>> http://stackexchange.com/users/3002022/rawkintrevo
>> http://trevorgrant.org
>> 
>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>> 
>> 
>>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski  wrote:
>>> 
>>> My own thoughts are that logistic regression seems a more "generalized"
>>> and hence more useful algo to be factored in... At least in the
>>> use cases that I've been toying with.
>>> 
>>> So I'd like to help out with that if wanted...
>>> 
 On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal  wrote:
 
 Trevor et al,
 
 I'd like to contribute an algorithm or two in samsara using spark as I
>>> would like to do a compare and contrast with mahout with R server for a
>>> data science pipeline, machine learning repo that I'm working on, in
>>> looking at the list of algorithms (https://mahout.apache.org/
>>> users/basics/algorithms.html) is there an algorithm for spark that would
>>> be beneficial for the community, my use cases would typically be around
>>> clustering or real time machine learning for building recommendations on
>>> the fly.The algorithms I see that could potentially be useful are: 1)
>>> Matrix Factorization with ALS 2) Logistic regression with SVD.
 
 Apache Mahout: Scalable machine learning and data mining<
>>> https://mahout.apache.org/users/basics/algorithms.html>
 mahout.apache.org
 Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
>>> Flink; Mahout Math-Scala Core Library and Scala DSL
 
 
 
 Any thoughts/guidance or recommendations would be very helpful.
 Thanks in advance.
>>> 
>>> 



Re: Contributing an algorithm for samsara

2017-02-17 Thread Saikat Kanjilal
Jim,
What do you say we start with ALS and then tackle glm?


Sent from my iPhone

> On Feb 17, 2017, at 6:56 AM, Trevor Grant  wrote:
> 
> Jim is right, and I would take it one further and say, it would be best to
> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
> from there a Logistic regression is a trivial extension.
> 
> Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
> in neck first for both Jim and Saikat...
> 
> MAHOUT-1928 and MAHOUT-1929
> 
> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC
> 
> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
> in there.
> 
> If you have an algorithm you are particularly intimate with, or explicitly
> need/want- feel free to open a JIRA and assign to yourself.
> 
> There is also a case to be made for implementing the ALS...
> 
> 1) It's a much better 'beginner' project.
> 2) Mahout has some world class Recommenders, a toy ALS implementation might
> help us think through how the other reccomenders (e.g. CCO) will 'fit' into
> the framework. E.g. ALS being the toy-prototype reccomender that helps us
> think through building out that section of the framework.
> 
> 
> 
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
> 
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> 
> 
>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski  wrote:
>> 
>> My own thoughts are that logistic regression seems a more "generalized"
>> and hence more useful algo to be factored in... At least in the
>> use cases that I've been toying with.
>> 
>> So I'd like to help out with that if wanted...
>> 
>>> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal  wrote:
>>> 
>>> Trevor et al,
>>> 
>>> I'd like to contribute an algorithm or two in samsara using spark as I
>> would like to do a compare and contrast with mahout with R server for a
>> data science pipeline, machine learning repo that I'm working on, in
>> looking at the list of algorithms (https://mahout.apache.org/
>> users/basics/algorithms.html) is there an algorithm for spark that would
>> be beneficial for the community, my use cases would typically be around
>> clustering or real time machine learning for building recommendations on
>> the fly.The algorithms I see that could potentially be useful are: 1)
>> Matrix Factorization with ALS 2) Logistic regression with SVD.
>>> 
>>> Apache Mahout: Scalable machine learning and data mining<
>> https://mahout.apache.org/users/basics/algorithms.html>
>>> mahout.apache.org
>>> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
>> Flink; Mahout Math-Scala Core Library and Scala DSL
>>> 
>>> 
>>> 
>>> Any thoughts/guidance or recommendations would be very helpful.
>>> Thanks in advance.
>> 
>> 


Re: Contributing an algorithm for samsara

2017-02-17 Thread Trevor Grant
Jim is right, and I would take it one further and say, it would be best to
implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
from there a Logistic regression is a trivial extension.

Buyer beware- GLMs will be a bit of work- doable, but that would be jumping
in neck first for both Jim and Saikat...

MAHOUT-1928 and MAHOUT-1929

https://issues.apache.org/jira/browse/MAHOUT-1925?jql=project%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs are
in there.

If you have an algorithm you are particularly intimate with, or explicitly
need/want- feel free to open a JIRA and assign to yourself.

There is also a case to be made for implementing the ALS...

1) It's a much better 'beginner' project.
2) Mahout has some world class Recommenders, a toy ALS implementation might
help us think through how the other reccomenders (e.g. CCO) will 'fit' into
the framework. E.g. ALS being the toy-prototype reccomender that helps us
think through building out that section of the framework.



Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski  wrote:

> My own thoughts are that logistic regression seems a more "generalized"
> and hence more useful algo to be factored in... At least in the
> use cases that I've been toying with.
>
> So I'd like to help out with that if wanted...
>
> > On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal  wrote:
> >
> > Trevor et al,
> >
> > I'd like to contribute an algorithm or two in samsara using spark as I
> would like to do a compare and contrast with mahout with R server for a
> data science pipeline, machine learning repo that I'm working on, in
> looking at the list of algorithms (https://mahout.apache.org/
> users/basics/algorithms.html) is there an algorithm for spark that would
> be beneficial for the community, my use cases would typically be around
> clustering or real time machine learning for building recommendations on
> the fly.The algorithms I see that could potentially be useful are: 1)
> Matrix Factorization with ALS 2) Logistic regression with SVD.
> >
> > Apache Mahout: Scalable machine learning and data mining<
> https://mahout.apache.org/users/basics/algorithms.html>
> > mahout.apache.org
> > Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
> Flink; Mahout Math-Scala Core Library and Scala DSL
> >
> >
> >
> > Any thoughts/guidance or recommendations would be very helpful.
> > Thanks in advance.
>
>


[jira] [Commented] (MAHOUT-1894) Add support for Spark 2x backend

2017-02-17 Thread Trevor Grant (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871928#comment-15871928
 ] 

Trevor Grant commented on MAHOUT-1894:
--

@apalumbo is still reporting issues where ever he tries this.

Want to make general call for testers to see where the 'gotchya' is. 

Here are instructions for testing- please help.

Step 1. Clone Mahout-1894

```sh
$ git clone https://github.com/rawkintrevo/mahout 
$ cd mahout
$ git checkout mahout-1894
```

Step 2. Download various Sparks
```sh
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
$ tar -xzf *tgz 
```
(only if those are the only tgz's in the directory)

Step 3. Iteratively Build Mahout and Test Shell

A) Spark 1.6.3
```sh
$ mvn clean package -DskipTests -Dspark.version=1.6.3
$ export SPARK_HOME=/path/to/spark/1.6.3
$ bin mahout spark-shell
```
In the shell...
```scala
scala> :load examples/bin/SparseSparseDrmTimer.mscala
```
^^ Should run with out error...
Ctrl+C to close.

B) Spark 2.0.2

```sh
$ mvn clean package -DskipTests -Dspark.version=2.0.2
$ export SPARK_HOME=/path/to/spark/2.0.2
$ bin mahout spark-shell
```
In the shell...
```scala
scala> :load examples/bin/SparseSparseDrmTimer.mscala
```
^^ Should run with out error...
Ctrl+C to close.


C) Spark 2.1.0
```sh
$ mvn clean package -DskipTests -Dspark.version=2.1.0
$ export SPARK_HOME=/path/to/spark/2.1.0
$ bin mahout spark-shell
```
In the shell...

```scala
scala> :load examples/bin/SparseSparseDrmTimer.mscala
```
^^ Should run with out error...
Ctrl+C to close.

> Add support for Spark 2x backend
> 
>
> Key: MAHOUT-1894
> URL: https://issues.apache.org/jira/browse/MAHOUT-1894
> Project: Mahout
>  Issue Type: Task
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Suneel Marthi
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 1.0.0, 0.13.0, 0.14.0
>
>
> add support for Spark 2.x as backend execution engine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Contributing an algorithm for samsara

2017-02-17 Thread Jim Jagielski
My own thoughts are that logistic regression seems a more "generalized"
and hence more useful algo to be factored in... At least in the
use cases that I've been toying with.

So I'd like to help out with that if wanted...

> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal  wrote:
> 
> Trevor et al,
> 
> I'd like to contribute an algorithm or two in samsara using spark as I would 
> like to do a compare and contrast with mahout with R server for a data 
> science pipeline, machine learning repo that I'm working on, in looking at 
> the list of algorithms 
> (https://mahout.apache.org/users/basics/algorithms.html) is there an 
> algorithm for spark that would be beneficial for the community, my use cases 
> would typically be around clustering or real time machine learning for 
> building recommendations on the fly.The algorithms I see that could 
> potentially be useful are: 1) Matrix Factorization with ALS 2) Logistic 
> regression with SVD.
> 
> Apache Mahout: Scalable machine learning and data 
> mining
> mahout.apache.org
> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O Flink; 
> Mahout Math-Scala Core Library and Scala DSL
> 
> 
> 
> Any thoughts/guidance or recommendations would be very helpful.
> Thanks in advance.



Re: Intro from a lurker

2017-02-17 Thread Jim Jagielski
Yes, please! Thx!
> On Feb 10, 2017, at 11:55 PM, Andrew Musselman  
> wrote:
> 
> Sounds good, thanks. Happy to invite you to prep and release chats if you'd
> like; let us know.
> 
> On Fri, Feb 10, 2017 at 4:06 AM, Jim Jagielski  wrote:
> 
>> Wow... I don't think I've EVER encountered a welcome like this!
>> 
>> Thanks for all the info and pointers... I plan to dig in over the
>> weekend and really digest the emails and see where I can make some
>> immediate (or semi-immediate ;) ) contributions.
>> 
>> Cheers!
>>