Re: Have Friedman's glmnet algo running in Spark

2015-08-04 Thread mike
 My friends and I are continuing work on the algorithm. You are right that 
there are two elements to Friedman's glmnet algorithm. One is the use of 
coordinate descent for minimizing penalized regression with an absolute value 
penalty and the other is managing the regularization parameters. Friedmans 
algorithm does return the the entire regularization path. We have had to get 
fairly deep into the mechanics of linear algebra. The tricky part has been 
arranging the matrix and vector multiplications to minimize the compute times - 
(e.g. big time differences between multiplying by a submatrix versus 
mulbiplying by the columns in the submatrix, etc. )

All of the versions we've produced generate a multitude of solutions (default = 
100) for a range of different values of the regularization parameter. The 
solutions always cover the most heavily penalized end of the curve. The number 
of solutions generated depends on how fine the steps are and how close the 
solutions get to the fully saturated (un-penalized) solution. Default values 
for these work about 80% of the time.

Personally, i've always found it useful to have the entire regularization path. 
One way or another, that's always required to get a final solution. It's just a 
question of whether the points on the path are generated by hunting and pecking 
or done all in one shot systematically.
mike






-Original Message-
From: Patrick [mailto:petz2...@gmail.com]
Sent: Tuesday, August 4, 2015 12:50 AM
To: d...@sparapache.org
Subject: Re: Have Friedman's glmnet algo running in Spark

I have a follow up on this: I see on JIRA that the idea of having a GLMNET imp 
entation was more orless abandoned, since a OWLQN implementation was chosen to 
construct a modelusing L1/L2 regularization. However, GLMNET has the property 
of "returning a multitide of models(corresponding to different vales of penalty 
parameters [for theregularization])". I think this is not the case in the OWLQN 
implementation. However, this would be really helpful to compare the accuracy 
of models withdifferent regParam values. As far as I understood, this would 
avoid to have a costly cross-validationstep over a possibly large set of 
regParam values. Joseph Bradley wrote> Some of this discussion seems valuable 
enough to preserve on the JIRA; can> we move it there (and copy any relevant 
discussions from previous emails> as> needed)?> > On Wed, Feb 25, 2015 at 10:35 
AM, <> mike@> > wrote:--View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Have-Friedman-s-glmnet-algo-running-in-Spark-tp10692p13587.htmlSent
 from the Apache Spark Developers List mailing list archive at 
Nabble.com.-To
 unsubscribe, e-mail: dev-unsubscribe@spark.apache.orgFor additional commands, 
e-mail: dev-h...@spark.apache.org


Re: Have Friedman's glmnet algo running in Spark

2015-08-04 Thread Patrick
I have a follow up on this: 
I see on JIRA that the idea of having a GLMNET implementation was more or
less abandoned, since a OWLQN implementation was chosen to construct a model
using L1/L2 regularization.   

However, GLMNET has the property of "returning a multitide of models
(corresponding to different vales of penalty parameters [for the
regularization])". I think this is not the case in the OWLQN implementation. 
However, this would be really helpful to compare the accuracy of models with
different regParam values. 
As far as I understood, this would avoid to have a costly cross-validation
step over a possibly large set of regParam values. 


Joseph Bradley wrote
> Some of this discussion seems valuable enough to preserve on the JIRA; can
> we move it there (and copy any relevant discussions from previous emails
> as
> needed)?
> 
> On Wed, Feb 25, 2015 at 10:35 AM, <

> mike@

> > wrote:





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Have-Friedman-s-glmnet-algo-running-in-Spark-tp10692p13587.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Have Friedman's glmnet algo running in Spark

2015-02-25 Thread Joseph Bradley
Some of this discussion seems valuable enough to preserve on the JIRA; can
we move it there (and copy any relevant discussions from previous emails as
needed)?

On Wed, Feb 25, 2015 at 10:35 AM,  wrote:

> Hi Debasish,
> Any method that generates point solutions to the minimization problem
> could simply be run a number of times to generate the coefficient paths as
> a function of the penalty parameter.  I think the only issues are how easy
> the method is to use and how much training and developer time is required
> to produce an answer.
>
> With regard to training time, Friedman says in his paper that they found
> problems where glmnet would generate the entire coefficient path more
> rapidly than sophisticated single point methods would generate single point
> solutions - not all problems, but some problems.  Ryan Tibshirani (Robert's
> son) who's a professor and researcher at CMU in convex function
> optimization has echoed that assertion for the particular case of the
> elasticnet penalty function (that's from slides of his that are available
> online).  So there's an open question about the training speed that i
> believe we can answer in fairly short order.  I'm eager to explore that.
> Does OWLQN do a pass through the data for each iteration?  The linear
> version of GLMNET does not.  On the other hand, OWLQN may be able to take
> coarser steps through parameter space.
>
> With regard to developer time, glmnet doesn't require the user to supply a
> starting point for the penalty parameter.  It calculates the starting
> point.  That makes it completely automatic.  you've probably been through
> the process of manually searching regularization parameter space with SVM.
> Pick out a set of regularization parameter values like 10 raised to the (-2
> through +5 in steps of 1).  See if there's a minimum in the range and if
> not shift to the right or left.  One of the reasons I pick up glmnet first
> for a new problem is that you just drop in the training set and out pop the
> coefficient curves.  Usually the defaults work.  One time out of 50 (or so)
> it doesn't converge.  It alerts you that it didn't converge and you change
> one parameter and rerun.  If you also drop in a test set then it even picks
> the optimum solution andproduces an estimate of out-of-sample error.
>
> We're going to make some speed/scaling runs on the synthetic data sets (in
> a range of sizes) that are used in Spark for testing linear regression.  We
> need some wider data sets.  Joseph mentioned some that we'll look at.  I've
> got a gene expression data set that's 30k wide by 15k tall.  That takes a
> few hours to train using R version of glmnet.  We're also talking to some
> biology friends to find other interesting data sets.
>
> I really am eager to see the comparisons.  And happy to help you tailor
> OWLQN to generate coefficient paths.  We might be able to produce a hybrid
> of Friedman's algorithm using his basic algorithm outline but substituting
> OWLQN for his round-robin coordinate descent.  But i'm a little cocerned
> that it's the round-robin coordinate descent that makes it possible to skip
> passing through the full data set for 4 out of 5 iterations.  We might be
> able to work a way around that.
>
> I'm just eager to have parallel versions of the tools available.  I'll
> keep you posted on our results.  We should aim for running one another's
> code.  I'll check with my colleagues and see when we'll have something we
> can hand out.  We've delayed putting together a release version in favor of
> generating some scaling results, as Joseph suggested.  Discussions like
> this may have some impact on what the release code looks like.
> Mike
>
>
>
>
>
> -Original Message---
> *From:* Debasish Das [mailto:debasish.da...@gmail.com]
> *Sent:* Wednesday, February 25, 2015 08:50 AM
> *To:* 'Joseph Bradley'
> *Cc:* m...@mbowles.com, 'dev'
> *Subject:* Re: Have Friedman's glmnet algo running in Spark
>
> Any reason why the regularization path cannot be implemented using current
> owlqn pr ?
>
> We can change owlqn in breeze to fit your needs...
>  On Feb 24, 2015 3:27 PM, "Joseph Bradley"  wrote:
>
>> Hi Mike,
>>
>> I'm not aware of a "standard" big dataset, but there are a number
>> available:
>> * The YearPredictionMSD dataset from the LIBSVM datasets is sizeable (in #
>> instances but not # features):
>> www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html
>> * I've used this text dataset from which one can generate lots of n-gram
>> features (but not many i

Re: Have Friedman's glmnet algo running in Spark

2015-02-25 Thread mike
 Hi Debasish,
Any method that generates point solutions to the minimization problem could 
simply be run a number of times to generate the coefficient paths as a function 
of the penalty parameter. I think the only issues are how easy the method is to 
use and how much training and developer time is required to produce an answer.

With regard to training time, Friedman says in his paper that they found 
problems where glmnet would generate the entire coefficient path more rapidly 
than sophisticated single point methods would generate single point solutions - 
not all problems, but some problems. Ryan Tibshirani (Robert's son) who's a 
professor and researcher at CMU in convex function optimization has echoed that 
assertion for the particular case of the elasticnet penalty function (that's 
from slides of his that are available online). So there's an open question 
about the training speed that i believe we can answer in fairly short order. 
I'm eager to explore that. Does OWLQN do a pass through the data for each 
iteration? The linear version of GLMNET does not. On the other hand, OWLQN may 
be able to take coarser steps through parameter space.

With regard to developer time, glmnet doesn't require the user to supply a 
starting point for the penalty parameter. It calculates the starting point. 
That makes it completely automatic. you've probably been through the process of 
manually searching regularization parameter space with SVM. Pick out a set of 
regularization parameter values like 10 raised to the (-2 through +5 in steps 
of 1). See if there's a minimum in the range and if not shift to the right or 
left. One of the reasons I pick up glmnet first for a new problem is that you 
just drop in the training set and out pop the coefficient curves. Usually the 
defaults work. One time out of 50 (or so) it doesn't converge. It alerts you 
that it didn't converge and you change one parameter and rerun. If you also 
drop in a test set then it even picks the optimum solution andproduces an 
estimate of out-of-sample error.

We're going to make some speed/scaling runs on the synthetic data sets (in a 
range of sizes) that are used in Spark for testing linear regression. We need 
some wider data sets. Joseph mentioned some that we'll look at. I've got a gene 
expression data set that's 30k wide by 15k tall. That takes a few hours to 
train using R version of glmnet. We're also talking to some biology friends to 
find other interesting data sets.

I really am eager to see the comparisons. And happy to help you tailor OWLQN to 
generate coefficient paths. We might be able to produce a hybrid of Friedman's 
algorithm using his basic algorithm outline but substituting OWLQN for his 
round-robin coordinate descent. But i'm a little cocerned that it's the 
round-robin coordinate descent that makes it possible to skip passing through 
the full data set for 4 out of 5 iterations. We might be able to work a way 
around that.

I'm just eager to have parallel versions of the tools available. I'll keep you 
posted on our results. We should aim for running one another's code. I'll check 
with my colleagues and see when we'll have something we can hand out. We've 
delayed putting together a release version in favor of generating some scaling 
results, as Joseph suggested. Discussions like this may have some impact on 
what the release code looks like.
Mike






-Original Message---
From: Debasish Das [mailto:debasish.da...@gmail.com]
Sent: Wednesday, February 25, 2015 08:50 AM
To: 'Joseph Bradley'
Cc: m...@mbowles.com, 'dev'
Subject: Re: Have Friedman's glmnet algo running in Spark

Any reason why the regularization path cannot be implemented using current 
owlqn pr ?
We can change owlqn in breeze to fit your needs...

On Feb 24, 2015 3:27 PM, "Joseph Bradley"  wrote:
Hi Mike,

I'm not aware of a "standard" big dataset, but there are a number available:
* The YearPredictionMSD dataset from the LIBSVM datasets is sizeable (in #
instances but not # features):
www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html
* I've used this text dataset from which one can generate lots of n-gram
features (but not many instances): http://www.ark.cs.cmu.edu/10K/
* I've seen some papers use the KDD Cup datasets, which might be the best
option I know of. The KDD Cup 2012 track 2 one seems promising.

Good luck!
Joseph

On Tue, Feb 24, 2015 at 1:56 PM,  wrote:

> Joseph,
> Thanks for your reply. We'll take the steps you suggest - generate some
> timing comparisons and post them in the GLMNET JIRA with a link from the
> OWLQN JIRA.
>
> We've got the regression version of GLMNET programmed. The regression
> version only requires a pass through the data each time the active set of
> coefficients changes. That'

Re: Have Friedman's glmnet algo running in Spark

2015-02-25 Thread Debasish Das
Any reason why the regularization path cannot be implemented using current
owlqn pr ?

We can change owlqn in breeze to fit your needs...
 On Feb 24, 2015 3:27 PM, "Joseph Bradley"  wrote:

> Hi Mike,
>
> I'm not aware of a "standard" big dataset, but there are a number
> available:
> * The YearPredictionMSD dataset from the LIBSVM datasets is sizeable (in #
> instances but not # features):
> www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html
> * I've used this text dataset from which one can generate lots of n-gram
> features (but not many instances): http://www.ark.cs.cmu.edu/10K/
> * I've seen some papers use the KDD Cup datasets, which might be the best
> option I know of.  The KDD Cup 2012 track 2 one seems promising.
>
> Good luck!
> Joseph
>
> On Tue, Feb 24, 2015 at 1:56 PM,  wrote:
>
> > Joseph,
> > Thanks for your reply.  We'll take the steps you suggest - generate some
> > timing comparisons and post them in the GLMNET JIRA with a link from the
> > OWLQN JIRA.
> >
> > We've got the regression version of GLMNET programmed.  The regression
> > version only requires a pass through the data each time the active set of
> > coefficients changes.  That's usualy less than or equal to the number of
> > decrements in the penalty coefficient (typical default = 100).  The
> > intermediate iterations can be done using results of previous passes
> > through the full data set.  We're expecting the number of data passes
> will
> > be independent of either number of rows or columns in the data set.
> We're
> > eager to demonstrate this scaling.  Do you have any suggestions regarding
> > data sets for large scale regression problems?  It would be nice to
> > demonstrate scaling for both number of rows and number of columns.
> >
> > Thanks for your help.
> > Mike
> >
> > -Original Message-
> > *From:* Joseph Bradley [mailto:jos...@databricks.com]
> > *Sent:* Sunday, February 22, 2015 06:48 PM
> > *To:* m...@mbowles.com
> > *Cc:* dev@spark.apache.org
> > *Subject:* Re: Have Friedman's glmnet algo running in Spark
> >
> > Hi Mike, glmnet has definitely been very successful, and it would be
> great
> > to see how we can improve optimization in MLlib! There is some related
> work
> > ongoing; here are the JIRAs: GLMNET implementation in Spark
> > LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package
> > The GLMNET JIRA has actually been closed in favor of the latter JIRA.
> > However, if you're getting good results in your experiments, could you
> > please post them on the GLMNET JIRA and link them from the other JIRA? If
> > it's faster and more scalable, that would be great to find out. As far as
> > where the code should go and the APIs, that can be discussed on the
> JIRA. I
> > hope this helps, and I'll keep an eye out for updates on the JIRAs!
> Joseph
> > On Thu, Feb 19, 2015 at 10:59 AM,  wrote: > Dev List, > A couple of
> > colleagues and I have gotten several versions of glmnet algo > coded and
> > running on Spark RDD. glmnet algo ( >
> > http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for >
> > generating coefficient paths solving penalized regression with elastic
> net
> > > penalties. The algorithm runs fast by taking an approach that
> generates >
> > solutions for a wide variety of penalty parameter. We're able to
> integrate
> > > into Mllib class structure a couple of different ways. The algorithm
> may
> > > fit better into the new pipeline structure since it naturally returns
> a >
> > multitide of models (corresponding to different vales of penalty >
> > parameters). That appears to fit better into pipeline than Mllib linear >
> > regression (for example). > > We've got regression running with the speed
> > optimizations that Friedman > recommends. We'll start working on the
> > logistic regression version next. > > We're eager to make the code
> > available as open source and would like to > get some feedback about how
> > best to do that. Any thoughts? > Mike Bowles. > > >
> >
> >
>


Re: Have Friedman's glmnet algo running in Spark

2015-02-24 Thread Joseph Bradley
Hi Mike,

I'm not aware of a "standard" big dataset, but there are a number available:
* The YearPredictionMSD dataset from the LIBSVM datasets is sizeable (in #
instances but not # features):
www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html
* I've used this text dataset from which one can generate lots of n-gram
features (but not many instances): http://www.ark.cs.cmu.edu/10K/
* I've seen some papers use the KDD Cup datasets, which might be the best
option I know of.  The KDD Cup 2012 track 2 one seems promising.

Good luck!
Joseph

On Tue, Feb 24, 2015 at 1:56 PM,  wrote:

> Joseph,
> Thanks for your reply.  We'll take the steps you suggest - generate some
> timing comparisons and post them in the GLMNET JIRA with a link from the
> OWLQN JIRA.
>
> We've got the regression version of GLMNET programmed.  The regression
> version only requires a pass through the data each time the active set of
> coefficients changes.  That's usualy less than or equal to the number of
> decrements in the penalty coefficient (typical default = 100).  The
> intermediate iterations can be done using results of previous passes
> through the full data set.  We're expecting the number of data passes will
> be independent of either number of rows or columns in the data set.  We're
> eager to demonstrate this scaling.  Do you have any suggestions regarding
> data sets for large scale regression problems?  It would be nice to
> demonstrate scaling for both number of rows and number of columns.
>
> Thanks for your help.
> Mike
>
> -Original Message-
> *From:* Joseph Bradley [mailto:jos...@databricks.com]
> *Sent:* Sunday, February 22, 2015 06:48 PM
> *To:* m...@mbowles.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Have Friedman's glmnet algo running in Spark
>
> Hi Mike, glmnet has definitely been very successful, and it would be great
> to see how we can improve optimization in MLlib! There is some related work
> ongoing; here are the JIRAs: GLMNET implementation in Spark
> LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package
> The GLMNET JIRA has actually been closed in favor of the latter JIRA.
> However, if you're getting good results in your experiments, could you
> please post them on the GLMNET JIRA and link them from the other JIRA? If
> it's faster and more scalable, that would be great to find out. As far as
> where the code should go and the APIs, that can be discussed on the JIRA. I
> hope this helps, and I'll keep an eye out for updates on the JIRAs! Joseph
> On Thu, Feb 19, 2015 at 10:59 AM,  wrote: > Dev List, > A couple of
> colleagues and I have gotten several versions of glmnet algo > coded and
> running on Spark RDD. glmnet algo ( >
> http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for >
> generating coefficient paths solving penalized regression with elastic net
> > penalties. The algorithm runs fast by taking an approach that generates >
> solutions for a wide variety of penalty parameter. We're able to integrate
> > into Mllib class structure a couple of different ways. The algorithm may
> > fit better into the new pipeline structure since it naturally returns a >
> multitide of models (corresponding to different vales of penalty >
> parameters). That appears to fit better into pipeline than Mllib linear >
> regression (for example). > > We've got regression running with the speed
> optimizations that Friedman > recommends. We'll start working on the
> logistic regression version next. > > We're eager to make the code
> available as open source and would like to > get some feedback about how
> best to do that. Any thoughts? > Mike Bowles. > > >
>
>


Re: Have Friedman's glmnet algo running in Spark

2015-02-24 Thread mike
 Joseph,
Thanks for your reply. We'll take the steps you suggest - generate some timing 
comparisons and post them in the GLMNET JIRA with a link from the OWLQN JIRA.

We've got the regression version of GLMNET programmed. The regression version 
only requires a pass through the data each time the active set of coefficients 
changes. That's usualy less than or equal to the number of decrements in the 
penalty coefficient (typical default = 100). The intermediate iterations can be 
done using results of previous passes through the full data set. We're 
expecting the number of data passes will be independent of either number of 
rows or columns in the data set. We're eager to demonstrate this scaling. Do 
you have any suggestions regarding data sets for large scale regression 
problems? It would be nice to demonstrate scaling for both number of rows and 
number of columns.

Thanks for your help.
Mike

-Original Message-
From: Joseph Bradley [mailto:jos...@databricks.com]
Sent: Sunday, February 22, 2015 06:48 PM
To: m...@mbowles.com
Cc: dev@spark.apache.org
Subject: Re: Have Friedman's glmnet algo running in Spark

Hi Mike,glmnet has definitely been very successful, and it would be great to 
seehow we can improve optimization in MLlib! There is some related workongoing; 
here are the JIRAs:GLMNET implementation in SparkLinearRegression with L1/L2 
(elastic net) using OWLQN in new ML packageThe GLMNET JIRA has actually been 
closed in favor of the latter JIRA.However, if you're getting good results in 
your experiments, could youplease post them on the GLMNET JIRA and link them 
from the other JIRA? Ifit's faster and more scalable, that would be great to 
find out.As far as where the code should go and the APIs, that can be discussed 
onthe JIRA.I hope this helps, and I'll keep an eye out for updates on the 
JIRAs!JosephOn Thu, Feb 19, 2015 at 10:59 AM,  wrote:> Dev List,> A couple of 
colleagues and I have gotten several versions of glmnet algo> coded and running 
on Spark RDD. glmnet algo (> http://www.jstatsoft.org/v33/i01/paper) is a very 
fast algorithm for> generating coefficient paths solving penalized regression 
with elastic net> penalties. The algorithm runs fast by taking an approach that 
generates> solutions for a wide variety of penalty parameter. We're able to 
integrate> into Mllib class structure a couple of different ways. The algorithm 
may> fit better into the new pipeline structure since it naturally returns a> 
multitide of models (corresponding to different vales of penalty> parameters). 
That appears to fit better into pipeline than Mllib linear> regression (for 
example).>> We've got regression running with the speed optimizations that 
Friedman> recommends. We'll start working on the logistic regression version 
next.>> We're eager to make the code available as open source and would like 
to> get some feedback about how best to do that. Any thoughts?> Mike Bowles.>>>


Re: Have Friedman's glmnet algo running in Spark

2015-02-22 Thread Joseph Bradley
Hi Mike,

glmnet has definitely been very successful, and it would be great to see
how we can improve optimization in MLlib!  There is some related work
ongoing; here are the JIRAs:

GLMNET implementation in Spark


LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package


The GLMNET JIRA has actually been closed in favor of the latter JIRA.
However, if you're getting good results in your experiments, could you
please post them on the GLMNET JIRA and link them from the other JIRA?  If
it's faster and more scalable, that would be great to find out.

As far as where the code should go and the APIs, that can be discussed on
the JIRA.

I hope this helps, and I'll keep an eye out for updates on the JIRAs!

Joseph


On Thu, Feb 19, 2015 at 10:59 AM,  wrote:

> Dev List,
> A couple of colleagues and I have gotten several versions of glmnet algo
> coded and running on Spark RDD. glmnet algo (
> http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for
> generating coefficient paths solving penalized regression with elastic net
> penalties. The algorithm runs fast by taking an approach that generates
> solutions for a wide variety of penalty parameter. We're able to integrate
> into Mllib class structure a couple of different ways. The algorithm may
> fit better into the new pipeline structure since it naturally returns a
> multitide of models (corresponding to different vales of penalty
> parameters). That appears to fit better into pipeline than Mllib linear
> regression (for example).
>
> We've got regression running with the speed optimizations that Friedman
> recommends. We'll start working on the logistic regression version next.
>
> We're eager to make the code available as open source and would like to
> get some feedback about how best to do that. Any thoughts?
> Mike Bowles.
>
>
>