Re: Have Friedman's glmnet algo running in Spark

mike Wed, 25 Feb 2015 10:37:55 -0800

 Hi Debasish,
Any method that generates point solutions to the minimization problem could 
simply be run a number of times to generate the coefficient paths as a function 
of the penalty parameter. I think the only issues are how easy the method is to 
use and how much training and developer time is required to produce an answer.

With regard to training time, Friedman says in his paper that they found 
problems where glmnet would generate the entire coefficient path more rapidly 
than sophisticated single point methods would generate single point solutions - 
not all problems, but some problems. Ryan Tibshirani (Robert's son) who's a 
professor and researcher at CMU in convex function optimization has echoed that 
assertion for the particular case of the elasticnet penalty function (that's 
from slides of his that are available online). So there's an open question 
about the training speed that i believe we can answer in fairly short order. 
I'm eager to explore that. Does OWLQN do a pass through the data for each 
iteration? The linear version of GLMNET does not. On the other hand, OWLQN may 
be able to take coarser steps through parameter space.

With regard to developer time, glmnet doesn't require the user to supply a 
starting point for the penalty parameter. It calculates the starting point. 
That makes it completely automatic. you've probably been through the process of 
manually searching regularization parameter space with SVM. Pick out a set of 
regularization parameter values like 10 raised to the (-2 through +5 in steps 
of 1). See if there's a minimum in the range and if not shift to the right or 
left. One of the reasons I pick up glmnet first for a new problem is that you 
just drop in the training set and out pop the coefficient curves. Usually the 
defaults work. One time out of 50 (or so) it doesn't converge. It alerts you 
that it didn't converge and you change one parameter and rerun. If you also 
drop in a test set then it even picks the optimum solution andproduces an 
estimate of out-of-sample error.

We're going to make some speed/scaling runs on the synthetic data sets (in a 
range of sizes) that are used in Spark for testing linear regression. We need 
some wider data sets. Joseph mentioned some that we'll look at. I've got a gene 
expression data set that's 30k wide by 15k tall. That takes a few hours to 
train using R version of glmnet. We're also talking to some biology friends to 
find other interesting data sets.

I really am eager to see the comparisons. And happy to help you tailor OWLQN to 
generate coefficient paths. We might be able to produce a hybrid of Friedman's 
algorithm using his basic algorithm outline but substituting OWLQN for his 
round-robin coordinate descent. But i'm a little cocerned that it's the 
round-robin coordinate descent that makes it possible to skip passing through 
the full data set for 4 out of 5 iterations. We might be able to work a way 
around that.

I'm just eager to have parallel versions of the tools available. I'll keep you 
posted on our results. We should aim for running one another's code. I'll check 
with my colleagues and see when we'll have something we can hand out. We've 
delayed putting together a release version in favor of generating some scaling 
results, as Joseph suggested. Discussions like this may have some impact on 
what the release code looks like.
Mike

-----Original Message---
From: Debasish Das [mailto:debasish.da...@gmail.com]
Sent: Wednesday, February 25, 2015 08:50 AM
To: 'Joseph Bradley'
Cc: m...@mbowles.com, 'dev'
Subject: Re: Have Friedman's glmnet algo running in Spark

Any reason why the regularization path cannot be implemented using current 
owlqn pr ?
We can change owlqn in breeze to fit your needs...

On Feb 24, 2015 3:27 PM, "Joseph Bradley" <jos...@databricks.com> wrote:
Hi Mike,

I'm not aware of a "standard" big dataset, but there are a number available:
* The YearPredictionMSD dataset from the LIBSVM datasets is sizeable (in #
instances but not # features):
www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html
* I've used this text dataset from which one can generate lots of n-gram
features (but not many instances): http://www.ark.cs.cmu.edu/10K/
* I've seen some papers use the KDD Cup datasets, which might be the best
option I know of. The KDD Cup 2012 track 2 one seems promising.

Good luck!
Joseph

On Tue, Feb 24, 2015 at 1:56 PM, <m...@mbowles.com> wrote:

> Joseph,
> Thanks for your reply. We'll take the steps you suggest - generate some
> timing comparisons and post them in the GLMNET JIRA with a link from the
> OWLQN JIRA.
>
> We've got the regression version of GLMNET programmed. The regression
> version only requires a pass through the data each time the active set of
> coefficients changes. That's usualy less than or equal to the number of
> decrements in the penalty coefficient (typical default = 100). The
> intermediate iterations can be done using results of previous passes
> through the full data set. We're expecting the number of data passes will
> be independent of either number of rows or columns in the data set. We're
> eager to demonstrate this scaling. Do you have any suggestions regarding
> data sets for large scale regression problems? It would be nice to
> demonstrate scaling for both number of rows and number of columns.
>
> Thanks for your help.
> Mike
>
> -----Original Message-----
> *From:* Joseph Bradley [mailto:jos...@databricks.com]
> *Sent:* Sunday, February 22, 2015 06:48 PM
> *To:* m...@mbowles.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Have Friedman's glmnet algo running in Spark
>
> Hi Mike, glmnet has definitely been very successful, and it would be great
> to see how we can improve optimization in MLlib! There is some related work
> ongoing; here are the JIRAs: GLMNET implementation in Spark
> LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package
> The GLMNET JIRA has actually been closed in favor of the latter JIRA.
> However, if you're getting good results in your experiments, could you
> please post them on the GLMNET JIRA and link them from the other JIRA? If
> it's faster and more scalable, that would be great to find out. As far as
> where the code should go and the APIs, that can be discussed on the JIRA. I
> hope this helps, and I'll keep an eye out for updates on the JIRAs! Joseph
> On Thu, Feb 19, 2015 at 10:59 AM, wrote: > Dev List, > A couple of
> colleagues and I have gotten several versions of glmnet algo > coded and
> running on Spark RDD. glmnet algo ( >
> http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for >
> generating coefficient paths solving penalized regression with elastic net
> > penalties. The algorithm runs fast by taking an approach that generates >
> solutions for a wide variety of penalty parameter. We're able to integrate
> > into Mllib class structure a couple of different ways. The algorithm may
> > fit better into the new pipeline structure since it naturally returns a >
> multitide of models (corresponding to different vales of penalty >
> parameters). That appears to fit better into pipeline than Mllib linear >
> regression (for example). > > We've got regression running with the speed
> optimizations that Friedman > recommends. We'll start working on the
> logistic regression version next. > > We're eager to make the code
> available as open source and would like to > get some feedback about how
> best to do that. Any thoughts? > Mike Bowles. > > >
>
>

Re: Have Friedman's glmnet algo running in Spark

Reply via email to