Some of this discussion seems valuable enough to preserve on the JIRA; can we move it there (and copy any relevant discussions from previous emails as needed)?
On Wed, Feb 25, 2015 at 10:35 AM, <m...@mbowles.com> wrote: > Hi Debasish, > Any method that generates point solutions to the minimization problem > could simply be run a number of times to generate the coefficient paths as > a function of the penalty parameter. I think the only issues are how easy > the method is to use and how much training and developer time is required > to produce an answer. > > With regard to training time, Friedman says in his paper that they found > problems where glmnet would generate the entire coefficient path more > rapidly than sophisticated single point methods would generate single point > solutions - not all problems, but some problems. Ryan Tibshirani (Robert's > son) who's a professor and researcher at CMU in convex function > optimization has echoed that assertion for the particular case of the > elasticnet penalty function (that's from slides of his that are available > online). So there's an open question about the training speed that i > believe we can answer in fairly short order. I'm eager to explore that. > Does OWLQN do a pass through the data for each iteration? The linear > version of GLMNET does not. On the other hand, OWLQN may be able to take > coarser steps through parameter space. > > With regard to developer time, glmnet doesn't require the user to supply a > starting point for the penalty parameter. It calculates the starting > point. That makes it completely automatic. you've probably been through > the process of manually searching regularization parameter space with SVM. > Pick out a set of regularization parameter values like 10 raised to the (-2 > through +5 in steps of 1). See if there's a minimum in the range and if > not shift to the right or left. One of the reasons I pick up glmnet first > for a new problem is that you just drop in the training set and out pop the > coefficient curves. Usually the defaults work. One time out of 50 (or so) > it doesn't converge. It alerts you that it didn't converge and you change > one parameter and rerun. If you also drop in a test set then it even picks > the optimum solution andproduces an estimate of out-of-sample error. > > We're going to make some speed/scaling runs on the synthetic data sets (in > a range of sizes) that are used in Spark for testing linear regression. We > need some wider data sets. Joseph mentioned some that we'll look at. I've > got a gene expression data set that's 30k wide by 15k tall. That takes a > few hours to train using R version of glmnet. We're also talking to some > biology friends to find other interesting data sets. > > I really am eager to see the comparisons. And happy to help you tailor > OWLQN to generate coefficient paths. We might be able to produce a hybrid > of Friedman's algorithm using his basic algorithm outline but substituting > OWLQN for his round-robin coordinate descent. But i'm a little cocerned > that it's the round-robin coordinate descent that makes it possible to skip > passing through the full data set for 4 out of 5 iterations. We might be > able to work a way around that. > > I'm just eager to have parallel versions of the tools available. I'll > keep you posted on our results. We should aim for running one another's > code. I'll check with my colleagues and see when we'll have something we > can hand out. We've delayed putting together a release version in favor of > generating some scaling results, as Joseph suggested. Discussions like > this may have some impact on what the release code looks like. > Mike > > > > > > -----Original Message--- > *From:* Debasish Das [mailto:debasish.da...@gmail.com] > *Sent:* Wednesday, February 25, 2015 08:50 AM > *To:* 'Joseph Bradley' > *Cc:* m...@mbowles.com, 'dev' > *Subject:* Re: Have Friedman's glmnet algo running in Spark > > Any reason why the regularization path cannot be implemented using current > owlqn pr ? > > We can change owlqn in breeze to fit your needs... > On Feb 24, 2015 3:27 PM, "Joseph Bradley" <jos...@databricks.com> wrote: > >> Hi Mike, >> >> I'm not aware of a "standard" big dataset, but there are a number >> available: >> * The YearPredictionMSD dataset from the LIBSVM datasets is sizeable (in # >> instances but not # features): >> www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html >> * I've used this text dataset from which one can generate lots of n-gram >> features (but not many instances): http://www.ark.cs.cmu.edu/10K/ >> * I've seen some papers use the KDD Cup datasets, which might be the best >> option I know of. The KDD Cup 2012 track 2 one seems promising. >> >> Good luck! >> Joseph >> >> On Tue, Feb 24, 2015 at 1:56 PM, <m...@mbowles.com> wrote: >> >> > Joseph, >> > Thanks for your reply. We'll take the steps you suggest - generate some >> > timing comparisons and post them in the GLMNET JIRA with a link from the >> > OWLQN JIRA. >> > >> > We've got the regression version of GLMNET programmed. The regression >> > version only requires a pass through the data each time the active set >> of >> > coefficients changes. That's usualy less than or equal to the number of >> > decrements in the penalty coefficient (typical default = 100). The >> > intermediate iterations can be done using results of previous passes >> > through the full data set. We're expecting the number of data passes >> will >> > be independent of either number of rows or columns in the data set. >> We're >> > eager to demonstrate this scaling. Do you have any suggestions >> regarding >> > data sets for large scale regression problems? It would be nice to >> > demonstrate scaling for both number of rows and number of columns. >> > >> > Thanks for your help. >> > Mike >> > >> > -----Original Message----- >> > *From:* Joseph Bradley [mailto:jos...@databricks.com] >> > *Sent:* Sunday, February 22, 2015 06:48 PM >> > *To:* m...@mbowles.com >> > *Cc:* dev@spark.apache.org >> > *Subject:* Re: Have Friedman's glmnet algo running in Spark >> > >> > Hi Mike, glmnet has definitely been very successful, and it would be >> great >> > to see how we can improve optimization in MLlib! There is some related >> work >> > ongoing; here are the JIRAs: GLMNET implementation in Spark >> > LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package >> > The GLMNET JIRA has actually been closed in favor of the latter JIRA. >> > However, if you're getting good results in your experiments, could you >> > please post them on the GLMNET JIRA and link them from the other JIRA? >> If >> > it's faster and more scalable, that would be great to find out. As far >> as >> > where the code should go and the APIs, that can be discussed on the >> JIRA. I >> > hope this helps, and I'll keep an eye out for updates on the JIRAs! >> Joseph >> > On Thu, Feb 19, 2015 at 10:59 AM, wrote: > Dev List, > A couple of >> > colleagues and I have gotten several versions of glmnet algo > coded and >> > running on Spark RDD. glmnet algo ( > >> > http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for > >> > generating coefficient paths solving penalized regression with elastic >> net >> > > penalties. The algorithm runs fast by taking an approach that >> generates > >> > solutions for a wide variety of penalty parameter. We're able to >> integrate >> > > into Mllib class structure a couple of different ways. The algorithm >> may >> > > fit better into the new pipeline structure since it naturally returns >> a > >> > multitide of models (corresponding to different vales of penalty > >> > parameters). That appears to fit better into pipeline than Mllib linear >> > >> > regression (for example). > > We've got regression running with the >> speed >> > optimizations that Friedman > recommends. We'll start working on the >> > logistic regression version next. > > We're eager to make the code >> > available as open source and would like to > get some feedback about how >> > best to do that. Any thoughts? > Mike Bowles. > > > >> > >> > >> >