Hi Mike, I'm not aware of a "standard" big dataset, but there are a number available: * The YearPredictionMSD dataset from the LIBSVM datasets is sizeable (in # instances but not # features): www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html * I've used this text dataset from which one can generate lots of n-gram features (but not many instances): http://www.ark.cs.cmu.edu/10K/ * I've seen some papers use the KDD Cup datasets, which might be the best option I know of. The KDD Cup 2012 track 2 one seems promising.
Good luck! Joseph On Tue, Feb 24, 2015 at 1:56 PM, <m...@mbowles.com> wrote: > Joseph, > Thanks for your reply. We'll take the steps you suggest - generate some > timing comparisons and post them in the GLMNET JIRA with a link from the > OWLQN JIRA. > > We've got the regression version of GLMNET programmed. The regression > version only requires a pass through the data each time the active set of > coefficients changes. That's usualy less than or equal to the number of > decrements in the penalty coefficient (typical default = 100). The > intermediate iterations can be done using results of previous passes > through the full data set. We're expecting the number of data passes will > be independent of either number of rows or columns in the data set. We're > eager to demonstrate this scaling. Do you have any suggestions regarding > data sets for large scale regression problems? It would be nice to > demonstrate scaling for both number of rows and number of columns. > > Thanks for your help. > Mike > > -----Original Message----- > *From:* Joseph Bradley [mailto:jos...@databricks.com] > *Sent:* Sunday, February 22, 2015 06:48 PM > *To:* m...@mbowles.com > *Cc:* dev@spark.apache.org > *Subject:* Re: Have Friedman's glmnet algo running in Spark > > Hi Mike, glmnet has definitely been very successful, and it would be great > to see how we can improve optimization in MLlib! There is some related work > ongoing; here are the JIRAs: GLMNET implementation in Spark > LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package > The GLMNET JIRA has actually been closed in favor of the latter JIRA. > However, if you're getting good results in your experiments, could you > please post them on the GLMNET JIRA and link them from the other JIRA? If > it's faster and more scalable, that would be great to find out. As far as > where the code should go and the APIs, that can be discussed on the JIRA. I > hope this helps, and I'll keep an eye out for updates on the JIRAs! Joseph > On Thu, Feb 19, 2015 at 10:59 AM, wrote: > Dev List, > A couple of > colleagues and I have gotten several versions of glmnet algo > coded and > running on Spark RDD. glmnet algo ( > > http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for > > generating coefficient paths solving penalized regression with elastic net > > penalties. The algorithm runs fast by taking an approach that generates > > solutions for a wide variety of penalty parameter. We're able to integrate > > into Mllib class structure a couple of different ways. The algorithm may > > fit better into the new pipeline structure since it naturally returns a > > multitide of models (corresponding to different vales of penalty > > parameters). That appears to fit better into pipeline than Mllib linear > > regression (for example). > > We've got regression running with the speed > optimizations that Friedman > recommends. We'll start working on the > logistic regression version next. > > We're eager to make the code > available as open source and would like to > get some feedback about how > best to do that. Any thoughts? > Mike Bowles. > > > > >