Re: Spark ML - Scaling logistic regression for many features

2016-03-12 Thread Nick Pentreath
Also adding dev list in case anyone else has ideas / views. On Sat, 12 Mar 2016 at 12:52, Nick Pentreath wrote: > Thanks for the feedback. > > I think Spark can certainly meet your use case when your data size scales > up, as the actual model dimension is very small - you will need to use > thos

Re: Spark ML - Scaling logistic regression for many features

2016-03-12 Thread Nick Pentreath
Thanks for the feedback. I think Spark can certainly meet your use case when your data size scales up, as the actual model dimension is very small - you will need to use those indexers or some other mapping mechanism. There is ongoing work for Spark 2.0 to make it easier to use models outside of

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
Thanks for the pointer to those indexers, those are some good examples. A good way to go for the trainer and any scoring done in Spark. I will definitely have to deal with scoring in non-Spark systems though. I think I will need to scale up beyond what single-node liblinear can practically provide

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Nick Pentreath
Ok, I think I understand things better now. For Spark's current implementation, you would need to map those features as you mention. You could also use say StringIndexer -> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with the mapping and training (e.g. http://spark.apache.o

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath wrote: > Would you mind letting us know the # training examples in the datasets? > Also, what do your features look like? Are they text, categorical etc? You > mention that most rows only have a few features, and all rows together have > a few 10,00

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Nick Pentreath
Would you mind letting us know the # training examples in the datasets? Also, what do your features look like? Are they text, categorical etc? You mention that most rows only have a few features, and all rows together have a few 10,000s features, yet your max feature value is 20 million. How are yo

Re: Spark ML - Scaling logistic regression for many features

2016-03-10 Thread Daniel Siegmann
Hi Nick, Thanks for the feedback and the pointers. I tried coalescing to fewer partitions and improved the situation dramatically. As you suggested, it is communication overhead dominating the overall runtime. The training run I mentioned originally had 900 partitions. Each tree aggregation has t

Re: Spark ML - Scaling logistic regression for many features

2016-03-09 Thread Nick Pentreath
Hi Daniel The bottleneck in Spark ML is most likely (a) the fact that the weight vector itself is dense, and (b) the related communication via the driver. A tree aggregation mechanism is used for computing gradient sums (see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apac

Re: Spark ML - Scaling logistic regression for many features

2016-03-08 Thread Daniel Siegmann
Just for the heck of it I tried the old MLlib implementation, but it had the same scalability problem. Anyone familiar with the logistic regression implementation who could weigh in? On Mon, Mar 7, 2016 at 5:35 PM, Michał Zieliński < zielinski.mich...@gmail.com> wrote: > We're using SparseVector

Re: Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Michał Zieliński
We're using SparseVector columns in a DataFrame, so they are definitely supported. But maybe for LR some implicit magic is happening inside. On 7 March 2016 at 23:04, Devin Jones wrote: > I could be wrong but its possible that toDF populates a dataframe which I > understand do not support sparse

Re: Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Devin Jones
I could be wrong but its possible that toDF populates a dataframe which I understand do not support sparsevectors at the moment. If you use the MlLib logistic regression implementation (not ml) you can pass the RDD[LabeledPoint] data type directly to the learner. http://spark.apache.org/docs/late

Re: Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Daniel Siegmann
Yes, it is a SparseVector. Most rows only have a few features, and all the rows together only have tens of thousands of features, but the vector size is ~ 20 million because that is the largest feature. On Mon, Mar 7, 2016 at 4:31 PM, Devin Jones wrote: > Hi, > > Which data structure are you usi

Re: Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Devin Jones
Hi, Which data structure are you using to train the model? If you haven't tried yet, you should consider the SparseVector http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann wrote: > I recently tri

Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Daniel Siegmann
I recently tried to a model using org.apache.spark.ml.classification.LogisticRegression on a data set where the feature vector size was around ~20 million. It did *not* go well. It took around 10 hours to train on a substantial cluster. Additionally, it pulled a lot data back to the driver - I even