I recently tried to a model using
org.apache.spark.ml.classification.LogisticRegression on a data set where
the feature vector size was around ~20 million. It did *not* go well. It
took around 10 hours to train on a substantial cluster. Additionally, it
pulled a lot data back to the driver - I eventually set --conf
spark.driver.memory=128g --conf spark.driver.maxResultSize=112g when
submitting.

Attempting the same application on the same cluster with the feature vector
size reduced to 100k took only ~ 9 minutes. Clearly there is an issue with
scaling to large numbers of features. I'm not doing anything fancy in my
app, here's the relevant code:

val lr = new LogisticRegression().setRegParam(1)
val model = lr.fit(trainingSet.toDF())

In comparison, a coworker trained a logistic regression model on her
*laptop* using the Java library liblinear in just a few minutes. That's
with the ~20 million-sized feature vectors. This suggests to me there is
some issue with Spark ML's implementation of logistic regression which is
limiting its scalability.

Note that my feature vectors are *very* sparse. The maximum feature is
around 20 million, but I think there are only 10's of thousands of features.

Has anyone run into this? Any idea where the bottleneck is or how this
problem might be solved?

One solution of course is to implement some dimensionality reduction. I'd
really like to avoid this, as it's just another thing to deal with - not so
hard to put it into the trainer, but then anything doing scoring will need
the same logic. Unless Spark ML supports this out of the box? An easy way
to save / load a model along with the dimensionality reduction logic so
when transform is called on the model it will handle the dimensionality
reduction transparently?

Any advice would be appreciated.

~Daniel Siegmann

Reply via email to