Daniel Siegmann created SPARK-14464:
---------------------------------------

             Summary: Logistic regression performs poorly for very large 
vectors, even when the number of non-zero features is small
                 Key: SPARK-14464
                 URL: https://issues.apache.org/jira/browse/SPARK-14464
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 1.6.0
            Reporter: Daniel Siegmann


When training (a.k.a. fitting) 
org.apache.spark.ml.classification.LogisticRegression, aggregation is done 
using arrays (which are dense structures). This is the case regardless of 
whether the features of each instance are stored in a sparse or a dense vector.

When the feature vectors are very large, performance is poor because there's a 
lot of overhead in transmitting these large arrays across workers.

However, just because the feature vectors are large doesn't mean all these 
features are actually being used. If the actual features are sparse, very large 
arrays are being allocated unnecessarily.

To solve this case, there should be an option to aggregate using sparse 
vectors. It should be up to the sure to set this explicitly as a parameter on 
the estimator, since the user should have some idea whether it is necessary for 
their particular case.

As an example, I have a use case where the features vector size is around 20 
million. However, there are only 7 - 8 thousand non-zero features. The time to 
train a model with only 10 worker is 1.3 hours on my test cluster. With 100 
workers this balloons to over 10 hours on the same cluster!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to