[jira] [Commented] (SPARK-14464) Logistic regression performs poorly for very large vectors, even when the number of non-zero features is small

Daniel Siegmann (JIRA) Thu, 07 Apr 2016 13:23:43 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230990#comment-15230990
 ]


Daniel Siegmann commented on SPARK-14464:
-----------------------------------------

I am working on a fix. I have used it successfully and will submit a PR once 
I've integrated my changes into Spark.

Using my solution the example case I described takes only 10 - 11 minutes, with 
less memory required - and that is regardless of whether 10 or 100 workers are 
being used!

> Logistic regression performs poorly for very large vectors, even when the 
> number of non-zero features is small
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14464
>                 URL: https://issues.apache.org/jira/browse/SPARK-14464
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.6.0
>            Reporter: Daniel Siegmann
>
> When training (a.k.a. fitting) 
> org.apache.spark.ml.classification.LogisticRegression, aggregation is done 
> using arrays (which are dense structures). This is the case regardless of 
> whether the features of each instance are stored in a sparse or a dense 
> vector.
> When the feature vectors are very large, performance is poor because there's 
> a lot of overhead in transmitting these large arrays across workers.
> However, just because the feature vectors are large doesn't mean all these 
> features are actually being used. If the actual features are sparse, very 
> large arrays are being allocated unnecessarily.
> To solve this case, there should be an option to aggregate using sparse 
> vectors. It should be up to the sure to set this explicitly as a parameter on 
> the estimator, since the user should have some idea whether it is necessary 
> for their particular case.
> As an example, I have a use case where the features vector size is around 20 
> million. However, there are only 7 - 8 thousand non-zero features. The time 
> to train a model with only 10 worker is 1.3 hours on my test cluster. With 
> 100 workers this balloons to over 10 hours on the same cluster! Also, 
> spark.driver.maxResultSize is set to 112 GB to accommodate the data being 
> pulled back to the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14464) Logistic regression performs poorly for very large vectors, even when the number of non-zero features is small

Reply via email to