[ 
https://issues.apache.org/jira/browse/SPARK-14464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382142#comment-15382142
 ] 

Nick Pentreath commented on SPARK-14464:
----------------------------------------

[~daniel.siegmann.aol] would you mind trying out the latest RC for Spark 2.0? 
I'm interested to see how much SPARK-16008 helps your use case (though I still 
agree enabling LR to handle sparsity is good as per your PR). Thanks!

> Logistic regression performs poorly for very large vectors, even when the 
> number of non-zero features is small
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14464
>                 URL: https://issues.apache.org/jira/browse/SPARK-14464
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.6.0
>            Reporter: Daniel Siegmann
>
> When training (a.k.a. fitting) 
> org.apache.spark.ml.classification.LogisticRegression, aggregation is done 
> using arrays (which are dense structures). This is the case regardless of 
> whether the features of each instance are stored in a sparse or a dense 
> vector.
> When the feature vectors are very large, performance is poor because there's 
> a lot of overhead in transmitting these large arrays across workers.
> However, just because the feature vectors are large doesn't mean all these 
> features are actually being used. If the actual features are sparse, very 
> large arrays are being allocated unnecessarily.
> To solve this case, there should be an option to aggregate using sparse 
> vectors. It should be up to the sure to set this explicitly as a parameter on 
> the estimator, since the user should have some idea whether it is necessary 
> for their particular case.
> As an example, I have a use case where the features vector size is around 20 
> million. However, there are only 7 - 8 thousand non-zero features. The time 
> to train a model with only 10 worker is 1.3 hours on my test cluster. With 
> 100 workers this balloons to over 10 hours on the same cluster! Also, 
> spark.driver.maxResultSize is set to 112 GB to accommodate the data being 
> pulled back to the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to