[ https://issues.apache.org/jira/browse/SPARK-14464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230990#comment-15230990 ]
Daniel Siegmann commented on SPARK-14464: ----------------------------------------- I am working on a fix. I have used it successfully and will submit a PR once I've integrated my changes into Spark. Using my solution the example case I described takes only 10 - 11 minutes, with less memory required - and that is regardless of whether 10 or 100 workers are being used! > Logistic regression performs poorly for very large vectors, even when the > number of non-zero features is small > -------------------------------------------------------------------------------------------------------------- > > Key: SPARK-14464 > URL: https://issues.apache.org/jira/browse/SPARK-14464 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 1.6.0 > Reporter: Daniel Siegmann > > When training (a.k.a. fitting) > org.apache.spark.ml.classification.LogisticRegression, aggregation is done > using arrays (which are dense structures). This is the case regardless of > whether the features of each instance are stored in a sparse or a dense > vector. > When the feature vectors are very large, performance is poor because there's > a lot of overhead in transmitting these large arrays across workers. > However, just because the feature vectors are large doesn't mean all these > features are actually being used. If the actual features are sparse, very > large arrays are being allocated unnecessarily. > To solve this case, there should be an option to aggregate using sparse > vectors. It should be up to the sure to set this explicitly as a parameter on > the estimator, since the user should have some idea whether it is necessary > for their particular case. > As an example, I have a use case where the features vector size is around 20 > million. However, there are only 7 - 8 thousand non-zero features. The time > to train a model with only 10 worker is 1.3 hours on my test cluster. With > 100 workers this balloons to over 10 hours on the same cluster! Also, > spark.driver.maxResultSize is set to 112 GB to accommodate the data being > pulled back to the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org