[ 
https://issues.apache.org/jira/browse/SPARK-16008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15335359#comment-15335359
 ] 

Apache Spark commented on SPARK-16008:
--------------------------------------

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/13729

> ML Logistic Regression aggregator serializes unnecessary data
> -------------------------------------------------------------
>
>                 Key: SPARK-16008
>                 URL: https://issues.apache.org/jira/browse/SPARK-16008
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Seth Hendrickson
>
> LogisticRegressionAggregator class is used to collect gradient updates in ML 
> logistic regression algorithm. The class stores a reference to the 
> coefficients array of length equal to the number of features. It also stores 
> a reference to an array of standard deviations which is length numFeatures 
> also. When a task is completed it serializes the class which also serializes 
> a copy of the two arrays. These arrays don't need to be serialized (only the 
> gradient updates are being aggregated). This causes issues performance issues 
> when the number of features is large and can trigger excess garbage 
> collection when the executor doesn't have much excess memory. 
> This results in serializing 2*numFeatures excess data. When multiclass 
> logistic regression is implemented, the excess will be numFeatures + 
> numClasses * numFeatures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to