[ https://issues.apache.org/jira/browse/SPARK-16008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15335359#comment-15335359 ]
Apache Spark commented on SPARK-16008: -------------------------------------- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/13729 > ML Logistic Regression aggregator serializes unnecessary data > ------------------------------------------------------------- > > Key: SPARK-16008 > URL: https://issues.apache.org/jira/browse/SPARK-16008 > Project: Spark > Issue Type: Bug > Components: ML > Reporter: Seth Hendrickson > > LogisticRegressionAggregator class is used to collect gradient updates in ML > logistic regression algorithm. The class stores a reference to the > coefficients array of length equal to the number of features. It also stores > a reference to an array of standard deviations which is length numFeatures > also. When a task is completed it serializes the class which also serializes > a copy of the two arrays. These arrays don't need to be serialized (only the > gradient updates are being aggregated). This causes issues performance issues > when the number of features is large and can trigger excess garbage > collection when the executor doesn't have much excess memory. > This results in serializing 2*numFeatures excess data. When multiclass > logistic regression is implemented, the excess will be numFeatures + > numClasses * numFeatures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org