Seth Hendrickson created SPARK-16008:
----------------------------------------

             Summary: ML Logistic Regression aggregator serializes unnecessary 
data
                 Key: SPARK-16008
                 URL: https://issues.apache.org/jira/browse/SPARK-16008
             Project: Spark
          Issue Type: Bug
          Components: ML
            Reporter: Seth Hendrickson


LogisticRegressionAggregator class is used to collect gradient updates in ML 
logistic regression algorithm. The class stores a reference to the coefficients 
array of length equal to the number of features. It also stores a reference to 
an array of standard deviations which is length numFeatures also. When a task 
is completed it serializes the class which also serializes a copy of the two 
arrays. These arrays don't need to be serialized (only the gradient updates are 
being aggregated). This causes issues performance issues when the number of 
features is large and can trigger excess garbage collection when the executor 
doesn't have much excess memory. 

This results in serializing 2*numFeatures excess data. When multiclass logistic 
regression is implemented, the excess will be numFeatures + numClasses * 
numFeatures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to