Seth Hendrickson created SPARK-19747:
----------------------------------------

             Summary: Consolidate code in ML aggregators
                 Key: SPARK-19747
                 URL: https://issues.apache.org/jira/browse/SPARK-19747
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 2.2.0
            Reporter: Seth Hendrickson
            Priority: Minor


Many algorithms in Spark ML are posed as optimization of a differentiable loss 
function over a parameter vector. We implement these by having a loss function 
accumulate the gradient using an Aggregator class which has methods that amount 
to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm that obeys this 
form implements a cost function class and an aggregator class, which are 
completely separate from one another but share probably 80% of the same code. 

I think it is important to clean things like this up, and if we can do it 
properly it will make the code much more maintainable, readable, and bug free. 
It will also help reduce the overhead of future implementations.

The design is of course open for discussion, but I think we should aim to:
1. Have all aggregators share parent classes, so that they only need to 
implement the {{add}} function. This is really the only difference in the 
current aggregators.
2. Have a single, generic cost function that is parameterized by the aggregator 
type. This reduces the many places we implement cost functions and greatly 
reduces the amount of duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to