Kyle Ellrott created SPARK-2372:
-----------------------------------

             Summary: Grouped Optimization/Learning
                 Key: SPARK-2372
                 URL: https://issues.apache.org/jira/browse/SPARK-2372
             Project: Spark
          Issue Type: New Feature
          Components: MLlib
    Affects Versions: 1.0.1, 1.1.0, 1.0.2
            Reporter: Kyle Ellrott


The purpose of this patch is the enable MLLib to better handle scenarios where 
the user would want to do learning on multiple feature/label sets at the same 
time. Rather then schedule each learning task separately, this patch lets the 
user create a single RDD with an Int key to represent the 'group' sets of 
entries belong to.

This patch establishing the GroupedOptimizer trait, for which 
GroupedGradientDescent has been implemented. This systems differs from the 
original Optimizer trait in that the original optimize method accepted 
RDD[(Int, Vector)] the new GroupedOptimizer accepts RDD[(Int, (Double, 
Vector))].
The difference is that the GroupedOptimizer uses a 'group' ID key in the RDD to 
multiplex multiple optimization operations in the same RDD.

This patch also establishes the GroupedGeneralizedLinearAlgorithm trait, for 
which the 'run' method has had the RDD[LabeledPoint] input replaced with 
RDD[(Int,LabeledPoint)].

This patch also provides a unit test and utility to take the results of 
MLUtils.kFold and turn it into a single grouped RDD, ready for simultaneous 
learning.

https://github.com/apache/spark/pull/1292




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to