from:"Kyle Ellrott \(JIRA\)"

[jira] [Comment Edited] (SPARK-2372) Grouped Optimization/Learning

2014-08-13 Thread Kyle Ellrott (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094668#comment-14094668
 ] 

Kyle Ellrott edited comment on SPARK-2372 at 8/13/14 6:06 AM:
--

GroupedBinaryClassificationMetrics has been added to the pull request connected 
to this issue.
GroupedBinaryClassificationMetrics is an re-write of the 
BinaryClassificationMetrics methods, but it work on a RDD[KEY,(Double,Double)] 
structure (rather then the RDD[(Double,Double)] that 
BinaryClassificationMetrics takes), where KEY is a generic that will be the 
type of the key used to identified the data set. 

A unit test is included do validate these function work in the same way as the 
BinaryClassificationMetrics implementations.

https://github.com/kellrott/spark/commit/dcabb2f6a39c0940afc39e809a50601f46e50162


was (Author: kellrott):
GroupedBinaryClassificationMetrics has been added to the pull request connected 
to this issue.
GroupedBinaryClassificationMetrics is an re-write of the 
BinaryClassificationMetrics methods, but it work on a RDD[KEY,(Double,Double)] 
structure (rather then the RDD[(Double,Double)] that 
BinaryClassificationMetrics takes), where KEY is a generic that will be the 
type of the key used to identified the data set. Now methods return 
Map[KEY,Double], with a different score for each data set, rather then a single 
'Double'

A unit test is included do validate these function work in the same way as the 
BinaryClassificationMetrics implementations.

https://github.com/kellrott/spark/commit/dcabb2f6a39c0940afc39e809a50601f46e50162

 Grouped Optimization/Learning
 -

 Key: SPARK-2372
 URL: https://issues.apache.org/jira/browse/SPARK-2372
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.1, 1.1.0, 1.0.2
Reporter: Kyle Ellrott

 The purpose of this patch is the enable MLLib to better handle scenarios 
 where the user would want to do learning on multiple feature/label sets at 
 the same time. Rather then schedule each learning task separately, this patch 
 lets the user create a single RDD with an Int key to represent the 'group' 
 sets of entries belong to.
 This patch establishing the GroupedOptimizer trait, for which 
 GroupedGradientDescent has been implemented. This systems differs from the 
 original Optimizer trait in that the original optimize method accepted 
 RDD[(Int, Vector)] the new GroupedOptimizer accepts RDD[(Int, (Double, 
 Vector))].
 The difference is that the GroupedOptimizer uses a 'group' ID key in the RDD 
 to multiplex multiple optimization operations in the same RDD.
 This patch also establishes the GroupedGeneralizedLinearAlgorithm trait, for 
 which the 'run' method has had the RDD[LabeledPoint] input replaced with 
 RDD[(Int,LabeledPoint)].
 This patch also provides a unit test and utility to take the results of 
 MLUtils.kFold and turn it into a single grouped RDD, ready for simultaneous 
 learning.
 https://github.com/apache/spark/pull/1292



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib

2014-08-12 Thread Kyle Ellrott (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094673#comment-14094673
]

Kyle Ellrott commented on SPARK-1486:
-

It would be helpful to get some feedback if the work being done for SPARK-2372
would help with this issue.

Support multi-model training in MLlib
-

Key: SPARK-1486
URL: https://issues.apache.org/jira/browse/SPARK-1486
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

It is rare in practice to train just one model with a given set of
parameters. Usually, this is done by training multiple models with different
sets of parameters and then select the best based on their performance on the
validation set. MLlib should provide native support for multi-model
training/scoring. It requires decoupling of concepts like problem,
formulation, algorithm, parameter set, and model, which are missing in MLlib
now. MLI implements similar concepts, which we can borrow. There are
different approaches for multi-model training:
0) Keep one copy of the data, and train models one after another (or maybe in
parallel, depending on the scheduler).
1) Keep one copy of the data, and train multiple models at the same time
(similar to `runs` in KMeans).
2) Make multiple copies of the data (still stored distributively), and use
more cores to distribute the work.
3) Collect the data, make the entire dataset available on workers, and train
one or more models on each worker.
Users should be able to choose which execution mode they want to use. Note
that 3) could cover many use cases in practice when the training data is not
huge, e.g., 1GB.
This task will be divided into sub-tasks and this JIRA is created to discuss
the design and track the overall progress.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2372) Grouped Optimization/Learning

2014-08-12 Thread Kyle Ellrott (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094668#comment-14094668
 ] 

Kyle Ellrott commented on SPARK-2372:
-

GroupedBinaryClassificationMetrics has been added to the pull request connected 
to this issue.
GroupedBinaryClassificationMetrics is an re-write of the 
BinaryClassificationMetrics methods, but it work on a RDD[KEY,(Double,Double)] 
structure (rather then the RDD[(Double,Double)] that 
BinaryClassificationMetrics takes), where KEY is a generic that will be the 
type of the key used to identified the data set. Now methods return 
Map[KEY,Double], with a different score for each data set, rather then a single 
'Double'

A unit test is included do validate these function work in the same way as the 
BinaryClassificationMetrics implementations.

https://github.com/kellrott/spark/commit/dcabb2f6a39c0940afc39e809a50601f46e50162

 Grouped Optimization/Learning
 -

 Key: SPARK-2372
 URL: https://issues.apache.org/jira/browse/SPARK-2372
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.1, 1.1.0, 1.0.2
Reporter: Kyle Ellrott

 The purpose of this patch is the enable MLLib to better handle scenarios 
 where the user would want to do learning on multiple feature/label sets at 
 the same time. Rather then schedule each learning task separately, this patch 
 lets the user create a single RDD with an Int key to represent the 'group' 
 sets of entries belong to.
 This patch establishing the GroupedOptimizer trait, for which 
 GroupedGradientDescent has been implemented. This systems differs from the 
 original Optimizer trait in that the original optimize method accepted 
 RDD[(Int, Vector)] the new GroupedOptimizer accepts RDD[(Int, (Double, 
 Vector))].
 The difference is that the GroupedOptimizer uses a 'group' ID key in the RDD 
 to multiplex multiple optimization operations in the same RDD.
 This patch also establishes the GroupedGeneralizedLinearAlgorithm trait, for 
 which the 'run' method has had the RDD[LabeledPoint] input replaced with 
 RDD[(Int,LabeledPoint)].
 This patch also provides a unit test and utility to take the results of 
 MLUtils.kFold and turn it into a single grouped RDD, ready for simultaneous 
 learning.
 https://github.com/apache/spark/pull/1292



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-2372) Grouped Optimization/Learning

2014-07-04 Thread Kyle Ellrott (JIRA)

Kyle Ellrott created SPARK-2372:
---

 Summary: Grouped Optimization/Learning
 Key: SPARK-2372
 URL: https://issues.apache.org/jira/browse/SPARK-2372
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.1, 1.1.0, 1.0.2
Reporter: Kyle Ellrott


The purpose of this patch is the enable MLLib to better handle scenarios where 
the user would want to do learning on multiple feature/label sets at the same 
time. Rather then schedule each learning task separately, this patch lets the 
user create a single RDD with an Int key to represent the 'group' sets of 
entries belong to.

This patch establishing the GroupedOptimizer trait, for which 
GroupedGradientDescent has been implemented. This systems differs from the 
original Optimizer trait in that the original optimize method accepted 
RDD[(Int, Vector)] the new GroupedOptimizer accepts RDD[(Int, (Double, 
Vector))].
The difference is that the GroupedOptimizer uses a 'group' ID key in the RDD to 
multiplex multiple optimization operations in the same RDD.

This patch also establishes the GroupedGeneralizedLinearAlgorithm trait, for 
which the 'run' method has had the RDD[LabeledPoint] input replaced with 
RDD[(Int,LabeledPoint)].

This patch also provides a unit test and utility to take the results of 
MLUtils.kFold and turn it into a single grouped RDD, ready for simultaneous 
learning.

https://github.com/apache/spark/pull/1292




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib

2014-07-04 Thread Kyle Ellrott (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14052783#comment-14052783
]

Kyle Ellrott commented on SPARK-1486:
-

In the case where you are using the same training/optimization procedure, you
can group together the sets of samples using a key. There is an example of this
in the pull request linked to SPARK-2372. This could be a very effective way to
deal with simultaneously training all of the different folds from a kFolds
split.

Support multi-model training in MLlib
-

Key: SPARK-1486
URL: https://issues.apache.org/jira/browse/SPARK-1486
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
Fix For: 1.1.0

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2372) Grouped Optimization/Learning

[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib

[jira] [Commented] (SPARK-2372) Grouped Optimization/Learning

[jira] [Created] (SPARK-2372) Grouped Optimization/Learning

[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib

5 matches

Site Navigation

Mail list logo

Footer information