[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib

2014-09-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139549#comment-14139549
 ] 

Apache Spark commented on SPARK-1486:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/2451

 Support multi-model training in MLlib
 -

 Key: SPARK-1486
 URL: https://issues.apache.org/jira/browse/SPARK-1486
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Burak Yavuz
Priority: Critical

 It is rare in practice to train just one model with a given set of 
 parameters. Usually, this is done by training multiple models with different 
 sets of parameters and then select the best based on their performance on the 
 validation set. MLlib should provide native support for multi-model 
 training/scoring. It requires decoupling of concepts like problem, 
 formulation, algorithm, parameter set, and model, which are missing in MLlib 
 now. MLI implements similar concepts, which we can borrow. There are 
 different approaches for multi-model training:
 0) Keep one copy of the data, and train models one after another (or maybe in 
 parallel, depending on the scheduler).
 1) Keep one copy of the data, and train multiple models at the same time 
 (similar to `runs` in KMeans).
 2) Make multiple copies of the data (still stored distributively), and use 
 more cores to distribute the work.
 3) Collect the data, make the entire dataset available on workers, and train 
 one or more models on each worker.
 Users should be able to choose which execution mode they want to use. Note 
 that 3) could cover many use cases in practice when the training data is not 
 huge, e.g., 1GB.
 This task will be divided into sub-tasks and this JIRA is created to discuss 
 the design and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib

2014-09-16 Thread Anant Daksh Asthana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136696#comment-14136696
 ] 

Anant Daksh Asthana commented on SPARK-1486:


That sounds very true and relevant. I am completely with you on this one.

On Tue, Sep 16, 2014 at 5:50 PM, Xiangrui Meng (JIRA) j...@apache.org



 Support multi-model training in MLlib
 -

 Key: SPARK-1486
 URL: https://issues.apache.org/jira/browse/SPARK-1486
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Burak Yavuz
Priority: Critical

 It is rare in practice to train just one model with a given set of 
 parameters. Usually, this is done by training multiple models with different 
 sets of parameters and then select the best based on their performance on the 
 validation set. MLlib should provide native support for multi-model 
 training/scoring. It requires decoupling of concepts like problem, 
 formulation, algorithm, parameter set, and model, which are missing in MLlib 
 now. MLI implements similar concepts, which we can borrow. There are 
 different approaches for multi-model training:
 0) Keep one copy of the data, and train models one after another (or maybe in 
 parallel, depending on the scheduler).
 1) Keep one copy of the data, and train multiple models at the same time 
 (similar to `runs` in KMeans).
 2) Make multiple copies of the data (still stored distributively), and use 
 more cores to distribute the work.
 3) Collect the data, make the entire dataset available on workers, and train 
 one or more models on each worker.
 Users should be able to choose which execution mode they want to use. Note 
 that 3) could cover many use cases in practice when the training data is not 
 huge, e.g., 1GB.
 This task will be divided into sub-tasks and this JIRA is created to discuss 
 the design and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib

2014-08-12 Thread Kyle Ellrott (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14094673#comment-14094673
 ] 

Kyle Ellrott commented on SPARK-1486:
-

It would be helpful to get some feedback if the work being done for SPARK-2372 
would help with this issue.

 Support multi-model training in MLlib
 -

 Key: SPARK-1486
 URL: https://issues.apache.org/jira/browse/SPARK-1486
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 It is rare in practice to train just one model with a given set of 
 parameters. Usually, this is done by training multiple models with different 
 sets of parameters and then select the best based on their performance on the 
 validation set. MLlib should provide native support for multi-model 
 training/scoring. It requires decoupling of concepts like problem, 
 formulation, algorithm, parameter set, and model, which are missing in MLlib 
 now. MLI implements similar concepts, which we can borrow. There are 
 different approaches for multi-model training:
 0) Keep one copy of the data, and train models one after another (or maybe in 
 parallel, depending on the scheduler).
 1) Keep one copy of the data, and train multiple models at the same time 
 (similar to `runs` in KMeans).
 2) Make multiple copies of the data (still stored distributively), and use 
 more cores to distribute the work.
 3) Collect the data, make the entire dataset available on workers, and train 
 one or more models on each worker.
 Users should be able to choose which execution mode they want to use. Note 
 that 3) could cover many use cases in practice when the training data is not 
 huge, e.g., 1GB.
 This task will be divided into sub-tasks and this JIRA is created to discuss 
 the design and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib

2014-07-15 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14062339#comment-14062339
 ] 

Erik Erlandson commented on SPARK-1486:
---

Does the dev on this issue effectively subsume SPARK-1457  and/or  SPARK-1856 ?


 Support multi-model training in MLlib
 -

 Key: SPARK-1486
 URL: https://issues.apache.org/jira/browse/SPARK-1486
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.1.0


 It is rare in practice to train just one model with a given set of 
 parameters. Usually, this is done by training multiple models with different 
 sets of parameters and then select the best based on their performance on the 
 validation set. MLlib should provide native support for multi-model 
 training/scoring. It requires decoupling of concepts like problem, 
 formulation, algorithm, parameter set, and model, which are missing in MLlib 
 now. MLI implements similar concepts, which we can borrow. There are 
 different approaches for multi-model training:
 0) Keep one copy of the data, and train models one after another (or maybe in 
 parallel, depending on the scheduler).
 1) Keep one copy of the data, and train multiple models at the same time 
 (similar to `runs` in KMeans).
 2) Make multiple copies of the data (still stored distributively), and use 
 more cores to distribute the work.
 3) Collect the data, make the entire dataset available on workers, and train 
 one or more models on each worker.
 Users should be able to choose which execution mode they want to use. Note 
 that 3) could cover many use cases in practice when the training data is not 
 huge, e.g., 1GB.
 This task will be divided into sub-tasks and this JIRA is created to discuss 
 the design and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib

2014-07-04 Thread Kyle Ellrott (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14052783#comment-14052783
 ] 

Kyle Ellrott commented on SPARK-1486:
-

In the case where you are using the same training/optimization procedure, you 
can group together the sets of samples using a key. There is an example of this 
in the pull request linked to SPARK-2372. This could be a very effective way to 
deal with simultaneously training all of the different folds from a kFolds 
split.

 Support multi-model training in MLlib
 -

 Key: SPARK-1486
 URL: https://issues.apache.org/jira/browse/SPARK-1486
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.1.0


 It is rare in practice to train just one model with a given set of 
 parameters. Usually, this is done by training multiple models with different 
 sets of parameters and then select the best based on their performance on the 
 validation set. MLlib should provide native support for multi-model 
 training/scoring. It requires decoupling of concepts like problem, 
 formulation, algorithm, parameter set, and model, which are missing in MLlib 
 now. MLI implements similar concepts, which we can borrow. There are 
 different approaches for multi-model training:
 0) Keep one copy of the data, and train models one after another (or maybe in 
 parallel, depending on the scheduler).
 1) Keep one copy of the data, and train multiple models at the same time 
 (similar to `runs` in KMeans).
 2) Make multiple copies of the data (still stored distributively), and use 
 more cores to distribute the work.
 3) Collect the data, make the entire dataset available on workers, and train 
 one or more models on each worker.
 Users should be able to choose which execution mode they want to use. Note 
 that 3) could cover many use cases in practice when the training data is not 
 huge, e.g., 1GB.
 This task will be divided into sub-tasks and this JIRA is created to discuss 
 the design and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)