[ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4591:
-------------------------------------
    Description: 
This is an umbrella JIRA for porting spark.mllib implementations to use the 
DataFrame-based API defined under spark.ml.  We want to achieve critical 
feature parity for the next release.

h3. Instructions for 3 subtask types

*Review tasks*: detailed review of a subpackage to identify feature gaps 
between spark.mllib and spark.ml.
* Should be listed as a subtask of this umbrella.
* Review subtasks cover major algorithm groups.  To pick up a review subtask, 
please:
** Comment that you are working on it.
** Compare the public APIs of spark.ml vs. spark.mllib.
** Comment on all missing items within spark.ml: algorithms, models, methods, 
features, etc.
** Check for existing JIRAs covering those items.  If there is no existing 
JIRA, create one, and link it to your comment.

*Critical tasks*: higher priority missing features which are required for this 
umbrella JIRA.
* Should be linked as "requires" links.

*Other tasks*: lower priority missing features which can be completed after the 
critical tasks.
* Should be linked as "related to" links.

h4. Excluded items

This does *not* include:
* Python: We can compare Scala vs. Python in spark.ml itself.
* Moving linalg to spark.ml: [SPARK-13944]
* Streaming ML: Requires stabilizing some internal APIs of structured streaming 
first

h3. TODO list

*Critical issues*
* [SPARK-14501]: Frequent Pattern Mining
* [SPARK-14709]: linear SVM

*Lower priority issues*
* Missing methods within algorithms (see Issue Links below)
* evaluation submodule
* stat submodule (should probably be covered in DataFrames)
* Developer-facing submodules:
** optimization
** random, rdd
** util

*To be prioritized*
* single-instance prediction: [SPARK-10413]
* pmml [SPARK-11171]


  was:
This is an umbrella JIRA for porting spark.mllib implementations to use the 
DataFrame-based API defined under spark.ml.  We want to achieve critical 
feature parity for the next release.

h3. Instructions for 3 subtask types

*Review tasks*: detailed review of a subpackage to identify feature gaps 
between spark.mllib and spark.ml.
* Should be listed as a subtask of this umbrella.
* Review subtasks cover major algorithm groups.  To pick up a review subtask, 
please:
** Comment that you are working on it.
** Compare the public APIs of spark.ml vs. spark.mllib.
** Comment on all missing items within spark.ml: algorithms, models, methods, 
features, etc.
** Check for existing JIRAs covering those items.  If there is no existing 
JIRA, create one, and link it to your comment.

*Critical tasks*: higher priority missing features which are required for this 
umbrella JIRA.
* Should be linked as "requires" links.

*Other tasks*: lower priority missing features which can be completed after the 
critical tasks.
* Should be linked as "related to" links.

h4. Excluded items

This does *not* include Python.  We can compare Scala vs. Python in spark.ml 
itself.

This also excludes moving linalg to spark.ml: [SPARK-13944]

This does not include the following items (but could eventually):
* Streaming ML
* pmml



> Algorithm/model parity for spark.ml (Scala)
> -------------------------------------------
>
>                 Key: SPARK-4591
>                 URL: https://issues.apache.org/jira/browse/SPARK-4591
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML
>            Reporter: Xiangrui Meng
>            Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve critical 
> feature parity for the next release.
> h3. Instructions for 3 subtask types
> *Review tasks*: detailed review of a subpackage to identify feature gaps 
> between spark.mllib and spark.ml.
> * Should be listed as a subtask of this umbrella.
> * Review subtasks cover major algorithm groups.  To pick up a review subtask, 
> please:
> ** Comment that you are working on it.
> ** Compare the public APIs of spark.ml vs. spark.mllib.
> ** Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> ** Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> *Critical tasks*: higher priority missing features which are required for 
> this umbrella JIRA.
> * Should be linked as "requires" links.
> *Other tasks*: lower priority missing features which can be completed after 
> the critical tasks.
> * Should be linked as "related to" links.
> h4. Excluded items
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * Moving linalg to spark.ml: [SPARK-13944]
> * Streaming ML: Requires stabilizing some internal APIs of structured 
> streaming first
> h3. TODO list
> *Critical issues*
> * [SPARK-14501]: Frequent Pattern Mining
> * [SPARK-14709]: linear SVM
> *Lower priority issues*
> * Missing methods within algorithms (see Issue Links below)
> * evaluation submodule
> * stat submodule (should probably be covered in DataFrames)
> * Developer-facing submodules:
> ** optimization
> ** random, rdd
> ** util
> *To be prioritized*
> * single-instance prediction: [SPARK-10413]
> * pmml [SPARK-11171]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org


Reply via email to