[ https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-4591: ------------------------------------- Description: This is an umbrella JIRA for porting spark.mllib implementations to use the DataFrame-based API defined under spark.ml. We want to achieve critical feature parity for the next release. h3. Instructions for 3 subtask types *Review tasks*: detailed review of a subpackage to identify feature gaps between spark.mllib and spark.ml. * Should be listed as a subtask of this umbrella. * Review subtasks cover major algorithm groups. To pick up a review subtask, please: ** Comment that you are working on it. ** Compare the public APIs of spark.ml vs. spark.mllib. ** Comment on all missing items within spark.ml: algorithms, models, methods, features, etc. ** Check for existing JIRAs covering those items. If there is no existing JIRA, create one, and link it to your comment. *Critical tasks*: higher priority missing features which are required for this umbrella JIRA. * Should be linked as "requires" links. *Other tasks*: lower priority missing features which can be completed after the critical tasks. * Should be linked as "related to" links. h4. Excluded items This does *not* include: * Python: We can compare Scala vs. Python in spark.ml itself. * Moving linalg to spark.ml: [SPARK-13944] * Streaming ML: Requires stabilizing some internal APIs of structured streaming first h3. TODO list *Critical issues* * [SPARK-14501]: Frequent Pattern Mining * [SPARK-14709]: linear SVM *Lower priority issues* * Missing methods within algorithms (see Issue Links below) * evaluation submodule * stat submodule (should probably be covered in DataFrames) * Developer-facing submodules: ** optimization ** random, rdd ** util *To be prioritized* * single-instance prediction: [SPARK-10413] * pmml [SPARK-11171] was: This is an umbrella JIRA for porting spark.mllib implementations to use the DataFrame-based API defined under spark.ml. We want to achieve critical feature parity for the next release. h3. Instructions for 3 subtask types *Review tasks*: detailed review of a subpackage to identify feature gaps between spark.mllib and spark.ml. * Should be listed as a subtask of this umbrella. * Review subtasks cover major algorithm groups. To pick up a review subtask, please: ** Comment that you are working on it. ** Compare the public APIs of spark.ml vs. spark.mllib. ** Comment on all missing items within spark.ml: algorithms, models, methods, features, etc. ** Check for existing JIRAs covering those items. If there is no existing JIRA, create one, and link it to your comment. *Critical tasks*: higher priority missing features which are required for this umbrella JIRA. * Should be linked as "requires" links. *Other tasks*: lower priority missing features which can be completed after the critical tasks. * Should be linked as "related to" links. h4. Excluded items This does *not* include Python. We can compare Scala vs. Python in spark.ml itself. This also excludes moving linalg to spark.ml: [SPARK-13944] This does not include the following items (but could eventually): * Streaming ML * pmml > Algorithm/model parity for spark.ml (Scala) > ------------------------------------------- > > Key: SPARK-4591 > URL: https://issues.apache.org/jira/browse/SPARK-4591 > Project: Spark > Issue Type: Umbrella > Components: ML > Reporter: Xiangrui Meng > Priority: Critical > > This is an umbrella JIRA for porting spark.mllib implementations to use the > DataFrame-based API defined under spark.ml. We want to achieve critical > feature parity for the next release. > h3. Instructions for 3 subtask types > *Review tasks*: detailed review of a subpackage to identify feature gaps > between spark.mllib and spark.ml. > * Should be listed as a subtask of this umbrella. > * Review subtasks cover major algorithm groups. To pick up a review subtask, > please: > ** Comment that you are working on it. > ** Compare the public APIs of spark.ml vs. spark.mllib. > ** Comment on all missing items within spark.ml: algorithms, models, methods, > features, etc. > ** Check for existing JIRAs covering those items. If there is no existing > JIRA, create one, and link it to your comment. > *Critical tasks*: higher priority missing features which are required for > this umbrella JIRA. > * Should be linked as "requires" links. > *Other tasks*: lower priority missing features which can be completed after > the critical tasks. > * Should be linked as "related to" links. > h4. Excluded items > This does *not* include: > * Python: We can compare Scala vs. Python in spark.ml itself. > * Moving linalg to spark.ml: [SPARK-13944] > * Streaming ML: Requires stabilizing some internal APIs of structured > streaming first > h3. TODO list > *Critical issues* > * [SPARK-14501]: Frequent Pattern Mining > * [SPARK-14709]: linear SVM > *Lower priority issues* > * Missing methods within algorithms (see Issue Links below) > * evaluation submodule > * stat submodule (should probably be covered in DataFrames) > * Developer-facing submodules: > ** optimization > ** random, rdd > ** util > *To be prioritized* > * single-instance prediction: [SPARK-10413] > * pmml [SPARK-11171] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org