[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15947: -- Description: After SPARK-15945, we should make ALL pipeline components accept old vector columns as input and do the conversion automatically (probably with a warning message), in order to smooth the migration to 2.0. Note that this includes loading old saved models. (was: After SPARK-15945, we should make ALL pipeline components accept old vector columns as input and do the conversion automatically (probably with a warning message), in order to smooth the migration to 2.0. Note that this include loading old saved models.) > Make pipeline components backward compatible with old vector columns in > Scala/Java > -- > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. Note that this includes > loading old saved models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15947: -- Description: After SPARK-15945, we should make ALL pipeline components accept old vector columns as input and do the conversion automatically (probably with a warning message), in order to smooth the migration to 2.0. Note that this include loading old saved models. (was: After SPARK-15945, we should make ALL pipeline components accept old vector columns as input and do the conversion automatically (probably with a warning message), in order to smooth the migration to 2.0.) > Make pipeline components backward compatible with old vector columns in > Scala/Java > -- > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. Note that this include > loading old saved models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15944) Make spark.ml package backward compatible with spark.mllib vectors
[ https://issues.apache.org/jira/browse/SPARK-15944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15329675#comment-15329675 ] Xiangrui Meng commented on SPARK-15944: --- We won't deprecate those utils before we deprecate the RDD-based API. > Make spark.ml package backward compatible with spark.mllib vectors > -- > > Key: SPARK-15944 > URL: https://issues.apache.org/jira/browse/SPARK-15944 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > During QA, we found that it is not trivial to convert a DataFrame with old > vector columns to new vector columns. So it would be easier for users to > migrate their datasets and pipelines if we: > 1) provide utils to convert DataFrames with vector columns > 2) automatically detect and convert old vector columns in ML pipelines > This is an umbrella JIRA to track the progress. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15948) Make pipeline components backward compatible with old vector columns in Python
[ https://issues.apache.org/jira/browse/SPARK-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15948: -- Description: Same as SPARK-15947 but for Python. (was: Same as SPARK-15974 but for Python.) > Make pipeline components backward compatible with old vector columns in Python > -- > > Key: SPARK-15948 > URL: https://issues.apache.org/jira/browse/SPARK-15948 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > > Same as SPARK-15947 but for Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15948) Make pipeline components backward compatible with old vector columns in Python
Xiangrui Meng created SPARK-15948: - Summary: Make pipeline components backward compatible with old vector columns in Python Key: SPARK-15948 URL: https://issues.apache.org/jira/browse/SPARK-15948 Project: Spark Issue Type: Sub-task Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15948) Make pipeline components backward compatible with old vector columns in Python
[ https://issues.apache.org/jira/browse/SPARK-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15948: -- Description: Same as SPARK-15974 but for Python. > Make pipeline components backward compatible with old vector columns in Python > -- > > Key: SPARK-15948 > URL: https://issues.apache.org/jira/browse/SPARK-15948 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > > Same as SPARK-15974 but for Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java
Xiangrui Meng created SPARK-15947: - Summary: Make pipeline components backward compatible with old vector columns in Scala/Java Key: SPARK-15947 URL: https://issues.apache.org/jira/browse/SPARK-15947 Project: Spark Issue Type: Sub-task Components: ML, MLlib Affects Versions: 2.0.0 Reporter: Xiangrui Meng After SPARK-15945, we should make ALL pipeline components accept old vector columns as input and do the conversion automatically (probably with a warning message), in order to smooth the migration to 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15946) Wrap the conversion utils in Python
Xiangrui Meng created SPARK-15946: - Summary: Wrap the conversion utils in Python Key: SPARK-15946 URL: https://issues.apache.org/jira/browse/SPARK-15946 Project: Spark Issue Type: Sub-task Reporter: Xiangrui Meng This is to wrap SPARK-15943 in Python. So Python users can use it to convert DataFrames with vector columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15946) Wrap the conversion utils in Python
[ https://issues.apache.org/jira/browse/SPARK-15946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15946: -- Description: This is to wrap SPARK-15945 in Python. So Python users can use it to convert DataFrames with vector columns. (was: This is to wrap SPARK-15943 in Python. So Python users can use it to convert DataFrames with vector columns.) > Wrap the conversion utils in Python > --- > > Key: SPARK-15946 > URL: https://issues.apache.org/jira/browse/SPARK-15946 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > > This is to wrap SPARK-15945 in Python. So Python users can use it to convert > DataFrames with vector columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15945) Implement conversion utils in Scala/Java
Xiangrui Meng created SPARK-15945: - Summary: Implement conversion utils in Scala/Java Key: SPARK-15945 URL: https://issues.apache.org/jira/browse/SPARK-15945 Project: Spark Issue Type: Sub-task Components: ML, MLlib Affects Versions: 2.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng This is to provide conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15944) Make spark.ml package backward compatible with spark.mllib vectors
Xiangrui Meng created SPARK-15944: - Summary: Make spark.ml package backward compatible with spark.mllib vectors Key: SPARK-15944 URL: https://issues.apache.org/jira/browse/SPARK-15944 Project: Spark Issue Type: Umbrella Components: ML, MLlib Affects Versions: 2.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical During QA, we found that it is not trivial to convert a DataFrame with old vector columns to new vector columns. So it would be easier for users to migrate their datasets and pipelines if we: 1) provide utils to convert DataFrames with vector columns 2) automatically detect and convert old vector columns in ML pipelines This is an umbrella JIRA to track the progress. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
[ https://issues.apache.org/jira/browse/SPARK-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15364: -- Assignee: Liang-Chi Hsieh > Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python > --- > > Key: SPARK-15364 > URL: https://issues.apache.org/jira/browse/SPARK-15364 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > Now picklers for both new and old vectors are implemented under > PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement > them under `spark.ml.python` instead. I set the target to 2.1 since those are > private APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
[ https://issues.apache.org/jira/browse/SPARK-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15364: -- Target Version/s: 2.0.0 (was: 2.1.0) > Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python > --- > > Key: SPARK-15364 > URL: https://issues.apache.org/jira/browse/SPARK-15364 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > Fix For: 2.0.0 > > > Now picklers for both new and old vectors are implemented under > PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement > them under `spark.ml.python` instead. I set the target to 2.1 since those are > private APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
[ https://issues.apache.org/jira/browse/SPARK-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15364. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13219 [https://github.com/apache/spark/pull/13219] > Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python > --- > > Key: SPARK-15364 > URL: https://issues.apache.org/jira/browse/SPARK-15364 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > Fix For: 2.0.0 > > > Now picklers for both new and old vectors are implemented under > PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement > them under `spark.ml.python` instead. I set the target to 2.1 since those are > private APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15799: -- Target Version/s: 2.1.0 > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15581: -- Description: This is a master list for MLlib improvements we are working on for the next release. Please view this as a wish list rather than a definite plan, for we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Remember to add the `@Since("VERSION")` annotation to new public APIs. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps to improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add a "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if applicable. h1. Roadmap (*WIP*) This is NOT [a complete list of MLlib JIRAs for 2.1| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. We only include umbrella JIRAs and high-level tasks. Major efforts in this release: * Feature parity for the DataFrames-based API (`spark.ml`), relative to the RDD-based API * ML persistence * Python API feature parity and test coverage * R API expansion and improvements * Note about new features: As usual, we expect to expand the feature set of MLlib. However, we will prioritize API parity, bug fixes, and improvements over new features. Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for it, but new features, APIs, and improvements will only be added to `spark.ml`. h2. Critical feature parity in DataFrame-based API * Umbrella JIRA: [SPARK-4591] h2. Persistence * Complete persistence within MLlib ** Python tuning (SPARK-13786) * MLlib in R format: compatibility with other languages (SPARK-15572) * Impose backwards compatibility for persistence (SPARK-15573) h2. Python API * Standardize unit tests for Scala and Python to improve and consolidate test coverage for Params, persistence, and other common functionality (SPARK-15571) * Improve Python API handling of Params, persistence (SPARK-14771) (SPARK-14706) ** Note: The linked JIRAs for this are incomplete. More to be created... ** Related: Implement Python meta-algorithms in Scala (to simplify persistence) (SPARK-15574) * Feature parity: The main goal of the Python API is to have feature parity with the Scala/Java API. You can find a [complete list here| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.1.0%20ORDER%20BY%20priority%20DESC]. The tasks fall into two major categories: ** Python API for missing methods (SPARK-14813) ** Python API for new algorithms. Committers should create a JIRA for the Python API after merging a public feature in Scala/Java. h2. SparkR * Improve R formula support and implementation (SPARK-15540) * Various
[jira] [Created] (SPARK-15799) Release SparkR on CRAN
Xiangrui Meng created SPARK-15799: - Summary: Release SparkR on CRAN Key: SPARK-15799 URL: https://issues.apache.org/jira/browse/SPARK-15799 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Xiangrui Meng Story: "As an R user, I would like to see SparkR released on CRAN, so I can use SparkR easily in an existing R environment and have other packages built on top of SparkR." I made this JIRA with the following questions in mind: * Are there known issues that prevent us releasing SparkR on CRAN? * Do we want to package Spark jars in the SparkR release? * Are there license issues? * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15581: -- Description: This is a master list for MLlib improvements we are working on for the next release. Please view this as a wish list rather than a definite plan, for we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Remember to add the `@Since("VERSION")` annotation to new public APIs. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps to improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add a "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if applicable. h1. Roadmap (*WIP*) This is NOT [a complete list of MLlib JIRAs for 2.1| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. We only include umbrella JIRAs and high-level tasks. Major efforts in this release: * Feature parity for the DataFrames-based API (`spark.ml`), relative to the RDD-based API * ML persistence * Python API feature parity and test coverage * R API expansion and improvements * Note about new features: As usual, we expect to expand the feature set of MLlib. However, we will prioritize API parity, bug fixes, and improvements over new features. Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for it, but new features, APIs, and improvements will only be added to `spark.ml`. h2. Critical feature parity in DataFrame-based API * Umbrella JIRA: [SPARK-4591] h2. Persistence * Complete persistence within MLlib ** Python tuning (SPARK-13786) * MLlib in R format: compatibility with other languages (SPARK-15572) * Impose backwards compatibility for persistence (SPARK-15573) h2. Python API * Standardize unit tests for Scala and Python to improve and consolidate test coverage for Params, persistence, and other common functionality (SPARK-15571) * Improve Python API handling of Params, persistence (SPARK-14771) (SPARK-14706) ** Note: The linked JIRAs for this are incomplete. More to be created... ** Related: Implement Python meta-algorithms in Scala (to simplify persistence) (SPARK-15574) * Feature parity: The main goal of the Python API is to have feature parity with the Scala/Java API. You can find a [complete list here| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.1.0%20ORDER%20BY%20priority%20DESC]. The tasks fall into two major categories: ** Python API for missing methods (SPARK-14813) ** Python API for new algorithms. Committers should create a JIRA for the Python API after merging a public feature in Scala/Java. h2. SparkR * Improve R formula support and implementation (SPARK-15540) * Various
[jira] [Commented] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
[ https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314816#comment-15314816 ] Xiangrui Meng commented on SPARK-15740: --- The proposal looks good to me. Please also try to measure the memory requirement so we can easily tell whether the issue is fixed or not. Triggering Jenkins maven builds is not convenient. > Word2VecSuite "big model load / save" caused OOM in maven jenkins builds > > > Key: SPARK-15740 > URL: https://issues.apache.org/jira/browse/SPARK-15740 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > [~andrewor14] noticed some OOM errors caused by "test big model load / save" > in Word2VecSuite, e.g., > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. > It doesn't show up in the test result because it was OOMed. > I'm going to disable the test first and leave this open for a proper fix. > cc [~tmnd91] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
[ https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313198#comment-15313198 ] Xiangrui Meng commented on SPARK-15740: --- [~tmnd91] Could you run the test and estimate how much ram does it need? Btw, we should set spark.kryoserializer.buffer.max to a small value instead of creating a big array. Do you have time to look into this issue? > Word2VecSuite "big model load / save" caused OOM in maven jenkins builds > > > Key: SPARK-15740 > URL: https://issues.apache.org/jira/browse/SPARK-15740 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > [~andrewor14] noticed some OOM errors caused by "test big model load / save" > in Word2VecSuite, e.g., > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. > It doesn't show up in the test result because it was OOMed. > I'm going to disable the test first and leave this open for a proper fix. > cc [~tmnd91] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
[ https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313198#comment-15313198 ] Xiangrui Meng edited comment on SPARK-15740 at 6/2/16 10:24 PM: [~tmnd91] Could you run the test and estimate how much ram does it need? Btw, we should set spark.kryoserializer.buffer.max to a small value instead of creating a big array for the test. Do you have time to look into this issue? was (Author: mengxr): [~tmnd91] Could you run the test and estimate how much ram does it need? Btw, we should set spark.kryoserializer.buffer.max to a small value instead of creating a big array. Do you have time to look into this issue? > Word2VecSuite "big model load / save" caused OOM in maven jenkins builds > > > Key: SPARK-15740 > URL: https://issues.apache.org/jira/browse/SPARK-15740 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > [~andrewor14] noticed some OOM errors caused by "test big model load / save" > in Word2VecSuite, e.g., > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. > It doesn't show up in the test result because it was OOMed. > I'm going to disable the test first and leave this open for a proper fix. > cc [~tmnd91] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
[ https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15740: -- Description: [~andrewor14] noticed some OOM errors caused by "test big model load / save" in Word2VecSuite, e.g., https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. It doesn't show up in the test result because it was OOMed. I'm going to disable the test first and leave this open for a proper fix. cc [~tmnd91] was: [~andrewor14] noticed some OOM errors caused by "test big model load / save" in Word2VecSuite. I'm going to disable the test first and leave this open for a proper fix. cc [~tmnd91] > Word2VecSuite "big model load / save" caused OOM in maven jenkins builds > > > Key: SPARK-15740 > URL: https://issues.apache.org/jira/browse/SPARK-15740 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > [~andrewor14] noticed some OOM errors caused by "test big model load / save" > in Word2VecSuite, e.g., > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. > It doesn't show up in the test result because it was OOMed. > I'm going to disable the test first and leave this open for a proper fix. > cc [~tmnd91] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
Xiangrui Meng created SPARK-15740: - Summary: Word2VecSuite "big model load / save" caused OOM in maven jenkins builds Key: SPARK-15740 URL: https://issues.apache.org/jira/browse/SPARK-15740 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.0.0 Reporter: Xiangrui Meng Priority: Critical [~andrewor14] noticed some OOM errors caused by "test big model load / save" in Word2VecSuite. I'm going to disable the test first and leave this open for a proper fix. cc [~tmnd91] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency
[ https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13944. --- Resolution: Fixed Fix Version/s: 2.0.0 > Separate out local linear algebra as a standalone module without Spark > dependency > - > > Key: SPARK-13944 > URL: https://issues.apache.org/jira/browse/SPARK-13944 > Project: Spark > Issue Type: New Feature > Components: Build, ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: DB Tsai >Priority: Blocker > Fix For: 2.0.0 > > > Separate out linear algebra as a standalone module without Spark dependency > to simplify production deployment. We can call the new module > spark-mllib-local, which might contain local models in the future. > The major issue is to remove dependencies on user-defined types. > The package name will be changed from mllib to ml. For example, Vector will > be changed from `org.apache.spark.mllib.linalg.Vector` to > `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML > pipeline will be the one in ML package; however, the existing mllib code will > not be touched. As a result, this will potentially break the API. Also, when > the vector is loaded from mllib vector by Spark SQL, the vector will > automatically converted into the one in ml package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14529) Consolidate mllib and mllib-local into one mllib folder
[ https://issues.apache.org/jira/browse/SPARK-14529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-14529. - Resolution: Won't Fix Marked the issue as won't fix. The main reason is that mllib-local might be used by external packages directly. > Consolidate mllib and mllib-local into one mllib folder > --- > > Key: SPARK-14529 > URL: https://issues.apache.org/jira/browse/SPARK-14529 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Minor > > In the 2.0 QA period (to avoid the conflict of other PRs), this task will > consolidate `mllib/src` into `mllib/mllib/src` and `mllib-local/src` into > `mllib/mllib-local/src`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15043) Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr
[ https://issues.apache.org/jira/browse/SPARK-15043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-15043. - Resolution: Fixed Fixed as part of SPARK-15030. > Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr > - > > Key: SPARK-15043 > URL: https://issues.apache.org/jira/browse/SPARK-15043 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Sean Owen >Priority: Critical > > It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become > flaky: > https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr > The first observed failure was in > https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816 > {code} > java.lang.AssertionError: expected:<0.9986422261219262> but > was:<0.9986422261219272> > at > org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75) > {code} > I'm going to ignore this test now, but we need to come back and fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15043) Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr
[ https://issues.apache.org/jira/browse/SPARK-15043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15043: -- Fix Version/s: 2.0.0 > Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr > - > > Key: SPARK-15043 > URL: https://issues.apache.org/jira/browse/SPARK-15043 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Sean Owen >Priority: Critical > Fix For: 2.0.0 > > > It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become > flaky: > https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr > The first observed failure was in > https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816 > {code} > java.lang.AssertionError: expected:<0.9986422261219262> but > was:<0.9986422261219272> > at > org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75) > {code} > I'm going to ignore this test now, but we need to come back and fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14529) Consolidate mllib and mllib-local into one mllib folder
[ https://issues.apache.org/jira/browse/SPARK-14529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299511#comment-15299511 ] Xiangrui Meng commented on SPARK-14529: --- We should decide whether we want to make this change in 2.0. I don't have strong preference on which folder layout is better. So I would +1 on keeping the current layout since it doesn't require code changes. How does it sound? > Consolidate mllib and mllib-local into one mllib folder > --- > > Key: SPARK-14529 > URL: https://issues.apache.org/jira/browse/SPARK-14529 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Minor > > In the 2.0 QA period (to avoid the conflict of other PRs), this task will > consolidate `mllib/src` into `mllib/mllib/src` and `mllib-local/src` into > `mllib/mllib-local/src`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15447) Performance test for ALS in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15447: -- Labels: QA (was: ) > Performance test for ALS in Spark 2.0 > - > > Key: SPARK-15447 > URL: https://issues.apache.org/jira/browse/SPARK-15447 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > Labels: QA > > We made several changes to ALS in 2.0. It is necessary to run some tests to > avoid performance regression. We should test (synthetic) datasets from 1 > million ratings to 1 billion ratings. > cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance > tests? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15447) Performance test for ALS in Spark 2.0
Xiangrui Meng created SPARK-15447: - Summary: Performance test for ALS in Spark 2.0 Key: SPARK-15447 URL: https://issues.apache.org/jira/browse/SPARK-15447 Project: Spark Issue Type: Task Components: ML Affects Versions: 2.0.0 Reporter: Xiangrui Meng Priority: Critical We made several changes to ALS in 2.0. It is necessary to run some tests to avoid performance regression. We should test (synthetic) datasets from 1 million ratings to 1 billion ratings. cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance tests? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15222) SparkR ML examples update in 2.0
[ https://issues.apache.org/jira/browse/SPARK-15222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15222. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13000 [https://github.com/apache/spark/pull/13000] > SparkR ML examples update in 2.0 > > > Key: SPARK-15222 > URL: https://issues.apache.org/jira/browse/SPARK-15222 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > > Update example code in examples/src/main/r/ml.R to reflect the new algorithms. > * spark.glm and glm > * spark.survreg > * spark.naiveBayes > * spark.kmeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15222) SparkR ML examples update in 2.0
[ https://issues.apache.org/jira/browse/SPARK-15222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15222: -- Assignee: Yanbo Liang > SparkR ML examples update in 2.0 > > > Key: SPARK-15222 > URL: https://issues.apache.org/jira/browse/SPARK-15222 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > > Update example code in examples/src/main/r/ml.R to reflect the new algorithms. > * spark.glm and glm > * spark.survreg > * spark.naiveBayes > * spark.kmeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15153) SparkR spark.naiveBayes throws error when label is numeric type
[ https://issues.apache.org/jira/browse/SPARK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15153: -- Shepherd: Xiangrui Meng > SparkR spark.naiveBayes throws error when label is numeric type > --- > > Key: SPARK-15153 > URL: https://issues.apache.org/jira/browse/SPARK-15153 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > When the label of dataset is numeric type, SparkR spark.naiveBayes will throw > error. This bug is easy to reproduce: > {code} > t <- as.data.frame(Titanic) > t1 <- t[t$Freq > 0, -5] > t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1) > t2 <- t1[-4] > df <- suppressWarnings(createDataFrame(sqlContext, t2)) > m <- spark.naiveBayes(df, NumericSurvived ~ .) > 16/05/05 03:26:17 ERROR RBackendHandler: fit on > org.apache.spark.ml.r.NaiveBayesWrapper failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.ClassCastException: > org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to > org.apache.spark.ml.attribute.NominalAttribute > at > org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66) > at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at io.netty.channel.AbstractChannelHandlerContext.invo > {code} > In RFormula, the response variable type could be string or numeric. If it's > string, RFormula will transform it to label of DoubleType by StringIndexer > and set corresponding column metadata; otherwise, RFormula will directly use > it as label when training model (and assumes that it was numbered from 0, > ..., maxLabelIndex). > When we extract labels at ml.r.NaiveBayesWrapper, we should handle it > according the type of the response variable (string or numeric). > cc [~mengxr] [~josephkb] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15153) SparkR spark.naiveBayes throws error when label is numeric type
[ https://issues.apache.org/jira/browse/SPARK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15153: -- Assignee: Yanbo Liang > SparkR spark.naiveBayes throws error when label is numeric type > --- > > Key: SPARK-15153 > URL: https://issues.apache.org/jira/browse/SPARK-15153 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > When the label of dataset is numeric type, SparkR spark.naiveBayes will throw > error. This bug is easy to reproduce: > {code} > t <- as.data.frame(Titanic) > t1 <- t[t$Freq > 0, -5] > t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1) > t2 <- t1[-4] > df <- suppressWarnings(createDataFrame(sqlContext, t2)) > m <- spark.naiveBayes(df, NumericSurvived ~ .) > 16/05/05 03:26:17 ERROR RBackendHandler: fit on > org.apache.spark.ml.r.NaiveBayesWrapper failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.ClassCastException: > org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to > org.apache.spark.ml.attribute.NominalAttribute > at > org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66) > at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at io.netty.channel.AbstractChannelHandlerContext.invo > {code} > In RFormula, the response variable type could be string or numeric. If it's > string, RFormula will transform it to label of DoubleType by StringIndexer > and set corresponding column metadata; otherwise, RFormula will directly use > it as label when training model (and assumes that it was numbered from 0, > ..., maxLabelIndex). > When we extract labels at ml.r.NaiveBayesWrapper, we should handle it > according the type of the response variable (string or numeric). > cc [~mengxr] [~josephkb] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15339) ML 2.0 QA: Scala APIs and code audit for regression
[ https://issues.apache.org/jira/browse/SPARK-15339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15339: -- Assignee: Yanbo Liang > ML 2.0 QA: Scala APIs and code audit for regression > --- > > Key: SPARK-15339 > URL: https://issues.apache.org/jira/browse/SPARK-15339 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 2.0.0 > > > ML 2.0 QA: Scala APIs and code audit for regression -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15339) ML 2.0 QA: Scala APIs and code audit for regression
[ https://issues.apache.org/jira/browse/SPARK-15339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15339. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13129 [https://github.com/apache/spark/pull/13129] > ML 2.0 QA: Scala APIs and code audit for regression > --- > > Key: SPARK-15339 > URL: https://issues.apache.org/jira/browse/SPARK-15339 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang > Fix For: 2.0.0 > > > ML 2.0 QA: Scala APIs and code audit for regression -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15394) ML user guide typos and grammar audit
[ https://issues.apache.org/jira/browse/SPARK-15394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15394: -- Fix Version/s: 2.0.0 > ML user guide typos and grammar audit > - > > Key: SPARK-15394 > URL: https://issues.apache.org/jira/browse/SPARK-15394 > Project: Spark > Issue Type: Bug > Components: Documentation, ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Trivial > Fix For: 2.0.0 > > > Audit the wording in ml user guides. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15394) ML user guide typos and grammar audit
[ https://issues.apache.org/jira/browse/SPARK-15394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15394: -- Assignee: Seth Hendrickson > ML user guide typos and grammar audit > - > > Key: SPARK-15394 > URL: https://issues.apache.org/jira/browse/SPARK-15394 > Project: Spark > Issue Type: Bug > Components: Documentation, ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Trivial > Fix For: 2.0.0 > > > Audit the wording in ml user guides. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15398) Update the warning message to recommend ML usage
[ https://issues.apache.org/jira/browse/SPARK-15398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15398. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13190 [https://github.com/apache/spark/pull/13190] > Update the warning message to recommend ML usage > > > Key: SPARK-15398 > URL: https://issues.apache.org/jira/browse/SPARK-15398 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Minor > Fix For: 2.0.0 > > > update the warning message in example, and recommend user to use ML instead > of MLlib > from > {code} > def showWarning() { > System.err.println( > """WARN: This is a naive implementation of Logistic Regression and is > given as an example! > |Please use either > org.apache.spark.mllib.classification.LogisticRegressionWithSGD or > |org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS > |for more conventional use. > """.stripMargin) > } > {code} > to > {code} > def showWarning() { > System.err.println( > """WARN: This is a naive implementation of Logistic Regression and is > given as an example! > |Please use org.apache.spark.ml.classification.LogisticRegression > |for more conventional use. > """.stripMargin) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15398) Update the warning message to recommend ML usage
[ https://issues.apache.org/jira/browse/SPARK-15398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15398: -- Assignee: zhengruifeng > Update the warning message to recommend ML usage > > > Key: SPARK-15398 > URL: https://issues.apache.org/jira/browse/SPARK-15398 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 2.0.0 > > > update the warning message in example, and recommend user to use ML instead > of MLlib > from > {code} > def showWarning() { > System.err.println( > """WARN: This is a naive implementation of Logistic Regression and is > given as an example! > |Please use either > org.apache.spark.mllib.classification.LogisticRegressionWithSGD or > |org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS > |for more conventional use. > """.stripMargin) > } > {code} > to > {code} > def showWarning() { > System.err.println( > """WARN: This is a naive implementation of Logistic Regression and is > given as an example! > |Please use org.apache.spark.ml.classification.LogisticRegression > |for more conventional use. > """.stripMargin) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML
[ https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15363: -- Assignee: Miao Wang > Example code shouldn't use VectorImplicits._, asML/fromML > - > > Key: SPARK-15363 > URL: https://issues.apache.org/jira/browse/SPARK-15363 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Xiangrui Meng >Assignee: Miao Wang > Fix For: 2.0.0 > > > In SPARK-14615, we use VectorImplicits._ and asML in example code to > minimize the changes in that PR. However, this is a private API, which > shouldn't appear in the example code. We should consider update them during > QA. > https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML
[ https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15363. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13213 [https://github.com/apache/spark/pull/13213] > Example code shouldn't use VectorImplicits._, asML/fromML > - > > Key: SPARK-15363 > URL: https://issues.apache.org/jira/browse/SPARK-15363 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Xiangrui Meng > Fix For: 2.0.0 > > > In SPARK-14615, we use VectorImplicits._ and asML in example code to > minimize the changes in that PR. However, this is a private API, which > shouldn't appear in the example code. We should consider update them during > QA. > https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15172) Warning message should explicitly tell user initial coefficients is ignored if its size doesn't match expected size in LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15172: -- Fix Version/s: (was: 2.1.0) 2.0.0 > Warning message should explicitly tell user initial coefficients is ignored > if its size doesn't match expected size in LogisticRegression > - > > Key: SPARK-15172 > URL: https://issues.apache.org/jira/browse/SPARK-15172 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: ding >Assignee: ding >Priority: Trivial > Fix For: 2.0.0 > > > From ML/LogisticRegression code logic, if size of initial coefficients > doesn't match expected size, initial coefficients value will be ignored. We > should explicitly tell user the information. Besides, log size of initial > coefficients should be more straightforward than log initial coefficients > value when size mismatch happened. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15296) Refactor All Java Tests that use SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15296. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13101 [https://github.com/apache/spark/pull/13101] > Refactor All Java Tests that use SparkSession > - > > Key: SPARK-15296 > URL: https://issues.apache.org/jira/browse/SPARK-15296 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, Tests >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Minor > Fix For: 2.0.0 > > > There's a lot of Duplicate code in Java tests. {{setUp()}} and {{tearDown()}} > of most java test classes in ML,MLLib. > So will create a {{SharedSparkSession}} class that has common code for > {{setUp}} and {{tearDown}} and other classes just extend that class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15341) Add documentation for `model.write` to clarify `summary` was not saved
[ https://issues.apache.org/jira/browse/SPARK-15341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15341: -- Assignee: Yanbo Liang > Add documentation for `model.write` to clarify `summary` was not saved > --- > > Key: SPARK-15341 > URL: https://issues.apache.org/jira/browse/SPARK-15341 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > > Currently in model.write, we don't save summary(if applicable). We should add > documentation to clarify it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15341) Add documentation for `model.write` to clarify `summary` was not saved
[ https://issues.apache.org/jira/browse/SPARK-15341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15341: -- Fix Version/s: 2.0.0 > Add documentation for `model.write` to clarify `summary` was not saved > --- > > Key: SPARK-15341 > URL: https://issues.apache.org/jira/browse/SPARK-15341 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > > Currently in model.write, we don't save summary(if applicable). We should add > documentation to clarify it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15414) Make the mllib,ml linalg type conversion APIs public
[ https://issues.apache.org/jira/browse/SPARK-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15414: -- Assignee: Sandeep Singh > Make the mllib,ml linalg type conversion APIs public > > > Key: SPARK-15414 > URL: https://issues.apache.org/jira/browse/SPARK-15414 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Sandeep Singh > Fix For: 2.0.0 > > > We should open up the APIs for converting between new, old linear algebra > types (in spark.mllib.linalg): > * Vector.asML > * Vectors.fromML > * same for Sparse/Dense and for Matrices > I made these private originally, but they will be useful for users > transitioning workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15414) Make the mllib,ml linalg type conversion APIs public
[ https://issues.apache.org/jira/browse/SPARK-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15414. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13202 [https://github.com/apache/spark/pull/13202] > Make the mllib,ml linalg type conversion APIs public > > > Key: SPARK-15414 > URL: https://issues.apache.org/jira/browse/SPARK-15414 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley > Fix For: 2.0.0 > > > We should open up the APIs for converting between new, old linear algebra > types (in spark.mllib.linalg): > * Vector.asML > * Vectors.fromML > * same for Sparse/Dense and for Matrices > I made these private originally, but they will be useful for users > transitioning workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML
[ https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292403#comment-15292403 ] Xiangrui Meng commented on SPARK-15363: --- No. I think we need to make the converters between new and old vectors public (WIP) and then in example code, we don't need implicits. Another option is to make implicits public. > Example code shouldn't use VectorImplicits._, asML/fromML > - > > Key: SPARK-15363 > URL: https://issues.apache.org/jira/browse/SPARK-15363 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Xiangrui Meng > > In SPARK-14615, we use VectorImplicits._ and asML in example code to > minimize the changes in that PR. However, this is a private API, which > shouldn't appear in the example code. We should consider update them during > QA. > https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14615) Use the new ML Vector and Matrix in the ML pipeline based algorithms
[ https://issues.apache.org/jira/browse/SPARK-14615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14615. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12627 [https://github.com/apache/spark/pull/12627] > Use the new ML Vector and Matrix in the ML pipeline based algorithms > - > > Key: SPARK-14615 > URL: https://issues.apache.org/jira/browse/SPARK-14615 > Project: Spark > Issue Type: Sub-task > Components: Build, ML >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Blocker > Fix For: 2.0.0 > > > Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new > vector and matrix type in the new ml pipeline based apis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14615) Use the new ML Vector and Matrix in the ML pipeline based algorithms
[ https://issues.apache.org/jira/browse/SPARK-14615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14615: -- Priority: Blocker (was: Major) > Use the new ML Vector and Matrix in the ML pipeline based algorithms > - > > Key: SPARK-14615 > URL: https://issues.apache.org/jira/browse/SPARK-14615 > Project: Spark > Issue Type: Sub-task > Components: Build, ML >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Blocker > > Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new > vector and matrix type in the new ml pipeline based apis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
Xiangrui Meng created SPARK-15364: - Summary: Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python Key: SPARK-15364 URL: https://issues.apache.org/jira/browse/SPARK-15364 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.0.0 Reporter: Xiangrui Meng Now picklers for both new and old vectors are implemented under PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement them under `spark.ml.python` instead. I set the target to 2.1 since those are private APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15363) Example code shouldn't use VectorImplicits._
[ https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15363: -- Description: In SPARK-14615, we use VectorImplicits._ and asML in example code to minimize the changes in that PR. However, this is a private API, which shouldn't appear in the example code. We should consider update them during QA. https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala was: In SPARK-14615, we use VectorImplicits._ in example code to minimize the changes in that PR. However, this is a private API, which shouldn't appear in the example code. We should consider update them during QA. https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala > Example code shouldn't use VectorImplicits._ > > > Key: SPARK-15363 > URL: https://issues.apache.org/jira/browse/SPARK-15363 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Xiangrui Meng > > In SPARK-14615, we use VectorImplicits._ and asML in example code to > minimize the changes in that PR. However, this is a private API, which > shouldn't appear in the example code. We should consider update them during > QA. > https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML
[ https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15363: -- Summary: Example code shouldn't use VectorImplicits._, asML/fromML (was: Example code shouldn't use VectorImplicits._) > Example code shouldn't use VectorImplicits._, asML/fromML > - > > Key: SPARK-15363 > URL: https://issues.apache.org/jira/browse/SPARK-15363 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Xiangrui Meng > > In SPARK-14615, we use VectorImplicits._ and asML in example code to > minimize the changes in that PR. However, this is a private API, which > shouldn't appear in the example code. We should consider update them during > QA. > https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15363) Example code shouldn't use VectorImplicits._
[ https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15363: -- Description: In SPARK-14615, we use VectorImplicits._ in example code to minimize the changes in that PR. However, this is a private API, which shouldn't appear in the example code. We should consider update them during QA. https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala was:In SPARK-14615, we use VectorImplicits._ in example code to minimize the changes in that PR. However, this is a private API, which shouldn't appear in the example code. We should consider update them during QA. > Example code shouldn't use VectorImplicits._ > > > Key: SPARK-15363 > URL: https://issues.apache.org/jira/browse/SPARK-15363 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Xiangrui Meng > > In SPARK-14615, we use VectorImplicits._ in example code to minimize the > changes in that PR. However, this is a private API, which shouldn't appear in > the example code. We should consider update them during QA. > https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15363) Example code shouldn't use VectorImplicits._
Xiangrui Meng created SPARK-15363: - Summary: Example code shouldn't use VectorImplicits._ Key: SPARK-15363 URL: https://issues.apache.org/jira/browse/SPARK-15363 Project: Spark Issue Type: Improvement Components: Documentation, ML Reporter: Xiangrui Meng In SPARK-14615, we use VectorImplicits._ in example code to minimize the changes in that PR. However, this is a private API, which shouldn't appear in the example code. We should consider update them during QA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14906) Copy pyspark.mllib.linalg to pyspark.ml.linalg
[ https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14906: -- Summary: Copy pyspark.mllib.linalg to pyspark.ml.linalg (was: Move VectorUDT and MatrixUDT in PySpark to new ML package) > Copy pyspark.mllib.linalg to pyspark.ml.linalg > -- > > Key: SPARK-14906 > URL: https://issues.apache.org/jira/browse/SPARK-14906 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark > codes should be moved too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package
[ https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14906: -- Assignee: Liang-Chi Hsieh > Move VectorUDT and MatrixUDT in PySpark to new ML package > - > > Key: SPARK-14906 > URL: https://issues.apache.org/jira/browse/SPARK-14906 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark > codes should be moved too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package
[ https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14906. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13099 [https://github.com/apache/spark/pull/13099] > Move VectorUDT and MatrixUDT in PySpark to new ML package > - > > Key: SPARK-14906 > URL: https://issues.apache.org/jira/browse/SPARK-14906 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Liang-Chi Hsieh > Fix For: 2.0.0 > > > As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark > codes should be moved too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15268) Make JavaTypeInference work with UDTRegistration
[ https://issues.apache.org/jira/browse/SPARK-15268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15268. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13046 [https://github.com/apache/spark/pull/13046] > Make JavaTypeInference work with UDTRegistration > > > Key: SPARK-15268 > URL: https://issues.apache.org/jira/browse/SPARK-15268 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > Fix For: 2.0.0 > > > We have a private UDTRegistration API to register user defined type. > Currently JavaTypeInference can't work with it. We should make it work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15268) Make JavaTypeInference work with UDTRegistration
[ https://issues.apache.org/jira/browse/SPARK-15268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15268: -- Assignee: Liang-Chi Hsieh > Make JavaTypeInference work with UDTRegistration > > > Key: SPARK-15268 > URL: https://issues.apache.org/jira/browse/SPARK-15268 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > We have a private UDTRegistration API to register user defined type. > Currently JavaTypeInference can't work with it. We should make it work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14050) Add multiple languages support for Stop Words Remover
[ https://issues.apache.org/jira/browse/SPARK-14050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14050: -- Assignee: Burak KÖSE > Add multiple languages support for Stop Words Remover > - > > Key: SPARK-14050 > URL: https://issues.apache.org/jira/browse/SPARK-14050 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Burak KÖSE >Assignee: Burak KÖSE > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14050) Add multiple languages support for Stop Words Remover
[ https://issues.apache.org/jira/browse/SPARK-14050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14050. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12843 [https://github.com/apache/spark/pull/12843] > Add multiple languages support for Stop Words Remover > - > > Key: SPARK-14050 > URL: https://issues.apache.org/jira/browse/SPARK-14050 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Burak KÖSE > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15269218#comment-15269218 ] Xiangrui Meng commented on SPARK-15027: --- Ah, I see the problems now. We do need the hash partitioner to accelerate queries from the driver and probably joins. What if we convert the factors using `repartition(blocks, "id")` before we return the factors? It should come with a hash partitioner, but it might be different from the one we used in ALS. #2 seems like a bug. Could you provide a minimal example that can reproduce it? Given the pending issues, it seems that we should target this to 2.1. Sounds good? > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6717) Clear shuffle files after checkpointing in ALS
[ https://issues.apache.org/jira/browse/SPARK-6717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6717. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11919 [https://github.com/apache/spark/pull/11919] > Clear shuffle files after checkpointing in ALS > -- > > Key: SPARK-6717 > URL: https://issues.apache.org/jira/browse/SPARK-6717 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: holdenk > Labels: als > Fix For: 2.0.0 > > > In ALS iterations, we checkpoint RDDs to cut lineage and to reduce shuffle > files. However, whether to clean shuffle files depends on the system GC, > which may not be triggered in ALS iterations. So after checkpointing, before > we let the RDD object go out of scope, we should clean its shuffle > dependencies explicitly. This function could either stay inside ALS or go to > Core. > Without this feature, we can call System.gc() periodically to clean shuffle > files of RDDs that went out of scope. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15064) Locale support in StopWordsRemover
Xiangrui Meng created SPARK-15064: - Summary: Locale support in StopWordsRemover Key: SPARK-15064 URL: https://issues.apache.org/jira/browse/SPARK-15064 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.0.0 Reporter: Xiangrui Meng We support case insensitive filtering (default) in StopWordsRemover. However, case insensitive matching depends on the locale and region, which cannot be explicitly set in StopWordsRemover. We should consider adding this support in MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15030) Support formula in spark.kmeans in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15030. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12813 [https://github.com/apache/spark/pull/12813] > Support formula in spark.kmeans in SparkR > - > > Key: SPARK-15030 > URL: https://issues.apache.org/jira/browse/SPARK-15030 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Fix For: 2.0.0 > > > In SparkR, spark.kmeans take a DataFrame with double columns. This is > different from other ML methods we implemented, which support R model > formula. We should add support for that as well. > {code:none} > spark.kmeans(data = df, formula = ~ lat + lon, ...) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14653) Remove NumericParser and jackson dependency from mllib-local
[ https://issues.apache.org/jira/browse/SPARK-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14653. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12802 [https://github.com/apache/spark/pull/12802] > Remove NumericParser and jackson dependency from mllib-local > > > Key: SPARK-14653 > URL: https://issues.apache.org/jira/browse/SPARK-14653 > Project: Spark > Issue Type: Sub-task > Components: Build, ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 2.0.0 > > > After SPARK-14549, we should remove NumericParser and jackson from > mllib-local, which were introduced very earlier and now replaced by UDTs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15027: -- Target Version/s: (was: 2.0.0) > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265238#comment-15265238 ] Xiangrui Meng commented on SPARK-15027: --- It might be tricky to use Dataset due to encoders and generic ID types. But if we use DataFrame as input and output, it seems feasible. It would be great if you can take a look. > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15027: -- Assignee: (was: Xiangrui Meng) > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265229#comment-15265229 ] Xiangrui Meng edited comment on SPARK-15027 at 4/30/16 7:50 AM: Just API change. I guess there are still gaps to use DataFrame for the implementation. Maybe this is not urgent for 2.0 since ALS.train is a developer API. was (Author: mengxr): No, just API change. I guess there are still gaps to use DataFrame for the implementation. Maybe this is not urgent for 2.0 since ALS.train is a developer API. > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265229#comment-15265229 ] Xiangrui Meng commented on SPARK-15027: --- No, just API change. I guess there are still gaps to use DataFrame for the implementation. Maybe this is not urgent for 2.0 since ALS.train is a developer API. > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15027: -- Description: We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` to be consistent with other APIs under spark.ml and it also leaves space for Tungsten-based optimization. (was: This continue the work from SPARK-14412 to update `intermediateRDDStorageLevel` to `intermediateStorageLevel`, and `finalRDDStorageLevel` to `finalStoargeLevel`. We should also update `ALS.train` to use `Dataset` instead of `RDD`.) > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15027: -- Summary: ALS.train should use DataFrame instead of RDD (was: ml.ALS params and ALS.train should not depend on RDD) > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This continue the work from SPARK-14412 to update > `intermediateRDDStorageLevel` to `intermediateStorageLevel`, and > `finalRDDStorageLevel` to `finalStoargeLevel`. We should also update > `ALS.train` to use `Dataset` instead of `RDD`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package
[ https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265216#comment-15265216 ] Xiangrui Meng commented on SPARK-14906: --- [~viirya] To confirm the scope of this JIRA, does it cover moving (or aliasing) `pyspark.mllib.linalg` to `pyspark.ml.linalg`? > Move VectorUDT and MatrixUDT in PySpark to new ML package > - > > Key: SPARK-14906 > URL: https://issues.apache.org/jira/browse/SPARK-14906 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Liang-Chi Hsieh > > As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark > codes should be moved too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15030) Support formula in spark.kmeans in SparkR
Xiangrui Meng created SPARK-15030: - Summary: Support formula in spark.kmeans in SparkR Key: SPARK-15030 URL: https://issues.apache.org/jira/browse/SPARK-15030 Project: Spark Issue Type: New Feature Components: ML, SparkR Affects Versions: 2.0.0 Reporter: Xiangrui Meng Assignee: Yanbo Liang In SparkR, spark.kmeans take a DataFrame with double columns. This is different from other ML methods we implemented, which support R model formula. We should add support for that as well. {code:none} spark.kmeans(data = df, formula = ~ lat + lon, ...) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14831. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12789 [https://github.com/apache/spark/pull/12789] > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Timothy Hunter >Priority: Critical > Fix For: 2.0.0 > > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14850. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12640 [https://github.com/apache/spark/pull/12640] > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15027) ml.ALS params and ALS.train should not depend on RDD
Xiangrui Meng created SPARK-15027: - Summary: ml.ALS params and ALS.train should not depend on RDD Key: SPARK-15027 URL: https://issues.apache.org/jira/browse/SPARK-15027 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng This continue the work from SPARK-14412 to update `intermediateRDDStorageLevel` to `intermediateStorageLevel`, and `finalRDDStorageLevel` to `finalStoargeLevel`. We should also update `ALS.train` to use `Dataset` instead of `RDD`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14412) spark.ml ALS prefered storage level Params
[ https://issues.apache.org/jira/browse/SPARK-14412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14412: -- Assignee: Nick Pentreath > spark.ml ALS prefered storage level Params > -- > > Key: SPARK-14412 > URL: https://issues.apache.org/jira/browse/SPARK-14412 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath >Priority: Minor > Fix For: 2.0.0 > > > spark.mllib ALS supports {{setIntermediateRDDStorageLevel}} and > {{setFinalRDDStorageLevel}}. Those should be added as Params in spark.ml, > but they should be in group "expertParam" since few users will need them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14412) spark.ml ALS prefered storage level Params
[ https://issues.apache.org/jira/browse/SPARK-14412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14412. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12660 [https://github.com/apache/spark/pull/12660] > spark.ml ALS prefered storage level Params > -- > > Key: SPARK-14412 > URL: https://issues.apache.org/jira/browse/SPARK-14412 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > Fix For: 2.0.0 > > > spark.mllib ALS supports {{setIntermediateRDDStorageLevel}} and > {{setFinalRDDStorageLevel}}. Those should be added as Params in spark.ml, > but they should be in group "expertParam" since few users will need them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14653) Remove NumericParser and jackson dependency from mllib-local
[ https://issues.apache.org/jira/browse/SPARK-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-14653: - Assignee: Xiangrui Meng > Remove NumericParser and jackson dependency from mllib-local > > > Key: SPARK-14653 > URL: https://issues.apache.org/jira/browse/SPARK-14653 > Project: Spark > Issue Type: Sub-task > Components: Build, ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-14549, we should remove NumericParser and jackson from > mllib-local, which were introduced very earlier and now replaced by UDTs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14311) Model persistence in SparkR 2.0
[ https://issues.apache.org/jira/browse/SPARK-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14311: -- Target Version/s: 2.0.0 Fix Version/s: 2.0.0 > Model persistence in SparkR 2.0 > --- > > Key: SPARK-14311 > URL: https://issues.apache.org/jira/browse/SPARK-14311 > Project: Spark > Issue Type: Umbrella > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 2.0.0 > > > In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, > naive Bayes, and AFT survival regression. Users can fit models, get summary, > and make predictions. However, they cannot save/load the models yet. > ML models in SparkR are wrappers around ML pipelines. So it should be > straightforward to implement model persistence. We need to think more about > the API. R uses save/load for objects and datasets (also objects). It is > possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But > I'm not sure whether load can be overloaded easily. I propose the following > API: > {code} > model <- glm(formula, data = df) > ml.save(model, path, mode = "overwrite") > model2 <- ml.load(path) > {code} > We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load > is a S3 method (correct me if I'm wrong). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14311) Model persistence in SparkR 2.0
[ https://issues.apache.org/jira/browse/SPARK-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14311: -- Summary: Model persistence in SparkR 2.0 (was: Model persistence in SparkR) > Model persistence in SparkR 2.0 > --- > > Key: SPARK-14311 > URL: https://issues.apache.org/jira/browse/SPARK-14311 > Project: Spark > Issue Type: Umbrella > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, > naive Bayes, and AFT survival regression. Users can fit models, get summary, > and make predictions. However, they cannot save/load the models yet. > ML models in SparkR are wrappers around ML pipelines. So it should be > straightforward to implement model persistence. We need to think more about > the API. R uses save/load for objects and datasets (also objects). It is > possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But > I'm not sure whether load can be overloaded easily. I propose the following > API: > {code} > model <- glm(formula, data = df) > ml.save(model, path, mode = "overwrite") > model2 <- ml.load(path) > {code} > We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load > is a S3 method (correct me if I'm wrong). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14311) Model persistence in SparkR 2.0
[ https://issues.apache.org/jira/browse/SPARK-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14311. --- Resolution: Fixed > Model persistence in SparkR 2.0 > --- > > Key: SPARK-14311 > URL: https://issues.apache.org/jira/browse/SPARK-14311 > Project: Spark > Issue Type: Umbrella > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, > naive Bayes, and AFT survival regression. Users can fit models, get summary, > and make predictions. However, they cannot save/load the models yet. > ML models in SparkR are wrappers around ML pipelines. So it should be > straightforward to implement model persistence. We need to think more about > the API. R uses save/load for objects and datasets (also objects). It is > possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But > I'm not sure whether load can be overloaded easily. I propose the following > API: > {code} > model <- glm(formula, data = df) > ml.save(model, path, mode = "overwrite") > model2 <- ml.load(path) > {code} > We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load > is a S3 method (correct me if I'm wrong). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13786) Pyspark ml.tuning support export/import
[ https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13786: -- Fix Version/s: (was: 2.0.0) > Pyspark ml.tuning support export/import > --- > > Key: SPARK-13786 > URL: https://issues.apache.org/jira/browse/SPARK-13786 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > > This should follow whatever implementation is chosen for Pipeline (since > these are all meta-algorithms). > Note this will also require persistence for Evaluators. Hopefully that can > leverage the Java implementations; there is not a real need to make Python > Evaluators be MLWritable, as far as I can tell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13786) Pyspark ml.tuning support export/import
[ https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reopened SPARK-13786: --- Re-open the issue since we reverted the change. > Pyspark ml.tuning support export/import > --- > > Key: SPARK-13786 > URL: https://issues.apache.org/jira/browse/SPARK-13786 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > > This should follow whatever implementation is chosen for Pipeline (since > these are all meta-algorithms). > Note this will also require persistence for Evaluators. Hopefully that can > leverage the Java implementations; there is not a real need to make Python > Evaluators be MLWritable, as far as I can tell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13786) Pyspark ml.tuning support export/import
[ https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13786. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12782 [https://github.com/apache/spark/pull/12782] > Pyspark ml.tuning support export/import > --- > > Key: SPARK-13786 > URL: https://issues.apache.org/jira/browse/SPARK-13786 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > Fix For: 2.0.0 > > > This should follow whatever implementation is chosen for Pipeline (since > these are all meta-algorithms). > Note this will also require persistence for Evaluators. Hopefully that can > leverage the Java implementations; there is not a real need to make Python > Evaluators be MLWritable, as far as I can tell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14059) Define R wrappers under org.apache.spark.ml.r
[ https://issues.apache.org/jira/browse/SPARK-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14059: -- Assignee: Yanbo Liang > Define R wrappers under org.apache.spark.ml.r > - > > Key: SPARK-14059 > URL: https://issues.apache.org/jira/browse/SPARK-14059 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Affects Versions: 1.6.1 >Reporter: Xiangrui Meng >Assignee: Yanbo Liang >Priority: Minor > > Currently, the wrapper files are under .../ml/r but the wrapper classes are > defined under ...ml.api.r, which doesn't follow package convention. We should > move all wrappers under ml.r. > This should happen after we merged other MLlib/R wrappers to avoid merge > conflicts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14059) Define R wrappers under org.apache.spark.ml.r
[ https://issues.apache.org/jira/browse/SPARK-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14059. --- Resolution: Fixed > Define R wrappers under org.apache.spark.ml.r > - > > Key: SPARK-14059 > URL: https://issues.apache.org/jira/browse/SPARK-14059 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Affects Versions: 1.6.1 >Reporter: Xiangrui Meng >Priority: Minor > > Currently, the wrapper files are under .../ml/r but the wrapper classes are > defined under ...ml.api.r, which doesn't follow package convention. We should > move all wrappers under ml.r. > This should happen after we merged other MLlib/R wrappers to avoid merge > conflicts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15010) Lots of error messages about accumulator in Spark shell when a task takes some time to run
Xiangrui Meng created SPARK-15010: - Summary: Lots of error messages about accumulator in Spark shell when a task takes some time to run Key: SPARK-15010 URL: https://issues.apache.org/jira/browse/SPARK-15010 Project: Spark Issue Type: Bug Components: Spark Core, Spark Shell Affects Versions: 2.0.0 Reporter: Xiangrui Meng Assignee: Wenchen Fan Priority: Blocker {code:none} 16/04/29 11:59:23 ERROR Utils: Uncaught exception in thread heartbeat-receiver-event-loop-thread java.lang.UnsupportedOperationException: Can't read accumulator value in task at org.apache.spark.NewAccumulator.value(NewAccumulator.scala:137) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9$$anonfun$apply$10.apply(TaskSchedulerImpl.scala:394) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9$$anonfun$apply$10.apply(TaskSchedulerImpl.scala:394) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:394) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:392) at scala.Option.map(Option.scala:146) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5.apply(TaskSchedulerImpl.scala:392) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5.apply(TaskSchedulerImpl.scala:391) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186) at org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:391) at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:128) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219) at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:127) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/04/29 11:59:33 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@1cd9105c,BlockManagerId(driver, 192.168.99.1, 60533))] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:494) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:523) at
[jira] [Created] (SPARK-15006) Generated JavaDoc should hide package private objects
Xiangrui Meng created SPARK-15006: - Summary: Generated JavaDoc should hide package private objects Key: SPARK-15006 URL: https://issues.apache.org/jira/browse/SPARK-15006 Project: Spark Issue Type: Improvement Components: Build, Documentation Affects Versions: 2.0.0 Reporter: Xiangrui Meng After we switch to the official release of genjavadoc in SPARK-14511, package private objects are no longer hidden in the generated JavaDoc. This JIRA is to track this upstream issue and update genjavadoc in Spark when there comes a fix in the upstream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264328#comment-15264328 ] Xiangrui Meng commented on SPARK-14831: --- Talked to [~timhunter] offline and he will submit a PR soon. > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Timothy Hunter >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14314) K-means model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14314. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12778 [https://github.com/apache/spark/pull/12778] > K-means model persistence in SparkR > --- > > Key: SPARK-14314 > URL: https://issues.apache.org/jira/browse/SPARK-14314 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Gayathri Murali > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14315) GLMs model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14315. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12778 [https://github.com/apache/spark/pull/12778] > GLMs model persistence in SparkR > > > Key: SPARK-14315 > URL: https://issues.apache.org/jira/browse/SPARK-14315 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Gayathri Murali > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14831: -- Assignee: Timothy Hunter (was: Xiangrui Meng) > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Timothy Hunter >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7264) SparkR API for parallel functions
[ https://issues.apache.org/jira/browse/SPARK-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7264. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12426 [https://github.com/apache/spark/pull/12426] > SparkR API for parallel functions > - > > Key: SPARK-7264 > URL: https://issues.apache.org/jira/browse/SPARK-7264 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Timothy Hunter > Fix For: 2.0.0 > > > This is a JIRA to discuss design proposals for enabling parallel R > computation in SparkR without exposing the entire RDD API. > The rationale for this is that the RDD API has a number of low level > functions and we would like to expose a more light-weight API that is both > friendly to R users and easy to maintain. > http://goo.gl/GLHKZI has a first cut design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14487) User Defined Type registration without SQLUserDefinedType annotation
[ https://issues.apache.org/jira/browse/SPARK-14487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14487. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12259 [https://github.com/apache/spark/pull/12259] > User Defined Type registration without SQLUserDefinedType annotation > > > Key: SPARK-14487 > URL: https://issues.apache.org/jira/browse/SPARK-14487 > Project: Spark > Issue Type: Sub-task >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > Currently we use SQLUserDefinedType annotation to register UDTs for user > classes. However, by doing this, we add Spark dependency to user classes. > For some user classes, it is unnecessary to add such dependency that will > increase deployment difficulty. > We should provide alternative approach to register UDTs for user classes > without SQLUserDefinedType annotation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14850: -- Assignee: Wenchen Fan > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Blocker > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org