[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java

2016-06-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15947:
--
Description: After SPARK-15945, we should make ALL pipeline components 
accept old vector columns as input and do the conversion automatically 
(probably with a warning message), in order to smooth the migration to 2.0. 
Note that this includes loading old saved models.  (was: After SPARK-15945, we 
should make ALL pipeline components accept old vector columns as input and do 
the conversion automatically (probably with a warning message), in order to 
smooth the migration to 2.0. Note that this include loading old saved models.)

> Make pipeline components backward compatible with old vector columns in 
> Scala/Java
> --
>
> Key: SPARK-15947
> URL: https://issues.apache.org/jira/browse/SPARK-15947
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> After SPARK-15945, we should make ALL pipeline components accept old vector 
> columns as input and do the conversion automatically (probably with a warning 
> message), in order to smooth the migration to 2.0. Note that this includes 
> loading old saved models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java

2016-06-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15947:
--
Description: After SPARK-15945, we should make ALL pipeline components 
accept old vector columns as input and do the conversion automatically 
(probably with a warning message), in order to smooth the migration to 2.0. 
Note that this include loading old saved models.  (was: After SPARK-15945, we 
should make ALL pipeline components accept old vector columns as input and do 
the conversion automatically (probably with a warning message), in order to 
smooth the migration to 2.0.)

> Make pipeline components backward compatible with old vector columns in 
> Scala/Java
> --
>
> Key: SPARK-15947
> URL: https://issues.apache.org/jira/browse/SPARK-15947
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> After SPARK-15945, we should make ALL pipeline components accept old vector 
> columns as input and do the conversion automatically (probably with a warning 
> message), in order to smooth the migration to 2.0. Note that this include 
> loading old saved models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15944) Make spark.ml package backward compatible with spark.mllib vectors

2016-06-14 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15329675#comment-15329675
 ] 

Xiangrui Meng commented on SPARK-15944:
---

We won't deprecate those utils before we deprecate the RDD-based API.

> Make spark.ml package backward compatible with spark.mllib vectors
> --
>
> Key: SPARK-15944
> URL: https://issues.apache.org/jira/browse/SPARK-15944
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> During QA, we found that it is not trivial to convert a DataFrame with old 
> vector columns to new vector columns. So it would be easier for users to 
> migrate their datasets and pipelines if we:
> 1) provide utils to convert DataFrames with vector columns
> 2) automatically detect and convert old vector columns in ML pipelines
> This is an umbrella JIRA to track the progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15948) Make pipeline components backward compatible with old vector columns in Python

2016-06-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15948:
--
Description: Same as SPARK-15947 but for Python.  (was: Same as SPARK-15974 
but for Python.)

> Make pipeline components backward compatible with old vector columns in Python
> --
>
> Key: SPARK-15948
> URL: https://issues.apache.org/jira/browse/SPARK-15948
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>
> Same as SPARK-15947 but for Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15948) Make pipeline components backward compatible with old vector columns in Python

2016-06-14 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15948:
-

 Summary: Make pipeline components backward compatible with old 
vector columns in Python
 Key: SPARK-15948
 URL: https://issues.apache.org/jira/browse/SPARK-15948
 Project: Spark
  Issue Type: Sub-task
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15948) Make pipeline components backward compatible with old vector columns in Python

2016-06-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15948:
--
Description: Same as SPARK-15974 but for Python.

> Make pipeline components backward compatible with old vector columns in Python
> --
>
> Key: SPARK-15948
> URL: https://issues.apache.org/jira/browse/SPARK-15948
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>
> Same as SPARK-15974 but for Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java

2016-06-14 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15947:
-

 Summary: Make pipeline components backward compatible with old 
vector columns in Scala/Java
 Key: SPARK-15947
 URL: https://issues.apache.org/jira/browse/SPARK-15947
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Affects Versions: 2.0.0
Reporter: Xiangrui Meng


After SPARK-15945, we should make ALL pipeline components accept old vector 
columns as input and do the conversion automatically (probably with a warning 
message), in order to smooth the migration to 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15946) Wrap the conversion utils in Python

2016-06-14 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15946:
-

 Summary: Wrap the conversion utils in Python
 Key: SPARK-15946
 URL: https://issues.apache.org/jira/browse/SPARK-15946
 Project: Spark
  Issue Type: Sub-task
Reporter: Xiangrui Meng


This is to wrap SPARK-15943 in Python. So Python users can use it to convert 
DataFrames with vector columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15946) Wrap the conversion utils in Python

2016-06-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15946:
--
Description: This is to wrap SPARK-15945 in Python. So Python users can use 
it to convert DataFrames with vector columns.  (was: This is to wrap 
SPARK-15943 in Python. So Python users can use it to convert DataFrames with 
vector columns.)

> Wrap the conversion utils in Python
> ---
>
> Key: SPARK-15946
> URL: https://issues.apache.org/jira/browse/SPARK-15946
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>
> This is to wrap SPARK-15945 in Python. So Python users can use it to convert 
> DataFrames with vector columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15945) Implement conversion utils in Scala/Java

2016-06-14 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15945:
-

 Summary: Implement conversion utils in Scala/Java
 Key: SPARK-15945
 URL: https://issues.apache.org/jira/browse/SPARK-15945
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


This is to provide conversion utils between old/new vector columns in a 
DataFrame. So users can use it to migrate their datasets and pipelines manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15944) Make spark.ml package backward compatible with spark.mllib vectors

2016-06-14 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15944:
-

 Summary: Make spark.ml package backward compatible with 
spark.mllib vectors
 Key: SPARK-15944
 URL: https://issues.apache.org/jira/browse/SPARK-15944
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical


During QA, we found that it is not trivial to convert a DataFrame with old 
vector columns to new vector columns. So it would be easier for users to 
migrate their datasets and pipelines if we:

1) provide utils to convert DataFrames with vector columns
2) automatically detect and convert old vector columns in ML pipelines

This is an umbrella JIRA to track the progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python

2016-06-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15364:
--
Assignee: Liang-Chi Hsieh

> Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
> ---
>
> Key: SPARK-15364
> URL: https://issues.apache.org/jira/browse/SPARK-15364
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Now picklers for both new and old vectors are implemented under 
> PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement 
> them under `spark.ml.python` instead. I set the target to 2.1 since those are 
> private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python

2016-06-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15364:
--
Target Version/s: 2.0.0  (was: 2.1.0)

> Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
> ---
>
> Key: SPARK-15364
> URL: https://issues.apache.org/jira/browse/SPARK-15364
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
> Fix For: 2.0.0
>
>
> Now picklers for both new and old vectors are implemented under 
> PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement 
> them under `spark.ml.python` instead. I set the target to 2.1 since those are 
> private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python

2016-06-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15364.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13219
[https://github.com/apache/spark/pull/13219]

> Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python
> ---
>
> Key: SPARK-15364
> URL: https://issues.apache.org/jira/browse/SPARK-15364
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
> Fix For: 2.0.0
>
>
> Now picklers for both new and old vectors are implemented under 
> PythonMLlibAPI. To separate spark.mllib from spark.ml, we should implement 
> them under `spark.ml.python` instead. I set the target to 2.1 since those are 
> private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15799) Release SparkR on CRAN

2016-06-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15799:
--
Target Version/s: 2.1.0

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15581:
--
Description: 
This is a master list for MLlib improvements we are working on for the next 
release. Please view this as a wish list rather than a definite plan, for we 
don't have an accurate estimate of available resources. Due to limited review 
bandwidth, features appearing on this list will get higher priority during code 
review. But feel free to suggest new items to the list in comments. We are 
experimenting with this process. Your feedback would be greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a medium/big feature. Based on our experience, mixing the development 
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Remember to add the `@Since("VERSION")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps to improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add a "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if applicable.

h1. Roadmap (*WIP*)

This is NOT [a complete list of MLlib JIRAs for 2.1| 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
 We only include umbrella JIRAs and high-level tasks.

Major efforts in this release:
* Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
RDD-based API
* ML persistence
* Python API feature parity and test coverage
* R API expansion and improvements
* Note about new features: As usual, we expect to expand the feature set of 
MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
over new features.

Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
it, but new features, APIs, and improvements will only be added to `spark.ml`.

h2. Critical feature parity in DataFrame-based API

* Umbrella JIRA: [SPARK-4591]

h2. Persistence

* Complete persistence within MLlib
** Python tuning (SPARK-13786)
* MLlib in R format: compatibility with other languages (SPARK-15572)
* Impose backwards compatibility for persistence (SPARK-15573)

h2. Python API
* Standardize unit tests for Scala and Python to improve and consolidate test 
coverage for Params, persistence, and other common functionality (SPARK-15571)
* Improve Python API handling of Params, persistence (SPARK-14771) (SPARK-14706)
** Note: The linked JIRAs for this are incomplete.  More to be created...
** Related: Implement Python meta-algorithms in Scala (to simplify persistence) 
(SPARK-15574)
* Feature parity: The main goal of the Python API is to have feature parity 
with the Scala/Java API. You can find a [complete list here| 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.1.0%20ORDER%20BY%20priority%20DESC].
 The tasks fall into two major categories:
** Python API for missing methods (SPARK-14813)
** Python API for new algorithms. Committers should create a JIRA for the 
Python API after merging a public feature in Scala/Java.

h2. SparkR
* Improve R formula support and implementation (SPARK-15540)
* Various 

[jira] [Created] (SPARK-15799) Release SparkR on CRAN

2016-06-07 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15799:
-

 Summary: Release SparkR on CRAN
 Key: SPARK-15799
 URL: https://issues.apache.org/jira/browse/SPARK-15799
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Xiangrui Meng


Story: "As an R user, I would like to see SparkR released on CRAN, so I can use 
SparkR easily in an existing R environment and have other packages built on top 
of SparkR."

I made this JIRA with the following questions in mind:
* Are there known issues that prevent us releasing SparkR on CRAN?
* Do we want to package Spark jars in the SparkR release?
* Are there license issues?
* How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15581:
--
Description: 
This is a master list for MLlib improvements we are working on for the next 
release. Please view this as a wish list rather than a definite plan, for we 
don't have an accurate estimate of available resources. Due to limited review 
bandwidth, features appearing on this list will get higher priority during code 
review. But feel free to suggest new items to the list in comments. We are 
experimenting with this process. Your feedback would be greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a 
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
than a medium/big feature. Based on our experience, mixing the development 
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when 
you start working on some features. This is to avoid duplicate work. For small 
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned 
first before coding and keep the ETA updated on the JIRA. If there exist no 
activity on the JIRA page for a certain amount of time, the JIRA should be 
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
after another.
* Remember to add the `@Since("VERSION")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review 
greatly helps to improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link 
them properly.
* Add a "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on 
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and 
documentation if applicable.

h1. Roadmap (*WIP*)

This is NOT [a complete list of MLlib JIRAs for 2.1| 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
 We only include umbrella JIRAs and high-level tasks.

Major efforts in this release:
* Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
RDD-based API
* ML persistence
* Python API feature parity and test coverage
* R API expansion and improvements
* Note about new features: As usual, we expect to expand the feature set of 
MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
over new features.

Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
it, but new features, APIs, and improvements will only be added to `spark.ml`.

h2. Critical feature parity in DataFrame-based API

* Umbrella JIRA: [SPARK-4591]

h2. Persistence

* Complete persistence within MLlib
** Python tuning (SPARK-13786)
* MLlib in R format: compatibility with other languages (SPARK-15572)
* Impose backwards compatibility for persistence (SPARK-15573)

h2. Python API
* Standardize unit tests for Scala and Python to improve and consolidate test 
coverage for Params, persistence, and other common functionality (SPARK-15571)
* Improve Python API handling of Params, persistence (SPARK-14771) (SPARK-14706)
** Note: The linked JIRAs for this are incomplete.  More to be created...
** Related: Implement Python meta-algorithms in Scala (to simplify persistence) 
(SPARK-15574)
* Feature parity: The main goal of the Python API is to have feature parity 
with the Scala/Java API. You can find a [complete list here| 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.1.0%20ORDER%20BY%20priority%20DESC].
 The tasks fall into two major categories:
** Python API for missing methods (SPARK-14813)
** Python API for new algorithms. Committers should create a JIRA for the 
Python API after merging a public feature in Scala/Java.

h2. SparkR
* Improve R formula support and implementation (SPARK-15540)
* Various 

[jira] [Commented] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds

2016-06-03 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314816#comment-15314816
 ] 

Xiangrui Meng commented on SPARK-15740:
---

The proposal looks good to me. Please also try to measure the memory 
requirement so we can easily tell whether the issue is fixed or not. Triggering 
Jenkins maven builds is not convenient.

> Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
> 
>
> Key: SPARK-15740
> URL: https://issues.apache.org/jira/browse/SPARK-15740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> [~andrewor14] noticed some OOM errors caused by "test big model load / save" 
> in Word2VecSuite, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull.
>  It doesn't show up in the test result because it was OOMed.
> I'm going to disable the test first and leave this open for a proper fix.
> cc [~tmnd91]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds

2016-06-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313198#comment-15313198
 ] 

Xiangrui Meng commented on SPARK-15740:
---

[~tmnd91] Could you run the test and estimate how much ram does it need? Btw, 
we should set spark.kryoserializer.buffer.max to a small value instead of 
creating a big array. Do you have time to look into this issue?

> Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
> 
>
> Key: SPARK-15740
> URL: https://issues.apache.org/jira/browse/SPARK-15740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> [~andrewor14] noticed some OOM errors caused by "test big model load / save" 
> in Word2VecSuite, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull.
>  It doesn't show up in the test result because it was OOMed.
> I'm going to disable the test first and leave this open for a proper fix.
> cc [~tmnd91]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds

2016-06-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313198#comment-15313198
 ] 

Xiangrui Meng edited comment on SPARK-15740 at 6/2/16 10:24 PM:


[~tmnd91] Could you run the test and estimate how much ram does it need? Btw, 
we should set spark.kryoserializer.buffer.max to a small value instead of 
creating a big array for the test. Do you have time to look into this issue?


was (Author: mengxr):
[~tmnd91] Could you run the test and estimate how much ram does it need? Btw, 
we should set spark.kryoserializer.buffer.max to a small value instead of 
creating a big array. Do you have time to look into this issue?

> Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
> 
>
> Key: SPARK-15740
> URL: https://issues.apache.org/jira/browse/SPARK-15740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> [~andrewor14] noticed some OOM errors caused by "test big model load / save" 
> in Word2VecSuite, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull.
>  It doesn't show up in the test result because it was OOMed.
> I'm going to disable the test first and leave this open for a proper fix.
> cc [~tmnd91]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds

2016-06-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15740:
--
Description: 
[~andrewor14] noticed some OOM errors caused by "test big model load / save" in 
Word2VecSuite, e.g., 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull.
 It doesn't show up in the test result because it was OOMed.

I'm going to disable the test first and leave this open for a proper fix.

cc [~tmnd91]

  was:
[~andrewor14] noticed some OOM errors caused by "test big model load / save" in 
Word2VecSuite.

I'm going to disable the test first and leave this open for a proper fix.

cc [~tmnd91]


> Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
> 
>
> Key: SPARK-15740
> URL: https://issues.apache.org/jira/browse/SPARK-15740
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> [~andrewor14] noticed some OOM errors caused by "test big model load / save" 
> in Word2VecSuite, e.g., 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull.
>  It doesn't show up in the test result because it was OOMed.
> I'm going to disable the test first and leave this open for a proper fix.
> cc [~tmnd91]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15740) Word2VecSuite "big model load / save" caused OOM in maven jenkins builds

2016-06-02 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15740:
-

 Summary: Word2VecSuite "big model load / save" caused OOM in maven 
jenkins builds
 Key: SPARK-15740
 URL: https://issues.apache.org/jira/browse/SPARK-15740
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Priority: Critical


[~andrewor14] noticed some OOM errors caused by "test big model load / save" in 
Word2VecSuite.

I'm going to disable the test first and leave this open for a proper fix.

cc [~tmnd91]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13944) Separate out local linear algebra as a standalone module without Spark dependency

2016-06-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13944.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Separate out local linear algebra as a standalone module without Spark 
> dependency
> -
>
> Key: SPARK-13944
> URL: https://issues.apache.org/jira/browse/SPARK-13944
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Separate out linear algebra as a standalone module without Spark dependency 
> to simplify production deployment. We can call the new module 
> spark-mllib-local, which might contain local models in the future.
> The major issue is to remove dependencies on user-defined types.
> The package name will be changed from mllib to ml. For example, Vector will 
> be changed from `org.apache.spark.mllib.linalg.Vector` to 
> `org.apache.spark.ml.linalg.Vector`. The return vector type in the new ML 
> pipeline will be the one in ML package; however, the existing mllib code will 
> not be touched. As a result, this will potentially break the API. Also, when 
> the vector is loaded from mllib vector by Spark SQL, the vector will 
> automatically converted into the one in ml package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14529) Consolidate mllib and mllib-local into one mllib folder

2016-06-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-14529.
-
Resolution: Won't Fix

Marked the issue as won't fix. The main reason is that mllib-local might be 
used by external packages directly.

> Consolidate mllib and mllib-local into one mllib folder
> ---
>
> Key: SPARK-14529
> URL: https://issues.apache.org/jira/browse/SPARK-14529
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Minor
>
> In the 2.0 QA period (to avoid the conflict of other PRs), this task will 
> consolidate `mllib/src` into `mllib/mllib/src` and `mllib-local/src` into 
> `mllib/mllib-local/src`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15043) Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr

2016-05-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-15043.
-
Resolution: Fixed

Fixed as part of SPARK-15030.

> Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr
> -
>
> Key: SPARK-15043
> URL: https://issues.apache.org/jira/browse/SPARK-15043
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Critical
>
> It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become 
> flaky:
> https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr
> The first observed failure was in 
> https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816
> {code}
> java.lang.AssertionError: expected:<0.9986422261219262> but 
> was:<0.9986422261219272>
>   at 
> org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75)
> {code}
> I'm going to ignore this test now, but we need to come back and fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15043) Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr

2016-05-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15043:
--
Fix Version/s: 2.0.0

> Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr
> -
>
> Key: SPARK-15043
> URL: https://issues.apache.org/jira/browse/SPARK-15043
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Critical
> Fix For: 2.0.0
>
>
> It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become 
> flaky:
> https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr
> The first observed failure was in 
> https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816
> {code}
> java.lang.AssertionError: expected:<0.9986422261219262> but 
> was:<0.9986422261219272>
>   at 
> org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75)
> {code}
> I'm going to ignore this test now, but we need to come back and fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Commented] (SPARK-14529) Consolidate mllib and mllib-local into one mllib folder

2016-05-24 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299511#comment-15299511
 ] 

Xiangrui Meng commented on SPARK-14529:
---

We should decide whether we want to make this change in 2.0. I don't have 
strong preference on which folder layout is better. So I would +1 on keeping 
the current layout since it doesn't require code changes. How does it sound?

> Consolidate mllib and mllib-local into one mllib folder
> ---
>
> Key: SPARK-14529
> URL: https://issues.apache.org/jira/browse/SPARK-14529
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Minor
>
> In the 2.0 QA period (to avoid the conflict of other PRs), this task will 
> consolidate `mllib/src` into `mllib/mllib/src` and `mllib-local/src` into 
> `mllib/mllib-local/src`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15447:
--
Labels: QA  (was: )

> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>  Labels: QA
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-05-20 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15447:
-

 Summary: Performance test for ALS in Spark 2.0
 Key: SPARK-15447
 URL: https://issues.apache.org/jira/browse/SPARK-15447
 Project: Spark
  Issue Type: Task
  Components: ML
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Priority: Critical


We made several changes to ALS in 2.0. It is necessary to run some tests to 
avoid performance regression. We should test (synthetic) datasets from 1 
million ratings to 1 billion ratings.

cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
tests?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15222) SparkR ML examples update in 2.0

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15222.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13000
[https://github.com/apache/spark/pull/13000]

> SparkR ML examples update in 2.0
> 
>
> Key: SPARK-15222
> URL: https://issues.apache.org/jira/browse/SPARK-15222
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Update example code in examples/src/main/r/ml.R to reflect the new algorithms.
> * spark.glm and glm
> * spark.survreg
> * spark.naiveBayes
> * spark.kmeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15222) SparkR ML examples update in 2.0

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15222:
--
Assignee: Yanbo Liang

> SparkR ML examples update in 2.0
> 
>
> Key: SPARK-15222
> URL: https://issues.apache.org/jira/browse/SPARK-15222
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Update example code in examples/src/main/r/ml.R to reflect the new algorithms.
> * spark.glm and glm
> * spark.survreg
> * spark.naiveBayes
> * spark.kmeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15153) SparkR spark.naiveBayes throws error when label is numeric type

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15153:
--
Shepherd: Xiangrui Meng

> SparkR spark.naiveBayes throws error when label is numeric type
> ---
>
> Key: SPARK-15153
> URL: https://issues.apache.org/jira/browse/SPARK-15153
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> When the label of dataset is numeric type, SparkR spark.naiveBayes will throw 
> error. This bug is easy to reproduce:
> {code}
> t <- as.data.frame(Titanic)
> t1 <- t[t$Freq > 0, -5]
> t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1)
> t2 <- t1[-4]
> df <- suppressWarnings(createDataFrame(sqlContext, t2))
> m <- spark.naiveBayes(df, NumericSurvived ~ .)
> 16/05/05 03:26:17 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.NaiveBayesWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.ClassCastException: 
> org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to 
> org.apache.spark.ml.attribute.NominalAttribute
>   at 
> org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66)
>   at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at io.netty.channel.AbstractChannelHandlerContext.invo
> {code}
> In RFormula, the response variable type could be string or numeric. If it's 
> string, RFormula will transform it to label of DoubleType by StringIndexer 
> and set corresponding column metadata; otherwise, RFormula will directly use 
> it as label when training model (and assumes that it was numbered from 0, 
> ..., maxLabelIndex). 
> When we extract labels at ml.r.NaiveBayesWrapper, we should handle it 
> according the type of the response variable (string or numeric).
> cc [~mengxr] [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15153) SparkR spark.naiveBayes throws error when label is numeric type

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15153:
--
Assignee: Yanbo Liang

> SparkR spark.naiveBayes throws error when label is numeric type
> ---
>
> Key: SPARK-15153
> URL: https://issues.apache.org/jira/browse/SPARK-15153
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> When the label of dataset is numeric type, SparkR spark.naiveBayes will throw 
> error. This bug is easy to reproduce:
> {code}
> t <- as.data.frame(Titanic)
> t1 <- t[t$Freq > 0, -5]
> t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1)
> t2 <- t1[-4]
> df <- suppressWarnings(createDataFrame(sqlContext, t2))
> m <- spark.naiveBayes(df, NumericSurvived ~ .)
> 16/05/05 03:26:17 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.NaiveBayesWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.ClassCastException: 
> org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to 
> org.apache.spark.ml.attribute.NominalAttribute
>   at 
> org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66)
>   at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at io.netty.channel.AbstractChannelHandlerContext.invo
> {code}
> In RFormula, the response variable type could be string or numeric. If it's 
> string, RFormula will transform it to label of DoubleType by StringIndexer 
> and set corresponding column metadata; otherwise, RFormula will directly use 
> it as label when training model (and assumes that it was numbered from 0, 
> ..., maxLabelIndex). 
> When we extract labels at ml.r.NaiveBayesWrapper, we should handle it 
> according the type of the response variable (string or numeric).
> cc [~mengxr] [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15339) ML 2.0 QA: Scala APIs and code audit for regression

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15339:
--
Assignee: Yanbo Liang

> ML 2.0 QA: Scala APIs and code audit for regression
> ---
>
> Key: SPARK-15339
> URL: https://issues.apache.org/jira/browse/SPARK-15339
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> ML 2.0 QA: Scala APIs and code audit for regression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15339) ML 2.0 QA: Scala APIs and code audit for regression

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15339.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13129
[https://github.com/apache/spark/pull/13129]

> ML 2.0 QA: Scala APIs and code audit for regression
> ---
>
> Key: SPARK-15339
> URL: https://issues.apache.org/jira/browse/SPARK-15339
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
> Fix For: 2.0.0
>
>
> ML 2.0 QA: Scala APIs and code audit for regression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15394) ML user guide typos and grammar audit

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15394:
--
Fix Version/s: 2.0.0

> ML user guide typos and grammar audit
> -
>
> Key: SPARK-15394
> URL: https://issues.apache.org/jira/browse/SPARK-15394
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Audit the wording in ml user guides.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15394) ML user guide typos and grammar audit

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15394:
--
Assignee: Seth Hendrickson

> ML user guide typos and grammar audit
> -
>
> Key: SPARK-15394
> URL: https://issues.apache.org/jira/browse/SPARK-15394
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Audit the wording in ml user guides.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15398) Update the warning message to recommend ML usage

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15398.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13190
[https://github.com/apache/spark/pull/13190]

> Update the warning message to recommend ML usage
> 
>
> Key: SPARK-15398
> URL: https://issues.apache.org/jira/browse/SPARK-15398
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
> Fix For: 2.0.0
>
>
> update the warning message in example, and recommend user to use ML instead 
> of MLlib
> from
> {code}
>   def showWarning() {
> System.err.println(
>   """WARN: This is a naive implementation of Logistic Regression and is 
> given as an example!
> |Please use either 
> org.apache.spark.mllib.classification.LogisticRegressionWithSGD or
> |org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
> |for more conventional use.
>   """.stripMargin)
>   }
> {code}
> to
> {code}
>   def showWarning() {
> System.err.println(
>   """WARN: This is a naive implementation of Logistic Regression and is 
> given as an example!
> |Please use org.apache.spark.ml.classification.LogisticRegression
> |for more conventional use.
>   """.stripMargin)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15398) Update the warning message to recommend ML usage

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15398:
--
Assignee: zhengruifeng

> Update the warning message to recommend ML usage
> 
>
> Key: SPARK-15398
> URL: https://issues.apache.org/jira/browse/SPARK-15398
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.0.0
>
>
> update the warning message in example, and recommend user to use ML instead 
> of MLlib
> from
> {code}
>   def showWarning() {
> System.err.println(
>   """WARN: This is a naive implementation of Logistic Regression and is 
> given as an example!
> |Please use either 
> org.apache.spark.mllib.classification.LogisticRegressionWithSGD or
> |org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
> |for more conventional use.
>   """.stripMargin)
>   }
> {code}
> to
> {code}
>   def showWarning() {
> System.err.println(
>   """WARN: This is a naive implementation of Logistic Regression and is 
> given as an example!
> |Please use org.apache.spark.ml.classification.LogisticRegression
> |for more conventional use.
>   """.stripMargin)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15363:
--
Assignee: Miao Wang

> Example code shouldn't use VectorImplicits._, asML/fromML
> -
>
> Key: SPARK-15363
> URL: https://issues.apache.org/jira/browse/SPARK-15363
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Miao Wang
> Fix For: 2.0.0
>
>
> In SPARK-14615, we use VectorImplicits._  and asML in example code to 
> minimize the changes in that PR. However, this is a private API, which 
> shouldn't appear in the example code. We should consider update them during 
> QA.
> https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15363.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13213
[https://github.com/apache/spark/pull/13213]

> Example code shouldn't use VectorImplicits._, asML/fromML
> -
>
> Key: SPARK-15363
> URL: https://issues.apache.org/jira/browse/SPARK-15363
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
> Fix For: 2.0.0
>
>
> In SPARK-14615, we use VectorImplicits._  and asML in example code to 
> minimize the changes in that PR. However, this is a private API, which 
> shouldn't appear in the example code. We should consider update them during 
> QA.
> https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15172) Warning message should explicitly tell user initial coefficients is ignored if its size doesn't match expected size in LogisticRegression

2016-05-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15172:
--
Fix Version/s: (was: 2.1.0)
   2.0.0

> Warning message should explicitly tell user initial coefficients is ignored 
> if its size doesn't match expected size in LogisticRegression
> -
>
> Key: SPARK-15172
> URL: https://issues.apache.org/jira/browse/SPARK-15172
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: ding
>Assignee: ding
>Priority: Trivial
> Fix For: 2.0.0
>
>
> From ML/LogisticRegression code logic, if size of initial coefficients 
> doesn't match expected size, initial coefficients value will be ignored. We 
> should explicitly tell user the information. Besides, log size of initial 
> coefficients should be more straightforward than log initial coefficients 
> value when size mismatch happened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15296) Refactor All Java Tests that use SparkSession

2016-05-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15296.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13101
[https://github.com/apache/spark/pull/13101]

> Refactor All Java Tests that use SparkSession
> -
>
> Key: SPARK-15296
> URL: https://issues.apache.org/jira/browse/SPARK-15296
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, Tests
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Minor
> Fix For: 2.0.0
>
>
> There's a lot of Duplicate code in Java tests. {{setUp()}} and {{tearDown()}} 
> of most java test classes in ML,MLLib.
> So will create a {{SharedSparkSession}} class that has common code for 
> {{setUp}} and {{tearDown}} and other classes just extend that class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15341) Add documentation for `model.write` to clarify `summary` was not saved

2016-05-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15341:
--
Assignee: Yanbo Liang

> Add documentation for `model.write` to clarify `summary` was not saved 
> ---
>
> Key: SPARK-15341
> URL: https://issues.apache.org/jira/browse/SPARK-15341
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently in model.write, we don't save summary(if applicable). We should add 
> documentation to clarify it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15341) Add documentation for `model.write` to clarify `summary` was not saved

2016-05-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15341:
--
Fix Version/s: 2.0.0

> Add documentation for `model.write` to clarify `summary` was not saved 
> ---
>
> Key: SPARK-15341
> URL: https://issues.apache.org/jira/browse/SPARK-15341
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently in model.write, we don't save summary(if applicable). We should add 
> documentation to clarify it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15414) Make the mllib,ml linalg type conversion APIs public

2016-05-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15414:
--
Assignee: Sandeep Singh

> Make the mllib,ml linalg type conversion APIs public
> 
>
> Key: SPARK-15414
> URL: https://issues.apache.org/jira/browse/SPARK-15414
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Sandeep Singh
> Fix For: 2.0.0
>
>
> We should open up the APIs for converting between new, old linear algebra 
> types (in spark.mllib.linalg):
> * Vector.asML
> * Vectors.fromML
> * same for Sparse/Dense and for Matrices
> I made these private originally, but they will be useful for users 
> transitioning workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15414) Make the mllib,ml linalg type conversion APIs public

2016-05-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15414.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13202
[https://github.com/apache/spark/pull/13202]

> Make the mllib,ml linalg type conversion APIs public
> 
>
> Key: SPARK-15414
> URL: https://issues.apache.org/jira/browse/SPARK-15414
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
> Fix For: 2.0.0
>
>
> We should open up the APIs for converting between new, old linear algebra 
> types (in spark.mllib.linalg):
> * Vector.asML
> * Vectors.fromML
> * same for Sparse/Dense and for Matrices
> I made these private originally, but they will be useful for users 
> transitioning workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML

2016-05-19 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292403#comment-15292403
 ] 

Xiangrui Meng commented on SPARK-15363:
---

No. I think we need to make the converters between new and old vectors public 
(WIP) and then in example code, we don't need implicits. Another option is to 
make implicits public.

> Example code shouldn't use VectorImplicits._, asML/fromML
> -
>
> Key: SPARK-15363
> URL: https://issues.apache.org/jira/browse/SPARK-15363
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>
> In SPARK-14615, we use VectorImplicits._  and asML in example code to 
> minimize the changes in that PR. However, this is a private API, which 
> shouldn't appear in the example code. We should consider update them during 
> QA.
> https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14615) Use the new ML Vector and Matrix in the ML pipeline based algorithms

2016-05-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14615.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12627
[https://github.com/apache/spark/pull/12627]

> Use the new ML Vector and Matrix in the ML pipeline based algorithms 
> -
>
> Key: SPARK-14615
> URL: https://issues.apache.org/jira/browse/SPARK-14615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, ML
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new 
> vector and matrix type in the new ml pipeline based apis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14615) Use the new ML Vector and Matrix in the ML pipeline based algorithms

2016-05-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14615:
--
Priority: Blocker  (was: Major)

> Use the new ML Vector and Matrix in the ML pipeline based algorithms 
> -
>
> Key: SPARK-14615
> URL: https://issues.apache.org/jira/browse/SPARK-14615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, ML
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Blocker
>
> Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new 
> vector and matrix type in the new ml pipeline based apis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15364) Implement Python picklers for ml.Vector and ml.Matrix under spark.ml.python

2016-05-17 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15364:
-

 Summary: Implement Python picklers for ml.Vector and ml.Matrix 
under spark.ml.python
 Key: SPARK-15364
 URL: https://issues.apache.org/jira/browse/SPARK-15364
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.0.0
Reporter: Xiangrui Meng


Now picklers for both new and old vectors are implemented under PythonMLlibAPI. 
To separate spark.mllib from spark.ml, we should implement them under 
`spark.ml.python` instead. I set the target to 2.1 since those are private APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15363) Example code shouldn't use VectorImplicits._

2016-05-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15363:
--
Description: 
In SPARK-14615, we use VectorImplicits._  and asML in example code to minimize 
the changes in that PR. However, this is a private API, which shouldn't appear 
in the example code. We should consider update them during QA.

https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala

  was:
In SPARK-14615, we use VectorImplicits._ in example code to minimize the 
changes in that PR. However, this is a private API, which shouldn't appear in 
the example code. We should consider update them during QA.

https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala


> Example code shouldn't use VectorImplicits._
> 
>
> Key: SPARK-15363
> URL: https://issues.apache.org/jira/browse/SPARK-15363
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>
> In SPARK-14615, we use VectorImplicits._  and asML in example code to 
> minimize the changes in that PR. However, this is a private API, which 
> shouldn't appear in the example code. We should consider update them during 
> QA.
> https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML

2016-05-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15363:
--
Summary: Example code shouldn't use VectorImplicits._, asML/fromML  (was: 
Example code shouldn't use VectorImplicits._)

> Example code shouldn't use VectorImplicits._, asML/fromML
> -
>
> Key: SPARK-15363
> URL: https://issues.apache.org/jira/browse/SPARK-15363
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>
> In SPARK-14615, we use VectorImplicits._  and asML in example code to 
> minimize the changes in that PR. However, this is a private API, which 
> shouldn't appear in the example code. We should consider update them during 
> QA.
> https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15363) Example code shouldn't use VectorImplicits._

2016-05-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15363:
--
Description: 
In SPARK-14615, we use VectorImplicits._ in example code to minimize the 
changes in that PR. However, this is a private API, which shouldn't appear in 
the example code. We should consider update them during QA.

https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala

  was:In SPARK-14615, we use VectorImplicits._ in example code to minimize the 
changes in that PR. However, this is a private API, which shouldn't appear in 
the example code. We should consider update them during QA.


> Example code shouldn't use VectorImplicits._
> 
>
> Key: SPARK-15363
> URL: https://issues.apache.org/jira/browse/SPARK-15363
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>
> In SPARK-14615, we use VectorImplicits._ in example code to minimize the 
> changes in that PR. However, this is a private API, which shouldn't appear in 
> the example code. We should consider update them during QA.
> https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15363) Example code shouldn't use VectorImplicits._

2016-05-17 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15363:
-

 Summary: Example code shouldn't use VectorImplicits._
 Key: SPARK-15363
 URL: https://issues.apache.org/jira/browse/SPARK-15363
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML
Reporter: Xiangrui Meng


In SPARK-14615, we use VectorImplicits._ in example code to minimize the 
changes in that PR. However, this is a private API, which shouldn't appear in 
the example code. We should consider update them during QA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14906) Copy pyspark.mllib.linalg to pyspark.ml.linalg

2016-05-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14906:
--
Summary: Copy pyspark.mllib.linalg to pyspark.ml.linalg  (was: Move 
VectorUDT and MatrixUDT in PySpark to new ML package)

> Copy pyspark.mllib.linalg to pyspark.ml.linalg
> --
>
> Key: SPARK-14906
> URL: https://issues.apache.org/jira/browse/SPARK-14906
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark 
> codes should be moved too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package

2016-05-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14906:
--
Assignee: Liang-Chi Hsieh

> Move VectorUDT and MatrixUDT in PySpark to new ML package
> -
>
> Key: SPARK-14906
> URL: https://issues.apache.org/jira/browse/SPARK-14906
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark 
> codes should be moved too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package

2016-05-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14906.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13099
[https://github.com/apache/spark/pull/13099]

> Move VectorUDT and MatrixUDT in PySpark to new ML package
> -
>
> Key: SPARK-14906
> URL: https://issues.apache.org/jira/browse/SPARK-14906
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark 
> codes should be moved too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15268) Make JavaTypeInference work with UDTRegistration

2016-05-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15268.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13046
[https://github.com/apache/spark/pull/13046]

> Make JavaTypeInference work with UDTRegistration
> 
>
> Key: SPARK-15268
> URL: https://issues.apache.org/jira/browse/SPARK-15268
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> We have a private UDTRegistration API to register user defined type. 
> Currently JavaTypeInference can't work with it. We should make it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15268) Make JavaTypeInference work with UDTRegistration

2016-05-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15268:
--
Assignee: Liang-Chi Hsieh

> Make JavaTypeInference work with UDTRegistration
> 
>
> Key: SPARK-15268
> URL: https://issues.apache.org/jira/browse/SPARK-15268
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> We have a private UDTRegistration API to register user defined type. 
> Currently JavaTypeInference can't work with it. We should make it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14050) Add multiple languages support for Stop Words Remover

2016-05-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14050:
--
Assignee: Burak KÖSE

> Add multiple languages support for Stop Words Remover
> -
>
> Key: SPARK-14050
> URL: https://issues.apache.org/jira/browse/SPARK-14050
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Burak KÖSE
>Assignee: Burak KÖSE
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14050) Add multiple languages support for Stop Words Remover

2016-05-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14050.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12843
[https://github.com/apache/spark/pull/12843]

> Add multiple languages support for Stop Words Remover
> -
>
> Key: SPARK-14050
> URL: https://issues.apache.org/jira/browse/SPARK-14050
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Burak KÖSE
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-05-03 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15269218#comment-15269218
 ] 

Xiangrui Meng commented on SPARK-15027:
---

Ah, I see the problems now. We do need the hash partitioner to accelerate 
queries from the driver and probably joins. What if we convert the factors 
using `repartition(blocks, "id")` before we return the factors? It should come 
with a hash partitioner, but it might be different from the one we used in ALS. 
#2 seems like a bug. Could you provide a minimal example that can reproduce it?

Given the pending issues, it seems that we should target this to 2.1. Sounds 
good?

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6717) Clear shuffle files after checkpointing in ALS

2016-05-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6717.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11919
[https://github.com/apache/spark/pull/11919]

> Clear shuffle files after checkpointing in ALS
> --
>
> Key: SPARK-6717
> URL: https://issues.apache.org/jira/browse/SPARK-6717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: holdenk
>  Labels: als
> Fix For: 2.0.0
>
>
> In ALS iterations, we checkpoint RDDs to cut lineage and to reduce shuffle 
> files. However, whether to clean shuffle files depends on the system GC, 
> which may not be triggered in ALS iterations. So after checkpointing, before 
> we let the RDD object go out of scope, we should clean its shuffle 
> dependencies explicitly. This function could either stay inside ALS or go to 
> Core.
> Without this feature, we can call System.gc() periodically to clean shuffle 
> files of RDDs that went out of scope.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15064) Locale support in StopWordsRemover

2016-05-02 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15064:
-

 Summary: Locale support in StopWordsRemover
 Key: SPARK-15064
 URL: https://issues.apache.org/jira/browse/SPARK-15064
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.0.0
Reporter: Xiangrui Meng


We support case insensitive filtering (default) in StopWordsRemover. However, 
case insensitive matching depends on the locale and region, which cannot be 
explicitly set in StopWordsRemover. We should consider adding this support in 
MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15030) Support formula in spark.kmeans in SparkR

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15030.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12813
[https://github.com/apache/spark/pull/12813]

> Support formula in spark.kmeans in SparkR
> -
>
> Key: SPARK-15030
> URL: https://issues.apache.org/jira/browse/SPARK-15030
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> In SparkR, spark.kmeans take a DataFrame with double columns. This is 
> different from other ML methods we implemented, which support R model 
> formula. We should add support for that as well.
> {code:none}
> spark.kmeans(data = df, formula = ~ lat + lon, ...)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14653) Remove NumericParser and jackson dependency from mllib-local

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14653.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12802
[https://github.com/apache/spark/pull/12802]

> Remove NumericParser and jackson dependency from mllib-local
> 
>
> Key: SPARK-14653
> URL: https://issues.apache.org/jira/browse/SPARK-14653
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 2.0.0
>
>
> After SPARK-14549, we should remove NumericParser and jackson from 
> mllib-local, which were introduced very earlier and now replaced by UDTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15027:
--
Target Version/s:   (was: 2.0.0)

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265238#comment-15265238
 ] 

Xiangrui Meng commented on SPARK-15027:
---

It might be tricky to use Dataset due to encoders and generic ID types. But if 
we use DataFrame as input and output, it seems feasible. It would be great if 
you can take a look.

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15027:
--
Assignee: (was: Xiangrui Meng)

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265229#comment-15265229
 ] 

Xiangrui Meng edited comment on SPARK-15027 at 4/30/16 7:50 AM:


Just API change. I guess there are still gaps to use DataFrame for the 
implementation. Maybe this is not urgent for 2.0 since ALS.train is a developer 
API.


was (Author: mengxr):
No, just API change. I guess there are still gaps to use DataFrame for the 
implementation. Maybe this is not urgent for 2.0 since ALS.train is a developer 
API.

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265229#comment-15265229
 ] 

Xiangrui Meng commented on SPARK-15027:
---

No, just API change. I guess there are still gaps to use DataFrame for the 
implementation. Maybe this is not urgent for 2.0 since ALS.train is a developer 
API.

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15027:
--
Description: We should also update `ALS.train` to use `Dataset/DataFrame` 
instead of `RDD` to be consistent with other APIs under spark.ml and it also 
leaves space for Tungsten-based optimization.  (was: This continue the work 
from SPARK-14412 to update `intermediateRDDStorageLevel` to 
`intermediateStorageLevel`, and `finalRDDStorageLevel` to `finalStoargeLevel`. 
We should also update `ALS.train` to use `Dataset` instead of `RDD`.)

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` 
> to be consistent with other APIs under spark.ml and it also leaves space for 
> Tungsten-based optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15027:
--
Summary: ALS.train should use DataFrame instead of RDD  (was: ml.ALS params 
and ALS.train should not depend on RDD)

> ALS.train should use DataFrame instead of RDD
> -
>
> Key: SPARK-15027
> URL: https://issues.apache.org/jira/browse/SPARK-15027
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This continue the work from SPARK-14412 to update 
> `intermediateRDDStorageLevel` to `intermediateStorageLevel`, and 
> `finalRDDStorageLevel` to `finalStoargeLevel`. We should also update 
> `ALS.train` to use `Dataset` instead of `RDD`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package

2016-04-30 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265216#comment-15265216
 ] 

Xiangrui Meng commented on SPARK-14906:
---

[~viirya] To confirm the scope of this JIRA, does it cover moving (or aliasing) 
`pyspark.mllib.linalg` to `pyspark.ml.linalg`? 

> Move VectorUDT and MatrixUDT in PySpark to new ML package
> -
>
> Key: SPARK-14906
> URL: https://issues.apache.org/jira/browse/SPARK-14906
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Liang-Chi Hsieh
>
> As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark 
> codes should be moved too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15030) Support formula in spark.kmeans in SparkR

2016-04-30 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15030:
-

 Summary: Support formula in spark.kmeans in SparkR
 Key: SPARK-15030
 URL: https://issues.apache.org/jira/browse/SPARK-15030
 Project: Spark
  Issue Type: New Feature
  Components: ML, SparkR
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Yanbo Liang


In SparkR, spark.kmeans take a DataFrame with double columns. This is different 
from other ML methods we implemented, which support R model formula. We should 
add support for that as well.

{code:none}
spark.kmeans(data = df, formula = ~ lat + lon, ...)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14831.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12789
[https://github.com/apache/spark/pull/12789]

> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>Priority: Critical
> Fix For: 2.0.0
>
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-04-30 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14850.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12640
[https://github.com/apache/spark/pull/12640]

> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15027) ml.ALS params and ALS.train should not depend on RDD

2016-04-29 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15027:
-

 Summary: ml.ALS params and ALS.train should not depend on RDD
 Key: SPARK-15027
 URL: https://issues.apache.org/jira/browse/SPARK-15027
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


This continue the work from SPARK-14412 to update `intermediateRDDStorageLevel` 
to `intermediateStorageLevel`, and `finalRDDStorageLevel` to 
`finalStoargeLevel`. We should also update `ALS.train` to use `Dataset` instead 
of `RDD`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14412) spark.ml ALS prefered storage level Params

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14412:
--
Assignee: Nick Pentreath

> spark.ml ALS prefered storage level Params
> --
>
> Key: SPARK-14412
> URL: https://issues.apache.org/jira/browse/SPARK-14412
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.0.0
>
>
> spark.mllib ALS supports {{setIntermediateRDDStorageLevel}} and 
> {{setFinalRDDStorageLevel}}.  Those should be added as Params in spark.ml, 
> but they should be in group "expertParam" since few users will need them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14412) spark.ml ALS prefered storage level Params

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14412.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12660
[https://github.com/apache/spark/pull/12660]

> spark.ml ALS prefered storage level Params
> --
>
> Key: SPARK-14412
> URL: https://issues.apache.org/jira/browse/SPARK-14412
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.0.0
>
>
> spark.mllib ALS supports {{setIntermediateRDDStorageLevel}} and 
> {{setFinalRDDStorageLevel}}.  Those should be added as Params in spark.ml, 
> but they should be in group "expertParam" since few users will need them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14653) Remove NumericParser and jackson dependency from mllib-local

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-14653:
-

Assignee: Xiangrui Meng

> Remove NumericParser and jackson dependency from mllib-local
> 
>
> Key: SPARK-14653
> URL: https://issues.apache.org/jira/browse/SPARK-14653
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-14549, we should remove NumericParser and jackson from 
> mllib-local, which were introduced very earlier and now replaced by UDTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14311) Model persistence in SparkR 2.0

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14311:
--
Target Version/s: 2.0.0
   Fix Version/s: 2.0.0

> Model persistence in SparkR 2.0
> ---
>
> Key: SPARK-14311
> URL: https://issues.apache.org/jira/browse/SPARK-14311
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 2.0.0
>
>
> In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, 
> naive Bayes, and AFT survival regression. Users can fit models, get summary, 
> and make predictions. However, they cannot save/load the models yet.
> ML models in SparkR are wrappers around ML pipelines. So it should be 
> straightforward to implement model persistence. We need to think more about 
> the API. R uses save/load for objects and datasets (also objects). It is 
> possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But 
> I'm not sure whether load can be overloaded easily. I propose the following 
> API:
> {code}
> model <- glm(formula, data = df)
> ml.save(model, path, mode = "overwrite")
> model2 <- ml.load(path)
> {code}
> We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load 
> is a S3 method (correct me if I'm wrong).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14311) Model persistence in SparkR 2.0

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14311:
--
Summary: Model persistence in SparkR 2.0  (was: Model persistence in SparkR)

> Model persistence in SparkR 2.0
> ---
>
> Key: SPARK-14311
> URL: https://issues.apache.org/jira/browse/SPARK-14311
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, 
> naive Bayes, and AFT survival regression. Users can fit models, get summary, 
> and make predictions. However, they cannot save/load the models yet.
> ML models in SparkR are wrappers around ML pipelines. So it should be 
> straightforward to implement model persistence. We need to think more about 
> the API. R uses save/load for objects and datasets (also objects). It is 
> possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But 
> I'm not sure whether load can be overloaded easily. I propose the following 
> API:
> {code}
> model <- glm(formula, data = df)
> ml.save(model, path, mode = "overwrite")
> model2 <- ml.load(path)
> {code}
> We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load 
> is a S3 method (correct me if I'm wrong).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14311) Model persistence in SparkR 2.0

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14311.
---
Resolution: Fixed

> Model persistence in SparkR 2.0
> ---
>
> Key: SPARK-14311
> URL: https://issues.apache.org/jira/browse/SPARK-14311
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, 
> naive Bayes, and AFT survival regression. Users can fit models, get summary, 
> and make predictions. However, they cannot save/load the models yet.
> ML models in SparkR are wrappers around ML pipelines. So it should be 
> straightforward to implement model persistence. We need to think more about 
> the API. R uses save/load for objects and datasets (also objects). It is 
> possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But 
> I'm not sure whether load can be overloaded easily. I propose the following 
> API:
> {code}
> model <- glm(formula, data = df)
> ml.save(model, path, mode = "overwrite")
> model2 <- ml.load(path)
> {code}
> We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load 
> is a S3 method (correct me if I'm wrong).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13786) Pyspark ml.tuning support export/import

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13786:
--
Fix Version/s: (was: 2.0.0)

> Pyspark ml.tuning support export/import
> ---
>
> Key: SPARK-13786
> URL: https://issues.apache.org/jira/browse/SPARK-13786
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> This should follow whatever implementation is chosen for Pipeline (since 
> these are all meta-algorithms).
> Note this will also require persistence for Evaluators.  Hopefully that can 
> leverage the Java implementations; there is not a real need to make Python 
> Evaluators be MLWritable, as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13786) Pyspark ml.tuning support export/import

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-13786:
---

Re-open the issue since we reverted the change.

> Pyspark ml.tuning support export/import
> ---
>
> Key: SPARK-13786
> URL: https://issues.apache.org/jira/browse/SPARK-13786
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> This should follow whatever implementation is chosen for Pipeline (since 
> these are all meta-algorithms).
> Note this will also require persistence for Evaluators.  Hopefully that can 
> leverage the Java implementations; there is not a real need to make Python 
> Evaluators be MLWritable, as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13786) Pyspark ml.tuning support export/import

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13786.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12782
[https://github.com/apache/spark/pull/12782]

> Pyspark ml.tuning support export/import
> ---
>
> Key: SPARK-13786
> URL: https://issues.apache.org/jira/browse/SPARK-13786
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
> Fix For: 2.0.0
>
>
> This should follow whatever implementation is chosen for Pipeline (since 
> these are all meta-algorithms).
> Note this will also require persistence for Evaluators.  Hopefully that can 
> leverage the Java implementations; there is not a real need to make Python 
> Evaluators be MLWritable, as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14059) Define R wrappers under org.apache.spark.ml.r

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14059:
--
Assignee: Yanbo Liang

> Define R wrappers under org.apache.spark.ml.r
> -
>
> Key: SPARK-14059
> URL: https://issues.apache.org/jira/browse/SPARK-14059
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 1.6.1
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Minor
>
> Currently, the wrapper files are under .../ml/r but the wrapper classes are 
> defined under ...ml.api.r, which doesn't follow package convention. We should 
> move all wrappers under ml.r.
> This should happen after we merged other MLlib/R wrappers to avoid merge 
> conflicts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14059) Define R wrappers under org.apache.spark.ml.r

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14059.
---
Resolution: Fixed

> Define R wrappers under org.apache.spark.ml.r
> -
>
> Key: SPARK-14059
> URL: https://issues.apache.org/jira/browse/SPARK-14059
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 1.6.1
>Reporter: Xiangrui Meng
>Priority: Minor
>
> Currently, the wrapper files are under .../ml/r but the wrapper classes are 
> defined under ...ml.api.r, which doesn't follow package convention. We should 
> move all wrappers under ml.r.
> This should happen after we merged other MLlib/R wrappers to avoid merge 
> conflicts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15010) Lots of error messages about accumulator in Spark shell when a task takes some time to run

2016-04-29 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15010:
-

 Summary: Lots of error messages about accumulator in Spark shell 
when a task takes some time to run
 Key: SPARK-15010
 URL: https://issues.apache.org/jira/browse/SPARK-15010
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Spark Shell
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Wenchen Fan
Priority: Blocker


{code:none}
16/04/29 11:59:23 ERROR Utils: Uncaught exception in thread 
heartbeat-receiver-event-loop-thread
java.lang.UnsupportedOperationException: Can't read accumulator value in task
at org.apache.spark.NewAccumulator.value(NewAccumulator.scala:137)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9$$anonfun$apply$10.apply(TaskSchedulerImpl.scala:394)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9$$anonfun$apply$10.apply(TaskSchedulerImpl.scala:394)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:394)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:392)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5.apply(TaskSchedulerImpl.scala:392)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$5.apply(TaskSchedulerImpl.scala:391)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:391)
at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:128)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219)
at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:127)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/04/29 11:59:33 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(driver,[Lscala.Tuple2;@1cd9105c,BlockManagerId(driver, 192.168.99.1, 
60533))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.executor.heartbeatInterval
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:494)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:523)
at 

[jira] [Created] (SPARK-15006) Generated JavaDoc should hide package private objects

2016-04-29 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-15006:
-

 Summary: Generated JavaDoc should hide package private objects
 Key: SPARK-15006
 URL: https://issues.apache.org/jira/browse/SPARK-15006
 Project: Spark
  Issue Type: Improvement
  Components: Build, Documentation
Affects Versions: 2.0.0
Reporter: Xiangrui Meng


After we switch to the official release of genjavadoc in SPARK-14511, package 
private objects are no longer hidden in the generated JavaDoc. This JIRA is to 
track this upstream issue and update genjavadoc in Spark when there comes a fix 
in the upstream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-29 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264328#comment-15264328
 ] 

Xiangrui Meng commented on SPARK-14831:
---

Talked to [~timhunter] offline and he will submit a PR soon.

> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14314) K-means model persistence in SparkR

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14314.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12778
[https://github.com/apache/spark/pull/12778]

> K-means model persistence in SparkR
> ---
>
> Key: SPARK-14314
> URL: https://issues.apache.org/jira/browse/SPARK-14314
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Gayathri Murali
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14315) GLMs model persistence in SparkR

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14315.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12778
[https://github.com/apache/spark/pull/12778]

> GLMs model persistence in SparkR
> 
>
> Key: SPARK-14315
> URL: https://issues.apache.org/jira/browse/SPARK-14315
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Gayathri Murali
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14831:
--
Assignee: Timothy Hunter  (was: Xiangrui Meng)

> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7264) SparkR API for parallel functions

2016-04-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7264.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12426
[https://github.com/apache/spark/pull/12426]

> SparkR API for parallel functions
> -
>
> Key: SPARK-7264
> URL: https://issues.apache.org/jira/browse/SPARK-7264
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Timothy Hunter
> Fix For: 2.0.0
>
>
> This is a JIRA to discuss design proposals for enabling parallel R 
> computation in SparkR without exposing the entire RDD API. 
> The rationale for this is that the RDD API has a number of low level 
> functions and we would like to expose a more light-weight API that is both 
> friendly to R users and easy to maintain.
> http://goo.gl/GLHKZI has a first cut design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14487) User Defined Type registration without SQLUserDefinedType annotation

2016-04-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14487.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12259
[https://github.com/apache/spark/pull/12259]

> User Defined Type registration without SQLUserDefinedType annotation
> 
>
> Key: SPARK-14487
> URL: https://issues.apache.org/jira/browse/SPARK-14487
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Currently we use SQLUserDefinedType annotation to register UDTs for user 
> classes. However, by doing this, we add Spark dependency to user classes.
> For some user classes, it is unnecessary to add such dependency that will 
> increase deployment difficulty.
> We should provide alternative approach to register UDTs for user classes 
> without SQLUserDefinedType annotation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2016-04-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14850:
--
Assignee: Wenchen Fan

> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Wenchen Fan
>Priority: Blocker
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    4   5   6   7   8   9   10   11   12   13   >