[jira] [Comment Edited] (SPARK-14810) ML, Graph 2.0 QA: API: Binary incompatible changes

2016-06-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293316#comment-15293316
 ] 

Nick Pentreath edited comment on SPARK-14810 at 6/20/16 8:17 AM:
-

List of changes since {{1.6.0}} audited - these are "false positives" due to 
being private, @Experimental, DeveloperAPI, etc:
* SPARK-13686 - Add a constructor parameter `regParam` to 
(Streaming)LinearRegressionWithSGD
* SPARK-13664 - Replace HadoopFsRelation with FileFormat
* SPARK-11622 - Make LibSVMRelation extends HadoopFsRelation and Add 
LibSVMOutputWriter
* SPARK-13920 - MIMA checks should apply to @Experimental and @DeveloperAPI APIs
* SPARK-11011 - UserDefinedType serialization should be strongly typed
* SPARK-13817 - Re-enable MiMA and removes object DataFrame
* SPARK-13927 - add row/column iterator to local matrices - (add methods to 
sealed trait)
* SPARK-13948 - MiMa Check should catch if the visibility change to `private` - 
(DataFrame -> Dataset)
* SPARK-11262 - Unit test for gradient, loss layers, memory management - 
(private class)
* SPARK-13430 - moved featureCol from LinearRegressionModelSummary to 
LinearRegressionSummary - (private class)
* SPARK-13048 - keepLastCheckpoint option for LDA EM optimizer - (private class)
* SPARK-14734 - Add conversions between mllib and ml Vector, Matrix types - 
(private methods added)
* SPARK-14861 - Replace internal usages of SQLContext with SparkSession - 
(private class)

Binary incompatible changes:
* SPARK-14615 - Use new ML Vector and Matrix in pipeline API
** Any {{UnaryTransformer[Vector, ...]}}:
*** {{ElementwiseProduct}}
*** {{Normalizer}}
*** {{PolynomialExpansion}}
** model values:
*** {{coefficients}} in {{LinearRegressionModel}}, {{LogisticRegressionModel}} 
and {{AFTModel}}
*** {{pc}} in {{PCAModel}}
*** {{idf}} in {{IDFModel}}
*** {{originalMin}}/{{originalMax}} in {{MinMaxScalerModel}}
*** {{mean}}/{{std}} in {{StandardScalerModel}}
* SPARK-14814 - Fix the java compatibility issue for the output of 
{{spark.mllib.tree.model.DecisionTreeModel.predict}} method.
* SPARK-14089 - Remove methods that has been deprecated since 1.1, 1.2, 1.3, 
1.4, and 1.5 
* SPARK-14952 - Remove methods deprecated in 1.6
* DataFrame -> Dataset changes for Java (this of course applies for all of 
Spark SQL)


was (Author: mlnick):
List of changes since {{1.6.0}} audited - these are "false positives" due to 
being private, @Experimental, DeveloperAPI, etc:
* SPARK-13686 - Add a constructor parameter `regParam` to 
(Streaming)LinearRegressionWithSGD
* SPARK-13664 - Replace HadoopFsRelation with FileFormat
* SPARK-11622 - Make LibSVMRelation extends HadoopFsRelation and Add 
LibSVMOutputWriter
* SPARK-13920 - MIMA checks should apply to @Experimental and @DeveloperAPI APIs
* SPARK-11011 - UserDefinedType serialization should be strongly typed
* SPARK-13817 - Re-enable MiMA and removes object DataFrame
* SPARK-13927 - add row/column iterator to local matrices - (add methods to 
sealed trait)
* SPARK-13948 - MiMa Check should catch if the visibility change to `private` - 
(DataFrame -> Dataset)
* SPARK-11262 - Unit test for gradient, loss layers, memory management - 
(private class)
* SPARK-13430 - moved featureCol from LinearRegressionModelSummary to 
LinearRegressionSummary - (private class)
* SPARK-13048 - keepLastCheckpoint option for LDA EM optimizer - (private class)
* SPARK-14734 - Add conversions between mllib and ml Vector, Matrix types - 
(private methods added)
* SPARK-14861 - Replace internal usages of SQLContext with SparkSession - 
(private class)

Binary incompatible changes:
* SPARK-14615 - Us new ML Vector and Matrix in pipeline API
** Any {{UnaryTransformer[Vector, ...]}}:
*** {{ElementwiseProduct}}
*** {{Normalizer}}
*** {{PolynomialExpansion}}
** model values:
*** {{coefficients}} in {{LinearRegressionModel}}, {{LogisticRegressionModel}} 
and {{AFTModel}}
*** {{pc}} in {{PCAModel}}
*** {{idf}} in {{IDFModel}}
*** {{originalMin}}/{{originalMax}} in {{MinMaxScalerModel}}
*** {{mean}}/{{std}} in {{StandardScalerModel}}
* SPARK-14814 - Fix the java compatibility issue for the output of 
{{spark.mllib.tree.model.DecisionTreeModel.predict}} method.
* SPARK-14089 - Remove methods that has been deprecated since 1.1, 1.2, 1.3, 
1.4, and 1.5 
* SPARK-14952 - Remove methods deprecated in 1.6
* DataFrame -> Dataset changes for Java (this of course applies for all of 
Spark SQL)

> ML, Graph 2.0 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-14810
> URL: https://issues.apache.org/jira/browse/SPARK-14810
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>    Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority:

[jira] [Comment Edited] (SPARK-14810) ML, Graph 2.0 QA: API: Binary incompatible changes

2016-06-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293316#comment-15293316
 ] 

Nick Pentreath edited comment on SPARK-14810 at 6/20/16 8:16 AM:
-

List of changes since {{1.6.0}} audited - these are "false positives" due to 
being private, @Experimental, DeveloperAPI, etc:
* SPARK-13686 - Add a constructor parameter `regParam` to 
(Streaming)LinearRegressionWithSGD
* SPARK-13664 - Replace HadoopFsRelation with FileFormat
* SPARK-11622 - Make LibSVMRelation extends HadoopFsRelation and Add 
LibSVMOutputWriter
* SPARK-13920 - MIMA checks should apply to @Experimental and @DeveloperAPI APIs
* SPARK-11011 - UserDefinedType serialization should be strongly typed
* SPARK-13817 - Re-enable MiMA and removes object DataFrame
* SPARK-13927 - add row/column iterator to local matrices - (add methods to 
sealed trait)
* SPARK-13948 - MiMa Check should catch if the visibility change to `private` - 
(DataFrame -> Dataset)
* SPARK-11262 - Unit test for gradient, loss layers, memory management - 
(private class)
* SPARK-13430 - moved featureCol from LinearRegressionModelSummary to 
LinearRegressionSummary - (private class)
* SPARK-13048 - keepLastCheckpoint option for LDA EM optimizer - (private class)
* SPARK-14734 - Add conversions between mllib and ml Vector, Matrix types - 
(private methods added)
* SPARK-14861 - Replace internal usages of SQLContext with SparkSession - 
(private class)

Binary incompatible changes:
* SPARK-14615 - Us new ML Vector and Matrix in pipeline API
** Any {{UnaryTransformer[Vector, ...]}}:
*** {{ElementwiseProduct}}
*** {{Normalizer}}
*** {{PolynomialExpansion}}
** model values:
*** {{coefficients}} in {{LinearRegressionModel}}, {{LogisticRegressionModel}} 
and {{AFTModel}}
*** {{pc}} in {{PCAModel}}
*** {{idf}} in {{IDFModel}}
*** {{originalMin}}/{{originalMax}} in {{MinMaxScalerModel}}
*** {{mean}}/{{std}} in {{StandardScalerModel}}
* SPARK-14814 - Fix the java compatibility issue for the output of 
{{spark.mllib.tree.model.DecisionTreeModel.predict}} method.
* SPARK-14089 - Remove methods that has been deprecated since 1.1, 1.2, 1.3, 
1.4, and 1.5 
* SPARK-14952 - Remove methods deprecated in 1.6
* DataFrame -> Dataset changes for Java (this of course applies for all of 
Spark SQL)


was (Author: mlnick):
List of changes since {{1.6.0}} audited - these are "false positives" due to 
being private, @Experimental, DeveloperAPI, etc:
* SPARK-13686 - Add a constructor parameter `regParam` to 
(Streaming)LinearRegressionWithSGD
* SPARK-13664 - Replace HadoopFsRelation with FileFormat
* SPARK-11622 - Make LibSVMRelation extends HadoopFsRelation and Add 
LibSVMOutputWriter
* SPARK-13920 - MIMA checks should apply to @Experimental and @DeveloperAPI APIs
* SPARK-11011 - UserDefinedType serialization should be strongly typed
* SPARK-13817 - Re-enable MiMA and removes object DataFrame
* SPARK-13927 - add row/column iterator to local matrices - (add methods to 
sealed trait)
* SPARK-13948 - MiMa Check should catch if the visibility change to `private` - 
(DataFrame -> Dataset)
* SPARK-11262 - Unit test for gradient, loss layers, memory management - 
(private class)
* SPARK-13430 - moved featureCol from LinearRegressionModelSummary to 
LinearRegressionSummary - (private class)
* SPARK-13048 - keepLastCheckpoint option for LDA EM optimizer - (private class)
* SPARK-14734 - Add conversions between mllib and ml Vector, Matrix types - 
(private methods added)
* SPARK-14861 - Replace internal usages of SQLContext with SparkSession - 
(private class)

Binary incompatible changes:
* SPARK-14814 - Fix the java compatibility issue for the output of 
{{spark.mllib.tree.model.DecisionTreeModel.predict}} method.
* SPARK-14089 - Remove methods that has been deprecated since 1.1, 1.2, 1.3, 
1.4, and 1.5 
* SPARK-14952 - Remove methods deprecated in 1.6
* DataFrame -> Dataset changes for Java (this of course applies for all of 
Spark SQL)

> ML, Graph 2.0 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-14810
> URL: https://issues.apache.org/jira/browse/SPARK-14810
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>    Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mai

[jira] [Created] (SPARK-16063) Add getStorageLevel to Dataset

2016-06-20 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-16063:
--

 Summary: Add getStorageLevel to Dataset
 Key: SPARK-16063
 URL: https://issues.apache.org/jira/browse/SPARK-16063
 Project: Spark
  Issue Type: Improvement
Reporter: Nick Pentreath
Assignee: Nick Pentreath
Priority: Minor


SPARK-11905 added {{cache}}/{{persist}} to {{Dataset}}. We should add 
{{Dataset.getStorageLevel}}, analogous to {{RDD.getStorageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15501) ML 2.0 QA: Scala APIs audit for recommendation

2016-06-17 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335961#comment-15335961
 ] 

Nick Pentreath commented on SPARK-15501:


It's done - resolved it.

> ML 2.0 QA: Scala APIs audit for recommendation
> --
>
> Key: SPARK-15501
> URL: https://issues.apache.org/jira/browse/SPARK-15501
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Blocker
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15501) ML 2.0 QA: Scala APIs audit for recommendation

2016-06-17 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15501.

   Resolution: Fixed
Fix Version/s: 2.0.0

> ML 2.0 QA: Scala APIs audit for recommendation
> --
>
> Key: SPARK-15501
> URL: https://issues.apache.org/jira/browse/SPARK-15501
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Blocker
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-06-17 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15447.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>    Assignee: Nick Pentreath
>Priority: Critical
>  Labels: QA
> Fix For: 2.0.0
>
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?
> Links:
> [Results 
> spreadsheet|https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing]
> [Raw results for 
> SPARK-14891|https://docs.google.com/document/d/1tlWFCv8zWJuxv_gfAhd-57TKURVyrYkF9v4FLl4Lpn0/edit?usp=sharing]
> [Raw results for 
> SPARK-6716|https://docs.google.com/document/d/12qLLX84Dg-XJAgoSQzmb0-bSncjTHhg7A_JJcQneDiE/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-06-17 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335956#comment-15335956
 ] 

Nick Pentreath commented on SPARK-15447:


Finalized results in the linked Google sheet. Also posted raw results in two 
linked Google docs.

[~mengxr] I didn't manage to run 1 billion ratings but did run 250mm (30mm 
users, 10mm items, 250mm ratings). I didn't see any potential performance 
regression issues for checkpointing changes (comparing RDD-based APIs between 
2.0.0 and 1.6.1) or DF changes (comparing DF-based APIs between 2.0.0 and 
1.6.1). I'm resolving this ticket, but let me know if you come up with any 
questions or concerns.

> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>    Assignee: Nick Pentreath
>Priority: Critical
>  Labels: QA
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?
> Links:
> [Results 
> spreadsheet|https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing]
> [Raw results for 
> SPARK-14891|https://docs.google.com/document/d/1tlWFCv8zWJuxv_gfAhd-57TKURVyrYkF9v4FLl4Lpn0/edit?usp=sharing]
> [Raw results for 
> SPARK-6716|https://docs.google.com/document/d/12qLLX84Dg-XJAgoSQzmb0-bSncjTHhg7A_JJcQneDiE/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-06-17 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15447:
---
Description: 
We made several changes to ALS in 2.0. It is necessary to run some tests to 
avoid performance regression. We should test (synthetic) datasets from 1 
million ratings to 1 billion ratings.

cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
tests?

Links:
[Results 
spreadsheet|https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing]
[Raw results for 
SPARK-14891|https://docs.google.com/document/d/1tlWFCv8zWJuxv_gfAhd-57TKURVyrYkF9v4FLl4Lpn0/edit?usp=sharing]
[Raw results for 
SPARK-6716|https://docs.google.com/document/d/12qLLX84Dg-XJAgoSQzmb0-bSncjTHhg7A_JJcQneDiE/edit?usp=sharing]

  was:
We made several changes to ALS in 2.0. It is necessary to run some tests to 
avoid performance regression. We should test (synthetic) datasets from 1 
million ratings to 1 billion ratings.

cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
tests?

Links:
[Results 
spreadsheet|https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing]


> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>    Assignee: Nick Pentreath
>Priority: Critical
>  Labels: QA
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?
> Links:
> [Results 
> spreadsheet|https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing]
> [Raw results for 
> SPARK-14891|https://docs.google.com/document/d/1tlWFCv8zWJuxv_gfAhd-57TKURVyrYkF9v4FLl4Lpn0/edit?usp=sharing]
> [Raw results for 
> SPARK-6716|https://docs.google.com/document/d/12qLLX84Dg-XJAgoSQzmb0-bSncjTHhg7A_JJcQneDiE/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-06-17 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15447:
---
Description: 
We made several changes to ALS in 2.0. It is necessary to run some tests to 
avoid performance regression. We should test (synthetic) datasets from 1 
million ratings to 1 billion ratings.

cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
tests?

Links:
[Results 
spreadsheet|https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing]

  was:
We made several changes to ALS in 2.0. It is necessary to run some tests to 
avoid performance regression. We should test (synthetic) datasets from 1 
million ratings to 1 billion ratings.

cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
tests?


> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>    Assignee: Nick Pentreath
>Priority: Critical
>  Labels: QA
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?
> Links:
> [Results 
> spreadsheet|https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15995) Gradient Boosted Trees - handling of Categorical Inputs

2016-06-17 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335801#comment-15335801
 ] 

Nick Pentreath commented on SPARK-15995:


cc [~sethah] 

> Gradient Boosted Trees - handling of Categorical Inputs
> ---
>
> Key: SPARK-15995
> URL: https://issues.apache.org/jira/browse/SPARK-15995
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: Taylor Baldwin
>
> Gradient Boosted trees appear to handle all inputs as continuous, or at least 
> ordered, values.  The trees returned in the Gradient Boosted model have nodes 
> for categorical values containing a split that operates on the threshold not 
> the categories value.  This treats categorical values as if the ordering of 
> the values is significant, which is not reasonable to assume.
> Both Random Forest and Decision Trees accept the map for categorical features 
> info, while Gradient Boosted trees do not.  Random Forest and Decision trees 
> provide nodes for categorical values that have split with the categories 
> populated.  
> According to the website documentation, Gradient Boosted trees should handle 
> categorical features yet there is no perceivable way to provide the 
> categorical information to enable handling them as categories not continuous 
> values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16008) ML Logistic Regression aggregator serializes unnecessary data

2016-06-17 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-16008:
---
Assignee: Seth Hendrickson

> ML Logistic Regression aggregator serializes unnecessary data
> -
>
> Key: SPARK-16008
> URL: https://issues.apache.org/jira/browse/SPARK-16008
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> LogisticRegressionAggregator class is used to collect gradient updates in ML 
> logistic regression algorithm. The class stores a reference to the 
> coefficients array of length equal to the number of features. It also stores 
> a reference to an array of standard deviations which is length numFeatures 
> also. When a task is completed it serializes the class which also serializes 
> a copy of the two arrays. These arrays don't need to be serialized (only the 
> gradient updates are being aggregated). This causes issues performance issues 
> when the number of features is large and can trigger excess garbage 
> collection when the executor doesn't have much excess memory. 
> This results in serializing 2*numFeatures excess data. When multiclass 
> logistic regression is implemented, the excess will be numFeatures + 
> numClasses * numFeatures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15997) Audit ml.feature Update documentation for ml feature transformers

2016-06-17 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15997:
---
Assignee: Gayathri Murali

> Audit ml.feature Update documentation for ml feature transformers
> -
>
> Key: SPARK-15997
> URL: https://issues.apache.org/jira/browse/SPARK-15997
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Gayathri Murali
>Assignee: Gayathri Murali
>
> This JIRA is a subtask of SPARK-15100 and improves documentation for new 
> features added to 
> 1. HashingTF
> 2. Countvectorizer
> 3. QuantileDiscretizer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-06-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1530#comment-1530
 ] 

Nick Pentreath commented on SPARK-15447:


Almost there - I'll be able to close this off by Friday




> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>    Assignee: Nick Pentreath
>Priority: Critical
>  Labels: QA
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15746) SchemaUtils.checkColumnType with VectorUDT prints instance details in error message

2016-06-13 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327237#comment-15327237
 ] 

Nick Pentreath commented on SPARK-15746:


I think you can go ahead now - I also vote for the {{case object VectorUDT}} 
approach.

> SchemaUtils.checkColumnType with VectorUDT prints instance details in error 
> message
> ---
>
> Key: SPARK-15746
> URL: https://issues.apache.org/jira/browse/SPARK-15746
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>    Reporter: Nick Pentreath
>Priority: Minor
>
> Currently, many feature transformers in {{ml}} use 
> {{SchemaUtils.checkColumnType(schema, ..., new VectorUDT)}} to check the 
> column type is a ({{ml.linalg}}) vector.
> The resulting error message contains "instance" info for the {{VectorUDT}}, 
> i.e. something like this:
> {code}
> java.lang.IllegalArgumentException: requirement failed: Column features must 
> be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually 
> StringType.
> {code}
> A solution would either be to amend {{SchemaUtils.checkColumnType}} to print 
> the error message using {{getClass.getName}}, or to create a {{private[spark] 
> case object VectorUDT extends VectorUDT}} for convenience, since it is used 
> so often (and incidentally this would make it easier to put {{VectorUDT}} 
> into lists of data types e.g. schema validation, UDAFs etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-13 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327220#comment-15327220
 ] 

Nick Pentreath commented on SPARK-15904:


Could you explain why you're using K>3000 when your dataset has dimension ~2000?

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 10G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15790) Audit @Since annotations in ML

2016-06-13 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327193#comment-15327193
 ] 

Nick Pentreath commented on SPARK-15790:


Yes, I've just looked at things in the concrete classes - params & methods 
defined in the traits etc are not annotated.

> Audit @Since annotations in ML
> --
>
> Key: SPARK-15790
> URL: https://issues.apache.org/jira/browse/SPARK-15790
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>
> Many classes & methods in ML are missing {{@Since}} annotations. Audit what's 
> missing and add annotations to public API constructors, vals and methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15790) Audit @Since annotations in ML

2016-06-13 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327028#comment-15327028
 ] 

Nick Pentreath commented on SPARK-15790:


Ah thanks - missed that umbrella. It's actually really the {{ml.feature}} 
classes mostly, and that PR seems to have stalled. I've started on a new one to 
cover the feature package.

> Audit @Since annotations in ML
> --
>
> Key: SPARK-15790
> URL: https://issues.apache.org/jira/browse/SPARK-15790
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>
> Many classes & methods in ML are missing {{@Since}} annotations. Audit what's 
> missing and add annotations to public API constructors, vals and methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15788) PySpark IDFModel missing "idf" property

2016-06-09 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15788.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13540
[https://github.com/apache/spark/pull/13540]

> PySpark IDFModel missing "idf" property
> ---
>
> Key: SPARK-15788
> URL: https://issues.apache.org/jira/browse/SPARK-15788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Nick Pentreath
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Scala {{IDFModel}} has a method {{def idf: Vector = idfModel.idf.asML}} - 
> this should be exposed on the Python side as a property



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15790) Audit @Since annotations in ML

2016-06-06 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-15790:
--

 Summary: Audit @Since annotations in ML
 Key: SPARK-15790
 URL: https://issues.apache.org/jira/browse/SPARK-15790
 Project: Spark
  Issue Type: Documentation
  Components: ML, PySpark
Reporter: Nick Pentreath
Assignee: Nick Pentreath


Many classes & methods in ML are missing {{@Since}} annotations. Audit what's 
missing and add annotations to public API constructors, vals and methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15788) PySpark IDFModel missing "idf" property

2016-06-06 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-15788:
--

 Summary: PySpark IDFModel missing "idf" property
 Key: SPARK-15788
 URL: https://issues.apache.org/jira/browse/SPARK-15788
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Nick Pentreath
Priority: Trivial


Scala {{IDFModel}} has a method {{def idf: Vector = idfModel.idf.asML}} - this 
should be exposed on the Python side as a property



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Welcoming Yanbo Liang as a committer

2016-06-04 Thread Nick Pentreath
Congratulations Yanbo and welcome
On Sat, 4 Jun 2016 at 10:17, Hortonworks  wrote:

> Congratulations, Yanbo
>
> Zhan Zhang
>
> Sent from my iPhone
>
> > On Jun 3, 2016, at 8:39 PM, Dongjoon Hyun  wrote:
> >
> > Congratulations
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


[jira] [Updated] (SPARK-15761) pyspark shell should load if PYSPARK_DRIVER_PYTHON is ipython an Python3

2016-06-03 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15761:
---
Assignee: Manoj Kumar

> pyspark shell should load if PYSPARK_DRIVER_PYTHON is ipython an Python3
> 
>
> Key: SPARK-15761
> URL: https://issues.apache.org/jira/browse/SPARK-15761
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Manoj Kumar
>Assignee: Manoj Kumar
>Priority: Minor
>
> My default python is ipython3 and it is odd that it fails with "IPython 
> requires Python 2.7+; please install python2.7 or set PYSPARK_PYTHON"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15168) Add missing params to Python's MultilayerPerceptronClassifier

2016-06-03 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15168.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12943
[https://github.com/apache/spark/pull/12943]

> Add missing params to Python's MultilayerPerceptronClassifier
> -
>
> Key: SPARK-15168
> URL: https://issues.apache.org/jira/browse/SPARK-15168
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
> Fix For: 2.0.0
>
>
> MultilayerPerceptronClassifier is missing step size, solver, and weights. Add 
> these params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15168) Add missing params to Python's MultilayerPerceptronClassifier

2016-06-03 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15168:
---
Assignee: holdenk

> Add missing params to Python's MultilayerPerceptronClassifier
> -
>
> Key: SPARK-15168
> URL: https://issues.apache.org/jira/browse/SPARK-15168
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
>
> MultilayerPerceptronClassifier is missing step size, solver, and weights. Add 
> these params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15746) SchemaUtils.checkColumnType with VectorUDT prints instance details in error message

2016-06-03 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314627#comment-15314627
 ] 

Nick Pentreath commented on SPARK-15746:


I'd say hold off on working on it until we decide which approach to take, but 
once that is done sure.

> SchemaUtils.checkColumnType with VectorUDT prints instance details in error 
> message
> ---
>
> Key: SPARK-15746
> URL: https://issues.apache.org/jira/browse/SPARK-15746
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>    Reporter: Nick Pentreath
>Priority: Minor
>
> Currently, many feature transformers in {{ml}} use 
> {{SchemaUtils.checkColumnType(schema, ..., new VectorUDT)}} to check the 
> column type is a ({{ml.linalg}}) vector.
> The resulting error message contains "instance" info for the {{VectorUDT}}, 
> i.e. something like this:
> {code}
> java.lang.IllegalArgumentException: requirement failed: Column features must 
> be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually 
> StringType.
> {code}
> A solution would either be to amend {{SchemaUtils.checkColumnType}} to print 
> the error message using {{getClass.getName}}, or to create a {{private[spark] 
> case object VectorUDT extends VectorUDT}} for convenience, since it is used 
> so often (and incidentally this would make it easier to put {{VectorUDT}} 
> into lists of data types e.g. schema validation, UDAFs etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14811) ML, Graph 2.0 QA: API: New Scala APIs, docs

2016-06-03 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314489#comment-15314489
 ] 

Nick Pentreath commented on SPARK-14811:


Yes, that does make sense. I will take a pass through and try to add {{Since}} 
where it has been missed, at least for stuff added in 2.0.0

> ML, Graph 2.0 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-14811
> URL: https://issues.apache.org/jira/browse/SPARK-14811
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-06-03 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314441#comment-15314441
 ] 

Nick Pentreath edited comment on SPARK-15447 at 6/3/16 5:22 PM:


Added a second tab to the sheet for testing DF-based API from 2.0.0-SNAPSHOT vs 
1.6.1 for SPARK-14891. Again, 2.0 is faster and no performance regression 
overall.

Test time seems a bit worse on average in 2.0 - but the 1.6.1 result has very 
large variance so inconclusive that performance is really worse.


was (Author: mlnick):
Added a second tab to the sheet for testing DF-based API from 2.0.0-SNAPSHOT vs 
1.6.1 for SPARK-14891. Again, 2.0 is faster and no performance regression.

> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>    Assignee: Nick Pentreath
>Priority: Critical
>  Labels: QA
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-06-03 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314441#comment-15314441
 ] 

Nick Pentreath commented on SPARK-15447:


Added a second tab to the sheet for testing DF-based API from 2.0.0-SNAPSHOT vs 
1.6.1 for SPARK-14891. Again, 2.0 is faster and no performance regression.

> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>    Assignee: Nick Pentreath
>Priority: Critical
>  Labels: QA
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15746) SchemaUtils.checkColumnType with VectorUDT prints instance details in error message

2016-06-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15746:
---
Summary: SchemaUtils.checkColumnType with VectorUDT prints instance details 
in error message  (was: SchemaUtils.checkColumnType with VectorUDT prints 
instance details)

> SchemaUtils.checkColumnType with VectorUDT prints instance details in error 
> message
> ---
>
> Key: SPARK-15746
> URL: https://issues.apache.org/jira/browse/SPARK-15746
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>    Reporter: Nick Pentreath
>Priority: Minor
>
> Currently, many feature transformers in {{ml}} use 
> {{SchemaUtils.checkColumnType(schema, ..., new VectorUDT)}} to check the 
> column type is a ({{ml.linalg}}) vector.
> The resulting error message contains "instance" info for the {{VectorUDT}}, 
> i.e. something like this:
> {code}
> java.lang.IllegalArgumentException: requirement failed: Column features must 
> be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually 
> StringType.
> {code}
> A solution would either be to amend {{SchemaUtils.checkColumnType}} to print 
> the error message using {{getClass.getName}}, or to create a {{private[spark] 
> case object VectorUDT extends VectorUDT}} for convenience, since it is used 
> so often (and incidentally this would make it easier to put {{VectorUDT}} 
> into lists of data types e.g. schema validation, UDAFs etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15746) SchemaUtils.checkColumnType with VectorUDT prints instance details

2016-06-02 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-15746:
--

 Summary: SchemaUtils.checkColumnType with VectorUDT prints 
instance details
 Key: SPARK-15746
 URL: https://issues.apache.org/jira/browse/SPARK-15746
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Nick Pentreath
Priority: Minor


Currently, many feature transformers in {{ml}} use 
{{SchemaUtils.checkColumnType(schema, ..., new VectorUDT)}} to check the column 
type is a ({{ml.linalg}}) vector.

The resulting error message contains "instance" info for the {{VectorUDT}}, 
i.e. something like this:
{code}
java.lang.IllegalArgumentException: requirement failed: Column features must be 
of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually 
StringType.
{code}

A solution would either be to amend {{SchemaUtils.checkColumnType}} to print 
the error message using {{getClass.getName}}, or to create a {{private[spark] 
case object VectorUDT extends VectorUDT}} for convenience, since it is used so 
often (and incidentally this would make it easier to put {{VectorUDT}} into 
lists of data types e.g. schema validation, UDAFs etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15668) ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type

2016-06-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15668.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13411
[https://github.com/apache/spark/pull/13411]

> ml.feature: update check schema to avoid confusion when user use MLlib.vector 
> as input type
> ---
>
> Key: SPARK-15668
> URL: https://issues.apache.org/jira/browse/SPARK-15668
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> As we use ml.vector to replace mllib.vector in ml, users use mllib.vector as 
> input column will get error. Yet some error message is confusing:
> s"Input column ${$(inputCol)} must be a vector column")
> The input column probably is already a vector of mllib. Update to avoid the 
> confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15139) PySpark TreeEnsemble missing methods

2016-06-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15139:
---
Assignee: holdenk

> PySpark TreeEnsemble missing methods
> 
>
> Key: SPARK-15139
> URL: https://issues.apache.org/jira/browse/SPARK-15139
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 2.0.0
>
>
> TreeEnsemble class is missing some accessor methods compared to Scala API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Nick Pentreath
Fair enough.

However, if you take a look at the deployment guide (
http://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies)
you will see that the generally advised approach is to package your app
dependencies into a fat JAR and submit (possibly using the --jars option
too). This also means you specify the Scala and other library versions in
your project pom.xml or sbt file, avoiding having to manually decide which
artefact to include on your classpath  :)

On Thu, 2 Jun 2016 at 16:06 Kevin Burton <bur...@spinn3r.com> wrote:

> Yeah.. thanks Nick. Figured that out since your last email... I deleted
> the 2.10 by accident but then put 2+2 together.
>
> Got it working now.
>
> Still sticking to my story that it's somewhat complicated to setup :)
>
> Kevin
>
> On Thu, Jun 2, 2016 at 3:59 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> Which Scala version is Spark built against? I'd guess it's 2.10 since
>> you're using spark-1.6, and you're using the 2.11 jar for es-hadoop.
>>
>>
>> On Thu, 2 Jun 2016 at 15:50 Kevin Burton <bur...@spinn3r.com> wrote:
>>
>>> Thanks.
>>>
>>> I'm trying to run it in a standalone cluster with an existing / large
>>> 100 node ES install.
>>>
>>> I'm using the standard 1.6.1 -2.6 distribution with
>>> elasticsearch-hadoop-2.3.2...
>>>
>>> I *think* I'm only supposed to use the
>>> elasticsearch-spark_2.11-2.3.2.jar with it...
>>>
>>> but now I get the following exception:
>>>
>>>
>>> java.lang.NoSuchMethodError:
>>> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>>> at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:52)
>>> at
>>> org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:37)
>>> at
>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
>>> at
>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>>> at $iwC$$iwC$$iwC$$iwC.(:59)
>>> at $iwC$$iwC$$iwC.(:61)
>>> at $iwC$$iwC.(:63)
>>> at $iwC.(:65)
>>> at (:67)
>>> at .(:71)
>>> at .()
>>> at .(:7)
>>> at .()
>>> at $print()
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>> at
>>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>>> at
>>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>>> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>>> at
>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>>> at
>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>>> at
>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>>> at
>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>>> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>>> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>>> at org.apache.spark.repl.SparkILoop.org
>>> $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>>> at
>>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>>> at
>>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>>> at
>>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>>> at
>>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>>&g

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Nick Pentreath
Which Scala version is Spark built against? I'd guess it's 2.10 since
you're using spark-1.6, and you're using the 2.11 jar for es-hadoop.


On Thu, 2 Jun 2016 at 15:50 Kevin Burton <bur...@spinn3r.com> wrote:

> Thanks.
>
> I'm trying to run it in a standalone cluster with an existing / large 100
> node ES install.
>
> I'm using the standard 1.6.1 -2.6 distribution with
> elasticsearch-hadoop-2.3.2...
>
> I *think* I'm only supposed to use the
> elasticsearch-spark_2.11-2.3.2.jar with it...
>
> but now I get the following exception:
>
>
> java.lang.NoSuchMethodError:
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:52)
> at
> org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:37)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:57)
> at $iwC$$iwC$$iwC$$iwC.(:59)
> at $iwC$$iwC$$iwC.(:61)
> at $iwC$$iwC.(:63)
> at $iwC.(:65)
> at (:67)
> at .(:71)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
> On Thu, Jun 2, 2016 at 3:45 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> Hey there
>>
>> When I used es-hadoop, I just pulled in the dependency into my pom.xml,
>> with spark as a "provided" dependency, and built a fat jar with assembly.
>>
>> Then with spark-submit use the --jars option to include your assembly jar
>> (IIRC I sometimes also needed to use --driver-classpath too, but perhaps
>> not with recent Spark versions).
>>
>>
>>
>> On Thu, 2 Jun 2016 at 15:

[jira] [Resolved] (SPARK-15139) PySpark TreeEnsemble missing methods

2016-06-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15139.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12919
[https://github.com/apache/spark/pull/12919]

> PySpark TreeEnsemble missing methods
> 
>
> Key: SPARK-15139
> URL: https://issues.apache.org/jira/browse/SPARK-15139
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
> Fix For: 2.0.0
>
>
> TreeEnsemble class is missing some accessor methods compared to Scala API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15092) toDebugString missing from ML DecisionTreeClassifier

2016-06-02 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15092.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12919
[https://github.com/apache/spark/pull/12919]

> toDebugString missing from ML DecisionTreeClassifier
> 
>
> Key: SPARK-15092
> URL: https://issues.apache.org/jira/browse/SPARK-15092
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: HDP 2.3.4, Red Hat 6.7
>Reporter: Ivan SPM
>Assignee: holdenk
>Priority: Minor
>  Labels: features
> Fix For: 2.0.0
>
>
> The attribute toDebugString is missing from the DecisionTreeClassifier and 
> DecisionTreeClassifierModel from ML. The attribute exists on the MLLib 
> DecisionTree model. 
> There's no way to check or print the model tree structure from the ML.
> The basic code for it is this:
> rom pyspark.ml import Pipeline
> from pyspark.ml.feature import VectorAssembler, StringIndexer
> from pyspark.ml.classification import DecisionTreeClassifier
> cl = DecisionTreeClassifier(labelCol='target_idx', featuresCol='features')
> pipe = Pipeline(stages=[target_index, assembler, cl])
> model = pipe.fit(df_train)
> # Prediction and model evaluation
> predictions = model.transform(df_test)
> mc_evaluator = MulticlassClassificationEvaluator(
> labelCol="target_idx", predictionCol="prediction", metricName="precision")
> accuracy = mc_evaluator.evaluate(predictions)
> print("Test Error = {}".format(1.0 - accuracy))
> now it would be great to be able to do what is being done on the MLLib model:
> print model.toDebugString(),  # it already has newline
> DecisionTreeModel classifier of depth 1 with 3 nodes
>   If (feature 0 <= 0.0)
>Predict: 0.0
>   Else (feature 0 > 0.0)
>Predict: 1.0
> but there's no toDebugString attribute either to the pipeline model or the 
> DecisionTreeClassifier model:
> cl.toDebugString()
> Attribute Error
> https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/mllib/tree.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Nick Pentreath
Hey there

When I used es-hadoop, I just pulled in the dependency into my pom.xml,
with spark as a "provided" dependency, and built a fat jar with assembly.

Then with spark-submit use the --jars option to include your assembly jar
(IIRC I sometimes also needed to use --driver-classpath too, but perhaps
not with recent Spark versions).



On Thu, 2 Jun 2016 at 15:34 Kevin Burton  wrote:

> I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say it's
> not super easy.
>
> I wish there was an easier way to get this stuff to work.. Last time I
> tried to use spark more I was having similar problems with classpath setup
> and Cassandra.
>
> Seems a huge opportunity to make this easier for new developers.  This
> stuff isn't rocket science but it can (needlessly) waste a ton of time.
>
> ... anyway... I'm have since figured out I have to specific *specific*
> jars from the elasticsearch-hadoop distribution and use those.
>
> Right now I'm using :
>
>
> SPARK_CLASSPATH=/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar:/usr/share/apache-spark/lib/*
>
> ... but I"m getting:
>
> java.lang.NoClassDefFoundError: Could not initialize class
> org.elasticsearch.hadoop.util.Version
> at
> org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376)
> at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40)
> at
> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
> at
> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> ... but I think its caused by this:
>
> 16/06/03 00:26:48 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> localhost): java.lang.Error: Multiple ES-Hadoop versions detected in the
> classpath; please use only one
> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar
>
> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar
>
> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar
>
> at org.elasticsearch.hadoop.util.Version.(Version.java:73)
> at
> org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376)
> at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40)
> at
> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
> at
> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> .. still tracking this down but was wondering if there is someting obvious
> I'm dong wrong.  I'm going to take out elasticsearch-hadoop-2.3.2.jar and
> try again.
>
> Lots of trial and error here :-/
>
> Kevin
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


[jira] [Comment Edited] (SPARK-14811) ML, Graph 2.0 QA: API: New Scala APIs, docs

2016-06-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313208#comment-15313208
 ] 

Nick Pentreath edited comment on SPARK-14811 at 6/2/16 10:31 PM:
-

[~josephkb] [~yanboliang] [~srowen] Question on this - we seem to be 
inconsistent with the {{@Since}} annotations on param setters. Generally there 
are none on the getters in shared traits or the class itself. But some classes 
(e.g. {{ALS}}) have put in full coverage for the param setter methods (e.g. 
{{setXXX}}). Do we want to try to do this across the board? Or do it for 
{{2.0.0}}? In which case we missed a few (e.g. {{setRelativeError}} in 
{{QuantileDiscretizer}}).


was (Author: mlnick):
[~josephkb] [~yanboliang] [~srowen] Question on this - we seem to be 
inconsistent with the {{@Since}} annotations on param setters. Generally there 
are none on the getters in shared traits or the class itself. But some classes 
(e.g. {{ALS}} have put in full coverage for the param setter methods (e.g. 
{{setXXX}}). Do we want to try to do this across the board? Or do it for 
{{2.0.0}}? In which case we missed a few (e.g. {{setRelativeError}} in 
{{QuantileDiscretizer}}).

> ML, Graph 2.0 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-14811
> URL: https://issues.apache.org/jira/browse/SPARK-14811
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14811) ML, Graph 2.0 QA: API: New Scala APIs, docs

2016-06-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313208#comment-15313208
 ] 

Nick Pentreath edited comment on SPARK-14811 at 6/2/16 10:31 PM:
-

[~josephkb] [~yanboliang] [~srowen] Question on this - we seem to be 
inconsistent with the {{@Since}} annotations on param setters. Generally there 
are none on the getters in shared traits or the class itself. But some classes 
(e.g. {{ALS}} have put in full coverage for the param setter methods (e.g. 
{{setXXX}}). Do we want to try to do this across the board? Or do it for 
{{2.0.0}}? In which case we missed a few (e.g. {{setRelativeError}} in 
{{QuantileDiscretizer}}).


was (Author: mlnick):
Question on this - we seem to be inconsistent with the {{@Since}} annotations 
on param setters. Generally there are none on the getters in shared traits or 
the class itself. But some classes (e.g. {{ALS}} have put in full coverage for 
the param setter methods (e.g. {{setXXX}}). Do we want to try to do this across 
the board? Or do it for {{2.0.0}}? In which case we missed a few (e.g. 
{{setRelativeError}} in {{QuantileDiscretizer}}).

> ML, Graph 2.0 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-14811
> URL: https://issues.apache.org/jira/browse/SPARK-14811
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14811) ML, Graph 2.0 QA: API: New Scala APIs, docs

2016-06-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313208#comment-15313208
 ] 

Nick Pentreath commented on SPARK-14811:


Question on this - we seem to be inconsistent with the {{@Since}} annotations 
on param setters. Generally there are none on the getters in shared traits or 
the class itself. But some classes (e.g. {{ALS}} have put in full coverage for 
the param setter methods (e.g. {{setXXX}}). Do we want to try to do this across 
the board? Or do it for {{2.0.0}}? In which case we missed a few (e.g. 
{{setRelativeError}} in {{QuantileDiscretizer}}).

> ML, Graph 2.0 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-14811
> URL: https://issues.apache.org/jira/browse/SPARK-14811
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15668) ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type

2016-06-01 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15668:
---
Assignee: yuhao yang

> ml.feature: update check schema to avoid confusion when user use MLlib.vector 
> as input type
> ---
>
> Key: SPARK-15668
> URL: https://issues.apache.org/jira/browse/SPARK-15668
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
>
> As we use ml.vector to replace mllib.vector in ml, users use mllib.vector as 
> input column will get error. Yet some error message is confusing:
> s"Input column ${$(inputCol)} must be a vector column")
> The input column probably is already a vector of mllib. Update to avoid the 
> confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15164) Mark classification algorithms as experimental where marked so in scala

2016-06-01 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15164:
---
Assignee: holdenk

> Mark classification algorithms as experimental where marked so in scala
> ---
>
> Key: SPARK-15164
> URL: https://issues.apache.org/jira/browse/SPARK-15164
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15162) Update PySpark LogisticRegression threshold PyDoc to be as complete as Scaladoc

2016-06-01 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15162:
---
Assignee: holdenk

> Update PySpark LogisticRegression threshold PyDoc to be as complete as 
> Scaladoc
> ---
>
> Key: SPARK-15162
> URL: https://issues.apache.org/jira/browse/SPARK-15162
> Project: Spark
>  Issue Type: Improvement
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
>
> The PyDoc for setting and getting the threshold in logistic regression 
> doesn't have the same level of detail as the Scaladoc does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14810) ML, Graph 2.0 QA: API: Binary incompatible changes

2016-06-01 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293316#comment-15293316
 ] 

Nick Pentreath edited comment on SPARK-14810 at 6/1/16 5:56 PM:


List of changes since {{1.6.0}} audited - these are "false positives" due to 
being private, @Experimental, DeveloperAPI, etc:
* SPARK-13686 - Add a constructor parameter `regParam` to 
(Streaming)LinearRegressionWithSGD
* SPARK-13664 - Replace HadoopFsRelation with FileFormat
* SPARK-11622 - Make LibSVMRelation extends HadoopFsRelation and Add 
LibSVMOutputWriter
* SPARK-13920 - MIMA checks should apply to @Experimental and @DeveloperAPI APIs
* SPARK-11011 - UserDefinedType serialization should be strongly typed
* SPARK-13817 - Re-enable MiMA and removes object DataFrame
* SPARK-13927 - add row/column iterator to local matrices - (add methods to 
sealed trait)
* SPARK-13948 - MiMa Check should catch if the visibility change to `private` - 
(DataFrame -> Dataset)
* SPARK-11262 - Unit test for gradient, loss layers, memory management - 
(private class)
* SPARK-13430 - moved featureCol from LinearRegressionModelSummary to 
LinearRegressionSummary - (private class)
* SPARK-13048 - keepLastCheckpoint option for LDA EM optimizer - (private class)
* SPARK-14734 - Add conversions between mllib and ml Vector, Matrix types - 
(private methods added)
* SPARK-14861 - Replace internal usages of SQLContext with SparkSession - 
(private class)

Binary incompatible changes:
* SPARK-14814 - Fix the java compatibility issue for the output of 
{{spark.mllib.tree.model.DecisionTreeModel.predict}} method.
* SPARK-14089 - Remove methods that has been deprecated since 1.1, 1.2, 1.3, 
1.4, and 1.5 
* SPARK-14952 - Remove methods deprecated in 1.6
* DataFrame -> Dataset changes for Java (this of course applies for all of 
Spark SQL)


was (Author: mlnick):
List of changes since {{1.6.0}} audited - these are "false positives" due to 
being private, @Experimental, DeveloperAPI, etc:
* SPARK-13686 - Add a constructor parameter `regParam` to 
(Streaming)LinearRegressionWithSGD
* SPARK-13664 - Replace HadoopFsRelation with FileFormat
* SPARK-11622 - Make LibSVMRelation extends HadoopFsRelation and Add 
LibSVMOutputWriter
* SPARK-13920 - MIMA checks should apply to @Experimental and @DeveloperAPI APIs
* SPARK-11011 - UserDefinedType serialization should be strongly typed
* SPARK-13817 - Re-enable MiMA and removes object DataFrame
* SPARK-13927 - add row/column iterator to local matrices - (add methods to 
sealed trait)
* SPARK-13948 - MiMa Check should catch if the visibility change to `private` - 
(DataFrame -> Dataset)
* SPARK-11262 - Unit test for gradient, loss layers, memory management - 
(private class)
* SPARK-13430 - moved featureCol from LinearRegressionModelSummary to 
LinearRegressionSummary - (private class)
* SPARK-13048 - keepLastCheckpoint option for LDA EM optimizer - (private class)
* SPARK-14734 - Add conversions between mllib and ml Vector, Matrix types - 
(private methods added)
* SPARK-14861 - Replace internal usages of SQLContext with SparkSession - 
(private class)

Binary incompatible changes:
* SPARK-14089 - Remove methods that has been deprecated since 1.1, 1.2, 1.3, 
1.4, and 1.5 
* SPARK-14952 - Remove methods deprecated in 1.6
* DataFrame -> Dataset changes for Java (this of course applies for all of 
Spark SQL)

> ML, Graph 2.0 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-14810
> URL: https://issues.apache.org/jira/browse/SPARK-14810
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>    Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15587) ML 2.0 QA: Scala APIs audit for feature

2016-06-01 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15587:
---
Assignee: Yanbo Liang

> ML 2.0 QA: Scala APIs audit for feature
> ---
>
> Key: SPARK-15587
> URL: https://issues.apache.org/jira/browse/SPARK-15587
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> See containing JIRA for details: [SPARK-14811]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15587) ML 2.0 QA: Scala APIs audit for feature

2016-06-01 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15587.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13410
[https://github.com/apache/spark/pull/13410]

> ML 2.0 QA: Scala APIs audit for feature
> ---
>
> Key: SPARK-15587
> URL: https://issues.apache.org/jira/browse/SPARK-15587
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Reporter: Joseph K. Bradley
> Fix For: 2.0.0
>
>
> See containing JIRA for details: [SPARK-14811]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-05-31 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15308797#comment-15308797
 ] 

Nick Pentreath commented on SPARK-15447:


Created a Google sheet with initial results: 
https://docs.google.com/spreadsheets/d/1iX5LisfXcZSTCHp8VPoo5z-eCO85A5VsZDtZ5e475ks/edit?usp=sharing

So far for SPARK-6717 I've just used {{spark-perf}} to compare the RDD-based 
APIs (as the checkpointing only impacts the RDD-based {{train}} method). From 
these results no red flags, and 2.0 is actually faster in general relative to 
1.6. Checkpointing does add a minor overhead (but this overhead is consistent 
across the versions and again better in 2.0).

There is something a little weird about the 1.6 results for 10m ratings case, 
but not sure what's going on there - I've rerun a few times with the same 
result.

Also, haven't managed to get to 1b ratings yet due to cluster size, will keep 
working on it.

> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>    Assignee: Nick Pentreath
>Priority: Critical
>  Labels: QA
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15575) Remove breeze from dependencies?

2016-05-27 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304546#comment-15304546
 ] 

Nick Pentreath commented on SPARK-15575:


What specifically are the "performance issues" with Breeze as it stands 
currently?

> Remove breeze from dependencies?
> 
>
> Key: SPARK-15575
> URL: https://issues.apache.org/jira/browse/SPARK-15575
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for discussing whether we should remove Breeze from the 
> dependencies of MLlib.  The main issues with Breeze are Scala 2.12 support 
> and performance issues.
> There are a few paths:
> # Keep dependency.  This could be OK, especially if the Scala version issues 
> are fixed within Breeze.
> # Remove dependency
> ## Implement our own linear algebra operators as needed
> ## Design a way to build Spark using custom linalg libraries of the user's 
> choice.  E.g., you could build MLlib using Breeze, or any other library 
> supporting the required operations.  This might require significant work.  
> See [SPARK-6442] for related discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15492) Binarization scala example copy & paste to spark-shell error

2016-05-26 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15492.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13266
[https://github.com/apache/spark/pull/13266]

> Binarization scala example copy & paste to spark-shell error
> 
>
> Key: SPARK-15492
> URL: https://issues.apache.org/jira/browse/SPARK-15492
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.0.0
>
>
> The Binarization scala example val dataFrame : Dataframe = 
> spark.createDataFrame(data).toDF("label", "feature"), which can't be pasted 
> in the spark-shell as Dataframe is not imported. Compared with other 
> examples, this explicit type is not required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15500) Remove defaults in storage level param doc in ALS

2016-05-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15500.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13277
[https://github.com/apache/spark/pull/13277]

> Remove defaults in storage level param doc in ALS
> -
>
> Key: SPARK-15500
> URL: https://issues.apache.org/jira/browse/SPARK-15500
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, PySpark
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.0.0
>
>
> Pending a decision on approach for SPARK-15130, I'm removing the "Default: 
> MEMORY_AND_DISK" part of the built-in {{Param}} doc for ALS storage level 
> params (both Scala and Python). This fixes up the output of {{explainParams}} 
> so that defaults are not shown twice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Cannot build master with sbt

2016-05-25 Thread Nick Pentreath
I've filed https://issues.apache.org/jira/browse/SPARK-15525

For now, you would have to check out sbt-antlr4 at
https://github.com/ihji/sbt-antlr4/commit/23eab68b392681a7a09f6766850785afe8dfa53d
(since
I don't see any branches or tags in the github repo for different
versions), and sbt publishLocal to get the dependency locally.

On Wed, 25 May 2016 at 15:13 Yiannis Gkoufas  wrote:

> Hi there,
>
> I have cloned the latest version from github.
> I am using scala 2.10.x
> When I invoke
>
> build/sbt clean package
>
> I get the exceptions because for the sbt-antlr library:
>
> [warn] module not found: com.simplytyped#sbt-antlr4;0.7.10
> [warn]  typesafe-ivy-releases: tried
> [warn]
> https://repo.typesafe.com/typesafe/ivy-releases/com.simplytyped/sbt-antlr4/scala_2.10/sbt_0.13/0.7.10/ivys/ivy.xml
> [warn]  sbt-plugin-releases: tried
> [warn]
> https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.simplytyped/sbt-antlr4/scala_2.10/sbt_0.13/0.7.10/ivys/ivy.xml
> [warn]  local: tried
> [warn]
> /home/johngouf/.ivy2/local/com.simplytyped/sbt-antlr4/scala_2.10/sbt_0.13/0.7.10/ivys/ivy.xml
> [warn]  public: tried
> [warn]
> https://repo1.maven.org/maven2/com/simplytyped/sbt-antlr4_2.10_0.13/0.7.10/sbt-antlr4-0.7.10.pom
> [warn]  simplytyped: tried
> [warn]
> http://simplytyped.github.io/repo/releases/com/simplytyped/sbt-antlr4_2.10_0.13/0.7.10/sbt-antlr4-0.7.10.pom
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [warn] ::
> [warn] ::  UNRESOLVED DEPENDENCIES ::
> [warn] ::
> [warn] :: com.simplytyped#sbt-antlr4;0.7.10: not found
> [warn] ::
> [warn]
> [warn] Note: Some unresolved dependencies have extra attributes.
> Check that these dependencies exist with the requested attributes.
> [warn] com.simplytyped:sbt-antlr4:0.7.10 (scalaVersion=2.10,
> sbtVersion=0.13)
> [warn]
> [warn] Note: Unresolved dependencies path:
> [warn] com.simplytyped:sbt-antlr4:0.7.10 (scalaVersion=2.10,
> sbtVersion=0.13) (/home/johngouf/IOT/spark/project/plugins.sbt#L26-27)
> [warn]   +- plugins:plugins:0.1-SNAPSHOT (scalaVersion=2.10,
> sbtVersion=0.13)
> sbt.ResolveException: unresolved dependency:
> com.simplytyped#sbt-antlr4;0.7.10: not found
>
> Any idea what is the problem here?
>
> Thanks!
>


[jira] [Created] (SPARK-15525) Clean sbt build fails to resolve sbt-antlr4 plugin

2016-05-25 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-15525:
--

 Summary: Clean sbt build fails to resolve sbt-antlr4 plugin
 Key: SPARK-15525
 URL: https://issues.apache.org/jira/browse/SPARK-15525
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.0.0
Reporter: Nick Pentreath


sbt-antlr4 plugin repo page is no longer available, and has been moved to 
bintray (refer 
[here|https://github.com/ihji/sbt-antlr4/commit/77e8f74457e17adad25293720b84ea706deb27f7]).

This causes building from scratch using sbt to fail to resolve, e.g. 

{code}
[info] Resolving com.simplytyped#sbt-antlr4;0.7.10 ...
[warn]  module not found: com.simplytyped#sbt-antlr4;0.7.10
[warn]  typesafe-ivy-releases: tried
[warn]   
https://repo.typesafe.com/typesafe/ivy-releases/com.simplytyped/sbt-antlr4/scala_2.10/sbt_0.13/0.7.10/ivys/ivy.xml
[warn]  sbt-plugin-releases: tried
[warn]   
https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.simplytyped/sbt-antlr4/scala_2.10/sbt_0.13/0.7.10/ivys/ivy.xml
[warn]  local: tried
[warn]   
/home/npentreath/.ivy2/local/com.simplytyped/sbt-antlr4/scala_2.10/sbt_0.13/0.7.10/ivys/ivy.xml
[warn]  public: tried
[warn]   
https://repo1.maven.org/maven2/com/simplytyped/sbt-antlr4_2.10_0.13/0.7.10/sbt-antlr4-0.7.10.pom
[warn]  simplytyped: tried
[warn]   
http://simplytyped.github.io/repo/releases/com/simplytyped/sbt-antlr4_2.10_0.13/0.7.10/sbt-antlr4-0.7.10.pom
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn]  ::
[warn]  ::  UNRESOLVED DEPENDENCIES ::
[warn]  ::
[warn]  :: com.simplytyped#sbt-antlr4;0.7.10: not found
[warn]  ::
[warn]
[warn]  Note: Some unresolved dependencies have extra attributes.  Check that 
these dependencies exist with the requested attributes.
[warn]  com.simplytyped:sbt-antlr4:0.7.10 (scalaVersion=2.10, 
sbtVersion=0.13)
[warn]
[warn]  Note: Unresolved dependencies path:
[warn]  com.simplytyped:sbt-antlr4:0.7.10 (scalaVersion=2.10, 
sbtVersion=0.13) (/home/npentreath/spark/project/plugins.sbt#L26-27)
[warn]+- plugins:plugins:0.1-SNAPSHOT (scalaVersion=2.10, 
sbtVersion=0.13)
sbt.ResolveException: unresolved dependency: com.simplytyped#sbt-antlr4;0.7.10: 
not found
{code}

Unfortunately, it also appears the older artefacts have been removed from the 
github.io repo (refer 
[here|https://github.com/simplytyped/simplytyped.github.io/commit/986efba2e3ec75fa4313275496868cbe9fcfc95b],
 but not added to bintray, so changing the resolver doesn't help:

{code}
info] Resolving com.simplytyped#sbt-antlr4;0.7.10 ...
[warn]  module not found: com.simplytyped#sbt-antlr4;0.7.10
[warn]  typesafe-ivy-releases: tried
[warn]   
https://repo.typesafe.com/typesafe/ivy-releases/com.simplytyped/sbt-antlr4/scala_2.10/sbt_0.13/0.7.10/ivys/ivy.xml
[warn]  sbt-plugin-releases: tried
[warn]   
https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.simplytyped/sbt-antlr4/scala_2.10/sbt_0.13/0.7.10/ivys/ivy.xml
[warn]  local: tried
[warn]   
/home/npentreath/.ivy2/local/com.simplytyped/sbt-antlr4/scala_2.10/sbt_0.13/0.7.10/ivys/ivy.xml
[warn]  public: tried
[warn]   
https://repo1.maven.org/maven2/com/simplytyped/sbt-antlr4_2.10_0.13/0.7.10/sbt-antlr4-0.7.10.pom
[warn]  bintray-simplytyped: tried
[warn]   
http://dl.bintray.com/simplytyped/sbt-plugins/com.simplytyped/sbt-antlr4/scala_2.10/sbt_0.13/0.7.10/ivys/ivy.xml
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn]  ::
[warn]  ::  UNRESOLVED DEPENDENCIES ::
[warn]  ::
[warn]  :: com.simplytyped#sbt-antlr4;0.7.10: not found
[warn]  ::
[warn]
[warn]  Note: Some unresolved dependencies have extra attributes.  Check that 
these dependencies exist with the requested attributes.
[warn]  com.simplytyped:sbt-antlr4:0.7.10 (scalaVersion=2.10, 
sbtVersion=0.13)
[warn]
[warn]  Note: Unresolved dependencies path:
[warn]  com.simplytyped:sbt-antlr4:0.7.10 (scalaVersion=2.10, 
sbtVersion=0.13) (/home/npentreath/spark/project/plugins.sbt#L26-27)
[warn]+- plugins:plugins:0.1-SNAPSHOT (scalaVersion=2.10, 
sbtVersion=0.13)
sbt.ResolveException: unresolved dependency: com.simplytyped#sbt-antlr4;0.7.10: 
not found
{code}

I've 
[commented|https://github.com/ihji/sbt-antlr4/commit/77e8f74457e17adad25293720b84ea706deb27f7#commitcomment-17611397]
 on the relevant commit to ask the author to publish the older artefacts to 
bintray, in which case we can update the resolver in {{plugins.sbt}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail

[jira] [Resolved] (SPARK-15504) Could MatrixFactorizationModel support recommend for some users only ?

2016-05-25 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15504.

Resolution: Duplicate

Please see SPARK-10802 which already exists.

For the old RDD-based API, it is unlikely that this will be supported directly. 
However SPARK-13857 will allow this as part of the DataFrame-based API.

> Could MatrixFactorizationModel support recommend for some users only ?
> --
>
> Key: SPARK-15504
> URL: https://issues.apache.org/jira/browse/SPARK-15504
> Project: Spark
>  Issue Type: Wish
>  Components: MLlib
>Affects Versions: 1.6.0, 1.6.1
> Environment: Spark 1.6.1
>Reporter: Hai
>Priority: Trivial
>  Labels: features, performance
>
> I have used the ALS algorithm training a model, and I want to recommend 
> products for some users not all in model, so the way I can use the API of 
> MatrixFactorizationModel is the one -> recommendProducts(user: Int, num: 
> Int): Array[Rating] which I should recommend the product one by one in spark 
> driver, or the one -> recommendProductsForUsers(num: Int): RDD[(Int, 
> Array[Rating])] which could run in spark cluster but it take some unused time 
> calculate the user that I don't want to recommend products for.  So I think 
> if there could have an API such as -> recommendProductsForUsers(users: 
> RDD[Int], num: Int): RDD[(Int, Array[Rating])], so it best  match my case. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15501) ML 2.0 QA: Scala APIs audit for recommendation

2016-05-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15501:
---
Component/s: ML
 Documentation

> ML 2.0 QA: Scala APIs audit for recommendation
> --
>
> Key: SPARK-15501
> URL: https://issues.apache.org/jira/browse/SPARK-15501
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15500) Remove defaults in storage level param doc in ALS

2016-05-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15500:
---
Component/s: PySpark
 ML
 Documentation

> Remove defaults in storage level param doc in ALS
> -
>
> Key: SPARK-15500
> URL: https://issues.apache.org/jira/browse/SPARK-15500
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, PySpark
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Minor
>
> Pending a decision on approach for SPARK-15130, I'm removing the "Default: 
> MEMORY_AND_DISK" part of the built-in {{Param}} doc for ALS storage level 
> params (both Scala and Python). This fixes up the output of {{explainParams}} 
> so that defaults are not shown twice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15502) Add note in ML ALS docs that user / item column only supports Int

2016-05-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-15502:
--

Assignee: Nick Pentreath

> Add note in ML ALS docs that user / item column only supports Int
> -
>
> Key: SPARK-15502
> URL: https://issues.apache.org/jira/browse/SPARK-15502
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, PySpark
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
>Priority: Minor
>
> Currently, ALS only supports {{Integer}} user/item ids. SPARK-14891 added 
> actual validation for this, but does allow any numeric type for these columns 
> so long as the ids are within {{Integer}} value range. Add a note to the user 
> guide to this effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15502) Add note in ML ALS docs that user / item column only supports Int

2016-05-24 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-15502:
--

 Summary: Add note in ML ALS docs that user / item column only 
supports Int
 Key: SPARK-15502
 URL: https://issues.apache.org/jira/browse/SPARK-15502
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML, PySpark
Reporter: Nick Pentreath
Priority: Minor


Currently, ALS only supports {{Integer}} user/item ids. SPARK-14891 added 
actual validation for this, but does allow any numeric type for these columns 
so long as the ids are within {{Integer}} value range. Add a note to the user 
guide to this effect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15501) ML 2.0 QA: Scala APIs audit for recommendation

2016-05-24 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-15501:
--

 Summary: ML 2.0 QA: Scala APIs audit for recommendation
 Key: SPARK-15501
 URL: https://issues.apache.org/jira/browse/SPARK-15501
 Project: Spark
  Issue Type: Improvement
Reporter: Nick Pentreath
Assignee: Nick Pentreath






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15500) Remove defaults in storage level param doc in ALS

2016-05-24 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-15500:
--

 Summary: Remove defaults in storage level param doc in ALS
 Key: SPARK-15500
 URL: https://issues.apache.org/jira/browse/SPARK-15500
 Project: Spark
  Issue Type: Documentation
Reporter: Nick Pentreath
Assignee: Nick Pentreath
Priority: Minor


Pending a decision on approach for SPARK-15130, I'm removing the "Default: 
MEMORY_AND_DISK" part of the built-in {{Param}} doc for ALS storage level 
params (both Scala and Python). This fixes up the output of {{explainParams}} 
so that defaults are not shown twice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15254) Improve ML pipeline Cross Validation Scaladoc & PyDoc

2016-05-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15254:
---
Description: The ML pipeline Cross Validation Scaladoc & PyDoc is very 
sparse - we should fill this out with a more concrete description.  (was: The 
ML pipeline Cross Validation Scaladoc & PyDoc is very spares - we should fill 
this out with a more concrete description.)

> Improve ML pipeline Cross Validation Scaladoc & PyDoc
> -
>
> Key: SPARK-15254
> URL: https://issues.apache.org/jira/browse/SPARK-15254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: holdenk
>Priority: Minor
>
> The ML pipeline Cross Validation Scaladoc & PyDoc is very sparse - we should 
> fill this out with a more concrete description.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15254) Improve ML pipeline Cross Validation Scaladoc & PyDoc

2016-05-24 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297871#comment-15297871
 ] 

Nick Pentreath commented on SPARK-15254:


Please go ahead!

> Improve ML pipeline Cross Validation Scaladoc & PyDoc
> -
>
> Key: SPARK-15254
> URL: https://issues.apache.org/jira/browse/SPARK-15254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: holdenk
>Priority: Minor
>
> The ML pipeline Cross Validation Scaladoc & PyDoc is very spares - we should 
> fill this out with a more concrete description.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15442) PySpark QuantileDiscretizer missing "relativeError" param

2016-05-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15442.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13228
[https://github.com/apache/spark/pull/13228]

> PySpark QuantileDiscretizer missing "relativeError" param
> -
>
> Key: SPARK-15442
> URL: https://issues.apache.org/jira/browse/SPARK-15442
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15492) Binarization scala example copy & paste to spark-shell error

2016-05-24 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15492:
---
Assignee: Miao Wang

> Binarization scala example copy & paste to spark-shell error
> 
>
> Key: SPARK-15492
> URL: https://issues.apache.org/jira/browse/SPARK-15492
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Minor
>
> The Binarization scala example val dataFrame : Dataframe = 
> spark.createDataFrame(data).toDF("label", "feature"), which can't be pasted 
> in the spark-shell as Dataframe is not imported. Compared with other 
> examples, this explicit type is not required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [DISCUSS] PredictionIO incubation proposal

2016-05-24 Thread Nick Pentreath
Hi everyone

I just want to make it clear that my suggestion was in no way some sort of
attempt to hijack the project or push a corporate agenda.

For me personally, I have not been directly involved in PredictionIO, that
is true. I have however spent the past 3 years prior to joining IBM
building from scratch, single-handedly, a commercial SaaS product that is
at its core fundamentally very similar in terms of architecture and design
(though admittedly less general as it was focused on the recommendation
space). Also, I've had some chats with Simon over the past couple of years
and also recently specifically about this proposal, hence my interest.

I can't speak for Mike directly, but certainly I see a potential SystemML
integration in the future as something interesting for both projects (I'm
not suggesting it should be worked on immediately as a primary focus).

In any case, I see the proposal is in voting stage and it appears the vote
will easily pass, so all the best with Apache PredictionIO (incubating)!
We'll look at getting involved where it makes sense and we could add value.

Nick

On Fri, 20 May 2016 at 18:14 Pat Ferrel  wrote:

> +1 for the current committer list, but please, anyone interested get
> familiar, we will need more help soon!
>
> Also I’d like to bring up the template gallery again. Plugins may be
> problematic in other projects but pio does nothing of interest *without* a
> template. There are some examples in the core repo but...
>
> Questions:
> 1) can the gallery be transferred? This is just a listing of templates
> that may be maintained by external people and is the source from which they
> are downloaded by default.
> 2) which templates are proposed for the transfer? Didn’t see that spelled
> out beyond the included examples.
>
> On May 20, 2016, at 8:53 AM, Suneel Marthi  wrote:
>
> The current list is good to go and includes all (both present and former)
> PIO folks.
> I am fine with going for Voting with the present list.
>
> +1
>
> On Fri, May 20, 2016 at 11:47 AM, Andrew Purtell 
> wrote:
>
> > The current list of initial committers was that provided me by the
> > PredictionIO folks so I have every reason to believe they all have a
> stake
> > at entering incubation.
> >
> > It's totally fine with me if we stick to that list. I am just trying to
> > facilitate the fairest process possible.
> >
> >
> > On Friday, May 20, 2016, Roman Shaposhnik  wrote:
> >
> >> On Thu, May 19, 2016 at 9:16 PM, Suneel Marthi  >> > wrote:
> >>> I definitely have concerns about too many folks becoming initial
> >> committers
> >>> and bringing their own corporate agendas to this project.
> >>>
> >>> I suggest that first we vote PIO into incubator then bring in those
> > less
> >>> experienced with the project. We have a good start with people who have
> >>> worked on the project from several orgs. Let us get organized first and
> >>> then bring in new people.
> >>
> >> I think this is a reasonable concern. Andrew, any chance you can look
> > over
> >> the names of initial committers and let us know who has had a stake in
> > the
> >> project before entering the incubation vs. those who are trying to join
> > in
> >> as
> >> part of the ASF Incubation.
> >>
> >> I'm not saying we need to pass judgement one way or the other yet, but
> it
> >> will be a very useful data point to know before voting.
> >>
> >> Thanks,
> >> Roman.
> >>
> >> -
> >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >> 
> >> For additional commands, e-mail: general-h...@incubator.apache.org
> >> 
> >>
> >>
> >
> > --
> > Best regards,
> >
> >   - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


[jira] [Assigned] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-05-23 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-15447:
--

Assignee: Nick Pentreath

> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>    Assignee: Nick Pentreath
>Priority: Critical
>  Labels: QA
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Removing module maintainer process

2016-05-23 Thread Nick Pentreath
+1 (binding)
On Mon, 23 May 2016 at 04:19, Matei Zaharia  wrote:

> Correction, let's run this for 72 hours, so until 9 PM EST May 25th.
>
> > On May 22, 2016, at 8:34 PM, Matei Zaharia 
> wrote:
> >
> > It looks like the discussion thread on this has only had positive
> replies, so I'm going to call a VOTE. The proposal is to remove the
> maintainer process in
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers
> <
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers>
> given that it doesn't seem to have had a huge impact on the project, and it
> can unnecessarily create friction in contributing. We already have +1s from
> Mridul, Tom, Andrew Or and Imran on that thread.
> >
> > I'll leave the VOTE open for 48 hours, until 9 PM EST on May 24, 2016.
> >
> > Matei
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-05-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294116#comment-15294116
 ] 

Nick Pentreath commented on SPARK-15447:


[~mengxr] yes will aim to run some tests during early next week.

> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>  Labels: QA
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15442) PySpark QuantileDiscretizer missing "relativeError" param

2016-05-20 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-15442:
--

Assignee: Nick Pentreath

> PySpark QuantileDiscretizer missing "relativeError" param
> -
>
> Key: SPARK-15442
> URL: https://issues.apache.org/jira/browse/SPARK-15442
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15442) PySpark QuantileDiscretizer missing "relativeError" param

2016-05-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293753#comment-15293753
 ] 

Nick Pentreath edited comment on SPARK-15442 at 5/20/16 5:18 PM:
-

When do you plan to submit a PR? I'm just about there on one already. But if 
you want to submit one go ahead and I'll review.


was (Author: mlnick):
When do you plan to submit a PR? I'm just about there on one already.

> PySpark QuantileDiscretizer missing "relativeError" param
> -
>
> Key: SPARK-15442
> URL: https://issues.apache.org/jira/browse/SPARK-15442
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15442) PySpark QuantileDiscretizer missing "relativeError" param

2016-05-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293753#comment-15293753
 ] 

Nick Pentreath commented on SPARK-15442:


When do you plan to submit a PR? I'm just about there on one already.

> PySpark QuantileDiscretizer missing "relativeError" param
> -
>
> Key: SPARK-15442
> URL: https://issues.apache.org/jira/browse/SPARK-15442
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nick Pentreath
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14810) ML, Graph 2.0 QA: API: Binary incompatible changes

2016-05-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293318#comment-15293318
 ] 

Nick Pentreath edited comment on SPARK-14810 at 5/20/16 1:04 PM:
-

Yeah makes sense - I've moved the listing to the comments for posterity. I will 
make another pass through before we get to RC stage to check for new ones and 
anything missed. At that point we can add to the migration guide.


was (Author: mlnick):
Yeah makes sense - I've moved the listing to the comments for posterity. I will 
make another pass through before we get to RC stage to check for new ones and 
anything missed

> ML, Graph 2.0 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-14810
> URL: https://issues.apache.org/jira/browse/SPARK-14810
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14810) ML, Graph 2.0 QA: API: Binary incompatible changes

2016-05-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280592#comment-15280592
 ] 

Nick Pentreath edited comment on SPARK-14810 at 5/20/16 1:00 PM:
-

[~josephkb] [~mengxr] [~srowen] I've made a pass through this. I think I've 
audited all the excludes added to {{MimaExcludes}} (but will take another pass 
to double check). The majority of excludes added relate to (a) private classes 
/ methods; (b) @Experimental / DeveloperAPI (c) adding methods to sealed 
traits; and (d) the change {{DataFrame}} -> {{Dataset}}.

(d) is a binary incompatible change but affects Java for all of Spark (as we 
know). So I've not worried about that.

I will check SPARK-13920 again as it added a lot of excludes (most of them 
appear to be for {{DataFrame}} -> {{Dataset}} or private, and all @Experimental 
/ DeveloperAPI, but still good to know if anything did change).

So far the the 2 issues are removing deprecated methods:
* SPARK-14089 - 1.1-1.5
** {{BinaryClassificationEvaluator.setScoreCol}}
** {{LBFGS.setMaxNumIterations}} - DeveloperAPI
** {{RDDFunctions.treeReduce}} and {{treeAggregate}} - DeveloperAPI
** {{mllib.tree.Strategy.defaultStategy}} - appears to be a spelling error in 
the method.
** {{mllib.tree.Node.build}} 
** {{MLUtils}} libsvm loaders for multiclass and load/save labeledData methods
* SPARK-14952 - 1.6
** {{ml.LinearRegression.weights}} - @Experimental
** {{ml.LogisticRegression.weights}} - @Experimental

So these are incompatible changes, but I assume are ok. I'm just wondering how 
we prefer to document these changes? Migration guide, or somewhere else?


was (Author: mlnick):
[~josephkb] [~mengxr] [~srowen] I've made a pass through this. I think I've 
audited all the excludes added to {{MimaExcludes}} (but will take another pass 
to double check). The majority of excludes added relate to (a) private classes 
/ methods; (b) @Experimental / DeveloperAPI (c) adding methods to sealed 
traits; and (d) the change {{DataFrame}} -> {{Dataset}}.

(d) is a binary incompatible change but affects Java for all of Spark (as we 
know). So I've not worried about that.

I will check SPARK-13920 again as it added a lot of excludes (most of them 
appear to be for {{DataFrame}} -> {{Dataset}} or private, and all @Experimental 
/ DeveloperAPI, but still good to know if anything did change).

So far the the 2 issues are removing deprecated methods:
* SPARK-14089 - 1.1-1.5
** {{BinaryClassificationEvaluator.setScoreCol}}
** {{LBFGS.setMaxNumIterations}} - DeveloperAPI
** {{RDDFunctions.treeReduce}} and {{treeAggregate}} - DeveloperAPI
** {mllib.tree.Strategy.defaultStategy}} - appears to be a spelling error in 
the method.
** {{mllib.tree.Node.build}} 
** {{MLUtils}} libsvm loaders for multiclass and load/save labeledData methods
* SPARK-14952 - 1.6
** {{ml.LinearRegression.weights}} - @Experimental
** {{ml.LogisticRegression.weights}} - @Experimental

So these are incompatible changes, but I assume are ok. I'm just wondering how 
we prefer to document these changes? Migration guide, or somewhere else?

> ML, Graph 2.0 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-14810
> URL: https://issues.apache.org/jira/browse/SPARK-14810
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14810) ML, Graph 2.0 QA: API: Binary incompatible changes

2016-05-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293318#comment-15293318
 ] 

Nick Pentreath commented on SPARK-14810:


Yeah makes sense - I've moved the listing to the comments for posterity. I will 
make another pass through before we get to RC stage to check for new ones and 
anything missed

> ML, Graph 2.0 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-14810
> URL: https://issues.apache.org/jira/browse/SPARK-14810
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>    Assignee: Nick Pentreath
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14810) ML, Graph 2.0 QA: API: Binary incompatible changes

2016-05-20 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-14810:
---
Description: 
Generate a list of binary incompatible changes using MiMa and create new JIRAs 
for issues found. Filter out false positives as needed.

If you want to take this task, look at the analogous task from the previous 
release QA, and ping the Assignee for advice.

  was:
Generate a list of binary incompatible changes using MiMa and create new JIRAs 
for issues found. Filter out false positives as needed.

If you want to take this task, look at the analogous task from the previous 
release QA, and ping the Assignee for advice.

List of changes since {{1.6.0}} audited - these are "false positives" due to 
being private, @Experimental, DeveloperAPI, etc:
* SPARK-13686 - Add a constructor parameter `regParam` to 
(Streaming)LinearRegressionWithSGD
* SPARK-13664 - Replace HadoopFsRelation with FileFormat
* SPARK-11622 - Make LibSVMRelation extends HadoopFsRelation and Add 
LibSVMOutputWriter
* SPARK-13920 - MIMA checks should apply to @Experimental and @DeveloperAPI APIs
* SPARK-11011 - UserDefinedType serialization should be strongly typed
* SPARK-13817 - Re-enable MiMA and removes object DataFrame
* SPARK-13927 - add row/column iterator to local matrices - (add methods to 
sealed trait)
* SPARK-13948 - MiMa Check should catch if the visibility change to `private` - 
(DataFrame -> Dataset)
* SPARK-11262 - Unit test for gradient, loss layers, memory management - 
(private class)
* SPARK-13430 - moved featureCol from LinearRegressionModelSummary to 
LinearRegressionSummary - (private class)
* SPARK-13048 - keepLastCheckpoint option for LDA EM optimizer - (private class)
* SPARK-14734 - Add conversions between mllib and ml Vector, Matrix types - 
(private methods added)
* SPARK-14861 - Replace internal usages of SQLContext with SparkSession - 
(private class)

Binary incompatible changes:
* SPARK-14089 - Remove methods that has been deprecated since 1.1, 1.2, 1.3, 
1.4, and 1.5 
* SPARK-14952 - Remove methods deprecated in 1.6
* DataFrame -> Dataset changes for Java (this of course applies for all of 
Spark SQL)


> ML, Graph 2.0 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-14810
> URL: https://issues.apache.org/jira/browse/SPARK-14810
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14810) ML, Graph 2.0 QA: API: Binary incompatible changes

2016-05-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293316#comment-15293316
 ] 

Nick Pentreath commented on SPARK-14810:


List of changes since {{1.6.0}} audited - these are "false positives" due to 
being private, @Experimental, DeveloperAPI, etc:
* SPARK-13686 - Add a constructor parameter `regParam` to 
(Streaming)LinearRegressionWithSGD
* SPARK-13664 - Replace HadoopFsRelation with FileFormat
* SPARK-11622 - Make LibSVMRelation extends HadoopFsRelation and Add 
LibSVMOutputWriter
* SPARK-13920 - MIMA checks should apply to @Experimental and @DeveloperAPI APIs
* SPARK-11011 - UserDefinedType serialization should be strongly typed
* SPARK-13817 - Re-enable MiMA and removes object DataFrame
* SPARK-13927 - add row/column iterator to local matrices - (add methods to 
sealed trait)
* SPARK-13948 - MiMa Check should catch if the visibility change to `private` - 
(DataFrame -> Dataset)
* SPARK-11262 - Unit test for gradient, loss layers, memory management - 
(private class)
* SPARK-13430 - moved featureCol from LinearRegressionModelSummary to 
LinearRegressionSummary - (private class)
* SPARK-13048 - keepLastCheckpoint option for LDA EM optimizer - (private class)
* SPARK-14734 - Add conversions between mllib and ml Vector, Matrix types - 
(private methods added)
* SPARK-14861 - Replace internal usages of SQLContext with SparkSession - 
(private class)

Binary incompatible changes:
* SPARK-14089 - Remove methods that has been deprecated since 1.1, 1.2, 1.3, 
1.4, and 1.5 
* SPARK-14952 - Remove methods deprecated in 1.6
* DataFrame -> Dataset changes for Java (this of course applies for all of 
Spark SQL)

> ML, Graph 2.0 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-14810
> URL: https://issues.apache.org/jira/browse/SPARK-14810
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15412) Improve linear & isotonic regression methods PyDocs

2016-05-20 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15412:
---
Assignee: holdenk

> Improve linear & isotonic regression methods PyDocs
> ---
>
> Key: SPARK-15412
> URL: https://issues.apache.org/jira/browse/SPARK-15412
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
>
> Very minor, but LinearRegression & Isotonic regression's PyDocs are missing 
> link, have a shorter description of boundaries, and aren't using list mode 
> for types of reguluarization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15444) Default value mismatch of param linkPredictionCol for GeneralizedLinearRegression

2016-05-20 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15444:
---
Assignee: Liang-Chi Hsieh

> Default value mismatch of param linkPredictionCol for  
> GeneralizedLinearRegression
> --
>
> Key: SPARK-15444
> URL: https://issues.apache.org/jira/browse/SPARK-15444
> Project: Spark
>  Issue Type: Test
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
> Fix For: 2.0.0
>
>
> There is a default value mismatch of param linkPredictionCol for  
> GeneralizedLinearRegression between PySpark and Scala. This causes ml.tests 
> failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15444) Default value mismatch of param linkPredictionCol for GeneralizedLinearRegression

2016-05-20 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15444.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13220
[https://github.com/apache/spark/pull/13220]

> Default value mismatch of param linkPredictionCol for  
> GeneralizedLinearRegression
> --
>
> Key: SPARK-15444
> URL: https://issues.apache.org/jira/browse/SPARK-15444
> Project: Spark
>  Issue Type: Test
>Reporter: Liang-Chi Hsieh
>Priority: Blocker
> Fix For: 2.0.0
>
>
> There is a default value mismatch of param linkPredictionCol for  
> GeneralizedLinearRegression between PySpark and Scala. This causes ml.tests 
> failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15100) Audit: ml.feature

2016-05-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292899#comment-15292899
 ] 

Nick Pentreath commented on SPARK-15100:


I created SPARK-15442 for #1

> Audit: ml.feature
> -
>
> Key: SPARK-15100
> URL: https://issues.apache.org/jira/browse/SPARK-15100
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Audit this sub-package for new algorithms which do not have corresponding 
> sections & examples in the user guide.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15100) Audit: ml.feature

2016-05-20 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292895#comment-15292895
 ] 

Nick Pentreath commented on SPARK-15100:


I'm not sure we need to set each and every possible parameter in each example, 
especially things that have sane defaults (like relativeError) or are expert 
params.

> Audit: ml.feature
> -
>
> Key: SPARK-15100
> URL: https://issues.apache.org/jira/browse/SPARK-15100
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Audit this sub-package for new algorithms which do not have corresponding 
> sections & examples in the user guide.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15442) PySpark QuantileDiscretizer missing "relativeError" param

2016-05-20 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-15442:
--

 Summary: PySpark QuantileDiscretizer missing "relativeError" param
 Key: SPARK-15442
 URL: https://issues.apache.org/jira/browse/SPARK-15442
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.0.0
Reporter: Nick Pentreath
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15316) PySpark GeneralizedLinearRegression missing linkPredictionCol param

2016-05-19 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15316.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13106
[https://github.com/apache/spark/pull/13106]

> PySpark GeneralizedLinearRegression missing linkPredictionCol param
> ---
>
> Key: SPARK-15316
> URL: https://issues.apache.org/jira/browse/SPARK-15316
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 2.0.0
>
>
> PySpark's GeneralizedLinearRegression is missing the linkPredictionCol param.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14891) ALS in ML never validates input schema

2016-05-18 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-14891.

   Resolution: Fixed
Fix Version/s: 2.0.0

> ALS in ML never validates input schema
> --
>
> Key: SPARK-14891
> URL: https://issues.apache.org/jira/browse/SPARK-14891
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>    Reporter: Nick Pentreath
>    Assignee: Nick Pentreath
> Fix For: 2.0.0
>
>
> Currently, {{ALS.fit}} never validates the input schema. There is a 
> {{transformSchema}} impl that calls {{validateAndTransformSchema}}, but it is 
> never called in either {{ALS.fit}} or {{ALSModel.transform}}.
> This was highlighted in SPARK-13857 (and failing PySpark tests 
> [here|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56849/consoleFull])when
>  adding a call to {{transformSchema}} in {{ALSModel.transform}} that actually 
> validates the input schema. The PySpark docstring tests result in Long inputs 
> by default, which fail validation as Int is required.
> Currently, the inputs for user and item ids are cast to Int, with no input 
> type validation (or warning message). So users could pass in Long, Float, 
> Double, etc. It's also not made clear anywhere in the docs that only Int 
> types for user and item are supported.
> Enforcing validation seems the best option but might break user code that 
> previously "just worked" especially in PySpark. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14978) PySpark TrainValidationSplitModel should support validationMetrics

2016-05-18 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288827#comment-15288827
 ] 

Nick Pentreath commented on SPARK-14978:


thanks!

> PySpark TrainValidationSplitModel should support validationMetrics
> --
>
> Key: SPARK-14978
> URL: https://issues.apache.org/jira/browse/SPARK-14978
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Kai Jiang
>Assignee: Takuya Kuwahara
> Fix For: 2.0.0
>
>
> validationMetrics in TrainValidationSplitModel should also be supported in 
> pyspark.ml.tuning



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14978) PySpark TrainValidationSplitModel should support validationMetrics

2016-05-18 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288790#comment-15288790
 ] 

Nick Pentreath commented on SPARK-14978:


[~srowen] how do I add JIRA username {{taku-k}} to the contributor group?

> PySpark TrainValidationSplitModel should support validationMetrics
> --
>
> Key: SPARK-14978
> URL: https://issues.apache.org/jira/browse/SPARK-14978
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Kai Jiang
> Fix For: 2.0.0
>
>
> validationMetrics in TrainValidationSplitModel should also be supported in 
> pyspark.ml.tuning



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15378) Unable to load NLTK in spark RDD pipeline

2016-05-18 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288665#comment-15288665
 ] 

Nick Pentreath edited comment on SPARK-15378 at 5/18/16 9:17 AM:
-

If you are trying to run on a cluster, then either the library needs to be 
installed on each worker node, or you can distribute libraries using the 
{{--py-files}} option of {{spark-submit}}. Please see [submitting applications 
guide|http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management]
 for details.


was (Author: mlnick):
If you are trying to run on a cluster, then either the library needs to be 
installed on each worker node, or you can distribute libaries using the 
{{--py-files}} option of {{spark-submit}}. Please see [submitting applications 
guide|http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management]
 for details.

>  Unable to load NLTK in spark RDD pipeline
> --
>
> Key: SPARK-15378
> URL: https://issues.apache.org/jira/browse/SPARK-15378
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
> Environment: spark version 1.6.1
>Reporter: Krishna Prasad
>  Labels: RDD, spark, spark-submit
>
> h1.Info: 
> * spark version 1.6.1
> * python version 2.7.9
> * I have install NLTK and its working fine with the following code, I am 
> running in *pyspark shell*
> {code}
> >>> from nltk.tokenize import word_tokenize
>   >>> text = "Hello, this is testing of nltk in pyspark, mainly 
> word_tokenize functions in nltk.tokenize, working fine with PySpark, please 
> see the below example"
>   >>> text
>   //'Hello, this is testing of nltk in pyspark, mainly word_tokenize 
> functions in nltk.tokenize, working fine with PySpark, please see the below 
> example'
>   >>> word_token  = word_tokenize(text)
>   >>> word_token
>   //['Hello', ',', 'this', 'is', 'testing', 'of', 'nltk', 'in', 
> 'pyspark', ',', 'mainly', 'word_tokenize', 'functions', 'in', 
> 'nltk.tokenize', ',', 'working', 'fine', 'with', 'PySpark', ',', 'please', 
> 'see', 'the', 'below', 'example']
>   >>>
> {code}
> h1.Problem:
> When I try to run it using spark in-build method `map` its throwing an error 
> *ImportError: No module named nltk.tokenize*
> {code}
> >>> from nltk.tokenize import word_tokenize
>   >>> rdd = sc.parallelize(["This is first sentence for tokenization", 
> "second line, we need to tokenize"])
>   >> rdd_tokens = rdd.map(lambda sentence : word_tokenize(sentence))
>   >> rdd_tokens
>   // PythonRDD[2] at RDD at PythonRDD.scala:43
>   >>> rdd_tokens.collect()
> {code}
> h2. Fullstack errors: 
> {code}
>   >>> from nltk.tokenize import word_tokenize
>   >>> rdd = sc.parallelize(["This is first sentence for tokenization", 
> "second line, we need to tokenize"])
>   >> rdd_tokens = rdd.map(lambda sentence : word_tokenize(sentence))
>   >> rdd_tokens
>   // PythonRDD[2] at RDD at PythonRDD.scala:43
>   >>> rdd_tokens.collect()
>   16/05/17 17:06:48 WARN 
> org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 
> 16, spark-w-0.c.clean-feat-131014.internal): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
> File "/usr/lib/spark/python/pyspark/worker.py", line 98, in 
> main
>   command = pickleSer._read_with_length(infile)
> File "/usr/lib/spark/python/pyspark/serializers.py", line 
> 164, in _read_with_length
>   return self.loads(obj)
> File "/usr/lib/spark/python/pyspark/serializers.py", line 
> 422, in loads
>   return pickle.loads(obj)
>   ImportError: No module named nltk.tokenize
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at 
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at 
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   

[jira] [Commented] (SPARK-15378) Unable to load NLTK in spark RDD pipeline

2016-05-18 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288665#comment-15288665
 ] 

Nick Pentreath commented on SPARK-15378:


If you are trying to run on a cluster, then either the library needs to be 
installed on each worker node, or you can distribute libaries using the 
{{--py-files}} option of {{spark-submit}}. Please see [submitting applications 
guide|http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management]
 for details.

>  Unable to load NLTK in spark RDD pipeline
> --
>
> Key: SPARK-15378
> URL: https://issues.apache.org/jira/browse/SPARK-15378
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
> Environment: spark version 1.6.1
>Reporter: Krishna Prasad
>  Labels: RDD, spark, spark-submit
>
> h1.Info: 
> * spark version 1.6.1
> * python version 2.7.9
> * I have install NLTK and its working fine with the following code, I am 
> running in *pyspark shell*
> {code}
> >>> from nltk.tokenize import word_tokenize
>   >>> text = "Hello, this is testing of nltk in pyspark, mainly 
> word_tokenize functions in nltk.tokenize, working fine with PySpark, please 
> see the below example"
>   >>> text
>   //'Hello, this is testing of nltk in pyspark, mainly word_tokenize 
> functions in nltk.tokenize, working fine with PySpark, please see the below 
> example'
>   >>> word_token  = word_tokenize(text)
>   >>> word_token
>   //['Hello', ',', 'this', 'is', 'testing', 'of', 'nltk', 'in', 
> 'pyspark', ',', 'mainly', 'word_tokenize', 'functions', 'in', 
> 'nltk.tokenize', ',', 'working', 'fine', 'with', 'PySpark', ',', 'please', 
> 'see', 'the', 'below', 'example']
>   >>>
> {code}
> h1.Problem:
> When I try to run it using spark in-build method `map` its throwing an error 
> *ImportError: No module named nltk.tokenize*
> {code}
> >>> from nltk.tokenize import word_tokenize
>   >>> rdd = sc.parallelize(["This is first sentence for tokenization", 
> "second line, we need to tokenize"])
>   >> rdd_tokens = rdd.map(lambda sentence : word_tokenize(sentence))
>   >> rdd_tokens
>   // PythonRDD[2] at RDD at PythonRDD.scala:43
>   >>> rdd_tokens.collect()
> {code}
> h2. Fullstack errors: 
> {code}
>   >>> from nltk.tokenize import word_tokenize
>   >>> rdd = sc.parallelize(["This is first sentence for tokenization", 
> "second line, we need to tokenize"])
>   >> rdd_tokens = rdd.map(lambda sentence : word_tokenize(sentence))
>   >> rdd_tokens
>   // PythonRDD[2] at RDD at PythonRDD.scala:43
>   >>> rdd_tokens.collect()
>   16/05/17 17:06:48 WARN 
> org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 
> 16, spark-w-0.c.clean-feat-131014.internal): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
> File "/usr/lib/spark/python/pyspark/worker.py", line 98, in 
> main
>   command = pickleSer._read_with_length(infile)
> File "/usr/lib/spark/python/pyspark/serializers.py", line 
> 164, in _read_with_length
>   return self.loads(obj)
> File "/usr/lib/spark/python/pyspark/serializers.py", line 
> 422, in loads
>   return pickle.loads(obj)
>   ImportError: No module named nltk.tokenize
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at 
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at 
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExe

[jira] [Resolved] (SPARK-14978) PySpark TrainValidationSplitModel should support validationMetrics

2016-05-18 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-14978.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12767
[https://github.com/apache/spark/pull/12767]

> PySpark TrainValidationSplitModel should support validationMetrics
> --
>
> Key: SPARK-14978
> URL: https://issues.apache.org/jira/browse/SPARK-14978
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Kai Jiang
> Fix For: 2.0.0
>
>
> validationMetrics in TrainValidationSplitModel should also be supported in 
> pyspark.ml.tuning



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [DISCUSS] PredictionIO incubation proposal

2016-05-17 Thread Nick Pentreath
Hi there

I'm glad to see the proposal to incubate PredictionIO. In my previous life
as a startup co-founder, I kept a close eye on the project, and it would be
fantastic to see it become an Apache incubating project!

The folks working on Apache Spark and Apache SystemML (incubating) here at
IBM are excited about the possibilities for integrating PredictionIO and
SystemML (Mike Dusenberry is a committer on that project), as well
as further improving Spark integration (I'm a PMC member on that project).

Mike and I, together with Luciano (who is a mentor on this proposal) would
like to volunteer our services as initial committers, if that is agreeable.

Kind regards
Nick
mln...@apache.org



>
> -- Forwarded message --
> From: Andrew Purtell 
> To: "general@incubator.apache.org" 
> Cc:
> Date: Fri, 13 May 2016 13:41:38 -0700
> Subject: [DISCUSS] PredictionIO incubation proposal
> Greetings,
>
> It is my pleasure to
> ​ ​
> propose the PredictionIO project for incubation at the Apache Software
> Foundation.
> ​ ​
> PredictionIO is a
> ​ popular​
> open
> ​ ​
> source Machine Learning Server built on top of a state-of-the-art open
> source stack, including several Apache technologies, that
> ​ ​
> enables developers to manage and deploy production-ready predictive
> services for various kinds of machine learning tasks
> ​, with more than 400 production deployments around the world and a growing
> contributor community. ​
>
>
> The text of the proposal is included below and is also available at
> https://wiki.apache.org/incubator/PredictionIO
>
> Best regards,
> Andrew Purtell
>
>
> = PredictionIO Proposal =
>
> === Abstract ===
> PredictionIO is an open source Machine Learning Server built on top of
> state-of-the-art open source stack, that enables developers to manage and
> deploy production-ready predictive services for various kinds of machine
> learning tasks.
>
> === Proposal ===
> The PredictionIO platform consists of the following components:
>
>  * PredictionIO framework - provides the machine learning stack for
>  building, evaluating and deploying engines with machine learning
>  algorithms. It uses Apache Spark for processing.
>
>  * Event Server - the machine learning analytics layer for unifying events
>  from multiple platforms. It can use Apache HBase or any JDBC backends
>  as its data store.
>
> The PredictionIO community also maintains a
> ​ ​
> Template Gallery, a place to
> publish and download (free or proprietary) engine templates for different
> types of machine learning applications, and is a complemental part of the
> project. At this point we exclude the Template Gallery from the proposal,
> as it has a separate set of contributors and we’re not familiar with an
> Apache approved mechanism to maintain such a gallery.
>
> You can find the Template Gallery at https://templates.prediction.io/
>
> === Background ===
> PredictionIO was started with a mission to democratize and bring machine
> learning to the masses.
>
> Machine learning has traditionally been a luxury for big companies like
> Google, Facebook, and Netflix. There are ML libraries and tools lying
> around the internet but the effort of putting them all together as a
> production-ready infrastructure is a very resource-intensive task that is
> remotely reachable by individuals or small businesses.
>
> PredictionIO is a production-ready, full stack machine learning system that
> allows organizations of any scale to quickly deploy machine learning
> capabilities. It comes with official and community-contributed machine
> learning engine templates that are easy to customize.
>
> === Rationale ===
> As usage and number of contributors to PredictionIO has grown bigger and
> more diverse, we have sought for an independent framework for the project
> to keep thriving. We believe the Apache foundation is a great fit. Joining
> Apache would ensure that tried and true processes and procedures are in
> place for the growing number of organizations interested in contributing
> to PredictionIO. PredictionIO is also a good fit for the Apache foundation.
> PredictionIO was built on top of several Apache projects (HBase, Spark,
> Hadoop). We are familiar with the Apache process and believe that the
> democratic and meritocratic nature of the foundation aligns with the
> project goals.
>
> === Initial Goals ===
> The initial milestones will be to move the existing codebase to Apache and
> integrate with the Apache development process. Once this is accomplished,
> we plan for incremental development and releases that follow the Apache
> guidelines, as well as growing our developer and user communities.
>
> === Current Status ===
> PredictionIO has undergone nine minor releases and many patches.
> PredictionIO is being used in production by Salesforce.com as well as many
> other organizations and apps. The PredictionIO codebase is currently
> hosted at GitHub, which will 

[jira] [Resolved] (SPARK-15182) Copy MLlib doc to ML: ml.feature

2016-05-17 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15182.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12957
[https://github.com/apache/spark/pull/12957]

> Copy MLlib doc to ML: ml.feature
> 
>
> Key: SPARK-15182
> URL: https://issues.apache.org/jira/browse/SPARK-15182
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15182) Copy MLlib doc to ML: ml.feature

2016-05-17 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15182:
---
Assignee: yuhao yang

> Copy MLlib doc to ML: ml.feature
> 
>
> Key: SPARK-15182
> URL: https://issues.apache.org/jira/browse/SPARK-15182
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14434) User guide doc and examples for GaussianMixture in spark.ml

2016-05-17 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-14434:
---
Assignee: Miao Wang

> User guide doc and examples for GaussianMixture in spark.ml
> ---
>
> Key: SPARK-14434
> URL: https://issues.apache.org/jira/browse/SPARK-14434
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
> Fix For: 2.0.0
>
>
> This should ideally happen after a Python API is added by [SPARK-14433]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14434) User guide doc and examples for GaussianMixture in spark.ml

2016-05-17 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-14434.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12788
[https://github.com/apache/spark/pull/12788]

> User guide doc and examples for GaussianMixture in spark.ml
> ---
>
> Key: SPARK-14434
> URL: https://issues.apache.org/jira/browse/SPARK-14434
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
> Fix For: 2.0.0
>
>
> This should ideally happen after a Python API is added by [SPARK-14433]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14709) spark.ml API for linear SVM

2016-05-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284304#comment-15284304
 ] 

Nick Pentreath commented on SPARK-14709:


It would be great to get the list of references for the SMO impl, as well as 
some concrete performance numbers (e.g. vs the other models in Spark)

> spark.ml API for linear SVM
> ---
>
> Key: SPARK-14709
> URL: https://issues.apache.org/jira/browse/SPARK-14709
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Provide API for SVM algorithm for DataFrames.  I would recommend using 
> OWL-QN, rather than wrapping spark.mllib's SGD-based implementation.
> The API should mimic existing spark.ml.classification APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14979) Add examples for GeneralizedLinearRegression

2016-05-16 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-14979.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12754
[https://github.com/apache/spark/pull/12754]

> Add examples for GeneralizedLinearRegression
> 
>
> Key: SPARK-14979
> URL: https://issues.apache.org/jira/browse/SPARK-14979
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add Scala/Java/Python examples for GeneralizedLinearRegression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15316) PySpark GeneralizedLinearRegression missing linkPredictionCol param

2016-05-16 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15316:
---
Assignee: holdenk

> PySpark GeneralizedLinearRegression missing linkPredictionCol param
> ---
>
> Key: SPARK-15316
> URL: https://issues.apache.org/jira/browse/SPARK-15316
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
>
> PySpark's GeneralizedLinearRegression is missing the linkPredictionCol param.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15305) spark.ml document Bisectiong k-means has the incorrect format

2016-05-16 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15305:
---
Assignee: Miao Wang

> spark.ml document Bisectiong k-means has the incorrect format
> -
>
> Key: SPARK-15305
> URL: https://issues.apache.org/jira/browse/SPARK-15305
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Miao Wang
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.0.0
>
>
> the generated ml-clustering.html: Bisecting k-means has the incorrect format. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15305) spark.ml document Bisectiong k-means has the incorrect format

2016-05-16 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-15305.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13083
[https://github.com/apache/spark/pull/13083]

> spark.ml document Bisectiong k-means has the incorrect format
> -
>
> Key: SPARK-15305
> URL: https://issues.apache.org/jira/browse/SPARK-15305
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Miao Wang
>Priority: Minor
> Fix For: 2.0.0
>
>
> the generated ml-clustering.html: Bisecting k-means has the incorrect format. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15186) Add user guide for Generalized Linear Regression.

2016-05-13 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-15186:
---
Assignee: Seth Hendrickson

> Add user guide for Generalized Linear Regression.
> -
>
> Key: SPARK-15186
> URL: https://issues.apache.org/jira/browse/SPARK-15186
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> We should add a user guide for the new GLR interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14979) Add examples for GeneralizedLinearRegression

2016-05-13 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-14979:
---
Assignee: Yanbo Liang

> Add examples for GeneralizedLinearRegression
> 
>
> Key: SPARK-14979
> URL: https://issues.apache.org/jira/browse/SPARK-14979
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> Add Scala/Java/Python examples for GeneralizedLinearRegression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    3   4   5   6   7   8   9   10   11   12   >