[jira] [Closed] (SPARK-8402) Add DP means clustering to MLlib

2017-05-11 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath closed SPARK-8402. - Resolution: Won't Fix > Add DP means clustering to ML

[jira] [Commented] (SPARK-8402) Add DP means clustering to MLlib

2017-05-11 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006060#comment-16006060 ] Nick Pentreath commented on SPARK-8402: --- I'm afraid I would say there is not sufficient demand

[jira] [Commented] (SPARK-11669) Python interface to SparkR GLM module

2017-05-11 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006056#comment-16006056 ] Nick Pentreath commented on SPARK-11669: I think this can be closed as its done

[jira] [Commented] (SPARK-20503) ML 2.2 QA: API: Python API coverage

2017-05-11 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006038#comment-16006038 ] Nick Pentreath commented on SPARK-20503: If SPARK-20602 and/or SPARK-20348 are completed, Python

[jira] [Updated] (SPARK-20679) Let ML ALS recommend for a subset of users/items

2017-05-11 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-20679: --- Summary: Let ML ALS recommend for a subset of users/items (was: Let ALS recommend

[jira] [Commented] (SPARK-20679) Let ALS recommend for a subset of users/items

2017-05-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002360#comment-16002360 ] Nick Pentreath commented on SPARK-20679: I'm working on this > Let ALS recommend for a sub

[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data

2017-05-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002343#comment-16002343 ] Nick Pentreath commented on SPARK-10802: Hey folks - since the {{ALSModel}} in the ML API now

[jira] [Created] (SPARK-20679) Let ALS recommend for a subset of users/items

2017-05-09 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-20679: -- Summary: Let ALS recommend for a subset of users/items Key: SPARK-20679 URL: https://issues.apache.org/jira/browse/SPARK-20679 Project: Spark Issue Type

[jira] [Comment Edited] (SPARK-10408) Autoencoder

2017-05-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002326#comment-16002326 ] Nick Pentreath edited comment on SPARK-10408 at 5/9/17 9:00 AM: What

[jira] [Commented] (SPARK-10408) Autoencoder

2017-05-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002326#comment-16002326 ] Nick Pentreath commented on SPARK-10408: What is the status here? I think it's fairly safe to say

[jira] [Commented] (SPARK-6323) Large rank matrix factorization with Nonlinear loss and constraints

2017-05-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16002310#comment-16002310 ] Nick Pentreath commented on SPARK-6323: --- I think it is safe to say this will not be feasible

[jira] [Resolved] (SPARK-20587) Improve performance of ML ALS recommendForAll

2017-05-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20587. Resolution: Fixed Fix Version/s: 2.2.1 Issue resolved by pull request 17845 [https

[jira] [Resolved] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-05-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-11968. Resolution: Fixed Fix Version/s: 2.2.1 Issue resolved by pull request 17742 [https

[jira] [Assigned] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-05-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-11968: -- Assignee: Peng Meng (was: Nick Pentreath) > ALS recommend all methods spend m

[jira] [Assigned] (SPARK-20677) Clean up ALS recommend all improvement code.

2017-05-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-20677: -- Assignee: Nick Pentreath > Clean up ALS recommend all improvement c

[jira] [Updated] (SPARK-20677) Clean up ALS recommend all improvement code.

2017-05-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-20677: --- Description: SPARK-11968 and SPARK-20587 added performance improvements to the "reco

[jira] [Updated] (SPARK-20677) Clean up ALS recommend all improvement code.

2017-05-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-20677: --- Description: SPARK-11968 and SPARK-20587 added performance improvements to the "reco

[jira] [Created] (SPARK-20677) Clean up ALS recommend all improvement code.

2017-05-09 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-20677: -- Summary: Clean up ALS recommend all improvement code. Key: SPARK-20677 URL: https://issues.apache.org/jira/browse/SPARK-20677 Project: Spark Issue Type

[jira] [Updated] (SPARK-20596) Improve ALS recommend all test cases

2017-05-08 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-20596: --- Fix Version/s: (was: 2.2.0) 2.2.1 > Improve ALS recommend all t

[jira] [Resolved] (SPARK-20596) Improve ALS recommend all test cases

2017-05-08 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20596. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17860 [https

[jira] [Commented] (SPARK-20503) ML 2.2 QA: API: Python API coverage

2017-05-04 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996676#comment-15996676 ] Nick Pentreath commented on SPARK-20503: cc [~holdenk] [~bryanc] [~zero323]? I can take

[jira] [Commented] (SPARK-20501) ML, Graph 2.2 QA: API: New Scala APIs, docs

2017-05-04 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996675#comment-15996675 ] Nick Pentreath commented on SPARK-20501: Things that would need to be checked include

[jira] [Updated] (SPARK-20499) Spark MLlib, GraphX 2.2 QA umbrella

2017-05-04 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-20499: --- Description: This JIRA lists tasks for the next Spark release's QA period for MLlib

[jira] [Updated] (SPARK-20596) Improve ALS recommend all test cases

2017-05-04 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-20596: --- Component/s: Tests > Improve ALS recommend all test ca

[jira] [Created] (SPARK-20596) Improve ALS recommend all test cases

2017-05-04 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-20596: -- Summary: Improve ALS recommend all test cases Key: SPARK-20596 URL: https://issues.apache.org/jira/browse/SPARK-20596 Project: Spark Issue Type: Test

[jira] [Updated] (SPARK-20596) Improve ALS recommend all test cases

2017-05-04 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-20596: --- Affects Version/s: (was: 2.1.0) 2.2.0 > Improve ALS recommend

[jira] [Created] (SPARK-20587) Improve performance of ML ALS recommendForAll

2017-05-03 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-20587: -- Summary: Improve performance of ML ALS recommendForAll Key: SPARK-20587 URL: https://issues.apache.org/jira/browse/SPARK-20587 Project: Spark Issue Type

[jira] [Resolved] (SPARK-6227) PCA and SVD for PySpark

2017-05-03 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-6227. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17621 [https

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-02 Thread Nick Pentreath
I won't +1 just given that it seems certain there will be another RC and there are the outstanding ML QA blocker issues. But clean build and test for JVM and Python tests LGTM on CentOS Linux 7.2.1511, OpenJDK 1.8.0_111 On Mon, 1 May 2017 at 22:42 Frank Austin Nothaft

[jira] [Created] (SPARK-20553) Update ALS examples for ML to illustrate recommend all

2017-05-02 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-20553: -- Summary: Update ALS examples for ML to illustrate recommend all Key: SPARK-20553 URL: https://issues.apache.org/jira/browse/SPARK-20553 Project: Spark

[jira] [Resolved] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items

2017-05-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20300. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17622 [https

[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-05-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992497#comment-15992497 ] Nick Pentreath commented on SPARK-20443: Interesting - though it appears to me that {{2048

[jira] [Commented] (SPARK-20551) ImportError adding custom class from jar in pyspark

2017-05-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992491#comment-15992491 ] Nick Pentreath commented on SPARK-20551: Yes I agree that it appears you're trying to import Java

[jira] [Closed] (SPARK-20551) ImportError adding custom class from jar in pyspark

2017-05-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath closed SPARK-20551. -- Resolution: Not A Problem > ImportError adding custom class from jar in pysp

[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-05-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992475#comment-15992475 ] Nick Pentreath commented on SPARK-20443: Were these tests against existing master? Because SPARK

[jira] [Assigned] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items

2017-04-30 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-20300: -- Assignee: Nick Pentreath > Python API for ALSModel.recommendForAllUsers,It

[jira] [Commented] (SPARK-20469) Add a method to display DataFrame schema in PipelineStage

2017-04-27 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987219#comment-15987219 ] Nick Pentreath commented on SPARK-20469: Pipeline stages themselves have no concept of schema

[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-26 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984533#comment-15984533 ] Nick Pentreath commented on SPARK-11968: Thanks - in the meantime I will take a look at the PR

[jira] [Commented] (SPARK-20443) The blockSize of MLLIB ALS should be setting by the User

2017-04-25 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983050#comment-15983050 ] Nick Pentreath commented on SPARK-20443: Your PR for SPARK-20446 / SPARK11968 should largely

[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983048#comment-15983048 ] Nick Pentreath commented on SPARK-11968: [~peng.m...@intel.com] would you mind posting your

[jira] [Closed] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-25 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath closed SPARK-20446. -- Resolution: Duplicate > Optimize the process of MLLIB ALS recommendFor

[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983026#comment-15983026 ] Nick Pentreath commented on SPARK-11968: Note, there is a solution proposed in SPARK-20446. I've

[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-25 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983018#comment-15983018 ] Nick Pentreath commented on SPARK-20446: By the way when I say it is a duplicate I mean

[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-04-25 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982695#comment-15982695 ] Nick Pentreath commented on SPARK-13857: I'm going to close this as superseded by SPARK-19535

[jira] [Closed] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-04-25 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath closed SPARK-13857. -- Resolution: Duplicate > Feature parity for ALS ML with ML

Re: pyspark vector

2017-04-25 Thread Nick Pentreath
Well the 3 in this case is the size of the sparse vector. This equates to the number of features, which for CountVectorizer (I assume that's what you're using) is also vocab size (number of unique terms). On Tue, 25 Apr 2017 at 04:06 Peyman Mohajerian wrote: > setVocabSize >

[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981648#comment-15981648 ] Nick Pentreath commented on SPARK-20446: By "compare to DataFrame implementation&qu

[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981134#comment-15981134 ] Nick Pentreath commented on SPARK-20446: Also would be good to compare to the new {{DataFrame

[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981129#comment-15981129 ] Nick Pentreath commented on SPARK-20446: Anyway I'd like to compare the approaches and see which

[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981122#comment-15981122 ] Nick Pentreath commented on SPARK-20446: The GC would come from the temp result array

[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981066#comment-15981066 ] Nick Pentreath commented on SPARK-20446: This is really a duplicate of https://issues.apache.org

[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-21 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15978157#comment-15978157 ] Nick Pentreath commented on SPARK-20392: cc [~viirya] > Slow performance when calling fit on

[jira] [Resolved] (SPARK-20097) Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR

2017-04-11 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20097. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17431 [https

Re: How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Nick Pentreath
Why not use the RandomForest from Spark ML? On Sun, 9 Apr 2017 at 16:01, Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > I have already posted this question to the StackOverflow > . >

Re: Spark 2.1 ml library scalability

2017-04-07 Thread Nick Pentreath
dently. That sounds like something which > could be ran in parallel. > > > On Fri, Apr 7, 2017 at 5:20 PM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > What is the size of training data (number examples, number features)? > Dense or sparse features? How man

Re: Spark 2.1 ml library scalability

2017-04-07 Thread Nick Pentreath
What is the size of training data (number examples, number features)? Dense or sparse features? How many classes? What commands are you using to submit your job via spark-submit? On Fri, 7 Apr 2017 at 13:12 Aseem Bansal wrote: > When using spark ml's LogisticRegression,

[jira] [Commented] (SPARK-4038) Outlier Detection Algorithm for MLlib

2017-04-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960537#comment-15960537 ] Nick Pentreath commented on SPARK-4038: --- I don't think there can be a reasonable expectation

[jira] [Commented] (SPARK-17716) Hidden Markov Model (HMM)

2017-04-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960534#comment-15960534 ] Nick Pentreath commented on SPARK-17716: I don't think we can expect sufficient committer

[jira] [Commented] (SPARK-3903) Create general data loading method for LabeledPoints

2017-04-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960531#comment-15960531 ] Nick Pentreath commented on SPARK-3903: --- I think given the move to DataFrames, and that we can load

[jira] [Commented] (SPARK-7674) R-like stats for ML models

2017-04-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960530#comment-15960530 ] Nick Pentreath commented on SPARK-7674: --- Is this JIRA still open? Can it be resolved

[jira] [Commented] (SPARK-12210) Small example that shows how to integrate spark.mllib with spark.ml

2017-04-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960523#comment-15960523 ] Nick Pentreath commented on SPARK-12210: Is this required any more? I guess we are close enough

[jira] [Assigned] (SPARK-20076) Python interface for ml.stats.Correlation

2017-04-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-20076: -- Assignee: Liang-Chi Hsieh > Python interface for ml.stats.Correlat

[jira] [Resolved] (SPARK-20076) Python interface for ml.stats.Correlation

2017-04-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20076. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17494 [https

[jira] [Commented] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator

2017-04-06 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958492#comment-15958492 ] Nick Pentreath commented on SPARK-19979: I think we could add a note to the user guide. However I

[jira] [Assigned] (SPARK-19953) RandomForest Models should use the UID of Estimator when fit

2017-04-06 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19953: -- Assignee: Bryan Cutler > RandomForest Models should use the UID of Estimator when

[jira] [Resolved] (SPARK-19953) RandomForest Models should use the UID of Estimator when fit

2017-04-06 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19953. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17296 [https

[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan

2017-04-04 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954904#comment-15954904 ] Nick Pentreath commented on SPARK-20203: I see there is a comment in the code that says

[jira] [Commented] (SPARK-20047) Constrained Logistic Regression

2017-04-03 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953551#comment-15953551 ] Nick Pentreath commented on SPARK-20047: Is this really targeted for 2.2.0? > Constrai

[jira] [Assigned] (SPARK-19969) Doc and examples for Imputer

2017-04-03 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19969: -- Assignee: yuhao yang > Doc and examples for Impu

[jira] [Resolved] (SPARK-19969) Doc and examples for Imputer

2017-04-03 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19969. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17324 [https

[jira] [Assigned] (SPARK-19985) Some ML Models error when copy or do not set parent

2017-04-03 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19985: -- Assignee: Bryan Cutler > Some ML Models error when copy or do not set par

[jira] [Resolved] (SPARK-19985) Some ML Models error when copy or do not set parent

2017-04-03 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19985. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17326 [https

Re: Collaborative filtering steps in spark

2017-03-29 Thread Nick Pentreath
No, it does a random initialization. It does use a slightly different approach from pure normal random - it chooses non-negative draws which results in very slightly better results empirically. In practice I'm not sure if the average rating approach will make a big difference (it's been a long

[jira] [Commented] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2017-03-29 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947114#comment-15947114 ] Nick Pentreath commented on SPARK-14174: The actual fix in the PR is pretty small - essentially

[jira] [Assigned] (SPARK-15040) PySpark impl for ml.feature.Imputer

2017-03-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-15040: -- Assignee: Nick Pentreath > PySpark impl for ml.feature.Impu

[jira] [Resolved] (SPARK-15040) PySpark impl for ml.feature.Imputer

2017-03-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-15040. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17316 [https

Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread Nick Pentreath
send a patch. > > On 23 March 2017 at 13:49, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > Yup, that is true and a reasonable clarification of the doc. > > > > On Thu, 23 Mar 2017 at 00:03 chris snow <chsnow...@gmail.com> wrote: > >> >

Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread Nick Pentreath
Yup, that is true and a reasonable clarification of the doc. On Thu, 23 Mar 2017 at 00:03 chris snow wrote: > The documentation for collaborative filtering is as follows: > > === > Scaling of the regularization parameter > > Since v1.1, we scale the regularization parameter

[jira] [Updated] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-22 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-20043: --- Labels: starter (was: ) > CrossValidatorModel loader does not recognize impurity &q

Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Nick Pentreath
As for SPARK-19759 , I don't think that needs to be targeted for 2.1.1 so we don't need to worry about it On Tue, 21 Mar 2017 at 13:49 Holden Karau wrote: > I agree with Michael, I think we've got some outstanding issues

[jira] [Commented] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-21 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15934905#comment-15934905 ] Nick Pentreath commented on SPARK-20043: I just noticed the error message you put above says

[jira] [Updated] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-21 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-20043: --- Docs Text: (was: I saved a CrossValidatorModel with a decision tree and a random forest. I

[jira] [Updated] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-21 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-20043: --- Description: I saved a CrossValidatorModel with a decision tree and a random forest. I use

Re: Contributing to Spark

2017-03-19 Thread Nick Pentreath
If you have experience and interest in Python then PySpark is a good area to look into. Yes, adding things like tests & documentation is a good starting point. Start out relatively small and go from there. Adding new wrappers to python for ML is useful for slightly larger tasks. On Mon, 20

[jira] [Commented] (SPARK-19969) Doc and examples for Imputer

2017-03-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928854#comment-15928854 ] Nick Pentreath commented on SPARK-19969: Ok - I can help on it but probably only some time next

[jira] [Commented] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator

2017-03-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928545#comment-15928545 ] Nick Pentreath commented on SPARK-19979: I wonder if this fits in as a sort of sub-task of SPARK

[jira] [Commented] (SPARK-19969) Doc and examples for Imputer

2017-03-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928537#comment-15928537 ] Nick Pentreath commented on SPARK-19969: No haven't done the doc or examples - I seem to recall

[jira] [Commented] (SPARK-15040) PySpark impl for ml.feature.Imputer

2017-03-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928180#comment-15928180 ] Nick Pentreath commented on SPARK-15040: Sorry, I did not see your comment - I opened a [PR

[jira] [Commented] (SPARK-19899) FPGrowth input column naming

2017-03-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928193#comment-15928193 ] Nick Pentreath commented on SPARK-19899: +1 on {{itemsCol}} - feel free to send a PR

[jira] [Assigned] (SPARK-13568) Create feature transformer to impute missing values

2017-03-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-13568: -- Assignee: yuhao yang > Create feature transformer to impute missing val

[jira] [Resolved] (SPARK-13568) Create feature transformer to impute missing values

2017-03-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-13568. Resolution: Fixed Fix Version/s: 2.2.0 > Create feature transformer to imp

[jira] [Created] (SPARK-19969) Doc and examples for Imputer

2017-03-16 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-19969: -- Summary: Doc and examples for Imputer Key: SPARK-19969 URL: https://issues.apache.org/jira/browse/SPARK-19969 Project: Spark Issue Type: Documentation

Re: Should we consider a Spark 2.1.1 release?

2017-03-16 Thread Nick Pentreath
Spark 1.5.1 had 87 issues fix version 1 month after 1.5.0. Spark 1.6.1 had 123 issues 2 months after 1.6.0 2.0.1 was larger (317 issues) at 3 months after 2.0.0 - makes sense due to how large a release it was. We are at 185 for 2.1.1 and 3 months after (and not released yet so it could slip

[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927601#comment-15927601 ] Nick Pentreath commented on SPARK-19962: You may also want to take a look at https

[jira] [Commented] (SPARK-19957) Inconsist KMeans initialization mode behavior between ML and MLlib

2017-03-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925723#comment-15925723 ] Nick Pentreath commented on SPARK-19957: See https://issues.apache.org/jira/browse/SPARK-16832

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902649#comment-15902649 ] Nick Pentreath commented on SPARK-14409: [~josephkb] in reference to your [PR comment|https

[jira] [Comment Edited] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902639#comment-15902639 ] Nick Pentreath edited comment on SPARK-14409 at 3/9/17 8:05 AM: I

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902639#comment-15902639 ] Nick Pentreath commented on SPARK-14409: I commented on the [PR for SPARK-19535|https

[jira] [Commented] (SPARK-13969) Extend input format that feature hashing can handle

2017-03-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900825#comment-15900825 ] Nick Pentreath commented on SPARK-13969: I think {{HashingTF}} and {{FeatureHasher

[jira] [Commented] (SPARK-19848) Regex Support in StopWordsRemover

2017-03-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899250#comment-15899250 ] Nick Pentreath commented on SPARK-19848: Perhaps the ML pipeline components mentioned [here

<    1   2   3   4   5   6   7   8   9   10   >