[jira] [Commented] (SPARK-23105) Spark MLlib, GraphX 2.3 QA umbrella

2018-01-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335526#comment-16335526 ] Nick Pentreath commented on SPARK-23105: Certain of the ML QA sub-tasks are marked {{Blocker

[jira] [Commented] (SPARK-13964) Feature hashing improvements

2018-01-22 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334599#comment-16334599 ] Nick Pentreath commented on SPARK-13964: Yes, that's certainly something I'd like to see added

Re: Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Nick Pentreath
At least one of their comparisons is flawed. The Spark ML version of linear regression (*note* they use linear regression and not logistic regression, it is not clear why) uses L-BFGS as the solver, not SGD (as MLLIB uses). Hence it is typically going to be slower. However, it should in most

[jira] [Commented] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-01-19 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332252#comment-16332252 ] Nick Pentreath commented on SPARK-23154: SGTM > Document backwards compatibility guarant

Re: [ML] Allow CrossValidation ParamGrid on SVMWithSGD

2018-01-19 Thread Nick Pentreath
SVMWithSGD sits in the older "mllib" package and is not compatible directly with the DataFrame API. I suppose one could write a ML-API wrapper around it. However, there is LinearSVC in Spark 2.2.x: http://spark.apache.org/docs/latest/ml-classification-regression.html#linear-support-vector-machine

[jira] [Assigned] (SPARK-23048) Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator

2018-01-19 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-23048: -- Assignee: Liang-Chi Hsieh > Update mllib docs to replace OneHotEnco

[jira] [Resolved] (SPARK-23048) Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator

2018-01-19 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-23048. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20257 [https

[jira] [Resolved] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-19 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-23127. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20293 [https

[jira] [Assigned] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-19 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-23127: -- Assignee: Nick Pentreath > Update FeatureHasher user guide for catCols parame

[jira] [Updated] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-17 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-23127: --- Description: SPARK-22801 added the {{categoricalCols}} parameter and updated the Scala

[jira] [Created] (SPARK-23127) Update FeatureHasher user guide for catCols parameter

2018-01-17 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-23127: -- Summary: Update FeatureHasher user guide for catCols parameter Key: SPARK-23127 URL: https://issues.apache.org/jira/browse/SPARK-23127 Project: Spark

[jira] [Commented] (SPARK-23060) RDD's apply function

2018-01-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326866#comment-16326866 ] Nick Pentreath commented on SPARK-23060: I agree I don't see enough of a compelling case

[jira] [Resolved] (SPARK-21108) convert LinearSVC to aggregator framework

2018-01-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-21108. Resolution: Fixed > convert LinearSVC to aggregator framew

[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-21856: -- Assignee: Chunsheng Ji > Update Python API for MultilayerPerceptronClassifierMo

[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-21856: -- Assignee: (was: Weichen Xu) > Update Python

[jira] [Assigned] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-21856: -- Assignee: Weichen Xu > Update Python API for MultilayerPerceptronClassifierMo

[jira] [Resolved] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-01-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-21856. Resolution: Fixed > Update Python API for MultilayerPerceptronClassifierMo

[jira] [Commented] (SPARK-22943) OneHotEncoder supports manual specification of categorySizes

2018-01-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326151#comment-16326151 ] Nick Pentreath commented on SPARK-22943: Does the new estimator & model version of OHE s

[jira] [Resolved] (SPARK-22993) checkpointInterval param doc should be clearer

2018-01-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-22993. Resolution: Fixed > checkpointInterval param doc should be clea

[jira] [Assigned] (SPARK-22993) checkpointInterval param doc should be clearer

2018-01-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-22993: -- Assignee: Seth Hendrickson > checkpointInterval param doc should be clea

[jira] [Commented] (SPARK-22871) Add GBT+LR Algorithm in MLlib

2017-12-31 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307210#comment-16307210 ] Nick Pentreath commented on SPARK-22871: Tree-based feature transformation is covered in SPARK

[jira] [Resolved] (SPARK-22801) Allow FeatureHasher to specify numeric columns to treat as categorical

2017-12-31 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-22801. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19991 [https

[jira] [Assigned] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-12-31 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-22397: -- Assignee: Huaxin Gao > Add multiple column support to QuantileDiscreti

[jira] [Resolved] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-12-31 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-22397. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19715 [https

[jira] [Updated] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2017-12-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-22799: --- Description: See the related discussion: https://issues.apache.org/jira/browse/SPARK-8418

[jira] [Created] (SPARK-22801) Allow FeatureHasher to specify numeric columns to treat as categorical

2017-12-15 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-22801: -- Summary: Allow FeatureHasher to specify numeric columns to treat as categorical Key: SPARK-22801 URL: https://issues.apache.org/jira/browse/SPARK-22801 Project

[jira] [Comment Edited] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292320#comment-16292320 ] Nick Pentreath edited comment on SPARK-8418 at 12/15/17 10:40 AM

[jira] [Updated] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2017-12-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-22799: --- Issue Type: Improvement (was: New Feature) > Bucketizer should throw exception if sin

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292320#comment-16292320 ] Nick Pentreath commented on SPARK-8418: --- Created SPARK-22796, SPARK-22797 and SPARK-22798 to track

[jira] [Created] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2017-12-15 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-22799: -- Summary: Bucketizer should throw exception if single- and multi-column params are both set Key: SPARK-22799 URL: https://issues.apache.org/jira/browse/SPARK-22799

[jira] [Created] (SPARK-22798) Add multiple column support to PySpark StringIndexer

2017-12-15 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-22798: -- Summary: Add multiple column support to PySpark StringIndexer Key: SPARK-22798 URL: https://issues.apache.org/jira/browse/SPARK-22798 Project: Spark

[jira] [Created] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2017-12-15 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-22797: -- Summary: Add multiple column support to PySpark Bucketizer Key: SPARK-22797 URL: https://issues.apache.org/jira/browse/SPARK-22797 Project: Spark Issue

[jira] [Created] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer

2017-12-15 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-22796: -- Summary: Add multiple column support to PySpark QuantileDiscretizer Key: SPARK-22796 URL: https://issues.apache.org/jira/browse/SPARK-22796 Project: Spark

[jira] [Commented] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-12-13 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288993#comment-16288993 ] Nick Pentreath commented on SPARK-19357: I've thought about this and taken a look at the proposed

[jira] [Resolved] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN

2017-12-12 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-22700. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19894 [https

[jira] [Assigned] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN

2017-12-12 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-22700: -- Assignee: zhengruifeng > Bucketizer.transform incorrectly drops row containing

[jira] [Resolved] (SPARK-22690) Imputer inherit HasOutputCols

2017-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-22690. Resolution: Fixed > Imputer inherit HasOutputC

[jira] [Updated] (SPARK-22690) Imputer inherit HasOutputCols

2017-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-22690: --- Fix Version/s: 2.3.0 > Imputer inherit HasOutputC

[jira] [Assigned] (SPARK-22690) Imputer inherit HasOutputCols

2017-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-22690: -- Assignee: zhengruifeng > Imputer inherit HasOutputC

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-12-01 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16275426#comment-16275426 ] Nick Pentreath commented on SPARK-8418: --- *1 I’m ok with throwing an exception. We can update

Re: CrossValidation distribution - is it in the roadmap?

2017-11-29 Thread Nick Pentreath
Hi Tomasz Parallel evaluation for CrossValidation and TrainValidationSplit was added for Spark 2.3 in https://issues.apache.org/jira/browse/SPARK-19357 On Wed, 29 Nov 2017 at 16:31 Tomasz Dudek wrote: > Hey, > > is there a way to make the following code: > > val

Re: does "Deep Learning Pipelines" scale out linearly?

2017-11-22 Thread Nick Pentreath
For that package specifically it’s best to see if they have a mailing list and if not perhaps ask on github issues. Having said that perhaps the folks involved in that package will reply here too. On Wed, 22 Nov 2017 at 20:03, Andy Davidson wrote: > I am starting

[jira] [Assigned] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter

2017-11-10 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-20199: -- Assignee: pralabhkumar > GradientBoostedTreesModel doesn't h

[jira] [Resolved] (SPARK-20199) GradientBoostedTreesModel doesn't have featureSubsetStrategy parameter

2017-11-10 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20199. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18118 [https

Re: Timeline for Spark 2.3

2017-11-09 Thread Nick Pentreath
+1 I think that’s practical On Fri, 10 Nov 2017 at 03:13, Erik Erlandson wrote: > +1 on extending the deadline. It will significantly improve the logistics > for upstreaming the Kubernetes back-end. Also agreed, on the general > realities of reduced bandwidth over the

[jira] [Resolved] (SPARK-20542) Add an API into Bucketizer that can bin a lot of columns all at once

2017-11-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20542. Resolution: Fixed Fix Version/s: 2.3.0 > Add an API into Bucketizer that can

[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-31 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226448#comment-16226448 ] Nick Pentreath commented on SPARK-13030: I just think it makes sense for OHE to be an Estimator

Re: StringIndexer on several columns in a DataFrame with Scala

2017-10-30 Thread Nick Pentreath
For now, you must follow this approach of constructing a pipeline consisting of a StringIndexer for each categorical column. See https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to allow multiple columns for StringIndexer, which is being worked on currently. The reason

[jira] [Updated] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-10-30 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-22397: --- Description: Once SPARK-20542 adds multi column support to {{Bucketizer}}, we can add multi

[jira] [Commented] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-10-30 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224459#comment-16224459 ] Nick Pentreath commented on SPARK-22397: [~huaxing] is working on this and will submit a PR

[jira] [Created] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-10-30 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-22397: -- Summary: Add multiple column support to QuantileDiscretizer Key: SPARK-22397 URL: https://issues.apache.org/jira/browse/SPARK-22397 Project: Spark Issue

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-10-30 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16224454#comment-16224454 ] Nick Pentreath commented on SPARK-8418: --- Adding SPARK-13030, since the new version

[jira] [Commented] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-25 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16219006#comment-16219006 ] Nick Pentreath commented on SPARK-22346: SPARK-19141 mentions another option which may work

[jira] [Commented] (SPARK-22331) Strength consistency for supporting string params: case-insensitive or not

2017-10-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214747#comment-16214747 ] Nick Pentreath commented on SPARK-22331: I can't think of any examples offhand where case

[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-17 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207133#comment-16207133 ] Nick Pentreath commented on SPARK-22289: I think option (2) is the more general fix here

[jira] [Assigned] (SPARK-20542) Add an API into Bucketizer that can bin a lot of columns all at once

2017-10-11 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-20542: -- Assignee: Liang-Chi Hsieh > Add an API into Bucketizer that can bin a lot of colu

[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data

2017-10-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196673#comment-16196673 ] Nick Pentreath commented on SPARK-10802: SPARK-20679 has been completed for the new ML API. I've

[jira] [Resolved] (SPARK-10802) Let ALS recommend for subset of data

2017-10-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-10802. Resolution: Won't Fix > Let ALS recommend for subset of d

[jira] [Assigned] (SPARK-20679) Let ML ALS recommend for a subset of users/items

2017-10-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-20679: -- Assignee: Nick Pentreath > Let ML ALS recommend for a subset of users/it

[jira] [Resolved] (SPARK-20679) Let ML ALS recommend for a subset of users/items

2017-10-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-20679. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18748 [https

[jira] [Commented] (SPARK-22115) Add operator for linalg Matrix and Vector

2017-10-08 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196288#comment-16196288 ] Nick Pentreath commented on SPARK-22115: Best keep it private for now. There's been lot

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
eases/spark-release-2-2-0.html#known-issues > before due to this reason. > I believe It should be fine and probably we should note if possible. I > believe this should not be a regression anyway as, if I understood > correctly, it was there from the very first place. > > Thank

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
Checked sigs & hashes. Tested on RHEL build/mvn -Phadoop-2.7 -Phive -Pyarn test passed Python tests passed I ran R tests and am getting some failures: https://gist.github.com/MLnick/ddf4d531d5125208771beee0cc9c697e (I seem to recall similar issues on a previous release but I thought it was

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-04 Thread Nick Pentreath
Ah right! Was using a new cloud instance and didn't realize I was logged in as root! thanks On Tue, 3 Oct 2017 at 21:13 Marcelo Vanzin <van...@cloudera.com> wrote: > Maybe you're running as root (or the admin account on your OS)? > > On Tue, Oct 3, 2017 at 12:12 PM, Nick Pentreath

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Nick Pentreath
Hmm I'm consistently getting this error in core tests: - SPARK-3697: ignore directories that cannot be read. *** FAILED *** 2 was not equal to 1 (FsHistoryProviderSuite.scala:146) Anyone else? Any insight? Perhaps it's my set up. >> >> On Tue, Oct 3, 2017 at 7:24 AM Holden Karau

[jira] [Commented] (SPARK-22115) Add operator for linalg Matrix and Vector

2017-10-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187740#comment-16187740 ] Nick Pentreath commented on SPARK-22115: Do we plan to make this private? Or are you suggesting

Re: Should Flume integration be behind a profile?

2017-10-02 Thread Nick Pentreath
I'd agree with #1 or #2. Deprecation now seems fine. Perhaps this should be raised on the user list also? And perhaps it makes sense to look at moving the Flume support into Apache Bahir if there is interest (I've cc'ed Bahir dev list here)? That way the current state of the connector could keep

Re: Should Flume integration be behind a profile?

2017-10-02 Thread Nick Pentreath
I'd agree with #1 or #2. Deprecation now seems fine. Perhaps this should be raised on the user list also? And perhaps it makes sense to look at moving the Flume support into Apache Bahir if there is interest (I've cc'ed Bahir dev list here)? That way the current state of the connector could keep

Re: Welcoming Tejas Patil as a Spark committer

2017-09-30 Thread Nick Pentreath
Congratulations! >> >> Matei Zaharia wrote >> > Hi all, >> > >> > The Spark PMC recently added Tejas Patil as a committer on the >> > project. Tejas has been contributing across several areas of Spark for >> > a while, focusing especially on scalability issues and SQL. Please >> > join me in

Re: How to run MLlib's word2vec in CBOW mode?

2017-09-28 Thread Nick Pentreath
MLlib currently doesn't support CBOW - there is an open PR for it (see https://issues.apache.org/jira/browse/SPARK-20372). On Thu, 28 Sep 2017 at 09:56 pun wrote: > Hello, > My understanding is that word2vec can be ran in two modes: > >- continuous bag-of-words

[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-09-25 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16180073#comment-16180073 ] Nick Pentreath commented on SPARK-13030: Yes definitely needs to support multi column. [~viirya

[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-09-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177939#comment-16177939 ] Nick Pentreath commented on SPARK-13030: It's ugly but we can introduce a new class

[jira] [Resolved] (SPARK-22061) Add pipeline model of SVM

2017-09-21 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-22061. Resolution: Won't Fix > Add pipeline model of

[jira] [Commented] (SPARK-22061) Add pipeline model of SVM

2017-09-21 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175124#comment-16175124 ] Nick Pentreath commented on SPARK-22061: Agreed, this already exists. I closed this issue. >

[jira] [Assigned] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.

2017-09-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-21958: -- Assignee: Travis Hegner > Attempting to save large Word2Vec model hangs dri

[jira] [Resolved] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.

2017-09-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-21958. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19191 [https

[jira] [Commented] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe

2017-09-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167806#comment-16167806 ] Nick Pentreath commented on SPARK-22021: Why a JavaScript function? I think this is not a good

[jira] [Commented] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.

2017-09-11 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160862#comment-16160862 ] Nick Pentreath commented on SPARK-21958: Seems like your proposal could improve things - but yeah

[jira] [Assigned] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-09-06 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19357: -- Assignee: Bryan Cutler > Parallel Model Evaluation for ML Tuning: Sc

[jira] [Resolved] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala

2017-09-06 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19357. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 16774 [https

[jira] [Comment Edited] (SPARK-21926) Some transformers in spark.ml.feature fail when trying to transform steaming dataframes

2017-09-06 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154901#comment-16154901 ] Nick Pentreath edited comment on SPARK-21926 at 9/6/17 6:54 AM: For #2

[jira] [Commented] (SPARK-21926) Some transformers in spark.ml.feature fail when trying to transform steaming dataframes

2017-09-06 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154901#comment-16154901 ] Nick Pentreath commented on SPARK-21926: For #2, (a) is definitely the correct solution. > S

[jira] [Resolved] (SPARK-15790) Audit @Since annotations in ML

2017-09-06 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-15790. Resolution: Fixed > Audit @Since annotations in

Re: isCached

2017-09-01 Thread Nick Pentreath
t; > On Fri, Sep 1, 2017 at 11:46 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> Dataset does have storageLevel. So you can use isCached = (storageLevel >> != StorageLevel.NONE) as a test. >> >> Arguably isCached could be added to dataset too, sh

Re: isCached

2017-09-01 Thread Nick Pentreath
Dataset does have storageLevel. So you can use isCached = (storageLevel != StorageLevel.NONE) as a test. Arguably isCached could be added to dataset too, shouldn't be a controversial change. On Fri, 1 Sep 2017 at 17:31, Nathan Kronenfeld wrote: > I'm currently

Re: Updates on migration guides

2017-08-30 Thread Nick Pentreath
MLlib has tried quite hard to ensure the migration guide is up to date for each release. I think generally we catch all breaking and most major behavior changes On Wed, 30 Aug 2017 at 17:02, Dongjoon Hyun wrote: > +1 > > On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li

Re: Updates on migration guides

2017-08-30 Thread Nick Pentreath
MLlib has tried quite hard to ensure the migration guide is up to date for each release. I think generally we catch all breaking and most major behavior changes On Wed, 30 Aug 2017 at 17:02, Dongjoon Hyun wrote: > +1 > > On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li

[jira] [Resolved] (SPARK-21469) Add doc and example for FeatureHasher

2017-08-30 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-21469. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19024 [https

[jira] [Assigned] (SPARK-21469) Add doc and example for FeatureHasher

2017-08-30 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-21469: -- Assignee: Bryan Cutler > Add doc and example for FeatureHas

[jira] [Commented] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting

2017-08-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138007#comment-16138007 ] Nick Pentreath commented on SPARK-21086: Ok - I commented on the PR. Agree it makes sense

[jira] [Commented] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2

2017-08-22 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136630#comment-16136630 ] Nick Pentreath commented on SPARK-21799: Refer to SPARK-18608 and SPARK-19422. There is some work

[jira] [Assigned] (SPARK-21468) FeatureHasher Python API

2017-08-21 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-21468: -- Assignee: Nick Pentreath > FeatureHasher Python

[jira] [Resolved] (SPARK-21468) FeatureHasher Python API

2017-08-21 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-21468. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18970 [https

[jira] [Commented] (SPARK-4981) Add a streaming singular value decomposition

2017-08-18 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16132118#comment-16132118 ] Nick Pentreath commented on SPARK-4981: --- Hey folks, as interesting as this would be, I think it's

[jira] [Commented] (SPARK-21742) BisectingKMeans generate different models with/without caching

2017-08-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128512#comment-16128512 ] Nick Pentreath commented on SPARK-21742: Isn't the solution to set a fixed seed for the randomly

[jira] [Resolved] (SPARK-13969) Extend input format that feature hashing can handle

2017-08-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-13969. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18513 [https

[jira] [Assigned] (SPARK-13969) Extend input format that feature hashing can handle

2017-08-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-13969: -- Assignee: Nick Pentreath > Extend input format that feature hashing can han

[jira] [Commented] (SPARK-21723) Can't write LibSVM - key not found: numFeatures

2017-08-15 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126939#comment-16126939 ] Nick Pentreath commented on SPARK-21723: Yes, we should definitely be able to write LibSVM format

[jira] [Commented] (SPARK-21086) CrossValidator, TrainValidationSplit should preserve all models after fitting

2017-08-03 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112623#comment-16112623 ] Nick Pentreath commented on SPARK-21086: I just want to understand _why_ folks want to keep all

[jira] [Commented] (SPARK-21624) Optimize communication cost of RF/GBT/DT

2017-08-03 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112489#comment-16112489 ] Nick Pentreath commented on SPARK-21624: I wonder if it makes sense to make it a {{Vector

<    1   2   3   4   5   6   7   8   9   10   >