[jira] [Updated] (SPARK-18274) Memory leak in PySpark StringIndexer

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18274: --- Fix Version/s: (was: 2.1.1) 2.1.0 > Memory leak in PySp

[jira] [Commented] (SPARK-18318) ML, Graph 2.1 QA: API: New Scala APIs, docs

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731415#comment-15731415 ] Nick Pentreath commented on SPARK-18318: Went ahead and re-marked fix version to {{2.1.0}} since

[jira] [Commented] (SPARK-18319) ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731413#comment-15731413 ] Nick Pentreath commented on SPARK-18319: Went ahead and re-marked fix version to {{2.1.0}} since

[jira] [Updated] (SPARK-18319) ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18319: --- Fix Version/s: (was: 2.2.0) 2.1.0 > ML, Graph 2.1 QA:

[jira] [Updated] (SPARK-18319) ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18319: --- Fix Version/s: (was: 2.1.1) > ML, Graph 2.1 QA: API: Experimental, DeveloperApi, fi

[jira] [Updated] (SPARK-18318) ML, Graph 2.1 QA: API: New Scala APIs, docs

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18318: --- Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0

[jira] [Updated] (SPARK-18592) Move DT/RF/GBT Param setter methods to subclasses

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18592: --- Fix Version/s: (was: 2.2.0) > Move DT/RF/GBT Param setter methods to subclas

[jira] [Commented] (SPARK-18320) ML 2.1 QA: API: Python API coverage

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731411#comment-15731411 ] Nick Pentreath commented on SPARK-18320: Went ahead and re-marked fix version to {{2.1.0}} since

[jira] [Updated] (SPARK-18324) ML, Graph 2.1 QA: Programming guide update and migration guide

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18324: --- Fix Version/s: (was: 2.2.0) > ML, Graph 2.1 QA: Programming guide update and migrat

[jira] [Updated] (SPARK-18408) API Improvements for LSH

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18408: --- Fix Version/s: (was: 2.2.0) > API Improvements for

[jira] [Updated] (SPARK-18320) ML 2.1 QA: API: Python API coverage

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18320: --- Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0

[jira] [Commented] (SPARK-18408) API Improvements for LSH

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731407#comment-15731407 ] Nick Pentreath commented on SPARK-18408: Went ahead and re-marked fix version to {{2.1.0}} since

[jira] [Updated] (SPARK-18324) ML, Graph 2.1 QA: Programming guide update and migration guide

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18324: --- Fix Version/s: (was: 2.1.1) 2.1.0 > ML, Graph 2.1 QA: Programm

[jira] [Commented] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731408#comment-15731408 ] Nick Pentreath commented on SPARK-18366: Went ahead and re-marked fix version to {{2.1.0}} since

[jira] [Commented] (SPARK-18324) ML, Graph 2.1 QA: Programming guide update and migration guide

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731410#comment-15731410 ] Nick Pentreath commented on SPARK-18324: Went ahead and re-marked fix version to {{2.1.0}} since

[jira] [Commented] (SPARK-18612) Leaked broadcasted variable Mllib

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731402#comment-15731402 ] Nick Pentreath commented on SPARK-18612: Went ahead and re-marked fix version to {{2.1.0}} since

[jira] [Updated] (SPARK-18592) Move DT/RF/GBT Param setter methods to subclasses

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18592: --- Fix Version/s: (was: 2.1.1) 2.1.0 > Move DT/RF/GBT Param set

[jira] [Commented] (SPARK-18592) Move DT/RF/GBT Param setter methods to subclasses

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731406#comment-15731406 ] Nick Pentreath commented on SPARK-18592: Went ahead and re-marked fix version to {{2.1.0}} since

[jira] [Updated] (SPARK-18612) Leaked broadcasted variable Mllib

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18612: --- Fix Version/s: (was: 2.1.1) 2.1.0 > Leaked broadcasted variable Ml

[jira] [Updated] (SPARK-18408) API Improvements for LSH

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18408: --- Fix Version/s: (was: 2.1.1) 2.1.0 > API Improvements for

[jira] [Updated] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18366: --- Fix Version/s: (was: 2.1.1) 2.1.0 > Add handleInvalid to Pysp

Re: unhelpful exception thrown on predict() when ALS trained model doesn't contain user or product?

2016-12-06 Thread Nick Pentreath
Indeed, it's being tracked here: https://issues.apache.org/jira/browse/SPARK-18230 though no Pr has been opened yet. On Tue, 6 Dec 2016 at 13:36 chris snow wrote: > I'm using the MatrixFactorizationModel.predict() method and encountered > the following exception: > > Name:

[jira] [Commented] (SPARK-18704) CrossValidator should preserve more tuning statistics

2016-12-04 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15721448#comment-15721448 ] Nick Pentreath commented on SPARK-18704: Yeah, I like this idea. I've also been finding

[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing

2016-11-30 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711095#comment-15711095 ] Nick Pentreath commented on SPARK-12347: Since the PR is still WIP and this is not a blocker

[jira] [Updated] (SPARK-12347) Write script to run all MLlib examples for testing

2016-11-30 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-12347: --- Target Version/s: 2.2.0 (was: 2.1.0) > Write script to run all MLlib examples for test

[jira] [Updated] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer

2016-11-30 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18366: --- Assignee: Sandeep Singh > Add handleInvalid to Pyspark for QuantileDiscreti

[jira] [Updated] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer

2016-11-30 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18366: --- Fix Version/s: (was: 2.1.0) 2.1.1 > Add handleInvalid to Pysp

[jira] [Resolved] (SPARK-18366) Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer

2016-11-30 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-18366. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15817 [https

Re: Why don't we imp some adaptive learning rate methods, such as adadelat, adam?

2016-11-30 Thread Nick Pentreath
check out https://github.com/VinceShieh/Spark-AdaOptimizer On Wed, 30 Nov 2016 at 10:52 WangJianfei wrote: > Hi devs: > Normally, the adaptive learning rate methods can have a fast > convergence > then standard SGD, so why don't we imp them? > see the link

[jira] [Commented] (SPARK-18616) Pure Python Implementation of MLWritable for use in Pipeline

2016-11-28 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704472#comment-15704472 ] Nick Pentreath commented on SPARK-18616: Just a note that generally committers set Target Version

[jira] [Updated] (SPARK-18616) Pure Python Implementation of MLWritable for use in Pipeline

2016-11-28 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18616: --- Target Version/s: (was: 2.0.2) > Pure Python Implementation of MLWritable for

[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2016-11-28 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15701839#comment-15701839 ] Nick Pentreath commented on SPARK-18608: I've also been meaning to log this for a little while

[jira] [Created] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2016-11-28 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-18608: -- Summary: Spark ML algorithms that check RDD cache level for internal caching double-cache data Key: SPARK-18608 URL: https://issues.apache.org/jira/browse/SPARK-18608

Re: how to print auc & prc for GBTClassifier, which is okay for RandomForestClassifier

2016-11-28 Thread Nick Pentreath
This is because currently GBTClassifier doesn't extend the ClassificationModel abstract class, which in turn has the rawPredictionCol and related methods for generating that column. I'm actually not sure off hand whether this was because the GBT implementation could not produce the raw prediction

[jira] [Updated] (SPARK-18450) Add AND-amplification to Locality Sensitive Hashing

2016-11-20 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18450: --- Component/s: ML > Add AND-amplification to Locality Sensitive Hash

[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH

2016-11-20 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18454: --- Component/s: ML > Changes to fix Nearest Neighbor Search for

[jira] [Updated] (SPARK-18408) API Improvements for LSH

2016-11-20 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18408: --- Component/s: ML > API Improvements for

[jira] [Resolved] (SPARK-18456) Use matrix abstraction for LogisitRegression coefficients during training

2016-11-20 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-18456. Resolution: Fixed Assignee: Seth Hendrickson Fix Version/s: 2.1.0 >

[jira] [Commented] (SPARK-18023) Adam optimizer

2016-11-20 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682471#comment-15682471 ] Nick Pentreath commented on SPARK-18023: Linking SPARK-17136 which is really a blocker for adding

[jira] [Commented] (SPARK-16377) Spark MLlib: MultilayerPerceptronClassifier - error while training

2016-11-20 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682458#comment-15682458 ] Nick Pentreath commented on SPARK-16377: Is this still a bug? As per your above comment seems we

[jira] [Commented] (SPARK-6346) Use faster converging optimization method in MLlib

2016-11-20 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15682455#comment-15682455 ] Nick Pentreath commented on SPARK-6346: --- I think we can close this ticket? It's pretty old

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Nick Pentreath
@Holden look forward to the blog post - I think a user guide PR based on it would also be super useful :) On Fri, 18 Nov 2016 at 05:29 Holden Karau wrote: > I've been working on a blog post around this and hope to have it published > early next month  > > On Nov 17,

[jira] [Commented] (SPARK-18441) Add Smote in spark mlib and ml

2016-11-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15670210#comment-15670210 ] Nick Pentreath commented on SPARK-18441: Yes, it would be good to understand what this is all

Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Nick Pentreath
alyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) On Mon, Nov 14, 2016 at 1:44 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the doubles from the test score and label DF. But you may prefer to just

Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Nick Pentreath
DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the doubles from the test score and label DF. But you may prefer to just use spark.ml evaluators, which work with DataFrames. Try BinaryClassificationEvaluator. On Mon, 14 Nov 2016 at 19:30, Bhaarat Sharma

Re: Nearest neighbour search

2016-11-14 Thread Nick Pentreath
LSH-based NN search and similarity join should be out in Spark 2.1 - there's a little work being done still to clear up the APIs and some functionality. Check out https://issues.apache.org/jira/browse/SPARK-5992 On Mon, 14 Nov 2016 at 16:12, Kevin Mellott wrote: >

Re: Finding a Spark Equivalent for Pandas' get_dummies

2016-11-11 Thread Nick Pentreath
For now OHE supports a single column. So you have to have 1000 OHE in a pipeline. However you can add them programatically so it is not too bad. If the cardinality of each feature is quite low, it should be workable. After that user VectorAssembler to stitch the vectors together (which accepts

[jira] [Issue Comment Deleted] (SPARK-18341) Eliminate use of SingularMatrixException in WeightedLeastSquares logic

2016-11-11 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18341: --- Comment: was deleted (was: Just for interest - why is an error code more desirable

[jira] [Commented] (SPARK-18341) Eliminate use of SingularMatrixException in WeightedLeastSquares logic

2016-11-11 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15657117#comment-15657117 ] Nick Pentreath commented on SPARK-18341: Just for interest - why is an error code more desirable

[jira] [Commented] (SPARK-18235) ml.ALSModel function parity: ALSModel should support recommendforAll

2016-11-04 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636190#comment-15636190 ] Nick Pentreath commented on SPARK-18235: This duplicates SPARK-13857. Please feel free to comment

[jira] [Updated] (SPARK-17772) Add helper testing methods for instance weighting

2016-11-04 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-17772: --- Assignee: Seth Hendrickson Target Version/s: 2.1.0 > Add helper testing meth

[jira] [Resolved] (SPARK-17138) Python API for multinomial logistic regression

2016-11-04 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-17138. Resolution: Fixed Assignee: Weichen Xu Fix Version/s: 2.1.0 > Python

[jira] [Updated] (SPARK-18060) Avoid unnecessary standardization in multinomial logistic regression training

2016-11-04 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18060: --- Assignee: Seth Hendrickson Target Version/s: 2.1.0 > Avoid unnecess

Re: Question about using collaborative filtering in MLlib

2016-11-03 Thread Nick Pentreath
I have a PR for it - https://github.com/apache/spark/pull/12574 Sadly I've been tied up and haven't had a chance to work further on it. The main issue outstanding is deciding on the transform semantics as well as performance testing. Any comments / feedback welcome especially on transform

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
Oh also you mention 20 partitions. Is that how many you have? How many ratings? It may be worth trying to reparation to larger number of partitions. On Fri, 21 Oct 2016 at 17:04, Nick Pentreath <nick.pentre...@gmail.com> wrote: > I wonder if you can try with setting different blocks

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
t was going out of memory with the default size too. > > On Fri, Oct 21, 2016 at 5:31 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > Did you try not setting the blocks parameter? It will then try to set it > automatically for your data size. > On Fri, 21 Oct 2016

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
lock size to 20,000 also results in the same. So there is > something I don't understand about how this is working. > > BTW, I am trying to find 50 latent factors (rank = 50). > > Do you have some insights as to how I should tweak things to get this > working? > > Thanks, > Nik >

Re: [Spark ML] Using GBTClassifier in OneVsRest

2016-10-21 Thread Nick Pentreath
Currently no - GBT implements the predictors, not the classifier interface. It might be possible to wrap it in a wrapper that extends the Classifier trait. Hopefully GBT will support multi-class at some point. But you can use RandomForest which does support multi-class. On Fri, 21 Oct 2016 at

Re: ALS.trainImplicit block sizes

2016-10-21 Thread Nick Pentreath
The blocks params will set both user and item blocks. Spark 2.0 supports user and item blocks for PySpark: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation On Fri, 21 Oct 2016 at 08:12 Nikhil Mishra wrote: > Hi, > > I

Re: Making more features in Logistic Regression

2016-10-18 Thread Nick Pentreath
You can use the PolynomialExpansion in Spark ML ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.PolynomialExpansion ) On Tue, 18 Oct 2016 at 21:47 miro wrote: > Yes, I was thinking going down this road: > > >

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-11 Thread Nick Pentreath
; > > > Sincerely, > > > > DB Tsai > > -- > > Web: https://www.dbtsai.com > > PGP Key ID: 0xAF08DF8D > > > > > > On Thu, Oct 6, 2016 at 4:09 AM, Nick Pentreath <nick.pentre...@gmail.com>

[jira] [Comment Edited] (SPARK-17784) Add fromCenters method for KMeans

2016-10-11 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564946#comment-15564946 ] Nick Pentreath edited comment on SPARK-17784 at 10/11/16 8:59 AM: -- It's

[jira] [Commented] (SPARK-17784) Add fromCenters method for KMeans

2016-10-11 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15564946#comment-15564946 ] Nick Pentreath commented on SPARK-17784: It's actually to create a new `KMeans` estimator I

[jira] [Updated] (SPARK-14501) spark.ml parity for fpm - frequent items

2016-10-10 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-14501: --- Target Version/s: 2.1.0 > spark.ml parity for fpm - frequent it

Re: why spark ml package doesn't contain svm algorithm

2016-09-27 Thread Nick Pentreath
There is a JIRA and PR for it - https://issues.apache.org/jira/browse/SPARK-14709 On Tue, 27 Sep 2016 at 09:10 hxw黄祥为 wrote: > I have found spark ml package have implement naivebayes algorithm and the > source code is simple,. > > I am confusing why spark ml package doesn’t

[jira] [Closed] (SPARK-17407) Unable to update structured stream from CSV

2016-09-27 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath closed SPARK-17407. -- Resolution: Not A Problem > Unable to update structured stream from

Re: Spark MLlib ALS algorithm

2016-09-24 Thread Nick Pentreath
The scale factor was only to scale up the number of ratings in the dataset for performance testing purposes, to illustrate the scalability of Spark ALS. It is not something you would normally do on your training dataset. On Fri, 23 Sep 2016 at 20:07, Roshani Nagmote

Re: Similar Items

2016-09-21 Thread Nick Pentreath
Sorry, the original repo: https://github.com/karlhigley/spark-neighbors On Wed, 21 Sep 2016 at 13:09 Nick Pentreath <nick.pentre...@gmail.com> wrote: > I should also point out another library I had not come across before : > https://github.com/sethah/spark-neighbors > > >

Re: Similar Items

2016-09-21 Thread Nick Pentreath
in a mere 65 seconds! Thanks so much for the help! > > On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott <kevin.r.mell...@gmail.com> > wrote: > >> Thanks Nick - those examples will help a ton!! >> >> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath < >> nick

Re: Similar Items

2016-09-20 Thread Nick Pentreath
documents 1 and 2 need to be compared to one > another (via cosine similarity) because they both contain the token > 'hockey'. I will investigate the methods that you recommended to see if > they may resolve our problem. > > Thanks, > Kevin > > On Tue, Sep 20, 2016 at 1:45 AM,

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-20 Thread Nick Pentreath
(cc'ing dev list also) I think a more general version of ranking metrics that allows arbitrary relevance scores could be useful. Ranking metrics are applicable to other settings like search or other learning-to-rank use cases, so it should be a little more generic than pure recommender settings.

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-20 Thread Nick Pentreath
(cc'ing dev list also) I think a more general version of ranking metrics that allows arbitrary relevance scores could be useful. Ranking metrics are applicable to other settings like search or other learning-to-rank use cases, so it should be a little more generic than pure recommender settings.

Re: Similar Items

2016-09-20 Thread Nick Pentreath
How many products do you have? How large are your vectors? It could be that SVD / LSA could be helpful. But if you have many products then trying to compute all-pair similarity with brute force is not going to be scalable. In this case you may want to investigate hashing (LSH) techniques. On

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-19 Thread Nick Pentreath
Try als.setCheckpointInterval ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS@setCheckpointInterval(checkpointInterval:Int):ALS.this.type ) On Mon, 19 Sep 2016 at 20:01 Roshani Nagmote wrote: > Hello Sean, > > Can

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-19 Thread Nick Pentreath
The PR already exists for adding RankingEvaluator to ML - https://github.com/apache/spark/pull/12461. I need to revive and review it. DB, your review would be welcome too (and also on https://github.com/apache/spark/issues/12574 which has implications for the semantics of ranking metrics in the

Re: weightCol doesn't seem to be handled properly in PySpark

2016-09-12 Thread Nick Pentreath
Could you create a JIRA ticket for it? https://issues.apache.org/jira/browse/SPARK On Thu, 8 Sep 2016 at 07:50 evanzamir wrote: > When I am trying to use LinearRegression, it seems that unless there is a > column specified with weights, it will raise a py4j error. Seems

Re: Organizing Spark ML example packages

2016-09-12 Thread Nick Pentreath
lia...@gmail.com> wrote: >> >>> This sounds good to me, and it will make ML examples more neatly. >>> >>> 2016-04-14 5:28 GMT-07:00 Nick Pentreath <nick.pentre...@gmail.com>: >>> >>>> Hey Spark devs >>>> >>>>

[jira] [Commented] (SPARK-17479) Fix LDA example in docs

2016-09-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478058#comment-15478058 ] Nick Pentreath commented on SPARK-17479: I just ran Scala, Java and Python examples of {{ml

[jira] [Commented] (SPARK-17479) Fix LDA example in docs

2016-09-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478050#comment-15478050 ] Nick Pentreath commented on SPARK-17479: I do see the data file: https://github.com/apache/spark

[jira] [Commented] (SYSTEMML-903) [Python API] Sparse to dense conversion is not yet implemented

2016-09-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SYSTEMML-903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15477670#comment-15477670 ] Nick Pentreath commented on SYSTEMML-903: - cc [~deron] [~niketanpansare] > [Python API] Spa

[jira] [Updated] (SYSTEMML-903) [Python API] Sparse to dense conversion is not yet implemented

2016-09-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SYSTEMML-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SYSTEMML-903: Description: Hitting this exception when doing something (admittedly trivial) in Python

[jira] [Updated] (SYSTEMML-903) [Python API] Sparse to dense conversion is not yet implemented

2016-09-09 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SYSTEMML-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SYSTEMML-903: Summary: [Python API] Sparse to dense conversion is not yet implemented (was: [Python

[jira] [Created] (SYSTEMML-903) [Python APISparse to dense conversion is not yet implemented

2016-09-09 Thread Nick Pentreath (JIRA)
Nick Pentreath created SYSTEMML-903: --- Summary: [Python APISparse to dense conversion is not yet implemented Key: SYSTEMML-903 URL: https://issues.apache.org/jira/browse/SYSTEMML-903 Project

Re: How to convert an ArrayType to DenseVector within DataFrame?

2016-09-08 Thread Nick Pentreath
You can use a udf like this: Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.0 /_/ Using Python version 2.7.12 (default, Jul 2 2016 17:43:17) SparkSession available as 'spark'. In [1]: from

[jira] [Commented] (SPARK-17094) provide simplified API for ML pipeline

2016-09-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471563#comment-15471563 ] Nick Pentreath commented on SPARK-17094: It's true that constructor doesn't exist. It could

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-06 Thread Nick Pentreath
That does seem strange. Can you provide an example to reproduce? On Tue, 6 Sep 2016 at 21:49 evanzamir wrote: > Am I misinterpreting what r2() in the LinearRegression Model summary means? > By definition, R^2 should never be a negative number! > > > > -- > View this

[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-06 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15466957#comment-15466957 ] Nick Pentreath commented on SPARK-17400: Could you explain further why you want to min-max scale

[jira] [Comment Edited] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-05 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464360#comment-15464360 ] Nick Pentreath edited comment on SPARK-17400 at 9/5/16 7:42 AM: Can you

[jira] [Commented] (SPARK-17400) MinMaxScaler.transform() outputs DenseVector by default, which causes poor performance

2016-09-05 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464360#comment-15464360 ] Nick Pentreath commented on SPARK-17400: Can you comment more on the performance issue - are you

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Nick Pentreath
at 15:37 Nick Pentreath <nick.pentre...@gmail.com> wrote: > Right now you are correct that Spark ML APIs do not support predicting on > a single instance (whether Vector for the models or a Row for a pipeline). > > See https://issues.apache.org/jira/browse/SPARK-10413 and > http

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Nick Pentreath
Right now you are correct that Spark ML APIs do not support predicting on a single instance (whether Vector for the models or a Row for a pipeline). See https://issues.apache.org/jira/browse/SPARK-10413 and https://issues.apache.org/jira/browse/SPARK-16431 (duplicate) for some discussion. There

Re: Equivalent of "predict" function from LogisticRegressionWithLBFGS in OneVsRest with LogisticRegression classifier (Spark 2.0)

2016-08-29 Thread Nick Pentreath
Try this: val df = spark.createDataFrame(Seq(Vectors.dense(Array(10, 590, 190, 700))).map(Tuple1.apply)).toDF("features") On Sun, 28 Aug 2016 at 11:06 yaroslav wrote: > Hi, > > We use such kind of logic for training our model > > val model = new

Re: Breaking down text String into Array elements

2016-08-23 Thread Nick Pentreath
y and >> it will work. I was wondering if I could do this in Spark/Scala with my >> limited knowledge >> >> Cheers >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh

Re: Breaking down text String into Array elements

2016-08-23 Thread Nick Pentreath
what is "text"? i.e. what is the "val text = ..." definition? If text is a String itself then indeed sc.parallelize(Array(text)) is doing the correct thing in this case. On Tue, 23 Aug 2016 at 19:42 Mich Talebzadeh wrote: > I am sure someone know this :) > > Created

Re: Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nick Pentreath
It's not impossible that a Transformer could output multiple columns - it's simply because none of the current ones do. It's true that it might be a relatively less common use case in general. But take StringIndexer for example. It turns strings (categorical features) into ints (0-based indexes).

[jira] [Comment Edited] (SPARK-13030) Change OneHotEncoder to Estimator

2016-08-22 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431319#comment-15431319 ] Nick Pentreath edited comment on SPARK-13030 at 8/22/16 6:08 PM: - Yes I

[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2016-08-22 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431319#comment-15431319 ] Nick Pentreath commented on SPARK-13030: Yes I also agree OHE needs to be an {{Estimator

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-22 Thread Nick Pentreath
I believe it may be because of this issue ( https://issues.apache.org/jira/browse/SPARK-13030). OHE is not an estimator - hence in cases where the number of categories differ between train and test, it's not usable in the current form. It's tricky to work around, though one option is to use

[jira] [Resolved] (SPARK-15113) Add missing numFeatures & numClasses to wrapped JavaClassificationModel

2016-08-22 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-15113. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 12889 [https

[jira] [Updated] (SPARK-15113) Add missing numFeatures & numClasses to wrapped JavaClassificationModel

2016-08-22 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-15113: --- Assignee: holdenk > Add missing numFeatures & numClasses to wrapped JavaClassificati

<    1   2   3   4   5   6   7   8   9   10   >