[jira] [Comment Edited] (SPARK-19848) Regex Support in StopWordsRemover

2017-03-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899250#comment-15899250 ] Nick Pentreath edited comment on SPARK-19848 at 3/7/17 11:06 AM

Re: Check if dataframe is empty

2017-03-07 Thread Nick Pentreath
I believe take on an empty dataset will return an empty Array rather than throw an exception. df.take(1).isEmpty should work On Tue, 7 Mar 2017 at 07:42, Deepak Sharma wrote: > If the df is empty , the .take would return > java.util.NoSuchElementException. > This can be

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898855#comment-15898855 ] Nick Pentreath commented on SPARK-14409: [~josephkb] the proposed input schema above encompasses

[jira] [Comment Edited] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896933#comment-15896933 ] Nick Pentreath edited comment on SPARK-14409 at 3/6/17 9:07 AM: I've

[jira] [Comment Edited] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896933#comment-15896933 ] Nick Pentreath edited comment on SPARK-14409 at 3/6/17 9:06 AM: I've

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896933#comment-15896933 ] Nick Pentreath commented on SPARK-14409: I've thought about this a lot over the past few days

[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2017-03-04 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895629#comment-15895629 ] Nick Pentreath commented on SPARK-7146: --- Personally I support developer API - these are going

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-03-04 Thread Nick Pentreath
Also, note https://issues.apache.org/jira/browse/SPARK-7146 is linked from SPARK-19498 specifically to discuss opening up sharedParams traits. On Fri, 3 Mar 2017 at 23:17 Shouheng Yi wrote: > Hi Spark dev list, > > > > Thank you guys so much for all your inputs.

[jira] [Commented] (SPARK-19339) StatFunctions.multipleApproxQuantiles can give NoSuchElementException: next on empty iterator

2017-03-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893858#comment-15893858 ] Nick Pentreath commented on SPARK-19339: This should be addressed by SPARK-19573 - empty (or all

[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-03-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893821#comment-15893821 ] Nick Pentreath commented on SPARK-19714: If you feel that handling values outside the bucket

[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators

2017-03-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893811#comment-15893811 ] Nick Pentreath commented on SPARK-19747: Also agree we should be able to extract out the penalty

[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators

2017-03-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893810#comment-15893810 ] Nick Pentreath commented on SPARK-19747: [~yuhaoyan] for {{SGDClassifier}} it would

[jira] [Resolved] (SPARK-19345) Add doc for "coldStartStrategy" usage in ALS

2017-03-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19345. Resolution: Fixed Fix Version/s: 2.2.0 > Add doc for "coldStartStrateg

[jira] [Updated] (SPARK-19345) Add doc for "coldStartStrategy" usage in ALS

2017-03-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-19345: --- Priority: Minor (was: Major) > Add doc for "coldStartStrategy"

[jira] [Updated] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol

2017-03-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-19704: --- Fix Version/s: 2.2.0 > AFTSurvivalRegression should support numeric censor

[jira] [Assigned] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol

2017-03-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19704: -- Assignee: zhengruifeng > AFTSurvivalRegression should support numeric censor

[jira] [Resolved] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol

2017-03-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19704. Resolution: Fixed > AFTSurvivalRegression should support numeric censor

[jira] [Assigned] (SPARK-19733) ALS performs unnecessary casting on item and user ids

2017-03-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19733: -- Assignee: Vasilis Vryniotis > ALS performs unnecessary casting on item and user

[jira] [Resolved] (SPARK-19733) ALS performs unnecessary casting on item and user ids

2017-03-02 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19733. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17059 [https

[jira] [Resolved] (SPARK-19787) Different default regParam values in ALS

2017-03-01 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19787. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17121 [https

[jira] [Assigned] (SPARK-19345) Add doc for "coldStartStrategy" usage in ALS

2017-02-28 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19345: -- Assignee: Nick Pentreath > Add doc for "coldStartStrategy"

[jira] [Resolved] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml

2017-02-28 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-14489. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 12896 [https

[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-02-27 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885636#comment-15885636 ] Nick Pentreath commented on SPARK-11968: While working on performance testing for ALS parity I've

[jira] [Reopened] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-02-27 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reopened SPARK-11968: Assignee: Nick Pentreath > ALS recommend all methods spend most of time in

[jira] [Commented] (SPARK-19141) VectorAssembler metadata causing memory issues

2017-02-27 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885625#comment-15885625 ] Nick Pentreath commented on SPARK-19141: Hi there - I've also run into issues with larger-scale

[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-27 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885315#comment-15885315 ] Nick Pentreath commented on SPARK-19714: I also agree that the naming of {{splits}} could

[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators

2017-02-26 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885298#comment-15885298 ] Nick Pentreath commented on SPARK-19747: Big +1 for this! I agree we really should be able

[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882224#comment-15882224 ] Nick Pentreath commented on SPARK-19714: Another alternative is that we do expand the "in

[jira] [Comment Edited] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882216#comment-15882216 ] Nick Pentreath edited comment on SPARK-19714 at 2/24/17 8:35 AM: - I agree

[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882216#comment-15882216 ] Nick Pentreath commented on SPARK-19714: I agree that the parameter naming is perhaps misleading

Re: Feedback on MLlib roadmap process proposal

2017-02-24 Thread Nick Pentreath
h low-level libraries. > > Tim > > > On Thu, Feb 23, 2017 at 5:32 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > Sorry for being late to the discussion. I think Joseph, Sean and others > have covered the issues well. > > Overall I like the pr

[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2017-02-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882206#comment-15882206 ] Nick Pentreath commented on SPARK-18813: FYI I've started going through a few of the top Watched

[jira] [Closed] (SPARK-10041) Proposal of Parameter Server Interface for Spark

2017-02-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath closed SPARK-10041. -- Resolution: Won't Fix > Proposal of Parameter Server Interface for Sp

[jira] [Closed] (SPARK-10041) Proposal of Parameter Server Interface for Spark

2017-02-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath closed SPARK-10041. -- Resolution: Won't Fix > Proposal of Parameter Server Interface for Sp

[jira] [Reopened] (SPARK-10041) Proposal of Parameter Server Interface for Spark

2017-02-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reopened SPARK-10041: > Proposal of Parameter Server Interface for Sp

[jira] [Commented] (SPARK-10041) Proposal of Parameter Server Interface for Spark

2017-02-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882198#comment-15882198 ] Nick Pentreath commented on SPARK-10041: I think it is safe to say this is not going to be part

[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib

2017-02-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882187#comment-15882187 ] Nick Pentreath commented on SPARK-2336: --- I think it's safe to say that this now lives in a Spark

[jira] [Commented] (SPARK-6567) Large linear model parallelism via a join and reduceByKey

2017-02-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882182#comment-15882182 ] Nick Pentreath commented on SPARK-6567: --- This JIRA has been around for a while without any movement

[jira] [Commented] (SPARK-3434) Distributed block matrix

2017-02-24 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882179#comment-15882179 ] Nick Pentreath commented on SPARK-3434: --- This JIRA only has SPARK-3976 open. There was an old PR

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882174#comment-15882174 ] Nick Pentreath commented on SPARK-14409: The other option is to work with [~danilo.ascione] PR

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882163#comment-15882163 ] Nick Pentreath commented on SPARK-14409: [~roberto.mirizzi] the {{goodThreshold}} param seems

[jira] [Resolved] (SPARK-14084) Parallel training jobs in model selection

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-14084. Resolution: Duplicate Target Version/s: (was: ) > Parallel training j

[jira] [Commented] (SPARK-14084) Parallel training jobs in model selection

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882123#comment-15882123 ] Nick Pentreath commented on SPARK-14084: I guess we could have put SPARK-19071 into this ticket

[jira] [Comment Edited] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882113#comment-15882113 ] Nick Pentreath edited comment on SPARK-3246 at 2/24/17 7:15 AM: Since

[jira] [Comment Edited] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882113#comment-15882113 ] Nick Pentreath edited comment on SPARK-3246 at 2/24/17 7:16 AM: Since

[jira] [Closed] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath closed SPARK-3246. - Resolution: Won't Fix > Support weighted SVMWithSGD for classification of unbalanced data

[jira] [Commented] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882113#comment-15882113 ] Nick Pentreath commented on SPARK-3246: --- Since {{mllib}} is in maintenance mode and {{LinearSVC

[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880628#comment-15880628 ] Nick Pentreath commented on SPARK-19634: Ah I see it was discussed in the design doc - will go

[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880623#comment-15880623 ] Nick Pentreath commented on SPARK-19634: Thanks [~timhunter]. In terms of performance, we expect

Re: Feedback on MLlib roadmap process proposal

2017-02-23 Thread Nick Pentreath
Sorry for being late to the discussion. I think Joseph, Sean and others have covered the issues well. Overall I like the proposed cleaned up roadmap & process (thanks Joseph!). As for the actual critical roadmap items mentioned on SPARK-18813, I think it makes sense and will comment a bit further

[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880387#comment-15880387 ] Nick Pentreath commented on SPARK-18813: Thanks for this Joseph and everyone for the comments

Re: Implementation of RNN/LSTM in Spark

2017-02-23 Thread Nick Pentreath
The short answer is there is none and highly unlikely to be inside of Spark MLlib any time in the near future. The best bets are to look at other DL libraries - for JVM there is Deeplearning4J and BigDL (there are others but these seem to be the most comprehensive I have come across) - that run

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880324#comment-15880324 ] Nick Pentreath commented on SPARK-14409: [~roberto.mirizzi] If using the current {{ALS.transform

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880312#comment-15880312 ] Nick Pentreath commented on SPARK-14409: [~danilo.ascione] Yes, your solution is generic assuming

[jira] [Commented] (SPARK-19668) Multiple NGram sizes

2017-02-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880131#comment-15880131 ] Nick Pentreath commented on SPARK-19668: The simplest will be to keep the existing param and make

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-22 Thread Nick Pentreath
helfpul > :) For instance, the similarity threshold, the number of hash tables, the > bucket width, etc... > > Thanks! > > On Mon, Feb 13, 2017 at 3:21 PM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > The original Uber authors provided this performanc

[jira] [Resolved] (SPARK-19679) Destroy broadcasted object without blocking

2017-02-22 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19679. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17016 [https

[jira] [Assigned] (SPARK-19679) Destroy broadcasted object without blocking

2017-02-22 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19679: -- Assignee: zhengruifeng > Destroy broadcasted object without block

[jira] [Assigned] (SPARK-19694) Add missing 'setTopicDistributionCol' for LDAModel

2017-02-22 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19694: -- Assignee: zhengruifeng > Add missing 'setTopicDistributionCol' for LDAMo

[jira] [Resolved] (SPARK-19694) Add missing 'setTopicDistributionCol' for LDAModel

2017-02-22 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19694. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17021 [https

[jira] [Comment Edited] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH

2017-02-21 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875552#comment-15875552 ] Nick Pentreath edited comment on SPARK-18454 at 2/21/17 8:00 AM: - Can you

[jira] [Commented] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH

2017-02-20 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875552#comment-15875552 ] Nick Pentreath commented on SPARK-18454: Can you also comment on http://mail-archives.apache.org

[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2017-02-20 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875465#comment-15875465 ] Nick Pentreath commented on SPARK-18608: [~podongfeng] [~yuhaoyan] I'm not aware of anyone

[jira] [Commented] (SPARK-19668) Multiple NGram sizes

2017-02-20 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875437#comment-15875437 ] Nick Pentreath commented on SPARK-19668: I'd say a range is feasible. The current API doesn't

[jira] [Commented] (SPARK-19573) Make NaN/null handling consistent in approxQuantile

2017-02-20 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15874321#comment-15874321 ] Nick Pentreath commented on SPARK-19573: cc [~timhunter] - can you take a look at the discussion

[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866755#comment-15866755 ] Nick Pentreath commented on SPARK-19208: Ah right I see - yes rewrite rules would be a good

[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866755#comment-15866755 ] Nick Pentreath edited comment on SPARK-19208 at 2/14/17 9:42 PM: - Ah

[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866675#comment-15866675 ] Nick Pentreath commented on SPARK-19208: When I said "estimator-like", I didn't mean

[jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth

2017-02-13 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864872#comment-15864872 ] Nick Pentreath commented on SPARK-14503: Seems {{PrefixSpan}} even takes different input: {{Array

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-13 Thread Nick Pentreath
) but the > error is still happens. And it happens when I call similarity join. After > transformation, the size of dataset is about 4G. > > 2017-02-11 3:07 GMT+07:00 Nick Pentreath <nick.pentre...@gmail.com>: > > What other params are you using for the lsh transformer? &g

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Nick Pentreath
What other params are you using for the lsh transformer? Are the issues occurring during transform or during the similarity join? On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan wrote: > hi Das, > In general, I will apply them to larger datasets, so I want to use LSH, >

Re: Google Summer of Code 2017 is coming

2017-02-05 Thread Nick Pentreath
I think Sean raises valid points - that the result is highly dependent on the particular student, project and mentor involved, and that the actual required time investment is very significant. Having said that, it's not all bad certainly. Scikit-learn started as a GSoC project 10 years ago!

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Nick Pentreath
Hi Maciej If you're seeing a regression from 1.6 -> 2.0 *both using DataFrames *then that seems to point to some other underlying issue as the root cause. Even though adding checkpointing should help, we should understand why it's different between 1.6 and 2.0? On Thu, 2 Feb 2017 at 08:22

[jira] [Commented] (SPARK-19422) Cache input data in algorithms

2017-02-01 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848401#comment-15848401 ] Nick Pentreath commented on SPARK-19422: Please see SPARK-18608 - the fix you propose in the PR

[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-01 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848108#comment-15848108 ] Nick Pentreath edited comment on SPARK-19208 at 2/1/17 8:09 AM: Another

[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-01 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848108#comment-15848108 ] Nick Pentreath commented on SPARK-19208: Another option would be an "Estimator" like

[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-01-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15834336#comment-15834336 ] Nick Pentreath commented on SPARK-18392: [~yunn] I was wondering if you will be working

[jira] [Updated] (SPARK-18704) CrossValidator should preserve more tuning statistics

2017-01-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18704: --- Shepherd: Nick Pentreath > CrossValidator should preserve more tuning statist

[jira] [Commented] (SPARK-19071) Optimizations for ML Pipeline Tuning

2017-01-23 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15834232#comment-15834232 ] Nick Pentreath commented on SPARK-19071: Thanks [~bryanc] for the design and working on the PoCs

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-01-17 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826186#comment-15826186 ] Nick Pentreath commented on SPARK-14409: Yes to be more clear, I would expect that the {{k

[jira] [Commented] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-17 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826096#comment-15826096 ] Nick Pentreath commented on SPARK-19208: If we're going to look at performance optimization here

[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-01-17 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826078#comment-15826078 ] Nick Pentreath commented on SPARK-14409: [~danilo.ascione] [~roberto.mirizzi] thanks for the code

Re: ML PIC

2017-01-16 Thread Nick Pentreath
this have some opportunity for newbs (like me) to volunteer some > time? > > Sent from my iPhone > > On Dec 21, 2016, at 9:08 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > It is part of the general feature parity roadmap. I can't recall offhand > any

[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2017-01-16 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823701#comment-15823701 ] Nick Pentreath commented on SPARK-19217: I don't understand why bq. You can't save

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Nick Pentreath
Yup - it's because almost all model data in spark ML (model coefficients) is "small" - i.e. Non distributed. If you look at ALS you'll see there is no repartitioning since the factor dataframes can be large On Fri, 13 Jan 2017 at 19:42, Sean Owen wrote: > You're referring to

[jira] [Issue Comment Deleted] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-01-12 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-13857: --- Comment: was deleted (was: My view is in practice brute-force is never going to be efficient

[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-01-12 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820747#comment-15820747 ] Nick Pentreath commented on SPARK-13857: My view is in practice brute-force is never going

[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-01-12 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820746#comment-15820746 ] Nick Pentreath commented on SPARK-13857: My view is in practice brute-force is never going

Re: ML PIC

2016-12-21 Thread Nick Pentreath
It is part of the general feature parity roadmap. I can't recall offhand any blocker reasons it's just resources On Wed, 21 Dec 2016 at 17:05, Robert Hamilton wrote: > Hi all. Is it on the roadmap to have an > Spark.ml.clustering.PowerIterationClustering? Are there

Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread Nick Pentreath
Yes most likely due to hashing tf returns ml vectors while you need mllib vectors for row matrix. I'd recommend using the vector conversion utils (I think in mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly). There are until methods for converting single vectors as well as

Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread Nick Pentreath
Yes most likely due to hashing tf returns ml vectors while you need mllib vectors for row matrix. I'd recommend using the vector conversion utils (I think in mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly). There are until methods for converting single vectors as well as

Re: 2.1.0-rc2 cut; committers please set fix version for branch-2.1 to 2.1.1 instead

2016-12-07 Thread Nick Pentreath
I went ahead and re-marked all the existing 2.1.1 fix version JIRAs (that had gone into branch-2.1 since RC1 but before RC2) for Spark ML to 2.1.0 On Thu, 8 Dec 2016 at 09:20 Reynold Xin wrote: > Thanks. >

[jira] [Commented] (SPARK-18633) Add multiclass logistic regression summary python example and document

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731424#comment-15731424 ] Nick Pentreath commented on SPARK-18633: Went ahead and remarked fix version to {{2.1.0}} since

[jira] [Updated] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18081: --- Fix Version/s: (was: 2.2.0) > Locality Sensitive Hashing (LSH) User Gu

[jira] [Updated] (SPARK-18633) Add multiclass logistic regression summary python example and document

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18633: --- Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0

[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731419#comment-15731419 ] Nick Pentreath commented on SPARK-18081: Went ahead and re-marked fix version to {{2.1.0}} since

[jira] [Commented] (SPARK-15819) Add KMeanSummary in KMeans of PySpark

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731421#comment-15731421 ] Nick Pentreath commented on SPARK-15819: Went ahead and re-marked fix version to {{2.1.0}} since

[jira] [Commented] (SPARK-18274) Memory leak in PySpark StringIndexer

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731418#comment-15731418 ] Nick Pentreath commented on SPARK-18274: Went ahead and re-marked fix version to {{2.1.0}} since

[jira] [Updated] (SPARK-15819) Add KMeanSummary in KMeans of PySpark

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-15819: --- Fix Version/s: (was: 2.1.1) (was: 2.2.0) 2.1.0

[jira] [Updated] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-12-07 Thread Nick Pentreath (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-18081: --- Fix Version/s: (was: 2.1.1) 2.1.0 > Locality Sensitive Hashing (

<    1   2   3   4   5   6   7   8   9   10   >