[GitHub] spark issue #17130: [SPARK-19791] [ML] Add doc and example for fpgrowth

2017-03-13 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/17130 The updated transform looks good; thanks for pinging! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #17233: [SPARK-11569][ML] Fix StringIndexer to handle null value...

2017-03-13 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/17233 I'll take a look --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wish

[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...

2017-03-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17233#discussion_r105719645 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -122,6 +122,86 @@ class StringIndexerSuite assert

[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...

2017-03-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17233#discussion_r105721463 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -188,35 +189,45 @@ class StringIndexerModel

[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...

2017-03-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17233#discussion_r10571 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -122,6 +122,86 @@ class StringIndexerSuite assert

[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...

2017-03-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17233#discussion_r105706598 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala --- @@ -39,20 +39,21 @@ import

[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...

2017-03-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/17233#discussion_r105719023 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala --- @@ -122,6 +122,86 @@ class StringIndexerSuite assert

[GitHub] spark issue #17218: [SPARK-19281][WIP][PYTHON][ML] spark.ml Python API for F...

2017-03-13 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/17218 True, if minSupport can be shared, then that's OK. confidence won't be shared though. --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #17215: [MINOR][ML] Improve MLWriter overwrite error message

2017-03-13 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/17215 Thanks! Merging with master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

2017-03-13 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/17110 Ping @imatiach-msft any more comments after the update? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #17283: [SPARK-19940][ML][MINOR] FPGrowthModel.transform should ...

2017-03-14 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/17283 Thanks for fixing this issue! LGTM Merging with master Stating the JIRA number for a bug fix is reasonable, though it's most useful if the bug appears in an actual release. -

[GitHub] spark issue #17233: [SPARK-11569][ML] Fix StringIndexer to handle null value...

2017-03-14 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/17233 LGTM Merging with master Thanks for the improvement! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...

2017-03-14 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/17237 Thanks for the PR! I just merged the fix for https://issues.apache.org/jira/browse/SPARK-11569 which will affect this PR. Would you mind updating this PR to include SPARK-11569's handli

[GitHub] spark issue #15779: [SPARK-17748][ML] Minor cleanups to one-pass linear regr...

2016-11-08 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15779 Thanks! I'll merge this with master and branch-2.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #15779: [SPARK-17748][ML] Minor cleanups to one-pass linear regr...

2016-11-08 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15779 Linking commit for branch-2.1 since it hasn't linked automatically: [https://github.com/apache/spark/commit/21bbf94b41fbd193e370a3820131e449aaf0e3db] --- If your project is set up for it

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-08 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15800 I disagree about this: you can do multiple probes with MinHash. For each hash function, the probability of a collision is equal to the Jaccard similarity coefficient: [https://en.wikipedia.org

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-08 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15148 @sethah > What is the intended use of the output column generated by transform? As an alternative set of features with decreased dimensionality? I agree it's ma

[GitHub] spark pull request #15800: [SPARK-18334] MinHash should use binary hash dist...

2016-11-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15800#discussion_r87245481 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -32,13 +32,7 @@ import org.apache.spark.sql.types.StructType

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-09 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15148 @MLnick I agree with most of your comments. A few responses: > In terms of transform - I disagree somewhat that the main use case is "dimensionality reduction". Perhaps the

[GitHub] spark issue #15748: [SPARK-18240][ML][PySpark] Add Summary of BiKMeans and G...

2016-11-09 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15748 I do want to add this at some point, though we need to focus on QA for now --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-09 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15148 Good points: Array of Vectors sounds good to me. There has been a lot of discussion. I'm going to try to summarize things in a follow-up JIRA, which I'll link here shortly.

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-09 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15800 I'm not convinced that it is useful to think about AND and OR amplification for MinHash for approxNearestNeighbors. Do you have a reference describing it? I just can't think of a bet

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-09 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15148 Phew, done! https://issues.apache.org/jira/browse/SPARK-18392 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-09 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15800 Limiting this discussion to MinHash: Say we construct a hash table of L x K functions (for doing both OR and AND amplification). For approxNearestNeighbors, we need to ask, "What i

[GitHub] spark issue #15795: [SPARK-18081] Add user guide for Locality Sensitive Hash...

2016-11-09 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15795 Linking [https://issues.apache.org/jira/browse/SPARK-18392] since it will alter the public API for LSH --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #15787: [SPARK-18286][ML] Add Scala/Java examples for MinHash an...

2016-11-09 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15787 Linking [https://issues.apache.org/jira/browse/SPARK-18392] since it will alter the public API for LSH --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-10 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15800 I agree @sethah and I are on the same page. Two clarifications about @Yunni 's post: * I'm not sure what you mean by "your method will only have 1 indicator for each row.&quo

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-10 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15800 > Yes, I mean we average K 0/1 indicators for each Vector, and get L averaged indicators. Do you agree with this part? I'm afraid I still disagree. There's a fundament

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-10 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15800 @sethah I misread what you wrote earlier: I still want to compute a single estimate, rather than L separate ones. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-10 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15800 You're talking about enlarging search ranges, or iterations, a few times, but I really do not think it makes sense to have multiple iterations in Spark. Iterations are useful for avoidi

[GitHub] spark issue #15843: [SPARK-18274][ML][PYSPARK] Memory leak in PySpark String...

2016-11-10 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15843 Thanks a lot for finding & reporting this! The fix should probably go in JavaWrapper, not JavaModel, right? I tested this manually (in JavaWrapper), and it seems to fix the problem

[GitHub] spark issue #15843: [SPARK-18274][ML][PYSPARK] Memory leak in PySpark String...

2016-11-10 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15843 You're right! It's another bug: copy should be implemented in JavaParams, not JavaModel. I'm sending this PR to fix that: https://github.com/techaddict/spark/pull/1

[GitHub] spark issue #15843: [SPARK-18274][ML][PYSPARK] Memory leak in PySpark String...

2016-11-11 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15843 Good point @holdenk --- @techaddict could you also please update the PR title to say "JavaWrapper" instead of "StringIndexer"? --- If your project is set up for it, you can

[GitHub] spark issue #15843: [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWr...

2016-11-11 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15843 I'm now wondering if ```__del___``` should be in JavaParams instead of JavaWrapper since JavaWrapper does not override ```copy```. Do yall agree it will be safer if it's moved there

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-11 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15800 @Yunni I guess we should remove it from the public API. I'm OK with leaving the code there and making it private for now. *One response:* > Enlarging search ranges

[GitHub] spark issue #15843: [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWr...

2016-11-14 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15843 I don't see a need to do deep copies of model summaries, but I agree I don't like how JavaWrapper is ambiguous about whether it does shallow or deep copies. I'd say the

[GitHub] spark issue #15817: [SPARK-18366][PYSPARK] Add handleInvalid to Pyspark for ...

2016-11-14 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15817 Can you please add "[ML]" to the PR description? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your projec

[GitHub] spark issue #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15874 Can you please add "[ML]" to the PR title? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...

2016-11-14 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15279 Can you please change the title to have: "SPARK-12347" -> "[SPARK-12347]"? --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...

2016-11-14 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15279 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if

[GitHub] spark issue #15817: [SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark ...

2016-11-14 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15817 Can you please implement the Param directly in Bucketizer and QuantileDiscretizer? Just like in Scala, HasHandleInvalid has built-in Param doc which applies to existing use cases but not

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-16 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15874 I'll take a look --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wish

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88521575 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala --- @@ -86,9 +94,24 @@ class MinHashSuite extends SparkFunSuite with

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88359067 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSHSuite.scala --- @@ -43,70 +43,72 @@ class RandomProjectionSuite

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88351984 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -31,36 +31,34 @@ import org.apache.spark.sql.types.StructType

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88353870 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -92,18 +93,17 @@ class MinHashModel private[ml] ( * LSH class for

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88359545 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSHSuite.scala --- @@ -115,64 +117,83 @@ class RandomProjectionSuite

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88349223 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.scala --- @@ -147,15 +151,17 @@ class RandomProjection(override

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88351861 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -31,36 +31,34 @@ import org.apache.spark.sql.types.StructType

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15874 Other comments: **MinHash** Looking yet again at this, I think it's using a technically incorrect hash function. It is *not* a perfect hash function. It can hash 2

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-18 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15874 @Yunni Thanks for the updates! I don't think we should include AND-amplification for 2.1 since we're already in QA. But it'd be nice to get it in 2.2. Also, 2.2 will give us pl

[GitHub] spark issue #15788: [SPARK-18291][SparkR][ML] SparkR glm predict should outp...

2016-11-21 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15788 @felixcheung @yanboliang + @shivaram @mengxr Just saw this change. IIRC we intentionally made spark.glm match R's glm behavior for the binomial family, though I see it was a little diff

[GitHub] spark issue #15777: [SPARK-18282][ML][PYSPARK] Add python clustering summari...

2016-11-21 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15777 @yanboliang I noticed the JIRA is still pending. Are there follow-up tasks? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark pull request #15913: [SPARK-18481][ML] ML 2.1 QA: Remove deprecated me...

2016-11-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15913#discussion_r89537785 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -443,12 +429,11 @@ private[ml] trait GBTParams extends TreeEnsembleParams

[GitHub] spark pull request #15913: [SPARK-18481][ML] ML 2.1 QA: Remove deprecated me...

2016-11-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15913#discussion_r89537864 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -317,8 +317,32 @@ private[ml] trait TreeEnsembleParams extends

[GitHub] spark pull request #15913: [SPARK-18481][ML] ML 2.1 QA: Remove deprecated me...

2016-11-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15913#discussion_r89537771 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -443,12 +425,11 @@ private[ml] trait GBTParams extends TreeEnsembleParams

[GitHub] spark pull request #15913: [SPARK-18481][ML] ML 2.1 QA: Remove deprecated me...

2016-11-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15913#discussion_r89537340 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala --- @@ -547,21 +547,6 @@ trait Params extends Identifiable with Serializable

[GitHub] spark issue #15913: [SPARK-18481][ML] ML 2.1 QA: Remove deprecated methods f...

2016-11-24 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15913 One more thing: This breaks several public APIs, although in mostly minor ways and with good reasons. Could you please update the JIRA to list the public API changes? --- If your project is

[GitHub] spark pull request #16017: [SPARK-18592][ML] Move DT/RF/GBT Param setter met...

2016-11-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16017#discussion_r89694871 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala --- @@ -52,33 +52,49 @@ class DecisionTreeClassifier

[GitHub] spark pull request #16017: [SPARK-18592][ML] Move DT/RF/GBT Param setter met...

2016-11-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16017#discussion_r89806761 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala --- @@ -52,33 +52,49 @@ class DecisionTreeClassifier

[GitHub] spark pull request #16017: [SPARK-18592][ML] Move DT/RF/GBT Param setter met...

2016-11-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16017#discussion_r89819028 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -134,27 +150,31 @@ private[ml] trait DecisionTreeParams extends

[GitHub] spark issue #16017: [SPARK-18592][ML] Move DT/RF/GBT Param setter methods to...

2016-11-28 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/16017 LGTM pending a check on whether the deprecations affect Estimators in addition to Models Thanks for the follow-up! --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request #16017: [SPARK-18592][ML] Move DT/RF/GBT Param setter met...

2016-11-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16017#discussion_r89824958 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -107,25 +107,41 @@ private[ml] trait DecisionTreeParams extends

[GitHub] spark issue #16011: [SPARK-18587][ML] Remove handleInvalid from QuantileDisc...

2016-11-28 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/16011 I agree with @MLnick that we should not make this change. We need users to be able to set all Params in Estimators, including the Params of the Models they produce. If this is confusing for

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-28 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15874 LGTM Thanks everyone! Merging with master and branch-2.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-28 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15874 Well, I'm having trouble merging b/c of bad wifi during travel. Ping @yanboliang @MLnick @mengxr would one of you mind merging this with master and branch-2.1? @sethah and I having both

[GitHub] spark pull request #15788: [SPARK-18291][SparkR][ML] SparkR glm predict shou...

2016-11-28 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15788#discussion_r89840613 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/GeneralizedLinearRegressionWrapper.scala --- @@ -52,7 +59,15 @@ private[r] class

[GitHub] spark issue #15788: [SPARK-18291][SparkR][ML] SparkR glm predict should outp...

2016-11-28 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15788 I agree that R makes it much easier to convert back to the string label. My main worries are: * changing the API, which may break some user code * hiding probabilities from R users

[GitHub] spark issue #15843: [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWr...

2016-11-28 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15843 Sorry for the slow response on this. Given the time pressure for 2.1, let's go with option 1 for now with a follow-up task to implement option 2. It would be great to include this fix i

[GitHub] spark issue #15972: [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, Deve...

2016-11-28 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15972 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #15972: [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, Deve...

2016-11-28 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15972 Quick note: These items will need to be updated in pyspark too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #15972: [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, Deve...

2016-11-28 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15972 Has anyone requested that LDAModel be unsealed? I'm trying to recall. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark issue #15972: [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, Deve...

2016-11-28 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15972 This looks good, except for the pyspark updates. Thank you @hhbyyh ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #15768: [SPARK-18080][ML][PySpark] Locality Sensitive Hashing (L...

2016-11-28 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15768 This can now proceed since [https://github.com/apache/spark/pull/15874] has been merged. Sorry for the delay! This will need to slip to 2.2 --- If your project is set up for it, you can reply

[GitHub] spark issue #16023: [SPARK-18599] Add Spectral LDA algorithm

2016-11-28 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/16023 Per the comments on the JIRA, could you please close this issue pending further discussion? Thank you! --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request #16017: [SPARK-18592][ML] Move DT/RF/GBT Param setter met...

2016-11-29 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16017#discussion_r90072953 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -107,25 +107,41 @@ private[ml] trait DecisionTreeParams extends

[GitHub] spark issue #16017: [SPARK-18592][ML] Move DT/RF/GBT Param setter methods to...

2016-11-29 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/16017 LGTM Merging with master and branch-2.1 Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request #16009: [SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scal...

2016-11-29 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/16009#discussion_r90088312 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -49,15 +49,13 @@ private[feature] trait ChiSqSelectorParams extends

[GitHub] spark issue #15972: [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, Deve...

2016-11-29 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15972 LGTM merging with master and branch-2.1 Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #15571: [SPARK-18034] Upgrade to MiMa 0.1.11 to fix flakiness

2016-10-20 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15571 @JoshRosen I could imagine us introducing this same issue in the future with more Params in sharedParams.scala (like ```HasAggregationDepth```). As long as it's fine for us add MimaExcludes

[GitHub] spark issue #15571: [SPARK-18034] Upgrade to MiMa 0.1.11 to fix flakiness

2016-10-20 Thread jkbradley
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15571 They currently aren't, but a lot of users want them to be. To make them available outside of Spark, we could create corresponding Java interfaces for them and expose those interfaces inste

[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84561834 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -73,15 +74,51 @@ final class Bucketizer @Since("1.4.0") (@Si

[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84562692 --- Diff: docs/ml-features.md --- @@ -1104,9 +1104,11 @@ for more details on the API. `QuantileDiscretizer` takes a column with continuous features

[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84563374 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -100,6 +102,24 @@ final class QuantileDiscretizer @Since

[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84563479 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -73,15 +74,51 @@ final class Bucketizer @Since("1.4.0") (@Si

[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84563480 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala --- @@ -73,15 +74,51 @@ final class Bucketizer @Since("1.4.0") (@Si

[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84563237 --- Diff: python/pyspark/ml/feature.py --- @@ -1157,9 +1157,11 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab

[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84563388 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala --- @@ -100,6 +102,24 @@ final class QuantileDiscretizer @Since

[GitHub] spark pull request #15428: [SPARK-17219][ML] enhanced NaN value handling in ...

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15428#discussion_r84563124 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala --- @@ -82,14 +82,23 @@ class QuantileDiscretizerSuite

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84566581 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84566224 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/BitSamplingSuite.scala --- @@ -0,0 +1,100 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84566504 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84566599 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -245,50 +246,50 @@ private[ml] abstract class LSHModel[T <: LSHMode

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84565977 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/BitSampling.scala --- @@ -0,0 +1,163 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-21 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84566821 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,343 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84587191 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -0,0 +1,340 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84587195 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashSuite.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84587197 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashSuite.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84588545 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashSuite.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r84589114 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashSuite.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software

<    5   6   7   8   9   10   11   12   13   14   >