[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95429690 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -159,14 +157,21 @@ class GBTClassifier @Since("

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95429802 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -159,14 +157,21 @@ class GBTClassifier @Since("

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95432804 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -159,14 +157,21 @@ class GBTClassifier @Since("

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95454611 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -215,10 +224,21 @@ class GBTClassificationModel private[ml

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95474280 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala --- @@ -20,6 +20,15 @@ package org.apache.spark.mllib.tree.loss import

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95457108 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -159,14 +157,21 @@ class GBTClassifier @Since("

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95474360 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala --- @@ -52,4 +61,8 @@ object LogLoss extends Loss { // The following

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95454890 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -241,19 +261,42 @@ class GBTClassificationModel private[ml

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95455906 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -241,19 +261,42 @@ class GBTClassificationModel private[ml

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95455485 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -241,19 +261,42 @@ class GBTClassificationModel private[ml

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95492631 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -241,19 +261,42 @@ class GBTClassificationModel private[ml

[GitHub] spark pull request #16377: [SPARK-18036][ML][MLLIB] Fixing decision trees ha...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16377#discussion_r95493260 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -161,6 +161,33 @@ class RandomForestSuite extends SparkFunSuite

[GitHub] spark pull request #16377: [SPARK-18036][ML][MLLIB] Fixing decision trees ha...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16377#discussion_r95494883 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -176,6 +203,18 @@ class RandomForestSuite extends SparkFunSuite

[GitHub] spark pull request #16377: [SPARK-18036][ML][MLLIB] Fixing decision trees ha...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16377#discussion_r95494917 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -170,12 +197,24 @@ class RandomForestSuite extends SparkFunSuite

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95499165 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -275,18 +316,33 @@ class GBTClassificationModel private[ml

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95497945 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -66,10 +72,157 @@ class GBTClassifierSuite extends

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95495249 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -275,18 +316,33 @@ class GBTClassificationModel private[ml

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95495553 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala --- @@ -20,6 +20,12 @@ package org.apache.spark.mllib.tree.loss import

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95498992 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala --- @@ -20,6 +20,12 @@ package org.apache.spark.mllib.tree.loss import

[GitHub] spark issue #16377: [SPARK-18036][ML][MLLIB] Fixing decision trees handling ...

2017-01-11 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/16377 This LGTM, thanks! ping @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-11 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95672358 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -275,18 +316,30 @@ class GBTClassificationModel private[ml

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-11 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95667604 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -159,14 +158,21 @@ class GBTClassifier @Since("

[GitHub] spark pull request #16441: [SPARK-14975][ML] Fixed GBTClassifier to predict ...

2017-01-11 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16441#discussion_r95670584 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -193,6 +199,8 @@ object GBTClassifier extends

[GitHub] spark issue #16571: [SPARK-19208][ML] MaxAbsScaler and MinMaxScaler are very...

2017-01-13 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/16571 Can we please discuss, on the JIRA, whether this is something we actually want to do? @srowen raises a point that I tend to agree with, so I'd prefer not to proceed with code review until w

[GitHub] spark pull request #16661: [SPARK-19313][ML][MLLIB] GaussianMixture should l...

2017-01-20 Thread sethah
GitHub user sethah opened a pull request: https://github.com/apache/spark/pull/16661 [SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features ## What changes were proposed in this pull request? The following test will fail on current master

[GitHub] spark issue #16661: [SPARK-19313][ML][MLLIB] GaussianMixture should limit th...

2017-01-21 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/16661 ping @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark pull request #16661: [SPARK-19313][ML][MLLIB] GaussianMixture should l...

2017-01-23 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16661#discussion_r97428335 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala --- @@ -53,6 +53,19 @@ class GaussianMixtureSuite extends

[GitHub] spark pull request #16661: [SPARK-19313][ML][MLLIB] GaussianMixture should l...

2017-01-23 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16661#discussion_r97428409 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala --- @@ -272,6 +277,10 @@ class GaussianMixture private

[GitHub] spark issue #16661: [SPARK-19313][ML][MLLIB] GaussianMixture should limit th...

2017-01-23 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/16661 @imatiach-msft Spark committers must push the changes. As long as at least one committer is aware of the changes there is probably nothing left to do. --- If your project is set up for it, you can

[GitHub] spark issue #16661: [SPARK-19313][ML][MLLIB] GaussianMixture should limit th...

2017-01-23 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/16661 Thanks for the review @srowen and @imatiach-msft! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request #16661: [SPARK-19313][ML][MLLIB] GaussianMixture should l...

2017-01-23 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/16661#discussion_r97479326 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala --- @@ -486,6 +491,9 @@ class GaussianMixture @Since("

[GitHub] spark issue #16377: [SPARK-18036][ML][MLLIB] Fixing decision trees handling ...

2017-01-24 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/16377 Thanks @jkbradley! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #16557: [SPARK-18693][ML][MLLIB] ML Evaluators should use weight...

2017-01-26 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/16557 +1 for breaking it up, maybe starting with regression. Also, just because something hasn't been reviewed in two weeks does not mean that there is no interest in it. Two weeks is not all that

[GitHub] spark pull request #15628: [SPARK-17471][ML] Add compressed method to ML mat...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15628#discussion_r86358529 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -1076,4 +1240,15 @@ object Matrices { SparseMatrix.fromCOO

[GitHub] spark issue #15671: [SPARK-18206][ML]Add instrumentation logs to ML training...

2016-11-03 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15671 I created [SPARK-18253](https://issues.apache.org/jira/browse/SPARK-18253) to track it. We may have to get to it after 2.1 QA period. --- If your project is set up for it, you can reply to this

[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-03 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/9 @yinxusen Status update? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #15671: [SPARK-18206][ML]Add instrumentation logs to ML training...

2016-11-03 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15671 @jkbradley Thanks for bringing that up. I'm ok with alternate solutions provided they don't require someone to remember to manually add or manually except a new param, and that we can ensu

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15593#discussion_r86436188 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1486,57 +1489,75 @@ private class LogisticAggregator

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15593#discussion_r86434451 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -489,13 +485,14 @@ class LogisticRegression @Since("

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15593#discussion_r86433115 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1486,57 +1489,75 @@ private class LogisticAggregator

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86459287 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala --- @@ -137,10 +172,11 @@ object MLTestingUtils extends SparkFunSuite

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86464072 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -91,7 +103,20 @@ abstract class Predictor[ // Cast LabelCol to

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86463958 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -91,7 +103,20 @@ abstract class Predictor[ // Cast LabelCol to

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86460724 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala --- @@ -47,18 +48,49 @@ object MLTestingUtils extends SparkFunSuite

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86463845 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -59,10 +69,12 @@ private[ml] trait PredictorParams extends Params

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86461964 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala --- @@ -86,7 +86,7 @@ private[regression] trait IsotonicRegressionBase

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86457706 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -51,6 +51,16 @@ private[ml] trait PredictorParams extends Params

[GitHub] spark issue #15762: [SPARK-18235][ML] ml.ALSModel function parity: ALSModel ...

2016-11-03 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15762 Looks like a duplicate of https://github.com/apache/spark/pull/12574 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double numeric d...

2016-11-04 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15314 LGTM after typo is fixed. ping @jkbradley @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double nu...

2016-11-04 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86569986 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -70,8 +68,8 @@ private[ml] trait PredictorParams extends Params

[GitHub] spark pull request #15773: [SPARK-18276][ML] ML models should copy the train...

2016-11-04 Thread sethah
GitHub user sethah opened a pull request: https://github.com/apache/spark/pull/15773 [SPARK-18276][ML] ML models should copy the training summary and set parent ## What changes were proposed in this pull request? Only some of the models which contain a training summary

[GitHub] spark issue #15773: [SPARK-18276][ML] ML models should copy the training sum...

2016-11-04 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15773 @yanboliang mind having a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark pull request #15777: [SPARK-18282][ML][PYSPARK] Add python clustering ...

2016-11-04 Thread sethah
GitHub user sethah opened a pull request: https://github.com/apache/spark/pull/15777 [SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM ## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and

[GitHub] spark pull request #13557: [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KM...

2016-11-04 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/13557#discussion_r86653603 --- Diff: python/pyspark/ml/clustering.py --- @@ -201,7 +202,74 @@ def computeCost(self, dataset): """ return

[GitHub] spark pull request #15777: [SPARK-18282][ML][PYSPARK] Add python clustering ...

2016-11-04 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15777#discussion_r86653456 --- Diff: python/pyspark/ml/tests.py --- @@ -1097,6 +1097,42 @@ def test_logistic_regression_summary(self): sameSummary = model.evaluate(df

[GitHub] spark issue #13557: [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of...

2016-11-04 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/13557 I created [SPARK-18282](https://issues.apache.org/jira/browse/SPARK-18282) and the PR: https://github.com/apache/spark/pull/15777 to implement this interface for GMM and BisectingKMeans. These two

[GitHub] spark pull request #15777: [SPARK-18282][ML][PYSPARK] Add python clustering ...

2016-11-04 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15777#discussion_r86653312 --- Diff: python/pyspark/ml/classification.py --- @@ -309,13 +309,16 @@ def interceptVector(self): @since("2.0.0") def su

[GitHub] spark issue #15779: [SPARK-17748][ML] Minor cleanups to one-pass linear regr...

2016-11-04 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15779 +1 on removing the use of exceptions. I thought it was a bit of an awkward solution to begin with. Thanks a lot for this pr, I will take a look soon. --- If your project is set up for it, you can

[GitHub] spark pull request #15779: [SPARK-17748][ML] Minor cleanups to one-pass line...

2016-11-05 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15779#discussion_r86670216 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/NormalEquationSolver.scala --- @@ -156,7 +157,7 @@ private[ml] class QuasiNewtonSolver

[GitHub] spark pull request #15779: [SPARK-17748][ML] Minor cleanups to one-pass line...

2016-11-05 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15779#discussion_r86670363 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -166,6 +166,9 @@ class LinearRegression @Since("1.3.0"

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-05 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 I apologize for coming late to this, but I am taking a look at some of the documentation now. For `RandomProjection` class there are two links: one to wikipedia entry on stable distributions and one

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-06 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 @karlhigley Thanks for your detailed response. From the amplification section on [Wikipedia](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification), it is pretty clear to me that

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-06 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r86719955 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-07 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 Ok, I'm looking more closely at this algorithm versus the literature. I agree that there is a lot of inconsistent terminology which is probably leading to some of the confusion here.

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-07 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 So I'll try to summarize the AND/OR amplification and how I think it fits into the current API right now. LSH relies on a single hashing function `h(x)` which is (R, cR, p1, p2)-sensitive which

[GitHub] spark issue #15768: [SPARK-18080][ML][PySpark] Locality Sensitive Hashing (L...

2016-11-07 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15768 I began to review this, but got sidetracked with a lot of the details we are currently discussing on the [original LSH PR](https://github.com/apache/spark/pull/15148). --- If your project is set

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-07 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 I was using L to refer to the number of compound hash functions, but you're right that in my explanation L was the "OR" parameter and d was the "AND" parameter. Thi

[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR

2016-11-07 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15593 @MLnick I updated it with your suggested wording for the comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request #15779: [SPARK-17748][ML] Minor cleanups to one-pass line...

2016-11-08 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15779#discussion_r87014522 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -404,6 +406,13 @@ object LinearRegression extends

[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-08 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/9 This is probably going to miss 2.1 since we are officially in QA now, just as an fyi. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-08 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15800 Using this as hashing distance for near-neighbor search doesn't make sense to me. If there aren't enough candidates where the distance is zero, we'll select some candidates who ha

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-08 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15800 Good point. Maybe we can log a warning when multi-probing is called with MinHash - to say that it will result in running brute force knn when there aren't enough candidates. --- If your proje

[GitHub] spark issue #15779: [SPARK-17748][ML] Minor cleanups to one-pass linear regr...

2016-11-08 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15779 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-09 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15800 @jkbradley Your updated summary above is in line with my view as well - that "multi-probing" as described in the paper doesn't translate exactly to MinHash, but that it does ma

[GitHub] spark issue #15074: [SPARK-17520] Implement a better __eq__ for SparseMatrix

2016-11-09 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15074 Something like `foreachActive` for matrices would enable a better solution, but if we don't go that route then I agree with @thunterdb about comparing sparse matrices with the same tran

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-09 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 If we were to use a matrix for the output, then when we do `approxSimilarityJoin` we would want to explode the output column by matrix rows, assuming the matrix structure was

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-09 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15593#discussion_r87275543 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -489,13 +485,14 @@ class LogisticRegression @Since("

[GitHub] spark pull request #15800: [SPARK-18334] MinHash should use binary hash dist...

2016-11-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15800#discussion_r87429950 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -76,7 +72,19 @@ class MinHashModel private[ml] ( @Since("

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-10 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15800 I agree with @jkbradley's suggested approach. One key point here (for MinHash): If a query point vector q hashes to some MinHash Vector [5.0, 22.0, 13.0] the best candidates will be

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-10 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15800 I think that we would have the following hash distance signature: scala def hashDistance(x: Vector, y: Vector): Double Then in `approxNearestNeighbors` we would

[GitHub] spark pull request #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to w...

2016-11-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15683#discussion_r87487460 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -83,10 +83,11 @@ class

[GitHub] spark pull request #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to w...

2016-11-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15683#discussion_r87487494 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -453,6 +454,8 @@ class

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15593#discussion_r87504228 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -489,13 +485,14 @@ class LogisticRegression @Since("

[GitHub] spark pull request #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to w...

2016-11-11 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15683#discussion_r87609107 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -453,6 +464,56 @@ class

[GitHub] spark issue #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to wrong re...

2016-11-11 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15683 @actuaryzhang Thanks a lot for correcting this! I just had a small comment to make the additional test shorter. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-11 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15800 @jkbradley Thanks for clarifying, I see your argument now. I agree that it makes sense from a statistical perspective. Still, I have not seen a single paper that describes anything quite exactly

[GitHub] spark pull request #15817: [SPARK-18366][PYSPARK] Add handleInvalid to Pyspa...

2016-11-11 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15817#discussion_r87617849 --- Diff: python/pyspark/ml/feature.py --- @@ -1163,9 +1184,11 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab

[GitHub] spark pull request #15817: [SPARK-18366][PYSPARK] Add handleInvalid to Pyspa...

2016-11-11 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15817#discussion_r87617539 --- Diff: python/pyspark/ml/feature.py --- @@ -158,21 +158,28 @@ class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, Jav

[GitHub] spark pull request #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to w...

2016-11-11 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15683#discussion_r87639131 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -88,6 +89,12 @@ class

[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR

2016-11-11 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15593 Thanks for the detailed explanation @dbtsai. +1 for doing this in a separate PR, since I'd imagine we want to run all the performance tests again as well. --- If your project is set up for it

[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR

2016-11-11 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15593 Thanks @dbtsai! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark pull request #15881: [SPARK-18434][ML] Add missing ParamValidations fo...

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15881#discussion_r87826206 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -171,7 +171,10 @@ class LinearRegression @Since("1.3.0"

[GitHub] spark issue #15777: [SPARK-18282][ML][PYSPARK] Add python clustering summari...

2016-11-14 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15777 ping @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark issue #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15874 Thanks @yunni, I can take a look at this today. I would prefer to separate the addition of "AND-amplification" into another PR since the other changes I believe we'd like to get in

[GitHub] spark pull request #15777: [SPARK-18282][ML][PYSPARK] Add python clustering ...

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15777#discussion_r87841411 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala --- @@ -132,7 +132,7 @@ class BisectingKMeansModel private[ml

[GitHub] spark pull request #15881: [SPARK-18434][ML] Add missing ParamValidations fo...

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15881#discussion_r87887739 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -171,7 +171,10 @@ class LinearRegression @Since("1.3.0"

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87910679 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -74,9 +72,12 @@ class MinHashModel private[ml

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87875995 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -74,9 +72,12 @@ class MinHashModel private[ml

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87844308 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -144,12 +152,12 @@ class MinHash(override val uid: String) extends LSH

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87904353 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala --- @@ -24,7 +24,7 @@ import org.apache.spark.ml.util.DefaultReadWriteTest

<    5   6   7   8   9   10   11   12   >