[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-21 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r89013284 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -155,8 +148,30 @@ private[ml] abstract class LSHModel[T <: LSHMode

[GitHub] spark pull request #15777: [SPARK-18282][ML][PYSPARK] Add python clustering ...

2016-11-20 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15777#discussion_r88833295 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala --- @@ -95,8 +95,7 @@ class BisectingKMeansModel private[ml

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-18 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15874 @jkbradley Thanks for checking that, that is the conclusion I drew as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88753014 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -31,36 +31,40 @@ import org.apache.spark.sql.types.StructType

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-18 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15874 I will take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/9 @yinxusen I took a look at the updates. Will you be able to create the design doc that Joseph mentioned? --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88725427 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -35,7 +38,25 @@ import org.apache.spark.sql.functions.{col, udf

[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88724978 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -35,7 +38,25 @@ import org.apache.spark.sql.functions.{col, udf

[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88714626 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -124,7 +147,8 @@ class KMeansModel private[ml] ( @Since("

[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88713108 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -414,6 +414,8 @@ object KMeans { val RANDOM = "r

[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88722547 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -145,18 +150,67 @@ class KMeansSuite extends SparkFunSuite with

[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88713359 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -284,11 +309,26 @@ class KMeans @Since("1.5.0") (

[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88715322 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -306,6 +346,25 @@ class KMeans @Since("1.5.0") ( @Si

[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88725396 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -35,7 +38,25 @@ import org.apache.spark.sql.functions.{col, udf

[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88713635 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -284,11 +309,26 @@ class KMeans @Since("1.5.0") (

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-17 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88536087 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -179,16 +211,13 @@ private[ml] abstract class LSHModel[T <: LSHMode

[GitHub] spark pull request #15831: [SPARK-18385][ML] Make the transformer's natively...

2016-11-17 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15831#discussion_r88530411 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -243,6 +244,42 @@ final class ChiSqSelectorModel private[ml

[GitHub] spark issue #15831: [SPARK-18385][ML] Make the transformer's natively in ml ...

2016-11-17 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15831 I see this patch was created as a result of the PR that separated the ml/mllib linalg packages, to avoid some inefficiencies in conversion. However, it also is a partial step toward feature parity

[GitHub] spark pull request #15777: [SPARK-18282][ML][PYSPARK] Add python clustering ...

2016-11-17 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15777#discussion_r88495818 --- Diff: python/pyspark/ml/tests.py --- @@ -1097,6 +1097,44 @@ def test_logistic_regression_summary(self): sameSummary = model.evaluate(df

[GitHub] spark pull request #15777: [SPARK-18282][ML][PYSPARK] Add python clustering ...

2016-11-17 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15777#discussion_r88483092 --- Diff: python/pyspark/ml/tests.py --- @@ -1097,6 +1097,44 @@ def test_logistic_regression_summary(self): sameSummary = model.evaluate(df

[GitHub] spark pull request #15777: [SPARK-18282][ML][PYSPARK] Add python clustering ...

2016-11-16 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15777#discussion_r88269525 --- Diff: python/pyspark/ml/clustering.py --- @@ -346,6 +453,27 @@ def computeCost(self, dataset): """ return

[GitHub] spark pull request #15874: [Spark-18408][ML] API Improvements for LSH

2016-11-15 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r88142430 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -179,16 +211,13 @@ private[ml] abstract class LSHModel[T <: LSHMode

[GitHub] spark issue #15893: [SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for c...

2016-11-15 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15893 cc @MLnick @dbtsai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #15893: [SPARK-18456][ML][FOLLOWUP] Use matrix abstractio...

2016-11-15 Thread sethah
GitHub user sethah opened a pull request: https://github.com/apache/spark/pull/15893 [SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in LogisticRegression training ## What changes were proposed in this pull request? This is a follow up to some of the

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87906133 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -31,13 +31,9 @@ import org.apache.spark.sql.types.StructType

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87906309 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType @Since

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87844941 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -106,22 +123,24 @@ private[ml] abstract class LSHModel[T <: LSHMode

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87906709 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType @Since

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87878252 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -102,8 +103,7 @@ class MinHashModel private[ml

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87908012 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -125,11 +125,11 @@ class MinHash(override val uid: String) extends LSH

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87922281 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -179,16 +211,13 @@ private[ml] abstract class LSHModel[T <: LSHMode

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87874869 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -66,10 +66,10 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]] s

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87904353 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala --- @@ -24,7 +24,7 @@ import org.apache.spark.ml.util.DefaultReadWriteTest

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87875688 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType @Since

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87928721 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.scala --- @@ -89,23 +90,25 @@ class RandomProjectionModel private[ml

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87876322 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -102,8 +103,7 @@ class MinHashModel private[ml

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87871105 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -106,22 +106,24 @@ private[ml] abstract class LSHModel[T <: LSHMode

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87874663 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -35,26 +35,26 @@ private[ml] trait LSHParams extends HasInputCol with

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87910679 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -74,9 +72,12 @@ class MinHashModel private[ml

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87875995 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -74,9 +72,12 @@ class MinHashModel private[ml

[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87844308 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -144,12 +152,12 @@ class MinHash(override val uid: String) extends LSH

[GitHub] spark pull request #15881: [SPARK-18434][ML] Add missing ParamValidations fo...

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15881#discussion_r87887739 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -171,7 +171,10 @@ class LinearRegression @Since("1.3.0"

[GitHub] spark pull request #15777: [SPARK-18282][ML][PYSPARK] Add python clustering ...

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15777#discussion_r87841411 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala --- @@ -132,7 +132,7 @@ class BisectingKMeansModel private[ml

[GitHub] spark issue #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15874 Thanks @yunni, I can take a look at this today. I would prefer to separate the addition of "AND-amplification" into another PR since the other changes I believe we'd like to get in

[GitHub] spark issue #15777: [SPARK-18282][ML][PYSPARK] Add python clustering summari...

2016-11-14 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15777 ping @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark pull request #15881: [SPARK-18434][ML] Add missing ParamValidations fo...

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15881#discussion_r87826206 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -171,7 +171,10 @@ class LinearRegression @Since("1.3.0"

[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR

2016-11-11 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15593 Thanks @dbtsai! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR

2016-11-11 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15593 Thanks for the detailed explanation @dbtsai. +1 for doing this in a separate PR, since I'd imagine we want to run all the performance tests again as well. --- If your project is set up for it

[GitHub] spark pull request #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to w...

2016-11-11 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15683#discussion_r87639131 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -88,6 +89,12 @@ class

[GitHub] spark pull request #15817: [SPARK-18366][PYSPARK] Add handleInvalid to Pyspa...

2016-11-11 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15817#discussion_r87617539 --- Diff: python/pyspark/ml/feature.py --- @@ -158,21 +158,28 @@ class Bucketizer(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, Jav

[GitHub] spark pull request #15817: [SPARK-18366][PYSPARK] Add handleInvalid to Pyspa...

2016-11-11 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15817#discussion_r87617849 --- Diff: python/pyspark/ml/feature.py --- @@ -1163,9 +1184,11 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-11 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15800 @jkbradley Thanks for clarifying, I see your argument now. I agree that it makes sense from a statistical perspective. Still, I have not seen a single paper that describes anything quite exactly

[GitHub] spark issue #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to wrong re...

2016-11-11 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15683 @actuaryzhang Thanks a lot for correcting this! I just had a small comment to make the additional test shorter. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to w...

2016-11-11 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15683#discussion_r87609107 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -453,6 +464,56 @@ class

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15593#discussion_r87504228 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -489,13 +485,14 @@ class LogisticRegression @Since("

[GitHub] spark pull request #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to w...

2016-11-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15683#discussion_r87487494 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -453,6 +454,8 @@ class

[GitHub] spark pull request #15683: [SPARK-18166][MLlib] Fix Poisson GLM bug due to w...

2016-11-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15683#discussion_r87487460 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -83,10 +83,11 @@ class

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-10 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15800 I think that we would have the following hash distance signature: scala def hashDistance(x: Vector, y: Vector): Double Then in `approxNearestNeighbors` we would

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-10 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15800 I agree with @jkbradley's suggested approach. One key point here (for MinHash): If a query point vector q hashes to some MinHash Vector [5.0, 22.0, 13.0] the best candidates will be

[GitHub] spark pull request #15800: [SPARK-18334] MinHash should use binary hash dist...

2016-11-10 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15800#discussion_r87429950 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -76,7 +72,19 @@ class MinHashModel private[ml] ( @Since("

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-09 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15593#discussion_r87275543 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -489,13 +485,14 @@ class LogisticRegression @Since("

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-09 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 If we were to use a matrix for the output, then when we do `approxSimilarityJoin` we would want to explode the output column by matrix rows, assuming the matrix structure was

[GitHub] spark issue #15074: [SPARK-17520] Implement a better __eq__ for SparseMatrix

2016-11-09 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15074 Something like `foreachActive` for matrices would enable a better solution, but if we don't go that route then I agree with @thunterdb about comparing sparse matrices with the same tran

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-09 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15800 @jkbradley Your updated summary above is in line with my view as well - that "multi-probing" as described in the paper doesn't translate exactly to MinHash, but that it does ma

[GitHub] spark issue #15779: [SPARK-17748][ML] Minor cleanups to one-pass linear regr...

2016-11-08 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15779 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-08 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15800 Good point. Maybe we can log a warning when multi-probing is called with MinHash - to say that it will result in running brute force knn when there aren't enough candidates. --- If your proje

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

2016-11-08 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15800 Using this as hashing distance for near-neighbor search doesn't make sense to me. If there aren't enough candidates where the distance is zero, we'll select some candidates who ha

[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-08 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/9 This is probably going to miss 2.1 since we are officially in QA now, just as an fyi. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request #15779: [SPARK-17748][ML] Minor cleanups to one-pass line...

2016-11-08 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15779#discussion_r87014522 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -404,6 +406,13 @@ object LinearRegression extends

[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR

2016-11-07 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15593 @MLnick I updated it with your suggested wording for the comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-07 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 I was using L to refer to the number of compound hash functions, but you're right that in my explanation L was the "OR" parameter and d was the "AND" parameter. Thi

[GitHub] spark issue #15768: [SPARK-18080][ML][PySpark] Locality Sensitive Hashing (L...

2016-11-07 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15768 I began to review this, but got sidetracked with a lot of the details we are currently discussing on the [original LSH PR](https://github.com/apache/spark/pull/15148). --- If your project is set

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-07 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 So I'll try to summarize the AND/OR amplification and how I think it fits into the current API right now. LSH relies on a single hashing function `h(x)` which is (R, cR, p1, p2)-sensitive which

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-07 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 Ok, I'm looking more closely at this algorithm versus the literature. I agree that there is a lot of inconsistent terminology which is probably leading to some of the confusion here.

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-06 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r86719955 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala --- @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-06 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 @karlhigley Thanks for your detailed response. From the amplification section on [Wikipedia](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification), it is pretty clear to me that

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-05 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 I apologize for coming late to this, but I am taking a look at some of the documentation now. For `RandomProjection` class there are two links: one to wikipedia entry on stable distributions and one

[GitHub] spark pull request #15779: [SPARK-17748][ML] Minor cleanups to one-pass line...

2016-11-05 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15779#discussion_r86670363 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -166,6 +166,9 @@ class LinearRegression @Since("1.3.0"

[GitHub] spark pull request #15779: [SPARK-17748][ML] Minor cleanups to one-pass line...

2016-11-05 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15779#discussion_r86670216 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/NormalEquationSolver.scala --- @@ -156,7 +157,7 @@ private[ml] class QuasiNewtonSolver

[GitHub] spark issue #15779: [SPARK-17748][ML] Minor cleanups to one-pass linear regr...

2016-11-04 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15779 +1 on removing the use of exceptions. I thought it was a bit of an awkward solution to begin with. Thanks a lot for this pr, I will take a look soon. --- If your project is set up for it, you can

[GitHub] spark issue #13557: [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of...

2016-11-04 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/13557 I created [SPARK-18282](https://issues.apache.org/jira/browse/SPARK-18282) and the PR: https://github.com/apache/spark/pull/15777 to implement this interface for GMM and BisectingKMeans. These two

[GitHub] spark pull request #15777: [SPARK-18282][ML][PYSPARK] Add python clustering ...

2016-11-04 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15777#discussion_r86653312 --- Diff: python/pyspark/ml/classification.py --- @@ -309,13 +309,16 @@ def interceptVector(self): @since("2.0.0") def su

[GitHub] spark pull request #15777: [SPARK-18282][ML][PYSPARK] Add python clustering ...

2016-11-04 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15777#discussion_r86653456 --- Diff: python/pyspark/ml/tests.py --- @@ -1097,6 +1097,42 @@ def test_logistic_regression_summary(self): sameSummary = model.evaluate(df

[GitHub] spark pull request #13557: [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KM...

2016-11-04 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/13557#discussion_r86653603 --- Diff: python/pyspark/ml/clustering.py --- @@ -201,7 +202,74 @@ def computeCost(self, dataset): """ return

[GitHub] spark pull request #15777: [SPARK-18282][ML][PYSPARK] Add python clustering ...

2016-11-04 Thread sethah
GitHub user sethah opened a pull request: https://github.com/apache/spark/pull/15777 [SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM ## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and

[GitHub] spark issue #15773: [SPARK-18276][ML] ML models should copy the training sum...

2016-11-04 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15773 @yanboliang mind having a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark pull request #15773: [SPARK-18276][ML] ML models should copy the train...

2016-11-04 Thread sethah
GitHub user sethah opened a pull request: https://github.com/apache/spark/pull/15773 [SPARK-18276][ML] ML models should copy the training summary and set parent ## What changes were proposed in this pull request? Only some of the models which contain a training summary

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double nu...

2016-11-04 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86569986 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -70,8 +68,8 @@ private[ml] trait PredictorParams extends Params

[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double numeric d...

2016-11-04 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15314 LGTM after typo is fixed. ping @jkbradley @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark issue #15762: [SPARK-18235][ML] ml.ALSModel function parity: ALSModel ...

2016-11-03 Thread sethah
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15762 Looks like a duplicate of https://github.com/apache/spark/pull/12574 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86461964 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala --- @@ -86,7 +86,7 @@ private[regression] trait IsotonicRegressionBase

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86457706 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -51,6 +51,16 @@ private[ml] trait PredictorParams extends Params

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86463845 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -59,10 +69,12 @@ private[ml] trait PredictorParams extends Params

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86460724 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala --- @@ -47,18 +48,49 @@ object MLTestingUtils extends SparkFunSuite

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86459287 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala --- @@ -137,10 +172,11 @@ object MLTestingUtils extends SparkFunSuite

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86464072 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -91,7 +103,20 @@ abstract class Predictor[ // Cast LabelCol to

[GitHub] spark pull request #15314: [SPARK-17747][ML] WeightCol support non-double da...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15314#discussion_r86463958 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala --- @@ -91,7 +103,20 @@ abstract class Predictor[ // Cast LabelCol to

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15593#discussion_r86434451 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -489,13 +485,14 @@ class LogisticRegression @Since("

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15593#discussion_r86433115 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1486,57 +1489,75 @@ private class LogisticAggregator

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

2016-11-03 Thread sethah
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15593#discussion_r86436188 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1486,57 +1489,75 @@ private class LogisticAggregator

<    3   4   5   6   7   8   9   10   11   12   >