[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-07 Thread karlhigley
Github user karlhigley commented on the issue: https://github.com/apache/spark/pull/15148 @jkbradley: "Multi-probe" seems like a standard term, and I think this is the [original paper](http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf) that coined it. >

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-07 Thread karlhigley
Github user karlhigley commented on the issue: https://github.com/apache/spark/pull/15148 @sethah: Your description of the combination of AND and OR amplification from the literature matches my understanding, and the combination of the two is what I was aiming for in spark-neighbors

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-11-05 Thread karlhigley
Github user karlhigley commented on the issue: https://github.com/apache/spark/pull/15148 @sethah: I think you're right that there's a discrepancy here, and I'm embarrassed that I didn't see it when I first reviewed the PR. On a reread of the source and your comment above, it looks

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-09-25 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r80393070 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/lsh/LSH.scala --- @@ -0,0 +1,290 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-09-25 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r80392464 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/lsh/LSH.scala --- @@ -0,0 +1,290 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-09-25 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r80392692 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/lsh/LSH.scala --- @@ -0,0 +1,290 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-7857][MLLIB] Prevent IDFModel from retu...

2015-12-06 Thread karlhigley
Github user karlhigley closed the pull request at: https://github.com/apache/spark/pull/9843 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-7857][MLLIB] Prevent IDFModel from retu...

2015-11-28 Thread karlhigley
Github user karlhigley commented on the pull request: https://github.com/apache/spark/pull/9843#issuecomment-160314300 I understand the issue you're pointing out, but it hasn't been a practical problem, even with hundreds of thousands of terms. The inclusion of explicit zeros

[GitHub] spark pull request: [SPARK-7857][MLLIB] Prevent IDFModel from retu...

2015-11-27 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/9843#discussion_r46051533 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -211,14 +213,17 @@ private object IDFModel { val n = v.size

[GitHub] spark pull request: [SPARK-7857][MLLIB] Prevent IDFModel from retu...

2015-11-27 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/9843#discussion_r46051762 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -211,14 +213,17 @@ private object IDFModel { val n = v.size

[GitHub] spark pull request: [SPARK-7857][MLLIB] Prevent IDFModel from retu...

2015-11-26 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/9843#discussion_r46013493 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -211,14 +213,16 @@ private object IDFModel { val n = v.size

[GitHub] spark pull request: [SPARK-7857][MLLIB] Prevent IDFModel from retu...

2015-11-26 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/9843#discussion_r46013714 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -218,7 +218,7 @@ private object IDFModel { newValues(k

[GitHub] spark pull request: [SPARK-7857][MLLIB] Prevent IDFModel from retu...

2015-11-26 Thread karlhigley
Github user karlhigley commented on the pull request: https://github.com/apache/spark/pull/9843#issuecomment-160004428 Updated to use an `ArrayBuffer` instead of `toSparse` to omit explicit zeros in returned SparseVectors. --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-7857][MLLIB] Prevent IDFModel from retu...

2015-11-26 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/9843#discussion_r46010460 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -218,7 +218,7 @@ private object IDFModel { newValues(k

[GitHub] spark pull request: [SPARK-7857][MLLIB] Prevent IDFModel from retu...

2015-11-26 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/9843#discussion_r46017996 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -211,14 +213,17 @@ private object IDFModel { val n = v.size

[GitHub] spark pull request: [SPARK-7857][MLLIB] Prevent IDFModel from retu...

2015-11-26 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/9843#discussion_r46000917 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -218,7 +218,7 @@ private object IDFModel { newValues(k

[GitHub] spark pull request: [SPARK-7857][MLLIB] Prevent IDFModel from retu...

2015-11-25 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/9843#discussion_r45901371 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -218,7 +218,7 @@ private object IDFModel { newValues(k

[GitHub] spark pull request: [SPARK-7857][MLLIB] Prevent IDFModel from retu...

2015-11-20 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/9843#discussion_r45481318 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -218,7 +218,7 @@ private object IDFModel { newValues(k

[GitHub] spark pull request: [SPARK-7857][MLLIB] Prevent IDFModel from retu...

2015-11-19 Thread karlhigley
GitHub user karlhigley opened a pull request: https://github.com/apache/spark/pull/9843 [SPARK-7857][MLLIB] Prevent IDFModel from returning zeros in SparseVe… …ctors When the IDF model's minDocFreq parameter is set to a non-zero threshold, the IDF for any feature

[GitHub] spark pull request: [SPARK-7334] [MLLIB] Random Projection for Dim...

2015-06-22 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/6613#discussion_r32949092 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/RandomProjection.scala --- @@ -0,0 +1,148 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-09 Thread karlhigley
Github user karlhigley commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66309244 Re: (2) Regular and Robust in the same class It's possible to implement, but I don't want to turn class hierarchy inside out. It just violates OOP principles

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-10-08 Thread karlhigley
Github user karlhigley commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-58428847 Those changes fixed the task size growth. I think it grew by 0.5 MB on each iteration, which no longer happens now that the background is broadcast. --- If your

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-10-08 Thread karlhigley
Github user karlhigley commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-58435188 Yes, iterations still increase in length. Since the function is recursive and the background changes, maybe it was picking up a copy of the updated background

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-10-02 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/1269#discussion_r18336322 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/topicmodeling/topicmodels/RobustPLSA.scala --- @@ -0,0 +1,180

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-10-02 Thread karlhigley
Github user karlhigley commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-57661277 With RobustPLSA, I'm seeing the size of the serialized tasks grow with each iteration, which doesn't seem to happen with PLSA. Not quite sure why; the only parameter

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-10-02 Thread karlhigley
Github user karlhigley commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-57674145 I did notice that the iterations took longer and longer, but wasn't sure if that was expected or not. I'm training the model on a dataset with 400k documents

[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-10-01 Thread karlhigley
Github user karlhigley commented on a diff in the pull request: https://github.com/apache/spark/pull/1269#discussion_r18312090 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/topicmodeling/topicmodels/RobustPLSA.scala --- @@ -0,0 +1,180

[GitHub] spark pull request: SPARK-1216. Add a OneHotEncoder for handling c...

2014-07-05 Thread karlhigley
Github user karlhigley commented on the pull request: https://github.com/apache/spark/pull/304#issuecomment-48093398 This looked useful, so I tried it out. It works as expected so long as the feature array is an `Array[Any]`.If the features are all categorical, then the feature