Github user karlhigley commented on the issue:
https://github.com/apache/spark/pull/15148
@jkbradley: "Multi-probe" seems like a standard term, and I think this is
the [original paper](http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf)
that coined it.
>
Github user karlhigley commented on the issue:
https://github.com/apache/spark/pull/15148
@sethah: Your description of the combination of AND and OR amplification
from the literature matches my understanding, and the combination of the two is
what I was aiming for in spark-neighbors
Github user karlhigley commented on the issue:
https://github.com/apache/spark/pull/15148
@sethah: I think you're right that there's a discrepancy here, and I'm
embarrassed that I didn't see it when I first reviewed the PR. On a reread of
the source and your comment above, it looks
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/15148#discussion_r80393070
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/lsh/LSH.scala ---
@@ -0,0 +1,290 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/15148#discussion_r80392464
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/lsh/LSH.scala ---
@@ -0,0 +1,290 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/15148#discussion_r80392692
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/lsh/LSH.scala ---
@@ -0,0 +1,290 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user karlhigley closed the pull request at:
https://github.com/apache/spark/pull/9843
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user karlhigley commented on the pull request:
https://github.com/apache/spark/pull/9843#issuecomment-160314300
I understand the issue you're pointing out, but it hasn't been a practical
problem, even with hundreds of thousands of terms. The inclusion of explicit
zeros
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/9843#discussion_r46051533
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -211,14 +213,17 @@ private object IDFModel {
val n = v.size
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/9843#discussion_r46051762
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -211,14 +213,17 @@ private object IDFModel {
val n = v.size
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/9843#discussion_r46013493
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -211,14 +213,16 @@ private object IDFModel {
val n = v.size
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/9843#discussion_r46013714
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -218,7 +218,7 @@ private object IDFModel {
newValues(k
Github user karlhigley commented on the pull request:
https://github.com/apache/spark/pull/9843#issuecomment-160004428
Updated to use an `ArrayBuffer` instead of `toSparse` to omit explicit
zeros in returned SparseVectors.
---
If your project is set up for it, you can reply
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/9843#discussion_r46010460
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -218,7 +218,7 @@ private object IDFModel {
newValues(k
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/9843#discussion_r46017996
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -211,14 +213,17 @@ private object IDFModel {
val n = v.size
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/9843#discussion_r46000917
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -218,7 +218,7 @@ private object IDFModel {
newValues(k
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/9843#discussion_r45901371
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -218,7 +218,7 @@ private object IDFModel {
newValues(k
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/9843#discussion_r45481318
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -218,7 +218,7 @@ private object IDFModel {
newValues(k
GitHub user karlhigley opened a pull request:
https://github.com/apache/spark/pull/9843
[SPARK-7857][MLLIB] Prevent IDFModel from returning zeros in SparseVeâ¦
â¦ctors
When the IDF model's minDocFreq parameter is set to a non-zero threshold,
the
IDF for any feature
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/6613#discussion_r32949092
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/feature/RandomProjection.scala ---
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software
Github user karlhigley commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-66309244
Re:
(2) Regular and Robust in the same class
It's possible to implement, but I don't want to turn class hierarchy
inside out. It just violates OOP principles
Github user karlhigley commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-58428847
Those changes fixed the task size growth. I think it grew by 0.5 MB on
each iteration, which no longer happens now that the background is broadcast.
---
If your
Github user karlhigley commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-58435188
Yes, iterations still increase in length.
Since the function is recursive and the background changes, maybe it was
picking up a copy of the updated background
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/1269#discussion_r18336322
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/topicmodeling/topicmodels/RobustPLSA.scala
---
@@ -0,0 +1,180
Github user karlhigley commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-57661277
With RobustPLSA, I'm seeing the size of the serialized tasks grow with each
iteration, which doesn't seem to happen with PLSA. Not quite sure why; the
only parameter
Github user karlhigley commented on the pull request:
https://github.com/apache/spark/pull/1269#issuecomment-57674145
I did notice that the iterations took longer and longer, but wasn't sure if
that was expected or not.
I'm training the model on a dataset with 400k documents
Github user karlhigley commented on a diff in the pull request:
https://github.com/apache/spark/pull/1269#discussion_r18312090
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/topicmodeling/topicmodels/RobustPLSA.scala
---
@@ -0,0 +1,180
Github user karlhigley commented on the pull request:
https://github.com/apache/spark/pull/304#issuecomment-48093398
This looked useful, so I tried it out. It works as expected so long as the
feature array is an `Array[Any]`.If the features are all categorical, then
the feature
28 matches
Mail list logo