Re: Selecting the top 100 records per group by?

2016-09-10 Thread Karl Higley
Would `topByKey` help? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L42 Best, Karl On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton wrote: > I'm trying to figure out a way to group by and return the top

Re: Locality sensitive hashing

2016-07-24 Thread Karl Higley
Hi Janardhan, I collected some LSH papers while working on an RDD-based implementation. Links at the end of the README here: https://github.com/karlhigley/spark-neighbors Keep me posted on what you come up with! Best, Karl On Sun, Jul 24, 2016 at 9:54 AM janardhan shetty

Re: How to recommend most similar users using Spark ML

2016-07-17 Thread Karl Higley
There are also some Spark packages for finding approximate nearest neighbors using locality sensitive hashing: https://spark-packages.org/?q=tags%3Alsh On Fri, Jul 15, 2016 at 7:45 AM nguyen duc Tuan wrote: > Hi jeremycod, > If you want to find top N nearest neighbors for

Re: Compute

2016-04-27 Thread Karl Higley
m sorry for uggly title of email. I forgot to check it before send. > > 2016-04-28 10:10 GMT+07:00 Karl Higley <kmhig...@gmail.com>: > >> One idea is to avoid materializing the pairs of points before computing >> the distances between them. You could do that using the LSH sign

Re: Compute

2016-04-27 Thread Karl Higley
One idea is to avoid materializing the pairs of points before computing the distances between them. You could do that using the LSH signatures by building (Signature, (Int, Vector)) tuples, grouping by signature, and then iterating pairwise over the resulting lists of points to compute the

Re: Reindexing in graphx

2016-02-25 Thread Karl Higley
For real time graph mutations and queries, you might consider a graph database like Neo4j or TitanDB. Titan can be backed by HBase, which you're already using, so that's probably worth a look. On Thu, Feb 25, 2016, 9:55 AM Udbhav Agarwal wrote: > That’s a good thing

Re: Computing hamming distance over large data set

2016-02-11 Thread Karl Higley
Hi, It sounds like you're trying to solve the approximate nearest neighbor (ANN) problem. With a large dataset, parallelizing a brute force O(n^2) approach isn't likely to help all that much, because the number of pairwise comparisons grows quickly as the size of the dataset increases. I'd look

Re: Product similarity with TF/IDF and Cosine similarity (DIMSUM)

2016-02-03 Thread Karl Higley
Hi Alan, I'm slow responding, so you may have already figured this out. Just in case, though: val approx = mat.columnSimilarities(0.1) approxEntries.first() res18: ((Long, Long), Double) = ((1638,966248),0.632455532033676) The above is returning the cosine similarity between columns 1638

Re: Spark : merging object with approximation

2015-11-25 Thread Karl Higley
Hi, What merge behavior do you want when A~=B, B~=C but A!=C? Should the merge emit ABC? AB and BC? Something else? Best, Karl On Sat, Nov 21, 2015 at 5:24 AM OcterA wrote: > Hello, > > I have a set of X data (around 30M entry), I have to do a batch to merge > data which