Using the Soundcloud implementation of LSH, I was able to process a 22K product dataset in a mere 65 seconds! Thanks so much for the help!
On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott <kevin.r.mell...@gmail.com> wrote: > Thanks Nick - those examples will help a ton!! > > On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath <nick.pentre...@gmail.com > > wrote: > >> A few options include: >> >> https://github.com/marufaytekin/lsh-spark - I've used this a bit and it >> seems quite scalable too from what I've looked at. >> https://github.com/soundcloud/cosine-lsh-join-spark - not used this but >> looks like it should do exactly what you need. >> https://github.com/mrsqueeze/*spark*-hash >> <https://github.com/mrsqueeze/spark-hash> >> >> >> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <kevin.r.mell...@gmail.com> >> wrote: >> >>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K >>> products at a time (as an isolated set of products). Within this set of >>> products (which represents all products for a particular supplier), I am >>> also analyzing each category separately. The largest categories typically >>> have around 10K products. >>> >>> That being said, when calculating IDFs for the 10K product set we come >>> out with roughly 12K unique tokens. In other words, our vectors are 12K >>> columns wide (although they are being represented using SparseVectors). We >>> have a step that is attempting to locate all documents that share the same >>> tokens, and for those items we will calculate the cosine similarity. >>> However, the part that attempts to identify documents with shared tokens is >>> the bottleneck. >>> >>> For this portion, we map our data down to the individual tokens >>> contained by each document. For example: >>> >>> DocumentId | Description >>> ------------------------------------------------------------ >>> ---------------------------------------- >>> 1 Easton Hockey Stick >>> 2 Bauer Hockey Gloves >>> >>> In this case, we'd map to the following: >>> >>> (1, 'Easton') >>> (1, 'Hockey') >>> (1, 'Stick') >>> (2, 'Bauer') >>> (2, 'Hockey') >>> (2, 'Gloves') >>> >>> Our goal is to aggregate this data as follows; however, our code that >>> currently does this is does not perform well. In the realistic 12K product >>> scenario, this resulted in 430K document/token tuples. >>> >>> ((1, 2), ['Hockey']) >>> >>> This then tells us that documents 1 and 2 need to be compared to one >>> another (via cosine similarity) because they both contain the token >>> 'hockey'. I will investigate the methods that you recommended to see if >>> they may resolve our problem. >>> >>> Thanks, >>> Kevin >>> >>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath < >>> nick.pentre...@gmail.com> wrote: >>> >>>> How many products do you have? How large are your vectors? >>>> >>>> It could be that SVD / LSA could be helpful. But if you have many >>>> products then trying to compute all-pair similarity with brute force is not >>>> going to be scalable. In this case you may want to investigate hashing >>>> (LSH) techniques. >>>> >>>> >>>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <kevin.r.mell...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I'm trying to write a Spark application that will detect similar items >>>>> (in this case products) based on their descriptions. I've got an ML >>>>> pipeline that transforms the product data to TF-IDF representation, using >>>>> the following components. >>>>> >>>>> - *RegexTokenizer* - strips out non-word characters, results in a >>>>> list of tokens >>>>> - *StopWordsRemover* - removes common "stop words", such as "the", >>>>> "and", etc. >>>>> - *HashingTF* - assigns a numeric "hash" to each token and >>>>> calculates the term frequency >>>>> - *IDF* - computes the inverse document frequency >>>>> >>>>> After this pipeline evaluates, I'm left with a SparseVector that >>>>> represents the inverse document frequency of tokens for each product. As a >>>>> next step, I'd like to be able to compare each vector to one another, to >>>>> detect similarities. >>>>> >>>>> Does anybody know of a straightforward way to do this in Spark? I >>>>> tried creating a UDF (that used the Breeze linear algebra methods >>>>> internally); however, that did not scale well. >>>>> >>>>> Thanks, >>>>> Kevin >>>>> >>>> >>> >