How many products do you have? How large are your vectors? It could be that SVD / LSA could be helpful. But if you have many products then trying to compute all-pair similarity with brute force is not going to be scalable. In this case you may want to investigate hashing (LSH) techniques.
On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <kevin.r.mell...@gmail.com> wrote: > Hi all, > > I'm trying to write a Spark application that will detect similar items (in > this case products) based on their descriptions. I've got an ML pipeline > that transforms the product data to TF-IDF representation, using the > following components. > > - *RegexTokenizer* - strips out non-word characters, results in a list > of tokens > - *StopWordsRemover* - removes common "stop words", such as "the", > "and", etc. > - *HashingTF* - assigns a numeric "hash" to each token and calculates > the term frequency > - *IDF* - computes the inverse document frequency > > After this pipeline evaluates, I'm left with a SparseVector that > represents the inverse document frequency of tokens for each product. As a > next step, I'd like to be able to compare each vector to one another, to > detect similarities. > > Does anybody know of a straightforward way to do this in Spark? I tried > creating a UDF (that used the Breeze linear algebra methods internally); > however, that did not scale well. > > Thanks, > Kevin >