Related question: is there anything that does scalable matrix multiplication on Spark? For example, we have that long list of vectors and want to construct the similarity matrix: v * T(v). In R it would be: v %*% t(v) Thanks, Pete
On Mon, Sep 19, 2016 at 3:49 PM, Kevin Mellott <kevin.r.mell...@gmail.com> wrote: > Hi all, > > I'm trying to write a Spark application that will detect similar items (in > this case products) based on their descriptions. I've got an ML pipeline > that transforms the product data to TF-IDF representation, using the > following components. > > - *RegexTokenizer* - strips out non-word characters, results in a list > of tokens > - *StopWordsRemover* - removes common "stop words", such as "the", > "and", etc. > - *HashingTF* - assigns a numeric "hash" to each token and calculates > the term frequency > - *IDF* - computes the inverse document frequency > > After this pipeline evaluates, I'm left with a SparseVector that > represents the inverse document frequency of tokens for each product. As a > next step, I'd like to be able to compare each vector to one another, to > detect similarities. > > Does anybody know of a straightforward way to do this in Spark? I tried > creating a UDF (that used the Breeze linear algebra methods internally); > however, that did not scale well. > > Thanks, > Kevin >