Something we are working on for purely content based similarity is using a KNN 
engine (search engine) but creating features from word2vec and an NER (Named 
Entity Recognizer).

putting the generated features into fields of a doc can really help with 
similarity because w2v and NER create semantic features. You can also try 
n-grams or skip-grams. These features are not very helpful for search but for  
similarity they work well.

The query to the KNN engine is a document, each field mapped to the 
corresponding field of the index. The result is the k nearest neighbors to the 
query doc.


> On Feb 14, 2016, at 11:05 AM, David Starina <david.star...@gmail.com> wrote:
> 
> Charles, thank you, I will check that out.
> 
> Ted, I am looking for semantic similarity. Unfortunately, I do not have any
> data on the usage of the documents (if by usage you mean user behavior).
> 
> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
>> Did you want textual similarity?
>> 
>> Or semantic similarity?
>> 
>> The actual semantics of a message can be opaque from the content, but clear
>> from the usage.
>> 
>> 
>> 
>> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <charlesce...@me.com> wrote:
>> 
>>> David,
>>> LDA or LSI can work quite nicely for similarity (YMMV of course depending
>>> on the characterization of your documents).
>>> You basically use the dot product of the square roots of the vectors for
>>> LDA -- if you do a search for Hellinger or Bhattachararyya distance that
>>> will lead you to a good similarity or distance measure.
>>> As I recall, Spark does provide an LDA implementation. Gensim provides a
>>> API for doing LDA similarity out of the box. Vowpal Wabbit is also worth
>>> looking at, particularly for a large dataset.
>>> Hope this is useful.
>>> Cheers
>>> 
>>> Sent from my iPhone
>>> 
>>>> On Feb 14, 2016, at 8:14 AM, David Starina <david.star...@gmail.com>
>>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I need to build a system to determine N (i.e. 10) most similar
>> documents
>>> to
>>>> a given document. I have some (theoretical) knowledge of Mahout
>>> algorithms,
>>>> but not enough to build the system. Can you give me some suggestions?
>>>> 
>>>> At first I was researching Latent Semantic Analysis for the task, but
>>> since
>>>> Mahout doesn't support it, I started researching some other options. I
>>> got
>>>> a hint that instead of LSA, you can use LDA (Latent Dirichlet
>> allocation)
>>>> in Mahout to achieve similar and even better results.
>>>> 
>>>> However ... and this is where I got confused ... LDA is a clustering
>>>> algorithm. However, what I need is not to cluster the documents into N
>>>> clusters - I need to get a matrix (similar to TF-IDF) from which I can
>>>> calculate some sort of a distance for any two documents to get N most
>>>> similar documents for any given document.
>>>> 
>>>> How do I achieve that? My idea was (still mostly theoretical, since I
>>> have
>>>> some problems with running the LDA algorithm) to extract some number of
>>>> topics with LDA, but I need not cluster the documents with the help of
>>> this
>>>> topics, but to get the matrix of documents as one dimention and topics
>> as
>>>> the other dimension. I was guessing I could then use this matrix an an
>>>> input to row-similarity algorithm.
>>>> 
>>>> Is this the correct concept? Or am I missing something?
>>>> 
>>>> And, since LDA is not supperted on Spark/Samsara, how could I achieve
>>>> similar results on Spark?
>>>> 
>>>> 
>>>> Thanks in advance,
>>>> David
>>> 
>> 

Reply via email to