Re: Document similarity
You might also look at this paper by Wallach http://maroo.cs.umass.edu/pub/web/getpdf.php?id=1101 Sent from my iPhone > On Mar 11, 2016, at 8:11 AM, David Starinawrote: > > Well, there is also an online method of LDA in Spark ... Pat, is there any > documentation on the method you described? > >> On Wed, Feb 24, 2016 at 6:10 PM, Pat Ferrel wrote: >> >> The method I described calculates similarity on the fly but requires new >> docs to go through feature extraction before similarity can be queried. The >> length of time to do feature extraction is short compared to training LDA. >> >> Another method that gets at semantic similarity uses adaptive skip-grams >> for text features. http://arxiv.org/abs/1502.07257 I haven’t tried this >> but a friend saw a presentation about using this method to create features >> for a search engine which showed a favorable comparison with word2vec. >> >> If you want to use LDA note that it is an unsupervised categorization >> method. To use it, the cluster descriptors (a vector of important terms) >> can be compared to the analyzed incoming document using a KNN/search >> engine. This will give you a list of the closest clusters but doesn’t >> really give you documents, which is your goal I think. LDA should be re-run >> periodically to generate new clusters. Do you want to know cluster >> inclusion or get a list of similar docs? >> >> On Feb 23, 2016, at 1:01 PM, David Starina >> wrote: >> >> Guys, one more question ... Are there some incremental methods to do this? >> I don't want to run the whole job again once a new document is added. In >> case of LDA ... I guess the best way is to calculate the topics on the new >> document using the topics from the previous LDA run ... And then every once >> in a while to recalculate the topics with the new documents? >> >> On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel >> wrote: >> >>> Something we are working on for purely content based similarity is using >> a >>> KNN engine (search engine) but creating features from word2vec and an NER >>> (Named Entity Recognizer). >>> >>> putting the generated features into fields of a doc can really help with >>> similarity because w2v and NER create semantic features. You can also try >>> n-grams or skip-grams. These features are not very helpful for search but >>> for similarity they work well. >>> >>> The query to the KNN engine is a document, each field mapped to the >>> corresponding field of the index. The result is the k nearest neighbors >> to >>> the query doc. >>> >>> > On Feb 14, 2016, at 11:05 AM, David Starina wrote: Charles, thank you, I will check that out. Ted, I am looking for semantic similarity. Unfortunately, I do not have >>> any data on the usage of the documents (if by usage you mean user behavior). > On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning wrote: > Did you want textual similarity? > > Or semantic similarity? > > The actual semantics of a message can be opaque from the content, but >>> clear > from the usage. > > > > On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl >>> wrote: > >> David, >> LDA or LSI can work quite nicely for similarity (YMMV of course >>> depending >> on the characterization of your documents). >> You basically use the dot product of the square roots of the vectors >>> for >> LDA -- if you do a search for Hellinger or Bhattachararyya distance >>> that >> will lead you to a good similarity or distance measure. >> As I recall, Spark does provide an LDA implementation. Gensim provides >>> a >> API for doing LDA similarity out of the box. Vowpal Wabbit is also >>> worth >> looking at, particularly for a large dataset. >> Hope this is useful. >> Cheers >> >> Sent from my iPhone >> On Feb 14, 2016, at 8:14 AM, David Starina >>> wrote: >>> >>> Hi, >>> >>> I need to build a system to determine N (i.e. 10) most similar > documents >> to >>> a given document. I have some (theoretical) knowledge of Mahout >> algorithms, >>> but not enough to build the system. Can you give me some suggestions? >>> >>> At first I was researching Latent Semantic Analysis for the task, but >> since >>> Mahout doesn't support it, I started researching some other options. >> I >> got >>> a hint that instead of LSA, you can use LDA (Latent Dirichlet > allocation) >>> in Mahout to achieve similar and even better results. >>> >>> However ... and this is where I got confused ... LDA is a clustering >>> algorithm. However, what I need is not to cluster the documents into >> N >>> clusters - I need to get a matrix
Re: Document similarity
Well, there is also an online method of LDA in Spark ... Pat, is there any documentation on the method you described? On Wed, Feb 24, 2016 at 6:10 PM, Pat Ferrelwrote: > The method I described calculates similarity on the fly but requires new > docs to go through feature extraction before similarity can be queried. The > length of time to do feature extraction is short compared to training LDA. > > Another method that gets at semantic similarity uses adaptive skip-grams > for text features. http://arxiv.org/abs/1502.07257 I haven’t tried this > but a friend saw a presentation about using this method to create features > for a search engine which showed a favorable comparison with word2vec. > > If you want to use LDA note that it is an unsupervised categorization > method. To use it, the cluster descriptors (a vector of important terms) > can be compared to the analyzed incoming document using a KNN/search > engine. This will give you a list of the closest clusters but doesn’t > really give you documents, which is your goal I think. LDA should be re-run > periodically to generate new clusters. Do you want to know cluster > inclusion or get a list of similar docs? > > On Feb 23, 2016, at 1:01 PM, David Starina > wrote: > > Guys, one more question ... Are there some incremental methods to do this? > I don't want to run the whole job again once a new document is added. In > case of LDA ... I guess the best way is to calculate the topics on the new > document using the topics from the previous LDA run ... And then every once > in a while to recalculate the topics with the new documents? > > On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel > wrote: > > > Something we are working on for purely content based similarity is using > a > > KNN engine (search engine) but creating features from word2vec and an NER > > (Named Entity Recognizer). > > > > putting the generated features into fields of a doc can really help with > > similarity because w2v and NER create semantic features. You can also try > > n-grams or skip-grams. These features are not very helpful for search but > > for similarity they work well. > > > > The query to the KNN engine is a document, each field mapped to the > > corresponding field of the index. The result is the k nearest neighbors > to > > the query doc. > > > > > >> On Feb 14, 2016, at 11:05 AM, David Starina > > wrote: > >> > >> Charles, thank you, I will check that out. > >> > >> Ted, I am looking for semantic similarity. Unfortunately, I do not have > > any > >> data on the usage of the documents (if by usage you mean user behavior). > >> > >> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning > > wrote: > >> > >>> Did you want textual similarity? > >>> > >>> Or semantic similarity? > >>> > >>> The actual semantics of a message can be opaque from the content, but > > clear > >>> from the usage. > >>> > >>> > >>> > >>> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl > > wrote: > >>> > David, > LDA or LSI can work quite nicely for similarity (YMMV of course > > depending > on the characterization of your documents). > You basically use the dot product of the square roots of the vectors > > for > LDA -- if you do a search for Hellinger or Bhattachararyya distance > > that > will lead you to a good similarity or distance measure. > As I recall, Spark does provide an LDA implementation. Gensim provides > > a > API for doing LDA similarity out of the box. Vowpal Wabbit is also > > worth > looking at, particularly for a large dataset. > Hope this is useful. > Cheers > > Sent from my iPhone > > > On Feb 14, 2016, at 8:14 AM, David Starina > wrote: > > > > Hi, > > > > I need to build a system to determine N (i.e. 10) most similar > >>> documents > to > > a given document. I have some (theoretical) knowledge of Mahout > algorithms, > > but not enough to build the system. Can you give me some suggestions? > > > > At first I was researching Latent Semantic Analysis for the task, but > since > > Mahout doesn't support it, I started researching some other options. > I > got > > a hint that instead of LSA, you can use LDA (Latent Dirichlet > >>> allocation) > > in Mahout to achieve similar and even better results. > > > > However ... and this is where I got confused ... LDA is a clustering > > algorithm. However, what I need is not to cluster the documents into > N > > clusters - I need to get a matrix (similar to TF-IDF) from which I > can > > calculate some sort of a distance for any two documents to get N most > > similar documents for any given document. > > > > How do I achieve that? My idea was (still mostly theoretical, since I > have > > some
Re: Document similarity
The method I described calculates similarity on the fly but requires new docs to go through feature extraction before similarity can be queried. The length of time to do feature extraction is short compared to training LDA. Another method that gets at semantic similarity uses adaptive skip-grams for text features. http://arxiv.org/abs/1502.07257 I haven’t tried this but a friend saw a presentation about using this method to create features for a search engine which showed a favorable comparison with word2vec. If you want to use LDA note that it is an unsupervised categorization method. To use it, the cluster descriptors (a vector of important terms) can be compared to the analyzed incoming document using a KNN/search engine. This will give you a list of the closest clusters but doesn’t really give you documents, which is your goal I think. LDA should be re-run periodically to generate new clusters. Do you want to know cluster inclusion or get a list of similar docs? On Feb 23, 2016, at 1:01 PM, David Starinawrote: Guys, one more question ... Are there some incremental methods to do this? I don't want to run the whole job again once a new document is added. In case of LDA ... I guess the best way is to calculate the topics on the new document using the topics from the previous LDA run ... And then every once in a while to recalculate the topics with the new documents? On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel wrote: > Something we are working on for purely content based similarity is using a > KNN engine (search engine) but creating features from word2vec and an NER > (Named Entity Recognizer). > > putting the generated features into fields of a doc can really help with > similarity because w2v and NER create semantic features. You can also try > n-grams or skip-grams. These features are not very helpful for search but > for similarity they work well. > > The query to the KNN engine is a document, each field mapped to the > corresponding field of the index. The result is the k nearest neighbors to > the query doc. > > >> On Feb 14, 2016, at 11:05 AM, David Starina > wrote: >> >> Charles, thank you, I will check that out. >> >> Ted, I am looking for semantic similarity. Unfortunately, I do not have > any >> data on the usage of the documents (if by usage you mean user behavior). >> >> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning > wrote: >> >>> Did you want textual similarity? >>> >>> Or semantic similarity? >>> >>> The actual semantics of a message can be opaque from the content, but > clear >>> from the usage. >>> >>> >>> >>> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl > wrote: >>> David, LDA or LSI can work quite nicely for similarity (YMMV of course > depending on the characterization of your documents). You basically use the dot product of the square roots of the vectors > for LDA -- if you do a search for Hellinger or Bhattachararyya distance > that will lead you to a good similarity or distance measure. As I recall, Spark does provide an LDA implementation. Gensim provides > a API for doing LDA similarity out of the box. Vowpal Wabbit is also > worth looking at, particularly for a large dataset. Hope this is useful. Cheers Sent from my iPhone > On Feb 14, 2016, at 8:14 AM, David Starina wrote: > > Hi, > > I need to build a system to determine N (i.e. 10) most similar >>> documents to > a given document. I have some (theoretical) knowledge of Mahout algorithms, > but not enough to build the system. Can you give me some suggestions? > > At first I was researching Latent Semantic Analysis for the task, but since > Mahout doesn't support it, I started researching some other options. I got > a hint that instead of LSA, you can use LDA (Latent Dirichlet >>> allocation) > in Mahout to achieve similar and even better results. > > However ... and this is where I got confused ... LDA is a clustering > algorithm. However, what I need is not to cluster the documents into N > clusters - I need to get a matrix (similar to TF-IDF) from which I can > calculate some sort of a distance for any two documents to get N most > similar documents for any given document. > > How do I achieve that? My idea was (still mostly theoretical, since I have > some problems with running the LDA algorithm) to extract some number > of > topics with LDA, but I need not cluster the documents with the help of this > topics, but to get the matrix of documents as one dimention and topics >>> as > the other dimension. I was guessing I could then use this matrix an an > input to row-similarity algorithm. > > Is this the correct concept? Or am I
Re: Document similarity
Guys, one more question ... Are there some incremental methods to do this? I don't want to run the whole job again once a new document is added. In case of LDA ... I guess the best way is to calculate the topics on the new document using the topics from the previous LDA run ... And then every once in a while to recalculate the topics with the new documents? On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrelwrote: > Something we are working on for purely content based similarity is using a > KNN engine (search engine) but creating features from word2vec and an NER > (Named Entity Recognizer). > > putting the generated features into fields of a doc can really help with > similarity because w2v and NER create semantic features. You can also try > n-grams or skip-grams. These features are not very helpful for search but > for similarity they work well. > > The query to the KNN engine is a document, each field mapped to the > corresponding field of the index. The result is the k nearest neighbors to > the query doc. > > > > On Feb 14, 2016, at 11:05 AM, David Starina > wrote: > > > > Charles, thank you, I will check that out. > > > > Ted, I am looking for semantic similarity. Unfortunately, I do not have > any > > data on the usage of the documents (if by usage you mean user behavior). > > > > On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning > wrote: > > > >> Did you want textual similarity? > >> > >> Or semantic similarity? > >> > >> The actual semantics of a message can be opaque from the content, but > clear > >> from the usage. > >> > >> > >> > >> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl > wrote: > >> > >>> David, > >>> LDA or LSI can work quite nicely for similarity (YMMV of course > depending > >>> on the characterization of your documents). > >>> You basically use the dot product of the square roots of the vectors > for > >>> LDA -- if you do a search for Hellinger or Bhattachararyya distance > that > >>> will lead you to a good similarity or distance measure. > >>> As I recall, Spark does provide an LDA implementation. Gensim provides > a > >>> API for doing LDA similarity out of the box. Vowpal Wabbit is also > worth > >>> looking at, particularly for a large dataset. > >>> Hope this is useful. > >>> Cheers > >>> > >>> Sent from my iPhone > >>> > On Feb 14, 2016, at 8:14 AM, David Starina > >>> wrote: > > Hi, > > I need to build a system to determine N (i.e. 10) most similar > >> documents > >>> to > a given document. I have some (theoretical) knowledge of Mahout > >>> algorithms, > but not enough to build the system. Can you give me some suggestions? > > At first I was researching Latent Semantic Analysis for the task, but > >>> since > Mahout doesn't support it, I started researching some other options. I > >>> got > a hint that instead of LSA, you can use LDA (Latent Dirichlet > >> allocation) > in Mahout to achieve similar and even better results. > > However ... and this is where I got confused ... LDA is a clustering > algorithm. However, what I need is not to cluster the documents into N > clusters - I need to get a matrix (similar to TF-IDF) from which I can > calculate some sort of a distance for any two documents to get N most > similar documents for any given document. > > How do I achieve that? My idea was (still mostly theoretical, since I > >>> have > some problems with running the LDA algorithm) to extract some number > of > topics with LDA, but I need not cluster the documents with the help of > >>> this > topics, but to get the matrix of documents as one dimention and topics > >> as > the other dimension. I was guessing I could then use this matrix an an > input to row-similarity algorithm. > > Is this the correct concept? Or am I missing something? > > And, since LDA is not supperted on Spark/Samsara, how could I achieve > similar results on Spark? > > > Thanks in advance, > David > >>> > >> > >
Re: Document similarity
Something we are working on for purely content based similarity is using a KNN engine (search engine) but creating features from word2vec and an NER (Named Entity Recognizer). putting the generated features into fields of a doc can really help with similarity because w2v and NER create semantic features. You can also try n-grams or skip-grams. These features are not very helpful for search but for similarity they work well. The query to the KNN engine is a document, each field mapped to the corresponding field of the index. The result is the k nearest neighbors to the query doc. > On Feb 14, 2016, at 11:05 AM, David Starinawrote: > > Charles, thank you, I will check that out. > > Ted, I am looking for semantic similarity. Unfortunately, I do not have any > data on the usage of the documents (if by usage you mean user behavior). > > On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning wrote: > >> Did you want textual similarity? >> >> Or semantic similarity? >> >> The actual semantics of a message can be opaque from the content, but clear >> from the usage. >> >> >> >> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl wrote: >> >>> David, >>> LDA or LSI can work quite nicely for similarity (YMMV of course depending >>> on the characterization of your documents). >>> You basically use the dot product of the square roots of the vectors for >>> LDA -- if you do a search for Hellinger or Bhattachararyya distance that >>> will lead you to a good similarity or distance measure. >>> As I recall, Spark does provide an LDA implementation. Gensim provides a >>> API for doing LDA similarity out of the box. Vowpal Wabbit is also worth >>> looking at, particularly for a large dataset. >>> Hope this is useful. >>> Cheers >>> >>> Sent from my iPhone >>> On Feb 14, 2016, at 8:14 AM, David Starina >>> wrote: Hi, I need to build a system to determine N (i.e. 10) most similar >> documents >>> to a given document. I have some (theoretical) knowledge of Mahout >>> algorithms, but not enough to build the system. Can you give me some suggestions? At first I was researching Latent Semantic Analysis for the task, but >>> since Mahout doesn't support it, I started researching some other options. I >>> got a hint that instead of LSA, you can use LDA (Latent Dirichlet >> allocation) in Mahout to achieve similar and even better results. However ... and this is where I got confused ... LDA is a clustering algorithm. However, what I need is not to cluster the documents into N clusters - I need to get a matrix (similar to TF-IDF) from which I can calculate some sort of a distance for any two documents to get N most similar documents for any given document. How do I achieve that? My idea was (still mostly theoretical, since I >>> have some problems with running the LDA algorithm) to extract some number of topics with LDA, but I need not cluster the documents with the help of >>> this topics, but to get the matrix of documents as one dimention and topics >> as the other dimension. I was guessing I could then use this matrix an an input to row-similarity algorithm. Is this the correct concept? Or am I missing something? And, since LDA is not supperted on Spark/Samsara, how could I achieve similar results on Spark? Thanks in advance, David >>> >>
Re: Document similarity
Charles, thank you, I will check that out. Ted, I am looking for semantic similarity. Unfortunately, I do not have any data on the usage of the documents (if by usage you mean user behavior). On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunningwrote: > Did you want textual similarity? > > Or semantic similarity? > > The actual semantics of a message can be opaque from the content, but clear > from the usage. > > > > On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl wrote: > > > David, > > LDA or LSI can work quite nicely for similarity (YMMV of course depending > > on the characterization of your documents). > > You basically use the dot product of the square roots of the vectors for > > LDA -- if you do a search for Hellinger or Bhattachararyya distance that > > will lead you to a good similarity or distance measure. > > As I recall, Spark does provide an LDA implementation. Gensim provides a > > API for doing LDA similarity out of the box. Vowpal Wabbit is also worth > > looking at, particularly for a large dataset. > > Hope this is useful. > > Cheers > > > > Sent from my iPhone > > > > > On Feb 14, 2016, at 8:14 AM, David Starina > > wrote: > > > > > > Hi, > > > > > > I need to build a system to determine N (i.e. 10) most similar > documents > > to > > > a given document. I have some (theoretical) knowledge of Mahout > > algorithms, > > > but not enough to build the system. Can you give me some suggestions? > > > > > > At first I was researching Latent Semantic Analysis for the task, but > > since > > > Mahout doesn't support it, I started researching some other options. I > > got > > > a hint that instead of LSA, you can use LDA (Latent Dirichlet > allocation) > > > in Mahout to achieve similar and even better results. > > > > > > However ... and this is where I got confused ... LDA is a clustering > > > algorithm. However, what I need is not to cluster the documents into N > > > clusters - I need to get a matrix (similar to TF-IDF) from which I can > > > calculate some sort of a distance for any two documents to get N most > > > similar documents for any given document. > > > > > > How do I achieve that? My idea was (still mostly theoretical, since I > > have > > > some problems with running the LDA algorithm) to extract some number of > > > topics with LDA, but I need not cluster the documents with the help of > > this > > > topics, but to get the matrix of documents as one dimention and topics > as > > > the other dimension. I was guessing I could then use this matrix an an > > > input to row-similarity algorithm. > > > > > > Is this the correct concept? Or am I missing something? > > > > > > And, since LDA is not supperted on Spark/Samsara, how could I achieve > > > similar results on Spark? > > > > > > > > > Thanks in advance, > > > David > > >
Document similarity
Hi, I need to build a system to determine N (i.e. 10) most similar documents to a given document. I have some (theoretical) knowledge of Mahout algorithms, but not enough to build the system. Can you give me some suggestions? At first I was researching Latent Semantic Analysis for the task, but since Mahout doesn't support it, I started researching some other options. I got a hint that instead of LSA, you can use LDA (Latent Dirichlet allocation) in Mahout to achieve similar and even better results. However ... and this is where I got confused ... LDA is a clustering algorithm. However, what I need is not to cluster the documents into N clusters - I need to get a matrix (similar to TF-IDF) from which I can calculate some sort of a distance for any two documents to get N most similar documents for any given document. How do I achieve that? My idea was (still mostly theoretical, since I have some problems with running the LDA algorithm) to extract some number of topics with LDA, but I need not cluster the documents with the help of this topics, but to get the matrix of documents as one dimention and topics as the other dimension. I was guessing I could then use this matrix an an input to row-similarity algorithm. Is this the correct concept? Or am I missing something? And, since LDA is not supperted on Spark/Samsara, how could I achieve similar results on Spark? Thanks in advance, David
Re: Document similarity
Did you want textual similarity? Or semantic similarity? The actual semantics of a message can be opaque from the content, but clear from the usage. On Sun, Feb 14, 2016 at 5:29 AM, Charles Earlwrote: > David, > LDA or LSI can work quite nicely for similarity (YMMV of course depending > on the characterization of your documents). > You basically use the dot product of the square roots of the vectors for > LDA -- if you do a search for Hellinger or Bhattachararyya distance that > will lead you to a good similarity or distance measure. > As I recall, Spark does provide an LDA implementation. Gensim provides a > API for doing LDA similarity out of the box. Vowpal Wabbit is also worth > looking at, particularly for a large dataset. > Hope this is useful. > Cheers > > Sent from my iPhone > > > On Feb 14, 2016, at 8:14 AM, David Starina > wrote: > > > > Hi, > > > > I need to build a system to determine N (i.e. 10) most similar documents > to > > a given document. I have some (theoretical) knowledge of Mahout > algorithms, > > but not enough to build the system. Can you give me some suggestions? > > > > At first I was researching Latent Semantic Analysis for the task, but > since > > Mahout doesn't support it, I started researching some other options. I > got > > a hint that instead of LSA, you can use LDA (Latent Dirichlet allocation) > > in Mahout to achieve similar and even better results. > > > > However ... and this is where I got confused ... LDA is a clustering > > algorithm. However, what I need is not to cluster the documents into N > > clusters - I need to get a matrix (similar to TF-IDF) from which I can > > calculate some sort of a distance for any two documents to get N most > > similar documents for any given document. > > > > How do I achieve that? My idea was (still mostly theoretical, since I > have > > some problems with running the LDA algorithm) to extract some number of > > topics with LDA, but I need not cluster the documents with the help of > this > > topics, but to get the matrix of documents as one dimention and topics as > > the other dimension. I was guessing I could then use this matrix an an > > input to row-similarity algorithm. > > > > Is this the correct concept? Or am I missing something? > > > > And, since LDA is not supperted on Spark/Samsara, how could I achieve > > similar results on Spark? > > > > > > Thanks in advance, > > David >
Is Feature Hashing appropriate for document to document similarity calculations?
We have a system that needs to be able to incrementally calculate document-document text similarity metrics as new documents are seen. I'm trying to understand if using feature hashing with for example, a StaticWordValueEncoder is appropriate for this kind of use case. Our text documents can contain web content so the size of the feature vector is really only bound by the number of words in the language. Currently our implementation uses a simple vector based bag of words type model to create 'one cell per word' feature vectors for each document and then we use cosine similarity to determine document to document similarity. We are not using Mahout. The issue with this approach is that the one cell per word feature vectors require the use of a singleton dictionary object to turn words into vector indexes, so we can only index one document at a time. I've been reading through the Mahout archives and the Mahout in Action book looking to see if Mahout has any answers to help with incremental parallelized vector generation, but it seems like the Mahout seq2sparse processes have the same 'batch' issue. I've seen various posts referring to using feature hashing as a way around this and the classifiers in part 3 of Mahout in Action explain how to use feature hashing to encode text like features. I'm just too green to know whether it's appropriate for our use case. Particularly whether the multiple probes recommended when using feature hashing with text, and the liklihood of feature collisions will significantly compromise our cosine similarity calculations. Thanks for any insights Martin
Re: Is Feature Hashing appropriate for document to document similarity calculations?
Hi Martin, i guess you should be fine with the StaticWordValueEncoder , following e.g. this discussion on this list, it is about clustering but matches some of your questions http://mahout.markmail.org/search/?q=hashing%20clustering#query:hashing%20clustering+page:1+mid:eitskeb7pkk3pupr+state:results What i am missing is a clear / big picture on the latter of the two most mentioned benefits of hashing: - dictionary / vectorization (with collisions and small error on average) - dimensionality reduction for the dimensionality reduction part, there is the minHash-Clustering, home brewed k-means with feature vector encoders plus the new streaming k-means stuff having own random projection and hash implementations. Any guidance: cool! For you Martin, i think you will have to roll your own incremental vector generation but it should be really straightforward Cheers, Johannes On Wed, Apr 24, 2013 at 6:32 PM, Martin Bayly mar...@parkbayly.net wrote: We have a system that needs to be able to incrementally calculate document-document text similarity metrics as new documents are seen. I'm trying to understand if using feature hashing with for example, a StaticWordValueEncoder is appropriate for this kind of use case. Our text documents can contain web content so the size of the feature vector is really only bound by the number of words in the language. Currently our implementation uses a simple vector based bag of words type model to create 'one cell per word' feature vectors for each document and then we use cosine similarity to determine document to document similarity. We are not using Mahout. The issue with this approach is that the one cell per word feature vectors require the use of a singleton dictionary object to turn words into vector indexes, so we can only index one document at a time. I've been reading through the Mahout archives and the Mahout in Action book looking to see if Mahout has any answers to help with incremental parallelized vector generation, but it seems like the Mahout seq2sparse processes have the same 'batch' issue. I've seen various posts referring to using feature hashing as a way around this and the classifiers in part 3 of Mahout in Action explain how to use feature hashing to encode text like features. I'm just too green to know whether it's appropriate for our use case. Particularly whether the multiple probes recommended when using feature hashing with text, and the liklihood of feature collisions will significantly compromise our cosine similarity calculations. Thanks for any insights Martin
Re: Is Feature Hashing appropriate for document to document similarity calculations?
Hashed feature vectors are an excellent choice for the unknown vocabulary problem. One problem you will have is that the static weighting won't by default weight rare words more highly than common words. One way to deal with this is to build a dictionary on a small subset of documents and assume all missing words are relatively rare and can have a default weight. You definitely need some weighting according to prevalence. On Wed, Apr 24, 2013 at 12:32 PM, Johannes Schulte johannes.schu...@gmail.com wrote: Hi Martin, i guess you should be fine with the StaticWordValueEncoder , following e.g. this discussion on this list, it is about clustering but matches some of your questions http://mahout.markmail.org/search/?q=hashing%20clustering#query:hashing%20clustering+page:1+mid:eitskeb7pkk3pupr+state:results What i am missing is a clear / big picture on the latter of the two most mentioned benefits of hashing: - dictionary / vectorization (with collisions and small error on average) - dimensionality reduction for the dimensionality reduction part, there is the minHash-Clustering, home brewed k-means with feature vector encoders plus the new streaming k-means stuff having own random projection and hash implementations. Any guidance: cool! For you Martin, i think you will have to roll your own incremental vector generation but it should be really straightforward Cheers, Johannes On Wed, Apr 24, 2013 at 6:32 PM, Martin Bayly mar...@parkbayly.net wrote: We have a system that needs to be able to incrementally calculate document-document text similarity metrics as new documents are seen. I'm trying to understand if using feature hashing with for example, a StaticWordValueEncoder is appropriate for this kind of use case. Our text documents can contain web content so the size of the feature vector is really only bound by the number of words in the language. Currently our implementation uses a simple vector based bag of words type model to create 'one cell per word' feature vectors for each document and then we use cosine similarity to determine document to document similarity. We are not using Mahout. The issue with this approach is that the one cell per word feature vectors require the use of a singleton dictionary object to turn words into vector indexes, so we can only index one document at a time. I've been reading through the Mahout archives and the Mahout in Action book looking to see if Mahout has any answers to help with incremental parallelized vector generation, but it seems like the Mahout seq2sparse processes have the same 'batch' issue. I've seen various posts referring to using feature hashing as a way around this and the classifiers in part 3 of Mahout in Action explain how to use feature hashing to encode text like features. I'm just too green to know whether it's appropriate for our use case. Particularly whether the multiple probes recommended when using feature hashing with text, and the liklihood of feature collisions will significantly compromise our cosine similarity calculations. Thanks for any insights Martin
Re: Pairwise Document Similarity
Thanks Grant and apologies for not actually reading the JIRA properly it was not intended as a criticism of the excellent Recommender work in Taste. I see know that the intention is to improve performance for specific use cases. My intention is to build a scalable near duplicate document detection capability using Mahout. The use case that I'm researching will require a scalable approach across a large corpus of documents (50m and growing). I see that there is another thread today on this topic. I will go ahead and implement the solution as outlined in my original post and delve further into the LSH implementation embedded in the clustering code and share my findings with the community. I'm looking to test this on a large corpus using EMR in order to get a feeling for the timings Cheers Niall On Jul 23, 2011 7:55 AM, Grant Ingersoll gsing...@apache.org wrote: On Jul 22, 2011, at 7:23 AM, Niall Riddell wrote: I've gone through MIA and felt the the rowsimilarityjob was a possibility, however I understand that a JIRA has been raised to make this potentially less general and in it's current form it may not match my performance/cost criteria (i.e. high/low). I don't think the goal of the JIRA issue is to make it less general, ti's to make the cases that can benefit from smarter use of the co-occurrences scale better. I see no reason why the existing format can't also be maintained for those similarity measures that can't benefit from more map side calculation. -Grant
Re: Pairwise Document Similarity
On Jul 22, 2011, at 7:23 AM, Niall Riddell wrote: I've gone through MIA and felt the the rowsimilarityjob was a possibility, however I understand that a JIRA has been raised to make this potentially less general and in it's current form it may not match my performance/cost criteria (i.e. high/low). I don't think the goal of the JIRA issue is to make it less general, ti's to make the cases that can benefit from smarter use of the co-occurrences scale better. I see no reason why the existing format can't also be maintained for those similarity measures that can't benefit from more map side calculation. -Grant
Re: Plagiarism - document similarity
Thanks to all , i need to start from the beginning theory , you are speaking arab :) to me, or in other words i need a less theoretical approach, or in other words some real code to put my hands on. Excuse this raw approach but i need a real fast to implement and understand algorithm to use in real world scenario possibly now ;) . Alternatively i need a basic text(book) to start reading and arrive to understand what you are saying. thanks again On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning ted.dunn...@gmail.com wrote: Easier to simply index all, say, three word phrases and use a TF-IDF score. This will give you a good proxy for sequence similarity. Documents should either be chopped on paragraph boundaries to have a roughly constant length or the score should not be normalized by document length. Log likelihood ratio (LLR) test can be useful to extract good query features from the subject document. TF-IDF score is a reasonable proxy for this although it does lead to some problems. The reason TF-IDF works as a query term selection method and why it fails can be seen from the fact that TF-IDF is very close to one of the most important terms in the LLR score. On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg andrew.clegg+mah...@gmail.com wrote: On 11 July 2011 08:19, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: - bioinformatics, in particular gene sequencing to detect long near-matching sequences (a variation of the above, I'm not familiar with any particular algorithms, but I imagine this is a well explored space The classic is Smith Waterman: http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm This approach been used in general text processing tasks too, e.g.: http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf given the funds they receive ;), Hah! Less so these days I'm afraid :-) Andrew. -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
Re: Plagiarism - document similarity
Hi Luca, again, I have to emphasize read what I gave you. The algorithm in my link was explained for non-scientists and if you are going to download Solr you will find the class to have a look on how they implemented that algorithm. More easy would mean that someone else is writing the code for you ;). Regards, Em Am 12.07.2011 09:58, schrieb Luca Natti: Thanks to all , i need to start from the beginning theory , you are speaking arab :) to me, or in other words i need a less theoretical approach, or in other words some real code to put my hands on. Excuse this raw approach but i need a real fast to implement and understand algorithm to use in real world scenario possibly now ;) . Alternatively i need a basic text(book) to start reading and arrive to understand what you are saying. thanks again On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning ted.dunn...@gmail.com wrote: Easier to simply index all, say, three word phrases and use a TF-IDF score. This will give you a good proxy for sequence similarity. Documents should either be chopped on paragraph boundaries to have a roughly constant length or the score should not be normalized by document length. Log likelihood ratio (LLR) test can be useful to extract good query features from the subject document. TF-IDF score is a reasonable proxy for this although it does lead to some problems. The reason TF-IDF works as a query term selection method and why it fails can be seen from the fact that TF-IDF is very close to one of the most important terms in the LLR score. On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg andrew.clegg+mah...@gmail.com wrote: On 11 July 2011 08:19, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: - bioinformatics, in particular gene sequencing to detect long near-matching sequences (a variation of the above, I'm not familiar with any particular algorithms, but I imagine this is a well explored space The classic is Smith Waterman: http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm This approach been used in general text processing tasks too, e.g.: http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf given the funds they receive ;), Hah! Less so these days I'm afraid :-) Andrew. -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
Re: Plagiarism - document similarity
I've seen people doing all kinds of things to detect this. A few directions to research around: - suffix trees/ suffix arrays to detect longest common subsequences (perfect matches though), - bioinformatics, in particular gene sequencing to detect long near-matching sequences (a variation of the above, I'm not familiar with any particular algorithms, but I imagine this is a well explored space given the funds they receive ;), - techniques for fuzzy matching/ near-duplicate detection, but combined with arbitrary document chunking or method-specific data (for example shingles). These should yield you tons of reading material to start with (Google Scholar, Citeseer). Sorry for not being any more specific. Dawid On Mon, Jul 11, 2011 at 9:15 AM, Luca Natti ilmioam...@gmail.com wrote: yes, i'm interested in plagiarism applied to research papers, university notes, thesis. Any theory and *best* snippets of code/examples is very appreciated! thanks in advance for your help, On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg andrew.cl...@gmail.com wrote: If 'puzzling' means direct plagiarism, then some sort of longest-common-subsequence might be a better metric. If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for me. On Friday, 8 July 2011, Sergey Bartunov sbos@gmail.com wrote: You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote: Is there a way to compute similarity between docs? And similarity by paragraphs? What We want to tell is if a research paper is original or made by puzzling other works. thanks! -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
Re: Plagiarism - document similarity
Hi Luca, how about quoting another researcher's work? Are you also interested in the amount of quotes in respect to the whole document? I think it is not impossible to let an algorithm find out whether some subsequences in both documents are correctly marked, but it might be hard. Depending on your business-case you might find out that there will be a lot of false-positives when judging someone's work as plagiarism. Another idea to find out similarity between the content of two documents is implemented in Nutch. Fortunately I found a piece of documentation in the solr-api-docs where you can read about it: http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html You could do something like that for content-blocks of a document (several sentences or a fixed window of words). This way you are able to find out similarities between documents where the author has rewritten a part of another researcher's work. This way you are able to find out phrases where the longest-common-subsequence is small but a human would see the similarities between both documents and the possiblity of a plagiarism. Regards, Em Am 11.07.2011 09:15, schrieb Luca Natti: yes, i'm interested in plagiarism applied to research papers, university notes, thesis. Any theory and *best* snippets of code/examples is very appreciated! thanks in advance for your help, On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg andrew.cl...@gmail.com wrote: If 'puzzling' means direct plagiarism, then some sort of longest-common-subsequence might be a better metric. If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for me. On Friday, 8 July 2011, Sergey Bartunov sbos@gmail.com wrote: You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote: Is there a way to compute similarity between docs? And similarity by paragraphs? What We want to tell is if a research paper is original or made by puzzling other works. thanks! -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
Re: Plagiarism - document similarity
Somethig that gives also false positives ok, because we can check by hand for the final decision on the doc. I need more specific directions with some examples , because we have few time to implement this. On Mon, Jul 11, 2011 at 10:35 AM, Em mailformailingli...@yahoo.de wrote: Hi Luca, how about quoting another researcher's work? Are you also interested in the amount of quotes in respect to the whole document? I think it is not impossible to let an algorithm find out whether some subsequences in both documents are correctly marked, but it might be hard. Depending on your business-case you might find out that there will be a lot of false-positives when judging someone's work as plagiarism. Another idea to find out similarity between the content of two documents is implemented in Nutch. Fortunately I found a piece of documentation in the solr-api-docs where you can read about it: http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html You could do something like that for content-blocks of a document (several sentences or a fixed window of words). This way you are able to find out similarities between documents where the author has rewritten a part of another researcher's work. This way you are able to find out phrases where the longest-common-subsequence is small but a human would see the similarities between both documents and the possiblity of a plagiarism. Regards, Em Am 11.07.2011 09:15, schrieb Luca Natti: yes, i'm interested in plagiarism applied to research papers, university notes, thesis. Any theory and *best* snippets of code/examples is very appreciated! thanks in advance for your help, On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg andrew.cl...@gmail.com wrote: If 'puzzling' means direct plagiarism, then some sort of longest-common-subsequence might be a better metric. If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for me. On Friday, 8 July 2011, Sergey Bartunov sbos@gmail.com wrote: You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote: Is there a way to compute similarity between docs? And similarity by paragraphs? What We want to tell is if a research paper is original or made by puzzling other works. thanks! -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
Re: Plagiarism - document similarity
On Jul 11, 2011, at 3:00am, Luca Natti wrote: Somethig that gives also false positives ok, because we can check by hand for the final decision on the doc. I need more specific directions with some examples , because we have few time to implement this. See Winnowing: Local Algorithms for Document Fingerprinting by Schleimer, Wilderson and Aiken. And any papers on MOSS, an implementation of the ideas contained in that paper, for detecting plagiarism. -- Ken On Mon, Jul 11, 2011 at 10:35 AM, Em mailformailingli...@yahoo.de wrote: Hi Luca, how about quoting another researcher's work? Are you also interested in the amount of quotes in respect to the whole document? I think it is not impossible to let an algorithm find out whether some subsequences in both documents are correctly marked, but it might be hard. Depending on your business-case you might find out that there will be a lot of false-positives when judging someone's work as plagiarism. Another idea to find out similarity between the content of two documents is implemented in Nutch. Fortunately I found a piece of documentation in the solr-api-docs where you can read about it: http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html You could do something like that for content-blocks of a document (several sentences or a fixed window of words). This way you are able to find out similarities between documents where the author has rewritten a part of another researcher's work. This way you are able to find out phrases where the longest-common-subsequence is small but a human would see the similarities between both documents and the possiblity of a plagiarism. Regards, Em Am 11.07.2011 09:15, schrieb Luca Natti: yes, i'm interested in plagiarism applied to research papers, university notes, thesis. Any theory and *best* snippets of code/examples is very appreciated! thanks in advance for your help, On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg andrew.cl...@gmail.com wrote: If 'puzzling' means direct plagiarism, then some sort of longest-common-subsequence might be a better metric. If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for me. On Friday, 8 July 2011, Sergey Bartunov sbos@gmail.com wrote: You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote: Is there a way to compute similarity between docs? And similarity by paragraphs? What We want to tell is if a research paper is original or made by puzzling other works. thanks! -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
Re: document similarity
I think by puzzling he meant piecing together work from other papers into some extension of previous work. -Xavier On 7/9/11 8:14 AM, Andrew Clegg wrote: If 'puzzling' means direct plagiarism, then some sort of longest-common-subsequence might be a better metric. If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for me. On Friday, 8 July 2011, Sergey Bartunov sbos@gmail.com wrote: You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote: Is there a way to compute similarity between docs? And similarity by paragraphs? What We want to tell is if a research paper is original or made by puzzling other works. thanks!
Re: document similarity
Disregard my last message. Sorry! On 7/11/11 8:35 AM, Xavier Stevens wrote: I think by puzzling he meant piecing together work from other papers into some extension of previous work. -Xavier On 7/9/11 8:14 AM, Andrew Clegg wrote: If 'puzzling' means direct plagiarism, then some sort of longest-common-subsequence might be a better metric. If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for me. On Friday, 8 July 2011, Sergey Bartunov sbos@gmail.com wrote: You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote: Is there a way to compute similarity between docs? And similarity by paragraphs? What We want to tell is if a research paper is original or made by puzzling other works. thanks!
Re: Plagiarism - document similarity
Easier to simply index all, say, three word phrases and use a TF-IDF score. This will give you a good proxy for sequence similarity. Documents should either be chopped on paragraph boundaries to have a roughly constant length or the score should not be normalized by document length. Log likelihood ratio (LLR) test can be useful to extract good query features from the subject document. TF-IDF score is a reasonable proxy for this although it does lead to some problems. The reason TF-IDF works as a query term selection method and why it fails can be seen from the fact that TF-IDF is very close to one of the most important terms in the LLR score. On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg andrew.clegg+mah...@gmail.com wrote: On 11 July 2011 08:19, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: - bioinformatics, in particular gene sequencing to detect long near-matching sequences (a variation of the above, I'm not familiar with any particular algorithms, but I imagine this is a well explored space The classic is Smith Waterman: http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm This approach been used in general text processing tasks too, e.g.: http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf given the funds they receive ;), Hah! Less so these days I'm afraid :-) Andrew. -- http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
Re: document similarity
You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote: Is there a way to compute similarity between docs? And similarity by paragraphs? What We want to tell is if a research paper is original or made by puzzling other works. thanks!
Re: Generating a Document Similarity Matrix
Hi Kris, I think the best way would be to manually join the names to the result after executing the job. --sebastian Am 02.07.2010 18:22, schrieb Kris Jack: Hi Sebastian, I am currently using your code with NamedVectors in my input. In the output, however, the names seem to be missing. Would there be a way to include them? Thanks, Kris 2010/6/29 Sebastian Schelter ssc.o...@googlemail.com Hi Kris, I'm glad I could help you and it's really cool that you are testing my patches on real data. I'm looking forward to hearing more! -sebastian Am 29.06.2010 11:25, schrieb Kris Jack: Hi Sebastian, You really are very kind! I have taken your code and run it to print out the contents of the output file. There are indeed only 37,952 results so that gives me more confidence in the vector dumper. I'm not sure why there was a memory problem though, seeing as it seems to have output the results correctly. Now I just have to match them up with my original lucene ids and see how it is performing. I'll keep you posted with the results. Thanks, Kris 2010/6/28 Sebastian Schelter ssc.o...@googlemail.com Hi Kris, Unfortunately I'm not familiar with the VectorDumper code (and a quick look didn't help either), so I can't help you with the OutOfMemoryError. It could be possible that only 37,952 results are found for an input of 500,000 vectors, it really depends on the actual data. If you're sure that there should be more results, you could provide me with a sample input file and I'll try to find out why there aren't more results. I wrote a small class for you that dumps the output file of the job to the console, (I tested it with the output of my unit-tests), maybe that can help us find the source of the problem. -sebastian public class MatrixReader extends AbstractJob { public static void main(String[] args) throws Exception { ToolRunner.run(new MatrixReader(), args); } @Override public int run(String[] args) throws Exception { addInputOption(); MapString,String parsedArgs = parseArguments(args); if (parsedArgs == null) { return -1; } Configuration conf = getConf(); FileSystem fs = FileSystem.get(conf); Path vectorFile = fs.listStatus(getInputPath(), TasteHadoopUtils.PARTS_FILTER)[0].getPath(); SequenceFile.Reader reader = null; try { reader = new SequenceFile.Reader(fs, vectorFile, conf); IntWritable key = new IntWritable(); VectorWritable value = new VectorWritable(); while (reader.next(key, value)) { int row = key.get(); System.out.print(String.valueOf(key.get()) + : ); IteratorElement elementsIterator = value.get().iterateNonZero(); String separator = ; while (elementsIterator.hasNext()) { Element element = elementsIterator.next(); System.out.print(separator + String.valueOf(element.index()) + , + String.valueOf(element.get())); separator = ;; } System.out.print(\n); } } finally { reader.close(); } return 0; } } Am 28.06.2010 17:18, schrieb Kris Jack: Hi, I am now using the version of org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that Sebastian has written and has been added to the trunk. Thanks again for that! I can generate an output file that should contain a list of documents with their top 100* *most similar documents. I am having problems, however, in converting the output file into a readable format using mahout's vectordump: $ ./mahout vectordump --seqFile similarRows --output results.out --printKey no HADOOP_CONF_DIR or HADOOP_HOME set, running locally Input Path: /home/kris/similarRows Exception in thread main java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) at org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77) at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at
Re: Generating a Document Similarity Matrix
Thanks Ted, I got that working. Unfortunately, the matrix multiplication job is taking far longer than I hoped. With just over 10 million documents, 10 mappers and 10 reducers, I can't get it to complete the job in under 48 hours. Perhaps you have an idea for speeding it up? I have already been quite ruthless with making the vectors sparse. I did not include terms that appeared in over 1% of the corpus and only kept terms that appeared at least 50 times. Is it normal that the matrix multiplication map reduce task should take so long to process with this quantity of data and resources available or do you think that my system is not configured properly? Thanks, Kris 2010/6/15 Ted Dunning ted.dunn...@gmail.com Threshold are generally dangerous. It is usually preferable to specify the sparseness you want (1%, 0.2%, whatever), sort the results in descending score order using Hadoop's builtin capabilities and just drop the rest. On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack mrkrisj...@gmail.com wrote: I was wondering if there was an interesting way to do this with the current mahout code such as requesting that the Vector accumulator returns only elements that have values greater than a given threshold, sorting the vector by value rather than key, or something else?
Re: Generating a Document Similarity Matrix
Hi Kris, maybe you want to give the patch from https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet tested it with larger data yet, but I would be happy to get some feedback for it and maybe it helps you with your usecase. -sebastian Am 18.06.2010 18:46, schrieb Kris Jack: Thanks Ted, I got that working. Unfortunately, the matrix multiplication job is taking far longer than I hoped. With just over 10 million documents, 10 mappers and 10 reducers, I can't get it to complete the job in under 48 hours. Perhaps you have an idea for speeding it up? I have already been quite ruthless with making the vectors sparse. I did not include terms that appeared in over 1% of the corpus and only kept terms that appeared at least 50 times. Is it normal that the matrix multiplication map reduce task should take so long to process with this quantity of data and resources available or do you think that my system is not configured properly? Thanks, Kris 2010/6/15 Ted Dunning ted.dunn...@gmail.com Threshold are generally dangerous. It is usually preferable to specify the sparseness you want (1%, 0.2%, whatever), sort the results in descending score order using Hadoop's builtin capabilities and just drop the rest. On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack mrkrisj...@gmail.com wrote: I was wondering if there was an interesting way to do this with the current mahout code such as requesting that the Vector accumulator returns only elements that have values greater than a given threshold, sorting the vector by value rather than key, or something else?
Re: Generating a Document Similarity Matrix
Thanks Sebastian, I'll give it a try! 2010/6/18 Sebastian Schelter ssc.o...@googlemail.com Hi Kris, maybe you want to give the patch from https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet tested it with larger data yet, but I would be happy to get some feedback for it and maybe it helps you with your usecase. -sebastian Am 18.06.2010 18:46, schrieb Kris Jack: Thanks Ted, I got that working. Unfortunately, the matrix multiplication job is taking far longer than I hoped. With just over 10 million documents, 10 mappers and 10 reducers, I can't get it to complete the job in under 48 hours. Perhaps you have an idea for speeding it up? I have already been quite ruthless with making the vectors sparse. I did not include terms that appeared in over 1% of the corpus and only kept terms that appeared at least 50 times. Is it normal that the matrix multiplication map reduce task should take so long to process with this quantity of data and resources available or do you think that my system is not configured properly? Thanks, Kris 2010/6/15 Ted Dunning ted.dunn...@gmail.com Threshold are generally dangerous. It is usually preferable to specify the sparseness you want (1%, 0.2%, whatever), sort the results in descending score order using Hadoop's builtin capabilities and just drop the rest. On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack mrkrisj...@gmail.com wrote: I was wondering if there was an interesting way to do this with the current mahout code such as requesting that the Vector accumulator returns only elements that have values greater than a given threshold, sorting the vector by value rather than key, or something else? -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/
Re: Generating a Document Similarity Matrix
Hi, I have gone through the matrix code quite thoroughly and have managed to get an implementation of document similarity working. Like you said, however, the rows of the similarity matrix are very dense and take a while to calculate. I have made the original document-term matrix more sparse by removing terms that have a document frequency of 1% and now I would like to trim out row entries that do not have high similarity values. As you suggested, I would like to keep only the top x entries by modifying the reducer of the matrix multiply job. I was wondering if there was an interesting way to do this with the current mahout code such as requesting that the Vector accumulator returns only elements that have values greater than a given threshold, sorting the vector by value rather than key, or something else? Thanks, Kris 2010/6/8 Jake Mannix jake.man...@gmail.com Hi Kris, So you already know how to make a sparse feature matrix out of your Solr index, based on Grant's instructions? Once you have that matrix loaded into HDFS, then the following Java code should create a document-document similarity matrix: // String p = /path/to/matrix/on/hdfs; String tmpPath = /tmp/matrixmultiplyspace; int numDocuments = // whatever your numDocuments is int numTerms = // total number of terms in the matrix DistributedRowMatrix text = new DistributedRowMatrix(inputPath, tmpPath, numDocuments, numTerms); JobConf conf = new JobConf(similarity job); text.configure(conf); DistributedRowMatrix transpose = text.transpose(); DistributedRowMatrix similarity = transpose.times(transpose); System.out.println(Similarity matrix lives: + similarity.getRowPath()); // Now, the rows of this similarity are going to be way too dense, so you'll want to write a small map-reduce job (well, no reduce is necessary) to run over this matrix and trim out all the unuseful entries of each row, but that shouldn't be too hard to do. Of course, to do it really efficiently, that functionality could be folded into the reducer of the matrix multiply job, and done in the same pass over the data as that one. -jake On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack mrkrisj...@gmail.com wrote: Hi Jake, Thanks for that. The first solution that you suggest is more like what I was imagining. Please excuse me, I'm new to Mahout and don't know how to use it to generate the full document-document similarity matrix. I would rather not have to re-implement the moreLikeThis algorithm that, although rather straight forward, may take time for a newbie to MapReduce like me. Could you guide me a little in finding the relevant Mahout code for generating the matrix or is it not really designed for that? For the moment, I would be happy to have an off-line batch version working. Also, it is desirable to take advantage of the text processing features that I have already configured using solr, so I would prefer to read in the feature vectors for the documents from a lucene index, as I am doing at present (e.g. http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ ). Thanks, Kris 2010/6/8 Jake Mannix jake.man...@gmail.com Hi Kris, If you generate a full document-document similarity matrix offline, and then make sure to sparsify the rows (trim off all similarities below a threshold, or only take the top N for each row, etc...). Then encoding these values directly in the index would indeed allow for *superfast* MoreLikeThis functionality, because you've already computed all of the similar results offline. The only downside is that it won't apply to newly indexed documents. If your indexing setup is such that you don't fold in new documents live, but do so in batch, then this should be fine. An alternative is to use something like a Locality Sensitive Hash (something one of my co-workers is writing up a nice implementation of now, and I'm going to get him to contribute it once it's fully tested), to reduce the search space (as a lucene Filter) and speed up the query. -jake On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack mrkrisj...@gmail.com wrote: Hi Olivier, Thanks for your suggestions. I have over 10 million documents and they have quite a lot of meta-data associated with them including rather large text fields. It is possible to tweak the moreLikeThis function from solr. I have tried changing the parameters ( http://wiki.apache.org/solr/MoreLikeThis) but am not managing to get results in under 300ms without sacrificing the quality of the results too much. I suspect that there would be gains to be made from reducing the dimensionality of the feature vectors before indexing with lucene so I may give that a try. I'll keep you posted if I come up with other solutions
Re: Generating a Document Similarity Matrix
Threshold are generally dangerous. It is usually preferable to specify the sparseness you want (1%, 0.2%, whatever), sort the results in descending score order using Hadoop's builtin capabilities and just drop the rest. On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack mrkrisj...@gmail.com wrote: I was wondering if there was an interesting way to do this with the current mahout code such as requesting that the Vector accumulator returns only elements that have values greater than a given threshold, sorting the vector by value rather than key, or something else?
Re: Generating a Document Similarity Matrix
I could try to make the similarity computation work on the rows of a DistributedRowMatrix with several metrices (similar to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob) and would concentrate on the implementation of it as a mathematical operation not specific to any domain. So it would be left up to the users to convert their documents to vectors and maybe do things like stemming or stopword removal to reduce the computation overhead, when they use this job for text documents. I could start working on that in 2 weeks from now though. Tell me if that's welcomed and I'll go and create a jira issue :) -sebastian 2010/6/9 Jake Mannix jake.man...@gmail.com On Tue, Jun 8, 2010 at 4:45 PM, Sebastian Schelter ssc.o...@googlemail.comwrote: The relation between these two problems (document similarity and item similarity in CF) is exactly like Sean pointed out: In the paper a document is a vector of term frequencies and the paper shows how to compute the pairwise similarities between those. To use this for collaborative filtering you actually just have to replace the document with an item which is a vector of user preferences. Yep, a vector is a vector is a vector. (And when you're me, even if you are *not* a vector, you might be a vector. ;) ) It shouldn't be too hard to make this work on a DistributedRowMatrix too, I think. You already mentioned you wanna have it that way some time in MAHOUT-362 :) Well indeed I did! -jake
Re: Generating a Document Similarity Matrix
Hi Jake, Thanks for that. I'll give it a try with cosine similarity first off and as I get more experience I'll try and implement some other similarity methods. Kris 2010/6/9 Jake Mannix jake.man...@gmail.com On Wed, Jun 9, 2010 at 5:11 AM, Kris Jack mrkrisj...@gmail.com wrote: Hi, Thanks very for the code. In implementing it, I got a little stuck on specifying the JobConf's similarity job - JobConf conf = new JobConf(similarity job); I assume that I should define here how I would like two vectors to be compared with one another? Please do correct me if that's wrong. If so, however, could you point me to any examples of what this code should look like (e.g. cosine similarity)? I'm sure that these kinds of similarity jobs must already exist in Mahout but being new to both Mahout and MapReduce, I'm not sure where to find them. In the sample I mentioned (using sparse matrix multiplication), you don't get to chose the similarity - if the input vectors are unit-length normalized, then the computation is cosine similarity. You would have to write your own map-reduce job to do a different one. -jake Thanks, Kris 2010/6/8 Jake Mannix jake.man...@gmail.com Hi Kris, So you already know how to make a sparse feature matrix out of your Solr index, based on Grant's instructions? Once you have that matrix loaded into HDFS, then the following Java code should create a document-document similarity matrix: // String p = /path/to/matrix/on/hdfs; String tmpPath = /tmp/matrixmultiplyspace; int numDocuments = // whatever your numDocuments is int numTerms = // total number of terms in the matrix DistributedRowMatrix text = new DistributedRowMatrix(inputPath, tmpPath, numDocuments, numTerms); JobConf conf = new JobConf(similarity job); text.configure(conf); DistributedRowMatrix transpose = text.transpose(); DistributedRowMatrix similarity = transpose.times(transpose); System.out.println(Similarity matrix lives: + similarity.getRowPath()); // Now, the rows of this similarity are going to be way too dense, so you'll want to write a small map-reduce job (well, no reduce is necessary) to run over this matrix and trim out all the unuseful entries of each row, but that shouldn't be too hard to do. Of course, to do it really efficiently, that functionality could be folded into the reducer of the matrix multiply job, and done in the same pass over the data as that one. -jake On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack mrkrisj...@gmail.com wrote: Hi Jake, Thanks for that. The first solution that you suggest is more like what I was imagining. Please excuse me, I'm new to Mahout and don't know how to use it to generate the full document-document similarity matrix. I would rather not have to re-implement the moreLikeThis algorithm that, although rather straight forward, may take time for a newbie to MapReduce like me. Could you guide me a little in finding the relevant Mahout code for generating the matrix or is it not really designed for that? For the moment, I would be happy to have an off-line batch version working. Also, it is desirable to take advantage of the text processing features that I have already configured using solr, so I would prefer to read in the feature vectors for the documents from a lucene index, as I am doing at present (e.g. http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ ). Thanks, Kris 2010/6/8 Jake Mannix jake.man...@gmail.com Hi Kris, If you generate a full document-document similarity matrix offline, and then make sure to sparsify the rows (trim off all similarities below a threshold, or only take the top N for each row, etc...). Then encoding these values directly in the index would indeed allow for *superfast* MoreLikeThis functionality, because you've already computed all of the similar results offline. The only downside is that it won't apply to newly indexed documents. If your indexing setup is such that you don't fold in new documents live, but do so in batch, then this should be fine. An alternative is to use something like a Locality Sensitive Hash (something one of my co-workers is writing up a nice implementation of now, and I'm going to get him to contribute it once it's fully tested), to reduce the search space (as a lucene Filter) and speed up the query. -jake On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack mrkrisj...@gmail.com wrote: Hi Olivier, Thanks for your suggestions. I have over 10 million documents and they have quite
Re: Generating a Document Similarity Matrix
Well I'm not sure they're unique, they're just vectors. Would that not be the best neutral representation for things like this? What was the comment about keying by ints vs longs earlier? If unifying that helps bring things closer together I can look at it, if I can understand the issue. On Wed, Jun 9, 2010 at 6:56 PM, Sebastian Schelter ssc.o...@googlemail.com wrote: The ItemSimilarityJob cannot be directly used as its not working on a DistributedRowMatrix but on data structures unique to collaborative filtering, so if you ask me I'd say that a separate job would be required.
Re: Generating a Document Similarity Matrix
On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix jake.man...@gmail.com wrote: The ItemSimilarityJob actually uses implementations of the Vector class hierarchy? I think that's the issue - if the on-disk and in-mapper representations are never Vectors, then they won't interoperate with any of the matrix operations... Yes they are Vectors. And yeah, keying on ints is necessary for now, unless we want to make a new matrix type (at least for distributed matrices) which keys on longs (which actually might be a good idea: now that we're using VInt and VLong, the disk space and network usage should be not be adversely affected - just the in-memory representation). Oh I see. Well that's not a problem. Already, IDs have to be mapped to ints to be used as dimensions in a Vector. So in most cases things are keyed by these int pseudo-IDs. That's OK too. A matrix is a bunch of vectors -- at least, that's a nice structure for a SequenceFile. Row (or col) ID mapped to row (column) vector. is that not what other jobs are using? what's the better alternative we could think about converging on.
Generating a Document Similarity Matrix
Hi everyone, I currently use lucene's moreLikeThis function through solr to find documents that are related to one another. A single call, however, takes around 4 seconds to complete and I would like to reduce this. I got to thinking that I might be able to use Mahout to generate a document similarity matrix offline that could then be looked-up in real time for serving. Is this a reasonable use of Mahout? If so, what functions will generate a document similarity matrix? Also, I would like to be able to keep the text processing advantages provided through lucene so it would help if I could still use my lucene index. If not, then could you recommend any alternative solutions please? Many thanks, Kris
Re: Generating a Document Similarity Matrix
Hi Kris, If you generate a full document-document similarity matrix offline, and then make sure to sparsify the rows (trim off all similarities below a threshold, or only take the top N for each row, etc...). Then encoding these values directly in the index would indeed allow for *superfast* MoreLikeThis functionality, because you've already computed all of the similar results offline. The only downside is that it won't apply to newly indexed documents. If your indexing setup is such that you don't fold in new documents live, but do so in batch, then this should be fine. An alternative is to use something like a Locality Sensitive Hash (something one of my co-workers is writing up a nice implementation of now, and I'm going to get him to contribute it once it's fully tested), to reduce the search space (as a lucene Filter) and speed up the query. -jake On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack mrkrisj...@gmail.com wrote: Hi Olivier, Thanks for your suggestions. I have over 10 million documents and they have quite a lot of meta-data associated with them including rather large text fields. It is possible to tweak the moreLikeThis function from solr. I have tried changing the parameters ( http://wiki.apache.org/solr/MoreLikeThis) but am not managing to get results in under 300ms without sacrificing the quality of the results too much. I suspect that there would be gains to be made from reducing the dimensionality of the feature vectors before indexing with lucene so I may give that a try. I'll keep you posted if I come up with other solutions. Thanks, Kris 2010/6/8 Olivier Grisel olivier.gri...@ensta.org 2010/6/8 Kris Jack mrkrisj...@gmail.com: Hi everyone, I currently use lucene's moreLikeThis function through solr to find documents that are related to one another. A single call, however, takes around 4 seconds to complete and I would like to reduce this. I got to thinking that I might be able to use Mahout to generate a document similarity matrix offline that could then be looked-up in real time for serving. Is this a reasonable use of Mahout? If so, what functions will generate a document similarity matrix? Also, I would like to be able to keep the text processing advantages provided through lucene so it would help if I could still use my lucene index. If not, then could you recommend any alternative solutions please? How many documents do you have in your index? Have you tried to tweak the MoreLikeThis parameters ? (I don't know if it's possible using the solr interface, I use it directly using the lucene java API) For instance you can trade off recall for speed by decreasing the number of terms to use in the query and trade recall for precision and speed by increasing the percentage of terms that should match. You could also use Mahout implementation of SVD to build low dimensional semantic vectors representing your documents (a.k.a. Latent Semantic Indexing) and then index those transformed frequency vectors in a dedicated lucene index (or document field provided you name the resulting terms with something that does not match real life terms present in other). However using standard SVD will probably result in dense (as opposed to sparse) low dimensional semantic vectors. I don't think lucene's lookup performance is good with dense frequency vectors even though the number of terms is greatly reduced by SVD. Hence it would probably be better to either threshold the top 100 absolute values of each semantic vectors before indexing (probably the simpler solution) or using a sparsifying penalty contrained variant of SVD / LSI. You should have a look at the literature on sparse coding or sparse dictionary learning, Sparse-PCA and more generally L1 penalty regression methods such as the Lasso and LARS. I don't know about any library for sparse semantic coding of document that works automatically with lucene. Probably some non trivial coding is needed there. Another alternative is finding low dimensional (64 or 32 components) dense codes and then binary thresholding then and store integer code in the DB or the lucene index and then build smart exact match queries to find all document lying in the hamming ball of size 1 or 2 of the reference document's binary code. But I think this approach while promising for web scale document collections is even more experimental and requires very good code low dim encoders (I don't think linear models such as SVD are good enough for reducing sparse 10e6 components vectors to dense 64 components vectors, non linear encoders such as Stacked Restricted Boltzmann Machines are probably a better choice). In any case let us know about your results, I am really interested on practical yet scalable solutions to this problem. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Dr
Re: Generating a Document Similarity Matrix
Sorry, copied the previous link from the wrong tab :/ Meant to be https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html for reading in lucene vectors. 2010/6/8 Kris Jack mrkrisj...@gmail.com Hi Jake, Thanks for that. The first solution that you suggest is more like what I was imagining. Please excuse me, I'm new to Mahout and don't know how to use it to generate the full document-document similarity matrix. I would rather not have to re-implement the moreLikeThis algorithm that, although rather straight forward, may take time for a newbie to MapReduce like me. Could you guide me a little in finding the relevant Mahout code for generating the matrix or is it not really designed for that? For the moment, I would be happy to have an off-line batch version working. Also, it is desirable to take advantage of the text processing features that I have already configured using solr, so I would prefer to read in the feature vectors for the documents from a lucene index, as I am doing at present (e.g. http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ ). Thanks, Kris 2010/6/8 Jake Mannix jake.man...@gmail.com Hi Kris, If you generate a full document-document similarity matrix offline, and then make sure to sparsify the rows (trim off all similarities below a threshold, or only take the top N for each row, etc...). Then encoding these values directly in the index would indeed allow for *superfast* MoreLikeThis functionality, because you've already computed all of the similar results offline. The only downside is that it won't apply to newly indexed documents. If your indexing setup is such that you don't fold in new documents live, but do so in batch, then this should be fine. An alternative is to use something like a Locality Sensitive Hash (something one of my co-workers is writing up a nice implementation of now, and I'm going to get him to contribute it once it's fully tested), to reduce the search space (as a lucene Filter) and speed up the query. -jake On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack mrkrisj...@gmail.com wrote: Hi Olivier, Thanks for your suggestions. I have over 10 million documents and they have quite a lot of meta-data associated with them including rather large text fields. It is possible to tweak the moreLikeThis function from solr. I have tried changing the parameters ( http://wiki.apache.org/solr/MoreLikeThis) but am not managing to get results in under 300ms without sacrificing the quality of the results too much. I suspect that there would be gains to be made from reducing the dimensionality of the feature vectors before indexing with lucene so I may give that a try. I'll keep you posted if I come up with other solutions. Thanks, Kris 2010/6/8 Olivier Grisel olivier.gri...@ensta.org 2010/6/8 Kris Jack mrkrisj...@gmail.com: Hi everyone, I currently use lucene's moreLikeThis function through solr to find documents that are related to one another. A single call, however, takes around 4 seconds to complete and I would like to reduce this. I got to thinking that I might be able to use Mahout to generate a document similarity matrix offline that could then be looked-up in real time for serving. Is this a reasonable use of Mahout? If so, what functions will generate a document similarity matrix? Also, I would like to be able to keep the text processing advantages provided through lucene so it would help if I could still use my lucene index. If not, then could you recommend any alternative solutions please? How many documents do you have in your index? Have you tried to tweak the MoreLikeThis parameters ? (I don't know if it's possible using the solr interface, I use it directly using the lucene java API) For instance you can trade off recall for speed by decreasing the number of terms to use in the query and trade recall for precision and speed by increasing the percentage of terms that should match. You could also use Mahout implementation of SVD to build low dimensional semantic vectors representing your documents (a.k.a. Latent Semantic Indexing) and then index those transformed frequency vectors in a dedicated lucene index (or document field provided you name the resulting terms with something that does not match real life terms present in other). However using standard SVD will probably result in dense (as opposed to sparse) low dimensional semantic vectors. I don't think lucene's lookup performance is good with dense frequency vectors even though the number of terms is greatly reduced by SVD. Hence it would probably be better to either threshold the top 100 absolute values of each semantic vectors before indexing (probably the simpler solution) or using a sparsifying penalty contrained
Re: Generating a Document Similarity Matrix
Hi Kris, So you already know how to make a sparse feature matrix out of your Solr index, based on Grant's instructions? Once you have that matrix loaded into HDFS, then the following Java code should create a document-document similarity matrix: // String p = /path/to/matrix/on/hdfs; String tmpPath = /tmp/matrixmultiplyspace; int numDocuments = // whatever your numDocuments is int numTerms = // total number of terms in the matrix DistributedRowMatrix text = new DistributedRowMatrix(inputPath, tmpPath, numDocuments, numTerms); JobConf conf = new JobConf(similarity job); text.configure(conf); DistributedRowMatrix transpose = text.transpose(); DistributedRowMatrix similarity = transpose.times(transpose); System.out.println(Similarity matrix lives: + similarity.getRowPath()); // Now, the rows of this similarity are going to be way too dense, so you'll want to write a small map-reduce job (well, no reduce is necessary) to run over this matrix and trim out all the unuseful entries of each row, but that shouldn't be too hard to do. Of course, to do it really efficiently, that functionality could be folded into the reducer of the matrix multiply job, and done in the same pass over the data as that one. -jake On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack mrkrisj...@gmail.com wrote: Hi Jake, Thanks for that. The first solution that you suggest is more like what I was imagining. Please excuse me, I'm new to Mahout and don't know how to use it to generate the full document-document similarity matrix. I would rather not have to re-implement the moreLikeThis algorithm that, although rather straight forward, may take time for a newbie to MapReduce like me. Could you guide me a little in finding the relevant Mahout code for generating the matrix or is it not really designed for that? For the moment, I would be happy to have an off-line batch version working. Also, it is desirable to take advantage of the text processing features that I have already configured using solr, so I would prefer to read in the feature vectors for the documents from a lucene index, as I am doing at present (e.g. http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ ). Thanks, Kris 2010/6/8 Jake Mannix jake.man...@gmail.com Hi Kris, If you generate a full document-document similarity matrix offline, and then make sure to sparsify the rows (trim off all similarities below a threshold, or only take the top N for each row, etc...). Then encoding these values directly in the index would indeed allow for *superfast* MoreLikeThis functionality, because you've already computed all of the similar results offline. The only downside is that it won't apply to newly indexed documents. If your indexing setup is such that you don't fold in new documents live, but do so in batch, then this should be fine. An alternative is to use something like a Locality Sensitive Hash (something one of my co-workers is writing up a nice implementation of now, and I'm going to get him to contribute it once it's fully tested), to reduce the search space (as a lucene Filter) and speed up the query. -jake On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack mrkrisj...@gmail.com wrote: Hi Olivier, Thanks for your suggestions. I have over 10 million documents and they have quite a lot of meta-data associated with them including rather large text fields. It is possible to tweak the moreLikeThis function from solr. I have tried changing the parameters ( http://wiki.apache.org/solr/MoreLikeThis) but am not managing to get results in under 300ms without sacrificing the quality of the results too much. I suspect that there would be gains to be made from reducing the dimensionality of the feature vectors before indexing with lucene so I may give that a try. I'll keep you posted if I come up with other solutions. Thanks, Kris 2010/6/8 Olivier Grisel olivier.gri...@ensta.org 2010/6/8 Kris Jack mrkrisj...@gmail.com: Hi everyone, I currently use lucene's moreLikeThis function through solr to find documents that are related to one another. A single call, however, takes around 4 seconds to complete and I would like to reduce this. I got to thinking that I might be able to use Mahout to generate a document similarity matrix offline that could then be looked-up in real time for serving. Is this a reasonable use of Mahout? If so, what functions will generate a document similarity matrix? Also, I would like to be able to keep the text processing advantages provided through lucene so it would help if I could still use my lucene index. If not, then could you recommend any alternative solutions please? How many documents do you have in your index? Have you tried to tweak
Re: Generating a Document Similarity Matrix
Hi Kris, actually the code to compute the item-to-item similarities in the collaborative filtering part of mahout (which at the first look seems to be a totally different problem than yours) is based on a paper that deals with computing the pairwise similarity of text documents in a very simple way. Maybe that could be helpful to you: Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdfhttp://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf -sebastian 2010/6/8 Kris Jack mrkrisj...@gmail.com Hi everyone, I currently use lucene's moreLikeThis function through solr to find documents that are related to one another. A single call, however, takes around 4 seconds to complete and I would like to reduce this. I got to thinking that I might be able to use Mahout to generate a document similarity matrix offline that could then be looked-up in real time for serving. Is this a reasonable use of Mahout? If so, what functions will generate a document similarity matrix? Also, I would like to be able to keep the text processing advantages provided through lucene so it would help if I could still use my lucene index. If not, then could you recommend any alternative solutions please? Many thanks, Kris
Re: Generating a Document Similarity Matrix
The code in mahout CF is doing that? I don't think that's right, we don't do anything that fancy right now, do we Sean? -jake On Tue, Jun 8, 2010 at 3:39 PM, Sebastian Schelter ssc.o...@googlemail.comwrote: Hi Kris, actually the code to compute the item-to-item similarities in the collaborative filtering part of mahout (which at the first look seems to be a totally different problem than yours) is based on a paper that deals with computing the pairwise similarity of text documents in a very simple way. Maybe that could be helpful to you: Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf -sebastian 2010/6/8 Kris Jack mrkrisj...@gmail.com Hi everyone, I currently use lucene's moreLikeThis function through solr to find documents that are related to one another. A single call, however, takes around 4 seconds to complete and I would like to reduce this. I got to thinking that I might be able to use Mahout to generate a document similarity matrix offline that could then be looked-up in real time for serving. Is this a reasonable use of Mahout? If so, what functions will generate a document similarity matrix? Also, I would like to be able to keep the text processing advantages provided through lucene so it would help if I could still use my lucene index. If not, then could you recommend any alternative solutions please? Many thanks, Kris
Re: Generating a Document Similarity Matrix
Sort of, there is a separate job to compute all item-item similarities under a variety of metrics. This is what Sebastian wrote. It's not used in the co-occurrence recommender (but could be -- vaguely a to-do here.) But sure if you're willing to think of a doc as an item vector of preferences from words then this works fine to compute doc similarity under these metrics. On Wed, Jun 9, 2010 at 12:52 AM, Jake Mannix jake.man...@gmail.com wrote: The code in mahout CF is doing that? I don't think that's right, we don't do anything that fancy right now, do we Sean? -jake On Tue, Jun 8, 2010 at 3:39 PM, Sebastian Schelter ssc.o...@googlemail.comwrote: Hi Kris, actually the code to compute the item-to-item similarities in the collaborative filtering part of mahout (which at the first look seems to be a totally different problem than yours) is based on a paper that deals with computing the pairwise similarity of text documents in a very simple way. Maybe that could be helpful to you: Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf -sebastian 2010/6/8 Kris Jack mrkrisj...@gmail.com Hi everyone, I currently use lucene's moreLikeThis function through solr to find documents that are related to one another. A single call, however, takes around 4 seconds to complete and I would like to reduce this. I got to thinking that I might be able to use Mahout to generate a document similarity matrix offline that could then be looked-up in real time for serving. Is this a reasonable use of Mahout? If so, what functions will generate a document similarity matrix? Also, I would like to be able to keep the text processing advantages provided through lucene so it would help if I could still use my lucene index. If not, then could you recommend any alternative solutions please? Many thanks, Kris
Re: Generating a Document Similarity Matrix
On Tue, Jun 8, 2010 at 3:56 PM, Olivier Grisel olivier.gri...@ensta.orgwrote: 2010/6/8 Jake Mannix jake.man...@gmail.com: Hi Kris, If you generate a full document-document similarity matrix offline, and then make sure to sparsify the rows (trim off all similarities below a threshold, or only take the top N for each row, etc...). Then encoding these values directly in the index would indeed allow for *superfast* MoreLikeThis functionality, because you've already computed all of the similar results offline. For 10e6 documents if might not be reasonable to generate the complete document-document similarity matrix: 1e12 components = a couple of tera bytes of similarity values just to find the find the top N afterwards: Nope, this isn't what happens in what I described: when you take a sparseDocumentMatrix.transpose().times(itself), the scaling does not go N^2*M, with N^2 outputs - the calculation is sparse, only computing the entries which are nonzero. If you pre-sparsify the documents a little (remove all terms which occur in more than 1% of all documents, or something like that), this sparse calculation is even faster - it scales as sum_{i=1...N}(k_i)^2, where k_i is the number of nonzero elements in document i. If all documents were the same length (k), then this scales as N*k^2, and the total number of nonzero entries in the output is far less than N^2 if k N,M. Getting rid of the common terms (even *lots* of them) beforehand is still a very good idea. -jake
Re: Generating a Document Similarity Matrix
On Tue, Jun 8, 2010 at 4:45 PM, Sebastian Schelter ssc.o...@googlemail.comwrote: The relation between these two problems (document similarity and item similarity in CF) is exactly like Sean pointed out: In the paper a document is a vector of term frequencies and the paper shows how to compute the pairwise similarities between those. To use this for collaborative filtering you actually just have to replace the document with an item which is a vector of user preferences. Yep, a vector is a vector is a vector. (And when you're me, even if you are *not* a vector, you might be a vector. ;) ) It shouldn't be too hard to make this work on a DistributedRowMatrix too, I think. You already mentioned you wanna have it that way some time in MAHOUT-362 :) Well indeed I did! -jake