Re: Document similarity

2016-03-11 Thread Charles Earl
You might also look at this paper by Wallach
http://maroo.cs.umass.edu/pub/web/getpdf.php?id=1101

Sent from my iPhone

> On Mar 11, 2016, at 8:11 AM, David Starina  wrote:
> 
> Well, there is also an online method of LDA in Spark ... Pat, is there any
> documentation on the method you described?
> 
>> On Wed, Feb 24, 2016 at 6:10 PM, Pat Ferrel  wrote:
>> 
>> The method I described calculates similarity on the fly but requires new
>> docs to go through feature extraction before similarity can be queried. The
>> length of time to do feature extraction is short compared to training LDA.
>> 
>> Another method that gets at semantic similarity uses adaptive skip-grams
>> for text features. http://arxiv.org/abs/1502.07257 I haven’t tried this
>> but a friend saw a presentation about using this method to create features
>> for a search engine which showed a favorable comparison with word2vec.
>> 
>> If you want to use LDA note that it is an unsupervised categorization
>> method. To use it, the cluster descriptors (a vector of important terms)
>> can be compared to the analyzed incoming document using a KNN/search
>> engine. This will give you a list of the closest clusters but doesn’t
>> really give you documents, which is your goal I think. LDA should be re-run
>> periodically to generate new clusters. Do you want to know cluster
>> inclusion or get a list of similar docs?
>> 
>> On Feb 23, 2016, at 1:01 PM, David Starina 
>> wrote:
>> 
>> Guys, one more question ... Are there some incremental methods to do this?
>> I don't want to run the whole job again once a new document is added. In
>> case of LDA ... I guess the best way is to calculate the topics on the new
>> document using the topics from the previous LDA run ... And then every once
>> in a while to recalculate the topics with the new documents?
>> 
>> On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel 
>> wrote:
>> 
>>> Something we are working on for purely content based similarity is using
>> a
>>> KNN engine (search engine) but creating features from word2vec and an NER
>>> (Named Entity Recognizer).
>>> 
>>> putting the generated features into fields of a doc can really help with
>>> similarity because w2v and NER create semantic features. You can also try
>>> n-grams or skip-grams. These features are not very helpful for search but
>>> for  similarity they work well.
>>> 
>>> The query to the KNN engine is a document, each field mapped to the
>>> corresponding field of the index. The result is the k nearest neighbors
>> to
>>> the query doc.
>>> 
>>> 
> On Feb 14, 2016, at 11:05 AM, David Starina 
 wrote:
 
 Charles, thank you, I will check that out.
 
 Ted, I am looking for semantic similarity. Unfortunately, I do not have
>>> any
 data on the usage of the documents (if by usage you mean user behavior).
 
> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning 
 wrote:
 
> Did you want textual similarity?
> 
> Or semantic similarity?
> 
> The actual semantics of a message can be opaque from the content, but
>>> clear
> from the usage.
> 
> 
> 
> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl 
>>> wrote:
> 
>> David,
>> LDA or LSI can work quite nicely for similarity (YMMV of course
>>> depending
>> on the characterization of your documents).
>> You basically use the dot product of the square roots of the vectors
>>> for
>> LDA -- if you do a search for Hellinger or Bhattachararyya distance
>>> that
>> will lead you to a good similarity or distance measure.
>> As I recall, Spark does provide an LDA implementation. Gensim provides
>>> a
>> API for doing LDA similarity out of the box. Vowpal Wabbit is also
>>> worth
>> looking at, particularly for a large dataset.
>> Hope this is useful.
>> Cheers
>> 
>> Sent from my iPhone
>> 
 On Feb 14, 2016, at 8:14 AM, David Starina 
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> I need to build a system to determine N (i.e. 10) most similar
> documents
>> to
>>> a given document. I have some (theoretical) knowledge of Mahout
>> algorithms,
>>> but not enough to build the system. Can you give me some suggestions?
>>> 
>>> At first I was researching Latent Semantic Analysis for the task, but
>> since
>>> Mahout doesn't support it, I started researching some other options.
>> I
>> got
>>> a hint that instead of LSA, you can use LDA (Latent Dirichlet
> allocation)
>>> in Mahout to achieve similar and even better results.
>>> 
>>> However ... and this is where I got confused ... LDA is a clustering
>>> algorithm. However, what I need is not to cluster the documents into
>> N
>>> clusters - I need to get a matrix 

Re: Document similarity

2016-03-11 Thread David Starina
Well, there is also an online method of LDA in Spark ... Pat, is there any
documentation on the method you described?

On Wed, Feb 24, 2016 at 6:10 PM, Pat Ferrel  wrote:

> The method I described calculates similarity on the fly but requires new
> docs to go through feature extraction before similarity can be queried. The
> length of time to do feature extraction is short compared to training LDA.
>
> Another method that gets at semantic similarity uses adaptive skip-grams
> for text features. http://arxiv.org/abs/1502.07257 I haven’t tried this
> but a friend saw a presentation about using this method to create features
> for a search engine which showed a favorable comparison with word2vec.
>
> If you want to use LDA note that it is an unsupervised categorization
> method. To use it, the cluster descriptors (a vector of important terms)
> can be compared to the analyzed incoming document using a KNN/search
> engine. This will give you a list of the closest clusters but doesn’t
> really give you documents, which is your goal I think. LDA should be re-run
> periodically to generate new clusters. Do you want to know cluster
> inclusion or get a list of similar docs?
>
> On Feb 23, 2016, at 1:01 PM, David Starina 
> wrote:
>
> Guys, one more question ... Are there some incremental methods to do this?
> I don't want to run the whole job again once a new document is added. In
> case of LDA ... I guess the best way is to calculate the topics on the new
> document using the topics from the previous LDA run ... And then every once
> in a while to recalculate the topics with the new documents?
>
> On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel 
> wrote:
>
> > Something we are working on for purely content based similarity is using
> a
> > KNN engine (search engine) but creating features from word2vec and an NER
> > (Named Entity Recognizer).
> >
> > putting the generated features into fields of a doc can really help with
> > similarity because w2v and NER create semantic features. You can also try
> > n-grams or skip-grams. These features are not very helpful for search but
> > for  similarity they work well.
> >
> > The query to the KNN engine is a document, each field mapped to the
> > corresponding field of the index. The result is the k nearest neighbors
> to
> > the query doc.
> >
> >
> >> On Feb 14, 2016, at 11:05 AM, David Starina 
> > wrote:
> >>
> >> Charles, thank you, I will check that out.
> >>
> >> Ted, I am looking for semantic similarity. Unfortunately, I do not have
> > any
> >> data on the usage of the documents (if by usage you mean user behavior).
> >>
> >> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning 
> > wrote:
> >>
> >>> Did you want textual similarity?
> >>>
> >>> Or semantic similarity?
> >>>
> >>> The actual semantics of a message can be opaque from the content, but
> > clear
> >>> from the usage.
> >>>
> >>>
> >>>
> >>> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl 
> > wrote:
> >>>
>  David,
>  LDA or LSI can work quite nicely for similarity (YMMV of course
> > depending
>  on the characterization of your documents).
>  You basically use the dot product of the square roots of the vectors
> > for
>  LDA -- if you do a search for Hellinger or Bhattachararyya distance
> > that
>  will lead you to a good similarity or distance measure.
>  As I recall, Spark does provide an LDA implementation. Gensim provides
> > a
>  API for doing LDA similarity out of the box. Vowpal Wabbit is also
> > worth
>  looking at, particularly for a large dataset.
>  Hope this is useful.
>  Cheers
> 
>  Sent from my iPhone
> 
> > On Feb 14, 2016, at 8:14 AM, David Starina 
>  wrote:
> >
> > Hi,
> >
> > I need to build a system to determine N (i.e. 10) most similar
> >>> documents
>  to
> > a given document. I have some (theoretical) knowledge of Mahout
>  algorithms,
> > but not enough to build the system. Can you give me some suggestions?
> >
> > At first I was researching Latent Semantic Analysis for the task, but
>  since
> > Mahout doesn't support it, I started researching some other options.
> I
>  got
> > a hint that instead of LSA, you can use LDA (Latent Dirichlet
> >>> allocation)
> > in Mahout to achieve similar and even better results.
> >
> > However ... and this is where I got confused ... LDA is a clustering
> > algorithm. However, what I need is not to cluster the documents into
> N
> > clusters - I need to get a matrix (similar to TF-IDF) from which I
> can
> > calculate some sort of a distance for any two documents to get N most
> > similar documents for any given document.
> >
> > How do I achieve that? My idea was (still mostly theoretical, since I
>  have
> > some 

Re: Document similarity

2016-02-24 Thread Pat Ferrel
The method I described calculates similarity on the fly but requires new docs 
to go through feature extraction before similarity can be queried. The length 
of time to do feature extraction is short compared to training LDA.

Another method that gets at semantic similarity uses adaptive skip-grams for 
text features. http://arxiv.org/abs/1502.07257 I haven’t tried this but a 
friend saw a presentation about using this method to create features for a 
search engine which showed a favorable comparison with word2vec.

If you want to use LDA note that it is an unsupervised categorization method. 
To use it, the cluster descriptors (a vector of important terms) can be 
compared to the analyzed incoming document using a KNN/search engine. This will 
give you a list of the closest clusters but doesn’t really give you documents, 
which is your goal I think. LDA should be re-run periodically to generate new 
clusters. Do you want to know cluster inclusion or get a list of similar docs?

On Feb 23, 2016, at 1:01 PM, David Starina  wrote:

Guys, one more question ... Are there some incremental methods to do this?
I don't want to run the whole job again once a new document is added. In
case of LDA ... I guess the best way is to calculate the topics on the new
document using the topics from the previous LDA run ... And then every once
in a while to recalculate the topics with the new documents?

On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel  wrote:

> Something we are working on for purely content based similarity is using a
> KNN engine (search engine) but creating features from word2vec and an NER
> (Named Entity Recognizer).
> 
> putting the generated features into fields of a doc can really help with
> similarity because w2v and NER create semantic features. You can also try
> n-grams or skip-grams. These features are not very helpful for search but
> for  similarity they work well.
> 
> The query to the KNN engine is a document, each field mapped to the
> corresponding field of the index. The result is the k nearest neighbors to
> the query doc.
> 
> 
>> On Feb 14, 2016, at 11:05 AM, David Starina 
> wrote:
>> 
>> Charles, thank you, I will check that out.
>> 
>> Ted, I am looking for semantic similarity. Unfortunately, I do not have
> any
>> data on the usage of the documents (if by usage you mean user behavior).
>> 
>> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning 
> wrote:
>> 
>>> Did you want textual similarity?
>>> 
>>> Or semantic similarity?
>>> 
>>> The actual semantics of a message can be opaque from the content, but
> clear
>>> from the usage.
>>> 
>>> 
>>> 
>>> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl 
> wrote:
>>> 
 David,
 LDA or LSI can work quite nicely for similarity (YMMV of course
> depending
 on the characterization of your documents).
 You basically use the dot product of the square roots of the vectors
> for
 LDA -- if you do a search for Hellinger or Bhattachararyya distance
> that
 will lead you to a good similarity or distance measure.
 As I recall, Spark does provide an LDA implementation. Gensim provides
> a
 API for doing LDA similarity out of the box. Vowpal Wabbit is also
> worth
 looking at, particularly for a large dataset.
 Hope this is useful.
 Cheers
 
 Sent from my iPhone
 
> On Feb 14, 2016, at 8:14 AM, David Starina 
 wrote:
> 
> Hi,
> 
> I need to build a system to determine N (i.e. 10) most similar
>>> documents
 to
> a given document. I have some (theoretical) knowledge of Mahout
 algorithms,
> but not enough to build the system. Can you give me some suggestions?
> 
> At first I was researching Latent Semantic Analysis for the task, but
 since
> Mahout doesn't support it, I started researching some other options. I
 got
> a hint that instead of LSA, you can use LDA (Latent Dirichlet
>>> allocation)
> in Mahout to achieve similar and even better results.
> 
> However ... and this is where I got confused ... LDA is a clustering
> algorithm. However, what I need is not to cluster the documents into N
> clusters - I need to get a matrix (similar to TF-IDF) from which I can
> calculate some sort of a distance for any two documents to get N most
> similar documents for any given document.
> 
> How do I achieve that? My idea was (still mostly theoretical, since I
 have
> some problems with running the LDA algorithm) to extract some number
> of
> topics with LDA, but I need not cluster the documents with the help of
 this
> topics, but to get the matrix of documents as one dimention and topics
>>> as
> the other dimension. I was guessing I could then use this matrix an an
> input to row-similarity algorithm.
> 
> Is this the correct concept? Or am I 

Re: Document similarity

2016-02-23 Thread David Starina
Guys, one more question ... Are there some incremental methods to do this?
I don't want to run the whole job again once a new document is added. In
case of LDA ... I guess the best way is to calculate the topics on the new
document using the topics from the previous LDA run ... And then every once
in a while to recalculate the topics with the new documents?

On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel  wrote:

> Something we are working on for purely content based similarity is using a
> KNN engine (search engine) but creating features from word2vec and an NER
> (Named Entity Recognizer).
>
> putting the generated features into fields of a doc can really help with
> similarity because w2v and NER create semantic features. You can also try
> n-grams or skip-grams. These features are not very helpful for search but
> for  similarity they work well.
>
> The query to the KNN engine is a document, each field mapped to the
> corresponding field of the index. The result is the k nearest neighbors to
> the query doc.
>
>
> > On Feb 14, 2016, at 11:05 AM, David Starina 
> wrote:
> >
> > Charles, thank you, I will check that out.
> >
> > Ted, I am looking for semantic similarity. Unfortunately, I do not have
> any
> > data on the usage of the documents (if by usage you mean user behavior).
> >
> > On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning 
> wrote:
> >
> >> Did you want textual similarity?
> >>
> >> Or semantic similarity?
> >>
> >> The actual semantics of a message can be opaque from the content, but
> clear
> >> from the usage.
> >>
> >>
> >>
> >> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl 
> wrote:
> >>
> >>> David,
> >>> LDA or LSI can work quite nicely for similarity (YMMV of course
> depending
> >>> on the characterization of your documents).
> >>> You basically use the dot product of the square roots of the vectors
> for
> >>> LDA -- if you do a search for Hellinger or Bhattachararyya distance
> that
> >>> will lead you to a good similarity or distance measure.
> >>> As I recall, Spark does provide an LDA implementation. Gensim provides
> a
> >>> API for doing LDA similarity out of the box. Vowpal Wabbit is also
> worth
> >>> looking at, particularly for a large dataset.
> >>> Hope this is useful.
> >>> Cheers
> >>>
> >>> Sent from my iPhone
> >>>
>  On Feb 14, 2016, at 8:14 AM, David Starina 
> >>> wrote:
> 
>  Hi,
> 
>  I need to build a system to determine N (i.e. 10) most similar
> >> documents
> >>> to
>  a given document. I have some (theoretical) knowledge of Mahout
> >>> algorithms,
>  but not enough to build the system. Can you give me some suggestions?
> 
>  At first I was researching Latent Semantic Analysis for the task, but
> >>> since
>  Mahout doesn't support it, I started researching some other options. I
> >>> got
>  a hint that instead of LSA, you can use LDA (Latent Dirichlet
> >> allocation)
>  in Mahout to achieve similar and even better results.
> 
>  However ... and this is where I got confused ... LDA is a clustering
>  algorithm. However, what I need is not to cluster the documents into N
>  clusters - I need to get a matrix (similar to TF-IDF) from which I can
>  calculate some sort of a distance for any two documents to get N most
>  similar documents for any given document.
> 
>  How do I achieve that? My idea was (still mostly theoretical, since I
> >>> have
>  some problems with running the LDA algorithm) to extract some number
> of
>  topics with LDA, but I need not cluster the documents with the help of
> >>> this
>  topics, but to get the matrix of documents as one dimention and topics
> >> as
>  the other dimension. I was guessing I could then use this matrix an an
>  input to row-similarity algorithm.
> 
>  Is this the correct concept? Or am I missing something?
> 
>  And, since LDA is not supperted on Spark/Samsara, how could I achieve
>  similar results on Spark?
> 
> 
>  Thanks in advance,
>  David
> >>>
> >>
>
>


Re: Document similarity

2016-02-14 Thread Pat Ferrel
Something we are working on for purely content based similarity is using a KNN 
engine (search engine) but creating features from word2vec and an NER (Named 
Entity Recognizer).

putting the generated features into fields of a doc can really help with 
similarity because w2v and NER create semantic features. You can also try 
n-grams or skip-grams. These features are not very helpful for search but for  
similarity they work well.

The query to the KNN engine is a document, each field mapped to the 
corresponding field of the index. The result is the k nearest neighbors to the 
query doc.


> On Feb 14, 2016, at 11:05 AM, David Starina  wrote:
> 
> Charles, thank you, I will check that out.
> 
> Ted, I am looking for semantic similarity. Unfortunately, I do not have any
> data on the usage of the documents (if by usage you mean user behavior).
> 
> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning  wrote:
> 
>> Did you want textual similarity?
>> 
>> Or semantic similarity?
>> 
>> The actual semantics of a message can be opaque from the content, but clear
>> from the usage.
>> 
>> 
>> 
>> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl  wrote:
>> 
>>> David,
>>> LDA or LSI can work quite nicely for similarity (YMMV of course depending
>>> on the characterization of your documents).
>>> You basically use the dot product of the square roots of the vectors for
>>> LDA -- if you do a search for Hellinger or Bhattachararyya distance that
>>> will lead you to a good similarity or distance measure.
>>> As I recall, Spark does provide an LDA implementation. Gensim provides a
>>> API for doing LDA similarity out of the box. Vowpal Wabbit is also worth
>>> looking at, particularly for a large dataset.
>>> Hope this is useful.
>>> Cheers
>>> 
>>> Sent from my iPhone
>>> 
 On Feb 14, 2016, at 8:14 AM, David Starina 
>>> wrote:
 
 Hi,
 
 I need to build a system to determine N (i.e. 10) most similar
>> documents
>>> to
 a given document. I have some (theoretical) knowledge of Mahout
>>> algorithms,
 but not enough to build the system. Can you give me some suggestions?
 
 At first I was researching Latent Semantic Analysis for the task, but
>>> since
 Mahout doesn't support it, I started researching some other options. I
>>> got
 a hint that instead of LSA, you can use LDA (Latent Dirichlet
>> allocation)
 in Mahout to achieve similar and even better results.
 
 However ... and this is where I got confused ... LDA is a clustering
 algorithm. However, what I need is not to cluster the documents into N
 clusters - I need to get a matrix (similar to TF-IDF) from which I can
 calculate some sort of a distance for any two documents to get N most
 similar documents for any given document.
 
 How do I achieve that? My idea was (still mostly theoretical, since I
>>> have
 some problems with running the LDA algorithm) to extract some number of
 topics with LDA, but I need not cluster the documents with the help of
>>> this
 topics, but to get the matrix of documents as one dimention and topics
>> as
 the other dimension. I was guessing I could then use this matrix an an
 input to row-similarity algorithm.
 
 Is this the correct concept? Or am I missing something?
 
 And, since LDA is not supperted on Spark/Samsara, how could I achieve
 similar results on Spark?
 
 
 Thanks in advance,
 David
>>> 
>> 



Re: Document similarity

2016-02-14 Thread David Starina
Charles, thank you, I will check that out.

Ted, I am looking for semantic similarity. Unfortunately, I do not have any
data on the usage of the documents (if by usage you mean user behavior).

On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning  wrote:

> Did you want textual similarity?
>
> Or semantic similarity?
>
> The actual semantics of a message can be opaque from the content, but clear
> from the usage.
>
>
>
> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl  wrote:
>
> > David,
> > LDA or LSI can work quite nicely for similarity (YMMV of course depending
> > on the characterization of your documents).
> > You basically use the dot product of the square roots of the vectors for
> > LDA -- if you do a search for Hellinger or Bhattachararyya distance that
> > will lead you to a good similarity or distance measure.
> > As I recall, Spark does provide an LDA implementation. Gensim provides a
> > API for doing LDA similarity out of the box. Vowpal Wabbit is also worth
> > looking at, particularly for a large dataset.
> > Hope this is useful.
> > Cheers
> >
> > Sent from my iPhone
> >
> > > On Feb 14, 2016, at 8:14 AM, David Starina 
> > wrote:
> > >
> > > Hi,
> > >
> > > I need to build a system to determine N (i.e. 10) most similar
> documents
> > to
> > > a given document. I have some (theoretical) knowledge of Mahout
> > algorithms,
> > > but not enough to build the system. Can you give me some suggestions?
> > >
> > > At first I was researching Latent Semantic Analysis for the task, but
> > since
> > > Mahout doesn't support it, I started researching some other options. I
> > got
> > > a hint that instead of LSA, you can use LDA (Latent Dirichlet
> allocation)
> > > in Mahout to achieve similar and even better results.
> > >
> > > However ... and this is where I got confused ... LDA is a clustering
> > > algorithm. However, what I need is not to cluster the documents into N
> > > clusters - I need to get a matrix (similar to TF-IDF) from which I can
> > > calculate some sort of a distance for any two documents to get N most
> > > similar documents for any given document.
> > >
> > > How do I achieve that? My idea was (still mostly theoretical, since I
> > have
> > > some problems with running the LDA algorithm) to extract some number of
> > > topics with LDA, but I need not cluster the documents with the help of
> > this
> > > topics, but to get the matrix of documents as one dimention and topics
> as
> > > the other dimension. I was guessing I could then use this matrix an an
> > > input to row-similarity algorithm.
> > >
> > > Is this the correct concept? Or am I missing something?
> > >
> > > And, since LDA is not supperted on Spark/Samsara, how could I achieve
> > > similar results on Spark?
> > >
> > >
> > > Thanks in advance,
> > > David
> >
>


Document similarity

2016-02-14 Thread David Starina
Hi,

I need to build a system to determine N (i.e. 10) most similar documents to
a given document. I have some (theoretical) knowledge of Mahout algorithms,
but not enough to build the system. Can you give me some suggestions?

At first I was researching Latent Semantic Analysis for the task, but since
Mahout doesn't support it, I started researching some other options. I got
a hint that instead of LSA, you can use LDA (Latent Dirichlet allocation)
in Mahout to achieve similar and even better results.

However ... and this is where I got confused ... LDA is a clustering
algorithm. However, what I need is not to cluster the documents into N
clusters - I need to get a matrix (similar to TF-IDF) from which I can
calculate some sort of a distance for any two documents to get N most
similar documents for any given document.

How do I achieve that? My idea was (still mostly theoretical, since I have
some problems with running the LDA algorithm) to extract some number of
topics with LDA, but I need not cluster the documents with the help of this
topics, but to get the matrix of documents as one dimention and topics as
the other dimension. I was guessing I could then use this matrix an an
input to row-similarity algorithm.

Is this the correct concept? Or am I missing something?

And, since LDA is not supperted on Spark/Samsara, how could I achieve
similar results on Spark?


Thanks in advance,
David


Re: Document similarity

2016-02-14 Thread Ted Dunning
Did you want textual similarity?

Or semantic similarity?

The actual semantics of a message can be opaque from the content, but clear
from the usage.



On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl  wrote:

> David,
> LDA or LSI can work quite nicely for similarity (YMMV of course depending
> on the characterization of your documents).
> You basically use the dot product of the square roots of the vectors for
> LDA -- if you do a search for Hellinger or Bhattachararyya distance that
> will lead you to a good similarity or distance measure.
> As I recall, Spark does provide an LDA implementation. Gensim provides a
> API for doing LDA similarity out of the box. Vowpal Wabbit is also worth
> looking at, particularly for a large dataset.
> Hope this is useful.
> Cheers
>
> Sent from my iPhone
>
> > On Feb 14, 2016, at 8:14 AM, David Starina 
> wrote:
> >
> > Hi,
> >
> > I need to build a system to determine N (i.e. 10) most similar documents
> to
> > a given document. I have some (theoretical) knowledge of Mahout
> algorithms,
> > but not enough to build the system. Can you give me some suggestions?
> >
> > At first I was researching Latent Semantic Analysis for the task, but
> since
> > Mahout doesn't support it, I started researching some other options. I
> got
> > a hint that instead of LSA, you can use LDA (Latent Dirichlet allocation)
> > in Mahout to achieve similar and even better results.
> >
> > However ... and this is where I got confused ... LDA is a clustering
> > algorithm. However, what I need is not to cluster the documents into N
> > clusters - I need to get a matrix (similar to TF-IDF) from which I can
> > calculate some sort of a distance for any two documents to get N most
> > similar documents for any given document.
> >
> > How do I achieve that? My idea was (still mostly theoretical, since I
> have
> > some problems with running the LDA algorithm) to extract some number of
> > topics with LDA, but I need not cluster the documents with the help of
> this
> > topics, but to get the matrix of documents as one dimention and topics as
> > the other dimension. I was guessing I could then use this matrix an an
> > input to row-similarity algorithm.
> >
> > Is this the correct concept? Or am I missing something?
> >
> > And, since LDA is not supperted on Spark/Samsara, how could I achieve
> > similar results on Spark?
> >
> >
> > Thanks in advance,
> > David
>


Is Feature Hashing appropriate for document to document similarity calculations?

2013-04-24 Thread Martin Bayly
We have a system that needs to be able to incrementally calculate
document-document text similarity metrics as new documents are seen.  I'm
trying to understand if using feature hashing with for example, a
StaticWordValueEncoder is appropriate for this kind of use case.  Our text
documents can contain web content so the size of the feature vector is
really only bound by the number of words in the language.

Currently our implementation uses a simple vector based bag of words type
model to create 'one cell per word' feature vectors for each document and
then we use cosine similarity to determine document to document similarity.
 We are not using Mahout.

The issue with this approach is that the one cell per word feature vectors
require the use of a singleton dictionary object to turn words into vector
indexes, so we can only index one document at a time.

I've been reading through the Mahout archives and the Mahout in Action book
looking to see if Mahout has any answers to help with incremental
parallelized vector generation, but it seems like the Mahout seq2sparse
processes have the same 'batch' issue.  I've seen various posts referring
to using feature hashing as a way around this and the classifiers in part 3
of Mahout in Action explain how to use feature hashing to encode text like
features.

I'm just too green to know whether it's appropriate for our use case.
 Particularly whether the multiple probes recommended when using feature
hashing with text, and the liklihood of feature collisions will
significantly compromise our cosine similarity calculations.

Thanks for any insights
Martin


Re: Is Feature Hashing appropriate for document to document similarity calculations?

2013-04-24 Thread Johannes Schulte
Hi Martin,

i guess you should be fine with the StaticWordValueEncoder , following e.g.
this discussion on this list, it is about clustering but matches some of
your questions

http://mahout.markmail.org/search/?q=hashing%20clustering#query:hashing%20clustering+page:1+mid:eitskeb7pkk3pupr+state:results

What i am missing is a clear / big picture on the latter of the two most
mentioned benefits of hashing:
- dictionary / vectorization (with collisions and small error on average)
- dimensionality reduction

for the dimensionality reduction part, there is the minHash-Clustering,
home brewed k-means with feature vector encoders plus the new streaming
k-means stuff having own random projection and hash implementations. Any
guidance: cool!

For you Martin, i think you will have to roll your own incremental vector
generation but it should be really straightforward

Cheers,

Johannes



On Wed, Apr 24, 2013 at 6:32 PM, Martin Bayly mar...@parkbayly.net wrote:

 We have a system that needs to be able to incrementally calculate
 document-document text similarity metrics as new documents are seen.  I'm
 trying to understand if using feature hashing with for example, a
 StaticWordValueEncoder is appropriate for this kind of use case.  Our text
 documents can contain web content so the size of the feature vector is
 really only bound by the number of words in the language.

 Currently our implementation uses a simple vector based bag of words type
 model to create 'one cell per word' feature vectors for each document and
 then we use cosine similarity to determine document to document similarity.
  We are not using Mahout.

 The issue with this approach is that the one cell per word feature vectors
 require the use of a singleton dictionary object to turn words into vector
 indexes, so we can only index one document at a time.

 I've been reading through the Mahout archives and the Mahout in Action book
 looking to see if Mahout has any answers to help with incremental
 parallelized vector generation, but it seems like the Mahout seq2sparse
 processes have the same 'batch' issue.  I've seen various posts referring
 to using feature hashing as a way around this and the classifiers in part 3
 of Mahout in Action explain how to use feature hashing to encode text like
 features.

 I'm just too green to know whether it's appropriate for our use case.
  Particularly whether the multiple probes recommended when using feature
 hashing with text, and the liklihood of feature collisions will
 significantly compromise our cosine similarity calculations.

 Thanks for any insights
 Martin



Re: Is Feature Hashing appropriate for document to document similarity calculations?

2013-04-24 Thread Ted Dunning
Hashed feature vectors are an excellent choice for the unknown vocabulary
problem.

One problem you will have is that the static weighting won't by default
weight rare words more highly than common words.  One way to deal with this
is to build a dictionary on a small subset of documents and assume all
missing words are relatively rare and can have a default weight.

You definitely need some weighting according to prevalence.



On Wed, Apr 24, 2013 at 12:32 PM, Johannes Schulte 
johannes.schu...@gmail.com wrote:

 Hi Martin,

 i guess you should be fine with the StaticWordValueEncoder , following e.g.
 this discussion on this list, it is about clustering but matches some of
 your questions


 http://mahout.markmail.org/search/?q=hashing%20clustering#query:hashing%20clustering+page:1+mid:eitskeb7pkk3pupr+state:results

 What i am missing is a clear / big picture on the latter of the two most
 mentioned benefits of hashing:
 - dictionary / vectorization (with collisions and small error on average)
 - dimensionality reduction

 for the dimensionality reduction part, there is the minHash-Clustering,
 home brewed k-means with feature vector encoders plus the new streaming
 k-means stuff having own random projection and hash implementations. Any
 guidance: cool!

 For you Martin, i think you will have to roll your own incremental vector
 generation but it should be really straightforward

 Cheers,

 Johannes



 On Wed, Apr 24, 2013 at 6:32 PM, Martin Bayly mar...@parkbayly.net
 wrote:

  We have a system that needs to be able to incrementally calculate
  document-document text similarity metrics as new documents are seen.  I'm
  trying to understand if using feature hashing with for example, a
  StaticWordValueEncoder is appropriate for this kind of use case.  Our
 text
  documents can contain web content so the size of the feature vector is
  really only bound by the number of words in the language.
 
  Currently our implementation uses a simple vector based bag of words type
  model to create 'one cell per word' feature vectors for each document and
  then we use cosine similarity to determine document to document
 similarity.
   We are not using Mahout.
 
  The issue with this approach is that the one cell per word feature
 vectors
  require the use of a singleton dictionary object to turn words into
 vector
  indexes, so we can only index one document at a time.
 
  I've been reading through the Mahout archives and the Mahout in Action
 book
  looking to see if Mahout has any answers to help with incremental
  parallelized vector generation, but it seems like the Mahout seq2sparse
  processes have the same 'batch' issue.  I've seen various posts referring
  to using feature hashing as a way around this and the classifiers in
 part 3
  of Mahout in Action explain how to use feature hashing to encode text
 like
  features.
 
  I'm just too green to know whether it's appropriate for our use case.
   Particularly whether the multiple probes recommended when using feature
  hashing with text, and the liklihood of feature collisions will
  significantly compromise our cosine similarity calculations.
 
  Thanks for any insights
  Martin
 



Re: Pairwise Document Similarity

2011-07-29 Thread Niall Riddell
Thanks Grant and apologies for not actually reading the JIRA properly it was
not intended as a criticism of the excellent Recommender work in Taste.  I
see know that the intention is to improve performance for specific use
cases.

My intention is to build a scalable near duplicate document detection
capability using Mahout.  The use case that I'm researching will require a
scalable approach across a large corpus of documents (50m and growing).  I
see that there is another thread today on this topic.

I will go ahead and implement the solution as outlined in my original post
and delve further into the LSH implementation embedded in the clustering
code and share my findings with the community.  I'm looking to test this on
a large corpus using EMR in order to get a feeling for the timings

Cheers Niall

On Jul 23, 2011 7:55 AM, Grant Ingersoll gsing...@apache.org wrote:


 On Jul 22, 2011, at 7:23 AM, Niall Riddell wrote:
 
 
  I've gone through MIA and felt the the rowsimilarityjob was a
  possibility, however I understand that a JIRA has been raised to make
  this potentially less general and in it's current form it may not
  match my performance/cost criteria (i.e. high/low).

 I don't think the goal of the JIRA issue is to make it less general, ti's
to make the cases that can benefit from smarter use of the co-occurrences
scale better.  I see no reason why the existing format can't also be
maintained for those similarity measures that can't benefit from more map
side calculation.

 -Grant



Re: Pairwise Document Similarity

2011-07-22 Thread Grant Ingersoll

On Jul 22, 2011, at 7:23 AM, Niall Riddell wrote:
 
 
 I've gone through MIA and felt the the rowsimilarityjob was a
 possibility, however I understand that a JIRA has been raised to make
 this potentially less general and in it's current form it may not
 match my performance/cost criteria (i.e. high/low).

I don't think the goal of the JIRA issue is to make it less general, ti's to 
make the cases that can benefit from smarter use of the co-occurrences scale 
better.  I see no reason why the existing format can't also be maintained for 
those similarity measures that can't benefit from more map side calculation.

-Grant



Re: Plagiarism - document similarity

2011-07-12 Thread Luca Natti
Thanks to all ,

i need to start from the beginning theory ,
you are speaking arab :) to me, or in other words i need
a less theoretical approach, or in other words some real code to put my
hands on.
Excuse this raw approach but i need a real fast to implement and understand
algorithm
to use in real world scenario possibly now ;) .
Alternatively i need a basic text(book) to start reading and arrive to
understand what you are saying.

thanks again

On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Easier to simply index all, say, three word phrases and use a TF-IDF score.
  This will give you a good proxy for sequence similarity.  Documents should
 either be chopped on paragraph boundaries to have a roughly constant length
 or the score should not be normalized by document length.

 Log likelihood ratio (LLR) test can be useful to extract good query
 features
 from the subject document.  TF-IDF score is a reasonable proxy for this
 although it does lead to some problems.  The reason TF-IDF works as a query
 term selection method and why it fails can be seen from the fact that
 TF-IDF
 is very close to one of the most important terms in the LLR score.

 On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg 
 andrew.clegg+mah...@gmail.com
  wrote:

  On 11 July 2011 08:19, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote:
   - bioinformatics, in particular gene sequencing to detect long
   near-matching sequences (a variation of the above, I'm not familiar
   with any particular algorithms, but I imagine this is a well explored
   space
 
  The classic is Smith  Waterman:
 
  http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm
 
  This approach been used in general text processing tasks too, e.g.:
 
 
 
 http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf
 
   given the funds they receive ;),
 
  Hah! Less so these days I'm afraid :-)
 
  Andrew.
 
  --
 
  http://tinyurl.com/andrew-clegg-linkedin |
 http://twitter.com/andrew_clegg
 



Re: Plagiarism - document similarity

2011-07-12 Thread Em
Hi Luca,

again, I have to emphasize read what I gave you.
The algorithm in my link was explained for non-scientists and if you are
going to download Solr you will find the class to have a look on how
they implemented that algorithm.

More easy would mean that someone else is writing the code for you ;).

Regards,
Em

Am 12.07.2011 09:58, schrieb Luca Natti:
 Thanks to all ,
 
 i need to start from the beginning theory ,
 you are speaking arab :) to me, or in other words i need
 a less theoretical approach, or in other words some real code to put my
 hands on.
 Excuse this raw approach but i need a real fast to implement and understand
 algorithm
 to use in real world scenario possibly now ;) .
 Alternatively i need a basic text(book) to start reading and arrive to
 understand what you are saying.
 
 thanks again
 
 On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Easier to simply index all, say, three word phrases and use a TF-IDF score.
  This will give you a good proxy for sequence similarity.  Documents should
 either be chopped on paragraph boundaries to have a roughly constant length
 or the score should not be normalized by document length.

 Log likelihood ratio (LLR) test can be useful to extract good query
 features
 from the subject document.  TF-IDF score is a reasonable proxy for this
 although it does lead to some problems.  The reason TF-IDF works as a query
 term selection method and why it fails can be seen from the fact that
 TF-IDF
 is very close to one of the most important terms in the LLR score.

 On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg 
 andrew.clegg+mah...@gmail.com
 wrote:

 On 11 July 2011 08:19, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote:
 - bioinformatics, in particular gene sequencing to detect long
 near-matching sequences (a variation of the above, I'm not familiar
 with any particular algorithms, but I imagine this is a well explored
 space

 The classic is Smith  Waterman:

 http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm

 This approach been used in general text processing tasks too, e.g.:



 http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf

 given the funds they receive ;),

 Hah! Less so these days I'm afraid :-)

 Andrew.

 --

 http://tinyurl.com/andrew-clegg-linkedin |
 http://twitter.com/andrew_clegg


 


Re: Plagiarism - document similarity

2011-07-11 Thread Dawid Weiss
I've seen people doing all kinds of things to detect this. A few
directions to research around:

- suffix trees/ suffix arrays to detect longest common subsequences
(perfect matches though),
- bioinformatics, in particular gene sequencing to detect long
near-matching sequences (a variation of the above, I'm not familiar
with any particular algorithms, but I imagine this is a well explored
space given the funds they receive ;),
- techniques for fuzzy matching/ near-duplicate detection, but
combined with arbitrary document chunking or method-specific data (for
example shingles).

These should yield you tons of reading material to start with (Google
Scholar, Citeseer). Sorry for not being any more specific.

Dawid

On Mon, Jul 11, 2011 at 9:15 AM, Luca Natti ilmioam...@gmail.com wrote:
 yes, i'm interested in plagiarism applied to research papers, university
 notes, thesis.
 Any theory and *best* snippets of code/examples is very appreciated!
 thanks in advance for your help,


 On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg andrew.cl...@gmail.com wrote:

 If 'puzzling' means direct plagiarism, then some sort of
 longest-common-subsequence might be a better metric.

 If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for
 me.

 On Friday, 8 July 2011, Sergey Bartunov sbos@gmail.com wrote:
  You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis
 
  On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote:
  Is there a  way to compute similarity between docs?
  And similarity by paragraphs?
 
  What We want to tell is if a research paper is original or made by
  puzzling other works.
 
  thanks!
 
 

 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg




Re: Plagiarism - document similarity

2011-07-11 Thread Em
Hi Luca,

how about quoting another researcher's work? Are you also interested in
the amount of quotes in respect to the whole document? I think it is not
impossible to let an algorithm find out whether some subsequences in
both documents are correctly marked, but it might be hard. Depending on
your business-case you might find out that there will be a lot of
false-positives when judging someone's work as plagiarism.

Another idea to find out similarity between the content of two documents
is implemented in Nutch. Fortunately I found a piece of documentation in
the solr-api-docs where you can read about it:
http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html

You could do something like that for content-blocks of a document
(several sentences or a fixed window of words). This way you are able to
find out similarities between documents where the author has rewritten a
part of another researcher's work.
This way you are able to find out phrases where the
longest-common-subsequence is small but a human would see the
similarities between both documents and the possiblity of a plagiarism.

Regards,
Em

Am 11.07.2011 09:15, schrieb Luca Natti:
 yes, i'm interested in plagiarism applied to research papers, university
 notes, thesis.
 Any theory and *best* snippets of code/examples is very appreciated!
 thanks in advance for your help,
 
 
 On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg andrew.cl...@gmail.com wrote:
 
 If 'puzzling' means direct plagiarism, then some sort of
 longest-common-subsequence might be a better metric.

 If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for
 me.

 On Friday, 8 July 2011, Sergey Bartunov sbos@gmail.com wrote:
 You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis

 On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote:
 Is there a  way to compute similarity between docs?
 And similarity by paragraphs?

 What We want to tell is if a research paper is original or made by
 puzzling other works.

 thanks!



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

 


Re: Plagiarism - document similarity

2011-07-11 Thread Luca Natti
Somethig that gives also false positives ok,
because we can check by hand for the final decision on the doc.

I need more specific directions with some examples ,
because we have few time to implement this.


On Mon, Jul 11, 2011 at 10:35 AM, Em mailformailingli...@yahoo.de wrote:

 Hi Luca,

 how about quoting another researcher's work? Are you also interested in
 the amount of quotes in respect to the whole document? I think it is not
 impossible to let an algorithm find out whether some subsequences in
 both documents are correctly marked, but it might be hard. Depending on
 your business-case you might find out that there will be a lot of
 false-positives when judging someone's work as plagiarism.

 Another idea to find out similarity between the content of two documents
 is implemented in Nutch. Fortunately I found a piece of documentation in
 the solr-api-docs where you can read about it:

 http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html

 You could do something like that for content-blocks of a document
 (several sentences or a fixed window of words). This way you are able to
 find out similarities between documents where the author has rewritten a
 part of another researcher's work.
 This way you are able to find out phrases where the
 longest-common-subsequence is small but a human would see the
 similarities between both documents and the possiblity of a plagiarism.

 Regards,
 Em

 Am 11.07.2011 09:15, schrieb Luca Natti:
  yes, i'm interested in plagiarism applied to research papers, university
  notes, thesis.
  Any theory and *best* snippets of code/examples is very appreciated!
  thanks in advance for your help,
 
 
  On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg andrew.cl...@gmail.com
 wrote:
 
  If 'puzzling' means direct plagiarism, then some sort of
  longest-common-subsequence might be a better metric.
 
  If this isn't what the OP meant, then sorry! 'Puzzling' is a new term
 for
  me.
 
  On Friday, 8 July 2011, Sergey Bartunov sbos@gmail.com wrote:
  You may start from
 http://en.wikipedia.org/wiki/Latent_semantic_analysis
 
  On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote:
  Is there a  way to compute similarity between docs?
  And similarity by paragraphs?
 
  What We want to tell is if a research paper is original or made by
  puzzling other works.
 
  thanks!
 
 
 
  --
 
  http://tinyurl.com/andrew-clegg-linkedin |
 http://twitter.com/andrew_clegg
 
 



Re: Plagiarism - document similarity

2011-07-11 Thread Ken Krugler

On Jul 11, 2011, at 3:00am, Luca Natti wrote:

 Somethig that gives also false positives ok,
 because we can check by hand for the final decision on the doc.
 
 I need more specific directions with some examples ,
 because we have few time to implement this.

See Winnowing: Local Algorithms for Document Fingerprinting by Schleimer, 
Wilderson and Aiken.

And any papers on MOSS, an implementation of the ideas contained in that paper, 
for detecting plagiarism.

-- Ken

 On Mon, Jul 11, 2011 at 10:35 AM, Em mailformailingli...@yahoo.de wrote:
 
 Hi Luca,
 
 how about quoting another researcher's work? Are you also interested in
 the amount of quotes in respect to the whole document? I think it is not
 impossible to let an algorithm find out whether some subsequences in
 both documents are correctly marked, but it might be hard. Depending on
 your business-case you might find out that there will be a lot of
 false-positives when judging someone's work as plagiarism.
 
 Another idea to find out similarity between the content of two documents
 is implemented in Nutch. Fortunately I found a piece of documentation in
 the solr-api-docs where you can read about it:
 
 http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html
 
 You could do something like that for content-blocks of a document
 (several sentences or a fixed window of words). This way you are able to
 find out similarities between documents where the author has rewritten a
 part of another researcher's work.
 This way you are able to find out phrases where the
 longest-common-subsequence is small but a human would see the
 similarities between both documents and the possiblity of a plagiarism.
 
 Regards,
 Em
 
 Am 11.07.2011 09:15, schrieb Luca Natti:
 yes, i'm interested in plagiarism applied to research papers, university
 notes, thesis.
 Any theory and *best* snippets of code/examples is very appreciated!
 thanks in advance for your help,
 
 
 On Sat, Jul 9, 2011 at 5:14 PM, Andrew Clegg andrew.cl...@gmail.com
 wrote:
 
 If 'puzzling' means direct plagiarism, then some sort of
 longest-common-subsequence might be a better metric.
 
 If this isn't what the OP meant, then sorry! 'Puzzling' is a new term
 for
 me.
 
 On Friday, 8 July 2011, Sergey Bartunov sbos@gmail.com wrote:
 You may start from
 http://en.wikipedia.org/wiki/Latent_semantic_analysis
 
 On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote:
 Is there a  way to compute similarity between docs?
 And similarity by paragraphs?
 
 What We want to tell is if a research paper is original or made by
 puzzling other works.
 
 thanks!
 
 
 
 --
 
 http://tinyurl.com/andrew-clegg-linkedin |
 http://twitter.com/andrew_clegg
 
 
 

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions








Re: document similarity

2011-07-11 Thread Xavier Stevens
I think by puzzling he meant piecing together work from other papers
into some extension of previous work.


-Xavier


On 7/9/11 8:14 AM, Andrew Clegg wrote:
 If 'puzzling' means direct plagiarism, then some sort of
 longest-common-subsequence might be a better metric.

 If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for me.

 On Friday, 8 July 2011, Sergey Bartunov sbos@gmail.com wrote:
 You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis

 On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote:
 Is there a  way to compute similarity between docs?
 And similarity by paragraphs?

 What We want to tell is if a research paper is original or made by
 puzzling other works.

 thanks!



Re: document similarity

2011-07-11 Thread Xavier Stevens
Disregard my last message. Sorry!

On 7/11/11 8:35 AM, Xavier Stevens wrote:
 I think by puzzling he meant piecing together work from other papers
 into some extension of previous work.


 -Xavier


 On 7/9/11 8:14 AM, Andrew Clegg wrote:
 If 'puzzling' means direct plagiarism, then some sort of
 longest-common-subsequence might be a better metric.

 If this isn't what the OP meant, then sorry! 'Puzzling' is a new term for me.

 On Friday, 8 July 2011, Sergey Bartunov sbos@gmail.com wrote:
 You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis

 On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote:
 Is there a  way to compute similarity between docs?
 And similarity by paragraphs?

 What We want to tell is if a research paper is original or made by
 puzzling other works.

 thanks!



Re: Plagiarism - document similarity

2011-07-11 Thread Ted Dunning
Easier to simply index all, say, three word phrases and use a TF-IDF score.
 This will give you a good proxy for sequence similarity.  Documents should
either be chopped on paragraph boundaries to have a roughly constant length
or the score should not be normalized by document length.

Log likelihood ratio (LLR) test can be useful to extract good query features
from the subject document.  TF-IDF score is a reasonable proxy for this
although it does lead to some problems.  The reason TF-IDF works as a query
term selection method and why it fails can be seen from the fact that TF-IDF
is very close to one of the most important terms in the LLR score.

On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg andrew.clegg+mah...@gmail.com
 wrote:

 On 11 July 2011 08:19, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote:
  - bioinformatics, in particular gene sequencing to detect long
  near-matching sequences (a variation of the above, I'm not familiar
  with any particular algorithms, but I imagine this is a well explored
  space

 The classic is Smith  Waterman:

 http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm

 This approach been used in general text processing tasks too, e.g.:


 http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf

  given the funds they receive ;),

 Hah! Less so these days I'm afraid :-)

 Andrew.

 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



Re: document similarity

2011-07-08 Thread Sergey Bartunov
You may start from http://en.wikipedia.org/wiki/Latent_semantic_analysis

On 8 July 2011 12:47, Luca Natti ilmioam...@gmail.com wrote:
 Is there a  way to compute similarity between docs?
 And similarity by paragraphs?

 What We want to tell is if a research paper is original or made by
 puzzling other works.

 thanks!



Re: Generating a Document Similarity Matrix

2010-07-02 Thread Sebastian Schelter
Hi Kris,

I think the best way would be to manually join the names to the result
after executing the job.

--sebastian

Am 02.07.2010 18:22, schrieb Kris Jack:
 Hi Sebastian,

 I am currently using your code with NamedVectors in my input.  In the
 output, however, the names seem to be missing.  Would there be a way to
 include them?

 Thanks,
 Kris



 2010/6/29 Sebastian Schelter ssc.o...@googlemail.com

   
 Hi Kris,

 I'm glad I could help you and it's really cool that you are testing my
 patches on real data. I'm looking forward to hearing more!

 -sebastian

 Am 29.06.2010 11:25, schrieb Kris Jack:
 
 Hi Sebastian,

 You really are very kind!  I have taken your code and run it to print out
 the contents of the output file.  There are indeed only 37,952 results so
 that gives me more confidence in the vector dumper.  I'm not sure why
   
 there
 
 was a memory problem though, seeing as it seems to have output the
   
 results
 
 correctly.  Now I just have to match them up with my original lucene ids
   
 and
 
 see how it is performing.  I'll keep you posted with the results.

 Thanks,
 Kris



 2010/6/28 Sebastian Schelter ssc.o...@googlemail.com


   
 Hi Kris,

 Unfortunately I'm not familiar with the VectorDumper code (and a quick
 look didn't help either), so I can't help you with the OutOfMemoryError.

 It could be possible that only 37,952 results are found for an input of
 500,000 vectors, it really depends on the actual data. If you're sure
 that there should be more results, you could provide me with a sample
 input file and I'll try to find out why there aren't more results.

 I wrote a small class for you that dumps the output file of the job to
 the console, (I tested it with the output of my unit-tests), maybe that
 can help us find the source of the problem.

 -sebastian

 public class MatrixReader extends AbstractJob {

  public static void main(String[] args) throws Exception {
ToolRunner.run(new MatrixReader(), args);
  }

  @Override
  public int run(String[] args) throws Exception {

addInputOption();

MapString,String parsedArgs = parseArguments(args);
if (parsedArgs == null) {
  return -1;
}

Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);

Path vectorFile = fs.listStatus(getInputPath(),
 TasteHadoopUtils.PARTS_FILTER)[0].getPath();

SequenceFile.Reader reader = null;
try {
  reader = new SequenceFile.Reader(fs, vectorFile, conf);
  IntWritable key = new IntWritable();
  VectorWritable value = new VectorWritable();

  while (reader.next(key, value)) {
int row = key.get();
System.out.print(String.valueOf(key.get()) +  : );
IteratorElement elementsIterator =
 
 value.get().iterateNonZero();
 
String separator = ;
while (elementsIterator.hasNext()) {
  Element element = elementsIterator.next();
  System.out.print(separator + String.valueOf(element.index()) +
 , + String.valueOf(element.get()));
  separator = ;;
}
System.out.print(\n);
  }
} finally {
  reader.close();
}
return 0;
  }
 }

 Am 28.06.2010 17:18, schrieb Kris Jack:

 
 Hi,

 I am now using the version of
 org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that
   
 Sebastian
 
   
 has

 
 written and has been added to the trunk.  Thanks again for that!  I can
 generate an output file that should contain a list of documents with

   
 their

 
 top 100* *most similar documents.  I am having problems, however, in
 converting the output file into a readable format using mahout's

   
 vectordump:

 
 $ ./mahout vectordump --seqFile similarRows --output results.out

   
 --printKey

 
 no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
 Input Path: /home/kris/similarRows
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at


   
 
 org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59)
 
 
 at
 org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
 at

   
 org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)

 
 at

   
 org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)

 
 at

   
 org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)

 
 at


   
 
 org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77)
 
 
 at

   
 org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138)
 
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at


   
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 
 
 at

Re: Generating a Document Similarity Matrix

2010-06-18 Thread Kris Jack
Thanks Ted,

I got that working.  Unfortunately, the matrix multiplication job is taking
far longer than I hoped.  With just over 10 million documents, 10 mappers
and 10 reducers, I can't get it to complete the job in under 48 hours.

Perhaps you have an idea for speeding it up?  I have already been quite
ruthless with making the vectors sparse.  I did not include terms that
appeared in over 1% of the corpus and only kept terms that appeared at least
50 times.  Is it normal that the matrix multiplication map reduce task
should take so long to process with this quantity of data and resources
available or do you think that my system is not configured properly?

Thanks,
Kris



2010/6/15 Ted Dunning ted.dunn...@gmail.com

 Threshold are generally dangerous.  It is usually preferable to specify the
 sparseness you want (1%, 0.2%, whatever), sort the results in descending
 score order using Hadoop's builtin capabilities and just drop the rest.

 On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack mrkrisj...@gmail.com wrote:

   I was wondering if there was an
  interesting way to do this with the current mahout code such as
 requesting
  that the Vector accumulator returns only elements that have values
 greater
  than a given threshold, sorting the vector by value rather than key, or
  something else?
 



Re: Generating a Document Similarity Matrix

2010-06-18 Thread Sebastian Schelter
Hi Kris,

maybe you want to give the patch from
https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet
tested it with larger data yet, but I would be happy to get some
feedback for it and maybe it helps you with your usecase.

-sebastian

Am 18.06.2010 18:46, schrieb Kris Jack:
 Thanks Ted,

 I got that working.  Unfortunately, the matrix multiplication job is taking
 far longer than I hoped.  With just over 10 million documents, 10 mappers
 and 10 reducers, I can't get it to complete the job in under 48 hours.

 Perhaps you have an idea for speeding it up?  I have already been quite
 ruthless with making the vectors sparse.  I did not include terms that
 appeared in over 1% of the corpus and only kept terms that appeared at least
 50 times.  Is it normal that the matrix multiplication map reduce task
 should take so long to process with this quantity of data and resources
 available or do you think that my system is not configured properly?

 Thanks,
 Kris



 2010/6/15 Ted Dunning ted.dunn...@gmail.com

   
 Threshold are generally dangerous.  It is usually preferable to specify the
 sparseness you want (1%, 0.2%, whatever), sort the results in descending
 score order using Hadoop's builtin capabilities and just drop the rest.

 On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack mrkrisj...@gmail.com wrote:

 
  I was wondering if there was an
 interesting way to do this with the current mahout code such as
   
 requesting
 
 that the Vector accumulator returns only elements that have values
   
 greater
 
 than a given threshold, sorting the vector by value rather than key, or
 something else?

   
 
   



Re: Generating a Document Similarity Matrix

2010-06-18 Thread Kris Jack
Thanks Sebastian, I'll give it a try!



2010/6/18 Sebastian Schelter ssc.o...@googlemail.com

 Hi Kris,

 maybe you want to give the patch from
 https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet
 tested it with larger data yet, but I would be happy to get some
 feedback for it and maybe it helps you with your usecase.

 -sebastian

 Am 18.06.2010 18:46, schrieb Kris Jack:
  Thanks Ted,
 
  I got that working.  Unfortunately, the matrix multiplication job is
 taking
  far longer than I hoped.  With just over 10 million documents, 10 mappers
  and 10 reducers, I can't get it to complete the job in under 48 hours.
 
  Perhaps you have an idea for speeding it up?  I have already been quite
  ruthless with making the vectors sparse.  I did not include terms that
  appeared in over 1% of the corpus and only kept terms that appeared at
 least
  50 times.  Is it normal that the matrix multiplication map reduce task
  should take so long to process with this quantity of data and resources
  available or do you think that my system is not configured properly?
 
  Thanks,
  Kris
 
 
 
  2010/6/15 Ted Dunning ted.dunn...@gmail.com
 
 
  Threshold are generally dangerous.  It is usually preferable to specify
 the
  sparseness you want (1%, 0.2%, whatever), sort the results in descending
  score order using Hadoop's builtin capabilities and just drop the rest.
 
  On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack mrkrisj...@gmail.com
 wrote:
 
 
   I was wondering if there was an
  interesting way to do this with the current mahout code such as
 
  requesting
 
  that the Vector accumulator returns only elements that have values
 
  greater
 
  than a given threshold, sorting the vector by value rather than key, or
  something else?
 
 
 
 




-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/


Re: Generating a Document Similarity Matrix

2010-06-15 Thread Kris Jack
Hi,

I have gone through the matrix code quite thoroughly and have managed to get
an implementation of document similarity working.  Like you said, however,
the rows of the similarity matrix are very dense and take a while to
calculate.  I have made the original document-term matrix more sparse by
removing terms that have a document frequency of  1% and now I would like
to trim out row entries that do not have high similarity values.  As you
suggested, I would like to keep only the top x entries by modifying the
reducer of the matrix multiply job.  I was wondering if there was an
interesting way to do this with the current mahout code such as requesting
that the Vector accumulator returns only elements that have values greater
than a given threshold, sorting the vector by value rather than key, or
something else?

Thanks,
Kris



2010/6/8 Jake Mannix jake.man...@gmail.com

 Hi Kris,

  So you already know how to make a sparse feature matrix out of
 your Solr index, based on Grant's instructions?  Once you have that
 matrix loaded into HDFS, then the following Java code should
 create a document-document similarity matrix:

 //
  String p = /path/to/matrix/on/hdfs;
  String tmpPath = /tmp/matrixmultiplyspace;
  int numDocuments = // whatever your numDocuments is
  int numTerms = // total number of terms in the matrix

  DistributedRowMatrix text = new DistributedRowMatrix(inputPath,
tmpPath, numDocuments, numTerms);
  JobConf conf = new JobConf(similarity job);
  text.configure(conf);

  DistributedRowMatrix transpose = text.transpose();

  DistributedRowMatrix similarity = transpose.times(transpose);
  System.out.println(Similarity matrix lives:  + similarity.getRowPath());
 //

 Now, the rows of this similarity are going to be way too dense, so
 you'll want to write a small map-reduce job (well, no reduce is necessary)
 to run over this matrix and trim out all the unuseful entries of each
 row, but that shouldn't be too hard to do.

 Of course, to do it really efficiently, that functionality could be folded
 into the reducer of the matrix multiply job, and done in the same pass over
 the data as that one.

  -jake



 On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack mrkrisj...@gmail.com wrote:

  Hi Jake,
 
  Thanks for that.  The first solution that you suggest is more like what I
  was imagining.
 
  Please excuse me, I'm new to Mahout and don't know how to use it to
  generate
  the full document-document similarity matrix.  I would rather not have to
  re-implement the moreLikeThis algorithm that, although rather straight
  forward, may take time for a newbie to MapReduce like me.  Could you
 guide
  me a little in finding the relevant Mahout code for generating the matrix
  or
  is it not really designed for that?
 
  For the moment, I would be happy to have an off-line batch version
 working.
  Also, it is desirable to take advantage of the text processing features
  that
  I have already configured using solr, so I would prefer to read in the
  feature vectors for the documents from a lucene index, as I am doing at
  present (e.g.
 
 
 http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
  ).
 
  Thanks,
  Kris
 
 
 
  2010/6/8 Jake Mannix jake.man...@gmail.com
 
   Hi Kris,
  
If you generate a full document-document similarity matrix offline,
 and
   then make sure to sparsify the rows (trim off all similarities below a
   threshold, or only take the top N for each row, etc...).  Then encoding
   these values directly in the index would indeed allow for *superfast*
   MoreLikeThis functionality, because you've already computed all
   of the similar results offline.
  
The only downside is that it won't apply to newly indexed documents.
   If your indexing setup is such that you don't fold in new documents
 live,
   but do so in batch, then this should be fine.
  
An alternative is to use something like a Locality Sensitive Hash
   (something one of my co-workers is writing up a nice implementation
   of now, and I'm going to get him to contribute it once it's fully
  tested),
   to reduce the search space (as a lucene Filter) and speed up the
   query.
  
-jake
  
   On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack mrkrisj...@gmail.com
 wrote:
  
Hi Olivier,
   
Thanks for your suggestions.  I have over 10 million documents and
 they
have
quite a lot of meta-data associated with them including rather large
  text
fields.  It is possible to tweak the moreLikeThis function from solr.
   I
have tried changing the parameters (
http://wiki.apache.org/solr/MoreLikeThis)
but am not managing to get results in under 300ms without sacrificing
  the
quality of the results too much.
   
I suspect that there would be gains to be made from reducing the
dimensionality of the feature vectors before indexing with lucene so
 I
   may
give that a try.  I'll keep you posted if I come up with other
  solutions

Re: Generating a Document Similarity Matrix

2010-06-15 Thread Ted Dunning
Threshold are generally dangerous.  It is usually preferable to specify the
sparseness you want (1%, 0.2%, whatever), sort the results in descending
score order using Hadoop's builtin capabilities and just drop the rest.

On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack mrkrisj...@gmail.com wrote:

  I was wondering if there was an
 interesting way to do this with the current mahout code such as requesting
 that the Vector accumulator returns only elements that have values greater
 than a given threshold, sorting the vector by value rather than key, or
 something else?



Re: Generating a Document Similarity Matrix

2010-06-09 Thread Sebastian Schelter
I could try to make the similarity computation work on the rows of a
DistributedRowMatrix with several metrices (similar to
o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob) and would
concentrate on the implementation of it as a mathematical operation not
specific to any domain.

So it would be left up to the users to convert their documents to vectors
and maybe do things like stemming or stopword removal to reduce the
computation overhead, when they use this job for text documents.

I could start working on that in 2 weeks from now though. Tell me if that's
welcomed and I'll go and create a jira issue :)

-sebastian


2010/6/9 Jake Mannix jake.man...@gmail.com

 On Tue, Jun 8, 2010 at 4:45 PM, Sebastian Schelter
 ssc.o...@googlemail.comwrote:

  The relation between these two problems (document similarity and item
  similarity in CF) is exactly like Sean pointed out: In the paper a
 document
  is a vector of term frequencies and the paper shows how to compute the
  pairwise similarities between those. To use this for collaborative
  filtering
  you actually just have to replace the document with an item which is a

 vector of user preferences.
 

 Yep, a vector is a vector is a vector.  (And when you're me, even if you
 are *not* a vector, you might be a vector. ;) )


  It shouldn't be too hard to make this work on a DistributedRowMatrix too,
 I
  think. You already mentioned you wanna have it that way some time
  in MAHOUT-362 :)
 

 Well indeed I did!

  -jake



Re: Generating a Document Similarity Matrix

2010-06-09 Thread Kris Jack
Hi Jake,

Thanks for that.  I'll give it a try with cosine similarity first off and as
I get more experience I'll try and implement some other similarity methods.

Kris



2010/6/9 Jake Mannix jake.man...@gmail.com

 On Wed, Jun 9, 2010 at 5:11 AM, Kris Jack mrkrisj...@gmail.com wrote:

  Hi,
 
  Thanks very for the code.  In implementing it, I got a little stuck on
  specifying the JobConf's similarity job -
 
   JobConf conf = new JobConf(similarity job);
 
  I assume that I should define here how I would like two vectors to be
  compared with one another?  Please do correct me if that's wrong.  If so,
  however, could you point me to any examples of what this code should look
  like (e.g. cosine similarity)?  I'm sure that these kinds of similarity
  jobs
  must already exist in Mahout but being new to both Mahout and MapReduce,
  I'm
  not sure where to find them.
 

 In the sample I mentioned (using sparse matrix multiplication), you don't
 get to chose the similarity - if the input vectors are unit-length
 normalized,
 then the computation is cosine similarity.  You would have to write your
 own
 map-reduce job to do a different one.

  -jake




 
  Thanks,
  Kris
 
 
 
  2010/6/8 Jake Mannix jake.man...@gmail.com
 
   Hi Kris,
  
So you already know how to make a sparse feature matrix out of
   your Solr index, based on Grant's instructions?  Once you have that
   matrix loaded into HDFS, then the following Java code should
   create a document-document similarity matrix:
  
   //
String p = /path/to/matrix/on/hdfs;
String tmpPath = /tmp/matrixmultiplyspace;
int numDocuments = // whatever your numDocuments is
int numTerms = // total number of terms in the matrix
  
DistributedRowMatrix text = new DistributedRowMatrix(inputPath,
  tmpPath, numDocuments, numTerms);
JobConf conf = new JobConf(similarity job);
text.configure(conf);
  
DistributedRowMatrix transpose = text.transpose();
  
DistributedRowMatrix similarity = transpose.times(transpose);
System.out.println(Similarity matrix lives:  +
  similarity.getRowPath());
   //
  
   Now, the rows of this similarity are going to be way too dense, so
   you'll want to write a small map-reduce job (well, no reduce is
  necessary)
   to run over this matrix and trim out all the unuseful entries of each
   row, but that shouldn't be too hard to do.
  
   Of course, to do it really efficiently, that functionality could be
  folded
   into the reducer of the matrix multiply job, and done in the same pass
  over
   the data as that one.
  
-jake
  
  
  
   On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack mrkrisj...@gmail.com
 wrote:
  
Hi Jake,
   
Thanks for that.  The first solution that you suggest is more like
 what
  I
was imagining.
   
Please excuse me, I'm new to Mahout and don't know how to use it to
generate
the full document-document similarity matrix.  I would rather not
 have
  to
re-implement the moreLikeThis algorithm that, although rather
 straight
forward, may take time for a newbie to MapReduce like me.  Could you
   guide
me a little in finding the relevant Mahout code for generating the
  matrix
or
is it not really designed for that?
   
For the moment, I would be happy to have an off-line batch version
   working.
Also, it is desirable to take advantage of the text processing
 features
that
I have already configured using solr, so I would prefer to read in
 the
feature vectors for the documents from a lucene index, as I am doing
 at
present (e.g.
   
   
  
 
 http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
).
   
Thanks,
Kris
   
   
   
2010/6/8 Jake Mannix jake.man...@gmail.com
   
 Hi Kris,

  If you generate a full document-document similarity matrix
 offline,
   and
 then make sure to sparsify the rows (trim off all similarities
 below
  a
 threshold, or only take the top N for each row, etc...).  Then
  encoding
 these values directly in the index would indeed allow for
 *superfast*
 MoreLikeThis functionality, because you've already computed all
 of the similar results offline.

  The only downside is that it won't apply to newly indexed
 documents.
 If your indexing setup is such that you don't fold in new documents
   live,
 but do so in batch, then this should be fine.

  An alternative is to use something like a Locality Sensitive Hash
 (something one of my co-workers is writing up a nice implementation
 of now, and I'm going to get him to contribute it once it's fully
tested),
 to reduce the search space (as a lucene Filter) and speed up the
 query.

  -jake

 On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack mrkrisj...@gmail.com
   wrote:

  Hi Olivier,
 
  Thanks for your suggestions.  I have over 10 million documents
 and
   they
  have
  quite

Re: Generating a Document Similarity Matrix

2010-06-09 Thread Sean Owen
Well I'm not sure they're unique, they're just vectors. Would that not
be the best neutral representation for things like this?

What was the comment about keying by ints vs longs earlier? If
unifying that helps bring things closer together I can look at it, if
I can understand the issue.

On Wed, Jun 9, 2010 at 6:56 PM, Sebastian Schelter
ssc.o...@googlemail.com wrote:
 The ItemSimilarityJob cannot be directly used as its not working on a
 DistributedRowMatrix but on data structures unique to collaborative
 filtering, so if you ask me I'd say that a separate job would be required.



Re: Generating a Document Similarity Matrix

2010-06-09 Thread Sean Owen
On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix jake.man...@gmail.com wrote:
 The ItemSimilarityJob actually uses implementations of the Vector
 class hierarchy?  I think that's the issue - if the on-disk and in-mapper
 representations are never Vectors, then they won't interoperate with
 any of the matrix operations...

Yes they are Vectors.

 And yeah, keying on ints is necessary for now, unless we want to
 make a new matrix type (at least for distributed matrices) which
 keys on longs (which actually might be a good idea: now that
 we're using VInt and VLong, the disk space and network usage
 should be not be adversely affected - just the in-memory
 representation).

Oh I see. Well that's not a problem. Already, IDs have to be mapped to
ints to be used as dimensions in a Vector. So in most cases things are
keyed by these int pseudo-IDs. That's OK too.

A matrix is a bunch of vectors -- at least, that's a nice structure
for a SequenceFile. Row (or col) ID mapped to row (column) vector.

is that not what other jobs are using?
what's the better alternative we could think about converging on.


Generating a Document Similarity Matrix

2010-06-08 Thread Kris Jack
Hi everyone,

I currently use lucene's moreLikeThis function through solr to find
documents that are related to one another.  A single call, however, takes
around 4 seconds to complete and I would like to reduce this.  I got to
thinking that I might be able to use Mahout to generate a document
similarity matrix offline that could then be looked-up in real time for
serving.  Is this a reasonable use of Mahout?  If so, what functions will
generate a document similarity matrix?  Also, I would like to be able to
keep the text processing advantages provided through lucene so it would help
if I could still use my lucene index.  If not, then could you recommend any
alternative solutions please?

Many thanks,
Kris


Re: Generating a Document Similarity Matrix

2010-06-08 Thread Jake Mannix
Hi Kris,

  If you generate a full document-document similarity matrix offline, and
then make sure to sparsify the rows (trim off all similarities below a
threshold, or only take the top N for each row, etc...).  Then encoding
these values directly in the index would indeed allow for *superfast*
MoreLikeThis functionality, because you've already computed all
of the similar results offline.

  The only downside is that it won't apply to newly indexed documents.
If your indexing setup is such that you don't fold in new documents live,
but do so in batch, then this should be fine.

  An alternative is to use something like a Locality Sensitive Hash
(something one of my co-workers is writing up a nice implementation
of now, and I'm going to get him to contribute it once it's fully tested),
to reduce the search space (as a lucene Filter) and speed up the
query.

  -jake

On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack mrkrisj...@gmail.com wrote:

 Hi Olivier,

 Thanks for your suggestions.  I have over 10 million documents and they
 have
 quite a lot of meta-data associated with them including rather large text
 fields.  It is possible to tweak the moreLikeThis function from solr.  I
 have tried changing the parameters (
 http://wiki.apache.org/solr/MoreLikeThis)
 but am not managing to get results in under 300ms without sacrificing the
 quality of the results too much.

 I suspect that there would be gains to be made from reducing the
 dimensionality of the feature vectors before indexing with lucene so I may
 give that a try.  I'll keep you posted if I come up with other solutions.

 Thanks,
 Kris



 2010/6/8 Olivier Grisel olivier.gri...@ensta.org

  2010/6/8 Kris Jack mrkrisj...@gmail.com:
   Hi everyone,
  
   I currently use lucene's moreLikeThis function through solr to find
   documents that are related to one another.  A single call, however,
 takes
   around 4 seconds to complete and I would like to reduce this.  I got to
   thinking that I might be able to use Mahout to generate a document
   similarity matrix offline that could then be looked-up in real time for
   serving.  Is this a reasonable use of Mahout?  If so, what functions
 will
   generate a document similarity matrix?  Also, I would like to be able
 to
   keep the text processing advantages provided through lucene so it would
  help
   if I could still use my lucene index.  If not, then could you recommend
  any
   alternative solutions please?
 
  How many documents do you have in your index? Have you tried to tweak
  the MoreLikeThis parameters ? (I don't know if it's possible using the
  solr interface, I use it directly using the lucene java API)
 
  For instance you can trade off recall for speed by decreasing the
  number of terms to use in the query and trade recall for precision and
  speed by increasing the percentage of terms that should match.
 
  You could also use Mahout implementation of SVD to build low
  dimensional semantic vectors representing your documents (a.k.a.
  Latent Semantic Indexing) and then index those transformed frequency
  vectors in a dedicated lucene index (or document field provided you
  name the resulting terms with something that does not match real life
  terms present in other). However using standard SVD will probably
  result in dense (as opposed to sparse) low dimensional semantic
  vectors. I don't think lucene's lookup performance is good with dense
  frequency vectors even though the number of terms is greatly reduced
  by SVD. Hence it would probably be better to either threshold the top
  100 absolute values of each semantic vectors before indexing (probably
  the simpler solution) or using a sparsifying penalty contrained
  variant of SVD / LSI. You should have a look at the literature on
  sparse coding or sparse dictionary learning, Sparse-PCA and more
  generally L1 penalty regression methods such as the Lasso and LARS. I
  don't know about any library  for sparse semantic coding of document
  that works automatically with lucene. Probably some non trivial coding
  is needed there.
 
  Another alternative is finding low dimensional (64 or 32 components)
  dense codes and then binary thresholding then and store integer code
  in the DB or the lucene index and then build smart exact match queries
  to find all document lying in the hamming ball of size 1 or 2 of the
  reference document's binary code. But I think this approach while
  promising for web scale document collections is even more experimental
  and requires very good code low dim encoders (I don't think linear
  models such as SVD are good enough for reducing sparse 10e6 components
  vectors to dense 64 components vectors, non linear encoders such as
  Stacked Restricted Boltzmann Machines are probably a better choice).
 
  In any case let us know about your results, I am really interested on
  practical yet scalable solutions to this problem.
 
  --
  Olivier
  http://twitter.com/ogrisel - http://github.com/ogrisel
 



 --
 Dr

Re: Generating a Document Similarity Matrix

2010-06-08 Thread Kris Jack
Sorry, copied the previous link from the wrong tab :/  Meant to be
https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html for reading
in lucene vectors.



2010/6/8 Kris Jack mrkrisj...@gmail.com

 Hi Jake,

 Thanks for that.  The first solution that you suggest is more like what I
 was imagining.

 Please excuse me, I'm new to Mahout and don't know how to use it to
 generate the full document-document similarity matrix.  I would rather not
 have to re-implement the moreLikeThis algorithm that, although rather
 straight forward, may take time for a newbie to MapReduce like me.  Could
 you guide me a little in finding the relevant Mahout code for generating the
 matrix or is it not really designed for that?

 For the moment, I would be happy to have an off-line batch version
 working.  Also, it is desirable to take advantage of the text processing
 features that I have already configured using solr, so I would prefer to
 read in the feature vectors for the documents from a lucene index, as I am
 doing at present (e.g.
 http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
 ).

 Thanks,
 Kris



 2010/6/8 Jake Mannix jake.man...@gmail.com

 Hi Kris,

  If you generate a full document-document similarity matrix offline, and
 then make sure to sparsify the rows (trim off all similarities below a
 threshold, or only take the top N for each row, etc...).  Then encoding
 these values directly in the index would indeed allow for *superfast*
 MoreLikeThis functionality, because you've already computed all
 of the similar results offline.

  The only downside is that it won't apply to newly indexed documents.
 If your indexing setup is such that you don't fold in new documents live,
 but do so in batch, then this should be fine.

  An alternative is to use something like a Locality Sensitive Hash
 (something one of my co-workers is writing up a nice implementation
 of now, and I'm going to get him to contribute it once it's fully tested),
 to reduce the search space (as a lucene Filter) and speed up the
 query.

  -jake

 On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack mrkrisj...@gmail.com wrote:

  Hi Olivier,
 
  Thanks for your suggestions.  I have over 10 million documents and they
  have
  quite a lot of meta-data associated with them including rather large
 text
  fields.  It is possible to tweak the moreLikeThis function from solr.  I
  have tried changing the parameters (
  http://wiki.apache.org/solr/MoreLikeThis)
  but am not managing to get results in under 300ms without sacrificing
 the
  quality of the results too much.
 
  I suspect that there would be gains to be made from reducing the
  dimensionality of the feature vectors before indexing with lucene so I
 may
  give that a try.  I'll keep you posted if I come up with other
 solutions.
 
  Thanks,
  Kris
 
 
 
  2010/6/8 Olivier Grisel olivier.gri...@ensta.org
 
   2010/6/8 Kris Jack mrkrisj...@gmail.com:
Hi everyone,
   
I currently use lucene's moreLikeThis function through solr to find
documents that are related to one another.  A single call, however,
  takes
around 4 seconds to complete and I would like to reduce this.  I got
 to
thinking that I might be able to use Mahout to generate a document
similarity matrix offline that could then be looked-up in real time
 for
serving.  Is this a reasonable use of Mahout?  If so, what functions
  will
generate a document similarity matrix?  Also, I would like to be
 able
  to
keep the text processing advantages provided through lucene so it
 would
   help
if I could still use my lucene index.  If not, then could you
 recommend
   any
alternative solutions please?
  
   How many documents do you have in your index? Have you tried to tweak
   the MoreLikeThis parameters ? (I don't know if it's possible using the
   solr interface, I use it directly using the lucene java API)
  
   For instance you can trade off recall for speed by decreasing the
   number of terms to use in the query and trade recall for precision and
   speed by increasing the percentage of terms that should match.
  
   You could also use Mahout implementation of SVD to build low
   dimensional semantic vectors representing your documents (a.k.a.
   Latent Semantic Indexing) and then index those transformed frequency
   vectors in a dedicated lucene index (or document field provided you
   name the resulting terms with something that does not match real life
   terms present in other). However using standard SVD will probably
   result in dense (as opposed to sparse) low dimensional semantic
   vectors. I don't think lucene's lookup performance is good with dense
   frequency vectors even though the number of terms is greatly reduced
   by SVD. Hence it would probably be better to either threshold the top
   100 absolute values of each semantic vectors before indexing (probably
   the simpler solution) or using a sparsifying penalty contrained

Re: Generating a Document Similarity Matrix

2010-06-08 Thread Jake Mannix
Hi Kris,

  So you already know how to make a sparse feature matrix out of
your Solr index, based on Grant's instructions?  Once you have that
matrix loaded into HDFS, then the following Java code should
create a document-document similarity matrix:

//
  String p = /path/to/matrix/on/hdfs;
  String tmpPath = /tmp/matrixmultiplyspace;
  int numDocuments = // whatever your numDocuments is
  int numTerms = // total number of terms in the matrix

  DistributedRowMatrix text = new DistributedRowMatrix(inputPath,
tmpPath, numDocuments, numTerms);
  JobConf conf = new JobConf(similarity job);
  text.configure(conf);

  DistributedRowMatrix transpose = text.transpose();

  DistributedRowMatrix similarity = transpose.times(transpose);
  System.out.println(Similarity matrix lives:  + similarity.getRowPath());
//

Now, the rows of this similarity are going to be way too dense, so
you'll want to write a small map-reduce job (well, no reduce is necessary)
to run over this matrix and trim out all the unuseful entries of each
row, but that shouldn't be too hard to do.

Of course, to do it really efficiently, that functionality could be folded
into the reducer of the matrix multiply job, and done in the same pass over
the data as that one.

  -jake



On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack mrkrisj...@gmail.com wrote:

 Hi Jake,

 Thanks for that.  The first solution that you suggest is more like what I
 was imagining.

 Please excuse me, I'm new to Mahout and don't know how to use it to
 generate
 the full document-document similarity matrix.  I would rather not have to
 re-implement the moreLikeThis algorithm that, although rather straight
 forward, may take time for a newbie to MapReduce like me.  Could you guide
 me a little in finding the relevant Mahout code for generating the matrix
 or
 is it not really designed for that?

 For the moment, I would be happy to have an off-line batch version working.
 Also, it is desirable to take advantage of the text processing features
 that
 I have already configured using solr, so I would prefer to read in the
 feature vectors for the documents from a lucene index, as I am doing at
 present (e.g.

 http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
 ).

 Thanks,
 Kris



 2010/6/8 Jake Mannix jake.man...@gmail.com

  Hi Kris,
 
   If you generate a full document-document similarity matrix offline, and
  then make sure to sparsify the rows (trim off all similarities below a
  threshold, or only take the top N for each row, etc...).  Then encoding
  these values directly in the index would indeed allow for *superfast*
  MoreLikeThis functionality, because you've already computed all
  of the similar results offline.
 
   The only downside is that it won't apply to newly indexed documents.
  If your indexing setup is such that you don't fold in new documents live,
  but do so in batch, then this should be fine.
 
   An alternative is to use something like a Locality Sensitive Hash
  (something one of my co-workers is writing up a nice implementation
  of now, and I'm going to get him to contribute it once it's fully
 tested),
  to reduce the search space (as a lucene Filter) and speed up the
  query.
 
   -jake
 
  On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack mrkrisj...@gmail.com wrote:
 
   Hi Olivier,
  
   Thanks for your suggestions.  I have over 10 million documents and they
   have
   quite a lot of meta-data associated with them including rather large
 text
   fields.  It is possible to tweak the moreLikeThis function from solr.
  I
   have tried changing the parameters (
   http://wiki.apache.org/solr/MoreLikeThis)
   but am not managing to get results in under 300ms without sacrificing
 the
   quality of the results too much.
  
   I suspect that there would be gains to be made from reducing the
   dimensionality of the feature vectors before indexing with lucene so I
  may
   give that a try.  I'll keep you posted if I come up with other
 solutions.
  
   Thanks,
   Kris
  
  
  
   2010/6/8 Olivier Grisel olivier.gri...@ensta.org
  
2010/6/8 Kris Jack mrkrisj...@gmail.com:
 Hi everyone,

 I currently use lucene's moreLikeThis function through solr to find
 documents that are related to one another.  A single call, however,
   takes
 around 4 seconds to complete and I would like to reduce this.  I
 got
  to
 thinking that I might be able to use Mahout to generate a document
 similarity matrix offline that could then be looked-up in real time
  for
 serving.  Is this a reasonable use of Mahout?  If so, what
 functions
   will
 generate a document similarity matrix?  Also, I would like to be
 able
   to
 keep the text processing advantages provided through lucene so it
  would
help
 if I could still use my lucene index.  If not, then could you
  recommend
any
 alternative solutions please?
   
How many documents do you have in your index? Have you tried to tweak

Re: Generating a Document Similarity Matrix

2010-06-08 Thread Sebastian Schelter
Hi Kris,

actually the code to compute the item-to-item similarities in the
collaborative filtering part of mahout (which at the first look seems to be
a totally different problem than yours) is based on a paper that deals with
computing the pairwise similarity of text documents in a very simple way.
Maybe that  could be helpful to you:

Elsayed et al: Pairwise Document Similarity in Large Collections with
MapReduce
http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdfhttp://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf

-sebastian


2010/6/8 Kris Jack mrkrisj...@gmail.com

 Hi everyone,

 I currently use lucene's moreLikeThis function through solr to find
 documents that are related to one another.  A single call, however, takes
 around 4 seconds to complete and I would like to reduce this.  I got to
 thinking that I might be able to use Mahout to generate a document
 similarity matrix offline that could then be looked-up in real time for
 serving.  Is this a reasonable use of Mahout?  If so, what functions will
 generate a document similarity matrix?  Also, I would like to be able to
 keep the text processing advantages provided through lucene so it would
 help
 if I could still use my lucene index.  If not, then could you recommend any
 alternative solutions please?

 Many thanks,
 Kris



Re: Generating a Document Similarity Matrix

2010-06-08 Thread Jake Mannix
The code in mahout CF is doing that?  I don't think that's right, we don't
do anything that fancy right now, do we Sean?

  -jake

On Tue, Jun 8, 2010 at 3:39 PM, Sebastian Schelter
ssc.o...@googlemail.comwrote:

 Hi Kris,

 actually the code to compute the item-to-item similarities in the
 collaborative filtering part of mahout (which at the first look seems to be
 a totally different problem than yours) is based on a paper that deals with
 computing the pairwise similarity of text documents in a very simple way.
 Maybe that  could be helpful to you:

 Elsayed et al: Pairwise Document Similarity in Large Collections with
 MapReduce

 http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
 
 http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf
 

 -sebastian


 2010/6/8 Kris Jack mrkrisj...@gmail.com

  Hi everyone,
 
  I currently use lucene's moreLikeThis function through solr to find
  documents that are related to one another.  A single call, however, takes
  around 4 seconds to complete and I would like to reduce this.  I got to
  thinking that I might be able to use Mahout to generate a document
  similarity matrix offline that could then be looked-up in real time for
  serving.  Is this a reasonable use of Mahout?  If so, what functions will
  generate a document similarity matrix?  Also, I would like to be able to
  keep the text processing advantages provided through lucene so it would
  help
  if I could still use my lucene index.  If not, then could you recommend
 any
  alternative solutions please?
 
  Many thanks,
  Kris
 



Re: Generating a Document Similarity Matrix

2010-06-08 Thread Sean Owen
Sort of, there is a separate job to compute all item-item similarities
under a variety of metrics. This is what Sebastian wrote. It's not
used in the co-occurrence recommender (but could be -- vaguely a to-do
here.)

But sure if you're willing to think of a doc as an item vector of
preferences from words then this works fine to compute doc
similarity under these metrics.

On Wed, Jun 9, 2010 at 12:52 AM, Jake Mannix jake.man...@gmail.com wrote:
 The code in mahout CF is doing that?  I don't think that's right, we don't
 do anything that fancy right now, do we Sean?

  -jake

 On Tue, Jun 8, 2010 at 3:39 PM, Sebastian Schelter
 ssc.o...@googlemail.comwrote:

 Hi Kris,

 actually the code to compute the item-to-item similarities in the
 collaborative filtering part of mahout (which at the first look seems to be
 a totally different problem than yours) is based on a paper that deals with
 computing the pairwise similarity of text documents in a very simple way.
 Maybe that  could be helpful to you:

 Elsayed et al: Pairwise Document Similarity in Large Collections with
 MapReduce

 http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
 
 http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf
 

 -sebastian


 2010/6/8 Kris Jack mrkrisj...@gmail.com

  Hi everyone,
 
  I currently use lucene's moreLikeThis function through solr to find
  documents that are related to one another.  A single call, however, takes
  around 4 seconds to complete and I would like to reduce this.  I got to
  thinking that I might be able to use Mahout to generate a document
  similarity matrix offline that could then be looked-up in real time for
  serving.  Is this a reasonable use of Mahout?  If so, what functions will
  generate a document similarity matrix?  Also, I would like to be able to
  keep the text processing advantages provided through lucene so it would
  help
  if I could still use my lucene index.  If not, then could you recommend
 any
  alternative solutions please?
 
  Many thanks,
  Kris
 




Re: Generating a Document Similarity Matrix

2010-06-08 Thread Jake Mannix
On Tue, Jun 8, 2010 at 3:56 PM, Olivier Grisel olivier.gri...@ensta.orgwrote:

 2010/6/8 Jake Mannix jake.man...@gmail.com:
  Hi Kris,
 
   If you generate a full document-document similarity matrix offline, and
  then make sure to sparsify the rows (trim off all similarities below a
  threshold, or only take the top N for each row, etc...).  Then encoding
  these values directly in the index would indeed allow for *superfast*
  MoreLikeThis functionality, because you've already computed all
  of the similar results offline.

 For 10e6 documents if might not be reasonable to generate the complete
 document-document similarity matrix: 1e12 components = a couple of
 tera bytes of similarity values just to find the find the top N
 afterwards:


Nope, this isn't what happens in what I described: when you take
a sparseDocumentMatrix.transpose().times(itself), the scaling does
not go N^2*M, with N^2 outputs - the calculation is sparse, only
computing the entries which are nonzero.  If you pre-sparsify the
documents a little (remove all terms which occur in more than
1% of all documents, or something like that), this sparse calculation
is even faster - it scales as sum_{i=1...N}(k_i)^2, where k_i is the
number of nonzero elements in document i.  If all documents were
the same length (k), then this scales as N*k^2, and the total
number of nonzero entries in the output is far less than N^2 if
k  N,M.

Getting rid of the common terms (even *lots* of them) beforehand
is still a very good idea.

   -jake


Re: Generating a Document Similarity Matrix

2010-06-08 Thread Jake Mannix
On Tue, Jun 8, 2010 at 4:45 PM, Sebastian Schelter
ssc.o...@googlemail.comwrote:

 The relation between these two problems (document similarity and item
 similarity in CF) is exactly like Sean pointed out: In the paper a document
 is a vector of term frequencies and the paper shows how to compute the
 pairwise similarities between those. To use this for collaborative
 filtering
 you actually just have to replace the document with an item which is a

vector of user preferences.


Yep, a vector is a vector is a vector.  (And when you're me, even if you
are *not* a vector, you might be a vector. ;) )


 It shouldn't be too hard to make this work on a DistributedRowMatrix too, I
 think. You already mentioned you wanna have it that way some time
 in MAHOUT-362 :)


Well indeed I did!

  -jake