Re: similarity of two texts

2004-06-02 Thread Terry Steichen
Erik, Could you expand on this just a wee bit, perhaps with an example of how to compute this vector angle? TIA, Terry - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, June 01, 2004 9:39 AM Subject: Re: similarity of two

Re: similarity of two texts

2004-06-02 Thread David Spencer
/Vector_Space_Search_Engine_Theory.pdf TIA, Terry - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, June 01, 2004 9:39 AM Subject: Re: similarity of two texts On Jun 1, 2004, at 9:24 AM, Grant Ingersoll wrote: Hey Eric, Eri*K* :) What did you

Re: similarity of two texts

2004-06-02 Thread Erik Hatcher
On Jun 2, 2004, at 1:39 PM, David Spencer wrote: Erik, Could you expand on this just a wee bit, perhaps with an example of how to compute this vector angle? I'm tempted to write the code to see how it works, but FYI this doc seems to nicely explain the concepts:

Re: similarity of two texts - another question

2004-06-02 Thread Gerard Sychay
Hmm, the term vector does not have to consist of only term frequencies, does it? To give weight to rare terms, could you create a term vector of (TF*IDF) values for each term? Then, a distance function would measure how many terms two vectors have in common, giving weight to how many rare terms

Re: similarity of two texts - another question

2004-06-02 Thread David Spencer
Gerard Sychay wrote: Hmm, the term vector does not have to consist of only term frequencies, does it? To give weight to rare terms, could you create a term vector of (TF*IDF) values for each term? Then, a distance function would measure how many terms two vectors have in common, giving weight to

Re: similarity of two texts

2004-06-01 Thread Erik Hatcher
On May 31, 2004, at 2:17 PM, Stefan Groschupf wrote: Lucene can't help you. What about using term vectors though? I've been able to do rudimentary document similarity calculations using the new support in Lucene 1.4. Search the 'net for more info on term vectors and the formulas needed

Re: similarity of two texts

2004-06-01 Thread sg
Zitiere Erik Hatcher [EMAIL PROTECTED]: On May 31, 2004, at 2:17 PM, Stefan Groschupf wrote: Lucene can't help you. What about using term vectors though? I've been able to do rudimentary document similarity calculations using the new support in Lucene 1.4. Ups?! Is it build-in Lucene

Re: similarity of two texts

2004-06-01 Thread Erik Hatcher
On Jun 1, 2004, at 6:06 AM, [EMAIL PROTECTED] wrote: Zitiere Erik Hatcher [EMAIL PROTECTED]: On May 31, 2004, at 2:17 PM, Stefan Groschupf wrote: Lucene can't help you. What about using term vectors though? I've been able to do rudimentary document similarity calculations using the new support

Re: similarity of two texts

2004-06-01 Thread uddam chukmol
Thanks guys for ur invaluable help and ideas. I'll take a look at Lucene 1.4 and tell you more whether it could deal with my problem. - Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger

Re: similarity of two texts

2004-06-01 Thread Grant Ingersoll
Hey Eric, What did you do to calc similarity? I haven't had time, but was thinking of ways to add the ability to get the similarity score (as calculated when doing a search) given a term vector (or just a document id). Any ideas on how to approach this would be appreciated. The scoring in

Re: similarity of two texts

2004-06-01 Thread Erik Hatcher
On Jun 1, 2004, at 9:24 AM, Grant Ingersoll wrote: Hey Eric, Eri*K* :) What did you do to calc similarity? I computed the angle between two vectors. The vectors are obtained from IndexReader.getTermFreqVector(docId, field). I haven't had time, but was thinking of ways to add the ability to

Re: similarity of two texts

2004-06-01 Thread Grant Ingersoll
Sorry, about the mispelling, Erik! Thanks for the insight. Explain is my friend as an end user, but it, too, is confusing at the code level! At some point I will have time to dig deeper and step through the scoring code. [EMAIL PROTECTED] 06/01/04 09:39AM On Jun 1, 2004, at 9:24 AM, Grant

Re: similarity of two texts - another question

2004-06-01 Thread David Spencer
Erik Hatcher wrote: On Jun 1, 2004, at 4:41 PM, uddam chukmol wrote: Well, a question again, how does Lucene compute the score between a document and a query? And I might add, thus, this approach to similarity gives more weight to rare terms that match, which one might want for this kind of

similarity of two texts

2004-05-31 Thread uddam chukmol
Hi, I'm a newbie to Lucene and heard that it helps in the information retrieval process. However, my problem is not really related to the information retrieval but to the comparison of two texts. I think Lucene may help resolving it. I would like to have a clue on how to compare two given

Re: similarity of two texts

2004-05-31 Thread Stefan Groschupf
Lucene can't help you. Search for text classification or text clustering. Browse the tools section @ www.text-mining.org there you will found may be tools that can help you with this task. In general some key words for your further search: Feature extraction from text. Data mining algorithms