Weighted cosine similarity calculation using Lucene

2012-04-20 Thread Kasun Perera
I have documents that are marked up with Taxonomy and Ontology terms
separately.
When I calculate the document similarity, I want to give higher weights to
those Taxonomy terms and Ontology terms.


When I index the document, I have defined the Document content, Taxonomy
and Ontology terms as Fields for each document like this in my program.


*Field ontologyTerm= new Field(fiboterms, fiboTermList[curDocNo],
Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*

*Field taxonomyTerm = new Field(taxoterms, taxoTermList[curDocNo],
Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*

*Field document = new Field(docNames[curDocNo], strRdElt,
Field.TermVector.YES);*



I’m using Lucene index .TermFreqVector functions to calculate TFIDF values
and, then calculate cosine similarity between two documents using TFIDF
values.


For give weights to Ontology and Taxonomy terms when calculating the cosine
similarity, what I can do is, programmatically multiply the Taxonomy
and Ontology
term frequencies with defined weight factor before calculating the TFIDF
scores. Will this give higher weight to Taxonomy and Ontology terms in
document similarity calculation?


Are there Lucene functions that can be used to give higher weights to the
certain fields when calculating TFIDF values using TermFreqVector? can I
just use the setboost() function for this purpose, then how?

-- 
Regards

Kasun Perera


Re: Weighted cosine similarity calculation using Lucene

2012-04-20 Thread Erick Erickson
Maybe I'm missing something here, but why not just boost the
terms in the fields at query time?

Best
Erick

On Fri, Apr 20, 2012 at 4:20 AM, Kasun Perera kas...@opensource.lk wrote:
 I have documents that are marked up with Taxonomy and Ontology terms
 separately.
 When I calculate the document similarity, I want to give higher weights to
 those Taxonomy terms and Ontology terms.


 When I index the document, I have defined the Document content, Taxonomy
 and Ontology terms as Fields for each document like this in my program.


 *Field ontologyTerm= new Field(fiboterms, fiboTermList[curDocNo],
 Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*

 *Field taxonomyTerm = new Field(taxoterms, taxoTermList[curDocNo],
 Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*

 *Field document = new Field(docNames[curDocNo], strRdElt,
 Field.TermVector.YES);*



 I’m using Lucene index .TermFreqVector functions to calculate TFIDF values
 and, then calculate cosine similarity between two documents using TFIDF
 values.


 For give weights to Ontology and Taxonomy terms when calculating the cosine
 similarity, what I can do is, programmatically multiply the Taxonomy
 and Ontology
 term frequencies with defined weight factor before calculating the TFIDF
 scores. Will this give higher weight to Taxonomy and Ontology terms in
 document similarity calculation?


 Are there Lucene functions that can be used to give higher weights to the
 certain fields when calculating TFIDF values using TermFreqVector? can I
 just use the setboost() function for this purpose, then how?

 --
 Regards

 Kasun Perera

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Weighted cosine similarity calculation using Lucene

2012-04-20 Thread Kasun Perera
Hi Erick

On Fri, Apr 20, 2012 at 5:14 PM, Erick Erickson erickerick...@gmail.comwrote:

 Maybe I'm missing something here, but why not just boost the
 terms in the fields at query time?


Yes I can boost the fields in the query time. But I'm using the
termFreqVector get term frequencies and then calculate the TFIDF values for
documents then calculate the cosine similarity using TFIDF.
The field.setboost() function will give NO effect on term Frequencies.
Is there anyother way to do the boosting that will give effect
on term-frequencies?

Thanks



 Best
 Erick

 On Fri, Apr 20, 2012 at 4:20 AM, Kasun Perera kas...@opensource.lk
 wrote:
  I have documents that are marked up with Taxonomy and Ontology terms
  separately.
  When I calculate the document similarity, I want to give higher weights
 to
  those Taxonomy terms and Ontology terms.
 
 
  When I index the document, I have defined the Document content, Taxonomy
  and Ontology terms as Fields for each document like this in my program.
 
 
  *Field ontologyTerm= new Field(fiboterms, fiboTermList[curDocNo],
  Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*
 
  *Field taxonomyTerm = new Field(taxoterms, taxoTermList[curDocNo],
  Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*
 
  *Field document = new Field(docNames[curDocNo], strRdElt,
  Field.TermVector.YES);*
 
 
 
  I’m using Lucene index .TermFreqVector functions to calculate TFIDF
 values
  and, then calculate cosine similarity between two documents using TFIDF
  values.
 
 
  For give weights to Ontology and Taxonomy terms when calculating the
 cosine
  similarity, what I can do is, programmatically multiply the Taxonomy
  and Ontology
  term frequencies with defined weight factor before calculating the TFIDF
  scores. Will this give higher weight to Taxonomy and Ontology terms in
  document similarity calculation?
 
 
  Are there Lucene functions that can be used to give higher weights to the
  certain fields when calculating TFIDF values using TermFreqVector? can I
  just use the setboost() function for this purpose, then how?
 
  --
  Regards
 
  Kasun Perera

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Regards

Kasun Perera