Weighted cosine similarity calculation using Lucene
I have documents that are marked up with Taxonomy and Ontology terms separately. When I calculate the document similarity, I want to give higher weights to those Taxonomy terms and Ontology terms. When I index the document, I have defined the Document content, Taxonomy and Ontology terms as Fields for each document like this in my program. *Field ontologyTerm= new Field(fiboterms, fiboTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field taxonomyTerm = new Field(taxoterms, taxoTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field document = new Field(docNames[curDocNo], strRdElt, Field.TermVector.YES);* I’m using Lucene index .TermFreqVector functions to calculate TFIDF values and, then calculate cosine similarity between two documents using TFIDF values. For give weights to Ontology and Taxonomy terms when calculating the cosine similarity, what I can do is, programmatically multiply the Taxonomy and Ontology term frequencies with defined weight factor before calculating the TFIDF scores. Will this give higher weight to Taxonomy and Ontology terms in document similarity calculation? Are there Lucene functions that can be used to give higher weights to the certain fields when calculating TFIDF values using TermFreqVector? can I just use the setboost() function for this purpose, then how? -- Regards Kasun Perera
Re: Weighted cosine similarity calculation using Lucene
Maybe I'm missing something here, but why not just boost the terms in the fields at query time? Best Erick On Fri, Apr 20, 2012 at 4:20 AM, Kasun Perera kas...@opensource.lk wrote: I have documents that are marked up with Taxonomy and Ontology terms separately. When I calculate the document similarity, I want to give higher weights to those Taxonomy terms and Ontology terms. When I index the document, I have defined the Document content, Taxonomy and Ontology terms as Fields for each document like this in my program. *Field ontologyTerm= new Field(fiboterms, fiboTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field taxonomyTerm = new Field(taxoterms, taxoTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field document = new Field(docNames[curDocNo], strRdElt, Field.TermVector.YES);* I’m using Lucene index .TermFreqVector functions to calculate TFIDF values and, then calculate cosine similarity between two documents using TFIDF values. For give weights to Ontology and Taxonomy terms when calculating the cosine similarity, what I can do is, programmatically multiply the Taxonomy and Ontology term frequencies with defined weight factor before calculating the TFIDF scores. Will this give higher weight to Taxonomy and Ontology terms in document similarity calculation? Are there Lucene functions that can be used to give higher weights to the certain fields when calculating TFIDF values using TermFreqVector? can I just use the setboost() function for this purpose, then how? -- Regards Kasun Perera - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weighted cosine similarity calculation using Lucene
Hi Erick On Fri, Apr 20, 2012 at 5:14 PM, Erick Erickson erickerick...@gmail.comwrote: Maybe I'm missing something here, but why not just boost the terms in the fields at query time? Yes I can boost the fields in the query time. But I'm using the termFreqVector get term frequencies and then calculate the TFIDF values for documents then calculate the cosine similarity using TFIDF. The field.setboost() function will give NO effect on term Frequencies. Is there anyother way to do the boosting that will give effect on term-frequencies? Thanks Best Erick On Fri, Apr 20, 2012 at 4:20 AM, Kasun Perera kas...@opensource.lk wrote: I have documents that are marked up with Taxonomy and Ontology terms separately. When I calculate the document similarity, I want to give higher weights to those Taxonomy terms and Ontology terms. When I index the document, I have defined the Document content, Taxonomy and Ontology terms as Fields for each document like this in my program. *Field ontologyTerm= new Field(fiboterms, fiboTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field taxonomyTerm = new Field(taxoterms, taxoTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field document = new Field(docNames[curDocNo], strRdElt, Field.TermVector.YES);* I’m using Lucene index .TermFreqVector functions to calculate TFIDF values and, then calculate cosine similarity between two documents using TFIDF values. For give weights to Ontology and Taxonomy terms when calculating the cosine similarity, what I can do is, programmatically multiply the Taxonomy and Ontology term frequencies with defined weight factor before calculating the TFIDF scores. Will this give higher weight to Taxonomy and Ontology terms in document similarity calculation? Are there Lucene functions that can be used to give higher weights to the certain fields when calculating TFIDF values using TermFreqVector? can I just use the setboost() function for this purpose, then how? -- Regards Kasun Perera - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Regards Kasun Perera