RE: DisjunctionMaxQuery and scoring
Hi, I think BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish the desired name IN (dick, rich) scoring behavior. This is because (name:dick | name:rich) with coord=false would score the 'document' Dick Rich higher than Rich because the former has two term matches and the latter only one. In contrast, I think the desire is that one and only one of the terms in the document match those in the BooleanQuery so that Rich would score higher than Dick Rich, given document length normalization. It's almost like a desire for BooleanQuery bq = new BooleanQuery(false); bq.set*Maximum*NumberShouldMatch(1); I that case DisjunctionMaxQuery is the way to go (it will only count the hit with highest score and not add scores (coord or not coord doesn't matter here). - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: DisjunctionMaxQuery and scoring
On Thu, Apr 19, 2012 at 8:32 PM, David Murgatroyd dmu...@gmail.com wrote: In contrast, I think the desire is that one and only one of the terms in the document match those in the BooleanQuery so that Rich would score higher than Dick Rich, given document length normalization. It's almost like a desire for BooleanQuery bq = new BooleanQuery(false); bq.set*Maximum*NumberShouldMatch(1); you can, by returning a customized weight with a coord impl that PUNISHES documents that match 1 sub. Take a look at http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/queries/src/java/org/apache/lucene/queries/BoostingQuery.java for some inspiration, especially this part: BooleanQuery result = new BooleanQuery() { @Override public Weight createWeight(IndexSearcher searcher) throws IOException { return new BooleanWeight(searcher, false) { @Override public float coord(int overlap, int max) { // your logic here when overlap == 1, 1, etc -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: DisjunctionMaxQuery and scoring
Hi, Ah sorry, I misunderstood, you wanted to score the duplicate match lower! To achieve this, you have to change the coord function in your similarity/BooleanWeight used for this query. Either way: If you want a group of terms that get only one score if at least one of the terms match (SQL IN), but not add them at all, DisjunctionMaxQuery is fine. I think this is what Benson asked for. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Friday, April 20, 2012 8:16 AM To: java-user@lucene.apache.org; david_murgatr...@hotmail.com Subject: RE: DisjunctionMaxQuery and scoring Hi, I think BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish the desired name IN (dick, rich) scoring behavior. This is because (name:dick | name:rich) with coord=false would score the 'document' Dick Rich higher than Rich because the former has two term matches and the latter only one. In contrast, I think the desire is that one and only one of the terms in the document match those in the BooleanQuery so that Rich would score higher than Dick Rich, given document length normalization. It's almost like a desire for BooleanQuery bq = new BooleanQuery(false); bq.set*Maximum*NumberShouldMatch(1); I that case DisjunctionMaxQuery is the way to go (it will only count the hit with highest score and not add scores (coord or not coord doesn't matter here). - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Weighted cosine similarity calculation using Lucene
I have documents that are marked up with Taxonomy and Ontology terms separately. When I calculate the document similarity, I want to give higher weights to those Taxonomy terms and Ontology terms. When I index the document, I have defined the Document content, Taxonomy and Ontology terms as Fields for each document like this in my program. *Field ontologyTerm= new Field(fiboterms, fiboTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field taxonomyTerm = new Field(taxoterms, taxoTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field document = new Field(docNames[curDocNo], strRdElt, Field.TermVector.YES);* I’m using Lucene index .TermFreqVector functions to calculate TFIDF values and, then calculate cosine similarity between two documents using TFIDF values. For give weights to Ontology and Taxonomy terms when calculating the cosine similarity, what I can do is, programmatically multiply the Taxonomy and Ontology term frequencies with defined weight factor before calculating the TFIDF scores. Will this give higher weight to Taxonomy and Ontology terms in document similarity calculation? Are there Lucene functions that can be used to give higher weights to the certain fields when calculating TFIDF values using TermFreqVector? can I just use the setboost() function for this purpose, then how? -- Regards Kasun Perera
Re: Field value vs TokenStream
Am 18.04.2012 20:06, schrieb Uwe Schindler: Hi, You should inform yourself about the difference between stored and indexed fields: The tokens in the .tis file are in fact the analyzed tokens retrieved from the TokenStream. This is controlled by the Field parameter Field.Index. The Field.Store parameter has nothing to do with indexing: if a field is marked as stored, the full and unchanged string / binary is stored in the stored fields file (.fdt). Stored fields are used Thanks for that clarification! Best, Carsten -- Carsten Schnober Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP -- Korpusanalyseplattform der nächsten Generation http://korap.ids-mannheim.de/ | Tel.: +49-(0)621-1581-238 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: DisjunctionMaxQuery and scoring
Uwe and Robert, Thanks. David and I are two peas in one pod here at Basis. --benson On Fri, Apr 20, 2012 at 2:33 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, Ah sorry, I misunderstood, you wanted to score the duplicate match lower! To achieve this, you have to change the coord function in your similarity/BooleanWeight used for this query. Either way: If you want a group of terms that get only one score if at least one of the terms match (SQL IN), but not add them at all, DisjunctionMaxQuery is fine. I think this is what Benson asked for. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Friday, April 20, 2012 8:16 AM To: java-user@lucene.apache.org; david_murgatr...@hotmail.com Subject: RE: DisjunctionMaxQuery and scoring Hi, I think BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish the desired name IN (dick, rich) scoring behavior. This is because (name:dick | name:rich) with coord=false would score the 'document' Dick Rich higher than Rich because the former has two term matches and the latter only one. In contrast, I think the desire is that one and only one of the terms in the document match those in the BooleanQuery so that Rich would score higher than Dick Rich, given document length normalization. It's almost like a desire for BooleanQuery bq = new BooleanQuery(false); bq.set*Maximum*NumberShouldMatch(1); I that case DisjunctionMaxQuery is the way to go (it will only count the hit with highest score and not add scores (coord or not coord doesn't matter here). - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weighted cosine similarity calculation using Lucene
Maybe I'm missing something here, but why not just boost the terms in the fields at query time? Best Erick On Fri, Apr 20, 2012 at 4:20 AM, Kasun Perera kas...@opensource.lk wrote: I have documents that are marked up with Taxonomy and Ontology terms separately. When I calculate the document similarity, I want to give higher weights to those Taxonomy terms and Ontology terms. When I index the document, I have defined the Document content, Taxonomy and Ontology terms as Fields for each document like this in my program. *Field ontologyTerm= new Field(fiboterms, fiboTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field taxonomyTerm = new Field(taxoterms, taxoTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field document = new Field(docNames[curDocNo], strRdElt, Field.TermVector.YES);* I’m using Lucene index .TermFreqVector functions to calculate TFIDF values and, then calculate cosine similarity between two documents using TFIDF values. For give weights to Ontology and Taxonomy terms when calculating the cosine similarity, what I can do is, programmatically multiply the Taxonomy and Ontology term frequencies with defined weight factor before calculating the TFIDF scores. Will this give higher weight to Taxonomy and Ontology terms in document similarity calculation? Are there Lucene functions that can be used to give higher weights to the certain fields when calculating TFIDF values using TermFreqVector? can I just use the setboost() function for this purpose, then how? -- Regards Kasun Perera - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Highlighter and Shingles...
Hi, Are there any notes on making the highlighter work consistently with a shingle generated index? I have a situation where complete matches highlight OK, but partial matches do not - leading to a number of blank previews... Our analyser look like: TokenStream result = new StopFilter(Version.LUCENE_36, new ShingleFilter( new StopFilter(Version.LUCENE_36, new LowerCaseFilter(Version.LUCENE_36, new StandardFilter(Version.LUCENE_36, new StandardTokenizer(Version.LUCENE_36, reader) ) ), STOP_CHARS_SET) ), STOP_WORDS_SET); -- Rgds. *Dawn Raison*
RE: Highlighter and Shingles...
Hi Dawn, Can you give an example of a partial match? Steve -Original Message- From: Dawn Zoë Raison [mailto:d...@digitorial.co.uk] Sent: Friday, April 20, 2012 7:59 AM To: java-user@lucene.apache.org Subject: Highlighter and Shingles... Hi, Are there any notes on making the highlighter work consistently with a shingle generated index? I have a situation where complete matches highlight OK, but partial matches do not - leading to a number of blank previews... Our analyser look like: TokenStream result = new StopFilter(Version.LUCENE_36, new ShingleFilter( new StopFilter(Version.LUCENE_36, new LowerCaseFilter(Version.LUCENE_36, new StandardFilter(Version.LUCENE_36, new StandardTokenizer(Version.LUCENE_36, reader) ) ), STOP_CHARS_SET) ), STOP_WORDS_SET); -- Rgds. *Dawn Raison* - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Weighted cosine similarity calculation using Lucene
Hi Erick On Fri, Apr 20, 2012 at 5:14 PM, Erick Erickson erickerick...@gmail.comwrote: Maybe I'm missing something here, but why not just boost the terms in the fields at query time? Yes I can boost the fields in the query time. But I'm using the termFreqVector get term frequencies and then calculate the TFIDF values for documents then calculate the cosine similarity using TFIDF. The field.setboost() function will give NO effect on term Frequencies. Is there anyother way to do the boosting that will give effect on term-frequencies? Thanks Best Erick On Fri, Apr 20, 2012 at 4:20 AM, Kasun Perera kas...@opensource.lk wrote: I have documents that are marked up with Taxonomy and Ontology terms separately. When I calculate the document similarity, I want to give higher weights to those Taxonomy terms and Ontology terms. When I index the document, I have defined the Document content, Taxonomy and Ontology terms as Fields for each document like this in my program. *Field ontologyTerm= new Field(fiboterms, fiboTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field taxonomyTerm = new Field(taxoterms, taxoTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);* *Field document = new Field(docNames[curDocNo], strRdElt, Field.TermVector.YES);* I’m using Lucene index .TermFreqVector functions to calculate TFIDF values and, then calculate cosine similarity between two documents using TFIDF values. For give weights to Ontology and Taxonomy terms when calculating the cosine similarity, what I can do is, programmatically multiply the Taxonomy and Ontology term frequencies with defined weight factor before calculating the TFIDF scores. Will this give higher weight to Taxonomy and Ontology terms in document similarity calculation? Are there Lucene functions that can be used to give higher weights to the certain fields when calculating TFIDF values using TermFreqVector? can I just use the setboost() function for this purpose, then how? -- Regards Kasun Perera - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Regards Kasun Perera