RE: DisjunctionMaxQuery and scoring

2012-04-20 Thread Uwe Schindler
Hi,
 I think
  BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish the
 desired name IN (dick, rich) scoring behavior. This is because
(name:dick |
 name:rich) with coord=false would score the 'document' Dick Rich higher
 than Rich because the former has two term matches and the latter only
one.
 In contrast, I think the desire is that one and only one of the terms in
the
 document match those in the BooleanQuery so that Rich would score higher
 than Dick Rich, given document length normalization. It's almost like a
desire
 for BooleanQuery bq = new BooleanQuery(false);
   bq.set*Maximum*NumberShouldMatch(1);

I that case DisjunctionMaxQuery is the way to go (it will only count the hit
with highest score and not add scores (coord or not coord doesn't matter
here).


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-20 Thread Robert Muir
On Thu, Apr 19, 2012 at 8:32 PM, David Murgatroyd dmu...@gmail.com wrote:
 In contrast, I think the desire
 is that one and only one of the terms in the document match those in the
 BooleanQuery so that Rich would score higher than Dick Rich, given
 document length normalization. It's almost like a desire for
 BooleanQuery bq = new BooleanQuery(false);
  bq.set*Maximum*NumberShouldMatch(1);


you can, by returning a customized weight with a coord impl that
PUNISHES documents that match  1 sub.

Take a look at 
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/queries/src/java/org/apache/lucene/queries/BoostingQuery.java
for some inspiration, especially this part:

BooleanQuery result = new BooleanQuery() {
@Override
public Weight createWeight(IndexSearcher searcher) throws IOException {
  return new BooleanWeight(searcher, false) {

@Override
public float coord(int overlap, int max) {
  // your logic here when overlap == 1,  1, etc

-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: DisjunctionMaxQuery and scoring

2012-04-20 Thread Uwe Schindler
Hi,

Ah sorry, I misunderstood, you wanted to score the duplicate match lower! To
achieve this, you have to change the coord function in your
similarity/BooleanWeight used for this query.

Either way: If you want a group of terms that get only one score if at least
one of the terms match (SQL IN), but not add them at all,
DisjunctionMaxQuery is fine. I think this is what Benson asked for.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: Friday, April 20, 2012 8:16 AM
 To: java-user@lucene.apache.org; david_murgatr...@hotmail.com
 Subject: RE: DisjunctionMaxQuery and scoring
 
 Hi,
  I think
   BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish
  the desired name IN (dick, rich) scoring behavior. This is because
 (name:dick |
  name:rich) with coord=false would score the 'document' Dick Rich
  higher than Rich because the former has two term matches and the
  latter only
 one.
  In contrast, I think the desire is that one and only one of the terms
  in
 the
  document match those in the BooleanQuery so that Rich would score
  higher than Dick Rich, given document length normalization. It's
  almost like a
 desire
  for BooleanQuery bq = new BooleanQuery(false);
bq.set*Maximum*NumberShouldMatch(1);
 
 I that case DisjunctionMaxQuery is the way to go (it will only count the
hit with
 highest score and not add scores (coord or not coord doesn't matter here).
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Weighted cosine similarity calculation using Lucene

2012-04-20 Thread Kasun Perera
I have documents that are marked up with Taxonomy and Ontology terms
separately.
When I calculate the document similarity, I want to give higher weights to
those Taxonomy terms and Ontology terms.


When I index the document, I have defined the Document content, Taxonomy
and Ontology terms as Fields for each document like this in my program.


*Field ontologyTerm= new Field(fiboterms, fiboTermList[curDocNo],
Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*

*Field taxonomyTerm = new Field(taxoterms, taxoTermList[curDocNo],
Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*

*Field document = new Field(docNames[curDocNo], strRdElt,
Field.TermVector.YES);*



I’m using Lucene index .TermFreqVector functions to calculate TFIDF values
and, then calculate cosine similarity between two documents using TFIDF
values.


For give weights to Ontology and Taxonomy terms when calculating the cosine
similarity, what I can do is, programmatically multiply the Taxonomy
and Ontology
term frequencies with defined weight factor before calculating the TFIDF
scores. Will this give higher weight to Taxonomy and Ontology terms in
document similarity calculation?


Are there Lucene functions that can be used to give higher weights to the
certain fields when calculating TFIDF values using TermFreqVector? can I
just use the setboost() function for this purpose, then how?

-- 
Regards

Kasun Perera


Re: Field value vs TokenStream

2012-04-20 Thread Carsten Schnober
Am 18.04.2012 20:06, schrieb Uwe Schindler:

Hi,

 You should inform yourself about the difference between stored and
 indexed fields: The tokens in the .tis file are in fact the analyzed
 tokens retrieved from the TokenStream. This is controlled by the Field
 parameter Field.Index. The Field.Store parameter has nothing to do with
 indexing: if a field is marked as stored, the full and unchanged string /
 binary is stored in the stored fields file (.fdt). Stored fields are used

Thanks for that clarification!
Best,
Carsten

-- 
Carsten Schnober
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP -- Korpusanalyseplattform der nächsten Generation
http://korap.ids-mannheim.de/ | Tel.: +49-(0)621-1581-238

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DisjunctionMaxQuery and scoring

2012-04-20 Thread Benson Margulies
Uwe and Robert,

Thanks. David and I are two peas in one pod here at Basis.

--benson

On Fri, Apr 20, 2012 at 2:33 AM, Uwe Schindler u...@thetaphi.de wrote:
 Hi,

 Ah sorry, I misunderstood, you wanted to score the duplicate match lower! To
 achieve this, you have to change the coord function in your
 similarity/BooleanWeight used for this query.

 Either way: If you want a group of terms that get only one score if at least
 one of the terms match (SQL IN), but not add them at all,
 DisjunctionMaxQuery is fine. I think this is what Benson asked for.

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: Friday, April 20, 2012 8:16 AM
 To: java-user@lucene.apache.org; david_murgatr...@hotmail.com
 Subject: RE: DisjunctionMaxQuery and scoring

 Hi,
  I think
   BooleanQuery bq = new BooleanQuery(false); doesn't quite accomplish
  the desired name IN (dick, rich) scoring behavior. This is because
 (name:dick |
  name:rich) with coord=false would score the 'document' Dick Rich
  higher than Rich because the former has two term matches and the
  latter only
 one.
  In contrast, I think the desire is that one and only one of the terms
  in
 the
  document match those in the BooleanQuery so that Rich would score
  higher than Dick Rich, given document length normalization. It's
  almost like a
 desire
  for BooleanQuery bq = new BooleanQuery(false);
    bq.set*Maximum*NumberShouldMatch(1);

 I that case DisjunctionMaxQuery is the way to go (it will only count the
 hit with
 highest score and not add scores (coord or not coord doesn't matter here).


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Weighted cosine similarity calculation using Lucene

2012-04-20 Thread Erick Erickson
Maybe I'm missing something here, but why not just boost the
terms in the fields at query time?

Best
Erick

On Fri, Apr 20, 2012 at 4:20 AM, Kasun Perera kas...@opensource.lk wrote:
 I have documents that are marked up with Taxonomy and Ontology terms
 separately.
 When I calculate the document similarity, I want to give higher weights to
 those Taxonomy terms and Ontology terms.


 When I index the document, I have defined the Document content, Taxonomy
 and Ontology terms as Fields for each document like this in my program.


 *Field ontologyTerm= new Field(fiboterms, fiboTermList[curDocNo],
 Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*

 *Field taxonomyTerm = new Field(taxoterms, taxoTermList[curDocNo],
 Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*

 *Field document = new Field(docNames[curDocNo], strRdElt,
 Field.TermVector.YES);*



 I’m using Lucene index .TermFreqVector functions to calculate TFIDF values
 and, then calculate cosine similarity between two documents using TFIDF
 values.


 For give weights to Ontology and Taxonomy terms when calculating the cosine
 similarity, what I can do is, programmatically multiply the Taxonomy
 and Ontology
 term frequencies with defined weight factor before calculating the TFIDF
 scores. Will this give higher weight to Taxonomy and Ontology terms in
 document similarity calculation?


 Are there Lucene functions that can be used to give higher weights to the
 certain fields when calculating TFIDF values using TermFreqVector? can I
 just use the setboost() function for this purpose, then how?

 --
 Regards

 Kasun Perera

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Highlighter and Shingles...

2012-04-20 Thread Dawn Zoë Raison

Hi,

Are there any notes on making the highlighter work consistently with a 
shingle generated index?
I have a situation where complete matches highlight OK, but partial 
matches do not - leading to a number of blank previews...


Our analyser look like:

TokenStream result =
new StopFilter(Version.LUCENE_36,
new ShingleFilter(
new StopFilter(Version.LUCENE_36,
new LowerCaseFilter(Version.LUCENE_36,
new StandardFilter(Version.LUCENE_36,
new 
StandardTokenizer(Version.LUCENE_36, reader)

)
),
STOP_CHARS_SET)
),
STOP_WORDS_SET);

--

Rgds.
*Dawn Raison*



RE: Highlighter and Shingles...

2012-04-20 Thread Steven A Rowe
Hi Dawn,

Can you give an example of a partial match?

Steve

-Original Message-
From: Dawn Zoë Raison [mailto:d...@digitorial.co.uk] 
Sent: Friday, April 20, 2012 7:59 AM
To: java-user@lucene.apache.org
Subject: Highlighter and Shingles...

Hi,

Are there any notes on making the highlighter work consistently with a shingle 
generated index?
I have a situation where complete matches highlight OK, but partial matches do 
not - leading to a number of blank previews...

Our analyser look like:

 TokenStream result =
 new StopFilter(Version.LUCENE_36,
 new ShingleFilter(
 new StopFilter(Version.LUCENE_36,
 new LowerCaseFilter(Version.LUCENE_36,
 new StandardFilter(Version.LUCENE_36,
 new 
StandardTokenizer(Version.LUCENE_36, reader)
 )
 ),
 STOP_CHARS_SET)
 ),
 STOP_WORDS_SET);

-- 

Rgds.
*Dawn Raison*


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Weighted cosine similarity calculation using Lucene

2012-04-20 Thread Kasun Perera
Hi Erick

On Fri, Apr 20, 2012 at 5:14 PM, Erick Erickson erickerick...@gmail.comwrote:

 Maybe I'm missing something here, but why not just boost the
 terms in the fields at query time?


Yes I can boost the fields in the query time. But I'm using the
termFreqVector get term frequencies and then calculate the TFIDF values for
documents then calculate the cosine similarity using TFIDF.
The field.setboost() function will give NO effect on term Frequencies.
Is there anyother way to do the boosting that will give effect
on term-frequencies?

Thanks



 Best
 Erick

 On Fri, Apr 20, 2012 at 4:20 AM, Kasun Perera kas...@opensource.lk
 wrote:
  I have documents that are marked up with Taxonomy and Ontology terms
  separately.
  When I calculate the document similarity, I want to give higher weights
 to
  those Taxonomy terms and Ontology terms.
 
 
  When I index the document, I have defined the Document content, Taxonomy
  and Ontology terms as Fields for each document like this in my program.
 
 
  *Field ontologyTerm= new Field(fiboterms, fiboTermList[curDocNo],
  Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*
 
  *Field taxonomyTerm = new Field(taxoterms, taxoTermList[curDocNo],
  Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);*
 
  *Field document = new Field(docNames[curDocNo], strRdElt,
  Field.TermVector.YES);*
 
 
 
  I’m using Lucene index .TermFreqVector functions to calculate TFIDF
 values
  and, then calculate cosine similarity between two documents using TFIDF
  values.
 
 
  For give weights to Ontology and Taxonomy terms when calculating the
 cosine
  similarity, what I can do is, programmatically multiply the Taxonomy
  and Ontology
  term frequencies with defined weight factor before calculating the TFIDF
  scores. Will this give higher weight to Taxonomy and Ontology terms in
  document similarity calculation?
 
 
  Are there Lucene functions that can be used to give higher weights to the
  certain fields when calculating TFIDF values using TermFreqVector? can I
  just use the setboost() function for this purpose, then how?
 
  --
  Regards
 
  Kasun Perera

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Regards

Kasun Perera