Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Terry Steichen
Joachim,

I believe you'll have to replace the default Similarity class with one of
your own.  Not sure exactly what the settings should be - maybe some other
list members can give you specifics.  Otherwise, you'll probably have to
experiment with it.

Regards,

Terry

- Original Message -
From: Joachim Schreiber [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, March 23, 2004 10:05 AM
Subject: Similarity - position in Field[] effects scoring - how to change?


 Hallo,

 I run in following problem. Perhaps somebody can help me.

 I have a index with different ids in the same field
 something like

 s
 s45678565
 s87854546

 Situation: I have different documents with the entry s in the
same
 index.


 document 1)

 s324235678565
 s324dssd5678565
 s45678324565
 s
 s8785454324326


 document 2)

 s324235678565
 s
 s45678324565
 s8785454324326



 when I search for   s:   I receive both docs, but document 1 has
a
 better scoring than document 2.
 The position of s in doc 1 is Field[4] and in doc 2 it's
Field[2],
 so this seems to effect scoring.

 How can I disable this behaviour, so doc 1 has the same scoring as doc
2???
 Which method do I have to overwrite in DefaultSimilarity.
 Has anybody any idea, any help.

 Thanks

 yo







 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Julien Nioche
Joachim,

Why don't you use the method explain of IndexSearcher?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSear
cher.html

This is the best way to find why your documents are different. I suspect the
lengthNorm  method, which is used at indexation time.

Julien


- Original Message -
From: Joachim Schreiber [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, March 23, 2004 4:05 PM
Subject: Similarity - position in Field[] effects scoring - how to change?


 Hallo,

 I run in following problem. Perhaps somebody can help me.

 I have a index with different ids in the same field
 something like

 s
 s45678565
 s87854546

 Situation: I have different documents with the entry s in the
same
 index.


 document 1)

 s324235678565
 s324dssd5678565
 s45678324565
 s
 s8785454324326


 document 2)

 s324235678565
 s
 s45678324565
 s8785454324326



 when I search for   s:   I receive both docs, but document 1 has
a
 better scoring than document 2.
 The position of s in doc 1 is Field[4] and in doc 2 it's
Field[2],
 so this seems to effect scoring.

 How can I disable this behaviour, so doc 1 has the same scoring as doc
2???
 Which method do I have to overwrite in DefaultSimilarity.
 Has anybody any idea, any help.

 Thanks

 yo







 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Joachim Schreiber
Thanks to Daniel the solutions is quite simple.

Use the latest cvs src from the head and try the new sorting feature, it
works very well ;-)

This should be documented anywhere, perhaps in the wiki !

cool new feature!

yo



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Ype Kingma
On Tuesday 23 March 2004 16:05, Joachim Schreiber wrote:
 Hallo,

 I run in following problem. Perhaps somebody can help me.

 I have a index with different ids in the same field
 something like

 s
 s45678565
 s87854546

 Situation: I have different documents with the entry s in the
 same index.


 document 1)

 s324235678565
 s324dssd5678565
 s45678324565
 s
 s8785454324326


 document 2)

 s324235678565
 s
 s45678324565
 s8785454324326



 when I search for   s:   I receive both docs, but document 1 has
 a better scoring than document 2.

Since the s field of document 2 is shorter, I'd expect document 2 to score 
higher. As mentioned, lengthNorm() is responsible for this.
Something does not add up here. Are the documents in the same index?

 The position of s in doc 1 is Field[4] and in doc 2 it's
 Field[2], so this seems to effect scoring.

Lucene's default scoring is independent of absolute term positions.

 How can I disable this behaviour, so doc 1 has the same scoring as doc 2???

Simply ignore the score. The easiest way is to use the low level scoring API
with your own HitCollector. Just make sure not to retrieve document field
values until you collected all your hits.

 Which method do I have to overwrite in DefaultSimilarity.
 Has anybody any idea, any help.

In which order to you want the resulting documents presented?
The low level api gives them in index order when the query consists
of single search term, afaik.

Regards,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Joachim Schreiber

 Why don't you use the method explain of IndexSearcher?

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSear
 cher.html

 This is the best way to find why your documents are different. I suspect
the
 lengthNorm  method, which is used at indexation time.

Yes but i think this is not a good choice because we have to receive all
docs.
this is not possible because i have hits with 300 000 and more


yo


 Julien


  Hallo,
 
  I run in following problem. Perhaps somebody can help me.
 
  I have a index with different ids in the same field
  something like
 
  s
  s45678565
  s87854546
 
  Situation: I have different documents with the entry s in the
 same
  index.
 
 
  document 1)
 
  s324235678565
  s324dssd5678565
  s45678324565
  s
  s8785454324326
 
 
  document 2)
 
  s324235678565
  s
  s45678324565
  s8785454324326
 
 
 
  when I search for   s:   I receive both docs, but document 1
has
 a
  better scoring than document 2.
  The position of s in doc 1 is Field[4] and in doc 2 it's
 Field[2],
  so this seems to effect scoring.
 
  How can I disable this behaviour, so doc 1 has the same scoring as doc
 2???
  Which method do I have to overwrite in DefaultSimilarity.
  Has anybody any idea, any help.
 
  Thanks
 
  yo
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Joachim Schreiber
Terry,


 I believe you'll have to replace the default Similarity class with one of
 your own.  Not sure exactly what the settings should be - maybe some other
 list members can give you specifics.  Otherwise, you'll probably have to
 experiment with it.

I tried the new sort feature from cvs and it works well !

But it's interesting, nobody knows exactly how scoring works (seems to me)
;-)

thanks

yo



 Regards,

 Terry

 - Original Message -
 From: Joachim Schreiber [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Tuesday, March 23, 2004 10:05 AM
 Subject: Similarity - position in Field[] effects scoring - how to change?


  Hallo,
 
  I run in following problem. Perhaps somebody can help me.
 
  I have a index with different ids in the same field
  something like
 
  s
  s45678565
  s87854546
 
  Situation: I have different documents with the entry s in the
 same
  index.
 
 
  document 1)
 
  s324235678565
  s324dssd5678565
  s45678324565
  s
  s8785454324326
 
 
  document 2)
 
  s324235678565
  s
  s45678324565
  s8785454324326
 
 
 
  when I search for   s:   I receive both docs, but document 1
has
 a
  better scoring than document 2.
  The position of s in doc 1 is Field[4] and in doc 2 it's
 Field[2],
  so this seems to effect scoring.
 
  How can I disable this behaviour, so doc 1 has the same scoring as doc
 2???
  Which method do I have to overwrite in DefaultSimilarity.
  Has anybody any idea, any help.
 
  Thanks
 
  yo
 
 
 
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Joachim Schreiber
 On Tuesday 23 March 2004 16:05, Joachim Schreiber wrote:
  Hallo,
 
  I run in following problem. Perhaps somebody can help me.
 
  I have a index with different ids in the same field
  something like
 
  s
  s45678565
  s87854546
 
  Situation: I have different documents with the entry s in the
  same index.
 
 
  document 1)
 
  s324235678565
  s324dssd5678565
  s45678324565
  s
  s8785454324326
 
 
  document 2)
 
  s324235678565
  s
  s45678324565
  s8785454324326
 
 
 
  when I search for   s:   I receive both docs, but document 1
has
  a better scoring than document 2.

 Since the s field of document 2 is shorter, I'd expect document 2 to score
 higher. As mentioned, lengthNorm() is responsible for this.
 Something does not add up here. Are the documents in the same index?

  The position of s in doc 1 is Field[4] and in doc 2 it's
  Field[2], so this seems to effect scoring.

 Lucene's default scoring is independent of absolute term positions.


hm...

  How can I disable this behaviour, so doc 1 has the same scoring as doc
2???

 Simply ignore the score. The easiest way is to use the low level scoring
API
 with your own HitCollector. Just make sure not to retrieve document field
 values until you collected all your hits.

you think its possible to order by e.g. date field without retrieving all
the values from the index??


  Which method do I have to overwrite in DefaultSimilarity.
  Has anybody any idea, any help.

 In which order to you want the resulting documents presented?
 The low level api gives them in index order when the query consists
 of single search term, afaik.

in index order is ok but not very flexibel

Regards,
yo


 Regards,
 Ype


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Similarity - position in Field[] effects scoring - how to change?

2004-03-23 Thread Ype Kingma
Joachim,

...

 you think its possible to order by e.g. date field without retrieving all
 the values from the index??

Yes, the new sorting feature from CVS does that, see Doug's
last note on the subject. (It might have been on lucene-dev,
I didn't keep a copy).

Have fun,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]