Re: Can I do boosting based on term postions?

Paul Elschot Fri, 03 Aug 2007 12:12:09 -0700

On Friday 03 August 2007 20:35, Shailendra Sharma wrote:
> Paul,
> 
> If I understand Cedric right, he wants to have different boosting depending
> on search term positions in the document. By using SpanFirstQuery he will
> only be able to consider in terms till particular position;



> but he won't be 
> able to do something like following:
>   a) Give 100% boosting to matching in first 100 words.
>   b) Give 80% boosting to matching in next 100 words.
>   c) Give 60% boosting to matching in next 100 words.

> Though it can be done by writing DisjunctionMaxQuery having multiple
> SpanFirstQuery with different boosting - but I see it as a workaround only
> and not the direct and efficient solution.

You're right, but SpanFirstQuery needs only a minor modification
for this to work.

This modification to SpanFirstQuery would be that the Spans
returned by SpanFirstQuery.getSpans() must always return 0
from its start() method. Then the slop passed to sloppyFreq(slop)
would be the distance from the beginning of the indexed field
to the end of the Spans of the SpanQuery passed to SpanFirstQuery.

Then the following should work:

Term firstTerm = .... ;

SpanFirstQuery sfq = new SpanFirstQuery(
  new SpanTermQuery( firstTerm),
  Integer.MAX_VALUE)) {
...
public Similarity getSimilarity() {
return new Similarity() {
...
float sloppyFreq(slop) {
  return (slop < 100)  ? 1.0f 
           : (slop < 200) ? 0.8f
           : (slop < 300) ? 0.6f 
           : 0.4f ; // etc. etc.
}}}}


Actually, I'm a bit surprised that SpanFirstQuery does not work that
way now.

Regards,
Paul Elschot


> 
> Cedric,
> 
> I am sending you the implementation of SpanTermQuery to your gmail
> account (lucene
> mailing list is bouncing email with attachment). I have named the class as
> VSpanTermQuery (I have followed the same package hierarchy as lucene). You
> also need to extend VSimilarity class - which would require implementation
> of method scoreSpan(..).
> 
> Let me know how it went. Though I did a testing for it, but before
> submitting to contrib, I need to do extensive testing.
> 
> Thanks,
> Shailendra
> 
> On 8/3/07, Paul Elschot <[EMAIL PROTECTED]> wrote:
> >
> > Cedric,
> >
> > You can choose the end limit for SpanFirstQuery yourself.
> >
> > Regards,
> > Paul Elschot
> >
> >
> > On Friday 03 August 2007 05:38, Cedric Ho wrote:
> > > Hi Paul,
> > >
> > > Isn't SpanFirstQuery only match those with position less than a
> > > certain end position?
> > >
> > > I am rather looking for a query that would score a document higher for
> > > terms appear near the start but not totally discard those with terms
> > > appear near the end.
> > >
> > > Regards,
> > > Cedric
> > >
> > > On 8/2/07, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > > > Cedric,
> > > >
> > > > SpanFirstQuery could be a solution without payloads.
> > > > You may want to give it your own Similarity.sloppyFreq() .
> > > >
> > > > Regards,
> > > > Paul Elschot
> > > >
> > > > On Thursday 02 August 2007 04:07, Cedric Ho wrote:
> > > > > Thanks for the quick response =)
> > > > >
> > > > > On 8/1/07, Shailendra Sharma <[EMAIL PROTECTED]> wrote:
> > > > > > Yes, it is easily doable through "Payload" facility. During
> > indexing
> > > > process
> > > > > > (mainly tokenization), you need to push this extra information in
> > each
> > > > > > token. And then you can use BoostingTermQuery for using Payload
> > value
> > to
> > > > > > include Payload in the score. You also need to implement
> > Similarity
> > for
> > > > this
> > > > > > (mainly scorePayload method).
> > > > >
> > > > > If I store, say a custom boost factor as Payload, does it means that
> > I
> > > > > will store one more byte per term per document in the index file? So
> > > > > the index file would be much larger?
> > > > >
> > > > > >
> > > > > > Other way can be to extend SpanTermQuery, this already calculates
> > the
> > > > > > position of match. You just need to do something to use this
> > position
> > > > value
> > > > > > in the score calculation.
> > > > >
> > > > > I see that SpanTermQuery takes a TermPositions from the indexReader
> > > > > and I can get the term position from there. However I am not sure
> > how
> > > > > to incorporate it into the score calculation. Would you mind give a
> > > > > little more detail on this?
> > > > >
> > > > > >
> > > > > > One possible advantage of SpanTermQuery approach is that you can
> > play
> > > > > > around, without re-creating indices everytime.
> > > > > >
> > > > > > Thanks,
> > > > > > Shailendra Sharma,
> > > > > > CTO, Ver se' Innovation Pvt. Ltd.
> > > > > > Bangalore, India
> > > > > >
> > > > > > On 8/1/07, Cedric Ho <[EMAIL PROTECTED]> wrote:
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I was wondering if it is possible to do boosting by search
> > terms'
> > > > > > > position in the document.
> > > > > > >
> > > > > > > for example:
> > > > > > > search terms appear in the first 100 words, or first 10% words,
> > or
> > in
> > > > > > > first two paragraphs would be given higher score.
> > > > > > >
> > > > > > > Is it achievable through using the new Payload function in
> > lucene
> > 2.2?
> > > > > > > Or are there any easier ways to achieve these ?
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > > Cedric
> > > > > > >
> > > > > > >
> > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > > > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > Thanks,
> > > > > Cedric
> > > > >
> > > > >
> > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > > >
> > > > >
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > > >
> > >
> > >
> > > --
> > > [EMAIL PROTECTED]
> > >
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Can I do boosting based on term postions?

Reply via email to