Re: Boosting Vs Sorting

Erick Erickson Mon, 24 Dec 2007 07:10:42 -0800

I don't think there's any real formula for the "correct" boosts, it's more a
matter of
experimentation.


I'd look at some of the raw scores that come back (not the normalized ones)
and
go from there. See HitCollector (?).

Best
Erick

On Dec 24, 2007 8:29 AM, Rakesh Shete <[EMAIL PROTECTED]> wrote:

>
> Hi Eric,
>
> Thanks for the help. Most of it makes sense to me and I amgoing ahead with
> query boosting during search instead of index time boosting.
>
> Could just tell me how do we arrive at the correct boost values for a
> query with multiple fields? My understanding is that this will vary as per
> the data that is being searched.
>
> My query after boosting would look something like this:
>
> +(i_title:sorting*^6.0 i_description:sorting*^3.5 i_detailedInfo:sorting*^
> 1.5 i_tags:sorting*^1.2) -i_published:false +i_topicsClasses.id:1*
>
> The reason for using high values is that I want to make sure that multiple
> occurrences of the search string in say field 'i_description' should not be
> treated as more relevant than a single match in the 'i_title' field.
>
> Do you have some guideline using which I can arrive at the correct boost
> value in the above scenario?
>
> --Rakesh S
>
> > Date: Fri, 21 Dec 2007 17:15:29 -0500
> > From: [EMAIL PROTECTED]
> > To: [email protected]
> > Subject: Re: Boosting Vs Sorting
> >
> > See below...
> >
> > On Dec 21, 2007 12:50 PM, Rakesh Shete <[EMAIL PROTECTED]> wrote:
> >
> > >
> > > Hi Eric,
> > >
> > > >> I don't see how sorting relates to your problem at all....
> > >
> > > Could you just explain how is sorting different from boosting?
> > >
> > > I have been trying to figure this out. Going through "Lucene In
> Action" my
> > > understanding of  sorting is that it will kind of second level of
> ordering
> > > after the query results have been scored (Not sure if the relevance
> > > established
> > > by scoring is lost in this process).
> > >
> >
> > I often think of sorting as being orthogonal to boosting. They're really
> > unrelated.
> > Boosting changes how the scoring of documents work. Sorting ignores
> scoring
> > and arranges the results lexically. You can *only* sort on fields that
> are a
> > single
> > token. I'm cheating a little here and you can implement your own
> > sorts, but that's another story.
> >
> > Maybe this would help. Say you were indexing books and wanted the
> results
> > presented to the user by title. You could index a "titlesort" field that
> had
> > the
> > title lowercased and all spaces replaced with underscores. Then, you
> could
> > sort the result of all books containing "solar energy" by title. Where's
> the
> > score
> > here? The only relevance score has here is that no book in the result
> set
> > will
> > have a score of 0.
> >
> > I did, at one point, have to sort by score then sub-sort by title. That
> is,
> > present the user with the top scoring documents sub-sorted by title.
> > This involved using relevancy as the primary sort and sub-sorting by
> > title. But the problem here is that scores of 0.98374 wouldn't be in the
> > same bucket as a score of 0.98375. Search the mail archive for
> > "bucket" and you should see that discussion.
> >
> >
> >
> > >
> > >
> > > >> Is it *really* better for your users to see a low-relevance query
> > > >> that happens to have the exact words in it before a very-high
> > > >> ranking but not quite exact response?
> > >
> > > Nopes. Thats the last thing my product manager will want.
> > >
> > > Lets take an example to simplify this:
> > >
> > > I have fields like title, description, tags. Now when I search for a
> term
> > > "Indoor Photography" then I would like the results with exact match in
> > > title to be
> > > more important than in description or tags. However, if there is an
> exact
> > > match in description
> > > then it should be given more preference than the partial match in
> title.
> > >
> > > Going by the points mentioned below and as per one of your posts
> > > (
> > >
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200609.mbox/[EMAIL 
> PROTECTED]
> > > )
> > > I understand that I need to specify query time boosting like this:
> > >
> > > title:Indoor Photography^2.5 description:Indoor Photography^1.5 tags:
> > > Indoor Photography^1.2
> > >
> >
> > That would go some distance towards what you want, but watch the syntax.
> > You might be better off constructing your own BooleanQuery. The syntax
> above
> > would actually parse something like title:Indoor
> default_field:Photography^
> > 2.5. You
> > need parentheses. Also think about phrase queries....
> >
> > Hope this helps
> > Erick
> >
> >
> > >
> > > Let me know if this would help my cause.
> > >
> > > Thnx for ur time n the valuable info.
> > >
> > > --Rakesh S
> > >
> > >
> > >
> > >
> > >
> > > > Date: Fri, 21 Dec 2007 09:53:02 -0500
> > > > From: [EMAIL PROTECTED]
> > > > To: [email protected]
> > > > Subject: Re: Boosting Vs Sorting
> > > >
> > > > OK, I'm trying to adjust to a Mac and my keyboard shortcuts
> sometimes
> > > > lead me to send the mail when I didn't intend. Sorry about that...
> > > >
> > > > So, leaving aside how you form your "similar" query, I *think* you
> > > > want to form two clauses, your "exact" and your "similar" and
> > > > boost them individually, combined in a boolean query.
> > > >
> > > > This will still interleave the results I think. But it's also a
> valid
> > > > question whether this is good or bad. Is it *really* better for your
> > > > users to see a low-relevance query that happens to have the exact
> > > > words in it before a very-high ranking but not quite exact response?
> > > > That, of course it up to your product manager....
> > > >
> > > > If it is really a requirement, it seems to me that you would be able
> to
> > > > just form the two queries independently, then just post-process
> them.
> > > > One query is the exact version, and the second query is the similar
> one.
> > > > Then just combine the results as you please by iterating the hits
> > > > object for the exact query then following it by the same for the
> > > similar.
> > > >
> > > > I don't see how sorting relates to your problem at all....
> > > >
> > > > Best
> > > > Erick
> > > >
> > > > On Dec 21, 2007 9:46 AM, Erick Erickson <[EMAIL PROTECTED]>
> wrote:
> > > >
> > > > > From my perspective, index-time boosting and sorting are apples
> > > > > and oranges.
> > > > >
> > > > > According to a post from Hoss, index-time boosting is a way of
> > > > > saying that "Field x in this document is more important than
> > > > > field x in other documents". Query-time boosts are a way of
> > > > > saying "I care about field X more than field Y across *all*
> > > > > documents".
> > > > >
> > > > > So index time boosting doesn't seem to relate to your problem
> since
> > > > > you really want to compare field x across all documents. It seems
> > > > > that query-time boosting is more relevant.
> > > > >
> > > > > So, leaving aside how you form your "similar" q
> > > > >
> > > > >
> > > > > On Dec 20, 2007 10:50 PM, Rakesh Shete < [EMAIL PROTECTED]>
> > > wrote:
> > > > >
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I am using Hibernate Search (http://www.hibernate.org/410.html)
> > > which is
> > > > > > a wrapper around Lucene for performing search over info stored
> in
> > > the DB. I
> > > > > > have questions related to Lucene boosting Vs sorting:
> > > > > >
> > > > > > Is index time boosting of documents and fields better than
> > > specifying
> > > > > > sorting parameters at search time?
> > > > > >
> > > > > > I have been browsing through the Lucene mail archives for an
> answer
> > > to
> > > > > > this. Going through them and reading on stuff related to Lucene
> > > scoring, my
> > > > > > understanding is that if I know upfront at index time that the
> > > relevance
> > > > > > order of results is based on certain fields, then, it is better
> to
> > > have
> > > > > > index time boosting of documents and fields. Am I right here?
> > > > > >
> > > > > > My requirements are like:
> > > > > > Results having an exact match to the input query string should
> have
> > > > > > highest preference followed by an exact match with field1,
> field2,
> > > field3
> > > > > > and then followed by search query substring (or near match)
> match
> > > with
> > > > > > field1, field2, field3.
> > > > > >
> > > > > > Any suggestions are most welcome.
> > > > > >
> > > > > > --Rakesh S
> > > > > >
> > > > > >
> _________________________________________________________________
> > > > > > Post free property ads on Yello Classifieds now! www.yello.in
> > > > > > http://ss1.richmedia.in/recurl.asp?pid=219
> > > > >
> > > > >
> > > > >
> > >
> > > _________________________________________________________________
> > > Post free property ads on Yello Classifieds now! www.yello.in
> > > http://ss1.richmedia.in/recurl.asp?pid=219
> > >
>
> _________________________________________________________________
> Post free property ads on Yello Classifieds now! www.yello.in
> http://ss1.richmedia.in/recurl.asp?pid=219
>

Re: Boosting Vs Sorting

Reply via email to