Can Lucene tells which field matched ?

2008-11-06 Thread Dora

Hi 

I am new to Lucene and working on a search module for some XML data:

I need to provide a "search all" able to look in all xml fields.
Apparently Lucene (2.4.0) does not provide such a "search all" facility, and
I have to build a query with my search field associated to all available XML
elements.

Assuming that I am searching in a address book (fictive example for
illustration) which is made of contacts (my lucene documents) containing
several fields like name, address, city, ...
 
So my search for "paul" inside my addressbook will look like:
name:paul OR address:paul OR city:paul and so on... 

Lucene will then tell me which contacts match my query, but is there a way
to know which field(s) matched the request ?
The goal is to display the XML with the matching fields highlighted.

I did not found anything like this in Lucene, so I seems that the only way
is to perform a additional search field by field...

So if I have 100 fields per document (I told you my address book was a
fictive example, the XML I am working on are a little bit more complex), and
get 100 results that I want ot display in a list, this mean that I would
need to perform 1 additional searches request... 

Please tell me that there is a better way to do the job...
-- 
View this message in context: 
http://www.nabble.com/Can-Lucene-tells-which-field-matched---tp20357552p20357552.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



possible score value

2008-11-06 Thread Francisco Borges
Hello,

I have been going through the scoring documentation and code.

I had the expectation that Lucene would enforce a score value between [0,1].
But from what I can grasp from the code and docs, score values can be
greater than one.

Does Lucene considers score values greater than 1 as valid?

Kind regards,
-- 
Francisco


Re: possible score value

2008-11-06 Thread Anshum
Hi Fransisco,

Did you come across :
  scoreNorm = 1.0f / topDocs.getMaxScore();
or something of this sort in Hits?
As per my knowledge, the initial score is more than 1 but finally the scores
get divided by the maxScore of the matched doc set. i.e. Setting an upper
limit of 1 (for the max scorer).
Hope this clarifies things! :)

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw


On Thu, Nov 6, 2008 at 4:20 PM, Francisco Borges <[EMAIL PROTECTED]
> wrote:

> Hello,
>
> I have been going through the scoring documentation and code.
>
> I had the expectation that Lucene would enforce a score value between
> [0,1].
> But from what I can grasp from the code and docs, score values can be
> greater than one.
>
> Does Lucene considers score values greater than 1 as valid?
>
> Kind regards,
> --
> Francisco
>


Re: possible score value

2008-11-06 Thread Francisco Borges
Hello Anshum,

No, I hadn't seen that. I had only gone through Similarity, and Weight
classes and worked through their calculations.

Thank you very much for the clarification!

Kind regards,
Francisco

On Thu, Nov 6, 2008 at 11:59 AM, Anshum <[EMAIL PROTECTED]> wrote:

> Hi Fransisco,
>
> Did you come across :
>  scoreNorm = 1.0f / topDocs.getMaxScore();
> or something of this sort in Hits?
> As per my knowledge, the initial score is more than 1 but finally the
> scores
> get divided by the maxScore of the matched doc set. i.e. Setting an upper
> limit of 1 (for the max scorer).
> Hope this clarifies things! :)
>
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
>
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw
>
>
> On Thu, Nov 6, 2008 at 4:20 PM, Francisco Borges <
> [EMAIL PROTECTED]
> > wrote:
>
> > Hello,
> >
> > I have been going through the scoring documentation and code.
> >
> > I had the expectation that Lucene would enforce a score value between
> > [0,1].
> > But from what I can grasp from the code and docs, score values can be
> > greater than one.
> >
> > Does Lucene considers score values greater than 1 as valid?
> >
> > Kind regards,
> > --
> > Francisco
> >
>



-- 
Francisco


Re: Can Lucene tells which field matched ?

2008-11-06 Thread Stefan Trcek
On Thursday 06 November 2008 10:18:45 Dora wrote:

> Lucene will then tell me which contacts match my query, but is there
> a way to know which field(s) matched the request ?
> The goal is to display the XML with the matching fields highlighted.

I think org.apache.lucene.search.highlight.Highlighter will do the job.

Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BoostingTermQuery scoring

2008-11-06 Thread Grant Ingersoll
Not sure, but it sounds like you are interested in a higher level  
Query, kind of like the BooleanQuery, but then part of it sounds like  
it is per document, right?  Is it that you want to deal with multiple  
payloads in a document, or multiple BTQs in a bigger query?

On Nov 4, 2008, at 9:42 AM, Peter Keegan wrote:


I'm using BoostingTermQuery to boost the score of documents with terms
containing payloads (boost value > 1). I'd like to change the scoring
behavior such that if a query contains multiple BoostingTermQuery  
terms
(either required or optional), documents containing more matching  
terms with
payloads always score higher than documents with fewer terms with  
payloads.
Currently, if one of the terms has a high IDF weight and contains a  
boosting
payload but no payloads on other matching terms, it may score higher  
than

docs with other matching terms with payloads and lower IDF.

I think what I need is a way to increase the weight of a matching  
term in
BoostingSpanScorer.score() if 'payloadsSeen > 0', but I don't see  
how to do

this. Any suggestions?

Thanks,
Peter


--
Grant Ingersoll


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Can Lucene tells which field matched ?

2008-11-06 Thread Daan de Wit
Hi,

I have implemented such a solution using the query explanation.
IndexSearcher has an explain(Query query, int document) method that
returns an Explanation object, on the Explanation object you can ask if
it is a match with #isMatch(). You still need to repeat this for each
found document though.

Daan

> -Original Message-
> From: Dora [mailto:[EMAIL PROTECTED]
> Sent: donderdag 6 november 2008 10:19
> To: java-user@lucene.apache.org
> Subject: Can Lucene tells which field matched ?
> 
> 
> Hi
> 
> I am new to Lucene and working on a search module for some XML data:
> 
> I need to provide a "search all" able to look in all xml fields.
> Apparently Lucene (2.4.0) does not provide such a "search all"
facility,
> and
> I have to build a query with my search field associated to all
available
> XML
> elements.
> 
> Assuming that I am searching in a address book (fictive example for
> illustration) which is made of contacts (my lucene documents)
containing
> several fields like name, address, city, ...
> 
> So my search for "paul" inside my addressbook will look like:
> name:paul OR address:paul OR city:paul and so on...
> 
> Lucene will then tell me which contacts match my query, but is there a
way
> to know which field(s) matched the request ?
> The goal is to display the XML with the matching fields highlighted.
> 
> I did not found anything like this in Lucene, so I seems that the only
way
> is to perform a additional search field by field...
> 
> So if I have 100 fields per document (I told you my address book was a
> fictive example, the XML I am working on are a little bit more
complex),
> and
> get 100 results that I want ot display in a list, this mean that I
would
> need to perform 1 additional searches request...
> 
> Please tell me that there is a better way to do the job...
> --
> View this message in context: http://www.nabble.com/Can-Lucene-tells-
> which-field-matched---tp20357552p20357552.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Can Lucene tells which field matched ?

2008-11-06 Thread Ulrich Vachon
Hi Daan,

Can we have an exemple of your implementation?

Thx
Ulrich VACHON 

-Message d'origine-
De : Daan de Wit [mailto:[EMAIL PROTECTED] 
Envoyé : jeudi 6 novembre 2008 11:35
À : java-user@lucene.apache.org
Objet : RE: Can Lucene tells which field matched ?

Hi,

I have implemented such a solution using the query explanation.
IndexSearcher has an explain(Query query, int document) method that returns an 
Explanation object, on the Explanation object you can ask if it is a match with 
#isMatch(). You still need to repeat this for each found document though.

Daan

> -Original Message-
> From: Dora [mailto:[EMAIL PROTECTED]
> Sent: donderdag 6 november 2008 10:19
> To: java-user@lucene.apache.org
> Subject: Can Lucene tells which field matched ?
> 
> 
> Hi
> 
> I am new to Lucene and working on a search module for some XML data:
> 
> I need to provide a "search all" able to look in all xml fields.
> Apparently Lucene (2.4.0) does not provide such a "search all"
facility,
> and
> I have to build a query with my search field associated to all
available
> XML
> elements.
> 
> Assuming that I am searching in a address book (fictive example for
> illustration) which is made of contacts (my lucene documents)
containing
> several fields like name, address, city, ...
> 
> So my search for "paul" inside my addressbook will look like:
> name:paul OR address:paul OR city:paul and so on...
> 
> Lucene will then tell me which contacts match my query, but is there a
way
> to know which field(s) matched the request ?
> The goal is to display the XML with the matching fields highlighted.
> 
> I did not found anything like this in Lucene, so I seems that the only
way
> is to perform a additional search field by field...
> 
> So if I have 100 fields per document (I told you my address book was a 
> fictive example, the XML I am working on are a little bit more
complex),
> and
> get 100 results that I want ot display in a list, this mean that I
would
> need to perform 1 additional searches request...
> 
> Please tell me that there is a better way to do the job...
> --
> View this message in context: http://www.nabble.com/Can-Lucene-tells-
> which-field-matched---tp20357552p20357552.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


__
Cet e-mail a été scanné par MessageLabs Email Security System.
Pour plus d'informations, visitez http://www.messagelabs.com/email 
__

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



What does Sort.RELEVANCE do?

2008-11-06 Thread Teruhiko Kurosaka
I can specify Sort.RELEVANCE to Searcher.search as in:

hits = searcher.search(q, Sort.RELEVANCE); // Using deprecated method to
make it short

What is the real effect of specifying the Sort argument like this?

Does Sort.RELEVANCE sorts the hits in order of the score
shown in Sect. 3.3 "Understanding Lucene scoring"
of Lucene In Action? If I use the search method without
a sort argument, is it equivalent of specifying
Sort.INDEXORDER?


T. "Kuro" Kurosaka, Basis Technology
San Francisco, California, U.S.A.
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



"Global" Field question (thread-safe)?

2008-11-06 Thread Glen Newton
I have a use case where I want all of my documents to have - in
addition to their other fields - a  single field=value.
An example use is where I have multiple Lucene indexes that I search
in parallel, but still need to distinguish them.
Index 1: All documents have: source="a1"
Index 2: All documents have: source="a2"

This is a common use case that has previously been discussed on this list.

The particular question I have is: when I am indexing, can I create a
single Field and use it for all Documents? Note I am in a
multithreaded environment, so many Documents are created and will have
this same Field added to them, and subsequently indexed.

So are their any threading issues with this particular usage?

thanks,

Glen

-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BoostingTermQuery scoring

2008-11-06 Thread Peter Keegan
Let me give some background on the problem behind my question.

Our index contains many fields (title, body, date, city, etc). Most queries
search all fields, but for best performance, we create an additional
'contents' field that contains all terms from all fields so that only one
field needs to be searched. Some fields, like title and city, are boosted by
a factor of 5. In order to make term boosting work, we create an additional
field 'boost' that contains all the terms from the boosted fields (title,
city).

Then, at search time, a query for "petroleum engineer" gets rewritten to:
(+contents:petroleum +contents:engineer) (+boost:petroleum +boost:engineer).
Note that the two clauses are OR'd so that a term that exists in both fields
will get a higher weight in the 'boost' field. This works quite well at
boosting documents with terms that exist in the boosted fields. However, it
doesn't work properly if excluded terms are added, for example:

(+contents:petroleum +contents:engineer -contents:drilling)
(+boost:petroleum +boost:engineer -boost:drilling)

If a document contains the term 'drilling' in the 'body' field, but not in
the 'title' or 'city' field, a false hit occurs.

Enter payloads and 'BoostingTermQuery'. At indexing time, as terms are added
to the 'contents' field, they are assigned a payload (value=5) if the term
also exists in one of the boosted fields. The 'scorePayload' method in our
Similarity class returns the payload value as a score. The query no longer
contains the 'boost' fields and is simply:

+contents:petroleum +contents:engineer -contents:drilling

The goal is to make the payload technique behavior similar to the 'boost'
field technique. The problem is that relevance scores of the top hits are
sometimes quite different. The reason is that the IDF values for a given
term in the 'boost' field is often much higher than the same term in the
'contents' field. This makes sense because the 'boost' field contains a
fairly small subset of the 'contents' field. Even with a payload of '5', a
low IDF in the 'contents' field usually erases the effect of the payload.

I have found a fairly simple (albeit inelegant) solution that seems to work.
The 'boost' field is still created as before, but it is only used to compute
IDF values for the weight class 'BoostingTermQuery.BoostingTermWeight. I had
to make this class 'public' so that I could override the IDF value as
follows:

public class MNSBoostingTermQuery extends BoostingTermQuery {
  public MNSBoostingTermQuery(Term term) {
super(term);
  }
  protected class MNSBoostingTermWeight extends
BoostingTermQuery.BoostingTermWeight {
public MNSBoostingTermWeight(BoostingTermQuery query, Searcher searcher)
throws IOException {
  super(query, searcher);
  java.util.HashSet newTerms = new java.util.HashSet();
  // Recompute IDF based on 'boost' field
  Iterator i = terms.iterator();
  Term term=null;
  while (i.hasNext()) {
term = (Term)i.next();
newTerms.add(new Term("boost", term.text()));
  }
  this.idf = this.query.getSimilarity(searcher).idf(newTerms, searcher);
}
  }
}

Any thoughts about a better implementation are welcome.

Peter




On Thu, Nov 6, 2008 at 8:00 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

> Not sure, but it sounds like you are interested in a higher level Query,
> kind of like the BooleanQuery, but then part of it sounds like it is per
> document, right?  Is it that you want to deal with multiple payloads in a
> document, or multiple BTQs in a bigger query?
>
> On Nov 4, 2008, at 9:42 AM, Peter Keegan wrote:
>
>  I'm using BoostingTermQuery to boost the score of documents with terms
>> containing payloads (boost value > 1). I'd like to change the scoring
>> behavior such that if a query contains multiple BoostingTermQuery terms
>> (either required or optional), documents containing more matching terms
>> with
>> payloads always score higher than documents with fewer terms with
>> payloads.
>> Currently, if one of the terms has a high IDF weight and contains a
>> boosting
>> payload but no payloads on other matching terms, it may score higher than
>> docs with other matching terms with payloads and lower IDF.
>>
>> I think what I need is a way to increase the weight of a matching term in
>> BoostingSpanScorer.score() if 'payloadsSeen > 0', but I don't see how to
>> do
>> this. Any suggestions?
>>
>> Thanks,
>> Peter
>>
>
> --
> Grant Ingersoll
>
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: "Global" Field question (thread-safe)?

2008-11-06 Thread Michael McCandless


The field never changes across all docs?  If so, this will work fine.

Mike

Glen Newton wrote:


I have a use case where I want all of my documents to have - in
addition to their other fields - a  single field=value.
An example use is where I have multiple Lucene indexes that I search
in parallel, but still need to distinguish them.
Index 1: All documents have: source="a1"
Index 2: All documents have: source="a2"

This is a common use case that has previously been discussed on this  
list.


The particular question I have is: when I am indexing, can I create a
single Field and use it for all Documents? Note I am in a
multithreaded environment, so many Documents are created and will have
this same Field added to them, and subsequently indexed.

So are their any threading issues with this particular usage?

thanks,

Glen

--

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What does Sort.RELEVANCE do?

2008-11-06 Thread Michael McCandless


Section 5.1.2 of LIA also explains this.

Sort.RELEVANCE sorts by relevance score, descending, breaking ties by  
sorting by doc ID, ascending, and s the default if you don't specify a  
sort order.


Sort.INDEXORDER sorts only by doc ID, which is not the default sort.

Mike

Teruhiko Kurosaka wrote:


I can specify Sort.RELEVANCE to Searcher.search as in:

hits = searcher.search(q, Sort.RELEVANCE); // Using deprecated  
method to

make it short

What is the real effect of specifying the Sort argument like this?

Does Sort.RELEVANCE sorts the hits in order of the score
shown in Sect. 3.3 "Understanding Lucene scoring"
of Lucene In Action? If I use the search method without
a sort argument, is it equivalent of specifying
Sort.INDEXORDER?


T. "Kuro" Kurosaka, Basis Technology
San Francisco, California, U.S.A.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: "Global" Field question (thread-safe)?

2008-11-06 Thread Glen Newton
Thanks!  :-)

2008/11/6 Michael McCandless <[EMAIL PROTECTED]>:
>
> The field never changes across all docs?  If so, this will work fine.
>
> Mike
>
> Glen Newton wrote:
>
>> I have a use case where I want all of my documents to have - in
>> addition to their other fields - a  single field=value.
>> An example use is where I have multiple Lucene indexes that I search
>> in parallel, but still need to distinguish them.
>> Index 1: All documents have: source="a1"
>> Index 2: All documents have: source="a2"
>>
>> This is a common use case that has previously been discussed on this list.
>>
>> The particular question I have is: when I am indexing, can I create a
>> single Field and use it for all Documents? Note I am in a
>> multithreaded environment, so many Documents are created and will have
>> this same Field added to them, and subsequently indexed.
>>
>> So are their any threading issues with this particular usage?
>>
>> thanks,
>>
>> Glen
>>
>> --
>>
>> -
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: BoostingTermQuery scoring

2008-11-06 Thread Peter Keegan
I've discovered another flaw in using this technique:

(+contents:petroleum +contents:engineer +contents:refinery)
(+boost:petroleum +boost:engineer +boost:refinery)

It's possible that the first clause will produce a matching doc and none of
the terms in the second clause are used to score that doc. Yet another
reason to use BoostingTermQuery.

Peter


On Thu, Nov 6, 2008 at 1:08 PM, Peter Keegan <[EMAIL PROTECTED]> wrote:

> Let me give some background on the problem behind my question.
>
> Our index contains many fields (title, body, date, city, etc). Most queries
> search all fields, but for best performance, we create an additional
> 'contents' field that contains all terms from all fields so that only one
> field needs to be searched. Some fields, like title and city, are boosted by
> a factor of 5. In order to make term boosting work, we create an additional
> field 'boost' that contains all the terms from the boosted fields (title,
> city).
>
> Then, at search time, a query for "petroleum engineer" gets rewritten to:
> (+contents:petroleum +contents:engineer) (+boost:petroleum +boost:engineer).
> Note that the two clauses are OR'd so that a term that exists in both fields
> will get a higher weight in the 'boost' field. This works quite well at
> boosting documents with terms that exist in the boosted fields. However, it
> doesn't work properly if excluded terms are added, for example:
>
> (+contents:petroleum +contents:engineer -contents:drilling)
> (+boost:petroleum +boost:engineer -boost:drilling)
>
> If a document contains the term 'drilling' in the 'body' field, but not in
> the 'title' or 'city' field, a false hit occurs.
>
> Enter payloads and 'BoostingTermQuery'. At indexing time, as terms are
> added to the 'contents' field, they are assigned a payload (value=5) if the
> term also exists in one of the boosted fields. The 'scorePayload' method in
> our Similarity class returns the payload value as a score. The query no
> longer contains the 'boost' fields and is simply:
>
> +contents:petroleum +contents:engineer -contents:drilling
>
> The goal is to make the payload technique behavior similar to the 'boost'
> field technique. The problem is that relevance scores of the top hits are
> sometimes quite different. The reason is that the IDF values for a given
> term in the 'boost' field is often much higher than the same term in the
> 'contents' field. This makes sense because the 'boost' field contains a
> fairly small subset of the 'contents' field. Even with a payload of '5', a
> low IDF in the 'contents' field usually erases the effect of the payload.
>
> I have found a fairly simple (albeit inelegant) solution that seems to
> work. The 'boost' field is still created as before, but it is only used to
> compute IDF values for the weight class
> 'BoostingTermQuery.BoostingTermWeight. I had to make this class 'public' so
> that I could override the IDF value as follows:
>
> public class MNSBoostingTermQuery extends BoostingTermQuery {
>   public MNSBoostingTermQuery(Term term) {
> super(term);
>   }
>   protected class MNSBoostingTermWeight extends
> BoostingTermQuery.BoostingTermWeight {
> public MNSBoostingTermWeight(BoostingTermQuery query, Searcher
> searcher) throws IOException {
>   super(query, searcher);
>   java.util.HashSet newTerms = new java.util.HashSet();
>   // Recompute IDF based on 'boost' field
>   Iterator i = terms.iterator();
>   Term term=null;
>   while (i.hasNext()) {
> term = (Term)i.next();
> newTerms.add(new Term("boost", term.text()));
>   }
>   this.idf = this.query.getSimilarity(searcher).idf(newTerms,
> searcher);
> }
>   }
> }
>
> Any thoughts about a better implementation are welcome.
>
> Peter
>
>
>
>
>
> On Thu, Nov 6, 2008 at 8:00 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:
>
>> Not sure, but it sounds like you are interested in a higher level Query,
>> kind of like the BooleanQuery, but then part of it sounds like it is per
>> document, right?  Is it that you want to deal with multiple payloads in a
>> document, or multiple BTQs in a bigger query?
>>
>> On Nov 4, 2008, at 9:42 AM, Peter Keegan wrote:
>>
>>  I'm using BoostingTermQuery to boost the score of documents with terms
>>> containing payloads (boost value > 1). I'd like to change the scoring
>>> behavior such that if a query contains multiple BoostingTermQuery terms
>>> (either required or optional), documents containing more matching terms
>>> with
>>> payloads always score higher than documents with fewer terms with
>>> payloads.
>>> Currently, if one of the terms has a high IDF weight and contains a
>>> boosting
>>> payload but no payloads on other matching terms, it may score higher than
>>> docs with other matching terms with payloads and lower IDF.
>>>
>>> I think what I need is a way to increase the weight of a matching term in
>>> BoostingSpanScorer.score() if 'payloadsSeen > 0', but I don't see how to
>>> do
>>> this. Any suggest

RE: BoostingTermQuery scoring

2008-11-06 Thread Steven A Rowe
Hi Peter,

On 11/06/2008 at 4:25 PM, Peter Keegan wrote:
> I've discovered another flaw in using this technique:
> 
> (+contents:petroleum +contents:engineer +contents:refinery)
> (+boost:petroleum +boost:engineer +boost:refinery)
> 
> It's possible that the first clause will produce a matching
> doc and none of the terms in the second clause are used to
> score that doc. Yet another reason to use BoostingTermQuery.

I think you could address this, without BTQ, using something like:

  boost:(+petroleum +engineer +refinery)
  (+contents:(+petroleum +engineer +refinery)
   +((*:* -boost:petroleum)
 (*:* -boost:engineer)
 (*:* -boost:refinery)))

The last three lines gives you the set of documents that are missing at least 
one of the terms in the "boost" field.  The *:* thingy, indicating a 
MatchAllDocsQuery, is necessary to get all documents that don't have a given 
term; Lucene's (sub-)query document exclusion operation needs a non-empty set 
on which to operate.

On 11/06/2008 at 1:08 PM, Peter Keegan wrote:
> Then, at search time, a query for "petroleum engineer" gets rewritten
> to: (+contents:petroleum +contents:engineer) (+boost:petroleum
> +boost:engineer). Note that the two clauses are OR'd so that a term that
> exists in both fields will get a higher weight in the 'boost' field.
> This works quite well at boosting documents with terms that exist in the
> boosted fields. However, it doesn't work properly if excluded terms are
> added, for example:
> 
> (+contents:petroleum +contents:engineer -contents:drilling)
> (+boost:petroleum +boost:engineer -boost:drilling)
> 
> If a document contains the term 'drilling' in the 'body'
> field, but not in the 'title' or 'city' field, a false hit occurs.

I think you could address this problem like this:

  +(boost:(+petroleum +engineer)
(+contents:(+petroleum +engineer)
 +((*:* -boost:petroleum)
   (*:* -boost:engineer
  -contents:drilling

You don't have to include "-boost:drilling", because this condition is entailed 
by "-contents:drilling".

Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Boosting results

2008-11-06 Thread Scott Smith
I'm interested in comments on the following problem.  

 

I have a set of documents.  They fall into 3 categories.  Call these
categories A, B, and C.  Each document has an indexed, non-tokenized
field called "category" which contains A, B, or C (they are mutually
exclusive categories).  

 

All of the documents contain a field called "body" which contains a
bunch of text.  This field is indexed and tokenized.

 

So, I want to do a search which looks something like:

 

(category:A OR category:B) AND body:fred

 

I want all of the category A documents to come before the category B
documents.  Effectively, I want to have the category A documents first
(sorted by relevancy) and then the category B documents after (sorted by
relevancy).

 

I thought I could do this by boosting the category portion of the query,
but that doesn't seem to work consistently.  I was setting the boost on
the category A term to 1.0 and the boost on the category B term to 0.0.

 

Any thoughts how to skin this?

 

Scott



Re: Boosting results

2008-11-06 Thread Erick Erickson
It seems to me that the easiest thing would be to fire two queries and
then just concatenate the results

category:A AND body:fred

category:B AND body:fred


If you really, really didn't want to fire two queries, you could create
filters on category A and category B and make a couple of
passes through your results seeing if the returned documents were in
the filter, but you'd still concatenate the results. Actually in your
specific example you could make one filter on A.

You could also consider a custom scorer that, added 1,000,000 to every
category A document.

How much were you boosting by? What happens if you boost by a very large
factor?
As in ridiculously large?

Best
Erick

On Thu, Nov 6, 2008 at 7:42 PM, Scott Smith <[EMAIL PROTECTED]>wrote:

> I'm interested in comments on the following problem.
>
>
>
> I have a set of documents.  They fall into 3 categories.  Call these
> categories A, B, and C.  Each document has an indexed, non-tokenized
> field called "category" which contains A, B, or C (they are mutually
> exclusive categories).
>
>
>
> All of the documents contain a field called "body" which contains a
> bunch of text.  This field is indexed and tokenized.
>
>
>
> So, I want to do a search which looks something like:
>
>
>
> (category:A OR category:B) AND body:fred
>
>
>
> I want all of the category A documents to come before the category B
> documents.  Effectively, I want to have the category A documents first
> (sorted by relevancy) and then the category B documents after (sorted by
> relevancy).
>
>
>
> I thought I could do this by boosting the category portion of the query,
> but that doesn't seem to work consistently.  I was setting the boost on
> the category A term to 1.0 and the boost on the category B term to 0.0.
>
>
>
> Any thoughts how to skin this?
>
>
>
> Scott
>
>