Re: Efficient Query Evaluation using a Two-Level Retrieval Process

J. Delgado Mon, 16 Nov 2009 10:10:28 -0800

On Mon, Nov 16, 2009 at 9:44 AM, Earwin Burrfoot <ear...@gmail.com> wrote:
> This algo is strictly tied to sort-by-score, if I understand it correctly.
> Lucene has queries and sorting decoupled (except for allowOutOfOrder
> mess), so implementing it would require some really fat hacks.
>


According to the paper on Indexing Boolean Expression (using the WAND
algo), sorting can be done based on scores that are determined based
weight assignment to key-value pairs:

http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf

So I believe this can be generalized to sorting by any doc attributes
given the proper weight assignment model

Of course, the devil-is-in-the-details :-(

-- Joaquin


> On Mon, Nov 16, 2009 at 20:26, J. Delgado <joaquin.delg...@gmail.com> wrote:
>> As I understood it setMinimumNumberShouldMatch(int min) Is used to
>> specify a minimum number of the optional BooleanClauses which must be
>> satisfied.
>>
>> I haven't seen the implementation of setMinimumNumberShouldMatch but
>> it seems a bit different than what is intended with the WAND operator,
>> which can take any real number as threshold θ
>>
>> As stated in the paper:
>>
>> WAND(X1,w1, . . . Xk,wk, θ) is true iff X 1≤i≤k and SUM(xiwi)≥ θ
>>
>> where xi is the indicator variable for Xi, that is xi =  1, if Xi is
>> true 0, otherwise.
>>
>> Observe that WAND can be used to implement AND
>> and OR via
>> AND(X1,X2, . . .Xk) ≡ WAND(X1, 1,X2, 1, . . . Xk, 1, k),
>> and
>> OR(X1,X2, . ..Xk) ≡ WAND(X1, 1,X2, 1, . ..Xk, 1, 1).
>>
>> What I find interesting is the idea of using a first pass using the
>> upper bound (maximal) contribution of a term on any document score and
>> the dynamic setting of the threshold θ to skip or to fully evaluate a
>> document..
>>
>> As stated in the paper:
>>
>> "Given this setup our preliminary scoring consists of evaluating
>> for each document d
>> WAND(X1,UB1,X2,UB2, . . .,Xk,UBk, θ),
>> where Xi is an indicator variable for the presence of query term i in
>> document d and the threshold θ is varied during
>> the algorithm as explained below. If WAND evaluates to true, then the
>> document d undergoes a full evaluation.
>> The threshold θ is set dynamically by the algorithm based on the
>> minimum score m among the top n results found so
>> far, where n is the number of requested documents. The larger the
>> threshold, the more documents will be skipped
>> and thus we will need to compute full scores for fewer documents."
>>
>> I think its worth a try...
>>
>> -- Joaquin
>>
>> On Mon, Nov 16, 2009 at 2:54 AM, Andrzej Bialecki <a...@getopt.org> wrote:
>>>
>>> J. Delgado wrote:
>>>>
>>>> Here is the link to the paper.
>>>> http://cis.poly.edu/westlab/papers/cntdstrb/p426-broder.pdf
>>>>
>>>> A more recent application of the use and extension of the WAND operator for
>>>> indexing of Boolean expressions:
>>>> http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf
>>>>
>>>> -- Joaquin
>>>>
>>>>
>>>> On Sun, Nov 15, 2009 at 11:12 PM, Shalin Shekhar Mangar <
>>>> shalinman...@gmail.com> wrote:
>>>>
>>>>> Hey Joaquin,
>>>>>
>>>>> The mailing list strips off attachments. Can you please upload it 
>>>>> somewhere
>>>>> and give us the link?
>>>>>
>>>>> On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado <joaquin.delg...@gmail.com
>>>>>>
>>>>>> wrote:
>>>>>> Please find attached the paper on "Efficient Query Evaluation using a
>>>>>> Two-Level Retrieval Process". I believe that such approach may improve
>>>>>
>>>>> the
>>>>>>
>>>>>> way Lucene/Solr evaluates queries today.
>>>
>>> The functionality of WAND (weak AND) is already implemented in Lucene, if I 
>>> understand it correctly - this is the BooleanQuery.setMinShouldMatch(int). 
>>> Lucene implements this probably differently from the algorithm described in 
>>> the paper, so there may be still some benefits from comparing the 
>>> algorithms in Lucene's BooleanScorer[2] with this one ...
>>>
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>  ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>
>
>
> --
> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Efficient Query Evaluation using a Two-Level Retrieval Process

Reply via email to