inverted index pruning

Li Li Tue, 15 Feb 2011 02:58:18 -0800

hi all,
    I recently read a paper "Pruning Policies for Two-Tiered Inverted
Index with Correctness Guarantee". It's idea is interesting and I
have some questions and like to share with you.
    it's idea is pruning unlikely documents for certain terms
    e.g.
    term1  d1  d3  d6  |  d9  d7 d8
    term2  d1  d6  d8  |  d3  d4 d5
    we have 2 terms here -- term1 and term2
    we perform a "and query" for searching documents that contain both
the 2 terms
    if our score function consider 2 types of scores: document static
score and term related document score
    for simplicity, the static score is the page_rank of a document
and term related score is tf*idf and final score is linear combination
such as page_rank + tf*idf
    the pruning criterion is : we only keep the documents whose
page_rank is top N of all the  docList for this term and tf*idf is
also
top N
    take above example  d1 d3 d5's page_rank is top 3 of all the 5
documents containing term1 and also d1's tf *idf is top 3 of all the 5
documents
   Then we want to get top 3 documents
   we evaluate d1 d3  d6 d8 because it's in pruned docList
   d1 and d6 's information is complete so we can calculate the
accurate score of d1
   d3 d8's information is incomplete, we can only calculate the upper
bound of them.
   we select top 3 document by score. suppose the result is d6 d1 d8 d3
   because d8 and d3 is upper bound score, we can compare the real
score of d8 and d3. So the search is failed and we have to search
un-pruned index for answer
   But if we want to get top 2 documents. we can know d1 and d6 are
the answer because d6, d8 and other documents' score is less
than them


   The experiments in this paper shows that we can use pruned
index(30% size of full index) to answer 90%+ queries
   This result is excited but we have a problem in using it in lucene.
   for "and query" in lucene, we use skipList to speed up queries.
   but this pruning algorithm cannot use pruned index.
   we can check it by above example.
   if we use skipList, we can skip d3. but d3's upper bound score may
larger than d1. if we don't score d3, we cannot know it.
   although we need only score docList in pruned index(30%), it may be
slower than "and query" in full index because we can
use skipList to skip many docs.
   Anyone has good idea for this problem? if we can solve this
problem, we can improve performance well.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

inverted index pruning

Reply via email to