Re: inverted index pruning

2011-02-16 Thread Li Li
great things.
But I think the patch is different from the method in that paper.
my colleague had tested this patch but don't get good results
(I don't know the detail well, and he just tell me his experience)

2011/2/15 Andrzej Bialecki a...@getopt.org:
 On 2/15/11 11:57 AM, Li Li wrote:

 hi all,
     I recently read a paper Pruning Policies for Two-Tiered Inverted
 Index with Correctness Guarantee. It's idea is interesting and I
 have some questions and like to share with you.

 Please take a look at LUCENE-1812, LUCENE-2632 and my presentation from
 Apache EuroCon 2010 in Prague, Munching and Crunching.


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



inverted index pruning

2011-02-15 Thread Li Li
hi all,
I recently read a paper Pruning Policies for Two-Tiered Inverted
Index with Correctness Guarantee. It's idea is interesting and I
have some questions and like to share with you.
it's idea is pruning unlikely documents for certain terms
e.g.
term1  d1  d3  d6  |  d9  d7 d8
term2  d1  d6  d8  |  d3  d4 d5
we have 2 terms here -- term1 and term2
we perform a and query for searching documents that contain both
the 2 terms
if our score function consider 2 types of scores: document static
score and term related document score
for simplicity, the static score is the page_rank of a document
and term related score is tf*idf and final score is linear combination
such as page_rank + tf*idf
the pruning criterion is : we only keep the documents whose
page_rank is top N of all the  docList for this term and tf*idf is
also
top N
take above example  d1 d3 d5's page_rank is top 3 of all the 5
documents containing term1 and also d1's tf *idf is top 3 of all the 5
documents
   Then we want to get top 3 documents
   we evaluate d1 d3  d6 d8 because it's in pruned docList
   d1 and d6 's information is complete so we can calculate the
accurate score of d1
   d3 d8's information is incomplete, we can only calculate the upper
bound of them.
   we select top 3 document by score. suppose the result is d6 d1 d8 d3
   because d8 and d3 is upper bound score, we can compare the real
score of d8 and d3. So the search is failed and we have to search
un-pruned index for answer
   But if we want to get top 2 documents. we can know d1 and d6 are
the answer because d6, d8 and other documents' score is less
than them

   The experiments in this paper shows that we can use pruned
index(30% size of full index) to answer 90%+ queries
   This result is excited but we have a problem in using it in lucene.
   for and query in lucene, we use skipList to speed up queries.
   but this pruning algorithm cannot use pruned index.
   we can check it by above example.
   if we use skipList, we can skip d3. but d3's upper bound score may
larger than d1. if we don't score d3, we cannot know it.
   although we need only score docList in pruned index(30%), it may be
slower than and query in full index because we can
use skipList to skip many docs.
   Anyone has good idea for this problem? if we can solve this
problem, we can improve performance well.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: inverted index pruning

2011-02-15 Thread Andrzej Bialecki

On 2/15/11 11:57 AM, Li Li wrote:

hi all,
 I recently read a paper Pruning Policies for Two-Tiered Inverted
Index with Correctness Guarantee. It's idea is interesting and I
have some questions and like to share with you.


Please take a look at LUCENE-1812, LUCENE-2632 and my presentation from 
Apache EuroCon 2010 in Prague, Munching and Crunching.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org