hi all, I recently read a paper "Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee". It's idea is interesting and I have some questions and like to share with you. it's idea is pruning unlikely documents for certain terms e.g. term1 d1 d3 d6 | d9 d7 d8 term2 d1 d6 d8 | d3 d4 d5 we have 2 terms here -- term1 and term2 we perform a "and query" for searching documents that contain both the 2 terms if our score function consider 2 types of scores: document static score and term related document score for simplicity, the static score is the page_rank of a document and term related score is tf*idf and final score is linear combination such as page_rank + tf*idf the pruning criterion is : we only keep the documents whose page_rank is top N of all the docList for this term and tf*idf is also top N take above example d1 d3 d5's page_rank is top 3 of all the 5 documents containing term1 and also d1's tf *idf is top 3 of all the 5 documents Then we want to get top 3 documents we evaluate d1 d3 d6 d8 because it's in pruned docList d1 and d6 's information is complete so we can calculate the accurate score of d1 d3 d8's information is incomplete, we can only calculate the upper bound of them. we select top 3 document by score. suppose the result is d6 d1 d8 d3 because d8 and d3 is upper bound score, we can compare the real score of d8 and d3. So the search is failed and we have to search un-pruned index for answer But if we want to get top 2 documents. we can know d1 and d6 are the answer because d6, d8 and other documents' score is less than them
The experiments in this paper shows that we can use pruned index(30% size of full index) to answer 90%+ queries This result is excited but we have a problem in using it in lucene. for "and query" in lucene, we use skipList to speed up queries. but this pruning algorithm cannot use pruned index. we can check it by above example. if we use skipList, we can skip d3. but d3's upper bound score may larger than d1. if we don't score d3, we cannot know it. although we need only score docList in pruned index(30%), it may be slower than "and query" in full index because we can use skipList to skip many docs. Anyone has good idea for this problem? if we can solve this problem, we can improve performance well. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org