Re: inverted index pruning
great things. But I think the patch is different from the method in that paper. my colleague had tested this patch but don't get good results (I don't know the detail well, and he just tell me his experience) 2011/2/15 Andrzej Bialecki a...@getopt.org: On 2/15/11 11:57 AM, Li Li wrote: hi all, I recently read a paper Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee. It's idea is interesting and I have some questions and like to share with you. Please take a look at LUCENE-1812, LUCENE-2632 and my presentation from Apache EuroCon 2010 in Prague, Munching and Crunching. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
inverted index pruning
hi all, I recently read a paper Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee. It's idea is interesting and I have some questions and like to share with you. it's idea is pruning unlikely documents for certain terms e.g. term1 d1 d3 d6 | d9 d7 d8 term2 d1 d6 d8 | d3 d4 d5 we have 2 terms here -- term1 and term2 we perform a and query for searching documents that contain both the 2 terms if our score function consider 2 types of scores: document static score and term related document score for simplicity, the static score is the page_rank of a document and term related score is tf*idf and final score is linear combination such as page_rank + tf*idf the pruning criterion is : we only keep the documents whose page_rank is top N of all the docList for this term and tf*idf is also top N take above example d1 d3 d5's page_rank is top 3 of all the 5 documents containing term1 and also d1's tf *idf is top 3 of all the 5 documents Then we want to get top 3 documents we evaluate d1 d3 d6 d8 because it's in pruned docList d1 and d6 's information is complete so we can calculate the accurate score of d1 d3 d8's information is incomplete, we can only calculate the upper bound of them. we select top 3 document by score. suppose the result is d6 d1 d8 d3 because d8 and d3 is upper bound score, we can compare the real score of d8 and d3. So the search is failed and we have to search un-pruned index for answer But if we want to get top 2 documents. we can know d1 and d6 are the answer because d6, d8 and other documents' score is less than them The experiments in this paper shows that we can use pruned index(30% size of full index) to answer 90%+ queries This result is excited but we have a problem in using it in lucene. for and query in lucene, we use skipList to speed up queries. but this pruning algorithm cannot use pruned index. we can check it by above example. if we use skipList, we can skip d3. but d3's upper bound score may larger than d1. if we don't score d3, we cannot know it. although we need only score docList in pruned index(30%), it may be slower than and query in full index because we can use skipList to skip many docs. Anyone has good idea for this problem? if we can solve this problem, we can improve performance well. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: inverted index pruning
On 2/15/11 11:57 AM, Li Li wrote: hi all, I recently read a paper Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee. It's idea is interesting and I have some questions and like to share with you. Please take a look at LUCENE-1812, LUCENE-2632 and my presentation from Apache EuroCon 2010 in Prague, Munching and Crunching. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org