our recent experiments show that PFOR is not a good solution for "and query" we tested it with our dataset and users' queries. for most case, PFOR is slower than vint. we analyzed the reason may be that it's very likely there is a low-frequent term in most queries. So the scoring time is the majority while decoding is not. e.g in our index, term "beijing"'s df is 2557916 and "park" is 2313201, both them are hight frequent terms. but the count of documents containing both is only 1552 for vint, it only need decode 1552 documents, while PFOR, it may decode many blocks. for most search engines, and query is used. So PFOR is only good for or query and "and query" whose terms are all high frequent. So we have to give up this in our application. partial decoder for PFOR? for all high frequent terms, using normal PFOR decoder ;for quries with low frequent terms, using partial decoder? partial decoder of PFOR many need many if/else and will be slower. Any one has any solution for this?
2010/12/27 Li Li <[email protected]>: > I integrated pfor codec into lucene 2.9.3 and the search time > comparsion is as follows: > single term and query or query > VINT in lucene 2.9.3 11.2 36.5 38.6 > PFor in lucene 2.9.3 8.7 27.6 33.4 > VINT in lucene 4 branch 10.6 26.5 35.4 > PFor in lcuene 4 branch 8.1 22.5 30.7 > > My test terms are high frequncy terms because we are interested in "bad case" > It seems lucene 4 branch's implementation of and query(conjuction > query) is well optimized that even for VINT codec, it's faster than > PFor in lucene 2.9.3. Could any one tell me what optimization is done? > is store docIDs and freqs separately making it faster? or anything > else? > > Another querstion, Is there anyone interested in integrating pfor > codec into lucene 2.9.3 as me( we have to use lucene 2.9 and solr > 1.4). And how do I contribute this patch? > > 2010/12/24 Michael McCandless <[email protected]>: >> Well, an early patch somewhere was able to run PFor on trunk, but the >> performance wasn't great because the trunk bulk-read API is a >> bottleneck (this is why the bulk postings branch was created). >> >> Mike >> >> On Wed, Dec 22, 2010 at 9:45 PM, Li Li <[email protected]> wrote: >>> I used the bulkpostings >>> branch(https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene) >>> does trunk have PForDelta decoder/encoder ? >>> >>> 2010/12/23 Michael McCandless <[email protected]>: >>>> Those are nice speedups! >>>> >>>> Did you use the 4.0 branch (ie trunk) or the bulkpostings branch for this >>>> test? >>>> >>>> Mike >>>> >>>> On Tue, Dec 21, 2010 at 9:59 PM, Li Li <[email protected]> wrote: >>>>> great improvement! >>>>> I did a test in our data set. doc count is about 2M+ and index size >>>>> after optimization is about 13.3GB(including fdt) >>>>> it seems lucene4's index format is better than lucene2.9.3. and PFor >>>>> give good results. >>>>> Besides BlockEncoder for frq and pos. is there any other modification >>>>> for lucene 4? >>>>> >>>>> decoder \ avg time single word(ms) and >>>>> query(ms) or query(ms) >>>>> VINT in lucene 2.9 11.2 >>>>> 36.5 38.6 >>>>> VINT in lucene 4 branch 10.6 >>>>> 26.5 35.4 >>>>> PFor in lucene 4 branch 8.1 >>>>> 22.5 30.7 >>>>> 2010/12/21 Li Li <[email protected]>: >>>>>>> OK we should have a look at that one still. We need to converge on a >>>>>>> good default codec for 4.0. Fortunately it's trivial to take any int >>>>>>> block encoder (fixed or variable block) and make a Lucene codec out of >>>>>>> it! >>>>>> >>>>>> I suggests you not to use this one, I fixed dozens of bugs but it >>>>>> still failed when with random tests. it's codes is hand coded rather >>>>>> than generated by program. But we may learn something from it. >>>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
