On Tuesday 06 October 2009 23:59:12 eks dev wrote:
> Paul,
> the point I was trying to make with this example was extreme,  but realistic. 
> Imagine 100Mio docs, sorted on field user_rights,  a term user_rights:XX 
> selects 40Mio of them (user rights...). To encode this, you need format with  
> two integers (for more of such intervals you would need slightly more, but 
> nevertheless, much less than for OpenBitSet, VInts, PFor...  ). Strictly 
> speaking this term is dense, but highly compressible and could be inlined 
> with pulsing trick...

Well, I've been considering to add compressed consecutive ranges to 
SortedVIntList, but I did not
get further than considering. This sounds like the perfect use case for that.

Regards,
Paul Elschot


> 
> cheers, eks  
> 
> 
> 
> 
> >
> >From: Paul Elschot <paul.elsc...@xs4all.nl>
> >To: java-dev@lucene.apache.org
> >Sent: Tuesday, 6 October, 2009 23:33:03
> >Subject: Re: [jira] Commented: (LUCENE-1410) PFOR implementation
> >
> >Eks,
> >
> >
> >> 
> >>>     [ 
> >>> https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762742#action_12762742
> >>>  ] 
> >>> 
> >>> Eks Dev commented on LUCENE-1410:
> >>> ---------------------------------
> >>> 
> >>> Mike, 
> >>> That is definitely the way to go, distribution dependent encoding, where 
> >>> every Term gets individual treatment.
> >>> 
> >>> Take for an example simple, but not all that rare case where Index gets 
> >>> sorted on some of the indexed fields (we use it really extensively, e.g. 
> >>> presorted doc collection on user_rights/zip/city, all indexed). There you 
> >>> get perfectly "compressible"  postings by simply managing intervals of 
> >>> set bits. Updates distort this picture, but we rebuild index periodically 
> >>> and all gets good again.  At the moment we load them into RAM as Filters 
> >>> in IntervalSets. if that would be possible in lucene, we wouldn't bother 
> >>> with Filters (VInt decoding on such super dense fields was killing us, 
> >>> even in RAMDirectory) ... 
> >
> >
> >You could try switching the Filter to OpenBitSet when that takes fewer bytes 
> >than SortedVIntList.
> >
> >
> >Regards,
> >>Paul Elschot
> >
> >
> >
> 
> 
>       

Reply via email to