Lucene has got some new compressed DocIdSet implementations that are 
technically very interesting and exciting: PForDeltaDocIdSet, WAH8DocIdSet, 
EliasFanoDocIdSet, … any more?  Yet it's difficult (at least for me) to 
understand their pros/cons to know when to pick amongst them.  They all seem 
great yet why do we have 3?  Only one is actually used by Lucene itself — 
WAH8DocIdSet in CachingWrapperFilter.   Javadocs are hit & miss; the JIRA 
issues have lots of fascinating background but it's time consuming to distill.  
I think it would be very useful to summarily document key characteristics on 
class level javadocs — not so much implementation details but information to 
help a user choose it versus another.  And as a bonus a table perhaps showing 
relative performance characteristics in package-level javadocs.

Related to this is, I'm wondering does it make sense for a codec's postings 
(assuming no doc freq & no positions?) to be implemented as a serialized 
version of one of these compressed doc id sets?  I think it would be really 
great, not just for compression but also because it might support 
Terms.advance() since some of these compressed formats have indexes.

~ David

Reply via email to