On Sep 30, 2008, at 11:19 AM, Nathan Kurz wrote: > How are you thinking about using PForDelta?
The current ScorePosting implementation in KinoSearch encodes 4 kinds of data, all in a single stream per field per segment: * doc num * term frequency * boost * positions The conclusion that Mike McCandless and I reached, now reinforced by recent benchmarking, is that we need to break up that data into separate streams. Current Lucene divides up the data into three streams per segment: * doc_num/freq (one ".frq" file for all fields) * boost (one ".fXXX" file per field) * positions (one ".prx" file for all fields) I think we should use four streams -- one for each data type. doc_num, freq and positions are all integers; those three streams would be encoded using PForDelta. The ideal implementation of MatchPosting only requires a single stream of document numbers. That would use PForDelta as well. (Note that the current KS MatchPosting class is a temporary hack that doesn't match the ideal implementation.) The logical extension of the KinoSearch model would be to use three files *per* *field* for postings data in each segment. I think that's probably too many files. Instead, I think we should establish dedicated files for each data type. All fields assigned to a Posting class which uses that type would store their data in shared files. For MatchPosting (in the segment named "seg_e8a"): * doc num => seg_e8a.match For ScorePosting: * doc num => seg_e8a.match * term frequency => seg_e8a.freq * boost => seg_e8a.boost * positions => seg_e8a.prox For RichPosting: * doc num => seg_e8a.match * term frequency => seg_e8a.freq * positions => seg_e8a.prox * per-pos boost => seg_e8a.proxboost Thinking ahead to PayloadPosting: * doc num => seg_e8a.match * string payload => seg_e8a.payload Straightforward term queries would always access the ".match" file. Term frequency would be read from the ".freq" file or default to 0 (resulting in a successful match but a score of 0.0). Boost would be read from the ".boost" file or default to 1.0. Positions data, however, would not be read during term queries -- eliminating the current search-time speed weakness. In a previous post, I mentioned that tricky part about transitioning the current KS code base to PForDelta is that PForDelta uses block encoding. I think what we'll have to do is store both the file position where the block starts and the offset into the block in the TermInfo we get when looking up the term in a Lexicon. PS: For the lucy-dev mail archives, the paper comparing PForDelta to VByte and other schemes is at <http://www2008.org/papers/pdf/p387-zhangA.pdf>, and the paper which describes PForDelta in depth is at <http://homepages.cwi.nl/~heman/downloads/msthesis.pdf>. Marvin Humphrey Rectangular Research http://www.rectangular.com/
