Re: PForDelta compression

Marvin Humphrey Tue, 30 Sep 2008 14:10:08 -0700

On Sep 30, 2008, at 11:19 AM, Nathan Kurz wrote:

> How are you thinking about using PForDelta?


The current ScorePosting implementation in KinoSearch encodes 4 kinds of data,
all in a single stream per field per segment:

  * doc num
  * term frequency
  * boost
  * positions

The conclusion that Mike McCandless and I reached, now reinforced by recent
benchmarking, is that we need to break up that data into separate streams.  

Current Lucene divides up the data into three streams per segment:

  * doc_num/freq (one ".frq" file for all fields)
  * boost        (one ".fXXX" file per field)
  * positions    (one ".prx" file for all fields)

I think we should use four streams -- one for each data type. doc_num, freq and
positions are all integers; those three streams would be encoded using
PForDelta.

The ideal implementation of MatchPosting only requires a single stream of
document numbers.  That would use PForDelta as well. (Note that the current KS
MatchPosting class is a temporary hack that doesn't match the ideal
implementation.)

The logical extension of the KinoSearch model would be to use three files *per*
*field* for postings data in each segment.  I think that's probably too many
files.  Instead, I think we should establish dedicated files for each data
type.  All fields assigned to a Posting class which uses that type would store
their data in shared files.

For MatchPosting (in the segment named "seg_e8a"):

  * doc num        => seg_e8a.match

For ScorePosting:

  * doc num        => seg_e8a.match
  * term frequency => seg_e8a.freq
  * boost          => seg_e8a.boost
  * positions      => seg_e8a.prox

For RichPosting:

  * doc num        => seg_e8a.match
  * term frequency => seg_e8a.freq
  * positions      => seg_e8a.prox
  * per-pos boost  => seg_e8a.proxboost

Thinking ahead to PayloadPosting:

  * doc num        => seg_e8a.match
  * string payload => seg_e8a.payload

Straightforward term queries would always access the ".match" file.  Term
frequency would be read from the ".freq" file or default to 0 (resulting in a
successful match but a score of 0.0).  Boost would be read from the ".boost"
file or default to 1.0.  Positions data, however, would not be read during term
queries -- eliminating the current search-time speed weakness.

In a previous post, I mentioned that tricky part about transitioning the
current KS code base to PForDelta is that PForDelta uses block encoding.  I
think what we'll have to do is store both the file position where the block
starts and the offset into the block in the TermInfo we get when looking up the
term in a Lexicon.

PS: For the lucy-dev mail archives, the paper comparing PForDelta to VByte and
other schemes is at <http://www2008.org/papers/pdf/p387-zhangA.pdf>, and the
paper which describes PForDelta in depth is at
<http://homepages.cwi.nl/~heman/downloads/msthesis.pdf>.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Re: PForDelta compression

Reply via email to