Re: Understanding Lucene's File Format

Michael McCandless Fri, 17 Sep 2010 02:24:11 -0700

The entry for each term in the terms dict stores a long file offset
pointer, into the .frq file, and another long for the .prx file.


But, these longs are delta-coded, so as you scan you have to sum up
these deltas to get the absolute file pointers.

The terms index (once loaded into RAM) has absolute longs, too.

So when looking up a term, we first bin search to the nearest indexed
term less than what you seek, then seek to that spot in the terms
dict, then scan, summing the deltas.

Mike

On Thu, Sep 16, 2010 at 3:53 PM, Giovanni Fernandez-Kincade
<gfernandez-kinc...@capitaliq.com> wrote:
> Hi,
> I've been trying to understand Lucene's file format and I keep getting hung 
> up on one detail - how can Lucene quickly find the frequency data (or 
> proximity data) for a particular term? According to the file formats page on 
> the Lucene 
> website<http://lucene.apache.org/java/2_2_0/fileformats.html#Term%20Dictionary>,
>  the FreqDelta field in the Term Info file (.tis) is relative to the previous 
> term. How is this helpful? The few references I've found on the web for this 
> subject make it sound like the Term Dictionary has direct pointers to the 
> frequency data for a given term, but that isn't consistent with the 
> aforementioned reference.
>
> Thanks for your help,
> Gio.
>

Re: Understanding Lucene's File Format

Reply via email to