Hi,

On Thu, Apr 18, 2013 at 3:46 PM, Gaurav Ranjan
<gaurav.ranjan.i...@gmail.com> wrote:
> I am a student and studying the functionality of Lucene for my project work.
> The DocDelta example on this link is not clear
> http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html?is-external=true
> ,
>
> Please explain the first part how we are getting 15,8,3 as the TermFreqs
> for the example.

The term appears once in doc 7 and 3 times in doc 11. In real-world
cases, freqs are very often equal to 1, so Lucene40PostingsFormat
tries to use as little data as possible (one bit here) when the freq
is 1. Here are the steps performed:

1. Raw doc IDs and freqs -> 7, 1, 11, 3
2. Delta-encoded doc IDs -> 7, 1, 4, 3 (11 - 7 = 4)
3. Multiply deltas by 2 -> 14, 1, 8, 3
4. When the frequency is 1, omit it and add one the the doc delta -> 15, 8, 3

To decode, just perform the steps in reverse order:
1. Encoded data -> 15, 8, 3
2. When the doc delta is even, it means that the frequency is omitted
and equal to 1 -> 15, 1, 8, 3
3. Divide deltas by 2 -> 7, 1, 4, 3
4. Restore absolute doc IDs -> 7, 1, 11, 3

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to