+1 to what Mikhail wrote, this is e.g. how postings work: instead of
interleaving doc IDs and frequencies, they always store a block of 128 doc
IDs followed by a block of 128 frequencies.

For reference, bit packing feels space-inefficient for this kind of data. I
would expect docFreqs to have a zipfian distribution, so you would end up
using a number of bits per docFreq that is driven by the highest docFreq in
the block while most values might be very low. Do you need random-access
into these doc freqs and postings start offsets or will you decode data for
an entire block every time anyway?


On Tue, Oct 17, 2023 at 8:39 AM Mikhail Khludnev <m...@apache.org> wrote:

> Hello Tony
> Is it possible to write a block of docfreqs and then a block of
> postingoffsets?
> Or why not write them as 10-bit integers and then split to quad and sextet
> in the posting format code?
>
> On Mon, Oct 16, 2023 at 11:50 PM Dongyu Xu <dongyu...@hotmail.com> wrote:
>
>> Hi devs,
>>
>> As I was working on https://github.com/apache/lucene/issues/12513 I
>> needed to compress positive integers which are used to locate postings etc.
>>
>> To put it concretely, I will need to pack a few values per term
>> contiguously and those values can have different bit-width. For example,
>> consider that we need to encode docFreq and postingsStartOffset per term
>> and docFreq takes 4 bit and the postingsStartOffset takes 6 bit. We
>> expect to write the following for two terms.
>>
>> ```
>> Term1 |  Term2
>>
>> docFreq(4bit) | postingsStartOffset(6bit) | docFreq(4bit) |
>> postingsStartOffset(6bit)
>>
>> ```
>>
>> On the read path, I expect to locate the offest for a term first and
>> followed by reading two values that have different bit-width.
>>
>> In the spirit of not re-inventing necessarily, I tried to explore the
>> existing PackedInts util classes and I believe there is no support for this
>> at the moment. The biggest gap I found is that the existing classes expect
>> to write/read values of same bit-width.
>>
>> I'm writing to get feedback from yall to see if I missed anything.
>>
>> Cheers,
>> Tony X
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
Adrien

Reply via email to