+1 to what Mikhail wrote, this is e.g. how postings work: instead of interleaving doc IDs and frequencies, they always store a block of 128 doc IDs followed by a block of 128 frequencies.
For reference, bit packing feels space-inefficient for this kind of data. I would expect docFreqs to have a zipfian distribution, so you would end up using a number of bits per docFreq that is driven by the highest docFreq in the block while most values might be very low. Do you need random-access into these doc freqs and postings start offsets or will you decode data for an entire block every time anyway? On Tue, Oct 17, 2023 at 8:39 AM Mikhail Khludnev <m...@apache.org> wrote: > Hello Tony > Is it possible to write a block of docfreqs and then a block of > postingoffsets? > Or why not write them as 10-bit integers and then split to quad and sextet > in the posting format code? > > On Mon, Oct 16, 2023 at 11:50 PM Dongyu Xu <dongyu...@hotmail.com> wrote: > >> Hi devs, >> >> As I was working on https://github.com/apache/lucene/issues/12513 I >> needed to compress positive integers which are used to locate postings etc. >> >> To put it concretely, I will need to pack a few values per term >> contiguously and those values can have different bit-width. For example, >> consider that we need to encode docFreq and postingsStartOffset per term >> and docFreq takes 4 bit and the postingsStartOffset takes 6 bit. We >> expect to write the following for two terms. >> >> ``` >> Term1 | Term2 >> >> docFreq(4bit) | postingsStartOffset(6bit) | docFreq(4bit) | >> postingsStartOffset(6bit) >> >> ``` >> >> On the read path, I expect to locate the offest for a term first and >> followed by reading two values that have different bit-width. >> >> In the spirit of not re-inventing necessarily, I tried to explore the >> existing PackedInts util classes and I believe there is no support for this >> at the moment. The biggest gap I found is that the existing classes expect >> to write/read values of same bit-width. >> >> I'm writing to get feedback from yall to see if I missed anything. >> >> Cheers, >> Tony X >> > > > -- > Sincerely yours > Mikhail Khludnev > -- Adrien