If you want to test a different block size (say 64 or 256), I really
recommend to just fork a different codec for the experiment.

There will likely be higher level changes you need to make, not just
changing a number. For example if you just increased this number to 256
without doing anything else, I wouldn't be surprised if you see worse
performance. More of the postings would be vint-encoded than before with
128, which might have some consequences. skipdata layout might be
inappropriate, these things are optimized for blocks of 128.

Just in general, I recommend making a codec for the benchmarking
experiments, tools like luceneutil support comparing codecs against each
other anyway so you can easily compare fairly against the existing codec.
Also, it should be much easier/faster to just make a new codec and adapt it
to test what you want!

I think it is an antipattern to make stuff within the codec "flexible", it
is autogenerated decompression code :) I am concerned such "flexibility"
would create barriers in the future to optimizations. For example we should
be able to experiment with converting this compression code over to
explicit vector API in java.

On Sat, Feb 27, 2021 at 4:29 PM Greg Miller <gsmil...@gmail.com> wrote:

> Hi folks!
>
> I've been a bit curious to test out different block size configurations in
> the Lucene postings list format, but thought I'd reach out to the community
> here first to see what work may have gone into this previously. I'm
> essentially interested in benchmarking different block size configurations
> on the real-world application of Lucene I'm working on.
>
> If my understanding of the code is correct, I know we're currently
> encoding compressed runs of 128 docs per block, relying on ForUtil for
> encoding/decoding purposes. It looks like we define this in
> ForUtil#BLOCK_SIZE (and reference it in a few external classes), but also
> know that it's not as simple as just changing that one definition. It
> appears much of the logic in ForUtil relies on the assumption of 128
> docs-per-block.
>
> I'm toying with the idea of making ForUtil a bit more flexible to allow
> for different block sizes to be tested in order to run the benchmarking I'd
> like to run, but the class looks heavily optimized to generate SIMD
> instructions (I think?), so that might be folly. Before I start hacking on
> a local branch to see what I can learn, is there any prior work that might
> be useful to be aware of? Anyone gone down this path and have some
> learnings to share? Any thoughts would be much appreciated!
>
> Cheers,
> -Greg
>

Reply via email to