Hi folks! I've been a bit curious to test out different block size configurations in the Lucene postings list format, but thought I'd reach out to the community here first to see what work may have gone into this previously. I'm essentially interested in benchmarking different block size configurations on the real-world application of Lucene I'm working on.
If my understanding of the code is correct, I know we're currently encoding compressed runs of 128 docs per block, relying on ForUtil for encoding/decoding purposes. It looks like we define this in ForUtil#BLOCK_SIZE (and reference it in a few external classes), but also know that it's not as simple as just changing that one definition. It appears much of the logic in ForUtil relies on the assumption of 128 docs-per-block. I'm toying with the idea of making ForUtil a bit more flexible to allow for different block sizes to be tested in order to run the benchmarking I'd like to run, but the class looks heavily optimized to generate SIMD instructions (I think?), so that might be folly. Before I start hacking on a local branch to see what I can learn, is there any prior work that might be useful to be aware of? Anyone gone down this path and have some learnings to share? Any thoughts would be much appreciated! Cheers, -Greg