Hi Greg, I like that Lucene can scale to index sizes that are much larger than the amount of main memory, so I would like the default codec to keep optimizing for sequential reads. We do random access for some parts of the index like the terms index and the points index, but the expectation is that they are so small that would always fit in the page cache (they were actually in the JVM heap not long ago). A skip index for every postings list feels like something that could be much larger than the terms index so I don't think it would be a good fit for the default codec.
We could still implement your idea in lucene/codecs for users who can force load their index into memory, the speedup looks significant for queries that do lots of skipping! These results make me wonder if there are other ways we could get similar benefits while keeping a sequential access pattern when reading postings? Le mar. 27 avr. 2021 à 21:03, Greg Miller <[email protected]> a écrit : > Hey folks- > > I've been experimenting with a different approach to implementing skip > lists in Lucene, and would be curious if anyone else has tried > something similar, or has early thoughts/feedback on what I'm doing. > > The idea is generally a simplification of what Lucene currently does, > targeted at improving QPS at search-time. Instead of treating skip > lists as forward-iterating structures, I'm indexing a single, flat > structure that I binary search when advancing. I'm indexing the same > data present in the lowest level of the current structures (i.e., skip > pointers to each block of 128 docs), and then binary searching those > pointers to find the candidate block for the target docID (instead of > only seeking forward). > > Before describing this idea in more detail (or opening a Jira), I'd be > curious how this community thinks about disk access operations and > what platforms/use-cases we primarily optimize for these days. This > approach I've been experimenting with is essentially a non-starter if > we assume that index accesses generally involve actually going to > disk—especially if those disks spin. Random seeks all over the place > are probably a terrible idea if those are actually hitting disk. In > the cases I've been experimenting with, indexes are generally hot, so > random seeks aren't much of a concern. Do we tend to optimize with > this assumption in mind, or are we still pretty careful about > use-cases that are actually doing a lot of disk IO? > > There are some other tricky implications with the approach I'm > experimenting with (some edge cases around Impacts and index size > growth due to not being able to do as much delta-encoding), but it's > not worth going into those until addressing the main idea of > whether-or-not random seeks even make sense. > > I've done some early benchmarking with luceneutil (wikimedium10m > specifically) and the idea looks like it might have some promise. I > don't really see any regressions to QPS*, while a few tasks seem to > show significant QPS improvements (AndHighLow: 9.8%, OrNotHighLow: > 11.1%, OrHighLow: 12.3%). > * (note: I've disabled Impacts in both baseline/candidate for early > experiments because of some challenges related to them... so these > results have a major asteriks right now and more work would certainly > need to be done) > > Thanks in advance for any feedback! I'm completely open to shooting > down this idea early if there are some fundamental flaws, or > alternatively opening up a Jira to investigate this further if folks > think it's reasonable to explore more :) > > Cheers, > -Greg > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
