Thanks for the reply Adrien! It makes sense to ensure the default codec continues to scale for large indexes that can't fit in a machine's physical memory. Thanks for the thoughts and context on the terms/points indexes, etc.
I'll look into how this idea could be spun off into a separate lucene/codecs implementation and open a Jira issue to track that work a little later this week. I'll also spend a little time thinking about whether-or-not there might be some sort of hybrid solution that enables some of these gains while maintaining the sequential access. I don't have anything off the top-of-my-head, but maybe if I put a draft of my change up as a PR (under lucene/codecs) it will spark some other ideas! Thanks again for your thoughts! Cheers, -Greg On Tue, Apr 27, 2021 at 2:23 PM Adrien Grand <[email protected]> wrote: > > Hi Greg, > > I like that Lucene can scale to index sizes that are much larger than the > amount of main memory, so I would like the default codec to keep optimizing > for sequential reads. We do random access for some parts of the index like > the terms index and the points index, but the expectation is that they are so > small that would always fit in the page cache (they were actually in the JVM > heap not long ago). A skip index for every postings list feels like something > that could be much larger than the terms index so I don't think it would be a > good fit for the default codec. > > We could still implement your idea in lucene/codecs for users who can force > load their index into memory, the speedup looks significant for queries that > do lots of skipping! > > These results make me wonder if there are other ways we could get similar > benefits while keeping a sequential access pattern when reading postings? > > Le mar. 27 avr. 2021 à 21:03, Greg Miller <[email protected]> a écrit : >> >> Hey folks- >> >> I've been experimenting with a different approach to implementing skip >> lists in Lucene, and would be curious if anyone else has tried >> something similar, or has early thoughts/feedback on what I'm doing. >> >> The idea is generally a simplification of what Lucene currently does, >> targeted at improving QPS at search-time. Instead of treating skip >> lists as forward-iterating structures, I'm indexing a single, flat >> structure that I binary search when advancing. I'm indexing the same >> data present in the lowest level of the current structures (i.e., skip >> pointers to each block of 128 docs), and then binary searching those >> pointers to find the candidate block for the target docID (instead of >> only seeking forward). >> >> Before describing this idea in more detail (or opening a Jira), I'd be >> curious how this community thinks about disk access operations and >> what platforms/use-cases we primarily optimize for these days. This >> approach I've been experimenting with is essentially a non-starter if >> we assume that index accesses generally involve actually going to >> disk—especially if those disks spin. Random seeks all over the place >> are probably a terrible idea if those are actually hitting disk. In >> the cases I've been experimenting with, indexes are generally hot, so >> random seeks aren't much of a concern. Do we tend to optimize with >> this assumption in mind, or are we still pretty careful about >> use-cases that are actually doing a lot of disk IO? >> >> There are some other tricky implications with the approach I'm >> experimenting with (some edge cases around Impacts and index size >> growth due to not being able to do as much delta-encoding), but it's >> not worth going into those until addressing the main idea of >> whether-or-not random seeks even make sense. >> >> I've done some early benchmarking with luceneutil (wikimedium10m >> specifically) and the idea looks like it might have some promise. I >> don't really see any regressions to QPS*, while a few tasks seem to >> show significant QPS improvements (AndHighLow: 9.8%, OrNotHighLow: >> 11.1%, OrHighLow: 12.3%). >> * (note: I've disabled Impacts in both baseline/candidate for early >> experiments because of some challenges related to them... so these >> results have a major asteriks right now and more work would certainly >> need to be done) >> >> Thanks in advance for any feedback! I'm completely open to shooting >> down this idea early if there are some fundamental flaws, or >> alternatively opening up a Jira to investigate this further if folks >> think it's reasonable to explore more :) >> >> Cheers, >> -Greg >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
