On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless < luc...@mikemccandless.com> wrote:
> > The segments_N file can be different, that's fine: after that, we then > re-use SegmentReaders when they are in common between the two commit > points. Each segments_N file refers to many segments... > > Yes, you are totally right - I didn't follow the code far enough the first time around. :) This is an excellent idea, actually - I can probably arrange maintained commit points as an MRU data structure (e.g. LinkedHashMap with access order), and simply grab the most recently opened reader to pass in when obtaining a new one from the new commit point - to maximize segment reader reuse. > You can set it (min and max) as high as you want; the only hard > requirement is that max >= 2*(min-1), I believe. > Looks like this is used inside Lucene41PostingsFormat, which simply passes in those defaults - so you are effectively saying the minimum (and therefore, maximum) block size can be raised to reuse the size of the terms index inside those TreeMap nodes? > > > We are already using a customized codec though, so perhaps adding > > this to the codec is okay and transparent? > > Hmmm :) Customized in what manner? > > We need to have the ability to turn off stored fields compression, so there is one codec in case the system is configured that way. The other one exists for compression on, but there I tweaked stored fields format for bias toward decompression, as well as a smaller chunk size - based on some empirical observations in executed tests. I am guessing I'll just add another customization to both that deals with the block sizing for postings format, and see what difference that makes...