On Thu, May 19, 2011 at 9:22 AM, Jason Rutherglen <jason.rutherg...@gmail.com> wrote:
>> maybe thats because we have one huge monolithic implementation > > Doesn't the DocValues branch solve this? Hopefully DocValues will replace FieldCache over time; maybe some day we can deprecate & remove FieldCache. But we still have work to do there, I believe; eg we don't have comparators for all types (on the docvalues branch) yet. > Also, instead of trying to implement clever ways of compressing > strings in the field cache, which probably won't bare fruit, I'd > prefer to look at [eventually] MMap'ing (using DV) the field caches to > avoid the loading and heap costs, which are signifcant. I'm not sure > if we can easily MMap packed ints and the shared byte[], though it > seems fairly doable? In fact, the packed ints and the byte[] packing of terms data is very much amenable/necessary for using MMap, far moreso than the separate objects we had before. I agree we should make an mmap option, though I would generally recommend against apps using mmap for these caches. We load these caches so that we'll have fast random access to potentially a great many documents during collection of one query (eg for sorting). When you mmap them you let the OS decide when to swap stuff out which mean you pick up potentially high query latency waiting for these pages to swap back in. Various other data structures in Lucene needs this fast random access (norms, del docs, terms index) and that's why we put them in RAM. I do agree for all else (the laaaarge postings), MMap is great. Of course the OS swaps out process RAM anyway, so... it's kinda moot (unless you've fixed your OS to not do this, which I always do!). I think a more productive area of exploration (to reduce RAM usage) would be to make a StringFieldComparator that doesn't need full access to all terms data, ie, operates per segment yet only does a "few" ord lookups when merging the results across segments. If "few" is small enough we can just use us the seek-by-ord from the terms dict to do them. This would be a huge RAM reduction because we could then sort by string fields (eg "title" field) without needing all term bytes randomly accessible. Mike http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org