I read over the LUCENE-1458 comments again. Interesting. I think
the most compelling argument is that the various files we're
normally loading into the heap are, after merging, in the IO
cache. If we can simply reuse the IO cache rather then allocate
a bunch of redundant arrays in heap, we could be better off? I
think this is very compelling for field caches, delDocs, and
bitsets that are tied to segments and loaded after each merge.

I think it's possible to write some basic benchmarks to test a
byte[] BitVector vs.a MappedByteBuffer BitVector and see what
happens.

The other potentially interesting angle here is in regards to
realtime updates, where we can implement a MMaped page type of
system so blocks of this stuff can be updated in near realtime,
directly in the MMaped space (similar to how in heap land with
LUCENE-1526 we're looking at breaking up the byte[] into a
byte[][]).

Also if we assume data is MMaped I don't think it matters as much if
the updates on disk are not in sequence? (Whereas today we try
to keep all our files sequentially readable optimized). Of
course I could be completely wrong. :)

On Wed, Jun 10, 2009 at 5:19 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Wed, Jun 10, 2009 at 7:23 PM, Jason
> Rutherglen<jason.rutherg...@gmail.com> wrote:
> > Cool! Sounds like with LUCENE-1458 we can experiment with some
> > of these things. Does CSF become just another codec?
>
> I believe LUCENE-1458 currently only makes terms dict & postings
> pluggable...
>
> >> I'm leary of having terms dict live entirely on disk, though
> > we should certainly explore it.
> >
> > Yeah, it should theoretically help with reloading, it could use
> > a skiplist (as we have a disk version of that implemented)
> > instead of binarysearch). It seems like with things like
> > TrieRange (which potentially adds many fields and terms) it
> > could be useful to let the IO cache calculate what we need in
> > RAM and what we don't, otherwise we're constantly at risk of
> > exceeding heap usage. There'll be other potential RAM issues
> > (such as page faults), but it seems like users will constantly
> > be up against the inability to precalculate Java heap usage of
> > data structures (whereas file based data usage can be measured).
> > Norms are another example, and with flexible indexing (and
> > scoring?) there may be additional fields the user may want to
> > change dynamically, that if completely loaded into heap cause
> > OOM problems.
> >
> > I guess I personally think it would be great to not worry about
> > exceeding heap with Lucene apps (as it's a guessing game), and
> > then one can simply analyze the OS level IO cache/swap space to
> > see if the app could slow down due to the machine not having
> > enough RAM. I think this would remove one of the major
> > differences between a Java based search engine and a C++ based
> > one.
>
> Marvin and I discussed this quite a bit already in LUCENE-1458... we
> should make it pluggable and then try both -- let the machine tell
> us ;)
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Reply via email to