Re: Realtime Search

Marvin Humphrey Fri, 26 Dec 2008 13:53:39 -0800

Robert,

Three exchanges ago in this thread, you made the incorrect assumption that the
motivation behind using mmap was read speed, and that memory mapping was being
waved around as some sort of magic wand:


    Is there something that I am missing? I see lots of references to
    using "memory mapped" files to "dramatically" improve performance.

    I don't think this is the case at all. At the lowest levels, it is
    somewhat more efficient from a CPU standpoint, but with a decent OS
    cache the IO performance difference is going to negligible.

In response, I indicated that the mmap design had been discussed in JIRA, and
pointed you at a particular issue.

    There have been substantial discussions about this design in JIRA,
    notably LUCENE-1458.

    The "dramatic" improvement is WRT to opening/reopening an IndexReader.

Apparently, you did not go back to read that JIRA thread, because you
subsequently offered a critique of a purely invented design you assumed we
must have arrived at, and continued to argue with a straw man about read
speed:

    1. with "fixed" size terms, the additional IO (larger pages) probably  
    offsets a lot of the random access benefit. This is why "compressed"  
    disks on a fast machine (CPU) are often faster than "uncompressed" -  
    more data is read during every IO access.

While my reply did not specifically point back to LUCENE-1458 again, I hoped
that having your foolish assumption exposed would motivate you to go back and
read it, so that you could offer an informed critique of the *actual* design.
I also linked to a specific comment in LUCENE-831 which explained how mmap
applied to sort caches.

    Additionally, sort caches would be written at index time in three files, and
    memory mapped as laid out in 
    
<https://issues.apache.org/jira/browse/LUCENE-831?focusedCommentId=12656150#action_12656150>.

Apparently you still didn't go back and read up, because you subsequently made
a third incorrect assumption, this time about plans to do away with the term
dictionary index.  In response I griped about JIRA again, using slightly
stronger but still intentionally indirect language.

    No.  That idea was entertained briefly and quickly discarded.  There seems
    to be an awful lot of irrelevant noise in the current thread arising due
    to lack of familiarity with the ongoing discussions in JIRA.

Unfortunately, this must not have worked either, because you have now offered a
fourth message based on incorrect assumptions which would have been remedied by
bringing yourself up to date with the relevant JIRA threads.

> That could very well be, but I was referencing your statement:
> 
> "1) Design index formats that can be memory mapped rather than slurped,
>      bringing the cost of opening/reopening an IndexReader down to a
>      negligible level."
> 
> The only reason to do this (or have it happen) is if you perform a binary
> search on the term index.

No.  As discussed in LUCENE-1458, LUCENE-1483, the specific link I pointed you
towards in LUCENE-831, the message where I provided you with that link, and
elsewhere in this thread... loading the term dictionary index is important, but
the cost pales in comparison to the cost of loading sort caches.  

> Using a 2 file system is going to be WAY slower - I'll bet lunch. It might be
> workable if the files were on a striped drive, or put each file on a different
> drive/controller, but requiring such specially configured hardware is not a
> good idea. In the common case (single drive), you are going to be seeking all
> over the place.

Mike McCandless and I had an extensive debate about the pros and cons of
depending on the OS cache to hold the term dictionary index under LUCENE-1458.
The concerns you express here were fully addressed, and even resolved under an
"agree to disagree" design.

> Also, the mmap is only suitable for 64 bit platforms, since there is no way
> in Java to unmap, you are going to run out of address space as segments are
> rewritten.

The discussion of how the mmap design translates from Lucy to Lucene is an
important one, but I despair of having it if we have to rehash all of
LUCENE-1458, LUCENE-831, and possibly LUCENE-1476 and LUCENE-1483 because you
cannot be troubled to bring yourself up to speed before commenting.

You are obviously knowledgable on the subject of low level memory issues.  Me
and Mike McCandless ain't exactly chopped liver, though, and neither are a lot
of other people around here who *are* bothering to keep up with the threads in
JIRA.  I request that you show the rest of us more respect.  Our time is
valuable, too.

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

Reply via email to