Re: Good MMapDirectory performance

Peter Keegan Tue, 14 Mar 2006 12:34:44 -0800

- I read from Peter Keegan's recent postings:
- "The Lucene server is using MMapDirectory. I'm running
-  the jvm with -Xmx16000M. Peak memory usage of the jvm
-  on Linux is about 6GB and 7.8GB on windows."
- We don't have nearly as much memory as Peter but I
- wonder whether he is gaining anything with such
- a large heap.


My application gets better throughput with more VM, but that is probably due
to heavy use of ByteBuffers in the application, not VM for Lucene.

Peter



On 3/12/06, kent.fitch <[EMAIL PROTECTED]> wrote:
>
> I thought I'd post some good news about MMapDirectory as
> the comments in the release notes are quite downbeat about
> its performance.  In some environments MMapDirectory
> provides a big improvement.
>
> Our test application is an index of 11.4 million
> documents which are derived from MARC (bibliographic)
> catalogue records.  Our aim is to build a system
> to demonstrate relevance ranking and result clustering
> for library union catalogue searching (a "union"
> catalogue accumulates/merges records from multiple
> ibraries).
>
> Our main index component sizes:
> fdt 17GB
> fdx 91MB
> tis 82MB
> frq 45MB
> prx 11MB
> tii 1.2 MB
>
> We have a separate Lucence index (not discussed further)
> which stores the MARC records.
>
> Each document has many fields.   We'll probably reduce the
> number after we decide on the best search strategies, but
> lots of fields gives us lots of flexability whilst testing
> search and ranking strategies.
>
> Stored and unindexed fields, used for summary results:
>   display title
>   display author
>   display publication details
>   holdingsCount (number of libraries holding)
>
> Tokenized indices:
>   title
>   author
>   subject
>   genre
>   keyword (all text)
>
> Keyword (untokenized) indices:
>   title
>   author
>   subject
>   genre
>   audience
>   Dewey/LC classification
>   language
>   isbn/issn
>   publication date (date range code)
>   unique bibliographic id
>
> "Wildcard" Tokenized indices created by a custom "stub"
> analyzer which reduces a term to its first few characters:
>   title
>   author
>   subject
>   keyword
>
> Field boosts are set for some fields.  For example, "title"
> "sub title", "series title", "component title" are all
> stored as "title" but with different field boosts (as a
> match on normal title is deemed more relevant than a match
> on series title).
>
> The document boost is set to the sqrt of the holdingsCount
> (favouring "popular" resources).
>
> The user interface supports searching and refining searches
> on specific fields but the most common search is created
> from a single google style search box.  Here's a typical
> query generated from a 2 word search:
>
> +(titleWords:"franz kafka^4.0"
>   authorWords:"franz kafka^3.0"
>   subjectWords:"franz kafka^3.0"
>   keywords:"franz kafka^1.4"
>   title:franz kafka^4.0
>   (+titleWords:franz +titleWords:kafka^3.0)
>   author:franz kafka^3.0
>   +authorWords:franz +authorWords:kafka^2.0)
>   subject:franz kafka^3.0
>   (+subjectWords:franz +subjectWords:kafka^1.5)
>   (+genreWords:franz +genreWords:kafka^2.0)
>   (+keywords:franz +keywords:kafka)
>   (+titleWildcard:fra +titleWildcard:kaf^0.7)
>   (+authorWildcard:fra +authorWildcard:kaf^0.7)
>   (+subjectWildcard:fra +subjectWildcard:kaf^0.7)
>   (+keywordWildcard:fra +keywordWildcard:kaf^0.2)
> )
>
> It generated 1635 hits.  We then read the first 700
> documents in the hit list and extract the date, subject,
> author, genre, Dewey/LC classification and audience
> fields for each, accumulating the popularity of each.
>
> Using this data, for each of the subject, author, genre,
> Dewey/LC and audience categories, we find the 30 most
> popular field values and for each of these we query the
> index to find their frequency in the entire index.
>
> We then render the first 100 document results (title,
> author, publication details, holdings) and the top 30
> for each of subject, author, genre, Dewey/KC and audience,
> ordering each list by the popularity of the term in the
> hit results (sample of the first 700) and rendering the
> size of the text based on the frequency of the term in
> the entire database (a bit like the Flickr tag popularity
> lists).  We also render a graph of hit results by date
> range.
>
> The initial search is very quick - typically a small
> number of tens of millsecs.  The "clustering" takes
> much longer - reading up to 700 records, extracting
> all those fields, sorting to get the top 30 of each
> field category, looking up the frequency of each term
> in the database.
>
> The test machine was a SunFire440 with 2 x 1.593GHz
> UltraSPARC-IIIi processors and 8GB of memory running
> Solaris 9, Java 1.5 in 64 bit mode, Jetty. The Lucene data
> directory is stored on a local 10K SCSI disk.
>
> The benchmark consisted of running 13,142 representative
> and unique search phrases collected from another system.
> The search phrases are unsorted.  The client (testing)
> system is run on another unloaded computer and was
> configured to run a varying number of threads representing
> different loads.  The results discussed here were
> produced with 3 threads - 3 simultaneous requests
> and the response rates are as seen by the client at
> the completion of the request when the last byte of
> the response is received.
>
> The Jetty/Lucene JVM was run with:
> -ms1200M -mx1200M -d64 -server  -verbose:gc
> The JVM (and file cache!) were "warmed up" with
> a first pass before results were recorded.
>
> The first test was run with Lucene 1.4.1 and achieved
> 3.3 responses/sec. (The time to generate the response
> includes the Jetty "client" time to process and render
> the result).  CPU Utilisation was low (typically <= 40%),
> but IO rates were not very high.  We mirrored the
> disk and achieved 3.7 responses/sec, but still CPU
> utilisation rarely went over 50%.
>
> We moved to Lucene 1.5.1 and with the same configuration
> (mirrored) achieved 3.6 responses/sec.
>
> We then set the parameter:
> -
> Dorg.apache.lucene.FSDirectory.class=org.apache.lucene.store.MMapDirectory
> to use MMapDirectory and achieved 8.1 responses/sec
> with very high CPU utilisation (over 90%).  Running
> a separate (unseen) set of 10,743 search terms without
> a "warm up" achieved 7.8 responses/sec.
>
> With 3 simultaneous requests the total response time
> profile as recorded at the client was:
>
> < 200ms for 40% of requests
> < 500ms for 80% of requests
> < 800ms for 90% of requests.
>
> I read from Peter Keegan's recent postings:
> "The Lucene server is using MMapDirectory. I'm running
> the jvm with -Xmx16000M. Peak memory usage of the jvm
> on Linux is about 6GB and 7.8GB on windows."
>
> We don't have nearly as much memory as Peter but I
> wonder whether he is gaining anything with such
> a large heap.  The file buffers allocated via MMap
> reside outside the JVM heap.  We notice that although
> our JVM heap is 1.2GB max (and regularly reduces
> to ~400MB after GC), the process expands to use all
> available memory with MMap, and a big Java heap
> means less memory for MMap to use(?).
>
> We are very happy with Lucene/MMAP performance given the
> extensive processing undertaken for the result clustering.
>
> Kent Fitch
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Good MMapDirectory performance

Reply via email to