These are great results!  Thanks for posting.

I'd be curious if you'd get better indexing throughput by using a single IndexWriter, fed by all 8 indexing threads, with an 8X bigger RAM buffer, instead of 8 IndexWriters that merge in the end.

How long does that final merge take now?

Also, 64 threads doing document construction seems too high? You may be losing some performance to the cost of thread context switching.

Did you use autoCommit=false? I think it should help since you have so many stored fields and some term vectors.

Mike

Glen Newton wrote:
Cass,
Thanks for converting it. I've posted it to my blog:
http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance- benchmarks.html

Sorry for the XML tags: I guess I followed the instructions on the
Lucene performance benchmarks page to literally ("Post these figures
to the lucene-user mailing list using this template.").

Sorry if it hurt your eyes!  :-)

-Glen

On 15/04/2008, Cass Costello <[EMAIL PROTECTED]> wrote:
I just did that so I could read it. :) I'll leave it up until Glen resends
 or posts it somewhere...
 http://www.casscostello.com/?page_id=28




On Tue, Apr 15, 2008 at 5:18 PM, Ian Holsman <[EMAIL PROTECTED]> wrote:

Hi Glen.
can you resend this in plain text?
or put the HTML up on a server somewhere and point to it with a brief
summary in the post?
I'd love to look and read it, all those tags are making me go blind.


Glen Newton wrote:

<benchmark>
 <ul>
 <p>
 <b>Hardware Environment</b><br/>
 <li><i>Dedicated machine for indexing</i>: yes</li>
 <li><i>CPU</i>: Dual processor dual core Xeon CPU 3.00GHz;
hyperthreading ON for 8 virtual cores</li>
 <li><i>RAM</i>: 8GB</li>
 <li><i>Drive configuration</i>: Dell EMC AX150 storage array fibre
channel</li>
 </p>
 <p>
 <b>Software environment</b><br/>
 <li><i>Lucene Version</i>: 2.3.1</li>
 <li><i>Java Version</i>:  Java(TM) SE Runtime Environment (build
1.6.0_02-b05)</li>
 <li><i>Java VM</i>: Java HotSpot(TM) 64-Bit Server VM (build
1.6.0_02-b05, mixed mode)</li>
 <li><i>OS Version</i>: Linux OpenSUSE 10.2 (64-bit X86-64)</li>
 <li><i>Location of index</i>: Filesystem, on attached storage</li>
 </p>
 <p>
 <b>Lucene indexing variables</b><br/>
 <li><i>Number of source documents</i>: 6,404,464</li>
<li><i>Total filesize of source documents</i>: 141GB; Note that this
is only the full-text: the metadata (title, author(s), abstract,
keywords, journal name) are in addition to this</li>
 <li><i>Average filesize of source documents</i>:
22KB + metadata (see above)</li>
<li><i>Source documents storage location</i>: Where are the documents
being indexed located?
  Filesystem</li>
 <li><i>File type of source documents</i>: text (PDFs converted to
text then gzipped)</li>
 <li><i>Parser(s) used, if any</i>: None, but files GZIPed & had to
be un-gziped by Java application which also did indexing</li>
 <li><i>Analyzer(s) used</i>: StandardAnalyzer</li>
 <li><i>Number of fields per document</i>: 24</li>
 <li><i>Type of fields</i>: all text; 20 stored; 3 of indexed
tokenized with term vector (full-text [not stored], title, abstract);
10 stored with no parsing; </li>
 <li><i>Index persistence</i>: FSDirectory</li>
 <li><i>Index size</i>: 83GB</li>
 <li><i>Number of terms</i>: 143,298,010</li>
 </p>
 <p>
 <b>Figures</b><br/>
 <li><i>Time taken (in ms/s as an average of at least 3 indexing
runs)</i>: 20.5 hours</li>
 <li><i>Time taken / 1000 docs indexed</i>: 11.5 seconds </li>
 <li><i>Memory consumption</i>:  -Xms4000m  -Xmx6000m</li>
 <li><i>Query speed</i>: average time a query takes, type
  of queries (e.g. simple one-term query, phrase query),
  not measuring any overhead outside Lucene</li>
 </p>
 <p>
 <b>Notes</b><br/>
 <li><i>Notes</i>:
      <ul>
        <li>
These are journal articles, so the additional fields besides
the
full-text are bibliographic metadata, such as title, authors,
abstract, keywords, journal name, volume, issue, start page, year.
        </li>
        <li>Java command line directives: -XX:+AggressiveOpts
-XX:+ScavengeBeforeFullGC -XX:-UseParallelGC   -server  -Xms4000m
-Xmx6000m
        </li>
        <li>Highly multithreaded & pipelined architecture using
java.util.concurrent.ThreadPoolExecutor
        </li>
<li>File system file reading and Un-gzip performed multithreaded
        </li>
<li>Eight separate parallel IndexWriters are fed by the pipeline
(creation of Document objects occurs in parallel with 64 threads),
merged at end into single index. Each parallel index had slightly
different RAM_BUFFER_SIZE_MB (64, 67, 70, 73, 76, 79, 83, 85 MB
respectively), so that flushing wouldn't all happen at the same time.
        </li>
        <li>
          Contact: glen DOT newton AT nrc-cnrc DOT gc DOT ca
        </li>

      </ul>
</li>
 </p>
 </ul>
</benchmark>






-------------------------------------------------------------------- -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
 Lego timeline:
http://cache.gizmodo.com/assets/resources/2008/01/lego-brick4- timeline.jpg



--

-

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to