RE: Lucene Speed under diff JVMs

2002-12-06 Thread Jonathan Reichhold
It doesn't surprise me that the IBM JDK is faster indexing.  This JVM is
better optimized in this case from my experience.

I did some serious load testing with various JVM implementation from Sun
and IBM and found that the opposite when it came to searching.  I.e.
Lucene searches were fastest under Sun 1.4.1.  This JVM was consequently
able to handle a higher load (faster response increases queries/second).
IBM was drastically slower at handling queries.  I've never tried
Jrocket since I don't like the cost.

The index for my tests had 7million records and 6 major fields.  Queries
were randomly chosen from a list of 2 million real user queries.  The
query load was meant to simulate real loads from a production site.
This was all accomplished on a single 1U, Redhat Linux 7.2, 2-processor
box with 1 GB of RAM.  Query times were very good compared to previous
indexing methods.

Jonathan

-Original Message-
From: Armbrust, Daniel C. [mailto:[EMAIL PROTECTED]] 
Sent: Thursday, December 05, 2002 2:47 PM
To: 'Lucene Users List'
Subject: Lucene Speed under diff JVMs


This may be of use to people who want to make lucene index faster.
Also, I'm curious as to what JVM most people run Lucene under, and if
anyone else has seen results like this:

I'm using the class that Otis wrote (see message from about 3 weeks ago)
for testing the scalability of lucene (more results on that later) and I
first tried running it under different versions of Java, to see where it
runs the fastest.  The class simply creates an index out of randomly
generated documents. 

All of the following were running on a dual CPU 1 GHz PIII Windows 2000
machine that wasn't doing much else during the benchmark.  The indexing
program was single threaded, so it only used one of the processors of
the machine.

java version "1.3.1_04"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1_04-b02)
Java HotSpot(TM) Client VM (build 1.3.1_04-b02, mixed mode)

42 seconds/1000 documents

java version "1.4.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1-b21) Java
HotSpot(TM) Client VM (build 1.4.1-b21, mixed mode)

42 seconds/1000 documents

Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1_01) BEA
WebLogic JRockit(R) Virtual Machine (build
8.0_Beta-1.4.1_01-win32-CROSIS-20021105-1617, Native Threads,
Generational Concurrent Garbage Collector)

35 seconds/1000 documents

java version "1.3.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1) Classic
VM (build 1.3.1, J2RE 1.3.1 IBM Windows 32 build cn131-20020403 (JIT
enabled: jitc))

27 seconds/1000 documents


As you can see, the IBM jvm pretty much smoked Suns.  And beat out
JRockit as well.  Just a hunch, but it wouldn't surprise me if search
times were also faster under the IBM jdk.  Has anyone else come to this
conclusion?


Dan

--
To unsubscribe, e-mail:

For additional commands, e-mail:




--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: prevent re-indexing

2002-12-09 Thread Jonathan Reichhold
I agree with Otis on this.  In your application that is indexing, save
the last time you started indexing.  Then next time you index, read the
previous time in and just index file modified since this date.  This
doesn't deal with deletes, but that would require a bit more work

Jonathan

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] 
Sent: Monday, December 09, 2002 1:20 PM
To: Lucene Users List
Subject: Re: prevent re-indexing


That's an application specific behaviour that you need to add to your
indexing app.

Otis

--- host unknown <[EMAIL PROTECTED]> wrote:
> Hi all,
> 
> I have a rather large file system that I'm indexing (php/html files
> actually).  I'm reindexing on a daily basis, however I don't
> want/need to 
> reindex 95+% of my files since they're not going to change.
> 
> Is there currently the capiblilty to look at the last modified date 
> and check it against the file that has already been indexed before
> re-indexing 
> the file?  Or is this something that needs to be implemented?
> 
> Thanks again,
> Dominic
> madison.com
> 
> PS.  Thanks for the quick responses last time...the spider is starting

> to behave :-)
> 
> 
> 
> 
> 
> _
> The new MSN 8: smart spam protection and 2 months FREE*
> http://join.msn.com/?page=features/junkmail
> 
> 
> --
> To unsubscribe, e-mail:
> 
> For additional commands, e-mail:
> 
> 


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:

For additional commands, e-mail:




--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Lucene Benchmarks and Information

2002-12-20 Thread Jonathan Reichhold
A question on the queries you used.  What sort of distribution of terms
did you use?  I.e. were all the queries single random words, or did you
add in multi-word queries and phrases?

I'm impressed with the results, just want to understand the testing
methodology better.

JR

-Original Message-
From: Armbrust, Daniel C. [mailto:[EMAIL PROTECTED]] 
Sent: Friday, December 20, 2002 8:57 AM
To: 'Lucene Users List'
Subject: Lucene Benchmarks and Information


I've been running some scalability tests on Lucene over the past couple
of weeks.  While there may be some flaws with some of my methods, I
think they will be useful for people that want an idea as to how Lucene
will scale.  If anyone has any questions about what I did, or wants
clarifications on something, I'll be happy to provide them.

I'll start by filling out the form


  Hardware Environment
* Dedicated machine for indexing: yes
* CPU: 1 2.53 GHz Pentium 4
* RAM: Self-explanatory
* Drive configuration: 100 GB 7200 RPM IDE, 80 GB 7200 RPM IDE

  Software environment
* Java Version: java version "1.3.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1)
Classic VM (build 1.3.1, J2RE 1.3.1 IBM Windows 32 build
cn131-20020403 (JIT enabled: jitc))
* OS Version: Win XP SP1
* Location of index: Local File Systems

  Lucene indexing variables
* Number of source documents: 43,779,000
* Total filesize of source documents: ~350 GB -- never stored
(documents were randomly generated)
* Average filesize of source documents: 8 KB
* Source documents storage location: Generated while indexing, never
written to disk
* File type of source documents: text
* Parser(s) used, if any: None
* Analyzer(s) used: Standard Analyzer
* Number of fields per document: 2
* Type of fields: text, Unstored
* Index persistence: FSDirectory

  Figures
* Time taken (in ms/s as an average of at least 3 indexing runs):
See notes below
* Time taken / 1000 docs indexed: 6.5 seconds/1000, not counting
optimization time.  15 seconds/1000 when optimizing every 100,000
documents, and building an index to ~ 5 million documents.  Above 5
million documents, optimization took too much time.  See notes below.
* Memory consumption: ~ 200 mb
   *  Index Size: 70.7 GB

  Notes
* Notes: The documents were randomly generated on the fly as part of
the indexing process from a list of ~100,000 words, who's average length
was 7.   The documents had 3 words in the title, and 500 words in the
body.

While I was trying to build this index, the biggest limitation of Lucene
that I ran into was optimization.  Optimization kills the indexers
performance when you get between 3-5 million documents in an index.  On
my Windows XP box, I had to reoptimize every 100,000 documents to keep
from running out of file handles.  While I could build a 5 million
document index in 24 hours... I could only add about another million
over the next 24 hours due to the pain of the optimizer recopying the
entire index over and over again (about 10 GB at this point), and it
would only get worse from there.  So, to build this large of an index, I
built several ~ 5 million document indexes, and then merged them at the
end into a single index.  The second issue (though not really a problem)
was that you have to have at least double the disk space available to
build the index as you need when you are done.  I could have kept
building the index bigger, but I ran out of disk space.  

When I was done building indexes, I ran some query's against them to see
how the search performance varied with the size of the index.  Following
are my results for various size indexes.

Index Size (GB) MS per query
4.5383
7.9283
10  89
12.7112
52.5694
70.7944


These numbers are an average of 3 runs of 500 randomly generated queries
being tossed at the index (single threaded) on the same hardware that
built the index.  The queries were randomly generated (about 50 % of the
queries had 0 results, 50% had 1 or more results) 

I was happy to see that these numbers make a nice linear plot
(attached).  I'm not sure what other comments to add here, other to
thank the authors of Lucene for their great design and implementation of
Lucene.

If anyone has anything else they would like me to test on this index
before I dump it... Speak up quick, I have to pull out one of the hard
drives this weekend to pass it on to its real owner.

Dan



--
To unsubscribe, e-mail:   
For additional commands, e-mail: