Hi

Thanks for the suggestions. This would require us to change the index and right now we literally have millions of documents stored in current index format. I'll bear it in mind, but I am not entirely sure how I would go about implementing the change at this point.

Much appreciate

Jamie


h t wrote:
1. redefine the archivedate field as YYmmDD format,
2. add another field using timestamp for sort use.
3. use RangeFilter to get result and then sort by timestamp.

2008/2/27, Jamie <[EMAIL PROTECTED]>:
Hi Michael & Others

Ok. I've gathered some more statistics from a different machine for your
analysis.
(I had to switch machines because the original one was in production and
my tests were interfering).

Here are the statistics from the new machine:

Total Documents: 1.2 million
Results Returned:  900k
Store Size 238G (size of original documents)
Index Size 1.2G (lucene index size)
Index / Store Ratio 0.5%

The search query is as follows:

archivedate:[d20071229010000 TO d20080228235900]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~why there is an
extra 'd' ?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As you can see, I am using a range query to search between specific dates.
Question: should this query be moved to a filter rather? I did not do
this as I needed to have the option to sort on date.

There are no other specific filters applied and in this example sorting
is turned off.

On this particular machine the search time varies between 2.64 seconds
and about 5 seconds.

The limitations of this machine are that it does uses a normal IDE drive
to house the index, not a SATA drive

IOStat Statistics

Linux 2.6.20-15-server 27/02/2008

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
         20.25    0.00    3.23    0.34    0.00   76.19

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               7.12        50.67       186.41   38936841  143240688

See attached for hardware info and the CPU call tree (taken from YourKit).

I would appreciate your recommendations.


Jamie


h t wrote:
Hi Michael,
I guess the hotspot of lucene is
org.apache.lucene.search.IndexSearcher.search()

Hi Jamie,
What's the original text size of a million emails?
I estimate the size of an email is around 100k, is this true?
When you doing search, what kind keywords did you input, words or short
sentence?
How many results return?
Did you use filter to shrink the results size?

2008/2/27, Michael Stoppelman <[EMAIL PROTECTED]>:
  So you're saying searches are taking 10 seconds on a 5G index? If so
that
seems ungodly slow.
If you're on *nix, have you watched your iostat statistics? Maybe
something
is hammering your hds.
Something seems amiss.

What lucene methods were pointed to as hotspots by YourKit?




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
Stimulus Software - MailArchiva
Email Archiving And Compliance
USA Tel: +1-713-366-8072 ext 3
UK Tel: +44-20-80991035 ext 3
Email: [EMAIL PROTECTED]
Web: http://www.mailarchiva.com

To receive MailArchiva Enterprise Edition product announcements, send a message to: <[EMAIL PROTECTED]>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to