RE: Search performance with one index vs. many indexes

2005-02-28 Thread Runde, Kevin
Follow Up to the article from Friday 

-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 28, 2005 1:30 AM
To: Lucene Users List
Subject: Re: Search performance with one index vs. many indexes

Jochen Franke writes:
 Topic: Search performance with large numbers of indexes vs. one large
index
 
 
 My questions are:
 
 - Is the size of the wordlist the problem?
 - Would we be a lot faster, when we have a smaller number
 of files per index?

sure. 
Look:
Index lookup of a word is O(ln(n)) where n is the number of words.
Index lookup of a word in k indexes having m words is O( k ln(m) )
In the best case all word lists are distict (purely theoretical), 
that is n = k*m or m = n/k
For n = 15 Mio, k = 800
ln(n) = 16.5
k*ln(n/k) = 7871
In a realistic case, m is much bigger since word lists won't be
distinct.
But it's the linear factor k that bites you.
In the worst case (all words in all indices) you have
k*ln(n) = 13218.8

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search performance with one index vs. many indexes

2005-02-28 Thread Runde, Kevin
Hi All,

Sorry about that please disregard that last email. I must not be fully
awake yet.

Sorry,
Kevin Runde 

-Original Message-
From: Runde, Kevin [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 28, 2005 7:34 AM
To: Lucene Users List
Subject: RE: Search performance with one index vs. many indexes

Follow Up to the article from Friday 

-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 28, 2005 1:30 AM
To: Lucene Users List
Subject: Re: Search performance with one index vs. many indexes

Jochen Franke writes:
 Topic: Search performance with large numbers of indexes vs. one large
index
 
 
 My questions are:
 
 - Is the size of the wordlist the problem?
 - Would we be a lot faster, when we have a smaller number
 of files per index?

sure. 
Look:
Index lookup of a word is O(ln(n)) where n is the number of words.
Index lookup of a word in k indexes having m words is O( k ln(m) )
In the best case all word lists are distict (purely theoretical), 
that is n = k*m or m = n/k
For n = 15 Mio, k = 800
ln(n) = 16.5
k*ln(n/k) = 7871
In a realistic case, m is much bigger since word lists won't be
distinct.
But it's the linear factor k that bites you.
In the worst case (all words in all indices) you have
k*ln(n) = 13218.8

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search performance with one index vs. many indexes

2005-02-27 Thread Morus Walter
Jochen Franke writes:
 Topic: Search performance with large numbers of indexes vs. one large index
 
 
 My questions are:
 
 - Is the size of the wordlist the problem?
 - Would we be a lot faster, when we have a smaller number
 of files per index?

sure. 
Look:
Index lookup of a word is O(ln(n)) where n is the number of words.
Index lookup of a word in k indexes having m words is O( k ln(m) )
In the best case all word lists are distict (purely theoretical), 
that is n = k*m or m = n/k
For n = 15 Mio, k = 800
ln(n) = 16.5
k*ln(n/k) = 7871
In a realistic case, m is much bigger since word lists won't be distinct.
But it's the linear factor k that bites you.
In the worst case (all words in all indices) you have
k*ln(n) = 13218.8

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search performance with one index vs. many indexes

2005-02-25 Thread Jochen Franke
Topic: Search performance with large numbers of indexes vs. one large index
Hello,
we are experiencing a performance problem when using large
numbers of indexes.
We have an application with about
6 Mio. Documents
one index of about 7 GB
probably 10 to 15 million different words in that
index.
The creation of the index out of one DB (where the
documents are coming from) with two processor takes about 20 hours.
For several reasons (e.g. parallelizing the index creation), we
created several indexes, by splitting the documents into logical groups.
We first created an artifical benchmark:
10 Mio. Documents
500 Indexes (in about 3 files per index)
10 GB Index alltogether
about 5.000 randomly selected words
Querying this index took about 0.4s per query, so it was only
twice the time than querying index, which was fine for us.
We did the same with one index merged out of the 500 indexes.
The lucene search performance was fine here as well (about 0.2s per 
query on our machine).


We then implemented the real thing which is:
6 Mio. Documents
800 Indexes (with about 28 files per index)
about 7 GB index size
probably 10 to 15 million different words in that
index.
We now have a query performance of 4-8 seconds per query.
The test with the real data in one index has not been finished
so far.
My questions are:
- Is the size of the wordlist the problem?
- Would we be a lot faster, when we have a smaller number
of files per index?
- Is 500-1000 still a reasonable number of indexes?
- Is there a more or less a linear relationship between
the number of indexes and the execution time of the query
(as all indexes have to be checked and the results have
to be merged)?
- Are there any parameters that could be configured for
that usecase?
- Should we implement any specialized classes specific to our use case?
Thanks,
Jochen Franke
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]