RE: Search performance with one index vs. many indexes
Follow Up to the article from Friday -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Monday, February 28, 2005 1:30 AM To: Lucene Users List Subject: Re: Search performance with one index vs. many indexes Jochen Franke writes: Topic: Search performance with large numbers of indexes vs. one large index My questions are: - Is the size of the wordlist the problem? - Would we be a lot faster, when we have a smaller number of files per index? sure. Look: Index lookup of a word is O(ln(n)) where n is the number of words. Index lookup of a word in k indexes having m words is O( k ln(m) ) In the best case all word lists are distict (purely theoretical), that is n = k*m or m = n/k For n = 15 Mio, k = 800 ln(n) = 16.5 k*ln(n/k) = 7871 In a realistic case, m is much bigger since word lists won't be distinct. But it's the linear factor k that bites you. In the worst case (all words in all indices) you have k*ln(n) = 13218.8 HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search performance with one index vs. many indexes
Hi All, Sorry about that please disregard that last email. I must not be fully awake yet. Sorry, Kevin Runde -Original Message- From: Runde, Kevin [mailto:[EMAIL PROTECTED] Sent: Monday, February 28, 2005 7:34 AM To: Lucene Users List Subject: RE: Search performance with one index vs. many indexes Follow Up to the article from Friday -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Monday, February 28, 2005 1:30 AM To: Lucene Users List Subject: Re: Search performance with one index vs. many indexes Jochen Franke writes: Topic: Search performance with large numbers of indexes vs. one large index My questions are: - Is the size of the wordlist the problem? - Would we be a lot faster, when we have a smaller number of files per index? sure. Look: Index lookup of a word is O(ln(n)) where n is the number of words. Index lookup of a word in k indexes having m words is O( k ln(m) ) In the best case all word lists are distict (purely theoretical), that is n = k*m or m = n/k For n = 15 Mio, k = 800 ln(n) = 16.5 k*ln(n/k) = 7871 In a realistic case, m is much bigger since word lists won't be distinct. But it's the linear factor k that bites you. In the worst case (all words in all indices) you have k*ln(n) = 13218.8 HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search performance with one index vs. many indexes
Jochen Franke writes: Topic: Search performance with large numbers of indexes vs. one large index My questions are: - Is the size of the wordlist the problem? - Would we be a lot faster, when we have a smaller number of files per index? sure. Look: Index lookup of a word is O(ln(n)) where n is the number of words. Index lookup of a word in k indexes having m words is O( k ln(m) ) In the best case all word lists are distict (purely theoretical), that is n = k*m or m = n/k For n = 15 Mio, k = 800 ln(n) = 16.5 k*ln(n/k) = 7871 In a realistic case, m is much bigger since word lists won't be distinct. But it's the linear factor k that bites you. In the worst case (all words in all indices) you have k*ln(n) = 13218.8 HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search performance with one index vs. many indexes
Topic: Search performance with large numbers of indexes vs. one large index Hello, we are experiencing a performance problem when using large numbers of indexes. We have an application with about 6 Mio. Documents one index of about 7 GB probably 10 to 15 million different words in that index. The creation of the index out of one DB (where the documents are coming from) with two processor takes about 20 hours. For several reasons (e.g. parallelizing the index creation), we created several indexes, by splitting the documents into logical groups. We first created an artifical benchmark: 10 Mio. Documents 500 Indexes (in about 3 files per index) 10 GB Index alltogether about 5.000 randomly selected words Querying this index took about 0.4s per query, so it was only twice the time than querying index, which was fine for us. We did the same with one index merged out of the 500 indexes. The lucene search performance was fine here as well (about 0.2s per query on our machine). We then implemented the real thing which is: 6 Mio. Documents 800 Indexes (with about 28 files per index) about 7 GB index size probably 10 to 15 million different words in that index. We now have a query performance of 4-8 seconds per query. The test with the real data in one index has not been finished so far. My questions are: - Is the size of the wordlist the problem? - Would we be a lot faster, when we have a smaller number of files per index? - Is 500-1000 still a reasonable number of indexes? - Is there a more or less a linear relationship between the number of indexes and the execution time of the query (as all indexes have to be checked and the results have to be merged)? - Are there any parameters that could be configured for that usecase? - Should we implement any specialized classes specific to our use case? Thanks, Jochen Franke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]