> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > > >Performance should be more-or-less linear: a two-million > document index will > >be almost twice as slow to search as a one-million document index. > > Is performance linear in the number of documents? I would naively > think it would be linear in the number of terms. Also, how does > document size change things? Is searching 1 million 100k documents > twice as fast as 1 million 200k documents?
Some more precise statements: The cost to search for a term is proportional to the number of documents that contain that term. The cost to search for a phrase is proportional to the sum of the number of occurrences of its constituent terms. The cost to execute a boolean query is the sum of the costs of its sub-queries. Longer documents contain more terms: usually both more unique terms and more occurrences. Total vocabulary size is not a big factor in search performance. When you open an index Lucene does read one out of every 128 unique terms into a table, so an index with a large number of unique terms will be slower to open. Searching that table for query terms is also slower for bigger indexes, but the time to search that table is not significant in overall performance. Lucene also reads at index open one byte per document per indexed field (the normalization factor). So an index with lots of documents and fields will also be slower to open. But, once opened, the cost of searching is largely dependent on the frequency characteristics of query terms. And, since IndexReaders and Searchers are thread safe, you don't need to open indexes very often. Doug _______________________________________________ Lucene-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/lucene-dev
