The scaling per machine should be linear. The overhead from the network is minimal because the Lucene object sizes are not impacting. Google mentions in one of their early white papers on scaling http://labs.google.com/papers/googlecluster-ieee.pdf that they have sub indexes which are now popularly called shards over which an individual thread performs a search over. Executed in parallel (ParallelMultiSearcher which does not use thread pooling) the response time will be faster than using a single thread assuming part of the indexes are in the system cache. A query is simply an iteration so it is easy to see how parallelization speeds up response times. Queries per second should ideally be solved by adding more hardware with the same indexes on each server. Then further dividing these into what can be termed cells which represent different indexes on sets of servers.
Having a large index on a single machine does not scale well because most of the index will not be in the system cache. If the index grows so does the response time. Dividing the index up into shards and cells allows for efficient scaling which is proven at the big G. It puts more of the total index in the system cache of many machines. The general assumption is that hardware is cheap and can be added easily, search systems can take advantage of this and parallelize as much as possible, per server, per application. On Wed, Jul 16, 2008 at 9:41 AM, Karl Wettin <[EMAIL PROTECTED]> wrote: > Is there some sort of a scaling strategies listing available? I think there > is a Wiki page missing. > > What are the typical promblems I'll encounter when distributing the search > over multiple machines? > > Do people split up their index per node or do they use the complete index > and restrict what part to search in using filters? The latter would be good > for the scores, right? Then how do I calculate the cost in speed for the > score with better quality? I mean, splitting the index in two and searching > on two machines using ParallelMultiSearcher probably means that I'll get > something like 30% speed improvement and not 100%. Or? > > Is there something to win by using multiple threads each restricted to a > part each of the same index on a single machine, compared to a single > thread? Or is it all I/O? That would mean there is something to gain if the > index was on SSD or in RAM, right? > > > karl > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >