The scaling per machine should be linear.  The overhead from the network is
minimal because the Lucene object sizes are not impacting.  Google mentions
in one of their early white papers on scaling
http://labs.google.com/papers/googlecluster-ieee.pdf that they have sub
indexes which are now popularly called shards over which an individual
thread performs a search over.  Executed in parallel (ParallelMultiSearcher
which does not use thread pooling) the response time will be faster than
using a single thread assuming part of the indexes are in the system cache.
A query is simply an iteration so it is easy to see how parallelization
speeds up response times.  Queries per second should ideally be solved by
adding more hardware with the same indexes on each server.  Then further
dividing these into what can be termed cells which represent different
indexes on sets of servers.

Having a large index on a single machine does not scale well because most of
the index will not be in the system cache.  If the index grows so does the
response time.  Dividing the index up into shards and cells allows for
efficient scaling which is proven at the big G.  It puts more of the total
index in the system cache of many machines.

The general assumption is that hardware is cheap and can be added easily,
search systems can take advantage of this and parallelize as much as
possible, per server, per application.

On Wed, Jul 16, 2008 at 9:41 AM, Karl Wettin <[EMAIL PROTECTED]> wrote:

> Is there some sort of a scaling strategies listing available? I think there
> is a Wiki page missing.
>
> What are the typical promblems I'll encounter when distributing the search
> over multiple machines?
>
> Do people split up their index per node or do they use the complete index
> and restrict what part to search in using filters? The latter would be good
> for the scores, right? Then how do I calculate the cost in speed for the
> score with better quality? I mean, splitting the index in two and searching
> on two machines using ParallelMultiSearcher probably means that I'll get
> something like 30% speed improvement and not 100%. Or?
>
> Is there something to win by using multiple threads each restricted to a
> part each of the same index on a single machine, compared to a single
> thread? Or is it all I/O? That would mean there is something to gain if the
> index was on SSD or in RAM, right?
>
>
>      karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Reply via email to