On 8/2/2011 4:44 AM, Bernd Fehling wrote:
Is there any knowledge on this list about the performance
crossover between a single index and sharding and
when to change from a single index to sharding?
E.g. if index size is larger than 150GB and num of docs is
more than 25 mio. then it is better to change from single index
to sharding and have two shards.
Or something like this...
Sure, solr might even handle 50 mio. docs but performance is going down
and a sharded system with distributed search will be faster than
a single index, or not?
The answer I've always seen here boils down to "it depends on a large
number of variables unique to every situation." The nature of your data
will affect things, like the number of fields, number of unique terms
per field, etc. If you have really complicated queries, that will slow
things down.
Probably the greatest limiting factor is memory. Having enough free
memory to fit the entire index into the operating system's disk cache is
the best thing you can do for performance. This is memory over and
above whatever you give to your Java heap. If you have a 150GB index
and you can afford machines with at least 192GB of RAM, a single index
would perform very well, once it is warmed up. Performance on a cold
index would not be very good. In a sharded scenario, you want to try
and size each machine so that its piece fits into RAM.
Next would be disk I/O. Any data that won't fit in the disk cache must
be retrieved from disk, which is typically the weakest link in the
chain. If you can put your index on solid state disks, that's almost as
good as having the index entirely in memory. Performance on a cold
index with SSD would be incredible.
Having a lot of high speed CPU available will help, but not as much as
memory and I/O.
Index rebuild time is another consideration that might lead you to go
distributed, as long as your data source can keep up with multiple readers.
My own index is too big to fit in RAM, even sharded. Each of the six
large shards is getting close to 19GB. Each machine has 14GB of RAM
(it's a virtual environment with three large shards per physical host)
and has 3GB allocated to Java. I am in the process of upgrading the
memory, at which point it will fit, but our growth will exceed the
maximum server memory again in the next year or so. I have plans to
eliminate the virtualization and have three shards in cores on each server.
I know this isn't really what you were looking for, but there are no
simple answers to your question.
Thanks,
Shawn