On 8/2/2011 4:44 AM, Bernd Fehling wrote:
Is there any knowledge on this list about the performance
crossover between a single index and sharding and
when to change from a single index to sharding?

E.g. if index size is larger than 150GB and num of docs is
more than 25 mio. then it is better to change from single index
to sharding and have two shards.
Or something like this...

Sure, solr might even handle 50 mio. docs but performance is going down
and a sharded system with distributed search will be faster than
a single index, or not?

The answer I've always seen here boils down to "it depends on a large number of variables unique to every situation." The nature of your data will affect things, like the number of fields, number of unique terms per field, etc. If you have really complicated queries, that will slow things down.

Probably the greatest limiting factor is memory. Having enough free memory to fit the entire index into the operating system's disk cache is the best thing you can do for performance. This is memory over and above whatever you give to your Java heap. If you have a 150GB index and you can afford machines with at least 192GB of RAM, a single index would perform very well, once it is warmed up. Performance on a cold index would not be very good. In a sharded scenario, you want to try and size each machine so that its piece fits into RAM.

Next would be disk I/O. Any data that won't fit in the disk cache must be retrieved from disk, which is typically the weakest link in the chain. If you can put your index on solid state disks, that's almost as good as having the index entirely in memory. Performance on a cold index with SSD would be incredible.

Having a lot of high speed CPU available will help, but not as much as memory and I/O.

Index rebuild time is another consideration that might lead you to go distributed, as long as your data source can keep up with multiple readers.

My own index is too big to fit in RAM, even sharded. Each of the six large shards is getting close to 19GB. Each machine has 14GB of RAM (it's a virtual environment with three large shards per physical host) and has 3GB allocated to Java. I am in the process of upgrading the memory, at which point it will fit, but our growth will exceed the maximum server memory again in the next year or so. I have plans to eliminate the virtualization and have three shards in cores on each server.

I know this isn't really what you were looking for, but there are no simple answers to your question.

Thanks,
Shawn

Reply via email to