Grant,

Thanks a lot for the answers. Please see my replies below.

> > 1) Should we do sharding or not?
> > If we start without sharding, how hard will it be to enable it?
> > Is it just some config changes + the index rebuild or is it more?
> 
> There will be operations setup, etc.  And you'll have to add in the
> appropriate query stuff.
> 
> Your install and requirements aren't that large, so I doubt you'll
> need sharding, but it always depends on your exact configuration.
> I've seen indexes as big as 80 million docs on a single machine, but
> the docs were smaller in size.
> 
> > My personal opinion is to go without sharding at first and enable it
> > later if do get a lot of documents.
> 
> Sounds reasonable.

One more question - is it worth it to try to keep the whole index in
memory and shard when it doesn't fit anymore? For me it seems like a bit
of overhead, but I may be very wrong here.
What's a recommended ratio of the parts to keep in RAM and on the HDDs?

> > 2) How should we organize our clusters to ensure redundancy?
> >
> > Should we have 2 or more identical Masters (means that all the
> > updates/optimisations/etc. are done for every one of them)?
> >
> > An alternative, afaik, is to reconfigure one slave to become the new
> > Master, how hard is that?
> 
> I don't have a good answer here, maybe someone else can chime in.  I
> know master failover is a concern, but I'm not sure how others handle
> it right now.  Would be good to have people share their approach.
> That being said, it seems reasonable to me to have identical masters.

I found this thread related to this issue:
http://www.nabble.com/High-Availability-deployment-to13094489.html#a1309
8729

I guess, it depends on how easy we can fill the gap between the last
commit and the time of the Master going down. Most likely, we'll have to
have 2 Masters.


> > 3) Basically, we can get servers of two kinds:
> > * Single Processor, Dual Core Opteron 2214HE
> > * 2 GB DDR2 SDRAM
> > * 1 x 250 GB (7200 RPM) SATA Drive(s)
> >
> > * Dual Processor, Quad Core 5335
> > * 16 GB Memory (Fully Buffered)
> > * 2 x 73 GB (10k RPM) 2.5" SAS Drive(s), RAID 1
> >
> > The second - more powerful - one is more expensive, of course.
> 
> Get as much RAM as you can afford.  Surely there is an in between
> machine as well that might balance cost and capabilities.  The first
> machine seems a bit light, especially in memory.

Fair enough.

> > How can we take advantage of the multiprocessor/multicore servers?
> >
> > Is there some special setup required to make, say, 2 instances of
SOLR
> > run on the same server using different processors/cores?
> 
> See the Core Admin stuff http://wiki.apache.org/solr/CoreAdmin.  Solr
> is thread-safe by design (so it's a bug, if you hit issues).  You can
> send it documents on multiple threads and it will be fine.

Hmmm, it seems that several cores are supposed to handle different
indexes:
http://wiki.apache.org/solr/MultipleIndexes#head-e517417ef9b96e32168b2cf
35ab6ff393f360d59
<< Solr1.3 added support for multiple "Solr Cores" in a single
deployment of Solr -- each Solr Core has it's own index. For more
information please see CoreAdmin.>>

As we are going to have just one index, so the only way to use it that I
see is to configure a Master on Core1 and a Slave on core 2, or 2 slaves
on 2 cores.

Do I miss something here? 

> > 4) Does it make much difference to get a more powerful Master?
> >
> > Or, on the contrary, as slaves will be queried more often, they
should
> > be the better ones? Maybe just the HDDs for the slaves should be as
> > fast
> > as possible?
> 
> Depends on where your bottlenecks are.  Are you getting a lot of
> queries or a lot of updates?

Both, but more queries than updates. Means we shouldn't neglect slaves,
I guess?


> As for HDDs, people have noted some nice speedups in Lucene using
> Solid-state drives, if you can afford them.  Fast I/O is good if
> you're retrieving whole documents, but once things are warmed up more
> RAM is most important, I think, as many things can be cached.


> > 5) How many slaves does it make sense to have per one Master?
> > What's (roughly) the performance gain from 1 to 2, 2 -> 3, etc?
> > When does it stop making sense to add more slaves?
> 
> I suppose it's when you can handle your peak load, but I don't have
> numbers.  One of the keys is to incrementally test and see what makes
> sense for your scenario.

Right, the numbers given in other responses (thanks Karl and Lars) look
impressive, so we'll consider this option.

> > As far as I understand, it depends mainly on the size of the index.
> > However, I'd guess the time required to do a push for too many
slaves
> > can be a problem too, correct?
> 
> The biggest problem for slaves is if the master does an optimization,
> in which case the whole snapshot must be downloaded versus incremental
> additions can be handled by getting just the deltas.

Our initial idea is to send batch updates several times per day rather
than individual real-time updates, commit and run optimization after
that, as advised here:
http://wiki.apache.org/solr/CollectionDistribution#head-cf174eea2524ae45
171a8486a13eea8b6f511f8b
<<We are presuming optimizations should be run once following large
batch-like updates to the collection and/or once a day.>>

Once the index is optimized, the slaves will get it when they pull next
time. So there will be only few (or none) incremental updates. However,
the new snapshots will appear not very often, so it shouldn't be a
problem for several slaves to get them, correct?

Thanks,
Andrey.

Reply via email to