It's usually best to use compositeId routing. That distributes the load evenly. Otherwise, _you_ have to be responsible for making sure that the docs are reasonably evenly distributed, which can be a pain.
Implicit routing is usually best in situations where you index to a particular shard for a while then move on to another shard, think news stories where you want to keep them for 30 days then dispose of them. Implicit lets you add/remove shards on a daily basis. Doesn't sound particularly suitable for your situation. But I do have to ask why you're sharding at all? 5M docs is a fairly small index by modern standards. There's some inevitable overhead with sharding that you could avoid. Mostly I'm asking if you've stress-tested with that query and update rate. The 7,000 updates/second do worry me a bit with a single-shard solution, but if you get adequate response times under that load, then there's no need to shard. Use all the hardware to support querying. Sharding will improve indexing throughput without doubt, Solr scales roughly linearly with the number of shards. Do use CloudSolrClient for your updates as it routes docs to the correct leader, avoiding one extra hop. Given your soft commit setting of 5 seconds, I infer that the allowable time for updates to be searchable is quite small, indicating that NRT replicas are the way to go. I'll also say that this commit rate is pretty aggressive given your volume, is it really necessary to be that short? Your caches are going to be pretty useless since they won't stick around for very long. Look carefully at the autowarming time, in order to make any good use of your fitlerCache, you'll have to autowarm it some and if you do, you need to insure that the autowarm interval is less than your autocommit time. Best, Erick On Thu, Jan 3, 2019 at 10:34 PM Doss <itsmed...@gmail.com> wrote: > > Hi, > > We are planning to setup a SOLR cloud with 6 nodes for 3 million records > (expected to grow to 5 million in a year), with 150 fields and over all > index would come around 120GB. > > We plan to use NRT with 5 sec soft commit and 1 min hard commit. > > Expected query volume would be 5000 select hits per second and 7000 inserts > / updates per second. > > Our records can be classified under 15 categories, but they will not have > even number of records, few categories will have more number of records. > > Queries will also come in the same pattern, that is., categories with high > number of records will get high volume of select / updates. > > For this situation we are confused in choosing what type of sharding would > help us in better performance in both select and updates? > > Composite / implicit - Composite with 15 shards or implicit based on 15 > categories. > > Our select queries will have minimum 15 filters in fq, with extensive > function queries used in sort. > > Updates will have 6 integer fields, 5 string fields and 4 string/integer > fields with multi valued. > > If we choose implicit to boost select performance, our updates will be > heavy on few shards (major category shards), will this be a problem? > > For our kind of situation which replica Type can we choose? All NRT or NRT > with TLOG ? > > Thanks in advance! > > Best, > Doss.