On 8/1/2014 4:19 AM, anand.mahajan wrote: > My current deployment : > i) I'm using Solr 4.8 and have set up a SolrCloud with 6 dedicated machines > - 24 Core + 96 GB RAM each. > ii)There are over 190M docs in the SolrCloud at the moment (for all > replicas its consuming overall disk 2340GB which implies - each doc is at > about 5-8kb in size.) > iii) The docs are split into 36 Shards - and 3 replica per shard (in all > 108 Solr Jetty processes split over 6 Servers leaving about 18 Jetty JVMs > running on each host)
<snip> > 2. Should I have been better served had I deployed a Single Jetty Solr > instance per server with multiple cores running inside? The servers do start > to swap out after a couple of days of Solr uptime - right now we reboot the > entire cluster every 4 days. Others have already mentioned the problems with autoCommit being far too frequent, so I'll just echo their advice to increase the intervals. You should DEFINITELY have exactly one jetty process per server. One Solr process can handle *many* shard replicas (cores). With 18 per server, that's a LOT of overhead (especially memory) that is not required. > 3. The routing key is not able to effectively balance the docs on available > shards - There are a few shards with just about 2M docs - and others over > 11M docs. Shall I split the larger shards? But I do not have more nodes / > hardware to allocate to this deployment. In such case would splitting up the > large shards give better read-write throughput? > > 4. To remain with the current hardware - would it help if I remove 1 replica > each from a shard? But that would mean even when just 1 node goes down for a > shard there would be only 1 live node left that would not serve the write > requests. Why not just let Solr automatically handle routing with the compositeId router? Chances are excellent that this will result in perfect shard balancing. Unless you want completely manual sharding (not controlled at all by SolrCloud), don't complicate it by trying to influence the routing. I think that when you say "1 live node left that would not serve the write requests" above, you may have a misconception about SolrCloud. *ALL* replicas have the same indexing load. Although the replication handler is required when you use SolrCloud, replication is *NOT* how the data gets on all replicas. Each update request makes its way to the shard leader, then the shard leader sends that update request to all replicas, and each one independently indexes the content. Replication only gets used when something goes wrong and a shard needs recovery. > 5. Also, is there a way to control where the Split Shard replicas would go? > Is there a pattern / rule that Solr follows when it creates replicas for > split shards? > > 6. I read somewhere that creating a Core would cost the OS one thread and a > file handle. Since a core repsents an index in its entirty would it not be > allocated the configured number of write threads? (The dafault that is 8) > > 7. The Zookeeper cluster is deployed on the same boxes as the Solr instance > - Would separating the ZK cluster out help? I don't know for sure what rules are followed when creating the replicas (cores). SolrCloud will make sure that replicas end up on different nodes, with a node being a Solr JVM, *NOT* a machine. If one node has fewer replicas already onboard than another, it will likely be preferred ...but I actually don't know if that logic is incorporated. Each core probably does create a thread, but there will be far more than one filehandle. A Lucene index (there is one in every core) is normally composed of dozens or hundreds of files, each of which will require a file handle. Each network connection also uses file handles. If load is light, putting ZK on the same nodes/disks as Solr is not a big deal. If load is heavy, you will want the ZK database to have its own dedicated disk spindle(s) ... but ZK's CPU requirements are usually very small. Completely separate servers are not usually required, but if you can do that, you would be much better protected against performance problems. Thanks, Shawn