Hi All, TL;DR version: We think we want to explore Lucene/Solr 4.0 and SolrCloud, but I’m not sure if there is any good doco/articles on how to make architecture choices for how to chop up big indexes… and what other general considerations are part of the equation?
==== I’m throwing this post out to the public to see if any kind and knowledgeable individuals could provide some educated feedback on the options our team is currently considering for the future architecture of our Solr indexes. We have a loose collection of Solr indexes, each with a specific purpose and differing schemas and document makeup, containing just over 300 million documents with varying degrees of full-text. Our existing architecture is showing its age, as it is really just the setup used for small/medium indexes scaled upwards. The biggest individual index is around 140 million documents and currently exists as a Master/Slave setup with the Master receiving all writes in the background and the 3 load balanced slaves updating with a 5 minute poll interval. The master index is 451gb on disk and the 3 slaves are running JVMs with RAM allocations of 21gb (right now anyway). We are struggling under the traffic load and/or scale of our indexes (mainly the later I think). We know this isn’t the best way to run things, but the index in question is a fairly new addition and each time we run into issues we tend to make small changes to improve things in the short term… like bumping the RAM allocation up, toying with poll intervals, garbage collection config etc. We’ve historically run into issues with facet queries generating a lot of bloat on some types of fields. These had to be solved through internal modifications, but I expect we’ll have to review this with the new version anyway. Related to that, there are some question marks on generating good facet data from a sharded approach. In particular though, we are really struggling with garbage collection on the slave machines around the time that the slave/master sync occurs because of multiple copies of the index being held in memory until all searchers have de-referenced the old index. The machines typically either crash from OOM when we occasionally have a third and/or forth copy of the index appear because of really old searchers not ‘letting go’ (hence we play with widening poll intervals), or they seem to rarely become perpetually locked in GC and have to be restarted (not 100% why, but large heap allocations aren’t helping, and cache warming may be a culprit). The team has lots of things we want to try to improve things, but given the scale of the systems it is very hard to just try things out without considerable resourcing implications. The entire ecosystem is spread across 7 machines that are resourced in the 64gb-100gb of RAM range (this is just me poking around our servers… not a thorough assessment). Each machine is running several JVMs so that for each ‘type’ of index there are typically 2-4 load balanced slaves available at any given time. One of those machines is exclusively used as the Master for all indexes and receives no search traffic… just lots of write traffic. I believe the answers to some of these are going to be very much dependent on schemas and documents, so I don’t imagine anyone can answer the questions better then we can after testing and benchmarking… but right now we are still trying to choose where to start, so broad ideas are very welcome. The kind of things we are currently thinking about: - Moving to v4.0 (currently just completed our v3.5 upgrade) to take advantage of the reduced RAM consumption: https://issues.apache.org/jira/browse/LUCENE-2380 We are hoping that this has the double-whammy impact of improving garbage collection as well. Lots of full-text data should equal lots of Strings, and thus lots of savings from this change. - Moving to a basic sharded approach. We’ve only just started testing this, and I’m not involved, so I’m not sure on what early results we’ve got…. But: - Given that we’d like to move to v4.0, I believe this opens up the option of a SolrCloud implementation… my suspicion is that this is where the money is at… but I’d be happy to hear feedback (good or bad) from people that are using it in production. - Hardware; we are not certain that the current approach of a few colossal machines is any better that lots of smaller clustered machines… and it is prohibitively expensive to experiment here. We don’t think that our current setup using SSDs and fibre-channel connections would be creating too many bottlenecks on I/O, and rarely see other hardware related issues, but I’d again be curious if people have observed contradictory evidence. My suspicion is that with the changes above though, our current hardware would handle the load far better than it currently is. - Are there any sort of pros and cons documented out there for making decisions on sharding and resource allocation? Like: - Good guidelines for choosing numbers of shards versus individual shard sizes. - Garbage collection implications on high search load with nearly non-stop write activity? A few big JVMs versus many small JVMs? - Load balancing in SolrCloud during commits? Very much overlaps with the above point. We had toyed with the idea of scripting our own method of removing a machine from the load balancer, running a ‘commit’ (pull from master) on it, then putting it back into rotation with the highest priority to receive searches… Not knowing whether SolrCloud already does this though. Our current pull interval setup is fairly primitive, but changing it and avoiding stale search results is complicated. - When we shard what are we sacrificing? Performance/quality of faceting? This one’s a real noob question from me, since I still have to RTFM. It is however our primary concern with sharding. - There are some more technical optimisations being considered as well, such as playing with Linux page sizes. I also noticed a reference here: http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.htmlon ‘tuning swappiness down to 0’. - There are also some queries over the state of Solr/Lucene 4’s efficiency during a Slave pulling from Master’s index segments. Ie. Is there much improvement here? Such as per-segment searching and per-segment faceting (are these just goals or have they already been met?) resulting in reopening an index with minor segment changes should be really extremely cheap. - Related to the above point there was some recollection amongst the team here of Solr’s field caches causing issue for auto warming. Is this still current, or even correct? I could me misunderstanding myself. Anyway, I realise this is quite a brain dump, but I’m hoping that there are others who’ve looked at these sorts of things and are willing to share. If there are any useful outcomes from our changes and benchmarking that others will be interested in, I’m also happy to follow up with contributions to public doco. Particularly in relation to the last two points (where I’m passing on information I don’t 100% understand myself right now) we can even devote some development resources to problems that the community is already aware that just need some dev time; if there are performance gains to be had. Ta, Greg