Yes, totally agree. We run 500m+ docs in a (non-cloud) Solr4, and it even performs reasonably well on commodity hardware with lots of faceting and concurrent indexing! Ok, you need a lot of RAM to keep faceting happy, but it works.
++1 for the automagic shard creator. We've been looking into doing this sort of thing internally - i.e. when a shard reaches a certain size/num docs, it creates 'sub-shards' to which new commits are sent and queries to the 'parent' shard are included. The concept works, as long as you don't try any non-dist stuff - it's one reason why all our fields are always single valued. There are also other implications like cleanup, deletes and security to take into account, to name a few. A cool side-effect of sub-sharding (for lack of a snappy term) is that the parent shard then stops suffering from auto-warming latency due to commits (we do a fair amount of committing). In theory, you could carry on sub-sharding until your hardware starts gasping for air. On Sun, Jan 4, 2015 at 1:44 PM, Bram Van Dam <bram.van...@intix.eu> wrote: > On 01/04/2015 02:22 AM, Jack Krupansky wrote: > >> The reality doesn't seem to >> be there today. 50 to 100 million documents, yes, but beyond that takes >> some kind of "heroic" effort, whether a much beefier box, very careful and >> limited data modeling or limiting of query capabilities or tolerance of >> higher latency, expert tuning, etc. >> > > I disagree. On the scale, at least. Up until 500M Solr performs "well" > (read: well enough considering the scale) in a single shard on a single box > of commodity hardware. Without any tuning or heroic efforts. Sure, some > queries aren't as snappy as you'd like, and sure, indexing and querying at > the same time will be somewhat unpleasant, but it will work, and it will > work well enough. > > Will it work for thousands of concurrent users? Of course not. Anyone who > is after that sort of thing won't find themselves in this scenario -- they > will throw hardware at the problem. > > There is something to be said for making sharding less painful. It would > be nice if, for instance, Solr would automagically create a new shard once > some magic number was reached (2B at the latest, I guess). But then that'll > break some query features ... :-( > > The reason we're using single large instances (sometimes on beefy > hardware) is that SolrCloud is a pain. Not just from an administrative > point of view (though that seems to be getting better, kudos for that!), > but mostly because some queries cannot be executed with distributed=true. > Our users, at least, prefer a slow query over an impossible query. > > Actually, this 2B limit is a good thing. It'll help me convince > $management to donate some of our time to Solr :-) > > - Bram >