Thanks Jack for your response. But I think Arnon's question was different. If you need to index 10,000 different collection of documents in Solr (say a collection denotes someone's Dropbox files), then you have two options: index all collections in one Solr collection, and add a field like collectionID to each document and query, or index each user's private collection in a different Solr collection.
The pros of the latter is that you don't need to add a collectionID filter to each query. Also from a security/privacy standpoint (and search quality) - a user can only ever search what he has access to -- e.g. it cannot get a spelling correction for words he never saw in his documents, nor document suggestions (even though the 'context' in some of Lucene suggesters allow one to do that too). From a quality standpoint you don't mix different term statistics etc. So from a single node's point of view, you can either index 100M documents in one index (Collection, shard, replica -- whatever -- a single Solr core), or in 10,000 such cores. From node capacity perspectives the two are the same -- same amount of documents will be indexed overall, same query workload etc. So the question is purely about Solr and its collections management -- is there anything in that process that can prevent one from managing thousands of collections on a single node, or within a single SolrCloud instance? If so, what is it -- are these the ZK watchers? Is there a thread per collection at work? Others? Shai On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <jack.krupan...@gmail.com> wrote: > As a general rule, there are only two ways that Solr scales to large > numbers: large number of documents and moderate number of nodes (shards and > replicas). All other parameters should be kept relatively small, like > dozens or low hundreds. Even shards and replicas should probably kept down > to that same guidance of dozens or low hundreds. > > Tens of millions of documents should be no problem. I recommend 100 million > as the rough limit of documents per node. Of course it all depends on your > particular data model and data and hardware and network, so that number > could be smaller or larger. > > The main guidance has always been to simply do a proof of concept > implementation to test for your particular data model and data values. > > -- Jack Krupansky > > On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <arn...@il.ibm.com> wrote: > > > We're running some tests on Solr and would like to have a deeper > > understanding of its limitations. > > > > Specifically, We have tens of millions of documents (say 50M) and are > > comparing several "#collections X #docs_per_collection" configurations. > > For example, we could have a single collection with 50M docs or 5000 > > collections with 10K docs each. > > When trying to create the 5000 collections, we start getting frequent > > errors after 1000-1500 collections have been created. Feels like some > > limit has been reached. > > These tests are done on a single node + an additional node for replica. > > > > Can someone elaborate on what could limit Solr to a high number of > > collections (if at all)? > > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there > > anything in Solr that can prevent it? Where would it break? > > > > Thanks, > > Arnon >