Thanks Jack for your response. But I think Arnon's question was different.

If you need to index 10,000 different collection of documents in Solr (say
a collection denotes someone's Dropbox files), then you have two options:
index all collections in one Solr collection, and add a field like
collectionID to each document and query, or index each user's private
collection in a different Solr collection.

The pros of the latter is that you don't need to add a collectionID filter
to each query. Also from a security/privacy standpoint (and search quality)
- a user can only ever search what he has access to -- e.g. it cannot get a
spelling correction for words he never saw in his documents, nor document
suggestions (even though the 'context' in some of Lucene suggesters allow
one to do that too). From a quality standpoint you don't mix different term
statistics etc.

So from a single node's point of view, you can either index 100M documents
in one index (Collection, shard, replica -- whatever -- a single Solr
core), or in 10,000 such cores. From node capacity perspectives the two are
the same -- same amount of documents will be indexed overall, same query
workload etc.

So the question is purely about Solr and its collections management -- is
there anything in that process that can prevent one from managing thousands
of collections on a single node, or within a single SolrCloud instance? If
so, what is it -- are these the ZK watchers? Is there a thread per
collection at work? Others?

Shai

On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> As a general rule, there are only two ways that Solr scales to large
> numbers: large number of documents and moderate number of nodes (shards and
> replicas). All other parameters should be kept relatively small, like
> dozens or low hundreds. Even shards and replicas should probably kept down
> to that same guidance of dozens or low hundreds.
>
> Tens of millions of documents should be no problem. I recommend 100 million
> as the rough limit of documents per node. Of course it all depends on your
> particular data model and data and hardware and network, so that number
> could be smaller or larger.
>
> The main guidance has always been to simply do a proof of concept
> implementation to test for your particular data model and data values.
>
> -- Jack Krupansky
>
> On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <arn...@il.ibm.com> wrote:
>
> > We're running some tests on Solr and would like to have a deeper
> > understanding of its limitations.
> >
> > Specifically, We have tens of millions of documents (say 50M) and are
> > comparing several "#collections X #docs_per_collection" configurations.
> > For example, we could have a single collection with 50M docs or 5000
> > collections with 10K docs each.
> > When trying to create the 5000 collections, we start getting frequent
> > errors after 1000-1500 collections have been created. Feels like some
> > limit has been reached.
> > These tests are done on a single node + an additional node for replica.
> >
> > Can someone elaborate on what could limit Solr to a high number of
> > collections (if at all)?
> > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
> > anything in Solr that can prevent it? Where would it break?
> >
> > Thanks,
> > Arnon
>

Reply via email to