Re: Limitation on Collections Number

Arnon Yogev Mon, 15 Jun 2015 01:53:54 -0700

Thank you for the replies.

The shard-per-user approach is interesting. We will look into it as well.


The errors we're getting when having ~1500 collections vary depending on 
the action (restarting the server, creating a new collection etc).
The frequent ones are:

1. Connection refused when starting solr (happens when Solr fails to 
start):
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:806)
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
        at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
453171 [main-SendThread(localhost.localdomain:2181)] WARN 
org.apache.zookeeper.ClientCnxn  ? Session 0x14df5cd0f900008 for server 
null, unexpected error, closing socket connection and attempting reconnect


2. "Error getting leader" when starting Solr (happens when solr does 
start):
:org.apache.solr.common.SolrException: Error getting leader from zk for 
shard shard1
                 at 
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:871)
                 at 
org.apache.solr.cloud.ZkController.register(ZkController.java:783)
                 at 
org.apache.solr.cloud.ZkController.register(ZkController.java:731)
                 at 
org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:262)
                 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
                 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:627)
                 at java.lang.Thread.run(Thread.java:809)
Caused by: org.apache.solr.common.SolrException: No registered leader was 
found after waiting for 1560000ms , collection: owner_234409 slice: shard1
                 at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:531)
                 at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderUrl(ZkStateReader.java:505)
                 at 
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:850)
                 ... 6 more

3. Collection already exists (though it does not) when trying to create a 
collection

15/06/2015, 11:35:41
WARN
OverseerCollectionProcessor
OverseerCollectionProcessor.processMessage : createcollection ,? {

15/06/2015, 11:35:41
ERROR
OverseerCollectionProcessor
Collection createcollection of createcollection 
failed:org.apache.solr.common.SolrException: collection already exists: 
owner_484011
Collection createcollection of createcollection 
failed:org.apache.solr.common.SolrException: collection already exists: 
owner_484011
                 at 
org.apache.solr.cloud.OverseerCollectionProcessor.createCollection(OverseerCollectionProcessor.java:1545)
                 at 
org.apache.solr.cloud.OverseerCollectionProcessor.processMessage(OverseerCollectionProcessor.java:385)
                 at 
org.apache.solr.cloud.OverseerCollectionProcessor.run(OverseerCollectionProcessor.java:198)
                 at java.lang.Thread.run(Thread.java:809)





From:   Erick Erickson <erickerick...@gmail.com>
To:     solr-user@lucene.apache.org
Date:   14/06/2015 08:47 PM
Subject:        Re: Limitation on Collections Number



re: hybrid approach.

Hmmm, _assuming_ that no single user has a really huge number of
documents you might be able to use a single collection (or much
smaller group of collections), by using custom routing. That allows
you to send all the docs for a particular user to a particular shard.
There are some obvious issues here with the long-tail users, most of
your users have +/- X docs on average, and three of them have 100,000X
docs. There are probably some not-so-obvious gotcha's too....

True, for user X you'd send sub-requests to all shards, but all but
one of them wouldn't find anything so would _probably_ be close to a
no-op. Conceptually, each shard then becomes N of your current
collections. Maybe there's a sweet spot performance-wise here where
you're hosting some number of users per shard (or aggregate N docs per
shard or...).

Of course there's more maintenance here, particularly you have to
manage the size of shards yourself since the possibility of them
getting lopsided is higher etc.

FWIW,
Erick

On Sun, Jun 14, 2015 at 9:48 AM, Shai Erera <ser...@gmail.com> wrote:
>>
>> My answer remains the same - a large number of collections (cores) in a
>> single Solr instance is not one of the ways in which Solr is designed 
to
>> scale. To repeat, there are only two ways to scale Solr, number of
>> documents and number of nodes.
>>
>
> Jack, I understand that, but I still feel you're missing the point. We
> didn't ask about scaling Solr at all - it's a question about indexing
> strategy when you need to index multiple disparate collections of 
documents
> -- one collection w/ a collectionID field, or a Solr collection per set 
of
> documents.
>
> If you are _not_ in SolrCloud, then there's the "Lots of cores" 
solution,
>> see: http://wiki.apache.org/solr/LotsOfCores. Pay attention to the
>> warning at the top: NOT FOR SOLRCLOUD!
>>
>
> Thanks Erick. We did read this a while ago. We are in SolrCloud mode 
cause
> we want to keep a replica per collection and SolrCloud makes it easy for
> us. However, we aren't in a real/common SolrCloud mode, where we just 
need
> to index 1B documents and sharding + replication comes to our aid.
>
> If we were not in a SolrCloud mode, I imagine we'd need to manage the
> replicas ourselves and also index a document to both replicas manually?
> That is, there is no way in _non_ SolrCloud mode to tell two cores that
> they are replicas of one another - correct?
>
> A user may sign on and search her documents
>> just a few times a day, for a few minutes at a time.
>>
>
> This is almost true -- you may visit your Dropbox once an hour (or it 
may
> be open in the background on your computer), but the server still 
receives
> documents (e.g. shares) frequently by other users, and need to index it 
for
> your collection. Not saying this isn't a good fit, just mentioning that
> it's not only the user who can update his/her collection, and therefore
> one's collection may be constantly active. Eventually this needs to be
> benchmarked.
>
> Our benchmarks show that on 1000 such collections, we achieve 
significant
> better response times from the multi-collection setup (one Solr 
collection
> per user) vs the single-collection setup (one Solr collection for *all*
> users, with a collectionID field added to all documents). Our next step 
is
> to try perhaps a hybrid mode where we store groups of users in the same
> Solr collection, but not all of them in the same Solr collection. So 
maybe
> if Solr works well w/ 1000 collections, we will index 10 users in one 
such
> collection ... we'll give it a try.
>
> I think SOLR-7191 may solve the general use case though I haven't yet 
read
> through it thoroughly.
>
> Shai
>
> On Sun, Jun 14, 2015 at 6:50 PM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
>
>> Yes, there are some known problems while scaling to large number of
>> collections, say 1000 or above. See
>> https://issues.apache.org/jira/browse/SOLR-7191
>>
>> On Sun, Jun 14, 2015 at 8:30 PM, Shai Erera <ser...@gmail.com> wrote:
>>
>> > Thanks Jack for your response. But I think Arnon's question was
>> different.
>> >
>> > If you need to index 10,000 different collection of documents in Solr
>> (say
>> > a collection denotes someone's Dropbox files), then you have two 
options:
>> > index all collections in one Solr collection, and add a field like
>> > collectionID to each document and query, or index each user's private
>> > collection in a different Solr collection.
>> >
>> > The pros of the latter is that you don't need to add a collectionID
>> filter
>> > to each query. Also from a security/privacy standpoint (and search
>> quality)
>> > - a user can only ever search what he has access to -- e.g. it cannot
>> get a
>> > spelling correction for words he never saw in his documents, nor 
document
>> > suggestions (even though the 'context' in some of Lucene suggesters 
allow
>> > one to do that too). From a quality standpoint you don't mix 
different
>> term
>> > statistics etc.
>> >
>> > So from a single node's point of view, you can either index 100M
>> documents
>> > in one index (Collection, shard, replica -- whatever -- a single Solr
>> > core), or in 10,000 such cores. From node capacity perspectives the 
two
>> are
>> > the same -- same amount of documents will be indexed overall, same 
query
>> > workload etc.
>> >
>> > So the question is purely about Solr and its collections management 
-- is
>> > there anything in that process that can prevent one from managing
>> thousands
>> > of collections on a single node, or within a single SolrCloud 
instance?
>> If
>> > so, what is it -- are these the ZK watchers? Is there a thread per
>> > collection at work? Others?
>> >
>> > Shai
>> >
>> > On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <
>> jack.krupan...@gmail.com>
>> > wrote:
>> >
>> > > As a general rule, there are only two ways that Solr scales to 
large
>> > > numbers: large number of documents and moderate number of nodes 
(shards
>> > and
>> > > replicas). All other parameters should be kept relatively small, 
like
>> > > dozens or low hundreds. Even shards and replicas should probably 
kept
>> > down
>> > > to that same guidance of dozens or low hundreds.
>> > >
>> > > Tens of millions of documents should be no problem. I recommend 100
>> > million
>> > > as the rough limit of documents per node. Of course it all depends 
on
>> > your
>> > > particular data model and data and hardware and network, so that 
number
>> > > could be smaller or larger.
>> > >
>> > > The main guidance has always been to simply do a proof of concept
>> > > implementation to test for your particular data model and data 
values.
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev <arn...@il.ibm.com>
>> wrote:
>> > >
>> > > > We're running some tests on Solr and would like to have a deeper
>> > > > understanding of its limitations.
>> > > >
>> > > > Specifically, We have tens of millions of documents (say 50M) and 
are
>> > > > comparing several "#collections X #docs_per_collection"
>> configurations.
>> > > > For example, we could have a single collection with 50M docs or 
5000
>> > > > collections with 10K docs each.
>> > > > When trying to create the 5000 collections, we start getting 
frequent
>> > > > errors after 1000-1500 collections have been created. Feels like 
some
>> > > > limit has been reached.
>> > > > These tests are done on a single node + an additional node for
>> replica.
>> > > >
>> > > > Can someone elaborate on what could limit Solr to a high number 
of
>> > > > collections (if at all)?
>> > > > i.e. if we wanted to have 5K or 10K (or 100K) collections, is 
there
>> > > > anything in Solr that can prevent it? Where would it break?
>> > > >
>> > > > Thanks,
>> > > > Arnon
>> > >
>> >
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>

Re: Limitation on Collections Number

Reply via email to