Re: Re: feedback on Solr 4.x LotsOfCores feature
bq: sharing the underlying solrconfig object the configset introduced in JIRA SOLR-4478 seems to be the solution for non-SolrCloud mode SOLR-4478 will NOT share the underlying config objects, it simply shares the underlying directory. Each core will, at least as presently envisioned, simply read the files that exist there and create their own solrconfig object. Schema objects may be shared, but not config objects. It may turn out to be relatively easy to do in the configset situation, but last time I looked at sharing the underlying config object it was too fraught with problems. bq: 15K cores is around 4 minutes I find this very odd. On my laptop, spinning disk, I think I was seeing 1k cores discovered/sec. You're seeing roughly 16x slower, so I have no idea what's going on here. If this is just reading the files, you should be seeing horrible disk contention. Are you on some kind of networked drive? bq: To do that in background and to block on that request until core discovery is complete, should not work for us (due to the worst case). What other choices are there? Either you have to do it up front or with some kind of blocking. Hmmm, I suppose you could keep some kind of custom store (DB? File? ZooKeeper?) that would keep the last known layout. You'd still have some kind of worst-case situation where the core you were trying to load wouldn't be in your persistent store and you'd _still_ have to wait for the discovery process to complete. bq: and we will use the cores Auto option to create load or only load the core on Interesting. I can see how this could all work without any core discovery but it does require a very specific setup. On Thu, Oct 10, 2013 at 11:42 AM, Soyez Olivier olivier.so...@worldline.com wrote: The corresponding patch for Solr 4.2.1 LotsOfCores can be found in SOLR-5316, including the new Cores options : - numBuckets to create a subdirectory based on a hash on the corename % numBuckets in the core Datadir - Auto with 3 differents values : 1) false : default behaviour 2) createLoad : create, if not exist, and load the core on the fly on the first incoming request (update, select) 3) onlyLoad : load the core on the fly on the first incoming request (update, select), if exist on disk Concerning : - sharing the underlying solrconfig object, the configset introduced in JIRA SOLR-4478 seems to be the solution for non-SolrCloud mode. We need to test it for our use case. If another solution exists, please tell me. We are very interested in such functionality and to contribute, if we can. - the possibility of lotsOfCores in SolrCloud, we don't know in details how SolrCloud is working. But one possible limit is the maximum number of entries that can be added to a zookeeper node. Maybe, a solution will be just a kind of hashing in the zookeeper tree. - the time to discover cores in Solr 4.4 : with spinning disk under linux, all cores with transient=true and loadOnStartup=false, the linux buffer cache empty before starting Solr : 15K cores is around 4 minutes. It's linear in the cores number, so for 50K it's more than 13 minutes. In fact, it corresponding to the time to read all core.properties files. To do that in background and to block on that request until core discovery is complete, should not work for us (due to the worst case). So, we will just disable the core Discovery, because we don't need to know all cores from the start. Start Solr without any core entries in solr.xml, and we will use the cores Auto option to create load or only load the core on the fly, based on the existence of the core on the disk (absolute path calculated from the core name). Thanks for your interest, Olivier De : Erick Erickson [erickerick...@gmail.com] Date d'envoi : lundi 7 octobre 2013 14:33 À : solr-user@lucene.apache.org Objet : Re: feedback on Solr 4.x LotsOfCores feature Thanks for the great writeup! It's always interesting to see how a feature plays out in the real world. A couple of questions though: bq: We added 2 Cores options : Do you mean you patched Solr? If so are you willing to shard the code back? If both are yes, please open a JIRA, attach the patch and assign it to me. bq: the number of file descriptors, it used a lot (need to increase global max and per process fd) Right, this makes sense since you have a bunch of cores all with their own descriptors open. I'm assuming that you hit a rather high max number and it stays pretty steady bq: the overhead to parse solrconfig.xml and load dependencies to open each core Right, I tried to look at sharing the underlying solrconfig object but it seemed pretty hairy. There are some extensive comments in the JIRA of the problems I foresaw. There may be some action on this in the future. bq: lotsOfCores doesn’t work with SolrCloud Right, we haven't concentrated on that, it's an interesting problem
Re: Re: feedback on Solr 4.x LotsOfCores feature
The corresponding patch for Solr 4.2.1 LotsOfCores can be found in SOLR-5316, including the new Cores options : - numBuckets to create a subdirectory based on a hash on the corename % numBuckets in the core Datadir - Auto with 3 differents values : 1) false : default behaviour 2) createLoad : create, if not exist, and load the core on the fly on the first incoming request (update, select) 3) onlyLoad : load the core on the fly on the first incoming request (update, select), if exist on disk Concerning : - sharing the underlying solrconfig object, the configset introduced in JIRA SOLR-4478 seems to be the solution for non-SolrCloud mode. We need to test it for our use case. If another solution exists, please tell me. We are very interested in such functionality and to contribute, if we can. - the possibility of lotsOfCores in SolrCloud, we don't know in details how SolrCloud is working. But one possible limit is the maximum number of entries that can be added to a zookeeper node. Maybe, a solution will be just a kind of hashing in the zookeeper tree. - the time to discover cores in Solr 4.4 : with spinning disk under linux, all cores with transient=true and loadOnStartup=false, the linux buffer cache empty before starting Solr : 15K cores is around 4 minutes. It's linear in the cores number, so for 50K it's more than 13 minutes. In fact, it corresponding to the time to read all core.properties files. To do that in background and to block on that request until core discovery is complete, should not work for us (due to the worst case). So, we will just disable the core Discovery, because we don't need to know all cores from the start. Start Solr without any core entries in solr.xml, and we will use the cores Auto option to create load or only load the core on the fly, based on the existence of the core on the disk (absolute path calculated from the core name). Thanks for your interest, Olivier De : Erick Erickson [erickerick...@gmail.com] Date d'envoi : lundi 7 octobre 2013 14:33 À : solr-user@lucene.apache.org Objet : Re: feedback on Solr 4.x LotsOfCores feature Thanks for the great writeup! It's always interesting to see how a feature plays out in the real world. A couple of questions though: bq: We added 2 Cores options : Do you mean you patched Solr? If so are you willing to shard the code back? If both are yes, please open a JIRA, attach the patch and assign it to me. bq: the number of file descriptors, it used a lot (need to increase global max and per process fd) Right, this makes sense since you have a bunch of cores all with their own descriptors open. I'm assuming that you hit a rather high max number and it stays pretty steady bq: the overhead to parse solrconfig.xml and load dependencies to open each core Right, I tried to look at sharing the underlying solrconfig object but it seemed pretty hairy. There are some extensive comments in the JIRA of the problems I foresaw. There may be some action on this in the future. bq: lotsOfCores doesn’t work with SolrCloud Right, we haven't concentrated on that, it's an interesting problem. In particular it's not clear what happens when nodes go up/down, replicate, resynch, all that. bq: When you start, it spend a lot of times to discover cores due to a big How long? I tried 15K cores on my laptop and I think I was getting 15 second delays or roughly 1K cores discovered/second. Is your delay on the order of 50 seconds with 50K cores? I'm not sure how you could do that in the background, but I haven't thought about it much. I tried multi-threading core discovery and that didn't help (SSD disk), I assumed that the problem was mostly I/O contention (but didn't prove it). What if a request came in for a core before you'd found it? I'm not sure what the right behavior would be except perhaps to block on that request until core discovery was complete. Hm. How would that work for your case? That seems do-able. BTW, so far you get the prize for the most cores on a node I think. Thanks again for the great feedback! Erick On Mon, Oct 7, 2013 at 3:53 AM, Soyez Olivier olivier.so...@worldline.com wrote: Hello, In my company, we use Solr in production to offer full text search on mailboxes. We host dozens million of mailboxes, but only webmail users have such feature (few millions). We have the following use case : - non static indexes with more update (indexing and deleting), than select requests (ratio 7:1) - homogeneous configuration for all indexes - not so much user at the same time We started to index mailboxes with Solr 1.4 in 2010, on a subset of 400,000 users. - we had a cluster of 50 servers, 4 Solr per server, 2000 users per Solr instance - we grow to 6000 users per Solr instance, 8 Solr per server, 60Go per index (~2 million users) - we upgraded to Solr 3.5 in 2012 As indexes grew, IOPS and the response times have increased more and more