RE: how to do offline adding/updating index
Thanks to all, i done by using multicore, vishal parekh -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p3019219.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: how to do offline adding/updating index
You can also turn off automatic replication pulling, and just manually issue a 'replicate' command to slave exactly when you want, without relying on it being triggered by optimization or whatever. (Well probably not 'manually', probably some custom update process you run that you'll have issue the 'replicate' command to slave when appropriate for your strategy). In case you want to replicate without an optimize, but not on every commit. (An optimize will result in more files being 'new' for replication, possibly all of them, where a replication without optimize, if most of the index remains the same but only a few new documents added/updated, will only result in some new files to be pulled). Or if you wanted to replicate after an optimize but not EVERY optimize. Or of course, you could just set the replication's poll time to be some high number, like an hour or whatever, so it'll only replicate once an hour no matter how many commits happen more often. Trade-offs either way, to flexibility/control and performance. As far as performance, you may just have to measure in your individual actual context, as much of a pain as that can be. It seems there are lots of significant variables. From: kenf_nc [ken.fos...@realestate.com] Sent: Tuesday, May 10, 2011 4:01 PM To: solr-user@lucene.apache.org Subject: Re: how to do offline adding/updating index Master/slave replication does this out of the box, easily. Just set the slave to update on Optimize only. Then you can update the master as much as you want. When you are ready to update the slave (the search instance), just optimize the master. On the slave's next cycle check it will refresh itself, quickly, efficiently, minimal impact to search performance. No need to build extra moving parts for swapping search servers or anything like that. -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2924426.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: how to do offline adding/updating index
Theoretically, a commit alone should have negligible effect on the slave, because of the same aspect of Solr architecture that makes too frequent commits problematic --- an existing Searcher continues to serve requests off the old version of the index, until the new commit (plus all it's warming) is complete, at which point the newly warmed Searcher switches into action. So long as there's enough RAM available for both operations, and so long as there's enough CPU available so the committing and warming of the new stuff doesn't starve things out. (this is where the 'too frequent commit' problem comes in, when you get so many overlapping commits such that you run out of RAM and/or CPU) However, this same 'theoretical' logic could be used to argue that you should be able to commit directly to the 'slave' without any replication at all with no performance indications, which doesn't seem to match actually observed results. So maybe it should be taken with a grain of salt, and investigated empirically. For that matter, it has seemed to me that even in the master-slave setup that I use, while the commit is going on there is SOME performance implication, although I haven't benchmarked it well, just impression. But it hasn't been a disastrous one, and it's a relatively short timespan, in the replication scenario. Running master and slave on the very same server (one with a whole bunch of cores and plenty of RAM), there hasn't seemed to me to be any performance implications on searching the slave while 'add'ing to the master (in a completely seperate java container). Only when actually doing the replication pull (and it's inherent commit to slave). From: kenf_nc [ken.fos...@realestate.com] Sent: Wednesday, May 11, 2011 9:46 AM To: solr-user@lucene.apache.org Subject: Re: how to do offline adding/updating index My understanding is that the Master has done all the indexing, that replication is a series of file copies to a temp directory, then a move and commit. The slave only gets hit with the effects of a commit, so whatever warming queries are in place, and the caches get reset. Doing too many commits too often is a problem in any situation with Solr and I wouldn't recommend it here. However, the original question implied commits would occur approximately once an hour, that is easily within the capabilities of the system. Fine tuning of warming queries should minimize any performance impact. Any effects should also be a relatively linear constant, they should not be wildly affected by the size of the update or the number of documents. Warming query results may be slightly different with new documents, but on the other hand, your new documents are now in cache ready for fast search, so a reasonable trade off. -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2927336.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to do offline adding/updating index
My understanding is that the Master has done all the indexing, that replication is a series of file copies to a temp directory, then a move and commit. The slave only gets hit with the effects of a commit, so whatever warming queries are in place, and the caches get reset. Doing too many commits too often is a problem in any situation with Solr and I wouldn't recommend it here. However, the original question implied commits would occur approximately once an hour, that is easily within the capabilities of the system. Fine tuning of warming queries should minimize any performance impact. Any effects should also be a relatively linear constant, they should not be wildly affected by the size of the update or the number of documents. Warming query results may be slightly different with new documents, but on the other hand, your new documents are now in cache ready for fast search, so a reasonable trade off. -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2927336.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to do offline adding/updating index
Replication large files can be bad for OS page cache as files being written are also written to the page cache. Search latency can grow due to I/O for getting the current index version back into memory. Also, Solr cache warming can casue a doubling of your heap usage. Frequent replication in an environment with large files and high query load is something one should measure before going in production. > Thanks - that sounds like what I was hoping for. So the I/O during > replication will have *some* impact on search performance, but > presumably much less than reindexing and merging/optimizing? > > -Mike > > > Master/slave replication does this out of the box, easily. Just set the > > slave to update on Optimize only. Then you can update the master as much > > as you want. When you are ready to update the slave (the search > > instance), just optimize the master. On the slave's next cycle check it > > will refresh itself, quickly, efficiently, minimal impact to search > > performance. No need to build extra moving parts for swapping search > > servers or anything like that. > > > > -- > > View this message in context: > > http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-ind > > ex-tp2923035p2924426.html Sent from the Solr - User mailing list archive > > at Nabble.com.
Re: how to do offline adding/updating index
Thanks - that sounds like what I was hoping for. So the I/O during replication will have *some* impact on search performance, but presumably much less than reindexing and merging/optimizing? -Mike Master/slave replication does this out of the box, easily. Just set the slave to update on Optimize only. Then you can update the master as much as you want. When you are ready to update the slave (the search instance), just optimize the master. On the slave's next cycle check it will refresh itself, quickly, efficiently, minimal impact to search performance. No need to build extra moving parts for swapping search servers or anything like that. -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2924426.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to do offline adding/updating index
Master/slave replication does this out of the box, easily. Just set the slave to update on Optimize only. Then you can update the master as much as you want. When you are ready to update the slave (the search instance), just optimize the master. On the slave's next cycle check it will refresh itself, quickly, efficiently, minimal impact to search performance. No need to build extra moving parts for swapping search servers or anything like that. -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2924426.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to do offline adding/updating index
I think the key question here is what's the best way to perform indexing without affecting search performance, or without affecting it much. If you have a batch of documents to index (say a daily batch that takes an hour to index and merge), you'd like to do that on an offline system, and then when ready, bring that index up for searching. but using Lucene's multiple commit points assumes you use the same box for search and indexing doesn't it? Something like this is what I have in mind (simple 2-server config here): Box 1 is live and searching Box 2 is offline and ready to index loading begins on Box 2... loading complete on Box 2 ... commit, optimize Swap Box 1 and Box 2 ( with a load balancer or application config?) Box 2 is live and searching Box 1 is offline and ready to index To make the best use of your resources, you'd then like to start using Box 1 for searching (until indexing starts up again). Perhaps if your load balancing is clever enough, it could be sensitive to the decreased performance of the indexing box and just send more requests to the other one(s). That's probably ideal. -Mike S Under the hood, Lucene can support this by keeping multiple commit points in the index. So you'd make a new commit whenever you finish indexing the updates from each hour, and record that this is the last "searchable" commit. Then you are free to commit while indexing the next hour's worth of changes, but these commits are not marked as searchable. But... this is a low level Lucene capability and I don't know of any plans for Solr to support multiple commit points in the index. Mike http://blog.mikemccandless.com On Tue, May 10, 2011 at 9:22 AM, vrpar...@gmail.com wrote: Hello all, indexing with dataimporthandler runs every hour (new records will be added, some records will be updated) note :large data requirement is when indexing is in progress, searching (on already indexed data) should not affect so should i use multicore-with merge and swap or delta query or any other way? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2923035.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to do offline adding/updating index
Under the hood, Lucene can support this by keeping multiple commit points in the index. So you'd make a new commit whenever you finish indexing the updates from each hour, and record that this is the last "searchable" commit. Then you are free to commit while indexing the next hour's worth of changes, but these commits are not marked as searchable. But... this is a low level Lucene capability and I don't know of any plans for Solr to support multiple commit points in the index. Mike http://blog.mikemccandless.com On Tue, May 10, 2011 at 9:22 AM, vrpar...@gmail.com wrote: > Hello all, > > indexing with dataimporthandler runs every hour (new records will be added, > some records will be updated) note :large data > > requirement is when indexing is in progress, searching (on already indexed > data) should not affect > > so should i use multicore-with merge and swap or delta query or any other > way? > > Thanks > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2923035.html > Sent from the Solr - User mailing list archive at Nabble.com. >
RE: how to do offline adding/updating index
One approach is to use Solr's replication features. Index to a 'master', periodically replicate to 'slave' on which all the searching is done. That's what I do; my master and slave are in fact on the same server (one with a bunch of CPUs and RAM however), although not alternate cores in a multi-core setup. I in fact put them in different containers (different tomcat or jetty instances) to isolate them as much as possible (don't want an accidental OOM on one effecting the other).This seems to work out pretty well -- although I think that while the replication operation is actually going on, performance on the slave is indeed effected somewhat, it's not completely without side effect. It's possible using some kind of 'swapping' technique would eliminate that, as you suggest, but I haven't tried it. Certainly a delta query for indexing imports is always a good idea if it will work for you, but with or without you'll probably need some other setup in addition to isolate your indexing from your searching, either replication or a method of 'swapping', indexing to a new Solr index and then swapping the indexes out. From: vrpar...@gmail.com [vrpar...@gmail.com] Sent: Tuesday, May 10, 2011 9:22 AM To: solr-user@lucene.apache.org Subject: how to do offline adding/updating index Hello all, indexing with dataimporthandler runs every hour (new records will be added, some records will be updated) note :large data requirement is when indexing is in progress, searching (on already indexed data) should not affect so should i use multicore-with merge and swap or delta query or any other way? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2923035.html Sent from the Solr - User mailing list archive at Nabble.com.