RE: how to do offline adding/updating index

2011-06-03 Thread vrpar...@gmail.com
Thanks to all,

i done by using multicore,

vishal parekh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p3019219.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: how to do offline adding/updating index

2011-05-11 Thread Jonathan Rochkind
You can also turn off automatic replication pulling, and just manually issue a 
'replicate' command to slave exactly when you want, without relying on it being 
triggered by optimization or whatever.  (Well probably not 'manually', probably 
some custom update process you run that you'll have issue the 'replicate' 
command to slave when appropriate for your strategy). 

In case you want to replicate without an optimize, but not on every commit. (An 
optimize will result in more files being 'new' for replication, possibly all of 
them, where a replication without optimize, if most of the index remains the 
same but only a few new documents added/updated, will only result in some new 
files to be pulled).  Or if you wanted to replicate after an optimize but not 
EVERY optimize. 

Or of course, you could just set the replication's poll time to be some high 
number, like an hour or whatever, so it'll only replicate once an hour no 
matter how many commits happen more often. 

Trade-offs either way, to flexibility/control and performance. As far as 
performance, you may just have to measure in your individual actual context, as 
much of a pain as that can be. It seems there are lots of significant 
variables. 

From: kenf_nc [ken.fos...@realestate.com]
Sent: Tuesday, May 10, 2011 4:01 PM
To: solr-user@lucene.apache.org
Subject: Re: how to do offline adding/updating index

Master/slave replication does this out of the box, easily. Just set the slave
to update on Optimize only. Then you can update the master as much as you
want. When you are ready to update the slave (the search instance), just
optimize the master. On the slave's next cycle check it will refresh itself,
quickly, efficiently, minimal impact to search performance. No need to build
extra moving parts for swapping search servers or anything like that.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2924426.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: how to do offline adding/updating index

2011-05-11 Thread Jonathan Rochkind
Theoretically, a commit alone should have negligible effect on the slave, 
because of the same aspect of Solr architecture that makes too frequent commits 
problematic --- an existing Searcher continues to serve requests off the old 
version of the index, until the new commit (plus all it's warming) is complete, 
at which point the newly warmed Searcher switches into action. 

So long as there's enough RAM available for both operations, and so long as 
there's enough CPU available so the committing and warming of the new stuff 
doesn't starve things out. (this is where the 'too frequent commit' problem 
comes in, when you get so many overlapping commits such that you run out of RAM 
and/or CPU)

However, this same 'theoretical' logic could be used to argue that you should 
be able to commit directly to the 'slave' without any replication at all with 
no performance indications, which doesn't seem to match actually observed 
results. So maybe it should be taken with a grain of salt, and investigated 
empirically. For that matter, it has seemed to me that even in the master-slave 
setup that I use, while the commit is going on there is SOME performance 
implication, although I haven't benchmarked it well, just impression. But it 
hasn't been a disastrous one, and it's a relatively short timespan, in the 
replication scenario.  

Running master and slave on the very same server (one with a whole bunch of 
cores and plenty of RAM), there hasn't seemed to me to be any performance 
implications on searching the slave while 'add'ing to the master (in a 
completely seperate java container). Only when actually doing the replication 
pull (and it's inherent commit to slave). 

From: kenf_nc [ken.fos...@realestate.com]
Sent: Wednesday, May 11, 2011 9:46 AM
To: solr-user@lucene.apache.org
Subject: Re: how to do offline adding/updating index

My understanding is that the Master has done all the indexing, that
replication is a series of file copies to a temp directory, then a move and
commit. The slave only gets hit with the effects of a commit, so whatever
warming queries are in place, and the caches get reset. Doing too many
commits too often is a problem in any situation with Solr and I wouldn't
recommend it here. However, the original question implied commits would
occur approximately once an hour, that is easily within the capabilities of
the system. Fine tuning of warming queries should minimize any performance
impact. Any effects should also be a relatively linear constant, they should
not be wildly affected by the size of the update or the number of documents.
Warming query results may be slightly different with new documents, but on
the other hand, your new documents are now in cache ready for fast search,
so a reasonable trade off.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2927336.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to do offline adding/updating index

2011-05-11 Thread kenf_nc
My understanding is that the Master has done all the indexing, that
replication is a series of file copies to a temp directory, then a move and
commit. The slave only gets hit with the effects of a commit, so whatever
warming queries are in place, and the caches get reset. Doing too many
commits too often is a problem in any situation with Solr and I wouldn't
recommend it here. However, the original question implied commits would
occur approximately once an hour, that is easily within the capabilities of
the system. Fine tuning of warming queries should minimize any performance
impact. Any effects should also be a relatively linear constant, they should
not be wildly affected by the size of the update or the number of documents.
Warming query results may be slightly different with new documents, but on
the other hand, your new documents are now in cache ready for fast search,
so a reasonable trade off.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2927336.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to do offline adding/updating index

2011-05-10 Thread Markus Jelsma
Replication large files can be bad for OS page cache as files being written are 
also written to the page cache. Search latency can grow due to I/O for getting 
the current index version back into memory. Also, Solr cache warming can casue 
a doubling of your heap usage.

Frequent replication in an environment with large files and high query load is 
something one should measure before going in production.

> Thanks - that sounds like what I was hoping for.  So the I/O during
> replication will have *some* impact on search performance, but
> presumably much less than reindexing and merging/optimizing?
> 
> -Mike
> 
> > Master/slave replication does this out of the box, easily. Just set the
> > slave to update on Optimize only. Then you can update the master as much
> > as you want. When you are ready to update the slave (the search
> > instance), just optimize the master. On the slave's next cycle check it
> > will refresh itself, quickly, efficiently, minimal impact to search
> > performance. No need to build extra moving parts for swapping search
> > servers or anything like that.
> > 
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-ind
> > ex-tp2923035p2924426.html Sent from the Solr - User mailing list archive
> > at Nabble.com.


Re: how to do offline adding/updating index

2011-05-10 Thread Mike Sokolov
Thanks - that sounds like what I was hoping for.  So the I/O during 
replication will have *some* impact on search performance, but 
presumably much less than reindexing and merging/optimizing?


-Mike


Master/slave replication does this out of the box, easily. Just set the slave
to update on Optimize only. Then you can update the master as much as you
want. When you are ready to update the slave (the search instance), just
optimize the master. On the slave's next cycle check it will refresh itself,
quickly, efficiently, minimal impact to search performance. No need to build
extra moving parts for swapping search servers or anything like that.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2924426.html
Sent from the Solr - User mailing list archive at Nabble.com.
   


Re: how to do offline adding/updating index

2011-05-10 Thread kenf_nc
Master/slave replication does this out of the box, easily. Just set the slave
to update on Optimize only. Then you can update the master as much as you
want. When you are ready to update the slave (the search instance), just
optimize the master. On the slave's next cycle check it will refresh itself,
quickly, efficiently, minimal impact to search performance. No need to build
extra moving parts for swapping search servers or anything like that.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2924426.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to do offline adding/updating index

2011-05-10 Thread Mike Sokolov
I think the key question here is what's the best way to perform indexing 
without affecting search performance, or without affecting it much.  If 
you have a batch of documents to index (say a daily batch that takes an 
hour to index and merge), you'd like to do that on an offline system, 
and then when ready, bring that index up for searching.  but using 
Lucene's multiple commit points assumes you use the same box for search 
and indexing doesn't it?


Something like this is what I have in mind (simple 2-server config here):

Box 1 is live and searching
Box 2 is offline and ready to index

loading begins on Box 2...
loading complete on Box 2 ...
commit, optimize

Swap Box 1 and Box 2 ( with a load balancer or application config?)
Box 2 is live and searching
Box 1 is offline and ready to index

To make the best use of your resources, you'd then like to start using 
Box 1 for searching (until indexing starts up again).  Perhaps if your 
load balancing is clever enough, it could be sensitive to the decreased 
performance of the indexing box and just send more requests to the other 
one(s).  That's probably ideal.


-Mike S


Under the hood, Lucene can support this by keeping multiple commit
points in the index.

So you'd make a new commit whenever you finish indexing the updates
from each hour, and record that this is the last "searchable" commit.

Then you are free to commit while indexing the next hour's worth of
changes, but these commits are not marked as searchable.

But... this is a low level Lucene capability and I don't know of any
plans for Solr to support multiple commit points in the index.

Mike

http://blog.mikemccandless.com

On Tue, May 10, 2011 at 9:22 AM, vrpar...@gmail.com  wrote:
   

Hello all,

indexing with dataimporthandler runs every hour (new records will be added,
some records will be updated) note :large data

requirement is when indexing is in progress, searching (on already indexed
data) should not affect

so should i use multicore-with merge and swap or delta query or any other
way?

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2923035.html
Sent from the Solr - User mailing list archive at Nabble.com.

 


Re: how to do offline adding/updating index

2011-05-10 Thread Michael McCandless
Under the hood, Lucene can support this by keeping multiple commit
points in the index.

So you'd make a new commit whenever you finish indexing the updates
from each hour, and record that this is the last "searchable" commit.

Then you are free to commit while indexing the next hour's worth of
changes, but these commits are not marked as searchable.

But... this is a low level Lucene capability and I don't know of any
plans for Solr to support multiple commit points in the index.

Mike

http://blog.mikemccandless.com

On Tue, May 10, 2011 at 9:22 AM, vrpar...@gmail.com  wrote:
> Hello all,
>
> indexing with dataimporthandler runs every hour (new records will be added,
> some records will be updated) note :large data
>
> requirement is when indexing is in progress, searching (on already indexed
> data) should not affect
>
> so should i use multicore-with merge and swap or delta query or any other
> way?
>
> Thanks
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2923035.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


RE: how to do offline adding/updating index

2011-05-10 Thread Jonathan Rochkind
One approach is to use Solr's replication features.  Index to a 'master', 
periodically replicate to 'slave' on which all the searching is done. 

That's what I do; my master and slave are in fact on the same server (one with 
a bunch of CPUs and RAM however), although not alternate cores in a multi-core 
setup. I in fact put them in different containers (different tomcat or jetty 
instances) to isolate them as much as possible (don't want an accidental OOM on 
one effecting the other).This seems to work out pretty well -- although I 
think that while the replication operation is actually going on, performance on 
the slave is indeed effected somewhat, it's not completely without side effect. 
 

It's possible using some kind of 'swapping' technique would eliminate that, as 
you suggest, but I haven't tried it. Certainly a delta query for indexing 
imports is always a good idea if it will work for you, but with or without 
you'll probably need some other setup in addition to isolate your indexing from 
your searching, either replication or a method of 'swapping', indexing to a new 
Solr index and then swapping the indexes out. 

From: vrpar...@gmail.com [vrpar...@gmail.com]
Sent: Tuesday, May 10, 2011 9:22 AM
To: solr-user@lucene.apache.org
Subject: how to do offline adding/updating index

Hello all,

indexing with dataimporthandler runs every hour (new records will be added,
some records will be updated) note :large data

requirement is when indexing is in progress, searching (on already indexed
data) should not affect

so should i use multicore-with merge and swap or delta query or any other
way?

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-do-offline-adding-updating-index-tp2923035p2923035.html
Sent from the Solr - User mailing list archive at Nabble.com.