Re: Best practices to rebuild index on live system

2010-11-11 Thread Shawn Heisey


On 11/11/2010 4:45 PM, Robert Gründler wrote:

So far, i can only think of 2 scenarios for rebuilding the index, if we need to 
update the schema after the rollout:

1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After 
importing, switch the application to cores A1, B1, C1

This will most likely cause a corrupt index, as in the 1.5 hours of indexing, 
the database might get inserts/updates/deletes.

2. Put the Livesystem in a Read-Only mode and rebuild the index during that 
time. This will ensure data integrity in the index, with the drawback for users 
not being
able to write to the app.


I can tell you how we handle this.  The actual build system is more 
complicated than I have mentioned here, involving replication and error 
handling, but this is the basic idea.  This isn't the only possible 
approach, but it does work.


I have 6 main static shards and one incremental shard, each on their own 
machine (Xen VM, actually).  Data is distributed by taking the Did value 
(primary key in the database) and doing a "mod 6" on it, the resulting 
value is the static shard number.


The system tracks two values at all times - minDid and maxDid.  The 
static shards have Did values <= minDid.  The incremental is > minDid 
and <= maxDid.  Once an hour, I write the current Did value to an RRD.  
Once a day, I use that RRD to figure out the Did value corresponding to 
one week ago.  All documents > minDid and <= newMinDid are 
delta-imported into the static indexes and deleted from the incremental 
index, and minDid is updated.


When it comes time to rebuild, I first rebuild the static indexes in a 
core named "build" which takes 5-6 hours.  When that's done, I rebuild 
the incremental in its build core, which only takes about 10 minutes.  
Then on all the machines, I swap the build and live cores.  While all 
the static builds are happening, the incremental continues to get new 
content, until it too is rebuilt.


Shawn



Re: Best practices to rebuild index on live system

2010-11-11 Thread Erick Erickson
If by "corrupt index" you mean an index that's just not quite
up to date, could you do a delta import? In other words, how
do you make our Solr index reflect changes to the DB even
without a schema change? Could you extend that method
to handle your use case?

So the scenario is something like this:
Record the time
rebuild the index
import all changes since you recorded the original time.
switch cores or replicate.

Best
Erick

2010/11/11 Robert Gründler 

> Hi again,
>
> we're coming closer to the rollout of our newly created solr/lucene based
> search, and i'm wondering
> how people handle changes to their schema on live systems.
>
> In our case, we have 3 cores (ie. A,B,C), where the largest one takes about
> 1.5 hours for a full dataimport from the relational
> database. The Index is being updated in realtime, through post
> insert/update/delete events in our ORM.
>
> So far, i can only think of 2 scenarios for rebuilding the index, if we
> need to update the schema after the rollout:
>
> 1. Create 3 more cores (A1,B1,C1) - Import the data from the database -
> After importing, switch the application to cores A1, B1, C1
>
> This will most likely cause a corrupt index, as in the 1.5 hours of
> indexing, the database might get inserts/updates/deletes.
>
> 2. Put the Livesystem in a Read-Only mode and rebuild the index during that
> time. This will ensure data integrity in the index, with the drawback for
> users not being
> able to write to the app.
>
> Does Solr provide any built-in approaches to this problem?
>
>
> best
>
> -robert
>
>
>
>


Re: Best practices to rebuild index on live system

2010-11-11 Thread Jonathan Rochkind
You can do a similar thing to your case #1 with Solr replication, 
handling a lot of the details for you instead of you manually switching 
cores and such. Index to a new core, then tell your production solr to 
be a slave replicating from that master new core. It still may have some 
of the same downsides as your scenario #1, it's essentially the same 
thing, but with Solr replication taking care of the some of the nuts and 
bolts for you.


I haven't hard of any better solutions. In general, Solr seems not 
really so great at use cases where the index changes frequently in 
response to user actions, it doesn't seem to really have been designed 
that way.


You could store all your user-created data in an external store (rdbms 
or no-sql), as well as indexing it, and then when you rebuild the index 
you can get it all from there, so you won't lose any. It seems to often 
work best, getting along with Solr's assumptions,  to avoid considering 
a Solr index ever the canonical storage location of any data -- Solr 
isn't really designed to be storage, it's designed to be an index.  
Always have the canonical storage location of any data being some actual 
store, with Solr just being an index. That approach tends to make it 
easier to work out things like this, although there can still be some 
tricks. (Like, after you're done building your new index, but before you 
replicate it to production, you might have to check the actual canonical 
store for any data that changed in between the time you started your 
re-index and now -- and then re-index that. And then any data that 
changed between the time your second re-index began and... this could go 
on forever. )


Robert Gründler wrote:

Hi again,

we're coming closer to the rollout of our newly created solr/lucene based 
search, and i'm wondering
how people handle changes to their schema on live systems. 


In our case, we have 3 cores (ie. A,B,C), where the largest one takes about 1.5 
hours for a full dataimport from the relational
database. The Index is being updated in realtime, through post 
insert/update/delete events in our ORM.

So far, i can only think of 2 scenarios for rebuilding the index, if we need to 
update the schema after the rollout:

1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After 
importing, switch the application to cores A1, B1, C1

This will most likely cause a corrupt index, as in the 1.5 hours of indexing, 
the database might get inserts/updates/deletes.

2. Put the Livesystem in a Read-Only mode and rebuild the index during that 
time. This will ensure data integrity in the index, with the drawback for users 
not being
able to write to the app.

Does Solr provide any built-in approaches to this problem?


best

-robert