Re: Syncing lucene index with a database

Chris Lu Thu, 26 Mar 2009 15:11:55 -0700

There are many things you need to synchronize with database. Besidesjust changed fields, you may need to deal with deleted database records,etc.

In general, it's not efficient to pull over data that's changingoften.and may not have much effect on search. It'll overload Luceneunnecessarily. Just leave them in the database.If the constantly changing data is absolutely needed, you can choose todo incremental indexing, selecting only inserted/changed records, andmerge with existing index.

There are several common ways to deal with deleted records also.


Seems you are not sure the proper index structure yet.

I think you can use DBSight Free version, to rapidly prototype andexperiment with all these choices, without coding any XML etc.


--

Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 
Million Euro funding!


Matt Schraeder wrote:

I'm new to Lucene and just beginning my project of adding it to our web
app.  We are indexing data from a MS SQL 2000 database and building
full-text search from it.

Everything I have read says that building the index is a resource heavy

operation so we should use it sparingly.  For the most part the database
table we are working from is updated once a day so as soon as the table
itself is updated we can rebuild our Lucene indexes.  However, there are
a few feilds that get updated with a cronjob every 15 minutes.  In terms
of speed and efficiency, what would be a better system for keeping our
data synced between the database and Lucene?

Of course one option would be to rebuild the Lucene index each time the

cronjob runs to keep the database and Lucene index synced.  We could
either return the entire database table, loop through the rows, get a
row's document in lucene remove/readd it, and do that for each row.
Alternatively after we update the main table we return just the rows
that were changed, loop through those and remove/readd them in lucene,

and do that for just the rows that have changed.Alternatively I have thought of using Lucene purely for search to

return just the primary key of items from our database table, then query
the database for those items and get the most up to date data from the
database to actually display our search results.  This would let us use
Lucene's superior searching capabilities and searching speed, but would
still require us to pull the data to be displayed from the database.

Another option is that we could do the same, but only return the fields

that could change frequently.  This would use Lucene to store and index
the majority of what is displayed on a search results page, only using
the database to return the 2 or 3 fields that might change in a search
for each row that lucene returns.

I'm honestly not sure what the "proper" choice should be, or if it

really depends on our own test cases.  Is it perfectly okay to run an
index update every 15 minutes? How much difference would it make in

terms of search time to search with lucene AND pull from the database?My main issue with searching with lucene but getting the actual data

from the database is that it seems like that would make our current
search system that is entirely database driven to run slower.



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Syncing lucene index with a database

Reply via email to