RE: Managing a large archival (and constantly changing) database

James Pine Fri, 07 Jul 2006 08:41:31 -0700

--- "Rob Staveley (Tom)" <[EMAIL PROTECTED]> wrote:

> Doug says:
> 
> > 1. On the index master, periodically checkpoint
> the index. Every minute or
> so the IndexWriter is closed and a 'cp -lr index
> index.DATE' command is
> executed from Java, where DATE is the current date
> and time. This
> efficiently makes a copy of the index when its in a
> consistent state by
> constructing a tree of hard links. If Lucene
> re-writes any files (e.g., the
> segments file) a new inode is created and the copy
> is unchanged. 
> 
> How can that be so? When the segments file is
> re-written it will surely
> clobber the copy rather than creating a new INODE,
> because it has the same
> name... wouldn't it?
>
> 
> What makes it different from (say)...
> 
>       mkdir x
>       echo original > x/x.txt
>       cp -lr x x.copy
>       echo update > x/x.txt
>       diff x/x.txt x.copy/x.txt
> 
> ...where x.copy/x.txt has "update" rather than
> "original" (certainly on
> Linux).


I agree in your example, both x/x.txt and x.copy/x.txt
will contain "update"; however if the lucene internals
do this:

        mkdir x
        echo original > x/x.txt
        cp -lr x x.copy
-->     rm x/x.txt
        echo update > x/x.txt
        diff x/x.txt x.copy/x.txt

Then x/x.txt will have a different inode from
x.copy/x.txt and their contents will be differ right?

JAMES
> 
> -----Original Message-----
> From: James Pine [mailto:[EMAIL PROTECTED] 
> Sent: 06 July 2006 20:09
> To: java-user@lucene.apache.org
> Subject: RE: Managing a large archival (and
> constantly changing) database
> 
> Hey,
> 
> I found this thread to be very useful when deciding
> upon an indexing
> strategy. 
> 
>
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12700.html
> 
> The system I work on has 3 million or so documents
> and it was (until a
> non-lucene performance issue came up) setup to
> add/delete new documents
> every 15 minutes in a similar manner as described in
> the thread. We were
> adding/deleting a few thousand documents every 15
> minutes, during peak
> traffic. We have a dedicated indexing machine and
> distribute portions of our
> index across multiple machines, but you could still
> follow the pattern all
> on one box, just with separate processes/threads. 
> 
> Even though lucene allows certain types of index
> operations to happen
> concurrently with search activity, IMHO, if you can
> decouple the indexing
> process from the searching process your system as a
> whole will be more
> flexible and scalable with only a little extra
> maintenance overhead.
> 
> JAMES
> 
> --- Larry Ogrodnek <[EMAIL PROTECTED]> wrote:
> 
> > We have a similar setup, although probably only
> 1/5th the number of 
> > documents and updates.  I'd suggest just making
> periodic index 
> > backups.
> > 
> > I've been storing my index as follows:
> > 
> > <workdir>/<index-name>/data/ (lucene index
> > directory)
> > <workdir>/<index-name>/backups/
> > 
> > The "data" is what's passed into
> > IndexWriter/IndexReader.  Additionally, I
> create/update a .last_update 
> > file, which just contains the timestamp of when
> the last update was 
> > started, so when the app starts up it only needs
> to retrieve updates 
> > from the db since then.
> > 
> > Periodically the app copies the contents of data
> into a new directory 
> > in backups named by the date/time, e.g.
> > backups/2007-07-04.110051.  If
> > needed, I can delete data and replace the contents
> with the latest 
> > backup, and the app will only retrieve records
> updated since the 
> > backup was made (using the backup's
> .last_update)...
> > 
> > I'd recommend making the complete index creation
> from scratch a normal 
> > operation as much as possible (but you're right,
> for that number of 
> > documents it will take awhile).  It's been really
> helpful here when 
> > doing additional deploys for testing, or deciding
> we want to index 
> > things differently, etc...
> > 
> > -larry
> > 
> > 
> > -----Original Message-----
> > From: Scott Smith
> [mailto:[EMAIL PROTECTED]
> > 
> > Sent: Thursday, July 06, 2006 1:48 PM
> > To: lucene-user@jakarta.apache.org
> > Subject: Managing a large archival (and constantly
> > changing) database
> > 
> > I've been asked to do a project which provides
> full-text search for a 
> > large database of articles.  The expectation is
> that most of the 
> > articles are fairly small (<2k bytes).  There will
> be an initial 
> > population of around 400,000 articles.  There will
> then be 
> > approximately 2000 new articles added each day
> (they need to be added 
> > in "real time"
> > (within a few minutes of arrival), but will be
> spread out during the 
> > day).  So, roughly another 700,000 articles each
> year.
> > 
> >  
> > 
> > I've read enough to believe that having a lucene
> database of several 
> > million articles is doable.  And, adding 2000
> articles per day 
> > wouldn't seem to be that many.  My concern is the
> real-time nature of 
> > the application.  I'm a bit nervous (perhaps
> without
> > justification) at
> > simply growing one monolithic lucene database. 
> > Should there be a crash,
> > the database will be unusable and I'll have to
> rebuild from scratch 
> > (which, based on my experience, would be hours of
> time).
> > 
> >  
> > 
> > Some of my thoughts were:
> > 
> > 1)     having monthly databases and using
> > MultiSearcher to search across
> > them.  That way my exposure for a corrupted
> database is limited to 
> > this month's database.  This would also seem to
> give me somewhat 
> > better control--meaning a) if the search was
> generating lots of hits, 
> > I could display the results a month at a time and
> not bury them with 
> > output.  It would also spread their search CPU out
> better and not 
> > prevent other individuals from doing a search.  If
> there were very few 
> > results, I could sleep between each month's search
> and again, not lock 
> > everyone else out from searches.
> > 
> > 2)     Have a "this month's" searchable and an
> > "everything else"
> > searchable.  At the beginning of each month, I
> would consolidate the 
> > previous month's database into the "everything
> else"
> > searchable.  This
> > would give more consistent results for relevancy
> ranked searches.  
> > But, it means that a bad search could return lots
> of 
=== message truncated ===


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Managing a large archival (and constantly changing) database

Reply via email to