--- "Rob Staveley (Tom)" <[EMAIL PROTECTED]> wrote:
> Doug says:
>
> > 1. On the index master, periodically checkpoint
> the index. Every minute or
> so the IndexWriter is closed and a 'cp -lr index
> index.DATE' command is
> executed from Java, where DATE is the current date
> and time. This
> efficiently makes a copy of the index when its in a
> consistent state by
> constructing a tree of hard links. If Lucene
> re-writes any files (e.g., the
> segments file) a new inode is created and the copy
> is unchanged.
>
> How can that be so? When the segments file is
> re-written it will surely
> clobber the copy rather than creating a new INODE,
> because it has the same
> name... wouldn't it?
>
>
> What makes it different from (say)...
>
> mkdir x
> echo original > x/x.txt
> cp -lr x x.copy
> echo update > x/x.txt
> diff x/x.txt x.copy/x.txt
>
> ...where x.copy/x.txt has "update" rather than
> "original" (certainly on
> Linux).
I agree in your example, both x/x.txt and x.copy/x.txt
will contain "update"; however if the lucene internals
do this:
mkdir x
echo original > x/x.txt
cp -lr x x.copy
--> rm x/x.txt
echo update > x/x.txt
diff x/x.txt x.copy/x.txt
Then x/x.txt will have a different inode from
x.copy/x.txt and their contents will be differ right?
JAMES
>
> -----Original Message-----
> From: James Pine [mailto:[EMAIL PROTECTED]
> Sent: 06 July 2006 20:09
> To: [email protected]
> Subject: RE: Managing a large archival (and
> constantly changing) database
>
> Hey,
>
> I found this thread to be very useful when deciding
> upon an indexing
> strategy.
>
>
http://www.mail-archive.com/[email protected]/msg12700.html
>
> The system I work on has 3 million or so documents
> and it was (until a
> non-lucene performance issue came up) setup to
> add/delete new documents
> every 15 minutes in a similar manner as described in
> the thread. We were
> adding/deleting a few thousand documents every 15
> minutes, during peak
> traffic. We have a dedicated indexing machine and
> distribute portions of our
> index across multiple machines, but you could still
> follow the pattern all
> on one box, just with separate processes/threads.
>
> Even though lucene allows certain types of index
> operations to happen
> concurrently with search activity, IMHO, if you can
> decouple the indexing
> process from the searching process your system as a
> whole will be more
> flexible and scalable with only a little extra
> maintenance overhead.
>
> JAMES
>
> --- Larry Ogrodnek <[EMAIL PROTECTED]> wrote:
>
> > We have a similar setup, although probably only
> 1/5th the number of
> > documents and updates. I'd suggest just making
> periodic index
> > backups.
> >
> > I've been storing my index as follows:
> >
> > <workdir>/<index-name>/data/ (lucene index
> > directory)
> > <workdir>/<index-name>/backups/
> >
> > The "data" is what's passed into
> > IndexWriter/IndexReader. Additionally, I
> create/update a .last_update
> > file, which just contains the timestamp of when
> the last update was
> > started, so when the app starts up it only needs
> to retrieve updates
> > from the db since then.
> >
> > Periodically the app copies the contents of data
> into a new directory
> > in backups named by the date/time, e.g.
> > backups/2007-07-04.110051. If
> > needed, I can delete data and replace the contents
> with the latest
> > backup, and the app will only retrieve records
> updated since the
> > backup was made (using the backup's
> .last_update)...
> >
> > I'd recommend making the complete index creation
> from scratch a normal
> > operation as much as possible (but you're right,
> for that number of
> > documents it will take awhile). It's been really
> helpful here when
> > doing additional deploys for testing, or deciding
> we want to index
> > things differently, etc...
> >
> > -larry
> >
> >
> > -----Original Message-----
> > From: Scott Smith
> [mailto:[EMAIL PROTECTED]
> >
> > Sent: Thursday, July 06, 2006 1:48 PM
> > To: [email protected]
> > Subject: Managing a large archival (and constantly
> > changing) database
> >
> > I've been asked to do a project which provides
> full-text search for a
> > large database of articles. The expectation is
> that most of the
> > articles are fairly small (<2k bytes). There will
> be an initial
> > population of around 400,000 articles. There will
> then be
> > approximately 2000 new articles added each day
> (they need to be added
> > in "real time"
> > (within a few minutes of arrival), but will be
> spread out during the
> > day). So, roughly another 700,000 articles each
> year.
> >
> >
> >
> > I've read enough to believe that having a lucene
> database of several
> > million articles is doable. And, adding 2000
> articles per day
> > wouldn't seem to be that many. My concern is the
> real-time nature of
> > the application. I'm a bit nervous (perhaps
> without
> > justification) at
> > simply growing one monolithic lucene database.
> > Should there be a crash,
> > the database will be unusable and I'll have to
> rebuild from scratch
> > (which, based on my experience, would be hours of
> time).
> >
> >
> >
> > Some of my thoughts were:
> >
> > 1) having monthly databases and using
> > MultiSearcher to search across
> > them. That way my exposure for a corrupted
> database is limited to
> > this month's database. This would also seem to
> give me somewhat
> > better control--meaning a) if the search was
> generating lots of hits,
> > I could display the results a month at a time and
> not bury them with
> > output. It would also spread their search CPU out
> better and not
> > prevent other individuals from doing a search. If
> there were very few
> > results, I could sleep between each month's search
> and again, not lock
> > everyone else out from searches.
> >
> > 2) Have a "this month's" searchable and an
> > "everything else"
> > searchable. At the beginning of each month, I
> would consolidate the
> > previous month's database into the "everything
> else"
> > searchable. This
> > would give more consistent results for relevancy
> ranked searches.
> > But, it means that a bad search could return lots
> of
=== message truncated ===
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]