Re: Realtime Search

John Wang Thu, 08 Jan 2009 20:42:12 -0800

We have worked on this problem on the server level as well. We have also
open sourced it at:


http://code.google.com/p/zoie/

wiki on the realtime aspect:

http://code.google.com/p/zoie/wiki/ZoieSystem

-John

On Fri, Dec 26, 2008 at 12:34 PM, Robert Engels <reng...@ix.netcom.com>wrote:

> If you move to the "either embedded, or server model", the post reopen is
> trivial, as the structures can be created as the segment is written.
>
> It is the networked shared access model that causes a lot of these
> optimizations to be far more complex than needed.
>
> Would it maybe be simpler to move the "embedded or server" model, and add a
> network shared file (e.g. nfs) access model as a layer?  The latter is going
> to perform far worse anyway.
>
> I guess I don't understand why Lucene continues to try and support this
> model. NO ONE does it any more.  This is the way MS Access worked, and
> everyone that wanted performance needed to move to SQL server for the server
> model.
>
>
> -----Original Message-----
> >From: Marvin Humphrey <mar...@rectangular.com>
> >Sent: Dec 26, 2008 12:53 PM
> >To: java-dev@lucene.apache.org
> >Subject: Re: Realtime Search
> >
> >On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote:
> >> >  4) Allow 2 concurrent writers: one for small, fast updates, and one
> for
> >> >     big background merges.
> >>
> >> Marvin can you describe more detail here?
> >
> >The goal is to improve worst-case write performance.
> >
> >Currently, writes are quick most of the time, but occassionally you'll
> trigger
> >a big merge and get stuck.  To solve this problem, we can assign a merge
> >policy to our primary writer which tells it to merge no more than
> >mergeThreshold documents.  The value of mergeTheshold will need tuning
> >depending on document size, change rate, and so on, but the idea is that
> we
> >want this writer to do as much merging as it can while still keeping
> >worst-case write performance down to an acceptable number.
> >
> >Doing only small merges just puts off the day of reckoning, of course.  By
> >avoiding big consolidations, we are slowly accumulating small-to-medium
> sized
> >segments and causing a gradual degradation of search-time performance.
> >
> >What we'd like is a separate write process, operating (mostly) in the
> >background, dedicated solely to merging segments which contain at least
> >mergeThreshold docs.
> >
> >If all we have to do is add documents to the index, adding that second
> write
> >process isn't a big deal.  We have to worry about competion for segment,
> >snapshot, and temp file names, but that's about it.
> >
> >Deletions make matters more complicated, but with a tombstone-based
> deletions
> >mechanism, the problems are solvable.
> >
> >When the background merge writer starts up, it will see a particular view
> of
> >the index in time, including deletions.  It will perform nearly all of its
> >operations based on this view of the index, mapping around documents which
> >were marked as deleted at init time.
> >
> >In between the time when the background merge writer starts up and the
> time it
> >finishes consolidating segment data, we assume that the primary writer
> will
> >have modified the index.
> >
> >  * New docs have been added in new segments.
> >  * Tombstones have been added which suppress documents in segments which
> >    didn't even exist when the background merge writer started up.
> >  * Tombstones have been added which suppress documents in segments which
> >    existed when the background merge writer started up, but were not
> merged.
> >  * Tombstones have been added which suppress documents in segments which
> have
> >    just been merged.
> >
> >Only the last category of deletions matters.
> >
> >At this point, the background merge writer aquires an exclusive write lock
> on
> >the index.  It examines recently added tombstones, translates the document
> >numbers and writes a tombstone file against itself.  Then it writes the
> >snapshot file to commit its changes and releases the write lock.
> >
> >Worst case update performance for the system is now the sum of the time it
> >takes the background merge writer consolidate tombstones and worst-case
> >performance of the primary writer.
> >
> >> It sounds like this is your solution for "decoupling" segments changes
> due
> >> to merges from changes from docs being indexed, from a reader's
> standpoint?
> >
> >It's true that we are decoupling the process of making logical changes to
> the
> >index from the process of internal consolidation.  I probably wouldn't
> >describe that as being done from the reader's standpoint, though.
> >
> >With mmap and data structures optimized for it, we basically solve the
> >read-time responsiveness cost problem.  From the client perspective, the
> delay
> >between firing off a change order and seeing that change made live is now
> >dominated by the time it takes to actually update the index.  The time
> between
> >the commit and having an IndexReader which can see that commit is
> negligible
> >in comparision.
> >
> >> Since you are using mmap to achieve near zero brand-new IndexReader
> >> creation, whereas in Lucene we are moving towards achieving real-time
> >> by always reopening a current IndexReader (not a brand new one), it
> >> seems like you should not actually have to worry about the case of
> >> reopening a reader after a large merge has finished?
> >
> >Even though we can rely on mmap rather than slurping, there are
> potentially a
> >lot of files to open and a lot of JSON-encoded metadata to parse, so I'm
> not
> >certain that Lucy/KS will never have to worry about the time it takes to
> open
> >a new IndexReader.  Fortunately, we can implement reopen() if we need to.
> >
> >> We need to deal with this case (background the warming) because
> >> creating that new SegmentReader (on the newly merged segment) can take
> >> a non-trivial amount of time.
> >
> >Yes.  Without mmap or some other solution, I think improvements to
> worst-case
> >update performance in Lucene will continue to be constrained by
> post-commit
> >IndexReader opening costs.
> >
> >Marvin Humphrey
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: Realtime Search

Reply via email to