Re: Realtime Search

Robert Engels Fri, 26 Dec 2008 12:40:00 -0800

There is also the distributed model - but in that case each node is running 
some sort of server anyway (as in Hadoop).


It seems that the distributed model would be easier to develop using Hadoop 
over the embedded model.

-----Original Message-----
>From: Robert Engels <reng...@ix.netcom.com>
>Sent: Dec 26, 2008 2:34 PM
>To: java-dev@lucene.apache.org
>Subject: Re: Realtime Search
>
>If you move to the "either embedded, or server model", the post reopen is 
>trivial, as the structures can be created as the segment is written.
>
>It is the networked shared access model that causes a lot of these 
>optimizations to be far more complex than needed.
>
>Would it maybe be simpler to move the "embedded or server" model, and add a 
>network shared file (e.g. nfs) access model as a layer?  The latter is going 
>to perform far worse anyway.
>
>I guess I don't understand why Lucene continues to try and support this model. 
>NO ONE does it any more.  This is the way MS Access worked, and everyone that 
>wanted performance needed to move to SQL server for the server model.
>
>
>-----Original Message-----
>>From: Marvin Humphrey <mar...@rectangular.com>
>>Sent: Dec 26, 2008 12:53 PM
>>To: java-dev@lucene.apache.org
>>Subject: Re: Realtime Search
>>
>>On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote:
>>> >  4) Allow 2 concurrent writers: one for small, fast updates, and one for
>>> >     big background merges.
>>> 
>>> Marvin can you describe more detail here? 
>>
>>The goal is to improve worst-case write performance.  
>>
>>Currently, writes are quick most of the time, but occassionally you'll trigger
>>a big merge and get stuck.  To solve this problem, we can assign a merge
>>policy to our primary writer which tells it to merge no more than
>>mergeThreshold documents.  The value of mergeTheshold will need tuning
>>depending on document size, change rate, and so on, but the idea is that we
>>want this writer to do as much merging as it can while still keeping
>>worst-case write performance down to an acceptable number.
>>
>>Doing only small merges just puts off the day of reckoning, of course.  By
>>avoiding big consolidations, we are slowly accumulating small-to-medium sized
>>segments and causing a gradual degradation of search-time performance.
>>
>>What we'd like is a separate write process, operating (mostly) in the
>>background, dedicated solely to merging segments which contain at least
>>mergeThreshold docs.
>>
>>If all we have to do is add documents to the index, adding that second write
>>process isn't a big deal.  We have to worry about competion for segment,
>>snapshot, and temp file names, but that's about it.
>>
>>Deletions make matters more complicated, but with a tombstone-based deletions
>>mechanism, the problems are solvable.
>>
>>When the background merge writer starts up, it will see a particular view of
>>the index in time, including deletions.  It will perform nearly all of its
>>operations based on this view of the index, mapping around documents which
>>were marked as deleted at init time.
>>
>>In between the time when the background merge writer starts up and the time it
>>finishes consolidating segment data, we assume that the primary writer will
>>have modified the index.
>>
>>  * New docs have been added in new segments.
>>  * Tombstones have been added which suppress documents in segments which
>>    didn't even exist when the background merge writer started up.
>>  * Tombstones have been added which suppress documents in segments which
>>    existed when the background merge writer started up, but were not merged.
>>  * Tombstones have been added which suppress documents in segments which have
>>    just been merged.
>>
>>Only the last category of deletions matters.
>>
>>At this point, the background merge writer aquires an exclusive write lock on
>>the index.  It examines recently added tombstones, translates the document
>>numbers and writes a tombstone file against itself.  Then it writes the
>>snapshot file to commit its changes and releases the write lock.
>>
>>Worst case update performance for the system is now the sum of the time it
>>takes the background merge writer consolidate tombstones and worst-case
>>performance of the primary writer.
>>
>>> It sounds like this is your solution for "decoupling" segments changes due
>>> to merges from changes from docs being indexed, from a reader's standpoint?
>>
>>It's true that we are decoupling the process of making logical changes to the
>>index from the process of internal consolidation.  I probably wouldn't
>>describe that as being done from the reader's standpoint, though.
>>
>>With mmap and data structures optimized for it, we basically solve the
>>read-time responsiveness cost problem.  From the client perspective, the delay
>>between firing off a change order and seeing that change made live is now
>>dominated by the time it takes to actually update the index.  The time between
>>the commit and having an IndexReader which can see that commit is negligible
>>in comparision.
>>
>>> Since you are using mmap to achieve near zero brand-new IndexReader
>>> creation, whereas in Lucene we are moving towards achieving real-time
>>> by always reopening a current IndexReader (not a brand new one), it
>>> seems like you should not actually have to worry about the case of
>>> reopening a reader after a large merge has finished?
>>
>>Even though we can rely on mmap rather than slurping, there are potentially a
>>lot of files to open and a lot of JSON-encoded metadata to parse, so I'm not
>>certain that Lucy/KS will never have to worry about the time it takes to open
>>a new IndexReader.  Fortunately, we can implement reopen() if we need to.
>>
>>> We need to deal with this case (background the warming) because
>>> creating that new SegmentReader (on the newly merged segment) can take
>>> a non-trivial amount of time.
>>
>>Yes.  Without mmap or some other solution, I think improvements to worst-case
>>update performance in Lucene will continue to be constrained by post-commit
>>IndexReader opening costs.
>>
>>Marvin Humphrey
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>For additional commands, e-mail: java-dev-h...@lucene.apache.org
>




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

Reply via email to