We have worked on this problem on the server level as well. We have also open sourced it at:
http://code.google.com/p/zoie/ wiki on the realtime aspect: http://code.google.com/p/zoie/wiki/ZoieSystem -John On Fri, Dec 26, 2008 at 12:34 PM, Robert Engels <reng...@ix.netcom.com>wrote: > If you move to the "either embedded, or server model", the post reopen is > trivial, as the structures can be created as the segment is written. > > It is the networked shared access model that causes a lot of these > optimizations to be far more complex than needed. > > Would it maybe be simpler to move the "embedded or server" model, and add a > network shared file (e.g. nfs) access model as a layer? The latter is going > to perform far worse anyway. > > I guess I don't understand why Lucene continues to try and support this > model. NO ONE does it any more. This is the way MS Access worked, and > everyone that wanted performance needed to move to SQL server for the server > model. > > > -----Original Message----- > >From: Marvin Humphrey <mar...@rectangular.com> > >Sent: Dec 26, 2008 12:53 PM > >To: java-dev@lucene.apache.org > >Subject: Re: Realtime Search > > > >On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote: > >> > 4) Allow 2 concurrent writers: one for small, fast updates, and one > for > >> > big background merges. > >> > >> Marvin can you describe more detail here? > > > >The goal is to improve worst-case write performance. > > > >Currently, writes are quick most of the time, but occassionally you'll > trigger > >a big merge and get stuck. To solve this problem, we can assign a merge > >policy to our primary writer which tells it to merge no more than > >mergeThreshold documents. The value of mergeTheshold will need tuning > >depending on document size, change rate, and so on, but the idea is that > we > >want this writer to do as much merging as it can while still keeping > >worst-case write performance down to an acceptable number. > > > >Doing only small merges just puts off the day of reckoning, of course. By > >avoiding big consolidations, we are slowly accumulating small-to-medium > sized > >segments and causing a gradual degradation of search-time performance. > > > >What we'd like is a separate write process, operating (mostly) in the > >background, dedicated solely to merging segments which contain at least > >mergeThreshold docs. > > > >If all we have to do is add documents to the index, adding that second > write > >process isn't a big deal. We have to worry about competion for segment, > >snapshot, and temp file names, but that's about it. > > > >Deletions make matters more complicated, but with a tombstone-based > deletions > >mechanism, the problems are solvable. > > > >When the background merge writer starts up, it will see a particular view > of > >the index in time, including deletions. It will perform nearly all of its > >operations based on this view of the index, mapping around documents which > >were marked as deleted at init time. > > > >In between the time when the background merge writer starts up and the > time it > >finishes consolidating segment data, we assume that the primary writer > will > >have modified the index. > > > > * New docs have been added in new segments. > > * Tombstones have been added which suppress documents in segments which > > didn't even exist when the background merge writer started up. > > * Tombstones have been added which suppress documents in segments which > > existed when the background merge writer started up, but were not > merged. > > * Tombstones have been added which suppress documents in segments which > have > > just been merged. > > > >Only the last category of deletions matters. > > > >At this point, the background merge writer aquires an exclusive write lock > on > >the index. It examines recently added tombstones, translates the document > >numbers and writes a tombstone file against itself. Then it writes the > >snapshot file to commit its changes and releases the write lock. > > > >Worst case update performance for the system is now the sum of the time it > >takes the background merge writer consolidate tombstones and worst-case > >performance of the primary writer. > > > >> It sounds like this is your solution for "decoupling" segments changes > due > >> to merges from changes from docs being indexed, from a reader's > standpoint? > > > >It's true that we are decoupling the process of making logical changes to > the > >index from the process of internal consolidation. I probably wouldn't > >describe that as being done from the reader's standpoint, though. > > > >With mmap and data structures optimized for it, we basically solve the > >read-time responsiveness cost problem. From the client perspective, the > delay > >between firing off a change order and seeing that change made live is now > >dominated by the time it takes to actually update the index. The time > between > >the commit and having an IndexReader which can see that commit is > negligible > >in comparision. > > > >> Since you are using mmap to achieve near zero brand-new IndexReader > >> creation, whereas in Lucene we are moving towards achieving real-time > >> by always reopening a current IndexReader (not a brand new one), it > >> seems like you should not actually have to worry about the case of > >> reopening a reader after a large merge has finished? > > > >Even though we can rely on mmap rather than slurping, there are > potentially a > >lot of files to open and a lot of JSON-encoded metadata to parse, so I'm > not > >certain that Lucy/KS will never have to worry about the time it takes to > open > >a new IndexReader. Fortunately, we can implement reopen() if we need to. > > > >> We need to deal with this case (background the warming) because > >> creating that new SegmentReader (on the newly merged segment) can take > >> a non-trivial amount of time. > > > >Yes. Without mmap or some other solution, I think improvements to > worst-case > >update performance in Lucene will continue to be constrained by > post-commit > >IndexReader opening costs. > > > >Marvin Humphrey > > > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >