Jason, On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen <jason.rutherg...@gmail.com > wrote:
> Today near realtime search (with or without SSDs) comes at a > price, that is reduced indexing speed due to continued in RAM > merging. People typically hack something together where indexes > are held in a RAMDir until being flushed to disk. The problem > with this is, merging in the background becomes really tricky > unless it's performed inside of IndexWriter (see LUCENE-1313 and > IW.getReader). There is the Zoie system which uses the RAMDir > solution, however it's implemented using a customized deleted > doc set based on a bloomfilter backed by an inefficient RB tree > which slows down queries. There's always a trade off when trying > to build an NRT system, currently. > I'm not sure what numbers you are using to justify saying that zoie "slows down queries" - latency at LinkedIn using zoie has a typical median response time of 4-8ms at the searcher node level (slower at the broker due to a lot of custom stuff that happens before queries are actually sent to the nodex), while dealing with sustained rapid indexing throughput, all with basically zero time between indexing event to index visibility (ie. true real-time, not "near real time", unless indexing events are coming in *very* fast). You say there's a tradeoff, but as you should remember from your time at LinkedIn, we do distributed realtime faceted search while maintaining extremely low latency and still indexing sometimes more than a thousand new docs a minute per node (I should dredge up some new numbers to verify what that is exactly these days). Deletes can pile up in segments so the > BalancedSegmentMergePolicy could be used to remove those faster > than LogMergePolicy, however I haven't tested it, and it may be > trying to not do large segment merges altogether which IMO > is less than ideal because query performance soon degrades > (similar to an unoptimized index). > Not optimizing all the way has shown in our case to actually be *better* than the "optimal" case of a 1-segment index, at least in the case of realtime indexing at rapid update pace. -jake