Here's a blog post describing some details of Twitter's approach: http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html
And here's a talk Michael did last October (Lucene Revolutions): http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter Twitter's case is simpler since they never delete ;) So we have to fix that to do it in Lucene... there are also various open issues that begin to explore some of the ideas here. But this ("immediate consistency") would be a deep and complex change, and I don't see many apps that actually require it. Mike McCandless http://blog.mikemccandless.com On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko <ita...@code972.com> wrote: > Thanks for your detailed answer. We'll have to tackle this and see whats > more important to us then. I'd definitely love to hear Zoie has overcame all > that... > > > Any pointers to Michael Busch's approach? I take this has something to do > with the core itself or index format, probably using the Flex version? > > > Itamar. > > > On 12/06/2011 23:12, Michael McCandless wrote: > >> > From what I understand of Zoie (and it's been some time since I last >> looked... so this could be wrong now), the biggest difference vs NRT >> is that Zoie aims for "immediate consistency", ie index changes are >> always made visible to the very next query, vs NRT which is >> "controlled consistency", a blend between immediate and eventual >> consistency where your app decides when the changes must become >> visible. >> >> But in exchange for that, Zoie pays a price: each search has a higher >> cost per collected hit, since it must post-filter for deleted docs. >> And since Zoie necessarily adds complexity, there's more risk; eg >> there were some nasty Zoie bugs that took quite some time to track >> down (under https://issues.apache.org/jira/browse/LUCENE-2729). >> >> Anyway, I don't think that's a good tradeoff, in general, for our >> users, because very few apps truly require immediate consistency from >> Lucene (can anyone give an example where their app depends on >> immediate consistency...?). I think it's better to spend time during >> reopen so that searches aren't slower. >> >> That said, Lucene has already incorporated one big part of Zoie >> (caching small segments in RAM) via the new NRTCachingDirectory (in >> contrib/misc). Also, the upcoming NRTManager >> (https://issues.apache.org/jira/browse/LUCENE-2955) adds control over >> visibility of specific indexing changes to queries that need to see >> the changes. >> >> Finally, even better would be to not have to make any tradeoff >> whatsoever ;) Twitter's approach (created by Michael Busch) seems to >> bring immediate consistency with no search performance hit, so if we >> do anything here likely it'll be similar to what Michael has done >> (though, those changes are not simple either!). >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko<ita...@code972.com> >> wrote: >>> >>> Mike, >>> >>> >>> Speaking of NRT, and completely off-topic, I know: Lucene's NRT >>> apparently >>> isn't fast enough if Zoie was needed, and now that Zoie is around are >>> there >>> any plans to make it Lucene's default? or: why would one still use NRT >>> when >>> Zoie seem to work much better? >>> >>> >>> Itamar. >>> >>> >>> On 12/06/2011 13:16, Michael McCandless wrote: >>> >>>> Remember that memory-mapping is not a panacea: at the end of the day, >>>> if there just isn't enough RAM on the machine to keep your full >>>> "working set" hot, then the OS will have to hit the disk, regardless >>>> of whether the access is through MMap or a "traditional" IO request. >>>> >>>> That said, on Fedora Linux anyway, I generally see better performance >>>> from MMap than from NIOFSDir; eg see the 2nd chart here: >>>> >>>> >>>> >>>> http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko<ita...@code972.com> >>>> wrote: >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> The whole point of my question was to find out if and how to make >>>>> balancing >>>>> on the SAME machine. Apparently thats not going to help and at a >>>>> certain >>>>> point we will just have to prompt the user to buy more hardware... >>>>> >>>>> >>>>> Out of curiosity, isn't there anything that we can do to avoid that? >>>>> for >>>>> instance using memory-mapped files for the indexes? anything that would >>>>> help >>>>> us overcome OS limitations of that sort... >>>>> >>>>> >>>>> Also, you mention a scheduled job to check for performance degradation; >>>>> any >>>>> idea how serious such a drop should be for sharding to be really >>>>> beneficial? >>>>> or is it application specific too? >>>>> >>>>> >>>>> Itamar. >>>>> >>>>> >>>>> On 12/06/2011 06:43, Shai Erera wrote: >>>>> >>>>>> I agree w/ Erick, there is no cutoff point (index size for that >>>>>> matter) >>>>>> above which you start sharding. >>>>>> >>>>>> What you can do is create a scheduled job in your system that runs a >>>>>> select >>>>>> list of queries and monitors their performance. Once it degrades, it >>>>>> shards >>>>>> the index by either splitting it (you can use IndexSplitter under >>>>>> contrib) >>>>>> or create a new shard, and direct new documents to it. >>>>>> >>>>>> I think I read somewhere, not sure if it was in Solr or ElasticSearch >>>>>> documentation, about a Balancer object, which moves shards around in >>>>>> order >>>>>> to balance the load on the cluster. You can implement something >>>>>> similar >>>>>> which tries to balance the index sizes, creates new shards on-the-fly, >>>>>> even >>>>>> merge shards if suddenly a whole source is being removed from the >>>>>> system >>>>>> etc. >>>>>> >>>>>> Also, note that the 'largest index size' threshold is really a machine >>>>>> constraint and not Lucene's. So if you decide that 10 GB is your >>>>>> cutoff, >>>>>> it >>>>>> is pointless to create 10x10GB shards on the same machine -- searching >>>>>> them >>>>>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's >>>>>> even >>>>>> worse because you consume more RAM when the indexes are split (e.g., >>>>>> terms >>>>>> index, field infos etc.). >>>>>> >>>>>> Shai >>>>>> >>>>>> On Sun, Jun 12, 2011 at 3:10 AM, Erick >>>>>> Erickson<erickerick...@gmail.com>wrote: >>>>>> >>>>>>> <<<We can't assume anything about the machine running it, >>>>>>> so testing won't really tell us much>>> >>>>>>> >>>>>>> Hmmm, then it's pretty hopeless I think. Problem is that >>>>>>> anything you say about running on a machine with >>>>>>> 2G available memory on a single processor is completely >>>>>>> incomparable to running on a machine with 64G of >>>>>>> memory available for Lucene and 16 processors. >>>>>>> >>>>>>> There's really no such thing as an "optimum" Lucene index >>>>>>> size, it always relates to the characteristics of the >>>>>>> underlying hardware. >>>>>>> >>>>>>> I think the best you can do is actually test on various >>>>>>> configurations, then at least you can say "on configuration >>>>>>> X this is the tipping point". >>>>>>> >>>>>>> Sorry there isn't a better answer that I know of, but... >>>>>>> >>>>>>> Best >>>>>>> Erick >>>>>>> >>>>>>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar >>>>>>> Syn-Hershko<ita...@code972.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I know Lucene indexes to be at their optimum up to a certain size - >>>>>>>> said >>>>>>> >>>>>>> to >>>>>>>> >>>>>>>> be around several GBs. I haven't found a good discussion over this, >>>>>>>> but >>>>>>> >>>>>>> its >>>>>>>> >>>>>>>> my understanding that at some point its better to split an index >>>>>>>> into >>>>>>> >>>>>>> parts >>>>>>>> >>>>>>>> (a la sharding) than to continue searching on a huge-size index. I >>>>>>>> assume >>>>>>>> this has to do with OS and IO configurations. Can anyone point me to >>>>>>>> more >>>>>>>> info on this? >>>>>>>> >>>>>>>> We have a product that is using Lucene for various searches, and at >>>>>>>> the >>>>>>>> moment each type of search is using its own Lucene index. We plan on >>>>>>>> refactoring the way it works and to combine all indexes into one - >>>>>>>> making >>>>>>>> the whole system more robust and with a smaller memory footprint, >>>>>>>> among >>>>>>>> other things. >>>>>>>> >>>>>>>> Assuming the above is true, we are interested in knowing how to do >>>>>>>> this >>>>>>>> correctly. Initially all our indexes will be run in one big index, >>>>>>>> but >>>>>>>> if >>>>>>> >>>>>>> at >>>>>>>> >>>>>>>> some index size there is a severe performance degradation we would >>>>>>>> like >>>>>>> >>>>>>> to >>>>>>>> >>>>>>>> handle that correctly by starting a new FSDirectory index to flush >>>>>>>> into, >>>>>>> >>>>>>> or >>>>>>>> >>>>>>>> by re-indexing and moving large indexes into their own Lucene index. >>>>>>>> >>>>>>>> Are there are any guidelines for measuring or estimating this >>>>>>>> correctly? >>>>>>>> what we should be aware of while considering all that? We can't >>>>>>>> assume >>>>>>>> anything about the machine running it, so testing won't really tell >>>>>>>> us >>>>>>>> much... >>>>>>>> >>>>>>>> Thanks in advance for any input on this, >>>>>>>> >>>>>>>> Itamar. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>>> >>>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>> >>>>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org