Hi Otis, Thanks for elaborating on this and the link (funny!).
I have quite a big dataset growing all the time. The problems that I start facing are pretty much predictable: 1. Scalability: this inludes indexing time (now some days!, better hours or even minutes, if that's possible) along with handling the rapid growth 2. Robustness: the entire system (distributed or single server or anything else) should be fault-tolerant, e.g. if one shard goes down, other catches up (master-slave scheme) 3. Some apps that we run on SOLR are pretty computationally demanding, like faceting over one+bi+trigrams of hundreds of millions of documents (index size of half a TB) ---> single server with a shard of data does not seem to be enough for realtime search. This is just for a bit of a background. I agree with you on that hadoop and cloud probably best suit massive batch processes rather than realtime search. I'm sure, if anyone out there made SOLR shine throught the cloud for realtime search over large datasets. By "SOLR on the cloud (e.g. HDFS + MR + cloud of commodity machines)" I mean what you've done for your customers using EC2. Any chance, the guidlines/articles for/on setting indices on HDFS are available in some open / paid area? To sum this up, I didn't mean to create a buzz on the cloud solutions in this thread, just was wondering what is practically available / going on in SOLR development in this regard. Thanks, Dmitry On Fri, Mar 25, 2011 at 10:28 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Hi Dan, > > This feels a bit like a buzzword soup.... with mushrooms. :) > > MR jobs, at least the ones in Hadoopland, are very batch oriented, so that > wouldn't be very suitable for most search applications. There are some > technologies like Riak that combine MR and search. Let me use this funny > little > link: http://lmgtfy.com/?q=riak%20mapreduce%20search > > > Sure, you can put indices on HDFS (but don't expect searches to be fast). > Sure > you can create indices using MapReduce, we've done that successfully for > customers bringing long indexing jobs from many hours to minutes by using, > yes, > a cluster of machines (actually EC2 instances). > But when you say "more into SOLR on the cloud (e.g. HDFS + MR + cloud of > commodity machines)", I can't actually picture what precisely you mean... > > > Otis > --- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > ----- Original Message ---- > > From: Dmitry Kan <dmitry....@gmail.com> > > To: solr-user@lucene.apache.org > > Cc: Upayavira <u...@odoko.co.uk> > > Sent: Fri, March 25, 2011 8:26:33 AM > > Subject: Re: solr on the cloud > > > > Hi, Upayavira > > > > Probably I'm confusing the terms here. When I say "distributed faceting" > I'm > > more into SOLR on the cloud (e.g. HDFS + MR + cloud of commodity > machines) > > rather than into traditional multicore/sharded SOLR on a single or > multiple > > servers with non-distributed file systems (is that what you mean when > you > > refer to "distribution of facet requests across hosts"?) > > > > On Fri, Mar 25, 2011 at 1:57 PM, Upayavira <u...@odoko.co.uk> wrote: > > > > > > > > > > > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan" <dmitry....@gmail.com> > > > wrote: > > > > Hi Yonik, > > > > > > > > Oh, this is great. Is distributed faceting available in the trunk? > What > > > > is > > > > the basic server setup needed for trying this out, is it cloud with > HDFS > > > > and > > > > SOLR with zookepers? > > > > Any chance to see the related documentation? :) > > > > > > Distributed faceting has been available for a long time, and is > > > available in the 1.4.1 release. > > > > > > The distribution of facet requests across hosts happens in the > > > background. There's no real difference (in query syntax) between a > > > standard facet query and a distributed one. > > > > > > i.e. you don't need SolrCloud nor Zookeeper for it. (they may provide > > > other benefits, but you don't need them for distributed faceting). > > > > > > Upayavira > > > > > > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik Seeley > > > > <yo...@lucidimagination.com>wrote: > > > > > > > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan <dmitry....@gmail.com> > > > wrote: > > > > > > Basically, of high interest is checking out the Map-Reduce for > > > > > distributed > > > > > > faceting, is it even possible with the trunk? > > > > > > > > > > Solr already has distributed faceting, and it's much more > performant > > > > > than a map-reduce implementation would be. > > > > > > > > > > I've also seen a product use the term "map reduce" incorrectly... > as > > > in, > > > > > we "map" the request to each shard, and then "reduce" the results > to a > > > > > single list (of course, that's not actually map-reduce at all ;-) > > > > > > > > > > > > > > :) this sounds pretty strange to me as well. It was only my guess, > that > > > > if > > > > you have MR as computational model and a cloud beneath it, you could > > > > naturally map facet fields to their counts inside single documents > (no > > > > matter, where they are, be it shards or "single" index) and pass > them > > > > onto > > > > reducers. > > > > > > > > > > > > > -Yonik > > > > > http://www.lucenerevolution.org -- Lucene/Solr User Conference, > May > > > > > 25-26, San Francisco > > > > > > > > > > > > > > > > > > > > > -- > > > > Regards, > > > > > > > > Dmitry Kan > > > > > > > --- > > > Enterprise Search Consultant at Sourcesense UK, > > > Making Sense of Open Source > > > > > > > > > > > > -- > > Regards, > > > > Dmitry Kan > > >