Thanks, Jason, this looks very relevant! On Fri, Mar 25, 2011 at 11:26 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote:
> Dmitry, > > If you're planning on using HBase you can take a look at > https://issues.apache.org/jira/browse/HBASE-3529 I think we may even > have a reasonable solution for reading the index [randomly] out of > HDFS. Benchmarking'll be implemented next. It's not production > ready, suggestions are welcome. > > Jason > > On Fri, Mar 25, 2011 at 2:03 PM, Dmitry Kan <dmitry....@gmail.com> wrote: > > Hi Otis, > > > > Thanks for elaborating on this and the link (funny!). > > > > I have quite a big dataset growing all the time. The problems that I > start > > facing are pretty much predictable: > > 1. Scalability: this inludes indexing time (now some days!, better hours > or > > even minutes, if that's possible) along with handling the rapid growth > > 2. Robustness: the entire system (distributed or single server or > anything > > else) should be fault-tolerant, e.g. if one shard goes down, other > catches > > up (master-slave scheme) > > 3. Some apps that we run on SOLR are pretty computationally demanding, > like > > faceting over one+bi+trigrams of hundreds of millions of documents (index > > size of half a TB) ---> single server with a shard of data does not seem > to > > be enough for realtime search. > > > > This is just for a bit of a background. I agree with you on that hadoop > and > > cloud probably best suit massive batch processes rather than realtime > > search. I'm sure, if anyone out there made SOLR shine throught the cloud > for > > realtime search over large datasets. > > > > By "SOLR on the cloud (e.g. HDFS + MR + cloud of > > commodity machines)" I mean what you've done for your customers using > EC2. > > Any chance, the guidlines/articles for/on setting indices on HDFS are > > available in some open / paid area? > > > > To sum this up, I didn't mean to create a buzz on the cloud solutions in > > this thread, just was wondering what is practically available / going on > in > > SOLR development in this regard. > > > > Thanks, > > > > Dmitry > > > > > > On Fri, Mar 25, 2011 at 10:28 PM, Otis Gospodnetic < > > otis_gospodne...@yahoo.com> wrote: > > > >> Hi Dan, > >> > >> This feels a bit like a buzzword soup.... with mushrooms. :) > >> > >> MR jobs, at least the ones in Hadoopland, are very batch oriented, so > that > >> wouldn't be very suitable for most search applications. There are some > >> technologies like Riak that combine MR and search. Let me use this > funny > >> little > >> link: http://lmgtfy.com/?q=riak%20mapreduce%20search > >> > >> > >> Sure, you can put indices on HDFS (but don't expect searches to be > fast). > >> Sure > >> you can create indices using MapReduce, we've done that successfully for > >> customers bringing long indexing jobs from many hours to minutes by > using, > >> yes, > >> a cluster of machines (actually EC2 instances). > >> But when you say "more into SOLR on the cloud (e.g. HDFS + MR + cloud > of > >> commodity machines)", I can't actually picture what precisely you > mean... > >> > >> > >> Otis > >> --- > >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > >> Lucene ecosystem search :: http://search-lucene.com/ > >> > >> > >> > >> ----- Original Message ---- > >> > From: Dmitry Kan <dmitry....@gmail.com> > >> > To: solr-user@lucene.apache.org > >> > Cc: Upayavira <u...@odoko.co.uk> > >> > Sent: Fri, March 25, 2011 8:26:33 AM > >> > Subject: Re: solr on the cloud > >> > > >> > Hi, Upayavira > >> > > >> > Probably I'm confusing the terms here. When I say "distributed > faceting" > >> I'm > >> > more into SOLR on the cloud (e.g. HDFS + MR + cloud of commodity > >> machines) > >> > rather than into traditional multicore/sharded SOLR on a single or > >> multiple > >> > servers with non-distributed file systems (is that what you mean when > >> you > >> > refer to "distribution of facet requests across hosts"?) > >> > > >> > On Fri, Mar 25, 2011 at 1:57 PM, Upayavira <u...@odoko.co.uk> wrote: > >> > > >> > > > >> > > > >> > > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan" < > dmitry....@gmail.com> > >> > > wrote: > >> > > > Hi Yonik, > >> > > > > >> > > > Oh, this is great. Is distributed faceting available in the > trunk? > >> What > >> > > > is > >> > > > the basic server setup needed for trying this out, is it cloud > with > >> HDFS > >> > > > and > >> > > > SOLR with zookepers? > >> > > > Any chance to see the related documentation? :) > >> > > > >> > > Distributed faceting has been available for a long time, and is > >> > > available in the 1.4.1 release. > >> > > > >> > > The distribution of facet requests across hosts happens in the > >> > > background. There's no real difference (in query syntax) between a > >> > > standard facet query and a distributed one. > >> > > > >> > > i.e. you don't need SolrCloud nor Zookeeper for it. (they may > provide > >> > > other benefits, but you don't need them for distributed faceting). > >> > > > >> > > Upayavira > >> > > > >> > > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik Seeley > >> > > > <yo...@lucidimagination.com>wrote: > >> > > > > >> > > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan < > dmitry....@gmail.com> > >> > > wrote: > >> > > > > > Basically, of high interest is checking out the Map-Reduce > for > >> > > > > distributed > >> > > > > > faceting, is it even possible with the trunk? > >> > > > > > >> > > > > Solr already has distributed faceting, and it's much more > >> performant > >> > > > > than a map-reduce implementation would be. > >> > > > > > >> > > > > I've also seen a product use the term "map reduce" > incorrectly... > >> as > >> > > in, > >> > > > > we "map" the request to each shard, and then "reduce" the > results > >> to a > >> > > > > single list (of course, that's not actually map-reduce at all > ;-) > >> > > > > > >> > > > > > >> > > > :) this sounds pretty strange to me as well. It was only my > guess, > >> that > >> > > > if > >> > > > you have MR as computational model and a cloud beneath it, you > could > >> > > > naturally map facet fields to their counts inside single > documents > >> (no > >> > > > matter, where they are, be it shards or "single" index) and pass > >> them > >> > > > onto > >> > > > reducers. > >> > > > > >> > > > > >> > > > > -Yonik > >> > > > > http://www.lucenerevolution.org -- Lucene/Solr User Conference, > >> May > >> > > > > 25-26, San Francisco > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > -- > >> > > > Regards, > >> > > > > >> > > > Dmitry Kan > >> > > > > >> > > --- > >> > > Enterprise Search Consultant at Sourcesense UK, > >> > > Making Sense of Open Source > >> > > > >> > > > >> > > >> > > >> > -- > >> > Regards, > >> > > >> > Dmitry Kan > >> > > >> > > > -- Regards, Dmitry Kan