Re: solr on the cloud

Dmitry Kan Fri, 25 Mar 2011 14:03:58 -0700

Hi Otis,

Thanks for elaborating on this and the link (funny!).


I have quite a big dataset growing all the time. The problems that I start
facing are pretty much predictable:
1. Scalability: this inludes indexing time (now some days!, better hours or
even minutes, if that's possible) along with handling the rapid growth
2. Robustness: the entire system (distributed or single server or anything
else) should be fault-tolerant, e.g. if one shard goes down, other catches
up (master-slave scheme)
3. Some apps that we run on SOLR are pretty computationally demanding, like
faceting over one+bi+trigrams of hundreds of millions of documents (index
size of half a TB) ---> single server with a shard of data does not seem to
be enough for realtime search.

This is just for a bit of a background. I agree with you on that hadoop and
cloud probably best suit massive batch processes rather than realtime
search. I'm sure, if anyone out there made SOLR shine throught the cloud for
realtime search over large datasets.

By "SOLR on the cloud (e.g. HDFS + MR +  cloud of
commodity machines)" I mean what you've done for your customers using EC2.
Any chance, the guidlines/articles for/on setting indices on HDFS are
available in some open / paid area?

To sum this up, I didn't mean to create a buzz on the cloud solutions in
this thread, just was wondering what is practically available / going on in
SOLR development in this regard.

Thanks,

Dmitry


On Fri, Mar 25, 2011 at 10:28 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hi Dan,
>
> This feels a bit like a buzzword soup.... with mushrooms. :)
>
> MR jobs, at least the ones in Hadoopland, are very batch oriented, so that
> wouldn't be very suitable for most search applications.  There are some
> technologies like Riak that combine MR and search.  Let me use this funny
> little
> link: http://lmgtfy.com/?q=riak%20mapreduce%20search
>
>
> Sure, you can put indices on HDFS (but don't expect searches to be fast).
>  Sure
> you can create indices using MapReduce, we've done that successfully for
> customers bringing long indexing jobs from many hours to minutes by using,
> yes,
> a cluster of machines (actually EC2 instances).
> But when you say "more into SOLR on the cloud (e.g. HDFS + MR +  cloud of
> commodity machines)", I can't actually picture what precisely you mean...
>
>
> Otis
> ---
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
> > From: Dmitry Kan <dmitry....@gmail.com>
> > To: solr-user@lucene.apache.org
> > Cc: Upayavira <u...@odoko.co.uk>
> > Sent: Fri, March 25, 2011 8:26:33 AM
> > Subject: Re: solr on the cloud
> >
> > Hi, Upayavira
> >
> > Probably I'm confusing the terms here. When I say  "distributed faceting"
> I'm
> > more into SOLR on the cloud (e.g. HDFS + MR +  cloud of commodity
> machines)
> > rather than into traditional multicore/sharded  SOLR on a single or
> multiple
> > servers with non-distributed file systems (is  that what you mean when
> you
> > refer to "distribution of facet requests across  hosts"?)
> >
> > On Fri, Mar 25, 2011 at 1:57 PM, Upayavira <u...@odoko.co.uk>  wrote:
> >
> > >
> > >
> > > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan"  <dmitry....@gmail.com>
> > >  wrote:
> > > > Hi Yonik,
> > > >
> > > > Oh, this is great. Is  distributed faceting available in the trunk?
> What
> > > > is
> > > >  the basic server setup needed for trying this out, is it cloud with
> HDFS
> > >  > and
> > > > SOLR with zookepers?
> > > > Any chance to see the  related documentation? :)
> > >
> > > Distributed faceting has been  available for a long time, and is
> > > available in the 1.4.1  release.
> > >
> > > The distribution of facet requests across hosts happens  in the
> > > background. There's no real difference (in query syntax) between  a
> > > standard facet query and a distributed one.
> > >
> > > i.e. you  don't need SolrCloud nor Zookeeper for it. (they may provide
> > > other  benefits, but you don't need them for distributed faceting).
> > >
> > >  Upayavira
> > >
> > > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik  Seeley
> > > > <yo...@lucidimagination.com>wrote:
> > >  >
> > > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan <dmitry....@gmail.com>
> > >  wrote:
> > > > > > Basically, of high interest is checking out the  Map-Reduce for
> > > > > distributed
> > > > > > faceting, is  it even possible with the trunk?
> > > > >
> > > > > Solr  already has distributed faceting, and it's much more
> performant
> > > >  > than a map-reduce implementation would be.
> > > > >
> > > >  > I've also seen a product use the term "map reduce" incorrectly...
>  as
> > > in,
> > > > > we "map" the request to each shard, and then  "reduce" the results
> to a
> > > > > single list (of course, that's not  actually map-reduce at all ;-)
> > > > >
> > > > >
> > > >  :) this sounds pretty strange to me as well. It was only my guess,
> that
> > >  > if
> > > > you have MR as computational model and a cloud beneath it,  you could
> > > > naturally map facet fields to their counts inside single  documents
> (no
> > > > matter, where they are, be it shards or "single"  index) and pass
> them
> > > > onto
> > > > reducers.
> > >  >
> > > >
> > > > > -Yonik
> > > > > http://www.lucenerevolution.org -- Lucene/Solr User Conference,
> May
> > >  > > 25-26, San Francisco
> > > > >
> > > >
> > >  >
> > > >
> > > > --
> > > > Regards,
> > > >
> > >  > Dmitry Kan
> > > >
> > > ---
> > > Enterprise Search Consultant at  Sourcesense UK,
> > > Making Sense of Open  Source
> > >
> > >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
>

Re: solr on the cloud

Reply via email to