Thanks, Jason, this looks very relevant!

On Fri, Mar 25, 2011 at 11:26 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> Dmitry,
>
> If you're planning on using HBase you can take a look at
> https://issues.apache.org/jira/browse/HBASE-3529  I think we may even
> have a reasonable solution for reading the index [randomly] out of
> HDFS.  Benchmarking'll be implemented next.  It's not production
> ready, suggestions are welcome.
>
> Jason
>
> On Fri, Mar 25, 2011 at 2:03 PM, Dmitry Kan <dmitry....@gmail.com> wrote:
> > Hi Otis,
> >
> > Thanks for elaborating on this and the link (funny!).
> >
> > I have quite a big dataset growing all the time. The problems that I
> start
> > facing are pretty much predictable:
> > 1. Scalability: this inludes indexing time (now some days!, better hours
> or
> > even minutes, if that's possible) along with handling the rapid growth
> > 2. Robustness: the entire system (distributed or single server or
> anything
> > else) should be fault-tolerant, e.g. if one shard goes down, other
> catches
> > up (master-slave scheme)
> > 3. Some apps that we run on SOLR are pretty computationally demanding,
> like
> > faceting over one+bi+trigrams of hundreds of millions of documents (index
> > size of half a TB) ---> single server with a shard of data does not seem
> to
> > be enough for realtime search.
> >
> > This is just for a bit of a background. I agree with you on that hadoop
> and
> > cloud probably best suit massive batch processes rather than realtime
> > search. I'm sure, if anyone out there made SOLR shine throught the cloud
> for
> > realtime search over large datasets.
> >
> > By "SOLR on the cloud (e.g. HDFS + MR +  cloud of
> > commodity machines)" I mean what you've done for your customers using
> EC2.
> > Any chance, the guidlines/articles for/on setting indices on HDFS are
> > available in some open / paid area?
> >
> > To sum this up, I didn't mean to create a buzz on the cloud solutions in
> > this thread, just was wondering what is practically available / going on
> in
> > SOLR development in this regard.
> >
> > Thanks,
> >
> > Dmitry
> >
> >
> > On Fri, Mar 25, 2011 at 10:28 PM, Otis Gospodnetic <
> > otis_gospodne...@yahoo.com> wrote:
> >
> >> Hi Dan,
> >>
> >> This feels a bit like a buzzword soup.... with mushrooms. :)
> >>
> >> MR jobs, at least the ones in Hadoopland, are very batch oriented, so
> that
> >> wouldn't be very suitable for most search applications.  There are some
> >> technologies like Riak that combine MR and search.  Let me use this
> funny
> >> little
> >> link: http://lmgtfy.com/?q=riak%20mapreduce%20search
> >>
> >>
> >> Sure, you can put indices on HDFS (but don't expect searches to be
> fast).
> >>  Sure
> >> you can create indices using MapReduce, we've done that successfully for
> >> customers bringing long indexing jobs from many hours to minutes by
> using,
> >> yes,
> >> a cluster of machines (actually EC2 instances).
> >> But when you say "more into SOLR on the cloud (e.g. HDFS + MR +  cloud
> of
> >> commodity machines)", I can't actually picture what precisely you
> mean...
> >>
> >>
> >> Otis
> >> ---
> >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> >> Lucene ecosystem search :: http://search-lucene.com/
> >>
> >>
> >>
> >> ----- Original Message ----
> >> > From: Dmitry Kan <dmitry....@gmail.com>
> >> > To: solr-user@lucene.apache.org
> >> > Cc: Upayavira <u...@odoko.co.uk>
> >> > Sent: Fri, March 25, 2011 8:26:33 AM
> >> > Subject: Re: solr on the cloud
> >> >
> >> > Hi, Upayavira
> >> >
> >> > Probably I'm confusing the terms here. When I say  "distributed
> faceting"
> >> I'm
> >> > more into SOLR on the cloud (e.g. HDFS + MR +  cloud of commodity
> >> machines)
> >> > rather than into traditional multicore/sharded  SOLR on a single or
> >> multiple
> >> > servers with non-distributed file systems (is  that what you mean when
> >> you
> >> > refer to "distribution of facet requests across  hosts"?)
> >> >
> >> > On Fri, Mar 25, 2011 at 1:57 PM, Upayavira <u...@odoko.co.uk>  wrote:
> >> >
> >> > >
> >> > >
> >> > > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan"  <
> dmitry....@gmail.com>
> >> > >  wrote:
> >> > > > Hi Yonik,
> >> > > >
> >> > > > Oh, this is great. Is  distributed faceting available in the
> trunk?
> >> What
> >> > > > is
> >> > > >  the basic server setup needed for trying this out, is it cloud
> with
> >> HDFS
> >> > >  > and
> >> > > > SOLR with zookepers?
> >> > > > Any chance to see the  related documentation? :)
> >> > >
> >> > > Distributed faceting has been  available for a long time, and is
> >> > > available in the 1.4.1  release.
> >> > >
> >> > > The distribution of facet requests across hosts happens  in the
> >> > > background. There's no real difference (in query syntax) between  a
> >> > > standard facet query and a distributed one.
> >> > >
> >> > > i.e. you  don't need SolrCloud nor Zookeeper for it. (they may
> provide
> >> > > other  benefits, but you don't need them for distributed faceting).
> >> > >
> >> > >  Upayavira
> >> > >
> >> > > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik  Seeley
> >> > > > <yo...@lucidimagination.com>wrote:
> >> > >  >
> >> > > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan <
> dmitry....@gmail.com>
> >> > >  wrote:
> >> > > > > > Basically, of high interest is checking out the  Map-Reduce
> for
> >> > > > > distributed
> >> > > > > > faceting, is  it even possible with the trunk?
> >> > > > >
> >> > > > > Solr  already has distributed faceting, and it's much more
> >> performant
> >> > > >  > than a map-reduce implementation would be.
> >> > > > >
> >> > > >  > I've also seen a product use the term "map reduce"
> incorrectly...
> >>  as
> >> > > in,
> >> > > > > we "map" the request to each shard, and then  "reduce" the
> results
> >> to a
> >> > > > > single list (of course, that's not  actually map-reduce at all
> ;-)
> >> > > > >
> >> > > > >
> >> > > >  :) this sounds pretty strange to me as well. It was only my
> guess,
> >> that
> >> > >  > if
> >> > > > you have MR as computational model and a cloud beneath it,  you
> could
> >> > > > naturally map facet fields to their counts inside single
>  documents
> >> (no
> >> > > > matter, where they are, be it shards or "single"  index) and pass
> >> them
> >> > > > onto
> >> > > > reducers.
> >> > >  >
> >> > > >
> >> > > > > -Yonik
> >> > > > > http://www.lucenerevolution.org -- Lucene/Solr User Conference,
> >> May
> >> > >  > > 25-26, San Francisco
> >> > > > >
> >> > > >
> >> > >  >
> >> > > >
> >> > > > --
> >> > > > Regards,
> >> > > >
> >> > >  > Dmitry Kan
> >> > > >
> >> > > ---
> >> > > Enterprise Search Consultant at  Sourcesense UK,
> >> > > Making Sense of Open  Source
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > Dmitry Kan
> >> >
> >>
> >
>



-- 
Regards,

Dmitry Kan

Reply via email to