Thanks Otis.

I really appreciate the details offered here. This was very helpful
information.

I'm going to go through Solandra and Elastic Search and see if those make
sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's
two recommendations for SolrCloud so far), so I will give that a shot when
it is available. However, do you know when SolrCloud IS expected to be
available?

Thanks again!

Warm regards,
Safdar



On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hello Ali,
>
> > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>
> > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
> > crawled + indexed every *4 weeks, *with a search latency of less than 0.5
> > seconds.
>
>
> That's fine.  Whether it's doable with any tech will depend on how much
> hardware you give it, among other things.
>
> > Needless to mention, the search index needs to scale to 5Billion pages.
> It
> > is also possible that I might need to store multiple indexes -- one for
> > crawled content, and one for ancillary data that is also very large. Each
> > of these indices would likely require a logically distributed and
> > replicated index.
>
>
> Yup, OK.
>
> > However, I would like for such a system to be homogenous with the Hadoop
> > infrastructure that is already installed on the cluster (for the crawl).
> In
> > other words, I would much prefer if the replication and distribution of
> the
> > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
> > using another scalability framework (such as SolrCloud). In addition, it
> > would be ideal if this environment was flexible enough to be dynamically
> > scaled based on the size requirements of the index and the search traffic
> > at the time (i.e. if it is deployed on an Amazon cluster, it should be
> easy
> > enough to automatically provision additional processing power into the
> > cluster without requiring server re-starts).
>
>
> There is no such thing just yet.
> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to
> automatically index HBase content, but that was either not completed or not
> committed into HBase.
>
> > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
> > be ideal for this scenario. I've heard mention of Solr-on-HBase,
> Solandra,
> > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
> is
> > mature enough and would be the right architectural choice to go along
> with
> > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
> aspects
> > above.
>
>
> Here is a summary on all of them:
> * Search on HBase - I assume you are referring to the same thing I
> mentioned above.  Not ready.
> * Solandra - uses Cassandra+Solr, plus DataStax now has a different
> (commercial) offering that combines search and Cassandra.  Looks good.
> * Lily - data stored in HBase cluster gets indexed to a separate Solr
> instance(s)  on the side.  Not really integrated the way you want it to be.
> * ElasticSearch - solid at this point, the most dynamic solution today,
> can scale well (we are working on a maaaany-B documents index and hundreds
> of nodes with ElasticSearch right now), etc.  But again, not integrated
> with Hadoop the way you want it.
> * IndexTank - has some technical weaknesses, not integrated with Hadoop,
> not sure about its future considering LinkedIn uses Zoie and Sensei already.
> * And there is SolrCloud, which is coming soon and will be solid, but is
> again not integrated.
>
> If I were you and I had to pick today - I'd pick ElasticSearch if I were
> completely open.  If I had Solr bias I'd give SolrCloud a try first.
>
> > Lastly, how much hardware (assuming a medium sized EC2 instance) would
> you
> > estimate my needing with this setup, for regular web-data (HTML text) at
> > this scale?
>
> I don't know off the topic of my head, but I'm guessing several hundred
> for serving search requests.
>
> HTH,
>
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
>
> Scalable Performance Monitoring - http://sematext.com/spm/index.html
>
>
> > Any architectural guidance would be greatly appreciated. The more details
> > provided, the wider my grin :).
> >
> > Many many thanks in advance.
> >
> > Thanks,
> > Safdar
> >
>

Reply via email to