Thanks Darren.

Actually, I would like the system to be homogenous - i.e., use Hadoop based
tools that already provide all the necessary scaling for the lucene index
(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds
its own layer of sharding/replication that is outside Hadoop, I feel that
using SolrCloud would be redundant, and a step in the opposite
direction, which is what I'm trying to avoid in the first place. Or am I
mistaken?

Thanks,
Safdar


On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni <dar...@ontrenet.com> wrote:

> You could use SolrCloud (for the automatic scaling) and just mount a
> fuse[1] HDFS directory and configure solr to use that directory for its
> data.
>
> [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
>
> On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
> > Hi,
> >
> > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
> > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
> > crawled + indexed every *4 weeks, *with a search latency of less than 0.5
> > seconds.
> >
> > Needless to mention, the search index needs to scale to 5Billion pages.
> It
> > is also possible that I might need to store multiple indexes -- one for
> > crawled content, and one for ancillary data that is also very large. Each
> > of these indices would likely require a logically distributed and
> > replicated index.
> >
> > However, I would like for such a system to be homogenous with the Hadoop
> > infrastructure that is already installed on the cluster (for the crawl).
> In
> > other words, I would much prefer if the replication and distribution of
> the
> > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
> > using another scalability framework (such as SolrCloud). In addition, it
> > would be ideal if this environment was flexible enough to be dynamically
> > scaled based on the size requirements of the index and the search traffic
> > at the time (i.e. if it is deployed on an Amazon cluster, it should be
> easy
> > enough to automatically provision additional processing power into the
> > cluster without requiring server re-starts).
> >
> > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
> > be ideal for this scenario. I've heard mention of Solr-on-HBase,
> Solandra,
> > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
> is
> > mature enough and would be the right architectural choice to go along
> with
> > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
> aspects
> > above.
> >
> > Lastly, how much hardware (assuming a medium sized EC2 instance) would
> you
> > estimate my needing with this setup, for regular web-data (HTML text) at
> > this scale?
> >
> > Any architectural guidance would be greatly appreciated. The more details
> > provided, the wider my grin :).
> >
> > Many many thanks in advance.
> >
> > Thanks,
> > Safdar
>
>
>

Reply via email to