Hi,

This won't give you the performance you need, unless you have enough RAM on the 
Solr box to cache the whole index in memory.
Have you tested this yourself?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 12. apr. 2012, at 15:27, Darren Govoni wrote:

> You could use SolrCloud (for the automatic scaling) and just mount a
> fuse[1] HDFS directory and configure solr to use that directory for its
> data. 
> 
> [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
> 
> On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
>> Hi,
>> 
>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>> seconds.
>> 
>> Needless to mention, the search index needs to scale to 5Billion pages. It
>> is also possible that I might need to store multiple indexes -- one for
>> crawled content, and one for ancillary data that is also very large. Each
>> of these indices would likely require a logically distributed and
>> replicated index.
>> 
>> However, I would like for such a system to be homogenous with the Hadoop
>> infrastructure that is already installed on the cluster (for the crawl). In
>> other words, I would much prefer if the replication and distribution of the
>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>> using another scalability framework (such as SolrCloud). In addition, it
>> would be ideal if this environment was flexible enough to be dynamically
>> scaled based on the size requirements of the index and the search traffic
>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>> enough to automatically provision additional processing power into the
>> cluster without requiring server re-starts).
>> 
>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>> mature enough and would be the right architectural choice to go along with
>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>> above.
>> 
>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>> estimate my needing with this setup, for regular web-data (HTML text) at
>> this scale?
>> 
>> Any architectural guidance would be greatly appreciated. The more details
>> provided, the wider my grin :).
>> 
>> Many many thanks in advance.
>> 
>> Thanks,
>> Safdar
> 
> 

Reply via email to