Thanks Darren. Actually, I would like the system to be homogenous - i.e., use Hadoop based tools that already provide all the necessary scaling for the lucene index (in terms of throughput, latency of writes/reads etc). Since SolrCloud adds its own layer of sharding/replication that is outside Hadoop, I feel that using SolrCloud would be redundant, and a step in the opposite direction, which is what I'm trying to avoid in the first place. Or am I mistaken?
Thanks, Safdar On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni <dar...@ontrenet.com> wrote: > You could use SolrCloud (for the automatic scaling) and just mount a > fuse[1] HDFS directory and configure solr to use that directory for its > data. > > [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS > > On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: > > Hi, > > > > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure > > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, > > crawled + indexed every *4 weeks, *with a search latency of less than 0.5 > > seconds. > > > > Needless to mention, the search index needs to scale to 5Billion pages. > It > > is also possible that I might need to store multiple indexes -- one for > > crawled content, and one for ancillary data that is also very large. Each > > of these indices would likely require a logically distributed and > > replicated index. > > > > However, I would like for such a system to be homogenous with the Hadoop > > infrastructure that is already installed on the cluster (for the crawl). > In > > other words, I would much prefer if the replication and distribution of > the > > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of > > using another scalability framework (such as SolrCloud). In addition, it > > would be ideal if this environment was flexible enough to be dynamically > > scaled based on the size requirements of the index and the search traffic > > at the time (i.e. if it is deployed on an Amazon cluster, it should be > easy > > enough to automatically provision additional processing power into the > > cluster without requiring server re-starts). > > > > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would > > be ideal for this scenario. I've heard mention of Solr-on-HBase, > Solandra, > > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these > is > > mature enough and would be the right architectural choice to go along > with > > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling > aspects > > above. > > > > Lastly, how much hardware (assuming a medium sized EC2 instance) would > you > > estimate my needing with this setup, for regular web-data (HTML text) at > > this scale? > > > > Any architectural guidance would be greatly appreciated. The more details > > provided, the wider my grin :). > > > > Many many thanks in advance. > > > > Thanks, > > Safdar > > >