You could use SolrCloud (for the automatic scaling) and just mount a fuse[1] HDFS directory and configure solr to use that directory for its data.
[1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: > Hi, > > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, > crawled + indexed every *4 weeks, *with a search latency of less than 0.5 > seconds. > > Needless to mention, the search index needs to scale to 5Billion pages. It > is also possible that I might need to store multiple indexes -- one for > crawled content, and one for ancillary data that is also very large. Each > of these indices would likely require a logically distributed and > replicated index. > > However, I would like for such a system to be homogenous with the Hadoop > infrastructure that is already installed on the cluster (for the crawl). In > other words, I would much prefer if the replication and distribution of the > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of > using another scalability framework (such as SolrCloud). In addition, it > would be ideal if this environment was flexible enough to be dynamically > scaled based on the size requirements of the index and the search traffic > at the time (i.e. if it is deployed on an Amazon cluster, it should be easy > enough to automatically provision additional processing power into the > cluster without requiring server re-starts). > > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would > be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is > mature enough and would be the right architectural choice to go along with > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects > above. > > Lastly, how much hardware (assuming a medium sized EC2 instance) would you > estimate my needing with this setup, for regular web-data (HTML text) at > this scale? > > Any architectural guidance would be greatly appreciated. The more details > provided, the wider my grin :). > > Many many thanks in advance. > > Thanks, > Safdar