Solrcloud or any other tech specific replication isnt going to 'just work' with hadoop replication. But with some significant custom coding anything should be possible. Interesting idea.
br><br><br>------- Original Message ------- On 4/12/2012 09:21 AM Ali S Kureishy wrote:<br>Thanks Darren. <br> <br>Actually, I would like the system to be homogenous - i.e., use Hadoop based <br>tools that already provide all the necessary scaling for the lucene index <br>(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds <br>its own layer of sharding/replication that is outside Hadoop, I feel that <br>using SolrCloud would be redundant, and a step in the opposite <br>direction, which is what I'm trying to avoid in the first place. Or am I <br>mistaken? <br> <br>Thanks, <br>Safdar <br> <br> <br>On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni <dar...@ontrenet.com> wrote: <br> <br>> You could use SolrCloud (for the automatic scaling) and just mount a <br>> fuse[1] HDFS directory and configure solr to use that directory for its <br>> data. <br>> <br>> [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS <br>> <br>> On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: <br>> > Hi, <br>> > <br>> > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure <br>> > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, <br>> > crawled + indexed every *4 weeks, *with a search latency of less than 0.5 <br>> > seconds. <br>> > <br>> > Needless to mention, the search index needs to scale to 5Billion pages. <br>> It <br>> > is also possible that I might need to store multiple indexes -- one for <br>> > crawled content, and one for ancillary data that is also very large. Each <br>> > of these indices would likely require a logically distributed and <br>> > replicated index. <br>> > <br>> > However, I would like for such a system to be homogenous with the Hadoop <br>> > infrastructure that is already installed on the cluster (for the crawl). <br>> In <br>> > other words, I would much prefer if the replication and distribution of <br>> the <br>> > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of <br>> > using another scalability framework (such as SolrCloud). In addition, it <br>> > would be ideal if this environment was flexible enough to be dynamically <br>> > scaled based on the size requirements of the index and the search traffic <br>> > at the time (i.e. if it is deployed on an Amazon cluster, it should be <br>> easy <br>> > enough to automatically provision additional processing power into the <br>> > cluster without requiring server re-starts). <br>> > <br>> > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would <br>> > be ideal for this scenario. I've heard mention of Solr-on-HBase, <br>> Solandra, <br>> > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these <br>> is <br>> > mature enough and would be the right architectural choice to go along <br>> with <br>> > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling <br>> aspects <br>> > above. <br>> > <br>> > Lastly, how much hardware (assuming a medium sized EC2 instance) would <br>> you <br>> > estimate my needing with this setup, for regular web-data (HTML text) at <br>> > this scale? <br>> > <br>> > Any architectural guidance would be greatly appreciated. The more details <br>> > provided, the wider my grin :). <br>> > <br>> > Many many thanks in advance. <br>> > <br>> > Thanks, <br>> > Safdar <br>> <br>> <br>> <br>