Solrcloud or any other tech specific replication isnt going to 'just work' with 
hadoop replication. But with some significant custom coding anything should be 
possible. Interesting idea.

br><br><br>------- Original Message -------
On 4/12/2012  09:21 AM Ali S Kureishy wrote:<br>Thanks Darren.
<br>
<br>Actually, I would like the system to be homogenous - i.e., use Hadoop based
<br>tools that already provide all the necessary scaling for the lucene index
<br>(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds
<br>its own layer of sharding/replication that is outside Hadoop, I feel that
<br>using SolrCloud would be redundant, and a step in the opposite
<br>direction, which is what I'm trying to avoid in the first place. Or am I
<br>mistaken?
<br>
<br>Thanks,
<br>Safdar
<br>
<br>
<br>On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni <dar...@ontrenet.com> wrote:
<br>
<br>> You could use SolrCloud (for the automatic scaling) and just mount a
<br>> fuse[1] HDFS directory and configure solr to use that directory for its
<br>> data.
<br>>
<br>> [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
<br>>
<br>> On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
<br>> > Hi,
<br>> >
<br>> > I'm trying to setup a large scale *Crawl + Index + Search 
*infrastructure
<br>> > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web 
pages*,
<br>> > crawled + indexed every *4 weeks, *with a search latency of less than 
0.5
<br>> > seconds.
<br>> >
<br>> > Needless to mention, the search index needs to scale to 5Billion pages.
<br>> It
<br>> > is also possible that I might need to store multiple indexes -- one for
<br>> > crawled content, and one for ancillary data that is also very large. 
Each
<br>> > of these indices would likely require a logically distributed and
<br>> > replicated index.
<br>> >
<br>> > However, I would like for such a system to be homogenous with the Hadoop
<br>> > infrastructure that is already installed on the cluster (for the crawl).
<br>> In
<br>> > other words, I would much prefer if the replication and distribution of
<br>> the
<br>> > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead 
of
<br>> > using another scalability framework (such as SolrCloud). In addition, it
<br>> > would be ideal if this environment was flexible enough to be dynamically
<br>> > scaled based on the size requirements of the index and the search 
traffic
<br>> > at the time (i.e. if it is deployed on an Amazon cluster, it should be
<br>> easy
<br>> > enough to automatically provision additional processing power into the
<br>> > cluster without requiring server re-starts).
<br>> >
<br>> > However, I'm not sure which Solr-based tool in the Hadoop ecosystem 
would
<br>> > be ideal for this scenario. I've heard mention of Solr-on-HBase,
<br>> Solandra,
<br>> > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
<br>> is
<br>> > mature enough and would be the right architectural choice to go along
<br>> with
<br>> > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
<br>> aspects
<br>> > above.
<br>> >
<br>> > Lastly, how much hardware (assuming a medium sized EC2 instance) would
<br>> you
<br>> > estimate my needing with this setup, for regular web-data (HTML text) at
<br>> > this scale?
<br>> >
<br>> > Any architectural guidance would be greatly appreciated. The more 
details
<br>> > provided, the wider my grin :).
<br>> >
<br>> > Many many thanks in advance.
<br>> >
<br>> > Thanks,
<br>> > Safdar
<br>>
<br>>
<br>>
<br>

Reply via email to