Thanks Otis. I really appreciate the details offered here. This was very helpful information.
I'm going to go through Solandra and Elastic Search and see if those make sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two recommendations for SolrCloud so far), so I will give that a shot when it is available. However, do you know when SolrCloud IS expected to be available? Thanks again! Warm regards, Safdar On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Hello Ali, > > > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure > > > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, > > crawled + indexed every *4 weeks, *with a search latency of less than 0.5 > > seconds. > > > That's fine. Whether it's doable with any tech will depend on how much > hardware you give it, among other things. > > > Needless to mention, the search index needs to scale to 5Billion pages. > It > > is also possible that I might need to store multiple indexes -- one for > > crawled content, and one for ancillary data that is also very large. Each > > of these indices would likely require a logically distributed and > > replicated index. > > > Yup, OK. > > > However, I would like for such a system to be homogenous with the Hadoop > > infrastructure that is already installed on the cluster (for the crawl). > In > > other words, I would much prefer if the replication and distribution of > the > > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of > > using another scalability framework (such as SolrCloud). In addition, it > > would be ideal if this environment was flexible enough to be dynamically > > scaled based on the size requirements of the index and the search traffic > > at the time (i.e. if it is deployed on an Amazon cluster, it should be > easy > > enough to automatically provision additional processing power into the > > cluster without requiring server re-starts). > > > There is no such thing just yet. > There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to > automatically index HBase content, but that was either not completed or not > committed into HBase. > > > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would > > be ideal for this scenario. I've heard mention of Solr-on-HBase, > Solandra, > > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these > is > > mature enough and would be the right architectural choice to go along > with > > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling > aspects > > above. > > > Here is a summary on all of them: > * Search on HBase - I assume you are referring to the same thing I > mentioned above. Not ready. > * Solandra - uses Cassandra+Solr, plus DataStax now has a different > (commercial) offering that combines search and Cassandra. Looks good. > * Lily - data stored in HBase cluster gets indexed to a separate Solr > instance(s) on the side. Not really integrated the way you want it to be. > * ElasticSearch - solid at this point, the most dynamic solution today, > can scale well (we are working on a maaaany-B documents index and hundreds > of nodes with ElasticSearch right now), etc. But again, not integrated > with Hadoop the way you want it. > * IndexTank - has some technical weaknesses, not integrated with Hadoop, > not sure about its future considering LinkedIn uses Zoie and Sensei already. > * And there is SolrCloud, which is coming soon and will be solid, but is > again not integrated. > > If I were you and I had to pick today - I'd pick ElasticSearch if I were > completely open. If I had Solr bias I'd give SolrCloud a try first. > > > Lastly, how much hardware (assuming a medium sized EC2 instance) would > you > > estimate my needing with this setup, for regular web-data (HTML text) at > > this scale? > > I don't know off the topic of my head, but I'm guessing several hundred > for serving search requests. > > HTH, > > Otis > -- > Search Analytics - http://sematext.com/search-analytics/index.html > > Scalable Performance Monitoring - http://sematext.com/spm/index.html > > > > Any architectural guidance would be greatly appreciated. The more details > > provided, the wider my grin :). > > > > Many many thanks in advance. > > > > Thanks, > > Safdar > > >