Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Otis Gospodnetic Sat, 14 Apr 2012 14:27:44 -0700

Hello,

Unfortunately I don't know when exactly SolrCloud release will be ready, but 
we've used trunk versions in the past and didn't have major issues.


Otis 
----
Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



>________________________________
> From: Ali S Kureishy <safdar.kurei...@gmail.com>
>To: Otis Gospodnetic <otis_gospodne...@yahoo.com> 
>Cc: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> 
>Sent: Friday, April 13, 2012 7:16 PM
>Subject: Re: Options for automagically Scaling Solr (without needing 
>distributed index/replication) in a Hadoop environment
> 
>
>Thanks Otis.
>
>
>I really appreciate the details offered here. This was very helpful 
>information.
>
>
>I'm going to go through Solandra and Elastic Search and see if those make 
>sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two 
>recommendations for SolrCloud so far), so I will give that a shot when it is 
>available. However, do you know when SolrCloud IS expected to be available?
>
>
>Thanks again!
>
>
>Warm regards,
>Safdar
>
>
>
>
>
>On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic <otis_gospodne...@yahoo.com> 
>wrote:
>
>Hello Ali,
>>
>>
>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>>
>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>>> seconds.
>>
>>
>>That's fine.  Whether it's doable with any tech will depend on how much 
>>hardware you give it, among other things.
>>
>>
>>> Needless to mention, the search index needs to scale to 5Billion pages. It
>>> is also possible that I might need to store multiple indexes -- one for
>>> crawled content, and one for ancillary data that is also very large. Each
>>> of these indices would likely require a logically distributed and
>>> replicated index.
>>
>>
>>Yup, OK.
>>
>>
>>> However, I would like for such a system to be homogenous with the Hadoop
>>> infrastructure that is already installed on the cluster (for the crawl). In
>>> other words, I would much prefer if the replication and distribution of the
>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>>> using another scalability framework (such as SolrCloud). In addition, it
>>> would be ideal if this environment was flexible enough to be dynamically
>>> scaled based on the size requirements of the index and the search traffic
>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>>> enough to automatically provision additional processing power into the
>>> cluster without requiring server re-starts).
>>
>>
>>There is no such thing just yet.
>>There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
>>automatically index HBase content, but that was either not completed or not 
>>committed into HBase.
>>
>>
>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>>> mature enough and would be the right architectural choice to go along with
>>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>>> above.
>>
>>
>>Here is a summary on all of them:
>>* Search on HBase - I assume you are referring to the same thing I mentioned 
>>above.  Not ready.
>>* Solandra - uses Cassandra+Solr, plus DataStax now has a different 
>>(commercial) offering that combines search and Cassandra.  Looks good.
>>* Lily - data stored in HBase cluster gets indexed to a separate Solr 
>>instance(s)  on the side.  Not really integrated the way you want it to be.
>>* ElasticSearch - solid at this point, the most dynamic solution today, can 
>>scale well (we are working on a maaaany-B documents index and hundreds of 
>>nodes with ElasticSearch right now), etc.  But again, not integrated with 
>>Hadoop the way you want it.
>>* IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
>>sure about its future considering LinkedIn uses Zoie and Sensei already.
>>* And there is SolrCloud, which is coming soon and will be solid, but is 
>>again not integrated.
>>
>>If I were you and I had to pick today - I'd pick ElasticSearch if I were 
>>completely open.  If I had Solr bias I'd give SolrCloud a try first.
>>
>>
>>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>>> estimate my needing with this setup, for regular web-data (HTML text) at
>>> this scale?
>>
>>I don't know off the topic of my head, but I'm guessing several hundred for 
>>serving search requests.
>>
>>HTH,
>>
>>Otis
>>--
>>Search Analytics - http://sematext.com/search-analytics/index.html
>>
>>Scalable Performance Monitoring - http://sematext.com/spm/index.html
>>
>>
>>
>>> Any architectural guidance would be greatly appreciated. The more details
>>> provided, the wider my grin :).
>>>
>>> Many many thanks in advance.
>>>
>>> Thanks,
>>> Safdar
>>>
>>
>
>
>

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Reply via email to