Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Lance Norskog Sat, 14 Apr 2012 16:11:37 -0700

It sounds like you really want the final map/reduce phase to put Solr
index files into HDFS. Solr has a feature to do this called 'Embedded
Solr'. This packages Solr as a library instead of an HTTP servlet. The
Solr committers mostly hate it and want it to go away, but it is
useful for exactly this problem.


There is some integration work here, both to bolt ES to the Hadoop
output libraries and also some trickery to write out the HDFS files.
HDFS only appends and most of the codecs (Lucene segment formats) like
to seek a lot. Then at the end it needs a way to tell SolrCloud about
the files.

If someone wants a great Summer Of Code project, Hadoop->Lucene
indexes->SolrCloud would be a lot of fun and make you widely loved by
people with money. I'm not kidding. Do a good job of this and write
clean code, and you'll get offers for very cool jobs.

On Sat, Apr 14, 2012 at 2:27 PM, Otis Gospodnetic
<otis_gospodne...@yahoo.com> wrote:
> Hello,
>
> Unfortunately I don't know when exactly SolrCloud release will be ready, but 
> we've used trunk versions in the past and didn't have major issues.
>
> Otis
> ----
> Performance Monitoring SaaS for Solr - 
> http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
>
>>________________________________
>> From: Ali S Kureishy <safdar.kurei...@gmail.com>
>>To: Otis Gospodnetic <otis_gospodne...@yahoo.com>
>>Cc: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
>>Sent: Friday, April 13, 2012 7:16 PM
>>Subject: Re: Options for automagically Scaling Solr (without needing 
>>distributed index/replication) in a Hadoop environment
>>
>>
>>Thanks Otis.
>>
>>
>>I really appreciate the details offered here. This was very helpful 
>>information.
>>
>>
>>I'm going to go through Solandra and Elastic Search and see if those make 
>>sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two 
>>recommendations for SolrCloud so far), so I will give that a shot when it is 
>>available. However, do you know when SolrCloud IS expected to be available?
>>
>>
>>Thanks again!
>>
>>
>>Warm regards,
>>Safdar
>>
>>
>>
>>
>>
>>On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic 
>><otis_gospodne...@yahoo.com> wrote:
>>
>>Hello Ali,
>>>
>>>
>>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>>>
>>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>>>> seconds.
>>>
>>>
>>>That's fine.  Whether it's doable with any tech will depend on how much 
>>>hardware you give it, among other things.
>>>
>>>
>>>> Needless to mention, the search index needs to scale to 5Billion pages. It
>>>> is also possible that I might need to store multiple indexes -- one for
>>>> crawled content, and one for ancillary data that is also very large. Each
>>>> of these indices would likely require a logically distributed and
>>>> replicated index.
>>>
>>>
>>>Yup, OK.
>>>
>>>
>>>> However, I would like for such a system to be homogenous with the Hadoop
>>>> infrastructure that is already installed on the cluster (for the crawl). In
>>>> other words, I would much prefer if the replication and distribution of the
>>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>>>> using another scalability framework (such as SolrCloud). In addition, it
>>>> would be ideal if this environment was flexible enough to be dynamically
>>>> scaled based on the size requirements of the index and the search traffic
>>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>>>> enough to automatically provision additional processing power into the
>>>> cluster without requiring server re-starts).
>>>
>>>
>>>There is no such thing just yet.
>>>There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
>>>automatically index HBase content, but that was either not completed or not 
>>>committed into HBase.
>>>
>>>
>>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>>>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>>>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>>>> mature enough and would be the right architectural choice to go along with
>>>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>>>> above.
>>>
>>>
>>>Here is a summary on all of them:
>>>* Search on HBase - I assume you are referring to the same thing I mentioned 
>>>above.  Not ready.
>>>* Solandra - uses Cassandra+Solr, plus DataStax now has a different 
>>>(commercial) offering that combines search and Cassandra.  Looks good.
>>>* Lily - data stored in HBase cluster gets indexed to a separate Solr 
>>>instance(s)  on the side.  Not really integrated the way you want it to be.
>>>* ElasticSearch - solid at this point, the most dynamic solution today, can 
>>>scale well (we are working on a maaaany-B documents index and hundreds of 
>>>nodes with ElasticSearch right now), etc.  But again, not integrated with 
>>>Hadoop the way you want it.
>>>* IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
>>>sure about its future considering LinkedIn uses Zoie and Sensei already.
>>>* And there is SolrCloud, which is coming soon and will be solid, but is 
>>>again not integrated.
>>>
>>>If I were you and I had to pick today - I'd pick ElasticSearch if I were 
>>>completely open.  If I had Solr bias I'd give SolrCloud a try first.
>>>
>>>
>>>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>>>> estimate my needing with this setup, for regular web-data (HTML text) at
>>>> this scale?
>>>
>>>I don't know off the topic of my head, but I'm guessing several hundred for 
>>>serving search requests.
>>>
>>>HTH,
>>>
>>>Otis
>>>--
>>>Search Analytics - http://sematext.com/search-analytics/index.html
>>>
>>>Scalable Performance Monitoring - http://sematext.com/spm/index.html
>>>
>>>
>>>
>>>> Any architectural guidance would be greatly appreciated. The more details
>>>> provided, the wider my grin :).
>>>>
>>>> Many many thanks in advance.
>>>>
>>>> Thanks,
>>>> Safdar
>>>>
>>>
>>
>>
>>



-- 
Lance Norskog
goks...@gmail.com

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Reply via email to