Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Lukáš Vlček Wed, 18 Apr 2012 06:41:22 -0700

AFAIK it can not. You can only add new shards by creating a new index and
you will then need to index new data into that new index. Index aliases are
useful mainly for searching part. So it means that you need to plan for
this when you implement your indexing logic. On the other hand the query
logic does not need to change as you only add new indices and give them all
the same alias.


I am not an expert on this but I think that index splitting and re-sharding
can be expensive for [near] real-time search system and the point is that
you can probably use different techniques to support your large scale
needs. Index aliasing and routing in elasticsearch can help a lot in
supporting various large scale data scenarios, check the following thread
in ES ML for some examples:
https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ

Just to sum it up, the fact that elasticsearch does have fixed number of
shards per index and does not support resharding and index splitting does
not mean you can not scale your data easily.

(I was not following this whole thread in every detail. So may be you may
have specific needs that can be solved only by splitting or resharding, in
such case I would recommend you to ask on ES ML with further questions, I
do not want to run into system X vs system Y flame here...)

Regards,
Lukas

On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> I'm curious how on the fly updates are handled as a new shard is added
> to an alias.  Eg, how does the system know to which shard to send an
> update?
>
> On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček <lukas.vl...@gmail.com>
> wrote:
> > Hi,
> >
> > speaking about ES I think it would be fair to mention that one has to
> > specify number of shards upfront when the index is created - that is
> > correct, however, it is possible to give index one or more aliases which
> > basically means that you can add new indices on the fly and give them
> same
> > alias which is then used to search against. Given that you can add/remove
> > indices, nodes and aliases on the fly I think there is a way how to
> handle
> > growing data set with ease. If anyone is interested such scenario has
> been
> > discussed in detail in ES mail list.
> >
> > Regards,
> > Lukas
> >
> > On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen <
> > jason.rutherg...@gmail.com> wrote:
> >
> >> One of big weaknesses of Solr Cloud (and ES?) is the lack of the
> >> ability to redistribute shards across servers.  Meaning, as a single
> >> shard grows too large, splitting the shard, while live updates.
> >>
> >> How do you plan on elastically adding more servers without this feature?
> >>
> >> Cassandra and HBase handle elasticity in their own ways.  Cassandra
> >> has successfully implemented the Dynamo model and HBase uses the
> >> traditional BigTable 'split'.  Both systems are complex though are at
> >> a singular level of maturity.
> >>
> >> Also Cassandra [successfully] implements multiple data center support,
> >> is that available in SC or ES?
> >>
> >> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
> >> <otis_gospodne...@yahoo.com> wrote:
> >> > Hello Ali,
> >> >
> >> >> I'm trying to setup a large scale *Crawl + Index + Search
> >> *infrastructure
> >> >
> >> >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
> >> pages*,
> >> >> crawled + indexed every *4 weeks, *with a search latency of less than
> >> 0.5
> >> >> seconds.
> >> >
> >> >
> >> > That's fine.  Whether it's doable with any tech will depend on how
> much
> >> hardware you give it, among other things.
> >> >
> >> >> Needless to mention, the search index needs to scale to 5Billion
> pages.
> >> It
> >> >> is also possible that I might need to store multiple indexes -- one
> for
> >> >> crawled content, and one for ancillary data that is also very large.
> >> Each
> >> >> of these indices would likely require a logically distributed and
> >> >> replicated index.
> >> >
> >> >
> >> > Yup, OK.
> >> >
> >> >> However, I would like for such a system to be homogenous with the
> Hadoop
> >> >> infrastructure that is already installed on the cluster (for the
> >> crawl). In
> >> >> other words, I would much prefer if the replication and distribution
> of
> >> the
> >> >> Solr/Lucene index be done automagically on top of Hadoop/HDFS,
> instead
> >> of
> >> >> using another scalability framework (such as SolrCloud). In
> addition, it
> >> >> would be ideal if this environment was flexible enough to be
> dynamically
> >> >> scaled based on the size requirements of the index and the search
> >> traffic
> >> >> at the time (i.e. if it is deployed on an Amazon cluster, it should
> be
> >> easy
> >> >> enough to automatically provision additional processing power into
> the
> >> >> cluster without requiring server re-starts).
> >> >
> >> >
> >> > There is no such thing just yet.
> >> > There is no Search+Hadoop/HDFS in a box just yet.  There was an
> attempt
> >> to automatically index HBase content, but that was either not completed
> or
> >> not committed into HBase.
> >> >
> >> >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem
> >> would
> >> >> be ideal for this scenario. I've heard mention of Solr-on-HBase,
> >> Solandra,
> >> >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of
> >> these is
> >> >> mature enough and would be the right architectural choice to go along
> >> with
> >> >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
> >> aspects
> >> >> above.
> >> >
> >> >
> >> > Here is a summary on all of them:
> >> > * Search on HBase - I assume you are referring to the same thing I
> >> mentioned above.  Not ready.
> >> > * Solandra - uses Cassandra+Solr, plus DataStax now has a different
> >> (commercial) offering that combines search and Cassandra.  Looks good.
> >> > * Lily - data stored in HBase cluster gets indexed to a separate Solr
> >> instance(s)  on the side.  Not really integrated the way you want it to
> be.
> >> > * ElasticSearch - solid at this point, the most dynamic solution
> today,
> >> can scale well (we are working on a maaaany-B documents index and
> hundreds
> >> of nodes with ElasticSearch right now), etc.  But again, not integrated
> >> with Hadoop the way you want it.
> >> > * IndexTank - has some technical weaknesses, not integrated with
> Hadoop,
> >> not sure about its future considering LinkedIn uses Zoie and Sensei
> already.
> >> > * And there is SolrCloud, which is coming soon and will be solid, but
> is
> >> again not integrated.
> >> >
> >> > If I were you and I had to pick today - I'd pick ElasticSearch if I
> were
> >> completely open.  If I had Solr bias I'd give SolrCloud a try first.
> >> >
> >> >> Lastly, how much hardware (assuming a medium sized EC2 instance)
> would
> >> you
> >> >> estimate my needing with this setup, for regular web-data (HTML
> text) at
> >> >> this scale?
> >> >
> >> > I don't know off the topic of my head, but I'm guessing several
> hundred
> >> for serving search requests.
> >> >
> >> > HTH,
> >> >
> >> > Otis
> >> > --
> >> > Search Analytics - http://sematext.com/search-analytics/index.html
> >> >
> >> > Scalable Performance Monitoring - http://sematext.com/spm/index.html
> >> >
> >> >
> >> >> Any architectural guidance would be greatly appreciated. The more
> >> details
> >> >> provided, the wider my grin :).
> >> >>
> >> >> Many many thanks in advance.
> >> >>
> >> >> Thanks,
> >> >> Safdar
> >> >>
> >>
>

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Reply via email to