from:"Ian Rose"

Re: Scaling strategies without shard splitting

2014-10-17 Thread Ian Rose

Hey Nik -

Thanks for the response.

- Ian

On Mon, Oct 13, 2014 at 4:28 PM, Nikolas Everett nik9...@gmail.com wrote:

On Mon, Oct 13, 2014 at 11:12 AM, Ian Rose ianr...@fullstory.com wrote:

Hi -

My team has used Solr in it's single-node configuration (without
SolrCloud) for a few years now. In our current product we are now looking
at transitioning to SolrCloud, but before we made that leap I wanted to
also take a good look at whether ElasticSearch would be a better fit for
our needs. Although ES has some nice advantages (such as automatic shard
rebalancing) I'm trying to figure out how to live in a world without shard
splitting. In brief, our situation is as follows:

- We use one index (collection in Solr) per customer.
- The indexes are going to vary quite a bit in size, following something
like a power-law distribution with many small indexes (let's guess 250k
documents), some medium sized indexes (up to a few million documents) and a
few large indexes (hundreds of millions of documents).
- So the number of shards required per index will vary greatly, and will
be hard to predict accurately at creation time.

How do people generally approach this kind of problem? Do you just make
a best guess at the appropriate number of shards for each new index and
then do a full re-index (with more shards) if the number of documents grows
bigger than expected?

I'm in a pretty similar boat and have done just fine without shard
splitting. I maintain the search index for about 900 wikis
http://noc.wikimedia.org/conf/all.dblist. Each wiki gets two
Elasticsearch indexes and those indexes vary in size, update rate, and
query rate a ton. Most wikis get a single shard for all of there indexes
but many of them use more
https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/747fc7436226774d1735775c2ef41c911d59b5d2/wmf-config%2FInitialiseSettings.php#L13828.
I basically just guestimated and reindexed the ones that were too big into
more shards.

We have a script that creates a new index with new configuration and then
copies all the document from the old index to the new one and then swap the
aliases (that we use for updates and queries) to the new index. Then it
re-does any updates or deletes that occurred since copy script started.
Having something like that is pretty common. I rarely use it to change
sharding configuration - its much more common that I'll use it to change
how a field in the document is analyzed.

Elasticsearch also has another way to handle this problem (we don't use it
for other reasons) where you create a single index for all customers and
then filter them at query time. You also add routing values to your
documents and queries so all documents from the same customer get routed to
the same shard. That way you can serve queries for a single customer out
of one shard which is pretty cool. For larger customers that don't fit on
a single shard you still create indexes just for them.

One thing to watch out for, though, is that Elasticsearch doesn't use the
shard's size when determining where to place the shard. It'll check to
make sure the shard won't fill the disk beyond some percentage but it won't
try to spread out the large shards so you can get somewhat unbalanced disk
usage. I have an open pull request for something to do that so probably
won't be true forever but it is true for now.

How big are your documents and how frequently do you think you'll need
shard splitting? If your documents are pretty small you may be able to get
away with just reindexing all of them for the customer when you need more
shards like I do. It sure isn't optimal but it gets the job done.

Another way to do things is once your customers get too big you create a
new index and route all of their new data there. You have to query both
indexes. This is _kindof_ how people handle log messages and it might
work, depending on your use case.

Nik

--
You received this message because you are subscribed to a topic in the
Google Groups elasticsearch group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/5JTYFC93jS8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd051yRH2AiG7ZsSPR_zD2a%3DMfaRcWFywyPfsfSPsyBf4Q%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd051yRH2AiG7ZsSPR_zD2a%3DMfaRcWFywyPfsfSPsyBf4Q%40mail.gmail.com?utm_medium=emailutm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

Scaling strategies without shard splitting

2014-10-13 Thread Ian Rose

Hi -

My team has used Solr in it's single-node configuration (without SolrCloud)
for a few years now. In our current product we are now looking at
transitioning to SolrCloud, but before we made that leap I wanted to also
take a good look at whether ElasticSearch would be a better fit for our
needs. Although ES has some nice advantages (such as automatic shard
rebalancing) I'm trying to figure out how to live in a world without shard
splitting. In brief, our situation is as follows:

How do people generally approach this kind of problem? Do you just make a
best guess at the appropriate number of shards for each new index and then
do a full re-index (with more shards) if the number of documents grows
bigger than expected?

Thanks!
- Ian

--
You received this message because you are subscribed to the Google Groups
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ded96e32-e1f1-4d09-8356-7367c86b1166%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Scaling strategies without shard splitting

Scaling strategies without shard splitting

2 matches

Site Navigation

Mail list logo

Footer information