Re: Scaling strategies without shard splitting

2014-10-17 Thread Ian Rose
Hey Nik -

Thanks for the response.

- Ian


On Mon, Oct 13, 2014 at 4:28 PM, Nikolas Everett nik9...@gmail.com wrote:



 On Mon, Oct 13, 2014 at 11:12 AM, Ian Rose ianr...@fullstory.com wrote:

 Hi -

 My team has used Solr in it's single-node configuration (without
 SolrCloud) for a few years now.  In our current product we are now looking
 at transitioning to SolrCloud, but before we made that leap I wanted to
 also take a good look at whether ElasticSearch would be a better fit for
 our needs.  Although ES has some nice advantages (such as automatic shard
 rebalancing) I'm trying to figure out how to live in a world without shard
 splitting.  In brief, our situation is as follows:

  - We use one index (collection in Solr) per customer.
  - The indexes are going to vary quite a bit in size, following something
 like a power-law distribution with many small indexes (let's guess  250k
 documents), some medium sized indexes (up to a few million documents) and a
 few large indexes (hundreds of millions of documents).
  - So the number of shards required per index will vary greatly, and will
 be hard to predict accurately at creation time.

 How do people generally approach this kind of problem?  Do you just make
 a best guess at the appropriate number of shards for each new index and
 then do a full re-index (with more shards) if the number of documents grows
 bigger than expected?


 I'm in a pretty similar boat and have done just fine without shard
 splitting.  I maintain the search index for about 900 wikis
 http://noc.wikimedia.org/conf/all.dblist.  Each wiki gets two
 Elasticsearch indexes and those indexes vary in size, update rate, and
 query rate a ton.  Most wikis get a single shard for all of there indexes
 but many of them use more
 https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/747fc7436226774d1735775c2ef41c911d59b5d2/wmf-config%2FInitialiseSettings.php#L13828.
 I basically just guestimated and reindexed the ones that were too big into
 more shards.

 We have a script that creates a new index with new configuration and then
 copies all the document from the old index to the new one and then swap the
 aliases (that we use for updates and queries) to the new index.  Then it
 re-does any updates or deletes that occurred since copy script started.
 Having something like that is pretty common.  I rarely use it to change
 sharding configuration - its much more common that I'll use it to change
 how a field in the document is analyzed.

 Elasticsearch also has another way to handle this problem (we don't use it
 for other reasons) where you create a single index for all customers and
 then filter them at query time.  You also add routing values to your
 documents and queries so all documents from the same customer get routed to
 the same shard.  That way you can serve queries for a single customer out
 of one shard which is pretty cool.  For larger customers that don't fit on
 a single shard you still create indexes just for them.

 One thing to watch out for, though, is that Elasticsearch doesn't use the
 shard's size when determining where to place the shard.  It'll check to
 make sure the shard won't fill the disk beyond some percentage but it won't
 try to spread out the large shards so you can get somewhat unbalanced disk
 usage.  I have an open pull request for something to do that so probably
 won't be true forever but it is true for now.

 How big are your documents and how frequently do you think you'll need
 shard splitting?  If your documents are pretty small you may be able to get
 away with just reindexing all of them for the customer when you need more
 shards like I do.  It sure isn't optimal but it gets the job done.

 Another way to do things is once your customers get too big you create a
 new index and route all of their new data there.  You have to query both
 indexes.  This is _kindof_ how people handle log messages and it might
 work, depending on your use case.

 Nik

 --
 You received this message because you are subscribed to a topic in the
 Google Groups elasticsearch group.
 To unsubscribe from this topic, visit
 https://groups.google.com/d/topic/elasticsearch/5JTYFC93jS8/unsubscribe.
 To unsubscribe from this group and all its topics, send an email to
 elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/CAPmjWd051yRH2AiG7ZsSPR_zD2a%3DMfaRcWFywyPfsfSPsyBf4Q%40mail.gmail.com
 https://groups.google.com/d/msgid/elasticsearch/CAPmjWd051yRH2AiG7ZsSPR_zD2a%3DMfaRcWFywyPfsfSPsyBf4Q%40mail.gmail.com?utm_medium=emailutm_source=footer
 .

 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https

Scaling strategies without shard splitting

2014-10-13 Thread Ian Rose
Hi -

My team has used Solr in it's single-node configuration (without SolrCloud) 
for a few years now.  In our current product we are now looking at 
transitioning to SolrCloud, but before we made that leap I wanted to also 
take a good look at whether ElasticSearch would be a better fit for our 
needs.  Although ES has some nice advantages (such as automatic shard 
rebalancing) I'm trying to figure out how to live in a world without shard 
splitting.  In brief, our situation is as follows:

 - We use one index (collection in Solr) per customer.
 - The indexes are going to vary quite a bit in size, following something 
like a power-law distribution with many small indexes (let's guess  250k 
documents), some medium sized indexes (up to a few million documents) and a 
few large indexes (hundreds of millions of documents).
 - So the number of shards required per index will vary greatly, and will 
be hard to predict accurately at creation time.

How do people generally approach this kind of problem?  Do you just make a 
best guess at the appropriate number of shards for each new index and then 
do a full re-index (with more shards) if the number of documents grows 
bigger than expected?

Thanks!
- Ian

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ded96e32-e1f1-4d09-8356-7367c86b1166%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.