Hey Nik - Thanks for the response.
- Ian On Mon, Oct 13, 2014 at 4:28 PM, Nikolas Everett <nik9...@gmail.com> wrote: > > > On Mon, Oct 13, 2014 at 11:12 AM, Ian Rose <ianr...@fullstory.com> wrote: > > Hi - >> >> My team has used Solr in it's single-node configuration (without >> SolrCloud) for a few years now. In our current product we are now looking >> at transitioning to SolrCloud, but before we made that leap I wanted to >> also take a good look at whether ElasticSearch would be a better fit for >> our needs. Although ES has some nice advantages (such as automatic shard >> rebalancing) I'm trying to figure out how to live in a world without shard >> splitting. In brief, our situation is as follows: >> >> - We use one index ("collection" in Solr) per customer. >> - The indexes are going to vary quite a bit in size, following something >> like a power-law distribution with many small indexes (let's guess < 250k >> documents), some medium sized indexes (up to a few million documents) and a >> few large indexes (hundreds of millions of documents). >> - So the number of shards required per index will vary greatly, and will >> be hard to predict accurately at creation time. >> >> How do people generally approach this kind of problem? Do you just make >> a best guess at the appropriate number of shards for each new index and >> then do a full re-index (with more shards) if the number of documents grows >> bigger than expected? >> >> > I'm in a pretty similar boat and have done just fine without shard > splitting. I maintain the search index for about 900 wikis > <http://noc.wikimedia.org/conf/all.dblist>. Each wiki gets two > Elasticsearch indexes and those indexes vary in size, update rate, and > query rate a ton. Most wikis get a single shard for all of there indexes > but many of them use more > <https://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/747fc7436226774d1735775c2ef41c911d59b5d2/wmf-config%2FInitialiseSettings.php#L13828>. > I basically just guestimated and reindexed the ones that were too big into > more shards. > > We have a script that creates a new index with new configuration and then > copies all the document from the old index to the new one and then swap the > aliases (that we use for updates and queries) to the new index. Then it > re-does any updates or deletes that occurred since copy script started. > Having something like that is pretty common. I rarely use it to change > sharding configuration - its much more common that I'll use it to change > how a field in the document is analyzed. > > Elasticsearch also has another way to handle this problem (we don't use it > for other reasons) where you create a single index for all customers and > then filter them at query time. You also add routing values to your > documents and queries so all documents from the same customer get routed to > the same shard. That way you can serve queries for a single customer out > of one shard which is pretty cool. For larger customers that don't fit on > a single shard you still create indexes just for them. > > One thing to watch out for, though, is that Elasticsearch doesn't use the > shard's size when determining where to place the shard. It'll check to > make sure the shard won't fill the disk beyond some percentage but it won't > try to spread out the large shards so you can get somewhat unbalanced disk > usage. I have an open pull request for something to do that so probably > won't be true forever but it is true for now. > > How big are your documents and how frequently do you think you'll need > shard splitting? If your documents are pretty small you may be able to get > away with just reindexing all of them for the customer when you need more > shards like I do. It sure isn't optimal but it gets the job done. > > Another way to do things is once your customers get too big you create a > new index and route all of their new data there. You have to query both > indexes. This is _kindof_ how people handle log messages and it might > work, depending on your use case. > > Nik > > -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/5JTYFC93jS8/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAPmjWd051yRH2AiG7ZsSPR_zD2a%3DMfaRcWFywyPfsfSPsyBf4Q%40mail.gmail.com > <https://groups.google.com/d/msgid/elasticsearch/CAPmjWd051yRH2AiG7ZsSPR_zD2a%3DMfaRcWFywyPfsfSPsyBf4Q%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALswpfCnkpu3RqAOowcNhZCAaxyT_Q9CKbv28Y_uEKA0UDjyiw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.