Interesting, ok. A colleague went to training with Elasticsearch and was told that given a default index with N shards similar index size was a critical thing for maintaining consistent search performance. I guess maybe that could play out by a two billion record index having a huge number of unique terms, while a smaller, say, 100k record index would have a substantially smaller set of terms, right? Dealing with content from stuff like the Twitter public API, I would anticipate a fairly linear growth of unique terms and overall index size. This ultimately results in the scenario initially, where a larger index is comparatively slower to search, due to its necessarily increased dictionary size. It seems as though there'd still be room for the kind of automatically scaling via a template system described above?
On Wednesday, April 9, 2014 7:38:35 AM UTC-7, Jörg Prante wrote: > > The number of documents is not relevant to the search time. > > Important factors for search time are the type of query, shard size, the > number of unique terms (the dictionary size), the number of segments, > network latency, disk drive latency, ... > > Maybe you mean equal distribution of docs with same average size across > shards. This means a search does not have to wait for nodes that must > search in larger shards. > > I do not think this needs a river plugin, since equal distribution of docs > over the shards is the default. > > Jörg > > > On Tue, Apr 8, 2014 at 9:03 PM, Josh Harrison <hij...@gmail.com<javascript:> > > wrote: > >> I have heard that ideally, you want to have a similar number of documents >> per shard for optimal search times, is that correct? >> >> I have data volumes that are just all over the place, from 100k to tens >> of millions in a week. >> >> I'm thinking about a river plugin that could: >> Take a mapping object as a template >> Define a template for child index names (project_\YYYY_\MM_\DD_\NNN = >> project_2014_04_08_000, etc) >> Define index shard count (5) >> Define maximum index size (1,000,000) >> Define a listening endpoint of some sort >> >> Documents would stream into the listening endpoint however you wanted, >> rivers, bulk loads using an API, etc. They would be automatically routed to >> the lowest numbered not-full index. So on a given day you could end up with >> fifteen indexes, or eighty, or two, but they'd all be a maximum of N >> records. >> >> A plugin seems desirable in this case, as it frees you from needing to >> write the load balancing into every ingestion stream you've got. >> >> Is this a reasonable solution to this problem? Am I overcomplicating >> things? >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8537dab8-8831-42a5-97b0-92367d3753ca%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.