Interesting, ok. A colleague went to training with Elasticsearch and was 
told that given a default index with N shards similar index size was a 
critical thing for maintaining consistent search performance. I guess maybe 
that could play out by a two billion record index having a huge number of 
unique terms, while a smaller, say, 100k record index would have a 
substantially smaller set of terms, right?
Dealing with content from stuff like the Twitter public API, I would 
anticipate a fairly linear growth of unique terms and overall index size. 
This ultimately results in the scenario initially, where a larger index is 
comparatively slower to search, due to its necessarily increased dictionary 
size. It seems as though there'd still be room for the kind of 
automatically scaling via a template system described above?

On Wednesday, April 9, 2014 7:38:35 AM UTC-7, Jörg Prante wrote:
>
> The number of documents is not relevant to the search time.
>
> Important factors for search time are the type of query, shard size, the 
> number of unique terms (the dictionary size), the number of segments, 
> network latency, disk drive latency, ...
>
> Maybe you mean equal distribution of docs with same average size across 
> shards. This means a search does not have to wait for nodes that must 
> search in larger shards.
>
> I do not think this needs a river plugin, since equal distribution of docs 
> over the shards is the default.
>
> Jörg
>
>
> On Tue, Apr 8, 2014 at 9:03 PM, Josh Harrison <hij...@gmail.com<javascript:>
> > wrote:
>
>> I have heard that ideally, you want to have a similar number of documents 
>> per shard for optimal search times, is that correct?
>>
>> I have data volumes that are just all over the place, from 100k to tens 
>> of millions in a week.
>>
>> I'm thinking about a river plugin that could:
>> Take a mapping object as a template
>> Define a template for child index names (project_\YYYY_\MM_\DD_\NNN = 
>> project_2014_04_08_000, etc)
>> Define index shard count (5)
>> Define maximum index size (1,000,000)
>> Define a listening endpoint of some sort
>>
>> Documents would stream into the listening endpoint however you wanted, 
>> rivers, bulk loads using an API, etc. They would be automatically routed to 
>> the lowest numbered not-full index. So on a given day you could end up with 
>> fifteen indexes, or eighty, or two, but they'd all be a maximum of N 
>> records.
>>
>> A plugin seems desirable in this case, as it frees you from needing to 
>> write the load balancing into every ingestion stream you've got.
>>
>> Is this a reasonable solution to this problem? Am I overcomplicating 
>> things? 
>>  
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8537dab8-8831-42a5-97b0-92367d3753ca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to