Hi Kireet, thanks for your answer and sorry for the late response. More 
shards doesn't help. It will slow down the system because each shard takes 
quite some overhead to maintain a Lucene index and, the smaller the shards, 
the bigger the overhead. Having more shards enhances the indexing 
performance and allows to distribute a big index across machines, but I 
don't have a cluster with a lot of machines. I could observe this negative 
effects while testing with 20 shards.

It would be very cool if somebody could answer/comment to the question 
summarized at the end of my post. Thanks again.





On Friday, July 11, 2014 3:02:50 AM UTC+2, Kireet Reddy wrote:
>
> I would test using multiple primary shards on a single machine. Since your 
> dataset seems to fit into RAM, this could help for these longer latency 
> queries.
>
> On Thursday, July 10, 2014 12:24:26 AM UTC-7, Fin Sekun wrote:
>>
>> Any hints?
>>
>>
>>
>> On Monday, July 7, 2014 3:51:19 PM UTC+2, Fin Sekun wrote:
>>>
>>>
>>> Hi,
>>>
>>>
>>> *SCENARIO*
>>>
>>> Our Elasticsearch database has ~2.5 million entries. Each entry has the 
>>> three analyzed fields "match", "sec_match" and "thi_match" (all contains 
>>> 3-20 words) that will be used in this query:
>>> https://gist.github.com/anonymous/a8d1142512e5625e4e91
>>>
>>>
>>> ES runs on two *types of servers*:
>>> (1) Real servers (system has direct access to real CPUs, no 
>>> virtualization) of newest generation - Very performant!
>>> (2) Cloud servers with virtualized CPUs - Poor CPUs, but this is generic 
>>> for cloud services.
>>>
>>> See https://gist.github.com/anonymous/3098b142c2bab51feecc for (1) and 
>>> (2) CPU details.
>>>
>>>
>>> *ES settings:*
>>> ES version 1.2.0 (jdk1.8.0_05)
>>> ES_HEAP_SIZE = 512m (we also tested with 1024m with same results)
>>> vm.max_map_count = 262144
>>> ulimit -n 64000
>>> ulimit -l unlimited
>>> index.number_of_shards: 1
>>> index.number_of_replicas: 0
>>> index.store.type: mmapfs
>>> threadpool.search.type: fixed
>>> threadpool.search.size: 75
>>> threadpool.search.queue_size: 5000
>>>
>>>
>>> *Infrastructure*:
>>> As you can see above, we don't use the cluster feature of ES (1 shard, 0 
>>> replicas). The reason is that our hosting infrastructure is based on 
>>> different providers.
>>> Upside: We aren't dependent on a single hosting provider. Downside: Our 
>>> servers aren't in the same LAN.
>>>
>>> This means:
>>> - We cannot use ES sharding, because synchronisation via WAN (internet) 
>>> seems not a useful solution.
>>> - So, every ES-server has the complete dataset and we configured only 
>>> one shard and no replicas for higher performance.
>>> - We have a distribution process that updates the ES data on every host 
>>> frequently. This process is fine for us, because updates aren't very often 
>>> and perfect just-in-time ES synchronisation isn't necessary for our 
>>> business case.
>>> - If a server goes down/crashs, the central loadbalancer removes it (the 
>>> resulting minimal packet lost is acceptable).
>>>  
>>>
>>>
>>>
>>> *PROBLEM*
>>>
>>> For long query terms (6 and more keywords), we have very high CPU loads, 
>>> even on the high performance server (1), and this leads to high response 
>>> times: 1-4sec on server (1), 8-20sec on server (2). The system parameters 
>>> while querying:
>>> - Very high load (usually 100%) for the thread responsible CPU (the 
>>> other CPUs are idle in our test scenario)
>>> - No I/O load (the harddisks are fine)
>>> - No RAM bottlenecks
>>>
>>> So, we think the file caching is working fine, because we have no I/O 
>>> problems and the garbage collector seams to be happy (jstat shows very few 
>>> GCs). The CPU is the problem, and ES hot-threads point to the Scorer module:
>>> https://gist.github.com/anonymous/9cecfd512cb533114b7d 
>>>
>>>
>>>
>>>
>>> *SUMMARY/ASSUMPTIONS*
>>>
>>> - Our database size isn't very big and the query not very complex.
>>> - ES is designed for huge amount of data, but the key is 
>>> clustering/sharding: Data distribution to many servers means smaller 
>>> indices, smaller indices leads to fewer CPU load and short response times.
>>> - So, our database isn't big, but to big for a single CPU and this means 
>>> especially low performance (virtual) CPUs can only be used in sharding 
>>> environments.
>>>
>>> If we don't want to lost the provider independency, we have only the 
>>> following two options:
>>>
>>> 1) Simpler query (I think not possible in our case)
>>> 2) Smaller database
>>>
>>>
>>>
>>>
>>> *QUESTIONS*
>>>
>>> Are our assumptions correct? Especially:
>>>
>>> - Is clustering/sharding (also small indices) the main key to 
>>> performance, that means the only possibility to prevent overloaded 
>>> (virtual) CPUs?
>>> - Is it right that clustering is only useful/possible in LANs?
>>> - Do you have any ES configuration or architecture hints regarding our 
>>> preference for using multiple hosting providers?
>>>
>>>
>>>
>>> Thank you. Rgds
>>> Fin
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7e1b3a52-23b4-4a22-8433-985e07ae7904%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to