Clustering/Sharding impact on query performance

'Fin Sekun' via elasticsearch Mon, 07 Jul 2014 06:51:32 -0700

Hi,

*SCENARIO*

Our Elasticsearch database has ~2.5 million entries. Each entry has the
three analyzed fields "match", "sec_match" and "thi_match" (all contains
3-20 words) that will be used in this query:
https://gist.github.com/anonymous/a8d1142512e5625e4e91

ES runs on two *types of servers*:
(1) Real servers (system has direct access to real CPUs, no virtualization)
of newest generation - Very performant!
(2) Cloud servers with virtualized CPUs - Poor CPUs, but this is generic
for cloud services.

See https://gist.github.com/anonymous/3098b142c2bab51feecc for (1) and (2)
CPU details.

*ES settings:*
ES version 1.2.0 (jdk1.8.0_05)
ES_HEAP_SIZE = 512m (we also tested with 1024m with same results)
vm.max_map_count = 262144
ulimit -n 64000
ulimit -l unlimited
index.number_of_shards: 1
index.number_of_replicas: 0
index.store.type: mmapfs
threadpool.search.type: fixed
threadpool.search.size: 75
threadpool.search.queue_size: 5000

*Infrastructure*:
As you can see above, we don't use the cluster feature of ES (1 shard, 0
replicas). The reason is that our hosting infrastructure is based on
different providers.
Upside: We aren't dependent on a single hosting provider. Downside: Our
servers aren't in the same LAN.

This means:
- We cannot use ES sharding, because synchronisation via WAN (internet)
seems not a useful solution.
- So, every ES-server has the complete dataset and we configured only one
shard and no replicas for higher performance.
- We have a distribution process that updates the ES data on every host
frequently. This process is fine for us, because updates aren't very often
and perfect just-in-time ES synchronisation isn't necessary for our
business case.
- If a server goes down/crashs, the central loadbalancer removes it (the
resulting minimal packet lost is acceptable).

*PROBLEM*

For long query terms (6 and more keywords), we have very high CPU loads,
even on the high performance server (1), and this leads to high response
times: 1-4sec on server (1), 8-20sec on server (2). The system parameters
while querying:
- Very high load (usually 100%) for the thread responsible CPU (the other
CPUs are idle in our test scenario)
- No I/O load (the harddisks are fine)
- No RAM bottlenecks

So, we think the file caching is working fine, because we have no I/O
problems and the garbage collector seams to be happy (jstat shows very few
GCs). The CPU is the problem, and ES hot-threads point to the Scorer module:
https://gist.github.com/anonymous/9cecfd512cb533114b7d

*SUMMARY/ASSUMPTIONS*

- Our database size isn't very big and the query not very complex.
- ES is designed for huge amount of data, but the key is
clustering/sharding: Data distribution to many servers means smaller
indices, smaller indices leads to fewer CPU load and short response times.
- So, our database isn't big, but to big for a single CPU and this means
especially low performance (virtual) CPUs can only be used in sharding
environments.

If we don't want to lost the provider independency, we have only the
following two options:

1) Simpler query (I think not possible in our case)
2) Smaller database

*QUESTIONS*

Are our assumptions correct? Especially:

- Is clustering/sharding (also small indices) the main key to performance,
that means the only possibility to prevent overloaded (virtual) CPUs?
- Is it right that clustering is only useful/possible in LANs?
- Do you have any ES configuration or architecture hints regarding our
preference for using multiple hosting providers?

Thank you. Rgds
Fin

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d1ad2e3c-6d16-493b-a066-1fa2a06a29a6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Clustering/Sharding impact on query performance

Reply via email to