Hi Kireet, thanks for your answer and sorry for the late response. More shards doesn't help. It will slow down the system because each shard takes quite some overhead to maintain a Lucene index and, the smaller the shards, the bigger the overhead. Having more shards enhances the indexing performance and allows to distribute a big index across machines, but I don't have a cluster with a lot of machines. I could observe this negative effects while testing with 20 shards.
It would be very cool if somebody could answer/comment to the question summarized at the end of my post. Thanks again. On Friday, July 11, 2014 3:02:50 AM UTC+2, Kireet Reddy wrote: > > I would test using multiple primary shards on a single machine. Since your > dataset seems to fit into RAM, this could help for these longer latency > queries. > > On Thursday, July 10, 2014 12:24:26 AM UTC-7, Fin Sekun wrote: >> >> Any hints? >> >> >> >> On Monday, July 7, 2014 3:51:19 PM UTC+2, Fin Sekun wrote: >>> >>> >>> Hi, >>> >>> >>> *SCENARIO* >>> >>> Our Elasticsearch database has ~2.5 million entries. Each entry has the >>> three analyzed fields "match", "sec_match" and "thi_match" (all contains >>> 3-20 words) that will be used in this query: >>> https://gist.github.com/anonymous/a8d1142512e5625e4e91 >>> >>> >>> ES runs on two *types of servers*: >>> (1) Real servers (system has direct access to real CPUs, no >>> virtualization) of newest generation - Very performant! >>> (2) Cloud servers with virtualized CPUs - Poor CPUs, but this is generic >>> for cloud services. >>> >>> See https://gist.github.com/anonymous/3098b142c2bab51feecc for (1) and >>> (2) CPU details. >>> >>> >>> *ES settings:* >>> ES version 1.2.0 (jdk1.8.0_05) >>> ES_HEAP_SIZE = 512m (we also tested with 1024m with same results) >>> vm.max_map_count = 262144 >>> ulimit -n 64000 >>> ulimit -l unlimited >>> index.number_of_shards: 1 >>> index.number_of_replicas: 0 >>> index.store.type: mmapfs >>> threadpool.search.type: fixed >>> threadpool.search.size: 75 >>> threadpool.search.queue_size: 5000 >>> >>> >>> *Infrastructure*: >>> As you can see above, we don't use the cluster feature of ES (1 shard, 0 >>> replicas). The reason is that our hosting infrastructure is based on >>> different providers. >>> Upside: We aren't dependent on a single hosting provider. Downside: Our >>> servers aren't in the same LAN. >>> >>> This means: >>> - We cannot use ES sharding, because synchronisation via WAN (internet) >>> seems not a useful solution. >>> - So, every ES-server has the complete dataset and we configured only >>> one shard and no replicas for higher performance. >>> - We have a distribution process that updates the ES data on every host >>> frequently. This process is fine for us, because updates aren't very often >>> and perfect just-in-time ES synchronisation isn't necessary for our >>> business case. >>> - If a server goes down/crashs, the central loadbalancer removes it (the >>> resulting minimal packet lost is acceptable). >>> >>> >>> >>> >>> *PROBLEM* >>> >>> For long query terms (6 and more keywords), we have very high CPU loads, >>> even on the high performance server (1), and this leads to high response >>> times: 1-4sec on server (1), 8-20sec on server (2). The system parameters >>> while querying: >>> - Very high load (usually 100%) for the thread responsible CPU (the >>> other CPUs are idle in our test scenario) >>> - No I/O load (the harddisks are fine) >>> - No RAM bottlenecks >>> >>> So, we think the file caching is working fine, because we have no I/O >>> problems and the garbage collector seams to be happy (jstat shows very few >>> GCs). The CPU is the problem, and ES hot-threads point to the Scorer module: >>> https://gist.github.com/anonymous/9cecfd512cb533114b7d >>> >>> >>> >>> >>> *SUMMARY/ASSUMPTIONS* >>> >>> - Our database size isn't very big and the query not very complex. >>> - ES is designed for huge amount of data, but the key is >>> clustering/sharding: Data distribution to many servers means smaller >>> indices, smaller indices leads to fewer CPU load and short response times. >>> - So, our database isn't big, but to big for a single CPU and this means >>> especially low performance (virtual) CPUs can only be used in sharding >>> environments. >>> >>> If we don't want to lost the provider independency, we have only the >>> following two options: >>> >>> 1) Simpler query (I think not possible in our case) >>> 2) Smaller database >>> >>> >>> >>> >>> *QUESTIONS* >>> >>> Are our assumptions correct? Especially: >>> >>> - Is clustering/sharding (also small indices) the main key to >>> performance, that means the only possibility to prevent overloaded >>> (virtual) CPUs? >>> - Is it right that clustering is only useful/possible in LANs? >>> - Do you have any ES configuration or architecture hints regarding our >>> preference for using multiple hosting providers? >>> >>> >>> >>> Thank you. Rgds >>> Fin >>> >>> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7e1b3a52-23b4-4a22-8433-985e07ae7904%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.