Any clues about transport connection issues on AWS HVM instances?

Radu Gheorghe Thu, 19 Jun 2014 03:45:07 -0700

Hi Elasticsearch list :)

I'm having some trouble while running Elasticsearch on r3.large (HVM
virtualization) instances in AWS. The short story is that, as soon as I put
any significant load on them, some requests take a very long time (for
example, Indices Stats) and I see disconnected/timeout errors in the logs.
Did anyone else experience similar things or has any ideas of another
solution than avoiding HVM instances?


More detailed symptoms:
- if there's very little load on them (say, 2GB of data on each node, few
queries and indexing operations) all is well
- by "significant load", I mean some 10GB of data, a few queries per
minute, 100 docs indexed per second (4K per doc, <10 fields). By no means
"overload", CPU rarely tops 20%, no significant GC, nothing suspicious in
any of the metrics SPM <http://sematext.com/spm/> collects. The only clue
is that, for the time the problem appears, we get heartbeat alerts because
requests to the stats APIs take too long
- by "some requests take very long time", I mean that some queries take
miliseconds (as I would expect them), and some take 10 minutes or so.
Eventually succeeding (at least this was the case for the manual requests
I've sent)
- sometimes, nodes get temporarily dropped from the cluster, but then
things quickly come back to green. However, sometimes shards were stuck
while relocating

Things I've tried:
- different ES versions and machine sizes: the same problem seems to appear
on 0.90.7 with r3.xlarge instances, I'm on 1.1.1 with r3.large
- teared down all machines and launched other ones and redeployed. Same
thing
- different JVM (1.7) versions: Oracle u25, u45, u55, u60, OpenJDK u51.
Same thing everywhere
- spawned the same number of machines with m3.large (same specs as
r3.large, except for half of the RAM, paravirtual instead of HVM). The
problem magically went away with the same data and load

Here are some Node Disconnected exceptions:
[2014-06-18 13:05:35,058][WARN ][search.action            ] [es01] Failed
to send release search context
org.elasticsearch.transport.NodeDisconnectedException:
[es02][inet[/10.140.1.84:9300]][search/freeContext] disconnected
[2014-06-18 13:05:35,058][DEBUG][action.admin.indices.stats] [es01]
[83f0223f-4222-4a57-a918-ff424924f002_2014-05-20][1],
node[oOlO-iewR3qnAuQkT28vfw], [P], s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@3339f285]
org.elasticsearch.transport.NodeDisconnectedException:
[es02][inet[/10.140.1.84:9300]][indices/stats/s] disconnected

I've enabled TRACE logging on both transport and discovery and all I see is
connection timeouts and exceptions, like:

07:29:19,039][TRACE][transport.netty ] [es01] close connection exception
caught on transport layer [[id: 0x190d8444]], disconnecting from relevant
node

Or, more verbose:

[2014-06-16 07:29:19,060][TRACE][transport.netty          ] [es01] connect
exception caught on transport layer [[id: 0x6816c0fe]]
org.elasticsearch.common.netty.channel.ConnectTimeoutException: connection
timed out: es03/10.171.39.244:9300
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2014-06-16 07:29:19,060][TRACE][discovery.zen.ping.unicast] [es01] [1]
failed to connect to [#zen_unicast_7#][es01][inet[es04/10.79.155.249:9300]]
org.elasticsearch.transport.ConnectTransportException: [][inet[es04/
10.79.155.249:9300]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:683)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:643)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNodeLight(NettyTransport.java:610)
at
org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:133)
at
org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:279)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.common.netty.channel.ConnectTimeoutException:
connection timed out: es03/10.171.39.244:9300
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

I'll appreciate any information, pointers, intuition you may have!

Thanks and best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAHXA0_1vD405d8LbuDUV-vJ1yminf23%2BDCbRecFFnHZ4ywfj0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Any clues about transport connection issues on AWS HVM instances?

Reply via email to