There is nothing (literally) in the log of either data node after the node 
joined events and nothing in the master log between index recovery and the 
first error message.

There are 0 queries run before the errors start occurring (access to the 
nodes is blocked via a firewall, so the only communications are between the 
nodes). We have 50% of the RAM allocated to the heap on each node (4GB 
each).

This cluster operated without issue under 1.1.2. Did something change 
between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?


On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:
>
> Generally ReceiveTimeoutTransportException is due to network disconnects 
> or a node failing to respond due to heavy load. What does the log 
> of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap 
> allocated. Rule of thumb is 1/2 available memory but <= 31GB
>
> On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:
>>
>>
>> ES Version: 1.3.5
>>
>> OS: Ubuntu 14.04.1 LTS
>>
>> Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS
>>
>> master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)
>>
>>
>> *After upgrading from ES 1.1.2...*
>>
>>
>> 1. Startup ES on master
>> 2. All nodes join cluster
>> 3. [2014-12-03 20:30:54,789][INFO ][gateway                  ] 
>> [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
>> 4. Checked health a few times
>>
>>
>> curl -XGET localhost:9200/_cat/health?v
>>
>>
>> 5. 6 minutes after cluster recovery initiates (and 5:20 after the 
>> recovery finishes), the log on the master node (10.0.1.18) reports:
>>
>>
>> [2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats] 
>> [ip-10-0-1-18.ec2.internal] failed to execute on node 
>> [pYi3z5PgRh6msJX_armz_A]
>>
>> org.elasticsearch.transport.ReceiveTimeoutTransportException: 
>> [ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n] 
>> request_id [17564] timed out after [15001ms]
>>
>> at 
>> org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
>>
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>> 6. Every 30 seconds or 60 seconds, the above error is reported for one or 
>> more of the data nodes
>>
>> 7. During this time, queries (search, index, etc.) don’t return. They 
>> hang until the error state temporarily resolves itself (a varying time 
>> around 15-20 minutes) at which point the expected result is returned.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/50dfaccc-b8c6-4f72-afad-d641078d42e5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to