Re: Timeouts on Node Stats API?

2014-03-17 Thread Xiao Yu
I still don't have any definitive logs or traces that point to the exact 
cause of this situation but it appears to be some weird scheduling bug with 
hyper threading. Our nodes are running on OpenJDK 7u25 with hyper threaded 
CPUs which caused ES to report 2x the number of "available_processors" as 
physical cores. After setting the processors setting to define the physical 
number of cores we are no longer seeing this issue.

We been running in this configuration all weekend with no reoccurrence 
where as we were seeing this issue every couple hours before.

--Xiao

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2b0f6226-272b-46e4-9f07-2e46764d9d9b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Timeouts on Node Stats API?

2014-03-14 Thread Xiao Yu


> Anything in the logs or slow logs?  You're sure slow GCs aren't impacting 
> performance?
>

There's nothing in the logs on the node other slow queries before or during 
the problematic period. The slow query logs show the same types of queries 
that we know to be slow (MLT queries with function rescores) and are shown. 
As a percentage of queries executed the there is no bump in number of slow 
queried before or during the problematic period. (Yes during the 
problematic period the broken node actually executes and returns query 
requests, it's not clear to me if it's simply routing queries to other 
nodes or if it's shards are actually executing queries as well.)

In addition before the problem occurs there is no increase in ES threads, 
heap or non heap memory use and the number of GC cycles remained consistent 
at about once every 2 seconds on the node. There are no long GC cycles and 
the node never drops from the cluster. During the problematic period the 
cluster reports that it's in a green state however all of our logging 
indicates that no indexing operations complete cluster wide (we should be 
seeing 100-500 / sec under normal load).

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/da65980b-d027-4ec2-9576-1f51a3dc6037%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Timeouts on Node Stats API?

2014-03-14 Thread Xiao Yu


> Can you do a hot_threads while this is happening?
>

Just for good measure I also checked hot threads for blocking and waiting, 
nothing interesting there either.

::: 
[es1.global.search.sat.wordpress.com][7fiNB_thTk-GRDKe4yQITA][inet[/76.74.248.134:9300]]{dc=sat,
 
parity=1, master=false}

0.0% (0s out of 500ms) wait usage by thread 'Finalizer'
 10/10 snapshots sharing following 4 elements
   java.lang.Object.wait(Native Method)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
   java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

0.0% (0s out of 500ms) wait usage by thread 'Signal Dispatcher'
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot

0.0% (0s out of 500ms) wait usage by thread 
'elasticsearch[es1.global.search.sat.wordpress.com][[timer]]'
 10/10 snapshots sharing following 2 elements
   java.lang.Thread.sleep(Native Method)
  
 
org.elasticsearch.threadpool.ThreadPool$EstimatedTimeThread.run(ThreadPool.java:511)
 
::: 
[es1.global.search.sat.wordpress.com][7fiNB_thTk-GRDKe4yQITA][inet[/76.74.248.134:9300]]{dc=sat,
 
parity=1, master=false}

0.0% (0s out of 500ms) block usage by thread 'Finalizer'
 10/10 snapshots sharing following 4 elements
   java.lang.Object.wait(Native Method)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
   java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

0.0% (0s out of 500ms) block usage by thread 'Signal Dispatcher'
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot

0.0% (0s out of 500ms) block usage by thread 
'elasticsearch[es1.global.search.sat.wordpress.com][[timer]]'
 10/10 snapshots sharing following 2 elements
   java.lang.Thread.sleep(Native Method)
  
 
org.elasticsearch.threadpool.ThreadPool$EstimatedTimeThread.run(ThreadPool.java:511)
 
While all this is happening indexing operations start to see curl timeouts 
in our application logs.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7539ba57-f25c-4439-aac7-ea91712982e1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Timeouts on Node Stats API?

2014-03-13 Thread Xiao Yu
After restarting the node I see logs like the following gist which seems to 
suggest there are some internal networking issues perhaps?

https://gist.github.com/xyu/9541662


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/bc0181ec-e33b-4a13-a4dc-7e7d68801177%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Timeouts on Node Stats API?

2014-03-13 Thread Xiao Yu

>
> Can you do a hot_threads while this is happening?
>

I took a couple samples from 2 nodes that were experiencing this issue:

https://gist.github.com/xyu/9541604

It seems like the problematic nodes are just doing normal searching 
operations?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/009a0765-17e8-40b3-8fab-6b7456b34d60%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Node not joining cluster on boot

2014-03-13 Thread Xiao Yu
Total shot in the dark here but try taking the hashmark out of the node 
names and see if that helps?

On Thursday, March 13, 2014 5:31:30 AM UTC-4, Guillaume Loetscher wrote:
>
> Sure
>
> Node # 1:
> root@es_node1:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml 
> cluster.name: logstash
> node.name: "Node ES #1"
> node.master: true
> node.data: true
> index.number_of_shards: 2
> index.number_of_replicas: 1
> discovery.zen.ping.timeout: 10s
>
> Node #2 :
> root@es_node2:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
> cluster.name: logstash
> node.name: "Node ES #2"
> node.master: true
> node.data: true
> index.number_of_shards: 2
> index.number_of_replicas: 1
> discovery.zen.ping.timeout: 10s
>
>
>
>
>
> Le jeudi 13 mars 2014 10:15:16 UTC+1, David Pilato a écrit :
>>
>> did you set the same cluster name on both nodes?
>>
>> -- 
>> *David Pilato* | *Technical Advocate* | *Elasticsearch.com*
>> @dadoonet  | 
>> @elasticsearchfr
>>
>>
>> Le 13 mars 2014 à 09:57:35, Guillaume Loetscher (ster...@gmail.com) a 
>> écrit:
>>
>> Hi,
>>
>> First, thanks for the answers and remarks.
>>
>> You are both right, the issue I'm currently facing leads to a 
>> "split-brain" situation, where Node #1 & Node #2 are both master, and doing 
>> their own life on their side. I'll see to change my configuration and the 
>> number of node, in order to limit this situation (I already checked this 
>> article talking about split-brain in 
>> ES
>> ).
>>
>> However, this split-brain situation is the result of the problem with the 
>> discovery / broadcast, which is represented in the log of Node #2 here :
>>  [2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2] 
>> received ping response ping_response{target [[Node ES 
>> #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], 
>> master [[Node ES 
>> #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], 
>> cluster_name[logstash]} with no matching id [1]
>>  
>> So, the connectivity between Node #1 (which is the first one online, and 
>> therefore master) and Node #2 is established, as the log on Node #2 clearly 
>> said "received ping response", but with an "ID that didn't match".
>>
>> This is apparently why Node #2 didn't join the cluster on Node #1, and 
>> this is this specific issue I want to resolve.
>>
>> Thanks,
>>
>> Le jeudi 13 mars 2014 07:03:35 UTC+1, David Pilato a écrit : 
>>>
>>> Bonjour :-)
>>>
>>> You should set min_master_nodes to 2. Although I'd recommend having 3 
>>> nodes instead of 2.
>>>
>>> --
>>> David ;-) 
>>> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>>>  
>>> Le 12 mars 2014 à 23:58, Guillaume Loetscher  a 
>>> écrit :
>>>
>>>  Hi,
>>>
>>> I've begun to test Elasticsearch recently, on a little mockup I've 
>>> designed.
>>>
>>> Currently, I'm running two nodes on two LXC (v0.9) containers. Those 
>>> containers are linked using veth to a bridge declared on the host.
>>>
>>> When I start the first node, the cluster starts, but when I start the 
>>> second node a bit later, it seems to get some information from the other 
>>> node but it always ended with the same "no matchind id" error.
>>>
>>> Here's what I'm doing :
>>>
>>> I start the LXC container of the first node :
>>>  root@lada:~# date && lxc-start -n es_node1 -d
>>> mercredi 12 mars 2014, 22:59:39 (UTC+0100)
>>>  
>>>
>>>
>>> I logon the node, check the log file :
>>>  [2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1] 
>>> version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
>>> [2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1] 
>>> initializing ...
>>> [2014-03-12 21:59:41,944][INFO ][plugins  ] [Node ES #1] 
>>> loaded [], sites []
>>> [2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1] 
>>> initialized
>>> [2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1] 
>>> starting ...
>>> [2014-03-12 21:59:47,485][INFO ][transport] [Node ES #1] 
>>> bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
>>> 172.16.0.100:9300]}
>>> [2014-03-12 21:59:57,573][INFO ][cluster.service  ] [Node ES #1] 
>>> new_master [Node ES 
>>> #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}, reason: 
>>> zen-disco-join (elected_as_master)
>>> [2014-03-12 21:59:57,657][INFO ][discovery] [Node ES #1] 
>>> logstash/LbMQazWXR9uB6Q7R2xLxGQ
>>> [2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1] 
>>> bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
>>> 172.16.0.100:9200]}
>>> [2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1] 
>>> started
>>> [2014-03-12 21:59:59,569][INFO ][gateway  ] [Node ES #1] 
>>> recovered [2] indices into cl

Re: Timeouts on Node Stats API?

2014-03-13 Thread Xiao Yu
After restarting nodes I'm also getting a bunch of errors for calls to the 
index stats API *after* the node has come back up. Seems like there's some 
issue here where stats API calls fails, does not time out and causes a 
backup of other calls until a thread pool is full?

Mar 13 12:22:31 esm1.global.search.sat.wordpress.com [2014-03-13 
12:22:31,063][DEBUG][action.admin.indices.stats] 
[esm1.global.search.sat.wordpress.com] [test-lang-analyzers-0][18], 
node[pnlbpsoZRlWbVfgIPSb9vg], [R], s[STARTED]: Failed to execute 
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@35193b8f]
Mar 13 12:22:31 esm1.global.search.sat.wordpress.com 
org.elasticsearch.transport.NodeDisconnectedException: 
[es4.global.search.sat.wordpress.com][inet[es4.global.search.sat.wordpress.com/76.74.248.144:9300]][indices/stats/s]
 
disconnected

It looks like these requests gets put into the 
"MANAGEMENT<https://github.com/elasticsearch/elasticsearch/search?q=ThreadPool.Names.MANAGEMENT&type=Code>"
 
thread pool which we've left at the default configs.

On Wednesday, March 12, 2014 5:13:35 PM UTC-4, Xiao Yu wrote:
>
> Hello,
>
> We have a cluster that's still running on 0.90.9 and it's recently 
> developed an interesting issue. Random (data) nodes within our cluster will 
> occasionally stop responding to the node stats API and we see errors like 
> the following in our cluster logs on the master node.
>
> Mar 12 20:53:17 esm1.global.search.iad.wordpress.com [2014-03-12 
> 20:53:17,945][DEBUG][action.admin.cluster.node.stats] [
> esm1.global.search.iad.wordpress.com] failed to execute on node 
> [CBIR6UWfSvqPIHOSgJ3c2Q]
> Mar 12 20:53:17 
> esm1.global.search.iad.wordpress.comorg.elasticsearch.transport.ReceiveTimeoutTransportException:
>  [
> es3.global.search.iad.wordpress.com][inet[
> 66.155.9.130/66.155.9.130:9300]][cluster/nodes/stats/n<http://66.155.9.130/66.155.9.130:9300%5D%5D%5Bcluster/nodes/stats/n>]
>  
> request_id [12395955] timed out after [15001ms]
> Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at 
> org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
> Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at 
> java.lang.Thread.run(Thread.java:724)
>
> While this is happening our cluster appears to function "normally", 
> queries made to the problematic node process and return normally and 
> judging by the eth traffic and load on the box it appears to even be 
> handling queries and even rebalancing shards. The only way to solve this 
> problem appears to be to reboot the node at which point the node 
> disconnects then rejoins the cluster and functions like any other node.
>
> Mar 12 21:02:45 esm1.global.search.iad.wordpress.com [2014-03-12 
> 21:02:45,991][INFO ][action.admin.cluster.node.shutdown] [
> esm1.global.search.iad.wordpress.com] [partial_cluster_shutdown]: 
> requested, shutting down [[CBIR6UWfSvqPIHOSgJ3c2Q]] in [1s]
> Mar 12 21:02:46 esm1.global.search.iad.wordpress.com [2014-03-12 
> 21:02:46,995][INFO ][action.admin.cluster.node.shutdown] [
> esm1.global.search.iad.wordpress.com] [partial_cluster_shutdown]: done 
> shutting down [[CBIR6UWfSvqPIHOSgJ3c2Q]]
> ...
> Mar 12 21:03:40 
> esm1.global.search.iad.wordpress.comorg.elasticsearch.transport.NodeDisconnectedException:
>  [
> es3.global.search.iad.wordpress.com][inet[
> 66.155.9.130/66.155.9.130:9300]][indices/stats/s<http://66.155.9.130/66.155.9.130:9300%5D%5D%5Bindices/stats/s>]
>  
> disconnected
> ...
> Mar 12 21:03:41 esm1.global.search.iad.wordpress.com [2014-03-12 
> 21:03:41,045][INFO ][cluster.service  ] [
> esm1.global.search.iad.wordpress.com] removed {[
> es3.global.search.iad.wordpress.com][CBIR6UWfSvqPIHOSgJ3c2Q][inet[
> 66.155.9.130/66.155.9.130:9300]]{dc=iad<http://66.155.9.130/66.155.9.130:9300%5D%5D%7Bdc=iad>,
>  
> parity=1, master=false},}, reason: zen-disco-node_left([
> es3.global.search.iad.wordpress.com][CBIR6UWfSvqPIHOSgJ3c2Q][inet[/66.155.9.130:9300]]{dc=iad,
>  
> parity=1, master=false})
> Mar 12 21:04:29 esm1.global.search.iad.wordpress.com [2014-03-12 
> 21:04:29,077][INFO ][cluster.service  ] [
> esm1.global.search.iad.wordpress.com] added {[
> es3.global.search.iad.wordpress.com][4Nj-a0SxRTuagLelneThQg][inet[/66.155.9.130:9300]]{dc=iad,
>  
> parity=1, master=false},}, reason: zen-disco-receive(join from node[[
> es3.global.search.iad.wordpress.com][4Nj-a0SxRTuagLelneT

Re: Node not joining cluster on boot

2014-03-12 Thread Xiao Yu
Sounds like you have a standard split-brain problem, the best way to solve 
this is to set discovery.zen.minimum_master_nodes to 2 for your cluster so 
that both nodes must be up to elect a single master. This does mean your 
cluster will not function with just 1 node.

On Wednesday, March 12, 2014 6:58:16 PM UTC-4, Guillaume Loetscher wrote:
>
> Hi,
>
> I've begun to test Elasticsearch recently, on a little mockup I've 
> designed.
>
> Currently, I'm running two nodes on two LXC (v0.9) containers. Those 
> containers are linked using veth to a bridge declared on the host.
>
> When I start the first node, the cluster starts, but when I start the 
> second node a bit later, it seems to get some information from the other 
> node but it always ended with the same "no matchind id" error.
>
> Here's what I'm doing :
>
> I start the LXC container of the first node : 
> root@lada:~# date && lxc-start -n es_node1 -d
> mercredi 12 mars 2014, 22:59:39 (UTC+0100)
>
>
>
> I logon the node, check the log file :
> [2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1] 
> version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
> [2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1] 
> initializing ...
> [2014-03-12 21:59:41,944][INFO ][plugins  ] [Node ES #1] 
> loaded [], sites []
> [2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1] 
> initialized
> [2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1] 
> starting ...
> [2014-03-12 21:59:47,485][INFO ][transport] [Node ES #1] 
> bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
> 172.16.0.100:9300]}
> [2014-03-12 21:59:57,573][INFO ][cluster.service  ] [Node ES #1] 
> new_master [Node ES 
> #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}, reason: 
> zen-disco-join (elected_as_master)
> [2014-03-12 21:59:57,657][INFO ][discovery] [Node ES #1] 
> logstash/LbMQazWXR9uB6Q7R2xLxGQ
> [2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1] 
> bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
> 172.16.0.100:9200]}
> [2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1] 
> started
> [2014-03-12 21:59:59,569][INFO ][gateway  ] [Node ES #1] 
> recovered [2] indices into cluster_state
>
>
>
>  Then I start the second node :
> root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
> mercredi 12 mars 2014, 23:02:59 (UTC+0100)
>
>
>
> Logon on the second node, and open the log :
> [2014-03-12 22:03:02,126][INFO ][node ] [Node ES #2] 
> version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:38:23Z]
> [2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2] 
> initializing ...
> [2014-03-12 22:03:02,141][INFO ][plugins  ] [Node ES #2] 
> loaded [], sites []
> [2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2] 
> initialized
> [2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2] 
> starting ...
> [2014-03-12 22:03:07,557][INFO ][transport] [Node ES #2] 
> bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
> 172.16.0.101:9300]}
> [2014-03-12 22:03:17,637][INFO ][cluster.service  ] [Node ES #2] 
> new_master [Node ES 
> #2][0nNCsZrFS6y95G1ld-v_rA][inet[/172.16.0.101:9300]]{master=true}, reason: 
> zen-disco-join (elected_as_master)
> [2014-03-12 22:03:17,718][INFO ][discovery] [Node ES #2] 
> logstash/0nNCsZrFS6y95G1ld-v_rA
> [2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2] 
> bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
> 172.16.0.101:9200]}
> [2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2] 
> started
> [2014-03-12 22:03:19,550][INFO ][gateway  ] [Node ES #2] 
> recovered [2] indices into cluster_state
> [2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2] 
> received ping response ping_response{target [[Node ES 
> #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], master 
> [[Node ES 
> #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], 
> cluster_name[logstash]} with no matching id [1]
>
>
> At that point, each node considered themselves as master.
>
> Here's my configuration for each node (same for node 1, except the 
> node.name) :
> cluster.name: logstash
> node.name: "Node ES #2"
> node.master: true
> node.data: true
> index.number_of_shards: 2
> index.number_of_replicas: 1
> discovery.zen.ping.timeout: 10s
>
> The bridge on my host is setup to forward immediately every new interfaces 
> so I don't think the problem is here. Here's the bridge config :
> auto br1
> iface br1 inet static
> address 172.16.0.254
> netmask 255.255.255.0
> bridge_ports regex veth_.*
> bridge_spt off
> bridge_maxw

Timeouts on Node Stats API?

2014-03-12 Thread Xiao Yu
Hello,

We have a cluster that's still running on 0.90.9 and it's recently 
developed an interesting issue. Random (data) nodes within our cluster will 
occasionally stop responding to the node stats API and we see errors like 
the following in our cluster logs on the master node.

Mar 12 20:53:17 esm1.global.search.iad.wordpress.com [2014-03-12 
20:53:17,945][DEBUG][action.admin.cluster.node.stats] 
[esm1.global.search.iad.wordpress.com] failed to execute on node 
[CBIR6UWfSvqPIHOSgJ3c2Q]
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com 
org.elasticsearch.transport.ReceiveTimeoutTransportException: 
[es3.global.search.iad.wordpress.com][inet[66.155.9.130/66.155.9.130:9300]][cluster/nodes/stats/n]
 
request_id [12395955] timed out after [15001ms]
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at 
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at 
java.lang.Thread.run(Thread.java:724)

While this is happening our cluster appears to function "normally", queries 
made to the problematic node process and return normally and judging by the 
eth traffic and load on the box it appears to even be handling queries and 
even rebalancing shards. The only way to solve this problem appears to be 
to reboot the node at which point the node disconnects then rejoins the 
cluster and functions like any other node.

Mar 12 21:02:45 esm1.global.search.iad.wordpress.com [2014-03-12 
21:02:45,991][INFO ][action.admin.cluster.node.shutdown] 
[esm1.global.search.iad.wordpress.com] [partial_cluster_shutdown]: 
requested, shutting down [[CBIR6UWfSvqPIHOSgJ3c2Q]] in [1s]
Mar 12 21:02:46 esm1.global.search.iad.wordpress.com [2014-03-12 
21:02:46,995][INFO ][action.admin.cluster.node.shutdown] 
[esm1.global.search.iad.wordpress.com] [partial_cluster_shutdown]: done 
shutting down [[CBIR6UWfSvqPIHOSgJ3c2Q]]
...
Mar 12 21:03:40 esm1.global.search.iad.wordpress.com 
org.elasticsearch.transport.NodeDisconnectedException: 
[es3.global.search.iad.wordpress.com][inet[66.155.9.130/66.155.9.130:9300]][indices/stats/s]
 
disconnected
...
Mar 12 21:03:41 esm1.global.search.iad.wordpress.com [2014-03-12 
21:03:41,045][INFO ][cluster.service  ] 
[esm1.global.search.iad.wordpress.com] removed 
{[es3.global.search.iad.wordpress.com][CBIR6UWfSvqPIHOSgJ3c2Q][inet[66.155.9.130/66.155.9.130:9300]]{dc=iad,
 
parity=1, master=false},}, reason: 
zen-disco-node_left([es3.global.search.iad.wordpress.com][CBIR6UWfSvqPIHOSgJ3c2Q][inet[/66.155.9.130:9300]]{dc=iad,
 
parity=1, master=false})
Mar 12 21:04:29 esm1.global.search.iad.wordpress.com [2014-03-12 
21:04:29,077][INFO ][cluster.service  ] 
[esm1.global.search.iad.wordpress.com] added 
{[es3.global.search.iad.wordpress.com][4Nj-a0SxRTuagLelneThQg][inet[/66.155.9.130:9300]]{dc=iad,
 
parity=1, master=false},}, reason: zen-disco-receive(join from 
node[[es3.global.search.iad.wordpress.com][4Nj-a0SxRTuagLelneThQg][inet[/66.155.9.130:9300]]{dc=iad,
 
parity=1, master=false}])

All this sometimes causes shards to relocate or go into a recovery state 
needlessly.

Nothing appears in the problematic node's logs aside from some slow queries 
similar to what's on all the other data nodes and looking at our monitoring 
systems there are no anomalies in CPU / memory / thread usage.

Any ideas as to what else I should check?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/888077be-cfbc-4682-ad55-dd051caec878%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.