Re: elasticsearch 1.4.0 yellow with shards stuck initializing - need help diagnosing this

2014-11-27 Thread Martijn v Groningen
This looks like a mapping issue to me (not 100% sure). A document that is
in the translog has a string field (with value: 'finished'), but it is
mapped as a number field (long, integer, double, etc.) in the mapping. This
causes the number format exception that you're seeing in your logs when
that document is indexed from the translog as part of the recovery and this
then prevents the shard from getting started.

These problems can occur when new fields are introduced at index time and
also when numeric_detection is enabled in the mapping (which makes these
errors more likely). Is this the case in your ES setup?

Can you also check the mappings of the logstash-2014.11.27 index and see
what fields can possible contain 'finished'? Unfortunately the field name
didn't get included with your errors.

On 27 November 2014 at 11:19, Jilles van Gurp jillesvang...@gmail.com
wrote:

 Our production cluster went yellow last night after our logstash index
 rolled over to the next version. I've seen this happen before but this time
 I decided to properly diagnose and seek some feedback on what might be
 going on.

 So, I'd love some feedback on what is going on. I'm happy to keep this
 cluster in a yellow state for a limited time to get some help from people
 in this group trying to diagnose this properly and maybe help some others
 who face the same issues. However, I will need to fix this one way or
 another before end of business day today. I plan to perform a rolling
 restart to see if node reinitialization fixes things. If not, I'll remove
 the problematic logstash index and move on. I'd love suggesttions for less
 intrusive solutions. I don't like losing data and rolling restarts are kind
 of tedious to babysit. Tends to take 45 minutes or so.

 Below is some information I've gathered. Let me know if you need me to
 extract more data.

 First the obvious:

 {
   status : 200,
   name : 192.168.1.13,
   cluster_name : linko_elasticsearch,
   version : {
 number : 1.4.0,
 build_hash : bc94bd81298f81c656893ab1d30a99356066,
 build_timestamp : 2014-11-05T14:26:12Z,
 build_snapshot : false,
 lucene_version : 4.10.2
   },
   tagline : You Know, for Search
 }

 [linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty
 {
   cluster_name : linko_elasticsearch,
   status : yellow,
   timed_out : false,
   number_of_nodes : 5,
   number_of_data_nodes : 3,
   active_primary_shards : 221,
   active_shards : 619,
   relocating_shards : 0,
   initializing_shards : 2,
   unassigned_shards : 1
 }

 So we're yellow and the reason is initializing and unassigned shards. We
 have five nodes, of which three are data nodes. It seems we are hitting
 some kind of resilience issue. The three machines have plenty of diskspace
 and memory.

 I found this in the log of one of our es nodes:
 [2014-11-27 10:15:12,585][WARN ][cluster.action.shard ] [192.168.1.13]
 [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4],
 node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID
 [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to start shard, message
 [RecoveryFailedException[[logstash-2014.11.27][4]: Recovery failed from
 [192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io][inet[/
 192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][
 es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested:
 RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]];
 nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution
 failed]; nested:
 RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]];
 nested: NumberFormatException[For input string: finished]; ]]

 on the mentioned node there's a corresponding messages:
 [2014-11-27 10:17:54,187][WARN ][cluster.action.shard ] [192.168.1.14]
 [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4],
 node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID
 [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to perform
 [indices:data/write/bulk[s]] on replica, message
 [RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]];
 nested: NumberFormatException[For input string: finished]; ]]

 All three data nodes have similar messages happening over and over again.

 Our cluster has been up for a couple of weeks and seems pretty happy
 otherwise. I deleted some older logstash indices a few days ago. The
 cluster has logstash data and a few smallish indiceses we use for our
 inbot.io service. The issue appears to be related to the logstash index
 rollover. Our app servers and kibana talk to the two non data nodes that we
 run on both our application servers.

 My next stop was kibana which we use on the same cluster with the logstash
 index that is probably causing us issues. Looking at that, I noticed a few
 interesting things:

- logstash indexing seems to 

Re: elasticsearch 1.4.0 yellow with shards stuck initializing - need help diagnosing this

2014-11-27 Thread Jilles van Gurp
Thanks for the explanation. I suspect many logstash users might be running 
into this one since you typically use a dynamic mapping with that. We have 
some idea where this is happening though and we can probably fix it 
properly. This happened during index roll over and we indeed are indexing a 
lot of things via logstash almost continuously. 

Jilles



On Thursday, November 27, 2014 4:06:21 PM UTC+1, Martijn v Groningen wrote:

 This looks like a mapping issue to me (not 100% sure). A document that is 
 in the translog has a string field (with value: 'finished'), but it is 
 mapped as a number field (long, integer, double, etc.) in the mapping. This 
 causes the number format exception that you're seeing in your logs when 
 that document is indexed from the translog as part of the recovery and this 
 then prevents the shard from getting started.

 These problems can occur when new fields are introduced at index time and 
 also when numeric_detection is enabled in the mapping (which makes these 
 errors more likely). Is this the case in your ES setup?

 Can you also check the mappings of the logstash-2014.11.27 index and see 
 what fields can possible contain 'finished'? Unfortunately the field name 
 didn't get included with your errors.

 On 27 November 2014 at 11:19, Jilles van Gurp jilles...@gmail.com 
 javascript: wrote:

 Our production cluster went yellow last night after our logstash index 
 rolled over to the next version. I've seen this happen before but this time 
 I decided to properly diagnose and seek some feedback on what might be 
 going on. 

 So, I'd love some feedback on what is going on. I'm happy to keep this 
 cluster in a yellow state for a limited time to get some help from people 
 in this group trying to diagnose this properly and maybe help some others 
 who face the same issues. However, I will need to fix this one way or 
 another before end of business day today. I plan to perform a rolling 
 restart to see if node reinitialization fixes things. If not, I'll remove 
 the problematic logstash index and move on. I'd love suggesttions for less 
 intrusive solutions. I don't like losing data and rolling restarts are kind 
 of tedious to babysit. Tends to take 45 minutes or so.

 Below is some information I've gathered. Let me know if you need me to 
 extract more data. 

 First the obvious:

 {
   status : 200,
   name : 192.168.1.13,
   cluster_name : linko_elasticsearch,
   version : {
 number : 1.4.0,
 build_hash : bc94bd81298f81c656893ab1d30a99356066,
 build_timestamp : 2014-11-05T14:26:12Z,
 build_snapshot : false,
 lucene_version : 4.10.2
   },
   tagline : You Know, for Search
 }

 [linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty
 {
   cluster_name : linko_elasticsearch,
   status : yellow,
   timed_out : false,
   number_of_nodes : 5,
   number_of_data_nodes : 3,
   active_primary_shards : 221,
   active_shards : 619,
   relocating_shards : 0,
   initializing_shards : 2,
   unassigned_shards : 1
 }

 So we're yellow and the reason is initializing and unassigned shards. We 
 have five nodes, of which three are data nodes. It seems we are hitting 
 some kind of resilience issue. The three machines have plenty of diskspace 
 and memory.

 I found this in the log of one of our es nodes:
 [2014-11-27 10:15:12,585][WARN ][cluster.action.shard ] 
 [192.168.1.13] [logstash-2014.11.27][4] sending failed shard for 
 [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], 
 s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to 
 start shard, message [RecoveryFailedException[[logstash-2014.11.27][4]: 
 Recovery failed from [192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io
 ][inet[/192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][
 es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested: 
 RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]];
  
 nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution 
 failed]; nested: 
 RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]];
  
 nested: NumberFormatException[For input string: finished]; ]]

 on the mentioned node there's a corresponding messages:
 [2014-11-27 10:17:54,187][WARN ][cluster.action.shard ] 
 [192.168.1.14] [logstash-2014.11.27][4] sending failed shard for 
 [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], 
 s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to 
 perform [indices:data/write/bulk[s]] on replica, message 
 [RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]];
  
 nested: NumberFormatException[For input string: finished]; ]]

 All three data nodes have similar messages happening over and over again.

 Our cluster has been up for a couple of weeks and seems pretty happy 
 otherwise. I deleted some 

Re: elasticsearch 1.4.0 yellow with shards stuck initializing - need help diagnosing this

2014-11-27 Thread Martijn v Groningen
If the field you suspect causing this is a string field in the mapping then
you can try to close and open the index. This will then sync the in-memory
representation of the mapping with what is in the cluster state.

On 27 November 2014 at 16:49, Jilles van Gurp jillesvang...@gmail.com
wrote:

 Thanks for the explanation. I suspect many logstash users might be running
 into this one since you typically use a dynamic mapping with that. We have
 some idea where this is happening though and we can probably fix it
 properly. This happened during index roll over and we indeed are indexing a
 lot of things via logstash almost continuously.

 Jilles



 On Thursday, November 27, 2014 4:06:21 PM UTC+1, Martijn v Groningen wrote:

 This looks like a mapping issue to me (not 100% sure). A document that is
 in the translog has a string field (with value: 'finished'), but it is
 mapped as a number field (long, integer, double, etc.) in the mapping. This
 causes the number format exception that you're seeing in your logs when
 that document is indexed from the translog as part of the recovery and this
 then prevents the shard from getting started.

 These problems can occur when new fields are introduced at index time and
 also when numeric_detection is enabled in the mapping (which makes these
 errors more likely). Is this the case in your ES setup?

 Can you also check the mappings of the logstash-2014.11.27 index and see
 what fields can possible contain 'finished'? Unfortunately the field name
 didn't get included with your errors.

 On 27 November 2014 at 11:19, Jilles van Gurp jilles...@gmail.com
 wrote:

 Our production cluster went yellow last night after our logstash index
 rolled over to the next version. I've seen this happen before but this time
 I decided to properly diagnose and seek some feedback on what might be
 going on.

 So, I'd love some feedback on what is going on. I'm happy to keep this
 cluster in a yellow state for a limited time to get some help from people
 in this group trying to diagnose this properly and maybe help some others
 who face the same issues. However, I will need to fix this one way or
 another before end of business day today. I plan to perform a rolling
 restart to see if node reinitialization fixes things. If not, I'll remove
 the problematic logstash index and move on. I'd love suggesttions for less
 intrusive solutions. I don't like losing data and rolling restarts are kind
 of tedious to babysit. Tends to take 45 minutes or so.

 Below is some information I've gathered. Let me know if you need me to
 extract more data.

 First the obvious:

 {
   status : 200,
   name : 192.168.1.13,
   cluster_name : linko_elasticsearch,
   version : {
 number : 1.4.0,
 build_hash : bc94bd81298f81c656893ab1d30a99356066,
 build_timestamp : 2014-11-05T14:26:12Z,
 build_snapshot : false,
 lucene_version : 4.10.2
   },
   tagline : You Know, for Search
 }

 [linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty
 {
   cluster_name : linko_elasticsearch,
   status : yellow,
   timed_out : false,
   number_of_nodes : 5,
   number_of_data_nodes : 3,
   active_primary_shards : 221,
   active_shards : 619,
   relocating_shards : 0,
   initializing_shards : 2,
   unassigned_shards : 1
 }

 So we're yellow and the reason is initializing and unassigned shards. We
 have five nodes, of which three are data nodes. It seems we are hitting
 some kind of resilience issue. The three machines have plenty of diskspace
 and memory.

 I found this in the log of one of our es nodes:
 [2014-11-27 10:15:12,585][WARN ][cluster.action.shard ]
 [192.168.1.13] [logstash-2014.11.27][4] sending failed shard for
 [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R],
 s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to
 start shard, message [RecoveryFailedException[[logstash-2014.11.27][4]:
 Recovery failed from [192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io
 ][inet[/192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][
 es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested:
 RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:
 9300]][internal:index/shard/recovery/start_recovery]]; nested:
 RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution
 failed]; nested: RemoteTransportException[[192.168.1.13][inet[/
 192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]];
 nested: NumberFormatException[For input string: finished]; ]]

 on the mentioned node there's a corresponding messages:
 [2014-11-27 10:17:54,187][WARN ][cluster.action.shard ]
 [192.168.1.14] [logstash-2014.11.27][4] sending failed shard for
 [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R],
 s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to
 perform [indices:data/write/bulk[s]] on replica, message
 [RemoteTransportException[[192.168.1.13][inet[/192.168.1.
 13:9300]][indices:data/write/bulk[s][r]]]; 

Re: elasticsearch 1.4.0 yellow with shards stuck initializing - need help diagnosing this

2014-11-27 Thread Jilles van Gurp
BTW. I should mention that I also filed a bug for this earlier today. 
 https://github.com/elasticsearch/elasticsearch/issues/8684

Clinton Gormley kindly replied to that and provided some additional insight.

It indeed seems our mapping is part of the problem but there's also the es 
side of things where it shouldn't get in this state. Apparently a fix for 
that part is coming.

Best,

Jilles

On Thursday, November 27, 2014 11:19:20 AM UTC+1, Jilles van Gurp wrote:

 Our production cluster went yellow last night after our logstash index 
 rolled over to the next version. I've seen this happen before but this time 
 I decided to properly diagnose and seek some feedback on what might be 
 going on. 

 So, I'd love some feedback on what is going on. I'm happy to keep this 
 cluster in a yellow state for a limited time to get some help from people 
 in this group trying to diagnose this properly and maybe help some others 
 who face the same issues. However, I will need to fix this one way or 
 another before end of business day today. I plan to perform a rolling 
 restart to see if node reinitialization fixes things. If not, I'll remove 
 the problematic logstash index and move on. I'd love suggesttions for less 
 intrusive solutions. I don't like losing data and rolling restarts are kind 
 of tedious to babysit. Tends to take 45 minutes or so.

 Below is some information I've gathered. Let me know if you need me to 
 extract more data. 

 First the obvious:

 {
   status : 200,
   name : 192.168.1.13,
   cluster_name : linko_elasticsearch,
   version : {
 number : 1.4.0,
 build_hash : bc94bd81298f81c656893ab1d30a99356066,
 build_timestamp : 2014-11-05T14:26:12Z,
 build_snapshot : false,
 lucene_version : 4.10.2
   },
   tagline : You Know, for Search
 }

 [linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty
 {
   cluster_name : linko_elasticsearch,
   status : yellow,
   timed_out : false,
   number_of_nodes : 5,
   number_of_data_nodes : 3,
   active_primary_shards : 221,
   active_shards : 619,
   relocating_shards : 0,
   initializing_shards : 2,
   unassigned_shards : 1
 }

 So we're yellow and the reason is initializing and unassigned shards. We 
 have five nodes, of which three are data nodes. It seems we are hitting 
 some kind of resilience issue. The three machines have plenty of diskspace 
 and memory.

 I found this in the log of one of our es nodes:
 [2014-11-27 10:15:12,585][WARN ][cluster.action.shard ] [192.168.1.13] 
 [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], 
 node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID 
 [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to start shard, message 
 [RecoveryFailedException[[logstash-2014.11.27][4]: Recovery failed from 
 [192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io][inet[/
 192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][
 es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested: 
 RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]];
  
 nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution 
 failed]; nested: 
 RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]];
  
 nested: NumberFormatException[For input string: finished]; ]]

 on the mentioned node there's a corresponding messages:
 [2014-11-27 10:17:54,187][WARN ][cluster.action.shard ] [192.168.1.14] 
 [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], 
 node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID 
 [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to perform 
 [indices:data/write/bulk[s]] on replica, message 
 [RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]];
  
 nested: NumberFormatException[For input string: finished]; ]]

 All three data nodes have similar messages happening over and over again.

 Our cluster has been up for a couple of weeks and seems pretty happy 
 otherwise. I deleted some older logstash indices a few days ago. The 
 cluster has logstash data and a few smallish indiceses we use for our 
 inbot.io service. The issue appears to be related to the logstash index 
 rollover. Our app servers and kibana talk to the two non data nodes that we 
 run on both our application servers.

 My next stop was kibana which we use on the same cluster with the logstash 
 index that is probably causing us issues. Looking at that, I noticed a few 
 interesting things:

- logstash indexing seems to be fine (good) and it appears there has 
been no data loss yet
- our cpu load jumped around midnight and sort of stayed up on all 
three nodes. We measure this using collectd and both mean and max load 
jumped to around 1 around the time the index rollover happened. 

 My next step was using curl -XGET 'localhost:9200/_cat/recovery?v'

 All the