Re: elasticsearch 1.4.0 yellow with shards stuck initializing - need help diagnosing this
This looks like a mapping issue to me (not 100% sure). A document that is in the translog has a string field (with value: 'finished'), but it is mapped as a number field (long, integer, double, etc.) in the mapping. This causes the number format exception that you're seeing in your logs when that document is indexed from the translog as part of the recovery and this then prevents the shard from getting started. These problems can occur when new fields are introduced at index time and also when numeric_detection is enabled in the mapping (which makes these errors more likely). Is this the case in your ES setup? Can you also check the mappings of the logstash-2014.11.27 index and see what fields can possible contain 'finished'? Unfortunately the field name didn't get included with your errors. On 27 November 2014 at 11:19, Jilles van Gurp jillesvang...@gmail.com wrote: Our production cluster went yellow last night after our logstash index rolled over to the next version. I've seen this happen before but this time I decided to properly diagnose and seek some feedback on what might be going on. So, I'd love some feedback on what is going on. I'm happy to keep this cluster in a yellow state for a limited time to get some help from people in this group trying to diagnose this properly and maybe help some others who face the same issues. However, I will need to fix this one way or another before end of business day today. I plan to perform a rolling restart to see if node reinitialization fixes things. If not, I'll remove the problematic logstash index and move on. I'd love suggesttions for less intrusive solutions. I don't like losing data and rolling restarts are kind of tedious to babysit. Tends to take 45 minutes or so. Below is some information I've gathered. Let me know if you need me to extract more data. First the obvious: { status : 200, name : 192.168.1.13, cluster_name : linko_elasticsearch, version : { number : 1.4.0, build_hash : bc94bd81298f81c656893ab1d30a99356066, build_timestamp : 2014-11-05T14:26:12Z, build_snapshot : false, lucene_version : 4.10.2 }, tagline : You Know, for Search } [linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty { cluster_name : linko_elasticsearch, status : yellow, timed_out : false, number_of_nodes : 5, number_of_data_nodes : 3, active_primary_shards : 221, active_shards : 619, relocating_shards : 0, initializing_shards : 2, unassigned_shards : 1 } So we're yellow and the reason is initializing and unassigned shards. We have five nodes, of which three are data nodes. It seems we are hitting some kind of resilience issue. The three machines have plenty of diskspace and memory. I found this in the log of one of our es nodes: [2014-11-27 10:15:12,585][WARN ][cluster.action.shard ] [192.168.1.13] [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to start shard, message [RecoveryFailedException[[logstash-2014.11.27][4]: Recovery failed from [192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io][inet[/ 192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][ es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested: RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution failed]; nested: RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]]; nested: NumberFormatException[For input string: finished]; ]] on the mentioned node there's a corresponding messages: [2014-11-27 10:17:54,187][WARN ][cluster.action.shard ] [192.168.1.14] [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]]; nested: NumberFormatException[For input string: finished]; ]] All three data nodes have similar messages happening over and over again. Our cluster has been up for a couple of weeks and seems pretty happy otherwise. I deleted some older logstash indices a few days ago. The cluster has logstash data and a few smallish indiceses we use for our inbot.io service. The issue appears to be related to the logstash index rollover. Our app servers and kibana talk to the two non data nodes that we run on both our application servers. My next stop was kibana which we use on the same cluster with the logstash index that is probably causing us issues. Looking at that, I noticed a few interesting things: - logstash indexing seems to
Re: elasticsearch 1.4.0 yellow with shards stuck initializing - need help diagnosing this
Thanks for the explanation. I suspect many logstash users might be running into this one since you typically use a dynamic mapping with that. We have some idea where this is happening though and we can probably fix it properly. This happened during index roll over and we indeed are indexing a lot of things via logstash almost continuously. Jilles On Thursday, November 27, 2014 4:06:21 PM UTC+1, Martijn v Groningen wrote: This looks like a mapping issue to me (not 100% sure). A document that is in the translog has a string field (with value: 'finished'), but it is mapped as a number field (long, integer, double, etc.) in the mapping. This causes the number format exception that you're seeing in your logs when that document is indexed from the translog as part of the recovery and this then prevents the shard from getting started. These problems can occur when new fields are introduced at index time and also when numeric_detection is enabled in the mapping (which makes these errors more likely). Is this the case in your ES setup? Can you also check the mappings of the logstash-2014.11.27 index and see what fields can possible contain 'finished'? Unfortunately the field name didn't get included with your errors. On 27 November 2014 at 11:19, Jilles van Gurp jilles...@gmail.com javascript: wrote: Our production cluster went yellow last night after our logstash index rolled over to the next version. I've seen this happen before but this time I decided to properly diagnose and seek some feedback on what might be going on. So, I'd love some feedback on what is going on. I'm happy to keep this cluster in a yellow state for a limited time to get some help from people in this group trying to diagnose this properly and maybe help some others who face the same issues. However, I will need to fix this one way or another before end of business day today. I plan to perform a rolling restart to see if node reinitialization fixes things. If not, I'll remove the problematic logstash index and move on. I'd love suggesttions for less intrusive solutions. I don't like losing data and rolling restarts are kind of tedious to babysit. Tends to take 45 minutes or so. Below is some information I've gathered. Let me know if you need me to extract more data. First the obvious: { status : 200, name : 192.168.1.13, cluster_name : linko_elasticsearch, version : { number : 1.4.0, build_hash : bc94bd81298f81c656893ab1d30a99356066, build_timestamp : 2014-11-05T14:26:12Z, build_snapshot : false, lucene_version : 4.10.2 }, tagline : You Know, for Search } [linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty { cluster_name : linko_elasticsearch, status : yellow, timed_out : false, number_of_nodes : 5, number_of_data_nodes : 3, active_primary_shards : 221, active_shards : 619, relocating_shards : 0, initializing_shards : 2, unassigned_shards : 1 } So we're yellow and the reason is initializing and unassigned shards. We have five nodes, of which three are data nodes. It seems we are hitting some kind of resilience issue. The three machines have plenty of diskspace and memory. I found this in the log of one of our es nodes: [2014-11-27 10:15:12,585][WARN ][cluster.action.shard ] [192.168.1.13] [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to start shard, message [RecoveryFailedException[[logstash-2014.11.27][4]: Recovery failed from [192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io ][inet[/192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][ es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested: RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution failed]; nested: RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]]; nested: NumberFormatException[For input string: finished]; ]] on the mentioned node there's a corresponding messages: [2014-11-27 10:17:54,187][WARN ][cluster.action.shard ] [192.168.1.14] [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]]; nested: NumberFormatException[For input string: finished]; ]] All three data nodes have similar messages happening over and over again. Our cluster has been up for a couple of weeks and seems pretty happy otherwise. I deleted some
Re: elasticsearch 1.4.0 yellow with shards stuck initializing - need help diagnosing this
If the field you suspect causing this is a string field in the mapping then you can try to close and open the index. This will then sync the in-memory representation of the mapping with what is in the cluster state. On 27 November 2014 at 16:49, Jilles van Gurp jillesvang...@gmail.com wrote: Thanks for the explanation. I suspect many logstash users might be running into this one since you typically use a dynamic mapping with that. We have some idea where this is happening though and we can probably fix it properly. This happened during index roll over and we indeed are indexing a lot of things via logstash almost continuously. Jilles On Thursday, November 27, 2014 4:06:21 PM UTC+1, Martijn v Groningen wrote: This looks like a mapping issue to me (not 100% sure). A document that is in the translog has a string field (with value: 'finished'), but it is mapped as a number field (long, integer, double, etc.) in the mapping. This causes the number format exception that you're seeing in your logs when that document is indexed from the translog as part of the recovery and this then prevents the shard from getting started. These problems can occur when new fields are introduced at index time and also when numeric_detection is enabled in the mapping (which makes these errors more likely). Is this the case in your ES setup? Can you also check the mappings of the logstash-2014.11.27 index and see what fields can possible contain 'finished'? Unfortunately the field name didn't get included with your errors. On 27 November 2014 at 11:19, Jilles van Gurp jilles...@gmail.com wrote: Our production cluster went yellow last night after our logstash index rolled over to the next version. I've seen this happen before but this time I decided to properly diagnose and seek some feedback on what might be going on. So, I'd love some feedback on what is going on. I'm happy to keep this cluster in a yellow state for a limited time to get some help from people in this group trying to diagnose this properly and maybe help some others who face the same issues. However, I will need to fix this one way or another before end of business day today. I plan to perform a rolling restart to see if node reinitialization fixes things. If not, I'll remove the problematic logstash index and move on. I'd love suggesttions for less intrusive solutions. I don't like losing data and rolling restarts are kind of tedious to babysit. Tends to take 45 minutes or so. Below is some information I've gathered. Let me know if you need me to extract more data. First the obvious: { status : 200, name : 192.168.1.13, cluster_name : linko_elasticsearch, version : { number : 1.4.0, build_hash : bc94bd81298f81c656893ab1d30a99356066, build_timestamp : 2014-11-05T14:26:12Z, build_snapshot : false, lucene_version : 4.10.2 }, tagline : You Know, for Search } [linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty { cluster_name : linko_elasticsearch, status : yellow, timed_out : false, number_of_nodes : 5, number_of_data_nodes : 3, active_primary_shards : 221, active_shards : 619, relocating_shards : 0, initializing_shards : 2, unassigned_shards : 1 } So we're yellow and the reason is initializing and unassigned shards. We have five nodes, of which three are data nodes. It seems we are hitting some kind of resilience issue. The three machines have plenty of diskspace and memory. I found this in the log of one of our es nodes: [2014-11-27 10:15:12,585][WARN ][cluster.action.shard ] [192.168.1.13] [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to start shard, message [RecoveryFailedException[[logstash-2014.11.27][4]: Recovery failed from [192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io ][inet[/192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][ es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested: RemoteTransportException[[192.168.1.14][inet[/192.168.1.14: 9300]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution failed]; nested: RemoteTransportException[[192.168.1.13][inet[/ 192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]]; nested: NumberFormatException[For input string: finished]; ]] on the mentioned node there's a corresponding messages: [2014-11-27 10:17:54,187][WARN ][cluster.action.shard ] [192.168.1.14] [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [RemoteTransportException[[192.168.1.13][inet[/192.168.1. 13:9300]][indices:data/write/bulk[s][r]]];
Re: elasticsearch 1.4.0 yellow with shards stuck initializing - need help diagnosing this
BTW. I should mention that I also filed a bug for this earlier today. https://github.com/elasticsearch/elasticsearch/issues/8684 Clinton Gormley kindly replied to that and provided some additional insight. It indeed seems our mapping is part of the problem but there's also the es side of things where it shouldn't get in this state. Apparently a fix for that part is coming. Best, Jilles On Thursday, November 27, 2014 11:19:20 AM UTC+1, Jilles van Gurp wrote: Our production cluster went yellow last night after our logstash index rolled over to the next version. I've seen this happen before but this time I decided to properly diagnose and seek some feedback on what might be going on. So, I'd love some feedback on what is going on. I'm happy to keep this cluster in a yellow state for a limited time to get some help from people in this group trying to diagnose this properly and maybe help some others who face the same issues. However, I will need to fix this one way or another before end of business day today. I plan to perform a rolling restart to see if node reinitialization fixes things. If not, I'll remove the problematic logstash index and move on. I'd love suggesttions for less intrusive solutions. I don't like losing data and rolling restarts are kind of tedious to babysit. Tends to take 45 minutes or so. Below is some information I've gathered. Let me know if you need me to extract more data. First the obvious: { status : 200, name : 192.168.1.13, cluster_name : linko_elasticsearch, version : { number : 1.4.0, build_hash : bc94bd81298f81c656893ab1d30a99356066, build_timestamp : 2014-11-05T14:26:12Z, build_snapshot : false, lucene_version : 4.10.2 }, tagline : You Know, for Search } [linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty { cluster_name : linko_elasticsearch, status : yellow, timed_out : false, number_of_nodes : 5, number_of_data_nodes : 3, active_primary_shards : 221, active_shards : 619, relocating_shards : 0, initializing_shards : 2, unassigned_shards : 1 } So we're yellow and the reason is initializing and unassigned shards. We have five nodes, of which three are data nodes. It seems we are hitting some kind of resilience issue. The three machines have plenty of diskspace and memory. I found this in the log of one of our es nodes: [2014-11-27 10:15:12,585][WARN ][cluster.action.shard ] [192.168.1.13] [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to start shard, message [RecoveryFailedException[[logstash-2014.11.27][4]: Recovery failed from [192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io][inet[/ 192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][ es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested: RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution failed]; nested: RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]]; nested: NumberFormatException[For input string: finished]; ]] on the mentioned node there's a corresponding messages: [2014-11-27 10:17:54,187][WARN ][cluster.action.shard ] [192.168.1.14] [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]]; nested: NumberFormatException[For input string: finished]; ]] All three data nodes have similar messages happening over and over again. Our cluster has been up for a couple of weeks and seems pretty happy otherwise. I deleted some older logstash indices a few days ago. The cluster has logstash data and a few smallish indiceses we use for our inbot.io service. The issue appears to be related to the logstash index rollover. Our app servers and kibana talk to the two non data nodes that we run on both our application servers. My next stop was kibana which we use on the same cluster with the logstash index that is probably causing us issues. Looking at that, I noticed a few interesting things: - logstash indexing seems to be fine (good) and it appears there has been no data loss yet - our cpu load jumped around midnight and sort of stayed up on all three nodes. We measure this using collectd and both mean and max load jumped to around 1 around the time the index rollover happened. My next step was using curl -XGET 'localhost:9200/_cat/recovery?v' All the