This looks like a mapping issue to me (not 100% sure). A document that is
in the translog has a string field (with value: 'finished'), but it is
mapped as a number field (long, integer, double, etc.) in the mapping. This
causes the number format exception that you're seeing in your logs when
that document is indexed from the translog as part of the recovery and this
then prevents the shard from getting started.

These problems can occur when new fields are introduced at index time and
also when numeric_detection is enabled in the mapping (which makes these
errors more likely). Is this the case in your ES setup?

Can you also check the mappings of the logstash-2014.11.27 index and see
what fields can possible contain 'finished'? Unfortunately the field name
didn't get included with your errors.

On 27 November 2014 at 11:19, Jilles van Gurp <jillesvang...@gmail.com>
wrote:

> Our production cluster went yellow last night after our logstash index
> rolled over to the next version. I've seen this happen before but this time
> I decided to properly diagnose and seek some feedback on what might be
> going on.
>
> So, I'd love some feedback on what is going on. I'm happy to keep this
> cluster in a yellow state for a limited time to get some help from people
> in this group trying to diagnose this properly and maybe help some others
> who face the same issues. However, I will need to fix this one way or
> another before end of business day today. I plan to perform a rolling
> restart to see if node reinitialization fixes things. If not, I'll remove
> the problematic logstash index and move on. I'd love suggesttions for less
> intrusive solutions. I don't like losing data and rolling restarts are kind
> of tedious to babysit. Tends to take 45 minutes or so.
>
> Below is some information I've gathered. Let me know if you need me to
> extract more data.
>
> First the obvious:
>
> {
>   "status" : 200,
>   "name" : "192.168.1.13",
>   "cluster_name" : "linko_elasticsearch",
>   "version" : {
>     "number" : "1.4.0",
>     "build_hash" : "bc94bd81298f81c656893ab1ddddd30a99356066",
>     "build_timestamp" : "2014-11-05T14:26:12Z",
>     "build_snapshot" : false,
>     "lucene_version" : "4.10.2"
>   },
>   "tagline" : "You Know, for Search"
> }
>
> [linko@app2 elasticsearch]$ curl localhost:9200/_cluster/health?pretty
> {
>   "cluster_name" : "linko_elasticsearch",
>   "status" : "yellow",
>   "timed_out" : false,
>   "number_of_nodes" : 5,
>   "number_of_data_nodes" : 3,
>   "active_primary_shards" : 221,
>   "active_shards" : 619,
>   "relocating_shards" : 0,
>   "initializing_shards" : 2,
>   "unassigned_shards" : 1
> }
>
> So we're yellow and the reason is initializing and unassigned shards. We
> have five nodes, of which three are data nodes. It seems we are hitting
> some kind of resilience issue. The three machines have plenty of diskspace
> and memory.
>
> I found this in the log of one of our es nodes:
> [2014-11-27 10:15:12,585][WARN ][cluster.action.shard     ] [192.168.1.13]
> [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4],
> node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID
> [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to start shard, message
> [RecoveryFailedException[[logstash-2014.11.27][4]: Recovery failed from
> [192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io][inet[/
> 192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][
> es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested:
> RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]];
> nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution
> failed]; nested:
> RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]];
> nested: NumberFormatException[For input string: "finished"]; ]]
>
> on the mentioned node there's a corresponding messages:
> [2014-11-27 10:17:54,187][WARN ][cluster.action.shard     ] [192.168.1.14]
> [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4],
> node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID
> [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to perform
> [indices:data/write/bulk[s]] on replica, message
> [RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]];
> nested: NumberFormatException[For input string: "finished"]; ]]
>
> All three data nodes have similar messages happening over and over again.
>
> Our cluster has been up for a couple of weeks and seems pretty happy
> otherwise. I deleted some older logstash indices a few days ago. The
> cluster has logstash data and a few smallish indiceses we use for our
> inbot.io service. The issue appears to be related to the logstash index
> rollover. Our app servers and kibana talk to the two non data nodes that we
> run on both our application servers.
>
> My next stop was kibana which we use on the same cluster with the logstash
> index that is probably causing us issues. Looking at that, I noticed a few
> interesting things:
>
>    - logstash indexing seems to be fine (good) and it appears there has
>    been no data loss yet
>    - our cpu load jumped around midnight and sort of stayed up on all
>    three nodes. We measure this using collectd and both mean and max load
>    jumped to around 1 around the time the index rollover happened.
>
> My next step was using curl -XGET 'localhost:9200/_cat/recovery?v'
>
> All the indices listed there looked fine. I'll spare you the output but
> everything appared to be in the 'done' stage.
>
> Finally, I did
> [linko@es3 elasticsearch]$ curl -XGET
> 'localhost:9200/_cluster/health/logstash-2014.11.27/?pretty'
> {
>   "cluster_name" : "linko_elasticsearch",
>   "status" : "yellow",
>   "timed_out" : false,
>   "number_of_nodes" : 5,
>   "number_of_data_nodes" : 3,
>   "active_primary_shards" : 5,
>   "active_shards" : 12,
>   "relocating_shards" : 0,
>   "initializing_shards" : 2,
>   "unassigned_shards" : 1
> }
>
> So, that confirms last night's new logstash index is the issue and it
> seems to have issues initializing two of its shards.  With some Googling,
> I also found a recently closed issues (e.g.
> https://github.com/elasticsearch/elasticsearch/issues/6808) that seem
> related to this for 1.3.x. However, I'm on 1.4 where this is supposedly
> fixed. The NumberFormatException mentioned in the logs seems suspicious for
> sure.
>
> My questions are: what is going on, how do I fix it properly and without
> data loss or downtime, and I do I stop this from happening? A rolling
> restart  might fix things (seen that happen before).
>
> Maybe there are some magic curl incantation than nudges things along that
> I've overlooked? I've dug around on google but all I seem to find is advice
> along the lines of deleting indices and/or maybe trying this or that by
> poking around with various apis in the elasticsearch internals. I'm
> looking for something a bit more responsible/structured and am a bit wary
> of the 'shotgun' approach here. Yes, I know I can fix things by deleting
> stuff or restarting stuff and doing a hail mary. However, my main concern
> is that the need for this seems to be a regularly recurring thing (done it
> a dozen times over the course of the last year for different issues).  What
> I really need is simple instructions that I can give to our people to fix
> things when things go bad: when x happens do this and you'll be fine. The
> last thing I want is ops people who get in the habit of deleting indices or
> doing random rolling restarts out of magical thinking.  That sounds like
> something that could cause more issues than that it solves. I know from
> experience a botched rolling restart is not nice to deal with.
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/392715d0-624e-4d12-a914-52d3bec7a3dd%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/392715d0-624e-4d12-a914-52d3bec7a3dd%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Met vriendelijke groet,

Martijn van Groningen

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CA%2BA76TzBQt5i1_jFwBs_b5DPK2F0tsdo3UV4RgtQMJwh33B6yQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to