I guess you hit the following condition:
- you insert data with bulk indexing
- your index has dynamic mapping and already has huge field mappings
- bulk requests span over many nodes / shards / replicas and introduce tons
of new fields into the dynamic mapping
- you do not wait for bulk responses before sending new bulk requests
That is, ES tries heavily to create the new field mappings but the result
of the new mapping does not make it to the other node in time before new
bulks arrive at the other node. The node just sees there must be a mapping
for a new field, but the cluster state has none to present although the
field was being mapped.
Maybe the cluster state is not sent at all, or it could not be read fully
from disk, or it is stuck somewhere else.
ES tries hard to prevent such conditions by assigning high priority to
cluster state messages that are sent throughout the cluster. Also, ES
avoids flooding of such messages.
Your observation is correct: the longer you execute bulk indexing with the
same type of data (except random data), the number of new field mappings
decreases over time, so the number of new ES cluster state promotions.
You can try the following to tackle this challenge:
- pre-create the field mappings for your indexes, or even better,
pre-create indices and disable dynamic mapping, so no cluster state changes
have to be promoted
- switch to synchronous bulk requests, or reduce concurrency in your bulk
requests. So you let the bulk indexing routine wait for the cluster state
changes to be consistent at all nodes.
- reduce the (perhaps huge) number of field mappings (more a question about
the type of data you index)
- reduce number of nodes (obviously an anti-pattern)
- or reduce replica level (always a good thing for efficiency while using
bulk indexing), to give the cluster some breath to broadcast the new
cluster states in shorter time to the corresponding nodes
Jörg
On Mon, Jun 16, 2014 at 10:34 PM, Brooke Babcock brookebabc...@gmail.com
wrote:
Thanks for the reply.
We've checked the log files on all the nodes - no errors or warnings.
Disks were practically empty - it was a fresh cluster, fresh index.
We have noticed that the problem occurs less frequently the more data we
send to the cluster. Our latest theory is that it corrects itself
(meaning, we are able to get by _id again) once a flush occurs. So by
sending it more data, we are ensuring that flushes happen more often.
On Monday, June 16, 2014 8:05:15 AM UTC-5, Alexander Reelsen wrote:
Hey,
it seems, as if writing into the translog fails at some stage (from a
complete birds eye view). Can you check your logfiles, if you ran into some
weird exceptions before that happens? Also, you did not run out of disk
space at any time when this has happened?
--Alex
On Fri, Jun 6, 2014 at 8:39 PM, Brooke Babcock brooke...@gmail.com
wrote:
In one part of our application we use Elasticsearch as an object store.
Therefore, when indexing, we supply our own _id. Likewise, when accessing a
document we use the simple GET method to fetch by _id. This has worked well
for us, up until recently. Normally, this is what we get:
curl -XGET 'http://127.0.0.1:9200/data-2014.06.06/key/test1?pretty=true'
{
_index : data-2014.06.06,
_type : key,
_id : test1,
_version : 1,
found : true,
_source:{sData:test data 1}
}
Now, we often encounter a recently indexed document that throws the
following error when we try to fetch it:
curl -XGET 'http://127.0.0.1:9200/data-2014.06.06/key/test2?pretty=true'
{
error:IllegalArgumentException[No type mapped for [43]],
status:500
}
This condition persists anywhere from 1 to 25 minutes or so, at which
point we no longer receive the error for that document and the GET succeeds
as normal. From that point on, we are able to consistently retrieve that
document by _id without issue. But, soon after, we will find a different
newly indexed document caught in the same bad state.
We know the documents are successfully indexed. Our bulk sender (which
uses the Java transport client) indicates no error during indexing and
we are still able to locate the document by doing an ids query, such as:
curl -XPOST http://127.0.0.1:9200/data-2014.06.06/key/_search?pretty=
true -d '
{
query: {
ids: {
values: [test2]
}
}
}'
Which responds:
{
took: 543,
timed_out: false,
_shards: {
total: 10,
successful: 10,
failed: 0
},
hits: {
total: 1,
max_score: 1.0,
hits: [ {
_index: data-2014.06.06,
_type: key,
_id: test2,
_score: 1.0,
_source:{sData: test data 2}
} ]
}
}
We first noticed this behavior in version 1.2.0. When we upgraded to
1.2.1, we deleted all indexes and started with a fresh cluster. We hoped
our problem would be solved by the big fix that came in 1.2.1, but we are
still regularly seeing