Also related https://github.com/elastic/elasticsearch/issues/10447

On 17 April 2015 at 12:37, Charlie Moad <charlie.m...@geofeedia.com> wrote:

> This was tracked down to a problem with Ubuntu 14.04 running under Xen (in
> AWS). The latest kernel in Ubuntu resolves the problem, so I had to do a
> rolling "apt-get update; apt-get dist-upgrade; reboot" on all nodes. This
> appears to have resolved the issue.
>
> For reference:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811
>
>
> On Thursday, April 16, 2015 at 11:20:06 AM UTC-4, Charlie Moad wrote:
>>
>> A few days ago we started to receive a lot of timeouts across our
>> cluster. This is causing shard allocation to fail and a perpetual
>> red/yellow state.
>>
>> Examples:
>> [2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats]
>> [coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw]
>> org.elasticsearch.transport.ReceiveTimeoutTransportException:
>> [search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]]
>> request_id [3680727] timed out after [15001ms]
>>         at
>> org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> [2015-04-16 15:03:26,105][WARN ][gateway.local            ]
>> [coordinator02] [global.y2014m01d30.v2][0]: failed to list shard stores on
>> node [1rfWT-mXTZmF_NzR_h1IZw]
>> org.elasticsearch.action.FailedNodeException: Failed node
>> [1rfWT-mXTZmF_NzR_h1IZw]
>>         at
>> org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
>>         at
>> org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
>>
>>         at
>> org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
>>         at
>> org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
>> [search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]]
>> request_id [3677537] timed out after [30001ms]
>>         ... 4 more
>>
>> I believe I have tracked this down to the management thread pool being
>> saturated on our data nodes and not responding to requests. Our cluster has
>> 3 master nodes,no data and 3 worker nodes,no master. I increased the
>> maximum pool size from 5 to 20 and the workers immediately jumped to 20.
>> I'm still seeing the errors.
>>
>> host                        management.type management.active
>> management.size management.queue management.queueSize management.rejected
>> management.largest management.completed management.min management.max
>> management.keepAlive
>> coordinator01               scaling                         1
>>   2                0                                        0
>>    2                37884              1             20
>> 5m
>> search02                    scaling                         1
>>  20                0                                        0
>>   20              1945337              1             20
>> 5m
>> search01                    scaling                         1
>>  20                0                                        0
>>   20              2034838              1             20
>> 5m
>> search03                    scaling                         1
>>  20                0                                        0
>>   20              1862848              1             20
>> 5m
>> coordinator03               scaling                         1
>>   2                0                                        0
>>    2                37875              1             20
>> 5m
>> coordinator02               scaling                         2
>>   5                0                                        0
>>    5                44127              1             20
>> 5m
>>
>> How can I address this problem?
>>
>> Thanks,
>>      Charlie
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/c49d0468-2d02-49f7-8356-4b9865842eb0%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/c49d0468-2d02-49f7-8356-4b9865842eb0%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X93cFNkW8MK480%2BfkgNLDZhSdWJ1_--3Ra__ki%3Dh8G0ig%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to