Re: Master keeps forgeting nodes

2015-04-08 Thread João Costa
Both _cat/indices and _cat/shards appear to be working during the cluster 
failure.

Em terça-feira, 7 de abril de 2015 14:05:02 UTC+1, João Costa escreveu:
>
> All machines are on the same region, the AZ is different though.
>
> When you say "check the _cat outputs", you mean making a call to 
> _cat/indices or _cat/shards when I know that the cluster is down, correct?
> I'll try to do that, then.
>
> Em segunda-feira, 6 de abril de 2015 23:32:51 UTC+1, Mark Walkom escreveu:
>>
>> The next time this happens can you check the _cat outputs, take a look at 
>> https://github.com/elastic/elasticsearch/issues/10447 and see if it's 
>> similar behaviour.
>>
>> On 7 April 2015 at 07:09, Mark Walkom  wrote:
>>
>>> Are you running across AZs, or regions?
>>>
>>> On 6 April 2015 at 21:01, João Costa  wrote:
>>>
>>>> Slight update: The same problem also happens on another cluster with 
>>>> the same on another AWS account.
>>>> While this does not happen on my test account, that's probably related 
>>>> to the fact that those instances are regularly rebooted.
>>>>
>>>>
>>>> Em segunda-feira, 6 de abril de 2015 11:42:07 UTC+1, João Costa 
>>>> escreveu:
>>>>>
>>>>> I have 2 EC2 in an AWS account where it appears that the master keeps 
>>>>> forgetting about the slave node.
>>>>>
>>>>> In the slave node logs (I removed the IPs and time for simplicity, the 
>>>>> master is "Cordelia Frost" and the slave is "Chronos"):
>>>>>
>>>>> [discovery.zen.fd] [Chronos] [master] pinging a master [Cordelia Frost] 
>>>>> but 
>>>>> we do not exists on it, act as if its master failure
>>>>> [discovery.zen.fd] [Chronos] [master] stopping fault detection against 
>>>>> master [Cordelia Frost], reason [master failure, do not exists on 
>>>>> master, act as master failure]
>>>>> [discovery.ec2] [Chronos] master_left [Cordelia Frost], reason [do 
>>>>> not exists on master, act as master failure]
>>>>> [discovery.ec2] [Chronos] master left (reason = do not exists on 
>>>>> master, act as master failure), current nodes: {[Chronos]}
>>>>> [cluster.service] [Chronos] removed {[Cordelia Frost]}, reason: 
>>>>> zen-disco-master_failed ([Cordelia Frost])
>>>>> [discovery.ec2] [Chronos] using dynamic discovery nodes
>>>>> [discovery.ec2] [Chronos] using dynamic discovery nodes
>>>>> [discovery.ec2] [Chronos] using dynamic discovery nodes
>>>>> [discovery.ec2] [Chronos] filtered ping responses: 
>>>>> (filter_client[true], filter_data[false])
>>>>> --> ping_response{node [Cordelia Frost], id[353], master 
>>>>> [Cordelia 
>>>>> Frost], hasJoinedOnce [true], cluster_name[cluster]}
>>>>> [discovery.zen.publish] [Chronos] received cluster state version 232374
>>>>> [discovery.zen.fd] [Chronos] [master] restarting fault detection 
>>>>> against master [Cordelia Frost], reason [new cluster state received 
>>>>> and we are monitoring the wrong master [null]]
>>>>> [discovery.ec2] [Chronos] got first state from fresh master
>>>>> [cluster.service] [Chronos] detected_master [Cordelia Frost], added 
>>>>> {[Cordelia Frost]}, reason: zen-disco-receive(from master [Cordelia 
>>>>> Frost])
>>>>>
>>>>> "Chronos" then receives the cluster state and everything goes back to 
>>>>> normal.
>>>>> This happens about on quite regular intervals (usually once per hour, 
>>>>> although some times it takes more time to happen). Any idea of what can 
>>>>> be 
>>>>> causing this?
>>>>>
>>>>> I have a ping timeout of 15s on discovery.ec2, so I think that ping 
>>>>> latency should not be the problem. I also do hourly snapshots with 
>>>>> curator, 
>>>>> in case that's relevant.
>>>>> Finally, I also have another elasticsearch cluster with the same 
>>>>> configuration on a different AWS account (used for testing purposes), and 
>>>>> that problem has never occured. Can this be related to the AWS region?
>>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to elasticsearc...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/baac8012-d1a4-4b50-96f2-d02919597fe5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Master keeps forgeting nodes

2015-04-07 Thread João Costa
All machines are on the same region, the AZ is different though.

When you say "check the _cat outputs", you mean making a call to 
_cat/indices or _cat/shards when I know that the cluster is down, correct?
I'll try to do that, then.

Em segunda-feira, 6 de abril de 2015 23:32:51 UTC+1, Mark Walkom escreveu:
>
> The next time this happens can you check the _cat outputs, take a look at 
> https://github.com/elastic/elasticsearch/issues/10447 and see if it's 
> similar behaviour.
>
> On 7 April 2015 at 07:09, Mark Walkom > 
> wrote:
>
>> Are you running across AZs, or regions?
>>
>> On 6 April 2015 at 21:01, João Costa > 
>> wrote:
>>
>>> Slight update: The same problem also happens on another cluster with the 
>>> same on another AWS account.
>>> While this does not happen on my test account, that's probably related 
>>> to the fact that those instances are regularly rebooted.
>>>
>>>
>>> Em segunda-feira, 6 de abril de 2015 11:42:07 UTC+1, João Costa escreveu:
>>>>
>>>> I have 2 EC2 in an AWS account where it appears that the master keeps 
>>>> forgetting about the slave node.
>>>>
>>>> In the slave node logs (I removed the IPs and time for simplicity, the 
>>>> master is "Cordelia Frost" and the slave is "Chronos"):
>>>>
>>>> [discovery.zen.fd] [Chronos] [master] pinging a master [Cordelia Frost] 
>>>> but 
>>>> we do not exists on it, act as if its master failure
>>>> [discovery.zen.fd] [Chronos] [master] stopping fault detection against 
>>>> master [Cordelia Frost], reason [master failure, do not exists on 
>>>> master, act as master failure]
>>>> [discovery.ec2] [Chronos] master_left [Cordelia Frost], reason [do not 
>>>> exists on master, act as master failure]
>>>> [discovery.ec2] [Chronos] master left (reason = do not exists on 
>>>> master, act as master failure), current nodes: {[Chronos]}
>>>> [cluster.service] [Chronos] removed {[Cordelia Frost]}, reason: 
>>>> zen-disco-master_failed ([Cordelia Frost])
>>>> [discovery.ec2] [Chronos] using dynamic discovery nodes
>>>> [discovery.ec2] [Chronos] using dynamic discovery nodes
>>>> [discovery.ec2] [Chronos] using dynamic discovery nodes
>>>> [discovery.ec2] [Chronos] filtered ping responses: 
>>>> (filter_client[true], filter_data[false])
>>>> --> ping_response{node [Cordelia Frost], id[353], master [Cordelia 
>>>> Frost], hasJoinedOnce [true], cluster_name[cluster]}
>>>> [discovery.zen.publish] [Chronos] received cluster state version 232374
>>>> [discovery.zen.fd] [Chronos] [master] restarting fault detection 
>>>> against master [Cordelia Frost], reason [new cluster state received 
>>>> and we are monitoring the wrong master [null]]
>>>> [discovery.ec2] [Chronos] got first state from fresh master
>>>> [cluster.service] [Chronos] detected_master [Cordelia Frost], added 
>>>> {[Cordelia Frost]}, reason: zen-disco-receive(from master [Cordelia 
>>>> Frost])
>>>>
>>>> "Chronos" then receives the cluster state and everything goes back to 
>>>> normal.
>>>> This happens about on quite regular intervals (usually once per hour, 
>>>> although some times it takes more time to happen). Any idea of what can be 
>>>> causing this?
>>>>
>>>> I have a ping timeout of 15s on discovery.ec2, so I think that ping 
>>>> latency should not be the problem. I also do hourly snapshots with 
>>>> curator, 
>>>> in case that's relevant.
>>>> Finally, I also have another elasticsearch cluster with the same 
>>>> configuration on a different AWS account (used for testing purposes), and 
>>>> that problem has never occured. Can this be related to the AWS region?
>>>>
>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to elasticsearc...@googlegroups.com .
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8c0407b1-bb5e-45b2-9cd9-214c55b53990%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Master keeps forgeting nodes

2015-04-06 Thread João Costa
Slight update: The same problem also happens on another cluster with the 
same on another AWS account.
While this does not happen on my test account, that's probably related to 
the fact that those instances are regularly rebooted.

Em segunda-feira, 6 de abril de 2015 11:42:07 UTC+1, João Costa escreveu:
>
> I have 2 EC2 in an AWS account where it appears that the master keeps 
> forgetting about the slave node.
>
> In the slave node logs (I removed the IPs and time for simplicity, the 
> master is "Cordelia Frost" and the slave is "Chronos"):
>
> [discovery.zen.fd] [Chronos] [master] pinging a master [Cordelia Frost] but 
> we do not exists on it, act as if its master failure
> [discovery.zen.fd] [Chronos] [master] stopping fault detection against 
> master [Cordelia Frost], reason [master failure, do not exists on master, 
> act as master failure]
> [discovery.ec2] [Chronos] master_left [Cordelia Frost], reason [do not 
> exists on master, act as master failure]
> [discovery.ec2] [Chronos] master left (reason = do not exists on master, 
> act as master failure), current nodes: {[Chronos]}
> [cluster.service] [Chronos] removed {[Cordelia Frost]}, reason: 
> zen-disco-master_failed ([Cordelia Frost])
> [discovery.ec2] [Chronos] using dynamic discovery nodes
> [discovery.ec2] [Chronos] using dynamic discovery nodes
> [discovery.ec2] [Chronos] using dynamic discovery nodes
> [discovery.ec2] [Chronos] filtered ping responses: (filter_client[true], 
> filter_data[false])
> --> ping_response{node [Cordelia Frost], id[353], master [Cordelia 
> Frost], hasJoinedOnce [true], cluster_name[cluster]}
> [discovery.zen.publish] [Chronos] received cluster state version 232374
> [discovery.zen.fd] [Chronos] [master] restarting fault detection against 
> master [Cordelia Frost], reason [new cluster state received and we are 
> monitoring the wrong master [null]]
> [discovery.ec2] [Chronos] got first state from fresh master
> [cluster.service] [Chronos] detected_master [Cordelia Frost], added 
> {[Cordelia Frost]}, reason: zen-disco-receive(from master [Cordelia 
> Frost])
>
> "Chronos" then receives the cluster state and everything goes back to 
> normal.
> This happens about on quite regular intervals (usually once per hour, 
> although some times it takes more time to happen). Any idea of what can be 
> causing this?
>
> I have a ping timeout of 15s on discovery.ec2, so I think that ping 
> latency should not be the problem. I also do hourly snapshots with curator, 
> in case that's relevant.
> Finally, I also have another elasticsearch cluster with the same 
> configuration on a different AWS account (used for testing purposes), and 
> that problem has never occured. Can this be related to the AWS region?
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ad32e277-d8a0-48f8-91a0-66f6868a08af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Master keeps forgeting nodes

2015-04-06 Thread João Costa
I have 2 EC2 in an AWS account where it appears that the master keeps 
forgetting about the slave node.

In the slave node logs (I removed the IPs and time for simplicity, the 
master is "Cordelia Frost" and the slave is "Chronos"):

[discovery.zen.fd] [Chronos] [master] pinging a master [Cordelia Frost] but 
we do not exists on it, act as if its master failure
[discovery.zen.fd] [Chronos] [master] stopping fault detection against 
master [Cordelia Frost], reason [master failure, do not exists on master, 
act as master failure]
[discovery.ec2] [Chronos] master_left [Cordelia Frost], reason [do not 
exists on master, act as master failure]
[discovery.ec2] [Chronos] master left (reason = do not exists on master, 
act as master failure), current nodes: {[Chronos]}
[cluster.service] [Chronos] removed {[Cordelia Frost]}, reason: 
zen-disco-master_failed ([Cordelia Frost])
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] using dynamic discovery nodes
[discovery.ec2] [Chronos] filtered ping responses: (filter_client[true], 
filter_data[false])
--> ping_response{node [Cordelia Frost], id[353], master [Cordelia 
Frost], hasJoinedOnce [true], cluster_name[cluster]}
[discovery.zen.publish] [Chronos] received cluster state version 232374
[discovery.zen.fd] [Chronos] [master] restarting fault detection against 
master [Cordelia Frost], reason [new cluster state received and we are 
monitoring the wrong master [null]]
[discovery.ec2] [Chronos] got first state from fresh master
[cluster.service] [Chronos] detected_master [Cordelia Frost], added 
{[Cordelia Frost]}, reason: zen-disco-receive(from master [Cordelia Frost])

"Chronos" then receives the cluster state and everything goes back to 
normal.
This happens about on quite regular intervals (usually once per hour, 
although some times it takes more time to happen). Any idea of what can be 
causing this?

I have a ping timeout of 15s on discovery.ec2, so I think that ping latency 
should not be the problem. I also do hourly snapshots with curator, in case 
that's relevant.
Finally, I also have another elasticsearch cluster with the same 
configuration on a different AWS account (used for testing purposes), and 
that problem has never occured. Can this be related to the AWS region?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8c367dc4-c388-4b9c-aa91-34d6fcadb156%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: What is the best practice for periodic snapshotting with awc-cloud+s3

2014-11-20 Thread João Costa
Hello,

Sorry for hijacking this thread, but I'm currently also pondering the best 
way to perform periodic snapshots in AWS.

My main concern is that we are using blue-green deployment with ephemeral 
storage on EC2, so if for some reason there is a problem with the cluster, 
we might lose a lot of data, therefore I would rather do frequent snapshots 
(for this reason, we are still using the deprecated S3 gateway).

The thing is, you claim that "Having too many snapshots is problematic" and 
that one should "prune old snapshots". Since snapshots are incremental, 
this will imply data loss, correct?
Also, is the problem related to the number of snapshots or the size of the 
data? Is there any way to merge old snapshots into one? Would this solve 
the problem?

Finally, if I create a cronjob to make automatic snapshots, can I run into 
problems if two instances attempt to create a snapshot with the same name 
at the same time?
Also, what's the best way to do a snapshot on shutdown? Should I put a 
script on init.d/rc.0 to run on shutdown before elasticsearch shuts down? 
I've seen cases where the EC2 instances have "not so grateful" shutdowns, 
so it would be wonder if there is a better way to do this on a cluster 
level (ie, if a node A notices that a node B is not responding, then it 
automatically makes a snapshot).

Sorry if some of these questions don't make much sense, I'm still quite new 
to elasticsearch and have not completly understood the new snapshot feature.

Em sexta-feira, 14 de novembro de 2014 08h19min42s UTC, Sally Ahn escreveu:
>
> Yes, I am now seeing the snapshots complete in about 2 minutes after 
> switching to a new, empty bucket.
> I'm not sure why the initial request to snapshot to the empty repo was 
> hanging because the snapshot did in fact complete in about 2 minutes, 
> according to the S3 timestamp.
> Time to automate deletion of old snapshots. :)
> Thanks for the response!
>
> On Thursday, November 13, 2014 9:35:20 PM UTC-8, Igor Motov wrote:
>>
>> Having too many snapshots is problematic. Each snapshot is done in 
>> incremental manner, so in order to figure out what changes and what is 
>> available all snapshots in the repository needs to be scanned, which takes 
>> time as number of snapshots growing. I would recommend pruning old 
>> snapshots as time goes by or starting snapshots into a new bucket/directory 
>> if you really need to maintain 2 hour resolution for 2 months old 
>> snapshots. The get command can sometimes hang because it's throttled by the 
>> on-going snapshot. 
>>
>>
>> On Wednesday, November 12, 2014 9:02:33 PM UTC-10, Sally Ahn wrote:
>>>
>>> I am also interested in this topic.
>>> We were snapshotting our cluster of two nodes every 2 hours (invoked via 
>>> a cron job) to an S3 repository (we were running ES 1.2.2 with 
>>> cloud-aws-plugin version 2.2.0, then we upgraded to ES 1.4.0 with 
>>> cloud-aws-plugin 2.4.0 but are still seeing issues described below).
>>> I've been seeing an increase in the time it takes to complete a snapshot 
>>> with each subsequent snapshot. 
>>> I see a thread 
>>> 
>>>  where 
>>> someone else was seeing the same thing, but that thread seems to have died.
>>> In my case, snapshots have gone from taking ~5 minutes to taking about 
>>> an hour, even between snapshots where data does not seem to have changed. 
>>>
>>> For example, you can see below a list of the snapshots stored in my S3 
>>> repo. Each snapshot is named with a timestamp of when my cron job invoked 
>>> the snapshot process. The S3 timestamp on the left shows the completion 
>>> time of that snapshot, and it's clear that it's steadily increasing:
>>>
>>> 2014-09-30 10:05   686   
>>> s3:///snapshot-2014.09.30-10:00:01
>>> 2014-09-30 12:05   686   
>>> s3:///snapshot-2014.09.30-12:00:01
>>> 2014-09-30 14:05   736   
>>> s3:///snapshot-2014.09.30-14:00:01
>>> 2014-09-30 16:05   736   
>>> s3:///snapshot-2014.09.30-16:00:01
>>> ...
>>> 2014-11-08 00:52  1488   
>>> s3:///snapshot-2014.11.08-00:00:01
>>> 2014-11-08 02:54  1488   
>>> s3:///snapshot-2014.11.08-02:00:01
>>> ...
>>> 2014-11-08 14:54  1488   
>>> s3:///snapshot-2014.11.08-14:00:01
>>> 2014-11-08 16:53  1488   
>>> s3:///snapshot-2014.11.08-16:00:01
>>> ...
>>> 2014-11-11 07:00  1638   
>>> s3:///snapshot-2014.11.11-06:00:01
>>> 2014-11-11 08:58  1638   
>>> s3:///snapshot-2014.11.11-08:00:01
>>> 2014-11-11 10:58  1638   
>>> s3:///snapshot-2014.11.11-10:00:01
>>> 2014-11-11 12:59  1638   
>>> s3:///snapshot-2014.11.11-12:00:01
>>> 2014-11-11 15:00  1638   
>>> s3:///snapshot-2014.11.11-14:00:01
>>> 2014-11-11 17:00  1638   
>>> s3:///snapshot-2014.11.11-16:00:01
>>>
>>> I suspected that this gradual increase was related to the accumulation 
>>> of old snapshots after I tested the following:
>>> 1. I created a brand new clust