Re: [ansible-project] EC2 Rolling Deploy with an ASG

Ben Whaley Mon, 24 Nov 2014 09:50:42 -0800

Oops, please use this link for the code instead.

https://gist.github.com/bwhaley/eee6a0f61636862515aa



On Monday, November 24, 2014 9:44:21 AM UTC-8, Ben Whaley wrote:
>
> Hi James,
>
> Thanks for your reply.
>
> Interesting point about the HealthCheckGracePeriod option. I wasn't aware 
> of its role here. I am indeed using it, in fact according to the docs it is 
> a required option for ELB health checks. I had it set to 180, and I just 
> tried it with lower values of 10 and 1 second. In both cases the behavior 
> is the same: the autoscale group considers the instances healthy (because 
> of the grace period, even at the lower value) and as a result ansible moves 
> on before the instances are InService in the ELB. Even with the 
> HealthCheckGracePeriod at the lowest possible value of 1 second, a race 
> exists between the module's health check and the ELB grace period.
>
> I've worked around this for now with a script that does the following:
> - Find the instances in the ASG
> - Check the ELB to determine if they are healthy or not
> - Exit 1 if not, 0 if yes
>
> Then I use an ansible task with an "until" loop to check the return code. 
> The script is here:
>
> https://gist.github.com/anonymous/05e99828848ee565ed33
>
> Happy to work this in to an ansible module if you think this is useful. Or 
> did I misunderstand the point about the health check grace period?
>
> Thanks,
> Ben
>
>
> On Monday, November 24, 2014 7:25:58 AM UTC-8, James Martin wrote:
>>
>> Ben,
>>
>> Thanks for the question.    Considering this: 
>> http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-add-elb-healthcheck.html,
>>  
>>   "Auto Scaling marks an instance unhealthy if the calls to the Amazon EC2 
>> action DescribeInstanceStatus return any state other than running, the 
>> system status shows impaired, or the calls to Elastic Load Balancing action 
>> DescribeInstanceHealth returns OutOfService in the instance state field."
>>
>> For determining the instance health status, we are fetching an ASG object 
>> in boto and checking the health_status attribute for each instance in the 
>> ASG, which are equal to either "healthy" or "unhealthy".  Are you using an 
>> instance grace period option for the ELB? 
>> http://docs.aws.amazon.com/AutoScaling/latest/APIReference/API_CreateAutoScalingGroup.html,
>>  
>> see HealthCheckGracePeriod.  This option is configurable with the 
>> health_check_period setting found in the ec2_asg module.  By default it is 
>> 500, and this would prematurely return the status of a healthy instance, as 
>> it means it would mark any instance as healthy for 500 seconds.
>>
>> - James
>>
>>
>>
>>
>>
>> On Saturday, November 22, 2014 5:39:28 PM UTC-5, Ben Whaley wrote:
>>>
>>> Hi all,
>>>
>>> Sorry for resurrecting an old thread, but wanted to mention my 
>>> experience thus far using ec2_asg & ec2_lc for code deploys.
>>>
>>> I'm more or less following the methods described in this helpful repo
>>>
>>> https://github.com/ansible/immutablish-deploys
>>>
>>> I believe the dual_asg role is accepted as the more reliable method for 
>>> deployments. If a deployment uses two ASGs, it's possible to just delete 
>>> the new ASG and everything goes back to normal. This is the "Netflix" 
>>> manner of releasing updates.
>>>
>>> The thing I'm finding though is that instances become "viable" well 
>>> before they're actually InService in the ELB. From the ec2_asg code and by 
>>> running ansible in verbose mode it's clear that ansible considers an 
>>> instance viable once AWS indicates that instances are Healthy and 
>>> InService. Checking via the AWS CLI tool, I can see that the ASG shows 
>>> instances as Healthy and InService, but the ELB shows OutOfService. 
>>>
>>> The AWS docs are clear about the behavior of autoscale instances with 
>>> health check type ELB: "For each call, if the Elastic Load Balancing action 
>>> returns any state other than InService, the instance is marked as 
>>> unhealthy." But this is not actually the case. 
>>>
>>> Has anyone else encountered this? Any suggested workarounds or fixes?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Thursday, September 11, 2014 12:54:25 PM UTC-7, Scott Anderson wrote:
>>>>
>>>> On Sep 11, 2014, at 3:26 PM, James Martin <jma...@ansible.com> wrote:
>>>>
>>>> I think we’re probably going to move to a system that uses a tier of 
>>>>> proxies and two ELBs. That way we can update the idle ELB, change out the 
>>>>> AMIs, and bring the updated ELB up behind an alternate domain for the 
>>>>> blue-green testing. Then when everything checks out, switch the proxies 
>>>>> to 
>>>>> the updated ELB and take down the remaining, now idle ELB.
>>>>>
>>>>>
>>>> Not following this exactly -- what's your tier of proxies?  You have a 
>>>> group of proxies (haproxy, nginx) behind a load balancer that point to 
>>>> your 
>>>> application?
>>>>
>>>>
>>>> Yes, nginx or some other HA-ish thing. If it’s nginx then you can 
>>>> maintain a brochure site even if something horrible happens to the 
>>>> application.
>>>>
>>>>  
>>>>
>>>>> Amazon would suggest using Route53 to point to the new ELB, but 
>>>>> there’s too great a chance of faulty DNS caching breaking a switch to a 
>>>>> new 
>>>>> ELB. Plus there’s a 60s TTL to start with regardless, even in the absence 
>>>>> of caching.
>>>>>
>>>>
>>>> Quite right.  There are some interesting things you can do with tools 
>>>> you could run on the hosts that would redirect traffic from blue hosts to 
>>>> the green LB, socat being one.  After you notice no more traffic coming to 
>>>> blue, you can terminate it.
>>>>
>>>>
>>>> That’s an interesting idea, but it fails if people are behind a caching 
>>>> DNS and they visit after you’ve terminated the blue traffic but before 
>>>> their caching DNS lets go of the record.
>>>>
>>>> You're right, I did miss that.  By checking the AMI, you're only 
>>>> updating the instance if the AMI changes.  If you a checking the launch 
>>>> config, you are updating the instances if any component of the launch 
>>>> config has changed -- AMI, instance type, address type, etc.
>>>>
>>>>
>>>> That’s true, but if I’m changing instance types I’ll generally just 
>>>> cycle_all. Because of the connection draining and parallelism of the 
>>>> instance creation, it’s just as quick to do all of them instead of the 
>>>> ones 
>>>> that needs changing. That said, it’s an obvious optimization for sure.
>>>>
>>>>
>>>> Using the ASG to do the provisioning might be preferable if it’s 
>>>>> reliable. At first I went that route, but I was having problems with the 
>>>>> ASG’s provisioning being non-deterministic. Manually creating the 
>>>>> instances 
>>>>> seems to ensure that things happen in a particular order and with 
>>>>> predictable speed. As mentioned, the manual method definitely works every 
>>>>> time, although I need to add some more timeout and error checking (like 
>>>>> what happens if I ask for 3 new instances and only get 2).
>>>>>
>>>>>
>>>> I didn't have any issues with the ASG doing the provisioning, but I 
>>>> would say nothing is predictable with AWS :).  
>>>>
>>>>
>>>> Very true. Over the past few months I’ve had several working processes 
>>>> just fail with no warning. The most recent is AWS sometimes refusing to 
>>>> return the current list of AMIs. Prior to that it was the Available status 
>>>> on an AMI not really meaning available. Now I check the list of returned 
>>>> AMIs in a loop until the one I’m looking for shows up, Available status 
>>>> notwithstanding. Very frustrating. Things could be worse, however: the API 
>>>> could be run by Facebook...
>>>>
>>>>
>>>>> I have a separate task that cleans up the old AMIs and LCs, 
>>>>> incidentally. I keep the most recent around as a backup for quick 
>>>>> rollbacks.
>>>>>
>>>>
>>>> That's cool, care to share?
>>>>  
>>>>
>>>>
>>>> I think I’ve posted it before, but here’s the important bit. After 
>>>> deleting everything but the oldest backup AMI (determined by naming 
>>>> convention or tags), delete any LC that doesn’t have an associated AMI:
>>>>
>>>> def delete_launch_configs(asg_connection, ec2_connection, module):
>>>>     changed = False
>>>>
>>>>     launch_configs = asg_connection.get_all_launch_configurations()
>>>>
>>>>     for config in launch_configs:
>>>>         image_id = config.image_id
>>>>         images = ec2_connection.get_all_images(image_ids=[image_id])
>>>>
>>>>         if not images:
>>>>             config.delete()
>>>>             changed = True
>>>>
>>>>     module.exit_json(changed=changed)
>>>>
>>>>
>>>> -scott
>>>>
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to ansible-project+unsubscr...@googlegroups.com.
To post to this group, send email to ansible-project@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/cf2571f0-00c1-4c69-8bc6-0edad6d57e71%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [ansible-project] EC2 Rolling Deploy with an ASG

Reply via email to