Re: [lopsa-tech] Amazon AWS becoming unreliable

Tom Perrine Fri, 26 Sep 2014 09:17:00 -0700

All the patching aside, the network (and some of the server hardware)
within the AWS East zones has been..... brittle, for at least 1-2
years.  We have pinger hosts in all our datacenters, and in every AZ
where we have instances.

This info is mostly from an internal study we did last year, so may no
longer be completely valid.

It has been common, almost expected, that US-East will often have >80%
packet loss, especially to AWS EU, for minutes or even hours at a
time. Since this is ICMP and/or UDP (depending how we wrote the
tests), this is believed to be an indicator of congestion within AWS'
network, since ICMP and UDP are the first packets to be dropped when
routers/links congest.

US-East also seems to have the oldest hardware, and highest instance
failure rates. This was especially true for the smaller instances,
which are believed to be on the older server hardware. We measured
"half life to failure" of groups of instances, and tiny/smalls tended
to see lots (80%?) die within 30 days at the outside, many died within
48 hours of launch.  Larger instances had much longer lifetimes.

Of course, it is Amazon *web* services, so as long as TCP/80 and
TCP/443 are working, few will notice it.

I'm guessing that your connections to Zabbix are not TCP?  Try opening
a long-term TCP connection to the host, maybe SSH+keepalives and my
theory is that the TCP session will be fine and ride through the
Zabbix interruptions you're seeing.

At least until your server is rebooted out from under you :-)

On Fri, Sep 26, 2014 at 5:03 AM, Bill Bogstad <[email protected]> wrote:
> On Fri, Sep 26, 2014 at 1:24 PM, Derek Balling <[email protected]> wrote:
>> On Sep 26, 2014, at 7:22 AM, Sean Lally <[email protected]> wrote:
>>
>> Haven't seen that, but they are doing a bunch of scheduled reboots that
>> started yesterday.  Guessing they're patching for bash...
>>
>>
>> The reboots started before bash. The current running-theory is there's a bug
>> in Xen that allows a guest to pierce the hypervisor and get outside.
>
> I would say this is confirmed by this Amazon AWS blog post:
>
> http://aws.amazon.com/blogs/aws/ec2-maintenance-update/
>
> They apparently have an Oct. 1st deadline before the bug is made public.
>
> Bill Bogstad
> _______________________________________________
> Tech mailing list
> [email protected]
> https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
> This list provided by the League of Professional System Administrators
>  http://lopsa.org/
_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-tech] Amazon AWS becoming unreliable

Reply via email to