I think you meant that people were spinning up boxes (not spot instances) and so everyone with a spot instance was loosing theirs. I would have to say that market forces were at work here so that part of the system worked correctly. What was bad is that the EBS's were loosing power without being told to shutdown properly. Thus when they came back up they were marked inconsistent so you had to rebuild the volume. Everyone else is doing the same thing so there is a huge contention for resources. Every problem they experience just makes the service all that much better. My colo wouldn't have had this trouble, but then they are not trying to house Pinterest, Netflix, and every other company out there.
On Thu, Jul 5, 2012 at 5:48 PM, Kevin Wright <kev.lee.wri...@gmail.com>wrote: > Interestingly, there was another failure mode not outlined in that > summary. Mostly because it was "by design" and can only be considered a > failure mode for users of Amazon's cloud, but not a failure of the cloud > itself. > > What we (Zeebox) noticed was that machines in other regions were going > down as well. Most notably in us-west, but even some in eu-west. The > machines in question were "spot" instances, where you bid a price and the > real-time value of an instance is based on demand. If the value is below > your bid, then you have a machine. When it goes over your bid, you lose it. > > It's an ideal model in may circumstances. I'll leave you to decide > whether it worked exactly as it should in this instance, or if it can be > classed as another level of the cascading failures :) > > As us-east went down, people turned to spot instances to make up for the > lost capacity. In turn, this drove up the price, and anyone who had a spot > instance happily doing its thing found themselves outbid and machine-less. > > And yes, it happened to us; though I'll add that we don't use spot > instances for anything which would affect the user experience! They're > better suited for continuous load testing and other similar tasks where a > vanishing machine isn't too painful. > > > > On 5 July 2012 19:28, Carl of the Posse <carl.qu...@gmail.com> wrote: > >> Amazon posted a nice summary of what went wrong with their systems: >>> >>> >> http://aws.amazon.com/message/67457/ >> >> Problems with backup power, and then most importantly, problems with load >> balancing control were what made the zone outage hard to work around. >> >> We (Netflix) might post a blog explaining how that affected us, the >> internal issues that resulted and what we are doing about it. I'll reply to >> this group if we do. >> >> >> -- > You received this message because you are subscribed to the Google Groups > "Java Posse" group. > To post to this group, send email to javaposse@googlegroups.com. > To unsubscribe from this group, send email to > javaposse+unsubscr...@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/javaposse?hl=en. > -- Robert Casto www.robertcasto.com www.sellerstoolbox.com www.lakotaeastbands.org -- You received this message because you are subscribed to the Google Groups "Java Posse" group. To post to this group, send email to javaposse@googlegroups.com. To unsubscribe from this group, send email to javaposse+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/javaposse?hl=en.