On Tue, Aug 25, 2009 at 11:59 PM, Shane Hathaway<[email protected]> wrote: > In the last distributed system I helped build, we didn't feel good about > having a central point of control (and failure), but in the end we > decided that a fully distributed system would add unjustifiable > complexity and expense. Fully distributed systems seem to grow > behaviors that are as hard to fix as human communication problems.
Yeah, my favorite example of a fully distributed system that seemed "to grow behaviors that are as hard to fix as human communication problems" was Amazon messaging system that carried *gossip*. How human like is that? """ At 9:41am PDT, we determined that servers within Amazon S3 were having problems communicating with each other. As background information, Amazon S3 uses a gossip protocol to quickly spread server state information throughout the system. This allows Amazon S3 to quickly route around failed or unreachable servers, among other things. When one server connects to another as part of processing a customer's request, it starts by gossiping about the system state. Only after gossip is completed will the server send along the information related to the customer request. On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn't able to successfully process many customer requests. """ http://status.aws.amazon.com/s3-20080720.html Also watch out for backbiting, speaking ill of others, spite and slander. Best, Gabe /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
