[google-appengine] Re: a fundamental increase in App Engine transparency and locality is needed

Bryan A. Pendleton Fri, 06 Mar 2009 12:16:42 -0800

Jon,

I think we'd all agree that an automatically published report is
better than a), or b). However, because there's Secret Sauce running
behind App Engine, surely a myriad of infrastructure details that
you'll never be allowed to publish detailed data on, c) will also
probably never be enough.


Or, well, I'll ask the question: what infrastructure was involved in
the most recent maintenance, and what about its performance
characteristics changed? Measuring/publishing various reports that we
all *could* generate ourselves *is* helpful, but what of the "behind
the scenes" stuff that can break?

Based on increases in latency, I could come up with all kinds of
explanations (I've worked with Hadoop for a number of years, so I know
of the general types of odd conditions that come up with a cluster of
machines attempting to work in concert but managed by failure-tolerant
software). Did a network switch fail that caused a large fraction of
the DataStore nodes to become unreachable? If so, would you ever be
willing to publish switch performance analytics as part of the App
Engine health status? Or, more generally, was it something
misconfigured or that just doesn't perform as well as predicted? If
misconfigured, can it be hot-updated, or is another downtime necessary
to repair the mistake?

The discussion of dedicated clusters does bring up a related question:
Google apparently does have some cluster update process, as evidenced
by the way features roll out to different users of Gmail at different
rates. I assume, based on everything we've been (not) told about
AppEngine, that AppEngine is essentially running as one single
instance of whatever granularity Gmail and similar apps are maintained
at Google. Could it be split into several? Then, presumably,
maintenance and upgrade periods could be scheduled somewhat
independently for different pools.

Ok, enough conjecture on my part. My point is, I think a PR-sanitized
message is going to still be necessary unless you guys truly disclose
the inner workings of the infrastructure that provides AppEngine,
which seems unlikely to ever completely happen.

-Bryan

On Mar 6, 2:37 pm, Jon McAlister <jon...@google.com> wrote:
> This is a very interesting thread. And yes, indeed, cluster stability
> and uptime and transparency are all of the utmost importance to us
> here on the App Engine team.
>
> I also agree that a fundamental increase in transparency would serve
> everyone's interests.
>
> As far as I can tell, transparency through a crisis occurs in one of
> three ways: (a) direct communication, (b) broadcasts of pr-approved
> messages, and (c) raw data.
>
> As an engineer, I like (c) more than (a) or (b). It's objective, and
> it can be distributed more efficiently. It's more efficient than (b)
> because we don't have to wordsmith it or get it reviewed and approved,
> it just goes out immediately. It's more efficient than (a) because it
> happens automatically; you don't take up the time of an engineer who
> would otherwise be busy trying to solve the problem.
>
> This is why our team has made it a priority to be very transparent
> about real time and detailed performance data with our Status 
> Site:http://code.google.com/status/appengine. I encourage everyone who
> cares about their application to spend a bit of time learning that
> site. Make sure to click through to the detailed graphs as well.
> Further, if there is more data you would like us to export, then let
> us know. It is in everyone's best interests for us to export more raw
> data.
>
> It is my personal belief that by continuing to iterate and improve
> upon our Status Site that we will achieve the fundamental increase in
> transparency, as opposed to Daniel's proposal of dedicated hardware,
> direct communication, and promises as to when problems will be fixed.
> Although, Daniel does bring up the excellent point that there is not a
> particularly efficient way for developers to report live production
> problems other than this group. Any thoughts on systems that we should
> look into here to make this more efficient?
>
> Jon
>
>
>
> On Fri, Mar 6, 2009 at 9:25 AM, johnP <j...@thinkwave.com> wrote:
>
> > My experience has shown that customers are generally OK with a
> > response:  There is a problem.  We are aware of it and people are
> > working on it.  It is estimated to be resolved in X time.
>
> > So if we (as vendors) are able to act as a positive PR army for
> > AppEngine, we need to be quickly armed with the data necessary to
> > accurately answer the above questions.
>
> > If vendors (on the front lines of PR Triage) don't have this data, a
> > blame-game will start.  No, we have no %$#$% idea what's wrong.  No -
> > those #$$#&^%$ are not saying anything.  Etc.
>
> > So far, I'd say the following.  Google has been almost good about
> > keeping the information flowing.  There have been periods of 'radio
> > silence'; and the official status report usually reports that "There
> > has been an anomaly but we have determined that nobody was affected
> > (am I a nobody?)".  But overall, they have responded.  And have
> > resolved issues.  And have upgraded the product.  And in the end, have
> > built something really special.
>
> > And I agree with Greg about sleeping at night.
>
> > johnP
>
> > On Mar 6, 5:42 am, peterk <peter.ke...@gmail.com> wrote:
> >> It's an interesting issue...I think we're all happy for things to
> >> behave pretty much like a black box when stuff is working as it's
> >> meant to, but are we so happy with that when things aren't working?
>
> >> Personally I don't need to know who specifically is responsible for
> >> the machines my apps run on, to be able to contact them directly etc.
> >> And I don't think that's really a viable option.
>
> >> However, I agree more feedback on causes and fix schedules when things
> >> go south would be good, along with frequent updates. I think 'naming
> >> and shaming' Google on infrastructure let downs, on your app, would be
> >> an option for you too, to lay the responsibility with Google.
> >> Significant issues, particularly on popular apps with decent
> >> visibility, are surely embarassing for Google given their reputation.
> >> Serving them a dose of that embarassment by telling your visitors 1)
> >> we're Google's infrastructure, 2) they're having problems right now,
> >> would probably give Google more incentive to avoid repetition of these
> >> issues.. :) If a number of popular apps were frequently reporting
> >> issues with Google, word about that would get around.
>
> >> From listening to talks from GAE folk though, I get the impression
> >> stability is of paramount importance..so I'm hopeful issues like those
> >> recently will be exceptions.
>
> >> On Mar 6, 2:24 am, dsw <daniel.wilker...@gmail.com> wrote:
>
> >> > I have long wanted to build apps on Google App Engine and have learned
> >> > a lot about it in preparation for doing so.  However the one problem
> >> > is if I have a customer and their app goes down for an hour and they
> >> > call me and say "what happened?" and "how can we prevent that in the
> >> > future?" my only response will be "I don't know" and "we can't."
> >> > These are unacceptable answers.
>
> >> > If you want App Engine to "cross the chasm" and become really for real
> >> > then at the very least what you need to do is provide the kind of
> >> > depth of sight into your infrastructure that you the Google engineers
> >> > have.  Further you need some kind of locality in the cloud: it would
> >> > help if there were some way of ensuring reliability by knowing that
> >> > (1) I have bought space on some particular cluster of machines (2)
> >> > which is now stable and more apps are not being added to it; I should
> >> > know (3) who is maintaining that cluster and (4) be able to send them
> >> > a trouble ticket and (5) have some idea of what is wrong and how long
> >> > it is going to take them to fix it.
>
> >> > This opaque cloud utility of compute stuff is a fantasy: some locality
> >> > and transparency will be needed or App Engine will never be really for
> >> > real.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: a fundamental increase in App Engine transparency and locality is needed

Reply via email to