I should probably also chime in with a positive note - on the bright
side, I haven't heard anyone complain about the datastore in a very
long time.  The HRD does seem to deliver on its promise.  Now we just
need all the rest of the infrastructure to come up to this same level
of robustness!

Jeff

On Wed, Sep 12, 2012 at 8:33 PM, Thomas Wiradikusuma
<wiradikus...@gmail.com> wrote:
> Hi Jeff,
>
> I feel sorry for your loss. I agree completely with your message and your
> recommendation to Google. Hi GAE team, at the very least, please let us know
> when you want to roll out upgrades and stuff. It might not break stuff, but
> we need to know so we can be prepared (at least we won't send newsletter to
> thousands of people, "Hey check our website now" or schedule a pitch to
> investor).
>
>
>
> On Thursday, 13 September 2012 08:35:39 UTC+8, Jeff Schnitzer wrote:
>>
>> On Wed, Sep 12, 2012 at 3:12 PM, Kaan Soral <kaan...@gmail.com> wrote:
>> > This is why I love App Engine, when a problem occurs instead of having a
>> > heart attack or committing suicide, you can just wait for it to be
>> > resolved.
>>
>> Hmmm.  This really unfortunately timed incident may have cost us an
>> important client, so I'm not feeling the love.
>>
>> I have quite a lot of experience building and running large online
>> systems prior to embracing GAE and my products have never had as much
>> downtime as I've had over the last year.  It hasn't always been
>> Google's fault (the entire .st registry going down for 8+ hours really
>> sucked[1]) but it usually has been.  See:
>>
>>  * Instance startup time ballooning by 3X and hitting deadlines
>> (multiple occasions)
>>  * GAE blocking CloudFlare with an undocumented security system
>>  * This incident, where Java instances started mysteriously failing
>>
>> Would waiting have fixed these issues?  I'm not convinced.  Google may
>> have smart people running GAE but they aren't watching _my_ app,
>> they're just watching for an uptick in the number of complaints.  If
>> you're doing something slightly unusual (say, running a CF reverse
>> proxy), you might be statistical noise.  Apparently this Java problem
>> _was_ widespread, but I had no way of knowing that.
>>
>> GAE's value proposition is that it's better to have Google's smart
>> engineers building and maintaining your infrastructure.  But my site
>> would be more reliable if I had one dumb person (possibly me) who
>> cares specifically about _my_ infrastructure.  I've screwed up
>> deployments and upgrades in production before, but at least I'm aware
>> when changes happen, get immediate feedback, and can fix the problem
>> right then and there.
>>
>> With GAE, the only thing I can do when my alarms go off is to whine as
>> loudly as possible.  But there is no feedback!  I have no way of
>> knowing if Google is working on the problem or if they're still
>> waiting for more complaints that will never materialize.  Will I be
>> down for 15 minutes, 1 hour, 2 hours, 8 hours, forever?  How long do
>> you want to wait?
>>
>> This feels like a fundamental flaw in the PaaS concept, destined to
>> produce multiple-hour downtimes at irregular intervals.  The feedback
>> loop is too slow (and lossy if the problem is not widespread).
>> There's no amount of QA or testing that will prevent failures in a
>> system as big as complicated as GAE.  So the only reasonable option is
>> to get that feedback loop shorter.  How can that happen?  Some ideas:
>>
>>  * Google could announce when they are rolling out changes.  I don't
>> need release notes (although it would be nice to know what to watch
>> for) but I'd like to know when I should pay extra attention.  Or not
>> schedule client demos.  Facebook does something like this, rolling out
>> platform changes on specific days of the week (which I long ago
>> stopped caring about).
>>
>>  * Google could make extra support channels available during this
>> time.  Hell, use twitter.  Think of us as your QA staff - if we see
>> something amiss, we'd like to let you know.
>>
>>  * Google could be more transparent about problems as they happen.
>> When you know there is an issue, let us know.  Since I must assume
>> that any problem which Google hasn't acknowledged is a problem Google
>> doesn't know about, I can stop spamming @google.com addresses.
>>
>>  * Google could monitor our apps, and compare error rates before
>> rollout to error rates after rollout.  Ideally you'd break this down
>> by component; figure out which apps use the search api, so when you
>> roll out changes to the search system, you're specifically watching
>> for an uptick in 500 errors from those apps.  Something like that.
>>
>> Any other ideas?  I really like GAE and I really like the PaaS
>> concept.  But reliability is really a problem.  It's probably going to
>> be an even bigger problem going on into the future as GAE (hopefully)
>> adds new features and gets a bigger footprint.  More moving parts
>> means more failures.
>>
>> Jeff
>>
>> P.S. Paying $6k/yr for Premier Support is not the answer.  Whether or
>> not that would solve my problem, that doesn't solve GAE's problem.
>>
>>    [1]:
>> http://blorn.com/post/29851770158/beware-cutesy-two-letter-tlds-for-your-domain-name
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/OsO531fxrSAJ.
>
> To post to this group, send email to google-appengine@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to