Re: [OpenStack-Infra] Suggestion for helping gate users deal with crisis...

2013-09-25 Thread Thierry Carrez
James E. Blair wrote:
 Clint Byrum cl...@fewbar.com writes:
 
 Hello infra rockstars. First and foremost, thank you for keeping the
 well oiled machinery of the OpenStack infrastructure running. It is a
 marvel of modern engineering, and I am not just saying that because I
 am prone to hyperbole.

 Last night while the gate was exploding a few of us noticed, and
 weren't really sure what to do. ttx was whitelisted in the statusbot,
 but untrained in how to handle it. I dug through jenkins configs and
 logs but I am completely ignorant of zuul and thus would have done
 more damage than not had I been able to coax anything out of the system
 (luckily I am also completely unprivileged.. good job :).
 
 Cool, we can do something about that.  Thierry, please see:
 
   http://ci.openstack.org/irc.html#statusbot

I actually read it during the crisis but thought that the commands had
to be issued by PRIVMSG :) Then that lead me to look for the nick
statusbot was running on, which was a red herring.

Proposed doc clarification @ https://review.openstack.org/48209

-- 
Thierry Carrez (ttx)

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Suggestion for helping gate users deal with crisis...

2013-09-24 Thread James E. Blair
Clint Byrum cl...@fewbar.com writes:

 Hello infra rockstars. First and foremost, thank you for keeping the
 well oiled machinery of the OpenStack infrastructure running. It is a
 marvel of modern engineering, and I am not just saying that because I
 am prone to hyperbole.

 Last night while the gate was exploding a few of us noticed, and
 weren't really sure what to do. ttx was whitelisted in the statusbot,
 but untrained in how to handle it. I dug through jenkins configs and
 logs but I am completely ignorant of zuul and thus would have done
 more damage than not had I been able to coax anything out of the system
 (luckily I am also completely unprivileged.. good job :).

Cool, we can do something about that.  Thierry, please see:

  http://ci.openstack.org/irc.html#statusbot

It's easy to add people to the statusbot whitelist without them needing
to be infra rockstars -- I'm happy to add anyone who shows an ability to
write a coherent status update.

We also have quite a few more features that we want from statusbot; it
should be used in more channels (but we need automated management of
channel permissions), and status alerts should show up on Gerrit and
Zuul status pages.  I believe if it were more useful, and therefore
used, and therefore visible, people would know where to look and what to
expect.  If anyone wants to hack on ircbots in python or add some fun
javascript to some web pages, come chat with us.

 I'd like to suggest that infra develop a play book for dealing
 with crisis. This is not just for those of you with the power to fix
 things. This is a public document that helps people understand what to do,
 who to wait for, who to contact, and how to do so, when things are broken.

 Statusbot works well as an Office Barbrady style nothing to see here,
 move along, but it is not so useful in helping to get the ball rolling
 on a solution after hours in the US. Had there been a play book with
 roles listed, the statusbot would have been made use of. As in:

 In the event of any failure, the statusbot should be updated by someone
 who is whitelisted in this file [link to the file] in git. Those
 individuals can send a message in this format, in #openstack-infra to
 update the status:

#status Pypi mirror problems causing gate failures. Please stand-by...

 This should be in a wiki page or published document somewhere that is
 linked basically everywhere. This allows those who see a failure as
 a crisis to click through, and find a warm fuzzy of options to take. It
 also helps take the burden off the infra team for educating everyone on
 how to deal with crisis. It is especially helpful in scaling the team
 out, as new members can learn how the team operates in general via the
 play book rather than having to wait for a crisis to happen.

 Anyway, just a suggestion. As I don't know the plays, I cannot write
 this page, but I would have been able to share the link with the few
 others who were affected by the outage last night, and that might have
 reduced their stress level a bit.

Now you know where that documentation would live if it existed.  We try
to document everything about the system on ci.openstack.org.  If you are
at all curious about the project infrastructure, I highly recommend it.

Our goal is not to be project gatekeepers.  We have robots for that.
Our goal is to facilitate everyone's participation in the project
infrastructure.  In most cases, privileged access is not required to
triage or solve problems.  I believe some was used in this case (to
manually add a package to the pypi mirror in order to speed up the
solution), but generally when something breaks, anyone in the project
has the power to fix it.  The best way to get the ball rolling on a
solution is for someone to start working on it.  And yes, then someone
should post a status update so people know it's being worked on.

I'm wary of writing a playbook that says if something breaks, contact
so-and-so to fix it.  That's not how this thing works.  It's more like
if something breaks, start trying to fix it.  And while I understand
that not everyone is capable of diagnosing and fixing every problem,
quite a number of people have managed to track down, diagnose, and fix
problems in this system without being infra rockstars.

Keep in mind, this failure, and indeed, most failures are not
infrastructure failures.  They are actually the gate working as
designed.  It is _supposed_ to 'break' when it is not possible to test
changes under the constraints we have set.  So a more traditional
enterprise service playbook doesn't help -- there are no simple levers
to pull, every such problem is different and requires a unique solution,
and everyone is empowered to create that solution.

I think your idea is a good one -- we should have more documentation to
help people understand what happens and where to look when things go
wrong.  I hope I've conveyed an idea of what I think the character of
that documentation should be.

-Jim


Re: [OpenStack-Infra] Suggestion for helping gate users deal with crisis...

2013-09-24 Thread Clint Byrum
Excerpts from jeblair's message of 2013-09-24 10:15:10 -0700:
 Clint Byrum cl...@fewbar.com writes:
  I'd like to suggest that infra develop a play book for dealing
  with crisis. This is not just for those of you with the power to fix
  things. This is a public document that helps people understand what to do,
  who to wait for, who to contact, and how to do so, when things are broken.
 
  Statusbot works well as an Office Barbrady style nothing to see here,
  move along, but it is not so useful in helping to get the ball rolling
  on a solution after hours in the US. Had there been a play book with
  roles listed, the statusbot would have been made use of. As in:
 
  In the event of any failure, the statusbot should be updated by someone
  who is whitelisted in this file [link to the file] in git. Those
  individuals can send a message in this format, in #openstack-infra to
  update the status:
 
 #status Pypi mirror problems causing gate failures. Please stand-by...
 
  This should be in a wiki page or published document somewhere that is
  linked basically everywhere. This allows those who see a failure as
  a crisis to click through, and find a warm fuzzy of options to take. It
  also helps take the burden off the infra team for educating everyone on
  how to deal with crisis. It is especially helpful in scaling the team
  out, as new members can learn how the team operates in general via the
  play book rather than having to wait for a crisis to happen.
 
  Anyway, just a suggestion. As I don't know the plays, I cannot write
  this page, but I would have been able to share the link with the few
  others who were affected by the outage last night, and that might have
  reduced their stress level a bit.
 
 Now you know where that documentation would live if it existed.  We try
 to document everything about the system on ci.openstack.org.  If you are
 at all curious about the project infrastructure, I highly recommend it.
 
 Our goal is not to be project gatekeepers.  We have robots for that.
 Our goal is to facilitate everyone's participation in the project
 infrastructure.  In most cases, privileged access is not required to
 triage or solve problems.  I believe some was used in this case (to
 manually add a package to the pypi mirror in order to speed up the
 solution), but generally when something breaks, anyone in the project
 has the power to fix it.  The best way to get the ball rolling on a
 solution is for someone to start working on it.  And yes, then someone
 should post a status update so people know it's being worked on.
 
 I'm wary of writing a playbook that says if something breaks, contact
 so-and-so to fix it.  That's not how this thing works.  It's more like
 if something breaks, start trying to fix it.  And while I understand
 that not everyone is capable of diagnosing and fixing every problem,
 quite a number of people have managed to track down, diagnose, and fix
 problems in this system without being infra rockstars.
 

Indeed, the playbook I have in mind is not this is how things are,
and these are the people who do the things. Ideally each playbook entry
is tied to a bug that has long term ordering problems or just a ton of
work pending.

 Keep in mind, this failure, and indeed, most failures are not
 infrastructure failures.  They are actually the gate working as
 designed.  It is _supposed_ to 'break' when it is not possible to test
 changes under the constraints we have set.  So a more traditional
 enterprise service playbook doesn't help -- there are no simple levers
 to pull, every such problem is different and requires a unique solution,
 and everyone is empowered to create that solution.
 

Indeed, I have noticed that there are rarely easy answers, as all of
the easy answers are automated properly. :)

I think what I'm looking for is something along the lines of this
is what is supposed to be happening right now, for those who are not
involved day to day in that decision stream.

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra