Re: [OpenStack-Infra] proposal: custom favicon for review.o.o

2020-01-29 Thread Ian Wienand
On Wed, Jan 29, 2020 at 06:35:28AM +, Sorin Sbarnea wrote:
> I guess that means that you are not against the idea.

I know it's probably not what you want to hear, but as it seems
favicons are becoming a component of branding like a logo I think
you'd do well to run your proposed work by someone with the expertise
to evaluate it with-respect-to whatever branding standards we have (I
imagine someone on the TC would have such contacts from the Foundation
or whoever does marketing).

If you just make something up and send it, you're probably going to
get review questions like "how can we know this meets the branding
standards to be the logo on our most popular website" or "is this the
right size, format etc. for browsers in 2020" which are things
upstream marketing and web people could sign off on.  So, personally,
I'd suggest a bit of pre-coordination there would mean any resulting
technical changes would be very non-controversial.

-i


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] tarballs.openstack.org to AFS publishing gameplan

2020-01-29 Thread Ian Wienand
On Wed, Jan 29, 2020 at 05:21:49AM +, Jeremy Stanley wrote:
> Of course I meant from /(.*) to tarballs.opendev.org/openstack/$1 so
> that clients actually get directed to the correct files. ;)

Ahh yes, sorry you mentioned that in IRC and I should have
incorporated that.  I'm happy with that; we can also have that
in-place and test it by overriding our hosts files before any
cut-over.

-i


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] proposal: custom favicon for review.o.o

2020-01-28 Thread Ian Wienand
On Fri, Jan 24, 2020 at 09:32:00AM +, Sorin Sbarnea wrote:
> We are currently using default Gerrit favicon on
> https://review.opendev.org and I would like to propose changing it
> in order to ease differentiation between it and other gerrit servers
> we may work with.

I did notice google started putting this next to search results
recently too, but then maybe reverted the change.

> How hard it would be to override it? (where)

I'm 99% sure it's built-in from [2] and there's no way to runtime
override it.  It looks like for robots.txt we tell the apache that
fronts gerrit to look elsewhere [3]; I imagine the same would need to
be done for favicon.ico.

... also be aware that upcoming containerisation of gerrit probably
invalidates all that.

-i

[1] 
https://www.theverge.com/2020/1/24/21080424/google-search-result-ads-desktop-favicon-redesign-backtrack-controversial-experiment
[2] 
https://opendev.org/opendev/gerrit/src/branch/openstack/2.13.12/gerrit-war/src/main/webapp
[3] 
https://opendev.org/opendev/puppet-gerrit/src/branch/master/templates/gerrit.vhost.erb#L71


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] tarballs.openstack.org to AFS publishing gameplan

2020-01-28 Thread Ian Wienand
Hello,

We're at the point of implementing the tarballs.openstack.org
publishing changes from [1], and I would just like to propose some
low-level plans for feedback that exceeded the detail in the spec.

We currently have tarballs.opendev.org which publishes content from
/afs/openstack.org/project/opendev.org/tarballs.  This is hosted on
the physical server files02.openstack.org and managed by puppet [2].

 1) I propose we move tarballs.opendev.org to be served by
static01.opendev.org and configured via ansible

Because:

 * it's one less thing running on a Xenial host with puppet we don't
   want to maintain.
 * it will be alongside tarballs.openstack.org per below

The /afs/openstack.org/project/opendev.org directory is currently a
single AFS volume "projet.opendev" and contains subdirectories:

 docs tarballs www

opendev.org jobs currently write their tarball content into the AFS
location, which is periodically "vos released" by [3].

 2) I propose we make a separate volume, with separate quota, and
mount it at /afs/openstack.org/project/tarballs.opendev.org.  We
copy the current data to that location and modify the opendev.org
tarball publishing jobs to use that location, and setup the same
periodic release.

Because:

 * Although currently the volume is tiny (<100mb), it will become
   quite large when combined with ~140gb of openstack repos
 * this seems distincly separate from docs and www data
 * we have content for other hosts at /afs/openstack.org/project like
   this, it fits logically.

The next steps are described in the spec; with this in place, we copy
the current openstack tarballs from
static.openstack.org:/srv/static/tarballs to
/afs/openstack.org/project/tarballs.opendev.org/openstack/

We then update the openstack tarball publishing jobs to publish to
this new location via AFS (we should be able to make this happen in
parallel, initially).

Finally, we need to serve these files.

 3) I propose we make tarballs.openstack.org a vhost on
static.opendev.org that serves the
/afs/openstack.org/project/tarballs.opendev.org/openstack/
directory.

Because

 * This is transparent for tarballs.openstack.org; all URLs work with
   no redirection, etc.
 * anyone hitting tarballs.opendev.org will see top-level project
   directories (openstack, zuul, airship, etc.) which makes sense.

I think this will get us where we want to be.

Any feedback welcome, thanks.  We will keep track of things in [4].

[1] https://docs.opendev.org/opendev/infra-specs/latest/specs/retire-static.html
[2] 
https://opendev.org/opendev/system-config/src/branch/master/manifests/site.pp#L441
[3] 
https://opendev.org/opendev/system-config/src/branch/master/modules/openstack_project/files/openafs/release-volumes.py
[4] https://storyboard.openstack.org/#!/story/2006598


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Creating OpenDev control-plane docker images and naming

2019-12-02 Thread Ian Wienand
On Tue, Nov 26, 2019 at 05:31:07PM +1100, Ian Wienand wrote:
> What I would propose is that projects do *not* have a single,
> top-level Dockerfile, but only (potentially many) specifically
> name-spaced versions.

> [2] I started looking at installing these together from a Dockerfile
> in system-config.  The problem is that you have a "build context",
> basically the directory the Dockerfile is in and everything under
> it.

I started looking closely at this, and I think have reversed my
position from above.  That is, I think we should keep the OpenDev
related dockerfiles in system-config.

[1] is a change in system-config to add jobs to build openstacksdk,
diskimage-builder and nodepool-[builder|launcher] containers.  It does
this by having these projects as required-projects: in the job
configuration and copying the Dockerfile into the Zuul-checked-out
source (and using that as the build context).  A bit ugly, but it
works.

However, to use these jobs for nodepool CI requires importing them
into zuul/nodepool.  This is tested with [2].

However, Zuul just reports:

  This change depends on a change with an invalid configuration.

It only depends-on [1], which has a valid configuration, at least in
the opendev tenant.

I think that this has to do with the zuul tenant not having the
projects that are used by required-jobs: from the new system-config
jobs [3], but am not certain it doesn't have something else to do with
the config errors at [4].  I have filed [5] because at the minimum a
more helpful error would be good.

-i

[1] https://review.opendev.org/696000
[2] https://review.opendev.org/696486
[3] https://review.opendev.org/696859
[4] https://zuul.opendev.org/t/zuul/config-errors
[5] https://storyboard.openstack.org/#!/story/2006968


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] Creating OpenDev control-plane docker images and naming

2019-11-25 Thread Ian Wienand
Hello,

I'm trying to get us to a point where we can use nodepool container
images in production, particularly because I want to use updated tools
available in later distributions than our current Xenial builders [1]

We have hit the hardest problem; naming :)

To build a speculative nodepool-builder container image that is
suitable for a CI job (the prerequisite for production), we need to
somehow layer openstacksdk, diskimage-builder and finally nodepool
itself into one image for testing. [2]

These all live in different namespaces, and the links between them are
not always clear.  Maybe a builder doesn't need diskimage-builder if
images come from elsewhere.  Maybe a launcher doesn't need
openstacksdk if it's talking to some other cloud.

This becomes weird when the zuul/nodepool-builder image depends on
opendev/python-base but also openstack/diskimage-builder and
openstack/openstacksdk.  You've got 3 different namespaces crossing
with no clear indication of what is supposed to work together.

I feel like we've been (or at least I have been) thinking that each
project will have *a* Dockerfile that produces some canonical
 image.  I think I've come to the conclusion this
is infeasible.

There can't be a single container that suits everyone, and indeed this
isn't the Zen of containers anyway.

What I would propose is that projects do *not* have a single,
top-level Dockerfile, but only (potentially many) specifically
name-spaced versions.

So for example, everything in the opendev/ namespace will be expected
to build from opendev/python-base.  Even though dib, openstacksdk and
zuul come from difference source-repo namespaces, it will make sense
to have:

  opendev/python-base
  +-> opendev/openstacksdk
  +-> opendev/diskimage-builder
  +-> opendev/nodepool-builder

because these containers are expected to work together as the opendev
control plane containers.  Since opendev/nodepool-builder is defined
as an image that expected to make RAX compatible, OpenStack uploadable
images it makes logical sense for it to bundle the kitchen sink.

I would expect that nodepool would also have a Docker.zuul file to
create images in the zuul/ namespace as the "reference"
implementation.  Maybe that looks a lot like Dockerfile.opendev -- but
then again maybe it makes different choices and does stuff like
Windows support etc. that the opendev ecosystem will not be interested
in.  You can still build and test these images just the same; just
we'll know they're targeted at doing something different.

As an example:

  https://review.opendev.org/696015 - create opendev/openstacksdk image
  https://review.opendev.org/693971 - create opendev/diskimage-builder

(a nodepool change will follow, but it's a bit harder as it's
cross-tenant so projects need to be imported).

Perhaps codifying that there's no such thing as *a* Dockerfile, and
possibly rules about what happens in the opendev/ namespace is spec
worthy, I'm not sure.

I hope this makes some sense!

Otherwise, I'd be interested in any and all ideas of how we basically
convert the nodepool-functional-openstack-base job to containers (that
means, bring up a devstack, and test nodepool, dib & openstacksdk with
full Depends-On: support to make sure it can build, upload and boot).
I consider that a pre-requisite before we start rolling anything out
in production.

-i

[1] I know we have ideas to work around the limitations of using host
tools to build images, but one thing at a time! :)

[2] I started looking at installing these together from a Dockerfile
in system-config.  The problem is that you have a "build context",
basically the directory the Dockerfile is in and everything under
it.  You can't reference anything outside this.  This does not
play well with Zuul, which has checked out the code for dib,
openstacksdk & nodepool into three different sibling directories.
So to speculatively build them together, you have to start copying
Zuul checkouts of code underneath your system-config Dockerfile
which is crazy.  It doesn't use any of the speculative build
registry stuff and just feels wrong because you're not building
small parts ontop of each other, as Docker is designed to do.  I
still don't really know how it will work across all the projects
for testing either.
  https://review.opendev.org/696000


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] [zuul-jobs] configure-mirrors: deprecate mirroring configuration for easy_install

2019-11-24 Thread Ian Wienand
Hello,

Today I force-merged [5] to avoid widespread gate breakage.  Because
the change is in zuul-jobs, we have a policy of annoucing
deprecations.  I've written the following but not sent it to
zuul-announce (per policy) yet, as I'm not 100% confident in the
explanation.

I'd appreciate it if, once proof-read, someone could send it out
(modified or otherwise).

Thanks,

-i

--

Hello,

The recent release of setuptools 42.0.0 has broken the method used by
the configure-mirrors role to ensure easy_install (the older method of
install packages, before pip became in widespread use [1]) would only
access the PyPi mirror.

The prior mirror setup code would set the "allow_hosts" whitelist to
the mirror host exclusively in pydistutils.cfg.  This would avoid
easy_install "leaking" access outside the specified mirror.

Change [2] in setuptools means that pip is now used to fetch packages.
Since it does not implement the constraints of the "allow_hosts"
setting, specifying this option has become an error condition.  This
is reported as:

 the `allow-hosts` option is not supported 'when using pip to install 
requirements

It has been pointed out [3] that this prior code would break any
dependency_links [4] that might be specified for the package (as the
external URLs will not match the whitelist).  Overall, there is no
desire to work-around this behaviour as easy_install is considered
deprecated for any current use.

In short, this means the only solution is to remove the now
conflicting configuration from pydistutils.cfg.  Due to the urgency of
this update, it has been merged with [5] before our usual 2-week
deprecation notice.

The result of this is that older setuptools (perhaps in a virtualenv)
with jobs still using easy_install may not correctly access the
specified mirror.  Assuming jobs have access to PyPi they would still
work, although without the benefits of a local mirror.  If such jobs
are firewalled from usptream they may now fail.  We consider the
chance of jobs using this legacy install method in this situation to
be very low.

Please contact zuul-discuss [6] with any concerns.

We now return you to your regularly scheduled programming :)

[1] https://packaging.python.org/discussions/pip-vs-easy-install/
[2] 
https://github.com/pypa/setuptools/commit/d6948c636f5e657ac56911b71b7a459d326d8389
[3] https://github.com/pypa/setuptools/issues/1916
[4] https://python-packaging.readthedocs.io/en/latest/dependencies.html
[5] https://review.opendev.org/695821
[6] http://lists.zuul-ci.org/cgi-bin/mailman/listinfo/zuul-discuss


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] CentOS 8 as a Python 3-only base image

2019-09-29 Thread Ian Wienand
On Fri, Sep 27, 2019 at 11:09:22AM +, Jeremy Stanley wrote:
> I'd eventually love to see us stop preinstalling pip and virtualenv
> entirely, allowing jobs to take care of doing that at runtime if
> they need to use them.

You'd think, right?  :) But it is a bit of a can of worms ...

So pip is a part of Python 3 ... "dnf install python3" brings in
python3-pip unconditionally.  So there will always be a pip on the
host.

For CentOS 8 that's pip version 9.something (upstream is on
19.something).  This is where traditionally we've had problems;
requirements, etc. uses some syntax feature that tickles a bug in old
pip and we're back to trying to override the default version.  I think
we can agree to try and mitigate that in jobs, rather than in base
images.

But as an additional complication, CentOS 8 ships it's
"platform-python" which is used by internal tools like dnf.  The thing
is, we have Python tools that could probably reasonably be considered
platform tools like "glean" which instantiates our networking.  I am
not sure if "/usr/libexec/platform-python -m pip install glean" is
considered an abuse or a good way to install against a stable Python
version.  I'll go with the latter ...

But ... platform-python doesn't have virtualenv (separate package on
Python 3).  Python documentation says that "venv" is a good way to
create a virtual environment and basically suggests it can do things
better than virtualenv because it's part of the base Python and so
doesn't have to have a bunch of hacks.  Then the virtualenv
documentation throws some shade at venv saying "a subset of
[virtualenv] has been integrated into the standard library" and lists
why virtualenv is better.  Now we have *three* choices for a virtual
environment: venv with either platform python or packaged python, or
virtualenv with packaged python.  Which should an element choose, if
they want to setup tools on the host during image build?  And how do
we stop every element having to hard-code all this logic into itself
over and over?

Where I came down on this is :

https://review.opendev.org/684462 : this stops installing from source
on CentOS 8, which I think we all agree on.  It makes some opinionated
decisions in creating DIB_PYTHON_PIP and DIB_PYTHON_VIRTUALENV
variables that will "do the right thing" when used by elements:

 * Python 2 first era (trusty/centos7) will use python2 pip and virtualenv
 * Python 3 era (bionic/fedora) will use python3 pip and venv (*not*
   virtualenv)
 * RHEL8/CentOS 8 will use platform-python pip & venv

https://review.opendev.org/685643 : above in action; installing glean
correctly on all supported distros.

-i


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] CentOS 8 as a Python 3-only base image

2019-09-27 Thread Ian Wienand
Hello,

All our current images use dib's "pip-and-virtualenv" element to
ensure the latest pip/setuptools/virtualenv are installed, and
/usr/bin/ installs Python 2 packages and
/usr/bin/ install Python 3 packages.

The upshot of this is that all our base images have Python 2 and 3
installed (even "python 3 first" distros like Bionic).

We have to make a decision if we want to continue this with CentOS 8;
to be specific the change [1].

Installing pip and virtualenv from upstream sources has a long history
full of bugs and workarounds nobody wants to think about (if you do
want to think about it, you can start at [2]).

A major problem has been that we have to put these packages on "hold",
to avoid the situation where the packaged versions are re-installed
over the upstream versions, creating a really big mess of mixed up
versions.

I'm thinking that CentOS 8 is a good place to stop this.  We just
won't support, in dib, installing pip/virtualenv from source for
CentOS 8.  We hope for the best that the packaged versions of tools
are always working, but *if* we do require fixes to the system
packages, we will implement that inside jobs directly, rather than on
the base images.

I think the 2019 world this is increasingly less likley, as we have
less reliance on older practices like mixing system-wide installs
(umm, yes devstack ... but we have a lot of work getting centos8
stable there anyway) and the Zuul v3 world makes it much easier to
deploy isloated fixes as roles should we need.

If we take this path, the images will be Python 3 only -- we recently
turned Ansible's "ansible_python_interpreter" to Python 3 for Fedora
30 and after a little debugging I think that is ready to go.  Of
course jobs can install the Python 2 environment should they desire.

Any comments here, or in the review [1] welcome.

Thanks,

-i

[1] https://review.opendev.org/684462
[2] 
https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/pip-and-virtualenv/install.d/pip-and-virtualenv-source-install/04-install-pip#L73


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] opensource infra: server sizes

2019-08-14 Thread Ian Wienand
On Tue, Aug 13, 2019 at 11:44:36AM +0200, Shadi Akiki wrote:
> 2- how the allocated resource can be downsized (which I was hoping to find
> in the opendev/system-config 
>  repo)

You are correct that the sizing details for control plane servers are
not really listed anywhere.

This is really an artifact of us manually creating control-plane
servers.  When we create a new control-plane server, we use the launch
tooling in [1] where you will see we manually select a flavor size.
This is dependent on the cloud we launch the server in and the flavors
they provide us.

There isn't really a strict rule on what flavor is chosen; it's more
art than science :) Basically the smallest for what seems appropriate
for what the server is doing.

After the server is created the exact flavor used is not recorded
separately (i.e. other than querying nova directly).  So there is no
central YAML file or anything with the server and the flavor it was
created with.  Sometimes the cloud provider will provide us with
custom flavors, or ask us to use a particular variant.

So in terms of resizing the servers, we are limited to the flavors
provided to us by the providers, which varies.  In terms of the
practicality of resizing, as I'm sure you know this can be harder or
easier depending on a big variety of things from the provider.  We
have resized servers before when it becomes clear they're not
performing (recently adding swap to the gitea servers comes to mind).
Depending on the type of service it varies; for something not
load-balanced that requires production downtime, it's a very manual
process.

Nobody is opposed to making any of this more programatic, I'm sure.
It's just a trade-off between development time to create and maintain
that, and how often we actually start control-plane servers.

In terms of ask.o.o, that is a "8 GB Performance" flavor, as defined
by RAX's flavors.  This was rebuilt when we upgraded it to Xenial as
an 8GB node (from 4GB) as investigation at the time showed 4GB was a
bit tight [2].  8GB is the next quanta up of flavor provided by RAX
over 4GB.

I hope this helps!

-i

[1] https://opendev.org/opendev/system-config/src/branch/master/launch
[2] http://lists.openstack.org/pipermail/openstack-dev/2018-April/129078.html


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] Weekly Infra Team Meeting Agenda for August 13, 2019

2019-08-12 Thread Ian Wienand
We will be meeting tomorrow at 19:00 UTC in #openstack-meeting on freenode with 
this agenda:

== Agenda for next meeting ==

* Announcements

* Actions from last meeting

* Specs approval

* Priority Efforts (Standing meeting agenda items. Please expand if you have 
subtopics.)
** 
[http://specs.openstack.org/openstack-infra/infra-specs/specs/task-tracker.html 
A Task Tracker for OpenStack]
** 
[http://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-management.html
 Update Config Management]
*** topic:update-cfg-mgmt
*** Zuul as CD engine
** OpenDev

* General topics
** Trusty Upgrade Progress (clarkb 201900813)
*** Next steps for hosting job logs in swift
** AFS mirroring status (ianw 20190813)
*** Debian buster updates not populated by reprepro but are assumed to be 
present by our mirror setup roles.
** PTG Planning (clarkb 20190813)
*** https://etherpad.openstack.org/p/OpenDev-Shanghai-PTG-2019
** New backup server (ianw 20190813)
*** https://review.opendev.org/#/c/675537

* Open discussion

Thanks,

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] opendev.org downtime Thu Jul 25 07:00 UTC 2019

2019-07-25 Thread Ian Wienand
Hello,

We received reports of connectivity issues to opendev.org at about
06:30 [1].

After some initial investigation, I could not contact
gitea-lb01.opendev.org via ipv4 or 6.

Upon checking it's console I saw a range of kernel errors that suggest
the host was probably having issues with it's disk [2].

I attempted to hard-reboot it, and it went into an error state.  The
initial error in the server status was

 {'message': 'Timed out during operation: cannot acquire state change lock 
(held by monitor=remoteDispatchDomainCreateWithFlags)', 'code': 500, 'created': 
'2019-07-25T07:25:25Z'}

After a short period, I tried again and got a different error state

 {'message': "internal error: process exited while connecting to monitor: 
lc=,keyid=masterKey0,iv=jHURYcYDkXqGBu4pC24bew==,format=base64 -drive 
'file=rbd:volumes/volume-41553c15-6b12-4137-a318-7caf6a9eb44c:id=cinder:auth_supported=cephx\\;none:mon_host=172.24.0.56\\:6789",
 'code': 500, 'created': '2019-07-25T07:27:21Z'}

The vexxhost status page [3] is currently not showing any outages in
the sjc1 region where this resides.

I think this probably requires vexxhost to confirm the status of
load-balancer VM.

I tried to launch a new node, at least to have it ready in case of
bigger issues.  This failed with errors about the image service [4].
This further suggets there might be some storage issues on the
backend.

I then checked on the gitea* backend servers, and they have similar
messages in their kernel logs referring to storage too (I should have
done this first, probably).  So this again suggests it is a
region-wide issue.

I have reached out to mnaser on IRC.  I think he is GMT-4 usually so
that gives a few hours to expect a response.  This will also mean more
experienced gitea admins will be around too.  Given it appears to be a
backend provider issue, I will not take further at this point.

Thanks,

-i

[1] 
http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2019-07-25.log.html#t2019-07-25T06:36:51
[2] http://paste.openstack.org/show/754834/
[3] https://status.vexxhost.com/
[4] http://paste.openstack.org/show/754835/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Meeting Agenda for July 9, 2019

2019-07-09 Thread Ian Wienand
On Mon, Jul 08, 2019 at 02:36:11PM -0700, Clark Boylan wrote:
> ** Mirror setup updates (clarkb 20190709)
> *** Do we replace existing mirrors with new opendev mirrors running openafs 
> 1.8.3?

I won't make it to the meeting tomorrow sorry, but here's the current
status, which is largely reflected in

 https://etherpad.openstack.org/p/opendev-mirror-afs

The kafs based servers have been paused for now due the hard crashes
in fscache, which requires us to monitor it very closely, which wasn't
happening over holiday breaks.

 https://review.opendev.org/#/c/669231/

dhowells is on vacation till at least the 15th, and there is no real
prospect of those issues being looked at until after then.

There are some changes in the afs-next kernel branch for us to try,
which should help with the "volume offline" issues we saw being
reported when a "vos release" was happening (basically making kafs
switch to the other server better).  I believe that capturing logs
from our AFS servers helped debug these issues.

I can take an action item to build a kernel with them and switch it
back in for testing late next week when I am back (or someone else
can, if they like).  This will give enough for a Tested-By: flag for
sending those changes to Linus upstream.

Once everyone is back, we can look more closely at the fscache issues,
which are currently a blocker for future work.

I'm not aware of any issues with openafs 1.8.3 based mirrors.  If we
need any new mirrors, or feel the need to replace those in production,
we should be fine bringing them up with that.

Thanks,

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] ARA 1.0 deployment plans

2019-06-17 Thread Ian Wienand
On Tue, Jun 11, 2019 at 04:39:58PM -0400, David Moreau Simard wrote:
> Although it was first implemented as somewhat of a hack to address the
> lack of scalability of HTML generation, I've gotten to like the design
> principle of isolating a job's result in a single database.
> 
> It easy to scale and keeps latency to a minimum compared to a central
> database server.

I've been ruminating on how all this can work given some constraints
of

- keep current model of "click on a link in the logs, see the ara
  results"

- no middleware to intercept such clicks with logs on swift

- don't actually know where the logs are if using swift (not just
  logs.openstack.org/xy/123456/) which makes it harder to find job
  artefacts like sqlite db's post job run (have to query gerrit or
  zuul results db?)

- some jobs, like in system-config, have "nested" ARA reports from
  subnodes; essentially reporting twice.

Can the ARA backend import a sqlite run after the fact?  I agree
adding latency to jobs running globally sending results piecemeal back
to a central db isn't going work; but if it logged everything to a
local db as now, then we uploaded that to a central location in post
that might work?  Although we can't run services/middleware on logs
directly, we could store the results as we see fit and run services on
a separate host.

If say, you had a role that sent the generated ARA sqlite.db to
ara.opendev.org and got back a UUID, then it could write into the logs
ara-report/index.html which might just be a straight 301 redirect to
https://ara.opendev.org/UUID.  This satisfies the "just click on it"
part.

It seems that "all" that needs to happen is that requests for
https://ara.opendev.org/uuid/api/v1/... to query either just the
results for "uuid" in the db.

And could the ara-web app (which is presumably then just statically
served from that host) know that when started as
https://ara.opendev.org/uuid it should talk to
https://ara.opendev.org/uuid/api/...?

I think though, this might be relying on a feature of the ara REST
server that doesn't exist -- the idea of unique "runs"?  Is that
something you'd have to paper-over with, say wsgi starting a separate
ara REST process/thread to respond to each incoming
/uuid/api/... request (maybe the process just starts pointing to
/opt/logs/uuid/results.sqlite)?

This doesn't have to grow indefinitely, we can similarly just have a
cron query to delete rows older than X weeks.

Easy in theory, of course ;)

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] Meeting Agenda for June 18, 2019

2019-06-17 Thread Ian Wienand
== Agenda for next meeting ==

* Announcements

* Actions from last meeting

* Specs approval

* Priority Efforts (Standing meeting agenda items. Please expand if you have 
subtopics.)
** 
[http://specs.openstack.org/openstack-infra/infra-specs/specs/task-tracker.html 
A Task Tracker for OpenStack]
** 
[http://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-management.html
 Update Config Management]
*** topic:update-cfg-mgmt
*** Zuul as CD engine
** OpenDev
*** Next steps

* General topics
** Trusty Upgrade Progress (ianw 20190618)
** https mirror update (ianw 20190618)
*** kafs in production update
*** https://review.opendev.org/#/q/status:open+branch:master+topic:kafs

* Open discussion


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] ARA 1.0 deployment plans

2019-06-11 Thread Ian Wienand
Hello,

I started to look at the system-config base -devel job, which runs
Ansible & ARA from master (this job has been quite useful in flagging
issues early across Ansible, testinfra, ARA etc, but it takes a bit
for us to keep it stable...)

It seems ARA 1.0 has moved in some directions we're not handling right
now.  Playing with [1] I've got ARA generating and uploading it's
database.

Currently, Apache matches an ara-report/ directory on
logs.openstack.org and sent to the ARA wsgi application which serves
the response from the sqlite db in that directory [2].

If I'm understanding, we now need ara-web [3] to display the report
page we all enjoy.  However this web app currently only gets data from
an ARA server instance that provides a REST interface with the info?

I'm not really seeing how this fits with the current middleware
deployment? (unfortunately [4] or an analogue in the new release seems
to have disappeared).  Do we now host a separate ARA server on
logs.openstack.org on some known port that knows how to turn
/*/ara-report/ URL requests into access of the .sqlite db on disk and
thus provide the REST interface?  And then somehow we host a ara-web
instance that knows how to request from this?

Given I can't see us wanting to do a bunch of puppet hacking to get
new services on logs.openstack.org, but yet also it requiring fairly
non-trivial effort to get the extant bits and pieces on that server
migrated to an all-Ansible environment, I think we have to give some
thought as to how we'll roll this out (plus add in containers,
possible logs on swift, etc ... for extra complexity :)

So does anyone have thoughts on a high-level view of how this might
hang together?

-i

[1] https://review.opendev.org/#/c/664478/
[2] 
https://opendev.org/opendev/puppet-openstackci/src/branch/master/templates/logs.vhost.erb
[3] https://github.com/ansible-community/ara-web
[4] https://ara.readthedocs.io/en/stable/advanced.html

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Zanata broken on Bionic

2019-04-08 Thread Ian Wienand
On Tue, Apr 02, 2019 at 12:28:31PM +0200, Frank Kloeker wrote:
> The OpenStack I18n team was aware about the fact, that we will run into an
> unsupported platform in the near future and started an investigation about
> the renew of translation platform on [1].
> [1] 
> https://blueprints.launchpad.net/openstack-i18n/+spec/renew-translation-platform

I took an action item to do some investigation in the infra meeting.
From the notes above, for last iteration it looks like it came down to
Zanata v Poodle.  However when I look at [1] Poodle doesn't look
terribly active.

It looks like Fedora haven't made any choices around which way to
go, but weblate has been suggested [2].

Looking at weblate, it seems to have a few things going for it from
the infra point of view

* it seems active
* it's a python/django app which fits our deployments and general
  skills better than java
* has a docker project [3] so has interest in containerisation
* we currently put translations in, and propose them via jobs
  triggered periodically using the zanata CLI tool as described at
  [4].  weblate has a command-line client that looks to me like it can
  do roughly what we do now [5] ... essentially integrate with jobs to
  upload new translations into the tool, and extract the translations
  and put them into gerrit.
* That said, it also seems we could integrate with it more "directly"
  [6]; it seems it can trigger imports of translations from git repos
  via webhooks (focused on github, but we could do similar with a post
  job) and also propose updates directly to gerrit (using git-review;
  documentation is light on this feature but it is there).  It looks
  like (if I'm reading it right) we could move all configuration in a
  .weblate file per-repo, which suits our distributed model.

> My recommendation would be to leave it as it is and to decide how to
> proceed.

Overall, yeah, if it ain't broke, don't fix it :)

The other thing is, I noticed that weblate has hosted options.  If the
CI integration is such that it's importing via webhooks, and proposing
reviews then it seems like this is essentially an unprivileged app.
We have sunk a lot of collective time and resources into Zanata
deployment and we should probably do a real cost-benefit analysis once
we have some more insights.

-i


[1] https://github.com/translate/pootle/commits/master
[2] 
https://lists.fedoraproject.org/archives/list/tr...@lists.fedoraproject.org/thread/PZUT5ABMNDVYBD7OUBEGVXM7YVW6RZKQ/#4J7BJQWOJDEBACSHDIB6MYWEEXHES6CW
[3] https://github.com/WeblateOrg/docker
[4] https://docs.openstack.org/i18n/latest/infra.html
[5] https://docs.weblate.org/en/latest/wlc.html
[6] https://docs.weblate.org/en/latest/admin/continuous.html

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] Meeting agenda for March 26, 2019

2019-03-25 Thread Ian Wienand
== Agenda for next meeting ==

* Announcements
** Clarkb remains on vacation March 25-28

* Actions from last meeting

* Specs approval

* Priority Efforts (Standing meeting agenda items. Please expand if you have 
subtopics.)
** 
[http://specs.openstack.org/openstack-infra/infra-specs/specs/task-tracker.html 
A Task Tracker for OpenStack]
** 
[http://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-management.html
 Update Config Management]
*** topic:puppet-4 and topic:update-cfg-mgmt
*** Zuul as CD engine
** OpenDev
*** https://storyboard.openstack.org/#!/story/2004627

* General topics
** PTG planning (clarkb 20190319 / ianw 20190326)
*** https://etherpad.openstack.org/2019-denver-ptg-infra-planning

* Open discussion

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Zanata broken on Bionic

2019-03-25 Thread Ian Wienand
On Fri, Mar 15, 2019 at 11:01:44AM +0100, Andreas Jaeger wrote:
> Anybody remembers or can reach out to Zanata folks for help on
> fixing this for good, please?

From internal communication with people previously involved with
Zanata, it seems the team has disabanded and there is no current
support or, at this time, planned future development.  So
unfortunately it seems there are no "Zanata folks" at this point :(

It's a shame considering there has been significant work integrating
it into workflows, but I think we have to work under the assumption
upstream will remain inactive.

Falling back to "if it ain't broke" we can just continue with the
status quo with the proposal job running on Xenial and its java
versions for the forseeable future.  Should we reach a point
post-Xenial support lifespan, we could even consider a more limited
deployment of both the proposal job and server using containers etc.
Yes, this is how corporations end up in 2019 with RHEL5 servers
running Python 2.4 :)

Ultimately though, it's probably something the I18n team needs to
discuss and infra can help with any decisions made.

-i

[1] https://review.openstack.org/#/q/topic:zanata/translations

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] [Release-job-failures] Release of openstack-infra/jenkins-job-builder failed

2018-12-10 Thread Ian Wienand
On Fri, Dec 07, 2018 at 12:02:00PM +0100, Thierry Carrez wrote:
> Looks like the readthedocs integration for JJB is misconfigured, causing the
> trigger-readthedocs-webhook to fail ?

Thanks for pointing this out.  After investigation it doesn't appear
to be misconfigured in any way, but it seems that RTD have started
enforcing the need for csrf tokens for the POST we use to notify it to
build.

This appears to be new behaviour, and possibly incorrectly applied
upstream (I'm struggling to think why it's necessary here).

I've filed

 https://github.com/rtfd/readthedocs.org/issues/4986

which hopefully can open a conversation about this.  Let's see what
comes of that...

*If* we have no choice but to move to token based authentication, I
did write the role to handle that.  But it involves every project
maintaining it's own secrets and us having to rework the jobs which is
not difficult but also not trivial.  So let's hope it doesn't come to
that ...

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Proposed changes to how we run our meeting

2018-11-20 Thread Ian Wienand
On Sun, Nov 18, 2018 at 11:09:29AM -0800, Clark Boylan wrote:
> Both ideas seem sound to me and I think we should try to implement
> them for the Infra team. I propose that we require agenda updates 24
> hours prior to the meeting start time and if there are no agenda
> updates we cancel the meeting. Curious to hear if others think this
> will be helpful and if 24 hours is enough lead time to be helpful.

My concern here is that we have standing items of priority tasks
updates that are essentially always there, and action item follow-up
from the prior meeting.  Personally I often find them very useful.

Having attended many in-person waffling weekly "status update"
meetings etc. I feel the infra one *is* very agenda focused.  I also
think there is never an expectation anyone is in the meeting; in fact
more so that we actively understand and expect people aren't there.

So I think it would be fine to send out the agenda 24 hours in
advance, and make a rule that new items post that will skip to the
next week, so that if there's nothing of particular interest people
can plan to skip.

This would involve managing the wiki page better IMO.  I always try to
tag my items with my name and date for discussion because clearing it
out is an asychronous operation.  What if we made the final thing in
the meeting after general discussion "reset agenda" so we have a
synchronisation point, and then clearly mark on the wiki page that
it's now for the next meeting date?

But I don't like that infra in general skips the meeting.  Apart from
the aforementioned standing items, people start thinking "oh my thing
is just little, I don't want to call a meeting for it" which is the
opposite of what we want to keep communication flowing.  For people
actively involved but remote like myself, it's a loss of a very
valuable hour to catch up on what's happening even with just the
regular updates.

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Control Plane Server Upgrade Sprint Planning

2018-09-18 Thread Ian Wienand
On Mon, Sep 17, 2018 at 04:09:03PM -0700, Clark Boylan wrote:
> October 15-19 may be our best week for this. Does that week work?

Post school-holidays here so SGTM :)

> Let me know if you are working on upgrading any servers/services and
> I will do what I can to help review changes and make that happen as
> well.

I will start on graphite.o.o as I now have some experience getting it
listening on ipv6 :) I think it's mostly package install and few
templating bits, ergo it might work well as ansible roles (i.e. don't
have to rewrite tricky logic).  I'll see how it starts to look ...

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Launch node and the new bridge server

2018-08-28 Thread Ian Wienand

On 08/28/2018 09:48 AM, Clark Boylan wrote:

On Mon, Aug 27, 2018, at 4:21 PM, Clark Boylan wrote:
One quick new observation. launch-node.py does not install puppet at
all so the subsequent ansible runs on the newly launched instances
will fail when attempting to stop the puppet service (and will
continue on to fail to run puppet as well I think).


I think we should manage puppet on the hosts from Ansible; we did
discuss that we could just manually run
system-config:install_puppet.sh after launching the node; but while
that script does contain some useful things for getting various puppet
versions, it also carries a lot of extra cruft from years gone by.

I've proposed the roles to install puppet in [1].  This runs the roles
under Zuul for integration testing.

For the control-plane, we need a slight tweak to the inventory writer
to pass through groups [2] and then we can add the roles to the base
playbook [3].

Thanks,

-i

[1] https://review.openstack.org/596968
[2] https://review.openstack.org/596994
[3] https://review.openstack.org/596997

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Request to keep Fedora 28 images around until Fedora 30 comes out

2018-08-02 Thread Ian Wienand
On 08/03/2018 04:45 AM, Clark Boylan wrote:
> On Thu, Aug 2, 2018, at 9:57 AM, Alex Schultz wrote:
> As a note, Fedora 28 does come with python2.7. It is installed so
> that Zuul related ansible things can execute under python2 on the
> test nodes. There is the possibility that ansible's python3 support
> is working well enough that we could switch to it, but that
> requires testing and updates to software and images and config.

Python 3 only images are possible -- dib has a whole "dib-python"
thing for running python scripts inside the chroot in a distro &
version independent way -- but not with the pip-and-virtualenv element
setup we do as that drags in both [1].  You can go through a "git log"
of that element to see some of the many problems :)

OpenStack has always managed to tickle bugs in
pip/setuptools/virtualenv which is why we go to the effort of
installing the latest out-of-band.  This is not to say it couldn't be
reconsidered, especially for a distro like Fedora which packages
up-to-date packages.  But this would definitely be the first
port-of-call for anyone interested in going down that path in infra.

-i

[1] 
https://git.openstack.org/cgit/openstack/diskimage-builder/tree/diskimage_builder/elements/pip-and-virtualenv/install.d/pip-and-virtualenv-source-install/04-install-pip

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] mirror.opensuse : AFS file name size issues

2018-06-17 Thread Ian Wienand
Hi,

It seems like the opensuse mirror has been on a bit of a growth spurt
[1].  Monitoring alerted me that the volume had not released for
several days, which lead me to look at the logs.

The rsync is failing with "File too large (27)" as it goes through
the tumbleweed sync.

As it turns out, AFS has a hard limit on the combined size of the
file names within a directory.  There are a couple of threads [2]
around from people who have found this out in pretty much the same way
as me ... when it starts failing :)

So you have 64k slots per directory, and file metadata+name takes up
slots per the formula:

 /* Find out how many entries are required to store a name. */
 int
 afs_dir_NameBlobs(char *name)
 {
 int i;
 i = strlen(name) + 1;
 return 1 + ((i + 15) >> 5);
 }

This means we have a problem with the large opensuse
tumbleweed/repo/oss/x86_64 directory, which has a lot of files with
quite long names.  Please, check my command/math, but if you run the
following command:

 $ rsync --list-only 
rsync://mirrors.rit.edu/opensuse/tumbleweed/repo/oss/x86_64/ \
   | awk '
 function slots(x) {
   i = length(x)+1;
   return 1 + rshift((i+15), 5)
 }
 { n += slots($5) }
 END {print n}
'

I come out with 82285, which is significantly more than the 64k slots
available.

I don't know what to do here, and it's really going to be up to people
interested in opensuse.  The most immediate thing is unnecessary
packages could be pruned from tumbleweed/repo/oss/x86_64/ during the
rsync.  Where unnecessary is in the eye of the beholder ... :)
See my note below, but it may have to be quite under 64k.

If we have any sway with upstream, maybe they could shard this
directory; similar to debian [3] or fedora [4] (that said, centos does
the same thing [5], but due to less packages and shorter names etc
it's only about 40% allocated).

Note that (open)AFS doesn't hard-link across directories, so some sort
of "rsync into smaller directories then hardlink tree" doesn't really
work.

Ideas, suggestions, reviews welcome :)

-- ps

There's an additional complication in that the slots fragment over
time and file names must be contiguous.  This means in practice, you do
get even less.

There is potential to "defrag" (I bet post Windows 95 you never
thought you'd hear that again :) by rebuilding the directories with
the salvager [6].  However, there are additional complications
again...

To simply do this we have to run a complete salvage of the *entire*
partition.  Although I have added "-salvagedirs" to afs01's
salvageserver (via [7]) in an attempt to do this for just one volume,
it turns out this is not obeyed until after [8] which is not in the
Xenial AFS version we use.  I really do not want to salvage all the
other volumes, most of which are huge.  The other option is to create
a new AFS server, move the volume to that so it's the only thing on
the partition, and run it there, then move it back [9].

I actually suspect an rm -rf * might also do it, and probably be faster
because we'd only move the data down once from the remote mirror,
rather than to a new server and back

But defragging is rather secondary if the directory is oversubscribed
anyway.

-i

[1] 
http://grafana02.openstack.org/d/ACtl1JSmz/afs?orgId=1&from=now-7d&to=now&panelId=28&fullscreen
[2] https://lists.openafs.org/pipermail/openafs-info/2016-July/041859.html
[3] http://mirror.iad.rax.openstack.org/debian/pool/main/
[4] 
http://mirror.iad.rax.openstack.org/fedora/releases/28/Everything/x86_64/os/Packages/
[5] http://mirror.iad.rax.openstack.org/centos/7/os/x86_64/Packages/
[6] http://docs.openafs.org/Reference/8/salvager.html
[7] https://docs.openstack.org/infra/system-config/afs.html#updating-settings
[8] https://gerrit.openafs.org/#/c/12461/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] afs02 r/o volume mirrors - resolved

2018-05-26 Thread Ian Wienand

On 05/25/2018 08:00 PM, Ian Wienand wrote:

I am now re-running the sync in a root screen on afs02 with -localauth
so it won't timeout.


I've now finished syncing back all R/O volumes on afs02, and the update
cron jobs have been running successfully.

Thanks,

-i


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] afs02 r/o volume mirrors - ongoing incident

2018-05-25 Thread Ian Wienand

On 05/24/2018 11:36 PM, Ian Wienand wrote:

Thanks to the help of Jeffrey Altman [1], we have managed to get
mirror.pypi starting to resync again.


And thanks to user error on my behalf, and identified by jeblair, in
the rush of all this I ran this under k5start on mirror-update,
instead of on one of the afs hosts with -localauth, so the ticket
timed out and the release failed.

---
root@mirror-update01:~# k5start -t -f /etc/afsadmin.keytab 
service/afsadmin -- vos release mirror.pypi

Kerberos initialization for service/afsad...@openstack.org

Release failed: rxk: authentication expired
Could not end transaction on a ro volume: rxk: authentication expired
 Could not update VLDB entry for volume 536870931
Failed to end transaction on the release clone 536870932
Could not release lock on the VLDB entry for volume 536870931
rxk: authentication expired
Error in vos release command.
rxk: authentication expired
---

If it is any consolation, it's the type of mistake you only make once :)

I am now re-running the sync in a root screen on afs02 with -localauth
so it won't timeout.  Expect it to finish about 20 hours from this
mail :/

Thanks,

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] afs02 r/o volume mirrors - ongoing incident

2018-05-24 Thread Ian Wienand
On 05/24/2018 08:45 PM, Ian Wienand wrote:
> On 05/24/2018 05:40 PM, Ian Wienand wrote:
>> In an effort to resolve this, the afs01 & 02 servers were restarted to
>> clear all old transactions, and for the affected mirrors I essentially
>> removed their read-only copies and re-added them with:
> 
> It seems this theory of removing the volumes and re-adding them is not
> sufficient to get things working; "vos release" is still failing.  I
> have sent a message to the openafs-devel list [1] with details and
> logs.

Thanks to the help of Jeffrey Altman [1], we have managed to get
mirror.pypi starting to resync again.  This is running in the root
screen on mirror-update.o.o (sorry, I forgot the "-v" on the command).

For reference, you can look at the transaction and see it receiving
data, e.g.

 root@afs02:/var/log/openafs# vos status -verbose -server localhost -localauth 
 Total transactions: 1
 --
 transaction: 62  created: Thu May 24 12:58:23 2018
 lastActiveTime: Thu May 24 12:58:23 2018
 volumeStatus: 
 volume: 536870932  partition: /vicepa  procedure: Restore
 packetRead: 2044135  lastReceiveTime: Thu May 24 13:33:17 2018
 packetSend: 1  lastSendTime: Thu May 24 13:33:17 2018
 --

Assuming this goes OK over the next few hours, that leaves
mirror.ubuntu and mirror.ubuntu-ports as the last two out-of-sync
mirrors.  As we do not want to run large releases in parallel, we can
tackle this when pypi is back in sync.

Thanks,

-i

[1] 
http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2018-05-24.log.html#t2018-05-24T12:57:39

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] afs02 r/o volume mirrors - ongoing incident

2018-05-24 Thread Ian Wienand
On 05/24/2018 05:40 PM, Ian Wienand wrote:
> In an effort to resolve this, the afs01 & 02 servers were restarted to
> clear all old transactions, and for the affected mirrors I essentially
> removed their read-only copies and re-added them with:

It seems this theory of removing the volumes and re-adding them is not
sufficient to get things working; "vos release" is still failing.  I
have sent a message to the openafs-devel list [1] with details and
logs.

We should probably see if any help can be gained from there.

If not, I'm starting to think that removing all R/O volumes, a "rm -rf
/vicepa/*" on afs02 and then starting the R/O mirrors again might be
an option?

If we critically need the mirrors updated, we can "vos remove" the R/O
volumes from any mirror and run an update just on afs01.  However note
that mirror-update.o.o is still in the emergency file and all cron
jobs stopped.

-i

[1] https://lists.openafs.org/pipermail/openafs-devel/2018-May/020491.html

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] afs02 r/o volume mirrors - ongoing incident

2018-05-24 Thread Ian Wienand
Hi,

We were notified of an issue around 22:45GMT with the volumes backing
the storage on afs02.dfw.o.o, which holds R/O mirrors for our AFS
volumes.

It seems that during this time there were a number of "vos release"s
in flight, or started, that ended up with volumes in a range of
unreliable states that made them un-releaseable (essentially halting
mirror updates).

Several of the volumes were recoverable with a manual "vos unlock" and
re-releasing the volume.  However, others were not.

To keep it short, fairly extensive debugging took place [2], but we
had corrupt volumes and deadlocked transactions between afs01 & afs02
with no reasonable solution.

In an effort to resolve this, the afs01 & 02 servers were restarted to
clear all old transactions, and for the affected mirrors I essentially
removed their read-only copies and re-added them with:

 k5start -t -f /etc/afsadmin.keytab service/afsadmin -- vos unlock $MIRROR
 k5start -t -f /etc/afsadmin.keytab service/afsadmin -- vos remove -server 
afs02.dfw.openstack.org -partition a -id $MIRROR.readonly
 k5start -t -f /etc/afsadmin.keytab service/afsadmin -- vos release -v $MIRROR
 k5start -t -f /etc/afsadmin.keytab service/afsadmin -- vos addsite -server 
afs02.dfw.openstack.org -partition a -id $MIRROR

The following volumes needed to be recovered

 mirror.fedora
 mirror.pypi
 mirror.ubuntu
 mirror.ubuntu-ports
 mirror.debian

(these are the largest repositories, and maybe it's no surprise that's
why they became corrupt?)

I have placed mirror-update.o.o in the emergency file, and commented
out all cron jobs on it.

Right now, I am running a script in a screen as the root user on
mirror-update.o.o to "vos release" these in sequence
(/root/release.sh).  Hopefully, this brings thing back into sync by
recreating the volumes.  If not, more debugging will be required :/

Please feel free to check in on this, otherwise I will update tomorrow
.au time

-i

[1] 
http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2018-05-23.log.html#t2018-05-23T22:43:46
[2] 
http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2018-05-24.log.html#t2018-05-24T04:01:21

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Selecting New Priority Effort(s)

2018-04-09 Thread Ian Wienand

On 04/06/2018 11:37 PM, Jens Harbott wrote:

I didn't intend to say that this was easier. My comment was related
to the efforts in https://review.openstack.org/558991 , which could
be avoided if we decided to deploy askbot on Xenial with
Ansible. The amount of work needed to perform the latter task would
not change, but we could skip the intermediate step, assuming that
we would start implementing 1) now instead of deciding to do it at a
later stage.


I disagree with this; having found a myriad of issues it's *still*
simpler that re-writing the whole thing IMO.

It doesn't matter, ansible, puppet, chef, bash scripts -- the
underlying problem is that we choose support libraries for postgres,
solr, celery, askbot, logs etc etc, get it to deploy, then forget
about it until the next LTS release 2 years later.  Of course the
whole world has moved on, but we're pinned to old versions of
everything and never tested on new platforms.

What *would* have helped is a rspec test that even just simply applies
the manifest on new platforms.  We have great infrastructure for these
tests; but most of our modules don't actually *run* anything (e.g.,
here's ethercalc and etherpad-lite issues too [1,2]).

These make it so much easier to collaborate; we can all see the result
of changes, link to logs, get input on what's going wrong, etc etc.

-i

[1] https://review.openstack.org/527822
[2] https://review.openstack.org/528130

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Problems setting up my own OpenStack Infrastructure

2018-04-04 Thread Ian Wienand

   * Puppet doesn't create the //var/log/nodepool///images /log directory


Note that since [1] the builder log output changed; previously it went
through python logging into the directory you mention, now it is
written into log files directly in /var/log/nodepool/builds (by
default)


   * The command /service //nodepool-builder//start /seems to//start a
 nodepool process that immediately aborts/


You may be seeing the result of a bad logging configuration file.  In
this case, the daemonise happens correctly (so systemd thinks it
worked) but it crashes soon after, but before any useful logging is
captured. I have a change out for that in [2] (reviews appreciated :)

Let me see how far I can get on my own. Thanks much for the offer to
tutor me on the IRC; I will watch out for you in my morning. Our
time difference is between 13 hours (EDT) and 16 hours (PDT) if you
are located in the continental US, i.e. 7pm EDT is 8am next day here
in Japan.


FWIW there are a couple of us in APAC who are happy to help too.  IRC
will always be the most immediate way however :)

-i

[1] https://review.openstack.org/#/c/542386/
[2] https://review.openstack.org/#/c/547889/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Options for logstash of ansible tasks

2018-03-28 Thread Ian Wienand
On 03/28/2018 11:30 AM, James E. Blair wrote:
> As soon as I say that, it makes me think that the solution to this
> really should be in the log processor.  Whether it's a grok filter, or
> just us parsing the lines looking for task start/stop -- that's where we
> can associate the extra data with every line from a task.  We can even
> generate a uuid right there in the log processor.

I'd agree the logstash level is probably where to do this.  How to
acheive that ...

In trying to bootstrap myself on the internals of this, one thing I've
found is that the multi-line filter [1] is deprecated for the
multiline codec plugin [2].

We make extensive use of this deprecated filter [3].  It's not clear
how we can go about migrating away from it?  The input is coming in as
"json_lines" as basically a json-dict -- with a tag that we then use
different multi-line matches for.

From what I can tell, it seems like the work of dealing with
multiple-lines has actually largley been put into filebeat [5] which
is analagous to our logstash-workers (it feeds the files into
logstash).

Ergo, do we have to add multi-line support to the logstash-pipeline,
so that events sent into logstash are already bundled together?

-i

[1] https://www.elastic.co/guide/en/logstash/2.4/plugins-filters-multiline.html
[2] 
https://www.elastic.co/guide/en/logstash/current/plugins-codecs-multiline.html
[3] 
https://git.openstack.org/cgit/openstack-infra/logstash-filters/tree/filters/openstack-filters.conf
[4] 
http://git.openstack.org/cgit/openstack-infra/system-config/tree/modules/openstack_project/templates/logstash/input.conf.erb
[5] 
https://www.elastic.co/guide/en/beats/filebeat/current/multiline-examples.html

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] Options for logstash of ansible tasks

2018-03-27 Thread Ian Wienand
I wanted to query for a failing ansible task; specifically what would
appear in the console log as

 2018-03-27 15:07:49.294630 | 
 2018-03-27 15:07:49.295143 | TASK [configure-unbound : Check for IPv6]
 2018-03-27 15:07:49.368062 | primary | skipping: Conditional result was False
 2018-03-27 15:07:49.400755 | 

While I can do

 message:"configure-unbound : Check for IPv6"

I want to correlate that with a result, looking also for the matching

 skipping: Conditional result was False

as the result of the task.  AFAICT, there is no way in kibana to
enforce a match on consecutive lines like this (as it has no concept
they are consecutive).

I considered a few things.  We could conceivably group everything
between "TASK" and a blank " | " into a single entry with a multiline
filter.  It was pointed out that this would make, for example, the
entire devstack log as a single entry, however.

The closest other thing I could find was "aggregate" [1]; but this
relies on having a unique task-id to group things together with.
Ansible doesn't give us this in the logs and AFAIK doesn't have a
concept of a uuid for tasks.

So I'm at a bit of a loss as to how we could effectively index ansible
tasks so we can determine the intermediate values or results of
individual tasks?  Any ideas?

-i

[1] 
https://www.elastic.co/guide/en/logstash/current/plugins-filters-aggregate.html

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Adding new etcd binaries to tarballs.o.o

2018-03-27 Thread Ian Wienand

On 03/28/2018 01:04 AM, Jeremy Stanley wrote:

I would be remiss if I failed to remind people that the *manually*
installed etcd release there was supposed to be a one-time stop-gap,
and we were promised it would be followed shortly with some sort of
job which made updating it not-manual. We're coming up on a year and
it looks like people have given in and manually added newer etcd
releases at least once since. If this file were important to
testing, I'd have expected someone to find time to take care of it
so that we don't have to. If that effort has been abandoned by the
people who originally convinced us to implement this "temporary"
workaround, we should remove it until it can be supported properly.


In reality we did fix it, as described with the
use-from-cache-or-download changes in the prior mail.  I even just
realised I submitted and forgot about [1] which never got reviewed to
remove the tarballs.o.o pointer -- that setting then got copied into
the new devstack zuulv3 jobs [2].

Anyway, we got there in the end :) I'll add to my todo list to clear
them from tarballs.o.o once this settles out.

-i

[1] https://review.openstack.org/#/c/508022/
[2] https://review.openstack.org/#/c/554977/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Adding new etcd binaries to tarballs.o.o

2018-03-26 Thread Ian Wienand
On 03/27/2018 09:25 AM, Tony Breeds wrote:
> Can we please add the appropriate files for the 3.3.2 (or 3.2.17)
> release of etcd added to tarballs.o.o

ISTR that we had problems even getting them from there during runs, so
moved to caching this.  I had to check, this isn't well documented ...

The nodepool element caching code [1] should be getting the image name
from [2], which gets the URL for the tarball via the environment
variables in stackrc.  dib then stuffs that tarball into /opt/cache on
all our images.

In the running devstack code, we use get_extra_file [3] which should
look for the tarball in the on-disk cache, or otherwise download it
[4].

Ergo, I'm pretty sure these files on tarballs.o.o are unused.  Bumping
the version in devstack should "just work" -- it will download
directly until the next day's builds come online with the file cached.

> I realise that 18.04 is just around the corner but doing this now gives
> us scope to land [4] soon and consider stable branches etc while we
> transition to bionic images and then dismantle the devstack
> infrastructure for consuming these tarballs
> [4] https://review.openstack.org/#/c/554977/1

I think we can discuss this in that review, but it seems likely from
our discussions in IRC that 3.2 will be the best choice here.  It is
in bionic & fedora; so we can shortcut all of this and install from
packages there.

-i

[1] 
https://git.openstack.org/cgit/openstack-infra/project-config/tree/nodepool/elements/cache-devstack/extra-data.d/55-cache-devstack-repos#n84
[2] 
https://git.openstack.org/cgit/openstack-dev/devstack/tree/tools/image_list.sh#n50
[3] https://git.openstack.org/cgit/openstack-dev/devstack/tree/lib/etcd3#n101
[4] https://git.openstack.org/cgit/openstack-dev/devstack/tree/functions#n59

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] [infra][nova] Corrupt nova-specs repo

2018-03-04 Thread Ian Wienand
On 06/30/2017 04:11 PM, Ian Wienand wrote:
> Unfortunately it seems the nova-specs repo has undergone some
> corruption, currently manifesting itself in an inability to be pushed
> to github for replication.

We haven't cleaned this up, due to wanting to do it during a rename
transition which hasn't happened yet due to zuulv3 rollout.

We had reports that github replication was not working.  Upon checking
the queue, nova-specs was suspicious.

...
07141063  Mar-02 08:04  (retry 3810) [d7122c96] push 
g...@github.com:openstack/nova-specs.git
4e27c57e waiting  Mar-02 08:12  [ee1b1935] push 
g...@github.com:openstack/networking-bagpipe.git
... so on ...

Checking out the logs, nova-specs tries to push itself and fails
constantly, per the previous mail.  However, usually we get an error
and things continue on; e.g.

[2018-03-02 08:04:56,439] [d7122c96] Cannot replicate to 
g...@github.com:openstack/nova-specs.git
org.eclipse.jgit.errors.TransportException: 
g...@github.com:openstack/nova-specs.git: error occurred during unpacking on 
the remote end: index-pack abnormal exit

Something seems to have happened at

[2018-03-02 08:05:58,065] [d7122c96] Push to 
g...@github.com:openstack/nova-specs.git references:

Becuase this never returned an error, or seemingly at all.  From that
point, no more attempts were made by the replication thread(s) to push
to github; jobs were queued but nothing happened.  I killed that task,
but no progress appeared to be made and the replication queue
continued to climb.  I couldn't find any other useful messages in the
logs; but they would be around that time if they were there.

I've restarted gerrit and replication appears to be moving again.  I'm
thinking maybe we should attempt to fix this separate to renames,
because at a minimum it makes debugging quite hard as it floods the
logs.  I'll bring it up in this week's meeting.

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Adding ARM64 cloud to infra

2018-02-22 Thread Ian Wienand
On 02/02/2018 05:15 PM, Ian Wienand wrote:
> - Once that is done, it should be straight forward to add a
>nodepool-builder in the cloud and have it build images, and zuul
>should be able to launch them just like any other node (famous last
>words).

This roughly turned out to be correct :)

In short, we now have ready xenial arm64 based nodes.  If you request
an ubuntu-xenial-arm64 node it should "just work"

There are some caveats:

 - I have manually installed a diskimage-builder with the changes from
   [1] downwards onto nb03.openstack.org.  These need to be finalised
   and a release tagged before we can remove nb03 from the emergency
   file (just means, don't run puppet on it).  Reviews welcome!

 - I want to merge [2] and related changes to expose the image build
   logs, and also the webapp end-points so we can monitor active
   nodes, etc.  It will take some baby-sitting so I plan on doing this
   next week.

 - We have mirror.cn1.linaro.openstack.org, but it's not mirroring
   anything that useful for arm64.  We need to sort out mirroring of
   ubuntu ports, maybe some wheel builds, etc.

 - There's currently capacity for 8 nodes.  So please take that into
   account when adding jobs.

Everything seems in good shape at the moment.  For posterity, here is
the first ever arm64 ready node:

 nodepool@nl03:/var/log/nodepool$ nodepool list | grep arm64
 | 0002683657 | linaro-cn1 | ubuntu-xenial-arm64 | 
c7bb6da6-52e5-4aab-88f1-ec0f1b392a0c | 211.148.24.200  |
| ready| 00:00:03:43 | unlocked |

:)

-i

[1] https://review.openstack.org/547161
[2] https://review.openstack.org/543671

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] [nodepool] Restricting images to specific nodepool builders

2018-02-19 Thread Ian Wienand

On 02/20/2018 02:23 AM, Paul Belanger wrote:

Why not just split the builder configuration file? I don't see a
need to add code to do this.


I'm happy with this; I was just coming at it from an angle of not
splitting the config file, but KISS :)


I did submit support homing diskimage builds to specific builder[2] a while
back, which is more inline with what ianw is asking. This allows us to assign
images to builders, if set.



[2] https://review.openstack.org/461239/


Only comment on this is that I think it might be better to avoid
putting specific hostnames in there directly; but rather add meta-data
to diskimage configurations describing the features they need on the
builder, and have the builder then only choose those builds it knows
it can do.  Feels more natural for the message-queue/scale-out type
environment where we can add/drop hosts at will.

We've two real examples to inform design; needing the Xenial build
host when all the others were trusty, and now the arm64 based ones.

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] [nodepool] Restricting images to specific nodepool builders

2018-02-18 Thread Ian Wienand
Hi,

How should we go about restricting certain image builds to specific
nodepool builder instances?  My immediate issue is with ARM64 image
builds, which I only want to happen on a builder hosted in an ARM64
cloud.

Currently, the builders go through the image list and check "is the
existing image missing or too old, if so, build" [1].  Additionally,
all builders share a configuration file [2]; so builders don't know
"who they are".

I'd propose we add an arbitrary tag/match system so that builders can
pickup only those builds they mark themselves capable of building?

e.g. diskimages would specify required builder tags similar to:

---
diskimages:
  - name: arm64-ubuntu-xenial
elements:
  - block-device-efi
  - vm
  - ubuntu-minimal
  ...
env-vars:
  TMPDIR: /opt/dib_tmp
  DIB_CHECKSUM: '1'
  ...
builder-requires:
  architecture: arm64
---

The nodepool.yaml would grow another section similar:

---
builder-provides:
  architecture: arm64
  something_else_unique_about_this_buidler: true
---

For OpenStack, we would template this section in the config file via
puppet in [2], ensuring above that only our theoretical ARM64 build
machine had that section in it's config.

The nodepool-buidler build loop can then check that its
builder-provides section has all the tags specified in an image's
"builder-requires" section before deciding to start building.

Thoughts welcome :)

-i

[1] 
https://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/builder.py#n607
[2] 
https://git.openstack.org/cgit/openstack-infra/project-config/tree/nodepool/nodepool.yaml

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Adding ARM64 cloud to infra

2018-02-01 Thread Ian Wienand

Hi,

A quick status update on the integration of the Linaro aarch64 cloud

- Everything is integrated into the system-config cloud-launcher bits,
  so all auth tokens are in place, keys are deploying, etc.

- I've started with a mirror.  So far only a minor change to puppet
  required for the ports sources list [1].  It's a bit bespoke at the
  moment but up as mirror.cn1.linaro.openstack.org.

- AFS is not supported out-of-the-box.  There is a series at [2] that
  I've been working on today, with some success.  I have custom
  packages at [3] which seem to work and can see our mirror
  directories.  I plan to puppet this in for our immediate needs, and
  keep working to get it integrated properly upstream.

- For building images, we are getting closer.  The series at [4] is
  still very WIP but can produce a working gpt+efi image.  I don't see
  any real blockers there; work will continue to make sure we get the
  interface if not perfect, at least not something we totally regret
  later :)

- Once that is done, it should be straight forward to add a
  nodepool-builder in the cloud and have it build images, and zuul
  should be able to launch them just like any other node (famous last
  words).

Thanks all,

-i

[1] https://review.openstack.org/539083
[2] https://gerrit.openafs.org/11940
[3] https://tarballs.openstack.org/package-afs-aarch64/
[4] https://review.openstack.org/#/c/539731/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Adding ARM64 cloud to infra

2018-01-18 Thread Ian Wienand

On 01/13/2018 03:54 AM, Marcin Juszkiewicz wrote:

UEFI expects GPT and DIB is completely not prepared for it.


I feel like we've made good progress on this part, with sufficient
GPT support in [1] to get started on the EFI part

... which is obviously where the magic is here.  This is my first
rodeo building something that boots on aarch64, but not yours I've
noticed :)

I've started writing some notes at [2] and anyone is welcome to edit,
expand, add notes on testing etc etc.  I've been reading through the
cirros implementation and have more of a handle on it; I'm guessing
we'll need to do something similar in taking distro grub packages and
put them in place manually.  Any notes on testing very welcome :)

Cheers,

-i

[1] https://review.openstack.org/#/c/533490/
[2] https://etherpad.openstack.org/p/dib-efi

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Adding ARM64 cloud to infra

2018-01-15 Thread Ian Wienand

On 01/16/2018 12:11 AM, Frank Jansen wrote:

do you have any insight into the availability of a physical
environment for the ARM64 cloud?



I’m curious, as there may be a need for downstream testing, which I
would assume will want to make use of our existing OSP CI framework.


Sorry, not 100% sure what you mean here?  I think the theory is that
this would be an ARM64 based cloud attached to OpenStack infra and
thus run any jobs infra could ...

-i


___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Adding ARM64 cloud to infra

2018-01-15 Thread Ian Wienand

On 01/13/2018 01:26 PM, Ian Wienand wrote:

In terms of implementation, since you've already looked, I think
essentially diskimage_builder/block_device/level1.py create() will
need some moderate re-factoring to call a gpt implementation in
response to a gpt label, which could translate self.partitions into a
format for calling parted via our existing exec_sudo.



bringing up a sample config and test, then working backwards from what
calls we expect to see


I've started down this path with

 https://review.openstack.org/#/c/533490/

... still very wip

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Adding ARM64 cloud to infra

2018-01-12 Thread Ian Wienand
On 01/13/2018 05:01 AM, Jeremy Stanley wrote:
> On 2018-01-12 17:54:20 +0100 (+0100), Marcin Juszkiewicz wrote:
> [...]
>> UEFI expects GPT and DIB is completely not prepared for it. I made
>> block-layout-arm64.yaml file and got it used just to see "sorry,
>> mbr expected" message.
> 
> I concur. It looks like the DIB team would welcome work toward GPT
> support based on the label entry at
> https://docs.openstack.org/diskimage-builder/latest/user_guide/building_an_image.html#module-partitioning
> and I find https://bugzilla.redhat.com/show_bug.cgi?id=1488557
> suggesting there's probably also interest within Red Hat for it as
> well.

Yes, it would be welcome.  So far it's been a bit of a "nice to have"
which has kept it low priority, but a concrete user could help our
focus here.

>> You have whole Python class to create MBR bit by bit when few
>> calls to 'sfdisk/gdisk' shell commands do the same.
> 
> Well, the comments at
> http://git.openstack.org/cgit/openstack/diskimage-builder/tree/diskimage_builder/block_device/level1/mbr.py?id=5d5fa06#n28
> make some attempt at explaining why it doesn't just do that instead
> (at least as of ~7 months ago?).

I agree with the broad argument of this sentiment; that writing a
binary-level GPT implementation is out of scope for dib (and the
existing MBR one is, with hindsight, something I would have pushed
back on more).

dib-block-device being in python is a double edged sword -- on the one
hand it's harder to drop in a few lines like in shell, but on the
other hand it has proper data structures, unit testing, logging and
config-reading abilities -- things that all are rather ugly, or get
lost with shell.  The code is not perfect, but doing more things like
[1,2] to enhance and better use libraries will help everyone (and
notice that's making it easier to translate directly to parted, no
coincidence :)

The GPL linkage issue, as described in the code, prevents us doing the
obvious thing and calling directly via python.  But I believe will we
be OK just making system() calls to parted to configure GPT;
especially given the clearly modular nature of it all.

In terms of implementation, since you've already looked, I think
essentially diskimage_builder/block_device/level1.py create() will
need some moderate re-factoring to call a gpt implementation in
response to a gpt label, which could translate self.partitions into a
format for calling parted via our existing exec_sudo.

This is highly amenable to a test-driven development scenario as we
have some pretty good existing unit tests for various parts of the
partitioning to template from (for example, tests/test_lvm.py).  So
bringing up a sample config and test, then working backwards from what
calls we expect to see is probably a great way to start.  Even if you
just want to provide some (pseudo)shell examples based on your
experience and any thoughts on the yaml config files it would be
helpful.

--

I try to run the meetings described in [3] if there is anything on the
agenda.  The cadence is probably not appropriate for this, we can do
much better via mail here, or #openstack-dib in IRC.  I hope we can
collaborate in a positive way; as I mentioned I think as a first step
we'd be best working backwards from what we expect to see in terms of
configuration, partition layout and parted calls.

Thanks,

-i

[1] https://review.openstack.org/#/c/503574/
[2] https://review.openstack.org/#/c/503572/
[3] https://wiki.openstack.org/wiki/Meetings/diskimage-builder

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Adding ARM64 cloud to infra

2018-01-11 Thread Ian Wienand

On 01/10/2018 08:41 PM, Gema Gomez wrote:

1. Control-plane project that will host a nodepool builder with 8 vCPUs,
8 GB RAM, 1TB storage on a Cinder volume for the image building scratch
space.

Does this mean you're planning on using diskimage-builder to produce
the images to run tests on?  I've seen occasional ARM things come by,
but of course diskimage-builder doesn't have CI for it (yet :) so it's
status is probably "unknown".

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] ze04 & #532575

2018-01-10 Thread Ian Wienand
Hi,

To avoid you having to pull apart the logs starting ~ [1], we
determined that ze04.o.o was externally rebooted at 01:00UTC (there is
a rather weird support ticket which you can look at, which is assigned
to a rackspace employee but in our queue, saying the host became
unresponsive).

Unfortunately that left a bunch of jobs orphaned and necessitated a
restart of zuul.

However, recent changes to not run the executor as root [2] were thus
partially rolled out on ze04 as it came up after reboot.  As a
consequence when the host came back up the executor was running as
root with an invalid finger server.

The executor on ze04 has been stopped, and the host placed in the
emergency file to avoid it coming back.  There are now some in-flight
patches to complete this transition, which will need to be staged a
bit more manually.

The other executors have been left as is, based on the KISS theory
they shouldn't restart and pick up the code until this has been dealt
with.

Thanks,

-i


[1] 
http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2018-01-11.log.html#t2018-01-11T01:09:20
[2] https://review.openstack.org/#/c/532575/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Xenial Upgrade Sprint Recap

2017-12-18 Thread Ian Wienand

On 12/19/2017 01:53 AM, James E. Blair wrote:

Ian Wienand  writes:


There's a bunch of stuff that wouldn't show up until live, but we
probably could have got a lot of prep work out of the way if the
integration tests were doing something.  I didn't realise that although
we run the tests, most of our modules don't actually have any tests
run ... even something very simple like "apply without failures"


Don't the apply tests do that?


Not really; since they do a --noop run they find things like syntax
issues, dependency loops, missing statements etc; but this does leave a
lot of room for other failures.

For example, our version of puppet-nodejs was warning on Xenial "this
platform not supported, I'll try to use sensible defaults", which
passed through the apply tests -- but wasn't actually working when it
came to really getting nodejs on the system alongside
etherpad/ethercalc.

I also think there was some false sense of security since (now called)
legacy-puppet-beaker-rspec-infra was working ... even though *it* was
a noop too.

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Xenial Upgrade Sprint Recap

2017-12-17 Thread Ian Wienand


On 12/16/2017 10:17 AM, Clark Boylan wrote:
> As the week ends there are a few services that are still in progress.
> Hope to get them done shortly:
>* etherpad.openstack.org
>* ethercalc.openstack.org
>* status.openstack.org

Working on these, all pretty much held up due to nodejs issues but
reviews are out.

There's a bunch of stuff that wouldn't show up until live, but we
probably could have got a lot of prep work out of the way if the
integration tests were doing something.  I didn't realise that although
we run the tests, most of our modules don't actually have any tests
run ... even something very simple like "apply without failures"

Of our integration modules, only a few have tests per below.  If we
have a pre-sprint next time to get some basic testing of these modules
against 18.04 I think that would be helpful.

That said, this has been particularly big one due to our renaming of
things into numeric groups and upstart->systemd changes.

-i

$ for d in puppet-*; do printf "%-30s"  "$d"; if $(ls 
$d/spec/acceptance/*_spec.rb > /dev/null 2>&1  ); then echo "YES"; else echo 
"NO" ; fi done
puppet-accessbot  YES
puppet-ansibleYES
puppet-askbot NO
puppet-asterisk   NO
puppet-bandersnatch   YES
puppet-bugdaystatsNO
puppet-bupNO
puppet-cgit   YES
puppet-diskimage_builder  YES
puppet-drupal NO
puppet-elastic_recheckNO
puppet-elasticsearch  YES
puppet-ethercalc  NO
puppet-etherpad_lite  NO
puppet-exim   NO
puppet-germqttNO
puppet-gerrit YES
puppet-gerritbot  NO
puppet-github NO
puppet-grafyaml   NO
puppet-graphite   YES
puppet-havegedYES
puppet-hound  YES
puppet-httpd  YES
puppet-infracloud YES
puppet-iptables   NO
puppet-jeepyb NO
puppet-jenkinsYES
puppet-kerberos   NO
puppet-kibana NO
puppet-lodgeitYES
puppet-log_processor  NO
puppet-logrotate  NO
puppet-logstash   YES
puppet-lpmqtt NO
puppet-mailmanNO
puppet-mediawiki  NO
puppet-meetbotNO
puppet-mosquitto  NO
puppet-mqtt_statsdNO
puppet-mysql_backup   NO
puppet-nodepool   NO
puppet-odsreg NO
puppet-openafsNO
puppet-openstackciYES
puppet-openstack_health   YES
puppet-openstackidNO
puppet-os_client_config   NO
puppet-packagekit NO
puppet-pgsql_backup   NO
puppet-phabricatorNO
puppet-pipYES
puppet-planet NO
puppet-project_config NO
puppet-ptgbot NO
puppet-redis  NO
puppet-refstack   NO
puppet-reviewday  NO
puppet-simpleproxyNO
puppet-snmpd  NO
puppet-sshNO
puppet-ssl_cert_check NO
puppet-stackalytics   NO
puppet-statusbot  NO
puppet-storyboard NO
puppet-subunit2sqlNO
puppet-sudoersNO
puppet-tmpreaper  NO
puppet-ulimit NO
puppet-unattended_upgradesNO
puppet-unboundNO
puppet-user   NO
puppet-zanata NO
puppet-zuul   YES

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Gate Issues

2017-12-08 Thread Ian Wienand

On 12/08/2017 08:38 PM, Ian Wienand wrote:

However, the gate did not become healthy.  Upon further investigation,
the executors are very frequently failing jobs with

  2017-12-08 06:41:10,412 ERROR zuul.AnsibleJob: [build: 
11062f1cca144052afb733813cdb16d8] Exception while executing job
  Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 
588, in execute
  str(self.job.unique))
File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 
702, in _execute
File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 
1157, in prepareAnsibleFiles
File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 
500, in make_inventory_dict
  for name in node['name']:
  TypeError: unhashable type: 'list'

This is leading to the very high "retry_limit" failures.

We suspect change [3] as this did some changes in the node area.
[3] https://review.openstack.org/521324


It was quickly pointed out by frickler that jobs to ze04 were working,
which made it clear that actually the executors just needed to be
restarted to pick up these changes too.  I've done that and things are
looking better.

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] Gate Issues

2017-12-08 Thread Ian Wienand
Hello,

Just to save people reverse-engineering IRC logs...

At ~04:00UTC frickler called out that things had been sitting in the
gate for ~17 hours.

Upon investigation, one of the stuck jobs was a
legacy-tempest-dsvm-neutron-full job
(bba5d98bb7b14b99afb539a75ee86a80) as part of
https://review.openstack.org/475955

Checking the zuul logs, it had sent that to ze04

  2017-12-07 15:06:20,962 DEBUG zuul.Pipeline.openstack.gate: Build > started

However, zuul-executor was not running on ze04.  I believe there were
issues with this host yesterday.  "/etc/init.d/zuul-executor start" and
"service zuul-executor start" reported as OK, but didn't actually
start the daemon.  Rather than debug, I just used
_SYSTEMCTL_SKIP_REDIRECT=1 and that got it going.  We should look into
that, I've noticed similar things with zuul-scheduler too.

At this point, the evidence suggested zuul was waiting for jobs that
would never return.  Thus I saved the queues, restarted zuul-scheduler
and re-queued.

Soon after frickler again noticed that releasenotes jobs were now
failing with "could not import extension openstackdocstheme" [1].  We
suspect [2].

However, the gate did not become healthy.  Upon further investigation,
the executors are very frequently failing jobs with

 2017-12-08 06:41:10,412 ERROR zuul.AnsibleJob: [build: 
11062f1cca144052afb733813cdb16d8] Exception while executing job
 Traceback (most recent call last):
   File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 
588, in execute
 str(self.job.unique))
   File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 
702, in _execute
   File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 
1157, in prepareAnsibleFiles
   File "/usr/local/lib/python3.5/dist-packages/zuul/executor/server.py", line 
500, in make_inventory_dict
 for name in node['name']:
 TypeError: unhashable type: 'list'

This is leading to the very high "retry_limit" failures.

We suspect change [3] as this did some changes in the node area.  I
did not want to revert this via a force-merge, I unfortunately don't
have time to do something like apply manually on the host and babysit
(I did not have time for a short email, so I sent a long one instead :)

At this point, I sent the alert to warn people the gate is unstable,
which is about the latest state.

Good luck,

-i

[1] 
http://logs.openstack.org/95/526595/1/check/build-openstack-releasenotes/f38ccb4/job-output.txt.gz
[2] https://review.openstack.org/525688
[3] https://review.openstack.org/521324

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Caching zanata-cli?

2017-12-03 Thread Ian Wienand
On 12/04/2017 09:54 AM, Andreas Jaeger wrote:
> ERROR: Failure downloading 
> https://search.maven.org/remotecontent?filepath=org/zanata/zanata-cli/3.8.1/zanata-cli-3.8.1-dist.tar.gz,
>  
> HTTP Error 503: Service Unavailable: Back-end server is at capacity
> 
> Could we cache this, please? Any takers?

There are several ways we could do this

 1. Stick it on tarballs.o.o -- which isn't local but may be more reliable
 2. Actually mirror via AFS -- a bit of a pain to setup for one file
 3. cache via reverse proxy -- possible
 4. add to CI images -- easy to do and avoid remote failures.

So I've proposed 4 in [1] and we can discuss further...

-i

[1] https://review.openstack.org/525050

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Sydney Infra evening

2017-11-07 Thread Ian Wienand
Let's meet at the swirlly fountain pit about 6:10pm

Preliminary plan is a ferry, dinner, walk and drinks

Not to sound like your Mum/Mom but a light jacket and comfortable shoes
suggested :)

-i

On 1 Nov. 2017 10:59 am, "Ian Wienand"  wrote:

On 10/18/2017 05:37 PM, Ian Wienand wrote:

> Hi all,
>
> As discussed in the meeting, I've started a page for planning an infra
> evening in Sydney (but note -- ALL welcome)
>
>https://ethercalc.openstack.org/lx7zv5denrb9
>

It looks like Wednesday night (8th) and the more active/pub crawl
option for those interested.

Cheers,

-i
___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Add member to upstream-institute-virtual-environment-core group

2017-11-01 Thread Ian Wienand
On 11/01/2017 09:27 PM, Ian Y. Choi wrote:
> Could you please add "Mark Korondi" in 
> upstream-institute-virtual-environment-core group?
> He is the bootstrapper of the project: 

It seems Mark has managed to get two gerrit accounts:

| registered_on   | full_name| preferred_email| 
contact_filed_on 
|-+--++--
| 2015-12-16 17:55:29 | Mark Korondi | korondi.m...@gmail.com | 2014-03-06 
21:07:35 |
| 2017-01-07 22:33:55 | Mark Korondi | korondi.m...@gmail.com | NULL
|

I have removed the second one and added to the group (I also added
yourself in case of issues).

Mark -- if you're having issues, reach out in #openstack-infra

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Sydney Infra evening

2017-10-31 Thread Ian Wienand

On 10/18/2017 05:37 PM, Ian Wienand wrote:

Hi all,

As discussed in the meeting, I've started a page for planning an infra
evening in Sydney (but note -- ALL welcome)

   https://ethercalc.openstack.org/lx7zv5denrb9


It looks like Wednesday night (8th) and the more active/pub crawl
option for those interested.

Cheers,

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] Sydney Infra evening

2017-10-17 Thread Ian Wienand

Hi all,

As discussed in the meeting, I've started a page for planning an infra
evening in Sydney (but note -- ALL welcome)

  https://ethercalc.openstack.org/lx7zv5denrb9

I put an active, less active and easy option.  Just fill it in and
we'll see where we're at.

Cheers,

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] Nominating new project-config and zuul job cores

2017-10-17 Thread Ian Wienand

On 10/14/2017 03:25 AM, Clark Boylan wrote:

I'd like to nominate a few people to be core on our job related config
repos. Dmsimard, mnaser, and jlk have been doing some great reviews
particularly around the Zuul v3 transition. In recognition of this work
I propose that we give them even more responsibility and make them all
cores on project-config, openstack-zuul-jobs, and zuul-jobs.

Please chime in with your feedback.


++ nice to see a lively project!

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] [openstack-dev] [all] Zuul v3 Rollout Update - devstack-gate issues edition

2017-10-12 Thread Ian Wienand
On 10/12/2017 05:52 PM, Ian Wienand wrote:
> I tried this in order, firstly recreating references.db (didn't help)
> and so I have started the checksums.db recreation.  This is now
> running; I just moved the old one out of the way

Well, that didn't go so well.  The output flooded stuff and then it
died.

---
...
Within references.db subtable references at get: No such file or directory
BDB0134 read: 0x11989b0, 4096: No such file or directory
Internal error of the underlying BerkeleyDB database:
Within references.db subtable references at get: No such file or directory
BDB0134 read: 0x11989b0, 4096: No such file or directory
Internal error of the underlying BerkeleyDB database:
Within references.db subtable references at get: No such file or directory
37 files were added but not used.
The next deleteunreferenced call will delete them.
BDB0151 fsync: Connection timed out
BDB0164 close: Connection timed out
./db/checksums.db: Connection timed out
BDB3028 ./db/checksums.db: unable to flush: Connection timed out
db_close(checksums.db, pool): Connection timed out
Error creating './db/version.new': Connection timed out(errno is 110)
Error 110 deleting lock file './db/lockfile': Connection timed out!
There have been errors!
---

Presumably this matches up with the AFS errors logged

---
[Thu Oct 12 09:19:59 2017] afs: Lost contact with file server 104.130.138.161 
in cell openstack.org (code -512) (all multi-homed ip addresses down for the 
server)
[Thu Oct 12 09:19:59 2017] afs: Lost contact with file server 104.130.138.161 
in cell openstack.org (code -512) (all multi-homed ip addresses down for the 
server)
[Thu Oct 12 09:19:59 2017] afs: failed to store file (110)
[Thu Oct 12 09:20:02 2017] afs: failed to store file (110)
[Thu Oct 12 09:20:10 2017] afs: file server 104.130.138.161 in cell 
openstack.org is back up (code 0) (multi-homed address; other same-host 
interfaces may still be down)
[Thu Oct 12 09:20:10 2017] afs: file server 104.130.138.161 in cell 
openstack.org is back up (code 0) (multi-homed address; other same-host 
interfaces may still be down)
---

I restarted for good luck, but if this is transient network issues, I
guess it will just happen again.  ping shows no packet loss, but very
occasional latency spikes, fwiw.

We restarted mirror-update; maybe it's worth restarting the AFS
servers too?

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] [openstack-dev] [all] Zuul v3 Rollout Update - devstack-gate issues edition

2017-10-11 Thread Ian Wienand
(moving to infra)

On 10/12/2017 04:28 PM, Ian Wienand wrote:
> mirrors provide, leading apt to great confusion.  Some debugging notes
> on reprepro at [1], but I have to conclude the .db files are corrupt
> and I have no idea how to recreate these other than to start again.

I ran the reprepro under strace, and the last thing that comes out is

 3170  pread(6, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 
90521600) = 4096

then it just stops with cpu at 100%.  lsof tells us

 reprepro 3170 root6u   REG   0,25  90628096 2537568 
/afs/.openstack.org/mirror/ubuntu/db/checksums.db

so, that db seems as likely as any to be causing the problems

pabelanger pointed to some recovery instructions at [1] previously.

I tried this in order, firstly recreating references.db (didn't help)
and so I have started the checksums.db recreation.  This is now
running; I just moved the old one out of the way

  root@mirror-update:/afs/.openstack.org/mirror/ubuntu/db# ls -lh
  total 1.1G
  -rw-r--r-- 1 10004 root 1.6M Oct 12 06:38 checksums.db
  -rw-r--r-- 1 10004 root  87M Oct 12 02:59 checksums.db.old

This started at about 06:30, meaning ~5 minutes/mb so I think around 6
hours till this is finished, hopefully (it's dragging everything
across afs).

Please take any of this over; it's running on mirror-update:

  screen(9683)─┬─bash(9684)───su(9917)───bash(9918)
  │├─bash(10466)───k5start(3755)───bash(3758)─┬─find(3996)
   │  └─reprepro(3997)

note i'm holding the cron lock with 

 root 10957  9918  0 06:46 pts/000:00:00 flock -n 
/var/run/reprepro/ubuntu.lock bash -c while true; do sleep 1000; done

(ps, I think we need to make those dirs on reboot:
https://review.openstack.org/511380)

-i

[1] https://github.com/esc/reprepro/blob/master/docs/recovery

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] [incident] OVH-BHS1 mirror disappeared

2017-09-20 Thread Ian Wienand

At around Sep 21 02:30UTC mirror01.bhs1.ovh.openstack.org became
uncontactable and jobs in the region started to fail.

The server was in an ACTIVE state but uncontactable.  I attempted to
get a console but either a log or url request returned 500 (request
id's below if it helps).

 ... console url show ...
The server has either erred or is incapable of performing the requested 
operation. (HTTP 500) (Request-ID: req-5da4cba2-efe8-4dfb-a8a7-faf490075c89)
 ...  console log show ...
The server has either erred or is incapable of performing the requested 
operation. (HTTP 500) (Request-ID: req-80beb593-b565-42eb-8a97-b2a208e3d865)

I could not figure out how to log into the web console with our
credentials.

I attempted to hard-reboot it, and it currently appears stuck in
HARD_REBOOT.  Thus I have placed nodepool.o.o in the emergency file
and set max-servers for the ovh-bhs1 region to 0

I have left it at this, as hopefully it will be beneficial for both
OVH and us to diagnose the issue since the host was definitely not
expected to disappear.  After this we can restore or rebuild it as
required.

Thanks,

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] citycloud lon1 mirror postmortem

2017-08-10 Thread Ian Wienand

Hi,

In response to sdague reporting that citycloud jobs were timing out, I
investigated the mirror, suspecting it was not providing data fast enough.

There were some 170 htcacheclean jobs running, and the host had a load
over 100.  I killed all these, but performance was still unacceptable.

I suspected networking, but since the host was in such a bad state I
decided to reboot it.  Unfortunately it would get an address from DHCP
but seemed to have DNS issues ... eventually it would ping but nothing
else was working.

nodepool.o.o was placed in the emergency file and I removed lon1 to
avoid jobs going there.

I used the citycloud live chat, and Kim helpfully investigated and
ended up migrating mirror.lon1.citycloud.openstack.org to a new
compute node.  This appeared to fix things, for us at least.

nodepool.o.o is removed from the emergency file and original config
restored.

With hindsight, clearly the excessive htcacheclean processes were due
to negative feedback of slow processes due to the network/dns issues
all starting to bunch up over time.  However, I still think we could
minimise further issues running it under a lock [1].  Other than that,
not sure there is much else we can do, I think this was largely an
upstream issue.

Cheers,

-i

[1] https://review.openstack.org/#/c/492481/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] [infra][nova] Corrupt nova-specs repo

2017-06-29 Thread Ian Wienand
Hi,

Unfortunately it seems the nova-specs repo has undergone some
corruption, currently manifesting itself in an inability to be pushed
to github for replication.

Upon examination, it seems there's a problem with a symlink and
probably jgit messing things up making duplicate files.  I have filed
a gerrit bug at [1] (although it's probably jgit, but it's just a
start).

Anyway, that leaves us the problem of cleaning up the repo into a
pushable state.  Here's my suggestion after some investigation:

The following are corrupt

---
$ git fsck
Checking object directories: 100% (256/256), done.
error in tree a494151b3c661dd9b6edc7b31764a2e2995bd60c: contains duplicate file 
entries
error in tree 26057d370ac90bc01c1cfa56be8bd381618e2b3e: contains duplicate file 
entries
error in tree 57423f5165f0f1f939e2ce141659234cbb5dbd4e: contains duplicate file 
entries
error in tree 05fd99ef56cd24c403424ac8d8183fea33399970: contains duplicate file 
entries
---

After some detective work [2], I related all these bad objects to the
refs that hold them.  It look as follows

---
fsck-bad: a494151b3c661dd9b6edc7b31764a2e2995bd60c
 -> 5fa34732b45f4afff3950253c74d7df11b0a4a36 refs/changes/26/463526/9

fsck-bad: 26057d370ac90bc01c1cfa56be8bd381618e2b3e
 -> 47128a23c2aad12761aa0df5742206806c1dfbb8 refs/changes/26/463526/8
 -> 7cf8302eb30b722a00b4d7e08b49e9b1cd5aacf4 refs/changes/26/463526/7
 -> 818dc055b971cd2b78260fd17d0b90652fb276fb refs/changes/26/463526/6

fsck-bad: 57423f5165f0f1f939e2ce141659234cbb5dbd4e

 -> 25bd72248682b584fb88cc01061e60a5a620463f refs/changes/26/463526/3
 -> c7e385eaa4f45b92e9e51dd2c49e799ab182ac2c refs/changes/26/463526/4
 -> 4b8870bbeda2320564d1a66580ba6e44fbd9a4a2 refs/changes/26/463526/5

fsck-bad: 05fd99ef56cd24c403424ac8d8183fea33399970
 -> e8161966418dc820a4499460b664d87864c4ce24 refs/changes/26/463526/2
---

So you may notice this is refs/changes/26/463526/[2-9]

Just deleting these refs and expiring the objects might be the easiest
way to go here, and seems to get things purged and fix up fsck

---
$ for i in `seq 2 9`; do
>  git update-ref -d refs/changes/26/463526/$i
> done

$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
Counting objects: 44756, done.
Delta compression using up to 16 threads.
Compressing objects: 100% (43850/43850), done.
Writing objects: 100% (44756/44756), done.
Total 44756 (delta 31885), reused 12846 (delta 0)

$ git fsck
Checking object directories: 100% (256/256), done.
Checking objects: 100% (44756/44756), done.
---

I'm thinking if we then force push that to github, we're pretty much
OK ... a few intermediate reviews will be gone but I don't think
they're important in this context.

I had a quick play with "git ls-tree", edit the file, "git mktree",
"git replace" and then trying to use filter-branch, but couldn't get
it to work.  Suggestions welcome; you can play with the repo from [1]
I would say.

Thanks,

-i

[1] https://bugs.chromium.org/p/gerrit/issues/detail?id=6622
[2] "git log --all --format=raw --raw -t --no-abbrev" and search for
the change sha, then find it in "git show-refs"

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

[OpenStack-Infra] static.o.o : skipping logs fsck at boot

2017-03-31 Thread Ian Wienand

Hi,

Thanks to Yolanda for driving the logs.o.o recovery after some RAX
maintenance on the log volumes.

Just some thoughts -- firstly this host boots up into a graphical
console which gives pretty but not terribly informative logo screen
after you've pulled up the emergency console (at least it's not java
any more?).  I presume nobody would be unhappy if we fiddle the grub
defaults to make this boot in text mode.

What do we think about turning off boot fsck for the logs volume (and
others)?  I admit we rarely reboot this, but when we do, it doesn't
seem strictly necessary to hold up boot for the long check on this
large volume.  While main-static volume is fsck-ing, /srv/static/logs
is available from main-static and can buffer logs until its finished.
This is what we're doing right now, but I had to get into the
aforementioned console and convince it to continue booting.

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Moving DIB to infra

2017-03-17 Thread Ian Wienand
On 03/16/2017 11:34 PM, Jeremy Stanley wrote:
> I'd also like to be certain the current DIB contributors are
> entirely disinterested in forming a separate official team in
> OpenStack as I doubt the TC would reject such a proposal (I'd
> happily support it).

Assuming "interested" means you had more than a couple of trivial
changes in the last release period would leave a voting group of maybe
5 people [1]?  It seems like a lot of bureaucracy to start up a whole
team for that?

Quite a lot of brain-power seems to have been spent on this so far.
Personally I don't see the difference between TripleO cores who
technically have power but don't use it or infra cores who technically
have power but don't use it.  I'm just finding it hard to find a hook
to engage with the whole thing.  If people feel strongly about moving
it under infra ok, but I'm not sure what difference it makes.

-i

[1] 
http://stackalytics.com/?module=diskimage-builder&metric=commits&release=ocata

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Request to add initial blazar-release team member

2017-03-08 Thread Ian Wienand

On 03/08/2017 03:45 PM, Masahito MUROI wrote:

This is a request mail to add me into blazar-release team[1] as an
initial member of the team.


Done

Thanks

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Ask.o.o down

2017-03-07 Thread Ian Wienand

On 03/07/2017 07:20 PM, Gene Kuo wrote:

These errors do line up to the time where it's down.
However, I have no idea what cause apache to seg fault.


Something disappearing underneath it would be my suspicion

Anyway, I added "CoreDumpDirectory /var/cache/apache2" to
/etc/apache2/apache2.conf manually (don't think it's puppet managed?)

Let's see if we can pick up a core dump, we can
at least then trace it back

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Ask.o.o down

2017-03-07 Thread Ian Wienand
On 03/07/2017 06:46 PM, Gene Kuo wrote:
> I found that ask.o.o is down again. 

I restarted apache

---
root@ask:/var/log/apache2# date
Tue Mar  7 07:54:26 UTC 2017

[Tue Mar 07 06:01:38.469993 2017] [core:notice] [pid 19511:tid 140460060575616] 
AH00052: child pid 19517 exit signal Segmentation fault (11)
[Tue Mar 07 06:31:21.397621 2017] [mpm_event:notice] [pid 19511:tid 
140460060575616] AH00493: SIGUSR1 received.  Doing graceful restart
[Tue Mar 07 06:31:21.529687 2017] [core:notice] [pid 19511] AH00060: seg fault 
or similar nasty error detected in the parent process
---

These errors probably maybe line up with the failure start?

Here are the recent failures I can find, do these line up with failure
times?  This does not seem to be consistent time like suggested with
the cron job.

---
error.log.1   :[Tue Mar 07 06:01:38.469993 2017] [core:notice] [pid 19511:tid 
140460060575616] AH00052: child pid 19517 exit signal Segmentation fault (11)
error.log.2.gz:[Sun Mar 05 17:42:53.457689 2017] [core:notice] [pid 18065:tid 
140399259293568] AH00052: child pid 16820 exit signal Segmentation fault (11)
error.log.4.gz:[Fri Mar 03 18:21:50.242282 2017] [core:notice] [pid 6066:tid 
140255921559424] AH00052: child pid 6072 exit signal Segmentation fault (11)
error.log.6.gz:[Wed Mar 01 08:40:37.026106 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 10324 exit signal Segmentation fault (11)
error.log.6.gz:[Wed Mar 01 13:38:41.474969 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 11891 exit signal Segmentation fault (11)
error.log.6.gz:[Wed Mar 01 13:57:44.712564 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 9635 exit signal Segmentation fault (11)
error.log.6.gz:[Wed Mar 01 22:03:17.717681 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 19925 exit signal Segmentation fault (11)
error.log.6.gz:[Thu Mar 02 05:51:14.236546 2017] [core:notice] [pid 9628:tid 
140257025800064] AH00052: child pid 15214 exit signal Segmentation fault (11)
---

I'm really not sure what happened with the logs; the 8th rotation
seems to have disappeared and then they get really old.

---
-rw-r- 1 root adm  481 Mar  1 06:54 error.log.7.gz
-rw-r- 1 root adm 4831 Oct  3 06:44 error.log.9.gz
---

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] Dropped connections from static.o.o

2017-01-29 Thread Ian Wienand
Hi,

We got a report of CI jobs failing with disconnects when
downloading from tarballs.openstack.org.  The file in question is
a largish container for kolla-kubernetes [1]

ISTR this is not the first time we've had complaints about this, but
I'm not sure if we ever came up with a solution.

Below are some of the failed jobs with the ip, start & failure
time of the download.

---
http://logs.openstack.org/98/426598/5/check/gate-kolla-kubernetes-deploy-centos-binary-2-ceph-nv/21e4361/console.html
inet 146.20.105.26
2017-01-30 00:00:08.109056 | + curl ... 
http://tarballs.openstack.org/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2
2017-01-30 00:04:44.002456 | curl: (18) transfer closed with 88598540 bytes 
remaining to read

http://logs.openstack.org/98/426598/5/check/gate-kolla-kubernetes-deploy-centos-binary-2-ceph-multi-nv/fe0849b/console.html
inet 146.20.105.198
2017-01-30 00:00:08.471434 | + curl -o ... 
http://tarballs.openstack.org/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2
2017-01-30 00:04:42.685201 | curl: (18) transfer closed with 542002092 bytes 
remaining to read

http://logs.openstack.org/98/426598/5/check/gate-kolla-kubernetes-deploy-centos-binary-2-external-ovs-nv/38030a1/console.html
inet6 2001:4800:1ae1:18:f816:3eff:fe9b:f9e2/64 
2017-01-30 00:01:07.306370 | + curl -o ... 
http://tarballs.openstack.org/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2
2017-01-30 00:03:34.810258 | curl: (18) transfer closed with 222546512 bytes 
remaining to read
---

At first, there is not much correlation and two of the requests
appear to not be logged at all.

---
root@static:/var/log/apache2# grep 'centos-binary-ceph.tar.bz2' 
tarballs.openstack.org_*  | grep '30/Jan'
tarballs.openstack.org_access.log:2001:4800:1ae1:18:f816:3eff:fe9e:4ccf - - 
[30/Jan/2017:00:00:33 +] "GET 
/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2 HTTP/1.1" 200 
1395496732 "-" "curl/7.29.0"
tarballs.openstack.org_access.log:2001:4800:1ae1:18:f816:3eff:fe67:2ca6 - - 
[30/Jan/2017:00:00:47 +] "GET 
/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2 HTTP/1.1" 200 
1395496732 "-" "curl/7.29.0"
tarballs.openstack.org_access.log:2001:4800:1ae1:18:f816:3eff:fe9b:f9e2 - - 
[30/Jan/2017:00:01:07 +] "GET 
/kolla-kubernetes/gate/containers//centos-binary-ceph.tar.bz2 HTTP/1.1" 200 
1395496732 "-" "curl/7.29.0"
---

However, I went to the generic apache error log around that time and
found the following

---
[Sun Jan 29 23:59:16.284909 2017] [mpm_event:error] [pid 1967:tid 
140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers
[Mon Jan 30 00:00:02.334021 2017] [mpm_event:error] [pid 1967:tid 
140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers
[Mon Jan 30 00:00:04.336258 2017] [mpm_event:error] [pid 1967:tid 
140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers
[Mon Jan 30 00:01:48.449350 2017] [mpm_event:error] [pid 1967:tid 
140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers
[Mon Jan 30 00:02:25.490781 2017] [mpm_event:error] [pid 1967:tid 
140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers
[Mon Jan 30 00:03:10.539081 2017] [mpm_event:error] [pid 1967:tid 
140205198583680] AH00485: scoreboard is full, not at MaxRequestWorkers
...
---

I think this is a smoking gun for the issue, because this issue leads
to the death of the serving process, which gets logged a little later.
Correlating this it seems like a few of these time-stamps match up
with when the reported jobs reported they got disconnected.

---
[Mon Jan 30 00:03:31.562290 2017] [core:notice] [pid 1967:tid 140205198583680] 
AH00052: child pid 14410 exit signal Segmentation fault (11)
[Mon Jan 30 00:03:35.566735 2017] [core:notice] [pid 1967:tid 140205198583680] 
AH00052: child pid 20378 exit signal Segmentation fault (11)
...
[Mon Jan 30 00:04:17.614883 2017] [core:notice] [pid 1967:tid 140205198583680] 
AH00052: child pid 4126 exit signal Segmentation fault (11)
[Mon Jan 30 00:04:17.614951 2017] [core:notice] [pid 1967:tid 140205198583680] 
AH00052: child pid 4204 exit signal Segmentation fault (11)
[Mon Jan 30 00:04:22.621893 2017] [core:notice] [pid 1967:tid 140205198583680] 
AH00052: child pid 4290 exit signal Segmentation fault (11)
[Mon Jan 30 00:04:37.638901 2017] [core:notice] [pid 1967:tid 140205198583680] 
AH00052: child pid 4358 exit signal Segmentation fault (11)
[Mon Jan 30 00:04:41.643388 2017] [core:notice] [pid 1967:tid 140205198583680] 
AH00052: child pid 4324 exit signal Segmentation fault (11)
[Mon Jan 30 00:04:42.645053 2017] [core:notice] [pid 1967:tid 140205198583680] 
AH00052: child pid 20485 exit signal Segmentation fault (11)
---

This ungraceful exit might also explain the lack of logs.

Unfortunately, the prognosis for this issue is not great.  The
original bug [2] seems to show it is a systemic issue and it is
discussed in the documentation [3] which says in short

 This mpm showed som

Re: [OpenStack-Infra] translate-dev wildfly/zanata/trove issues

2017-01-15 Thread Ian Wienand

On 01/16/2017 03:15 PM, Ian Y. Choi wrote:

Note that this issue is re-generatable: I am able recreate the issue on
translate-dev
: When I create a new version from openstack-manuals from master branch
[1] - 20107950 words,
there is no further web responses from translate-dev.o.o around after
3/4 of total words were processed
like [2-4]. It was completely fine for a version creation with 2610150
words.


One thing I noticed was that the trove db for translate-dev seems to
be set to 2gb (I'm guessing this means working RAM) and this might be
a bit tight?  Maybe the remote mysql needs more than that; fails
transactions and leaves things in a tricky state?  Does anyone know
what sort of demands are placed on the db?

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] translate-dev wildfly/zanata/trove issues

2017-01-15 Thread Ian Wienand
Hi,

I was alerted to translate-dev performance issues today.  Indeed, it
seemed that things were going crazy with the java wildfly process
sucking up all CPU.

At first there didn't seem to be anything in the logs.  Java was
clearly going mad however, with the following threads going flat-out.

--- gone crazy processes ---
14807 wildfly   20   0  9.892g 4.729g  34232 R 93.7 60.7  26:56.73 java
14804 wildfly   20   0  9.892g 4.729g  34232 R 92.4 60.7  26:51.91 java
14806 wildfly   20   0  9.892g 4.729g  34232 R 92.4 60.7  26:53.92 java
14808 wildfly   20   0  9.892g 4.729g  34232 R 92.4 60.7  26:53.97 java
14810 wildfly   20   0  9.892g 4.729g  34232 R 92.4 60.7  26:56.28 java
14809 wildfly   20   0  9.892g 4.729g  34232 R 92.1 60.7  26:57.74 java
14803 wildfly   20   0  9.892g 4.729g  34232 R 91.1 60.7  26:54.90 java
14805 wildfly   20   0  9.892g 4.729g  34232 R 90.4 60.7  26:52.44 java
---

Hoping to find a easy smoking-gun, I made the java process dump it's
threads to see what these are doing

--- thread dump ---

14807  14804  14806  14808  14810  14809  14803, 14805
0x39d7 0x39d4 0x39d6 0x39d8 0x39da 0x39d9 0x39d3 0x39d5

"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x7fb2a8026000 nid=0x39d7 
runnable 
"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x7fb2a8026000 nid=0x39d7 
runnable 
"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x7fb2a8026000 nid=0x39d7 
runnable 
"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x7fb2a8026000 nid=0x39d7 
runnable 

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x7fb2a8021000 nid=0x39d4 
runnable 
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x7fb2a8021000 nid=0x39d4 
runnable 
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x7fb2a8021000 nid=0x39d4 
runnable 
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x7fb2a8021000 nid=0x39d4 
runnable 

"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x7fb2a8024800 nid=0x39d6 
runnable 
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x7fb2a8024800 nid=0x39d6 
runnable 
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x7fb2a8024800 nid=0x39d6 
runnable 
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x7fb2a8024800 nid=0x39d6 
runnable 

"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x7fb2a8028000 nid=0x39d8 
runnable 
"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x7fb2a8028000 nid=0x39d8 
runnable 
"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x7fb2a8028000 nid=0x39d8 
runnable 
"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x7fb2a8028000 nid=0x39d8 
runnable 

"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x7fb2a802b800 nid=0x39da 
runnable 
"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x7fb2a802b800 nid=0x39da 
runnable 
"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x7fb2a802b800 nid=0x39da 
runnable 
"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x7fb2a802b800 nid=0x39da 
runnable 

"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x7fb2a8029800 nid=0x39d9 
runnable 
"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x7fb2a8029800 nid=0x39d9 
runnable 
"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x7fb2a8029800 nid=0x39d9 
runnable 
"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x7fb2a8029800 nid=0x39d9 
runnable 

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x7fb2a801f000 nid=0x39d3 
runnable 
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x7fb2a801f000 nid=0x39d3 
runnable 
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x7fb2a801f000 nid=0x39d3 
runnable 
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x7fb2a801f000 nid=0x39d3 
runnable
---

Unfortunately, these all appear to be GC threads so there's nothing
really obvious there.

However, eventually in the log you start getting stuff like

---
2017-01-16T01:58:00,844Z SEVERE 
[javax.enterprise.resource.webcontainer.jsf.application] (default task-43) 
Error Rendering View[/project/project.xhtml]: javax.el.ELException: 
/WEB-INF/layout/project/settings-tab-languages.xhtml @117,88 rendered="#{not 
projectHome.hasLocaleAlias(locale)}": javax.persistence.PersistenceException: 
org.hibernate.HibernateException: Transaction was rolled back in a different 
thread!
...
2017-01-16T01:08:48,805Z WARN  
[org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (pool-5-thread-1) SQL Error: 
0, SQLState: null
2017-01-16T01:08:48,806Z ERROR 
[org.hibernate.engine.jdbc.spi.SqlExceptionHelper] (pool-5-thread-1) 
javax.resource.ResourceException: IJ000460: Error checking for a transaction
2017-01-16T01:08:48,813Z ERROR [org.hibernate.AssertionFailure] 
(pool-5-thread-1) HHH99: an assertion failure occured (this may indicate a 
bug in Hibernate, but is more likely due to unsafe use of the session): 
org.hibernate.exception.GenericJDBCException: Could not open connection
2017-01-16T01:08:48,813Z WARN  [com.arjuna.ats.jta] (pool-5-thread-1) 
ARJUNA016029: SynchronizationImple.afterCompletion - failed for 
org.hibernate.engine.transaction.synchronization.internal.RegisteredSynchronization@3fc123

Re: [OpenStack-Infra] timeout, shells and ansible 2.2

2017-01-10 Thread Ian Wienand
On 01/11/2017 04:53 PM, Ian Wienand wrote:
> The thing is, ansible 2.0.2.0 seems to do all the ssh stuff very
> differently so this doesn't appear to happen.

I tell a lie; this same thing actually happens with 2.0.2.0

I'm wondering if just nobody has run "run_all.sh" by hand (on a
terminal) since [1].  I know we've frequently run the playbooks
individually by hand, but not under timeout like that.

After some more searching I've proposed [2]

At the bottom of all this yak shaving, I think it's fine to upgrade
puppetmaster to 2.2.1-rc3 to work around any potential for the recent
CVE.  I've proposed [3] and [4] for that.

-i

[1] https://review.openstack.org/#/c/311913/
[2] https://review.openstack.org/418722 Run timeout with --foreground
[3] https://review.openstack.org/418652
[4] https://review.openstack.org/418671

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] timeout, shells and ansible 2.2

2017-01-10 Thread Ian Wienand

So I'm trying to test ansible 2.2.1-rc3 on puppetmaster.  If as root
you try

# /root/ianw/ansible/bin/activate # ansible 2.2.1-rc3 venv
# /root/ianw/run_all.sh # run_all but with --check to dry-run

You should see the problem.  ansible-playbook just stops; when you
take a look at the wait channel, you can see it's all blocked on a
signal.

---
root@puppetmaster:/root/ianw/ansible# ps -aefl | grep ansible
0 S root  4202  1010  0  80   0 -  2927 wait   03:06 pts/15   00:00:00 
timeout -k 2m 120m ansible-playbook -vvv 
0 T root  4203  4202  0  80   0 - 44381 signal 03:06 pts/15   00:00:02 
/root/ianw/ansible/bin/python /root/ianw/
1 T root  4331  4203  0  80   0 - 46533 signal 03:06 pts/15   00:00:00 
/root/ianw/ansible/bin/python /root/ianw/
1 T root  4360  4203  0  80   0 - 46596 signal 03:06 pts/15   00:00:00 
/root/ianw/ansible/bin/python /root/ianw/
4 T root  4393  4331  0  80   0 - 11005 signal 03:06 pts/15   00:00:00 ssh 
-C -o ControlMaster=auto -o ControlPe
4 T root  4398  4360  0  80   0 - 11005 signal 03:06 pts/15   00:00:00 ssh 
-C -o ControlMaster=auto -o ControlPe
1 T root  4400  4203  0  80   0 - 46661 signal 03:06 pts/15   00:00:00 
/root/ianw/ansible/bin/python /root/ianw/
4 T root  4410  4400  0  80   0 - 11005 signal 03:06 pts/15   00:00:00 ssh 
-C -o ControlMaster=auto -o ControlPe
1 T root  4412  4203  0  80   0 - 46661 signal 03:06 pts/15   00:00:00 
/root/ianw/ansible/bin/python /root/ianw/
1 T root  4421  4203  0  80   0 - 46725 signal 03:06 pts/15   00:00:00 
/root/ianw/ansible/bin/python /root/ianw/
4 T root  4422  4412  0  80   0 - 11005 signal 03:06 pts/15   00:00:00 ssh 
-C -o ControlMaster=auto -o ControlPe
1 T root  4428  4203  0  80   0 - 46982 signal 03:06 pts/15   00:00:00 
/root/ianw/ansible/bin/python /root/ianw/
1 T root  4431  4203  0  80   0 - 46725 signal 03:06 pts/15   00:00:00 
/root/ianw/ansible/bin/python /root/ianw/
4 T root  4436  4421  0  80   0 - 11005 signal 03:06 pts/15   00:00:00 ssh 
-C -o ControlMaster=auto -o ControlPe
1 T root  4438  4203  0  80   0 - 46789 signal 03:06 pts/15   00:00:00 
/root/ianw/ansible/bin/python /root/ianw/
4 T root  4446  4438  0  80   0 - 11551 signal 03:06 pts/15   00:00:00 ssh 
-C -o ControlMaster=auto -o ControlPe
1 T root  4447  4203  0  80   0 - 46792 signal 03:06 pts/15   00:00:00 
/root/ianw/ansible/bin/python /root/ianw/
4 T root  4448  4447  0  80   0 - 11109 signal 03:06 pts/15   00:00:00 ssh 
-C -o ControlMaster=auto -o ControlPe
1 T root  4452  4203  0  80   0 - 46856 signal 03:06 pts/15   00:00:00 
/root/ianw/ansible/bin/python /root/ianw/
4 T root  4455  4452  0  80   0 - 11037 signal 03:06 pts/15   00:00:00 ssh 
-C -o ControlMaster=auto -o ControlPe
4 T root  4461  4431  0  80   0 - 11005 signal 03:06 pts/15   00:00:00 ssh 
-C -o ControlMaster=auto -o ControlPe
1 T root  4462  4428  0  80   0 - 46725 signal 03:06 pts/15   00:00:00 
/root/ianw/ansible/bin/python /root/ianw/
---

stracing any of those gives

---
root@puppetmaster:/root/ianw/ansible# strace -p 4412
Process 4412 attached
--- stopped by SIGTTOU ---
---

I know from experience that SIGTTOU is just bad news, it means crappy
tricky terminal stuff ahead.  I started looking through an strace to
see where this comes up

---
(ansible)root@puppetmaster:/tmp# cat /tmp/output.log  | grep TTOU | grep kill
21029 kill(21029, SIGTTOU 
---

Having a look at pid 21029 it's an ssh process, that ansible is
launching via some pipe/fifo related system, but it's going to

---
execve("/usr/bin/ssh" 21029 connect(3, {sa_family=AF_INET6,
sin6_port=htons(22), inet_pton(AF_INET6, 
"2001:4800:7819:105:be76:4eff:fe04:a5b2", ...)
---

Well guess what; this host obviously doesn't have keys deployed and is
throwing up a password prompt.

---
# ssh 2001:4800:7819:105:be76:4eff:fe04:a5b2
root@2001:4800:7819:105:be76:4eff:fe04:a5b2's password:
---

(The other thing is -- what is this host and why is shade picking it
up?  but we have to handle this.  also maybe at the time it was a
host key unknown message, but same same)

It's at this point we can see ssh opening /dev/tty and then, bang, in
comes our SIGTTOU

---
21029 open("/dev/tty", O_RDWR 
21029 <... open resumed> )  = 4
21029 ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS 

21029 <... ioctl resumed> , {B38400 opost isig icanon echo ...}) = 0
21029 ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 
{B38400 opost isig icanon echo ...}) = 0
21029 ioctl(4, SNDCTL_TMR_CONTINUE or SNDRV_TIMER_IOCTL_GPARAMS or TCSETSF 

21029 <... ioctl resumed> , {B38400 opost isig icanon echo ...}) = ? 
ERESTARTSYS (To be restarted if SA_RESTART is set)
21029 --- SIGTTOU {si_signo=SIGTTOU, si_code=SI_KERNEL} ---
---

Which is all great EXCEPT it seems there is an, AFAICT, unresolved bug
in "timeout" that means it has incorrectly reset the SIGTTOU handlers
to default (?) so everything just completely stops. 

[OpenStack-Infra] Gerrit downtime on Thursday 2017-01-12 at 20:00 UTC

2017-01-10 Thread Ian Wienand
Hi everyone,

On Thursday, January 12th from approximately 20:00 through 20:30 UTC
Gerrit will be unavailable while we complete project renames.

Currently, we plan on renaming the following projects:

 Nomad -> Cyborg
  - openstack/nomad -> openstack/cyborg

 Nimble -> Mogan 
  - openstack/nimble -> openstack/mogan
  - openstack/python-nimbleclient -> openstack/python-moganclient
  - openstack/nimble-specs -> openstack/mogan-specs

Existing reviews, project watches, etc, for these projects will all be
carried over.

This list is subject to change. If you need a rename, please be sure
to get your project-config change in soon so we can review it and add
it to 
https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Upcoming_Project_Renames

If you have any questions about the maintenance, please reply here or
contact us in #openstack-infra on freenode.

-i 

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] [nodepool] Heads up: 1.26.0 dib release for Xenial/glean network issues

2016-12-21 Thread Ian Wienand

Hi,

We found a regression where python3-only Xenial images have a messed
up pip, and incorrectly installs glean.  The result is that the system
boots but no network.

Because dib builds images for a wide range of platforms, some of which
ship python3 only, we need a way to call python scripts that is
version agnostic.  For this reason we have the dib-python element,
which installs a local dib-python binary which can be used as a #!
script.  This script decides basically to call python or python3 as
appropriate.  A recent change made this explicit [1] and removed the
(theoretically) redundant python2 install.

We believe this is due pollution of the VIRTUAL_ENV variable into the
building chroot and some magic that happens in site.py to fiddle paths
[2].  But we haven't quite sorted that out.  Of course, it is very
worrying that this all got past CI and we will be investigating that
too.

I have merged and released in 1.26.0 a hack [3] to ensure python2 is
installed for Xenial while we work on a better solution.

infra-root should be aware of this if there are any problems with
Xenial image generation that result in uncontactable hosts.  I believe
this will get us through the holidays.

Cheers & Happy Holidays/Merry Christmas all,

-i

[1] https://review.openstack.org/408288/
[2] https://review.openstack.org/413487/
[3] https://review.openstack.org/413410/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] [nodepool] Heads up: 1.26.0 dib release for Xenial/glean network issues

2016-12-21 Thread Ian Wienand

Hi,

We found a regression where python3-only Xenial images have a messed
up pip, and incorrectly installs glean.  The result is that the system
boots but no network.

Because dib builds images for a wide range of platforms, some of which
ship python3 only, we need a way to call python scripts that is
version agnostic.  For this reason we have the dib-python element,
which installs a local dib-python binary which can be used as a #!
script.  This script decides basically to call python or python3 as
appropriate.  A recent change made this explicit [1] and removed the
(theoretically) redundant python2 install.

We believe this is due pollution of the VIRTUAL_ENV variable into the
building chroot and some magic that happens in site.py to fiddle paths
[2].  But we haven't quite sorted that out.  Of course, it is very
worrying that this all got past CI and we will be investigating that
too.

I have merged and released in 1.26.0 a hack [3] to ensure python2 is
installed for Xenial while we work on a better solution.

infra-root should be aware of this if there are any problems with
Xenial image generation that result in uncontactable hosts.  I believe
this will get us through the holidays.

Cheers & Happy Holidays/Merry Christmas all,

-i

[1] https://review.openstack.org/408288/
[2] https://review.openstack.org/413487/
[3] https://review.openstack.org/413410/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] [zuulv3] Zookeeper on CentOS 7

2016-12-06 Thread Ian Wienand
(I know this isn't the greatest place to discuss packaging, but this
seems like somewhere we can get interested people together)

After first looking a year ago (!) I've gone back to have another poke
at Zookeeper on CentOS 7 packages.  This is going to be required for
zuulv3.

As you can see from an attempted build-log [1] there are a bunch of
requirements.  Some of these are more problematic than others.  The
following etherpad has a range of info, but here's where I think we
need to go:

 https://etherpad.openstack.org/p/zookeeper-epel7

1) netty is a hard requirement; ZK can't work without it.  This seems
   to be rather bad news, because the dependency chain here is long.
   At [1], I have attempted builds of netty's dependencies; as you can
   see they have some extensive requirements of their own.

   This may actually be quite a bit to untangle, and I think we need
   to focus the discussion firstly on if this can actually be done.
   Without netty, I don't see there's anything further to do.  I have
   filed [2].

1a) I'm not clear on what exactly objectweb-pom brings, but it's a
build-dependency for >F21.  I have filed [3].  It may be a hard
dependency, but it does currently build at least.

2) Ivy is a dependency manager and ivy-local is part of the Fedora
   java packaging infrastructure.  We are not going to get that
   backported.  However, it seems that we could modify the build to
   not use ivy, but hack in dependencies manually [4]

3) checkstyle, jdiff, jtoaster all seem to be related to parts of the
   build we can skip such as test-suites, documentation and contrib
   tools.  I *think* that just means we cut bits out of build.xml

tl;dr -- this is a nightmare really; but if netty and it's
dependencies are where to start.

HOWEVER there is another option.  Take the whole upsteram release and
shoe-horn it into an RPM.  Luckily I searched because someone already
did that [5] and with a bit of tweaking we can build a package in COPR
[6].  If you're interested, give it a try and we can iterate on any
issues.

Now it's not really "packaged" as such, and obviously not going to be
officially distributed ... but maybe this will do?

-i

[1] 
https://copr-be.cloud.fedoraproject.org/results/ggillies/rdo-newton-extras/epel-7-x86_64/00484851-zookeeper/root.log.gz
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1402199
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1402195
[4] 
https://lists.fedoraproject.org/pipermail/java-devel/2015-November/005705.html
[5] https://github.com/id/zookeeper-el7-rpm/
[6] https://copr.fedorainfracloud.org/coprs/iwienand/zookeeper-el7

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] pypi volume downtime

2016-12-05 Thread Ian Wienand
On 6 Dec. 2016 3:13 am, "Kevin L. Mitchell"  wrote:On Mon, 2016-12-05 at 15:30 +1100, Ian Wienand wrote:
For the record, those log entries are from December 2nd, rather than February: US date conventions. Heh, yep :). In one of the openafs files it has at the top /* 1/1/89: NB:  this stuff is all going to be replaced.  Don't take it too seriously */
So maybe I'll excuse it from modern conventions :). But funny how it's not even the standard syslog messiness. That makes more sense, but still is still quite a bit before the problem arose-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra

Re: [OpenStack-Infra] pypi volume downtime

2016-12-04 Thread Ian Wienand

On 12/05/2016 03:30 PM, Ian Wienand wrote:

So I think the only side-effect at the moment is that while the
bandersnatch cron update is running, AFS is locked and thus the
mirrors will not get a new volume release until this sync is done;
i.e. our pypi mirrors are a bit behind.


As of right now, I believe the sync is all done and a bandersnatch run
has completed successfully; i.e. I think everything is up-to-date

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] ask.openstack.org full disk

2016-12-04 Thread Ian Wienand

On 11/23/2016 01:51 AM, Jeremy Stanley wrote:

Thanks! I removed a few old manual backups from some of our homedirs
(mostly mine!) freeing up a few more GB on the rootfs. The biggest
offender though seems to be /var/log/jetty which has about a week of
retention. Whatever's rotating these daily at midnight UTC (doesn't
seem to be logrotate doing it) isn't compressing them, so they're up
to nearly 13GB now (which is a lot on a 40GB rootfs).


This happened again today.

I removed all the 2016_11_*.log files from there, which freed up about
16GiB.  Upon more investigation, it seems jetty is making these log
files directly, and "rotating" by just logging to a file with the
current day and removing files after a few days.

To change this to something smarter than can compress too, I think you
need to put log4j classes into the classpath and configure jetty to
use the slf4j facade to redirect logs and ... I dunno, it got too
hard.  fungi also went in a live-configured it to be less verbose.  I
think this also requires more configuration to be made persistent.

I've proposed [1] to just compress the logs and cleanup in cron which
seems like the KISS approach, at least for now.

-i

[1] https://review.openstack.org/#/c/406670/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] pypi volume downtime

2016-12-04 Thread Ian Wienand
Hi,

Today I was alerted to jobs failing on IRC, further investigation
showed the pypi volume did not seem to be responding on the mirror
servers.

---
 ianw@mirror:/afs/openstack.org/mirror$ ls pypi
 ls: cannot access pypi: Connection timed out
---

The bandersnatch logs suggested the vos release was not working, and a
manual attempt confirmed this

---
 root@mirror-update:~# k5start -t -f /etc/afsadmin.keytab service/afsadmin -- 
vos release -v mirror.pypi
 Kerberos initialization for service/afsad...@openstack.org

 mirror.pypi 
 RWrite: 536870931 ROnly: 536870932 RClone: 536870932 
 number of sites -> 3
server afs01.dfw.openstack.org partition /vicepa RW Site  -- New release
server afs01.dfw.openstack.org partition /vicepa RO Site  -- New release
server afs02.dfw.openstack.org partition /vicepa RO Site  -- Old release
 Failed to start transaction on RW clone 536870932
 Volume not attached, does not exist, or not on line
 Error in vos release command.
 Volume not attached, does not exist, or not on line
---

I figured afs01 must be having issues.  The problem seems to have
appeared at this point (note the .old logs, because I restarted
things, which seems to be the point it rotates):

--- FileLog.old ---
 Sun Dec  4 23:36:06 2016 Volume 536870932 offline: not in service
 Sun Dec  4 23:41:03 2016 fssync: breaking all call backs for volume 536870932
 Sun Dec  4 23:46:05 2016 fssync: breaking all call backs for volume 536870932
 Sun Dec  4 23:46:05 2016 VRequestSalvage: volume 536870932 online salvaged too 
many times; forced offline.

This then made the volume server unhappy:

--- VolserLog.old ---
 Sun Dec  4 23:45:58 2016 1 Volser: Clone: Recloning volume 536870931 to volume 
536870932
 Sun Dec  4 23:46:11 2016 SYNC_ask: negative response on circuit 'FSSYNC'
 Sun Dec  4 23:46:11 2016 FSYNC_askfs: FSSYNC request denied for reason=0
 Sun Dec  4 23:46:11 2016 VAttachVolume: attach of volume 536870932 apparently 
denied by file server
 Sun Dec  4 23:46:11 2016 attach2: forcing vol 536870932 to error state (state 
0 flags 0x0 ec 103)

As for the root cause, I don't see anything else particularly
insightful in the logs.  The salvage server logs, implicated above,
end in Feburary which isn't very helpful

--- SalsrvLog.old ---
 12/02/2016 04:19:59 SALVAGING VOLUME 536870931.
 12/02/2016 04:19:59 mirror.pypi (536870931) updated 12/02/2016 04:15
 12/02/2016 04:20:02 totalInodes 1931509
 12/02/2016 04:53:31 Salvaged mirror.pypi (536870931): 1931502 files, 442808916 
blocks

I looked through syslog & other bits and pieces looking for anything
suspicious around the same time, and didn't see anything.

There may have been a less heavy-handed approach, but I tried a
restart of the openafs services on afs01 with the hope it would
re-attach, and it appears to have done so.  At this point, the mirrors
could access the pypi volume again.

I have started a manual vos release on mirror-update.o.o.  This seems
to have decided to recreate the volume on afs02.dfw.o.o which is still
going as I write this

---
 root@mirror-update:~# k5start -t -f /etc/afsadmin.keytab service/afsadmin -- 
vos release -v mirror.pypi
 Kerberos initialization for service/afsad...@openstack.org
 mirror.pypi 
 RWrite: 536870931 ROnly: 536870932 RClone: 536870932 
 number of sites -> 3
server afs01.dfw.openstack.org partition /vicepa RW Site  -- New release
server afs01.dfw.openstack.org partition /vicepa RO Site  -- New release
server afs02.dfw.openstack.org partition /vicepa RO Site  -- Old release
 This is a completion of a previous release
 Starting transaction on cloned volume 536870932... done
 Deleting extant RO_DONTUSE site on afs02.dfw.openstack.org... done
 Creating new volume 536870932 on replication site afs02.dfw.openstack.org:  
done
 Starting ForwardMulti from 536870932 to 536870932 on afs02.dfw.openstack.org 
(full release).
 [ongoing]
---

That is where we're at right now.  I did not really expect that to
happen and rather stupidly didn't run that "vos release" in a screen
session.  So I think the only side-effect at the moment is that while
the bandersnatch cron update is running, AFS is locked and thus the
mirrors will not get a new volume release until this sync is done;
i.e. our pypi mirrors are a bit behind.

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] ask.openstack.org full disk

2016-11-20 Thread Ian Wienand
Hi all,

I was alerted about 04:00UTC that ask.openstack.org was giving 500
errors.

When I logged in, the first thing was it was out of disk [1].  After a
little poking the safest way to clear some space seemed to be the
apt cache which gave some breathing room.

After this, an error similar to the following persisted

---
mod_wsgi (pid=6706): Exception occurred processing WSGI script 
'/srv/askbot-site/config/django.wsgi'.
 Traceback (most recent call last):
   File 
"/usr/askbot-env/lib/python2.7/site-packages/django/core/handlers/wsgi.py", 
line 255, in __call_

 response = self.get_response(request)
   File 
"/usr/askbot-env/lib/python2.7/site-packages/django/core/handlers/base.py", 
line 176, in get_res

 response = self.handle_uncaught_exception(request, resolver, 
sys.exc_info())
   File 
"/usr/askbot-env/lib/python2.7/site-packages/django/core/handlers/base.py", 
line 218, in handle_

 if resolver.urlconf_module is None:
   File 
"/usr/askbot-env/lib/python2.7/site-packages/django/core/urlresolvers.py", line 
361, in urlconf_

 self._urlconf_module = import_module(self.urlconf_name)
   File 
"/usr/askbot-env/lib/python2.7/site-packages/django/utils/importlib.py", line 
35, in import_modu

 __import__(name)
   File "/srv/askbot-site/config/urls.py", line 12, in 
 from askbot.views.error import internal_error as handler500
   File 
"/usr/askbot-env/lib/python2.7/site-packages/askbot-0.7.53-py2.7.egg/askbot/views/__init__.py",
 

 from askbot.views import api_v1
 ImportError: cannot import name api_v1
---

AFAICT everything seemed OK, so my assumption was/is that something
became out of sync in the /usr/askbot-env virtualenv while the host
was out of disk.  I re-installed the askbot package with pip in the
virtualenv from /srv/askbot-site and this seems to have restored
functionality.

I restarted apache & celery

Disk is still tight on this host.  Someone who knows a little more
about the service might like to go clear out anything else that is
unnecessary.

-i

[1] 
http://cacti.openstack.org/cacti/graph_view.php?action=tree&tree_id=1&leaf_id=156

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] [nodepool.o.o] Image builder cleanup

2016-11-07 Thread Ian Wienand
On 11/07/2016 04:08 PM, Ian Wienand wrote:
> I have started some image builds now
> to see what the deal is.  I will keep an eye on them.

So we have fresh images for everything but fedora (time to delete
fedora23, just haven't got around to it, will debug fedora24 unless
anyone else wants to)

+--++-++---+-+
| ID   | Image  | Filename| Version 
   | State | Age |
+--++-++---+-+
| 1415 | centos-7   | /opt/nodepool_dib/centos-7-1478169240   | 
1478169240 | ready | 03:18:22:22 |
| 1439 | centos-7   | /opt/nodepool_dib/centos-7-1478478511   | 
1478478511 | ready | 00:07:40:12 |
| 1413 | debian-jessie  | /opt/nodepool_dib/debian-jessie-1478169240  | 
1478169240 | ready | 03:19:38:30 |
| 1440 | debian-jessie  | /opt/nodepool_dib/debian-jessie-1478488745  | 
1478488745 | ready | 00:04:45:21 |
| 1411 | fedora-23  | /opt/nodepool_dib/fedora-23-1478169240  | 
1478169240 | ready | 03:21:21:22 |
| 1418 | fedora-23  | /opt/nodepool_dib/fedora-23-1478255640  | 
1478255640 | ready | 02:21:18:15 |
| 1354 | fedora-24  | /opt/nodepool_dib/fedora-24-1477305240  | 
1477305240 | ready | 13:20:15:49 |
| 1361 | fedora-24  | /opt/nodepool_dib/fedora-24-1477391640  | 
1477391640 | ready | 12:20:08:51 |
| 1349 | ubuntu-precise | /opt/nodepool_dib/ubuntu-precise-1477218840 | 
1477218840 | ready | 14:17:04:05 |
| 1443 | ubuntu-precise | /opt/nodepool_dib/ubuntu-precise-1478495141 | 
1478495141 | ready | 00:02:48:14 |
| 1416 | ubuntu-trusty  | /opt/nodepool_dib/ubuntu-trusty-1478169240  | 
1478169240 | ready | 03:16:39:32 |
| 1444 | ubuntu-trusty  | /opt/nodepool_dib/ubuntu-trusty-1478500378  | 
1478500378 | ready | 00:01:24:14 |
| 1417 | ubuntu-xenial  | /opt/nodepool_dib/ubuntu-xenial-1478169240  | 
1478169240 | ready | 03:14:41:39 |
| 1445 | ubuntu-xenial  | /opt/nodepool_dib/ubuntu-xenial-1478505418  | 
1478505418 | ready | 00:00:02:46 |
+--++-++---+-+

I kicked off a centos upload ... it's going.

> It's currently very hard to debug the upload process.  I'll soon
> propose some changes to split the upload logs out into provider log
> files, similar to the way we split the build logs out into separate
> files.  I think this will help to diagnose issues on specific
> providers much quicker.

Reviews for this at

 https://review.openstack.org/#/q/status:open+topic:image-log-split

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] [nodepool.o.o] Image builder cleanup

2016-11-06 Thread Ian Wienand
Hi all,

I noticed that nodepool was failing to build, out of space again.  We
haven't had a build in about 3 days.

Unlike last time, there wasn't anything to cleanup in the cache; it
all seemed to be images.

---
ianw@nodepool:/opt$ sudo du -sh ./*/
16G ./dib_cache/
12G ./dib_tmp/
704K./gear/
16K ./lost+found/
7.2M./nodepool/
914G./nodepool_dib/
66M ./system-config/
5.6G./test_images/
---

The image list at the time I started looked like

nodepool@nodepool:~$ nodepool dib-image-list
2016-11-06 23:41:11,267 INFO gear.Connection.nodepool: Disconnected from 
zuul.openstack.org port 4730
2016-11-06 23:41:11,311 INFO gear.Connection.nodepool: Connected to 
zuul.openstack.org port 4730
+--++-++---+-+
| ID   | Image  | Filename| Version 
   | State | Age |
+--++-++---+-+
| 1357 | centos-7   | /opt/nodepool_dib/centos-7-1477305240   | 
1477305240 | ready | 13:08:03:05 |
| 1415 | centos-7   | /opt/nodepool_dib/centos-7-1478169240   | 
1478169240 | ready | 03:08:42:23 |
| 1355 | debian-jessie  | /opt/nodepool_dib/debian-jessie-1477305240  | 
1477305240 | ready | 13:09:21:24 |
| 1413 | debian-jessie  | /opt/nodepool_dib/debian-jessie-1478169240  | 
1478169240 | ready | 03:09:58:31 |
| 1411 | fedora-23  | /opt/nodepool_dib/fedora-23-1478169240  | 
1478169240 | ready | 03:11:41:23 |
| 1418 | fedora-23  | /opt/nodepool_dib/fedora-23-1478255640  | 
1478255640 | ready | 02:11:38:16 |
| 1354 | fedora-24  | /opt/nodepool_dib/fedora-24-1477305240  | 
1477305240 | ready | 13:10:35:50 |
| 1361 | fedora-24  | /opt/nodepool_dib/fedora-24-1477391640  | 
1477391640 | ready | 12:10:28:52 |
| 1342 | ubuntu-precise | /opt/nodepool_dib/ubuntu-precise-1477132440 | 
1477132440 | ready | 15:07:10:03 |
| 1349 | ubuntu-precise | /opt/nodepool_dib/ubuntu-precise-1477218840 | 
1477218840 | ready | 14:07:24:06 |
| 1344 | ubuntu-trusty  | /opt/nodepool_dib/ubuntu-trusty-1477132440  | 
1477132440 | ready | 15:04:45:19 |
| 1416 | ubuntu-trusty  | /opt/nodepool_dib/ubuntu-trusty-1478169240  | 
1478169240 | ready | 03:06:59:33 |
| 1345 | ubuntu-xenial  | /opt/nodepool_dib/ubuntu-xenial-1477132440  | 
1477132440 | ready | 15:03:23:41 |
| 1417 | ubuntu-xenial  | /opt/nodepool_dib/ubuntu-xenial-1478169240  | 
1478169240 | ready | 03:05:01:40 |
+--++-++---+-+

Well there was a lot of left-over builds in /opt/nodepool_dib, which
I've dumped into /opt/nodepool_dib/ianw-cleanup-2016-11.07.txt

I removed all the old builds listed in that file (i.e. all builds not
listed above).  This got us to a usable amount of free space

 /dev/mapper/main-nodepoolbuild 1008G  579G  430G  58% /opt

I then noticed that nodepool was stuck building *a lot* of old images

 nodepool@nodepool:/opt/nodepool_dib$ nodepool image-list | grep building | wc 
-l
 826

I went through an did an image-delete on each of these building
instances to clear things out.  I have started some image builds now
to see what the deal is.  I will keep an eye on them.

It's currently very hard to debug the upload process.  I'll soon propose some 
changes
to split the upload logs out into provider log files, similar to the way we 
split the
build logs out into separate files.  I think this will help to diagnose issues 
on
specific providers much quicker.

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Scheduling a Zuul meeting

2016-11-02 Thread Ian Wienand

On 11/03/2016 04:30 AM, James E. Blair wrote:

Please let me know if the proposed time (Monday, 20:00 UTC) works for
you, or if an alternate time would be better.


This should be fine for us antipodeans :) 19:00 is also OK, but starts
getting pretty early in (our) winter

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] pypi mirrors out of sync

2016-09-22 Thread Ian Wienand
On 09/22/2016 12:28 PM, Tony Breeds wrote:
> Checking pypi[2] shows:
>  ...
> openstacksdk-0.9.7.tar.gz
> openstacksdk-0.8.6-py2.py3-none-any.whl
> openstacksdk-0.9.7-py2.py3-none-any.whl
> openstacksdk-0.7.3.tar.gz
> ...
> But the mirror for that job[3] shows:
> ...
> openstacksdk-0.9.5.tar.gz
> openstacksdk-0.9.6.tar.gz
> ...
>
> [2] https://pypi.python.org/simple/openstacksdk/
> [3] http://mirror.mtl01.internap.openstack.org/pypi/simple/openstacksdk/


So fungi started the process for a manual sync [1]; I had a look at
the screen session he started about 30 minutes later.  I was too hasty
and didn't save the screen session from fungi's run, but the sync that
started seemed to be incomplete.  I did capture

  2016-09-22 04:44:41,143 INFO: Resuming interrupted sync from local todo list.

from it, which might be related.

Anyway, I re-ran the mirror and it seemed to actually sync.  There's
logs in mirror-update:root/screen0.log when I turned on logging in the
screen session.

Once this finished I did a vos release on the pypi volume and it all
seemed to go fine.  The above 0.9.7 package appears in the mirror.

So, very sorry for killing the output of the first run that might have
been a good debugging point.  I'm guessing it cleared out something or
other which let the second manual run work.  I have exited the screen
session to release the mirror flock; but clearly we'll want to check
on this for the next little bit.

-i

[1] 
http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-09-22.log.html#t2016-09-22T04:00:27

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Migrated review-dev.openstack.org

2016-08-25 Thread Ian Wienand

On 08/26/2016 02:57 PM, Ian Wienand wrot> Hi all,

SSL needs to be updated.  I will speak with experts in this field
(i.e. fungi).


It's a self-signed certificate.  I must have just forgot that I
accepted the old one.  So that's good (thanks fungi)


There's no ipv6 hosts in there, no entries for review-dev and lots of
others.  If this isn't user error, I can look into it further.


Turns out it's a known problem [1]

I have applied that patch manually to ~root/rackdns-venv and it works.
The upstream project looks pretty dead ... if someone knows what we
should be doing in 2106 point me to it and I can update instructions.

Thanks,

-i

[1] https://github.com/kwminnick/rackspace-dns-cli/pull/1/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] Migrated review-dev.openstack.org

2016-08-25 Thread Ian Wienand
Hi all,

Pursuant to our discussion at [1] I have migrated this host.

I created a new 100GiB cinder volume and copied the old ~gerrit2 to
this.  This is now mounted at ~gerrit2 on the new 30GiB host in a
manner similar to review.openstack.org

SSL needs to be updated.  I will speak with experts in this field
(i.e. fungi).

I hit a number of issues.  For whatever reason, new hosts do not come
up with working ipv6 addresses, necessitating some work-arounds.  See
the series at [2].

Another thing was that the rackdns command from [3] does not seem to
show all our hosts?  I tried

 ianw@puppetmaster:~$ . ~root/ci-launch/openstack-rs-nova.sh
 ianw@puppetmaster:~$ . ~root/rackdns-venv/bin/activate
 ianw@puppetmaster:~$ rackdns record-list openstack.org

There's no ipv6 hosts in there, no entries for review-dev and lots of
others.  If this isn't user error, I can look into it further.  So I
ended up modifying the DNS via the web-interface where all the hosts
were listed correctly.

It took me a little too long to realise that the new host was trying
to share the remote db with the old host, causing all sorts of havoc.
Something to think about when writing puppet, anyway.

I'll remove the old host once we're satisfied the new one is working.

Thanks,

-i


[1] 
http://eavesdrop.openstack.org/meetings/infra/2016/infra.2016-08-16-19.02.html
[2] https://review.openstack.org/#/q/status:open+topic:launch-node
[3] 
https://git.openstack.org/cgit/openstack-infra/system-config/tree/launch/dns.py

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] A tool for slurping gerrit changes in to bug updates

2016-05-25 Thread Ian Wienand

On 05/26/2016 04:43 AM, Sean Dague wrote:

One thing I've been thinking a bit about is whether the event stream
could get into something like MQTT easily.


Although larger in scope than just gerrit, Fedora has something very
similar to this with fedmsg [1]

It is a pretty cool idea to have everything that's happening exposed
in a common place with a documented format.

-i

[1] http://www.fedmsg.com/en/latest/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] [infra] Jobs failing : "No matching distribution found for "

2016-05-10 Thread Ian Wienand
So it seems the just released pip 8.1.2 has brought in a new version
of setuptools with it, which creates canonical names per [1] by
replacing "." with "-".

The upshot is that pip is now looking for the wrong name on our local
mirrors.  e.g.

---
 $ pip --version
pip 8.1.2 from /tmp/foo/lib/python2.7/site-packages (python 2.7)
$ pip --verbose  install --trusted-host mirror.ord.rax.openstack.org -i 
http://mirror.ord.rax.openstack.org/pypi/simple 'oslo.config>=3.9.0'
Collecting oslo.config>=3.9.0
  1 location(s) to search for versions of oslo.config:
  * http://mirror.ord.rax.openstack.org/pypi/simple/oslo-config/
  Getting page http://mirror.ord.rax.openstack.org/pypi/simple/oslo-config/
  Starting new HTTP connection (1): mirror.ord.rax.openstack.org
  "GET /pypi/simple/oslo-config/ HTTP/1.1" 404 222
  Could not fetch URL 
http://mirror.ord.rax.openstack.org/pypi/simple/oslo-config/: 404 Client Error: 
Not Found for url: http://mirror.ord.rax.openstack.org/pypi/simple/oslo-config/ 
- skipping
  Could not find a version that satisfies the requirement oslo.config>=3.9.0 
(from versions: )
---

(note olso-config, not oslo.config).  Compare to

---
$ pip --verbose install --trusted-host mirror.ord.rax.openstack.org -i 
http://mirror.ord.rax.openstack.org/pypi/simple 'oslo.config>=3.9.0'
You are using pip version 6.0.8, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting oslo.config>=3.9.0
  Getting page http://mirror.ord.rax.openstack.org/pypi/simple/oslo.config/
  Starting new HTTP connection (1): mirror.ord.rax.openstack.org
  "GET /pypi/simple/oslo.config/ HTTP/1.1" 200 2491
---

I think infra jobs that run on bare-precise are hitting this
currently, because that image was just built.  Other jobs *might* be
isolated from this for a bit, until the new pip gets out there on
images, but "winter is coming", as they say...

There is [2] available to make bandersnatch use the new names.
However, I wonder if this might have the effect of breaking the
mirrors for old versions of pip that ask for the "."?

pypi proper does not seem affected, just our mirrors.

I think probably working with bandersnatch to get a fixed version ASAP
is probably the best way forward, rather than us trying to pin to old
pip versions.

-i

[1] https://www.python.org/dev/peps/pep-0503/
[2] 
https://bitbucket.org/pypa/bandersnatch/pull-requests/20/fully-implement-pep-503-normalization/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Test images and libvirt watchdog

2016-04-06 Thread Ian Wienand
On 04/07/2016 04:17 AM, Horváth Ferenc wrote:
> What should be the next step when I have the minified image?

I would imagine you would add it to devstack around [1] as a
dependency for tempest.  Having it in this list will get it uploaded
to glance for test runs, and image-builds will automatically cache it.

-i

[1] https://git.openstack.org/cgit/openstack-dev/devstack/tree/stackrc#n653

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] On image building and CI environments

2016-04-06 Thread Ian Wienand
Hi all,

I spent a bit of time putting together a high-level view of the many
changes we've worked on to get our image building & platform support
to where it is today

 https://www.technovelty.org/openstack/image-building-in-openstack-ci.html

It's a bit long, but I hope it can help introduce interested people to
the current environment and ongoing work in this area.

Cheers,

-i

(p.s. thanks to the couple of #infra pre-readers for their feedback,
of course more welcome :)

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] Path forward for Centos & Fedora testing

2016-01-26 Thread Ian Wienand

Hi,

It has been a pretty crazy month with a lot of action on many fronts,
so I thought I'd call out Centos and Fedora testing for those
interested.

There are really 3 Centos environments in flight

 - snapshot images
 - DIB images built ontop of the upstream cloud-image release
 - DIB based "minimal" builds

Snapshot images are in active use on RAX and were used on HP Cloud.
These use the underlying image provided by the upstream provider and
run our setup scripts ontop of that.

The diskimage-builder (DIB) images built ontop of the upstream
released cloud-images have been building and working, but never really
been used in production.  The main reason is that they are based on
cloud-init, and thus incompatible with networking setup on RAX, so it
was not possible to deploy universally.  As described below, rather
than putting in more work here, we've been working on fixing the
"minimal" builds -- these builds create the vm image from essentially
a blank chroot, and use a different suite of tools such as glean and
growroot rather than cloud-init for maximum compatibility with our
provider platforms.

Because the snapshot images are based on the upstream providers image,
any customisation done there usually breaks our scripts.  Thanks to
jeblair and mordred, we discovered that the OVH Centos7 image indeed
does not work.  However, as an intermediate step, we can use the DIB
cloud-image based builds there.

---

Fedora has been one heck of a mess.  Several huge things happened in
Fedora 22 -- the default package installer switched to dnf, the
default puppet changed to version 4 and we made the move to "minimal"
builds.  Any one of these things is a big change -- all together has
meant a lot of work untangling things into a working state.  In the
last little while we can add to that major issues with pip and further
changes to nodepool's building mechanisms.

We a lot of effort from all concerned, we have working Fedora 23 nodes
based on fedora-minimal images in RAX (i.e. zuul knows about them and
tries to run jobs on them).  I believe at this point, devstack jobs
should be OK once we fix up issues with increasing disk-space [1,2].
Once we are stable on RAX, we can roll this out to other providers.

---

The Fedora work has had the nice effect of shaking out most of the
issues with centos-minimal builds too.  We have just one small issue
blocking builds [3].

Thus our path forward for the next little while will be standarising
all testing on our centos-minimal & fedora-minimal based DIB images
across all providers.  I hope to switch RAX to centos-minimal builds
soon, and deploy both fedora-minimal and centos-minimal more widely.

Thanks,

-i

[1] https://review.openstack.org/271862
[2] https://review.openstack.org/271907
[3] https://review.openstack.org/272857

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] Use of conditional includes, etc. in jobs

2015-09-06 Thread Ian Wienand
Hi,

With more and more plugins, etc, within our various projects, I've
seen some jobs coming in with things like

  if [ -f /path/to/hook.sh ]; then
 . /path/to/hook.sh
  fi

and some similar "conditional execution" idioms.

Clearly we don't want to go overboard and deny maintainers flexibility
in providing various parts of their jobs.  However, my concern with
this sort of thing is that if these files go missing, there is high
potential for silent failure.  There's nothing worse than thinking
your jobs are doing something then finding out (probably a long time
later) they are not due to a silent, unreported failure.

My preference is to see the jobs being strict around things like
sourcing files or calling functions; thus issues like files not being
there or paths changing will then result in a loud failure.

This is really a minor thing; certainly devstack-gate isn't free of it
and you can argue around how jobs would fail.  My thought is just to
include this sort of "failure hardening" as part of the general
reviewing Zeitgeist.

Thanks,

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] [jjb] What's the deal with {{?

2015-08-13 Thread Ian Wienand

On 08/13/2015 08:19 PM, Darragh Bailey wrote:

macros do not get substitution performed unless you provide a
variable to be substituted in.


Thanks; that makes some sense when you grok what's going on,
especially as to why job-templates require it but other macros don't.

I have proposed [1] to better document the situation as it stands.


I wonder if jinja templating would avoid some of the quirks we run
into around using python's string formatting for substitution?



So I feel like jjb could probably do better even given the status-quo
-- if it bailed on missing parameters to shell-builders (or, always
expanded -- in essence the same thing), then we would *always* just
put "${{FOO}}" when we want "${FOO}" in the output.

As it stands, sometimes we take the "short-cut" of letting no
parameters represent "pass-through" -- but that leads to the rather
confusing inconsistency we have now.

I proposed [2]; jjb is more complex than I expected (duh!) so
interested if it can be made to work.

-i

[1] https://review.openstack.org/212952
[2] https://review.openstack.org/212980

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] [jjb] What's the deal with {{?

2015-08-12 Thread Ian Wienand

Hi,

Just trying to get my head around this from [1]:

---

- builder:
name: test_builder
builders:
  - shell: |
   echo ${FOO_1}
   echo ${{FOO_2}}

- job-template:
name: '{foo}-test'
builders:
  - test_builder
  - shell: |
   echo ${{FOO_3}}

- project:
name: 'foo'
jobs:
  - '{foo}-test':
 foo: bar

---

that's going to output a job basically

---
echo ${FOO_1}
echo ${{FOO_2}}
echo ${FOO_3}
---

Why do I *not* get a "FOO_1 parameter missing" for test_builder?  If I
do

---

  - test_builder:
  FOO_1: bar

---

it does actually come out with "echo $bar" as you might expect.

Or the same question in reverse: why *do* I get an error about a
missing parameter if I have just "${FOO_3}" in the job-template?

I can't find a clear explanation for this, although there might be
one I'm missing.  If I can find one, I'll add it to some sort of
documentation.

-i

[1] https://review.openstack.org/#/c/212246

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] [openstack-dev] [all][infra] CI System is broken

2015-07-29 Thread Ian Wienand

On 07/29/2015 07:33 PM, Andreas Jaeger wrote:

Currently Zuul is stuck and not processing any events at all, thus no
jobs are checked or gated.


I think whatever happened has happened again; if jhesketh is out it
might be a few hours from this email before people with the right
access are back online to fix it.

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


[OpenStack-Infra] [nodepool] devstack-node build checker Wed Jun 17 20:55:25 EDT 2015 : FAIL

2015-06-17 Thread Ian Wienand
nodecheker run at Wed Jun 17 20:55:19 EDT 2015
--

PASS: http://nodepool.openstack.org/rax-dfw.devstack-centos7.log

FAIL: http://nodepool.openstack.org/rax-iad.devstack-centos7.log

2015-06-17 14:34:26,310 INFO nodepool.image.build.rax-iad.devstack-centos7: 
main()
2015-06-17 14:34:26,318 INFO nodepool.image.build.rax-iad.devstack-centos7:   
File "/opt/nodepool-scripts/cache_git_repos.py", line 84, in main
2015-06-17 14:34:26,319 INFO nodepool.image.build.rax-iad.devstack-centos7: 
raise Exception('Failed to clone %s' % m.group(1))
2015-06-17 14:34:26,321 INFO nodepool.image.build.rax-iad.devstack-centos7: 
Exception: Failed to clone openstack/nova


FAIL: http://nodepool.openstack.org/rax-ord.devstack-centos7.log

2015-06-17 14:43:08,054 INFO nodepool.image.build.rax-ord.devstack-centos7: 
main()
2015-06-17 14:43:08,054 INFO nodepool.image.build.rax-ord.devstack-centos7:   
File "/opt/nodepool-scripts/cache_git_repos.py", line 84, in main
2015-06-17 14:43:08,055 INFO nodepool.image.build.rax-ord.devstack-centos7: 
raise Exception('Failed to clone %s' % m.group(1))
2015-06-17 14:43:08,056 INFO nodepool.image.build.rax-ord.devstack-centos7: 
Exception: Failed to clone openstack/python-ironic-inspector-client


FAIL: http://nodepool.openstack.org/hpcloud-b1.devstack-centos7.log

2015-06-17 14:43:25,362 INFO nodepool.image.build.hpcloud-b1.devstack-centos7:  
   main()
2015-06-17 14:43:25,363 INFO nodepool.image.build.hpcloud-b1.devstack-centos7:  
 File "/opt/nodepool-scripts/cache_git_repos.py", line 84, in main
2015-06-17 14:43:25,364 INFO nodepool.image.build.hpcloud-b1.devstack-centos7:  
   raise Exception('Failed to clone %s' % m.group(1))
2015-06-17 14:43:25,364 INFO nodepool.image.build.hpcloud-b1.devstack-centos7: 
Exception: Failed to clone openstack/nova


FAIL: http://nodepool.openstack.org/hpcloud-b2.devstack-centos7.log

2015-06-17 14:43:38,949 INFO nodepool.image.build.hpcloud-b2.devstack-centos7:  
   main()
2015-06-17 14:43:38,951 INFO nodepool.image.build.hpcloud-b2.devstack-centos7:  
 File "/opt/nodepool-scripts/cache_git_repos.py", line 84, in main
2015-06-17 14:43:38,952 INFO nodepool.image.build.hpcloud-b2.devstack-centos7:  
   raise Exception('Failed to clone %s' % m.group(1))
2015-06-17 14:43:38,952 INFO nodepool.image.build.hpcloud-b2.devstack-centos7: 
Exception: Failed to clone openstack/horizon


FAIL: http://nodepool.openstack.org/hpcloud-b3.devstack-centos7.log

2015-06-16 14:41:55,565 INFO nodepool.image.build.hpcloud-b3.devstack-centos7:  
   main()
2015-06-16 14:41:55,565 INFO nodepool.image.build.hpcloud-b3.devstack-centos7:  
 File "/opt/nodepool-scripts/cache_git_repos.py", line 84, in main
2015-06-16 14:41:55,565 INFO nodepool.image.build.hpcloud-b3.devstack-centos7:  
   raise Exception('Failed to clone %s' % m.group(1))
2015-06-16 14:41:55,566 INFO nodepool.image.build.hpcloud-b3.devstack-centos7: 
Exception: Failed to clone stackforge/cathead


FAIL: http://nodepool.openstack.org/hpcloud-b4.devstack-centos7.log

2015-06-17 14:43:46,951 INFO nodepool.image.build.hpcloud-b4.devstack-centos7:  
   main()
2015-06-17 14:43:46,951 INFO nodepool.image.build.hpcloud-b4.devstack-centos7:  
 File "/opt/nodepool-scripts/cache_git_repos.py", line 84, in main
2015-06-17 14:43:46,951 INFO nodepool.image.build.hpcloud-b4.devstack-centos7:  
   raise Exception('Failed to clone %s' % m.group(1))
2015-06-17 14:43:46,951 INFO nodepool.image.build.hpcloud-b4.devstack-centos7: 
Exception: Failed to clone openstack/horizon


FAIL: http://nodepool.openstack.org/hpcloud-b5.devstack-centos7.log

2015-06-17 14:55:26,637 INFO nodepool.image.build.hpcloud-b5.devstack-centos7:  
   main()
2015-06-17 14:55:26,638 INFO nodepool.image.build.hpcloud-b5.devstack-centos7:  
 File "/opt/nodepool-scripts/cache_git_repos.py", line 84, in main
2015-06-17 14:55:26,642 INFO nodepool.image.build.hpcloud-b5.devstack-centos7:  
   raise Exception('Failed to clone %s' % m.group(1))
2015-06-17 14:55:26,642 INFO nodepool.image.build.hpcloud-b5.devstack-centos7: 
Exception: Failed to clone openstack/python-ironic-inspector-client


PASS: http://nodepool.openstack.org/rax-dfw.devstack-f21.log

FAIL: http://nodepool.openstack.org/rax-iad.devstack-f21.log

2015-06-17 14:41:25,828 INFO nodepool.image.build.rax-iad.devstack-f21: 
main()
2015-06-17 14:41:25,828 INFO nodepool.image.build.rax-iad.devstack-f21:   File 
"/opt/nodepool-scripts/cache_git_repos.py", line 84, in main
2015-06-17 14:41:25,828 INFO nodepool.image.build.rax-iad.devstack-f21: 
raise Exception('Failed to clone %s' % m.group(1))
2015-06-17 14:41:25,828 INFO nodepool.image.build.rax-iad.devstack-f21: 
Exception: Failed to clone openstack/horizon


FAIL: http://nodepool.openstack.org/rax-ord.devstack-f21.log

2015-06-17 14:35:35,954 INFO nodepool.image.build.rax-ord.devstack-f21: 
main()
2015-06-17 14:35:35,954 IN

[OpenStack-Infra] On failing image builds

2015-06-17 Thread Ian Wienand
Hi,

I spent some time last week figuring out issues with centos kernel
failures which turned out to have been fixed in a recent update that
was not applied to some nodes due to build failures.

This prompted me to look a bit more closely at builds with [1].  The
results are not great.  We are having a lot of failures even in just
the centos/fedora builds I've been looking at [2]; with some days most
images failing to build.

Now I know there's things in motion here.  jhesketh is looking at the
git timeout issues, which are the major cause of problems (especially
note the saturday and sunday jobs go much better than presumably other
times when things are under load).

I know there is a spec out for better testing of images before
deployment which is slightly related.  I know there's a change out
there for a full REST API in nodepool.

Anyway, to avoid more problems like this, I think what I should do now
is expand this script to monitor not just centos/fedora and echo the
output to the infra-list.  Having sentinels in the log files [4] would
make this more reliable.

That way, we can quickly identify issues with builds without having a
manual process of digging through log files, hopefully notice patterns
of failure and distribute some of the load of checking on things.

-i

[1] https://github.com/ianw/nodechecker
[2] http://people.redhat.com/~iwienand/nodechecker-output/
[3] https://review.openstack.org/139598
[4] https://review.openstack.org/190889

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Nodepool for ubuntu vivid.

2015-06-04 Thread Ian Wienand
On 06/05/2015 12:02 PM, Tony Breeds wrote:
>> Don't plan on doing much else
> 
> Ever? or just for sometime while I get it going?

The thing is that it's only ever one commit way from not-going.  Given
the rate of change not just of Open Stack, but everything it sits
on-top of, you should expect to spend considerable time just
maintaining it.

-i

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


Re: [OpenStack-Infra] Nodepool for ubuntu vivid.

2015-06-04 Thread Ian Wienand

On 06/05/2015 10:46 AM, Tony Breeds wrote:

Hi All,
I'd like to test the current dev release of ubuntu (15.04) in the gate.
If I read nodepool.yaml.erb correctly then there currently isn't a nodepool
image defined for this.


There is not.  First thing is probably to make it so diskimage-builder 
is creating vivid images (maybe it is already, I don't know)


At that point, you should boot that image and try out the devstack-gate 
scripts to ensure that we can do all the stuff puppet does by following [1]


After that, you can propose adding the images, see the f22 example [2]

After that, nodepool will build them, you can see the logs at [3].  It
never quite works the same there, so expect it to fail at first :)

After that, in theory, you should have nodes ready to go.  We can then
bring up a job that runs on that node ... there's a few tricks but we
can cross that bridge when we get there.  Again the Fedora job is a
good template.


2) Have someone on the infra team help/mentor me through the process
   for 15.04 so that I can do 15.10 by myself.


I've been through all this multiple times with Red Hat platforms.
Happy to help.


Also council me on what I'm signing up for ;P


Don't plan on doing much else

-i

[1] 
http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/README.rst#n165

[2] https://review.openstack.org/186619
[3] http://nodepool.openstack.org/

___
OpenStack-Infra mailing list
OpenStack-Infra@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra


  1   2   >