Re: [openstack-dev] [tripleo] CI jobs failures

2016-03-08 Thread Richard Su



On 03/08/2016 09:58 AM, Derek Higgins wrote:

On 7 March 2016 at 18:22, Ben Nemec  wrote:

On 03/07/2016 11:33 AM, Derek Higgins wrote:

On 7 March 2016 at 15:24, Derek Higgins  wrote:

On 6 March 2016 at 16:58, James Slagle  wrote:

On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi  wrote:

I'm kind of hijacking Dan's e-mail but I would like to propose some
technical improvements to stop having so much CI failures.


1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
mistake to swap on files because we don't have enough RAM. In my
experience, swaping on non-SSD disks is even worst that not having
enough RAM. We should stop doing that I think.

We have been relying on swap in tripleo-ci for a little while. While
not ideal, it has been an effective way to at least be able to test
what we've been testing given the amount of physical RAM that is
available.

Ok, so I have a few points here, in places where I'm making
assumptions I'll try to point it out

o Yes I agree using swap should be avoided if at all possible

o We are currently looking into adding more RAM to our testenv hosts,
it which point we can afford to be a little more liberal with Memory
and this problem should become less of an issue, having said that

o Even though using swap is bad, if we have some processes with a
large Mem footprint that don't require constant access to a portion of
the footprint swaping it out over the duration of the CI test isn't as
expensive as it would suggest (assuming it doesn't need to be swapped
back in and the kernel has selected good candidates to swap out)

o The test envs that host the undercloud and overcloud nodes have 64G
of RAM each, they each host 4 testenvs and each test env if running a
HA job can use up to 21G of RAM so we have over committed there, it
this is only a problem if a test env host gets 4 HA jobs that are
started around the same time (and as a result a each have 4 overcloud
nodes running at the same time), to allow this to happen without VM's
being killed by the OOM we've also enabled swap there. The majority of
the time this swap isn't in use, only if all 4 testenvs are being
simultaneously used and they are all running the second half of a CI
test at the same time.

o The overcloud nodes are VM's running with a "unsafe" disk caching
mechanism, this causes sync requests from guest to be ignored and as a
result if the instances being hosted on these nodes are going into
swap this swap will be cached on the host as long as RAM is available.
i.e. swap being used in the undercloud or overcloud isn't being synced
to the disk on the host unless it has to be.

o What I'd like us to avoid is simply bumping up the memory every time
we hit a OOM error without at least
   1. Explaining why we need more memory all of a sudden
   2. Looking into a way we may be able to avoid simply bumping the RAM
(at peak times we are memory constrained)

as an example, Lets take a look at the swap usage on the undercloud of
a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
swap enabled via a swapfile
the overcloud deploy started @22:07:46 and finished at @22:28:06

In the graph you'll see a spike in memory being swapped out around
22:09, this corresponds almost exactly to when the overcloud image is
being downloaded from swift[3], looking the top output at the end of
the test you'll see that swift-proxy is using over 500M of Mem[4].

I'd much prefer we spend time looking into why the swift proxy is
using this much memory rather then blindly bump the memory allocated
to the VM, perhaps we have something configured incorrectly or we've
hit a bug in swift.

Having said all that we can bump the memory allocated to each node but
we have to accept 1 of 2 possible consequences
1. We'll env up using the swap on the testenv hosts more then we
currently are or
2. We'll have to reduce the number of test envs per host from 4 down
to 3, wiping 25% of our capacity

Thinking about this a little more, we could do a radical experiment
for a week and just do this, i.e. bump up the RAM on each env and
accept we loose 25 of our capacity, maybe it doesn't matter, if our
success rate goes up then we'd be running less rechecks anyways.
The downside is that we'd probably hit less timing errors (assuming
the tight resources is whats showing them up), I say downside because
this just means downstream users might hit them more often if CI
isn't. Anyways maybe worth discussing at tomorrows meeting.

+1 to reducing the number of testenvs and allocating more memory to
each.  The huge number of rechecks we're having to do is definitely
contributing to our CI load in a big way, so if we could cut those down
by 50% I bet it would offset the lost testenvs.  And it would reduce
developer aggravation by about a million percent. :-)

Also, on some level I'm not too concerned about the absolute minimum
memory use case.  Nobody deploying OpenStack in the real world is doing
so on 4 GB nodes.  I doubt 99% of them ar

Re: [openstack-dev] [tripleo] Location of TripleO REST API

2015-11-25 Thread Richard Su



On 11/24/2015 07:27 AM, Dougal Matthews wrote:



On 24 November 2015 at 07:45, Richard Su <mailto:r...@redhat.com>> wrote:




On 11/17/2015 07:31 AM, Tzu-Mainn Chen wrote:






On 10 November 2015 at 15:08, Tzu-Mainn Chen
mailto:tzuma...@redhat.com>> wrote:

Hi all,

At the last IRC meeting it was agreed that the new
TripleO REST API
should forgo the Tuskar name, and simply be called... the
TripleO
API.  There's one more point of discussion: where should
the API
live?  There are two possibilities:

a) Put it in tripleo-common, where the business logic
lives.  If we
do this, it would make sense to rename tripleo-common to
simply
tripleo.


+1 - I think this makes most sense if we are not going to
support the tripleo repo as a library.


Okay, this seems to be the consensus, which is great.

The leftover question is how to package the renamed repo. 
'tripleo' is already intuitively in use by tripleo-incubator.

In IRC, bnemec and trown suggested splitting the renamed repo
into two packages - 'python-tripleo' and 'tripleo-api',
which seems sensible to me.

What do others think?




I have started the process of renaming the repo with these patches:
https://review.openstack.org/#/c/247834/
https://review.gerrithub.io/#/c/252864/

Jan made an interesting suggestion that it may be easier to create
a new repo named tripleo and move the tripleo-common code there.
With renaming, I'm already see some complications with the
tripleo-common package builds failing in the CI until updated spec
is merged.

What do folks think about this? I am unsure which is more
complicated, creating a new repo and all the setup that goes with
it. Or renaming the existing repo and fixing CI issues along the way.


I'm not sure which is easier or better, but if we do create a new repo 
we need to make sure we carry over the git history.


Good idea. I have submitted a request to create the new repo.

https://review.openstack.org/#/c/249521/



- Richard



b) Put it in its own repo, tripleo-api


The first option made a lot of sense to people on IRC, as
the proposed
API is a very thin layer that's bound closely to the code
in tripleo-
common.  The major objection is that renaming is not
trivial; however
it was mentioned that renaming might not be *too* bad...
as long as
it's done sooner rather than later.

What do people think?


Thanks,
Tzu-Mainn Chen


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

<http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
<mailto:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
<mailto:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
<http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Location of TripleO REST API

2015-11-23 Thread Richard Su



On 11/17/2015 07:31 AM, Tzu-Mainn Chen wrote:






On 10 November 2015 at 15:08, Tzu-Mainn Chen mailto:tzuma...@redhat.com>> wrote:

Hi all,

At the last IRC meeting it was agreed that the new TripleO
REST API
should forgo the Tuskar name, and simply be called... the TripleO
API.  There's one more point of discussion: where should the API
live?  There are two possibilities:

a) Put it in tripleo-common, where the business logic lives. 
If we

do this, it would make sense to rename tripleo-common to simply
tripleo.


+1 - I think this makes most sense if we are not going to support
the tripleo repo as a library.


Okay, this seems to be the consensus, which is great.

The leftover question is how to package the renamed repo. 'tripleo' is 
already intuitively in use by tripleo-incubator.
In IRC, bnemec and trown suggested splitting the renamed repo into two 
packages - 'python-tripleo' and 'tripleo-api',

which seems sensible to me.

What do others think?




I have started the process of renaming the repo with these patches:
https://review.openstack.org/#/c/247834/
https://review.gerrithub.io/#/c/252864/

Jan made an interesting suggestion that it may be easier to create a new 
repo named tripleo and move the tripleo-common code there. With 
renaming, I'm already see some complications with the tripleo-common 
package builds failing in the CI until updated spec is merged.


What do folks think about this? I am unsure which is more complicated, 
creating a new repo and all the setup that goes with it. Or renaming the 
existing repo and fixing CI issues along the way.


- Richard



b) Put it in its own repo, tripleo-api


The first option made a lot of sense to people on IRC, as the
proposed
API is a very thin layer that's bound closely to the code in
tripleo-
common.  The major objection is that renaming is not trivial;
however
it was mentioned that renaming might not be *too* bad... as
long as
it's done sooner rather than later.

What do people think?


Thanks,
Tzu-Mainn Chen


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleoO] Switching SELinux to enforcing mode spec

2014-07-22 Thread Richard Su

Hello,

As discussed earlier this morning, we are working towards switching 
SELinux to enforcing mode in tripleo. The work required are detailed in 
this spec: https://review.openstack.org/#/c/108168/. I welcome 
additional comments and suggestions.


Thank you,

Richard

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Ceilometer] [TripleO] adding process/service monitoring

2014-01-27 Thread Richard Su
Hi,

I have been looking into how to add process/service monitoring to
tripleo. Here I want to be able to detect when an openstack dependent
component that is deployed on an instance has failed. And when a failure
has occurred I want to be notified and eventually see it in Tuskar.

Ceilometer doesn't handle this particular use case today. So I have been
doing some research and there are many options out there that provides
process checks: nagios, sensu, zabbix, and monit. I am a bit wary of
pulling one of these options into tripleo. There is some increased
operational and maintenance costs when pulling in each of them. And
physical device monitoring is currently in the works for Ceilometer
lessening the need for some of the other abilities that an another
monitoring tool would provide.

For the particular use case of monitoring processes/services, at a high
level, I am considering writing a simple daemon to perform the check.
Checks and failures are written out as messages to the notification bus.
Interested parties like Tuskar or Ceilometer can subscribe to these
messages.

In general does this sound like a reasonable approach?

There is also the question of how to configure or figure out which
processes we are interested in monitoring. I need to do more research
here but I'm considering either looking at the elements listed by
diskimage-builder or by looking at the orc post-configure.d scripts to
find service that are restarted.

I welcome your feedback and suggestions.

- Richard Su

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev