Re: [openstack-dev] [all] QPID incompatible with python 3 and untested in gate -- what to do?

2015-04-14 Thread Clint Byrum
Excerpts from Matt Riedemann's message of 2015-04-14 13:13:47 -0700:
 
 On 4/14/2015 12:22 PM, Clint Byrum wrote:
  Hello! There's been some recent progress on python3 compatibility for
  core libraries that OpenStack depends on[1], and this is likely to open
  the flood gates for even more python3 problems to be found and fixed.
 
  Recently a proposal was made to make oslo.messaging start to run python3
  tests[2], and it was found that qpid-python is not python3 compatible yet.
 
  This presents us with questions: Is anyone using QPID, and if so, should
  we add gate testing for it? If not, can we deprecate the driver? In the
  most recent survey results I could find [3] I don't even see message
  broker mentioned, whereas Databases in use do vary somewhat.
 
  Currently it would appear that only oslo.messaging runs functional tests
  against QPID. I was unable to locate integration testing for it, but I
  may not know all of the places to dig around to find that.
 
  So, please let us know if QPID is important to you. Otherwise it may be
  time to unburden ourselves of its maintenance.
 
  [1] https://pypi.python.org/pypi/eventlet/0.17.3
  [2] https://review.openstack.org/#/c/172135/
  [3] 
  http://superuser.openstack.org/articles/openstack-user-survey-insights-november-2014
 
  __
  OpenStack Development Mailing List (not for usage questions)
  Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 FWIW IBM Cloud Manager with OpenStack is still shipping qpid 0.30.  We 
 switched to the default deployment being RabbitMQ in Kilo though (maybe 
 even Juno).  But we do have a support matrix tested with qpid as the rpc 
 backend.  Our mainline paths are tested with rabbitmq though since 
 that's the default backend for us now.
 

So, I think we can count this as another point toward removing impl_qpid.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] QPID incompatible with python 3 and untested in gate -- what to do?

2015-04-14 Thread Clint Byrum
Excerpts from Ken Giusti's message of 2015-04-14 12:54:20 -0700:
 Just to be clear: you're asking specifically about the 0-10 based
 impl_qpid.py driver, correct?   This is the driver that is used for
 the qpid:// transport (aka rpc_backend).
 
 I ask because I'm maintaining the AMQP 1.0 driver (transport
 amqp://) that can also be used with qpidd.
 
 However, the AMQP 1.0 driver isn't yet Python 3 compatible due to its
 dependency on Proton, which has yet to be ported to python 3 - though
 that's currently being worked on [1].
 
 I'm planning on porting the AMQP 1.0 driver once the dependent
 libraries are available.
 
 [1]: https://issues.apache.org/jira/browse/PROTON-490
 

Thanks Ken. Yes that's what I mean.

That there is an alternative already that has active work toward Python
3 seems like a point toward removal of impl qpid.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Consistent variable documentation for diskimage-builder elements

2015-04-13 Thread Clint Byrum
Excerpts from Dan Prince's message of 2015-04-13 14:07:28 -0700:
 On Tue, 2015-04-07 at 21:06 +, Gregory Haynes wrote:
  Hello,
  
  Id like to propse a standard for consistently documenting our
  diskimage-builder elements. I have pushed a review which transforms the
  apt-sources element to this format[1][2]. Essentially, id like to move
  in the direction of making all our element README.rst's contain a sub
  section called Environment Vairables with a Definition List[3] where
  each entry is the environment variable. Under that environment variable
  we will have a field list[4] with Required, Default, Description, and
  optionally Example.
  
  The goal here is that rather than users being presented with a wall of
  text that they need to dig through to remember the name of a variable,
  there is a quick way for them to get the information they need. It also
  should help us to remember to document the vital bits of information for
  each vairable we use.
  
  Thoughts?
 
 I like the direction of the cleanup. +2
 
 I do wonder who we'll enforce consistency in making sure future changes
 adhere to the new format. It would be nice to have a CI check on these
 things so people don't constantly need to debate the correct syntax,
 etc.

I agree Dan, which is why I'd like to make sure these are machine
readable and consistent. I think it would actually make sense to make
our argument isolation efforts utilize this format, as that would make
sure that these are consistent with the code as well.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] how to send messages (and events) to our users

2015-04-08 Thread Clint Byrum
Excerpts from Angus Salkeld's message of 2015-04-06 19:55:37 -0700:
 Hi all
 
 For quite some time we (Heat team) have wanted to be able to send messages
 to our
 users (by user I do not mean the Operator, but the User that is interacting
 with the client).
 
 What do I mean by user messages, and how do they differ from our current
 log messages
 and notifications?
 - Our current logs are for the operator and have information that the user
 should not have
   (ip addresses, hostnames, configuration options, other tenant info etc..)
 - Our notifications (that Ceilometer uses) *could* be used, but I am not
 sure if it quite fits.
   (they seem a bit heavy weight for a log message and aimed at higher level
 events)
 
 These messages could be (based on Heat's use case):
 
 - Specific user oriented log messages (distinct from our normal operator
 logs)

These currently go in the Heat events API, yes?

 - Deprecation messages (if they are using old resource properties/template
 features)

I think this could fit with the bits above.

 - Progress and resource state changes (an application doesn't want to poll
 an api for a state change)

These also go in the current Heat events.

 - Automated actions (autoscaling events, time based actions)

As do these?

 - Potentially integrated server logs (from in guest agents)
 
 I wanted to raise this to [all] as it would be great to have a general
 solution that
 all projects can make use of.
 
 What do we have now:
 - The user can not get any kind of log message from services. The closest
 thing
   ATM is the notifications in Ceilometer, but I have the feeling that these
 have a different aim.
 - nova console log
 - Heat has a DB event table for users (we have long wanted to get rid of
 this)

So if we forget the DB part of it, the API is also lacking things like
pagination and search that one would want in an event/logging API.

 
 What do other clouds provide:
 - https://devcenter.heroku.com/articles/logging
 - https://cloud.google.com/logging/docs/
 - https://aws.amazon.com/blogs/aws/cloudwatch-log-service/
 - http://aws.amazon.com/cloudtrail/
 (other examples...)
 
 What are some options we could investigate:
 1. remote syslog
 The user provides a rsyslog server IP/port and we send their messages
 to that.
 [pros] simple, and the user could also send their server's log messages
 to the same
   rsyslog - great visibility into what is going on.
 
   There are great tools like loggly/logstash/papertrailapp that
 source logs from remote syslog
   It leaves the user in control of what tools they get to use.
 
 [cons] Would we become a spam agent (just sending traffic to an
 IP/Port) - I guess that's how remote syslog
works. I am not sure if this is an issue or not?
 
   This might be a lesser solution for the use case of an
 application doesn't want to poll an api for a state change
 
   I am not sure how we would integrate this with horizon.
 

I think this one puts too much burden on the user to setup a good
receiver.

 2. Zaqar
 We send the messages to a queue in Zaqar.
 [pros] multi tenant OpenStack project for messaging!
 
 [cons] I don't think Zaqar is installed in most installations (tho'
 please correct me here if this
is wrong). I know Mirantis does not currently support Zaqar,
 so that would be a problem for me.
 
   There is not the level of external tooling like in option 1
 (logstash and friends)


I agree with your con, and would also add that after the long
discussions we had in the past we had some concerns about scaling.

 3. Other options:
Please chip in with suggestions/links!
 

There's this:

https://wiki.openstack.org/wiki/Cue

I think that could be a bit like 1, but provide the user with an easy
target for the messages.

I also want to point out that what I'd actually rather see is that all
of the services provide functionality like this. Users would be served
by having an event stream from Nova telling them when their instances
are active, deleted, stopped, started, error, etc.

Also, I really liked Sandy's suggestion to use the notifications on the
backend, and then funnel them into something that the user can consume.
The project they have, yagi, for putting them into atom feeds is pretty
interesting. If we could give people a simple API that says subscribe
to Nova/Cinder/Heat/etc. notifications for instance X, and put them
in an atom feed, that seems like something that would make sense as
an under-the-cloud service that would be relatively low cost and would
ultimately reduce load on API servers.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Consistent variable documentation for diskimage-builder elements

2015-04-08 Thread Clint Byrum
Excerpts from Gregory Haynes's message of 2015-04-07 14:06:52 -0700:
 Hello,
 
 Id like to propse a standard for consistently documenting our
 diskimage-builder elements. I have pushed a review which transforms the
 apt-sources element to this format[1][2]. Essentially, id like to move
 in the direction of making all our element README.rst's contain a sub
 section called Environment Vairables with a Definition List[3] where
 each entry is the environment variable. Under that environment variable
 we will have a field list[4] with Required, Default, Description, and
 optionally Example.
 
 The goal here is that rather than users being presented with a wall of
 text that they need to dig through to remember the name of a variable,
 there is a quick way for them to get the information they need. It also
 should help us to remember to document the vital bits of information for
 each vairable we use.
 
 Thoughts?

I discussed a format for something similar here:

https://review.openstack.org/#/c/162267/

Perhaps we could merge the effort.

The design and implementation in that might take some time, but if we
can document the variables at the same time we prepare the inputs for
isolation, that seems like a winning path forward.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Keystone] SQLite support (migrations, work-arounds, and more), is it worth it?

2015-04-06 Thread Clint Byrum
Excerpts from Boris Bobrov's message of 2015-04-03 18:29:08 -0700:
 On Saturday 04 April 2015 03:55:59 Morgan Fainberg wrote:
  I am looking forward to the Liberty cycle and seeing the special casing we
  do for SQLite in our migrations (and elsewhere). My inclination is that we
  should (similar to the deprecation of eventlet) deprecate support for
  SQLite in Keystone. In Liberty we will have a full functional test suite
  that can (and will) be used to validate everything against much more real
  environments instead of in-process “eventlet-like” test-keystone-services;
  the “Restful test cases” will no longer be part of the standard unit tests
  (as they are functional testing). With this change I’m inclined to say
  SQLite (being the non-production usable DB) what it is we should look at
  dropping migration support for SQLite and the custom work-arounds.
  
  Most deployers and developers (as far as I know) use devstack and MySQL or
  Postgres to really suss out DB interactions.
  
  I am looking for feedback from the community on the general stance for
  SQLite, and more specifically the benefit (if any) of supporting it in
  Keystone.
 
 +1. Drop it and clean up tons of code used for support of sqlite only.
 
 Doing tests with mysql is as easy, as with sqlite (mysqladmin drop -f; 
 mysqladmin create for reset), and using it by default will finally make 
 people test their code on real rdbmses.
 

Please please please be careful with that and make sure the database
name is _always_ random in tests... or even better, write a fixture to
spin up a mysqld inside a private tempdir. That would be a really cool
thing for oslo.db to provide actually.

I'm just thinking some poor sap runs the test suite with the wrong
.my.cnf in the wrong place and poof there went keystone's db.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Infra] Use of heat for CI of OpenStack

2015-04-03 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2015-04-03 10:08:07 -0700:
 Monty Taylor wrote:
  On 04/03/2015 08:55 AM, Maish Saidel-Keesing wrote:
  I was wondering..
 
  Is the OpenStack CI/CD Infra using Heat in any way? Do the commits
  trigger a new build of DevStack/OpenStack that is based on a Heat
  Template or just the provisioning of a regular instance and then
  deployment of code on top of that?
 
  Nope - we do not use heat - we use a program called nodepool:
 
  http://git.openstack.org/cgit/openstack-infra/nodepool/
 
  Which uses the nova api to provision servers. These servers are
  currently registered as jenkins slaves so that the workload run on them
  is defined a s jenkins job.
 
  There are a few reasons we do not use heat for this - none of them I
  think of as negative against heat:
 
  - Our pool spans 4 regions of 2 public clouds. Heat runs in a cloud, the
  positioning is wrong
  - Our pool is predominantly single-machines that are used once - which
  means a heat template would add extra complexity for not much gain.
  - Our current system predates the existence of heat. It is also highly
  specific to the task at hand - namely, ensuring that there are always
  test nodes available.
 
 Can these things be fixed? Heat afaik isn't a frozen piece of sofware... 
 It would be pretty neat to use the projects that we have that others are 
 using if we could. Might be an interesting summit topic/idea?
 

Monty said these aren't negatives. They're just aspects. Heat is
supposed to alleviate you from needing to build something specific like
Nodepool. But it wasn't there, and nodepool is specific to the task,
so there's no point in using Heat for it. It's like using hammer and
nails to build your scaffolding because you don't have a nailgun. Was
it ideal? No, but at this point, the scaffolding is built.. no point in
tearing it down so you can build it again, only faster.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] how to handle vendor-specific API microversions?

2015-03-31 Thread Clint Byrum
Excerpts from Lingxian Kong's message of 2015-03-23 21:11:28 -0700:
 2015-03-21 23:31 GMT+08:00 Monty Taylor mord...@inaugust.com:
 
  I would vote that we not make this pleasant or easy for vendors who are
  wanting to add a feature to the API. As a person who uses several clouds
  daily, I can tell you that a vendor chosing to do that is VERY mean to
  users, and provides absolutely no value to anyone, other than allowing
  someone to make a divergent differentiated fork.
 
  Just don't do it. Seriously. It makes life very difficult for people
  trying to consume these things.
 
  The API is not the place for divergence.
 
 But, what if some vendors have already implemented some on-premise
 features using the Nova extension mechanism, to achieve strategy of
 product differentiation themselves based on OpenStack? IMHO, the
 DefCore has already give some advise about what's OpenStack(you must
 pass through a lot of predefined tests). If vendors can not provide
 extra features by themselvs(which is backwards compatible), they will
 lose a little competitiveness on their product.
 
 I'm not very sure whether or not my understanding is right, but I
 really concern about the what's the right direction for the vendors or
 providers.
 

What is being suggested is that those vendors need to write an API
that stands alone, apart from OpenStack's API's, with its own client
libraries and programs. This is to make it clear, those things are not
OpenStack. Extensions sort of hide in the shadows, and it is very hard
for a user to distinguish what they can depend on.

Think about the very nice alternatives that are GNU-specific for glibc.
If someone is writing an app that may need to land on many systems, they
must at least know to put those calls behind a layer of indirection that
they can focus on when porting. Same thing here.

Nobody wants to harm the ecosystem or discourage vendors from pushing into
corners where upstream might take too long to catch up. But OpenStack
isn't going to facilitate those things at the expense of the end-user
ecosystem.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Do we need release announcements for all the things?

2015-03-23 Thread Clint Byrum
Excerpts from Russell Bryant's message of 2015-03-23 07:51:43 -0700:
 On 03/23/2015 10:47 AM, Thierry Carrez wrote:
  Kuvaja, Erno wrote:
  [...]
  This is one of the benefits/caveats of having a single dev mailing list. 
  There is lots of noise for everyone, but this particular noise is one of 
  those I think we should not get rid of.
  
  One area where we could work to remove noise would be to move new core
  reviewers nomination/suggestion threads out of the ML. They are mostly
  useless IMHO (only +1s), and PTLs are empowered to make the call anyway.
  
  That's one area where the PTL could move to ask for forgiveness model.
  If we really want a feedback mechanism, we could look for a way to move
  that to Gerrit or some other lightweight voting tool.
 
 In my experience, there's usually some behind the scenes discussion in
 advance, anyway.  Nobody really wants to propose someone publicly that
 might get a -1.
 

[with some hesitation for irony..] +1

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova][Cinder] Questions re progress

2015-03-18 Thread Clint Byrum
Excerpts from Adam Lawson's message of 2015-03-18 11:25:37 -0700:
 The aim is cloud storage that isn't affected by a host failure and major
 players who deploy hyper-scaling clouds architect them to prevent that from
 happening. To me that's cloud 101. Physical machine goes down, data
 disappears, VM's using it fail and folks scratch their head and ask this
 was in the cloud right? That's the indication of a service failure, not a
 feature.


Ceph provides this for cinder installations that use it.

 I'm just a very big proponent of cloud arch that provides a seamless
 abstraction between the service and the hardware. Ceph and DRDB are decent
 enough. But tying data access to a single host by design is a mistake IMHO
 so I'm asking why we do things the way we do and whether that's the way
 it's always going to be.
 

Why do you say Ceph is decent. It solves all your issues you're
talking about, and does so on commodity hardware.

 Of course this bumps into the question whether all apps hosted in the cloud
 should be cloud aware or whether the cloud should have some tolerance for
 legacy apps that are not written that way.
 

Using volumes is more expensive than using specialized scale-out storage,
aka cloud aware storage. But finding and migrating to that scale-out
storage takes time and has a cost too, so volumes have their place and
always will.

So, can you be more clear, what is it that you're suggesting isn't
available now?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Do we need release announcements for all the things?

2015-03-13 Thread Clint Byrum
Excerpts from Doug Hellmann's message of 2015-03-13 08:06:43 -0700:
 
 On Fri, Mar 13, 2015, at 06:57 AM, Thierry Carrez wrote:
  Clint Byrum wrote:
   I spend a not-insignificant amount of time deciding which threads to
   read and which to fully ignore each day, so extra threads mean extra
   work, even with a streamlined workflow of single-key-press-per-thread.
   
   So I'm wondering what people are getting from these announcements being
   on the discussion list. I feel like they'd be better off in a weekly
   digest, on a web page somewhere, or perhaps with a tag that could be
   filtered out for those that don't benefit from them.
  
  The first value of a release announcement is (obviously) to let people
  know something was released. There is a bit of a paradox there with some
  announcements being posted to openstack-announce (in theory low-traffic
  and high-attention), and some announcements being posted to
  openstack-dev (high-traffic and medium-attention). Where is the line
  drawn ?
  
  The second value of a release announcement is the thread it creates in
  case immediate issues are spotted. I kind of like that some
  python-*client release announcements are followed-up by a this broke
  the world thread, all in a single convenient package. Delaying
  announcements defeats that purpose.
  
  We need to adapt our current (restricted) usage of openstack-announce to
  a big-tent less-hierarchical future anyway: if we continue to split
  announcements, which projects are deemed important enough to be
  granted openstack-announce access ?
  
  Personally in the future I'm not opposed to allowing any openstack
  project (big-tent definition) to post to openstack-announce (ideally in
  a standard / autogenerated format) with reply-to set to openstack-dev.
  We could use a separate list, but then release and OSSA announcements
  are the only thing we use -announce for currently, so I'm not sure it's
  worth it.
  
  So I'm +1 on using a specific list (and setting reply-to to -dev), and
  I'm suggesting openstack-announce should be reused to avoid creating two
  classes of deliverables (-announce worthy and not).
 
 We had complaints in the past when we *didn't* send release
 announcements because people were then unaware of why a new release
 might be causing changes in behavior, so we built a bunch of tools to
 make it easy to create uniform and informative release note emails
 containing the level of detail people wanted. So far those are only
 being used by Oslo, but we're moving the scripts to the release-tools
 repo to make them easy for all library maintainers to use.
 

This is really what I'm asking about. If people were less happy with not
having them, then it makes sense to have them.

 These announcements are primarily for our developer community and the
 folks at the distros who need to know to package the new versions. Are
 we going to start having non-dev folks who subscribe to the announce
 list complain about the release announcements for libraries, then? Are
 enough developers subscribed to the announce list that they will see the
 release messages to meet the original needs we were trying to meet?
 

I hope I don't come across as complaining. I archive them very rapidly
without ever looking at the content currently. Sometimes they come up in
my searches for topics and then having them in the single timeline is
great, but I have an email reader that supports this without changing
the list behavior. I am more wondering if people who aren't as optimized
as I am have trouble keeping up with them. And having a few less things
to archive manually would certainly be nicer for me, but is a secondary
goal.

I haven't seen very much interest in changing things, mostly people in
support of keeping them as-is. So I suspect people are not annoyed about
this in particular, and we can close the book on this thread.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] readout from Philly Operators Meetup

2015-03-12 Thread Clint Byrum
Excerpts from Sean Dague's message of 2015-03-11 05:59:10 -0700:
 =
  Additional Interesting Bits
 =
 
 Rabbit
 --
 
 There was a whole session on Rabbit -
 https://etherpad.openstack.org/p/PHL-ops-rabbit-queue
 
 Rabbit is a top operational concern for most large sites. Almost all
 sites have a restart everything that talks to rabbit script because
 during rabbit ha opperations queues tend to blackhole.
 
 All other queue systems OpenStack supports are worse than Rabbit (from
 experience in that room).
 
 oslo.messaging  1.6.0 was a significant regression in dependability
 from the incubator code. It now seems to be getting better but still a
 lot of issues. (L112)
 
 Operators *really* want the concept in
 https://review.openstack.org/#/c/146047/ landed. (I asked them to
 provide such feedback in gerrit).
 

This reminded me that there are other options that need investigation.

A few of us have been looking at what it might take to use something
in between RabbitMQ and ZeroMQ for RPC and notifications. Some initial
forays into inspecting Gearman (which infra has successfully used for
quite some time as the backend of Zuul) look promising. A few notes:

* The Gearman protocol is crazy simple. There are currently 4 known gearman
  server implementations: Perl, Java, C, and Python (written and
  maintained by our own infra team). http://gearman.org/download/ for
  the others, and https://pypi.python.org/pypi/gear for the python one.

* Gearman has no pub/sub capability built in for 1:N comms. However, it
  is fairly straight forward to write workers that will rebroadcast
  messages to subscribers.

* Gearman's security model is not very rich. Mostly, if you have been
  authenticated to the gearman server (only the C server actually even
  supports any type of authentication, via SSL client certs), you can
  do whatever you want including consuming all the messages in a queue
  or filling up a queue with nonsense. This has been raised as a concern
  in the past and might warrant extra work to add support to the python
  server and/or add ACL support.

Part of our motivation for this is that some of us are going to be
deploying a cloud soon and none of us are excited about deploying and
supporting RabbitMQ. So we may be proposing specs to add Gearman as a
deployment option soon.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] readout from Philly Operators Meetup

2015-03-12 Thread Clint Byrum
Excerpts from Doug Hellmann's message of 2015-03-12 10:04:57 -0700:
 
 On Thu, Mar 12, 2015, at 12:47 PM, Clint Byrum wrote:
  Excerpts from Sean Dague's message of 2015-03-11 05:59:10 -0700:
   =
Additional Interesting Bits
   =
   
   Rabbit
   --
   
   There was a whole session on Rabbit -
   https://etherpad.openstack.org/p/PHL-ops-rabbit-queue
   
   Rabbit is a top operational concern for most large sites. Almost all
   sites have a restart everything that talks to rabbit script because
   during rabbit ha opperations queues tend to blackhole.
   
   All other queue systems OpenStack supports are worse than Rabbit (from
   experience in that room).
   
   oslo.messaging  1.6.0 was a significant regression in dependability
   from the incubator code. It now seems to be getting better but still a
   lot of issues. (L112)
   
   Operators *really* want the concept in
   https://review.openstack.org/#/c/146047/ landed. (I asked them to
   provide such feedback in gerrit).
   
  
  This reminded me that there are other options that need investigation.
  
  A few of us have been looking at what it might take to use something
  in between RabbitMQ and ZeroMQ for RPC and notifications. Some initial
  forays into inspecting Gearman (which infra has successfully used for
  quite some time as the backend of Zuul) look promising. A few notes:
  
  * The Gearman protocol is crazy simple. There are currently 4 known
  gearman
server implementations: Perl, Java, C, and Python (written and
maintained by our own infra team). http://gearman.org/download/ for
the others, and https://pypi.python.org/pypi/gear for the python one.
  
  * Gearman has no pub/sub capability built in for 1:N comms. However, it
is fairly straight forward to write workers that will rebroadcast
messages to subscribers.
  
  * Gearman's security model is not very rich. Mostly, if you have been
authenticated to the gearman server (only the C server actually even
supports any type of authentication, via SSL client certs), you can
do whatever you want including consuming all the messages in a queue
or filling up a queue with nonsense. This has been raised as a concern
in the past and might warrant extra work to add support to the python
server and/or add ACL support.
  
  Part of our motivation for this is that some of us are going to be
  deploying a cloud soon and none of us are excited about deploying and
  supporting RabbitMQ. So we may be proposing specs to add Gearman as a
  deployment option soon.
 
 That sounds really intriguing, and I look forward to reading it and
 learning more about gearman.
 
 Be forewarned that oslo.messaging is pretty badly understaffed right
 now. Most of the original contributors have moved on, either to other
 parts of OpenStack or out of the community entirely. We can use more
 messaging experts to help with reviews and improvements. 
 

Noted, and subscribed in gertty. :)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] readout from Philly Operators Meetup

2015-03-12 Thread Clint Byrum
Excerpts from Sean Dague's message of 2015-03-12 09:59:35 -0700:
 On 03/12/2015 12:47 PM, Clint Byrum wrote:
  Excerpts from Sean Dague's message of 2015-03-11 05:59:10 -0700:
  =
   Additional Interesting Bits
  =
 
  Rabbit
  --
 
  There was a whole session on Rabbit -
  https://etherpad.openstack.org/p/PHL-ops-rabbit-queue
 
  Rabbit is a top operational concern for most large sites. Almost all
  sites have a restart everything that talks to rabbit script because
  during rabbit ha opperations queues tend to blackhole.
 
  All other queue systems OpenStack supports are worse than Rabbit (from
  experience in that room).
 
  oslo.messaging  1.6.0 was a significant regression in dependability
  from the incubator code. It now seems to be getting better but still a
  lot of issues. (L112)
 
  Operators *really* want the concept in
  https://review.openstack.org/#/c/146047/ landed. (I asked them to
  provide such feedback in gerrit).
 
  
  This reminded me that there are other options that need investigation.
  
  A few of us have been looking at what it might take to use something
  in between RabbitMQ and ZeroMQ for RPC and notifications. Some initial
  forays into inspecting Gearman (which infra has successfully used for
  quite some time as the backend of Zuul) look promising. A few notes:
  
  * The Gearman protocol is crazy simple. There are currently 4 known gearman
server implementations: Perl, Java, C, and Python (written and
maintained by our own infra team). http://gearman.org/download/ for
the others, and https://pypi.python.org/pypi/gear for the python one.
  
  * Gearman has no pub/sub capability built in for 1:N comms. However, it
is fairly straight forward to write workers that will rebroadcast
messages to subscribers.
  
  * Gearman's security model is not very rich. Mostly, if you have been
authenticated to the gearman server (only the C server actually even
supports any type of authentication, via SSL client certs), you can
do whatever you want including consuming all the messages in a queue
or filling up a queue with nonsense. This has been raised as a concern
in the past and might warrant extra work to add support to the python
server and/or add ACL support.
  
  Part of our motivation for this is that some of us are going to be
  deploying a cloud soon and none of us are excited about deploying and
  supporting RabbitMQ. So we may be proposing specs to add Gearman as a
  deployment option soon.
 
 I think experimentation of other models is good. There was some
 conversation that maybe Kafka was a better model as well. However,
 realize that services are quite chatty at this point and push pretty
 large payloads through that bus. The HA story is also quite important,
 because the underlying message architecture assumes reliable delivery
 for some of the messages, and if they fall on the floor, you'll get
 either leaked resources, or broken resources. It's actually the HA
 recovery piece of Rabbit (and when it doesn't HA recover correctly)
 that's seemingly the sharp edge most people are hitting.

Kafka is definitely another one I'd like to keep an eye on, but have
zero experience using.

Chatty is good for gearman, I don't see a problem with that. It's
particulary good at the sort of send, wait for response, act model
that I see used often in RPC. Large payloads can be a little expensive
as Gearman will keep them all in memory, but I wonder what you mean by
large. 1MB/message is meh if the rate of sending is 50/s.

Reliable delivery is handled several ways:

* Synchronous senders that can hang around and wait for a reply will
  work well as gearman will simply retry those messages if receivers
  have problems. If the gearmand itself dies in this case, client
  libraries should re-send to the next one in the list of servers.

* Async messages can be stored and forwarded. This scales out really
  nicely, but does complicate things in similar ways to ZeroMQ by
  needing a store-and-forward worker on each box.

* Enable persistence in the C server. This one is really darn slow IMO,
  and gives back a lot of Gearman's advantage at being mostly in-memory
  and scaling out. It works a lot like RabbitMQ's shovel HA method where
  you recover from a down node by loading all of the jobs into another
  node's memory. There are multiple options for the store, including
  sqlite, redis, tokyocabinet, and good old fashioned MySQL. My personal
  experience was with tokyocabinet (which I co-wrote the driver for)
  on top of DRBD. We hit delivery rates in the thousands / second with
  small messages, and that was with 2006 CPU's and hard disks. I imagine
  modern hardware with battery backed write cache and  SSD's can deliver
  quite a bit more.

 
 So... experimentation is good, but also important to realize how much is
 provided for by the infrastructure that's there.
 

Indeed, I suspect

[openstack-dev] [all] Do we need release announcements for all the things?

2015-03-12 Thread Clint Byrum
I spend a not-insignificant amount of time deciding which threads to
read and which to fully ignore each day, so extra threads mean extra
work, even with a streamlined workflow of single-key-press-per-thread.

So I'm wondering what people are getting from these announcements being
on the discussion list. I feel like they'd be better off in a weekly
digest, on a web page somewhere, or perhaps with a tag that could be
filtered out for those that don't benefit from them.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Do we need release announcements for all the things?

2015-03-12 Thread Clint Byrum
Excerpts from Jeremy Stanley's message of 2015-03-12 13:58:20 -0700:
 On 2015-03-12 13:22:04 -0700 (-0700), Clint Byrum wrote:
 [...]
  So I'm wondering what people are getting from these announcements
  being on the discussion list.
 [...]
 
 The main thing I get from them is that they're being recorded to a
 (theoretically) immutable archive indexed by a lot of other systems.
 Some day I'd love for them to include checksums of the release
 artifacts and be OpenPGP-signed by a release delegate for whatever
 project is releasing, and for those people to also try to get their
 keys signed by one another and members of the community at large.
 

I had not considered the value of that, but it seems like a good thing.

 Sure, we could divert them to a different list (openstack-announce
 was suggested in another reply), but I suspect that most people
 subscribed to -dev are also subscribed to -announce and so it
 wouldn't effectively decrease their E-mail volume. On the other
 hand, a lot more people should be subscribed to -announce so that's
 probably a good idea anyway?

openstack-announce would be the opposite of less impact on the signal
to noise ratio for anyone who does want to see them. I prioritize
openstack-announce since I would assume announcements would mostly be
important things reserved for a low-traffic list.

So I think a tag seems like a reasonable way to keep them on the list,
but allow for automated de-prioritization of them by those who don't
want to see them.

Could we maybe have a [release] tag mandated for these?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Keystone]ON DELETE RESTRICT VS ON DELETE CASCADE

2015-03-09 Thread Clint Byrum
Excerpts from David Stanek's message of 2015-03-08 11:18:05 -0700:
 On Sun, Mar 8, 2015 at 1:37 PM, Mike Bayer mba...@redhat.com wrote:
 
  can you elaborate on your reasoning that FK constraints should be used less
  overall?  or do you just mean that the client side should be mirroring the
  same
  rules that would be enforced by the FKs?
 
 
 I don't think he means that we will use them less.  Our SQL backends are
 full of them.  What Keystone can't do is rely on them because not all
 implementations of our backends support FKs.
 

Note that they're also a huge waste of SQL performance. It's _far_ cheaper
to scale out application servers and garbage-collect using background jobs
like pt-archiver than it will ever be to scale out a consistent data-store
and do every single little bit of house keeping in real time.  So even
on SQL backends, I'd recommend just disabling and dropping FK constraints
if you expect any more than the bare minimum usage of Keystone.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Keystone]ON DELETE RESTRICT VS ON DELETE CASCADE

2015-03-09 Thread Clint Byrum
Excerpts from Mike Bayer's message of 2015-03-09 10:26:37 -0700:
 
 Clint Byrum cl...@fewbar.com wrote:
 
  Excerpts from David Stanek's message of 2015-03-08 11:18:05 -0700:
  On Sun, Mar 8, 2015 at 1:37 PM, Mike Bayer mba...@redhat.com wrote:
  
  can you elaborate on your reasoning that FK constraints should be used 
  less
  overall?  or do you just mean that the client side should be mirroring the
  same
  rules that would be enforced by the FKs?
  
  I don't think he means that we will use them less.  Our SQL backends are
  full of them.  What Keystone can't do is rely on them because not all
  implementations of our backends support FKs.
  
  Note that they're also a huge waste of SQL performance. It's _far_ cheaper
  to scale out application servers and garbage-collect using background jobs
  like pt-archiver than it will ever be to scale out a consistent data-store
  and do every single little bit of house keeping in real time.  So even
  on SQL backends, I'd recommend just disabling and dropping FK constraints
  if you expect any more than the bare minimum usage of Keystone.
 
 Im about -1000 on disabling foreign key constraints. Any decision based on
 “performance” IMHO has to be proven with benchmarks. Foreign keys on modern
 databases like MySQL and Postgresql do not add overhead to any significant
 degree compared to just the workings of the Python code itself (which means,
 a benchmark here should be illustrating a tangible impact on the python
 application itself). OTOH, the prospect of a database with failed
 referential integrity is a recipe for disaster.   
 

So I think I didn't speak clearly enough here. The benchmarks are of
course needed, but there's a tipping point when write activity gets to
a certain level where it's cheaper to let it get a little skewed and
correct asynchronously. This is not unique to SQL, this is all large
scale distributed systems. There's probably a super cool formula for it
too, but roughly it is

(num_trans_per_s * cost_of_fk_check_per_trans)

versus

(error_cost * error_rate)+(cost_find_all_errors/seconds_to_find_all_errors)

So it's not really something I think one can blindly accept as better,
but rather something that one needs to calculate for themselves. You say
cost_of_fk_check_per_trans is negligible, but that has been measured as
not true in the past:

http://www.percona.com/blog/2010/09/20/instrumentation-and-the-cost-of-foreign-keys/

That article demonstrates that the FK adds lock contention in
InnoDB. There's more. With NDB (MySQL cluster) it's an 18% performance
hit on raw throughput:

http://johanandersson.blogspot.com/2013/06/benchmarking-performance-impact-of.html

Though that could be artificially inflated due to being a raw benchmark.

Now, where that point is with Keystone I don't know. The point is, if you
write the code relying on the existence, Keystone becomes a vertically
scaling app that cannot ever scale out beyond whatever that limit is.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Keystone]ON DELETE RESTRICT VS ON DELETE CASCADE

2015-03-09 Thread Clint Byrum
Excerpts from Mike Bayer's message of 2015-03-09 17:26:36 -0700:
 
 Clint Byrum cl...@fewbar.com wrote:
 
  
  So I think I didn't speak clearly enough here. The benchmarks are of
  course needed, but there's a tipping point when write activity gets to
  a certain level where it's cheaper to let it get a little skewed and
  correct asynchronously. This is not unique to SQL, this is all large
  scale distributed systems. There's probably a super cool formula for it
  too, but roughly it is
  
  (num_trans_per_s * cost_of_fk_check_per_trans)
  
  versus
  
  (error_cost * error_rate)+(cost_find_all_errors/seconds_to_find_all_errors)
 
 Well the error cost here would be a database that would be “corrupted”,
 meaning it has rows which no longer refer to things that exist and the
 database is now in a case where it may very well be unusable by the
 application, without being rolled back to some known state. 
 

That's not a cost, that's a situation. What's the actual cost to the
user? may very well be unusable implies uncertainty, which is certainly
a risk, but the cost is unknown. Typically one must estimate the cost
with each error found.

 If Keystone truly doesn’t care about ACID it might want to consider MyISAM
 tables, which are faster for read-heavy workloads, though these aren’t
 compatible with Galera.
 

Please try to refrain from using false equivalence. ACID stands for
Atomicity, Consistency, Isolation, Durability. Nowhere in there does it
stand for referential integrity. If Keystone uses transactions
properly, ACID is preserved. Also I don't think it is productive to
bring up MyISAM in any serious conversation about databases.

  So it's not really something I think one can blindly accept as better,
  but rather something that one needs to calculate for themselves. You say
  cost_of_fk_check_per_trans is negligible, but that has been measured as
  not true in the past:
  
  http://www.percona.com/blog/2010/09/20/instrumentation-and-the-cost-of-foreign-keys/
 
 That’s not a surprising case because the “parent” row being modified is
 being referred to by the “child” row that’s still in transaction. This is an
 implementation detail of the ACID guarantees which one gets when they use a
 relational database. If Keystone’s relational backend in fact has a
 performance bottleneck due to an operation like this, it should be visited
 individually. But I think it’s extremely unlikely this is actually the case.
 

Lock contention is a real thing that will inevitably slow down transaction
speed if not carefully avoided. One less query (which is what FK checks
end up being) means one less read lock taken and one less place to have
to think through.

In practical matters, the fact that identity and assignment are not
allowed to FK does practically shutdown most of the real possibilities
of this type of contention.

  That article demonstrates that the FK adds lock contention in
  InnoDB. There's more. With NDB (MySQL cluster) it's an 18% performance
  hit on raw throughput:
  
  http://johanandersson.blogspot.com/2013/06/benchmarking-performance-impact-of.html
 
 For NDB cluster, foreign key support was only added to that system two years
 ago, in version 5.6.10 in 2013. This is clearly not a system designed to
 support foreign keys in the first place, the feature is entirely bleeding
 edge for that specific system, and performance like that is entirely
 atypical outside for database systems outside of NDB cluster. Specifically
 with Openstack, the clustering solution usually used is Galera which has no
 such performance issue.
 
 So sure, if you’re using NDB cluster, FOREIGN KEY support is
 bleeding edge and you may very well want to disable constraints as you’re
 using a system that wasn’t designed with this use case in mind. But because
 using a relational database is somewhat pointless if you don’t need ACID,
 I’d probably use Galera instead.
 

NDB is probably overkill for Keystone until we get up into the millions
of users scale. One day maybe :). The point is that this is a high performance
DB with high performance demands and it is 18% slower for some types of
operations when FK's are added.

  
  Now, where that point is with Keystone I don't know. The point is, if you
  write the code relying on the existence, Keystone becomes a vertically
  scaling app that cannot ever scale out beyond whatever that limit is.
 
 There seems to be some misunderstanding that using foreign keys to enforce
 referential integrity seems to imply that the application is now dependent
 on these constraints being in place. I notice that the conversation was
 originally talking a bit about allowing rows to be deleted using CASCADE,
 and my original question referred to the notion of foreign key use
 *overall*, not specifically as a means to offer automatic deletion of
 related rows with CASCADE.   The use of foreign key constraints
 in openstack applications does not imply an unbreakable reliance
 upon them at all, for two

Re: [openstack-dev] [all] Re-evaluating the suitability of the 6 month release cycle

2015-03-04 Thread Clint Byrum
Excerpts from Thierry Carrez's message of 2015-03-04 02:19:48 -0800:
 James Bottomley wrote:
  On Tue, 2015-03-03 at 11:59 +0100, Thierry Carrez wrote:
  James Bottomley wrote:
  Actually, this is possible: look at Linux, it freezes for 10 weeks of a
  12 month release cycle (or 6 weeks of an 8 week one).  More on this
  below.
 
  I'd be careful with comparisons with the Linux kernel. First it's a
  single bit of software, not a collection of interconnected projects.
  
  Well, we do have interconnection: the kernel on it's own doesn't do
  anything without a userspace.  The theory was that we didn't have to be
  like BSD (coupled user space and kernel) and we could rely on others
  (principally the GNU project in the early days) to provide the userspace
  and that we could decouple kernel development from the userspace
  releases.  Threading models were, I think, the biggest challenges to
  this assumption, but we survived.
 
 Right. My point there is that you only release one thing. We release a
 lot more pieces. There is (was?) downstream value in coordinating those
 releases, which is a factor in our ability to do it more often than
 twice a year.
 

I think the value of coordinated releases has been agreed upon for a
long time. This thread is more about the cost, don't you think?

  Second it's at a very different evolution/maturity point (20 years old
  vs. 0-4 years old for OpenStack projects).
  
  Yes, but I thought I covered this in the email: you can see that at the
  4 year point in its lifecycle, the kernel was behaving very differently
  (and in fact more similar to OpenStack).  The question I thought was
  still valid is whether anything was learnable from the way the kernel
  evolved later.  I think the key issue, which you seem to have in
  OpenStack is that the separate develop/stabilise phases caused
  frustration to build up in our system which (nine years later) led the
  kernel to adopt the main branch stabilisation with overlapping subsystem
  development cycle.
 
 I agree with you: the evolution the kernel went through is almost a
 natural law, and I know we won't stay in the current model forever. I'm
 just not sure we have reached the level of general stability that makes
 it possible to change *just now*. I welcome brainstorming and discussion
 on future evolutions, though, and intend to lead a cross-project session
 discussion on that in Vancouver.
 

I don't believe that the kernel reached maturity as a point of
eventuality. Just like humans aren't going to jump across the grand
canyon no matter how strong they get, it will take a concerted effort
that may put other goals on hold to build a bridge. With the kernel
there was a clear moment where leadership had tried a few things and
then just had to make it clear that all the code goes in one place, but
instability would not be tolerated. They crossed that chasm, and while
there have been chaotic branches and ruffled feathers, once everybody
got over the paradox, it's been business as usual since then with the
model James describes.

I think the less mature a project is, the wider that chasm is, but I
don't think it's ever going to be an easy thing to do. Since we don't
have a dictator to force us to cross the chasm, we should really think
about planning for the crossing ASAP.

   Finally it sits at a
  different layer, so there is less need for documentation/translations to
  be shipped with the software release.
  
  It's certainly a lot less than you, but we have the entire system call
  man pages.  It's an official project of the kernel:
  
  https://www.kernel.org/doc/man-pages/
  
  And we maintain translations for it
  
  https://www.kernel.org/doc/man-pages/translations.html
 
 By translations I meant strings in the software itself, not doc
 translations. We don't translate docs upstream either :) I guess we
 could drop those (and/or downstream them in a way) if that was the last
 thing holding up adding more agility.
 
 So in summary, yes we can (and do) learn from kernel history, but those
 projects are sufficiently different that the precise timeframes and
 numbers can't really be compared. Apples and oranges are both fruits
 which mature (and rot if left unchecked), but they evolve at different
 speeds :)
 

I'm not super excited about being an apple or an orange, since neither
are sentient and thus cannot collaborate on a better existence than
rotting.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] auto-abandon changesets considered harmful (was Re: [stable][all] Revisiting the 6 month release cycle [metrics])

2015-03-02 Thread Clint Byrum
Excerpts from Doug Wiegley's message of 2015-03-02 12:47:14 -0800:
 
  On Mar 2, 2015, at 1:13 PM, James E. Blair cor...@inaugust.com wrote:
  
  Stefano branched this thread from an older one to talk about
  auto-abandon.  In the previous thread, I believe I explained my
  concerns, but since the topic split, perhaps it would be good to
  summarize why this is an issue.
  
  1) A core reviewer forcefully abandoning a change contributed by someone
  else can be a very negative action.  It's one thing for a contributor to
  say I have abandoned this effort, it's very different for a core
  reviewer to do that for them.  It is a very strong action and signal,
  and should not be taken lightly.
 
 I'm not arguing against better tooling, queries, or additional comment 
 warnings.  All of those are good things. But I think some of the push back in 
 this thread is challenging this notion that abandoning is negative, which you 
 seem to be treating as a given.
 
 I don't. At all. And I don't think I'm alone.
 
 I also don't understand your point that the review becomes invisible, since 
 it's a simple gerrit query to see closed reviews, and your own contention is 
 that gerrit queries solve this in the other direction, so it can't be too 
 hard in this one, either. I've done that many times to find mine and others 
 abandoned reviews, the most recent example being resurrecting all of the 
 lbaas v2 reviews after it slipped out of juno and eventually was put into 
 it's own repo.  Some of those reviews were abandoned, others not, and it was 
 roughly equivalent to find them, open or not, and then re-tool those for the 
 latest changes to master.
 

You are correct in saying that just like users can query for a proper
queue of things they should look at, people can also query for abandoned
patches.

However, I'm not sure these are actually the same things.

One is a simple query to hide things you don't want.

The other is a simple query to find things you don't know are missing.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Re-evaluating the suitability of the 6 month release cycle

2015-03-02 Thread Clint Byrum
Excerpts from Angus Salkeld's message of 2015-03-02 17:08:15 -0800:
 On Tue, Mar 3, 2015 at 9:45 AM, James Bottomley 
 james.bottom...@hansenpartnership.com wrote:
 
  On Tue, 2015-02-24 at 12:05 +0100, Thierry Carrez wrote:
   Daniel P. Berrange wrote:
[...]
The key observations

   
The first key observation from the schedule is that although we have
a 6 month release cycle, we in fact make 4 releases in that six
months because there are 3 milestones releases approx 6-7 weeks apart
from each other, in addition to the final release. So one of the key
burdens of a more frequent release cycle is already being felt, to
some degree.
   
The second observation is that thanks to the need to support a
continuous deployment models, the GIT master branches are generally
considered to be production ready at all times. The tree does not
typically go through periods of major instability that can be seen
in other projects, particular those which lack such comprehensive
testing infrastructure.
   
The third observation is that due to the relatively long cycle, and
increasing amounts of process, the work accomplished during the
cycles is becoming increasingly bursty. This is in turn causing
unacceptably long delays for contributors when their work is unlucky
enough to not get accepted during certain critical short windows of
opportunity in the cycle.
   
The first two observations strongly suggest that the choice of 6
months as a cycle length is a fairly arbitrary decision that can be
changed without unreasonable pain. The third observation suggests a
much shorter cycle length would smooth out the bumps and lead to a
more efficient  satisfying development process for all involved.
  
   I think you're judging the cycle from the perspective of developers
   only. 6 months was not an arbitrary decision. Translations and
   documentation teams basically need a month of feature/string freeze in
   order to complete their work. Since we can't reasonably freeze one month
   every 2 months, we picked 6 months.
 
  Actually, this is possible: look at Linux, it freezes for 10 weeks of a
  12 month release cycle (or 6 weeks of an 8 week one).  More on this
  below.
 
   It's also worth noting that we were on a 3-month cycle at the start of
   OpenStack. That was dropped after a cataclysmic release that managed the
   feat of (a) not having anything significant done, and (b) have out of
   date documentation and translations.
  
   While I agree that the packagers and stable teams can opt to skip a
   release, the docs, translations or security teams don't really have that
   luxury... Please go beyond the developers needs and consider the needs
   of the other teams.
  
   Random other comments below:
  
[...]
Release schedule

   
First the releases would probably be best attached to a set of
pre-determined fixed dates that don't ever vary from year to year.
eg releses happen Feb 1st, Apr 1st, Jun 1st, Aug 1st, Oct 1st, and
Dec 1st. If a particular release slips, don't alter following release
dates, just shorten the length of the dev cycle, so it becomes fully
self-correcting. The even numbered months are suggested to avoid a
release landing in xmas/new year :-)
  
   The Feb 1 release would probably be pretty empty :)
  
[...]
Stable branches
---
   
The consequences of a 2 month release cycle appear fairly severe for
the stable branch maint teams at first sight. This is not, however,
an insurmountable problem. The linux kernel shows an easy way forward
with their approach of only maintaining stable branches for a subset
of major releases, based around user / vendor demand. So it is still
entirely conceivable that the stable team only provide stable branch
releases for 2 out of the 6 yearly releases. ie no additional burden
over what they face today. Of course they might decide they want to
do more stable branches, but maintain each for a shorter time. So I
could equally see them choosing todo 3 or 4 stable branches a year.
Whatever is most effective for those involved and those consuming
them is fine.
  
   Stable branches may have the luxury of skipping releases and designate a
   stable one from time to time (I reject the Linux comparison because
   the kernel is at a very different moment in software lifecycle). The
   trick being, making one release special is sure to recreate the peak
   issues you're trying to solve.
 
  I don't disagree with the observation about different points in the
  lifecycle, but perhaps it might be instructive to ask if the linux
  kernel ever had a period in its development history that looks somewhat
  like OpenStack does now.  I would claim it did: before 2.6, we had the
  odd/even develop/stabilise cycle.  The theory driving it was that we
  

Re: [openstack-dev] [Ironic] Adding vendor drivers in Ironic

2015-03-01 Thread Clint Byrum
Excerpts from Gary Kotton's message of 2015-03-01 02:32:37 -0800:
 Hi,
 I am just relaying pain-points that we encountered in neutron. As I have
 said below it makes the development process a lot quicker for people
 working on external drivers. I personally believe that it fragments the
 community and feel that the external drivers loose the community
 contributions and inputs.

I think you're right that this does change the dynamic in the
community. One way to lower the barrier is to go ahead and define the
plugin API very strongly, but then delegate control of drivers in-tree
to active maintainers, rather than in external repositories. If a driver
falls below the line in terms of maintenance, then it can be deprecated.
And if a maintainer feels strongly that they cannot include the driver
with Ironic for whatever reason, the plugin API being strongly defined
will allow them to do so.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] Adding vendor drivers in Ironic

2015-02-28 Thread Clint Byrum
I'm not sure I understand your statement Gary. If Ironic defines
what is effectively a plugin API, and the vendor drivers are careful
to utilize that API properly, the two sets of code can be released
entirely independent of one another. This is how modules work in the
kernel, X.org drivers work, and etc. etc. Of course, vendors could be
irresponsible and break compatibility with older releases of Ironic,
but that is not in their best interest, so I don't see why anybody would
need to tightly couple.

As far as where generic code goes, that seems obvious: it all has to go
into Ironic and be hidden behind the plugin API.

Excerpts from Gary Kotton's message of 2015-02-28 09:28:55 -0800:
 Hi,
 There are pros and cons for what you have mentioned. My concern, and I 
 mentioned them with the neutron driver decomposition, is that we are are 
 loosing the community inputs and contributions. Yes, one can certainly move 
 faster and freer (which is a huge pain point in the community). How are 
 generic code changes percolated to your repo? Do you have an automatic CI 
 that detects this? Please note that when itonic release you will need to 
 release your repo so that the relationship is 1:1...
 Thanks
 Gary
 
 From: Ramakrishnan G 
 rameshg87.openst...@gmail.commailto:rameshg87.openst...@gmail.com
 Reply-To: OpenStack List 
 openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
 Date: Saturday, February 28, 2015 at 8:28 AM
 To: OpenStack List 
 openstack-dev@lists.openstack.orgmailto:openstack-dev@lists.openstack.org
 Subject: [openstack-dev] [Ironic] Adding vendor drivers in Ironic
 
 
 Hello All,
 
 This is about adding vendor drivers in Ironic.
 
 In Kilo, we have many vendor drivers getting added in Ironic which is a very 
 good thing.  But something I noticed is that, most of these reviews have lots 
 of hardware-specific code in them.  This is something most of the other 
 Ironic folks cannot understand unless they go and read the hardware manuals 
 of the vendor hardware about what is being done.  Otherwise we just need to 
 blindly mark the file as reviewed.
 
 Now let me pitch in with our story about this.  We added a vendor driver for 
 HP Proliant hardware (the *ilo drivers in Ironic).  Initially we proposed 
 this same thing in Ironic that we will add all the hardware specific code in 
 Ironic itself under the directory drivers/modules/ilo.  But few of the Ironic 
 folks didn't agree on this (Devananda especially who is from my company :)). 
 So we created a new module proliantutils, hosted in our own github and 
 recently moved it to stackforge.  We gave a limited set of APIs for Ironic to 
 use - like get_host_power_status(), set_host_power(), get_one_time_boot(), 
 set_one_time_boot(), etc. (Entire list is here 
 https://github.com/stackforge/proliantutils/blob/master/proliantutils/ilo/operations.pyhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_stackforge_proliantutils_blob_master_proliantutils_ilo_operations.pyd=AwMFaQc=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEsr=VlZxHpZBmzzkWT5jqz9JYBk8YTeq9N3-diTlNj4GyNcm=m5_FxZnmz3cyIvavSV
 DImH6xLR79L-svbcYKkjdcnb8s=fjlOB2ORYcne-cyYnZJO8bdpi4J8rbfCAbmciPllmFIe=).
 
 We have only seen benefits in doing it.  Let me bring in some examples:
 
 1) We tried to add support for some lower version of servers.  We could do 
 this without making any changes in Ironic (Review in proliantutils 
 https://review.openstack.org/#/c/153945/)
 2) We are adding support for newer models of servers (earlier we use to talk 
 to servers in protocol called RIBCL, newer servers we will use a protocol 
 called RIS) - We could do this with just 14 lines of actual code change in 
 Ironic (this was needed mainly because we didn't think we will have to use a 
 new protocol itself when we started) - 
 https://review.openstack.org/#/c/154403/
 
 Now talking about the advantages of putting hardware-specific code in Ironic:
 
 1) It's reviewed by Openstack community and tested:
 No. I doubt if I throw in 600 lines of new iLO specific code that is here 
 (https://github.com/stackforge/proliantutils/blob/master/proliantutils/ilo/ris.pyhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_stackforge_proliantutils_blob_master_proliantutils_ilo_ris.pyd=AwMFaQc=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEsr=VlZxHpZBmzzkWT5jqz9JYBk8YTeq9N3-diTlNj4GyNcm=m5_FxZnmz3cyIvavSVDImH6xLR79L-svbcYKkjdcnb8s=vYNQ8MopljQOqje3T_aIhtw0oZPK4tFHGnlcbBH6wace=)
  for Ironic folks, they will hardly take a look at it.  And regarding 
 testing, it's not tested in the gate unless we have a 3rd party CI for it.  
 [We (iLO drivers) also don't have 3rd party CI right now, but we are working 
 on it.]
 
 2) Everything gets packaged into distributions automatically:
 Now the hardware-specific code that we add in Ironic under 
 drivers/modules/vendor/ will get packaged into distributions, but this code 
 in turn will have dependencies  which needs to be installed manually by the 
 

Re: [openstack-dev] [all] creating a unified developer reference manual

2015-02-27 Thread Clint Byrum
Excerpts from Ben Nemec's message of 2015-02-27 09:25:37 -0800:
 On 02/27/2015 03:54 AM, Thierry Carrez wrote:
  Doug Hellmann wrote:
  Maybe some of the folks in the meeting who felt more strongly that it
  should be a separate document can respond with their thoughts?
  
  I don't feel very strongly and could survive this landing in
  openstack-specs. My objection was the following:
  
  - Specs are for designing the solution and implementation plan to a
  specific problem. They are mainly used by developers and reviewers
  during implementation as a clear reference rationale for change and
  approved plan. Once they are fully implemented, they are kept for
  history purpose, not for constant reference.
  
  - Guidelines/developer doc are for all developers (old and new) to
  converge on best practices on topics that are not directly implemented
  as hacking rules. They are constantly used by everyone (not just
  developers/reviewers of a given feature) and never become history.
  
  Putting guidelines doc in the middle of specs makes it a bit less
  discoverable imho, especially by our new developers. It's harder to
  determine which are still current and you should read. An OpenStack
  developer doc sounds like a much better entry point.
  
  That said, the devil is in the details, and some efforts start as specs
  (for existing code to catch up with the recommendation) and become
  guidelines (for future code being written). That is the case of the log
  levels spec: it is both a spec and a guideline. Personally I wouldn't
  object if that was posted in both areas, or if the relevant pieces were
  copied, once the current code has caught up, from the spec to a dev
  guideline.
  
  In the eventlet case, it's only a set of best practices / guidelines:
  there is no specific problem to solve, no catch-up plan for existing
  code to implement. Only a collection of recommendations if you get to
  write future eventlet-based code. Those won't start or end. Which is why
  I think it should go straight to a developer doc.
  
 
 Well, this whole spec arose because we found out there was existing code
 that was doing bad things with eventlet monkey patching that needed to
 be fixed.  The specific problem is actually being worked concurrently
 with the spec because everyone involved has agreed on a solution, which
 became one of the guidelines in the spec.  I'd be surprised if there
 aren't other projects that need similar changes to be in line with the
 new recommendations though.  I'd hope that future projects will follow
 the guidelines, but they were actually written for the purpose of
 eliminating as many potential eventlet gotchas in our _current_ code as
 possible.  Coming up with a specific list of changes needed is tough
 until we have agreement on the best practices though, which is why the
 first work item is a somewhat vague audit and fix all the things point.
 
 Personally, I would expect most best practice/guideline type specs to be
 similar.  Nobody's going to take the time to write up a spec about
 something everyone's already doing - they're going to do it because one
 or a few projects have found something that works well and they think
 everyone should be doing it.  So I think your point about most of these
 things moving from spec to guideline throughout their lifetime is spot
 on, I'm just wondering if it's worth complicating the workflow for that
 process.  Herding the cats for something big like the log guidelines is
 hard enough without requiring two separate documents for the immediate
 work and the long-term information.
 
 That said, I agree with the points about publishing this stuff under a
 developer reference doc rather than specs, and if that can't be done in
 a single repo maybe we do have to split.  I'd still prefer to keep it
 all in one repo though - moving a doc between directories is a lot
 simpler than moving it between repos (and also doesn't lose any previous
 discussion in Gerrit).
 

There are numerous wiki pages that have a wealth of knowledge, but
very poor history attached. Such as:

  * https://wiki.openstack.org/wiki/Python3
  * https://wiki.openstack.org/wiki/GitCommitMessages
  * https://wiki.openstack.org/wiki/Gerrit_Workflow
  * https://wiki.openstack.org/wiki/Getting_The_Code
  * https://wiki.openstack.org/wiki/Testr

Just having these in git would be useful, and having the full change
history with the same care given to commit messages and with reviewers
would I think improve the content and usability of these.

Since these are not even close to specs, but make excellent static
documents, I think having them in a cross-project developer
documentation repository makes a lot of sense.

That said, I do think that we would need to have a much lower bar for
reviews (maybe just 1 * +2, but let it sit at least 3 days or something
to allow for exposure).

I'd be quite happy to help with an effort to convert the wiki pages
above into rst and move forward with things 

Re: [openstack-dev] [all] Replace eventlet with asyncio

2015-02-25 Thread Clint Byrum
Excerpts from Victor Stinner's message of 2015-02-25 02:12:05 -0800:
 Hi,
 
  I also just put up another proposal to consider:
  https://review.openstack.org/#/c/156711/
  Sew over eventlet + patching with threads
 
 My asyncio spec is unclear about WSGI, I just wrote
 
 The spec doesn't change OpenStack components running WSGI servers
 like nova-api. The specific problem of using asyncio with WSGI will
 need a separated spec.
 
 Joshua's threads spec proposes:
 
 I would prefer to let applications such as apache or others handle
 the request as they see fit and just make sure that our applications
 provide wsgi entrypoints that are stateless and can be horizontally
 scaled as needed (aka remove all eventlet and thread ... semantics
 and usage from these entrypoints entirely).
 
 Keystone wants to do the same:
 https://review.openstack.org/#/c/157495/
 Deprecate Eventlet Deployment in favor of wsgi containers
 
 This deprecates Eventlet support in documentation and on invocation
 of keystone-all.
 
 I agree: we don't need concurrency in the code handling a single HTTP 
 request: use blocking functions calls. You should rely on highly efficient 
 HTTP servers like Apache, nginx, werkzeug, etc. There is a lot of choice, 
 just pick your favorite server ;-) Each HTTP request is handled in a thread. 
 You can use N processes and each process running M threads. It's a common 
 architecture design which is efficient.
 
 For database accesses, just use regular blocking calls (no need to modify 
 SQLAchemy). According to Mike Bayer's benchmark (*), it's even the fastest 
 method if your code is database intensive. You may share a pool of database 
 connections between the threads, but a connection should only be used by a 
 single thread.
 
 (*) http://techspot.zzzeek.org/2015/02/15/asynchronous-python-and-databases/
 
 I don't think that we need a spec if everybody already agree on the design :-)
 

+1

This leaves a few pieces of python which don't operate via HTTP
requests. There are likely more, but these come to mind:

* Nova conductor
* Nova scheduler/Gantt
* Nova compute
* Neutron agents
* Heat engine

I don't have a good answer for them, but my gut says none of these
gets as crazy with concurrency as the API services which have to talk
to all the clients with their terrible TCP stacks, and awful network
connectivity. The list above is always just talking on local buses, and
thus can likely just stay on eventlet, or use a multiprocessing model to
take advantage of local CPUs too. I know for Heat's engine, we saw quite
an improvement in performance of Heat just by running multiple engines.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] db-level locks, non-blocking algorithms, active/active DB clusters and IPAM

2015-02-25 Thread Clint Byrum
Excerpts from Salvatore Orlando's message of 2015-02-23 04:07:38 -0800:
 Lazy-Stacker summary:
 I am doing some work on Neutron IPAM code for IP Allocation, and I need to
 found whether it's better to use db locking queries (SELECT ... FOR UPDATE)
 or some sort of non-blocking algorithm.
 Some measures suggest that for this specific problem db-level locking is
 more efficient even when using multi-master DB clusters, which kind of
 counters recent findings by other contributors [2]... but also backs those
 from others [7].
 

Thanks Salvatore, the story and data you produced is quite interesting.

 
 With the test on the Galera cluster I was expecting a terrible slowdown in
 A-1 because of deadlocks caused by certification failures. I was extremely
 disappointed that the slowdown I measured however does not make any of the
 other algorithms a viable alternative.
 On the Galera cluster I did not run extensive collections for A-2. Indeed
 primary key violations seem to triggers db deadlock because of failed write
 set certification too (but I have not yet tested this).
 I run tests with 10 threads on each node, for a total of 30 workers. Some
 results are available at [15]. There was indeed a slow down in A-1 (about
 20%), whereas A-3 performance stayed pretty much constant. Regardless, A-1
 was still at least 3 times faster than A-3.
 As A-3's queries are mostly select (about 75% of them) use of caches might
 make it a lot faster; also the algorithm is probably inefficient and can be
 optimised in several areas. Still, I suspect it can be made faster than
 A-1. At this stage I am leaning towards adoption db-level-locks with
 retries for Neutron's IPAM. However, since I never trust myself, I wonder
 if there is something important that I'm neglecting and will hit me down
 the road.
 

The thing is, nobody should actually be running blindly with writes
being sprayed out to all nodes in a Galera cluster. So A-1 won't slow
down _at all_ if you just use Galera as an ACTIVE/PASSIVE write master.
It won't scale any worse for writes, since all writes go to all nodes
anyway. For reads we can very easily start to identify hot-spot reads
that can be sent to all nodes and are tolerant of a few seconds latency.

 In the medium term, there are a few things we might consider for Neutron's
 built-in IPAM.
 1) Move the allocation logic out of the driver, thus making IPAM an
 independent service. The API workers will then communicate with the IPAM
 service through a message bus, where IP allocation requests will be
 naturally serialized

This would rely on said message bus guaranteeing ordered delivery. That
is going to scale far worse, and be more complicated to maintain, than
Galera with a few retries on failover.

 2) Use 3-party software as dogpile, zookeeper but even memcached to
 implement distributed coordination. I have nothing against it, and I reckon
 Neutron can only benefit for it (in case you're considering of arguing that
 it does not scale, please also provide solid arguments to support your
 claim!). Nevertheless, I do believe API request processing should proceed
 undisturbed as much as possible. If processing an API requests requires
 distributed coordination among several components then it probably means
 that an asynchronous paradigm is more suitable for that API request.
 

If we all decide that having a load balancer sending all writes and
reads to one Galera node is not acceptable for some reason, then we
should consider a distributed locking method that might scale better,
like ZK/etcd or the like. But I think just figuring out why we want to
send all writes and reads to all nodes is a better short/medium term
goal.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] H302 considered harmful

2015-02-25 Thread Clint Byrum
Excerpts from Duncan Thomas's message of 2015-02-25 10:51:00 -0800:
 Hi
 
 So a review [1] was recently submitted to cinder to fix up all of the H302
 violations, and turn on the automated check for them. This is certainly a
 reasonable suggestion given the number of manual reviews that -1 for this
 issue, however I'm far from convinced it actually makes the code more
 readable,
 
 Is there anybody who'd like to step forward in defence of this rule and
 explain why it is an improvement? I don't discount for a moment the
 possibility I'm missing something, and welcome the education in that case

I think we've had this conclusion a few times before, but let me
resurrect it:

The reason we have hacking and flake8 and pep8 and etc. etc. is so that
code reviews don't descend into nit picking and style spraying.

I'd personally have a private conversation with anyone who mentioned
this, or any other rule that is in hacking/etc., in a review. I want to
know why people think it is a good idea to bombard users with rules that
are already called out explicitly in automation.

Let the robots do their job, and they will let you do yours (until the
singularity, at which point your job will be hiding from the robots).

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] H302 considered harmful

2015-02-25 Thread Clint Byrum
Excerpts from Duncan Thomas's message of 2015-02-25 12:51:35 -0800:
 Clint
 
 This rule is not currently enabled in Cinder. This review fixes up all
 cases and enables it, which is absolutely 100% the right thing to do if we
 decide to implement this rule.
 
 The purpose of this thread is to understand the value of the rule. We
 should either enforce it, or else explicitly decide to ignore it, and
 educate reviewers who manually comment on it.
 
 I lean against the rule, but there are certainly enough comments coming in
 that I'll look and think again, which is a good result for the thread.
 

Thanks for your thoughts Duncan, they are appreciated.

I believe that what's being missed here is arguing for or against the
rule, or even taking time to try and understand it, is far more costly
than simply following it if it is enabled or ignoring it if it is not
enabled.

I don't think any of us want to be project historians, so we should
just make sure to have a good commit message when we turn it on or off,
and otherwise move forward with the actual development of OpenStack.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Stepping down as TripleO PTL

2015-02-18 Thread Clint Byrum
Excerpts from Clint Byrum's message of 2015-02-17 08:52:46 -0800:
 Excerpts from Anita Kuno's message of 2015-02-17 07:38:01 -0800:
  On 02/17/2015 09:21 AM, Clint Byrum wrote:
   There has been a recent monumental shift in my focus around OpenStack,
   and it has required me to take most of my attention off TripleO. Given
   that, I don't think it is in the best interest of the project that I
   continue as PTL for the Kilo cycle.
   
   I'd like to suggest that we hold an immediate election for a replacement
   who can be 100% focused on the project.
   
   Thanks everyone for your hard work up to this point. I hope that one day
   soon TripleO can deliver on the promise of a self-deploying OpenStack
   that is stable and automated enough to sit in the gate for many if not
   all OpenStack projects.
   
   
   
   __
   OpenStack Development Mailing List (not for usage questions)
   Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
   http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
   
  So in the middle of a release, changing PTLs can take 3 avenues:
  
  1) The new PTL is appointed. Usually there is a leadership candidate in
  waiting which the rest of the project feels it can rally around until
  the next election. The stepping down PTL takes the pulse of the
  developers on the project and informs us on the mailing list who the
  appointed PTL is. Barring any huge disagreement, we continue on with
  work and the appointed PTL has the option of standing for election in
  the next election round. The appointment lasts until the next round of
  elections.
  
 
 Thanks for letting me know about this Anita.
 
 I'd like to appoint somebody, but I need to have some discussions with a
 few people first. As luck would have it, some of those people will be in
 Seattle with us for the mid-cycle starting tomorrow.
 
  2) We have an election, in which case we need candidates and some dates.
  Let me know if we want to exercise this option so that Tristan and I can
  organize some dates.
  
 
 Let's wait a bit until I figure out if there's a clear and willing
 appointee. That should be clear by Thursday.

Ok, we talked this morning, and James Slagle has agreed to step in as
the PTL for the rest of this cycle. So I hereby appoint him so.

Thanks everyone!


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Stepping down as TripleO PTL

2015-02-17 Thread Clint Byrum
Excerpts from Anita Kuno's message of 2015-02-17 07:38:01 -0800:
 On 02/17/2015 09:21 AM, Clint Byrum wrote:
  There has been a recent monumental shift in my focus around OpenStack,
  and it has required me to take most of my attention off TripleO. Given
  that, I don't think it is in the best interest of the project that I
  continue as PTL for the Kilo cycle.
  
  I'd like to suggest that we hold an immediate election for a replacement
  who can be 100% focused on the project.
  
  Thanks everyone for your hard work up to this point. I hope that one day
  soon TripleO can deliver on the promise of a self-deploying OpenStack
  that is stable and automated enough to sit in the gate for many if not
  all OpenStack projects.
  
  
  
  __
  OpenStack Development Mailing List (not for usage questions)
  Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
  
 So in the middle of a release, changing PTLs can take 3 avenues:
 
 1) The new PTL is appointed. Usually there is a leadership candidate in
 waiting which the rest of the project feels it can rally around until
 the next election. The stepping down PTL takes the pulse of the
 developers on the project and informs us on the mailing list who the
 appointed PTL is. Barring any huge disagreement, we continue on with
 work and the appointed PTL has the option of standing for election in
 the next election round. The appointment lasts until the next round of
 elections.
 

Thanks for letting me know about this Anita.

I'd like to appoint somebody, but I need to have some discussions with a
few people first. As luck would have it, some of those people will be in
Seattle with us for the mid-cycle starting tomorrow.

 2) We have an election, in which case we need candidates and some dates.
 Let me know if we want to exercise this option so that Tristan and I can
 organize some dates.
 

Let's wait a bit until I figure out if there's a clear and willing
appointee. That should be clear by Thursday.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] Repurposing HP CI regions

2015-02-17 Thread Clint Byrum
FYI: Recently HP's focus for deployment has changed, and as such, some of
the resources we had dedicated for TripleO are being redistributed. As
such, the HP CI region won't be returning to the pool (it is currently
removed due to some stability issues). Nor will we be adding region #2,
which never quite made it into the pool.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] Stepping down as TripleO PTL

2015-02-17 Thread Clint Byrum
There has been a recent monumental shift in my focus around OpenStack,
and it has required me to take most of my attention off TripleO. Given
that, I don't think it is in the best interest of the project that I
continue as PTL for the Kilo cycle.

I'd like to suggest that we hold an immediate election for a replacement
who can be 100% focused on the project.

Thanks everyone for your hard work up to this point. I hope that one day
soon TripleO can deliver on the promise of a self-deploying OpenStack
that is stable and automated enough to sit in the gate for many if not
all OpenStack projects.


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][tc] Lets keep our community open, lets fight for it

2015-02-17 Thread Clint Byrum
Excerpts from Daniel P. Berrange's message of 2015-02-17 02:37:50 -0800:
 On Wed, Feb 11, 2015 at 03:14:39PM +0100, Stefano Maffulli wrote:
   ## Cores are *NOT* special
   
   At some point, for some reason that is unknown to me, this message
   changed and the feeling of core's being some kind of superheros became
   a thing. It's gotten far enough to the point that I've came to know
   that some projects even have private (flagged with +s), password
   protected, irc channels for core reviewers.
  
  This is seriously disturbing.
  
  If you're one of those core reviewers hanging out on a private channel,
  please contact me privately: I'd love to hear from you why we failed as
  a community at convincing you that an open channel is the place to be.
  
  No public shaming, please: education first.
 
 I've been thinking about these last few lines a bit, I'm not entirely
 comfortable with the dynamic this sets up.
 
 What primarily concerns me is the issue of community accountability. A core
 feature of OpenStack's project  individual team governance is the idea
 of democractic elections, where the individual contributors can vote in
 people who they think will lead OpenStack in a positive way, or conversely
 hold leadership to account by voting them out next time. The ability of
 individuals contributors to exercise this freedom though, relies on the
 voters being well informed about what is happening in the community.
 
 If cases of bad community behaviour, such as use of passwd protected IRC
 channels, are always primarily dealt with via further private communications,
 then we are denying the voters the information they need to hold people to
 account. I can understand the desire to avoid publically shaming people
 right away, because the accusations may be false, or may be arising from a
 simple mis-understanding, but at some point genuine issues like this need
 to be public. Without this we make it difficult for contributors to make
 an informed decision at future elections.
 
 Right now, this thread has left me wondering whether there are still any
 projects which are using password protected IRC channels, or whether they
 have all been deleted, and whether I will be unwittingly voting for people
 who supported their use in future openstack elections.
 

Shaming a person is a last resort, when that person may not listen to
reason. It's sometimes necessary to bring shame to a practice, but even
then, those who are participating are now draped in shame as well and
will have a hard time saving face.

However, if we show respect to peoples' ideas, and take the time not
only to educate them on our values, but also to educate ourselves about
what motivates that practice, then I think we will have a much easier
time changing, or even accepting, these behaviors.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][tc] Lets keep our community open, lets fight for it

2015-02-17 Thread Clint Byrum
Excerpts from Ed Leafe's message of 2015-02-17 10:11:01 -0800:
 On Feb 17, 2015, at 11:29 AM, Clint Byrum cl...@fewbar.com wrote:
 
  Shaming a person is a last resort, when that person may not listen to
  reason. It's sometimes necessary to bring shame to a practice, but even
  then, those who are participating are now draped in shame as well and
  will have a hard time saving face.
 
 Why must pointing out that someone is doing something incorrectly necessarily 
 shaming? Those of us who review code do that all the time; telling someone 
 that there is a better way to code something is certainly not shaming, since 
 we all benefit from those suggestions.
 

Funny you should bring that up, that may be an entirely new branch of this
thread which is how harmful some of our review practices are to overall
community harmony. I definitely think there's a small amount of unhealthy
shaming in reviews, and a not small amount of non-constructive criticism.

Saying This code is not covered by tests. or You could make this less
complex by using a generator. is constructive criticism that has as
little shaming effect as possible without beating around the bush. This
is the very definition of _educating_.

However, being entirely subjective and attacking stylistic issues
(please know that I'm not claiming innocence at all here) does damage to
the relationship between coder and review team. Of course, a discussion
of style has a place, but I believe that place is in a private
conversation, not out in the open where it will almost certainly bring
shame to the submitter.

 Sure, you can also be a jerk about how you tell someone they can improve, but 
 that's certainly not the norm in this community.
 

I agree that the subjective stylistic nit picking comes in a polite way.
I think that only softens the blow to someone's ego and still conveys a
level of disrespect that will eventually erode the level of trust
between the submitter and the project as a whole.

So, somewhat ironically, I think the right place to make subjective
observations about someone's work is in a private message.

Unfortunately, I think humans are quite subjective themselves, and so
what might be too harsh and shameful to one ego, might be just the right
thing to educate the next. Calibration of one's criticism practices is
one of those things I'm sure most of us geeks would like to think we
don't have to worry about. However, I think it is worthwhile to consider
it before making any critique, especially when one doesn't know the
recipient of the critique extremely well.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] stepping down as core reviewer

2015-02-15 Thread Clint Byrum
Thanks Robert. I share most of your views on this. The project will
certainly miss your reviews. I'll go ahead and remove you from the
permissions and stats.

Excerpts from Robert Collins's message of 2015-02-15 13:40:02 -0800:
 Hi, I've really not been pulling my weight as a core reviewer in
 TripleO since late last year when personal issues really threw me for
 a while. While those are behind me now, and I had a good break over
 the christmas and new year period, I'm sufficiently out of touch with
 the current (fantastic) progress being made that I don't feel
 comfortable +2'ing anything except the most trivial things.
 
 Now the answer to that is to get stuck back in, page in the current
 blueprints and charge ahead - but...
 
 One of the things I found myself reflecting on during my break was the
 extreme fragility of the things we were deploying in TripleO - most of
 our time is spent fixing fallout from unintended, unexpected
 consequences in the system. I think its time to put some effort
 directly in on that in a proactive fashion rather than just reacting
 to whichever failure du jour is breaking deployments / scale /
 performance.
 
 So for the last couple of weeks I've been digging into the Nova
 (initially) bugtracker and code with an eye to 'how did we get this
 bug in the first place', and refreshing my paranoid
 distributed-systems-ops mindset: I'll be writing more about that
 separately, but its clear to me that there's enough meat there - both
 analysis, discussion, and hopefully execution - that it would be
 self-deceptive for me to think I'll be able to meaningfully contribute
 to TripleO in the short term.
 
 I'm super excited by Kolla - I think that containers really address
 the big set of hurdles we had with image based deployments, and if we
 can one-way-or-another get cinder and Ironic running out of
 containers, we should have a pretty lovely deployment story. But I
 still think helping on the upstream stuff more is more important for
 now. We'll see where we're at in a cycle or two :)
 
 -Rob
 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [keystone] Depraction of the auth_token fragments

2015-02-14 Thread Clint Byrum
Excerpts from Thomas Goirand's message of 2015-02-14 16:48:01 -0800:
 Hi,
 
 I've seen messages in the logs telling that we should move to the
 identity_uri.
 
 I don't really like the identity_uri which contains all of the
 information in a single directive, which means that a script that would
 edit it would need a lot more parsing work than simply a key/value pair
 logic. This is error prone. The fragments don't have this issue.
 
 So, could we decide to:
 1/ Not remove the auth fragments
 2/ Remove the deprecation warnings
 

Automation has tended away from parsing and editting files in place for
a long time now. Typically you'd have a source of truth with all the
values, and a tool to turn that into a URL during file generation. This
isn't error prone in my experience.

I don't really know why the single URL is preferred, but I don't think
the argument that it makes parsing and editting the config file with
external tools is strong enough to roll this deprecation back.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][tc] Lets keep our community open, lets fight for it

2015-02-12 Thread Clint Byrum
Excerpts from Flavio Percoco's message of 2015-02-12 00:13:35 -0800:
 On 11/02/15 09:37 -0800, Clint Byrum wrote:
 Excerpts from Stefano Maffulli's message of 2015-02-11 06:14:39 -0800:
  On Wed, 2015-02-11 at 10:55 +0100, Flavio Percoco wrote:
   This email is dedicated to the openness of our community/project.
 
  It's good to have a reminder every now and then. Thank you Flavio for
  caring enough to notice bad patterns and for raising a flag.
 
   ## Keep discussions open
  
   I don't believe there's anything wrong about kicking off some
   discussions in private channels about specs/bugs. I don't believe
   there's anything wrong in having calls to speed up some discussions.
   HOWEVER, I believe it's *completely* wrong to consider those private
   discussions sufficient.
  [...]
 
  Well said. Conversations can happen anywhere and any time, but they
  should stay in open and accessible channels. Consensus needs to be built
  and decisions need to be shared, agreed upon by the community at large
  (and mailing lists are the most accessible media we have).
 
  That said, it's is very hard to generalize and I'd rather deal/solve
  specific examples. Sometimes, I'm sure there are episodes when a fast
  decision was needed and a limited amount of people had to carry the
  burden of responsibility. Life is hard, software development is hard and
  general rules sometimes need to be adapted to the reality. Again, too
  much generalization here for what I'm confortable with.
 
  Maybe it's worth repeating that I'm personally (and in my role)
  available to listen and mediate in cases when communication seems to
  happen behind closed doors. If you think something unhealthy is
  happening, talk to me (confidentiality assured).
 
   ## Mailing List vs IRC Channel
  
   I get it, our mailing list is freaking busy, keeping up with it is
   hard and time consuming and that leads to lots of IRC discussions.
 
  Not sure I agree with the causality but, the facts are those: traffic on
  the list and on IRC is very high (although not increasing anymore
  [1][2]).
 
I
   don't think there's anything wrong with that but I believe it's wrong
   to expect *EVERYONE* to be in the IRC channel when those discussions
   happen.
 
  Email is hard, I have the feeling that the vast majority of people use
  bad (they all suck, no joke) email clients. Lots and lots of email is
  even worse. Most contributors commit very few patches: the investment
  for them to configure their MUA to filter our traffic is too high.
 
  I have added more topics today to the openstack-dev list[3]. Maybe,
  besides filtering on the receiving end, we may spend some time
  explaining how to use mailman topics? I'll draft something on Ask, it
  may help those that have limited interest in OpenStack.
 
  What else can we do to make things better?
 
 
 I am one of those people who has a highly optimized MUA for mailing list
 reading. It is still hard. Even with one keypress to kill threads from
 view forever, and full text index searching, I still find it takes me
 an hour just to filter the don't want to see from the want to see
 threads each day.
 
 The filtering on the list-server side I think is not known by everybody,
 and it might be a good idea to socialize it even more, and maybe even
 invest in making the UI for it really straight forward for people to
 use.
 
 That said, even if you just choose [all], and [yourproject], some
 [yourproject] tags are pretty busy.
 
 Would it be helpful if we share our email clients configs so that
 others can use them? I guess we could have a section for this in the
 wiki page.
 
 I'm sure each one of us has his/her own server-side filters so, I
 guess we could start with those.
 

Great idea Flavio. I went ahead and created a github repository with my
sup-mail hook which tags everything with openstack-dev. The mail client
itself is where most of the magic happens, but being able to read all
the openstack-dev things and then all the not openstack-dev things
is quite important to my email workflow.

I called the repository FERK for Firehose Email Reading Kit. I'm
happy to merge pull requests if people want to share their other email
client configurations and also things like procmail filters.

https://github.com/SpamapS/ferk

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][tc] Lets keep our community open, lets fight for it

2015-02-11 Thread Clint Byrum
Excerpts from Stefano Maffulli's message of 2015-02-11 06:14:39 -0800:
 On Wed, 2015-02-11 at 10:55 +0100, Flavio Percoco wrote:
  This email is dedicated to the openness of our community/project.
 
 It's good to have a reminder every now and then. Thank you Flavio for
 caring enough to notice bad patterns and for raising a flag. 
 
  ## Keep discussions open
  
  I don't believe there's anything wrong about kicking off some
  discussions in private channels about specs/bugs. I don't believe
  there's anything wrong in having calls to speed up some discussions.
  HOWEVER, I believe it's *completely* wrong to consider those private
  discussions sufficient. 
 [...]
 
 Well said. Conversations can happen anywhere and any time, but they
 should stay in open and accessible channels. Consensus needs to be built
 and decisions need to be shared, agreed upon by the community at large
 (and mailing lists are the most accessible media we have). 
 
 That said, it's is very hard to generalize and I'd rather deal/solve
 specific examples. Sometimes, I'm sure there are episodes when a fast
 decision was needed and a limited amount of people had to carry the
 burden of responsibility. Life is hard, software development is hard and
 general rules sometimes need to be adapted to the reality. Again, too
 much generalization here for what I'm confortable with.
 
 Maybe it's worth repeating that I'm personally (and in my role)
 available to listen and mediate in cases when communication seems to
 happen behind closed doors. If you think something unhealthy is
 happening, talk to me (confidentiality assured).
 
  ## Mailing List vs IRC Channel
  
  I get it, our mailing list is freaking busy, keeping up with it is
  hard and time consuming and that leads to lots of IRC discussions.
 
 Not sure I agree with the causality but, the facts are those: traffic on
 the list and on IRC is very high (although not increasing anymore
 [1][2]).
 
   I
  don't think there's anything wrong with that but I believe it's wrong
  to expect *EVERYONE* to be in the IRC channel when those discussions
  happen.
 
 Email is hard, I have the feeling that the vast majority of people use
 bad (they all suck, no joke) email clients. Lots and lots of email is
 even worse. Most contributors commit very few patches: the investment
 for them to configure their MUA to filter our traffic is too high.
 
 I have added more topics today to the openstack-dev list[3]. Maybe,
 besides filtering on the receiving end, we may spend some time
 explaining how to use mailman topics? I'll draft something on Ask, it
 may help those that have limited interest in OpenStack.
 
 What else can we do to make things better?
 

I am one of those people who has a highly optimized MUA for mailing list
reading. It is still hard. Even with one keypress to kill threads from
view forever, and full text index searching, I still find it takes me
an hour just to filter the don't want to see from the want to see
threads each day.

The filtering on the list-server side I think is not known by everybody,
and it might be a good idea to socialize it even more, and maybe even
invest in making the UI for it really straight forward for people to
use.

That said, even if you just choose [all], and [yourproject], some
[yourproject] tags are pretty busy.

  ## Cores are *NOT* special
  
  At some point, for some reason that is unknown to me, this message
  changed and the feeling of core's being some kind of superheros became
  a thing. It's gotten far enough to the point that I've came to know
  that some projects even have private (flagged with +s), password
  protected, irc channels for core reviewers.
 
 This is seriously disturbing.
 
 If you're one of those core reviewers hanging out on a private channel,
 please contact me privately: I'd love to hear from you why we failed as
 a community at convincing you that an open channel is the place to be.
 
 No public shaming, please: education first.
 

I really like what you had to say above. I think we can do better and
I don't really blame those who've worked around OpenStack's problems
with their own solution. Whether or not that solution is in fact quite
dangerous for the project as a whole is another matter that we should
consider separately from why did these people feel a need to isolate
themselves?

I am confident this community will find a solution that works well
enough that we can move past this swiftly.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][tc] Lets keep our community open, lets fight for it

2015-02-11 Thread Clint Byrum
Excerpts from Nikola Đipanov's message of 2015-02-11 05:26:47 -0800:
 On 02/11/2015 02:13 PM, Sean Dague wrote:
  
  If core team members start dropping off external IRC where they are
  communicating across corporate boundaries, then the local tribal effects
  start taking over. You get people start talking about the upstream as
  them. The moment we get into us vs. them, we've got a problem.
  Especially when the upstream project is them.
  
 
 A lot of assumptions being presented as fact here.
 
 I believe the technical term for the above is 'slippery slope fallacy'.
 

I don't see that fallacy, though it could descend into that if people
keep pushing in that direction. Where I think Sean did a nice job
stopping short of the slippery slope is that he only identified the step
that is happening _now_, not the next step.

I tend to agree that right now, if core team members are not talking
on IRC to other core members in the open, whether inside or outside
corporate boundaries, then we do see an us vs. them mentality happen.
It's not I think thats the next step. I have personally seen that
happening and will work hard to stop it. I think Sean has probably seen
his share of it too,  as that is what he described in detail without
publicly shaming anyone or any company (well done Sean).

 We can and _must_ do much better than this on this mailing list! Let's
 drag the discussion level back up!

I'm certain we can always improve, and I appreciate you taking the time
to have a Gandalf moment to stop the Balrog of fallacy from  entering
this thread. We seriously can't let the discussion slip down that
slope.. oh wait.

That said, I do want us to talk about uncomfortable things when
necessary. I think this thread is not something where it will be entirely
productive to stay 100% positive throughout. We might just have to use
some negative language along side our positive suggestions to make sure
people have an efficient way to measure their own behavior.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Glance][Artifacts] Object Version format: SemVer vs pep440

2015-02-10 Thread Clint Byrum
Excerpts from Alexander Tivelkov's message of 2015-02-10 07:28:55 -0800:
 Hi folks,
 
 One of the key features that we are adding to Glance with the
 introduction of Artifacts is the ability to have multiple versions of
 the same object in the repository: this gives us the possibility to
 query for the latest version of something, keep track on the changes
 history, and build various continuous delivery solutions on top of
 Artifact Repository.
 
 We need to determine the format and rules we will use to define,
 increment and compare versions of artifacts in the repository. There
 are two alternatives we have to choose from, and we are seeking advice
 on this choice.
 
 First, there is Semantic Versioning specification, available at [1].
 It is a very generic spec, widely used and adopted in many areas of
 software development. It is quite straightforward: 3 mandatory numeric
 components for version number, plus optional string labels for
 pre-release versions and build metadata.
 
 And then there is PEP-440 spec, which is a recommended approach to
 identifying versions and specifying dependencies when distributing
 Python. It is a pythonic way to set versions of python packages,
 including PIP version strings.
 
 Conceptually PEP-440 and Semantic Versioning are similar in purpose,
 but slightly different in syntax. Notably, the count of version number
 components and rules of version precedence resolution differ between
 PEP-440 and SemVer. Unfortunately, the two version string formats are
 not compatible, so we have to choose one or the other.
 
 According to my initial vision, the Artifact Repository should be as
 generic as possible in terms of potential adoption. The artifacts were
 never supposed to be python packages only, and even the projects which
 will create and use these artifacts are not mandatory limited to be
 pythonic, the developers of that projects may not be python
 developers! So, I'd really wanted to avoid any python-specific
 notations, such as PEP-440 for artifacts.
 
 I've put this vision into a spec [3] which also contains a proposal on
 how to convert the semver-compatible version strings into the
 comparable values which may be mapped to database types, so a database
 table may be queried, ordered and filtered by the object version.
 
 So, we need some feedback on this topic. Would you prefer artifacts to
 be versioned with SemVer or with PEP-440 notation? Are you interested
 in having some generic utility which will map versions (in either
 format) to database columns? If so, which version format would you
 prefer?
 
 We are on a tight schedule here, as we want to begin landing
 artifact-related code soon. So, I would appreciate your feedback
 during this week: here in the ML or in the comments to [3] review.
 

Hi. This is really interesting work and I'm glad Glance is growing into
an artifact catalog as I think it will assist cloud users and UI
development at the same time.

It seems to me that there are really only two reasons to care about the
content of the versions: sorting, and filtering. You want to make sure
if people upload artifacts named myapp like this:

myapp:1.0 myapp:2.0 myapp:1.1

That when they say show me the newest myapp they get 2.0, not 1.1.

And if they say show me the newest myapp in the 1.x series they get 1.1.

I am a little worried this is not something that can or should be made
generic in a micro service.

Here's a thought: You could just have the version, series, and sequence,
and let users manage the sequencing themselves on the client side. This
way if users want to use the _extremely_ difficult to program for Debian
packaging version, you don't have to figure out how to make 1.0~special
less than 1.0 and more than 0.9.

To start with, you can have a default strategy of a single series, and
max(sequence)+1000 if unspecified. Then teach the clients the various
semvers/pep440's/etc. etc. and let them choose their own sequencing and
series strategy.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-09 Thread Clint Byrum
Excerpts from Jay Pipes's message of 2015-02-09 12:36:45 -0800:
 CAS is preferred because it is measurably faster and more 
 obstruction-free than SELECT FOR UPDATE. A colleague of mine is almost 
 ready to publish documentation showing a benchmark of this that shows 
 nearly a 100% decrease in total amount of lock/wait time using CAS 
 versus waiting for the coarser-level certification timeout to retry the 
 transactions. As mentioned above, I believe this is due to the dramatic 
 decrease in ROLLBACKs.
 

I think the missing piece of the puzzle for me was that each ROLLBACK is
an expensive operation. I figured it was like a non-local return (i.e.
'raise' in python or 'throw' in java) and thus not measurably different.
But now that I think of it, there is likely quite a bit of optimization
around the query path, and not so much around the rollback path.

The bottom of this rabbit hole is simply exquisite, isn't it? :)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-09 Thread Clint Byrum
Excerpts from Jay Pipes's message of 2015-02-09 10:15:10 -0800:
 On 02/09/2015 01:02 PM, Attila Fazekas wrote:
  I do not see why not to use `FOR UPDATE` even with multi-writer or
  Is the retry/swap way really solves anything here.
 snip
  Am I missed something ?
 
 Yes. Galera does not replicate the (internal to InnnoDB) row-level locks 
 that are needed to support SELECT FOR UPDATE statements across multiple 
 cluster nodes.
 
 https://groups.google.com/forum/#!msg/codership-team/Au1jVFKQv8o/QYV_Z_t5YAEJ
 

Attila acknowledged that. What Attila was saying was that by using it
with Galera, the box that is doing the FOR UPDATE locks will simply fail
upon commit because a conflicting commit has already happened and arrived
from the node that accepted the write. Further what Attila is saying is
that this means there is not such an obvious advantage to the CAS method,
since the rollback and the # updated rows == 0 are effectively equivalent
at this point, seeing as the prior commit has already arrived and thus
will not need to wait to fail certification and be rolled back.

I am not entirely certain that is true though, as I think what will
happen in sequential order is:

writer1: UPDATE books SET genre = 'Scifi' WHERE genre = 'sciencefiction';
writer1: -- send in-progress update to cluster
writer2: SELECT FOR UPDATE books WHERE id=3;
writer1: COMMIT
writer1: -- try to certify commit in cluster
** Here is where I stop knowing for sure what happens **
writer2: certifies writer1's transaction or blocks?
writer2: UPDATE books SET genre = 'sciencefiction' WHERE id=3;
writer2: COMMIT -- One of them is rolled back.

So, at that point where I'm not sure (please some Galera expert tell
me):

If what happens is as I suggest, writer1's transaction is certified,
then that just means the lock sticks around blocking stuff on writer2,
but that the data is updated and it is certain that writer2's commit will
be rolled back. However, if it blocks waiting on the lock to resolve,
then I'm at a loss to determine which transaction would be rolled back,
but I am thinking that it makes sense that the transaction from writer2
would be rolled back, because the commit is later.

All this to say that usually the reason for SELECT FOR UPDATE is not
to only do an update (the transactional semantics handle that), but
also to prevent the old row from being seen again, which, as Jay says,
it cannot do.  So I believe you are both correct:

* Attila, yes I think you're right that CAS is not any more efficient
at replacing SELECT FOR UPDATE from a blocking standpoint.

* Jay, yes I think you're right that SELECT FOR UPDATE is not the right
thing to use to do such reads, because one is relying on locks that are
meaningless on a Galera cluster.

Where I think the CAS ends up being the preferred method for this sort
of thing is where one consideres that it won't hold a meaningless lock
while the transaction is completed and then rolled back.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] operators vs users for choosing convergence engine

2015-02-06 Thread Clint Byrum
Excerpts from Zane Bitter's message of 2015-02-06 06:25:57 -0800:
 On 03/02/15 14:12, Clint Byrum wrote:
  The visible change in making things parallel was minimal. In talking
  about convergence, it's become clear that users can and should expect
  something radically different when they issue stack updates. I'd love to
  say that it can be done to just bind convergence into the old ways, but
  doing so would also remove the benefit of having it.
 
  Also allowing resume wasn't a new behavior, it was fixing a bug really
  (that state was lost on failed operations). Convergence is a pretty
  different beast from the current model,
 
 That's not actually the case for Phase 1; really nothing much should 
 change from the user point of view, except that if you issue an update 
 before a previous one is finished then you won't get an error back any more.
 
 
 In any event, I think Angus's comment on the review is correct, we 
 actually have two different problems here. One is how to land the code, 
 and a config option is indisputably the right choice here: until many, 
 many blueprints have landed then the convergence code path will do 
 literally nothing at all. There is no conceivable advantage to users for 
 opting in to that.
 
 The second question, which we can continue to discuss, is whether to 
 allow individual users to opt in/out once operators have enabled the 
 convergence flow path. I'm not convinced that there is anything 
 particular special about this feature that warrants such a choice more 
 than any other feature that we have developed in the past. However, I 
 don't think we need to decide until around the time that we're preparing 
 to flip the default on. By that time we should have better information 
 about the level of stability we're dealing with, and we can get input 
 from operators on what kind of additional steps we should take to 
 maintain stability in the face of possible regressions.
 

All good points and it seems like a plan is forming that will help
operators deploy rapidly without forcing users to scramble too much.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Clint Byrum
Excerpts from Angus Lees's message of 2015-02-04 16:59:31 -0800:
 On Thu Feb 05 2015 at 9:02:49 AM Robert Collins robe...@robertcollins.net
 wrote:
 
  On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com wrote:
   How interesting,
  
   Why are people using galera if it behaves like this? :-/
 
  Because its actually fairly normal. In fact its an instance of point 7
  on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our
  oldest wiki pages :).
 
  In more detail, consider what happens in full isolation when you have
  the A and B example given, but B starts its transaction before A.
 
  B BEGIN
  A BEGIN
  A INSERT foo
  A COMMIT
  B SELECT foo - NULL
 
 
 Note that this still makes sense from each of A and B's individual view of
 the world.
 
 If I understood correctly, the big change with Galera that Matthew is
 highlighting is that read-after-write may not be consistent from the pov of
 a single thread.
 

No that's not a complete picture.

What Matthew is highlighting is that after a commit, a new transaction
may not see the write if it is done on a separate node in the cluster.

In a single thread, using a single database session, then a read after
successful commit is guaranteed to read a version of the database
that existed after that commit. What it may not be consistent with is
subsequent writes which may have happened after the commit on other
servers, unless you use the sync wait.

 Not have read-after-write is *really* hard to code to (see for example x86
 SMP cache coherency, C++ threading semantics, etc which all provide
 read-after-write for this reason).  This is particularly true when the
 affected operations are hidden behind an ORM - it isn't clear what might
 involve a database call and sequencers (or logical clocks, etc) aren't made
 explicit in the API.
 
 I strongly suggest just enabling wsrep_casual_reads on all galera sessions,
 unless you can guarantee that the high-level task is purely read-only, and
 then moving on to something else ;)  If we choose performance over
 correctness here then we're just signing up for lots of debugging of hard
 to reproduce race conditions, and the fixes are going to look like what
 wsrep_casual_reads does anyway.
 
 (Mind you, exposing sequencers at every API interaction would be awesome,
 and I look forward to a future framework and toolchain that makes that easy
 to do correctly)
 

I'd like to see actual examples where that will matter. Meanwhile making
all selects wait for the cluster will basically just ruin responsiveness
and waste tons of time, so we should be careful to think this through
before making any blanket policy.

I'd also like to see consideration given to systems that handle
distributed consistency in a more active manner. etcd and Zookeeper are
both such systems, and might serve as efficient guards for critical
sections without raising latency.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-05 Thread Clint Byrum
Excerpts from Avishay Traeger's message of 2015-02-04 22:19:53 -0800:
 On Wed, Feb 4, 2015 at 11:00 PM, Robert Collins robe...@robertcollins.net
 wrote:
 
  On 5 February 2015 at 10:24, Joshua Harlow harlo...@outlook.com wrote:
   How interesting,
  
   Why are people using galera if it behaves like this? :-/
 
  Because its actually fairly normal. In fact its an instance of point 7
  on https://wiki.openstack.org/wiki/BasicDesignTenets - one of our
  oldest wiki pages :).
 
 
 When I hear MySQL I don't exactly think of eventual consistency (#7),
 scalability (#1), horizontal scalability (#4), etc.
 For the past few months I have been advocating implementing an alternative
 to db/sqlalchemy, but of course it's a huge undertaking.  NoSQL (or even
 distributed key-value stores) should be considered IMO.  Just some food for
 thought :)
 

I know it is popular to think that MySQL* == old slow and low-scale, but
that is only popular with those who have not actually tried to scale
MySQL. You may want to have a chat with the people running MySQL at
Google, Facebook, and a long tail of not quite as big sites but still
massively bigger than most clouds. Note that many of the people who
helped those companies scale up are involved directly with OpenStack.

The NoSQL bits that are popular out there make the easy part easy. There
is no magic bullet for the hard part, which is when you need to do both
synchronous and asynchronous. Factor in its maturity and the breadth of
talent available, and I'll choose MySQL for this task every time.

* Please let's also give a nod to our friends working on MariaDB, a
  MySQL-compatible fork that many find preferrable and for the purposes
  of this discussion, equivalent.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder][neutron][security] Rootwrap on root-intensive nodes

2015-02-04 Thread Clint Byrum
Excerpts from Tristan Cacqueray's message of 2015-02-04 09:02:19 -0800:
 On 02/04/2015 06:57 AM, Daniel P. Berrange wrote:
  On Wed, Feb 04, 2015 at 11:58:03AM +0100, Thierry Carrez wrote:
  What solutions do we have ?
 
  (1) we could get our act together and audit and fix those filter
  definitions. Remove superfluous usage of root rights, make use of
  advanced filters for where we actually need them. We have been preaching
  for that at many many design summits. This is a lot of work though...
  There were such efforts in the past, but they were never completed for
  some types of nodes. Worse, the bad filter definitions kept coming back,
  since developers take shortcuts, reviewers may not have sufficient
  security awareness to detect crappy filter definitions, and I don't
  think we can design a gate test that would have such awareness.
 
  (2) bite the bullet and accept that some types of nodes actually need
  root rights for so many different things, they should just run as root
  anyway. I know a few distributions which won't be very pleased by such a
  prospect, but that would be a more honest approach (rather than claiming
  we provide efficient isolation when we really don't). An added benefit
  is that we could replace a number of shell calls by Python code, which
  would simplify the code and increase performance.
 
  (3) intermediary solution where we would run as the nova user but run
  sudo COMMAND directly (instead of sudo nova-rootwrap CONFIG COMMAND).
  That would leave it up to distros to choose between a blanket sudoer or
  maintain their own filtering rules. I think it's a bit hypocritical
  though (pretend the distros could filter if they wanted it, when we
  dropped the towel on doing that ourselves). I'm also not convinced it's
  more secure than solution 2, and it prevents from reducing the number of
  shell-outs, which I think is a worthy idea.
 
  In all cases I would not drop the baby with the bath water, and keep
  rootwrap for all the cases where root rights are needed on a very
  specific set of commands (like neutron, or nova's api-metadata). The
  daemon mode should address the performance issue for the projects making
  a lot of calls.
  
  
  (4) I think that ultimately we need to ditch rootwrap and provide a proper
  privilege separated, formal RPC mechanism for each project.
  
  eg instead of having a rootwrap command, or rootwrap server attempting
  to validate safety of
  
  qemu-img create -f qcow2 
  /var/lib/nova/instances/instance1/disk.qcow2
  
  we should have a  nova-compute-worker daemon running as root, that accepts
  an RPC command from nova-compute running unprivileged. eg
  
  CreateImage(instane0001, qcow2, disk.qcow)
  
  This immediately makes it trivial to validate that we're not trying to
  trick qemu-img into overwriting some key system file.
  
  This is certainly alot more work than trying to patchup rootwrap, but
  it would provide a level of security that rootwrap can never achieve IMHO.
  
 
 This 4th idea sounds interesting, though we are assuming this new service
 running as root would be exempt of bug, especially if it uses the same
 libraries as non-root services... For example a major bug in python would
 give attacker direct root access while the rootwrap solution would in
 theory keep the intruder at the sudo level...
 

I don't believe that anyone assumes the new service would be without
bugs. But just like the OpenSSH team saw years ago, privilege separation
means that you can absolutely know what is running as root, and what is
not. So when you decide to commit your resources to code audits, you
_start_ with the things that run with elevated privileges.

 
 For completeness, I'd like to propose a more long-term solution:
 
 (5) Get ride of root! Seriously, OpenStack could support security mechanism
 like SELinux or AppArmor in order to properly isolate service and let them
 run what they need to run.
 
 For what it worth, the underlying issue here is having a single almighty super
 user: root and thus we should, at least, consider solution that remove the
 need of such powers (e.g. kernel module loading, ptrace or raw socket).
 

We don't need a security module to drop all of those capabilities
entirely and run as a hobbled root user. By my measure, this process for
nova-compute would only need CAP_NET_ADMIN, CAP_SYS_ADMIN and CAP_KILL.
These capabilities can be audited per-agent and even verified as needed
simply by running integration tests without each one to see what breaks.

 Beside, as long as sensitive process are not contained at the system level,
 the attack surface for a non-root user is still very wide (e.g. system calls,
 setuid binaries, ipc, ...)
 
 
 While this might sounds impossible to implement upstream because it's too
 vendor specific or just because of other technicals difficulties,
 I guess it still deserves a mention in this thread.
 

I think OpenStack can do its part by making privilege 

Re: [openstack-dev] [nova][cinder][neutron][security] Rootwrap on root-intensive nodes

2015-02-04 Thread Clint Byrum
Excerpts from Daniel P. Berrange's message of 2015-02-04 03:57:53 -0800:
 On Wed, Feb 04, 2015 at 11:58:03AM +0100, Thierry Carrez wrote:
  The first one is performance -- each call would spawn a Python
  interpreter which would then call the system command. This was fine when
  there were just a few calls here and there, not so much when it's called
  a hundred times in a row. During the Juno cycle, a daemon mode was added
  to solve this issue. It is significantly faster than running sudo
  directly (the often-suggested alternative). Projects still have to start
  adopting it though. Neutron and Cinder have started work to do that in Kilo.
  
  The second problem is the quality of the filter definitions. Rootwrap is
  a framework to enable isolation. It's only as good as the filters each
  project defines. Most of them rely on CommandFilters that do not check
  any argument, instead of using more powerful filters (which are arguably
  more painful to maintain). Developers routinely add filter definitions
  that basically remove any isolation that might have been there, like
  allowing blank dd, tee, chown or chmod.
 
 I think this is really the key point which shows rootwrap as a concept
 is broken by design IMHO. Root wrap is essentially trying to provide an
 API for invoking privileged operations, but instead of actually designing
 an explicit API for the operations, we done by implicit one based on
 command args. From a security POV I think this approach is doomed to
 failure, but command arg strings are fr to expressive a concept
 to deal with.
 
  What solutions do we have ?
  
  (1) we could get our act together and audit and fix those filter
  definitions. Remove superfluous usage of root rights, make use of
  advanced filters for where we actually need them. We have been preaching
  for that at many many design summits. This is a lot of work though...
  There were such efforts in the past, but they were never completed for
  some types of nodes. Worse, the bad filter definitions kept coming back,
  since developers take shortcuts, reviewers may not have sufficient
  security awareness to detect crappy filter definitions, and I don't
  think we can design a gate test that would have such awareness.
  
  (2) bite the bullet and accept that some types of nodes actually need
  root rights for so many different things, they should just run as root
  anyway. I know a few distributions which won't be very pleased by such a
  prospect, but that would be a more honest approach (rather than claiming
  we provide efficient isolation when we really don't). An added benefit
  is that we could replace a number of shell calls by Python code, which
  would simplify the code and increase performance.
  
  (3) intermediary solution where we would run as the nova user but run
  sudo COMMAND directly (instead of sudo nova-rootwrap CONFIG COMMAND).
  That would leave it up to distros to choose between a blanket sudoer or
  maintain their own filtering rules. I think it's a bit hypocritical
  though (pretend the distros could filter if they wanted it, when we
  dropped the towel on doing that ourselves). I'm also not convinced it's
  more secure than solution 2, and it prevents from reducing the number of
  shell-outs, which I think is a worthy idea.
  
  In all cases I would not drop the baby with the bath water, and keep
  rootwrap for all the cases where root rights are needed on a very
  specific set of commands (like neutron, or nova's api-metadata). The
  daemon mode should address the performance issue for the projects making
  a lot of calls.
 
 
 (4) I think that ultimately we need to ditch rootwrap and provide a proper
 privilege separated, formal RPC mechanism for each project.
 
 eg instead of having a rootwrap command, or rootwrap server attempting
 to validate safety of
 
 qemu-img create -f qcow2 /var/lib/nova/instances/instance1/disk.qcow2
 
 we should have a  nova-compute-worker daemon running as root, that accepts
 an RPC command from nova-compute running unprivileged. eg
 
 CreateImage(instane0001, qcow2, disk.qcow)
 
 This immediately makes it trivial to validate that we're not trying to
 trick qemu-img into overwriting some key system file.
 
 This is certainly alot more work than trying to patchup rootwrap, but
 it would provide a level of security that rootwrap can never achieve IMHO.
 

+1, I think you're right on Daniel. Count me in for future discussions
and work on this.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Clint Byrum
Excerpts from Matthew Booth's message of 2015-02-04 08:30:32 -0800:
 * Write followed by read on a different node can return stale data
 
 During a commit, Galera replicates a transaction out to all other db
 nodes. Due to its design, Galera knows these transactions will be
 successfully committed to the remote node eventually[2], but it doesn't
 commit them straight away. The remote node will check these outstanding
 replication transactions for write conflicts on commit, but not for
 read. This means that you can do:
 
 A: start transaction;
 A: insert into foo values(1)
 A: commit;
 B: select * from foo; -- May not contain the value we inserted above[3]
 
 This means that even for 'synchronous' slaves, if a client makes an RPC
 call which writes a row to write master A, then another RPC call which
 expects to read that row from synchronous slave node B, there's no
 default guarantee that it'll be there.
 
 Galera exposes a session variable which will fix this: wsrep_sync_wait
 (or wsrep_causal_reads on older mysql). However, this isn't the default.
 It presumably has a performance cost, but I don't know what it is, or
 how it scales with various workloads.
 

wsrep_sync_wait/wsrep_casual_reads doesn't actually hit the cluster
any harder, it simply tells the local Galera node if you're not caught
up with the highest known sync point, don't answer queries yet. So it
will slow down that particular query as it waits for an update from the
leader about sync point and, if necessary, waits for the local engine
to catch up to that point. However, it isn't going to push that query
off to all the other boxes or anything like that.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

2015-02-04 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2015-02-04 13:24:20 -0800:
 How interesting,
 
 Why are people using galera if it behaves like this? :-/
 

Note that any true MVCC database will roll back transactions on
conflicts. One must always have a deadlock detection algorithm of
some kind.

Galera behaves like this because it is enormously costly to be synchronous
at all times for everything. So it is synchronous when you want it to be,
and async when you don't.

Note that it's likely NDB (aka MySQL Cluster) would work fairly well
for OpenStack's workloads, and does not suffer from this. However, it
requires low latency high bandwidth links between all nodes (infiniband
recommended) or it will just plain suck. So Galera is a cheaper, easier
to tune and reason about option.

 Are the people that are using it know/aware that this happens? :-/
 

I think the problem really is that it is somewhat de facto, and used
without being tested. The gate doesn't set up a three node Galera db and
test that OpenStack works right. Also it is inherently a race condition,
and thus will be a hard one to test.

Thats where having knowledge of it and taking time to engineer a
solution that makes sense is really the best course I can think of.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] operators vs users for choosing convergence engine

2015-02-03 Thread Clint Byrum
Excerpts from Angus Salkeld's message of 2015-02-03 02:40:44 -0800:
 On Tue, Feb 3, 2015 at 10:52 AM, Steve Baker sba...@redhat.com wrote:
 
  A spec has been raised to add a config option to allow operators to choose
  whether to use the new convergence engine for stack operations. For some
  context you should read the spec first [1]
 
  Rather than doing this, I would like to propose the following:
  * Users can (optionally) choose which engine to use by specifying an
  engine parameter on stack-create (choice of classic or convergence)
  * Operators can set a config option which determines which engine to use
  if the user makes no explicit choice
  * Heat developers will set the default config option from classic to
  convergence when convergence is deemed sufficiently mature
 
  I realize it is not ideal to expose this kind of internal implementation
  detail to the user, but choosing convergence _will_ result in different
  stack behaviour (such as multiple concurrent update operations) so there is
  an argument for giving the user the choice. Given enough supporting
  documentation they can choose whether convergence might be worth trying for
  a given stack (for example, a large stack which receives frequent updates)
 
  Operators likely won't feel they have enough knowledge to make the call
  that a heat install should be switched to using all convergence, and users
  will never be able to try it until the operators do (or the default
  switches).
 
  Finally, there are also some benefits to heat developers. Creating a whole
  new gate job to test convergence-enabled heat will consume its share of CI
  resource. I'm hoping to make it possible for some of our functional tests
  to run against a number of scenarios/environments. Being able to run tests
  under classic and convergence scenarios in one test run will be a great
  help (for performance profiling too).
 
 
 Hi
 
 I didn't have a good initial response to this, but it's growing on me. One
 issue is the specific option that we expose, it's not nice having
 a dead option once we totally switch over and remove classic. So is it
 worth coming up with a real feature that convergence-phase-1 enables
 and use that (like enable-concurrent-updates). Then we need to think if we
 would actually want to keep that feature around (as in
 once classic is gone is it possible to maintain
 disable-concurrent-update).
 

There are other features of convergence that will be less obvious.
Having stack operations survive a restart of the engines is a pretty big
one that one might have a hard time grasping, but will be appreciated by
users. Also being able to push a bigger stack in will be a large benefit,
though perhaps not one that is realized on day 1.

Anyway, I'd prefer that they just be versioned, and not named. The names
are too implementation specific. A v1 stack will be expected to work
with v1 stack tested templates and parameters for as long as we support
v1 stacks.

A v2 stack will be expected to work similarly, but may act differently,
and thus a user can treat this as another API update that they need to
deal with. The features will be a force multiplier, but the recommendation
of the team by removing the experimental tag will be the primary
motivator. And for operators, when they're comfortable with new stacks all
going to v2 they can enable that as the default. If they trust the Heat
developers, they can just go to v2 as default when the Heat devs say so.

Once we get all of the example templates to work with v2 and write some
new v2-specific stacks, thats the time to write a migration tool and
deprecate v1.

So, to be clear, I'm fully in support of Steve Baker's suggestion to let
the users choose which engine to use. However, I think we should treat
it not as an engine choice, but as an interface choice. The fact that
it takes a whole new engine to support the new features of the interface
is the implementation detail that no end-user needs to care about.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] operators vs users for choosing convergence engine

2015-02-03 Thread Clint Byrum
Excerpts from Zane Bitter's message of 2015-02-03 10:00:44 -0800:
 On 02/02/15 19:52, Steve Baker wrote:
  A spec has been raised to add a config option to allow operators to
  choose whether to use the new convergence engine for stack operations.
  For some context you should read the spec first [1]
 
  Rather than doing this, I would like to propose the following:
 
 I am strongly, strongly opposed to making this part of the API.
 
  * Users can (optionally) choose which engine to use by specifying an
  engine parameter on stack-create (choice of classic or convergence)
  * Operators can set a config option which determines which engine to use
  if the user makes no explicit choice
  * Heat developers will set the default config option from classic to
  convergence when convergence is deemed sufficiently mature
 
 We'd also need a way for operators to prevent users from enabling 
 convergence if they're not ready to support it.
 

This would be relatively simple to do by simply providing a list of the
supported stack versions.

  I realize it is not ideal to expose this kind of internal implementation
  detail to the user, but choosing convergence _will_ result in different
  stack behaviour (such as multiple concurrent update operations) so there
  is an argument for giving the user the choice. Given enough supporting
  documentation they can choose whether convergence might be worth trying
  for a given stack (for example, a large stack which receives frequent
  updates)
 
 It's supposed to be a strict improvement; we don't need to ask 
 permission. We have made major changes of this type in practically every 
 Heat release. When we switched from creating resources serially to 
 creating them in parallel in Havana we didn't ask permission. We just 
 did it. We when started allowing users to recover from a failed 
 operation in Juno we didn't ask permission. We just did it. We don't 
 need to ask permission to allow concurrent updates. We can just do it.
 

The visible change in making things parallel was minimal. In talking
about convergence, it's become clear that users can and should expect
something radically different when they issue stack updates. I'd love to
say that it can be done to just bind convergence into the old ways, but
doing so would also remove the benefit of having it.

Also allowing resume wasn't a new behavior, it was fixing a bug really
(that state was lost on failed operations). Convergence is a pretty
different beast from the current model, and letting users fall back
to the old one means that when things break they can solve their own
problem while the operator and devs figure it out. The operator may know
what is breaking their side, but they may have very little idea of what
is happening on the end-user's side.

 The only difference here is that we are being a bit smarter and 
 uncoupling our development schedule from the release cycle. There are 15 
 other blueprints, essentially all of which have to be complete before 
 convergence is usable at all. It won't do *anything at all* until we are 
 at least 12 blueprints in. The config option buys us time to land them 
 without the risk of something half-finished appearing in the release 
 (trunk-chasers will also thank us). It has no other legitimate purpose IMO.
 

The config option only really allows an operator to go forward. If
the users start expecting concurrent updates and resiliency, and then
all their stacks are rolled back to the old engine because #reasons,
this puts pressure on the operator. This will make operators delay the
forward progress onto convergence for as long as possible.

I'm also not entirely sure rolling the config option back to the old
setting would even be possible without breaking any in-progress stacks.

 The goal is IN NO WAY to maintain separate code paths in the long term. 
 The config option is simply a development strategy to allow us to land 
 code without screwing up a release and while maintaining as much test 
 coverage as possible.
 

Nobody plans to maintain the Keystone v2 domainless implementation forever
too. But letting users consider domains and other v3 options for a while
means that the ecosystem grows more naturally without giving up ground
to instability. Once the v3 adoption rate is high enough, people will
likely look at removing the old code because nobody uses it. In my
opinion OpenStack has been far too eager to deprecate and remove things
that users rely on, but I do think this will happen and should happen
eventually.

  Operators likely won't feel they have enough knowledge to make the call
  that a heat install should be switched to using all convergence, and
  users will never be able to try it until the operators do (or the
  default switches).
 
 Hardly anyone should have to make a call. We should flip the default as 
 soon as all of the blueprints have landed (i.e. as soon as it works at 
 all), provided that a release is not imminent. (Realistically, at this 
 

Re: [openstack-dev] [Heat][Keystone] Native keystone resources in Heat

2015-01-29 Thread Clint Byrum
Excerpts from Zane Bitter's message of 2015-01-29 08:41:36 -0800:
 I got a question today about creating keystone users/roles/tenants in 
 Heat templates. We currently support creating users via the 
 AWS::IAM::User resource, but we don't have a native equivalent.
 
 IIUC keystone now allows you to add users to a domain that is otherwise 
 backed by a read-only backend (i.e. LDAP). If this means that it's now 
 possible to configure a cloud so that one need not be an admin to create 
 users then I think it would be a really useful thing to expose in Heat. 
 Does anyone know if that's the case?
 

I think you got that a little backward. Keystone lets you have domains
that are read/write, and domains that are read-only. So you can have
the real users in LDAP and then give a different class of user their
own keystone-only domain that they can control.

That is a bit confusing to the real functionality gap, which I think
is a corner case but worth exploring. Being able to create a user in a
domain that the user provides credentials for is a useful thing. A user
may want to deploy their own instance control mechanism (like standalone
Heat!) for instance, and having a limited-access user for this created
by a domain admin with credentials that are only ever stored in Heat
seems like a win. Some care is needed to make sure the role can't just
'stack show' on Heat and grab the admin creds, but that seems like
something that would go in a deployer guide.. something like Make sure
domain admins know not to give delegated users the 'heat-user' role.

 I think roles and tenants are likely to remain admin-only, but we have 
 precedent for including resources like that in /contrib... this seems 
 like it would be comparably useful.
 

I feel like admin-only things will matter as soon as real multi-cloud
support exists in Heat. What I really want is to have a Heat in my
management cloud that reaches into my managed cloud when necessary.
Right now in TripleO we have to keep anything admin-only out of the
Heat templates and run the utilities from os-cloud-config somewhere
because we don't want admin credentials (even just trusts) in Heat. But
if we could use the deployment(under) cloud's Heat to reach into the
user(over) cloud to add users, roles, networks, etc., then that would
maintain the separation our security auditors desire.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo.db] PyMySQL review

2015-01-29 Thread Clint Byrum
Excerpts from Vishvananda Ishaya's message of 2015-01-29 10:21:58 -0800:
 
 On Jan 29, 2015, at 8:57 AM, Roman Podoliaka rpodoly...@mirantis.com wrote:
 
  Jeremy,
  
  I don't have exact numbers, so yeah, it's just an assumption based on
  looking at the nova-api/scheduler logs with connection_debug set to
  100.
  
  But that's a good point you are making here: it will be interesting to
  see what difference enabling of PyMySQL will make for tempest/rally
  workloads, rather than just running synthetic tests. I'm going to give
  it a try on my devstack installation.
 
 
 FWIW I tested this a while ago on some perf tests on nova and cinder that we
 run internally and I found pymysql to be slower by about 10%. It appears that
 we were cpu bound in python more often than we were blocking talking to the
 db. I do recall someone doing a similar test in neutron saw some speedup,
 however. On our side we also exposed a few race conditions which made it less
 stable. We hit a few hard deadlocks in volume create IIRC. 
 
 I don’t think switching is going to give us much benefit right away. We will
 need a few optimizations and bugfixes in other areas (particularly in our
 sqlalchemy usage) before we will derive any benefit from the switch.
 

No magic bullets, right? I think we can all resolve this statement in
our heads though: fast and never concurrent will eventually lose to
concurrent and potentially fast with optimizations.

The question is, how long does the hare (python-mysqldb) have to sleep
before the tortoise (PyMySQL) wins? Right now it's still a close race, but
if the tortoise even gains 10% speed, it likely becomes no contest at all.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo.db] PyMySQL review

2015-01-28 Thread Clint Byrum
Excerpts from Johannes Erdfelt's message of 2015-01-28 15:33:25 -0800:
 On Wed, Jan 28, 2015, Mike Bayer mba...@redhat.com wrote:
  I can envision turning this driver into a total monster, adding
  C-speedups where needed but without getting in the way of async
  patching, adding new APIs for explicit async, and everything else.
  However, I’ve no idea what the developers have an appetite for.
 
 This is great information. I appreciate the work on evaluating it.
 
 Can I bring up the alternative of dropping eventlet and switching to
 native threads?
 
 We spend a lot of time working on the various incompatibilies between
 eventlet and other libraries we use. It also restricts us by making it
 difficult to use an entire class of python modules (that use C
 extensions for performance, etc).
 
 I personally have spent more time than I wish to admit fixing bugs in
 eventlet and troubleshooting problems we've had.
 
 And it's never been clear to me why we *need* to use eventlet or
 green threads in general.
 
 Our modern Nova appears to only be weakly tied to eventlet and greenlet.
 I think we would spend less time replacing eventlet with native threads
 than we'll spend in the future trying to fit our code and dependencies
 into the eventlet shaped hole we currently have.
 
 I'm not as familiar with the code in other OpenStack projects, but from
 what I have seen, they appear to be similar to Nova and are only weakly
 tied to eventlet/greenlet.

As is often the case with threading, a reason to avoid using it is
that libraries often aren't able or willing to assert thread safety.

That said, one way to fix that, is to fix those libraries that we do
want to use, to be thread safe. :)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tc] do we really need project tags in the governance repository?

2015-01-27 Thread Clint Byrum
Excerpts from Thierry Carrez's message of 2015-01-27 02:46:03 -0800:
 Doug Hellmann wrote:
  On Mon, Jan 26, 2015, at 12:02 PM, Thierry Carrez wrote:
  [...]
  I'm open to alternative suggestions on where the list of tags, their
  definition and the list projects they apply to should live. If you don't
  like that being in the governance repository, what would have your
  preference ?
  
  From the very beginning I have taken the position that tags are by
  themselves not sufficiently useful for evaluating projects. If someone
  wants to choose between Ceilometer, Monasca, or StackTach, we're
  unlikely to come up with tags that will let them do that. They need
  in-depth discussions of deployment options, performance characteristics,
  and feature trade-offs.
 
 They are still useful to give people a chance to discover that those 3
 are competing in the same space, and potentially get an idea of which
 one (if any) is deployed on more than one public cloud, better
 documented, or security-supported. I agree with you that an
 (opinionated) article comparing those 3 solutions would be a nice thing
 to have, but I'm just saying that basic, clearly-defined reference
 project metadata still has a lot of value, especially as we grow the
 number of projects.
 

I agree with your statement that summary reference metadata is useful. I
agree with Doug that it is inappropriate for the TC to assign it.

  That said, I object to only saying this is all information that can be
  found elsewhere or should live elsewhere, because that is just keeping
  the current situation -- where that information exists somewhere but
  can't be efficiently found by our downstream consumers. We need a
  taxonomy and clear definitions for tags, so that our users can easily
  find, understand and navigate such project metadata.
  
  As someone new to the project, I would not think to look in the
  governance documents for state information about a project. I would
  search for things like install guide openstack or component list
  openstack and expect to find them in the documentation. So I think
  putting the information in those (or similar) places will actually make
  it easier to find for someone that hasn't been involved in the
  discussion of tags and the governance repository.
 
 The idea here is to have the reference information in some
 Gerrit-controlled repository (currently openstack/governance, but I'm
 open to moving this elsewhere), and have that reference information
 consumed by the openstack.org website when you navigate to the
 Software section, to present a browseable/searchable list of projects
 with project metadata. I don't expect anyone to read the YAML file from
 the governance repository. On the other hand, the software section of
 the openstack.org website is by far the most visited page of all our web
 properties, so I expect most people to see that.
 

Just like we gather docs and specs into single websites, we could also
gather project metadata. Let the projects set their tags. One thing
that might make sense for the TC to do is to elevate certain tags to
a more important status that they _will_ provide guidance on when to
use. However, the actual project to tag mapping would work quite well
as a single file in whatever repository the project team thinks would
be the best starting point for a new user.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Openstack-operators] [openstack-operators] [Keystone] flush expired tokens and moves deleted instance

2015-01-27 Thread Clint Byrum

Excerpts from Tim Bell's message of 2015-01-25 22:10:10 -0800:
 This is often mentioned as one of those items which catches every OpenStack 
 cloud operator at some time. It's not clear to me that there could not be a 
 scheduled job built into the system with a default frequency (configurable, 
 ideally).
 
 If we are all configuring this as a cron job, is there a reason that it could 
 not be built into the code ?
 
It has come up before.

The main reason not to build it into the code as it's even better to
just _never store tokens_:

https://blueprints.launchpad.net/keystone/+spec/non-persistent-tokens
http://git.openstack.org/cgit/openstack/keystone-specs/plain/specs/juno/non-persistent-tokens.rst

or just use certs:

https://blueprints.launchpad.net/keystone/+spec/keystone-tokenless-authz-with-x509-ssl-client-cert

The general thought is that putting lots of things in the database that
don't need to be stored anywhere is a bad idea. The need for the cron
job is just a symptom of that bug.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Openstack-operators] [openstack-operators] [Keystone] flush expired tokens and moves deleted instance

2015-01-27 Thread Clint Byrum
The problem with running in memcached is now you have to keep _EVERY_
token in RAM. This is not any cheaper than cleaning out a giant on-disk
table.

Also worth noting is that memcached can produce frustrating results unless
you run it with -M. That is because without -M, your tokens may be removed
well before their expiration and well before memcached fills up if the
slabs that are allocated in the early days of running are filled up.
Also single users that have many tokens will overrun the per-item limit
in memcached with the size of the token ID list.

There's no magic bullet.. just trade-offs that may or may not work well
for your site.

Excerpts from John Dewey's message of 2015-01-27 10:41:33 -0800:
 This is one reason to use the memcached backend. Why replicate these tokens 
 in the first place. 
 
 On Tuesday, January 27, 2015 at 10:21 AM, Clint Byrum wrote:
 
  
  Excerpts from Tim Bell's message of 2015-01-25 22:10:10 -0800:
   This is often mentioned as one of those items which catches every 
   OpenStack cloud operator at some time. It's not clear to me that there 
   could not be a scheduled job built into the system with a default 
   frequency (configurable, ideally).
   
   If we are all configuring this as a cron job, is there a reason that it 
   could not be built into the code ?
  It has come up before.
  
  The main reason not to build it into the code as it's even better to
  just _never store tokens_:
  
  https://blueprints.launchpad.net/keystone/+spec/non-persistent-tokens
  http://git.openstack.org/cgit/openstack/keystone-specs/plain/specs/juno/non-persistent-tokens.rst
  
  or just use certs:
  
  https://blueprints.launchpad.net/keystone/+spec/keystone-tokenless-authz-with-x509-ssl-client-cert
  
  The general thought is that putting lots of things in the database that
  don't need to be stored anywhere is a bad idea. The need for the cron
  job is just a symptom of that bug.
  
  __
  OpenStack Development Mailing List (not for usage questions)
  Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe 
  (mailto:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe)
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
  
  

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tc][python-clients] More freedom for all python clients

2015-01-26 Thread Clint Byrum
Excerpts from Robert Collins's message of 2015-01-26 12:29:37 -0800:
 On 27 January 2015 at 09:01, Joe Gordon joe.gord...@gmail.com wrote:
 
 
  On Wed, Jan 21, 2015 at 5:03 AM, Sean Dague s...@dague.net wrote:
 
  On 01/20/2015 08:15 PM, Robert Collins wrote:
   On 21 January 2015 at 10:21, Clark Boylan cboy...@sapwetik.org wrote:
   ...
   This ml thread came up in the TC meeting today and I am responding here
   to catch the thread up with the meeting. The soft update option is the
   suggested fix for non openstack projects that want to have most of
   their
   requirements managed by global requirements.
  
   For the project structure reform opening things up we should consider
   loosening the criteria to get on the list and make it primarily based
   on
   technical criteria such as py3k support, license compatibility,
   upstream
   support/activity, and so on (basically the current criteria with less
   of
   a focus on where the project comes from if it is otherwise healthy).
   Then individual projects would choose the subset they need to depend
   on.
   This model should be viable with different domains as well if we go
   that
   route.
  
   The following is not from the TC meeting but addressing other portions
   of this conversation:
  
   At least one concern with this option is that as the number of total
   requirements goes up is the difficulty in debugging installation
   conflicts becomes more difficult too. I have suggested that we could
   write tools to help with this. Install bisection based on pip logs for
   example, but these tools are still theoretical so I may be
   overestimating their usefulness.
  
   To address the community scaling aspect I think you push a lot of work
   back on deployers/users if we don't curate requirements for anything
   that ends up tagged as production ready (or whatever the equivalent
   tag becomes). Essentially we are saying this doesn't scale for us so
   now you deal with the fallout. Have fun, which isn't very friendly to
   people consuming the software. We already have an absurd number of
   requirements and management of them has appeared to scale. I don't
   foresee my workload going up if we open up the list as suggested.
  
   Perhaps I missed something, but the initial request wasn't about
   random packages, it was about other stackforge clients - these are
   things in the ecosystem! I'm glad we have technical solutions, but it
   just seems odd to me that adding them would ever have been
   controversial.
 
  Well, I think Clark and I have different opinions of how much of a pain
  unwinding the requirements are, and how long these tend to leave the
  gate broken. I am happy to also put it in a somebody elses problem
  field for resolving the issues. :)
 
  Honestly, I think we're actually at a different point, where we need to
  stop assuming that the sane way to deal with python is to install it
  into system libraries, and just put every service in a venv and get rid
  of global requirements entirely. Global requirements was a scaling fix
  for getting to 10 coexisting projects. I don't think it actually works
  well with 50 ecosystem projects. Which is why I proposed the domains
  solution instead.
 
 
  ++ using per service virtual environments would help us avoid a whole class
  of nasty issues. On the flip side doing this makes things harder for distros
  to find a set of non-conflicting dependencies etc.
 
 
   On the pip solver side, joe gordon was working on a thing to install a
   fixed set of packages by bypassing the pip resolver... not sure how
   thats progressing.
 
  I think if we are talking seriously about bypassing the pip resolver, we
  should step back and think about that fact. Because now we're producting
  a custom installation process that will produce an answer for us, which
  is completely different than any answer that anyone else is getting for
  how to get a coherent system.
 
 
  Fully agreed, I am looking into avoiding pips dependency solver for stable
  branches only right now. But using per service venvs would be even better.
 

moved post to bottom for us backwards folk who see the quotes in
original order
 TripleO has done per service venvs for a couple years now, and it
 doesn't solve the fragility issue that our unbounded deps cause. It
 avoids most but not all conflicting deps within OpenStack, and none of
 the 'upstream broke us' cases.
 

Note that we are not testing per-service venvs anymore because of the
extreme high cost of building the controller images with so many separate
venvs. We just put the openstack namespaced pieces in one big openstack
venv now.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] Part two of core reviewer team -- pruning

2015-01-15 Thread Clint Byrum
Hello! Now that we've added James, I have some suggestions for members
that should be dropped.

I have communicated with some of these individuals and confirmed they
are not interested in continuing. So for posterity sake I'm noting
these two removals, effective immediately:

Tzu-mainn Chen
Imre Farkas

The following members have not submitted any reviews in 90 days, but I
wasn't able to contact them immediately. If you are on this list and
would like to stay or step down, please reply directly to me or on the
list. Since we still have data blips from the traditional December
break, I don't think we should think too much about the numbers beyond
this until next month.


Members with no reviews in 90 days:

Martyn Taylor
Radomir Dopieralski

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] nominating James Polley for tripleo-core

2015-01-15 Thread Clint Byrum
In about 24 hours we've seen 9 core +1's, one non-core +1, and only one
dissenting opinion from James himself which I think we have properly
dismissed. With my nomination counting as an additional +1, that is 10,
which is 50% of the 20 cores active the last 90 days.

I believe this vote has carried. Please welcome James Polley to the
TripleO core reviewer team. :)

Excerpts from Clint Byrum's message of 2015-01-14 10:14:45 -0800:
 Hello! It has been a while since we expanded our review team. The
 numbers aren't easy to read with recent dips caused by the summit and
 holidays. However, I believe James has demonstrated superb review skills
 and a commitment to the project that shows broad awareness of the
 project.
 
 Below are the results of a meta-review I did, selecting recent reviews
 by James with comments and a final score. I didn't find any reviews by
 James that I objected to.
 
 https://review.openstack.org/#/c/133554/ -- Took charge and provided
 valuable feedback. +2
 https://review.openstack.org/#/c/114360/ -- Good -1 asking for better
 commit message and then timely follow-up +1 with positive comments for
 more improvement. +2
 https://review.openstack.org/#/c/138947/ -- Simpler review, +1'd on Dec.
 19 and no follow-up since. Allowing 2 weeks for holiday vacation, this
 is only really about 7 - 10 working days and acceptable. +2
 https://review.openstack.org/#/c/146731/ -- Very thoughtful -1 review of
 recent change with alternatives to the approach submitted as patches.
 https://review.openstack.org/#/c/139876/ -- Simpler review, +1'd in
 agreement with everyone else. +1
 https://review.openstack.org/#/c/142621/ -- Thoughtful +1 with
 consideration for other reviewers. +2
 https://review.openstack.org/#/c/113983/ -- Thorough spec review with
 grammar pedantry noted as something that would not prevent a positive
 review score. +2
 
 All current tripleo-core members are invited to vote at this time. Thank
 you!

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] nominating James Polley for tripleo-core

2015-01-15 Thread Clint Byrum
Excerpts from Chuck Carlino's message of 2015-01-15 09:43:41 -0800:
 On 01/15/2015 08:49 AM, Alexis Lee wrote:
  Clint Byrum said on Wed, Jan 14, 2015 at 10:14:45AM -0800:
  holidays. However, I believe James has demonstrated superb review skills
  and a commitment to the project that shows broad awareness of the
  project.
  Big +1. Thanks for taking the time to meta-review, Clint.
 
 
  Alexis
 
 I don't get a vote, but just wanted to point out James' excellent 
 contributions in chasing down neutron issues.
 
 Hmm, now that I've said it, I'm not entirely certain he'd have wanted me 
 to :P
 

Awesome, so James is our Neutron person now. :)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] nominating James Polley for tripleo-core

2015-01-14 Thread Clint Byrum
Excerpts from James Polley's message of 2015-01-14 12:46:37 -0800:
 Thanks for the nomination Clint (and +1s from people who have already
 responded)
 
 At this stage, I believe we've traditionally[1] asked[2] the potential new
 Core Reviewer to commit to 3 reviews per work-day.
 
 I don't feel that that's a commitment I can make at this point. It's not
 something I've been able to achieve in the past - I've come close over the
 last 30 days, but the 90 day report shows me barely above 2 per day. I
 think my current throughput is something I can commit to maintaining, and
 I'd like to think that it can grow over time; but I don't think I can
 commit to doing anything more than I've already been able to do.
 
 If the rest of the core reviewers think I'm still making a valuable
 contribution, I'm more than happy to accept this nomination.
 

IMO we need to re-evaluate that requirement. None of us has done a great
job at sustaining it, however as a team we've managed to at least get
enough reviews done to keep the tubes flowing. I know that at one point
we got really backed up, but what solved that was a combination of a few
less patches getting submitted (probably because of the long wait time)
and a few more reviewers being added. So having more good reviewers like
yourself seems more important than having more perfect reviewers.

Also the main reason for wanting people to do 3 per day is to maintain
familiarity with the code. I think you've been able to remain familiar
with your traditional rate just fine.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] nominating James Polley for tripleo-core

2015-01-14 Thread Clint Byrum
Hello! It has been a while since we expanded our review team. The
numbers aren't easy to read with recent dips caused by the summit and
holidays. However, I believe James has demonstrated superb review skills
and a commitment to the project that shows broad awareness of the
project.

Below are the results of a meta-review I did, selecting recent reviews
by James with comments and a final score. I didn't find any reviews by
James that I objected to.

https://review.openstack.org/#/c/133554/ -- Took charge and provided
valuable feedback. +2
https://review.openstack.org/#/c/114360/ -- Good -1 asking for better
commit message and then timely follow-up +1 with positive comments for
more improvement. +2
https://review.openstack.org/#/c/138947/ -- Simpler review, +1'd on Dec.
19 and no follow-up since. Allowing 2 weeks for holiday vacation, this
is only really about 7 - 10 working days and acceptable. +2
https://review.openstack.org/#/c/146731/ -- Very thoughtful -1 review of
recent change with alternatives to the approach submitted as patches.
https://review.openstack.org/#/c/139876/ -- Simpler review, +1'd in
agreement with everyone else. +1
https://review.openstack.org/#/c/142621/ -- Thoughtful +1 with
consideration for other reviewers. +2
https://review.openstack.org/#/c/113983/ -- Thorough spec review with
grammar pedantry noted as something that would not prevent a positive
review score. +2

All current tripleo-core members are invited to vote at this time. Thank
you!

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown

2015-01-09 Thread Clint Byrum
Excerpts from Zane Bitter's message of 2015-01-09 14:57:21 -0800:
 On 08/01/15 05:39, Anant Patil wrote:
  1. The stack was failing when there were single disjoint resources or
  just one resource in template. The graph did not include this resource
  due to a minor bug in dependency_names(). I have added a test case and
  fix here:
  https://github.com/anantpatil/heat-convergence-prototype/commit/b58abd77cf596475ecf3f19ed38adf8ad3bb6b3b
 
 Thanks, sorry about that! I will push a patch to fix it up.
 
  2. The resource graph is created with keys in both forward order
  traversal and reverse order traversal and the update will finish the
  forward order and attempt the reverse order. If this is the case, then
  the update-replaced resources will be deleted before the update is
  complete and if the update fails, the old resource is not available for
  roll-back; a new resource has to be created then. I have added a test
  case at the above mentioned location.
 
  In our PoC, the updates (concurrent updates) won't remove a
  update-replaced resource until all the resources are updated, and
  resource clean-up phase is started.
 
 Hmmm, this is a really interesting question actually. That's certainly 
 not how Heat works at the moment; we've always assumed that rollback is 
 best-effort at recovering the exact resources you had before. It would 
 be great to have users weigh in on how they expect this to behave. I'm 
 curious now what CloudFormation does.
 
 I'm reluctant to change it though because I'm pretty sure this is 
 definitely *not* how you would want e.g. a rolling update of an 
 autoscaling group to happen.
 
  It is unacceptable to remove the old
  resource to be rolled-back to since it may have changes which the user
  doesn't want to loose;
 
 If they didn't want to lose it they shouldn't have tried an update that 
 would replace it. If an update causes a replacement or an interruption 
 to service then I consider the same fair game for the rollback - the 
 user has already given us permission for that kind of change. (Whether 
 the user's consent was informed is a separate question, addressed by 
 Ryan's update-preview work.)
 

In the original vision we had for using scaled groups to manage, say,
nova-compute nodes, you definitely can't create new servers, so you
can't just create all the new instances without de-allocating some.

That said, thats why we are using in-place methods like rebuild.

I think it would be acceptable to have cleanup run asynchronously,
and to have rollback re-create anything that has already been cleaned up.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat][tripleo] Making diskimage-builder install from forked repo?

2015-01-08 Thread Clint Byrum
Excerpts from Steven Hardy's message of 2015-01-08 09:37:55 -0800:
 Hi all,
 
 I'm trying to test a fedora-software-config image with some updated
 components.  I need:
 
 - Install latest master os-apply-config (the commit I want isn't released)
 - Install os-refresh-config fork from https://review.openstack.org/#/c/145764
 
 I can't even get the o-a-c from master part working:
 
 export PATH=${PWD}/dib-utils/bin:$PATH
 export
 ELEMENTS_PATH=tripleo-image-elements/elements:heat-templates/hot/software-config/elements
 export DIB_INSTALLTYPE_os_apply_config=source
 
 diskimage-builder/bin/disk-image-create vm fedora selinux-permissive \
   os-collect-config os-refresh-config os-apply-config \
   heat-config-ansible \
   heat-config-cfn-init \
   heat-config-docker \
   heat-config-puppet \
   heat-config-salt \
   heat-config-script \
   ntp \
   -o fedora-software-config.qcow2
 
 This is what I'm doing, both tools end up as pip installed versions AFAICS,
 so I've had to resort to manually hacking the image post-DiB using
 virt-copy-in.
 
 Pretty sure there's a way to make DiB do this, but don't know what, anyone
 able to share some clues?  Do I have to hack the elements, or is there a
 better way?
 
 The docs are pretty sparse, so any help would be much appreciated! :)
 

Hi Steve. The os-*-config tools represent a bit of a quandry for us,
as we want to test and run with released versions, not latest git, so
the elements just install from pypi. I believe we use devpi in testing
to test new commits to the tools themselves.

So you can probably setup a devpi instance locally, and upload the
commits you want to it, and then build the image with the 'pypi' element
added and this:

PYPI_MIRROR_URL=http://localhost:3141/

See diskimage-builder/elements/pypi/README.md for more info on how to
set this up.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat][tripleo] Making diskimage-builder install from forked repo?

2015-01-08 Thread Clint Byrum
Excerpts from Chris Jones's message of 2015-01-08 10:16:14 -0800:
 Hi
 
  On 8 Jan 2015, at 17:58, Clint Byrum cl...@fewbar.com wrote:
  
  Excerpts from Steven Hardy's message of 2015-01-08 09:37:55 -0800:
  So you can probably setup a devpi instance locally, and upload the
  commits you want to it, and then build the image with the 'pypi' element
 
 Given that we have a pretty good release frequency of all our tools, is this 
 burden on devs/testers actually justified at this point, versus the potential 
 consistency we could have with source repo flexibility in other openstack 
 components?
 

We've been discussing in #tripleo and I think you're right. I think we
can solve this by just switching to source-repositories and providing
a relatively simple tool to set release tags when people want released
versions only.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Switching CI back to amd64

2015-01-07 Thread Clint Byrum
Excerpts from Derek Higgins's message of 2015-01-07 02:51:41 -0800:
 Hi All,
 I intended to bring this up at this mornings meeting but the train I
 was on had no power sockets (and I had no battery) so sending to the
 list instead.
 
 We currently run our CI with on images built for i386, we took this
 decision a while back to save memory ( at the time is allowed us to move
 the amount of memory required in our VMs from 4G to 2G (exactly where in
 those bands the hard requirements are I don't know)
 
 Since then we have had to move back to 3G for the i386 VM as 2G was no
 longer enough so the saving in memory is no longer as dramatic.
 
 Now that the difference isn't as dramatic, I propose we switch back to
 amd64 (with 4G vms) in order to CI on what would be closer to a
 production deployment and before making the switch wanted to throw the
 idea out there for others to digest.
 
 This obviously would impact our capacity as we will have to reduce the
 number of testenvs per testenv hosts. Our capacity (in RH1 and roughly
 speaking) allows us to run about 1440 ci jobs per day. I believe we can
 make the switch and still keep capacity above 1200 with a few other changes
 1. Add some more testenv hosts, we have 2 unused hosts at the moment and
 we can probably take 2 of the compute nodes from the overcloud.
 2. Kill VM's at the end of each CI test (as opposed to leaving them
 running until the next CI test kills them), allowing us to more
 successfully overcommit on RAM
 3. maybe look into adding swap on the test env hosts, they don't
 currently have any, so over committing RAM is a problem the the OOM
 killer is handling from time to time (I only noticed this yesterday).
 
 The other benefit to doing this is that is we were to ever want to CI
 images build with packages (this has come up in previous meetings) we
 wouldn't need to provide i386 packages just for CI, while the rest of
 the world uses the amd64.

+1 on all counts.

It's also important to note that we should actually have a whole new
rack of servers added to capacity soon (I think soon is about 6 months
so far, but we are at least committed to it). So this would be, at worst,
a temporary loss of 240 jobs per day.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] Meetup Reminder

2015-01-04 Thread Clint Byrum
Happy New Year!

Just a friendly reminder to those of you who are interested in TripleO,
we have a three-day Meetup scheduled for February 18-20 in Seattle, WA.
All are welcome, though space is limited to 30 participants. Thus far we
have 8 people signed up in the etherpad:

https://etherpad.openstack.org/p/kilo-tripleo-midcycle-meetup

Please do add yourself to the list if you intend to come. There is
information on hotels and we will add any event notifications and agenda
items that come up.

Thanks, and see you all there!

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [heat] Application level HA via Heat

2014-12-24 Thread Clint Byrum
Excerpts from Renat Akhmerov's message of 2014-12-24 03:40:22 -0800:
 Hi
 
  Ok, I'm quite happy to accept this may be a better long-term solution, but
  can anyone comment on the current maturity level of Mistral?  Questions
  which spring to mind are:
  
  - Is the DSL stable now?
 
 You can think “yes” because although we keep adding new features we do it in 
 a backwards compatible manner. I personally try to be very cautious about 
 this.
 
  - What's the roadmap re incubation (there are a lot of TBD's here:
 https://wiki.openstack.org/wiki/Mistral/Incubation)
 
 Ooh yeah, this page is very very obsolete which is actually my fault because 
 I didn’t pay a lot of attention to this after I heard all these rumors about 
 TC changing the whole approach around getting projects incubated/integrated.
 
 I think incubation readiness from a technical perspective is good (various 
 style checks, procedures etc.), even if there’s still something that we need 
 to adjust it must not be difficult and time consuming. The main question for 
 the last half a year has been “What OpenStack program best fits Mistral?”. So 
 far we’ve had two candidates: Orchestration and some new program (e.g. 
 Workflow Service). However, nothing is decided yet on that.
 

It's probably worth re-thinking the discussion above given the governance
changes that are being worked on:

http://governance.openstack.org/resolutions/20141202-project-structure-reform-spec.html

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown

2014-12-18 Thread Clint Byrum
Excerpts from Anant Patil's message of 2014-12-16 07:36:58 -0800:
 On 16-Dec-14 00:59, Clint Byrum wrote:
  Excerpts from Anant Patil's message of 2014-12-15 07:15:30 -0800:
  On 13-Dec-14 05:42, Zane Bitter wrote:
  On 12/12/14 05:29, Murugan, Visnusaran wrote:
 
 
  -Original Message-
  From: Zane Bitter [mailto:zbit...@redhat.com]
  Sent: Friday, December 12, 2014 6:37 AM
  To: openstack-dev@lists.openstack.org
  Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept
  showdown
 
  On 11/12/14 08:26, Murugan, Visnusaran wrote:
  [Murugan, Visnusaran]
  In case of rollback where we have to cleanup earlier version of
  resources,
  we could get the order from old template. We'd prefer not to have a
  graph table.
 
  In theory you could get it by keeping old templates around. But that
  means keeping a lot of templates, and it will be hard to keep track
  of when you want to delete them. It also means that when starting an
  update you'll need to load every existing previous version of the
  template in order to calculate the dependencies. It also leaves the
  dependencies in an ambiguous state when a resource fails, and
  although that can be worked around it will be a giant pain to 
  implement.
 
 
  Agree that looking to all templates for a delete is not good. But
  baring Complexity, we feel we could achieve it by way of having an
  update and a delete stream for a stack update operation. I will
  elaborate in detail in the etherpad sometime tomorrow :)
 
  I agree that I'd prefer not to have a graph table. After trying a
  couple of different things I decided to store the dependencies in the
  Resource table, where we can read or write them virtually for free
  because it turns out that we are always reading or updating the
  Resource itself at exactly the same time anyway.
 
 
  Not sure how this will work in an update scenario when a resource does
  not change and its dependencies do.
 
  We'll always update the requirements, even when the properties don't
  change.
 
 
  Can you elaborate a bit on rollback.
 
  I didn't do anything special to handle rollback. It's possible that we 
  need to - obviously the difference in the UpdateReplace + rollback case 
  is that the replaced resource is now the one we want to keep, and yet 
  the replaced_by/replaces dependency will force the newer (replacement) 
  resource to be checked for deletion first, which is an inversion of the 
  usual order.
 
 
  This is where the version is so handy! For UpdateReplaced ones, there is
  an older version to go back to. This version could just be template ID,
  as I mentioned in another e-mail. All resources are at the current
  template ID if they are found in the current template, even if they is
  no need to update them. Otherwise, they need to be cleaned-up in the
  order given in the previous templates.
 
  I think the template ID is used as version as far as I can see in Zane's
  PoC. If the resource template key doesn't match the current template
  key, the resource is deleted. The version is misnomer here, but that
  field (template id) is used as though we had versions of resources.
 
  However, I tried to think of a scenario where that would cause problems 
  and I couldn't come up with one. Provided we know the actual, real-world 
  dependencies of each resource I don't think the ordering of those two 
  checks matters.
 
  In fact, I currently can't think of a case where the dependency order 
  between replacement and replaced resources matters at all. It matters in 
  the current Heat implementation because resources are artificially 
  segmented into the current and backup stacks, but with a holistic view 
  of dependencies that may well not be required. I tried taking that line 
  out of the simulator code and all the tests still passed. If anybody can 
  think of a scenario in which it would make a difference, I would be very 
  interested to hear it.
 
  In any event though, it should be no problem to reverse the direction of 
  that one edge in these particular circumstances if it does turn out to 
  be a problem.
 
  We had an approach with depends_on
  and needed_by columns in ResourceTable. But dropped it when we figured 
  out
  we had too many DB operations for Update.
 
  Yeah, I initially ran into this problem too - you have a bunch of nodes 
  that are waiting on the current node, and now you have to go look them 
  all up in the database to see what else they're waiting on in order to 
  tell if they're ready to be triggered.
 
  It turns out the answer is to distribute the writes but centralise the 
  reads. So at the start of the update, we read all of the Resources, 
  obtain their dependencies and build one central graph[1]. We than make 
  that graph available to each resource (either by passing it as a 
  notification parameter, or storing it somewhere central in the DB that 
  they will all have to read anyway, i.e. the Stack). But when we update a 
  dependency we

Re: [openstack-dev] Do all OpenStack daemons support sd_notify?

2014-12-15 Thread Clint Byrum
Excerpts from Ihar Hrachyshka's message of 2014-12-15 07:21:04 -0800:
 Hash: SHA512
 
 On 14/12/14 09:45, Thomas Goirand wrote:
  Hi,
  
  As I am slowing fixing all systemd issues for the daemons of
  OpenStack in Debian (and hopefully, have this ready before the
  freeze of Jessie), I was wondering what kind of Type= directive to
  put on the systemd .service files. I have noticed that in Fedora,
  there's Type=notify. So my question is:
  
  Do all OpenStack daemons, as a rule, support the DBus sd_notify
  thing? Should I always use Type=notify for systemd .service files?
  Can this be called a general rule with no exception?
 
 (I will talk about neutron only.)
 
 I guess Type=notify is supposed to be used with daemons that use
 Service class from oslo-incubator that provides systemd notification
 mechanism, or call to systemd.notify_once() otherwise.
 
 In terms of Neutron, neutron-server process is doing it, metadata
 agent also seems to do it, while OVS agent seems to not. So it really
 should depend on each service and the way it's implemented. You cannot
 just assume that every Neutron service reports back to systemd.
 
 In terms of Fedora, we have Type=notify for neutron-server service only.
 
 BTW now that more distributions are interested in shipping unit files
 for services, should we upstream them and ship the same thing in all
 interested distributions?
 

Since we can expect the five currently implemented OS's in TripleO to all
have systemd by default soon (Debian, Fedora, openSUSE, RHEL, Ubuntu),
it would make a lot of sense for us to make the systemd unit files that
TripleO generates set Type=notify wherever possible. So hopefully we can
actually make such a guarantee upstream sometime in the not-so-distant
future, especially since our CI will run two of the more distinct forks,
Ubuntu and Fedora.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown

2014-12-15 Thread Clint Byrum
Excerpts from Anant Patil's message of 2014-12-15 07:15:30 -0800:
 On 13-Dec-14 05:42, Zane Bitter wrote:
  On 12/12/14 05:29, Murugan, Visnusaran wrote:
 
 
  -Original Message-
  From: Zane Bitter [mailto:zbit...@redhat.com]
  Sent: Friday, December 12, 2014 6:37 AM
  To: openstack-dev@lists.openstack.org
  Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept
  showdown
 
  On 11/12/14 08:26, Murugan, Visnusaran wrote:
  [Murugan, Visnusaran]
  In case of rollback where we have to cleanup earlier version of
  resources,
  we could get the order from old template. We'd prefer not to have a
  graph table.
 
  In theory you could get it by keeping old templates around. But that
  means keeping a lot of templates, and it will be hard to keep track
  of when you want to delete them. It also means that when starting an
  update you'll need to load every existing previous version of the
  template in order to calculate the dependencies. It also leaves the
  dependencies in an ambiguous state when a resource fails, and
  although that can be worked around it will be a giant pain to implement.
 
 
  Agree that looking to all templates for a delete is not good. But
  baring Complexity, we feel we could achieve it by way of having an
  update and a delete stream for a stack update operation. I will
  elaborate in detail in the etherpad sometime tomorrow :)
 
  I agree that I'd prefer not to have a graph table. After trying a
  couple of different things I decided to store the dependencies in the
  Resource table, where we can read or write them virtually for free
  because it turns out that we are always reading or updating the
  Resource itself at exactly the same time anyway.
 
 
  Not sure how this will work in an update scenario when a resource does
  not change and its dependencies do.
 
  We'll always update the requirements, even when the properties don't
  change.
 
 
  Can you elaborate a bit on rollback.
  
  I didn't do anything special to handle rollback. It's possible that we 
  need to - obviously the difference in the UpdateReplace + rollback case 
  is that the replaced resource is now the one we want to keep, and yet 
  the replaced_by/replaces dependency will force the newer (replacement) 
  resource to be checked for deletion first, which is an inversion of the 
  usual order.
  
 
 This is where the version is so handy! For UpdateReplaced ones, there is
 an older version to go back to. This version could just be template ID,
 as I mentioned in another e-mail. All resources are at the current
 template ID if they are found in the current template, even if they is
 no need to update them. Otherwise, they need to be cleaned-up in the
 order given in the previous templates.
 
 I think the template ID is used as version as far as I can see in Zane's
 PoC. If the resource template key doesn't match the current template
 key, the resource is deleted. The version is misnomer here, but that
 field (template id) is used as though we had versions of resources.
 
  However, I tried to think of a scenario where that would cause problems 
  and I couldn't come up with one. Provided we know the actual, real-world 
  dependencies of each resource I don't think the ordering of those two 
  checks matters.
  
  In fact, I currently can't think of a case where the dependency order 
  between replacement and replaced resources matters at all. It matters in 
  the current Heat implementation because resources are artificially 
  segmented into the current and backup stacks, but with a holistic view 
  of dependencies that may well not be required. I tried taking that line 
  out of the simulator code and all the tests still passed. If anybody can 
  think of a scenario in which it would make a difference, I would be very 
  interested to hear it.
  
  In any event though, it should be no problem to reverse the direction of 
  that one edge in these particular circumstances if it does turn out to 
  be a problem.
  
  We had an approach with depends_on
  and needed_by columns in ResourceTable. But dropped it when we figured out
  we had too many DB operations for Update.
  
  Yeah, I initially ran into this problem too - you have a bunch of nodes 
  that are waiting on the current node, and now you have to go look them 
  all up in the database to see what else they're waiting on in order to 
  tell if they're ready to be triggered.
  
  It turns out the answer is to distribute the writes but centralise the 
  reads. So at the start of the update, we read all of the Resources, 
  obtain their dependencies and build one central graph[1]. We than make 
  that graph available to each resource (either by passing it as a 
  notification parameter, or storing it somewhere central in the DB that 
  they will all have to read anyway, i.e. the Stack). But when we update a 
  dependency we don't update the central graph, we update the individual 
  Resource so there's no global lock required.
  
  [1] 
  

Re: [openstack-dev] Unsafe Abandon

2014-12-15 Thread Clint Byrum
Excerpts from Ari Rubenstein's message of 2014-12-15 12:32:08 -0800:
 Hi there,
 I'm new to the list, and trying to get more information about the following 
 issue:
 
 https://bugs.launchpad.net/heat/+bug/1353670
 Is there anyone on the list who can explain under what conditions a user 
 might hit this?  Workarounds?  ETA for a fix?

Hi Ari. Welcome, and thanks for your interest in OpenStack and Heat!

A bit of etiquette first: Please do not reply to existing threads to
start a new one. That is known as a hijack:

https://wiki.openstack.org/wiki/MailingListEtiquette#Changing_Subject

Also, for bugs, you'll find it's best to ask in the comments of the bug,
as those who are most interested and able to answer should already be
subscribed and can respond there.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] mid-cycle details -- CONFIRMED Feb. 18 - 20

2014-12-15 Thread Clint Byrum
I'm happy to announce we've cleared the schedule and the Mid-Cycle is
confirmed for February 18 - 20 in Seattle, WA at HP's downtown offices.

Please refer to the etherpad linked below for details including address
and instructions for access to the building.

PLEASE make sure you add yourself to the list of confirmed attendees
on the etherpad *BEFORE* booking travel. We have a hard limit of 30
participants, so if you are not certain you have a spot, please contact
me before booking travel.

Excerpts from Clint Byrum's message of 2014-12-01 14:58:58 -0800:
 Hello! I've received confirmation that our venue, the HP offices in
 downtown Seattle, will be available for the most-often-preferred
 least-often-cannot week of Feb 16 - 20.
 
 Our venue has a maximum of 20 participants, but I only have 16 possible
 attendees now. Please add yourself to that list _now_ if you will be
 joining us.
 
 I've asked our office staff to confirm Feb 18 - 20 (Wed-Fri). When they
 do, I will reply to this thread to let everyone know so you can all
 start to book travel. See the etherpad for travel details.
 
 https://etherpad.openstack.org/p/kilo-tripleo-midcycle-meetup

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] UniqueConstraint for name and tenant_id in security group

2014-12-11 Thread Clint Byrum
Excerpts from Jay Pipes's message of 2014-12-11 05:43:46 -0800:
 On 12/11/2014 07:22 AM, Anna Kamyshnikova wrote:
  Hello everyone!
 
  In neutron there is a rather old bug [1] about adding uniqueness for
  security group name and tenant id. I found this idea reasonable and
  started working on fix for this bug [2]. I think it is good to add a
  uniqueconstraint because:
 
  1) In nova there is such constraint for security groups
  https://github.com/openstack/nova/blob/stable/juno/nova/db/sqlalchemy/migrate_repo/versions/216_havana.py#L1155-L1157.
  So I think that it is rather disruptive that it is impossible to create
  security group with the same name in nova, but possible in neutron.
  2) Users get confused having security groups with the same name.
 
  In comment for proposed change Assaf Muller and Maru Newby object for
  such solution and suggested another option, so I think we need more eyes
  on this change.
 
  I would like to ask you to share your thoughts on this topic.
  [1] - https://bugs.launchpad.net/neutron/+bug/1194579
  [2] - https://review.openstack.org/135006
 
 I'm generally in favor of making name attributes opaque, utf-8 strings 
 that are entirely user-defined and have no constraints on them. I 
 consider the name to be just a tag that the user places on some 
 resource. It is the resource's ID that is unique.
 
 I do realize that Nova takes a different approach to *some* resources, 
 including the security group name.
 
 End of the day, it's probably just a personal preference whether names 
 should be unique to a tenant/user or not.
 

The problem with this approach is that it requires the user to have an
external mechanism to achieve idempotency. By allowing an opaque string
that the user submits to you to be guaranteed to be unique, you allow
the user to write dumber code around creation in an unreliable
fashion. So instead of


while True:
  try:
item = clientlib.find(name='foo')[0]
break
  except NotFound:
try:
  item = clientlib.create(name='foo')
  break
except UniqueConflict:
  item = clientlib.find(name='foo')[0]
  break

You can keep retrying forever because you know only one thing with that
name will ever exist.

Without unique names, you have to write weird stuff like this to do a
retry.

while len(clientlib.find(name='foo'))  1:
  try:
item = clientlib.create(name='foo')
list = clientlib.searchfor(name='foo')
for found_item in list:
  if found_item.id != item.id:
clientlib.delete(found_item.id)

Name can certainly remain not-unique and free-form, but don't discount
the value of a unique value that the user specifies.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] mid-cycle details final draft

2014-12-10 Thread Clint Byrum
Just FYI, we ran into a last minute scheduling conflict with the venue
and are sorting it out, so please _do not book travel yet_. Worst case
it will move to Feb 16 - 18 instead of 18 - 20.

Excerpts from Clint Byrum's message of 2014-12-01 14:58:58 -0800:
 Hello! I've received confirmation that our venue, the HP offices in
 downtown Seattle, will be available for the most-often-preferred
 least-often-cannot week of Feb 16 - 20.
 
 Our venue has a maximum of 20 participants, but I only have 16 possible
 attendees now. Please add yourself to that list _now_ if you will be
 joining us.
 
 I've asked our office staff to confirm Feb 18 - 20 (Wed-Fri). When they
 do, I will reply to this thread to let everyone know so you can all
 start to book travel. See the etherpad for travel details.
 
 https://etherpad.openstack.org/p/kilo-tripleo-midcycle-meetup

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] Fuel agent proposal

2014-12-09 Thread Clint Byrum
Excerpts from Yuriy Zveryanskyy's message of 2014-12-09 04:05:03 -0800:
 Good day Ironicers.
 
 I do not want to discuss questions like Is feature X good for release 
 Y? or Is feature Z in Ironic scope or not?.
 I want to get an answer for this: Is Ironic a flexible, easy extendable 
 and user-oriented solution for deployment?

I surely hope it is.

 Yes, it is I think. IPA is the great software, but Fuel Agent proposes a 
 different and alternative way for deploying.

It's not fundamentally different, it is just capable of other things.

 Devananda wrote about pets and cattle, and maybe some want to manage 
 pets rather than cattle? Let
 users do a choice.

IMO this is too high-level of a discussion for Ironic to get bogged
down in. Disks can have partitions and be hosted in RAID controllers,
and these things _MUST_ come before an OS is put on the disks, but after
power control happens. Since Ironic does put OS's on disks, and control
power, I believe it is obligated to provide an interface for rich disk
configuration.

There are valid use cases for _both_ of those things in cattle, which
is a higher level problem that should not cloud the low level interface
discussion.

So IMO, Ironic needs to provide an interface for agents to richly
configure disks, whether IPA supports it or not.

Would I like to see these things in IPA so that there isn't a mismatch
of features? Yes. Does that matter _now_? Not really. The FuelAgent can
prove out the interface while the features migrate into IPA.

 We do not plan to change any Ironic API for the driver, internal or 
 external (as opposed to IPA, this was done for it).
 If there will be no one for Fuel Agent's driver support I think this 
 driver should be removed from Ironic tree (I heard
 this practice is used in Linux kernel).
 

We have a _hyperv_ driver in Nova.. I think we can have a something
we're not entirely 100% on board with in Ironic.

All of that said, I would admonish FuelAgent developers to work to
commit to combine their agent with IPA long term. I would admonish Ironic
developers to be receptive to things that users want. It doesn't always
mean taking responsibility for implementations, but you _do_ need to
consider the pain of not providing interfaces and of forcing people to
remain out of tree (remember when Ironic's driver wasn't in Nova's tree?)

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] People of OpenStack (and their IRC nicks)

2014-12-09 Thread Clint Byrum
Excerpts from Angus Salkeld's message of 2014-12-09 15:25:59 -0800:
 On Wed, Dec 10, 2014 at 5:11 AM, Stefano Maffulli stef...@openstack.org
 wrote:
 
  On 12/09/2014 06:04 AM, Jeremy Stanley wrote:
   We already have a solution for tracking the contributor-IRC
   mapping--add it to your Foundation Member Profile. For example, mine
   is in there already:
  
   http://www.openstack.org/community/members/profile/5479
 
  I recommend updating the openstack.org member profile and add IRC
  nickname there (and while you're there, update your affiliation history).
 
  There is also a search engine on:
 
  http://www.openstack.org/community/members/
 
 
 Except that info doesn't appear nicely in review. Some people put their
 nick in their Full Name in
 gerrit. Hopefully Clint doesn't mind:
 
 https://review.openstack.org/#/q/owner:%22Clint+%27SpamapS%27+Byrum%22+status:open,n,z
 

Indeed, I really didn't like that I'd be reviewing somebody's change,
and talking to them on IRC, and not know if they knew who I was.

It also has the odd side effect that gerritbot triggers my IRC filters
when I 'git review'.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Alternate meeting time

2014-12-05 Thread Clint Byrum
Excerpts from marios's message of 2014-12-04 02:40:23 -0800:
 On 04/12/14 11:40, James Polley wrote:
  Just taking a look at http://doodle.com/27ffgkdm5gxzr654 again - we've
  had 10 people respond so far. The winning time so far is Monday 2100UTC
  - 7 yes and one If I have to.
 
 for me it currently shows 1200 UTC as the preferred time.
 
 So to be clear, we are voting here for the alternate meeting. The
 'original' meeting is at 1900UTC. If in fact 2100UTC ends up being the
 most popular, what would be the point of an alternating meeting that is
 only 2 hours apart in time?
 

Actually that's a good point. I didn't really think about it before I
voted, but the regular time is perfect for me, so perhaps I should
remove my vote, and anyone else who does not need the alternate time
should consider doing so as well.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Managing no-mergepy template duplication

2014-12-05 Thread Clint Byrum
Excerpts from Steven Hardy's message of 2014-12-04 01:09:18 -0800:
 On Wed, Dec 03, 2014 at 06:54:48PM -0800, Clint Byrum wrote:
  Excerpts from Dan Prince's message of 2014-12-03 18:35:15 -0800:
   On Wed, 2014-12-03 at 10:11 +, Steven Hardy wrote:
Hi all,

Lately I've been spending more time looking at tripleo and doing some
reviews. I'm particularly interested in helping the no-mergepy and
subsequent puppet-software-config implementations mature (as well as
improving overcloud updates via heat).

Since Tomas's patch landed[1] to enable --no-mergepy in
tripleo-heat-templates, it's become apparent that frequently patches are
submitted which only update overcloud-source.yaml, so I've been trying 
to
catch these and ask for a corresponding change to e.g controller.yaml.

This raises the following questions:

1. Is it reasonable to -1 a patch and ask folks to update in both 
places?
   
   Yes! In fact until we abandon merge.py we shouldn't land anything that
   doesn't make the change in both places. Probably more important to make
   sure things go into the new (no-mergepy) templates though.
   
2. How are we going to handle this duplication and divergence?
   
   Move as quickly as possible to the new without-mergepy varients? That is
   my vote anyways.
   
3. What's the status of getting gating CI on the --no-mergepy templates?
   
   Devtest already supports it by simply setting an option (which sets an
   ENV variable). Just need to update tripleo-ci to do that and then make
   the switch.
   
4. What barriers exist (now that I've implemented[2] the eliding 
functionality
requested[3] for ResourceGroup) to moving to the --no-mergepy
implementation by default?
   
   None that I know of.
   
  
  I concur with Dan. Elide was the last reason not to use this.
 
 That's great news! :)
 
  One thing to consider is that there is no actual upgrade path from
  non-autoscaling-group based clouds, to auto-scaling-group based
  templates. We should consider how we'll do that before making it the
  default. So, I suggest we discuss possible upgrade paths and then move
  forward with switching one of the CI jobs to using the new templates.
 
 This is probably going to be really hard :(
 
 The sort of pattern which might work is:
 
 1. Abandon mergepy based stack
 2. Have helper script to reformat abandon data into nomergepy based adopt
 data
 3. Adopt stack
 
 Unforunately there are several abandon/adopt bugs we'll have to fix if we
 decide this is the way to go (original author hasn't maintained it, but we
 can pick up the slack if it's on the critical path for TripleO).
 
 An alternative could be the external resource feature Angus is looking at:
 
 https://review.openstack.org/#/c/134848/
 
 This would be more limited (we just reference rather than manage the
 existing resources), but potentially safer.
 
 The main risk here is import (or subsequent update) operations becoming
 destructive and replacing things, but I guess to some extent this is a risk
 with any change to tripleo-heat-templates.
 

So you and I talked on IRC, but I want to socialize what we talked about
more.

The abandon/adopt pipeline is a bit broken in Heat and hasn't proven to be
as useful as I'd hoped when it was first specced out. It seems too broad,
and relies on any tools understanding how to morph a whole new format
(the abandon json).

With external_reference, the external upgrade process just needs to know
how to morph the template. So if we're combining 8 existing servers into
an autoscaling group, we just need to know how to make an autoscaling
group with 8 servers as the external reference ids. This is, I think,
the shortest path to a working solution, as I feel the external
reference work in Heat is relatively straight forward and the spec has
widescale agreement.

There was another approach I mentioned, which is that we can teach Heat
how to morph resources. So we could teach Heat that servers can be made
into autoscaling groups, and vice-versa. This is a whole new feature
though, and IMO, something that should be tackled _after_ we make it
work with the external_reference feature, as this is basically a
superset of what we'll do externally.

 Has any thought been given to upgrade CI testing?  I'm thinking grenade or
 grenade-style testing here where we test maintaing a deployed overcloud
 over an upgrade of (some subset of) changes.
 
 I know the upgrade testing thing will be hard, but to me it's a key
 requirement to mature heat-driven updates vs those driven by external
 tooling.

Upgrade testing is vital to the future of the project IMO. We really
haven't validated the image based update method upstream yet. In Helion,
we're using tripleo-ansible for updates, and that works great, but we
need to get that or something similar into the pipeline for the gate,
or every user who adopts will be left with a ton of work if they want

Re: [openstack-dev] [TripleO] [Ironic] Do we want to remove Nova-bm support?

2014-12-04 Thread Clint Byrum
Excerpts from Steve Kowalik's message of 2014-12-03 20:47:19 -0800:
 Hi all,
 
 I'm becoming increasingly concerned about all of the code paths
 in tripleo-incubator that check $USE_IRONIC -eq 0 -- that is, use
 nova-baremetal rather than Ironic. We do not check nova-bm support in
 CI, haven't for at least a month, and I'm concerned that parts of it
 may be slowly bit-rotting.
 
 I think our documentation is fairly clear that nova-baremetal is
 deprecated and Ironic is the way forward, and I know it flies in the
 face of backwards-compatibility, but do we want to bite the bullet and
 remove nova-bm support?

Has Ironic settled on a migration path/tool from nova-bm? If yes, then
we should remove nova-bm support and point people at the migration
documentation.

If Ironic decided not to provide one, then we should just remove support
as well.

If Ironic just isn't done yet, then removing nova-bm in TripleO is
premature and we should wait for them to finish.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Do we want to remove Nova-bm support?

2014-12-04 Thread Clint Byrum
Excerpts from Ben Nemec's message of 2014-12-04 11:12:10 -0800:
 FWIW, I think the correct thing to do here is to get our Juno jobs up
 and running and have one of them verify the nova-bm code paths for this
 cycle, and then remove it next cycle.
 
 That said, I have no idea how close we are to actually having Juno jobs
 and I agree that we have no idea if the nova-bm code actually works
 anymore (although that applies to backwards compat as a whole too).
 
 I guess I'm inclined to just leave it though.  AFAIK the nova-bm code
 isn't hurting anything, and if it does happen to be working and have a
 user then removing it would break them for no good reason.  If it's not
 working then it's not working and nobody's going to accidentally start
 using it.  The only real downside of leaving it is if it is working and
 someone would happen to override our defaults, ignore all the
 deprecation warnings, and start using it anyway.  I don't see that as a
 big concern.
 
 But I'm not super attached to nova-bm either, so just my 2 cents.
 

I think this is overly cautious, but I can't think of a moderately
cautious plan, so let's just land deprecation warning messages in the
image builds and devtest scripts. I don't know if there's much more we
can do without running the risk of yanking the rug out from some silent
user out there.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Using python logging.conf for openstack services

2014-12-04 Thread Clint Byrum
Excerpts from Gregory Haynes's message of 2014-12-04 15:20:53 -0800:
 Hello TripleOers,
 
 I got a patch together to move us off of our upstart exec service |
 logger -t service hack [1] and this got me wondering - why aren't we
 using the python logging.conf supported by most OpenStack projects [2]
 to write out logs to files in our desired location? 
 
 This is highly desirable for a couple reasons:
 
 * Less complexity / more straightforward. Basically we wouldn't have to
 run rsyslog or similar and have app config to talk to syslog then syslog
 config to put our logs where we want. We also don't have to battle with
 upstart + rsyslog vs systemd-journald differences and maintain two sets
 of configuration.
 

+1 for less complexity and for just using the normal OS logging
facilities that exist and are quite efficient.

 * We get actual control over formatting. This is a double edged sword in
 that AFAICT you *have* to control formatting if you're using a
 logging.conf with a custom log handler. This means it would be a bit of
 a divergence from our use the defaults policy but there are some
 logging formats in the OpenStack docs [3] named normal, maybe this
 could be acceptable? The big win here is we can avoid issues like having
 duplicate timestamps [4] (this issue still exists on Ubuntu, at least)
 without having to do two sets of configuration, one for upstart +
 rsyslog, one for systemd.
 

This thread might need to get an [all] tag, and we might need to ask the
question why isn't syslog the default?. I don't have a good answer for
that, so I think we might want to consider doing it, but I especially
would like to hear from operators how many of them actually log things
to syslog vs. on disk or something else.

 * This makes setting custom logging configuration a lot more feasible
 for operators. As-is, if an operator wants to forward logs to an
 existing central log server we dont really have a good way for them to
 do this. We also have a requirement that we can come up with a way to
 expose the rsyslog/journald config options needed to do this to
 operators. If we are using logging.conf we can just use our existing
 passthrough-config system to let operators simply write out custom
 logging.conf files which are already documented by OpenStack.

As usual, the parameters we have should just be things that we want to
ask users about on nearly every deployment. So I concur that passthrough
is the way to go, since it will only be for those who don't want to use
syslog for logging.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [diskimage-builder] Tracing levels for scripts (119023)

2014-12-03 Thread Clint Byrum
Excerpts from Chris Jones's message of 2014-12-03 02:47:30 -0800:
 Hi
 
 I am very sympathetic to this view. We have a patch in hand that improves the 
 situation. We also have disagreement about the ideal situation.
 
 I +2'd Ian's patch because it makes things work better than they do now. If 
 we can arrive at an ideal solution later, great, but the more I think about 
 logging from a multitude of bash scripts, and tricks like XTRACE_FD, the more 
 I think it's crazy and we should just incrementally improve the non-trace 
 logging as a separate exercise, leaving working tracing for true debugging 
 situations.
 

Forgive me, I am not pushing for an ideal situation, but I don't want a
regression.

Running without -x right now has authors xtracing as a rule. Meaning
that the moment this merges, the amount of output goes to almost nil
compared to what it is now.

Basically this is just more of the same OpenStack wrong-headed idea, you
have to run in DEBUG logging mode to be able to understand any issue.

I'm totally willing to compromise on the ideal for something that is
good enough, but I'm saying this is not good enough _if_ it turns off
tracing for all scripts.

What if the patch is reworked to leave the current trace-all-the-time
mode in place, and we iterate on each script to make tracing conditional
as we add proper logging?

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Managing no-mergepy template duplication

2014-12-03 Thread Clint Byrum
Excerpts from Dan Prince's message of 2014-12-03 18:35:15 -0800:
 On Wed, 2014-12-03 at 10:11 +, Steven Hardy wrote:
  Hi all,
  
  Lately I've been spending more time looking at tripleo and doing some
  reviews. I'm particularly interested in helping the no-mergepy and
  subsequent puppet-software-config implementations mature (as well as
  improving overcloud updates via heat).
  
  Since Tomas's patch landed[1] to enable --no-mergepy in
  tripleo-heat-templates, it's become apparent that frequently patches are
  submitted which only update overcloud-source.yaml, so I've been trying to
  catch these and ask for a corresponding change to e.g controller.yaml.
  
  This raises the following questions:
  
  1. Is it reasonable to -1 a patch and ask folks to update in both places?
 
 Yes! In fact until we abandon merge.py we shouldn't land anything that
 doesn't make the change in both places. Probably more important to make
 sure things go into the new (no-mergepy) templates though.
 
  2. How are we going to handle this duplication and divergence?
 
 Move as quickly as possible to the new without-mergepy varients? That is
 my vote anyways.
 
  3. What's the status of getting gating CI on the --no-mergepy templates?
 
 Devtest already supports it by simply setting an option (which sets an
 ENV variable). Just need to update tripleo-ci to do that and then make
 the switch.
 
  4. What barriers exist (now that I've implemented[2] the eliding 
  functionality
  requested[3] for ResourceGroup) to moving to the --no-mergepy
  implementation by default?
 
 None that I know of.
 

I concur with Dan. Elide was the last reason not to use this.

One thing to consider is that there is no actual upgrade path from
non-autoscaling-group based clouds, to auto-scaling-group based
templates. We should consider how we'll do that before making it the
default. So, I suggest we discuss possible upgrade paths and then move
forward with switching one of the CI jobs to using the new templates.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [qa] Should it be allowed to attach 2 interfaces from the same subnet to a VM?

2014-12-02 Thread Clint Byrum
Excerpts from Danny Choi (dannchoi)'s message of 2014-12-02 08:34:07 -0800:
 Hi Andrea,
 
 Though both interfaces come up, only one will response to the ping from the 
 neutron router.
 When I disable it, then the second one will response to ping.
 So it looks like only one interface is useful at a time.
 

I believe both interfaces can be used independently by setting
arp_announce to 1 or 2. As in:

sysctl -w net.ipv4.conf.all.arp_announce=2

Might want to try both settings. The documentation is here:

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown

2014-12-02 Thread Clint Byrum
Excerpts from Anant Patil's message of 2014-12-02 10:37:31 -0800:
 Yes, that's the synchronization block for which we use the stack lock.
 Currently, a thread spin waits to acquire the lock to enter this critical
 section.
 
 I don't really know how to do application level transaction. Is there an
 external library for that? AFAIK, we cannot switch from DB transaction to
 application level to do some computation to resolve next set of resources,
 submit to async queue, and again continue with the DB transaction. The
 transnational method looks attractive and clean to me but I am limited by
 my knowledge.Or do you mean to have DB transaction object made available
 via DB APIs and use them in the application? Please share your thoughts.
 

Every DB request has a context attached to the currently running
greenthread which has the DB session object. So yes, you do begin the
transaction in one db API call, and try to commit it in another after
having attempted any application logic.  The whole thing should always be
in a try/except retry loop to handle deadlocks and conflicts introduced
by multi-master sync replication like Galera.

For non-transactional backends, they would have to use a spin lock like
you're using now.

 To avoid locking the stack, we were thinking of designating a single engine
 with responsibility of processing all notifications for a stack. All the
 workers will notify on the stack topic, which only one engine listens to
 and then the notifications end up in a queue (local to the engine, per
 stack), from where they are taken up one-by-one to continue the stack
 operation. The convergence jobs are produced by this engine for the stack
 it is responsible for, and they might end-up in any of the engines. But the
 notifications for a stack are directed to one engine to avoid contention
 for lock. The convergence load is leveled and distributed, and stack lock
 is not needed.
 

No don't do that. You already have that engine, it is the database engine
and it is intended to be used for synchronization and will achieve a
higher degree of concurrency in bigger stacks because it will only block
two workers when they try to inspect or change the same rows.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [diskimage-builder] Tracing levels for scripts (119023)

2014-12-02 Thread Clint Byrum
Excerpts from Ian Wienand's message of 2014-12-02 11:22:31 -0800:
 On 12/02/2014 03:46 PM, Clint Byrum wrote:
  1) Conform all o-r-c scripts to the logging standards we have in
  OpenStack, or write new standards for diskimage-builder and conform
  them to those standards. Abolish non-conditional xtrace in any script
  conforming to the standards.
 
 Honestly in the list of things that need doing in openstack, this must
 be near the bottom.
 
 The whole reason I wrote this is because disk-image-create -x ...
 doesn't do what any reasonable person expects it to; i.e. trace all
 the scripts it starts.
 
 Having a way to trace execution of all d-i-b scripts is all that's
 needed and gives sufficient detail to debug issues.

Several developers have expressed their concern for an all-or-nothing
approach to this. The concern is that when you turn off the trace all
you lose the author-intended info level messages.

I for one find the idea of printing every cp, cat, echo and ls command out
rather frustratingly verbose when scanning logs from a normal run. Do I
want it sometimes? YES, but it will actually hinder normal image building
iteration if we only have a toggle of all trace or no trace.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown

2014-12-01 Thread Clint Byrum
Excerpts from Anant Patil's message of 2014-11-30 23:02:29 -0800:
 On 27-Nov-14 18:03, Murugan, Visnusaran wrote:
  Hi Zane,
  
   
  
  At this stage our implementation (as mentioned in wiki
  https://wiki.openstack.org/wiki/Heat/ConvergenceDesign) achieves your
  design goals.
  
   
  
  1.   In case of a parallel update, our implementation adjusts graph
  according to new template and waits for dispatched resource tasks to
  complete.
  
  2.   Reason for basing our PoC on Heat code:
  
  a.   To solve contention processing parent resource by all dependent
  resources in parallel.
  
  b.  To avoid porting issue from PoC to HeatBase. (just to be aware
  of potential issues asap)
  
  3.   Resource timeout would be helpful, but I guess its resource
  specific and has to come from template and default values from plugins.
  
  4.   We see resource notification aggregation and processing next
  level of resources without contention and with minimal DB usage as the
  problem area. We are working on the following approaches in *parallel.*
  
  a.   Use a Queue per stack to serialize notification.
  
  b.  Get parent ProcessLog (ResourceID, EngineID) and initiate
  convergence upon first child notification. Subsequent children who fail
  to get parent resource lock will directly send message to waiting parent
  task (topic=stack_id.parent_resource_id)
  
  Based on performance/feedback we can select either or a mashed version.
  
   
  
  Advantages:
  
  1.   Failed Resource tasks can be re-initiated after ProcessLog
  table lookup.
  
  2.   One worker == one resource.
  
  3.   Supports concurrent updates
  
  4.   Delete == update with empty stack
  
  5.   Rollback == update to previous know good/completed stack.
  
   
  
  Disadvantages:
  
  1.   Still holds stackLock (WIP to remove with ProcessLog)
  
   
  
  Completely understand your concern on reviewing our code, since commits
  are numerous and there is change of course at places.  Our start commit
  is [c1b3eb22f7ab6ea60b095f88982247dd249139bf] though this might not help J
  
   
  
  Your Thoughts.
  
   
  
  Happy Thanksgiving.
  
  Vishnu.
  
   
  
  *From:*Angus Salkeld [mailto:asalk...@mirantis.com]
  *Sent:* Thursday, November 27, 2014 9:46 AM
  *To:* OpenStack Development Mailing List (not for usage questions)
  *Subject:* Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown
  
   
  
  On Thu, Nov 27, 2014 at 12:20 PM, Zane Bitter zbit...@redhat.com
  mailto:zbit...@redhat.com wrote:
  
  A bunch of us have spent the last few weeks working independently on
  proof of concept designs for the convergence architecture. I think
  those efforts have now reached a sufficient level of maturity that
  we should start working together on synthesising them into a plan
  that everyone can forge ahead with. As a starting point I'm going to
  summarise my take on the three efforts; hopefully the authors of the
  other two will weigh in to give us their perspective.
  
  
  Zane's Proposal
  ===
  
  
  https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph
  
  I implemented this as a simulator of the algorithm rather than using
  the Heat codebase itself in order to be able to iterate rapidly on
  the design, and indeed I have changed my mind many, many times in
  the process of implementing it. Its notable departure from a
  realistic simulation is that it runs only one operation at a time -
  essentially giving up the ability to detect race conditions in
  exchange for a completely deterministic test framework. You just
  have to imagine where the locks need to be. Incidentally, the test
  framework is designed so that it can easily be ported to the actual
  Heat code base as functional tests so that the same scenarios could
  be used without modification, allowing us to have confidence that
  the eventual implementation is a faithful replication of the
  simulation (which can be rapidly experimented on, adjusted and
  tested when we inevitably run into implementation issues).
  
  This is a complete implementation of Phase 1 (i.e. using existing
  resource plugins), including update-during-update, resource
  clean-up, replace on update and rollback; with tests.
  
  Some of the design goals which were successfully incorporated:
  - Minimise changes to Heat (it's essentially a distributed version
  of the existing algorithm), and in particular to the database
  - Work with the existing plugin API
  - Limit total DB access for Resource/Stack to O(n) in the number of
  resources
  - Limit overall DB access to O(m) in the number of edges
  - Limit lock contention to only those operations actually contending
  (i.e. no global locks)
  - Each worker task deals with only one resource
  - Only read 

Re: [openstack-dev] [diskimage-builder] Tracing levels for scripts (119023)

2014-12-01 Thread Clint Byrum
Excerpts from James Slagle's message of 2014-11-28 11:27:20 -0800:
 On Thu, Nov 27, 2014 at 1:29 PM, Sullivan, Jon Paul
 jonpaul.sulli...@hp.com wrote:
  -Original Message-
  From: Ben Nemec [mailto:openst...@nemebean.com]
  Sent: 26 November 2014 17:03
  To: OpenStack Development Mailing List (not for usage questions)
  Subject: Re: [openstack-dev] [diskimage-builder] Tracing levels for
  scripts (119023)
 
  On 11/25/2014 10:58 PM, Ian Wienand wrote:
   Hi,
  
   My change [1] to enable a consistent tracing mechanism for the many
   scripts diskimage-builder runs during its build seems to have hit a
   stalemate.
  
   I hope we can agree that the current situation is not good.  When
   trying to develop with diskimage-builder, I find myself constantly
   going and fiddling with set -x in various scripts, requiring me
   re-running things needlessly as I try and trace what's happening.
   Conversley some scripts set -x all the time and give output when you
   don't want it.
  
   Now nodepool is using d-i-b more, it would be even nicer to have
   consistency in the tracing so relevant info is captured in the image
   build logs.
  
   The crux of the issue seems to be some disagreement between reviewers
   over having a single trace everything flag or a more fine-grained
   approach, as currently implemented after it was asked for in reviews.
  
   I must be honest, I feel a bit silly calling out essentially a
   four-line patch here.
 
  My objections are documented in the review, but basically boil down to
  the fact that it's not a four line patch, it's a 500+ line patch that
  does essentially the same thing as:
 
  set +e
  set -x
  export SHELLOPTS
 
  I don't think this is true, as there are many more things in SHELLOPTS than 
  just xtrace.  I think it is wrong to call the two approaches equivalent.
 
 
  in disk-image-create.  You do lose set -e in disk-image-create itself on
  debug runs because that's not something we can safely propagate,
  although we could work around that by unsetting it before calling hooks.
   FWIW I've used this method locally and it worked fine.
 
  So this does say that your alternative implementation has a difference from 
  the proposed one.  And that the difference has a negative impact.
 
 
  The only drawback is it doesn't allow the granularity of an if block in
  every script, but I don't personally see that as a particularly useful
  feature anyway.  I would like to hear from someone who requested that
  functionality as to what their use case is and how they would define the
  different debug levels before we merge an intrusive patch that would
  need to be added to every single new script in dib or tripleo going
  forward.
 
  So currently we have boilerplate to be added to all new elements, and that 
  boilerplate is:
 
  set -eux
  set -o pipefail
 
  This patch would change that boilerplate to:
 
  if [ ${DIB_DEBUG_TRACE:-0} -gt 0 ]; then
  set -x
  fi
  set -eu
  set -o pipefail
 
  So it's adding 3 lines.  It doesn't seem onerous, especially as most people 
  creating a new element will either copy an existing one or copy/paste the 
  header anyway.
 
  I think that giving control over what is effectively debug or non-debug 
  output is a desirable feature.
 
 I don't think it's debug vs non-debug. I think script writers that
 have explicitly used set -x previously have then operated under the
 assumption that they don't need to add any useful logging since it's
 running -x. In that case, this patch is actually harmful.
 

I believe James has hit the nail squarely on the head with the paragraph
above.

I propose a way forward for this:

1) Conform all o-r-c scripts to the logging standards we have in
OpenStack, or write new standards for diskimage-builder and conform
them to those standards. Abolish non-conditional xtrace in any script
conforming to the standards.

2) Once that is done, implement optional -x. I rather prefer the explicit
conditional set -x implementation over SHELLOPTS. As somebody else
pointed out, it feels like asking for unintended side-effects. But the
how is far less important than the what in this case, which step 1
will better define.

Anyone else have a better plan?

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DB: transaction isolation and related questions

2014-11-19 Thread Clint Byrum
Excerpts from Mike Bayer's message of 2014-11-19 10:05:35 -0800:
 
  On Nov 18, 2014, at 1:38 PM, Eugene Nikanorov enikano...@mirantis.com 
  wrote:
  
  Hi neutron folks,
  
  There is an ongoing effort to refactor some neutron DB logic to be 
  compatible with galera/mysql which doesn't support locking 
  (with_lockmode('update')).
  
  Some code paths that used locking in the past were rewritten to retry the 
  operation if they detect that an object was modified concurrently.
  The problem here is that all DB operations (CRUD) are performed in the 
  scope of some transaction that makes complex operations to be executed in 
  atomic manner.
  For mysql the default transaction isolation level is 'REPEATABLE READ' 
  which means that once the code issue a query within a transaction, this 
  query will return the same result while in this transaction (e.g. the 
  snapshot is taken by the DB during the first query and then reused for the 
  same query).
  In other words, the retry logic like the following will not work:
  
  def allocate_obj():
  with session.begin(subtrans=True):
   for i in xrange(n_retries):
obj = session.query(Model).filter_by(filters)
count = session.query(Model).filter_by(id=obj.id 
  http://obj.id/).update({'allocated': True})
if count:
 return obj
  
  since usually methods like allocate_obj() is called from within another 
  transaction, we can't simply put transaction under 'for' loop to fix the 
  issue.
 
 has this been confirmed?  the point of systems like repeatable read is not 
 just that you read the “old” data, it’s also to ensure that updates to that 
 data either proceed or fail explicitly; locking is also used to prevent 
 concurrent access that can’t be reconciled.  A lower isolation removes these 
 advantages.  
 

Yes this is confirmed and fails reliably on Galera based systems.

 I ran a simple test in two MySQL sessions as follows:
 
 session 1:
 
 mysql create table some_table(data integer) engine=innodb;
 Query OK, 0 rows affected (0.01 sec)
 
 mysql insert into some_table(data) values (1);
 Query OK, 1 row affected (0.00 sec)
 
 mysql begin;
 Query OK, 0 rows affected (0.00 sec)
 
 mysql select data from some_table;
 +--+
 | data |
 +--+
 |1 |
 +--+
 1 row in set (0.00 sec)
 
 
 session 2:
 
 mysql begin;
 Query OK, 0 rows affected (0.00 sec)
 
 mysql update some_table set data=2 where data=1;
 Query OK, 1 row affected (0.00 sec)
 Rows matched: 1  Changed: 1  Warnings: 0
 
 then back in session 1, I ran:
 
 mysql update some_table set data=3 where data=1;
 
 this query blocked;  that’s because session 2 has placed a write lock on the 
 table.  this is the effect of repeatable read isolation.

With Galera this session might happen on another node. There is no
distributed lock, so this would not block...

 
 while it blocked, I went to session 2 and committed the in-progress 
 transaction:
 
 mysql commit;
 Query OK, 0 rows affected (0.00 sec)
 
 then session 1 unblocked, and it reported, correctly, that zero rows were 
 affected:
 
 Query OK, 0 rows affected (7.29 sec)
 Rows matched: 0  Changed: 0  Warnings: 0
 
 the update had not taken place, as was stated by “rows matched:
 
 mysql select * from some_table;
 +--+
 | data |
 +--+
 |1 |
 +--+
 1 row in set (0.00 sec)
 
 the code in question would do a retry at this point; it is checking the 
 number of rows matched, and that number is accurate.
 
 if our code did *not* block at the point of our UPDATE, then it would have 
 proceeded, and the other transaction would have overwritten what we just did, 
 when it committed.   I don’t know that read committed is necessarily any 
 better here.
 
 now perhaps, with Galera, none of this works correctly.  That would be a 
 different issue in which case sure, we should use whatever isolation is 
 recommended for Galera.  But I’d want to potentially peg it to the fact that 
 Galera is in use, or not.
 
 would love also to hear from Jay Pipes on this since he literally wrote the 
 book on MySQL ! :)

What you missed is that with Galera the commit that happened last would
be rolled back. This is a reality in many scenarios on SQL databases and
should be handled _regardless_ of Galera. It is a valid way to handle
deadlocks on single node DBs as well (pgsql will do this sometimes).

One simply cannot rely on multi-statement transactions to always succeed.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Scale out bug-triage by making it easier for people to contribute

2014-11-18 Thread Clint Byrum
Excerpts from Flavio Percoco's message of 2014-11-17 08:46:19 -0800:
 Greetings,
 
 Regardless of how big/small bugs backlog is for each project, I
 believe this is a common, annoying and difficult problem. At the oslo
 meeting today, we're talking about how to address our bug triage
 process and I proposed something that I've seen done in other
 communities (rust-language [0]) that I consider useful and a good
 option for OpenStack too.
 
 The process consist in a bot that sends an email to every *volunteer*
 with 10 bugs to review/triage for the week. Each volunteer follows the
 triage standards, applies tags and provides information on whether the
 bug is still valid or not. The volunteer doesn't have to fix the bug,
 just triage it.
 
 In openstack, we could have a job that does this and then have people
 from each team volunteer to help with triage. The benefits I see are:
 
 * Interested folks don't have to go through the list and filter the
 bugs they want to triage. The bot should be smart enough to pick the
 oldest, most critical, etc.
 
 * It's a totally opt-in process and volunteers can obviously ignore
 emails if they don't have time that week.
 
 * It helps scaling out the triage process without poking people around
 and without having to do a call for volunteers every meeting/cycle/etc
 
 The above doesn't solve the problme completely but just like reviews,
 it'd be an optional, completely opt-in process that people can sign up
 for.
 

My experience in Ubuntu, where we encouraged non-developers to triage
bugs, was that non-developers often ask the wrong questions and
sometimes even harm the process by putting something in the wrong
priority or state because of a lack of deep understanding.

Triage in a hospital is done by experienced nurses and doctors working
together, not triagers. This is because it may not always be obvious
to somebody just how important a problem is. We have the same set of
problems. The most important thing is that developers see it as an
important task and take part. New volunteers should be getting involved
at every level, not just bug triage.

I think the best approach to this, like reviews, is to have a place
where users can go to drive the triage workload to 0. For instance, the
ubuntu server team had this report for triage:

http://reqorts.qa.ubuntu.com/reports/ubuntu-server/triage-report.html

Sadly, it looks like they're overwhelmed or have abandoned the effort
(I hope this doesn't say something about Ubuntu server itself..), but
the basic process was to move bugs off these lists. I'm sure if we ask
nice the author of that code will share it with us and we could adapt
it for OpenStack projects.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] [Ironic] [Cinder] Baremetal volumes -- how to model direct attached storage

2014-11-14 Thread Clint Byrum
Excerpts from Chris Jones's message of 2014-11-14 00:42:48 -0800:
 Hi
 
 My thoughts:
 
 Shoe-horning the ephemeral partition into Cinder seems like a lot of pain for 
 almost no gain[1]. The only gain I can think of would be that we could bring 
 a node down, boot it into a special ramdisk that exposes the volume to the 
 network, so cindery operations (e.g. migration) could be performed, but I'm 
 not even sure if anyone is asking for that?
 
 Forcing Cinder to understand and track something it can never normally do 
 anything with, seems like we're just trying to squeeze ourselves into an 
 ever-shrinking VM costume!
 
 Having said that, preserve ephemeral is a terrible oxymoron, so if we can 
 do something about it, we probably should.
 
 How about instead, we teach Nova/Ironic about a concept of no ephemeral? 
 They make a partition on the first disk for the first image they deploy, and 
 then they never touch the other part(s) of the disk(s), until the instance is 
 destroyed. This creates one additional burden for operators, which is to 
 create and format a partition the first time they boot, but since this is a 
 very small number of commands, and something we could trivially bake into our 
 (root?) elements, I'm not sure it's a huge problem.
 
 This gets rid of the cognitive dissonance of preserving something that is 
 described as ephemeral, and (IMO) makes it extremely clear that OpenStack 
 isn't going to touch anything but the first partition of the first disk. If 
 this were baked into the flavour rather than something we tack onto a nova 
 rebuild command, it offers greater safety for operators, against the risk of 
 accidentallying a vital state partition with a misconstructed rebuild command.
 

+1

A predictable and simple rule seems like it would go a long way to
decoupling state preservation from rebuild, which I like very much.

There is, of course, the issue of decom then, but that has never been a
concern for TripleO, and for OnMetal, they think we're a bit daft trying
to preserve state while delivering new images anyway. :)

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] LTFS integration with OpenStack Swift for scenario like - Data Archival as a Service .

2014-11-14 Thread Clint Byrum
Excerpts from Samuel Merritt's message of 2014-11-14 10:06:53 -0800:
 On 11/13/14, 10:19 PM, Sachin Goswami wrote:
  In OpenStack Swift - xfs file system is integrated which provides a
  maximum file system size of 8 exbibytes minus one byte (263-1 bytes).
 
 Not exactly. The Swift storage nodes keep their data on POSIX 
 filesystems with support for extended attributes. While XFS filesystems 
 are typically used, XFS is not required.
 
  We are studying use of LTFS integration with OpenStack Swift for
  scenario like - *Data Archival as a Service* .
 
  Was integration of LTFS with Swift considered before? If so, can you
   please share your study output? Will integration of LTFS with Swift
  fit into existing Swift architecture ?
 
 Assuming it's POSIX enough and supports extended attributes, a tape 
 filesystem on a spinning disk might technically work, but I don't see it 
 performing well at all.
 
 If you're talking about using actual tapes for data storage, I can't 
 imagine that working out for you. Most clients aren't prepared to wait 
 multiple minutes for HTTP responses while a tape laboriously spins back 
 and forth, so they'll just time out.
 

Agreed. You'd need to have a separate API for freezing and thawing data
I think, similar to the way glacier works. However, my understanding of
glacier is that it is simply a massive bank of cheap disks which are
largely kept powered off until either a ton of requests for data on a
single disk arrive, or a certain amount of time has passed. The benefit
of this is that there is no intermediary storage required. The disks
are either online, and you can read your data, or offline, and you have
to wait.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Clint Byrum
Excerpts from Zane Bitter's message of 2014-11-13 05:54:03 -0800:
 On 13/11/14 03:29, Murugan, Visnusaran wrote:
  Hi all,
 
  Convergence-POC distributes stack operations by sending resource actions
  over RPC for any heat-engine to execute. Entire stack lifecycle will be
  controlled by worker/observer notifications. This distributed model has
  its own advantages and disadvantages.
 
  Any stack operation has a timeout and a single engine will be
  responsible for it. If that engine goes down, timeout is lost along with
  it. So a traditional way is for other engines to recreate timeout from
  scratch. Also a missed resource action notification will be detected
  only when stack operation timeout happens.
 
  To overcome this, we will need the following capability:
 
  1.Resource timeout (can be used for retry)
 
 I don't believe this is strictly needed for phase 1 (essentially we 
 don't have it now, so nothing gets worse).
 

We do have a stack timeout, and it stands to reason that we won't have a
single box with a timeout greenthread after this, so a strategy is
needed.

 For phase 2, yes, we'll want it. One thing we haven't discussed much is 
 that if we used Zaqar for this then the observer could claim a message 
 but not acknowledge it until it had processed it, so we could have 
 guaranteed delivery.


Frankly, if oslo.messaging doesn't support reliable delivery then we
need to add it. Zaqar should have nothing to do with this and is, IMO, a
poor choice at this stage, though I like the idea of using it in the
future so that we can make Heat more of an outside-the-cloud app.

  2.Recover from engine failure (loss of stack timeout, resource action
  notification)
 
  Suggestion:
 
  1.Use task queue like celery to host timeouts for both stack and resource.
 
 I believe Celery is more or less a non-starter as an OpenStack 
 dependency because it uses Kombu directly to talk to the queue, vs. 
 oslo.messaging which is an abstraction layer over Kombu, Qpid, ZeroMQ 
 and maybe others in the future. i.e. requiring Celery means that some 
 users would be forced to install Rabbit for the first time.

 One option would be to fork Celery and replace Kombu with oslo.messaging 
 as its abstraction layer. Good luck getting that maintained though, 
 since Celery _invented_ Kombu to be it's abstraction layer.
 

A slight side point here: Kombu supports Qpid and ZeroMQ. Oslo.messaging
is more about having a unified API than a set of magic backends. It
actually boggles my mind why we didn't just use kombu (cue 20 reactions
with people saying it wasn't EXACTLY right), but I think we're committed
to oslo.messaging now. Anyway, celery would need no such refactor, as
kombu would be able to access the same bus as everything else just fine.

  2.Poll database for engine failures and restart timers/ retrigger
  resource retry (IMHO: This would be a traditional and weighs heavy)
 
  3.Migrate heat to use TaskFlow. (Too many code change)
 
 If it's just handling timed triggers (maybe this is closer to #2) and 
 not migrating the whole code base, then I don't see why it would be a 
 big change (or even a change at all - it's basically new functionality). 
 I'm not sure if TaskFlow has something like this already. If not we 
 could also look at what Mistral is doing with timed tasks and see if we 
 could spin some of it out into an Oslo library.
 

I feel like it boils down to something running periodically checking for
scheduled tasks that are due to run but have not run yet. I wonder if we
can actually look at Ironic for how they do this, because Ironic polls
power state of machines constantly, and uses a hash ring to make sure
only one conductor is polling any one machine at a time. If we broke
stacks up into a hash ring like that for the purpose of singleton tasks
like timeout checking, that might work out nicely.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2014-11-13 00:45:07 -0800:
 A question;
 
 How is using something like celery in heat vs taskflow in heat (or at least 
 concept [1]) 'to many code change'.
 
 Both seem like change of similar levels ;-)
 

I've tried a few times to dive into refactoring some things to use
TaskFlow at a shallow level, and have always gotten confused and
frustrated.

The amount of lines that are changed probably is the same. But the
massive shift in thinking is not an easy one to make. It may be worth some
thinking on providing a shorter bridge to TaskFlow adoption, because I'm
a huge fan of the idea and would _start_ something with it in a heartbeat,
but refactoring things to use it feels really weird to me.

 What was your metric for determining the code change either would have (out 
 of curiosity)?
 
 Perhaps u should look at [2], although I'm unclear on what the desired 
 functionality is here.
 
 Do u want the single engine to transfer its work to another engine when it 
 'goes down'? If so then the jobboard model + zookeper inherently does this.
 
 Or maybe u want something else? I'm probably confused because u seem to be 
 asking for resource timeouts + recover from engine failure (which seems like 
 a liveness issue and not a resource timeout one), those 2 things seem 
 separable.
 

I agree with you on this. It is definitely a liveness problem. The
resource timeout isn't something I've seen discussed before. We do have
a stack timeout, and we need to keep on honoring that, but we can do
that with a job that sleeps for the stack timeout if we have a liveness
guarantee that will resurrect the job (with the sleep shortened by the
time since stack-update-time) somewhere else if the original engine
can't complete the job.

 [1] http://docs.openstack.org/developer/taskflow/jobs.html
 
 [2] 
 http://docs.openstack.org/developer/taskflow/examples.html#jobboard-producer-consumer-simple
 
 On Nov 13, 2014, at 12:29 AM, Murugan, Visnusaran visnusaran.muru...@hp.com 
 wrote:
 
  Hi all,
   
  Convergence-POC distributes stack operations by sending resource actions 
  over RPC for any heat-engine to execute. Entire stack lifecycle will be 
  controlled by worker/observer notifications. This distributed model has its 
  own advantages and disadvantages.
   
  Any stack operation has a timeout and a single engine will be responsible 
  for it. If that engine goes down, timeout is lost along with it. So a 
  traditional way is for other engines to recreate timeout from scratch. Also 
  a missed resource action notification will be detected only when stack 
  operation timeout happens.
   
  To overcome this, we will need the following capability:
  1.   Resource timeout (can be used for retry)
  2.   Recover from engine failure (loss of stack timeout, resource 
  action notification)
   
   
  Suggestion:
  1.   Use task queue like celery to host timeouts for both stack and 
  resource.
  2.   Poll database for engine failures and restart timers/ retrigger 
  resource retry (IMHO: This would be a traditional and weighs heavy)
  3.   Migrate heat to use TaskFlow. (Too many code change)
   
  I am not suggesting we use Task Flow. Using celery will have very minimum 
  code change. (decorate appropriate functions)
   
   
  Your thoughts.
   
  -Vishnu
  IRC: ckmvishnu
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Clint Byrum
Excerpts from Zane Bitter's message of 2014-11-13 09:55:43 -0800:
 On 13/11/14 09:58, Clint Byrum wrote:
  Excerpts from Zane Bitter's message of 2014-11-13 05:54:03 -0800:
  On 13/11/14 03:29, Murugan, Visnusaran wrote:
  Hi all,
 
  Convergence-POC distributes stack operations by sending resource actions
  over RPC for any heat-engine to execute. Entire stack lifecycle will be
  controlled by worker/observer notifications. This distributed model has
  its own advantages and disadvantages.
 
  Any stack operation has a timeout and a single engine will be
  responsible for it. If that engine goes down, timeout is lost along with
  it. So a traditional way is for other engines to recreate timeout from
  scratch. Also a missed resource action notification will be detected
  only when stack operation timeout happens.
 
  To overcome this, we will need the following capability:
 
  1.Resource timeout (can be used for retry)
 
  I don't believe this is strictly needed for phase 1 (essentially we
  don't have it now, so nothing gets worse).
 
 
  We do have a stack timeout, and it stands to reason that we won't have a
  single box with a timeout greenthread after this, so a strategy is
  needed.
 
 Right, that was 2, but I was talking specifically about the resource 
 retry. I think we agree on both points.
 
  For phase 2, yes, we'll want it. One thing we haven't discussed much is
  that if we used Zaqar for this then the observer could claim a message
  but not acknowledge it until it had processed it, so we could have
  guaranteed delivery.
 
 
  Frankly, if oslo.messaging doesn't support reliable delivery then we
  need to add it.
 
 That is straight-up impossible with AMQP. Either you ack the message and 
 risk losing it if the worker dies before processing is complete, or you 
 don't ack the message until it's processed and you become a blocker for 
 every other worker trying to pull jobs off the queue. It works fine when 
 you have only one worker; otherwise not so much. This is the crux of the 
 whole why isn't Zaqar just Rabbit debate.
 

I'm not sure we have the same understanding of AMQP, so hopefully we can
clarify here. This stackoverflow answer echoes my understanding:

http://stackoverflow.com/questions/17841843/rabbitmq-does-one-consumer-block-the-other-consumers-of-the-same-queue

Not ack'ing just means they might get retransmitted if we never ack. It
doesn't block other consumers. And as the link above quotes from the
AMQP spec, when there are multiple consumers, FIFO is not guaranteed.
Other consumers get other messages.

So just add the ability for a consumer to read, work, ack to
oslo.messaging, and this is mostly handled via AMQP. Of course that
also likely means no zeromq for Heat without accepting that messages
may be lost if workers die.

Basically we need to add something that is not RPC but instead
jobqueue that mimics this:

http://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/rpc/dispatcher.py#n131

I've always been suspicious of this bit of code, as it basically means
that if anything fails between that call, and the one below it, we have
lost contact, but as long as clients are written to re-send when there
is a lack of reply, there shouldn't be a problem. But, for a job queue,
there is no reply, and so the worker would dispatch, and then
acknowledge after the dispatched call had returned (including having
completed the step where new messages are added to the queue for any
newly-possible children).

Just to be clear, I believe what Zaqar adds is the ability to peek at
a specific message ID and not affect it in the queue, which is entirely
different than ACK'ing the ones you've already received in your session.

 Most stuff in OpenStack gets around this by doing synchronous calls 
 across oslo.messaging, where there is an end-to-end ack. We don't want 
 that here though. We'll probably have to make do with having ways to 
 recover after a failure (kick off another update with the same data is 
 always an option). The hard part is that if something dies we don't 
 really want to wait until the stack timeout to start recovering.


I fully agree. Josh's point about using a coordination service like
Zookeeper to maintain liveness is an interesting one here. If we just
make sure that all the workers that have claimed work off the queue are
alive, that should be sufficient to prevent a hanging stack situation
like you describe above.

  Zaqar should have nothing to do with this and is, IMO, a
  poor choice at this stage, though I like the idea of using it in the
  future so that we can make Heat more of an outside-the-cloud app.
 
 I'm inclined to agree that it would be hard to force operators to deploy 
 Zaqar in order to be able to deploy Heat, and that we should probably be 
 cautious for that reason.
 
 That said, from a purely technical point of view it's not a poor choice 
 at all - it has *exactly* the semantics we want (unlike AMQP), and at 
 least

Re: [openstack-dev] [all] config options not correctly deprecated

2014-11-13 Thread Clint Byrum
Excerpts from Ben Nemec's message of 2014-11-13 15:20:47 -0800:
 On 11/10/2014 05:00 AM, Daniel P. Berrange wrote:
  On Mon, Nov 10, 2014 at 09:45:02AM +, Derek Higgins wrote:
  Tl;dr oslo.config wasn't logging warnings about deprecated config
  options, do we need to support them for another cycle?
  
  AFAIK, there has not been any change in olso.config behaviour
  in the Juno release, as compared to previous releases. The
  oslo.config behaviour is that the generated sample config file
  contain all the deprecation information.
  
  The idea that olso.config issue log warnings is a decent RFE
  to make the use of deprecated config settings more visible.
  This is an enhancement though, not a bug.
  
  A set of patches to remove deprecated options in Nova was landed on
  Thursday[1], these were marked as deprecated during the juno dev cycle
  and got removed now that kilo has started.
  
  Yes, this is our standard practice - at the start of each release
  cycle, we delete anything that was marked as deprected in the
  previous release cycle. ie we give downstream users/apps 1 release
  cycle of grace to move to the new option names.
  
  Most of the deprecated config options are listed as deprecated in the
  documentation for nova.conf changes[2] linked to from the Nova upgrade
  section in the Juno release notes[3] (the deprecated cinder config
  options are not listed here along with the allowed_direct_url_schemes
  glance option).
  
  The sample  nova.conf generated by olso lists all the deprecations.
  
  For example, for cinder options it shows what the old config option
  name was.
  
[cinder]
  
#
# Options defined in nova.volume.cinder
#
  
# Info to match when looking for cinder in the service
# catalog. Format is: separated values of the form:
# service_type:service_name:endpoint_type (string value)
# Deprecated group/name - [DEFAULT]/cinder_catalog_info
#catalog_info=volume:cinder:publicURL
  
  Also note the deprecated name will not appear as an option in the
  sample config file at all, other than in this deprecation comment.
  
  
  My main worry is that there were no warnings about these options being
  deprecated in nova's logs (as a result they were still being used in
  tripleo), once I noticed tripleo's CI jobs were failing and discovered
  the reason I submitted 4 reverts to put back the deprecated options in
  nova[4] as I believe they should now be supported for another cycle
  (along with a fix to oslo.config to log warnings about their use). The 4
  patches have now been blocked as they go against our deprecation policy.
 
  I believe the correct way to handle this is to support these options for
  another cycle so that other operators don't get hit when upgrading to
  kilo. While at that same time fix oslo.config to report the deprecated
  options in kilo.
  
  I have marked this mail with the [all] tag because there are other
  projects using the same deprecated_name (or deprecated_group)
  parameter when adding config options, I think those projects also now
  need to support their deprecated options for another cycle.
  
  AFAIK, there's nothing different about Juno vs previous release cycles,
  so I don't see any reason to do anything different this time around.
  No matter what we do there is always a possibility that downstream
  apps / users will not notice and/or ignore the deprecation. We should
  certainly look at how to make deprecation more obvious, but I don't
  think we should change our policy just because an app missed the fact
  that these were deprecated.
 
 So the difference to me is that this cycle we are aware that we're
 creating a crappy experience for deployers.  In the past we didn't have
 anything in the CI environment simulating a real deployment so these
 sorts of issues went unnoticed.  IMHO telling deployers that they have
 to troll the sample configs and try to figure out which deprecated opts
 they're still using is not an acceptable answer.
 

I don't know if this is really fair, as all of the deprecated options do
appear here:

http://docs.openstack.org/juno/config-reference/content/nova-conf-changes-juno.html

So the real bug is that in TripleO we're not paying attention to the
appropriate stream of deprecations. Logs on running systems is a mighty
big hammer when the documentation is being updated for us, and we're
just not paying attention in the right place.

BTW, where SHOULD continuous deployers pay attention for this stuff?

 Now that we do know, I think we need to address the issue.  The first
 step is to revert the deprecated removals - they're not hurting
 anything, and if we wait another cycle we can fix oslo.config and then
 remove them once deployers have had a reasonable chance to address the
 deprecation.
 

In this case, we can just fix the templates. Are we broken? Yes. Can we
fix it? YES! I would definitely appreciate the reverts preceding that,
so that we can land other things without 

Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2014-11-13 14:01:14 -0800:
 On Nov 13, 2014, at 7:10 AM, Clint Byrum cl...@fewbar.com wrote:
 
  Excerpts from Joshua Harlow's message of 2014-11-13 00:45:07 -0800:
  A question;
  
  How is using something like celery in heat vs taskflow in heat (or at 
  least concept [1]) 'to many code change'.
  
  Both seem like change of similar levels ;-)
  
  
  I've tried a few times to dive into refactoring some things to use
  TaskFlow at a shallow level, and have always gotten confused and
  frustrated.
  
  The amount of lines that are changed probably is the same. But the
  massive shift in thinking is not an easy one to make. It may be worth some
  thinking on providing a shorter bridge to TaskFlow adoption, because I'm
  a huge fan of the idea and would _start_ something with it in a heartbeat,
  but refactoring things to use it feels really weird to me.
 
 I wonder how I can make that better...
 
 Where the concepts that new/different? Maybe I just have more of a functional 
 programming background and the way taskflow gets you to create tasks that are 
 later executed, order them ahead of time, and then *later* run them is still 
 a foreign concept for folks that have not done things with non-procedural 
 languages. What were the confusion points, how may I help address them? More 
 docs maybe, more examples, something else?

My feeling is that it is hard to let go of the language constructs that
_seem_ to solve the problems TaskFlow does, even though in fact they are
the problem because they're using the stack for control-flow where we
want that control-flow to yield to TaskFlow.

I also kind of feel like the Twisted folks answered a similar question
with inline callbacks and made things easier but more complex in
doing so. If I had a good answer I would give it to you though. :)

 
 I would agree that the jobboard[0] concept is different than the other parts 
 of taskflow, but it could be useful here:
 
 Basically at its core its a application of zookeeper where 'jobs' are posted 
 to a directory (using sequenced nodes in zookeeper, so that ordering is 
 retained). Entities then acquire ephemeral locks on those 'jobs' (these locks 
 will be released if the owner process disconnects, or fails...) and then work 
 on the contents of that job (where contents can be pretty much arbitrary). 
 This creates a highly available job queue (queue-like due to the node 
 sequencing[1]), and it sounds pretty similar to what zaqar could provide in 
 theory (except the zookeeper one is proven, battle-hardened, works and 
 exists...). But we should of course continue being scared of zookeeper, 
 because u know, who wants to use a tool where it would fit, haha (this is a 
 joke).
 

So ordering is a distraction from the task at hand. But the locks that
indicate liveness of the workers is very interesting to me. Since we
don't actually have requirements of ordering on the front-end of the task
(we do on the completion of certain tasks, but we can use a DB for that),
I wonder if we can just get the same effect with a durable queue that uses
a reliable messaging pattern where we don't ack until we're done. That
would achieve the goal of liveness.

 [0] 
 https://github.com/openstack/taskflow/blob/master/taskflow/jobs/jobboard.py#L25
  
 
 [1] 
 http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#Sequence+Nodes+--+Unique+Naming
 
  
  What was your metric for determining the code change either would have 
  (out of curiosity)?
  
  Perhaps u should look at [2], although I'm unclear on what the desired 
  functionality is here.
  
  Do u want the single engine to transfer its work to another engine when it 
  'goes down'? If so then the jobboard model + zookeper inherently does this.
  
  Or maybe u want something else? I'm probably confused because u seem to be 
  asking for resource timeouts + recover from engine failure (which seems 
  like a liveness issue and not a resource timeout one), those 2 things seem 
  separable.
  
  
  I agree with you on this. It is definitely a liveness problem. The
  resource timeout isn't something I've seen discussed before. We do have
  a stack timeout, and we need to keep on honoring that, but we can do
  that with a job that sleeps for the stack timeout if we have a liveness
  guarantee that will resurrect the job (with the sleep shortened by the
  time since stack-update-time) somewhere else if the original engine
  can't complete the job.
  
  [1] http://docs.openstack.org/developer/taskflow/jobs.html
  
  [2] 
  http://docs.openstack.org/developer/taskflow/examples.html#jobboard-producer-consumer-simple
  
  On Nov 13, 2014, at 12:29 AM, Murugan, Visnusaran 
  visnusaran.muru...@hp.com wrote:
  
  Hi all,
  
  Convergence-POC distributes stack operations by sending resource actions 
  over RPC for any heat-engine to execute. Entire stack lifecycle will be 
  controlled by worker/observer notifications. This distributed model has

<    1   2   3   4   5   6   7   8   9   10   >