Dan, you are leaving out the parts of my response where I am agreeing
with you and saying that your "Option #2" is probably the things we
should go with.
-jay
On 06/01/2018 12:22 PM, Dan Smith wrote:
So, you're saying the normal process is to try upgrading the Linux
kernel and associated low-level libs, wait the requisite amount of
time that takes (can be a long time) and just hope that everything
comes back OK? That doesn't sound like any upgrade I've ever seen.
I'm saying I think it's a process practiced by some to install the new
kernel and libs and then reboot to activate, yeah.
No, sorry if I wasn't clear. They can live-migrate the instances off
of the to-be-upgraded compute host. They would only need to
cold-migrate instances that use the aforementioned non-movable
resources.
I don't think it's reasonable to force people to have to move every
instance in their cloud (live or otherwise) in order to upgrade. That
means that people who currently do their upgrades in-place in one step,
now have to do their upgrade in N steps, for N compute nodes. That
doesn't seem reasonable to me.
If we are going to go through the hassle of writing a bunch of
transformation code in order to keep operator action as low as
possible, I would prefer to consolidate all of this code into the
nova-manage (or nova-status) tool and put some sort of
attribute/marker on each compute node record to indicate whether a
"heal" operation has occurred for that compute node.
We need to know details of each compute node in order to do that. We
could make the tool external and something they run per-compute node,
but that still makes it N steps, even if the N steps are lighter
weight.
Someone (maybe Gibi?) on this thread had mentioned having the virt
driver (in update_provider_tree) do the whole set reserved = total
thing when first attempting to create the child providers. That would
work to prevent the scheduler from attempting to place workloads on
those child providers, but we would still need some marker on the
compute node to indicate to the nova-manage heal_nested_providers (or
whatever) command that the compute node has had its provider tree
validated/healed, right?
So that means you restart your cloud and it's basically locked up until
you perform the N steps to unlock N nodes? That also seems like it's not
going to make us very popular on the playground :)
I need to go read Eric's tome on how to handle the communication of
things from virt to compute so that this translation can be done. I'm
not saying I have the answer, I'm just saying that making this the
problem of the operators doesn't seem like a solution to me, and that we
should figure out how we're going to do this before we go down the
rabbit hole.
--Dan
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev