[openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-28 Thread Sylvain Bauza
Hi,

I already told about that in a separate thread, but let's put it here too
for more visibility.

tl;dr: I suspect existing allocations are being lost when we upgrade a
compute service from Queens to Rocky, if those allocations are made against
inventories that are now provided by a child Resource Provider.


I started reviewing https://review.openstack.org/#/c/565487/ and bottom
patches to understand the logic with querying nested resource providers.
>From what I understand, the scheduler will query Placement using the same
query but will get (thanks to a new microversion) not only allocation
candidates that are root resource providers but also any possible child.

If so, that's great as in a rolling upgrade scenario with mixed computes
(both Queens and Rocky), we will still continue to return both old RPs and
new child RPs if they both support the same resource classes ask.
Accordingly, allocations done by the scheduler will be made against the
corresponding Resource Provider, whether it's a root RP (old way) or a
child RP (new way).

Do I still understand correctly ? If yes, perfect, let's jump to my upgrade
concern.
Now, consider the Queens->Rocky compute upgrade. If I'm an operator and I
start deploying Rocky on one compute node, it will provide to Placement API
new inventories that are possibly nested.
In that situation, say for example with VGPU inventories, that would mean
that the compute node would stop reporting inventories for its root RP, but
would rather report inventories for at least one single child RP.
In that model, do we reconcile the allocations that were already made
against the "root RP" inventory ? I don't think so, hence my question here.

Thanks,
-Sylvain
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-28 Thread TETSURO NAKAMURA

Hi,

> Do I still understand correctly ? If yes, perfect, let's jump to my 
upgrade

> concern.

Yes, I think. The old microversions look into only root providers and 
give up providing resources if a root provider itself doesn't have 
enough inventories for requested resources. But the new microversion 
looks into the root's descendents also and see if it can provide 
requested resources *collectively* in that tree.


The tests from [1] would help you understand this, where VCPUs come from 
the root(compute host) and SRIOV_NET_VFs from its grandchild.


[1] 
https://review.openstack.org/#/c/565487/15/nova/tests/functional/api/openstack/placement/gabbits/allocation-candidates.yaml@362


> In that situation, say for example with VGPU inventories, that would mean
> that the compute node would stop reporting inventories for its root 
RP, but

> would rather report inventories for at least one single child RP.
> In that model, do we reconcile the allocations that were already made
> against the "root RP" inventory ?

It would be nice to see Eric and Jay comment on this,
but if I'm not mistaken, when the virt driver stops reporting 
inventories for its root RP, placement would try to delete that 
inventory inside and raise InventoryInUse exception if any allocations 
still exist on that resource.


```
update_from_provider_tree() (nova/compute/resource_tracker.py)
  + _set_inventory_for_provider() (nova/scheduler/client/report.py)
  + put() - PUT /resource_providers//inventories with new 
inventories (scheduler/client/report.py)

  + set_inventories() (placement/handler/inventory.py)
  + _set_inventory() (placement/objects/resource_proveider.py)
  + _delete_inventory_from_provider() 
(placement/objects/resource_proveider.py)

  -> raise exception.InventoryInUse
```

So we need some trick something like deleting VGPU allocations before 
upgrading and set the allocation again for the created new child after 
upgrading?


On 2018/05/28 23:18, Sylvain Bauza wrote:

Hi,

I already told about that in a separate thread, but let's put it here too
for more visibility.

tl;dr: I suspect existing allocations are being lost when we upgrade a
compute service from Queens to Rocky, if those allocations are made against
inventories that are now provided by a child Resource Provider.


I started reviewing https://review.openstack.org/#/c/565487/ and bottom
patches to understand the logic with querying nested resource providers.

From what I understand, the scheduler will query Placement using the same

query but will get (thanks to a new microversion) not only allocation
candidates that are root resource providers but also any possible child.

If so, that's great as in a rolling upgrade scenario with mixed computes
(both Queens and Rocky), we will still continue to return both old RPs and
new child RPs if they both support the same resource classes ask.
Accordingly, allocations done by the scheduler will be made against the
corresponding Resource Provider, whether it's a root RP (old way) or a
child RP (new way).

Do I still understand correctly ? If yes, perfect, let's jump to my upgrade
concern.
Now, consider the Queens->Rocky compute upgrade. If I'm an operator and I
start deploying Rocky on one compute node, it will provide to Placement API
new inventories that are possibly nested.
In that situation, say for example with VGPU inventories, that would mean
that the compute node would stop reporting inventories for its root RP, but
would rather report inventories for at least one single child RP.
In that model, do we reconcile the allocations that were already made
against the "root RP" inventory ? I don't think so, hence my question here.

Thanks,
-Sylvain



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



--
Tetsuro Nakamura 
NTT Network Service Systems Laboratories
TEL:0422 59 6914(National)/+81 422 59 6914(International)
3-9-11, Midori-Cho Musashino-Shi, Tokyo 180-8585 Japan



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-29 Thread Sylvain Bauza
On Tue, May 29, 2018 at 3:08 AM, TETSURO NAKAMURA <
nakamura.tets...@lab.ntt.co.jp> wrote:

> Hi,
>
> > Do I still understand correctly ? If yes, perfect, let's jump to my
> upgrade
> > concern.
>
> Yes, I think. The old microversions look into only root providers and give
> up providing resources if a root provider itself doesn't have enough
> inventories for requested resources. But the new microversion looks into
> the root's descendents also and see if it can provide requested resources
> *collectively* in that tree.
>
> The tests from [1] would help you understand this, where VCPUs come from
> the root(compute host) and SRIOV_NET_VFs from its grandchild.
>
> [1] https://review.openstack.org/#/c/565487/15/nova/tests/functi
> onal/api/openstack/placement/gabbits/allocation-candidates.yaml@362
>
>
Yeah I already saw those tests, but I wanted to make sure I was correctly
understanding.


> > In that situation, say for example with VGPU inventories, that would mean
> > that the compute node would stop reporting inventories for its root RP,
> but
> > would rather report inventories for at least one single child RP.
> > In that model, do we reconcile the allocations that were already made
> > against the "root RP" inventory ?
>
> It would be nice to see Eric and Jay comment on this,
> but if I'm not mistaken, when the virt driver stops reporting inventories
> for its root RP, placement would try to delete that inventory inside and
> raise InventoryInUse exception if any allocations still exist on that
> resource.
>
> ```
> update_from_provider_tree() (nova/compute/resource_tracker.py)
>   + _set_inventory_for_provider() (nova/scheduler/client/report.py)
>   + put() - PUT /resource_providers//inventories with new
> inventories (scheduler/client/report.py)
>   + set_inventories() (placement/handler/inventory.py)
>   + _set_inventory() (placement/objects/resource_proveider.py)
>   + _delete_inventory_from_provider()
> (placement/objects/resource_proveider.py)
>   -> raise exception.InventoryInUse
> ```
>
> So we need some trick something like deleting VGPU allocations before
> upgrading and set the allocation again for the created new child after
> upgrading?
>
>
I wonder if we should keep the existing inventory in the root RP, and
somehow just reserve the left resources (so Placement wouldn't pass that
root RP for queries, but would still have allocations). But then, where and
how to do this ? By the resource tracker ?

-Sylvain


> On 2018/05/28 23:18, Sylvain Bauza wrote:
>
>> Hi,
>>
>> I already told about that in a separate thread, but let's put it here too
>> for more visibility.
>>
>> tl;dr: I suspect existing allocations are being lost when we upgrade a
>> compute service from Queens to Rocky, if those allocations are made
>> against
>> inventories that are now provided by a child Resource Provider.
>>
>>
>> I started reviewing https://review.openstack.org/#/c/565487/ and bottom
>> patches to understand the logic with querying nested resource providers.
>>
>>> From what I understand, the scheduler will query Placement using the same
>>>
>> query but will get (thanks to a new microversion) not only allocation
>> candidates that are root resource providers but also any possible child.
>>
>> If so, that's great as in a rolling upgrade scenario with mixed computes
>> (both Queens and Rocky), we will still continue to return both old RPs and
>> new child RPs if they both support the same resource classes ask.
>> Accordingly, allocations done by the scheduler will be made against the
>> corresponding Resource Provider, whether it's a root RP (old way) or a
>> child RP (new way).
>>
>> Do I still understand correctly ? If yes, perfect, let's jump to my
>> upgrade
>> concern.
>> Now, consider the Queens->Rocky compute upgrade. If I'm an operator and I
>> start deploying Rocky on one compute node, it will provide to Placement
>> API
>> new inventories that are possibly nested.
>> In that situation, say for example with VGPU inventories, that would mean
>> that the compute node would stop reporting inventories for its root RP,
>> but
>> would rather report inventories for at least one single child RP.
>> In that model, do we reconcile the allocations that were already made
>> against the "root RP" inventory ? I don't think so, hence my question
>> here.
>>
>> Thanks,
>> -Sylvain
>>
>>
>>
>> 
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
> --
> Tetsuro Nakamura 
> NTT Network Service Systems Laboratories
> TEL:0422 59 6914(National)/+81 422 59 6914(International)
> 3-9-11, Midori-Cho Musashino-Shi, Tokyo 180-8585 Japan
>
>
>
__
OpenStac

Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-29 Thread Balázs Gibizer



On Tue, May 29, 2018 at 9:38 AM, Sylvain Bauza  
wrote:



On Tue, May 29, 2018 at 3:08 AM, TETSURO NAKAMURA 
 wrote


> In that situation, say for example with VGPU inventories, that 
would mean
> that the compute node would stop reporting inventories for its 
root RP, but

> would rather report inventories for at least one single child RP.
> In that model, do we reconcile the allocations that were already 
made

> against the "root RP" inventory ?

It would be nice to see Eric and Jay comment on this,
but if I'm not mistaken, when the virt driver stops reporting 
inventories for its root RP, placement would try to delete that 
inventory inside and raise InventoryInUse exception if any 
allocations still exist on that resource.


```
update_from_provider_tree() (nova/compute/resource_tracker.py)
  + _set_inventory_for_provider() (nova/scheduler/client/report.py)
  + put() - PUT /resource_providers//inventories with 
new inventories (scheduler/client/report.py)

  + set_inventories() (placement/handler/inventory.py)
  + _set_inventory() 
(placement/objects/resource_proveider.py)
  + _delete_inventory_from_provider() 
(placement/objects/resource_proveider.py)

  -> raise exception.InventoryInUse
```

So we need some trick something like deleting VGPU allocations 
before upgrading and set the allocation again for the created new 
child after upgrading?




I wonder if we should keep the existing inventory in the root RP, and 
somehow just reserve the left resources (so Placement wouldn't pass 
that root RP for queries, but would still have allocations). But 
then, where and how to do this ? By the resource tracker ?




AFAIK it is the virt driver that decides to model the VGU resource at a 
different place in the RP tree so I think it is the responsibility of 
the same virt driver to move any existing allocation from the old place 
to the new place during this change.


Cheers,
gibi


-Sylvain




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-29 Thread Sylvain Bauza
2018-05-29 11:01 GMT+02:00 Balázs Gibizer :

>
>
> On Tue, May 29, 2018 at 9:38 AM, Sylvain Bauza  wrote:
>
>>
>>
>> On Tue, May 29, 2018 at 3:08 AM, TETSURO NAKAMURA <
>> nakamura.tets...@lab.ntt.co.jp> wrote
>>
>> > In that situation, say for example with VGPU inventories, that would
>>> mean
>>> > that the compute node would stop reporting inventories for its root
>>> RP, but
>>> > would rather report inventories for at least one single child RP.
>>> > In that model, do we reconcile the allocations that were already made
>>> > against the "root RP" inventory ?
>>>
>>> It would be nice to see Eric and Jay comment on this,
>>> but if I'm not mistaken, when the virt driver stops reporting
>>> inventories for its root RP, placement would try to delete that inventory
>>> inside and raise InventoryInUse exception if any allocations still exist on
>>> that resource.
>>>
>>> ```
>>> update_from_provider_tree() (nova/compute/resource_tracker.py)
>>>   + _set_inventory_for_provider() (nova/scheduler/client/report.py)
>>>   + put() - PUT /resource_providers//inventories with new
>>> inventories (scheduler/client/report.py)
>>>   + set_inventories() (placement/handler/inventory.py)
>>>   + _set_inventory() (placement/objects/resource_pr
>>> oveider.py)
>>>   + _delete_inventory_from_provider()
>>> (placement/objects/resource_proveider.py)
>>>   -> raise exception.InventoryInUse
>>> ```
>>>
>>> So we need some trick something like deleting VGPU allocations before
>>> upgrading and set the allocation again for the created new child after
>>> upgrading?
>>>
>>>
>> I wonder if we should keep the existing inventory in the root RP, and
>> somehow just reserve the left resources (so Placement wouldn't pass that
>> root RP for queries, but would still have allocations). But then, where and
>> how to do this ? By the resource tracker ?
>>
>>
> AFAIK it is the virt driver that decides to model the VGU resource at a
> different place in the RP tree so I think it is the responsibility of the
> same virt driver to move any existing allocation from the old place to the
> new place during this change.
>
>
No. Allocations are done by the scheduler or by the conductor. Virt drivers
only provide inventories.



> Cheers,
> gibi
>
>
> -Sylvain
>>
>>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-29 Thread Balázs Gibizer



On Tue, May 29, 2018 at 11:52 AM, Sylvain Bauza 
 wrote:



2018-05-29 11:01 GMT+02:00 Balázs Gibizer 
:



On Tue, May 29, 2018 at 9:38 AM, Sylvain Bauza  
wrote:



On Tue, May 29, 2018 at 3:08 AM, TETSURO NAKAMURA 
 wrote


> In that situation, say for example with VGPU inventories, that 
would mean
> that the compute node would stop reporting inventories for its 
root RP, but

> would rather report inventories for at least one single child RP.
> In that model, do we reconcile the allocations that were already 
made

> against the "root RP" inventory ?

It would be nice to see Eric and Jay comment on this,
but if I'm not mistaken, when the virt driver stops reporting 
inventories for its root RP, placement would try to delete that 
inventory inside and raise InventoryInUse exception if any 
allocations still exist on that resource.


```
update_from_provider_tree() (nova/compute/resource_tracker.py)
  + _set_inventory_for_provider() (nova/scheduler/client/report.py)
  + put() - PUT /resource_providers//inventories with 
new inventories (scheduler/client/report.py)

  + set_inventories() (placement/handler/inventory.py)
  + _set_inventory() 
(placement/objects/resource_proveider.py)
  + _delete_inventory_from_provider() 
(placement/objects/resource_proveider.py)

  -> raise exception.InventoryInUse
```

So we need some trick something like deleting VGPU allocations 
before upgrading and set the allocation again for the created new 
child after upgrading?




I wonder if we should keep the existing inventory in the root RP, 
and somehow just reserve the left resources (so Placement wouldn't 
pass that root RP for queries, but would still have allocations). 
But then, where and how to do this ? By the resource tracker ?




AFAIK it is the virt driver that decides to model the VGU resource 
at a different place in the RP tree so I think it is the 
responsibility of the same virt driver to move any existing 
allocation from the old place to the new place during this change.




No. Allocations are done by the scheduler or by the conductor. Virt 
drivers only provide inventories.


I understand that the allocation is made by the scheduler and the 
conductor but today the scheduler and the conductor do not have to know 
the structure for the RP tree to make such allocations. Therefore for 
me the scheduler and the conductor is a bad place to try to move 
allocation around due to a change in the modelling of the resources in 
the RP tree. In the other hand the virt driver knows the structure of 
the RP tree so it has the necessary information to move the existing 
allocaiton from the old place to the new place.


gibi





Cheers,
gibi



-Sylvain




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-29 Thread Sylvain Bauza
Le mar. 29 mai 2018 à 11:02, Balázs Gibizer  a
écrit :

>
>
> On Tue, May 29, 2018 at 9:38 AM, Sylvain Bauza 
> wrote:
> >
> >
> > On Tue, May 29, 2018 at 3:08 AM, TETSURO NAKAMURA
> >  wrote
> >
> >> > In that situation, say for example with VGPU inventories, that
> >> would mean
> >> > that the compute node would stop reporting inventories for its
> >> root RP, but
> >> > would rather report inventories for at least one single child RP.
> >> > In that model, do we reconcile the allocations that were already
> >> made
> >> > against the "root RP" inventory ?
> >>
> >> It would be nice to see Eric and Jay comment on this,
> >> but if I'm not mistaken, when the virt driver stops reporting
> >> inventories for its root RP, placement would try to delete that
> >> inventory inside and raise InventoryInUse exception if any
> >> allocations still exist on that resource.
> >>
> >> ```
> >> update_from_provider_tree() (nova/compute/resource_tracker.py)
> >>   + _set_inventory_for_provider() (nova/scheduler/client/report.py)
> >>   + put() - PUT /resource_providers//inventories with
> >> new inventories (scheduler/client/report.py)
> >>   + set_inventories() (placement/handler/inventory.py)
> >>   + _set_inventory()
> >> (placement/objects/resource_proveider.py)
> >>   + _delete_inventory_from_provider()
> >> (placement/objects/resource_proveider.py)
> >>   -> raise exception.InventoryInUse
> >> ```
> >>
> >> So we need some trick something like deleting VGPU allocations
> >> before upgrading and set the allocation again for the created new
> >> child after upgrading?
> >>
> >
> > I wonder if we should keep the existing inventory in the root RP, and
> > somehow just reserve the left resources (so Placement wouldn't pass
> > that root RP for queries, but would still have allocations). But
> > then, where and how to do this ? By the resource tracker ?
> >
>
> AFAIK it is the virt driver that decides to model the VGU resource at a
> different place in the RP tree so I think it is the responsibility of
> the same virt driver to move any existing allocation from the old place
> to the new place during this change.
>
> Cheers,
> gibi
>

Why not instead not move the allocation but rather have the virt driver
updating the root RP by modifying the reserved value to the total size?

That way, the virt driver wouldn't need to ask for an allocation but rather
continue to provide inventories...

Thoughts?


> > -Sylvain
> >
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-29 Thread Balázs Gibizer



On Tue, May 29, 2018 at 1:47 PM, Sylvain Bauza  
wrote:



Le mar. 29 mai 2018 à 11:02, Balázs Gibizer 
 a écrit :



On Tue, May 29, 2018 at 9:38 AM, Sylvain Bauza 
wrote:
>
>
> On Tue, May 29, 2018 at 3:08 AM, TETSURO NAKAMURA
>  wrote
>
>> > In that situation, say for example with VGPU inventories, that
>> would mean
>> > that the compute node would stop reporting inventories for its
>> root RP, but
>> > would rather report inventories for at least one single child 
RP.

>> > In that model, do we reconcile the allocations that were already
>> made
>> > against the "root RP" inventory ?
>>
>> It would be nice to see Eric and Jay comment on this,
>> but if I'm not mistaken, when the virt driver stops reporting
>> inventories for its root RP, placement would try to delete that
>> inventory inside and raise InventoryInUse exception if any
>> allocations still exist on that resource.
>>
>> ```
>> update_from_provider_tree() (nova/compute/resource_tracker.py)
>>   + _set_inventory_for_provider() 
(nova/scheduler/client/report.py)

>>   + put() - PUT /resource_providers//inventories with
>> new inventories (scheduler/client/report.py)
>>   + set_inventories() (placement/handler/inventory.py)
>>   + _set_inventory()
>> (placement/objects/resource_proveider.py)
>>   + _delete_inventory_from_provider()
>> (placement/objects/resource_proveider.py)
>>   -> raise exception.InventoryInUse
>> ```
>>
>> So we need some trick something like deleting VGPU allocations
>> before upgrading and set the allocation again for the created new
>> child after upgrading?
>>
>
> I wonder if we should keep the existing inventory in the root RP, 
and

> somehow just reserve the left resources (so Placement wouldn't pass
> that root RP for queries, but would still have allocations). But
> then, where and how to do this ? By the resource tracker ?
>

AFAIK it is the virt driver that decides to model the VGU resource 
at a

different place in the RP tree so I think it is the responsibility of
the same virt driver to move any existing allocation from the old 
place

to the new place during this change.

Cheers,
gibi


Why not instead not move the allocation but rather have the virt 
driver updating the root RP by modifying the reserved value to the 
total size?


That way, the virt driver wouldn't need to ask for an allocation but 
rather continue to provide inventories...


Thoughts?


Keeping the old allocaton at the old RP and adding a similar sized 
reservation in the new RP feels hackis as those are not really reserved 
GPUs but used GPUs just from the old RP. If somebody sums up the total 
reported GPUs in this setup via the placement API then she will get 
more GPUs in total that what is physically visible for the hypervisor 
as the GPUs part of the old allocation reported twice in two different 
total value. Could we just report less GPU inventories to the new RP 
until the old RP has GPU allocations?


Some alternatives from my jetlagged brain:

a) Implement a move inventory/allocation API in placement. Given a 
resource class and a source RP uuid and a destination RP uuid placement 
moves the inventory and allocations of that resource class from the 
source RP to the destination RP. Then the virt drive can call this API 
to move the allocation. This has an impact on the fast forward upgrade 
as it needs running virt driver to do the allocation move.


b) For this I assume that live migrating an instance having a GPU 
allocation on the old RP will allocate GPU for that instance from the 
new RP. In the virt driver do not report GPUs to the new RP while there 
is allocation for such GPUs in the old RP. Let the deployer live 
migrate away the instances. When the virt driver detects that there is 
no more GPU allocations on the old RP it can delete the inventory from 
the old RP and report it to the new RP.


c) For this I assume that there is no support for live migration of an 
instance having a GPU. If there is GPU allocation in the old RP then 
virt driver does not report GPU inventory to the new RP just creates 
the new nested RPs. Provide a placement-manage command to do the 
inventory + allocation copy from the old RP to the new RP.


Cheers,
gibi





> -Sylvain
>


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-29 Thread Sylvain Bauza
On Tue, May 29, 2018 at 2:21 PM, Balázs Gibizer  wrote:

>
>
> On Tue, May 29, 2018 at 1:47 PM, Sylvain Bauza  wrote:
>
>>
>>
>> Le mar. 29 mai 2018 à 11:02, Balázs Gibizer 
>> a écrit :
>>
>>>
>>>
>>> On Tue, May 29, 2018 at 9:38 AM, Sylvain Bauza 
>>> wrote:
>>> >
>>> >
>>> > On Tue, May 29, 2018 at 3:08 AM, TETSURO NAKAMURA
>>> >  wrote
>>> >
>>> >> > In that situation, say for example with VGPU inventories, that
>>> >> would mean
>>> >> > that the compute node would stop reporting inventories for its
>>> >> root RP, but
>>> >> > would rather report inventories for at least one single child RP.
>>> >> > In that model, do we reconcile the allocations that were already
>>> >> made
>>> >> > against the "root RP" inventory ?
>>> >>
>>> >> It would be nice to see Eric and Jay comment on this,
>>> >> but if I'm not mistaken, when the virt driver stops reporting
>>> >> inventories for its root RP, placement would try to delete that
>>> >> inventory inside and raise InventoryInUse exception if any
>>> >> allocations still exist on that resource.
>>> >>
>>> >> ```
>>> >> update_from_provider_tree() (nova/compute/resource_tracker.py)
>>> >>   + _set_inventory_for_provider() (nova/scheduler/client/report.py)
>>> >>   + put() - PUT /resource_providers//inventories with
>>> >> new inventories (scheduler/client/report.py)
>>> >>   + set_inventories() (placement/handler/inventory.py)
>>> >>   + _set_inventory()
>>> >> (placement/objects/resource_proveider.py)
>>> >>   + _delete_inventory_from_provider()
>>> >> (placement/objects/resource_proveider.py)
>>> >>   -> raise exception.InventoryInUse
>>> >> ```
>>> >>
>>> >> So we need some trick something like deleting VGPU allocations
>>> >> before upgrading and set the allocation again for the created new
>>> >> child after upgrading?
>>> >>
>>> >
>>> > I wonder if we should keep the existing inventory in the root RP, and
>>> > somehow just reserve the left resources (so Placement wouldn't pass
>>> > that root RP for queries, but would still have allocations). But
>>> > then, where and how to do this ? By the resource tracker ?
>>> >
>>>
>>> AFAIK it is the virt driver that decides to model the VGU resource at a
>>> different place in the RP tree so I think it is the responsibility of
>>> the same virt driver to move any existing allocation from the old place
>>> to the new place during this change.
>>>
>>> Cheers,
>>> gibi
>>>
>>
>> Why not instead not move the allocation but rather have the virt driver
>> updating the root RP by modifying the reserved value to the total size?
>>
>> That way, the virt driver wouldn't need to ask for an allocation but
>> rather continue to provide inventories...
>>
>> Thoughts?
>>
>
> Keeping the old allocaton at the old RP and adding a similar sized
> reservation in the new RP feels hackis as those are not really reserved
> GPUs but used GPUs just from the old RP. If somebody sums up the total
> reported GPUs in this setup via the placement API then she will get more
> GPUs in total that what is physically visible for the hypervisor as the
> GPUs part of the old allocation reported twice in two different total
> value. Could we just report less GPU inventories to the new RP until the
> old RP has GPU allocations?
>
>

We could keep the old inventory in the root RP for the previous vGPU type
already supported in Queens and just add other inventories for other vGPU
types now supported. That looks possibly the simpliest option as the virt
driver knows that.



> Some alternatives from my jetlagged brain:
>
> a) Implement a move inventory/allocation API in placement. Given a
> resource class and a source RP uuid and a destination RP uuid placement
> moves the inventory and allocations of that resource class from the source
> RP to the destination RP. Then the virt drive can call this API to move the
> allocation. This has an impact on the fast forward upgrade as it needs
> running virt driver to do the allocation move.
>
>
Instead of having the virt driver doing that (TBH, I don't like that given
both Xen and libvirt drivers have the same problem), we could write a
nova-manage upgrade call for that that would call the Placement API, sure.


> b) For this I assume that live migrating an instance having a GPU
> allocation on the old RP will allocate GPU for that instance from the new
> RP. In the virt driver do not report GPUs to the new RP while there is
> allocation for such GPUs in the old RP. Let the deployer live migrate away
> the instances. When the virt driver detects that there is no more GPU
> allocations on the old RP it can delete the inventory from the old RP and
> report it to the new RP.
>
>
For the moment, vGPUs don't support live migration, even within QEMU. I
haven't checked that, but IIUC when you live-migrate an instance that have
vGPUs, it will just migrate it without recreating the vGPUs.
Now, the problem is with the VGPU allocation, we should delete it then.
May

Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-30 Thread Balázs Gibizer



On Tue, May 29, 2018 at 3:12 PM, Sylvain Bauza  
wrote:



On Tue, May 29, 2018 at 2:21 PM, Balázs Gibizer 
 wrote:



On Tue, May 29, 2018 at 1:47 PM, Sylvain Bauza  
wrote:



Le mar. 29 mai 2018 à 11:02, Balázs Gibizer 
 a écrit :



On Tue, May 29, 2018 at 9:38 AM, Sylvain Bauza 
wrote:
>
>
> On Tue, May 29, 2018 at 3:08 AM, TETSURO NAKAMURA
>  wrote
>
>> > In that situation, say for example with VGPU inventories, that
>> would mean
>> > that the compute node would stop reporting inventories for its
>> root RP, but
>> > would rather report inventories for at least one single child 
RP.
>> > In that model, do we reconcile the allocations that were 
already

>> made
>> > against the "root RP" inventory ?
>>
>> It would be nice to see Eric and Jay comment on this,
>> but if I'm not mistaken, when the virt driver stops reporting
>> inventories for its root RP, placement would try to delete that
>> inventory inside and raise InventoryInUse exception if any
>> allocations still exist on that resource.
>>
>> ```
>> update_from_provider_tree() (nova/compute/resource_tracker.py)
>>   + _set_inventory_for_provider() 
(nova/scheduler/client/report.py)
>>   + put() - PUT /resource_providers//inventories 
with

>> new inventories (scheduler/client/report.py)
>>   + set_inventories() (placement/handler/inventory.py)
>>   + _set_inventory()
>> (placement/objects/resource_proveider.py)
>>   + _delete_inventory_from_provider()
>> (placement/objects/resource_proveider.py)
>>   -> raise exception.InventoryInUse
>> ```
>>
>> So we need some trick something like deleting VGPU allocations
>> before upgrading and set the allocation again for the created 
new

>> child after upgrading?
>>
>
> I wonder if we should keep the existing inventory in the root 
RP, and
> somehow just reserve the left resources (so Placement wouldn't 
pass

> that root RP for queries, but would still have allocations). But
> then, where and how to do this ? By the resource tracker ?
>

AFAIK it is the virt driver that decides to model the VGU resource 
at a
different place in the RP tree so I think it is the responsibility 
of
the same virt driver to move any existing allocation from the old 
place

to the new place during this change.

Cheers,
gibi


Why not instead not move the allocation but rather have the virt 
driver updating the root RP by modifying the reserved value to the 
total size?


That way, the virt driver wouldn't need to ask for an allocation 
but rather continue to provide inventories...


Thoughts?


Keeping the old allocaton at the old RP and adding a similar sized 
reservation in the new RP feels hackis as those are not really 
reserved GPUs but used GPUs just from the old RP. If somebody sums 
up the total reported GPUs in this setup via the placement API then 
she will get more GPUs in total that what is physically visible for 
the hypervisor as the GPUs part of the old allocation reported twice 
in two different total value. Could we just report less GPU 
inventories to the new RP until the old RP has GPU allocations?





We could keep the old inventory in the root RP for the previous vGPU 
type already supported in Queens and just add other inventories for 
other vGPU types now supported. That looks possibly the simpliest 
option as the virt driver knows that.


That works for me. Can we somehow deprecate the previous, already 
supported vGPU types to eventually get rid of the splitted inventory?






Some alternatives from my jetlagged brain:

a) Implement a move inventory/allocation API in placement. Given a 
resource class and a source RP uuid and a destination RP uuid 
placement moves the inventory and allocations of that resource class 
from the source RP to the destination RP. Then the virt drive can 
call this API to move the allocation. This has an impact on the fast 
forward upgrade as it needs running virt driver to do the allocation 
move.




Instead of having the virt driver doing that (TBH, I don't like that 
given both Xen and libvirt drivers have the same problem), we could 
write a nova-manage upgrade call for that that would call the 
Placement API, sure.


The nova-manage is another possible way similar to my idea #c) but 
there I imagined the logic in placement-manage instead of nova-manage.




b) For this I assume that live migrating an instance having a GPU 
allocation on the old RP will allocate GPU for that instance from 
the new RP. In the virt driver do not report GPUs to the new RP 
while there is allocation for such GPUs in the old RP. Let the 
deployer live migrate away the instances. When the virt driver 
detects that there is no more GPU allocations on the old RP it can 
delete the inventory from the old RP and report it to the new RP.




For the moment, vGPUs don't support live migration, even within QEMU. 
I haven't checked that, but IIUC when you live-migrate an instance 
that have vGPUs, it will just migrate it without recre

Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Sylvain Bauza
On Wed, May 30, 2018 at 1:06 PM, Balázs Gibizer  wrote:

>
>
> On Tue, May 29, 2018 at 3:12 PM, Sylvain Bauza  wrote:
>
>>
>>
>> On Tue, May 29, 2018 at 2:21 PM, Balázs Gibizer <
>> balazs.gibi...@ericsson.com> wrote:
>>
>>>
>>>
>>> On Tue, May 29, 2018 at 1:47 PM, Sylvain Bauza 
>>> wrote:
>>>


 Le mar. 29 mai 2018 à 11:02, Balázs Gibizer <
 balazs.gibi...@ericsson.com> a écrit :

>
>
> On Tue, May 29, 2018 at 9:38 AM, Sylvain Bauza 
> wrote:
> >
> >
> > On Tue, May 29, 2018 at 3:08 AM, TETSURO NAKAMURA
> >  wrote
> >
> >> > In that situation, say for example with VGPU inventories, that
> >> would mean
> >> > that the compute node would stop reporting inventories for its
> >> root RP, but
> >> > would rather report inventories for at least one single child RP.
> >> > In that model, do we reconcile the allocations that were already
> >> made
> >> > against the "root RP" inventory ?
> >>
> >> It would be nice to see Eric and Jay comment on this,
> >> but if I'm not mistaken, when the virt driver stops reporting
> >> inventories for its root RP, placement would try to delete that
> >> inventory inside and raise InventoryInUse exception if any
> >> allocations still exist on that resource.
> >>
> >> ```
> >> update_from_provider_tree() (nova/compute/resource_tracker.py)
> >>   + _set_inventory_for_provider() (nova/scheduler/client/report.py)
> >>   + put() - PUT /resource_providers//inventories with
> >> new inventories (scheduler/client/report.py)
> >>   + set_inventories() (placement/handler/inventory.py)
> >>   + _set_inventory()
> >> (placement/objects/resource_proveider.py)
> >>   + _delete_inventory_from_provider()
> >> (placement/objects/resource_proveider.py)
> >>   -> raise exception.InventoryInUse
> >> ```
> >>
> >> So we need some trick something like deleting VGPU allocations
> >> before upgrading and set the allocation again for the created new
> >> child after upgrading?
> >>
> >
> > I wonder if we should keep the existing inventory in the root RP, and
> > somehow just reserve the left resources (so Placement wouldn't pass
> > that root RP for queries, but would still have allocations). But
> > then, where and how to do this ? By the resource tracker ?
> >
>
> AFAIK it is the virt driver that decides to model the VGU resource at a
> different place in the RP tree so I think it is the responsibility of
> the same virt driver to move any existing allocation from the old place
> to the new place during this change.
>
> Cheers,
> gibi
>

 Why not instead not move the allocation but rather have the virt driver
 updating the root RP by modifying the reserved value to the total size?

 That way, the virt driver wouldn't need to ask for an allocation but
 rather continue to provide inventories...

 Thoughts?

>>>
>>> Keeping the old allocaton at the old RP and adding a similar sized
>>> reservation in the new RP feels hackis as those are not really reserved
>>> GPUs but used GPUs just from the old RP. If somebody sums up the total
>>> reported GPUs in this setup via the placement API then she will get more
>>> GPUs in total that what is physically visible for the hypervisor as the
>>> GPUs part of the old allocation reported twice in two different total
>>> value. Could we just report less GPU inventories to the new RP until the
>>> old RP has GPU allocations?
>>>
>>>
>>
>> We could keep the old inventory in the root RP for the previous vGPU type
>> already supported in Queens and just add other inventories for other vGPU
>> types now supported. That looks possibly the simpliest option as the virt
>> driver knows that.
>>
>
> That works for me. Can we somehow deprecate the previous, already
> supported vGPU types to eventually get rid of the splitted inventory?
>
>
>>
>> Some alternatives from my jetlagged brain:
>>>
>>> a) Implement a move inventory/allocation API in placement. Given a
>>> resource class and a source RP uuid and a destination RP uuid placement
>>> moves the inventory and allocations of that resource class from the source
>>> RP to the destination RP. Then the virt drive can call this API to move the
>>> allocation. This has an impact on the fast forward upgrade as it needs
>>> running virt driver to do the allocation move.
>>>
>>>
>> Instead of having the virt driver doing that (TBH, I don't like that
>> given both Xen and libvirt drivers have the same problem), we could write a
>> nova-manage upgrade call for that that would call the Placement API, sure.
>>
>
> The nova-manage is another possible way similar to my idea #c) but there I
> imagined the logic in placement-manage instead of nova-manage.
>
>
>> b) For this I assume that live migrating an i

Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Balázs Gibizer



On Thu, May 31, 2018 at 11:10 AM, Sylvain Bauza  
wrote:




After considering the whole approach, discussing with a couple of 
folks over IRC, here is what I feel the best approach for a seamless 
upgrade :
 - VGPU inventory will be kept on root RP (for the first type) in 
Queens so that a compute service upgrade won't impact the DB
 - during Queens, operators can run a DB online migration script 
(like the ones we currently have in 
https://github.com/openstack/nova/blob/c2f42b0/nova/cmd/manage.py#L375) 
that will create a new resource provider for the first type and move 
the inventory and allocations to it.
 - it's the responsibility of the virt driver code to check whether a 
child RP with its name being the first type name already exists to 
know whether to update the inventory against the root RP or the child 
RP.


Does it work for folks ?


+1 works for me
gibi

PS : we already have the plumbing in place in nova-manage and we're 
still managing full Nova resources. I know we plan to move Placement 
out of the nova tree, but for the Rocky timeframe, I feel we can 
consider nova-manage as the best and quickiest approach for the data 
upgrade.


-Sylvain





__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Eric Fried
This seems reasonable, but...

On 05/31/2018 04:34 AM, Balázs Gibizer wrote:
> 
> 
> On Thu, May 31, 2018 at 11:10 AM, Sylvain Bauza  wrote:
>>>
>>
>> After considering the whole approach, discussing with a couple of
>> folks over IRC, here is what I feel the best approach for a seamless
>> upgrade :
>>  - VGPU inventory will be kept on root RP (for the first type) in
>> Queens so that a compute service upgrade won't impact the DB
>>  - during Queens, operators can run a DB online migration script (like
-^^
Did you mean Rocky?

>> the ones we currently have in
>> https://github.com/openstack/nova/blob/c2f42b0/nova/cmd/manage.py#L375) that
>> will create a new resource provider for the first type and move the
>> inventory and allocations to it.
>>  - it's the responsibility of the virt driver code to check whether a
>> child RP with its name being the first type name already exists to
>> know whether to update the inventory against the root RP or the child RP.
>>
>> Does it work for folks ?
> 
> +1 works for me
> gibi
> 
>> PS : we already have the plumbing in place in nova-manage and we're
>> still managing full Nova resources. I know we plan to move Placement
>> out of the nova tree, but for the Rocky timeframe, I feel we can
>> consider nova-manage as the best and quickiest approach for the data
>> upgrade.
>>
>> -Sylvain
>>
>>
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Naichuan Sun
I can do it on xenserver side, although keep old inv in compute node rp looks 
weird to me(it just work for one case: upgrade)...

-Original Message-
From: Eric Fried [mailto:openst...@fried.cc] 
Sent: Thursday, May 31, 2018 9:54 PM
To: OpenStack Development Mailing List (not for usage questions) 

Subject: Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested 
Resource Providers

This seems reasonable, but...

On 05/31/2018 04:34 AM, Balázs Gibizer wrote:
> 
> 
> On Thu, May 31, 2018 at 11:10 AM, Sylvain Bauza  wrote:
>>>
>>
>> After considering the whole approach, discussing with a couple of 
>> folks over IRC, here is what I feel the best approach for a seamless 
>> upgrade :
>>  - VGPU inventory will be kept on root RP (for the first type) in 
>> Queens so that a compute service upgrade won't impact the DB
>>  - during Queens, operators can run a DB online migration script 
>> (like
-^^
Did you mean Rocky?

>> the ones we currently have in
>> https://github.com/openstack/nova/blob/c2f42b0/nova/cmd/manage.py#L37
>> 5) that will create a new resource provider for the first type and 
>> move the inventory and allocations to it.
>>  - it's the responsibility of the virt driver code to check whether a 
>> child RP with its name being the first type name already exists to 
>> know whether to update the inventory against the root RP or the child RP.
>>
>> Does it work for folks ?
> 
> +1 works for me
> gibi
> 
>> PS : we already have the plumbing in place in nova-manage and we're 
>> still managing full Nova resources. I know we plan to move Placement 
>> out of the nova tree, but for the Rocky timeframe, I feel we can 
>> consider nova-manage as the best and quickiest approach for the data 
>> upgrade.
>>
>> -Sylvain
>>
>>
> 
> 
> __
>  OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: 
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Jay Pipes

On 05/29/2018 09:12 AM, Sylvain Bauza wrote:
We could keep the old inventory in the root RP for the previous vGPU 
type already supported in Queens and just add other inventories for 
other vGPU types now supported. That looks possibly the simpliest option 
as the virt driver knows that.


What do you mean by "vGPU type"? Are you referring to the multiple GPU 
types stuff where specific virt drivers know how to handle different 
vGPU vendor types? Or are you referring to a "non-nested VGPU inventory 
on the compute node provider" versus a "VGPU inventory on multiple child 
providers, each representing a different physical GPU (or physical GPU 
group in the case of Xen)"?


-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Jay Pipes

On 05/30/2018 07:06 AM, Balázs Gibizer wrote:
The nova-manage is another possible way similar to my idea #c) but there 
I imagined the logic in placement-manage instead of nova-manage.


Please note there is no placement-manage CLI tool.

Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Jay Pipes

On 05/31/2018 05:10 AM, Sylvain Bauza wrote:
After considering the whole approach, discussing with a couple of folks 
over IRC, here is what I feel the best approach for a seamless upgrade :
  - VGPU inventory will be kept on root RP (for the first type) in 
Queens so that a compute service upgrade won't impact the DB
  - during Queens, operators can run a DB online migration script (like 
the ones we currently have in 
https://github.com/openstack/nova/blob/c2f42b0/nova/cmd/manage.py#L375) 
that will create a new resource provider for the first type and move the 
inventory and allocations to it.
  - it's the responsibility of the virt driver code to check whether a 
child RP with its name being the first type name already exists to know 
whether to update the inventory against the root RP or the child RP.


Does it work for folks ?


No, sorry, that doesn't work for me. It seems overly complex and 
fragile, especially considering that VGPUs are not moveable anyway (no 
support for live migrating them). Same goes for CPU pinning, NUMA 
topologies, PCI passthrough devices, SR-IOV PF/VFs and all the other 
"must have" features that have been added to the virt driver over the 
last 5 years.


My feeling is that we should not attempt to "migrate" any allocations or 
inventories between root or child providers within a compute node, period.


The virt drivers should simply error out of update_provider_tree() if 
there are ANY existing VMs on the host AND the virt driver wishes to 
begin tracking resources with nested providers.


The upgrade operation should look like this:

1) Upgrade placement
2) Upgrade nova-scheduler
3) start loop on compute nodes. for each compute node:
 3a) disable nova-compute service on node (to take it out of scheduling)
 3b) evacuate all existing VMs off of node
 3c) upgrade compute node (on restart, the compute node will see no
 VMs running on the node and will construct the provider tree inside
 update_provider_tree() with an appropriate set of child providers
 and inventories on those child providers)
 3d) enable nova-compute service on node

Which is virtually identical to the "normal" upgrade process whenever 
there are significant changes to the compute node -- such as upgrading 
libvirt or the kernel. Nested resource tracking is another such 
significant change and should be dealt with in a similar way, IMHO.


Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Dan Smith
> My feeling is that we should not attempt to "migrate" any allocations
> or inventories between root or child providers within a compute node,
> period.

While I agree this is the simplest approach, it does put a lot of
responsibility on the operators to do work to sidestep this issue, which
might not even apply to them (and knowing if it does might be
difficult).

> The virt drivers should simply error out of update_provider_tree() if
> there are ANY existing VMs on the host AND the virt driver wishes to
> begin tracking resources with nested providers.
>
> The upgrade operation should look like this:
>
> 1) Upgrade placement
> 2) Upgrade nova-scheduler
> 3) start loop on compute nodes. for each compute node:
>  3a) disable nova-compute service on node (to take it out of scheduling)
>  3b) evacuate all existing VMs off of node

You mean s/evacuate/cold migrate/ of course... :)

>  3c) upgrade compute node (on restart, the compute node will see no
>  VMs running on the node and will construct the provider tree inside
>  update_provider_tree() with an appropriate set of child providers
>  and inventories on those child providers)
>  3d) enable nova-compute service on node
>
> Which is virtually identical to the "normal" upgrade process whenever
> there are significant changes to the compute node -- such as upgrading
> libvirt or the kernel.

Not necessarily. It's totally legit (and I expect quite common) to just
reboot the host to take kernel changes, bringing back all the instances
that were there when it resumes. The "normal" case of moving things
around slide-puzzle-style applies to live migration (which isn't an
option here). I think people that can take downtime for the instances
would rather not have to move things around for no reason if the
instance has to get shut off anyway.

> Nested resource tracking is another such significant change and should
> be dealt with in a similar way, IMHO.

This basically says that for anyone to move to rocky, they will have to
cold migrate every single instance in order to do that upgrade right? I
mean, anyone with two socket machines or SRIOV NICs would end up with at
least one level of nesting, correct? Forcing everyone to move everything
to do an upgrade seems like a non-starter to me.

We also need to consider the case where people would be FFU'ing past
rocky (i.e. never running rocky computes). We've previously said that
we'd provide a way to push any needed transitions with everything
offline to facilitate that case, so I think we need to implement that
method anyway.

I kinda think we need to either:

1. Make everything perform the pivot on compute node start (which can be
   re-used by a CLI tool for the offline case)
2. Make everything default to non-nested inventory at first, and provide
   a way to migrate a compute node and its instances one at a time (in
   place) to roll through.

We can also document "or do the cold-migration slide puzzle thing" as an
alternative for people that feel that's more reasonable.

I just think that forcing people to take down their data plane to work
around our own data model is kinda evil and something we should be
avoiding at this level of project maturity. What we're really saying is
"we know how to translate A into B, but we require you to move many GBs
of data over the network and take some downtime because it's easier for
*us* than making it seamless."

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Chris Dent

On Thu, 31 May 2018, Dan Smith wrote:


I kinda think we need to either:

1. Make everything perform the pivot on compute node start (which can be
  re-used by a CLI tool for the offline case)


This sounds effectively like: validate my inventory and allocations
at compute node start, correcting them as required (including the
kind of migration stuff related to nested). Is that right?

That's something I'd like to be the norm. It takes us back to a sort
of self-healing compute node.

Or am I missing something (forgive me, I've been on holiday).


I just think that forcing people to take down their data plane to work
around our own data model is kinda evil and something we should be
avoiding at this level of project maturity. What we're really saying is
"we know how to translate A into B, but we require you to move many GBs
of data over the network and take some downtime because it's easier for
*us* than making it seamless."


If we can do it, I agree that being not evil is good.

--
Chris Dent   ٩◔̯◔۶   https://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Eric Fried
> 1. Make everything perform the pivot on compute node start (which can be
>re-used by a CLI tool for the offline case)
> 2. Make everything default to non-nested inventory at first, and provide
>a way to migrate a compute node and its instances one at a time (in
>place) to roll through.

I agree that it sure would be nice to do ^ rather than requiring the
"slide puzzle" thing.

But how would this be accomplished, in light of the current "separation
of responsibilities" drawn at the virt driver interface, whereby the
virt driver isn't supposed to talk to placement directly, or know
anything about allocations?  Here's a first pass:

The virt driver, via the return value from update_provider_tree, tells
the resource tracker that "inventory of resource class A on provider B
have moved to provider C" for all applicable AxBxC.  E.g.

[ { 'from_resource_provider': ,
'moved_resources': [VGPU: 4],
'to_resource_provider': 
  },
  { 'from_resource_provider': ,
'moved_resources': [VGPU: 4],
'to_resource_provider': 
  },
  { 'from_resource_provider': ,
'moved_resources': [
SRIOV_NET_VF: 2,
NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 1000,
NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 1000,
],
'to_resource_provider': 
  }
]

As today, the resource tracker takes the updated provider tree and
invokes [1] the report client method update_from_provider_tree [2] to
flush the changes to placement.  But now update_from_provider_tree also
accepts the return value from update_provider_tree and, for each "move":

- Creates provider C (as described in the provider_tree) if it doesn't
already exist.
- Creates/updates provider C's inventory as described in the
provider_tree (without yet updating provider B's inventory).  This ought
to create the inventory of resource class A on provider C.
- Discovers allocations of rc A on rp B and POSTs to move them to rp C*.
- Updates provider B's inventory.

(*There's a hole here: if we're splitting a glommed-together inventory
across multiple new child providers, as the VGPUs in the example, we
don't know which allocations to put where.  The virt driver should know
which instances own which specific inventory units, and would be able to
report that info within the data structure.  That's getting kinda close
to the virt driver mucking with allocations, but maybe it fits well
enough into this model to be acceptable?)

Note that the return value from update_provider_tree is optional, and
only used when the virt driver is indicating a "move" of this ilk.  If
it's None/[] then the RT/update_from_provider_tree flow is the same as
it is today.

If we can do it this way, we don't need a migration tool.  In fact, we
don't even need to restrict provider tree "reshaping" to release
boundaries.  As long as the virt driver understands its own data model
migrations and reports them properly via update_provider_tree, it can
shuffle its tree around whenever it wants.

Thoughts?

-efried

[1]
https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/compute/resource_tracker.py#L890
[2]
https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/scheduler/client/report.py#L1341

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Eric Fried
Rats, typo correction below.

On 05/31/2018 01:26 PM, Eric Fried wrote:
>> 1. Make everything perform the pivot on compute node start (which can be
>>re-used by a CLI tool for the offline case)
>> 2. Make everything default to non-nested inventory at first, and provide
>>a way to migrate a compute node and its instances one at a time (in
>>place) to roll through.
> 
> I agree that it sure would be nice to do ^ rather than requiring the
> "slide puzzle" thing.
> 
> But how would this be accomplished, in light of the current "separation
> of responsibilities" drawn at the virt driver interface, whereby the
> virt driver isn't supposed to talk to placement directly, or know
> anything about allocations?  Here's a first pass:
> 
> The virt driver, via the return value from update_provider_tree, tells
> the resource tracker that "inventory of resource class A on provider B
> have moved to provider C" for all applicable AxBxC.  E.g.
> 
> [ { 'from_resource_provider': ,
> 'moved_resources': [VGPU: 4],
> 'to_resource_provider': 
>   },
>   { 'from_resource_provider': ,
> 'moved_resources': [VGPU: 4],
> 'to_resource_provider': 
>   },
>   { 'from_resource_provider': ,
> 'moved_resources': [
> SRIOV_NET_VF: 2,
> NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 1000,
> NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 1000,
> ],
> 'to_resource_provider': 
---
s/gpu_rp2_uuid/sriovnic_rp_uuid/ or similar.

>   }
> ]
> 
> As today, the resource tracker takes the updated provider tree and
> invokes [1] the report client method update_from_provider_tree [2] to
> flush the changes to placement.  But now update_from_provider_tree also
> accepts the return value from update_provider_tree and, for each "move":
> 
> - Creates provider C (as described in the provider_tree) if it doesn't
> already exist.
> - Creates/updates provider C's inventory as described in the
> provider_tree (without yet updating provider B's inventory).  This ought
> to create the inventory of resource class A on provider C.
> - Discovers allocations of rc A on rp B and POSTs to move them to rp C*.
> - Updates provider B's inventory.
> 
> (*There's a hole here: if we're splitting a glommed-together inventory
> across multiple new child providers, as the VGPUs in the example, we
> don't know which allocations to put where.  The virt driver should know
> which instances own which specific inventory units, and would be able to
> report that info within the data structure.  That's getting kinda close
> to the virt driver mucking with allocations, but maybe it fits well
> enough into this model to be acceptable?)
> 
> Note that the return value from update_provider_tree is optional, and
> only used when the virt driver is indicating a "move" of this ilk.  If
> it's None/[] then the RT/update_from_provider_tree flow is the same as
> it is today.
> 
> If we can do it this way, we don't need a migration tool.  In fact, we
> don't even need to restrict provider tree "reshaping" to release
> boundaries.  As long as the virt driver understands its own data model
> migrations and reports them properly via update_provider_tree, it can
> shuffle its tree around whenever it wants.
> 
> Thoughts?
> 
> -efried
> 
> [1]
> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/compute/resource_tracker.py#L890
> [2]
> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/scheduler/client/report.py#L1341
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Chris Dent

On Thu, 31 May 2018, Eric Fried wrote:


But how would this be accomplished, in light of the current "separation
of responsibilities" drawn at the virt driver interface, whereby the
virt driver isn't supposed to talk to placement directly, or know
anything about allocations?  Here's a first pass:


For sake of discussion, how much (if any) easier would it be if we
got rid of this restriction?


the resource tracker that "inventory of resource class A on provider B
have moved to provider C" for all applicable AxBxC.  E.g.


traits too?


[ { 'from_resource_provider': ,
   'moved_resources': [VGPU: 4],
   'to_resource_provider': 

[snip]


If we can do it this way, we don't need a migration tool.  In fact, we
don't even need to restrict provider tree "reshaping" to release
boundaries.  As long as the virt driver understands its own data model
migrations and reports them properly via update_provider_tree, it can
shuffle its tree around whenever it wants.


Assuming the restriction is kept, your model seems at least worth
exploring. The fact that we are using what amounts to a DSL to pass
some additional instruction back from the virt driver feels squiffy
for some reason (probably because I'm not wed to the restriction),
but it is well-contained.

--
Chris Dent   ٩◔̯◔۶   https://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Sylvain Bauza
On Thu, May 31, 2018 at 3:54 PM, Eric Fried  wrote:

> This seems reasonable, but...
>
> On 05/31/2018 04:34 AM, Balázs Gibizer wrote:
> >
> >
> > On Thu, May 31, 2018 at 11:10 AM, Sylvain Bauza 
> wrote:
> >>>
> >>
> >> After considering the whole approach, discussing with a couple of
> >> folks over IRC, here is what I feel the best approach for a seamless
> >> upgrade :
> >>  - VGPU inventory will be kept on root RP (for the first type) in
> >> Queens so that a compute service upgrade won't impact the DB
> >>  - during Queens, operators can run a DB online migration script (like
> -^^
> Did you mean Rocky?
>


Oops, yeah of course. Queens > Rocky.

>
> >> the ones we currently have in
> >> https://github.com/openstack/nova/blob/c2f42b0/nova/cmd/manage.py#L375)
> that
> >> will create a new resource provider for the first type and move the
> >> inventory and allocations to it.
> >>  - it's the responsibility of the virt driver code to check whether a
> >> child RP with its name being the first type name already exists to
> >> know whether to update the inventory against the root RP or the child
> RP.
> >>
> >> Does it work for folks ?
> >
> > +1 works for me
> > gibi
> >
> >> PS : we already have the plumbing in place in nova-manage and we're
> >> still managing full Nova resources. I know we plan to move Placement
> >> out of the nova tree, but for the Rocky timeframe, I feel we can
> >> consider nova-manage as the best and quickiest approach for the data
> >> upgrade.
> >>
> >> -Sylvain
> >>
> >>
> >
> >
> > 
> __
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:
> unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Eric Fried
Chris-

>> virt driver isn't supposed to talk to placement directly, or know
>> anything about allocations?
> 
> For sake of discussion, how much (if any) easier would it be if we
> got rid of this restriction?

At this point, having implemented the update_[from_]provider_tree flow
as we have, it would probably make things harder.  We still have to do
the same steps, but any bits we wanted to let the virt driver handle
would need some kind of weird callback dance.

But even if we scrapped update_[from_]provider_tree and redesigned from
first principles, virt drivers would have a lot of duplication of the
logic that currently resides in update_from_provider_tree.

So even though the restriction seems to make things awkward, having been
embroiled in this code as I have, I'm actually seeing how it keeps
things as clean and easy to reason about as can be expected for
something that's inherently as complicated as this.

>> the resource tracker that "inventory of resource class A on provider B
>> have moved to provider C" for all applicable AxBxC.  E.g.
> 
> traits too?

The traits are part of the updated provider tree itself.  The existing
logic in update_from_provider_tree handles shuffling those around.  I
don't think the RT needs to be told about any specific trait movement in
order to reason about moving allocations.  Do you see something I'm
missing there?

> The fact that we are using what amounts to a DSL to pass
> some additional instruction back from the virt driver feels squiffy

Yeah, I don't disagree.  The provider_tree object, and updating it via
update_provider_tree, is kind of a DSL already.  The list-of-dicts
format is just a strawman; we could make it an object or whatever (not
that that would make it less DSL-ish).

Perhaps an OVO :P

-efried
.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Sylvain Bauza
On Thu, May 31, 2018 at 4:34 PM, Jay Pipes  wrote:

> On 05/29/2018 09:12 AM, Sylvain Bauza wrote:
>
>> We could keep the old inventory in the root RP for the previous vGPU type
>> already supported in Queens and just add other inventories for other vGPU
>> types now supported. That looks possibly the simpliest option as the virt
>> driver knows that.
>>
>
> What do you mean by "vGPU type"? Are you referring to the multiple GPU
> types stuff where specific virt drivers know how to handle different vGPU
> vendor types? Or are you referring to a "non-nested VGPU inventory on the
> compute node provider" versus a "VGPU inventory on multiple child
> providers, each representing a different physical GPU (or physical GPU
> group in the case of Xen)"?
>
>
I speak about a "vGPU type" because it's how we agreed to have multiple
child RPs.
See
https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/add-support-for-vgpu.html#proposed-change

For Xen, a vGPU type is a Xen GPU group. For libvirt, it's just a mdev type.
Each pGPU can support multiple types. For the moment, we only support one
type, but my spec ( https://review.openstack.org/#/c/557065/ ) explains
more about that.


-jay
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Sylvain Bauza
On Thu, May 31, 2018 at 5:00 PM, Jay Pipes  wrote:

> On 05/31/2018 05:10 AM, Sylvain Bauza wrote:
>
>> After considering the whole approach, discussing with a couple of folks
>> over IRC, here is what I feel the best approach for a seamless upgrade :
>>   - VGPU inventory will be kept on root RP (for the first type) in Queens
>> so that a compute service upgrade won't impact the DB
>>   - during Queens, operators can run a DB online migration script (like
>> the ones we currently have in https://github.com/openstack/n
>> ova/blob/c2f42b0/nova/cmd/manage.py#L375) that will create a new
>> resource provider for the first type and move the inventory and allocations
>> to it.
>>   - it's the responsibility of the virt driver code to check whether a
>> child RP with its name being the first type name already exists to know
>> whether to update the inventory against the root RP or the child RP.
>>
>> Does it work for folks ?
>>
>
> No, sorry, that doesn't work for me. It seems overly complex and fragile,
> especially considering that VGPUs are not moveable anyway (no support for
> live migrating them). Same goes for CPU pinning, NUMA topologies, PCI
> passthrough devices, SR-IOV PF/VFs and all the other "must have" features
> that have been added to the virt driver over the last 5 years.
>
> My feeling is that we should not attempt to "migrate" any allocations or
> inventories between root or child providers within a compute node, period.
>
>
I don't understand why you're talking of *moving* an instance. My concern
was about upgrading a compute node to Rocky where some instances were
already there, and using vGPUs.


> The virt drivers should simply error out of update_provider_tree() if
> there are ANY existing VMs on the host AND the virt driver wishes to begin
> tracking resources with nested providers.
>
> The upgrade operation should look like this:
>
> 1) Upgrade placement
> 2) Upgrade nova-scheduler
> 3) start loop on compute nodes. for each compute node:
>  3a) disable nova-compute service on node (to take it out of scheduling)
>  3b) evacuate all existing VMs off of node
>  3c) upgrade compute node (on restart, the compute node will see no
>  VMs running on the node and will construct the provider tree inside
>  update_provider_tree() with an appropriate set of child providers
>  and inventories on those child providers)
>  3d) enable nova-compute service on node
>
> Which is virtually identical to the "normal" upgrade process whenever
> there are significant changes to the compute node -- such as upgrading
> libvirt or the kernel. Nested resource tracking is another such significant
> change and should be dealt with in a similar way, IMHO.
>
>
Upgrading to Rocky for vGPUs doesn't need to also upgrade libvirt or the
kernel. So why operators should need to "evacuate" (I understood that as
"migrate")  instances if they don't need to upgrade their host OS ?

Best,
> -jay
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Sylvain Bauza
On Thu, May 31, 2018 at 7:09 PM, Dan Smith  wrote:

> > My feeling is that we should not attempt to "migrate" any allocations
> > or inventories between root or child providers within a compute node,
> > period.
>
> While I agree this is the simplest approach, it does put a lot of
> responsibility on the operators to do work to sidestep this issue, which
> might not even apply to them (and knowing if it does might be
> difficult).
>
>
Shit, I missed the point why we were discussing about migrations. When you
upgrade, you wanna move your workloads for upgrading your kernel and the
likes. Gotcha.
But, I assume that's not something mandatory for a single upgrade (say
Queens>Rocky). In that case, you just wanna upgrade your compute without
moving your instances. Or you notified your users about a maintenance and
you know you have a minimal maintenance period for breaking them.
In both cases, adding more steps for upgrading seems a tricky and dangerous
path for those operators who are afraid of making a mistake.


> > The virt drivers should simply error out of update_provider_tree() if
> > there are ANY existing VMs on the host AND the virt driver wishes to
> > begin tracking resources with nested providers.
> >
> > The upgrade operation should look like this:
> >
> > 1) Upgrade placement
> > 2) Upgrade nova-scheduler
> > 3) start loop on compute nodes. for each compute node:
> >  3a) disable nova-compute service on node (to take it out of scheduling)
> >  3b) evacuate all existing VMs off of node
>
> You mean s/evacuate/cold migrate/ of course... :)
>
> >  3c) upgrade compute node (on restart, the compute node will see no
> >  VMs running on the node and will construct the provider tree inside
> >  update_provider_tree() with an appropriate set of child providers
> >  and inventories on those child providers)
> >  3d) enable nova-compute service on node
> >
> > Which is virtually identical to the "normal" upgrade process whenever
> > there are significant changes to the compute node -- such as upgrading
> > libvirt or the kernel.
>
> Not necessarily. It's totally legit (and I expect quite common) to just
> reboot the host to take kernel changes, bringing back all the instances
> that were there when it resumes. The "normal" case of moving things
> around slide-puzzle-style applies to live migration (which isn't an
> option here). I think people that can take downtime for the instances
> would rather not have to move things around for no reason if the
> instance has to get shut off anyway.
>
>
Yeah exactly that. Accepting a downtime is fair, to the price to not have a
long list of operations to do during that downtime period.



> > Nested resource tracking is another such significant change and should
> > be dealt with in a similar way, IMHO.
>
> This basically says that for anyone to move to rocky, they will have to
> cold migrate every single instance in order to do that upgrade right? I
> mean, anyone with two socket machines or SRIOV NICs would end up with at
> least one level of nesting, correct? Forcing everyone to move everything
> to do an upgrade seems like a non-starter to me.
>
>
For the moment, we aren't providing NUMA topologies with nested RPs but
once we do that, yeah, that would imply the above, which sounds harsh to
hear from an operator perspective.



> We also need to consider the case where people would be FFU'ing past
> rocky (i.e. never running rocky computes). We've previously said that
> we'd provide a way to push any needed transitions with everything
> offline to facilitate that case, so I think we need to implement that
> method anyway.
>
> I kinda think we need to either:
>
> 1. Make everything perform the pivot on compute node start (which can be
>re-used by a CLI tool for the offline case)
>

That's another alternative I haven't explored yet. Thanks for the idea. We
already reconcile the world when we restart the compute service by checking
whether mediated devices exist, so that could be a good option actually.



> 2. Make everything default to non-nested inventory at first, and provide
>a way to migrate a compute node and its instances one at a time (in
>place) to roll through.
>
>
 We could say that Rocky isn't supporting multiple vGPU types until you
make the necessary DB migration that will create child RPs and the likes.
That's yet another approach.

We can also document "or do the cold-migration slide puzzle thing" as an
> alternative for people that feel that's more reasonable.
>
> I just think that forcing people to take down their data plane to work
> around our own data model is kinda evil and something we should be
> avoiding at this level of project maturity. What we're really saying is
> "we know how to translate A into B, but we require you to move many GBs
> of data over the network and take some downtime because it's easier for
> *us* than making it seamless."
>
> --Dan
>
> 

Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Jay Pipes

On 05/31/2018 01:09 PM, Dan Smith wrote:

My feeling is that we should not attempt to "migrate" any allocations
or inventories between root or child providers within a compute node,
period.


While I agree this is the simplest approach, it does put a lot of
responsibility on the operators to do work to sidestep this issue, which
might not even apply to them (and knowing if it does might be
difficult).


Perhaps, yes. Though the process I described is certainly not foreign to 
operators. It is a safe and well-practiced approach.



The virt drivers should simply error out of update_provider_tree() if
there are ANY existing VMs on the host AND the virt driver wishes to
begin tracking resources with nested providers.

The upgrade operation should look like this:

1) Upgrade placement
2) Upgrade nova-scheduler
3) start loop on compute nodes. for each compute node:
  3a) disable nova-compute service on node (to take it out of scheduling)
  3b) evacuate all existing VMs off of node


You mean s/evacuate/cold migrate/ of course... :)


I meant evacuate as in `nova host-evacuate-live` with a fall back to 
`nova host-servers-migrate` if live migration isn't possible).



  3c) upgrade compute node (on restart, the compute node will see no
  VMs running on the node and will construct the provider tree inside
  update_provider_tree() with an appropriate set of child providers
  and inventories on those child providers)
  3d) enable nova-compute service on node

Which is virtually identical to the "normal" upgrade process whenever
there are significant changes to the compute node -- such as upgrading
libvirt or the kernel.


Not necessarily. It's totally legit (and I expect quite common) to just
reboot the host to take kernel changes, bringing back all the instances
that were there when it resumes.


So, you're saying the normal process is to try upgrading the Linux 
kernel and associated low-level libs, wait the requisite amount of time 
that takes (can be a long time) and just hope that everything comes back 
OK? That doesn't sound like any upgrade I've ever seen. All upgrade 
procedures I have seen attempt to get the workloads off of the compute 
host before trying anything major (and upgrading a linux kernel or 
low-level lib like libvirt is a major thing IMHO).


> The "normal" case of moving things
around slide-puzzle-style applies to live migration (which isn't an 
option here).
Sorry, I was saying that for all the lovely resources that have been 
bolted on to Nova in the past 5 years (CPU pinning, NUMA topologies, PCI 
passthrough, SR-IOV PF/VFs, vGPUs, etc), that if the workload uses 
*those* resources, then live migration won't work and that the admin 
would need to fall back to nova host-servers-migrate. I wasn't saying 
that live migration for all workloads/instances would not be a possibility.



I think people that can take downtime for the instances would rather
not have to move things around for no reason if the instance has to
get shut off anyway.


Maybe. Not sure. But my line of thinking is stick to a single, already 
known procedure since that is safe and well-practiced.


Code that we don't have to write means code that doesn't have new bugs 
that we'll have to track down and fix.


I'm also thinking that we'd be tracking down and fixing those bugs while 
trying to put out a fire that was caused by trying to auto-heal 
everything at once on nova-compute startup and resulting in broken state 
and an inability of the nova-compute service to start again, essentially 
trapping instances on the failed host. ;)



Nested resource tracking is another such significant change and should
be dealt with in a similar way, IMHO.


This basically says that for anyone to move to rocky, they will have to
cold migrate every single instance in order to do that upgrade right?


No, sorry if I wasn't clear. They can live-migrate the instances off of 
the to-be-upgraded compute host. They would only need to cold-migrate 
instances that use the aforementioned non-movable resources.



I kinda think we need to either:

1. Make everything perform the pivot on compute node start (which can be
re-used by a CLI tool for the offline case)

>

2. Make everything default to non-nested inventory at first, and provide
a way to migrate a compute node and its instances one at a time (in
place) to roll through.


I would vote for Option #2 if it comes down to it.

If we are going to go through the hassle of writing a bunch of 
transformation code in order to keep operator action as low as possible, 
I would prefer to consolidate all of this code into the nova-manage (or 
nova-status) tool and put some sort of attribute/marker on each compute 
node record to indicate whether a "heal" operation has occurred for that 
compute node.


Kinda like what Matt's been playing with for the heal_allocations stuff.

At least in that case, we'd have all the transform/heal code in a single 
place and we wouldn't need to ha

Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Sylvain Bauza
On Thu, May 31, 2018 at 7:44 PM, Chris Dent  wrote:

> On Thu, 31 May 2018, Dan Smith wrote:
>
> I kinda think we need to either:
>>
>> 1. Make everything perform the pivot on compute node start (which can be
>>   re-used by a CLI tool for the offline case)
>>
>
> This sounds effectively like: validate my inventory and allocations
> at compute node start, correcting them as required (including the
> kind of migration stuff related to nested). Is that right?
>
> That's something I'd like to be the norm. It takes us back to a sort
> of self-healing compute node.
>
> Or am I missing something (forgive me, I've been on holiday).
>


I think I understand the same as you. And I think it's actually the best
approach. Wow, Dan, you saved my life again. Should I call you Mitch
Buchannon ?



>
> I just think that forcing people to take down their data plane to work
>> around our own data model is kinda evil and something we should be
>> avoiding at this level of project maturity. What we're really saying is
>> "we know how to translate A into B, but we require you to move many GBs
>> of data over the network and take some downtime because it's easier for
>> *us* than making it seamless."
>>
>
> If we can do it, I agree that being not evil is good.
>
> --
> Chris Dent   ٩◔̯◔۶   https://anticdent.org/
> freenode: cdent tw: @anticdent
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-05-31 Thread Sylvain Bauza
On Thu, May 31, 2018 at 8:26 PM, Eric Fried  wrote:

> > 1. Make everything perform the pivot on compute node start (which can be
> >re-used by a CLI tool for the offline case)
> > 2. Make everything default to non-nested inventory at first, and provide
> >a way to migrate a compute node and its instances one at a time (in
> >place) to roll through.
>
> I agree that it sure would be nice to do ^ rather than requiring the
> "slide puzzle" thing.
>
> But how would this be accomplished, in light of the current "separation
> of responsibilities" drawn at the virt driver interface, whereby the
> virt driver isn't supposed to talk to placement directly, or know
> anything about allocations?  Here's a first pass:
>
>

What we usually do is to implement either at the compute service level or
at the virt driver level some init_host() method that will reconcile what
you want.
For example, we could just imagine a non-virt specific method (and I like
that because it's non-virt specific) - ie. called by compute's init_host()
that would lookup the compute root RP inventories, see whether one ore more
inventories tied to specific resource classes have to be moved from the
root RP and be attached to a child RP.
The only subtility that would require a virt-specific update would be the
name of the child RP (as both Xen and libvirt plan to use the child RP name
as the vGPU type identifier) but that's an implementation detail that a
possible virt driver update by the resource tracker would reconcile that.


The virt driver, via the return value from update_provider_tree, tells
> the resource tracker that "inventory of resource class A on provider B
> have moved to provider C" for all applicable AxBxC.  E.g.
>
> [ { 'from_resource_provider': ,
> 'moved_resources': [VGPU: 4],
> 'to_resource_provider': 
>   },
>   { 'from_resource_provider': ,
> 'moved_resources': [VGPU: 4],
> 'to_resource_provider': 
>   },
>   { 'from_resource_provider': ,
> 'moved_resources': [
> SRIOV_NET_VF: 2,
> NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 1000,
> NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 1000,
> ],
> 'to_resource_provider': 
>   }
> ]
>
> As today, the resource tracker takes the updated provider tree and
> invokes [1] the report client method update_from_provider_tree [2] to
> flush the changes to placement.  But now update_from_provider_tree also
> accepts the return value from update_provider_tree and, for each "move":
>
> - Creates provider C (as described in the provider_tree) if it doesn't
> already exist.
> - Creates/updates provider C's inventory as described in the
> provider_tree (without yet updating provider B's inventory).  This ought
> to create the inventory of resource class A on provider C.
> - Discovers allocations of rc A on rp B and POSTs to move them to rp C*.
> - Updates provider B's inventory.
>
> (*There's a hole here: if we're splitting a glommed-together inventory
> across multiple new child providers, as the VGPUs in the example, we
> don't know which allocations to put where.  The virt driver should know
> which instances own which specific inventory units, and would be able to
> report that info within the data structure.  That's getting kinda close
> to the virt driver mucking with allocations, but maybe it fits well
> enough into this model to be acceptable?)
>
> Note that the return value from update_provider_tree is optional, and
> only used when the virt driver is indicating a "move" of this ilk.  If
> it's None/[] then the RT/update_from_provider_tree flow is the same as
> it is today.
>
> If we can do it this way, we don't need a migration tool.  In fact, we
> don't even need to restrict provider tree "reshaping" to release
> boundaries.  As long as the virt driver understands its own data model
> migrations and reports them properly via update_provider_tree, it can
> shuffle its tree around whenever it wants.
>
> Thoughts?
>
> -efried
>
> [1]
> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2f
> c34d7e8e1b/nova/compute/resource_tracker.py#L890
> [2]
> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2f
> c34d7e8e1b/nova/scheduler/client/report.py#L1341
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-06-01 Thread Eric Fried
Sylvain-

On 05/31/2018 02:41 PM, Sylvain Bauza wrote:
> 
> 
> On Thu, May 31, 2018 at 8:26 PM, Eric Fried  > wrote:
> 
> > 1. Make everything perform the pivot on compute node start (which can be
> >    re-used by a CLI tool for the offline case)
> > 2. Make everything default to non-nested inventory at first, and provide
> >    a way to migrate a compute node and its instances one at a time (in
> >    place) to roll through.
> 
> I agree that it sure would be nice to do ^ rather than requiring the
> "slide puzzle" thing.
> 
> But how would this be accomplished, in light of the current "separation
> of responsibilities" drawn at the virt driver interface, whereby the
> virt driver isn't supposed to talk to placement directly, or know
> anything about allocations?  Here's a first pass:
> 
> 
> 
> What we usually do is to implement either at the compute service level
> or at the virt driver level some init_host() method that will reconcile
> what you want.
> For example, we could just imagine a non-virt specific method (and I
> like that because it's non-virt specific) - ie. called by compute's
> init_host() that would lookup the compute root RP inventories, see
> whether one ore more inventories tied to specific resource classes have
> to be moved from the root RP and be attached to a child RP.
> The only subtility that would require a virt-specific update would be
> the name of the child RP (as both Xen and libvirt plan to use the child
> RP name as the vGPU type identifier) but that's an implementation detail
> that a possible virt driver update by the resource tracker would
> reconcile that.

The question was rhetorical; my suggestion (below) was an attempt at
designing exactly what you've described.  Let me know if I can
explain/clarify it further.  I'm looking for feedback as to whether it's
a viable approach.

> The virt driver, via the return value from update_provider_tree, tells
> the resource tracker that "inventory of resource class A on provider B
> have moved to provider C" for all applicable AxBxC.  E.g.
> 
> [ { 'from_resource_provider': ,
>     'moved_resources': [VGPU: 4],
>     'to_resource_provider': 
>   },
>   { 'from_resource_provider': ,
>     'moved_resources': [VGPU: 4],
>     'to_resource_provider': 
>   },
>   { 'from_resource_provider': ,
>     'moved_resources': [
>         SRIOV_NET_VF: 2,
>         NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 1000,
>         NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 1000,
>     ],
>     'to_resource_provider': 
>   }
> ]
> 
> As today, the resource tracker takes the updated provider tree and
> invokes [1] the report client method update_from_provider_tree [2] to
> flush the changes to placement.  But now update_from_provider_tree also
> accepts the return value from update_provider_tree and, for each "move":
> 
> - Creates provider C (as described in the provider_tree) if it doesn't
> already exist.
> - Creates/updates provider C's inventory as described in the
> provider_tree (without yet updating provider B's inventory).  This ought
> to create the inventory of resource class A on provider C.
> - Discovers allocations of rc A on rp B and POSTs to move them to rp C*.
> - Updates provider B's inventory.
> 
> (*There's a hole here: if we're splitting a glommed-together inventory
> across multiple new child providers, as the VGPUs in the example, we
> don't know which allocations to put where.  The virt driver should know
> which instances own which specific inventory units, and would be able to
> report that info within the data structure.  That's getting kinda close
> to the virt driver mucking with allocations, but maybe it fits well
> enough into this model to be acceptable?)
> 
> Note that the return value from update_provider_tree is optional, and
> only used when the virt driver is indicating a "move" of this ilk.  If
> it's None/[] then the RT/update_from_provider_tree flow is the same as
> it is today.
> 
> If we can do it this way, we don't need a migration tool.  In fact, we
> don't even need to restrict provider tree "reshaping" to release
> boundaries.  As long as the virt driver understands its own data model
> migrations and reports them properly via update_provider_tree, it can
> shuffle its tree around whenever it wants.
> 
> Thoughts?
> 
> -efried
> 
> [1]
> 
> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/compute/resource_tracker.py#L890
> 
> 
> [2]
> 
> https://github.com/openstack/nova/blob/8753c9a38667f984d385b4783c3c2fc34d7e8e1b/nova/scheduler/client/report.py#L1341
> 
> 

Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-06-01 Thread Dan Smith
> So, you're saying the normal process is to try upgrading the Linux
> kernel and associated low-level libs, wait the requisite amount of
> time that takes (can be a long time) and just hope that everything
> comes back OK? That doesn't sound like any upgrade I've ever seen.

I'm saying I think it's a process practiced by some to install the new
kernel and libs and then reboot to activate, yeah.

> No, sorry if I wasn't clear. They can live-migrate the instances off
> of the to-be-upgraded compute host. They would only need to
> cold-migrate instances that use the aforementioned non-movable
> resources.

I don't think it's reasonable to force people to have to move every
instance in their cloud (live or otherwise) in order to upgrade. That
means that people who currently do their upgrades in-place in one step,
now have to do their upgrade in N steps, for N compute nodes. That
doesn't seem reasonable to me.

> If we are going to go through the hassle of writing a bunch of
> transformation code in order to keep operator action as low as
> possible, I would prefer to consolidate all of this code into the
> nova-manage (or nova-status) tool and put some sort of
> attribute/marker on each compute node record to indicate whether a
> "heal" operation has occurred for that compute node.

We need to know details of each compute node in order to do that. We
could make the tool external and something they run per-compute node,
but that still makes it N steps, even if the N steps are lighter
weight.

> Someone (maybe Gibi?) on this thread had mentioned having the virt
> driver (in update_provider_tree) do the whole set reserved = total
> thing when first attempting to create the child providers. That would
> work to prevent the scheduler from attempting to place workloads on
> those child providers, but we would still need some marker on the
> compute node to indicate to the nova-manage heal_nested_providers (or
> whatever) command that the compute node has had its provider tree
> validated/healed, right?

So that means you restart your cloud and it's basically locked up until
you perform the N steps to unlock N nodes? That also seems like it's not
going to make us very popular on the playground :)

I need to go read Eric's tome on how to handle the communication of
things from virt to compute so that this translation can be done. I'm
not saying I have the answer, I'm just saying that making this the
problem of the operators doesn't seem like a solution to me, and that we
should figure out how we're going to do this before we go down the
rabbit hole.

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-06-01 Thread Jay Pipes
Dan, you are leaving out the parts of my response where I am agreeing 
with you and saying that your "Option #2" is probably the things we 
should go with.


-jay

On 06/01/2018 12:22 PM, Dan Smith wrote:

So, you're saying the normal process is to try upgrading the Linux
kernel and associated low-level libs, wait the requisite amount of
time that takes (can be a long time) and just hope that everything
comes back OK? That doesn't sound like any upgrade I've ever seen.


I'm saying I think it's a process practiced by some to install the new
kernel and libs and then reboot to activate, yeah.


No, sorry if I wasn't clear. They can live-migrate the instances off
of the to-be-upgraded compute host. They would only need to
cold-migrate instances that use the aforementioned non-movable
resources.


I don't think it's reasonable to force people to have to move every
instance in their cloud (live or otherwise) in order to upgrade. That
means that people who currently do their upgrades in-place in one step,
now have to do their upgrade in N steps, for N compute nodes. That
doesn't seem reasonable to me.


If we are going to go through the hassle of writing a bunch of
transformation code in order to keep operator action as low as
possible, I would prefer to consolidate all of this code into the
nova-manage (or nova-status) tool and put some sort of
attribute/marker on each compute node record to indicate whether a
"heal" operation has occurred for that compute node.


We need to know details of each compute node in order to do that. We
could make the tool external and something they run per-compute node,
but that still makes it N steps, even if the N steps are lighter
weight.


Someone (maybe Gibi?) on this thread had mentioned having the virt
driver (in update_provider_tree) do the whole set reserved = total
thing when first attempting to create the child providers. That would
work to prevent the scheduler from attempting to place workloads on
those child providers, but we would still need some marker on the
compute node to indicate to the nova-manage heal_nested_providers (or
whatever) command that the compute node has had its provider tree
validated/healed, right?


So that means you restart your cloud and it's basically locked up until
you perform the N steps to unlock N nodes? That also seems like it's not
going to make us very popular on the playground :)

I need to go read Eric's tome on how to handle the communication of
things from virt to compute so that this translation can be done. I'm
not saying I have the answer, I'm just saying that making this the
problem of the operators doesn't seem like a solution to me, and that we
should figure out how we're going to do this before we go down the
rabbit hole.

--Dan



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-06-01 Thread Jay Pipes

On 05/31/2018 02:26 PM, Eric Fried wrote:

1. Make everything perform the pivot on compute node start (which can be
re-used by a CLI tool for the offline case)
2. Make everything default to non-nested inventory at first, and provide
a way to migrate a compute node and its instances one at a time (in
place) to roll through.


I agree that it sure would be nice to do ^ rather than requiring the
"slide puzzle" thing.

But how would this be accomplished, in light of the current "separation
of responsibilities" drawn at the virt driver interface, whereby the
virt driver isn't supposed to talk to placement directly, or know
anything about allocations?
FWIW, I don't have a problem with the virt driver "knowing about 
allocations". What I have a problem with is the virt driver *claiming 
resources for an instance*.


That's what the whole placement claims resources things was all about, 
and I'm not interested in stepping back to the days of long racy claim 
operations by having the compute nodes be responsible for claiming 
resources.


That said, once the consumer generation microversion lands [1], it 
should be possible to *safely* modify an allocation set for a consumer 
(instance) and move allocation records for an instance from one provider 
to another.


[1] https://review.openstack.org/#/c/565604/


Here's a first pass:

The virt driver, via the return value from update_provider_tree, tells
the resource tracker that "inventory of resource class A on provider B
have moved to provider C" for all applicable AxBxC.  E.g.

[ { 'from_resource_provider': ,
 'moved_resources': [VGPU: 4],
 'to_resource_provider': 
   },
   { 'from_resource_provider': ,
 'moved_resources': [VGPU: 4],
 'to_resource_provider': 
   },
   { 'from_resource_provider': ,
 'moved_resources': [
 SRIOV_NET_VF: 2,
 NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 1000,
 NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 1000,
 ],
 'to_resource_provider': 
   }
]

As today, the resource tracker takes the updated provider tree and
invokes [1] the report client method update_from_provider_tree [2] to
flush the changes to placement.  But now update_from_provider_tree also
accepts the return value from update_provider_tree and, for each "move":

- Creates provider C (as described in the provider_tree) if it doesn't
already exist.
- Creates/updates provider C's inventory as described in the
provider_tree (without yet updating provider B's inventory).  This ought
to create the inventory of resource class A on provider C.


Unfortunately, right here you'll introduce a race condition. As soon as 
this operation completes, the scheduler will have the ability to throw 
new instances on provider C and consume the inventory from it that you 
intend to give to the existing instance that is consuming from provider B.



- Discovers allocations of rc A on rp B and POSTs to move them to rp C*.


For each consumer of resources on rp B, right?


- Updates provider B's inventory.


Again, this is problematic because the scheduler will have already begun 
to place new instances on B's inventory, which could very well result in 
incorrect resource accounting on the node.


We basically need to have one giant new REST API call that accepts the 
list of "move instructions" and performs all of the instructions in a 
single transaction. :(



(*There's a hole here: if we're splitting a glommed-together inventory
across multiple new child providers, as the VGPUs in the example, we
don't know which allocations to put where.  The virt driver should know
which instances own which specific inventory units, and would be able to
report that info within the data structure.  That's getting kinda close
to the virt driver mucking with allocations, but maybe it fits well
enough into this model to be acceptable?)


Well, it's not really the virt driver *itself* mucking with the 
allocations. It's more that the virt driver is telling something *else* 
the move instructions that it feels are needed...



Note that the return value from update_provider_tree is optional, and
only used when the virt driver is indicating a "move" of this ilk.  If
it's None/[] then the RT/update_from_provider_tree flow is the same as
it is today.

If we can do it this way, we don't need a migration tool.  In fact, we
don't even need to restrict provider tree "reshaping" to release
boundaries.  As long as the virt driver understands its own data model
migrations and reports them properly via update_provider_tree, it can
shuffle its tree around whenever it wants.


Due to the many race conditions we would have in trying to fudge 
inventory amounts (the reserved/total thing) and allocation movement for 
>1 consumer at a time, I'm pretty sure the only safe thing to do is 
have a single new HTTP endpoint that would take this list of move 
operations and perform them atomically (on the placement server side of 
course).


Here's a strawman for how that HTTP endp

Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-06-01 Thread Dan Smith
> Dan, you are leaving out the parts of my response where I am agreeing
> with you and saying that your "Option #2" is probably the things we
> should go with.

No, what you said was:

>> I would vote for Option #2 if it comes down to it.

Implying (to me at least) that you still weren't in favor of either, but
would choose that as the least offensive option :)

I didn't quote it because I didn't have any response. I just wanted to
address the other assertions about what is and isn't a common upgrade
scenario, which I think is the important data we need to consider when
making a decision here.

I didn't mean to imply or hide anything with my message trimming, so
sorry if it came across as such.

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-06-01 Thread Jay Pipes

On 06/01/2018 03:02 PM, Dan Smith wrote:

Dan, you are leaving out the parts of my response where I am agreeing
with you and saying that your "Option #2" is probably the things we
should go with.


No, what you said was:


I would vote for Option #2 if it comes down to it.


Implying (to me at least) that you still weren't in favor of either, but
would choose that as the least offensive option :)

I didn't quote it because I didn't have any response. I just wanted to
address the other assertions about what is and isn't a common upgrade
scenario, which I think is the important data we need to consider when
making a decision here.


Understood. I've now accepted fact that we will need to do something to 
transform the data model without requiring operators to move workloads.



I didn't mean to imply or hide anything with my message trimming, so
sorry if it came across as such.


No worries.

Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-06-01 Thread Dan Smith
> FWIW, I don't have a problem with the virt driver "knowing about
> allocations". What I have a problem with is the virt driver *claiming
> resources for an instance*.

+1000.

> That's what the whole placement claims resources things was all about,
> and I'm not interested in stepping back to the days of long racy claim
> operations by having the compute nodes be responsible for claiming
> resources.
>
> That said, once the consumer generation microversion lands [1], it
> should be possible to *safely* modify an allocation set for a consumer
> (instance) and move allocation records for an instance from one
> provider to another.

Agreed. I'm hesitant to have the compute nodes arguing with the
scheduler even to patch things up, given the mess we just cleaned
up. The thing that I think makes this okay is that one compute node
cleaning/pivoting allocations for instances isn't going to be fighting
anything else whilst doing it. Migrations and new instance builds where
the source/destination or scheduler/compute aren't clear who owns the
allocation is a problem.

That said, we need to make sure we can handle the case where an instance
is in resize_confirm state across a boundary where we go from non-NRP to
NRP. It *should* be okay for the compute to handle this by updating the
instance's allocation held by the migration instead of the instance
itself, if the compute determines that it is the source.

--Dan

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-06-04 Thread Eric Fried
There has been much discussion.  We've gotten to a point of an initial
proposal and are ready for more (hopefully smaller, hopefully
conclusive) discussion.

To that end, there will be a HANGOUT tomorrow (TUESDAY, JUNE 5TH) at
1500 UTC.  Be in #openstack-placement to get the link to join.

The strawpeople outlined below and discussed in the referenced etherpad
have been consolidated/distilled into a new etherpad [1] around which
the hangout discussion will be centered.

[1] https://etherpad.openstack.org/p/placement-making-the-(up)grade

Thanks,
efried

On 06/01/2018 01:12 PM, Jay Pipes wrote:
> On 05/31/2018 02:26 PM, Eric Fried wrote:
>>> 1. Make everything perform the pivot on compute node start (which can be
>>>     re-used by a CLI tool for the offline case)
>>> 2. Make everything default to non-nested inventory at first, and provide
>>>     a way to migrate a compute node and its instances one at a time (in
>>>     place) to roll through.
>>
>> I agree that it sure would be nice to do ^ rather than requiring the
>> "slide puzzle" thing.
>>
>> But how would this be accomplished, in light of the current "separation
>> of responsibilities" drawn at the virt driver interface, whereby the
>> virt driver isn't supposed to talk to placement directly, or know
>> anything about allocations?
> FWIW, I don't have a problem with the virt driver "knowing about
> allocations". What I have a problem with is the virt driver *claiming
> resources for an instance*.
> 
> That's what the whole placement claims resources things was all about,
> and I'm not interested in stepping back to the days of long racy claim
> operations by having the compute nodes be responsible for claiming
> resources.
> 
> That said, once the consumer generation microversion lands [1], it
> should be possible to *safely* modify an allocation set for a consumer
> (instance) and move allocation records for an instance from one provider
> to another.
> 
> [1] https://review.openstack.org/#/c/565604/
> 
>> Here's a first pass:
>>
>> The virt driver, via the return value from update_provider_tree, tells
>> the resource tracker that "inventory of resource class A on provider B
>> have moved to provider C" for all applicable AxBxC.  E.g.
>>
>> [ { 'from_resource_provider': ,
>>  'moved_resources': [VGPU: 4],
>>  'to_resource_provider': 
>>    },
>>    { 'from_resource_provider': ,
>>  'moved_resources': [VGPU: 4],
>>  'to_resource_provider': 
>>    },
>>    { 'from_resource_provider': ,
>>  'moved_resources': [
>>  SRIOV_NET_VF: 2,
>>  NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 1000,
>>  NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 1000,
>>  ],
>>  'to_resource_provider': 
>>    }
>> ]
>>
>> As today, the resource tracker takes the updated provider tree and
>> invokes [1] the report client method update_from_provider_tree [2] to
>> flush the changes to placement.  But now update_from_provider_tree also
>> accepts the return value from update_provider_tree and, for each "move":
>>
>> - Creates provider C (as described in the provider_tree) if it doesn't
>> already exist.
>> - Creates/updates provider C's inventory as described in the
>> provider_tree (without yet updating provider B's inventory).  This ought
>> to create the inventory of resource class A on provider C.
> 
> Unfortunately, right here you'll introduce a race condition. As soon as
> this operation completes, the scheduler will have the ability to throw
> new instances on provider C and consume the inventory from it that you
> intend to give to the existing instance that is consuming from provider B.
> 
>> - Discovers allocations of rc A on rp B and POSTs to move them to rp C*.
> 
> For each consumer of resources on rp B, right?
> 
>> - Updates provider B's inventory.
> 
> Again, this is problematic because the scheduler will have already begun
> to place new instances on B's inventory, which could very well result in
> incorrect resource accounting on the node.
> 
> We basically need to have one giant new REST API call that accepts the
> list of "move instructions" and performs all of the instructions in a
> single transaction. :(
> 
>> (*There's a hole here: if we're splitting a glommed-together inventory
>> across multiple new child providers, as the VGPUs in the example, we
>> don't know which allocations to put where.  The virt driver should know
>> which instances own which specific inventory units, and would be able to
>> report that info within the data structure.  That's getting kinda close
>> to the virt driver mucking with allocations, but maybe it fits well
>> enough into this model to be acceptable?)
> 
> Well, it's not really the virt driver *itself* mucking with the
> allocations. It's more that the virt driver is telling something *else*
> the move instructions that it feels are needed...
> 
>> Note that the return value from update_provider_tree is optional, and
>> only used when the virt driver is indicating a "move" of thi

Re: [openstack-dev] [nova] [placement] Upgrade concerns with nested Resource Providers

2018-06-08 Thread Eric Fried
There is now a blueprint [1] and draft spec [2].  Reviews welcomed.

[1] https://blueprints.launchpad.net/nova/+spec/reshape-provider-tree
[2] https://review.openstack.org/#/c/572583/

On 06/04/2018 06:00 PM, Eric Fried wrote:
> There has been much discussion.  We've gotten to a point of an initial
> proposal and are ready for more (hopefully smaller, hopefully
> conclusive) discussion.
> 
> To that end, there will be a HANGOUT tomorrow (TUESDAY, JUNE 5TH) at
> 1500 UTC.  Be in #openstack-placement to get the link to join.
> 
> The strawpeople outlined below and discussed in the referenced etherpad
> have been consolidated/distilled into a new etherpad [1] around which
> the hangout discussion will be centered.
> 
> [1] https://etherpad.openstack.org/p/placement-making-the-(up)grade
> 
> Thanks,
> efried
> 
> On 06/01/2018 01:12 PM, Jay Pipes wrote:
>> On 05/31/2018 02:26 PM, Eric Fried wrote:
 1. Make everything perform the pivot on compute node start (which can be
     re-used by a CLI tool for the offline case)
 2. Make everything default to non-nested inventory at first, and provide
     a way to migrate a compute node and its instances one at a time (in
     place) to roll through.
>>>
>>> I agree that it sure would be nice to do ^ rather than requiring the
>>> "slide puzzle" thing.
>>>
>>> But how would this be accomplished, in light of the current "separation
>>> of responsibilities" drawn at the virt driver interface, whereby the
>>> virt driver isn't supposed to talk to placement directly, or know
>>> anything about allocations?
>> FWIW, I don't have a problem with the virt driver "knowing about
>> allocations". What I have a problem with is the virt driver *claiming
>> resources for an instance*.
>>
>> That's what the whole placement claims resources things was all about,
>> and I'm not interested in stepping back to the days of long racy claim
>> operations by having the compute nodes be responsible for claiming
>> resources.
>>
>> That said, once the consumer generation microversion lands [1], it
>> should be possible to *safely* modify an allocation set for a consumer
>> (instance) and move allocation records for an instance from one provider
>> to another.
>>
>> [1] https://review.openstack.org/#/c/565604/
>>
>>> Here's a first pass:
>>>
>>> The virt driver, via the return value from update_provider_tree, tells
>>> the resource tracker that "inventory of resource class A on provider B
>>> have moved to provider C" for all applicable AxBxC.  E.g.
>>>
>>> [ { 'from_resource_provider': ,
>>>  'moved_resources': [VGPU: 4],
>>>  'to_resource_provider': 
>>>    },
>>>    { 'from_resource_provider': ,
>>>  'moved_resources': [VGPU: 4],
>>>  'to_resource_provider': 
>>>    },
>>>    { 'from_resource_provider': ,
>>>  'moved_resources': [
>>>  SRIOV_NET_VF: 2,
>>>  NET_BANDWIDTH_EGRESS_KILOBITS_PER_SECOND: 1000,
>>>  NET_BANDWIDTH_INGRESS_KILOBITS_PER_SECOND: 1000,
>>>  ],
>>>  'to_resource_provider': 
>>>    }
>>> ]
>>>
>>> As today, the resource tracker takes the updated provider tree and
>>> invokes [1] the report client method update_from_provider_tree [2] to
>>> flush the changes to placement.  But now update_from_provider_tree also
>>> accepts the return value from update_provider_tree and, for each "move":
>>>
>>> - Creates provider C (as described in the provider_tree) if it doesn't
>>> already exist.
>>> - Creates/updates provider C's inventory as described in the
>>> provider_tree (without yet updating provider B's inventory).  This ought
>>> to create the inventory of resource class A on provider C.
>>
>> Unfortunately, right here you'll introduce a race condition. As soon as
>> this operation completes, the scheduler will have the ability to throw
>> new instances on provider C and consume the inventory from it that you
>> intend to give to the existing instance that is consuming from provider B.
>>
>>> - Discovers allocations of rc A on rp B and POSTs to move them to rp C*.
>>
>> For each consumer of resources on rp B, right?
>>
>>> - Updates provider B's inventory.
>>
>> Again, this is problematic because the scheduler will have already begun
>> to place new instances on B's inventory, which could very well result in
>> incorrect resource accounting on the node.
>>
>> We basically need to have one giant new REST API call that accepts the
>> list of "move instructions" and performs all of the instructions in a
>> single transaction. :(
>>
>>> (*There's a hole here: if we're splitting a glommed-together inventory
>>> across multiple new child providers, as the VGPUs in the example, we
>>> don't know which allocations to put where.  The virt driver should know
>>> which instances own which specific inventory units, and would be able to
>>> report that info within the data structure.  That's getting kinda close
>>> to the virt driver mucking with allocations, but maybe it fits well
>>> enough into this model to be accept