subject:"\\\\\\\[openstack\\\\\\\-dev\\\\\\\] \\\\\\\[nova\\\\\\\] how nova should behave when placement returns consumer generation conflict"

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

2018-09-17 Thread Jay Pipes


Thanks Giblet,

Will review this afternoon.

Best,
-jay

On 09/17/2018 09:10 AM, Balázs Gibizer wrote:


Hi,

Reworked and rebased the series based on this thread. The series starts 
here https://review.openstack.org/#/c/591597


Cheers,
gibi


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

2018-09-17 Thread Balázs Gibizer



Hi,

Reworked and rebased the series based on this thread. The series starts 
here https://review.openstack.org/#/c/591597


Cheers,
gibi


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

2018-08-27 Thread Jay Pipes


On 08/22/2018 08:55 AM, Balázs Gibizer wrote:

On Fri, Aug 17, 2018 at 5:40 PM, Eric Fried  wrote:

gibi-

 - On migration, when we transfer the allocations in either 
direction, a

 conflict means someone managed to resize (or otherwise change
 allocations?) since the last time we pulled data. Given the global 
lock

 in the report client, this should have been tough to do. If it does
 happen, I would think any retry would need to be done all the way back
 at the claim, which I imagine is higher up than we should go. So 
again,

 I think we should fail the migration and make the user retry.


 Do we want to fail the whole migration or just the migration step (e.g.
 confirm, revert)?
 The later means that failure during confirm or revert would put the
 instance back to VERIFY_RESIZE. While the former would mean that in 
case
 of conflict at confirm we try an automatic revert. But for a 
conflict at

 revert we can only put the instance to ERROR state.


This again should be "impossible" to come across. What would the
behavior be if we hit, say, ValueError in this spot?


I might not totally follow you. I see two options to choose from for the 
revert case:


a) Allocation manipulation error during revert of a migration causes 
that instance goes to ERROR. -> end user cannot retry the revert the 
instance needs to be deleted.


I would say this one is correct, but not because the user did anything 
wrong. Rather, *something inside Nova failed* because technically Nova 
shouldn't allow resource allocation to change while a server is in 
CONFIRMING_RESIZE task state. If we didn't make the server go to an 
ERROR state, I'm afraid we'd have no indication anywhere that this 
improper situation ever happened and we'd end up hiding some serious 
data corruption bugs.


b) Allocation manipulation error during revert of a migration causes 
that the instance goes back to VERIFY_RESIZE state. -> end user can 
retry the revert via the API.


I see three options to choose from for the confirm case:

a) Allocation manipulation error during confirm of a migration causes 
that instance goes to ERROR. -> end user cannot retry the confirm the 
instance needs to be deleted.


For the same reasons outlined above, I think this is the only safe option.

Best,
-jay

b) Allocation manipulation error during confirm of a migration causes 
that the instance goes back to VERIFY_RESIZE state. -> end user can 
retry the confirm via the API.


c) Allocation manipulation error during confirm of a migration causes 
that nova automatically tries to revert the migration. (For failure 
during this revert the same options available as for the generic revert 
case, see above)


We also need to consider live migration. It is similar in a sense that 
it also use move_allocations. But it is different as the end user 
doesn't explicitly confirm or revert a live migration.


I'm looking for opinions about which option we should take in each cases.

gibi



-efried

__ 


OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

2018-08-27 Thread Jay Pipes


Sorry for the delay in responding to this, Gibi and Eric. Comments inline.

tl;dr: go with option a)

On 08/16/2018 11:34 AM, Eric Fried wrote:

Thanks for this, gibi.

TL;DR: a).

I didn't look, but I'm pretty sure we're not caching allocations in the
report client. Today, nobody outside of nova (specifically the resource
tracker via the report client) is supposed to be mucking with instance
allocations, right? And given the global lock in the resource tracker,
it should be pretty difficult to race e.g. a resize and a delete in any
meaningful way.


It's not a global (i.e. multi-node) lock. It's a semaphore for just that 
compute node. Migrations (mostly) involve more than one compute node, so 
the compute node semaphore is useless in that regard, thus the need to 
go with option a) and bail out if any change to a generation of any of 
the consumers involved in the migration operation.



So short term, IMO it is reasonable to treat any generation conflict
as an error. No retries. Possible wrinkle on delete, where it should
be a failure unless forced.


Agreed for all migration and deletion operations.


Long term, I also can't come up with any scenario where it would be
appropriate to do a narrowly-focused GET+merge/replace+retry. But
implementing the above short-term plan shouldn't prevent us from adding
retries for individual scenarios later if we do uncover places where it
makes sense.


Neither do I. Safety first, IMHO.

Best,
-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

2018-08-22 Thread Eric Fried

b) sounds the most sane in both cases. I don't like the idea of "your
move operation failed and you have no recourse but to delete your
instance". And automatic retry sounds lovely, but potentially hairy to
implement (and we would need to account for the retries-failed scenario
anyway) so at least initially we should leave that out.

On 08/22/2018 07:55 AM, Balázs Gibizer wrote:
> 
> 
> On Fri, Aug 17, 2018 at 5:40 PM, Eric Fried  wrote:
>> gibi-
>>
  - On migration, when we transfer the allocations in either
 direction, a
  conflict means someone managed to resize (or otherwise change
  allocations?) since the last time we pulled data. Given the global
 lock
  in the report client, this should have been tough to do. If it does
  happen, I would think any retry would need to be done all the way back
  at the claim, which I imagine is higher up than we should go. So
 again,
  I think we should fail the migration and make the user retry.
>>>
>>>  Do we want to fail the whole migration or just the migration step (e.g.
>>>  confirm, revert)?
>>>  The later means that failure during confirm or revert would put the
>>>  instance back to VERIFY_RESIZE. While the former would mean that in
>>> case
>>>  of conflict at confirm we try an automatic revert. But for a
>>> conflict at
>>>  revert we can only put the instance to ERROR state.
>>
>> This again should be "impossible" to come across. What would the
>> behavior be if we hit, say, ValueError in this spot?
> 
> I might not totally follow you. I see two options to choose from for the
> revert case:
> 
> a) Allocation manipulation error during revert of a migration causes
> that instance goes to ERROR. -> end user cannot retry the revert the
> instance needs to be deleted.
> 
> b) Allocation manipulation error during revert of a migration causes
> that the instance goes back to VERIFY_RESIZE state. -> end user can
> retry the revert via the API.
> 
> I see three options to choose from for the confirm case:
> 
> a) Allocation manipulation error during confirm of a migration causes
> that instance goes to ERROR. -> end user cannot retry the confirm the
> instance needs to be deleted.
> 
> b) Allocation manipulation error during confirm of a migration causes
> that the instance goes back to VERIFY_RESIZE state. -> end user can
> retry the confirm via the API.
> 
> c) Allocation manipulation error during confirm of a migration causes
> that nova automatically tries to revert the migration. (For failure
> during this revert the same options available as for the generic revert
> case, see above)
> 
> We also need to consider live migration. It is similar in a sense that
> it also use move_allocations. But it is different as the end user
> doesn't explicitly confirm or revert a live migration.
> 
> I'm looking for opinions about which option we should take in each cases.
> 
> gibi
> 
>>
>> -efried
>>
>> __
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

2018-08-22 Thread Balázs Gibizer




On Fri, Aug 17, 2018 at 5:40 PM, Eric Fried  wrote:

gibi-

 - On migration, when we transfer the allocations in either 
direction, a

 conflict means someone managed to resize (or otherwise change
 allocations?) since the last time we pulled data. Given the global 
lock

 in the report client, this should have been tough to do. If it does
 happen, I would think any retry would need to be done all the way 
back
 at the claim, which I imagine is higher up than we should go. So 
again,

 I think we should fail the migration and make the user retry.


 Do we want to fail the whole migration or just the migration step 
(e.g.

 confirm, revert)?
 The later means that failure during confirm or revert would put the
 instance back to VERIFY_RESIZE. While the former would mean that in 
case
 of conflict at confirm we try an automatic revert. But for a 
conflict at

 revert we can only put the instance to ERROR state.


This again should be "impossible" to come across. What would the
behavior be if we hit, say, ValueError in this spot?


I might not totally follow you. I see two options to choose from for 
the revert case:


a) Allocation manipulation error during revert of a migration causes 
that instance goes to ERROR. -> end user cannot retry the revert the 
instance needs to be deleted.


b) Allocation manipulation error during revert of a migration causes 
that the instance goes back to VERIFY_RESIZE state. -> end user can 
retry the revert via the API.


I see three options to choose from for the confirm case:

a) Allocation manipulation error during confirm of a migration causes 
that instance goes to ERROR. -> end user cannot retry the confirm the 
instance needs to be deleted.


b) Allocation manipulation error during confirm of a migration causes 
that the instance goes back to VERIFY_RESIZE state. -> end user can 
retry the confirm via the API.


c) Allocation manipulation error during confirm of a migration causes 
that nova automatically tries to revert the migration. (For failure 
during this revert the same options available as for the generic revert 
case, see above)


We also need to consider live migration. It is similar in a sense that 
it also use move_allocations. But it is different as the end user 
doesn't explicitly confirm or revert a live migration.


I'm looking for opinions about which option we should take in each 
cases.


gibi



-efried

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

2018-08-17 Thread Eric Fried

gibi-

>> - On migration, when we transfer the allocations in either direction, a
>> conflict means someone managed to resize (or otherwise change
>> allocations?) since the last time we pulled data. Given the global lock
>> in the report client, this should have been tough to do. If it does
>> happen, I would think any retry would need to be done all the way back
>> at the claim, which I imagine is higher up than we should go. So again,
>> I think we should fail the migration and make the user retry.
> 
> Do we want to fail the whole migration or just the migration step (e.g.
> confirm, revert)?
> The later means that failure during confirm or revert would put the
> instance back to VERIFY_RESIZE. While the former would mean that in case
> of conflict at confirm we try an automatic revert. But for a conflict at
> revert we can only put the instance to ERROR state.

This again should be "impossible" to come across. What would the
behavior be if we hit, say, ValueError in this spot?

-efried

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

2018-08-17 Thread Balázs Gibizer




On Thu, Aug 16, 2018 at 5:34 PM, Eric Fried  wrote:

Thanks for this, gibi.


TL;DR: a).

I didn't look, but I'm pretty sure we're not caching allocations in 
the
report client. Today, nobody outside of nova (specifically the 
resource

tracker via the report client) is supposed to be mucking with instance
allocations, right? And given the global lock in the resource tracker,
it should be pretty difficult to race e.g. a resize and a delete in 
any

meaningful way. So short term, IMO it is reasonable to treat any
generation conflict as an error. No retries. Possible wrinkle on 
delete,

where it should be a failure unless forced.


Yes, today the instance_uuid and migraton_uuid consumers in placement 
are only changed from nova.


Right now I don't have any examples where nova is racing with itself on 
a instance or migration consumer. We could try hitting the Nova API in 
parallel with different server lifecycle operations against the same 
server to see if we can find races. But until such race is discovered 
we can go with option a)




Long term, I also can't come up with any scenario where it would be
appropriate to do a narrowly-focused GET+merge/replace+retry. But
implementing the above short-term plan shouldn't prevent us from 
adding
retries for individual scenarios later if we do uncover places where 
it

makes sense.



Later when resources consumed by a server will be handled outside of 
nova, like bandwidth from neutron and accelerators from cyborg we might 
see cases when nova will not be the only module changing a 
instance_uuid consumer. Then we have to decide how to handle that. I 
think one solution could be to make sure Nova knows about the bandwidth 
and accelerator resource needs of a server even if it is provided by 
neutron or cyborg. This knowledge is anyhow necessary to support atomic 
resource claim in the scheduler. For neturon ports this will be done 
through the resource_request attribute of the port. So even if the 
resource need of a port changes nova can go back to neutron and query 
the current need. This way nova can implement the following generic 
algorithm for every operation where nova wants to change the 
instance_uuid consumer in placement:
* collect the server current resource needs (might involve reading it 
from flavor, from neutron port, from cyborg accelerator) and apply the 
change nova wants to make (e.g. delete, move, resize).

* GET current consumer view from placement
* merge the two and push the result back to placement



Here's some stream-of-consciousness that led me to the above opinions:

- On spawn, we send the allocation with a consumer gen of None because
we expect the consumer not to exist. If it exists, that should be a 
hard

fail. (Hopefully the only way this happens is a true UUID conflict.)

- On migration, when we create the migration UUID, ditto above ^


I agree on both. I suggest returning HTTP 500 as we need a bug report 
about these cases.




- On migration, when we transfer the allocations in either direction, 
a

conflict means someone managed to resize (or otherwise change
allocations?) since the last time we pulled data. Given the global 
lock

in the report client, this should have been tough to do. If it does
happen, I would think any retry would need to be done all the way back
at the claim, which I imagine is higher up than we should go. So 
again,

I think we should fail the migration and make the user retry.


Do we want to fail the whole migration or just the migration step (e.g. 
confirm, revert)?
The later means that failure during confirm or revert would put the 
instance back to VERIFY_RESIZE. While the former would mean that in 
case of conflict at confirm we try an automatic revert. But for a 
conflict at revert we can only put the instance to ERROR state.




- On destroy, a conflict again means someone managed a resize despite
the global lock. If I'm deleting an instance and something about it
changes, I would think I want the opportunity to reevaluate my 
decision

to delete it. That said, I would definitely want a way to force it (in
which case we can just use the DELETE call explicitly). But neither 
case

should be a retry, and certainly there is no destroy scenario where I
would want a "merging" of allocations to happen.


Good idea about allowing forcing the delete. So a simple DELETE 
/servers/{instance_uuid} could fail on consumer conflict but a POST 
/servers/{instance_uuid}/action with forceDelete body would use DELETE 
/allocations and therefore will ignore any consumer generation.


Cheers,
gibi



Thanks,
efried


On 08/16/2018 06:43 AM, Balázs Gibizer wrote:

 reformatted for readabiliy, sorry:

 Hi,

 tl;dr: To properly use consumer generation (placement 1.28) in Nova 
we

 need to decide how to handle consumer generation conflict from Nova
 perspective:
 a) Nova reads the current consumer_generation before the allocation
   update operation and use that generation in the allocation update

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

2018-08-16 Thread Eric Fried

Thanks for this, gibi.

TL;DR: a).

I didn't look, but I'm pretty sure we're not caching allocations in the
report client. Today, nobody outside of nova (specifically the resource
tracker via the report client) is supposed to be mucking with instance
allocations, right? And given the global lock in the resource tracker,
it should be pretty difficult to race e.g. a resize and a delete in any
meaningful way. So short term, IMO it is reasonable to treat any
generation conflict as an error. No retries. Possible wrinkle on delete,
where it should be a failure unless forced.

Long term, I also can't come up with any scenario where it would be
appropriate to do a narrowly-focused GET+merge/replace+retry. But
implementing the above short-term plan shouldn't prevent us from adding
retries for individual scenarios later if we do uncover places where it
makes sense.

Here's some stream-of-consciousness that led me to the above opinions:

- On spawn, we send the allocation with a consumer gen of None because
we expect the consumer not to exist. If it exists, that should be a hard
fail. (Hopefully the only way this happens is a true UUID conflict.)

- On migration, when we create the migration UUID, ditto above ^

- On migration, when we transfer the allocations in either direction, a
conflict means someone managed to resize (or otherwise change
allocations?) since the last time we pulled data. Given the global lock
in the report client, this should have been tough to do. If it does
happen, I would think any retry would need to be done all the way back
at the claim, which I imagine is higher up than we should go. So again,
I think we should fail the migration and make the user retry.

- On destroy, a conflict again means someone managed a resize despite
the global lock. If I'm deleting an instance and something about it
changes, I would think I want the opportunity to reevaluate my decision
to delete it. That said, I would definitely want a way to force it (in
which case we can just use the DELETE call explicitly). But neither case
should be a retry, and certainly there is no destroy scenario where I
would want a "merging" of allocations to happen.

Thanks,
efried


On 08/16/2018 06:43 AM, Balázs Gibizer wrote:
> reformatted for readabiliy, sorry:
> 
> Hi,
> 
> tl;dr: To properly use consumer generation (placement 1.28) in Nova we
> need to decide how to handle consumer generation conflict from Nova
> perspective:
> a) Nova reads the current consumer_generation before the allocation
>   update operation and use that generation in the allocation update
>   operation.  If the allocation is changed between the read and the
>   update then nova fails the server lifecycle operation and let the
>   end user retry it.
> b) Like a) but in case of conflict nova blindly retries the
>   read-and-update operation pair couple of times and if only fails
>   the life cycle operation if run out of retries.
> c) Nova stores its own view of the allocation. When a consumer's
>   allocation needs to be modified then nova reads the current state
>   of the consumer from placement. Then nova combines the two
>   allocations to generate the new expected consumer state. In case
>   of generation conflict nova retries the read-combine-update
>   operation triplet.
> 
> Which way we should go now?
> 
> What should be or long term goal?
> 
> 
> Details:
> 
> There are plenty of affected lifecycle operations. See the patch series
> starting at [1].
> 
> For example:
> 
> The current patch[1] that handles the delete server case implements
> option b).  It simly reads the current consumer generation from
> placement and uses that to send a PUT /allocatons/{instance_uuid} with
> "allocations": {} in its body.
> 
> Here implementing option c) would mean that during server delete nova
> needs:
> 1) to compile its own view of the resource need of the server
>   (currently based on the flavor but in the future based on the
>   attached port's resource requests as well)
> 2) then read the current allocation of the server from placement
> 3) then subtract the server resource needs from the current allocation
>   and send the resulting allocation back in the update to placement
> 
> In the simple case this subtraction would result in an empty allocation
> sent to placement. Also in this simple case c) has the same effect as
> b) currently implementated in [1].
> 
> However if somebody outside of nova modifies the allocation of this
> consumer in a way that nova does not know about such changed resource
> need then b) and c) will result in different placement state after
> server delete.
> 
> I only know of one example, the change of neutron port's resource
> request while the port is attached. (Note, it is out of scope in the
> first step of bandwidth implementation.) In this specific example
> option c) can work if nova re-reads the port's resource request during
> delete when recalculates its own view of the server resource needs. But
> I don't know if

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

2018-08-16 Thread Balázs Gibizer


reformatted for readabiliy, sorry:

Hi,

tl;dr: To properly use consumer generation (placement 1.28) in Nova we
need to decide how to handle consumer generation conflict from Nova
perspective:
a) Nova reads the current consumer_generation before the allocation
  update operation and use that generation in the allocation update
  operation.  If the allocation is changed between the read and the
  update then nova fails the server lifecycle operation and let the
  end user retry it.
b) Like a) but in case of conflict nova blindly retries the
  read-and-update operation pair couple of times and if only fails
  the life cycle operation if run out of retries.
c) Nova stores its own view of the allocation. When a consumer's
  allocation needs to be modified then nova reads the current state
  of the consumer from placement. Then nova combines the two
  allocations to generate the new expected consumer state. In case
  of generation conflict nova retries the read-combine-update
  operation triplet.

Which way we should go now?

What should be or long term goal?


Details:

There are plenty of affected lifecycle operations. See the patch series
starting at [1].

For example:

The current patch[1] that handles the delete server case implements
option b).  It simly reads the current consumer generation from
placement and uses that to send a PUT /allocatons/{instance_uuid} with
"allocations": {} in its body.

Here implementing option c) would mean that during server delete nova
needs:
1) to compile its own view of the resource need of the server
  (currently based on the flavor but in the future based on the
  attached port's resource requests as well)
2) then read the current allocation of the server from placement
3) then subtract the server resource needs from the current allocation
  and send the resulting allocation back in the update to placement

In the simple case this subtraction would result in an empty allocation
sent to placement. Also in this simple case c) has the same effect as
b) currently implementated in [1].

However if somebody outside of nova modifies the allocation of this
consumer in a way that nova does not know about such changed resource
need then b) and c) will result in different placement state after
server delete.

I only know of one example, the change of neutron port's resource
request while the port is attached. (Note, it is out of scope in the
first step of bandwidth implementation.) In this specific example
option c) can work if nova re-reads the port's resource request during
delete when recalculates its own view of the server resource needs. But
I don't know if every other resource (e.g.  accelerators) used by a
server can be / will be handled this way.


Other examples of affected lifecycle operations:

During a server migration moving the source host allocation from the
instance_uuid to a the migration_uuid fails with consumer generation
conflict because of the instance_uuid consumer generation. [2]

Confirming a migration fails as the deletion of the source host
allocation fails due to the consumer generation conflict of the
migration_uuid consumer that is being emptied.[3]

During scheduling of a new server putting allocation to instance_uuid
fails as the scheduler assumes that it is a new consumer and therefore
uses consumer_generation: None for the allocation, but placement
reports generation conflict. [4]

During a non-forced evacuation the scheduler tries to claim the
resource on the destination host with the instance_uuid, but that
consumer already holds the source allocation therefore the scheduler
cannot assume that the instance_uuid is a new consumer. [4]


[1] https://review.openstack.org/#/c/591597
[2] https://review.openstack.org/#/c/591810
[3] https://review.openstack.org/#/c/591811
[4] https://review.openstack.org/#/c/583667






__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

2018-08-16 Thread Balázs Gibizer


Hi,

tl;dr: To properly use consumer generation (placement 1.28) in Nova we 
need to

decide how to handle consumer generation conflict from Nova perspective:
a) Nova reads the current consumer_generation before the allocation 
update

  operation and use that generation in the allocation update operation.
  If the allocation is changed between the read and the update then 
nova

  fails the server lifecycle operation and let the end user retry it.
b) Like a) but in case of conflict nova blindly retries the 
read-and-update
  operation pair couple of times and if only fails the life cycle 
operation

  if run out of retries.
c) Nova stores its own view of the allocation. When a consumer's 
allocation
  needs to be modified then nova reads the current state of the 
consumer from

  placement. Then nova combines the two allocations to generate the new
  expected consumer state. In case of generation conflict nova retries 
the

  read-combine-update operation triplet.

Which way we should go now?

What should be or long term goal?


Details:

There are plenty of affected lifecycle operations. See the patch series
starting at [1].

For example:

The current patch[1] that handles the delete server case implements 
option b).
It simly reads the current consumer generation from placement and uses 
that to
send a PUT /allocatons/{instance_uuid} with "allocations": {} in its 
body.


Here implementing option c) would mean that during server delete nova 
needs:
1) to compile its own view of the resource need of the server 
(currently based

  on the flavor but in the future based on the attached port's resource
  requests as well)
2) then read the current allocation of the server from placement
3) then subtract the server resource needs from the current allocation 
and

  send the resulting allocation back in the update to placement

In the simple case this subtraction would result in an empty allocation 
sent to
placement. Also in this simple case c) has the same effect as b) 
currently

implementated in [1].

However if somebody outside of nova modifies the allocation of this 
consumer in
a way that nova does not know about such changed resource need then b) 
and c)

will result in different placement state after server delete.

I only know of one example, the change of neutron port's resource 
request while
the port is attached. (Note, it is out of scope in the first step of 
bandwidth
implementation.) In this specific example option c) can work if nova 
re-reads
the port's resource request during delete when recalculates its own 
view of the

server resource needs. But I don't know if every other resource (e.g.
accelerators) used by a server can be / will be handled this way.


Other examples of affected lifecycle operations:

During a server migration moving the source host allocation from the
instance_uuid to a the migration_uuid fails with consumer generation 
conflict

because of the instance_uuid consumer generation. [2]

Confirming a migration fails as the deletion of the source host 
allocation
fails due to the consumer generation conflict of the migration_uuid 
consumer

that is being emptied.[3]

During scheduling of a new server putting allocation to instance_uuid 
fails as

the scheduler assumes that it is a new consumer and therefore uses
consumer_generation: None for the allocation, but placement reports 
generation

conflict. [4]

During a non-forced evacuation the scheduler tries to claim the 
resource on the
destination host with the instance_uuid, but that consumer already 
holds the
source allocation therefore the scheduler cannot assume that the 
instance_uuid

is a new consumer. [4]


Cheers,
gibi

[1] https://review.openstack.org/#/c/591597
[2] https://review.openstack.org/#/c/591810
[3] https://review.openstack.org/#/c/591811
[4] https://review.openstack.org/#/c/583667




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

Re: [openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

[openstack-dev] [nova] how nova should behave when placement returns consumer generation conflict

11 matches

Site Navigation

Mail list logo

Footer information