Thanks, Eric. Looks like there are no good solutions even as candidates, but only options with varying levels of unacceptability. It is funny that that the option that is considered the least unacceptable is to let the problem happen and then fail the request (last one in your list).

Could I ask what is the objection to the scheme that applies multiple traits and removes one as needed, apart from the fact that it has races?

Regards,
Sundar

On 3/28/2018 11:48 AM, Eric Fried wrote:
Sundar-

        We're running across this issue in several places right now.   One
thing that's definitely not going to get traction is
automatically/implicitly tweaking inventory in one resource class when
an allocation is made on a different resource class (whether in the same
or different RPs).

        Slightly less of a nonstarter, but still likely to get significant
push-back, is the idea of tweaking traits on the fly.  For example, your
vGPU case might be modeled as:

PGPU_RP: {
   inventory: {
       CUSTOM_VGPU_TYPE_A: 2,
       CUSTOM_VGPU_TYPE_B: 4,
   }
   traits: [
       CUSTOM_VGPU_TYPE_A_CAPABLE,
       CUSTOM_VGPU_TYPE_B_CAPABLE,
   ]
}

        The request would come in for
resources=CUSTOM_VGPU_TYPE_A:1&required=VGPU_TYPE_A_CAPABLE, resulting
in an allocation of CUSTOM_VGPU_TYPE_A:1.  Now while you're processing
that, you would *remove* CUSTOM_VGPU_TYPE_B_CAPABLE from the PGPU_RP.
So it doesn't matter that there's still inventory of
CUSTOM_VGPU_TYPE_B:4, because a request including
required=CUSTOM_VGPU_TYPE_B_CAPABLE won't be satisfied by this RP.
There's of course a window between when the initial allocation is made
and when you tweak the trait list.  In that case you'll just have to
fail the loser.  This would be like any other failure in e.g. the spawn
process; it would bubble up, the allocation would be removed; retries
might happen or whatever.

        Like I said, you're likely to get a lot of resistance to this idea as
well.  (Though TBH, I'm not sure how we can stop you beyond -1'ing your
patches; there's nothing about placement that disallows it.)

        The simple-but-inefficient solution is simply that we'd still be able
to make allocations for vGPU type B, but you would have to fail right
away when it came down to cyborg to attach the resource.  Which is code
you pretty much have to write anyway.  It's an improvement if cyborg
gets to be involved in the post-get-allocation-candidates
weighing/filtering step, because you can do that check at that point to
help filter out the candidates that would fail.  Of course there's still
a race condition there, but it's no different than for any other resource.

efried

On 03/28/2018 12:27 PM, Nadathur, Sundar wrote:
Hi Eric and all,
     I should have clarified that this race condition happens only for
the case of devices with multiple functions. There is a prior thread
<http://lists.openstack.org/pipermail/openstack-dev/2018-March/127882.html>
about it. I was trying to get a solution within Cyborg, but that faces
this race condition as well.

IIUC, this situation is somewhat similar to the issue with vGPU types
<http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-03-27.log.html#t2018-03-27T13:41:00>
(thanks to Alex Xu for pointing this out). In the latter case, we could
start with an inventory of (vgpu-type-a: 2; vgpu-type-b: 4).  But, after
consuming a unit of  vGPU-type-a, ideally the inventory should change
to: (vgpu-type-a: 1; vgpu-type-b: 0). With multi-function accelerators,
we start with an RP inventory of (region-type-A: 1, function-X: 4). But,
after consuming a unit of that function, ideally the inventory should
change to: (region-type-A: 0, function-X: 3).

I understand that this approach is controversial :) Also, one difference
from the vGPU case is that the number and count of vGPU types is static,
whereas with FPGAs, one could reprogram it to result in more or fewer
functions. That said, we could hopefully keep this analogy in mind for
future discussions.

We probably will not support multi-function accelerators in Rocky. This
discussion is for the longer term.

Regards,
Sundar

On 3/23/2018 12:44 PM, Eric Fried wrote:
Sundar-

        First thought is to simplify by NOT keeping inventory information in
the cyborg db at all.  The provider record in the placement service
already knows the device (the provider ID, which you can look up in the
cyborg db) the host (the root_provider_uuid of the provider representing
the device) and the inventory, and (I hope) you'll be augmenting it with
traits indicating what functions it's capable of.  That way, you'll
always get allocation candidates with devices that *can* load the
desired function; now you just have to engage your weigher to prioritize
the ones that already have it loaded so you can prefer those.

        Am I missing something?

                efried

On 03/22/2018 11:27 PM, Nadathur, Sundar wrote:
Hi all,
     There seems to be a possibility of a race condition in the
Cyborg/Nova flow. Apologies for missing this earlier. (You can refer to
the proposed Cyborg/Nova spec
<https://review.openstack.org/#/c/554717/1/doc/specs/rocky/cyborg-nova-sched.rst>
for details.)

Consider the scenario where the flavor specifies a resource class for a
device type, and also specifies a function (e.g. encrypt) in the extra
specs. The Nova scheduler would only track the device type as a
resource, and Cyborg needs to track the availability of functions.
Further, to keep it simple, say all the functions exist all the time (no
reprogramming involved).

To recap, here is the scheduler flow for this case:

   * A request spec with a flavor comes to Nova conductor/scheduler. The
     flavor has a device type as a resource class, and a function in the
     extra specs.
   * Placement API returns the list of RPs (compute nodes) which contain
     the requested device types (but not necessarily the function).
   * Cyborg will provide a custom filter which queries Cyborg DB. This
     needs to check which hosts contain the needed function, and filter
     out the rest.
   * The scheduler selects one node from the filtered list, and the
     request goes to the compute node.

For the filter to work, the Cyborg DB needs to maintain a table with
triples of (host, function type, #free units). The filter checks if a
given host has one or more free units of the requested function type.
But, to keep the # free units up to date, Cyborg on the selected compute
node needs to notify the Cyborg API to decrement the #free units when an
instance is spawned, and to increment them when resources are released.

Therein lies the catch: this loop from the compute node to controller is
susceptible to race conditions. For example, if two simultaneous
requests each ask for function A, and there is only one unit of that
available, the Cyborg filter will approve both, both may land on the
same host, and one will fail. This is because Cyborg on the controller
does not decrement resource usage due to one request before processing
the next request.

This is similar to this previous Nova scheduling issue
<https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/placement-claims.html>.
That was solved by having the scheduler claim a resource in Placement
for the selected node. I don't see an analog for Cyborg, since it would
not know which node is selected.

Thanks in advance for suggestions and solutions.

Regards,
Sundar








__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to