Hi Sylvain,
Glad to know we are on the same page. I haven't updated the spec with
this proposal yet, in case I got more comments :). I will do so by today.
Thanks,
Sundar
On 5/30/2018 12:34 AM, Sylvain Bauza wrote:
On Wed, May 30, 2018 at 1:33 AM, Nadathur, Sundar
<sundar.nadat...@intel.com <mailto:sundar.nadat...@intel.com>> wrote:
Hi all,
The Cyborg/Nova scheduling spec [1] details what traits will be
applied to the resource providers that represent devices like
GPUs. Some of the traits referred to vendor names. I got feedback
that traits must not refer to products or specific models of
devices. I agree. However, we need some reference to device types
to enable matching the VM driver with the device.
TL;DR We need some reference to device types, but we don't need
product names. I will update the spec [1] to clarify that. Rest of
this email clarifies why we need device types in traits, and what
traits we propose to include.
In general, an accelerator device is operated by two pieces of
software: a driver in the kernel (which may discover and handle
the PF for SR-IOV devices), and a driver/library in the guest
(which may handle the assigned VF).
The device assigned to the VM must match the driver/library
packaged in the VM. For this, the request must explicitly state
what category of devices it needs. For example, if the VM needs a
GPU, it needs to say whether it needs an AMD GPU or an Nvidia GPU,
since it may have the driver/libraries for that vendor alone. It
may also need to state what version of Cuda is needed, if it is a
Nvidia GPU. These aspects are necessarily vendor-specific.
FWIW, the vGPU implementation for Nova also has the same concern. We
want to provide traits for explicitly say "use this vGPU type" but
given it's related to a specific vendor, we can't just say "ask for
this frame buffer size, or just for the display heads", but rather "we
need a vGPU accepting Quadro vDWS license".
Further, one driver/library version may handle multiple devices.
Since a new driver version may be backwards compatible, multiple
driver versions may manage the same device. The
development/release of the driver/library inside the VM should be
independent of the kernel driver for that device.
I agree.
For FPGAs, there is an additional twist as the VM may need
specific bitstream(s), and they match only specific device/region
types. The bitstream for a device from a vendor will not fit any
other device from the same vendor, let alone other vendors. IOW,
the region type is specific not just to a vendor but to a device
type within the vendor. So, it is essential to identify the device
type.
So, the proposed set of RCs and traits are as below. As we learn
more about actual usages by operators, we may need to evolve this set.
* There is a resource class per device category e.g.
CUSTOM_ACCELERATOR_GPU, CUSTOM_ACCELERATOR_FPGA.
* The resource provider that represents a device has the
following traits:
o Vendor/Category trait: e.g. CUSTOM_GPU_AMD,
CUSTOM_FPGA_XILINX.
o Device type trait which is a refinement of vendor/category
trait e.g. CUSTOM_FPGA_XILINX_VU9P.
NOTE: This is not a product or model, at least for FPGAs.
Multiple products may use the same FPGA chip.
NOTE: The reason for having both the vendor/category and
this one is that a flavor may ask for either, depending on
the granularity desired. IOW, if one driver can handle all
devices from a vendor (*eye roll*), the flavor can ask for
the vendor/category trait alone. If there are separate
drivers for different device families from the same
vendor, the flavor must specify the trait for the device
family.
NOTE: The equivalent trait for GPUs may be like
CUSTOM_GPU_NVIDIA_P90, but I'll let others decide if that
is a product or not.
I was about to propose the same for vGPUs in Nova, ie. using custom
traits. The only concern is that we need operators to set the traits
directly using osc-placement instead of having Nova magically provide
those traits. But anyway, given operators need to set the vGPU types
they want, I think it's acceptable.
o For FPGAs, we have additional traits:
+ Functionality trait: e.g. CUSTOM_FPGA_COMPUTE,
CUSTOM_FPGA_NETWORK, CUSTOM_FPGA_STORAGE
+ Region type ID. e.g. CUSTOM_FPGA_INTEL_REGION_<uuid>.
+ Optionally, a function ID, indicating what function is
currently programmed in the region RP. e.g.
CUSTOM_FPGA_INTEL_FUNCTION_<uuid>. Not all
implementations may provide it. The function trait may
change on reprogramming, but it is not expected to be
frequent.
+ Possibly, CUSTOM_PROGRAMMABLE as a separate trait.
[1] https://review.openstack.org/#/c/554717/
<https://review.openstack.org/#/c/554717/>
I'll try to review the spec as soon as I can.
-Sylvain
Thanks.
Regards,
Sundar
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
<http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
<http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev