On Fri, Feb 25, 2022 at 12:04 PM Gorka Eguileor <gegui...@redhat.com> wrote: > > On 24/02, Nir Soffer wrote: > > On Thu, Feb 24, 2022 at 8:46 PM Gorka Eguileor <gegui...@redhat.com> wrote: > > > > > > On 24/02, Nir Soffer wrote: > > > > On Thu, Feb 24, 2022 at 6:35 PM Muli Ben-Yehuda > > > > <m...@lightbitslabs.com> wrote: > > > > > > > > > > On Thu, Feb 24, 2022 at 6:28 PM Nir Soffer <nsof...@redhat.com> wrote: > > > > >> > > > > >> On Thu, Feb 24, 2022 at 6:10 PM Muli Ben-Yehuda > > > > >> <m...@lightbitslabs.com> wrote: > > > > >> > > > > > >> > On Thu, Feb 24, 2022 at 3:58 PM Nir Soffer <nsof...@redhat.com> > > > > >> > wrote: > > > > >> >> > > > > >> >> On Wed, Feb 23, 2022 at 6:24 PM Muli Ben-Yehuda > > > > >> >> <m...@lightbitslabs.com> wrote: > > > > >> >> > > > > > >> >> > Thanks for the detailed instructions, Nir. I'm going to > > > > >> >> > scrounge up some hardware. > > > > >> >> > By the way, if anyone else would like to work on NVMe/TCP > > > > >> >> > support, for NVMe/TCP target you can either use Lightbits (talk > > > > >> >> > to me offline for details) or use the upstream Linux NVMe/TCP > > > > >> >> > target. Lightbits is a clustered storage system while upstream > > > > >> >> > is a single target, but the client side should be close enough > > > > >> >> > for vdsm/ovirt purposes. > > > > >> >> > > > > >> >> I played with NVMe/TCP a little bit, using qemu to create a > > > > >> >> virtual > > > > >> >> NVMe disk, and export > > > > >> >> it using the kernel on one VM, and consume it on another VM. > > > > >> >> https://futurewei-cloud.github.io/ARM-Datacenter/qemu/nvme-of-tcp-vms/ > > > > >> >> > > > > >> >> One question about device naming - do we always get the same name > > > > >> >> of the > > > > >> >> device in all hosts? > > > > >> > > > > > >> > > > > > >> > No, we do not, see below how we handle migration in os_brick. > > > > >> > > > > > >> >> To support VM migration, every device must have unique name in > > > > >> >> the cluster. > > > > >> >> With multipath we always have unique name, since we disable > > > > >> >> "friendly names", > > > > >> >> so we always have: > > > > >> >> > > > > >> >> /dev/mapper/{wwid} > > > > >> >> > > > > >> >> With rbd we also do not use /dev/rbdN but a unique path: > > > > >> >> > > > > >> >> /dev/rbd/poolname/volume-vol-id > > > > >> >> > > > > >> >> How do we ensure cluster-unique device path? If os_brick does not > > > > >> >> handle it, we > > > > >> >> can to do in ovirt, for example: > > > > >> >> > > > > >> >> /run/vdsm/mangedvolumes/{uuid} -> /dev/nvme7n42 > > > > >> >> > > > > >> >> but I think this should be handled in cinderlib, since openstack > > > > >> >> have > > > > >> >> the same problem with migration. > > > > >> > > > > > >> > > > > > >> > Indeed. Both the Lightbits LightOS connector and the nvmeof > > > > >> > connector do this through the target provided namespace (LUN) > > > > >> > UUID. After connecting to the target, the connectors wait for the > > > > >> > local friendly-named device file that has the right UUID to show > > > > >> > up, and then return the friendly name. So different hosts will > > > > >> > have different friendly names, but the VMs will be attached to the > > > > >> > right namespace since we return the friendly name on the current > > > > >> > host that has the right UUID. Does this also work for you? > > > > >> > > > > >> It will not work for oVirt. > > > > >> > > > > >> Migration in oVirt works like this: > > > > >> > > > > >> 1. Attach disks to destination host > > > > >> 2. Send VM XML from source host to destination host, and start the > > > > >> VM is paused mode > > > > >> 3. Start the migration on the source host > > > > >> 4. When migration is done, start the CPU on the destination host > > > > >> 5. Detach the disks from the source > > > > >> > > > > >> This will break in step 2, since the source xml refer to nvme device > > > > >> that does not exist or already used by another VM. > > > > > > > > > > > > > > > Indeed. > > > > > > > > > >> To make this work, the VM XML must use the same path, existing on > > > > >> both hosts. > > > > >> > > > > >> The issue can be solved by libvirt hook updating the paths before > > > > >> qemu > > > > >> is started on the destination, but I think the right way to handle > > > > >> this is to > > > > >> have the same path. > > > > > > > > > > > > > > > You mentioned above that it can be handled in ovirt (c.f., > > > > > /run/vdsm/mangedvolumes/{uuid} -> /dev/nvme7n42), which seems like a > > > > > reasonable approach given the constraint imposed by the oVirt > > > > > migration flow you outlined above. What information does vdsm need to > > > > > create and use the /var/run/vdsm/managedvolumes/{uuid} link? Today > > > > > the connector does (trimmed for brevity): > > > > > > > > > > def connect_volume(self, connection_properties): > > > > > device_info = {'type': 'block'} > > > > > uuid = connection_properties['uuid'] > > > > > device_path = self._get_device_by_uuid(uuid) > > > > > device_info['path'] = device_path > > > > > return device_info > > > > > > > > I think we have 2 options: > > > > > > > > 1. unique path created by os_brick using the underlying uuid > > > > > > > > In this case the connector will return the uuid, and ovirt will use > > > > it to resolve the unique path that will be stored and used on engine > > > > side to create the vm xml. > > > > > > > > I'm not sure how the connector should return this uuid. Looking in > > > > current > > > > vdsm code: > > > > > > > > if vol_type in ("iscsi", "fibre_channel"): > > > > if "multipath_id" not in attachment: > > > > raise se.ManagedVolumeUnsupportedDevice(vol_id, attachment) > > > > # /dev/mapper/xxxyyy > > > > return os.path.join(DEV_MAPPER, attachment["multipath_id"]) > > > > elif vol_type == "rbd": > > > > # /dev/rbd/poolname/volume-vol-id > > > > return os.path.join(DEV_RBD, connection_info['data']['name']) > > > > > > > > os_brick does not have a uniform way to address different devices. > > > > > > > > Maybe Gorka can help with this. > > > > > > Hi, > > > > > > That is true, because in OpenStack we haven't had the need to have the > > > same path on every host or even on the same host during different > > > connections. > > > > > > For nvme a new `elif` clause could be added there, though it will be a > > > bit trickier, because the nvme connection properties format are a bit of > > > a mess... > > > > > > We have 2 different formats for the nvme properties, and the wwid that > > > appears in symlink /dev/disk/by-id/nvme-<wwid> may or may not be the > > > volume id, may be the uuid in the connection info if present or the > > > nguid if the nvme device doesn't have uuid. > > > > > > For these reasons I would recommend not relying on the connection > > > information and relying on the path from the attachment instead. > > > > > > Something like this should be probably fine: > > > > > > elif vol_type == 'nvme': > > > device_name = os.path.basename(attachment['path']) > > > controller = device_name.rsplit('n', 1)[0] > > > wwid_filename = f'/sys/class/nvme/{controller}/{device_name}/wwid' > > > with open(wwid_filename, 'r') as f: > > > uuid = f.read().strip() > > > return os.path.join('/dev/disk/by-id/nvme-', uuid) > > > > Thanks Gorka! > > > > but isn't this duplicating logic already in os brick? > > https://github.com/openstack/os-brick/blob/56bf0272b55dcbbc7f5b03150973a80af1407f4f/os_brick/initiator/connectors/lightos.py#L193 > > > > Hi Nir, > > Oh! I thought we were talking about the generic NVMe-oF connector, > didn't know this was specific about the LightOS one. > > The link is used as an easy way to locate the volume, it doesn't mean > that it is returned to the caller of the `connect_volume` method. In > fact, we can see how that method actually returns the real path and not > the link's path: > > def _check_device_exists_using_dev_lnk(self, uuid): > lnk_path = f"/dev/disk/by-id/nvme-uuid.{uuid}" > --> if os.path.exists(lnk_path): > ^^^ Check link exists > > --> devname = os.path.realpath(lnk_path) > ^^^ Get the real path for the symlink > > --> if devname.startswith("/dev/nvme"): > ^^^ Make extra sure it's not pointing to something crazy > > LOG.info("LIGHTOS: devpath %s detected for uuid %s", > devname, uuid) > > --> return devname > ^^^ Return it > > return None > > > Another interesting detail is this wait: > > https://github.com/openstack/os-brick/blob/56bf0272b55dcbbc7f5b03150973a80af1407f4f/os_brick/initiator/connectors/lightos.py#L228 > > > > def _get_device_by_uuid(self, uuid): > > endtime = time.time() + self.WAIT_DEVICE_TIMEOUT > > while time.time() < endtime: > > try: > > device = self._check_device_exists_using_dev_lnk(uuid) > > if device: > > return device > > except Exception as e: > > LOG.debug(f'LIGHTOS: {e}') > > device = self._check_device_exists_reading_block_class(uuid) > > if device: > > return device > > > > time.sleep(1) > > return None > > > > The code does not explain why it tries to use the /dev/disk/by-id link > > and fallback to sysfs on errors. Based on our experience with udev, > > I guess that the author does not trust udev. I wonder if we can trust > > it as the stable device path. > > > > In my experience udev rules (which is different from udev itself) are > less that reliable as a way of finding devices when working "in the > wild". They are only reliable if you have full control over the host > system and are sure nobody (admin or distro) can break things. > > For reference, at Red Hat we have an RFE to improve os-brick [1] and > stop using symlinks at all. > > While they are not 100% reliable in the wild, they are quite reliable > once they are working on a specific system, which means that if we > confirm they are working on a system we can rely on them if no changes > are made on the system (and if CPU is not 100% during attachment). > > > [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1697319 > > > > If we can trust this path, maybe os_brick can return the stable path > > in a uniform way for all kind of devices? > > I don't think this is likely to happen, because it has no real value for > OpenStack so it's unlikely to get prioritized (for coding and reviews).
Since we cannot get a stable path from os-brick, and stable path is a oVirt specific requirement, we need to handle this in oVirt, similar to the way we handle multipath and rbd and traditional storage. Nir _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ZEB6T5XLERDHW44DK6JGD3ZU4ZVGDBCR/