[ovirt-devel] Re: [ovirt-users] Re: [=EXTERNAL=] Re: help using nvme/tcp storage with cinderlib and Managed Block Storage

Nir Soffer Thu, 24 Feb 2022 11:07:13 -0800

On Thu, Feb 24, 2022 at 8:46 PM Gorka Eguileor <gegui...@redhat.com> wrote:
>
> On 24/02, Nir Soffer wrote:
> > On Thu, Feb 24, 2022 at 6:35 PM Muli Ben-Yehuda <m...@lightbitslabs.com> 
> > wrote:
> > >
> > > On Thu, Feb 24, 2022 at 6:28 PM Nir Soffer <nsof...@redhat.com> wrote:
> > >>
> > >> On Thu, Feb 24, 2022 at 6:10 PM Muli Ben-Yehuda <m...@lightbitslabs.com> 
> > >> wrote:
> > >> >
> > >> > On Thu, Feb 24, 2022 at 3:58 PM Nir Soffer <nsof...@redhat.com> wrote:
> > >> >>
> > >> >> On Wed, Feb 23, 2022 at 6:24 PM Muli Ben-Yehuda 
> > >> >> <m...@lightbitslabs.com> wrote:
> > >> >> >
> > >> >> > Thanks for the detailed instructions, Nir. I'm going to scrounge up 
> > >> >> > some hardware.
> > >> >> > By the way, if anyone else would like to work on NVMe/TCP support, 
> > >> >> > for NVMe/TCP target you can either use Lightbits (talk to me 
> > >> >> > offline for details) or use the upstream Linux NVMe/TCP target. 
> > >> >> > Lightbits is a clustered storage system while upstream is a single 
> > >> >> > target, but the client side should be close enough for vdsm/ovirt 
> > >> >> > purposes.
> > >> >>
> > >> >> I played with NVMe/TCP a little bit, using qemu to create a virtual
> > >> >> NVMe disk, and export
> > >> >> it using the kernel on one VM, and consume it on another VM.
> > >> >> https://futurewei-cloud.github.io/ARM-Datacenter/qemu/nvme-of-tcp-vms/
> > >> >>
> > >> >> One question about device naming - do we always get the same name of 
> > >> >> the
> > >> >> device in all hosts?
> > >> >
> > >> >
> > >> > No, we do not, see below how we handle migration in os_brick.
> > >> >
> > >> >> To support VM migration, every device must have unique name in the 
> > >> >> cluster.
> > >> >> With multipath we always have unique name, since we disable "friendly 
> > >> >> names",
> > >> >> so we always have:
> > >> >>
> > >> >>     /dev/mapper/{wwid}
> > >> >>
> > >> >> With rbd we also do not use /dev/rbdN but a unique path:
> > >> >>
> > >> >>     /dev/rbd/poolname/volume-vol-id
> > >> >>
> > >> >> How do we ensure cluster-unique device path? If os_brick does not 
> > >> >> handle it, we
> > >> >> can to do in ovirt, for example:
> > >> >>
> > >> >>     /run/vdsm/mangedvolumes/{uuid} -> /dev/nvme7n42
> > >> >>
> > >> >> but I think this should be handled in cinderlib, since openstack have
> > >> >> the same problem with migration.
> > >> >
> > >> >
> > >> > Indeed. Both the Lightbits LightOS connector and the nvmeof connector 
> > >> > do this through the target provided namespace (LUN) UUID. After 
> > >> > connecting to the target, the connectors wait for the local 
> > >> > friendly-named device file that has the right UUID to show up, and 
> > >> > then return the friendly name. So different hosts will have different 
> > >> > friendly names, but the VMs will be attached to the right namespace 
> > >> > since we return the friendly name on the current host that has the 
> > >> > right UUID. Does this also work for you?
> > >>
> > >> It will not work for oVirt.
> > >>
> > >> Migration in oVirt works like this:
> > >>
> > >> 1. Attach disks to destination host
> > >> 2. Send VM XML from source host to destination host, and start the
> > >>    VM is paused mode
> > >> 3. Start the migration on the source host
> > >> 4. When migration is done, start the CPU on the destination host
> > >> 5. Detach the disks from the source
> > >>
> > >> This will break in step 2, since the source xml refer to nvme device
> > >> that does not exist or already used by another VM.
> > >
> > >
> > > Indeed.
> > >
> > >> To make this work, the VM XML must use the same path, existing on
> > >> both hosts.
> > >>
> > >> The issue can be solved by libvirt hook updating the paths before qemu
> > >> is started on the destination, but I think the right way to handle this 
> > >> is to
> > >> have the same path.
> > >
> > >
> > >  You mentioned above that it can be handled in ovirt (c.f., 
> > > /run/vdsm/mangedvolumes/{uuid} -> /dev/nvme7n42), which seems like a 
> > > reasonable approach given the constraint imposed by the oVirt migration 
> > > flow you outlined above. What information does vdsm need to create and 
> > > use the /var/run/vdsm/managedvolumes/{uuid} link? Today the connector 
> > > does (trimmed for brevity):
> > >
> > >     def connect_volume(self, connection_properties):
> > >         device_info = {'type': 'block'}
> > >         uuid = connection_properties['uuid']
> > >         device_path = self._get_device_by_uuid(uuid)
> > >         device_info['path'] = device_path
> > >         return device_info
> >
> > I think we have 2 options:
> >
> > 1. unique path created by os_brick using the underlying uuid
> >
> > In this case the connector will return the uuid, and ovirt will use
> > it to resolve the unique path that will be stored and used on engine
> > side to create the vm xml.
> >
> > I'm not sure how the connector should return this uuid. Looking in current
> > vdsm code:
> >
> >     if vol_type in ("iscsi", "fibre_channel"):
> >         if "multipath_id" not in attachment:
> >             raise se.ManagedVolumeUnsupportedDevice(vol_id, attachment)
> >         # /dev/mapper/xxxyyy
> >         return os.path.join(DEV_MAPPER, attachment["multipath_id"])
> >     elif vol_type == "rbd":
> >         # /dev/rbd/poolname/volume-vol-id
> >         return os.path.join(DEV_RBD, connection_info['data']['name'])
> >
> > os_brick does not have a uniform way to address different devices.
> >
> > Maybe Gorka can help with this.
>
> Hi,
>
> That is true, because in OpenStack we haven't had the need to have the
> same path on every host or even on the same host during different
> connections.
>
> For nvme a new `elif` clause could be added there, though it will be a
> bit trickier, because the nvme connection properties format are a bit of
> a mess...
>
> We have 2 different formats for the nvme properties, and the wwid that
> appears in symlink /dev/disk/by-id/nvme-<wwid> may or may not be the
> volume id, may be the uuid in the connection info if present or the
> nguid if the nvme device doesn't have uuid.
>
> For these reasons I would recommend not relying on the connection
> information and relying on the path from the attachment instead.
>
> Something like this should be probably fine:
>
>   elif vol_type == 'nvme':
>       device_name = os.path.basename(attachment['path'])
>       controller = device_name.rsplit('n', 1)[0]
>       wwid_filename = f'/sys/class/nvme/{controller}/{device_name}/wwid'
>       with open(wwid_filename, 'r') as f:
>           uuid = f.read().strip()
>       return os.path.join('/dev/disk/by-id/nvme-', uuid)


Thanks Gorka!

but isn't this duplicating logic already in os brick?
https://github.com/openstack/os-brick/blob/56bf0272b55dcbbc7f5b03150973a80af1407f4f/os_brick/initiator/connectors/lightos.py#L193

Another interesting detail is this wait:
https://github.com/openstack/os-brick/blob/56bf0272b55dcbbc7f5b03150973a80af1407f4f/os_brick/initiator/connectors/lightos.py#L228

    def _get_device_by_uuid(self, uuid):
        endtime = time.time() + self.WAIT_DEVICE_TIMEOUT
        while time.time() < endtime:
            try:
                device = self._check_device_exists_using_dev_lnk(uuid)
                if device:
                    return device
            except Exception as e:
                LOG.debug(f'LIGHTOS: {e}')
            device = self._check_device_exists_reading_block_class(uuid)
            if device:
                return device

            time.sleep(1)
        return None

The code does not explain why it tries to use the /dev/disk/by-id link
and fallback to sysfs on errors. Based on our experience with udev,
I guess that the author does not trust udev. I wonder if we can trust
it as the stable device path.

If we can trust this path, maybe os_brick can return the stable path
in a uniform way for all kind of devices?

>
> Cheers,
> Gorka.
>
> >
> > 2. unique path created by oVirt
> >
> > In this case oVirt will use the disk uuid already used in
> > ManagedVolume.{attach,detach}_volume APIs:
> > https://github.com/oVirt/vdsm/blob/500c035903dd35180d71c97791e0ce4356fb77ad/lib/vdsm/api/vdsm-api.yml#L9734
> > https://github.com/oVirt/vdsm/blob/500c035903dd35180d71c97791e0ce4356fb77ad/lib/vdsm/api/vdsm-api.yml#L9749
> >
> > From oVirt point of view, using the disk uuid seems better. It makes it easy
> > to debug when you can follow the uuid in all logs on different systems and
> > locate the actual disk using the same uuid.
> >
> > Nir
> >
>
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/PVL6GEVIQG6YMKHL4PG5HQTTXZBWOPSJ/

[ovirt-devel] Re: [ovirt-users] Re: [=EXTERNAL=] Re: help using nvme/tcp storage with cinderlib and Managed Block Storage

Reply via email to