On Tue, Jul 24, 2018 at 5:51 AM, Nir Soffer <nsof...@redhat.com> wrote:
> On Mon, Jul 23, 2018 at 9:35 PM Ryan Bullock <rrb3...@gmail.com> wrote: > >> Hello All, >> >> We recently stood up a new Ovirt install backed by an ISCSI SAN and it >> has been working great, but there are a few quirks I am trying to iron out. >> >> We have run into an issue where when we fail-over our SAN (for >> maintenance, or otherwise) any VM with a Direct LUN gets paused and doesn’t >> resume. VMs without a direct LUN never paused. >> > > I guess the other VMs did get paused, but they were resumed > automatically by the system, so from your point of view, they did > not "pause". > > You can check vdsm log if the other vms did pause and resume. I'm not > sure engine UI reports all pause and resume events. > > Ah, Ok. That would make sense. I had checked the events via the UI and it didn't show any pauses, but I had not checked the actual VDSM logs on the hosts. Unfortunately my logs of for the period have rolled off. I had noticed this behaviour during our first firmware upgrade on our SAN about a month ago. Since VM leases allowed us to maintain HA I just put it in my list of things to follow up on. Going forward I will make sure to double check the VDSM logs to see what is happening in the background. > Digging through posts on this list and reading some bug reports, it seems >> like this a known quirk with how Ovirt handles Direct LUNs (it doesn't >> monitor the LUNs and so it wont resume the VM). >> > > Right. > > Can you file a bug for supporting this? > > Vdsm does monitor multipath events for all LUNs, but they are used only > for reporting purposes, see: > https://ovirt.org/develop/release-management/features/ > storage/multipath-events/ > > We could use the events for resuming vms using the multipath devices that > became available. This functionality will be even more important in the > next version > since we plan to move to LUN per disk model. > > I will look at doing this. At the very least I feel that differences/limitations between storage back-ends/methods should be documented. Just so users don't run into any surprises. > To get the VMs to automatically restart I have attached VM leases to them >> and that seems to work fine, not as nice as a pause and resume, but it >> minimizes downtime. >> > > Cool! > > >> What I’m trying to understand is why the VMs with Direct LUNs paused, and >> ones without didn’t. My only speculation is that since the Non-Direct is >> using LVM on top of ISCSI, that LVM is adding its own layer of timeouts >> that cause it to mask the outage? >> > > I don't know about additional retry mechanism in the data-path for LVM > based disks. I think we use the same multipath failover behavior. > > >> My other question is, how can I keep my VMs with Direct LUNs from pausing >> during short outages? Can I put configurations in my multipath.conf for >> just the wwids of my Direct LUNs to increase the ‘no_path_retry’ to prevent >> the VMs from pausing in the first place? I know in general you don’t want >> to increase the ‘no_path_retry’ because it can cause timeout issues with >> VDSM and SPM operations (LVM changes, etc). But in the case of a Direct LUN >> would it cause any problems? >> > > You can add a drop-in multipath configuration that will change > no_path_retry for specific device, or multiapth. > > Increasing no_path_retry will cause larger delays when vdsm try to > access the LUNs via lvm commands, but the delay should be only on > the first access when a LUN is not available. > > Would that increased delay cause any sort of issues for Ovirt (e.g. thinking a node is offline/unresponsive) if set globally in multipath.conf? Since a Direct LUN doesn't use LVM, would this even be a consideration if the increased delay was limited to the Direct LUN only? Here is an example drop-in file: > > # cat /etc/multipath/conf.d/my.conf > devices { > device { > vendor "my-vendor" > product "my-product" > # based on 5 seconds monitor interval, queue I/O for > # 60 seconds when no path is available, before failing. > no_path_retry 12 > } > } > > multipaths { > multipath { > wwid "my-wwidr" > no_path_retry 12 > } > } > > Yep, this was my plan. See "man multipath.conf" for more info. > > Nir > Thanks, Ryan
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/NXZHSYHIDVO4W2CCVU6G7SBFLELC7APV/