On Tue, Jul 24, 2018 at 5:51 AM, Nir Soffer <nsof...@redhat.com> wrote:

> On Mon, Jul 23, 2018 at 9:35 PM Ryan Bullock <rrb3...@gmail.com> wrote:
>
>> Hello All,
>>
>> We recently stood up a new Ovirt install backed by an ISCSI SAN and it
>> has been working great, but there are a few quirks I am trying to iron out.
>>
>> We have run into an issue where when we fail-over our SAN (for
>> maintenance, or otherwise) any VM with a Direct LUN gets paused and doesn’t
>> resume. VMs without a direct LUN never paused.
>>
>
> I guess the other VMs did get paused, but they were resumed
> automatically by the system, so from your point of view, they did
> not "pause".
>
> You can check vdsm log if the other vms did pause and resume. I'm not
> sure engine UI reports all pause and resume events.
>
>

Ah, Ok. That would make sense. I had checked the events via the UI and it
didn't show any pauses, but I had not checked the actual VDSM logs on the
hosts. Unfortunately my logs of for the period have rolled off. I had
noticed this behaviour during our first firmware upgrade on our SAN about a
month ago. Since VM leases allowed us to maintain HA I just put it in my
list of things to follow up on. Going forward I will make sure to double
check the VDSM logs to see what is happening in the background.

> Digging through posts on this list and reading some bug reports, it seems
>> like this a known quirk with how Ovirt handles Direct LUNs (it doesn't
>> monitor the LUNs and so it wont resume the VM).
>>
>
> Right.
>
> Can you file a bug for supporting this?
>
> Vdsm does monitor multipath events for all LUNs, but they are used only
> for reporting purposes, see:
> https://ovirt.org/develop/release-management/features/
> storage/multipath-events/
>
> We could use the events for resuming vms using the multipath devices that
> became available. This functionality will be even more important in the
> next version
> since we plan to move to LUN per disk model.
>
>

I will look at doing this. At the very least I feel that
differences/limitations between storage back-ends/methods should be
documented. Just so users don't run into any surprises.

> To get the VMs to automatically restart I have attached VM leases to them
>> and that seems to work fine, not as nice as a pause and resume, but it
>> minimizes downtime.
>>
>
> Cool!
>
>
>> What I’m trying to understand is why the VMs with Direct LUNs paused, and
>> ones without didn’t. My only speculation is that since the Non-Direct is
>> using LVM on top of ISCSI, that LVM is adding its own layer of timeouts
>> that cause it to mask the outage?
>>
>
> I don't know about additional retry mechanism in the data-path for LVM
> based disks. I think we use the same multipath failover behavior.
>
>
>> My other question is, how can I keep my VMs with Direct LUNs from pausing
>> during short outages? Can I put configurations in my multipath.conf for
>> just the wwids of my Direct LUNs to increase the ‘no_path_retry’ to prevent
>> the VMs from pausing in the first place? I know in general you don’t want
>> to increase the ‘no_path_retry’ because it can cause timeout issues with
>> VDSM and SPM operations (LVM changes, etc). But in the case of a Direct LUN
>> would it cause any problems?
>>
>
> You can add a drop-in multipath configuration that will change
> no_path_retry for specific device, or multiapth.
>
> Increasing no_path_retry will cause larger delays when vdsm try to
> access the LUNs via lvm commands, but the delay should be only on
> the first access when a LUN is not available.
>
>
Would that increased delay cause any sort of issues for Ovirt (e.g.
thinking a node is offline/unresponsive) if set globally in multipath.conf?
Since a Direct LUN doesn't use LVM, would this even be a consideration if
the increased delay was limited to the Direct LUN only?

Here is an example drop-in file:
>
> # cat /etc/multipath/conf.d/my.conf
> devices {
>     device {
>         vendor "my-vendor"
>         product "my-product"
>         # based on 5 seconds monitor interval, queue I/O for
>         # 60 seconds when no path is available, before failing.
>         no_path_retry 12
>     }
> }
>
> multipaths {
>     multipath {
>         wwid "my-wwidr"
>         no_path_retry 12
>     }
> }
>
>
Yep, this was my plan.

See "man multipath.conf" for more info.
>
> Nir
>

Thanks,

Ryan
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/NXZHSYHIDVO4W2CCVU6G7SBFLELC7APV/

Reply via email to