[ovirt-users] Re: How to handle broken NFS storage?

2021-06-04 Thread Vojtech Juranek
On Friday, 4 June 2021 10:50:47 CEST David White via Users wrote:
> I'm trying to figure out how to keep a "broken" NFS mount point from causing
> the entire HCI cluster to crash.
> 
> HCI is working beautifully.
> Last night, I finished adding some NFS storage to the cluster - this is
> storage that I don't necessarily need to be HA, and I was hoping to store
> some backups and less-important VMs on, since my Gluster (sssd) storage
> availability is pretty limited.
> 
> But as a test, after I got everything setup, I stopped the nfs-server.
> This caused the entire cluster to go down, and several VMs - that are not
> stored on the NFS storage - went belly up.

what was the error? Was NFS domain master storage domain?

> Once I started the NFS server process again, HCI did what it was supposed to
> do, and was able to automatically recover. My concern is that NFS is a
> single point of failure, and if VMs that don't even rely on that storage
> are affected if the NFS storage goes away, then I don't want anything to do
> with it. On the other hand, I'm still struggling to come up with a good way
> to run on-site backups and snapshots without using up more gluster space on
> my (more expensive) sssd storage.
> 
> Is there any way to setup NFS storage for a Backup Domain - as well as a
> Data domain (for lesser important VMs) - such that, if the NFS server
> crashed, all of my non-NFS stuff would be unaffected?
> 
> Sent with ProtonMail Secure Email.



signature.asc
Description: This is a digitally signed message part.
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ME7AHEOOL4WOSNOYKMZ24YIHSD725OQU/


[ovirt-users] Re: How to handle broken NFS storage?

2021-06-04 Thread Nir Soffer
On Fri, Jun 4, 2021 at 12:11 PM David White via Users  wrote:
>
> I'm trying to figure out how to keep a "broken" NFS mount point from causing 
> the entire HCI cluster to crash.
>
> HCI is working beautifully.
> Last night, I finished adding some NFS storage to the cluster - this is 
> storage that I don't necessarily need to be HA, and I was hoping to store 
> some backups and less-important VMs on, since my Gluster (sssd) storage 
> availability is pretty limited.
>
> But as a test, after I got everything setup, I stopped the nfs-server.
> This caused the entire cluster to go down, and several VMs - that are not 
> stored on the NFS storage - went belly up.

Please explain in more detail "went belly up".

In general vms not using he nfs storage domain should not be affected, but
due to unfortunate design of vdsm, all storage domain share the same global lock
and when one storage domain has trouble, it can cause delays in
operations on other
domains. This may lead to timeouts and vms reported as non-responsive,
but the actual
vms, should not be affected.

If  you have a good way to reproduce the issue please file a bug with
all the logs, we try
to improve this situation.

> Once I started the NFS server process again, HCI did what it was supposed to 
> do, and was able to automatically recover.
> My concern is that NFS is a single point of failure, and if VMs that don't 
> even rely on that storage are affected if the NFS storage goes away, then I 
> don't want anything to do with it.

You need to understand the actual effect on the vms before you reject NFS.

> On the other hand, I'm still struggling to come up with a good way to run 
> on-site backups and snapshots without using up more gluster space on my (more 
> expensive) sssd storage.

NFS is useful for this purpose. You don't need synchronous replication, and
you want the backups outside of your cluster so in case of disaster you can
restore the backups on another system.

Snapshots are always on the same storage so it will not help.

> Is there any way to setup NFS storage for a Backup Domain - as well as a Data 
> domain (for lesser important VMs) - such that, if the NFS server crashed, all 
> of my non-NFS stuff would be unaffected?

NFS storage domain will always affect other storage domains, but if you mount
your NFS storage outside of ovirt, the mount will not affect the system.

Then you can backup to this mount, for example using backup_vm.py:
https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py

Or one of the backup solutions, all of them are not using a storage domain for
keeping the backups so the mount should not affect the system.

Nir
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/MYQAQTMXRAZT7EYAYCMYXBJYZHSNJT7G/


[ovirt-users] Re: How to handle broken NFS storage?

2021-06-04 Thread David White via Users
When I stopped the NFS service, I was connect to a VM over ssh.
I was also connected to one of the physical hosts over ssh, and was running top.

I observed that server load continued to increase over time on the physical 
host.
Several of the VMs (perhaps all?), including the one I was connected to, went 
down due to an underlying storage issue.
It appears to me that HA VMs were restarted automatically. For example, I see 
the following in the oVirt Manager Event Log (domain names changed / redacted):


Jun 4, 2021, 4:25:42 AM
Highly Available VM server2.example.com failed. It will be restarted 
automatically.

Jun 4, 2021, 4:25:42 AM
Highly Available VM mail.example.com failed. It will be restarted automatically.

Jun 4, 2021, 4:25:42 AM
Highly Available VM core1.mgt.example.com failed. It will be restarted 
automatically.

Jun 4, 2021, 4:25:42 AM
VM cha1-shared.example.com has been paused due to unknown storage error.

Jun 4, 2021, 4:25:42 AM
VM server.example.org has been paused due to storage I/O problem.

Jun 4, 2021, 4:25:42 AM
VM server.example.com has been paused.

Jun 4, 2021, 4:25:42 AM
VM server.example.org has been paused.

Jun 4, 2021, 4:25:41 AM
VM server.example.org has been paused due to unknown storage error.

Jun 4, 2021, 4:25:41 AM
VM HostedEngine has been paused due to storage I/O problem.


During this outage, I also noticed that customer websites were not working.
So I clearly took an outage.

> If you have a good way to reproduce the issue please file a bug with
> all the logs, we try to improve this situation.

I don't have a separate lab environment, but if I'm able to reproduce the issue 
off hours, I may try to do so.
What logs would be helpful? 


> NFS storage domain will always affect other storage domains, but if you mount
> your NFS storage outside of ovirt, the mount will not affect the system.
> 

> Then you can backup to this mount, for example using backup_vm.py:
> https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py

If I'm understanding you correctly, it sounds like you're suggesting that I 
just connect 1 (or multiple) hosts to the NFS mount manually, and don't use the 
oVirt manager to build the backup domain. Then just run this script on a cron 
or something - is that correct?


Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐
On Friday, June 4, 2021 12:29 PM, Nir Soffer  wrote:

> On Fri, Jun 4, 2021 at 12:11 PM David White via Users users@ovirt.org wrote:
> 

> > I'm trying to figure out how to keep a "broken" NFS mount point from 
> > causing the entire HCI cluster to crash.
> > HCI is working beautifully.
> > Last night, I finished adding some NFS storage to the cluster - this is 
> > storage that I don't necessarily need to be HA, and I was hoping to store 
> > some backups and less-important VMs on, since my Gluster (sssd) storage 
> > availability is pretty limited.
> > But as a test, after I got everything setup, I stopped the nfs-server.
> > This caused the entire cluster to go down, and several VMs - that are not 
> > stored on the NFS storage - went belly up.
> 

> Please explain in more detail "went belly up".
> 

> In general vms not using he nfs storage domain should not be affected, but
> due to unfortunate design of vdsm, all storage domain share the same global 
> lock
> and when one storage domain has trouble, it can cause delays in
> operations on other
> domains. This may lead to timeouts and vms reported as non-responsive,
> but the actual
> vms, should not be affected.
> 

> If you have a good way to reproduce the issue please file a bug with
> all the logs, we try
> to improve this situation.
> 

> > Once I started the NFS server process again, HCI did what it was supposed 
> > to do, and was able to automatically recover.
> > My concern is that NFS is a single point of failure, and if VMs that don't 
> > even rely on that storage are affected if the NFS storage goes away, then I 
> > don't want anything to do with it.
> 

> You need to understand the actual effect on the vms before you reject NFS.
> 

> > On the other hand, I'm still struggling to come up with a good way to run 
> > on-site backups and snapshots without using up more gluster space on my 
> > (more expensive) sssd storage.
> 

> NFS is useful for this purpose. You don't need synchronous replication, and
> you want the backups outside of your cluster so in case of disaster you can
> restore the backups on another system.
> 

> Snapshots are always on the same storage so it will not help.
> 

> > Is there any way to setup NFS storage for a Backup Domain - as well as a 
> > Data domain (for lesser important VMs) - such that, if the NFS server 
> > crashed, all of my non-NFS stuff would be unaffected?
> 

> NFS storage domain will always affect other storage domains, but if you mount
> your NFS storage outside of ovirt, the mount will not affect the system.
> 

> Then you can backup to this mount, for example using backup_vm.py:
> ht

[ovirt-users] Re: How to handle broken NFS storage?

2021-06-05 Thread Strahil Nikolov via Users
Exactly. Ovirt will monitor storage only if it knows about it.I would use 
autofs or systemd's '.mount' + '.automount' unit files to automatically mount 
and umount the NFS.
In worst case scenario, you will have only several stale NFS mount points (if 
using the hard option) and those most of the time can be recovered without 
reboot.
Best Regards,Strahil Nikolov



If I'm understanding you correctly, it sounds like you're suggesting that I 
just connect 1 (or multiple) hosts to the NFS mount manually, and don't use the 
oVirt manager to build the backup domain. Then just run this script on a cron 
or something - is that correct? 
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/IP7B7KXJ5GJUKYDMJKAUJXLP6GPMSHRI/


[ovirt-users] Re: How to handle broken NFS storage?

2021-06-05 Thread Edward Berger
I could be mistaken, but I think the issue is that ovirt engine believes
the NFS domain is mandatory for all hypervisor hosts unless you put the
storage domain into 'maintenance' mode.
It will notice its down and start trying to fence the offending hypervisors
which in turn tries to migrate VMs to other hypervisors that also are
marked bad because they can't reach the storage domain either. That's what
I recall once seeing when I thought it was safe to temporarily take down an
ISO NFS domain a few years ago.

On Fri, Jun 4, 2021 at 5:11 AM David White via Users 
wrote:

> I'm trying to figure out how to keep a "broken" NFS mount point from
> causing the entire HCI cluster to crash.
>
> HCI is working beautifully.
> Last night, I finished adding some NFS storage to the cluster - this is
> storage that I don't necessarily need to be HA, and I was hoping to store
> some backups and less-important VMs on, since my Gluster (sssd) storage
> availability is pretty limited.
>
> But as a test, after I got everything setup, I stopped the nfs-server.
> This caused the entire cluster to go down, and several VMs - that are not
> stored on the NFS storage - went belly up.
>
> Once I started the NFS server process again, HCI did what it was supposed
> to do, and was able to automatically recover.
> My concern is that NFS is a single point of failure, and if VMs that don't
> even rely on that storage are affected if the NFS storage goes away, then I
> don't want anything to do with it.
> On the other hand, I'm still struggling to come up with a good way to run
> on-site backups and snapshots without using up more gluster space on my
> (more expensive) sssd storage.
>
> Is there any way to setup NFS storage for a Backup Domain - as well as a
> Data domain (for lesser important VMs) - such that, if the NFS server
> crashed, all of my non-NFS stuff would be unaffected?
>
>
> Sent with ProtonMail  Secure Email.
>
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/7PEJUNBDBD72TMSPIQFQGHF2ZFNT6UYX/
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/D6KDJ7VB2Q7QXJSJ3KYZNLQYPJOFWE6A/


[ovirt-users] Re: How to handle broken NFS storage?

2021-06-06 Thread Strahil Nikolov via Users
That's correct.Ovirt knows and ensures that this NFS is available at all time 
and will take any actions to recover the situation.When you use autofs outside 
the oVirt, you still have that backup location and also it will umount when not 
in use.
Best Regards,Strahil Nikolov
 
 
  On Sat, Jun 5, 2021 at 17:35, Edward Berger wrote:   
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/D6KDJ7VB2Q7QXJSJ3KYZNLQYPJOFWE6A/
  
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/VKXMIZWXQTO6AVHBAXDKPBCIZ2B3CD3N/


[ovirt-users] Re: How to handle broken NFS storage?

2021-06-06 Thread Nir Soffer
On Sat, Jun 5, 2021 at 3:25 AM David White via Users  wrote:
>
> When I stopped the NFS service, I was connect to a VM over ssh.
> I was also connected to one of the physical hosts over ssh, and was running 
> top.
>
> I observed that server load continued to increase over time on the physical 
> host.
> Several of the VMs (perhaps all?), including the one I was connected to, went 
> down due to an underlying storage issue.
> It appears to me that HA VMs were restarted automatically. For example, I see 
> the following in the oVirt Manager Event Log (domain names changed / 
> redacted):
>
>
> Jun 4, 2021, 4:25:42 AM
> Highly Available VM server2.example.com failed. It will be restarted 
> automatically.

Do  you have a cdrom on an ISO storage domain, maybe on the same NFS server
that you stopped?

> Jun 4, 2021, 4:25:42 AM
> Highly Available VM mail.example.com failed. It will be restarted 
> automatically.
>
> Jun 4, 2021, 4:25:42 AM
> Highly Available VM core1.mgt.example.com failed. It will be restarted 
> automatically.
>
> Jun 4, 2021, 4:25:42 AM
> VM cha1-shared.example.com has been paused due to unknown storage error.
>
> Jun 4, 2021, 4:25:42 AM
> VM server.example.org has been paused due to storage I/O problem.
>
> Jun 4, 2021, 4:25:42 AM
> VM server.example.com has been paused.

I guess this vm was using the NFS server?

> Jun 4, 2021, 4:25:42 AM
> VM server.example.org has been paused.
>
> Jun 4, 2021, 4:25:41 AM
> VM server.example.org has been paused due to unknown storage error.
>
> Jun 4, 2021, 4:25:41 AM
> VM HostedEngine has been paused due to storage I/O problem.
>
>
> During this outage, I also noticed that customer websites were not working.
> So I clearly took an outage.
>
> > If you have a good way to reproduce the issue please file a bug with
> > all the logs, we try to improve this situation.
>
> I don't have a separate lab environment, but if I'm able to reproduce the 
> issue off hours, I may try to do so.
> What logs would be helpful?

/var/log/vdsm.log
/var/log/sanlock.log
/var/log/messages or output of journalctl

> > NFS storage domain will always affect other storage domains, but if you 
> > mount
> > your NFS storage outside of ovirt, the mount will not affect the system.
> >
>
> > Then you can backup to this mount, for example using backup_vm.py:
> > https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py
>
> If I'm understanding you correctly, it sounds like you're suggesting that I 
> just connect 1 (or multiple) hosts to the NFS mount manually,

Yes

> and don't use the oVirt manager to build the backup domain. Then just run 
> this script on a cron or something - is that correct?

Yes.

You can run the backup in many ways, for example you can run it via ssh
from another host, finding where vms are running, and connecting to
the host to perform a backup. This is outside of ovirt, since ovirt does not
have built-in a backup feature. We have backup API and example code using it
which can be used to build a backup solution.

> Sent with ProtonMail Secure Email.
>
> ‐‐‐ Original Message ‐‐‐
> On Friday, June 4, 2021 12:29 PM, Nir Soffer  wrote:
>
> > On Fri, Jun 4, 2021 at 12:11 PM David White via Users users@ovirt.org wrote:
> >
>
> > > I'm trying to figure out how to keep a "broken" NFS mount point from 
> > > causing the entire HCI cluster to crash.
> > > HCI is working beautifully.
> > > Last night, I finished adding some NFS storage to the cluster - this is 
> > > storage that I don't necessarily need to be HA, and I was hoping to store 
> > > some backups and less-important VMs on, since my Gluster (sssd) storage 
> > > availability is pretty limited.
> > > But as a test, after I got everything setup, I stopped the nfs-server.
> > > This caused the entire cluster to go down, and several VMs - that are not 
> > > stored on the NFS storage - went belly up.
> >
>
> > Please explain in more detail "went belly up".
> >
>
> > In general vms not using he nfs storage domain should not be affected, but
> > due to unfortunate design of vdsm, all storage domain share the same global 
> > lock
> > and when one storage domain has trouble, it can cause delays in
> > operations on other
> > domains. This may lead to timeouts and vms reported as non-responsive,
> > but the actual
> > vms, should not be affected.
> >
>
> > If you have a good way to reproduce the issue please file a bug with
> > all the logs, we try
> > to improve this situation.
> >
>
> > > Once I started the NFS server process again, HCI did what it was supposed 
> > > to do, and was able to automatically recover.
> > > My concern is that NFS is a single point of failure, and if VMs that 
> > > don't even rely on that storage are affected if the NFS storage goes 
> > > away, then I don't want anything to do with it.
> >
>
> > You need to understand the actual effect on the vms before you reject NFS.
> >
>
> > > On the other hand, I'm still struggling to come up with a good way to run 
> 

[ovirt-users] Re: How to handle broken NFS storage?

2021-06-06 Thread Nir Soffer
On Sun, Jun 6, 2021 at 12:31 PM Nir Soffer  wrote:
>
> On Sat, Jun 5, 2021 at 3:25 AM David White via Users  wrote:
> >
> > When I stopped the NFS service, I was connect to a VM over ssh.
> > I was also connected to one of the physical hosts over ssh, and was running 
> > top.
> >
> > I observed that server load continued to increase over time on the physical 
> > host.
> > Several of the VMs (perhaps all?), including the one I was connected to, 
> > went down due to an underlying storage issue.
> > It appears to me that HA VMs were restarted automatically. For example, I 
> > see the following in the oVirt Manager Event Log (domain names changed / 
> > redacted):
> >
> >
> > Jun 4, 2021, 4:25:42 AM
> > Highly Available VM server2.example.com failed. It will be restarted 
> > automatically.
>
> Do  you have a cdrom on an ISO storage domain, maybe on the same NFS server
> that you stopped?

If you share vm xml for the ha vms and the regular vms it will be easier to
understand your system.

The best way is to use:

sudo virsh -r dumpxml {vm-name}

> > Jun 4, 2021, 4:25:42 AM
> > Highly Available VM mail.example.com failed. It will be restarted 
> > automatically.
> >
> > Jun 4, 2021, 4:25:42 AM
> > Highly Available VM core1.mgt.example.com failed. It will be restarted 
> > automatically.
> >
> > Jun 4, 2021, 4:25:42 AM
> > VM cha1-shared.example.com has been paused due to unknown storage error.
> >
> > Jun 4, 2021, 4:25:42 AM
> > VM server.example.org has been paused due to storage I/O problem.
> >
> > Jun 4, 2021, 4:25:42 AM
> > VM server.example.com has been paused.
>
> I guess this vm was using the NFS server?
>
> > Jun 4, 2021, 4:25:42 AM
> > VM server.example.org has been paused.
> >
> > Jun 4, 2021, 4:25:41 AM
> > VM server.example.org has been paused due to unknown storage error.
> >
> > Jun 4, 2021, 4:25:41 AM
> > VM HostedEngine has been paused due to storage I/O problem.
> >
> >
> > During this outage, I also noticed that customer websites were not working.
> > So I clearly took an outage.
> >
> > > If you have a good way to reproduce the issue please file a bug with
> > > all the logs, we try to improve this situation.
> >
> > I don't have a separate lab environment, but if I'm able to reproduce the 
> > issue off hours, I may try to do so.
> > What logs would be helpful?
>
> /var/log/vdsm.log
> /var/log/sanlock.log
> /var/log/messages or output of journalctl
>
> > > NFS storage domain will always affect other storage domains, but if you 
> > > mount
> > > your NFS storage outside of ovirt, the mount will not affect the system.
> > >
> >
> > > Then you can backup to this mount, for example using backup_vm.py:
> > > https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py
> >
> > If I'm understanding you correctly, it sounds like you're suggesting that I 
> > just connect 1 (or multiple) hosts to the NFS mount manually,
>
> Yes
>
> > and don't use the oVirt manager to build the backup domain. Then just run 
> > this script on a cron or something - is that correct?
>
> Yes.
>
> You can run the backup in many ways, for example you can run it via ssh
> from another host, finding where vms are running, and connecting to
> the host to perform a backup. This is outside of ovirt, since ovirt does not
> have built-in a backup feature. We have backup API and example code using it
> which can be used to build a backup solution.
>
> > Sent with ProtonMail Secure Email.
> >
> > ‐‐‐ Original Message ‐‐‐
> > On Friday, June 4, 2021 12:29 PM, Nir Soffer  wrote:
> >
> > > On Fri, Jun 4, 2021 at 12:11 PM David White via Users users@ovirt.org 
> > > wrote:
> > >
> >
> > > > I'm trying to figure out how to keep a "broken" NFS mount point from 
> > > > causing the entire HCI cluster to crash.
> > > > HCI is working beautifully.
> > > > Last night, I finished adding some NFS storage to the cluster - this is 
> > > > storage that I don't necessarily need to be HA, and I was hoping to 
> > > > store some backups and less-important VMs on, since my Gluster (sssd) 
> > > > storage availability is pretty limited.
> > > > But as a test, after I got everything setup, I stopped the nfs-server.
> > > > This caused the entire cluster to go down, and several VMs - that are 
> > > > not stored on the NFS storage - went belly up.
> > >
> >
> > > Please explain in more detail "went belly up".
> > >
> >
> > > In general vms not using he nfs storage domain should not be affected, but
> > > due to unfortunate design of vdsm, all storage domain share the same 
> > > global lock
> > > and when one storage domain has trouble, it can cause delays in
> > > operations on other
> > > domains. This may lead to timeouts and vms reported as non-responsive,
> > > but the actual
> > > vms, should not be affected.
> > >
> >
> > > If you have a good way to reproduce the issue please file a bug with
> > > all the logs, we try
> > > to improve this situation.
> > >
> >
> > > > Once I started the NFS server process ag