Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Since this thread shows up at the top of the search "oVirt compellent", I should mention that this has been solved. The problem was a bad disk in the Compellent's tier 2 storage. The mutlipath.conf and iscsi.conf advice is still valid, though, and made oVirt more resilient when the Compellent was struggling. -- This email was Virus checked by UTM 9. For issues please contact the Windows Systems Admin. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Lets continue this on bugzilla. https://bugzilla.redhat.com/show_bug.cgi?id=1225162 ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
- Original Message - > From: "Chris Jones - BookIt.com Systems Administrator" > > To: users@ovirt.org > Sent: Friday, May 22, 2015 8:55:37 PM > Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via > iSCSI/Multipath > > > > Is there maybe some IO problem on the iSCSI target side? > > IIUIC the problem is some timeout, which could indicate that the target > > is overloaded. > > Maybe. I need to check with Dell. I did manage to get it to be a little > more stable with this config. > > defaults { > polling_interval 10 > path_selector "round-robin 0" > path_grouping_policy multibus > getuid_callout "/usr/lib/udev/scsi_id --whitelisted > --replace-whitespace --device=/dev/%n" > path_checker readsector0 > rr_min_io_rq 100 > max_fds 8192 > rr_weight priorities > failback immediate > no_path_retry fail > user_friendly_names no You should keep the default without change, and add specific settings under the device section. > } > devices { >device { > vendor COMPELNT > product "Compellent Vol" > path_checker tur > no_path_retryfail This is mostly likely missing some settings. You are *not* getting the settings from the "defaults" section above. For example, since you did not specify here "failback immediate", failback for this device defaults to whatever default multipath chose, not the value set in "defaults" above. >} > } > > I referenced it from > http://en.community.dell.com/techcenter/enterprise-solutions/w/oracle_solutions/1315.how-to-configure-device-mapper-multipath. > I modified it a bit since that is Red Hat 5 specific and there have been > some changes. > > It's not crashing anymore but I'm still seeing storage warnings in > engine.log. I'm going to be enabling jumbo frames and talking with Dell > to figure out if it's something on the Compellent side. I'll update here > once I find something out. Lets continue this on bugzilla. See also this patch: https://gerrit.ovirt.org/41244 Nir ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
- Original Message - > From: "Chris Jones - BookIt.com Systems Administrator" > > To: users@ovirt.org > Sent: Friday, May 22, 2015 12:32:01 AM > Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via > iSCSI/Multipath > > On 05/21/2015 03:49 PM, Chris Jones - BookIt.com Systems Administrator > wrote: > > I've applied the multipath.conf and iscsi.conf changes you recommended. > > It seems to be running better. I was able to bring up all the hosts and > > VMs without it falling apart. > > I take it back. This did not solve the issue. I tried batch starting the > VMs and half the nodes went down due to the same storage issues. VDSM > Logs again. > https://www.dropbox.com/s/12sudzhaily72nb/vdsm_failures.log.gz?dl=1 It is possible that the multipath configuration I suggested is not optimized correctly for your server or it is too old (last update on 2013). Or you have some issues in the network or storage server. I would continue with the storage vendor. Nir ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
- Original Message - > From: "Chris Jones - BookIt.com Systems Administrator" > > To: users@ovirt.org > Sent: Thursday, May 21, 2015 10:49:23 PM > Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via > iSCSI/Multipath > > I've applied the multipath.conf and iscsi.conf changes you recommended. > It seems to be running better. I was able to bring up all the hosts and > VMs without it falling apart. > > I'm still seeing the domain "in problem" and "recovered" from problem > warnings in engine.log, though. They were happening only when hosts were > activating and when I was mass launching many VMs. Is this normal? > > 2015-05-21 15:31:32,264 WARN > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > (org.ovirt.thread.pool-8-thread-13) domain > c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: > blade6c2.ism.ld > 2015-05-21 15:31:47,468 INFO > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > (org.ovirt.thread.pool-8-thread-4) Domain > c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. > vds: blade6c2.ism.ld > > Here's the vdsm log from a node the engine was warning about > https://www.dropbox.com/s/yaubaxax1w499f1/vdsm2.log.gz?dl=1. It's > trimmed to just before and after it happened. > > What is that repo stat command from your previous email, Nir? "repostat > vdsm.log" I don't see it on the engine or the node. Is it used to parse > the log? Where can I find it? It is available here: https://gerrit.ovirt.org/38749 Nir ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Is there maybe some IO problem on the iSCSI target side? IIUIC the problem is some timeout, which could indicate that the target is overloaded. Maybe. I need to check with Dell. I did manage to get it to be a little more stable with this config. defaults { polling_interval 10 path_selector "round-robin 0" path_grouping_policy multibus getuid_callout "/usr/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n" path_checker readsector0 rr_min_io_rq 100 max_fds 8192 rr_weight priorities failback immediate no_path_retry fail user_friendly_names no } devices { device { vendor COMPELNT product "Compellent Vol" path_checker tur no_path_retryfail } } I referenced it from http://en.community.dell.com/techcenter/enterprise-solutions/w/oracle_solutions/1315.how-to-configure-device-mapper-multipath. I modified it a bit since that is Red Hat 5 specific and there have been some changes. It's not crashing anymore but I'm still seeing storage warnings in engine.log. I'm going to be enabling jumbo frames and talking with Dell to figure out if it's something on the Compellent side. I'll update here once I find something out. Thanks again for all the help. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
- Original Message - > On 05/21/2015 03:49 PM, Chris Jones - BookIt.com Systems Administrator > wrote: > > I've applied the multipath.conf and iscsi.conf changes you recommended. > > It seems to be running better. I was able to bring up all the hosts and > > VMs without it falling apart. > > I take it back. This did not solve the issue. I tried batch starting the > VMs and half the nodes went down due to the same storage issues. VDSM Is there maybe some IO problem on the iSCSI target side? IIUIC the problem is some timeout, which could indicate that the target is overloaded. But maybe I get something wrong ... - fabian > Logs again. > https://www.dropbox.com/s/12sudzhaily72nb/vdsm_failures.log.gz?dl=1 > ___ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
On 05/21/2015 03:49 PM, Chris Jones - BookIt.com Systems Administrator wrote: I've applied the multipath.conf and iscsi.conf changes you recommended. It seems to be running better. I was able to bring up all the hosts and VMs without it falling apart. I take it back. This did not solve the issue. I tried batch starting the VMs and half the nodes went down due to the same storage issues. VDSM Logs again. https://www.dropbox.com/s/12sudzhaily72nb/vdsm_failures.log.gz?dl=1 ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
I've applied the multipath.conf and iscsi.conf changes you recommended. It seems to be running better. I was able to bring up all the hosts and VMs without it falling apart. I'm still seeing the domain "in problem" and "recovered" from problem warnings in engine.log, though. They were happening only when hosts were activating and when I was mass launching many VMs. Is this normal? 2015-05-21 15:31:32,264 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c2.ism.ld 2015-05-21 15:31:47,468 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c2.ism.ld Here's the vdsm log from a node the engine was warning about https://www.dropbox.com/s/yaubaxax1w499f1/vdsm2.log.gz?dl=1. It's trimmed to just before and after it happened. What is that repo stat command from your previous email, Nir? "repostat vdsm.log" I don't see it on the engine or the node. Is it used to parse the log? Where can I find it? Thanks again. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
On 21.05.2015 02:48, Chris Jones - BookIt.com Systems Administrator wrote: >> Another issue may be that the setting for COMPELNT/Compellent Vol are wrong; >> the setting we ship is missing lot of settings that exists in the builtin >> setting, and this may have bad effect. If your devices match this , I would >> try this multipath configuration, instead of the one vdsm configures. >> >> device { >> vendor "COMPELNT" >> product "Compellent Vol" >> path_grouping_policy "multibus" >> path_checker "tur" >> features "0" >> hardware_handler "0" >> prio "const" >> failback "immediate" >> rr_weight "uniform" >> no_path_retry fail >> } > > I wish I could. We're using the CentOS 7 ovirt-node-iso. The > multipath.conf is less than ideal I have this issue also. I think about opening a BZ ;) but when I tried updating it, oVirt > instantly overwrites it. To be clear, yes I know changes do not survive > reboots and yes I know about persist, but it changes it while running. > Live! Persist won't help there. > > I also tried building a CentOS 7 "thick client" where I set up CentOS 7 > first, added the oVirt repo, then let the engine provision it. Same > problem with multipath.conf being overwritten with the default oVirt setup. > > So I tried to be slick about it. I made the multipath.conf immutable. > That prevented the engine from being able to activate the node. It would > fail on a vds command that gets the nodes capabilities and part of what > it does is reads then overwrites multipath.conf. > > How do I safely update multipath.conf? In the second line of your multipath conf, add: # RHEV PRIVATE Then, host deploy will ignore it and never change it. > > >> >> To verify that your devices match this, you can check the devices vendor and >> procut >> strings in the output of "multipath -ll". I would like to see the output of >> this >> command. > > multipath -ll (default setup) can be seen here. > http://paste.linux-help.org/view/430c7538 > >> Another platform issue is bad default SCSI >> node.session.timeo.replacement_timeout >> value, which is set to 120 seconds. This setting mean that the SCSI layer >> will >> wait 120 seconds for io to complete on one path, before failing the io >> request. >> So you may have one bad path, causing 120 second delay, while you could >> complete >> the request using another path. >> >> Multipath is trying to set this value to 5 seconds, but this value is >> reverting >> to the default 120 seconds after a device has trouble. There is an open bug >> about >> this which we hope to get fixed in the rhel/centos 7.2. >> https://bugzilla.redhat.com/1139038 >> >> This issue together with "no_path_retry queue" is a very bad mix for ovirt. >> >> You can fix this timeout by setting: >> >> # /etc/iscsi/iscsid.conf >> node.session.timeo.replacement_timeout = 5 > > I'll see if that's possible with persist. Will this change survive node > upgrades? > > Thanks for the reply and the suggestions. > ___ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > -- Daniel Helgenberger m box bewegtbild GmbH P: +49/30/2408781-22 F: +49/30/2408781-10 ACKERSTR. 19 D-10115 BERLIN www.m-box.de www.monkeymen.tv Geschäftsführer: Martin Retschitzegger / Michaela Göllner Handeslregister: Amtsgericht Charlottenburg / HRB 112767 ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
On 05/21/2015 02:47 AM, Chris Jones - BookIt.com Systems Administrator wrote: >> Another issue may be that the setting for COMPELNT/Compellent Vol are >> wrong; >> the setting we ship is missing lot of settings that exists in the >> builtin >> setting, and this may have bad effect. If your devices match this , I >> would >> try this multipath configuration, instead of the one vdsm configures. >> >> device { >> vendor "COMPELNT" >> product "Compellent Vol" >> path_grouping_policy "multibus" >> path_checker "tur" >> features "0" >> hardware_handler "0" >> prio "const" >> failback "immediate" >> rr_weight "uniform" >> no_path_retry fail >> } > > I wish I could. We're using the CentOS 7 ovirt-node-iso. The > multipath.conf is less than ideal but when I tried updating it, oVirt > instantly overwrites it. To be clear, yes I know changes do not > survive reboots and yes I know about persist, but it changes it while > running. Live! Persist won't help there. > > I also tried building a CentOS 7 "thick client" where I set up CentOS > 7 first, added the oVirt repo, then let the engine provision it. Same > problem with multipath.conf being overwritten with the default oVirt > setup. > > So I tried to be slick about it. I made the multipath.conf immutable. > That prevented the engine from being able to activate the node. It > would fail on a vds command that gets the nodes capabilities and part > of what it does is reads then overwrites multipath.conf. > > How do I safely update multipath.conf? > Somehow the multipath.conf that oVirt generates forces my HDD RAID controller disks to change from /dev/sdb* and /dev/sdc*. So I had to blacklist these. I was able to persist it by adding "# RHEV PRIVATE" right below the "# RHEV REVISION 1.1" Hope this helps Met vriendelijke groet, With kind regards, Jorick Astrego Netbulae Virtualization Experts Tel: 053 20 30 270 i...@netbulae.euStaalsteden 4-3A KvK 08198180 Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01 ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Chris, as you are using ovirt-node, after Nir suggestions please also execute the below command too to save the settings changes across reboots: # persist /etc/iscsi/iscsid.conf Thanks. I will do so, but first I have to resolve not being able to update multipath.conf as described in my previous email. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Another issue may be that the setting for COMPELNT/Compellent Vol are wrong; the setting we ship is missing lot of settings that exists in the builtin setting, and this may have bad effect. If your devices match this , I would try this multipath configuration, instead of the one vdsm configures. device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback "immediate" rr_weight "uniform" no_path_retry fail } I wish I could. We're using the CentOS 7 ovirt-node-iso. The multipath.conf is less than ideal but when I tried updating it, oVirt instantly overwrites it. To be clear, yes I know changes do not survive reboots and yes I know about persist, but it changes it while running. Live! Persist won't help there. I also tried building a CentOS 7 "thick client" where I set up CentOS 7 first, added the oVirt repo, then let the engine provision it. Same problem with multipath.conf being overwritten with the default oVirt setup. So I tried to be slick about it. I made the multipath.conf immutable. That prevented the engine from being able to activate the node. It would fail on a vds command that gets the nodes capabilities and part of what it does is reads then overwrites multipath.conf. How do I safely update multipath.conf? To verify that your devices match this, you can check the devices vendor and procut strings in the output of "multipath -ll". I would like to see the output of this command. multipath -ll (default setup) can be seen here. http://paste.linux-help.org/view/430c7538 Another platform issue is bad default SCSI node.session.timeo.replacement_timeout value, which is set to 120 seconds. This setting mean that the SCSI layer will wait 120 seconds for io to complete on one path, before failing the io request. So you may have one bad path, causing 120 second delay, while you could complete the request using another path. Multipath is trying to set this value to 5 seconds, but this value is reverting to the default 120 seconds after a device has trouble. There is an open bug about this which we hope to get fixed in the rhel/centos 7.2. https://bugzilla.redhat.com/1139038 This issue together with "no_path_retry queue" is a very bad mix for ovirt. You can fix this timeout by setting: # /etc/iscsi/iscsid.conf node.session.timeo.replacement_timeout = 5 I'll see if that's possible with persist. Will this change survive node upgrades? Thanks for the reply and the suggestions. ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
On 05/20/2015 07:10 PM, Nir Soffer wrote: - Original Message - From: "Chris Jones - BookIt.com Systems Administrator" To: users@ovirt.org Sent: Thursday, May 21, 2015 12:49:50 AM Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath vdsm.log in the node side, will help here too. https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log contains only the messages at and after when a host was become unresponsive due to storage issues. According to the log, you have a real issue accessing storage from the host: [nsoffer@thin untitled (master)]$ repostat vdsm.log domain: 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2 delay avg: 0.000856 min: 0.00 max: 0.001168 last check avg: 11.51 min: 0.30 max: 64.10 domain: 64101f40-0f10-471d-9f5f-44591f9e087d delay avg: 0.008358 min: 0.00 max: 0.040269 last check avg: 11.86 min: 0.30 max: 63.40 domain: 31e97cc8-6a10-4a45-8f25-95eba88b4dc0 delay avg: 0.007793 min: 0.000819 max: 0.041316 last check avg: 11.47 min: 0.00 max: 70.20 domain: 842edf83-22c6-46cd-acaa-a1f76d61e545 delay avg: 0.000493 min: 0.000374 max: 0.000698 last check avg: 4.86 min: 0.20 max: 9.90 domain: b050c455-5ab1-4107-b055-bfcc811195fc delay avg: 0.002080 min: 0.00 max: 0.040142 last check avg: 11.83 min: 0.00 max: 63.70 domain: c46adffc-614a-4fa2-9d2d-954f174f6a39 delay avg: 0.004798 min: 0.00 max: 0.041006 last check avg: 18.42 min: 1.40 max: 102.90 domain: 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7 delay avg: 0.001002 min: 0.00 max: 0.001199 last check avg: 11.56 min: 0.30 max: 61.70 domain: 20153412-f77a-4944-b252-ff06a78a1d64 delay avg: 0.003748 min: 0.00 max: 0.040903 last check avg: 12.18 min: 0.00 max: 67.20 domain: 26929b89-d1ca-4718-90d6-b3a6da585451 delay avg: 0.000963 min: 0.00 max: 0.001209 last check avg: 10.99 min: 0.00 max: 64.30 domain: 0137183b-ea40-49b1-b617-256f47367280 delay avg: 0.000881 min: 0.00 max: 0.001227 last check avg: 11.086667 min: 0.10 max: 63.20 Note the high last check maximum value (e.g. 102 seconds). Vdsm has a monitor thread for each domain, doing a read from one of the storage domain special disk every 10 seconds. When we see high last check value, it means that the monitor thread is stuck reading from the disk. This is an indicator that vms may have trouble accessing this storage domains, and engine is handling this by making the host non-operational, or if all hosts cannot access the domain, making the domain inactive. One of the known issues that can be related, is bad multipath configuration. Some storage server have bad builtin configuration embedded into multipath. In particular, using "no_path_retry queue", or "no_path_retry 60". This setting means that when the SCSI layer fails, and multipath does not have any active path it will queue io foerver (queue), or retry many times (e.g, 60) before failing the io request. This can lead to stuck process, doing a read or write that never fails or takes many minutes to fail. Vdsm is not designed to handle such delays - a stuck thread may block other unrelated threads. Vdsm includes special configuration for your storage vendor (COMPELNT), but maybe it does not match the product (Compellent Vol). See https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/configurators/multipath.py#L57 device { vendor "COMPELNT" product "Compellent Vol" no_path_retry fail } Another issue may be that the setting for COMPELNT/Compellent Vol are wrong; the setting we ship is missing lot of settings that exists in the builtin setting, and this may have bad effect. If your devices match this , I would try this multipath configuration, instead of the one vdsm configures. device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback "immediate" rr_weight "uniform" no_path_retry fail } To verify that your devices match this, you can check the devices vendor and procut strings in the output of "multipath -ll". I would like to see the output of this command. Another platform issue is bad default SCSI node.session.timeo.replacement_timeout value, which is set to 120 seconds. This setting mean that the SCSI layer will wait 120 seconds for io to complete on one path, before failing the io request. So you may have one bad path, causing 120 second delay, while you could complete the request using another path. Multipath is
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
- Original Message - > From: "Chris Jones - BookIt.com Systems Administrator" > > To: users@ovirt.org > Sent: Thursday, May 21, 2015 12:49:50 AM > Subject: Re: [ovirt-users] oVirt Instability with Dell Compellent via > iSCSI/Multipath > > >> vdsm.log in the node side, will help here too. > > https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log > contains only the messages at and after when a host was become > unresponsive due to storage issues. According to the log, you have a real issue accessing storage from the host: [nsoffer@thin untitled (master)]$ repostat vdsm.log domain: 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2 delay avg: 0.000856 min: 0.00 max: 0.001168 last check avg: 11.51 min: 0.30 max: 64.10 domain: 64101f40-0f10-471d-9f5f-44591f9e087d delay avg: 0.008358 min: 0.00 max: 0.040269 last check avg: 11.86 min: 0.30 max: 63.40 domain: 31e97cc8-6a10-4a45-8f25-95eba88b4dc0 delay avg: 0.007793 min: 0.000819 max: 0.041316 last check avg: 11.47 min: 0.00 max: 70.20 domain: 842edf83-22c6-46cd-acaa-a1f76d61e545 delay avg: 0.000493 min: 0.000374 max: 0.000698 last check avg: 4.86 min: 0.20 max: 9.90 domain: b050c455-5ab1-4107-b055-bfcc811195fc delay avg: 0.002080 min: 0.00 max: 0.040142 last check avg: 11.83 min: 0.00 max: 63.70 domain: c46adffc-614a-4fa2-9d2d-954f174f6a39 delay avg: 0.004798 min: 0.00 max: 0.041006 last check avg: 18.42 min: 1.40 max: 102.90 domain: 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7 delay avg: 0.001002 min: 0.00 max: 0.001199 last check avg: 11.56 min: 0.30 max: 61.70 domain: 20153412-f77a-4944-b252-ff06a78a1d64 delay avg: 0.003748 min: 0.00 max: 0.040903 last check avg: 12.18 min: 0.00 max: 67.20 domain: 26929b89-d1ca-4718-90d6-b3a6da585451 delay avg: 0.000963 min: 0.00 max: 0.001209 last check avg: 10.99 min: 0.00 max: 64.30 domain: 0137183b-ea40-49b1-b617-256f47367280 delay avg: 0.000881 min: 0.00 max: 0.001227 last check avg: 11.086667 min: 0.10 max: 63.20 Note the high last check maximum value (e.g. 102 seconds). Vdsm has a monitor thread for each domain, doing a read from one of the storage domain special disk every 10 seconds. When we see high last check value, it means that the monitor thread is stuck reading from the disk. This is an indicator that vms may have trouble accessing this storage domains, and engine is handling this by making the host non-operational, or if all hosts cannot access the domain, making the domain inactive. One of the known issues that can be related, is bad multipath configuration. Some storage server have bad builtin configuration embedded into multipath. In particular, using "no_path_retry queue", or "no_path_retry 60". This setting means that when the SCSI layer fails, and multipath does not have any active path it will queue io foerver (queue), or retry many times (e.g, 60) before failing the io request. This can lead to stuck process, doing a read or write that never fails or takes many minutes to fail. Vdsm is not designed to handle such delays - a stuck thread may block other unrelated threads. Vdsm includes special configuration for your storage vendor (COMPELNT), but maybe it does not match the product (Compellent Vol). See https://github.com/oVirt/vdsm/blob/master/lib/vdsm/tool/configurators/multipath.py#L57 device { vendor "COMPELNT" product "Compellent Vol" no_path_retry fail } Another issue may be that the setting for COMPELNT/Compellent Vol are wrong; the setting we ship is missing lot of settings that exists in the builtin setting, and this may have bad effect. If your devices match this , I would try this multipath configuration, instead of the one vdsm configures. device { vendor "COMPELNT" product "Compellent Vol" path_grouping_policy "multibus" path_checker "tur" features "0" hardware_handler "0" prio "const" failback "immediate" rr_weight "uniform" no_path_retry fail } To verify that your devices match this, you can check the devices vendor and procut strings in the output of "multipath -ll". I would like to see the output of this command. Another platform issue is bad default SCSI node.session.timeo.replacement_timeout value, which is set to 120 seconds. This setting mean that the SCSI layer will wait 120 seconds for io to complete on one path, before failin
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
vdsm.log in the node side, will help here too. https://www.dropbox.com/s/zvnttmylmrd0hyx/vdsm.log.gz?dl=0. This log contains only the messages at and after when a host was become unresponsive due to storage issues. # rpm -qa | grep -i vdsm might help too. vdsm-cli-4.16.14-0.el7.noarch vdsm-reg-4.16.14-0.el7.noarch ovirt-node-plugin-vdsm-0.2.2-5.el7.noarch vdsm-python-zombiereaper-4.16.14-0.el7.noarch vdsm-xmlrpc-4.16.14-0.el7.noarch vdsm-yajsonrpc-4.16.14-0.el7.noarch vdsm-4.16.14-0.el7.x86_64 vdsm-gluster-4.16.14-0.el7.noarch vdsm-hook-ethtool-options-4.16.14-0.el7.noarch vdsm-python-4.16.14-0.el7.noarch vdsm-jsonrpc-4.16.14-0.el7.noarch Hey Chris, please open a bug [1] for this, then we can track it and we can help to identify the issue. I will do so. vdsm.log.gz Description: application/gzip ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Sorry for the delay on this. I am in the process of reproducing the error to get the logs. On 05/19/2015 07:31 PM, Douglas Schilling Landgraf wrote: Hello Chris, On 05/19/2015 06:19 PM, Chris Jones - BookIt.com Systems Administrator wrote: Engine: oVirt Engine Version: 3.5.2-1.el7.centos Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos Remote storage: Dell Compellent SC8000 Storage setup: 2 nics connected to the Compellent. Several domains backed by LUNs. Several VM disk using direct LUN. Networking: Dell 10 Gb/s switches I've been struggling with oVirt completely falling apart due to storage related issues. By falling apart I mean most to all of the nodes suddenly losing contact with the storage domains. This results in an endless loop of the VMs on the failed nodes trying to be migrated and remigrated as the nodes flap between response and unresponsive. During these times, engine.log looks like this. 2015-05-19 03:09:42,443 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-50) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c1.ism.ld 2015-05-19 03:09:42,560 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-38) domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:45,497 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-24) domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: blade3c2.ism.ld 2015-05-19 03:09:51,713 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-46) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade4c2.ism.ld 2015-05-19 03:09:57,647 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c1.ism.ld 2015-05-19 03:09:57,782 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) domain 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:57,783 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) Domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. vds: blade2c1.ism.ld 2015-05-19 03:10:00,639 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-31) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade4c1.ism.ld 2015-05-19 03:10:00,703 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) domain 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: blade1c1.ism.ld 2015-05-19 03:10:00,712 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade3c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:06,932 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,933 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:09,929 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-16) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade3c1.ism.ld My troubleshooting steps so far: 1. Tailing engine.log for "in problem" and "recovered from problem" 2. Shutting down all the VMs. 3. Shutting down all but one node. 4. Bringing up one node at a time to see what the log reports. vdsm.log in the node side, will help here too. When only one node is active everything is fine. When a second node comes up, I begin to see the log output as shown above. I've been struggling with this for over a month. I'm sure others have used oVirt with a Compellent and encountered (and worked around) similar problems. I'm looking for some help in figuring out if it's oVirt or something that I'm doing wrong. We're cl
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Hi Chris, I have an Ovirt + Dell Compellent similar to yours (previous model, not SC8000) and sometimes I faced issues similar to yours. >From my experience I can advise you to A) check links between SAN and servers, all paths, all configuration, cabling. Everything should be setup correctly (all redundant paths green, server mappings etc) BEFORE installing ovirt. We had a running KVM environment before "upgrading" it to ovirt 3.5.1 B) Also check fencing is working both manually and automatically (connections to iDRAC etc). This is a kind of pre-requisite to have HA working. C) I also noticed that when something is not going well on one of the shared storage, this brings down the whole cluster (VM run, but a lot of headaches being). First of all note that ovirt tries to stabilize the situation itself for as long as ~15 minutes or more. It is slow in re-fencing etc. Sometimes it enters in a loop and you have to locate the problematic storage. You want to check the multipath on every server is working correctly. If you are having problems with just two nodes, I guess something is not really ok at configuration level. I have 2 clusters, 12 hosts and several (lots) of shared storage working and usually when something goes wrong is because of an human error (like when I deleted the LUN on the SAN before destroying the storage on the ovirt interface). On the hand, I have the overall impression that the system is not forgiving at all and that it is far from being rock solid. Cheers AG ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
- Original Message - > Hello Chris, > > On 05/19/2015 06:19 PM, Chris Jones - BookIt.com Systems Administrator > wrote: > > Engine: oVirt Engine Version: 3.5.2-1.el7.centos > > Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos > > Remote storage: Dell Compellent SC8000 > > Storage setup: 2 nics connected to the Compellent. Several domains > > backed by LUNs. Several VM disk using direct LUN. > > Networking: Dell 10 Gb/s switches > > > > I've been struggling with oVirt completely falling apart due to storage > > related issues. By falling apart I mean most to all of the nodes > > suddenly losing contact with the storage domains. This results in an > > endless loop of the VMs on the failed nodes trying to be migrated and > > remigrated as the nodes flap between response and unresponsive. During > > these times, engine.log looks like this. > > > > 2015-05-19 03:09:42,443 WARN > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-50) domain > > c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: > > blade6c1.ism.ld > > 2015-05-19 03:09:42,560 WARN > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-38) domain > > 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: > > blade2c1.ism.ld > > 2015-05-19 03:09:45,497 WARN > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-24) domain > > 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: > > blade3c2.ism.ld > > 2015-05-19 03:09:51,713 WARN > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-46) domain > > b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: > > blade4c2.ism.ld > > 2015-05-19 03:09:57,647 INFO > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-13) Domain > > c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. > > vds: blade6c1.ism.ld > > 2015-05-19 03:09:57,782 WARN > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-6) domain > > 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: > > blade2c1.ism.ld > > 2015-05-19 03:09:57,783 INFO > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-6) Domain > > 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. > > vds: blade2c1.ism.ld > > 2015-05-19 03:10:00,639 INFO > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-31) Domain > > c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. > > vds: blade4c1.ism.ld > > 2015-05-19 03:10:00,703 WARN > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-17) domain > > 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: > > blade1c1.ism.ld > > 2015-05-19 03:10:00,712 INFO > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-4) Domain > > 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. > > vds: blade3c2.ism.ld > > 2015-05-19 03:10:06,931 INFO > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-48) Domain > > 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. > > vds: blade4c2.ism.ld > > 2015-05-19 03:10:06,931 INFO > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-48) Domain > > 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from > > problem. No active host in the DC is reporting it as problematic, so > > clearing the domain recovery timer. > > 2015-05-19 03:10:06,932 INFO > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-48) Domain > > b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. > > vds: blade4c2.ism.ld > > 2015-05-19 03:10:06,933 INFO > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-48) Domain > > b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from > > problem. No active host in the DC is reporting it as problematic, so > > clearing the domain recovery timer. > > 2015-05-19 03:10:09,929 WARN > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > > (org.ovirt.thread.pool-8-thread-16) domain > > b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: > > blade3c1.ism.ld > > > > > > My troubleshooting steps so far: > > > > 1. Tailing engine.log for "in problem" and "recovered from problem" > > 2. Shutting down all the VMs. > > 3. Shutting down all but one node. > > 4. Bringing up one node at a time to see what the log reports. > > vdsm.log in the node side, will help here too. > > > When only one node is active everything is fine. When a second node > > comes up, I begin to see
Re: [ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Hello Chris, On 05/19/2015 06:19 PM, Chris Jones - BookIt.com Systems Administrator wrote: Engine: oVirt Engine Version: 3.5.2-1.el7.centos Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos Remote storage: Dell Compellent SC8000 Storage setup: 2 nics connected to the Compellent. Several domains backed by LUNs. Several VM disk using direct LUN. Networking: Dell 10 Gb/s switches I've been struggling with oVirt completely falling apart due to storage related issues. By falling apart I mean most to all of the nodes suddenly losing contact with the storage domains. This results in an endless loop of the VMs on the failed nodes trying to be migrated and remigrated as the nodes flap between response and unresponsive. During these times, engine.log looks like this. 2015-05-19 03:09:42,443 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-50) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c1.ism.ld 2015-05-19 03:09:42,560 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-38) domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:45,497 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-24) domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: blade3c2.ism.ld 2015-05-19 03:09:51,713 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-46) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade4c2.ism.ld 2015-05-19 03:09:57,647 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c1.ism.ld 2015-05-19 03:09:57,782 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) domain 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:57,783 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) Domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. vds: blade2c1.ism.ld 2015-05-19 03:10:00,639 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-31) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade4c1.ism.ld 2015-05-19 03:10:00,703 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) domain 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: blade1c1.ism.ld 2015-05-19 03:10:00,712 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade3c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:06,932 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,933 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:09,929 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-16) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade3c1.ism.ld My troubleshooting steps so far: 1. Tailing engine.log for "in problem" and "recovered from problem" 2. Shutting down all the VMs. 3. Shutting down all but one node. 4. Bringing up one node at a time to see what the log reports. vdsm.log in the node side, will help here too. When only one node is active everything is fine. When a second node comes up, I begin to see the log output as shown above. I've been struggling with this for over a month. I'm sure others have used oVirt with a Compellent and encountered (and worked around) similar problems. I'm looking for some help in figuring out if it's oVirt or something that I'm doing wrong. We're close to giving up on oVirt completely because of this. P.S. I've tested via bare metal and Proxmox with the Compellent. Not at the same scale but it se
[ovirt-users] oVirt Instability with Dell Compellent via iSCSI/Multipath
Engine: oVirt Engine Version: 3.5.2-1.el7.centos Nodes: oVirt Node - 3.5 - 0.999.201504280931.el7.centos Remote storage: Dell Compellent SC8000 Storage setup: 2 nics connected to the Compellent. Several domains backed by LUNs. Several VM disk using direct LUN. Networking: Dell 10 Gb/s switches I've been struggling with oVirt completely falling apart due to storage related issues. By falling apart I mean most to all of the nodes suddenly losing contact with the storage domains. This results in an endless loop of the VMs on the failed nodes trying to be migrated and remigrated as the nodes flap between response and unresponsive. During these times, engine.log looks like this. 2015-05-19 03:09:42,443 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-50) domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 in problem. vds: blade6c1.ism.ld 2015-05-19 03:09:42,560 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-38) domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:45,497 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-24) domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 in problem. vds: blade3c2.ism.ld 2015-05-19 03:09:51,713 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-46) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade4c2.ism.ld 2015-05-19 03:09:57,647 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-13) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade6c1.ism.ld 2015-05-19 03:09:57,782 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) domain 26929b89-d1ca-4718-90d6-b3a6da585451:generic_data_1 in problem. vds: blade2c1.ism.ld 2015-05-19 03:09:57,783 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-6) Domain 0b1d36e4-7992-43c7-8ac0-740f7c2cadb7:ovirttest1 recovered from problem. vds: blade2c1.ism.ld 2015-05-19 03:10:00,639 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-31) Domain c46adffc-614a-4fa2-9d2d-954f174f6a39:db_binlog_1 recovered from problem. vds: blade4c1.ism.ld 2015-05-19 03:10:00,703 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) domain 64101f40-0f10-471d-9f5f-44591f9e087d:logging_1 in problem. vds: blade1c1.ism.ld 2015-05-19 03:10:00,712 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-4) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade3c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,931 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain 05c8fa9c-fcbf-4a17-a3c6-011696a1b9a2:ovirttest2 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:06,932 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 recovered from problem. vds: blade4c2.ism.ld 2015-05-19 03:10:06,933 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) Domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2015-05-19 03:10:09,929 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-16) domain b050c455-5ab1-4107-b055-bfcc811195fc:os_data_1 in problem. vds: blade3c1.ism.ld My troubleshooting steps so far: 1. Tailing engine.log for "in problem" and "recovered from problem" 2. Shutting down all the VMs. 3. Shutting down all but one node. 4. Bringing up one node at a time to see what the log reports. When only one node is active everything is fine. When a second node comes up, I begin to see the log output as shown above. I've been struggling with this for over a month. I'm sure others have used oVirt with a Compellent and encountered (and worked around) similar problems. I'm looking for some help in figuring out if it's oVirt or something that I'm doing wrong. We're close to giving up on oVirt completely because of this. P.S. I've tested via bare metal and Proxmox with the Compellent. Not at the same scale but it seems to work fine there. -- This email was Virus checked by UTM 9. F