Today I got an alert from monitoring systems that my EL7 MariaDB VM was down. I looked in oVirt and it showed the VM as "paused due to storage I/O problem". That VM is the only one currently attached to disks that are on an iSCSI storage domain. That iSCSI storage domain uses iSER (over DDR Infiniband).
My instance is made up of two ovirt nodes, both with CentOS 6.5 running oVirt 3.4.3. The host that was running this VM has 16 running VMs total. I looked at the vdsm logs for the hypervisor and found log entries with "No free file handlers in pool". Below are logs. I quick "google" showed old posts about it being a ulimit problem. Right now after resuming the paused VM I see this on the VDSM server # lsof -u vdsm | wc -l 2377 # sudo -H -u vdsm -s ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1032295 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 12288 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited If the issue was due to number of open files, is it the case that if the only VM using the iSCSI storage domain was paused, the file descriptors were flushed? How do I go about finding out the cause of this, so I can prevent future issues? Is the fix to just increase the "nofile" ulimit for vdsm in /etc/security/limits.d/99-vdsm.conf? Thanks, - Trey LOGS: Sep 12 12:10:49 ovirtnode01 vdsm TaskManager.Task ERROR Task=`f467714c-8948-4cc3-94af-efba6b67701c`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 873, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3055, in getVolumeSize#012 apparentsize = str(dom.getVSize(imgUUID, volUUID))#012 File "/usr/share/vdsm/storage/fileSD.py", line 314, in getVSize#012 return self.oop.os.stat(volPath).st_size#012 File "/usr/share/vdsm/storage/remoteFileHandler.py", line 312, in callCrabRPCFunction#012 raise Exception("No free file handlers in pool")#012Exception: No free file handlers in pool Sep 12 12:10:49 ovirtnode01 vdsm vm.Vm ERROR vmId=`967dce86-63c2-412a-97c5-d7c6f1af8dfb`::Unable to update the volume 92c6cfd0-c236-44ab-894d-cb80421dd865 (domain: 6da59b00-0de7-4219-960b-d581b27052b5 image: 06060f34-12c5-4396-bfe3-ad0f1d4b03fc) for the drive vda Sep 12 12:10:54 ovirtnode01 sanlock[3379]: 2014-09-12 12:10:54-0500 1773913 [5185]: s1 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_data/6da59b00-0de7-4219-960b-d581b27052b5/dom_md/ids Sep 12 12:10:54 ovirtnode01 sanlock[3379]: 2014-09-12 12:10:54-0500 1773913 [5185]: s1 renewal error -202 delta_length 10 last_success 1773883 <repeats a few times> Sep 12 12:10:58 ovirtnode01 vdsm vm.Vm ERROR vmId=`151972dc-0025-470e-bdfe-38d3b085a63c`::Unable to update the volume 1b589b1b-aeb8-4a7a-a005-d5042de47f36 (domain: 6da59b00-0de7-4219-960b-d581b27052b5 image: 5d9456a3-12e1-4e50-b899-b12430a4fdb9) for the drive vda Sep 12 12:10:58 ovirtnode01 vdsm vm.Vm ERROR vmId=`b15f90ff-f359-447a-b309-2cf64d01d0ce`::Unable to update the volume 70a0fee4-2dd3-4460-94da-9ddbfb66845b (domain: 6da59b00-0de7-4219-960b-d581b27052b5 image: 4e133730-f549-4b49-a8e9-baedaca4c1f1) for the drive vda <snip> Then lines like this: Sep 12 12:11:49 ovirtnode01 sanlock[3379]: 2014-09-12 12:11:49-0500 1773968 [5185]: s1 delta_renew read rv -2 offset 0 /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_data/6da59b00-0de7-4219-960b-d581b27052b5/dom_md/ids Sep 12 12:11:49 ovirtnode01 sanlock[3379]: 2014-09-12 12:11:49-0500 1773968 [5185]: s1 renewal error -2 delta_length 10 last_success 1773883 Sep 12 12:11:49 ovirtnode01 vdsm TaskManager.Task ERROR Task=`b39a6636-451e-409c-9a81-7392000206d3`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 873, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3055, in getVolumeSize#012 apparentsize = str(dom.getVSize(imgUUID, volUUID))#012 File "/usr/share/vdsm/storage/fileSD.py", line 314, in getVSize#012 return self.oop.os.stat(volPath).st_size#012 File "/usr/share/vdsm/storage/remoteFileHandler.py", line 297, in callCrabRPCFunction#012 *args, **kwargs)#012 File "/usr/share/vdsm/storage/remoteFileHandler.py", line 184, in callCrabRPCFunction#012 rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)#012 File "/usr/share/vdsm/storage/remoteFileHandler.py", line 150, in _recvAll#012 raise Timeout()#012Timeout Sep 12 12:11:49 ovirtnode01 vdsm vm.Vm ERROR vmId=`ca242da9-b7ad-41e2-a26a-db2fea2e49d7`::Unable to update the volume 03baf076-ef59-41f6-9cab-8cd64cd89307 (domain: 6da59b00-0de7-4219-960b-d581b27052b5 image: d1960266-0a06-4b2a-a15d-141612aff740) for the drive vda Sep 12 12:11:59 ovirtnode01 sanlock[3379]: 2014-09-12 12:11:59-0500 1773978 [5185]: 6da59b00 close_task_aio 0 0x7f3dfc0008c0 busy Sep 12 12:11:59 ovirtnode01 sanlock[3379]: 2014-09-12 12:11:59-0500 1773978 [5185]: 6da59b00 close_task_aio 1 0x7f3dfc000910 busy Sep 12 12:11:59 ovirtnode01 sanlock[3379]: 2014-09-12 12:11:59-0500 1773978 [5185]: 6da59b00 close_task_aio 2 0x7f3dfc000960 busy _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users