Meital, I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".
I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up. Could the fact that lockd & statd were not running on the NFS host cause this error? Is there a workaround available that we know of? On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine <mbour...@redhat.com>wrote: > Hi Johan, > > Please take a look at this error (from vdsm.log): > > Thread-636938::DEBUG::2014-02-18 > 10:48:06,374::task::579::TaskManager.Task::(_updateState) > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -> state > preparing > Thread-636938::INFO::2014-02-18 > 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: > getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', > spUUID='59980e09-b329-4254-b66e-790abd69e194', > imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', > volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) > Thread-636938::ERROR::2014-02-18 > 10:48:06,376::task::850::TaskManager.Task::(_setError) > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error > Thread-636938::DEBUG::2014-02-18 > 10:48:06,415::task::869::TaskManager.Task::(_run) > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: > f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', > '59980e09-b329-4254-b66e-790abd69e194', > 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', > '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task > Thread-636938::DEBUG::2014-02-18 > 10:48:06,416::task::1194::TaskManager.Task::(stop) > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing > (force False) > Thread-636938::DEBUG::2014-02-18 > 10:48:06,416::task::974::TaskManager.Task::(_decref) > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True > Thread-636938::INFO::2014-02-18 > 10:48:06,416::task::1151::TaskManager.Task::(prepare) > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No > free file handlers in pool' - code 100 > Thread-636938::DEBUG::2014-02-18 > 10:48:06,417::task::1156::TaskManager.Task::(prepare) > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file > handlers in pool > > > > And then you can see after a few seconds: > > MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) > I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan > (2.6.32-358.18.1.el6.x86_64) > > > Meaning that vdsm was restarted. > > Which oVirt version are you using? > I see that there are a few old bugs that describes the same behaviour, but > with different reproduction steps, for example [1], [2]. > Can you think of any reproduction steps that might be causing this issue? > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 > [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011 > > > > ------------------------------ > > *From: *"Johan Kooijman" <m...@johankooijman.com> > *To: *"users" <users@ovirt.org> > *Sent: *Tuesday, February 18, 2014 1:32:56 PM > *Subject: *[Users] Nodes lose storage at random > > > Hi All, > > We're seeing some weird issues in our ovirt setup. We have 4 nodes > connected and an NFS (v3) filestore (FreeBSD/ZFS). > > Once in a while, it seems at random, a node loses their connection to > storage, recovers it a minute later. The other nodes usually don't lose > their storage at that moment. Just one, or two at a time. > > We've setup extra tooling to verify the storage performance at those > moments and the availability for other systems. It's always online, just > the nodes don't think so. > > The engine tells me this: > > 2014-02-18 11:48:03,598 WARN > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in > problem. vds: hv5 > 2014-02-18 11:48:18,909 WARN > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] > (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in > problem. vds: hv5 > 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = > 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing. > 2014-02-18 11:48:45,070 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, > Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data > Center GS. Setting Data Center status to Non Responsive (On host hv5, > Error: Network error during communication with the Host.). > > The export and data domain live over NFS. There's another domain, ISO, > that lives on the engine machine, also shared over NFS. That domain doesn't > have any issue at all. > > Attached are the logfiles for the relevant time period for both the engine > server and the node. The node by the way, is a deployment of the node ISO, > not a full blown installation. > > Any clues on where to begin searching? The NFS server shows no issues nor > anything in the logs. I did notice that the statd and lockd daemons were > not running, but I wonder if that can have anything to do with the issue. > > -- > Met vriendelijke groeten / With kind regards, > Johan Kooijman > > m...@johankooijman.com > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > > > -- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E m...@johankooijman.com
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users