Re: [Users] Nodes lose storage at random

Meital Bourvine Tue, 18 Feb 2014 05:57:16 -0800

Hi Johan, 

Can you please run something like this on the spm node? 
while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` >> 
/tmp/handler_num.txt; sleep 1; done


When it'll happen again, please stop the script, and write here the maximum 
number and the time that it happened. 

Also, please check if "process_pool_max_slots_per_domain" is defined in 
/etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined there, 
the default is 10) 

Thanks! 

----- Original Message -----

> From: "Johan Kooijman" <m...@johankooijman.com>
> To: "Meital Bourvine" <mbour...@redhat.com>
> Cc: "users" <users@ovirt.org>
> Sent: Tuesday, February 18, 2014 2:55:11 PM
> Subject: Re: [Users] Nodes lose storage at random

> To follow up on this: The setup has only ~80 VM's active right now. The 2
> bugreports are not in scope for this setup, the issues occur at random, even
> when there's no activity (create/delete VM's) and there are only 4
> directories in / rhev/data-center/mnt/.

> On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman < m...@johankooijman.com >
> wrote:

> > Meital,
> 

> > I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I
> > use
> > the node iso CentOS 6 "oVirt Node - 3.0.1 - 1.0.2.el6".
> 

> > I have no ways of reproducing just yet. I can confirm that it's happening
> > on
> > all nodes in the cluster. And every time a node goes offline, this error
> > pops up.
> 

> > Could the fact that lockd & statd were not running on the NFS host cause
> > this
> > error? Is there a workaround available that we know of?
> 

> > On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine < mbour...@redhat.com >
> > wrote:
> 

> > > Hi Johan,
> > 
> 

> > > Please take a look at this error (from vdsm.log):
> > 
> 

> > > Thread-636938::DEBUG::2014-02-18
> > > 10:48:06,374::task::579::TaskManager.Task::(_updateState)
> > > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init ->
> > > state
> > > preparing
> > 
> 
> > > Thread-636938::INFO::2014-02-18
> > > 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect:
> > > getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04',
> > > spUUID='59980e09-b329-4254-b66e-790abd69e194',
> > > imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb',
> > > volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
> > 
> 
> > > Thread-636938::ERROR::2014-02-18
> > > 10:48:06,376::task::850::TaskManager.Task::(_setError)
> > > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
> > 
> 
> > > Thread-636938::DEBUG::2014-02-18
> > > 10:48:06,415::task::869::TaskManager.Task::(_run)
> > > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run:
> > > f4ce9a6e-0292-4071-9a24-a8d8fba7222b
> > > ('e9f70496-f181-4c9b-9ecb-d7f780772b04',
> > > '59980e09-b329-4254-b66e-790abd69e194',
> > > 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb',
> > > '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
> > 
> 
> > > Thread-636938::DEBUG::2014-02-18
> > > 10:48:06,416::task::1194::TaskManager.Task::(stop)
> > > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing
> > > (force False)
> > 
> 
> > > Thread-636938::DEBUG::2014-02-18
> > > 10:48:06,416::task::974::TaskManager.Task::(_decref)
> > > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
> > 
> 
> > > Thread-636938::INFO::2014-02-18
> > > 10:48:06,416::task::1151::TaskManager.Task::(prepare)
> > > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted:
> > > u'No
> > > free file handlers in pool' - code 100
> > 
> 
> > > Thread-636938::DEBUG::2014-02-18
> > > 10:48:06,417::task::1156::TaskManager.Task::(prepare)
> > > Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free
> > > file
> > > handlers in pool
> > 
> 

> > > And then you can see after a few seconds:
> > 
> 
> > > MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID:
> > > 1450)
> > > I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan
> > > (2.6.32-358.18.1.el6.x86_64)
> > 
> 

> > > Meaning that vdsm was restarted.
> > 
> 

> > > Which oVirt version are you using?
> > 
> 
> > > I see that there are a few old bugs that describes the same behaviour,
> > > but
> > > with different reproduction steps, for example [1], [2].
> > 
> 
> > > Can you think of any reproduction steps that might be causing this issue?
> > 
> 

> > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210
> > 
> 
> > > [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011
> > 
> 

> > > > From: "Johan Kooijman" < m...@johankooijman.com >
> > > 
> > 
> 
> > > > To: "users" < users@ovirt.org >
> > > 
> > 
> 
> > > > Sent: Tuesday, February 18, 2014 1:32:56 PM
> > > 
> > 
> 
> > > > Subject: [Users] Nodes lose storage at random
> > > 
> > 
> 

> > > > Hi All,
> > > 
> > 
> 

> > > > We're seeing some weird issues in our ovirt setup. We have 4 nodes
> > > > connected
> > > > and an NFS (v3) filestore (FreeBSD/ZFS).
> > > 
> > 
> 

> > > > Once in a while, it seems at random, a node loses their connection to
> > > > storage, recovers it a minute later. The other nodes usually don't lose
> > > > their storage at that moment. Just one, or two at a time.
> > > 
> > 
> 

> > > > We've setup extra tooling to verify the storage performance at those
> > > > moments
> > > > and the availability for other systems. It's always online, just the
> > > > nodes
> > > > don't think so.
> > > 
> > 
> 

> > > > The engine tells me this:
> > > 
> > 
> 

> > > > 2014-02-18 11:48:03,598 WARN
> > > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> > > > (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export
> > > > in
> > > > problem. vds: hv5
> > > 
> > 
> 
> > > > 2014-02-18 11:48:18,909 WARN
> > > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> > > > (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in
> > > > problem. vds: hv5
> > > 
> > 
> 
> > > > 2014-02-18 11:48:45,021 WARN
> > > > [org.ovirt.engine.core.vdsbroker.VdsManager]
> > > > (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS ,
> > > > vds
> > > > =
> > > > 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error,
> > > > continuing.
> > > 
> > 
> 
> > > > 2014-02-18 11:48:45,070 INFO
> > > > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> > > > (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894,
> > > > Call
> > > > Stack: null, Custom Event ID: -1, Message: Invalid status on Data
> > > > Center
> > > > GS.
> > > > Setting Data Center status to Non Responsive (On host hv5, Error:
> > > > Network
> > > > error during communication with the Host.).
> > > 
> > 
> 

> > > > The export and data domain live over NFS. There's another domain, ISO,
> > > > that
> > > > lives on the engine machine, also shared over NFS. That domain doesn't
> > > > have
> > > > any issue at all.
> > > 
> > 
> 

> > > > Attached are the logfiles for the relevant time period for both the
> > > > engine
> > > > server and the node. The node by the way, is a deployment of the node
> > > > ISO,
> > > > not a full blown installation.
> > > 
> > 
> 

> > > > Any clues on where to begin searching? The NFS server shows no issues
> > > > nor
> > > > anything in the logs. I did notice that the statd and lockd daemons
> > > > were
> > > > not
> > > > running, but I wonder if that can have anything to do with the issue.
> > > 
> > 
> 

> > > > --
> > > 
> > 
> 
> > > > Met vriendelijke groeten / With kind regards,
> > > 
> > 
> 
> > > > Johan Kooijman
> > > 
> > 
> 

> > > > m...@johankooijman.com
> > > 
> > 
> 

> > > > _______________________________________________
> > > 
> > 
> 
> > > > Users mailing list
> > > 
> > 
> 
> > > > Users@ovirt.org
> > > 
> > 
> 
> > > > http://lists.ovirt.org/mailman/listinfo/users
> > > 
> > 
> 

> > --
> 
> > Met vriendelijke groeten / With kind regards,
> 
> > Johan Kooijman
> 

> > T +31(0) 6 43 44 45 27
> 
> > F +31(0) 162 82 00 01
> 
> > E m...@johankooijman.com
> 

> --
> Met vriendelijke groeten / With kind regards,
> Johan Kooijman

> T +31(0) 6 43 44 45 27
> F +31(0) 162 82 00 01
> E m...@johankooijman.com

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [Users] Nodes lose storage at random

Reply via email to