BTW, Could you paste the ouput of run_probes commands once it finish?
On Wed, Jul 30, 2014 at 4:58 PM, Ruben S. Montero <rsmont...@opennebula.org> wrote: > This seems to be a bug, when collectd does not respond (because of waiting > for sudo password) OpenNebula does not move the hosts to ERROR. The probes > are designed to not start another collectd process; but probably we should > check that a running one it is not working and send the ERROR message to > OpenNebula. > > Pointer to the issue: > http://dev.opennebula.org/issues/3118 > > Cheers > > > On Wed, Jul 30, 2014 at 4:53 PM, Steven Timm <t...@fnal.gov> wrote: > >> On Wed, 30 Jul 2014, Ruben S. Montero wrote: >> >> Hi, >>> 1.- monitor_ds.sh may use LVM commands (vgdisplay) that needs sudo >>> access. It should be automatically setup by the opennebula node >>> packages. >>> >>> 2.- It is not a real daemon, the first time a host is monitored a >>> process is left to periodically send information. OpenNebula >>> restarts it if no information is received in 3 monitor steps. Nothing >>> needs to be set up... >>> >>> Cheers >>> >>> >> On further inspection I found that this collectd was running on my nodes, >> and obviously failing up until now because the sudoers was not set >> correctly. But there was nothing to warn us about it. Nothing on >> the opennebula head node to even tell us that the information was stale. >> No log file on the node to show the errors we were getting. In short, >> it was just quietly dying and we had no idea. How to make sure this >> doesn't happen again in the future? >> >> Steve Timm >> >> >> >> >> >> >> >>> On Wed, Jul 30, 2014 at 3:50 PM, Steven Timm <t...@fnal.gov> wrote: >>> On Wed, 30 Jul 2014, Ruben S. Montero wrote: >>> >>> >>> Maybe you could try to execute the monitor probes in the >>> node, >>> >>> 1. ssh the node >>> 2. Go to /var/tmp/one/im >>> 3. Execute run_probes kvm-probes >>> >>> >>> When I do that, (using sh -x ) I get the following: >>> >>> -bash-4.1$ sh -x ./run_probes kvm-probes >>> ++ dirname ./run_probes >>> + source ./../scripts_common.sh >>> ++ export LANG=C >>> ++ LANG=C >>> ++ export >>> PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/ >>> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin >>> ++ >>> PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/ >>> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin >>> ++ AWK=awk >>> ++ BASH=bash >>> ++ CUT=cut >>> ++ DATE=date >>> ++ DD=dd >>> ++ DF=df >>> ++ DU=du >>> ++ GREP=grep >>> ++ ISCSIADM=iscsiadm >>> ++ LVCREATE=lvcreate >>> ++ LVREMOVE=lvremove >>> ++ LVRENAME=lvrename >>> ++ LVS=lvs >>> ++ LN=ln >>> ++ MD5SUM=md5sum >>> ++ MKFS=mkfs >>> ++ MKISOFS=genisoimage >>> ++ MKSWAP=mkswap >>> ++ QEMU_IMG=qemu-img >>> ++ RADOS=rados >>> ++ RBD=rbd >>> ++ READLINK=readlink >>> ++ RM=rm >>> ++ SCP=scp >>> ++ SED=sed >>> ++ SSH=ssh >>> ++ SUDO=sudo >>> ++ SYNC=sync >>> ++ TAR=tar >>> ++ TGTADM=tgtadm >>> ++ TGTADMIN=tgt-admin >>> ++ TGTSETUPLUN=tgt-setup-lun-one >>> ++ TR=tr >>> ++ VGDISPLAY=vgdisplay >>> ++ VMKFSTOOLS=vmkfstools >>> ++ WGET=wget >>> +++ uname -s >>> ++ '[' xLinux = xLinux ']' >>> ++ SED='sed -r' >>> +++ basename ./run_probes >>> ++ SCRIPT_NAME=run_probes >>> + export LANG=C >>> + LANG=C >>> + HYPERVISOR_DIR=kvm-probes.d >>> + ARGUMENTS=kvm-probes >>> ++ dirname ./run_probes >>> + SCRIPTS_DIR=. >>> + cd . >>> ++ '[' -d kvm-probes.d ']' >>> ++ run_dir kvm-probes.d >>> ++ cd kvm-probes.d >>> +++ ls architecture.sh collectd-client-shepherd.sh cpu.sh kvm.rb >>> monitor_ds.sh name.sh poll.sh version.sh >>> ++ for i in '`ls *`' >>> ++ '[' -x architecture.sh ']' >>> ++ ./architecture.sh kvm-probes >>> ++ EXIT_CODE=0 >>> ++ '[' x0 '!=' x0 ']' >>> ++ for i in '`ls *`' >>> ++ '[' -x collectd-client-shepherd.sh ']' >>> ++ ./collectd-client-shepherd.sh kvm-probes >>> ++ EXIT_CODE=0 >>> ++ '[' x0 '!=' x0 ']' >>> ++ for i in '`ls *`' >>> ++ '[' -x cpu.sh ']' >>> ++ ./cpu.sh kvm-probes >>> ++ EXIT_CODE=0 >>> ++ '[' x0 '!=' x0 ']' >>> ++ for i in '`ls *`' >>> ++ '[' -x kvm.rb ']' >>> ++ ./kvm.rb kvm-probes >>> ++ EXIT_CODE=0 >>> ++ '[' x0 '!=' x0 ']' >>> ++ for i in '`ls *`' >>> ++ '[' -x monitor_ds.sh ']' >>> ++ ./monitor_ds.sh kvm-probes >>> [sudo] password for oneadmin: >>> >>> and it stays hung on the password for oneadmin. >>> >>> What's going on? >>> >>> Also, you mentioned a collectd--are you saying that OpenNebula 4.6 >>> now needs to run a daemon on every single VM host? >>> Where is it documented >>> on how to set it up? >>> >>> Steve >>> >>> >>> >>> >>> >>> >>> >>> Make sure you do not have a host using the same hostname >>> fgtest14 and running a collectd process >>> >>> On Jul 29, 2014 4:35 PM, "Steven Timm" <t...@fnal.gov> >>> wrote: >>> >>> I am still trying to debug a nasty monitoring >>> inconsistency. >>> >>> -bash-4.1$ onevm list | grep fgtest14 >>> 26 oneadmin oneadmin fgt6x4-26 runn 6 >>> 4G fgtest14 117d 19h50 >>> 27 oneadmin oneadmin fgt5x4-27 runn 10 >>> 4G fgtest14 117d 17h57 >>> 28 oneadmin oneadmin fgt1x1-28 runn 10 >>> 4.1G fgtest14 117d 16h59 >>> 30 oneadmin oneadmin fgt5x1-30 runn 0 >>> 4G fgtest14 116d 23h50 >>> 33 oneadmin oneadmin ip6sl5vda-33 runn 6 >>> 4G fgtest14 116d 19h57 >>> -bash-4.1$ onehost list >>> ID NAME CLUSTER RVM ALLOCATED_CPU >>> ALLOCATED_MEM STAT >>> 3 fgtest11 ipv6 0 0 / 400 (0%) >>> 0K / 15.7G (0%) on >>> 4 fgtest12 ipv6 0 0 / 400 (0%) >>> 0K / 15.7G (0%) on >>> 7 fgtest13 ipv6 0 0 / 800 (0%) >>> 0K / 23.6G (0%) on >>> 8 fgtest14 ipv6 5 0 / 800 (0%) >>> 0K / 23.6G (0%) on >>> 9 fgtest20 ipv6 3 300 / 800 (37%) >>> 12G / 31.4G (38%) on >>> 11 fgtest19 ipv6 0 0 / 800 (0%) >>> 0K / 31.5G (0%) on >>> -bash-4.1$ onehost show 8 >>> HOST 8 INFORMATION >>> ID : 8 >>> NAME : fgtest14 >>> CLUSTER : ipv6 >>> STATE : MONITORED >>> IM_MAD : kvm >>> VM_MAD : kvm >>> VN_MAD : dummy >>> LAST MONITORING TIME : 07/29 09:25:45 >>> >>> HOST SHARES >>> TOTAL MEM : 23.6G >>> USED MEM (REAL) : 876.4M >>> USED MEM (ALLOCATED) : 0K >>> TOTAL CPU : 800 >>> USED CPU (REAL) : 0 >>> USED CPU (ALLOCATED) : 0 >>> RUNNING VMS : 5 >>> >>> LOCAL SYSTEM DATASTORE #102 CAPACITY >>> TOTAL: : 548.8G >>> USED: : 175.3G >>> FREE: : 345.6G >>> >>> MONITORING INFORMATION >>> ARCH="x86_64" >>> CPUSPEED="2992" >>> HOSTNAME="fgtest14.fnal.gov" >>> HYPERVISOR="kvm" >>> MODELNAME="Intel(R) Xeon(R) CPU E5450 @ >>> 3.00GHz" >>> NETRX="234844577" >>> NETTX="21553126" >>> RESERVED_CPU="" >>> RESERVED_MEM="" >>> VERSION="4.6.0" >>> >>> VIRTUAL MACHINES >>> >>> ID USER GROUP NAME STAT UCPU >>> UMEM HOST TIME >>> 26 oneadmin oneadmin fgt6x4-26 runn 6 >>> 4G fgtest14 117d 19h50 >>> 27 oneadmin oneadmin fgt5x4-27 runn 10 >>> 4G fgtest14 117d 17h57 >>> 28 oneadmin oneadmin fgt1x1-28 runn 10 >>> 4.1G fgtest14 117d 17h00 >>> 30 oneadmin oneadmin fgt5x1-30 runn 0 >>> 4G fgtest14 116d 23h50 >>> 33 oneadmin oneadmin ip6sl5vda-33 runn 6 >>> 4G fgtest14 116d 19h57 >>> ------------------------------ >>> ----------------------------------------------------- >>> >>> All of this looks great, right? >>> Just one problem: There are no VM's running on >>> fgtest14 and >>> haven't been for 4 days. >>> >>> [root@fgtest14 ~]# virsh list >>> Id Name State >>> ---------------------------------------------------- >>> >>> [root@fgtest14 ~]# >>> >>> ------------------------------ >>> ------------------------------------------- >>> Yet the monitoring reports no errors. >>> >>> Tue Jul 29 09:28:10 2014 [InM][D]: Host fgtest14 (8) >>> successfully monitored. >>> >>> ------------------------------ >>> ----------------------------------------------- >>> At the same time, there is no evidence that ONE is >>> actually trying to or >>> succeeding to monitor these five vm's yet they are >>> still stuck in "runn" >>> which means I can't do a onevm restart to restart them. >>> (the vm images of these 5 vm's are still out there on >>> the VM host and >>> I would like to save and restart them if I can). >>> >>> What is the remotes command that ONE4.6 would use to >>> monitor this host? >>> Can I do it manually and see what output I get? >>> >>> Are we dealing with some kind of a bug, or just a very >>> confused system? >>> Any help is appreciated. I have to get this sorted out >>> before >>> I dare deploy one4.x in production. >>> >>> Steve Timm >>> >>> >>> ------------------------------ >>> ------------------------------------ >>> Steven C. Timm, Ph.D (630) 840-8525 >>> t...@fnal.gov http://home.fnal.gov/~timm/ >>> Fermilab Scientific Computing Division, Scientific >>> Computing Services Quad. >>> Grid and Cloud Services Dept., Associate Dept. Head >>> for Cloud Computing >>> _______________________________________________ >>> Users mailing list >>> Users@lists.opennebula.org >>> http://lists.opennebula.org/ >>> listinfo.cgi/users-opennebula.org >>> >>> >>> >>> >>> ------------------------------------------------------------------ >>> Steven C. Timm, Ph.D (630) 840-8525 >>> t...@fnal.gov http://home.fnal.gov/~timm/ >>> Fermilab Scientific Computing Division, Scientific Computing >>> Services Quad. >>> Grid and Cloud Services Dept., Associate Dept. Head for Cloud >>> Computing >>> >>> >>> >>> >>> -- >>> -- >>> Ruben S. Montero, PhD >>> Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise >>> Cloud Made Simple >>> www.OpenNebula.org | rsmont...@opennebula.org | @OpenNebula >>> >>> >>> >> ------------------------------------------------------------------ >> Steven C. Timm, Ph.D (630) 840-8525 >> t...@fnal.gov http://home.fnal.gov/~timm/ >> Fermilab Scientific Computing Division, Scientific Computing Services >> Quad. >> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing >> > > > > -- > -- > Ruben S. Montero, PhD > Project co-Lead and Chief Architect > OpenNebula - Flexible Enterprise Cloud Made Simple > www.OpenNebula.org | rsmont...@opennebula.org | @OpenNebula > -- -- Ruben S. Montero, PhD Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise Cloud Made Simple www.OpenNebula.org | rsmont...@opennebula.org | @OpenNebula
_______________________________________________ Users mailing list Users@lists.opennebula.org http://lists.opennebula.org/listinfo.cgi/users-opennebula.org