This seems to be a problem when upgrading the DB, See the inconsistency in fgtest14:
<RUNNING_VMS>5</RUNNING_VMS>....<VMS></VMS> That's the reason for not seeing any action taken on VM 26 it is not registered in the host (empty <VM> set) I suggest to stop oned and execute onedb fsck Cheers On Wed, Jul 30, 2014 at 4:44 PM, Steven Timm <t...@fnal.gov> wrote: > OK--I have now installed the opennebula-node-kvm rpm on > all of the VM hosts (SURPRISE), made sure that the collectd > that is running is the current one from opennebula 4.6, > and verified that the run_probes kvm-probes can > run interactively as oneadmin on all of the nodes. the one on > fgtest14 correctly reports that there are no running VM's, > and the two machines that do have running vm's correctly report > that they do have running VM's. > > Only problem is, the five virtual machines that opennebula still thinks > are running on fgtest14, still report back as running > even though opennebula hasn't made any attempt to monitor them? > > How do we get things back into sync and tell opennebula that VM #26 > isn't really running anymore? Is there a way to force this vm into > "unknown" state so we can do a onevm boot on it? Database hackery > included? Even better, has someone come up with an XML hacker to > do the XML substitition of one field in the huge mysql field? > > Even more important: it's clear that the monitoring was obviously > failing and failing for a long time because we didn't have the > sudoers file there that the opennebula-node-kvm provides. > But there was absolutely no warning of that.. as far as the > head node was concerned we were happy as a clam. > > > ---- > > The important pieces of output from run_probes kvm-probes > > fgtest19 > ARCH=x86_64 > MODELNAME="Intel(R) Xeon(R) CPU E5450 @ 3.00GHz" > HYPERVISOR=kvm > TOTALCPU=800 > CPUSPEED=2992 > TOTALMEMORY=33010680 > USEDMEMORY=1586216 > FREEMEMORY=31424464 > FREECPU=800.0 > USEDCPU=0.0 > NETRX=5958104400 > NETTX=2323329968 > DS_LOCATION_USED_MB=1924 > DS_LOCATION_TOTAL_MB=280380 > DS_LOCATION_FREE_MB=264129 > DS = [ > ID = 102, > USED_MB = 1924, > TOTAL_MB = 280380, > FREE_MB = 264129 > ] > HOSTNAME=fgtest19.fnal.gov > VM_POLL=YES > VM=[ > ID=55, > DEPLOY_ID=one-55, > POLL="NETRX=25289118 USEDCPU=0.0 NETTX=214808 USEDMEMORY=4194304 > STATE=a" ] > VERSION="4.6.0" > fgtest20 > ARCH=x86_64 > MODELNAME="Intel(R) Xeon(R) CPU E5450 @ 3.00GHz" > HYPERVISOR=kvm > TOTALCPU=800 > CPUSPEED=2992 > TOTALMEMORY=32875804 > USEDMEMORY=8801100 > FREEMEMORY=24074704 > FREECPU=793.6 > USEDCPU=6.39999999999998 > NETRX=184155823062 > NETTX=58685116817 > DS_LOCATION_USED_MB=50049 > DS_LOCATION_TOTAL_MB=281012 > DS_LOCATION_FREE_MB=216499 > DS = [ > ID = 102, > USED_MB = 50049, > TOTAL_MB = 281012, > FREE_MB = 216499 > ] > HOSTNAME=fgtest20.fnal.gov > VM_POLL=YES > VM=[ > ID=31, > DEPLOY_ID=one-31, > POLL="NETRX=71728978887 USEDCPU=0.5 NETTX=54281255903 USEDMEMORY=4270812 > STATE=a" ] > VM=[ > ID=24, > DEPLOY_ID=one-24, > POLL="NETRX=2383960153 USEDCPU=0.0 NETTX=17345416 USEDMEMORY=4194304 > STATE=a" ] > VM=[ > ID=48, > DEPLOY_ID=one-48, > POLL="NETRX=2546074171 USEDCPU=0.0 NETTX=145782495 USEDMEMORY=4194304 > STATE=a" ] > VERSION="4.6.0" > > fgtest14 > ARCH=x86_64 > MODELNAME="Intel(R) Xeon(R) CPU E5450 @ 3.00GHz" > HYPERVISOR=kvm > TOTALCPU=800 > CPUSPEED=2992 > TOTALMEMORY=24736796 > USEDMEMORY=937004 > FREEMEMORY=23799792 > FREECPU=800.0 > USEDCPU=0.0 > NETRX=285471609 > NETTX=25467521 > DS_LOCATION_USED_MB=179498 > DS_LOCATION_TOTAL_MB=561999 > DS_LOCATION_FREE_MB=353864 > DS = [ > ID = 102, > USED_MB = 179498, > TOTAL_MB = 561999, > FREE_MB = 353864 > ] > > ------------------------- > And the appropriate excerpts from oned.log: > > /var/log/one/oned.log.20140728111811:Fri Jul 25 15:22:05 2014 [DiM][D]: > Restarting VM 26 > /var/log/one/oned.log.20140728111811:Fri Jul 25 15:22:05 2014 [DiM][E]: > Could not restart VM 26, wrong state. > /var/log/one/oned.log.20140728111811:Fri Jul 25 15:37:48 2014 [DiM][D]: > Stopping VM 26 > /var/log/one/oned.log.20140728111811:Fri Jul 25 15:37:48 2014 [VMM][D]: > VM 26 successfully monitored: STATE=- > ----------------------------------- > > This is the mysql row in host_pool for host fgtest14 > mysql> > mysql> select * from host_pool where oid=8 \G > *************************** 1. row *************************** > oid: 8 > name: fgtest14 > body: <HOST><ID>8</ID><NAME>fgtest14</NAME><STATE>2</ > STATE><IM_MAD>kvm</IM_MAD><VM_MAD>kvm</VM_MAD><VN_MAD>dummy< > /VN_MAD><LAST_MON_TIME>1406731190</LAST_MON_TIME>< > CLUSTER_ID>101</CLUSTER_ID><CLUSTER>ipv6</CLUSTER><HOST_ > SHARE><DISK_USAGE>0</DISK_USAGE><MEM_USAGE>0</MEM_USAGE> > <CPU_USAGE>0</CPU_USAGE><MAX_DISK>561999</MAX_DISK><MAX_ > MEM>24736796</MAX_MEM><MAX_CPU>800</MAX_CPU><FREE_DISK> > 353864</FREE_DISK><FREE_MEM>23802216</FREE_MEM><FREE_CPU> > 800</FREE_CPU><USED_DISK>179498</USED_DISK><USED_MEM> > 934580</USED_MEM><USED_CPU>0</USED_CPU><RUNNING_VMS>5</ > RUNNING_VMS><DATASTORES><DS><FREE_MB><![CDATA[353864]]></ > FREE_MB><ID><![CDATA[102]]></ID><TOTAL_MB><![CDATA[561999]] > ></TOTAL_MB><USED_MB><![CDATA[179498]]></USED_MB></DS></ > DATASTORES></HOST_SHARE><VMS></VMS><TEMPLATE><ARCH><![CDATA[ > x86_64]]></ARCH><CPUSPEED><![CDATA[2992]]></CPUSPEED><HOSTNAME><![CDATA[ > fgtest14.fnal.gov]]></HOSTNAME><HYPERVISOR><![CDATA[kvm]]></ > HYPERVISOR><MODELNAME><![CDATA[Intel(R) Xeon(R) CPU E5450 @ > 3.00GHz]]></MODELNAME><NETRX><![CDATA[285677608]]></NETRX>< > NETTX><![CDATA[25489275]]></NETTX><RESERVED_CPU><![CDATA[] > ]></RESERVED_CPU><RESERVED_MEM><![CDATA[]]></RESERVED_ > MEM><VERSION><![CDATA[4.6.0]]></VERSION></TEMPLATE></HOST> > state: 2 > last_mon_time: 1406731190 > uid: 0 > gid: 0 > owner_u: 1 > group_u: 0 > other_u: 0 > cid: 101 > 1 row in set (0.00 sec) > > > > And this is the row in vm_pool for VM id 26 > > *************************** 1. row *************************** > oid: 26 > name: fgt6x4-26 > body: <VM><ID>26</ID><UID>0</UID><GID>0</GID><UNAME>oneadmin</ > UNAME><GNAME>oneadmin</GNAME><NAME>fgt6x4-26</NAME>< > PERMISSIONS><OWNER_U>1</OWNER_U><OWNER_M>1</OWNER_M><OWNER_ > A>0</OWNER_A><GROUP_U>0</GROUP_U><GROUP_M>0</GROUP_M>< > GROUP_A>0</GROUP_A><OTHER_U>0</OTHER_U><OTHER_M>0</OTHER_M>< > OTHER_A>0</OTHER_A></PERMISSIONS><LAST_POLL>1406320668</LAST_POLL><STATE> > 3</STATE><LCM_STATE>3</LCM_STATE><RESCHED>0</RESCHED>< > STIME>1396463735</STIME><ETIME>0</ETIME><DEPLOY_ID>one- > 26</DEPLOY_ID><MEMORY>4194304</MEMORY><CPU>6</CPU><NET_TX> > 748982286</NET_TX><NET_RX>1588690678</NET_RX><TEMPLATE>< > AUTOMATIC_REQUIREMENTS><![CDATA[CLUSTER_ID = 101 & !(PUBLIC_CLOUD = > YES)]]></AUTOMATIC_REQUIREMENTS><CONTEXT><CTX_USER><![CDATA[PFVTRVI+ > PElEPjA8L0lEPjxHSUQ+MDwvR0lEPjxHUk9VUFM+PElEPjA8L0lEPjwvR1JPVVBTPjxHTk > FNRT5vbmVhZG1pbjwvR05BTUU+PE5BTUU+b25lYWRtaW48L05BTUU+ > PFBBU1NXT1JEPjFmNjQxYzdlMzZkZWU5MmUzNDQ0Mjk2NmI1OTYwMGJkMGE3 > ZmU5ZDQ8L1BBU1NXT1JEPjxBVVRIX0RSSVZFUj5jb3JlPC9BVVRIX0RSSVZF > Uj48RU5BQkxFRD4xPC9FTkFCTEVEPjxURU1QTEFURT48VE9LRU5fUEFTU1dPUkQ+ > PCFbQ0RBVEFbNzFhYzU0OWM5MzhmNjA0NmY3NDEzMDI4Y2ZhOGNjODU2YzI2 > ZGNhNV1dPjwvVE9LRU5fUEFTU1dPUkQ+PC9URU1QTEFURT48REFUQVNUT1JFX1 > FVT1RBPjwvREFUQVNUT1JFX1FVT1RBPjxORVRXT1JLX1FVT1RBPjwvTkVUV0 > 9SS19RVU9UQT48Vk1fUVVPVEE+PC9WTV9RVU9UQT48SU1BR0VfUVVPVE > E+PC9JTUFHRV9RVU9UQT48L1VTRVI+]]></CTX_USER><DISK_ID><![ > CDATA[2]]></DISK_ID><ETH0_DNS><![CDATA[131.225.0.254]]></ > ETH0_DNS><ETH0_GATEWAY><![CDATA[131.225.41.200]]></ETH0_ > GATEWAY><ETH0_IP><![CDATA[131.225.41.169]]></ETH0_IP><ETH0_ > IPV6><![CDATA[2001:400:2410:29::169]]></ETH0_IPV6><ETH0_ > MAC><![CDATA[00:16:3e:06:06:04]]></ETH0_MAC><ETH0_MASK><![ > CDATA[255.255.255.128]]></ETH0_MASK><FILES><![CDATA[/ > cloud/images/OpenNebula/scripts/one3.2/contextualization/init.sh > /cloud/images/OpenNebula/scripts/one3.2/contextualization/credentials.sh > /cloud/images/OpenNebula/scripts/one3.2/contextualization/kerberos.sh] > ]></FILES><GATEWAY><![CDATA[131.225.41.200]]></GATEWAY><INIT_SCRIPTS><![CDATA[init.sh > credentials.sh kerberos.sh]]></INIT_SCRIPTS>< > IP_PUBLIC><![CDATA[131.225.41.169]]></IP_PUBLIC><NETMASK><![ > CDATA[255.255.255.128]]></NETMASK><NETWORK><![CDATA[YES] > ]></NETWORK><ROOT_PUBKEY><![CDATA[id_dsa.pub]]></ROOT_ > PUBKEY><TARGET><![CDATA[hdc]]></TARGET><USERNAME><![CDATA[ > opennebula]]></USERNAME><USER_PUBKEY><![CDATA[id_dsa.pub]]>< > /USER_PUBKEY></CONTEXT><CPU><![CDATA[1]]></CPU><DISK><CLONE> > <![CDATA[NO]]></CLONE><CLONE_TARGET><![CDATA[SYSTEM]]></ > CLONE_TARGET><CLUSTER_ID><![CDATA[101]]></CLUSTER_ID>< > DATASTORE><![CDATA[ip6_img_ds]]></DATASTORE><DATASTORE_ID><! > [CDATA[101]]></DATASTORE_ID><DEV_PREFIX><![CDATA[hd]]></ > DEV_PREFIX><DISK_ID><![CDATA[0]]></DISK_ID><IMAGE><![CDATA[ > fgt6x4_os]]></IMAGE><IMAGE_ID><![CDATA[5]]></IMAGE_ID>< > IMAGE_UNAME><![CDATA[oneadmin]]></IMAGE_UNAME><LN_TARGET><![ > CDATA[SYSTEM]]></LN_TARGET><PERSISTENT><![CDATA[YES]]></ > PERSISTENT><READONLY><![CDATA[NO]]></READONLY><SAVE><![ > CDATA[YES]]></SAVE><SIZE><![CDATA[46080]]></SIZE><SOURCE>< > ![CDATA[/var/lib/one//datastores/101/3078b4235100008fbdbf9dff7eea95 > b1]]></SOURCE><TARGET><![CDATA[vda]]></TARGET><TM_MAD>< > ![CDATA[ssh]]></TM_MAD><TYPE><![CDATA[FILE]]></TYPE></DISK>< > DISK><DEV_PREFIX><![CDATA[hd]]></DEV_PREFIX><DISK_ID><![ > CDATA[1]]></DISK_ID><SIZE><![CDATA[5120]]></SIZE><TARGET><! > [CDATA[vdb]]></TARGET><TYPE><![CDATA[swap]]></TYPE></DISK>< > FEATURES><ACPI><![CDATA[yes]]></ACPI></FEATURES><GRAPHICS>< > AUTOPORT><![CDATA[yes]]></AUTOPORT><KEYMAP><![CDATA[en- > us]]></KEYMAP><LISTEN><![CDATA[127.0.0.1]]></LISTEN>< > PORT><![CDATA[5926]]></PORT><TYPE><![CDATA[vnc]]></TYPE></ > GRAPHICS><MEMORY><![CDATA[4096]]></MEMORY><NIC><BRIDGE>< > ![CDATA[br0]]></BRIDGE><CLUSTER_ID><![CDATA[101]]></ > CLUSTER_ID><IP><![CDATA[131.225.41.169]]></IP><IP6_LINK><! > [CDATA[fe80::216:3eff:fe06:604]]></IP6_LINK><MAC><![ > CDATA[00:16:3e:06:06:04]]></MAC><MODEL><![CDATA[virtio]]>< > /MODEL><NETWORK><![CDATA[Static_IPV6_Public]]></ > NETWORK><NETWORK_ID><![CDATA[1]]></NETWORK_ID><NETWORK_ > UNAME><![CDATA[oneadmin]]></NETWORK_UNAME><NIC_ID><![ > CDATA[0]]></NIC_ID><VLAN><![CDATA[NO]]></VLAN></NIC><OS>< > ARCH><![CDATA[x86_64]]></ARCH></OS><RAW><DATA><![CDATA[ > <devices> > <serial type='pty'> > <target port='0'/> > </serial> > <console type='pty'> > <target type='serial' port='0'/> > </console> > > </devices>]]></DATA><TYPE><![CDATA[kvm]]></TYPE></RAW>< > TEMPLATE_ID><![CDATA[6]]></TEMPLATE_ID><VCPU><![CDATA[2]] > ></VCPU><VMID><![CDATA[26]]></VMID></TEMPLATE><USER_TEMPLATE><ERROR><![CDATA[Fri > Jul 25 15:37:48 2014 : Error saving VM state: Could not save one-26 to > /var/lib/one/datastores/102/26/checkpoint]]></ERROR>< > NPTYPE><![CDATA[NPERNLM]]></NPTYPE><RANK><![CDATA[ > FREEMEMORY]]></RANK><USERVO><![CDATA[test181818]]></USERVO>< > /USER_TEMPLATE><HISTORY_RECORDS><HISTORY><OID>26</OID> > <SEQ>0</SEQ><HOSTNAME>fgtest14</HOSTNAME><HID>10</ > HID><CID>101</CID><STIME>1396463752</STIME><ETIME>0</ > ETIME><VMMMAD>kvm</VMMMAD><VNMMAD>dummy</VNMMAD><TMMAD> > ssh</TMMAD><DS_LOCATION>/var/lib/one/datastores</DS_ > LOCATION><DS_ID>102</DS_ID><PSTIME>1396463752</PSTIME>< > PETIME>1396465032</PETIME><RSTIME>1396465032</RSTIME>< > RETIME>0</RETIME><ESTIME>0</ESTIME><EETIME>0</EETIME>< > REASON>0</REASON><ACTION>0</ACTION></HISTORY></HISTORY_RECORDS></VM> > uid: 0 > gid: 0 > last_poll: 1406320668 > state: 3 > lcm_state: 3 > owner_u: 1 > group_u: 0 > other_u: 0 > 1 row in set (0.00 sec) > > > ------------------------------- > > > > > On Wed, 30 Jul 2014, Steven Timm wrote: > > On Wed, 30 Jul 2014, Ruben S. Montero wrote: >> >> >>> Not really sure what can be going on... The monitor scripts return the >>> information of all VMs running in the node. In 4.6 the >>> monitoring system uses a push approach, through UDP, so you may have >>> the >>> information being reported by misbehaved monitoring >>> daemons. Sometimes this may happen in dev environments if you are >>> resetting the DB,... >>> >> >> when we ran the update to take this database from ONE4.4 to ONE4.6, one >> host (the aforementioned fgtest14) and one datastore (image store 101) got >> wiped out of the database, I reinserted them both back in and restarted >> opennebula. >> >> Steve Timm >> >> >> >> >> >>> On Jul 28, 2014 6:32 PM, "Steven Timm" <t...@fnal.gov> wrote: >>> >>> I am currently dealing with an unexplained monitoring question >>> in OpenNebula 4.6 on my development cloud. >>> >>> I frequently see OpenNebula return that the status of a ONe >>> host is "ON" even in the case of a system misconfiguration where, >>> given the credentials, it is impossible for opennebula to >>> even ssh into the node as oneadmin. >>> >>> >>> I've fixed all those instances, restarted OpenNebula, >>> but opennebula still reports a number of VM's >>> in state "running" even though the node they are running >>> on was rebooted three days ago and is running no >>> virtual machines whatsoever. >>> >>> I think I could be dealing with database corruption of some type >>> (generated on the one4.4->one4.6 update), or there could >>> be some problem with the remote scripts on the nodes. >>> I saw, and I think I fixed, the problems with the database >>> corruption (namely one of the hosts and one of the datastores >>> got knocked out of the database for reasons unknown, and I >>> re-inserted them). But in any case there is some >>> error handling that is not working in the monitoring >>> and something is exiting with status 0 that shouldn't be. >>> >>> ideas? Has anyone else seen something like this? >>> >>> Steve Timm >>> >>> >>> >>> ------------------------------------------------------------ >>> ------ >>> Steven C. Timm, Ph.D (630) 840-8525 >>> t...@fnal.gov http://home.fnal.gov/~timm/ >>> Fermilab Scientific Computing Division, Scientific Computing >>> Services Quad. >>> Grid and Cloud Services Dept., Associate Dept. Head for Cloud >>> Computing >>> _______________________________________________ >>> Users mailing list >>> Users@lists.opennebula.org >>> http: //lists.opennebula.org/listinfo.cgi/users-opennebula.org >>> >>> >>> >>> >> ------------------------------------------------------------------ >> Steven C. Timm, Ph.D (630) 840-8525 >> t...@fnal.gov http://home.fnal.gov/~timm/ >> Fermilab Scientific Computing Division, Scientific Computing Services >> Quad. >> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing >> >> > ------------------------------------------------------------------ > Steven C. Timm, Ph.D (630) 840-8525 > t...@fnal.gov http://home.fnal.gov/~timm/ > Fermilab Scientific Computing Division, Scientific Computing Services Quad. > Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing > -- -- Ruben S. Montero, PhD Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise Cloud Made Simple www.OpenNebula.org | rsmont...@opennebula.org | @OpenNebula
_______________________________________________ Users mailing list Users@lists.opennebula.org http://lists.opennebula.org/listinfo.cgi/users-opennebula.org