Re: [one-users] What remotes commands does one 4.6 use:

Ruben S. Montero Wed, 30 Jul 2014 08:02:42 -0700

BTW, Could you paste the ouput of run_probes commands once it finish?


On Wed, Jul 30, 2014 at 4:58 PM, Ruben S. Montero <rsmont...@opennebula.org>
wrote:

> This seems to be a bug, when collectd does not respond (because of waiting
> for sudo password) OpenNebula does not move the hosts to ERROR. The probes
> are designed to not start another collectd process; but probably we should
> check that a running one it is not working and send the ERROR message to
> OpenNebula.
>
> Pointer to the issue:
> http://dev.opennebula.org/issues/3118
>
> Cheers
>
>
> On Wed, Jul 30, 2014 at 4:53 PM, Steven Timm <t...@fnal.gov> wrote:
>
>> On Wed, 30 Jul 2014, Ruben S. Montero wrote:
>>
>>  Hi,
>>> 1.- monitor_ds.sh may use LVM commands (vgdisplay) that needs sudo
>>> access. It should be automatically setup by the opennebula node
>>> packages.
>>>
>>> 2.- It is not a real daemon, the first time a host is monitored a
>>> process is left to periodically send information. OpenNebula
>>> restarts it if no information is received in 3 monitor steps. Nothing
>>> needs to be set up...
>>>
>>> Cheers
>>>
>>>
>> On further inspection I found that this collectd was running on my nodes,
>> and obviously failing up until now because the sudoers was not set
>> correctly.  But there was nothing to warn us about it.  Nothing on
>> the opennebula head node to even tell us that the information was stale.
>> No log file on the node to show the errors we were getting. In short,
>> it was just quietly dying and we had no idea.  How to make sure this
>> doesn't happen again in the future?
>>
>> Steve Timm
>>
>>
>>
>>
>>
>>
>>
>>> On Wed, Jul 30, 2014 at 3:50 PM, Steven Timm <t...@fnal.gov> wrote:
>>>       On Wed, 30 Jul 2014, Ruben S. Montero wrote:
>>>
>>>
>>>             Maybe you could try to execute the  monitor probes in the
>>> node,
>>>
>>>             1. ssh the node
>>>             2. Go to /var/tmp/one/im
>>>             3. Execute run_probes kvm-probes
>>>
>>>
>>>       When I do that, (using sh -x ) I get the following:
>>>
>>>       -bash-4.1$ sh -x ./run_probes kvm-probes
>>>       ++ dirname ./run_probes
>>>       + source ./../scripts_common.sh
>>>       ++ export LANG=C
>>>       ++ LANG=C
>>>       ++ export
>>>       PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/
>>> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
>>>       ++
>>>       PATH=/bin:/sbin:/usr/bin:/usr/krb5/bin:/usr/lib64/qt-3.3/
>>> bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
>>>       ++ AWK=awk
>>>       ++ BASH=bash
>>>       ++ CUT=cut
>>>       ++ DATE=date
>>>       ++ DD=dd
>>>       ++ DF=df
>>>       ++ DU=du
>>>       ++ GREP=grep
>>>       ++ ISCSIADM=iscsiadm
>>>       ++ LVCREATE=lvcreate
>>>       ++ LVREMOVE=lvremove
>>>       ++ LVRENAME=lvrename
>>>       ++ LVS=lvs
>>>       ++ LN=ln
>>>       ++ MD5SUM=md5sum
>>>       ++ MKFS=mkfs
>>>       ++ MKISOFS=genisoimage
>>>       ++ MKSWAP=mkswap
>>>       ++ QEMU_IMG=qemu-img
>>>       ++ RADOS=rados
>>>       ++ RBD=rbd
>>>       ++ READLINK=readlink
>>>       ++ RM=rm
>>>       ++ SCP=scp
>>>       ++ SED=sed
>>>       ++ SSH=ssh
>>>       ++ SUDO=sudo
>>>       ++ SYNC=sync
>>>       ++ TAR=tar
>>>       ++ TGTADM=tgtadm
>>>       ++ TGTADMIN=tgt-admin
>>>       ++ TGTSETUPLUN=tgt-setup-lun-one
>>>       ++ TR=tr
>>>       ++ VGDISPLAY=vgdisplay
>>>       ++ VMKFSTOOLS=vmkfstools
>>>       ++ WGET=wget
>>>       +++ uname -s
>>>       ++ '[' xLinux = xLinux ']'
>>>       ++ SED='sed -r'
>>>       +++ basename ./run_probes
>>>       ++ SCRIPT_NAME=run_probes
>>>       + export LANG=C
>>>       + LANG=C
>>>       + HYPERVISOR_DIR=kvm-probes.d
>>>       + ARGUMENTS=kvm-probes
>>>       ++ dirname ./run_probes
>>>       + SCRIPTS_DIR=.
>>>       + cd .
>>>       ++ '[' -d kvm-probes.d ']'
>>>       ++ run_dir kvm-probes.d
>>>       ++ cd kvm-probes.d
>>>       +++ ls architecture.sh collectd-client-shepherd.sh cpu.sh kvm.rb
>>> monitor_ds.sh name.sh poll.sh version.sh
>>>       ++ for i in '`ls *`'
>>>       ++ '[' -x architecture.sh ']'
>>>       ++ ./architecture.sh kvm-probes
>>>       ++ EXIT_CODE=0
>>>       ++ '[' x0 '!=' x0 ']'
>>>       ++ for i in '`ls *`'
>>>       ++ '[' -x collectd-client-shepherd.sh ']'
>>>       ++ ./collectd-client-shepherd.sh kvm-probes
>>>       ++ EXIT_CODE=0
>>>       ++ '[' x0 '!=' x0 ']'
>>>       ++ for i in '`ls *`'
>>>       ++ '[' -x cpu.sh ']'
>>>       ++ ./cpu.sh kvm-probes
>>>       ++ EXIT_CODE=0
>>>       ++ '[' x0 '!=' x0 ']'
>>>       ++ for i in '`ls *`'
>>>       ++ '[' -x kvm.rb ']'
>>>       ++ ./kvm.rb kvm-probes
>>>       ++ EXIT_CODE=0
>>>       ++ '[' x0 '!=' x0 ']'
>>>       ++ for i in '`ls *`'
>>>       ++ '[' -x monitor_ds.sh ']'
>>>       ++ ./monitor_ds.sh kvm-probes
>>>       [sudo] password for oneadmin:
>>>
>>>       and it stays hung on the password for oneadmin.
>>>
>>>       What's going on?
>>>
>>>       Also, you mentioned a collectd--are you saying that OpenNebula 4.6
>>> now needs to run a daemon on every single VM host?
>>>        Where is it documented
>>>       on how to set it up?
>>>
>>>       Steve
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>             Make sure you do not have a host using the same hostname
>>> fgtest14 and running a  collectd process
>>>
>>>             On Jul 29, 2014 4:35 PM, "Steven Timm" <t...@fnal.gov>
>>> wrote:
>>>
>>>                   I am still trying to debug a nasty monitoring
>>> inconsistency.
>>>
>>>                   -bash-4.1$ onevm list | grep fgtest14
>>>                       26 oneadmin oneadmin fgt6x4-26       runn    6
>>>  4G fgtest14   117d 19h50
>>>                       27 oneadmin oneadmin fgt5x4-27       runn   10
>>>  4G fgtest14   117d 17h57
>>>                       28 oneadmin oneadmin fgt1x1-28       runn   10
>>>  4.1G fgtest14   117d 16h59
>>>                       30 oneadmin oneadmin fgt5x1-30       runn    0
>>>  4G fgtest14   116d 23h50
>>>                       33 oneadmin oneadmin ip6sl5vda-33    runn    6
>>>  4G fgtest14   116d 19h57
>>>                   -bash-4.1$ onehost list
>>>                     ID NAME            CLUSTER   RVM      ALLOCATED_CPU
>>>      ALLOCATED_MEM STAT
>>>                      3 fgtest11        ipv6        0       0 / 400 (0%)
>>>    0K / 15.7G (0%) on
>>>                      4 fgtest12        ipv6        0       0 / 400 (0%)
>>>    0K / 15.7G (0%) on
>>>                      7 fgtest13        ipv6        0       0 / 800 (0%)
>>>    0K / 23.6G (0%) on
>>>                      8 fgtest14        ipv6        5       0 / 800 (0%)
>>>    0K / 23.6G (0%) on
>>>                      9 fgtest20        ipv6        3    300 / 800 (37%)
>>>  12G / 31.4G (38%) on
>>>                     11 fgtest19        ipv6        0       0 / 800 (0%)
>>>    0K / 31.5G (0%) on
>>>                   -bash-4.1$ onehost show 8
>>>                   HOST 8 INFORMATION
>>>                   ID                    : 8
>>>                   NAME                  : fgtest14
>>>                   CLUSTER               : ipv6
>>>                   STATE                 : MONITORED
>>>                   IM_MAD                : kvm
>>>                   VM_MAD                : kvm
>>>                   VN_MAD                : dummy
>>>                   LAST MONITORING TIME  : 07/29 09:25:45
>>>
>>>                   HOST SHARES
>>>                   TOTAL MEM             : 23.6G
>>>                   USED MEM (REAL)       : 876.4M
>>>                   USED MEM (ALLOCATED)  : 0K
>>>                   TOTAL CPU             : 800
>>>                   USED CPU (REAL)       : 0
>>>                   USED CPU (ALLOCATED)  : 0
>>>                   RUNNING VMS           : 5
>>>
>>>                   LOCAL SYSTEM DATASTORE #102 CAPACITY
>>>                   TOTAL:                : 548.8G
>>>                   USED:                 : 175.3G
>>>                   FREE:                 : 345.6G
>>>
>>>                   MONITORING INFORMATION
>>>                   ARCH="x86_64"
>>>                   CPUSPEED="2992"
>>>                   HOSTNAME="fgtest14.fnal.gov"
>>>                   HYPERVISOR="kvm"
>>>                   MODELNAME="Intel(R) Xeon(R) CPU           E5450  @
>>> 3.00GHz"
>>>                   NETRX="234844577"
>>>                   NETTX="21553126"
>>>                   RESERVED_CPU=""
>>>                   RESERVED_MEM=""
>>>                   VERSION="4.6.0"
>>>
>>>                   VIRTUAL MACHINES
>>>
>>>                       ID USER     GROUP    NAME            STAT UCPU
>>>  UMEM HOST TIME
>>>                       26 oneadmin oneadmin fgt6x4-26       runn    6
>>>  4G fgtest14   117d 19h50
>>>                       27 oneadmin oneadmin fgt5x4-27       runn   10
>>>  4G fgtest14   117d 17h57
>>>                       28 oneadmin oneadmin fgt1x1-28       runn   10
>>>  4.1G fgtest14   117d 17h00
>>>                       30 oneadmin oneadmin fgt5x1-30       runn    0
>>>  4G fgtest14   116d 23h50
>>>                       33 oneadmin oneadmin ip6sl5vda-33    runn    6
>>>  4G fgtest14   116d 19h57
>>>                   ------------------------------
>>> -----------------------------------------------------
>>>
>>>                   All of this looks great, right?
>>>                   Just one problem:  There are no VM's running on
>>> fgtest14 and
>>>                   haven't been for 4 days.
>>>
>>>                   [root@fgtest14 ~]# virsh list
>>>                    Id    Name                           State
>>>                   ----------------------------------------------------
>>>
>>>                   [root@fgtest14 ~]#
>>>
>>>                   ------------------------------
>>> -------------------------------------------
>>>                   Yet the monitoring reports no errors.
>>>
>>>                   Tue Jul 29 09:28:10 2014 [InM][D]: Host fgtest14 (8)
>>> successfully monitored.
>>>
>>>                   ------------------------------
>>> -----------------------------------------------
>>>                   At the same time, there is no evidence that ONE is
>>> actually trying to or
>>>                   succeeding to monitor these five vm's yet they are
>>> still stuck in "runn"
>>>                   which means I can't do a onevm restart to restart them.
>>>                   (the vm images of these 5 vm's are still out there on
>>> the VM host and
>>>                   I would like to save and restart them if I can).
>>>
>>>                   What is the remotes command that ONE4.6 would use to
>>> monitor this host?
>>>                   Can I do it manually and see what output I get?
>>>
>>>                   Are we dealing with some kind of a bug, or just a very
>>> confused system?
>>>                   Any help is appreciated. I have to get this sorted out
>>> before
>>>                   I dare deploy one4.x in production.
>>>
>>>                   Steve Timm
>>>
>>>
>>>                   ------------------------------
>>> ------------------------------------
>>>                   Steven C. Timm, Ph.D  (630) 840-8525
>>>                   t...@fnal.gov  http://home.fnal.gov/~timm/
>>>                   Fermilab Scientific Computing Division, Scientific
>>> Computing Services Quad.
>>>                   Grid and Cloud Services Dept., Associate Dept. Head
>>> for Cloud Computing
>>>                   _______________________________________________
>>>                   Users mailing list
>>>                   Users@lists.opennebula.org
>>>                   http://lists.opennebula.org/
>>> listinfo.cgi/users-opennebula.org
>>>
>>>
>>>
>>>
>>>       ------------------------------------------------------------------
>>>       Steven C. Timm, Ph.D  (630) 840-8525
>>>       t...@fnal.gov  http://home.fnal.gov/~timm/
>>>       Fermilab Scientific Computing Division, Scientific Computing
>>> Services Quad.
>>>       Grid and Cloud Services Dept., Associate Dept. Head for Cloud
>>> Computing
>>>
>>>
>>>
>>>
>>> --
>>> --
>>> Ruben S. Montero, PhD
>>> Project co-Lead and Chief Architect OpenNebula - Flexible Enterprise
>>> Cloud Made Simple
>>> www.OpenNebula.org | rsmont...@opennebula.org | @OpenNebula
>>>
>>>
>>>
>> ------------------------------------------------------------------
>> Steven C. Timm, Ph.D  (630) 840-8525
>> t...@fnal.gov  http://home.fnal.gov/~timm/
>> Fermilab Scientific Computing Division, Scientific Computing Services
>> Quad.
>> Grid and Cloud Services Dept., Associate Dept. Head for Cloud Computing
>>
>
>
>
> --
> --
> Ruben S. Montero, PhD
> Project co-Lead and Chief Architect
> OpenNebula - Flexible Enterprise Cloud Made Simple
> www.OpenNebula.org | rsmont...@opennebula.org | @OpenNebula
>



-- 
-- 
Ruben S. Montero, PhD
Project co-Lead and Chief Architect
OpenNebula - Flexible Enterprise Cloud Made Simple
www.OpenNebula.org | rsmont...@opennebula.org | @OpenNebula

_______________________________________________
Users mailing list
Users@lists.opennebula.org
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

Re: [one-users] What remotes commands does one 4.6 use:

Reply via email to