It seems that there are more people having this problem and we are taking a look on several ways to fix this. One problem with /var/run is that it is normally owned by root and a process started by oneadmin user can not write there. In the frontend a new directory for OpenNebula pid files is created but in the nodes it does not exist.
On Tue, Jan 21, 2014 at 8:07 AM, Gerry O'Brien <ge...@scss.tcd.ie> wrote: > Hi Javier, > > See my previous email. Another scenario is when > "/tmp/one-collectd-client.pid" does not exist due to issues with /tmp. > > A change seems to have been made to put a pid file in /tmp instead of > /run or /var/run. > > Regards, > Gerry > > > > On 20/01/2014 17:44, Javier Fontan wrote: >> >> I've been trying to reproduce the problem, that is, making OpenNebula >> start a high amount of collectd-client processes. The only way I was >> able to do it is when the file "/tmp/one-collectd-client.pid" exists >> and has wrong permissions. Can you check the ownership and permissions >> of that file? >> >> On Mon, Jan 20, 2014 at 4:15 PM, Javier Fontan <jfon...@opennebula.org> >> wrote: >>> >>> The problem seems to be the high amount of collectd processes running. >>> Try killing all "collectd-client.rb" processes. There should be only >>> one running per host. >>> >>> In case you want to use the old method of monitoring you can follow this >>> guide: >>> >>> >>> http://docs.opennebula.org/stable/administration/monitoring/imsshpullg.html#imsshpullg >>> >>> On Mon, Jan 20, 2014 at 2:17 PM, Gerry O'Brien <ge...@scss.tcd.ie> wrote: >>>> >>>> Hi Ruben, >>>> >>>> Below is the output of 'ps -ef | grep one' on a host that has been >>>> disabled, rebooted and enabled. There are multiple versions of >>>> collectd-client.rb kvm running. >>>> >>>> >>>> We have discovered today a serious issue that is having an adverse >>>> effect on our DNS system. When the machines below was enabled, >>>> immediately >>>> our DNS server is flooded with requests from the host (see a sample >>>> below). >>>> Our logs show that this has only started happening since the >>>> upgrade to >>>> 4.4. If we don't get a fix for this we will have to go back to 4.2, >>>> which is >>>> something I really don't want to do. >>>> >>>> Regards, >>>> Gerry >>>> >>>> >>>> >>>> >>>> oneadmin 3628 1 0 13:04 ? 00:00:00 ruby >>>> /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores >>>> 4124 >>>> 20 0 host101.scss.tcd.ie >>>> oneadmin 4600 1 0 13:05 ? 00:00:00 ruby >>>> /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores >>>> 4124 >>>> 20 0 host101.scss.tcd.ie >>>> oneadmin 6400 1 0 13:07 ? 00:00:00 ruby >>>> /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores >>>> 4124 >>>> 20 0 host101.scss.tcd.ie >>>> oneadmin 9003 1 0 13:08 ? 00:00:00 ruby >>>> /var/tmp/one/im/kvm.d/collectd-client.rb kvm /var/lib/one//datastores >>>> 4124 >>>> 20 0 host101.scss.tcd.ie >>>> oneadmin 12953 3628 0 13:10 ? 00:00:00 /bin/bash >>>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores >>>> 4124 >>>> 20 0 host101.scss.tcd.ie >>>> oneadmin 12955 6400 0 13:10 ? 00:00:00 /bin/bash >>>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores >>>> 4124 >>>> 20 0 host101.scss.tcd.ie >>>> oneadmin 12969 12953 0 13:10 ? 00:00:00 /bin/bash >>>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores >>>> 4124 >>>> 20 0 host101.scss.tcd.ie >>>> oneadmin 12970 12969 0 13:10 ? 00:00:00 /bin/bash >>>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores >>>> 4124 >>>> 20 0 host101.scss.tcd.ie >>>> oneadmin 12972 12955 0 13:10 ? 00:00:00 /bin/bash >>>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores >>>> 4124 >>>> 20 0 host101.scss.tcd.ie >>>> oneadmin 12973 12972 0 13:10 ? 00:00:00 /bin/bash >>>> /var/tmp/one/im/kvm.d/../run_probes kvm-probes /var/lib/one//datastores >>>> 4124 >>>> 20 0 host101.scss.tcd.ie >>>> oneadmin 13029 12973 0 13:10 ? 00:00:00 /bin/bash >>>> ./monitor_ds.sh >>>> kvm-probes /var/lib/one//datastores 4124 20 0 host101.scss.tcd.ie >>>> oneadmin 13030 12970 0 13:10 ? 00:00:00 /bin/bash >>>> ./monitor_ds.sh >>>> kvm-probes /var/lib/one//datastores 4124 20 0 host101.scss.tcd.ie >>>> >>>> >>>> >>>> -2014 13:14:26.675 client 134.226.59.101#52314: query: >>>> host101.scss.tcd.ie >>>> IN AAAA + (134.226.32.57) >>>> 20-Jan-2014 13:14:26.680 client 134.226.59.101#51356: query: >>>> host101.scss.tcd.ie IN A + (134.226.32.57) >>>> 20-Jan-2014 13:14:26.680 client 134.226.59.101#51356: query: >>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57) >>>> 20-Jan-2014 13:14:26.822 client 134.226.59.101#47870: query: >>>> host101.scss.tcd.ie IN A + (134.226.32.57) >>>> 20-Jan-2014 13:14:26.822 client 134.226.59.101#47870: query: >>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57) >>>> 20-Jan-2014 13:14:26.824 client 134.226.59.101#58734: query: >>>> host101.scss.tcd.ie IN A + (134.226.32.57) >>>> 20-Jan-2014 13:14:26.825 client 134.226.59.101#58734: query: >>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57) >>>> 20-Jan-2014 13:14:26.952 client 134.226.59.101#39659: query: >>>> host101.scss.tcd.ie IN A + (134.226.32.57) >>>> 20-Jan-2014 13:14:26.952 client 134.226.59.101#39659: query: >>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57) >>>> 20-Jan-2014 13:14:26.952 client 134.226.59.101#53975: query: >>>> host101.scss.tcd.ie IN A + (134.226.32.57) >>>> 20-Jan-2014 13:14:26.953 client 134.226.59.101#53975: query: >>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57) >>>> 20-Jan-2014 13:14:27.108 client 134.226.59.101#36294: query: >>>> host101.scss.tcd.ie IN A + (134.226.32.57) >>>> 20-Jan-2014 13:14:27.108 client 134.226.59.101#36294: query: >>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57) >>>> 20-Jan-2014 13:14:27.109 client 134.226.59.101#59277: query: >>>> host101.scss.tcd.ie IN A + (134.226.32.57) >>>> 20-Jan-2014 13:14:27.109 client 134.226.59.101#59277: query: >>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57) >>>> 20-Jan-2014 13:14:27.347 client 134.226.59.101#49614: query: >>>> host101.scss.tcd.ie IN A + (134.226.32.57) >>>> 20-Jan-2014 13:14:27.348 client 134.226.59.101#49614: query: >>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57) >>>> 20-Jan-2014 13:14:27.350 client 134.226.59.101#44058: query: >>>> host101.scss.tcd.ie IN A + (134.226.32.57) >>>> 20-Jan-2014 13:14:27.357 client 134.226.59.101#44058: query: >>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57) >>>> 20-Jan-2014 13:14:27.458 client 134.226.59.101#51830: query: >>>> host101.scss.tcd.ie IN A + (134.226.32.57) >>>> 20-Jan-2014 13:14:27.458 client 134.226.59.101#51830: query: >>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57) >>>> 20-Jan-2014 13:14:27.461 client 134.226.59.101#38419: query: >>>> host101.scss.tcd.ie IN A + (134.226.32.57) >>>> 20-Jan-2014 13:14:27.461 client 134.226.59.101#38419: query: >>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57) >>>> 20-Jan-2014 13:14:31.184 client 134.226.59.101#38617: query: >>>> host101.scss.tcd.ie IN A + (134.226.32.57) >>>> 20-Jan-2014 13:14:31.184 client 134.226.59.101#38617: query: >>>> host101.scss.tcd.ie IN AAAA + (134.226.32.57) >>>> 20-Jan-2014 13:14:31.302 client 134.226 >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 17/01/2014 17:45, Ruben S. Montero wrote: >>>>> >>>>> Hi Gerry >>>>> >>>>> Just to check, are you using 4.4 Final? We've seen this in the betas >>>>> and >>>>> "thought" we fixed for the final version. Also could you check that >>>>> there >>>>> are just one monitorization process at the hosts (collectd-client.sh, >>>>> or >>>>> equiv should be the name of the process) >>>>> >>>>> Also could you send us the lines from oned.log between Thu Jan 16 >>>>> 16:56:25 >>>>> 2014 and Thu Jan 16 17:25:43 2014; plus the first lines that includes >>>>> you >>>>> oned.conf values (we are interested specially in those related to >>>>> monitoring interval) >>>>> >>>>> >>>>> Cheers >>>>> >>>>> Ruben >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Jan 17, 2014 at 2:27 PM, Gerry O'Brien <ge...@scss.tcd.ie> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Below is a truncated log file for a VM. The monitor continually >>>>>> cycles >>>>>> through finding the machine RUNNING and stat UNKNOWN. This occurs for >>>>>> many >>>>>> many machines at the same time. All machines were created by a script. >>>>>> >>>>>> The VMs are Microsoft Windows 7 64bit Enterprise. Individual >>>>>> context >>>>>> is created by a startup script. They run fine but eventually >>>>>> /var/log/one >>>>>> is going overflow. >>>>>> >>>>>> Restarting oned seems to fix the problem but this is hardly a >>>>>> long >>>>>> term solution. >>>>>> >>>>>> Any suggestions on what could be causing this? >>>>>> >>>>>> Regards, >>>>>> Gerry >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Thu Jan 16 16:56:21 2014 [DiM][I]: New VM state is ACTIVE. >>>>>> Thu Jan 16 16:56:22 2014 [LCM][I]: New VM state is PROLOG. >>>>>> Thu Jan 16 16:56:22 2014 [VM][I]: Virtual Machine has no context >>>>>> Thu Jan 16 16:56:22 2014 [LCM][I]: New VM state is BOOT >>>>>> Thu Jan 16 16:56:22 2014 [VMM][I]: Generating deployment file: >>>>>> /var/lib/one/vms/1788/deployment.0 >>>>>> Thu Jan 16 16:56:23 2014 [VMM][I]: ExitCode: 0 >>>>>> Thu Jan 16 16:56:23 2014 [VMM][I]: Successfully execute network driver >>>>>> operation: pre. >>>>>> Thu Jan 16 16:56:25 2014 [VMM][I]: ExitCode: 0 >>>>>> Thu Jan 16 16:56:25 2014 [VMM][I]: Successfully execute virtualization >>>>>> driver operation: deploy. >>>>>> Thu Jan 16 16:56:25 2014 [VMM][I]: ExitCode: 0 >>>>>> Thu Jan 16 16:56:25 2014 [VMM][I]: Successfully execute network driver >>>>>> operation: post. >>>>>> Thu Jan 16 16:56:25 2014 [LCM][I]: New VM state is RUNNING >>>>>> Thu Jan 16 16:56:51 2014 [LCM][I]: New VM state is UNKNOWN >>>>>> Thu Jan 16 16:59:01 2014 [VMM][I]: VM found again, state is RUNNING >>>>>> Thu Jan 16 16:59:23 2014 [LCM][I]: New VM state is UNKNOWN >>>>>> Thu Jan 16 17:01:41 2014 [VMM][I]: VM found again, state is RUNNING >>>>>> Thu Jan 16 17:01:58 2014 [LCM][I]: New VM state is UNKNOWN >>>>>> Thu Jan 16 17:04:18 2014 [VMM][I]: VM found again, state is RUNNING >>>>>> Thu Jan 16 17:04:39 2014 [LCM][I]: New VM state is UNKNOWN >>>>>> Thu Jan 16 17:06:55 2014 [VMM][I]: VM found again, state is RUNNING >>>>>> Thu Jan 16 17:07:06 2014 [LCM][I]: New VM state is UNKNOWN >>>>>> Thu Jan 16 17:09:31 2014 [VMM][I]: VM found again, state is RUNNING >>>>>> Thu Jan 16 17:09:31 2014 [LCM][I]: New VM state is UNKNOWN >>>>>> Thu Jan 16 17:12:22 2014 [VMM][I]: VM found again, state is RUNNING >>>>>> Thu Jan 16 17:12:27 2014 [LCM][I]: New VM state is UNKNOWN >>>>>> Thu Jan 16 17:15:11 2014 [VMM][I]: VM found again, state is RUNNING >>>>>> Thu Jan 16 17:15:22 2014 [LCM][I]: New VM state is UNKNOWN >>>>>> Thu Jan 16 17:17:49 2014 [VMM][I]: VM found again, state is RUNNING >>>>>> Thu Jan 16 17:18:00 2014 [LCM][I]: New VM state is UNKNOWN >>>>>> Thu Jan 16 17:20:27 2014 [VMM][I]: VM found again, state is RUNNING >>>>>> Thu Jan 16 17:20:34 2014 [LCM][I]: New VM state is UNKNOWN >>>>>> Thu Jan 16 17:23:04 2014 [VMM][I]: VM found again, state is RUNNING >>>>>> Thu Jan 16 17:23:08 2014 [LCM][I]: New VM state is UNKNOWN >>>>>> Thu Jan 16 17:25:41 2014 [VMM][I]: VM found again, state is RUNNING >>>>>> Thu Jan 16 17:25:43 2014 [LCM][I]: New VM state is UNKNOWN >>>>>> >>>>>> -- >>>>>> Gerry O'Brien >>>>>> >>>>>> Systems Manager >>>>>> School of Computer Science and Statistics >>>>>> Trinity College Dublin >>>>>> Dublin 2 >>>>>> IRELAND >>>>>> >>>>>> 00 353 1 896 1341 >>>>>> >>>>>> _______________________________________________ >>>>>> Users mailing list >>>>>> Users@lists.opennebula.org >>>>>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org >>>>>> >>>>> >>>> >>>> -- >>>> Gerry O'Brien >>>> >>>> Systems Manager >>>> School of Computer Science and Statistics >>>> Trinity College Dublin >>>> Dublin 2 >>>> IRELAND >>>> >>>> 00 353 1 896 1341 >>>> >>>> _______________________________________________ >>>> Users mailing list >>>> Users@lists.opennebula.org >>>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org >>> >>> >>> >>> -- >>> Javier Fontán Muiños >>> Developer >>> OpenNebula - The Open Source Toolkit for Data Center Virtualization >>> www.OpenNebula.org | @OpenNebula | github.com/jfontan >> >> >> > > > -- > Gerry O'Brien > > Systems Manager > School of Computer Science and Statistics > Trinity College Dublin > Dublin 2 > IRELAND > > 00 353 1 896 1341 > -- Javier Fontán Muiños Developer OpenNebula - The Open Source Toolkit for Data Center Virtualization www.OpenNebula.org | @OpenNebula | github.com/jfontan _______________________________________________ Users mailing list Users@lists.opennebula.org http://lists.opennebula.org/listinfo.cgi/users-opennebula.org