Re: vcld memory leak

Andy Kurth Wed, 08 Oct 2014 07:20:05 -0700

There are probably 2 related problems -- (1) the health of the ESXi server
and (2) the VCL code not handling all cases when the health of the host
causes unexpected results.  More below...


On Tue, Oct 7, 2014 at 6:53 PM, Junaid Ali <[email protected]> wrote:

> Hello,
> I've recently been hitting a memory leak with the vcl daemon (VCL version
> 2.3.2). The problem appears to be happening in the computer_not_being_used
> subroutine within new.pm (see attached log)
>
> The problem appears to start when during a reload there was an issue
> communicating with the VMWare server. This caused the VM to be left on the
> VMHost in a powered off state along with the deletion of the entries from
> the computerloadlog table
>
> *|6309|19812:19812|reload| ---- CRITICAL ---- *
> *|6309|19812:19812|reload| 2014-09-03
> 09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser error :
> Start tag expected, '<' not found*
> *|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection refused)*
> *|6309|19812:19812|reload| ^*
> *|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)*
> *|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)*
> *|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)*
> *|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)*
> *|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)*
> *|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties (line:
> 172)*
> *|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line: 1663)*
> *|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)*
> *|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line:
> 2471)*
> *|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line: 2096)*
> *|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms (line:
> 1594)*
> *|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)*
> *|6309|19812:19812|reload| (-12) new.pm <http://new.pm>, reload_image
> (line: 671)*
> *|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process (line:
> 291)*
> *|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)*
> *2014-09-03
> 09:45:51|6309|19812:19812|reload|utils.pm:delete_computerloadlog_reservation(6396)|removing
> computerloadlog entries matching loadstate = begin*
> *2014-09-03
> 09:45:51|6309|19812:19812|reload|utils.pm:delete_computerloadlog_reservation(6443)|deleted
> rows from computerloadlog for reservation id=19812*
>
>
Yes.  We are seeing this more and more as of late on our ESXi 4.1 servers.
This particular error only appears if you are using the vSphere SDK to
manage the host.  I believe the same underlying problem is described in the
following issue if SSH and vim-cmd is used to manage the host:
https://issues.apache.org/jira/browse/VCL-769

As a test on a server which is exhibiting the problem you described and to
determine if the problems are related, please try to SSH in and run the
following command:
vim-cmd hostsvc/datastore/info

If this displays an error then they are related.  Running 'services.sh
restart' on the host may solve the problem.  If not, then it's likely the
.pid files in /var/run became inconsistent with the running services.  Each
should contain the PID of the corresponding service.  If they contain the
wrong PID then 'services.sh restart' will fail to restart some services and
the problems will continue.  If you verify that 'services.sh restart'
doesn't fix the issue, I can try to write instructions on how to fix the
files manually.  I have added some code to VIM_SSH.pm to try to correct the
.pid files automatically.  This isn't possible with the vSphere SDK.

Please send the contents of each of these files from an affected host:
/var/run/vmware/vmware-hostd.PID
/var/run/vmware/vicimprovider.PID
/var/run/vmware/vmkdevmgr.pid
/var/run/vmware/vmkeventd.pid
/var/run/vmware/vmsyslogd.pid
/var/run/vmware/vmware-rhttpproxy.PID
/var/run/vmware/vmware-vpxa.PID

And the output from these commands:
ps -ef | grep hostd-worker
ps -ef | grep sfcb-vmware_bas
ps -ef | grep vmkdevmgr
ps -ef | grep vmkeventd
ps -ef | grep vmsyslogd
ps -ef | grep rhttpproxy-work
ps -ef | grep vpxa-worker



> Now when a new reservation comes in and the same vm is allocated for the
> reservation, the computer_not_being_used subroutine calls the
> $self->code_loop_timeout(sub{return !reservation_being_processed(@_)},
> [$competing_reservation_id], $message, $total_wait_seconds,
> $attempt_delay_seconds)) section (on line # 815 in new.pm) it receives a
> 0 from reservation_being_processed with message
>
> *"2014-10-07
> 11:45:54|8175|23084:23084|new|utils.pm:reservation_being_processed(8634)|computerloadlog
> 'begin' entry does NOT exist for reservation 19812" *
>
> The vcl daemon thinks that the reload has completed. This causes the same
> reservation to be processed over an over within computer_not_being_used
> causing memory spike's and eventually killing that vcld thread.
>
> Any ideas how the reservation_being_processed can handle the lack of
> "begin" entries when used along with the code_loop_timeout from
> computer_not_being_used or the DESTROY handler can make sure such
> reservations are purged, so it doesn't cause this issue?
>
>
The VCL code could be improved to better handle this problem.

The problem is probably due to when the error occurs in the vcld process
sequence -- very early on when the object to manage the VM host is being
initialized.  From the trace output, the vSphere_SDK.pm::_get_file_info
subroutine is calling the vSphere SDK's Vim::get_view subroutine.  This
fails miserably (probably due to the SDK not catching the error described
above) and causes the entire vcld child process to die abruptly.  The
problem occurs before the vcld changed the request state to 'pending'.  The
request state/laststate remains to be reload/reload or new/new when the
process dies.  As a result, vcld keeps trying the same sequence over and
over again.

It's possible to improve the code to catch this by wrapping all Vim::*
calls in an eval block.  I'll get this implemented for the next release.
Patching 2.3.2 may be possible but could also be ugly.  Vim::get_view is
called from many places in vSphere_SDK.pm.  Every one of these would need
to be updated.

Regards,
Andy


> Please let me know if you need any further clarification.
>
> Thanks.
>
> --
> Junaid Ali
>

Re: vcld memory leak

Reply via email to