Re: vcld memory leak

Andy Kurth Wed, 08 Oct 2014 10:46:24 -0700

The pids look correct.  Is this host currently generating any of the 'Start
tag expected...' error messages?  If it is then something else may be
wrong.  If not, then if you see the error in the future I would again check
the pids.  It's pretty simple but tedious.  In your file, the value
in vmware-hostd.PID (2953236) should match the PID of the
first hostd-worker process and the PPID of the other hostd-worker
processes.  The same goes for the other files and processes.


I have updated the code in the repository to catch the error which caused
the process to die over and over again.  I also applied similar changes to
the vSphere_SDK.pm file which shipped with 2.3.2 to the attached file.  You
can try to swap your current file:
/usr/local/vcl/lib/VCL/Module/Provisioning/vSphere_SDK.pm
...with the attached file and restart the vcld service.

I tested this on a simple reload and it worked.  The updated file won't
prevent the initial error from occurring but it will catch the problem so
that the vcld process doesn't abruptly die and repeatedly retry.

-Andy


On Wed, Oct 8, 2014 at 12:35 PM, Junaid Ali <[email protected]> wrote:

> Hi Andy,
> Thanks for the information.
> I was able to ssh into the vmhost and run the commands. It runs the
> vim-cmd commands without any errors. Attached are the PID from the files as
> well as ps command and they are consistent. So it may not be related to the
> PID mismatch. I looked back during the week of 9/3 when the original
> problem occurred and could not find anything out of place on the VMHost
> (please check attached pdf report from vcenter) or the management node.
>
> Thanks.
> Junaid.
>
> On Wed, Oct 8, 2014 at 9:18 AM, Andy Kurth <[email protected]> wrote:
>
>> There are probably 2 related problems -- (1) the health of the ESXi server
>> and (2) the VCL code not handling all cases when the health of the host
>> causes unexpected results.  More below...
>>
>> On Tue, Oct 7, 2014 at 6:53 PM, Junaid Ali <[email protected]> wrote:
>>
>> > Hello,
>> > I've recently been hitting a memory leak with the vcl daemon (VCL
>> version
>> > 2.3.2). The problem appears to be happening in the
>> computer_not_being_used
>> > subroutine within new.pm (see attached log)
>> >
>> > The problem appears to start when during a reload there was an issue
>> > communicating with the VMWare server. This caused the VM to be left on
>> the
>> > VMHost in a powered off state along with the deletion of the entries
>> from
>> > the computerloadlog table
>> >
>> > *|6309|19812:19812|reload| ---- CRITICAL ---- *
>> > *|6309|19812:19812|reload| 2014-09-03
>> > 09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser error
>> :
>> > Start tag expected, '<' not found*
>> > *|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection
>> refused)*
>> > *|6309|19812:19812|reload| ^*
>> > *|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)*
>> > *|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)*
>> > *|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)*
>> > *|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)*
>> > *|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)*
>> > *|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties (line:
>> > 172)*
>> > *|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line:
>> 1663)*
>> > *|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)*
>> > *|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line:
>> > 2471)*
>> > *|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line: 2096)*
>> > *|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms (line:
>> > 1594)*
>> > *|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)*
>> > *|6309|19812:19812|reload| (-12) new.pm <http://new.pm>, reload_image
>> > (line: 671)*
>> > *|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process (line:
>> > 291)*
>> > *|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)*
>> > *2014-09-03
>> > 09:45:51|6309|19812:19812|reload|utils.pm:
>> delete_computerloadlog_reservation(6396)|removing
>> > computerloadlog entries matching loadstate = begin*
>> > *2014-09-03
>> > 09:45:51|6309|19812:19812|reload|utils.pm:
>> delete_computerloadlog_reservation(6443)|deleted
>> > rows from computerloadlog for reservation id=19812*
>>
>> >
>> >
>> Yes.  We are seeing this more and more as of late on our ESXi 4.1 servers.
>> This particular error only appears if you are using the vSphere SDK to
>> manage the host.  I believe the same underlying problem is described in
>> the
>> following issue if SSH and vim-cmd is used to manage the host:
>> https://issues.apache.org/jira/browse/VCL-769
>>
>> As a test on a server which is exhibiting the problem you described and to
>> determine if the problems are related, please try to SSH in and run the
>> following command:
>> vim-cmd hostsvc/datastore/info
>>
>> If this displays an error then they are related.  Running 'services.sh
>> restart' on the host may solve the problem.  If not, then it's likely the
>> .pid files in /var/run became inconsistent with the running services.
>> Each
>> should contain the PID of the corresponding service.  If they contain the
>> wrong PID then 'services.sh restart' will fail to restart some services
>> and
>> the problems will continue.  If you verify that 'services.sh restart'
>> doesn't fix the issue, I can try to write instructions on how to fix the
>> files manually.  I have added some code to VIM_SSH.pm to try to correct
>> the
>> .pid files automatically.  This isn't possible with the vSphere SDK.
>>
>> Please send the contents of each of these files from an affected host:
>> /var/run/vmware/vmware-hostd.PID
>> /var/run/vmware/vicimprovider.PID
>> /var/run/vmware/vmkdevmgr.pid
>> /var/run/vmware/vmkeventd.pid
>> /var/run/vmware/vmsyslogd.pid
>> /var/run/vmware/vmware-rhttpproxy.PID
>> /var/run/vmware/vmware-vpxa.PID
>>
>> And the output from these commands:
>> ps -ef | grep hostd-worker
>> ps -ef | grep sfcb-vmware_bas
>> ps -ef | grep vmkdevmgr
>> ps -ef | grep vmkeventd
>> ps -ef | grep vmsyslogd
>> ps -ef | grep rhttpproxy-work
>> ps -ef | grep vpxa-worker
>>
>>
>>
>> > Now when a new reservation comes in and the same vm is allocated for the
>> > reservation, the computer_not_being_used subroutine calls the
>> > $self->code_loop_timeout(sub{return !reservation_being_processed(@_)},
>> > [$competing_reservation_id], $message, $total_wait_seconds,
>> > $attempt_delay_seconds)) section (on line # 815 in new.pm) it receives
>> a
>> > 0 from reservation_being_processed with message
>> >
>> > *"2014-10-07
>> > 11:45:54|8175|23084:23084|new|utils.pm:
>> reservation_being_processed(8634)|computerloadlog
>> > 'begin' entry does NOT exist for reservation 19812" *
>>
>> >
>> > The vcl daemon thinks that the reload has completed. This causes the
>> same
>> > reservation to be processed over an over within computer_not_being_used
>> > causing memory spike's and eventually killing that vcld thread.
>> >
>> > Any ideas how the reservation_being_processed can handle the lack of
>> > "begin" entries when used along with the code_loop_timeout from
>> > computer_not_being_used or the DESTROY handler can make sure such
>> > reservations are purged, so it doesn't cause this issue?
>> >
>> >
>> The VCL code could be improved to better handle this problem.
>>
>> The problem is probably due to when the error occurs in the vcld process
>> sequence -- very early on when the object to manage the VM host is being
>> initialized.  From the trace output, the vSphere_SDK.pm::_get_file_info
>> subroutine is calling the vSphere SDK's Vim::get_view subroutine.  This
>> fails miserably (probably due to the SDK not catching the error described
>> above) and causes the entire vcld child process to die abruptly.  The
>> problem occurs before the vcld changed the request state to 'pending'.
>> The
>> request state/laststate remains to be reload/reload or new/new when the
>> process dies.  As a result, vcld keeps trying the same sequence over and
>> over again.
>>
>> It's possible to improve the code to catch this by wrapping all Vim::*
>> calls in an eval block.  I'll get this implemented for the next release.
>> Patching 2.3.2 may be possible but could also be ugly.  Vim::get_view is
>> called from many places in vSphere_SDK.pm.  Every one of these would need
>> to be updated.
>>
>> Regards,
>> Andy
>>
>>
>> > Please let me know if you need any further clarification.
>> >
>> > Thanks.
>> >
>> > --
>> > Junaid Ali
>> >
>>
>
>
>
> --
> Junaid Ali
> Systems & Virtualization Engineer,
> Office of Technology Services/IIT,
> 10W, 31st Street,
> Stuart Building Room # 007,
> Chicago, IL - 60616
> Ph (O): 312-567-5836
> Ph (F): 312-567-5968
>

Re: vcld memory leak

Reply via email to