Re: vcld memory leak

Junaid Ali Thu, 09 Oct 2014 07:33:47 -0700

Hi Andy,
Thanks for the reply and the quick patch.
I didn't receive any attachment. Can you please resend the attachment?


Thanks.

On Wed, Oct 8, 2014 at 12:44 PM, Andy Kurth <[email protected]> wrote:

> The pids look correct.  Is this host currently generating any of the 'Start
> tag expected...' error messages?  If it is then something else may be
> wrong.  If not, then if you see the error in the future I would again check
> the pids.  It's pretty simple but tedious.  In your file, the value
> in vmware-hostd.PID (2953236) should match the PID of the
> first hostd-worker process and the PPID of the other hostd-worker
> processes.  The same goes for the other files and processes.
>
> I have updated the code in the repository to catch the error which caused
> the process to die over and over again.  I also applied similar changes to
> the vSphere_SDK.pm file which shipped with 2.3.2 to the attached file.  You
> can try to swap your current file:
> /usr/local/vcl/lib/VCL/Module/Provisioning/vSphere_SDK.pm
> ...with the attached file and restart the vcld service.
>
> I tested this on a simple reload and it worked.  The updated file won't
> prevent the initial error from occurring but it will catch the problem so
> that the vcld process doesn't abruptly die and repeatedly retry.
>
> -Andy
>
>
> On Wed, Oct 8, 2014 at 12:35 PM, Junaid Ali <[email protected]> wrote:
>
>> Hi Andy,
>> Thanks for the information.
>> I was able to ssh into the vmhost and run the commands. It runs the
>> vim-cmd commands without any errors. Attached are the PID from the files as
>> well as ps command and they are consistent. So it may not be related to the
>> PID mismatch. I looked back during the week of 9/3 when the original
>> problem occurred and could not find anything out of place on the VMHost
>> (please check attached pdf report from vcenter) or the management node.
>>
>> Thanks.
>> Junaid.
>>
>> On Wed, Oct 8, 2014 at 9:18 AM, Andy Kurth <[email protected]> wrote:
>>
>>> There are probably 2 related problems -- (1) the health of the ESXi
>>> server
>>> and (2) the VCL code not handling all cases when the health of the host
>>> causes unexpected results.  More below...
>>>
>>> On Tue, Oct 7, 2014 at 6:53 PM, Junaid Ali <[email protected]> wrote:
>>>
>>> > Hello,
>>> > I've recently been hitting a memory leak with the vcl daemon (VCL
>>> version
>>> > 2.3.2). The problem appears to be happening in the
>>> computer_not_being_used
>>> > subroutine within new.pm (see attached log)
>>> >
>>> > The problem appears to start when during a reload there was an issue
>>> > communicating with the VMWare server. This caused the VM to be left on
>>> the
>>> > VMHost in a powered off state along with the deletion of the entries
>>> from
>>> > the computerloadlog table
>>> >
>>> > *|6309|19812:19812|reload| ---- CRITICAL ---- *
>>> > *|6309|19812:19812|reload| 2014-09-03
>>> > 09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser
>>> error :
>>> > Start tag expected, '<' not found*
>>> > *|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection
>>> refused)*
>>> > *|6309|19812:19812|reload| ^*
>>> > *|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)*
>>> > *|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)*
>>> > *|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)*
>>> > *|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)*
>>> > *|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)*
>>> > *|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties (line:
>>> > 172)*
>>> > *|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line:
>>> 1663)*
>>> > *|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)*
>>> > *|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line:
>>> > 2471)*
>>> > *|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line:
>>> 2096)*
>>> > *|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms (line:
>>> > 1594)*
>>> > *|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)*
>>> > *|6309|19812:19812|reload| (-12) new.pm <http://new.pm>, reload_image
>>> > (line: 671)*
>>> > *|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process
>>> (line:
>>> > 291)*
>>> > *|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)*
>>> > *2014-09-03
>>> > 09:45:51|6309|19812:19812|reload|utils.pm:
>>> delete_computerloadlog_reservation(6396)|removing
>>> > computerloadlog entries matching loadstate = begin*
>>> > *2014-09-03
>>> > 09:45:51|6309|19812:19812|reload|utils.pm:
>>> delete_computerloadlog_reservation(6443)|deleted
>>> > rows from computerloadlog for reservation id=19812*
>>>
>>> >
>>> >
>>> Yes.  We are seeing this more and more as of late on our ESXi 4.1
>>> servers.
>>> This particular error only appears if you are using the vSphere SDK to
>>> manage the host.  I believe the same underlying problem is described in
>>> the
>>> following issue if SSH and vim-cmd is used to manage the host:
>>> https://issues.apache.org/jira/browse/VCL-769
>>>
>>> As a test on a server which is exhibiting the problem you described and
>>> to
>>> determine if the problems are related, please try to SSH in and run the
>>> following command:
>>> vim-cmd hostsvc/datastore/info
>>>
>>> If this displays an error then they are related.  Running 'services.sh
>>> restart' on the host may solve the problem.  If not, then it's likely the
>>> .pid files in /var/run became inconsistent with the running services.
>>> Each
>>> should contain the PID of the corresponding service.  If they contain the
>>> wrong PID then 'services.sh restart' will fail to restart some services
>>> and
>>> the problems will continue.  If you verify that 'services.sh restart'
>>> doesn't fix the issue, I can try to write instructions on how to fix the
>>> files manually.  I have added some code to VIM_SSH.pm to try to correct
>>> the
>>> .pid files automatically.  This isn't possible with the vSphere SDK.
>>>
>>> Please send the contents of each of these files from an affected host:
>>> /var/run/vmware/vmware-hostd.PID
>>> /var/run/vmware/vicimprovider.PID
>>> /var/run/vmware/vmkdevmgr.pid
>>> /var/run/vmware/vmkeventd.pid
>>> /var/run/vmware/vmsyslogd.pid
>>> /var/run/vmware/vmware-rhttpproxy.PID
>>> /var/run/vmware/vmware-vpxa.PID
>>>
>>> And the output from these commands:
>>> ps -ef | grep hostd-worker
>>> ps -ef | grep sfcb-vmware_bas
>>> ps -ef | grep vmkdevmgr
>>> ps -ef | grep vmkeventd
>>> ps -ef | grep vmsyslogd
>>> ps -ef | grep rhttpproxy-work
>>> ps -ef | grep vpxa-worker
>>>
>>>
>>>
>>> > Now when a new reservation comes in and the same vm is allocated for
>>> the
>>> > reservation, the computer_not_being_used subroutine calls the
>>> > $self->code_loop_timeout(sub{return !reservation_being_processed(@_)},
>>> > [$competing_reservation_id], $message, $total_wait_seconds,
>>> > $attempt_delay_seconds)) section (on line # 815 in new.pm) it
>>> receives a
>>> > 0 from reservation_being_processed with message
>>> >
>>> > *"2014-10-07
>>> > 11:45:54|8175|23084:23084|new|utils.pm:
>>> reservation_being_processed(8634)|computerloadlog
>>> > 'begin' entry does NOT exist for reservation 19812" *
>>>
>>> >
>>> > The vcl daemon thinks that the reload has completed. This causes the
>>> same
>>> > reservation to be processed over an over within computer_not_being_used
>>> > causing memory spike's and eventually killing that vcld thread.
>>> >
>>> > Any ideas how the reservation_being_processed can handle the lack of
>>> > "begin" entries when used along with the code_loop_timeout from
>>> > computer_not_being_used or the DESTROY handler can make sure such
>>> > reservations are purged, so it doesn't cause this issue?
>>> >
>>> >
>>> The VCL code could be improved to better handle this problem.
>>>
>>> The problem is probably due to when the error occurs in the vcld process
>>> sequence -- very early on when the object to manage the VM host is being
>>> initialized.  From the trace output, the vSphere_SDK.pm::_get_file_info
>>> subroutine is calling the vSphere SDK's Vim::get_view subroutine.  This
>>> fails miserably (probably due to the SDK not catching the error described
>>> above) and causes the entire vcld child process to die abruptly.  The
>>> problem occurs before the vcld changed the request state to 'pending'.
>>> The
>>> request state/laststate remains to be reload/reload or new/new when the
>>> process dies.  As a result, vcld keeps trying the same sequence over and
>>> over again.
>>>
>>> It's possible to improve the code to catch this by wrapping all Vim::*
>>> calls in an eval block.  I'll get this implemented for the next release.
>>> Patching 2.3.2 may be possible but could also be ugly.  Vim::get_view is
>>> called from many places in vSphere_SDK.pm.  Every one of these would need
>>> to be updated.
>>>
>>> Regards,
>>> Andy
>>>
>>>
>>> > Please let me know if you need any further clarification.
>>> >
>>> > Thanks.
>>> >
>>> > --
>>> > Junaid Ali
>>> >
>>>
>>
>>
>>
>> --
>> Junaid Ali
>> Systems & Virtualization Engineer,
>> Office of Technology Services/IIT,
>> 10W, 31st Street,
>> Stuart Building Room # 007,
>> Chicago, IL - 60616
>> Ph (O): 312-567-5836
>> Ph (F): 312-567-5968
>>
>
>


-- 
Junaid Ali

Re: vcld memory leak

Reply via email to