Re: vcld memory leak

Andy Kurth Tue, 14 Oct 2014 06:49:07 -0700

Try this:
wget
https://svn.apache.org/repos/asf/vcl/sandbox/patches/vSphere_SDK_2.3.2_updated.pm


Regards,
Andy

On Thu, Oct 9, 2014 at 10:32 AM, Junaid Ali <[email protected]> wrote:

> Hi Andy,
> Thanks for the reply and the quick patch.
> I didn't receive any attachment. Can you please resend the attachment?
>
> Thanks.
>
> On Wed, Oct 8, 2014 at 12:44 PM, Andy Kurth <[email protected]> wrote:
>
> > The pids look correct.  Is this host currently generating any of the
> 'Start
> > tag expected...' error messages?  If it is then something else may be
> > wrong.  If not, then if you see the error in the future I would again
> check
> > the pids.  It's pretty simple but tedious.  In your file, the value
> > in vmware-hostd.PID (2953236) should match the PID of the
> > first hostd-worker process and the PPID of the other hostd-worker
> > processes.  The same goes for the other files and processes.
> >
> > I have updated the code in the repository to catch the error which caused
> > the process to die over and over again.  I also applied similar changes
> to
> > the vSphere_SDK.pm file which shipped with 2.3.2 to the attached file.
> You
> > can try to swap your current file:
> > /usr/local/vcl/lib/VCL/Module/Provisioning/vSphere_SDK.pm
> > ...with the attached file and restart the vcld service.
> >
> > I tested this on a simple reload and it worked.  The updated file won't
> > prevent the initial error from occurring but it will catch the problem so
> > that the vcld process doesn't abruptly die and repeatedly retry.
> >
> > -Andy
> >
> >
> > On Wed, Oct 8, 2014 at 12:35 PM, Junaid Ali <[email protected]> wrote:
> >
> >> Hi Andy,
> >> Thanks for the information.
> >> I was able to ssh into the vmhost and run the commands. It runs the
> >> vim-cmd commands without any errors. Attached are the PID from the
> files as
> >> well as ps command and they are consistent. So it may not be related to
> the
> >> PID mismatch. I looked back during the week of 9/3 when the original
> >> problem occurred and could not find anything out of place on the VMHost
> >> (please check attached pdf report from vcenter) or the management node.
> >>
> >> Thanks.
> >> Junaid.
> >>
> >> On Wed, Oct 8, 2014 at 9:18 AM, Andy Kurth <[email protected]> wrote:
> >>
> >>> There are probably 2 related problems -- (1) the health of the ESXi
> >>> server
> >>> and (2) the VCL code not handling all cases when the health of the host
> >>> causes unexpected results.  More below...
> >>>
> >>> On Tue, Oct 7, 2014 at 6:53 PM, Junaid Ali <[email protected]> wrote:
> >>>
> >>> > Hello,
> >>> > I've recently been hitting a memory leak with the vcl daemon (VCL
> >>> version
> >>> > 2.3.2). The problem appears to be happening in the
> >>> computer_not_being_used
> >>> > subroutine within new.pm (see attached log)
> >>> >
> >>> > The problem appears to start when during a reload there was an issue
> >>> > communicating with the VMWare server. This caused the VM to be left
> on
> >>> the
> >>> > VMHost in a powered off state along with the deletion of the entries
> >>> from
> >>> > the computerloadlog table
> >>> >
> >>> > *|6309|19812:19812|reload| ---- CRITICAL ---- *
> >>> > *|6309|19812:19812|reload| 2014-09-03
> >>> > 09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser
> >>> error :
> >>> > Start tag expected, '<' not found*
> >>> > *|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection
> >>> refused)*
> >>> > *|6309|19812:19812|reload| ^*
> >>> > *|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)*
> >>> > *|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)*
> >>> > *|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)*
> >>> > *|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)*
> >>> > *|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)*
> >>> > *|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties
> (line:
> >>> > 172)*
> >>> > *|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line:
> >>> 1663)*
> >>> > *|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)*
> >>> > *|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line:
> >>> > 2471)*
> >>> > *|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line:
> >>> 2096)*
> >>> > *|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms
> (line:
> >>> > 1594)*
> >>> > *|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)*
> >>> > *|6309|19812:19812|reload| (-12) new.pm <http://new.pm>,
> reload_image
> >>> > (line: 671)*
> >>> > *|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process
> >>> (line:
> >>> > 291)*
> >>> > *|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)*
> >>> > *2014-09-03
> >>> > 09:45:51|6309|19812:19812|reload|utils.pm:
> >>> delete_computerloadlog_reservation(6396)|removing
> >>> > computerloadlog entries matching loadstate = begin*
> >>> > *2014-09-03
> >>> > 09:45:51|6309|19812:19812|reload|utils.pm:
> >>> delete_computerloadlog_reservation(6443)|deleted
> >>> > rows from computerloadlog for reservation id=19812*
> >>>
> >>> >
> >>> >
> >>> Yes.  We are seeing this more and more as of late on our ESXi 4.1
> >>> servers.
> >>> This particular error only appears if you are using the vSphere SDK to
> >>> manage the host.  I believe the same underlying problem is described in
> >>> the
> >>> following issue if SSH and vim-cmd is used to manage the host:
> >>> https://issues.apache.org/jira/browse/VCL-769
> >>>
> >>> As a test on a server which is exhibiting the problem you described and
> >>> to
> >>> determine if the problems are related, please try to SSH in and run the
> >>> following command:
> >>> vim-cmd hostsvc/datastore/info
> >>>
> >>> If this displays an error then they are related.  Running 'services.sh
> >>> restart' on the host may solve the problem.  If not, then it's likely
> the
> >>> .pid files in /var/run became inconsistent with the running services.
> >>> Each
> >>> should contain the PID of the corresponding service.  If they contain
> the
> >>> wrong PID then 'services.sh restart' will fail to restart some services
> >>> and
> >>> the problems will continue.  If you verify that 'services.sh restart'
> >>> doesn't fix the issue, I can try to write instructions on how to fix
> the
> >>> files manually.  I have added some code to VIM_SSH.pm to try to correct
> >>> the
> >>> .pid files automatically.  This isn't possible with the vSphere SDK.
> >>>
> >>> Please send the contents of each of these files from an affected host:
> >>> /var/run/vmware/vmware-hostd.PID
> >>> /var/run/vmware/vicimprovider.PID
> >>> /var/run/vmware/vmkdevmgr.pid
> >>> /var/run/vmware/vmkeventd.pid
> >>> /var/run/vmware/vmsyslogd.pid
> >>> /var/run/vmware/vmware-rhttpproxy.PID
> >>> /var/run/vmware/vmware-vpxa.PID
> >>>
> >>> And the output from these commands:
> >>> ps -ef | grep hostd-worker
> >>> ps -ef | grep sfcb-vmware_bas
> >>> ps -ef | grep vmkdevmgr
> >>> ps -ef | grep vmkeventd
> >>> ps -ef | grep vmsyslogd
> >>> ps -ef | grep rhttpproxy-work
> >>> ps -ef | grep vpxa-worker
> >>>
> >>>
> >>>
> >>> > Now when a new reservation comes in and the same vm is allocated for
> >>> the
> >>> > reservation, the computer_not_being_used subroutine calls the
> >>> > $self->code_loop_timeout(sub{return
> !reservation_being_processed(@_)},
> >>> > [$competing_reservation_id], $message, $total_wait_seconds,
> >>> > $attempt_delay_seconds)) section (on line # 815 in new.pm) it
> >>> receives a
> >>> > 0 from reservation_being_processed with message
> >>> >
> >>> > *"2014-10-07
> >>> > 11:45:54|8175|23084:23084|new|utils.pm:
> >>> reservation_being_processed(8634)|computerloadlog
> >>> > 'begin' entry does NOT exist for reservation 19812" *
> >>>
> >>> >
> >>> > The vcl daemon thinks that the reload has completed. This causes the
> >>> same
> >>> > reservation to be processed over an over within
> computer_not_being_used
> >>> > causing memory spike's and eventually killing that vcld thread.
> >>> >
> >>> > Any ideas how the reservation_being_processed can handle the lack of
> >>> > "begin" entries when used along with the code_loop_timeout from
> >>> > computer_not_being_used or the DESTROY handler can make sure such
> >>> > reservations are purged, so it doesn't cause this issue?
> >>> >
> >>> >
> >>> The VCL code could be improved to better handle this problem.
> >>>
> >>> The problem is probably due to when the error occurs in the vcld
> process
> >>> sequence -- very early on when the object to manage the VM host is
> being
> >>> initialized.  From the trace output, the vSphere_SDK.pm::_get_file_info
> >>> subroutine is calling the vSphere SDK's Vim::get_view subroutine.  This
> >>> fails miserably (probably due to the SDK not catching the error
> described
> >>> above) and causes the entire vcld child process to die abruptly.  The
> >>> problem occurs before the vcld changed the request state to 'pending'.
> >>> The
> >>> request state/laststate remains to be reload/reload or new/new when the
> >>> process dies.  As a result, vcld keeps trying the same sequence over
> and
> >>> over again.
> >>>
> >>> It's possible to improve the code to catch this by wrapping all Vim::*
> >>> calls in an eval block.  I'll get this implemented for the next
> release.
> >>> Patching 2.3.2 may be possible but could also be ugly.  Vim::get_view
> is
> >>> called from many places in vSphere_SDK.pm.  Every one of these would
> need
> >>> to be updated.
> >>>
> >>> Regards,
> >>> Andy
> >>>
> >>>
> >>> > Please let me know if you need any further clarification.
> >>> >
> >>> > Thanks.
> >>> >
> >>> > --
> >>> > Junaid Ali
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> Junaid Ali
> >> Systems & Virtualization Engineer,
> >> Office of Technology Services/IIT,
> >> 10W, 31st Street,
> >> Stuart Building Room # 007,
> >> Chicago, IL - 60616
> >> Ph (O): 312-567-5836
> >> Ph (F): 312-567-5968
> >>
> >
> >
>
>
> --
> Junaid Ali
>

Re: vcld memory leak

Reply via email to