Hi Andy, Thanks for the reply and the quick patch. I didn't receive any attachment. Can you please resend the attachment?
Thanks. On Wed, Oct 8, 2014 at 12:44 PM, Andy Kurth <[email protected]> wrote: > The pids look correct. Is this host currently generating any of the 'Start > tag expected...' error messages? If it is then something else may be > wrong. If not, then if you see the error in the future I would again check > the pids. It's pretty simple but tedious. In your file, the value > in vmware-hostd.PID (2953236) should match the PID of the > first hostd-worker process and the PPID of the other hostd-worker > processes. The same goes for the other files and processes. > > I have updated the code in the repository to catch the error which caused > the process to die over and over again. I also applied similar changes to > the vSphere_SDK.pm file which shipped with 2.3.2 to the attached file. You > can try to swap your current file: > /usr/local/vcl/lib/VCL/Module/Provisioning/vSphere_SDK.pm > ...with the attached file and restart the vcld service. > > I tested this on a simple reload and it worked. The updated file won't > prevent the initial error from occurring but it will catch the problem so > that the vcld process doesn't abruptly die and repeatedly retry. > > -Andy > > > On Wed, Oct 8, 2014 at 12:35 PM, Junaid Ali <[email protected]> wrote: > >> Hi Andy, >> Thanks for the information. >> I was able to ssh into the vmhost and run the commands. It runs the >> vim-cmd commands without any errors. Attached are the PID from the files as >> well as ps command and they are consistent. So it may not be related to the >> PID mismatch. I looked back during the week of 9/3 when the original >> problem occurred and could not find anything out of place on the VMHost >> (please check attached pdf report from vcenter) or the management node. >> >> Thanks. >> Junaid. >> >> On Wed, Oct 8, 2014 at 9:18 AM, Andy Kurth <[email protected]> wrote: >> >>> There are probably 2 related problems -- (1) the health of the ESXi >>> server >>> and (2) the VCL code not handling all cases when the health of the host >>> causes unexpected results. More below... >>> >>> On Tue, Oct 7, 2014 at 6:53 PM, Junaid Ali <[email protected]> wrote: >>> >>> > Hello, >>> > I've recently been hitting a memory leak with the vcl daemon (VCL >>> version >>> > 2.3.2). The problem appears to be happening in the >>> computer_not_being_used >>> > subroutine within new.pm (see attached log) >>> > >>> > The problem appears to start when during a reload there was an issue >>> > communicating with the VMWare server. This caused the VM to be left on >>> the >>> > VMHost in a powered off state along with the deletion of the entries >>> from >>> > the computerloadlog table >>> > >>> > *|6309|19812:19812|reload| ---- CRITICAL ---- * >>> > *|6309|19812:19812|reload| 2014-09-03 >>> > 09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser >>> error : >>> > Start tag expected, '<' not found* >>> > *|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection >>> refused)* >>> > *|6309|19812:19812|reload| ^* >>> > *|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)* >>> > *|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)* >>> > *|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)* >>> > *|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)* >>> > *|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)* >>> > *|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties (line: >>> > 172)* >>> > *|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line: >>> 1663)* >>> > *|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)* >>> > *|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line: >>> > 2471)* >>> > *|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line: >>> 2096)* >>> > *|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms (line: >>> > 1594)* >>> > *|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)* >>> > *|6309|19812:19812|reload| (-12) new.pm <http://new.pm>, reload_image >>> > (line: 671)* >>> > *|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process >>> (line: >>> > 291)* >>> > *|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)* >>> > *2014-09-03 >>> > 09:45:51|6309|19812:19812|reload|utils.pm: >>> delete_computerloadlog_reservation(6396)|removing >>> > computerloadlog entries matching loadstate = begin* >>> > *2014-09-03 >>> > 09:45:51|6309|19812:19812|reload|utils.pm: >>> delete_computerloadlog_reservation(6443)|deleted >>> > rows from computerloadlog for reservation id=19812* >>> >>> > >>> > >>> Yes. We are seeing this more and more as of late on our ESXi 4.1 >>> servers. >>> This particular error only appears if you are using the vSphere SDK to >>> manage the host. I believe the same underlying problem is described in >>> the >>> following issue if SSH and vim-cmd is used to manage the host: >>> https://issues.apache.org/jira/browse/VCL-769 >>> >>> As a test on a server which is exhibiting the problem you described and >>> to >>> determine if the problems are related, please try to SSH in and run the >>> following command: >>> vim-cmd hostsvc/datastore/info >>> >>> If this displays an error then they are related. Running 'services.sh >>> restart' on the host may solve the problem. If not, then it's likely the >>> .pid files in /var/run became inconsistent with the running services. >>> Each >>> should contain the PID of the corresponding service. If they contain the >>> wrong PID then 'services.sh restart' will fail to restart some services >>> and >>> the problems will continue. If you verify that 'services.sh restart' >>> doesn't fix the issue, I can try to write instructions on how to fix the >>> files manually. I have added some code to VIM_SSH.pm to try to correct >>> the >>> .pid files automatically. This isn't possible with the vSphere SDK. >>> >>> Please send the contents of each of these files from an affected host: >>> /var/run/vmware/vmware-hostd.PID >>> /var/run/vmware/vicimprovider.PID >>> /var/run/vmware/vmkdevmgr.pid >>> /var/run/vmware/vmkeventd.pid >>> /var/run/vmware/vmsyslogd.pid >>> /var/run/vmware/vmware-rhttpproxy.PID >>> /var/run/vmware/vmware-vpxa.PID >>> >>> And the output from these commands: >>> ps -ef | grep hostd-worker >>> ps -ef | grep sfcb-vmware_bas >>> ps -ef | grep vmkdevmgr >>> ps -ef | grep vmkeventd >>> ps -ef | grep vmsyslogd >>> ps -ef | grep rhttpproxy-work >>> ps -ef | grep vpxa-worker >>> >>> >>> >>> > Now when a new reservation comes in and the same vm is allocated for >>> the >>> > reservation, the computer_not_being_used subroutine calls the >>> > $self->code_loop_timeout(sub{return !reservation_being_processed(@_)}, >>> > [$competing_reservation_id], $message, $total_wait_seconds, >>> > $attempt_delay_seconds)) section (on line # 815 in new.pm) it >>> receives a >>> > 0 from reservation_being_processed with message >>> > >>> > *"2014-10-07 >>> > 11:45:54|8175|23084:23084|new|utils.pm: >>> reservation_being_processed(8634)|computerloadlog >>> > 'begin' entry does NOT exist for reservation 19812" * >>> >>> > >>> > The vcl daemon thinks that the reload has completed. This causes the >>> same >>> > reservation to be processed over an over within computer_not_being_used >>> > causing memory spike's and eventually killing that vcld thread. >>> > >>> > Any ideas how the reservation_being_processed can handle the lack of >>> > "begin" entries when used along with the code_loop_timeout from >>> > computer_not_being_used or the DESTROY handler can make sure such >>> > reservations are purged, so it doesn't cause this issue? >>> > >>> > >>> The VCL code could be improved to better handle this problem. >>> >>> The problem is probably due to when the error occurs in the vcld process >>> sequence -- very early on when the object to manage the VM host is being >>> initialized. From the trace output, the vSphere_SDK.pm::_get_file_info >>> subroutine is calling the vSphere SDK's Vim::get_view subroutine. This >>> fails miserably (probably due to the SDK not catching the error described >>> above) and causes the entire vcld child process to die abruptly. The >>> problem occurs before the vcld changed the request state to 'pending'. >>> The >>> request state/laststate remains to be reload/reload or new/new when the >>> process dies. As a result, vcld keeps trying the same sequence over and >>> over again. >>> >>> It's possible to improve the code to catch this by wrapping all Vim::* >>> calls in an eval block. I'll get this implemented for the next release. >>> Patching 2.3.2 may be possible but could also be ugly. Vim::get_view is >>> called from many places in vSphere_SDK.pm. Every one of these would need >>> to be updated. >>> >>> Regards, >>> Andy >>> >>> >>> > Please let me know if you need any further clarification. >>> > >>> > Thanks. >>> > >>> > -- >>> > Junaid Ali >>> > >>> >> >> >> >> -- >> Junaid Ali >> Systems & Virtualization Engineer, >> Office of Technology Services/IIT, >> 10W, 31st Street, >> Stuart Building Room # 007, >> Chicago, IL - 60616 >> Ph (O): 312-567-5836 >> Ph (F): 312-567-5968 >> > > -- Junaid Ali
