There are probably 2 related problems -- (1) the health of the ESXi server and (2) the VCL code not handling all cases when the health of the host causes unexpected results. More below...
On Tue, Oct 7, 2014 at 6:53 PM, Junaid Ali <[email protected]> wrote: > Hello, > I've recently been hitting a memory leak with the vcl daemon (VCL version > 2.3.2). The problem appears to be happening in the computer_not_being_used > subroutine within new.pm (see attached log) > > The problem appears to start when during a reload there was an issue > communicating with the VMWare server. This caused the VM to be left on the > VMHost in a powered off state along with the deletion of the entries from > the computerloadlog table > > *|6309|19812:19812|reload| ---- CRITICAL ---- * > *|6309|19812:19812|reload| 2014-09-03 > 09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser error : > Start tag expected, '<' not found* > *|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection refused)* > *|6309|19812:19812|reload| ^* > *|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)* > *|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)* > *|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)* > *|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)* > *|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)* > *|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties (line: > 172)* > *|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line: 1663)* > *|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)* > *|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line: > 2471)* > *|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line: 2096)* > *|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms (line: > 1594)* > *|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)* > *|6309|19812:19812|reload| (-12) new.pm <http://new.pm>, reload_image > (line: 671)* > *|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process (line: > 291)* > *|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)* > *2014-09-03 > 09:45:51|6309|19812:19812|reload|utils.pm:delete_computerloadlog_reservation(6396)|removing > computerloadlog entries matching loadstate = begin* > *2014-09-03 > 09:45:51|6309|19812:19812|reload|utils.pm:delete_computerloadlog_reservation(6443)|deleted > rows from computerloadlog for reservation id=19812* > > Yes. We are seeing this more and more as of late on our ESXi 4.1 servers. This particular error only appears if you are using the vSphere SDK to manage the host. I believe the same underlying problem is described in the following issue if SSH and vim-cmd is used to manage the host: https://issues.apache.org/jira/browse/VCL-769 As a test on a server which is exhibiting the problem you described and to determine if the problems are related, please try to SSH in and run the following command: vim-cmd hostsvc/datastore/info If this displays an error then they are related. Running 'services.sh restart' on the host may solve the problem. If not, then it's likely the .pid files in /var/run became inconsistent with the running services. Each should contain the PID of the corresponding service. If they contain the wrong PID then 'services.sh restart' will fail to restart some services and the problems will continue. If you verify that 'services.sh restart' doesn't fix the issue, I can try to write instructions on how to fix the files manually. I have added some code to VIM_SSH.pm to try to correct the .pid files automatically. This isn't possible with the vSphere SDK. Please send the contents of each of these files from an affected host: /var/run/vmware/vmware-hostd.PID /var/run/vmware/vicimprovider.PID /var/run/vmware/vmkdevmgr.pid /var/run/vmware/vmkeventd.pid /var/run/vmware/vmsyslogd.pid /var/run/vmware/vmware-rhttpproxy.PID /var/run/vmware/vmware-vpxa.PID And the output from these commands: ps -ef | grep hostd-worker ps -ef | grep sfcb-vmware_bas ps -ef | grep vmkdevmgr ps -ef | grep vmkeventd ps -ef | grep vmsyslogd ps -ef | grep rhttpproxy-work ps -ef | grep vpxa-worker > Now when a new reservation comes in and the same vm is allocated for the > reservation, the computer_not_being_used subroutine calls the > $self->code_loop_timeout(sub{return !reservation_being_processed(@_)}, > [$competing_reservation_id], $message, $total_wait_seconds, > $attempt_delay_seconds)) section (on line # 815 in new.pm) it receives a > 0 from reservation_being_processed with message > > *"2014-10-07 > 11:45:54|8175|23084:23084|new|utils.pm:reservation_being_processed(8634)|computerloadlog > 'begin' entry does NOT exist for reservation 19812" * > > The vcl daemon thinks that the reload has completed. This causes the same > reservation to be processed over an over within computer_not_being_used > causing memory spike's and eventually killing that vcld thread. > > Any ideas how the reservation_being_processed can handle the lack of > "begin" entries when used along with the code_loop_timeout from > computer_not_being_used or the DESTROY handler can make sure such > reservations are purged, so it doesn't cause this issue? > > The VCL code could be improved to better handle this problem. The problem is probably due to when the error occurs in the vcld process sequence -- very early on when the object to manage the VM host is being initialized. From the trace output, the vSphere_SDK.pm::_get_file_info subroutine is calling the vSphere SDK's Vim::get_view subroutine. This fails miserably (probably due to the SDK not catching the error described above) and causes the entire vcld child process to die abruptly. The problem occurs before the vcld changed the request state to 'pending'. The request state/laststate remains to be reload/reload or new/new when the process dies. As a result, vcld keeps trying the same sequence over and over again. It's possible to improve the code to catch this by wrapping all Vim::* calls in an eval block. I'll get this implemented for the next release. Patching 2.3.2 may be possible but could also be ugly. Vim::get_view is called from many places in vSphere_SDK.pm. Every one of these would need to be updated. Regards, Andy > Please let me know if you need any further clarification. > > Thanks. > > -- > Junaid Ali >
