Hi Andy, Thanks for the information. I was able to ssh into the vmhost and run the commands. It runs the vim-cmd commands without any errors. Attached are the PID from the files as well as ps command and they are consistent. So it may not be related to the PID mismatch. I looked back during the week of 9/3 when the original problem occurred and could not find anything out of place on the VMHost (please check attached pdf report from vcenter) or the management node.
Thanks. Junaid. On Wed, Oct 8, 2014 at 9:18 AM, Andy Kurth <[email protected]> wrote: > There are probably 2 related problems -- (1) the health of the ESXi server > and (2) the VCL code not handling all cases when the health of the host > causes unexpected results. More below... > > On Tue, Oct 7, 2014 at 6:53 PM, Junaid Ali <[email protected]> wrote: > > > Hello, > > I've recently been hitting a memory leak with the vcl daemon (VCL version > > 2.3.2). The problem appears to be happening in the > computer_not_being_used > > subroutine within new.pm (see attached log) > > > > The problem appears to start when during a reload there was an issue > > communicating with the VMWare server. This caused the VM to be left on > the > > VMHost in a powered off state along with the deletion of the entries from > > the computerloadlog table > > > > *|6309|19812:19812|reload| ---- CRITICAL ---- * > > *|6309|19812:19812|reload| 2014-09-03 > > 09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser error : > > Start tag expected, '<' not found* > > *|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection > refused)* > > *|6309|19812:19812|reload| ^* > > *|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)* > > *|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)* > > *|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)* > > *|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)* > > *|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)* > > *|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties (line: > > 172)* > > *|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line: > 1663)* > > *|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)* > > *|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line: > > 2471)* > > *|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line: 2096)* > > *|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms (line: > > 1594)* > > *|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)* > > *|6309|19812:19812|reload| (-12) new.pm <http://new.pm>, reload_image > > (line: 671)* > > *|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process (line: > > 291)* > > *|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)* > > *2014-09-03 > > 09:45:51|6309|19812:19812|reload|utils.pm: > delete_computerloadlog_reservation(6396)|removing > > computerloadlog entries matching loadstate = begin* > > *2014-09-03 > > 09:45:51|6309|19812:19812|reload|utils.pm: > delete_computerloadlog_reservation(6443)|deleted > > rows from computerloadlog for reservation id=19812* > > > > > Yes. We are seeing this more and more as of late on our ESXi 4.1 servers. > This particular error only appears if you are using the vSphere SDK to > manage the host. I believe the same underlying problem is described in the > following issue if SSH and vim-cmd is used to manage the host: > https://issues.apache.org/jira/browse/VCL-769 > > As a test on a server which is exhibiting the problem you described and to > determine if the problems are related, please try to SSH in and run the > following command: > vim-cmd hostsvc/datastore/info > > If this displays an error then they are related. Running 'services.sh > restart' on the host may solve the problem. If not, then it's likely the > .pid files in /var/run became inconsistent with the running services. Each > should contain the PID of the corresponding service. If they contain the > wrong PID then 'services.sh restart' will fail to restart some services and > the problems will continue. If you verify that 'services.sh restart' > doesn't fix the issue, I can try to write instructions on how to fix the > files manually. I have added some code to VIM_SSH.pm to try to correct the > .pid files automatically. This isn't possible with the vSphere SDK. > > Please send the contents of each of these files from an affected host: > /var/run/vmware/vmware-hostd.PID > /var/run/vmware/vicimprovider.PID > /var/run/vmware/vmkdevmgr.pid > /var/run/vmware/vmkeventd.pid > /var/run/vmware/vmsyslogd.pid > /var/run/vmware/vmware-rhttpproxy.PID > /var/run/vmware/vmware-vpxa.PID > > And the output from these commands: > ps -ef | grep hostd-worker > ps -ef | grep sfcb-vmware_bas > ps -ef | grep vmkdevmgr > ps -ef | grep vmkeventd > ps -ef | grep vmsyslogd > ps -ef | grep rhttpproxy-work > ps -ef | grep vpxa-worker > > > > > Now when a new reservation comes in and the same vm is allocated for the > > reservation, the computer_not_being_used subroutine calls the > > $self->code_loop_timeout(sub{return !reservation_being_processed(@_)}, > > [$competing_reservation_id], $message, $total_wait_seconds, > > $attempt_delay_seconds)) section (on line # 815 in new.pm) it receives a > > 0 from reservation_being_processed with message > > > > *"2014-10-07 > > 11:45:54|8175|23084:23084|new|utils.pm: > reservation_being_processed(8634)|computerloadlog > > 'begin' entry does NOT exist for reservation 19812" * > > > > The vcl daemon thinks that the reload has completed. This causes the same > > reservation to be processed over an over within computer_not_being_used > > causing memory spike's and eventually killing that vcld thread. > > > > Any ideas how the reservation_being_processed can handle the lack of > > "begin" entries when used along with the code_loop_timeout from > > computer_not_being_used or the DESTROY handler can make sure such > > reservations are purged, so it doesn't cause this issue? > > > > > The VCL code could be improved to better handle this problem. > > The problem is probably due to when the error occurs in the vcld process > sequence -- very early on when the object to manage the VM host is being > initialized. From the trace output, the vSphere_SDK.pm::_get_file_info > subroutine is calling the vSphere SDK's Vim::get_view subroutine. This > fails miserably (probably due to the SDK not catching the error described > above) and causes the entire vcld child process to die abruptly. The > problem occurs before the vcld changed the request state to 'pending'. The > request state/laststate remains to be reload/reload or new/new when the > process dies. As a result, vcld keeps trying the same sequence over and > over again. > > It's possible to improve the code to catch this by wrapping all Vim::* > calls in an eval block. I'll get this implemented for the next release. > Patching 2.3.2 may be possible but could also be ugly. Vim::get_view is > called from many places in vSphere_SDK.pm. Every one of these would need > to be updated. > > Regards, > Andy > > > > Please let me know if you need any further clarification. > > > > Thanks. > > > > -- > > Junaid Ali > > > -- Junaid Ali Systems & Virtualization Engineer, Office of Technology Services/IIT, 10W, 31st Street, Stuart Building Room # 007, Chicago, IL - 60616 Ph (O): 312-567-5836 Ph (F): 312-567-5968
Please send the contents of each of these files from an affected host: /var/run/vmware/vmware-hostd.PID = 2953236 /var/run/vmware/vicimprovider.PID = 2953355 /var/run/vmware/vmkdevmgr.pid = 8936 /var/run/vmware/vmkeventd.pid = 8919 /var/run/vmware/vmsyslogd.pid = 4583375 /var/run/vmware/vmware-rhttpproxy.PID = <Process Not Found> /var/run/vmware/vmware-vpxa.PID = 2953240 And the output from these commands: ps -P | grep hostd-worker WID CID World Name PCID Command 2953236 2953236 hostd-worker 9126 hostd 2953248 2953236 hostd-worker 9126 hostd 2953249 2953236 hostd-worker 9126 hostd 2953257 2953236 hostd-worker 9126 hostd 2953259 2953236 hostd-worker 9126 hostd 2953260 2953236 hostd-worker 9126 hostd 2953261 2953236 hostd-worker 9126 hostd 2953418 2953236 hostd-worker 9126 hostd 2953419 2953236 hostd-worker 9126 hostd 2953421 2953236 hostd-worker 9126 hostd 2953422 2953236 hostd-worker 9126 hostd 2953423 2953236 hostd-worker 9126 hostd 2953425 2953236 hostd-worker 9126 hostd 2945235 2953236 hostd-worker 9126 hostd 3158683 2953236 hostd-worker 9126 hostd 3012281 2953236 hostd-worker 9126 hostd 3012282 2953236 hostd-worker 9126 hostd 3030752 2953236 hostd-worker 9126 hostd 2973687 2953236 hostd-worker 9126 hostd ps -P | grep sfcb-vmware_bas WID CID World Name PCID Command 2953355 2953355 sfcb-vmware_bas 9624 /sbin/sfcbd 2953356 2953355 sfcb-vmware_bas 9624 /sbin/sfcbd 2953357 2953355 sfcb-vmware_bas 9624 /sbin/sfcbd 2953358 2953355 sfcb-vmware_bas 9624 /sbin/sfcbd 2953359 2953355 sfcb-vmware_bas 9624 /sbin/sfcbd 2953360 2953355 sfcb-vmware_bas 9624 /sbin/sfcbd 2953362 2953355 sfcb-vmware_bas 9624 /sbin/sfcbd 2953363 2953355 sfcb-vmware_bas 9624 /sbin/sfcbd 2953364 2953355 sfcb-vmware_bas 9624 /sbin/sfcbd 2945353 2953355 sfcb-vmware_bas 9624 /sbin/sfcbd ps -P | grep vmkdevmgr WID CID World Name PCID Command 8936 8936 vmkdevmgr 8420 /usr/sbin/vmkdevmgr ps -P | grep vmkeventd WID CID World Name PCID Command 8919 8919 vmkeventd 8420 /usr/lib/vmware/bin/vmkeventd ps -P | grep vmsyslogd WID CID World Name PCID Command 4583375 4583375 vmsyslogd 4624334 /bin/python 4624336 4583375 vmsyslogd 4624334 /bin/python 4583377 4583375 vmsyslogd 4624334 /bin/python ps -P | grep rhttpproxy-work WID CID World Name PCID Command <Process Not Found> ps -P | grep vpxa-worker WID CID World Name PCID Command 2953240 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953242 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953243 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953246 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953250 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953251 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953252 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953253 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953366 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953369 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953370 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953371 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953372 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953373 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953413 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953414 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa 2953424 2953240 vpxa-worker 9344 /usr/lib/vmware/vpxa/bin/vpxa
