Try this: wget https://svn.apache.org/repos/asf/vcl/sandbox/patches/vSphere_SDK_2.3.2_updated.pm
Regards, Andy On Thu, Oct 9, 2014 at 10:32 AM, Junaid Ali <[email protected]> wrote: > Hi Andy, > Thanks for the reply and the quick patch. > I didn't receive any attachment. Can you please resend the attachment? > > Thanks. > > On Wed, Oct 8, 2014 at 12:44 PM, Andy Kurth <[email protected]> wrote: > > > The pids look correct. Is this host currently generating any of the > 'Start > > tag expected...' error messages? If it is then something else may be > > wrong. If not, then if you see the error in the future I would again > check > > the pids. It's pretty simple but tedious. In your file, the value > > in vmware-hostd.PID (2953236) should match the PID of the > > first hostd-worker process and the PPID of the other hostd-worker > > processes. The same goes for the other files and processes. > > > > I have updated the code in the repository to catch the error which caused > > the process to die over and over again. I also applied similar changes > to > > the vSphere_SDK.pm file which shipped with 2.3.2 to the attached file. > You > > can try to swap your current file: > > /usr/local/vcl/lib/VCL/Module/Provisioning/vSphere_SDK.pm > > ...with the attached file and restart the vcld service. > > > > I tested this on a simple reload and it worked. The updated file won't > > prevent the initial error from occurring but it will catch the problem so > > that the vcld process doesn't abruptly die and repeatedly retry. > > > > -Andy > > > > > > On Wed, Oct 8, 2014 at 12:35 PM, Junaid Ali <[email protected]> wrote: > > > >> Hi Andy, > >> Thanks for the information. > >> I was able to ssh into the vmhost and run the commands. It runs the > >> vim-cmd commands without any errors. Attached are the PID from the > files as > >> well as ps command and they are consistent. So it may not be related to > the > >> PID mismatch. I looked back during the week of 9/3 when the original > >> problem occurred and could not find anything out of place on the VMHost > >> (please check attached pdf report from vcenter) or the management node. > >> > >> Thanks. > >> Junaid. > >> > >> On Wed, Oct 8, 2014 at 9:18 AM, Andy Kurth <[email protected]> wrote: > >> > >>> There are probably 2 related problems -- (1) the health of the ESXi > >>> server > >>> and (2) the VCL code not handling all cases when the health of the host > >>> causes unexpected results. More below... > >>> > >>> On Tue, Oct 7, 2014 at 6:53 PM, Junaid Ali <[email protected]> wrote: > >>> > >>> > Hello, > >>> > I've recently been hitting a memory leak with the vcl daemon (VCL > >>> version > >>> > 2.3.2). The problem appears to be happening in the > >>> computer_not_being_used > >>> > subroutine within new.pm (see attached log) > >>> > > >>> > The problem appears to start when during a reload there was an issue > >>> > communicating with the VMWare server. This caused the VM to be left > on > >>> the > >>> > VMHost in a powered off state along with the deletion of the entries > >>> from > >>> > the computerloadlog table > >>> > > >>> > *|6309|19812:19812|reload| ---- CRITICAL ---- * > >>> > *|6309|19812:19812|reload| 2014-09-03 > >>> > 09:45:50|6309|19812:19812|reload|vcld:die_handler(639)|:1: parser > >>> error : > >>> > Start tag expected, '<' not found* > >>> > *|6309|19812:19812|reload| Can't connect to vcl2:443 (Connection > >>> refused)* > >>> > *|6309|19812:19812|reload| ^* > >>> > *|6309|19812:19812|reload| ( 0) vcld, die_handler (line: 639)* > >>> > *|6309|19812:19812|reload| (-1) LibXML.pm, (eval) (line: 378)* > >>> > *|6309|19812:19812|reload| (-2) LibXML.pm, parse_string (line: 378)* > >>> > *|6309|19812:19812|reload| (-3) VICommon.pm, (eval) (line: 2194)* > >>> > *|6309|19812:19812|reload| (-4) VICommon.pm, request (line: 2194)* > >>> > *|6309|19812:19812|reload| (-5) (eval 29660), RetrieveProperties > (line: > >>> > 172)* > >>> > *|6309|19812:19812|reload| (-6) VICommon.pm, update_view_data (line: > >>> 1663)* > >>> > *|6309|19812:19812|reload| (-7) VICommon.pm, get_view (line: 1512)* > >>> > *|6309|19812:19812|reload| (-8) vSphere_SDK.pm, _get_file_info (line: > >>> > 2471)* > >>> > *|6309|19812:19812|reload| (-9) vSphere_SDK.pm, find_files (line: > >>> 2096)* > >>> > *|6309|19812:19812|reload| (-10) VMware.pm, remove_existing_vms > (line: > >>> > 1594)* > >>> > *|6309|19812:19812|reload| (-11) VMware.pm, load (line: 469)* > >>> > *|6309|19812:19812|reload| (-12) new.pm <http://new.pm>, > reload_image > >>> > (line: 671)* > >>> > *|6309|19812:19812|reload| (-13) new.pm <http://new.pm>, process > >>> (line: > >>> > 291)* > >>> > *|6309|19812:19812|reload| (-14) vcld, make_new_child (line: 571)* > >>> > *2014-09-03 > >>> > 09:45:51|6309|19812:19812|reload|utils.pm: > >>> delete_computerloadlog_reservation(6396)|removing > >>> > computerloadlog entries matching loadstate = begin* > >>> > *2014-09-03 > >>> > 09:45:51|6309|19812:19812|reload|utils.pm: > >>> delete_computerloadlog_reservation(6443)|deleted > >>> > rows from computerloadlog for reservation id=19812* > >>> > >>> > > >>> > > >>> Yes. We are seeing this more and more as of late on our ESXi 4.1 > >>> servers. > >>> This particular error only appears if you are using the vSphere SDK to > >>> manage the host. I believe the same underlying problem is described in > >>> the > >>> following issue if SSH and vim-cmd is used to manage the host: > >>> https://issues.apache.org/jira/browse/VCL-769 > >>> > >>> As a test on a server which is exhibiting the problem you described and > >>> to > >>> determine if the problems are related, please try to SSH in and run the > >>> following command: > >>> vim-cmd hostsvc/datastore/info > >>> > >>> If this displays an error then they are related. Running 'services.sh > >>> restart' on the host may solve the problem. If not, then it's likely > the > >>> .pid files in /var/run became inconsistent with the running services. > >>> Each > >>> should contain the PID of the corresponding service. If they contain > the > >>> wrong PID then 'services.sh restart' will fail to restart some services > >>> and > >>> the problems will continue. If you verify that 'services.sh restart' > >>> doesn't fix the issue, I can try to write instructions on how to fix > the > >>> files manually. I have added some code to VIM_SSH.pm to try to correct > >>> the > >>> .pid files automatically. This isn't possible with the vSphere SDK. > >>> > >>> Please send the contents of each of these files from an affected host: > >>> /var/run/vmware/vmware-hostd.PID > >>> /var/run/vmware/vicimprovider.PID > >>> /var/run/vmware/vmkdevmgr.pid > >>> /var/run/vmware/vmkeventd.pid > >>> /var/run/vmware/vmsyslogd.pid > >>> /var/run/vmware/vmware-rhttpproxy.PID > >>> /var/run/vmware/vmware-vpxa.PID > >>> > >>> And the output from these commands: > >>> ps -ef | grep hostd-worker > >>> ps -ef | grep sfcb-vmware_bas > >>> ps -ef | grep vmkdevmgr > >>> ps -ef | grep vmkeventd > >>> ps -ef | grep vmsyslogd > >>> ps -ef | grep rhttpproxy-work > >>> ps -ef | grep vpxa-worker > >>> > >>> > >>> > >>> > Now when a new reservation comes in and the same vm is allocated for > >>> the > >>> > reservation, the computer_not_being_used subroutine calls the > >>> > $self->code_loop_timeout(sub{return > !reservation_being_processed(@_)}, > >>> > [$competing_reservation_id], $message, $total_wait_seconds, > >>> > $attempt_delay_seconds)) section (on line # 815 in new.pm) it > >>> receives a > >>> > 0 from reservation_being_processed with message > >>> > > >>> > *"2014-10-07 > >>> > 11:45:54|8175|23084:23084|new|utils.pm: > >>> reservation_being_processed(8634)|computerloadlog > >>> > 'begin' entry does NOT exist for reservation 19812" * > >>> > >>> > > >>> > The vcl daemon thinks that the reload has completed. This causes the > >>> same > >>> > reservation to be processed over an over within > computer_not_being_used > >>> > causing memory spike's and eventually killing that vcld thread. > >>> > > >>> > Any ideas how the reservation_being_processed can handle the lack of > >>> > "begin" entries when used along with the code_loop_timeout from > >>> > computer_not_being_used or the DESTROY handler can make sure such > >>> > reservations are purged, so it doesn't cause this issue? > >>> > > >>> > > >>> The VCL code could be improved to better handle this problem. > >>> > >>> The problem is probably due to when the error occurs in the vcld > process > >>> sequence -- very early on when the object to manage the VM host is > being > >>> initialized. From the trace output, the vSphere_SDK.pm::_get_file_info > >>> subroutine is calling the vSphere SDK's Vim::get_view subroutine. This > >>> fails miserably (probably due to the SDK not catching the error > described > >>> above) and causes the entire vcld child process to die abruptly. The > >>> problem occurs before the vcld changed the request state to 'pending'. > >>> The > >>> request state/laststate remains to be reload/reload or new/new when the > >>> process dies. As a result, vcld keeps trying the same sequence over > and > >>> over again. > >>> > >>> It's possible to improve the code to catch this by wrapping all Vim::* > >>> calls in an eval block. I'll get this implemented for the next > release. > >>> Patching 2.3.2 may be possible but could also be ugly. Vim::get_view > is > >>> called from many places in vSphere_SDK.pm. Every one of these would > need > >>> to be updated. > >>> > >>> Regards, > >>> Andy > >>> > >>> > >>> > Please let me know if you need any further clarification. > >>> > > >>> > Thanks. > >>> > > >>> > -- > >>> > Junaid Ali > >>> > > >>> > >> > >> > >> > >> -- > >> Junaid Ali > >> Systems & Virtualization Engineer, > >> Office of Technology Services/IIT, > >> 10W, 31st Street, > >> Stuart Building Room # 007, > >> Chicago, IL - 60616 > >> Ph (O): 312-567-5836 > >> Ph (F): 312-567-5968 > >> > > > > > > > -- > Junaid Ali >
