[ https://issues.apache.org/jira/browse/CLOUDSTACK-9857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976738#comment-15976738 ]
ASF subversion and git services commented on CLOUDSTACK-9857: ------------------------------------------------------------- Commit 9cc3ae8a942122ba3384b348376c6a948a2a74cc in cloudstack's branch refs/heads/49-to-master from [~rajanik] [ https://gitbox.apache.org/repos/asf?p=cloudstack.git;h=9cc3ae8 ] Merge release branch 4.9 to master * 4.9: CLOUDSTACK-9857: With this change if agent dies the systemd will catch it properly and show process as exited CLOUDSTACK-9805: Display VR list in network details CLOUDSTACK-9356: FIX Cannot add users in VPC VPN > CloudStack KVM Agent Self Fencing - improper systemd config > ------------------------------------------------------------ > > Key: CLOUDSTACK-9857 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-9857 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: KVM > Affects Versions: 4.5.2 > Reporter: Abhinandan Prateek > Assignee: Abhinandan Prateek > Priority: Critical > Fix For: 4.10.0.0 > > > We had a database outage few days ago, we noticed that most of cloudstack KVM > agents committed a suicide and never retried to connect. Moreover - we had > puppet - that was suppose to restart cloudstack-agent daemon when it goes > into failed, but apparently it never does go to “failed” state. > 2017-03-30 04:07:50,720 DEBUG [cloud.agent.Agent] > (agentRequest-Handler-2:null) Request:Seq -1--1: { Cmd , MgmtId: -1, via: > -1, Ver: v1, Flags: 111, > [{"com.cloud.agent.api.ReadyCommand":{"_details":"com.cloud.utils.exception.CloudRuntimeException: > DB Exception on: null","wait":0}}] } > 2017-03-30 04:07:50,721 DEBUG [cloud.agent.Agent] > (agentRequest-Handler-2:null) Processing command: > com.cloud.agent.api.ReadyCommand > 2017-03-30 04:07:50,721 DEBUG [cloud.agent.Agent] > (agentRequest-Handler-2:null) Not ready to connect to mgt server: > com.cloud.utils.exception.CloudRuntimeException: DB Exception on: null > 2017-03-30 04:07:50,722 INFO [cloud.agent.Agent] (AgentShutdownThread:null) > Stopping the agent: Reason = sig.kill > 2017-03-30 04:07:50,723 DEBUG [cloud.agent.Agent] (AgentShutdownThread:null) > Sending shutdown to management server > While agent fenced itself for whatever logic reason it had - the systemd > agent did not exit properly. > Here what the status of the cloudstack-agent looks like > [root@mqa6-kvm02 ~]# service cloudstack-agent status > ● cloudstack-agent.service - SYSV: Cloud Agent > Loaded: loaded (/etc/rc.d/init.d/cloudstack-agent) > Active: active (exited) since Fri 2017-03-31 23:50:47 GMT; 12s ago > Docs: man:systemd-sysv-generator(8) > Process: 632 ExecStop=/etc/rc.d/init.d/cloudstack-agent stop (code=exited, > status=0/SUCCESS) > Process: 654 ExecStart=/etc/rc.d/init.d/cloudstack-agent start > (code=exited, status=0/SUCCESS) > Main PID: 441 > Mar 31 23:50:47 mqa6-kvm02 systemd[1]: Starting SYSV: Cloud Agent... > Mar 31 23:50:47 mqa6-kvm02 cloudstack-agent[654]: Starting Cloud Agent: > Mar 31 23:50:47 mqa6-kvm02 systemd[1]: Started SYSV: Cloud Agent. > Mar 31 23:50:49 mqa6-kvm02 sudo[806]: root : TTY=unknown ; PWD=/ ; > USER=root ; COMMAND=/bin/grep InitiatorName= /etc/iscsi/initiatorname.iscsi > The "Active: active (exited)" should be "Active: failed (Result: exit-code)” > Solution: > The fix is to add pidfile into /etc/init.d/cloudstack-agent > Like so: > # chkconfig: 35 99 10 > # description: Cloud Agent > + # pidfile: /var/run/cloudstack-agent.pid > Post that - if agent dies - the systemd will catch it properly and it will > look as expected > [root@mqa6-kvm02 ~]# service cloudstack-agent status > ● cloudstack-agent.service - SYSV: Cloud Agent > Loaded: loaded (/etc/rc.d/init.d/cloudstack-agent) > Active: failed (Result: exit-code) since Fri 2017-03-31 23:51:40 GMT; 7s > ago > Docs: man:systemd-sysv-generator(8) > Process: 1124 ExecStop=/etc/rc.d/init.d/cloudstack-agent stop (code=exited, > status=255) > Process: 949 ExecStart=/etc/rc.d/init.d/cloudstack-agent start > (code=exited, status=0/SUCCESS) > Main PID: 975 > With this change - some other tool can properly inspect the state of daemon > and take actions when it failed instead of it being in active (exited) state. -- This message was sent by Atlassian JIRA (v6.3.15#6346)