[GitHub] somejfn commented on issue #2890: KVMHAMonitor thread blocks indefinitely while NFS not available

2018-10-30 Thread GitBox
somejfn commented on issue #2890: KVMHAMonitor thread blocks indefinitely while 
NFS not available
URL: https://github.com/apache/cloudstack/issues/2890#issuecomment-434282030
 
 
   If you just shutdown the NFS server,  the NFS client will get immediate
   response (TCP reset) so this is not the same as blocking the NFS server IP
   with iptables and DROP rules like I do for testing the network outage:
   
   *iptables -I INPUT -s nfs_server_ip -j DROP ; iptables -I OUTPUT -d
   nfs_server_ip -j DROP*
   
   It may take more than one attempt to see the tread pool block.
   
   
   
   On Tue, Oct 30, 2018 at 4:47 AM Rohit Yadav 
   wrote:
   
   > Based on the triaging exercise, I've moved this to 4.11.3.0 as further
   > discussion is pending. I've taken the least risk approach to revert part of
   > the change in behaviour and submitted - #2984
   > 
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > ,
   > or mute the thread
   > 

   > .
   >
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] somejfn commented on issue #2890: KVMHAMonitor thread blocks indefinitely while NFS not available

2018-10-29 Thread GitBox
somejfn commented on issue #2890: KVMHAMonitor thread blocks indefinitely while 
NFS not available
URL: https://github.com/apache/cloudstack/issues/2890#issuecomment-433911908
 
 
   With NFS not available and since those are hard mounts,  even a "virsh
   destroy" would not work.  Libvirtd will block until the NFS mount issue is
   resolved. I think that ideally the cloudstack-agent would do every task
   in a non-blocking way and not be affected by primary storage hiccups.   For
   instance,  to avoid the thread pool blocking on libvirtd tasks,  why no
   implement a configurable timeout on thosetasks with sensible defaults ?   I
   don't see a good reason a call to libvirtd take more than a few seconds
   (beside known long lasting tasks such as live migration)
   
   As for fencing,  afaik the host HA framework was created for the purpose or
   reliable fencing... but will cause more damage than good if the end result
   is to reboot all KVM hosts via IPMI (compared to just wait for NFS to come
   back)
   
   On Mon, Oct 29, 2018 at 9:18 AM Boris Stoyanov - a.k.a Bobby <
   notificati...@github.com> wrote:
   
   > I think the leanest way to fence the resource would be, prior to setting
   > the host down to iterate all it's VMs and shut them down, only then to
   > proceed and mark the host as 'Down', once were there, there's no issue with
   > VM-HA starting a new instance on a separate host.
   > I guess this needs further investigation and a fix as described.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > ,
   > or mute the thread
   > 

   > .
   >
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] somejfn commented on issue #2890: KVMHAMonitor thread blocks indefinitely while NFS not available

2018-10-29 Thread GitBox
somejfn commented on issue #2890: KVMHAMonitor thread blocks indefinitely while 
NFS not available
URL: https://github.com/apache/cloudstack/issues/2890#issuecomment-433896656
 
 
   Correct.  Only VM-HA enabled would get restarted and create a duplicate
   when the host goes down.  Still,  I don't think this a good behavior to
   fire VM-HA (because of host Up --> Down state) under any scenarios caused
   by transient storage disconnection.   If  the host goes down after 5
   minutes,  VM-HA restarts VM about one minute later, and then if the NFS
   issue gets resolved you have almost 100% probability of root disk
   corruption and you don't know where the 2 VMs are since Cloudstack only
   remembers the last copy it started.
   
   
   On Mon, Oct 29, 2018 at 3:21 AM Boris Stoyanov - a.k.a Bobby <
   notificati...@github.com> wrote:
   
   > hi @csquire  @somejfn
   > , thanks for this issue!
   >
   > I think it's correct that the host goes into 'Down' state after loosing
   > it's grip on the storage, since this is basically making it inoperable.
   > Going into 'Disconnected' state would only mean the connection between
   > management and host is compromised.
   >
   > On the other hand duplicated VMs is definitely something that needs to get
   > addressed, prior marking the host as 'Down' when we have a VM-HA enabled.
   > Just to be sure, can you please confirm you don't see these duplicated VMs
   > on a non VM-ha enabled instances? I'd like to narrow down this issue and
   > make sure it's in the VM-HA logic.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > ,
   > or mute the thread
   > 

   > .
   >
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] somejfn commented on issue #2890: KVMHAMonitor thread blocks indefinitely while NFS not available

2018-10-26 Thread GitBox
somejfn commented on issue #2890: KVMHAMonitor thread blocks indefinitely while 
NFS not available
URL: https://github.com/apache/cloudstack/issues/2890#issuecomment-433499907
 
 
   @rhtyd 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] somejfn commented on issue #2890: KVMHAMonitor thread blocks indefinitely while NFS not available

2018-10-23 Thread GitBox
somejfn commented on issue #2890: KVMHAMonitor thread blocks indefinitely while 
NFS not available
URL: https://github.com/apache/cloudstack/issues/2890#issuecomment-432374446
 
 
   Confirmed we see similar behavior on 4.11.2rc3 and the agent went in Down 
state.   Agent logs:
   
   810986-e702-36ea-a87b-fd48064ecb12
   2018-10-23 13:14:40,391 INFO  [kvm.resource.LibvirtConnection] 
(agentRequest-Handler-4:null) (logid:f8cd7cf7) No existing libvirtd connection 
found. Opening a new one
   2018-10-23 13:14:40,392 WARN  [kvm.resource.LibvirtConnection] 
(agentRequest-Handler-4:null) (logid:f8cd7cf7) Can not find a connection for 
Instance i-4-24-VM. Assuming the default connection.
   2018-10-23 13:14:40,399 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(agentRequest-Handler-4:null) (logid:f8cd7cf7) Trying to fetch storage pool 
4e49054a-463f-306f-9678-b0d9b02af9a1 from libvirt
   2018-10-23 13:14:51,496 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(agentRequest-Handler-2:null) (logid:3a0df8e5) Trying to fetch storage pool 
0e233ec5-ea14-439e-bfde-a8c7566d254c from libvirt
   2018-10-23 13:14:51,498 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(agentRequest-Handler-2:null) (logid:3a0df8e5) Asking libvirt to refresh 
storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c
   2018-10-23 13:15:25,027 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(agentRequest-Handler-1:null) (logid:581a1d95) Trying to fetch storage pool 
0e233ec5-ea14-439e-bfde-a8c7566d254c from libvirt
   2018-10-23 13:15:25,029 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(agentRequest-Handler-1:null) (logid:581a1d95) Asking libvirt to refresh 
storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c
   2018-10-23 13:15:25,590 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(agentRequest-Handler-5:null) (logid:581a1d95) Trying to fetch storage pool 
3e810986-e702-36ea-a87b-fd48064ecb12 from libvirt
   2018-10-23 13:15:25,592 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(agentRequest-Handler-5:null) (logid:581a1d95) Asking libvirt to refresh 
storage pool 3e810986-e702-36ea-a87b-fd48064ecb12
   
   2018-10-23 13:21:28,804 WARN  [kvm.resource.KVMHAChecker] (Script-3:null) 
(logid:) Interrupting script.
   2018-10-23 13:21:28,806 WARN  [kvm.resource.KVMHAChecker] 
(pool-15160-thread-1:null) (logid:c3d5dcaf) Timed out: 
/usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 
10.73.96.232 -p /vol/t500_0_fls3_pool36_root -m 
/mnt/d05f1c9d-9454-3707-a6c4-781398af198d -h 10.73.96.212 -r -t 60 .  Output is:
   2018-10-23 13:21:32,826 WARN  [kvm.resource.KVMHAChecker] (Script-7:null) 
(logid:) Interrupting script.
   2018-10-23 13:21:32,827 WARN  [kvm.resource.KVMHAChecker] 
(pool-15161-thread-1:null) (logid:c3d5dcaf) Timed out: 
/usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 
10.73.96.232 -p /vol/t500_0_fls3_pool36_root -m 
/mnt/d05f1c9d-9454-3707-a6c4-781398af198d -h 10.73.96.212 -r -t 60 .  Output is:
   2018-10-23 13:21:36,846 WARN  [kvm.resource.KVMHAChecker] (Script-4:null) 
(logid:) Interrupting script.
   2018-10-23 13:21:36,847 WARN  [kvm.resource.KVMHAChecker] 
(pool-15162-thread-1:null) (logid:4a3cb34f) Timed out: 
/usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 
10.73.96.232 -p /vol/t500_0_fls3_pool36_root -m 
/mnt/d05f1c9d-9454-3707-a6c4-781398af198d -h 10.73.96.212 -r -t 60 .  Output is:
   2018-10-23 13:24:44,205 INFO  [cloud.agent.Agent] (Agent-Handler-1:null) 
(logid:5a5a7500) Lost connection to host: 10.73.96.19. Attempting reconnection 
while we still have 5 commands in progress.
   2018-10-23 13:24:44,206 INFO  [utils.nio.NioClient] (Agent-Handler-1:null) 
(logid:5a5a7500) NioClient connection closed
   2018-10-23 13:24:44,206 INFO  [cloud.agent.Agent] (Agent-Handler-1:null) 
(logid:5a5a7500) Reconnecting to host:10.73.96.19
   2018-10-23 13:24:44,207 INFO  [utils.nio.NioClient] (Agent-Handler-1:null) 
(logid:5a5a7500) Connecting to 10.73.96.19:8250
   2018-10-23 13:24:44,207 INFO  [utils.nio.Link] (Agent-Handler-1:null) 
(logid:5a5a7500) Conf file found: /etc/cloudstack/agent/agent.properties
   
   Note sometimes you will see the agent successfully go in Disconnect state 
but the host HA framework might still fire after the kvm.ha.degraded.max.period 
timer and that is not expected.   In any case we want to avoid massive KVM host 
resets via IPMI for storage related problems because this is more damaging than 
waiting to primary storage to come back. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] somejfn commented on issue #2890: KVMHAMonitor thread blocks indefinitely while NFS not available

2018-10-17 Thread GitBox
somejfn commented on issue #2890: KVMHAMonitor thread blocks indefinitely while 
NFS not available
URL: https://github.com/apache/cloudstack/issues/2890#issuecomment-430701430
 
 
   This morning I confirmed the behavior on 4.9 is different than 4.11. When 
there's a long lasting (say 15 minutes) NFS hang the agent stays Up and when 
NFS operations resumes everyone's happy.   Note we did disable the automatic 
reboot in the heartbeat script for that to work.  This saved us from massive 
reboots and VM outages before when we had a network maintenance that cut all 
KVM host from NFS for 22 minutes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services