SystemVM Type? Routing or System?

2022-06-08 Thread Sean Lair
We have some confusion on which Template Type SystemVM templates should be set 
to.  The documentation seems to be inconsistent, could someone help clarify?

The following URL says to set "Routing" to NO when registering a new SystemVM 
Template:
http://docs.cloudstack.apache.org/en/latest/upgrading/upgrade/upgrade-4.16.html

The following URL says to "select Routing":
http://docs.cloudstack.apache.org/en/latest/adminguide/systemvm.html


The following URL says to manually edit the DB and make it type "System":
https://docs.cloudstack.apache.org/en/latest/adminguide/templates/_bypass-secondary-storage-kvm.html?highlight=bypass%20secondary
UPDATE cloud.vm_template SET type='SYSTEM' WHERE uuid='UUID_OF_NEW_TEMPLATE';


Thanks
Sean


Reverting to VM Snapshots fail if VM is powered off+on

2021-12-13 Thread Sean Lair
We are seeing a strange problem in our ACS environments.  We are running 
Centos7 as our hypervisors.  When we take a VM Snapshot and then later revert 
to it, it works as long as we haven't stopped and started the VM.  If we stop 
the VM and start it again - even if it is still on the same host - we cannot 
revert back to a VM Snapshot.  Here is the error and further information.   Any 
ideas?  It is 100% reproducible for us.


2021-12-13 22:50:51,731 DEBUG [c.c.a.t.Request] (AgentManager-Handler-12:null) 
(logid:) Seq 101-5603885311332466879: Processing:  { Ans: , MgmtId: 
345051498372, via: 101, Ver: v1, Flags: 10, 
[{"com.cloud.agent.api.RevertToVMSnapshotAnswer":{"result":false,"details":" 
Revert to VM snapshot failed due to org.libvirt.LibvirtException: revert 
requires force: Target CPU feature count 3 does not match source 0","wait":0}}] 
}
2021-12-13 22:50:51,732 ERROR [o.a.c.s.v.DefaultVMSnapshotStrategy] 
(Work-Job-Executor-64:ctx-1767fb85 job-130106/job-130111 ctx-ab4680c7) 
(logid:87cc475a) Revert VM: i-2-317-VM to snapshot: 
i-2-317-VM_VS_20211213224802 failed due to  Revert to VM snapshot failed due to 
org.libvirt.LibvirtException: revert requires force: Target CPU feature count 3 
does not match source 0
com.cloud.utils.exception.CloudRuntimeException: Revert VM: i-2-317-VM to 
snapshot: i-2-317-VM_VS_20211213224802 failed due to  Revert to VM snapshot 
failed due to org.libvirt.LibvirtException: revert requires force: Target CPU 
feature count 3 does not match source 0
2021-12-13 22:50:51,743 ERROR [c.c.v.VmWorkJobHandlerProxy] 
(Work-Job-Executor-64:ctx-1767fb85 job-130106/job-130111 ctx-ab4680c7) 
(logid:87cc475a) Invocation exception, caused by: 
com.cloud.utils.exception.CloudRuntimeException: Revert VM: i-2-317-VM to 
snapshot: i-2-317-VM_VS_20211213224802 failed due to  Revert to VM snapshot 
failed due to org.libvirt.LibvirtException: revert requires force: Target CPU 
feature count 3 does not match source 0
2021-12-13 22:50:51,743 INFO  [c.c.v.VmWorkJobHandlerProxy] 
(Work-Job-Executor-64:ctx-1767fb85 job-130106/job-130111 ctx-ab4680c7) 
(logid:87cc475a) Rethrow exception 
com.cloud.utils.exception.CloudRuntimeException: Revert VM: i-2-317-VM to 
snapshot: i-2-317-VM_VS_20211213224802 failed due to  Revert to VM snapshot 
failed due to org.libvirt.LibvirtException: revert requires force: Target CPU 
feature count 3 does not match source 0
com.cloud.utils.exception.CloudRuntimeException: Revert VM: i-2-317-VM to 
snapshot: i-2-317-VM_VS_20211213224802 failed due to  Revert to VM snapshot 
failed due to org.libvirt.LibvirtException: revert requires force: Target CPU 
feature count 3 does not match source 0
Caused by: com.cloud.utils.exception.CloudRuntimeException: Revert VM: 
i-2-317-VM to snapshot: i-2-317-VM_VS_20211213224802 failed due to  Revert to 
VM snapshot failed due to org.libvirt.LibvirtException: revert requires force: 
Target CPU feature count 3 does not match source 0


[root@labcloudkvm02 ~]# virsh dumpxml 33
...
  
IvyBridge



  
...


[root@labcloudkvm02 ~]# virsh dumpxml 33 --migratable
...
  
IvyBridge
  
...


[root@labcloudkvm02 ~]# virsh snapshot-dumpxml 33 i-2-317-VM_VS_20211213224802
...

  IvyBridge
  
  
  

...

In agent.properties:
guest.cpu.model=IvyBridge
guest.cpu.mode=custom


RE: [DISCUSS] Moving to OpenVPN as the remote access VPN provider

2021-06-16 Thread Sean Lair
I would love to see OpenVPN as the client VPN.  We consider the current Client 
VPN unusable.  We use OpenVPN with OPNsense firewalls and it has been 
rock-solid.


-Original Message-
From: Rohit Yadav  
Sent: Friday, June 11, 2021 12:40 PM
To: us...@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: [DKIM Fail] Re: [DISCUSS] Moving to OpenVPN as the remote access VPN 
provider

Hi PL,

You can check the ikev2 support in 4.15+ here: 
https://github.com/apache/cloudstack/pull/4953

I think a generic VPN framework-provider feature is probably what we need (i.e. 
to let user or admin decide what VPN provider they want, supporting 
strongswan/ipsec and openvpn) so I'm not trying to defend OpenVPN here but your 
comments on OpenVPN are incorrect. It is widely used (in many projects incl. 
pfSense) and both server/clients are opensource and not proprietary afaik (GPL 
or AGPL license, I'm not sure about platform-specific clients (the GUI ones) 
but I checked the CLI clients are opensource):
https://github.com/OpenVPN/openvpn
https://github.com/OpenVPN/openvpn3

One key requirement for whatever VPN provider we support is that it should be 
free and opensource and available on Debian (for use in the systemvmtemplate) 
and OpenVPN fits that requirement. The package is available on Debian: 
https://packages.debian.org/buster-backports/openvpn

Regards.


From: Pierre-Luc Dion 
Sent: Friday, June 11, 2021 20:10
To: us...@cloudstack.apache.org 
Cc: dev 
Subject: Re: [DISCUSS] Moving to OpenVPN as the remote access VPN provider

Just to be sure, what CloudStack > v4.15 uses Strongswan/l2tp or
strongswan/ikev2 ?

Because l2tp became complicated to configure on native vpn clients on some 
OSes, kind of deprecated remote management VPN, compared to IKEv2.
I'm a bit concerned about OpenVPN for the clients, what if binaries become 
subscription based availability or become proprietary ?

For sure we need the option to select what type of VPN solution to offer when 
deploying a cloud.

>From my perspective I cannot use/offer OpenVPN as a solution to my customers 
>because it involves forcing them to download third party software on their 
>workstations and I don't want to be responsible for a security breach on their 
>workstation because of a requirement for 3rd party software that we don't 
>control.



On Fri, Jun 11, 2021 at 10:14 AM Rohit Yadav 
wrote:

> Thanks all for the feedback so far, looks like the majority of people 
> on the thread would prefer OpenVPN but for s2s they may continue to 
> prefer strongswan/ipsec for site-to-site VPC feature. If we're unable 
> to reach consensus then a general-purpose provider-framework may be 
> more flexible to the end-user or admin (to select which VPN provider 
> they want for their network, we heard in this thread - openvpn, 
> strongswan/l2tp, wireguard, and maybe other providers in future).
>
> Btw, ikev2 is supported now with strongswan with this -
> https://github.com/apache/cloudstack/pull/4953
>
> My personal opinion: As user of most of these VPN providers, I 
> personally like OpenVPN which I found to be easier to use both on 
> desktop/laptop and on phone. With openvpn as the default I imagine in 
> CloudStack I could enable VPN for a network and CloudStack gives me an 
> option to download a .ovpn file which I can import in my openvpn 
> client (desktop, phone, cli...) click connect to connect to the VPN. 
> For certificate generation/storage, the CA framework could be used so 
> the openvpn server certs are the same across network restarts (with 
> cleanup). I think a process like this could be simpler than what we've 
> right now, and the ovpn download+import workflow would be easier than 
> what we'll get from either strongswan/current or wireguard. While I 
> like the simplicity of wireguard, which is more like SSH setup I 
> wouldn't mind doing setup on individual VMs (much like setting up ssh key) or 
> use something like TailScale.
>
>
> Regards.
>
> 
> From: Gabriel Bräscher 
> Sent: Friday, June 11, 2021 19:28
> To: dev 
> Cc: users 
> Subject: Re: [DISCUSS] Moving to OpenVPN as the remote access VPN 
> provider
>
> I understand that OpenVPN is a great option and far adopted.
> I am  ++1 in allowing Users/Admins to choose which VPN provider suits 
> them best; creating an offering (or global settings) that would allow 
> setting which VPN provider will be used would be awesome.
>
> I understand that OpenVPN is a great option and far adopted; however, 
> I would be -1 if this would impact on removing support for Strongswan 
> -- which from what I understood is not the proposal, but saying anyway 
> to be sure.
>
> Thanks for raising this proposal/discussion, Rohit.
>
> Cheers,
> Gabriel.
>
>
> Em sex., 11 de jun. de 2021 às 08:46, Pierre-Luc Dion 
>  >
> escreveu:
>
> > Hello,
> >
> > Daan, I agree we should provide capability to select the vpn 
> > solution to use, the question 

RE: Set Number of queues for Virtio NIC driver to vCPU count?

2021-03-24 Thread Sean Lair
Thanks for the reply guys.  We'll start looking more into this!

Sean
-Original Message-
From: Rohit Yadav  
Sent: Wednesday, March 24, 2021 7:28 AM
To: dev@cloudstack.apache.org
Cc: Sean Lair 
Subject: [DKIM Fail] Re: Set Number of queues for Virtio NIC driver to vCPU 
count?

Hi Sean,

Agree with your proposal. I'm not sure if there are any cons to the changes as 
I see related changes for disks.

Regards.

Regards,
Rohit Yadav


From: n...@li.nux.ro 
Sent: Wednesday, March 24, 2021 5:19:36 PM
To: dev@cloudstack.apache.org 
Cc: Sean Lair 
Subject: Re: Set Number of queues for Virtio NIC driver to vCPU count?

+1

It's a great idea which is implemented already on some of the other platforms.
It can make a big difference when pushing a lot of traffic, such as VoIP etc.
Hope it gets implemented.

Lucian

On 2021-03-23 23:08, Sean Lair wrote:
> Hi all,
>
> We are looking to improve the network performance of our KVM/QEMU VMs 
> running in CloudStack.  One thing we noticed is that the Virtio NICs 
> are not configured to use multiple queues.  A couple of years ago 
> someone created a PR to increase the Virtio SCSI queue count to match 
> the number of vCPUs:
>
> https://github.com/apache/cloudstack/pull/3101
>
> Before we look at this further, any thoughts on doing something 
> similar with Virtio NICs?
>
> Thanks
> Sean

rohit.ya...@shapeblue.com
www.shapeblue.com
3 London Bridge Street,  3rd floor, News Building, London  SE1 9SGUK @shapeblue
  
 



Set Number of queues for Virtio NIC driver to vCPU count?

2021-03-23 Thread Sean Lair
Hi all,

We are looking to improve the network performance of our KVM/QEMU VMs running 
in CloudStack.  One thing we noticed is that the Virtio NICs are not configured 
to use multiple queues.  A couple of years ago someone created a PR to increase 
the Virtio SCSI queue count to match the number of vCPUs:

https://github.com/apache/cloudstack/pull/3101

Before we look at this further, any thoughts on doing something similar with 
Virtio NICs?

Thanks
Sean



RE: Secure Live Migration for KVM

2021-03-15 Thread Sean Lair
Quick update so no one spends any time looking into this.  Found a few things 
that we are working to fix:

1. if ca.plugin.root.auth.strictness is set to false, CloudStack will not try 
to renew any certs it has issued.  I'd say this is an issue, it should still 
renew certs it has issued.
2. if the KVM agent is connecting to the management servers via a 
load-balancer, then the management servers see's the load-balancer's IP address 
as the client IP address.  This causes the client certificate trust check to 
fail, as the load-balancer IP address is not in the cert's Subject Alternative 
Names list.
3. Similar to #2, the CA background task also has an issue when KVM agents come 
through a load-balancer

We'll fix #2 and #3 by having the KVM agents connect directly to the mgmt 
servers.

Thanks
Sean


-Original Message-
From: Sean Lair  
Sent: Monday, March 15, 2021 12:18 PM
To: dev@cloudstack.apache.org
Subject: RE: Secure Live Migration for KVM

Hi Rohit from our initial debugging, the issue may be a little more involved.  
Maybe you could add some insight.

We added some debug logging to monitor the size of the activeCertMap and have 
noticed it is almost always 0.  When the CABackgroundTask runs, it never does 
anything because the in memory activeCertMaps on each mgmt server is empty.

When a KVM host connects to a mgmt server, we do not see any code that 
populates the activeCertMap with the newly connected host's Cert.  Shouldn't a 
host connection trigger adding the host's cert to the activeCertMap?

Furthermore, when a cert is provisioned from the web-interface/API for a host, 
we do see the activeCertMap initially being populated.  However, as part of 
that process, the agent is restarted.  That restart of the agent triggers the 
following method in AgentManagerImpl.java:

protected boolean handleDisconnectWithoutInvestigation(final AgentAttache 
attache, final Status.Event event, final boolean transitState, final boolean 
removeAgent)

That method ends up calling the following method which removes the host/cert 
from the activeCertMap:
caService.purgeHostCertificate(host);

Now, since at host reconnect there isn't any code to re-populate the 
activeCertMap, it remains at 0 and as mentioned the CABackgroundTask never has 
anything to do, thus certs never get renewed.

We are still looking into this, but let us know what we are missing if you have 
a chance to take a look.

Thanks!!
Sean




-Original Message-
From: Rohit Yadav  
Sent: Friday, March 12, 2021 12:50 AM
To: dev@cloudstack.apache.org
Subject: [DKIM Fail] Re: Secure Live Migration for KVM

Hi Greg, I think you're right the 
https://github.com/apache/cloudstack/pull/4156 should fix the auto-renewal 
issue.
In the meanwhile for already connected kvm hosts/agents, you can run the 
provisionCertificate API.


Regards.


From: Greg Goodrich 
Sent: Friday, March 12, 2021 04:00
To: dev@cloudstack.apache.org 
Subject: Re: Secure Live Migration for KVM

Further investigation finds this PR which may be related - 
https://github.com/apache/cloudstack/pull/4156. We are investigating if this 
could be the cause.

--
Greg Goodrich | IP Pathways
Development Manager
3600 109th Street | Urbandale, IA 50322
p. 515.422.9346 | e. ggoodr...@ippathways.com<mailto:j...@ippathways.com>


rohit.ya...@shapeblue.com
www.shapeblue.com
3 London Bridge Street,  3rd floor, News Building, London  SE1 9SGUK @shapeblue
  
 

On Mar 11, 2021, at 4:09 PM, Greg Goodrich 
mailto:ggoodr...@ippathways.com>> wrote:

We have just discovered in our Lab environment that the certificates for 
libvirtd did not auto renew. Thus when we did an update, and restart of the 
agent, it failed to start, due to Libvirtd failing to start from an expired 
certificate. We then checked our production hosts, and their certificates are 
due to expire in 4 days, even though our setting is to auto renew at 15 days. 
Has anyone else encountered a problem with this? It appears to be related to 
this feature - https://github.com/apache/cloudstack/pull/2505.

We are running 4.11.3 in both environments.

--
Greg Goodrich | IP Pathways
Development Manager
3600 109th Street | Urbandale, IA 50322
p. 515.422.9346 | e. 
ggoodr...@ippathways.com<mailto:ggoodr...@ippathways.com><mailto:j...@ippathways.com>




RE: Secure Live Migration for KVM

2021-03-15 Thread Sean Lair
Hi Rohit from our initial debugging, the issue may be a little more involved.  
Maybe you could add some insight.

We added some debug logging to monitor the size of the activeCertMap and have 
noticed it is almost always 0.  When the CABackgroundTask runs, it never does 
anything because the in memory activeCertMaps on each mgmt server is empty.

When a KVM host connects to a mgmt server, we do not see any code that 
populates the activeCertMap with the newly connected host's Cert.  Shouldn't a 
host connection trigger adding the host's cert to the activeCertMap?

Furthermore, when a cert is provisioned from the web-interface/API for a host, 
we do see the activeCertMap initially being populated.  However, as part of 
that process, the agent is restarted.  That restart of the agent triggers the 
following method in AgentManagerImpl.java:

protected boolean handleDisconnectWithoutInvestigation(final AgentAttache 
attache, final Status.Event event, final boolean transitState, final boolean 
removeAgent)

That method ends up calling the following method which removes the host/cert 
from the activeCertMap:
caService.purgeHostCertificate(host);

Now, since at host reconnect there isn't any code to re-populate the 
activeCertMap, it remains at 0 and as mentioned the CABackgroundTask never has 
anything to do, thus certs never get renewed.

We are still looking into this, but let us know what we are missing if you have 
a chance to take a look.

Thanks!!
Sean




-Original Message-
From: Rohit Yadav  
Sent: Friday, March 12, 2021 12:50 AM
To: dev@cloudstack.apache.org
Subject: [DKIM Fail] Re: Secure Live Migration for KVM

Hi Greg, I think you're right the 
https://github.com/apache/cloudstack/pull/4156 should fix the auto-renewal 
issue.
In the meanwhile for already connected kvm hosts/agents, you can run the 
provisionCertificate API.


Regards.


From: Greg Goodrich 
Sent: Friday, March 12, 2021 04:00
To: dev@cloudstack.apache.org 
Subject: Re: Secure Live Migration for KVM

Further investigation finds this PR which may be related - 
https://github.com/apache/cloudstack/pull/4156. We are investigating if this 
could be the cause.

--
Greg Goodrich | IP Pathways
Development Manager
3600 109th Street | Urbandale, IA 50322
p. 515.422.9346 | e. ggoodr...@ippathways.com


rohit.ya...@shapeblue.com
www.shapeblue.com
3 London Bridge Street,  3rd floor, News Building, London  SE1 9SGUK @shapeblue
  
 

On Mar 11, 2021, at 4:09 PM, Greg Goodrich 
mailto:ggoodr...@ippathways.com>> wrote:

We have just discovered in our Lab environment that the certificates for 
libvirtd did not auto renew. Thus when we did an update, and restart of the 
agent, it failed to start, due to Libvirtd failing to start from an expired 
certificate. We then checked our production hosts, and their certificates are 
due to expire in 4 days, even though our setting is to auto renew at 15 days. 
Has anyone else encountered a problem with this? It appears to be related to 
this feature - https://github.com/apache/cloudstack/pull/2505.

We are running 4.11.3 in both environments.

--
Greg Goodrich | IP Pathways
Development Manager
3600 109th Street | Urbandale, IA 50322
p. 515.422.9346 | e. 
ggoodr...@ippathways.com




RE: Virtual machines volume lock manager

2020-05-19 Thread Sean Lair
Are you using NFS?

Yea, we implmented locking because of that problem:

https://libvirt.org/locking-lockd.html

echo lock_manager = \"lockd\" >> /etc/libvirt/qemu.conf

-Original Message-
From: Andrija Panic  
Sent: Wednesday, October 30, 2019 6:55 AM
To: dev 
Cc: users 
Subject: Re: Virtual machines volume lock manager

I would advise trying to reproduce.

start migration, then either:
- configure timeout so that it''s way too low, so that migration fails due to 
timeouts.
- restart mgmt server in the middle of migrations This should cause migration 
to fail - and you can observe if you have reproduced the problem.
keep in mind, that there might be some garbage left, due to not-properly 
handling the failed migration But from QEMU point of view - if migration fails, 
by all means the new VM should be destroyed...



On Wed, 30 Oct 2019 at 11:31, Rakesh Venkatesh 

wrote:

> Hi Andrija
>
>
> Sorry for the late reply.
>
> Im using 4.7 version of ACS. Qemu version 1:2.5+dfsg-5ubuntu10.40
>
> Im not sure if ACS job failed or libvirt job as I didnt see into logs.
> Yes the vm will be in paused state during migration but after the 
> failed migration, the same vm was in "running" state on two different 
> hypervisors.
> We wrote a script to find out how duplicated vm's are running and 
> found out that more than 5 vm's had this issue.
>
>
> On Mon, Oct 28, 2019 at 2:42 PM Andrija Panic 
> 
> wrote:
>
> > I've been running KVM public cloud up to recently and have never 
> > seen
> such
> > behaviour.
> >
> > What versions (ACS, qemu, libvrit) are you running?
> >
> > How does the migration fail - ACS job - or libvirt job?
> > destination VM is by default always in PAUSED state, until the 
> > migration
> is
> > finished - only then the destination VM (on the new host) will get
> RUNNING,
> > while previously pausing the original VM (on the old host).
> >
> > i,e.
> > phase1  source vm RUNNING, destination vm PAUSED (RAM content being
> > copied over... takes time...)
> > phase2  source vm PAUSED, destination vm PAUSED (last bits of RAM
> > content are migrated)
> > phase3  source vm destroyed, destination VM RUNNING.
> >
> > Andrija
> >
> > On Mon, 28 Oct 2019 at 14:26, Rakesh Venkatesh <
> http://sea.ippathways.com:32224/?dmVyPTEuMDAxJiYzM2ZmODRmOWFhMzdmZmQ1O
> T01REI5N0ExQV84NTE5N18yMDM4OV8xJiZjZjE2YzBlNTI0N2VmMjM9MTIzMyYmdXJsPXd
> 3dyUyRXJha2VzaHYlMkVjb20=@gmail.com>
> > wrote:
> >
> > > Hello Users
> > >
> > >
> > > Recently we have seen cases where when the Vm migration fails,
> cloudstack
> > > ends up running two instances of the same VM on different hypervisors.
> > The
> > > state will be "running" and not any other transition state. This 
> > > will
> of
> > > course lead to corruption of disk. Does CloudStack has any option 
> > > of
> > volume
> > > locking so that two instances of the same VM wont be running?
> > > Anyone else has faced this issue and found some solution to fix it?
> > >
> > > We are thinking of using "virtlockd" of libvirt or implementing 
> > > custom
> > lock
> > > mechanisms. There are some pros and cons of the both the solutions 
> > > and
> i
> > > want your feedback before proceeding further.
> > >
> > > --
> > > Thanks and regards
> > > Rakesh venkatesh
> > >
> >
> >
> > --
> >
> > Andrija Panić
> >
>
>
> --
> Thanks and regards
> Rakesh venkatesh
>


-- 

Andrija Panić


RE: Issue adding a second zone to Cloudstack

2020-03-29 Thread Sean Lair
Thank you for the reply Vivek!  I was wondering if that was the case but just 
couldn't find any documentation to verify.  We've done that and are back on the 
right path!


-Original Message-
From: Vivek Kumar  
Sent: Saturday, March 28, 2020 4:22 AM
To: us...@cloudstack.apache.org
Cc: dev@cloudstack.apache.org
Subject: Re: Issue adding a second zone to Cloudstack

Hello Sean,

You need to again seed the template to the secondary storage of your new zone 
just like you did for the first zone.

i.e
/usr/share/cloudstack-common/scripts/storage/secondary/cloud-install-sys-tmplt 
-m /mnt/secondary -u 
http://download.cloudstack.org/systemvm/4.11/systemvmtemplate-4.11.3-kvm.qcow2.bz2
 
<http://download.cloudstack.org/systemvm/4.11/systemvmtemplate-4.11.3-kvm.qcow2.bz2>
 -h kvm -s  -F

Vivek Kumar
Manager - Cloud & DevOps 
IndiQus Technologies
24*7  O +91 11 4055 1411  |   M +91 7503460090 
www.indiqus.com <http://indiqus.com/>

This message is intended only for the use of the individual or entity to which 
it is addressed and may contain information that is confidential and/or 
privileged. If you are not the intended recipient please delete the original 
message and any copy of it from your computer system. You are hereby notified 
that any dissemination, distribution or copying of this communication is 
strictly prohibited unless proper authorization has been obtained for such 
action. If you have received this communication in error, please notify the 
sender immediately. Although IndiQus attempts to sweep e-mail and attachments 
for viruses, it does not guarantee that both are virus-free and accepts no 
liability for any damage sustained as a result of viruses.

> On 28-Mar-2020, at 4:08 AM, Sean Lair  wrote:
> 
> Hi all,
> 
> We are running 4.11.3 with a single zone, that zone is working without issue. 
>  We are trying to add a second zone to the installation, and everything seems 
> to go well, except we are confused on how the SystemVM templates should be 
> handled for the new zone.  The new zone has its own secondary storage (NFS).  
> When Cloudstack sees the new Zone, it attempts to provision a Secondary 
> Storage VM.  However, it is unable to because the SystemVM Template doesn't 
> exist on the new secondary storage (NFS).
> 
> Are we supposed to pre-populate another copy of the SystemVM Template in the 
> additional zone and secondary storage?  Or should cloudstack copy the 
> existing SystemVM Template (which is set as cross-zone) to the new zone for 
> us?  Here is some detailed information:
> 
> MariaDB [cloud]> SELECT id,name,type,cross_zones,state FROM cloud.vm_template 
> WHERE name like '%systemvm-kvm%' AND removed IS NULL;
> +-+--+-+-+--+
> | id  | name | type| cross_zones | state|
> +-+--+-+-+--+
> | 344 | systemvm-kvm-4.11.3  | SYSTEM  |   1 | Active   |
> +-+--+-+-+--+
> 
> MariaDB [cloud]> select id,store_id,template_id,install_path, download_state 
> from template_store_ref;
> +-+--+-+++
> | id  | store_id | template_id | install_path 
>   | download_state |
> +-+--+-+++
> | 666 |1 | 344 | 
> template/tmpl/2/344/182f0a79-1e16-3e53-a6e9-fcffe5f11c3e.qcow2 | 
> DOWNLOADED |
> | 756 |   16 | 344 | template/tmpl/1/344/ 
>   | DOWNLOADED |
> +-+--+-+++
> 
> Why in the template_store_ref table did cloudstack add a new row with 
> "downloaded" and missing a filename in the "install_path"?
> 
> 
> The KVM host cannot mount the template on the new secondary storage, because 
> it isn't there yet (should cloudstack be copying that template from the 
> existing zone to the new one for us?):
> --
> 2020-03-27 18:51:40,626 ERROR [kvm.storage.LibvirtStorageAdaptor] 
> (agentRequest-Handler-2:null) (logid:6b50f03a) Failed to create netfs mount: 
> 10.102.33.5:/zone2_secondary/template/tmpl/1/344
> org.libvirt.LibvirtException: internal error: Child process (/usr/bin/mount 
> 10.10.33.5:/zone2_secondary/template/tmpl/1/344 
> /mnt/b69caab0-4c1e-34b6-94b8-2617ba561e9a -o nodev,nosuid,noexec) unexpected 
> exit status 32: mount.nfs: mounting 
> 10.10.33.5:/zone2__secondary/template/tmpl/1/344 failed, reason given by 
> server: No such file or directory
> -
> 
> 
> Thanks!
> Sean



Issue adding a second zone to Cloudstack

2020-03-27 Thread Sean Lair
Hi all,

We are running 4.11.3 with a single zone, that zone is working without issue.  
We are trying to add a second zone to the installation, and everything seems to 
go well, except we are confused on how the SystemVM templates should be handled 
for the new zone.  The new zone has its own secondary storage (NFS).  When 
Cloudstack sees the new Zone, it attempts to provision a Secondary Storage VM.  
However, it is unable to because the SystemVM Template doesn't exist on the new 
secondary storage (NFS).

Are we supposed to pre-populate another copy of the SystemVM Template in the 
additional zone and secondary storage?  Or should cloudstack copy the existing 
SystemVM Template (which is set as cross-zone) to the new zone for us?  Here is 
some detailed information:

MariaDB [cloud]> SELECT id,name,type,cross_zones,state FROM cloud.vm_template 
WHERE name like '%systemvm-kvm%' AND removed IS NULL;
+-+--+-+-+--+
| id  | name | type| cross_zones | state|
+-+--+-+-+--+
| 344 | systemvm-kvm-4.11.3  | SYSTEM  |   1 | Active   |
+-+--+-+-+--+

MariaDB [cloud]> select id,store_id,template_id,install_path, download_state 
from template_store_ref;
+-+--+-+++
| id  | store_id | template_id | install_path   
| download_state |
+-+--+-+++
| 666 |1 | 344 | 
template/tmpl/2/344/182f0a79-1e16-3e53-a6e9-fcffe5f11c3e.qcow2 | 
DOWNLOADED |
| 756 |   16 | 344 | template/tmpl/1/344/   
| DOWNLOADED |
+-+--+-+++

Why in the template_store_ref table did cloudstack add a new row with 
"downloaded" and missing a filename in the "install_path"?


The KVM host cannot mount the template on the new secondary storage, because it 
isn't there yet (should cloudstack be copying that template from the existing 
zone to the new one for us?):
--
2020-03-27 18:51:40,626 ERROR [kvm.storage.LibvirtStorageAdaptor] 
(agentRequest-Handler-2:null) (logid:6b50f03a) Failed to create netfs mount: 
10.102.33.5:/zone2_secondary/template/tmpl/1/344
org.libvirt.LibvirtException: internal error: Child process (/usr/bin/mount 
10.10.33.5:/zone2_secondary/template/tmpl/1/344 
/mnt/b69caab0-4c1e-34b6-94b8-2617ba561e9a -o nodev,nosuid,noexec) unexpected 
exit status 32: mount.nfs: mounting 
10.10.33.5:/zone2__secondary/template/tmpl/1/344 failed, reason given by 
server: No such file or directory
-


Thanks!
Sean


SystemVM: Routing Checkbox discrepancy

2020-03-13 Thread Sean Lair
Hi All, there is a discrepancy in our Cloudstack Documentation.  The following 
Upgrade section says to NOT check the Routing checkbox when uploading a new 
SystemVM Template:

http://docs.cloudstack.apache.org/en/latest/upgrading/upgrade/upgrade-4.12.html

This page however says we SHOULD check the Routing checkbox when uploading a 
new SystemVM Template.  Which is it?

http://docs.cloudstack.apache.org/en/4.13.0.0/adminguide/systemvm.html#changing-the-default-system-vm-template

Also, the above link says to go to "Infrastructure > Zone > Settings" to change 
the router.template.kvm setting, but it is in Global Settings not under Zone 
Settings.

Also, a related question, when uploading a new SystemVM Template the template 
in the DB  is of type ROUTING (if routing is checked during upload otherwise it 
is type USER).  But the ConsoleVM and Secondary Storage VM needs it to be type 
of SYSTEM.  When a SystemVM Template is changed out when not also performing a 
CloudStack upgrade, what is the correct way to change the SystemVM template for 
ConsoleVMs, Secondary Storage VMs and vRouters?

Thanks all!
Sean


Issue with newest mysql-connector-java

2020-01-21 Thread Sean Lair
Opened Issue:
https://github.com/apache/cloudstack/issues/3826

We noticed that on mysql-connector-java version 8.0.19 (not sure about other 
8.0.x versions) we have errors such as the following:

Caused by: java.lang.IllegalArgumentException: Can not set long field 
com.cloud.upgrade.dao.VersionVO.id to java.math.BigInteger
at 
sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:167)
at 
sun.reflect.UnsafeFieldAccessorImpl.throwSetIllegalArgumentException(UnsafeFieldAccessorImpl.java:171)
at 
sun.reflect.UnsafeLongFieldAccessorImpl.set(UnsafeLongFieldAccessorImpl.java:102)

Looks like in code we are using Long with Auto Increment fields, but the DB 
columns are actually BigInt.  Downgrading to the EPEL release of 
mysql-connector-java (5.1.25-3) fixed the issue.  However, I expect lots of 
people would hit this, because in the upgrade guides we specify to add the 
mysql-community repo - which uses newer mysql-connectors:

http://docs.cloudstack.apache.org/projects/archived-cloudstack-release-notes/en/4.11/upgrade/upgrade-4.9.html

Thanks
Sean


RE: Do not see KVM Hosts after 4.9.3 -> 4.11.2

2019-05-31 Thread Sean Lair
Update on the issue.  Thanks Richard for the hint about MariaDB needing an 
update (and everyone else that responded).  It's crazy, I did a manual select, 
mimicking the host_view SQL, and also received zero rows.  I modifed the select 
statement to remove the LEFT JOIN with last_annotation_view, and the select 
statement returned rows as expected...  No idea (has to be a bug) why a LEFT 
OUTER JOIN would truncate a return set like that...  

We were running MariaDB 10.0.33-1.el7.centos, did an upgrade to 
10.0.38-1.el7.centos.  Then the host_view (and the GUI) started working as 
expected...

MariaDB bug??



-Original Message-
From: Dag Sonstebo [mailto:dag.sonst...@shapeblue.com] 
Sent: Friday, May 31, 2019 4:47 AM
To: dev@cloudstack.apache.org
Subject: Re: Do not see KVM Hosts after 4.9.3 -> 4.11.2

There are known issues with using MariaDB version 10 - I recommend you stick to 
version 5.5 for the foreseeable future, and we have had several cases of people 
having to downgrade lately. 

The issues you are seeing are most likely down to this Richard - you should not 
have to make any DB schema changes / view changes to make the GUI work.

Regards,
Dag Sonstebo
Cloud Architect
ShapeBlue
 

On 31/05/2019, 10:34, "Richard Lawley"  wrote:

I don't believe the issue was related to views as such.  When I was
trying to diagnose it earlier in the week I ran the query the view
runs manually, and got the same result.  I then started removing
joined tables (even though they were all left joins so should not
matter), and data appeared once I removed the join to
last_annotation_view (which was empty).

We had been running 4.8 on that server previously.  The issue was
resolved by updating our database server (to MariaDB 10.1.40, from
10.1.25 I think) - the same query started returning data properly.

On Fri, 31 May 2019 at 09:35, Riepl, Gregor (SWISS TXT)
 wrote:
>
>
> > - You did the upgrade on a newly built MySQL / MariaDB server (keep in 
mind you can not at this point run MariaDB version 10.x)
> > - AND you imported database dumps to the new DB servers
> > - AND you didn't give 'cloud@%' permissions before the import:
> > GRANT ALL ON *.* TO 'cloud'@'%' IDENTIFIED BY '' WITH GRANT 
OPTION;
> >
> > If these apply then the import fails after all tables are imported but 
before the views are imported - hence the GUI struggles to display data.
>
> Could this be related to the fact that views are created with the 
creating user's permissions by default?
> When I recently migrated our CS database to a new host, I ran into errors 
because of subtle root user changes (i.e. different host parts) on the new DB 
server.
>
> MySQL/MariaDB sets the SQL SECURITY to DEFINER by default, which means 
that the exact user/hostname combo must exist on the target host when importing 
a database. In my opinion, this makes absolutely no sense. The default should 
be INVOKER, i.e. queries on the view should be executed with the permissions of 
the user sending the query on the view, not those of the user who created the 
view in the first place.
>
> See https://dev.mysql.com/doc/refman/8.0/en/create-view.html for more 
info on the topic.
>
> Is there a particular reason why CloudStack uses the MySQL default? 
Perhaps all views should be changed to use SQL SECURITY INVOKER?
>
> My quick fix to the problem was to comment out the DEFINER = ... lines 
from the database dump during import:
> zcat cloudstack.sql.gz | grep -v "50013 DEFINER" | mysql -p



dag.sonst...@shapeblue.com
www.shapeblue.com
Amadeus House, Floral Street, London  WC2E 9DPUK @shapeblue
  
 



Do not see KVM Hosts after 4.9.3 -> 4.11.2

2019-05-30 Thread Sean Lair
After upgrading from 4.9.3 to 4.11.2, we no longer see hosts in the CloudStack 
web-interface.  Hitting the listHosts API directly also does not return any 
results.  It's just an empty list.  When looking in the DB we do see the hosts 
and there are rows where the version is 4.11.2.0.

The agent.log on the KVM (CentOS 7) looks good.  We can see a list of running 
VMs in CloudStack and their Running/Stopped statuses look good.  We can see 
Zones/Pods/Clusters, just not Hosts.  The "Infrastructure" page also correctly 
says "2" in the Host Count.  But when we Click Hosts it just says:

No data to show

Any ideas?

Thanks
Sean







RE: Snapshots on KVM corrupting disk images

2019-02-28 Thread Sean Lair
Hi Ivan, I wanted to respond here and see if you published a PR yet on this.

This is a very scary issue for us as customer can snapshot their volumes and 
end up causing corruption - and they blame us.  It's already happened - luckily 
we had Storage Array level snapshots in place as a safety net...

Thanks!!
Sean

-Original Message-
From: Ivan Kudryavtsev [mailto:kudryavtsev...@bw-sw.com] 
Sent: Sunday, January 27, 2019 7:29 PM
To: users ; cloudstack-fan 

Cc: dev 
Subject: Re: Snapshots on KVM corrupting disk images

Well, guys. I dived into CS agent scripts, which make volume snapshots and 
found there are no code for suspend/resume and also no code for qemu-agent call 
fsfreeze/fsthaw. I don't see any blockers adding that code yet and try to add 
it in nearest days. If tests go well, I'll publish the PR, which I suppose 
could be integrated into 4.11.3.

пн, 28 янв. 2019 г., 2:45 cloudstack-fan
cloudstack-...@protonmail.com.invalid:

> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing 
> during the last 5-6 years of using ACS with KVM hosts (see this 
> thread, if you're interested in additional details:
> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox
> /browser
> ).
>
> I'd like to state that creating snapshots of a running virtual machine 
> is a bit risky. I've implemented some workarounds in my environment, 
> but I'm still not sure that they are 100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage 
> do you use, if it's not a secret? Does you storage use XFS as a filesystem?
> Did you see something like this in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> Did you see any unusual messages in your log-file when the disaster 
> happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
>
> ‐‐‐ Original Message ‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair  wrote:
>
> > Hi all,
> >
> > We had some instances where VM disks are becoming corrupted when 
> > using
> KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> >
> > The first time was when someone mass-enabled scheduled snapshots on 
> > a
> lot of large number VMs and secondary storage filled up. We had to 
> restore all those VM disks... But believed it was just our fault with 
> letting secondary storage fill up.
> >
> > Today we had an instance where a snapshot failed and now the disk 
> > image
> is corrupted and the VM can't boot. here is the output of some commands:
> >
> >
> --
> --
> --
> --
> --
> --
> --
> 
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> > Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> > Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> >
> --
> --
> --
> --
> --
> --
> 

RE: Snapshots on KVM corrupting disk images

2019-02-01 Thread Sean Lair
Hello,

We are using NFS storage.  It is actually native NFS mounts on a NetApp storage 
system.  We haven't seen those log entries, but we also don't always know when 
a VM gets corrupted...  When we finally get a call that a VM is having issues, 
we've found that it was corrupted a while ago.


-Original Message-
From: cloudstack-fan [mailto:cloudstack-...@protonmail.com.INVALID] 
Sent: Sunday, January 27, 2019 1:45 PM
To: us...@cloudstack.apache.org
Cc: dev@cloudstack.apache.org
Subject: Re: Snapshots on KVM corrupting disk images

Hello Sean,

It seems that you've encountered the same issue that I've been facing during 
the last 5-6 years of using ACS with KVM hosts (see this thread, if you're 
interested in additional details: 
https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).

I'd like to state that creating snapshots of a running virtual machine is a bit 
risky. I've implemented some workarounds in my environment, but I'm still not 
sure that they are 100% effective.

I have a couple of questions, if you don't mind. What kind of storage do you 
use, if it's not a secret? Does you storage use XFS as a filesystem? Did you 
see something like this in your log-files?
[***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in 
kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory 
allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: 
qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc 
(mode:0x250) Did you see any unusual messages in your log-file when the 
disaster happened?

I hope, things will be well. Wish you good luck and all the best!


‐‐‐ Original Message ‐‐‐
On Tuesday, 22 January 2019 18:30, Sean Lair  wrote:

> Hi all,
>
> We had some instances where VM disks are becoming corrupted when using KVM 
> snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>
> The first time was when someone mass-enabled scheduled snapshots on a lot of 
> large number VMs and secondary storage filled up. We had to restore all those 
> VM disks... But believed it was just our fault with letting secondary storage 
> fill up.
>
> Today we had an instance where a snapshot failed and now the disk image is 
> corrupted and the VM can't boot. here is the output of some commands:
>
> --
> --
> --
> --
> --
> --
> --
> 
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> Could not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> Could not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>
> --
> --
> --
> --
> --
> --
> --
> --
> ---
>
> We tried restoring to before the snapshot failure, but still have strange 
> errors:
>
> --
> --
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> image: ./184aa458-9d4b-4c1b

RE: Snapshots on KVM corrupting disk images

2019-02-01 Thread Sean Lair
Sounds good, I think something needs to be done.  Very scary that users can 
corrupt their VMs if they are doing volume snapshots


-Original Message-
From: Ivan Kudryavtsev [mailto:kudryavtsev...@bw-sw.com] 
Sent: Sunday, January 27, 2019 7:29 PM
To: users ; cloudstack-fan 

Cc: dev 
Subject: Re: Snapshots on KVM corrupting disk images

Well, guys. I dived into CS agent scripts, which make volume snapshots and 
found there are no code for suspend/resume and also no code for qemu-agent call 
fsfreeze/fsthaw. I don't see any blockers adding that code yet and try to add 
it in nearest days. If tests go well, I'll publish the PR, which I suppose 
could be integrated into 4.11.3.

пн, 28 янв. 2019 г., 2:45 cloudstack-fan
cloudstack-...@protonmail.com.invalid:

> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing 
> during the last 5-6 years of using ACS with KVM hosts (see this 
> thread, if you're interested in additional details:
> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox
> /browser
> ).
>
> I'd like to state that creating snapshots of a running virtual machine 
> is a bit risky. I've implemented some workarounds in my environment, 
> but I'm still not sure that they are 100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage 
> do you use, if it's not a secret? Does you storage use XFS as a filesystem?
> Did you see something like this in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size
> 65552 in kmem_realloc (mode:0x250)
> Did you see any unusual messages in your log-file when the disaster 
> happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
>
> ‐‐‐ Original Message ‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair  wrote:
>
> > Hi all,
> >
> > We had some instances where VM disks are becoming corrupted when 
> > using
> KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> >
> > The first time was when someone mass-enabled scheduled snapshots on 
> > a
> lot of large number VMs and secondary storage filled up. We had to 
> restore all those VM disks... But believed it was just our fault with 
> letting secondary storage fill up.
> >
> > Today we had an instance where a snapshot failed and now the disk 
> > image
> is corrupted and the VM can't boot. here is the output of some commands:
> >
> >
> --
> --
> --
> --
> --
> --
> --
> 
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> > Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> > info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> > Could
> not read snapshots: File too large
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> >
> --
> --
> --
> --
> --
> --
> --
> --
> ---
> >
> > We

RE: Snapshots on KVM corrupting disk images

2019-01-22 Thread Sean Lair
Thanks Wei!  We really appreciate the response and the link.

Shouldn't we be doing something to stop the ability to use snapshots (scheduled 
and other snapshot operations) in CloudStack?  

-Original Message-
From: Wei ZHOU [mailto:ustcweiz...@gmail.com] 
Sent: Tuesday, January 22, 2019 4:06 PM
To: dev@cloudstack.apache.org
Subject: Re: Snapshots on KVM corrupting disk images

Hi Sean,

The (recurring) volume snapshot on running vms should be disabled in cloudstack.

According to some discussions (for example 
https://bugzilla.redhat.com/show_bug.cgi?id=920020), the image might be 
corrupted due to the concurrent read/write operations in volume snapshot (by 
qemu-img snapshot).

```

qcow2 images must not be used in read-write mode from two processes at the same 
time. You can either have them opened either by one read-write process or by 
many read-only processes. Having one (paused) read-write process (the running
VM) and additional read-only processes (copying out a snapshot with qemu-img) 
may happen to work in practice, but you're on your own and we won't give 
support for such attempts.

```
The safe way to take a volume snapshot of running vm is
(1) take a vm snapshot (vm will be paused)
(2) then create a volume snapshot from the vm snapshot

-Wei



Sean Lair  于2019年1月22日周二 下午5:30写道:

> Hi all,
>
> We had some instances where VM disks are becoming corrupted when using 
> KVM snapshots.  We are running CloudStack 4.9.3 with KVM on CentOS 7.
>
> The first time was when someone mass-enabled scheduled snapshots on a 
> lot of large number VMs and secondary storage filled up.  We had to 
> restore all those VM disks...  But believed it was just our fault with 
> letting secondary storage fill up.
>
> Today we had an instance where a snapshot failed and now the disk 
> image is corrupted and the VM can't boot.  here is the output of some 
> commands:
>
> ---
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> Could not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': 
> Could not read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> ---
>
> We tried restoring to before the snapshot failure, but still have 
> strange
> errors:
>
> --
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> file format: qcow2
> virtual size: 50G (53687091200 bytes)
> disk size: 73G
> cluster_size: 65536
> Snapshot list:
> IDTAG VM SIZEDATE   VM CLOCK
> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> Format specific information:
> compat: 1.1
> lazy refcounts: false
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> tcmalloc: large alloc 1539750010880 bytes == (nil) @  0x7fb9cbbf7bf3
> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 
> 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 
> 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on 
> the image.
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img 
> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> Snapshot list:
> IDTAG VM SIZEDATE   VM CLOCK
> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43
> 3099:35:55.242
> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16
> 3431:52:23.942
> --
>
> Everyone is now extremely hesitant to use snapshots in KVM  We 
> tried deleting the snapshots in the restored disk image, but it errors out...
>
>
> Does anyone else have issues with KVM snapshots?  We are considering 
> just disabling this functionality now...
>
> Thanks
> Sean
>
>
>
>
>
>
>


RE: CloudStack 4.11.2 Snapshot Revert fail

2019-01-22 Thread Sean Lair
Luckily it was for a VM that is never touched in CloudStack.  The snaps were 
scheduled ones.  No, no changes to VM or template.

We are due to upgrade from 4.9.3 but we have not yet.

-Original Message-
From: Andrija Panic [mailto:andrija.pa...@gmail.com] 
Sent: Tuesday, January 22, 2019 11:05 AM
To: dev 
Cc: us...@cloudstack.apache.org
Subject: Re: CloudStack 4.11.2 Snapshot Revert fail

Hi there,

after VM was deployed and snapshots created - was there any changes to VM or 
template from which VM was created - did ACS version get upgraded ?

Best

On Tue, 22 Jan 2019 at 17:52, li jerry  wrote:

> HI ALL
>
> I use CloudStack 4.11.2 to manage Xenserver 7.1.2 (XenServer CU2).
>
> VM snapshot for revert failure (snapshot does not contain memorysnapshot).
>
> 2019-01-23 00:06:54,210 DEBUG [c.c.a.m.ClusteredAgentAttache] 
> (Work-Job-Executor-156:ctx-28f7465a job-2867/job-2869 ctx-a04e0ed9)
> (logid:a9ef7fe7) Seq 5-6201456686889173919: Forwarding Seq
> 5-6201456686889173919:  { Cmd , MgmtId: 240661250348494, via: 
> 5(wxac6001),
> Ver: v1, Flags: 100011,
> [{"com.cloud.agent.api.RevertToVMSnapshotCommand":{"reloadVm":false,"vmUuid":"b2a78e9c-06ab-4200-ad6d-fe095f622502","volumeTOs":[{"uuid":"7a58ffdc-b02c-41bf-963c-be56c2da0e9b","volumeType":"ROOT","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"WXACP01CL01_LUN10","id":19,"poolType":"PreSetup","host":"localhost","path":"/WXACP01CL01_LUN10","port":0,"url":"PreSetup://localhost/WXACP01CL01_LUN10/?ROLE=Primary&STOREUUID=WXACP01CL01_LUN10","isManaged":false}},"name":"ROOT-33","size":21474836480,"path":"dd1cf43d-d5a4-4633-9c3e-8f73d1ccc484","volumeId":93,"vmName":"i-2-33-VM","accountId":2,"format":"VHD","provisioningType":"THIN","id":93,"deviceId":0,"hypervisorType":"XenServer"},{"uuid":"74268aa2-b4e5-4574-a981-027e55b5383f","volumeType":"DATADISK","dataStore":{"org.apache.
> cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"WXACP01CL01_LUN01",
> "id":1,"poolType":"PreSetup","host":"localhost","path":"/WXACP01CL01_L
> UN01","port":0,"url":"PreSetup://localhost/WXACP01CL01_LUN01/?ROLE=Pri
> mary&STOREUUID=WXACP01CL01_LUN01","isManaged":false}},"name":"DATA-33"
> ,"size":1099511627776,"path":"e2ead686-d0bb-49f2-b656-77c2bf497990","v
> olumeId":95,"vmName":"i-2-33-VM","accountId":2,"format":"VHD","provisi
> oningType":"THIN","id":95,"deviceId":1,"hypervisorType":"XenServer"}],
> "target":{"id":27,"snapshotName":"i-2-33-VM_VS_20190122155503","type":
> "Disk","createTime":1548172503000,"current":true,"description":"asdfas
> df","quiescevm":true},"vmName":"i-2-33-VM","guestOSType":"CentOS
> 7","wait":0}}] } to 55935224135780
>
> 2019-01-23 00:06:54,222 DEBUG [c.c.a.t.Request]
> (AgentManager-Handler-14:null) (logid:) Seq 5-6201456686889173919:
> Processing:  { Ans: , MgmtId: 240661250348494, via: 5, Ver: v1, Flags: 
> 10, 
> [{"com.cloud.agent.api.RevertToVMSnapshotAnswer":{"result":false,"details":"
> Hypervisor 
> com.cloud.hypervisor.xenserver.resource.XenServer650Resource
> doesn't support guest OS type CentOS 7. you can choose 'Other install 
> media' to run it as HVM","wait":0}}] }
> 2019-01-23 00:06:54,223 DEBUG [c.c.a.t.Request] 
> (Work-Job-Executor-156:ctx-28f7465a job-2867/job-2869 ctx-a04e0ed9)
> (logid:a9ef7fe7) Seq 5-6201456686889173919: Received:  { Ans: , MgmtId:
> 240661250348494, via: 5(wxac6001), Ver: v1, Flags: 10, { 
> RevertToVMSnapshotAnswer } }
> 2019-01-23 00:06:54,223 ERROR [o.a.c.s.v.DefaultVMSnapshotStrategy]
> (Work-Job-Executor-156:ctx-28f7465a job-2867/job-2869 ctx-a04e0ed9)
> (logid:a9ef7fe7) Revert VM: i-2-33-VM to snapshot:
> i-2-33-VM_VS_20190122155503 failed due to  Hypervisor 
> com.cloud.hypervisor.xenserver.resource.XenServer650Resource doesn't 
> support guest OS type CentOS 7. you can choose 'Other install media' 
> to run it as HVM
> 2019-01-23 00:06:54,226 DEBUG [c.c.v.s.VMSnapshotManagerImpl] 
> (Work-Job-Executor-156:ctx-28f7465a job-2867/job-2869 ctx-a04e0ed9)
> (logid:a9ef7fe7) Failed to revert vmsnapshot: 27
> com.cloud.utils.exception.CloudRuntimeException: Revert VM: i-2-33-VM 
> to
> snapshot: i-2-33-VM_VS_20190122155503 failed due to  Hypervisor 
> com.cloud.hypervisor.xenserver.resource.XenServer650Resource doesn't 
> support guest OS type CentOS 7. you can choose 'Other install media' 
> to run it as HVM
>   at
> org.apache.cloudstack.storage.vmsnapshot.DefaultVMSnapshotStrategy.revertVMSnapshot(DefaultVMSnapshotStrategy.java:393)
>   at
> com.cloud.vm.snapshot.VMSnapshotManagerImpl.orchestrateRevertToVMSnapshot(VMSnapshotManagerImpl.java:846)
>   at
> com.cloud.vm.snapshot.VMSnapshotManagerImpl.orchestrateRevertToVMSnapshot(VMSnapshotManagerImpl.java:1211)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:4

RE: CloudStack 4.11.2 Snapshot Revert fail

2019-01-22 Thread Sean Lair
Sorry, replied to wrong snapshot thread..


-Original Message-
From: Sean Lair 
Sent: Tuesday, January 22, 2019 11:48 AM
To: dev 
Cc: us...@cloudstack.apache.org
Subject: RE: CloudStack 4.11.2 Snapshot Revert fail

Luckily it was for a VM that is never touched in CloudStack.  The snaps were 
scheduled ones.  No, no changes to VM or template.

We are due to upgrade from 4.9.3 but we have not yet.

-Original Message-
From: Andrija Panic [mailto:andrija.pa...@gmail.com]
Sent: Tuesday, January 22, 2019 11:05 AM
To: dev 
Cc: us...@cloudstack.apache.org
Subject: Re: CloudStack 4.11.2 Snapshot Revert fail

Hi there,

after VM was deployed and snapshots created - was there any changes to VM or 
template from which VM was created - did ACS version get upgraded ?

Best

On Tue, 22 Jan 2019 at 17:52, li jerry  wrote:

> HI ALL
>
> I use CloudStack 4.11.2 to manage Xenserver 7.1.2 (XenServer CU2).
>
> VM snapshot for revert failure (snapshot does not contain memorysnapshot).
>
> 2019-01-23 00:06:54,210 DEBUG [c.c.a.m.ClusteredAgentAttache] 
> (Work-Job-Executor-156:ctx-28f7465a job-2867/job-2869 ctx-a04e0ed9)
> (logid:a9ef7fe7) Seq 5-6201456686889173919: Forwarding Seq
> 5-6201456686889173919:  { Cmd , MgmtId: 240661250348494, via: 
> 5(wxac6001),
> Ver: v1, Flags: 100011,
> [{"com.cloud.agent.api.RevertToVMSnapshotCommand":{"reloadVm":false,"vmUuid":"b2a78e9c-06ab-4200-ad6d-fe095f622502","volumeTOs":[{"uuid":"7a58ffdc-b02c-41bf-963c-be56c2da0e9b","volumeType":"ROOT","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"WXACP01CL01_LUN10","id":19,"poolType":"PreSetup","host":"localhost","path":"/WXACP01CL01_LUN10","port":0,"url":"PreSetup://localhost/WXACP01CL01_LUN10/?ROLE=Primary&STOREUUID=WXACP01CL01_LUN10","isManaged":false}},"name":"ROOT-33","size":21474836480,"path":"dd1cf43d-d5a4-4633-9c3e-8f73d1ccc484","volumeId":93,"vmName":"i-2-33-VM","accountId":2,"format":"VHD","provisioningType":"THIN","id":93,"deviceId":0,"hypervisorType":"XenServer"},{"uuid":"74268aa2-b4e5-4574-a981-027e55b5383f","volumeType":"DATADISK","dataStore":{"org.apache.
> cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"WXACP01CL01_LUN01",
> "id":1,"poolType":"PreSetup","host":"localhost","path":"/WXACP01CL01_L
> UN01","port":0,"url":"PreSetup://localhost/WXACP01CL01_LUN01/?ROLE=Pri
> mary&STOREUUID=WXACP01CL01_LUN01","isManaged":false}},"name":"DATA-33"
> ,"size":1099511627776,"path":"e2ead686-d0bb-49f2-b656-77c2bf497990","v
> olumeId":95,"vmName":"i-2-33-VM","accountId":2,"format":"VHD","provisi
> oningType":"THIN","id":95,"deviceId":1,"hypervisorType":"XenServer"}],
> "target":{"id":27,"snapshotName":"i-2-33-VM_VS_20190122155503","type":
> "Disk","createTime":1548172503000,"current":true,"description":"asdfas
> df","quiescevm":true},"vmName":"i-2-33-VM","guestOSType":"CentOS
> 7","wait":0}}] } to 55935224135780
>
> 2019-01-23 00:06:54,222 DEBUG [c.c.a.t.Request]
> (AgentManager-Handler-14:null) (logid:) Seq 5-6201456686889173919:
> Processing:  { Ans: , MgmtId: 240661250348494, via: 5, Ver: v1, Flags: 
> 10, 
> [{"com.cloud.agent.api.RevertToVMSnapshotAnswer":{"result":false,"details":"
> Hypervisor
> com.cloud.hypervisor.xenserver.resource.XenServer650Resource
> doesn't support guest OS type CentOS 7. you can choose 'Other install 
> media' to run it as HVM","wait":0}}] }
> 2019-01-23 00:06:54,223 DEBUG [c.c.a.t.Request] 
> (Work-Job-Executor-156:ctx-28f7465a job-2867/job-2869 ctx-a04e0ed9)
> (logid:a9ef7fe7) Seq 5-6201456686889173919: Received:  { Ans: , MgmtId:
> 240661250348494, via: 5(wxac6001), Ver: v1, Flags: 10, { 
> RevertToVMSnapshotAnswer } }
> 2019-01-23 00:06:54,223 ERROR [o.a.c.s.v.DefaultVMSnapshotStrategy]
> (Work-Job-Executor-156:ctx-28f7465a job-2867/job-2869 ctx-a04e0ed9)
> (logid:a9ef7fe7) Revert VM: i-2-33-VM to snapshot:
> i-2-33-VM_VS_20190122155503 fai

RE: Snapshots on KVM corrupting disk images

2019-01-22 Thread Sean Lair
Hi Simon

It is NFS mount.  The underlying storage is NetApp that we run a lot of 
different environments on, it is rock-solid, the only issues we've had are with 
KVM snapshots.

Thanks
Sean

-Original Message-
From: Simon Weller [mailto:swel...@ena.com.INVALID] 
Sent: Tuesday, January 22, 2019 10:42 AM
To: us...@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Re: Snapshots on KVM corrupting disk images

Sean,


What underlying primary storage are you using and how is it being utilized by 
ACS (e.g. NFS, shared mount et al)?



- Si



From: Sean Lair 
Sent: Tuesday, January 22, 2019 10:30 AM
To: us...@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Snapshots on KVM corrupting disk images

Hi all,

We had some instances where VM disks are becoming corrupted when using KVM 
snapshots.  We are running CloudStack 4.9.3 with KVM on CentOS 7.

The first time was when someone mass-enabled scheduled snapshots on a lot of 
large number VMs and secondary storage filled up.  We had to restore all those 
VM disks...  But believed it was just our fault with letting secondary storage 
fill up.

Today we had an instance where a snapshot failed and now the disk image is 
corrupted and the VM can't boot.  here is the output of some commands:

---
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not 
read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not 
read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
---

We tried restoring to before the snapshot failure, but still have strange 
errors:

--
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
file format: qcow2
virtual size: 50G (53687091200 bytes)
disk size: 73G
cluster_size: 65536
Snapshot list:
IDTAG VM SIZEDATE   VM CLOCK
1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 
3099:35:55.242
2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 
3431:52:23.942
Format specific information:
compat: 1.1
lazy refcounts: false

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
tcmalloc: large alloc 1539750010880 bytes == (nil) @  0x7fb9cbbf7bf3 
0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 
0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 
0x55d16ddd9f7d No errors were found on the image.

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img snapshot -l 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
Snapshot list:
IDTAG VM SIZEDATE   VM CLOCK
1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 
3099:35:55.242
2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 
3431:52:23.942
--

Everyone is now extremely hesitant to use snapshots in KVM  We tried 
deleting the snapshots in the restored disk image, but it errors out...


Does anyone else have issues with KVM snapshots?  We are considering just 
disabling this functionality now...

Thanks
Sean








Snapshots on KVM corrupting disk images

2019-01-22 Thread Sean Lair
Hi all,

We had some instances where VM disks are becoming corrupted when using KVM 
snapshots.  We are running CloudStack 4.9.3 with KVM on CentOS 7.

The first time was when someone mass-enabled scheduled snapshots on a lot of 
large number VMs and secondary storage filled up.  We had to restore all those 
VM disks...  But believed it was just our fault with letting secondary storage 
fill up.

Today we had an instance where a snapshot failed and now the disk image is 
corrupted and the VM can't boot.  here is the output of some commands:

---
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not 
read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not 
read snapshots: File too large

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
---

We tried restoring to before the snapshot failure, but still have strange 
errors:

--
[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
-rw-r--r--. 1 root root 73G Jan 22 11:04 ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
file format: qcow2
virtual size: 50G (53687091200 bytes)
disk size: 73G
cluster_size: 65536
Snapshot list:
IDTAG VM SIZEDATE   VM CLOCK
1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 
3099:35:55.242
2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 
3431:52:23.942
Format specific information:
compat: 1.1
lazy refcounts: false

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
tcmalloc: large alloc 1539750010880 bytes == (nil) @  0x7fb9cbbf7bf3 
0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 
0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 
0x55d16ddd9f7d
No errors were found on the image.

[root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img snapshot -l 
./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
Snapshot list:
IDTAG VM SIZEDATE   VM CLOCK
1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95   3.7G 2018-12-23 11:01:43 
3099:35:55.242
2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd   3.8G 2019-01-06 11:03:16 
3431:52:23.942
--

Everyone is now extremely hesitant to use snapshots in KVM  We tried 
deleting the snapshots in the restored disk image, but it errors out...


Does anyone else have issues with KVM snapshots?  We are considering just 
disabling this functionality now...

Thanks
Sean








RE: [VOTE] Apache CloudStack 4.11.1.0 LTS [RC3]

2018-06-22 Thread Sean Lair
Would someone mind testing testing a Restart VPC w/ Cleanup on a VPC that has a 
private gateway configured?  The test 
"test_03_vpc_privategw_restart_vpc_cleanup" is failing due to the following 
(according to logs).  My test environment is not available right now so I can't 
check myself.  I don't have this problem in my 4.9.3 prod environment. 


Java.lang.NullPointerException
at 
com.cloud.network.router.NicProfileHelperImpl.createPrivateNicProfileForGateway(NicProfileHelperImpl.java:95)



NicProfileHelperImpl.java (Lines 93 - 95)

final PrivateIpAddress ip =
new PrivateIpAddress(ipVO, 
privateNetwork.getBroadcastUri().toString(), privateNetwork.getGateway(), 
netmask,

NetUtils.long2Mac(NetUtils.createSequenceBasedMacAddress(ipVO.getMacAddress(), 
NetworkModel.MACIdentifier.value(;


Thanks
Sean

-Original Message-
From: Paul Angus [mailto:paul.an...@shapeblue.com] 
Sent: Thursday, June 21, 2018 11:00 AM
To: dev@cloudstack.apache.org; us...@cloudstack.apache.org
Subject: [VOTE] Apache CloudStack 4.11.1.0 LTS [RC3]

Hi All,



I've created a 4.11.1.0 release (RC3), with the following artefacts up for 
testing and a vote:
The changes since RC2 are listed at the end of this email.



Git Branch and Commit SH:

https://gitbox.apache.org/repos/asf?p=cloudstack.git;a=shortlog;h=refs/heads/4.11.1.0-RC20180621T1552

Commit: 2cb2dacbe75a23f5068b80f6ea45031c29052c31



Source release (checksums and signatures are available at the same

location):

https://dist.apache.org/repos/dist/dev/cloudstack/4.11.1.0/



PGP release keys (signed using 8B309F7251EE0BC8):

https://dist.apache.org/repos/dist/release/cloudstack/KEYS



The vote will be open for at least 72hrs.



For sanity in tallying the vote, can PMC members please be sure to indicate 
"(binding)" with their vote?



[ ] +1  approve

[ ] +0  no opinion

[ ] -1  disapprove (and reason why)





Additional information:



For users' convenience, I've built packages from 
5f48487dc62fd1decaabc4ab2a10f549d6c82400 and published RC1 repository here:

http://packages.shapeblue.com/testing/4111rc3/



The release notes are still work-in-progress, but the systemvm template upgrade 
section has been updated. You may refer the following for systemvm template 
upgrade testing:

http://docs.cloudstack.apache.org/projects/cloudstack-release-notes/en/latest/index.html



4.11.1 systemvm templates are available from here:

http://packages.shapeblue.com/systemvmtemplate/4.11.1-rc1/




Changes Since RC2:

Merged #2712 reuse ip for non redundant VPC 6 hours ago Merged #2714 send 
unsupported answer only when applicable 10 hours ago Merged #2715 smoketest: 
Fix test_vm_life_cycle secure migration tests a day ago Merged #2493 
CLOUDSTACK-10326: Prevent hosts fall into Maintenance when there are running 
VMs on it a day ago Merged #2716 configdrive: make fewer mountpoints on hosts a 
day ago Merged #2681 Source NAT option on Private Gateway 2 days ago Merged 
#2710 comply with api key constraint 2 days ago Merged #2706 packaging: use 
libuuid x86_64 package for cloudstack-common 2 days ago

Kind regards,

Paul Angus


paul.an...@shapeblue.com
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
  
 



RE: Private Gateway SNAT Bug

2018-05-29 Thread Sean Lair
No problem!  

https://github.com/apache/cloudstack/issues/2680

Also, create a possible Pull Request:

https://github.com/apache/cloudstack/pull/2681




-Original Message-
From: Rafael Weingärtner [mailto:rafaelweingart...@gmail.com] 
Sent: Tuesday, May 29, 2018 4:11 PM
To: dev 
Subject: Re: Private Gateway SNAT Bug

Thanks Sean. Can you do something for us?
Can you open an issue at https://github.com/apache/cloudstack/issues/?
We decided not to use Jira anymore. Also, can you close the jira ticket?

On Tue, May 29, 2018 at 6:08 PM, Sean Lair  wrote:

> Opened up Issue with more info:
>
> https://issues.apache.org/jira/browse/CLOUDSTACK-10379
>
>
> -Original Message-
> From: Sean Lair
> Sent: Tuesday, May 29, 2018 12:08 PM
> To: dev@cloudstack.apache.org
> Subject: Private Gateway SNAT Bug
>
> I've found a bug in the Private Gateway functionality, when Source NAT 
> is enabled for the Private Gateway.  When the SNAT is added to 
> iptables, it has the source CIDR of the private gateway subnet.  Since 
> no VMs live in that private gateway subnet, the SNAT doesn't work.  Below is 
> an example:
>
>
> -  VMs have IP addresses in the 10.0.0.0/24 subnet.
>
> -  The Private Gateway address is 10.101.141.2/30
>
> See the outputs below, see how the SOURCE field for the new SNAT 
> (eth3) only matches if the source is 10.101.141.0/30?  Since the VM 
> has an IP address in 10.0.0.0/24, the VMs don't get SNAT'd as they 
> should when talking across the private gateway.  The SOURCE should be set to 
> ANYWHERE.
>
> BEFORE ADDING PRIVATE GATEWAY
> ---
> Chain POSTROUTING (policy ACCEPT 1 packets, 52 bytes)
> pkts bytes target prot opt in out source
>  destination
> 2   736 SNAT   all  --  anyeth210.0.0.0/24
> anywhere to:10.0.0.1
>16  1039 SNAT   all  --  anyeth1anywhere
>  anywhere to:46.99.52.18
>
> AFTER ADDING PRIVATE GATEWAY W/ SNAT
> ---
> Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
> pkts bytes target prot opt in out source
>  destination
> 0 0 SNAT   all  --  anyeth310.101.141.0/30
> anywhere to:10.101.141.2
> 2   736 SNAT   all  --  anyeth210.0.0.0/24
> anywhere to:10.0.0.1
>23  1515 SNAT   all  --  anyeth1anywhere
>  anywhere to:46.99.52.18
>
>
> It looks like CsAddress.py treats the creation of the Private Gateway 
> SNAT as if it is a GUEST network, which works fine, except for the 
> SNAT problem shown above.  Here is the code from MASTER (line 479 is SNAT 
> rule):
>
>
> if self.get_type() in ["guest"]:
> ...
> ...
> self.fw.append(["nat", "front",
> "-A POSTROUTING -s %s -o %s -j SNAT --to-source %s" %
> (guestNetworkCidr, self.dev, self.address['public_ip'])])
>
> I am thinking we just change that to the following.  I can't think of 
> any reason we need the source/guest CIDR specified:
>
> if self.get_type() in ["guest"]:
> ...
> ...
> self.fw.append(["nat", "front",
> "-A POSTROUTING -o %s -j SNAT --to-source %s" %
> (self.dev, self.address['public_ip'])])
>
>
> THE NAT TABLE IF THE ABOVE CODE CHANGE IS MADE
> ---
> Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
> pkts bytes target prot opt in out source
>  destination
> 0 0 SNAT   all  --  anyeth3anywhere
>  anywhere to:10.101.141.2
> 2   736 SNAT   all  --  anyeth2anywhere
>  anywhere to:10.0.0.1
>23  1515 SNAT   all  --  anyeth1anywhere
>
> Thoughts everyone?
>
>


--
Rafael Weingärtner


RE: Private Gateway SNAT Bug

2018-05-29 Thread Sean Lair
Opened up Issue with more info:

https://issues.apache.org/jira/browse/CLOUDSTACK-10379


-Original Message-
From: Sean Lair 
Sent: Tuesday, May 29, 2018 12:08 PM
To: dev@cloudstack.apache.org
Subject: Private Gateway SNAT Bug

I've found a bug in the Private Gateway functionality, when Source NAT is 
enabled for the Private Gateway.  When the SNAT is added to iptables, it has 
the source CIDR of the private gateway subnet.  Since no VMs live in that 
private gateway subnet, the SNAT doesn't work.  Below is an example:


-  VMs have IP addresses in the 10.0.0.0/24 subnet.

-  The Private Gateway address is 10.101.141.2/30

See the outputs below, see how the SOURCE field for the new SNAT (eth3) only 
matches if the source is 10.101.141.0/30?  Since the VM has an IP address in 
10.0.0.0/24, the VMs don't get SNAT'd as they should when talking across the 
private gateway.  The SOURCE should be set to ANYWHERE.

BEFORE ADDING PRIVATE GATEWAY
---
Chain POSTROUTING (policy ACCEPT 1 packets, 52 bytes)
pkts bytes target prot opt in out source   destination
2   736 SNAT   all  --  anyeth210.0.0.0/24  anywhere
 to:10.0.0.1
   16  1039 SNAT   all  --  anyeth1anywhere anywhere
 to:46.99.52.18

AFTER ADDING PRIVATE GATEWAY W/ SNAT
---
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source   destination
0 0 SNAT   all  --  anyeth310.101.141.0/30  anywhere
 to:10.101.141.2
2   736 SNAT   all  --  anyeth210.0.0.0/24  anywhere
 to:10.0.0.1
   23  1515 SNAT   all  --  anyeth1anywhere anywhere
 to:46.99.52.18


It looks like CsAddress.py treats the creation of the Private Gateway SNAT as 
if it is a GUEST network, which works fine, except for the SNAT problem shown 
above.  Here is the code from MASTER (line 479 is SNAT rule):


if self.get_type() in ["guest"]:
...
...
self.fw.append(["nat", "front",
"-A POSTROUTING -s %s -o %s -j SNAT --to-source %s" %
(guestNetworkCidr, self.dev, self.address['public_ip'])])

I am thinking we just change that to the following.  I can't think of any 
reason we need the source/guest CIDR specified:

if self.get_type() in ["guest"]:
...
...
self.fw.append(["nat", "front",
"-A POSTROUTING -o %s -j SNAT --to-source %s" %
(self.dev, self.address['public_ip'])])


THE NAT TABLE IF THE ABOVE CODE CHANGE IS MADE
---
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source   destination
0 0 SNAT   all  --  anyeth3anywhere anywhere
 to:10.101.141.2
2   736 SNAT   all  --  anyeth2anywhere anywhere
 to:10.0.0.1
   23  1515 SNAT   all  --  anyeth1anywhere

Thoughts everyone?



Private Gateway SNAT Bug

2018-05-29 Thread Sean Lair
I've found a bug in the Private Gateway functionality, when Source NAT is 
enabled for the Private Gateway.  When the SNAT is added to iptables, it has 
the source CIDR of the private gateway subnet.  Since no VMs live in that 
private gateway subnet, the SNAT doesn't work.  Below is an example:


-  VMs have IP addresses in the 10.0.0.0/24 subnet.

-  The Private Gateway address is 10.101.141.2/30

See the outputs below, see how the SOURCE field for the new SNAT (eth3) only 
matches if the source is 10.101.141.0/30?  Since the VM has an IP address in 
10.0.0.0/24, the VMs don't get SNAT'd as they should when talking across the 
private gateway.  The SOURCE should be set to ANYWHERE.

BEFORE ADDING PRIVATE GATEWAY
---
Chain POSTROUTING (policy ACCEPT 1 packets, 52 bytes)
pkts bytes target prot opt in out source   destination
2   736 SNAT   all  --  anyeth210.0.0.0/24  anywhere
 to:10.0.0.1
   16  1039 SNAT   all  --  anyeth1anywhere anywhere
 to:46.99.52.18

AFTER ADDING PRIVATE GATEWAY W/ SNAT
---
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source   destination
0 0 SNAT   all  --  anyeth310.101.141.0/30  anywhere
 to:10.101.141.2
2   736 SNAT   all  --  anyeth210.0.0.0/24  anywhere
 to:10.0.0.1
   23  1515 SNAT   all  --  anyeth1anywhere anywhere
 to:46.99.52.18


It looks like CsAddress.py treats the creation of the Private Gateway SNAT as 
if it is a GUEST network, which works fine, except for the SNAT problem shown 
above.  Here is the code from MASTER (line 479 is SNAT rule):


if self.get_type() in ["guest"]:
...
...
self.fw.append(["nat", "front",
"-A POSTROUTING -s %s -o %s -j SNAT --to-source %s" %
(guestNetworkCidr, self.dev, self.address['public_ip'])])

I am thinking we just change that to the following.  I can't think of any 
reason we need the source/guest CIDR specified:

if self.get_type() in ["guest"]:
...
...
self.fw.append(["nat", "front",
"-A POSTROUTING -o %s -j SNAT --to-source %s" %
(self.dev, self.address['public_ip'])])


THE NAT TABLE IF THE ABOVE CODE CHANGE IS MADE
---
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source   destination
0 0 SNAT   all  --  anyeth3anywhere anywhere
 to:10.101.141.2
2   736 SNAT   all  --  anyeth2anywhere anywhere
 to:10.0.0.1
   23  1515 SNAT   all  --  anyeth1anywhere

Thoughts everyone?



RE: HA issues

2018-03-01 Thread Sean Lair
FYI Nux, I opened the following PR for the change we made in our environment to 
get VM HA to work.  I referenced your ticket!

https://github.com/apache/cloudstack/pull/2474


-Original Message-
From: Nux! [mailto:n...@li.nux.ro] 
Sent: Monday, January 22, 2018 8:15 AM
To: dev 
Subject: Re: HA issues

Hi,

Installed and reinstalled, VM HA just does not work for me.
In addition, if the HV going AWOL is hosting the systemvms, then they also do 
not get restarted despite available HVs online.
I've opened another ticket with logs:

https://issues.apache.org/jira/browse/CLOUDSTACK-10246

Happy to allow access to my rig if it helps.

I've disabled firewall and whatnot also left out other bits of network hardware 
just to keep it simpler, still no go.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

- Original Message -
> From: "Paul Angus" 
> To: "dev" 
> Sent: Saturday, 20 January, 2018 08:40:01
> Subject: RE: HA issues

> No problem,
> 
> To be honest host-ha was developed *because* vm-ha was not reliable 
> under a number of conditions, including a host failure.
> 
> paul.an...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>  
> 
> 
> 
> -Original Message-
> From: Nux! [mailto:n...@li.nux.ro]
> Sent: 19 January 2018 14:26
> To: dev 
> Subject: Re: HA issues
> 
> Hi Paul,
> 
> Thanks for checking. My compute offering is HA enabled, of course.
> Host HA is disabled as well as OOBM.
> 
> 
> I'll do the tests again on Monday and report back.
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> - Original Message -
>> From: "Paul Angus" 
>> To: "dev" 
>> Sent: Friday, 19 January, 2018 14:10:06
>> Subject: RE: HA issues
> 
>> Hey Nux,
>> 
>> I've being testing out the host-ha feature against a couple of physical 
>> hosts.
>> I've found that if the compute offering isn't ha enabled, then the vm isn't
>> restarted on the original host when it is rebooted, or any other host.If
>> the vm is ha-enabled, then the vm was restarted on the original host 
>> when host ha restarted the host.
>> 
>> Can you double check that the instance was an ha-enabled one?
>> 
>> OR
>> maybe the timeouts for the host-ha are too long and the vm-ha 
>> timed-out before hand ...?
>> 
>> 
>> 
>> Kind regards,
>> 
>> Paul Angus
>> 
>> paul.an...@shapeblue.com
>> www.shapeblue.com
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>  
>> 
>> 
>> 
>> -Original Message-
>> From: Nux! [mailto:n...@li.nux.ro]
>> Sent: 17 January 2018 09:12
>> To: dev 
>> Subject: Re: HA issues
>> 
>> Right, sorry for using the terms interchangeably, I see what you mean.
>> 
>> I'll do further testing then as VM HA was also not working in my setup.
>> 
>> I'll be back.
>> 
>> --
>> Sent from the Delta quadrant using Borg technology!
>> 
>> Nux!
>> www.nux.ro
>> 
>> - Original Message -
>>> From: "Rohit Yadav" 
>>> To: "dev" 
>>> Sent: Wednesday, 17 January, 2018 09:09:19
>>> Subject: Re: HA issues
>> 
>>> Hi Lucian,
>>> 
>>> 
>>> The "Host HA" feature is entirely different from VM HA, however, 
>>> they may work in tandem, so please stop using the terms 
>>> interchangeably as it may cause the community to believe a regression has 
>>> been caused.
>>> 
>>> 
>>> The "Host HA" feature currently ships with only "Host HA" provider 
>>> for KVM that is strictly tied to out-of-band management (IPMI for 
>>> fencing, i.e power off and recovery, i.e. reboot) and NFS (as primary 
>>> storage).
>>> (We also have a provider for simulator, but that's for 
>>> coverage/testing purposes).
>>> 
>>> 
>>> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is 
>>> enabled.
>>> The frameowkr allows interested parties may write their own HA 
>>> providers for a hypervisor that can use a different 
>>> strategy/mechanism for fencing/recovery of hosts (including write a 
>>> non-IPMI based OOBM
>>> plugin) and host/disk activity checker that is non-NFS based.
>>> 
>>> 
>>> The "Host HA" feature ships disabled by default and does not cause 
>>> any interference with VM HA. However, when enabled and configured 
>>> correctly, it is a known limitation that when it is unable to 
>>> successfully perform recovery or fencing tasks it may not trigger VM 
>>> HA. We can discuss how to handle such cases (thoughts?). "Host HA"
>>> would try couple of times to recover and failing to do so, it would 
>>> eventually trigger a host fencing task. If it's unable to fence a 
>>> host, it will indefinitely attempt to fence the host (the host state 
>>> will be stuck at fencing state in cloud.ha_config table for example) 
>>> and alerts will be sent to admin who can do some manual intervention 
>>> to handle such situations (if you've email/smtp enabled, you should 
>>> see alert emails).
>>> 
>>> 
>>> We can discuss how to improve and have a workaround for the case 
>>> you've hit, thanks for sharing.
>>> 
>

RE: HA issues

2018-03-01 Thread Sean Lair
Based on your note we made the following change:

https://github.com/apache/cloudstack/pull/2472

It adds a sleep between retries and then stops the cloudstack-agent if it still 
can't write the heartbeat file after the retries...  At least this way an alert 
is raised instead of a hard reboot.  Also, it allows HA to kick-in and handle 
correctly.


-Original Message-
From: Andrija Panic [mailto:andrija.pa...@gmail.com] 
Sent: Tuesday, February 20, 2018 5:16 PM
To: dev 
Subject: Re: HA issues

That is good to hear ( no NFS issues causing Agent Disconnect).

I assume you are using "normal" NFS solution with proper HA and no ZFS (kernel 
panic etc), but anyway be aware of this one

https://github.com/apache/cloudstack/blob/e532b574ddb186a117da638fb6059356fe7c266c/scripts/vm/hypervisor/kvm/kvmheartbeat.sh#L161



we used to comment this line, because we did have some issues with 
communication link, and this commented line saved our a$$ few times :)

CHeers

On 20 February 2018 at 20:50, Sean Lair  wrote:

> Hi Andrija
>
> We are currently running XenServer in production.  We are working on 
> moving to KVM and have it deployed in a development environment.
>
> The team is putting CloudStack + KVM through its paces and that is 
> when it was discovered how broken VM HA is in 4.9.3.  Initially our 
> patches fixed VM HA, but just caused VMs to get started on two hosts 
> during failure testing.  The libvirt lockd has solved that issue thus far.
>
> Short answer to you question is :-), we were not having problems with 
> Agent Disconnects in a production environment.  It was our testing/QA 
> that revealed the issues.  Our NFS has been stable so far, no issues 
> with the agent crashing/stopping that wasn't initiated by the team's testing.
>
> Thanks
> Sean
>
>
> -Original Message-
> From: Andrija Panic [mailto:andrija.pa...@gmail.com]
> Sent: Saturday, February 17, 2018 1:49 PM
> To: dev 
> Subject: Re: HA issues
>
> Hi Sean,
>
> (we have 2 threads interleaving on the libvirt lockd..) - so, did you 
> manage to understand what can cause the Agent Disconnect in most 
> cases, for you specifically? Is there any software (CloudStack) root 
> cause (disregarding i.e. networking issues etc)
>
> Just our examples, which you should probably not have:
>
> We had CEPH cluster running (with ACS), and there any exception in 
> librbd would crash JVM and the agent, but this has been fixed mostly - 
> Now get i.e. agent disconnect when ACS try to delete volume on CEPH 
> (and for some reason not succeed withing 30 minutes, volume deletion 
> fails) - then libvirt get's completety stuck (virsh list even dont 
> work)...so  agent get's disconnect eventually.
>
> It would be good to get rid of agent disconnections in general, 
> obviously
> :) so that is why I'm asking (you are on NFS, so would like to see 
> your experience here).
>
> Thanks
>
> On 16 February 2018 at 21:52, Sean Lair  wrote:
>
> > We were in the same situation as Nux.
> >
> > In our test environment we hit the issue with VMs not getting fenced and
> > coming up on two hosts because of VM HA.   However, we updated some of
> the
> > logic for VM HA and turned on libvirtd's locking mechanism.  Now we 
> > are working great w/o IPMI.  The locking stops the VMs from starting 
> > elsewhere, and everything recovers very nicely when the host starts
> responding again.
> >
> > We are on 4.9.3 and haven't started testing with 4.11 yet, but it 
> > may work along-side IPMI just fine - it would just have affect the fencing.
> > However, we *currently* prefer how we are doing it now, because if 
> > the agent stops responding, but the host is still up, the VMs 
> > continue running and no actual downtime is incurred.  Even when VM 
> > HA attempts to power on the VMs on another host, it just fails the 
> > power-up and the VMs continue to run on the "agent disconnected" 
> > host. The host goes into alarm state and our NOC can look into what 
> > is wrong the agent on the host.  If IPMI was enabled, it sounds like 
> > it would power off the host (fence) and force downtime for us even 
> > if the VMs were actually running OK - and just the agent is unreachable.
> >
> > I plan on submitting our updates via a pull request at some point.
> > But I can also send the updated code to anyone that wants to do some 
> > testing before then.
> >
> > -Original Message-
> > From: Marcus [mailto:shadow...@gmail.com]
> > Sent: Friday, February 16, 2018 11:27 AM
> > To: dev@cloudstack.apache.org
> > Subject: Re: HA issues
> >
> > From your other emails it 

RE: [IMPENDING SHUTDOWN] Re: Replacing download.cloud.com by download.cloudstack.org

2018-02-28 Thread Sean Lair
Looks like it is still referenced here:

http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/4.11/templates/_password.html



-Original Message-
From: Chiradeep Vittal [mailto:chirade...@gmail.com] 
Sent: Tuesday, February 27, 2018 3:59 PM
To: dev 
Subject: Re: [IMPENDING SHUTDOWN] Re: Replacing download.cloud.com by 
download.cloudstack.org

For the last 6 days, here are the stats (first column is downloads)

   2381 templates/4.3/systemvm64template-2014-01-14-master-kvm.qcow2.bz2

 89 templates/4.5.1/systemvm64template-2015-05-14-4.5.1-xen.vhd.bz2

 77 releases/2.2.0/eec2209b-9875-3c8d-92be-c001bd8a0faf.qcow2.bz2

 51 templates/4.3/systemvm64template-2014-04-10-master-xen.vhd.bz2

 50 templates/acton/acton-systemvm-02062012.vhd.bz2

 50 templates/4.2/64bit/systemvmtemplate64-2013-07-15-master-xen.vhd.bz2

 34 templates/4.5/systemvm64template-2015-02-03-4.5.0-xen.vhd.bz2

 25 templates/4.5.1/systemvm64template-2015-08-20-4.5.1-xen.vhd.bz2

 24 templates/4.7/systemvm64template-2016-03-24-4.7.0-xen.vhd.bz2

 24 templates/4.5/systemvm64template-4.5-kvm.qcow2.bz2

 22 templates/4.2/systemvmtemplate-2013-06-12-master-kvm.qcow2.bz2

 20 templates/4.3/systemvm64template-2014-09-30-4.3-xen.vhd.bz2

 20 templates/4.2/systemvmtemplate-2013-07-12-master-xen.vhd.bz2

 16 templates/4.3/systemvm64template-2014-01-14-master-xen.vhd.bz2

 15 templates/builtin/centos56-x86_64.vhd.bz2

 15 templates/4.5/systemvm64template-2014-12-18-4.5.0.0-xen.vhd.bz2

 13 templates/4.3/systemvm64template-2015-02-04-4.3-xen.vhd.bz2

  5 templates/4.3/systemvm64template-2014-06-23-master-kvm.qcow2.bz2

  4 templates/4.5.1/systemvm64template-2016-04-15-4.5.1-xen.vhd.bz2

  3 templates/4.3/systemvm64template-2014-06-23-master-xen.vhd.bz2

  2 templates/4.3/systemvm64template-2015-02-04-4.3-kvm.qcow2.bz2

  1 templates/acton/acton-systemvm-02062012.qcow2.bz2

  1 releases/4.3/centos6_4_64bit.vhd.bz2


Here's a list of the top IPs downloading. If you recognize yourself here (the 
top one is from one location in Montreal), please fix:

   2370 216.113.73.34

220 175.107.195.22

 60 66.165.176.60

 42 180.222.191.150

 38 82.192.93.187

 16 60.27.95.237

 16 200.124.137.20

 10 216.55.171.4

 10 130.185.128.25

  9 81.142.101.129

  9 211.125.79.4

  8 193.144.82.46

  7 217.192.89.130

  7 199.115.112.53

  6 5.152.164.8

  6 121.15.182.138

  5 91.223.182.11

  5 69.26.35.120

  5 219.163.55.73

  5 194.19.236.162

  5 178.16.163.70

  5 14.139.116.2

  4 84.33.37.2

  4 74.84.196.150

  4 185.53.31.146

  4 12.129.245.254

  4 119.31.171.20

  2 61.50.103.158

  2 58.210.242.134

  2 58.140.89.62

  2 41.77.158.254

  2 222.80.81.132

  2 218.104.96.139

  2 178.170.92.5

  2 177.47.20.58

  2 118.70.146.225

  2 115.249.104.17

  2 112.217.184.196

  2 112.217.184.195

  2 111.198.74.125

  2 103.4.132.9

On Tue, Feb 27, 2018 at 11:22 AM, Wido den Hollander  wrote:

> Yes! Sounds good to me
>
> > Op 27 feb. 2018 om 20:02 heeft Rohit Yadav 
> > 
> het volgende geschreven:
> >
> > Sounds good to me.
> >
> >
> > - Rohit
> >
> > 
> > From: Chiradeep Vittal 
> > Sent: Tuesday, February 27, 2018 18:33
> > To: dev
> > Subject: [IMPENDING SHUTDOWN] Re: Replacing download.cloud.com by
> download.cloudstack.org
> >
> > I would like to propose a shutdown window of 60 days starting March 1.
> > Every few days (3?) we can announce the impending shutdown on users@ 
> > and dev@.
> > At day 30 we can temporarily disable (using S3 permissions) 
> > download.cloud.com for 72 hours to see if this any impact.
> >
> > Thoughts?
> >
> >
> > rohit.ya...@shapeblue.com
> > www.shapeblue.com
> > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> >
> >
> >
> >> On Mon, Feb 26, 2018 at 4:26 AM, Wido den Hollander 
> wrote:
> >>
> >>
> >>> On 02/25/2018 12:11 AM, Chiradeep Vittal wrote:
> >>>
> >>> Which releases still refer to them in the setup SQL?
> >>>
> >>
> >> I think it was 4.9? Does anybody know this exactly?
> >>
> >> Wido
> >>
> >>
> >>> Sent from my iPhone
> >>>
>  On Feb 24, 2018, at 12:31 PM, Wido den Hollander 
> wrote:
> 
> 
> 
> > On 02/24/2018 12:14 AM, Chiradeep Vittal wrote:
> > Citrix is wondering if they can shut down download.cloud.com at this
> > point.
> 
> 
>  Grepping through the source I see no references to download.cloud.com
>  anymore.
> 
>  I *think* it can be shut down. Any other opinions on this?
> 
>  Wido
> 
> > Thanks!
> >>
> >> On Fri, Mar 10, 2017 at 2:39 AM, Wido den Hollander  >
> >> wrote:
> >>
> >>> Op 10 maart 2017 om 11:36 schreef Wido den Hollander <
> w...@widodh.nl>:
> >>>
> >>>
> >>> Hi,
> >>>
> >>> 

RE: HA issues

2018-02-21 Thread Sean Lair
Thanks so much for the info - we'll look at that line also!

I'll let you know when we create a PR for our changes - in case you want to 
review them for your environment

-Original Message-
From: Andrija Panic [mailto:andrija.pa...@gmail.com] 
Sent: Tuesday, February 20, 2018 5:16 PM
To: dev 
Subject: Re: HA issues

That is good to hear ( no NFS issues causing Agent Disconnect).

I assume you are using "normal" NFS solution with proper HA and no ZFS (kernel 
panic etc), but anyway be aware of this one

https://github.com/apache/cloudstack/blob/e532b574ddb186a117da638fb6059356fe7c266c/scripts/vm/hypervisor/kvm/kvmheartbeat.sh#L161



we used to comment this line, because we did have some issues with 
communication link, and this commented line saved our a$$ few times :)

CHeers

On 20 February 2018 at 20:50, Sean Lair  wrote:

> Hi Andrija
>
> We are currently running XenServer in production.  We are working on 
> moving to KVM and have it deployed in a development environment.
>
> The team is putting CloudStack + KVM through its paces and that is 
> when it was discovered how broken VM HA is in 4.9.3.  Initially our 
> patches fixed VM HA, but just caused VMs to get started on two hosts 
> during failure testing.  The libvirt lockd has solved that issue thus far.
>
> Short answer to you question is :-), we were not having problems with 
> Agent Disconnects in a production environment.  It was our testing/QA 
> that revealed the issues.  Our NFS has been stable so far, no issues 
> with the agent crashing/stopping that wasn't initiated by the team's testing.
>
> Thanks
> Sean
>
>
> -Original Message-
> From: Andrija Panic [mailto:andrija.pa...@gmail.com]
> Sent: Saturday, February 17, 2018 1:49 PM
> To: dev 
> Subject: Re: HA issues
>
> Hi Sean,
>
> (we have 2 threads interleaving on the libvirt lockd..) - so, did you 
> manage to understand what can cause the Agent Disconnect in most 
> cases, for you specifically? Is there any software (CloudStack) root 
> cause (disregarding i.e. networking issues etc)
>
> Just our examples, which you should probably not have:
>
> We had CEPH cluster running (with ACS), and there any exception in 
> librbd would crash JVM and the agent, but this has been fixed mostly - 
> Now get i.e. agent disconnect when ACS try to delete volume on CEPH 
> (and for some reason not succeed withing 30 minutes, volume deletion 
> fails) - then libvirt get's completety stuck (virsh list even dont 
> work)...so  agent get's disconnect eventually.
>
> It would be good to get rid of agent disconnections in general, 
> obviously
> :) so that is why I'm asking (you are on NFS, so would like to see 
> your experience here).
>
> Thanks
>
> On 16 February 2018 at 21:52, Sean Lair  wrote:
>
> > We were in the same situation as Nux.
> >
> > In our test environment we hit the issue with VMs not getting fenced and
> > coming up on two hosts because of VM HA.   However, we updated some of
> the
> > logic for VM HA and turned on libvirtd's locking mechanism.  Now we 
> > are working great w/o IPMI.  The locking stops the VMs from starting 
> > elsewhere, and everything recovers very nicely when the host starts
> responding again.
> >
> > We are on 4.9.3 and haven't started testing with 4.11 yet, but it 
> > may work along-side IPMI just fine - it would just have affect the fencing.
> > However, we *currently* prefer how we are doing it now, because if 
> > the agent stops responding, but the host is still up, the VMs 
> > continue running and no actual downtime is incurred.  Even when VM 
> > HA attempts to power on the VMs on another host, it just fails the 
> > power-up and the VMs continue to run on the "agent disconnected" 
> > host. The host goes into alarm state and our NOC can look into what 
> > is wrong the agent on the host.  If IPMI was enabled, it sounds like 
> > it would power off the host (fence) and force downtime for us even 
> > if the VMs were actually running OK - and just the agent is unreachable.
> >
> > I plan on submitting our updates via a pull request at some point.
> > But I can also send the updated code to anyone that wants to do some 
> > testing before then.
> >
> > -Original Message-
> > From: Marcus [mailto:shadow...@gmail.com]
> > Sent: Friday, February 16, 2018 11:27 AM
> > To: dev@cloudstack.apache.org
> > Subject: Re: HA issues
> >
> > From your other emails it sounds as though you do not have IPMI 
> > configured, nor host HA enabled, correct? In this case, the correct 
> > thing to do is nothing. If CloudStack cannot guarantee

VM HA starting VMs that were powered off within Guest

2018-02-20 Thread Sean Lair
We have some Windows VMs we have VM HA enabled for.  When a user does a 
shutdown of the VM from within Windows, VM HA reports the following and powers 
the VM back up.  Is this expected behavior?

Log Snip-it:

2018-02-20 19:51:58,898 INFO  [c.c.v.VirtualMachineManagerImpl] 
(AgentManager-Handler-3:null) (logid:) VM i-26-122-VM is at Running and we 
received a power-off report while there is no pending jobs on it
2018-02-20 19:51:58,898 INFO  [c.c.v.VirtualMachineManagerImpl] 
(AgentManager-Handler-3:null) (logid:) Detected out-of-band stop of a HA 
enabled VM i-26-122-VM, will schedule restart
2018-02-20 19:51:58,919 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(AgentManager-Handler-3:null) (logid:) Schedule vm for HA:  VM[User|i-26-122-VM]

Thanks
Sean


RE: HA issues

2018-02-20 Thread Sean Lair
Hi Andrija

We are currently running XenServer in production.  We are working on moving to 
KVM and have it deployed in a development environment.

The team is putting CloudStack + KVM through its paces and that is when it was 
discovered how broken VM HA is in 4.9.3.  Initially our patches fixed VM HA, 
but just caused VMs to get started on two hosts during failure testing.  The 
libvirt lockd has solved that issue thus far.

Short answer to you question is :-), we were not having problems with Agent 
Disconnects in a production environment.  It was our testing/QA that revealed 
the issues.  Our NFS has been stable so far, no issues with the agent 
crashing/stopping that wasn't initiated by the team's testing.

Thanks
Sean


-Original Message-
From: Andrija Panic [mailto:andrija.pa...@gmail.com] 
Sent: Saturday, February 17, 2018 1:49 PM
To: dev 
Subject: Re: HA issues

Hi Sean,

(we have 2 threads interleaving on the libvirt lockd..) - so, did you manage to 
understand what can cause the Agent Disconnect in most cases, for you 
specifically? Is there any software (CloudStack) root cause (disregarding i.e. 
networking issues etc)

Just our examples, which you should probably not have:

We had CEPH cluster running (with ACS), and there any exception in librbd would 
crash JVM and the agent, but this has been fixed mostly - Now get i.e. agent 
disconnect when ACS try to delete volume on CEPH (and for some reason not 
succeed withing 30 minutes, volume deletion fails) - then libvirt get's 
completety stuck (virsh list even dont work)...so  agent get's disconnect 
eventually.

It would be good to get rid of agent disconnections in general, obviously
:) so that is why I'm asking (you are on NFS, so would like to see your 
experience here).

Thanks

On 16 February 2018 at 21:52, Sean Lair  wrote:

> We were in the same situation as Nux.
>
> In our test environment we hit the issue with VMs not getting fenced and
> coming up on two hosts because of VM HA.   However, we updated some of the
> logic for VM HA and turned on libvirtd's locking mechanism.  Now we 
> are working great w/o IPMI.  The locking stops the VMs from starting 
> elsewhere, and everything recovers very nicely when the host starts 
> responding again.
>
> We are on 4.9.3 and haven't started testing with 4.11 yet, but it may 
> work along-side IPMI just fine - it would just have affect the fencing.
> However, we *currently* prefer how we are doing it now, because if the 
> agent stops responding, but the host is still up, the VMs continue 
> running and no actual downtime is incurred.  Even when VM HA attempts 
> to power on the VMs on another host, it just fails the power-up and 
> the VMs continue to run on the "agent disconnected" host. The host 
> goes into alarm state and our NOC can look into what is wrong the 
> agent on the host.  If IPMI was enabled, it sounds like it would power 
> off the host (fence) and force downtime for us even if the VMs were 
> actually running OK - and just the agent is unreachable.
>
> I plan on submitting our updates via a pull request at some point.  
> But I can also send the updated code to anyone that wants to do some 
> testing before then.
>
> -Original Message-
> From: Marcus [mailto:shadow...@gmail.com]
> Sent: Friday, February 16, 2018 11:27 AM
> To: dev@cloudstack.apache.org
> Subject: Re: HA issues
>
> From your other emails it sounds as though you do not have IPMI 
> configured, nor host HA enabled, correct? In this case, the correct 
> thing to do is nothing. If CloudStack cannot guarantee the VM state 
> (as is the case with an unreachable hypervisor), it should do nothing, 
> for fear of causing a split brain and corrupting the VM disk (VM running on 
> two hosts).
>
> Clustering and fencing is a tricky proposition. When CloudStack (or 
> any other cluster manager) is not configured to or cannot guarantee 
> state then things will simply lock up, in this case your HA VM on your 
> broken hypervisor will not run elsewhere. This has been the case for a 
> long time with CloudStack, HA would only start a VM after the original 
> hypervisor agent came back and reported no VM is running.
>
> The new feature, from what I gather, simply adds the possibility of 
> CloudStack being able to reach out and shut down the hypervisor to 
> guarantee state. At that point it can start the VM elsewhere. If 
> something fails in that process (IPMI unreachable, for example, or bad 
> credentials), you're still going to be stuck with a VM not coming back.
>
> It's the nature of the thing. I'd be wary of any HA solution that does 
> not reach out and guarantee state via host or storage fencing before 
> starting a VM elsewhere, as it will be making assumptions. Its 
> entirely po

RE: HA issues

2018-02-16 Thread Sean Lair
We were in the same situation as Nux.

In our test environment we hit the issue with VMs not getting fenced and coming 
up on two hosts because of VM HA.   However, we updated some of the logic for 
VM HA and turned on libvirtd's locking mechanism.  Now we are working great w/o 
IPMI.  The locking stops the VMs from starting elsewhere, and everything 
recovers very nicely when the host starts responding again.  

We are on 4.9.3 and haven't started testing with 4.11 yet, but it may work 
along-side IPMI just fine - it would just have affect the fencing.  However, we 
*currently* prefer how we are doing it now, because if the agent stops 
responding, but the host is still up, the VMs continue running and no actual 
downtime is incurred.  Even when VM HA attempts to power on the VMs on another 
host, it just fails the power-up and the VMs continue to run on the "agent 
disconnected" host. The host goes into alarm state and our NOC can look into 
what is wrong the agent on the host.  If IPMI was enabled, it sounds like it 
would power off the host (fence) and force downtime for us even if the VMs were 
actually running OK - and just the agent is unreachable.

I plan on submitting our updates via a pull request at some point.  But I can 
also send the updated code to anyone that wants to do some testing before then.

-Original Message-
From: Marcus [mailto:shadow...@gmail.com] 
Sent: Friday, February 16, 2018 11:27 AM
To: dev@cloudstack.apache.org
Subject: Re: HA issues

From your other emails it sounds as though you do not have IPMI configured, nor 
host HA enabled, correct? In this case, the correct thing to do is nothing. If 
CloudStack cannot guarantee the VM state (as is the case with an unreachable 
hypervisor), it should do nothing, for fear of causing a split brain and 
corrupting the VM disk (VM running on two hosts).

Clustering and fencing is a tricky proposition. When CloudStack (or any other 
cluster manager) is not configured to or cannot guarantee state then things 
will simply lock up, in this case your HA VM on your broken hypervisor will not 
run elsewhere. This has been the case for a long time with CloudStack, HA would 
only start a VM after the original hypervisor agent came back and reported no 
VM is running.

The new feature, from what I gather, simply adds the possibility of CloudStack 
being able to reach out and shut down the hypervisor to guarantee state. At 
that point it can start the VM elsewhere. If something fails in that process 
(IPMI unreachable, for example, or bad credentials), you're still going to be 
stuck with a VM not coming back.

It's the nature of the thing. I'd be wary of any HA solution that does not 
reach out and guarantee state via host or storage fencing before starting a VM 
elsewhere, as it will be making assumptions. Its entirely possible a VM might 
be unreachable or unable to access it storage for a short while, a new instance 
of the VM is started elsewhere, and the original VM comes back.

On Wed, Jan 17, 2018 at 9:02 AM Nux!  wrote:

> Hi Rohit,
>
> I've reinstalled and tested. Still no go with VM HA.
>
> What I did was to kernel panic that particular HV ("echo c > 
> /proc/sysrq-trigger" <- this is a proper way to simulate a crash).
> What happened next is the HV got marked as "Alert", the VM on it was 
> all the time marked as "Running" and it was not migrated to another HV.
> Once the panicked HV has booted back the VM reboots and becomes available.
>
> I'm running on CentOS 7 mgmt + HVs and NFS primary and secondary storage.
> The VM has HA enabled service offering.
> Host HA or OOBM configuration was not touched.
>
> Full log http://tmp.nux.ro/W3s-management-server.log
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
>
> - Original Message -
> > From: "Rohit Yadav" 
> > To: "dev" 
> > Sent: Wednesday, 17 January, 2018 12:13:33
> > Subject: Re: HA issues
>
> > I performed VM HA sanity checks and was not able to reproduce any
> regression
> > against two KVM CentOS7 hosts in a cluster.
> >
> >
> > Without the "Host HA" feature, I deployed few HA-enabled VMs on a 
> > KVM
> host2 and
> > killed it (powered off). After few minutes of CloudStack attempting 
> > to
> find why
> > the host (kvm agent) timed out, CloudStack kicked investigators, 
> > that eventually led KVM fencers to work and VM HA job kicked to 
> > start those
> few VMs
> > on host1 and the KVM host2 was put to "Down" state.
> >
> >
> > - Rohit
> >
> > 
> >
> >
> >
> > 
> >
> > rohit.ya...@shapeblue.com
> > www.shapeblue.com
> > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
> >
> >
> >
> > From: Rohit Yadav
> > Sent: Wednesday, January 17, 2018 2:39:19 PM
> > To: dev
> > Subject: Re: HA issues
> >
> >
> > Hi Lucian,
> >
> >
> > The "Host HA" feature is entirely different from VM HA, however, they
> may work
> > in tandem, so please stop using the terms inte

RE: HA issues

2018-02-16 Thread Sean Lair
We've done a lot of work on VM HA (we are on 4.9.3) and have it working 
reliably.  We've also been able stop the problem of VMs getting started on two 
hosts during some HA events.  Since this is 4.9.3, we do not use IPMI for this 
functionality.  We have not testing how the addition of IPMI in 4.11 affect our 
patch.

We are running KVM w/ NFS storage.  If you like I can get you our patch for 
testing.  



-Original Message-
From: Nux! [mailto:n...@li.nux.ro] 
Sent: Monday, January 22, 2018 8:15 AM
To: dev 
Subject: Re: HA issues

Hi,

Installed and reinstalled, VM HA just does not work for me.
In addition, if the HV going AWOL is hosting the systemvms, then they also do 
not get restarted despite available HVs online.
I've opened another ticket with logs:

https://issues.apache.org/jira/browse/CLOUDSTACK-10246

Happy to allow access to my rig if it helps.

I've disabled firewall and whatnot also left out other bits of network hardware 
just to keep it simpler, still no go.

--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro

- Original Message -
> From: "Paul Angus" 
> To: "dev" 
> Sent: Saturday, 20 January, 2018 08:40:01
> Subject: RE: HA issues

> No problem,
> 
> To be honest host-ha was developed *because* vm-ha was not reliable 
> under a number of conditions, including a host failure.
> 
> paul.an...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>  
> 
> 
> 
> -Original Message-
> From: Nux! [mailto:n...@li.nux.ro]
> Sent: 19 January 2018 14:26
> To: dev 
> Subject: Re: HA issues
> 
> Hi Paul,
> 
> Thanks for checking. My compute offering is HA enabled, of course.
> Host HA is disabled as well as OOBM.
> 
> 
> I'll do the tests again on Monday and report back.
> 
> --
> Sent from the Delta quadrant using Borg technology!
> 
> Nux!
> www.nux.ro
> 
> - Original Message -
>> From: "Paul Angus" 
>> To: "dev" 
>> Sent: Friday, 19 January, 2018 14:10:06
>> Subject: RE: HA issues
> 
>> Hey Nux,
>> 
>> I've being testing out the host-ha feature against a couple of physical 
>> hosts.
>> I've found that if the compute offering isn't ha enabled, then the vm isn't
>> restarted on the original host when it is rebooted, or any other host.If
>> the vm is ha-enabled, then the vm was restarted on the original host 
>> when host ha restarted the host.
>> 
>> Can you double check that the instance was an ha-enabled one?
>> 
>> OR
>> maybe the timeouts for the host-ha are too long and the vm-ha 
>> timed-out before hand ...?
>> 
>> 
>> 
>> Kind regards,
>> 
>> Paul Angus
>> 
>> paul.an...@shapeblue.com
>> www.shapeblue.com
>> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>>  
>> 
>> 
>> 
>> -Original Message-
>> From: Nux! [mailto:n...@li.nux.ro]
>> Sent: 17 January 2018 09:12
>> To: dev 
>> Subject: Re: HA issues
>> 
>> Right, sorry for using the terms interchangeably, I see what you mean.
>> 
>> I'll do further testing then as VM HA was also not working in my setup.
>> 
>> I'll be back.
>> 
>> --
>> Sent from the Delta quadrant using Borg technology!
>> 
>> Nux!
>> www.nux.ro
>> 
>> - Original Message -
>>> From: "Rohit Yadav" 
>>> To: "dev" 
>>> Sent: Wednesday, 17 January, 2018 09:09:19
>>> Subject: Re: HA issues
>> 
>>> Hi Lucian,
>>> 
>>> 
>>> The "Host HA" feature is entirely different from VM HA, however, 
>>> they may work in tandem, so please stop using the terms 
>>> interchangeably as it may cause the community to believe a regression has 
>>> been caused.
>>> 
>>> 
>>> The "Host HA" feature currently ships with only "Host HA" provider 
>>> for KVM that is strictly tied to out-of-band management (IPMI for 
>>> fencing, i.e power off and recovery, i.e. reboot) and NFS (as primary 
>>> storage).
>>> (We also have a provider for simulator, but that's for 
>>> coverage/testing purposes).
>>> 
>>> 
>>> Therefore, "Host HA" for KVM (+nfs) currently works only when OOBM is 
>>> enabled.
>>> The frameowkr allows interested parties may write their own HA 
>>> providers for a hypervisor that can use a different 
>>> strategy/mechanism for fencing/recovery of hosts (including write a 
>>> non-IPMI based OOBM
>>> plugin) and host/disk activity checker that is non-NFS based.
>>> 
>>> 
>>> The "Host HA" feature ships disabled by default and does not cause 
>>> any interference with VM HA. However, when enabled and configured 
>>> correctly, it is a known limitation that when it is unable to 
>>> successfully perform recovery or fencing tasks it may not trigger VM 
>>> HA. We can discuss how to handle such cases (thoughts?). "Host HA"
>>> would try couple of times to recover and failing to do so, it would 
>>> eventually trigger a host fencing task. If it's unable to fence a 
>>> host, it will indefinitely attempt to fence the host (the host state 
>>> will be stuck at fencing state in cloud.ha_config table for example) 
>>> and alerts will be sent to admin who can do some man

RE: System VMs not migrating when host down

2018-02-15 Thread Sean Lair
Thanks for the replies everyone. 

After further investigating, I am seeing how broken VM HA is right now (at 
least in 4.9.3).

We've started patching the code so it works again, but once we fixed it - we 
hit the dreaded VMs running on 2 different hosts... not good!

We are KVM w/ NFS.  It looks like the standard CloudStack documentation doesn't 
specify to use the built-in locking mechanism in libvirtd.  Looks like an easy 
solution, as if we are locking the VM's disk files, it shouldn't be able to 
come up on another host...

I've seen some of the talk about IPMI being used for Host HA in 4.11... but we 
don't have IPMI setup yet.  The locking mechanisms in libvirtd seem like the 
best idea to us so far - but we are just starting to look into it and implement 
it.

https://libvirt.org/locking-lockd.html

It reminds us of how VMware vSphere does locking, which works great.

 

-Original Message-
From: Andrija Panic [mailto:andrija.pa...@gmail.com] 
Sent: Wednesday, February 14, 2018 3:22 AM
To: dev 
Subject: Re: System VMs not migrating when host down

Humble opinion (until HOST HA is ready in 4.11 if not mistaken?), avoid using 
HA option for VMs  - avoid setting the  "Offer HA" option on any 
compute/service offerings, since we did end  up (was it ACS 4.5 or 4.8, can't 
remember now) having 2 copies of SAME VM running on 2 different hosts...imagine 
storage/volume corruption...this happened a few times for us.

HOST HA looks like really a nice thing, I have not tested that yet...but sould 
completely solve the problem.

On 14 February 2018 at 10:14, Paul Angus  wrote:

> Hi Sean,
>
> The 'problem' with VM HA in KVM is that it relies on the parent host 
> agent to be connected to report that the VM is down.  We cannot assume 
> that just because a host agent is disconnected, that the VMs on that 
> host are not running.
>
> This is where HOST HA comes in, this feature detects loss of 
> connection to the agent and then tries to determine if the VMs on that 
> host are active and then attempts some corrective action.
>
>
> Kind regards,
>
> Paul Angus
>
> paul.an...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>
>
>
>
> -Original Message-
> From: Sean Lair [mailto:sl...@ippathways.com]
> Sent: 13 February 2018 23:06
> To: dev@cloudstack.apache.org
> Subject: System VMs not migrating when host down
>
> Hi all,
>
> We are testing VM HA and are having a problem with our system VMs
> (secondary storage and console) not being started up on another host when a
> host fails.
>
> Shouldn't the system VMs be VM HA-enabled?  Currently they are just in an
> "Alert" agent state, but never migrate.  We are currently running 4.9.3.
>
>
> Thanks
> Sean
>



-- 

Andrija Panić


System VMs not migrating when host down

2018-02-13 Thread Sean Lair
Hi all,

We are testing VM HA and are having a problem with our system VMs (secondary 
storage and console) not being started up on another host when a host fails.

Shouldn't the system VMs be VM HA-enabled?  Currently they are just in an 
"Alert" agent state, but never migrate.  We are currently running 4.9.3.


Thanks
Sean


RE: Found 2 bugs in ACS 4.10. Possibly exist in 4.11 (master)

2018-01-07 Thread Sean Lair
I use a wildcard cert on 4.9.2 it it's fine. We haven't gone to 4.10 yet to 
test.  We'll prob go straight to 4.11 when released.

We have also had the high-cpu on the mgmt servers in our 4.9.x deployments.  It 
is very frusterating, it also happens every few days.  Haven't been able to 
track down why yet.  In a different thread a while ago, Simon Weller also 
reporting high cpu issues but I'm not sure if he every found the culprit either.



-Original Message-
From: Ivan Kudryavtsev [mailto:kudryavtsev...@bw-sw.com] 
Sent: Saturday, January 6, 2018 12:28 AM
To: dev 
Subject: Found 2 bugs in ACS 4.10. Possibly exist in 4.11 (master)

Hello, colleagues.

During last days I found 2 bugs which I believe is critical for 4.11 release. I 
would like to share them here and get help if possible:

1. CPVM bug. I use wildcard certificate issued by Comodo CA. I uploaded it to 
CS via UI and destroyed CPVM to force use it. It uses it like a charm, but 
after some amount of time it loses it and console proxy connection is no longer 
not possible. After it's beging rebooted or recreated everything works well. 
I'm not familiar with CPVM at all and can not even imaging what can be wrong 
here.

1a. CPVM has debug enabled and logs include tons of messages like:

2018-01-06 06:13:57,069 DEBUG
[cloud.consoleproxy.ConsoleProxyAjaxImageHandler] (Thread-4159:null) 
AjaxImageHandler
/ajaximg?token=RcHSrvzegyrjZAlc1Wjifcwv9P8WwK3eH63SuIS8WFFGssxymmjdYkZ4-S4ilY1UHxX612Lt_5Xi1Z5JaoCfDSf_UCi8lTIsPEBlDpUEWQg1IblYu0HxvoDugX9J4XgAdpj74qg_U4pOs74dzdZFB50PB_HxcMhzUqd5plH914PmRDw5k0ONaa183CsGa7DcGVvWaR_eYP_8_CArahGAjHt04Kx227tjyMx4Zaju7iNyxpBWxtBC5YJyj8rjv7IeA_0Pevz91pWn6OE1pkeLwGeFSV8pZw4BWg95SG97A-I&key=2020&ts=1515219237015
2018-01-06 06:13:57,070 DEBUG
[cloud.consoleproxy.ConsoleProxyHttpHandlerHelper] (Thread-4159:null) decode 
token. host: 10.252.2.10
2018-01-06 06:13:57,070 DEBUG
[cloud.consoleproxy.ConsoleProxyHttpHandlerHelper] (Thread-4159:null) decode 
token. port: 5903
2018-01-06 06:13:57,070 DEBUG
[cloud.consoleproxy.ConsoleProxyHttpHandlerHelper] (Thread-4159:null) decode 
token. tag: 375c62b5-74d9-4494-8b79-0d7c76cff10f

Every opened session is dumped to logs. I suppose it's dangerous and could lead 
to FS overusage and CPVM failure.

/dev/vda10  368M   63M  287M
19% /var/log

Might it be that (1) is a consequence of (1a)?

2. High CPU utilization bug. After management server is launched it uses 0 CPU 
because I run development cloud. After two days I see that 2 cores are used 50% 
by management server processes, several days ago I saw all management server 
processes utilized almost all CPU available. Surprisingly It continues function 
(API, UI), no active API utlization in logs.

The only two suspicios things I found for the last incident are:

root@cs2-head1:/var/log/cloudstack/management# zgrep ERROR 
management-server.log.2018-01-04.gz
2018-01-04 12:58:23,391 ERROR [c.c.c.ClusterManagerImpl]
(localhost-startStop-1:null) (logid:) Unable to ping management server at
10.252.2.2:9090 due to ConnectException
2018-01-04 12:58:25,743 ERROR [c.c.u.PropertiesUtil]
(localhost-startStop-1:null) (logid:) Unable to find properties file:
commands.properties
2018-01-04 14:36:23,874 ERROR [c.c.u.PropertiesUtil]
(localhost-startStop-1:null) (logid:) Unable to find properties file:
commands.properties
2018-01-04 14:43:23,043 ERROR [c.c.v.VmWorkJobHandlerProxy]
(Work-Job-Executor-5:ctx-e566f561 job-38158/job-38188 ctx-b1887051)
(logid:be4b64e0) Invocation exception, caused by:
com.cloud.exception.InsufficientServerCapacityException: Unable to create a 
deployment for VM[SecondaryStorageVm|s-24-VM]Scope=interface
com.cloud.dc.DataCenter; id=1
2018-01-04 14:43:23,043 ERROR [c.c.v.VmWorkJobHandlerProxy]
(Work-Job-Executor-4:ctx-faf69614 job-38155/job-38185 ctx-83290fa8)
(logid:65010252) Invocation exception, caused by:
com.cloud.exception.InsufficientServerCapacityException: Unable to create a 
deployment for VM[ConsoleProxy|v-10-VM]Scope=interface
com.cloud.dc.DataCenter; id=1
2018-01-04 14:43:23,044 ERROR [c.c.v.VmWorkJobDispatcher]
(Work-Job-Executor-5:ctx-e566f561 job-38158/job-38188) (logid:be4b64e0) Unable 
to complete AsyncJobVO {id:38188, userId: 1, accountId: 1,
instanceType: null, instanceId: null, cmd: com.cloud.vm.VmWorkStart,
cmdInfo:
rO0ABXNyABhjb20uY2xvdWQudm0uVm1Xb3JrU3RhcnR9cMGsvxz73gIAC0oABGRjSWRMAAZhdm9pZHN0ADBMY29tL2Nsb3VkL2RlcGxveS9EZXBsb3ltZW50UGxhbm5lciRFeGNsdWRlTGlzdDtMAAljbHVzdGVySWR0ABBMamF2YS9sYW5nL0xvbmc7TAAGaG9zdElkcQB-AAJMAAtqb3VybmFsTmFtZXQAEkxqYXZhL2xhbmcvU3RyaW5nO0wAEXBoeXNpY2FsTmV0d29ya0lkcQB-AAJMAAdwbGFubmVycQB-AANMAAVwb2RJZHEAfgACTAAGcG9vbElkcQB-AAJMAAlyYXdQYXJhbXN0AA9MamF2YS91dGlsL01hcDtMAA1yZXNlcnZhdGlvbklkcQB-AAN4cgATY29tLmNsb3VkLnZtLlZtV29ya5-ZtlbwJWdrAgAESgAJYWNjb3VudElkSgAGdXNlcklkSgAEdm1JZEwAC2hhbmRsZXJOYW1lcQB-AAN4cAABAAEAGHQAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAHBwcHBwcHBwcHA,
cmdVersion: 0, status: IN_PROGRESS

RE: [DISCUSS] CloudStack 4.9.3.0 (LTS)

2017-07-24 Thread Sean Lair
Hi Rohit

I previous suggested these for 4.9.3.0

https://github.com/apache/cloudstack/pull/2041 (VR related jobs scheduled and 
run twice on mgmt servers)
https://github.com/apache/cloudstack/pull/2040 (Bug in monitoring of S2S VPNs - 
also exists in 4.10)
https://github.com/apache/cloudstack/pull/1966 (IPSEC VPNs do not work after 
vRouter reboot)

I'd also like to suggest these:
https://github.com/apache/cloudstack/pull/1246 (unable to use reserved IP range 
in a network)
https://github.com/apache/cloudstack/pull/2201 (VPC VR doesn't respond to DNS 
requests from remote access vpn clients)


Thanks
Sean

-Original Message-
From: Rohit Yadav [mailto:rohit.ya...@shapeblue.com] 
Sent: Monday, July 24, 2017 5:56 AM
To: dev@cloudstack.apache.org; us...@cloudstack.apache.org
Subject: Re: [DISCUSS] CloudStack 4.9.3.0 (LTS)

All,


We'll accept bugfixes on 4.9 branch till end of next week, following which I'll 
start release work towards 4.9.3.0 (LTS) release. Please help review 
outstanding PRs, share PRs that we should consider and advise/suggest issues 
that need to be reverted/backported, for example see: 
https://github.com/apache/cloudstack/pull/2052


Thank you for your support and co-operation.


- Rohit


From: Rohit Yadav 
Sent: Sunday, July 23, 2017 1:26:48 PM
To: dev@cloudstack.apache.org; us...@cloudstack.apache.org
Subject: Re: [DISCUSS] CloudStack 4.9.3.0 (LTS)

All,


I've started looking into reviewing/testing/merging of the PRs targeting 4.9+, 
I'll share some plans around 4.9.3.0 soon. Meanwhile, help in reporting any 
major/critical bugs and PRs we should consider reviewing/testing/merging. 
Thanks.


- Rohit

rohit.ya...@shapeblue.com
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue




rohit.ya...@shapeblue.com
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
  
 



RE: [DISCUSS] CloudStack 4.9.3.0 (LTS)

2017-07-10 Thread Sean Lair
Here are three issues we ran into in 4.9.2.0.  We have been running all of 
these fixes for several months without issues.  The code changes are all very 
easy/small, but had a big impact for us.

I'd respectfully suggest they go into 4.9.3.0:

https://github.com/apache/cloudstack/pull/2041 (VR related jobs scheduled and 
run twice on mgmt servers)
https://github.com/apache/cloudstack/pull/2040 (Bug in monitoring of S2S VPNs - 
also exists in 4.10)
https://github.com/apache/cloudstack/pull/1966 (IPSEC VPNs do not work after 
vRouter reboot)

Thanks
Sean

-Original Message-
From: Rohit Yadav [mailto:rohit.ya...@shapeblue.com] 
Sent: Friday, July 7, 2017 1:14 AM
To: dev@cloudstack.apache.org
Cc: us...@cloudstack.apache.org
Subject: [DISCUSS] CloudStack 4.9.3.0 (LTS)

All,


With 4.10.0.0 voted, I would like to start some initial discussion around the 
next minor LTS release 4.9.3.0. At the moment I don't have a timeline, plans or 
dates to share but I would like to engage with the community to gather list of 
issues, commits, PRs that we should consider for the next LTS release 4.9.3.0.


To reduce our test and QA scope, we don't want to consider changes that are new 
feature, or enhancements but strictly blockers/critical/major bugfixes and 
security related fixes, and we can consider reverting any already 
committed/merged PR(s) on 4.9 branch (committed since 4.9.2.0).


Please go through list of commits since 4.9.2.0 (you can also run, git log 
4.9.2.0..4.9) and let us know if there is any change we should consider 
reverting:

https://github.com/apache/cloudstack/commits/4.9


I started backporting some 
fixes on the 4.9 branch, please go through the following PR and raise 
objections on changes/commits that we should not backport or revert:

https://github.com/apache/cloudstack/pull/2052


Lastly, please also share any PRs that we should consider reviewing+merging on 
4.9 branch for the 4.9.3.0 release effort.


- Rohit

rohit.ya...@shapeblue.com
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
  
 



RE: How are router checks scheduled?

2017-04-12 Thread Sean Lair
The change to "/opt/cloud/bin/checkbatchs2svpn.sh" fixes the issues where no 
all of the VPN checks are returned.  I'll create and issue and PR

Sean


-Original Message-----
From: Sean Lair 
Sent: Tuesday, April 11, 2017 2:33 PM
To: dev@cloudstack.apache.org
Subject: RE: How are router checks scheduled?

Found and fixed at least one issue (4.9.2.0), had to update this file: 
"/server/src/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java"

Because "VpcVirtualNetworkApplianceManagerImpl" extends 
"VirtualNetworkApplianceManagerImpl"

When VpcVirtualNetworkApplianceManagerImpl was created it re-ran 
"VirtualNetworkApplianceManagerImpl.Start".  That rescheduled all of the 
various health and stats checks so everything was now running twice...  Added 
this to the above file:

   @Override
public boolean start() {
return true;
}

@Override
public boolean stop() {
return true;
}


Now when we double-check our work by running this command:

cat /var/log/cloudstack/management/management-server.log | grep "routers to 
update status"

We only see that job (for example) kicking off once every 30-seconds instead 
twice every 30-seconds.  Not sure if this solved the CPU issue yet.  The above 
code coincidently is already in master as part of (PR #866).

The issue with all the VPN alerts was exacerbated by this bug, but not the 
root-cause it looks like.  We have another fix in place for 
"/opt/cloud/bin/checkbatchs2svpn.sh".  When a tenant has a lot of S2S VPN 
connections, not all of the statuses are returned when the S2S VPN checks 
occur.  It seems the SSHExecutor doesn't get the entire output of the script if 
there is any delay during execution.  The Check S2S VPN code assumes 
"disconnected" if a S2S  status isn't included in the response (or in our case, 
occasionally the response is cut off and missing a S2S VPN).  Here is an 
example:

2017-04-11 17:05:40,444 DEBUG [c.c.h.x.r.CitrixResourceBase] 
(DirectAgent-190:ctx-e894af45) (logid:cbbccfaa) Executing command in VR: 
/opt/cloud/bin/router_proxy.sh checkbatchs2svpn.sh 169.254.2.130 67.41.109.167 
65.100.18.183 67.41.109.165 67.41.109.166

2017-04-11 17:05:41,836 DEBUG [c.c.a.t.Request] (DirectAgent-190:ctx-e894af45) 
(logid:cbbccfaa) Seq 51-772085861117329631: Processing:  { Ans: , MgmtId: 
345050927939, via: 51(cloudxen01.dsm1.ippathways.net), Ver: v1, Flags: 110, 
[{"com.cloud.agent.api.CheckS2SVpnConnectionsAnswer":{"ipToConnected":{"65.100.18.183":true,"67.41.109.167":true,"67.41.109.165":true},"ipToDetail":{"65.100.18.183":"ISAKMP
 SA found;IPsec SA found;Site-to-site VPN have 
connected","67.41.109.167":"ISAKMP SA found;IPsec SA found;Site-to-site VPN 
have connected","67.41.109.165":"ISAKMP SA found;IPsec SA found;Site-to-site 
VPN have connected"},"details":"67.41.109.167:0:ISAKMP SA found;IPsec SA 
found;Site-to-site VPN have connected&65.100.18.183:0:ISAKMP SA found;IPsec SA 
found;Site-to-site VPN have connected&67.41.109.165:0:ISAKMP SA found;IPsec SA 
found;Site-to-site VPN have connected&","result":true,"wait":0}}] }

A check was requested for 4x S2S VPNs, but the result only returned 3x S2S VPN 
statuses!!  To fix this we changed "/opt/cloud/bin/checkbatchs2svpn.sh" on the 
vRouter as follows.  So far so good, but we won't know until we run for a while 
longer if that was definitely the issue...

ORIGINALLY:
---
for i in $*
do
info=`/opt/cloud/bin/checks2svpn.sh $i`
ret=$?
echo -n "$i:$ret:$info&"
done

NEW:

for i in $*
do
info=`/opt/cloud/bin/checks2svpn.sh $i`
ret=$?
batchInfo+="$i:$ret:$info&"
done
echo -n $batchInfo


Hopefully that makes sense and helps someone else.  PR #1966 has also been very 
important in our environment.



-Original Message-
From: Simon Weller [mailto:swel...@ena.com] 
Sent: Monday, April 10, 2017 5:26 PM
To: dev@cloudstack.apache.org
Subject: Re: How are router checks scheduled?

We've seen something very similar. By any chance, are you seeing any strange 
cpu load issues that grow over time as well?

Our team has been chasing down an issue that appears to be related to s2s vpn 
checks, where a race condition seems to occur that threads out the cpu over 
time.




From: Sean Lair 
Sent: Monday, April 10, 2017 5:11 PM
To: dev@cloudstack.apache.org
Subject: RE: How are router checks scheduled?

I do have two mgmt servers, but I have one powered off.  The log excerpt is 
from one management server.  This can be checked in the environment by running:

cat /var/log/cloudstack/management/management-server.log | grep "routers to 
update status"

This is h

RE: How are router checks scheduled?

2017-04-11 Thread Sean Lair
Found and fixed at least one issue (4.9.2.0), had to update this file: 
"/server/src/com/cloud/network/router/VpcVirtualNetworkApplianceManagerImpl.java"

Because "VpcVirtualNetworkApplianceManagerImpl" extends 
"VirtualNetworkApplianceManagerImpl"

When VpcVirtualNetworkApplianceManagerImpl was created it re-ran 
"VirtualNetworkApplianceManagerImpl.Start".  That rescheduled all of the 
various health and stats checks so everything was now running twice...  Added 
this to the above file:

   @Override
public boolean start() {
return true;
}

@Override
public boolean stop() {
return true;
}


Now when we double-check our work by running this command:

cat /var/log/cloudstack/management/management-server.log | grep "routers to 
update status"

We only see that job (for example) kicking off once every 30-seconds instead 
twice every 30-seconds.  Not sure if this solved the CPU issue yet.  The above 
code coincidently is already in master as part of (PR #866).

The issue with all the VPN alerts was exacerbated by this bug, but not the 
root-cause it looks like.  We have another fix in place for 
"/opt/cloud/bin/checkbatchs2svpn.sh".  When a tenant has a lot of S2S VPN 
connections, not all of the statuses are returned when the S2S VPN checks 
occur.  It seems the SSHExecutor doesn't get the entire output of the script if 
there is any delay during execution.  The Check S2S VPN code assumes 
"disconnected" if a S2S  status isn't included in the response (or in our case, 
occasionally the response is cut off and missing a S2S VPN).  Here is an 
example:

2017-04-11 17:05:40,444 DEBUG [c.c.h.x.r.CitrixResourceBase] 
(DirectAgent-190:ctx-e894af45) (logid:cbbccfaa) Executing command in VR: 
/opt/cloud/bin/router_proxy.sh checkbatchs2svpn.sh 169.254.2.130 67.41.109.167 
65.100.18.183 67.41.109.165 67.41.109.166

2017-04-11 17:05:41,836 DEBUG [c.c.a.t.Request] (DirectAgent-190:ctx-e894af45) 
(logid:cbbccfaa) Seq 51-772085861117329631: Processing:  { Ans: , MgmtId: 
345050927939, via: 51(cloudxen01.dsm1.ippathways.net), Ver: v1, Flags: 110, 
[{"com.cloud.agent.api.CheckS2SVpnConnectionsAnswer":{"ipToConnected":{"65.100.18.183":true,"67.41.109.167":true,"67.41.109.165":true},"ipToDetail":{"65.100.18.183":"ISAKMP
 SA found;IPsec SA found;Site-to-site VPN have 
connected","67.41.109.167":"ISAKMP SA found;IPsec SA found;Site-to-site VPN 
have connected","67.41.109.165":"ISAKMP SA found;IPsec SA found;Site-to-site 
VPN have connected"},"details":"67.41.109.167:0:ISAKMP SA found;IPsec SA 
found;Site-to-site VPN have connected&65.100.18.183:0:ISAKMP SA found;IPsec SA 
found;Site-to-site VPN have connected&67.41.109.165:0:ISAKMP SA found;IPsec SA 
found;Site-to-site VPN have connected&","result":true,"wait":0}}] }

A check was requested for 4x S2S VPNs, but the result only returned 3x S2S VPN 
statuses!!  To fix this we changed "/opt/cloud/bin/checkbatchs2svpn.sh" on the 
vRouter as follows.  So far so good, but we won't know until we run for a while 
longer if that was definitely the issue...

ORIGINALLY:
---
for i in $*
do
info=`/opt/cloud/bin/checks2svpn.sh $i`
ret=$?
echo -n "$i:$ret:$info&"
done

NEW:

for i in $*
do
info=`/opt/cloud/bin/checks2svpn.sh $i`
ret=$?
batchInfo+="$i:$ret:$info&"
done
echo -n $batchInfo


Hopefully that makes sense and helps someone else.  PR #1966 has also been very 
important in our environment.



-Original Message-
From: Simon Weller [mailto:swel...@ena.com] 
Sent: Monday, April 10, 2017 5:26 PM
To: dev@cloudstack.apache.org
Subject: Re: How are router checks scheduled?

We've seen something very similar. By any chance, are you seeing any strange 
cpu load issues that grow over time as well?

Our team has been chasing down an issue that appears to be related to s2s vpn 
checks, where a race condition seems to occur that threads out the cpu over 
time.




From: Sean Lair 
Sent: Monday, April 10, 2017 5:11 PM
To: dev@cloudstack.apache.org
Subject: RE: How are router checks scheduled?

I do have two mgmt servers, but I have one powered off.  The log excerpt is 
from one management server.  This can be checked in the environment by running:

cat /var/log/cloudstack/management/management-server.log | grep "routers to 
update status"

This is happening both in prod and our dev environment.  I've been digging 
through the code and have some ideas and will post back later if successful in 
correcting the issue.

The biggest problem is the race condition between the two simultaneous S2S VPN 
checks.  They step on each other and spam the heck out of us with th

RE: How are router checks scheduled?

2017-04-10 Thread Sean Lair
Yep! Exactly, we have that issue too.  I am testing a possible fix right now, 
I'll let you know how it goes!


-Original Message-
From: Simon Weller [mailto:swel...@ena.com] 
Sent: Monday, April 10, 2017 5:26 PM
To: dev@cloudstack.apache.org
Subject: Re: How are router checks scheduled?

We've seen something very similar. By any chance, are you seeing any strange 
cpu load issues that grow over time?

Our team has been chasing down an issue that appears to be related to s2s vpn 
checks, where a race condition seems to occur that threads out the cpu over 
time.



____
From: Sean Lair 
Sent: Monday, April 10, 2017 5:11 PM
To: dev@cloudstack.apache.org
Subject: RE: How are router checks scheduled?

I do have two mgmt servers, but I have one powered off.  The log excerpt is 
from one management server.  This can be checked in the environment by running:

cat /var/log/cloudstack/management/management-server.log | grep "routers to 
update status"

This is happening both in prod and our dev environment.  I've been digging 
through the code and have some ideas and will post back later if successful in 
correcting the issue.

The biggest problem is the race condition between the two simultaneous S2S VPN 
checks.  They step on each other and spam the heck out of us with the email 
alerting.



-Original Message-
From: Simon Weller [mailto:swel...@ena.com]
Sent: Monday, April 10, 2017 5:02 PM
To: dev@cloudstack.apache.org
Subject: RE: How are router checks scheduled?

Do you have 2 management servers?

Simon Weller/615-312-6068

-Original Message-
From: Sean Lair [sl...@ippathways.com]
Received: Monday, 10 Apr 2017, 2:54PM
To: dev@cloudstack.apache.org [dev@cloudstack.apache.org]
Subject: How are router checks scheduled?

According to my management server logs, some of the period checks are getting 
kicked off twice at the same time.  The CheckRouterTask is kicked off every 
30-seconds, but each time it is ran, it is ran twice at the same second...  See 
logs below for example:

2017-04-10 21:48:12,879 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-5f7bc584) (logid:4d5b1031) Found 10 routers to 
update status.
2017-04-10 21:48:12,932 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-d027ab6f) (logid:1bc50629) Found 10 routers to 
update status.
2017-04-10 21:48:42,877 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-2c8f4d18) (logid:e9111785) Found 10 routers to 
update status.
2017-04-10 21:48:42,927 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-1bfd5351) (logid:ad0f95ef) Found 10 routers to 
update status.
2017-04-10 21:49:12,874 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-ede0d2bb) (logid:6f244423) Found 10 routers to 
update status.
2017-04-10 21:49:12,928 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-d58842d5) (logid:8442d73c) Found 10 routers to 
update status.

How is this scheduled/kicked off?  I am debugging some site-to-site VPN alert 
problems, and they seem to be related to a race condition due to the 
"CheckRouterTask" be kicked off two at a time.

Thanks
Sean





RE: How are router checks scheduled?

2017-04-10 Thread Sean Lair
I do have two mgmt servers, but I have one powered off.  The log excerpt is 
from one management server.  This can be checked in the environment by running:

cat /var/log/cloudstack/management/management-server.log | grep "routers to 
update status"

This is happening both in prod and our dev environment.  I've been digging 
through the code and have some ideas and will post back later if successful in 
correcting the issue.  

The biggest problem is the race condition between the two simultaneous S2S VPN 
checks.  They step on each other and spam the heck out of us with the email 
alerting.



-Original Message-
From: Simon Weller [mailto:swel...@ena.com] 
Sent: Monday, April 10, 2017 5:02 PM
To: dev@cloudstack.apache.org
Subject: RE: How are router checks scheduled?

Do you have 2 management servers?

Simon Weller/615-312-6068

-Original Message-----
From: Sean Lair [sl...@ippathways.com]
Received: Monday, 10 Apr 2017, 2:54PM
To: dev@cloudstack.apache.org [dev@cloudstack.apache.org]
Subject: How are router checks scheduled?

According to my management server logs, some of the period checks are getting 
kicked off twice at the same time.  The CheckRouterTask is kicked off every 
30-seconds, but each time it is ran, it is ran twice at the same second...  See 
logs below for example:

2017-04-10 21:48:12,879 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-5f7bc584) (logid:4d5b1031) Found 10 routers to 
update status.
2017-04-10 21:48:12,932 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-d027ab6f) (logid:1bc50629) Found 10 routers to 
update status.
2017-04-10 21:48:42,877 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-2c8f4d18) (logid:e9111785) Found 10 routers to 
update status.
2017-04-10 21:48:42,927 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-1bfd5351) (logid:ad0f95ef) Found 10 routers to 
update status.
2017-04-10 21:49:12,874 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-ede0d2bb) (logid:6f244423) Found 10 routers to 
update status.
2017-04-10 21:49:12,928 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-d58842d5) (logid:8442d73c) Found 10 routers to 
update status.

How is this scheduled/kicked off?  I am debugging some site-to-site VPN alert 
problems, and they seem to be related to a race condition due to the 
"CheckRouterTask" be kicked off two at a time.

Thanks
Sean





How are router checks scheduled?

2017-04-10 Thread Sean Lair
According to my management server logs, some of the period checks are getting 
kicked off twice at the same time.  The CheckRouterTask is kicked off every 
30-seconds, but each time it is ran, it is ran twice at the same second...  See 
logs below for example:

2017-04-10 21:48:12,879 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-5f7bc584) (logid:4d5b1031) Found 10 routers to 
update status.
2017-04-10 21:48:12,932 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-d027ab6f) (logid:1bc50629) Found 10 routers to 
update status.
2017-04-10 21:48:42,877 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-2c8f4d18) (logid:e9111785) Found 10 routers to 
update status.
2017-04-10 21:48:42,927 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-1bfd5351) (logid:ad0f95ef) Found 10 routers to 
update status.
2017-04-10 21:49:12,874 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-ede0d2bb) (logid:6f244423) Found 10 routers to 
update status.
2017-04-10 21:49:12,928 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] 
(RouterStatusMonitor-1:ctx-d58842d5) (logid:8442d73c) Found 10 routers to 
update status.

How is this scheduled/kicked off?  I am debugging some site-to-site VPN alert 
problems, and they seem to be related to a race condition due to the 
"CheckRouterTask" be kicked off two at a time.

Thanks
Sean





Re: [GitHub] cloudstack issue #1966: CLOUDSTACK-9801: IPSec VPN does not work after vRout...

2017-02-23 Thread Sean Lair
It is open against the 4.9 branch.

We are running 4.9.2.0, looks like it affects all 4.9.x.x

We haven't tested against 4.10 (strongswan) yet.  But it could be a problem and 
will be worth testing.  If strongswan starts before cloudstack adds the nics to 
the VM it could have same issue.
 

> On Feb 23, 2017, at 7:51 PM, swill  wrote:
> 
> Github user swill commented on the issue:
> 
>https://github.com/apache/cloudstack/pull/1966
> 
>I can't see what branch this is opened against on my phone. What version 
> of ACS is this opened against and which version do you have a problem in. The 
> reason I ask is because #1741 was added in master to upgrade from openswan to 
> strongswan. . Thx... 
> 
> 
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
> with INFRA.
> ---


RE: VPN/IPSEC problem after upgrading to 4.9.2.0

2017-02-23 Thread Sean Lair
Looks like this bug was introduced by Pull Request #1423

https://github.com/apache/cloudstack/pull/1423

It added code to start ipsec 
(cloudstack/systemvm/patches/debian/config/opt/cloud/bin/configure.py)

if vpnconfig['create']:
logging.debug("Enabling  remote access vpn  on "+ public_ip)
CsHelper.start_if_stopped("ipsec")
self.configure_l2tpIpsec(public_ip, self.dbag[public_ip])


The issue is that if a reboot is issued from the CloudStack UI (as opposed to 
manually by logging into the vRouter), the NICS (except eth0) are not added to 
the VM until the cloud service is running.

Since ipsec was started before the nics were added to the VM and before the 
public IP address is added to the nic, ipsec is not listening on the public IP 
address and all VPNs are broken.

This is not a problem with the Site2Site VPN section of configure.py, because 
that section does not start ipsec if the public IP is not on the system yet...  


That is my synopsis at least.

Thanks
Sean


-----Original Message-
From: Sean Lair 
Sent: Thursday, February 23, 2017 2:27 PM
To: dev@cloudstack.apache.org
Subject: VPN/IPSEC problem after upgrading to 4.9.2.0

We just upgraded from 4.8.1.1 to 4.9.2.0.  After upgrading we rebooted the 
virtual routers, and noticed that our site-to-site VPNs and remote-access VPNs 
would no longer connect.  After troubleshooting, we noticed that Openswan 
(ipsec.d) wasn't listening on the vRouter's IPs.  Here is the abbreviated 
output of "ipsec auto -status" while we were having the issue:

root@r-10-VM:~# ipsec auto --status
000 using kernel interface: netkey
000 interface lo/lo 127.0.0.1
000 interface lo/lo 127.0.0.1
000 interface eth0/eth0 169.254.1.45
000 interface eth0/eth0 169.254.1.45
000 %myid = (none)


Openswan only knows about the loopback and 169.254.1.45 address  We 
rebooted the vRouter several times with the same results.  However, if we 
manually stopped and started ipsec, then issued a "ipsec auto -status", the 
abbreviated output would be:

root@r-10-VM:~# ipsec auto --status
000 using kernel interface: netkey
000 interface lo/lo 127.0.0.1
000 interface lo/lo 127.0.0.1
000 interface eth0/eth0 169.254.1.45
000 interface eth0/eth0 169.254.1.45
000 interface eth1/eth1 192.103.11.172
000 interface eth1/eth1 192.103.11.172
000 interface eth2/eth2 192.168.1.1
000 interface eth2/eth2 192.168.1.1
000 %myid = (none)

Openswan now knows about the additional interfaces and VPNs function as 
expected...  It's like ipsec.d is started before all of the interfaces are 
configured?  Is this a known bug or I am off-base with my analysis somehow?

Thanks
Sean


VPN/IPSEC problem after upgrading to 4.9.2.0

2017-02-23 Thread Sean Lair
We just upgraded from 4.8.1.1 to 4.9.2.0.  After upgrading we rebooted the 
virtual routers, and noticed that our site-to-site VPNs and remote-access VPNs 
would no longer connect.  After troubleshooting, we noticed that Openswan 
(ipsec.d) wasn't listening on the vRouter's IPs.  Here is the abbreviated 
output of "ipsec auto -status" while we were having the issue:

root@r-10-VM:~# ipsec auto --status
000 using kernel interface: netkey
000 interface lo/lo 127.0.0.1
000 interface lo/lo 127.0.0.1
000 interface eth0/eth0 169.254.1.45
000 interface eth0/eth0 169.254.1.45
000 %myid = (none)


Openswan only knows about the loopback and 169.254.1.45 address  We 
rebooted the vRouter several times with the same results.  However, if we 
manually stopped and started ipsec, then issued a "ipsec auto -status", the 
abbreviated output would be:

root@r-10-VM:~# ipsec auto --status
000 using kernel interface: netkey
000 interface lo/lo 127.0.0.1
000 interface lo/lo 127.0.0.1
000 interface eth0/eth0 169.254.1.45
000 interface eth0/eth0 169.254.1.45
000 interface eth1/eth1 192.103.11.172
000 interface eth1/eth1 192.103.11.172
000 interface eth2/eth2 192.168.1.1
000 interface eth2/eth2 192.168.1.1
000 %myid = (none)

Openswan now knows about the additional interfaces and VPNs function as 
expected...  It's like ipsec.d is started before all of the interfaces are 
configured?  Is this a known bug or I am off-base with my analysis somehow?

Thanks
Sean


RE: [VOTE] Apache Cloudstack 4.9.0 RC1

2016-07-12 Thread Sean Lair
Hi all, I vote -1 and would like to see the jdbc:mysql and site-to-site vpn 
fixed in 4.9.

https://github.com/apache/cloudstack/pull/1610
https://github.com/apache/cloudstack/pull/1480

Thanks!
Sean

-Original Message-
From: Wido den Hollander [mailto:w...@widodh.nl] 
Sent: Tuesday, July 12, 2016 1:48 AM
To: Sean Lair ; dev@cloudstack.apache.org
Subject: RE: [VOTE] Apache Cloudstack 4.9.0 RC1


> Op 11 juli 2016 om 22:40 schreef Sean Lair :
> 
> 
> Hi all,
> 
> One small comment since strongSwan didn't make it into 4.9.  There is still a 
> very simple bug in enabling PFS for site-to-site VPNs.  The code checks the 
> Dead Peer Detection (DPD) variable instead of the PFS variable when 
> determining whether or not to enable PFS for the site-to-site VPN.
> 
> Here is the 1-line of code that is broken.  You can see how it refers to dpd 
> to set pfs.
> 
> file.addeq(" pfs=%s" % CsHelper.bool_to_yn(obj['dpd']))
> 
> This pull request fixes the issue, but was not merged since we were going to 
> strongSwan.  It would be nice if this bug fix was put into 4.9.0
> 
> https://github.com/apache/cloudstack/pull/1480
> 

Would it make you a -1 for you without this PR? If so, please vote -1 :)

Wido

> 
> Thanks!
> Sean
> 
> -Original Message-
> From: Will Stevens [mailto:williamstev...@gmail.com] 
> Sent: Wednesday, July 6, 2016 3:52 PM
> To: dev@cloudstack.apache.org
> Subject: [VOTE] Apache Cloudstack 4.9.0 RC1
> 
> Hi All,
> 
> I've created a 4.9.0 release, with the following artifacts up for a vote:
> 
> Git Branch and Commit SH:
> https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=shortlog;h=refs/heads/4.9.0-RC20160706T1546
> Commit: 643f75aa9150156b1fb05f339a338614fc7ad3fb
> 
> I will be updating the Release Notes with the changes in this release 
> tomorrow.  If the RC changes, I can adapt the release notes after.
> 
> Source release (checksums and signatures are available at the same
> location):
> https://dist.apache.org/repos/dist/dev/cloudstack/4.9.0/
> 
> PGP release keys (signed using CB818F64):
> https://dist.apache.org/repos/dist/release/cloudstack/KEYS
> 
> Vote will be open for 72 hours.
> 
> For sanity in tallying the vote, can PMC members please be sure to indicate 
> "(binding)" with their vote?
> 
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
> 
> Thanks,
> 
> Will


RE: [VOTE] Apache Cloudstack 4.9.0 RC1

2016-07-11 Thread Sean Lair
Hi all,

One small comment since strongSwan didn't make it into 4.9.  There is still a 
very simple bug in enabling PFS for site-to-site VPNs.  The code checks the 
Dead Peer Detection (DPD) variable instead of the PFS variable when determining 
whether or not to enable PFS for the site-to-site VPN.

Here is the 1-line of code that is broken.  You can see how it refers to dpd to 
set pfs.

file.addeq(" pfs=%s" % CsHelper.bool_to_yn(obj['dpd']))

This pull request fixes the issue, but was not merged since we were going to 
strongSwan.  It would be nice if this bug fix was put into 4.9.0

https://github.com/apache/cloudstack/pull/1480


Thanks!
Sean

-Original Message-
From: Will Stevens [mailto:williamstev...@gmail.com] 
Sent: Wednesday, July 6, 2016 3:52 PM
To: dev@cloudstack.apache.org
Subject: [VOTE] Apache Cloudstack 4.9.0 RC1

Hi All,

I've created a 4.9.0 release, with the following artifacts up for a vote:

Git Branch and Commit SH:
https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;a=shortlog;h=refs/heads/4.9.0-RC20160706T1546
Commit: 643f75aa9150156b1fb05f339a338614fc7ad3fb

I will be updating the Release Notes with the changes in this release tomorrow. 
 If the RC changes, I can adapt the release notes after.

Source release (checksums and signatures are available at the same
location):
https://dist.apache.org/repos/dist/dev/cloudstack/4.9.0/

PGP release keys (signed using CB818F64):
https://dist.apache.org/repos/dist/release/cloudstack/KEYS

Vote will be open for 72 hours.

For sanity in tallying the vote, can PMC members please be sure to indicate 
"(binding)" with their vote?

[ ] +1  approve
[ ] +0  no opinion
[ ] -1  disapprove (and reason why)

Thanks,

Will