Re: [Proposal] - StorageHA

2017-03-14 Thread Koushik Das
Hi Jeromy,

Thanks for the proposal. It will good if you can create a FS in cwiki for the 
same. I saw your comment about force stopping VMs affected by primary storage 
outage. If this can be done without any issues then it can be leveraged to 
improve the current behavior of Cloudstack as well. Currently in case of 
primary storage outage all hosts attached to it are rebooted based on some 
timeout (for XS and KVM). Refer to the heartbeat scripts 
(scripts/vm/hypervisor/xenserver/xenheartbeat.sh and 
scripts/vm/hypervisor/kvm/kvmheartbeat.sh).

"3.  You need to be very sure of failures before shutting hosts down.  Also a 
host is likely to be connected to multiple storage pools, so you wouldn't want 
to shut down a host due to one pool becoming unavailable."

JG:  The script wouldn’t shut down any hosts at all.  Just force stop the 
affected VMs on that specific host and then start them on a host that is not 
having the issue with storage.

Thanks,
Koushik


On 15/03/17, 12:34 AM, "Rafael Weingärtner" <rafaelweingart...@gmail.com> wrote:

Jeromy I already experienced a similar problem. A host had a problem of
connectivity with a storage, ACS did not know (I think it does check this
right now) and it was kind of odd. Because some VMs could not be started,
because of their last host id was the host with the connectivity problem.

I believe it would be a great thing to improve these checks and to treat
this type of problem in ACS.

My two cents:
Change the name of what you are proposing. It is not high availability per
se; it is more like a storage health check/heartbeat thing. It may give
people very high expectations (I am not trying to undermine the work that
is required). For instance, the first time I read your email, I thought
about real HA, with some sort of block replication and redundancy in the
storage system. That means, I was expecting if a whole storage goes down,
ACS would have to have a way of solving this problem without losing data
and without interrupting VMs. What you want to do is a little different, if
a host has some connectivity problem, we simply add it to the avoid list,
and then we migrate/move the VMs that were running there to a healthy host.

About the idea of being for KVM only; even though we might not be able to
run scripts on some hypervisors (Dom0 not being exposed), we could use
their API and check if the storage is responding. For instance, in XenServr
you could call “xe sr-scan uuid=” (or maybe other command
that uses storage), if there is any communication problem, it will report
an error. The same type of feature will probably be available on all other
hypervisors.

On Tue, Mar 14, 2017 at 1:46 PM, Jeromy Grimmett <jer...@cloudbrix.com>
wrote:

> If all networking is lost, then obviously there is a bigger problem
> there.  The monitor is designed to do a true read/write test to the 
storage
> and report back a pass/fail.  Through this discussion, there was a ping
> suggestion, which I think we will include.
>
> The way this came about was that one of our hosts had a problem with a
> single primary storage, but all other hosts were 100% good across that
> storage and all others.  That troubled host was having problems with just
> the single storage device, but according to CloudStack, all was well and
> everything was good.  The way I am looking at this, is that what we are
> attempting to do is a much better and far more accurate test of storage
> availability compared to how Cloudstack currently does it based our
> experience.
>
> Make more sense?
>
> j
>
> Jeromy Grimmett
> P: 603.766.3625
> jer...@cloudbrix.com
> www.cloudbrix.com
>
>
> -Original Message-
> From: Simon Weller [mailto:swel...@ena.com]
> Sent: Tuesday, March 14, 2017 10:01 AM
> To: dev@cloudstack.apache.org
> Cc: Alex Bonilla <a...@cloudbrix.com>
> Subject: Re: [Proposal] - StorageHA
>
> So a few questions come to mind here.
>
>
> So if all networking is lost, how are you going to being able reliably
> fence the VM on the hosts?
>
> Are you assuming you still have out of band IPMI connectivity?
>
> If you're running bonded interfaces to different switches, what scenario
> would occur where the host loses network connectivity?
>
>
> - Si
>
> ____
    > From: Tutkowski, Mike <mike.tutkow...@netapp.com>
> Sent: Tuesday, March 14, 2017 8:25 AM
> To: dev@cloudstack.apache.org
> Cc: Alex Bonilla
> Subject: Re: [Proposal] - StorageHA
>
> Thanks f

Re: [Proposal] - StorageHA

2017-03-14 Thread Rafael Weingärtner
Jeromy I already experienced a similar problem. A host had a problem of
connectivity with a storage, ACS did not know (I think it does check this
right now) and it was kind of odd. Because some VMs could not be started,
because of their last host id was the host with the connectivity problem.

I believe it would be a great thing to improve these checks and to treat
this type of problem in ACS.

My two cents:
Change the name of what you are proposing. It is not high availability per
se; it is more like a storage health check/heartbeat thing. It may give
people very high expectations (I am not trying to undermine the work that
is required). For instance, the first time I read your email, I thought
about real HA, with some sort of block replication and redundancy in the
storage system. That means, I was expecting if a whole storage goes down,
ACS would have to have a way of solving this problem without losing data
and without interrupting VMs. What you want to do is a little different, if
a host has some connectivity problem, we simply add it to the avoid list,
and then we migrate/move the VMs that were running there to a healthy host.

About the idea of being for KVM only; even though we might not be able to
run scripts on some hypervisors (Dom0 not being exposed), we could use
their API and check if the storage is responding. For instance, in XenServr
you could call “xe sr-scan uuid=” (or maybe other command
that uses storage), if there is any communication problem, it will report
an error. The same type of feature will probably be available on all other
hypervisors.

On Tue, Mar 14, 2017 at 1:46 PM, Jeromy Grimmett <jer...@cloudbrix.com>
wrote:

> If all networking is lost, then obviously there is a bigger problem
> there.  The monitor is designed to do a true read/write test to the storage
> and report back a pass/fail.  Through this discussion, there was a ping
> suggestion, which I think we will include.
>
> The way this came about was that one of our hosts had a problem with a
> single primary storage, but all other hosts were 100% good across that
> storage and all others.  That troubled host was having problems with just
> the single storage device, but according to CloudStack, all was well and
> everything was good.  The way I am looking at this, is that what we are
> attempting to do is a much better and far more accurate test of storage
> availability compared to how Cloudstack currently does it based our
> experience.
>
> Make more sense?
>
> j
>
> Jeromy Grimmett
> P: 603.766.3625
> jer...@cloudbrix.com
> www.cloudbrix.com
>
>
> -Original Message-
> From: Simon Weller [mailto:swel...@ena.com]
> Sent: Tuesday, March 14, 2017 10:01 AM
> To: dev@cloudstack.apache.org
> Cc: Alex Bonilla <a...@cloudbrix.com>
> Subject: Re: [Proposal] - StorageHA
>
> So a few questions come to mind here.
>
>
> So if all networking is lost, how are you going to being able reliably
> fence the VM on the hosts?
>
> Are you assuming you still have out of band IPMI connectivity?
>
> If you're running bonded interfaces to different switches, what scenario
> would occur where the host loses network connectivity?
>
>
> - Si
>
> 
> From: Tutkowski, Mike <mike.tutkow...@netapp.com>
> Sent: Tuesday, March 14, 2017 8:25 AM
> To: dev@cloudstack.apache.org
> Cc: Alex Bonilla
> Subject: Re: [Proposal] - StorageHA
>
> Thanks for your clarification. I see now. You were referring to a
> networking problem where one host could not see the storage (but the
> storage was still up and running).
>
> On 3/13/17, 10:31 PM, "Jeromy Grimmett" <jer...@cloudbrix.com> wrote:
>
> I apologize for the delay on the response, let me clarify the points
> requested:
>
> Mike asked:
>
> "What I was curious about is if you plan to exclusively build your
> feature as a set of scripts and/or if you plan to update the CloudStack
> code base, as well."
>
> JG:  My idea was to do this separately as a plugin, then add it to the
> code base down the road.
>
> "Also, if a primary storage actually goes offline, I'm not clear on
> how starting an impacted VM on a different compute host would help. Could
> you clarify this for me?"
>
> JG:  The VM would be started on another host that still has access to
> the storage.  Individually a host can have problems and lose its
> connectivity to a primary storage device.  The solution we are working on
> would help to get the VM back and up running much faster than waiting for
> Cloudstack to make a decision to restart the VM on a different host.
>
> Paul asked:
>
> "  1.  We can't/don't run scripts on vSphere hosts (not sure about
> Hyper-V)"

RE: [Proposal] - StorageHA

2017-03-14 Thread Jeromy Grimmett
If all networking is lost, then obviously there is a bigger problem there.  The 
monitor is designed to do a true read/write test to the storage and report back 
a pass/fail.  Through this discussion, there was a ping suggestion, which I 
think we will include.  

The way this came about was that one of our hosts had a problem with a single 
primary storage, but all other hosts were 100% good across that storage and all 
others.  That troubled host was having problems with just the single storage 
device, but according to CloudStack, all was well and everything was good.  The 
way I am looking at this, is that what we are attempting to do is a much better 
and far more accurate test of storage availability compared to how Cloudstack 
currently does it based our experience.

Make more sense?

j

Jeromy Grimmett
P: 603.766.3625
jer...@cloudbrix.com
www.cloudbrix.com


-Original Message-
From: Simon Weller [mailto:swel...@ena.com] 
Sent: Tuesday, March 14, 2017 10:01 AM
To: dev@cloudstack.apache.org
Cc: Alex Bonilla <a...@cloudbrix.com>
Subject: Re: [Proposal] - StorageHA

So a few questions come to mind here.


So if all networking is lost, how are you going to being able reliably fence 
the VM on the hosts?

Are you assuming you still have out of band IPMI connectivity?

If you're running bonded interfaces to different switches, what scenario would 
occur where the host loses network connectivity?


- Si


From: Tutkowski, Mike <mike.tutkow...@netapp.com>
Sent: Tuesday, March 14, 2017 8:25 AM
To: dev@cloudstack.apache.org
Cc: Alex Bonilla
Subject: Re: [Proposal] - StorageHA

Thanks for your clarification. I see now. You were referring to a networking 
problem where one host could not see the storage (but the storage was still up 
and running).

On 3/13/17, 10:31 PM, "Jeromy Grimmett" <jer...@cloudbrix.com> wrote:

I apologize for the delay on the response, let me clarify the points 
requested:

Mike asked:

"What I was curious about is if you plan to exclusively build your feature 
as a set of scripts and/or if you plan to update the CloudStack code base, as 
well."

JG:  My idea was to do this separately as a plugin, then add it to the code 
base down the road.

"Also, if a primary storage actually goes offline, I'm not clear on how 
starting an impacted VM on a different compute host would help. Could you 
clarify this for me?"

JG:  The VM would be started on another host that still has access to the 
storage.  Individually a host can have problems and lose its connectivity to a 
primary storage device.  The solution we are working on would help to get the 
VM back and up running much faster than waiting for Cloudstack to make a 
decision to restart the VM on a different host.

Paul asked:

"  1.  We can't/don't run scripts on vSphere hosts (not sure about Hyper-V)"

JG:  I should have been more clear, this is for KVM hosts.

"2.  I know of one failure scenario (which happened) where MTU issues in 
intermediate switches meant that small amounts of data could pass, but anything 
that was passed as jumbo frames then failed. So it would be important to 
exercise that."

JG:  I have faced this Jumbo Frame issue as well, perhaps we need to have 
an option that would indicate Jumbo Frames are being used to access that 
storage and the test result would reflect a failure to access using Jumbo 
Frames.

"3.  You need to be very sure of failures before shutting hosts down.  Also 
a host is likely to be connected to multiple storage pools, so you wouldn't 
want to shut down a host due to one pool becoming unavailable."

JG:  The script wouldn't shut down any hosts at all.  Just force stop the 
affected VMs on that specific host and then start them on a host that is not 
having the issue with storage.

"4.  Environments can have hundreds of storage pools, so watch out for 
spamming the logs with updates."

JG:  The polling/testing time increments are configurable, so I am hoping 
that can help with that.  The results are pretty small and should be relatively 
negligible.

"5.  The primary storage pools have a 'state' which should get updated and 
used by the deployment planners"

JG:  I have copied Alex on this email to make sure he sees this suggestion. 
 We will figure out how to incorporate that 'state' field.

"6.  Secondary storage pools don't have a 'state' - but it would be great 
if that were added in the DB and reflected in the UI."

JG:  For now, I think this might be a feature request that maybe we should 
submit through the normal Cloudstack request process.  Otherwise, we can 
definitely include that into our work when we start to add it into the code 
base.

To take this a step further, we are also working on a KVM host load 
balancer that will be used as a factor wh

Re: [Proposal] - StorageHA

2017-03-14 Thread Simon Weller
So a few questions come to mind here.


So if all networking is lost, how are you going to being able reliably fence 
the VM on the hosts?

Are you assuming you still have out of band IPMI connectivity?

If you're running bonded interfaces to different switches, what scenario would 
occur where the host loses network connectivity?


- Si


From: Tutkowski, Mike <mike.tutkow...@netapp.com>
Sent: Tuesday, March 14, 2017 8:25 AM
To: dev@cloudstack.apache.org
Cc: Alex Bonilla
Subject: Re: [Proposal] - StorageHA

Thanks for your clarification. I see now. You were referring to a networking 
problem where one host could not see the storage (but the storage was still up 
and running).

On 3/13/17, 10:31 PM, "Jeromy Grimmett" <jer...@cloudbrix.com> wrote:

I apologize for the delay on the response, let me clarify the points 
requested:

Mike asked:

"What I was curious about is if you plan to exclusively build your feature 
as a set of scripts and/or if you plan to update the CloudStack code base, as 
well."

JG:  My idea was to do this separately as a plugin, then add it to the code 
base down the road.

"Also, if a primary storage actually goes offline, I'm not clear on how 
starting an impacted VM on a different compute host would help. Could you 
clarify this for me?"

JG:  The VM would be started on another host that still has access to the 
storage.  Individually a host can have problems and lose its connectivity to a 
primary storage device.  The solution we are working on would help to get the 
VM back and up running much faster than waiting for Cloudstack to make a 
decision to restart the VM on a different host.

Paul asked:

"  1.  We can't/don't run scripts on vSphere hosts (not sure about Hyper-V)"

JG:  I should have been more clear, this is for KVM hosts.

"2.  I know of one failure scenario (which happened) where MTU issues in 
intermediate switches meant that small amounts of data could pass, but anything 
that was passed as jumbo frames then failed. So it would be important to 
exercise that."

JG:  I have faced this Jumbo Frame issue as well, perhaps we need to have 
an option that would indicate Jumbo Frames are being used to access that 
storage and the test result would reflect a failure to access using Jumbo 
Frames.

"3.  You need to be very sure of failures before shutting hosts down.  Also 
a host is likely to be connected to multiple storage pools, so you wouldn't 
want to shut down a host due to one pool becoming unavailable."

JG:  The script wouldn’t shut down any hosts at all.  Just force stop the 
affected VMs on that specific host and then start them on a host that is not 
having the issue with storage.

"4.  Environments can have hundreds of storage pools, so watch out for 
spamming the logs with updates."

JG:  The polling/testing time increments are configurable, so I am hoping 
that can help with that.  The results are pretty small and should be relatively 
negligible.

"5.  The primary storage pools have a 'state' which should get updated and 
used by the deployment planners"

JG:  I have copied Alex on this email to make sure he sees this suggestion. 
 We will figure out how to incorporate that 'state' field.

"6.  Secondary storage pools don't have a 'state' - but it would be great 
if that were added in the DB and reflected in the UI."

JG:  For now, I think this might be a feature request that maybe we should 
submit through the normal Cloudstack request process.  Otherwise, we can 
definitely include that into our work when we start to add it into the code 
base.

To take this a step further, we are also working on a KVM host load 
balancer that will be used as a factor when moving the VMs.  We have a number 
of little projects we are working on.

Thank you all for reviewing the information.  All suggestions are welcome.

Jeromy Grimmett
P: 603.766.3625
jer...@cloudbrix.com
www.cloudbrix.com<http://www.cloudbrix.com>


-Original Message-
From: Paul Angus [mailto:paul.an...@shapeblue.com]
Sent: Saturday, March 11, 2017 2:43 AM
To: dev@cloudstack.apache.org
Subject: RE: [Proposal] - StorageHA

Hi Jeromy,

I love the idea, I'm not really a developer, so those guys will look at 
things a different way, but...

These would be by my initial comments:


  1.  We can't/don't run scripts on vSphere hosts (not sure about Hyper-V)
  2.  I know of one failure scenario (which happened) where MTU issues in 
intermediate switches meant that small amounts of data could pass, but anything 
that was passed as jumbo frames then failed. So it would be important to 
exercise that.
  3.  You need to be very sure of failures before shutting hosts down.  
Also a host is likely to be connected to mul

Re: [Proposal] - StorageHA

2017-03-14 Thread Tutkowski, Mike
Thanks for your clarification. I see now. You were referring to a networking 
problem where one host could not see the storage (but the storage was still up 
and running).

On 3/13/17, 10:31 PM, "Jeromy Grimmett" <jer...@cloudbrix.com> wrote:

I apologize for the delay on the response, let me clarify the points 
requested:

Mike asked:

"What I was curious about is if you plan to exclusively build your feature 
as a set of scripts and/or if you plan to update the CloudStack code base, as 
well."

JG:  My idea was to do this separately as a plugin, then add it to the code 
base down the road.

"Also, if a primary storage actually goes offline, I'm not clear on how 
starting an impacted VM on a different compute host would help. Could you 
clarify this for me?"

JG:  The VM would be started on another host that still has access to the 
storage.  Individually a host can have problems and lose its connectivity to a 
primary storage device.  The solution we are working on would help to get the 
VM back and up running much faster than waiting for Cloudstack to make a 
decision to restart the VM on a different host.

Paul asked:

"  1.  We can't/don't run scripts on vSphere hosts (not sure about Hyper-V)"

JG:  I should have been more clear, this is for KVM hosts.
  
"2.  I know of one failure scenario (which happened) where MTU issues in 
intermediate switches meant that small amounts of data could pass, but anything 
that was passed as jumbo frames then failed. So it would be important to 
exercise that."

JG:  I have faced this Jumbo Frame issue as well, perhaps we need to have 
an option that would indicate Jumbo Frames are being used to access that 
storage and the test result would reflect a failure to access using Jumbo 
Frames. 

"3.  You need to be very sure of failures before shutting hosts down.  Also 
a host is likely to be connected to multiple storage pools, so you wouldn't 
want to shut down a host due to one pool becoming unavailable."

JG:  The script wouldn’t shut down any hosts at all.  Just force stop the 
affected VMs on that specific host and then start them on a host that is not 
having the issue with storage.

"4.  Environments can have hundreds of storage pools, so watch out for 
spamming the logs with updates."

JG:  The polling/testing time increments are configurable, so I am hoping 
that can help with that.  The results are pretty small and should be relatively 
negligible.

"5.  The primary storage pools have a 'state' which should get updated and 
used by the deployment planners"

JG:  I have copied Alex on this email to make sure he sees this suggestion. 
 We will figure out how to incorporate that 'state' field.

"6.  Secondary storage pools don't have a 'state' - but it would be great 
if that were added in the DB and reflected in the UI."

JG:  For now, I think this might be a feature request that maybe we should 
submit through the normal Cloudstack request process.  Otherwise, we can 
definitely include that into our work when we start to add it into the code 
base.

To take this a step further, we are also working on a KVM host load 
balancer that will be used as a factor when moving the VMs.  We have a number 
of little projects we are working on.

Thank you all for reviewing the information.  All suggestions are welcome.

Jeromy Grimmett
P: 603.766.3625
jer...@cloudbrix.com
www.cloudbrix.com


-Original Message-
From: Paul Angus [mailto:paul.an...@shapeblue.com] 
Sent: Saturday, March 11, 2017 2:43 AM
To: dev@cloudstack.apache.org
Subject: RE: [Proposal] - StorageHA

Hi Jeromy,

I love the idea, I'm not really a developer, so those guys will look at 
things a different way, but...

These would be by my initial comments:


  1.  We can't/don't run scripts on vSphere hosts (not sure about Hyper-V)
  2.  I know of one failure scenario (which happened) where MTU issues in 
intermediate switches meant that small amounts of data could pass, but anything 
that was passed as jumbo frames then failed. So it would be important to 
exercise that.
  3.  You need to be very sure of failures before shutting hosts down.  
Also a host is likely to be connected to multiple storage pools, so you 
wouldn't want to shut down a host due to one pool becoming unavailable.
  4.  Environments can have hundreds of storage pools, so watch out for 
spamming the logs with updates.
  5.  The primary storage pools have a 'state' which should get updated and 
used by the deployment planners
  6.  Secondary storage pools don't have a 'state' - but it would be great 
if that were added in the DB and reflected in the UI.



  

RE: [Proposal] - StorageHA

2017-03-13 Thread Jeromy Grimmett
I apologize for the delay on the response, let me clarify the points requested:

Mike asked:

"What I was curious about is if you plan to exclusively build your feature as a 
set of scripts and/or if you plan to update the CloudStack code base, as well."

JG:  My idea was to do this separately as a plugin, then add it to the code 
base down the road.

"Also, if a primary storage actually goes offline, I'm not clear on how 
starting an impacted VM on a different compute host would help. Could you 
clarify this for me?"

JG:  The VM would be started on another host that still has access to the 
storage.  Individually a host can have problems and lose its connectivity to a 
primary storage device.  The solution we are working on would help to get the 
VM back and up running much faster than waiting for Cloudstack to make a 
decision to restart the VM on a different host.

Paul asked:

"  1.  We can't/don't run scripts on vSphere hosts (not sure about Hyper-V)"

JG:  I should have been more clear, this is for KVM hosts.
  
"2.  I know of one failure scenario (which happened) where MTU issues in 
intermediate switches meant that small amounts of data could pass, but anything 
that was passed as jumbo frames then failed. So it would be important to 
exercise that."

JG:  I have faced this Jumbo Frame issue as well, perhaps we need to have an 
option that would indicate Jumbo Frames are being used to access that storage 
and the test result would reflect a failure to access using Jumbo Frames. 

"3.  You need to be very sure of failures before shutting hosts down.  Also a 
host is likely to be connected to multiple storage pools, so you wouldn't want 
to shut down a host due to one pool becoming unavailable."

JG:  The script wouldn’t shut down any hosts at all.  Just force stop the 
affected VMs on that specific host and then start them on a host that is not 
having the issue with storage.

"4.  Environments can have hundreds of storage pools, so watch out for spamming 
the logs with updates."

JG:  The polling/testing time increments are configurable, so I am hoping that 
can help with that.  The results are pretty small and should be relatively 
negligible.

"5.  The primary storage pools have a 'state' which should get updated and used 
by the deployment planners"

JG:  I have copied Alex on this email to make sure he sees this suggestion.  We 
will figure out how to incorporate that 'state' field.

"6.  Secondary storage pools don't have a 'state' - but it would be great if 
that were added in the DB and reflected in the UI."

JG:  For now, I think this might be a feature request that maybe we should 
submit through the normal Cloudstack request process.  Otherwise, we can 
definitely include that into our work when we start to add it into the code 
base.

To take this a step further, we are also working on a KVM host load balancer 
that will be used as a factor when moving the VMs.  We have a number of little 
projects we are working on.

Thank you all for reviewing the information.  All suggestions are welcome.

Jeromy Grimmett
P: 603.766.3625
jer...@cloudbrix.com
www.cloudbrix.com


-Original Message-
From: Paul Angus [mailto:paul.an...@shapeblue.com] 
Sent: Saturday, March 11, 2017 2:43 AM
To: dev@cloudstack.apache.org
Subject: RE: [Proposal] - StorageHA

Hi Jeromy,

I love the idea, I'm not really a developer, so those guys will look at things 
a different way, but...

These would be by my initial comments:


  1.  We can't/don't run scripts on vSphere hosts (not sure about Hyper-V)
  2.  I know of one failure scenario (which happened) where MTU issues in 
intermediate switches meant that small amounts of data could pass, but anything 
that was passed as jumbo frames then failed. So it would be important to 
exercise that.
  3.  You need to be very sure of failures before shutting hosts down.  Also a 
host is likely to be connected to multiple storage pools, so you wouldn't want 
to shut down a host due to one pool becoming unavailable.
  4.  Environments can have hundreds of storage pools, so watch out for 
spamming the logs with updates.
  5.  The primary storage pools have a 'state' which should get updated and 
used by the deployment planners
  6.  Secondary storage pools don't have a 'state' - but it would be great if 
that were added in the DB and reflected in the UI.



Kind regards,

Paul Angus


paul.an...@shapeblue.com
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
  
 

From: Jeromy Grimmett [mailto:jer...@cloudbrix.com]
Sent: 10 March 2017 15:28
To: dev@cloudstack.apache.org
Subject: [Proposal] - StorageHA

Hello,

I am new to the mailing list, and we are glad to be a part of the CloudStack 
community.  We are looking to develop plugins and modules that will help grow 
and expand the adoption and use of CloudStack.  So as part of my introductory 
email, I'd 

RE: [Proposal] - StorageHA

2017-03-10 Thread Paul Angus
Hi Jeromy,

I love the idea, I'm not really a developer, so those guys will look at things 
a different way, but...

These would be by my initial comments:


  1.  We can't/don't run scripts on vSphere hosts (not sure about Hyper-V)
  2.  I know of one failure scenario (which happened) where MTU issues in 
intermediate switches meant that small amounts of data could pass, but anything 
that was passed as jumbo frames then failed. So it would be important to 
exercise that.
  3.  You need to be very sure of failures before shutting hosts down.  Also a 
host is likely to be connected to multiple storage pools, so you wouldn't want 
to shut down a host due to one pool becoming unavailable.
  4.  Environments can have hundreds of storage pools, so watch out for 
spamming the logs with updates.
  5.  The primary storage pools have a 'state' which should get updated and 
used by the deployment planners
  6.  Secondary storage pools don't have a 'state' - but it would be great if 
that were added in the DB and reflected in the UI.



Kind regards,

Paul Angus


paul.an...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 

From: Jeromy Grimmett [mailto:jer...@cloudbrix.com]
Sent: 10 March 2017 15:28
To: dev@cloudstack.apache.org
Subject: [Proposal] - StorageHA

Hello,

I am new to the mailing list, and we are glad to be a part of the CloudStack 
community.  We are looking to develop plugins and modules that will help grow 
and expand the adoption and use of CloudStack.  So as part of my introductory 
email, I'd like to introduce a little project we have been working on; a 
StorageHA Monitor.  The Monitor would allow CloudStack and the hosts to test, 
communicate and resolve VM availability issues when storage (primary and/or 
secondary) availability becomes apparent.  This is a small write up about how 
it would work:

Consists of two scripts/programs:

The host script runs on the host servers and checks to see if the primary and 
secondary storage is available by doing a read/write test then reports to the 
master script that runs on the Cloudstack server. The host script will test a 
read and a write to the storage every 5 seconds (configurable), and if it fails 
3 times (configurable) then it will be recorded by the master script.

The master script will monitor the results of the host script. If the test is 
good, nothing happens and the results are logged and so that we can track the 
history of the test results. If the test reports back as failed, then it will 
perform the following actions:


  *   Secondary Storage - It will simply generate and send an alert that the 
failure has occurred.


  *   Primary Storage - The script will perform the following tasks:
 *   Generate and send an alert that the failure has occurred.
 *   Force the VMs on that host to shutdown.
 *   Determine which host to move the VMs to.
 *   Start the VMs on the healthy host.

We have already started working on some code, and the solution seems to be 
testing well.  Any thoughts/ideas/input are(is) welcome.  Should there are a 
solution out there already, then please forgive our ignorance, and point us in 
the right direction. We look forward to further collaboration with you all.

Regards,
j

Jeromy Grimmett
[cb-sig-logo2]
155 Fleet Street
Portsmouth, NH 03801
Direct: 603.766.3625
Office: 603.766.4908
Fax: 603.766.4729
jer...@cloudbrix.com
www.cloudbrix.com



Re: [Proposal] - StorageHA

2017-03-10 Thread Tutkowski, Mike
Hi,

Thanks for sending out this email and welcome to the CloudStack Community. :)

I have a couple quick questions:

First of all, let me start with something I found in our docs:
Primary Storage Outage and Data 
Loss

When a primary storage outage occurs the hypervisor immediately stops all VMs 
stored on that storage device. Guests that are marked for HA will be restarted 
as soon as practical when the primary storage comes back on line. With NFS, the 
hypervisor may allow the virtual machines to continue running depending on the 
nature of the issue. For example, an NFS hang will cause the guest VMs to be 
suspended until storage connectivity is restored. Primary storage is not 
designed to be backed up. Individual volumes in primary storage can be backed 
up using snapshots.

What I was curious about is if you plan to exclusively build your feature as a 
set of scripts and/or if you plan to update the CloudStack code base, as well.

Also, if a primary storage actually goes offline, I'm not clear on how starting 
an impacted VM on a different compute host would help. Could you clarify this 
for me?

Thanks!
Mike

On Mar 10, 2017, at 8:29 AM, Jeromy Grimmett 
> wrote:

Hello,

I am new to the mailing list, and we are glad to be a part of the CloudStack 
community.  We are looking to develop plugins and modules that will help grow 
and expand the adoption and use of CloudStack.  So as part of my introductory 
email, I’d like to introduce a little project we have been working on; a 
StorageHA Monitor.  The Monitor would allow CloudStack and the hosts to test, 
communicate and resolve VM availability issues when storage (primary and/or 
secondary) availability becomes apparent.  This is a small write up about how 
it would work:

Consists of two scripts/programs:

The host script runs on the host servers and checks to see if the primary and 
secondary storage is available by doing a read/write test then reports to the 
master script that runs on the Cloudstack server. The host script will test a 
read and a write to the storage every 5 seconds (configurable), and if it fails 
3 times (configurable) then it will be recorded by the master script.

The master script will monitor the results of the host script. If the test is 
good, nothing happens and the results are logged and so that we can track the 
history of the test results. If the test reports back as failed, then it will 
perform the following actions:


· Secondary Storage - It will simply generate and send an alert that 
the failure has occurred.


· Primary Storage - The script will perform the following tasks:

o   Generate and send an alert that the failure has occurred.

o   Force the VMs on that host to shutdown.

o   Determine which host to move the VMs to.

o   Start the VMs on the healthy host.

We have already started working on some code, and the solution seems to be 
testing well.  Any thoughts/ideas/input are(is) welcome.  Should there are a 
solution out there already, then please forgive our ignorance, and point us in 
the right direction. We look forward to further collaboration with you all.

Regards,
j

Jeromy Grimmett
[cb-sig-logo2]
155 Fleet Street
Portsmouth, NH 03801
Direct: 603.766.3625
Office: 603.766.4908
Fax: 603.766.4729
jer...@cloudbrix.com
www.cloudbrix.com