[openstack-dev] [nova][service group]improve host state detection

2014-04-28 Thread Jiangying (Jenny)
Nova now can detect host unreachable. But it fails to make out host isolation, 
host dead and nova compute service down. When host unreachable is reported, 
users have to find out the exact state by himself and then take the appropriate 
measure to recover. Therefore we'd like to improve the host detection for nova.

Currently the service group API factors out the host detection and makes it a 
set of abstract internal APIs with a pluggable backend implementation. The 
backend we designed is as follows:

A detection central agent is introduced. When a member joins into the service 
group, the member host starts to send network heartbeat to the central agent 
and writes timestamp in shared storage periodically. When the central agent 
stops receiving the network heartbeats from a member, it pings the member and 
checks the storage heartbeat before declaring the host to have failed.


network heartbeat|network ping|storage heartbeat| state  | reason
|-||---|--
OK   |  - |-| Running | -
  Not OK |   Not OK   | Not OK  | Dead   | hardware 
failure/abnormal host shut down
  Not OK | OK | Not OK  | Service unreachable| service 
process crashed
  Not OK |   Not OK   |   OK| Isolated   | network 
unreachable

Based on the state recognition table, nova can discern the exact host state and 
assign the reasons.

Thoughts?

Jenny

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][service group]improve host state detection

2014-04-28 Thread John Garbutt
On 28 April 2014 13:30, Jiangying (Jenny)  wrote:
> Nova now can detect host unreachable. But it fails to make out host
> isolation, host dead and nova compute service down. When host unreachable is
> reported, users have to find out the exact state by himself and then take
> the appropriate measure to recover. Therefore we’d like to improve the host
> detection for nova.
>
> Currently the service group API factors out the host detection and makes it
> a set of abstract internal APIs with a pluggable backend implementation. The
> backend we designed is as follows:
>
> A detection central agent is introduced. When a member joins into the
> service group, the member host starts to send network heartbeat to the
> central agent and writes timestamp in shared storage periodically. When the
> central agent stops receiving the network heartbeats from a member, it pings
> the member and checks the storage heartbeat before declaring the host to
> have failed.
>
> 
>
> network heartbeat|network ping|storage heartbeat| state  | reason
>
> |-||---|--
>
> OK   |  - |-| Running | -
>
>   Not OK |   Not OK   | Not OK  | Dead   |
> hardware failure/abnormal host shut down
>
>   Not OK | OK | Not OK  | Service unreachable|
> service process crashed
>
>   Not OK |   Not OK   |   OK| Isolated   |
> network unreachable
>
> 
>
> Based on the state recognition table, nova can discern the exact host state
> and assign the reasons.
>
> Thoughts?

I don't think Nova should try to include functionality that
re-implements other good monitoring tools (Nagios, etc)

Having said that, having a new service group API that uses information
from external tools to decide if a host is dead or not, and describes
why, is maybe worth considering.

John

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][service group]improve host state detection

2014-04-28 Thread Sylvain Bauza
2014-04-28 16:33 GMT+02:00 John Garbutt :

>
> I don't think Nova should try to include functionality that
> re-implements other good monitoring tools (Nagios, etc)
>
> Having said that, having a new service group API that uses information
> from external tools to decide if a host is dead or not, and describes
> why, is maybe worth considering.
>
>

Agree with John, a new backend could potentially help out this use-case.
That said, there is yet a ZooKeeper driver [1] for servicegroups that could
help.

My 2 cts,
-Sylvain

[1] :
https://github.com/openstack/nova/blob/master/nova/servicegroup/drivers/zk.py



> John
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][service group]improve host state detection

2014-04-28 Thread Jay Pipes
cc'ing Intel and Ericsson engineers who are interested in a similar
plan...

On Mon, 2014-04-28 at 15:33 +0100, John Garbutt wrote:
> On 28 April 2014 13:30, Jiangying (Jenny)  wrote:
> > Nova now can detect host unreachable. But it fails to make out host
> > isolation, host dead and nova compute service down. When host unreachable is
> > reported, users have to find out the exact state by himself and then take
> > the appropriate measure to recover. Therefore we’d like to improve the host
> > detection for nova.
> >
> > Currently the service group API factors out the host detection and makes it
> > a set of abstract internal APIs with a pluggable backend implementation. The
> > backend we designed is as follows:
> >
> > A detection central agent is introduced. When a member joins into the
> > service group, the member host starts to send network heartbeat to the
> > central agent and writes timestamp in shared storage periodically. When the
> > central agent stops receiving the network heartbeats from a member, it pings
> > the member and checks the storage heartbeat before declaring the host to
> > have failed.
> >
> > 
> >
> > network heartbeat|network ping|storage heartbeat| state  | reason
> >
> > |-||---|--
> >
> > OK   |  - |-| Running | -
> >
> >   Not OK |   Not OK   | Not OK  | Dead   |
> > hardware failure/abnormal host shut down
> >
> >   Not OK | OK | Not OK  | Service unreachable|
> > service process crashed
> >
> >   Not OK |   Not OK   |   OK| Isolated   |
> > network unreachable
> >
> > 
> >
> > Based on the state recognition table, nova can discern the exact host state
> > and assign the reasons.
> >
> > Thoughts?
> 
> I don't think Nova should try to include functionality that
> re-implements other good monitoring tools (Nagios, etc)

Agreed.

> Having said that, having a new service group API that uses information
> from external tools to decide if a host is dead or not, and describes
> why, is maybe worth considering.

Also agreed.

FYI, related blueprint from Ericsson: 

https://review.openstack.org/#/c/87978/

I am -1 on the above blueprint not because I don't see the value in
having nic state play a part in service group management, but because I
don't see a reason to have the resource tracker (which manages resource
usage, not state) or scheduler implement agent state checks.

Best,
-jay


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][service group]improve host state detection

2014-05-01 Thread Day, Phil
>Nova now can detect host unreachable. But it fails to make out host isolation, 
>host dead and nova compute service down. When host unreachable is reported, 
>users have to find out the exact state by himself and then take the 
>appropriate measure to recover. Therefore we'd like to improve the host 
>detection for nova.

I guess this depends on the service group driver that you use.  For example if 
you use the DB driver, then there is a thread running on the compute manager 
that periodically updates the "alive" status - which included both a liveness 
check (to the extent that the thread is still running) of the compute manager 
and that it can contact the DB.If the compute manager is using conductor 
then it also includes implicitly a check that the compute manager can talk to 
MQ (a nice side effect of conductor - as before a node could be "Up" because it 
could talk to the DB but not able to process any messages)

So to me the DB driver kind of already covers "send network heartbeat to the 
central agent and writes timestamp in shared storage periodically" - so maybe 
this is more of a specific ServiceGroup Driver issue rather than a generic 
ServiceGroup change ?

Phil



From: Jiangying (Jenny) [mailto:jenny.jiangy...@huawei.com]
Sent: 28 April 2014 13:31
To: openstack-dev@lists.openstack.org
Subject: [openstack-dev] [nova][service group]improve host state detection

Nova now can detect host unreachable. But it fails to make out host isolation, 
host dead and nova compute service down. When host unreachable is reported, 
users have to find out the exact state by himself and then take the appropriate 
measure to recover. Therefore we'd like to improve the host detection for nova.

Currently the service group API factors out the host detection and makes it a 
set of abstract internal APIs with a pluggable backend implementation. The 
backend we designed is as follows:

A detection central agent is introduced. When a member joins into the service 
group, the member host starts to send network heartbeat to the central agent 
and writes timestamp in shared storage periodically. When the central agent 
stops receiving the network heartbeats from a member, it pings the member and 
checks the storage heartbeat before declaring the host to have failed.


network heartbeat|network ping|storage heartbeat| state  | reason
|-||---|--
OK   |  - |-| Running | -
  Not OK |   Not OK   | Not OK  | Dead   | hardware 
failure/abnormal host shut down
  Not OK | OK | Not OK  | Service unreachable| service 
process crashed
  Not OK |   Not OK   |   OK| Isolated   | network 
unreachable

Based on the state recognition table, nova can discern the exact host state and 
assign the reasons.

Thoughts?

Jenny

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][service group]improve host state detection

2014-06-19 Thread Steve Gordon
- Original Message -
> From: "Phil Day" 
> To: "OpenStack Development Mailing List (not for usage questions)" 
> 
> 
> >Nova now can detect host unreachable. But it fails to make out host
> >isolation, host dead and nova compute service down. When host unreachable
> >is reported, users have to find out the exact state by himself and then
> >take the appropriate measure to recover. Therefore we'd like to improve the
> >host detection for nova.
> 
> I guess this depends on the service group driver that you use.  For example
> if you use the DB driver, then there is a thread running on the compute
> manager that periodically updates the "alive" status - which included both a
> liveness check (to the extent that the thread is still running) of the
> compute manager and that it can contact the DB.If the compute manager is
> using conductor then it also includes implicitly a check that the compute
> manager can talk to MQ (a nice side effect of conductor - as before a node
> could be "Up" because it could talk to the DB but not able to process any
> messages)
> 
> So to me the DB driver kind of already covers "send network heartbeat to the
> central agent and writes timestamp in shared storage periodically" - so
> maybe this is more of a specific ServiceGroup Driver issue rather than a
> generic ServiceGroup change ?
> 
> Phil

This refers to the case where the compute host is completely unreachable. 
Another interesting case, either using passthrough of an entire NIC or an SRIOV 
VF, is what happens when the link of one of the ports allocated for passthrough 
is down. It may not be desirable to disable the host entirely in this case, but 
certainly the user would not expect instances with flavors request a passed 
through networking device to be scheduled to get a device with a link that is 
down. I'd be interested in how a nagios or zabbix based solution to this would 
be architected.

Somewhat tangentially to the above I believe Balazs has re-architected the spec 
he had proposed in this area somewhat:

https://review.openstack.org/#/c/87978/

Thanks,

Steve


> From: Jiangying (Jenny) [mailto:jenny.jiangy...@huawei.com]
> Sent: 28 April 2014 13:31
> To: openstack-dev@lists.openstack.org
> Subject: [openstack-dev] [nova][service group]improve host state detection
> 
> Nova now can detect host unreachable. But it fails to make out host
> isolation, host dead and nova compute service down. When host unreachable is
> reported, users have to find out the exact state by himself and then take
> the appropriate measure to recover. Therefore we'd like to improve the host
> detection for nova.
> 
> Currently the service group API factors out the host detection and makes it a
> set of abstract internal APIs with a pluggable backend implementation. The
> backend we designed is as follows:
> 
> A detection central agent is introduced. When a member joins into the service
> group, the member host starts to send network heartbeat to the central agent
> and writes timestamp in shared storage periodically. When the central agent
> stops receiving the network heartbeats from a member, it pings the member
> and checks the storage heartbeat before declaring the host to have failed.
> 
> 
> network heartbeat|network ping|storage heartbeat| state  | reason
> |-||---|--
> OK   |  - |-| Running | -
>   Not OK |   Not OK   | Not OK  | Dead   |
>   hardware failure/abnormal host shut down
>   Not OK | OK | Not OK  | Service unreachable|
>   service process crashed
>   Not OK |   Not OK   |   OK| Isolated   |
>   network unreachable
> 
> Based on the state recognition table, nova can discern the exact host state
> and assign the reasons.
> 
> Thoughts?
> 
> Jenny
> 
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

-- 
Steve Gordon, RHCE
Product Manager, Red Hat Enterprise Linux OpenStack Platform
Red Hat Canada (Toronto, Ontario)

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev