On 11/30/2017 6:05 PM, Nematollah Bidokhti wrote:
Hi,
Our [Fault-Genes WG] has been working on defining the fault
classifications for key OpenStack projects in an effort to support
OpenStack fault management & self-healing.
We have been using machine learning (unsupervised data) as a method to
look into all bugs and issues submitted by the community and it has been
very challenging to define the classification completely by the machine.
We have decided to go with supervised data set. In order to do this, we
need to come up with our training data.
We need your help to generate the training data set. *Basically, we only
need 2 or 3 unique fault classifications with a short description and
the associated mitigations _from each member who is familiar with
OpenStack design & operation_. This way we can build a focused library
of faults & mitigations for each project.*
Once this data is accumulated, we will develop our own specific
algorithms that can be applied to all future OpenStack issues.
Thanks in advance for your support.
*No.*
*Project*
*Fault Classification*
*Description*
*Root Cause*
*Mitigation*
*1*
**
**
**
**
**
*2*
**
**
**
**
**
*3*
**
**
**
**
**
Below are examples of what a couple of developers in Neutron have
provided. I am sure there are other types of fault classifications in
Neurton that have not been captured in this table.
*Fault Classification*
*Root Cause*
*Mitigation*
Network Connectivity Issues
Virtual interface in the VM admin down
Un-shut the virtual interface
Virtual interface does not have IP address via DHCP
Depends on lower level root cause
Virtual network does not have interface to the router
Add virtual network as one of the router interfaces
vNICport of VM not active (stuck in build)
Depends on lower level root cause
Security group lock in traffic
Fix the security group to allow relevant traffic
Unable to Add Port to Bridge
Libvirtdin Apparmor is blocking
allow Libvirtd profile in Appamor
No Valid Host Found/insufficient hypervisor resources
Compute nodes do not have sufficient resources
free up required compute storage and memory resources on compute node
No Resource
Configuration issues
Change config setting
Authentication/permissions error
Configuration error such as port # or Password
Make sure end points are properly configured
Gateway access not reachable
Use custom keep-alive health-check
Design issue of OpenStack Network node
Out of band health checking mechanism
Security Group Mis-configuration
The security group
Change security rules/Programming the security group
DNS Attack
Implement CERT alerts updates
Network design issue
Network storm
Reduce L2 broadcast domain
Nemat
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
I'm not entirely sure how you classify some of this stuff.
For example, here is a nova/neutron bug in triage:
https://bugs.launchpad.net/nova/+bug/1730637
In this case, the user tries to attach a port to an instance and it
fails with a port binding failure.
From the nova side, we have no idea if this is a user error or a
problem in the networking backend. Therefore I wouldn't know how to
classify this, or describe the root cause or how to mitigate it.
--
Thanks,
Matt
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev