----- Original Message ----- > From: "Nicolas Ecarnot" <nico...@ecarnot.net> > To: "users@ovirt.org" <Users@ovirt.org> > Sent: Wednesday, August 5, 2015 5:32:38 PM > Subject: [ovirt-users] ovirt+gluster+NFS : storage hicups > > Hi, > > I used the two links below to setup a test DC : > > http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/ > http://community.redhat.com/blog/2014/11/up-and-running-with-ovirt-3-5-part-two/ > > The only thing I did different is I did not usea hosted engine, but I > dedicated a solid server for that. > So I have one engine (CentOS 6.6), and 3 hosts (CentOS 7.0) > > As in the doc above, my 3 hosts are publishing 300 Go of replicated > gluster storage, above which ctdb is managing a floating virtual ip that > is used by NFS as the master storage domain. > > The last point is that the manager is also presenting a NFS storage I'm > using as an export domain. > > It took me some time to plug all this setup as it is a bit more > complicated as my other DC with a real SAN and no gluster, but it is > eventually working (I can run VMs, migrate them...) > > I have made many severe tests (from a very dumb user point of view : > unplug/replug the power cable of this server - does ctdb floats the vIP? > does gluster self-heals?, does the VM restart?...) > When precisely looking each layer one by one, all seems to be correct : > ctdb is fast at managing the ip, NFS is OK, gluster seems to > reconstruct, fencing eventually worked with the lanplus workaround, and > so on... > > But from times to times, there seem to appear a severe hicup which I > have great difficulties to diagnose. > The messages in the web gui are not very precise, and not consistent: > - some tell about some host having network issues, but I can ping it > from every place it needs to be reached (especially from the SPM and the > manager) Ping doesn't say much as the ssh protocol is the one being used. Please try this and report. Please attach logs (engine+vdsm). Log snippets would be helpful (but more important are full logs).
In general it smells like an ssh/firewall issue. > "On host serv-vm-al01, Error: Network error during communication with > the Host" > > - some tell that some volume is degraded, when it's not (gluster > commands are showing no issue. Even the oVirt tab about the volumes are > all green) > > - "Host serv-vm-al03 cannot access the Storage Domain(s) <UNKNOWN> > attached to the Data Center" > Just by waiting a couple of seconds lead to a self heal with no action. > > - Repeated "Detected change in status of brick > serv-vm-al03:/gluster/data/brick of volume data from DOWN to UP." > but absolutely no action is made on this filesystem. > > At this time, zero VM is running in this test datacenter, and no action > is made on the hosts. Though, I see some looping errors coming and > going, and I find no way to diagnose. > > Amongst the *actions* that I had the idea to use to solve some issues : > - I've found that trying to force the self-healing, and playing with > gluster commands had no effect > - I've found that playing with gluster adviced actions "find /gluster > -exec stat {} \; ..." seem to have no either effect > - I've found that forcing ctdb to move the vIp ("ctdb stop, ctdb > continue") DID SOLVE most of these issue. > I believe that it's not what ctdb is doing that helps, but maybe one of > its shell hook that is cleaning some troubles? > > As this setup is complexe, I don't ask anyone a silver bullet, but maybe > you may know which layer is the most fragile, and which one I should > look at more closely? > > -- > Nicolas ECARNOT > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users