On 01/29/2014 02:35 PM, Nicolas Ecarnot wrote:
Le 29/01/2014 13:29, Maor Lipchuk a écrit :
Hi Nicolas,

Can u please attach the VDSM logs of the problematic nodes and valid
nodes, the engine log and also the sanlock log.

You wrote that many nodes suddenly began to become
unresponsive,
Do you mean that the hosts switched to non-responsive status in the
engine?
I'm asking that because non-responsive status indicate that the engine
could not communicate with the hosts, it could be related to sanlock
since if the host encountered a problem to write to the master domain it
causes sanlock to restart VDSM and make the hosts non responsive.

non-resposneive for engine is if vdsm is up/responsive.
run locally;
# vdsClient -s 0 getVdsCaps

to check vdsm is ok


regards,
Maor

It will be hard work to provide these logs but I will try asap.
But to answer your question : the engine saw the failing nodes as
unresponsive, but I was always fully able to ping them and ssh-log on them.

Is there some place I could read further doc about sanlock?

Nicolas Ecarnot


On 01/27/2014 09:26 AM, Nicolas Ecarnot wrote:
Le 26/01/2014 23:23, Itamar Heim a écrit :
On 01/20/2014 12:06 PM, Nicolas Ecarnot wrote:
Hi,

oVirt 3.3, no big issue since the recent snapshot joke, but all in all
running fine.

All my VM are stored in a iSCSI SAN. The VM usually are using only one
or two disks (1: system, 2: data) and it is OK.

Friday, I created a new LUN. Inside a VM, I linked to it via iscsiadm
and successfully login to the Lun (session, automatic attach on boot,
read, write) : nice.

Then after detaching it and shuting down the MV, and for the first
time,
I tried to make use of the feature "direct attach" to attach the disk
directly from oVirt, login the session via oVirt.
I connected nice and I saw the disk appear in my VM as /dev/sda or
whatever. I was able to mount it, read and write.

Then disaster stoke all this : many nodes suddenly began to become
unresponsive, quickly migrating their VM to the remaining nodes.
Hopefully, the migrations ran fine and I lost no VM nor downtime,
but I
had to reboot every concerned node (other actions failed).

In the failing nodes, /var/log/messages showed the log you can read in
the end of this message.
I first get device-mapper warnings, then the host unable to
collaborate
with the logical volumes.

The 3 volumes are the three main storage domains, perfectly up and
running where I store my oVirt VMs.

My reflexions :
- I'm not sure device-mapper is to blame. I frequently see device
mapper
complaining and nothing is getting worse (not oVirt specifically)
- I have not change my network settings for months (bonding,
linking...)
The only new factor is the usage of direct attach LUN.
- This morning I was able to reproduce the bug, just by trying again
this attachement, and booting the VM. No mounting of the LUN, just VM
booting, waiting, and this is enough to crash oVirt.
- when the disaster happens, usually, amongst the nodes, only three
nodes gets stroke, the only one that run VMs. Obviously, after
migration, different nodes are hosting the VMs, and those new nodes
are
the one that then get stroke.

This is quite reproductible.

And frightening.


The log :

Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36:
multipath: error getting device
Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error
adding
target to table
Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: table: 253:36:
multipath: error getting device
Jan 20 10:20:45 serv-vm-adm11 kernel: device-mapper: ioctl: error
adding
target to table
Jan 20 10:20:47 serv-vm-adm11 vdsm TaskManager.Task ERROR
Task=`847653e6-8b23-4429-ab25-257538b35293`::Unexpected
error#012Traceback (most recent call last):#012  File
"/usr/share/vdsm/storage/task.py", line 857, in _run#012    return
fn(*args, **kargs)#012  File "/usr/share/vdsm/logUtils.py", line
45, in
wrapper#012    res = f(*args, **kwargs)#012  File
"/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012
volUUID, bs=1))#012  File "/usr/share/vdsm/storage/volume.py", line
333,
in getVSize#012    mysd = sdCache.produce(sdUUID=sdUUID)#012  File
"/usr/share/vdsm/storage/sdc.py", line 98, in produce#012
domain.getRealDomain()#012  File "/usr/share/vdsm/storage/sdc.py",
line
52, in getRealDomain#012    return
self._cache._realProduce(self._sdUUID)#012  File
"/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012
domain =
self._findDomain(sdUUID)#012  File "/usr/share/vdsm/storage/sdc.py",
line 141, in _findDomain#012    dom = findMethod(sdUUID)#012  File
"/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012
return
BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012
File
"/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012
lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012
File "/usr/share/vdsm/storage/lvm.py", line 976, in
checkVGBlockSizes#012    raise se.VolumeGroupDoesNotExist("vg_uuid:
%s"
% vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist:
('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',)
Jan 20 10:20:47 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR
vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the
volume
80bac371-6899-4fbe-a8e1-272037186bfb (domain:
1429ffe2-4137-416c-bb38-63fd73f4bcc1 image:
a5995c25-cdc9-4499-b9b4-08394a38165c) for the drive vda
Jan 20 10:20:48 serv-vm-adm11 vdsm TaskManager.Task ERROR
Task=`886e07bd-637b-4286-8a44-08dce5c8b207`::Unexpected
error#012Traceback (most recent call last):#012  File
"/usr/share/vdsm/storage/task.py", line 857, in _run#012    return
fn(*args, **kargs)#012  File "/usr/share/vdsm/logUtils.py", line
45, in
wrapper#012    res = f(*args, **kwargs)#012  File
"/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012
volUUID, bs=1))#012  File "/usr/share/vdsm/storage/volume.py", line
333,
in getVSize#012    mysd = sdCache.produce(sdUUID=sdUUID)#012  File
"/usr/share/vdsm/storage/sdc.py", line 98, in produce#012
domain.getRealDomain()#012  File "/usr/share/vdsm/storage/sdc.py",
line
52, in getRealDomain#012    return
self._cache._realProduce(self._sdUUID)#012  File
"/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012
domain =
self._findDomain(sdUUID)#012  File "/usr/share/vdsm/storage/sdc.py",
line 141, in _findDomain#012    dom = findMethod(sdUUID)#012  File
"/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012
return
BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012
File
"/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012
lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012
File "/usr/share/vdsm/storage/lvm.py", line 976, in
checkVGBlockSizes#012    raise se.VolumeGroupDoesNotExist("vg_uuid:
%s"
% vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist:
('vg_uuid: 1429ffe2-4137-416c-bb38-63fd73f4bcc1',)
Jan 20 10:20:48 serv-vm-adm11 ¿<11>vdsm vm.Vm ERROR
vmId=`2c0bbb51-0f94-4bf1-9579-4e897260f88e`::Unable to update the
volume
ea9c8f12-4eb6-42de-b6d6-6296555d0ac0 (domain:
1429ffe2-4137-416c-bb38-63fd73f4bcc1 image:
f42e0c9d-ad1b-4337-b82c-92914153ff44) for the drive vdb
Jan 20 10:21:03 serv-vm-adm11 vdsm TaskManager.Task ERROR
Task=`27bb14f9-0cd1-4316-95b0-736d162d5681`::Unexpected
error#012Traceback (most recent call last):#012  File
"/usr/share/vdsm/storage/task.py", line 857, in _run#012    return
fn(*args, **kargs)#012  File "/usr/share/vdsm/logUtils.py", line
45, in
wrapper#012    res = f(*args, **kwargs)#012  File
"/usr/share/vdsm/storage/hsm.py", line 3053, in getVolumeSize#012
volUUID, bs=1))#012  File "/usr/share/vdsm/storage/volume.py", line
333,
in getVSize#012    mysd = sdCache.produce(sdUUID=sdUUID)#012  File
"/usr/share/vdsm/storage/sdc.py", line 98, in produce#012
domain.getRealDomain()#012  File "/usr/share/vdsm/storage/sdc.py",
line
52, in getRealDomain#012    return
self._cache._realProduce(self._sdUUID)#012  File
"/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce#012
domain =
self._findDomain(sdUUID)#012  File "/usr/share/vdsm/storage/sdc.py",
line 141, in _findDomain#012    dom = findMethod(sdUUID)#012  File
"/usr/share/vdsm/storage/blockSD.py", line 1288, in findDomain#012
return
BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))#012
File
"/usr/share/vdsm/storage/blockSD.py", line 414, in __init__#012
lvm.checkVGBlockSizes(sdUUID, (self.logBlkSize, self.phyBlkSize))#012
File "/usr/share/vdsm/storage/lvm.py", line 976, in
checkVGBlockSizes#012    raise se.VolumeGroupDoesNotExist("vg_uuid:
%s"
% vgUUID)#012VolumeGroupDoesNotExist: Volume Group does not exist:
('vg_uuid: 83d39199-d4e4-474c-b232-7088c76a2811',)




was this diagnosed/resolved?

- Diagnosed : I discovered no further deeper way to diagnose this issue
- Resolved : I found nor received no further way to solve it.





_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Reply via email to