Re: [Users] oVirt 3.3 gluster volume active but unable to activate domain

2013-10-04 Thread Itamar Heim

On 10/04/2013 12:54 AM, Gianluca Cecchi wrote:

On messages of vdsm hosts

Oct  3 23:05:57 f18ovn03 sanlock[1146]: 2013-10-03 23:05:57+0200 16624
[13543]: read_sectors delta_leader offset 512 rv
  -5 
/rhev/data-center/mnt/glusterSD/f18ovn01.mydomain:gvdata/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids
Oct  3 23:05:58 f18ovn03 sanlock[1146]: 2013-10-03 23:05:58+0200 16625
[1155]: s2142 add_lockspace fail result -5


Oct  3 23:04:24 f18ovn01 sanlock[1166]: 2013-10-03 23:04:24+0200 16154
[1172]: s2688 add_lockspace fail result -5
Oct  3 23:04:24 f18ovn01 vdsm TaskManager.Task ERROR
Task=`bd6b0848-4550-483e-9002-e3051a2e1074`::Unexpected error
Oct  3 23:04:25 f18ovn01 vdsm TaskManager.Task ERROR
Task=`f2ac595d-cc9d-4125-a7be-7b8706cc9ee3`::Unexpected error
Oct  3 23:04:25 f18ovn01 vdsm TaskManager.Task ERROR
Task=`6c03be35-57f1-405e-b127-bd708defad67`::Unexpected error
Oct  3 23:04:26 f18ovn01 sanlock[1166]: 2013-10-03 23:04:26+0200 16157
[21348]: read_sectors delta_leader offset 0 rv -5
/rhev/data-center/mnt/glusterSD/f18ovn01.mydomain:gvdata/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids
Oct  3 23:04:27 f18ovn01 sanlock[1166]: 2013-10-03 23:04:27+0200 16158
[1172]: s2689 add_lockspace fail result -5

But the origin possibly is a split brain detected at gluster level
I see it in rhev-data-center-mnt-glusterSD-f18ovn01.mydomain:gvdata.log
this afternoon around the time I had installed the first guest and ran
a shutdown and a power on.

see:
https://docs.google.com/file/d/0BwoPbcrMv8mvNHNOVlNrOFFabjQ/edit?usp=sharing

Why gluster logs are two hours behind? UTC? Any way to set them with
the current system time?

Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users



do you happen to know why it got into a split brain status?
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] oVirt 3.3 gluster volume active but unable to activate domain

2013-10-04 Thread Gianluca Cecchi
On Fri, Oct 4, 2013 at 11:30 AM, Itamar Heim  wrote:

 do you happen to know why it got into a split brain status?

I suppose with the fist migration attempt.

But last night I replicated but I only got the heal for ids file, that
gluster apparently solved itself.
Possibly the bug about ide can influence the migration problem and the
possible gluster split brain?
What does it contain dom_md/ids file and when does it get updated?
During install of OS, and yum update it seems all was ok and no heal
about disk file itself
SO I presume the metadata update not correctly managed someway could
induce gluster coherence on this ids file

Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[Users] oVirt 3.3 gluster volume active but unable to activate domain

2013-10-03 Thread Gianluca Cecchi
One engine with f19 and two nodes with f19.
All with ovirt stable repo for f19.
DC defined as GlusterFS
The volume is ok, but I can't activate the domain.
Relevant logs when I clich activate are below

On engine:
2013-10-03 23:05:10,332 INFO
[org.ovirt.engine.core.vdsbroker.gluster.GlusterServicesListVDSCommand]
(pool-6-thread-50) START, GlusterServicesListVDSCommand(HostName =
f18ovn03, HostId = b67bcfd4-f868-49d5-8704-4936ee922249), log id:
5704c54f
2013-10-03 23:05:12,121 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(DefaultQuartzScheduler_Worker-83) hostFromVds::selectedVds -
f18ovn01, spmStatus Free, storage pool Gluster
2013-10-03 23:05:12,142 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(DefaultQuartzScheduler_Worker-83) SpmStatus on vds
80188ccc-83b2-4bc8-9385-8d07f7458a3c: Free
2013-10-03 23:05:12,144 INFO
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(DefaultQuartzScheduler_Worker-83) starting spm on vds f18ovn01,
storage pool Gluster, prevId 1, LVER 0
2013-10-03 23:05:12,148 INFO
[org.ovirt.engine.core.vdsbroker.gluster.GlusterServicesListVDSCommand]
(pool-6-thread-46) FINISH, GlusterServicesListVDSCommand, return:
[org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@955283ba,
org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@1ef87397,
org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@c1b996b6,
org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@30199726,
org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@606c4879,
org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@2b860d38,
org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@f69fd1f7],
log id: 4a1b4d33
2013-10-03 23:05:12,159 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
(DefaultQuartzScheduler_Worker-83) START, SpmStartVDSCommand(HostName
= f18ovn01, HostId = 80188ccc-83b2-4bc8-9385-8d07f7458a3c,
storagePoolId = eb679feb-4da2-4fd0-a185-abbe459ffa70, prevId=1,
prevLVER=0, storagePoolFormatType=V3, recoveryMode=Manual,
SCSIFencing=false), log id: 62f11f2d
2013-10-03 23:05:12,169 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
(DefaultQuartzScheduler_Worker-83) spmStart polling started: taskId =
ab9f2f84-f89b-44e9-b508-a904420635f4
2013-10-03 23:05:12,232 INFO
[org.ovirt.engine.core.vdsbroker.gluster.GlusterServicesListVDSCommand]
(pool-6-thread-50) FINISH, GlusterServicesListVDSCommand, return:
[org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@b624c19b,
org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@3fcab178,
org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@e28bd497,
org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@50ebd507,
org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@813e865a,
org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@4c584b19,
org.ovirt.engine.core.common.businessentities.gluster.GlusterServerService@17720fd8],
log id: 5704c54f
2013-10-03 23:05:12,512 INFO
[org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
(DefaultQuartzScheduler_Worker-6) START,
GlusterVolumesListVDSCommand(HostName = f18ovn01, HostId =
80188ccc-83b2-4bc8-9385-8d07f7458a3c), log id: 39a3f45d
2013-10-03 23:05:12,595 INFO
[org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand]
(DefaultQuartzScheduler_Worker-6) FINISH,
GlusterVolumesListVDSCommand, return:
{97873e57-0cc2-4740-ae38-186a8dd94718=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@a82da199,
d055b38c-2754-4e53-af5c-69cc0b8bf31c=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@ef0c0180},
log id: 39a3f45d
2013-10-03 23:05:14,182 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]
(DefaultQuartzScheduler_Worker-83) Failed in HSMGetTaskStatusVDS
method
2013-10-03 23:05:14,184 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand]
(DefaultQuartzScheduler_Worker-83) Error code AcquireHostIdFailure and
error message VDSGenericException: VDSErrorException: Failed to
HSMGetTaskStatusVDS, error = Cannot acquire host id
2013-10-03 23:05:14,186 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
(DefaultQuartzScheduler_Worker-83) spmStart polling ended: taskId =
ab9f2f84-f89b-44e9-b508-a904420635f4 task status = finished
2013-10-03 23:05:14,188 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
(DefaultQuartzScheduler_Worker-83) Start SPM Task failed - result:
cleanSuccess, message: VDSGenericException: VDSErrorException: Failed
to HSMGetTaskStatusVDS, error = Cannot acquire host id
2013-10-03 23:05:14,214 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand]
(DefaultQuartzScheduler_Worker-83) spmStart polling ended, spm 

Re: [Users] oVirt 3.3 gluster volume active but unable to activate domain

2013-10-03 Thread Gianluca Cecchi
On messages of vdsm hosts

Oct  3 23:05:57 f18ovn03 sanlock[1146]: 2013-10-03 23:05:57+0200 16624
[13543]: read_sectors delta_leader offset 512 rv
 -5 
/rhev/data-center/mnt/glusterSD/f18ovn01.mydomain:gvdata/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids
Oct  3 23:05:58 f18ovn03 sanlock[1146]: 2013-10-03 23:05:58+0200 16625
[1155]: s2142 add_lockspace fail result -5


Oct  3 23:04:24 f18ovn01 sanlock[1166]: 2013-10-03 23:04:24+0200 16154
[1172]: s2688 add_lockspace fail result -5
Oct  3 23:04:24 f18ovn01 vdsm TaskManager.Task ERROR
Task=`bd6b0848-4550-483e-9002-e3051a2e1074`::Unexpected error
Oct  3 23:04:25 f18ovn01 vdsm TaskManager.Task ERROR
Task=`f2ac595d-cc9d-4125-a7be-7b8706cc9ee3`::Unexpected error
Oct  3 23:04:25 f18ovn01 vdsm TaskManager.Task ERROR
Task=`6c03be35-57f1-405e-b127-bd708defad67`::Unexpected error
Oct  3 23:04:26 f18ovn01 sanlock[1166]: 2013-10-03 23:04:26+0200 16157
[21348]: read_sectors delta_leader offset 0 rv -5
/rhev/data-center/mnt/glusterSD/f18ovn01.mydomain:gvdata/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids
Oct  3 23:04:27 f18ovn01 sanlock[1166]: 2013-10-03 23:04:27+0200 16158
[1172]: s2689 add_lockspace fail result -5

But the origin possibly is a split brain detected at gluster level
I see it in rhev-data-center-mnt-glusterSD-f18ovn01.mydomain:gvdata.log
this afternoon around the time I had installed the first guest and ran
a shutdown and a power on.

see:
https://docs.google.com/file/d/0BwoPbcrMv8mvNHNOVlNrOFFabjQ/edit?usp=sharing

Why gluster logs are two hours behind? UTC? Any way to set them with
the current system time?

Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] oVirt 3.3 gluster volume active but unable to activate domain

2013-10-03 Thread Gianluca Cecchi
And in fact after solving the split brain, the gluster domain
automatically activated.

From rhev-data-center-mnt-glusterSD-f18ovn01.mydomain:gvdata.log under
/var/log/glusterfs I found ids file was the one not in sync
As the VM only started on f18ovn03 and I was not able to migrate to
f18ovn01, I decided to delete the file form f18ovn01.

BTW: what does dom_md/ids contain?


[2013-10-03 22:06:33.543730] E
[afr-self-heal-common.c:197:afr_sh_print_split_brain_log]
0-gvdata-replicate-0: Unable to self-heal contents of
'/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids' (possible
split-brain). Please delete the file from all but the preferred
subvolume.- Pending matrix:  [ [ 0 2 ] [ 2 0 ] ]
[2013-10-03 22:06:33.544013] E
[afr-self-heal-common.c:2212:afr_self_heal_completion_cbk]
0-gvdata-replicate-0: background  data self-heal failed on
/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids
[2013-10-03 22:06:33.544522] W [afr-open.c:213:afr_open]
0-gvdata-replicate-0: failed to open as split brain seen, returning
EIO
[2013-10-03 22:06:33.544603] W [page.c:991:__ioc_page_error]
0-gvdata-io-cache: page error for page = 0x7f4b80004910  waitq =
0x7f4b8001da60
[2013-10-03 22:06:33.544635] W [fuse-bridge.c:2049:fuse_readv_cbk]
0-glusterfs-fuse: 132995: READ = -1 (Input/output error)
[2013-10-03 22:06:33.545070] W
[client-lk.c:367:delete_granted_locks_owner] 0-gvdata-client-0: fdctx
not valid
[2013-10-03 22:06:33.545118] W
[client-lk.c:367:delete_granted_locks_owner] 0-gvdata-client-1: fdctx
not valid

I found that gluster creates hard links, so you have to delete all
copies of conflicting file from the brick directory of the node you
choose to delete from.

Thanks very much to this link:
http://inuits.eu/blog/fixing-glusterfs-split-brain

So these my steps:

locate
[root@f18ovn01 d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291]# find
/gluster/DATA_GLUSTER/brick1/ -samefile
/gluster/DATA_GLUSTER/brick1/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids
-print

/gluster/DATA_GLUSTER/brick1/.glusterfs/ae/27/ae27eb8d-c653-4cc0-a054-ea376ce8097d
/gluster/DATA_GLUSTER/brick1/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids

and then delete both
[root@f18ovn01 d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291]# find
/gluster/DATA_GLUSTER/brick1/ -samefile
/gluster/DATA_GLUSTER/brick1/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids
-print -delete

/gluster/DATA_GLUSTER/brick1/.glusterfs/ae/27/ae27eb8d-c653-4cc0-a054-ea376ce8097d
/gluster/DATA_GLUSTER/brick1/d0b96d4a-62aa-4e9f-b50e-f7a0cb5be291/dom_md/ids

An after this step no more  E  lines in gluster log and gluster
domain automatically activated by engine.

Gianluca
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users