Hi,

i have 2 machines running a simple replicate volume to provide highly available storage for kvm virtual machines. As soon as auto healing starts, glusterfs will start blocking the vm's storage access (apparently writes are what causes this) leaving the whole virtual machine hanging. I can replicate this bug on both ext3 and ext4 filesystems, on real machines as well as on vm's.

Any help would be appreciated, we have to run the vm's without glusterfs at the moment because of this problem :-(

More on my config:

* Ubuntu 10.04 Server 64bit
* Kernel 2.6.32-21-server
* Fuse 2.8.1
* Glusterfs v3.0.2

How to replicate:

* 2 Nodes running glusterfs replicate
* Start KVM virtual machine with diskfile on glusterfs
* Stop glusterfsd on one node
* Make changes to the diskfile
* Bring glusterfsd back online (auto healing starts) (replicate: no missing files - /image.raw. proceeding to metadata check) * As soon as the vm starts writing data, it will be blocked until autohealing finishes (Making it completely unresponsive)

Message from Kernel (Printed several times while healing):

INFO: task kvm:7774 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kvm           D 00000000ffffffff     0  7774      1 0x00000000
ffff8801adcd9e48 0000000000000082 0000000000015bc0 0000000000015bc0
ffff880308d9df80 ffff8801adcd9fd8 0000000000015bc0 ffff880308d9dbc0
0000000000015bc0 ffff8801adcd9fd8 0000000000015bc0 ffff880308d9df80
Call Trace:
[<ffffffff8153f867>] __mutex_lock_slowpath+0xe7/0x170
[<ffffffff8153f75b>] mutex_lock+0x2b/0x50
[<ffffffff8123a1d1>] fuse_file_llseek+0x41/0xe0
[<ffffffff8114238a>] vfs_llseek+0x3a/0x40
[<ffffffff81142fd6>] sys_lseek+0x66/0x80
[<ffffffff810131b2>] system_call_fastpath+0x16/0x1b

Gluster Configuration:

### glusterfsd.vol ###
volume posix
  type storage/posix
  option directory /data/export
end-volume

volume locks
  type features/locks
  subvolumes posix
end-volume

volume brick
  type performance/io-threads
  option thread-count 16
  subvolumes locks
end-volume

volume server
  type protocol/server
  option transport-type tcp
  option transport.socket.nodelay on
  option transport.socket.bind-address 192.168.158.141
  option auth.addr.brick.allow 192.168.158.*
  subvolumes brick
end-volume

### glusterfs.vol ###
volume gluster1
  type protocol/client
  option transport-type tcp
  option remote-host 192.168.158.141
  option remote-subvolume brick
end-volume

volume gluster2
  type protocol/client
  option transport-type tcp
  option remote-host 192.168.158.142
  option remote-subvolume brick
end-volume

volume replicate
  type cluster/replicate
  subvolumes gluster1 gluster2
end-volume

### fstab ###
/etc/glusterfs/glusterfs.vol /mnt/glusterfs glusterfs log-level=DEBUG,direct-io-mode=disable 0 0


I read that you wanted users to kill -11 the glusterfs process for more debug info - here it is:

pending frames:
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)

patchset: v3.0.2
signal received: 11
time of crash: 2010-09-28 11:14:31
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.0.2
/lib/libc.so.6(+0x33af0)[0x7f0c6bf0eaf0]
/lib/libc.so.6(epoll_wait+0x33)[0x7f0c6bfc1c93]
/usr/lib/libglusterfs.so.0(+0x2e261)[0x7f0c6c6ac261]
glusterfs(main+0x852)[0x4044f2]
/lib/libc.so.6(__libc_start_main+0xfd)[0x7f0c6bef9c4d]
glusterfs[0x402ab9]
---------
_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Reply via email to