raid10_make_request bug

Serg Liberman Mon, 15 Aug 2011 00:27:17 -0700

Hi all!

I get some trouble using mdadm over LVM2. Here is my config:

I've 2 servers (xen1 and xen2 - their hostnames in my local network)with configuration below:

Each server have 4 SATA disks, 1 Tb each attached to motherboard.
 4x4 Gb ddr3
debian squeeze x64 installed:
root@xen2:~# uname -a

Linux xen2 2.6.32-5-xen-amd64 #1 SMP Wed Jan 12 05:46:49 UTC 2011 x86_64GNU/Linux


Storage configuration:

First 256 Mb and second 32 Gb of 2 of 4 disks are used for raid1 devicesfor /boot and swap respectively.

The rest of space, 970 Gb on all 4 sata disks are used as raid10.

There is LVM2 installed over that raid10. Volume group is named xenlvm(that servers are expected to use as xen 4.0.1 hosts, but the story isnot about Xen troubles).

/ , /var, /home are located on logical volumes of small size:

root@xen2:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/XENLVM-home
9.2G 6.0G 2.8G 69% /
tmpfs 7.6G 0 7.6G 0% /lib/init/rw
udev 7.1G 316K 7.1G 1% /dev
tmpfs 7.6G 0 7.6G 0% /dev/shm
/dev/md3 223M 31M 180M 15% /boot
/dev/mapper/XENLVM-var
9.2G 150M 8.6G 2% /home
/dev/mapper/XENLVM-root
9.2G 2.5G 6.3G 29% /var

About 900 Gb on "xenlvm" volume group are left free to create newlogical volumes, which are expected to use as block devices for raid1partitions. One member of such array is local logical volume and thesecond is an Ata over Ethernet device.

It's name (of this aoe device) is e.g. e0.1.

We need such complications to run Xen vm-s. Our vm-s use raid1 devicesfor storing their data. And if one of two hosts (xen1 or xen2) die withcatastrofic failure the second xen host will hold virtual machine blockdevice so we can start it.This two servers have 2 ethernet devices on each. One of eth dev (eth1on each) is comunicating with our lan (to connect to the server). Thesecond ethernet device (eth0 on each) is connected with another serverusing ethernet cross connection with 1 Gbit/s throughput to provide diskspace via ata over ethernet.


So here is the problem with this RAID1 device:

I've configured one 20Gigs RAID1 so it have 20GiGs AoE device and 20GiGs LVM local block storage.mdadm -C /dev/md3 --level=1 --raid-devices=2 /dev/etherd/e0.1/dev/xenlvm/raid20gig

And installed Windows 2003 over this volume. Made some configurationsinside and installed some soft. Then I backed up image of this volumeusing dd:

dd if=/dev/md3 of=/backups/md3_date.dd

Then I decided to run "more of this" virtual machines from that backup.So I created another one raid1 device with 20 Gigs capacity:

mdadm -C /dev/md4 --level=1 --raid-devices=2 /dev/xenlvm/raid20gig2/dev/etherd/e0.2


And wrote that dd backup to it:

dd if=/backups/md3_date.dd of=/dev/md4.

And started domU with this md4 device as hdd. It runs smoothly. Butwhen I look at

cat /proc/mdstat I see that one of backed deviced is in faulting state:

md4 : active raid1 dm-15[0](F) etherd/e1.5[1]
20970424 blocks super 1.2 [2/1] [_U]

that dm-15 is the LVM2 device /dev/xenlvm/raid20gig2

If I hot remove, re-add failing device the raid volume begins to resyncas it was in normal state :


root@xen1:~# mdadm /dev/md4 -r /dev/dm-15
mdadm: hot removed /dev/dm-15 from /dev/md8

root@xen1:~# mdadm /dev/md4 -a /dev/dm-15
mdadm: re-added /dev/dm-15

Only faulting device listing showed below, as I said, there is raid10in system, and there's no problem with it.

root@xen1:~# cat /proc/mdstat
Personalities : [raid1] [raid10]
md4 : active raid1 dm-15[0] etherd/e1.5[1]
20970424 blocks super 1.2 [2/1] [_U]

[>....................] recovery = 1.0% (218752/20970424) finish=17.3minspeed=19886K/sec

So I started to watch /var/log/syslog and messages for some errors andfound a message bellow:

*raid10_make_request bug: can't convert block across chunks or biggerthan 512k 965198847 4

This message appears in the log at the moment when state of lvm blockdevice dm-15 changes from normal to faulting in /proc/mdstat.

That's not the end of the story. I saw this message on xen1 host, sothat was local lvm device. But at some moment this problem appeared onthe second host - xen2.And this message appears in xen2 /var/log/kern.log and floods it veryfast, so I get my /var full in two days. And after that my aoe-device onxen1 gets into "down" state and the vm dies.

I googled for this error and found only posts about some redhat anddebian etch kernel bug in year 2007-2009.

raid10_make_request bug

Reply via email to