attempt to access beyond end of device

2000-07-15 Thread Darren Nickerson


Folks,

kernel 2.2.13ac1, patched with ide.2.2.13.1999.patch and 
raid0145-19990824-2.2.11. I know this is no longer "state of the art", but it 
was pretty solid in its day. Recently, we've had 2 events which took our the 
entire raid5 array, both followed the same pattern. Here's the sequence:

Drive loses DMA for some reason. 

Jul 14 09:20:11 osmin kernel: hdi: timeout waiting for DMA 
Jul 14 09:20:11 osmin kernel: hdi: irq timeout: status=0xd0 { Busy } 
Jul 14 09:20:11 osmin kernel: hdi: DMA disabled 
Jul 14 09:20:12 osmin kernel: ide4: reset: success 


Further attempts to access the disk lead to:

Jul 14 09:22:25 osmin kernel: hdi: write_intr error2: nr_sectors=1, stat=0x58 
Jul 14 09:22:25 osmin kernel: hdi: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 
Jul 14 09:22:25 osmin kernel: ide4: reset: success 
Jul 14 09:27:32 osmin kernel: hdi: write_intr error2: nr_sectors=1, stat=0x58 
Jul 14 09:27:32 osmin kernel: hdi: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 
Jul 14 09:27:32 osmin kernel: ide4: reset: success 

This goes on for hours and hours, and the drive is still marked active in 
mdstat. Finally, after many hours:

Jul 15 00:25:45 osmin kernel: hdi: write_intr error2: nr_sectors=1, stat=0x58 
Jul 15 00:25:45 osmin kernel: hdi: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 
Jul 15 00:25:45 osmin kernel: ide4: reset: success 
Jul 15 00:25:47 osmin kernel: hdi: write_intr error2: nr_sectors=1, stat=0x58 
Jul 15 00:25:47 osmin kernel: hdi: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 
Jul 15 00:25:47 osmin kernel: ide4: reset: success 
Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device 
Jul 15 00:26:06 osmin kernel: 39:01: rw=0, want=635481100, limit=33417184 
Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099 
sector=1270962198 size=1024 count=1 
Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdk1, disabling device. Operation 
continuing on 3 devices 
Jul 15 00:26:06 osmin kernel: raid5: restarting stripe 1270962198 
Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device 
Jul 15 00:26:06 osmin kernel: 16:41: rw=0, want=635481100, limit=36630688 
Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099 
sector=1270962198 size=1024 count=1 
Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdd1, disabling device. Operation 
continuing on 2 devices 
Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device 
Jul 15 00:26:06 osmin kernel: 22:01: rw=0, want=635481100, limit=33417184 
Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099 
sector=1270962198 size=1024 count=1 
Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdg1, disabling device. Operation 
continuing on 1 devices 
Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device 
Jul 15 00:26:06 osmin kernel: 38:01: rw=0, want=635481100, limit=33417184 
Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099 
sector=1270962198 size=1024 count=1 
Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdi1, disabling device. Operation 
continuing on 0 devices 
Jul 15 00:26:06 osmin kernel: raid5: restarting stripe 1270962198 

followed by

Jul 15 00:26:06 osmin kernel: raid5: md1: unrecoverable I/O error for block 4053926987 
Jul 15 00:26:06 osmin kernel: raid5: md1: unrecoverable I/O error for block 4053730379 

on and on forever, and the array is dead to the world.

Raid has failed me here. I lost one disk, I lost them all. The reason I 
installed RAID simply led me to a larger catastrophe. Why? Yes, I can reboot 
and fsck the array, but files are missing (old files not recently accessed) 
and there's repairing to be done. Not an ideal solution.

My question is this: do the diagnostics above point to a misconfig on my part, 
or is this a shortcoming in Raid's ability to cope with a drive with DMA 
disabled?

-Darren





Fatal: Only RAID1 devices are supported for boot images

2000-07-15 Thread Sven Kirmess

How can I get lilo to work? Funny is I have a working lilo boot
sector, but I cannot create a new one. I have no idea what I've
changed...


Sven

-


# lilo -v -t
LILO version 21.4-4 (test mode), Copyright (C) 1992-1998 Werner Almesberger
'lba32' extensions Copyright (C) 1999,2000 John Coffman

boot = /dev/hde, map = /boot/map.2101
Reading boot sector from /dev/hde
Merging with /usr/local/src/lilo-21.4.4/boot.b
Fatal: Only RAID1 devices are supported for boot images  


--- lilo.conf ---
# more /etc/lilo.conf
# LILO Konfigurations-Datei
# Start LILO global Section
install = /usr/local/src/lilo-21.4.4/boot.b
#
# I have tried both (hda and md100), but they didn't work!!!
#
# boot=/dev/hda
boot=/dev/md100
# compact   # faster, but won't work on all systems.
linear  # for RAID
vga = normal# force sane state
read-only
prompt
# timeout=00
timeout=50

# End LILO global Section
#

image = /boot/vmlinuz
  root = /dev/md100
  label = Linux





# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid5]
read_ahead 1024 sectors
md100 : active raid1 hdg1[1] hde1[0] 153600 blocks [2/2] [UU]
md101 : active raid1 hdg2[1] hde2[0] 20480 blocks [2/2] [UU]
md150 : active raid5 hdg6[1] hde6[0] 2054912 blocks level 5, 128k chunk, algorithm 2 
[3/2] [UU_]
md151 : active raid5 hdg7[1] hde7[0] 1027840 blocks level 5, 128k chunk, algorithm 2 
[3/2] [UU_]
md159 : active raid5 hdg8[1] hde8[0] 2055936 blocks level 5, 128k chunk, algorithm 2 
[3/2] [UU_]
md155 : active raid5 hdg9[1] hde9[0] 1027840 blocks level 5, 128k chunk, algorithm 2 
[3/2] [UU_]
md170 : active raid5 hdg10[1] hde10[0] 3084288 blocks level 5, 128k chunk, algorithm 2 
[3/2] [UU_]
md190 : active raid5 hdg11[1] hde11[0] 2055936 blocks level 5, 128k chunk, algorithm 2 
[3/2] [UU_]
md191 : active raid5 hdg12[1] hde12[0] 16 blocks level 5, 128k chunk, algorithm 2 
[3/2] [UU_]
md192 : active raid5 hdg13[1] hde13[0] 320256 blocks level 5, 128k chunk, algorithm 2 
[3/2] [UU_]
md200 : active raid5 hdg15[1] hde15[0] 1504 blocks level 5, 128k chunk, algorithm 
2 [3/2] [UU_]
md230 : active raid5 hdg16[1] hde16[0] 320256 blocks level 5, 128k chunk, algorithm 2 
[3/2] [UU_]
unused devices: none





Re: Failure autodetecting raid0 partitions

2000-07-15 Thread Kenneth Johansson

Anders Qvist wrote:

 I have a 2.2.11+intl+raid0.90 successfully mounting its ext2 root file
 system off /dev/md0, which is autodetected by the kernel. A 2.4-test2
 kernel compiled with CONFIG_AUTODETECT_RAID fails to autodetect my
 partitons when I write it to a floppy and boot it. It just says
 autodetecting RAID arrays ... autorun DONE.

 There is probably something I don't know. I'd be grateful if someone told
 me what it was. NB: I'm not on the list.

The autodetection was not done on partitions inside an extended partition.

This patch fixes that.





--- linux-2.4.0-test4/fs/partitions/msdos.c Sat Jul 15 13:22:29 2000
+++ linux/fs/partitions/msdos.c Sat Jul 15 21:03:38 2000
@@ -136,6 +136,12 @@
add_gd_partition(hd, current_minor,
 this_sector+START_SECT(p)*sector_size,
 NR_SECTS(p)*sector_size);
+#if CONFIG_BLK_DEV_MD  CONFIG_AUTODETECT_RAID
+   if (SYS_IND(p) == LINUX_RAID_PARTITION) {
+   md_autodetect_dev(MKDEV(hd-major,current_minor));
+   }
+#endif
+
current_minor++;
loopct = 0;
if ((current_minor  mask) == 0)



Re: Failure autodetecting raid0 partitions

2000-07-15 Thread Edward Schernau

Wow, an email CCed to Linus himself!  *faint*



Re: Failure autodetecting raid0 partitions

2000-07-15 Thread Kenneth Johansson

Edward Schernau wrote:

 Wow, an email CCed to Linus himself!  *faint*

Well do you know of another way to get a patch into the kernel ??





Re: Fatal: Only RAID1 devices are supported for boot images

2000-07-15 Thread Sven Kirmess

 # lilo -v -t
 LILO version 21.4-4 (test mode), Copyright (C) 1992-1998 Werner Almesberger
 'lba32' extensions Copyright (C) 1999,2000 John Coffman

 boot = /dev/hde, map = /boot/map.2101
 Reading boot sector from /dev/hde
 Merging with /usr/local/src/lilo-21.4.4/boot.b
 Fatal: Only RAID1 devices are supported for boot images  

 --- lilo.conf ---
 install = /usr/local/src/lilo-21.4.4/boot.b

I solved the problem. boot.b was on a RAID 5 devide. I didn't know
thta's a problem and I don't think it is one bacause it worked...

 Sven





Re: Failure autodetecting raid0 partitions

2000-07-15 Thread Chris Mauritz

 From [EMAIL PROTECTED] Sat Jul 15 19:29:44 2000
 
 Edward Schernau wrote:
 
  Wow, an email CCed to Linus himself!  *faint*
 
 Well do you know of another way to get a patch into the kernel ??

So if Linus gets hit by a bus (or a fast moving hari krishna), how
are folks to get things into the kernel then?

C
-- 
Christopher Mauritz
[EMAIL PROTECTED]



Re: Failure autodetecting raid0 partitions

2000-07-15 Thread Seth Vidal

 So if Linus gets hit by a bus (or a fast moving hari krishna), how
 are folks to get things into the kernel then?

Probably Alan.

-sv