Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
On Monday, 2 July 2007 18:36, David Greaves wrote: > Rafael J. Wysocki wrote: > > On Monday, 2 July 2007 16:32, David Greaves wrote: > >> Rafael J. Wysocki wrote: > >>> On Monday, 2 July 2007 12:56, Tejun Heo wrote: > David Greaves wrote: > >> Tejun Heo wrote: > >>> It's really weird tho. The PHY RDY status changed events are coming > >>> from the device which is NOT used while resuming > > There is an obvious problem there though Tejun (the errors even when sda > > isn't involved in the OS boot) - can I start another thread about that > > issue/bug later? I need to reshuffle partitions so I'd rather get the > > hibernate working first and then go back to it if that's OK? > Yeah, sure. The problem is that we don't know whether or how those two > are related. It would be great if there's a way to verify memory image > read from hibernation is intact. Rafael, any ideas? > >>> Well, s2disk has an option to compute an MD5 checksum of the image during > >>> the hibernation and verify it while reading the image. > >> (Assuming you mean the mainline version) > >> > >> Sounds like a good think to try next... > >> Couldn't see anything on this in ../Documentation/power/* > >> How do I enable it? > > > > Add 'compute checksum = y' to the s2disk's configuration file. > > Ah, right - that's uswsusp isn't it? Which isn't what I'm having problems > with > AFAIK? > > My suspend procedure is: > > xfs_freeze -f /scratch > sync > echo platform > /sys/power/disk > echo disk > /sys/power/state > xfs_freeze -u /scratch > > Which should work (actually it should work without the sync/xfs_freeze too). > > So to debug the problem I'd like to minimally extend this process rather than > replace it with another approach. Well, this is not entirely "another approach". Only the saving of the image is done differently, the rest is the same. > I take it there isn't an 'echo y > /sys/power/do_image_checksum'? No, there is not anything like that. Greetings, Rafael -- "Premature optimization is the root of all evil." - Donald Knuth - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
Rafael J. Wysocki wrote: On Monday, 2 July 2007 16:32, David Greaves wrote: Rafael J. Wysocki wrote: On Monday, 2 July 2007 12:56, Tejun Heo wrote: David Greaves wrote: Tejun Heo wrote: It's really weird tho. The PHY RDY status changed events are coming from the device which is NOT used while resuming There is an obvious problem there though Tejun (the errors even when sda isn't involved in the OS boot) - can I start another thread about that issue/bug later? I need to reshuffle partitions so I'd rather get the hibernate working first and then go back to it if that's OK? Yeah, sure. The problem is that we don't know whether or how those two are related. It would be great if there's a way to verify memory image read from hibernation is intact. Rafael, any ideas? Well, s2disk has an option to compute an MD5 checksum of the image during the hibernation and verify it while reading the image. (Assuming you mean the mainline version) Sounds like a good think to try next... Couldn't see anything on this in ../Documentation/power/* How do I enable it? Add 'compute checksum = y' to the s2disk's configuration file. Ah, right - that's uswsusp isn't it? Which isn't what I'm having problems with AFAIK? My suspend procedure is: xfs_freeze -f /scratch sync echo platform > /sys/power/disk echo disk > /sys/power/state xfs_freeze -u /scratch Which should work (actually it should work without the sync/xfs_freeze too). So to debug the problem I'd like to minimally extend this process rather than replace it with another approach. I take it there isn't an 'echo y > /sys/power/do_image_checksum'? David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
On Monday, 2 July 2007 16:32, David Greaves wrote: > Rafael J. Wysocki wrote: > > On Monday, 2 July 2007 12:56, Tejun Heo wrote: > >> David Greaves wrote: > Tejun Heo wrote: > > It's really weird tho. The PHY RDY status changed events are coming > > from the device which is NOT used while resuming > >>> There is an obvious problem there though Tejun (the errors even when sda > >>> isn't involved in the OS boot) - can I start another thread about that > >>> issue/bug later? I need to reshuffle partitions so I'd rather get the > >>> hibernate working first and then go back to it if that's OK? > >> Yeah, sure. The problem is that we don't know whether or how those two > >> are related. It would be great if there's a way to verify memory image > >> read from hibernation is intact. Rafael, any ideas? > > > > Well, s2disk has an option to compute an MD5 checksum of the image during > > the hibernation and verify it while reading the image. > (Assuming you mean the mainline version) > > Sounds like a good think to try next... > Couldn't see anything on this in ../Documentation/power/* > How do I enable it? Add 'compute checksum = y' to the s2disk's configuration file. Greetings, Rafael -- "Premature optimization is the root of all evil." - Donald Knuth - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
Rafael J. Wysocki wrote: On Monday, 2 July 2007 12:56, Tejun Heo wrote: David Greaves wrote: Tejun Heo wrote: It's really weird tho. The PHY RDY status changed events are coming from the device which is NOT used while resuming There is an obvious problem there though Tejun (the errors even when sda isn't involved in the OS boot) - can I start another thread about that issue/bug later? I need to reshuffle partitions so I'd rather get the hibernate working first and then go back to it if that's OK? Yeah, sure. The problem is that we don't know whether or how those two are related. It would be great if there's a way to verify memory image read from hibernation is intact. Rafael, any ideas? Well, s2disk has an option to compute an MD5 checksum of the image during the hibernation and verify it while reading the image. (Assuming you mean the mainline version) Sounds like a good think to try next... Couldn't see anything on this in ../Documentation/power/* How do I enable it? Still, s2disk/resume aren't very easy to install and configure ... I have it working fine on 2 other machines now so that doesn't appear to be a problem. David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
On Monday, 2 July 2007 12:56, Tejun Heo wrote: > David Greaves wrote: > >> Tejun Heo wrote: > >>> It's really weird tho. The PHY RDY status changed events are coming > >>> from the device which is NOT used while resuming > > > > There is an obvious problem there though Tejun (the errors even when sda > > isn't involved in the OS boot) - can I start another thread about that > > issue/bug later? I need to reshuffle partitions so I'd rather get the > > hibernate working first and then go back to it if that's OK? > > Yeah, sure. The problem is that we don't know whether or how those two > are related. It would be great if there's a way to verify memory image > read from hibernation is intact. Rafael, any ideas? Well, s2disk has an option to compute an MD5 checksum of the image during the hibernation and verify it while reading the image. Still, s2disk/resume aren't very easy to install and configure ... Greetings, Rafael -- "Premature optimization is the root of all evil." - Donald Knuth - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
David Greaves wrote: >> Tejun Heo wrote: >>> It's really weird tho. The PHY RDY status changed events are coming >>> from the device which is NOT used while resuming > > There is an obvious problem there though Tejun (the errors even when sda > isn't involved in the OS boot) - can I start another thread about that > issue/bug later? I need to reshuffle partitions so I'd rather get the > hibernate working first and then go back to it if that's OK? Yeah, sure. The problem is that we don't know whether or how those two are related. It would be great if there's a way to verify memory image read from hibernation is intact. Rafael, any ideas? Thanks. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
David Greaves wrote: been away, back now... again... David Greaves wrote: When I move the swap/resume partition to a different controller (ie when I broke the / mirror and used the freed space) the problem seems to go away. No, it's not gone away - but it's taking longer to show up. I can try and put together a test loop that does work, hibernates, resumes and repeats but since I know it crashes at some point there doesn't seem much point unless I'm looking for something. There's not much in the logs - is there any other instrumentation that people could suggest? DaveC, given this is happening without (obvious) libata errors do you think it may be something in the XFS/md/hibernate area? If there's anything to be tried then I'll also move to 2.6.22-rc6. > Tejun Heo wrote: >> It's really weird tho. The PHY RDY status changed events are coming >> from the device which is NOT used while resuming There is an obvious problem there though Tejun (the errors even when sda isn't involved in the OS boot) - can I start another thread about that issue/bug later? I need to reshuffle partitions so I'd rather get the hibernate working first and then go back to it if that's OK? David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
been away, back now... Tejun Heo wrote: David Greaves wrote: Tejun Heo wrote: How reproducible is the problem? Does the problem go away or occur more often if you change the drive you write the memory image to? I don't think there should be activity on the sda drive during resume itself. [I broke my / md mirror and am using some of that for swap/resume for now] I did change the swap/resume device to sdd2 (different controller, onboard sata_via) and there was no EH during resume. The system seemed OK, wrote a few Gb of video and did a kernel compile. I repeated this test, no EH during resume, no problems. I even ran xfs_fsr, the defragment utility, to stress the fs. I retain this configuration and try again tonight but it looks like there _may_ be a link between EH during resume and my problems... Having retained this new configuration for a couple of days now I haven't had any problems. This is good but not really ideal since / isn't mirrored anymore :( Of course, I don't understand why it *should* EH during resume, it doesn't during boot or normal operation... EH occurs during boot, suspend and resume all the time. It just runs in quiet mode to avoid disturbing the users too much. In your case, EH is kicking in due to actual exception conditions so it's being verbose to give clue about what's going on. I was trying to say that I don't actually see any errors being handled in normal operation. I'm not sure if you are saying that these PHY RDY events are normally handled quietly (which would explain it). It's really weird tho. The PHY RDY status changed events are coming from the device which is NOT used while resuming yes - but the erroring device which is not being used is on the same controller as the device with the in-use resume partition. and it's before any actual PM events are triggered. Your kernel just boots, swsusp realizes it's resuming and tries to read memory image from the swap device. yes While reading, the disk controller raises consecutive PHY readiness changed interrupts. EH recovers them alright but the end result seems to indicate that the loaded image is corrupt. Yes, that's consistent with what I'm seeing. When I move the swap/resume partition to a different controller (ie when I broke the / mirror and used the freed space) the problem seems to go away. I am seeing messages in dmesg though: ATA: abnormal status 0xD0 on port 0xf881e0c7 ATA: abnormal status 0xD0 on port 0xf881e0c7 ATA: abnormal status 0xD0 on port 0xf881e0c7 ATA: abnormal status 0xD0 on port 0xf881e0c7 ATA: abnormal status 0xD0 on port 0xf881e0c7 ata1.00: configured for UDMA/100 ata2.00: revalidation failed (errno=-2) ata2: failed to recover some devices, retrying in 5 secs sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors (200050 MB) sd 0:0:0:0: resuming sd 0:0:0:0: [sda] Starting disk ATA: abnormal status 0x7F on port 0x00019807 ATA: abnormal status 0x7F on port 0x00019007 ATA: abnormal status 0x7F on port 0x00019007 ATA: abnormal status 0x7F on port 0x00019807 ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ATA: abnormal status 0xD0 on port 0xf881e0c7 ATA: abnormal status 0xD0 on port 0xf881e0c7 ATA: abnormal status 0xD0 on port 0xf881e0c7 ATA: abnormal status 0xD0 on port 0xf881e0c7 ATA: abnormal status 0xD0 on port 0xf881e0c7 ata1.00: configured for UDMA/100 ata2.00: revalidation failed (errno=-2) ata2: failed to recover some devices, retrying in 5 secs So, there's no device suspend/resume code involved at all. The kernel just booted and is trying to read data from the drive. Please try with only the first drive attached and see what happens. That's kinda hard; swap and root are on different drives... Does it help that although the errors above appear, the system seems OK when I just use the other controller? I have to be cautious what I do with this machine as it's the wife's active desktop box . David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
David Greaves wrote: > Tejun Heo wrote: >> Your controller is repeatedly reporting PHY readiness changed exception. >> Are you reading the system image from the device attached to the first >> SATA port? > > Yes if you mean 1st as in the one after the zero-th ... I meant the first first (0th). >> How reproducible is the problem? Does the problem go away or occur more >> often if you change the drive you write the memory image to? > > I don't think there should be activity on the sda drive during resume > itself. > > [I broke my / md mirror and am using some of that for swap/resume for now] > > I did change the swap/resume device to sdd2 (different controller, > onboard sata_via) and there was no EH during resume. The system seemed > OK, wrote a few Gb of video and did a kernel compile. > I repeated this test, no EH during resume, no problems. > I even ran xfs_fsr, the defragment utility, to stress the fs. > > I retain this configuration and try again tonight but it looks like > there _may_ be a link between EH during resume and my problems... > > Of course, I don't understand why it *should* EH during resume, it > doesn't during boot or normal operation... EH occurs during boot, suspend and resume all the time. It just runs in quiet mode to avoid disturbing the users too much. In your case, EH is kicking in due to actual exception conditions so it's being verbose to give clue about what's going on. It's really weird tho. The PHY RDY status changed events are coming from the device which is NOT used while resuming and it's before any actual PM events are triggered. Your kernel just boots, swsusp realizes it's resuming and tries to read memory image from the swap device. While reading, the disk controller raises consecutive PHY readiness changed interrupts. EH recovers them alright but the end result seems to indicate that the loaded image is corrupt. So, there's no device suspend/resume code involved at all. The kernel just booted and is trying to read data from the drive. Please try with only the first drive attached and see what happens. Thanks. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
On Tue, Jun 19, 2007 at 10:24:23AM +0100, David Greaves wrote: > David Greaves wrote: > so I cd'ed out of /scratch and umounted. > > I then tried the xfs_check. > > haze:~# xfs_check /dev/video_vg/video_lv > ERROR: The filesystem has valuable metadata changes in a log which needs to > be replayed. Mount the filesystem to replay the log, and unmount it before > re-running xfs_check. If you are unable to mount the filesystem, then use > the xfs_repair -L option to destroy the log and attempt a repair. > Note that destroying the log may cause corruption -- please attempt a mount > of the filesystem before doing this. > haze:~# mount /scratch/ > haze:~# umount /scratch/ > haze:~# xfs_check /dev/video_vg/video_lv > > Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... > haze kernel: Bad page state in process 'xfs_db' I think we can safely say that your system is hosed at this point ;) > ugh. Try again > haze:~# xfs_check /dev/video_vg/video_lv > haze:~# zero output means no on-disk corruption was found. Everything is consistent on disk, so that seems to indicate something in memory has been crispy fried by the suspend/resume > Dave, I ran xfs_check -v... but I got bored when it reached 122M of bz2 > compressed output with no sign of stopping... still got it if it's any > use... No, not useful. It's a log of every operation it does and so is really only useful for debugging xfs-check problems ;) > I then rebooted and ran a repair which didn't show any damage. Not surprising as your first check showed no damage. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
Rafael J. Wysocki wrote: This is on 2.6.22-rc5 Is the Tejun's patch http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.22-rc5/patches/30-block-always-requeue-nonfs-requests-at-the-front.patch applied on top of that? 2.6.22-rc5 includes it. (but, when I was testing rc4, I did apply this patch) David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
Tejun Heo wrote: Hello, again... David Greaves wrote: Good :) Now, not so good :) Oh, crap. :-) So I hibernated last night and resumed this morning. Before hibernating I froze and sync'ed. After resume I thawed it. (Sorry Dave) Here are some photos of the screen during resume. This is not 100% reproducable - it seems to occur only if the system is shutdown for 30mins or so. Tejun, I wonder if error handling during resume is problematic? I got the same errors in 2.6.21. I have never seen these (or any other libata) errors other than during resume. http://www.dgreaves.com/pub/2.6.22-rc5-resume-failure.jpg (hard to read, here's one from 2.6.21 http://www.dgreaves.com/pub/2.6.21-resume-failure.jpg Your controller is repeatedly reporting PHY readiness changed exception. Are you reading the system image from the device attached to the first SATA port? Yes if you mean 1st as in the one after the zero-th ... resume=/dev/sdb4 haze:~# swapon -s FilenameTypeSizeUsedPriority /dev/sdb4 partition 1004020 0 -1 dmesg snippet below... sda is part of the /scratch xfs array though. SMART doesn't show any problems and of course all is well other than during a resume. sda/b are on sata_sil (a cheap plugin pci card) I _think_ I've only seen the xfs problem when a resume shows these errors. The error handling itself tries very hard to ensure that there is no data corruption in case of errors. All commands which experience exceptions are retried but if the drive itself is doing something stupid, there's only so much the driver can do. How reproducible is the problem? Does the problem go away or occur more often if you change the drive you write the memory image to? I don't think there should be activity on the sda drive during resume itself. [I broke my / md mirror and am using some of that for swap/resume for now] I did change the swap/resume device to sdd2 (different controller, onboard sata_via) and there was no EH during resume. The system seemed OK, wrote a few Gb of video and did a kernel compile. I repeated this test, no EH during resume, no problems. I even ran xfs_fsr, the defragment utility, to stress the fs. I retain this configuration and try again tonight but it looks like there _may_ be a link between EH during resume and my problems... Of course, I don't understand why it *should* EH during resume, it doesn't during boot or normal operation... Any more tests you'd like me to try? David dmesg snippet... sata_sil :00:0a.0: version 2.2 ACPI: PCI Interrupt :00:0a.0[A] -> GSI 16 (level, low) -> IRQ 18 scsi0 : sata_sil PM: Adding info for No Bus:host0 scsi1 : sata_sil PM: Adding info for No Bus:host1 ata1: SATA max UDMA/100 cmd 0xf881e080 ctl 0xf881e08a bmdma 0xf881e000 irq 0 ata2: SATA max UDMA/100 cmd 0xf881e0c0 ctl 0xf881e0ca bmdma 0xf881e008 irq 0 ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata1.00: ATA-7: Maxtor 6B200M0, BANC1980, max UDMA/100 ata1.00: 390721968 sectors, multi 0: LBA48 ata1.00: configured for UDMA/100 ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata2.00: ata_hpa_resize 1: sectors = 312581808, hpa_sectors = 312581808 ata2.00: ATA-6: ST3160023AS, 3.18, max UDMA/133 ata2.00: 312581808 sectors, multi 0: LBA48 ata2.00: ata_hpa_resize 1: sectors = 312581808, hpa_sectors = 312581808 ata2.00: configured for UDMA/100 PM: Adding info for No Bus:target0:0:0 scsi 0:0:0:0: Direct-Access ATA Maxtor 6B200M0 BANC PQ: 0 ANSI: 5 PM: Adding info for scsi:0:0:0:0 sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors (200050 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors (200050 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sd 0:0:0:0: [sda] Attached SCSI disk sd 0:0:0:0: Attached scsi generic sg0 type 0 PM: Adding info for No Bus:target1:0:0 scsi 1:0:0:0: Direct-Access ATA ST3160023AS 3.18 PQ: 0 ANSI: 5 PM: Adding info for scsi:1:0:0:0 sd 1:0:0:0: [sdb] 312581808 512-byte hardware sectors (160042 MB) sd 1:0:0:0: [sdb] Write Protect is off sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 1:0:0:0: [sdb] 312581808 512-byte hardware sectors (160042 MB) sd 1:0:0:0: [sdb] Write Protect is off sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sdb: sdb1 sdb2 sdb3 sdb4 sd 1:0:0:0: [sdb] Attached SCSI disk sd 1:0:0:0: Attached scsi generic sg1 type 0 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
On Tuesday, 19 June 2007 11:24, David Greaves wrote: > David Greaves wrote: > > I'm going to have to do some more testing... > done > > > > David Chinner wrote: > >> On Mon, Jun 18, 2007 at 08:49:34AM +0100, David Greaves wrote: > >>> David Greaves wrote: > >>> So doing: > >>> xfs_freeze -f /scratch > >>> sync > >>> echo platform > /sys/power/disk > >>> echo disk > /sys/power/state > >>> # resume > >>> xfs_freeze -u /scratch > >>> > >>> Works (for now - more usage testing tonight) > >> > >> Verrry interesting. > > Good :) > Now, not so good :) > > > >> What you were seeing was an XFS shutdown occurring because the free space > >> btree was corrupted. IOWs, the process of suspend/resume has resulted > >> in either bad data being written to disk, the correct data not being > >> written to disk or the cached block being corrupted in memory. > > That's the kind of thing I was suspecting, yes. > > > >> If you run xfs_check on the filesystem after it has shut down after a > >> resume, > >> can you tell us if it reports on-disk corruption? Note: do not run > >> xfs_repair > >> to check this - it does not check the free space btrees; instead it > >> simply > >> rebuilds them from scratch. If xfs_check reports an error, then run > >> xfs_repair > >> to fix it up. > > OK, I can try this tonight... > > > This is on 2.6.22-rc5 Is the Tejun's patch http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.22-rc5/patches/30-block-always-requeue-nonfs-requests-at-the-front.patch applied on top of that? Rafael -- "Premature optimization is the root of all evil." - Donald Knuth - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
Hello, David Greaves wrote: >> Good :) > Now, not so good :) Oh, crap. :-) > So I hibernated last night and resumed this morning. > Before hibernating I froze and sync'ed. After resume I thawed it. (Sorry > Dave) > > Here are some photos of the screen during resume. This is not 100% > reproducable - it seems to occur only if the system is shutdown for > 30mins or so. > > Tejun, I wonder if error handling during resume is problematic? I got > the same errors in 2.6.21. I have never seen these (or any other libata) > errors other than during resume. > > http://www.dgreaves.com/pub/2.6.22-rc5-resume-failure.jpg > (hard to read, here's one from 2.6.21 > http://www.dgreaves.com/pub/2.6.21-resume-failure.jpg Your controller is repeatedly reporting PHY readiness changed exception. Are you reading the system image from the device attached to the first SATA port? > I _think_ I've only seen the xfs problem when a resume shows these errors. The error handling itself tries very hard to ensure that there is no data corruption in case of errors. All commands which experience exceptions are retried but if the drive itself is doing something stupid, there's only so much the driver can do. How reproducible is the problem? Does the problem go away or occur more often if you change the drive you write the memory image to? -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
David Greaves wrote: I'm going to have to do some more testing... done David Chinner wrote: On Mon, Jun 18, 2007 at 08:49:34AM +0100, David Greaves wrote: David Greaves wrote: So doing: xfs_freeze -f /scratch sync echo platform > /sys/power/disk echo disk > /sys/power/state # resume xfs_freeze -u /scratch Works (for now - more usage testing tonight) Verrry interesting. Good :) Now, not so good :) What you were seeing was an XFS shutdown occurring because the free space btree was corrupted. IOWs, the process of suspend/resume has resulted in either bad data being written to disk, the correct data not being written to disk or the cached block being corrupted in memory. That's the kind of thing I was suspecting, yes. If you run xfs_check on the filesystem after it has shut down after a resume, can you tell us if it reports on-disk corruption? Note: do not run xfs_repair to check this - it does not check the free space btrees; instead it simply rebuilds them from scratch. If xfs_check reports an error, then run xfs_repair to fix it up. OK, I can try this tonight... This is on 2.6.22-rc5 So I hibernated last night and resumed this morning. Before hibernating I froze and sync'ed. After resume I thawed it. (Sorry Dave) Here are some photos of the screen during resume. This is not 100% reproducable - it seems to occur only if the system is shutdown for 30mins or so. Tejun, I wonder if error handling during resume is problematic? I got the same errors in 2.6.21. I have never seen these (or any other libata) errors other than during resume. http://www.dgreaves.com/pub/2.6.22-rc5-resume-failure.jpg (hard to read, here's one from 2.6.21 http://www.dgreaves.com/pub/2.6.21-resume-failure.jpg I _think_ I've only seen the xfs problem when a resume shows these errors. Ok, to try and cause a problem I ran a make and got this back at once: make: stat: Makefile: Input/output error make: stat: clean: Input/output error make: *** No rule to make target `clean'. Stop. make: stat: GNUmakefile: Input/output error make: stat: makefile: Input/output error I caught the first dmesg this time: Filesystem "dm-0": XFS internal error xfs_btree_check_sblock at line 334 of file fs/xfs/xfs_btree.c. Caller 0xc01b58e1 [] show_trace_log_lvl+0x1a/0x30 [] show_trace+0x12/0x20 [] dump_stack+0x15/0x20 [] xfs_error_report+0x4f/0x60 [] xfs_btree_check_sblock+0x56/0xd0 [] xfs_alloc_lookup+0x181/0x390 [] xfs_alloc_lookup_le+0x16/0x20 [] xfs_free_ag_extent+0x51/0x690 [] xfs_free_extent+0xa4/0xc0 [] xfs_bmap_finish+0x119/0x170 [] xfs_itruncate_finish+0x23a/0x3a0 [] xfs_inactive+0x482/0x500 [] xfs_fs_clear_inode+0x34/0xa0 [] clear_inode+0x57/0xe0 [] generic_delete_inode+0xe5/0x110 [] generic_drop_inode+0x167/0x1b0 [] iput+0x5f/0x70 [] do_unlinkat+0xdf/0x140 [] sys_unlink+0x10/0x20 [] syscall_call+0x7/0xb === xfs_force_shutdown(dm-0,0x8) called from line 4258 of file fs/xfs/xfs_bmap.c. Return address = 0xc021101e Filesystem "dm-0": Corruption of in-memory data detected. Shutting down filesystem: dm-0 Please umount the filesystem, and rectify the problem(s) so I cd'ed out of /scratch and umounted. I then tried the xfs_check. haze:~# xfs_check /dev/video_vg/video_lv ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_check. If you are unable to mount the filesystem, then use the xfs_repair -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. haze:~# mount /scratch/ haze:~# umount /scratch/ haze:~# xfs_check /dev/video_vg/video_lv Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: Bad page state in process 'xfs_db' Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: page:c1767bc0 flags:0x80010008 mapping: mapcount:-64 count:0 Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: Trying to fix it up, but a reboot is needed Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: Backtrace: Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: Bad page state in process 'syslogd' Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: page:c1767cc0 flags:0x80010008 mapping: mapcount:-64 count:0 Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: Trying to fix it up, but a reboot is needed Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ... haze kernel: Backtrace: ugh. Try again haze:~# xfs_check /dev/video_vg/video_lv haze:~# whilst running a top reported this as roughly the peak memory usage: 8759 root 18 0 479m 474m 876 R 2.0 46.9 0:02.49 xfs_db so it looks like it didn't run out of memory (machine has 1Gb). Dave, I ran xfs_check -v... but