Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-07-02 Thread Rafael J. Wysocki
On Monday, 2 July 2007 18:36, David Greaves wrote:
> Rafael J. Wysocki wrote:
> > On Monday, 2 July 2007 16:32, David Greaves wrote:
> >> Rafael J. Wysocki wrote:
> >>> On Monday, 2 July 2007 12:56, Tejun Heo wrote:
>  David Greaves wrote:
> >> Tejun Heo wrote:
> >>> It's really weird tho.  The PHY RDY status changed events are coming
> >>> from the device which is NOT used while resuming
> > There is an obvious problem there though Tejun (the errors even when sda
> > isn't involved in the OS boot) - can I start another thread about that
> > issue/bug later? I need to reshuffle partitions so I'd rather get the
> > hibernate working first and then go back to it if that's OK?
>  Yeah, sure.  The problem is that we don't know whether or how those two
>  are related.  It would be great if there's a way to verify memory image
>  read from hibernation is intact.  Rafael, any ideas?
> >>> Well, s2disk has an option to compute an MD5 checksum of the image during
> >>> the hibernation and verify it while reading the image.
> >> (Assuming you mean the mainline version)
> >>
> >> Sounds like a good think to try next...
> >> Couldn't see anything on this in ../Documentation/power/*
> >> How do I enable it?
> > 
> > Add 'compute checksum = y' to the s2disk's configuration file.
> 
> Ah, right - that's uswsusp isn't it? Which isn't what I'm having problems 
> with 
> AFAIK?
> 
> My suspend procedure is:
> 
> xfs_freeze -f /scratch
> sync
> echo platform > /sys/power/disk
> echo disk > /sys/power/state
> xfs_freeze -u /scratch
> 
> Which should work (actually it should work without the sync/xfs_freeze too).
> 
> So to debug the problem I'd like to minimally extend this process rather than 
> replace it with another approach.

Well, this is not entirely "another approach".  Only the saving of the image is
done differently, the rest is the same.

> I take it there isn't an 'echo y > /sys/power/do_image_checksum'?

No, there is not anything like that.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-07-02 Thread David Greaves

Rafael J. Wysocki wrote:

On Monday, 2 July 2007 16:32, David Greaves wrote:

Rafael J. Wysocki wrote:

On Monday, 2 July 2007 12:56, Tejun Heo wrote:

David Greaves wrote:

Tejun Heo wrote:

It's really weird tho.  The PHY RDY status changed events are coming
from the device which is NOT used while resuming

There is an obvious problem there though Tejun (the errors even when sda
isn't involved in the OS boot) - can I start another thread about that
issue/bug later? I need to reshuffle partitions so I'd rather get the
hibernate working first and then go back to it if that's OK?

Yeah, sure.  The problem is that we don't know whether or how those two
are related.  It would be great if there's a way to verify memory image
read from hibernation is intact.  Rafael, any ideas?

Well, s2disk has an option to compute an MD5 checksum of the image during
the hibernation and verify it while reading the image.

(Assuming you mean the mainline version)

Sounds like a good think to try next...
Couldn't see anything on this in ../Documentation/power/*
How do I enable it?


Add 'compute checksum = y' to the s2disk's configuration file.


Ah, right - that's uswsusp isn't it? Which isn't what I'm having problems with 
AFAIK?


My suspend procedure is:

xfs_freeze -f /scratch
sync
echo platform > /sys/power/disk
echo disk > /sys/power/state
xfs_freeze -u /scratch

Which should work (actually it should work without the sync/xfs_freeze too).

So to debug the problem I'd like to minimally extend this process rather than 
replace it with another approach.


I take it there isn't an 'echo y > /sys/power/do_image_checksum'?

David


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-07-02 Thread Rafael J. Wysocki
On Monday, 2 July 2007 16:32, David Greaves wrote:
> Rafael J. Wysocki wrote:
> > On Monday, 2 July 2007 12:56, Tejun Heo wrote:
> >> David Greaves wrote:
>  Tejun Heo wrote:
> > It's really weird tho.  The PHY RDY status changed events are coming
> > from the device which is NOT used while resuming
> >>> There is an obvious problem there though Tejun (the errors even when sda
> >>> isn't involved in the OS boot) - can I start another thread about that
> >>> issue/bug later? I need to reshuffle partitions so I'd rather get the
> >>> hibernate working first and then go back to it if that's OK?
> >> Yeah, sure.  The problem is that we don't know whether or how those two
> >> are related.  It would be great if there's a way to verify memory image
> >> read from hibernation is intact.  Rafael, any ideas?
> > 
> > Well, s2disk has an option to compute an MD5 checksum of the image during
> > the hibernation and verify it while reading the image.
> (Assuming you mean the mainline version)
> 
> Sounds like a good think to try next...
> Couldn't see anything on this in ../Documentation/power/*
> How do I enable it?

Add 'compute checksum = y' to the s2disk's configuration file.

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-07-02 Thread David Greaves

Rafael J. Wysocki wrote:

On Monday, 2 July 2007 12:56, Tejun Heo wrote:

David Greaves wrote:

Tejun Heo wrote:

It's really weird tho.  The PHY RDY status changed events are coming
from the device which is NOT used while resuming

There is an obvious problem there though Tejun (the errors even when sda
isn't involved in the OS boot) - can I start another thread about that
issue/bug later? I need to reshuffle partitions so I'd rather get the
hibernate working first and then go back to it if that's OK?

Yeah, sure.  The problem is that we don't know whether or how those two
are related.  It would be great if there's a way to verify memory image
read from hibernation is intact.  Rafael, any ideas?


Well, s2disk has an option to compute an MD5 checksum of the image during
the hibernation and verify it while reading the image.

(Assuming you mean the mainline version)

Sounds like a good think to try next...
Couldn't see anything on this in ../Documentation/power/*
How do I enable it?



 Still, s2disk/resume
aren't very easy to install  and configure ...


I have it working fine on 2 other machines now so that doesn't appear to be a 
problem.


David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-07-02 Thread Rafael J. Wysocki
On Monday, 2 July 2007 12:56, Tejun Heo wrote:
> David Greaves wrote:
> >> Tejun Heo wrote:
> >>> It's really weird tho.  The PHY RDY status changed events are coming
> >>> from the device which is NOT used while resuming
> > 
> > There is an obvious problem there though Tejun (the errors even when sda
> > isn't involved in the OS boot) - can I start another thread about that
> > issue/bug later? I need to reshuffle partitions so I'd rather get the
> > hibernate working first and then go back to it if that's OK?
> 
> Yeah, sure.  The problem is that we don't know whether or how those two
> are related.  It would be great if there's a way to verify memory image
> read from hibernation is intact.  Rafael, any ideas?

Well, s2disk has an option to compute an MD5 checksum of the image during
the hibernation and verify it while reading the image.  Still, s2disk/resume
aren't very easy to install  and configure ...

Greetings,
Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-07-02 Thread Tejun Heo
David Greaves wrote:
>> Tejun Heo wrote:
>>> It's really weird tho.  The PHY RDY status changed events are coming
>>> from the device which is NOT used while resuming
> 
> There is an obvious problem there though Tejun (the errors even when sda
> isn't involved in the OS boot) - can I start another thread about that
> issue/bug later? I need to reshuffle partitions so I'd rather get the
> hibernate working first and then go back to it if that's OK?

Yeah, sure.  The problem is that we don't know whether or how those two
are related.  It would be great if there's a way to verify memory image
read from hibernation is intact.  Rafael, any ideas?

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-06-29 Thread David Greaves

David Greaves wrote:

been away, back now...

again...

David Greaves wrote:
When I move the swap/resume partition to a different controller (ie when 
I broke the / mirror and used the freed space) the problem seems to go 
away.

No, it's not gone away - but it's taking longer to show up.
I can try and put together a test loop that does work, hibernates, resumes and 
repeats but since I know it crashes at some point there doesn't seem much point 
unless I'm looking for something.
There's not much in the logs - is there any other instrumentation that people 
could suggest?
DaveC, given this is happening without (obvious) libata errors do you think it 
may be something in the XFS/md/hibernate area?


If there's anything to be tried then I'll also move to 2.6.22-rc6.


> Tejun Heo wrote:
>> It's really weird tho.  The PHY RDY status changed events are coming
>> from the device which is NOT used while resuming

There is an obvious problem there though Tejun (the errors even when sda isn't 
involved in the OS boot) - can I start another thread about that issue/bug 
later? I need to reshuffle partitions so I'd rather get the hibernate working 
first and then go back to it if that's OK?


David

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-06-21 Thread David Greaves

been away, back now...

Tejun Heo wrote:

David Greaves wrote:

Tejun Heo wrote:

How reproducible is the problem?  Does the problem go away or occur more
often if you change the drive you write the memory image to?

I don't think there should be activity on the sda drive during resume
itself.

[I broke my / md mirror and am using some of that for swap/resume for now]

I did change the swap/resume device to sdd2 (different controller,
onboard sata_via) and there was no EH during resume. The system seemed
OK, wrote a few Gb of video and did a kernel compile.
I repeated this test, no EH during resume, no problems.
I even ran xfs_fsr, the defragment utility, to stress the fs.

I retain this configuration and try again tonight but it looks like
there _may_ be a link between EH during resume and my problems...
Having retained this new configuration for a couple of days now I haven't had 
any problems.

This is good but not really ideal since / isn't mirrored anymore :(


Of course, I don't understand why it *should* EH during resume, it
doesn't during boot or normal operation...


EH occurs during boot, suspend and resume all the time.  It just runs in
quiet mode to avoid disturbing the users too much.  In your case, EH is
kicking in due to actual exception conditions so it's being verbose to
give clue about what's going on.
I was trying to say that I don't actually see any errors being handled in normal 
operation.
I'm not sure if you are saying that these PHY RDY events are normally handled 
quietly (which would explain it).




It's really weird tho.  The PHY RDY status changed events are coming
from the device which is NOT used while resuming
yes - but the erroring device which is not being used is on the same controller 
as the device with the in-use resume partition.



and it's before any
actual PM events are triggered.  Your kernel just boots, swsusp realizes
it's resuming and tries to read memory image from the swap device.

yes


While reading, the disk controller raises consecutive PHY readiness
changed interrupts.  EH recovers them alright but the end result seems
to indicate that the loaded image is corrupt.

Yes, that's consistent with what I'm seeing.

When I move the swap/resume partition to a different controller (ie when I broke 
the / mirror and used the freed space) the problem seems to go away.


I am seeing messages in dmesg though:
ATA: abnormal status 0xD0 on port 0xf881e0c7
ATA: abnormal status 0xD0 on port 0xf881e0c7
ATA: abnormal status 0xD0 on port 0xf881e0c7
ATA: abnormal status 0xD0 on port 0xf881e0c7
ATA: abnormal status 0xD0 on port 0xf881e0c7
ata1.00: configured for UDMA/100
ata2.00: revalidation failed (errno=-2)
ata2: failed to recover some devices, retrying in 5 secs
sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors (200050 MB)

sd 0:0:0:0: resuming
sd 0:0:0:0: [sda] Starting disk
ATA: abnormal status 0x7F on port 0x00019807
ATA: abnormal status 0x7F on port 0x00019007
ATA: abnormal status 0x7F on port 0x00019007
ATA: abnormal status 0x7F on port 0x00019807

ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ATA: abnormal status 0xD0 on port 0xf881e0c7
ATA: abnormal status 0xD0 on port 0xf881e0c7
ATA: abnormal status 0xD0 on port 0xf881e0c7
ATA: abnormal status 0xD0 on port 0xf881e0c7
ATA: abnormal status 0xD0 on port 0xf881e0c7
ata1.00: configured for UDMA/100
ata2.00: revalidation failed (errno=-2)
ata2: failed to recover some devices, retrying in 5 secs



So, there's no device suspend/resume code involved at all.  The kernel
just booted and is trying to read data from the drive.  Please try with
only the first drive attached and see what happens.

That's kinda hard; swap and root are on different drives...

Does it help that although the errors above appear, the system seems OK when I 
just use the other controller?


I have to be cautious what I do with this machine as it's the wife's active 
desktop box .


David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-06-20 Thread Tejun Heo
David Greaves wrote:
> Tejun Heo wrote:
>> Your controller is repeatedly reporting PHY readiness changed exception.
>>  Are you reading the system image from the device attached to the first
>> SATA port?
> 
> Yes if you mean 1st as in the one after the zero-th ...

I meant the first first (0th).

>> How reproducible is the problem?  Does the problem go away or occur more
>> often if you change the drive you write the memory image to?
> 
> I don't think there should be activity on the sda drive during resume
> itself.
> 
> [I broke my / md mirror and am using some of that for swap/resume for now]
> 
> I did change the swap/resume device to sdd2 (different controller,
> onboard sata_via) and there was no EH during resume. The system seemed
> OK, wrote a few Gb of video and did a kernel compile.
> I repeated this test, no EH during resume, no problems.
> I even ran xfs_fsr, the defragment utility, to stress the fs.
> 
> I retain this configuration and try again tonight but it looks like
> there _may_ be a link between EH during resume and my problems...
> 
> Of course, I don't understand why it *should* EH during resume, it
> doesn't during boot or normal operation...

EH occurs during boot, suspend and resume all the time.  It just runs in
quiet mode to avoid disturbing the users too much.  In your case, EH is
kicking in due to actual exception conditions so it's being verbose to
give clue about what's going on.

It's really weird tho.  The PHY RDY status changed events are coming
from the device which is NOT used while resuming and it's before any
actual PM events are triggered.  Your kernel just boots, swsusp realizes
it's resuming and tries to read memory image from the swap device.
While reading, the disk controller raises consecutive PHY readiness
changed interrupts.  EH recovers them alright but the end result seems
to indicate that the loaded image is corrupt.

So, there's no device suspend/resume code involved at all.  The kernel
just booted and is trying to read data from the drive.  Please try with
only the first drive attached and see what happens.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-06-19 Thread David Chinner
On Tue, Jun 19, 2007 at 10:24:23AM +0100, David Greaves wrote:
> David Greaves wrote:
> so I cd'ed out of /scratch and umounted.
> 
> I then tried the xfs_check.
> 
> haze:~# xfs_check /dev/video_vg/video_lv
> ERROR: The filesystem has valuable metadata changes in a log which needs to
> be replayed.  Mount the filesystem to replay the log, and unmount it before
> re-running xfs_check.  If you are unable to mount the filesystem, then use
> the xfs_repair -L option to destroy the log and attempt a repair.
> Note that destroying the log may cause corruption -- please attempt a mount
> of the filesystem before doing this.
> haze:~# mount /scratch/
> haze:~# umount /scratch/
> haze:~# xfs_check /dev/video_vg/video_lv
> 
> Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
> haze kernel: Bad page state in process 'xfs_db'

I think we can safely say that your system is hosed at this point ;)

> ugh. Try again
> haze:~# xfs_check /dev/video_vg/video_lv
> haze:~#

zero output means no on-disk corruption was found. Everything is
consistent on disk, so that seems to indicate something in memory has been
crispy fried by the suspend/resume

> Dave, I ran xfs_check -v... but I got bored when it reached 122M of bz2 
> compressed output with no sign of stopping... still got it if it's any 
> use...

No, not useful. It's a log of every operation it does and so is really
only useful for debugging xfs-check problems ;)

> I then rebooted and ran a repair which didn't show any damage.

Not surprising as your first check showed no damage.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-06-19 Thread David Greaves

Rafael J. Wysocki wrote:

This is on 2.6.22-rc5


Is the Tejun's patch

http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.22-rc5/patches/30-block-always-requeue-nonfs-requests-at-the-front.patch

applied on top of that?


2.6.22-rc5 includes it.

(but, when I was testing rc4, I did apply this patch)

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-06-19 Thread David Greaves

Tejun Heo wrote:

Hello,

again...


David Greaves wrote:

Good :)

Now, not so good :)


Oh, crap.  :-)




So I hibernated last night and resumed this morning.
Before hibernating I froze and sync'ed. After resume I thawed it. (Sorry
Dave)

Here are some photos of the screen during resume. This is not 100%
reproducable - it seems to occur only if the system is shutdown for
30mins or so.

Tejun, I wonder if error handling during resume is problematic? I got
the same errors in 2.6.21. I have never seen these (or any other libata)
errors other than during resume.

http://www.dgreaves.com/pub/2.6.22-rc5-resume-failure.jpg
(hard to read, here's one from 2.6.21
http://www.dgreaves.com/pub/2.6.21-resume-failure.jpg


Your controller is repeatedly reporting PHY readiness changed exception.
 Are you reading the system image from the device attached to the first
SATA port?


Yes if you mean 1st as in the one after the zero-th ...

resume=/dev/sdb4
haze:~# swapon -s
FilenameTypeSizeUsedPriority
/dev/sdb4   partition   1004020 0   -1

dmesg snippet below...

sda is part of the /scratch xfs array though. SMART doesn't show any problems 
and of course all is well other than during a resume.


sda/b are on sata_sil (a cheap plugin pci card)




I _think_ I've only seen the xfs problem when a resume shows these errors.


The error handling itself tries very hard to ensure that there is no
data corruption in case of errors.  All commands which experience
exceptions are retried but if the drive itself is doing something
stupid, there's only so much the driver can do.

How reproducible is the problem?  Does the problem go away or occur more
often if you change the drive you write the memory image to?


I don't think there should be activity on the sda drive during resume itself.

[I broke my / md mirror and am using some of that for swap/resume for now]

I did change the swap/resume device to sdd2 (different controller, onboard 
sata_via) and there was no EH during resume. The system seemed OK, wrote a few 
Gb of video and did a kernel compile.

I repeated this test, no EH during resume, no problems.
I even ran xfs_fsr, the defragment utility, to stress the fs.

I retain this configuration and try again tonight but it looks like there _may_ 
be a link between EH during resume and my problems...


Of course, I don't understand why it *should* EH during resume, it doesn't 
during boot or normal operation...


Any more tests you'd like me to try?

David


dmesg snippet...

sata_sil :00:0a.0: version 2.2
ACPI: PCI Interrupt :00:0a.0[A] -> GSI 16 (level, low) -> IRQ 18
scsi0 : sata_sil
PM: Adding info for No Bus:host0
scsi1 : sata_sil
PM: Adding info for No Bus:host1
ata1: SATA max UDMA/100 cmd 0xf881e080 ctl 0xf881e08a bmdma 0xf881e000 irq 0
ata2: SATA max UDMA/100 cmd 0xf881e0c0 ctl 0xf881e0ca bmdma 0xf881e008 irq 0
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: ATA-7: Maxtor 6B200M0, BANC1980, max UDMA/100
ata1.00: 390721968 sectors, multi 0: LBA48
ata1.00: configured for UDMA/100
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2.00: ata_hpa_resize 1: sectors = 312581808, hpa_sectors = 312581808
ata2.00: ATA-6: ST3160023AS, 3.18, max UDMA/133
ata2.00: 312581808 sectors, multi 0: LBA48
ata2.00: ata_hpa_resize 1: sectors = 312581808, hpa_sectors = 312581808
ata2.00: configured for UDMA/100
PM: Adding info for No Bus:target0:0:0
scsi 0:0:0:0: Direct-Access ATA  Maxtor 6B200M0   BANC PQ: 0 ANSI: 5
PM: Adding info for scsi:0:0:0:0
sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors (200050 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO 
or FUA

sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors (200050 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO 
or FUA

 sda: sda1
sd 0:0:0:0: [sda] Attached SCSI disk
sd 0:0:0:0: Attached scsi generic sg0 type 0
PM: Adding info for No Bus:target1:0:0
scsi 1:0:0:0: Direct-Access ATA  ST3160023AS  3.18 PQ: 0 ANSI: 5
PM: Adding info for scsi:1:0:0:0
sd 1:0:0:0: [sdb] 312581808 512-byte hardware sectors (160042 MB)
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO 
or FUA

sd 1:0:0:0: [sdb] 312581808 512-byte hardware sectors (160042 MB)
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO 
or FUA

 sdb: sdb1 sdb2 sdb3 sdb4
sd 1:0:0:0: [sdb] Attached SCSI disk
sd 1:0:0:0: Attached scsi generic sg1 type 0
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to

Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-06-19 Thread Rafael J. Wysocki
On Tuesday, 19 June 2007 11:24, David Greaves wrote:
> David Greaves wrote:
> > I'm going to have to do some more testing...
> done
> 
> 
> > David Chinner wrote:
> >> On Mon, Jun 18, 2007 at 08:49:34AM +0100, David Greaves wrote:
> >>> David Greaves wrote:
> >>> So doing:
> >>> xfs_freeze -f /scratch
> >>> sync
> >>> echo platform > /sys/power/disk
> >>> echo disk > /sys/power/state
> >>> # resume
> >>> xfs_freeze -u /scratch
> >>>
> >>> Works (for now - more usage testing tonight)
> >>
> >> Verrry interesting.
> > Good :)
> Now, not so good :)
> 
> 
> >> What you were seeing was an XFS shutdown occurring because the free space
> >> btree was corrupted. IOWs, the process of suspend/resume has resulted
> >> in either bad data being written to disk, the correct data not being
> >> written to disk or the cached block being corrupted in memory.
> > That's the kind of thing I was suspecting, yes.
> > 
> >> If you run xfs_check on the filesystem after it has shut down after a 
> >> resume,
> >> can you tell us if it reports on-disk corruption? Note: do not run 
> >> xfs_repair
> >> to check this - it does not check the free space btrees; instead it 
> >> simply
> >> rebuilds them from scratch. If xfs_check reports an error, then run 
> >> xfs_repair
> >> to fix it up.
> > OK, I can try this tonight...
> 
> 
> This is on 2.6.22-rc5

Is the Tejun's patch

http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.22-rc5/patches/30-block-always-requeue-nonfs-requests-at-the-front.patch

applied on top of that?

Rafael


-- 
"Premature optimization is the root of all evil." - Donald Knuth
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-06-19 Thread Tejun Heo
Hello,

David Greaves wrote:
>> Good :)
> Now, not so good :)

Oh, crap.  :-)

> So I hibernated last night and resumed this morning.
> Before hibernating I froze and sync'ed. After resume I thawed it. (Sorry
> Dave)
> 
> Here are some photos of the screen during resume. This is not 100%
> reproducable - it seems to occur only if the system is shutdown for
> 30mins or so.
> 
> Tejun, I wonder if error handling during resume is problematic? I got
> the same errors in 2.6.21. I have never seen these (or any other libata)
> errors other than during resume.
> 
> http://www.dgreaves.com/pub/2.6.22-rc5-resume-failure.jpg
> (hard to read, here's one from 2.6.21
> http://www.dgreaves.com/pub/2.6.21-resume-failure.jpg

Your controller is repeatedly reporting PHY readiness changed exception.
 Are you reading the system image from the device attached to the first
SATA port?

> I _think_ I've only seen the xfs problem when a resume shows these errors.

The error handling itself tries very hard to ensure that there is no
data corruption in case of errors.  All commands which experience
exceptions are retried but if the drive itself is doing something
stupid, there's only so much the driver can do.

How reproducible is the problem?  Does the problem go away or occur more
often if you change the drive you write the memory image to?

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-06-19 Thread David Greaves

David Greaves wrote:

I'm going to have to do some more testing...

done



David Chinner wrote:

On Mon, Jun 18, 2007 at 08:49:34AM +0100, David Greaves wrote:

David Greaves wrote:
So doing:
xfs_freeze -f /scratch
sync
echo platform > /sys/power/disk
echo disk > /sys/power/state
# resume
xfs_freeze -u /scratch

Works (for now - more usage testing tonight)


Verrry interesting.

Good :)

Now, not so good :)



What you were seeing was an XFS shutdown occurring because the free space
btree was corrupted. IOWs, the process of suspend/resume has resulted
in either bad data being written to disk, the correct data not being
written to disk or the cached block being corrupted in memory.

That's the kind of thing I was suspecting, yes.

If you run xfs_check on the filesystem after it has shut down after a 
resume,
can you tell us if it reports on-disk corruption? Note: do not run 
xfs_repair
to check this - it does not check the free space btrees; instead it 
simply
rebuilds them from scratch. If xfs_check reports an error, then run 
xfs_repair

to fix it up.

OK, I can try this tonight...



This is on 2.6.22-rc5

So I hibernated last night and resumed this morning.
Before hibernating I froze and sync'ed. After resume I thawed it. (Sorry Dave)

Here are some photos of the screen during resume. This is not 100% reproducable 
- it seems to occur only if the system is shutdown for 30mins or so.


Tejun, I wonder if error handling during resume is problematic? I got the same 
errors in 2.6.21. I have never seen these (or any other libata) errors other 
than during resume.


http://www.dgreaves.com/pub/2.6.22-rc5-resume-failure.jpg
(hard to read, here's one from 2.6.21
http://www.dgreaves.com/pub/2.6.21-resume-failure.jpg

I _think_ I've only seen the xfs problem when a resume shows these errors.


Ok, to try and cause a problem I ran a make and got this back at once:
make: stat: Makefile: Input/output error
make: stat: clean: Input/output error
make: *** No rule to make target `clean'.  Stop.
make: stat: GNUmakefile: Input/output error
make: stat: makefile: Input/output error


I caught the first dmesg this time:

Filesystem "dm-0": XFS internal error xfs_btree_check_sblock at line 334 of file 
fs/xfs/xfs_btree.c.  Caller 0xc01b58e1

 [] show_trace_log_lvl+0x1a/0x30
 [] show_trace+0x12/0x20
 [] dump_stack+0x15/0x20
 [] xfs_error_report+0x4f/0x60
 [] xfs_btree_check_sblock+0x56/0xd0
 [] xfs_alloc_lookup+0x181/0x390
 [] xfs_alloc_lookup_le+0x16/0x20
 [] xfs_free_ag_extent+0x51/0x690
 [] xfs_free_extent+0xa4/0xc0
 [] xfs_bmap_finish+0x119/0x170
 [] xfs_itruncate_finish+0x23a/0x3a0
 [] xfs_inactive+0x482/0x500
 [] xfs_fs_clear_inode+0x34/0xa0
 [] clear_inode+0x57/0xe0
 [] generic_delete_inode+0xe5/0x110
 [] generic_drop_inode+0x167/0x1b0
 [] iput+0x5f/0x70
 [] do_unlinkat+0xdf/0x140
 [] sys_unlink+0x10/0x20
 [] syscall_call+0x7/0xb
 ===
xfs_force_shutdown(dm-0,0x8) called from line 4258 of file fs/xfs/xfs_bmap.c. 
Return address = 0xc021101e
Filesystem "dm-0": Corruption of in-memory data detected.  Shutting down 
filesystem: dm-0

Please umount the filesystem, and rectify the problem(s)

so I cd'ed out of /scratch and umounted.

I then tried the xfs_check.

haze:~# xfs_check /dev/video_vg/video_lv
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_check.  If you are unable to mount the filesystem, then use
the xfs_repair -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
haze:~# mount /scratch/
haze:~# umount /scratch/
haze:~# xfs_check /dev/video_vg/video_lv

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: Bad page state in process 'xfs_db'

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: page:c1767bc0 flags:0x80010008 mapping: mapcount:-64 
count:0

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: Trying to fix it up, but a reboot is needed

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: Backtrace:

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: Bad page state in process 'syslogd'

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: page:c1767cc0 flags:0x80010008 mapping: mapcount:-64 
count:0

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: Trying to fix it up, but a reboot is needed

Message from [EMAIL PROTECTED] at Tue Jun 19 08:47:30 2007 ...
haze kernel: Backtrace:

ugh. Try again
haze:~# xfs_check /dev/video_vg/video_lv
haze:~#

whilst running a top reported this as roughly the peak memory usage:
 8759 root  18   0  479m 474m  876 R  2.0 46.9   0:02.49 xfs_db
so it looks like it didn't run out of memory (machine has 1Gb).

Dave, I ran xfs_check -v... but