Am Do 24. Juli 2008 schrieb Andy Green: > Somebody in the thread at some point said: > | Hi all, > | > | I can now reliably reproduce the issue, as dd'ing the mbr back to the > | card so far restores sane behaviour : > | > | If sd_drive is set to "0", then after a resume from "sync && apm -s" the > | MBR of my 4GB SanDisk is wiped - so far I haven't noticed any other > | errors, but have not looked very closely. > ... > > | PS: Can somebody please tell me how to re-initialize the card without > | going through another suspend/resume cycle ? > > sd_drive setting isn't actually used until next time we access the card, > so provoking an access will do it, eg, touch /something ; sync. > > But the two explanations for what goes on seem mixed still here, we > affect sd_drive and we do a suspend. My guess / hope is that this > problem is coming from the suspend action alone and the change of > sd_drive is bogus here. Maybe you can bang on it a little more trying > to disprove that hypothesis? > > -Andy
As I think this seems to be quite a good clue to what's really happening here, quote from the OLPC ticket #6532: (HTH) >>>>> cc dilinger added I've spend some time digging deep into the bowels of the VFS and block layer and gathering some debug output and have an explanation for the partition table corruption: Upon coming out of resume, the SD code, with CONFIG_MMC_UNSAFE_SUSPEND enabled, checks to see if there is a card plugged into the system and whether that card is the same as the one that was plugged into the system at suspend time. This is accomplished by reading the card ID of the device and for some reason, very possibly #1339, we fail this detection. In this case, the kernel removes the old device from the system and in this execution path, the partition information for this device is zeroed. Even though the device is removed, the device is still mounted and upon unmount, ext2 syncs the superblock, even if the file system is sync'd beforehand. The superblock is block 0 of the partition and the block layer adds to this the partition start offset before submitting the write to the lower layers. As the partition information has already been zeroed out, we end up writing to block 0 of the disk itself, overwriting the partition table and the geometry information. I've verified this by both gathering debug output and 'dd' + 'hexdump' of corrupted and uncorrupted media. Some interesting points: We are able to delete a block device even though it is still mounted. Even though the device has been deleted, the write submitted to it does not fail. Note that this is still not 100% reproducible and in certain cases the superblock write during unmount does fail with block I/O errors, meaning that the queue is properly deleted. As per dilinger's comments on IRC, the VFS has lots of refcounts and there is a timing issue/race condition that we're hitting. As per #1339, we may be able to add an OLPC specific hackto wait 500ms or so upon resume to get around this. I will try this but I don't think this is acceptable given our suspend/resume requirements. Something I don't quite understand at the moment is how/when our userland env (journal specifically I think?) unmounts the device as I've been testing via command line suspend mount, and unmount while running in console mode. Next steps: Get an understanding of the what is happening with our userland and brainstorm with cjb about the possibility of simply unmounting the SD device upon suspend. There are issues around this as we may have files open and that will keep us from suspending. Test adding a timeout to the resume path to see if it solves our problem to validate that it is indeed something related to our HW. Dig into the unmount/write to non-existing bdev some more nad discuss this upstream if needed. (Adding dilinger to cc:) <<<<<<<<<<<<<<<<<<<<<<<< /j
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Openmoko community mailing list community@lists.openmoko.org http://lists.openmoko.org/mailman/listinfo/community