I am the author of the page in question. To establish my credentials, I wrote my first filesystem forensic tool in 1980, to diagnose and repair a Unix filesystem that had been damaged by a kernel misconfigured that made it swap on top of the filesystem. That was when 10 MB disk packs the size of garbage can lids cost $5000.
Since then I have written filesystem readers, writers, and forensic tools for UFS, ext2, FAT12/16/32, ISO-9660, NFS, Mac HFS, romfs, and JFFS2. I have studied, with an eye toward implementation, the data structures for NTFS, UBIFS, cramfs, and squashfs. da...@lang.hm wrote: > > so if the device is performing wear leveling, then the fact that your > FAT is on the same eraseblock as your partition table should not > matter in the least, since the wear leveling will avoid stressing any > particlar part of the flash. That would be true in a perfect world, but wear leveling is hard to do perfectly. Relocating requires maintaining two copies of the erase block, as well as hidden metadata that tells you which copy is which, plus a hidden allocation map. Updating all of these things in a way that avoids catastrophic loss of the entire device (due to inconsistent metadata) is tricky. Some FTLs get it (at least mostly) right, many don't. FTL software is, after all, software, so obscure bugs are always possible. Making hardware behave stably during power loss is triply difficult. I suspect, based on cryptic hints in various specs and standards that I've read, that some FTLs have special optimizations for FAT filesystems with the factory-supplied layout. If the FAT is in a known "nice" location, you can apply different caching and wear leveling policies to that known hot-spot, and perhaps even reduce the overall metadata by using the FAT as part of the block-substitution metadata for the data area. Many manufacturers could care less about what Linux hackers want to do - their market being ordinary users who stick the device in a camera - so such "cheat" optimizations are fair game from a business standpoint. > > as such I see no point in worrying about the partition table being on > the same eraseblock as a frequently written item. Many filesystem layouts can recover from damage to the allocation maps, either automatically or with an offline tool. It's possible to rebuild ext2 allocation bitmaps from inode and directory information. For FAT filesystems, there's a backup FAT copy that will at least let you roll back to a semi-consistent recent state. But there's no redundant for the partition map or the BPB. If you should lose one of those during a botched write, it's bye-bye to all your data, barring mad forensic skills. In stress testing of some "LBA NAND" devices, we saw several cases where, after a fairly long period, the devices completely locked up and lost the ability to read or rewrite the first block. I had done a bad job of partitioning it, because I wasn't paying enough attention when I created the test image. It's unclear what the results would have been had the layout been better - the stress test takes several weeks and the failures are statistical in nature - but I can't help believing that, for a device with a known wear-out mechanism and elaborate workarounds to hide that fact, working it harder than necessary will reduce its lifetime and possibly trigger microcode bugs that might otherwise cause no trouble. > > as for the block boundry not being an eraseblock boundry if the > partition starts at block 1 > > if you use 1k blocks and have 256k eraseblocks, then 1 out of every > 256 writes will generate two erases instead of one > > worst case is you use 4k blocks and have 128k eraseblocks, at which > point 1 out of every 32 writes will generate two erases instead of one. > > to use the intel terminology, these result in write amplification > factors of approximatly 1.005 and 1.03 respectivly. > > neither of these qualify as a 'flash killer' in my mind. The main amplification comes not from the erases, but from the writes. If the "cluster/block space" begins in the middle of FLASH page, then 1-block write will involve a read-modify-write of two adjacent pages. That is four internal accesses instead of one. Each such access takes between 100 and 200 uS, depending on the degree to which you can pipeline the accesses - and read-modify-write is hard to pipeline. So the back-end throughput can easily be reduced by a factor of 4 or even more. The write-amplification factor is 2 by a trivial analysis, and it can get worse if you factor in the requirement for writing the pages within an erase block sequentially. The implied coupling between the two spanned pages increases the difficulty of replacement-page allocation, increasing the probability of garbage collection. The erase amplification factor tracks the write amplification factor. You must do at least one erase for every 64 writes, assuming perfect efficiency of your page-reassigment algorithm and its metadata. Double the writes, at least double the erases. > > now, if a FAT or superblock happens to span an eraseblock, then you > will have a much more significant issue, but nothing that is said in > this document refers to this problem (and in fact, it indicates that > things like this follow the start of the partition very closely, which > implies that unless the partition starts very close to the end of an > eraseblock it's highly unlikely that these will span eraseblocks) > > so I still see this as crying wolf. It has been my experience that USB sticks and SD cards with intact factory formatting tend to last longer and run faster than ones that have been reformatted with random layouts. I don't have quantifiable numbers, but I do have enough accumulated experience - I have over two dozen FLASH devices within each reach - to convince me that something interesting is happening. And I know enough about how these things work internally to convince me that aligned accesses inherently result in less internal data traffic than unaligned accesses. > > as for ubifs, that is designed for when you have access to the raw > flash, which is not the case for any device where you have a flash > translation layer in place, so it is really only useful on embedded > system, not on commercially available flash drives of any type. Indeed. The page in question has nothing whatsoever to do with UBIFS. _______________________________________________ Devel mailing list Devel@lists.laptop.org http://lists.laptop.org/listinfo/devel