Re: Status of ReiserFS + Journalling
On Thu, 5 Oct 2000, Neil Brown wrote: > 2/ Arrange your filesystem so that you write new data to an otherwise >unused stripe a whole stripe at a time, and store some sort of >chechksum in the stripe so that corruption can be detected. This >implies a log structured filesystem (though possibly you could come >close enough with a journalling or similar filesystem, I'm not >sure). This will hose your performance if you're doing random read/writes of small chunks of data. Its better in that case to have the size that your app/fs writes be the same as the blocksize on a single disk, so that you don't have to seek all the drives to the same cylinder every time you do a read/write. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
On Thu, 5 Oct 2000, Neil Brown wrote: 2/ Arrange your filesystem so that you write new data to an otherwise unused stripe a whole stripe at a time, and store some sort of chechksum in the stripe so that corruption can be detected. This implies a log structured filesystem (though possibly you could come close enough with a journalling or similar filesystem, I'm not sure). This will hose your performance if you're doing random read/writes of small chunks of data. Its better in that case to have the size that your app/fs writes be the same as the blocksize on a single disk, so that you don't have to seek all the drives to the same cylinder every time you do a read/write. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
Jeremy Fitzhardinge wrote: > > On Thu, Oct 05, 2000 at 11:33:30AM +0200, Helge Hafting wrote: > > A power failure might leave you with a corrupt disk block. That is > > detectable (read failure) and you may then reconstruct it using the > > rest of the stripe. This will get you data from either before > > or after the update was supposed to happen. > > How would you be able to tell which disk contains the bad stripe? > RAID reconstruction relies on knowing which disk to reconstruct because > it's obviously bad - there's out of band information in the form > of I/O errors. If you only have an incompletely updated stripe on > a disk, you don't know which data to reconstruct from parity. > Correct. RAID won't help you if one disk is updated flawlessly but not the others. It is a guard against disk breakdown only. > I think the only way of doing this properly is to either have > battery-backed cache, or by having journalling at the RAID level. Isn't this something a journalling _fs_ is supposed to fix? You don't really need journalling at the raid level, the raid should (on a dirty startup) notice the dirtiness, and check every stripe for correct parity. (In a raid check, or by using a degraded mode where every stripe is read completely and checked before access.) The raid can then inform the fs that the entire stripe is corrupt when parity is bad, and the fs can fix this by replaying its journal (or using a fsck). Helge Hafting - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
On Friday October 6, [EMAIL PROTECTED] wrote: > Neil Brown wrote: > > Suppose, for stripe X the parity device is device 1 and we were > > updating the block on device 0 at the time of system failure. > > What had happened was that the new parity block was written out, but > > the new data block wasn't. > > Suppose further than when the system come back, device 2 has failed. > > We now cannot recover the data that was on stripe X, device 2. If we > > tried, we would xor all the blocks from working devices together and I > > hope that you can see that this would be the wrong answer. This poor, > > innocent, block, which hasn't been modified for years, has just been > > corrupted. Not good for PR. > > Now that I'm getting better at thinking about this I can see that a very > simple journal will protect from this particular problem. A phase-tree > style approach would likely to the job more efficiently, once again. > Here's the ultimate simple approach: why not treat an entire stripe as > one block? That way you never get 'innocent blocks' on your stripe. yyye. uttt... There was a detail (one of many probably) that I skipped in my brief description of raid5. Every block on the raid5 array has a two dimensional address (discnumber, blocknumber) This needs to be mapped into the linear address space expected by a filesystem (unless you have a clever filesystem that understand two dimensional addressing and copes with holes where the parity blocks are). Two extremes of ways to do this are: abc- afk- de-f bg-o g-hi c-lp -jkl -hmq mno- din- qp-r ej-r where letters are logical block addresses, hyphens are parity blocks, columns are drives, and rows a physical block numbers. What is typically done it to define a cluster size, and then address down the drive for a cluster, and then step across to the next driver for the next cluster, so with a cluster size of 3, the above array would be adg- beh- cfi- jm-p hn-q lo-r (notice that the parity blocks come in clusters like the data blocks). There is a trade off when choosing cluster size. A cluster size of 1 (as it in the very first picture above) means that any sequential access will probably use all drives, and so you should see appropriate speed ups for read, and you might be able to avoid reading old data for writes (as when you write a whole stripe you don't need to read old data to calculate parity). This is good if you have just a single thread accessing the array. A large cluster size (e.g. 64k) means that most accesses will use only one drive (for read) or two drives (for write - data + parity). This means that multiple threads that access the array concurrently and not always be tripping over each other (sometimes, but not always) (this is called 'head contention'). There is a formular that I have seen, but cannot remember, which links typical IO size, and typical number of concurrent threads to ideal cluster size. Issues of drive geometry come into this too. If you are going to read any of a track, you may as well read all of it. So having a cluster size that was a multiple of the track size would be good (if only drives have constant sized tracks!!). Back to your idea. Having each stripe be one filesystem block means either having large filesystem blocks (meaning lots of wastage) or having a cluster size of 1. Unfortuately, with Linux Software Raid, the minimum cluster size is one page, and the maximum filesystem block size is one page, so we cannot try this out on linux to see how it actually works. My understand of the way WOFL works is that it uses RAID4, so there are no parity holes to worry about (RAID4 has all the parity on the one drive) and WOFL knows about the 2-D structure. It tries to lay out whole files (or large clusters of each file) on to one disk each, but hopes to have enough files that need writing at one time that it can write them all, one onto each disc, and thus keep all the discs busy while writing, but still have reduced head contention when reading. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
Neil Brown wrote: > Suppose, for stripe X the parity device is device 1 and we were > updating the block on device 0 at the time of system failure. > What had happened was that the new parity block was written out, but > the new data block wasn't. > Suppose further than when the system come back, device 2 has failed. > We now cannot recover the data that was on stripe X, device 2. If we > tried, we would xor all the blocks from working devices together and I > hope that you can see that this would be the wrong answer. This poor, > innocent, block, which hasn't been modified for years, has just been > corrupted. Not good for PR. Now that I'm getting better at thinking about this I can see that a very simple journal will protect from this particular problem. A phase-tree style approach would likely to the job more efficiently, once again. Here's the ultimate simple approach: why not treat an entire stripe as one block? That way you never get 'innocent blocks' on your stripe. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
On Friday October 6, [EMAIL PROTECTED] wrote: Neil Brown wrote: Suppose, for stripe X the parity device is device 1 and we were updating the block on device 0 at the time of system failure. What had happened was that the new parity block was written out, but the new data block wasn't. Suppose further than when the system come back, device 2 has failed. We now cannot recover the data that was on stripe X, device 2. If we tried, we would xor all the blocks from working devices together and I hope that you can see that this would be the wrong answer. This poor, innocent, block, which hasn't been modified for years, has just been corrupted. Not good for PR. Now that I'm getting better at thinking about this I can see that a very simple journal will protect from this particular problem. A phase-tree style approach would likely to the job more efficiently, once again. Here's the ultimate simple approach: why not treat an entire stripe as one block? That way you never get 'innocent blocks' on your stripe. yyye. uttt... There was a detail (one of many probably) that I skipped in my brief description of raid5. Every block on the raid5 array has a two dimensional address (discnumber, blocknumber) This needs to be mapped into the linear address space expected by a filesystem (unless you have a clever filesystem that understand two dimensional addressing and copes with holes where the parity blocks are). Two extremes of ways to do this are: abc- afk- de-f bg-o g-hi c-lp -jkl -hmq mno- din- qp-r ej-r where letters are logical block addresses, hyphens are parity blocks, columns are drives, and rows a physical block numbers. What is typically done it to define a cluster size, and then address down the drive for a cluster, and then step across to the next driver for the next cluster, so with a cluster size of 3, the above array would be adg- beh- cfi- jm-p hn-q lo-r (notice that the parity blocks come in clusters like the data blocks). There is a trade off when choosing cluster size. A cluster size of 1 (as it in the very first picture above) means that any sequential access will probably use all drives, and so you should see appropriate speed ups for read, and you might be able to avoid reading old data for writes (as when you write a whole stripe you don't need to read old data to calculate parity). This is good if you have just a single thread accessing the array. A large cluster size (e.g. 64k) means that most accesses will use only one drive (for read) or two drives (for write - data + parity). This means that multiple threads that access the array concurrently and not always be tripping over each other (sometimes, but not always) (this is called 'head contention'). There is a formular that I have seen, but cannot remember, which links typical IO size, and typical number of concurrent threads to ideal cluster size. Issues of drive geometry come into this too. If you are going to read any of a track, you may as well read all of it. So having a cluster size that was a multiple of the track size would be good (if only drives have constant sized tracks!!). Back to your idea. Having each stripe be one filesystem block means either having large filesystem blocks (meaning lots of wastage) or having a cluster size of 1. Unfortuately, with Linux Software Raid, the minimum cluster size is one page, and the maximum filesystem block size is one page, so we cannot try this out on linux to see how it actually works. My understand of the way WOFL works is that it uses RAID4, so there are no parity holes to worry about (RAID4 has all the parity on the one drive) and WOFL knows about the 2-D structure. It tries to lay out whole files (or large clusters of each file) on to one disk each, but hopes to have enough files that need writing at one time that it can write them all, one onto each disc, and thus keep all the discs busy while writing, but still have reduced head contention when reading. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
Jeremy Fitzhardinge wrote: On Thu, Oct 05, 2000 at 11:33:30AM +0200, Helge Hafting wrote: A power failure might leave you with a corrupt disk block. That is detectable (read failure) and you may then reconstruct it using the rest of the stripe. This will get you data from either before or after the update was supposed to happen. How would you be able to tell which disk contains the bad stripe? RAID reconstruction relies on knowing which disk to reconstruct because it's obviously bad - there's out of band information in the form of I/O errors. If you only have an incompletely updated stripe on a disk, you don't know which data to reconstruct from parity. Correct. RAID won't help you if one disk is updated flawlessly but not the others. It is a guard against disk breakdown only. I think the only way of doing this properly is to either have battery-backed cache, or by having journalling at the RAID level. Isn't this something a journalling _fs_ is supposed to fix? You don't really need journalling at the raid level, the raid should (on a dirty startup) notice the dirtiness, and check every stripe for correct parity. (In a raid check, or by using a degraded mode where every stripe is read completely and checked before access.) The raid can then inform the fs that the entire stripe is corrupt when parity is bad, and the fs can fix this by replaying its journal (or using a fsck). Helge Hafting - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
On Thu, Oct 05, 2000 at 11:33:30AM +0200, Helge Hafting wrote: > A power failure might leave you with a corrupt disk block. That is > detectable (read failure) and you may then reconstruct it using the > rest of the stripe. This will get you data from either before > or after the update was supposed to happen. How would you be able to tell which disk contains the bad stripe? RAID reconstruction relies on knowing which disk to reconstruct because it's obviously bad - there's out of band information in the form of I/O errors. If you only have an incompletely updated stripe on a disk, you don't know which data to reconstruct from parity. I think the only way of doing this properly is to either have battery-backed cache, or by having journalling at the RAID level. J PGP signature
Re: Status of ReiserFS + Journalling
Neil Brown wrote: > > For RAID5 a 'stripe' is a set of blocks, one from each underlying > device, which are all at the same offset within their device. > For each stripe, one of the blocks is a "parity" block - though it is > a different block for each stripe (parity is rotated). > > Content of the parity block is computed from the xor of the content of > all the other (data) blocks. > > To update a data block, you must also update the parity block to keep > it consistant. For example, you can read old partity block, read old > data block, compute >newparity = oldparity xor olddata xor newdata > and then write out newparity and newdata. > > It is not possible (on current hardware:-) to write both newparity and > newdata to the different devices atomically. If the system fails > (e.g. power failure) between writing one and writing the other, then > you have an inconsistant stripe. OK, and not only newdata is corrupted, but n-2 of its unrelated neighbors on the same stripe. I see the problem. I'm also... beginning... to see... a solution. Maybe. [stuff I can't answer intelligently yet snipped] > > Given a clear statement of the problem, I think I can show how to update > > the stripes atomically. At the very least, I'll know what interface > > Tux2 needs from RAID in order to guarantee an atomic update. > > From my understanding, there are two ways to approach this problem. > > 1/ store updates to a separate device, either NV ram or a separate > disc drive. Providing you write address/oldvalue/newvalue to the > separate device before updating the main array, you could be safe > against single device failures combined with system failures. A journalling filesystem. Fine. I'm sure Stephen has put plenty of thought into this one. Advantage: it's obvious how it helps the RAID problem. Disadvantage: you have the normal finicky journalling boundary conditions to worry about. Miscellaneous fact: you will be writing everything twice (roughly). > 2/ Arrange your filesystem so that you write new data to an otherwise >unused stripe a whole stripe at a time, and store some sort of >chechksum in the stripe so that corruption can be detected. This >implies a log structured filesystem (though possibly you could come >close enough with a journalling or similar filesystem, I'm not >sure). I think it's true that Tux2's approach can do many of the things a LFS can do. And you can't tell by looking at a block which inode it belongs to - I think we need to know this. The obvious fix is to extend the group metafile with a section that reverse maps each block using a two-word inode:index pair. (.2% extra space with 4K blocks.) A nice fact about Tux2 is that the changes in a filesystem from phase to phase can be completely arbitrary. (LFS shares this property - it falls out from doing copy-on-write towards the root of a tree.) So you can operate a write-twice algorithm like this: first clean out a number of partially-populated stripes by branching the in-use blocks off to empty stripes. The reverse lookup is used to know which inode blocks to branch. You don't have to worry about writing full stripes because Tux2 will automatically revert to a consistent state on interruption. When you have enough clear space you cause a phase transition, and now you have a consistent filesystem with lots of clear stripes into which you can branch update blocks. Even numbered phases: Clear out freespace by branching orphan blocks Odd numbered phases: Branch updates into the new freespace Notice that what we I'm doing up there very closely resembles an incremental defrag, and can be tuned to really be a defrag. This might be useful. What is accomplished is that we never kill innocent blocks in the nasty way you described earlier. I'm not claiming this is at all efficient - I smell a better algorithm somewhere in there. On the other hand, it's about the same efficiency as journalling, doesn't have so many tricky boundary conditions, and you get the defrag, something that is a lot harder to do with an update-in-place scheme. Correct me if I'm wrong, but don't people running RAID care more about safety than speed? This is just the first cut. I think I have some sort of understanding of the problem now, however imperfect. I'll let it sit and percolate for a while, and now I *must* stop doing this, get some sleep, then try to prepare some slides for next week in Atlanta :-) -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
Vojtech Pavlik wrote: > Hmm, now that I think about it, this can be brought to data corruption > even easier ... Imagine a case where a stripe isn't written completely. > One of the drives (independently whether it's the xor one or one the > other one) has thus invalid data. > > Now how do you decide, after boot, which drive of the set, including the > xor drive is it the one that contains the invalid data? I think this is > not possible. > A power failure might leave you with a corrupt disk block. That is detectable (read failure) and you may then reconstruct it using the rest of the stripe. This will get you data from either before or after the update was supposed to happen. There is a requirement for this to work: never ever write to more than one disk in the same stripe simultaneously. (You can write to all drives simultaneously, but a different stripe on each.) I believe this is hard to achieve with the current implementation, as raid-5 would have to override the elevator algorithms as well as any caching internal to the drives. And performance would probably not be fantastic. A simple raid protects against disk breakdown, not power loss (or kernel crash.) There are UPS'es for power loss, and battery-backed caches for further improvement. Helge Hafting - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
On Thu, Oct 05, 2000 at 09:49:29AM +0200, Andi Kleen wrote: > On Thu, Oct 05, 2000 at 09:39:34AM +0200, Vojtech Pavlik wrote: > > Hmm, now that I think about it, this can be brought to data corruption > > even easier ... Imagine a case where a stripe isn't written completely. > > One of the drives (independently whether it's the xor one or one the > > other one) has thus invalid data. > > > > Now how do you decide, after boot, which drive of the set, including the > > xor drive is it the one that contains the invalid data? I think this is > > not possible. > > Normally only the parity block and the actually to be changed block in the > stripe are updated, not all blocks in a stripe set. > > When no disk fails then to be changed block may still contain the old value > after a crash (not worse than the no RAID case). parity will be fixed up to > make the RAID consistent again. The other blocks are not touched. True, the result is no worse than the normal single disk case. -- Vojtech Pavlik SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
On Thu, Oct 05, 2000 at 01:54:59PM +1100, Neil Brown wrote: > 2/ Arrange your filesystem so that you write new data to an otherwise >unused stripe a whole stripe at a time, and store some sort of >chechksum in the stripe so that corruption can be detected. This >implies a log structured filesystem (though possibly you could come >close enough with a journalling or similar filesystem, I'm not >sure). You don't need a checksum I think, just an atomically updated fs block -> actually stripe map would be enough. It can be only updated after you wrote the new independent stripe completely. Simply using ordered writes for it (only write map after you wrote stripe) may be tricky though, because you could get cyclic dependencies in a single HW map block when the file system allocates many new stripes in parallel [so you would probably need something like soft updates and handling of multiple versions of the map in core] Another method when you have a logging fs is to simply log the map block change into your normal log. At least for ext3 and reiserfs it would be expensive though, because they can only log complete changed hardware blocks of the map. JFS or XFS with item logging could do it relatively cheaply. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
On Thu, Oct 05, 2000 at 01:54:59PM +1100, Neil Brown wrote: 2/ Arrange your filesystem so that you write new data to an otherwise unused stripe a whole stripe at a time, and store some sort of chechksum in the stripe so that corruption can be detected. This implies a log structured filesystem (though possibly you could come close enough with a journalling or similar filesystem, I'm not sure). You don't need a checksum I think, just an atomically updated fs block - actually stripe map would be enough. It can be only updated after you wrote the new independent stripe completely. Simply using ordered writes for it (only write map after you wrote stripe) may be tricky though, because you could get cyclic dependencies in a single HW map block when the file system allocates many new stripes in parallel [so you would probably need something like soft updates and handling of multiple versions of the map in core] Another method when you have a logging fs is to simply log the map block change into your normal log. At least for ext3 and reiserfs it would be expensive though, because they can only log complete changed hardware blocks of the map. JFS or XFS with item logging could do it relatively cheaply. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
On Thu, Oct 05, 2000 at 09:49:29AM +0200, Andi Kleen wrote: On Thu, Oct 05, 2000 at 09:39:34AM +0200, Vojtech Pavlik wrote: Hmm, now that I think about it, this can be brought to data corruption even easier ... Imagine a case where a stripe isn't written completely. One of the drives (independently whether it's the xor one or one the other one) has thus invalid data. Now how do you decide, after boot, which drive of the set, including the xor drive is it the one that contains the invalid data? I think this is not possible. Normally only the parity block and the actually to be changed block in the stripe are updated, not all blocks in a stripe set. When no disk fails then to be changed block may still contain the old value after a crash (not worse than the no RAID case). parity will be fixed up to make the RAID consistent again. The other blocks are not touched. True, the result is no worse than the normal single disk case. -- Vojtech Pavlik SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
Vojtech Pavlik wrote: Hmm, now that I think about it, this can be brought to data corruption even easier ... Imagine a case where a stripe isn't written completely. One of the drives (independently whether it's the xor one or one the other one) has thus invalid data. Now how do you decide, after boot, which drive of the set, including the xor drive is it the one that contains the invalid data? I think this is not possible. A power failure might leave you with a corrupt disk block. That is detectable (read failure) and you may then reconstruct it using the rest of the stripe. This will get you data from either before or after the update was supposed to happen. There is a requirement for this to work: never ever write to more than one disk in the same stripe simultaneously. (You can write to all drives simultaneously, but a different stripe on each.) I believe this is hard to achieve with the current implementation, as raid-5 would have to override the elevator algorithms as well as any caching internal to the drives. And performance would probably not be fantastic. A simple raid protects against disk breakdown, not power loss (or kernel crash.) There are UPS'es for power loss, and battery-backed caches for further improvement. Helge Hafting - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
Neil Brown wrote: For RAID5 a 'stripe' is a set of blocks, one from each underlying device, which are all at the same offset within their device. For each stripe, one of the blocks is a "parity" block - though it is a different block for each stripe (parity is rotated). Content of the parity block is computed from the xor of the content of all the other (data) blocks. To update a data block, you must also update the parity block to keep it consistant. For example, you can read old partity block, read old data block, compute newparity = oldparity xor olddata xor newdata and then write out newparity and newdata. It is not possible (on current hardware:-) to write both newparity and newdata to the different devices atomically. If the system fails (e.g. power failure) between writing one and writing the other, then you have an inconsistant stripe. OK, and not only newdata is corrupted, but n-2 of its unrelated neighbors on the same stripe. I see the problem. I'm also... beginning... to see... a solution. Maybe. [stuff I can't answer intelligently yet snipped] Given a clear statement of the problem, I think I can show how to update the stripes atomically. At the very least, I'll know what interface Tux2 needs from RAID in order to guarantee an atomic update. From my understanding, there are two ways to approach this problem. 1/ store updates to a separate device, either NV ram or a separate disc drive. Providing you write address/oldvalue/newvalue to the separate device before updating the main array, you could be safe against single device failures combined with system failures. A journalling filesystem. Fine. I'm sure Stephen has put plenty of thought into this one. Advantage: it's obvious how it helps the RAID problem. Disadvantage: you have the normal finicky journalling boundary conditions to worry about. Miscellaneous fact: you will be writing everything twice (roughly). 2/ Arrange your filesystem so that you write new data to an otherwise unused stripe a whole stripe at a time, and store some sort of chechksum in the stripe so that corruption can be detected. This implies a log structured filesystem (though possibly you could come close enough with a journalling or similar filesystem, I'm not sure). I think it's true that Tux2's approach can do many of the things a LFS can do. And you can't tell by looking at a block which inode it belongs to - I think we need to know this. The obvious fix is to extend the group metafile with a section that reverse maps each block using a two-word inode:index pair. (.2% extra space with 4K blocks.) A nice fact about Tux2 is that the changes in a filesystem from phase to phase can be completely arbitrary. (LFS shares this property - it falls out from doing copy-on-write towards the root of a tree.) So you can operate a write-twice algorithm like this: first clean out a number of partially-populated stripes by branching the in-use blocks off to empty stripes. The reverse lookup is used to know which inode blocks to branch. You don't have to worry about writing full stripes because Tux2 will automatically revert to a consistent state on interruption. When you have enough clear space you cause a phase transition, and now you have a consistent filesystem with lots of clear stripes into which you can branch update blocks. Even numbered phases: Clear out freespace by branching orphan blocks Odd numbered phases: Branch updates into the new freespace Notice that what we I'm doing up there very closely resembles an incremental defrag, and can be tuned to really be a defrag. This might be useful. What is accomplished is that we never kill innocent blocks in the nasty way you described earlier. I'm not claiming this is at all efficient - I smell a better algorithm somewhere in there. On the other hand, it's about the same efficiency as journalling, doesn't have so many tricky boundary conditions, and you get the defrag, something that is a lot harder to do with an update-in-place scheme. Correct me if I'm wrong, but don't people running RAID care more about safety than speed? This is just the first cut. I think I have some sort of understanding of the problem now, however imperfect. I'll let it sit and percolate for a while, and now I *must* stop doing this, get some sleep, then try to prepare some slides for next week in Atlanta :-) -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
On Thu, Oct 05, 2000 at 11:33:30AM +0200, Helge Hafting wrote: A power failure might leave you with a corrupt disk block. That is detectable (read failure) and you may then reconstruct it using the rest of the stripe. This will get you data from either before or after the update was supposed to happen. How would you be able to tell which disk contains the bad stripe? RAID reconstruction relies on knowing which disk to reconstruct because it's obviously bad - there's out of band information in the form of I/O errors. If you only have an incompletely updated stripe on a disk, you don't know which data to reconstruct from parity. I think the only way of doing this properly is to either have battery-backed cache, or by having journalling at the RAID level. J PGP signature
Re: Status of ReiserFS + Journalling
Andi Kleen wrote: > On Wed, Oct 04, 2000 at 01:42:46AM -0600, Andreas Dilger wrote: > > You should ask the reiserfs mailing list for outstanding problems. As > > far as LVM is concerned, I don't think there is a problem, but watch out > > for software RAID 5 and journalling filesystems (reiser or ext3, at least > > under 2.2) - it can have problems if there is a disk crash. > > It is not inherent to journaling file systems, linux software raid 5 > can always corrupt your data when you have a system crash with a disk > crash (no way to write stripe sets atomically and half writen strip sets > usually give random data for any crashed block in it when xored against parity) 'Atomic' - a word that makes my ears perk up. Tux2 is all about atomic updating. Could you please give a simple statement of the problem for a person who doesn't know much more about RAID than that it stands for Redundant Array of Inexpensive Disks (Drives?) Given a clear statement of the problem, I think I can show how to update the stripes atomically. At the very least, I'll know what interface Tux2 needs from RAID in order to guarantee an atomic update. > In this case "safe" just means that you don't need a fsck to be sure that > the metadata is consistent -- data is never guaranteed to be consistent > unless you have applications that use fsync/O_SYNC properly (=basically do > their own journaling) I truly believe that's a temporary situation. > So overall I would not be worried too much, it isn't much worse with a > journaled fs than it is without it. But if it could be better... -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
On Wed, Oct 04, 2000 at 01:42:46AM -0600, Andreas Dilger wrote: > Magnus Naeslund writes: > > The storage will be exported via ftp, samba, nfs & cvs. > > I will patch the selected kernel to support LFS and LVM, and the filesystem > > will run on that. > > > > I am very interested in ReiserFS, and success/failure stories about it. > > You really need to watch out when using ReiserFS as an NFS server. There > are patches to NFS in order to run on ReiserFS. There were also problems > (they may be fixed now) that force the NFS clients to be Linux only. I never knew of any problems that didn't affect linux clients too (and I did most of the NFS patches) If you know any please let me know. > > You should ask the reiserfs mailing list for outstanding problems. As > far as LVM is concerned, I don't think there is a problem, but watch out > for software RAID 5 and journalling filesystems (reiser or ext3, at least > under 2.2) - it can have problems if there is a disk crash. It is not inherent to journaling file systems, linux software raid 5 can always corrupt your data when you have a system crash with a disk crash (no way to write stripe sets atomically and half writen strip sets usually give random data for any crashed block in it when xored against parity) Given that 2.4 should be safe. In this case "safe" just means that you don't need a fsck to be sure that the metadata is consistent -- data is never guaranteed to be consistent unless you have applications that use fsync/O_SYNC properly (=basically do their own journaling) So overall I would not be worried too much, it isn't much worse with a journaled fs than it is without it. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
Magnus Naeslund writes: > The storage will be exported via ftp, samba, nfs & cvs. > I will patch the selected kernel to support LFS and LVM, and the filesystem > will run on that. > > I am very interested in ReiserFS, and success/failure stories about it. You really need to watch out when using ReiserFS as an NFS server. There are patches to NFS in order to run on ReiserFS. There were also problems (they may be fixed now) that force the NFS clients to be Linux only. You should ask the reiserfs mailing list for outstanding problems. As far as LVM is concerned, I don't think there is a problem, but watch out for software RAID 5 and journalling filesystems (reiser or ext3, at least under 2.2) - it can have problems if there is a disk crash. Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
Magnus Naeslund writes: The storage will be exported via ftp, samba, nfs cvs. I will patch the selected kernel to support LFS and LVM, and the filesystem will run on that. I am very interested in ReiserFS, and success/failure stories about it. You really need to watch out when using ReiserFS as an NFS server. There are patches to NFS in order to run on ReiserFS. There were also problems (they may be fixed now) that force the NFS clients to be Linux only. You should ask the reiserfs mailing list for outstanding problems. As far as LVM is concerned, I don't think there is a problem, but watch out for software RAID 5 and journalling filesystems (reiser or ext3, at least under 2.2) - it can have problems if there is a disk crash. Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Status of ReiserFS + Journalling
On Wed, Oct 04, 2000 at 01:42:46AM -0600, Andreas Dilger wrote: Magnus Naeslund writes: The storage will be exported via ftp, samba, nfs cvs. I will patch the selected kernel to support LFS and LVM, and the filesystem will run on that. I am very interested in ReiserFS, and success/failure stories about it. You really need to watch out when using ReiserFS as an NFS server. There are patches to NFS in order to run on ReiserFS. There were also problems (they may be fixed now) that force the NFS clients to be Linux only. I never knew of any problems that didn't affect linux clients too (and I did most of the NFS patches) If you know any please let me know. You should ask the reiserfs mailing list for outstanding problems. As far as LVM is concerned, I don't think there is a problem, but watch out for software RAID 5 and journalling filesystems (reiser or ext3, at least under 2.2) - it can have problems if there is a disk crash. It is not inherent to journaling file systems, linux software raid 5 can always corrupt your data when you have a system crash with a disk crash (no way to write stripe sets atomically and half writen strip sets usually give random data for any crashed block in it when xored against parity) Given that 2.4 should be safe. In this case "safe" just means that you don't need a fsck to be sure that the metadata is consistent -- data is never guaranteed to be consistent unless you have applications that use fsync/O_SYNC properly (=basically do their own journaling) So overall I would not be worried too much, it isn't much worse with a journaled fs than it is without it. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/