Re: Journaling FS and RAID
Hi, On Wed, Jun 28, 2000 at 06:35:51PM +0200, Benno Senoner wrote: As far as I know the issue has been fixed in 2.4.* kernel series. ReiserFS and software RAID5 is NOT safe in 2.2.* but Stephen Tweedie (some time ago) pointed out that , the only way to make a software raid system that survives (without data corruption) a power failure while in degraded mode ( this case is rare but it COULD happen), is to make a big RAID5 partition where you store the data and a small RAID1 parition where you keep the journal of the RAID5 partition. The real situation is a little more complex than that. In degraded mode, or if you lose a disk during a crash, ALL raid5 systems --- hardware and software --- risk data loss unless they have some transactional mechanism to allow them to write entire stripes atomically with respect to power failure. In practice, this is usually achieved (for hardware raid) by logging the stripe updates to non-volatile memory. (This is usually the same memory that is used for the write-back cache, so it gives a natural performance boost as well.) Using a separate raid1 journal is possible, but would be an odd way to deal with the problem given that we're talking at the level of individual raid devices here. For journaling *filesystems*, having the journal on an external raid1 disk is a great way to boost performance, but that doesn't fix the raid5 problem above. He said ext3fs can be adapted for this, what is the current status ? No I didn't! I said that ext3 can in principle use off-disk journals, but that is an entirely separate problem from the raid5 consistency issue. Making raid5 totally safe while in degraded mode *must* require the cooperation of the raid layer itself --- it simply cannot be done in the filesystem unless the filesystem guarantees 100% that it only ever writes complete stripes at a time. There are a number of ways this could be done --- in particular, there have been a few projects recently (SWARM, Lustre) which would lend themselves to this sort of operation, by layering the filesystem on top of a log-based storage abstraction which could have the above protection built in. last questions: are the current ext3 and reiserfs raid-reconstruction safe ? On 2.4, they should be --- the new raid code performs reconstruction in a way which is invisible to the buffer cache layers. Testers welcome. :-) Cheers, Stephen
Re: fs-devel URL
Hi, On Thu, Mar 30, 2000 at 11:13:13PM +0200, Thomas Kotzian wrote: There was a discussion about LVM, reiserfs,... , and i need the URL or the address for the mailinglist for fs-devel the File-system development group. [EMAIL PROTECTED] --Stephen
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?
Hi, Chris Wedgwood writes: This may affect data which was not being written at the time of the crash. Only raid 5 is affected. Long term -- if you journal to something outside the RAID5 array (ie. to raid-1 protected log disks) then you should be safe against this type of failure? Indeed. The jfs journaling layer in ext3 is a completely generic block device journaling layer which could be used for such a purpose (and raid/LVM journaling is one of the reasons it was designed this way). --Stephen
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?
Hi, Benno Senoner writes: wow, really good idea to journal to a RAID1 array ! do you think it is possible to to the following: - N disks holding a soft RAID5 array. - reserve a small partition on at least 2 disks of the array to hold a RAID1 array. - keep the journal on this partition. Yes. My jfs code will eventually support this. The main thing it is missing right now is the ability to journal multiple devices to a single journal: the on-disk structure is already designed with that in mind but the code does not yet support it. --Stephen
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power fai
Hi, On Wed, 12 Jan 2000 11:28:28 MET-1, "Petr Vandrovec" [EMAIL PROTECTED] said: I did not follow this thread (on -fsdevel) too close (and I never looked into RAID code, so I should shut up), but... can you confirm that after buffer with data is finally marked dirty, parity is recomputed anyway? So that window is really small and same problems occurs every moment when you wrote data, but did not wrote parity yet? Yes, that's what I said. --Stephen
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?
Hi, On Wed, 12 Jan 2000 22:09:35 +0100, Benno Senoner [EMAIL PROTECTED] said: Sorry for my ignorance I got a little confused by this post: Ingo said we are 100% journal-safe, you said the contrary, Raid resync is safe in the presence of journaling. Journaling is not safe in the presence of raid resync. can you or Ingo please explain us in which situation (power-loss) running linux-raid+ journaled FS we risk a corrupted filesystem ? Please read my previous reply on the subject (the one that started off with "I'm tired of answering the same question a million times so here's a definitive answer"). Basically, there will always be a small risk of data loss if power-down is accompanied by loss of a disk (it's a double-failure); and the current implementation of raid resync means that journaling will be broken by the raid1 or raid5 resync code after a reboot on a journaled filesystem (ext3 is likely to panic, reiserfs will not but will still get its IO ordering requirements messed up by the resync). After the reboot if all disk remain intact physically, will we only lose the data that was being written, or is there a possibility to end up in a corrupted filesystem which could more damages in future ? In the power+disk failure case, there is a very narrow window in which parity may be incorrect, so loss of the disk may result in inability to correctly restore the lost data. This may affect data which was not being written at the time of the crash. Only raid 5 is affected. --Stephen
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?
Hi, On Wed, 12 Jan 2000 00:12:55 +0200 (IST), Gadi Oxman [EMAIL PROTECTED] said: Stephen, I'm afraid that there are some misconceptions about the RAID-5 code. I don't think so --- I've been through this with Ingo --- but I appreciate your feedback since I'm getting inconsistent advise here! Please let me explain... In an early pre-release version of the RAID code (more than two years ago?), which didn't protect against that race, we indeed saw locked buffers changing under us from the point in which we computed the parity till the point in which they were actually written to the disk, leading to a corrupted parity. That is not the race. The race has nothing at all to do with buffers changing while they are being used for parity: that's a different problem, long ago fixed by copying the buffers. The race I'm concerned about could occur when the raid driver wants to compute parity for a stripe and finds some of the blocks are present, and clean, in the buffer cache. Raid assumes that those buffers represent what is on disk, naturally enough. So, it uses them to calculate parity without rereading all of the disk blocks in the stripe. The trouble is that the standard practice in the kernel, when modifying a buffer, is to make the change and _then_ mark the buffer dirty. If you hit that window, then the raid driver will find a buffer which doesn't match what is on disk, and will compute parity from that buffer rather than from the on-disk contents. 1. n dirty blocks are scheduled for a stripe write. That's not the race. The problem occurs when only one single dirty block is scheduled for a write, and we need to find the contents of the rest of the stripe to compute parity. Point (2) is also incorrect; we have taken care *not* to peek into the buffer cache to find clean buffers and use them for parity calculations. We make no such assumptions. Not according to Ingo --- can we get a definitive answer on this, please? Many thanks, Stephen
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?
Hi, On Tue, 11 Jan 2000 16:41:55 -0600, "Mark Ferrell" [EMAIL PROTECTED] said: Perhaps I am confused. How is it that a power outage while attached to the UPS becomes "unpredictable"? One of the most common ways to get an outage while on a UPS is somebody tripping over, or otherwise removing, the cable between the UPS and the computer. How exactly is that predictable? Just because you reduce the risk of unexpected power outage doesn't mean we can ignore the possibility. --Stephen
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?
Hi, On Wed, 12 Jan 2000 07:21:17 -0500 (EST), Ingo Molnar [EMAIL PROTECTED] said: On Wed, 12 Jan 2000, Gadi Oxman wrote: As far as I know, we took care not to poke into the buffer cache to find clean buffers -- in raid5.c, the only code which does a find_buffer() is: yep, this is still the case. OK, that's good to know. Especially the reconstruction code is a rathole. Unfortunately blocking reconstruction if b_count == 0 is not acceptable because several filesystems (such as ext2fs) keep metadata caches around (eg. the block group descriptors in the ext2fs case) which have b_count == 1 for a longer time. That's not a problem: we don't need reconstruction to interact with the buffer cache at all. Ideally, what I'd like to see the reconstruction code do is to: * lock a stripe * read a new copy of that stripe locally * recalc parity and write back whatever disks are necessary for the stripe * unlock the stripe so that the data never goes through the buffer cache at all, but that the stripe is locked with respect to other IOs going on below the level of ll_rw_block (remember there may be IOs coming in to ll_rw_block which are not from the buffer cache, eg. swap or journal IOs). We are '100% journal-safe' if power fails during resync. Except for the fact that resync isn't remotely journal-safe in the first place, yes. :-) --Stephen
[FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?
Hi, This is a FAQ: I've answered it several times, but in different places, so here's a definitive answer which will be my last one: future questions will be directed to the list archives. :-) On Tue, 11 Jan 2000 16:20:35 +0100, Benno Senoner [EMAIL PROTECTED] said: then raid can miscalculate parity by assuming that the buffer matches what is on disk, and that can actually cause damage to other data than the data being written if a disk dies and we have to start using parity for that stripe. do you know if using soft RAID5 + regular etx2 causes the same sort of damages, or if the corruption chances are lower when using a non journaled FS ? Sort of. See below. is the potential corruption caused by the RAID layer or by the FS layer ? ( does need the FS code or the RAID code to be fixed ?) It is caused by neither: it is an interaction effect. if it's caused by the FS layer, how does behave XFS (not here yet ;-) ) or ReiserFS in this case ? They will both fail in the same way. Right, here's the problem: The semantics of the linux-2.2 buffer cache are not well defined with respect to write ordering. There is no policy to guide what gets written and when: the writeback caching can trickle to disk at any time, and other system components such as filesystems and the VM can force a write-back of data to disk at any time. Journaling imposes write ordering constraints which insist that data in the buffer cache *MUST NOT* be written to disk unless the filesystem explicitly says so. RAID-5 needs to interact directly with the buffer cache in order to be able to improve performance. There are three nasty interactions which result: 1) RAID-5 tries to bunch writes of dirty buffers up so that all the data in a stripe gets written to disk at once. For RAID-5, this is very much faster than dribbling the stripe back one disk at a time. Unfortunately, this can result in dirty buffers being written to disk earlier than the filesystem expected, with the result that on a crash, the filesystem journal may not be entirely consistent. This interaction hits ext3, which stores its pending transaction buffer updates in the buffer cache with the b_dirty bit set. 2) RAID-5 peeks into the buffer cache to look for buffer contents in order to calculate parity without reading all of the disks in a stripe. If a journaling system tries to prevent modified data from being flushed to disk by deferring the setting of the buffer dirty flag, then RAID-5 will think that the buffer, being clean, matches the state of the disk and so it will calculate parity which doesn't actually match what is on disk. If we crash and one disk fails on reboot, wrong parity may prevent recovery of the lost data. This interaction hits reiserfs, which stores its pending transaction buffer updates in the buffer cache with the b_dirty bit clear. Both interactions 1) and 2) can be solved by making RAID-5 completely avoid buffers which have an incremented b_count reference count, and making sure that the filesystems all hold that count raised when the buffers are in an inconsistent or pinned state. 3) The soft-raid backround rebuild code reads and writes through the buffer cache with no synchronisation at all with other fs activity. After a crash, this background rebuild code will kill the write-ordering attempts of any journalling filesystem. This affects both ext3 and reiserfs, under both RAID-1 and RAID-5. Interaction 3) needs a bit more work from the raid core to fix, but it's still not that hard to do. So, can any of these problems affect other, non-journaled filesystems too? Yes, 1) can: throughout the kernel there are places where buffers are modified before the dirty bits are set. In such places we will always mark the buffers dirty soon, so the window in which an incorrect parity can be calculated is _very_ narrow (almost non-existant on non-SMP machines), and the window in which it will persist on disk is also very small. This is not a problem. It is just another example of a race window which exists already with _all_ non-battery-backed RAID-5 systems (both software and hardware): even with perfect parity calculations, it is simply impossible to guarantee that an entire stipe update on RAID-5 completes in a single, atomic operation. If you write a single data block and its parity block to the RAID array, then on an unexpected reboot you will always have some risk that the parity will have been written, but not the data. On a reboot, if you lose a disk then you can reconstruct it incorrectly due to the bogus parity. THIS IS EXPECTED. RAID-5 isn't proof against multiple failures, and the only way you can get bitten by this failure mode is to have a system failure and a disk failure at the same time. --Stephen
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?
Hi, On Tue, 11 Jan 2000 15:03:03 +0100, mauelsha [EMAIL PROTECTED] said: THIS IS EXPECTED. RAID-5 isn't proof against multiple failures, and the only way you can get bitten by this failure mode is to have a system failure and a disk failure at the same time. To try to avoid this kind of problem some brands do have additional logging (to disk which is slow for sure or to NVRAM) in place, which enables them to at least recognize the fault to avoid the reconstruction of invalid data or even enables them to recover the data by using redundant copies of it in NVRAM + logging information what could be written to the disks and what not. Absolutely: the only way to avoid it is to make the data+parity updates atomic, either in NVRAM or via transactions. I'm not aware of any software RAID solutions which do such logging at the moment: do you know of any? --Stephen
Re: soft RAID5 + journalled FS + power failure = problems ?
Hi, On Fri, 07 Jan 2000 13:26:21 +0100, Benno Senoner [EMAIL PROTECTED] said: what happens when I run RAID5+ jornaled FS and the box is just writing data to the disk and then a power outage occurs ? Will this lead to a corrupted filesystem or will only the data which was just written, be lost ? It's more complex than that. Right now, without any other changes, the main danger is that the raid code can sometimes lead to the filesystem's updates being sent to disk in the wrong order, so that on reboot, the journaling corrupts things unpredictably and silently. There is a second effect, which is that if the journaling code tries to prevent a buffer being written early by keeping its dirty bit clear, then raid can miscalculate parity by assuming that the buffer matches what is on disk, and that can actually cause damage to other data than the data being written if a disk dies and we have to start using parity for that stripe. Both are fixable, but for now, be careful... --Stephen
Re: Best way to set up swap for high availability?
Hi, On Fri, 26 Nov 1999 18:04:27 +0100, Martin Bene [EMAIL PROTECTED] said: At 11:35 25.11.99 +0100, Thomas Waldmann wrote: What's more interesting for me: how about swap on RAID-5 ? Personaly, I've only used raid1, but I can give you a quote from Ingo - and he should know: At 14:49 14.04.99 +0200, Ingo Molnar wrote: Hmm? Since when does swapping work on raid-1? How about raid-5? i've tested it on RAID5, swapping madly to a RAID5 array while parity is being reconstructed works just fine. Sorry, but since then we did find a fault. Raid resync goes through the buffer cache. Swap bypasses the buffer cache. There is no coherency between the two activities. It is possible for raid1 and raid5 background resync to corrupt swap writes to the partition during reconstruction. We need to fix this anyway, since the same problem bites journaling. --Stephen
Re: Best way to set up swap for high availability?
Hi, On Mon, 6 Dec 1999 16:11:14 -0500 (EST), Andy Poling [EMAIL PROTECTED] said: On Mon, Dec 06, 1999 at 02:53:22PM +, Stephen C. Tweedie wrote: Sorry, but since then we did find a fault. Raid resync goes through the buffer cache. Swap bypasses the buffer cache. There is no coherency between the two activities. It is possible for raid1 and raid5 background resync to corrupt swap writes to the partition during reconstruction. Stephen, does this also hold true if one is swapping to files that happen to be located on a softare raid partition? Yes, 'fraid so. --Stephen
Re: Best way to set up swap for high availability?
Hi, On Mon, 6 Dec 1999 20:17:12 +0100, Luca Berra [EMAIL PROTECTED] said: do you mean that the problem arises ONLY, when a disk fails and has to be reconstructed? No, it can happen any time the kernel does a resync after an unclean shutdown. --Stephen
RE: Bad rawio/raid performance
Hi, On Tue, 19 Oct 1999 20:12:20 -0700, "Tom Livingston" [EMAIL PROTECTED] said: Has anyone else tried raw-io with md devices? It works for me but the performance is quite bad. This is a recently reported issue on the linux-kernel mailing list. The jist of it is that rawio is using a 512 byte blocksize, where raid assumes a 1024. This was only first reported a couple of days ago (10/16) Yep. It's not clear just yet exactly how best to fix this --- the hacked patch which forces the raw IO blocksize to 1024 will break applications which (legitimately) expect to be able to perform 512-byte IOs on the raw device. I'll let people know once we've figured out how to get 512-byte IOs working on raid decently. --Stephen
RE: Bad rawio/raid performance
Hi, On Tue, 26 Oct 1999 11:42:41 -0400 (EDT), David Holl [EMAIL PROTECTED] said: would specifying differing input output block sizes with dd help? Unfortunately not, no. The underlying device blocksize is set when the device is first opened. --Stephen
Re: (reiserfs) Re: 71% full raid - no space left on device
Hi, On Wed, 20 Oct 1999 13:12:23 +0400, Hans Reiser [EMAIL PROTECTED] said: We don't have inodes in our FS, but we do have stat data, and that is dynamically allocated (dynamic per FS, not per file yet, soon but not yet each field will be optional and inheritable per file). Does XFS dynamically allocate? It might. I believe so, yes. They do have traditional-looking block groups, but within each group they can allocate blocks arbitrarily to hold inode data. (This is just from memory of their description at the Darmstadt workshop.) --Stephen
Re: networked RAID-1
Hi, On Mon, 11 Oct 1999 17:02:27 -0500, Stephen Waters [EMAIL PROTECTED] said: This blurb in the latest Kernel Traffic has some status information on ext3 and ACLs that might be relevant. 12-18mo for a really stable version, but version 0.02 is supposed (maybe already) to be out very soon. If by ext3 you mean journaling, I'm expecting 6 months for a really stable version, and I expect to see people deploying it in anger within 3. I already have it on all my laptop filesystems, for example (after a couple of bugfixes while I was on the road last week). New release today. --Stephen
Re: networked RAID-1
Hi, On Thu, 7 Oct 1999 01:59:31 -0500, [EMAIL PROTECTED] (G.W. Wettstein) said: If this works, you can also add a third machine and make a threefold raid1 for added HA. Curious myself if this would work. Unfortunately cannot test this myself. This strategy for doing HA has interested us as well. Just a few comments: First of all the current NBD implementation, at least the pieces of it that we have been able to find, is not sufficiently robust to implement this strategy in a production environment. There are at least two teams working on beefing up NBD, including the addition of proper connection resync and a kernel-based server. A much more thorny problem is the management of network breaks between the two disks --- you have to deal with all the clustering issues surrounding quorum management or you end up with both disks failing the other side over to themselves and you can't resolve the conflict afterwards. --Stephen
Re: networked RAID-1
Hi, On Mon, 11 Oct 1999 13:55:23 -0400, Tom Kunz [EMAIL PROTECTED] said: Stephen (and others who might know), Are there homepages and/or mailing lists for these teams? I would be highly interested in participating... One is the GFS team at http://gfs.lcse.umn.edu/. The other hasn't announced publicly yet. --Stephen
Re: networked RAID-1
Hi, On Mon, 11 Oct 1999 16:58:46 -0400, Tom Kunz [EMAIL PROTECTED] said: Hmm, well GFS isn't exactly an improvement on NBD, it's more like an entirely different filesystem type. GFS is a shared disk filesystem. It doesn't care how the disk is shared, and one of the side projects they have taken on is to extend nbd to provide a level of functionality at which they could run GFS over nbd. The resulting gnbd code is on the GFS cvs repository afaik: I can look out and post the gnbd announcement if you like. I was talking with Simon Horman of VA-Research at Internet World in NYC this past week, and he feels that it'll be 12 to 18 months until we have ext3 ext3 should be usable by Christmas/new year. and/or some other kind of nicely-working, network-distributed filesystem (such as GFS). InterMezzo will be there _much_ sooner by all accounts. It has already been demonstrated under serious load, and Peter is spending a lot of time on it right now. InterMezzo is a more loosely coupled filesystem than GFS, but should be perfect for jobs which do not require shared write access to single files. See http://www.inter-mezzo.org/. It's exciting stuff. :) Cheers, Stephen
Re: raid0 and raw io
Hi, On Thu, 29 Jul 1999 09:38:20 -0700, Carlos Hwa [EMAIL PROTECTED] said: I have a 2 disk raid0 with 32k chunk size using raidtools 0.90 beta10 right now, and have applied stephen tweedie's raw i/o patch. the raw io patch works fine with a single disk but if i try to use raw io on /dev/md0 for some reason transfer sizes are only 512bytes according to the scsi analyzer, no matter what i specify (i am using lmdd from lmbench to test, lmdd if=/dev/zero of=/dev/raw1 bs=65536 count=2048, /dev/raw1 is the raw device for /dev/md0). Mr. tweedie says it should work correctly, so could this be a limitation with the linux raid software? Thanks. I'm back from holiday, so... Ingo, any thoughts on this? The raw IO code is basically just stringing together temporary buffer_heads and then submitting them all, as a single call, to ll_rw_block (up to a limit of 128 sectors per call). The IOs are ordered, so attempt_merge() should be happy enough about merging. The only thing I can think of which is somewhat unusual about the IOs is that the device's blocksize is unconditionally set to 512 bytes beforehand: will that confuse md's block merging? --Stephen
Re: A couple of... pearls?
Hi, On Sat, 24 Apr 1999 21:09:05 +0200 (MEST), Francisco Jose Montilla [EMAIL PROTECTED] said: Hi, I happen to came across a couple of statements that somewhat involves the use of RAID, statements that I believe are not absolutely correct, if not false, or half truths. --- [...] Keep in mind that 99 percent of PC hardware is garbage. A friend of mine was a small-time Internet service provider. He was running BSDI, a not-quite-free Unix, on a bunch of PC clones. A hard disk was generating errors. He reloaded from backup tape. He still got errors. It turned out that his SCSI controller had gone bad some weeks before. It had corrupted both the hard disk and the backup tapes. He lost all of his data. He lost all of his clients' data. Lesson 1: You are less likely to lose with a SCSI controller designed by a real engineer in the Hewlett-Packard Unix workstation division than you are with one thrown in on a $49 sound card. Lesson 2: Mirrored disks on separate SCSI chains. Period. No. Lesson number zero: check the consistency of your backups. Regularly. I know the HP part is gonna make Dietmar's delights :). Apart from that, I wonder: - Doesn't SCSI controllers use parity? (Although you have to enable it, of course) Yes, if the controller supports it, and all modern controllers do. I don't even think any of our drivers let you disable it any more. However, most of the cheapo sound-card-based scsi controllers (which were first designed as a cheap way of interfacing to a cdrom) don't do parity. Run raid on that? Yeah, right... - I agree on using two *controllers* (not two channels on the same controller) gives appropiate redundancy if one of they go mad, but nonetheless, although we use only one, shouldn't data corruption be detected by the controller parity? No. Errors generated on the cable will be detected. Bus/memory errors will not; soft errors in the controller will not; and errors in the disk itself will not. One step further, how will the soft RAID code handle this? does it have some heuristics to detect that, or is completelly the task of the controller and imposible for soft RAID to detect that? If the IO completes with the status "OK, all IO finished fine", the RAID code believes it. -- Why would I want a two channel RAID card for RAID one? By putting each harddrive on a separate channel, you can ensure that even if a cable or terminator on one channel were to go bad, the system would continue to function. When hot-swapping a harddrive, the RAID card must temporarily stop the SCSI channel the drive is attached to. If the other drive in a RAID one array is connected to a different channel, the computer can operate completely normally during the hot-swap. I agree completely with the first statement. But the second sounds somewhat odd to me. I can hotadd or hotremove a disk on linux with sw RAID and a non-hot swappable capable controller, maybe this is another feature of sw RAID over hw RAID? You can try, but if the bus is active while you do it, chances are you'll corrupt data. There _are_ specially designed raid cabinets which electrically isolate the bus so that you can do this safely, but that's not the case for your typical scsi bus. --Stephen
Re: Benchmarks/Performance.
Hi, On Thu, 22 Apr 1999 20:45:52 +0100 (IST), Paul Jakma [EMAIL PROTECTED] said: i tried this with raid0, and if bonnie is any guide, the optimal configuration is 64k chunk size, 4k e2fs block size. Going much above 64k will mean that readahead has to work very much harder to keep all the pipelines full when doing large sequential IOs. That's why bonnie results can fall off. However, if you have independent IOs going on (web/news/mail service or multiuser machines) then that concurrent activity may still be faster with larger chunk sizes, as you minimise the chance of any one file access having to cross multiple disks. In other words, all benchmarks lie. :) --Stephen
Re: Benchmarks/Performance.
Hi, On Mon, 26 Apr 1999 21:28:20 +0100 (IST), Paul Jakma [EMAIL PROTECTED] said: it was close between 32k and 64k. 128k was noticably slower (for bonnie) so i didn't bother with 256k. Fine, but 128k will be noticeably faster for some other tasks. Like I said, it depends on whether you prioritise large-file bandwidth over the ability to serve many IOs at once. viz pipelining: would i be right in thinking that a decent scsi controller and drives can "pipeline" /far/ better than, eg, a udma setup? Yes, although you eventually run into a different bottleneck: the filesystem has to serialise every so often while reading its indirection metadata blocks. Using a 4k fs blocksize helps there (again, for squeezing the last few %age points out of sequential readahead). ie the optimal chunk size would be higher for a scsi system than for an eide/udma setup? udma can do readahead and multi-sector IOs. scsi can have limited tagged queue depths. Command setup is more expensive on scsi than on ide. Which costs dominate really depends on the workload. --Stephen
Re: So, it's up -- and I'm beating it, now about that boot..
Hi, On Sat, 17 Apr 1999 16:22:59 -0400 (EDT), "m. allan noah" [EMAIL PROTECTED] said: have you ACTUALLY used grub to boot off of raid1? i dont see how grub is capable. it would have to be able to read the md device. prove me wrong please. raid-1 has the property that the raid superblock is at the end of the partition, so that the filesystem contained inside the raid starts at the start of each of the component raid partitions. In other words, each partition in the raid set looks like a perfectly formed ext2fs filesystem which just happens to be 64k smaller than the total partition size. Grub should be able to read this just fine. Cheers, Stephen
Re: Swap on raid
Hi, On 15 Apr 1999 00:13:48 -, [EMAIL PROTECTED] said: AFAIK, the swap code uses raw file blocks on disk, rather than passing through to vfs, cause you dont want to cache swap accesses, think about it :) Sort of correct. It does bypass most of the VFS, but it does use the standard block device IO routines. this is how swap can work on a partition or a file, cause at swapon time, the blocks are mapped for direct access. No, for files, we do the mapping on demand, not all at once on swapon. swap running on raid then, if it works at all, is not actually protecting you. Yes it is. Swapping is not done inside the VFS, but neither is RAID. RAID works under the hood of the block device IO routines (drivers/block/ll_rw_block.c), so both VFS and swap will take full advantage of any RAID devices being used. --Stephen
RE: Swap on raid
Hi, On Wed, 14 Apr 1999 15:32:40 -0400, "Joe Garcia" [EMAIL PROTECTED] said: Swapping to a file should work, but if I remember correctly you get horrible performance. Swap-file performance on 2.2 kernels is _much_ better. --Stephen
Re: Swap on raid
Hi, On Wed, 14 Apr 1999 21:59:49 +0100 (BST), A James Lewis [EMAIL PROTECTED] said: It wasn't a month ago that this was not possible because it needed to allocate memory for the raid and couldn't because it needed to swap to do it? Was I imagining this or have you guys been working too hard! There may well have been a few possible deadlocks, but the current kswapd code is pretty careful to avoid them. Things should be OK. --Stephen
Re: partition type to autodetect raid
Hi, The only place I would even imagine this would be possible would be in the mode pages, but my recollection of the SCSI standard says that all of these modes pages are read only. :( IIRC there are some writable fields in some drives to allow you to set caching/writeback behaviour, for example, but even then they are definitely not persisitent. There's nowhere to store a permanent data-type marker. --Stephen
Re: partition type to autodetect raid
Hi, On Sun, 28 Mar 1999 15:27:26 -0500 (EST), Laszlo Vecsey [EMAIL PROTECTED] said: Isnt there room in the raid header for an additional flag to mark the 'partition' type? I realize this might require a 'mkraid --upgrade' to be run, but at least the 'partitions' could then be detected and then I could root automount more cleanly for example.. That's not the point: we don't even _look_ for a raid superblock unless the partition is marked for autostart. There are related problems we need to deal with regularly when building filesystems: what happens if you reformat a raid disk as a single ext2fs filesystem? The raid superblock remains intact, but we do _not_ want to autostart it. That's why it's best to leave things as they are: if a partition is not recognisable as having a raid superblock, we don't autostart it. --Stephen
Re: Filesystem corruption (was: Re: Linux 2.2.4 RAID - success report)
Hi, On Mon, 29 Mar 1999 11:28:25 +0100, Richard Jones [EMAIL PROTECTED] said: Not so fast there :-) In the stress tests, I've encountered almost silent filesystem corruption. The filesystem reports errors as attached below, but the file operations continue without error, corrupting files in the process. At no time did the RAID software report any problem, nor did any reconstruction kick in. Anyone have any ideas what might be going on? It doesn't seem to be exclusively a 2.2.4 thing. I've seen similar problems with 2.0.36-19990128. This is a pretty good indication of a hardware fault. Looking at the messages: Mar 26 20:52:35 fred kernel: EXT2-fs error (device md(9,0)): ext2_free_blocks: Freeing blocks not in datazone - block = 550046767, count = 1 Mar 26 20:52:36 fred kernel: EXT2-fs error (device md(9,0)): ext2_free_blocks: Freeing blocks not in datazone - block = 536870912, count = 1 Mar 27 10:47:59 fred kernel: EXT2-fs error (device md(9,0)): ext2_free_blocks: Freeing blocks not in datazone - block = 538609421, count = 1 these are block numbers (in hex): 20C90C2F, 2000, 201A870D. Something is randomly flipping bit 29 in the block addresses (the block numbers are entirely valid apart from this). This may be a disk or controller fault, but I'd replace the cabling first. --Stephen
Re: RAID1 experiences
Hi, On Sat, 13 Feb 1999 18:14:14 -0500, Michael Stone [EMAIL PROTECTED] said: On Wed, Feb 10, 1999 at 09:43:12AM -0600, Chris Price wrote: Instead of pointing fingers at Redhat, I would ask if there is someone with teh Linux-raid community that actively corresponds with redhat to let them know of current status of linux-raid? Ingo etal. seem to be doing a superb job in adding funtionality and fixing bugs quickly, but that does result in a myriad of patches being issued fairly regularily - is it Redhat's responsibility to keep track of linux-raid, or is it our responsibility to inform them of stable releases? Is anyone in the "linux-raid community" being paid to do research work for redhat? If so, they should probably keep redhat informed. If not, I think it's fair to expect redhat to do their own work. Umm, Ingo Molnar == [EMAIL PROTECTED] I think we can assume that there is somebody working for Red Hat who knows a bit about the current state of Raid. :) However, speaking from the point of view of a kernel developer rather than a Red Hat employee, there are real obstacles to including the new Raid stuff in Red Hat Linux, the main one being compatibility with existing installations using older Raid code. I wouldn't like to be the one trying to make Red Hat upgrades work with the new Raid drivers but without breaking old-style Raid volumes... --Stephen
Re: benefits of journaling for soft RAID ?
Hi, On Thu, 11 Feb 1999 09:00:20 +0100, Benno Senoner [EMAIL PROTECTED] said: can someone please explain what journaling precisely does, (is this a sort of mechanism, which leaves the filesystem in a consistent status, even in case of disk write interruption, due of power loss or other causes ?) Exactly. It keeps a record of in-progress filesystem operations so that entire complex operations, such as renames, always complete atomically even if you reboot half-way through. It eliminates the need for an fsck at reboot. and the advantages / disadvantages ( makes filesystem slower ?), It _should_ make it faster, for most access patterns. It will make "mount -o sync" operation (for things like NFS servers) *MUCH* faster, especially if you use a separate disk for the journal. --Stephen
Re: fsck performance on large RAID arrays ?
Hi, On Tue, 9 Feb 1999 13:31:14 +0100 (CET), MOLNAR Ingo [EMAIL PROTECTED] said: Stephen Tweedie is working on the journalling extensions. [not sure what the current status is, he had a working prototype end of last year.] I had journaling and buffer commit code, but not any filesystem personality stuff. Current status is that 2.2 is out (yay!) and journaling is once again my top priority, and I've just started doing real testing of basic filesystem transactions (currently only for the simplest case --- chmod --- but most of the others are relatively easy to add once that works properly). AFAIK, these extensions will not destroy anything we have with ext2fs, they are (as usual) optional. I'd call it ext3fs too because the changes themselves are bigger than ext2fs itself, and together with all the other upcoming 2.3 features (ACLs, trees, compression, etc.) it will be significantly different from 'classic' ext2fs, but it's up to Stephen ... The development codebase is certainly being done as a separate ext3fs but that is simply to allow me to test things without trashing the root filesystem on all of my test boxes! The intention is that eventually these features should be merged into ext2 proper, but only if we can absolutely guarantee that there will be no reliability penalty for users not using the new code. During the transition/testing period I'll certainly be maintaining a test tree for ext3 as a separate filesystem so that people don't have to put their existing data at risk --Stephen
Re: Sun disklabels (Was: Re: RELEASE: RAID-0,1,4,5 patch...)
Hi, On Thu, 28 Jan 1999 18:56:48 -0800, "David S. Miller" [EMAIL PROTECTED] said: You need to start using data at cylinder 1 on all disks or it will get nuked. It doesn't happen on the first disk because ext2 skips some space at the beginning of the volume. Swap space has the same problem, you cannot start it at cyliner 0. The new-style SWAPSPACE2 avoids the first 1024 bytes in the partition precisely because of requests such as these from Sparc people. And yes, I agree, having the same facility for raid component partitions would be most useful. Adding a "start data offset" to the raid superblock, defaulting to 0, would allow backwards compatibility too. --Stephen
Re: Is this possible/feasible
Hi, On Sun, 18 Oct 1998 23:42:39 +, "Adam Williams" [EMAIL PROTECTED] said: Any pointers on where to gets doc's for this setup? linux/Documentation/nbd.txt (surprise!) documents network block devices. The fact that raid may be running on nbd doesn't affect the upper raid stuff at all. Ingo, this is actually a problem: on nbd raid1, we *really* want read balancing to prefer the local disk if possible! Does anyone know if the CODA filesystem has redundancy features? Yes, it does, along with automatic reconciliation. Very nice. --Stephen
Re: Is this possible/feasible
Hi, On Sun, 18 Oct 1998 15:55:35 +0200 (CEST), MOLNAR Ingo [EMAIL PROTECTED] said: On Sun, 18 Oct 1998, Tod Detre wrote: in 2.1 kernels you can mak nfs a block device. raid can work with block devices so if you raid5 several nfs computers one can go down, but you still can go on. you probably want to use Stephen Tweedie's NBD (Network Block Device), Heh, thanks, but the credit is Pavel Machek's. I've just been testing and bug-fixing it. which works over TCP and is such more reliable and works over bigger distance and larger dropped packets range. You can even have 5 disks on 5 continents put together into a big RAID5 array. (ment to survive a meteorite up to the size of a few 10 miles ;) and you can loopback it through a crypt^H^H^H^H^Hcompression module too before sending it out to the net. Of course, you'll need to manually reconstruct the raid array as appropriate, and you don't get raid autostart on a networked block device either. However, it ought to be fun to watch, and I'm hoping we can integrate this method of operation into some of the clustering technology now appearing on Linux to do failover of NFS services if one of the networked raid hosts dies. Just remount the raid on another machine using the surviving networked disks, remount ext2fs and migrate the NFS server's IP address: voila! --Stephen
Re: Linear/ext2-volume question
Hi, On Sun, 18 Oct 1998 12:05:11 +0100, "Johan Gronvall" [EMAIL PROTECTED] said: I'm new to this list so please bare with me if I ask stupid questions. I'm looking for a kind of linear solution. I have however got the impression that you can only 'concatenate' 2 disks or partitions to make a single md device. Correct? No, you can have as many as you want. And both disks need to be reformatted. Right? Yes. If that's the case, then who's actually using linear mode? Me, for a start! I found it very useful to be able to combine together a few scraps of spare space on a number of mounted disks to create a scratch partition of useful size. Anyway, I found something that was called ext2-volume, a kind of extension to the ext2 filesystem, that made it possible to extend a mounted partition on the fly! Cool, but I don't know how to build it. It seems that I lack a file called ext2fs.h. Anyone tried this? Yes, it is due to be integrated into ext2 in the 2.3 kernels, but for now I wouldn't advise using it as it lacks some fairly important things like e2fsck. :) --Stephen