Re: (reiserfs) Re: Raid resync changes buffer cache semantics --- not good for journaling!
Hi, On Wed, 3 Nov 1999 17:43:18 +0100 (MET), Ingo Molnar <[EMAIL PROTECTED]> said: > .. which is exactly what the RAID5 code was doing ever since. It _has_ to > do it to get 100% recovery anyway. This is one reason why access to caches > is so important to the RAID code. (we do not snapshot clean buffers) And that's exactly why it will break --- the dirty bit on a buffer does not tell you if it is clean or not, because the standard behaviour all through the entire VFS is to modify the buffer before calling mark_buffer_dirty(). There have always been windows in which you can see a buffer as clean but it in fact is dirty, and the writable mmap case in 2.3 makes this worse. > no, RAID marks buffers dirty which it found in a clean and not locked > state in the buffer-cache. It's perfectly legal to do so - or at least it > was perfectly legal until now, No it wasn't, because swap has always been able to bypass the buffer cache. I'll not say it was illegal either --- let's just say that the situation was undefined --- but we have had writes outside the buffer cache for _years_, and you simply can't say that it has always been legal to assume that the buffer cache was the sole synchronisation mechanism for IO. I think we can make a pretty good compromise here, however. We can mandate that any kernel component which bypasses the buffer cache is responsible for ensuring that the buffer cache is invalidated beforehand. That lets raid do the right thing regarding parity calculations. For this to work, however, the raid resync must not be allowed to repopulate the buffer cache and create a new cache incoherency. It would not be desparately hard to lock the resync code against other IOs in progress, so that resync is entirely atomic with respect to things like swap. I can live with this for jfs: I can certainly make sure that I bforget() journal descriptor blocks after IO to make sure that there is no cache incoherency if the next pass over the log writes to the same block using a temporary buffer_head. Similarly, raw IO can do buffer cache coherency if necessary (but that will be a big performance drag if the device is in fact shared). The one thing that I really don't want to have to deal with is the raid resync code doing its read/wait/write thing while I'm writing new data via temporary buffer_heads, as that _will_ corrupt the device in a way that I can't avoid. There is no way for me to do metadata journal writes through the buffer cache without copying data (because the block cannot be in the buffer cache twice), so I _have_ to use cache-bypass here to avoid an extra copy. > ok, i see your point. I guess i'll have to change the RAID code to do the > following: > #define raid_dirty(bh)(buffer_dirty(bh) && (buffer_count(bh) > 1)) > because nothing is allowed to change a clean buffer without having a > reference to it. And nothing is allowed to release a physical index before > dirtying it. Does this cover all cases? It should do --- will you then do parity calculation and a writeback on a snapshot of such buffers? If so, we should be safe. >> No. How else can I send one copy of data to two locations in >> journaling, or bypass the cache for direct IO? This is exactly what I >> don't want to see happen. > for direct IO (with the example i have given to you) you are completely > bypassing the cache, you are not bypassing the index! You are doing > zero-copy, and the buffer does not stay cached. Registering and deregistering every 512-byte block for raw IO is a CPU nightmare, but I can do it for now. However, what happens when we start wanting to pass kiobufs directly into ll_rw_block()? For performance, we really want to be able to send chunks larger than a single disk block to the driver in one go. --Stephen
Re: (reiserfs) Re: Raid resync changes buffer cache semantics --- not good for journaling!
On Wed, 3 Nov 1999, Stephen C. Tweedie wrote: > You can get around this problem by snapshotting the buffer cache and > writing it to the disk, of course, [...] .. which is exactly what the RAID5 code was doing ever since. It _has_ to do it to get 100% recovery anyway. This is one reason why access to caches is so important to the RAID code. (we do not snapshot clean buffers) > > sure it can. In 2.2 the defined thing that prevents dirty blocks from > > being written out arbitrarily (by bdflush) is the buffer lock. > > Wrong semantics --- the buffer lock is supposed to synchronise actual > physical IO (ie. ll_rw_block) and temporary states of the buffer. It > is not intended to have any relevance to bdflush. because it synchronizes IO it _obviously_ also synchronizes bdflush access because bdflush does nothing but keeps balance and starts IO! You/we might want to make this mechanizm more explicit, i dont mind. > > this is a misunderstanding! RAID will not and does not flush anything to > > disk that is illegal to flush. > > raid resync currently does so by writing back buffers which are not > marked dirty. no, RAID marks buffers dirty which it found in a clean and not locked state in the buffer-cache. It's perfectly legal to do so - or at least it was perfectly legal until now, but we can make the rule more explicit. I always just _suggested_ to use the buffer lock. > > I'd like to have ways to access mapped & valid cached data from the > > physical index side. > > You can't. You have never been able to assume that safely. > > Think about ext2 writing to a file in any kernel up to 2.3.xx. We do > the following: > > getblk() > ll_rw_block(READ) if it is a partial write > copy_from_user() > mark_buffer_dirty() > update_vm_cache() > > copy_from_user has always been able to block. Do you see the problem? > We have wide-open windows in which the contents of the buffer cache have > been modified but the buffer is not marked dirty. [...] thanks, i now see the problem. Thinking about it, i do not see any conceptual problem. Right now the pagecache is careless about keeping the physical index state correct, because it can assume exclusive access to that state through higher level locks. > There are also multiple places in ext2 where we call mark_buffer_dirty() > on more than one buffer_head after an update. mark_buffer_dirty() can > block, so there again you have a window where you risk viewing a > modified but not dirty buffer. > > So, what semantics, precisely, do you need in order to calculate parity? > I don't see how you can do it reliably if you don't know if the > in-memory buffer_head matches what is on disk. ok, i see your point. I guess i'll have to change the RAID code to do the following: #define raid_dirty(bh) (buffer_dirty(bh) && (buffer_count(bh) > 1)) because nothing is allowed to change a clean buffer without having a reference to it. And nothing is allowed to release a physical index before dirtying it. Does this cover all cases? > That's fine, but we're disagreeing about what the rules are. Everything > else in the system assumes that the rule for device drivers is that > ll_rw_block defines what they are allowed to do, nothing else. If you > want to change that, then we really need to agree exactly what the > required semantics are. agreed. > > O_DIRECT is not a problem either i believe. > > Indeed, the cache coherency can be worked out. The memory mapped > writable file seems a much bigger problem. yep. > > the physical index and IO layer should i think be tightly integrated. This > > has other advantages as well, not just RAID or LVM: > > No. How else can I send one copy of data to two locations in > journaling, or bypass the cache for direct IO? This is exactly what I > don't want to see happen. for direct IO (with the example i have given to you) you are completely bypassing the cache, you are not bypassing the index! You are doing zero-copy, and the buffer does not stay cached. -- mingo
Re: (reiserfs) Re: Raid resync changes buffer cache semantics --- not good for journaling!
Hi, On Wed, 3 Nov 1999 10:30:36 +0100 (MET), Ingo Molnar <[EMAIL PROTECTED]> said: >> OK... but raid resync _will_ block forever as it currently stands. > {not forever, but until the transaction is committed. (it's not even > necessary for the RAID resync to wait for locked buffers, it could as well > skip over locked & dirty buffers - those will be written out anyway.) } No --- it may block forever. If I have a piece of busy metadata (eg. a superblock) which is constantly being updated, then it is quite possible that the filesystem will modify and re-pin the buffer in a new transaction before the previous transaction has finished committing. (The commit happens asynchronously after the transaction has closed, for obvious performance reasons, and a new transaction can legitimately touch the buffer while the old one is committing.) > i'd like to repeat that i do not mind what the mechanizm is called, > buffer-cache or physical-cache or pagecache-II or whatever. The fact that > the number of 'out of the blue sky' caches and IO is growing is a _design > bug_, i believe. Not a helpful attitude --- I could just as legitimately say that the current raid design is a massive design bug because it violates the device driver layering so badly. There are perfectly good reasons for doing cache-bypass. You want to forbid zero-copy block device IO in the IO stack just because it is inconvenient for software raid? That's not a good answer. Journaling has a good example: I create temporary buffer_heads in order to copy journaled metadata buffers to the log without copying the data. It is simply not possible to do zero-copy journaling if I go through the cache, because you'd be wanting me to put the same buffer_head on two different hash lists for that. Ugh. > But whenever a given page is known to be 'bound' to a given physical block > on a disk/device and is representative for the content, *BUT IT ISN'T* representative of the physical content. It only becomes representative of that content once the buffer has been written to disk. Ingo, you simply cannot assume this on 2.3. Think about memory mapped shared writable files. Even if you hash those pages' buffer_heads into the buffer cache, those writable buffers are going to be (a) volatile, and (b) *completely* out of sync with what is on disk --- and the buffer_heads in question are not going to be marked dirty. You can get around this problem by snapshotting the buffer cache and writing it to the disk, of course, but if you're going to write the whole stripe that way then you are forcing extra IOs anyway (you are saving read IOs for parity calcs but you are having to perform extra writes), and you are also going to violate any write ordering constraints being imposed by higher levels. > it's a partial cache i agree, but in 2.2 it is a good and valid way to > ensure data coherency. (except in the swapping-to-a-swap-device case, > which is a bug in the RAID code) ... and --- now --- except for raw IO, and except for journaling. > sure it can. In 2.2 the defined thing that prevents dirty blocks from > being written out arbitrarily (by bdflush) is the buffer lock. Wrong semantics --- the buffer lock is supposed to synchronise actual physical IO (ie. ll_rw_block) and temporary states of the buffer. It is not intended to have any relevance to bdflush. >> I'll not pretend that this doesn't pose difficulties for raid, but >> neither do I believe that raid should have the write to be a cache >> manager, deciding on its own when to flush stuff to disk. > this is a misunderstanding! RAID will not and does not flush anything to > disk that is illegal to flush. raid resync currently does so by writing back buffers which are not marked dirty. >> There is a huge off-list discussion in progress with Linus about this >> right now, looking at the layering required to add IO ordering. We have >> a proposal for per-device IO barriers, for example. If raid ever thinks >> it can write to disk without a specific ll_rw_block() request, we are >> lost. Sorry. You _must_ observe write ordering. > it _WILL_ listen to any defined rule. It will however not be able to go > telephatic and guess any future interfaces. There is already a defined rule, and there always has been. It is called ll_rw_block(). Anything else is a problem. > I'd like to have ways to access mapped & valid cached data from the > physical index side. You can't. You have never been able to assume that safely. Think about ext2 writing to a file in any kernel up to 2.3.xx. We do the following: getblk() ll_rw_block(READ) if it is a partial write copy_from_user() mark_buffer_dirty() update_vm_cache() copy_from_user has always been able to block. Do you see the problem? We have wide-open windows in which the contents of the buffer cache have been modified but the buffer is not marked dirty. With 2.3, things just get worse for you, with writable sh
Re: Raid resync changes buffer cache semantics --- not good for journaling!
On Tue, 2 Nov 1999, Stephen C. Tweedie wrote: > OK... but raid resync _will_ block forever as it currently stands. {not forever, but until the transaction is committed. (it's not even necessary for the RAID resync to wait for locked buffers, it could as well skip over locked & dirty buffers - those will be written out anyway.) } > > no, paging (named mappings) writes do not bypass the buffer-cache, and > > thats the issue. > > Mappings bypass it for read in 2.2, and for read _and_ write on 2.3. I > wasn't talking about writes: I'm talking about IO in general. IO is not > limited to the buffer cache, and the places where the buffer cache is > bypassed are growing, not shrinking. i said writes; reads are irrelevant in this context. in our case we were mainly worried about cache coherency issues, obviously only dirty data (ie. writes) are interesting in this sense. OTOH i do agree that for better generic performance the RAID code would like to see cached reads as well, not only dirty data, and this is a problem on 2.2 as well. i'd like to repeat that i do not mind what the mechanizm is called, buffer-cache or physical-cache or pagecache-II or whatever. The fact that the number of 'out of the blue sky' caches and IO is growing is a _design bug_, i believe. Not too hard to fix at this point, and i'm aware of the possible speed impact, which i'd like to minimize as much as possible. I do not mind invisible but coherent dirty-caches, we always had those: eg. the inode cache, or not yet mapped dirty pages (not really present right now, but possible with lazy allocation). But whenever a given page is known to be 'bound' to a given physical block on a disk/device and is representative for the content, i'd like to have ways to access/manage these cache elements. ('manage' obviously does not include 'changing contents', it means changing the state of the cache element in a defined way, eg. marking it dirty, or marking it clean, unmapping it, remapping it, etc.) > > I agree that swapping is a problem (bug) even in 2.2, thanks for pointing > > it out. (It's not really hard to fix because the swap cache is more or > > less physically indexed.) > > Yep. Journaling will have the same problem. The block device interface > has never required that the writes come from the buffer cache. yes, i do not require it either. But i'd like to see a way to access cached contents 'from the physical side' as well - in cases where this is possible. (and all other cases should -and currently do- stay coherent explicitly) > > i dont really mind how it's called. It's a physical index of all dirty & > > cached physical device contents which might get written out directly to > > the device at any time. In 2.2 this is the buffer-cache. > > No it isn't. The buffer cache is a partial cache at best. It does not > record all writes, and certainly doesn't record all reads, even on 2.2. it's a partial cache i agree, but in 2.2 it is a good and valid way to ensure data coherency. (except in the swapping-to-a-swap-device case, which is a bug in the RAID code) > Most importantly, data in the buffer cache cannot be written arbitrarily > to disk at any time by the raid code: you'll totally wreck any write > ordering attempts by higher level code. sure it can. In 2.2 the defined thing that prevents dirty blocks from being written out arbitrarily (by bdflush) is the buffer lock. bdflush is a cache manager similar to the RAID code! 'Write ordering attempts by higher level' first have to be defined Stephen, and sure if/when these write ordering requirements become part of the buffer-cache then the RAID code will listen to it. But you cannot blame a 3 years old concept to not work on a 6 months old new code. i'd like to re-ask the question why locking buffers is not good to keep transactions pending. This solves the bdflush and RAID issue without _any_ change to the buffer-cache. There might be practical/theoretical reasons for this not being possible/desired, please enlighten me if this is the case. Again, i do not mind having another 'write ordering' mechanizm either, but these should first be defined and agreed on. > It can't access the page cache in 2.2. (yes, i'd like to fix this in the 2.3 RAID code, additionally to being nice to the journalling code as well.) > > The RAID code is not just a device driver, it's also a cache > > manager. Why do you think it's inferior to access cached data along a > > physical index? > > Ask Linus, he's pushing this point much more strongly than I am! The > buffer cache will become less and less of a cache as time goes on in his > grand plan: it is to become little more than an IO buffer layer. we should not be too focused on the buffer-cache. The buffer-cache has many 'legacy' features (eg. I'll not pretend that this doesn't pose difficulties for raid, but > neither do I believe that raid should have the write to be a cache > manager, deciding on its own when to flush stuff to disk. this
Re: Raid resync changes buffer cache semantics --- not good for journaling!
From: "Stephen C. Tweedie" <[EMAIL PROTECTED]> Date: Tue, 2 Nov 1999 17:44:55 + (GMT) Ask Linus, he's pushing this point much more strongly than I am! The buffer cache will become less and less of a cache as time goes on in his grand plan: it is to become little more than an IO buffer layer. Ultimately, I think may be better off if we remove any hint of caching from the I/O buffer layer. The cache coherency issues between the page and buffer cache make me nervous, and I'm not completely 100% convinced we got it all right. (I'm wondering if some of the ext2 corruption reports in the 2.2 kernels are coming from a buffer cache/page cache corruption.) This means putting filesystem meta-data into the page cache. Yes, I know Stephen has some concerns about doing this because the big memory patches mean pages in the page cache might not be directly accessible by the kernel. I see two solutions to this, both with drawbacks. One is to use a VM remap facility to map directories, superblocks, inode tables etc. into the kernel address space. The other is to have flags which ask the kernel to map filesystem metadtata into part of the page cache that's addressable by the kernel. The first adds a VM delay to accessing the filesystem metadata, and the other means we need to manage the part of the page cache that's below 2GB differently from the page cache in high memory at least as far as freeing pages in response to memory pressure is concerned. Basically, for the raid code to poke around in higher layers is a huge layering violation. We are heading towards doing things like adding kiobuf interfaces to ll_rw_block (in which the IO descriptor that the driver receives will have no reference to the buffer cache), and and raw, unbuffered access to the drivers for raw devices and O_DIRECT. Raw IO is already there and bypasses the buffer cache. So does swap. So does journaling. So does page-in (in 2.2) and page-out (in 2.3). It'll be interesting to see how this affects using dump(8) on a mounted filesystem. This was never particularly guaranteed to give a coherent filesystem image, but what with increasing bypass of the buffer cache, it may make the results of using dump(8) on a live filesystem even worse. One way of solving this is to add some kernel support for dump(8); for example, the infamous iopen() call which Linus hates so much. (Yes, it violates the Unix permission model, which is why it needs to be restricted to root, and yes, it won't work on all filesystems; just those that have inodes.) The other is to simply tell people to give up on dump completely, and just use a file-level tool such as tar or bru. - Ted
Re: Raid resync changes buffer cache semantics --- not good for journaling!
On Tue, Nov 02, 1999 at 11:17:23AM -0700, Matt Zinkevicius wrote: > Is just software RAID affected? or hardware RAID as well? Just the software one, with hardware RAID the intricacies of the operations are hidden and the hardware *precisely* present an unified standardized interface (like SCSI ...). The kernel is not concerned by this. Daniel -- [EMAIL PROTECTED] | W3C, INRIA Rhone-Alpes | Today's Bookmarks : Tel : +33 476 615 257 | 655, avenue de l'Europe | Linux, WWW, rpmfind, Fax : +33 476 615 207 | 38330 Montbonnot FRANCE | rpm2html, XML, http://www.w3.org/People/W3Cpeople.html#Veillard | badminton, and Kaffe.
Re: Raid resync changes buffer cache semantics --- not good for journaling!
Is just software RAID affected? or hardware RAID as well? --Matt __ Get Your Private, Free Email at http://www.hotmail.com
Re: Raid resync changes buffer cache semantics --- not good for journaling!
Hi, On Tue, 2 Nov 1999 14:12:00 +0100 (MET), Ingo Molnar <[EMAIL PROTECTED]> said: > yes but this means that the block was not cached. OK... but raid resync _will_ block forever as it currently stands. >> > 2.3 removes physical indexing of cached blocks, >> >> 2.2 never guaranteed that IO was from cached blocks in the first place. >> Swap and paging both bypass the buffer cache entirely. [..] > no, paging (named mappings) writes do not bypass the buffer-cache, and > thats the issue. Mappings bypass it for read in 2.2, and for read _and_ write on 2.3. I wasn't talking about writes: I'm talking about IO in general. IO is not limited to the buffer cache, and the places where the buffer cache is bypassed are growing, not shrinking. > I agree that swapping is a problem (bug) even in 2.2, thanks for pointing > it out. (It's not really hard to fix because the swap cache is more or > less physically indexed.) Yep. Journaling will have the same problem. The block device interface has never required that the writes come from the buffer cache. >> But you cannot rely on the buffer cache. If I "dd" to a swapfile and do >> a swapon, then the swapper will start to write to that swapfile using >> temporary buffer_heads. If you do IO or checksum optimisation based on >> the buffer cache you'll risk plastering obsolete data over the disks. > i dont really mind how it's called. It's a physical index of all dirty & > cached physical device contents which might get written out directly to > the device at any time. In 2.2 this is the buffer-cache. No it isn't. The buffer cache is a partial cache at best. It does not record all writes, and certainly doesn't record all reads, even on 2.2. Most importantly, data in the buffer cache cannot be written arbitrarily to disk at any time by the raid code: you'll totally wreck any write ordering attempts by higher level code. > Think about it, it's not a hack, it's a solid concept. The RAID code > cannot even create its own physical index if the cache is completely > private. Should the RAID code re-read blocks from disk when it > calculates parity, just because it cannot access already cached data > in the pagecache? It can't access the page cache in 2.2. > The RAID code is not just a device driver, it's also a cache > manager. Why do you think it's inferior to access cached data along a > physical index? Ask Linus, he's pushing this point much more strongly than I am! The buffer cache will become less and less of a cache as time goes on in his grand plan: it is to become little more than an IO buffer layer. Basically, for the raid code to poke around in higher layers is a huge layering violation. We are heading towards doing things like adding kiobuf interfaces to ll_rw_block (in which the IO descriptor that the driver receives will have no reference to the buffer cache), and and raw, unbuffered access to the drivers for raw devices and O_DIRECT. Raw IO is already there and bypasses the buffer cache. So does swap. So does journaling. So does page-in (in 2.2) and page-out (in 2.3). I'll not pretend that this doesn't pose difficulties for raid, but neither do I believe that raid should have the write to be a cache manager, deciding on its own when to flush stuff to disk. There is a huge off-list discussion in progress with Linus about this right now, looking at the layering required to add IO ordering. We have a proposal for per-device IO barriers, for example. If raid ever thinks it can write to disk without a specific ll_rw_block() request, we are lost. Sorry. You _must_ observe write ordering. Peeking into the buffer cache for reads is a much more benign behaviour. It is still going to be a big problem for things like journaling and raw IO, and is a potential swap corrupter, but we can fix these things by being religious about removing or updating any buffer-cache copies of disk blocks we're about to write to bypassing the buffer cache. Right now the raid resync can clearly write buffers without being asked to do so, and that needs to be fixed. It should be possible to do so without redesigning the whole of software raid. Can we assume that apart from that, raid*.c never writes data without being asked, even if it does use the buffer cache to compute parity? > well, we are not talking about non-cached IO here. We are talking about a > new kind of (improved) page cache that is not physically indexed. _This_ > is the problem. If the page-cache was physically indexed then i could look > it up from the RAID code just fine. If the page-cache was physically > indexed (or more accurately, the part of the pagecache that is already > mapped to a device in one way or another, which is 90+% of it.) then the > RAID code could obey all the locking (and additional delaying) rules > present there. It is insane to think that a device driver (which raid *is*, from the point of view of the page cache) should have a right to poke about in a virtually in
Re: Raid resync changes buffer cache semantics --- not good for journaling!
Hi, On Tue, 02 Nov 1999 08:43:01 -0700, [EMAIL PROTECTED] said: >> Fixing this in raid seems far, far preferable to fixing it in the >> filesystems. The filesystem should be allowed to use the buffer cache >> for metadata and should be able to assume that there is a way to prevent >> those buffers from being written to disk until it is ready. > What about doing it in the page cache: i.e. reserve pages for journaling > and let them hit the buffer cache only when the transaction allows it? > This may be a naive suggestion, but it looks logical. The issue is one of IO. Journaling builds a list of disk block updates which need to be applied in a given order. IO is block-based, not page-based. I can cache a directory in the page cache, but when I start doing modifications, it's on a per-block basis because journaling has got to record modified blocks. I could quite easily end up with two different blocks in the same page belonging to two different transactions if the blocksize is less than the pagesize. Doing filesystem caching in the page cache is fine, but it does not really make sense as a data structure in which to build a transaction's pending-write lists. There's a second issue: I talked with Ingo about the journaling API early on, and it was designed specifically to support buffer journaling. I want to be able to allow the raid or LVM driver code to use the same jfs API to apply transactional updates across multiple devices at the block device level, for doing things like reconfiguring an array to merge a new disk or to mark errors. --Stephen
Re: Raid resync changes buffer cache semantics --- not good for journaling!
Hi, Stephen wrote: > Fixing this in raid seems far, far preferable to fixing it in the > filesystems. The filesystem should be allowed to use the buffer cache > for metadata and should be able to assume that there is a way to prevent > those buffers from being written to disk until it is ready. What about doing it in the page cache: i.e. reserve pages for journaling and let them hit the buffer cache only when the transaction allows it? This may be a naive suggestion, but it looks logical. - Peter -
Re: Raid resync changes buffer cache semantics --- not good for journaling!
On Tue, 2 Nov 1999, Stephen C. Tweedie wrote: > > i dont think dump should block. dump(8) is using the raw block device to > > read fs data, which in turn uses the buffer-cache to get to the cached > > state of device blocks. Nothing blocks there, i've just re-checked > > fs/block_dev.c, it's using getblk(), and getblk() is not blocking on > > anything. > > fs/block_dev.c:block_read() naturally does a ll_rw_block(READ) followed > by a wait_on_buffer(). It blocks. yes but this means that the block was not cached. Remember the original point, my suggestion was to 'keep in-transaction buffers locked'. You said this doesnt work because it blocks dump(). But dump() CANNOT block because those buffers are cached, => dump does not block but just uses getblk() and skips over those buffers. dump() _of course_ blocks if the buffer is not cached. Or have i misunderstood you and we talking about different issues? > > You suggested a new mechanizm to mark buffers as 'pinned', > > That is only to synchronise with bdflush: I'd like to be able to > distinguish between buffers which contain dirty data but which are not > yet ready for disk IO, and buffers which I want to send to the disk. > The device drivers themselves should never ever have to worry about > those buffers: ll_rw_block() is the defined interface for device > drivers, NOT the buffer cache. (see later) > > 2.3 removes physical indexing of cached blocks, > > 2.2 never guaranteed that IO was from cached blocks in the first place. > Swap and paging both bypass the buffer cache entirely. [..] no, paging (named mappings) writes do not bypass the buffer-cache, and thats the issue. RAID would pretty quickly corrupt filesystems if this was the case. In 2.2 all filesystem (data and metadata) writes go through the buffer-cache. I agree that swapping is a problem (bug) even in 2.2, thanks for pointing it out. (It's not really hard to fix because the swap cache is more or less physically indexed.) > > and this destroys a fair amount of physical-level optimizations that > > were possible. (eg. RAID5 has to detect cached data within the same > > row, to speed up things and avoid double-buffering. If data is in the > > page cache and not hashed then there is no way RAID5 could detect such > > data.) > > But you cannot rely on the buffer cache. If I "dd" to a swapfile and do > a swapon, then the swapper will start to write to that swapfile using > temporary buffer_heads. If you do IO or checksum optimisation based on > the buffer cache you'll risk plastering obsolete data over the disks. i dont really mind how it's called. It's a physical index of all dirty & cached physical device contents which might get written out directly to the device at any time. In 2.2 this is the buffer-cache. Think about it, it's not a hack, it's a solid concept. The RAID code cannot even create its own physical index if the cache is completely private. Should the RAID code re-read blocks from disk when it calculates parity, just because it cannot access already cached data in the pagecache? The RAID code is not just a device driver, it's also a cache manager. Why do you think it's inferior to access cached data along a physical index? > > i'll probably try to put pagecache blocks on the physical index again > > (the buffer-cache), which solution i expect will face some resistance > > :) > > Yes. Device drivers should stay below ll_rw_block() and not make any > assumptions about the buffer cache. Linus is _really_ determined not to > let any new assumptions about the buffer cache into the kernel (I'm > having to deal with this in the journaling filesystem too). well, as a matter of fact, for a couple of pre-kernels we had all pagecache pages aliased into the buffer-cache as well, so it's not a technical problem at all. At that time it clearly appeared to be beneficial (simpler) to unhash pagecache pages from the buffer-cache so they got unhashed (as those two entities are orthogonal), but we might want to rethink that issue. > > in 2.2 RAID is a user of the buffer-cache, uses it and obeys its rules. > > The buffer-cache represents all cached (dirty and clean) blocks within the > > system. > > It does not, however, represent any non-cached IO. well, we are not talking about non-cached IO here. We are talking about a new kind of (improved) page cache that is not physically indexed. _This_ is the problem. If the page-cache was physically indexed then i could look it up from the RAID code just fine. If the page-cache was physically indexed (or more accurately, the part of the pagecache that is already mapped to a device in one way or another, which is 90+% of it.) then the RAID code could obey all the locking (and additional delaying) rules present there. This is not just about resync! If it was only for resync, then we could surely hack in some sort of device-level lock to protect the reconstruction window. i think your problem is that you do not accept the fact that t
Re: Raid resync changes buffer cache semantics --- not good for journaling!
Hi, On Mon, 1 Nov 1999 13:04:23 -0500 (EST), Ingo Molnar <[EMAIL PROTECTED]> said: > On Mon, 1 Nov 1999, Stephen C. Tweedie wrote: >> No, that's completely inappropriate: locking the buffer indefinitely >> will simply cause jobs like dump() to block forever, for example. > i dont think dump should block. dump(8) is using the raw block device to > read fs data, which in turn uses the buffer-cache to get to the cached > state of device blocks. Nothing blocks there, i've just re-checked > fs/block_dev.c, it's using getblk(), and getblk() is not blocking on > anything. fs/block_dev.c:block_read() naturally does a ll_rw_block(READ) followed by a wait_on_buffer(). It blocks. > (the IO layer should and does synchronize on the bh lock) Exactly, and the lock flag should be used to synchronise IO, _not_ to play games with bdflush/writeback. If we keep buffers locked, then raid resync is going to stall there too for the same reason --- wait_on_buffer() will block. >> However, you're missing a much more important issue: not all writes go >> through the buffer cache. >> Currently, swapping bypasses the buffer cache entirely: writes from swap >> go via temporary buffer_heads to ll_rw_block. The buffer_heads are > we were not talking about swapping but journalled transactions, and you > were asking about a mechanizm to keep the RAID resync from writing back to > disk. It's the same issue. If you arbitrarily write back through the buffer cache while a swap write IO is in progress, you can wipe out that swap data and corrupt the swap file. If you arbitrarily write back journaled buffers before journaling asks you to, you destroy recovery. The swap case is, if anything, even worse: it kills you even if you don't take a reboot, because you have just overwritten the swapped-out data with the previous contents of the buffer cache, so you've lost a write to disk. Journaling does the same thing by using temporary buffer heads to write metadata to the log without copying the buffer contents. Again it is IO which is not in the buffer cache. There are thus two problems: (a) the raid code is writing back data from the buffer cache oblivious to the fact that other users of the device may be writing back data which is not in the buffer cache at all, and (b) it is writing back data when it was not asked to do so, destroying write ordering. Both of these violate the definition of a device driver. > The RAID layer resync thread explicitly synchronizes on locked > buffers. (it doesnt have to but it does) And that is illegal, because it assumes that everybody else is using the buffer cache. That is not the case, and it is even less the case in 2.3. > You suggested a new mechanizm to mark buffers as 'pinned', That is only to synchronise with bdflush: I'd like to be able to distinguish between buffers which contain dirty data but which are not yet ready for disk IO, and buffers which I want to send to the disk. The device drivers themselves should never ever have to worry about those buffers: ll_rw_block() is the defined interface for device drivers, NOT the buffer cache. >> In 2.3 the situation is much worse, as _all_ ext2 file writes bypass the >> buffer cache. [...] > the RAID code has major problems with 2.3's pagecache changes. It will have major problems with ext3 too, then, but I really do think that is raid's fault, because: > 2.3 removes physical indexing of cached blocks, 2.2 never guaranteed that IO was from cached blocks in the first place. Swap and paging both bypass the buffer cache entirely. To assume that you can synchronise IO by doing a getblk() and syncing on the buffer_head is wrong, even if it used to work most of the time. > and this destroys a fair amount of physical-level optimizations that > were possible. (eg. RAID5 has to detect cached data within the same > row, to speed up things and avoid double-buffering. If data is in the > page cache and not hashed then there is no way RAID5 could detect such > data.) But you cannot rely on the buffer cache. If I "dd" to a swapfile and do a swapon, then the swapper will start to write to that swapfile using temporary buffer_heads. If you do IO or checksum optimisation based on the buffer cache you'll risk plastering obsolete data over the disks. > i'll probably try to put pagecache blocks on the physical index again > (the buffer-cache), which solution i expect will face some resistance > :) Yes. Device drivers should stay below ll_rw_block() and not make any assumptions about the buffer cache. Linus is _really_ determined not to let any new assumptions about the buffer cache into the kernel (I'm having to deal with this in the journaling filesystem too). > in 2.2 RAID is a user of the buffer-cache, uses it and obeys its rules. > The buffer-cache represents all cached (dirty and clean) blocks within the > system. It does not, however, represent any non-cached IO. > If there are other block caches in the system (the page-cache
Re: Raid resync changes buffer cache semantics --- not good for journaling!
Hi, On Fri, 29 Oct 1999 14:06:24 -0400 (EDT), Ingo Molnar <[EMAIL PROTECTED]> said: > On Fri, 29 Oct 1999, Stephen C. Tweedie wrote: >> Fixing this in raid seems far, far preferable to fixing it in the >> filesystems. The filesystem should be allowed to use the buffer cache >> for metadata and should be able to assume that there is a way to prevent >> those buffers from being written to disk until it is ready. > why dont you lock the buffer during transaction, thats the right way to > 'pin' down a buffer and prevent it from being written out. You can keep a > buffer locked && dirty indefinitely, and it should be easy to unlock them > when comitting a transaction. Am i missing something? No, that's completely inappropriate: locking the buffer indefinitely will simply cause jobs like dump() to block forever, for example. However, you're missing a much more important issue: not all writes go through the buffer cache. Currently, swapping bypasses the buffer cache entirely: writes from swap go via temporary buffer_heads to ll_rw_block. The buffer_heads are never part of the buffer cache and are discarded as soon as IO is complete. The same mechanism is used when reading to the page cache, but that's probably safe enough as writes do use the buffer cache in 2.2. In 2.3 the situation is much worse, as _all_ ext2 file writes bypass the buffer cache. The buffer_heads do persist, but they overlay the page cache, not the buffer cache --- they do not appear on the buffer cache hash lists. You _cannot_ synchronise with these writes at the buffer cache level. If your raid resync collides with such a write, it is entirely possible that the filesystem write will occur between the raid read and the raid write --- you will corrupt ext2 files. In my own case right now, ext3 on 2.2 behaves much like ext2 does on 2.3 --- it uses temporary buffer_heads to write directly to the journal, and so is going to be bitten by the raid resync behaviour. Basically, device drivers cannot assume that all IOs come from the buffer cache --- raid has to work at the level of ll_rw_block, not at the level of the buffer cache. --Stephen
Re: Raid resync changes buffer cache semantics --- not good for journaling!
On Fri, 29 Oct 1999, Stephen C. Tweedie wrote: > Ingo, can we work together to address this? One solution would be the > ability to mark a buffer_head as "pinned" against being written to disk, > Fixing this in raid seems far, far preferable to fixing it in the > filesystems. The filesystem should be allowed to use the buffer cache > for metadata and should be able to assume that there is a way to prevent > those buffers from being written to disk until it is ready. why dont you lock the buffer during transaction, thats the right way to 'pin' down a buffer and prevent it from being written out. You can keep a buffer locked && dirty indefinitely, and it should be easy to unlock them when comitting a transaction. Am i missing something? Ingo
Raid resync changes buffer cache semantics --- not good for journaling!
Hi all, There seems to be a conflict between journaling filesystem requirements (both ext3 and reiserfs), and the current raid code when it comes to write ordering in the buffer cache. The current ext3 code adds debugging checks to ll_rw_block designed to detect any cases where blocks are being written to disk in an order which breaks the filesystem's transaction ordering guarantees. A couple of hours ago it was triggered during a test run here by the raid background resync daemon. Raid resync basically works by reading, and rewriting, the entire raid device stripe by stripe. The write pass is unconditional. Even if the block is marked as reserved for journaling, and so is bypassed by bdflush, even if the block is clean: it gets written to disk. ext3 uses a separate buffer list for journaled buffers to avoid bdflush writing them back early. As I understand it (correct me if I'm wrong, Chris), reiserfs journaling simply avoids setting the dirty bit on the buffer_head until the log record has been written. Neither case stops raid resync from flushing the buffer to disk. As far as I can see, the current raid resync simply cannot observe any write ordering requirements being placed on the buffer cache. This is something which will have to be addressed in the raid code --- the only alternative appears to be to avoid placing any uncommitted transactional data in the buffer cache at all, which would require massive rewrites of ext3 (and probably no less trauma in reiserfs). This isn't a bug in either the raid code or the journaling --- it's just that the raid code changes semantics which non-journaling filesystems don't care about. Journaling adds extra requirements to the buffer cache, and raid changes the semantics in an incompatible way. Put the two together and you have serious problems during a background raid sync. Ingo, can we work together to address this? One solution would be the ability to mark a buffer_head as "pinned" against being written to disk, and to have raid resync use a temporary buffer head when updating that block and use the on-disk copy, not the in-memory one, to update the disk (guaranteeing that the in-memory copy doesn't hit disk). You will have a much better understanding of the locking requirements necessary to ensure that the two copies don't cause mayhem, but I'm willing to help on the implementation. Fixing this in raid seems far, far preferable to fixing it in the filesystems. The filesystem should be allowed to use the buffer cache for metadata and should be able to assume that there is a way to prevent those buffers from being written to disk until it is ready. --Stephen
Re: (reiserfs) Re: Raid resync changes buffer cache semantics ---not good for journaling!
Hi, On Thu, 4 Nov 1999 19:03:18 +0100 (CET), Rik van Riel <[EMAIL PROTECTED]> said: > The obvious solution would be to allow multiple versions of the > same disk block to be in memory. That is already possible. You can make as many buffer_heads as you want for a given disk block, as long as only one is in the buffer cache itself. > We need that anyway for journalling, so why not extend it with a flag > (or use another flag) to show that _this_ particular disk block is the > one on disk or the one supposed to be on disk (go on, flush this). That's just not the point --- even for the copy which _is_ supposed to be authoritative on disk, you can't be sure that it is in fact uptodate because there is a race between the filesystem modifying the buffer contents and the dirty bit being marked in the buffer_head. --Stephen
Re: (reiserfs) Re: Raid resync changes buffer cache semantics ---not good for journaling!
On Wed, 3 Nov 1999, Ingo Molnar wrote: > On Wed, 3 Nov 1999, Stephen C. Tweedie wrote: > > > So, what semantics, precisely, do you need in order to calculate parity? > > I don't see how you can do it reliably if you don't know if the > > in-memory buffer_head matches what is on disk. > > ok, i see your point. I guess i'll have to change the RAID code to > do the following: [snip] The obvious solution would be to allow multiple versions of the same disk block to be in memory. We need that anyway for journalling, so why not extend it with a flag (or use another flag) to show that _this_ particular disk block is the one on disk or the one supposed to be on disk (go on, flush this). regards, Rik -- The Internet is not a network of computers. It is a network of people. That is its real strength.