Re: Raid resync changes buffer cache semantics --- not good for journaling!

Stephen C. Tweedie Tue, 2 Nov 1999 02:52:44 -0800
Hi,

On Mon, 1 Nov 1999 13:04:23 -0500 (EST), Ingo Molnar <[EMAIL PROTECTED]>
said:

> On Mon, 1 Nov 1999, Stephen C. Tweedie wrote:
>> No, that's completely inappropriate: locking the buffer indefinitely
>> will simply cause jobs like dump() to block forever, for example.

> i dont think dump should block. dump(8) is using the raw block device to
> read fs data, which in turn uses the buffer-cache to get to the cached
> state of device blocks. Nothing blocks there, i've just re-checked
> fs/block_dev.c, it's using getblk(), and getblk() is not blocking on
> anything.

fs/block_dev.c:block_read() naturally does a ll_rw_block(READ) followed
by a wait_on_buffer().  It blocks.

> (the IO layer should and does synchronize on the bh lock) 

Exactly, and the lock flag should be used to synchronise IO, _not_ to
play games with bdflush/writeback.  If we keep buffers locked, then raid
resync is going to stall there too for the same reason ---
wait_on_buffer() will block.

>> However, you're missing a much more important issue: not all writes go
>> through the buffer cache.

>> Currently, swapping bypasses the buffer cache entirely: writes from swap
>> go via temporary buffer_heads to ll_rw_block.  The buffer_heads are

> we were not talking about swapping but journalled transactions, and you
> were asking about a mechanizm to keep the RAID resync from writing back to
> disk.

It's the same issue.  If you arbitrarily write back through the buffer
cache while a swap write IO is in progress, you can wipe out that swap
data and corrupt the swap file.  If you arbitrarily write back journaled
buffers before journaling asks you to, you destroy recovery.  The swap
case is, if anything, even worse: it kills you even if you don't take a
reboot, because you have just overwritten the swapped-out data with the
previous contents of the buffer cache, so you've lost a write to disk.

Journaling does the same thing by using temporary buffer heads to write
metadata to the log without copying the buffer contents.  Again it is IO
which is not in the buffer cache.

There are thus two problems: (a) the raid code is writing back data
from the buffer cache oblivious to the fact that other users of the
device may be writing back data which is not in the buffer cache at all,
and (b) it is writing back data when it was not asked to do so,
destroying write ordering.  Both of these violate the definition of a
device driver.

> The RAID layer resync thread explicitly synchronizes on locked
> buffers. (it doesnt have to but it does) 

And that is illegal, because it assumes that everybody else is using the
buffer cache.  That is not the case, and it is even less the case in
2.3.

> You suggested a new mechanizm to mark buffers as 'pinned', 

That is only to synchronise with bdflush: I'd like to be able to
distinguish between buffers which contain dirty data but which are not
yet ready for disk IO, and buffers which I want to send to the disk.
The device drivers themselves should never ever have to worry about
those buffers: ll_rw_block() is the defined interface for device
drivers, NOT the buffer cache.

>> In 2.3 the situation is much worse, as _all_ ext2 file writes bypass the
>> buffer cache. [...]

> the RAID code has major problems with 2.3's pagecache changes. 

It will have major problems with ext3 too, then, but I really do think
that is raid's fault, because:

> 2.3 removes physical indexing of cached blocks, 

2.2 never guaranteed that IO was from cached blocks in the first place.
Swap and paging both bypass the buffer cache entirely.  To assume that
you can synchronise IO by doing a getblk() and syncing on the
buffer_head is wrong, even if it used to work most of the time.

> and this destroys a fair amount of physical-level optimizations that
> were possible. (eg. RAID5 has to detect cached data within the same
> row, to speed up things and avoid double-buffering. If data is in the
> page cache and not hashed then there is no way RAID5 could detect such
> data.)

But you cannot rely on the buffer cache.  If I "dd" to a swapfile and do
a swapon, then the swapper will start to write to that swapfile using
temporary buffer_heads.  If you do IO or checksum optimisation based on
the buffer cache you'll risk plastering obsolete data over the disks.  

> i'll probably try to put pagecache blocks on the physical index again
> (the buffer-cache), which solution i expect will face some resistance
> :)

Yes.  Device drivers should stay below ll_rw_block() and not make any
assumptions about the buffer cache.  Linus is _really_ determined not to
let any new assumptions about the buffer cache into the kernel (I'm
having to deal with this in the journaling filesystem too).

> in 2.2 RAID is a user of the buffer-cache, uses it and obeys its rules.
> The buffer-cache represents all cached (dirty and clean) blocks within the
> system. 

It does not, however, represent any non-cached IO.

> If there are other block caches in the system (the page-cache in 2.2
> was readonly, thus not an issue), then the RAID code has to and will
> synchronize with them.

Does it do so in 2.2?  Raid resync, for one, does not.

--Stephen
Re: Raid resync changes buffer cache semantics --- not good for journaling!

Reply via email to