Re: Raid resync changes buffer cache semantics --- not good for journaling!

Ingo Molnar Tue, 2 Nov 1999 04:31:54 -0800

On Tue, 2 Nov 1999, Stephen C. Tweedie wrote:

> > i dont think dump should block. dump(8) is using the raw block device to
> > read fs data, which in turn uses the buffer-cache to get to the cached
> > state of device blocks. Nothing blocks there, i've just re-checked
> > fs/block_dev.c, it's using getblk(), and getblk() is not blocking on
> > anything.
> 
> fs/block_dev.c:block_read() naturally does a ll_rw_block(READ) followed
> by a wait_on_buffer().  It blocks.

yes but this means that the block was not cached. Remember the original
point, my suggestion was to 'keep in-transaction buffers locked'. You said
this doesnt work because it blocks dump(). But dump() CANNOT block because
those buffers are cached, => dump does not block but just uses getblk()
and skips over those buffers. dump() _of course_ blocks if the buffer is
not cached. Or have i misunderstood you and we talking about different
issues?

> > You suggested a new mechanizm to mark buffers as 'pinned', 
> 
> That is only to synchronise with bdflush: I'd like to be able to
> distinguish between buffers which contain dirty data but which are not
> yet ready for disk IO, and buffers which I want to send to the disk.
> The device drivers themselves should never ever have to worry about
> those buffers: ll_rw_block() is the defined interface for device
> drivers, NOT the buffer cache.

(see later)

> > 2.3 removes physical indexing of cached blocks, 
> 
> 2.2 never guaranteed that IO was from cached blocks in the first place.
> Swap and paging both bypass the buffer cache entirely. [..]

no, paging (named mappings) writes do not bypass the buffer-cache, and
thats the issue. RAID would pretty quickly corrupt filesystems if this was
the case. In 2.2 all filesystem (data and metadata) writes go through the
buffer-cache.

I agree that swapping is a problem (bug) even in 2.2, thanks for pointing
it out. (It's not really hard to fix because the swap cache is more or
less physically indexed.) 

> > and this destroys a fair amount of physical-level optimizations that
> > were possible. (eg. RAID5 has to detect cached data within the same
> > row, to speed up things and avoid double-buffering. If data is in the
> > page cache and not hashed then there is no way RAID5 could detect such
> > data.)
> 
> But you cannot rely on the buffer cache.  If I "dd" to a swapfile and do
> a swapon, then the swapper will start to write to that swapfile using
> temporary buffer_heads.  If you do IO or checksum optimisation based on
> the buffer cache you'll risk plastering obsolete data over the disks.  

i dont really mind how it's called. It's a physical index of all dirty &
cached physical device contents which might get written out directly to
the device at any time. In 2.2 this is the buffer-cache. Think about it,
it's not a hack, it's a solid concept. The RAID code cannot even create
its own physical index if the cache is completely private. Should the RAID
code re-read blocks from disk when it calculates parity, just because it
cannot access already cached data in the pagecache? The RAID code is not
just a device driver, it's also a cache manager. Why do you think it's
inferior to access cached data along a physical index?

> > i'll probably try to put pagecache blocks on the physical index again
> > (the buffer-cache), which solution i expect will face some resistance
> > :)
> 
> Yes.  Device drivers should stay below ll_rw_block() and not make any
> assumptions about the buffer cache.  Linus is _really_ determined not to
> let any new assumptions about the buffer cache into the kernel (I'm
> having to deal with this in the journaling filesystem too).

well, as a matter of fact, for a couple of pre-kernels we had all
pagecache pages aliased into the buffer-cache as well, so it's not a
technical problem at all. At that time it clearly appeared to be
beneficial (simpler) to unhash pagecache pages from the buffer-cache so
they got unhashed (as those two entities are orthogonal), but we might
want to rethink that issue.

> > in 2.2 RAID is a user of the buffer-cache, uses it and obeys its rules.
> > The buffer-cache represents all cached (dirty and clean) blocks within the
> > system. 
> 
> It does not, however, represent any non-cached IO.

well, we are not talking about non-cached IO here. We are talking about a
new kind of (improved) page cache that is not physically indexed. _This_
is the problem. If the page-cache was physically indexed then i could look
it up from the RAID code just fine. If the page-cache was physically
indexed (or more accurately, the part of the pagecache that is already
mapped to a device in one way or another, which is 90+% of it.) then the
RAID code could obey all the locking (and additional delaying) rules
present there. This is not just about resync! If it was only for resync,
then we could surely hack in some sort of device-level lock to protect the
reconstruction window.

i think your problem is that you do not accept the fact that the RAID code
is a cache manager/cache user. There are RL benefits to this approach,
would you like to see benchmark numbers with caching turned on in the RAID
driver vs. caching turned off? What can i do to convince you that the RAID
code uses/is an integrated cache that this is nothing evil.

-- mingo
Re: Raid resync changes buffer cache semantics --- not good for journaling!

Reply via email to