Hi,

On Wed, 3 Nov 1999 10:30:36 +0100 (MET), Ingo Molnar
<[EMAIL PROTECTED]> said:

>> OK... but raid resync _will_ block forever as it currently stands.

> {not forever, but until the transaction is committed. (it's not even
> necessary for the RAID resync to wait for locked buffers, it could as well
> skip over locked & dirty buffers - those will be written out anyway.) }

No --- it may block forever.  If I have a piece of busy metadata (eg. a
superblock) which is constantly being updated, then it is quite possible
that the filesystem will modify and re-pin the buffer in a new
transaction before the previous transaction has finished committing.
(The commit happens asynchronously after the transaction has closed, for
obvious performance reasons, and a new transaction can legitimately
touch the buffer while the old one is committing.)

> i'd like to repeat that i do not mind what the mechanizm is called,
> buffer-cache or physical-cache or pagecache-II or whatever. The fact that
> the number of 'out of the blue sky' caches and IO is growing is a _design
> bug_, i believe.

Not a helpful attitude --- I could just as legitimately say that the
current raid design is a massive design bug because it violates the
device driver layering so badly.

There are perfectly good reasons for doing cache-bypass.  You want to
forbid zero-copy block device IO in the IO stack just because it is
inconvenient for software raid?  That's not a good answer.

Journaling has a good example: I create temporary buffer_heads in
order to copy journaled metadata buffers to the log without copying
the data.  It is simply not possible to do zero-copy journaling if I
go through the cache, because you'd be wanting me to put the same
buffer_head on two different hash lists for that.  Ugh.  

> But whenever a given page is known to be 'bound' to a given physical block
> on a disk/device and is representative for the content, 

*BUT IT ISN'T* representative of the physical content.  It only
becomes representative of that content once the buffer has been
written to disk.

Ingo, you simply cannot assume this on 2.3.  Think about memory mapped
shared writable files.  Even if you hash those pages' buffer_heads
into the buffer cache, those writable buffers are going to be (a)
volatile, and (b) *completely* out of sync with what is on disk ---
and the buffer_heads in question are not going to be marked dirty.

You can get around this problem by snapshotting the buffer cache and
writing it to the disk, of course, but if you're going to write the
whole stripe that way then you are forcing extra IOs anyway (you are
saving read IOs for parity calcs but you are having to perform extra
writes), and you are also going to violate any write ordering
constraints being imposed by higher levels.

> it's a partial cache i agree, but in 2.2 it is a good and valid way to
> ensure data coherency. (except in the swapping-to-a-swap-device case,
> which is a bug in the RAID code)

... and --- now --- except for raw IO, and except for journaling.

> sure it can. In 2.2 the defined thing that prevents dirty blocks from
> being written out arbitrarily (by bdflush) is the buffer lock. 

Wrong semantics --- the buffer lock is supposed to synchronise actual
physical IO (ie. ll_rw_block) and temporary states of the buffer.  It
is not intended to have any relevance to bdflush.

>> I'll not pretend that this doesn't pose difficulties for raid, but
>> neither do I believe that raid should have the write to be a cache
>> manager, deciding on its own when to flush stuff to disk.

> this is a misunderstanding! RAID will not and does not flush anything to
> disk that is illegal to flush.

raid resync currently does so by writing back buffers which are not
marked dirty.

>> There is a huge off-list discussion in progress with Linus about this
>> right now, looking at the layering required to add IO ordering.  We have
>> a proposal for per-device IO barriers, for example.  If raid ever thinks
>> it can write to disk without a specific ll_rw_block() request, we are
>> lost.  Sorry.  You _must_ observe write ordering.

> it _WILL_ listen to any defined rule. It will however not be able to go
> telephatic and guess any future interfaces.

There is already a defined rule, and there always has been.  It is
called ll_rw_block().  Anything else is a problem.

> I'd like to have ways to access mapped & valid cached data from the
> physical index side.

You can't.  You have never been able to assume that safely.

Think about ext2 writing to a file in any kernel up to 2.3.xx.  We do
the following:

        getblk()
        ll_rw_block(READ) if it is a partial write
        copy_from_user()
        mark_buffer_dirty()
        update_vm_cache()

copy_from_user has always been able to block.  Do you see the problem?
We have wide-open windows in which the contents of the buffer cache have
been modified but the buffer is not marked dirty.  With 2.3, things just
get worse for you, with writable shared mappings dirtying things
spontaneously all over the place without ever setting the dirty bit.

There are also multiple places in ext2 where we call mark_buffer_dirty()
on more than one buffer_head after an update.  mark_buffer_dirty() can
block, so there again you have a window where you risk viewing a
modified but not dirty buffer.

So, what semantics, precisely, do you need in order to calculate parity?
I don't see how you can do it reliably if you don't know if the
in-memory buffer_head matches what is on disk.

> think of RAID as a normal user process, and as such it can do and wants to
> do anything that is within the rules. Why shouldnt we give access to
> certain caches if that is beneficial and doesnt impact the generic case
> too much?

That's fine, but we're disagreeing about what the rules are.  Everything
else in the system assumes that the rule for device drivers is that
ll_rw_block defines what they are allowed to do, nothing else.  If you
want to change that, then we really need to agree exactly what the
required semantics are.

> O_DIRECT is not a problem either i believe. 

Indeed, the cache coherency can be worked out.  The memory mapped
writable file seems a much bigger problem.

> the physical index and IO layer should i think be tightly integrated. This
> has other advantages as well, not just RAID or LVM: 

No.  How else can I send one copy of data to two locations in
journaling, or bypass the cache for direct IO?  This is exactly what I
don't want to see happen.

> - 'physical-index driven write clustering': the IO layer could 'discover'
> that a neighboring physical block is dirty as well, and might merge with
> it. (strictly only if defined rules permit it, so it's _not_
> unconditional)

Only the filesystem ultimately knows that information.  We are talking
about barriers within a device dirty queue right now, but nobody is
proposing global barriers.  A global barrier for a multi-disk filesystem
like XFS or ext3 can only be performed in the filesystem, and will be
done by submitting ll_rw_block() requests in a given order, waiting for
completion when necessary.

--Stephen

Reply via email to