Re: (reiserfs) Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-03 Thread Stephen C. Tweedie

Hi,

On Wed, 3 Nov 1999 17:43:18 +0100 (MET), Ingo Molnar
<[EMAIL PROTECTED]> said:

> .. which is exactly what the RAID5 code was doing ever since. It _has_ to
> do it to get 100% recovery anyway. This is one reason why access to caches
> is so important to the RAID code. (we do not snapshot clean buffers) 

And that's exactly why it will break --- the dirty bit on a buffer does
not tell you if it is clean or not, because the standard behaviour all
through the entire VFS is to modify the buffer before calling
mark_buffer_dirty().  There have always been windows in which you can
see a buffer as clean but it in fact is dirty, and the writable mmap
case in 2.3 makes this worse.

> no, RAID marks buffers dirty which it found in a clean and not locked
> state in the buffer-cache. It's perfectly legal to do so - or at least it
> was perfectly legal until now,

No it wasn't, because swap has always been able to bypass the buffer
cache.  I'll not say it was illegal either --- let's just say that the
situation was undefined --- but we have had writes outside the buffer
cache for _years_, and you simply can't say that it has always been
legal to assume that the buffer cache was the sole synchronisation
mechanism for IO.

I think we can make a pretty good compromise here, however.  We can
mandate that any kernel component which bypasses the buffer cache is
responsible for ensuring that the buffer cache is invalidated
beforehand.  That lets raid do the right thing regarding parity
calculations.  

For this to work, however, the raid resync must not be allowed to
repopulate the buffer cache and create a new cache incoherency.  It
would not be desparately hard to lock the resync code against other IOs
in progress, so that resync is entirely atomic with respect to things
like swap.

I can live with this for jfs: I can certainly make sure that I bforget()
journal descriptor blocks after IO to make sure that there is no cache
incoherency if the next pass over the log writes to the same block using
a temporary buffer_head.  Similarly, raw IO can do buffer cache
coherency if necessary (but that will be a big performance drag if the
device is in fact shared).

The one thing that I really don't want to have to deal with is the raid
resync code doing its read/wait/write thing while I'm writing new data
via temporary buffer_heads, as that _will_ corrupt the device in a way
that I can't avoid.  There is no way for me to do metadata journal
writes through the buffer cache without copying data (because the block
cannot be in the buffer cache twice), so I _have_ to use cache-bypass
here to avoid an extra copy.

> ok, i see your point. I guess i'll have to change the RAID code to do the
> following:

> #define raid_dirty(bh)(buffer_dirty(bh) && (buffer_count(bh) > 1))

> because nothing is allowed to change a clean buffer without having a
> reference to it. And nothing is allowed to release a physical index before
> dirtying it. Does this cover all cases?

It should do --- will you then do parity calculation and a writeback on
a snapshot of such buffers?  If so, we should be safe.

>> No.  How else can I send one copy of data to two locations in
>> journaling, or bypass the cache for direct IO?  This is exactly what I
>> don't want to see happen.

> for direct IO (with the example i have given to you) you are completely
> bypassing the cache, you are not bypassing the index! You are doing
> zero-copy, and the buffer does not stay cached.

Registering and deregistering every 512-byte block for raw IO is a CPU
nightmare, but I can do it for now.

However, what happens when we start wanting to pass kiobufs directly
into ll_rw_block()?  For performance, we really want to be able to send
chunks larger than a single disk block to the driver in one go.

--Stephen



Re: (reiserfs) Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-03 Thread Ingo Molnar


On Wed, 3 Nov 1999, Stephen C. Tweedie wrote:

> You can get around this problem by snapshotting the buffer cache and
> writing it to the disk, of course, [...]

.. which is exactly what the RAID5 code was doing ever since. It _has_ to
do it to get 100% recovery anyway. This is one reason why access to caches
is so important to the RAID code. (we do not snapshot clean buffers) 

> > sure it can. In 2.2 the defined thing that prevents dirty blocks from
> > being written out arbitrarily (by bdflush) is the buffer lock. 
> 
> Wrong semantics --- the buffer lock is supposed to synchronise actual
> physical IO (ie. ll_rw_block) and temporary states of the buffer.  It
> is not intended to have any relevance to bdflush.

because it synchronizes IO it _obviously_ also synchronizes bdflush access
because bdflush does nothing but keeps balance and starts IO! You/we might
want to make this mechanizm more explicit, i dont mind.

> > this is a misunderstanding! RAID will not and does not flush anything to
> > disk that is illegal to flush.
> 
> raid resync currently does so by writing back buffers which are not
> marked dirty.

no, RAID marks buffers dirty which it found in a clean and not locked
state in the buffer-cache. It's perfectly legal to do so - or at least it
was perfectly legal until now, but we can make the rule more explicit. I
always just _suggested_ to use the buffer lock. 

> > I'd like to have ways to access mapped & valid cached data from the
> > physical index side.
> 
> You can't.  You have never been able to assume that safely.
> 
> Think about ext2 writing to a file in any kernel up to 2.3.xx.  We do
> the following:
> 
>   getblk()
>   ll_rw_block(READ) if it is a partial write
>   copy_from_user()
>   mark_buffer_dirty()
>   update_vm_cache()
> 
> copy_from_user has always been able to block.  Do you see the problem?
> We have wide-open windows in which the contents of the buffer cache have
> been modified but the buffer is not marked dirty. [...]

thanks, i now see the problem. Thinking about it, i do not see any
conceptual problem. Right now the pagecache is careless about keeping the
physical index state correct, because it can assume exclusive access to
that state through higher level locks.

> There are also multiple places in ext2 where we call mark_buffer_dirty()
> on more than one buffer_head after an update.  mark_buffer_dirty() can
> block, so there again you have a window where you risk viewing a
> modified but not dirty buffer.
>
> So, what semantics, precisely, do you need in order to calculate parity?
> I don't see how you can do it reliably if you don't know if the
> in-memory buffer_head matches what is on disk.

ok, i see your point. I guess i'll have to change the RAID code to do the
following:

#define raid_dirty(bh)  (buffer_dirty(bh) && (buffer_count(bh) > 1))

because nothing is allowed to change a clean buffer without having a
reference to it. And nothing is allowed to release a physical index before
dirtying it. Does this cover all cases?

> That's fine, but we're disagreeing about what the rules are.  Everything
> else in the system assumes that the rule for device drivers is that
> ll_rw_block defines what they are allowed to do, nothing else.  If you
> want to change that, then we really need to agree exactly what the
> required semantics are.

agreed.

> > O_DIRECT is not a problem either i believe. 
> 
> Indeed, the cache coherency can be worked out.  The memory mapped
> writable file seems a much bigger problem.

yep.

> > the physical index and IO layer should i think be tightly integrated. This
> > has other advantages as well, not just RAID or LVM: 
> 
> No.  How else can I send one copy of data to two locations in
> journaling, or bypass the cache for direct IO?  This is exactly what I
> don't want to see happen.

for direct IO (with the example i have given to you) you are completely
bypassing the cache, you are not bypassing the index! You are doing
zero-copy, and the buffer does not stay cached.

-- mingo




Re: (reiserfs) Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-03 Thread Stephen C. Tweedie

Hi,

On Wed, 3 Nov 1999 10:30:36 +0100 (MET), Ingo Molnar
<[EMAIL PROTECTED]> said:

>> OK... but raid resync _will_ block forever as it currently stands.

> {not forever, but until the transaction is committed. (it's not even
> necessary for the RAID resync to wait for locked buffers, it could as well
> skip over locked & dirty buffers - those will be written out anyway.) }

No --- it may block forever.  If I have a piece of busy metadata (eg. a
superblock) which is constantly being updated, then it is quite possible
that the filesystem will modify and re-pin the buffer in a new
transaction before the previous transaction has finished committing.
(The commit happens asynchronously after the transaction has closed, for
obvious performance reasons, and a new transaction can legitimately
touch the buffer while the old one is committing.)

> i'd like to repeat that i do not mind what the mechanizm is called,
> buffer-cache or physical-cache or pagecache-II or whatever. The fact that
> the number of 'out of the blue sky' caches and IO is growing is a _design
> bug_, i believe.

Not a helpful attitude --- I could just as legitimately say that the
current raid design is a massive design bug because it violates the
device driver layering so badly.

There are perfectly good reasons for doing cache-bypass.  You want to
forbid zero-copy block device IO in the IO stack just because it is
inconvenient for software raid?  That's not a good answer.

Journaling has a good example: I create temporary buffer_heads in
order to copy journaled metadata buffers to the log without copying
the data.  It is simply not possible to do zero-copy journaling if I
go through the cache, because you'd be wanting me to put the same
buffer_head on two different hash lists for that.  Ugh.  

> But whenever a given page is known to be 'bound' to a given physical block
> on a disk/device and is representative for the content, 

*BUT IT ISN'T* representative of the physical content.  It only
becomes representative of that content once the buffer has been
written to disk.

Ingo, you simply cannot assume this on 2.3.  Think about memory mapped
shared writable files.  Even if you hash those pages' buffer_heads
into the buffer cache, those writable buffers are going to be (a)
volatile, and (b) *completely* out of sync with what is on disk ---
and the buffer_heads in question are not going to be marked dirty.

You can get around this problem by snapshotting the buffer cache and
writing it to the disk, of course, but if you're going to write the
whole stripe that way then you are forcing extra IOs anyway (you are
saving read IOs for parity calcs but you are having to perform extra
writes), and you are also going to violate any write ordering
constraints being imposed by higher levels.

> it's a partial cache i agree, but in 2.2 it is a good and valid way to
> ensure data coherency. (except in the swapping-to-a-swap-device case,
> which is a bug in the RAID code)

... and --- now --- except for raw IO, and except for journaling.

> sure it can. In 2.2 the defined thing that prevents dirty blocks from
> being written out arbitrarily (by bdflush) is the buffer lock. 

Wrong semantics --- the buffer lock is supposed to synchronise actual
physical IO (ie. ll_rw_block) and temporary states of the buffer.  It
is not intended to have any relevance to bdflush.

>> I'll not pretend that this doesn't pose difficulties for raid, but
>> neither do I believe that raid should have the write to be a cache
>> manager, deciding on its own when to flush stuff to disk.

> this is a misunderstanding! RAID will not and does not flush anything to
> disk that is illegal to flush.

raid resync currently does so by writing back buffers which are not
marked dirty.

>> There is a huge off-list discussion in progress with Linus about this
>> right now, looking at the layering required to add IO ordering.  We have
>> a proposal for per-device IO barriers, for example.  If raid ever thinks
>> it can write to disk without a specific ll_rw_block() request, we are
>> lost.  Sorry.  You _must_ observe write ordering.

> it _WILL_ listen to any defined rule. It will however not be able to go
> telephatic and guess any future interfaces.

There is already a defined rule, and there always has been.  It is
called ll_rw_block().  Anything else is a problem.

> I'd like to have ways to access mapped & valid cached data from the
> physical index side.

You can't.  You have never been able to assume that safely.

Think about ext2 writing to a file in any kernel up to 2.3.xx.  We do
the following:

getblk()
ll_rw_block(READ) if it is a partial write
copy_from_user()
mark_buffer_dirty()
update_vm_cache()

copy_from_user has always been able to block.  Do you see the problem?
We have wide-open windows in which the contents of the buffer cache have
been modified but the buffer is not marked dirty.  With 2.3, things just
get worse for you, with writable sh

Re: (reiserfs) Re: Raid resync changes buffer cache semantics ---not good for journaling!

1999-01-03 Thread Stephen C. Tweedie

Hi,

On Thu, 4 Nov 1999 19:03:18 +0100 (CET), Rik van Riel
<[EMAIL PROTECTED]> said:

> The obvious solution would be to allow multiple versions of the
> same disk block to be in memory. 

That is already possible.  You can make as many buffer_heads as you want
for a given disk block, as long as only one is in the buffer cache itself.

> We need that anyway for journalling, so why not extend it with a flag
> (or use another flag) to show that _this_ particular disk block is the
> one on disk or the one supposed to be on disk (go on, flush this).

That's just not the point --- even for the copy which _is_ supposed to
be authoritative on disk, you can't be sure that it is in fact uptodate
because there is a race between the filesystem modifying the buffer
contents and the dirty bit being marked in the buffer_head.

--Stephen



Re: (reiserfs) Re: Raid resync changes buffer cache semantics ---not good for journaling!

1999-01-02 Thread Rik van Riel

On Wed, 3 Nov 1999, Ingo Molnar wrote:
> On Wed, 3 Nov 1999, Stephen C. Tweedie wrote:
> 
> > So, what semantics, precisely, do you need in order to calculate parity?
> > I don't see how you can do it reliably if you don't know if the
> > in-memory buffer_head matches what is on disk.
> 
> ok, i see your point. I guess i'll have to change the RAID code to
> do the following:

[snip]

The obvious solution would be to allow multiple versions of the
same disk block to be in memory. We need that anyway for
journalling, so why not extend it with a flag (or use another
flag) to show that _this_ particular disk block is the one on
disk or the one supposed to be on disk (go on, flush this).

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.