Re: (reiserfs) Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-03 Thread Stephen C. Tweedie

Hi,

On Wed, 3 Nov 1999 17:43:18 +0100 (MET), Ingo Molnar
<[EMAIL PROTECTED]> said:

> .. which is exactly what the RAID5 code was doing ever since. It _has_ to
> do it to get 100% recovery anyway. This is one reason why access to caches
> is so important to the RAID code. (we do not snapshot clean buffers) 

And that's exactly why it will break --- the dirty bit on a buffer does
not tell you if it is clean or not, because the standard behaviour all
through the entire VFS is to modify the buffer before calling
mark_buffer_dirty().  There have always been windows in which you can
see a buffer as clean but it in fact is dirty, and the writable mmap
case in 2.3 makes this worse.

> no, RAID marks buffers dirty which it found in a clean and not locked
> state in the buffer-cache. It's perfectly legal to do so - or at least it
> was perfectly legal until now,

No it wasn't, because swap has always been able to bypass the buffer
cache.  I'll not say it was illegal either --- let's just say that the
situation was undefined --- but we have had writes outside the buffer
cache for _years_, and you simply can't say that it has always been
legal to assume that the buffer cache was the sole synchronisation
mechanism for IO.

I think we can make a pretty good compromise here, however.  We can
mandate that any kernel component which bypasses the buffer cache is
responsible for ensuring that the buffer cache is invalidated
beforehand.  That lets raid do the right thing regarding parity
calculations.  

For this to work, however, the raid resync must not be allowed to
repopulate the buffer cache and create a new cache incoherency.  It
would not be desparately hard to lock the resync code against other IOs
in progress, so that resync is entirely atomic with respect to things
like swap.

I can live with this for jfs: I can certainly make sure that I bforget()
journal descriptor blocks after IO to make sure that there is no cache
incoherency if the next pass over the log writes to the same block using
a temporary buffer_head.  Similarly, raw IO can do buffer cache
coherency if necessary (but that will be a big performance drag if the
device is in fact shared).

The one thing that I really don't want to have to deal with is the raid
resync code doing its read/wait/write thing while I'm writing new data
via temporary buffer_heads, as that _will_ corrupt the device in a way
that I can't avoid.  There is no way for me to do metadata journal
writes through the buffer cache without copying data (because the block
cannot be in the buffer cache twice), so I _have_ to use cache-bypass
here to avoid an extra copy.

> ok, i see your point. I guess i'll have to change the RAID code to do the
> following:

> #define raid_dirty(bh)(buffer_dirty(bh) && (buffer_count(bh) > 1))

> because nothing is allowed to change a clean buffer without having a
> reference to it. And nothing is allowed to release a physical index before
> dirtying it. Does this cover all cases?

It should do --- will you then do parity calculation and a writeback on
a snapshot of such buffers?  If so, we should be safe.

>> No.  How else can I send one copy of data to two locations in
>> journaling, or bypass the cache for direct IO?  This is exactly what I
>> don't want to see happen.

> for direct IO (with the example i have given to you) you are completely
> bypassing the cache, you are not bypassing the index! You are doing
> zero-copy, and the buffer does not stay cached.

Registering and deregistering every 512-byte block for raw IO is a CPU
nightmare, but I can do it for now.

However, what happens when we start wanting to pass kiobufs directly
into ll_rw_block()?  For performance, we really want to be able to send
chunks larger than a single disk block to the driver in one go.

--Stephen



Re: (reiserfs) Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-03 Thread Ingo Molnar


On Wed, 3 Nov 1999, Stephen C. Tweedie wrote:

> You can get around this problem by snapshotting the buffer cache and
> writing it to the disk, of course, [...]

.. which is exactly what the RAID5 code was doing ever since. It _has_ to
do it to get 100% recovery anyway. This is one reason why access to caches
is so important to the RAID code. (we do not snapshot clean buffers) 

> > sure it can. In 2.2 the defined thing that prevents dirty blocks from
> > being written out arbitrarily (by bdflush) is the buffer lock. 
> 
> Wrong semantics --- the buffer lock is supposed to synchronise actual
> physical IO (ie. ll_rw_block) and temporary states of the buffer.  It
> is not intended to have any relevance to bdflush.

because it synchronizes IO it _obviously_ also synchronizes bdflush access
because bdflush does nothing but keeps balance and starts IO! You/we might
want to make this mechanizm more explicit, i dont mind.

> > this is a misunderstanding! RAID will not and does not flush anything to
> > disk that is illegal to flush.
> 
> raid resync currently does so by writing back buffers which are not
> marked dirty.

no, RAID marks buffers dirty which it found in a clean and not locked
state in the buffer-cache. It's perfectly legal to do so - or at least it
was perfectly legal until now, but we can make the rule more explicit. I
always just _suggested_ to use the buffer lock. 

> > I'd like to have ways to access mapped & valid cached data from the
> > physical index side.
> 
> You can't.  You have never been able to assume that safely.
> 
> Think about ext2 writing to a file in any kernel up to 2.3.xx.  We do
> the following:
> 
>   getblk()
>   ll_rw_block(READ) if it is a partial write
>   copy_from_user()
>   mark_buffer_dirty()
>   update_vm_cache()
> 
> copy_from_user has always been able to block.  Do you see the problem?
> We have wide-open windows in which the contents of the buffer cache have
> been modified but the buffer is not marked dirty. [...]

thanks, i now see the problem. Thinking about it, i do not see any
conceptual problem. Right now the pagecache is careless about keeping the
physical index state correct, because it can assume exclusive access to
that state through higher level locks.

> There are also multiple places in ext2 where we call mark_buffer_dirty()
> on more than one buffer_head after an update.  mark_buffer_dirty() can
> block, so there again you have a window where you risk viewing a
> modified but not dirty buffer.
>
> So, what semantics, precisely, do you need in order to calculate parity?
> I don't see how you can do it reliably if you don't know if the
> in-memory buffer_head matches what is on disk.

ok, i see your point. I guess i'll have to change the RAID code to do the
following:

#define raid_dirty(bh)  (buffer_dirty(bh) && (buffer_count(bh) > 1))

because nothing is allowed to change a clean buffer without having a
reference to it. And nothing is allowed to release a physical index before
dirtying it. Does this cover all cases?

> That's fine, but we're disagreeing about what the rules are.  Everything
> else in the system assumes that the rule for device drivers is that
> ll_rw_block defines what they are allowed to do, nothing else.  If you
> want to change that, then we really need to agree exactly what the
> required semantics are.

agreed.

> > O_DIRECT is not a problem either i believe. 
> 
> Indeed, the cache coherency can be worked out.  The memory mapped
> writable file seems a much bigger problem.

yep.

> > the physical index and IO layer should i think be tightly integrated. This
> > has other advantages as well, not just RAID or LVM: 
> 
> No.  How else can I send one copy of data to two locations in
> journaling, or bypass the cache for direct IO?  This is exactly what I
> don't want to see happen.

for direct IO (with the example i have given to you) you are completely
bypassing the cache, you are not bypassing the index! You are doing
zero-copy, and the buffer does not stay cached.

-- mingo




Re: (reiserfs) Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-03 Thread Stephen C. Tweedie

Hi,

On Wed, 3 Nov 1999 10:30:36 +0100 (MET), Ingo Molnar
<[EMAIL PROTECTED]> said:

>> OK... but raid resync _will_ block forever as it currently stands.

> {not forever, but until the transaction is committed. (it's not even
> necessary for the RAID resync to wait for locked buffers, it could as well
> skip over locked & dirty buffers - those will be written out anyway.) }

No --- it may block forever.  If I have a piece of busy metadata (eg. a
superblock) which is constantly being updated, then it is quite possible
that the filesystem will modify and re-pin the buffer in a new
transaction before the previous transaction has finished committing.
(The commit happens asynchronously after the transaction has closed, for
obvious performance reasons, and a new transaction can legitimately
touch the buffer while the old one is committing.)

> i'd like to repeat that i do not mind what the mechanizm is called,
> buffer-cache or physical-cache or pagecache-II or whatever. The fact that
> the number of 'out of the blue sky' caches and IO is growing is a _design
> bug_, i believe.

Not a helpful attitude --- I could just as legitimately say that the
current raid design is a massive design bug because it violates the
device driver layering so badly.

There are perfectly good reasons for doing cache-bypass.  You want to
forbid zero-copy block device IO in the IO stack just because it is
inconvenient for software raid?  That's not a good answer.

Journaling has a good example: I create temporary buffer_heads in
order to copy journaled metadata buffers to the log without copying
the data.  It is simply not possible to do zero-copy journaling if I
go through the cache, because you'd be wanting me to put the same
buffer_head on two different hash lists for that.  Ugh.  

> But whenever a given page is known to be 'bound' to a given physical block
> on a disk/device and is representative for the content, 

*BUT IT ISN'T* representative of the physical content.  It only
becomes representative of that content once the buffer has been
written to disk.

Ingo, you simply cannot assume this on 2.3.  Think about memory mapped
shared writable files.  Even if you hash those pages' buffer_heads
into the buffer cache, those writable buffers are going to be (a)
volatile, and (b) *completely* out of sync with what is on disk ---
and the buffer_heads in question are not going to be marked dirty.

You can get around this problem by snapshotting the buffer cache and
writing it to the disk, of course, but if you're going to write the
whole stripe that way then you are forcing extra IOs anyway (you are
saving read IOs for parity calcs but you are having to perform extra
writes), and you are also going to violate any write ordering
constraints being imposed by higher levels.

> it's a partial cache i agree, but in 2.2 it is a good and valid way to
> ensure data coherency. (except in the swapping-to-a-swap-device case,
> which is a bug in the RAID code)

... and --- now --- except for raw IO, and except for journaling.

> sure it can. In 2.2 the defined thing that prevents dirty blocks from
> being written out arbitrarily (by bdflush) is the buffer lock. 

Wrong semantics --- the buffer lock is supposed to synchronise actual
physical IO (ie. ll_rw_block) and temporary states of the buffer.  It
is not intended to have any relevance to bdflush.

>> I'll not pretend that this doesn't pose difficulties for raid, but
>> neither do I believe that raid should have the write to be a cache
>> manager, deciding on its own when to flush stuff to disk.

> this is a misunderstanding! RAID will not and does not flush anything to
> disk that is illegal to flush.

raid resync currently does so by writing back buffers which are not
marked dirty.

>> There is a huge off-list discussion in progress with Linus about this
>> right now, looking at the layering required to add IO ordering.  We have
>> a proposal for per-device IO barriers, for example.  If raid ever thinks
>> it can write to disk without a specific ll_rw_block() request, we are
>> lost.  Sorry.  You _must_ observe write ordering.

> it _WILL_ listen to any defined rule. It will however not be able to go
> telephatic and guess any future interfaces.

There is already a defined rule, and there always has been.  It is
called ll_rw_block().  Anything else is a problem.

> I'd like to have ways to access mapped & valid cached data from the
> physical index side.

You can't.  You have never been able to assume that safely.

Think about ext2 writing to a file in any kernel up to 2.3.xx.  We do
the following:

getblk()
ll_rw_block(READ) if it is a partial write
copy_from_user()
mark_buffer_dirty()
update_vm_cache()

copy_from_user has always been able to block.  Do you see the problem?
We have wide-open windows in which the contents of the buffer cache have
been modified but the buffer is not marked dirty.  With 2.3, things just
get worse for you, with writable sh

Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-03 Thread Ingo Molnar


On Tue, 2 Nov 1999, Stephen C. Tweedie wrote:

> OK... but raid resync _will_ block forever as it currently stands.

{not forever, but until the transaction is committed. (it's not even
necessary for the RAID resync to wait for locked buffers, it could as well
skip over locked & dirty buffers - those will be written out anyway.) }

> > no, paging (named mappings) writes do not bypass the buffer-cache, and
> > thats the issue. 
> 
> Mappings bypass it for read in 2.2, and for read _and_ write on 2.3.  I
> wasn't talking about writes: I'm talking about IO in general.  IO is not
> limited to the buffer cache, and the places where the buffer cache is
> bypassed are growing, not shrinking.

i said writes; reads are irrelevant in this context. in our case we were
mainly worried about cache coherency issues, obviously only dirty data
(ie. writes) are interesting in this sense. OTOH i do agree that for
better generic performance the RAID code would like to see cached reads as
well, not only dirty data, and this is a problem on 2.2 as well.

i'd like to repeat that i do not mind what the mechanizm is called,
buffer-cache or physical-cache or pagecache-II or whatever. The fact that
the number of 'out of the blue sky' caches and IO is growing is a _design
bug_, i believe. Not too hard to fix at this point, and i'm aware of the
possible speed impact, which i'd like to minimize as much as possible. I
do not mind invisible but coherent dirty-caches, we always had those: eg.
the inode cache, or not yet mapped dirty pages (not really present right
now, but possible with lazy allocation).

But whenever a given page is known to be 'bound' to a given physical block
on a disk/device and is representative for the content, i'd like to have
ways to access/manage these cache elements. ('manage' obviously does not
include 'changing contents', it means changing the state of the cache
element in a defined way, eg. marking it dirty, or marking it clean,
unmapping it, remapping it, etc.)

> > I agree that swapping is a problem (bug) even in 2.2, thanks for pointing
> > it out. (It's not really hard to fix because the swap cache is more or
> > less physically indexed.) 
> 
> Yep.  Journaling will have the same problem.  The block device interface
> has never required that the writes come from the buffer cache.

yes, i do not require it either. But i'd like to see a way to access
cached contents 'from the physical side' as well - in cases where this is
possible. (and all other cases should -and currently do- stay coherent
explicitly)

> > i dont really mind how it's called. It's a physical index of all dirty &
> > cached physical device contents which might get written out directly to
> > the device at any time. In 2.2 this is the buffer-cache.
> 
> No it isn't.  The buffer cache is a partial cache at best.  It does not
> record all writes, and certainly doesn't record all reads, even on 2.2.

it's a partial cache i agree, but in 2.2 it is a good and valid way to
ensure data coherency. (except in the swapping-to-a-swap-device case,
which is a bug in the RAID code)

> Most importantly, data in the buffer cache cannot be written arbitrarily
> to disk at any time by the raid code: you'll totally wreck any write
> ordering attempts by higher level code.

sure it can. In 2.2 the defined thing that prevents dirty blocks from
being written out arbitrarily (by bdflush) is the buffer lock. bdflush is
a cache manager similar to the RAID code! 'Write ordering attempts by
higher level' first have to be defined Stephen, and sure if/when these
write ordering requirements become part of the buffer-cache then the RAID
code will listen to it. But you cannot blame a 3 years old concept to not
work on a 6 months old new code.

i'd like to re-ask the question why locking buffers is not good to keep
transactions pending. This solves the bdflush and RAID issue without _any_
change to the buffer-cache. There might be practical/theoretical reasons
for this not being possible/desired, please enlighten me if this is the
case. Again, i do not mind having another 'write ordering' mechanizm
either, but these should first be defined and agreed on.

> It can't access the page cache in 2.2.

(yes, i'd like to fix this in the 2.3 RAID code, additionally to being
nice to the journalling code as well.)

> > The RAID code is not just a device driver, it's also a cache
> > manager. Why do you think it's inferior to access cached data along a
> > physical index?
> 
> Ask Linus, he's pushing this point much more strongly than I am!  The
> buffer cache will become less and less of a cache as time goes on in his
> grand plan: it is to become little more than an IO buffer layer.

we should not be too focused on the buffer-cache. The buffer-cache has
many 'legacy' features (eg.  I'll not pretend that this doesn't pose difficulties for raid, but
> neither do I believe that raid should have the write to be a cache
> manager, deciding on its own when to flush stuff to disk.

this

Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread Theodore Y. Ts'o

   From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
   Date:   Tue, 2 Nov 1999 17:44:55 + (GMT)

   Ask Linus, he's pushing this point much more strongly than I am!  The
   buffer cache will become less and less of a cache as time goes on in his
   grand plan: it is to become little more than an IO buffer layer.

Ultimately, I think may be better off if we remove any hint of caching
from the I/O buffer layer.  The cache coherency issues between the page
and buffer cache make me nervous, and I'm not completely 100% convinced
we got it all right.  (I'm wondering if some of the ext2 corruption
reports in the 2.2 kernels are coming from a buffer cache/page cache
corruption.)

This means putting filesystem meta-data into the page cache.  Yes, I
know Stephen has some concerns about doing this because the big memory
patches mean pages in the page cache might not be directly accessible by
the kernel.  I see two solutions to this, both with drawbacks.  One is
to use a VM remap facility to map directories, superblocks, inode tables
etc. into the kernel address space.  The other is to have flags which
ask the kernel to map filesystem metadtata into part of the page cache
that's addressable by the kernel.  The first adds a VM delay to
accessing the filesystem metadata, and the other means we need to manage
the part of the page cache that's below 2GB differently from the page
cache in high memory at least as far as freeing pages in response to
memory pressure is concerned.

   Basically, for the raid code to poke around in higher layers is a huge
   layering violation.  We are heading towards doing things like adding
   kiobuf interfaces to ll_rw_block (in which the IO descriptor that the
   driver receives will have no reference to the buffer cache), and 
   and raw, unbuffered access to the drivers for raw devices and O_DIRECT.
   Raw IO is already there and bypasses the buffer cache.  So does swap.
   So does journaling.  So does page-in (in 2.2) and page-out (in 2.3).

It'll be interesting to see how this affects using dump(8) on a mounted
filesystem.  This was never particularly guaranteed to give a coherent
filesystem image, but what with increasing bypass of the buffer cache,
it may make the results of using dump(8) on a live filesystem even
worse.

One way of solving this is to add some kernel support for dump(8); for
example, the infamous iopen() call which Linus hates so much.  (Yes, it
violates the Unix permission model, which is why it needs to be
restricted to root, and yes, it won't work on all filesystems; just
those that have inodes.)  The other is to simply tell people to give up
on dump completely, and just use a file-level tool such as tar or bru.

- Ted



Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread Daniel Veillard

On Tue, Nov 02, 1999 at 11:17:23AM -0700, Matt Zinkevicius wrote:
> Is just software RAID affected? or hardware RAID as well?

  Just the software one, with hardware RAID the intricacies
of the operations are hidden and the hardware *precisely* present
an unified standardized interface (like SCSI ...). The kernel is
not concerned by this.

Daniel

-- 
[EMAIL PROTECTED] | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux, WWW, rpmfind,
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | rpm2html, XML,
http://www.w3.org/People/W3Cpeople.html#Veillard | badminton, and Kaffe.



Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread Matt Zinkevicius

Is just software RAID affected? or hardware RAID as well?

--Matt

__
Get Your Private, Free Email at http://www.hotmail.com



Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Tue, 2 Nov 1999 14:12:00 +0100 (MET), Ingo Molnar
<[EMAIL PROTECTED]> said:

> yes but this means that the block was not cached. 

OK... but raid resync _will_ block forever as it currently stands.

>> > 2.3 removes physical indexing of cached blocks, 
>> 
>> 2.2 never guaranteed that IO was from cached blocks in the first place.
>> Swap and paging both bypass the buffer cache entirely. [..]

> no, paging (named mappings) writes do not bypass the buffer-cache, and
> thats the issue. 

Mappings bypass it for read in 2.2, and for read _and_ write on 2.3.  I
wasn't talking about writes: I'm talking about IO in general.  IO is not
limited to the buffer cache, and the places where the buffer cache is
bypassed are growing, not shrinking.

> I agree that swapping is a problem (bug) even in 2.2, thanks for pointing
> it out. (It's not really hard to fix because the swap cache is more or
> less physically indexed.) 

Yep.  Journaling will have the same problem.  The block device interface
has never required that the writes come from the buffer cache.

>> But you cannot rely on the buffer cache.  If I "dd" to a swapfile and do
>> a swapon, then the swapper will start to write to that swapfile using
>> temporary buffer_heads.  If you do IO or checksum optimisation based on
>> the buffer cache you'll risk plastering obsolete data over the disks.  

> i dont really mind how it's called. It's a physical index of all dirty &
> cached physical device contents which might get written out directly to
> the device at any time. In 2.2 this is the buffer-cache.

No it isn't.  The buffer cache is a partial cache at best.  It does not
record all writes, and certainly doesn't record all reads, even on 2.2.
Most importantly, data in the buffer cache cannot be written arbitrarily
to disk at any time by the raid code: you'll totally wreck any write
ordering attempts by higher level code.

> Think about it, it's not a hack, it's a solid concept. The RAID code
> cannot even create its own physical index if the cache is completely
> private. Should the RAID code re-read blocks from disk when it
> calculates parity, just because it cannot access already cached data
> in the pagecache?

It can't access the page cache in 2.2.

> The RAID code is not just a device driver, it's also a cache
> manager. Why do you think it's inferior to access cached data along a
> physical index?

Ask Linus, he's pushing this point much more strongly than I am!  The
buffer cache will become less and less of a cache as time goes on in his
grand plan: it is to become little more than an IO buffer layer.

Basically, for the raid code to poke around in higher layers is a huge
layering violation.  We are heading towards doing things like adding
kiobuf interfaces to ll_rw_block (in which the IO descriptor that the
driver receives will have no reference to the buffer cache), and 
and raw, unbuffered access to the drivers for raw devices and O_DIRECT.
Raw IO is already there and bypasses the buffer cache.  So does swap.
So does journaling.  So does page-in (in 2.2) and page-out (in 2.3).

I'll not pretend that this doesn't pose difficulties for raid, but
neither do I believe that raid should have the write to be a cache
manager, deciding on its own when to flush stuff to disk.

There is a huge off-list discussion in progress with Linus about this
right now, looking at the layering required to add IO ordering.  We have
a proposal for per-device IO barriers, for example.  If raid ever thinks
it can write to disk without a specific ll_rw_block() request, we are
lost.  Sorry.  You _must_ observe write ordering.

Peeking into the buffer cache for reads is a much more benign behaviour.
It is still going to be a big problem for things like journaling and raw
IO, and is a potential swap corrupter, but we can fix these things by
being religious about removing or updating any buffer-cache copies of
disk blocks we're about to write to bypassing the buffer cache.

Right now the raid resync can clearly write buffers without being asked
to do so, and that needs to be fixed.  It should be possible to do so
without redesigning the whole of software raid.  Can we assume that
apart from that, raid*.c never writes data without being asked, even if
it does use the buffer cache to compute parity?  

> well, we are not talking about non-cached IO here. We are talking about a
> new kind of (improved) page cache that is not physically indexed. _This_
> is the problem. If the page-cache was physically indexed then i could look
> it up from the RAID code just fine. If the page-cache was physically
> indexed (or more accurately, the part of the pagecache that is already
> mapped to a device in one way or another, which is 90+% of it.) then the
> RAID code could obey all the locking (and additional delaying) rules
> present there. 

It is insane to think that a device driver (which raid *is*, from the
point of view of the page cache) should have a right to poke about in a
virtually in

Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Tue, 02 Nov 1999 08:43:01 -0700, [EMAIL PROTECTED] said:

>> Fixing this in raid seems far, far preferable to fixing it in the
>> filesystems.  The filesystem should be allowed to use the buffer cache
>> for metadata and should be able to assume that there is a way to prevent
>> those buffers from being written to disk until it is ready.

> What about doing it in the page cache: i.e. reserve pages for journaling
> and let them hit the buffer cache only when the transaction allows it?

> This may be a naive suggestion, but it looks logical.

The issue is one of IO.  Journaling builds a list of disk block updates
which need to be applied in a given order.  IO is block-based, not
page-based.  I can cache a directory in the page cache, but when I start
doing modifications, it's on a per-block basis because journaling has
got to record modified blocks.  I could quite easily end up with two
different blocks in the same page belonging to two different
transactions if the blocksize is less than the pagesize.

Doing filesystem caching in the page cache is fine, but it does not
really make sense as a data structure in which to build a transaction's
pending-write lists.

There's a second issue: I talked with Ingo about the journaling API
early on, and it was designed specifically to support buffer journaling.
I want to be able to allow the raid or LVM driver code to use the same
jfs API to apply transactional updates across multiple devices at the
block device level, for doing things like reconfiguring an array to
merge a new disk or to mark errors.

--Stephen



Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread braam


Hi,

Stephen wrote:

> Fixing this in raid seems far, far preferable to fixing it in the
> filesystems.  The filesystem should be allowed to use the buffer cache
> for metadata and should be able to assume that there is a way to prevent
> those buffers from being written to disk until it is ready.

What about doing it in the page cache: i.e. reserve pages for journaling
and let them hit the buffer cache only when the transaction allows it?

This may be a naive suggestion, but it looks logical.

- Peter -



Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread Ingo Molnar


On Tue, 2 Nov 1999, Stephen C. Tweedie wrote:

> > i dont think dump should block. dump(8) is using the raw block device to
> > read fs data, which in turn uses the buffer-cache to get to the cached
> > state of device blocks. Nothing blocks there, i've just re-checked
> > fs/block_dev.c, it's using getblk(), and getblk() is not blocking on
> > anything.
> 
> fs/block_dev.c:block_read() naturally does a ll_rw_block(READ) followed
> by a wait_on_buffer().  It blocks.

yes but this means that the block was not cached. Remember the original
point, my suggestion was to 'keep in-transaction buffers locked'. You said
this doesnt work because it blocks dump(). But dump() CANNOT block because
those buffers are cached, => dump does not block but just uses getblk()
and skips over those buffers. dump() _of course_ blocks if the buffer is
not cached. Or have i misunderstood you and we talking about different
issues?

> > You suggested a new mechanizm to mark buffers as 'pinned', 
> 
> That is only to synchronise with bdflush: I'd like to be able to
> distinguish between buffers which contain dirty data but which are not
> yet ready for disk IO, and buffers which I want to send to the disk.
> The device drivers themselves should never ever have to worry about
> those buffers: ll_rw_block() is the defined interface for device
> drivers, NOT the buffer cache.

(see later)

> > 2.3 removes physical indexing of cached blocks, 
> 
> 2.2 never guaranteed that IO was from cached blocks in the first place.
> Swap and paging both bypass the buffer cache entirely. [..]

no, paging (named mappings) writes do not bypass the buffer-cache, and
thats the issue. RAID would pretty quickly corrupt filesystems if this was
the case. In 2.2 all filesystem (data and metadata) writes go through the
buffer-cache.

I agree that swapping is a problem (bug) even in 2.2, thanks for pointing
it out. (It's not really hard to fix because the swap cache is more or
less physically indexed.) 

> > and this destroys a fair amount of physical-level optimizations that
> > were possible. (eg. RAID5 has to detect cached data within the same
> > row, to speed up things and avoid double-buffering. If data is in the
> > page cache and not hashed then there is no way RAID5 could detect such
> > data.)
> 
> But you cannot rely on the buffer cache.  If I "dd" to a swapfile and do
> a swapon, then the swapper will start to write to that swapfile using
> temporary buffer_heads.  If you do IO or checksum optimisation based on
> the buffer cache you'll risk plastering obsolete data over the disks.  

i dont really mind how it's called. It's a physical index of all dirty &
cached physical device contents which might get written out directly to
the device at any time. In 2.2 this is the buffer-cache. Think about it,
it's not a hack, it's a solid concept. The RAID code cannot even create
its own physical index if the cache is completely private. Should the RAID
code re-read blocks from disk when it calculates parity, just because it
cannot access already cached data in the pagecache? The RAID code is not
just a device driver, it's also a cache manager. Why do you think it's
inferior to access cached data along a physical index?

> > i'll probably try to put pagecache blocks on the physical index again
> > (the buffer-cache), which solution i expect will face some resistance
> > :)
> 
> Yes.  Device drivers should stay below ll_rw_block() and not make any
> assumptions about the buffer cache.  Linus is _really_ determined not to
> let any new assumptions about the buffer cache into the kernel (I'm
> having to deal with this in the journaling filesystem too).

well, as a matter of fact, for a couple of pre-kernels we had all
pagecache pages aliased into the buffer-cache as well, so it's not a
technical problem at all. At that time it clearly appeared to be
beneficial (simpler) to unhash pagecache pages from the buffer-cache so
they got unhashed (as those two entities are orthogonal), but we might
want to rethink that issue.

> > in 2.2 RAID is a user of the buffer-cache, uses it and obeys its rules.
> > The buffer-cache represents all cached (dirty and clean) blocks within the
> > system. 
> 
> It does not, however, represent any non-cached IO.

well, we are not talking about non-cached IO here. We are talking about a
new kind of (improved) page cache that is not physically indexed. _This_
is the problem. If the page-cache was physically indexed then i could look
it up from the RAID code just fine. If the page-cache was physically
indexed (or more accurately, the part of the pagecache that is already
mapped to a device in one way or another, which is 90+% of it.) then the
RAID code could obey all the locking (and additional delaying) rules
present there. This is not just about resync! If it was only for resync,
then we could surely hack in some sort of device-level lock to protect the
reconstruction window.

i think your problem is that you do not accept the fact that t

Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Mon, 1 Nov 1999 13:04:23 -0500 (EST), Ingo Molnar <[EMAIL PROTECTED]>
said:

> On Mon, 1 Nov 1999, Stephen C. Tweedie wrote:
>> No, that's completely inappropriate: locking the buffer indefinitely
>> will simply cause jobs like dump() to block forever, for example.

> i dont think dump should block. dump(8) is using the raw block device to
> read fs data, which in turn uses the buffer-cache to get to the cached
> state of device blocks. Nothing blocks there, i've just re-checked
> fs/block_dev.c, it's using getblk(), and getblk() is not blocking on
> anything.

fs/block_dev.c:block_read() naturally does a ll_rw_block(READ) followed
by a wait_on_buffer().  It blocks.

> (the IO layer should and does synchronize on the bh lock) 

Exactly, and the lock flag should be used to synchronise IO, _not_ to
play games with bdflush/writeback.  If we keep buffers locked, then raid
resync is going to stall there too for the same reason ---
wait_on_buffer() will block.

>> However, you're missing a much more important issue: not all writes go
>> through the buffer cache.

>> Currently, swapping bypasses the buffer cache entirely: writes from swap
>> go via temporary buffer_heads to ll_rw_block.  The buffer_heads are

> we were not talking about swapping but journalled transactions, and you
> were asking about a mechanizm to keep the RAID resync from writing back to
> disk.

It's the same issue.  If you arbitrarily write back through the buffer
cache while a swap write IO is in progress, you can wipe out that swap
data and corrupt the swap file.  If you arbitrarily write back journaled
buffers before journaling asks you to, you destroy recovery.  The swap
case is, if anything, even worse: it kills you even if you don't take a
reboot, because you have just overwritten the swapped-out data with the
previous contents of the buffer cache, so you've lost a write to disk.

Journaling does the same thing by using temporary buffer heads to write
metadata to the log without copying the buffer contents.  Again it is IO
which is not in the buffer cache.

There are thus two problems: (a) the raid code is writing back data
from the buffer cache oblivious to the fact that other users of the
device may be writing back data which is not in the buffer cache at all,
and (b) it is writing back data when it was not asked to do so,
destroying write ordering.  Both of these violate the definition of a
device driver.

> The RAID layer resync thread explicitly synchronizes on locked
> buffers. (it doesnt have to but it does) 

And that is illegal, because it assumes that everybody else is using the
buffer cache.  That is not the case, and it is even less the case in
2.3.

> You suggested a new mechanizm to mark buffers as 'pinned', 

That is only to synchronise with bdflush: I'd like to be able to
distinguish between buffers which contain dirty data but which are not
yet ready for disk IO, and buffers which I want to send to the disk.
The device drivers themselves should never ever have to worry about
those buffers: ll_rw_block() is the defined interface for device
drivers, NOT the buffer cache.

>> In 2.3 the situation is much worse, as _all_ ext2 file writes bypass the
>> buffer cache. [...]

> the RAID code has major problems with 2.3's pagecache changes. 

It will have major problems with ext3 too, then, but I really do think
that is raid's fault, because:

> 2.3 removes physical indexing of cached blocks, 

2.2 never guaranteed that IO was from cached blocks in the first place.
Swap and paging both bypass the buffer cache entirely.  To assume that
you can synchronise IO by doing a getblk() and syncing on the
buffer_head is wrong, even if it used to work most of the time.

> and this destroys a fair amount of physical-level optimizations that
> were possible. (eg. RAID5 has to detect cached data within the same
> row, to speed up things and avoid double-buffering. If data is in the
> page cache and not hashed then there is no way RAID5 could detect such
> data.)

But you cannot rely on the buffer cache.  If I "dd" to a swapfile and do
a swapon, then the swapper will start to write to that swapfile using
temporary buffer_heads.  If you do IO or checksum optimisation based on
the buffer cache you'll risk plastering obsolete data over the disks.  

> i'll probably try to put pagecache blocks on the physical index again
> (the buffer-cache), which solution i expect will face some resistance
> :)

Yes.  Device drivers should stay below ll_rw_block() and not make any
assumptions about the buffer cache.  Linus is _really_ determined not to
let any new assumptions about the buffer cache into the kernel (I'm
having to deal with this in the journaling filesystem too).

> in 2.2 RAID is a user of the buffer-cache, uses it and obeys its rules.
> The buffer-cache represents all cached (dirty and clean) blocks within the
> system. 

It does not, however, represent any non-cached IO.

> If there are other block caches in the system (the page-cache

Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-01 Thread Stephen C. Tweedie

Hi,

On Fri, 29 Oct 1999 14:06:24 -0400 (EDT), Ingo Molnar <[EMAIL PROTECTED]>
said:

> On Fri, 29 Oct 1999, Stephen C. Tweedie wrote:

>> Fixing this in raid seems far, far preferable to fixing it in the
>> filesystems.  The filesystem should be allowed to use the buffer cache
>> for metadata and should be able to assume that there is a way to prevent
>> those buffers from being written to disk until it is ready.

> why dont you lock the buffer during transaction, thats the right way to
> 'pin' down a buffer and prevent it from being written out. You can keep a
> buffer locked && dirty indefinitely, and it should be easy to unlock them
> when comitting a transaction. Am i missing something? 

No, that's completely inappropriate: locking the buffer indefinitely
will simply cause jobs like dump() to block forever, for example.
However, you're missing a much more important issue: not all writes go
through the buffer cache.

Currently, swapping bypasses the buffer cache entirely: writes from swap
go via temporary buffer_heads to ll_rw_block.  The buffer_heads are
never part of the buffer cache and are discarded as soon as IO is
complete.  The same mechanism is used when reading to the page cache,
but that's probably safe enough as writes do use the buffer cache in
2.2.

In 2.3 the situation is much worse, as _all_ ext2 file writes bypass the
buffer cache.  The buffer_heads do persist, but they overlay the page
cache, not the buffer cache --- they do not appear on the buffer cache
hash lists.  You _cannot_ synchronise with these writes at the buffer
cache level.  If your raid resync collides with such a write, it is
entirely possible that the filesystem write will occur between the raid
read and the raid write --- you will corrupt ext2 files.

In my own case right now, ext3 on 2.2 behaves much like ext2 does on 2.3
--- it uses temporary buffer_heads to write directly to the journal, and
so is going to be bitten by the raid resync behaviour.

Basically, device drivers cannot assume that all IOs come from the
buffer cache --- raid has to work at the level of ll_rw_block, not at
the level of the buffer cache.

--Stephen




Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-10-29 Thread Ingo Molnar


On Fri, 29 Oct 1999, Stephen C. Tweedie wrote:

> Ingo, can we work together to address this?  One solution would be the
> ability to mark a buffer_head as "pinned" against being written to disk,

> Fixing this in raid seems far, far preferable to fixing it in the
> filesystems.  The filesystem should be allowed to use the buffer cache
> for metadata and should be able to assume that there is a way to prevent
> those buffers from being written to disk until it is ready.

why dont you lock the buffer during transaction, thats the right way to
'pin' down a buffer and prevent it from being written out. You can keep a
buffer locked && dirty indefinitely, and it should be easy to unlock them
when comitting a transaction. Am i missing something? 

Ingo



Raid resync changes buffer cache semantics --- not good for journaling!

1999-10-29 Thread Stephen C. Tweedie

Hi all,

There seems to be a conflict between journaling filesystem requirements
(both ext3 and reiserfs), and the current raid code when it comes to
write ordering in the buffer cache.

The current ext3 code adds debugging checks to ll_rw_block designed to
detect any cases where blocks are being written to disk in an order
which breaks the filesystem's transaction ordering guarantees.  

A couple of hours ago it was triggered during a test run here by the
raid background resync daemon.

Raid resync basically works by reading, and rewriting, the entire raid
device stripe by stripe.  The write pass is unconditional.  Even if the
block is marked as reserved for journaling, and so is bypassed by
bdflush, even if the block is clean: it gets written to disk.

ext3 uses a separate buffer list for journaled buffers to avoid bdflush
writing them back early.  As I understand it (correct me if I'm wrong,
Chris), reiserfs journaling simply avoids setting the dirty bit on the
buffer_head until the log record has been written.  Neither case stops
raid resync from flushing the buffer to disk.

As far as I can see, the current raid resync simply cannot observe any
write ordering requirements being placed on the buffer cache.  This is
something which will have to be addressed in the raid code --- the only
alternative appears to be to avoid placing any uncommitted transactional
data in the buffer cache at all, which would require massive rewrites of
ext3 (and probably no less trauma in reiserfs).

This isn't a bug in either the raid code or the journaling --- it's just
that the raid code changes semantics which non-journaling filesystems
don't care about.  Journaling adds extra requirements to the buffer
cache, and raid changes the semantics in an incompatible way.  Put the
two together and you have serious problems during a background raid
sync.

Ingo, can we work together to address this?  One solution would be the
ability to mark a buffer_head as "pinned" against being written to disk,
and to have raid resync use a temporary buffer head when updating that
block and use the on-disk copy, not the in-memory one, to update the
disk (guaranteeing that the in-memory copy doesn't hit disk).  You will
have a much better understanding of the locking requirements necessary
to ensure that the two copies don't cause mayhem, but I'm willing to
help on the implementation.  

Fixing this in raid seems far, far preferable to fixing it in the
filesystems.  The filesystem should be allowed to use the buffer cache
for metadata and should be able to assume that there is a way to prevent
those buffers from being written to disk until it is ready.

--Stephen



Re: (reiserfs) Re: Raid resync changes buffer cache semantics ---not good for journaling!

1999-01-03 Thread Stephen C. Tweedie

Hi,

On Thu, 4 Nov 1999 19:03:18 +0100 (CET), Rik van Riel
<[EMAIL PROTECTED]> said:

> The obvious solution would be to allow multiple versions of the
> same disk block to be in memory. 

That is already possible.  You can make as many buffer_heads as you want
for a given disk block, as long as only one is in the buffer cache itself.

> We need that anyway for journalling, so why not extend it with a flag
> (or use another flag) to show that _this_ particular disk block is the
> one on disk or the one supposed to be on disk (go on, flush this).

That's just not the point --- even for the copy which _is_ supposed to
be authoritative on disk, you can't be sure that it is in fact uptodate
because there is a race between the filesystem modifying the buffer
contents and the dirty bit being marked in the buffer_head.

--Stephen



Re: (reiserfs) Re: Raid resync changes buffer cache semantics ---not good for journaling!

1999-01-02 Thread Rik van Riel

On Wed, 3 Nov 1999, Ingo Molnar wrote:
> On Wed, 3 Nov 1999, Stephen C. Tweedie wrote:
> 
> > So, what semantics, precisely, do you need in order to calculate parity?
> > I don't see how you can do it reliably if you don't know if the
> > in-memory buffer_head matches what is on disk.
> 
> ok, i see your point. I guess i'll have to change the RAID code to
> do the following:

[snip]

The obvious solution would be to allow multiple versions of the
same disk block to be in memory. We need that anyway for
journalling, so why not extend it with a flag (or use another
flag) to show that _this_ particular disk block is the one on
disk or the one supposed to be on disk (go on, flush this).

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.