Re: wierdisms w/ ext3.

1999-11-02 Thread Timothy Ball

Here's the info from /var/log/dmesg. Could it be that my journal file
has a large inode number? And if you have more than one ext3 partition
can you have more than one journal file? How would you specify it...
must read code... 

--tim "ooh kdb is neat" ball

--snip--snip--snip--
Partition check:
 hda: hda1 hda2  hda5 hda6 
 hdb: hdb1 hdb2 hdb3
RAMDISK: Compressed image found at block 0
autodetecting RAID arrays
autorun ...
... autorun DONE.
ext3: No journal on filesystem on 01:00
EXT3-fs: get root inode failed
VFS: Mounted root (ext2 filesystem).
autodetecting RAID arrays
autorun ...
... autorun DONE.
ext3: No journal on filesystem on 03:42
EXT3-fs: get root inode failed
VFS: Mounted root (ext2 filesystem) readonly.
change_root: old root has d_count=1
Trying to unmount old root ... okay
--snip--snip--snip--

-- 
Send mail with subject "send pgp key" for public key.
pub  1024R/CFF85605 1999-06-10 Timothy L. Ball [EMAIL PROTECTED]
 Key fingerprint = 8A 8E 64 D6 21 C0 90 29  9F D6 1E DC F8 18 CB CD



Re: Linux Buffer Cache Does Not Support Mirroring

1999-11-02 Thread Ingo Molnar


On Mon, 1 Nov 1999 [EMAIL PROTECTED] wrote:

 XFS on Irix caches file data in buffers, but not in the regular buffer cache,
 they are cached off the vnode and organized by logical file offset rather
 than by disk block number, the memory in these buffers comes from the page
 subsystem, the page tag being the vnode and file offset. These buffers do
 not have to have a physical disk block associated with them, XFS allows you
 to reserve blocks on the disk for a file without picking which blocks. At
 some point when the data needs to be written (memory pressure, or sync
 activity etc), the filesystem is asked to allocate physical blocks for the
 data, these are associated with the buffers and they get written out.
 Delaying the allocation allows us to collect together multiple small writes
 into one big allocation request. It also means that we can bypass allocation
 altogether if the file is truncated before it is flushed to disk.

the new 2.3 pagecache should enable this almost out-of-box. Apart from
memory pressure issues, the missing bit is to split up fs-get_block()
into a 'soft' and 'real' allocation branch. This means that whenever the
pagecache creates a new dirty page, it calls the 'soft' get_block()
variant, which is very fast and just bumps up some counters within XFS (so
we do not get asynchron out-of-space conditions). Then whenever
ll_rw_block() (or bdflush) sees a !buffer_mapped() but buffer_allocated()
block it will call the 'real' lowlevel handler to do the allocation for
real. 

i kept this in mind all along when doing the pagecache changes, and i
intend to do this for ext2fs. Splitting up get_block() is easy without
breaking filesystems, the last 'create' parameter can be made '2' to mean
'lazy create'. 

note that not all filesystems can know in advance how much space a new
inode block will take, but this is not a problem, the lazy-allocator can
safely 'overestimate' space needs. 

is this the kind of interface you need for XFS? i can make a prototype
patch for ext2fs (and the pagecache  bdflush), which should be easy to
adopt for XFS. 

-- mingo



Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Mon, 1 Nov 1999 13:04:23 -0500 (EST), Ingo Molnar [EMAIL PROTECTED]
said:

 On Mon, 1 Nov 1999, Stephen C. Tweedie wrote:
 No, that's completely inappropriate: locking the buffer indefinitely
 will simply cause jobs like dump() to block forever, for example.

 i dont think dump should block. dump(8) is using the raw block device to
 read fs data, which in turn uses the buffer-cache to get to the cached
 state of device blocks. Nothing blocks there, i've just re-checked
 fs/block_dev.c, it's using getblk(), and getblk() is not blocking on
 anything.

fs/block_dev.c:block_read() naturally does a ll_rw_block(READ) followed
by a wait_on_buffer().  It blocks.

 (the IO layer should and does synchronize on the bh lock) 

Exactly, and the lock flag should be used to synchronise IO, _not_ to
play games with bdflush/writeback.  If we keep buffers locked, then raid
resync is going to stall there too for the same reason ---
wait_on_buffer() will block.

 However, you're missing a much more important issue: not all writes go
 through the buffer cache.

 Currently, swapping bypasses the buffer cache entirely: writes from swap
 go via temporary buffer_heads to ll_rw_block.  The buffer_heads are

 we were not talking about swapping but journalled transactions, and you
 were asking about a mechanizm to keep the RAID resync from writing back to
 disk.

It's the same issue.  If you arbitrarily write back through the buffer
cache while a swap write IO is in progress, you can wipe out that swap
data and corrupt the swap file.  If you arbitrarily write back journaled
buffers before journaling asks you to, you destroy recovery.  The swap
case is, if anything, even worse: it kills you even if you don't take a
reboot, because you have just overwritten the swapped-out data with the
previous contents of the buffer cache, so you've lost a write to disk.

Journaling does the same thing by using temporary buffer heads to write
metadata to the log without copying the buffer contents.  Again it is IO
which is not in the buffer cache.

There are thus two problems: (a) the raid code is writing back data
from the buffer cache oblivious to the fact that other users of the
device may be writing back data which is not in the buffer cache at all,
and (b) it is writing back data when it was not asked to do so,
destroying write ordering.  Both of these violate the definition of a
device driver.

 The RAID layer resync thread explicitly synchronizes on locked
 buffers. (it doesnt have to but it does) 

And that is illegal, because it assumes that everybody else is using the
buffer cache.  That is not the case, and it is even less the case in
2.3.

 You suggested a new mechanizm to mark buffers as 'pinned', 

That is only to synchronise with bdflush: I'd like to be able to
distinguish between buffers which contain dirty data but which are not
yet ready for disk IO, and buffers which I want to send to the disk.
The device drivers themselves should never ever have to worry about
those buffers: ll_rw_block() is the defined interface for device
drivers, NOT the buffer cache.

 In 2.3 the situation is much worse, as _all_ ext2 file writes bypass the
 buffer cache. [...]

 the RAID code has major problems with 2.3's pagecache changes. 

It will have major problems with ext3 too, then, but I really do think
that is raid's fault, because:

 2.3 removes physical indexing of cached blocks, 

2.2 never guaranteed that IO was from cached blocks in the first place.
Swap and paging both bypass the buffer cache entirely.  To assume that
you can synchronise IO by doing a getblk() and syncing on the
buffer_head is wrong, even if it used to work most of the time.

 and this destroys a fair amount of physical-level optimizations that
 were possible. (eg. RAID5 has to detect cached data within the same
 row, to speed up things and avoid double-buffering. If data is in the
 page cache and not hashed then there is no way RAID5 could detect such
 data.)

But you cannot rely on the buffer cache.  If I "dd" to a swapfile and do
a swapon, then the swapper will start to write to that swapfile using
temporary buffer_heads.  If you do IO or checksum optimisation based on
the buffer cache you'll risk plastering obsolete data over the disks.  

 i'll probably try to put pagecache blocks on the physical index again
 (the buffer-cache), which solution i expect will face some resistance
 :)

Yes.  Device drivers should stay below ll_rw_block() and not make any
assumptions about the buffer cache.  Linus is _really_ determined not to
let any new assumptions about the buffer cache into the kernel (I'm
having to deal with this in the journaling filesystem too).

 in 2.2 RAID is a user of the buffer-cache, uses it and obeys its rules.
 The buffer-cache represents all cached (dirty and clean) blocks within the
 system. 

It does not, however, represent any non-cached IO.

 If there are other block caches in the system (the page-cache in 2.2
 was readonly, thus not an issue), 

Re: wierdisms w/ ext3.

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Mon, 1 Nov 1999 15:03:54 -0600, Timothy Ball
[EMAIL PROTECTED] said:

 I did my best to try to follow what the README for ext3 said. I made a
 journal file in /var/local/journal/journal.dat. It has an inode # of
 183669.

 Then I did /sbin/lilo -R linux rw rootflags=journal=183669. 

Silly question, but is /var/local/journal on the same filesystem as the
root?  Those rootflags look fine otherwise.

--Stephen



Re: wierdisms w/ ext3.

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Tue, 2 Nov 1999 03:10:10 -0600, Timothy Ball
[EMAIL PROTECTED] said:

 Here's the info from /var/log/dmesg. Could it be that my journal file
 has a large inode number? And if you have more than one ext3 partition
 can you have more than one journal file? How would you specify it...
 must read code... 

You need one per filesystem, and you register it when you mount the
filesystem.  For non-root filesystems, just umount and remount with the
"-o journal=xxx" flag.  For the root filesystem you need the rootflags=
trick.

--Stephen



Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread Ingo Molnar


On Tue, 2 Nov 1999, Stephen C. Tweedie wrote:

  i dont think dump should block. dump(8) is using the raw block device to
  read fs data, which in turn uses the buffer-cache to get to the cached
  state of device blocks. Nothing blocks there, i've just re-checked
  fs/block_dev.c, it's using getblk(), and getblk() is not blocking on
  anything.
 
 fs/block_dev.c:block_read() naturally does a ll_rw_block(READ) followed
 by a wait_on_buffer().  It blocks.

yes but this means that the block was not cached. Remember the original
point, my suggestion was to 'keep in-transaction buffers locked'. You said
this doesnt work because it blocks dump(). But dump() CANNOT block because
those buffers are cached, = dump does not block but just uses getblk()
and skips over those buffers. dump() _of course_ blocks if the buffer is
not cached. Or have i misunderstood you and we talking about different
issues?

  You suggested a new mechanizm to mark buffers as 'pinned', 
 
 That is only to synchronise with bdflush: I'd like to be able to
 distinguish between buffers which contain dirty data but which are not
 yet ready for disk IO, and buffers which I want to send to the disk.
 The device drivers themselves should never ever have to worry about
 those buffers: ll_rw_block() is the defined interface for device
 drivers, NOT the buffer cache.

(see later)

  2.3 removes physical indexing of cached blocks, 
 
 2.2 never guaranteed that IO was from cached blocks in the first place.
 Swap and paging both bypass the buffer cache entirely. [..]

no, paging (named mappings) writes do not bypass the buffer-cache, and
thats the issue. RAID would pretty quickly corrupt filesystems if this was
the case. In 2.2 all filesystem (data and metadata) writes go through the
buffer-cache.

I agree that swapping is a problem (bug) even in 2.2, thanks for pointing
it out. (It's not really hard to fix because the swap cache is more or
less physically indexed.) 

  and this destroys a fair amount of physical-level optimizations that
  were possible. (eg. RAID5 has to detect cached data within the same
  row, to speed up things and avoid double-buffering. If data is in the
  page cache and not hashed then there is no way RAID5 could detect such
  data.)
 
 But you cannot rely on the buffer cache.  If I "dd" to a swapfile and do
 a swapon, then the swapper will start to write to that swapfile using
 temporary buffer_heads.  If you do IO or checksum optimisation based on
 the buffer cache you'll risk plastering obsolete data over the disks.  

i dont really mind how it's called. It's a physical index of all dirty 
cached physical device contents which might get written out directly to
the device at any time. In 2.2 this is the buffer-cache. Think about it,
it's not a hack, it's a solid concept. The RAID code cannot even create
its own physical index if the cache is completely private. Should the RAID
code re-read blocks from disk when it calculates parity, just because it
cannot access already cached data in the pagecache? The RAID code is not
just a device driver, it's also a cache manager. Why do you think it's
inferior to access cached data along a physical index?

  i'll probably try to put pagecache blocks on the physical index again
  (the buffer-cache), which solution i expect will face some resistance
  :)
 
 Yes.  Device drivers should stay below ll_rw_block() and not make any
 assumptions about the buffer cache.  Linus is _really_ determined not to
 let any new assumptions about the buffer cache into the kernel (I'm
 having to deal with this in the journaling filesystem too).

well, as a matter of fact, for a couple of pre-kernels we had all
pagecache pages aliased into the buffer-cache as well, so it's not a
technical problem at all. At that time it clearly appeared to be
beneficial (simpler) to unhash pagecache pages from the buffer-cache so
they got unhashed (as those two entities are orthogonal), but we might
want to rethink that issue.

  in 2.2 RAID is a user of the buffer-cache, uses it and obeys its rules.
  The buffer-cache represents all cached (dirty and clean) blocks within the
  system. 
 
 It does not, however, represent any non-cached IO.

well, we are not talking about non-cached IO here. We are talking about a
new kind of (improved) page cache that is not physically indexed. _This_
is the problem. If the page-cache was physically indexed then i could look
it up from the RAID code just fine. If the page-cache was physically
indexed (or more accurately, the part of the pagecache that is already
mapped to a device in one way or another, which is 90+% of it.) then the
RAID code could obey all the locking (and additional delaying) rules
present there. This is not just about resync! If it was only for resync,
then we could surely hack in some sort of device-level lock to protect the
reconstruction window.

i think your problem is that you do not accept the fact that the RAID code
is a cache manager/cache user. There are RL 

Re: Linux Buffer Cache Does Not Support Mirroring

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Mon, 01 Nov 1999 15:53:29 -0500, Jeff Garzik
[EMAIL PROTECTED] said:

 XFS delays allocation of user data blocks when possible to
 make blocks more contiguous; holding them in the buffer cache.
 This allows XFS to make extents large without requiring the user
 to specify extent size, and without requiring a filesystem
 reorganizer to fix the extent sizes after the fact. This also
 reduces the number of writes to disk and extents used for a file. 

 Is this sort of manipulation possible with the existing buffer cache?

Absolutely not, but it is not hard with the page cache.  The main thing
missing is a VM callback to allow memory pressure to force unallocated,
pinned pages to disk.

--Stephen



Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread braam


Hi,

Stephen wrote:

 Fixing this in raid seems far, far preferable to fixing it in the
 filesystems.  The filesystem should be allowed to use the buffer cache
 for metadata and should be able to assume that there is a way to prevent
 those buffers from being written to disk until it is ready.

What about doing it in the page cache: i.e. reserve pages for journaling
and let them hit the buffer cache only when the transaction allows it?

This may be a naive suggestion, but it looks logical.

- Peter -



Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread Matt Zinkevicius

Is just software RAID affected? or hardware RAID as well?

--Matt

__
Get Your Private, Free Email at http://www.hotmail.com



Bug in FAT in 2.3.24 and DMSDOS

1999-11-02 Thread Pavel Pisa

Hello all,
   I have two things on my desk now.

---
1) There is bug in FAT FS which is triggered by lseek after end of file
   and then calling write. Old code allocated and zeroed all necessary
   clusters to write at wanted position. New code cannot do that. I use
   this for allocating new STACKER CVF file by program MKSTACFS.

   To trig bug call something like "mkstacfs stacvol.000 1 4"
   on any mounted DOS partition. It should create 5MB long file
   "stacvol.000", but it leads to :

-
Nov  1 22:51:11 thor kernel: kernel BUG at file.c:94!
Nov  1 22:51:11 thor kernel: invalid operand: 
Nov  1 22:51:11 thor kernel: CPU:0
Nov  1 22:51:11 thor kernel: EIP:0010:[c480af7d]
Nov  1 22:51:11 thor kernel: EFLAGS: 00010286
Nov  1 22:51:11 thor kernel: eax: 0019   ebx:    ecx:    edx: 
003b
Nov  1 22:51:11 thor kernel: esi: 2717   edi: c1d522a0   ebp: c1801380   esp: 
c1e63e5c
Nov  1 22:51:11 thor kernel: ds: 0018   es: 0018   ss: 0018
Nov  1 22:51:11 thor kernel: Process mkstacfs (pid: 1456, stackpage=c1e63000)
Nov  1 22:51:11 thor kernel: Stack: 005e 2717 1000 0008  
c0128e0b c1d522a0 2717 
Nov  1 22:51:11 thor kernel:c1801380 0001 c105a000 004e2000 0fff 
  c1800fff 
Nov  1 22:51:11 thor kernel:c1800fff c1801620 0007  0001 
01ff 0007 0007 
[c0128e0b] block_write_cont_page+653
[c480b12e] fat_write_partial_page+302
[c011f527] generic_file_write+577
[c480b19f] 
[c480b000] 
[c480b172] 
[c01261ea] sys_write+184
[c0108c84] 


int fat_get_block(struct inode *inode, long iblock, struct buffer_head *bh_result, int 
create) {

...

if (iblock9 != MSDOS_I(inode)-i_realsize) {
BUG();
return -EIO;
}
-

   I did not try to fix this. I can write simple code to allocate 
   full needed cluster chain, but I do not know reasons of original
   writer. There is no comment for "i_realsize". It seems, that it is
   file size rounded up to multiple of SECTOR_SIZE. 
   I think, that multiple of cluster size would have more sense,
   but I realy do not know reasons of original writer.

-
2) I have spend little of time on update of DMSDOS to 2.3.x
   kernels. I have patched FAT FS and version of DMSDOS
   which can read, write and map readonly by use of readpage.
   It doesnot use new page cache for reads and writes.
   These problems I want to solve after kernel stabilization.
   I need stable VFS and FAT which will not change for some time.
   But there are some real bugs in FAT, which disables usage
   of CVF layer for anything else than big blocks.

   I have put these changes and more comented cvf.c to my
   first patch. It contains what DMSDOS realy needs in kernel
   and patch should not break anything.

   I have more changes on my hard drive, but they are only experimental.
 

Best wisches,
  Pavel Pisa  
  


PS: please CC directly to me

 FAT BUG() trigger
 My updates for future DMSDOS versions


Re: Raid resync changes buffer cache semantics --- not good for journaling!

1999-11-02 Thread Theodore Y. Ts'o

   From: "Stephen C. Tweedie" [EMAIL PROTECTED]
   Date:   Tue, 2 Nov 1999 17:44:55 + (GMT)

   Ask Linus, he's pushing this point much more strongly than I am!  The
   buffer cache will become less and less of a cache as time goes on in his
   grand plan: it is to become little more than an IO buffer layer.

Ultimately, I think may be better off if we remove any hint of caching
from the I/O buffer layer.  The cache coherency issues between the page
and buffer cache make me nervous, and I'm not completely 100% convinced
we got it all right.  (I'm wondering if some of the ext2 corruption
reports in the 2.2 kernels are coming from a buffer cache/page cache
corruption.)

This means putting filesystem meta-data into the page cache.  Yes, I
know Stephen has some concerns about doing this because the big memory
patches mean pages in the page cache might not be directly accessible by
the kernel.  I see two solutions to this, both with drawbacks.  One is
to use a VM remap facility to map directories, superblocks, inode tables
etc. into the kernel address space.  The other is to have flags which
ask the kernel to map filesystem metadtata into part of the page cache
that's addressable by the kernel.  The first adds a VM delay to
accessing the filesystem metadata, and the other means we need to manage
the part of the page cache that's below 2GB differently from the page
cache in high memory at least as far as freeing pages in response to
memory pressure is concerned.

   Basically, for the raid code to poke around in higher layers is a huge
   layering violation.  We are heading towards doing things like adding
   kiobuf interfaces to ll_rw_block (in which the IO descriptor that the
   driver receives will have no reference to the buffer cache), and 
   and raw, unbuffered access to the drivers for raw devices and O_DIRECT.
   Raw IO is already there and bypasses the buffer cache.  So does swap.
   So does journaling.  So does page-in (in 2.2) and page-out (in 2.3).

It'll be interesting to see how this affects using dump(8) on a mounted
filesystem.  This was never particularly guaranteed to give a coherent
filesystem image, but what with increasing bypass of the buffer cache,
it may make the results of using dump(8) on a live filesystem even
worse.

One way of solving this is to add some kernel support for dump(8); for
example, the infamous iopen() call which Linus hates so much.  (Yes, it
violates the Unix permission model, which is why it needs to be
restricted to root, and yes, it won't work on all filesystems; just
those that have inodes.)  The other is to simply tell people to give up
on dump completely, and just use a file-level tool such as tar or bru.

- Ted



Re: Buffer and page cache

1999-11-02 Thread Stephen C. Tweedie

Hi,

On Tue, 02 Nov 1999 08:15:36 -0700, [EMAIL PROTECTED] said:

 I'd like these pages to age a little before handing them over to the
 "inode disk", because the "write_one_page" function called by
 generic_file_write would incur significant latency if the inode disk is
 "real", ie. not simulated in the same system.

The write-page method is only required to queue the data for writing to
the media.  It is not required to complete the physical IO, so the
filesystem can use any mechanism it likes to keep those pages queued for
eventual physical IO (just as 2.3 uses the buffer lists to queue that
data for eventual writeback via bdflush).

 So we have a page cache for the inodes in the file system where the
 pages become dirty - but no buffers are attached.  It reminds of a
 shared mapping, but there is no vma for the pages.

Fine.

 What appears to be needed is the following - probably it's mostly
 lacking in my understanding, but I'd appreciate to be advised how to
 attack the following points:

 - a bit to keep shrink_mmap away from the page.  

Yes, bumping the page count is the perfect way to do this.

 - a bit for a struct page that indicates the page needs to be written. 
 From block_write_full_page one could think that the PageUptoDate bit is
 maybe the one to use.  But does that really describe that this page is
 "dirty" - as it is done for buffers.

PageUpToDate can't be used: it is needed to flag whether the contents of
the page are valid for a read.  A written page must always be uptodate:
!uptodate implies that we have created the page but are still reading it
in from disk (or that the readin failed for some reason).

 - some indication of aging: we would like a pgflush daemon to walk the
 dirty pages of the file system and write them back _after_ a little
 while 

The fs should be able to manage that on its own.  If you queue all of
the pages which have been sent to the writepage() method, then you can
flush to the physical disk whenever you want.  A trivial bdflush
lookalike in the fs itself can deal with that.

You might well want a filesystem-private pointer in the page struct off
which to hook any fs-specific data (such as your dirty page linked list
pointers and the dirty flag).  You will also need a way for the VM to
exert memory pressure on those pages if it needs to reclaim memory.

These are both things which ext3 will want anyway, so we should make
sure that any infrastructure that gets put in place for this gets
reviewed by all the different fs groups first.

--Stephen



Buffer and page cache

1999-11-02 Thread braam

Hi,

I'm working on a file system which talks to an "inode disk", the storage
industry calls these object based disks.  A simulated object based disk
can be constructed from the lower half of ext2 (or any other file system
for that matter). 

The file system has no knowledge of disk blocks, and solely uses the
page cache.  

I'd like these pages to age a little before handing them over to the
"inode disk", because the "write_one_page" function called by
generic_file_write would incur significant latency if the inode disk is
"real", ie. not simulated in the same system.

So we have a page cache for the inodes in the file system where the
pages become dirty - but no buffers are attached.  It reminds of a
shared mapping, but there is no vma for the pages.

What appears to be needed is the following - probably it's mostly
lacking in my understanding, but I'd appreciate to be advised how to
attack the following points:

- a bit to keep shrink_mmap away from the page.  When the file system
writes in this page, we need to change its state so that it doesn't get
thrown out afterwards.  We could "get" the page for this purpose. 
Locking is not good, since we may need to write to the page again.

- a bit for a struct page that indicates the page needs to be written. 
From block_write_full_page one could think that the PageUptoDate bit is
maybe the one to use.  But does that really describe that this page is
"dirty" - as it is done for buffers.

- some indication of aging: we would like a pgflush daemon to walk the
dirty pages of the file system and write them back _after_ a little
while 

The construction should hopefully be capable of supporting Stephen's
journaling extensions too, but I can't oversee everything in one blow
(he probably can).

Any advice would be appreciated!

No why are we doing this?   

Effectively we have split Ext2 into an upper half (the file system) and
a lower half (the object based device driver).  

For cluster file systems it does seem an attractive division of labor to
let the drive do the allocation and have the clustered file system only
share inode metadata and data blocks.  So the block and inode allocation
metadata is not spread around the cluster.  This saves locks and traffic
and, perhaps most importantly, complexity.

You can find some preliminary code at:
ftp://carissimi.coda.cs.cmu.edu/pub/obd, but currently it writes through
to the disk and doesn't cluster yet.  Hence this message. 

- Peter -