Re: wierdisms w/ ext3.
Here's the info from /var/log/dmesg. Could it be that my journal file has a large inode number? And if you have more than one ext3 partition can you have more than one journal file? How would you specify it... must read code... --tim "ooh kdb is neat" ball --snip--snip--snip-- Partition check: hda: hda1 hda2 hda5 hda6 hdb: hdb1 hdb2 hdb3 RAMDISK: Compressed image found at block 0 autodetecting RAID arrays autorun ... ... autorun DONE. ext3: No journal on filesystem on 01:00 EXT3-fs: get root inode failed VFS: Mounted root (ext2 filesystem). autodetecting RAID arrays autorun ... ... autorun DONE. ext3: No journal on filesystem on 03:42 EXT3-fs: get root inode failed VFS: Mounted root (ext2 filesystem) readonly. change_root: old root has d_count=1 Trying to unmount old root ... okay --snip--snip--snip-- -- Send mail with subject "send pgp key" for public key. pub 1024R/CFF85605 1999-06-10 Timothy L. Ball [EMAIL PROTECTED] Key fingerprint = 8A 8E 64 D6 21 C0 90 29 9F D6 1E DC F8 18 CB CD
Re: Linux Buffer Cache Does Not Support Mirroring
On Mon, 1 Nov 1999 [EMAIL PROTECTED] wrote: XFS on Irix caches file data in buffers, but not in the regular buffer cache, they are cached off the vnode and organized by logical file offset rather than by disk block number, the memory in these buffers comes from the page subsystem, the page tag being the vnode and file offset. These buffers do not have to have a physical disk block associated with them, XFS allows you to reserve blocks on the disk for a file without picking which blocks. At some point when the data needs to be written (memory pressure, or sync activity etc), the filesystem is asked to allocate physical blocks for the data, these are associated with the buffers and they get written out. Delaying the allocation allows us to collect together multiple small writes into one big allocation request. It also means that we can bypass allocation altogether if the file is truncated before it is flushed to disk. the new 2.3 pagecache should enable this almost out-of-box. Apart from memory pressure issues, the missing bit is to split up fs-get_block() into a 'soft' and 'real' allocation branch. This means that whenever the pagecache creates a new dirty page, it calls the 'soft' get_block() variant, which is very fast and just bumps up some counters within XFS (so we do not get asynchron out-of-space conditions). Then whenever ll_rw_block() (or bdflush) sees a !buffer_mapped() but buffer_allocated() block it will call the 'real' lowlevel handler to do the allocation for real. i kept this in mind all along when doing the pagecache changes, and i intend to do this for ext2fs. Splitting up get_block() is easy without breaking filesystems, the last 'create' parameter can be made '2' to mean 'lazy create'. note that not all filesystems can know in advance how much space a new inode block will take, but this is not a problem, the lazy-allocator can safely 'overestimate' space needs. is this the kind of interface you need for XFS? i can make a prototype patch for ext2fs (and the pagecache bdflush), which should be easy to adopt for XFS. -- mingo
Re: Raid resync changes buffer cache semantics --- not good for journaling!
Hi, On Mon, 1 Nov 1999 13:04:23 -0500 (EST), Ingo Molnar [EMAIL PROTECTED] said: On Mon, 1 Nov 1999, Stephen C. Tweedie wrote: No, that's completely inappropriate: locking the buffer indefinitely will simply cause jobs like dump() to block forever, for example. i dont think dump should block. dump(8) is using the raw block device to read fs data, which in turn uses the buffer-cache to get to the cached state of device blocks. Nothing blocks there, i've just re-checked fs/block_dev.c, it's using getblk(), and getblk() is not blocking on anything. fs/block_dev.c:block_read() naturally does a ll_rw_block(READ) followed by a wait_on_buffer(). It blocks. (the IO layer should and does synchronize on the bh lock) Exactly, and the lock flag should be used to synchronise IO, _not_ to play games with bdflush/writeback. If we keep buffers locked, then raid resync is going to stall there too for the same reason --- wait_on_buffer() will block. However, you're missing a much more important issue: not all writes go through the buffer cache. Currently, swapping bypasses the buffer cache entirely: writes from swap go via temporary buffer_heads to ll_rw_block. The buffer_heads are we were not talking about swapping but journalled transactions, and you were asking about a mechanizm to keep the RAID resync from writing back to disk. It's the same issue. If you arbitrarily write back through the buffer cache while a swap write IO is in progress, you can wipe out that swap data and corrupt the swap file. If you arbitrarily write back journaled buffers before journaling asks you to, you destroy recovery. The swap case is, if anything, even worse: it kills you even if you don't take a reboot, because you have just overwritten the swapped-out data with the previous contents of the buffer cache, so you've lost a write to disk. Journaling does the same thing by using temporary buffer heads to write metadata to the log without copying the buffer contents. Again it is IO which is not in the buffer cache. There are thus two problems: (a) the raid code is writing back data from the buffer cache oblivious to the fact that other users of the device may be writing back data which is not in the buffer cache at all, and (b) it is writing back data when it was not asked to do so, destroying write ordering. Both of these violate the definition of a device driver. The RAID layer resync thread explicitly synchronizes on locked buffers. (it doesnt have to but it does) And that is illegal, because it assumes that everybody else is using the buffer cache. That is not the case, and it is even less the case in 2.3. You suggested a new mechanizm to mark buffers as 'pinned', That is only to synchronise with bdflush: I'd like to be able to distinguish between buffers which contain dirty data but which are not yet ready for disk IO, and buffers which I want to send to the disk. The device drivers themselves should never ever have to worry about those buffers: ll_rw_block() is the defined interface for device drivers, NOT the buffer cache. In 2.3 the situation is much worse, as _all_ ext2 file writes bypass the buffer cache. [...] the RAID code has major problems with 2.3's pagecache changes. It will have major problems with ext3 too, then, but I really do think that is raid's fault, because: 2.3 removes physical indexing of cached blocks, 2.2 never guaranteed that IO was from cached blocks in the first place. Swap and paging both bypass the buffer cache entirely. To assume that you can synchronise IO by doing a getblk() and syncing on the buffer_head is wrong, even if it used to work most of the time. and this destroys a fair amount of physical-level optimizations that were possible. (eg. RAID5 has to detect cached data within the same row, to speed up things and avoid double-buffering. If data is in the page cache and not hashed then there is no way RAID5 could detect such data.) But you cannot rely on the buffer cache. If I "dd" to a swapfile and do a swapon, then the swapper will start to write to that swapfile using temporary buffer_heads. If you do IO or checksum optimisation based on the buffer cache you'll risk plastering obsolete data over the disks. i'll probably try to put pagecache blocks on the physical index again (the buffer-cache), which solution i expect will face some resistance :) Yes. Device drivers should stay below ll_rw_block() and not make any assumptions about the buffer cache. Linus is _really_ determined not to let any new assumptions about the buffer cache into the kernel (I'm having to deal with this in the journaling filesystem too). in 2.2 RAID is a user of the buffer-cache, uses it and obeys its rules. The buffer-cache represents all cached (dirty and clean) blocks within the system. It does not, however, represent any non-cached IO. If there are other block caches in the system (the page-cache in 2.2 was readonly, thus not an issue),
Re: wierdisms w/ ext3.
Hi, On Mon, 1 Nov 1999 15:03:54 -0600, Timothy Ball [EMAIL PROTECTED] said: I did my best to try to follow what the README for ext3 said. I made a journal file in /var/local/journal/journal.dat. It has an inode # of 183669. Then I did /sbin/lilo -R linux rw rootflags=journal=183669. Silly question, but is /var/local/journal on the same filesystem as the root? Those rootflags look fine otherwise. --Stephen
Re: wierdisms w/ ext3.
Hi, On Tue, 2 Nov 1999 03:10:10 -0600, Timothy Ball [EMAIL PROTECTED] said: Here's the info from /var/log/dmesg. Could it be that my journal file has a large inode number? And if you have more than one ext3 partition can you have more than one journal file? How would you specify it... must read code... You need one per filesystem, and you register it when you mount the filesystem. For non-root filesystems, just umount and remount with the "-o journal=xxx" flag. For the root filesystem you need the rootflags= trick. --Stephen
Re: Raid resync changes buffer cache semantics --- not good for journaling!
On Tue, 2 Nov 1999, Stephen C. Tweedie wrote: i dont think dump should block. dump(8) is using the raw block device to read fs data, which in turn uses the buffer-cache to get to the cached state of device blocks. Nothing blocks there, i've just re-checked fs/block_dev.c, it's using getblk(), and getblk() is not blocking on anything. fs/block_dev.c:block_read() naturally does a ll_rw_block(READ) followed by a wait_on_buffer(). It blocks. yes but this means that the block was not cached. Remember the original point, my suggestion was to 'keep in-transaction buffers locked'. You said this doesnt work because it blocks dump(). But dump() CANNOT block because those buffers are cached, = dump does not block but just uses getblk() and skips over those buffers. dump() _of course_ blocks if the buffer is not cached. Or have i misunderstood you and we talking about different issues? You suggested a new mechanizm to mark buffers as 'pinned', That is only to synchronise with bdflush: I'd like to be able to distinguish between buffers which contain dirty data but which are not yet ready for disk IO, and buffers which I want to send to the disk. The device drivers themselves should never ever have to worry about those buffers: ll_rw_block() is the defined interface for device drivers, NOT the buffer cache. (see later) 2.3 removes physical indexing of cached blocks, 2.2 never guaranteed that IO was from cached blocks in the first place. Swap and paging both bypass the buffer cache entirely. [..] no, paging (named mappings) writes do not bypass the buffer-cache, and thats the issue. RAID would pretty quickly corrupt filesystems if this was the case. In 2.2 all filesystem (data and metadata) writes go through the buffer-cache. I agree that swapping is a problem (bug) even in 2.2, thanks for pointing it out. (It's not really hard to fix because the swap cache is more or less physically indexed.) and this destroys a fair amount of physical-level optimizations that were possible. (eg. RAID5 has to detect cached data within the same row, to speed up things and avoid double-buffering. If data is in the page cache and not hashed then there is no way RAID5 could detect such data.) But you cannot rely on the buffer cache. If I "dd" to a swapfile and do a swapon, then the swapper will start to write to that swapfile using temporary buffer_heads. If you do IO or checksum optimisation based on the buffer cache you'll risk plastering obsolete data over the disks. i dont really mind how it's called. It's a physical index of all dirty cached physical device contents which might get written out directly to the device at any time. In 2.2 this is the buffer-cache. Think about it, it's not a hack, it's a solid concept. The RAID code cannot even create its own physical index if the cache is completely private. Should the RAID code re-read blocks from disk when it calculates parity, just because it cannot access already cached data in the pagecache? The RAID code is not just a device driver, it's also a cache manager. Why do you think it's inferior to access cached data along a physical index? i'll probably try to put pagecache blocks on the physical index again (the buffer-cache), which solution i expect will face some resistance :) Yes. Device drivers should stay below ll_rw_block() and not make any assumptions about the buffer cache. Linus is _really_ determined not to let any new assumptions about the buffer cache into the kernel (I'm having to deal with this in the journaling filesystem too). well, as a matter of fact, for a couple of pre-kernels we had all pagecache pages aliased into the buffer-cache as well, so it's not a technical problem at all. At that time it clearly appeared to be beneficial (simpler) to unhash pagecache pages from the buffer-cache so they got unhashed (as those two entities are orthogonal), but we might want to rethink that issue. in 2.2 RAID is a user of the buffer-cache, uses it and obeys its rules. The buffer-cache represents all cached (dirty and clean) blocks within the system. It does not, however, represent any non-cached IO. well, we are not talking about non-cached IO here. We are talking about a new kind of (improved) page cache that is not physically indexed. _This_ is the problem. If the page-cache was physically indexed then i could look it up from the RAID code just fine. If the page-cache was physically indexed (or more accurately, the part of the pagecache that is already mapped to a device in one way or another, which is 90+% of it.) then the RAID code could obey all the locking (and additional delaying) rules present there. This is not just about resync! If it was only for resync, then we could surely hack in some sort of device-level lock to protect the reconstruction window. i think your problem is that you do not accept the fact that the RAID code is a cache manager/cache user. There are RL
Re: Linux Buffer Cache Does Not Support Mirroring
Hi, On Mon, 01 Nov 1999 15:53:29 -0500, Jeff Garzik [EMAIL PROTECTED] said: XFS delays allocation of user data blocks when possible to make blocks more contiguous; holding them in the buffer cache. This allows XFS to make extents large without requiring the user to specify extent size, and without requiring a filesystem reorganizer to fix the extent sizes after the fact. This also reduces the number of writes to disk and extents used for a file. Is this sort of manipulation possible with the existing buffer cache? Absolutely not, but it is not hard with the page cache. The main thing missing is a VM callback to allow memory pressure to force unallocated, pinned pages to disk. --Stephen
Re: Raid resync changes buffer cache semantics --- not good for journaling!
Hi, Stephen wrote: Fixing this in raid seems far, far preferable to fixing it in the filesystems. The filesystem should be allowed to use the buffer cache for metadata and should be able to assume that there is a way to prevent those buffers from being written to disk until it is ready. What about doing it in the page cache: i.e. reserve pages for journaling and let them hit the buffer cache only when the transaction allows it? This may be a naive suggestion, but it looks logical. - Peter -
Re: Raid resync changes buffer cache semantics --- not good for journaling!
Is just software RAID affected? or hardware RAID as well? --Matt __ Get Your Private, Free Email at http://www.hotmail.com
Bug in FAT in 2.3.24 and DMSDOS
Hello all, I have two things on my desk now. --- 1) There is bug in FAT FS which is triggered by lseek after end of file and then calling write. Old code allocated and zeroed all necessary clusters to write at wanted position. New code cannot do that. I use this for allocating new STACKER CVF file by program MKSTACFS. To trig bug call something like "mkstacfs stacvol.000 1 4" on any mounted DOS partition. It should create 5MB long file "stacvol.000", but it leads to : - Nov 1 22:51:11 thor kernel: kernel BUG at file.c:94! Nov 1 22:51:11 thor kernel: invalid operand: Nov 1 22:51:11 thor kernel: CPU:0 Nov 1 22:51:11 thor kernel: EIP:0010:[c480af7d] Nov 1 22:51:11 thor kernel: EFLAGS: 00010286 Nov 1 22:51:11 thor kernel: eax: 0019 ebx: ecx: edx: 003b Nov 1 22:51:11 thor kernel: esi: 2717 edi: c1d522a0 ebp: c1801380 esp: c1e63e5c Nov 1 22:51:11 thor kernel: ds: 0018 es: 0018 ss: 0018 Nov 1 22:51:11 thor kernel: Process mkstacfs (pid: 1456, stackpage=c1e63000) Nov 1 22:51:11 thor kernel: Stack: 005e 2717 1000 0008 c0128e0b c1d522a0 2717 Nov 1 22:51:11 thor kernel:c1801380 0001 c105a000 004e2000 0fff c1800fff Nov 1 22:51:11 thor kernel:c1800fff c1801620 0007 0001 01ff 0007 0007 [c0128e0b] block_write_cont_page+653 [c480b12e] fat_write_partial_page+302 [c011f527] generic_file_write+577 [c480b19f] [c480b000] [c480b172] [c01261ea] sys_write+184 [c0108c84] int fat_get_block(struct inode *inode, long iblock, struct buffer_head *bh_result, int create) { ... if (iblock9 != MSDOS_I(inode)-i_realsize) { BUG(); return -EIO; } - I did not try to fix this. I can write simple code to allocate full needed cluster chain, but I do not know reasons of original writer. There is no comment for "i_realsize". It seems, that it is file size rounded up to multiple of SECTOR_SIZE. I think, that multiple of cluster size would have more sense, but I realy do not know reasons of original writer. - 2) I have spend little of time on update of DMSDOS to 2.3.x kernels. I have patched FAT FS and version of DMSDOS which can read, write and map readonly by use of readpage. It doesnot use new page cache for reads and writes. These problems I want to solve after kernel stabilization. I need stable VFS and FAT which will not change for some time. But there are some real bugs in FAT, which disables usage of CVF layer for anything else than big blocks. I have put these changes and more comented cvf.c to my first patch. It contains what DMSDOS realy needs in kernel and patch should not break anything. I have more changes on my hard drive, but they are only experimental. Best wisches, Pavel Pisa PS: please CC directly to me FAT BUG() trigger My updates for future DMSDOS versions
Re: Raid resync changes buffer cache semantics --- not good for journaling!
From: "Stephen C. Tweedie" [EMAIL PROTECTED] Date: Tue, 2 Nov 1999 17:44:55 + (GMT) Ask Linus, he's pushing this point much more strongly than I am! The buffer cache will become less and less of a cache as time goes on in his grand plan: it is to become little more than an IO buffer layer. Ultimately, I think may be better off if we remove any hint of caching from the I/O buffer layer. The cache coherency issues between the page and buffer cache make me nervous, and I'm not completely 100% convinced we got it all right. (I'm wondering if some of the ext2 corruption reports in the 2.2 kernels are coming from a buffer cache/page cache corruption.) This means putting filesystem meta-data into the page cache. Yes, I know Stephen has some concerns about doing this because the big memory patches mean pages in the page cache might not be directly accessible by the kernel. I see two solutions to this, both with drawbacks. One is to use a VM remap facility to map directories, superblocks, inode tables etc. into the kernel address space. The other is to have flags which ask the kernel to map filesystem metadtata into part of the page cache that's addressable by the kernel. The first adds a VM delay to accessing the filesystem metadata, and the other means we need to manage the part of the page cache that's below 2GB differently from the page cache in high memory at least as far as freeing pages in response to memory pressure is concerned. Basically, for the raid code to poke around in higher layers is a huge layering violation. We are heading towards doing things like adding kiobuf interfaces to ll_rw_block (in which the IO descriptor that the driver receives will have no reference to the buffer cache), and and raw, unbuffered access to the drivers for raw devices and O_DIRECT. Raw IO is already there and bypasses the buffer cache. So does swap. So does journaling. So does page-in (in 2.2) and page-out (in 2.3). It'll be interesting to see how this affects using dump(8) on a mounted filesystem. This was never particularly guaranteed to give a coherent filesystem image, but what with increasing bypass of the buffer cache, it may make the results of using dump(8) on a live filesystem even worse. One way of solving this is to add some kernel support for dump(8); for example, the infamous iopen() call which Linus hates so much. (Yes, it violates the Unix permission model, which is why it needs to be restricted to root, and yes, it won't work on all filesystems; just those that have inodes.) The other is to simply tell people to give up on dump completely, and just use a file-level tool such as tar or bru. - Ted
Re: Buffer and page cache
Hi, On Tue, 02 Nov 1999 08:15:36 -0700, [EMAIL PROTECTED] said: I'd like these pages to age a little before handing them over to the "inode disk", because the "write_one_page" function called by generic_file_write would incur significant latency if the inode disk is "real", ie. not simulated in the same system. The write-page method is only required to queue the data for writing to the media. It is not required to complete the physical IO, so the filesystem can use any mechanism it likes to keep those pages queued for eventual physical IO (just as 2.3 uses the buffer lists to queue that data for eventual writeback via bdflush). So we have a page cache for the inodes in the file system where the pages become dirty - but no buffers are attached. It reminds of a shared mapping, but there is no vma for the pages. Fine. What appears to be needed is the following - probably it's mostly lacking in my understanding, but I'd appreciate to be advised how to attack the following points: - a bit to keep shrink_mmap away from the page. Yes, bumping the page count is the perfect way to do this. - a bit for a struct page that indicates the page needs to be written. From block_write_full_page one could think that the PageUptoDate bit is maybe the one to use. But does that really describe that this page is "dirty" - as it is done for buffers. PageUpToDate can't be used: it is needed to flag whether the contents of the page are valid for a read. A written page must always be uptodate: !uptodate implies that we have created the page but are still reading it in from disk (or that the readin failed for some reason). - some indication of aging: we would like a pgflush daemon to walk the dirty pages of the file system and write them back _after_ a little while The fs should be able to manage that on its own. If you queue all of the pages which have been sent to the writepage() method, then you can flush to the physical disk whenever you want. A trivial bdflush lookalike in the fs itself can deal with that. You might well want a filesystem-private pointer in the page struct off which to hook any fs-specific data (such as your dirty page linked list pointers and the dirty flag). You will also need a way for the VM to exert memory pressure on those pages if it needs to reclaim memory. These are both things which ext3 will want anyway, so we should make sure that any infrastructure that gets put in place for this gets reviewed by all the different fs groups first. --Stephen
Buffer and page cache
Hi, I'm working on a file system which talks to an "inode disk", the storage industry calls these object based disks. A simulated object based disk can be constructed from the lower half of ext2 (or any other file system for that matter). The file system has no knowledge of disk blocks, and solely uses the page cache. I'd like these pages to age a little before handing them over to the "inode disk", because the "write_one_page" function called by generic_file_write would incur significant latency if the inode disk is "real", ie. not simulated in the same system. So we have a page cache for the inodes in the file system where the pages become dirty - but no buffers are attached. It reminds of a shared mapping, but there is no vma for the pages. What appears to be needed is the following - probably it's mostly lacking in my understanding, but I'd appreciate to be advised how to attack the following points: - a bit to keep shrink_mmap away from the page. When the file system writes in this page, we need to change its state so that it doesn't get thrown out afterwards. We could "get" the page for this purpose. Locking is not good, since we may need to write to the page again. - a bit for a struct page that indicates the page needs to be written. From block_write_full_page one could think that the PageUptoDate bit is maybe the one to use. But does that really describe that this page is "dirty" - as it is done for buffers. - some indication of aging: we would like a pgflush daemon to walk the dirty pages of the file system and write them back _after_ a little while The construction should hopefully be capable of supporting Stephen's journaling extensions too, but I can't oversee everything in one blow (he probably can). Any advice would be appreciated! No why are we doing this? Effectively we have split Ext2 into an upper half (the file system) and a lower half (the object based device driver). For cluster file systems it does seem an attractive division of labor to let the drive do the allocation and have the clustered file system only share inode metadata and data blocks. So the block and inode allocation metadata is not spread around the cluster. This saves locks and traffic and, perhaps most importantly, complexity. You can find some preliminary code at: ftp://carissimi.coda.cs.cmu.edu/pub/obd, but currently it writes through to the disk and doesn't cluster yet. Hence this message. - Peter -