Re: [RFC] Per-inode metadata cache.
Hi, On 19 Oct 1999 00:44:38 -0500, [EMAIL PROTECTED] (Eric W. Biederman) said: > Meanwhile having the metadata in the page cache (where they would > have predictable offsets by file size) Doesn't help --- you still need to look up the physical block numbers in order to clear the allocation bitmaps for indirect blocks, so we're going to need those lookups anyway. Once you have that, looking for a given fixed offset in the buffer cache is no harder than doing so in the page cache. > would also speed handle partial truncates . . . [ Either case with > a little gentle handling will handle speed up fsync] (At least for > ext2 type filesystems). No argument there, but a per-inode dirty buffer list would do the same. Neither of these are overwhelming arguments for a move. --Stephen
Re: [RFC] Per-inode metadata cache.
On Mon, 18 Oct 1999, Stephen C. Tweedie wrote: >Data and metadata are completely different. On, say, a large and busy >web or ftp server, you really don't care about a 1G metadata limit, but >a 1G page cache limit is much more painful. Sure I completly agree. It looked you was talking about something magic related to bigmem that was not possible to do. The rasonable thought about bigmem (that I just mentioned in my previous email of this thread) is that it doesn't worth to bloat the filesystem with kmap as the VM pressure generated by the metadata is rasonable small. I never said the opposite. Just left the metadata to live in regular pages as now. I can't see any problem (you where the ones talking about bigmem troubles metadata related and I can't see them). Andrea
Re: [RFC] Per-inode metadata cache.
Hi, On 18 Oct 1999 08:20:51 -0500, [EMAIL PROTECTED] (Eric W. Biederman) said: >> And I still can't see how you can find the stale buffer in a >> per-object queue as the object can be destroyed as well after the >> lowlevel truncate. > Yes but you can prevent the buffer from becomming a stale buffer with the > per-object queue. That still doesn't let you get rid of the bforget(): remember that a partial truncate() still needs to be dealt with, and you can't work out which buffers to discard in that case without doing the full metadata walk. Just having a per-inode buffer queue doesn't help there (although it _would_ help solve the case which we currently get wrong, which is directory data blocks, since we never partially truncate a directory). --Stephen
Re: [RFC] Per-inode metadata cache.
Hi, On Mon, 18 Oct 1999 13:26:45 -0400 (EDT), Alexander Viro <[EMAIL PROTECTED]> said: >> You can't even know which is the inode Y that is using a block X without >> reading all the inode metadata while the block X still belongs to the >> inode Y (before the truncate). > WTF would we _need_ to know? Think about it as about memory caching. You > can cache by virtual address and you can cache by physical address. And > since we have no aliasing here... We still have the underlying problem. The hash lists used for buffer cache lookup are only part of the buffer cache data structures. The other part --- the buffer lists --- are independent of the namespace you are using. If you have a dirty buffer in an inode's metadata cache, then yes, you still need to deal with the aliasing left when that data is deallocated unless you explicitly revoke (bforget) that buffer as soon as it is deallocated. >> Right now you know only that the stale buffer can't came from inode >> metadata as we have the race prone slowww trick to loop in polling mode >> inside ext2_truncate until we'll be sure bforget will run in hard mode. > And? Don't use bread() on metadata. It should never enter the buffer hash, > just as buffer_heads of _data_ never enter it. Leave bread() for > statically allocated stuff. It's write, not read, which is the problem: the danger is that you have a dirty metadata buffer which is deleted and reallocated as data, but the buffer is still dirty in the buffer cache and can stomp on your freshly allocated data buffer later on. The cache coherency isn't the problem as much as _disk_ coherency. Cache coherency in this case doesn't hurt as long as the wrong version never hits disk (because the first thing that will happen if a stale metadata buffer gets reallocated as new metadata will be that the buffer in memory gets cleared anyway). --Stephen
Re: [RFC] Per-inode metadata cache.
Hi, On Mon, 18 Oct 1999 13:26:45 -0400 (EDT), Alexander Viro <[EMAIL PROTECTED]> said: >> You can't even know which is the inode Y that is using a block X without >> reading all the inode metadata while the block X still belongs to the >> inode Y (before the truncate). > WTF would we _need_ to know? Think about it as about memory caching. You > can cache by virtual address and you can cache by physical address. And > since we have no aliasing here... We still have the underlying problem. The hash lists used for buffer cache lookup are only part of the buffer cache data structures. The other part --- the buffer lists --- are independent of the namespace you are using. If you have a dirty buffer in an inode's metadata cache, then yes, you still need to deal with the aliasing left when that data is deallocated unless you explicitly revoke (bforget) that buffer as soon as it is deallocated. >> Right now you know only that the stale buffer can't came from inode >> metadata as we have the race prone slowww trick to loop in polling mode >> inside ext2_truncate until we'll be sure bforget will run in hard mode. > And? Don't use bread() on metadata. It should never enter the buffer hash, > just as buffer_heads of _data_ never enter it. Leave bread() for > statically allocated stuff. It's write, not read, which is the problem: the danger is that you have a dirty metadata buffer which is deleted and reallocated as data, but the buffer is still dirty in the buffer cache and can stomp on your freshly allocated data buffer later on. The cache coherency isn't the problem as much as _disk_ coherency. cache coheren >> And even if you know such information you should do a linear search in the >> list that's slower than an hash lookup anyway (or you can start slowly >> unmapping all the other stale buffers even if you are not interested >> about, so you may block more than necessary). Doing a per-object hash is >> not an option IMHO. > You are kinda late with it - it's already done for data. The question > being: do we ever need lookups by physical address (disk location)? > AFAICS it is _not_ needed. >> >d) Moving the second class into the page cache will cause problems >> >with bigmem stuff. Besides, I have the reasons of my own for keeping those >> I can't see these bigmem issues. The buffer and page-cache memory is not >> in bigmem anyway. And you can use bigmem _wherever_ you want as far as you >> remeber to fix all the involved code to kmap before read/write to >> potential bigmem memory. bigmem issue looks like a red-herring to me. > Right. And you may want different policies for data and metadata - the > latter should always be in kvm. AFAICS that's what SCT refered to.
Re: [RFC] Per-inode metadata cache.
Hi, On Mon, 18 Oct 1999 14:30:10 +0200 (CEST), Andrea Arcangeli <[EMAIL PROTECTED]> said: > I can't see these bigmem issues. The buffer and page-cache memory is not > in bigmem anyway. And you can use bigmem _wherever_ you want as far as you > remeber to fix all the involved code to kmap before read/write to > potential bigmem memory. bigmem issue looks like a red-herring to me. Data and metadata are completely different. On, say, a large and busy web or ftp server, you really don't care about a 1G metadata limit, but a 1G page cache limit is much more painful. Secondly, data is accessed by user code: mmap of highmem page cache pages can work automatically via normal ptes, and read()/write() of such pages requires kmap (or rather, something a little more complex which can accept a page fault mid-copy) in a very few, well-defined places in filemap.c. Metadata, on the other hand, is extremely hot data, being accessed randomly at high frequency _everywhere_ in the filesystems. It's a much, much messier job to teach the filesystems about high-memory buffer cache than to teach the filemap about high-memory page cache, and the page cache is the one under the most memory pressure in the first place. --Stephen
Re: [RFC] Per-inode metadata cache.
Hi, On Sat, 16 Oct 1999 01:59:38 -0400 (EDT), Alexander Viro <[EMAIL PROTECTED]> said: a) to d), fine. > e) we might get out with just a dirty blocks lists, but I think > that we can do better than that: keep per-inode cache for metadata. It > is going to be separate from the data pagecache. It is never exported to > user space and lives completely in kernel context. Ability to search > there doesn't mean much for normal filesystems, but weirdies like AFFS > will _really_ benefit from it - I've just realized that I was crufting up > an equivalent of such cache there anyway. Why? Whenever we are doing a lookup on such data, we are _always_ indexing by physical block number: what's wrong with the normal buffer cache in such a case? --Stephen
Re: [RFC] Per-inode metadata cache.
On Mon, 18 Oct 1999, Alexander Viro wrote: >? The same way you are doing it with pagecache. I think if I would have just understood I wouldn't be asking here ;). >WTF would we _need_ to know? Think about it as about memory caching. You >can cache by virtual address and you can cache by physical address. And >since we have no aliasing here... Andrea, consider the thing as VM. You >have virtual addresses (offsets in file for data, offsets in directory for >directory contents, beginning of covered range for indirect blocks, etc.) >and you have physical address (offset on a disk). You have mapping of VA >to PA (->bmap(), ->get_block()) (context being the file). Old scheme was >equivalent to caching by PA and recalculation of mapping on each use. New >sceme gives caching by context+VA and provides a working TLB. We already >have it for data. For metadata we are still caching by PA. And getting the >nasty pile of it with cache coherency. Would you like to work with MMU of >such architecture? So far so good. I just know the above. What I can't see is how do you want to efficiently enforce coherency by moving the dirty-metadata elsewhere. >> Right now you know only that the stale buffer can't came from inode >> metadata as we have the race prone slowww trick to loop in polling mode >> inside ext2_truncate until we'll be sure bforget will run in hard mode. > >And? Don't use bread() on metadata. It should never enter the buffer hash, >just as buffer_heads of _data_ never enter it. Leave bread() for >statically allocated stuff. Hmm I think the rewrite is so intensive that I can't see it following the current code. I have to think at a partial rewrote of ext2 to make it to work in my mind without bread and by using the buffer cache only to allow bdflush/kupdate/sync to do their work. >> And even if you know such information you should do a linear search in the >> list that's slower than an hash lookup anyway (or you can start slowly >> unmapping all the other stale buffers even if you are not interested >> about, so you may block more than necessary). Doing a per-object hash is >> not an option IMHO. > >You are kinda late with it - it's already done for data. The question The page cache is _not_ a per-obect hash. You don't have a separate hash for each inode. You have a _global_ hash for all inodes. At the same time you _just_ (in 2.3.22) have a global hash for all and only the metadata and it's the so called "buffer cache". So if you can reach your object in O(1) with an additional structure that may be rasonable, but if you'll have again to search (linear search or an hash search) then better you use the buffer hash to get rid of the stale buffer. I think you are missing the reason you are suggesting to move the metadata writes elsewhere. The reason for using the pagecache is because we'll reach the buffer->b_data without having to enter the filesystem code (so we'll be able to write and to read without doing the Virtual->Phisical translation, the pagecache it's our filesystem-TLB as you said). We are instead _not_ using the page cache to enfore coherency (it infacts _breaks_ the coherency and we have to query the buffer hash to know if we are going to collide with the buffer cache). So I can't see how and additional metadata-virtual layer can help to enforce coherency between page-cache the new implemented metadata-cache and the old buffer-cache (now used only for taking care of writes). Note also that as the pagecache is just allowing us to avoid entering the bmap path (so we just remains at the virtual layer all the time we can) and we also don't need to optimize the metadata. We just avoid to query the metadata using the pagecache, so I can't see a performance downside of taking the metadata at the physical layer. And as just said, I can't see how moving the metadata at the virtual layer can help to enforce coherency (we'll have to handle coherency by hand with another additional layer instead). >Right. And you may want different policies for data and metadata - the >latter should always be in kvm. AFAICS that's what SCT refered to. That sounds rasonable of course otherwise you'll have to impact the filesystem with lots of kmap. Anyway 64g is a red-herring too as we could _just_ use the bigmem memory (between 1/2g to 4g) as pagecache by using bounce buffers for slowly and deadlock-prone doing I/O (remember ll_rw_block can't fail right now and it's supposed to not allocate memory as it's called from swapout). Solving the deadlock issue has nothing to do with 64g (and you'll go always dogslow then during I/O). For the raw-io all is been easy with the bounce buffers as raw-io can fail instead. Andrea
Re: [RFC] Per-inode metadata cache.
On Mon, 18 Oct 1999, Andrea Arcangeli wrote: > On Sat, 16 Oct 1999, Alexander Viro wrote: > > > c) Currently we keep the stuff for the first class around the page > >cache and the rest in buffer cache. Large part of our problems comes > >from the fact that we need to detect migration of block from one class to > >another and scanning the whole buffer cache is way too slow. > > We don't scan the whole buffer cache. We only do a fast query on the > buffer hash. > > And I still can't see how you can find the stale buffer in a per-object > queue as the object can be destroyed as well after the lowlevel truncate. ? The same way you are doing it with pagecache. > You can't even know which is the inode Y that is using a block X without > reading all the inode metadata while the block X still belongs to the > inode Y (before the truncate). WTF would we _need_ to know? Think about it as about memory caching. You can cache by virtual address and you can cache by physical address. And since we have no aliasing here... Andrea, consider the thing as VM. You have virtual addresses (offsets in file for data, offsets in directory for directory contents, beginning of covered range for indirect blocks, etc.) and you have physical address (offset on a disk). You have mapping of VA to PA (->bmap(), ->get_block()) (context being the file). Old scheme was equivalent to caching by PA and recalculation of mapping on each use. New sceme gives caching by context+VA and provides a working TLB. We already have it for data. For metadata we are still caching by PA. And getting the nasty pile of it with cache coherency. Would you like to work with MMU of such architecture? > Right now you know only that the stale buffer can't came from inode > metadata as we have the race prone slowww trick to loop in polling mode > inside ext2_truncate until we'll be sure bforget will run in hard mode. And? Don't use bread() on metadata. It should never enter the buffer hash, just as buffer_heads of _data_ never enter it. Leave bread() for statically allocated stuff. > And even if you know such information you should do a linear search in the > list that's slower than an hash lookup anyway (or you can start slowly > unmapping all the other stale buffers even if you are not interested > about, so you may block more than necessary). Doing a per-object hash is > not an option IMHO. You are kinda late with it - it's already done for data. The question being: do we ever need lookups by physical address (disk location)? AFAICS it is _not_ needed. > > d) Moving the second class into the page cache will cause problems > >with bigmem stuff. Besides, I have the reasons of my own for keeping those > I can't see these bigmem issues. The buffer and page-cache memory is not > in bigmem anyway. And you can use bigmem _wherever_ you want as far as you > remeber to fix all the involved code to kmap before read/write to > potential bigmem memory. bigmem issue looks like a red-herring to me. Right. And you may want different policies for data and metadata - the latter should always be in kvm. AFAICS that's what SCT refered to.
Re: [RFC] Per-inode metadata cache.
On Sat, 16 Oct 1999, Alexander Viro wrote: > c) Currently we keep the stuff for the first class around the page >cache and the rest in buffer cache. Large part of our problems comes >from the fact that we need to detect migration of block from one class to >another and scanning the whole buffer cache is way too slow. We don't scan the whole buffer cache. We only do a fast query on the buffer hash. And I still can't see how you can find the stale buffer in a per-object queue as the object can be destroyed as well after the lowlevel truncate. You can't even know which is the inode Y that is using a block X without reading all the inode metadata while the block X still belongs to the inode Y (before the truncate). Right now you know only that the stale buffer can't came from inode metadata as we have the race prone slowww trick to loop in polling mode inside ext2_truncate until we'll be sure bforget will run in hard mode. And even if you know such information you should do a linear search in the list that's slower than an hash lookup anyway (or you can start slowly unmapping all the other stale buffers even if you are not interested about, so you may block more than necessary). Doing a per-object hash is not an option IMHO. > d) Moving the second class into the page cache will cause problems >with bigmem stuff. Besides, I have the reasons of my own for keeping those I can't see these bigmem issues. The buffer and page-cache memory is not in bigmem anyway. And you can use bigmem _wherever_ you want as far as you remeber to fix all the involved code to kmap before read/write to potential bigmem memory. bigmem issue looks like a red-herring to me. Andrea
[RFC] Per-inode metadata cache.
Stephen, I've looked through the current stuff with truncate() (BTW, minixfs is broken too - rmdir() hangs solid) and I think that I have a more-or-less tolerable solution. You definitely know the VFS/VM interaction better (I'm mostly dealing with namespace side of the things), so I'ld really like to hear your comments on the stuff below. a) On the normal filesystems we have 4 kinds of blocks - data, per-inode metadata, per-fs metadata and free ones. b) We can't allow several instances of the buffer_head for the same block, at least not the stale dirty ones. c) Currently we keep the stuff for the first class around the page cache and the rest in buffer cache. Large part of our problems comes from the fact that we need to detect migration of block from one class to another and scanning the whole buffer cache is way too slow. d) Moving the second class into the page cache will cause problems with bigmem stuff. Besides, I have the reasons of my own for keeping those beasts separate - softupdates code will get simpler that way. I suspect that you have similar reasons wrt journalling. The bottom line: we don't want it there. e) we might get out with just a dirty blocks lists, but I think that we can do better than that: keep per-inode cache for metadata. It is going to be separate from the data pagecache. It is never exported to user space and lives completely in kernel context. Ability to search there doesn't mean much for normal filesystems, but weirdies like AFFS will _really_ benefit from it - I've just realized that I was crufting up an equivalent of such cache there anyway. f) Essentially we have several address spaces _and_ mappings between them - some of them are done via the hardware MMU (process ones), some are done by hands (per-inode pagecache). I think that we might take it further and add per-inode metadata address spaces. Moreover, the large part of code will be shared anyway, so we might want an ability to make address spaces fullfledged objects - e.g. Steve Dodd^W^WNTFS driver may want to keep a separate caches for forks, etc.[1] Sorry for less than coherent text - I'm wading through the current code trying to figure out what it is supposed to do and I'd feel much better if I had better understanding of ideology behind the thing. Could you comment on/ACK/NAK the stuff above? I'm Cc'ing it to fsdevel - this stuff affects all filesystems and IMO everybody will benefit from the clearly formulated rules regarding the work with buffer cache. Cheers, Al [1] MULTICS segments with human face, anyone? ;-) -- Two of my imaginary friends reproduced once ... with negative results. Ben <[EMAIL PROTECTED]> in ASR