Re: [RFC] Per-inode metadata cache.

1999-10-19 Thread Stephen C. Tweedie

Hi,

On 19 Oct 1999 00:44:38 -0500, [EMAIL PROTECTED] (Eric
W. Biederman) said:

 Meanwhile having the metadata in the page cache (where they would
 have predictable offsets by file size) 

Doesn't help --- you still need to look up the physical block numbers
in order to clear the allocation bitmaps for indirect blocks, so we're
going to need those lookups anyway.  Once you have that, looking for a
given fixed offset in the buffer cache is no harder than doing so in
the page cache.

 would also speed handle partial truncates . . .  [ Either case with
 a little gentle handling will handle speed up fsync] (At least for
 ext2 type filesystems).

No argument there, but a per-inode dirty buffer list would do the
same.  Neither of these are overwhelming arguments for a move.

--Stephen



Re: [RFC] Per-inode metadata cache.

1999-10-18 Thread Stephen C. Tweedie

Hi,

On Sat, 16 Oct 1999 01:59:38 -0400 (EDT), Alexander Viro
[EMAIL PROTECTED] said:

a) to d), fine.

   e) we might get out with just a dirty blocks lists, but I think
 that we can do better than that: keep per-inode cache for metadata. It
 is going to be separate from the data pagecache. It is never exported to
 user space and lives completely in kernel context. Ability to search
 there doesn't mean much for normal filesystems, but weirdies like AFFS
 will _really_ benefit from it - I've just realized that I was crufting up
 an equivalent of such cache there anyway.

Why?  Whenever we are doing a lookup on such data, we are _always_
indexing by physical block number: what's wrong with the normal buffer
cache in such a case?

--Stephen



Re: [RFC] Per-inode metadata cache.

1999-10-18 Thread Alexander Viro



On Mon, 18 Oct 1999, Andrea Arcangeli wrote:

 On Sat, 16 Oct 1999, Alexander Viro wrote:
 
  c) Currently we keep the stuff for the first class around the page
 cache and the rest in buffer cache. Large part of our problems comes
 from the fact that we need to detect migration of block from one class to
 another and scanning the whole buffer cache is way too slow.
 
 We don't scan the whole buffer cache. We only do a fast query on the
 buffer hash.
 
 And I still can't see how you can find the stale buffer in a per-object
 queue as the object can be destroyed as well after the lowlevel truncate.

? The same way you are doing it with pagecache.

 You can't even know which is the inode Y that is using a block X without
 reading all the inode metadata while the block X still belongs to the
 inode Y (before the truncate).

WTF would we _need_ to know? Think about it as about memory caching. You
can cache by virtual address and you can cache by physical address. And
since we have no aliasing here... Andrea, consider the thing as VM. You
have virtual addresses (offsets in file for data, offsets in directory for
directory contents, beginning of covered range for indirect blocks, etc.)
and you have physical address (offset on a disk). You have mapping of VA
to PA (-bmap(), -get_block()) (context being the file). Old scheme was
equivalent to caching by PA and recalculation of mapping on each use. New
sceme gives caching by context+VA and provides a working TLB. We already
have it for data. For metadata we are still caching by PA. And getting the
nasty pile of it with cache coherency. Would you like to work with MMU of
such architecture?

 Right now you know only that the stale buffer can't came from inode
 metadata as we have the race prone slowww trick to loop in polling mode
 inside ext2_truncate until we'll be sure bforget will run in hard mode.

And? Don't use bread() on metadata. It should never enter the buffer hash,
just as buffer_heads of _data_ never enter it. Leave bread() for
statically allocated stuff.

 And even if you know such information you should do a linear search in the
 list that's slower than an hash lookup anyway (or you can start slowly
 unmapping all the other stale buffers even if you are not interested
 about, so you may block more than necessary). Doing a per-object hash is
 not an option IMHO.

You are kinda late with it - it's already done for data. The question
being: do we ever need lookups by physical address (disk location)?
AFAICS it is _not_ needed.

  d) Moving the second class into the page cache will cause problems
 with bigmem stuff. Besides, I have the reasons of my own for keeping those
 
 I can't see these bigmem issues. The buffer and page-cache memory is not
 in bigmem anyway. And you can use bigmem _wherever_ you want as far as you
 remeber to fix all the involved code to kmap before read/write to
 potential bigmem memory. bigmem issue looks like a red-herring to me.

Right. And you may want different policies for data and metadata - the
latter should always be in kvm. AFAICS that's what SCT refered to.



Re: [RFC] Per-inode metadata cache.

1999-10-18 Thread Andrea Arcangeli

On Mon, 18 Oct 1999, Alexander Viro wrote:

? The same way you are doing it with pagecache.

I think if I would have just understood I wouldn't be asking here ;).

WTF would we _need_ to know? Think about it as about memory caching. You
can cache by virtual address and you can cache by physical address. And
since we have no aliasing here... Andrea, consider the thing as VM. You
have virtual addresses (offsets in file for data, offsets in directory for
directory contents, beginning of covered range for indirect blocks, etc.)
and you have physical address (offset on a disk). You have mapping of VA
to PA (-bmap(), -get_block()) (context being the file). Old scheme was
equivalent to caching by PA and recalculation of mapping on each use. New
sceme gives caching by context+VA and provides a working TLB. We already
have it for data. For metadata we are still caching by PA. And getting the
nasty pile of it with cache coherency. Would you like to work with MMU of
such architecture?

So far so good. I just know the above. What I can't see is how do you want
to efficiently enforce coherency by moving the dirty-metadata elsewhere.

 Right now you know only that the stale buffer can't came from inode
 metadata as we have the race prone slowww trick to loop in polling mode
 inside ext2_truncate until we'll be sure bforget will run in hard mode.

And? Don't use bread() on metadata. It should never enter the buffer hash,
just as buffer_heads of _data_ never enter it. Leave bread() for
statically allocated stuff.

Hmm I think the rewrite is so intensive that I can't see it following the
current code. I have to think at a partial rewrote of ext2 to make it to
work in my mind without bread and by using the buffer cache only to allow
bdflush/kupdate/sync to do their work.

 And even if you know such information you should do a linear search in the
 list that's slower than an hash lookup anyway (or you can start slowly
 unmapping all the other stale buffers even if you are not interested
 about, so you may block more than necessary). Doing a per-object hash is
 not an option IMHO.

You are kinda late with it - it's already done for data. The question

The page cache is _not_ a per-obect hash. You don't have a separate hash
for each inode. You have a _global_ hash for all inodes.

At the same time you _just_ (in 2.3.22) have a global hash for all and
only the metadata and it's the so called "buffer cache". So if you can
reach your object in O(1) with an additional structure that may be
rasonable, but if you'll have again to search (linear search or an hash
search) then better you use the buffer hash to get rid of the stale
buffer.

I think you are missing the reason you are suggesting to move the metadata
writes elsewhere.

The reason for using the pagecache is because we'll reach the
buffer-b_data without having to enter the filesystem code (so we'll be
able to write and to read without doing the Virtual-Phisical translation,
the pagecache it's our filesystem-TLB as you said).

We are instead _not_ using the page cache to enfore coherency (it infacts
_breaks_ the coherency and we have to query the buffer hash to know if we
are going to collide with the buffer cache).

So I can't see how and additional metadata-virtual layer can help to
enforce coherency between page-cache the new implemented metadata-cache
and the old buffer-cache (now used only for taking care of writes).

Note also that as the pagecache is just allowing us to avoid entering the
bmap path (so we just remains at the virtual layer all the time we can)
and we also don't need to optimize the metadata. We just avoid to query
the metadata using the pagecache, so I can't see a performance downside of
taking the metadata at the physical layer. And as just said, I can't see
how moving the metadata at the virtual layer can help to enforce coherency
(we'll have to handle coherency by hand with another additional layer
instead).

Right. And you may want different policies for data and metadata - the
latter should always be in kvm. AFAICS that's what SCT refered to.

That sounds rasonable of course otherwise you'll have to impact the
filesystem with lots of kmap. Anyway 64g is a red-herring too as we could
_just_ use the bigmem memory (between 1/2g to 4g) as pagecache by using
bounce buffers for slowly and deadlock-prone doing I/O (remember
ll_rw_block can't fail right now and it's supposed to not allocate memory
as it's called from swapout). Solving the deadlock issue has nothing to do
with 64g (and you'll go always dogslow then during I/O). For the raw-io
all is been easy with the bounce buffers as raw-io can fail instead.

Andrea



Re: [RFC] Per-inode metadata cache.

1999-10-18 Thread Stephen C. Tweedie

Hi,

On Mon, 18 Oct 1999 14:30:10 +0200 (CEST), Andrea Arcangeli
[EMAIL PROTECTED] said:

 I can't see these bigmem issues. The buffer and page-cache memory is not
 in bigmem anyway. And you can use bigmem _wherever_ you want as far as you
 remeber to fix all the involved code to kmap before read/write to
 potential bigmem memory. bigmem issue looks like a red-herring to me.

Data and metadata are completely different.  On, say, a large and busy
web or ftp server, you really don't care about a 1G metadata limit, but
a 1G page cache limit is much more painful.  

Secondly, data is accessed by user code: mmap of highmem page cache
pages can work automatically via normal ptes, and read()/write() of such
pages requires kmap (or rather, something a little more complex which
can accept a page fault mid-copy) in a very few, well-defined places in
filemap.c.  Metadata, on the other hand, is extremely hot data, being
accessed randomly at high frequency _everywhere_ in the filesystems.

It's a much, much messier job to teach the filesystems about high-memory
buffer cache than to teach the filemap about high-memory page cache, and
the page cache is the one under the most memory pressure in the first
place.

--Stephen



Re: [RFC] Per-inode metadata cache.

1999-10-18 Thread Stephen C. Tweedie

Hi,

On Mon, 18 Oct 1999 13:26:45 -0400 (EDT), Alexander Viro
[EMAIL PROTECTED] said:

 You can't even know which is the inode Y that is using a block X without
 reading all the inode metadata while the block X still belongs to the
 inode Y (before the truncate).

 WTF would we _need_ to know? Think about it as about memory caching. You
 can cache by virtual address and you can cache by physical address. And
 since we have no aliasing here... 

We still have the underlying problem.  The hash lists used for buffer
cache lookup are only part of the buffer cache data structures.  The
other part --- the buffer lists --- are independent of the namespace you
are using.  If you have a dirty buffer in an inode's metadata cache,
then yes, you still need to deal with the aliasing left when that data
is deallocated unless you explicitly revoke (bforget) that buffer as
soon as it is deallocated.

 Right now you know only that the stale buffer can't came from inode
 metadata as we have the race prone slowww trick to loop in polling mode
 inside ext2_truncate until we'll be sure bforget will run in hard mode.

 And? Don't use bread() on metadata. It should never enter the buffer hash,
 just as buffer_heads of _data_ never enter it. Leave bread() for
 statically allocated stuff.

It's write, not read, which is the problem: the danger is that you have
a dirty metadata buffer which is deleted and reallocated as data, but
the buffer is still dirty in the buffer cache and can stomp on your
freshly allocated data buffer later on.

The cache coherency isn't the problem as much as _disk_ coherency.
 cache coheren

 And even if you know such information you should do a linear search in the
 list that's slower than an hash lookup anyway (or you can start slowly
 unmapping all the other stale buffers even if you are not interested
 about, so you may block more than necessary). Doing a per-object hash is
 not an option IMHO.

 You are kinda late with it - it's already done for data. The question
 being: do we ever need lookups by physical address (disk location)?
 AFAICS it is _not_ needed.

 d) Moving the second class into the page cache will cause problems
 with bigmem stuff. Besides, I have the reasons of my own for keeping those
 
 I can't see these bigmem issues. The buffer and page-cache memory is not
 in bigmem anyway. And you can use bigmem _wherever_ you want as far as you
 remeber to fix all the involved code to kmap before read/write to
 potential bigmem memory. bigmem issue looks like a red-herring to me.

 Right. And you may want different policies for data and metadata - the
 latter should always be in kvm. AFAICS that's what SCT refered to.



Re: [RFC] Per-inode metadata cache.

1999-10-18 Thread Stephen C. Tweedie

Hi,

On Mon, 18 Oct 1999 13:26:45 -0400 (EDT), Alexander Viro
[EMAIL PROTECTED] said:

 You can't even know which is the inode Y that is using a block X without
 reading all the inode metadata while the block X still belongs to the
 inode Y (before the truncate).

 WTF would we _need_ to know? Think about it as about memory caching. You
 can cache by virtual address and you can cache by physical address. And
 since we have no aliasing here... 

We still have the underlying problem.  The hash lists used for buffer
cache lookup are only part of the buffer cache data structures.  The
other part --- the buffer lists --- are independent of the namespace you
are using.  If you have a dirty buffer in an inode's metadata cache,
then yes, you still need to deal with the aliasing left when that data
is deallocated unless you explicitly revoke (bforget) that buffer as
soon as it is deallocated.

 Right now you know only that the stale buffer can't came from inode
 metadata as we have the race prone slowww trick to loop in polling mode
 inside ext2_truncate until we'll be sure bforget will run in hard mode.

 And? Don't use bread() on metadata. It should never enter the buffer hash,
 just as buffer_heads of _data_ never enter it. Leave bread() for
 statically allocated stuff.

It's write, not read, which is the problem: the danger is that you have
a dirty metadata buffer which is deleted and reallocated as data, but
the buffer is still dirty in the buffer cache and can stomp on your
freshly allocated data buffer later on.

The cache coherency isn't the problem as much as _disk_ coherency.
Cache coherency in this case doesn't hurt as long as the wrong version
never hits disk (because the first thing that will happen if a stale
metadata buffer gets reallocated as new metadata will be that the buffer
in memory gets cleared anyway).

--Stephen



Re: [RFC] Per-inode metadata cache.

1999-10-18 Thread Stephen C. Tweedie

Hi,

On 18 Oct 1999 08:20:51 -0500, [EMAIL PROTECTED] (Eric W. Biederman)
said:

 And I still can't see how you can find the stale buffer in a
 per-object queue as the object can be destroyed as well after the
 lowlevel truncate.

 Yes but you can prevent the buffer from becomming a stale buffer with the
 per-object queue.

That still doesn't let you get rid of the bforget(): remember that a
partial truncate() still needs to be dealt with, and you can't work out
which buffers to discard in that case without doing the full metadata
walk.  Just having a per-inode buffer queue doesn't help there (although
it _would_ help solve the case which we currently get wrong, which is
directory data blocks, since we never partially truncate a directory).

--Stephen



Re: [RFC] Per-inode metadata cache.

1999-10-18 Thread Andrea Arcangeli

On Mon, 18 Oct 1999, Stephen C. Tweedie wrote:

Data and metadata are completely different.  On, say, a large and busy
web or ftp server, you really don't care about a 1G metadata limit, but
a 1G page cache limit is much more painful.  

Sure I completly agree. It looked you was talking about something
magic related to bigmem that was not possible to do.

The rasonable thought about bigmem (that I just mentioned in
my previous email of this thread) is that it doesn't worth to bloat the
filesystem with kmap as the VM pressure generated by the metadata is
rasonable small.

I never said the opposite. Just left the metadata to live in regular pages
as now. I can't see any problem (you where the ones talking about bigmem
troubles metadata related and I can't see them).

Andrea



[RFC] Per-inode metadata cache.

1999-10-15 Thread Alexander Viro

Stephen, I've looked through the current stuff with truncate()
(BTW, minixfs is broken too - rmdir() hangs solid) and I think that I have
a more-or-less tolerable solution. You definitely know the VFS/VM
interaction better (I'm mostly dealing with namespace side of the things),
so I'ld really like to hear your comments on the stuff below.
a) On the normal filesystems we have 4 kinds of blocks - data,
per-inode metadata, per-fs metadata and free ones.
b) We can't allow several instances of the buffer_head for the
same block, at least not the stale dirty ones.
c) Currently we keep the stuff for the first class around the page
cache and the rest in buffer cache. Large part of our problems comes
from the fact that we need to detect migration of block from one class to
another and scanning the whole buffer cache is way too slow.
d) Moving the second class into the page cache will cause problems
with bigmem stuff. Besides, I have the reasons of my own for keeping those
beasts separate - softupdates code will get simpler that way. I suspect
that you have similar reasons wrt journalling. The bottom line: we don't
want it there.
e) we might get out with just a dirty blocks lists, but I think
that we can do better than that: keep per-inode cache for metadata. It
is going to be separate from the data pagecache. It is never exported to
user space and lives completely in kernel context. Ability to search
there doesn't mean much for normal filesystems, but weirdies like AFFS
will _really_ benefit from it - I've just realized that I was crufting up
an equivalent of such cache there anyway.
f) Essentially we have several address spaces _and_ mappings
between them - some of them are done via the hardware MMU (process ones),
some are done by hands (per-inode pagecache). I think that we might take
it further and add per-inode metadata address spaces. Moreover, the large
part of code will be shared anyway, so we might want an ability to make
address spaces fullfledged objects - e.g. Steve Dodd^W^WNTFS driver may
want to keep a separate caches for forks, etc.[1]

Sorry for less than coherent text - I'm wading through the current
code trying to figure out what it is supposed to do and I'd feel much
better if I had better understanding of ideology behind the thing. Could
you comment on/ACK/NAK the stuff above? I'm Cc'ing it to fsdevel - this
stuff affects all filesystems and IMO everybody will benefit from the
clearly formulated rules regarding the work with buffer cache.
Cheers,
Al

[1] MULTICS segments with human face, anyone? ;-)

-- 
Two of my imaginary friends reproduced once ... with negative results.
Ben [EMAIL PROTECTED] in ASR