Re: Questions about the buffer+page cache in 2.4.0

2000-07-27 Thread Matthew Wilcox

On Thu, Jul 27, 2000 at 07:49:50PM +0200, Daniel Phillips wrote:
> So now it's time to start asking questions.  Just jumping in at a place I felt I
> knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see
> it's changed somewhat.  Finding and removing a block from the free list is now
> bracketed by a spinlock pair.  First question: why do we use atomic_set to set
> the initial buffer use count if this is already protected by a spinlock?

the usual answer to this type of question is that there is another
place which accesses the buffer use count without holding the spinlock.
I haven't audited this particular piece of code and I'm not familiar
with this area, but this may help you find the solution.

(the other answer is that this is a bug left over from a time when there
was no spinlock :-)

-- 
Revolutions do not require corporate support.



Bug in fs/super.c

2000-07-27 Thread Brian Poole

Hello everyone,

Basically multiple mounts don't check the requested options from the
system call against what the superblock already has stored in sb->s_flags
and since it just uses the existing superblock it is possible to mount
partitions with options different than what you requested, without 
userland ever being the wiser. I've tried this on several machines with
up-to-date mount + kernels (2.10m, 2.4.0test5-prex).

Here is a quick example, say sda1 is already mounted as /, rw. next step
is 'mount -t ext2 -o ro /dev/sda1 /mnt'. this goes fine, running 'mount'
or 'cat /etc/mtab' looks fine (mounted ro). however if you 'touch
/mnt/file' it lets you, so it is obviously not ro. looking a little
further, 'cat /proc/mounts and you see the kernel has mounted it rw. this
inconsistency is no good. it needs to do some sort of check around line
778 maybe.. not really sure if there are masks involved or what not so
I'll leave the real coding to you guys.

pseudocode:
if(sb->flags == flags){
  if (fs_type == sb->s_type) {
 path_release(&nd); 
return sb;
  }
}else  {
  error=-ESOMETHING;
  goto out;
}


This bug is also present in the bind mounts, although I'm not positive on
how it works there because I haven't looked at the code (I am not much of
a kernel person anyways, this is all giving me large headaches ;). The
same example but with 'mount -t bind -o ro / /mnt' should do the trick to
show the bug.

This has already been sent to sct, tytso, al vito, and lkml, and I've
finally been advised I should send it here. hopefully someone can now
patch it so I can stop fwding this everywhere ;)

enjoy,

-b





[announce] forced umount patch (2.4.0-test5-pre5)

2000-07-27 Thread Tigran Aivazian

Hi,

Someone asked me to announce this on linux-fsdevel (I did on linux-kernel
only). So here goes:

http://www.moses.uklinux.net/patches/badfs-2.4.0-test5-pre5.patch

This patch adds ability to forcibly umount (as in umount -f) any
filesystem. This is work in progress and by no means ready for submission
to Linus. Namely:

a) it works, but:

b) it has serious (though unlikely to occur) race conditions wrt
filesystem's module unload and possibly with rename(2).

So, please try it, read it, fix it and send your comments to me :)

Regards,
Tigran

PS. We do need to have forced umount under Linux because:

a) Solaris has it.

b) it is useful




Re: Questions about the buffer+page cache in 2.4.0

2000-07-27 Thread Andi Kleen

On Thu, Jul 27, 2000 at 08:02:50PM +0200, Daniel Phillips wrote:
> So now it's time to start asking questions.  Just jumping in at a place I felt I
> knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see
> it's changed somewhat.  Finding and removing a block from the free list is now
> bracketed by a spinlock pair.  First question: why do we use atomic_set to set
> the initial buffer use count if this is already protected by a spinlock?

The buffer use count needs to be atomic in other places because interrupts
may change it on UP. atomic_t can only be modified by atomic_* functions 
and atomic.h is lacking a "atomic_set_nonatomic". So even when you only 
need the atomic property once you have to change all uses of the field.


-Andi

-- 
This is like TV. I don't like TV.



Re: another ext3 question

2000-07-27 Thread Jeremy Hansen


I actually heard this quite some time ago and I always use it in any
example where people are already using ext3 in a production
environment.  Are there more cases of ext3 in production?

Someone asked this question:

ok ... to clarify ... ext3 _guarantees_ consistent file system metadata
or empirically, it tends to be robust about maintaining consistent
file system metadata across abrupt reboots?

reiserfs, jfs, etc. all _guarantee_ this condition.

So I'm really not sure what this means because I thought the *point*
of a journalling filesystem was the above question.

Thanks
-jeremy

> On Thu, Jul 27, 2000 at 01:41:54PM -0400, Jeremy Hansen wrote:
> > We're really itching to use ext3 in a production environment.  Can you
> > give any clues on how things are going?  I'm not asking for time frames by
> > any means but rather like, yes, things are coming along nicely or whatever
> > clues you can give.
> 
>   the rpmfind.net and fr.rpmfind.net servers have been running completely
> on top of ext3 for more than one year without troubles. I use it for a 
> couple of W3C servers too.
> 
> Daniel
> 
> 

-- 

http://www.xxedgexx.com | [EMAIL PROTECTED]
-




Re: another ext3 question

2000-07-27 Thread Daniel Veillard

On Thu, Jul 27, 2000 at 01:41:54PM -0400, Jeremy Hansen wrote:
> We're really itching to use ext3 in a production environment.  Can you
> give any clues on how things are going?  I'm not asking for time frames by
> any means but rather like, yes, things are coming along nicely or whatever
> clues you can give.

  the rpmfind.net and fr.rpmfind.net servers have been running completely
on top of ext3 for more than one year without troubles. I use it for a 
couple of W3C servers too.

Daniel

-- 
[EMAIL PROTECTED] | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe



Re: Questions about the buffer+page cache in 2.4.0

2000-07-27 Thread Tigran Aivazian

On Thu, 27 Jul 2000, Tigran Aivazian wrote:

> On Thu, 27 Jul 2000, Daniel Phillips wrote:
> > So now it's time to start asking questions.  Just jumping in at a place I felt I
> > knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see
> > it's changed somewhat.  Finding and removing a block from the free list is now
> > bracketed by a spinlock pair.  First question: why do we use atomic_set to set
> > the initial buffer use count if this is already protected by a spinlock?
> 
> have a look at other users of bh->b_count. For example __brelse() does
> atomic_dec() and it is called directly from brelse() which can be called
> by filesystem without any other protection.

actually, that was naive answer and it was wrong. The correct answer is -
in that piece of code you do _not_ actually care about atomicity of the
b_count operation as you are the only user of it. However, atomicity of
b_count is needed in general, so, to keep the same API to access it one
just sets it to 1 in the getblk() using atomic_set() rather than just
b_count.counter = 1.

Regards,
Tigran




Re: Questions about the buffer+page cache in 2.4.0

2000-07-27 Thread Tigran Aivazian

On Thu, 27 Jul 2000, Daniel Phillips wrote:
> So now it's time to start asking questions.  Just jumping in at a place I felt I
> knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see
> it's changed somewhat.  Finding and removing a block from the free list is now
> bracketed by a spinlock pair.  First question: why do we use atomic_set to set
> the initial buffer use count if this is already protected by a spinlock?

have a look at other users of bh->b_count. For example __brelse() does
atomic_dec() and it is called directly from brelse() which can be called
by filesystem without any other protection.

regards,
Tigran




Bug in fs/super.c (cont.)

2000-07-27 Thread Brian Poole

hey all (again),

forgot to add a bit on to that last post..

i have been told that it should be possible to change the code so that it
does support changing the options (at least some of them) between
different mounts, however it is not implemented now. so the check that I 
present is only meant as a patch until the support is written. if you
would rather just write the code so that it can change options I'd be even
happier ;) 

i really need to have a partition RO in one place and RW in another which
is why i hit into this inconsistency. only other way i know of doing it is
via a loopback nfs or similarly silly. 

ah well,

-b







Re: another ext3 question

2000-07-27 Thread Stephen C. Tweedie

Hi,

On Thu, Jul 27, 2000 at 01:41:54PM -0400, Jeremy Hansen wrote:
> 
> We're really itching to use ext3 in a production environment.  Can you
> give any clues on how things are going?

The ext3-0.0.2f appears to be rock solid.  Andreas has got prototyped
code for e2fsck log replay, and I've got pending code for
out-of-memory and IO failures sitting here, along with most of the
code for metadata-only journaling.  I'm off on holiday in a few hours
for 2 weeks, but expect a release shortly after I get back with some
more goodies in it.

Cheers,
 Stephen



Questions about the buffer+page cache in 2.4.0

2000-07-27 Thread Daniel Phillips

After spending *far* too long getting lxr installed and working, I finally have
my favorite source browser working in the same place I email from :-)  Meaning
that the next time I get hit with a tough question I'll be able to respond
having to go home and browsing the source first.

BTW, lxr is a wonderful thing and everybody who has the patience should think
about installing and running it locally.  It can do far more for you than just
index the Linux kernel.  I see that by going to the effort of installing it I've
now jumped two development versions ahead of the official site
(http://lxr.linux.no/source/) so it was worth it.

So now it's time to start asking questions.  Just jumping in at a place I felt I
knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see
it's changed somewhat.  Finding and removing a block from the free list is now
bracketed by a spinlock pair.  First question: why do we use atomic_set to set
the initial buffer use count if this is already protected by a spinlock?

I thought I'd start with an easy question just to do a reality check.  That's
all for today, my excuse being that lxr bit such a big chunk out of my day I
only had time for one question.  Tomorrow hopefully I'll start getting at the
things I need to know in order to do the right thing with tail blocks.

-- 
Daniel



Re: another ext3 question

2000-07-27 Thread Jeremy Hansen


Thanks!  I've been using ext3 on about 5 machines for the past week or so
and it's been great so far.  Haven't had any problems except for that
weird thing creating that journal on the root filesystem and it was
getting confused witht he ramdisk or whatever, but making a boot disk and
mounting fixed that.

We're really itching to use ext3 in a production environment.  Can you
give any clues on how things are going?  I'm not asking for time frames by
any means but rather like, yes, things are coming along nicely or whatever
clues you can give.

Also, do you know of anyone creating an anaconda install that will take
advantage of ext3?  Assuming I would know anaconda which I don't, it would
seem easy enough to create journal files just based on the size of
partition.  That would be swell.

Thanks
-jeremy

> Hi,
> 
> On Fri, Jul 21, 2000 at 11:54:20PM -0600, Andreas Dilger wrote:
> 
> > Note that you should not make the journals so large that they are a
> > major fraction of your RAM, as you will not gain anything by this.
> > A few megabytes is fine, 1024 disk blocks is the minimum.
> 
> Yep.  The main drawbacks to a large journal is that (a) they can pin a
> lot of buffers in memory at once, and (b) they take longer to recover.
> The only advantage of a large journal is that it gives the filesystem
> more flexibility in writing things back to the main disk, but it's not
> a large effect unless you have a very heavy write load.
> 
> Cheers,
>  Stephen
> 

-- 

http://www.xxedgexx.com | [EMAIL PROTECTED]
-




Re: another ext3 question

2000-07-27 Thread Stephen C. Tweedie

Hi,

On Fri, Jul 21, 2000 at 11:54:20PM -0600, Andreas Dilger wrote:

> Note that you should not make the journals so large that they are a
> major fraction of your RAM, as you will not gain anything by this.
> A few megabytes is fine, 1024 disk blocks is the minimum.

Yep.  The main drawbacks to a large journal is that (a) they can pin a
lot of buffers in memory at once, and (b) they take longer to recover.
The only advantage of a large journal is that it gives the filesystem
more flexibility in writing things back to the main disk, but it's not
a large effect unless you have a very heavy write load.

Cheers,
 Stephen



Re: Tailmerging for Ext2

2000-07-27 Thread Daniel Phillips

Alexander Viro wrote:
> On Wed, 26 Jul 2000, Stephen C. Tweedie wrote:
> > On Wed, Jul 26, 2000 at 03:19:46PM -0400, Alexander Viro wrote:
> >
> > > Erm? Consider that: huge lseek() + write past the end of file. Woops - got
> > > to unmerge the tail (it's an internal block now) and we've got no
> > > knowledge of IO going on the page. Again, IO may be asynchronous - no
> > > protection from i_sem for us. After that page becomes a regular one,
> > > right? Looks like a change of state to me...
> >
> > Naturally, and that change of state must be made atomically by the
> > filesystem.
> 
> Yep. Which is the point - there _are_ dragons. I believe that it's doable,
> but I realy want to repeat: Daniel, watch out for races at the moments
> when page state changes, it needs more accurate approach than usual
> pagecache-using fs. It can be done, but it will take some reading (and
> yes, Stephen, I know that _you_ know it ;-)

That's apparent, and I feel that Stephen could probably implement the entire
tail merge as described so far in few days.  But that wouldn't be as useful as
having me and perhaps some interested observers others go all the way through
the exercise of figuring out the so-far unwritten rules of the
buffercache/pagecache duo.

The exact same accurate work is required for Tux2, which makes massive use of
copy-on-write.  Right now, buffer issues are the main thing standing in the way
of making a development code release for Tux2.  So there is no question in my
mind about whether such issues have to be dealt with: they do.

I dove into the 2.4.0 cache code for the first time last night (using lxr - try
it, you'll like it) and I'm almost at the point where I have some relevant
questions to ask.  I notice that buffer.c has increased in size by almost 50%
and is far and away the largest module in the VFS.  Worse, buffer.c is massively
cross-coupled to the mm subsystem and the page cache, as we know too well. 
Buffer.c is right at the core of the issues we're talking about.

Bearing that in mind, instead of just jumping in and starting to code I'll try
the methodical approach :-)  My immediate objective is to try clarify a few
things that aren't immediately obvious from the source, in the following areas:

  - States and transitions for the main objects:
- Buffer heads
- Buffer data
- Page heads
- Page data
- Other?

  - Existing concurrency controls:
- Semaphores/Spinlocks
- Big kernel lock
- Filesystem locks
- Posix locks?
- Other?

  - Planned additions/deletions of concurrency controls

I will also try to make a list of the main internal functions in the VFS (and
some related ones from the mm and drivers modules) and examine
function-by-function what the intended usage is, what the issues/caveats are,
and maybe even how we can expect them to evolve in the future.

I think we need even more than this in terms of documentation in order to work
effectively, but this at least will be a good start.  It will be more than what
we have now.  If it gets to the point where we can actually answer questions
about race conditions by consulting the docs then we really will have
accomplished something.  Yes, I know that the code is going to keep evolving and
sometimes will break the docs, but I also have confidence that the docs can keep
up with such evolution given some interested volunteer doc maintainers willing
to hang out on the devel list and keep asking questions.

Even in 2.2.x I felt that there is a lot of understated elegance in Linux's
buffer cache design.  In 2.4.0 it seems to be getting more elegant, although
it's hard to say exactly, because of the sparse (read: nonexistent)
documentation.  This is a problem that can be easily fixed.

To get through this I will have to ask a lot of naive-sounding questions. 
Hopefully I'll have the first batch ready this afternoon (morning, your time).

-- 
Daniel