Re: Questions about the buffer+page cache in 2.4.0
On Thu, Jul 27, 2000 at 07:49:50PM +0200, Daniel Phillips wrote: > So now it's time to start asking questions. Just jumping in at a place I felt I > knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see > it's changed somewhat. Finding and removing a block from the free list is now > bracketed by a spinlock pair. First question: why do we use atomic_set to set > the initial buffer use count if this is already protected by a spinlock? the usual answer to this type of question is that there is another place which accesses the buffer use count without holding the spinlock. I haven't audited this particular piece of code and I'm not familiar with this area, but this may help you find the solution. (the other answer is that this is a bug left over from a time when there was no spinlock :-) -- Revolutions do not require corporate support.
Bug in fs/super.c
Hello everyone, Basically multiple mounts don't check the requested options from the system call against what the superblock already has stored in sb->s_flags and since it just uses the existing superblock it is possible to mount partitions with options different than what you requested, without userland ever being the wiser. I've tried this on several machines with up-to-date mount + kernels (2.10m, 2.4.0test5-prex). Here is a quick example, say sda1 is already mounted as /, rw. next step is 'mount -t ext2 -o ro /dev/sda1 /mnt'. this goes fine, running 'mount' or 'cat /etc/mtab' looks fine (mounted ro). however if you 'touch /mnt/file' it lets you, so it is obviously not ro. looking a little further, 'cat /proc/mounts and you see the kernel has mounted it rw. this inconsistency is no good. it needs to do some sort of check around line 778 maybe.. not really sure if there are masks involved or what not so I'll leave the real coding to you guys. pseudocode: if(sb->flags == flags){ if (fs_type == sb->s_type) { path_release(&nd); return sb; } }else { error=-ESOMETHING; goto out; } This bug is also present in the bind mounts, although I'm not positive on how it works there because I haven't looked at the code (I am not much of a kernel person anyways, this is all giving me large headaches ;). The same example but with 'mount -t bind -o ro / /mnt' should do the trick to show the bug. This has already been sent to sct, tytso, al vito, and lkml, and I've finally been advised I should send it here. hopefully someone can now patch it so I can stop fwding this everywhere ;) enjoy, -b
[announce] forced umount patch (2.4.0-test5-pre5)
Hi, Someone asked me to announce this on linux-fsdevel (I did on linux-kernel only). So here goes: http://www.moses.uklinux.net/patches/badfs-2.4.0-test5-pre5.patch This patch adds ability to forcibly umount (as in umount -f) any filesystem. This is work in progress and by no means ready for submission to Linus. Namely: a) it works, but: b) it has serious (though unlikely to occur) race conditions wrt filesystem's module unload and possibly with rename(2). So, please try it, read it, fix it and send your comments to me :) Regards, Tigran PS. We do need to have forced umount under Linux because: a) Solaris has it. b) it is useful
Re: Questions about the buffer+page cache in 2.4.0
On Thu, Jul 27, 2000 at 08:02:50PM +0200, Daniel Phillips wrote: > So now it's time to start asking questions. Just jumping in at a place I felt I > knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see > it's changed somewhat. Finding and removing a block from the free list is now > bracketed by a spinlock pair. First question: why do we use atomic_set to set > the initial buffer use count if this is already protected by a spinlock? The buffer use count needs to be atomic in other places because interrupts may change it on UP. atomic_t can only be modified by atomic_* functions and atomic.h is lacking a "atomic_set_nonatomic". So even when you only need the atomic property once you have to change all uses of the field. -Andi -- This is like TV. I don't like TV.
Re: another ext3 question
I actually heard this quite some time ago and I always use it in any example where people are already using ext3 in a production environment. Are there more cases of ext3 in production? Someone asked this question: ok ... to clarify ... ext3 _guarantees_ consistent file system metadata or empirically, it tends to be robust about maintaining consistent file system metadata across abrupt reboots? reiserfs, jfs, etc. all _guarantee_ this condition. So I'm really not sure what this means because I thought the *point* of a journalling filesystem was the above question. Thanks -jeremy > On Thu, Jul 27, 2000 at 01:41:54PM -0400, Jeremy Hansen wrote: > > We're really itching to use ext3 in a production environment. Can you > > give any clues on how things are going? I'm not asking for time frames by > > any means but rather like, yes, things are coming along nicely or whatever > > clues you can give. > > the rpmfind.net and fr.rpmfind.net servers have been running completely > on top of ext3 for more than one year without troubles. I use it for a > couple of W3C servers too. > > Daniel > > -- http://www.xxedgexx.com | [EMAIL PROTECTED] -
Re: another ext3 question
On Thu, Jul 27, 2000 at 01:41:54PM -0400, Jeremy Hansen wrote: > We're really itching to use ext3 in a production environment. Can you > give any clues on how things are going? I'm not asking for time frames by > any means but rather like, yes, things are coming along nicely or whatever > clues you can give. the rpmfind.net and fr.rpmfind.net servers have been running completely on top of ext3 for more than one year without troubles. I use it for a couple of W3C servers too. Daniel -- [EMAIL PROTECTED] | W3C, INRIA Rhone-Alpes | Today's Bookmarks : Tel : +33 476 615 257 | 655, avenue de l'Europe | Linux XML libxml WWW Fax : +33 476 615 207 | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind http://www.w3.org/People/all#veillard%40w3.org | RPM badminton Kaffe
Re: Questions about the buffer+page cache in 2.4.0
On Thu, 27 Jul 2000, Tigran Aivazian wrote: > On Thu, 27 Jul 2000, Daniel Phillips wrote: > > So now it's time to start asking questions. Just jumping in at a place I felt I > > knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see > > it's changed somewhat. Finding and removing a block from the free list is now > > bracketed by a spinlock pair. First question: why do we use atomic_set to set > > the initial buffer use count if this is already protected by a spinlock? > > have a look at other users of bh->b_count. For example __brelse() does > atomic_dec() and it is called directly from brelse() which can be called > by filesystem without any other protection. actually, that was naive answer and it was wrong. The correct answer is - in that piece of code you do _not_ actually care about atomicity of the b_count operation as you are the only user of it. However, atomicity of b_count is needed in general, so, to keep the same API to access it one just sets it to 1 in the getblk() using atomic_set() rather than just b_count.counter = 1. Regards, Tigran
Re: Questions about the buffer+page cache in 2.4.0
On Thu, 27 Jul 2000, Daniel Phillips wrote: > So now it's time to start asking questions. Just jumping in at a place I felt I > knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see > it's changed somewhat. Finding and removing a block from the free list is now > bracketed by a spinlock pair. First question: why do we use atomic_set to set > the initial buffer use count if this is already protected by a spinlock? have a look at other users of bh->b_count. For example __brelse() does atomic_dec() and it is called directly from brelse() which can be called by filesystem without any other protection. regards, Tigran
Bug in fs/super.c (cont.)
hey all (again), forgot to add a bit on to that last post.. i have been told that it should be possible to change the code so that it does support changing the options (at least some of them) between different mounts, however it is not implemented now. so the check that I present is only meant as a patch until the support is written. if you would rather just write the code so that it can change options I'd be even happier ;) i really need to have a partition RO in one place and RW in another which is why i hit into this inconsistency. only other way i know of doing it is via a loopback nfs or similarly silly. ah well, -b
Re: another ext3 question
Hi, On Thu, Jul 27, 2000 at 01:41:54PM -0400, Jeremy Hansen wrote: > > We're really itching to use ext3 in a production environment. Can you > give any clues on how things are going? The ext3-0.0.2f appears to be rock solid. Andreas has got prototyped code for e2fsck log replay, and I've got pending code for out-of-memory and IO failures sitting here, along with most of the code for metadata-only journaling. I'm off on holiday in a few hours for 2 weeks, but expect a release shortly after I get back with some more goodies in it. Cheers, Stephen
Questions about the buffer+page cache in 2.4.0
After spending *far* too long getting lxr installed and working, I finally have my favorite source browser working in the same place I email from :-) Meaning that the next time I get hit with a tough question I'll be able to respond having to go home and browsing the source first. BTW, lxr is a wonderful thing and everybody who has the patience should think about installing and running it locally. It can do far more for you than just index the Linux kernel. I see that by going to the effort of installing it I've now jumped two development versions ahead of the official site (http://lxr.linux.no/source/) so it was worth it. So now it's time to start asking questions. Just jumping in at a place I felt I knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see it's changed somewhat. Finding and removing a block from the free list is now bracketed by a spinlock pair. First question: why do we use atomic_set to set the initial buffer use count if this is already protected by a spinlock? I thought I'd start with an easy question just to do a reality check. That's all for today, my excuse being that lxr bit such a big chunk out of my day I only had time for one question. Tomorrow hopefully I'll start getting at the things I need to know in order to do the right thing with tail blocks. -- Daniel
Re: another ext3 question
Thanks! I've been using ext3 on about 5 machines for the past week or so and it's been great so far. Haven't had any problems except for that weird thing creating that journal on the root filesystem and it was getting confused witht he ramdisk or whatever, but making a boot disk and mounting fixed that. We're really itching to use ext3 in a production environment. Can you give any clues on how things are going? I'm not asking for time frames by any means but rather like, yes, things are coming along nicely or whatever clues you can give. Also, do you know of anyone creating an anaconda install that will take advantage of ext3? Assuming I would know anaconda which I don't, it would seem easy enough to create journal files just based on the size of partition. That would be swell. Thanks -jeremy > Hi, > > On Fri, Jul 21, 2000 at 11:54:20PM -0600, Andreas Dilger wrote: > > > Note that you should not make the journals so large that they are a > > major fraction of your RAM, as you will not gain anything by this. > > A few megabytes is fine, 1024 disk blocks is the minimum. > > Yep. The main drawbacks to a large journal is that (a) they can pin a > lot of buffers in memory at once, and (b) they take longer to recover. > The only advantage of a large journal is that it gives the filesystem > more flexibility in writing things back to the main disk, but it's not > a large effect unless you have a very heavy write load. > > Cheers, > Stephen > -- http://www.xxedgexx.com | [EMAIL PROTECTED] -
Re: another ext3 question
Hi, On Fri, Jul 21, 2000 at 11:54:20PM -0600, Andreas Dilger wrote: > Note that you should not make the journals so large that they are a > major fraction of your RAM, as you will not gain anything by this. > A few megabytes is fine, 1024 disk blocks is the minimum. Yep. The main drawbacks to a large journal is that (a) they can pin a lot of buffers in memory at once, and (b) they take longer to recover. The only advantage of a large journal is that it gives the filesystem more flexibility in writing things back to the main disk, but it's not a large effect unless you have a very heavy write load. Cheers, Stephen
Re: Tailmerging for Ext2
Alexander Viro wrote: > On Wed, 26 Jul 2000, Stephen C. Tweedie wrote: > > On Wed, Jul 26, 2000 at 03:19:46PM -0400, Alexander Viro wrote: > > > > > Erm? Consider that: huge lseek() + write past the end of file. Woops - got > > > to unmerge the tail (it's an internal block now) and we've got no > > > knowledge of IO going on the page. Again, IO may be asynchronous - no > > > protection from i_sem for us. After that page becomes a regular one, > > > right? Looks like a change of state to me... > > > > Naturally, and that change of state must be made atomically by the > > filesystem. > > Yep. Which is the point - there _are_ dragons. I believe that it's doable, > but I realy want to repeat: Daniel, watch out for races at the moments > when page state changes, it needs more accurate approach than usual > pagecache-using fs. It can be done, but it will take some reading (and > yes, Stephen, I know that _you_ know it ;-) That's apparent, and I feel that Stephen could probably implement the entire tail merge as described so far in few days. But that wouldn't be as useful as having me and perhaps some interested observers others go all the way through the exercise of figuring out the so-far unwritten rules of the buffercache/pagecache duo. The exact same accurate work is required for Tux2, which makes massive use of copy-on-write. Right now, buffer issues are the main thing standing in the way of making a development code release for Tux2. So there is no question in my mind about whether such issues have to be dealt with: they do. I dove into the 2.4.0 cache code for the first time last night (using lxr - try it, you'll like it) and I'm almost at the point where I have some relevant questions to ask. I notice that buffer.c has increased in size by almost 50% and is far and away the largest module in the VFS. Worse, buffer.c is massively cross-coupled to the mm subsystem and the page cache, as we know too well. Buffer.c is right at the core of the issues we're talking about. Bearing that in mind, instead of just jumping in and starting to code I'll try the methodical approach :-) My immediate objective is to try clarify a few things that aren't immediately obvious from the source, in the following areas: - States and transitions for the main objects: - Buffer heads - Buffer data - Page heads - Page data - Other? - Existing concurrency controls: - Semaphores/Spinlocks - Big kernel lock - Filesystem locks - Posix locks? - Other? - Planned additions/deletions of concurrency controls I will also try to make a list of the main internal functions in the VFS (and some related ones from the mm and drivers modules) and examine function-by-function what the intended usage is, what the issues/caveats are, and maybe even how we can expect them to evolve in the future. I think we need even more than this in terms of documentation in order to work effectively, but this at least will be a good start. It will be more than what we have now. If it gets to the point where we can actually answer questions about race conditions by consulting the docs then we really will have accomplished something. Yes, I know that the code is going to keep evolving and sometimes will break the docs, but I also have confidence that the docs can keep up with such evolution given some interested volunteer doc maintainers willing to hang out on the devel list and keep asking questions. Even in 2.2.x I felt that there is a lot of understated elegance in Linux's buffer cache design. In 2.4.0 it seems to be getting more elegant, although it's hard to say exactly, because of the sparse (read: nonexistent) documentation. This is a problem that can be easily fixed. To get through this I will have to ask a lot of naive-sounding questions. Hopefully I'll have the first batch ready this afternoon (morning, your time). -- Daniel