vm/fs meetup in september?
I'd just like to take the chance also to ask about a VM/FS meetup some time around kernel summit (maybe take a big of time during UKUUG or so). I was thinking about trying to arrange a proper mini summit thing, but it's a bit difficult and we could talk this year about doing it for subsequent years. If there is a bit of interest, we could probably find a small room somewhere this year on pretty short notice or do it as a BOF or something. I don't want to do it in the VM summit, because that kind of alienates the filesystem guys. What I want to talk about is anything and everything that the VM can do better to help the fs and vice versa. I'd like to stay away from memory management where not too applicable to the fs. A few things I'd like to talk about are: - the address space operations APIs, and their page based nature. I think it would be nice to generally move toward offset,length based ones as much as possible because it should give more efficiency and flexibility in the filesystem. - write_begin API if it is still an issue by that date. Hope not :) - truncate races - fsblock if it hasn't been shot down by then - how to make complex API changes without having to fix most things yourself. Anyway, if you will be in the area and are interested, let me know (off list) and we can work out time and place. Thanks, Nick - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote: > fsblock is a rewrite of the "buffer layer" (ding dong the witch is > dead), which I have been working on, on and off and is now at the stage > where some of the basics are working-ish. This email is going to be > long... Long overdue. Thank you. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Sat, Jun 23, 2007 at 11:07:54PM -0400, Jeff Garzik wrote: > Nick Piggin wrote: > >- No deadlocks (hopefully). The buffer layer is technically deadlocky by > > design, because it can require memory allocations at page writeout-time. > > It also has one path that cannot tolerate memory allocation failures. > > No such problems for fsblock, which keeps fsblock metadata around for as > > long as a page is dirty (this still has problems vs get_user_pages, but > > that's going to require an audit of all get_user_pages sites. Phew). > > > >- In line with the above item, filesystem block allocation is performed > > before a page is dirtied. In the buffer layer, mmap writes can dirty a > > page with no backing blocks which is a problem if the filesystem is > > ENOSPC (patches exist for buffer.c for this). > > This raises an eyebrow... The handling of ENOSPC prior to mmap write is > more an ABI behavior, so I don't see how this can be fixed with internal > changes, yet without changing behavior currently exported to userland > (and thus affecting code based on such assumptions). I believe people are happy to have it SIGBUS (which is how the VM is already set up with page_mkwrite, and what fsblock does). > >- An inode's metadata must be tracked per-inode in order for fsync to > > work correctly. buffer contains helpers to do this for basic > > filesystems, but any block can be only the metadata for a single inode. > > This is not really correct for things like inode descriptor blocks. > > fsblock can track multiple inodes per block. (This is non trivial, > > and it may be overkill so it could be reverted to a simpler scheme > > like buffer). > > hrm; no specific comment but this seems like an idea/area that needs to > be fleshed out more, by converting some of the more advanced filesystems. Yep. It's conceptually fairly simple though, and it might be easier than having filesystems implement their own complex syncing that finds and syncs everything themselves. > >- Large block support. I can mount and run an 8K block size minix3 fs on > > my 4K page system and it didn't require anything special in the fs. We > > can go up to about 32MB blocks now, and gigabyte+ blocks would only > > require one more bit in the fsblock flags. fsblock_superpage blocks > > are > PAGE_CACHE_SIZE, midpage ==, and subpage <. > > definitely useful, especially if I rewrite my ibu filesystem for 2.6.x, > like I've been planning. Yeah, it wasn't the primary motivation for the rewrite, but it would be negligent to not even consider large blocks in such a rewrite, I think. > >So. Comments? Is this something we want? If yes, then how would we > >transition from buffer.c to fsblock.c? > > Your work is definitely interesting, but I think it will be even more > interesting once ext2 (w/ dir in pagecache) and ext3 (journalling) are > converted. Well minix has dir in pagecache ;) But you're completely right: ext2 will be the next step and then ext3 and things like XFS and NTFS will be the real test. I think I could eventually get ext2 done (one of the biggest headaches is simply just converting ->b_data accesses), however unlikely a journalling one. > My gut feeling is that there are several problem areas you haven't hit > yet, with the new code. I would agree with your gut :) > Also, once things are converted, the question of transitioning from > buffer.c will undoubtedly answer itself. That's the way several of us > handle transitions: finish all the work, then look with fresh eyes and > conceive a path from the current code to your enhanced code. Yeah that would be nice. It's very difficult because of so much filesystem code. I'd say it would be feasible to step buffer.c into fsblock.c, however if we were to track all (or even the common) filesystems along with that it would introduce a huge number of kind-of-redundant changes that I don't think all fs maintainers would have time to write (and as I said, I can't do it myself). Anyway, let's cross that bridge if and when we come to it. For now, the big thing that needs to be done is convert a "big" fs and see if the results tell us that it's workable. Thanks for the comments Jeff. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
Nick Piggin wrote: - No deadlocks (hopefully). The buffer layer is technically deadlocky by design, because it can require memory allocations at page writeout-time. It also has one path that cannot tolerate memory allocation failures. No such problems for fsblock, which keeps fsblock metadata around for as long as a page is dirty (this still has problems vs get_user_pages, but that's going to require an audit of all get_user_pages sites. Phew). - In line with the above item, filesystem block allocation is performed before a page is dirtied. In the buffer layer, mmap writes can dirty a page with no backing blocks which is a problem if the filesystem is ENOSPC (patches exist for buffer.c for this). This raises an eyebrow... The handling of ENOSPC prior to mmap write is more an ABI behavior, so I don't see how this can be fixed with internal changes, yet without changing behavior currently exported to userland (and thus affecting code based on such assumptions). - An inode's metadata must be tracked per-inode in order for fsync to work correctly. buffer contains helpers to do this for basic filesystems, but any block can be only the metadata for a single inode. This is not really correct for things like inode descriptor blocks. fsblock can track multiple inodes per block. (This is non trivial, and it may be overkill so it could be reverted to a simpler scheme like buffer). hrm; no specific comment but this seems like an idea/area that needs to be fleshed out more, by converting some of the more advanced filesystems. - Large block support. I can mount and run an 8K block size minix3 fs on my 4K page system and it didn't require anything special in the fs. We can go up to about 32MB blocks now, and gigabyte+ blocks would only require one more bit in the fsblock flags. fsblock_superpage blocks are > PAGE_CACHE_SIZE, midpage ==, and subpage <. definitely useful, especially if I rewrite my ibu filesystem for 2.6.x, like I've been planning. So. Comments? Is this something we want? If yes, then how would we transition from buffer.c to fsblock.c? Your work is definitely interesting, but I think it will be even more interesting once ext2 (w/ dir in pagecache) and ext3 (journalling) are converted. My gut feeling is that there are several problem areas you haven't hit yet, with the new code. Also, once things are converted, the question of transitioning from buffer.c will undoubtedly answer itself. That's the way several of us handle transitions: finish all the work, then look with fresh eyes and conceive a path from the current code to your enhanced code. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
Just clarify a few things. Don't you hate rereading a long work you wrote? (oh, you're supposed to do that *before* you press send?). On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote: > > I'm announcing "fsblock" now because it is quite intrusive and so I'd > like to get some thoughts about significantly changing this core part > of the kernel. > > fsblock is a rewrite of the "buffer layer" (ding dong the witch is > dead), which I have been working on, on and off and is now at the stage > where some of the basics are working-ish. This email is going to be > long... > > Firstly, what is the buffer layer? The buffer layer isn't really a > buffer layer as in the buffer cache of unix: the block device cache > is unified with the pagecache (in terms of the pagecache, a blkdev > file is just like any other, but with a 1:1 mapping between offset > and block). I mean, in Linux, the block device cache is unified. UNIX I believe did all its caching in a buffer cache, below the filesystem. > - Large block support. I can mount and run an 8K block size minix3 fs on > my 4K page system and it didn't require anything special in the fs. We Oh, and I don't have a Linux mkfs that makes minixv3 filesystems. I had an image kindly made for me because I don't use minix. If you want to test large block support, I won't email it to you though: you can just convert ext2 or ext3 to fsblock ;) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 3/3] minix: convert to fsblock
Convert minix from buffer head to fsblock. --- fs/minix/bitmap.c | 148 +-- fs/minix/file.c |6 - fs/minix/inode.c| 172 ++-- fs/minix/itree_common.c | 227 fs/minix/itree_v1.c |7 - fs/minix/itree_v2.c |7 - fs/minix/minix.h| 17 ++- 7 files changed, 382 insertions(+), 202 deletions(-) Index: linux-2.6/fs/minix/minix.h === --- linux-2.6.orig/fs/minix/minix.h +++ linux-2.6/fs/minix/minix.h @@ -1,4 +1,5 @@ #include +#include #include #include @@ -37,16 +38,18 @@ struct minix_sb_info { int s_dirsize; int s_namelen; int s_link_max; - struct buffer_head ** s_imap; - struct buffer_head ** s_zmap; - struct buffer_head * s_sbh; + struct fsblock_meta ** s_imap; + struct fsblock_meta ** s_zmap; + struct fsblock_meta * s_smblock; struct minix_super_block * s_ms; unsigned short s_mount_state; unsigned short s_version; }; -extern struct minix_inode * minix_V1_raw_inode(struct super_block *, ino_t, struct buffer_head **); -extern struct minix2_inode * minix_V2_raw_inode(struct super_block *, ino_t, struct buffer_head **); +extern struct minix_inode * minix_V1_raw_inode(struct super_block *, ino_t, struct fsblock_meta **); +extern void minix_put_raw_inode(struct super_block *sb, ino_t ino, struct fsblock_meta *mblock, struct minix_inode *p); +extern struct minix2_inode * minix_V2_raw_inode(struct super_block *, ino_t, struct fsblock_meta **); +extern void minix2_put_raw_inode(struct super_block *sb, ino_t ino, struct fsblock_meta *mblock, struct minix2_inode *p); extern struct inode * minix_new_inode(const struct inode * dir, int * error); extern void minix_free_inode(struct inode * inode); extern unsigned long minix_count_free_inodes(struct minix_sb_info *sbi); @@ -60,8 +63,8 @@ extern void V2_minix_truncate(struct ino extern void minix_truncate(struct inode *); extern int minix_sync_inode(struct inode *); extern void minix_set_inode(struct inode *, dev_t); -extern int V1_minix_get_block(struct inode *, long, struct buffer_head *, int); -extern int V2_minix_get_block(struct inode *, long, struct buffer_head *, int); +extern int V1_minix_insert_mapping(struct address_space *, loff_t, size_t, int); +extern int V2_minix_insert_mapping(struct address_space *, loff_t, size_t, int); extern unsigned V1_minix_blocks(loff_t, struct super_block *); extern unsigned V2_minix_blocks(loff_t, struct super_block *); Index: linux-2.6/fs/minix/itree_common.c === --- linux-2.6.orig/fs/minix/itree_common.c +++ linux-2.6/fs/minix/itree_common.c @@ -1,31 +1,29 @@ /* Generic part */ typedef struct { - block_t *p; + block_t *mem; + int offset; block_t key; - struct buffer_head *bh; + struct fsblock_meta *mblock; } Indirect; static DEFINE_RWLOCK(pointers_lock); -static inline void add_chain(Indirect *p, struct buffer_head *bh, block_t *v) +static inline void add_chain(Indirect *p, struct fsblock_meta *mblock, block_t *mem, int offset) { - p->key = *(p->p = v); - p->bh = bh; + p->mem = mem; + p->offset = offset; + p->key = mem[offset]; + p->mblock = mblock; } static inline int verify_chain(Indirect *from, Indirect *to) { - while (from <= to && from->key == *from->p) + while (from <= to && from->key == from->mem[from->offset]) from++; return (from > to); } -static inline block_t *block_end(struct buffer_head *bh) -{ - return (block_t *)((char*)bh->b_data + bh->b_size); -} - static inline Indirect *get_branch(struct inode *inode, int depth, int *offsets, @@ -34,35 +32,43 @@ static inline Indirect *get_branch(struc { struct super_block *sb = inode->i_sb; Indirect *p = chain; - struct buffer_head *bh; + struct fsblock_meta *mblock; *err = 0; /* i_data is not going away, no lock needed */ - add_chain (chain, NULL, i_data(inode) + *offsets); + add_chain (chain, NULL, i_data(inode), *offsets); if (!p->key) - goto no_block; + goto out; while (--depth) { - bh = sb_bread(sb, block_to_cpu(p->key)); - if (!bh) - goto failure; + void *data; + + mblock = sb_mbread(sb, block_to_cpu(p->key)); + if (!mblock) { + *err = -EIO; + goto out; + } read_lock(&pointers_lock); - if (!verify_chain(chain, p)) - goto changed; - add_chain(++p, bh,
[patch 2/3] block_dev: convert to fsblock
Convert block_dev mostly to fsblocks. --- fs/block_dev.c | 204 +++- fs/buffer.c | 113 ++-- fs/super.c |2 include/linux/buffer_head.h |9 - include/linux/fs.h | 29 ++ 5 files changed, 225 insertions(+), 132 deletions(-) Index: linux-2.6/fs/block_dev.c === --- linux-2.6.orig/fs/block_dev.c +++ linux-2.6/fs/block_dev.c @@ -16,7 +16,9 @@ #include #include #include +#include #include +#include #include #include #include @@ -61,14 +63,14 @@ static void kill_bdev(struct block_devic { if (bdev->bd_inode->i_mapping->nrpages == 0) return; - invalidate_bh_lrus(); + invalidate_bh_lrus(); /* XXX: this can go when buffers goes */ truncate_inode_pages(bdev->bd_inode->i_mapping, 0); } int set_blocksize(struct block_device *bdev, int size) { /* Size must be a power of two, and between 512 and PAGE_SIZE */ - if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size)) + if (size < 512 || !is_power_of_2(size)) return -EINVAL; /* Size cannot be smaller than the size supported by the device */ @@ -92,7 +94,7 @@ int sb_set_blocksize(struct super_block if (set_blocksize(sb->s_bdev, size)) return 0; /* If we get here, we know size is power of two -* and it's value is between 512 and PAGE_SIZE */ +* and it's value is >= 512 */ sb->s_blocksize = size; sb->s_blocksize_bits = blksize_bits(size); return sb->s_blocksize; @@ -112,19 +114,12 @@ EXPORT_SYMBOL(sb_min_blocksize); static int blkdev_get_block(struct inode *inode, sector_t iblock, - struct buffer_head *bh, int create) +struct buffer_head *bh, int create) { if (iblock >= max_block(I_BDEV(inode))) { if (create) return -EIO; - - /* -* for reads, we're just trying to fill a partial page. -* return a hole, they will have to call get_block again -* before they can fill it, and they will get -EIO at that -* time -*/ - return 0; +return 0; } bh->b_bdev = I_BDEV(inode); bh->b_blocknr = iblock; @@ -132,6 +127,66 @@ blkdev_get_block(struct inode *inode, se return 0; } +static int blkdev_insert_mapping(struct address_space *mapping, loff_t off, + size_t len, int create) +{ + sector_t blocknr; + struct inode *inode = mapping->host; + pgoff_t next, end; + struct pagevec pvec; + int ret = 0; + + pagevec_init(&pvec, 0); + next = off >> PAGE_CACHE_SHIFT; + end = (off + len) >> PAGE_CACHE_SHIFT; + blocknr = off >> inode->i_blkbits; + while (next <= end && pagevec_lookup(&pvec, mapping, next, + min(end - next, (pgoff_t)PAGEVEC_SIZE))) { + unsigned int i; + + for (i = 0; i < pagevec_count(&pvec); i++) { + struct fsblock *block; + struct page *page = pvec.pages[i]; + + BUG_ON(page->index != next); + BUG_ON(blocknr != pgoff_sector(next, inode->i_blkbits)); + BUG_ON(!PageLocked(page)); + + if (blocknr >= max_block(I_BDEV(inode))) { + if (create) + ret = -ENOMEM; + + /* +* for reads, we're just trying to fill a +* partial page. return a hole, they will +* have to call in again before they can fill +* it, and they will get -EIO at that time +*/ + continue; /* xxx: could be smarter, stop now */ + } + + block = page_blocks(page); + if (fsblock_subpage(block)) { + struct fsblock *b; + for_each_block(block, b) { + if (!test_bit(BL_mapped, &b->flags)) + map_fsblock(b, blocknr); + blocknr++; + } + } else { + if (!test_bit(BL_mapped, &block->flags)) + map_fsblock(block, blocknr); + blocknr++; + } + next++; + } + pagevec_release(&pvec); + } + + re
[RFC] fsblock
I'm announcing "fsblock" now because it is quite intrusive and so I'd like to get some thoughts about significantly changing this core part of the kernel. fsblock is a rewrite of the "buffer layer" (ding dong the witch is dead), which I have been working on, on and off and is now at the stage where some of the basics are working-ish. This email is going to be long... Firstly, what is the buffer layer? The buffer layer isn't really a buffer layer as in the buffer cache of unix: the block device cache is unified with the pagecache (in terms of the pagecache, a blkdev file is just like any other, but with a 1:1 mapping between offset and block). There are filesystem APIs to access the block device, but these go through the block device pagecache as well. These don't exactly define the buffer layer either. The buffer layer is a layer between the pagecache and the block device for block based filesystems. It keeps a translation between logical offset and physical block number, as well as meta information such as locks, dirtyness, and IO status of each block. This information is tracked via the buffer_head structure. Why rewrite the buffer layer? Lots of people have had a desire to completely rip out the buffer layer, but we can't do that[*] because it does actually serve a useful purpose. Why the bad rap? Because the code is old and crufty, and buffer_head is an awful name. It must be among the oldest code in the core fs/vm, and the main reason is because of the inertia of so many and such complex filesystems. [*] About the furthest we could go is use the struct page for the information otherwise stored in the buffer_head, but this would be tricky and suboptimal for filesystems with non page sized blocks and would probably bloat the struct page as well. So why rewrite rather than incremental improvements? Incremental improvements are logically the correct way to do this, and we probably could go from buffer.c to fsblock.c in steps. But I didn't do this because: a) the blinding pace at which things move in this area would make me an old man before it would be complete; b) I didn't actually know exactly what it was going to look like before starting on it; c) I wanted stable root filesystems and such when testing it; and d) I found it reasonably easy to have both layers coexist (it uses an extra page flag, but even that wouldn't be needed if the old buffer layer was better decoupled from the page cache). I started this as an exercise to see how the buffer layer could be improved, and I think it is working out OK so far. The name is fsblock because it basically ties the fs layer to the block layer. I think Andrew has wanted to rename buffer_head to block before, but block is too clashy, and it isn't a great deal more descriptive than buffer_head. I believe fsblock is. I'll go through a list of things where I have hopefully improved on the buffer layer, off the top of my head. The big caveat here is that minix is the only real filesystem I have converted so far, and complex journalled filesystems might pose some problems that water down its goodness (I don't know). - data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on 64-bit (could easily be 32 if we can have int bitops). Compare this to around 50 and 100ish for struct buffer_head. With a 4K page and 1K blocks, IO requires 10% RAM overhead in buffer heads alone. With fsblocks you're down to around 3%. - Structure packing. A page gets a number of buffer heads that are allocated in a linked list. fsblocks are allocated contiguously, so cacheline footprint is smaller in the above situation. - Data / metadata separation. I have a struct fsblock and a struct fsblock_meta, so we could put more stuff into the usually less used fsblock_meta without bloating it up too much. After a few tricks, these are no longer any different in my code, and dirty up the typing quite a lot (and I'm aware it still has some warnings, thanks). So if not useful this could be taken out. - Locking. fsblocks completely use the pagecache for locking and lookups. The page lock is used, but there is no extra per-inode lock that buffer has. Would go very nicely with lockless pagecache. RCU is used for one non-blocking fsblock lookup (find_get_block), but I'd really rather hope filesystems can tolerate that blocking, and get rid of RCU completely. (actually this is not quite true because mapping->private_lock is still used for mark_buffer_dirty_inode equivalent, but that's a relatively rare operation). - Coupling with pagecache metadata. Pagecache pages contain some metadata that is logically redundant because it is tracked in buffers as well (eg. a page is dirty if one or more buffers are dirty, or uptodate if all buffers are uptodate). This is great because means we can avoid that layer in some situations, but they can get out of sync. eg. if a filesystem writes a buffer out by hand, its pagecache page will stay dirty
Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching
This thread is amazing. With so many smart people's precious time, What are the results? What are the issues anyway? Is anyone happy? (I'm not and I assume Chris is not) Yes, "waste of time" is taking place here, but it's not for "pathname-based MAC" but for "wrongly posted messages", I believe. I'm a relatively new to this ml, let me ask. Is this ml a place of judge or battle? (not to help or support?) Nothing is perfect, so we can work to make things to better, right? I have suggestions: Let's clarify issues first. - problems (or limitations) of pathname-based MAC - advantages of pathname-based MAC - how can pathname-based MAC supplement label based (Stephen, James and Kyle, please help) Let's start the arguments again if we get the issues. Threads should be definitely separated per issue and a assigning a chair may help. Above issues are independent of SELinux. We should not *compare* SELinux and AA, that can cause a problem. Every software has shortages that's why we need to work and we can make progress. For some issues we may need to compare them, in that case moderators would help. BTW I have posted a RFC of TOMOYO Linux that is another pathname-based MAC. http://lkml.org/lkml/2007/6/13/58 AA and TOMOYO Linux have BoF sessions at OLS2007, so it would be a great opportunity to *talk* over the issues. What I want to say is "let's make progress and help each other to make Linux better". Thank you, Toshiharu Harada Chris Wright wrote: > * Chris Mason ([EMAIL PROTECTED]) wrote: >> I'm sure people there will have a different versions of events. The >> one part that was discussed was if pathname based security was >> useful, and a number of the people in the room (outside of >> novell) said it was. Now, it could be that nobody wanted to argue >> anymore, since most opinions had come out on one list or another by >> then. > > Indeed. The trouble is that's too high level compared with the actual > implementation details. AA is stalled because it has failed to get > VFS support for it's model. I don't see a nice way out unless it > changes it's notion of policy language (globbing is the tough one) > or gets traction to pass dentry/vfsmount all the way down. Paths are > completely relevant for security, esp. when considering the parent dir > and the leaf (as in forward lookup case). Retroactively creating the > full path is at the minimum ugly, and in the worst case can be insecure > (yes AA has taken many measures to mitigate that insecurity). > >> But as someone who doesn't use either SElinux or AA, I really hope >> we can get past the part of the debate where: >> >> while(1) >> AA) we think we're making users happy with pathname security >> SELINUX) pathname security sucks > > Yes. Please. Both parties are miserably failing the sanity test. > Doing the same thing over and over and expecting different results. > > AA folks: deal with the VFS issues that your patchset have in a palatable > way (which does not include passing NULL when it's inconvenient to > do otherwise). You've already missed an opportunity with Christoph's > suggestions for changes in NFS. I know you've considered many alternative > approaches and consistently hit dead ends. But please note, if you > have coded yourself into a corner because of your policy language, > that's your issue to solve, not ours. > > SELinux folks: do something useful rather than quibbling over the TCSEC > definition of MAC and AA's poor taste in marketing literature. Here's > some suggestions: > > 1) Make SELinux usable (it's *still* the number one complaint). While > this is a bit of a cheap shot, it really is one of the core reasons AA > advocates exist. > 2) Work on a variant of Kyle's suggestion to squash the relevancy of AA. > 3) Write an effective exploit against AA that demonstrates the fundamental > weakness of the model (better make sure it's not also an issue for > targetted policy). -- Toshiharu Harada [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching
This thread is amazing. With so many smart people's precious time, What are the results? What are the issues anyway? Is anyone happy? (I'm not and I assume Chris is not) Yes, "waste of time" is taking place here, but it's not for "pathname-based MAC" but for "wrongly posted messages", I believe. I'm a relatively new to this ml, let me ask. Is this ml a place of judge or battle? (not to help or support?) Nothing is perfect, so we can work to make things to better, right? I have suggestions: Let's clarify issues first. - problems (or limitations) of pathname-based MAC - advantages of pathname-based MAC - how can pathname-based MAC supplement label based (Stephen, James and Kyle, please help) Let's start the arguments again if we get the issues. Threads should be definitely separated per issue and a assigning a chair may help. Well, I crated a Wiki page. If it helps, please feel free to use it. I mean I would like people to add your issues here. It's wiki, so you are welcome to modify everything. http://tomoyo.sourceforge.jp/wiki-e/?MAC-ISSUES If ml is better, I have no objections. I just wanted to help discussion. Above issues are independent of SELinux. We should not *compare* SELinux and AA, that can cause a problem. Every software has shortages that's why we need to work and we can make progress. For some issues we may need to compare them, in that case moderators would help. BTW I have posted a RFC of TOMOYO Linux that is another pathname-based MAC. http://lkml.org/lkml/2007/6/13/58 AA and TOMOYO Linux have BoF sessions at OLS2007, so it would be a great opportunity to *talk* over the issues. What I want to say is "let's make progress and help each other to make Linux better". Cheers, Toshiharu Harada - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [-mm PATCH] ocfs2: ->fallocate() support
On Sat, Jun 23, 2007 at 09:54:15AM -0700, Andrew Morton wrote: > > On Thu, 21 Jun 2007 12:01:43 -0700 Mark Fasheh <[EMAIL PROTECTED]> wrote: > > Plug ocfs2 into the ->fallocate() callback. We only support FA_ALLOCATE for > > now - FA_DEALLOCATE will come later. > > I get a healthy reject when applying this against your own tree. Perhaps > because my copy of it is a few days old, dunno. Eek, yeah I based it on a pretty recent version of my stuff. > I think I'll duck it, allow you to merge this in the normal fashion once > fallocate hits mainline, OK? Sounds good to me. Could I trouble you to add me to the CC for one of the patches that add ->fallocate() so I know when it gets sent upstream? Thanks, --Mark -- Mark Fasheh Senior Software Developer, Oracle [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [-mm PATCH] ocfs2: ->fallocate() support
> On Thu, 21 Jun 2007 12:01:43 -0700 Mark Fasheh <[EMAIL PROTECTED]> wrote: > Plug ocfs2 into the ->fallocate() callback. We only support FA_ALLOCATE for > now - FA_DEALLOCATE will come later. I get a healthy reject when applying this against your own tree. Perhaps because my copy of it is a few days old, dunno. I think I'll duck it, allow you to merge this in the normal fashion once fallocate hits mainline, OK? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/26] Mount writer count and read-only bind mounts
> On Fri, 22 Jun 2007 13:03:03 -0700 Dave Hansen <[EMAIL PROTECTED]> wrote: > Why do we need r/o bind mounts? > > This feature allows a read-only view into a read-write filesystem. > In the process of doing that, it also provides infrastructure for > keeping track of the number of writers to any given mount. > > This has a number of uses. It allows chroots to have parts of > filesystems writable. It will be useful for containers in the future > because users may have root inside a container, but should not > be allowed to write to somefilesystems. This also replaces > patches that vserver has had out of the tree for several years. > > It allows security enhancement by making sure that parts of > your filesystem read-only (such as when you don't trust your > FTP server), when you don't want to have entire new filesystems > mounted, or when you want atime selectively updated. > I've been using the following script to test that the feature is > working as desired. It takes a directory and makes a regular > bind and a r/o bind mount of it. It then performs some normal > filesystem operations on the three directories, including ones > that are expected to fail, like creating a file on the r/o > mount. Doesn't selinux do some of this? My overall reaction: owch. There's a ton of tricksy code here and great potential for us to accidentally break it in the future by forgetting a mnt_may_write() as the kernel evolves. And then there's the added complexity and the added runtime overhead. Balance that against some pretty obscure-looking benefits and I'm struggling to see how a merge is justifiable? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 25/26] r/o bind mounts: scalable writer count
> On Fri, 22 Jun 2007 13:03:36 -0700 Dave Hansen <[EMAIL PROTECTED]> wrote: > +DEFINE_PER_CPU(struct mnt_writer, mnt_writers); this can have static scope. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/26] r/o bind mounts: stub functions
> On Fri, 22 Jun 2007 13:03:09 -0700 Dave Hansen <[EMAIL PROTECTED]> wrote: > This patch adds two function mnt_want_write() and > mnt_drop_write() ITYM "global, exported-to-modules yet 100% undocumented" functions. tsk. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/26] filesystem helpers for custom 'struct file's
> On Fri, 22 Jun 2007 13:03:08 -0700 Dave Hansen <[EMAIL PROTECTED]> wrote: > Christoph H. says this stands on its own and can go in before the > rest of the r/o bind mount set. > > --- > > Some filesystems forego the vfs and may_open() and create their > own 'struct file's. > > This patch creates a couple of helper functions which can be > used by these filesystems, and will provide a unified place > which the r/o bind mount code may patch. > > Also, rename two existing, static-scope init_file() to less > generic names. the sysfs changes are no longer applicable here due to pending changes in Greg's tree. The net/socket.c change is a mystery. My version of sock_attach_fd() doesn't look like yours. So I snarfed the first three patches, passed on this one and shall await a quality review-and-ack from someone who is appropriately familiar with this code, thanks. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 25/26] r/o bind mounts: scalable writer count
> I'd suggest something along these lines in final mntput: > > lock_and_coalesce_cpu_mnt_writer_counts(); > mnt_unlock_cpus(); > BUG_ON(atomic_read(&mnt->__mnt_writers)); Or rather a WARN_ON(), since it's not a fatal condition. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 25/26] r/o bind mounts: scalable writer count
> This uses a statically-allocated percpu variable. All > operations are local to a cpu as long that cpu operates on > the same mount, and there are no writer count imbalances. > Writer count imbalances happen when a write is taken on one > cpu, and released on another, like when an open/close pair > is performed on two different cpus because the task moved. > I'd suggest something along these lines in final mntput: lock_and_coalesce_cpu_mnt_writer_counts(); mnt_unlock_cpus(); BUG_ON(atomic_read(&mnt->__mnt_writers)); since there's basically no other we'll notice if there _is_ an imbalance. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] make iunique use a do/while loop rather than its obscure goto loop
On Wed, Apr 11, 2007 at 05:58:56PM -0400, Jeffrey Layton wrote: > A while back, Christoph mentioned that he thought that iunique ought to be > cleaned up to use a more conventional loop construct. This patch does that, > turning the strange goto loop into a do/while. > > Signed-off-by: Jeff Layton <[EMAIL PROTECTED]> > > diff --git a/fs/inode.c b/fs/inode.c > index 23fc1fd..90e7587 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -689,21 +689,18 @@ ino_t iunique(struct super_block *sb, ino_t > max_reserved) > struct inode *inode; > struct hlist_head * head; > ino_t res; > + > spin_lock(&inode_lock); > -retry: > - if (counter > max_reserved) { > - head = inode_hashtable + hash(sb,counter); > + do { > + if (counter <= max_reserved) > + counter = max_reserved + 1; > res = counter++; > + head = inode_hashtable + hash(sb, res); > inode = find_inode_fast(sb, head, res); > - if (!inode) { > - spin_unlock(&inode_lock); > - return res; > - } > - } else { > - counter = max_reserved + 1; > - } > - goto retry; > - > + } while (inode != NULL); > + spin_unlock(&inode_lock); > + > + return res; > } > > EXPORT_SYMBOL(iunique); ---end quoted text--- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 26/26] honor r/w changes at do_remount() time
On Fri, Jun 22, 2007 at 01:03:37PM -0700, Dave Hansen wrote: > > > Originally from: Herbert Poetzl <[EMAIL PROTECTED]> > > This is the core of the read-only bind mount patch set. > > Note that this does _not_ add a "ro" option directly to > the bind mount operation. If you require such a mount, > you must first do the bind, then follow it up with a > 'mount -o remount,ro' operation. > > Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> Ok. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 24/26] do_rmdir(): elevate write count
Ok. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 23/26] elevate mnt writers for vfs_unlink() callers
Ok. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 22/26] sys_mknodat(): elevate write count for vfs_mknod/create()
On Fri, Jun 22, 2007 at 01:03:32PM -0700, Dave Hansen wrote: > > > This takes care of all of the direct callers of vfs_mknod(). > Since a few of these cases also handle normal file creation > as well, this also covers some calls to vfs_create(). Ok. > diff -puN > fs/namei.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create > fs/namei.c > --- lxc/fs/namei.c~18-24-sys-mknodat-elevate-write-count-for-vfs-mknod-create > 2007-06-21 23:23:25.0 -0700 > +++ lxc-dave/fs/namei.c 2007-06-21 23:23:25.0 -0700 > @@ -1897,14 +1897,26 @@ asmlinkage long sys_mknodat(int dfd, con > if (!IS_ERR(dentry)) { > switch (mode & S_IFMT) { > case 0: case S_IFREG: > + error = mnt_want_write(nd.mnt); > + if (error) > + break; > error = vfs_create(nd.dentry->d_inode,dentry,mode,&nd); > + mnt_drop_write(nd.mnt); > break; > case S_IFCHR: case S_IFBLK: > + error = mnt_want_write(nd.mnt); > + if (error) > + break; > error = vfs_mknod(nd.dentry->d_inode,dentry,mode, > new_decode_dev(dev)); > + mnt_drop_write(nd.mnt); > break; > case S_IFIFO: case S_IFSOCK: > + error = mnt_want_write(nd.mnt); > + if (error) > + break; > error = vfs_mknod(nd.dentry->d_inode,dentry,mode,0); > + mnt_drop_write(nd.mnt); > break; > case S_IFDIR: > error = -EPERM; Should we just take the calls outside the switch statement? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 21/26] elevate write count for do_sys_utime() and touch_atime()
Ok. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 20/26] elevate write count for do_utimes()
Ok. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 19/26] elevate writer count for do_sys_truncate()
Ok. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 18/26] nfs: check mnt instead of superblock directly
Ok. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/26] elevate write count over calls to vfs_rename()
On Fri, Jun 22, 2007 at 01:03:25PM -0700, Dave Hansen wrote: > > This also creates a little helper in the NFS code to > make an if() a little bit less ugly. That should probably be a separate patch. Or better one to just rip out the MSNFS ifdefs completely, they've always been true for about the last ten years. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/26] mount_is_safe(): add comment
On Fri, Jun 22, 2007 at 01:03:22PM -0700, Dave Hansen wrote: > > > This area of code is currently #ifdef'd out, so add a comment > for the time when it is actually used. Ok. Does this clash with the user mount patches? Even if it does I think we want this patch first in the series and fix the user mounts up ontop of it. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 16/26] unix_find_other() elevate write count for touch_atime()
Ok. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/26] elevate write count for file_update_time()
On Fri, Jun 22, 2007 at 01:03:21PM -0700, Dave Hansen wrote: > > > Signed-off-by: Dave Hansen <[EMAIL PROTECTED]> We really want a guaranteed non-NULL file here, but I don't want to put this on your plate also. Please add a comment about bloody NFS exports for now. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/26] elevate mnt writers for callers of vfs_mkdir()
On Fri, Jun 22, 2007 at 01:03:16PM -0700, Dave Hansen wrote: > > Pretty self-explanatory. Fits in with the rest of the series. Ok for this and similar patch 11-13. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/26] make access() use mnt check
On Fri, Jun 22, 2007 at 01:03:14PM -0700, Dave Hansen wrote: > > It is OK to let access() go without using a mnt_want/drop_write() > pair because it doesn't actually do writes to the filesystem, > and it is inherently racy anyway. This is a rare case when it is > OK to use __mnt_is_readonly() directly. You probably want to add a big comment explaining why it's fine here. That reminds me of something else I had in mind to debug that the writer counts are okay: we should probably add a check in permission that we have an elevated writercount on the vfsmount/sb. Of course we'll need some way to overrid it for access(), which means passing down a flag to it or something. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/26] elevate writer count for chown and friends
On Fri, Jun 22, 2007 at 01:03:13PM -0700, Dave Hansen wrote: > > > chown/chmod,etc... don't call permission in the same way > that the normal "open for write" calls do. They still > write to the filesystem, so bump the write count during > these operations. Looks good. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/26] r/o bind mounts: elevate write count for some ioctls
On Fri, Jun 22, 2007 at 01:03:12PM -0700, Dave Hansen wrote: > > Some ioctl()s can cause writes to the filesystem. Take > these, and make them use mnt_want/drop_write() instead. > > We need to pass the filp one layer deeper in XFS, but > somebody _just_ pulled it out in February because nobody > was using it, so I don't feel guilty for adding it back. Could have been me :) Would be nicer to just pass down the vfsmount instead of the file in XFS, but it shouldn't really matter in the end. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/26] elevate write count open()'d files
On Fri, Jun 22, 2007 at 01:03:10PM -0700, Dave Hansen wrote: > > This is the first really tricky patch in the series. It > elevates the writer count on a mount each time a > non-special file is opened for write. > > This is not completely apparent in the patch because the > two if() conditions in may_open() above the > mnt_want_write() call are, combined, equivalent to > special_file(). > > There is also an elevated count around the vfs_create() > call in open_namei(). The count needs to be kept elevated > all the way into the may_open() call. Otherwise, when the > write is dropped, a ro->rw transisition could occur. This > would lead to having rw access on the newly created file, > while the vfsmount is ro. That is bad. Looks good. > > Some filesystems forego the use of normal vfs calls to create > struct files. Make sure that these users elevate the mnt writer > count because they will get __fput(), and we need to make > sure they're balanced. With this you mean ipc/mqueue.c? You should probably mention that single bastard child of a wannabe-filesystem explicitly :) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/26] r/o bind mounts: stub functions
On Fri, Jun 22, 2007 at 01:03:09PM -0700, Dave Hansen wrote: > > This patch adds two function mnt_want_write() and > mnt_drop_write(). These are used like a lock pair around > and fs operations that might cause a write to the filesystem. > > Before these can become useful, we must first cover each > place in the VFS where writes are performed with a > want/drop pair. When that is complete, we can actually > introduce code that will safely check the counts before > allowing r/w<->r/o transitions to occur. Ok, > > Note that we put the linux/fs.h #include as far down in > mount.h as possible. This is to keep the mount.h->fs.h-> > sched.h->mount.h include dependency from biting us. except for this. Please just move __mnt_is_readonly out of line or make it a macro, but we surely don't want to include fs.h in mount.h - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/26] filesystem helpers for custom 'struct file's
On Fri, Jun 22, 2007 at 01:03:08PM -0700, Dave Hansen wrote: > > Christoph H. says this stands on its own and can go in before the > rest of the r/o bind mount set. > > --- > > Some filesystems forego the vfs and may_open() and create their > own 'struct file's. > > This patch creates a couple of helper functions which can be > used by these filesystems, and will provide a unified place > which the r/o bind mount code may patch. > > Also, rename two existing, static-scope init_file() to less > generic names. Looks good. Again this should go in ASAP after 2.6.23 independent of the more complicated bits because it's a nice cleanup. In the end get_empty_filp should go away as an exported symbol. Note that we've grown more instances of the crap you're fixing here, e.g. fs/anon_inode.c - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/26] ext4: remove extra IS_RDONLY() check
On Fri, Jun 22, 2007 at 01:03:06PM -0700, Dave Hansen wrote: > > ext4_change_inode_journal_flag() is only called from one > location: ext4_ioctl(EXT3_IOC_SETFLAGS). That ioctl > case already has a IS_RDONLY() call in it so this one > is superfluous. Ditto. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/26] ext3: remove extra IS_RDONLY() check
On Fri, Jun 22, 2007 at 01:03:05PM -0700, Dave Hansen wrote: > > ext3_change_inode_journal_flag() is only called from one > location: ext3_ioctl(EXT3_IOC_SETFLAGS). That ioctl > case already has a IS_RDONLY() call in it so this one > is superfluous. Ok, and stands nicely on it's own. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/26] document nlink function
On Fri, Jun 22, 2007 at 01:03:04PM -0700, Dave Hansen wrote: > > These should have been documented from the beginning. Fix it. Ok. This is a trivial doc fix and should go into the 2.6.22 queue IMHO. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html