[patch 2/6][RFC] Allow FIBMAP to return EFBIG on large filesystems.
Propagate an error (EFBIG) to userspace if the physical block is too large to return in a 32bit int instead of truncating it. Signed-off-by: Mike Waychison <[EMAIL PROTECTED]> fs/ioctl.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) Index: linux-2.6.23/fs/ioctl.c === --- linux-2.6.23.orig/fs/ioctl.c2007-10-26 15:26:10.0 -0700 +++ linux-2.6.23/fs/ioctl.c 2007-10-26 16:16:28.0 -0700 @@ -52,6 +52,7 @@ static int file_ioctl(struct file *filp, case FIBMAP: { struct address_space *mapping = filp->f_mapping; + sector_t phys_block; int res; /* do we support this mess? */ if (!mapping->a_ops->bmap) @@ -64,8 +65,15 @@ static int file_ioctl(struct file *filp, return -EINVAL; lock_kernel(); - res = mapping->a_ops->bmap(mapping, block); + phys_block = mapping->a_ops->bmap(mapping, block); unlock_kernel(); + + /* Make sure that the return value fits in the +* user's buffer. */ + if ((u32)phys_block < phys_block) + return -EFBIG; + + res = phys_block; return put_user(res, p); } case FIGETBSZ: -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 4/6][RFC] Attempt to plug race with truncate
Attempt to deal with races with truncate paths. I'm not really sure on the locking here, but these seem to be taken by the truncate path. BKL is left as some filesystem may(?) still require it. Signed-off-by: Mike Waychison <[EMAIL PROTECTED]> fs/ioctl.c |8 1 file changed, 8 insertions(+) Index: linux-2.6.23/fs/ioctl.c === --- linux-2.6.23.orig/fs/ioctl.c2007-10-26 15:27:29.0 -0700 +++ linux-2.6.23/fs/ioctl.c 2007-10-26 16:16:28.0 -0700 @@ -43,13 +43,21 @@ static long do_ioctl(struct file *filp, static int do_fibmap(struct address_space *mapping, sector_t block, sector_t *phys_block) { + struct inode *inode = mapping->host; + if (!capable(CAP_SYS_RAWIO)) return -EPERM; if (!mapping->a_ops->bmap) return -EINVAL; lock_kernel(); + /* Avoid races with truncate */ + mutex_lock(&inode->i_mutex); + /* FIXME: Do we really need i_alloc_sem? */ + down_read(&inode->i_alloc_sem); *phys_block = mapping->a_ops->bmap(mapping, block); + up_read(&inode->i_alloc_sem); + mutex_unlock(&inode->i_mutex); unlock_kernel(); return 0; -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 5/6][RFC] Introduce FIBMAP64
Introduce FIBMAP64. This is the same as FIBMAP, but takes a u64. This new routine requires filesystems to implement bmap64 if they want to support > 2^31 blocks in a logical file. We do this to give time for filesystems to be properly audited. Signed-off-by: Mike Waychison <[EMAIL PROTECTED]> fs/compat_ioctl.c |2 ++ fs/ioctl.c | 34 +- include/linux/fs.h |7 +++ 3 files changed, 42 insertions(+), 1 deletion(-) Index: linux-2.6.23/fs/ioctl.c === --- linux-2.6.23.orig/fs/ioctl.c2007-10-26 15:28:28.0 -0700 +++ linux-2.6.23/fs/ioctl.c 2007-10-26 16:16:28.0 -0700 @@ -44,11 +44,24 @@ static int do_fibmap(struct address_spac sector_t *phys_block) { struct inode *inode = mapping->host; + sector_t (*bmap)(struct address_space *, sector_t); if (!capable(CAP_SYS_RAWIO)) return -EPERM; - if (!mapping->a_ops->bmap) + + if (mapping->a_ops->bmap64) { + /* Filesystem has bmap path audited for 64bit. */ + bmap = mapping->a_ops->bmap64; + } else if (mapping->a_ops->bmap) { + bmap = mapping->a_ops->bmap; + /* Need to verify that the input looks sane when sector_t is +* 64bit and the filesystem has only been audited for 32bit. */ + if ((s32)block < 0) + return -EFBIG; + } else { + /* no FIBMAP support */ return -EINVAL; + } lock_kernel(); /* Avoid races with truncate */ @@ -95,6 +108,25 @@ static int file_ioctl(struct file *filp, res = phys_block; return put_user(res, p); } + case FIBMAP64: + { + struct address_space *mapping = filp->f_mapping; + sector_t phys_block; + u64 block; + + if ((error = get_user(block, p)) != 0) + return error; + + /* Make sure that the logical block fits in sector_t */ + if ((sector_t)-1 < block) + return -EFBIG; + + error = do_fibmap(mapping, block, &phys_block); + if (error) + return error; + + return put_user(block, p); + } case FIGETBSZ: return put_user(inode->i_sb->s_blocksize, p); case FIONREAD: Index: linux-2.6.23/include/linux/fs.h === --- linux-2.6.23.orig/include/linux/fs.h2007-10-26 15:25:47.0 -0700 +++ linux-2.6.23/include/linux/fs.h 2007-10-26 15:29:00.0 -0700 @@ -220,6 +220,7 @@ extern int dir_notify_enable; #define BMAP_IOCTL 1 /* obsolete - kept for compatibility */ #define FIBMAP_IO(0x00,1) /* bmap access */ #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */ +#define FIBMAP64 _IO(0x00,3) /* bmap access - 64 bit sector numbers */ #defineFS_IOC_GETFLAGS _IOR('f', 1, long) #defineFS_IOC_SETFLAGS _IOW('f', 2, long) @@ -423,6 +424,12 @@ struct address_space_operations { int (*commit_write)(struct file *, struct page *, unsigned, unsigned); /* Unfortunately this kludge is needed for FIBMAP. Don't use it */ sector_t (*bmap)(struct address_space *, sector_t); + /* +* Version of bmap for filesystems that can safely handle 64bit +* sector_t. Once all filesystems are converted to this, we can drop +* bmap(). +*/ + sector_t (*bmap64)(struct address_space *, sector_t); void (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, gfp_t); ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, Index: linux-2.6.23/fs/compat_ioctl.c === --- linux-2.6.23.orig/fs/compat_ioctl.c 2007-10-26 15:25:47.0 -0700 +++ linux-2.6.23/fs/compat_ioctl.c 2007-10-26 15:29:00.0 -0700 @@ -2506,6 +2506,7 @@ COMPATIBLE_IOCTL(FIONREAD) /* This is a /* 0x00 */ COMPATIBLE_IOCTL(FIBMAP) COMPATIBLE_IOCTL(FIGETBSZ) +COMPATIBLE_IOCTL(FIBMAP64) /* 0x03 -- HD/IDE ioctl's used by hdparm and friends. * Some need translations, these do not. */ @@ -3597,6 +3598,7 @@ asmlinkage long compat_sys_ioctl(unsigne case FIBMAP: case FIGETBSZ: + case FIBMAP64: case FIONREAD: if (S_ISREG(filp->f_path.dentry->d_inode->i_mode)) break; -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body o
[patch 3/6][RFC] Move FIBMAP logic
Move FIBMAP logic out of file_ioctl() in preparation for introducing FIBMAP64. Signed-off-by: Mike Waychison <[EMAIL PROTECTED]> fs/ioctl.c | 25 ++--- 1 file changed, 18 insertions(+), 7 deletions(-) Index: linux-2.6.23/fs/ioctl.c === --- linux-2.6.23.orig/fs/ioctl.c2007-10-26 15:26:11.0 -0700 +++ linux-2.6.23/fs/ioctl.c 2007-10-26 16:16:28.0 -0700 @@ -40,6 +40,21 @@ static long do_ioctl(struct file *filp, return error; } +static int do_fibmap(struct address_space *mapping, sector_t block, +sector_t *phys_block) +{ + if (!capable(CAP_SYS_RAWIO)) + return -EPERM; + if (!mapping->a_ops->bmap) + return -EINVAL; + + lock_kernel(); + *phys_block = mapping->a_ops->bmap(mapping, block); + unlock_kernel(); + + return 0; +} + static int file_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { @@ -55,18 +70,14 @@ static int file_ioctl(struct file *filp, sector_t phys_block; int res; /* do we support this mess? */ - if (!mapping->a_ops->bmap) - return -EINVAL; - if (!capable(CAP_SYS_RAWIO)) - return -EPERM; if ((error = get_user(block, p)) != 0) return error; if (block < 0) return -EINVAL; - lock_kernel(); - phys_block = mapping->a_ops->bmap(mapping, block); - unlock_kernel(); + error = do_fibmap(mapping, block, &phys_block); + if (error) + return error; /* Make sure that the return value fits in the * user's buffer. */ -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 6/6][RFC] Drop CAP_SYS_RAWIO requirement on FIBMAP
Remove the need for having CAP_SYS_RAWIO when doing a FIBMAP call on an open file descriptor. It would be nice to allow users to have permission to see where their data is landing on disk, and there really isn't a good reason to keep them from getting at this information. Signed-off-by: Mike Waychison <[EMAIL PROTECTED]> fs/ioctl.c |3 --- 1 file changed, 3 deletions(-) Index: linux-2.6.23/fs/ioctl.c === --- linux-2.6.23.orig/fs/ioctl.c2007-10-26 15:30:05.0 -0700 +++ linux-2.6.23/fs/ioctl.c 2007-10-26 15:31:43.0 -0700 @@ -46,9 +46,6 @@ static int do_fibmap(struct address_spac struct inode *inode = mapping->host; sector_t (*bmap)(struct address_space *, sector_t); - if (!capable(CAP_SYS_RAWIO)) - return -EPERM; - if (mapping->a_ops->bmap64) { /* Filesystem has bmap path audited for 64bit. */ bmap = mapping->a_ops->bmap64; -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 0/6][RFC] Cleanup FIBMAP
The following series is meant to clean up FIBMAP paths with the eventual goal of allowing users to be able to FIBMAP their data. I'm sending this as an RFC as I've only tested this on a x86_64 kernel with a 32bit binary on ext2 and I've noticed a couple ext2_warnings already. I'm unsure of the locking in [4/6] fix_race_with_truncate.patch. Any help here would greatly be appreciated. The last patch, [6/6] drop_cap_sys_rawio_for_fibmap.patch, is of course, not to be applied until any remaining issues are fixed :) Thanks, Mike Waychison -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/6][RFC] Keep FIBMAP from looking at negative block nrs
Return an error if the user requests a negative logical block in a file. Signed-off-by: Mike Waychison <[EMAIL PROTECTED]> fs/ioctl.c |2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.23/fs/ioctl.c === --- linux-2.6.23.orig/fs/ioctl.c2007-10-26 15:25:48.0 -0700 +++ linux-2.6.23/fs/ioctl.c 2007-10-26 16:16:29.0 -0700 @@ -60,6 +60,8 @@ static int file_ioctl(struct file *filp, return -EPERM; if ((error = get_user(block, p)) != 0) return error; + if (block < 0) + return -EINVAL; lock_kernel(); res = mapping->a_ops->bmap(mapping, block); -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] 9p: add missing end-of-options record for trans_fd
The list of options that the fd transport accepts is missing end-of-options marker. This patch adds it. Signed-off-by: Latchesar Ionkov <[EMAIL PROTECTED]> --- commit 70ec0c7936c2516d128fdb1a32d749071b8846ec tree 8de5495f83b94096825f8bfe0767e4fcbaf36ef3 parent 25ed88ed319e40fb6426e2aaefcc5f136aadd4ee author Latchesar Ionkov <[EMAIL PROTECTED]> 1193438452 -0600 committer Latchesar Ionkov <[EMAIL PROTECTED]> 1193438452 -0600 net/9p/trans_fd.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c index 30269a4..62332ed 100644 --- a/net/9p/trans_fd.c +++ b/net/9p/trans_fd.c @@ -62,13 +62,14 @@ struct p9_trans_fd { enum { /* Options that take integer arguments */ - Opt_port, Opt_rfdno, Opt_wfdno, + Opt_port, Opt_rfdno, Opt_wfdno, Opt_err, }; static match_table_t tokens = { {Opt_port, "port=%u"}, {Opt_rfdno, "rfdno=%u"}, {Opt_wfdno, "wfdno=%u"}, + {Opt_err, NULL}, }; /** - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] 9p: return NULL when trans not found
v9fs_match_trans function returns arbitrary transport module instead of NULL when the requested transport is not registered. This patch modifies the function to return NULL in that case. Signed-off-by: Latchesar Ionkov <[EMAIL PROTECTED]> --- commit 25ed88ed319e40fb6426e2aaefcc5f136aadd4ee tree 281feda9127a85178d44faca59a13645d08e314d parent 0c0b7fa3d4e80ab3e4e69b8a55661f649d1b41ff author Latchesar Ionkov <[EMAIL PROTECTED]> 1193438318 -0600 committer Latchesar Ionkov <[EMAIL PROTECTED]> 1193438318 -0600 net/9p/mod.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/9p/mod.c b/net/9p/mod.c index 41d70f4..8f9763a 100644 --- a/net/9p/mod.c +++ b/net/9p/mod.c @@ -76,9 +76,9 @@ struct p9_trans_module *v9fs_match_trans(const substring_t *name) list_for_each(p, &v9fs_trans_list) { t = list_entry(p, struct p9_trans_module, list); if (strncmp(t->name, name->from, name->to-name->from) == 0) - break; + return t; } - return t; + return NULL; } EXPORT_SYMBOL(v9fs_match_trans); - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] 9p: use copy of the options value instead of original
v9fs_parse_options function uses strsep which modifies the value of the v9ses->options field. That modified value is later passed to the function that creates the transport potentially making the transport creation function to fail. This patch creates a copy of v9ses->option field that v9fs_parse_options function uses instead of the original value. Signed-off-by: Latchesar Ionkov <[EMAIL PROTECTED]> --- commit 0c0b7fa3d4e80ab3e4e69b8a55661f649d1b41ff tree eeb4ac6ea85387c153e3f7f6cad370e1c5dc8534 parent c9927c2bf4f45bb85e8b502ab3fb79ad6483c244 author Latchesar Ionkov <[EMAIL PROTECTED]> 1193438132 -0600 committer Latchesar Ionkov <[EMAIL PROTECTED]> 1193438132 -0600 fs/9p/v9fs.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/9p/v9fs.c b/fs/9p/v9fs.c index 756f7e9..fbb12da 100644 --- a/fs/9p/v9fs.c +++ b/fs/9p/v9fs.c @@ -82,7 +82,7 @@ static match_table_t tokens = { static void v9fs_parse_options(struct v9fs_session_info *v9ses) { - char *options = v9ses->options; + char *options; substring_t args[MAX_OPT_ARGS]; char *p; int option; @@ -96,9 +96,10 @@ static void v9fs_parse_options(struct v9fs_session_info *v9ses) v9ses->cache = 0; v9ses->trans = v9fs_default_trans(); - if (!options) + if (!v9ses->options) return; + options = kstrdup(v9ses->options, GFP_KERNEL); while ((p = strsep(&options, ",")) != NULL) { int token; if (!*p) @@ -169,6 +170,7 @@ static void v9fs_parse_options(struct v9fs_session_info *v9ses) continue; } } + kfree(options); } /** - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/1] Drop CAP_SYS_RAWIO requirement for FIBMAP
Jason Uhlenkott wrote: On Fri, Oct 26, 2007 at 14:59:57 -0700, Mike Waychison wrote: Jason Uhlenkott wrote: Additionally, ext3_bmap() has this to say about it: if (EXT3_I(inode)->i_state & EXT3_STATE_JDATA) { /* * This is a REALLY heavyweight approach, but the use of * bmap on dirty files is expected to be extremely rare: * only if we run lilo or swapon on a freshly made file * do we expect this to happen. * * (bmap requires CAP_SYS_RAWIO so this does not * represent an unprivileged user DOS attack --- we'd be * in trouble if mortal users could trigger this path at * will.) Hmm. I don't know what the right approach to this is. This seems to be the same situation as the delayed allocation problem, no? Yup. What if we just returned 0? Tools like lilo are already doing sync(), would that cause the journal to get flushed explicitly anyway? Not sure, but I'd be pretty nervous about breaking any existing users which aren't explicitly syncing. True. We can probably get away with an implicit flush when CAP_SYS_RAWIO is set, but that's pretty gross :( Are you envisioning users who want to see where their data is landing for performance reasons? It seems like such users are going to have sufficiently different desires from existing FIBMAP users (who need to know where everything is because they intend to fiddle with the raw device) that a different interface might be warranted. A little of both ;) We could introduce a new API, though either way, the same fundamental problems apply wrt auditing. I see three reasons that new APIs are warranted: a) to deal with block numbers > 2^31 --> FIBMAP64 b) to have a path where no syncing is required due to worries about user DoS (delayed allocation / data in journal). c) possibly some way to FIBMAP a range so that userspace doesn't need to syscall for each block, something like how mincore() does it? I have a patchset ready that I'll send out shortly that introduces FIBMAP64. The last patch in that set drops the CAP_SYS_RAWIO, but it's probably not what we want given DoS case. I'd like to send it out anyway to get some comments on some of the sanity checks and locking I'm adding. Handling (c) above is just extra sugar and isn't something I'm too worried about implementing. Mike Waychison - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/1] Drop CAP_SYS_RAWIO requirement for FIBMAP
On Fri, Oct 26, 2007 at 14:59:57 -0700, Mike Waychison wrote: > Jason Uhlenkott wrote: > >Additionally, ext3_bmap() has this to say about it: > > > >if (EXT3_I(inode)->i_state & EXT3_STATE_JDATA) { > >/* > > * This is a REALLY heavyweight approach, but the use of > > * bmap on dirty files is expected to be extremely rare: > > * only if we run lilo or swapon on a freshly made file > > * do we expect this to happen. > > * > > * (bmap requires CAP_SYS_RAWIO so this does not > > * represent an unprivileged user DOS attack --- we'd be > > * in trouble if mortal users could trigger this path at > > * will.) > > Hmm. I don't know what the right approach to this is. This seems to be > the same situation as the delayed allocation problem, no? Yup. > What if we just returned 0? Tools like lilo are already doing sync(), > would that cause the journal to get flushed explicitly anyway? Not sure, but I'd be pretty nervous about breaking any existing users which aren't explicitly syncing. Are you envisioning users who want to see where their data is landing for performance reasons? It seems like such users are going to have sufficiently different desires from existing FIBMAP users (who need to know where everything is because they intend to fiddle with the raw device) that a different interface might be warranted. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer
On Fri, 26 Oct 2007, Serge E. Hallyn wrote: > > It wouldn't be much effort to rebase this patch against Linus's latest > > tree. I am assuming that the static lsm patch is in there based on the > > recent discussion on LKML? > > Oh, sorry for the two emails. > > Yeah it's in 2.6.24. So a rebase will be necessary anyway. That code may still change -- Arjan's update, for example. -- James Morris <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/1] Drop CAP_SYS_RAWIO requirement for FIBMAP
On Fri, Oct 26, 2007 at 01:22:17 +0100, Alan Cox wrote: > On Thu, 25 Oct 2007 16:06:40 -0700 > Mike Waychison <[EMAIL PROTECTED]> wrote: > > > Remove the need for having CAP_SYS_RAWIO when doing a FIBMAP call on an > > open file descriptor. > > > > It would be nice to allow users to have permission to see where their data > > is landing on disk, and there really isn't a good reason to keep them from > > getting at this information. > > Historically this was done because people felt it was more secure. It > also allows you to make some deductions about other activities on the > disk but thats probably only a concern for very very security crazed > compartmentalised boxes > > Also historically at least FIBMAP could be abused to crash the system. > Now if you can verify that has been fixed I have no problem, but given > that I can find no record of that being fixed it would be wise to audit > it first and review Chris Evans and other reports about what occurs when > FIBMAP is passed random block numbers. > > FIBMAP has another problem for this general use as well - it takes an int > but the block number can now be bigger for very large files on 32bit. Additionally, ext3_bmap() has this to say about it: if (EXT3_I(inode)->i_state & EXT3_STATE_JDATA) { /* * This is a REALLY heavyweight approach, but the use of * bmap on dirty files is expected to be extremely rare: * only if we run lilo or swapon on a freshly made file * do we expect this to happen. * * (bmap requires CAP_SYS_RAWIO so this does not * represent an unprivileged user DOS attack --- we'd be * in trouble if mortal users could trigger this path at * will.) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/1] Drop CAP_SYS_RAWIO requirement for FIBMAP
Jason Uhlenkott wrote: On Fri, Oct 26, 2007 at 01:22:17 +0100, Alan Cox wrote: On Thu, 25 Oct 2007 16:06:40 -0700 Mike Waychison <[EMAIL PROTECTED]> wrote: Remove the need for having CAP_SYS_RAWIO when doing a FIBMAP call on an open file descriptor. It would be nice to allow users to have permission to see where their data is landing on disk, and there really isn't a good reason to keep them from getting at this information. Historically this was done because people felt it was more secure. It also allows you to make some deductions about other activities on the disk but thats probably only a concern for very very security crazed compartmentalised boxes Also historically at least FIBMAP could be abused to crash the system. Now if you can verify that has been fixed I have no problem, but given that I can find no record of that being fixed it would be wise to audit it first and review Chris Evans and other reports about what occurs when FIBMAP is passed random block numbers. FIBMAP has another problem for this general use as well - it takes an int but the block number can now be bigger for very large files on 32bit. Additionally, ext3_bmap() has this to say about it: if (EXT3_I(inode)->i_state & EXT3_STATE_JDATA) { /* * This is a REALLY heavyweight approach, but the use of * bmap on dirty files is expected to be extremely rare: * only if we run lilo or swapon on a freshly made file * do we expect this to happen. * * (bmap requires CAP_SYS_RAWIO so this does not * represent an unprivileged user DOS attack --- we'd be * in trouble if mortal users could trigger this path at * will.) Hmm. I don't know what the right approach to this is. This seems to be the same situation as the delayed allocation problem, no? What if we just returned 0? Tools like lilo are already doing sync(), would that cause the journal to get flushed explicitly anyway? Mike Waychison - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Fix linux/swap.h build wart
On Fri, Oct 26, 2007 at 01:38:12PM -0700, Luck, Tony wrote: > Not fundamental reasons, but these patches break the ia64 build for most > configs with: > > In file included from kernel/futex.c:59: > include/asm/futex.h: In function `futex_atomic_op_inuser': > include/asm/futex.h:62: error: implicit declaration of function > `pagefault_disable' > include/asm/futex.h:86: error: implicit declaration of function > `pagefault_enable' > make[1]: *** [kernel/futex.o] Error 1 > > Adding an include of to kernel/futex.c is sufficient (but > not necessarily right) to unbreak this. Yeah, I've been changing to for these. Looks like I'll need to make a pass over all the futex.h's. > A few configs also fail with: > mm/migrate.c: In function `migrate_page_copy': > mm/migrate.c:359: error: implicit declaration of function `copy_highpage' > make[1]: *** [mm/migrate.o] Error 1 > > Making mm/migrate.c include fixes this one. Excellent, I hadn't yet figured out who used migrate.c. Jeff -- Work email - jdike at linux dot intel dot com - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0/3] Distributed storage. Mirror algo extension for automatic recovery.
On Thu, 18 Oct 2007 23:17:41 +0400 Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > I'm pleased to announce sixth release of the distributed storage > subsystem, which allows to form a storage on top of remote and local > nodes, which in turn can be exported to another storage as a node to > form tree-like storages. I went back and re-read last month's discussion and I'm not seeing any reason why we shouldn't start thinking about merging this. How close is it to that stage? A peek at your development blog indicates that things are still changing at a moderate rate? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH 0/2] Fix linux/swap.h build wart
> These patches propose to break the recursion instead by removing the > linux/pagemap.h -> linux/highmem.h inclusion. I'd like to know > whether there are any fundamental reasons that this include should be > preserved. Not fundamental reasons, but these patches break the ia64 build for most configs with: In file included from kernel/futex.c:59: include/asm/futex.h: In function `futex_atomic_op_inuser': include/asm/futex.h:62: error: implicit declaration of function `pagefault_disable' include/asm/futex.h:86: error: implicit declaration of function `pagefault_enable' make[1]: *** [kernel/futex.o] Error 1 Adding an include of to kernel/futex.c is sufficient (but not necessarily right) to unbreak this. A few configs also fail with: mm/migrate.c: In function `migrate_page_copy': mm/migrate.c:359: error: implicit declaration of function `copy_highpage' make[1]: *** [mm/migrate.o] Error 1 Making mm/migrate.c include fixes this one. -Tony - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/2] Remove highmem.h include from pagemap.h
There has been a long-standing wart in linux/swap.h where it uses page_cache_release and release_pages without declaring them by including linux/pagemap.h. There is this scary comment to warn off anyone foolish enough to try to fix this: /* only sparc can not include linux/pagemap.h in this file * so leave page_cache_release and release_pages undeclared... */ This has caused problems for a number of people including: akpm - http://www.ussg.iu.edu/hypermail/linux/kernel/0708.3/0446.html "Oh that damn thing again. It's a regular source of include breakage." ppc - http://patchwork.ozlabs.org/linuxppc/patch?id=12288 UML - http://www.mail-archive.com/[EMAIL PROTECTED]/msg05095.html What happens with sparc, when this include is added, is a linux/mm.h -> linux/mm.h recursion: linux/mm.h -> asm/pgtable.h -> linux/swap.h -> linux/pagemap.h -> linux/highmem.h -> linux/mm.h leaving various things that live in mm.h undefined in highmem.h. sparc is unique in the asm/pgtable.h -> linux/swap.h inclusion. This looks correct, as pgtable.h uses swp_entry_t, which is defined in swap.h, so maybe the other arches should do the same. Adding this to the UML pgtable.h causes the same build breakage. All of the links in the recursion above look reasonable, except for linux/pagemap.h -> linux/highmem.h, which is the point of attack for this fix. linux/pagemap.h uses nothing from linux/highmem.h, so doesn't need to include it. However, there are a number of files which use stuff from highmem.h, don't include it, but do include pagemap.h. This patch removes the linux/pagemap.h -> linux/highmem.h inclusion and adds a linux/highmem.h include to every file in the tree which now needs one. --- arch/parisc/kernel/cache.c |1 + drivers/block/rd.c |1 + drivers/dma/iovlock.c|1 + drivers/media/video/ivtv/ivtv-udma.c |1 + drivers/net/e1000/e1000_main.c |1 + drivers/net/e1000e/netdev.c |1 + fs/9p/vfs_addr.c |1 + fs/affs/file.c |1 + fs/affs/symlink.c|1 + fs/afs/dir.c |1 + fs/afs/fsclient.c|1 + fs/afs/mntpt.c |1 + fs/afs/rxrpc.c |1 + fs/afs/write.c |1 + fs/cifs/file.c |1 + fs/cifs/inode.c |1 + fs/coda/symlink.c|1 + fs/cramfs/inode.c|1 + fs/dlm/lowcomms.c|1 + fs/ecryptfs/mmap.c |1 + fs/ecryptfs/read_write.c |1 + fs/efs/symlink.c |1 + fs/ext2/dir.c|1 + fs/ext2/namei.c |1 + fs/ext3/inode.c |1 + fs/ext4/extents.c|1 + fs/ext4/inode.c |1 + fs/freevxfs/vxfs_immed.c |1 + fs/freevxfs/vxfs_subr.c |1 + fs/fuse/dev.c|1 + fs/gfs2/bmap.c |1 + fs/gfs2/lops.c |1 + fs/gfs2/ops_address.c|1 + fs/gfs2/ops_file.c |1 + fs/hfs/bnode.c |1 + fs/hfs/btree.c |1 + fs/hfsplus/bitmap.c |1 + fs/hfsplus/bnode.c |1 + fs/hfsplus/btree.c |1 + fs/hostfs/hostfs_kern.c |1 + fs/hpfs/namei.c |1 + fs/isofs/compress.c |1 + fs/isofs/rock.c |1 + fs/jbd/journal.c |1 + fs/jbd2/journal.c|1 + fs/jffs2/fs.c|1 + fs/jfs/super.c |1 + fs/libfs.c |1 + fs/minix/namei.c |1 + fs/namei.c |1 + fs/ncpfs/dir.c |1 + fs/ncpfs/mmap.c |1 + fs/ncpfs/symlink.c |1 + fs/nfs/dir.c |1 + fs/nfs/nfs2xdr.c |1 + fs/nfs/nfs3xdr.c |1 + fs/nfs/nfs4proc.c|1 + fs/nfs/nfs4xdr.c |1 + fs/nfs/read.c|1 + fs/nfs/symlink.c |1 + fs/nfs/write.c |1 + fs/ntfs/aops.c |1 + fs/ntfs/file.c |1 + fs/ocfs2/symlink.c |1 + fs/reiserfs/ioctl.c |1 + fs/reiserfs/stree.c |1 + fs/reiserfs/tail_conversion.c|1 + fs/reiserfs/xattr.c |1 + fs/romfs/inode.c |1 +
[RFC PATCH 0/2] Fix linux/swap.h build wart
For some time, there has been a problem of linux/swap.h using page_cache_release and release_pages without declaring them by including linux/pagemap.h. pagemap.h isn't included because that breaks the sparc build. The full details are in the next post, but the short story is that sparc's pgtable.h includes linux/swap.h, creating a mm.h -> mm.h recursion. The current tree breaks that recursion by removing the swap.h -> pagemap.h inclusion. This breaks builds on a regular basis (references also in the next post). These patches propose to break the recursion instead by removing the linux/pagemap.h -> linux/highmem.h inclusion. I'd like to know whether there are any fundamental reasons that this include should be preserved. The reason for choosing this include is that pagemap.h doesn't use anything exported by highmem.h. The downside is that a bunch of things now need to include highmem.h. These are mostly filesystems, plus a good amount of arch code and a few drivers. I would argue that if you use something from a header, you should include it, regardless of whether you already get the header indirectly, so this isn't too much of a downside as far as I'm concerned. The one thing I'm uncertain about is whether pagemap.h and highmem.h have some basic relationship that implies that you can assume you get highmem when you include pagemap. The two patches that follow do the following: remove highmem.h from pagemap.h and add includes of highmem.h to a number of files that now need them include pagemap.h in swap.h These are for comment only, not for mainline, so there's no S-o-b right yet. At this point, I haven't confirmed that the following files, which use stuff from highmem.h but don't include it, still build: ./arch/x86/kernel/crash_dump_64.c ./arch/frv/mm/fault.c ./arch/frv/mm/kmap.c ./arch/h8300/mm/kmap.c ./arch/ia64/kernel/crash_dump.c ./arch/m68k/mm/kmap.c ./arch/m68knommu/mm/kmap.c ./arch/powerpc/kernel/crash_dump.c ./arch/sh/kernel/crash_dump.c ./arch/xtensa/mm/pgtable.c ./drivers/block/ps3disk.c ./drivers/infiniband/hw/ipath/ipath_verbs.h ./drivers/mmc/host/at91_mci.c ./drivers/mmc/host/mmci.h ./drivers/mmc/host/mmc_spi.c ./drivers/s390/block/xpram.c ./drivers/s390/char/tape_34xx.c ./drivers/kvm/paging_tmpl.h ./fs/hfsplus/hfsplus_fs.h ./include/asm-arm/cacheflush.h ./include/asm-frv/pgtable.h ./include/asm-x86/kexec_32.h ./include/asm-x86/kexec_64.h ./include/asm-x86/pgtable_32.h ./include/asm-m68k/motorola_pgalloc.h ./include/asm-m68k/sun3_pgalloc.h ./include/asm-m68k/sun3_pgtable.h ./include/asm-mips/cacheflush.h ./include/asm-mips/page.h ./include/asm-parisc/cacheflush.h ./include/asm-powerpc/pgtable-ppc32.h ./include/asm-ppc/pgtable.h ./include/asm-s390/kexec.h ./include/asm-sh/kexec.h ./mm/migrate.c Jeff -- Work email - jdike at linux dot intel dot com - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/2] Add pagemap.h include to swap.h
Include linux/pagemap.h in linux/swap.h and remove the comment saying that it's not possible. --- include/linux/swap.h |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.17/include/linux/swap.h === --- linux-2.6.17.orig/include/linux/swap.h 2007-10-24 10:05:09.0 -0400 +++ linux-2.6.17/include/linux/swap.h 2007-10-25 22:59:29.0 -0400 @@ -6,6 +6,7 @@ #include #include #include +#include #include #include @@ -282,8 +283,7 @@ static inline void disable_swap_token(vo #define si_swapinfo(val) \ do { (val)->freeswap = (val)->totalswap = 0; } while (0) -/* only sparc can not include linux/pagemap.h in this file - * so leave page_cache_release and release_pages undeclared... */ + #define free_page_and_swap_cache(page) \ page_cache_release(page) #define free_pages_and_swap_cache(pages, nr) \ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer
On Fri, 2007-10-26 at 11:36 -0500, Serge E. Hallyn wrote: > Quoting David P. Quigley ([EMAIL PROTECTED]): > > On Fri, 2007-10-26 at 10:02 -0500, Serge E. Hallyn wrote: > > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > > On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote: > > > > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > > > > static int task_alloc_security(struct task_struct *task) > > > > > > @@ -2423,14 +2397,22 @@ static const char > > > > > > *selinux_inode_xattr_getsuffix(void) > > > > > > * > > > > > > * Permission check is handled by selinux_inode_getxattr hook. > > > > > > */ > > > > > > -static int selinux_inode_getsecurity(const struct inode *inode, > > > > > > const char *name, void *buffer, size_t size, int err) > > > > > > +static int selinux_inode_getsecurity(const struct inode *inode, > > > > > > + const char *name, > > > > > > + void **buffer) > > > > > > { > > > > > > + u32 size; > > > > > > + int error; > > > > > > struct inode_security_struct *isec = inode->i_security; > > > > > > > > > > > > if (strcmp(name, XATTR_SELINUX_SUFFIX)) > > > > > > return -EOPNOTSUPP; > > > > > > > > > > > > - return selinux_getsecurity(isec->sid, buffer, size); > > > > > > + error = security_sid_to_context(isec->sid, (char **)buffer, > > > > > > &size); > > > > > > > > > > The only other downside I see here is that when the user just passes > > > > > in > > > > > NULL for a buffer, security_sid_to_context() will still > > > > > kmalloc the buffer only to have it immediately freed by > > > > > xattr_getsecurity() through release_secctx(). I trust that isn't seen > > > > > as any major performance impact? > > > > > > > > There is no way to avoid this in the SELinux case. SELinux doesn't store > > > > the sid to string mapping directly. Rather it takes the sid and then > > > > builds the string from fields in the related structure. So regardless > > > > this data is being allocated internally. The only issue I potentially > > > > see is that if someone passes in null expecting just to get the length > > > > we are actually returning a value. However we are changing the semantics > > > > of the function so the old semantics are no longer valid. > > > > > > Hmm? Which semantics are no longer valid? > > > > > > You're changing the semantincs of the in-kernel API, but userspace can > > > still send in NULL to query the length of the buffer needed. So if > > > userspace does two getxattrs, one to get the length, then another to get > > > the value, selinux will be kmallocing twice. > > > > > > For a file manager doing a listing on a huge directory and wanting to > > > list the selinux type, i could see that being a performance issue. Of > > > course they could get around that by sending in a 'reasonably large' > > > buffer for a first try. > > > > > > > Ok lets start this line of thought over again since it has been a while > > since I wrote the patches and got almost no sleep last night. > > > > Your concerns are that we are double allocating buffers one of which we > > are just going to immediately free after a copy. So inside the SELinux > > helper function there was what I saw as generic code for handling > > xattrs. This can be seen in the new function xattr_getsecurity which use > > to be internal to SELinux (selinux_getsecurity). What we are doing is > > grabbing the string which internally is being allocated anyway and if > > our buffer passed in for the copy is null we just goto out returning the > > length and freeing the buffer. So here is our standard null handling > > that we had before. In LSMs where there is no internal allocation to > > handle the getsecurity call this should introduce almost no overhead. > > Ah, thanks, you reminded me of what I was trying to point out. > > SMACK won't do allocations so it's ok. SELinux will do allocations > in any case so it's ok. So in terms of current users it's fine, so I > don't want to complaint too loudly. > > But the now-generic xattr_getsecurity() call passes in 'buffer' from its > stack, with no indication to the LSM of whether userspace passed in NULL > or a buffer. So if there *were* an lsm which had to allocate space to > return data, but didn't want to do so when the user just asked for the > length of the data, then that LSM would be out of luck. > > So would you object to passing in a boolean telling the LSM whether the > user had passed in a buffer in which to return data or not? I don't see why we couldn't add a bool. I am just wondering are there really use-cases where the developer is going to need this though. Unfortunately I'm not all knowing so I can't foresee what us "Insane" security people will do so I think it's a reasonable addition. I'll rebase the patch and make these changes and hopefully have a new revision to the list before the end of the day. Dave > > thanks, > -serge > >
Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer
Quoting David P. Quigley ([EMAIL PROTECTED]): > On Fri, 2007-10-26 at 10:02 -0500, Serge E. Hallyn wrote: > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote: > > > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > > > static int task_alloc_security(struct task_struct *task) > > > > > @@ -2423,14 +2397,22 @@ static const char > > > > > *selinux_inode_xattr_getsuffix(void) > > > > > * > > > > > * Permission check is handled by selinux_inode_getxattr hook. > > > > > */ > > > > > -static int selinux_inode_getsecurity(const struct inode *inode, > > > > > const char *name, void *buffer, size_t size, int err) > > > > > +static int selinux_inode_getsecurity(const struct inode *inode, > > > > > + const char *name, > > > > > + void **buffer) > > > > > { > > > > > + u32 size; > > > > > + int error; > > > > > struct inode_security_struct *isec = inode->i_security; > > > > > > > > > > if (strcmp(name, XATTR_SELINUX_SUFFIX)) > > > > > return -EOPNOTSUPP; > > > > > > > > > > - return selinux_getsecurity(isec->sid, buffer, size); > > > > > + error = security_sid_to_context(isec->sid, (char **)buffer, > > > > > &size); > > > > > > > > The only other downside I see here is that when the user just passes in > > > > NULL for a buffer, security_sid_to_context() will still > > > > kmalloc the buffer only to have it immediately freed by > > > > xattr_getsecurity() through release_secctx(). I trust that isn't seen > > > > as any major performance impact? > > > > > > There is no way to avoid this in the SELinux case. SELinux doesn't store > > > the sid to string mapping directly. Rather it takes the sid and then > > > builds the string from fields in the related structure. So regardless > > > this data is being allocated internally. The only issue I potentially > > > see is that if someone passes in null expecting just to get the length > > > we are actually returning a value. However we are changing the semantics > > > of the function so the old semantics are no longer valid. > > > > Hmm? Which semantics are no longer valid? > > > > You're changing the semantincs of the in-kernel API, but userspace can > > still send in NULL to query the length of the buffer needed. So if > > userspace does two getxattrs, one to get the length, then another to get > > the value, selinux will be kmallocing twice. > > > > For a file manager doing a listing on a huge directory and wanting to > > list the selinux type, i could see that being a performance issue. Of > > course they could get around that by sending in a 'reasonably large' > > buffer for a first try. > > > > Ok lets start this line of thought over again since it has been a while > since I wrote the patches and got almost no sleep last night. > > Your concerns are that we are double allocating buffers one of which we > are just going to immediately free after a copy. So inside the SELinux > helper function there was what I saw as generic code for handling > xattrs. This can be seen in the new function xattr_getsecurity which use > to be internal to SELinux (selinux_getsecurity). What we are doing is > grabbing the string which internally is being allocated anyway and if > our buffer passed in for the copy is null we just goto out returning the > length and freeing the buffer. So here is our standard null handling > that we had before. In LSMs where there is no internal allocation to > handle the getsecurity call this should introduce almost no overhead. Ah, thanks, you reminded me of what I was trying to point out. SMACK won't do allocations so it's ok. SELinux will do allocations in any case so it's ok. So in terms of current users it's fine, so I don't want to complaint too loudly. But the now-generic xattr_getsecurity() call passes in 'buffer' from its stack, with no indication to the LSM of whether userspace passed in NULL or a buffer. So if there *were* an lsm which had to allocate space to return data, but didn't want to do so when the user just asked for the length of the data, then that LSM would be out of luck. So would you object to passing in a boolean telling the LSM whether the user had passed in a buffer in which to return data or not? thanks, -serge > For example in Casey's latest SMACK patch he has a table of the label > strings and he can pass a pointer directly into that table back via the > security_inode_getsecurity hook. > In addition to this completes the lifecycle management that > security_release_secctx attempts to perform. It doesn't seem right that > if we require security_release_secctx to free the data we obtained from > security_inode_getsecurity that the data buffer should be allocated > outside of security_inode_getsecurity. > > I hope this clears up any questions/concerns. > > > -serge > > - > > To unsubscribe from this list: send th
Re: [patch] fs: restore nobh
On Fri 26-10-07 00:49:41, Nick Piggin wrote: > On Thu, Oct 25, 2007 at 09:07:36PM +0200, Jan Kara wrote: > > Hi, > > > > > This is overdue, sorry. Got a little complicated, and I've been away from > > > my filesystem test setup so I didn't want ot send it (lucky, coz I found > > > a bug after more substantial testing). > > > > > > Anyway, RFC? > > Hmm, maybe one comment/question: > > > > > Index: linux-2.6/fs/buffer.c > > > === > > > --- linux-2.6.orig/fs/buffer.c2007-10-08 14:09:35.0 +1000 > > > +++ linux-2.6/fs/buffer.c 2007-10-08 16:45:28.0 +1000 > > ... > > > > > -/* > > > - * Make sure any changes to nobh_commit_write() are reflected in > > > - * nobh_truncate_page(), since it doesn't call commit_write(). > > > - */ > > > -int nobh_commit_write(struct file *file, struct page *page, > > > - unsigned from, unsigned to) > > > +int nobh_write_end(struct file *file, struct address_space *mapping, > > > + loff_t pos, unsigned len, unsigned copied, > > > + struct page *page, void *fsdata) > > > { > > > struct inode *inode = page->mapping->host; > > > - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; > > > + struct buffer_head *head = NULL; > > > + struct buffer_head *bh; > > > > > > - if (page_has_buffers(page)) > > > - return generic_commit_write(file, page, from, to); > > > + if (!PageMappedToDisk(page)) { > > > + if (unlikely(copied < len) && !page_has_buffers(page)) > > > + attach_nobh_buffers(page, head); > > > + if (page_has_buffers(page)) > > > + return generic_write_end(file, mapping, pos, len, > > > copied, page, fsdata); > > > + } > > So we can have a page with attached buffers but we decide to not call > > generic_write_end()... > > > > > SetPageUptodate(page); > > > set_page_dirty(page); > > And we just set the page dirty but we leave buffers clean. So cannot > > subsequent try_to_free_buffers() come, mark page as clean (as buffers > > are clean) and discard the data? > > Hmm, we might just be OK here because set_page_dirty should dirty the > buffers. However, I think you have spotted a mistake here and it would be > better if the generic_write_end test was outside the PageMapedToDisk > check. But set_page_dirty() does not dirty buffers. That is only done by set_page_dirty_buffers(). Or am I missing something? Honza -- Jan Kara <[EMAIL PROTECTED]> SUSE Labs, CR - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer
On Fri, 2007-10-26 at 10:02 -0500, Serge E. Hallyn wrote: > Quoting David P. Quigley ([EMAIL PROTECTED]): > > On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote: > > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > > static int task_alloc_security(struct task_struct *task) > > > > @@ -2423,14 +2397,22 @@ static const char > > > > *selinux_inode_xattr_getsuffix(void) > > > > * > > > > * Permission check is handled by selinux_inode_getxattr hook. > > > > */ > > > > -static int selinux_inode_getsecurity(const struct inode *inode, const > > > > char *name, void *buffer, size_t size, int err) > > > > +static int selinux_inode_getsecurity(const struct inode *inode, > > > > + const char *name, > > > > + void **buffer) > > > > { > > > > + u32 size; > > > > + int error; > > > > struct inode_security_struct *isec = inode->i_security; > > > > > > > > if (strcmp(name, XATTR_SELINUX_SUFFIX)) > > > > return -EOPNOTSUPP; > > > > > > > > - return selinux_getsecurity(isec->sid, buffer, size); > > > > + error = security_sid_to_context(isec->sid, (char **)buffer, > > > > &size); > > > > > > The only other downside I see here is that when the user just passes in > > > NULL for a buffer, security_sid_to_context() will still > > > kmalloc the buffer only to have it immediately freed by > > > xattr_getsecurity() through release_secctx(). I trust that isn't seen > > > as any major performance impact? > > > > There is no way to avoid this in the SELinux case. SELinux doesn't store > > the sid to string mapping directly. Rather it takes the sid and then > > builds the string from fields in the related structure. So regardless > > this data is being allocated internally. The only issue I potentially > > see is that if someone passes in null expecting just to get the length > > we are actually returning a value. However we are changing the semantics > > of the function so the old semantics are no longer valid. > > Hmm? Which semantics are no longer valid? > > You're changing the semantincs of the in-kernel API, but userspace can > still send in NULL to query the length of the buffer needed. So if > userspace does two getxattrs, one to get the length, then another to get > the value, selinux will be kmallocing twice. > > For a file manager doing a listing on a huge directory and wanting to > list the selinux type, i could see that being a performance issue. Of > course they could get around that by sending in a 'reasonably large' > buffer for a first try. > Ok lets start this line of thought over again since it has been a while since I wrote the patches and got almost no sleep last night. Your concerns are that we are double allocating buffers one of which we are just going to immediately free after a copy. So inside the SELinux helper function there was what I saw as generic code for handling xattrs. This can be seen in the new function xattr_getsecurity which use to be internal to SELinux (selinux_getsecurity). What we are doing is grabbing the string which internally is being allocated anyway and if our buffer passed in for the copy is null we just goto out returning the length and freeing the buffer. So here is our standard null handling that we had before. In LSMs where there is no internal allocation to handle the getsecurity call this should introduce almost no overhead. For example in Casey's latest SMACK patch he has a table of the label strings and he can pass a pointer directly into that table back via the security_inode_getsecurity hook. In addition to this completes the lifecycle management that security_release_secctx attempts to perform. It doesn't seem right that if we require security_release_secctx to free the data we obtained from security_inode_getsecurity that the data buffer should be allocated outside of security_inode_getsecurity. I hope this clears up any questions/concerns. > -serge > - > To unsubscribe from this list: send the line "unsubscribe > linux-security-module" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer
Quoting Stephen Smalley ([EMAIL PROTECTED]): > On Fri, 2007-10-26 at 10:02 -0500, Serge E. Hallyn wrote: > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote: > > > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > > > static int task_alloc_security(struct task_struct *task) > > > > > @@ -2423,14 +2397,22 @@ static const char > > > > > *selinux_inode_xattr_getsuffix(void) > > > > > * > > > > > * Permission check is handled by selinux_inode_getxattr hook. > > > > > */ > > > > > -static int selinux_inode_getsecurity(const struct inode *inode, > > > > > const char *name, void *buffer, size_t size, int err) > > > > > +static int selinux_inode_getsecurity(const struct inode *inode, > > > > > + const char *name, > > > > > + void **buffer) > > > > > { > > > > > + u32 size; > > > > > + int error; > > > > > struct inode_security_struct *isec = inode->i_security; > > > > > > > > > > if (strcmp(name, XATTR_SELINUX_SUFFIX)) > > > > > return -EOPNOTSUPP; > > > > > > > > > > - return selinux_getsecurity(isec->sid, buffer, size); > > > > > + error = security_sid_to_context(isec->sid, (char **)buffer, > > > > > &size); > > > > > > > > The only other downside I see here is that when the user just passes in > > > > NULL for a buffer, security_sid_to_context() will still > > > > kmalloc the buffer only to have it immediately freed by > > > > xattr_getsecurity() through release_secctx(). I trust that isn't seen > > > > as any major performance impact? > > > > > > There is no way to avoid this in the SELinux case. SELinux doesn't store > > > the sid to string mapping directly. Rather it takes the sid and then > > > builds the string from fields in the related structure. So regardless > > > this data is being allocated internally. The only issue I potentially > > > see is that if someone passes in null expecting just to get the length > > > we are actually returning a value. However we are changing the semantics > > > of the function so the old semantics are no longer valid. > > > > Hmm? Which semantics are no longer valid? > > > > You're changing the semantincs of the in-kernel API, but userspace can > > still send in NULL to query the length of the buffer needed. So if > > userspace does two getxattrs, one to get the length, then another to get > > the value, selinux will be kmallocing twice. > > > > For a file manager doing a listing on a huge directory and wanting to > > list the selinux type, i could see that being a performance issue. Of > > course they could get around that by sending in a 'reasonably large' > > buffer for a first try. > > That's what current userland does. libselinux always tries with an > initial buffer first (and usually succeeds), thereby avoiding the second > call to getxattr in the common case. Ok - I figured for doing thousands of these in one directory listing that could waste quite a bit of memory, but since (as i check) selinux has always done a kmalloc for every getsecurity call, I guess it's a fair tradeoff thanks, -serge - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer
On Fri, 2007-10-26 at 10:02 -0500, Serge E. Hallyn wrote: > Quoting David P. Quigley ([EMAIL PROTECTED]): > > On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote: > > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > > static int task_alloc_security(struct task_struct *task) > > > > @@ -2423,14 +2397,22 @@ static const char > > > > *selinux_inode_xattr_getsuffix(void) > > > > * > > > > * Permission check is handled by selinux_inode_getxattr hook. > > > > */ > > > > -static int selinux_inode_getsecurity(const struct inode *inode, const > > > > char *name, void *buffer, size_t size, int err) > > > > +static int selinux_inode_getsecurity(const struct inode *inode, > > > > + const char *name, > > > > + void **buffer) > > > > { > > > > + u32 size; > > > > + int error; > > > > struct inode_security_struct *isec = inode->i_security; > > > > > > > > if (strcmp(name, XATTR_SELINUX_SUFFIX)) > > > > return -EOPNOTSUPP; > > > > > > > > - return selinux_getsecurity(isec->sid, buffer, size); > > > > + error = security_sid_to_context(isec->sid, (char **)buffer, > > > > &size); > > > > > > The only other downside I see here is that when the user just passes in > > > NULL for a buffer, security_sid_to_context() will still > > > kmalloc the buffer only to have it immediately freed by > > > xattr_getsecurity() through release_secctx(). I trust that isn't seen > > > as any major performance impact? > > > > There is no way to avoid this in the SELinux case. SELinux doesn't store > > the sid to string mapping directly. Rather it takes the sid and then > > builds the string from fields in the related structure. So regardless > > this data is being allocated internally. The only issue I potentially > > see is that if someone passes in null expecting just to get the length > > we are actually returning a value. However we are changing the semantics > > of the function so the old semantics are no longer valid. > > Hmm? Which semantics are no longer valid? > > You're changing the semantincs of the in-kernel API, but userspace can > still send in NULL to query the length of the buffer needed. So if > userspace does two getxattrs, one to get the length, then another to get > the value, selinux will be kmallocing twice. > > For a file manager doing a listing on a huge directory and wanting to > list the selinux type, i could see that being a performance issue. Of > course they could get around that by sending in a 'reasonably large' > buffer for a first try. That's what current userland does. libselinux always tries with an initial buffer first (and usually succeeds), thereby avoiding the second call to getxattr in the common case. -- Stephen Smalley National Security Agency - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer
On Fri, 2007-10-26 at 11:13 -0400, David P. Quigley wrote: > On Fri, 2007-10-26 at 10:02 -0500, Serge E. Hallyn wrote: > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote: > > > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > > > static int task_alloc_security(struct task_struct *task) > > > > > @@ -2423,14 +2397,22 @@ static const char > > > > > *selinux_inode_xattr_getsuffix(void) > > > > > * > > > > > * Permission check is handled by selinux_inode_getxattr hook. > > > > > */ > > > > > -static int selinux_inode_getsecurity(const struct inode *inode, > > > > > const char *name, void *buffer, size_t size, int err) > > > > > +static int selinux_inode_getsecurity(const struct inode *inode, > > > > > + const char *name, > > > > > + void **buffer) > > > > > { > > > > > + u32 size; > > > > > + int error; > > > > > struct inode_security_struct *isec = inode->i_security; > > > > > > > > > > if (strcmp(name, XATTR_SELINUX_SUFFIX)) > > > > > return -EOPNOTSUPP; > > > > > > > > > > - return selinux_getsecurity(isec->sid, buffer, size); > > > > > + error = security_sid_to_context(isec->sid, (char **)buffer, > > > > > &size); > > > > > > > > The only other downside I see here is that when the user just passes in > > > > NULL for a buffer, security_sid_to_context() will still > > > > kmalloc the buffer only to have it immediately freed by > > > > xattr_getsecurity() through release_secctx(). I trust that isn't seen > > > > as any major performance impact? > > > > > > There is no way to avoid this in the SELinux case. SELinux doesn't store > > > the sid to string mapping directly. Rather it takes the sid and then > > > builds the string from fields in the related structure. So regardless > > > this data is being allocated internally. The only issue I potentially > > > see is that if someone passes in null expecting just to get the length > > > we are actually returning a value. However we are changing the semantics > > > of the function so the old semantics are no longer valid. > > > > Hmm? Which semantics are no longer valid? > > > > You're changing the semantincs of the in-kernel API, but userspace can > > still send in NULL to query the length of the buffer needed. So if > > userspace does two getxattrs, one to get the length, then another to get > > the value, selinux will be kmallocing twice. > > > We are changing the semantics of the LSM hook. From a memory allocation > stand point the only difference is who allocates the buffer. Before this > patch we would allocate one outside of the LSM hook and then fill it > (even if null were passed). Since the LSM hook is in a better position > to accurately create this buffer we pushed the allocation to it. > > > For a file manager doing a listing on a huge directory and wanting to > > list the selinux type, i could see that being a performance issue. Of > > course they could get around that by sending in a 'reasonably large' > > buffer for a first try. > > I'd agree but these allocations have always been happening and no one > has seen a problem with it. There should be the same number of > allocations as before the question is where are they done. This statement applies to SELinux. In actuality it is possible for an LSM to have less allocation than before with this model. > > > > > -serge > > - > To unsubscribe from this list: send the line "unsubscribe > linux-security-module" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer
On Fri, 2007-10-26 at 10:07 -0500, Serge E. Hallyn wrote: > Quoting David P. Quigley ([EMAIL PROTECTED]): > > On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote: > > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > > This patch modifies the interface to inode_getsecurity to have > > > > the > > > > function return a buffer containing the security blob and its length via > > > > parameters instead of relying on the calling function to give it an > > > > appropriately sized buffer. Security blobs obtained with this function > > > > should be freed using the release_secctx LSM hook. This alleviates the > > > > problem of the caller having to guess a length and preallocate a buffer > > > > for this function allowing it to be used elsewhere for Labeled NFS. The > > > > patch also removed the unused err parameter. The conversion is similar > > > > to the one performed by Al Viro for the security_getprocattr hook. > > > > > > > > Signed-off-by: David P. Quigley <[EMAIL PROTECTED]> > > > > --- > > > > fs/xattr.c | 26 -- > > > > include/linux/security.h | 27 ++- > > > > include/linux/xattr.h|1 + > > > > mm/shmem.c |3 +-- > > > > security/dummy.c |4 +++- > > > > security/selinux/hooks.c | 38 ++ > > > > > > (Hmm, I was about to ask if this diffstat could be complete, as it > > > doesn't have for instance security/security.c, but I guess this predates > > > the staticlsm patch...) > > > > It wouldn't be much effort to rebase this patch against Linus's latest > > tree. I am assuming that the static lsm patch is in there based on the > > recent discussion on LKML? > > Oh, sorry for the two emails. > > Yeah it's in 2.6.24. So a rebase will be necessary anyway. I was just > saying I was too lazy to find another tree against which to check that > you didn't miss any getsecurity calls (hidden under some exotic .config) > to change their arguments :) I used the LXR to get all uses of getsecurity so I am pretty sure I have them all. > > -serge > - > To unsubscribe from this list: send the line "unsubscribe > linux-security-module" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer
On Fri, 2007-10-26 at 10:02 -0500, Serge E. Hallyn wrote: > Quoting David P. Quigley ([EMAIL PROTECTED]): > > On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote: > > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > > static int task_alloc_security(struct task_struct *task) > > > > @@ -2423,14 +2397,22 @@ static const char > > > > *selinux_inode_xattr_getsuffix(void) > > > > * > > > > * Permission check is handled by selinux_inode_getxattr hook. > > > > */ > > > > -static int selinux_inode_getsecurity(const struct inode *inode, const > > > > char *name, void *buffer, size_t size, int err) > > > > +static int selinux_inode_getsecurity(const struct inode *inode, > > > > + const char *name, > > > > + void **buffer) > > > > { > > > > + u32 size; > > > > + int error; > > > > struct inode_security_struct *isec = inode->i_security; > > > > > > > > if (strcmp(name, XATTR_SELINUX_SUFFIX)) > > > > return -EOPNOTSUPP; > > > > > > > > - return selinux_getsecurity(isec->sid, buffer, size); > > > > + error = security_sid_to_context(isec->sid, (char **)buffer, > > > > &size); > > > > > > The only other downside I see here is that when the user just passes in > > > NULL for a buffer, security_sid_to_context() will still > > > kmalloc the buffer only to have it immediately freed by > > > xattr_getsecurity() through release_secctx(). I trust that isn't seen > > > as any major performance impact? > > > > There is no way to avoid this in the SELinux case. SELinux doesn't store > > the sid to string mapping directly. Rather it takes the sid and then > > builds the string from fields in the related structure. So regardless > > this data is being allocated internally. The only issue I potentially > > see is that if someone passes in null expecting just to get the length > > we are actually returning a value. However we are changing the semantics > > of the function so the old semantics are no longer valid. > > Hmm? Which semantics are no longer valid? > > You're changing the semantincs of the in-kernel API, but userspace can > still send in NULL to query the length of the buffer needed. So if > userspace does two getxattrs, one to get the length, then another to get > the value, selinux will be kmallocing twice. We are changing the semantics of the LSM hook. From a memory allocation stand point the only difference is who allocates the buffer. Before this patch we would allocate one outside of the LSM hook and then fill it (even if null were passed). Since the LSM hook is in a better position to accurately create this buffer we pushed the allocation to it. > For a file manager doing a listing on a huge directory and wanting to > list the selinux type, i could see that being a performance issue. Of > course they could get around that by sending in a 'reasonably large' > buffer for a first try. I'd agree but these allocations have always been happening and no one has seen a problem with it. There should be the same number of allocations as before the question is where are they done. > > -serge - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer
Quoting David P. Quigley ([EMAIL PROTECTED]): > On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote: > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > This patch modifies the interface to inode_getsecurity to have the > > > function return a buffer containing the security blob and its length via > > > parameters instead of relying on the calling function to give it an > > > appropriately sized buffer. Security blobs obtained with this function > > > should be freed using the release_secctx LSM hook. This alleviates the > > > problem of the caller having to guess a length and preallocate a buffer > > > for this function allowing it to be used elsewhere for Labeled NFS. The > > > patch also removed the unused err parameter. The conversion is similar > > > to the one performed by Al Viro for the security_getprocattr hook. > > > > > > Signed-off-by: David P. Quigley <[EMAIL PROTECTED]> > > > --- > > > fs/xattr.c | 26 -- > > > include/linux/security.h | 27 ++- > > > include/linux/xattr.h|1 + > > > mm/shmem.c |3 +-- > > > security/dummy.c |4 +++- > > > security/selinux/hooks.c | 38 ++ > > > > (Hmm, I was about to ask if this diffstat could be complete, as it > > doesn't have for instance security/security.c, but I guess this predates > > the staticlsm patch...) > > It wouldn't be much effort to rebase this patch against Linus's latest > tree. I am assuming that the static lsm patch is in there based on the > recent discussion on LKML? Oh, sorry for the two emails. Yeah it's in 2.6.24. So a rebase will be necessary anyway. I was just saying I was too lazy to find another tree against which to check that you didn't miss any getsecurity calls (hidden under some exotic .config) to change their arguments :) -serge - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer
Quoting David P. Quigley ([EMAIL PROTECTED]): > On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote: > > Quoting David P. Quigley ([EMAIL PROTECTED]): > > > static int task_alloc_security(struct task_struct *task) > > > @@ -2423,14 +2397,22 @@ static const char > > > *selinux_inode_xattr_getsuffix(void) > > > * > > > * Permission check is handled by selinux_inode_getxattr hook. > > > */ > > > -static int selinux_inode_getsecurity(const struct inode *inode, const > > > char *name, void *buffer, size_t size, int err) > > > +static int selinux_inode_getsecurity(const struct inode *inode, > > > + const char *name, > > > + void **buffer) > > > { > > > + u32 size; > > > + int error; > > > struct inode_security_struct *isec = inode->i_security; > > > > > > if (strcmp(name, XATTR_SELINUX_SUFFIX)) > > > return -EOPNOTSUPP; > > > > > > - return selinux_getsecurity(isec->sid, buffer, size); > > > + error = security_sid_to_context(isec->sid, (char **)buffer, &size); > > > > The only other downside I see here is that when the user just passes in > > NULL for a buffer, security_sid_to_context() will still > > kmalloc the buffer only to have it immediately freed by > > xattr_getsecurity() through release_secctx(). I trust that isn't seen > > as any major performance impact? > > There is no way to avoid this in the SELinux case. SELinux doesn't store > the sid to string mapping directly. Rather it takes the sid and then > builds the string from fields in the related structure. So regardless > this data is being allocated internally. The only issue I potentially > see is that if someone passes in null expecting just to get the length > we are actually returning a value. However we are changing the semantics > of the function so the old semantics are no longer valid. Hmm? Which semantics are no longer valid? You're changing the semantincs of the in-kernel API, but userspace can still send in NULL to query the length of the buffer needed. So if userspace does two getxattrs, one to get the length, then another to get the value, selinux will be kmallocing twice. For a file manager doing a listing on a huge directory and wanting to list the selinux type, i could see that being a performance issue. Of course they could get around that by sending in a 'reasonably large' buffer for a first try. -serge - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] VFS/Security: Rework inode_getsecurity and callers to return resulting buffer
On Thu, 2007-10-25 at 19:02 -0500, Serge E. Hallyn wrote: > Quoting David P. Quigley ([EMAIL PROTECTED]): > > This patch modifies the interface to inode_getsecurity to have the > > function return a buffer containing the security blob and its length via > > parameters instead of relying on the calling function to give it an > > appropriately sized buffer. Security blobs obtained with this function > > should be freed using the release_secctx LSM hook. This alleviates the > > problem of the caller having to guess a length and preallocate a buffer > > for this function allowing it to be used elsewhere for Labeled NFS. The > > patch also removed the unused err parameter. The conversion is similar > > to the one performed by Al Viro for the security_getprocattr hook. > > > > Signed-off-by: David P. Quigley <[EMAIL PROTECTED]> > > --- > > fs/xattr.c | 26 -- > > include/linux/security.h | 27 ++- > > include/linux/xattr.h|1 + > > mm/shmem.c |3 +-- > > security/dummy.c |4 +++- > > security/selinux/hooks.c | 38 ++ > > (Hmm, I was about to ask if this diffstat could be complete, as it > doesn't have for instance security/security.c, but I guess this predates > the staticlsm patch...) It wouldn't be much effort to rebase this patch against Linus's latest tree. I am assuming that the static lsm patch is in there based on the recent discussion on LKML? > > > 6 files changed, 53 insertions(+), 46 deletions(-) > > > > diff --git a/fs/xattr.c b/fs/xattr.c > > index a44fd92..d45c7ef 100644 > > --- a/fs/xattr.c > > +++ b/fs/xattr.c > > @@ -105,6 +105,29 @@ out: > > EXPORT_SYMBOL_GPL(vfs_setxattr); > > > > ssize_t > > +xattr_getsecurity(struct inode *inode, const char *name, void *value, > > + size_t size) > > +{ > > + void *buffer = NULL; > > + ssize_t len; > > + > > + len = security_inode_getsecurity(inode, name, &buffer); > > + if (len < 0) > > + return len; > > + if (!value || !size) > > + goto out; > > + if (size < len) { > > + len = -ERANGE; > > + goto out; > > + } > > + memcpy(value, buffer, len); > > +out: > > + security_release_secctx(buffer, len); > > This is mighty misleading in -ERANGE case :) I realize that > selinux_release_secctx() ignores len anyway. But given the description > in security.h, I'd say either you need to keep the actual length > allocated and pass that in here, or (probably better) have another patch > remove the second argument from security_release_secctx(). I would consider this a bug on my part. I don't think we can get rid of the len argument to security_release_secctx because it is supposed to be a generic destructor for security blobs. If the module always returned a fixed length blob we could remove it but there is no telling what one of the many new LSMs to come will do. > > > + return len; > > +} > > +EXPORT_SYMBOL_GPL(xattr_getsecurity); > > + > > +ssize_t > > vfs_getxattr(struct dentry *dentry, char *name, void *value, size_t size) > > { > > struct inode *inode = dentry->d_inode; > > @@ -126,8 +149,7 @@ vfs_getxattr(struct dentry *dentry, char *name, void > > *value, size_t size) > > if (!strncmp(name, XATTR_SECURITY_PREFIX, > > XATTR_SECURITY_PREFIX_LEN)) { > > const char *suffix = name + XATTR_SECURITY_PREFIX_LEN; > > - int ret = security_inode_getsecurity(inode, suffix, value, > > -size, error); > > + int ret = xattr_getsecurity(inode, suffix, value, size); > > /* > > * Only overwrite the return value if a security module > > * is actually active. > > diff --git a/include/linux/security.h b/include/linux/security.h > > index 1a15526..8658929 100644 > > --- a/include/linux/security.h > > +++ b/include/linux/security.h > > @@ -391,15 +391,11 @@ struct request_sock; > > * identified by @name for @dentry. > > * Return 0 if permission is granted. > > * @inode_getsecurity: > > - * Copy the extended attribute representation of the security label > > - * associated with @name for @inode into @buffer. @buffer may be > > - * NULL to request the size of the buffer required. @size indicates > > - * the size of @buffer in bytes. Note that @name is the remainder > > - * of the attribute name after the security. prefix has been removed. > > - * @err is the return value from the preceding fs getxattr call, > > - * and can be used by the security module to determine whether it > > - * should try and canonicalize the attribute value. > > - * Return number of bytes used/required on success. > > + * Retrieve a copy of the extended attribute representation of the > > + * security label associated with @name for @inode via @buffer. Note that > > + * @name is the remainder of th
[PATCH] Add an ERR_CAST() function to complement ERR_PTR and co. [try #5.2]
Add an ERR_CAST() function to complement ERR_PTR and co. for the purposes of casting an error entyped as one pointer type to an error of another pointer type whilst making it explicit as to what is going on. This provides a replacement for the ERR_PTR(PTR_ERR(p)) construct. Signed-off-by: David Howells <[EMAIL PROTECTED]> --- include/linux/err.h | 13 + 1 files changed, 13 insertions(+), 0 deletions(-) diff --git a/include/linux/err.h b/include/linux/err.h index 1ab1d44..ec87f31 100644 --- a/include/linux/err.h +++ b/include/linux/err.h @@ -34,6 +34,19 @@ static inline long IS_ERR(const void *ptr) return IS_ERR_VALUE((unsigned long)ptr); } +/** + * ERR_CAST - Explicitly cast an error-valued pointer to another pointer type + * @ptr: The pointer to cast. + * + * Explicitly cast an error-valued pointer to another pointer type in such a + * way as to make it clear that's what's going on. + */ +static inline void *ERR_CAST(const void *ptr) +{ + /* cast away the const */ + return (void *) ptr; +} + #endif #endif /* _LINUX_ERR_H */ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Fri, 26 Oct 2007, Neil Brown wrote: > On Thursday October 25, [EMAIL PROTECTED] wrote: > > The patch looks like it makes perfect sense to me. Great, thanks a lot for looking at it, Neil and Pekka. > Before the change, ->writepage could return AOP_WRITEPAGE_ACTIVATE > without unlocking the page, and this has precisely the effect of: >ClearPageReclaim(); (if the call path was through pageout) >SetPageActive(); (if the call was through shrink_page_list) >unlock_page(); > > With the patch, the ->writepage method does the SetPageActive and the > unlock_page, which on the whole seems cleaner. > > We seem to have lost a call to ClearPageReclaim - I don't know if that > is significant. It doesn't show up in the diff at all, but pageout() already has if (!PageWriteback(page)) { /* synchronous write or broken a_ops? */ ClearPageReclaim(page); } which will clear it since we've never set PageWriteback. I think no harm would come from leaving it set there, since it only takes effect in end_page_writeback (its effect being to let the just written page be moved to the end of the LRU, so that it will then be soon reclaimed), and clear_page_dirty_for_io clears it before coming down this way. But I'd never argue for that: I hate having leftover flags hanging around outside the scope of their relevance. > > Special, hidden, undocumented, secret hack! Then in 2.6.7 Andrew > > stole his own secret and used it when concocting ramdisk_writepage. > > Oh, and NFS made some kind of use of it in 2.6.6 only. Then Neil > > revealed the secret to the uninitiated in 2.6.17: now, what's the > > appropriate punishment for that? > > Surely the punishment should be for writing hidden undocumented hacks > in the first place! I vote we just make him maintainer for the whole > kernel - that will keep him so busy that he will never have a chance > to do it again :-) That is a splendid retort, which has won you absolution. But it makes me a little sad: that smiley should be a weepy. > > --- 2.6.24-rc1/Documentation/filesystems/vfs.txt2007-10-24 > > 07:15:11.0 +0100 > > +++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 > > +0100 > > @@ -567,9 +567,7 @@ struct address_space_operations { > >If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to > >try too hard if there are problems, and may choose to write out > >other pages from the mapping if that is easier (e.g. due to > > - internal dependencies). If it chooses not to start writeout, it > > - should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep > > - calling ->writepage on that page. > > + internal dependencies). > > > > It seems that the new requirement is that if the address_space > chooses not to write out the page, it should now call SetPageActive(). > If that is the case, I think it should be explicit in the > documentation - please? No, it's not the case; but you're right that I should add something there, to put an end to the idea. It'll be something along the lines of "You may notice shmem setting PageActive there, but please don't do that; or if you insist, be sure never to do so in the !wbc->for_reclaim case". The PageActive thing is for when a filesystem regrets that it even had a ->writepage (it replicates the behaviour of the writepage == NULL case or the VM_LOCKED SWAP_FAIL case or the !add_to_swap case, delaying the return of this page to writepage for as long as it can). It's done in shmem_writepage because shm_lock (equivalent to VM_LOCKED) is only discovered within that writepage, and no-swap is discovered there too. ramdisk does it too: I've not tried to understand ramdisk as Nick and Eric have, but it used to have no writepage, and would prefer to have no writepage, but appears to need one for some PageUptodate reasons. It's fairly normal for a filesystem to find that for some reason it cannot carry out a writepage on this page right now (in the reclaim case: the sync case demands action, IIRC); so it then simply does set_page_dirty and unlock_page and returns "success". I'll try to condense this down for the Doc when finalizing the patch; which I've still not yet tested properly - thanks for the eyes, but I can't submit it until I've checked in detail that it really gets to do what we think it does. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Distributed storage. Move away from char device ioctls.
Returning back to this, since block based storage, which can act as a shared storage/transport layer, is ready with 5'th release of the DST. My couple of notes on proposed data distribution algorithm in FS. On Sun, Sep 16, 2007 at 03:07:11AM -0400, Kyle Moffett ([EMAIL PROTECTED]) wrote: > >I actually think there is a place for this - and improvements are > >definitely welcome. Even Lustre needs block-device level > >redundancy currently, though we will be working to make Lustre- > >level redundancy available in the future (the problem is WAY harder > >than it seems at first glance, if you allow writeback caches at the > >clients and servers). > > I really think that to get proper non-block-device-level filesystem > redundancy you need to base it on something similar to the GIT > model. Data replication is done in specific-sized chunks indexed by > SHA-1 sum and you actually have a sort of "merge algorithm" for when > local and remote changes differ. The OS would only implement a very > limited list of merge algorithms, IE one of: > > (A) Don't merge, each client gets its own branch and merges are manual > (B) Most recent changed version is made the master every X-seconds/ > open/close/write/other-event. > (C) The tree at X (usually a particular client/server) is always > used as the master when there are conflicts. This looks like a good way to work with offline clients (where I recall Coda), after offline node modified data, it should be merged back to the cluster with different algorithms. Data (supposed to be) written to the failed node during its offline time will be resynced from other nodes when failed one is online, there are no problems and/or special algorithms to be used here. Filesystem replication is not a 100% 'git way' - git tree contains already combined objects - i.e. the last blob for given path does not contain its history, only ready to be used data, while filesystem, especially that one which requires simultaneous write from different threads/nodes, should implement copy-on-write semantics, essentially putting all new data (git commit) to the new location and then collect it from different extents to present a ready file. At least that is how I see the filesystem I'm working on. ... > There's a lot of other technical details which would need resolution > in an actual implementation, but this is enough of a summary to give > you the gist of the concept. Most likely there will be some major > flaw which makes it impossible to produce reliably, but the concept > contains the things I would be interested in for a real "networked > filesystem". Git semantics and copy-on-write has actually quite a lot in common (on high enough level of abstraction), but SHA1 index is not a possible issue in filesystem - even besides amount of data to be hashed before key is ready. Key should also contain enough information about what underlying data is - git does not store that information (tree, blob or whatever) in its keys, since it does not require it. At least that is how I see it to be implemented. Overall I see this new project as a true copy-on-write FS. Thanks. > Cheers, > Kyle Moffett -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFS: new fgetattr() file operation
> On Fri, Oct 26, 2007 at 01:10:14AM +0200, Miklos Szeredi wrote: > > > On Wed, Oct 24, 2007 at 05:27:04PM +0200, Miklos Szeredi wrote: > > > > > >> Wouldn't you be better off by attempting to implement an "open > > > > > >> by ino" operation and an operation to get the generation count > > > > > >> for the file and then modifying the network protocol of interest > > > > > >> to use these as identifiers for the file to be manipulated? > > > > > >> > > > > > > > > > > > > You mean an "open by inode" on the userspace API? My guess, it > > > > > > wouldn't get very far. > > > > > > > > > > This isn't a new idea and has been implemented on a variety of > > > > > different systems. > > > > > > > > Like? > > > > > > XFS. > > > > > > 'man open_by_handle' > > > > Doesn't seem widely used, with 600 something google hits. > > from the man page: > > "They are intended for use by a limited set of system utilities such > as backup programs." > > It also gets used by HSMs and so it is current, tested and is not > going away Sure. > > And in this > > old thread Linus is not entirely enthusiastic about the concept: > > > > http://lkml.org/lkml/1999/1/11/244 > > That was "open by inode number", AFAICT. A handle is an opaque > blob that can be an arbitrary length defined by the filesystem. > You have to convert a fd or path to a handle first before you can > use it later, so any filesystem can implement it... > > i.e. it is exactly what this (unanswered) post suggested: > > http://lkml.org/lkml/1999/1/13/186 Well, yeah, if a filesystem doesn't have a global index, the whole path could just be stuffed into the handle. But there's not much point in that, is there? And because the interface bypasses normal access checking on parent directories, it has security implications, making it not generally useful. Specifically, it is not useful for the "unprivileged file server" case that Peter was suggesting. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi, On 10/26/07, Neil Brown <[EMAIL PROTECTED]> wrote: > It seems that the new requirement is that if the address_space > chooses not to write out the page, it should now call SetPageActive(). > If that is the case, I think it should be explicit in the > documentation - please? Agreed. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, On 10/25/07, Hugh Dickins <[EMAIL PROTECTED]> wrote: > @@ -349,10 +349,6 @@ static pageout_t pageout(struct page *pa > res = mapping->a_ops->writepage(page, &wbc); > if (res < 0) > handle_write_error(mapping, page, res); > - if (res == AOP_WRITEPAGE_ACTIVATE) { > - ClearPageReclaim(page); > - return PAGE_ACTIVATE; > - } I don't see ClearPageReclaim added anywhere. Is that done on purpose? Other than that, the patch looks good to me and I think we should stick it into -mm to punish Andrew for his secret hack ;-). Pekka - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html