[Devel] [PATCH 9/9] Document usage of multiple-instances of devpts
>From c4596977ca34b9664d97efa8681e6711145a22cf Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 9/9] Document usage of multiple-instances of devpts Changelog [v2]: - Add note indicating strict isolation is not possible unless all mounts of devpts use the 'newinstance' mount option. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- Documentation/filesystems/devpts.txt | 132 ++ 1 files changed, 132 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/devpts.txt diff --git a/Documentation/filesystems/devpts.txt b/Documentation/filesystems/devpts.txt new file mode 100644 index 000..68dffd8 --- /dev/null +++ b/Documentation/filesystems/devpts.txt @@ -0,0 +1,132 @@ + +To support containers, we now allow multiple instances of devpts filesystem, +such that indices of ptys allocated in one instance are independent of indices +allocated in other instances of devpts. + +To preserve backward compatibility, this support for multiple instances is +enabled only if: + + - CONFIG_DEVPTS_MULTIPLE_INSTANCES=y, and + - '-o newinstance' mount option is specified while mounting devpts + +IOW, devpts now supports both single-instance and multi-instance semantics. + +If CONFIG_DEVPTS_MULTIPLE_INSTANCES=n, there is no change in behavior and +this referred to as the "legacy" mode. In this mode, the new mount options +(-o newinstance and -o ptmxmode) will be ignored with a 'bogus option' message +on console. + +If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and devpts is mounted without the +'newinstance' option (as in current start-up scripts) the new mount binds +to the initial kernel mount of devpts. This mode is referred to as the +'single-instance' mode and the current, single-instance semantics are +preserved, i.e PTYs are common across the system. + +The only difference between this single-instance mode and the legacy mode +is the presence of new, '/dev/pts/ptmx' node with permissions , which +can safely be ignored. + +If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and 'newinstance' option is specified, +the mount is considered to be in the multi-instance mode and a new instance +of the devpts fs is created. Any ptys created in this instance are independent +of ptys in other instances of devpts. Like in the single-instance mode, the +/dev/pts/ptmx node is present. To effectively use the multi-instance mode, +open of /dev/ptmx must be a redirected to '/dev/pts/ptmx' using a symlink or +bind-mount. + +Eg: A container startup script could do the following: + + $ chmod 0666 /dev/pts/ptmx + $ rm /dev/ptmx + $ ln -s pts/ptmx /dev/ptmx + $ ns_exec -cm /bin/bash + + # We are now in new container + + $ umount /dev/pts + $ mount -t devpts -o newinstance lxcpts /dev/pts + $ sshd -p 1234 + +where 'ns_exec -cm /bin/bash' calls clone() with CLONE_NEWNS flag and execs +/bin/bash in the child process. A pty created by the sshd is not visible in +the original mount of /dev/pts. + +User-space changes +-- + +In multi-instance mode (i.e '-o newinstance' mount option is specified at least +once), following user-space issues should be noted. + +1. If -o newinstance mount option is never used, /dev/pts/ptmx can be ignored + and no change is needed to system-startup scripts. + +2. To effectively use multi-instance mode (i.e -o newinstance is specified) + administrators or startup scripts should "redirect" open of /dev/ptmx to + /dev/pts/ptmx using either a bind mount or symlink. + + $ mount -t devpts -o newinstance devpts /dev/pts + + followed by either + + $ rm /dev/ptmx + $ ln -s pts/ptmx /dev/ptmx + $ chmod 666 /dev/pts/ptmx + or + $ mount -o bind /dev/pts/ptmx /dev/ptmx + +3. The '/dev/ptmx -> pts/ptmx' symlink is the preferred method since it + enables better error-reporting and treats both single-instance and + multi-instance mounts similarly. + + But this method requires that system-startup scripts set the mode of + /dev/pts/ptmx correctly (default mode is ). The scripts can set the + mode by, either + + - adding ptmxmode mount option to devpts entry in /etc/fstab, or + - using 'chmod 0666 /dev/pts/ptmx' + +4. If multi-instance mode mount is needed for containers, but the system + startup scripts have not yet been updated, container-startup scripts + should bind mount /dev/ptmx to /dev/pts/ptmx to avoid breaking single- + instance mounts. + + Or, in general, container-startup scripts should use: + + mount -t devpts -o newinstance -o ptmxmode=0666 devpts /dev/pts + if [ ! -L /dev/ptmx ]; then + mount -o bind /dev/pts/ptmx /dev/ptmx + fi + + When all devpts mounts are multi-instance, /dev
[Devel] [PATCH 6/9] Define mknod_ptmx()
>From ff0c06fb1878a3c20a6a91d9666d44f56eb94ff7 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 6/9] Define mknod_ptmx() /dev/ptmx is closely tied to the devpts filesystem. An open of /dev/ptmx, allocates the next pty index and the associated device shows up in the devpts fs as /dev/pts/n. Wih multiple instancs of devpts filesystem, during an open of /dev/ptmx we would be unable to determine which instance of the devpts is being accessed. So we move the 'ptmx' node into /dev/pts and use the inode of the 'ptmx' node to identify the superblock and hence the devpts instance. This patch adds ability for the kernel to internally create the [ptmx, c, 5:2] device when mounting devpts filesystem. Since the ptmx node in devpts is new and may surprise some userspace scripts, the default permissions for the new node is . These permissions can be changed either using chmod or by remounting with the new '-o ptmxmode=0666' mount option. Changelog[v5]: - [Serge Hallyn bugfix]: Letting new_inode() assign inode number to ptmx can collide with hand-assigning inode numbers to ptys. So, hand-assign specific inode number to ptmx node also. - [Serge Hallyn]: Maybe safer to grab root dentry mutex while creating ptmx node - [Bugfix with Serge Hallyn] Replace lookup_one_len() in mknod_ptmx() wih d_alloc_name() (lookup during ->get_sb() locks up system). To simplify patchset, fold the ptmx_dentry patch into this. Changelog[v4]: - Change default permissions of pts/ptmx node to . - Move code for ptmxmode under #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES. Changelog[v3]: - Rename ptmx_mode to ptmxmode (for consistency with 'newinstance') Changelog[v2]: - [H. Peter Anvin] Remove mknod() system call support and create the ptmx node internally. Changelog[v1]: - Earlier version of this patch enabled creating /dev/pts/tty as well. As pointed out by Al Viro and H. Peter Anvin, that is not really necessary. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 113 +++-- 1 files changed, 109 insertions(+), 4 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index 7ae60aa..bbdd7df 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -27,6 +27,13 @@ #define DEVPTS_SUPER_MAGIC 0x1cd1 #define DEVPTS_DEFAULT_MODE 0600 +/* + * ptmx is a new node in /dev/pts and will be unused in legacy (single- + * instance) mode. To prevent surprises in user space, set permissions of + * ptmx to 0. Use 'chmod' or remount with '-o ptmxmode' to set meaningful + * permissions. + */ +#define DEVPTS_DEFAULT_PTMX_MODE #define PTMX_MINOR 2 extern int pty_limit; /* Config limit on Unix98 ptys */ @@ -40,10 +47,11 @@ struct pts_mount_opts { uid_t uid; gid_t gid; umode_t mode; + umode_t ptmxmode; }; enum { - Opt_uid, Opt_gid, Opt_mode, + Opt_uid, Opt_gid, Opt_mode, Opt_ptmxmode, Opt_err }; @@ -51,12 +59,16 @@ static match_table_t tokens = { {Opt_uid, "uid=%u"}, {Opt_gid, "gid=%u"}, {Opt_mode, "mode=%o"}, +#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES + {Opt_ptmxmode, "ptmxmode=%o"}, +#endif {Opt_err, NULL} }; struct pts_fs_info { struct ida allocated_ptys; struct pts_mount_opts mount_opts; + struct dentry *ptmx_dentry; }; static inline struct pts_fs_info *DEVPTS_SB(struct super_block *sb) @@ -81,6 +93,7 @@ static int parse_mount_options(char *data, struct pts_mount_opts *opts) opts->uid = 0; opts->gid = 0; opts->mode= DEVPTS_DEFAULT_MODE; + opts->ptmxmode = DEVPTS_DEFAULT_PTMX_MODE; while ((p = strsep(&data, ",")) != NULL) { substring_t args[MAX_OPT_ARGS]; @@ -109,6 +122,13 @@ static int parse_mount_options(char *data, struct pts_mount_opts *opts) return -EINVAL; opts->mode = option & S_IALLUGO; break; +#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES + case Opt_ptmxmode: + if (match_octal(&args[0], &option)) + return -EINVAL; + opts->ptmxmode = option & S_IALLUGO; + break; +#endif default: printk(KERN_ERR "devpts: called with bogus options\n"); return -EINVAL; @@ -118,12 +138,93 @@ static int parse_mount_options(char *data, struct pts_mount_opts *opts) return 0; } +#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES +static int mknod_
[Devel] [PATCH 8/9] Enable multiple instances of devpts
>From 80380a560dfe89dede7df33e9e4360653f9bda14 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 8/9] Enable multiple instances of devpts To support containers, allow multiple instances of devpts filesystem, such that indices of ptys allocated in one instance are independent of ptys allocated in other instances of devpts. But to preserve backward compatibility, enable this support for multiple instances only if: - CONFIG_DEVPTS_MULTIPLE_INSTANCES is set to Y, and - '-o newinstance' mount option is specified while mounting devpts To use multi-instance mount, a container startup script could: $ ns_exec -cm /bin/bash $ umount /dev/pts $ mount -t devpts -o newinstance lxcpts /dev/pts $ mount -o bind /dev/pts/ptmx /dev/ptmx $ /usr/sbin/sshd -p 1234 where 'ns_exec -cm /bin/bash' is calls clone() with CLONE_NEWNS flag and execs /bin/bash in the child process. A pty created by the sshd is not visible in the original mount of /dev/pts. USER-SPACE-IMPACT: - See Documentation/fs/devpts.txt (included in next patch) for user- space impact in multi-instance and mixed-mode operation. TODO: - Update mount(8), pts(4) man pages. Highlight impact of not redirecting /dev/ptmx to /dev/pts/ptmx after a multi-instance mount. Changelog[v6]: - [Dave Hansen] Use new get_init_pts_sb() interface - [Serge Hallyn] Don't bother displaying 'newinstance' in show_options - [Serge Hallyn] Use macros (PARSE_REMOUNT/PARSE_MOUNT) instead of 0/1. - [Serge Hallyn] Check error return from get_sb_single() (now get_init_pts_sb()) - devpts_pty_kill(): don't dput error dentries Changelog[v5]: - Move get_sb_ref() definition to earlier patch - Move usage info to Documentation/filesystems/devpts.txt (next patch) - Make ptmx node even in init_pts_ns, now that default mode is (defined in earlier patch, enabled here). - Cache ptmx dentry and use to update mode during remount (defined in earlier patch, enabled here). - Bugfix: explicitly ignore newinstance on remount (if newinstance was specified on remount of initial mount, it would be ignored but /proc/mounts would imply that the option was set) Changelog[v4]: - Update patch description to address H. Peter Anvin's comments - Consolidate multi-instance mode code under new config token, CONFIG_DEVPTS_MULTIPLE_INSTANCE. - Move usage-details from patch description to Documentation/fs/devpts.txt Changelog[v3]: - Rename new mount option to 'newinstance' - Create ptmx nodes only in 'newinstance' mounts - Bugfix: parse_mount_options() modifies @data but since we need to parse the @data twice (once in devpts_get_sb() and once during do_remount_sb()), parse a local copy of @data in devpts_get_sb(). (restructured code in devpts_get_sb() to fix this) Changelog[v2]: - Support both single-mount and multiple-mount semantics and provide '-onewmnt' option to select the semantics. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 170 ++-- 1 files changed, 163 insertions(+), 7 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index 1dfdbf0..f9a9346 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -48,10 +48,11 @@ struct pts_mount_opts { gid_t gid; umode_t mode; umode_t ptmxmode; + int newinstance; }; enum { - Opt_uid, Opt_gid, Opt_mode, Opt_ptmxmode, + Opt_uid, Opt_gid, Opt_mode, Opt_ptmxmode, Opt_newinstance, Opt_err }; @@ -61,6 +62,7 @@ static match_table_t tokens = { {Opt_mode, "mode=%o"}, #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES {Opt_ptmxmode, "ptmxmode=%o"}, + {Opt_newinstance, "newinstance"}, #endif {Opt_err, NULL} }; @@ -78,13 +80,17 @@ static inline struct pts_fs_info *DEVPTS_SB(struct super_block *sb) static inline struct super_block *pts_sb_from_inode(struct inode *inode) { +#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES if (inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC) return inode->i_sb; - +#endif return devpts_mnt->mnt_sb; } -static int parse_mount_options(char *data, struct pts_mount_opts *opts) +#define PARSE_MOUNT0 +#define PARSE_REMOUNT 1 + +static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts) { char *p; @@ -95,6 +101,10 @@ static int parse_mount_options(char *data, struct pts_mount_opts *opts) opts->mode= DEVPTS_DEFAULT_MODE; opts->ptmxmode = DEVPTS_DEFAULT_PTMX_MODE; + /* newinstance mak
[Devel] [PATCH 7/9] Define get_init_pts_sb()
>From 3a2b7147d5aa345ab96d321ffefd326cbc43e24d Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 7/9] Define get_init_pts_sb() See comments in the function header for details. The new interface will be used in a follow-on patch. Changelog [v2]: [Dave Hansen] Replace get_sb_ref() in fs/super.c with get_init_pts_sb() and make the new interface private to devpts Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 55 - 1 files changed, 54 insertions(+), 1 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index bbdd7df..1dfdbf0 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -305,10 +305,63 @@ fail: return -ENOMEM; } +static int compare_init_pts_sb(struct super_block *s, void *p) +{ + if (devpts_mnt) + return devpts_mnt->mnt_sb == s; + + return 0; +} + +/* + * get_init_pts_sb() + * + * This interface is needed to support multiple namespace semantics in + * devpts while preserving backward compatibility of the current 'single- + * namespace' semantics. i.e all mounts of devpts without the 'newinstance' + * mount option should bind to the initial kernel mount, like + * get_sb_single(). + * + * Mounts with 'newinstance' option create a new private namespace. + * + * But for single-mount semantics, devpts cannot use get_sb_single(), + * because get_sb_single()/sget() find and use the super-block from + * the most recent mount of devpts. But that recent mount may be a + * 'newinstance' mount and get_sb_single() would pick the newinstance + * super-block instead of the initial super-block. + * + * This interface is identical to get_sb_single() except that it + * consistently selects the 'single-namespace' superblock even in the + * presence of the private namespace (i.e 'newinstance') super-blocks. + */ +static int get_init_pts_sb(struct file_system_type *fs_type, int flags, + void *data, struct vfsmount *mnt) +{ +struct super_block *s; +int error; + +s = sget(fs_type, compare_init_pts_sb, set_anon_super, NULL); +if (IS_ERR(s)) +return PTR_ERR(s); + +if (!s->s_root) { +s->s_flags = flags; +error = devpts_fill_super(s, data, flags & MS_SILENT ? 1 : 0); +if (error) { +up_write(&s->s_umount); +deactivate_super(s); +return error; +} +s->s_flags |= MS_ACTIVE; +} +do_remount_sb(s, flags, data, 0); +return simple_set_mnt(mnt, s); +} + static int devpts_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt) { - return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt); + return get_init_pts_sb(fs_type, flags, data, mnt); } static void devpts_kill_sb(struct super_block *sb) -- 1.5.2.5 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/9] Per-mount 'config' object
>From fa07e30bf77063b127129d317e91d6dc454ea739 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 3/9] Per-mount 'config' object With support for multiple mounts of devpts, the 'config' structure really represents per-mount options rather than config parameters. Rename 'config' structure to 'pts_mount_opts' and store it in the super-block. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Acked-by: Serge Hallyn <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 49 + 1 files changed, 29 insertions(+), 20 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index 6e63db7..e91c15c 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -34,13 +34,13 @@ static DEFINE_MUTEX(allocated_ptys_lock); static struct vfsmount *devpts_mnt; -static struct { +struct pts_mount_opts { int setuid; int setgid; uid_t uid; gid_t gid; umode_t mode; -} config = {.mode = DEVPTS_DEFAULT_MODE}; +}; enum { Opt_uid, Opt_gid, Opt_mode, @@ -56,6 +56,7 @@ static match_table_t tokens = { struct pts_fs_info { struct ida allocated_ptys; + struct pts_mount_opts mount_opts; }; static inline struct pts_fs_info *DEVPTS_SB(struct super_block *sb) @@ -74,12 +75,14 @@ static inline struct super_block *pts_sb_from_inode(struct inode *inode) static int devpts_remount(struct super_block *sb, int *flags, char *data) { char *p; + struct pts_fs_info *fsi = DEVPTS_SB(sb); + struct pts_mount_opts *opts = &fsi->mount_opts; - config.setuid = 0; - config.setgid = 0; - config.uid = 0; - config.gid = 0; - config.mode= DEVPTS_DEFAULT_MODE; + opts->setuid = 0; + opts->setgid = 0; + opts->uid = 0; + opts->gid = 0; + opts->mode= DEVPTS_DEFAULT_MODE; while ((p = strsep(&data, ",")) != NULL) { substring_t args[MAX_OPT_ARGS]; @@ -94,19 +97,19 @@ static int devpts_remount(struct super_block *sb, int *flags, char *data) case Opt_uid: if (match_int(&args[0], &option)) return -EINVAL; - config.uid = option; - config.setuid = 1; + opts->uid = option; + opts->setuid = 1; break; case Opt_gid: if (match_int(&args[0], &option)) return -EINVAL; - config.gid = option; - config.setgid = 1; + opts->gid = option; + opts->setgid = 1; break; case Opt_mode: if (match_octal(&args[0], &option)) return -EINVAL; - config.mode = option & S_IALLUGO; + opts->mode = option & S_IALLUGO; break; default: printk(KERN_ERR "devpts: called with bogus options\n"); @@ -119,11 +122,14 @@ static int devpts_remount(struct super_block *sb, int *flags, char *data) static int devpts_show_options(struct seq_file *seq, struct vfsmount *vfs) { - if (config.setuid) - seq_printf(seq, ",uid=%u", config.uid); - if (config.setgid) - seq_printf(seq, ",gid=%u", config.gid); - seq_printf(seq, ",mode=%03o", config.mode); + struct pts_fs_info *fsi = DEVPTS_SB(vfs->mnt_sb); + struct pts_mount_opts *opts = &fsi->mount_opts; + + if (opts->setuid) + seq_printf(seq, ",uid=%u", opts->uid); + if (opts->setgid) + seq_printf(seq, ",gid=%u", opts->gid); + seq_printf(seq, ",mode=%03o", opts->mode); return 0; } @@ -143,6 +149,7 @@ static void *new_pts_fs_info(void) return NULL; ida_init(&fsi->allocated_ptys); + fsi->mount_opts.mode = DEVPTS_DEFAULT_MODE; return fsi; } @@ -262,6 +269,8 @@ int devpts_pty_new(struct inode *ptmx_inode, struct tty_struct *tty) struct super_block *sb = pts_sb_from_inode(ptmx_inode); struct inode *inode = new_inode(sb); struct dentry *root = sb->s_root; + struct pts_fs_info *fsi = DEVPTS_SB(sb); + struct pts_mount_opts *opts = &fsi->mount_opts; char s[12]; /* We're supposed to be given the slave end of a pty */ @@ -272,10 +281,10 @@ int devpts_pty_new(struct inode *ptmx_inode, struct tty_struct *tty) return -ENOMEM; inode->i_ino = number+2;
[Devel] [PATCH 4/9] Extract option parsing to new function
>From c4e1a348c2424ce503c24c8a56fa91015d9ee194 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 4/9] Extract option parsing to new function Move code to parse mount options into a separate function so it can (later) be shared between mount and remount operations. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Acked-by: Serge Hallyn <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 12 +--- 1 files changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index e91c15c..7ae60aa 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -72,11 +72,9 @@ static inline struct super_block *pts_sb_from_inode(struct inode *inode) return devpts_mnt->mnt_sb; } -static int devpts_remount(struct super_block *sb, int *flags, char *data) +static int parse_mount_options(char *data, struct pts_mount_opts *opts) { char *p; - struct pts_fs_info *fsi = DEVPTS_SB(sb); - struct pts_mount_opts *opts = &fsi->mount_opts; opts->setuid = 0; opts->setgid = 0; @@ -120,6 +118,14 @@ static int devpts_remount(struct super_block *sb, int *flags, char *data) return 0; } +static int devpts_remount(struct super_block *sb, int *flags, char *data) +{ + struct pts_fs_info *fsi = DEVPTS_SB(sb); + struct pts_mount_opts *opts = &fsi->mount_opts; + + return parse_mount_options(data, opts); +} + static int devpts_show_options(struct seq_file *seq, struct vfsmount *vfs) { struct pts_fs_info *fsi = DEVPTS_SB(vfs->mnt_sb); -- 1.5.2.5 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/9] Remove devpts_root global
>From 1479f9e238d607403abf5af4296bd5c84a1dc18f Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 1/9] Remove devpts_root global Remove the 'devpts_root' global variable and find the root dentry using the super_block. The super-block can be found from the device inode, using the new wrapper, pts_sb_from_inode(). Changelog: This patch is based on an earlier patchset from Serge Hallyn and Matt Helsley. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Acked-by: Serge Hallyn <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 29 - 1 files changed, 20 insertions(+), 9 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index a70d5d0..ec33833 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -34,7 +34,6 @@ static DEFINE_IDA(allocated_ptys); static DEFINE_MUTEX(allocated_ptys_lock); static struct vfsmount *devpts_mnt; -static struct dentry *devpts_root; static struct { int setuid; @@ -56,6 +55,14 @@ static match_table_t tokens = { {Opt_err, NULL} }; +static inline struct super_block *pts_sb_from_inode(struct inode *inode) +{ + if (inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC) + return inode->i_sb; + + return devpts_mnt->mnt_sb; +} + static int devpts_remount(struct super_block *sb, int *flags, char *data) { char *p; @@ -142,7 +149,7 @@ devpts_fill_super(struct super_block *s, void *data, int silent) inode->i_fop = &simple_dir_operations; inode->i_nlink = 2; - devpts_root = s->s_root = d_alloc_root(inode); + s->s_root = d_alloc_root(inode); if (s->s_root) return 0; @@ -211,7 +218,9 @@ int devpts_pty_new(struct inode *ptmx_inode, struct tty_struct *tty) struct tty_driver *driver = tty->driver; dev_t device = MKDEV(driver->major, driver->minor_start+number); struct dentry *dentry; - struct inode *inode = new_inode(devpts_mnt->mnt_sb); + struct super_block *sb = pts_sb_from_inode(ptmx_inode); + struct inode *inode = new_inode(sb); + struct dentry *root = sb->s_root; char s[12]; /* We're supposed to be given the slave end of a pty */ @@ -231,15 +240,15 @@ int devpts_pty_new(struct inode *ptmx_inode, struct tty_struct *tty) sprintf(s, "%d", number); - mutex_lock(&devpts_root->d_inode->i_mutex); + mutex_lock(&root->d_inode->i_mutex); - dentry = d_alloc_name(devpts_root, s); + dentry = d_alloc_name(root, s); if (!IS_ERR(dentry)) { d_add(dentry, inode); - fsnotify_create(devpts_root->d_inode, dentry); + fsnotify_create(root->d_inode, dentry); } - mutex_unlock(&devpts_root->d_inode->i_mutex); + mutex_unlock(&root->d_inode->i_mutex); return 0; } @@ -256,11 +265,13 @@ struct tty_struct *devpts_get_tty(struct inode *pts_inode, int number) void devpts_pty_kill(struct tty_struct *tty) { struct inode *inode = tty->driver_data; + struct super_block *sb = pts_sb_from_inode(inode); + struct dentry *root = sb->s_root; struct dentry *dentry; BUG_ON(inode->i_rdev == MKDEV(TTYAUX_MAJOR, PTMX_MINOR)); - mutex_lock(&devpts_root->d_inode->i_mutex); + mutex_lock(&root->d_inode->i_mutex); dentry = d_find_alias(inode); if (dentry && !IS_ERR(dentry)) { @@ -269,7 +280,7 @@ void devpts_pty_kill(struct tty_struct *tty) dput(dentry); } - mutex_unlock(&devpts_root->d_inode->i_mutex); + mutex_unlock(&root->d_inode->i_mutex); } static int __init init_devpts_fs(void) -- 1.5.2.5 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/9] Per-mount allocated_ptys
>From d12a714cbd541b808a80f9f556fda4a1f3bf4198 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 2/9] Per-mount allocated_ptys To enable multiple mounts of devpts, 'allocated_ptys' must be a per-mount variable rather than a global variable. Move 'allocated_ptys' into the super_block's s_fs_info. Changelog[v2]: Define and use DEVPTS_SB() wrapper. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 55 ++-- 1 files changed, 48 insertions(+), 7 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index ec33833..6e63db7 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -30,7 +30,6 @@ #define PTMX_MINOR 2 extern int pty_limit; /* Config limit on Unix98 ptys */ -static DEFINE_IDA(allocated_ptys); static DEFINE_MUTEX(allocated_ptys_lock); static struct vfsmount *devpts_mnt; @@ -55,6 +54,15 @@ static match_table_t tokens = { {Opt_err, NULL} }; +struct pts_fs_info { + struct ida allocated_ptys; +}; + +static inline struct pts_fs_info *DEVPTS_SB(struct super_block *sb) +{ + return sb->s_fs_info; +} + static inline struct super_block *pts_sb_from_inode(struct inode *inode) { if (inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC) @@ -126,6 +134,19 @@ static const struct super_operations devpts_sops = { .show_options = devpts_show_options, }; +static void *new_pts_fs_info(void) +{ + struct pts_fs_info *fsi; + + fsi = kzalloc(sizeof(struct pts_fs_info), GFP_KERNEL); + if (!fsi) + return NULL; + + ida_init(&fsi->allocated_ptys); + + return fsi; +} + static int devpts_fill_super(struct super_block *s, void *data, int silent) { @@ -137,9 +158,13 @@ devpts_fill_super(struct super_block *s, void *data, int silent) s->s_op = &devpts_sops; s->s_time_gran = 1; + s->s_fs_info = new_pts_fs_info(); + if (!s->s_fs_info) + goto fail; + inode = new_inode(s); if (!inode) - goto fail; + goto free_fsi; inode->i_ino = 1; inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME; inode->i_blocks = 0; @@ -155,6 +180,9 @@ devpts_fill_super(struct super_block *s, void *data, int silent) printk("devpts: get root dentry failed\n"); iput(inode); + +free_fsi: + kfree(s->s_fs_info); fail: return -ENOMEM; } @@ -165,11 +193,19 @@ static int devpts_get_sb(struct file_system_type *fs_type, return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt); } +static void devpts_kill_sb(struct super_block *sb) +{ + struct pts_fs_info *fsi = DEVPTS_SB(sb); + + kfree(fsi); + kill_anon_super(sb); +} + static struct file_system_type devpts_fs_type = { .owner = THIS_MODULE, .name = "devpts", .get_sb = devpts_get_sb, - .kill_sb= kill_anon_super, + .kill_sb= devpts_kill_sb, }; /* @@ -179,16 +215,18 @@ static struct file_system_type devpts_fs_type = { int devpts_new_index(struct inode *ptmx_inode) { + struct super_block *sb = pts_sb_from_inode(ptmx_inode); + struct pts_fs_info *fsi = DEVPTS_SB(sb); int index; int ida_ret; retry: - if (!ida_pre_get(&allocated_ptys, GFP_KERNEL)) { + if (!ida_pre_get(&fsi->allocated_ptys, GFP_KERNEL)) { return -ENOMEM; } mutex_lock(&allocated_ptys_lock); - ida_ret = ida_get_new(&allocated_ptys, &index); + ida_ret = ida_get_new(&fsi->allocated_ptys, &index); if (ida_ret < 0) { mutex_unlock(&allocated_ptys_lock); if (ida_ret == -EAGAIN) @@ -197,7 +235,7 @@ retry: } if (index >= pty_limit) { - ida_remove(&allocated_ptys, index); + ida_remove(&fsi->allocated_ptys, index); mutex_unlock(&allocated_ptys_lock); return -EIO; } @@ -207,8 +245,11 @@ retry: void devpts_kill_index(struct inode *ptmx_inode, int idx) { + struct super_block *sb = pts_sb_from_inode(ptmx_inode); + struct pts_fs_info *fsi = DEVPTS_SB(sb); + mutex_lock(&allocated_ptys_lock); - ida_remove(&allocated_ptys, idx); + ida_remove(&fsi->allocated_ptys, idx); mutex_unlock(&allocated_ptys_lock); } -- 1.5.2.5 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 0/9] Multiple devpts instances
Enable multiple instances of devpts filesystem so each container can allocate ptys independently. User interface: Since supporting multiple mounts of devpts can break user-space, this feature is enabled only if: - CONFIG_DEVPTS_MULTIPLE_INSTANCES=y (new config token), and - new mount option, -o newinstance is specified If CONFIG_DEVPTS_MULTIPLE_INSTANCES=n, there should be no change in behavior. See [PATCH 9/9] - Documentation/filesystems/devpts.txt for detailed usage/compatibility information. [PATCH 1/9] Remove devpts_root global [PATCH 2/9] Per-mount allocated_ptys [PATCH 3/9] Per-mount 'config' object [PATCH 4/9] Extract option parsing to new function [PATCH 5/9] Add DEVPTS_MULTIPLE_INSTANCES config token [PATCH 6/9] Define mknod_ptmx() [PATCH 7/9] Define get_init_pts_sb() [PATCH 8/9] Enable multiple instances of devpts [PATCH 9/9] Document usage of multiple-instances of devpts Implementation notes: 1. To enable multiple mounts of /dev/pts, (most) devpts interfaces need to know which instance of devpts is being accessed. This patchset uses the 'struct inode' or 'struct tty_struct' of the device being accessed to identify the appropriate devpts instance. Hence the need for the /dev/pts/ptmx bind-mount. 2. Mount options must be parsed twice during mount (once to determine the mode of mount (single/multi-instance) and once to actually save the options. There does not seem to be an easy way to parse once and reuse (See 'safe_process_mount_opts()' in [PATCH 9/10]) Changelog [v5]: - Merge all option-parsing (previously in patches 6,7) into patch 6. - Bugfixes (see changelog in patches 6 and 8) - Replace get_sb_ref() with devpts-specific get_init_pts_sb() (patch 7) - Minor update to devpts.txt documentation (patch 9) Changelog [v4]: - Port to 2008-09-04 ttydev tree (and drop patches that were merged in) - Add DEVPTS_MULTIPLE_INSTANCES config token (patch 5) and move new behavior under #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES. - Create ptmx node even in initial mount - Change default permissions of pts/ptmx node to (patch 6) - Cache ptmx dentry and use to update permissions of ptmx node on remount so legacy mode can use the node with a simpler change to /etc/fstab (patch 7) - Move get_sb_ref() helper code to separate patch (patch 8) - Add Documentation/filesystems/devpts.txt and describe usage info in that file. Changelog [v3]: - Port to 2008-08-28 ttydev tree - Rename new mount options to 'ptmxmode' and 'newinstance'. - [Alan Cox] Use tty driver data to identify devpts (this is used to cleanup get_node() in devpts_pty_kill()). - [H. Peter Anvin] get_node() cleanup in devpts (which was enabled by the inode/tty parameter to devpts interfaces) - Bugfix in multi-mount mode (see Patch 11/11). - Executed pty tests in LTP (in both single-instance and multi-instance mode) - Should be bisect-safe :-) Changelog [v2]: - New mount option '-o newmnt' added (patch 8/8) - Support both single-mount and multi-mount semantics (patch 8/8) - Automatically create ptmx node when devpts is mounted (patch 7/8) - Extract option parsing code to new function (patch 6/8) - Make 'config' params per-mount variables (patch 5/8) - Slightly re-ordered existing patches in set (patches 1/8, 2/8) TODO: - Do we need a '-o ptmxuid' and '-o ptmxgid' options as well ? - Update mount(8) man page - (Sometime in future) Remove even initial kernel mount of devpts - Any other good test suites to test this (besides LTP, sshd). ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 5/9] Add DEVPTS_MULTIPLE_INSTANCES config token
>From 18a468a2f2db8f055bf62882d44d40764e924f3b Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 5/9] Add DEVPTS_MULTIPLE_INSTANCES config token Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Acked-by: Serge Hallyn <[EMAIL PROTECTED]> --- drivers/char/Kconfig | 11 +++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 700ff96..0d3ea89 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -443,6 +443,17 @@ config UNIX98_PTYS All modern Linux systems use the Unix98 ptys. Say Y unless you're on an embedded system and want to conserve memory. +config DEVPTS_MULTIPLE_INSTANCES + bool "Support multiple instances of devpts" + depends on UNIX98_PTYS + default n + ---help--- + Enable support for multiple instances of devpts filesystem. + If you want to have isolated PTY namespaces (eg: in containers), + say Y here. Otherwise, say N. If enabled, each mount of devpts + filesystem with the '-o newinstance' option will create an + independent PTY namespace. + config LEGACY_PTYS bool "Legacy (BSD) PTY support" default y -- 1.5.2.5 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH] 'kill sig -1' must only apply to callers namespace
>From d92b4befe07c6a1e852e4462126a5443342448cd Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Date: Tue, 21 Oct 2008 18:00:01 -0700 Subject: [PATCH] kill sig -1 must only apply to callers namespace Currently "kill -1" kills processes in all namespaces and breaks the isolation of namespaces. Earlier attempt to fix this is discussed at: http://lkml.org/lkml/2008/7/23/148 but nothing seems to have happened since then. This patch uses the simple fix suggested by Oleg Nesterov. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- kernel/signal.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index 105217d..4530fc6 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1144,7 +1144,8 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid) struct task_struct * p; for_each_process(p) { - if (p->pid > 1 && !same_thread_group(p, current)) { + if (task_pid_vnr(p) > 1 && + !same_thread_group(p, current)) { int err = group_send_sig_info(sig, info, p); ++count; if (err != -EPERM) -- 1.5.2.5 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH] 'kill sig -1' must only apply to caller's namespace
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH] 'kill sig -1' must only apply to caller's namespace Currently "kill -1" kills processes in all namespaces and breaks the isolation of namespaces. Earlier attempt to fix this was discussed at: http://lkml.org/lkml/2008/7/23/148 As suggested by Oleg Nesterov in that thread, use "task_pid_vnr() > 1" check since task_pid_vnr() returns 0 if process is outside the caller's namespace. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Acked-by: Eric W. Biederman <[EMAIL PROTECTED]> Tested-by: Daniel Hokka Zakrisson <[EMAIL PROTECTED]> --- kernel/signal.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index 105217d..4530fc6 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1144,7 +1144,8 @@ static int kill_something_info(int sig, struct siginfo *info, pid_t pid) struct task_struct * p; for_each_process(p) { - if (p->pid > 1 && !same_thread_group(p, current)) { + if (task_pid_vnr(p) > 1 && + !same_thread_group(p, current)) { int err = group_send_sig_info(sig, info, p); ++count; if (err != -EPERM) -- 1.5.2.5 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Signals to cinit
Signals to container-init Its been almost a year since we tried to address the signals to container-init issue. We have two patchsets both of which address the main issues and should work for current 'sysV init' (since init explicitly ignores signals it does not handle). But both patchsets fail in some corner cases and so the approaches have stalled. Can we choose one of these approaches and clearly define limitations (with maybe noisy warnings for known corner-cases) ? That way, cinits have an option of working-around in user-space till a better solution is in place. Or should we explore other, potentially more expensive/intrusive solutions ? I have below a brief summary of the two patchsets we tried before and a very high-level suggestion for an more expensive/intrusive fix. Eric's patchset: https://lists.linux-foundation.org/pipermail/containers/2007-December/009152.html - Defines a semantic that drops signals to cinit if handler for signal is SIG_DFL. - SIG_DFL signal is ignored even when blocked. - Fails with blocked signal if handler is SIG_DFL when blocked. cinit: signal set to SIG_DFL cinit: block signal ancestor sends a fatal signal to cinit signal is ignored (since handler is SIG_DFL) - cinit uses sigtimedwait() for a fatal signal set to SIG_DFL. The patchset ignores the signal (due to the SIG_DFL). - /sbin/init blocks SIGCHLD, execs a new program which then installs a handler for SIGCHLD. But since SIGCHLD == SIG_DFL just after exec(), the SIGCHLD could be missed. Oleg's patchset: Originally posted here: http://marc.info/?l=linux-kernel&m=118753610515859 An updated patch is included in this mail: https://lists.linux-foundation.org/pipermail/containers/2007-December/009308.html - Fails with blocked signals cinit: block fatal signal descendant posts signal to cinit signal is queued since it was blocked cinit sets handler for signal to SIG_DFL cinit unblocks signal and terminates even though its from descendant. - Fails with ptraced cinit ? - Drops a signal in sigaction() or get_signal_to_deliver() after we enqueuing it ("started processing it") (side-note: But isn't there a precedent for it anyway ? get_signal_to_deliver() currently does ignore signals it does not want. sigaction() drops signals that were set to SIG_IGN) - To quote Eric, "does not start with a 'solid' definition and can become unmaintainable", but I am not too clear on this comment. Track ancestor-signals separately: (not implemented) Add a second 'sigset_t' to sigpending: struct sigpending { struct list_head list; sigset_t signal; sigset_t ancestor_signal; }; 'ancestor_signal' is only used for container-inits. Global inits have no ancestors so it is not affected. If a cinit receives a signal from a descendant, signal gets added to 'sigpending->signal' set (as is done today). If cinit receives a signal from an ancestor, signal is added to 'ancestor_signal' set. When delivering the signal, (maybe in get_signal_to_deliver(), the cinit can determine if the signal was from ancestor/descendant and act accordingly. If the same signal is received from both ancestor and descendant it would be set in both sets and we make a policy maybe that ancestor has priority (i.e signal from descendant is ignored/dropped) Other observations: - when queuing the signal, we use the same 'sigpending->list' regardless of the sender's namespace so the order of processing of signals is unchanged. - when dequeuing a signal, dequeue from both sets - when checking for a pending signal (eg: sigkill_pending()), check the OR of both sets This would be intrusive since we need to replace reads/writes of 'current->pending.signal' and 'current->signal.shared_pending' with wrappers. It maybe a bit more expensive runtime and adds a new 'sigset_t' to task_struct/signal_struct. I can send a small prototype if this makes sense at all. Other approaches to try ? Thanks, Sukadev ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: ptem01 LTP failure in ttydev-0909
Sorry, this was buried in my inbox... Subrata Modak [EMAIL PROTECTED] wrote: | Hi Sukadev, | | On Thu, 2008-09-18 at 21:14 -0700, [EMAIL PROTECTED] wrote: | > Alan Cox [EMAIL PROTECTED] wrote: | > | > The test changes the window size using the slave-fd and expects that | > | > it won't affect the window-size on master-fd. With this change, we | > | > return the slave's window size and test fails. | > | | > | I've no idea why anyone would have thought the existing behaviour was | > | correct. The pty/tty pair code tries to share the size and other | > | information at all times and the old test was I think verifying a bug | > | existed. | > | | > | Unless anyone can cite anything to show otherwise anyway ? | > | > Subrata | > | > We are referring to the last window size check in test2() of | > testcases/kernel/pty/ptem01.c. This check will cause the test | > to fail when some of the planned ttydev changes are merged. | > | > Would you happen to know if the check is really required or if | > it should be dropped ? | | I would want the test to remain there, but introduce some checkings | before running the test. As test2() is valid under present | circumstances, we should retain it as people will keep using LTP on | lower kernels. Just to be clear, the entire test2() is not broken. Only the last part (see patch below) Other parts of test2() should be fine even with new changes. | | Having said that, i would like to come with a solution where test2() of | testcases/kernel/pty/ptem01.c is not run after the planned ttydev | changes are merged. Something compile/run time checking to either not to | build that part of code and run it. Can we do something like that by | checking some glibc/kernel exported definitions ? Other than the kernel version when the changes are merged, I am not sure there is a way. Besides, it is not clear which assertion that part of test2() is testing and if it is even needed for older kernels. Here is the part of test2() I am referring to: --- testcases/kernel/pty/ptem01.c | 12 1 file changed, 12 deletions(-) Index: ltp-full-20071031/testcases/kernel/pty/ptem01.c === --- ltp-full-20071031.orig/testcases/kernel/pty/ptem01.c2008-11-01 13:30:42.977954127 -0700 +++ ltp-full-20071031/testcases/kernel/pty/ptem01.c 2008-11-01 13:31:41.439427078 -0700 @@ -238,18 +238,6 @@ test2(void) tst_exit(); } - if (ioctl(masterfd, TIOCGWINSZ, &wsz) != 0) { - tst_resm(TFAIL,"TIOCGWINSZ"); - tst_exit(); - } - - if (wsz.ws_row == wsz2.ws_row || wsz.ws_col == wsz2.ws_col || - wsz.ws_xpixel == wsz2.ws_xpixel || - wsz.ws_ypixel == wsz2.ws_ypixel) { - tst_resm(TFAIL, "unexpected window size returned"); - tst_exit(); - } - if (close(slavefd) != 0) { tst_resm(TBROK,"close"); tst_exit(); ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Signals to cinit
Oleg Nesterov [EMAIL PROTECTED] wrote: | (lkml cced because containers list's archive is not useable) Hmm. what do you mean by not usable ? I see your email here: https://lists.linux-foundation.org/pipermail/containers/2008-November/014152.html | | On 11/10, Oleg Nesterov wrote: | > | > On 11/01, [EMAIL PROTECTED] wrote: | > > | > > Other approaches to try ? | > | > I think we should try to do something simple, even if not perfect. Because | > most users do not care about this problem since they do not use containers | > at all. It would be very sad to add intrusive changes to the code. | > | > I think we should fix another problem first. send_signal()->copy_siginfo() | > path must be changed anyway, when the signal comes from the parent ns we | > report the "wrong" si_code/si_pid, yes? So, somehow send_signal() must | > have "bool from_parent_ns" (or whatever) annyway. | > | > Now, let's forget forget for a moment that send_signal()->__sigqueue_alloc() | > can fail. | > | > I think we should encode this "from_parent_ns" into "struct siginfo". I do | > not think it is good idea to extend this structure, I think we can introduce | > SI_FROM_PARENT_NS or we perhaps can use "SI_FROMUSER(info) && info->si_pid == 0". Yes, afaics, we just need to pass one extra bit of information per signal (whether or not sender is in ancestor-ns) from sender to receiver. | > Or something. yes, sys_rt_sigqueueinfo() is problematic... Yes, if user-space sets si_pid to 0. Can we change sys_rt_sigqueueinfo() to: if (!info->si_pid) info->si_pid = getpid(); or would that change semantics adversely ? How about putting this under CONFIG_PID_NS or your CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;) | > | > Now, copy_process(CLONE_NEWPID) sets child->signal |= SIGNAL_UNKILLABLE, this | > protects cinit from unwanted signals. Then we change get_signal_to_deliver() | > | > - if (unlikely(signal->flags & SIGNAL_UNKILLABLE) && | > + if (unlikely(signal->flags & SIGNAL_UNKILLABLE) && !siginfo_from_parent_ns(info) | > | > and now we can kill cinit from parent ns. This needs more checks if we want | > to stop/strace it, but perhaps this is enough for the start. Note that we | > do not need to change complete_signal(), at least for now, the code under | > "if (sig_fatal(p, sig)" is just optimization. | > | > | > So, afaics, the only real problem is how we can handle the case when | > __sigqueue_alloc() fails. I think for the start we can just return | > -ENOMEM in this case (when from_parent_ns == T). Then we can improve | > this behaviour. We can change complete_signal() to ensure that the | > fatal signal from the upper ns always kills cinit, and in this case | > we ignore the the failed __sigqueue_alloc(). This way at least SIGKILL | > always works. | > | > Yes, this is not perfect, and it is very possible I missed something | > else. But simple. I agree | | But how can send_signal() know that the signal comes from the upper ns? | This is not trivial, we can't blindly use current to check. The signal | can be sent from irq/workqueue/etc. You mean the in_interrupt() check we had in earlier patchset would not be enough ? | | Perhaps we can start with something like the patch below. Not that I like | it very much though. We should really place this code under | CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;) CONFIG_PID_NS ? ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Signals to cinit
Oleg Nesterov [EMAIL PROTECTED] wrote: | (lkml cced because containers list's archive is not useable) | | On 11/10, Oleg Nesterov wrote: | > | > On 11/01, [EMAIL PROTECTED] wrote: | > > | > > Other approaches to try ? | > | > I think we should try to do something simple, even if not perfect. Because | > most users do not care about this problem since they do not use containers | > at all. It would be very sad to add intrusive changes to the code. | > | > I think we should fix another problem first. send_signal()->copy_siginfo() | > path must be changed anyway, when the signal comes from the parent ns we | > report the "wrong" si_code/si_pid, yes? So, somehow send_signal() must | > have "bool from_parent_ns" (or whatever) annyway. Yes, this was in both the patchsets we reviewed last year :-) I can send this fix out independently. | > | > Now, let's forget forget for a moment that send_signal()->__sigqueue_alloc() | > can fail. | > | > I think we should encode this "from_parent_ns" into "struct siginfo". I do | > not think it is good idea to extend this structure, I think we can introduce | > SI_FROM_PARENT_NS or we perhaps can use "SI_FROMUSER(info) && info->si_pid == 0". | > Or something. yes, sys_rt_sigqueueinfo() is problematic... Also, what happens if a fatal signal is first received from a descendant and while that is still pending, the same signal is received from ancestor ns ? Won't the second one be ignored by legacy_queue() for the non-rt case ? Of course, this is a new scenario, specific to containers, and we may be able to define the policy without changing semantics. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 1/7] Factor out code to allocate pidmap page
From: Sukadev Bhattiprolu Signed-off-by: Sukadev Bhattiprolu --- kernel/pid.c | 43 --- 1 files changed, 28 insertions(+), 15 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index b2e5f78..c0aaebe 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -122,6 +122,31 @@ static void free_pidmap(struct upid *upid) atomic_inc(&map->nr_free); } +static int alloc_pidmap_page(struct pidmap *map) +{ + void *page; + + if (likely(map->page)) + return 0; + + page = kzalloc(PAGE_SIZE, GFP_KERNEL); + + /* +* Free the page if someone raced with us installing it: +*/ + spin_lock_irq(&pidmap_lock); + if (map->page) + kfree(page); + else + map->page = page; + spin_unlock_irq(&pidmap_lock); + + if (unlikely(!map->page)) + return -1; + + return 0; +} + static int alloc_pidmap(struct pid_namespace *pid_ns) { int i, offset, max_scan, pid, last = pid_ns->last_pid; @@ -134,21 +159,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; for (i = 0; i <= max_scan; ++i) { - if (unlikely(!map->page)) { - void *page = kzalloc(PAGE_SIZE, GFP_KERNEL); - /* -* Free the page if someone raced with us -* installing it: -*/ - spin_lock_irq(&pidmap_lock); - if (map->page) - kfree(page); - else - map->page = page; - spin_unlock_irq(&pidmap_lock); - if (unlikely(!map->page)) - break; - } + if (alloc_pidmap_page(map)) + break; + if (likely(atomic_read(&map->nr_free))) { do { if (!test_and_set_bit(offset, map->page)) { -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 5/7] Add target_pids parameter to copy_process()
From: Sukadev Bhattiprolu The new parameter will be used in a follow-on patch when clone_with_pids() is implemented. Signed-off-by: Sukadev Bhattiprolu --- kernel/fork.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index d2d69d3..373411e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -949,12 +949,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, unsigned long stack_size, int __user *child_tidptr, struct pid *pid, + pid_t *target_pids, int trace) { int retval; struct task_struct *p; int cgroup_callbacks_done = 0; - pid_t *target_pids = NULL; if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1327,7 +1327,7 @@ struct task_struct * __cpuinit fork_idle(int cpu) struct pt_regs regs; task = copy_process(CLONE_VM, 0, idle_regs(®s), 0, NULL, - &init_struct_pid, 0); + &init_struct_pid, NULL, 0); if (!IS_ERR(task)) init_idle(task, cpu); @@ -1350,6 +1350,7 @@ long do_fork(unsigned long clone_flags, struct task_struct *p; int trace = 0; long nr; + pid_t *target_pids = NULL; /* * Do some preliminary argument and permissions checking before we @@ -1390,7 +1391,7 @@ long do_fork(unsigned long clone_flags, trace = tracehook_prepare_clone(clone_flags); p = copy_process(clone_flags, stack_start, regs, stack_size, -child_tidptr, NULL, trace); +child_tidptr, NULL, target_pids, trace); /* * Do this prior waking up the new thread - the thread pointer * might get invalid after that point, if the thread exits quickly. -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 3/7] Add target_pid parameter to alloc_pidmap()
From: Sukadev Bhattiprolu Signed-off-by: Sukadev Bhattiprolu --- kernel/pid.c | 28 ++-- 1 files changed, 26 insertions(+), 2 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index fd72ad9..93406c6 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -147,12 +147,36 @@ static int alloc_pidmap_page(struct pidmap *map) return 0; } -static int alloc_pidmap(struct pid_namespace *pid_ns) +static int set_pidmap(struct pid_namespace *pid_ns, int pid) +{ + int offset; + struct pidmap *map; + + if (pid >= pid_max) + return -EINVAL; + + offset = pid & BITS_PER_PAGE_MASK; + map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; + + if (alloc_pidmap_page(map)) + return -ENOMEM; + + if (test_and_set_bit(offset, map->page)) + return -EBUSY; + + atomic_dec(&map->nr_free); + return pid; +} + +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid) { int i, offset, max_scan, pid, last = pid_ns->last_pid; struct pidmap *map; int rc = -EAGAIN; + if (target_pid) + return set_pidmap(pid_ns, target_pid); + pid = last + 1; if (pid >= pid_max) pid = RESERVED_PIDS; @@ -269,7 +293,7 @@ struct pid *alloc_pid(struct pid_namespace *ns) tmp = ns; for (i = ns->level; i >= 0; i--) { - nr = alloc_pidmap(tmp); + nr = alloc_pidmap(tmp, 0); if (nr < 0) goto out_free; -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 4/7] Add target_pids parameter to alloc_pid()
From: Sukadev Bhattiprolu This parameter is currently NULL, but will be used in a follow-on patch. Signed-off-by: Sukadev Bhattiprolu --- include/linux/pid.h |2 +- kernel/fork.c |3 ++- kernel/pid.c|9 +++-- 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index 49f1c2f..914185d 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr); extern struct pid *find_ge_pid(int nr, struct pid_namespace *); int next_pidmap(struct pid_namespace *pid_ns, int last); -extern struct pid *alloc_pid(struct pid_namespace *ns); +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids); extern void free_pid(struct pid *pid); /* diff --git a/kernel/fork.c b/kernel/fork.c index f8411a8..d2d69d3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -954,6 +954,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, int retval; struct task_struct *p; int cgroup_callbacks_done = 0; + pid_t *target_pids = NULL; if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1119,7 +1120,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != &init_struct_pid) { - pid = alloc_pid(p->nsproxy->pid_ns); + pid = alloc_pid(p->nsproxy->pid_ns, target_pids); if (IS_ERR(pid)) { retval = PTR_ERR(pid); goto bad_fork_cleanup_io; diff --git a/kernel/pid.c b/kernel/pid.c index 93406c6..4b2373a 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -279,13 +279,14 @@ void free_pid(struct pid *pid) call_rcu(&pid->rcu, delayed_put_pid); } -struct pid *alloc_pid(struct pid_namespace *ns) +struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids) { struct pid *pid; enum pid_type type; int i, nr; struct pid_namespace *tmp; struct upid *upid; + int tpid; pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); if (!pid) @@ -293,7 +294,11 @@ struct pid *alloc_pid(struct pid_namespace *ns) tmp = ns; for (i = ns->level; i >= 0; i--) { - nr = alloc_pidmap(tmp, 0); + tpid = 0; + if (target_pids) + tpid = target_pids[i]; + + nr = alloc_pidmap(tmp, tpid); if (nr < 0) goto out_free; -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 6/7] Define do_fork_with_pids()
From: Sukadev Bhattiprolu do_fork_with_pids() is same as do_fork(), except that it takes an additional, target_pids, parameter. This parameter, currently unused, specifies the target_pids of the process in each of its pid namespaces. Signed-off-by: Sukadev Bhattiprolu --- include/linux/sched.h |1 + kernel/fork.c | 17 ++--- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index b4c38bc..2173df1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1995,6 +1995,7 @@ extern int disallow_signal(int); extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, pid_t *target_pids); struct task_struct *fork_idle(int); extern void set_task_comm(struct task_struct *tsk, char *from); diff --git a/kernel/fork.c b/kernel/fork.c index 373411e..912d008 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1340,17 +1340,17 @@ struct task_struct * __cpuinit fork_idle(int cpu) * It copies the process, and if successful kick-starts * it and waits for it to finish using the VM if required. */ -long do_fork(unsigned long clone_flags, +long do_fork_with_pids(unsigned long clone_flags, unsigned long stack_start, struct pt_regs *regs, unsigned long stack_size, int __user *parent_tidptr, - int __user *child_tidptr) + int __user *child_tidptr, + pid_t *target_pids) { struct task_struct *p; int trace = 0; long nr; - pid_t *target_pids = NULL; /* * Do some preliminary argument and permissions checking before we @@ -1448,6 +1448,17 @@ long do_fork(unsigned long clone_flags, return nr; } +long do_fork(unsigned long clone_flags, + unsigned long stack_start, + struct pt_regs *regs, + unsigned long stack_size, + int __user *parent_tidptr, + int __user *child_tidptr) +{ + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size, + parent_tidptr, child_tidptr, NULL); +} + #ifndef ARCH_MIN_MMSTRUCT_ALIGN #define ARCH_MIN_MMSTRUCT_ALIGN 0 #endif -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 7/7] Define clone_with_pids syscall
From: Sukadev Bhattiprolu clone_with_pids() is same as clone(), except that it takes a 'target_pid_set' paramter which lets caller choose a specific pid number for the child process in each of the child process's pid namespace. This system call would be needed to implement Checkpoint/Restart (i.e after a checkpoint, restart a process with its original pids). Call clone_with_pids as follows: pid_t pids[] = { 0, 77, 99 }; struct target_pid_set pid_set; pid_set.num_pids = sizeof(pids) / sizeof(int); pid_set.target_pids = &pids; syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set); If a target-pid is 0, the kernel continues to assign a pid for the process in that namespace. In the above example, pids[0] is 0, meaning the kernel will assign next available pid to the process in init_pid_ns. But kernel will assign pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either 77 or 99 are taken, the system call fails with -EBUSY. If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces, the system call fails with -EINVAL. Its mostly an exploratory patch seeking feedback on the interface. NOTE: Compared to clone(), clone_with_pids() needs to pass in two more pieces of information: - number of pids in the set - user buffer containing the list of pids. But since clone() already takes 5 parameters, use a 'struct target_pid_set'. TODO: - Gently tested. - May need additional sanity checks in check_target_pids() - Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in the namespace is either 1 or 0). Signed-off-by: Sukadev Bhattiprolu --- arch/x86/include/asm/syscalls.h|1 + arch/x86/include/asm/unistd_32.h |1 + arch/x86/kernel/entry_32.S |1 + arch/x86/kernel/process_32.c | 91 arch/x86/kernel/syscall_table_32.S |1 + include/linux/types.h |5 ++ 6 files changed, 100 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h index 7043408..1fdc149 100644 --- a/arch/x86/include/asm/syscalls.h +++ b/arch/x86/include/asm/syscalls.h @@ -31,6 +31,7 @@ asmlinkage int sys_get_thread_area(struct user_desc __user *); /* kernel/process_32.c */ int sys_fork(struct pt_regs *); int sys_clone(struct pt_regs *); +int sys_clone_with_pids(struct pt_regs *); int sys_vfork(struct pt_regs *); int sys_execve(struct pt_regs *); diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 6e72d74..90f906f 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -340,6 +340,7 @@ #define __NR_inotify_init1 332 #define __NR_preadv333 #define __NR_pwritev 334 +#define __NR_clone_with_pids 335 #ifdef __KERNEL__ diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index c929add..ee92b0d 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -707,6 +707,7 @@ ptregs_##name: \ PTREGSCALL(iopl) PTREGSCALL(fork) PTREGSCALL(clone) +PTREGSCALL(clone_with_pids) PTREGSCALL(vfork) PTREGSCALL(execve) PTREGSCALL(sigaltstack) diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 76f8f84..66ac6f7 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -445,6 +445,97 @@ int sys_clone(struct pt_regs *regs) return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr); } +static int check_target_pids(unsigned long clone_flags, + struct target_pid_set *pid_setp) +{ + /* +* CLONE_NEWPID implies pid == 1 +* +* TODO: Maybe this should be more fine-grained (i.e would we want +* to have a container-init have a specific pid in ancestor +* namespaces ?) +*/ + if (clone_flags & CLONE_NEWPID) + return -EINVAL; + + /* number of pids must match current nesting level of pid ns */ + if (pid_setp->num_pids > task_pid(current)->level + 1) + return -EINVAL; + + /* TODO: More sanity checks ? */ + + return 0; +} + +static pid_t *copy_target_pids(unsigned long clone_flags, void __user *upid_setp) +{ + int rc; + int size; + unsigned long clone_flags; + pid_t __user *utarget_pids; + pid_t *target_pids; + struct target_pid_set pid_set; + + if (copy_from_user(pid_setp, upid_setp, sizeof(*pid_setp))) + return ERR_PTR(-EFAULT); + + size = pid_setp->num_pids * sizeof(pid_t); + utarget_pids = pid_setp->target_pids; + + target_pids = kzalloc(size, GFP_KERNEL); + if (!target_pids) + return ERR_PTR(-ENOMEM); + + rc = -EFAU
[Devel] [RFC][PATCH 2/7] Have alloc_pidmap() return actual error code
From: Sukadev Bhattiprolu alloc_pidmap() can fail either because all pid numbers are in use or we can't allocate memory. With support for setting a specific pid number, alloc_pidmap() would also fail if either the given pid number is invalid or in use. Rather than have caller assume -ENOMEM, have alloc_pidmap() return the actual error. Signed-off-by: Sukadev Bhattiprolu --- kernel/fork.c |5 +++-- kernel/pid.c |9 ++--- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index b9e2edd..f8411a8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1119,10 +1119,11 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != &init_struct_pid) { - retval = -ENOMEM; pid = alloc_pid(p->nsproxy->pid_ns); - if (!pid) + if (IS_ERR(pid)) { + retval = PTR_ERR(pid); goto bad_fork_cleanup_io; + } if (clone_flags & CLONE_NEWPID) { retval = pid_ns_prepare_proc(p->nsproxy->pid_ns); diff --git a/kernel/pid.c b/kernel/pid.c index c0aaebe..fd72ad9 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -151,6 +151,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) { int i, offset, max_scan, pid, last = pid_ns->last_pid; struct pidmap *map; + int rc = -EAGAIN; pid = last + 1; if (pid >= pid_max) @@ -159,8 +160,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; for (i = 0; i <= max_scan; ++i) { - if (alloc_pidmap_page(map)) + if (alloc_pidmap_page(map)) { + rc = -ENOMEM; break; + } if (likely(atomic_read(&map->nr_free))) { do { @@ -192,7 +195,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) } pid = mk_pid(pid_ns, map, offset); } - return -1; + return rc; } int next_pidmap(struct pid_namespace *pid_ns, int last) @@ -297,7 +300,7 @@ out_free: free_pidmap(pid->numbers + i); kmem_cache_free(ns->pid_cachep, pid); - pid = NULL; + pid = ERR_PTR(nr); goto out; } -- 1.5.2.5 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 0/4] Devpts namespace
Serge, Matt, please sign-off on these patches as you see fit. --- Devpts namespace patchset In continuation of the implementation of containers in mainline, we need to support multiple PTY namespaces so that the PTY index (ie the tty names) in one container is independent of the PTY indices of other containers. For instance this would allow each container to have a '/dev/pts/0' PTY and refer to different terminals. [PATCH 1/4]: Factor out PTY index allocation [PATCH 2/4]: Use interface to access allocated_ptys [PATCH 3/4]: Enable multiple mounts of /dev/pts [PATCH 4/4]: Enable cloning PTY namespaces Todo: - This patchset depends on availability of additional clone flags !!! - Needs more testing. Changelog: This patchset is based on earlier versions developed by Serge Hallyn and Matt Helsley. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 1/4]: Factor out PTY index allocation
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [RFC][PATCH 1/4]: Factor out PTY index allocation Factor out the code used to allocate/free a pts index into new interfaces, devpts_new_index() and devpts_kill_index(). This localizes the external data structures used in managing the pts indices. Changelog: - Version 0: Based on earlier versions from Serge Hallyn and Matt Helsley. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- drivers/char/tty_io.c | 40 ++-- fs/devpts/inode.c | 42 +- include/linux/devpts_fs.h |4 3 files changed, 51 insertions(+), 35 deletions(-) Index: linux-2.6.24/drivers/char/tty_io.c === --- linux-2.6.24.orig/drivers/char/tty_io.c 2008-01-24 14:58:37.0 -0800 +++ linux-2.6.24/drivers/char/tty_io.c 2008-02-05 17:17:11.0 -0800 @@ -90,7 +90,6 @@ #include #include #include -#include #include #include #include @@ -136,9 +135,6 @@ EXPORT_SYMBOL(tty_mutex); #ifdef CONFIG_UNIX98_PTYS extern struct tty_driver *ptm_driver; /* Unix98 pty masters; for /dev/ptmx */ -extern int pty_limit; /* Config limit on Unix98 ptys */ -static DEFINE_IDR(allocated_ptys); -static DECLARE_MUTEX(allocated_ptys_lock); static int ptmx_open(struct inode *, struct file *); #endif @@ -2568,15 +2564,9 @@ static void release_dev(struct file * fi */ release_tty(tty, idx); -#ifdef CONFIG_UNIX98_PTYS /* Make this pty number available for reallocation */ - if (devpts) { - down(&allocated_ptys_lock); - idr_remove(&allocated_ptys, idx); - up(&allocated_ptys_lock); - } -#endif - + if (devpts) + devpts_kill_index(idx); } /** @@ -2732,29 +2722,13 @@ static int ptmx_open(struct inode * inod struct tty_struct *tty; int retval; int index; - int idr_ret; nonseekable_open(inode, filp); /* find a device that is not in use. */ - down(&allocated_ptys_lock); - if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) { - up(&allocated_ptys_lock); - return -ENOMEM; - } - idr_ret = idr_get_new(&allocated_ptys, NULL, &index); - if (idr_ret < 0) { - up(&allocated_ptys_lock); - if (idr_ret == -EAGAIN) - return -ENOMEM; - return -EIO; - } - if (index >= pty_limit) { - idr_remove(&allocated_ptys, index); - up(&allocated_ptys_lock); - return -EIO; - } - up(&allocated_ptys_lock); + index = devpts_new_index(); + if (index < 0) + return index; mutex_lock(&tty_mutex); retval = init_dev(ptm_driver, index, &tty); @@ -2781,9 +2755,7 @@ out1: release_dev(filp); return retval; out: - down(&allocated_ptys_lock); - idr_remove(&allocated_ptys, index); - up(&allocated_ptys_lock); + devpts_kill_index(index); return retval; } #endif Index: linux-2.6.24/fs/devpts/inode.c === --- linux-2.6.24.orig/fs/devpts/inode.c 2008-01-24 14:58:37.0 -0800 +++ linux-2.6.24/fs/devpts/inode.c 2008-02-05 17:17:11.0 -0800 @@ -17,12 +17,17 @@ #include #include #include +#include #include #include #include #define DEVPTS_SUPER_MAGIC 0x1cd1 +extern int pty_limit; /* Config limit on Unix98 ptys */ +static DEFINE_IDR(allocated_ptys); +static DECLARE_MUTEX(allocated_ptys_lock); + static struct vfsmount *devpts_mnt; static struct dentry *devpts_root; @@ -156,9 +161,44 @@ static struct dentry *get_node(int num) return lookup_one_len(s, root, sprintf(s, "%d", num)); } +int devpts_new_index(void) +{ + int index; + int idr_ret; + +retry: + if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) { + return -ENOMEM; + } + + down(&allocated_ptys_lock); + idr_ret = idr_get_new(&allocated_ptys, NULL, &index); + if (idr_ret < 0) { + up(&allocated_ptys_lock); + if (idr_ret == -EAGAIN) + goto retry; + return -EIO; + } + + if (index >= pty_limit) { + idr_remove(&allocated_ptys, index); + up(&allocated_ptys_lock); + return -EIO; + } + up(&allocated_ptys_lock); + return index; +} + +void devpts_kill_index(int idx) +{ + down(&allocated_ptys_lock); + idr_remove(&allocated_ptys, idx); + up(&allocated_ptys_lock); +} + int devpts_pty_new(struct tty_struct *tty) { -
[Devel] [RFC][PATCH 2/4]: Use interface to access allocated_ptys
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [RFC][PATCH 2/4]: Use interface to access allocated_ptys In preparation for supporting multiple PTY namespaces, use an inline function to access the 'allocated_ptys' idr. Changelog: - Version 0: Based on earlier versions from Serge Hallyn and Matt Helsley. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Acked-by: Serge Hallyn <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) Index: linux-2.6.24/fs/devpts/inode.c === --- linux-2.6.24.orig/fs/devpts/inode.c 2008-02-05 17:17:11.0 -0800 +++ linux-2.6.24/fs/devpts/inode.c 2008-02-05 17:30:52.0 -0800 @@ -28,6 +28,11 @@ extern int pty_limit;/* Config limit o static DEFINE_IDR(allocated_ptys); static DECLARE_MUTEX(allocated_ptys_lock); +static inline struct idr *current_pts_ns_allocated_ptys(void) +{ + return &allocated_ptys; +} + static struct vfsmount *devpts_mnt; static struct dentry *devpts_root; @@ -167,12 +172,12 @@ int devpts_new_index(void) int idr_ret; retry: - if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) { + if (!idr_pre_get(current_pts_ns_allocated_ptys(), GFP_KERNEL)) { return -ENOMEM; } down(&allocated_ptys_lock); - idr_ret = idr_get_new(&allocated_ptys, NULL, &index); + idr_ret = idr_get_new(current_pts_ns_allocated_ptys(), NULL, &index); if (idr_ret < 0) { up(&allocated_ptys_lock); if (idr_ret == -EAGAIN) @@ -181,7 +186,7 @@ retry: } if (index >= pty_limit) { - idr_remove(&allocated_ptys, index); + idr_remove(current_pts_ns_allocated_ptys(), index); up(&allocated_ptys_lock); return -EIO; } @@ -192,7 +197,7 @@ retry: void devpts_kill_index(int idx) { down(&allocated_ptys_lock); - idr_remove(&allocated_ptys, idx); + idr_remove(current_pts_ns_allocated_ptys(), idx); up(&allocated_ptys_lock); } ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 4/4]: Enable cloning PTY namespaces
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [RFC][PATCH 4/4]: Enable cloning PTY namespaces Enable cloning PTY namespaces. TODO: This version temporarily uses the clone flag '0x8000' which is unused in mainline atm, but used for CLONE_IO in -mm. While we must extend clone() (urgently) to solve this, it hopefully does not affect review of the rest of this patchset. Changelog: - Version 0: Based on earlier versions from Serge Hallyn and Matt Helsley. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 84 +++--- include/linux/devpts_fs.h | 52 include/linux/init_task.h |1 include/linux/nsproxy.h |2 + include/linux/sched.h |2 + kernel/fork.c |2 - kernel/nsproxy.c | 17 - 7 files changed, 146 insertions(+), 14 deletions(-) Index: linux-2.6.24/fs/devpts/inode.c === --- linux-2.6.24.orig/fs/devpts/inode.c 2008-02-05 19:16:39.0 -0800 +++ linux-2.6.24/fs/devpts/inode.c 2008-02-05 20:27:41.0 -0800 @@ -25,18 +25,25 @@ #define DEVPTS_SUPER_MAGIC 0x1cd1 extern int pty_limit; /* Config limit on Unix98 ptys */ -static DEFINE_IDR(allocated_ptys); static DECLARE_MUTEX(allocated_ptys_lock); +static struct file_system_type devpts_fs_type; + +struct pts_namespace init_pts_ns = { + .kref = { + .refcount = ATOMIC_INIT(2), + }, + .allocated_ptys = IDR_INIT(init_pts_ns.allocated_ptys), + .mnt = NULL, +}; static inline struct idr *current_pts_ns_allocated_ptys(void) { - return &allocated_ptys; + return ¤t->nsproxy->pts_ns->allocated_ptys; } -static struct vfsmount *devpts_mnt; static inline struct vfsmount *current_pts_ns_mnt(void) { - return devpts_mnt; + return current->nsproxy->pts_ns->mnt; } static struct { @@ -59,6 +66,42 @@ static match_table_t tokens = { {Opt_err, NULL} }; +struct pts_namespace *new_pts_ns(void) +{ + struct pts_namespace *ns; + + ns = kmalloc(sizeof(*ns), GFP_KERNEL); + if (!ns) + return ERR_PTR(-ENOMEM); + + ns->mnt = kern_mount_data(&devpts_fs_type, ns); + if (IS_ERR(ns->mnt)) { + kfree(ns); + return ERR_PTR(PTR_ERR(ns->mnt)); + } + + idr_init(&ns->allocated_ptys); + kref_init(&ns->kref); + + return ns; +} + +void free_pts_ns(struct kref *ns_kref) +{ + struct pts_namespace *ns; + + ns = container_of(ns_kref, struct pts_namespace, kref); + BUG_ON(ns == &init_pts_ns); + + mntput(ns->mnt); + /* +* TODO: +* idr_remove_all(&ns->allocated_ptys); introduced in 2.6.23 +*/ + idr_destroy(&ns->allocated_ptys); + kfree(ns); +} + static int devpts_remount(struct super_block *sb, int *flags, char *data) { char *p; @@ -160,18 +203,27 @@ static int devpts_test_sb(struct super_b static int devpts_set_sb(struct super_block *sb, void *data) { - sb->s_fs_info = data; + struct pts_namespace *ns = data; + + sb->s_fs_info = get_pts_ns(ns); return set_anon_super(sb, NULL); } static int devpts_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt) { + struct pts_namespace *ns; struct super_block *sb; int err; + /* hereafter we're very similar to proc_get_sb */ + if (flags & MS_KERNMOUNT) + ns = data; + else + ns = current->nsproxy->pts_ns; + /* hereafter we're very simlar to get_sb_nodev */ - sb = sget(fs_type, devpts_test_sb, devpts_set_sb, data); + sb = sget(fs_type, devpts_test_sb, devpts_set_sb, ns); if (IS_ERR(sb)) return PTR_ERR(sb); @@ -187,16 +239,25 @@ static int devpts_get_sb(struct file_sys } sb->s_flags |= MS_ACTIVE; - devpts_mnt = mnt; + ns->mnt = mnt; return simple_set_mnt(mnt, sb); } +static void devpts_kill_sb(struct super_block *sb) +{ + struct pts_namespace *ns; + + ns = sb->s_fs_info; + kill_anon_super(sb); + put_pts_ns(ns); +} + static struct file_system_type devpts_fs_type = { .owner = THIS_MODULE, .name = "devpts", .get_sb = devpts_get_sb, - .kill_sb= kill_anon_super, + .kill_sb= devpts_kill_sb, }; /* @@ -352,18 +413,19 @@ static int __init init_devpts_fs(void) if (err) return err; - mnt = kern_mount_data(&devpts_fs_type, NULL); + mnt = kern_mount_data(&devpts_fs_type, &init
[Devel] [RFC][PATCH 3/4]: Enable multiple mounts of /dev/pts
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [RFC][PATCH 3/4]: Enable multiple mounts of /dev/pts To support multiple PTY namespaces, we should be allow multiple mounts of /dev/pts, once within each PTY namespace. This patch removes the get_sb_single() in devpts_get_sb() and uses test and set sb interfaces to allow remounting /dev/pts. The patch also removes the globals, 'devpts_root' and uses current_pts_mnt() to access 'devpts_mnt' Changelog: - Version 0: Based on earlier versions from Serge Hallyn and Matt Helsley. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 120 +- 1 file changed, 101 insertions(+), 19 deletions(-) Index: linux-2.6.24/fs/devpts/inode.c === --- linux-2.6.24.orig/fs/devpts/inode.c 2008-02-05 17:30:52.0 -0800 +++ linux-2.6.24/fs/devpts/inode.c 2008-02-05 19:16:39.0 -0800 @@ -34,7 +34,10 @@ static inline struct idr *current_pts_ns } static struct vfsmount *devpts_mnt; -static struct dentry *devpts_root; +static inline struct vfsmount *current_pts_ns_mnt(void) +{ + return devpts_mnt; +} static struct { int setuid; @@ -130,7 +133,7 @@ devpts_fill_super(struct super_block *s, inode->i_fop = &simple_dir_operations; inode->i_nlink = 2; - devpts_root = s->s_root = d_alloc_root(inode); + s->s_root = d_alloc_root(inode); if (s->s_root) return 0; @@ -140,10 +143,53 @@ fail: return -ENOMEM; } +/* + * We use test and set super-block operations to help determine whether we + * need a new super-block for this namespace. get_sb() walks the list of + * existing devpts supers, comparing them with the @data ptr. Since we + * passed 'current's namespace as the @data pointer we can compare the + * namespace pointer in the super-block's 's_fs_info'. If the test is + * TRUE then get_sb() returns a new active reference to the super block. + * Otherwise, it helps us build an active reference to a new one. + */ + +static int devpts_test_sb(struct super_block *sb, void *data) +{ + return sb->s_fs_info == data; +} + +static int devpts_set_sb(struct super_block *sb, void *data) +{ + sb->s_fs_info = data; + return set_anon_super(sb, NULL); +} + static int devpts_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt) { - return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt); + struct super_block *sb; + int err; + + /* hereafter we're very simlar to get_sb_nodev */ + sb = sget(fs_type, devpts_test_sb, devpts_set_sb, data); + if (IS_ERR(sb)) + return PTR_ERR(sb); + + if (sb->s_root) + return simple_set_mnt(mnt, sb); + + sb->s_flags = flags; + err = devpts_fill_super(sb, data, flags & MS_SILENT ? 1 : 0); + if (err) { + up_write(&sb->s_umount); + deactivate_super(sb); + return err; + } + + sb->s_flags |= MS_ACTIVE; + devpts_mnt = mnt; + + return simple_set_mnt(mnt, sb); } static struct file_system_type devpts_fs_type = { @@ -158,10 +204,9 @@ static struct file_system_type devpts_fs * to the System V naming convention */ -static struct dentry *get_node(int num) +static struct dentry *get_node(struct dentry *root, int num) { char s[12]; - struct dentry *root = devpts_root; mutex_lock(&root->d_inode->i_mutex); return lookup_one_len(s, root, sprintf(s, "%d", num)); } @@ -207,12 +252,28 @@ int devpts_pty_new(struct tty_struct *tt struct tty_driver *driver = tty->driver; dev_t device = MKDEV(driver->major, driver->minor_start+number); struct dentry *dentry; - struct inode *inode = new_inode(devpts_mnt->mnt_sb); + struct dentry *root; + struct vfsmount *mnt; + struct inode *inode; + /* We're supposed to be given the slave end of a pty */ BUG_ON(driver->type != TTY_DRIVER_TYPE_PTY); BUG_ON(driver->subtype != PTY_TYPE_SLAVE); + mnt = current_pts_ns_mnt(); + if (!mnt) + return -ENOSYS; + root = mnt->mnt_root; + + mutex_lock(&root->d_inode->i_mutex); + inode = idr_find(current_pts_ns_allocated_ptys(), number); + mutex_unlock(&root->d_inode->i_mutex); + + if (inode && !IS_ERR(inode)) + return -EEXIST; + + inode = new_inode(mnt->mnt_sb); if (!inode) return -ENOMEM; @@ -222,23 +283,31 @@ int devpts_pty_new(struct tty_struct *tt inode->i_mtime = inode->i_atime = inode->i_ctime = CU
[Devel] Re: [PATCH 4/4] The control group itself
This patchset does fix the problem I was having before with null and zero devices. Overall, it looks like pretty good. I am still reviewing the patches. Just some nits I came across: Pavel Emelianov [EMAIL PROTECTED] wrote: | Each new group will have its own maps for char and block | layers. The devices access list is tuned via the | devices.permissions file. One may read from the file to get | the configured state. | | The top container isn't initialized, so that the char | and block layers will use the global maps to lookup | their devices. I did that not to export the static maps | to the outer world. | | Good news is that this patch now contains more comments | and Documentation file :) | | Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]> | | --- | | diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt | new file mode 100644 | index 000..dbd0c7a | --- /dev/null | +++ b/Documentation/controllers/devices.txt | @@ -0,0 +1,61 @@ | + | + Devices visibility controller | + | +This controller allows to tune the devices accessibility by tasks, | +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only | +access to IDE devices and completely hide SCSI disks. | + | +Tasks still can call mknod to create device files, regardless of | +whether the particular device is visible or accessible, but they | +may not be able to open it later. | + | +This one hides under CONFIG_CGROUP_DEVS option. | + | + | +Configuring | + | +The controller provides a single file to configure itself -- the | +devices.permissions one. To change the accessibility level for some | +device write the following string into it: | + | +[cb] :(|*) [r-][w-] | + ^ ^ ^ | + | | | | + | | +--- access rights (1) | + | | | + | +-- device major and minor numbers (2) | + | | + +-- device type (character / block) | + | +1) The access rights set to '--' remove the device from the group's | +access list, so that it will not even be shown in this file later. | + | +2) Setting the minor to '*' grants access to all the minors for | +particular major. | + | +When reading from it, one may see something like | + | + c 1:5 rw | + b 8:* r- | + | +Security issues, concerning who may grant access to what are governed | +at the cgroup infrastructure level. | + | + | +Examples: | + | +1. Grand full access to /dev/null Grant. | + # echo c 1:3 rw > /cgroups//devices.permissions | + | +2. Grant the read-only access to /dev/sda and partitions | + # echo b 8:* r- > ... This grants access to all scsi disks, sda..sdp and not just 'sda' right ? | + | +3. Change the /dev/null access to write-only | + # echo c 1:3 -w > ... | + | +4. Revoke access to /dev/sda | + # echo b 8:* -- > ... | + | + | + Written by Pavel Emelyanov <[EMAIL PROTECTED]> | + | diff --git a/fs/Makefile b/fs/Makefile | index 7996220..5ad03be 100644 | --- a/fs/Makefile | +++ b/fs/Makefile | @@ -64,6 +64,8 @@ obj-y += devpts/ | | obj-$(CONFIG_PROFILING) += dcookies.o | obj-$(CONFIG_DLM)+= dlm/ | + | +obj-$(CONFIG_CGROUP_DEVS)+= devscontrol.o | | # Do not add any filesystems before this line | obj-$(CONFIG_REISERFS_FS)+= reiserfs/ | diff --git a/fs/devscontrol.c b/fs/devscontrol.c | new file mode 100644 | index 000..48c5f69 | --- /dev/null | +++ b/fs/devscontrol.c | @@ -0,0 +1,314 @@ | +/* | + * devscontrol.c - Device Controller | + * | + * Copyright 2007 OpenVZ SWsoft Inc | + * Author: Pavel Emelyanov | + * | + * This program is free software; you can redistribute it and/or modify | + * it under the terms of the GNU General Public License as published by | + * the Free Software Foundation; either version 2 of the License, or | + * (at your option) any later version. | + * | + * This program is distributed in the hope that it will be useful, | + * but WITHOUT ANY WARRANTY; without even the implied warranty of | + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | + * GNU General Public License for more details. | + */ | + | +#include | +#include | +#include | +#include | +#include | +#include | +#include | + | +struct devs_cgroup { | + /* | + * The subsys state to build into cgrous infrastructure | + */ ... into cgroups | + struct cgroup_subsys_state css; | + | + /* | + * The maps of character and block devices. They provide a | + * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c | + * and block/genhd.c to find out how the ->open() callbacks | + * work when opening a device. | + * | + * Each group will have its onw maps, and at the open() own maps | + * time code will lookup in this map to get the device | + * and permissions by its dev_t. | + */ | + struct kobj_map *cdev_map; | + struct kobj_map *bdev_map; | +}; | + | +static inlin
Re: [Devel] [RFC][PATCH 3/4]: Enable multiple mounts of /dev/pts
Serge E. Hallyn [EMAIL PROTECTED] wrote: | | > exploited in OpenVZ, so if we can somehow avoid forcing the NEWNS flag | > that would be very very good :) See my next comment about this issue. | > | > > Pavel, not long ago you said you were starting to look at tty and pty | > > stuff - did you have any different ideas on devpts virtualization, or | > > are you ok with this minus your comments thus far? | > | > I have a similar idea of how to implement this, but I didn't thought | > about the details. As far as this issue is concerned, I see no reasons | > why we need a kern_mount-ed devtpsfs instance. If we don't make such, | > we may safely hold the ptsns from the superblock and be happy. The | > same seems applicable to the mqns, no? | | But the current->nsproxy->devpts->mnt is used in several functions in | patch 3. Hmm, current_pts_ns_mnt() is used in: devpts_pty_new() devpts_get_tty() devpts_pty_kill() All of these return error if current_pts_ns_mnt() returns NULL. So, can we require user-space mount and unmount /dev/pts and return error if any operation is attempted before the mount ? | | > The reason I have the kern_mount-ed instance of proc for pid namespaces | > is that I need a vfsmount to flush task entries from, but allowing | > it to be NULL (i.e. no kern_mount, but optional user mounts) means | > handing all the possible races, which is too heavy. But do we actually | > need the vfsmount for devpts and mqns if no user-space mounts exist? | > | > Besides, I planned to include legacy ptys virtualization and console | > virtualizatin in this namespace, but it seems, that it is not present | > in this particular one. | | I had been thinking the consoles would have their own ns, since there's | really nothing linking them, but there really is no good reason why | userspace should ever want them separate. So I'm fine with combining | them. | | > >>> + sb->s_flags |= MS_ACTIVE; | > >>> + devpts_mnt = mnt; | > >>> + | > >>> + return simple_set_mnt(mnt, sb); | > >>> } | > >>> | > >>> static struct file_system_type devpts_fs_type = { | > >>> @@ -158,10 +204,9 @@ static struct file_system_type devpts_fs | > >>> * to the System V naming convention | > >>> */ | > >>> | > >>> -static struct dentry *get_node(int num) | > >>> +static struct dentry *get_node(struct dentry *root, int num) | > >>> { | > >>> char s[12]; | > >>> - struct dentry *root = devpts_root; | > >>> mutex_lock(&root->d_inode->i_mutex); | > >>> return lookup_one_len(s, root, sprintf(s, "%d", num)); | > >>> } | > >>> @@ -207,12 +252,28 @@ int devpts_pty_new(struct tty_struct *tt | > >>> struct tty_driver *driver = tty->driver; | > >>> dev_t device = MKDEV(driver->major, driver->minor_start+number); | > >>> struct dentry *dentry; | > >>> - struct inode *inode = new_inode(devpts_mnt->mnt_sb); | > >>> + struct dentry *root; | > >>> + struct vfsmount *mnt; | > >>> + struct inode *inode; | > >>> + | > >>> | > >>> /* We're supposed to be given the slave end of a pty */ | > >>> BUG_ON(driver->type != TTY_DRIVER_TYPE_PTY); | > >>> BUG_ON(driver->subtype != PTY_TYPE_SLAVE); | > >>> | > >>> + mnt = current_pts_ns_mnt(); | > >>> + if (!mnt) | > >>> + return -ENOSYS; | > >>> + root = mnt->mnt_root; | > >>> + | > >>> + mutex_lock(&root->d_inode->i_mutex); | > >>> + inode = idr_find(current_pts_ns_allocated_ptys(), number); | > >>> + mutex_unlock(&root->d_inode->i_mutex); | > >>> + | > >>> + if (inode && !IS_ERR(inode)) | > >>> + return -EEXIST; | > >>> + | > >>> + inode = new_inode(mnt->mnt_sb); | > >>> if (!inode) | > >>> return -ENOMEM; | > >>> | > >>> @@ -222,23 +283,31 @@ int devpts_pty_new(struct tty_struct *tt | > >>> inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME; | > >>> init_special_inode(inode, S_IFCHR|config.mode, device); | > >>> inode->i_private = tty; | > >>> + idr_replace(current_pts_ns_allocated_ptys(), inode, number); | > >>> | > >>> - dentry = get_node(number); | > >>> + dentry = get_node(root, number); | > >>> if (!IS_ERR(dentry) && !dentry->d_inode) { | > >>> d_instantiate(dentry, inode); | > >>> - fsnotify_create(devpts_root->d_inode, dentry); | > >>> + fsnotify_create(root->d_inode, dentry); | > >>> } | > >>> | > >>> - mutex_unlock(&devpts_root->d_inode->i_mutex); | > >>> + mutex_unlock(&root->d_inode->i_mutex); | > >>> | > >>> return 0; | > >>> } | > >>> | > >>> struct tty_struct *devpts_get_tty(int number) | > >>> { | > >>> - struct dentry *dentry = get_node(number); | > >>> + struct vfsmount *mnt; | > >>> + struct dentry *dentry; | > >>> struct tty_struct *tty; | > >>> | > >>> +
Re: [Devel] [RFC][PATCH 3/4]: Enable multiple mounts of /dev/pts
Pavel Emelianov [EMAIL PROTECTED] wrote: | Serge E. Hallyn wrote: | > Quoting Pavel Emelyanov ([EMAIL PROTECTED]): | >> [snip] | >> | Mmm. I wanted to send one small objection to Cedric's patches with mqns, | but the thread was abandoned by the time I decided to do-it-right-now. | | So I can put it here: forcing the CLONE_NEWNS is not very good, since | this makes impossible to push a bind mount inside a new namespace, which | may operate in some chroot environment. But this ability is heavily | >>> Which direction do you want to go? I'm wondering whether mounts | >>> propagation can address it. | >> Hardly. AFAIS there's no way to let the chroot-ed tasks see parts of | >> vfs tree, that left behind them after chroot, unless they are in the | >> same mntns as you, and you bind mount this parts to their tree. No? | > | > Well no, but I suspect I'm just not understanding what you want to do. | > But if the chroot is under /jail1, and you've done, say, | > | > mkdir -p /share/pts | > mkdir -p /jail1/share | > mount --bind /share /share | > mount --make-shared /share | > mount --bind /share /jail1/share | > mount --make-slave /jail1/share | > | > before the chroot-ed tasks were cloned with CLONE_NEWNS, then when you | > do | > | > mount --bind /dev/pts /share/pts | > | > from the parent mntns (not that I know why you'd want to do *that* :) | > then the chroot'ed tasks will see the original mntns's /dev/pts under | > /jail1/share. | | I haven't yet tried that, but :( this function | | static inline int check_mnt(struct vfsmount *mnt) | { | return mnt->mnt_ns == current->nsproxy->mnt_ns; | } | | and this code in do_loopback | | if (!check_mnt(nd->mnt) || !check_mnt(old_nd.mnt)) | goto out; | | makes me think that trying to bind a mount from another mntns | ot _to_ another is prohibited... Do I miss something? | | >>> Though really, I think you're right - we shouldn't break the kernel | >>> doing CLONE_NEWMQ or CLONE_NEWPTS without CLONE_NEWNS, so we shouldn't | >>> force the combination. | >>> | exploited in OpenVZ, so if we can somehow avoid forcing the NEWNS flag | that would be very very good :) See my next comment about this issue. | | > Pavel, not long ago you said you were starting to look at tty and pty | > stuff - did you have any different ideas on devpts virtualization, or | > are you ok with this minus your comments thus far? | I have a similar idea of how to implement this, but I didn't thought | about the details. As far as this issue is concerned, I see no reasons | why we need a kern_mount-ed devtpsfs instance. If we don't make such, | we may safely hold the ptsns from the superblock and be happy. The | same seems applicable to the mqns, no? | >>> But the current->nsproxy->devpts->mnt is used in several functions in | >>> patch 3. | >> Indeed. I overlooked this. Then we're in a deep ... problem here. | >> | >> Breaking this circle was not that easy with pid namespaces, so | >> I put the strut in proc_flush_task - when the last task from the | >> namespace exits the kern-mount-ed vfsmnt is dropped, but we can't | >> do the same stuff with devpts. | > | > But I still don't see what the problem is with my proposal? So long as | > you agree that if there are no tasks remaining in the devptsns, | > then any task which has its devpts mounted should see an empty directory | > (due to sb->s_info being NULL), I think it works. | | Well, if we _do_ can handle the races with ns->devpts_mnt switch | from not NULL to NULL, then I'm fine with this approach. | | I just remember, that with pid namespaces this caused a complicated | locking and performance degradation. This is the problem I couldn't | remember yesterday. Well, iirc, one problem with pid namespaces was that we need to keep the task and pid_namespace association until the task was waited on (for instance the wait() call from parent needs the pid_t of the child which is tied to the pid ns in struct upid). For this reason, we don't drop the mnt reference in free_pid_ns() but hold the reference till proc_flush_task(). With devpts, can't we simply drop the reference in free_pts_ns() so that when the last task using the pts_ns exits, we can unmount and release the mnt ? IOW, do you suspect that the circular reference leads to leaking vfsmnts ? ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC][PATCH 3/4]: Enable multiple mounts of /dev/pts
| locking and performance degradation. This is the problem I couldn't | > | remember yesterday. | > | > Well, iirc, one problem with pid namespaces was that we need to keep | > the task and pid_namespace association until the task was waited on | > (for instance the wait() call from parent needs the pid_t of the | > child which is tied to the pid ns in struct upid). | > | > For this reason, we don't drop the mnt reference in free_pid_ns() but | > hold the reference till proc_flush_task(). | > | > With devpts, can't we simply drop the reference in free_pts_ns() so | > that when the last task using the pts_ns exits, we can unmount and | > release the mnt ? | | I hope we can. The thing I'm worried about is whether we can correctly | handle race with this pointer switch from NULL to not-NULL. | | > IOW, do you suspect that the circular reference leads to leaking vfsmnts ? | > | | Of course! If the namespace holds the vfsmnt, vfsmnt holds the superblock | and the superblock holds the namespace we won't drop this chain ever, | unless some other object breaks this chain. Of course :-) I had a bug in new_pts_ns() that was masking the problem. I had ns->mnt = kern_mount_data()... ... kref_init(&ns->kref); So the kref_init() would overwrite the reference got by devpts_set_sb() and was preventing the leaking vfsmnt in my test. Thanks Pavel, Sukadev ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC] Remove kern_mount() in init_devpts_fs()
Is the kern_mount() of devpts really needed or can we simply register the filesystem type and wait for an user-space mount before being able to create PTYs ? This is just an RFC patch that removes the kern_mount() and the 'devpts_mnt' and 'devpts_root' global variables and uses a 'devpts_sb' to store the single super block associated with devpts. Removing the kern_mount() and relying on user-space mount could simplify cloning of PTS namespaces. --- fs/devpts/inode.c | 49 + 1 file changed, 29 insertions(+), 20 deletions(-) Index: linux-2.6.24/fs/devpts/inode.c === --- linux-2.6.24.orig/fs/devpts/inode.c 2008-02-22 14:23:53.0 -0800 +++ linux-2.6.24/fs/devpts/inode.c 2008-02-25 16:00:17.0 -0800 @@ -23,9 +23,6 @@ #define DEVPTS_SUPER_MAGIC 0x1cd1 -static struct vfsmount *devpts_mnt; -static struct dentry *devpts_root; - static struct { int setuid; int setgid; @@ -97,6 +94,7 @@ static const struct super_operations dev .remount_fs = devpts_remount, }; +static struct super_block *devpts_sb; static int devpts_fill_super(struct super_block *s, void *data, int silent) { @@ -120,9 +118,11 @@ devpts_fill_super(struct super_block *s, inode->i_fop = &simple_dir_operations; inode->i_nlink = 2; - devpts_root = s->s_root = d_alloc_root(inode); - if (s->s_root) + s->s_root = d_alloc_root(inode); + if (s->s_root) { + devpts_sb = s; return 0; + } printk("devpts: get root dentry failed\n"); iput(inode); @@ -136,11 +136,17 @@ static int devpts_get_sb(struct file_sys return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt); } +static void devpts_kill_sb(struct super_block *sb) +{ + devpts_sb = NULL; + kill_anon_super(sb); +} + static struct file_system_type devpts_fs_type = { .owner = THIS_MODULE, .name = "devpts", .get_sb = devpts_get_sb, - .kill_sb= kill_anon_super, + .kill_sb= devpts_kill_sb, }; /* @@ -151,7 +157,12 @@ static struct file_system_type devpts_fs static struct dentry *get_node(int num) { char s[12]; - struct dentry *root = devpts_root; + struct dentry *root; + + if (!devpts_sb) + return NULL; + + root = devpts_sb->s_root; mutex_lock(&root->d_inode->i_mutex); return lookup_one_len(s, root, sprintf(s, "%d", num)); } @@ -162,7 +173,12 @@ int devpts_pty_new(struct tty_struct *tt struct tty_driver *driver = tty->driver; dev_t device = MKDEV(driver->major, driver->minor_start+number); struct dentry *dentry; - struct inode *inode = new_inode(devpts_mnt->mnt_sb); + struct inode *inode; + + if (!devpts_sb) + return -ENOSYS; + + inode = new_inode(devpts_sb); /* We're supposed to be given the slave end of a pty */ BUG_ON(driver->type != TTY_DRIVER_TYPE_PTY); @@ -181,10 +197,10 @@ int devpts_pty_new(struct tty_struct *tt dentry = get_node(number); if (!IS_ERR(dentry) && !dentry->d_inode) { d_instantiate(dentry, inode); - fsnotify_create(devpts_root->d_inode, dentry); + fsnotify_create(devpts_sb->s_root->d_inode, dentry); } - mutex_unlock(&devpts_root->d_inode->i_mutex); + mutex_unlock(&devpts_sb->s_root->d_inode->i_mutex); return 0; } @@ -201,7 +217,7 @@ struct tty_struct *devpts_get_tty(int nu dput(dentry); } - mutex_unlock(&devpts_root->d_inode->i_mutex); + mutex_unlock(&devpts_sb->s_root->d_inode->i_mutex); return tty; } @@ -219,24 +235,17 @@ void devpts_pty_kill(int number) } dput(dentry); } - mutex_unlock(&devpts_root->d_inode->i_mutex); + mutex_unlock(&devpts_sb->s_root->d_inode->i_mutex); } static int __init init_devpts_fs(void) { - int err = register_filesystem(&devpts_fs_type); - if (!err) { - devpts_mnt = kern_mount(&devpts_fs_type); - if (IS_ERR(devpts_mnt)) - err = PTR_ERR(devpts_mnt); - } - return err; + return register_filesystem(&devpts_fs_type); } static void __exit exit_devpts_fs(void) { unregister_filesystem(&devpts_fs_type); - mntput(devpts_mnt); } module_init(init_devpts_fs) ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH] Fix warning in kernel/pid.c
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH]: Fix compile warning in kernel/pid.c We get a warning in kernel/pid.c due to the deprecated find_task_by_pid(). Make the function inline in sched.h to avoid the warning. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- include/linux/sched.h |5 - kernel/pid.c |6 -- 2 files changed, 4 insertions(+), 7 deletions(-) Index: linux-2.6-25-rc2-mm1/include/linux/sched.h === --- linux-2.6-25-rc2-mm1.orig/include/linux/sched.h 2008-02-27 16:07:52.0 -0800 +++ linux-2.6-25-rc2-mm1/include/linux/sched.h 2008-02-27 16:29:31.0 -0800 @@ -1632,7 +1632,10 @@ extern struct pid_namespace init_pid_ns; extern struct task_struct *find_task_by_pid_type_ns(int type, int pid, struct pid_namespace *ns); -extern struct task_struct *find_task_by_pid(pid_t nr) __deprecated; +static inline struct task_struct *__deprecated find_task_by_pid(pid_t nr) +{ + return find_task_by_pid_type_ns(PIDTYPE_PID, nr, &init_pid_ns); +} extern struct task_struct *find_task_by_vpid(pid_t nr); extern struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns); Index: linux-2.6-25-rc2-mm1/kernel/pid.c === --- linux-2.6-25-rc2-mm1.orig/kernel/pid.c 2008-02-27 15:18:22.0 -0800 +++ linux-2.6-25-rc2-mm1/kernel/pid.c 2008-02-27 16:29:31.0 -0800 @@ -380,12 +380,6 @@ struct task_struct *find_task_by_pid_typ EXPORT_SYMBOL(find_task_by_pid_type_ns); -struct task_struct *find_task_by_pid(pid_t nr) -{ - return find_task_by_pid_type_ns(PIDTYPE_PID, nr, &init_pid_ns); -} -EXPORT_SYMBOL(find_task_by_pid); - struct task_struct *find_task_by_vpid(pid_t vnr) { return find_task_by_pid_type_ns(PIDTYPE_PID, vnr, ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 0/7][v2] Cloning PTS namespace
Devpts namespace patchset In continuation of the implementation of containers in mainline, we need to support multiple PTY namespaces so that the PTY index (ie the tty names) in one container is independent of the PTY indices of other containers. For instance this would allow each container to have a '/dev/pts/0' PTY and refer to different terminals. [PATCH 1/7]: Propagate error code from devpts_pty_new [PATCH 2/7]: Factor out PTY index allocation [PATCH 3/7]: Enable multiple mounts of /dev/pts [PATCH 4/7]: Implement get_pts_ns() and put_pts_ns() [PATCH 5/7]: Determine pts_ns from a pty's inode. [PATCH 6/7]: Check for user-space mount of /dev/pts [PATCH 7/7]: Enable cloning PTY namespaces Todo: - This patchset depends on availability of additional clone flags. and relies on on Cedric's clone64 patchset. - Needs more testing. Changelog[v1]: - Fixed circular reference by not caching the pts_ns in sb->s_fs_info (without incrementing reference count) and clearing the sb->s_fs_info when destroying the pts_ns (See patch 3/7 for details). - To allow access to a child container's ptys from parent container, determine the 'pts_ns' of a 'pty' from its inode (See patch 5/7 for details. - Added a check (hack) to ensure user-space mount of /dev/pts is done before creating PTYs in a new pts-ns (see patch 6/7 for details). - Reorganized the patchset and removed redundant changes. - Ported to work wih Cedric Le Goater's clone64() system call now that we are out of clone_flags. Changelog[v0]: This patchset is based on earlier versions developed by Serge Hallyn and Matt Helsley. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/7] Propagate error code from devpts_pty_new
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 1/7]: Propagate error code from devpts_pty_new Have ptmx_open() propagate any error code returned by devpts_pty_new(). Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Cc: Cedric Le Goater <[EMAIL PROTECTED]> Cc: Dave Hansen <[EMAIL PROTECTED]> Cc: Serge Hallyn <[EMAIL PROTECTED]> Cc: [EMAIL PROTECTED] --- drivers/char/tty_io.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c === --- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c 2008-03-21 20:13:38.0 -0700 +++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:04:07.0 -0700 @@ -2835,8 +2835,8 @@ static int ptmx_open(struct inode *inode filp->private_data = tty; file_move(filp, &tty->tty_files); - retval = -ENOMEM; - if (devpts_pty_new(tty->link)) + retval = devpts_pty_new(tty->link); + if (retval) goto out1; check_tty_count(tty, "tty_open"); ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/7]: Factor out PTY index allocation
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 2/7]: Factor out PTY index allocation Factor out the code used to allocate/free a pts index into new interfaces, devpts_new_index() and devpts_kill_index(). This localizes the external data structures used in managing the pts indices. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Signed-off-by: Serge Hallyn<[EMAIL PROTECTED]> Signed-off-by: Matt Helsley<[EMAIL PROTECTED]> --- drivers/char/tty_io.c | 40 ++-- fs/devpts/inode.c | 42 +- include/linux/devpts_fs.h |4 3 files changed, 51 insertions(+), 35 deletions(-) Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-03-24 20:04:07.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:04:26.0 -0700 @@ -17,6 +17,8 @@ #ifdef CONFIG_UNIX98_PTYS +int devpts_new_index(void); +void devpts_kill_index(int idx); int devpts_pty_new(struct tty_struct *tty); /* mknod in devpts */ struct tty_struct *devpts_get_tty(int number); /* get tty structure */ void devpts_pty_kill(int number); /* unlink */ @@ -24,6 +26,8 @@ void devpts_pty_kill(int number); /* u #else /* Dummy stubs in the no-pty case */ +static inline int devpts_new_index(void) { return -EINVAL; } +static inline void devpts_kill_index(int idx) { } static inline int devpts_pty_new(struct tty_struct *tty) { return -EINVAL; } static inline struct tty_struct *devpts_get_tty(int number) { return NULL; } static inline void devpts_pty_kill(int number) { } Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c === --- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c 2008-03-24 20:04:07.0 -0700 +++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:04:26.0 -0700 @@ -91,7 +91,6 @@ #include #include #include -#include #include #include #include @@ -137,9 +136,6 @@ EXPORT_SYMBOL(tty_mutex); #ifdef CONFIG_UNIX98_PTYS extern struct tty_driver *ptm_driver; /* Unix98 pty masters; for /dev/ptmx */ -extern int pty_limit; /* Config limit on Unix98 ptys */ -static DEFINE_IDR(allocated_ptys); -static DEFINE_MUTEX(allocated_ptys_lock); static int ptmx_open(struct inode *, struct file *); #endif @@ -2636,15 +2632,9 @@ static void release_dev(struct file *fil */ release_tty(tty, idx); -#ifdef CONFIG_UNIX98_PTYS /* Make this pty number available for reallocation */ - if (devpts) { - mutex_lock(&allocated_ptys_lock); - idr_remove(&allocated_ptys, idx); - mutex_unlock(&allocated_ptys_lock); - } -#endif - + if (devpts) + devpts_kill_index(idx); } /** @@ -2800,29 +2790,13 @@ static int ptmx_open(struct inode *inode struct tty_struct *tty; int retval; int index; - int idr_ret; nonseekable_open(inode, filp); /* find a device that is not in use. */ - mutex_lock(&allocated_ptys_lock); - if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) { - mutex_unlock(&allocated_ptys_lock); - return -ENOMEM; - } - idr_ret = idr_get_new(&allocated_ptys, NULL, &index); - if (idr_ret < 0) { - mutex_unlock(&allocated_ptys_lock); - if (idr_ret == -EAGAIN) - return -ENOMEM; - return -EIO; - } - if (index >= pty_limit) { - idr_remove(&allocated_ptys, index); - mutex_unlock(&allocated_ptys_lock); - return -EIO; - } - mutex_unlock(&allocated_ptys_lock); + index = devpts_new_index(); + if (index < 0) + return index; mutex_lock(&tty_mutex); retval = init_dev(ptm_driver, index, &tty); @@ -2847,9 +2821,7 @@ out1: release_dev(filp); return retval; out: - mutex_lock(&allocated_ptys_lock); - idr_remove(&allocated_ptys, index); - mutex_unlock(&allocated_ptys_lock); + devpts_kill_index(index); return retval; } #endif Index: 2.6.25-rc5-mm1/fs/devpts/inode.c === --- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c 2008-03-24 20:04:07.0 -0700 +++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:04:26.0 -0700 @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,10 @@ #define DEVPTS_DEFAULT_MODE 0600 +extern int pty_limit; /* Config limit on Unix98 ptys */ +static DEFINE_IDR(allocated_ptys); +static DECLARE_
[Devel] [PATCH 3/7]: Enable multiple mounts of /dev/pts
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 3/7]: Enable multiple mounts of /dev/pts To support multiple PTY namespaces, we should be allow multiple mounts of /dev/pts, once within each PTY namespace. This patch removes the get_sb_single() in devpts_get_sb() and uses test and set sb interfaces to allow remounting /dev/pts. The patch also removes the globals, 'devpts_mnt', 'devpts_root' and uses a skeletal 'init_pts_ns' to store the vfsmount. Changelog [v2]: - (Pavel Emelianov/Serge Hallyn) Remove reference to pts_ns from sb->s_fs_info to fix the circular reference (/dev/pts is not unmounted unless the pts_ns is destroyed, so we don't need a reference to the pts_ns). Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]> Signed-off-by: Matt Helsley <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 160 +- include/linux/devpts_fs.h | 11 +++ 2 files changed, 143 insertions(+), 28 deletions(-) Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-03-24 20:04:26.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:04:31.0 -0700 @@ -14,6 +14,17 @@ #define _LINUX_DEVPTS_FS_H #include +#include +#include +#include + +struct pts_namespace { + struct kref kref; + struct idr allocated_ptys; + struct vfsmount *mnt; +}; + +extern struct pts_namespace init_pts_ns; #ifdef CONFIG_UNIX98_PTYS Index: 2.6.25-rc5-mm1/fs/devpts/inode.c === --- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c 2008-03-24 20:04:26.0 -0700 +++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:04:31.0 -0700 @@ -28,12 +28,8 @@ #define DEVPTS_DEFAULT_MODE 0600 extern int pty_limit; /* Config limit on Unix98 ptys */ -static DEFINE_IDR(allocated_ptys); static DECLARE_MUTEX(allocated_ptys_lock); -static struct vfsmount *devpts_mnt; -static struct dentry *devpts_root; - static struct { int setuid; int setgid; @@ -54,6 +50,15 @@ static match_table_t tokens = { {Opt_err, NULL} }; +struct pts_namespace init_pts_ns = { + .kref = { + .refcount = ATOMIC_INIT(2), + }, + .allocated_ptys = IDR_INIT(init_pts_ns.allocated_ptys), + .mnt = NULL, +}; + + static int devpts_remount(struct super_block *sb, int *flags, char *data) { char *p; @@ -140,7 +145,7 @@ devpts_fill_super(struct super_block *s, inode->i_fop = &simple_dir_operations; inode->i_nlink = 2; - devpts_root = s->s_root = d_alloc_root(inode); + s->s_root = d_alloc_root(inode); if (s->s_root) return 0; @@ -150,17 +155,82 @@ fail: return -ENOMEM; } +/* + * We use test and set super-block operations to help determine whether we + * need a new super-block for this namespace. get_sb() walks the list of + * existing devpts supers, comparing them with the @data ptr. Since we + * passed 'current's namespace as the @data pointer we can compare the + * namespace pointer in the super-block's 's_fs_info'. If the test is + * TRUE then get_sb() returns a new active reference to the super block. + * Otherwise, it helps us build an active reference to a new one. + */ + +static int devpts_test_sb(struct super_block *sb, void *data) +{ + return sb->s_fs_info == data; +} + +static int devpts_set_sb(struct super_block *sb, void *data) +{ + /* +* new_pts_ns() mounts the pts namespace and free_pts_ns() +* drops the reference to the mount. i.e the s_fs_inf is +* cleared and vfsmnt is releasand _before_ pts_namespace +* is freed. +* +* So we don't need a reference to the pts_namespace here +* (Getting a reference here will also cause circular reference). +*/ + sb->s_fs_info = data; + return set_anon_super(sb, NULL); +} + static int devpts_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt) { - return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt); + struct super_block *sb; + struct pts_namespace *ns; + int err; + + /* hereafter we're very similar to proc_get_sb */ + if (flags & MS_KERNMOUNT) + ns = data; + else + ns = &init_pts_ns; + + /* hereafter we're very simlar to get_sb_nodev */ + sb = sget(fs_type, devpts_test_sb, devpts_set_sb, ns); + if (IS_ERR(sb)) + return PTR_ERR(sb); + + if (sb->s_root) + return sim
[Devel] [PATCH 4/7] Implement get_pts_ns() and put_pts_ns()
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 4/7]: Implement get_pts_ns() and put_pts_ns() Implement get_pts_ns() and put_pts_ns() interfaces. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- include/linux/devpts_fs.h | 21 - 1 file changed, 20 insertions(+), 1 deletion(-) Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-03-24 20:04:31.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:05:05.0 -0700 @@ -27,13 +27,26 @@ struct pts_namespace { extern struct pts_namespace init_pts_ns; #ifdef CONFIG_UNIX98_PTYS - int devpts_new_index(void); void devpts_kill_index(int idx); int devpts_pty_new(struct tty_struct *tty); /* mknod in devpts */ struct tty_struct *devpts_get_tty(int number); /* get tty structure */ void devpts_pty_kill(int number); /* unlink */ +static inline void free_pts_ns(struct kref *ns_kref) { } + +static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns) +{ + if (ns && (ns != &init_pts_ns)) + kref_get(&ns->kref); + return ns; +} +static inline void put_pts_ns(struct pts_namespace *ns) +{ + if (ns && (ns != &init_pts_ns)) + kref_put(&ns->kref, free_pts_ns); +} + #else /* Dummy stubs in the no-pty case */ @@ -43,6 +56,12 @@ static inline int devpts_pty_new(struct static inline struct tty_struct *devpts_get_tty(int number) { return NULL; } static inline void devpts_pty_kill(int number) { } +static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns) +{ + return &init_pts_ns; +} + +static inline void put_pts_ns(struct pts_namespace *ns) { } #endif ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 5/7]: Determine pts_ns from a pty's inode.
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 5/7]: Determine pts_ns from a pty's inode. The devpts interfaces currently operate on a specific pts namespace which they get from the 'current' task. With implementation of containers and cloning of PTS namespaces, we want to be able to access PTYs in a child-pts-ns from a parent-pts-ns. For instance we could bind-mount and pivot-root the child container on '/vserver/vserver1' and then access the "pts/0" of 'vserver1' using $ echo foo > /vserver/vserver1/dev/pts/0 The task doing the above 'echo' could be in parent-pts-ns. So we find the 'pts-ns' of the above file from the inode representing the above file rather than from the 'current' task. Note that we need to find and hold a reference to the pts_ns to prevent the pts_ns from being freed while it is being accessed from 'outside'. This patch implements, 'pts_ns_from_inode()' which returns the pts_ns using 'inode->i_sb->s_fs_info'. Since, the 'inode' information is not visible inside devpts code itself, this patch modifies the tty driver code to determine the pts_ns and passes it into devpts. TODO: What is the expected behavior when '/dev/tty' or '/dev/ptmx' are accessed from parent-pts-ns. i.e: $ echo "foobar" > /vserver/vserver1/dev/tty) This patch currently ignores the '/vserver/vserver1' part (that seemed to be the easiest to do :-). So opening /dev/ptmx from even the child pts-ns will create a pty in the _PARENT_ pts-ns. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- drivers/char/pty.c|2 - drivers/char/tty_io.c | 86 +++--- fs/devpts/inode.c | 19 +++--- include/linux/devpts_fs.h | 42 +++--- 4 files changed, 119 insertions(+), 30 deletions(-) Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-03-24 20:05:05.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:08:33.0 -0700 @@ -17,6 +17,7 @@ #include #include #include +#include struct pts_namespace { struct kref kref; @@ -26,12 +27,43 @@ struct pts_namespace { extern struct pts_namespace init_pts_ns; +#define DEVPTS_SUPER_MAGIC 0x1cd1 +static inline struct pts_namespace *pts_ns_from_inode(struct inode *inode) +{ + /* +* Need this bug-on for now to catch any cases in tty_open() +* or release_dev() I may have missed. +*/ + BUG_ON(inode->i_sb->s_magic != DEVPTS_SUPER_MAGIC); + + /* +* If we have a valid inode, we already have a reference to +* mount-point. Since there is a single super-block for the +* devpts mount, i_sb->s_fs_info cannot go to NULL. So we +* should not need a lock here. +*/ + + return (struct pts_namespace *)inode->i_sb->s_fs_info; +} + +static inline struct pts_namespace *current_pts_ns(void) +{ + return &init_pts_ns; +} + + #ifdef CONFIG_UNIX98_PTYS -int devpts_new_index(void); -void devpts_kill_index(int idx); -int devpts_pty_new(struct tty_struct *tty); /* mknod in devpts */ -struct tty_struct *devpts_get_tty(int number); /* get tty structure */ -void devpts_pty_kill(int number); /* unlink */ +int devpts_new_index(struct pts_namespace *pts_ns); +void devpts_kill_index(struct pts_namespace *pts_ns, int idx); + +/* mknod in devpts */ +int devpts_pty_new(struct pts_namespace *pts_ns, struct tty_struct *tty); + +/* get tty structure */ +struct tty_struct *devpts_get_tty(struct pts_namespace *pts_ns, int number); + +/* unlink */ +void devpts_pty_kill(struct pts_namespace *pts_ns, int number); static inline void free_pts_ns(struct kref *ns_kref) { } Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c === --- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c 2008-03-24 20:04:26.0 -0700 +++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:08:15.0 -0700 @@ -2064,8 +2064,8 @@ static void tty_line_name(struct tty_dri * relaxed for the (most common) case of reopening a tty. */ -static int init_dev(struct tty_driver *driver, int idx, - struct tty_struct **ret_tty) +static int init_dev(struct tty_driver *driver, struct pts_namespace *pts_ns, + int idx, struct tty_struct **ret_tty) { struct tty_struct *tty, *o_tty; struct ktermios *tp, **tp_loc, *o_tp, **o_tp_loc; @@ -2074,7 +2074,7 @@ static int init_dev(struct tty_driver *d /* check whether we're reopening an existing tty */ if (driver->fl
[Devel] [PATCH 6/7]: Check for user-space mount of /dev/pts
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 6/7]: Check for user-space mount of /dev/pts When the pts namespace is cloned, the /dev/pts is not useful unless it is remounted from the user space. If user-space clones pts namespace but does not remount /dev/pts, it would end up using the /dev/pts mount from parent-pts-ns but allocate the pts indices from current pts ns. This patch (hack ?) prevents creation of PTYs in user space unless user-space mounts /dev/pts. (While this patch can be folded into others, keeping this separate for now for easier review (and to highlight the hack :-) Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 25 +++-- include/linux/devpts_fs.h | 20 +++- 2 files changed, 42 insertions(+), 3 deletions(-) Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-03-24 20:08:33.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:08:57.0 -0700 @@ -23,6 +23,7 @@ struct pts_namespace { struct kref kref; struct idr allocated_ptys; struct vfsmount *mnt; + int user_mounted; }; extern struct pts_namespace init_pts_ns; @@ -30,6 +31,8 @@ extern struct pts_namespace init_pts_ns; #define DEVPTS_SUPER_MAGIC 0x1cd1 static inline struct pts_namespace *pts_ns_from_inode(struct inode *inode) { + struct pts_namespace *ns; + /* * Need this bug-on for now to catch any cases in tty_open() * or release_dev() I may have missed. @@ -43,7 +46,22 @@ static inline struct pts_namespace *pts_ * should not need a lock here. */ - return (struct pts_namespace *)inode->i_sb->s_fs_info; + ns = (struct pts_namespace *)inode->i_sb->s_fs_info; + + /* +* If user-space did not mount pts ns after cloning pts namespace, +* the child process would end up accessing devpts mount of the +* parent but use allocated_ptys from the cloned pts ns. +* +* This check prevents creating ptys unless user-space mounts +* devpts in the new pts namespace. +* +* Is there a cleaner way to prevent this ? +*/ + if (!ns->user_mounted) + return NULL; + + return ns; } static inline struct pts_namespace *current_pts_ns(void) Index: 2.6.25-rc5-mm1/fs/devpts/inode.c === --- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c 2008-03-24 20:08:33.0 -0700 +++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:08:57.0 -0700 @@ -201,8 +201,11 @@ static int devpts_get_sb(struct file_sys if (IS_ERR(sb)) return PTR_ERR(sb); - if (sb->s_root) + if (sb->s_root) { + if (!(flags & MS_KERNMOUNT)) + ns->user_mounted = 1; return simple_set_mnt(mnt, sb); + } sb->s_flags = flags; err = devpts_fill_super(sb, data, flags & MS_SILENT ? 1 : 0); @@ -248,6 +251,10 @@ int devpts_new_index(struct pts_namespac int index; int idr_ret; + if (!pts_ns || !pts_ns->user_mounted) { + printk(KERN_ERR "devpts_new_index() without user_mount\n"); + return -ENOSYS; + } retry: if (!idr_pre_get(&pts_ns->allocated_ptys, GFP_KERNEL)) { return -ENOMEM; @@ -273,7 +280,7 @@ retry: void devpts_kill_index(struct pts_namespace *pts_ns, int idx) { - + BUG_ON(!pts_ns->user_mounted); down(&allocated_ptys_lock); idr_remove(&pts_ns->allocated_ptys, idx); up(&allocated_ptys_lock); @@ -293,6 +300,11 @@ int devpts_pty_new( struct pts_namespace BUG_ON(driver->type != TTY_DRIVER_TYPE_PTY); BUG_ON(driver->subtype != PTY_TYPE_SLAVE); + if (!pts_ns || !pts_ns->user_mounted) { + printk(KERN_ERR "devpts_pty_new() without user_mount\n"); + return -ENOSYS; + } + mnt = pts_ns->mnt; root = mnt->mnt_root; @@ -332,6 +344,11 @@ struct tty_struct *devpts_get_tty(struct struct dentry *dentry; struct tty_struct *tty; + if (!pts_ns || !pts_ns->user_mounted) { + printk(KERN_ERR "devpts_get_tty() without user_mount\n"); + return ERR_PTR(-ENOSYS); + } + mnt = pts_ns->mnt; dentry = get_node(mnt->mnt_root, number); @@ -353,6 +370,10 @@ void devpts_pty_kill(struct pts_namespac struct dentry *dentry; struct dentry *root; + if (!pts_ns || !pts_ns->user_mounted) { + printk(KERN_ERR "devpts_pty_kill() wi
[Devel] [PATCH 7/7]: Enable cloning PTY namespaces
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 7/7]: Enable cloning PTY namespaces Enable cloning PTY namespaces. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]> Signed-off-by: Matt Helsley <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 40 +++- include/linux/devpts_fs.h | 22 -- include/linux/init_task.h |1 + include/linux/nsproxy.h |2 ++ include/linux/sched.h |1 + kernel/fork.c |2 +- kernel/nsproxy.c | 17 +++-- 7 files changed, 79 insertions(+), 6 deletions(-) Index: 2.6.25-rc5-mm1/include/linux/sched.h === --- 2.6.25-rc5-mm1.orig/include/linux/sched.h 2008-03-24 20:02:57.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/sched.h2008-03-24 20:12:56.0 -0700 @@ -28,6 +28,7 @@ #define CLONE_NEWPID 0x2000 /* New pid namespace */ #define CLONE_NEWNET 0x4000 /* New network namespace */ #define CLONE_IO 0x8000 /* Clone io context */ +#define CLONE_NEWPTS 0x0002ULL /* Clone pts ns */ /* * Scheduling policies Index: 2.6.25-rc5-mm1/include/linux/nsproxy.h === --- 2.6.25-rc5-mm1.orig/include/linux/nsproxy.h 2008-03-24 20:02:57.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/nsproxy.h 2008-03-24 20:12:56.0 -0700 @@ -8,6 +8,7 @@ struct mnt_namespace; struct uts_namespace; struct ipc_namespace; struct pid_namespace; +struct pts_namespace; /* * A structure to contain pointers to all per-process @@ -29,6 +30,7 @@ struct nsproxy { struct pid_namespace *pid_ns; struct user_namespace *user_ns; struct net *net_ns; + struct pts_namespace *pts_ns; }; extern struct nsproxy init_nsproxy; Index: 2.6.25-rc5-mm1/include/linux/init_task.h === --- 2.6.25-rc5-mm1.orig/include/linux/init_task.h 2008-03-24 20:02:57.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/init_task.h2008-03-24 20:12:56.0 -0700 @@ -78,6 +78,7 @@ extern struct nsproxy init_nsproxy; .mnt_ns = NULL, \ INIT_NET_NS(net_ns) \ INIT_IPC_NS(ipc_ns) \ + .pts_ns = &init_pts_ns, \ .user_ns= &init_user_ns,\ } Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-03-24 20:08:57.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:12:56.0 -0700 @@ -66,7 +66,7 @@ static inline struct pts_namespace *pts_ static inline struct pts_namespace *current_pts_ns(void) { - return &init_pts_ns; + return current->nsproxy->pts_ns; } @@ -83,7 +83,8 @@ struct tty_struct *devpts_get_tty(struct /* unlink */ void devpts_pty_kill(struct pts_namespace *pts_ns, int number); -static inline void free_pts_ns(struct kref *ns_kref) { } +extern struct pts_namespace *new_pts_ns(void); +extern void free_pts_ns(struct kref *kref); static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns) { @@ -97,6 +98,15 @@ static inline void put_pts_ns(struct pts kref_put(&ns->kref, free_pts_ns); } +static inline struct pts_namespace *copy_pts_ns(u64 flags, + struct pts_namespace *old_ns) +{ + if (flags & CLONE_NEWPTS) + return new_pts_ns(); + else + return get_pts_ns(old_ns); +} + #else /* Dummy stubs in the no-pty case */ @@ -112,6 +122,14 @@ static inline struct pts_namespace *get_ } static inline void put_pts_ns(struct pts_namespace *ns) { } + +static inline struct pts_namespace *copy_pts_ns(u64 flags, + struct pts_namespace *old_ns) +{ + if (flags & CLONE_NEWPTS) + return ERR_PTR(-EINVAL); + return old_ns; +} #endif Index: 2.6.25-rc5-mm1/fs/devpts/inode.c === --- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c 2008-03-24 20:08:57.0 -0700 +++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:14:20.0 -0700 @@ -27,6 +27,7 @@ extern int pty_limit; /* Config limit on Unix98 ptys */ static DECLARE_MUTEX(allocated_ptys_lock); +static struct file_system_type devpts_fs_type; static struct { int setuid; @@ -56,6 +57,43 @@ struct pts_namespace init_pts_ns = { .mnt = NULL, }; +struct pts_namesp
[Devel] Re: [PATCH 6/7]: Check for user-space mount of /dev/pts
Serge E. Hallyn [EMAIL PROTECTED] wrote: | Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]): | > | > From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> | > Subject: [PATCH 6/7]: Check for user-space mount of /dev/pts | > | > When the pts namespace is cloned, the /dev/pts is not useful unless it | > is remounted from the user space. | > | > If user-space clones pts namespace but does not remount /dev/pts, it | > would end up using the /dev/pts mount from parent-pts-ns but allocate | > the pts indices from current pts ns. | | So why not use the allocated_ptys from the parent ptsns? It's what | userspace asked for and it's safe to do. The problem is when opening /dev/ptmx, we use current_pts_ns() and when opening slave-pty, we use pts_ns from the inode. If child-pts-ns opens /dev/ptmx, we use 'allocated-ptys' from child-pts-ns and we allocate index 0. But when the process opens the slave pty "/dev/pts/0", we would get the pts_ns from the inode which would come from parent-pts-ns (and could refer to an existing pty). Agree with Alexey and Pavel, its bad. Will think some more, but appreciate any ideas. Sukadev ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/7] Implement get_pts_ns() and put_pts_ns()
Serge E. Hallyn [EMAIL PROTECTED] wrote: | Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]): | > | > From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> | > Subject: [PATCH 4/7]: Implement get_pts_ns() and put_pts_ns() | > | > Implement get_pts_ns() and put_pts_ns() interfaces. | > | > Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> | > --- | > include/linux/devpts_fs.h | 21 - | > 1 file changed, 20 insertions(+), 1 deletion(-) | > | > Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h | > === | > --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-03-24 20:04:31.0 -0700 | > +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:05:05.0 -0700 | > @@ -27,13 +27,26 @@ struct pts_namespace { | > extern struct pts_namespace init_pts_ns; | > | > #ifdef CONFIG_UNIX98_PTYS | > - | > int devpts_new_index(void); | > void devpts_kill_index(int idx); | > int devpts_pty_new(struct tty_struct *tty); /* mknod in devpts */ | > struct tty_struct *devpts_get_tty(int number); /* get tty structure */ | > void devpts_pty_kill(int number); /* unlink */ | > | > +static inline void free_pts_ns(struct kref *ns_kref) { } | > + | > +static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns) | > +{ | > + if (ns && (ns != &init_pts_ns)) | > + kref_get(&ns->kref); | > + return ns; | > +} | > +static inline void put_pts_ns(struct pts_namespace *ns) | > +{ | > + if (ns && (ns != &init_pts_ns)) | > + kref_put(&ns->kref, free_pts_ns); | | This isn't right, or I'm not thinking right. Don't you somewhere | need to | | 1. rcu_assign ns->mnt->mnt_sb->s_fs_info to NULL | 2. wait a grace period | 3. call free_pts_ns and check the refcount on the ns again? | | and then do pts_ns_from_inode() under an rcu_read_lock and grab | a ref to the ns? Yes, we need the rcu to grab the reference to pts_ns. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 5/7]: Determine pts_ns from a pty's inode.
Serge E. Hallyn [EMAIL PROTECTED] wrote: | Quoting Serge E. Hallyn ([EMAIL PROTECTED]): | > Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]): | > > | > > From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> | > > Subject: [PATCH 5/7]: Determine pts_ns from a pty's inode. | > > | > > The devpts interfaces currently operate on a specific pts namespace | > > which they get from the 'current' task. | > > | > > With implementation of containers and cloning of PTS namespaces, we want | > > to be able to access PTYs in a child-pts-ns from a parent-pts-ns. For | > > instance we could bind-mount and pivot-root the child container on | > > '/vserver/vserver1' and then access the "pts/0" of 'vserver1' using | > > | > > $ echo foo > /vserver/vserver1/dev/pts/0 | > > | > > The task doing the above 'echo' could be in parent-pts-ns. So we find | > > the 'pts-ns' of the above file from the inode representing the above | > > file rather than from the 'current' task. | > > | > > Note that we need to find and hold a reference to the pts_ns to prevent | > > the pts_ns from being freed while it is being accessed from 'outside'. | > > | > > This patch implements, 'pts_ns_from_inode()' which returns the pts_ns | > > using 'inode->i_sb->s_fs_info'. | > > | > > Since, the 'inode' information is not visible inside devpts code itself, | > > this patch modifies the tty driver code to determine the pts_ns and passes | > > it into devpts. | > > | > > TODO: | > > What is the expected behavior when '/dev/tty' or '/dev/ptmx' are | > > accessed from parent-pts-ns. i.e: | > > | > > $ echo "foobar" > /vserver/vserver1/dev/tty) | > > | > > This patch currently ignores the '/vserver/vserver1' part (that | > | > The way this is phrased it almost sounds like you're considering using | > the pathnames to figure out the ptsns to use :). | > | > It's not clear to me what is the sane thing to do. | > | > what you're doing here - have /dev/ptmx and /dev/tty always use | > current->'s ptsns - isn't ideal. | > | > It would be nicer to not have a 'devpts ns', and instead have a | > full device namespace. However, then it still isn't clear how to tie | > /vs/vs1/dev/ptmx to vs1's device namespace, since there is no device | > fs to which to tie the devns. | > | > We could tie the devns to a device inode on mknod, using the devns of | > the creating task. Then when starting up vs1, you just have to always | > let vs1 create /dev/ptmx and /dev/tty. I can't think of anything | > better offhand. | > | > Other ideas? | | I suppose you could just create /dev/pts/ptmx and /dev/pts/tty. | Recommend that in containers /dev/ptmx and /dev/tty be symlinks | into /dev/pts. Applications don't need to change. If | ptmx_open() sees that inode->i_sb is a devptsfs, it gets the | namespace from the sb. If not, then it was a device in /dev | and it gets the nmespace from current. But we would still depend on user-space remounting /dev/pts after the clone right ? Until they do that we would access the parent container's /dev/pts/ptmx ? ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 5/7]: Determine pts_ns from a pty's inode.
Serge E. Hallyn [EMAIL PROTECTED] wrote: | > | I suppose you could just create /dev/pts/ptmx and /dev/pts/tty. | > | Recommend that in containers /dev/ptmx and /dev/tty be symlinks | > | into /dev/pts. Applications don't need to change. If | > | ptmx_open() sees that inode->i_sb is a devptsfs, it gets the | > | namespace from the sb. If not, then it was a device in /dev | > | and it gets the nmespace from current. | > | > But we would still depend on user-space remounting /dev/pts after | > the clone right ? Until they do that we would access the parent | > container's /dev/pts/ptmx ? | | Yes. Which is the right thing to do imo. Hmm, that sounds reasonable, although slightly inconsistent with pid-ns, where pid starts at 1 regardless of whether /proc is remounted. But even so, if user fails to establish the symlink, clones the pts ns and tries to create a pty, we would end up with different pts nses again ? i.e /dev/ptmx is still a char dev in root fs clone(pts_ns) ( In child, (before remount /dev/pts)) open("/dev/ptmx") open("/dev/pts/0") Since ptmx is not in devpts, we use current_pts_ns() or child-pts-ns Since /dev/pts is not remounted in child, we get the parent pts-ns from If we can somehow detect the incorrect configuration and fail either open, we should be ok :-) inode. Suka ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 5/7]: Determine pts_ns from a pty's inode.
Serge E. Hallyn [EMAIL PROTECTED] wrote: | Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]): | > Serge E. Hallyn [EMAIL PROTECTED] wrote: | > | > | I suppose you could just create /dev/pts/ptmx and /dev/pts/tty. | > | > | Recommend that in containers /dev/ptmx and /dev/tty be symlinks | > | > | into /dev/pts. Applications don't need to change. If | > | > | ptmx_open() sees that inode->i_sb is a devptsfs, it gets the | > | > | namespace from the sb. If not, then it was a device in /dev | > | > | and it gets the nmespace from current. | > | > | > | > But we would still depend on user-space remounting /dev/pts after | > | > the clone right ? Until they do that we would access the parent | > | > container's /dev/pts/ptmx ? | > | | > | Yes. Which is the right thing to do imo. | > | > Hmm, that sounds reasonable, although slightly inconsistent with pid-ns, | > where pid starts at 1 regardless of whether /proc is remounted. | | Very different cases. The pid is the task's pid in the new pidns. | The task ALSO has a different pid in the parent pidns. | | The pts only has an identity in one ptsns. | | > But even so, if user fails to establish the symlink, clones the pts ns | > and tries to create a pty, we would end up with different pts nses again ? | | Yes. So what? We would end up allocating a pts index from child-pts-ns (i.e index 0) and attempt to open /dev/pts/0 which could be an existing pty in the parent pts ns ? | | > i.e | > /dev/ptmx is still a char dev in root fs | > clone(pts_ns) | > ( In child, (before remount /dev/pts)) | > open("/dev/ptmx") | > open("/dev/pts/0") | > | > Since ptmx is not in devpts, we use current_pts_ns() or child-pts-ns | > Since /dev/pts is not remounted in child, we get the parent pts-ns from | > | > If we can somehow detect the incorrect configuration and fail either | > open, we should be ok :-) | | I completely disagree with this sentiment. The kernel doesn't need | to detect an "incorrect configuration" if it isn't dangerous. One | man's "incorrect configuration" is another man's useful trick. Myabe configuration is the wrong word, but unless I am missing something above, spanning two pts-nses is an error condition ? ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/7][v2]: Enable multiple mounts of /dev/pts
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 3/7][v2]: Enable multiple mounts of /dev/pts To support multiple PTY namespaces, we should be allow multiple mounts of /dev/pts, once within each PTY namespace. This patch removes the get_sb_single() in devpts_get_sb() and uses test and set sb interfaces to allow remounting /dev/pts. The patch also removes the globals, 'devpts_mnt', 'devpts_root' and uses a skeletal 'init_pts_ns' to store the vfsmount. Changelog [v3]: - Removed some unnecessary comments from devpts_set_sb() Changelog [v2]: - (Pavel Emelianov/Serge Hallyn) Remove reference to pts_ns from sb->s_fs_info to fix the circular reference (/dev/pts is not unmounted unless the pts_ns is destroyed, so we don't need a reference to the pts_ns). Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]> Signed-off-by: Matt Helsley <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 151 +- include/linux/devpts_fs.h | 11 +++ 2 files changed, 134 insertions(+), 28 deletions(-) Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-03-24 20:04:26.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-04-01 18:08:42.0 -0700 @@ -14,6 +14,17 @@ #define _LINUX_DEVPTS_FS_H #include +#include +#include +#include + +struct pts_namespace { + struct kref kref; + struct idr allocated_ptys; + struct vfsmount *mnt; +}; + +extern struct pts_namespace init_pts_ns; #ifdef CONFIG_UNIX98_PTYS Index: 2.6.25-rc5-mm1/fs/devpts/inode.c === --- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c 2008-03-24 20:04:26.0 -0700 +++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-04-01 18:08:41.0 -0700 @@ -28,12 +28,8 @@ #define DEVPTS_DEFAULT_MODE 0600 extern int pty_limit; /* Config limit on Unix98 ptys */ -static DEFINE_IDR(allocated_ptys); static DECLARE_MUTEX(allocated_ptys_lock); -static struct vfsmount *devpts_mnt; -static struct dentry *devpts_root; - static struct { int setuid; int setgid; @@ -54,6 +50,15 @@ static match_table_t tokens = { {Opt_err, NULL} }; +struct pts_namespace init_pts_ns = { + .kref = { + .refcount = ATOMIC_INIT(2), + }, + .allocated_ptys = IDR_INIT(init_pts_ns.allocated_ptys), + .mnt = NULL, +}; + + static int devpts_remount(struct super_block *sb, int *flags, char *data) { char *p; @@ -140,7 +145,7 @@ devpts_fill_super(struct super_block *s, inode->i_fop = &simple_dir_operations; inode->i_nlink = 2; - devpts_root = s->s_root = d_alloc_root(inode); + s->s_root = d_alloc_root(inode); if (s->s_root) return 0; @@ -150,17 +155,73 @@ fail: return -ENOMEM; } +/* + * We use test and set super-block operations to help determine whether we + * need a new super-block for this namespace. get_sb() walks the list of + * existing devpts supers, comparing them with the @data ptr. Since we + * passed 'current's namespace as the @data pointer we can compare the + * namespace pointer in the super-block's 's_fs_info'. If the test is + * TRUE then get_sb() returns a new active reference to the super block. + * Otherwise, it helps us build an active reference to a new one. + */ + +static int devpts_test_sb(struct super_block *sb, void *data) +{ + return sb->s_fs_info == data; +} + +static int devpts_set_sb(struct super_block *sb, void *data) +{ + sb->s_fs_info = data; + return set_anon_super(sb, NULL); +} + static int devpts_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt) { - return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt); + struct super_block *sb; + struct pts_namespace *ns; + int err; + + /* hereafter we're very similar to proc_get_sb */ + if (flags & MS_KERNMOUNT) + ns = data; + else + ns = &init_pts_ns; + + /* hereafter we're very simlar to get_sb_nodev */ + sb = sget(fs_type, devpts_test_sb, devpts_set_sb, ns); + if (IS_ERR(sb)) + return PTR_ERR(sb); + + if (sb->s_root) + return simple_set_mnt(mnt, sb); + + sb->s_flags = flags; + err = devpts_fill_super(sb, data, flags & MS_SILENT ? 1 : 0); + if (err) { + up_write(&sb->s_umount); + deactivate_super(sb); + return err; + } + + sb->s_flags |= MS_ACTIVE; +
[Devel] [PATCH 6/7][v2]: Determine pts_ns from a pty's inode
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 6/7][v2]: Determine pts_ns from a pty's inode. The devpts interfaces currently operate on a specific pts namespace which they get from the 'current' task. With implementation of containers and cloning of PTS namespaces, we want to be able to access PTYs in a child-pts-ns from a parent-pts-ns. For instance we could bind-mount and pivot-root the child container on '/vserver/vserver1' and then access the "pts/0" of 'vserver1' using $ echo foo > /vserver/vserver1/dev/pts/0 The task doing the above 'echo' could be in parent-pts-ns. So we find the 'pts-ns' of the above file from the inode representing the above file rather than from the 'current' task. Note that we need to find and hold a reference to the pts_ns to prevent the pts_ns from being freed while it is being accessed from 'outside'. This patch implements, 'pts_ns_from_inode()' which returns the pts_ns using 'inode->i_sb->s_fs_info'. Since, the 'inode' information is not visible inside devpts code itself, this patch modifies the tty driver code to determine the pts_ns and passes it into devpts. Changelog [v2]: [Serge Hallyn] Use rcu to access sb->s_fs_info. [Serge Hallyn] Simplify handling of ptmx and tty devices by expecting user to create them in /dev/pts (see also devpts-mknod patch) Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- drivers/char/pty.c| 13 +- drivers/char/tty_io.c | 93 +++--- fs/devpts/inode.c | 19 +++-- include/linux/devpts_fs.h | 38 -- 4 files changed, 131 insertions(+), 32 deletions(-) Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-04-02 22:42:08.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-04-02 22:42:14.0 -0700 @@ -17,6 +17,7 @@ #include #include #include +#include struct pts_namespace { struct kref kref; @@ -26,12 +27,39 @@ struct pts_namespace { extern struct pts_namespace init_pts_ns; +#define DEVPTS_SUPER_MAGIC 0x1cd1 + +static inline struct pts_namespace *current_pts_ns(void) +{ + return &init_pts_ns; +} + +static inline struct pts_namespace *pts_ns_from_inode(struct inode *inode) +{ + /* +* If this file exists on devpts, return the pts_ns from the +* devpts super-block. Otherwise just use the pts-ns of the +* calling task. +*/ + if(inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC) + return rcu_dereference(inode->i_sb->s_fs_info); + + return current_pts_ns(); +} + + #ifdef CONFIG_UNIX98_PTYS -int devpts_new_index(void); -void devpts_kill_index(int idx); -int devpts_pty_new(struct tty_struct *tty); /* mknod in devpts */ -struct tty_struct *devpts_get_tty(int number); /* get tty structure */ -void devpts_pty_kill(int number); /* unlink */ +int devpts_new_index(struct pts_namespace *pts_ns); +void devpts_kill_index(struct pts_namespace *pts_ns, int idx); + +/* mknod in devpts */ +int devpts_pty_new(struct pts_namespace *pts_ns, struct tty_struct *tty); + +/* get tty structure */ +struct tty_struct *devpts_get_tty(struct pts_namespace *pts_ns, int number); + +/* unlink */ +void devpts_pty_kill(struct pts_namespace *pts_ns, int number); static inline void free_pts_ns(struct kref *ns_kref) { } Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c === --- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c 2008-04-02 22:35:29.0 -0700 +++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-04-02 22:42:14.0 -0700 @@ -2064,8 +2064,8 @@ static void tty_line_name(struct tty_dri * relaxed for the (most common) case of reopening a tty. */ -static int init_dev(struct tty_driver *driver, int idx, - struct tty_struct **ret_tty) +static int init_dev(struct tty_driver *driver, struct pts_namespace *pts_ns, + int idx, struct tty_struct **ret_tty) { struct tty_struct *tty, *o_tty; struct ktermios *tp, **tp_loc, *o_tp, **o_tp_loc; @@ -2074,7 +2074,11 @@ static int init_dev(struct tty_driver *d /* check whether we're reopening an existing tty */ if (driver->flags & TTY_DRIVER_DEVPTS_MEM) { - tty = devpts_get_tty(idx); + tty = devpts_get_tty(pts_ns, idx); + if (IS_ERR(tty)) { + retval = PTR_ERR(tty); + goto end_init; + } /* * If we don't have a tty here on a slave open, it's because * the master alre
[Devel] [PATCH 7/7][v2]: Enable cloning PTY namespaces
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 7/7][v2]: Enable cloning PTY namespaces Enable cloning PTY namespaces. Changelog[v2]: [Serge Hallyn]: Use rcu to access sb->s_fs_info. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]> Signed-off-by: Matt Helsley <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 84 -- include/linux/devpts_fs.h | 22 ++-- include/linux/init_task.h |1 include/linux/nsproxy.h |2 + include/linux/sched.h |1 kernel/fork.c |2 - kernel/nsproxy.c | 17 - 7 files changed, 122 insertions(+), 7 deletions(-) Index: 2.6.25-rc5-mm1/include/linux/sched.h === --- 2.6.25-rc5-mm1.orig/include/linux/sched.h 2008-04-02 22:50:22.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/sched.h2008-04-02 22:51:59.0 -0700 @@ -28,6 +28,7 @@ #define CLONE_NEWPID 0x2000 /* New pid namespace */ #define CLONE_NEWNET 0x4000 /* New network namespace */ #define CLONE_IO 0x8000 /* Clone io context */ +#define CLONE_NEWPTS 0x0002ULL /* Clone pts ns */ /* * Scheduling policies Index: 2.6.25-rc5-mm1/include/linux/nsproxy.h === --- 2.6.25-rc5-mm1.orig/include/linux/nsproxy.h 2008-04-02 22:50:22.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/nsproxy.h 2008-04-02 22:51:59.0 -0700 @@ -8,6 +8,7 @@ struct mnt_namespace; struct uts_namespace; struct ipc_namespace; struct pid_namespace; +struct pts_namespace; /* * A structure to contain pointers to all per-process @@ -29,6 +30,7 @@ struct nsproxy { struct pid_namespace *pid_ns; struct user_namespace *user_ns; struct net *net_ns; + struct pts_namespace *pts_ns; }; extern struct nsproxy init_nsproxy; Index: 2.6.25-rc5-mm1/include/linux/init_task.h === --- 2.6.25-rc5-mm1.orig/include/linux/init_task.h 2008-04-02 22:50:22.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/init_task.h2008-04-02 22:51:59.0 -0700 @@ -78,6 +78,7 @@ extern struct nsproxy init_nsproxy; .mnt_ns = NULL, \ INIT_NET_NS(net_ns) \ INIT_IPC_NS(ipc_ns) \ + .pts_ns = &init_pts_ns, \ .user_ns= &init_user_ns,\ } Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-04-02 22:51:59.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-04-02 22:51:59.0 -0700 @@ -31,7 +31,7 @@ extern struct pts_namespace init_pts_ns; static inline struct pts_namespace *current_pts_ns(void) { - return &init_pts_ns; + return current->nsproxy->pts_ns; } static inline struct pts_namespace *pts_ns_from_inode(struct inode *inode) @@ -61,7 +61,8 @@ struct tty_struct *devpts_get_tty(struct /* unlink */ void devpts_pty_kill(struct pts_namespace *pts_ns, int number); -static inline void free_pts_ns(struct kref *ns_kref) { } +extern struct pts_namespace *new_pts_ns(void); +extern void free_pts_ns(struct kref *kref); static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns) { @@ -75,6 +76,15 @@ static inline void put_pts_ns(struct pts kref_put(&ns->kref, free_pts_ns); } +static inline struct pts_namespace *copy_pts_ns(u64 flags, + struct pts_namespace *old_ns) +{ + if (flags & CLONE_NEWPTS) + return new_pts_ns(); + else + return get_pts_ns(old_ns); +} + #else /* Dummy stubs in the no-pty case */ @@ -90,6 +100,14 @@ static inline struct pts_namespace *get_ } static inline void put_pts_ns(struct pts_namespace *ns) { } + +static inline struct pts_namespace *copy_pts_ns(u64 flags, + struct pts_namespace *old_ns) +{ + if (flags & CLONE_NEWPTS) + return ERR_PTR(-EINVAL); + return old_ns; +} #endif Index: 2.6.25-rc5-mm1/fs/devpts/inode.c === --- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c 2008-04-02 22:51:59.0 -0700 +++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-04-02 22:51:59.0 -0700 @@ -27,6 +27,7 @@ extern int pty_limit; /* Config limit on Unix98 ptys */ static DECLARE_MUTEX(allocated_ptys_lock); +static struct file_system_type devpts_fs_type; static st
[Devel] [PATCH 4/7][v2]: Allow mknod of ptmx and tty in devpts
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 4/7][v2]: Allow mknod of ptmx and tty in devpts We want to allow administrators to access PTYs in a descendant pts-namespaces, for instance "echo foo > /vserver/vserver1/dev/pts/0". To enable such access we must hold a reference to the pts-ns in which the device (ptmx or the slave pty) exists. Note that we cannot use the pts-ns of the 'current' process since that pts-ns could be different from the pts-ns in which the PTY device was created. So we find the pts-ns from the inode of the PTY (inode->i_sb->s_fs_info). While this would work for the slave PTY devices like /dev/pts/0, it would not work for either the master PTY device (/dev/ptmx) or controlling terminal (/dev/tty). To uniformly handle the master, slave and controlling ttys, we allow creation of 'ptmx' and 'tty' devices in /dev/pts. When creating containers, the administrator can then: $ umount /dev/pts $ mount -t devpts lxcpts /dev/pts $ mknod /dev/pts/ptmx c 5 2 $ mknod /dev/pts/tty c 5 0 $ rm /dev/ptmx /dev/tty $ ln -s /dev/pts/ptmx /dev/ptmx $ ln -s /dev/pts/tty /dev/tty With this, even if the 'ptmx' is accessed from parent pts-ns we still find and hold the pts-ns in which 'ptmx' actually belongs. This patch merely allows creation of /dev/pts/ptmx and /dev/pts/tty. We hold a reference to the dentries for these nodes to pin them in memory and use 'kill_litter_super()' while unmounting to ensure we drop these dentries. TODO: Ability to unlink the /dev/pts/ptmx and /dev/pts/tty nodes. Note: if /dev/ptmx is a symlink to /vserver/vserver1/dev/pts/ptmx an open of /dev/ptmx in init-pts-ns will create a PTY in 'vserver1' ! Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 55 ++ 1 file changed, 51 insertions(+), 4 deletions(-) Index: 2.6.25-rc5-mm1/fs/devpts/inode.c === --- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c 2008-04-02 10:18:42.0 -0700 +++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-04-02 22:51:02.0 -0700 @@ -58,7 +58,6 @@ struct pts_namespace init_pts_ns = { .mnt = NULL, }; - static int devpts_remount(struct super_block *sb, int *flags, char *data) { char *p; @@ -122,6 +121,54 @@ static const struct super_operations dev .show_options = devpts_show_options, }; + +static int devpts_mknod(struct inode *dir, struct dentry *dentry, + int mode, dev_t rdev) +{ + int inum; + struct inode *inode; + struct super_block *sb = dir->i_sb; + + if (dentry->d_inode) + return -EEXIST; + + if (!S_ISCHR(mode)) + return -EPERM; + + if (rdev == MKDEV(TTYAUX_MAJOR, 0)) + inum = 2; + else if (rdev == MKDEV(TTYAUX_MAJOR, 2)) + inum = 3; + else + return -EPERM; + + inode = new_inode(sb); + if (!inode) + return -ENOMEM; + + inode->i_ino = inum; + inode->i_uid = inode->i_gid = 0; + inode->i_blocks = 0; + inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME; + + init_special_inode(inode, mode, rdev); + + d_instantiate(dentry, inode); + /* +* Get a reference to the dentry so the device-nodes persist +* even when there are no active references to them. We use +* kill_litter_super() to remove this entry when unmounting +* devpts. +*/ + dget(dentry); + return 0; +} + +const struct inode_operations devpts_dir_inode_operations = { +.lookup = simple_lookup, + .mknod = devpts_mknod, +}; + static int devpts_fill_super(struct super_block *s, void *data, int silent) { @@ -141,7 +188,7 @@ devpts_fill_super(struct super_block *s, inode->i_blocks = 0; inode->i_uid = inode->i_gid = 0; inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO | S_IWUSR; - inode->i_op = &simple_dir_inode_operations; + inode->i_op = &devpts_dir_inode_operations; inode->i_fop = &simple_dir_operations; inode->i_nlink = 2; @@ -214,7 +261,7 @@ static int devpts_get_sb(struct file_sys static void devpts_kill_sb(struct super_block *sb) { sb->s_fs_info = NULL; - kill_anon_super(sb); + kill_litter_super(sb); } static struct file_system_type devpts_fs_type = { @@ -303,7 +350,7 @@ int devpts_pty_new(struct tty_struct *tt if (!inode) return -ENOMEM; - inode->i_ino = number+2; + inode->i_ino = number+4; inode->i_uid = config.setuid ? config.uid : current->fsuid; inode->i_gid = config.setgid
[Devel] [PATCH 2/7][v2]: Factor out PTY index allocation
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 2/7][v2]: Factor out PTY index allocation Factor out the code used to allocate/free a pts index into new interfaces, devpts_new_index() and devpts_kill_index(). This localizes the external data structures used in managing the pts indices. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Signed-off-by: Serge Hallyn<[EMAIL PROTECTED]> Signed-off-by: Matt Helsley<[EMAIL PROTECTED]> --- drivers/char/tty_io.c | 40 ++-- fs/devpts/inode.c | 42 +- include/linux/devpts_fs.h |4 3 files changed, 51 insertions(+), 35 deletions(-) Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-03-24 20:04:07.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:04:26.0 -0700 @@ -17,6 +17,8 @@ #ifdef CONFIG_UNIX98_PTYS +int devpts_new_index(void); +void devpts_kill_index(int idx); int devpts_pty_new(struct tty_struct *tty); /* mknod in devpts */ struct tty_struct *devpts_get_tty(int number); /* get tty structure */ void devpts_pty_kill(int number); /* unlink */ @@ -24,6 +26,8 @@ void devpts_pty_kill(int number); /* u #else /* Dummy stubs in the no-pty case */ +static inline int devpts_new_index(void) { return -EINVAL; } +static inline void devpts_kill_index(int idx) { } static inline int devpts_pty_new(struct tty_struct *tty) { return -EINVAL; } static inline struct tty_struct *devpts_get_tty(int number) { return NULL; } static inline void devpts_pty_kill(int number) { } Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c === --- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c 2008-03-24 20:04:07.0 -0700 +++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:04:26.0 -0700 @@ -91,7 +91,6 @@ #include #include #include -#include #include #include #include @@ -137,9 +136,6 @@ EXPORT_SYMBOL(tty_mutex); #ifdef CONFIG_UNIX98_PTYS extern struct tty_driver *ptm_driver; /* Unix98 pty masters; for /dev/ptmx */ -extern int pty_limit; /* Config limit on Unix98 ptys */ -static DEFINE_IDR(allocated_ptys); -static DEFINE_MUTEX(allocated_ptys_lock); static int ptmx_open(struct inode *, struct file *); #endif @@ -2636,15 +2632,9 @@ static void release_dev(struct file *fil */ release_tty(tty, idx); -#ifdef CONFIG_UNIX98_PTYS /* Make this pty number available for reallocation */ - if (devpts) { - mutex_lock(&allocated_ptys_lock); - idr_remove(&allocated_ptys, idx); - mutex_unlock(&allocated_ptys_lock); - } -#endif - + if (devpts) + devpts_kill_index(idx); } /** @@ -2800,29 +2790,13 @@ static int ptmx_open(struct inode *inode struct tty_struct *tty; int retval; int index; - int idr_ret; nonseekable_open(inode, filp); /* find a device that is not in use. */ - mutex_lock(&allocated_ptys_lock); - if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) { - mutex_unlock(&allocated_ptys_lock); - return -ENOMEM; - } - idr_ret = idr_get_new(&allocated_ptys, NULL, &index); - if (idr_ret < 0) { - mutex_unlock(&allocated_ptys_lock); - if (idr_ret == -EAGAIN) - return -ENOMEM; - return -EIO; - } - if (index >= pty_limit) { - idr_remove(&allocated_ptys, index); - mutex_unlock(&allocated_ptys_lock); - return -EIO; - } - mutex_unlock(&allocated_ptys_lock); + index = devpts_new_index(); + if (index < 0) + return index; mutex_lock(&tty_mutex); retval = init_dev(ptm_driver, index, &tty); @@ -2847,9 +2821,7 @@ out1: release_dev(filp); return retval; out: - mutex_lock(&allocated_ptys_lock); - idr_remove(&allocated_ptys, index); - mutex_unlock(&allocated_ptys_lock); + devpts_kill_index(index); return retval; } #endif Index: 2.6.25-rc5-mm1/fs/devpts/inode.c === --- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c 2008-03-24 20:04:07.0 -0700 +++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:04:26.0 -0700 @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,10 @@ #define DEVPTS_DEFAULT_MODE 0600 +extern int pty_limit; /* Config limit on Unix98 ptys */ +static DEFINE_IDR(allocated_ptys); +static DECLARE_
[Devel] [PATCH 5/7][v2]: Implement get_pts_ns() and put_pts_ns()
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 5/7][v2]: Implement get_pts_ns() and put_pts_ns() Implement get_pts_ns() and put_pts_ns() interfaces. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- include/linux/devpts_fs.h | 21 - 1 file changed, 20 insertions(+), 1 deletion(-) Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-04-02 22:35:35.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-04-02 22:42:08.0 -0700 @@ -27,13 +27,26 @@ struct pts_namespace { extern struct pts_namespace init_pts_ns; #ifdef CONFIG_UNIX98_PTYS - int devpts_new_index(void); void devpts_kill_index(int idx); int devpts_pty_new(struct tty_struct *tty); /* mknod in devpts */ struct tty_struct *devpts_get_tty(int number); /* get tty structure */ void devpts_pty_kill(int number); /* unlink */ +static inline void free_pts_ns(struct kref *ns_kref) { } + +static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns) +{ + if (ns && (ns != &init_pts_ns)) + kref_get(&ns->kref); + return ns; +} +static inline void put_pts_ns(struct pts_namespace *ns) +{ + if (ns && (ns != &init_pts_ns)) + kref_put(&ns->kref, free_pts_ns); +} + #else /* Dummy stubs in the no-pty case */ @@ -43,6 +56,12 @@ static inline int devpts_pty_new(struct static inline struct tty_struct *devpts_get_tty(int number) { return NULL; } static inline void devpts_pty_kill(int number) { } +static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns) +{ + return &init_pts_ns; +} + +static inline void put_pts_ns(struct pts_namespace *ns) { } #endif ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/7][v2]: Propagate error code from devpts_pty_new
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 1/7][v2]: Propagate error code from devpts_pty_new Have ptmx_open() propagate any error code returned by devpts_pty_new(). Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Cc: Cedric Le Goater <[EMAIL PROTECTED]> Cc: Dave Hansen <[EMAIL PROTECTED]> Cc: Serge Hallyn <[EMAIL PROTECTED]> Cc: [EMAIL PROTECTED] --- drivers/char/tty_io.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c === --- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c 2008-03-21 20:13:38.0 -0700 +++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:04:07.0 -0700 @@ -2835,8 +2835,8 @@ static int ptmx_open(struct inode *inode filp->private_data = tty; file_move(filp, &tty->tty_files); - retval = -ENOMEM; - if (devpts_pty_new(tty->link)) + retval = devpts_pty_new(tty->link); + if (retval) goto out1; check_tty_count(tty, "tty_open"); ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 0/7][v2] Clone PTY namespaces
Devpts namespace patchset In continuation of the implementation of containers in mainline, we need to support multiple PTY namespaces so that the PTY index (ie the tty names) in one container is independent of the PTY indices of other containers. For instance this would allow each container to have a '/dev/pts/0' PTY and refer to different terminals. [PATCH 1/7]: Propagate error code from devpts_pty_new [PATCH 2/7]: Factor out PTY index allocation [PATCH 3/7]: Enable multiple mounts of /dev/pts [PATCH 4/7]: Allow mknod of ptmx and tty in devpts [PATCH 5/7]: Implement get_pts_ns() and put_pts_ns() [PATCH 6/7]: Determine pts_ns from a pty's inode [PATCH 7/7]: Enable cloning PTY namespaces Todo: - This patchset depends on availability of additional clone flags. and relies on on Cedric's clone64 patchset. - Needs some cleanup and more testing. - Ensure patchset is bisect-safe Changelog[v2]: (Patches 4 and 6 differ significantly from [v1]. Others are mostly the same) - [Alexey Dobriyan, Pavel Emelyanov] Removed the hack to check for user-space mount. - [Serge Hallyn] Added rcu locking around access to sb->s_fs_info. - [Serge Hallyn] Allow creation of /dev/pts/ptmx and /dev/pts/tty devices to simplify the process of finding the 'owning' pts-ns of the device (specially when accessed from parent-pts-ns) See patches 4 and 6 for details. Changelog[v1]: - Fixed circular reference by not caching the pts_ns in sb->s_fs_info (without incrementing reference count) and clearing the sb->s_fs_info when destroying the pts_ns - To allow access to a child container's ptys from parent container, determine the 'pts_ns' of a 'pty' from its inode. - Added a check (hack) to ensure user-space mount of /dev/pts is done before creating PTYs in a new pts-ns. - Reorganized the patchset and removed redundant changes. - Ported to work wih Cedric Le Goater's clone64() system call now that we are out of clone_flags. Changelog[v0]: This patchset is based on earlier versions developed by Serge Hallyn and Matt Helsley. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 6/7][v2]: Determine pts_ns from a pty's inode
Serge E. Hallyn [EMAIL PROTECTED] wrote: | > + | > + /* | > +* What pts-ns do we want to use when opening "/dev/tty" ? | > +* Sounds like current_pts_ns(), but what should happen | > +* if parent pts ns does: | > +* | > +* echo foo > /vs/vs1/dev/tty | | You'll want to remove this comment, right? Your patch 4 solved | this problem? Yes, Will remove while porting to rc8-mm1. Should I go ahead and post as RFC to lkml ? Sukadev ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 0/7] Clone PTS namespace
Devpts namespace patchset In continuation of the implementation of containers in mainline, we need to support multiple PTY namespaces so that the PTY index (ie the tty names) in one container is independent of the PTY indices of other containers. For instance this would allow each container to have a '/dev/pts/0' PTY and refer to different terminals. [PATCH 1/7]: Propagate error code from devpts_pty_new [PATCH 2/7]: Factor out PTY index allocation [PATCH 3/7]: Enable multiple mounts of /dev/pts [PATCH 4/7]: Allow mknod of ptmx and tty in devpts [PATCH 5/7]: Implement get_pts_ns() and put_pts_ns() [PATCH 6/7]: Determine pts_ns from a pty's inode [PATCH 7/7]: Enable cloning PTY namespaces Todo: - This patchset depends on availability of additional clone flags. and relies on on Cedric's clone64 patchset. See http://marc.info/?l=linux-kernel&m=120272411925609&w=2 - Needs some cleanup and more testing - Ensure patchset is bisect-safe --- Changelogs from earlier posts to [EMAIL PROTECTED] Changelog[v2]: (Patches 4 and 6 differ significantly from [v1]. Others are mostly the same) - [Alexey Dobriyan, Pavel Emelyanov] Removed the hack to check for user-space mount. - [Serge Hallyn] Added rcu locking around access to sb->s_fs_info. - [Serge Hallyn] Allow creation of /dev/pts/ptmx and /dev/pts/tty devices to simplify the process of finding the 'owning' pts-ns of the device (specially when accessed from parent-pts-ns) See patches 4 and 6 for details. Changelog[v1]: - Fixed circular reference by not caching the pts_ns in sb->s_fs_info (without incrementing reference count) and clearing the sb->s_fs_info when destroying the pts_ns - To allow access to a child container's ptys from parent container, determine the 'pts_ns' of a 'pty' from its inode. - Added a check (hack) to ensure user-space mount of /dev/pts is done before creating PTYs in a new pts-ns. - Reorganized the patchset and removed redundant changes. - Ported to work wih Cedric Le Goater's clone64() system call now that we are out of clone_flags. Changelog[v0]: This patchset is based on earlier versions developed by Serge Hallyn and Matt Helsley. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 1/7]: Propagate error code from devpts_pty_new
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [RFC][PATCH 1/7]: Propagate error code from devpts_pty_new Have ptmx_open() propagate any error code returned by devpts_pty_new() (which returns either 0 or -ENOMEM anyway). Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- drivers/char/tty_io.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: 2.6.25-rc8-mm1/drivers/char/tty_io.c === --- 2.6.25-rc8-mm1.orig/drivers/char/tty_io.c 2008-04-07 14:49:56.0 -0700 +++ 2.6.25-rc8-mm1/drivers/char/tty_io.c2008-04-08 09:12:55.0 -0700 @@ -2835,8 +2835,8 @@ static int ptmx_open(struct inode *inode filp->private_data = tty; file_move(filp, &tty->tty_files); - retval = -ENOMEM; - if (devpts_pty_new(tty->link)) + retval = devpts_pty_new(tty->link); + if (retval) goto out1; check_tty_count(tty, "tty_open"); ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 3/7]: Enable multiple mounts of /dev/pts
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject:[RFC][PATCH 3/7]: Enable multiple mounts of /dev/pts To support multiple PTY namespaces, we should be allow multiple mounts of /dev/pts, once within each PTY namespace. This patch removes the get_sb_single() in devpts_get_sb() and uses test and set sb interfaces to allow remounting /dev/pts. The patch also removes the globals, 'devpts_mnt', 'devpts_root' and uses a skeletal 'init_pts_ns' to store the vfsmount. Changelog [v3]: - Removed some unnecessary comments from devpts_set_sb() Changelog [v2]: - (Pavel Emelianov/Serge Hallyn) Remove reference to pts_ns from sb->s_fs_info to fix the circular reference (/dev/pts is not unmounted unless the pts_ns is destroyed, so we don't need a reference to the pts_ns). Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]> Signed-off-by: Matt Helsley <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 151 +- include/linux/devpts_fs.h | 11 +++ 2 files changed, 134 insertions(+), 28 deletions(-) Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-03-24 20:04:26.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-04-01 18:08:42.0 -0700 @@ -14,6 +14,17 @@ #define _LINUX_DEVPTS_FS_H #include +#include +#include +#include + +struct pts_namespace { + struct kref kref; + struct idr allocated_ptys; + struct vfsmount *mnt; +}; + +extern struct pts_namespace init_pts_ns; #ifdef CONFIG_UNIX98_PTYS Index: 2.6.25-rc5-mm1/fs/devpts/inode.c === --- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c 2008-03-24 20:04:26.0 -0700 +++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-04-01 18:08:41.0 -0700 @@ -28,12 +28,8 @@ #define DEVPTS_DEFAULT_MODE 0600 extern int pty_limit; /* Config limit on Unix98 ptys */ -static DEFINE_IDR(allocated_ptys); static DECLARE_MUTEX(allocated_ptys_lock); -static struct vfsmount *devpts_mnt; -static struct dentry *devpts_root; - static struct { int setuid; int setgid; @@ -54,6 +50,15 @@ static match_table_t tokens = { {Opt_err, NULL} }; +struct pts_namespace init_pts_ns = { + .kref = { + .refcount = ATOMIC_INIT(2), + }, + .allocated_ptys = IDR_INIT(init_pts_ns.allocated_ptys), + .mnt = NULL, +}; + + static int devpts_remount(struct super_block *sb, int *flags, char *data) { char *p; @@ -140,7 +145,7 @@ devpts_fill_super(struct super_block *s, inode->i_fop = &simple_dir_operations; inode->i_nlink = 2; - devpts_root = s->s_root = d_alloc_root(inode); + s->s_root = d_alloc_root(inode); if (s->s_root) return 0; @@ -150,17 +155,73 @@ fail: return -ENOMEM; } +/* + * We use test and set super-block operations to help determine whether we + * need a new super-block for this namespace. get_sb() walks the list of + * existing devpts supers, comparing them with the @data ptr. Since we + * passed 'current's namespace as the @data pointer we can compare the + * namespace pointer in the super-block's 's_fs_info'. If the test is + * TRUE then get_sb() returns a new active reference to the super block. + * Otherwise, it helps us build an active reference to a new one. + */ + +static int devpts_test_sb(struct super_block *sb, void *data) +{ + return sb->s_fs_info == data; +} + +static int devpts_set_sb(struct super_block *sb, void *data) +{ + sb->s_fs_info = data; + return set_anon_super(sb, NULL); +} + static int devpts_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt) { - return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt); + struct super_block *sb; + struct pts_namespace *ns; + int err; + + /* hereafter we're very similar to proc_get_sb */ + if (flags & MS_KERNMOUNT) + ns = data; + else + ns = &init_pts_ns; + + /* hereafter we're very simlar to get_sb_nodev */ + sb = sget(fs_type, devpts_test_sb, devpts_set_sb, ns); + if (IS_ERR(sb)) + return PTR_ERR(sb); + + if (sb->s_root) + return simple_set_mnt(mnt, sb); + + sb->s_flags = flags; + err = devpts_fill_super(sb, data, flags & MS_SILENT ? 1 : 0); + if (err) { + up_write(&sb->s_umount); + deactivate_super(sb); + return err; + } + + sb->s_flags |= MS_ACTIVE; +
[Devel] [RFC][PATCH 2/7]: Factor out PTY index allocation
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [RFC][PATCH 2/7]: Factor out PTY index allocation Factor out the code used to allocate/free a pts index into new interfaces, devpts_new_index() and devpts_kill_index(). This localizes the external data structures used in managing the pts indices. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Signed-off-by: Serge Hallyn<[EMAIL PROTECTED]> Signed-off-by: Matt Helsley<[EMAIL PROTECTED]> --- drivers/char/tty_io.c | 40 ++-- fs/devpts/inode.c | 42 +- include/linux/devpts_fs.h |4 3 files changed, 51 insertions(+), 35 deletions(-) Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-03-24 20:04:07.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:04:26.0 -0700 @@ -17,6 +17,8 @@ #ifdef CONFIG_UNIX98_PTYS +int devpts_new_index(void); +void devpts_kill_index(int idx); int devpts_pty_new(struct tty_struct *tty); /* mknod in devpts */ struct tty_struct *devpts_get_tty(int number); /* get tty structure */ void devpts_pty_kill(int number); /* unlink */ @@ -24,6 +26,8 @@ void devpts_pty_kill(int number); /* u #else /* Dummy stubs in the no-pty case */ +static inline int devpts_new_index(void) { return -EINVAL; } +static inline void devpts_kill_index(int idx) { } static inline int devpts_pty_new(struct tty_struct *tty) { return -EINVAL; } static inline struct tty_struct *devpts_get_tty(int number) { return NULL; } static inline void devpts_pty_kill(int number) { } Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c === --- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c 2008-03-24 20:04:07.0 -0700 +++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:04:26.0 -0700 @@ -91,7 +91,6 @@ #include #include #include -#include #include #include #include @@ -137,9 +136,6 @@ EXPORT_SYMBOL(tty_mutex); #ifdef CONFIG_UNIX98_PTYS extern struct tty_driver *ptm_driver; /* Unix98 pty masters; for /dev/ptmx */ -extern int pty_limit; /* Config limit on Unix98 ptys */ -static DEFINE_IDR(allocated_ptys); -static DEFINE_MUTEX(allocated_ptys_lock); static int ptmx_open(struct inode *, struct file *); #endif @@ -2636,15 +2632,9 @@ static void release_dev(struct file *fil */ release_tty(tty, idx); -#ifdef CONFIG_UNIX98_PTYS /* Make this pty number available for reallocation */ - if (devpts) { - mutex_lock(&allocated_ptys_lock); - idr_remove(&allocated_ptys, idx); - mutex_unlock(&allocated_ptys_lock); - } -#endif - + if (devpts) + devpts_kill_index(idx); } /** @@ -2800,29 +2790,13 @@ static int ptmx_open(struct inode *inode struct tty_struct *tty; int retval; int index; - int idr_ret; nonseekable_open(inode, filp); /* find a device that is not in use. */ - mutex_lock(&allocated_ptys_lock); - if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) { - mutex_unlock(&allocated_ptys_lock); - return -ENOMEM; - } - idr_ret = idr_get_new(&allocated_ptys, NULL, &index); - if (idr_ret < 0) { - mutex_unlock(&allocated_ptys_lock); - if (idr_ret == -EAGAIN) - return -ENOMEM; - return -EIO; - } - if (index >= pty_limit) { - idr_remove(&allocated_ptys, index); - mutex_unlock(&allocated_ptys_lock); - return -EIO; - } - mutex_unlock(&allocated_ptys_lock); + index = devpts_new_index(); + if (index < 0) + return index; mutex_lock(&tty_mutex); retval = init_dev(ptm_driver, index, &tty); @@ -2847,9 +2821,7 @@ out1: release_dev(filp); return retval; out: - mutex_lock(&allocated_ptys_lock); - idr_remove(&allocated_ptys, index); - mutex_unlock(&allocated_ptys_lock); + devpts_kill_index(index); return retval; } #endif Index: 2.6.25-rc5-mm1/fs/devpts/inode.c === --- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c 2008-03-24 20:04:07.0 -0700 +++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:04:26.0 -0700 @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,10 @@ #define DEVPTS_DEFAULT_MODE 0600 +extern int pty_limit; /* Config limit on Unix98 ptys */ +static DEFINE_IDR(allocated_ptys); +static DECLARE_
[Devel] [RFC][PATCH 4/7]: Allow mknod of ptmx and tty in devpts
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [RFC][PATCH 4/7]: Allow mknod of ptmx and tty in devpts We want to allow administrators to access PTYs in descendant pts-namespaces, for instance "echo foo > /vserver/vserver1/dev/pts/0". To enable such access we must hold a reference to the pts-ns in which the device (ptmx or slave pty) exists. Note that we cannot use the pts-ns of the 'current' process since that pts-ns could be different from the pts-ns in which the PTY device was created. So we find the pts-ns from the inode of the PTY (inode->i_sb->s_fs_info). While this would work for the slave PTY devices like /dev/pts/0, it would not work for either the master PTY device (/dev/ptmx) or controlling terminal (/dev/tty). To uniformly handle the master, slave and controlling ttys, we allow creation of 'ptmx' and 'tty' devices in /dev/pts. When creating containers, the administrator can then: In init-pts-ns: $ mknod /dev/pts/ptmx c 5 2 $ mknod /dev/pts/tty c 5 0 $ rm /dev/ptmx /dev/tty $ ln -s /dev/pts/ptmx /dev/ptmx $ ln -s /dev/pts/tty /dev/tty In child-pts-ns: $ umount /dev/pts $ mount -t devpts lxcpts /dev/pts $ mknod /dev/pts/ptmx c 5 2 $ mknod /dev/pts/tty c 5 0 With this, even if the 'ptmx' is accessed from parent pts-ns we still find and hold the pts-ns in which 'ptmx' actually belongs. This patch merely allows creation of /dev/pts/ptmx and /dev/pts/tty. Follow-on patches will enable cloning the pts namespace and using the pts-ns from the inode. TODO: - Ability to unlink the /dev/pts/ptmx and /dev/pts/tty nodes. Note: - If /dev/ptmx is a symlink to /vserver/vserver1/dev/pts/ptmx, open("/dev/ptmx") in init-pts-ns will create a PTY in 'vserver1' ! Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 55 ++ 1 file changed, 51 insertions(+), 4 deletions(-) Index: 2.6.25-rc8-mm1/fs/devpts/inode.c === --- 2.6.25-rc8-mm1.orig/fs/devpts/inode.c 2008-04-08 09:18:23.0 -0700 +++ 2.6.25-rc8-mm1/fs/devpts/inode.c2008-04-08 13:35:43.0 -0700 @@ -58,7 +58,6 @@ struct pts_namespace init_pts_ns = { .mnt = NULL, }; - static int devpts_remount(struct super_block *sb, int *flags, char *data) { char *p; @@ -122,6 +121,54 @@ static const struct super_operations dev .show_options = devpts_show_options, }; + +static int devpts_mknod(struct inode *dir, struct dentry *dentry, + int mode, dev_t rdev) +{ + int inum; + struct inode *inode; + struct super_block *sb = dir->i_sb; + + if (dentry->d_inode) + return -EEXIST; + + if (!S_ISCHR(mode)) + return -EPERM; + + if (rdev == MKDEV(TTYAUX_MAJOR, 0)) + inum = 2; + else if (rdev == MKDEV(TTYAUX_MAJOR, 2)) + inum = 3; + else + return -EPERM; + + inode = new_inode(sb); + if (!inode) + return -ENOMEM; + + inode->i_ino = inum; + inode->i_uid = inode->i_gid = 0; + inode->i_blocks = 0; + inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME; + + init_special_inode(inode, mode, rdev); + + d_instantiate(dentry, inode); + /* +* Get a reference to the dentry so the device-nodes persist +* even when there are no active references to them. We use +* kill_litter_super() to remove this entry when unmounting +* devpts. +*/ + dget(dentry); + return 0; +} + +const struct inode_operations devpts_dir_inode_operations = { + .lookup = simple_lookup, + .mknod = devpts_mknod, +}; + static int devpts_fill_super(struct super_block *s, void *data, int silent) { @@ -141,7 +188,7 @@ devpts_fill_super(struct super_block *s, inode->i_blocks = 0; inode->i_uid = inode->i_gid = 0; inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO | S_IWUSR; - inode->i_op = &simple_dir_inode_operations; + inode->i_op = &devpts_dir_inode_operations; inode->i_fop = &simple_dir_operations; inode->i_nlink = 2; @@ -214,7 +261,7 @@ static int devpts_get_sb(struct file_sys static void devpts_kill_sb(struct super_block *sb) { sb->s_fs_info = NULL; - kill_anon_super(sb); + kill_litter_super(sb); } static struct file_system_type devpts_fs_type = { @@ -303,7 +350,7 @@ int devpts_pty_new(struct tty_struct *tt if (!inode) return -ENOMEM; - inode->i_ino = number+2; + inode->i_ino = number+4; inode->i_uid = config.s
[Devel] [RFC][PATCH 5/7]: Implement get_pts_ns() and put_pts_ns()
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [RFC][PATCH 5/7]: Implement get_pts_ns() and put_pts_ns() Implement get_pts_ns() and put_pts_ns() interfaces. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- include/linux/devpts_fs.h | 21 - 1 file changed, 20 insertions(+), 1 deletion(-) Index: 2.6.25-rc8-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc8-mm1.orig/include/linux/devpts_fs.h 2008-04-08 09:18:23.0 -0700 +++ 2.6.25-rc8-mm1/include/linux/devpts_fs.h2008-04-08 13:36:31.0 -0700 @@ -27,13 +27,26 @@ struct pts_namespace { extern struct pts_namespace init_pts_ns; #ifdef CONFIG_UNIX98_PTYS - int devpts_new_index(void); void devpts_kill_index(int idx); int devpts_pty_new(struct tty_struct *tty); /* mknod in devpts */ struct tty_struct *devpts_get_tty(int number); /* get tty structure */ void devpts_pty_kill(int number); /* unlink */ +static inline void free_pts_ns(struct kref *ns_kref) { } + +static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns) +{ + if (ns && (ns != &init_pts_ns)) + kref_get(&ns->kref); + return ns; +} +static inline void put_pts_ns(struct pts_namespace *ns) +{ + if (ns && (ns != &init_pts_ns)) + kref_put(&ns->kref, free_pts_ns); +} + #else /* Dummy stubs in the no-pty case */ @@ -43,6 +56,12 @@ static inline int devpts_pty_new(struct static inline struct tty_struct *devpts_get_tty(int number) { return NULL; } static inline void devpts_pty_kill(int number) { } +static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns) +{ + return &init_pts_ns; +} + +static inline void put_pts_ns(struct pts_namespace *ns) { } #endif ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 7/7]: Enable cloning PTY namespaces
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [RFC][PATCH 7/7]: Enable cloning PTY namespaces Enable cloning PTY namespaces. Note: We are out of clone_flags! This patch depends on Cedric Le Goater's clone64() patchset. Changelog[v2]: [Serge Hallyn]: Use rcu to access sb->s_fs_info. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]> Signed-off-by: Matt Helsley <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 84 -- include/linux/devpts_fs.h | 22 ++-- include/linux/init_task.h |1 include/linux/nsproxy.h |2 + include/linux/sched.h |1 kernel/fork.c |2 - kernel/nsproxy.c | 17 - 7 files changed, 122 insertions(+), 7 deletions(-) Index: 2.6.25-rc8-mm1/include/linux/sched.h === --- 2.6.25-rc8-mm1.orig/include/linux/sched.h 2008-04-08 13:38:08.0 -0700 +++ 2.6.25-rc8-mm1/include/linux/sched.h2008-04-08 14:27:41.0 -0700 @@ -28,6 +28,7 @@ #define CLONE_NEWPID 0x2000 /* New pid namespace */ #define CLONE_NEWNET 0x4000 /* New network namespace */ #define CLONE_IO 0x8000 /* Clone io context */ +#define CLONE_NEWPTS 0x0002ULL /* Clone pts ns */ /* * Scheduling policies Index: 2.6.25-rc8-mm1/include/linux/nsproxy.h === --- 2.6.25-rc8-mm1.orig/include/linux/nsproxy.h 2008-04-08 13:38:08.0 -0700 +++ 2.6.25-rc8-mm1/include/linux/nsproxy.h 2008-04-08 14:27:41.0 -0700 @@ -8,6 +8,7 @@ struct mnt_namespace; struct uts_namespace; struct ipc_namespace; struct pid_namespace; +struct pts_namespace; /* * A structure to contain pointers to all per-process @@ -29,6 +30,7 @@ struct nsproxy { struct pid_namespace *pid_ns; struct user_namespace *user_ns; struct net *net_ns; + struct pts_namespace *pts_ns; }; extern struct nsproxy init_nsproxy; Index: 2.6.25-rc8-mm1/include/linux/init_task.h === --- 2.6.25-rc8-mm1.orig/include/linux/init_task.h 2008-04-08 13:38:08.0 -0700 +++ 2.6.25-rc8-mm1/include/linux/init_task.h2008-04-08 14:27:41.0 -0700 @@ -78,6 +78,7 @@ extern struct nsproxy init_nsproxy; .mnt_ns = NULL, \ INIT_NET_NS(net_ns) \ INIT_IPC_NS(ipc_ns) \ + .pts_ns = &init_pts_ns, \ .user_ns= &init_user_ns,\ } Index: 2.6.25-rc8-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc8-mm1.orig/include/linux/devpts_fs.h 2008-04-08 13:38:08.0 -0700 +++ 2.6.25-rc8-mm1/include/linux/devpts_fs.h2008-04-08 14:27:41.0 -0700 @@ -31,7 +31,7 @@ extern struct pts_namespace init_pts_ns; static inline struct pts_namespace *current_pts_ns(void) { - return &init_pts_ns; + return current->nsproxy->pts_ns; } static inline struct pts_namespace *pts_ns_from_inode(struct inode *inode) @@ -61,7 +61,8 @@ struct tty_struct *devpts_get_tty(struct /* unlink */ void devpts_pty_kill(struct pts_namespace *pts_ns, int number); -static inline void free_pts_ns(struct kref *ns_kref) { } +extern struct pts_namespace *new_pts_ns(void); +extern void free_pts_ns(struct kref *kref); static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns) { @@ -75,6 +76,15 @@ static inline void put_pts_ns(struct pts kref_put(&ns->kref, free_pts_ns); } +static inline struct pts_namespace *copy_pts_ns(u64 flags, + struct pts_namespace *old_ns) +{ + if (flags & CLONE_NEWPTS) + return new_pts_ns(); + else + return get_pts_ns(old_ns); +} + #else /* Dummy stubs in the no-pty case */ @@ -90,6 +100,14 @@ static inline struct pts_namespace *get_ } static inline void put_pts_ns(struct pts_namespace *ns) { } + +static inline struct pts_namespace *copy_pts_ns(u64 flags, + struct pts_namespace *old_ns) +{ + if (flags & CLONE_NEWPTS) + return ERR_PTR(-EINVAL); + return old_ns; +} #endif Index: 2.6.25-rc8-mm1/fs/devpts/inode.c === --- 2.6.25-rc8-mm1.orig/fs/devpts/inode.c 2008-04-08 13:38:08.0 -0700 +++ 2.6.25-rc8-mm1/fs/devpts/inode.c2008-04-08 14:33:04.0 -0700 @@ -27,6 +27,7 @@ extern int pty_limit; /* Config limit on U
[Devel] [RFC][PATCH 6/7]: Determine pts_ns from a pty's inode
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [RFC][PATCH 6/7]: Determine pts_ns from a pty's inode. The devpts interfaces currently operate on a specific pts namespace which they get from the 'current' task. With implementation of containers and cloning of PTS namespaces, we want to be able to access PTYs in a child-pts-ns from a parent-pts-ns. For instance we could bind-mount and pivot-root the child container on '/vserver/vserver1' and then access the "pts/0" of 'vserver1' using $ echo foo > /vserver/vserver1/dev/pts/0 The task doing the above 'echo' could be in parent-pts-ns. So we find the 'pts-ns' of the above file from the inode representing the device rather than from the 'current' task. Note that we need to find and hold a reference to the pts_ns to prevent the pts_ns from being freed while it is being accessed from 'outside'. This patch implements, 'pts_ns_from_inode()' which returns the pts_ns using 'inode->i_sb->s_fs_info'. Since, the 'inode' information is not visible inside devpts code itself, this patch modifies the tty driver code to determine the pts_ns and passes it into devpts. Changelog [v2]: [Serge Hallyn] Use rcu to access sb->s_fs_info. [Serge Hallyn] Simplify handling of ptmx and tty devices by expecting user to create them in /dev/pts (see also devpts-mknod patch) Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- drivers/char/pty.c| 13 +- drivers/char/tty_io.c | 96 +++--- fs/devpts/inode.c | 19 +++-- include/linux/devpts_fs.h | 38 +++--- 4 files changed, 134 insertions(+), 32 deletions(-) Index: 2.6.25-rc8-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc8-mm1.orig/include/linux/devpts_fs.h 2008-04-08 13:36:31.0 -0700 +++ 2.6.25-rc8-mm1/include/linux/devpts_fs.h2008-04-08 13:38:08.0 -0700 @@ -17,6 +17,7 @@ #include #include #include +#include struct pts_namespace { struct kref kref; @@ -26,12 +27,39 @@ struct pts_namespace { extern struct pts_namespace init_pts_ns; +#define DEVPTS_SUPER_MAGIC 0x1cd1 + +static inline struct pts_namespace *current_pts_ns(void) +{ + return &init_pts_ns; +} + +static inline struct pts_namespace *pts_ns_from_inode(struct inode *inode) +{ + /* +* If this file exists on devpts, return the pts_ns from the +* devpts super-block. Otherwise just use the pts-ns of the +* calling task. +*/ + if(inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC) + return rcu_dereference(inode->i_sb->s_fs_info); + + return current_pts_ns(); +} + + #ifdef CONFIG_UNIX98_PTYS -int devpts_new_index(void); -void devpts_kill_index(int idx); -int devpts_pty_new(struct tty_struct *tty); /* mknod in devpts */ -struct tty_struct *devpts_get_tty(int number); /* get tty structure */ -void devpts_pty_kill(int number); /* unlink */ +int devpts_new_index(struct pts_namespace *pts_ns); +void devpts_kill_index(struct pts_namespace *pts_ns, int idx); + +/* mknod in devpts */ +int devpts_pty_new(struct pts_namespace *pts_ns, struct tty_struct *tty); + +/* get tty structure */ +struct tty_struct *devpts_get_tty(struct pts_namespace *pts_ns, int number); + +/* unlink */ +void devpts_pty_kill(struct pts_namespace *pts_ns, int number); static inline void free_pts_ns(struct kref *ns_kref) { } Index: 2.6.25-rc8-mm1/drivers/char/tty_io.c === --- 2.6.25-rc8-mm1.orig/drivers/char/tty_io.c 2008-04-08 09:15:56.0 -0700 +++ 2.6.25-rc8-mm1/drivers/char/tty_io.c2008-04-08 14:25:11.0 -0700 @@ -2064,8 +2064,8 @@ static void tty_line_name(struct tty_dri * relaxed for the (most common) case of reopening a tty. */ -static int init_dev(struct tty_driver *driver, int idx, - struct tty_struct **ret_tty) +static int init_dev(struct tty_driver *driver, struct pts_namespace *pts_ns, + int idx, struct tty_struct **ret_tty) { struct tty_struct *tty, *o_tty; struct ktermios *tp, **tp_loc, *o_tp, **o_tp_loc; @@ -2074,7 +2074,11 @@ static int init_dev(struct tty_driver *d /* check whether we're reopening an existing tty */ if (driver->flags & TTY_DRIVER_DEVPTS_MEM) { - tty = devpts_get_tty(idx); + tty = devpts_get_tty(pts_ns, idx); + if (IS_ERR(tty)) { + retval = PTR_ERR(tty); + goto end_init; + } /* * If we don't have a tty here on a slave open, it's because * the master already
[Devel] Re: [RFC][PATCH 0/7] Clone PTS namespace
H. Peter Anvin [EMAIL PROTECTED] wrote: > [EMAIL PROTECTED] wrote: >> Devpts namespace patchset >> In continuation of the implementation of containers in mainline, we need >> to >> support multiple PTY namespaces so that the PTY index (ie the tty names) >> in >> one container is independent of the PTY indices of other containers. For >> instance this would allow each container to have a '/dev/pts/0' PTY and >> refer to different terminals. > > Why do we "need" this? There isn't a fundamental need for this to be a > dense numberspace (in fact, there are substantial reasons why it's a bad > idea; the only reason the namespace is dense at the moment is because of > the hideously bad handing of utmp in glibc.) Other than indicies, this > seems to be a more special case of device isolation across namespaces, > would that be a more useful problem to solve across the board? We want to provide isolation between containers, meaning PTYs in container C1 should not be accessible to processes in C2 (unless C2 is an ancestor). The other reason for this in the longer term is for checkpoint/restart. When restarting an application we want to make sure that the PTY indices it was using is available and isolated. We started out with isolating just the indices but added the special-case handling for granting the host visibility into a child-container. A complete device-namespace could solve this, but IIUC, is being planned in the longer term. We are hoping this would provide the isolation in the near-term without being too intrusive or impeding the implementation of the device namespace. Sukadev ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 0/3] clone64() and unshare64() system calls
This is a resend of the patch set Cedric had sent earlier. I ported the patch set to 2.6.25-rc8-mm1 and tested on x86 and x86_64. --- We have run out of the 32 bits in clone_flags ! This patchset introduces 2 new system calls which support 64bit clone-flags. long sys_clone64(unsigned long flags_high, unsigned long flags_low, unsigned long newsp); long sys_unshare64(unsigned long flags_high, unsigned long flags_low); The current version of clone64() does not support CLONE_PARENT_SETTID and CLONE_CHILD_CLEARTID because we would exceed the 6 registers limit of some arches. It's possible to get around this limitation but we might not need it as we already have clone() This is work in progress but already includes support for x86, x86_64, x86_64(32), ppc64, ppc64(32), s390x, s390x(31). ia64 already supports 64bits clone flags through the clone2() syscall. should we harmonize the name to clone2 ? ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/3] change clone_flags type to u64
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [lxc-dev] [patch -lxc 1/3] change clone_flags type to u64 This is a preliminary patch changing the clone_flags type to 64bits for all the routines called by do_fork(). It prepares ground for the next patch which introduces an enhanced version of clone() supporting 64bits flags. This is work in progress. All conversions might not be done yet. Signed-off-by: Cedric Le Goater <[EMAIL PROTECTED]> Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- arch/alpha/kernel/process.c |2 +- arch/arm/kernel/process.c |2 +- arch/avr32/kernel/process.c |2 +- arch/blackfin/kernel/process.c |2 +- arch/cris/arch-v10/kernel/process.c |2 +- arch/cris/arch-v32/kernel/process.c |2 +- arch/frv/kernel/process.c |2 +- arch/h8300/kernel/process.c |2 +- arch/ia64/ia32/sys_ia32.c |2 +- arch/ia64/kernel/process.c |2 +- arch/m32r/kernel/process.c |2 +- arch/m68k/kernel/process.c |2 +- arch/m68knommu/kernel/process.c |2 +- arch/mips/kernel/process.c |2 +- arch/mn10300/kernel/process.c |2 +- arch/parisc/kernel/process.c|2 +- arch/powerpc/kernel/process.c |2 +- arch/s390/kernel/process.c |2 +- arch/sh/kernel/process_32.c |2 +- arch/sh/kernel/process_64.c |2 +- arch/sparc/kernel/process.c |2 +- arch/sparc64/kernel/process.c |2 +- arch/um/kernel/process.c|2 +- arch/v850/kernel/process.c |2 +- arch/x86/kernel/process_32.c|2 +- arch/x86/kernel/process_64.c|2 +- arch/xtensa/kernel/process.c|2 +- fs/namespace.c |2 +- include/linux/ipc_namespace.h |4 ++-- include/linux/key.h |2 +- include/linux/mnt_namespace.h |2 +- include/linux/nsproxy.h |4 ++-- include/linux/pid_namespace.h |4 ++-- include/linux/sched.h |6 -- include/linux/security.h|6 +++--- include/linux/sem.h |4 ++-- include/linux/user_namespace.h |4 ++-- include/linux/utsname.h |4 ++-- include/net/net_namespace.h |4 ++-- ipc/namespace.c |2 +- ipc/sem.c |2 +- kernel/fork.c | 36 ++-- kernel/nsproxy.c|6 +++--- kernel/pid_namespace.c |2 +- kernel/user_namespace.c |2 +- kernel/utsname.c|2 +- net/core/net_namespace.c|4 ++-- security/dummy.c|2 +- security/keys/process_keys.c|2 +- security/security.c |2 +- security/selinux/hooks.c|2 +- 51 files changed, 83 insertions(+), 81 deletions(-) Index: 2.6.25-rc2-mm1/arch/alpha/kernel/process.c === --- 2.6.25-rc2-mm1.orig/arch/alpha/kernel/process.c +++ 2.6.25-rc2-mm1/arch/alpha/kernel/process.c @@ -270,7 +270,7 @@ alpha_vfork(struct pt_regs *regs) */ int -copy_thread(int nr, unsigned long clone_flags, unsigned long usp, +copy_thread(int nr, u64 clone_flags, unsigned long usp, unsigned long unused, struct task_struct * p, struct pt_regs * regs) { Index: 2.6.25-rc2-mm1/arch/arm/kernel/process.c === --- 2.6.25-rc2-mm1.orig/arch/arm/kernel/process.c +++ 2.6.25-rc2-mm1/arch/arm/kernel/process.c @@ -331,7 +331,7 @@ void release_thread(struct task_struct * asmlinkage void ret_from_fork(void) __asm__("ret_from_fork"); int -copy_thread(int nr, unsigned long clone_flags, unsigned long stack_start, +copy_thread(int nr, u64 clone_flags, unsigned long stack_start, unsigned long stk_sz, struct task_struct *p, struct pt_regs *regs) { struct thread_info *thread = task_thread_info(p); Index: 2.6.25-rc2-mm1/arch/avr32/kernel/process.c === --- 2.6.25-rc2-mm1.orig/arch/avr32/kernel/process.c +++ 2.6.25-rc2-mm1/arch/avr32/kernel/process.c @@ -325,7 +325,7 @@ int dump_fpu(struct pt_regs *regs, elf_f asmlinkage void ret_from_fork(void); -int copy_thread(int nr, unsigned long clone_flags, unsigned long usp, +int copy_thread(int nr, u64 clone_flags, unsigned long usp, unsigned long unused, struct task_struct *p, struct pt_regs *regs) { Index: 2.6.25-rc2-mm1/arch/blackfin/kernel/process.c === --- 2.6.25-rc2-mm1.orig/arch/blackfin/kernel/process.c +++ 2.6.25-rc2-mm1/arch/blackfin/kernel/process.c @@ -168,7 +168,7 @@ asmlinkage i
[Devel] [PATCH 2/3] add do_unshare()
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 2/3] add do_unshare() This patch adds a do_unshare() routine which will be common to the unshare() and unshare64() syscall. Signed-off-by: Cedric Le Goater <[EMAIL PROTECTED]> Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- kernel/fork.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) Index: 2.6.25-rc2-mm1/kernel/fork.c === --- 2.6.25-rc2-mm1.orig/kernel/fork.c +++ 2.6.25-rc2-mm1/kernel/fork.c @@ -1696,7 +1696,7 @@ static int unshare_semundo(u64 unshare_f * constructed. Here we are modifying the current, active, * task_struct. */ -asmlinkage long sys_unshare(unsigned long unshare_flags) +static long do_unshare(u64 unshare_flags) { int err = 0; struct fs_struct *fs, *new_fs = NULL; @@ -1790,3 +1790,8 @@ bad_unshare_cleanup_thread: bad_unshare_out: return err; } + +asmlinkage long sys_unshare(unsigned long unshare_flags) +{ + return do_unshare(unshare_flags); +} -- --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "lxc-dev" group. To post to this group, send email to [EMAIL PROTECTED] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/lxc-dev?hl=en -~--~~~~--~~--~--~--- ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/3] add the clone64() and unshare64() syscalls
From: Cedric Le Goater <[EMAIL PROTECTED]> Subject: [PATCH 3/3] add the clone64() and unshare64() syscalls This patch adds 2 new syscalls : long sys_clone64(unsigned long flags_high, unsigned long flags_low, unsigned long newsp); long sys_unshare64(unsigned long flags_high, unsigned long flags_low); The current version of clone64() does not support CLONE_PARENT_SETTID and CLONE_CHILD_CLEARTID because we would exceed the 6 registers limit of some arches. It's possible to get around this limitation but we might not need it as we already have clone() Signed-off-by: Cedric Le Goater <[EMAIL PROTECTED]> Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- arch/powerpc/kernel/entry_32.S |8 arch/powerpc/kernel/entry_64.S |5 + arch/powerpc/kernel/process.c | 15 +++ arch/s390/kernel/compat_linux.c| 16 arch/s390/kernel/compat_wrapper.S |6 ++ arch/s390/kernel/process.c | 15 +++ arch/s390/kernel/syscalls.S|2 ++ arch/x86/ia32/ia32entry.S |4 arch/x86/ia32/sys_ia32.c | 12 arch/x86/kernel/entry_64.S |1 + arch/x86/kernel/process_32.c | 14 ++ arch/x86/kernel/process_64.c | 15 +++ arch/x86/kernel/syscall_table_32.S |2 ++ include/asm-powerpc/systbl.h |2 ++ include/asm-powerpc/unistd.h |4 +++- include/asm-s390/unistd.h |4 +++- include/asm-x86/unistd_32.h|2 ++ include/asm-x86/unistd_64.h|4 include/linux/syscalls.h |3 +++ kernel/fork.c |7 +++ kernel/sys_ni.c|3 +++ 21 files changed, 142 insertions(+), 2 deletions(-) Index: 2.6.25-rc2-mm1/arch/s390/kernel/syscalls.S === --- 2.6.25-rc2-mm1.orig/arch/s390/kernel/syscalls.S 2008-02-27 15:17:34.0 -0800 +++ 2.6.25-rc2-mm1/arch/s390/kernel/syscalls.S 2008-03-06 22:08:49.0 -0800 @@ -330,3 +330,5 @@ SYSCALL(sys_eventfd,sys_eventfd,sys_even SYSCALL(sys_timerfd_create,sys_timerfd_create,sys_timerfd_create_wrapper) SYSCALL(sys_timerfd_settime,sys_timerfd_settime,compat_sys_timerfd_settime_wrapper) /* 320 */ SYSCALL(sys_timerfd_gettime,sys_timerfd_gettime,compat_sys_timerfd_gettime_wrapper) +SYSCALL(sys_clone64,sys_clone64,sys32_clone64) +SYSCALL(sys_unshare64,sys_unshare64,sys_unshare64_wrapper) Index: 2.6.25-rc2-mm1/arch/x86/kernel/syscall_table_32.S === --- 2.6.25-rc2-mm1.orig/arch/x86/kernel/syscall_table_32.S 2008-02-27 15:17:35.0 -0800 +++ 2.6.25-rc2-mm1/arch/x86/kernel/syscall_table_32.S 2008-03-06 22:08:49.0 -0800 @@ -326,3 +326,5 @@ ENTRY(sys_call_table) .long sys_fallocate .long sys_timerfd_settime /* 325 */ .long sys_timerfd_gettime + .long sys_clone64 + .long sys_unshare64 Index: 2.6.25-rc2-mm1/include/asm-powerpc/systbl.h === --- 2.6.25-rc2-mm1.orig/include/asm-powerpc/systbl.h2008-02-27 15:18:12.0 -0800 +++ 2.6.25-rc2-mm1/include/asm-powerpc/systbl.h 2008-03-06 22:08:49.0 -0800 @@ -316,3 +316,5 @@ COMPAT_SYS(fallocate) SYSCALL(subpage_prot) COMPAT_SYS_SPU(timerfd_settime) COMPAT_SYS_SPU(timerfd_gettime) +PPC_SYS(clone64) +SYSCALL_SPU(unshare64) Index: 2.6.25-rc2-mm1/include/asm-powerpc/unistd.h === --- 2.6.25-rc2-mm1.orig/include/asm-powerpc/unistd.h2008-02-27 15:18:12.0 -0800 +++ 2.6.25-rc2-mm1/include/asm-powerpc/unistd.h 2008-03-06 22:08:49.0 -0800 @@ -335,10 +335,12 @@ #define __NR_subpage_prot 310 #define __NR_timerfd_settime 311 #define __NR_timerfd_gettime 312 +#define __NR_clone64 313 +#define __NR_unshare64 314 #ifdef __KERNEL__ -#define __NR_syscalls 313 +#define __NR_syscalls 315 #define __NR__exit __NR_exit #define NR_syscalls__NR_syscalls Index: 2.6.25-rc2-mm1/include/asm-s390/unistd.h === --- 2.6.25-rc2-mm1.orig/include/asm-s390/unistd.h 2008-02-27 15:18:13.0 -0800 +++ 2.6.25-rc2-mm1/include/asm-s390/unistd.h2008-03-06 22:08:49.0 -0800 @@ -259,7 +259,9 @@ #define __NR_timerfd_create319 #define __NR_timerfd_settime 320 #define __NR_timerfd_gettime 321 -#define NR_syscalls 322 +#define __NR_clone64 322 +#define __NR_unshare64 323 +#define NR_syscalls 324 /* * There are some system calls that are not present on 64 bit, some Index: 2.6.25-rc2-mm1/include/asm-x86/unistd_32.h === --- 2.6.25-rc2-mm1.orig/incl
[Devel] Re: [PATCH 0/3] clone64() and unshare64() system calls
H. Peter Anvin [EMAIL PROTECTED] wrote: > [EMAIL PROTECTED] wrote: >> This is a resend of the patch set Cedric had sent earlier. I ported >> the patch set to 2.6.25-rc8-mm1 and tested on x86 and x86_64. >> --- >> We have run out of the 32 bits in clone_flags ! >> This patchset introduces 2 new system calls which support 64bit >> clone-flags. >> long sys_clone64(unsigned long flags_high, unsigned long flags_low, >> unsigned long newsp); >> long sys_unshare64(unsigned long flags_high, unsigned long >> flags_low); >> The current version of clone64() does not support CLONE_PARENT_SETTID and >> CLONE_CHILD_CLEARTID because we would exceed the 6 registers limit of some >> arches. It's possible to get around this limitation but we might not >> need it as we already have clone() > > I really dislike this interface. > > If you're going to make it a 64-bit pass it in as a 64-bit number, instead > of breaking it into two numbers. Maybe I am missing your point. The glibc interface could take a 64bit parameter, but don't we need to pass 32-bit values into the system call on 32 bit systems ? > Better yet, IMO, would be to pass a pointer to a structure like: > > struct shared { > unsigned long nwords; > unsigned long flags[]; > }; > > ... which can be expanded indefinitely. Yes, this was discussed before in the context of Pavel Emelyanov's patch http://lkml.org/lkml/2008/1/16/109 along with sys_indirect(). While there was no consensus, it looked like adding a new system call was better than open ended interfaces. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 3/3] add the clone64() and unshare64() syscalls
Jakub Jelinek [EMAIL PROTECTED] wrote: | On Wed, Apr 09, 2008 at 03:34:59PM -0700, [EMAIL PROTECTED] wrote: | > From: Cedric Le Goater <[EMAIL PROTECTED]> | > Subject: [PATCH 3/3] add the clone64() and unshare64() syscalls | > | > This patch adds 2 new syscalls : | > | > long sys_clone64(unsigned long flags_high, unsigned long flags_low, | > unsigned long newsp); | > | > long sys_unshare64(unsigned long flags_high, unsigned long flags_low); | | Can you explain why are you adding it for 64-bit arches too? unsigned long | is there already 64-bit, and both sys_clone and sys_unshare have unsigned | long flags, rather than unsigned int. Hmm, By simply resuing clone() on 64 bit and adding a new call for 32-bit won't the semantics of clone() differ between the two ? i.e clone() on 64 bit supports say CLONE_NEWPTS clone() on 32bit does not ? Wouldn't it be simpler/cleaner if clone() and clone64() behaved the same on both 32 and 64 bit systems ? Sukadev ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/3] clone64() and unshare64() system calls
H. Peter Anvin [EMAIL PROTECTED] wrote: >> Yes, this was discussed before in the context of Pavel Emelyanov's patch >> http://lkml.org/lkml/2008/1/16/109 >> along with sys_indirect(). While there was no consensus, it looked like >> adding a new system call was better than open ended interfaces. > > That's not really an open-ended interface, it's just an expandable bitmap. Yes, we liked such an approach earlier too and its conceivable that we will run out of the 64-bits too :-) But as Jon Corbet pointed out in the the thread above, it looked like adding a new system call has been the "traditional" way of solving this in Linux so far and there has been no consensus on a newer approach. Sukadev ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/3] clone64() and unshare64() system calls
Paul Menage [EMAIL PROTECTED] wrote: | On Wed, Apr 9, 2008 at 7:38 PM, <[EMAIL PROTECTED]> wrote: | > | > But as Jon Corbet pointed out in the the thread above, it looked like | > adding a new system call has been the "traditional" way of solving this | > in Linux so far and there has been no consensus on a newer approach. | > | | I thought that the consensus was that adding a new system call was | better than trying to force extensibility on to the existing | non-extensible system call. There were couple of objections to extensible system calls like sys_indirect() and to Pavel's approach. | | But if we are adding a new system call, why not make the new one | extensible to reduce the need for yet another new call in the future? hypothetically, can we make a variant of clone() extensible to the point of requiring a copy_from_user() ? | | Paul ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH 0/7] Clone PTS namespace
Serge E. Hallyn [EMAIL PROTECTED] wrote: | > | > Further what I did for the network namespace should easily handle the | > uid/gid namespace and should be a good starting place for a general | > device namespace. | | Agreed. What's the git url and which branch do i use for your proof | of concept tree? I'll do the userns patch on top of that. I assume | Suka will do the same for ptys? | Sure. BTW, can we push the following 3 helper patches in the set. I believe they will be required to support multiple pts namespaces, even if the actual way we do it is not final yet. [PATCH 1/7]: Propagate error code from devpts_pty_new [PATCH 2/7]: Factor out PTY index allocation [PATCH 3/7]: Enable multiple mounts of /dev/pts Sukadev ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 0/4] Helper patches for PTY namespaces
Some simple helper patches to enable implementation of multiple PTY (or device) namespaces. [PATCH 1/4]: Propagate error code from devpts_pty_new [PATCH 2/4]: Factor out PTY index allocation [PATCH 3/4]: Move devpts globals into init_pts_ns [PATCH 3/4]: Enable multiple mounts of /dev/pts This patchset is based on earlier versions developed by Serge Hallyn and Matt Helsley. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/4]: Propagate error code from devpts_pty_new
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 1/4]: Propagate error code from devpts_pty_new Have ptmx_open() propagate any error code returned by devpts_pty_new() (which returns either 0 or -ENOMEM anyway). Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- drivers/char/tty_io.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: 2.6.25-rc8-mm1/drivers/char/tty_io.c === --- 2.6.25-rc8-mm1.orig/drivers/char/tty_io.c 2008-04-07 14:49:56.0 -0700 +++ 2.6.25-rc8-mm1/drivers/char/tty_io.c2008-04-09 13:54:00.0 -0700 @@ -2835,8 +2835,8 @@ static int ptmx_open(struct inode *inode filp->private_data = tty; file_move(filp, &tty->tty_files); - retval = -ENOMEM; - if (devpts_pty_new(tty->link)) + retval = devpts_pty_new(tty->link); + if (retval) goto out1; check_tty_count(tty, "tty_open"); ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/4]: Factor out PTY index allocation
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 2/4]: Factor out PTY index allocation Factor out the code used to allocate/free a pts index into new interfaces, devpts_new_index() and devpts_kill_index(). This localizes the external data structures used in managing the pts indices. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Signed-off-by: Serge Hallyn<[EMAIL PROTECTED]> Signed-off-by: Matt Helsley<[EMAIL PROTECTED]> --- drivers/char/tty_io.c | 40 ++-- fs/devpts/inode.c | 42 +- include/linux/devpts_fs.h |4 3 files changed, 51 insertions(+), 35 deletions(-) Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h === --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h 2008-03-24 20:04:07.0 -0700 +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:04:26.0 -0700 @@ -17,6 +17,8 @@ #ifdef CONFIG_UNIX98_PTYS +int devpts_new_index(void); +void devpts_kill_index(int idx); int devpts_pty_new(struct tty_struct *tty); /* mknod in devpts */ struct tty_struct *devpts_get_tty(int number); /* get tty structure */ void devpts_pty_kill(int number); /* unlink */ @@ -24,6 +26,8 @@ void devpts_pty_kill(int number); /* u #else /* Dummy stubs in the no-pty case */ +static inline int devpts_new_index(void) { return -EINVAL; } +static inline void devpts_kill_index(int idx) { } static inline int devpts_pty_new(struct tty_struct *tty) { return -EINVAL; } static inline struct tty_struct *devpts_get_tty(int number) { return NULL; } static inline void devpts_pty_kill(int number) { } Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c === --- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c 2008-03-24 20:04:07.0 -0700 +++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:04:26.0 -0700 @@ -91,7 +91,6 @@ #include #include #include -#include #include #include #include @@ -137,9 +136,6 @@ EXPORT_SYMBOL(tty_mutex); #ifdef CONFIG_UNIX98_PTYS extern struct tty_driver *ptm_driver; /* Unix98 pty masters; for /dev/ptmx */ -extern int pty_limit; /* Config limit on Unix98 ptys */ -static DEFINE_IDR(allocated_ptys); -static DEFINE_MUTEX(allocated_ptys_lock); static int ptmx_open(struct inode *, struct file *); #endif @@ -2636,15 +2632,9 @@ static void release_dev(struct file *fil */ release_tty(tty, idx); -#ifdef CONFIG_UNIX98_PTYS /* Make this pty number available for reallocation */ - if (devpts) { - mutex_lock(&allocated_ptys_lock); - idr_remove(&allocated_ptys, idx); - mutex_unlock(&allocated_ptys_lock); - } -#endif - + if (devpts) + devpts_kill_index(idx); } /** @@ -2800,29 +2790,13 @@ static int ptmx_open(struct inode *inode struct tty_struct *tty; int retval; int index; - int idr_ret; nonseekable_open(inode, filp); /* find a device that is not in use. */ - mutex_lock(&allocated_ptys_lock); - if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) { - mutex_unlock(&allocated_ptys_lock); - return -ENOMEM; - } - idr_ret = idr_get_new(&allocated_ptys, NULL, &index); - if (idr_ret < 0) { - mutex_unlock(&allocated_ptys_lock); - if (idr_ret == -EAGAIN) - return -ENOMEM; - return -EIO; - } - if (index >= pty_limit) { - idr_remove(&allocated_ptys, index); - mutex_unlock(&allocated_ptys_lock); - return -EIO; - } - mutex_unlock(&allocated_ptys_lock); + index = devpts_new_index(); + if (index < 0) + return index; mutex_lock(&tty_mutex); retval = init_dev(ptm_driver, index, &tty); @@ -2847,9 +2821,7 @@ out1: release_dev(filp); return retval; out: - mutex_lock(&allocated_ptys_lock); - idr_remove(&allocated_ptys, index); - mutex_unlock(&allocated_ptys_lock); + devpts_kill_index(index); return retval; } #endif Index: 2.6.25-rc5-mm1/fs/devpts/inode.c === --- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c 2008-03-24 20:04:07.0 -0700 +++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:04:26.0 -0700 @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,10 @@ #define DEVPTS_DEFAULT_MODE 0600 +extern int pty_limit; /* Config limit on Unix98 ptys */ +static DEFINE_IDR(allocated_ptys); +static DECLARE_
[Devel] [PATCH 3/4]: Move devpts globals into init_pts_ns
Matt, Serge, please sign-off on this version. --- From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH 3/4]: Move devpts globals into init_pts_ns Move devpts global variables 'allocated_ptys' and 'devpts_mnt' into a new 'pts_namespace' and remove the 'devpts_root'. Changelog: - Split these relatively simpler changes off from the patch that supports remounting devpts. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 84 -- include/linux/devpts_fs.h | 10 + 2 files changed, 70 insertions(+), 24 deletions(-) Index: 2.6.25-rc8-mm1/fs/devpts/inode.c === --- 2.6.25-rc8-mm1.orig/fs/devpts/inode.c 2008-04-11 10:12:09.0 -0700 +++ 2.6.25-rc8-mm1/fs/devpts/inode.c2008-04-12 10:10:33.0 -0700 @@ -28,12 +28,8 @@ #define DEVPTS_DEFAULT_MODE 0600 extern int pty_limit; /* Config limit on Unix98 ptys */ -static DEFINE_IDR(allocated_ptys); static DECLARE_MUTEX(allocated_ptys_lock); -static struct vfsmount *devpts_mnt; -static struct dentry *devpts_root; - static struct { int setuid; int setgid; @@ -54,6 +50,14 @@ static match_table_t tokens = { {Opt_err, NULL} }; +struct pts_namespace init_pts_ns = { + .kref = { + .refcount = ATOMIC_INIT(2), + }, + .allocated_ptys = IDR_INIT(init_pts_ns.allocated_ptys), + .mnt = NULL, +}; + static int devpts_remount(struct super_block *sb, int *flags, char *data) { char *p; @@ -140,7 +144,7 @@ devpts_fill_super(struct super_block *s, inode->i_fop = &simple_dir_operations; inode->i_nlink = 2; - devpts_root = s->s_root = d_alloc_root(inode); + s->s_root = d_alloc_root(inode); if (s->s_root) return 0; @@ -168,10 +172,9 @@ static struct file_system_type devpts_fs * to the System V naming convention */ -static struct dentry *get_node(int num) +static struct dentry *get_node(struct dentry *root, int num) { char s[12]; - struct dentry *root = devpts_root; mutex_lock(&root->d_inode->i_mutex); return lookup_one_len(s, root, sprintf(s, "%d", num)); } @@ -180,14 +183,17 @@ int devpts_new_index(void) { int index; int idr_ret; + struct pts_namespace *pts_ns; + + pts_ns = &init_pts_ns; retry: - if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) { + if (!idr_pre_get(&pts_ns->allocated_ptys, GFP_KERNEL)) { return -ENOMEM; } down(&allocated_ptys_lock); - idr_ret = idr_get_new(&allocated_ptys, NULL, &index); + idr_ret = idr_get_new(&pts_ns->allocated_ptys, NULL, &index); if (idr_ret < 0) { up(&allocated_ptys_lock); if (idr_ret == -EAGAIN) @@ -196,7 +202,7 @@ retry: } if (index >= pty_limit) { - idr_remove(&allocated_ptys, index); + idr_remove(&pts_ns->allocated_ptys, index); up(&allocated_ptys_lock); return -EIO; } @@ -206,8 +212,10 @@ retry: void devpts_kill_index(int idx) { + struct pts_namespace *pts_ns = &init_pts_ns; + down(&allocated_ptys_lock); - idr_remove(&allocated_ptys, idx); + idr_remove(&pts_ns->allocated_ptys, idx); up(&allocated_ptys_lock); } @@ -217,12 +225,26 @@ int devpts_pty_new(struct tty_struct *tt struct tty_driver *driver = tty->driver; dev_t device = MKDEV(driver->major, driver->minor_start+number); struct dentry *dentry; - struct inode *inode = new_inode(devpts_mnt->mnt_sb); + struct dentry *root; + struct inode *inode; + struct pts_namespace *pts_ns; /* We're supposed to be given the slave end of a pty */ BUG_ON(driver->type != TTY_DRIVER_TYPE_PTY); BUG_ON(driver->subtype != PTY_TYPE_SLAVE); + pts_ns = &init_pts_ns; + root = pts_ns->mnt->mnt_root; + + mutex_lock(&root->d_inode->i_mutex); + inode = idr_find(&pts_ns->allocated_ptys, number); + mutex_unlock(&root->d_inode->i_mutex); + + if (inode && !IS_ERR(inode)) + return -EEXIST; + + inode = new_inode(pts_ns->mnt->mnt_sb); + if (!inode) return -ENOMEM; @@ -232,23 +254,28 @@ int devpts_pty_new(struct tty_struct *tt inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME; init_special_inode(inode, S_IFCHR|config.mode, device); inode->i_private = tty; + idr_replace(&pts_ns->allocated_ptys, inode, number); - den
[Devel] [PATCH 4/4]: Enable multiple mounts of /dev/pts
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject:[PATCH 4/4]: Enable multiple mounts of /dev/pts To support multiple PTY namespaces, allow multiple mounts of /dev/pts, once within each PTY namespace. This patch removes the get_sb_single() in devpts_get_sb() and uses test and set sb interfaces to allow remounting /dev/pts. Changelog [v4]: - Split-off the simpler changes of moving global=variables into 'pts_namespace' to previous patch. Changelog [v3]: - Removed some unnecessary comments from devpts_set_sb() Changelog [v2]: - (Pavel Emelianov/Serge Hallyn) Remove reference to pts_ns from sb->s_fs_info to fix the circular reference (/dev/pts is not unmounted unless the pts_ns is destroyed, so we don't need a reference to the pts_ns). Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]> Signed-off-by: Matt Helsley <[EMAIL PROTECTED]> --- fs/devpts/inode.c | 62 +++--- 1 file changed, 59 insertions(+), 3 deletions(-) Index: 2.6.25-rc8-mm1/fs/devpts/inode.c === --- 2.6.25-rc8-mm1.orig/fs/devpts/inode.c 2008-04-12 10:10:33.0 -0700 +++ 2.6.25-rc8-mm1/fs/devpts/inode.c2008-04-12 10:10:38.0 -0700 @@ -154,17 +154,73 @@ fail: return -ENOMEM; } +/* + * We use test and set super-block operations to help determine whether we + * need a new super-block for this namespace. get_sb() walks the list of + * existing devpts supers, comparing them with the @data ptr. Since we + * passed 'current's namespace as the @data pointer we can compare the + * namespace pointer in the super-block's 's_fs_info'. If the test is + * TRUE then get_sb() returns a new active reference to the super block. + * Otherwise, it helps us build an active reference to a new one. + */ + +static int devpts_test_sb(struct super_block *sb, void *data) +{ + return sb->s_fs_info == data; +} + +static int devpts_set_sb(struct super_block *sb, void *data) +{ + sb->s_fs_info = data; + return set_anon_super(sb, NULL); +} + static int devpts_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt) { - return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt); + struct super_block *sb; + struct pts_namespace *ns; + int err; + + /* hereafter we're very similar to proc_get_sb */ + if (flags & MS_KERNMOUNT) + ns = data; + else + ns = &init_pts_ns; + + /* hereafter we're very simlar to get_sb_nodev */ + sb = sget(fs_type, devpts_test_sb, devpts_set_sb, ns); + if (IS_ERR(sb)) + return PTR_ERR(sb); + + if (sb->s_root) + return simple_set_mnt(mnt, sb); + + sb->s_flags = flags; + err = devpts_fill_super(sb, data, flags & MS_SILENT ? 1 : 0); + if (err) { + up_write(&sb->s_umount); + deactivate_super(sb); + return err; + } + + sb->s_flags |= MS_ACTIVE; + ns->mnt = mnt; + + return simple_set_mnt(mnt, sb); +} + +static void devpts_kill_sb(struct super_block *sb) +{ + sb->s_fs_info = NULL; + kill_anon_super(sb); } static struct file_system_type devpts_fs_type = { .owner = THIS_MODULE, .name = "devpts", .get_sb = devpts_get_sb, - .kill_sb= kill_anon_super, + .kill_sb= devpts_kill_sb, }; /* @@ -315,7 +371,7 @@ static int __init init_devpts_fs(void) err = register_filesystem(&devpts_fs_type); if (!err) { - mnt = kern_mount(&devpts_fs_type); + mnt = kern_mount_data(&devpts_fs_type, &init_pts_ns); if (IS_ERR(mnt)) err = PTR_ERR(mnt); else ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/4] Helper patches for PTY namespaces
Subrata Modak [EMAIL PROTECTED] wrote: | Sukadev, | | Any corresponding test cases for LTP. We just have UTS, PID & SYSVIPC | Namespace till now. I had some unit-tests that I used with the clone-pts-ns patchset. But the patches in this set are just helpers and should not change existing functionality. I can send the tests I used (they are not in LTP format) when I tested the clone-pts-ns patchset. Thanks, Sukadev ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH]: Propagate error code from devpts_pty_new
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH]: Propagate error code from devpts_pty_new Have ptmx_open() propagate any error code returned by devpts_pty_new() (which returns either 0 or -ENOMEM anyway). Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- drivers/char/tty_io.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: 2.6.25-rc8-mm2/drivers/char/tty_io.c === --- 2.6.25-rc8-mm2.orig/drivers/char/tty_io.c 2008-04-16 09:38:23.0 -0700 +++ 2.6.25-rc8-mm2/drivers/char/tty_io.c2008-04-16 09:51:11.0 -0700 @@ -2835,8 +2835,8 @@ static int ptmx_open(struct inode *inode filp->private_data = tty; file_move(filp, &tty->tty_files); - retval = -ENOMEM; - if (devpts_pty_new(tty->link)) + retval = devpts_pty_new(tty->link); + if (retval) goto out1; check_tty_count(tty, "tty_open"); ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH]: Factor out PTY index allocation
We noticed this while working on pts namespaces and believe this might be an useful change even as we rework our pts/device namespace approach. --- From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Subject: [PATCH]: Factor out PTY index allocation Factor out the code used to allocate/free a pts index into new interfaces, devpts_new_index() and devpts_kill_index(). This localizes the external data structures used in managing the pts indices. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Signed-off-by: Serge Hallyn<[EMAIL PROTECTED]> Signed-off-by: Matt Helsley<[EMAIL PROTECTED]> --- drivers/char/tty_io.c | 40 ++-- fs/devpts/inode.c | 42 +- include/linux/devpts_fs.h |4 3 files changed, 51 insertions(+), 35 deletions(-) Index: 2.6.25-rc8-mm2/include/linux/devpts_fs.h === --- 2.6.25-rc8-mm2.orig/include/linux/devpts_fs.h 2008-01-26 09:49:16.0 -0800 +++ 2.6.25-rc8-mm2/include/linux/devpts_fs.h2008-04-16 09:51:15.0 -0700 @@ -17,6 +17,8 @@ #ifdef CONFIG_UNIX98_PTYS +int devpts_new_index(void); +void devpts_kill_index(int idx); int devpts_pty_new(struct tty_struct *tty); /* mknod in devpts */ struct tty_struct *devpts_get_tty(int number); /* get tty structure */ void devpts_pty_kill(int number); /* unlink */ @@ -24,6 +26,8 @@ void devpts_pty_kill(int number); /* u #else /* Dummy stubs in the no-pty case */ +static inline int devpts_new_index(void) { return -EINVAL; } +static inline void devpts_kill_index(int idx) { } static inline int devpts_pty_new(struct tty_struct *tty) { return -EINVAL; } static inline struct tty_struct *devpts_get_tty(int number) { return NULL; } static inline void devpts_pty_kill(int number) { } Index: 2.6.25-rc8-mm2/drivers/char/tty_io.c === --- 2.6.25-rc8-mm2.orig/drivers/char/tty_io.c 2008-04-16 09:51:11.0 -0700 +++ 2.6.25-rc8-mm2/drivers/char/tty_io.c2008-04-16 09:51:15.0 -0700 @@ -91,7 +91,6 @@ #include #include #include -#include #include #include #include @@ -137,9 +136,6 @@ EXPORT_SYMBOL(tty_mutex); #ifdef CONFIG_UNIX98_PTYS extern struct tty_driver *ptm_driver; /* Unix98 pty masters; for /dev/ptmx */ -extern int pty_limit; /* Config limit on Unix98 ptys */ -static DEFINE_IDR(allocated_ptys); -static DEFINE_MUTEX(allocated_ptys_lock); static int ptmx_open(struct inode *, struct file *); #endif @@ -2636,15 +2632,9 @@ static void release_dev(struct file *fil */ release_tty(tty, idx); -#ifdef CONFIG_UNIX98_PTYS /* Make this pty number available for reallocation */ - if (devpts) { - mutex_lock(&allocated_ptys_lock); - idr_remove(&allocated_ptys, idx); - mutex_unlock(&allocated_ptys_lock); - } -#endif - + if (devpts) + devpts_kill_index(idx); } /** @@ -2800,29 +2790,13 @@ static int ptmx_open(struct inode *inode struct tty_struct *tty; int retval; int index; - int idr_ret; nonseekable_open(inode, filp); /* find a device that is not in use. */ - mutex_lock(&allocated_ptys_lock); - if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) { - mutex_unlock(&allocated_ptys_lock); - return -ENOMEM; - } - idr_ret = idr_get_new(&allocated_ptys, NULL, &index); - if (idr_ret < 0) { - mutex_unlock(&allocated_ptys_lock); - if (idr_ret == -EAGAIN) - return -ENOMEM; - return -EIO; - } - if (index >= pty_limit) { - idr_remove(&allocated_ptys, index); - mutex_unlock(&allocated_ptys_lock); - return -EIO; - } - mutex_unlock(&allocated_ptys_lock); + index = devpts_new_index(); + if (index < 0) + return index; mutex_lock(&tty_mutex); retval = init_dev(ptm_driver, index, &tty); @@ -2847,9 +2821,7 @@ out1: release_dev(filp); return retval; out: - mutex_lock(&allocated_ptys_lock); - idr_remove(&allocated_ptys, index); - mutex_unlock(&allocated_ptys_lock); + devpts_kill_index(index); return retval; } #endif Index: 2.6.25-rc8-mm2/fs/devpts/inode.c === --- 2.6.25-rc8-mm2.orig/fs/devpts/inode.c 2008-02-27 15:17:59.0 -0800 +++ 2.6.25-rc8-mm2/fs/devpts/inode.c2008-04-16 09:51:15.0 -0700 @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,10 @@ #define DEVPTS_DEFAULT_MOD
[Devel] Re: [GIT PATCH] actually check va randomization
Dave Hansen [EMAIL PROTECTED] wrote: | Rather than just documenting this in the readme, actually spit | out a warning on it. Why not just bail out ? Its mostly unreliable at that point anyway. Besides, the warning can get buried in lot of other output. --- >From 84d005031a8a17bdca62dc541c296a3bea74658c Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Date: Wed, 11 Jun 2008 11:22:17 -0700 Subject: [PATCH] cryo currently requires randomize_va_space to be 0. Fail if it is not. --- cr.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/cr.c b/cr.c index db3ada0..217c5e7 100644 --- a/cr.c +++ b/cr.c @@ -1464,6 +1464,7 @@ void check_for_va_randomize(void) return; fprintf(stderr, "WARNING: %s is set to: %d\n", VA_RANDOM_FILE, enabled); fprintf(stderr, " Please set to 0 to make cryo more reliable\n"); + exit(1); } void usage(void) -- 1.5.2.5 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH] Restore fd flags in restarted process
>From e33c0c11cc612896cb12ddad1925037e52e76eb3 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Date: Tue, 17 Jun 2008 12:32:30 -0700 Subject: [PATCH] Restore fd flags in restarted process. We currently get these flags using fcntl(F_GETFL) and save them while checkpointing but we do not restore them when restarting the process. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- cr.c | 10 +- sci.h |7 --- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/cr.c b/cr.c index c52dd70..5163a3d 100644 --- a/cr.c +++ b/cr.c @@ -251,7 +251,7 @@ int getfdinfo(pinfo_t *pi) if (len >= 0) pi->fi[n].name[len] = 0; stat(dname, &st); pi->fi[n].mode = st.st_mode; - pi->fi[n].flag = PT_FCNTL(syscallpid, pi->fi[n].fdnum, F_GETFL); + pi->fi[n].flag = PT_FCNTL(syscallpid, pi->fi[n].fdnum, F_GETFL, 0); if (S_ISREG(st.st_mode)) pi->fi[n].offset = (off_t)PT_LSEEK(syscallpid, pi->fi[n].fdnum, 0, SEEK_CUR); else if (S_ISFIFO(st.st_mode)) @@ -841,6 +841,14 @@ int restore_fd(int fd, pid_t pid) } } + /* +* Restore any special flags this fd had +*/ + ret = PT_FCNTL(pid, fdinfo->fdnum, F_SETFL, fdinfo->flag); + DEBUG(" restore_fd() fd %d setfl flag 0x%x, ret %d\n", + fdinfo->fdnum, fdinfo->flag, ret); + + free(fdinfo); } if (1) { diff --git a/sci.h b/sci.h index b0cac3c..0b32ae4 100644 --- a/sci.h +++ b/sci.h @@ -138,10 +138,11 @@ int call_func(pid_t pid, int scratch, int flag, int funcaddr, int argc, ...); 0, 0, off, \ 0, 0, w) -#define PT_FCNTL(p, fd, cmd) \ - ptrace_syscall(p, 0, 0, SYS_fcntl, 2, \ +#define PT_FCNTL(p, fd, cmd, arg) \ + ptrace_syscall(p, 0, 0, SYS_fcntl, 3, \ 0, 0, fd, \ - 0, 0, cmd) + 0, 0, cmd, \ + 0, 0, arg) #define PT_CLOSE(p, fd)\ ptrace_syscall(p, 0, 0, SYS_close, 1, \ -- 1.5.2.5 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH][cryo] Save/restore state of unnamed pipes
>From fd13986de32af31621b1badbcf7bfb5626648e0e Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Date: Mon, 16 Jun 2008 18:41:05 -0700 Subject: [PATCH] Save/restore state of unnamed pipes Design: Current Linux kernels provide ability to read/write contents of FIFOs using /proc. i.e 'cat /proc/pid/fd/read-side-fd' prints the unread data in the FIFO. Similarly, 'cat foo > /proc/pid/fd/read-sid-fd' appends the contents of 'foo' to the unread contents of the FIFO. So to save/restore the state of the pipe, a simple implementation is to read the from the unnamed pipe's fd and save to the checkpoint-file. When restoring, create a pipe (using PT_PIPE()) in the child process, read the contents of the pipe from the checkpoint file and write it to the newly created pipe. Its fairly straightforward, except for couple of notes: - when we read contents of '/proc/pid/fd/read-side-fd' we drain the pipe such that when the checkpointed application resumes, it will not find any data. To fix this, we read from the 'read-side-fd' and write it back to the 'read-side-fd' in addition to writing to the checkpoint file. - there does not seem to be a mechanism to determine the count of unread bytes in the file. Current implmentation assumes a maximum of 64K bytes (PIPE_BUFS * PAGE_SIZE on i386) and fails if the pipe is not fully drained. Basic unit-testing done at this point (using tests/pipe.c). TODO: - Additional testing (with multiple-processes and multiple-pipes) - Named-pipes Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- cr.c | 215 ++ 1 files changed, 203 insertions(+), 12 deletions(-) diff --git a/cr.c b/cr.c index 5163a3d..0cb9774 100644 --- a/cr.c +++ b/cr.c @@ -84,6 +84,11 @@ typedef struct fdinfo_t { char name[128]; /* file name. NULL if anonymous (pipe, socketpair) */ } fdinfo_t; +typedef struct fifoinfo_t { + int fi_fd; /* fifo's read-side fd */ + int fi_length; /* number of bytes in the fifo */ +} fifofdinfo_t; + typedef struct memseg_t { unsigned long start;/* memory segment start address */ unsigned long end; /* memory segment end address */ @@ -468,6 +473,128 @@ out: return rc; } +static int estimate_fifo_unread_bytes(pinfo_t *pi, int fd) +{ + /* +* Is there a way to find the number of bytes remaining to be +* read in a fifo ? If not, can we print it in fdinfo ? +* +* Return 64K (PIPE_BUFS * PAGE_SIZE) for now. +*/ + return 65536; +} + +static void ensure_fifo_has_drained(char *fname, int fifo_fd) +{ + int rc, c; + + rc = read(fifo_fd, &c, 1); + if (rc != -1 && errno != EAGAIN) { + ERROR("FIFO '%s' not drained fully. rc %d, c %d " + "errno %d\n", fname, rc, c, errno); + } + +} + +static int save_process_fifo_info(pinfo_t *pi, int fd) +{ + int i; + int rc; + int nbytes; + int fifo_fd; + int pbuf_size; + pid_t pid = pi->pid; + char fname[256]; + fdinfo_t *fi = pi->fi; + char *pbuf; + fifofdinfo_t fifofdinfo; + + write_item(fd, "FIFO", NULL, 0); + + for (i = 0; i < pi->nf; i++) { + if (! S_ISFIFO(fi[i].mode)) + continue; + + DEBUG("FIFO fd %d (%s), flag 0x%x\n", fi[i].fdnum, fi[i].name, + fi[i].flag); + + if (!(fi[i].flag & O_WRONLY)) + continue; + + pbuf_size = estimate_fifo_unread_bytes(pi, fd); + + pbuf = (char *)malloc(pbuf_size); + if (!pbuf) { + ERROR("Unable to allocate FIFO buffer of size %d\n", + pbuf_size); + } + memset(pbuf, 0, pbuf_size); + + sprintf(fname, "/proc/%u/fd/%u", pid, fi[i].fdnum); + + /* +* Open O_NONBLOCK so read does not block if fifo has fewer +* bytes than our estimate. +*/ + fifo_fd = open(fname, O_RDWR|O_NONBLOCK); + if (fifo_fd < 0) + ERROR("Error %d opening FIFO '%s'\n", errno, fname); + + nbytes = read(fifo_fd, pbuf, pbuf_size); + if (nbytes < 0) { + if (errno != EAGAIN) { + ERROR("Error %d reading FIFO '%s'\n", errno, + fname); + } +
[Devel] [RFC][PATCH][cryo] Read/print contents of fifo
>From 0f5b3ea20238e0704a71252a3d495ca0db61e1dc Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Date: Sat, 14 Jun 2008 11:45:00 -0700 Subject: [RFC][PATCH] Read/print contents of fifo. To test that checkpoint/restart of pipes is working, read one byte at a time from the pipe and write to stdout. After checkpoint, both the checkpointed application and the restarted application should continue reading from the checkpoint. The '-e' option to the program, tests with an empty pipe. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- tests/pipe.c | 32 1 files changed, 28 insertions(+), 4 deletions(-) diff --git a/tests/pipe.c b/tests/pipe.c index cc3cdfd..0812cb3 100644 --- a/tests/pipe.c +++ b/tests/pipe.c @@ -3,25 +3,49 @@ #include #include #include +#include +#include -int main() +int main(int argc, char *argv[]) { int i = 0; + int rc; int fds[2]; + int c; + int empty; char *buf = "abcdefghijklmnopqrstuvwxyz"; + /* +* -e: test with an empty pipe +*/ + empty = 0; + if (argc > 1 && strcmp(argv[1], "-e") == 0) + empty = 1; + if (pipe(fds) < 0) { perror("pipe()"); exit(1); } - write(fds[1], buf, strlen(buf)); + if (!empty) + write(fds[1], buf, strlen(buf)); + if (fcntl(fds[0], F_SETFL, O_NONBLOCK) < 0) { + perror("fcntl()"); + exit(1); + } printf("Running as %d\n", getpid()); while (i<100) { sleep(1); - if (i%5 == 0) - printf("i is %d (pid %d)\n", i, getpid()); + if (i%5 == 0) { + c = errno = 0; + rc = read(fds[0], &c, 1); + if (rc != 1) { + perror("read() failed"); + } + printf("i is %d (pid %d), c is %c\n", i, getpid(), c); + + } i++; } } -- 1.5.2.5 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes
Serge E. Hallyn [EMAIL PROTECTED] wrote: | > + | > + rc = read(fifo_fd, &c, 1); | > + if (rc != -1 && errno != EAGAIN) { | | Won't errno only be set if rc == -1? Did you mean || here? Yes I meant ||. I also had 'errno = 0' before the read, but seem to have deleted it when I moved code around. | > + } else if (S_ISFIFO(fdinfo->mode)) { | > + int pipefds[2] = { 0, 0 }; | > + | > + /* | > +* We create the pipe when we see the pipe's read-fd. | > +* Just ignore the pipe's write-fd. | > +*/ | > + if (fdinfo->flag == O_WRONLY) | > + continue; | > + | > + DEBUG("Creating pipe for fd %d\n", fdinfo->fdnum); | > + | > + t_d(PT_PIPE(pid, pipefds)); | > + t_d(pipefds[0]); | > + t_d(pipefds[1]); | > + | > + if (pipefds[0] != fdinfo->fdnum) { | > + DEBUG("Hmm, new pipe has fds %d, %d " | > + "Old pipe had fd %d\n", pipefds[0], | > + pipefds[1], fdinfo->fdnum); getchar(); | | Can you explain what you're doing here? I would have expected you to | dup2() to get back the correct fd, so maybe I'm missing something... You are right, I should use dup2() here. Will send an updated patch. Thanks, Suka ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes
Matt Helsley [EMAIL PROTECTED] wrote: | | On Tue, 2008-06-17 at 17:30 -0500, Serge E. Hallyn wrote: | > Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]): | > > | > > >From fd13986de32af31621b1badbcf7bfb5626648e0e Mon Sep 17 00:00:00 2001 | > > From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> | > > Date: Mon, 16 Jun 2008 18:41:05 -0700 | > > Subject: [PATCH] Save/restore state of unnamed pipes | > > | > > Design: | > > | > > Current Linux kernels provide ability to read/write contents of FIFOs | > > using /proc. i.e 'cat /proc/pid/fd/read-side-fd' prints the unread data | > > in the FIFO. Similarly, 'cat foo > /proc/pid/fd/read-sid-fd' appends | > > the contents of 'foo' to the unread contents of the FIFO. | > > | > > So to save/restore the state of the pipe, a simple implementation is | > > to read the from the unnamed pipe's fd and save to the checkpoint-file. | > > When restoring, create a pipe (using PT_PIPE()) in the child process, | > > read the contents of the pipe from the checkpoint file and write it to | > > the newly created pipe. | > > | > > Its fairly straightforward, except for couple of notes: | > > | > > - when we read contents of '/proc/pid/fd/read-side-fd' we drain | > > the pipe such that when the checkpointed application resumes, | > > it will not find any data. To fix this, we read from the | > > 'read-side-fd' and write it back to the 'read-side-fd' in | > > addition to writing to the checkpoint file. | > > | > > - there does not seem to be a mechanism to determine the count | > > of unread bytes in the file. Current implmentation assumes a | > > maximum of 64K bytes (PIPE_BUFS * PAGE_SIZE on i386) and fails | > > if the pipe is not fully drained. | > > | > > Basic unit-testing done at this point (using tests/pipe.c). | > > | > > TODO: | > > - Additional testing (with multiple-processes and multiple-pipes) | > > - Named-pipes | > > | > > Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> | > > --- | > > cr.c | 215 ++ | > > 1 files changed, 203 insertions(+), 12 deletions(-) | > > | > > diff --git a/cr.c b/cr.c | > > index 5163a3d..0cb9774 100644 | > > --- a/cr.c | > > +++ b/cr.c | > > @@ -84,6 +84,11 @@ typedef struct fdinfo_t { | > > char name[128]; /* file name. NULL if anonymous (pipe, socketpair) */ | > > } fdinfo_t; | > > | > > +typedef struct fifoinfo_t { | > > + int fi_fd; /* fifo's read-side fd */ | > > + int fi_length; /* number of bytes in the fifo */ | > > +} fifofdinfo_t; | > > + | > > typedef struct memseg_t { | > > unsigned long start;/* memory segment start address */ | > > unsigned long end; /* memory segment end address */ | > > @@ -468,6 +473,128 @@ out: | > > return rc; | > > } | > > | > > +static int estimate_fifo_unread_bytes(pinfo_t *pi, int fd) | > > +{ | > > + /* | > > + * Is there a way to find the number of bytes remaining to be | > > + * read in a fifo ? If not, can we print it in fdinfo ? | > > + * | > > + * Return 64K (PIPE_BUFS * PAGE_SIZE) for now. | > > + */ | > > + return 65536; | > > +} | > > + | > > +static void ensure_fifo_has_drained(char *fname, int fifo_fd) | > > +{ | > > + int rc, c; | > > + | > > + rc = read(fifo_fd, &c, 1); | > > + if (rc != -1 && errno != EAGAIN) { | > | > Won't errno only be set if rc == -1? Did you mean || here? | > | > > + ERROR("FIFO '%s' not drained fully. rc %d, c %d " | > > + "errno %d\n", fname, rc, c, errno); | > > + } | > > + | > > +} | > > + | > > +static int save_process_fifo_info(pinfo_t *pi, int fd) | > > +{ | > > + int i; | > > + int rc; | > > + int nbytes; | > > + int fifo_fd; | > > + int pbuf_size; | > > + pid_t pid = pi->pid; | > > + char fname[256]; | > > + fdinfo_t *fi = pi->fi; | > > + char *pbuf; | > > + fifofdinfo_t fifofdinfo; | > > + | > > + write_item(fd, "FIFO", NULL, 0); | > > + | > > + for (i = 0; i < pi->nf; i++) { | > > + if (! S_ISFIFO(fi[i].mode)) | > > + continue; | > > + | > > + DEBUG("FIFO fd %d (%s), flag 0x%x\n", fi[i].fdnum, fi[i].name,
[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes
Matt Helsley [EMAIL PROTECTED] wrote: | > | | > | | > | pipe(pipefds); /* returns 5 and 4 in elements 0 and 1 */ | > | /* use fds after last_fd as trampolines for fds we want to create */ | > | dup2(pipefds[0], last_fd + 1); | > | dup2(pipefds[1], last_fd + 2); | > | close(pipefds[0]); | > | close(pipefds[1]); | > | dup2(last_fd + 1, ); | > | dup2(last_fd + 2, ); | > | close(last_fd + 1); | > | close(last_fd + 2); | > | | > | | > | Which is alot more code but should work no matter which fds we get back | > | from pipe(). Of course this assumes the checkpointed application hasn't | > | used all of its fds. :( | > | | > | > This sounds like a good idea too, but we could use any fd that has not | > yet been used in the restart-process right ? It would break if all fds | | Yes, but we don't know which fd is available unless we allocate it | without dup2(). Here's how it could be done without last_fd (again, | dropping PT_FUNC notation): | | /* | * Move fds from src to dest. Useful for correctly "moving" pipe fds and | * other cases where we have a small number of fds to move to their | * original fd. | * | * Assumes dest_fds and src_fds are of the same, small length since | * this is O(num_fds^2). | * | * If num_fds == 1 then use plain dup2(). | * | * Use this in place of multiple dup2() calls (num_fds > 1) unless you are | * absolutely certain the set of dest fds do not intersect the set of src fds. | * Does NOT magically prevent you from accidentally clobbering fds outside the | * src_fds array. | */ | void move_fds(int *dest_fds, int *src_fds, const unsigned int num_fds) | { | int i; | unsigned int num_moved = 0; | | for (i = 0; i < num_fds; i++) { | int j; | | if (src_fds[i] == dest_fds[i]) | continue; /* nothing to be done */ | | /* src fd != dest fd so we need to perform: | dup2(src fd, dest fd); | but dup2() closes dest fd if it already exists. | This means we could accidentally close one of | the src fds. Avoid this by searching for any | src fd == dest fd and dup()'ing src fd to | a different fd so we can use the dest fd. |*/ | for (j = i + 1; j < num_fds; j++) /* This makes us O(N^2) */ | if (dest_fds[i] == src_fds[j]) | /* |* we're using an fd for something |* else already -- we need a trampoline |*/ So let me rephrase the problem. Suppose the checkpointed application was using fds in following "orig-fd-set" { [0..10], 18, 27 } where 18 and 27 are part of a pipe. For simplicity lets assume that 18 is the read-side-fd. We checkpointed this application and are now trying to restart it. In the restarted application, we would call dup2(fd1, fd2), where 'fd1' is some new, random fd and 'fd2' is an fd in 'orig-fd-set' (say fd2 = 18). IIUC, there is a risk here of 'fd2' being closed accidentally while it is in use. But, the only way I can see 'fd2' being in use in the restarted process is if _cryo_ opened some file _during_ restart and did not close. I ran into this early on with the randomize_va_space file (which was easily fixed). Would cryo need to keep one or more temporary/debug files open in the restarted process (i.e files that are not in the 'orig-fd-set'). If cryo does, then maybe it could open such files: - after clone() (so files are not open in restarted process), or - find the last_fd used and dup2() to that fd, leaving the 'orig-fd-set' all open/available for restarted process For debug, before each 'dup2(fd1, fd2)' we could 'fstat(fd2, &buf)' to ensure 'fd2' is not in use and error out if it is. Thanks for your comments. I will look at your code in more detail. Suka ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes
Matt Helsley [EMAIL PROTECTED] wrote: | | > So let me rephrase the problem. | > | > Suppose the checkpointed application was using fds in following | > "orig-fd-set" | > | > { [0..10], 18, 27 } | > | > where 18 and 27 are part of a pipe. For simplicity lets assume that | > 18 is the read-side-fd. | | so orig_pipefd[0] == 18 | and orig_pipefd[1] == 27 | | > We checkpointed this application and are now trying to restart it. | > | > In the restarted application, we would call | > | > dup2(fd1, fd2), | > | > where 'fd1' is some new, random fd and 'fd2' is an fd in 'orig-fd-set' |^^ Even if they were truly random, this | does not preclude fd1 from having the same value as an fd in the | remaining orig-fd-set -- such as one of the two we're about to try and | restart with pipe(). I agree. fd1 could be an hither-to-unseen fd from the 'orig-fd-set'. | | > (say fd2 = 18). | | fd1 = restarted_pipefd[0] | fd2 = restarted_pipefd[1] | | In my example fd1 == 27 and fd2 == 18 | | > IIUC, there is a risk here of 'fd2' being closed accidentally while | > it is in use. | | Yes, that's the risk. | | > But, the only way I can see 'fd2' being in use in the restarted process | > is if _cryo_ opened some file _during_ restart and did not close. I ran | | Both file descriptors returned from pipe() are in use during restart | and closing one of them would not be proper. Cryo hasn't "forgotten" to | close one of them -- cryo needs to dup2() both of them to their | "destination" fds. But if they have been swapped or if just one is the | "destination" of the other then you could end up with a broken pipe. Ok I see what you are saying. The assumption I have is that we would process the fds from 'orig-fd-set' in ascending order. Its good to confirm that assumption now :-) proc_readfd_common() seems to return the fds in ascending order (so readdir() of "/proc/pid/fd/" would get them in ascending order - no ?) If we process 'orig-fd-set' in order and suppose we create the pipe for the smaller of the two fds (could be the write-side). Then the other side of the pipe would either not collide with an existing fd or that fd would not be in the 'orig-fd-set' (in the latter case it would be safe for dup2() to close). | | > into this early on with the randomize_va_space file (which was easily | > fixed). | | This logic only works if cryo only has one new fd at a time. However | that's not possible with pipe(). Or socketpair(). In those cases one of | the two new fds could be the "destination" fd for the other. In that | case dup2() will kindly close it for you and break your new | pipe/socketpair! :) | | That's why I asked if POSIX guarantees the read side file descriptor is | always less than the write side. If it does then the numbers can't be | swapped and maybe using your assumption that we don't have any other fds | accidentally left open ensures dup2() will be safe. I don't think POSIX guarantees, but will double check. | | > Would cryo need to keep one or more temporary/debug files open in the | > restarted process (i.e files that are not in the 'orig-fd-set'). | | There's no need to keep temporary/debug files open that I can see. Just | a need to be careful when more than one new file descriptor has been | created before doing a dup2(). | | > If cryo does, then maybe it could open such files: | > | > - after clone() (so files are not open in restarted process), or | > | > - find the last_fd used and dup2() to that fd, leaving the | > 'orig-fd-set' all open/available for restarted process | > | > For debug, before each 'dup2(fd1, fd2)' we could 'fstat(fd2, &buf)' | > to ensure 'fd2' is not in use and error out if it is. | | fstat() could certainly be useful for debugging dup2(). However it still | doesn't nicely show us whether there are any fds we've leaked that we | forgot about unless we fstat() all possible fds and then compare the set | of existing fds to the orig-fd-set. Yes, was suggesting fstat() only to detect collisions, but yes, to detect leaks, we have to do more. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes
| | Now restart does : | | int pipefds[2]; | | pipe(pipefds); /* | * kernel is allowed to return pipefds[0] == 12 and | * pipefds[1] == 11 | */ | | dup2(pipefds[0], 11); /* closes pipefds[1]! */ | dup2(pipefds[1], 12); Aah. I see it now (finally). Thanks, Suka ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes
Matt Helsley [EMAIL PROTECTED] wrote: | | > | I don't see anything in the pipe man page, at least, that suggests we | > | can safely assume pipefds[0] < pipefds[1]. | > | | > | The solution could be to use "trampoline" fds. Suppose last_fd is the | > | largest fd that exists in the final checkpointed/restarting application. | > | We could do (Skipping the PT_FUNC "notation" for clarity): | > | > | | > | | > | pipe(pipefds); /* returns 5 and 4 in elements 0 and 1 */ | > | /* use fds after last_fd as trampolines for fds we want to create */ | > | dup2(pipefds[0], last_fd + 1); | > | dup2(pipefds[1], last_fd + 2); | > | close(pipefds[0]); | > | close(pipefds[1]); | > | dup2(last_fd + 1, ); | > | dup2(last_fd + 2, ); | > | close(last_fd + 1); | > | close(last_fd + 2); | > | | > | | > | Which is alot more code but should work no matter which fds we get back | > | from pipe(). Of course this assumes the checkpointed application hasn't | > | used all of its fds. :( | > | It appears that this last_fd approach will fit in easier with current design of cryo (where we process one or two fds at a time and don't have the src_fds and dest_fds handy). BTW, we should be able to accomplish the above with a single-unused fd right (i.e no need for last_fd+2) ? | > | > This sounds like a good idea too, but we could use any fd that has not | > yet been used in the restart-process right ? It would break if all fds | | Yes, but we don't know which fd is available unless we allocate it | without dup2(). Right. I was thinking we could find that out at the time of checkpoint (a brute-force fstat(i, &statbuf) for i = 0..n or something more efficient). Well just thought of another approach. Basically, we have a temporary need for an unused fd for use as a trampoline. So, why not 'set-aside' an fd for that purpose and after all other fds have been created, go back and create this fd ? i.e lets say the first regular file we open is associated with 'fd = 3'. We save away the 'fdinfo' for 3 say in a global variable and close(3). Now use 'fd = 3' in place of last_fd+1 above. Once all fds have been setup correctly, go back and set up fd = 3 using the saved fdinfo (this would be a simple open of the file followed by seek and maybe an fcntl). This would work even if the application was using all its fds ? If we do need both last_fd+1 and last_fd+2, we would have to set aside two regular files. Hmm, would it work even if an application uses all (1024) its fds for pipes :-), but just a thought at this point. Suka ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes
Matt Helsley [EMAIL PROTECTED] wrote: | | Yes, I think that's sufficient: | | int pipefds[2]; | | ... | | restarted_read_fd = 11; | restarted_write_fd = 12; | | ... | | pipe(pipefds); | | /* |* pipe() may have returned one (or both) of the restarted fds |* at the wrong end of the pipe. This could cause dup2() to |* accidentaly close the pipe. Avoid that with an extra dup(). |*/ | if (pipefds[1] == restarted_read_fd) { | dup2(pipefds[1], last_fd + 1); | pipefds[1] = last_fd + 1; | } | | if (pipefds[0] != restarted_read_fd) { | dup2(pipefds[0], restarted_read_fd); | close(pipefds[0]); | } | | if (pipefds[0] != restarted_read_fd) { | dup2(pipefds[1], restarted_write_fd); | close(pipefds[1]); | } Shouldn't the last if be if (pipefds[1] != restarted_wrte_fd) ? (otherwise it would break if pipefds[0] = 11 and pipefds[1] = 200) I came up with something similar, but with an extra close(). And in my code, I had restarted_* names referring to pipefds[] making it a bit confusing initially. How about using actual_fds[] (instead of pipefds) and expected_fds[] instead of (restart_*) ? Thanks, Suka | | I think this code does the minimal number of operations needed in the | restarted application too -- it counts on the second dup2() closing one | of the fds if pipefds[1] == restarted_read_fd. | | Cheers, | -Matt ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 0/2][cryo] Save/restore pipe state
[PATCH 1/2] Save/restore state of unnamed pipes Basic infrastructure to save/restore pipe state with assumptions about order of fds. [PATCH 2/2] Support Non-consecutive and dup pipe fds Remove above assumptions about order of fds and support dups of pipe fds. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/2][cryo] Save/restore state of unnamed pipes
>From e513f8bc0fe808425264ad01210ac610f6453047 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Date: Mon, 16 Jun 2008 18:41:05 -0700 Subject: [PATCH] Save/restore state of unnamed pipes Design: Current Linux kernels provide ability to read/write contents of FIFOs using /proc. i.e 'cat /proc/pid/fd/read-side-fd' prints the unread data in the FIFO. Similarly, 'cat foo > /proc/pid/fd/read-sid-fd' appends the contents of 'foo' to the unread contents of the FIFO. So to save/restore the state of the pipe, a simple implementation is to read the from the unnamed pipe's fd and save to the checkpoint-file. When restoring, create a pipe (using PT_PIPE()) in the child process, read the contents of the pipe from the checkpoint file and write it to the newly created pipe. Its fairly straightforward, except for couple of notes: - when we read contents of '/proc/pid/fd/read-side-fd' we drain the pipe such that when the checkpointed application resumes, it will not find any data. To fix this, we read from the 'read-side-fd' and write it back to the 'read-side-fd' in addition to writing to the checkpoint file. - there does not seem to be a mechanism to determine the count of unread bytes in the file. Current implmentation assumes a maximum of 64K bytes (PIPE_BUFS * PAGE_SIZE on i386) and fails if the pipe is not fully drained. Changelog:[v1]: - [Serge Hallyn]: use || instead of && in ensure_fifo_has_drained - [Serge Hallyn, Matt Helsley]: Use dup2() to restore fds and remove assumptions about order of read and write fds (addressed in PATCH 2/2). Some unit-testing done at this point (using tests/pipe.c). TODO: - Additional testing (with multiple-processes and multiple-pipes) - Named-pipes --- cr.c | 217 ++ 1 files changed, 205 insertions(+), 12 deletions(-) diff --git a/cr.c b/cr.c index c7e3332..716cc86 100644 --- a/cr.c +++ b/cr.c @@ -88,6 +88,11 @@ typedef struct fdinfo_t { char name[128]; /* file name. NULL if anonymous (pipe, socketpair) */ } fdinfo_t; +typedef struct fifoinfo_t { + int fi_fd; /* fifo's read-side fd */ + int fi_length; /* number of bytes in the fifo */ +} fifofdinfo_t; + typedef struct memseg_t { unsigned long start;/* memory segment start address */ unsigned long end; /* memory segment end address */ @@ -499,6 +504,129 @@ out: return rc; } +static int estimate_fifo_unread_bytes(pinfo_t *pi, int fd) +{ + /* +* Is there a way to find the number of bytes remaining to be +* read in a fifo ? If not, can we print it in fdinfo ? +* +* Return 64K (PIPE_BUFS * PAGE_SIZE) for now. +*/ + return 65536; +} + +static void ensure_fifo_has_drained(char *fname, int fifo_fd) +{ + int rc, c; + + errno = 0; + rc = read(fifo_fd, &c, 1); + if (rc != -1 || errno != EAGAIN) { + ERROR("FIFO '%s' not drained fully. rc %d, c %d " + "errno %d\n", fname, rc, c, errno); + } + +} + +static int save_process_fifo_info(pinfo_t *pi, int fd) +{ + int i; + int rc; + int nbytes; + int fifo_fd; + int pbuf_size; + pid_t pid = pi->pid; + char fname[256]; + fdinfo_t *fi = pi->fi; + char *pbuf; + fifofdinfo_t fifofdinfo; + + write_item(fd, "FIFO", NULL, 0); + + for (i = 0; i < pi->nf; i++) { + if (! S_ISFIFO(fi[i].mode)) + continue; + + DEBUG("FIFO fd %d (%s), flag 0x%x\n", fi[i].fdnum, fi[i].name, + fi[i].flag); + + if (!(fi[i].flag & O_WRONLY)) + continue; + + pbuf_size = estimate_fifo_unread_bytes(pi, fd); + + pbuf = (char *)malloc(pbuf_size); + if (!pbuf) { + ERROR("Unable to allocate FIFO buffer of size %d\n", + pbuf_size); + } + memset(pbuf, 0, pbuf_size); + + sprintf(fname, "/proc/%u/fd/%u", pid, fi[i].fdnum); + + /* +* Open O_NONBLOCK so read does not block if fifo has fewer +* bytes than our estimate. +*/ + fifo_fd = open(fname, O_RDWR|O_NONBLOCK); + if (fifo_fd < 0) + ERROR("Error %d opening FIFO '%s'\n", errno, fname); + + nbytes = read(fifo_fd, pbuf, pbuf_size); + if (nbytes < 0) { + if (er
[Devel] [PATCH 2/2] Support Non-consecutive and dup pipe fds
>From a80c5215763f757840214465277e911e46e01219 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Date: Mon, 23 Jun 2008 20:13:57 -0700 Subject: [PATCH] Support Non-consecutive and dup pipe fds PATCH 1/1 provides basic infrastructure to save/restore state of pipes. This patch removes assumptions about order of the pipe-fds and also supports existence of 'dups' of pipe-fds. This logic has been separated from PATCH 1/1 for easier review and the two patches could be combined into a single one. Thanks to Matt Helsley for the optimized logic/code in match_pipe_ends(). TODO: There are few TODO's marked out in the patch. Hopefully these can be addressed without significant impact to the central-logic of saving/restoring pipes. - Temporarily using a regular-file's fd as 'trampoline-fd' when all fds are in use - Maybe read all fdinfo into memory during restart, so we can reduce the information we save into the checkpoint-file (see comments near 'struct fdinfo'). - Check logic of detecting 'dup's of pipe fds (any hidden gotchas ?) See pair_pipe_fds() - Alloc ppi_list[] dynamically (see getfdinfo()). - Use getrlimit() to compute max-open-fds (see near caller of pair_pipe_fds()). - [Oleg Nesterov]: SIGIO/inotify() issues associated with writing-back to pipes (fixing this would require some assitance from kernel ?) Ran several unit-test cases (see test-patches). Additional cases to be developed/executed. Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]> --- cr.c | 262 -- 1 files changed, 240 insertions(+), 22 deletions(-) diff --git a/cr.c b/cr.c index 716cc86..f40a4fb 100644 --- a/cr.c +++ b/cr.c @@ -79,8 +79,25 @@ typedef struct isockinfo_t { char tcpstate[TPI_LEN]; } isockinfo_t; +/* + * TODO: restore_fd() processes each fd as it reads it of the checkpoint + * file. To avoid making a second-pass at the file, we store following + * fields during checkpoint (for now). + * + * peer_fdnum, dup_fdnum, create_pipe, tramp_fd' fields can be + * + * We could eliminate this fields by reading all fdinfo into memory + * and then 'computing' the above fields before processing the fds. + * But this would require a non-trivial rewrite of the restore_fd() + * logic. Hopefully that can be done without significant impact to + * rest of the logic associated with saving/restoring pipes. + */ typedef struct fdinfo_t { int fdnum; /* file descriptor number */ + int peer_fdnum; /* peer fd for pipes */ + int dup_fdnum; /* fd, if fd is dup of another pipe fd */ + int create_pipe;/* TRUE if this is the create-end of the pipe */ + int tramp_fd; /* trampoline-fd for use in restoring pipes */ mode_t mode;/* mode as per stat(2) */ off_t offset; /* read/write pointer position for regular files */ int flag; /* open(2) flag */ @@ -117,6 +134,7 @@ typedef struct pinfo_t { int nt; /* number of thread child (0 if no thread lib) */ pid_t *tpid;/* array of thread info */ struct pinfo_t *pmt;/* multithread: pointer to main thread info */ + int tramp_fd; /* trampoline-fd for use in restoring pipes */ } pinfo_t; /* @@ -263,6 +281,89 @@ int getsockinfo(pid_t pid, pinfo_t *pi, int num) return ret; } +typedef struct pipe_peer_info { + fdinfo_t *pipe_fdi; + //fdinfo_t *peer_fdi; + __ino_t pipe_ino; +} pipe_peer_info_t; + +__ino_t get_fd_ino(char *fname) +{ + struct stat sbuf; + + if (stat(fname, &sbuf) < 0) + ERROR("stat() on fd %s failed, errno %d\n", fname, errno); + + return sbuf.st_ino; +} + +static void pair_pipe_fds(pipe_peer_info_t *ppi_list, int npipe_fds) +{ + int i, j; + pipe_peer_info_t *xppi, *yppi; + fdinfo_t *xfdi, *yfdi; + + /* +* TODO: This currently assumes pipefds have not been dup'd. +* Of course, need to kill this assumption soon. +*/ + for (i = 0; i < npipe_fds; i++) { + xppi = &ppi_list[i]; + xfdi = xppi->pipe_fdi; + + j = i + 1; + for (j = i+1; j < npipe_fds; j++) { + yppi = &ppi_list[j]; + yfdi = yppi->pipe_fdi; + + if (yppi->pipe_ino != xppi->pipe_ino) + continue; + + DEBUG("Checking flag i %d, j %d\n", i, j); + /* +* i and j refer to same pipe. Check if they are
[Devel] [PATCH 0/4][cryo] Test pipes
PATCH[1/4]: Support-multiple-pipe-test-cases.patch PATCH[2/4]: Testcase-3-continous-read-write-to-pipe.patch PATCH[3/4]: Test4-Non-consecutive-pipe-fds.patch PATCH[4/4]: Test-5-Read-write-using-dup-of-pipe-fds.patch ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/4]: Test3: continous read/write to pipe
>From ade1b719f7d9968e0f934daf736ca1746cb6747d Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Date: Sun, 22 Jun 2008 22:26:18 -0700 Subject: [PATCH] Testcase 3: continous read/write to pipe --- tests/pipe.c | 82 +- 1 files changed, 81 insertions(+), 1 deletions(-) diff --git a/tests/pipe.c b/tests/pipe.c index 76be6cc..c81ac2a 100644 --- a/tests/pipe.c +++ b/tests/pipe.c @@ -6,6 +6,8 @@ #include #include +#define min(a, b) ((a) < (b) ? (a) : (b)) +static char *temp_file; char *test_descriptions[] = { NULL, "Test with an empty pipe", @@ -18,7 +20,7 @@ char *test_descriptions[] = { "Test with all-fds in use for pipes", }; -static int last_num = 2; +static int last_num = 3; usage(char *argv[]) { int i; @@ -82,12 +84,89 @@ int test2() } } +int read_write_pipe() +{ + int i = 0; + int rc; + int fds[2]; + int fd1, fd2; + int c; + char *wbufp; + char wbuf[256]; + char rbuf[256]; + char *rbufp; + + rbufp = &rbuf[0]; + wbufp = &wbuf[0]; + + strcpy(wbufp, "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"); + memset(rbufp, '\0', sizeof(rbuf)); + + if (pipe(fds) < 0) { + perror("pipe()"); + exit(1); + } + printf("fds[0] %d, fds[1] %d\n", fds[0], fds[1]); + + if (fcntl(fds[0], F_SETFL, O_NONBLOCK) < 0) { + perror("fcntl()"); + exit(1); + } + + printf("Running as %d\n", getpid()); + for (i = 0; i < 50; i++) { + sleep(1); + if (i%2 == 0) { + c = errno = 0; + rc = read(fds[0], rbufp, 3); + + if (rc < 0) + perror("read() failed"); + else { + printf("i is %d (pid %d), rbufp %p read %s\n", + i, getpid(), rbufp, rbufp); + rbufp += rc; + } + + if (*wbufp == '\0') + continue; + + errno = 0; + rc = write(fds[1], wbufp, min(3, strlen(wbufp))); + if (rc < 0) { + perror("write() to pipe"); + } else { + if (rc != 3) { + printf("Wrote %d of 3 bytes, " + "error %d\n", rc, errno); + } + wbufp += rc; + } + } + } + + if (strncmp(wbuf, rbuf, strlen(wbufp))) { + printf("Wrote: %s\n", wbuf); + printf("Read : %s\n", rbuf); + printf("Test FAILED\n"); + } else { + printf("Test passed\n"); + } +} + +static void test3() +{ + read_write_pipe(); +} + int main(int argc, char *argv[]) { int c; int tc_num; + temp_file = argv[0]; + while((c = getopt(argc, argv, "t:")) != EOF) { switch(c) { case 't': @@ -102,6 +181,7 @@ main(int argc, char *argv[]) switch(tc_num) { case 1: test1(); break; case 2: test2(); break; + case 3: test3(); break; default: printf("Unsupported test case %d\n", tc_num); usage(argv); -- 1.5.2.5 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/4]: Support multiple pipe test cases
>From a99deb9bcdd611c52589fa733dd90057f1f134bf Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Date: Sun, 22 Jun 2008 21:05:48 -0700 Subject: [PATCH] Support multiple pipe test cases Modify pipe.c to support multiple test cases and to select a test case using the -t option. Implement two tests: - empty pipe - single write to pipe followed by continous read --- tests/pipe.c | 93 ++--- 1 files changed, 75 insertions(+), 18 deletions(-) diff --git a/tests/pipe.c b/tests/pipe.c index 0812cb3..76be6cc 100644 --- a/tests/pipe.c +++ b/tests/pipe.c @@ -1,52 +1,109 @@ #include #include +#include #include #include #include #include -#include -int main(int argc, char *argv[]) +char *test_descriptions[] = { + NULL, + "Test with an empty pipe", + "Test with single-write to and continous read from a pipe", + "Test continous reads/writes from pipe", + "Test non-consecutive pipe-fds", + "Test with read-fd > write-fd", + "Test with read-fd/write-fd swapped", + "Test with all-fds in use", + "Test with all-fds in use for pipes", +}; + +static int last_num = 2; +usage(char *argv[]) +{ + int i; + printf("Usage: %s -t \n", argv[0]); + printf("\t where 1 && strcmp(argv[1], "-e") == 0) - empty = 1; - if (pipe(fds) < 0) { perror("pipe()"); exit(1); } - if (!empty) - write(fds[1], buf, strlen(buf)); + write(fds[1], buf, strlen(buf)); if (fcntl(fds[0], F_SETFL, O_NONBLOCK) < 0) { perror("fcntl()"); exit(1); } + printf("Running as %d\n", getpid()); while (i<100) { sleep(1); if (i%5 == 0) { c = errno = 0; rc = read(fds[0], &c, 1); - if (rc != 1) { - perror("read() failed"); - } - printf("i is %d (pid %d), c is %c\n", i, getpid(), c); - + if (rc != 1) + perror("read() pipe failed\n"); + printf("i is %d (pid %d), next byte is %d\n", i, + getpid(), c); } i++; } } +int +main(int argc, char *argv[]) +{ + int c; + int tc_num; + + while((c = getopt(argc, argv, "t:")) != EOF) { + switch(c) { + case 't': + tc_num = atoi(optarg); + break; + default: + printf("Unknown option %c\n", c); + usage(argv); + } + } + + switch(tc_num) { + case 1: test1(); break; + case 2: test2(); break; + default: + printf("Unsupported test case %d\n", tc_num); + usage(argv); + } +} -- 1.5.2.5 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/4][cryo]: Test 4: Non-consecutive pipe fds
>From c7276c8cb59247faa13d42a1b871c853a80d80d1 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <[EMAIL PROTECTED]> Date: Sun, 22 Jun 2008 22:43:01 -0700 Subject: [PATCH] Test4: Non-consecutive pipe fds --- tests/pipe.c | 76 ++--- 1 files changed, 66 insertions(+), 10 deletions(-) diff --git a/tests/pipe.c b/tests/pipe.c index c81ac2a..5b04f46 100644 --- a/tests/pipe.c +++ b/tests/pipe.c @@ -5,6 +5,7 @@ #include #include #include +#include #define min(a, b) ((a) < (b) ? (a) : (b)) static char *temp_file; @@ -20,7 +21,7 @@ char *test_descriptions[] = { "Test with all-fds in use for pipes", }; -static int last_num = 3; +static int last_num = 4; usage(char *argv[]) { int i; @@ -84,13 +85,50 @@ int test2() } } -int read_write_pipe() + +reset_pipe_fds(int *tmpfds, int *testfds, int close_unused) +{ + struct stat statbuf; + int rc; + + if (fstat(testfds[0], &statbuf) == 0) { + printf("fd %d is in use...\n", testfds[0]); + exit(1); + } + if (fstat(testfds[1], &statbuf) == 0) { + printf("fd %d is in use...\n", testfds[1]); + exit(1); + } + + rc = dup2(tmpfds[0], testfds[0]); + if (rc < 0) { + printf("dup2(%d, %d) failed, error %d\n", + tmpfds[0], testfds[0], rc, errno); + exit(1); + } + + rc = dup2(tmpfds[1], testfds[1]); + if (rc < 0) { + printf("dup2(%d, %d) failed, error %d\n", + tmpfds[1], testfds[1], rc, errno); + exit(1); + } + + if (close_unused) { + close(tmpfds[0]); + close(tmpfds[1]); + } +} + +#define TEST_NON_CONSECUTIVE_FD 0x1 + +int read_write_pipe(int *testfdsp, int close_unused) { int i = 0; int rc; - int fds[2]; - int fd1, fd2; + int tmpfds[2]; int c; + int read_fd, write_fd; char *wbufp; char wbuf[256]; char rbuf[256]; @@ -102,13 +140,23 @@ int read_write_pipe() strcpy(wbufp, "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"); memset(rbufp, '\0', sizeof(rbuf)); - if (pipe(fds) < 0) { + if (pipe(tmpfds) < 0) { perror("pipe()"); exit(1); } - printf("fds[0] %d, fds[1] %d\n", fds[0], fds[1]); - if (fcntl(fds[0], F_SETFL, O_NONBLOCK) < 0) { + if (testfdsp) { + reset_pipe_fds(tmpfds, testfdsp, close_unused); + read_fd = testfdsp[0]; + write_fd = testfdsp[1]; + } else { + read_fd = tmpfds[0]; + write_fd = tmpfds[1]; + } + + printf("read_fd %d, write_fd %d\n", read_fd, write_fd); + + if (fcntl(read_fd, F_SETFL, O_NONBLOCK) < 0) { perror("fcntl()"); exit(1); } @@ -118,7 +166,7 @@ int read_write_pipe() sleep(1); if (i%2 == 0) { c = errno = 0; - rc = read(fds[0], rbufp, 3); + rc = read(read_fd, rbufp, 3); if (rc < 0) perror("read() failed"); @@ -132,7 +180,7 @@ int read_write_pipe() continue; errno = 0; - rc = write(fds[1], wbufp, min(3, strlen(wbufp))); + rc = write(write_fd, wbufp, min(3, strlen(wbufp))); if (rc < 0) { perror("write() to pipe"); } else { @@ -156,7 +204,14 @@ int read_write_pipe() static void test3() { - read_write_pipe(); + read_write_pipe(NULL, 1); +} + +static void test4() +{ + int tmpfds[2] = { 172, 101 }; + + read_write_pipe(tmpfds, 1); } int @@ -182,6 +237,7 @@ main(int argc, char *argv[]) case 1: test1(); break; case 2: test2(); break; case 3: test3(); break; + case 4: test4(); break; default: printf("Unsupported test case %d\n", tc_num); usage(argv); -- 1.5.2.5 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel