[Devel] [PATCH 9/9] Document usage of multiple-instances of devpts

2008-10-14 Thread sukadev

>From c4596977ca34b9664d97efa8681e6711145a22cf Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 9/9] Document usage of multiple-instances of devpts

Changelog [v2]:
- Add note indicating strict isolation is not possible unless all
  mounts of devpts use the 'newinstance' mount option.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 Documentation/filesystems/devpts.txt |  132 ++
 1 files changed, 132 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/devpts.txt

diff --git a/Documentation/filesystems/devpts.txt 
b/Documentation/filesystems/devpts.txt
new file mode 100644
index 000..68dffd8
--- /dev/null
+++ b/Documentation/filesystems/devpts.txt
@@ -0,0 +1,132 @@
+
+To support containers, we now allow multiple instances of devpts filesystem,
+such that indices of ptys allocated in one instance are independent of indices
+allocated in other instances of devpts.
+
+To preserve backward compatibility, this support for multiple instances is
+enabled only if:
+
+   - CONFIG_DEVPTS_MULTIPLE_INSTANCES=y, and
+   - '-o newinstance' mount option is specified while mounting devpts
+
+IOW, devpts now supports both single-instance and multi-instance semantics.
+
+If CONFIG_DEVPTS_MULTIPLE_INSTANCES=n, there is no change in behavior and
+this referred to as the "legacy" mode. In this mode, the new mount options
+(-o newinstance and -o ptmxmode) will be ignored with a 'bogus option' message
+on console.
+
+If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and devpts is mounted without the
+'newinstance' option (as in current start-up scripts) the new mount binds
+to the initial kernel mount of devpts. This mode is referred to as the
+'single-instance' mode and the current, single-instance semantics are
+preserved, i.e PTYs are common across the system.
+
+The only difference between this single-instance mode and the legacy mode
+is the presence of new, '/dev/pts/ptmx' node with permissions , which
+can safely be ignored.
+
+If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and 'newinstance' option is specified,
+the mount is considered to be in the multi-instance mode and a new instance
+of the devpts fs is created. Any ptys created in this instance are independent
+of ptys in other instances of devpts. Like in the single-instance mode, the
+/dev/pts/ptmx node is present. To effectively use the multi-instance mode,
+open of /dev/ptmx must be a redirected to '/dev/pts/ptmx' using a symlink or
+bind-mount.
+
+Eg: A container startup script could do the following:
+
+   $ chmod 0666 /dev/pts/ptmx
+   $ rm /dev/ptmx
+   $ ln -s pts/ptmx /dev/ptmx
+   $ ns_exec -cm /bin/bash
+
+   # We are now in new container
+
+   $ umount /dev/pts
+   $ mount -t devpts -o newinstance lxcpts /dev/pts
+   $ sshd -p 1234
+
+where 'ns_exec -cm /bin/bash' calls clone() with CLONE_NEWNS flag and execs
+/bin/bash in the child process.  A pty created by the sshd is not visible in
+the original mount of /dev/pts.
+
+User-space changes
+--
+
+In multi-instance mode (i.e '-o newinstance' mount option is specified at least
+once), following user-space issues should be noted.
+
+1. If -o newinstance mount option is never used, /dev/pts/ptmx can be ignored
+   and no change is needed to system-startup scripts.
+
+2. To effectively use multi-instance mode (i.e -o newinstance is specified)
+   administrators or startup scripts should "redirect" open of /dev/ptmx to
+   /dev/pts/ptmx using either a bind mount or symlink.
+
+   $ mount -t devpts -o newinstance devpts /dev/pts
+
+   followed by either
+
+   $ rm /dev/ptmx
+   $ ln -s pts/ptmx /dev/ptmx
+   $ chmod 666 /dev/pts/ptmx
+   or
+   $ mount -o bind /dev/pts/ptmx /dev/ptmx
+
+3. The '/dev/ptmx -> pts/ptmx' symlink is the preferred method since it
+   enables better error-reporting and treats both single-instance and
+   multi-instance mounts similarly.
+
+   But this method requires that system-startup scripts set the mode of
+   /dev/pts/ptmx correctly (default mode is ). The scripts can set the
+   mode by, either
+
+   - adding ptmxmode mount option to devpts entry in /etc/fstab, or
+   - using 'chmod 0666 /dev/pts/ptmx'
+
+4. If multi-instance mode mount is needed for containers, but the system
+   startup scripts have not yet been updated, container-startup scripts
+   should bind mount /dev/ptmx to /dev/pts/ptmx to avoid breaking single-
+   instance mounts.
+
+   Or, in general, container-startup scripts should use:
+
+   mount -t devpts -o newinstance -o ptmxmode=0666 devpts /dev/pts
+   if [ ! -L /dev/ptmx ]; then
+   mount -o bind /dev/pts/ptmx /dev/ptmx
+   fi
+
+   When all devpts mounts are multi-instance, /dev

[Devel] [PATCH 6/9] Define mknod_ptmx()

2008-10-14 Thread sukadev

>From ff0c06fb1878a3c20a6a91d9666d44f56eb94ff7 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 6/9] Define mknod_ptmx()

/dev/ptmx is closely tied to the devpts filesystem. An open of /dev/ptmx,
allocates the next pty index and the associated device shows up in the
devpts fs as /dev/pts/n.

Wih multiple instancs of devpts filesystem, during an open of /dev/ptmx
we would be unable to determine which instance of the devpts is being
accessed.

So we move the 'ptmx' node into /dev/pts and use the inode of the 'ptmx'
node to identify the superblock and hence the devpts instance.  This patch
adds ability for the kernel to internally create the [ptmx, c, 5:2] device
when mounting devpts filesystem.  Since the ptmx node in devpts is new and
may surprise some userspace scripts, the default permissions for the new
node is .  These permissions can be changed either using chmod or by
remounting with the new '-o ptmxmode=0666' mount option.

Changelog[v5]:
- [Serge Hallyn bugfix]: Letting new_inode() assign inode number to
  ptmx can collide with hand-assigning inode numbers to ptys. So,
  hand-assign specific inode number to ptmx node also.
- [Serge Hallyn]: Maybe safer to grab root dentry mutex while creating
  ptmx node
- [Bugfix with Serge Hallyn] Replace lookup_one_len() in mknod_ptmx()
  wih d_alloc_name() (lookup during ->get_sb() locks up system). To
  simplify patchset, fold the ptmx_dentry patch into this.

Changelog[v4]:
- Change default permissions of pts/ptmx node to .
- Move code for ptmxmode under #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES.

Changelog[v3]:
- Rename ptmx_mode to ptmxmode (for consistency with 'newinstance')

Changelog[v2]:
- [H. Peter Anvin] Remove mknod() system call support and create the
  ptmx node internally.

Changelog[v1]:
- Earlier version of this patch enabled creating /dev/pts/tty as
  well. As pointed out by Al Viro and H. Peter Anvin, that is not
  really necessary.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |  113 +++--
 1 files changed, 109 insertions(+), 4 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 7ae60aa..bbdd7df 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -27,6 +27,13 @@
 #define DEVPTS_SUPER_MAGIC 0x1cd1
 
 #define DEVPTS_DEFAULT_MODE 0600
+/*
+ * ptmx is a new node in /dev/pts and will be unused in legacy (single-
+ * instance) mode. To prevent surprises in user space, set permissions of
+ * ptmx to 0. Use 'chmod' or remount with '-o ptmxmode' to set meaningful
+ * permissions.
+ */
+#define DEVPTS_DEFAULT_PTMX_MODE 
 #define PTMX_MINOR 2
 
 extern int pty_limit;  /* Config limit on Unix98 ptys */
@@ -40,10 +47,11 @@ struct pts_mount_opts {
uid_t   uid;
gid_t   gid;
umode_t mode;
+   umode_t ptmxmode;
 };
 
 enum {
-   Opt_uid, Opt_gid, Opt_mode,
+   Opt_uid, Opt_gid, Opt_mode, Opt_ptmxmode,
Opt_err
 };
 
@@ -51,12 +59,16 @@ static match_table_t tokens = {
{Opt_uid, "uid=%u"},
{Opt_gid, "gid=%u"},
{Opt_mode, "mode=%o"},
+#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
+   {Opt_ptmxmode, "ptmxmode=%o"},
+#endif
{Opt_err, NULL}
 };
 
 struct pts_fs_info {
struct ida allocated_ptys;
struct pts_mount_opts mount_opts;
+   struct dentry *ptmx_dentry;
 };
 
 static inline struct pts_fs_info *DEVPTS_SB(struct super_block *sb)
@@ -81,6 +93,7 @@ static int parse_mount_options(char *data, struct 
pts_mount_opts *opts)
opts->uid = 0;
opts->gid = 0;
opts->mode= DEVPTS_DEFAULT_MODE;
+   opts->ptmxmode = DEVPTS_DEFAULT_PTMX_MODE;
 
while ((p = strsep(&data, ",")) != NULL) {
substring_t args[MAX_OPT_ARGS];
@@ -109,6 +122,13 @@ static int parse_mount_options(char *data, struct 
pts_mount_opts *opts)
return -EINVAL;
opts->mode = option & S_IALLUGO;
break;
+#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
+   case Opt_ptmxmode:
+   if (match_octal(&args[0], &option))
+   return -EINVAL;
+   opts->ptmxmode = option & S_IALLUGO;
+   break;
+#endif
default:
printk(KERN_ERR "devpts: called with bogus options\n");
return -EINVAL;
@@ -118,12 +138,93 @@ static int parse_mount_options(char *data, struct 
pts_mount_opts *opts)
return 0;
 }
 
+#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
+static int mknod_

[Devel] [PATCH 8/9] Enable multiple instances of devpts

2008-10-14 Thread sukadev

>From 80380a560dfe89dede7df33e9e4360653f9bda14 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 8/9] Enable multiple instances of devpts

To support containers, allow multiple instances of devpts filesystem, such
that indices of ptys allocated in one instance are independent of ptys
allocated in other instances of devpts.

But to preserve backward compatibility, enable this support for multiple
instances only if:

- CONFIG_DEVPTS_MULTIPLE_INSTANCES is set to Y, and
- '-o newinstance' mount option is specified while mounting devpts

To use multi-instance mount, a container startup script could:

$ ns_exec -cm /bin/bash
$ umount /dev/pts
$ mount -t devpts -o newinstance lxcpts /dev/pts
$ mount -o bind /dev/pts/ptmx /dev/ptmx
$ /usr/sbin/sshd -p 1234

where 'ns_exec -cm /bin/bash' is calls clone() with CLONE_NEWNS flag and execs
/bin/bash in the child process. A pty created by the sshd is not visible in
the original mount of /dev/pts.

USER-SPACE-IMPACT:
- See Documentation/fs/devpts.txt (included in next patch) for user-
  space impact in multi-instance and mixed-mode operation.
TODO:
- Update mount(8), pts(4) man pages. Highlight impact of not
  redirecting /dev/ptmx to /dev/pts/ptmx after a multi-instance mount.

Changelog[v6]:
- [Dave Hansen] Use new get_init_pts_sb() interface
- [Serge Hallyn] Don't bother displaying 'newinstance' in show_options
- [Serge Hallyn] Use macros (PARSE_REMOUNT/PARSE_MOUNT) instead of 0/1.
- [Serge Hallyn] Check error return from get_sb_single() (now
  get_init_pts_sb())
- devpts_pty_kill(): don't dput error dentries

Changelog[v5]:
- Move get_sb_ref() definition to earlier patch
- Move usage info to Documentation/filesystems/devpts.txt (next patch)
- Make ptmx node even in init_pts_ns, now that default mode is 
  (defined in earlier patch, enabled here).
- Cache ptmx dentry and use to update mode during remount
  (defined in earlier patch, enabled here).
- Bugfix: explicitly ignore newinstance on remount (if newinstance was
  specified on remount of initial mount, it would be ignored but
  /proc/mounts would imply that the option was set)

Changelog[v4]:

- Update patch description to address H. Peter Anvin's comments
- Consolidate multi-instance mode code under new config token,
  CONFIG_DEVPTS_MULTIPLE_INSTANCE.
- Move usage-details from patch description to
  Documentation/fs/devpts.txt

Changelog[v3]:
- Rename new mount option to 'newinstance'
- Create ptmx nodes only in 'newinstance' mounts
- Bugfix: parse_mount_options() modifies @data but since we need to
  parse the @data twice (once in devpts_get_sb() and once during
  do_remount_sb()), parse a local copy of @data in devpts_get_sb().
  (restructured code in devpts_get_sb() to fix this)

Changelog[v2]:
- Support both single-mount and multiple-mount semantics and
      provide '-onewmnt' option to select the semantics.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |  170 ++--
 1 files changed, 163 insertions(+), 7 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 1dfdbf0..f9a9346 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -48,10 +48,11 @@ struct pts_mount_opts {
gid_t   gid;
umode_t mode;
umode_t ptmxmode;
+   int newinstance;
 };
 
 enum {
-   Opt_uid, Opt_gid, Opt_mode, Opt_ptmxmode,
+   Opt_uid, Opt_gid, Opt_mode, Opt_ptmxmode, Opt_newinstance,
Opt_err
 };
 
@@ -61,6 +62,7 @@ static match_table_t tokens = {
{Opt_mode, "mode=%o"},
 #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
{Opt_ptmxmode, "ptmxmode=%o"},
+   {Opt_newinstance, "newinstance"},
 #endif
{Opt_err, NULL}
 };
@@ -78,13 +80,17 @@ static inline struct pts_fs_info *DEVPTS_SB(struct 
super_block *sb)
 
 static inline struct super_block *pts_sb_from_inode(struct inode *inode)
 {
+#ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
if (inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC)
return inode->i_sb;
-
+#endif
return devpts_mnt->mnt_sb;
 }
 
-static int parse_mount_options(char *data, struct pts_mount_opts *opts)
+#define PARSE_MOUNT0
+#define PARSE_REMOUNT  1
+
+static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts)
 {
char *p;
 
@@ -95,6 +101,10 @@ static int parse_mount_options(char *data, struct 
pts_mount_opts *opts)
opts->mode= DEVPTS_DEFAULT_MODE;
opts->ptmxmode = DEVPTS_DEFAULT_PTMX_MODE;
 
+   /* newinstance mak

[Devel] [PATCH 7/9] Define get_init_pts_sb()

2008-10-14 Thread sukadev

>From 3a2b7147d5aa345ab96d321ffefd326cbc43e24d Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 7/9] Define get_init_pts_sb()

See comments in the function header for details. The new interface will
be used in a follow-on patch.

Changelog [v2]:
[Dave Hansen] Replace get_sb_ref() in fs/super.c with get_init_pts_sb()
and make the new interface private to devpts

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |   55 -
 1 files changed, 54 insertions(+), 1 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index bbdd7df..1dfdbf0 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -305,10 +305,63 @@ fail:
return -ENOMEM;
 }
 
+static int compare_init_pts_sb(struct super_block *s, void *p)
+{
+   if (devpts_mnt)
+   return devpts_mnt->mnt_sb == s;
+
+   return 0;
+}
+
+/*
+ * get_init_pts_sb()
+ *
+ * This interface is needed to support multiple namespace semantics in
+ * devpts while preserving backward compatibility of the current 'single-
+ * namespace' semantics. i.e all mounts of devpts without the 'newinstance'
+ * mount option should bind to the initial kernel mount, like
+ * get_sb_single().
+ *
+ * Mounts with 'newinstance' option create a new private namespace.
+ *
+ * But for single-mount semantics, devpts cannot use get_sb_single(),
+ * because get_sb_single()/sget() find and use the super-block from
+ * the most recent mount of devpts. But that recent mount may be a
+ * 'newinstance' mount and get_sb_single() would pick the newinstance
+ * super-block instead of the initial super-block.
+ *
+ * This interface is identical to get_sb_single() except that it
+ * consistently selects the 'single-namespace' superblock even in the
+ * presence of the private namespace (i.e 'newinstance') super-blocks.
+ */
+static int get_init_pts_sb(struct file_system_type *fs_type, int flags,
+   void *data, struct vfsmount *mnt)
+{
+struct super_block *s;
+int error;
+
+s = sget(fs_type, compare_init_pts_sb, set_anon_super, NULL);
+if (IS_ERR(s))
+return PTR_ERR(s);
+
+if (!s->s_root) {
+s->s_flags = flags;
+error = devpts_fill_super(s, data, flags & MS_SILENT ? 1 : 0);
+if (error) {
+up_write(&s->s_umount);
+deactivate_super(s);
+return error;
+}
+s->s_flags |= MS_ACTIVE;
+}
+do_remount_sb(s, flags, data, 0);
+return simple_set_mnt(mnt, s);
+}
+
 static int devpts_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 {
-   return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt);
+   return get_init_pts_sb(fs_type, flags, data, mnt);
 }
 
 static void devpts_kill_sb(struct super_block *sb)
-- 
1.5.2.5

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/9] Per-mount 'config' object

2008-10-14 Thread sukadev

>From fa07e30bf77063b127129d317e91d6dc454ea739 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 3/9] Per-mount 'config' object

With support for multiple mounts of devpts, the 'config' structure really
represents per-mount options rather than config parameters. Rename 'config'
structure to 'pts_mount_opts' and store it in the super-block.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |   49 +
 1 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 6e63db7..e91c15c 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -34,13 +34,13 @@ static DEFINE_MUTEX(allocated_ptys_lock);
 
 static struct vfsmount *devpts_mnt;
 
-static struct {
+struct pts_mount_opts {
int setuid;
int setgid;
uid_t   uid;
gid_t   gid;
umode_t mode;
-} config = {.mode = DEVPTS_DEFAULT_MODE};
+};
 
 enum {
Opt_uid, Opt_gid, Opt_mode,
@@ -56,6 +56,7 @@ static match_table_t tokens = {
 
 struct pts_fs_info {
struct ida allocated_ptys;
+   struct pts_mount_opts mount_opts;
 };
 
 static inline struct pts_fs_info *DEVPTS_SB(struct super_block *sb)
@@ -74,12 +75,14 @@ static inline struct super_block *pts_sb_from_inode(struct 
inode *inode)
 static int devpts_remount(struct super_block *sb, int *flags, char *data)
 {
char *p;
+   struct pts_fs_info *fsi = DEVPTS_SB(sb);
+   struct pts_mount_opts *opts = &fsi->mount_opts;
 
-   config.setuid  = 0;
-   config.setgid  = 0;
-   config.uid = 0;
-   config.gid = 0;
-   config.mode= DEVPTS_DEFAULT_MODE;
+   opts->setuid  = 0;
+   opts->setgid  = 0;
+   opts->uid = 0;
+   opts->gid = 0;
+   opts->mode= DEVPTS_DEFAULT_MODE;
 
while ((p = strsep(&data, ",")) != NULL) {
substring_t args[MAX_OPT_ARGS];
@@ -94,19 +97,19 @@ static int devpts_remount(struct super_block *sb, int 
*flags, char *data)
case Opt_uid:
if (match_int(&args[0], &option))
return -EINVAL;
-   config.uid = option;
-   config.setuid = 1;
+   opts->uid = option;
+   opts->setuid = 1;
break;
case Opt_gid:
if (match_int(&args[0], &option))
return -EINVAL;
-   config.gid = option;
-   config.setgid = 1;
+   opts->gid = option;
+   opts->setgid = 1;
break;
case Opt_mode:
if (match_octal(&args[0], &option))
return -EINVAL;
-   config.mode = option & S_IALLUGO;
+   opts->mode = option & S_IALLUGO;
break;
default:
printk(KERN_ERR "devpts: called with bogus options\n");
@@ -119,11 +122,14 @@ static int devpts_remount(struct super_block *sb, int 
*flags, char *data)
 
 static int devpts_show_options(struct seq_file *seq, struct vfsmount *vfs)
 {
-   if (config.setuid)
-   seq_printf(seq, ",uid=%u", config.uid);
-   if (config.setgid)
-   seq_printf(seq, ",gid=%u", config.gid);
-   seq_printf(seq, ",mode=%03o", config.mode);
+   struct pts_fs_info *fsi = DEVPTS_SB(vfs->mnt_sb);
+   struct pts_mount_opts *opts = &fsi->mount_opts;
+
+   if (opts->setuid)
+   seq_printf(seq, ",uid=%u", opts->uid);
+   if (opts->setgid)
+   seq_printf(seq, ",gid=%u", opts->gid);
+   seq_printf(seq, ",mode=%03o", opts->mode);
 
return 0;
 }
@@ -143,6 +149,7 @@ static void *new_pts_fs_info(void)
return NULL;
 
ida_init(&fsi->allocated_ptys);
+   fsi->mount_opts.mode = DEVPTS_DEFAULT_MODE;
 
return fsi;
 }
@@ -262,6 +269,8 @@ int devpts_pty_new(struct inode *ptmx_inode, struct 
tty_struct *tty)
struct super_block *sb = pts_sb_from_inode(ptmx_inode);
struct inode *inode = new_inode(sb);
struct dentry *root = sb->s_root;
+   struct pts_fs_info *fsi = DEVPTS_SB(sb);
+   struct pts_mount_opts *opts = &fsi->mount_opts;
char s[12];
 
/* We're supposed to be given the slave end of a pty */
@@ -272,10 +281,10 @@ int devpts_pty_new(struct inode *ptmx_inode, struct 
tty_struct *tty)
return -ENOMEM;
 
inode->i_ino = number+2;

[Devel] [PATCH 4/9] Extract option parsing to new function

2008-10-14 Thread sukadev

>From c4e1a348c2424ce503c24c8a56fa91015d9ee194 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 4/9] Extract option parsing to new function

Move code to parse mount options into a separate function so it can
(later) be shared between mount and remount operations.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |   12 +---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index e91c15c..7ae60aa 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -72,11 +72,9 @@ static inline struct super_block *pts_sb_from_inode(struct 
inode *inode)
return devpts_mnt->mnt_sb;
 }
 
-static int devpts_remount(struct super_block *sb, int *flags, char *data)
+static int parse_mount_options(char *data, struct pts_mount_opts *opts)
 {
char *p;
-   struct pts_fs_info *fsi = DEVPTS_SB(sb);
-   struct pts_mount_opts *opts = &fsi->mount_opts;
 
opts->setuid  = 0;
opts->setgid  = 0;
@@ -120,6 +118,14 @@ static int devpts_remount(struct super_block *sb, int 
*flags, char *data)
return 0;
 }
 
+static int devpts_remount(struct super_block *sb, int *flags, char *data)
+{
+   struct pts_fs_info *fsi = DEVPTS_SB(sb);
+   struct pts_mount_opts *opts = &fsi->mount_opts;
+
+   return parse_mount_options(data, opts);
+}
+
 static int devpts_show_options(struct seq_file *seq, struct vfsmount *vfs)
 {
struct pts_fs_info *fsi = DEVPTS_SB(vfs->mnt_sb);
-- 
1.5.2.5

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/9] Remove devpts_root global

2008-10-14 Thread sukadev

>From 1479f9e238d607403abf5af4296bd5c84a1dc18f Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 1/9] Remove devpts_root global

Remove the 'devpts_root' global variable and find the root dentry using
the super_block. The super-block can be found from the device inode, using
the new wrapper, pts_sb_from_inode().

Changelog: This patch is based on an earlier patchset from Serge Hallyn
   and Matt Helsley.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |   29 -
 1 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index a70d5d0..ec33833 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -34,7 +34,6 @@ static DEFINE_IDA(allocated_ptys);
 static DEFINE_MUTEX(allocated_ptys_lock);
 
 static struct vfsmount *devpts_mnt;
-static struct dentry *devpts_root;
 
 static struct {
int setuid;
@@ -56,6 +55,14 @@ static match_table_t tokens = {
{Opt_err, NULL}
 };
 
+static inline struct super_block *pts_sb_from_inode(struct inode *inode)
+{
+   if (inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC)
+   return inode->i_sb;
+
+   return devpts_mnt->mnt_sb;
+}
+
 static int devpts_remount(struct super_block *sb, int *flags, char *data)
 {
char *p;
@@ -142,7 +149,7 @@ devpts_fill_super(struct super_block *s, void *data, int 
silent)
inode->i_fop = &simple_dir_operations;
inode->i_nlink = 2;
 
-   devpts_root = s->s_root = d_alloc_root(inode);
+   s->s_root = d_alloc_root(inode);
if (s->s_root)
return 0;

@@ -211,7 +218,9 @@ int devpts_pty_new(struct inode *ptmx_inode, struct 
tty_struct *tty)
struct tty_driver *driver = tty->driver;
dev_t device = MKDEV(driver->major, driver->minor_start+number);
struct dentry *dentry;
-   struct inode *inode = new_inode(devpts_mnt->mnt_sb);
+   struct super_block *sb = pts_sb_from_inode(ptmx_inode);
+   struct inode *inode = new_inode(sb);
+   struct dentry *root = sb->s_root;
char s[12];
 
/* We're supposed to be given the slave end of a pty */
@@ -231,15 +240,15 @@ int devpts_pty_new(struct inode *ptmx_inode, struct 
tty_struct *tty)
 
sprintf(s, "%d", number);
 
-   mutex_lock(&devpts_root->d_inode->i_mutex);
+   mutex_lock(&root->d_inode->i_mutex);
 
-   dentry = d_alloc_name(devpts_root, s);
+   dentry = d_alloc_name(root, s);
if (!IS_ERR(dentry)) {
d_add(dentry, inode);
-   fsnotify_create(devpts_root->d_inode, dentry);
+   fsnotify_create(root->d_inode, dentry);
}
 
-   mutex_unlock(&devpts_root->d_inode->i_mutex);
+   mutex_unlock(&root->d_inode->i_mutex);
 
return 0;
 }
@@ -256,11 +265,13 @@ struct tty_struct *devpts_get_tty(struct inode 
*pts_inode, int number)
 void devpts_pty_kill(struct tty_struct *tty)
 {
struct inode *inode = tty->driver_data;
+   struct super_block *sb = pts_sb_from_inode(inode);
+   struct dentry *root = sb->s_root;
struct dentry *dentry;
 
BUG_ON(inode->i_rdev == MKDEV(TTYAUX_MAJOR, PTMX_MINOR));
 
-   mutex_lock(&devpts_root->d_inode->i_mutex);
+   mutex_lock(&root->d_inode->i_mutex);
 
dentry = d_find_alias(inode);
if (dentry && !IS_ERR(dentry)) {
@@ -269,7 +280,7 @@ void devpts_pty_kill(struct tty_struct *tty)
dput(dentry);
}
 
-   mutex_unlock(&devpts_root->d_inode->i_mutex);
+   mutex_unlock(&root->d_inode->i_mutex);
 }
 
 static int __init init_devpts_fs(void)
-- 
1.5.2.5

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/9] Per-mount allocated_ptys

2008-10-14 Thread sukadev

>From d12a714cbd541b808a80f9f556fda4a1f3bf4198 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 2/9] Per-mount allocated_ptys

To enable multiple mounts of devpts, 'allocated_ptys' must be a per-mount
variable rather than a global variable.  Move 'allocated_ptys' into the
super_block's s_fs_info.

Changelog[v2]:
Define and use DEVPTS_SB() wrapper.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |   55 ++--
 1 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index ec33833..6e63db7 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -30,7 +30,6 @@
 #define PTMX_MINOR 2
 
 extern int pty_limit;  /* Config limit on Unix98 ptys */
-static DEFINE_IDA(allocated_ptys);
 static DEFINE_MUTEX(allocated_ptys_lock);
 
 static struct vfsmount *devpts_mnt;
@@ -55,6 +54,15 @@ static match_table_t tokens = {
{Opt_err, NULL}
 };
 
+struct pts_fs_info {
+   struct ida allocated_ptys;
+};
+
+static inline struct pts_fs_info *DEVPTS_SB(struct super_block *sb)
+{
+   return sb->s_fs_info;
+}
+
 static inline struct super_block *pts_sb_from_inode(struct inode *inode)
 {
if (inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC)
@@ -126,6 +134,19 @@ static const struct super_operations devpts_sops = {
.show_options   = devpts_show_options,
 };
 
+static void *new_pts_fs_info(void)
+{
+   struct pts_fs_info *fsi;
+
+   fsi = kzalloc(sizeof(struct pts_fs_info), GFP_KERNEL);
+   if (!fsi)
+   return NULL;
+
+   ida_init(&fsi->allocated_ptys);
+
+   return fsi;
+}
+
 static int
 devpts_fill_super(struct super_block *s, void *data, int silent)
 {
@@ -137,9 +158,13 @@ devpts_fill_super(struct super_block *s, void *data, int 
silent)
s->s_op = &devpts_sops;
s->s_time_gran = 1;
 
+   s->s_fs_info = new_pts_fs_info();
+   if (!s->s_fs_info)
+   goto fail;
+
inode = new_inode(s);
if (!inode)
-   goto fail;
+   goto free_fsi;
inode->i_ino = 1;
inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
inode->i_blocks = 0;
@@ -155,6 +180,9 @@ devpts_fill_super(struct super_block *s, void *data, int 
silent)

printk("devpts: get root dentry failed\n");
iput(inode);
+
+free_fsi:
+   kfree(s->s_fs_info);
 fail:
return -ENOMEM;
 }
@@ -165,11 +193,19 @@ static int devpts_get_sb(struct file_system_type *fs_type,
return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt);
 }
 
+static void devpts_kill_sb(struct super_block *sb)
+{
+   struct pts_fs_info *fsi = DEVPTS_SB(sb);
+
+   kfree(fsi);
+   kill_anon_super(sb);
+}
+
 static struct file_system_type devpts_fs_type = {
.owner  = THIS_MODULE,
.name   = "devpts",
.get_sb = devpts_get_sb,
-   .kill_sb= kill_anon_super,
+   .kill_sb= devpts_kill_sb,
 };
 
 /*
@@ -179,16 +215,18 @@ static struct file_system_type devpts_fs_type = {
 
 int devpts_new_index(struct inode *ptmx_inode)
 {
+   struct super_block *sb = pts_sb_from_inode(ptmx_inode);
+   struct pts_fs_info *fsi = DEVPTS_SB(sb);
int index;
int ida_ret;
 
 retry:
-   if (!ida_pre_get(&allocated_ptys, GFP_KERNEL)) {
+   if (!ida_pre_get(&fsi->allocated_ptys, GFP_KERNEL)) {
return -ENOMEM;
}
 
mutex_lock(&allocated_ptys_lock);
-   ida_ret = ida_get_new(&allocated_ptys, &index);
+   ida_ret = ida_get_new(&fsi->allocated_ptys, &index);
if (ida_ret < 0) {
mutex_unlock(&allocated_ptys_lock);
if (ida_ret == -EAGAIN)
@@ -197,7 +235,7 @@ retry:
}
 
if (index >= pty_limit) {
-   ida_remove(&allocated_ptys, index);
+   ida_remove(&fsi->allocated_ptys, index);
mutex_unlock(&allocated_ptys_lock);
return -EIO;
}
@@ -207,8 +245,11 @@ retry:
 
 void devpts_kill_index(struct inode *ptmx_inode, int idx)
 {
+   struct super_block *sb = pts_sb_from_inode(ptmx_inode);
+   struct pts_fs_info *fsi = DEVPTS_SB(sb);
+
mutex_lock(&allocated_ptys_lock);
-   ida_remove(&allocated_ptys, idx);
+   ida_remove(&fsi->allocated_ptys, idx);
mutex_unlock(&allocated_ptys_lock);
 }
 
-- 
1.5.2.5

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/9] Multiple devpts instances

2008-10-14 Thread sukadev

Enable multiple instances of devpts filesystem so each container can allocate
ptys independently.

User interface:

Since supporting multiple mounts of devpts can break user-space, this
feature is enabled only if:

- CONFIG_DEVPTS_MULTIPLE_INSTANCES=y (new config token), and
- new mount option, -o newinstance is specified

If CONFIG_DEVPTS_MULTIPLE_INSTANCES=n, there should be no change in
behavior.

See [PATCH 9/9] - Documentation/filesystems/devpts.txt for detailed
usage/compatibility information.

[PATCH 1/9] Remove devpts_root global
[PATCH 2/9] Per-mount allocated_ptys
[PATCH 3/9] Per-mount 'config' object
[PATCH 4/9] Extract option parsing to new function
[PATCH 5/9] Add DEVPTS_MULTIPLE_INSTANCES config token
[PATCH 6/9] Define mknod_ptmx()
[PATCH 7/9] Define get_init_pts_sb()
[PATCH 8/9] Enable multiple instances of devpts
[PATCH 9/9] Document usage of multiple-instances of devpts

Implementation notes:

1. To enable multiple mounts of /dev/pts, (most) devpts interfaces
   need to know which instance of devpts is being accessed. This
   patchset uses the 'struct inode' or 'struct tty_struct' of the
   device being accessed to identify the appropriate devpts instance.
   Hence the need for the /dev/pts/ptmx bind-mount.

2. Mount options must be parsed twice during mount (once to determine
   the mode of mount (single/multi-instance) and once to actually
   save the options. There does not seem to be an easy way to parse
   once and reuse (See 'safe_process_mount_opts()' in [PATCH 9/10])


Changelog [v5]:
- Merge all option-parsing (previously in patches 6,7) into patch 6.
- Bugfixes (see changelog in patches 6 and 8)
- Replace get_sb_ref() with devpts-specific get_init_pts_sb() (patch 7)
- Minor update to devpts.txt documentation (patch 9)


Changelog [v4]:

- Port to 2008-09-04 ttydev tree (and drop patches that were merged in)
- Add DEVPTS_MULTIPLE_INSTANCES config token (patch 5) and move new
  behavior under #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES. 
- Create ptmx node even in initial mount 
- Change default permissions of pts/ptmx node to  (patch 6)
- Cache ptmx dentry and use to update permissions of ptmx node on
  remount so legacy mode can use the node with a simpler change to
  /etc/fstab (patch 7)
- Move get_sb_ref() helper code to separate patch (patch 8)
- Add Documentation/filesystems/devpts.txt and describe usage info
  in that file.
  
Changelog [v3]:

- Port to 2008-08-28 ttydev tree
- Rename new mount options to 'ptmxmode' and 'newinstance'.
- [Alan Cox] Use tty driver data to identify devpts (this is used to
  cleanup get_node() in devpts_pty_kill()).
- [H. Peter Anvin] get_node() cleanup in devpts (which was enabled by
  the inode/tty parameter to devpts interfaces)
- Bugfix in multi-mount mode (see Patch 11/11).
- Executed pty tests in LTP (in both single-instance and multi-instance
  mode)
- Should be bisect-safe :-)

Changelog [v2]:

- New mount option '-o newmnt' added (patch 8/8)
- Support both single-mount and multi-mount semantics (patch 8/8)
- Automatically create ptmx node when devpts is mounted (patch 7/8)
- Extract option parsing code to new function (patch 6/8)
- Make 'config' params per-mount variables (patch 5/8)
- Slightly re-ordered existing patches in set (patches 1/8, 2/8)

TODO:
- Do we need a '-o ptmxuid' and '-o ptmxgid' options as well ?
- Update mount(8) man page
- (Sometime in future) Remove even initial kernel mount of devpts 
- Any other good test suites to test this (besides LTP, sshd).
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 5/9] Add DEVPTS_MULTIPLE_INSTANCES config token

2008-10-14 Thread sukadev

>From 18a468a2f2db8f055bf62882d44d40764e924f3b Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 5/9] Add DEVPTS_MULTIPLE_INSTANCES config token

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---
 drivers/char/Kconfig |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 700ff96..0d3ea89 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -443,6 +443,17 @@ config UNIX98_PTYS
  All modern Linux systems use the Unix98 ptys.  Say Y unless
  you're on an embedded system and want to conserve memory.
 
+config DEVPTS_MULTIPLE_INSTANCES
+   bool "Support multiple instances of devpts"
+   depends on UNIX98_PTYS
+   default n
+   ---help---
+ Enable support for multiple instances of devpts filesystem.
+ If you want to have isolated PTY namespaces (eg: in containers),
+ say Y here.  Otherwise, say N. If enabled, each mount of devpts
+ filesystem with the '-o newinstance' option will create an
+ independent PTY namespace.
+
 config LEGACY_PTYS
bool "Legacy (BSD) PTY support"
default y
-- 
1.5.2.5

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH] 'kill sig -1' must only apply to callers namespace

2008-10-21 Thread sukadev

>From d92b4befe07c6a1e852e4462126a5443342448cd Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Date: Tue, 21 Oct 2008 18:00:01 -0700
Subject: [PATCH] kill sig -1 must only apply to callers namespace

Currently "kill  -1" kills processes in all namespaces and breaks the
isolation of namespaces. Earlier attempt to fix this is discussed at:

http://lkml.org/lkml/2008/7/23/148

but nothing seems to have happened since then.

This patch uses the simple fix suggested by Oleg Nesterov.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 kernel/signal.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 105217d..4530fc6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1144,7 +1144,8 @@ static int kill_something_info(int sig, struct siginfo 
*info, pid_t pid)
struct task_struct * p;
 
for_each_process(p) {
-   if (p->pid > 1 && !same_thread_group(p, current)) {
+   if (task_pid_vnr(p) > 1 &&
+   !same_thread_group(p, current)) {
int err = group_send_sig_info(sig, info, p);
++count;
if (err != -EPERM)
-- 
1.5.2.5

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH] 'kill sig -1' must only apply to caller's namespace

2008-10-23 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH] 'kill sig -1' must only apply to caller's namespace

Currently "kill  -1" kills processes in all namespaces and breaks the
isolation of namespaces. Earlier attempt to fix this was discussed at:

http://lkml.org/lkml/2008/7/23/148

As suggested by Oleg Nesterov in that thread, use "task_pid_vnr() > 1"
check since task_pid_vnr() returns 0 if process is outside the caller's
namespace.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Acked-by: Eric W. Biederman <[EMAIL PROTECTED]>
Tested-by: Daniel Hokka Zakrisson <[EMAIL PROTECTED]>
---
 kernel/signal.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 105217d..4530fc6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1144,7 +1144,8 @@ static int kill_something_info(int sig, struct siginfo 
*info, pid_t pid)
struct task_struct * p;
 
for_each_process(p) {
-   if (p->pid > 1 && !same_thread_group(p, current)) {
+   if (task_pid_vnr(p) > 1 &&
+   !same_thread_group(p, current)) {
int err = group_send_sig_info(sig, info, p);
++count;
if (err != -EPERM)
-- 
1.5.2.5

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Signals to cinit

2008-11-01 Thread sukadev

Signals to container-init

Its been almost a year since we tried to address the signals to
container-init issue. We have two patchsets both of which address
the main issues and should work for current 'sysV init' (since init
explicitly ignores signals it does not handle).

But both patchsets fail in some corner cases and so the approaches
have stalled.

Can we choose one of these approaches and clearly define limitations
(with maybe noisy warnings for known corner-cases) ?  That way, cinits
have an option of working-around in user-space till a better solution
is in place.

Or should we explore other, potentially more expensive/intrusive
solutions ?  

I have below a brief summary of the two patchsets we tried before
and a very high-level suggestion for an more expensive/intrusive
fix.

Eric's patchset:


https://lists.linux-foundation.org/pipermail/containers/2007-December/009152.html

- Defines a semantic that drops signals to cinit if handler for signal
  is SIG_DFL.

- SIG_DFL signal is ignored even when blocked.

- Fails with blocked signal if handler is SIG_DFL when blocked.

  cinit: signal set to SIG_DFL
  cinit: block signal
  ancestor sends a fatal signal to cinit
  signal is ignored (since handler is SIG_DFL)

- cinit uses sigtimedwait() for a fatal signal set to SIG_DFL.
  The patchset ignores the signal (due to the SIG_DFL).

- /sbin/init blocks SIGCHLD, execs a new program which then installs
  a handler for SIGCHLD. But since SIGCHLD == SIG_DFL just after
  exec(), the SIGCHLD could be missed.


Oleg's patchset:

Originally posted here:
http://marc.info/?l=linux-kernel&m=118753610515859

An updated patch is included in this mail:


https://lists.linux-foundation.org/pipermail/containers/2007-December/009308.html

- Fails with blocked signals

cinit: block fatal signal
descendant posts signal to cinit
signal is queued since it was blocked
cinit sets handler for signal to SIG_DFL
cinit unblocks signal and terminates even though its from
descendant.

- Fails with ptraced cinit ?

- Drops a signal in sigaction() or get_signal_to_deliver() after we
  enqueuing it ("started processing it") 

  (side-note:  But isn't there a precedent for it anyway ?
  get_signal_to_deliver() currently does ignore signals it does not
  want. sigaction() drops signals that were set to SIG_IGN)

- To quote Eric, "does not start with a 'solid' definition and can
  become unmaintainable", but I am not too clear on this comment.


Track ancestor-signals separately: (not implemented)

Add a second 'sigset_t' to sigpending:

struct sigpending {
struct list_head list;
sigset_t signal;
sigset_t ancestor_signal;
};

'ancestor_signal' is only used for container-inits. Global inits
have no ancestors so it is not affected.

If a cinit receives a signal from a descendant, signal gets added
to 'sigpending->signal' set (as is done today).

If cinit receives a signal from an ancestor, signal is added to
'ancestor_signal' set.

When delivering the signal, (maybe in get_signal_to_deliver(), the
cinit can determine if the signal was from ancestor/descendant
and act accordingly.

If the same signal is received from both ancestor and descendant it
would be set in both sets and we make a policy maybe that ancestor
has priority (i.e signal from descendant is ignored/dropped)

Other observations:
- when queuing the signal, we use the same 'sigpending->list' 
  regardless of the sender's namespace so the order of
  processing of signals is unchanged.

- when dequeuing a signal, dequeue from both sets

- when checking for a pending signal (eg: sigkill_pending()),
  check the OR of both sets 
  
This would be intrusive since we need to replace reads/writes of 
'current->pending.signal' and 'current->signal.shared_pending' with
wrappers.

It maybe a bit more expensive runtime and adds a new 'sigset_t' to
task_struct/signal_struct.

I can send a small prototype if this makes sense at all.

Other approaches to try ?


Thanks,

Sukadev
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: ptem01 LTP failure in ttydev-0909

2008-11-01 Thread sukadev

Sorry, this was buried in my inbox...

Subrata Modak [EMAIL PROTECTED] wrote:
| Hi Sukadev,
| 
| On Thu, 2008-09-18 at 21:14 -0700, [EMAIL PROTECTED] wrote:
| > Alan Cox [EMAIL PROTECTED] wrote:
| > | > The test changes the window size using the slave-fd and expects that
| > | > it won't affect the window-size on master-fd. With this change, we
| > | > return the slave's window size and test fails.
| > | 
| > | I've no idea why anyone would have thought the existing behaviour was
| > | correct. The pty/tty pair code tries to share the size and other
| > | information at all times and the old test was I think verifying a bug
| > | existed.
| > | 
| > | Unless anyone can cite anything to show otherwise anyway ?
| > 
| > Subrata 
| > 
| > We are referring to the last window size check in test2() of
| > testcases/kernel/pty/ptem01.c. This check will cause the test
| > to fail when some of the planned ttydev changes are merged.
| > 
| > Would you happen to know if the check is really required or if
| > it should be dropped ?
| 
|  I would want the test to remain there, but introduce some checkings
| before running the test. As test2() is valid under present
| circumstances, we should retain it as people will keep using LTP on
| lower kernels.

Just to be clear, the entire test2() is not broken. Only the last part
(see patch below) Other parts of test2() should be fine even with
new changes.

| 
| Having said that, i would like to come with a solution where test2() of
| testcases/kernel/pty/ptem01.c is not run after the planned ttydev
| changes are merged. Something compile/run time checking to either not to
| build that part of code and run it. Can we do something like that by
| checking some glibc/kernel exported definitions ?

Other than the kernel version when the changes are merged, I am not sure
there is a way. Besides, it is not clear which assertion that part of
test2() is testing and if it is even needed for older kernels.

Here is the part of test2() I am referring to:

---
 testcases/kernel/pty/ptem01.c |   12 
 1 file changed, 12 deletions(-)

Index: ltp-full-20071031/testcases/kernel/pty/ptem01.c
===
--- ltp-full-20071031.orig/testcases/kernel/pty/ptem01.c2008-11-01 
13:30:42.977954127 -0700
+++ ltp-full-20071031/testcases/kernel/pty/ptem01.c 2008-11-01 
13:31:41.439427078 -0700
@@ -238,18 +238,6 @@ test2(void)
tst_exit();
}

-   if (ioctl(masterfd, TIOCGWINSZ, &wsz) != 0) {
-   tst_resm(TFAIL,"TIOCGWINSZ");
-   tst_exit();
-   }
-
-   if (wsz.ws_row == wsz2.ws_row || wsz.ws_col == wsz2.ws_col ||
-   wsz.ws_xpixel == wsz2.ws_xpixel ||
-   wsz.ws_ypixel == wsz2.ws_ypixel) {
-   tst_resm(TFAIL, "unexpected window size returned");
-   tst_exit();
-   }
-
if (close(slavefd) != 0) {
tst_resm(TBROK,"close");
tst_exit();
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Signals to cinit

2008-11-10 Thread sukadev
Oleg Nesterov [EMAIL PROTECTED] wrote:
| (lkml cced because containers list's archive is not useable)

Hmm. what do you mean by not usable ? I see your email here:
https://lists.linux-foundation.org/pipermail/containers/2008-November/014152.html

| 
| On 11/10, Oleg Nesterov wrote:
| >
| > On 11/01, [EMAIL PROTECTED] wrote:
| > >
| > > Other approaches to try ?
| >
| > I think we should try to do something simple, even if not perfect. Because
| > most users do not care about this problem since they do not use containers
| > at all. It would be very sad to add intrusive changes to the code.
| >
| > I think we should fix another problem first. send_signal()->copy_siginfo()
| > path must be changed anyway, when the signal comes from the parent ns we
| > report the "wrong" si_code/si_pid, yes? So, somehow send_signal() must
| > have "bool from_parent_ns" (or whatever) annyway.
| >
| > Now, let's forget forget for a moment that send_signal()->__sigqueue_alloc()
| > can fail.
| >
| > I think we should encode this "from_parent_ns" into "struct siginfo". I do
| > not think it is good idea to extend this structure, I think we can introduce
| > SI_FROM_PARENT_NS or we perhaps can use "SI_FROMUSER(info) && info->si_pid 
== 0".

Yes, afaics, we just need to pass one extra bit of information per signal
(whether or not sender is in ancestor-ns) from sender to receiver.

| > Or something. yes, sys_rt_sigqueueinfo() is problematic...

Yes, if user-space sets si_pid to 0.

Can we change sys_rt_sigqueueinfo() to:

if (!info->si_pid)
info->si_pid = getpid();

or would that change semantics adversely ? How about putting this
under CONFIG_PID_NS or your CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)

| >
| > Now, copy_process(CLONE_NEWPID) sets child->signal |= SIGNAL_UNKILLABLE, 
this
| > protects cinit from unwanted signals. Then we change get_signal_to_deliver()
| >
| > -   if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
| > +   if (unlikely(signal->flags & SIGNAL_UNKILLABLE) && 
!siginfo_from_parent_ns(info)
| >
| > and now we can kill cinit from parent ns. This needs more checks if we want
| > to stop/strace it, but perhaps this is enough for the start. Note that we
| > do not need to change complete_signal(), at least for now, the code under
| > "if (sig_fatal(p, sig)" is just optimization.
| >
| >
| > So, afaics, the only real problem is how we can handle the case when
| > __sigqueue_alloc() fails. I think for the start we can just return
| > -ENOMEM in this case (when from_parent_ns == T). Then we can improve
| > this behaviour. We can change complete_signal() to ensure that the
| > fatal signal from the upper ns always kills cinit, and in this case
| > we ignore the the failed __sigqueue_alloc(). This way at least SIGKILL
| > always works.
| >
| > Yes, this is not perfect, and it is very possible I missed something
| > else. But simple.

I agree 
| 
| But how can send_signal() know that the signal comes from the upper ns?
| This is not trivial, we can't blindly use current to check. The signal
| can be sent from irq/workqueue/etc.

You mean the in_interrupt() check we had in earlier patchset would
not be enough ?

| 
| Perhaps we can start with something like the patch below. Not that I like
| it very much though. We should really place this code under
| CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)

CONFIG_PID_NS ?
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Signals to cinit

2008-11-10 Thread sukadev
Oleg Nesterov [EMAIL PROTECTED] wrote:
| (lkml cced because containers list's archive is not useable)
| 
| On 11/10, Oleg Nesterov wrote:
| >
| > On 11/01, [EMAIL PROTECTED] wrote:
| > >
| > > Other approaches to try ?
| >
| > I think we should try to do something simple, even if not perfect. Because
| > most users do not care about this problem since they do not use containers
| > at all. It would be very sad to add intrusive changes to the code.
| >
| > I think we should fix another problem first. send_signal()->copy_siginfo()
| > path must be changed anyway, when the signal comes from the parent ns we
| > report the "wrong" si_code/si_pid, yes? So, somehow send_signal() must
| > have "bool from_parent_ns" (or whatever) annyway.

Yes, this was in both the patchsets we reviewed last year :-) I can send
this fix out independently.

| >
| > Now, let's forget forget for a moment that send_signal()->__sigqueue_alloc()
| > can fail.
| >
| > I think we should encode this "from_parent_ns" into "struct siginfo". I do
| > not think it is good idea to extend this structure, I think we can introduce
| > SI_FROM_PARENT_NS or we perhaps can use "SI_FROMUSER(info) && info->si_pid 
== 0".
| > Or something. yes, sys_rt_sigqueueinfo() is problematic...

Also, what happens if a fatal signal is first received from a descendant 
and while that is still pending, the same signal is received from ancestor
ns ?  Won't the second one be ignored by legacy_queue() for the non-rt case ?

Of course, this is a new scenario, specific to containers, and we may be
able to define the policy without changing semantics.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 1/7] Factor out code to allocate pidmap page

2009-05-04 Thread sukadev
From: Sukadev Bhattiprolu 


Signed-off-by: Sukadev Bhattiprolu 
---
 kernel/pid.c |   43 ---
 1 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index b2e5f78..c0aaebe 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,6 +122,31 @@ static void free_pidmap(struct upid *upid)
atomic_inc(&map->nr_free);
 }

+static int alloc_pidmap_page(struct pidmap *map)
+{
+   void *page;
+
+   if (likely(map->page))
+   return 0;
+
+   page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+   /*
+* Free the page if someone raced with us installing it:
+*/
+   spin_lock_irq(&pidmap_lock);
+   if (map->page)
+   kfree(page);
+   else
+   map->page = page;
+   spin_unlock_irq(&pidmap_lock);
+
+   if (unlikely(!map->page))
+   return -1;
+
+   return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -134,21 +159,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i <= max_scan; ++i) {
-   if (unlikely(!map->page)) {
-   void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-   /*
-* Free the page if someone raced with us
-* installing it:
-*/
-   spin_lock_irq(&pidmap_lock);
-   if (map->page)
-   kfree(page);
-   else
-   map->page = page;
-   spin_unlock_irq(&pidmap_lock);
-   if (unlikely(!map->page))
-   break;
-   }
+   if (alloc_pidmap_page(map))
+   break;
+
if (likely(atomic_read(&map->nr_free))) {
do {
if (!test_and_set_bit(offset, map->page)) {
-- 
1.5.2.5
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 5/7] Add target_pids parameter to copy_process()

2009-05-04 Thread sukadev
From: Sukadev Bhattiprolu 

The new parameter will be used in a follow-on patch when clone_with_pids()
is implemented.

Signed-off-by: Sukadev Bhattiprolu 
---
 kernel/fork.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d2d69d3..373411e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -949,12 +949,12 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
+   pid_t *target_pids,
int trace)
 {
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
-   pid_t *target_pids = NULL;

if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1327,7 +1327,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
struct pt_regs regs;

task = copy_process(CLONE_VM, 0, idle_regs(®s), 0, NULL,
-   &init_struct_pid, 0);
+   &init_struct_pid, NULL, 0);
if (!IS_ERR(task))
init_idle(task, cpu);

@@ -1350,6 +1350,7 @@ long do_fork(unsigned long clone_flags,
struct task_struct *p;
int trace = 0;
long nr;
+   pid_t *target_pids = NULL;

/*
 * Do some preliminary argument and permissions checking before we
@@ -1390,7 +1391,7 @@ long do_fork(unsigned long clone_flags,
trace = tracehook_prepare_clone(clone_flags);

p = copy_process(clone_flags, stack_start, regs, stack_size,
-child_tidptr, NULL, trace);
+child_tidptr, NULL, target_pids, trace);
/*
 * Do this prior waking up the new thread - the thread pointer
 * might get invalid after that point, if the thread exits quickly.
-- 
1.5.2.5
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 3/7] Add target_pid parameter to alloc_pidmap()

2009-05-04 Thread sukadev
From: Sukadev Bhattiprolu 


Signed-off-by: Sukadev Bhattiprolu 
---
 kernel/pid.c |   28 ++--
 1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index fd72ad9..93406c6 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,12 +147,36 @@ static int alloc_pidmap_page(struct pidmap *map)
return 0;
 }

-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int set_pidmap(struct pid_namespace *pid_ns, int pid)
+{
+   int offset;
+   struct pidmap *map;
+
+   if (pid >= pid_max)
+   return -EINVAL;
+
+   offset = pid & BITS_PER_PAGE_MASK;
+   map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
+
+   if (alloc_pidmap_page(map))
+   return -ENOMEM;
+
+   if (test_and_set_bit(offset, map->page))
+   return -EBUSY;
+
+   atomic_dec(&map->nr_free);
+   return pid;
+}
+
+static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
 {
int i, offset, max_scan, pid, last = pid_ns->last_pid;
struct pidmap *map;
int rc = -EAGAIN;

+   if (target_pid)
+   return set_pidmap(pid_ns, target_pid);
+
pid = last + 1;
if (pid >= pid_max)
pid = RESERVED_PIDS;
@@ -269,7 +293,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)

tmp = ns;
for (i = ns->level; i >= 0; i--) {
-   nr = alloc_pidmap(tmp);
+   nr = alloc_pidmap(tmp, 0);
if (nr < 0)
goto out_free;

-- 
1.5.2.5
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 4/7] Add target_pids parameter to alloc_pid()

2009-05-04 Thread sukadev
From: Sukadev Bhattiprolu 

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu 
---
 include/linux/pid.h |2 +-
 kernel/fork.c   |3 ++-
 kernel/pid.c|9 +++--
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);

-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);

 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index f8411a8..d2d69d3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -954,6 +954,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
+   pid_t *target_pids = NULL;

if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1119,7 +1120,7 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
goto bad_fork_cleanup_io;

if (pid != &init_struct_pid) {
-   pid = alloc_pid(p->nsproxy->pid_ns);
+   pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index 93406c6..4b2373a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -279,13 +279,14 @@ void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
 }

-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
struct pid *pid;
enum pid_type type;
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
+   int tpid;

pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
@@ -293,7 +294,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)

tmp = ns;
for (i = ns->level; i >= 0; i--) {
-   nr = alloc_pidmap(tmp, 0);
+   tpid = 0;
+   if (target_pids)
+   tpid = target_pids[i];
+
+   nr = alloc_pidmap(tmp, tpid);
if (nr < 0)
goto out_free;

-- 
1.5.2.5
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 6/7] Define do_fork_with_pids()

2009-05-04 Thread sukadev
From: Sukadev Bhattiprolu 

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, target_pids, parameter. This parameter, currently unused,
specifies the target_pids of the process in each of its pid namespaces.

Signed-off-by: Sukadev Bhattiprolu 
---
 include/linux/sched.h |1 +
 kernel/fork.c |   17 ++---
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..2173df1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1995,6 +1995,7 @@ extern int disallow_signal(int);

 extern int do_execve(char *, char __user * __user *, char __user * __user *, 
struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned 
long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, 
unsigned long, int __user *, int __user *, pid_t *target_pids);
 struct task_struct *fork_idle(int);

 extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/kernel/fork.c b/kernel/fork.c
index 373411e..912d008 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1340,17 +1340,17 @@ struct task_struct * __cpuinit fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
  unsigned long stack_start,
  struct pt_regs *regs,
  unsigned long stack_size,
  int __user *parent_tidptr,
- int __user *child_tidptr)
+ int __user *child_tidptr,
+ pid_t *target_pids)
 {
struct task_struct *p;
int trace = 0;
long nr;
-   pid_t *target_pids = NULL;

/*
 * Do some preliminary argument and permissions checking before we
@@ -1448,6 +1448,17 @@ long do_fork(unsigned long clone_flags,
return nr;
 }

+long do_fork(unsigned long clone_flags,
+ unsigned long stack_start,
+ struct pt_regs *regs,
+ unsigned long stack_size,
+ int __user *parent_tidptr,
+ int __user *child_tidptr)
+{
+   return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+   parent_tidptr, child_tidptr, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
1.5.2.5
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 7/7] Define clone_with_pids syscall

2009-05-04 Thread sukadev
From: Sukadev Bhattiprolu 

clone_with_pids() is same as clone(), except that it takes a 'target_pid_set'
paramter which lets caller choose a specific pid number for the child process
in each of the child process's pid namespace. This system call would be needed
to implement Checkpoint/Restart (i.e after a checkpoint, restart a process with
its original pids).

Call clone_with_pids as follows:

pid_t pids[] = { 0, 77, 99 };
struct target_pid_set pid_set;

pid_set.num_pids = sizeof(pids) / sizeof(int);
pid_set.target_pids = &pids;

syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set);

If a target-pid is 0, the kernel continues to assign a pid for the process in
that namespace. In the above example, pids[0] is 0, meaning the kernel will
assign next available pid to the process in init_pid_ns. But kernel will assign
pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
77 or 99 are taken, the system call fails with -EBUSY.

If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
the system call fails with -EINVAL.

Its mostly an exploratory patch seeking feedback on the interface.

NOTE:
Compared to clone(), clone_with_pids() needs to pass in two more
pieces of information:

- number of pids in the set
- user buffer containing the list of pids.

But since clone() already takes 5 parameters, use a 'struct
target_pid_set'.

TODO:
- Gently tested.
- May need additional sanity checks in check_target_pids()
- Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in
  the namespace is either 1 or 0).

Signed-off-by: Sukadev Bhattiprolu 
---
 arch/x86/include/asm/syscalls.h|1 +
 arch/x86/include/asm/unistd_32.h   |1 +
 arch/x86/kernel/entry_32.S |1 +
 arch/x86/kernel/process_32.c   |   91 
 arch/x86/kernel/syscall_table_32.S |1 +
 include/linux/types.h  |5 ++
 6 files changed, 100 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 7043408..1fdc149 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -31,6 +31,7 @@ asmlinkage int sys_get_thread_area(struct user_desc __user *);
 /* kernel/process_32.c */
 int sys_fork(struct pt_regs *);
 int sys_clone(struct pt_regs *);
+int sys_clone_with_pids(struct pt_regs *);
 int sys_vfork(struct pt_regs *);
 int sys_execve(struct pt_regs *);

diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6e72d74..90f906f 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
 #define __NR_inotify_init1 332
 #define __NR_preadv333
 #define __NR_pwritev   334
+#define __NR_clone_with_pids   335

 #ifdef __KERNEL__

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index c929add..ee92b0d 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -707,6 +707,7 @@ ptregs_##name: \
 PTREGSCALL(iopl)
 PTREGSCALL(fork)
 PTREGSCALL(clone)
+PTREGSCALL(clone_with_pids)
 PTREGSCALL(vfork)
 PTREGSCALL(execve)
 PTREGSCALL(sigaltstack)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 76f8f84..66ac6f7 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -445,6 +445,97 @@ int sys_clone(struct pt_regs *regs)
return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, 
child_tidptr);
 }

+static int check_target_pids(unsigned long clone_flags,
+   struct target_pid_set *pid_setp)
+{
+   /*
+* CLONE_NEWPID implies pid == 1
+*
+* TODO: Maybe this should be more fine-grained (i.e would we want
+*   to have a container-init have a specific pid in ancestor
+*   namespaces ?)
+*/
+   if (clone_flags & CLONE_NEWPID)
+   return -EINVAL;
+
+   /* number of pids must match current nesting level of pid ns */
+   if (pid_setp->num_pids > task_pid(current)->level + 1)
+   return -EINVAL;
+
+   /* TODO: More sanity checks ?  */
+
+   return 0;
+}
+
+static pid_t *copy_target_pids(unsigned long clone_flags, void __user 
*upid_setp)
+{
+   int rc;
+   int size;
+   unsigned long clone_flags;
+   pid_t __user *utarget_pids;
+   pid_t *target_pids;
+   struct target_pid_set pid_set;
+
+   if (copy_from_user(pid_setp, upid_setp, sizeof(*pid_setp)))
+   return ERR_PTR(-EFAULT);
+
+   size = pid_setp->num_pids * sizeof(pid_t);
+   utarget_pids = pid_setp->target_pids;
+
+   target_pids = kzalloc(size, GFP_KERNEL);
+   if (!target_pids)
+   return ERR_PTR(-ENOMEM);
+
+   rc = -EFAU

[Devel] [RFC][PATCH 2/7] Have alloc_pidmap() return actual error code

2009-05-04 Thread sukadev
From: Sukadev Bhattiprolu 

alloc_pidmap() can fail either because all pid numbers are in use or
we can't allocate memory. With support for setting a specific pid
number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have caller assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu 
---
 kernel/fork.c |5 +++--
 kernel/pid.c  |9 ++---
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index b9e2edd..f8411a8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1119,10 +1119,11 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
goto bad_fork_cleanup_io;

if (pid != &init_struct_pid) {
-   retval = -ENOMEM;
pid = alloc_pid(p->nsproxy->pid_ns);
-   if (!pid)
+   if (IS_ERR(pid)) {
+   retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
+   }

if (clone_flags & CLONE_NEWPID) {
retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
diff --git a/kernel/pid.c b/kernel/pid.c
index c0aaebe..fd72ad9 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -151,6 +151,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
int i, offset, max_scan, pid, last = pid_ns->last_pid;
struct pidmap *map;
+   int rc = -EAGAIN;

pid = last + 1;
if (pid >= pid_max)
@@ -159,8 +160,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i <= max_scan; ++i) {
-   if (alloc_pidmap_page(map))
+   if (alloc_pidmap_page(map)) {
+   rc = -ENOMEM;
break;
+   }

if (likely(atomic_read(&map->nr_free))) {
do {
@@ -192,7 +195,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
}
pid = mk_pid(pid_ns, map, offset);
}
-   return -1;
+   return rc;
 }

 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -297,7 +300,7 @@ out_free:
free_pidmap(pid->numbers + i);

kmem_cache_free(ns->pid_cachep, pid);
-   pid = NULL;
+   pid = ERR_PTR(nr);
goto out;
 }

-- 
1.5.2.5
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 0/4] Devpts namespace

2008-02-05 Thread sukadev

Serge, Matt, please sign-off on these patches as you see fit.
---

Devpts namespace patchset


In continuation of the implementation of containers in mainline, we need to
support multiple PTY namespaces so that the PTY index (ie the tty names) in
one container is independent of the PTY indices of other containers.  For
instance this would allow each container to have a '/dev/pts/0' PTY and
refer to different terminals.

[PATCH 1/4]: Factor out PTY index allocation
[PATCH 2/4]: Use interface to access allocated_ptys
[PATCH 3/4]: Enable multiple mounts of /dev/pts
[PATCH 4/4]: Enable cloning PTY namespaces

Todo:

- This patchset depends on availability of additional clone flags !!!
- Needs more testing.

Changelog:

This patchset is based on earlier versions developed by Serge Hallyn
and Matt Helsley.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 1/4]: Factor out PTY index allocation

2008-02-05 Thread sukadev
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [RFC][PATCH 1/4]: Factor out PTY index allocation

Factor out the code used to allocate/free a pts index into new interfaces,
devpts_new_index() and devpts_kill_index().  This localizes the external
data structures used in managing the pts indices.

Changelog:
- Version 0: Based on earlier versions from Serge Hallyn and
  Matt Helsley.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 drivers/char/tty_io.c |   40 ++--
 fs/devpts/inode.c |   42 +-
 include/linux/devpts_fs.h |4 
 3 files changed, 51 insertions(+), 35 deletions(-)

Index: linux-2.6.24/drivers/char/tty_io.c
===
--- linux-2.6.24.orig/drivers/char/tty_io.c 2008-01-24 14:58:37.0 
-0800
+++ linux-2.6.24/drivers/char/tty_io.c  2008-02-05 17:17:11.0 -0800
@@ -90,7 +90,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -136,9 +135,6 @@ EXPORT_SYMBOL(tty_mutex);
 
 #ifdef CONFIG_UNIX98_PTYS
 extern struct tty_driver *ptm_driver;  /* Unix98 pty masters; for /dev/ptmx */
-extern int pty_limit;  /* Config limit on Unix98 ptys */
-static DEFINE_IDR(allocated_ptys);
-static DECLARE_MUTEX(allocated_ptys_lock);
 static int ptmx_open(struct inode *, struct file *);
 #endif
 
@@ -2568,15 +2564,9 @@ static void release_dev(struct file * fi
 */
release_tty(tty, idx);
 
-#ifdef CONFIG_UNIX98_PTYS
/* Make this pty number available for reallocation */
-   if (devpts) {
-   down(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, idx);
-   up(&allocated_ptys_lock);
-   }
-#endif
-
+   if (devpts)
+   devpts_kill_index(idx);
 }
 
 /**
@@ -2732,29 +2722,13 @@ static int ptmx_open(struct inode * inod
struct tty_struct *tty;
int retval;
int index;
-   int idr_ret;
 
nonseekable_open(inode, filp);
 
/* find a device that is not in use. */
-   down(&allocated_ptys_lock);
-   if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) {
-   up(&allocated_ptys_lock);
-   return -ENOMEM;
-   }
-   idr_ret = idr_get_new(&allocated_ptys, NULL, &index);
-   if (idr_ret < 0) {
-   up(&allocated_ptys_lock);
-   if (idr_ret == -EAGAIN)
-   return -ENOMEM;
-   return -EIO;
-   }
-   if (index >= pty_limit) {
-   idr_remove(&allocated_ptys, index);
-   up(&allocated_ptys_lock);
-   return -EIO;
-   }
-   up(&allocated_ptys_lock);
+   index = devpts_new_index();
+   if (index < 0)
+   return index;
 
mutex_lock(&tty_mutex);
retval = init_dev(ptm_driver, index, &tty);
@@ -2781,9 +2755,7 @@ out1:
release_dev(filp);
return retval;
 out:
-   down(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, index);
-   up(&allocated_ptys_lock);
+   devpts_kill_index(index);
return retval;
 }
 #endif
Index: linux-2.6.24/fs/devpts/inode.c
===
--- linux-2.6.24.orig/fs/devpts/inode.c 2008-01-24 14:58:37.0 -0800
+++ linux-2.6.24/fs/devpts/inode.c  2008-02-05 17:17:11.0 -0800
@@ -17,12 +17,17 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 
 #define DEVPTS_SUPER_MAGIC 0x1cd1
 
+extern int pty_limit;  /* Config limit on Unix98 ptys */
+static DEFINE_IDR(allocated_ptys);
+static DECLARE_MUTEX(allocated_ptys_lock);
+
 static struct vfsmount *devpts_mnt;
 static struct dentry *devpts_root;
 
@@ -156,9 +161,44 @@ static struct dentry *get_node(int num)
return lookup_one_len(s, root, sprintf(s, "%d", num));
 }
 
+int devpts_new_index(void)
+{
+   int index;
+   int idr_ret;
+
+retry:
+   if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) {
+   return -ENOMEM;
+   }
+
+   down(&allocated_ptys_lock);
+   idr_ret = idr_get_new(&allocated_ptys, NULL, &index);
+   if (idr_ret < 0) {
+   up(&allocated_ptys_lock);
+   if (idr_ret == -EAGAIN)
+   goto retry;
+   return -EIO;
+   }
+
+   if (index >= pty_limit) {
+   idr_remove(&allocated_ptys, index);
+   up(&allocated_ptys_lock);
+   return -EIO;
+   }
+   up(&allocated_ptys_lock);
+   return index;
+}
+
+void devpts_kill_index(int idx)
+{
+   down(&allocated_ptys_lock);
+   idr_remove(&allocated_ptys, idx);
+   up(&allocated_ptys_lock);
+}
+
 int devpts_pty_new(struct tty_struct *tty)
 {
- 

[Devel] [RFC][PATCH 2/4]: Use interface to access allocated_ptys

2008-02-05 Thread sukadev
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [RFC][PATCH 2/4]: Use interface to access allocated_ptys

In preparation for supporting multiple PTY namespaces, use an inline
function to access the 'allocated_ptys' idr.

Changelog:
- Version 0: Based on earlier versions from Serge Hallyn and
  Matt Helsley.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |   13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6.24/fs/devpts/inode.c
===
--- linux-2.6.24.orig/fs/devpts/inode.c 2008-02-05 17:17:11.0 -0800
+++ linux-2.6.24/fs/devpts/inode.c  2008-02-05 17:30:52.0 -0800
@@ -28,6 +28,11 @@ extern int pty_limit;/* Config limit o
 static DEFINE_IDR(allocated_ptys);
 static DECLARE_MUTEX(allocated_ptys_lock);
 
+static inline struct idr *current_pts_ns_allocated_ptys(void)
+{
+   return &allocated_ptys;
+}
+
 static struct vfsmount *devpts_mnt;
 static struct dentry *devpts_root;
 
@@ -167,12 +172,12 @@ int devpts_new_index(void)
int idr_ret;
 
 retry:
-   if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) {
+   if (!idr_pre_get(current_pts_ns_allocated_ptys(), GFP_KERNEL)) {
return -ENOMEM;
}
 
down(&allocated_ptys_lock);
-   idr_ret = idr_get_new(&allocated_ptys, NULL, &index);
+   idr_ret = idr_get_new(current_pts_ns_allocated_ptys(), NULL, &index);
if (idr_ret < 0) {
up(&allocated_ptys_lock);
if (idr_ret == -EAGAIN)
@@ -181,7 +186,7 @@ retry:
}
 
if (index >= pty_limit) {
-   idr_remove(&allocated_ptys, index);
+   idr_remove(current_pts_ns_allocated_ptys(), index);
up(&allocated_ptys_lock);
return -EIO;
}
@@ -192,7 +197,7 @@ retry:
 void devpts_kill_index(int idx)
 {
down(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, idx);
+   idr_remove(current_pts_ns_allocated_ptys(), idx);
up(&allocated_ptys_lock);
 }
 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 4/4]: Enable cloning PTY namespaces

2008-02-05 Thread sukadev
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [RFC][PATCH 4/4]: Enable cloning PTY namespaces

Enable cloning PTY namespaces.

TODO:
This version temporarily uses the clone flag '0x8000' which
is unused in mainline atm, but used for CLONE_IO in -mm. 
While we must extend clone() (urgently) to solve this, it hopefully
does not affect review of the rest of this patchset.

Changelog:
- Version 0: Based on earlier versions from Serge Hallyn and
  Matt Helsley.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |   84 +++---
 include/linux/devpts_fs.h |   52 
 include/linux/init_task.h |1 
 include/linux/nsproxy.h   |2 +
 include/linux/sched.h |2 +
 kernel/fork.c |2 -
 kernel/nsproxy.c  |   17 -
 7 files changed, 146 insertions(+), 14 deletions(-)

Index: linux-2.6.24/fs/devpts/inode.c
===
--- linux-2.6.24.orig/fs/devpts/inode.c 2008-02-05 19:16:39.0 -0800
+++ linux-2.6.24/fs/devpts/inode.c  2008-02-05 20:27:41.0 -0800
@@ -25,18 +25,25 @@
 #define DEVPTS_SUPER_MAGIC 0x1cd1
 
 extern int pty_limit;  /* Config limit on Unix98 ptys */
-static DEFINE_IDR(allocated_ptys);
 static DECLARE_MUTEX(allocated_ptys_lock);
+static struct file_system_type devpts_fs_type;
+
+struct pts_namespace init_pts_ns = {
+   .kref = {
+   .refcount = ATOMIC_INIT(2),
+   },
+   .allocated_ptys = IDR_INIT(init_pts_ns.allocated_ptys),
+   .mnt = NULL,
+};
 
 static inline struct idr *current_pts_ns_allocated_ptys(void)
 {
-   return &allocated_ptys;
+   return ¤t->nsproxy->pts_ns->allocated_ptys;
 }
 
-static struct vfsmount *devpts_mnt;
 static inline struct vfsmount *current_pts_ns_mnt(void)
 {
-   return devpts_mnt;
+   return current->nsproxy->pts_ns->mnt;
 }
 
 static struct {
@@ -59,6 +66,42 @@ static match_table_t tokens = {
{Opt_err, NULL}
 };
 
+struct pts_namespace *new_pts_ns(void)
+{
+   struct pts_namespace *ns;
+
+   ns = kmalloc(sizeof(*ns), GFP_KERNEL);
+   if (!ns)
+   return ERR_PTR(-ENOMEM);
+
+   ns->mnt = kern_mount_data(&devpts_fs_type, ns);
+   if (IS_ERR(ns->mnt)) {
+   kfree(ns);
+   return ERR_PTR(PTR_ERR(ns->mnt));
+   }
+
+   idr_init(&ns->allocated_ptys);
+   kref_init(&ns->kref);
+
+   return ns;
+}
+
+void free_pts_ns(struct kref *ns_kref)
+{
+   struct pts_namespace *ns;
+
+   ns = container_of(ns_kref, struct pts_namespace, kref);
+   BUG_ON(ns == &init_pts_ns);
+
+   mntput(ns->mnt);
+   /*
+* TODO:
+*  idr_remove_all(&ns->allocated_ptys); introduced in 2.6.23
+*/
+   idr_destroy(&ns->allocated_ptys);
+   kfree(ns);
+}
+
 static int devpts_remount(struct super_block *sb, int *flags, char *data)
 {
char *p;
@@ -160,18 +203,27 @@ static int devpts_test_sb(struct super_b
 
 static int devpts_set_sb(struct super_block *sb, void *data)
 {
-   sb->s_fs_info = data;
+   struct pts_namespace *ns = data;
+
+   sb->s_fs_info = get_pts_ns(ns);
return set_anon_super(sb, NULL);
 }
 
 static int devpts_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 {
+   struct pts_namespace *ns;
struct super_block *sb;
int err;
 
+   /* hereafter we're very similar to proc_get_sb */
+   if (flags & MS_KERNMOUNT)
+   ns = data;
+   else
+   ns = current->nsproxy->pts_ns;
+
/* hereafter we're very simlar to get_sb_nodev */
-   sb = sget(fs_type, devpts_test_sb, devpts_set_sb, data);
+   sb = sget(fs_type, devpts_test_sb, devpts_set_sb, ns);
if (IS_ERR(sb))
return PTR_ERR(sb);
 
@@ -187,16 +239,25 @@ static int devpts_get_sb(struct file_sys
}
 
sb->s_flags |= MS_ACTIVE;
-   devpts_mnt = mnt;
+   ns->mnt = mnt;
 
return simple_set_mnt(mnt, sb);
 }
 
+static void devpts_kill_sb(struct super_block *sb)
+{
+   struct pts_namespace *ns;
+
+   ns = sb->s_fs_info;
+   kill_anon_super(sb);
+   put_pts_ns(ns);
+}
+
 static struct file_system_type devpts_fs_type = {
.owner  = THIS_MODULE,
.name   = "devpts",
.get_sb = devpts_get_sb,
-   .kill_sb= kill_anon_super,
+   .kill_sb= devpts_kill_sb,
 };
 
 /*
@@ -352,18 +413,19 @@ static int __init init_devpts_fs(void)
if (err)
return err;
 
-   mnt = kern_mount_data(&devpts_fs_type, NULL);
+   mnt = kern_mount_data(&devpts_fs_type, &init

[Devel] [RFC][PATCH 3/4]: Enable multiple mounts of /dev/pts

2008-02-05 Thread sukadev
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [RFC][PATCH 3/4]: Enable multiple mounts of /dev/pts

To support multiple PTY namespaces, we should be allow multiple mounts of
/dev/pts, once within each PTY namespace.

This patch removes the get_sb_single() in devpts_get_sb() and uses test and
set sb interfaces to allow remounting /dev/pts.  The patch also removes the
globals, 'devpts_root' and uses current_pts_mnt() to access 'devpts_mnt'

Changelog:
- Version 0: Based on earlier versions from Serge Hallyn and
  Matt Helsley.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |  120 +-
 1 file changed, 101 insertions(+), 19 deletions(-)

Index: linux-2.6.24/fs/devpts/inode.c
===
--- linux-2.6.24.orig/fs/devpts/inode.c 2008-02-05 17:30:52.0 -0800
+++ linux-2.6.24/fs/devpts/inode.c  2008-02-05 19:16:39.0 -0800
@@ -34,7 +34,10 @@ static inline struct idr *current_pts_ns
 }
 
 static struct vfsmount *devpts_mnt;
-static struct dentry *devpts_root;
+static inline struct vfsmount *current_pts_ns_mnt(void)
+{
+   return devpts_mnt;
+}
 
 static struct {
int setuid;
@@ -130,7 +133,7 @@ devpts_fill_super(struct super_block *s,
inode->i_fop = &simple_dir_operations;
inode->i_nlink = 2;
 
-   devpts_root = s->s_root = d_alloc_root(inode);
+   s->s_root = d_alloc_root(inode);
if (s->s_root)
return 0;

@@ -140,10 +143,53 @@ fail:
return -ENOMEM;
 }
 
+/*
+ * We use test and set super-block operations to help determine whether we
+ * need a new super-block for this namespace. get_sb() walks the list of
+ * existing devpts supers, comparing them with the @data ptr. Since we
+ * passed 'current's namespace as the @data pointer we can compare the
+ * namespace pointer in the super-block's 's_fs_info'.  If the test is
+ * TRUE then get_sb() returns a new active reference to the super block.
+ * Otherwise, it helps us build an active reference to a new one.
+ */
+
+static int devpts_test_sb(struct super_block *sb, void *data)
+{
+   return sb->s_fs_info == data;
+}
+
+static int devpts_set_sb(struct super_block *sb, void *data)
+{
+   sb->s_fs_info = data;
+   return set_anon_super(sb, NULL);
+}
+
 static int devpts_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 {
-   return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt);
+   struct super_block *sb;
+   int err;
+
+   /* hereafter we're very simlar to get_sb_nodev */
+   sb = sget(fs_type, devpts_test_sb, devpts_set_sb, data);
+   if (IS_ERR(sb))
+   return PTR_ERR(sb);
+
+   if (sb->s_root)
+   return simple_set_mnt(mnt, sb);
+
+   sb->s_flags = flags;
+   err = devpts_fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
+   if (err) {
+   up_write(&sb->s_umount);
+   deactivate_super(sb);
+   return err;
+   }
+
+   sb->s_flags |= MS_ACTIVE;
+   devpts_mnt = mnt;
+
+   return simple_set_mnt(mnt, sb);
 }
 
 static struct file_system_type devpts_fs_type = {
@@ -158,10 +204,9 @@ static struct file_system_type devpts_fs
  * to the System V naming convention
  */
 
-static struct dentry *get_node(int num)
+static struct dentry *get_node(struct dentry *root, int num)
 {
char s[12];
-   struct dentry *root = devpts_root;
mutex_lock(&root->d_inode->i_mutex);
return lookup_one_len(s, root, sprintf(s, "%d", num));
 }
@@ -207,12 +252,28 @@ int devpts_pty_new(struct tty_struct *tt
struct tty_driver *driver = tty->driver;
dev_t device = MKDEV(driver->major, driver->minor_start+number);
struct dentry *dentry;
-   struct inode *inode = new_inode(devpts_mnt->mnt_sb);
+   struct dentry *root;
+   struct vfsmount *mnt;
+   struct inode *inode;
+
 
/* We're supposed to be given the slave end of a pty */
BUG_ON(driver->type != TTY_DRIVER_TYPE_PTY);
BUG_ON(driver->subtype != PTY_TYPE_SLAVE);
 
+   mnt = current_pts_ns_mnt();
+   if (!mnt)
+   return -ENOSYS;
+   root = mnt->mnt_root;
+
+   mutex_lock(&root->d_inode->i_mutex);
+   inode = idr_find(current_pts_ns_allocated_ptys(), number);
+   mutex_unlock(&root->d_inode->i_mutex);
+
+   if (inode && !IS_ERR(inode))
+   return -EEXIST;
+
+   inode = new_inode(mnt->mnt_sb);
if (!inode)
return -ENOMEM;
 
@@ -222,23 +283,31 @@ int devpts_pty_new(struct tty_struct *tt
inode->i_mtime = inode->i_atime = inode->i_ctime = CU

[Devel] Re: [PATCH 4/4] The control group itself

2008-02-11 Thread sukadev
This patchset does fix the problem I was having before with null and
zero devices. Overall, it looks like pretty good.

I am still reviewing the patches.  Just some nits I came across:


Pavel Emelianov [EMAIL PROTECTED] wrote:
| Each new group will have its own maps for char and block
| layers. The devices access list is tuned via the 
| devices.permissions file. One may read from the file to get 
| the configured state.
| 
| The top container isn't initialized, so that the char 
| and block layers will use the global maps to lookup 
| their devices. I did that not to export the static maps
| to the outer world.
| 
| Good news is that this patch now contains more comments 
| and Documentation file :)
| 
| Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>
| 
| ---
| 
| diff --git a/Documentation/controllers/devices.txt 
b/Documentation/controllers/devices.txt
| new file mode 100644
| index 000..dbd0c7a
| --- /dev/null
| +++ b/Documentation/controllers/devices.txt
| @@ -0,0 +1,61 @@
| +
| + Devices visibility controller
| +
| +This controller allows to tune the devices accessibility by tasks,
| +i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
| +access to IDE devices and completely hide SCSI disks.
| +
| +Tasks still can call mknod to create device files, regardless of
| +whether the particular device is visible or accessible, but they
| +may not be able to open it later.
| +
| +This one hides under CONFIG_CGROUP_DEVS option.
| +
| +
| +Configuring
| +
| +The controller provides a single file to configure itself -- the
| +devices.permissions one. To change the accessibility level for some
| +device write the following string into it:
| +
| +[cb] :(|*) [r-][w-]
| + ^  ^   ^
| + |  |   |
| + |  |   +--- access rights (1)
| + |  |
| + |  +-- device major and minor numbers (2)
| + |
| + +-- device type (character / block)
| +
| +1) The access rights set to '--' remove the device from the group's
| +access list, so that it will not even be shown in this file later.
| +
| +2) Setting the minor to '*' grants access to all the minors for
| +particular major.
| +
| +When reading from it, one may see something like
| +
| + c 1:5 rw
| + b 8:* r-
| +
| +Security issues, concerning who may grant access to what are governed
| +at the cgroup infrastructure level.
| +
| +
| +Examples:
| +
| +1. Grand full access to /dev/null

Grant.

| + # echo c 1:3 rw > /cgroups//devices.permissions
| +
| +2. Grant the read-only access to /dev/sda and partitions
| + # echo b 8:* r- > ...

This grants access to all scsi disks, sda..sdp and not just 'sda' right ?

| +
| +3. Change the /dev/null access to write-only
| + # echo c 1:3 -w > ...
| +
| +4. Revoke access to /dev/sda
| + # echo b 8:* -- > ...
| +
| +
| + Written by Pavel Emelyanov <[EMAIL PROTECTED]>
| +
| diff --git a/fs/Makefile b/fs/Makefile
| index 7996220..5ad03be 100644
| --- a/fs/Makefile
| +++ b/fs/Makefile
| @@ -64,6 +64,8 @@ obj-y   += devpts/
| 
|  obj-$(CONFIG_PROFILING)  += dcookies.o
|  obj-$(CONFIG_DLM)+= dlm/
| +
| +obj-$(CONFIG_CGROUP_DEVS)+= devscontrol.o
|   
|  # Do not add any filesystems before this line
|  obj-$(CONFIG_REISERFS_FS)+= reiserfs/
| diff --git a/fs/devscontrol.c b/fs/devscontrol.c
| new file mode 100644
| index 000..48c5f69
| --- /dev/null
| +++ b/fs/devscontrol.c
| @@ -0,0 +1,314 @@
| +/*
| + * devscontrol.c - Device Controller
| + *
| + * Copyright 2007 OpenVZ SWsoft Inc
| + * Author: Pavel Emelyanov 
| + *
| + * This program is free software; you can redistribute it and/or modify
| + * it under the terms of the GNU General Public License as published by
| + * the Free Software Foundation; either version 2 of the License, or
| + * (at your option) any later version.
| + *
| + * This program is distributed in the hope that it will be useful,
| + * but WITHOUT ANY WARRANTY; without even the implied warranty of
| + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
| + * GNU General Public License for more details.
| + */
| +
| +#include 
| +#include 
| +#include 
| +#include 
| +#include 
| +#include 
| +#include 
| +
| +struct devs_cgroup {
| + /*
| +  * The subsys state to build into cgrous infrastructure
| +  */

... into cgroups

| + struct cgroup_subsys_state css;
| +
| + /*
| +  * The maps of character and block devices. They provide a
| +  * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c
| +  * and block/genhd.c to find out how the ->open() callbacks
| +  * work when opening a device.
| +  *
| +  * Each group will have its onw maps, and at the open()

own maps

| +  * time code will lookup in this map to get the device
| +  * and permissions by its dev_t.
| +  */
| + struct kobj_map *cdev_map;
| + struct kobj_map *bdev_map;
| +};
| +
| +static inlin

Re: [Devel] [RFC][PATCH 3/4]: Enable multiple mounts of /dev/pts

2008-02-14 Thread sukadev
Serge E. Hallyn [EMAIL PROTECTED] wrote:
| 
| > exploited in OpenVZ, so if we can somehow avoid forcing the NEWNS flag
| > that would be very very good :) See my next comment about this issue.
| > 
| > > Pavel, not long ago you said you were starting to look at tty and pty
| > > stuff - did you have any different ideas on devpts virtualization, or
| > > are you ok with this minus your comments thus far?
| > 
| > I have a similar idea of how to implement this, but I didn't thought
| > about the details. As far as this issue is concerned, I see no reasons
| > why we need a kern_mount-ed devtpsfs instance. If we don't make such,
| > we may safely hold the ptsns from the superblock and be happy. The
| > same seems applicable to the mqns, no?
| 
| But the current->nsproxy->devpts->mnt is used in several functions in
| patch 3.

Hmm, current_pts_ns_mnt() is used in:

devpts_pty_new()
devpts_get_tty()
devpts_pty_kill()

All of these return error if current_pts_ns_mnt() returns NULL.
So, can we require user-space mount and unmount /dev/pts and return
error if any operation is attempted before the mount ?

| 
| > The reason I have the kern_mount-ed instance of proc for pid namespaces
| > is that I need a vfsmount to flush task entries from, but allowing
| > it to be NULL (i.e. no kern_mount, but optional user mounts) means
| > handing all the possible races, which is too heavy. But do we actually
| > need the vfsmount for devpts and mqns if no user-space mounts exist?
| > 
| > Besides, I planned to include legacy ptys virtualization and console
| > virtualizatin in this namespace, but it seems, that it is not present
| > in this particular one.
| 
| I had been thinking the consoles would have their own ns, since there's
| really nothing linking them,  but there really is no good reason why
| userspace should ever want them separate.  So I'm fine with combining
| them.
| 
| > >>> +   sb->s_flags |= MS_ACTIVE;
| > >>> +   devpts_mnt = mnt;
| > >>> +
| > >>> +   return simple_set_mnt(mnt, sb);
| > >>>  }
| > >>>  
| > >>>  static struct file_system_type devpts_fs_type = {
| > >>> @@ -158,10 +204,9 @@ static struct file_system_type devpts_fs
| > >>>   * to the System V naming convention
| > >>>   */
| > >>>  
| > >>> -static struct dentry *get_node(int num)
| > >>> +static struct dentry *get_node(struct dentry *root, int num)
| > >>>  {
| > >>> char s[12];
| > >>> -   struct dentry *root = devpts_root;
| > >>> mutex_lock(&root->d_inode->i_mutex);
| > >>> return lookup_one_len(s, root, sprintf(s, "%d", num));
| > >>>  }
| > >>> @@ -207,12 +252,28 @@ int devpts_pty_new(struct tty_struct *tt
| > >>> struct tty_driver *driver = tty->driver;
| > >>> dev_t device = MKDEV(driver->major, driver->minor_start+number);
| > >>> struct dentry *dentry;
| > >>> -   struct inode *inode = new_inode(devpts_mnt->mnt_sb);
| > >>> +   struct dentry *root;
| > >>> +   struct vfsmount *mnt;
| > >>> +   struct inode *inode;
| > >>> +
| > >>>  
| > >>> /* We're supposed to be given the slave end of a pty */
| > >>> BUG_ON(driver->type != TTY_DRIVER_TYPE_PTY);
| > >>> BUG_ON(driver->subtype != PTY_TYPE_SLAVE);
| > >>>  
| > >>> +   mnt = current_pts_ns_mnt();
| > >>> +   if (!mnt)
| > >>> +   return -ENOSYS;
| > >>> +   root = mnt->mnt_root;
| > >>> +
| > >>> +   mutex_lock(&root->d_inode->i_mutex);
| > >>> +   inode = idr_find(current_pts_ns_allocated_ptys(), number);
| > >>> +   mutex_unlock(&root->d_inode->i_mutex);
| > >>> +
| > >>> +   if (inode && !IS_ERR(inode))
| > >>> +   return -EEXIST;
| > >>> +
| > >>> +   inode = new_inode(mnt->mnt_sb);
| > >>> if (!inode)
| > >>> return -ENOMEM;
| > >>>  
| > >>> @@ -222,23 +283,31 @@ int devpts_pty_new(struct tty_struct *tt
| > >>> inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
| > >>> init_special_inode(inode, S_IFCHR|config.mode, device);
| > >>> inode->i_private = tty;
| > >>> +   idr_replace(current_pts_ns_allocated_ptys(), inode, number);
| > >>>  
| > >>> -   dentry = get_node(number);
| > >>> +   dentry = get_node(root, number);
| > >>> if (!IS_ERR(dentry) && !dentry->d_inode) {
| > >>> d_instantiate(dentry, inode);
| > >>> -   fsnotify_create(devpts_root->d_inode, dentry);
| > >>> +   fsnotify_create(root->d_inode, dentry);
| > >>> }
| > >>>  
| > >>> -   mutex_unlock(&devpts_root->d_inode->i_mutex);
| > >>> +   mutex_unlock(&root->d_inode->i_mutex);
| > >>>  
| > >>> return 0;
| > >>>  }
| > >>>  
| > >>>  struct tty_struct *devpts_get_tty(int number)
| > >>>  {
| > >>> -   struct dentry *dentry = get_node(number);
| > >>> +   struct vfsmount *mnt;
| > >>> +   struct dentry *dentry;
| > >>> struct tty_struct *tty;
| > >>>  
| > >>> +   

Re: [Devel] [RFC][PATCH 3/4]: Enable multiple mounts of /dev/pts

2008-02-14 Thread sukadev
Pavel Emelianov [EMAIL PROTECTED] wrote:
| Serge E. Hallyn wrote:
| > Quoting Pavel Emelyanov ([EMAIL PROTECTED]):
| >> [snip]
| >>
|  Mmm. I wanted to send one small objection to Cedric's patches with mqns,
|  but the thread was abandoned by the time I decided to do-it-right-now.
| 
|  So I can put it here: forcing the CLONE_NEWNS is not very good, since
|  this makes impossible to push a bind mount inside a new namespace, which
|  may operate in some chroot environment. But this ability is heavily
| >>> Which direction do you want to go?  I'm wondering whether mounts
| >>> propagation can address it.
| >> Hardly. AFAIS there's no way to let the chroot-ed tasks see parts of
| >> vfs tree, that left behind them after chroot, unless they are in the 
| >> same mntns as you, and you bind mount this parts to their tree. No?
| > 
| > Well no, but I suspect I'm just not understanding what you want to do.
| > But if the chroot is under /jail1, and you've done, say,
| > 
| > mkdir -p /share/pts
| > mkdir -p /jail1/share
| > mount --bind /share /share
| > mount --make-shared /share
| > mount --bind /share /jail1/share
| > mount --make-slave /jail1/share
| > 
| > before the chroot-ed tasks were cloned with CLONE_NEWNS, then when you
| > do
| > 
| > mount --bind /dev/pts /share/pts
| > 
| > from the parent mntns (not that I know why you'd want to do *that* :)
| > then the chroot'ed tasks will see the original mntns's /dev/pts under
| > /jail1/share.
| 
| I haven't yet tried that, but :( this function
| 
|   static inline int check_mnt(struct vfsmount *mnt)
|   {
|   return mnt->mnt_ns == current->nsproxy->mnt_ns;
|   }
| 
| and this code in do_loopback
| 
| if (!check_mnt(nd->mnt) || !check_mnt(old_nd.mnt))
| goto out;
| 
| makes me think that trying to bind a mount from another mntns
| ot _to_ another is prohibited... Do I miss something?
| 
| >>> Though really, I think you're right - we shouldn't break the kernel
| >>> doing CLONE_NEWMQ or CLONE_NEWPTS without CLONE_NEWNS, so we shouldn't
| >>> force the combination.
| >>>
|  exploited in OpenVZ, so if we can somehow avoid forcing the NEWNS flag
|  that would be very very good :) See my next comment about this issue.
| 
| > Pavel, not long ago you said you were starting to look at tty and pty
| > stuff - did you have any different ideas on devpts virtualization, or
| > are you ok with this minus your comments thus far?
|  I have a similar idea of how to implement this, but I didn't thought
|  about the details. As far as this issue is concerned, I see no reasons
|  why we need a kern_mount-ed devtpsfs instance. If we don't make such,
|  we may safely hold the ptsns from the superblock and be happy. The
|  same seems applicable to the mqns, no?
| >>> But the current->nsproxy->devpts->mnt is used in several functions in
| >>> patch 3.
| >> Indeed. I overlooked this. Then we're in a deep ... problem here.
| >>
| >> Breaking this circle was not that easy with pid namespaces, so
| >> I put the strut in proc_flush_task - when the last task from the
| >> namespace exits the kern-mount-ed vfsmnt is dropped, but we can't
| >> do the same stuff with devpts.
| > 
| > But I still don't see what the problem is with my proposal?  So long as
| > you agree that if there are no tasks remaining in the devptsns,
| > then any task which has its devpts mounted should see an empty directory
| > (due to sb->s_info being NULL), I think it works.
| 
| Well, if we _do_ can handle the races with ns->devpts_mnt switch
| from not NULL to NULL, then I'm fine with this approach.
| 
| I just remember, that with pid namespaces this caused a complicated
| locking and performance degradation. This is the problem I couldn't
| remember yesterday.

Well, iirc, one problem with pid namespaces was that we need to keep
the task and pid_namespace association until the task was waited on
(for instance the wait() call from parent needs the pid_t of the
child which is tied to the pid ns in struct upid).

For this reason, we don't drop the mnt reference in free_pid_ns() but
hold the reference till proc_flush_task().

With devpts, can't we simply drop the reference in free_pts_ns() so
that when the last task using the pts_ns exits, we can unmount and
release the mnt ?

IOW, do you suspect that the circular reference leads to leaking vfsmnts ?
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC][PATCH 3/4]: Enable multiple mounts of /dev/pts

2008-02-15 Thread sukadev
 | locking and performance degradation. This is the problem I couldn't
| > | remember yesterday.
| > 
| > Well, iirc, one problem with pid namespaces was that we need to keep
| > the task and pid_namespace association until the task was waited on
| > (for instance the wait() call from parent needs the pid_t of the
| > child which is tied to the pid ns in struct upid).
| > 
| > For this reason, we don't drop the mnt reference in free_pid_ns() but
| > hold the reference till proc_flush_task().
| > 
| > With devpts, can't we simply drop the reference in free_pts_ns() so
| > that when the last task using the pts_ns exits, we can unmount and
| > release the mnt ?
| 
| I hope we can. The thing I'm worried about is whether we can correctly
| handle race with this pointer switch from NULL to not-NULL.
| 
| > IOW, do you suspect that the circular reference leads to leaking vfsmnts ?
| > 
| 
| Of course! If the namespace holds the vfsmnt, vfsmnt holds the superblock
| and the superblock holds the namespace we won't drop this chain ever,
| unless some other object breaks this chain.

Of course :-) I had a bug in new_pts_ns() that was masking the problem.
I had

ns->mnt  = kern_mount_data()...

...
kref_init(&ns->kref);

So the kref_init() would overwrite the reference got by devpts_set_sb()
and was preventing the leaking vfsmnt in my test.

Thanks Pavel,

Sukadev
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC] Remove kern_mount() in init_devpts_fs()

2008-02-25 Thread sukadev

Is the kern_mount() of devpts really needed or can we simply register 
the filesystem type and wait for an user-space mount before being able
to create PTYs ?

This is just an RFC patch that removes the kern_mount() and the 'devpts_mnt'
and 'devpts_root' global variables and uses a 'devpts_sb' to store the single
super block associated with devpts.

Removing the kern_mount() and relying on user-space mount could simplify
cloning of PTS namespaces.


---
 fs/devpts/inode.c |   49 +
 1 file changed, 29 insertions(+), 20 deletions(-)

Index: linux-2.6.24/fs/devpts/inode.c
===
--- linux-2.6.24.orig/fs/devpts/inode.c 2008-02-22 14:23:53.0 -0800
+++ linux-2.6.24/fs/devpts/inode.c  2008-02-25 16:00:17.0 -0800
@@ -23,9 +23,6 @@
 
 #define DEVPTS_SUPER_MAGIC 0x1cd1
 
-static struct vfsmount *devpts_mnt;
-static struct dentry *devpts_root;
-
 static struct {
int setuid;
int setgid;
@@ -97,6 +94,7 @@ static const struct super_operations dev
.remount_fs = devpts_remount,
 };
 
+static struct super_block *devpts_sb;
 static int
 devpts_fill_super(struct super_block *s, void *data, int silent)
 {
@@ -120,9 +118,11 @@ devpts_fill_super(struct super_block *s,
inode->i_fop = &simple_dir_operations;
inode->i_nlink = 2;
 
-   devpts_root = s->s_root = d_alloc_root(inode);
-   if (s->s_root)
+   s->s_root = d_alloc_root(inode);
+   if (s->s_root) {
+   devpts_sb = s;
return 0;
+   }

printk("devpts: get root dentry failed\n");
iput(inode);
@@ -136,11 +136,17 @@ static int devpts_get_sb(struct file_sys
return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt);
 }
 
+static void devpts_kill_sb(struct super_block *sb)
+{
+   devpts_sb = NULL;
+   kill_anon_super(sb);
+}
+
 static struct file_system_type devpts_fs_type = {
.owner  = THIS_MODULE,
.name   = "devpts",
.get_sb = devpts_get_sb,
-   .kill_sb= kill_anon_super,
+   .kill_sb= devpts_kill_sb,
 };
 
 /*
@@ -151,7 +157,12 @@ static struct file_system_type devpts_fs
 static struct dentry *get_node(int num)
 {
char s[12];
-   struct dentry *root = devpts_root;
+   struct dentry *root;
+
+   if (!devpts_sb)
+   return NULL;
+
+   root = devpts_sb->s_root;
mutex_lock(&root->d_inode->i_mutex);
return lookup_one_len(s, root, sprintf(s, "%d", num));
 }
@@ -162,7 +173,12 @@ int devpts_pty_new(struct tty_struct *tt
struct tty_driver *driver = tty->driver;
dev_t device = MKDEV(driver->major, driver->minor_start+number);
struct dentry *dentry;
-   struct inode *inode = new_inode(devpts_mnt->mnt_sb);
+   struct inode *inode;
+
+   if (!devpts_sb)
+   return -ENOSYS;
+
+   inode = new_inode(devpts_sb);
 
/* We're supposed to be given the slave end of a pty */
BUG_ON(driver->type != TTY_DRIVER_TYPE_PTY);
@@ -181,10 +197,10 @@ int devpts_pty_new(struct tty_struct *tt
dentry = get_node(number);
if (!IS_ERR(dentry) && !dentry->d_inode) {
d_instantiate(dentry, inode);
-   fsnotify_create(devpts_root->d_inode, dentry);
+   fsnotify_create(devpts_sb->s_root->d_inode, dentry);
}
 
-   mutex_unlock(&devpts_root->d_inode->i_mutex);
+   mutex_unlock(&devpts_sb->s_root->d_inode->i_mutex);
 
return 0;
 }
@@ -201,7 +217,7 @@ struct tty_struct *devpts_get_tty(int nu
dput(dentry);
}
 
-   mutex_unlock(&devpts_root->d_inode->i_mutex);
+   mutex_unlock(&devpts_sb->s_root->d_inode->i_mutex);
 
return tty;
 }
@@ -219,24 +235,17 @@ void devpts_pty_kill(int number)
}
dput(dentry);
}
-   mutex_unlock(&devpts_root->d_inode->i_mutex);
+   mutex_unlock(&devpts_sb->s_root->d_inode->i_mutex);
 }
 
 static int __init init_devpts_fs(void)
 {
-   int err = register_filesystem(&devpts_fs_type);
-   if (!err) {
-   devpts_mnt = kern_mount(&devpts_fs_type);
-   if (IS_ERR(devpts_mnt))
-   err = PTR_ERR(devpts_mnt);
-   }
-   return err;
+   return register_filesystem(&devpts_fs_type);
 }
 
 static void __exit exit_devpts_fs(void)
 {
unregister_filesystem(&devpts_fs_type);
-   mntput(devpts_mnt);
 }
 
 module_init(init_devpts_fs)
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH] Fix warning in kernel/pid.c

2008-02-28 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH]: Fix compile warning in kernel/pid.c

We get a warning in kernel/pid.c due to the deprecated find_task_by_pid().
Make the function inline in sched.h to avoid the warning.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 include/linux/sched.h |5 -
 kernel/pid.c  |6 --
 2 files changed, 4 insertions(+), 7 deletions(-)

Index: linux-2.6-25-rc2-mm1/include/linux/sched.h
===
--- linux-2.6-25-rc2-mm1.orig/include/linux/sched.h 2008-02-27 
16:07:52.0 -0800
+++ linux-2.6-25-rc2-mm1/include/linux/sched.h  2008-02-27 16:29:31.0 
-0800
@@ -1632,7 +1632,10 @@ extern struct pid_namespace init_pid_ns;
 extern struct task_struct *find_task_by_pid_type_ns(int type, int pid,
struct pid_namespace *ns);
 
-extern struct task_struct *find_task_by_pid(pid_t nr) __deprecated;
+static inline struct task_struct *__deprecated find_task_by_pid(pid_t nr)
+{
+   return find_task_by_pid_type_ns(PIDTYPE_PID, nr, &init_pid_ns);
+}
 extern struct task_struct *find_task_by_vpid(pid_t nr);
 extern struct task_struct *find_task_by_pid_ns(pid_t nr,
struct pid_namespace *ns);
Index: linux-2.6-25-rc2-mm1/kernel/pid.c
===
--- linux-2.6-25-rc2-mm1.orig/kernel/pid.c  2008-02-27 15:18:22.0 
-0800
+++ linux-2.6-25-rc2-mm1/kernel/pid.c   2008-02-27 16:29:31.0 -0800
@@ -380,12 +380,6 @@ struct task_struct *find_task_by_pid_typ
 
 EXPORT_SYMBOL(find_task_by_pid_type_ns);
 
-struct task_struct *find_task_by_pid(pid_t nr)
-{
-   return find_task_by_pid_type_ns(PIDTYPE_PID, nr, &init_pid_ns);
-}
-EXPORT_SYMBOL(find_task_by_pid);
-
 struct task_struct *find_task_by_vpid(pid_t vnr)
 {
return find_task_by_pid_type_ns(PIDTYPE_PID, vnr,
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/7][v2] Cloning PTS namespace

2008-03-24 Thread sukadev

Devpts namespace patchset

In continuation of the implementation of containers in mainline, we need to
support multiple PTY namespaces so that the PTY index (ie the tty names) in
one container is independent of the PTY indices of other containers.  For
instance this would allow each container to have a '/dev/pts/0' PTY and
refer to different terminals.

[PATCH 1/7]: Propagate error code from devpts_pty_new
[PATCH 2/7]: Factor out PTY index allocation
[PATCH 3/7]: Enable multiple mounts of /dev/pts
[PATCH 4/7]: Implement get_pts_ns() and put_pts_ns()
[PATCH 5/7]: Determine pts_ns from a pty's inode.
[PATCH 6/7]: Check for user-space mount of /dev/pts
[PATCH 7/7]: Enable cloning PTY namespaces

Todo:
- This patchset depends on availability of additional clone flags.
  and relies on on Cedric's clone64 patchset.

- Needs more testing.

Changelog[v1]:
- Fixed circular reference by not caching the pts_ns in sb->s_fs_info
  (without incrementing reference count) and clearing the sb->s_fs_info
  when destroying the pts_ns (See patch 3/7 for details).

- To allow access to a child container's ptys from parent container,
  determine the 'pts_ns' of a 'pty' from its inode (See patch 5/7
  for details.

- Added a check (hack) to ensure user-space mount of /dev/pts is
  done before creating PTYs in a new pts-ns (see patch 6/7 for
  details).

- Reorganized the patchset and removed redundant changes.

- Ported to work wih Cedric Le Goater's clone64() system call now
  that we are out of clone_flags.

Changelog[v0]:

This patchset is based on earlier versions developed by Serge Hallyn
and Matt Helsley.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/7] Propagate error code from devpts_pty_new

2008-03-24 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 1/7]: Propagate error code from devpts_pty_new

Have ptmx_open() propagate any error code returned by devpts_pty_new().

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Cc: Cedric Le Goater <[EMAIL PROTECTED]>
Cc: Dave Hansen <[EMAIL PROTECTED]>
Cc: Serge Hallyn <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
---
 drivers/char/tty_io.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c
===
--- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c   2008-03-21 20:13:38.0 
-0700
+++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:04:07.0 
-0700
@@ -2835,8 +2835,8 @@ static int ptmx_open(struct inode *inode
filp->private_data = tty;
file_move(filp, &tty->tty_files);
 
-   retval = -ENOMEM;
-   if (devpts_pty_new(tty->link))
+   retval = devpts_pty_new(tty->link);
+   if (retval)
goto out1;
 
check_tty_count(tty, "tty_open");
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/7]: Factor out PTY index allocation

2008-03-24 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 2/7]: Factor out PTY index allocation

Factor out the code used to allocate/free a pts index into new interfaces,
devpts_new_index() and devpts_kill_index().  This localizes the external
data structures used in managing the pts indices.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Signed-off-by: Serge Hallyn<[EMAIL PROTECTED]>
Signed-off-by: Matt Helsley<[EMAIL PROTECTED]>

---
 drivers/char/tty_io.c |   40 ++--
 fs/devpts/inode.c |   42 +-
 include/linux/devpts_fs.h |4 
 3 files changed, 51 insertions(+), 35 deletions(-)

Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-03-24 
20:04:07.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:04:26.0 
-0700
@@ -17,6 +17,8 @@
 
 #ifdef CONFIG_UNIX98_PTYS
 
+int devpts_new_index(void);
+void devpts_kill_index(int idx);
 int devpts_pty_new(struct tty_struct *tty);  /* mknod in devpts */
 struct tty_struct *devpts_get_tty(int number);  /* get tty structure */
 void devpts_pty_kill(int number);   /* unlink */
@@ -24,6 +26,8 @@ void devpts_pty_kill(int number);  /* u
 #else
 
 /* Dummy stubs in the no-pty case */
+static inline int devpts_new_index(void) { return -EINVAL; }
+static inline void devpts_kill_index(int idx) { }
 static inline int devpts_pty_new(struct tty_struct *tty) { return -EINVAL; }
 static inline struct tty_struct *devpts_get_tty(int number) { return NULL; }
 static inline void devpts_pty_kill(int number) { }
Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c
===
--- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c   2008-03-24 20:04:07.0 
-0700
+++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:04:26.0 
-0700
@@ -91,7 +91,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -137,9 +136,6 @@ EXPORT_SYMBOL(tty_mutex);
 
 #ifdef CONFIG_UNIX98_PTYS
 extern struct tty_driver *ptm_driver;  /* Unix98 pty masters; for /dev/ptmx */
-extern int pty_limit;  /* Config limit on Unix98 ptys */
-static DEFINE_IDR(allocated_ptys);
-static DEFINE_MUTEX(allocated_ptys_lock);
 static int ptmx_open(struct inode *, struct file *);
 #endif
 
@@ -2636,15 +2632,9 @@ static void release_dev(struct file *fil
 */
release_tty(tty, idx);
 
-#ifdef CONFIG_UNIX98_PTYS
/* Make this pty number available for reallocation */
-   if (devpts) {
-   mutex_lock(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, idx);
-   mutex_unlock(&allocated_ptys_lock);
-   }
-#endif
-
+   if (devpts)
+   devpts_kill_index(idx);
 }
 
 /**
@@ -2800,29 +2790,13 @@ static int ptmx_open(struct inode *inode
struct tty_struct *tty;
int retval;
int index;
-   int idr_ret;
 
nonseekable_open(inode, filp);
 
/* find a device that is not in use. */
-   mutex_lock(&allocated_ptys_lock);
-   if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) {
-   mutex_unlock(&allocated_ptys_lock);
-   return -ENOMEM;
-   }
-   idr_ret = idr_get_new(&allocated_ptys, NULL, &index);
-   if (idr_ret < 0) {
-   mutex_unlock(&allocated_ptys_lock);
-   if (idr_ret == -EAGAIN)
-   return -ENOMEM;
-   return -EIO;
-   }
-   if (index >= pty_limit) {
-   idr_remove(&allocated_ptys, index);
-   mutex_unlock(&allocated_ptys_lock);
-   return -EIO;
-   }
-   mutex_unlock(&allocated_ptys_lock);
+   index = devpts_new_index();
+   if (index < 0)
+   return index;
 
mutex_lock(&tty_mutex);
retval = init_dev(ptm_driver, index, &tty);
@@ -2847,9 +2821,7 @@ out1:
release_dev(filp);
return retval;
 out:
-   mutex_lock(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, index);
-   mutex_unlock(&allocated_ptys_lock);
+   devpts_kill_index(index);
return retval;
 }
 #endif
Index: 2.6.25-rc5-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c   2008-03-24 20:04:07.0 
-0700
+++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:04:26.0 -0700
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,6 +27,10 @@
 
 #define DEVPTS_DEFAULT_MODE 0600
 
+extern int pty_limit;  /* Config limit on Unix98 ptys */
+static DEFINE_IDR(allocated_ptys);
+static DECLARE_

[Devel] [PATCH 3/7]: Enable multiple mounts of /dev/pts

2008-03-24 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 3/7]: Enable multiple mounts of /dev/pts

To support multiple PTY namespaces, we should be allow multiple
mounts of /dev/pts, once within each PTY namespace.

This patch removes the get_sb_single() in devpts_get_sb() and
uses test and set sb interfaces to allow remounting /dev/pts.
The patch also removes the globals, 'devpts_mnt', 'devpts_root'
and uses a skeletal 'init_pts_ns' to store the vfsmount.

Changelog [v2]:

- (Pavel Emelianov/Serge Hallyn) Remove reference to pts_ns from
  sb->s_fs_info to fix the circular reference (/dev/pts is not
  unmounted unless the pts_ns is destroyed, so we don't need a
  reference to the pts_ns).

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]>
Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |  160 +-
 include/linux/devpts_fs.h |   11 +++
 2 files changed, 143 insertions(+), 28 deletions(-)

Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-03-24 
20:04:26.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:04:31.0 
-0700
@@ -14,6 +14,17 @@
 #define _LINUX_DEVPTS_FS_H
 
 #include 
+#include 
+#include 
+#include 
+
+struct pts_namespace {
+   struct kref kref;
+   struct idr allocated_ptys;
+   struct vfsmount *mnt;
+};
+
+extern struct pts_namespace init_pts_ns;
 
 #ifdef CONFIG_UNIX98_PTYS
 
Index: 2.6.25-rc5-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c   2008-03-24 20:04:26.0 
-0700
+++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:04:31.0 -0700
@@ -28,12 +28,8 @@
 #define DEVPTS_DEFAULT_MODE 0600
 
 extern int pty_limit;  /* Config limit on Unix98 ptys */
-static DEFINE_IDR(allocated_ptys);
 static DECLARE_MUTEX(allocated_ptys_lock);
 
-static struct vfsmount *devpts_mnt;
-static struct dentry *devpts_root;
-
 static struct {
int setuid;
int setgid;
@@ -54,6 +50,15 @@ static match_table_t tokens = {
{Opt_err, NULL}
 };
 
+struct pts_namespace init_pts_ns = {
+   .kref = {
+   .refcount = ATOMIC_INIT(2),
+   },
+   .allocated_ptys = IDR_INIT(init_pts_ns.allocated_ptys),
+   .mnt = NULL,
+};
+
+
 static int devpts_remount(struct super_block *sb, int *flags, char *data)
 {
char *p;
@@ -140,7 +145,7 @@ devpts_fill_super(struct super_block *s,
inode->i_fop = &simple_dir_operations;
inode->i_nlink = 2;
 
-   devpts_root = s->s_root = d_alloc_root(inode);
+   s->s_root = d_alloc_root(inode);
if (s->s_root)
return 0;

@@ -150,17 +155,82 @@ fail:
return -ENOMEM;
 }
 
+/*
+ * We use test and set super-block operations to help determine whether we
+ * need a new super-block for this namespace. get_sb() walks the list of
+ * existing devpts supers, comparing them with the @data ptr. Since we
+ * passed 'current's namespace as the @data pointer we can compare the
+ * namespace pointer in the super-block's 's_fs_info'.  If the test is
+ * TRUE then get_sb() returns a new active reference to the super block.
+ * Otherwise, it helps us build an active reference to a new one.
+ */
+
+static int devpts_test_sb(struct super_block *sb, void *data)
+{
+   return sb->s_fs_info == data;
+}
+
+static int devpts_set_sb(struct super_block *sb, void *data)
+{
+   /*
+* new_pts_ns() mounts the pts namespace and free_pts_ns()
+* drops the reference to the mount. i.e the s_fs_inf is
+* cleared and vfsmnt is releasand _before_ pts_namespace
+* is freed.
+*
+* So we don't need a reference to the pts_namespace here
+* (Getting a reference here will also cause circular reference).
+*/
+   sb->s_fs_info = data;
+   return set_anon_super(sb, NULL);
+}
+
 static int devpts_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 {
-   return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt);
+   struct super_block *sb;
+   struct pts_namespace *ns;
+   int err;
+
+   /* hereafter we're very similar to proc_get_sb */
+   if (flags & MS_KERNMOUNT)
+   ns = data;
+   else
+   ns = &init_pts_ns;
+
+   /* hereafter we're very simlar to get_sb_nodev */
+   sb = sget(fs_type, devpts_test_sb, devpts_set_sb, ns);
+   if (IS_ERR(sb))
+   return PTR_ERR(sb);
+
+   if (sb->s_root)
+   return sim

[Devel] [PATCH 4/7] Implement get_pts_ns() and put_pts_ns()

2008-03-24 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 4/7]: Implement get_pts_ns() and put_pts_ns()

Implement get_pts_ns() and put_pts_ns() interfaces.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 include/linux/devpts_fs.h |   21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-03-24 
20:04:31.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:05:05.0 
-0700
@@ -27,13 +27,26 @@ struct pts_namespace {
 extern struct pts_namespace init_pts_ns;
 
 #ifdef CONFIG_UNIX98_PTYS
-
 int devpts_new_index(void);
 void devpts_kill_index(int idx);
 int devpts_pty_new(struct tty_struct *tty);  /* mknod in devpts */
 struct tty_struct *devpts_get_tty(int number);  /* get tty structure */
 void devpts_pty_kill(int number);   /* unlink */
 
+static inline void free_pts_ns(struct kref *ns_kref) { }
+
+static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns)
+{
+   if (ns && (ns != &init_pts_ns))
+   kref_get(&ns->kref);
+   return ns;
+}
+static inline void put_pts_ns(struct pts_namespace *ns)
+{
+   if (ns && (ns != &init_pts_ns))
+   kref_put(&ns->kref, free_pts_ns);
+}
+
 #else
 
 /* Dummy stubs in the no-pty case */
@@ -43,6 +56,12 @@ static inline int devpts_pty_new(struct 
 static inline struct tty_struct *devpts_get_tty(int number) { return NULL; }
 static inline void devpts_pty_kill(int number) { }
 
+static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns)
+{
+   return &init_pts_ns;
+}
+
+static inline void put_pts_ns(struct pts_namespace *ns) { }
 #endif
 
 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 5/7]: Determine pts_ns from a pty's inode.

2008-03-24 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 5/7]: Determine pts_ns from a pty's inode.

The devpts interfaces currently operate on a specific pts namespace
which they get from the 'current' task.

With implementation of containers and cloning of PTS namespaces, we want
to be able to access PTYs in a child-pts-ns from a parent-pts-ns. For
instance we could bind-mount and pivot-root the child container on
'/vserver/vserver1' and then access the "pts/0" of 'vserver1' using 

$ echo foo > /vserver/vserver1/dev/pts/0

The task doing the above 'echo' could be in parent-pts-ns. So we find
the 'pts-ns' of the above file from the inode representing the above
file rather than from the 'current' task.

Note that we need to find and hold a reference to the pts_ns to prevent
the pts_ns from being freed while it is being accessed from 'outside'.

This patch implements, 'pts_ns_from_inode()' which returns the pts_ns
using 'inode->i_sb->s_fs_info'.

Since, the 'inode' information is not visible inside devpts code itself,
this patch modifies the tty driver code to determine the pts_ns and passes
it into devpts.

TODO:
What is the expected behavior when '/dev/tty' or '/dev/ptmx' are
accessed from parent-pts-ns. i.e:

$ echo "foobar" > /vserver/vserver1/dev/tty)

This patch currently ignores the '/vserver/vserver1' part (that
seemed to be the easiest to do :-). So opening /dev/ptmx from
even the child pts-ns will create a pty in the _PARENT_ pts-ns.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 drivers/char/pty.c|2 -
 drivers/char/tty_io.c |   86 +++---
 fs/devpts/inode.c |   19 +++---
 include/linux/devpts_fs.h |   42 +++---
 4 files changed, 119 insertions(+), 30 deletions(-)

Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-03-24 
20:05:05.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:08:33.0 
-0700
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct pts_namespace {
struct kref kref;
@@ -26,12 +27,43 @@ struct pts_namespace {
 
 extern struct pts_namespace init_pts_ns;
 
+#define DEVPTS_SUPER_MAGIC 0x1cd1
+static inline struct pts_namespace *pts_ns_from_inode(struct inode *inode)
+{
+   /*
+* Need this bug-on for now to catch any cases in tty_open()
+* or release_dev() I may have missed.
+*/
+   BUG_ON(inode->i_sb->s_magic != DEVPTS_SUPER_MAGIC);
+
+   /*
+* If we have a valid inode, we already have a reference to
+* mount-point. Since there is a single super-block for the
+* devpts mount, i_sb->s_fs_info cannot go to NULL. So we
+* should not need a lock here.
+*/
+
+   return (struct pts_namespace *)inode->i_sb->s_fs_info;
+}
+
+static inline struct pts_namespace *current_pts_ns(void)
+{
+   return &init_pts_ns;
+}
+
+
 #ifdef CONFIG_UNIX98_PTYS
-int devpts_new_index(void);
-void devpts_kill_index(int idx);
-int devpts_pty_new(struct tty_struct *tty);  /* mknod in devpts */
-struct tty_struct *devpts_get_tty(int number);  /* get tty structure */
-void devpts_pty_kill(int number);   /* unlink */
+int devpts_new_index(struct pts_namespace *pts_ns);
+void devpts_kill_index(struct pts_namespace *pts_ns, int idx);
+
+/* mknod in devpts */
+int devpts_pty_new(struct pts_namespace *pts_ns, struct tty_struct *tty);
+
+/* get tty structure */
+struct tty_struct *devpts_get_tty(struct pts_namespace *pts_ns, int number);
+
+/* unlink */
+void devpts_pty_kill(struct pts_namespace *pts_ns, int number);
 
 static inline void free_pts_ns(struct kref *ns_kref) { }
 
Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c
===
--- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c   2008-03-24 20:04:26.0 
-0700
+++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:08:15.0 
-0700
@@ -2064,8 +2064,8 @@ static void tty_line_name(struct tty_dri
  * relaxed for the (most common) case of reopening a tty.
  */
 
-static int init_dev(struct tty_driver *driver, int idx,
-   struct tty_struct **ret_tty)
+static int init_dev(struct tty_driver *driver, struct pts_namespace *pts_ns,
+   int idx, struct tty_struct **ret_tty)
 {
struct tty_struct *tty, *o_tty;
struct ktermios *tp, **tp_loc, *o_tp, **o_tp_loc;
@@ -2074,7 +2074,7 @@ static int init_dev(struct tty_driver *d
 
/* check whether we're reopening an existing tty */
if (driver->fl

[Devel] [PATCH 6/7]: Check for user-space mount of /dev/pts

2008-03-24 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 6/7]: Check for user-space mount of /dev/pts

When the pts namespace is cloned, the /dev/pts is not useful unless it
is remounted from the user space.

If user-space clones pts namespace but does not remount /dev/pts, it
would end up using the /dev/pts mount from parent-pts-ns but allocate
the pts indices from current pts ns.

This patch (hack ?) prevents creation of PTYs in user space unless
user-space mounts /dev/pts.

(While this patch can be folded into others, keeping this separate
for now for easier review (and to highlight the hack :-)

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |   25 +++--
 include/linux/devpts_fs.h |   20 +++-
 2 files changed, 42 insertions(+), 3 deletions(-)

Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-03-24 
20:08:33.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:08:57.0 
-0700
@@ -23,6 +23,7 @@ struct pts_namespace {
struct kref kref;
struct idr allocated_ptys;
struct vfsmount *mnt;
+   int user_mounted;
 };
 
 extern struct pts_namespace init_pts_ns;
@@ -30,6 +31,8 @@ extern struct pts_namespace init_pts_ns;
 #define DEVPTS_SUPER_MAGIC 0x1cd1
 static inline struct pts_namespace *pts_ns_from_inode(struct inode *inode)
 {
+   struct pts_namespace *ns;
+
/*
 * Need this bug-on for now to catch any cases in tty_open()
 * or release_dev() I may have missed.
@@ -43,7 +46,22 @@ static inline struct pts_namespace *pts_
 * should not need a lock here.
 */
 
-   return (struct pts_namespace *)inode->i_sb->s_fs_info;
+   ns = (struct pts_namespace *)inode->i_sb->s_fs_info;
+
+   /*
+* If user-space did not mount pts ns after cloning pts namespace,
+* the child process would end up accessing devpts mount of the
+* parent but use allocated_ptys from the cloned pts ns.
+*
+* This check prevents creating ptys unless user-space mounts
+* devpts  in the new pts namespace.
+*
+* Is there a cleaner way to prevent this ?
+*/
+   if (!ns->user_mounted)
+   return NULL;
+
+   return ns;
 }
 
 static inline struct pts_namespace *current_pts_ns(void)
Index: 2.6.25-rc5-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c   2008-03-24 20:08:33.0 
-0700
+++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:08:57.0 -0700
@@ -201,8 +201,11 @@ static int devpts_get_sb(struct file_sys
if (IS_ERR(sb))
return PTR_ERR(sb);
 
-   if (sb->s_root)
+   if (sb->s_root) {
+   if (!(flags & MS_KERNMOUNT))
+   ns->user_mounted = 1;
return simple_set_mnt(mnt, sb);
+   }
 
sb->s_flags = flags;
err = devpts_fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
@@ -248,6 +251,10 @@ int devpts_new_index(struct pts_namespac
int index;
int idr_ret;
 
+   if (!pts_ns || !pts_ns->user_mounted) {
+   printk(KERN_ERR "devpts_new_index() without user_mount\n");
+   return -ENOSYS;
+   }
 retry:
if (!idr_pre_get(&pts_ns->allocated_ptys, GFP_KERNEL)) {
return -ENOMEM;
@@ -273,7 +280,7 @@ retry:
 
 void devpts_kill_index(struct pts_namespace *pts_ns, int idx)
 {
-
+   BUG_ON(!pts_ns->user_mounted);
down(&allocated_ptys_lock);
idr_remove(&pts_ns->allocated_ptys, idx);
up(&allocated_ptys_lock);
@@ -293,6 +300,11 @@ int devpts_pty_new( struct pts_namespace
BUG_ON(driver->type != TTY_DRIVER_TYPE_PTY);
BUG_ON(driver->subtype != PTY_TYPE_SLAVE);
 
+   if (!pts_ns || !pts_ns->user_mounted) {
+   printk(KERN_ERR "devpts_pty_new() without user_mount\n");
+   return -ENOSYS;
+   }
+
mnt = pts_ns->mnt;
root = mnt->mnt_root;
 
@@ -332,6 +344,11 @@ struct tty_struct *devpts_get_tty(struct
struct dentry *dentry;
struct tty_struct *tty;
 
+   if (!pts_ns || !pts_ns->user_mounted) {
+   printk(KERN_ERR "devpts_get_tty() without user_mount\n");
+   return ERR_PTR(-ENOSYS);
+   }
+
mnt = pts_ns->mnt;
 
dentry = get_node(mnt->mnt_root, number);
@@ -353,6 +370,10 @@ void devpts_pty_kill(struct pts_namespac
struct dentry *dentry;
struct dentry *root;
 
+   if (!pts_ns || !pts_ns->user_mounted) {
+   printk(KERN_ERR "devpts_pty_kill() wi

[Devel] [PATCH 7/7]: Enable cloning PTY namespaces

2008-03-24 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 7/7]: Enable cloning PTY namespaces

Enable cloning PTY namespaces.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]>
Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>

---
 fs/devpts/inode.c |   40 +++-
 include/linux/devpts_fs.h |   22 --
 include/linux/init_task.h |1 +
 include/linux/nsproxy.h   |2 ++
 include/linux/sched.h |1 +
 kernel/fork.c |2 +-
 kernel/nsproxy.c  |   17 +++--
 7 files changed, 79 insertions(+), 6 deletions(-)

Index: 2.6.25-rc5-mm1/include/linux/sched.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/sched.h   2008-03-24 20:02:57.0 
-0700
+++ 2.6.25-rc5-mm1/include/linux/sched.h2008-03-24 20:12:56.0 
-0700
@@ -28,6 +28,7 @@
 #define CLONE_NEWPID   0x2000  /* New pid namespace */
 #define CLONE_NEWNET   0x4000  /* New network namespace */
 #define CLONE_IO   0x8000  /* Clone io context */
+#define CLONE_NEWPTS   0x0002ULL   /* Clone pts ns */
 
 /*
  * Scheduling policies
Index: 2.6.25-rc5-mm1/include/linux/nsproxy.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/nsproxy.h 2008-03-24 20:02:57.0 
-0700
+++ 2.6.25-rc5-mm1/include/linux/nsproxy.h  2008-03-24 20:12:56.0 
-0700
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct pts_namespace;
 
 /*
  * A structure to contain pointers to all per-process
@@ -29,6 +30,7 @@ struct nsproxy {
struct pid_namespace *pid_ns;
struct user_namespace *user_ns;
struct net   *net_ns;
+   struct pts_namespace *pts_ns;
 };
 extern struct nsproxy init_nsproxy;
 
Index: 2.6.25-rc5-mm1/include/linux/init_task.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/init_task.h   2008-03-24 
20:02:57.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/init_task.h2008-03-24 20:12:56.0 
-0700
@@ -78,6 +78,7 @@ extern struct nsproxy init_nsproxy;
.mnt_ns = NULL, \
INIT_NET_NS(net_ns) \
INIT_IPC_NS(ipc_ns) \
+   .pts_ns = &init_pts_ns, \
.user_ns= &init_user_ns,\
 }
 
Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-03-24 
20:08:57.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:12:56.0 
-0700
@@ -66,7 +66,7 @@ static inline struct pts_namespace *pts_
 
 static inline struct pts_namespace *current_pts_ns(void)
 {
-   return &init_pts_ns;
+   return current->nsproxy->pts_ns;
 }
 
 
@@ -83,7 +83,8 @@ struct tty_struct *devpts_get_tty(struct
 /* unlink */
 void devpts_pty_kill(struct pts_namespace *pts_ns, int number);
 
-static inline void free_pts_ns(struct kref *ns_kref) { }
+extern struct pts_namespace *new_pts_ns(void);
+extern void free_pts_ns(struct kref *kref);
 
 static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns)
 {
@@ -97,6 +98,15 @@ static inline void put_pts_ns(struct pts
kref_put(&ns->kref, free_pts_ns);
 }
 
+static inline struct pts_namespace *copy_pts_ns(u64 flags,
+  struct pts_namespace *old_ns)
+{
+  if (flags & CLONE_NEWPTS)
+  return new_pts_ns();
+  else
+  return get_pts_ns(old_ns);
+}
+
 #else
 
 /* Dummy stubs in the no-pty case */
@@ -112,6 +122,14 @@ static inline struct pts_namespace *get_
 }
 
 static inline void put_pts_ns(struct pts_namespace *ns) { }
+
+static inline struct pts_namespace *copy_pts_ns(u64 flags,
+  struct pts_namespace *old_ns)
+{
+  if (flags & CLONE_NEWPTS)
+  return ERR_PTR(-EINVAL);
+  return old_ns;
+}
 #endif
 
 
Index: 2.6.25-rc5-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c   2008-03-24 20:08:57.0 
-0700
+++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:14:20.0 -0700
@@ -27,6 +27,7 @@
 
 extern int pty_limit;  /* Config limit on Unix98 ptys */
 static DECLARE_MUTEX(allocated_ptys_lock);
+static struct file_system_type devpts_fs_type;
 
 static struct {
int setuid;
@@ -56,6 +57,43 @@ struct pts_namespace init_pts_ns = {
.mnt = NULL,
 };
 
+struct pts_namesp

[Devel] Re: [PATCH 6/7]: Check for user-space mount of /dev/pts

2008-03-25 Thread sukadev
Serge E. Hallyn [EMAIL PROTECTED] wrote:
| Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
| > 
| > From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
| > Subject: [PATCH 6/7]: Check for user-space mount of /dev/pts
| > 
| > When the pts namespace is cloned, the /dev/pts is not useful unless it
| > is remounted from the user space.
| > 
| > If user-space clones pts namespace but does not remount /dev/pts, it
| > would end up using the /dev/pts mount from parent-pts-ns but allocate
| > the pts indices from current pts ns.
| 
| So why not use the allocated_ptys from the parent ptsns?  It's what
| userspace asked for and it's safe to do.

The problem is when opening /dev/ptmx, we use current_pts_ns() and
when opening slave-pty, we use pts_ns from the inode.

If child-pts-ns opens /dev/ptmx, we use 'allocated-ptys' from
child-pts-ns and we allocate index 0. But when the process opens
the slave pty "/dev/pts/0", we would get the pts_ns from the
inode which would come from parent-pts-ns (and could refer to
an existing pty).

Agree with Alexey and Pavel, its bad. Will think some more, but
appreciate any ideas.

Sukadev
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/7] Implement get_pts_ns() and put_pts_ns()

2008-03-25 Thread sukadev
Serge E. Hallyn [EMAIL PROTECTED] wrote:
| Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
| > 
| > From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
| > Subject: [PATCH 4/7]: Implement get_pts_ns() and put_pts_ns()
| > 
| > Implement get_pts_ns() and put_pts_ns() interfaces.
| > 
| > Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
| > ---
| >  include/linux/devpts_fs.h |   21 -
| >  1 file changed, 20 insertions(+), 1 deletion(-)
| > 
| > Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
| > ===
| > --- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-03-24 
20:04:31.0 -0700
| > +++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 
20:05:05.0 -0700
| > @@ -27,13 +27,26 @@ struct pts_namespace {
| >  extern struct pts_namespace init_pts_ns;
| > 
| >  #ifdef CONFIG_UNIX98_PTYS
| > -
| >  int devpts_new_index(void);
| >  void devpts_kill_index(int idx);
| >  int devpts_pty_new(struct tty_struct *tty);  /* mknod in devpts */
| >  struct tty_struct *devpts_get_tty(int number);  /* get tty structure */
| >  void devpts_pty_kill(int number);   /* unlink */
| > 
| > +static inline void free_pts_ns(struct kref *ns_kref) { }
| > +
| > +static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns)
| > +{
| > +   if (ns && (ns != &init_pts_ns))
| > +   kref_get(&ns->kref);
| > +   return ns;
| > +}
| > +static inline void put_pts_ns(struct pts_namespace *ns)
| > +{
| > +   if (ns && (ns != &init_pts_ns))
| > +   kref_put(&ns->kref, free_pts_ns);
| 
| This isn't right, or I'm not thinking right.  Don't you somewhere
| need to
| 
|   1. rcu_assign ns->mnt->mnt_sb->s_fs_info to NULL
|   2. wait a grace period
|   3. call free_pts_ns and check the refcount on the ns again?
| 
| and then do pts_ns_from_inode() under an rcu_read_lock and grab
| a ref to the ns?

Yes, we need the rcu to grab the reference to pts_ns.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 5/7]: Determine pts_ns from a pty's inode.

2008-03-25 Thread sukadev
Serge E. Hallyn [EMAIL PROTECTED] wrote:
| Quoting Serge E. Hallyn ([EMAIL PROTECTED]):
| > Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
| > > 
| > > From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
| > > Subject: [PATCH 5/7]: Determine pts_ns from a pty's inode.
| > > 
| > > The devpts interfaces currently operate on a specific pts namespace
| > > which they get from the 'current' task.
| > > 
| > > With implementation of containers and cloning of PTS namespaces, we want
| > > to be able to access PTYs in a child-pts-ns from a parent-pts-ns. For
| > > instance we could bind-mount and pivot-root the child container on
| > > '/vserver/vserver1' and then access the "pts/0" of 'vserver1' using 
| > > 
| > >   $ echo foo > /vserver/vserver1/dev/pts/0
| > >   
| > > The task doing the above 'echo' could be in parent-pts-ns. So we find
| > > the 'pts-ns' of the above file from the inode representing the above
| > > file rather than from the 'current' task.
| > > 
| > > Note that we need to find and hold a reference to the pts_ns to prevent
| > > the pts_ns from being freed while it is being accessed from 'outside'.
| > > 
| > > This patch implements, 'pts_ns_from_inode()' which returns the pts_ns
| > > using 'inode->i_sb->s_fs_info'.
| > > 
| > > Since, the 'inode' information is not visible inside devpts code itself,
| > > this patch modifies the tty driver code to determine the pts_ns and passes
| > > it into devpts.
| > > 
| > > TODO:
| > >   What is the expected behavior when '/dev/tty' or '/dev/ptmx' are
| > >   accessed from parent-pts-ns. i.e:
| > > 
| > >   $ echo "foobar" > /vserver/vserver1/dev/tty)
| > >   
| > >   This patch currently ignores the '/vserver/vserver1' part (that
| > 
| > The way this is phrased it almost sounds like you're considering using
| > the pathnames to figure out the ptsns to use :).
| > 
| > It's not clear to me what is the sane thing to do.
| > 
| > what you're doing here - have /dev/ptmx and /dev/tty always use
| > current->'s ptsns - isn't ideal.
| > 
| > It would be nicer to not have a 'devpts ns', and instead have a
| > full device namespace.  However, then it still isn't clear how to tie
| > /vs/vs1/dev/ptmx to vs1's device namespace, since there is no device
| > fs to which to tie the devns.
| > 
| > We could tie the devns to a device inode on mknod, using the devns of
| > the creating task.  Then when starting up vs1, you just have to always
| > let vs1 create /dev/ptmx and /dev/tty.  I can't think of anything
| > better offhand.
| > 
| > Other ideas?
| 
| I suppose you could just create /dev/pts/ptmx and /dev/pts/tty.
| Recommend that in containers /dev/ptmx and /dev/tty be symlinks
| into /dev/pts.  Applications don't need to change.  If
| ptmx_open() sees that inode->i_sb is a devptsfs, it gets the
| namespace from the sb.  If not, then it was a device in /dev
| and it gets the nmespace from current.

But we would still depend on user-space remounting /dev/pts after
the clone right ? Until they do that we would access the parent
container's /dev/pts/ptmx ?
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 5/7]: Determine pts_ns from a pty's inode.

2008-03-26 Thread sukadev
Serge E. Hallyn [EMAIL PROTECTED] wrote:
| > | I suppose you could just create /dev/pts/ptmx and /dev/pts/tty.
| > | Recommend that in containers /dev/ptmx and /dev/tty be symlinks
| > | into /dev/pts.  Applications don't need to change.  If
| > | ptmx_open() sees that inode->i_sb is a devptsfs, it gets the
| > | namespace from the sb.  If not, then it was a device in /dev
| > | and it gets the nmespace from current.
| > 
| > But we would still depend on user-space remounting /dev/pts after
| > the clone right ? Until they do that we would access the parent
| > container's /dev/pts/ptmx ?
| 
| Yes.  Which is the right thing to do imo.

Hmm, that sounds reasonable, although slightly inconsistent with pid-ns,
where pid starts at 1 regardless of whether /proc is remounted.

But even so, if user fails to establish the symlink, clones the pts ns
and tries to create a pty, we would end up with different pts nses again ?

i.e
/dev/ptmx is still a char dev in root fs
clone(pts_ns)
( In child, (before remount /dev/pts))
open("/dev/ptmx")
open("/dev/pts/0")

Since ptmx is not in devpts, we use current_pts_ns() or child-pts-ns
Since /dev/pts is not remounted in child, we get the parent pts-ns from

If we can somehow detect the incorrect configuration and fail either
open, we should be ok :-)
inode.

Suka
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 5/7]: Determine pts_ns from a pty's inode.

2008-03-26 Thread sukadev
Serge E. Hallyn [EMAIL PROTECTED] wrote:
| Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
| > Serge E. Hallyn [EMAIL PROTECTED] wrote:
| > | > | I suppose you could just create /dev/pts/ptmx and /dev/pts/tty.
| > | > | Recommend that in containers /dev/ptmx and /dev/tty be symlinks
| > | > | into /dev/pts.  Applications don't need to change.  If
| > | > | ptmx_open() sees that inode->i_sb is a devptsfs, it gets the
| > | > | namespace from the sb.  If not, then it was a device in /dev
| > | > | and it gets the nmespace from current.
| > | > 
| > | > But we would still depend on user-space remounting /dev/pts after
| > | > the clone right ? Until they do that we would access the parent
| > | > container's /dev/pts/ptmx ?
| > | 
| > | Yes.  Which is the right thing to do imo.
| > 
| > Hmm, that sounds reasonable, although slightly inconsistent with pid-ns,
| > where pid starts at 1 regardless of whether /proc is remounted.
| 
| Very different cases.  The pid is the task's pid in the new pidns.
| The task ALSO has a different pid in the parent pidns.
| 
| The pts only has an identity in one ptsns.
| 
| > But even so, if user fails to establish the symlink, clones the pts ns
| > and tries to create a pty, we would end up with different pts nses again ?
| 
| Yes.  So what?

We would end up allocating a pts index from child-pts-ns (i.e index 0)
and attempt to open /dev/pts/0 which could be an existing pty in the
parent pts ns ?
| 
| > i.e
| > /dev/ptmx is still a char dev in root fs
| > clone(pts_ns)
| > ( In child, (before remount /dev/pts))
| > open("/dev/ptmx")
| > open("/dev/pts/0")
| > 
| > Since ptmx is not in devpts, we use current_pts_ns() or child-pts-ns
| > Since /dev/pts is not remounted in child, we get the parent pts-ns from
| > 
| > If we can somehow detect the incorrect configuration and fail either
| > open, we should be ok :-)
| 
| I completely disagree with this sentiment.  The kernel doesn't need
| to detect an "incorrect configuration" if it isn't dangerous.  One
| man's "incorrect configuration" is another man's useful trick.

Myabe configuration is the wrong word, but unless I am missing something
above, spanning two pts-nses is an error condition ?
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/7][v2]: Enable multiple mounts of /dev/pts

2008-04-03 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 3/7][v2]: Enable multiple mounts of /dev/pts

To support multiple PTY namespaces, we should be allow multiple
mounts of /dev/pts, once within each PTY namespace.

This patch removes the get_sb_single() in devpts_get_sb() and
uses test and set sb interfaces to allow remounting /dev/pts.
The patch also removes the globals, 'devpts_mnt', 'devpts_root'
and uses a skeletal 'init_pts_ns' to store the vfsmount.

Changelog [v3]:
- Removed some unnecessary comments from devpts_set_sb()

Changelog [v2]:

- (Pavel Emelianov/Serge Hallyn) Remove reference to pts_ns from
  sb->s_fs_info to fix the circular reference (/dev/pts is not
  unmounted unless the pts_ns is destroyed, so we don't need a
  reference to the pts_ns).

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]>
Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |  151 +-
 include/linux/devpts_fs.h |   11 +++
 2 files changed, 134 insertions(+), 28 deletions(-)

Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-03-24 
20:04:26.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-04-01 18:08:42.0 
-0700
@@ -14,6 +14,17 @@
 #define _LINUX_DEVPTS_FS_H
 
 #include 
+#include 
+#include 
+#include 
+
+struct pts_namespace {
+   struct kref kref;
+   struct idr allocated_ptys;
+   struct vfsmount *mnt;
+};
+
+extern struct pts_namespace init_pts_ns;
 
 #ifdef CONFIG_UNIX98_PTYS
 
Index: 2.6.25-rc5-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c   2008-03-24 20:04:26.0 
-0700
+++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-04-01 18:08:41.0 -0700
@@ -28,12 +28,8 @@
 #define DEVPTS_DEFAULT_MODE 0600
 
 extern int pty_limit;  /* Config limit on Unix98 ptys */
-static DEFINE_IDR(allocated_ptys);
 static DECLARE_MUTEX(allocated_ptys_lock);
 
-static struct vfsmount *devpts_mnt;
-static struct dentry *devpts_root;
-
 static struct {
int setuid;
int setgid;
@@ -54,6 +50,15 @@ static match_table_t tokens = {
{Opt_err, NULL}
 };
 
+struct pts_namespace init_pts_ns = {
+   .kref = {
+   .refcount = ATOMIC_INIT(2),
+   },
+   .allocated_ptys = IDR_INIT(init_pts_ns.allocated_ptys),
+   .mnt = NULL,
+};
+
+
 static int devpts_remount(struct super_block *sb, int *flags, char *data)
 {
char *p;
@@ -140,7 +145,7 @@ devpts_fill_super(struct super_block *s,
inode->i_fop = &simple_dir_operations;
inode->i_nlink = 2;
 
-   devpts_root = s->s_root = d_alloc_root(inode);
+   s->s_root = d_alloc_root(inode);
if (s->s_root)
return 0;

@@ -150,17 +155,73 @@ fail:
return -ENOMEM;
 }
 
+/*
+ * We use test and set super-block operations to help determine whether we
+ * need a new super-block for this namespace. get_sb() walks the list of
+ * existing devpts supers, comparing them with the @data ptr. Since we
+ * passed 'current's namespace as the @data pointer we can compare the
+ * namespace pointer in the super-block's 's_fs_info'.  If the test is
+ * TRUE then get_sb() returns a new active reference to the super block.
+ * Otherwise, it helps us build an active reference to a new one.
+ */
+
+static int devpts_test_sb(struct super_block *sb, void *data)
+{
+   return sb->s_fs_info == data;
+}
+
+static int devpts_set_sb(struct super_block *sb, void *data)
+{
+   sb->s_fs_info = data;
+   return set_anon_super(sb, NULL);
+}
+
 static int devpts_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 {
-   return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt);
+   struct super_block *sb;
+   struct pts_namespace *ns;
+   int err;
+
+   /* hereafter we're very similar to proc_get_sb */
+   if (flags & MS_KERNMOUNT)
+   ns = data;
+   else
+   ns = &init_pts_ns;
+
+   /* hereafter we're very simlar to get_sb_nodev */
+   sb = sget(fs_type, devpts_test_sb, devpts_set_sb, ns);
+   if (IS_ERR(sb))
+   return PTR_ERR(sb);
+
+   if (sb->s_root)
+   return simple_set_mnt(mnt, sb);
+
+   sb->s_flags = flags;
+   err = devpts_fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
+   if (err) {
+   up_write(&sb->s_umount);
+   deactivate_super(sb);
+   return err;
+   }
+
+   sb->s_flags |= MS_ACTIVE;
+  

[Devel] [PATCH 6/7][v2]: Determine pts_ns from a pty's inode

2008-04-03 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 6/7][v2]: Determine pts_ns from a pty's inode.

The devpts interfaces currently operate on a specific pts namespace
which they get from the 'current' task.

With implementation of containers and cloning of PTS namespaces, we want
to be able to access PTYs in a child-pts-ns from a parent-pts-ns. For
instance we could bind-mount and pivot-root the child container on
'/vserver/vserver1' and then access the "pts/0" of 'vserver1' using 

$ echo foo > /vserver/vserver1/dev/pts/0

The task doing the above 'echo' could be in parent-pts-ns. So we find
the 'pts-ns' of the above file from the inode representing the above
file rather than from the 'current' task.

Note that we need to find and hold a reference to the pts_ns to prevent
the pts_ns from being freed while it is being accessed from 'outside'.

This patch implements, 'pts_ns_from_inode()' which returns the pts_ns
using 'inode->i_sb->s_fs_info'.

Since, the 'inode' information is not visible inside devpts code itself,
this patch modifies the tty driver code to determine the pts_ns and passes
it into devpts.

Changelog [v2]:
[Serge Hallyn] Use rcu to access sb->s_fs_info.

[Serge Hallyn] Simplify handling of ptmx and tty devices by expecting
user to create them in /dev/pts (see also devpts-mknod patch)


Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 drivers/char/pty.c|   13 +-
 drivers/char/tty_io.c |   93 +++---
 fs/devpts/inode.c |   19 +++--
 include/linux/devpts_fs.h |   38 --
 4 files changed, 131 insertions(+), 32 deletions(-)

Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-04-02 
22:42:08.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-04-02 22:42:14.0 
-0700
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct pts_namespace {
struct kref kref;
@@ -26,12 +27,39 @@ struct pts_namespace {
 
 extern struct pts_namespace init_pts_ns;
 
+#define DEVPTS_SUPER_MAGIC 0x1cd1
+
+static inline struct pts_namespace *current_pts_ns(void)
+{
+   return &init_pts_ns;
+}
+
+static inline struct pts_namespace *pts_ns_from_inode(struct inode *inode)
+{
+   /*
+* If this file exists on devpts, return the pts_ns from the
+* devpts super-block. Otherwise just use the pts-ns of the
+* calling task.
+*/
+   if(inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC)
+   return rcu_dereference(inode->i_sb->s_fs_info);
+
+   return current_pts_ns();
+}
+
+
 #ifdef CONFIG_UNIX98_PTYS
-int devpts_new_index(void);
-void devpts_kill_index(int idx);
-int devpts_pty_new(struct tty_struct *tty);  /* mknod in devpts */
-struct tty_struct *devpts_get_tty(int number);  /* get tty structure */
-void devpts_pty_kill(int number);   /* unlink */
+int devpts_new_index(struct pts_namespace *pts_ns);
+void devpts_kill_index(struct pts_namespace *pts_ns, int idx);
+
+/* mknod in devpts */
+int devpts_pty_new(struct pts_namespace *pts_ns, struct tty_struct *tty);
+
+/* get tty structure */
+struct tty_struct *devpts_get_tty(struct pts_namespace *pts_ns, int number);
+
+/* unlink */
+void devpts_pty_kill(struct pts_namespace *pts_ns, int number);
 
 static inline void free_pts_ns(struct kref *ns_kref) { }
 
Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c
===
--- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c   2008-04-02 22:35:29.0 
-0700
+++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-04-02 22:42:14.0 
-0700
@@ -2064,8 +2064,8 @@ static void tty_line_name(struct tty_dri
  * relaxed for the (most common) case of reopening a tty.
  */
 
-static int init_dev(struct tty_driver *driver, int idx,
-   struct tty_struct **ret_tty)
+static int init_dev(struct tty_driver *driver, struct pts_namespace *pts_ns,
+   int idx, struct tty_struct **ret_tty)
 {
struct tty_struct *tty, *o_tty;
struct ktermios *tp, **tp_loc, *o_tp, **o_tp_loc;
@@ -2074,7 +2074,11 @@ static int init_dev(struct tty_driver *d
 
/* check whether we're reopening an existing tty */
if (driver->flags & TTY_DRIVER_DEVPTS_MEM) {
-   tty = devpts_get_tty(idx);
+   tty = devpts_get_tty(pts_ns, idx);
+   if (IS_ERR(tty)) {
+   retval = PTR_ERR(tty);
+   goto end_init;
+   }
/*
 * If we don't have a tty here on a slave open, it's because
 * the master alre

[Devel] [PATCH 7/7][v2]: Enable cloning PTY namespaces

2008-04-03 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 7/7][v2]: Enable cloning PTY namespaces

Enable cloning PTY namespaces.

Changelog[v2]:
[Serge Hallyn]: Use rcu to access sb->s_fs_info.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]>
Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>

---
 fs/devpts/inode.c |   84 --
 include/linux/devpts_fs.h |   22 ++--
 include/linux/init_task.h |1 
 include/linux/nsproxy.h   |2 +
 include/linux/sched.h |1 
 kernel/fork.c |2 -
 kernel/nsproxy.c  |   17 -
 7 files changed, 122 insertions(+), 7 deletions(-)

Index: 2.6.25-rc5-mm1/include/linux/sched.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/sched.h   2008-04-02 22:50:22.0 
-0700
+++ 2.6.25-rc5-mm1/include/linux/sched.h2008-04-02 22:51:59.0 
-0700
@@ -28,6 +28,7 @@
 #define CLONE_NEWPID   0x2000  /* New pid namespace */
 #define CLONE_NEWNET   0x4000  /* New network namespace */
 #define CLONE_IO   0x8000  /* Clone io context */
+#define CLONE_NEWPTS   0x0002ULL   /* Clone pts ns */
 
 /*
  * Scheduling policies
Index: 2.6.25-rc5-mm1/include/linux/nsproxy.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/nsproxy.h 2008-04-02 22:50:22.0 
-0700
+++ 2.6.25-rc5-mm1/include/linux/nsproxy.h  2008-04-02 22:51:59.0 
-0700
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct pts_namespace;
 
 /*
  * A structure to contain pointers to all per-process
@@ -29,6 +30,7 @@ struct nsproxy {
struct pid_namespace *pid_ns;
struct user_namespace *user_ns;
struct net   *net_ns;
+   struct pts_namespace *pts_ns;
 };
 extern struct nsproxy init_nsproxy;
 
Index: 2.6.25-rc5-mm1/include/linux/init_task.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/init_task.h   2008-04-02 
22:50:22.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/init_task.h2008-04-02 22:51:59.0 
-0700
@@ -78,6 +78,7 @@ extern struct nsproxy init_nsproxy;
.mnt_ns = NULL, \
INIT_NET_NS(net_ns) \
INIT_IPC_NS(ipc_ns) \
+   .pts_ns = &init_pts_ns, \
.user_ns= &init_user_ns,\
 }
 
Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-04-02 
22:51:59.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-04-02 22:51:59.0 
-0700
@@ -31,7 +31,7 @@ extern struct pts_namespace init_pts_ns;
 
 static inline struct pts_namespace *current_pts_ns(void)
 {
-   return &init_pts_ns;
+   return current->nsproxy->pts_ns;
 }
 
 static inline struct pts_namespace *pts_ns_from_inode(struct inode *inode)
@@ -61,7 +61,8 @@ struct tty_struct *devpts_get_tty(struct
 /* unlink */
 void devpts_pty_kill(struct pts_namespace *pts_ns, int number);
 
-static inline void free_pts_ns(struct kref *ns_kref) { }
+extern struct pts_namespace *new_pts_ns(void);
+extern void free_pts_ns(struct kref *kref);
 
 static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns)
 {
@@ -75,6 +76,15 @@ static inline void put_pts_ns(struct pts
kref_put(&ns->kref, free_pts_ns);
 }
 
+static inline struct pts_namespace *copy_pts_ns(u64 flags,
+  struct pts_namespace *old_ns)
+{
+  if (flags & CLONE_NEWPTS)
+  return new_pts_ns();
+  else
+  return get_pts_ns(old_ns);
+}
+
 #else
 
 /* Dummy stubs in the no-pty case */
@@ -90,6 +100,14 @@ static inline struct pts_namespace *get_
 }
 
 static inline void put_pts_ns(struct pts_namespace *ns) { }
+
+static inline struct pts_namespace *copy_pts_ns(u64 flags,
+  struct pts_namespace *old_ns)
+{
+  if (flags & CLONE_NEWPTS)
+  return ERR_PTR(-EINVAL);
+  return old_ns;
+}
 #endif
 
 
Index: 2.6.25-rc5-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c   2008-04-02 22:51:59.0 
-0700
+++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-04-02 22:51:59.0 -0700
@@ -27,6 +27,7 @@
 
 extern int pty_limit;  /* Config limit on Unix98 ptys */
 static DECLARE_MUTEX(allocated_ptys_lock);
+static struct file_system_type devpts_fs_type;
 
 static st

[Devel] [PATCH 4/7][v2]: Allow mknod of ptmx and tty in devpts

2008-04-03 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 4/7][v2]: Allow mknod of ptmx and tty in devpts

We want to allow administrators to access PTYs in a descendant pts-namespaces,
for instance "echo foo > /vserver/vserver1/dev/pts/0". To enable such access
we must hold a reference to the pts-ns in which the device (ptmx or the
slave pty) exists. 

Note that we cannot use the pts-ns of the 'current' process since that pts-ns
could be different from the pts-ns in which the PTY device was created. So
we find the pts-ns from the inode of the PTY (inode->i_sb->s_fs_info).
While this would work for the slave PTY devices like /dev/pts/0, it would
not work for either the master PTY device (/dev/ptmx) or controlling terminal
(/dev/tty).

To uniformly handle the master, slave and controlling ttys, we allow creation
of 'ptmx' and 'tty' devices in /dev/pts. When creating containers, the
administrator can then:

$ umount /dev/pts
$ mount -t devpts lxcpts /dev/pts
$ mknod /dev/pts/ptmx c 5 2
$ mknod /dev/pts/tty c 5 0
$ rm /dev/ptmx /dev/tty
$ ln -s /dev/pts/ptmx /dev/ptmx
$ ln -s /dev/pts/tty /dev/tty

With this, even if the 'ptmx' is accessed from parent pts-ns we still find
and hold the pts-ns in which 'ptmx' actually belongs.

This patch merely allows creation of /dev/pts/ptmx and /dev/pts/tty. We
hold a reference to the dentries for these nodes to pin them in memory and
use 'kill_litter_super()' while unmounting to ensure we drop these dentries.

TODO: Ability to unlink the /dev/pts/ptmx and /dev/pts/tty nodes.

Note: if /dev/ptmx is a symlink to /vserver/vserver1/dev/pts/ptmx an open
    of /dev/ptmx in init-pts-ns will create a PTY in 'vserver1' ! 

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |   55 ++
 1 file changed, 51 insertions(+), 4 deletions(-)

Index: 2.6.25-rc5-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c   2008-04-02 10:18:42.0 
-0700
+++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-04-02 22:51:02.0 -0700
@@ -58,7 +58,6 @@ struct pts_namespace init_pts_ns = {
.mnt = NULL,
 };
 
-
 static int devpts_remount(struct super_block *sb, int *flags, char *data)
 {
char *p;
@@ -122,6 +121,54 @@ static const struct super_operations dev
.show_options   = devpts_show_options,
 };
 
+
+static int devpts_mknod(struct inode *dir, struct dentry *dentry,
+   int mode, dev_t rdev)
+{
+   int inum;
+   struct inode *inode;
+   struct super_block *sb = dir->i_sb;
+
+   if (dentry->d_inode)
+   return -EEXIST;
+
+   if (!S_ISCHR(mode))
+   return -EPERM;
+
+   if (rdev == MKDEV(TTYAUX_MAJOR, 0))
+   inum = 2;
+   else if (rdev == MKDEV(TTYAUX_MAJOR, 2))
+   inum = 3;
+   else
+   return -EPERM;
+
+   inode = new_inode(sb);
+   if (!inode)
+   return -ENOMEM;
+
+   inode->i_ino = inum;
+   inode->i_uid = inode->i_gid = 0;
+   inode->i_blocks = 0;
+   inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
+
+   init_special_inode(inode, mode, rdev);
+
+   d_instantiate(dentry, inode);
+   /*
+* Get a reference to the dentry so the device-nodes persist
+* even when there are no active references to them. We use
+* kill_litter_super() to remove this entry when unmounting
+* devpts.
+*/
+   dget(dentry);
+   return 0;
+}
+
+const struct inode_operations devpts_dir_inode_operations = {
+.lookup = simple_lookup,
+   .mknod  = devpts_mknod,
+};
+
 static int
 devpts_fill_super(struct super_block *s, void *data, int silent)
 {
@@ -141,7 +188,7 @@ devpts_fill_super(struct super_block *s,
inode->i_blocks = 0;
inode->i_uid = inode->i_gid = 0;
inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO | S_IWUSR;
-   inode->i_op = &simple_dir_inode_operations;
+   inode->i_op = &devpts_dir_inode_operations;
inode->i_fop = &simple_dir_operations;
inode->i_nlink = 2;
 
@@ -214,7 +261,7 @@ static int devpts_get_sb(struct file_sys
 static void devpts_kill_sb(struct super_block *sb)
 {
sb->s_fs_info = NULL;
-   kill_anon_super(sb);
+   kill_litter_super(sb);
 }
 
 static struct file_system_type devpts_fs_type = {
@@ -303,7 +350,7 @@ int devpts_pty_new(struct tty_struct *tt
if (!inode)
return -ENOMEM;
 
-   inode->i_ino = number+2;
+   inode->i_ino = number+4;
inode->i_uid = config.setuid ? config.uid : current->fsuid;
inode->i_gid = config.setgid 

[Devel] [PATCH 2/7][v2]: Factor out PTY index allocation

2008-04-03 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 2/7][v2]: Factor out PTY index allocation

Factor out the code used to allocate/free a pts index into new interfaces,
devpts_new_index() and devpts_kill_index().  This localizes the external
data structures used in managing the pts indices.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Signed-off-by: Serge Hallyn<[EMAIL PROTECTED]>
Signed-off-by: Matt Helsley<[EMAIL PROTECTED]>

---
 drivers/char/tty_io.c |   40 ++--
 fs/devpts/inode.c |   42 +-
 include/linux/devpts_fs.h |4 
 3 files changed, 51 insertions(+), 35 deletions(-)

Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-03-24 
20:04:07.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:04:26.0 
-0700
@@ -17,6 +17,8 @@
 
 #ifdef CONFIG_UNIX98_PTYS
 
+int devpts_new_index(void);
+void devpts_kill_index(int idx);
 int devpts_pty_new(struct tty_struct *tty);  /* mknod in devpts */
 struct tty_struct *devpts_get_tty(int number);  /* get tty structure */
 void devpts_pty_kill(int number);   /* unlink */
@@ -24,6 +26,8 @@ void devpts_pty_kill(int number);  /* u
 #else
 
 /* Dummy stubs in the no-pty case */
+static inline int devpts_new_index(void) { return -EINVAL; }
+static inline void devpts_kill_index(int idx) { }
 static inline int devpts_pty_new(struct tty_struct *tty) { return -EINVAL; }
 static inline struct tty_struct *devpts_get_tty(int number) { return NULL; }
 static inline void devpts_pty_kill(int number) { }
Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c
===
--- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c   2008-03-24 20:04:07.0 
-0700
+++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:04:26.0 
-0700
@@ -91,7 +91,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -137,9 +136,6 @@ EXPORT_SYMBOL(tty_mutex);
 
 #ifdef CONFIG_UNIX98_PTYS
 extern struct tty_driver *ptm_driver;  /* Unix98 pty masters; for /dev/ptmx */
-extern int pty_limit;  /* Config limit on Unix98 ptys */
-static DEFINE_IDR(allocated_ptys);
-static DEFINE_MUTEX(allocated_ptys_lock);
 static int ptmx_open(struct inode *, struct file *);
 #endif
 
@@ -2636,15 +2632,9 @@ static void release_dev(struct file *fil
 */
release_tty(tty, idx);
 
-#ifdef CONFIG_UNIX98_PTYS
/* Make this pty number available for reallocation */
-   if (devpts) {
-   mutex_lock(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, idx);
-   mutex_unlock(&allocated_ptys_lock);
-   }
-#endif
-
+   if (devpts)
+   devpts_kill_index(idx);
 }
 
 /**
@@ -2800,29 +2790,13 @@ static int ptmx_open(struct inode *inode
struct tty_struct *tty;
int retval;
int index;
-   int idr_ret;
 
nonseekable_open(inode, filp);
 
/* find a device that is not in use. */
-   mutex_lock(&allocated_ptys_lock);
-   if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) {
-   mutex_unlock(&allocated_ptys_lock);
-   return -ENOMEM;
-   }
-   idr_ret = idr_get_new(&allocated_ptys, NULL, &index);
-   if (idr_ret < 0) {
-   mutex_unlock(&allocated_ptys_lock);
-   if (idr_ret == -EAGAIN)
-   return -ENOMEM;
-   return -EIO;
-   }
-   if (index >= pty_limit) {
-   idr_remove(&allocated_ptys, index);
-   mutex_unlock(&allocated_ptys_lock);
-   return -EIO;
-   }
-   mutex_unlock(&allocated_ptys_lock);
+   index = devpts_new_index();
+   if (index < 0)
+   return index;
 
mutex_lock(&tty_mutex);
retval = init_dev(ptm_driver, index, &tty);
@@ -2847,9 +2821,7 @@ out1:
release_dev(filp);
return retval;
 out:
-   mutex_lock(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, index);
-   mutex_unlock(&allocated_ptys_lock);
+   devpts_kill_index(index);
return retval;
 }
 #endif
Index: 2.6.25-rc5-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c   2008-03-24 20:04:07.0 
-0700
+++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:04:26.0 -0700
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,6 +27,10 @@
 
 #define DEVPTS_DEFAULT_MODE 0600
 
+extern int pty_limit;  /* Config limit on Unix98 ptys */
+static DEFINE_IDR(allocated_ptys);
+static DECLARE_

[Devel] [PATCH 5/7][v2]: Implement get_pts_ns() and put_pts_ns()

2008-04-03 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 5/7][v2]: Implement get_pts_ns() and put_pts_ns()

Implement get_pts_ns() and put_pts_ns() interfaces.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 include/linux/devpts_fs.h |   21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-04-02 
22:35:35.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-04-02 22:42:08.0 
-0700
@@ -27,13 +27,26 @@ struct pts_namespace {
 extern struct pts_namespace init_pts_ns;
 
 #ifdef CONFIG_UNIX98_PTYS
-
 int devpts_new_index(void);
 void devpts_kill_index(int idx);
 int devpts_pty_new(struct tty_struct *tty);  /* mknod in devpts */
 struct tty_struct *devpts_get_tty(int number);  /* get tty structure */
 void devpts_pty_kill(int number);   /* unlink */
 
+static inline void free_pts_ns(struct kref *ns_kref) { }
+
+static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns)
+{
+   if (ns && (ns != &init_pts_ns))
+   kref_get(&ns->kref);
+   return ns;
+}
+static inline void put_pts_ns(struct pts_namespace *ns)
+{
+   if (ns && (ns != &init_pts_ns))
+   kref_put(&ns->kref, free_pts_ns);
+}
+
 #else
 
 /* Dummy stubs in the no-pty case */
@@ -43,6 +56,12 @@ static inline int devpts_pty_new(struct 
 static inline struct tty_struct *devpts_get_tty(int number) { return NULL; }
 static inline void devpts_pty_kill(int number) { }
 
+static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns)
+{
+   return &init_pts_ns;
+}
+
+static inline void put_pts_ns(struct pts_namespace *ns) { }
 #endif
 
 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/7][v2]: Propagate error code from devpts_pty_new

2008-04-03 Thread sukadev
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 1/7][v2]: Propagate error code from devpts_pty_new

Have ptmx_open() propagate any error code returned by devpts_pty_new().

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Cc: Cedric Le Goater <[EMAIL PROTECTED]>
Cc: Dave Hansen <[EMAIL PROTECTED]>
Cc: Serge Hallyn <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
---
 drivers/char/tty_io.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c
===
--- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c   2008-03-21 20:13:38.0 
-0700
+++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:04:07.0 
-0700
@@ -2835,8 +2835,8 @@ static int ptmx_open(struct inode *inode
filp->private_data = tty;
file_move(filp, &tty->tty_files);
 
-   retval = -ENOMEM;
-   if (devpts_pty_new(tty->link))
+   retval = devpts_pty_new(tty->link);
+   if (retval)
goto out1;
 
check_tty_count(tty, "tty_open");
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/7][v2] Clone PTY namespaces

2008-04-03 Thread sukadev

Devpts namespace patchset

In continuation of the implementation of containers in mainline, we need to
support multiple PTY namespaces so that the PTY index (ie the tty names) in
one container is independent of the PTY indices of other containers.  For
instance this would allow each container to have a '/dev/pts/0' PTY and
refer to different terminals.

[PATCH 1/7]: Propagate error code from devpts_pty_new
[PATCH 2/7]: Factor out PTY index allocation
[PATCH 3/7]: Enable multiple mounts of /dev/pts
[PATCH 4/7]: Allow mknod of ptmx and tty in devpts
[PATCH 5/7]: Implement get_pts_ns() and put_pts_ns()
[PATCH 6/7]: Determine pts_ns from a pty's inode
[PATCH 7/7]: Enable cloning PTY namespaces

Todo:
- This patchset depends on availability of additional clone flags.
  and relies on on Cedric's clone64 patchset.

- Needs some cleanup and more testing.

- Ensure patchset is bisect-safe

Changelog[v2]:

 (Patches 4 and 6 differ significantly from [v1]. Others are mostly
 the same)

- [Alexey Dobriyan, Pavel Emelyanov] Removed the hack to check for
  user-space mount.

- [Serge Hallyn] Added rcu locking around access to sb->s_fs_info.

- [Serge Hallyn] Allow creation of /dev/pts/ptmx and /dev/pts/tty
  devices to simplify the process of finding the 'owning' pts-ns
  of the device (specially when accessed from parent-pts-ns)
  See patches 4 and 6 for details.

Changelog[v1]:
- Fixed circular reference by not caching the pts_ns in sb->s_fs_info
  (without incrementing reference count) and clearing the sb->s_fs_info
  when destroying the pts_ns

- To allow access to a child container's ptys from parent container,
  determine the 'pts_ns' of a 'pty' from its inode.

- Added a check (hack) to ensure user-space mount of /dev/pts is
  done before creating PTYs in a new pts-ns.

- Reorganized the patchset and removed redundant changes.

- Ported to work wih Cedric Le Goater's clone64() system call now
  that we are out of clone_flags.

Changelog[v0]:

This patchset is based on earlier versions developed by Serge Hallyn
and Matt Helsley.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 6/7][v2]: Determine pts_ns from a pty's inode

2008-04-04 Thread sukadev
Serge E. Hallyn [EMAIL PROTECTED] wrote:
| > +
| > +   /*
| > +* What pts-ns do we want to use when opening "/dev/tty" ?
| > +* Sounds like current_pts_ns(), but what should happen
| > +* if parent pts ns does:
| > +*
| > +*  echo foo > /vs/vs1/dev/tty
| 
| You'll want to remove this comment, right?  Your patch 4 solved
| this problem?

Yes, Will remove while porting to rc8-mm1.

Should I go ahead and post as RFC to lkml ?

Sukadev
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 0/7] Clone PTS namespace

2008-04-08 Thread sukadev

Devpts namespace patchset

In continuation of the implementation of containers in mainline, we need to
support multiple PTY namespaces so that the PTY index (ie the tty names) in
one container is independent of the PTY indices of other containers.  For
instance this would allow each container to have a '/dev/pts/0' PTY and
refer to different terminals.

[PATCH 1/7]: Propagate error code from devpts_pty_new
[PATCH 2/7]: Factor out PTY index allocation
[PATCH 3/7]: Enable multiple mounts of /dev/pts
[PATCH 4/7]: Allow mknod of ptmx and tty in devpts
[PATCH 5/7]: Implement get_pts_ns() and put_pts_ns()
[PATCH 6/7]: Determine pts_ns from a pty's inode
[PATCH 7/7]: Enable cloning PTY namespaces

Todo:
- This patchset depends on availability of additional clone flags.
  and relies on on Cedric's clone64 patchset. See

  http://marc.info/?l=linux-kernel&m=120272411925609&w=2

- Needs some cleanup and more testing

- Ensure patchset is bisect-safe

---
Changelogs from earlier posts to [EMAIL PROTECTED]

Changelog[v2]:

 (Patches 4 and 6 differ significantly from [v1]. Others are mostly
 the same)

- [Alexey Dobriyan, Pavel Emelyanov] Removed the hack to check for
  user-space mount.

- [Serge Hallyn] Added rcu locking around access to sb->s_fs_info.

- [Serge Hallyn] Allow creation of /dev/pts/ptmx and /dev/pts/tty
  devices to simplify the process of finding the 'owning' pts-ns
  of the device (specially when accessed from parent-pts-ns)
  See patches 4 and 6 for details.

Changelog[v1]:
- Fixed circular reference by not caching the pts_ns in sb->s_fs_info
  (without incrementing reference count) and clearing the sb->s_fs_info
  when destroying the pts_ns

- To allow access to a child container's ptys from parent container,
  determine the 'pts_ns' of a 'pty' from its inode.

- Added a check (hack) to ensure user-space mount of /dev/pts is
  done before creating PTYs in a new pts-ns.

- Reorganized the patchset and removed redundant changes.

- Ported to work wih Cedric Le Goater's clone64() system call now
  that we are out of clone_flags.

Changelog[v0]:

This patchset is based on earlier versions developed by Serge Hallyn
and Matt Helsley.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 1/7]: Propagate error code from devpts_pty_new

2008-04-08 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [RFC][PATCH 1/7]: Propagate error code from devpts_pty_new

Have ptmx_open() propagate any error code returned by devpts_pty_new()
(which returns either 0 or -ENOMEM anyway).

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 drivers/char/tty_io.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: 2.6.25-rc8-mm1/drivers/char/tty_io.c
===
--- 2.6.25-rc8-mm1.orig/drivers/char/tty_io.c   2008-04-07 14:49:56.0 
-0700
+++ 2.6.25-rc8-mm1/drivers/char/tty_io.c2008-04-08 09:12:55.0 
-0700
@@ -2835,8 +2835,8 @@ static int ptmx_open(struct inode *inode
filp->private_data = tty;
file_move(filp, &tty->tty_files);
 
-   retval = -ENOMEM;
-   if (devpts_pty_new(tty->link))
+   retval = devpts_pty_new(tty->link);
+   if (retval)
goto out1;
 
check_tty_count(tty, "tty_open");
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 3/7]: Enable multiple mounts of /dev/pts

2008-04-08 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject:[RFC][PATCH 3/7]: Enable multiple mounts of /dev/pts

To support multiple PTY namespaces, we should be allow multiple mounts of
/dev/pts, once within each PTY namespace.

This patch removes the get_sb_single() in devpts_get_sb() and uses test and
set sb interfaces to allow remounting /dev/pts.  The patch also removes the
globals, 'devpts_mnt', 'devpts_root' and uses a skeletal 'init_pts_ns' to
store the vfsmount.

Changelog [v3]:
- Removed some unnecessary comments from devpts_set_sb()

Changelog [v2]:

- (Pavel Emelianov/Serge Hallyn) Remove reference to pts_ns from
  sb->s_fs_info to fix the circular reference (/dev/pts is not
  unmounted unless the pts_ns is destroyed, so we don't need a
  reference to the pts_ns).

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]>
Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |  151 +-
 include/linux/devpts_fs.h |   11 +++
 2 files changed, 134 insertions(+), 28 deletions(-)

Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-03-24 
20:04:26.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-04-01 18:08:42.0 
-0700
@@ -14,6 +14,17 @@
 #define _LINUX_DEVPTS_FS_H
 
 #include 
+#include 
+#include 
+#include 
+
+struct pts_namespace {
+   struct kref kref;
+   struct idr allocated_ptys;
+   struct vfsmount *mnt;
+};
+
+extern struct pts_namespace init_pts_ns;
 
 #ifdef CONFIG_UNIX98_PTYS
 
Index: 2.6.25-rc5-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c   2008-03-24 20:04:26.0 
-0700
+++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-04-01 18:08:41.0 -0700
@@ -28,12 +28,8 @@
 #define DEVPTS_DEFAULT_MODE 0600
 
 extern int pty_limit;  /* Config limit on Unix98 ptys */
-static DEFINE_IDR(allocated_ptys);
 static DECLARE_MUTEX(allocated_ptys_lock);
 
-static struct vfsmount *devpts_mnt;
-static struct dentry *devpts_root;
-
 static struct {
int setuid;
int setgid;
@@ -54,6 +50,15 @@ static match_table_t tokens = {
{Opt_err, NULL}
 };
 
+struct pts_namespace init_pts_ns = {
+   .kref = {
+   .refcount = ATOMIC_INIT(2),
+   },
+   .allocated_ptys = IDR_INIT(init_pts_ns.allocated_ptys),
+   .mnt = NULL,
+};
+
+
 static int devpts_remount(struct super_block *sb, int *flags, char *data)
 {
char *p;
@@ -140,7 +145,7 @@ devpts_fill_super(struct super_block *s,
inode->i_fop = &simple_dir_operations;
inode->i_nlink = 2;
 
-   devpts_root = s->s_root = d_alloc_root(inode);
+   s->s_root = d_alloc_root(inode);
if (s->s_root)
return 0;

@@ -150,17 +155,73 @@ fail:
return -ENOMEM;
 }
 
+/*
+ * We use test and set super-block operations to help determine whether we
+ * need a new super-block for this namespace. get_sb() walks the list of
+ * existing devpts supers, comparing them with the @data ptr. Since we
+ * passed 'current's namespace as the @data pointer we can compare the
+ * namespace pointer in the super-block's 's_fs_info'.  If the test is
+ * TRUE then get_sb() returns a new active reference to the super block.
+ * Otherwise, it helps us build an active reference to a new one.
+ */
+
+static int devpts_test_sb(struct super_block *sb, void *data)
+{
+   return sb->s_fs_info == data;
+}
+
+static int devpts_set_sb(struct super_block *sb, void *data)
+{
+   sb->s_fs_info = data;
+   return set_anon_super(sb, NULL);
+}
+
 static int devpts_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 {
-   return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt);
+   struct super_block *sb;
+   struct pts_namespace *ns;
+   int err;
+
+   /* hereafter we're very similar to proc_get_sb */
+   if (flags & MS_KERNMOUNT)
+   ns = data;
+   else
+   ns = &init_pts_ns;
+
+   /* hereafter we're very simlar to get_sb_nodev */
+   sb = sget(fs_type, devpts_test_sb, devpts_set_sb, ns);
+   if (IS_ERR(sb))
+   return PTR_ERR(sb);
+
+   if (sb->s_root)
+   return simple_set_mnt(mnt, sb);
+
+   sb->s_flags = flags;
+   err = devpts_fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
+   if (err) {
+   up_write(&sb->s_umount);
+   deactivate_super(sb);
+   return err;
+   }
+
+   sb->s_flags |= MS_ACTIVE;
+  

[Devel] [RFC][PATCH 2/7]: Factor out PTY index allocation

2008-04-08 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [RFC][PATCH 2/7]: Factor out PTY index allocation

Factor out the code used to allocate/free a pts index into new interfaces,
devpts_new_index() and devpts_kill_index().  This localizes the external
data structures used in managing the pts indices.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Signed-off-by: Serge Hallyn<[EMAIL PROTECTED]>
Signed-off-by: Matt Helsley<[EMAIL PROTECTED]>

---
 drivers/char/tty_io.c |   40 ++--
 fs/devpts/inode.c |   42 +-
 include/linux/devpts_fs.h |4 
 3 files changed, 51 insertions(+), 35 deletions(-)

Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-03-24 
20:04:07.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:04:26.0 
-0700
@@ -17,6 +17,8 @@
 
 #ifdef CONFIG_UNIX98_PTYS
 
+int devpts_new_index(void);
+void devpts_kill_index(int idx);
 int devpts_pty_new(struct tty_struct *tty);  /* mknod in devpts */
 struct tty_struct *devpts_get_tty(int number);  /* get tty structure */
 void devpts_pty_kill(int number);   /* unlink */
@@ -24,6 +26,8 @@ void devpts_pty_kill(int number);  /* u
 #else
 
 /* Dummy stubs in the no-pty case */
+static inline int devpts_new_index(void) { return -EINVAL; }
+static inline void devpts_kill_index(int idx) { }
 static inline int devpts_pty_new(struct tty_struct *tty) { return -EINVAL; }
 static inline struct tty_struct *devpts_get_tty(int number) { return NULL; }
 static inline void devpts_pty_kill(int number) { }
Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c
===
--- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c   2008-03-24 20:04:07.0 
-0700
+++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:04:26.0 
-0700
@@ -91,7 +91,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -137,9 +136,6 @@ EXPORT_SYMBOL(tty_mutex);
 
 #ifdef CONFIG_UNIX98_PTYS
 extern struct tty_driver *ptm_driver;  /* Unix98 pty masters; for /dev/ptmx */
-extern int pty_limit;  /* Config limit on Unix98 ptys */
-static DEFINE_IDR(allocated_ptys);
-static DEFINE_MUTEX(allocated_ptys_lock);
 static int ptmx_open(struct inode *, struct file *);
 #endif
 
@@ -2636,15 +2632,9 @@ static void release_dev(struct file *fil
 */
release_tty(tty, idx);
 
-#ifdef CONFIG_UNIX98_PTYS
/* Make this pty number available for reallocation */
-   if (devpts) {
-   mutex_lock(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, idx);
-   mutex_unlock(&allocated_ptys_lock);
-   }
-#endif
-
+   if (devpts)
+   devpts_kill_index(idx);
 }
 
 /**
@@ -2800,29 +2790,13 @@ static int ptmx_open(struct inode *inode
struct tty_struct *tty;
int retval;
int index;
-   int idr_ret;
 
nonseekable_open(inode, filp);
 
/* find a device that is not in use. */
-   mutex_lock(&allocated_ptys_lock);
-   if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) {
-   mutex_unlock(&allocated_ptys_lock);
-   return -ENOMEM;
-   }
-   idr_ret = idr_get_new(&allocated_ptys, NULL, &index);
-   if (idr_ret < 0) {
-   mutex_unlock(&allocated_ptys_lock);
-   if (idr_ret == -EAGAIN)
-   return -ENOMEM;
-   return -EIO;
-   }
-   if (index >= pty_limit) {
-   idr_remove(&allocated_ptys, index);
-   mutex_unlock(&allocated_ptys_lock);
-   return -EIO;
-   }
-   mutex_unlock(&allocated_ptys_lock);
+   index = devpts_new_index();
+   if (index < 0)
+   return index;
 
mutex_lock(&tty_mutex);
retval = init_dev(ptm_driver, index, &tty);
@@ -2847,9 +2821,7 @@ out1:
release_dev(filp);
return retval;
 out:
-   mutex_lock(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, index);
-   mutex_unlock(&allocated_ptys_lock);
+   devpts_kill_index(index);
return retval;
 }
 #endif
Index: 2.6.25-rc5-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c   2008-03-24 20:04:07.0 
-0700
+++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:04:26.0 -0700
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,6 +27,10 @@
 
 #define DEVPTS_DEFAULT_MODE 0600
 
+extern int pty_limit;  /* Config limit on Unix98 ptys */
+static DEFINE_IDR(allocated_ptys);
+static DECLARE_

[Devel] [RFC][PATCH 4/7]: Allow mknod of ptmx and tty in devpts

2008-04-08 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [RFC][PATCH 4/7]: Allow mknod of ptmx and tty in devpts

We want to allow administrators to access PTYs in descendant pts-namespaces,
for instance "echo foo > /vserver/vserver1/dev/pts/0". To enable such access
we must hold a reference to the pts-ns in which the device (ptmx or slave pty)
exists. 

Note that we cannot use the pts-ns of the 'current' process since that pts-ns
could be different from the pts-ns in which the PTY device was created. So
we find the pts-ns from the inode of the PTY (inode->i_sb->s_fs_info).

While this would work for the slave PTY devices like /dev/pts/0, it would
not work for either the master PTY device (/dev/ptmx) or controlling terminal
(/dev/tty).

To uniformly handle the master, slave and controlling ttys, we allow creation
of 'ptmx' and 'tty' devices in /dev/pts. When creating containers, the
administrator can then:

In init-pts-ns:

$ mknod /dev/pts/ptmx c 5 2
$ mknod /dev/pts/tty c 5 0
$ rm /dev/ptmx /dev/tty
$ ln -s /dev/pts/ptmx /dev/ptmx
$ ln -s /dev/pts/tty /dev/tty

In child-pts-ns:

$ umount /dev/pts
$ mount -t devpts lxcpts /dev/pts
$ mknod /dev/pts/ptmx c 5 2
$ mknod /dev/pts/tty c 5 0

With this, even if the 'ptmx' is accessed from parent pts-ns we still find
and hold the pts-ns in which 'ptmx' actually belongs.

This patch merely allows creation of /dev/pts/ptmx and /dev/pts/tty. Follow-on
patches will enable cloning the pts namespace and using the pts-ns from
the inode.

TODO:
- Ability to unlink the /dev/pts/ptmx and /dev/pts/tty nodes.

Note:
- If /dev/ptmx is a symlink to /vserver/vserver1/dev/pts/ptmx,
  open("/dev/ptmx") in init-pts-ns will create a PTY in 'vserver1' ! 

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |   55 ++
 1 file changed, 51 insertions(+), 4 deletions(-)

Index: 2.6.25-rc8-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc8-mm1.orig/fs/devpts/inode.c   2008-04-08 09:18:23.0 
-0700
+++ 2.6.25-rc8-mm1/fs/devpts/inode.c2008-04-08 13:35:43.0 -0700
@@ -58,7 +58,6 @@ struct pts_namespace init_pts_ns = {
.mnt = NULL,
 };
 
-
 static int devpts_remount(struct super_block *sb, int *flags, char *data)
 {
char *p;
@@ -122,6 +121,54 @@ static const struct super_operations dev
.show_options   = devpts_show_options,
 };
 
+
+static int devpts_mknod(struct inode *dir, struct dentry *dentry,
+   int mode, dev_t rdev)
+{
+   int inum;
+   struct inode *inode;
+   struct super_block *sb = dir->i_sb;
+
+   if (dentry->d_inode)
+   return -EEXIST;
+
+   if (!S_ISCHR(mode))
+   return -EPERM;
+
+   if (rdev == MKDEV(TTYAUX_MAJOR, 0))
+   inum = 2;
+   else if (rdev == MKDEV(TTYAUX_MAJOR, 2))
+   inum = 3;
+   else
+   return -EPERM;
+
+   inode = new_inode(sb);
+   if (!inode)
+   return -ENOMEM;
+
+   inode->i_ino = inum;
+   inode->i_uid = inode->i_gid = 0;
+   inode->i_blocks = 0;
+   inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
+
+   init_special_inode(inode, mode, rdev);
+
+   d_instantiate(dentry, inode);
+   /*
+* Get a reference to the dentry so the device-nodes persist
+* even when there are no active references to them. We use
+* kill_litter_super() to remove this entry when unmounting
+* devpts.
+*/
+   dget(dentry);
+   return 0;
+}
+
+const struct inode_operations devpts_dir_inode_operations = {
+   .lookup = simple_lookup,
+   .mknod  = devpts_mknod,
+};
+
 static int
 devpts_fill_super(struct super_block *s, void *data, int silent)
 {
@@ -141,7 +188,7 @@ devpts_fill_super(struct super_block *s,
inode->i_blocks = 0;
inode->i_uid = inode->i_gid = 0;
inode->i_mode = S_IFDIR | S_IRUGO | S_IXUGO | S_IWUSR;
-   inode->i_op = &simple_dir_inode_operations;
+   inode->i_op = &devpts_dir_inode_operations;
inode->i_fop = &simple_dir_operations;
inode->i_nlink = 2;
 
@@ -214,7 +261,7 @@ static int devpts_get_sb(struct file_sys
 static void devpts_kill_sb(struct super_block *sb)
 {
sb->s_fs_info = NULL;
-   kill_anon_super(sb);
+   kill_litter_super(sb);
 }
 
 static struct file_system_type devpts_fs_type = {
@@ -303,7 +350,7 @@ int devpts_pty_new(struct tty_struct *tt
if (!inode)
return -ENOMEM;
 
-   inode->i_ino = number+2;
+   inode->i_ino = number+4;
inode->i_uid = config.s

[Devel] [RFC][PATCH 5/7]: Implement get_pts_ns() and put_pts_ns()

2008-04-08 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [RFC][PATCH 5/7]: Implement get_pts_ns() and put_pts_ns()

Implement get_pts_ns() and put_pts_ns() interfaces.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 include/linux/devpts_fs.h |   21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

Index: 2.6.25-rc8-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc8-mm1.orig/include/linux/devpts_fs.h   2008-04-08 
09:18:23.0 -0700
+++ 2.6.25-rc8-mm1/include/linux/devpts_fs.h2008-04-08 13:36:31.0 
-0700
@@ -27,13 +27,26 @@ struct pts_namespace {
 extern struct pts_namespace init_pts_ns;
 
 #ifdef CONFIG_UNIX98_PTYS
-
 int devpts_new_index(void);
 void devpts_kill_index(int idx);
 int devpts_pty_new(struct tty_struct *tty);  /* mknod in devpts */
 struct tty_struct *devpts_get_tty(int number);  /* get tty structure */
 void devpts_pty_kill(int number);   /* unlink */
 
+static inline void free_pts_ns(struct kref *ns_kref) { }
+
+static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns)
+{
+   if (ns && (ns != &init_pts_ns))
+   kref_get(&ns->kref);
+   return ns;
+}
+static inline void put_pts_ns(struct pts_namespace *ns)
+{
+   if (ns && (ns != &init_pts_ns))
+   kref_put(&ns->kref, free_pts_ns);
+}
+
 #else
 
 /* Dummy stubs in the no-pty case */
@@ -43,6 +56,12 @@ static inline int devpts_pty_new(struct 
 static inline struct tty_struct *devpts_get_tty(int number) { return NULL; }
 static inline void devpts_pty_kill(int number) { }
 
+static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns)
+{
+   return &init_pts_ns;
+}
+
+static inline void put_pts_ns(struct pts_namespace *ns) { }
 #endif
 
 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 7/7]: Enable cloning PTY namespaces

2008-04-08 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [RFC][PATCH 7/7]: Enable cloning PTY namespaces

Enable cloning PTY namespaces.

Note:
We are out of clone_flags! This patch depends on Cedric Le Goater's
clone64() patchset.

Changelog[v2]:
[Serge Hallyn]: Use rcu to access sb->s_fs_info.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]>
Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>

---
 fs/devpts/inode.c |   84 --
 include/linux/devpts_fs.h |   22 ++--
 include/linux/init_task.h |1 
 include/linux/nsproxy.h   |2 +
 include/linux/sched.h |1 
 kernel/fork.c |2 -
 kernel/nsproxy.c  |   17 -
 7 files changed, 122 insertions(+), 7 deletions(-)

Index: 2.6.25-rc8-mm1/include/linux/sched.h
===
--- 2.6.25-rc8-mm1.orig/include/linux/sched.h   2008-04-08 13:38:08.0 
-0700
+++ 2.6.25-rc8-mm1/include/linux/sched.h2008-04-08 14:27:41.0 
-0700
@@ -28,6 +28,7 @@
 #define CLONE_NEWPID   0x2000  /* New pid namespace */
 #define CLONE_NEWNET   0x4000  /* New network namespace */
 #define CLONE_IO   0x8000  /* Clone io context */
+#define CLONE_NEWPTS   0x0002ULL   /* Clone pts ns */
 
 /*
  * Scheduling policies
Index: 2.6.25-rc8-mm1/include/linux/nsproxy.h
===
--- 2.6.25-rc8-mm1.orig/include/linux/nsproxy.h 2008-04-08 13:38:08.0 
-0700
+++ 2.6.25-rc8-mm1/include/linux/nsproxy.h  2008-04-08 14:27:41.0 
-0700
@@ -8,6 +8,7 @@ struct mnt_namespace;
 struct uts_namespace;
 struct ipc_namespace;
 struct pid_namespace;
+struct pts_namespace;
 
 /*
  * A structure to contain pointers to all per-process
@@ -29,6 +30,7 @@ struct nsproxy {
struct pid_namespace *pid_ns;
struct user_namespace *user_ns;
struct net   *net_ns;
+   struct pts_namespace *pts_ns;
 };
 extern struct nsproxy init_nsproxy;
 
Index: 2.6.25-rc8-mm1/include/linux/init_task.h
===
--- 2.6.25-rc8-mm1.orig/include/linux/init_task.h   2008-04-08 
13:38:08.0 -0700
+++ 2.6.25-rc8-mm1/include/linux/init_task.h2008-04-08 14:27:41.0 
-0700
@@ -78,6 +78,7 @@ extern struct nsproxy init_nsproxy;
.mnt_ns = NULL, \
INIT_NET_NS(net_ns) \
INIT_IPC_NS(ipc_ns) \
+   .pts_ns = &init_pts_ns, \
.user_ns= &init_user_ns,\
 }
 
Index: 2.6.25-rc8-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc8-mm1.orig/include/linux/devpts_fs.h   2008-04-08 
13:38:08.0 -0700
+++ 2.6.25-rc8-mm1/include/linux/devpts_fs.h2008-04-08 14:27:41.0 
-0700
@@ -31,7 +31,7 @@ extern struct pts_namespace init_pts_ns;
 
 static inline struct pts_namespace *current_pts_ns(void)
 {
-   return &init_pts_ns;
+   return current->nsproxy->pts_ns;
 }
 
 static inline struct pts_namespace *pts_ns_from_inode(struct inode *inode)
@@ -61,7 +61,8 @@ struct tty_struct *devpts_get_tty(struct
 /* unlink */
 void devpts_pty_kill(struct pts_namespace *pts_ns, int number);
 
-static inline void free_pts_ns(struct kref *ns_kref) { }
+extern struct pts_namespace *new_pts_ns(void);
+extern void free_pts_ns(struct kref *kref);
 
 static inline struct pts_namespace *get_pts_ns(struct pts_namespace *ns)
 {
@@ -75,6 +76,15 @@ static inline void put_pts_ns(struct pts
kref_put(&ns->kref, free_pts_ns);
 }
 
+static inline struct pts_namespace *copy_pts_ns(u64 flags,
+  struct pts_namespace *old_ns)
+{
+  if (flags & CLONE_NEWPTS)
+  return new_pts_ns();
+  else
+  return get_pts_ns(old_ns);
+}
+
 #else
 
 /* Dummy stubs in the no-pty case */
@@ -90,6 +100,14 @@ static inline struct pts_namespace *get_
 }
 
 static inline void put_pts_ns(struct pts_namespace *ns) { }
+
+static inline struct pts_namespace *copy_pts_ns(u64 flags,
+  struct pts_namespace *old_ns)
+{
+  if (flags & CLONE_NEWPTS)
+  return ERR_PTR(-EINVAL);
+  return old_ns;
+}
 #endif
 
 
Index: 2.6.25-rc8-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc8-mm1.orig/fs/devpts/inode.c   2008-04-08 13:38:08.0 
-0700
+++ 2.6.25-rc8-mm1/fs/devpts/inode.c2008-04-08 14:33:04.0 -0700
@@ -27,6 +27,7 @@
 
 extern int pty_limit;  /* Config limit on U

[Devel] [RFC][PATCH 6/7]: Determine pts_ns from a pty's inode

2008-04-08 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [RFC][PATCH 6/7]: Determine pts_ns from a pty's inode.

The devpts interfaces currently operate on a specific pts namespace
which they get from the 'current' task.

With implementation of containers and cloning of PTS namespaces, we want
to be able to access PTYs in a child-pts-ns from a parent-pts-ns. For
instance we could bind-mount and pivot-root the child container on
'/vserver/vserver1' and then access the "pts/0" of 'vserver1' using 

$ echo foo > /vserver/vserver1/dev/pts/0

The task doing the above 'echo' could be in parent-pts-ns. So we find
the 'pts-ns' of the above file from the inode representing the device
rather than from the 'current' task.

Note that we need to find and hold a reference to the pts_ns to prevent
the pts_ns from being freed while it is being accessed from 'outside'.

This patch implements, 'pts_ns_from_inode()' which returns the pts_ns
using 'inode->i_sb->s_fs_info'.

Since, the 'inode' information is not visible inside devpts code itself,
this patch modifies the tty driver code to determine the pts_ns and passes
it into devpts.

Changelog [v2]:
[Serge Hallyn] Use rcu to access sb->s_fs_info.

[Serge Hallyn] Simplify handling of ptmx and tty devices by expecting
user to create them in /dev/pts (see also devpts-mknod patch)


Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 drivers/char/pty.c|   13 +-
 drivers/char/tty_io.c |   96 +++---
 fs/devpts/inode.c |   19 +++--
 include/linux/devpts_fs.h |   38 +++---
 4 files changed, 134 insertions(+), 32 deletions(-)

Index: 2.6.25-rc8-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc8-mm1.orig/include/linux/devpts_fs.h   2008-04-08 
13:36:31.0 -0700
+++ 2.6.25-rc8-mm1/include/linux/devpts_fs.h2008-04-08 13:38:08.0 
-0700
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct pts_namespace {
struct kref kref;
@@ -26,12 +27,39 @@ struct pts_namespace {
 
 extern struct pts_namespace init_pts_ns;
 
+#define DEVPTS_SUPER_MAGIC 0x1cd1
+
+static inline struct pts_namespace *current_pts_ns(void)
+{
+   return &init_pts_ns;
+}
+
+static inline struct pts_namespace *pts_ns_from_inode(struct inode *inode)
+{
+   /*
+* If this file exists on devpts, return the pts_ns from the
+* devpts super-block. Otherwise just use the pts-ns of the
+* calling task.
+*/
+   if(inode->i_sb->s_magic == DEVPTS_SUPER_MAGIC)
+   return rcu_dereference(inode->i_sb->s_fs_info);
+
+   return current_pts_ns();
+}
+
+
 #ifdef CONFIG_UNIX98_PTYS
-int devpts_new_index(void);
-void devpts_kill_index(int idx);
-int devpts_pty_new(struct tty_struct *tty);  /* mknod in devpts */
-struct tty_struct *devpts_get_tty(int number);  /* get tty structure */
-void devpts_pty_kill(int number);   /* unlink */
+int devpts_new_index(struct pts_namespace *pts_ns);
+void devpts_kill_index(struct pts_namespace *pts_ns, int idx);
+
+/* mknod in devpts */
+int devpts_pty_new(struct pts_namespace *pts_ns, struct tty_struct *tty);
+
+/* get tty structure */
+struct tty_struct *devpts_get_tty(struct pts_namespace *pts_ns, int number);
+
+/* unlink */
+void devpts_pty_kill(struct pts_namespace *pts_ns, int number);
 
 static inline void free_pts_ns(struct kref *ns_kref) { }
 
Index: 2.6.25-rc8-mm1/drivers/char/tty_io.c
===
--- 2.6.25-rc8-mm1.orig/drivers/char/tty_io.c   2008-04-08 09:15:56.0 
-0700
+++ 2.6.25-rc8-mm1/drivers/char/tty_io.c2008-04-08 14:25:11.0 
-0700
@@ -2064,8 +2064,8 @@ static void tty_line_name(struct tty_dri
  * relaxed for the (most common) case of reopening a tty.
  */
 
-static int init_dev(struct tty_driver *driver, int idx,
-   struct tty_struct **ret_tty)
+static int init_dev(struct tty_driver *driver, struct pts_namespace *pts_ns,
+   int idx, struct tty_struct **ret_tty)
 {
struct tty_struct *tty, *o_tty;
struct ktermios *tp, **tp_loc, *o_tp, **o_tp_loc;
@@ -2074,7 +2074,11 @@ static int init_dev(struct tty_driver *d
 
/* check whether we're reopening an existing tty */
if (driver->flags & TTY_DRIVER_DEVPTS_MEM) {
-   tty = devpts_get_tty(idx);
+   tty = devpts_get_tty(pts_ns, idx);
+   if (IS_ERR(tty)) {
+   retval = PTR_ERR(tty);
+   goto end_init;
+   }
/*
 * If we don't have a tty here on a slave open, it's because
 * the master already

[Devel] Re: [RFC][PATCH 0/7] Clone PTS namespace

2008-04-09 Thread sukadev
H. Peter Anvin [EMAIL PROTECTED] wrote:
> [EMAIL PROTECTED] wrote:
>> Devpts namespace patchset
>> In continuation of the implementation of containers in mainline, we need 
>> to
>> support multiple PTY namespaces so that the PTY index (ie the tty names) 
>> in
>> one container is independent of the PTY indices of other containers.  For
>> instance this would allow each container to have a '/dev/pts/0' PTY and
>> refer to different terminals.
>
> Why do we "need" this?  There isn't a fundamental need for this to be a 
> dense numberspace (in fact, there are substantial reasons why it's a bad 
> idea; the only reason the namespace is dense at the moment is because of 
> the hideously bad handing of utmp in glibc.)  Other than indicies, this 
> seems to be a more special case of device isolation across namespaces, 
> would that be a more useful problem to solve across the board?

We want to provide isolation between containers, meaning PTYs in container
C1 should not be accessible to processes in C2 (unless C2 is an ancestor).

The other reason for this in the longer term is for checkpoint/restart.
When restarting an application we want to make sure that the PTY indices
it was using is available and isolated.

We started out with isolating just the indices but added the special-case
handling for granting the host visibility into a child-container.

A complete device-namespace could solve this, but IIUC, is being planned
in the longer term. We are hoping this would provide the isolation in the
near-term without being too intrusive or impeding the implementation of
the device namespace.

Sukadev
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/3] clone64() and unshare64() system calls

2008-04-09 Thread sukadev

This is a resend of the patch set Cedric had sent earlier. I ported
the patch set to 2.6.25-rc8-mm1 and tested on x86 and x86_64.
---

We have run out of the 32 bits in clone_flags !

This patchset introduces 2 new system calls which support 64bit clone-flags.

 long sys_clone64(unsigned long flags_high, unsigned long flags_low,
unsigned long newsp);

 long sys_unshare64(unsigned long flags_high, unsigned long flags_low);

The current version of clone64() does not support CLONE_PARENT_SETTID and 
CLONE_CHILD_CLEARTID because we would exceed the 6 registers limit of some 
arches. It's possible to get around this limitation but we might not
need it as we already have clone()

This is work in progress but already includes support for x86, x86_64, 
x86_64(32), ppc64, ppc64(32), s390x, s390x(31). 

ia64 already supports 64bits clone flags through the clone2() syscall.
should we harmonize the name to clone2 ?  
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/3] change clone_flags type to u64

2008-04-09 Thread sukadev
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [lxc-dev] [patch -lxc 1/3] change clone_flags type to u64

This is a preliminary patch changing the clone_flags type to 64bits
for all the routines called by do_fork(). 

It prepares ground for the next patch which introduces an enhanced 
version of clone() supporting 64bits flags.

This is work in progress. All conversions might not be done yet.

Signed-off-by: Cedric Le Goater <[EMAIL PROTECTED]>
Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 arch/alpha/kernel/process.c |2 +-
 arch/arm/kernel/process.c   |2 +-
 arch/avr32/kernel/process.c |2 +-
 arch/blackfin/kernel/process.c  |2 +-
 arch/cris/arch-v10/kernel/process.c |2 +-
 arch/cris/arch-v32/kernel/process.c |2 +-
 arch/frv/kernel/process.c   |2 +-
 arch/h8300/kernel/process.c |2 +-
 arch/ia64/ia32/sys_ia32.c   |2 +-
 arch/ia64/kernel/process.c  |2 +-
 arch/m32r/kernel/process.c  |2 +-
 arch/m68k/kernel/process.c  |2 +-
 arch/m68knommu/kernel/process.c |2 +-
 arch/mips/kernel/process.c  |2 +-
 arch/mn10300/kernel/process.c   |2 +-
 arch/parisc/kernel/process.c|2 +-
 arch/powerpc/kernel/process.c   |2 +-
 arch/s390/kernel/process.c  |2 +-
 arch/sh/kernel/process_32.c |2 +-
 arch/sh/kernel/process_64.c |2 +-
 arch/sparc/kernel/process.c |2 +-
 arch/sparc64/kernel/process.c   |2 +-
 arch/um/kernel/process.c|2 +-
 arch/v850/kernel/process.c  |2 +-
 arch/x86/kernel/process_32.c|2 +-
 arch/x86/kernel/process_64.c|2 +-
 arch/xtensa/kernel/process.c|2 +-
 fs/namespace.c  |2 +-
 include/linux/ipc_namespace.h   |4 ++--
 include/linux/key.h |2 +-
 include/linux/mnt_namespace.h   |2 +-
 include/linux/nsproxy.h |4 ++--
 include/linux/pid_namespace.h   |4 ++--
 include/linux/sched.h   |6 --
 include/linux/security.h|6 +++---
 include/linux/sem.h |4 ++--
 include/linux/user_namespace.h  |4 ++--
 include/linux/utsname.h |4 ++--
 include/net/net_namespace.h |4 ++--
 ipc/namespace.c |2 +-
 ipc/sem.c   |2 +-
 kernel/fork.c   |   36 ++--
 kernel/nsproxy.c|6 +++---
 kernel/pid_namespace.c  |2 +-
 kernel/user_namespace.c |2 +-
 kernel/utsname.c|2 +-
 net/core/net_namespace.c|4 ++--
 security/dummy.c|2 +-
 security/keys/process_keys.c|2 +-
 security/security.c |2 +-
 security/selinux/hooks.c|2 +-
 51 files changed, 83 insertions(+), 81 deletions(-)

Index: 2.6.25-rc2-mm1/arch/alpha/kernel/process.c
===
--- 2.6.25-rc2-mm1.orig/arch/alpha/kernel/process.c
+++ 2.6.25-rc2-mm1/arch/alpha/kernel/process.c
@@ -270,7 +270,7 @@ alpha_vfork(struct pt_regs *regs)
  */

 int
-copy_thread(int nr, unsigned long clone_flags, unsigned long usp,
+copy_thread(int nr, u64 clone_flags, unsigned long usp,
unsigned long unused,
struct task_struct * p, struct pt_regs * regs)
 {
Index: 2.6.25-rc2-mm1/arch/arm/kernel/process.c
===
--- 2.6.25-rc2-mm1.orig/arch/arm/kernel/process.c
+++ 2.6.25-rc2-mm1/arch/arm/kernel/process.c
@@ -331,7 +331,7 @@ void release_thread(struct task_struct *
 asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");

 int
-copy_thread(int nr, unsigned long clone_flags, unsigned long stack_start,
+copy_thread(int nr, u64 clone_flags, unsigned long stack_start,
unsigned long stk_sz, struct task_struct *p, struct pt_regs *regs)
 {
struct thread_info *thread = task_thread_info(p);
Index: 2.6.25-rc2-mm1/arch/avr32/kernel/process.c
===
--- 2.6.25-rc2-mm1.orig/arch/avr32/kernel/process.c
+++ 2.6.25-rc2-mm1/arch/avr32/kernel/process.c
@@ -325,7 +325,7 @@ int dump_fpu(struct pt_regs *regs, elf_f

 asmlinkage void ret_from_fork(void);

-int copy_thread(int nr, unsigned long clone_flags, unsigned long usp,
+int copy_thread(int nr, u64 clone_flags, unsigned long usp,
unsigned long unused,
struct task_struct *p, struct pt_regs *regs)
 {
Index: 2.6.25-rc2-mm1/arch/blackfin/kernel/process.c
===
--- 2.6.25-rc2-mm1.orig/arch/blackfin/kernel/process.c
+++ 2.6.25-rc2-mm1/arch/blackfin/kernel/process.c
@@ -168,7 +168,7 @@ asmlinkage i

[Devel] [PATCH 2/3] add do_unshare()

2008-04-09 Thread sukadev

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 2/3] add do_unshare()

This patch adds a do_unshare() routine which will be common
to the unshare() and unshare64() syscall.

Signed-off-by: Cedric Le Goater <[EMAIL PROTECTED]>
Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 kernel/fork.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Index: 2.6.25-rc2-mm1/kernel/fork.c
===
--- 2.6.25-rc2-mm1.orig/kernel/fork.c
+++ 2.6.25-rc2-mm1/kernel/fork.c
@@ -1696,7 +1696,7 @@ static int unshare_semundo(u64 unshare_f
  * constructed. Here we are modifying the current, active,
  * task_struct.
  */
-asmlinkage long sys_unshare(unsigned long unshare_flags)
+static long do_unshare(u64 unshare_flags)
 {
int err = 0;
struct fs_struct *fs, *new_fs = NULL;
@@ -1790,3 +1790,8 @@ bad_unshare_cleanup_thread:
 bad_unshare_out:
return err;
 }
+
+asmlinkage long sys_unshare(unsigned long unshare_flags)
+{
+   return do_unshare(unshare_flags);
+}

-- 

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"lxc-dev" group.
To post to this group, send email to [EMAIL PROTECTED]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/lxc-dev?hl=en
-~--~~~~--~~--~--~---

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/3] add the clone64() and unshare64() syscalls

2008-04-09 Thread sukadev
From: Cedric Le Goater <[EMAIL PROTECTED]>
Subject: [PATCH 3/3] add the clone64() and unshare64() syscalls

This patch adds 2 new syscalls :

 long sys_clone64(unsigned long flags_high, unsigned long flags_low,
unsigned long newsp);

 long sys_unshare64(unsigned long flags_high, unsigned long flags_low);

The current version of clone64() does not support CLONE_PARENT_SETTID and 
CLONE_CHILD_CLEARTID because we would exceed the 6 registers limit of some 
arches. It's possible to get around this limitation but we might not
need it as we already have clone()

Signed-off-by: Cedric Le Goater <[EMAIL PROTECTED]>
Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>

---
 arch/powerpc/kernel/entry_32.S |8 
 arch/powerpc/kernel/entry_64.S |5 +
 arch/powerpc/kernel/process.c  |   15 +++
 arch/s390/kernel/compat_linux.c|   16 
 arch/s390/kernel/compat_wrapper.S  |6 ++
 arch/s390/kernel/process.c |   15 +++
 arch/s390/kernel/syscalls.S|2 ++
 arch/x86/ia32/ia32entry.S  |4 
 arch/x86/ia32/sys_ia32.c   |   12 
 arch/x86/kernel/entry_64.S |1 +
 arch/x86/kernel/process_32.c   |   14 ++
 arch/x86/kernel/process_64.c   |   15 +++
 arch/x86/kernel/syscall_table_32.S |2 ++
 include/asm-powerpc/systbl.h   |2 ++
 include/asm-powerpc/unistd.h   |4 +++-
 include/asm-s390/unistd.h  |4 +++-
 include/asm-x86/unistd_32.h|2 ++
 include/asm-x86/unistd_64.h|4 
 include/linux/syscalls.h   |3 +++
 kernel/fork.c  |7 +++
 kernel/sys_ni.c|3 +++
 21 files changed, 142 insertions(+), 2 deletions(-)

Index: 2.6.25-rc2-mm1/arch/s390/kernel/syscalls.S
===
--- 2.6.25-rc2-mm1.orig/arch/s390/kernel/syscalls.S 2008-02-27 
15:17:34.0 -0800
+++ 2.6.25-rc2-mm1/arch/s390/kernel/syscalls.S  2008-03-06 22:08:49.0 
-0800
@@ -330,3 +330,5 @@ SYSCALL(sys_eventfd,sys_eventfd,sys_even
 SYSCALL(sys_timerfd_create,sys_timerfd_create,sys_timerfd_create_wrapper)
 
SYSCALL(sys_timerfd_settime,sys_timerfd_settime,compat_sys_timerfd_settime_wrapper)
 /* 320 */
 
SYSCALL(sys_timerfd_gettime,sys_timerfd_gettime,compat_sys_timerfd_gettime_wrapper)
+SYSCALL(sys_clone64,sys_clone64,sys32_clone64)
+SYSCALL(sys_unshare64,sys_unshare64,sys_unshare64_wrapper)
Index: 2.6.25-rc2-mm1/arch/x86/kernel/syscall_table_32.S
===
--- 2.6.25-rc2-mm1.orig/arch/x86/kernel/syscall_table_32.S  2008-02-27 
15:17:35.0 -0800
+++ 2.6.25-rc2-mm1/arch/x86/kernel/syscall_table_32.S   2008-03-06 
22:08:49.0 -0800
@@ -326,3 +326,5 @@ ENTRY(sys_call_table)
.long sys_fallocate
.long sys_timerfd_settime   /* 325 */
.long sys_timerfd_gettime
+   .long sys_clone64
+   .long sys_unshare64
Index: 2.6.25-rc2-mm1/include/asm-powerpc/systbl.h
===
--- 2.6.25-rc2-mm1.orig/include/asm-powerpc/systbl.h2008-02-27 
15:18:12.0 -0800
+++ 2.6.25-rc2-mm1/include/asm-powerpc/systbl.h 2008-03-06 22:08:49.0 
-0800
@@ -316,3 +316,5 @@ COMPAT_SYS(fallocate)
 SYSCALL(subpage_prot)
 COMPAT_SYS_SPU(timerfd_settime)
 COMPAT_SYS_SPU(timerfd_gettime)
+PPC_SYS(clone64)
+SYSCALL_SPU(unshare64)
Index: 2.6.25-rc2-mm1/include/asm-powerpc/unistd.h
===
--- 2.6.25-rc2-mm1.orig/include/asm-powerpc/unistd.h2008-02-27 
15:18:12.0 -0800
+++ 2.6.25-rc2-mm1/include/asm-powerpc/unistd.h 2008-03-06 22:08:49.0 
-0800
@@ -335,10 +335,12 @@
 #define __NR_subpage_prot  310
 #define __NR_timerfd_settime   311
 #define __NR_timerfd_gettime   312
+#define __NR_clone64   313
+#define __NR_unshare64 314
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls  313
+#define __NR_syscalls  315
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls
Index: 2.6.25-rc2-mm1/include/asm-s390/unistd.h
===
--- 2.6.25-rc2-mm1.orig/include/asm-s390/unistd.h   2008-02-27 
15:18:13.0 -0800
+++ 2.6.25-rc2-mm1/include/asm-s390/unistd.h2008-03-06 22:08:49.0 
-0800
@@ -259,7 +259,9 @@
 #define __NR_timerfd_create319
 #define __NR_timerfd_settime   320
 #define __NR_timerfd_gettime   321
-#define NR_syscalls 322
+#define __NR_clone64   322
+#define __NR_unshare64 323
+#define NR_syscalls 324
 
 /* 
  * There are some system calls that are not present on 64 bit, some
Index: 2.6.25-rc2-mm1/include/asm-x86/unistd_32.h
===
--- 2.6.25-rc2-mm1.orig/incl

[Devel] Re: [PATCH 0/3] clone64() and unshare64() system calls

2008-04-09 Thread sukadev
H. Peter Anvin [EMAIL PROTECTED] wrote:
> [EMAIL PROTECTED] wrote:
>> This is a resend of the patch set Cedric had sent earlier. I ported
>> the patch set to 2.6.25-rc8-mm1 and tested on x86 and x86_64.
>> ---
>> We have run out of the 32 bits in clone_flags !
>> This patchset introduces 2 new system calls which support 64bit 
>> clone-flags.
>>  long sys_clone64(unsigned long flags_high, unsigned long flags_low,
>>  unsigned long newsp);
>>  long sys_unshare64(unsigned long flags_high, unsigned long 
>> flags_low);
>> The current version of clone64() does not support CLONE_PARENT_SETTID and 
>> CLONE_CHILD_CLEARTID because we would exceed the 6 registers limit of some 
>> arches. It's possible to get around this limitation but we might not
>> need it as we already have clone()
>
> I really dislike this interface.
>
> If you're going to make it a 64-bit pass it in as a 64-bit number, instead 
> of breaking it into two numbers.

Maybe I am missing your point. The glibc interface could take a 64bit
parameter, but don't we need to pass 32-bit values into the system call 
on 32 bit systems ?

> Better yet, IMO, would be to pass a pointer to a structure like:
>
> struct shared {
>   unsigned long nwords;
>   unsigned long flags[];
> };
>
> ... which can be expanded indefinitely.

Yes, this was discussed before in the context of Pavel Emelyanov's patch

http://lkml.org/lkml/2008/1/16/109

along with sys_indirect().  While there was no consensus, it looked like
adding a new system call was better than open ended interfaces.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 3/3] add the clone64() and unshare64() syscalls

2008-04-09 Thread sukadev
Jakub Jelinek [EMAIL PROTECTED] wrote:
| On Wed, Apr 09, 2008 at 03:34:59PM -0700, [EMAIL PROTECTED] wrote:
| > From: Cedric Le Goater <[EMAIL PROTECTED]>
| > Subject: [PATCH 3/3] add the clone64() and unshare64() syscalls
| > 
| > This patch adds 2 new syscalls :
| > 
| >  long sys_clone64(unsigned long flags_high, unsigned long flags_low,
| > unsigned long newsp);
| > 
| >  long sys_unshare64(unsigned long flags_high, unsigned long flags_low);
| 
| Can you explain why are you adding it for 64-bit arches too?  unsigned long
| is there already 64-bit, and both sys_clone and sys_unshare have unsigned
| long flags, rather than unsigned int.

Hmm,

By simply resuing clone() on 64 bit and adding a new call for 32-bit won't
the semantics of clone() differ between the two ?

i.e clone() on 64 bit supports say CLONE_NEWPTS clone() on 32bit does not ?

Wouldn't it be simpler/cleaner if clone() and clone64() behaved the same
on both 32 and 64 bit systems ?

Sukadev
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/3] clone64() and unshare64() system calls

2008-04-09 Thread sukadev
H. Peter Anvin [EMAIL PROTECTED] wrote:
>> Yes, this was discussed before in the context of Pavel Emelyanov's patch
>>  http://lkml.org/lkml/2008/1/16/109
>> along with sys_indirect().  While there was no consensus, it looked like
>> adding a new system call was better than open ended interfaces.
>
> That's not really an open-ended interface, it's just an expandable bitmap.

Yes, we liked such an approach earlier too and its conceivable that we
will run out of the 64-bits too :-)

But as Jon Corbet pointed out in the the thread above, it looked like
adding a new system call has been the "traditional" way of solving this
in Linux so far and there has been no consensus on a newer approach.

Sukadev
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/3] clone64() and unshare64() system calls

2008-04-10 Thread sukadev
Paul Menage [EMAIL PROTECTED] wrote:
| On Wed, Apr 9, 2008 at 7:38 PM,  <[EMAIL PROTECTED]> wrote:
| >
| >  But as Jon Corbet pointed out in the the thread above, it looked like
| >  adding a new system call has been the "traditional" way of solving this
| >  in Linux so far and there has been no consensus on a newer approach.
| >
| 
| I thought that the consensus was that adding a new system call was
| better than trying to force extensibility on to the existing
| non-extensible system call.

There were couple of objections to extensible system calls like
sys_indirect() and to Pavel's approach.

| 
| But if we are adding a new system call, why not make the new one
| extensible to reduce the need for yet another new call in the future?

hypothetically, can we make a variant of clone() extensible to the point
of requiring a copy_from_user() ?

| 
| Paul
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 0/7] Clone PTS namespace

2008-04-10 Thread sukadev
Serge E. Hallyn [EMAIL PROTECTED] wrote:
| > 
| > Further what I did for the network namespace should easily handle the
| > uid/gid namespace and should be a good starting place for a general
| > device namespace.
| 
| Agreed.  What's the git url and which branch do i use for your proof
| of concept tree?  I'll do the userns patch on top of that.  I assume
| Suka will do the same for ptys?
| 

Sure.

BTW, can we push the following 3 helper patches in the set. I believe
they will be required to support multiple pts namespaces, even if
the actual way we do it is not final yet.

[PATCH 1/7]: Propagate error code from devpts_pty_new
[PATCH 2/7]: Factor out PTY index allocation
[PATCH 3/7]: Enable multiple mounts of /dev/pts

Sukadev
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/4] Helper patches for PTY namespaces

2008-04-12 Thread sukadev

Some simple helper patches to enable implementation of multiple PTY
(or device) namespaces.

[PATCH 1/4]: Propagate error code from devpts_pty_new
[PATCH 2/4]: Factor out PTY index allocation
[PATCH 3/4]: Move devpts globals into init_pts_ns
[PATCH 3/4]: Enable multiple mounts of /dev/pts

This patchset is based on earlier versions developed by Serge Hallyn
and Matt Helsley.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/4]: Propagate error code from devpts_pty_new

2008-04-12 Thread sukadev
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 1/4]: Propagate error code from devpts_pty_new

Have ptmx_open() propagate any error code returned by devpts_pty_new()
(which returns either 0 or -ENOMEM anyway).

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 drivers/char/tty_io.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: 2.6.25-rc8-mm1/drivers/char/tty_io.c
===
--- 2.6.25-rc8-mm1.orig/drivers/char/tty_io.c   2008-04-07 14:49:56.0 
-0700
+++ 2.6.25-rc8-mm1/drivers/char/tty_io.c2008-04-09 13:54:00.0 
-0700
@@ -2835,8 +2835,8 @@ static int ptmx_open(struct inode *inode
filp->private_data = tty;
file_move(filp, &tty->tty_files);
 
-   retval = -ENOMEM;
-   if (devpts_pty_new(tty->link))
+   retval = devpts_pty_new(tty->link);
+   if (retval)
goto out1;
 
check_tty_count(tty, "tty_open");
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/4]: Factor out PTY index allocation

2008-04-12 Thread sukadev
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 2/4]: Factor out PTY index allocation

Factor out the code used to allocate/free a pts index into new interfaces,
devpts_new_index() and devpts_kill_index().  This localizes the external
data structures used in managing the pts indices.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Signed-off-by: Serge Hallyn<[EMAIL PROTECTED]>
Signed-off-by: Matt Helsley<[EMAIL PROTECTED]>

---
 drivers/char/tty_io.c |   40 ++--
 fs/devpts/inode.c |   42 +-
 include/linux/devpts_fs.h |4 
 3 files changed, 51 insertions(+), 35 deletions(-)

Index: 2.6.25-rc5-mm1/include/linux/devpts_fs.h
===
--- 2.6.25-rc5-mm1.orig/include/linux/devpts_fs.h   2008-03-24 
20:04:07.0 -0700
+++ 2.6.25-rc5-mm1/include/linux/devpts_fs.h2008-03-24 20:04:26.0 
-0700
@@ -17,6 +17,8 @@
 
 #ifdef CONFIG_UNIX98_PTYS
 
+int devpts_new_index(void);
+void devpts_kill_index(int idx);
 int devpts_pty_new(struct tty_struct *tty);  /* mknod in devpts */
 struct tty_struct *devpts_get_tty(int number);  /* get tty structure */
 void devpts_pty_kill(int number);   /* unlink */
@@ -24,6 +26,8 @@ void devpts_pty_kill(int number);  /* u
 #else
 
 /* Dummy stubs in the no-pty case */
+static inline int devpts_new_index(void) { return -EINVAL; }
+static inline void devpts_kill_index(int idx) { }
 static inline int devpts_pty_new(struct tty_struct *tty) { return -EINVAL; }
 static inline struct tty_struct *devpts_get_tty(int number) { return NULL; }
 static inline void devpts_pty_kill(int number) { }
Index: 2.6.25-rc5-mm1/drivers/char/tty_io.c
===
--- 2.6.25-rc5-mm1.orig/drivers/char/tty_io.c   2008-03-24 20:04:07.0 
-0700
+++ 2.6.25-rc5-mm1/drivers/char/tty_io.c2008-03-24 20:04:26.0 
-0700
@@ -91,7 +91,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -137,9 +136,6 @@ EXPORT_SYMBOL(tty_mutex);
 
 #ifdef CONFIG_UNIX98_PTYS
 extern struct tty_driver *ptm_driver;  /* Unix98 pty masters; for /dev/ptmx */
-extern int pty_limit;  /* Config limit on Unix98 ptys */
-static DEFINE_IDR(allocated_ptys);
-static DEFINE_MUTEX(allocated_ptys_lock);
 static int ptmx_open(struct inode *, struct file *);
 #endif
 
@@ -2636,15 +2632,9 @@ static void release_dev(struct file *fil
 */
release_tty(tty, idx);
 
-#ifdef CONFIG_UNIX98_PTYS
/* Make this pty number available for reallocation */
-   if (devpts) {
-   mutex_lock(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, idx);
-   mutex_unlock(&allocated_ptys_lock);
-   }
-#endif
-
+   if (devpts)
+   devpts_kill_index(idx);
 }
 
 /**
@@ -2800,29 +2790,13 @@ static int ptmx_open(struct inode *inode
struct tty_struct *tty;
int retval;
int index;
-   int idr_ret;
 
nonseekable_open(inode, filp);
 
/* find a device that is not in use. */
-   mutex_lock(&allocated_ptys_lock);
-   if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) {
-   mutex_unlock(&allocated_ptys_lock);
-   return -ENOMEM;
-   }
-   idr_ret = idr_get_new(&allocated_ptys, NULL, &index);
-   if (idr_ret < 0) {
-   mutex_unlock(&allocated_ptys_lock);
-   if (idr_ret == -EAGAIN)
-   return -ENOMEM;
-   return -EIO;
-   }
-   if (index >= pty_limit) {
-   idr_remove(&allocated_ptys, index);
-   mutex_unlock(&allocated_ptys_lock);
-   return -EIO;
-   }
-   mutex_unlock(&allocated_ptys_lock);
+   index = devpts_new_index();
+   if (index < 0)
+   return index;
 
mutex_lock(&tty_mutex);
retval = init_dev(ptm_driver, index, &tty);
@@ -2847,9 +2821,7 @@ out1:
release_dev(filp);
return retval;
 out:
-   mutex_lock(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, index);
-   mutex_unlock(&allocated_ptys_lock);
+   devpts_kill_index(index);
return retval;
 }
 #endif
Index: 2.6.25-rc5-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc5-mm1.orig/fs/devpts/inode.c   2008-03-24 20:04:07.0 
-0700
+++ 2.6.25-rc5-mm1/fs/devpts/inode.c2008-03-24 20:04:26.0 -0700
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,6 +27,10 @@
 
 #define DEVPTS_DEFAULT_MODE 0600
 
+extern int pty_limit;  /* Config limit on Unix98 ptys */
+static DEFINE_IDR(allocated_ptys);
+static DECLARE_

[Devel] [PATCH 3/4]: Move devpts globals into init_pts_ns

2008-04-12 Thread sukadev
Matt, Serge, please sign-off on this version.
---
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH 3/4]: Move devpts globals into init_pts_ns

Move devpts global variables 'allocated_ptys' and 'devpts_mnt' into a new
'pts_namespace' and remove the 'devpts_root'.

Changelog: 
- Split these relatively simpler changes off from the patch that
  supports remounting devpts.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |   84 --
 include/linux/devpts_fs.h |   10 +
 2 files changed, 70 insertions(+), 24 deletions(-)

Index: 2.6.25-rc8-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc8-mm1.orig/fs/devpts/inode.c   2008-04-11 10:12:09.0 
-0700
+++ 2.6.25-rc8-mm1/fs/devpts/inode.c2008-04-12 10:10:33.0 -0700
@@ -28,12 +28,8 @@
 #define DEVPTS_DEFAULT_MODE 0600
 
 extern int pty_limit;  /* Config limit on Unix98 ptys */
-static DEFINE_IDR(allocated_ptys);
 static DECLARE_MUTEX(allocated_ptys_lock);
 
-static struct vfsmount *devpts_mnt;
-static struct dentry *devpts_root;
-
 static struct {
int setuid;
int setgid;
@@ -54,6 +50,14 @@ static match_table_t tokens = {
{Opt_err, NULL}
 };
 
+struct pts_namespace init_pts_ns = {
+   .kref = {
+   .refcount = ATOMIC_INIT(2),
+   },
+   .allocated_ptys = IDR_INIT(init_pts_ns.allocated_ptys),
+   .mnt = NULL,
+};
+
 static int devpts_remount(struct super_block *sb, int *flags, char *data)
 {
char *p;
@@ -140,7 +144,7 @@ devpts_fill_super(struct super_block *s,
inode->i_fop = &simple_dir_operations;
inode->i_nlink = 2;
 
-   devpts_root = s->s_root = d_alloc_root(inode);
+   s->s_root = d_alloc_root(inode);
if (s->s_root)
return 0;

@@ -168,10 +172,9 @@ static struct file_system_type devpts_fs
  * to the System V naming convention
  */
 
-static struct dentry *get_node(int num)
+static struct dentry *get_node(struct dentry *root, int num)
 {
char s[12];
-   struct dentry *root = devpts_root;
mutex_lock(&root->d_inode->i_mutex);
return lookup_one_len(s, root, sprintf(s, "%d", num));
 }
@@ -180,14 +183,17 @@ int devpts_new_index(void)
 {
int index;
int idr_ret;
+   struct pts_namespace *pts_ns;
+
+   pts_ns = &init_pts_ns;
 
 retry:
-   if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) {
+   if (!idr_pre_get(&pts_ns->allocated_ptys, GFP_KERNEL)) {
return -ENOMEM;
}
 
down(&allocated_ptys_lock);
-   idr_ret = idr_get_new(&allocated_ptys, NULL, &index);
+   idr_ret = idr_get_new(&pts_ns->allocated_ptys, NULL, &index);
if (idr_ret < 0) {
up(&allocated_ptys_lock);
if (idr_ret == -EAGAIN)
@@ -196,7 +202,7 @@ retry:
}
 
if (index >= pty_limit) {
-   idr_remove(&allocated_ptys, index);
+   idr_remove(&pts_ns->allocated_ptys, index);
up(&allocated_ptys_lock);
return -EIO;
}
@@ -206,8 +212,10 @@ retry:
 
 void devpts_kill_index(int idx)
 {
+   struct pts_namespace *pts_ns = &init_pts_ns;
+
down(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, idx);
+   idr_remove(&pts_ns->allocated_ptys, idx);
up(&allocated_ptys_lock);
 }
 
@@ -217,12 +225,26 @@ int devpts_pty_new(struct tty_struct *tt
struct tty_driver *driver = tty->driver;
dev_t device = MKDEV(driver->major, driver->minor_start+number);
struct dentry *dentry;
-   struct inode *inode = new_inode(devpts_mnt->mnt_sb);
+   struct dentry *root;
+   struct inode *inode;
+   struct pts_namespace *pts_ns;
 
/* We're supposed to be given the slave end of a pty */
BUG_ON(driver->type != TTY_DRIVER_TYPE_PTY);
BUG_ON(driver->subtype != PTY_TYPE_SLAVE);
 
+   pts_ns = &init_pts_ns;
+   root = pts_ns->mnt->mnt_root;
+
+   mutex_lock(&root->d_inode->i_mutex);
+   inode = idr_find(&pts_ns->allocated_ptys, number);
+   mutex_unlock(&root->d_inode->i_mutex);
+
+   if (inode && !IS_ERR(inode))
+   return -EEXIST;
+
+   inode = new_inode(pts_ns->mnt->mnt_sb);
+
if (!inode)
return -ENOMEM;
 
@@ -232,23 +254,28 @@ int devpts_pty_new(struct tty_struct *tt
inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
init_special_inode(inode, S_IFCHR|config.mode, device);
inode->i_private = tty;
+   idr_replace(&pts_ns->allocated_ptys, inode, number);
 
-   den

[Devel] [PATCH 4/4]: Enable multiple mounts of /dev/pts

2008-04-12 Thread sukadev
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject:[PATCH 4/4]: Enable multiple mounts of /dev/pts

To support multiple PTY namespaces, allow multiple mounts of /dev/pts, once
within each PTY namespace.

This patch removes the get_sb_single() in devpts_get_sb() and uses test and
set sb interfaces to allow remounting /dev/pts.

Changelog [v4]:
- Split-off the simpler changes of moving global=variables into
  'pts_namespace' to previous patch.

Changelog [v3]:
- Removed some unnecessary comments from devpts_set_sb()

Changelog [v2]:

- (Pavel Emelianov/Serge Hallyn) Remove reference to pts_ns from
  sb->s_fs_info to fix the circular reference (/dev/pts is not
  unmounted unless the pts_ns is destroyed, so we don't need a
  reference to the pts_ns).

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Signed-off-by: Serge Hallyn <[EMAIL PROTECTED]>
Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>
---
 fs/devpts/inode.c |   62 +++---
 1 file changed, 59 insertions(+), 3 deletions(-)

Index: 2.6.25-rc8-mm1/fs/devpts/inode.c
===
--- 2.6.25-rc8-mm1.orig/fs/devpts/inode.c   2008-04-12 10:10:33.0 
-0700
+++ 2.6.25-rc8-mm1/fs/devpts/inode.c2008-04-12 10:10:38.0 -0700
@@ -154,17 +154,73 @@ fail:
return -ENOMEM;
 }
 
+/*
+ * We use test and set super-block operations to help determine whether we
+ * need a new super-block for this namespace. get_sb() walks the list of
+ * existing devpts supers, comparing them with the @data ptr. Since we
+ * passed 'current's namespace as the @data pointer we can compare the
+ * namespace pointer in the super-block's 's_fs_info'.  If the test is
+ * TRUE then get_sb() returns a new active reference to the super block.
+ * Otherwise, it helps us build an active reference to a new one.
+ */
+
+static int devpts_test_sb(struct super_block *sb, void *data)
+{
+   return sb->s_fs_info == data;
+}
+
+static int devpts_set_sb(struct super_block *sb, void *data)
+{
+   sb->s_fs_info = data;
+   return set_anon_super(sb, NULL);
+}
+
 static int devpts_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 {
-   return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt);
+   struct super_block *sb;
+   struct pts_namespace *ns;
+   int err;
+
+   /* hereafter we're very similar to proc_get_sb */
+   if (flags & MS_KERNMOUNT)
+   ns = data;
+   else
+   ns = &init_pts_ns;
+
+   /* hereafter we're very simlar to get_sb_nodev */
+   sb = sget(fs_type, devpts_test_sb, devpts_set_sb, ns);
+   if (IS_ERR(sb))
+   return PTR_ERR(sb);
+
+   if (sb->s_root)
+   return simple_set_mnt(mnt, sb);
+
+   sb->s_flags = flags;
+   err = devpts_fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
+   if (err) {
+   up_write(&sb->s_umount);
+   deactivate_super(sb);
+   return err;
+   }
+
+   sb->s_flags |= MS_ACTIVE;
+   ns->mnt = mnt;
+
+   return simple_set_mnt(mnt, sb);
+}
+
+static void devpts_kill_sb(struct super_block *sb)
+{
+   sb->s_fs_info = NULL;
+   kill_anon_super(sb);
 }
 
 static struct file_system_type devpts_fs_type = {
.owner  = THIS_MODULE,
.name   = "devpts",
.get_sb = devpts_get_sb,
-   .kill_sb= kill_anon_super,
+   .kill_sb= devpts_kill_sb,
 };
 
 /*
@@ -315,7 +371,7 @@ static int __init init_devpts_fs(void)
 
err = register_filesystem(&devpts_fs_type);
if (!err) {
-   mnt = kern_mount(&devpts_fs_type);
+   mnt = kern_mount_data(&devpts_fs_type, &init_pts_ns);
if (IS_ERR(mnt))
err = PTR_ERR(mnt);
else
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/4] Helper patches for PTY namespaces

2008-04-12 Thread sukadev
Subrata Modak [EMAIL PROTECTED] wrote:
| Sukadev,
| 
| Any corresponding test cases for LTP. We just have UTS, PID & SYSVIPC
| Namespace till now.

I had some unit-tests that I used with the clone-pts-ns patchset.
But the patches in this set are just helpers and should not change
existing functionality.  

I can send the tests I used (they are not in LTP format) when I tested
the clone-pts-ns patchset.

Thanks,

Sukadev
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH]: Propagate error code from devpts_pty_new

2008-04-16 Thread sukadev
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH]: Propagate error code from devpts_pty_new

Have ptmx_open() propagate any error code returned by devpts_pty_new()
(which returns either 0 or -ENOMEM anyway).

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 drivers/char/tty_io.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: 2.6.25-rc8-mm2/drivers/char/tty_io.c
===
--- 2.6.25-rc8-mm2.orig/drivers/char/tty_io.c   2008-04-16 09:38:23.0 
-0700
+++ 2.6.25-rc8-mm2/drivers/char/tty_io.c2008-04-16 09:51:11.0 
-0700
@@ -2835,8 +2835,8 @@ static int ptmx_open(struct inode *inode
filp->private_data = tty;
file_move(filp, &tty->tty_files);
 
-   retval = -ENOMEM;
-   if (devpts_pty_new(tty->link))
+   retval = devpts_pty_new(tty->link);
+   if (retval)
goto out1;
 
check_tty_count(tty, "tty_open");
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH]: Factor out PTY index allocation

2008-04-16 Thread sukadev
We noticed this while working on pts namespaces and believe this might
be an useful change even as we rework our pts/device namespace approach.

---

From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Subject: [PATCH]: Factor out PTY index allocation

Factor out the code used to allocate/free a pts index into new interfaces,
devpts_new_index() and devpts_kill_index().  This localizes the external
data structures used in managing the pts indices.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Signed-off-by: Serge Hallyn<[EMAIL PROTECTED]>
Signed-off-by: Matt Helsley<[EMAIL PROTECTED]>

---
 drivers/char/tty_io.c |   40 ++--
 fs/devpts/inode.c |   42 +-
 include/linux/devpts_fs.h |4 
 3 files changed, 51 insertions(+), 35 deletions(-)

Index: 2.6.25-rc8-mm2/include/linux/devpts_fs.h
===
--- 2.6.25-rc8-mm2.orig/include/linux/devpts_fs.h   2008-01-26 
09:49:16.0 -0800
+++ 2.6.25-rc8-mm2/include/linux/devpts_fs.h2008-04-16 09:51:15.0 
-0700
@@ -17,6 +17,8 @@
 
 #ifdef CONFIG_UNIX98_PTYS
 
+int devpts_new_index(void);
+void devpts_kill_index(int idx);
 int devpts_pty_new(struct tty_struct *tty);  /* mknod in devpts */
 struct tty_struct *devpts_get_tty(int number);  /* get tty structure */
 void devpts_pty_kill(int number);   /* unlink */
@@ -24,6 +26,8 @@ void devpts_pty_kill(int number);  /* u
 #else
 
 /* Dummy stubs in the no-pty case */
+static inline int devpts_new_index(void) { return -EINVAL; }
+static inline void devpts_kill_index(int idx) { }
 static inline int devpts_pty_new(struct tty_struct *tty) { return -EINVAL; }
 static inline struct tty_struct *devpts_get_tty(int number) { return NULL; }
 static inline void devpts_pty_kill(int number) { }
Index: 2.6.25-rc8-mm2/drivers/char/tty_io.c
===
--- 2.6.25-rc8-mm2.orig/drivers/char/tty_io.c   2008-04-16 09:51:11.0 
-0700
+++ 2.6.25-rc8-mm2/drivers/char/tty_io.c2008-04-16 09:51:15.0 
-0700
@@ -91,7 +91,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -137,9 +136,6 @@ EXPORT_SYMBOL(tty_mutex);
 
 #ifdef CONFIG_UNIX98_PTYS
 extern struct tty_driver *ptm_driver;  /* Unix98 pty masters; for /dev/ptmx */
-extern int pty_limit;  /* Config limit on Unix98 ptys */
-static DEFINE_IDR(allocated_ptys);
-static DEFINE_MUTEX(allocated_ptys_lock);
 static int ptmx_open(struct inode *, struct file *);
 #endif
 
@@ -2636,15 +2632,9 @@ static void release_dev(struct file *fil
 */
release_tty(tty, idx);
 
-#ifdef CONFIG_UNIX98_PTYS
/* Make this pty number available for reallocation */
-   if (devpts) {
-   mutex_lock(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, idx);
-   mutex_unlock(&allocated_ptys_lock);
-   }
-#endif
-
+   if (devpts)
+   devpts_kill_index(idx);
 }
 
 /**
@@ -2800,29 +2790,13 @@ static int ptmx_open(struct inode *inode
struct tty_struct *tty;
int retval;
int index;
-   int idr_ret;
 
nonseekable_open(inode, filp);
 
/* find a device that is not in use. */
-   mutex_lock(&allocated_ptys_lock);
-   if (!idr_pre_get(&allocated_ptys, GFP_KERNEL)) {
-   mutex_unlock(&allocated_ptys_lock);
-   return -ENOMEM;
-   }
-   idr_ret = idr_get_new(&allocated_ptys, NULL, &index);
-   if (idr_ret < 0) {
-   mutex_unlock(&allocated_ptys_lock);
-   if (idr_ret == -EAGAIN)
-   return -ENOMEM;
-   return -EIO;
-   }
-   if (index >= pty_limit) {
-   idr_remove(&allocated_ptys, index);
-   mutex_unlock(&allocated_ptys_lock);
-   return -EIO;
-   }
-   mutex_unlock(&allocated_ptys_lock);
+   index = devpts_new_index();
+   if (index < 0)
+   return index;
 
mutex_lock(&tty_mutex);
retval = init_dev(ptm_driver, index, &tty);
@@ -2847,9 +2821,7 @@ out1:
release_dev(filp);
return retval;
 out:
-   mutex_lock(&allocated_ptys_lock);
-   idr_remove(&allocated_ptys, index);
-   mutex_unlock(&allocated_ptys_lock);
+   devpts_kill_index(index);
return retval;
 }
 #endif
Index: 2.6.25-rc8-mm2/fs/devpts/inode.c
===
--- 2.6.25-rc8-mm2.orig/fs/devpts/inode.c   2008-02-27 15:17:59.0 
-0800
+++ 2.6.25-rc8-mm2/fs/devpts/inode.c2008-04-16 09:51:15.0 -0700
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,6 +27,10 @@
 
 #define DEVPTS_DEFAULT_MOD

[Devel] Re: [GIT PATCH] actually check va randomization

2008-06-11 Thread sukadev
Dave Hansen [EMAIL PROTECTED] wrote:
| Rather than just documenting this in the readme, actually spit
| out a warning on it.

Why not just bail out ?  Its mostly unreliable at that point anyway.
Besides, the warning can get buried in lot of other output.
---
>From 84d005031a8a17bdca62dc541c296a3bea74658c Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Date: Wed, 11 Jun 2008 11:22:17 -0700
Subject: [PATCH] cryo currently requires randomize_va_space to be 0. Fail if it 
is not.

---
 cr.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/cr.c b/cr.c
index db3ada0..217c5e7 100644
--- a/cr.c
+++ b/cr.c
@@ -1464,6 +1464,7 @@ void check_for_va_randomize(void)
return;
fprintf(stderr, "WARNING: %s is set to: %d\n", VA_RANDOM_FILE, enabled);
fprintf(stderr, " Please set to 0 to make cryo more 
reliable\n");
+   exit(1);
 }
 
 void usage(void)
-- 
1.5.2.5

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH] Restore fd flags in restarted process

2008-06-17 Thread sukadev

>From e33c0c11cc612896cb12ddad1925037e52e76eb3 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Date: Tue, 17 Jun 2008 12:32:30 -0700
Subject: [PATCH] Restore fd flags in restarted process.

We currently get these flags using fcntl(F_GETFL) and save them while
checkpointing but we do not restore them when restarting the process.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 cr.c  |   10 +-
 sci.h |7 ---
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/cr.c b/cr.c
index c52dd70..5163a3d 100644
--- a/cr.c
+++ b/cr.c
@@ -251,7 +251,7 @@ int getfdinfo(pinfo_t *pi)
if (len >= 0) pi->fi[n].name[len] = 0;
stat(dname, &st);
pi->fi[n].mode = st.st_mode;
-   pi->fi[n].flag = PT_FCNTL(syscallpid, pi->fi[n].fdnum, F_GETFL);
+   pi->fi[n].flag = PT_FCNTL(syscallpid, pi->fi[n].fdnum, F_GETFL, 
0);
if (S_ISREG(st.st_mode))
pi->fi[n].offset = (off_t)PT_LSEEK(syscallpid, 
pi->fi[n].fdnum, 0, SEEK_CUR);
else if (S_ISFIFO(st.st_mode))
@@ -841,6 +841,14 @@ int restore_fd(int fd, pid_t pid)
}
}
 
+   /*
+* Restore any special flags this fd had
+*/
+   ret = PT_FCNTL(pid, fdinfo->fdnum, F_SETFL, fdinfo->flag);
+   DEBUG(" restore_fd() fd %d setfl flag 0x%x, ret %d\n",
+   fdinfo->fdnum, fdinfo->flag, ret);
+
+
free(fdinfo);
}
if (1) {
diff --git a/sci.h b/sci.h
index b0cac3c..0b32ae4 100644
--- a/sci.h
+++ b/sci.h
@@ -138,10 +138,11 @@ int call_func(pid_t pid, int scratch, int flag, int 
funcaddr, int argc, ...);
0, 0, off,  \
0, 0, w)
 
-#define PT_FCNTL(p, fd, cmd) \
-   ptrace_syscall(p, 0, 0, SYS_fcntl, 2,   \
+#define PT_FCNTL(p, fd, cmd, arg) \
+   ptrace_syscall(p, 0, 0, SYS_fcntl, 3,   \
0, 0, fd,   \
-   0, 0, cmd)
+   0, 0, cmd,  \
+   0, 0, arg)
 
 #define PT_CLOSE(p, fd)\
ptrace_syscall(p, 0, 0, SYS_close, 1,   \
-- 
1.5.2.5

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH][cryo] Save/restore state of unnamed pipes

2008-06-17 Thread sukadev

>From fd13986de32af31621b1badbcf7bfb5626648e0e Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Date: Mon, 16 Jun 2008 18:41:05 -0700
Subject: [PATCH] Save/restore state of unnamed pipes

Design:

Current Linux kernels provide ability to read/write contents of FIFOs
using /proc. i.e 'cat /proc/pid/fd/read-side-fd' prints the unread data
in the FIFO.  Similarly, 'cat foo > /proc/pid/fd/read-sid-fd' appends
the contents of 'foo' to the unread contents of the FIFO.

So to save/restore the state of the pipe, a simple implementation is
to read the from the unnamed pipe's fd and save to the checkpoint-file.
When restoring, create a pipe (using PT_PIPE()) in the child process,
read the contents of the pipe from the checkpoint file and write it to
the newly created pipe.

Its fairly straightforward, except for couple of notes:

- when we read contents of '/proc/pid/fd/read-side-fd' we drain
  the pipe such that when the checkpointed application resumes,
  it will not find any data. To fix this, we read from the
  'read-side-fd' and write it back to the 'read-side-fd' in
  addition to writing to the checkpoint file.

- there does not seem to be a mechanism to determine the count
  of unread bytes in the file. Current implmentation assumes a
  maximum of 64K bytes (PIPE_BUFS * PAGE_SIZE on i386) and fails
  if the pipe is not fully drained.

Basic unit-testing done at this point (using tests/pipe.c).

TODO:
- Additional testing (with multiple-processes and multiple-pipes)
- Named-pipes

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 cr.c |  215 ++
 1 files changed, 203 insertions(+), 12 deletions(-)

diff --git a/cr.c b/cr.c
index 5163a3d..0cb9774 100644
--- a/cr.c
+++ b/cr.c
@@ -84,6 +84,11 @@ typedef struct fdinfo_t {
char name[128]; /* file name. NULL if anonymous (pipe, 
socketpair) */
 } fdinfo_t;
 
+typedef struct fifoinfo_t {
+   int fi_fd;  /* fifo's read-side fd */
+   int fi_length;  /* number of bytes in the fifo */
+} fifofdinfo_t;
+
 typedef struct memseg_t {
unsigned long start;/* memory segment start address */
unsigned long end;  /* memory segment end address */
@@ -468,6 +473,128 @@ out:
return rc;
 }
 
+static int estimate_fifo_unread_bytes(pinfo_t *pi, int fd)
+{
+   /*
+* Is there a way to find the number of bytes remaining to be
+* read in a fifo ? If not, can we print it in fdinfo ?
+*
+* Return 64K (PIPE_BUFS * PAGE_SIZE) for now.
+*/
+   return 65536;
+}
+
+static void ensure_fifo_has_drained(char *fname, int fifo_fd)
+{
+   int rc, c;
+
+   rc = read(fifo_fd, &c, 1);
+   if (rc != -1 && errno != EAGAIN) {
+   ERROR("FIFO '%s' not drained fully. rc %d, c %d "
+   "errno %d\n", fname, rc, c, errno);
+   }
+
+}
+
+static int save_process_fifo_info(pinfo_t *pi, int fd)
+{
+   int i;
+   int rc;
+   int nbytes;
+   int fifo_fd;
+   int pbuf_size;
+   pid_t pid = pi->pid;
+   char fname[256];
+   fdinfo_t *fi = pi->fi;
+   char  *pbuf;
+   fifofdinfo_t fifofdinfo;
+
+   write_item(fd, "FIFO", NULL, 0);
+
+   for (i = 0; i < pi->nf; i++) {
+   if (! S_ISFIFO(fi[i].mode))
+   continue;
+
+   DEBUG("FIFO fd %d (%s), flag 0x%x\n", fi[i].fdnum, fi[i].name,
+   fi[i].flag);
+
+   if (!(fi[i].flag & O_WRONLY))
+   continue;
+
+   pbuf_size = estimate_fifo_unread_bytes(pi, fd);
+
+   pbuf = (char *)malloc(pbuf_size);
+   if (!pbuf) {
+   ERROR("Unable to allocate FIFO buffer of size %d\n",
+   pbuf_size);
+   }
+   memset(pbuf, 0, pbuf_size);
+
+   sprintf(fname, "/proc/%u/fd/%u", pid, fi[i].fdnum);
+
+   /*
+* Open O_NONBLOCK so read does not block if fifo has fewer
+* bytes than our estimate.
+*/
+   fifo_fd = open(fname, O_RDWR|O_NONBLOCK);
+   if (fifo_fd < 0)
+   ERROR("Error %d opening FIFO '%s'\n", errno, fname);
+
+   nbytes = read(fifo_fd, pbuf, pbuf_size);
+   if (nbytes < 0) {
+   if (errno != EAGAIN) {
+   ERROR("Error %d reading FIFO '%s'\n", errno,
+   fname);
+   }
+   

[Devel] [RFC][PATCH][cryo] Read/print contents of fifo

2008-06-17 Thread sukadev

>From 0f5b3ea20238e0704a71252a3d495ca0db61e1dc Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Date: Sat, 14 Jun 2008 11:45:00 -0700
Subject: [RFC][PATCH] Read/print contents of fifo.

To test that checkpoint/restart of pipes is working, read
one byte at a time from the pipe and write to stdout.

After checkpoint, both the checkpointed application and the
restarted application should continue reading from the checkpoint.

The '-e' option to the program, tests with an empty pipe.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 tests/pipe.c |   32 
 1 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/tests/pipe.c b/tests/pipe.c
index cc3cdfd..0812cb3 100644
--- a/tests/pipe.c
+++ b/tests/pipe.c
@@ -3,25 +3,49 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
-int main()
+int main(int argc, char *argv[])
 {
int i = 0;
+   int rc;
int fds[2];
+   int c;
+   int empty;
char *buf = "abcdefghijklmnopqrstuvwxyz";
 
+   /*
+* -e: test with an empty pipe
+*/
+   empty = 0;
+   if (argc > 1 && strcmp(argv[1], "-e") == 0)
+   empty = 1;
+
if (pipe(fds) < 0) {
perror("pipe()");
exit(1);
}
 
-   write(fds[1], buf, strlen(buf));
+   if (!empty)
+   write(fds[1], buf, strlen(buf));
 
+   if (fcntl(fds[0], F_SETFL, O_NONBLOCK) < 0) {
+   perror("fcntl()");
+   exit(1);
+   }
printf("Running as %d\n", getpid());
while (i<100) {
sleep(1);
-   if (i%5 == 0)
-   printf("i is %d (pid %d)\n", i, getpid());
+   if (i%5 == 0) {
+   c = errno = 0;
+   rc = read(fds[0], &c, 1);
+   if (rc != 1) {
+   perror("read() failed");
+   }
+   printf("i is %d (pid %d), c is %c\n", i, getpid(), c);
+
+   }
i++;
}
 }
-- 
1.5.2.5

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes

2008-06-17 Thread sukadev
Serge E. Hallyn [EMAIL PROTECTED] wrote:


| > +
| > +   rc = read(fifo_fd, &c, 1);
| > +   if (rc != -1 && errno != EAGAIN) {
| 
|   Won't errno only be set if rc == -1?  Did you mean || here?

Yes I meant ||. I also had 'errno = 0' before the read, but seem
to have deleted it when I moved code around.



| > +   } else if (S_ISFIFO(fdinfo->mode)) {
| > +   int pipefds[2] = { 0, 0 };
| > +
| > +   /*
| > +* We create the pipe when we see the pipe's read-fd.
| > +* Just ignore the pipe's write-fd.
| > +*/
| > +   if (fdinfo->flag == O_WRONLY)
| > +   continue;
| > +
| > +   DEBUG("Creating pipe for fd %d\n", fdinfo->fdnum);
| > +
| > +   t_d(PT_PIPE(pid, pipefds));
| > +   t_d(pipefds[0]);
| > +   t_d(pipefds[1]);
| > +
| > +   if (pipefds[0] != fdinfo->fdnum) {
| > +   DEBUG("Hmm, new pipe has fds %d, %d "
| > +   "Old pipe had fd %d\n", pipefds[0],
| > +   pipefds[1], fdinfo->fdnum); getchar();
| 
| Can you explain what you're doing here?  I would have expected you to
| dup2() to get back the correct fd, so maybe I'm missing something...

You are right, I should use dup2() here.

Will send an updated patch.

Thanks,

Suka
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes

2008-06-17 Thread sukadev
Matt Helsley [EMAIL PROTECTED] wrote:
| 
| On Tue, 2008-06-17 at 17:30 -0500, Serge E. Hallyn wrote:
| > Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
| > > 
| > > >From fd13986de32af31621b1badbcf7bfb5626648e0e Mon Sep 17 00:00:00 2001
| > > From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
| > > Date: Mon, 16 Jun 2008 18:41:05 -0700
| > > Subject: [PATCH] Save/restore state of unnamed pipes
| > > 
| > > Design:
| > > 
| > > Current Linux kernels provide ability to read/write contents of FIFOs
| > > using /proc. i.e 'cat /proc/pid/fd/read-side-fd' prints the unread data
| > > in the FIFO.  Similarly, 'cat foo > /proc/pid/fd/read-sid-fd' appends
| > > the contents of 'foo' to the unread contents of the FIFO.
| > > 
| > > So to save/restore the state of the pipe, a simple implementation is
| > > to read the from the unnamed pipe's fd and save to the checkpoint-file.
| > > When restoring, create a pipe (using PT_PIPE()) in the child process,
| > > read the contents of the pipe from the checkpoint file and write it to
| > > the newly created pipe.
| > > 
| > > Its fairly straightforward, except for couple of notes:
| > > 
| > >   - when we read contents of '/proc/pid/fd/read-side-fd' we drain
| > > the pipe such that when the checkpointed application resumes,
| > > it will not find any data. To fix this, we read from the
| > > 'read-side-fd' and write it back to the 'read-side-fd' in
| > > addition to writing to the checkpoint file.
| > > 
| > >   - there does not seem to be a mechanism to determine the count
| > > of unread bytes in the file. Current implmentation assumes a
| > > maximum of 64K bytes (PIPE_BUFS * PAGE_SIZE on i386) and fails
| > > if the pipe is not fully drained.
| > > 
| > > Basic unit-testing done at this point (using tests/pipe.c).
| > > 
| > > TODO:
| > >   - Additional testing (with multiple-processes and multiple-pipes)
| > >   - Named-pipes
| > > 
| > > Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
| > > ---
| > >  cr.c |  215 
++
| > >  1 files changed, 203 insertions(+), 12 deletions(-)
| > > 
| > > diff --git a/cr.c b/cr.c
| > > index 5163a3d..0cb9774 100644
| > > --- a/cr.c
| > > +++ b/cr.c
| > > @@ -84,6 +84,11 @@ typedef struct fdinfo_t {
| > >   char name[128]; /* file name. NULL if anonymous (pipe, 
socketpair) */
| > >  } fdinfo_t;
| > > 
| > > +typedef struct fifoinfo_t {
| > > + int fi_fd;  /* fifo's read-side fd */
| > > + int fi_length;  /* number of bytes in the fifo */
| > > +} fifofdinfo_t;
| > > +
| > >  typedef struct memseg_t {
| > >   unsigned long start;/* memory segment start address */
| > >   unsigned long end;  /* memory segment end address */
| > > @@ -468,6 +473,128 @@ out:
| > >   return rc;
| > >  }
| > > 
| > > +static int estimate_fifo_unread_bytes(pinfo_t *pi, int fd)
| > > +{
| > > + /*
| > > +  * Is there a way to find the number of bytes remaining to be
| > > +  * read in a fifo ? If not, can we print it in fdinfo ?
| > > +  *
| > > +  * Return 64K (PIPE_BUFS * PAGE_SIZE) for now.
| > > +  */
| > > + return 65536;
| > > +}
| > > +
| > > +static void ensure_fifo_has_drained(char *fname, int fifo_fd)
| > > +{
| > > + int rc, c;
| > > +
| > > + rc = read(fifo_fd, &c, 1);
| > > + if (rc != -1 && errno != EAGAIN) {
| > 
| > Won't errno only be set if rc == -1?  Did you mean || here?
| > 
| > > + ERROR("FIFO '%s' not drained fully. rc %d, c %d "
| > > + "errno %d\n", fname, rc, c, errno);
| > > + }
| > > +
| > > +}
| > > +
| > > +static int save_process_fifo_info(pinfo_t *pi, int fd)
| > > +{
| > > + int i;
| > > + int rc;
| > > + int nbytes;
| > > + int fifo_fd;
| > > + int pbuf_size;
| > > + pid_t pid = pi->pid;
| > > + char fname[256];
| > > + fdinfo_t *fi = pi->fi;
| > > + char  *pbuf;
| > > + fifofdinfo_t fifofdinfo;
| > > +
| > > + write_item(fd, "FIFO", NULL, 0);
| > > +
| > > + for (i = 0; i < pi->nf; i++) {
| > > + if (! S_ISFIFO(fi[i].mode))
| > > + continue;
| > > +
| > > + DEBUG("FIFO fd %d (%s), flag 0x%x\n", fi[i].fdnum, fi[i].name,

[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes

2008-06-18 Thread sukadev
Matt Helsley [EMAIL PROTECTED] wrote:

| > | 
| > | 
| > | pipe(pipefds); /* returns 5 and 4 in elements 0 and 1 */
| > | /* use fds after last_fd as trampolines for fds we want to create 
*/
| > | dup2(pipefds[0], last_fd + 1);
| > | dup2(pipefds[1], last_fd + 2);
| > | close(pipefds[0]);
| > | close(pipefds[1]);
| > | dup2(last_fd + 1, );
| > | dup2(last_fd + 2, );
| > | close(last_fd + 1);
| > | close(last_fd + 2);
| > | 
| > | 
| > | Which is alot more code but should work no matter which fds we get back
| > | from pipe(). Of course this assumes the checkpointed application hasn't
| > | used all of its fds. :(
| > | 
| > 
| > This sounds like a good idea too, but we could use any fd that has not
| > yet been used in the restart-process right ? It would break if all fds
| 
| Yes, but we don't know which fd is available unless we allocate it
| without dup2(). Here's how it could be done without last_fd (again,
| dropping PT_FUNC notation):
| 
| /*
|  * Move fds from src to dest. Useful for correctly "moving" pipe fds and 
|  * other cases where we have a small number of fds to move to their
|  * original fd.
|  *
|  * Assumes dest_fds and src_fds are of the same, small length since
|  * this is O(num_fds^2).
|  *
|  * If num_fds == 1 then use plain dup2().
|  *
|  * Use this in place of multiple dup2() calls (num_fds > 1) unless you are
|  * absolutely certain the set of dest fds do not intersect the set of src fds.
|  * Does NOT magically prevent you from accidentally clobbering fds outside the
|  * src_fds array.
|  */
| void move_fds(int *dest_fds, int *src_fds, const unsigned int num_fds)
| {
|   int i;
|   unsigned int num_moved = 0;
| 
|   for (i = 0; i < num_fds; i++) {
|   int j;
| 
|   if (src_fds[i] == dest_fds[i])
|   continue; /* nothing to be done */
| 
|   /* src fd != dest fd so we need to perform:
|   dup2(src fd, dest fd);
|  but dup2() closes dest fd if it already exists.
|  This means we could accidentally close one of
|  the src fds. Avoid this by searching for any
|  src fd == dest fd and dup()'ing src fd to
|  a different fd so we can use the dest fd.
|*/
|   for (j = i + 1; j < num_fds; j++) /* This makes us O(N^2) */
|   if (dest_fds[i] == src_fds[j])
|   /*
|* we're using an fd for something 
|* else already -- we need a trampoline
|*/

So let me rephrase the problem.

Suppose the checkpointed application was using fds in following
"orig-fd-set"

{ [0..10], 18, 27 }

where 18 and 27 are part of a pipe. For simplicity lets assume that
18 is the read-side-fd.

We checkpointed this application and are now trying to restart it.

In the restarted application, we would call

dup2(fd1, fd2),

where 'fd1' is some new, random fd and 'fd2' is an fd in 'orig-fd-set'
(say fd2 = 18).

IIUC, there is a risk here of 'fd2' being closed accidentally while
it is in use.

But, the only way I can see 'fd2' being in use in the restarted process
is if _cryo_ opened some file _during_ restart and did not close. I ran
into this early on with the randomize_va_space file (which was easily
fixed).

Would cryo need to keep one or more temporary/debug files open in the
restarted process (i.e files that are not in the 'orig-fd-set').

If cryo does, then maybe it could open such files:

- after clone() (so files are not open in restarted process), or

- find the last_fd used and dup2() to that fd, leaving the
  'orig-fd-set' all open/available for restarted process 

For debug, before each 'dup2(fd1, fd2)' we could 'fstat(fd2, &buf)'
to ensure 'fd2' is not in use and error out if it is.

Thanks for your comments. I will look at your code in more detail.

Suka
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes

2008-06-18 Thread sukadev
Matt Helsley [EMAIL PROTECTED] wrote:
| 
| > So let me rephrase the problem.
| > 
| > Suppose the checkpointed application was using fds in following
| > "orig-fd-set"
| > 
| > { [0..10], 18, 27 }
| > 
| > where 18 and 27 are part of a pipe. For simplicity lets assume that
| > 18 is the read-side-fd.
| 
| so orig_pipefd[0] == 18
| and orig_pipefd[1] == 27
| 
| > We checkpointed this application and are now trying to restart it.
| > 
| > In the restarted application, we would call
| > 
| > dup2(fd1, fd2),
| > 
| > where 'fd1' is some new, random fd and 'fd2' is an fd in 'orig-fd-set'
|^^  Even if they were truly random, this
| does not preclude fd1 from having the same value as an fd in the
| remaining orig-fd-set -- such as one of the two we're about to try and
| restart with pipe().

I agree. fd1 could be an hither-to-unseen fd from the 'orig-fd-set'.

| 
| > (say fd2 = 18).
| 
| fd1 = restarted_pipefd[0]
| fd2 = restarted_pipefd[1]
| 
| In my example fd1 == 27 and fd2 == 18
| 
| > IIUC, there is a risk here of 'fd2' being closed accidentally while
| > it is in use.
| 
|   Yes, that's the risk.
| 
| > But, the only way I can see 'fd2' being in use in the restarted process
| > is if _cryo_ opened some file _during_ restart and did not close. I ran
| 
|   Both file descriptors returned from pipe() are in use during restart
| and closing one of them would not be proper. Cryo hasn't "forgotten" to
| close one of them -- cryo needs to dup2() both of them to their
| "destination" fds. But if they have been swapped or if just one is the
| "destination" of the other then you could end up with a broken pipe.

Ok I see what you are saying. 

The assumption I have is that we would process the fds from 'orig-fd-set'
in ascending order. Its good to confirm that assumption now :-)

proc_readfd_common() seems to return the fds in ascending order (so
readdir() of "/proc/pid/fd/" would get them in ascending order - no ?)

If we process 'orig-fd-set' in order and suppose we create the pipe for
the smaller of the two fds (could be the write-side). Then the other side
of the pipe would either not collide with an existing fd or that
fd would not be in the 'orig-fd-set' (in the latter case it would
be safe for dup2() to close).

| 
| > into this early on with the randomize_va_space file (which was easily
| > fixed).
| 
|   This logic only works if cryo only has one new fd at a time. However
| that's not possible with pipe(). Or socketpair(). In those cases one of
| the two new fds could be the "destination" fd for the other. In that
| case dup2() will kindly close it for you and break your new
| pipe/socketpair! :)
| 
|   That's why I asked if POSIX guarantees the read side file descriptor is
| always less than the write side. If it does then the numbers can't be
| swapped and maybe using your assumption that we don't have any other fds
| accidentally left open ensures dup2() will be safe.

I don't think POSIX guarantees, but will double check.

| 
| > Would cryo need to keep one or more temporary/debug files open in the
| > restarted process (i.e files that are not in the 'orig-fd-set').
| 
|   There's no need to keep temporary/debug files open that I can see. Just
| a need to be careful when more than one new file descriptor has been
| created before doing a dup2().
| 
| > If cryo does, then maybe it could open such files:
| > 
| > - after clone() (so files are not open in restarted process), or
| > 
| > - find the last_fd used and dup2() to that fd, leaving the
| >   'orig-fd-set' all open/available for restarted process 
| > 
| > For debug, before each 'dup2(fd1, fd2)' we could 'fstat(fd2, &buf)'
| > to ensure 'fd2' is not in use and error out if it is.
| 
| fstat() could certainly be useful for debugging dup2(). However it still
| doesn't nicely show us whether there are any fds we've leaked that we
| forgot about unless we fstat() all possible fds and then compare the set
| of existing fds to the orig-fd-set.

Yes, was suggesting fstat() only to detect collisions, but yes, to
detect leaks, we have to do more.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes

2008-06-18 Thread sukadev
| 
| Now restart does :
| 
| int pipefds[2];
| 
| pipe(pipefds); /* 
|   * kernel is allowed to return pipefds[0] == 12 and 
|   * pipefds[1] == 11
|   */
| 
| dup2(pipefds[0], 11); /* closes pipefds[1]! */
| dup2(pipefds[1], 12);

Aah. I see it now (finally). Thanks,

Suka
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes

2008-06-19 Thread sukadev
Matt Helsley [EMAIL PROTECTED] wrote:


| 
| > | I don't see anything in the pipe man page, at least, that suggests we
| > | can safely assume pipefds[0] < pipefds[1].
| > | 
| > |   The solution could be to use "trampoline" fds. Suppose last_fd is the
| > | largest fd that exists in the final checkpointed/restarting application.
| > | We could do (Skipping the PT_FUNC "notation" for clarity):
| > 
| > | 
| > | 
| > | pipe(pipefds); /* returns 5 and 4 in elements 0 and 1 */
| > | /* use fds after last_fd as trampolines for fds we want to create 
*/
| > | dup2(pipefds[0], last_fd + 1);
| > | dup2(pipefds[1], last_fd + 2);
| > | close(pipefds[0]);
| > | close(pipefds[1]);
| > | dup2(last_fd + 1, );
| > | dup2(last_fd + 2, );
| > | close(last_fd + 1);
| > | close(last_fd + 2);
| > | 
| > | 
| > | Which is alot more code but should work no matter which fds we get back
| > | from pipe(). Of course this assumes the checkpointed application hasn't
| > | used all of its fds. :(
| > | 

It appears that this last_fd approach will fit in easier with current
design of cryo (where we process one or two fds at a time and don't have
the src_fds and dest_fds handy). 

BTW, we should be able to accomplish the above with a single-unused fd
right (i.e no need for last_fd+2) ?

| > 
| > This sounds like a good idea too, but we could use any fd that has not
| > yet been used in the restart-process right ? It would break if all fds
| 
| Yes, but we don't know which fd is available unless we allocate it
| without dup2().

Right. I was thinking we could find that out at the time of checkpoint
(a brute-force fstat(i, &statbuf) for i = 0..n or something more
efficient).

Well just thought of another approach.

Basically, we have a temporary need for an unused fd for use as a
trampoline.  So, why not 'set-aside' an fd for that purpose and after
all other fds have been created, go back and create this fd ?

i.e lets say the first regular file we open is associated with 'fd = 3'.
We save away the 'fdinfo' for 3 say in a global variable and close(3).
Now use 'fd = 3' in place of last_fd+1 above.

Once all fds have been setup correctly, go back and set up fd = 3
using the saved fdinfo (this would be a simple open of the file
followed by seek and maybe an fcntl).

This would work even if the application was using all its fds ?

If we do need both last_fd+1 and last_fd+2, we would have to set
aside two regular files.

Hmm, would it work even if an application uses all (1024) its fds
for pipes :-), but just a thought at this point.

Suka
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH][cryo] Save/restore state of unnamed pipes

2008-06-19 Thread sukadev
Matt Helsley [EMAIL PROTECTED] wrote:
| 
| Yes, I think that's sufficient:
| 
| int pipefds[2];
| 
|   ...
| 
|   restarted_read_fd = 11;
|   restarted_write_fd = 12;
| 
|   ...
| 
| pipe(pipefds);
| 
|   /* 
|* pipe() may have returned one (or both) of the restarted fds
|* at the wrong end of the pipe. This could cause dup2() to
|* accidentaly close the pipe. Avoid that with an extra dup().
|*/
| if (pipefds[1] == restarted_read_fd) {
|   dup2(pipefds[1], last_fd + 1);
|   pipefds[1] = last_fd + 1;
|   }
| 
|   if (pipefds[0] != restarted_read_fd) {
|   dup2(pipefds[0], restarted_read_fd);
|   close(pipefds[0]);
|   }
| 
|   if (pipefds[0] != restarted_read_fd) {
|   dup2(pipefds[1], restarted_write_fd);
|   close(pipefds[1]);
|   }

Shouldn't the last if be

if (pipefds[1] != restarted_wrte_fd) ?

(otherwise it would break if pipefds[0] = 11 and pipefds[1] = 200)

I came up with something similar, but with an extra close(). And
in my code, I had restarted_* names referring to pipefds[] making
it a bit confusing initially.

How about using actual_fds[] (instead of pipefds) and expected_fds[]
instead of (restart_*) ? 

Thanks,

Suka

| 
| I think this code does the minimal number of operations needed in the
| restarted application too -- it counts on the second dup2() closing one
| of the fds if pipefds[1] == restarted_read_fd.
| 
| Cheers,
|   -Matt
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/2][cryo] Save/restore pipe state

2008-06-23 Thread sukadev

[PATCH 1/2] Save/restore state of unnamed pipes

Basic infrastructure to save/restore pipe state with assumptions
about order of fds.

[PATCH 2/2] Support Non-consecutive and dup pipe fds

Remove above assumptions about order of fds and support dups of
pipe fds.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/2][cryo] Save/restore state of unnamed pipes

2008-06-23 Thread sukadev
>From e513f8bc0fe808425264ad01210ac610f6453047 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Date: Mon, 16 Jun 2008 18:41:05 -0700
Subject: [PATCH] Save/restore state of unnamed pipes

Design:

Current Linux kernels provide ability to read/write contents of FIFOs
using /proc. i.e 'cat /proc/pid/fd/read-side-fd' prints the unread data
in the FIFO.  Similarly, 'cat foo > /proc/pid/fd/read-sid-fd' appends
the contents of 'foo' to the unread contents of the FIFO.

So to save/restore the state of the pipe, a simple implementation is
to read the from the unnamed pipe's fd and save to the checkpoint-file.
When restoring, create a pipe (using PT_PIPE()) in the child process,
read the contents of the pipe from the checkpoint file and write it to
the newly created pipe.

Its fairly straightforward, except for couple of notes:

- when we read contents of '/proc/pid/fd/read-side-fd' we drain
  the pipe such that when the checkpointed application resumes,
  it will not find any data. To fix this, we read from the
  'read-side-fd' and write it back to the 'read-side-fd' in
  addition to writing to the checkpoint file.

- there does not seem to be a mechanism to determine the count
  of unread bytes in the file. Current implmentation assumes a
  maximum of 64K bytes (PIPE_BUFS * PAGE_SIZE on i386) and fails
  if the pipe is not fully drained.

Changelog:[v1]:

- [Serge Hallyn]: use || instead of && in ensure_fifo_has_drained

- [Serge Hallyn, Matt Helsley]: Use dup2() to restore fds and
remove assumptions about order of read and write fds
(addressed in PATCH 2/2).

Some unit-testing done at this point (using tests/pipe.c).

TODO:
- Additional testing (with multiple-processes and multiple-pipes)
- Named-pipes
---
 cr.c |  217 ++
 1 files changed, 205 insertions(+), 12 deletions(-)

diff --git a/cr.c b/cr.c
index c7e3332..716cc86 100644
--- a/cr.c
+++ b/cr.c
@@ -88,6 +88,11 @@ typedef struct fdinfo_t {
char name[128]; /* file name. NULL if anonymous (pipe, 
socketpair) */
 } fdinfo_t;
 
+typedef struct fifoinfo_t {
+   int fi_fd;  /* fifo's read-side fd */
+   int fi_length;  /* number of bytes in the fifo */
+} fifofdinfo_t;
+
 typedef struct memseg_t {
unsigned long start;/* memory segment start address */
unsigned long end;  /* memory segment end address */
@@ -499,6 +504,129 @@ out:
return rc;
 }
 
+static int estimate_fifo_unread_bytes(pinfo_t *pi, int fd)
+{
+   /*
+* Is there a way to find the number of bytes remaining to be
+* read in a fifo ? If not, can we print it in fdinfo ?
+*
+* Return 64K (PIPE_BUFS * PAGE_SIZE) for now.
+*/
+   return 65536;
+}
+
+static void ensure_fifo_has_drained(char *fname, int fifo_fd)
+{
+   int rc, c;
+
+   errno = 0;
+   rc = read(fifo_fd, &c, 1);
+   if (rc != -1 || errno != EAGAIN) {
+   ERROR("FIFO '%s' not drained fully. rc %d, c %d "
+   "errno %d\n", fname, rc, c, errno);
+   }
+
+}
+
+static int save_process_fifo_info(pinfo_t *pi, int fd)
+{
+   int i;
+   int rc;
+   int nbytes;
+   int fifo_fd;
+   int pbuf_size;
+   pid_t pid = pi->pid;
+   char fname[256];
+   fdinfo_t *fi = pi->fi;
+   char  *pbuf;
+   fifofdinfo_t fifofdinfo;
+
+   write_item(fd, "FIFO", NULL, 0);
+
+   for (i = 0; i < pi->nf; i++) {
+   if (! S_ISFIFO(fi[i].mode))
+   continue;
+
+   DEBUG("FIFO fd %d (%s), flag 0x%x\n", fi[i].fdnum, fi[i].name,
+   fi[i].flag);
+
+   if (!(fi[i].flag & O_WRONLY))
+   continue;
+
+   pbuf_size = estimate_fifo_unread_bytes(pi, fd);
+
+   pbuf = (char *)malloc(pbuf_size);
+   if (!pbuf) {
+   ERROR("Unable to allocate FIFO buffer of size %d\n",
+   pbuf_size);
+   }
+   memset(pbuf, 0, pbuf_size);
+
+   sprintf(fname, "/proc/%u/fd/%u", pid, fi[i].fdnum);
+
+   /*
+* Open O_NONBLOCK so read does not block if fifo has fewer
+* bytes than our estimate.
+*/
+   fifo_fd = open(fname, O_RDWR|O_NONBLOCK);
+   if (fifo_fd < 0)
+   ERROR("Error %d opening FIFO '%s'\n", errno, fname);
+
+   nbytes = read(fifo_fd, pbuf, pbuf_size);
+   if (nbytes < 0) {
+   if (er

[Devel] [PATCH 2/2] Support Non-consecutive and dup pipe fds

2008-06-23 Thread sukadev

>From a80c5215763f757840214465277e911e46e01219 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Date: Mon, 23 Jun 2008 20:13:57 -0700
Subject: [PATCH] Support Non-consecutive and dup pipe fds

PATCH 1/1 provides basic infrastructure to save/restore state of
pipes. This patch removes assumptions about order of the pipe-fds
and also supports existence of 'dups' of pipe-fds.

This logic has been separated from PATCH 1/1 for easier review and
the two patches could be combined into a single one.

Thanks to Matt Helsley for the optimized logic/code in match_pipe_ends().

TODO:
There are few TODO's marked out in the patch. Hopefully these
can be addressed without significant impact to the central-logic
of saving/restoring pipes.

- Temporarily using a regular-file's fd as 'trampoline-fd' when
  all fds are in use

- Maybe read all fdinfo into memory during restart, so we can
  reduce the information we save into the checkpoint-file
  (see comments near 'struct fdinfo').

- Check logic of detecting 'dup's of pipe fds (any hidden
  gotchas ?) See pair_pipe_fds()

- Alloc ppi_list[] dynamically (see getfdinfo()).

- Use getrlimit() to compute max-open-fds (see near caller of
  pair_pipe_fds()).

- [Oleg Nesterov]: SIGIO/inotify() issues associated with writing-back
  to pipes (fixing this would require some assitance from kernel ?)

Ran several unit-test cases (see test-patches). Additional cases to be
developed/executed.

Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
---
 cr.c |  262 --
 1 files changed, 240 insertions(+), 22 deletions(-)

diff --git a/cr.c b/cr.c
index 716cc86..f40a4fb 100644
--- a/cr.c
+++ b/cr.c
@@ -79,8 +79,25 @@ typedef struct isockinfo_t {
char tcpstate[TPI_LEN];
 } isockinfo_t;
 
+/*
+ * TODO: restore_fd() processes each fd as it reads it of the checkpoint
+ * file. To avoid making a second-pass at the file, we store following
+ * fields during checkpoint (for now).
+ *
+ * peer_fdnum, dup_fdnum, create_pipe, tramp_fd' fields can be
+ *
+ * We could eliminate this fields by reading all fdinfo into memory
+ * and then 'computing' the above fields before processing the fds.
+ * But this would require a non-trivial rewrite of the restore_fd()
+ * logic. Hopefully that can be done without significant impact to
+ * rest of the logic associated with saving/restoring pipes.
+ */
 typedef struct fdinfo_t {
int fdnum;  /* file descriptor number */
+   int peer_fdnum; /* peer fd for pipes */
+   int dup_fdnum;  /* fd, if fd is dup of another pipe fd */
+   int create_pipe;/* TRUE if this is the create-end of the pipe */
+   int tramp_fd;   /* trampoline-fd for use in restoring pipes */
mode_t mode;/* mode as per stat(2) */
off_t offset;   /* read/write pointer position for regular 
files */
int flag;   /* open(2) flag */
@@ -117,6 +134,7 @@ typedef struct pinfo_t {
int nt; /* number of thread child (0 if no thread lib) 
*/
pid_t *tpid;/* array of thread info */
struct pinfo_t *pmt;/* multithread: pointer to main thread info */
+   int tramp_fd;   /* trampoline-fd for use in restoring pipes */
 } pinfo_t;
 
 /*
@@ -263,6 +281,89 @@ int getsockinfo(pid_t pid, pinfo_t *pi, int num)
return ret;
 }
 
+typedef struct pipe_peer_info {
+   fdinfo_t *pipe_fdi;
+   //fdinfo_t *peer_fdi;
+   __ino_t pipe_ino;
+} pipe_peer_info_t;
+
+__ino_t get_fd_ino(char *fname)
+{
+   struct stat sbuf;
+
+   if (stat(fname, &sbuf) < 0)
+   ERROR("stat() on fd %s failed, errno %d\n", fname, errno);
+
+   return sbuf.st_ino;
+}
+
+static void pair_pipe_fds(pipe_peer_info_t *ppi_list, int npipe_fds)
+{
+   int i, j;
+   pipe_peer_info_t *xppi, *yppi;
+   fdinfo_t *xfdi, *yfdi;
+
+   /*
+* TODO: This currently assumes pipefds have not been dup'd.
+*   Of course, need to kill this assumption soon.
+*/
+   for (i = 0; i < npipe_fds; i++) {
+   xppi = &ppi_list[i];
+   xfdi = xppi->pipe_fdi;
+
+   j = i + 1;
+   for (j = i+1; j < npipe_fds; j++) {
+   yppi = &ppi_list[j];
+   yfdi = yppi->pipe_fdi;
+
+   if (yppi->pipe_ino != xppi->pipe_ino)
+   continue;
+
+   DEBUG("Checking flag i %d, j %d\n", i, j);
+   /*
+* i and j refer to same pipe. Check if they are

[Devel] [PATCH 0/4][cryo] Test pipes

2008-06-23 Thread sukadev

PATCH[1/4]: Support-multiple-pipe-test-cases.patch
PATCH[2/4]: Testcase-3-continous-read-write-to-pipe.patch
PATCH[3/4]: Test4-Non-consecutive-pipe-fds.patch
PATCH[4/4]: Test-5-Read-write-using-dup-of-pipe-fds.patch
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/4]: Test3: continous read/write to pipe

2008-06-23 Thread sukadev

>From ade1b719f7d9968e0f934daf736ca1746cb6747d Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Date: Sun, 22 Jun 2008 22:26:18 -0700
Subject: [PATCH] Testcase 3: continous read/write to pipe

---
 tests/pipe.c |   82 +-
 1 files changed, 81 insertions(+), 1 deletions(-)

diff --git a/tests/pipe.c b/tests/pipe.c
index 76be6cc..c81ac2a 100644
--- a/tests/pipe.c
+++ b/tests/pipe.c
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#define min(a, b)  ((a) < (b) ? (a) : (b))
+static char *temp_file;
 char *test_descriptions[] = {
NULL,
"Test with an empty pipe",
@@ -18,7 +20,7 @@ char *test_descriptions[] = {
"Test with all-fds in use for pipes",
 };
 
-static int last_num = 2;
+static int last_num = 3;
 usage(char *argv[])
 {
int i;
@@ -82,12 +84,89 @@ int test2()
}
 }
 
+int read_write_pipe()
+{
+   int i = 0;
+   int rc;
+   int fds[2];
+   int fd1, fd2;
+   int c;
+   char *wbufp;
+   char wbuf[256];
+   char rbuf[256];
+   char *rbufp;
+
+   rbufp = &rbuf[0];
+   wbufp = &wbuf[0];
+
+   strcpy(wbufp, "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ");
+   memset(rbufp, '\0', sizeof(rbuf));
+
+   if (pipe(fds) < 0) {
+   perror("pipe()");
+   exit(1);
+   }
+   printf("fds[0] %d, fds[1] %d\n", fds[0], fds[1]);
+
+   if (fcntl(fds[0], F_SETFL, O_NONBLOCK) < 0) {
+   perror("fcntl()");
+   exit(1);
+   }
+
+   printf("Running as %d\n", getpid());
+   for (i = 0; i < 50; i++) {
+   sleep(1);
+   if (i%2 == 0) {
+   c = errno = 0;
+   rc = read(fds[0], rbufp, 3);
+
+   if (rc < 0)
+   perror("read() failed");
+   else {
+   printf("i is %d (pid %d), rbufp %p read %s\n",
+   i, getpid(), rbufp, rbufp);
+   rbufp += rc;
+   }
+
+   if (*wbufp == '\0')
+   continue;
+
+   errno = 0;
+   rc = write(fds[1], wbufp, min(3, strlen(wbufp)));
+   if (rc < 0) {
+   perror("write() to pipe");
+   } else {
+   if (rc != 3) {
+   printf("Wrote %d of 3 bytes, "
+   "error %d\n", rc, errno);
+   }
+   wbufp += rc;
+   }
+   }
+   }
+
+   if (strncmp(wbuf, rbuf, strlen(wbufp))) {
+   printf("Wrote: %s\n", wbuf);
+   printf("Read : %s\n", rbuf);
+   printf("Test FAILED\n");
+   } else {
+   printf("Test passed\n");
+   }
+}
+
+static void test3()
+{
+   read_write_pipe();
+}
+
 int
 main(int argc, char *argv[])
 {
int c;
int tc_num;
 
+   temp_file = argv[0];
+
while((c = getopt(argc, argv, "t:")) != EOF) {
switch(c) {
case 't':
@@ -102,6 +181,7 @@ main(int argc, char *argv[])
switch(tc_num) {
case 1: test1(); break;
case 2: test2(); break;
+   case 3: test3(); break;
default:
printf("Unsupported test case %d\n", tc_num);
usage(argv);
-- 
1.5.2.5

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/4]: Support multiple pipe test cases

2008-06-23 Thread sukadev
>From a99deb9bcdd611c52589fa733dd90057f1f134bf Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Date: Sun, 22 Jun 2008 21:05:48 -0700
Subject: [PATCH] Support multiple pipe test cases

Modify pipe.c to support multiple test cases and to select a test case
using the -t option.

Implement two tests:
- empty pipe
- single write to pipe followed by continous read
---
 tests/pipe.c |   93 ++---
 1 files changed, 75 insertions(+), 18 deletions(-)

diff --git a/tests/pipe.c b/tests/pipe.c
index 0812cb3..76be6cc 100644
--- a/tests/pipe.c
+++ b/tests/pipe.c
@@ -1,52 +1,109 @@
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
-#include 
 
-int main(int argc, char *argv[])
+char *test_descriptions[] = {
+   NULL,
+   "Test with an empty pipe",
+   "Test with single-write to and continous read from a pipe",
+   "Test continous reads/writes from pipe",
+   "Test non-consecutive pipe-fds",
+   "Test with read-fd > write-fd",
+   "Test with read-fd/write-fd swapped",
+   "Test with all-fds in use",
+   "Test with all-fds in use for pipes",
+};
+
+static int last_num = 2;
+usage(char *argv[])
+{
+   int i;
+   printf("Usage: %s -t \n", argv[0]);
+   printf("\t where  1 && strcmp(argv[1], "-e") == 0)
-   empty = 1;
-
if (pipe(fds) < 0) {
perror("pipe()");
exit(1);
}
 
-   if (!empty)
-   write(fds[1], buf, strlen(buf));
+   write(fds[1], buf, strlen(buf));
 
if (fcntl(fds[0], F_SETFL, O_NONBLOCK) < 0) {
perror("fcntl()");
exit(1);
}
+
printf("Running as %d\n", getpid());
while (i<100) {
sleep(1);
if (i%5 == 0) {
c = errno = 0;
rc = read(fds[0], &c, 1);
-   if (rc != 1) {
-   perror("read() failed");
-   }
-   printf("i is %d (pid %d), c is %c\n", i, getpid(), c);
-
+   if (rc != 1)
+   perror("read() pipe failed\n");
+   printf("i is %d (pid %d), next byte is %d\n", i,
+   getpid(), c);
}
i++;
}
 }
 
+int
+main(int argc, char *argv[])
+{
+   int c;
+   int tc_num;
+
+   while((c = getopt(argc, argv, "t:")) != EOF) {
+   switch(c) {
+   case 't':
+   tc_num = atoi(optarg);
+   break;
+   default:
+   printf("Unknown option %c\n", c);
+   usage(argv);
+   }
+   }
+
+   switch(tc_num) {
+   case 1: test1(); break;
+   case 2: test2(); break;
+   default:
+   printf("Unsupported test case %d\n", tc_num);
+   usage(argv);
+   }
+}
-- 
1.5.2.5

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/4][cryo]: Test 4: Non-consecutive pipe fds

2008-06-23 Thread sukadev

>From c7276c8cb59247faa13d42a1b871c853a80d80d1 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
Date: Sun, 22 Jun 2008 22:43:01 -0700
Subject: [PATCH] Test4: Non-consecutive pipe fds

---
 tests/pipe.c |   76 ++---
 1 files changed, 66 insertions(+), 10 deletions(-)

diff --git a/tests/pipe.c b/tests/pipe.c
index c81ac2a..5b04f46 100644
--- a/tests/pipe.c
+++ b/tests/pipe.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define min(a, b)  ((a) < (b) ? (a) : (b))
 static char *temp_file;
@@ -20,7 +21,7 @@ char *test_descriptions[] = {
"Test with all-fds in use for pipes",
 };
 
-static int last_num = 3;
+static int last_num = 4;
 usage(char *argv[])
 {
int i;
@@ -84,13 +85,50 @@ int test2()
}
 }
 
-int read_write_pipe()
+
+reset_pipe_fds(int *tmpfds, int *testfds, int close_unused)
+{
+   struct stat statbuf;
+   int rc;
+
+   if (fstat(testfds[0], &statbuf) == 0) {
+   printf("fd %d is in use...\n", testfds[0]);
+   exit(1);
+   }
+   if (fstat(testfds[1], &statbuf) == 0) {
+   printf("fd %d is in use...\n", testfds[1]);
+   exit(1);
+   }
+
+   rc = dup2(tmpfds[0], testfds[0]);
+   if (rc < 0) {
+   printf("dup2(%d, %d) failed, error %d\n",
+   tmpfds[0], testfds[0], rc, errno);
+   exit(1);
+   }
+
+   rc = dup2(tmpfds[1], testfds[1]);
+   if (rc < 0) {
+   printf("dup2(%d, %d) failed, error %d\n",
+   tmpfds[1], testfds[1], rc, errno);
+   exit(1);
+   }
+
+   if (close_unused) {
+   close(tmpfds[0]);
+   close(tmpfds[1]);
+   }
+}
+
+#define TEST_NON_CONSECUTIVE_FD 0x1
+
+int read_write_pipe(int *testfdsp, int close_unused)
 {
int i = 0;
int rc;
-   int fds[2];
-   int fd1, fd2;
+   int tmpfds[2];
int c;
+   int read_fd, write_fd;
char *wbufp;
char wbuf[256];
char rbuf[256];
@@ -102,13 +140,23 @@ int read_write_pipe()
strcpy(wbufp, "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ");
memset(rbufp, '\0', sizeof(rbuf));
 
-   if (pipe(fds) < 0) {
+   if (pipe(tmpfds) < 0) {
perror("pipe()");
exit(1);
}
-   printf("fds[0] %d, fds[1] %d\n", fds[0], fds[1]);
 
-   if (fcntl(fds[0], F_SETFL, O_NONBLOCK) < 0) {
+   if (testfdsp) {
+   reset_pipe_fds(tmpfds, testfdsp, close_unused);
+   read_fd = testfdsp[0];
+   write_fd = testfdsp[1];
+   } else {
+   read_fd = tmpfds[0];
+   write_fd = tmpfds[1];
+   }
+
+   printf("read_fd %d, write_fd %d\n", read_fd, write_fd);
+
+   if (fcntl(read_fd, F_SETFL, O_NONBLOCK) < 0) {
perror("fcntl()");
exit(1);
}
@@ -118,7 +166,7 @@ int read_write_pipe()
sleep(1);
if (i%2 == 0) {
c = errno = 0;
-   rc = read(fds[0], rbufp, 3);
+   rc = read(read_fd, rbufp, 3);
 
if (rc < 0)
perror("read() failed");
@@ -132,7 +180,7 @@ int read_write_pipe()
continue;
 
errno = 0;
-   rc = write(fds[1], wbufp, min(3, strlen(wbufp)));
+   rc = write(write_fd, wbufp, min(3, strlen(wbufp)));
if (rc < 0) {
perror("write() to pipe");
} else {
@@ -156,7 +204,14 @@ int read_write_pipe()
 
 static void test3()
 {
-   read_write_pipe();
+   read_write_pipe(NULL, 1);
+}
+
+static void test4()
+{
+   int tmpfds[2] = { 172, 101 };
+
+   read_write_pipe(tmpfds, 1);
 }
 
 int
@@ -182,6 +237,7 @@ main(int argc, char *argv[])
case 1: test1(); break;
case 2: test2(); break;
case 3: test3(); break;
+   case 4: test4(); break;
default:
printf("Unsupported test case %d\n", tc_num);
usage(argv);
-- 
1.5.2.5

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


  1   2   3   4   5   6   7   8   9   10   >