[Devel] Re: [PATCH 4/8] af_unix: Allow SO_PEERCRED to work across namespaces.

2010-06-14 Thread Daniel Lezcano
On 06/13/2010 03:30 PM, Eric W. Biederman wrote:
 Use struct pid and struct cred to store the peer credentials on struct
 sock.  This gives enough information to convert the peer credential
 information to a value relative to whatever namespace the socket is in
 at the time.

 This removes nasty surprises when using SO_PEERCRED on socket
 connetions where the processes on either side are in different pid and
 user namespaces.

 Signed-off-by: Eric W. Biedermanebied...@xmission.com


Acked-by: Daniel Lezcano daniel.lezc...@free.fr

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 5/8] af_netlink: Add needed scm_destroy after scm_send.

2010-06-14 Thread Daniel Lezcano
On 06/13/2010 03:31 PM, Eric W. Biederman wrote:
 scm_send occasionally allocates state in the scm_cookie, so I have
 modified netlink_sendmsg to guarantee that when scm_send succeeds
 scm_destory will be called to free that state.

 Signed-off-by: Eric W. Biedermanebied...@xmission.com
 ---


Reviewed-by: Daniel Lezcano daniel.lezc...@free.fr
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 8/8] af_unix: Allow connecting to sockets in other network namespaces.

2010-06-14 Thread Daniel Lezcano
On 06/13/2010 03:35 PM, Eric W. Biederman wrote:
 Remove the restriction that only allows connecting to a unix domain
 socket identified by unix path that is in the same network namespace.

 Crossing network namespaces is always tricky and we did not support
 this at first, because of a strict policy of don't mix the namespaces.
 Later after Pavel proposed this we did not support this because no one
 had performed the audit to make certain using unix domain sockets
 across namespaces is safe.

 What fundamentally makes connecting to af_unix sockets in other
 namespaces is safe is that you have to have the proper permissions on
 the unix domain socket inode that lives in the filesystem.  If you
 want strict isolation you just don't create inodes where unfriendlys
 can get at them, or with permissions that allow unfriendlys to open
 them.  All nicely handled for us by the mount namespace and other
 standard file system facilities.

 I looked through unix domain sockets and they are a very controlled
 environment so none of the work that goes on in dev_forward_skb to
 make crossing namespaces safe appears needed, we are not loosing
 controll of the skb and so do not need to set up the skb to look like
 it is comming in fresh from the outside world.  Further the fields in
 struct unix_skb_parms should not have any problems crossing network
 namespaces.

 Now that we handle SCM_CREDENTIALS in a way that gives useable values
 across namespaces.  There does not appear to be any operational
 problems with encouraging the use of unix domain sockets across
 containers either.

 Signed-off-by: Eric W. Biedermanebied...@xmission.com


Acked-by: Daniel Lezcano daniel.lezc...@free.fr

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH RFC user-cr] add -p option to ckptinfo to dump pids tree

2010-06-14 Thread Oren Laadan

I added a '-p' flag to ckptinfo based on this patch (will push
to repository later).

Thanks,

Oren.


On 03/09/2010 04:05 PM, Serge E. Hallyn wrote:
 This will be used by a restart wrapper to analyze /proc/$$/mountinfo.
 
 Also add the new rpids field to ckpt_pids to reflect kernel.
 
 It depends on the patch I just sent for linux-cr adding rpid to
 struct ckpt_pids.
 
 Signed-off-by: Serge E. Hallyn se...@us.ibm.com
 ---
  ckptinfo.c |   36 +++-
  include/linux/checkpoint_hdr.h |1 +
  2 files changed, 36 insertions(+), 1 deletions(-)
 
 diff --git a/ckptinfo.c b/ckptinfo.c
 index 6190301..b5922da 100644
 --- a/ckptinfo.c
 +++ b/ckptinfo.c
 @@ -28,6 +28,7 @@ static char usage_str[] =
  \tOptions:\n
  \t -h,--help print this help message\n
  \t -e,--errorshow error messages\n
 +\t -p,--pids  just list process tree\n
  \t -v,--verbose  verbose output\n
  \t--show-arch-regs   show registers contents\n
  ;
 @@ -36,6 +37,7 @@ struct args {
   int error;
   int verbose;
   int show_arch_regs;
 + int show_process_tree;
  };
  
  int __verbose;
 @@ -55,6 +57,7 @@ static int image_parse_vma(struct ckpt_hdr *h, int fd, 
 struct args *args);
  static int image_parse_file(struct ckpt_hdr *h, int fd, struct args *args);
  static int image_parse_objref(struct ckpt_hdr *h, int fd, struct args *args);
  static int image_parse_error(struct ckpt_hdr *h, int fd, struct args *args);
 +static void show_tasks(struct ckpt_hdr *h, int numpids);
  
  #ifdef __i386__
  #define __HAVE_image_parse_cpu
 @@ -85,10 +88,11 @@ static void parse_args(struct args *args, int argc, char 
 *argv[])
   { help,   no_argument,NULL, 'h' },
   { error,  no_argument,NULL, 'e' },
   { verbose,no_argument,NULL, 'v' },
 + { pids,   no_argument,NULL, 'p' },
   { show-arch-regs, no_argument,NULL, 1 },
   { NULL, 0,  NULL, 0 }
   };
 - static char optc[] = hve;
 + static char optc[] = hvep;
  
   while (1) {
   int c = getopt_long(argc, argv, optc, opts, NULL);
 @@ -99,6 +103,9 @@ static void parse_args(struct args *args, int argc, char 
 *argv[])
   exit(1);
   case 'h':
   usage(usage_str);
 + case 'p':
 + args-show_process_tree = 1;
 + break;
   case 'e':
   args-error = 1;
   break;
 @@ -208,6 +215,7 @@ static int image_parse(int fd, struct args *args)
  {
   struct ckpt_hdr *h;
   int ret;
 + int numpids, next_read_pids = 0;
  
   do {
   ret = image_read_obj(fd, h);
 @@ -215,6 +223,11 @@ static int image_parse(int fd, struct args *args)
   break;
   if (!h)
   continue;
 + if (next_read_pids) {
 + next_read_pids = 0;
 + if (args-show_process_tree)
 + show_tasks(h, numpids);
 + }
   switch (h-type) {
   case CKPT_HDR_OBJREF:
   ret = image_parse_objref(h, fd, args);
 @@ -232,12 +245,33 @@ static int image_parse(int fd, struct args *args)
   ret = image_parse_error(h, fd, args);
   break;
   }
 + if (h-type == CKPT_HDR_TREE) {
 + struct ckpt_hdr_tree *t = (struct ckpt_hdr_tree *) h;
 + numpids = t-nr_tasks;
 + next_read_pids = 1;
 + }
   free(h);
   } while (ret  0);
  
   return ret;
  }
  
 +struct ckpt_hdr_pids {
 + struct ckpt_hdr h;
 + struct ckpt_pids p;
 +};
 +
 +static void show_tasks(struct ckpt_hdr *h, int numpids)
 +{
 + struct ckpt_hdr_pids *pp = (struct ckpt_hdr_pids *)h;
 + struct ckpt_pids *p = pp-p;
 + int i;
 +
 + for (i=0; inumpids; i++)
 + printf(Task %d: rpid %d pid %d vppid %d\n, i, p[i].rpid,
 + p[i].vpid, p[i].vppid);
 +}
 +
  static int image_parse_objref(struct ckpt_hdr *h, int fd, struct args *args)
  {
   struct ckpt_hdr_objref *hh = (struct ckpt_hdr_objref *) h;
 diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
 index e8eaf23..9c6bc6d 100644
 --- a/include/linux/checkpoint_hdr.h
 +++ b/include/linux/checkpoint_hdr.h
 @@ -321,6 +321,7 @@ struct ckpt_hdr_tree {
  } __attribute__((aligned(8)));
  
  struct ckpt_pids {
 + __s32 rpid;
   __s32 vpid;
   __s32 vppid;
   __s32 vtgid;
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers


[Devel] Re: [RFC][PATCH 0/6][usercr]: Rename/reorg usercr code

2010-06-14 Thread Oren Laadan

Patches 1-3 applied (and will be pushed shortly).

Oren.

On 04/25/2010 04:24 PM, Oren Laadan wrote:
 Hi Suka,
 
 Thanks for the patchset. A couple of comments:
 
 * I prefer the header exported to users to be checkpoint.h - this
 is consistent with kernel headers, and with future name of a c/r
 library if we opt libcheckpoint.a
 
 * I also prefer to leave checkpoint.c as a separate file, as is.
 It may gain more functionality in the future. If the goal was to
 only export a single .o file, then the solution IMHO is to instead
 export a single library: libcheckpoint.a
 
 Unless you have a strong opinion against the above, I'll go ahead
 and pull patches 1-3, leave out 4-6, and update the Makefile to
 create a libcheckpoint.a library.
 
 Oren.
 
 
 Sukadev Bhattiprolu wrote:
 Sukadev Bhattiprolu [suka...@linux.vnet.ibm.com] wrote:
 | 
 | Change the prefix of the USERCR apis to 'cr_' and reorg the code to avoid
 | duplication and reduce code size.
 | 
 |[PATCH 1/6] Change API prefix to cr_
 |[PATCH 2/6] Remove flags parameter to cr_checkpoint()
 |[PATCH 3/6] Minor reorg of restart.c
 |[PATCH 4/6] Move checkpoint() into restart.c
 |[PATCH 5/6] Rename restart.c to cr_checkpoint.c
 |[PATCH 6/6] Rename common.h to cr_log.h
 | 
 | With this change, USERCR for now exports just the following two files:
 | 
 |cr_checkpoint.o
 |cr_checkpoint.h
 | 
 | But this patchset does cause some churn, let me know if you think any of it
 | is unnecessary/noise.  I have tested for now along with the patch that 
 removes
 | most exits() from cr_restart(). If the reorg makes sense, will run all 
 tests
 | once more and resubmit.

 All tests in cr-tests pass with these two patchsets (on usercr ckpt-v20-dev
 and linux ckpt-v21-rc2).

 Pls let me know if there are comments about the patchsets.

 Thanks,

 Sukadev

 ___
 Containers mailing list
 contain...@lists.linux-foundation.org
 https://lists.linux-foundation.org/mailman/listinfo/containers
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] cr: pts (v2): detect use of multiple devpts mounts in container

2010-06-14 Thread Oren Laadan
Hi,

I have three questions:

1) I read the comment (further below):
   ...
   + * is.  Note that means that when we start checkpointing mounts
   + * and filesystems we'll need to change this.
   ...
and I wonder if this is only temporary until checkpoint mounts
work, and then there will be something completely different ?

2) If so, then would the following work instead: we could keep a
pointer (to SB) on the ckpt_ctx. which will start NULL and will be
set to the SB when we first see one, and compared against additional
ones.

It's safe because the SB is pinned down by the pty, and it's much
less logic/code considering that this is purely temporary.

3) Does this alert us when a (single) pty namespace in the container
is shared with the parent container ?

Oren.



On 04/29/2010 06:43 PM, Serge E. Hallyn wrote:
 We don't support multiple devpts mounts in a container.  So
 bail on checkpoint if a container has them.  This will cause
 checkpoint to fail of a container which still has a pty from
 parent container in use.
 
 Since I don't actually need any methods in the ckpt_ops for
 ptsns, I just register the ckpt_ops in tty_io.c which is never
 a module.  We will presumably have to deal with modular ckpt_ops
 after the next submission, and we've talked about how to do it,
 but it's kind of late in this cycle to do that now.
 
 checkpointing a task with open fd to pty from parent ns results
 in a message like:
 
 [err -16][pos 3382][E @ tty_file_checkpoint:2760]Open pty from alien ptsns
 
 Signed-off-by: Serge E. Hallyn se...@us.ibm.com
 ---
  drivers/char/tty_io.c|   32 
  include/linux/checkpoint_hdr.h   |   18 ++
  include/linux/checkpoint_types.h |2 ++
  3 files changed, 52 insertions(+), 0 deletions(-)
 
 diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
 index d264000..999317c 100644
 --- a/drivers/char/tty_io.c
 +++ b/drivers/char/tty_io.c
 @@ -96,6 +96,7 @@
  #include linux/bitops.h
  #include linux/delay.h
  #include linux/seq_file.h
 +#include linux/mount.h
  
  #include linux/uaccess.h
  #include asm/system.h
 @@ -2687,6 +2688,27 @@ static int tty_can_checkpoint(struct ckpt_ctx *ctx, 
 struct tty_struct *tty)
   return 1;
  }
  
 +/*
 + * If tty_file_checkpoint() ever supports more than unix98 ptys we'll have
 + * to check for that
 + */
 +static int detect_multiple_ptsns(struct ckpt_ctx *ctx, struct file *file)
 +{
 + int objref;
 + int new;
 +
 + objref = ckpt_obj_lookup_add(ctx, file-f_path.mnt-mnt_sb,
 +   CKPT_OBJ_PTS_NS, new);
 + if (objref  0)
 + return objref;
 + if (new  ctx-num_devpts_ns)
 + return -EBUSY;
 + if (file-f_path.mnt-mnt_ns != ctx-root_nsproxy-mnt_ns)
 + return -EBUSY;
 + ctx-num_devpts_ns++;
 + return 0;
 +}
 +
  static int tty_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
  {
   struct ckpt_hdr_file_tty *h;
 @@ -2732,6 +2754,11 @@ static int tty_file_checkpoint(struct ckpt_ctx *ctx, 
 struct file *file)
   ret = ckpt_write_obj(ctx, h-common.h);
  
   ckpt_hdr_put(ctx, h);
 +
 + ret = detect_multiple_ptsns(ctx, file);
 + if (ret)
 + ckpt_err(ctx, ret, Open pty from alien ptsns\n);
 +
   return ret;
  }
  
 @@ -3680,6 +3707,8 @@ static struct cdev tty_cdev, console_cdev;
   */
  static int __init tty_init(void)
  {
 + int err;
 +
   cdev_init(tty_cdev, tty_fops);
   if (cdev_add(tty_cdev, MKDEV(TTYAUX_MAJOR, 0), 1) ||
   register_chrdev_region(MKDEV(TTYAUX_MAJOR, 0), 1, /dev/tty)  0)
 @@ -3698,6 +3727,9 @@ static int __init tty_init(void)
   vty_init(console_fops);
  #endif
  #ifdef CONFIG_CHECKPOINT
 + err = register_checkpoint_obj(ckpt_obj_ptsns_ops);
 + if (err)
 + return err;
   return checkpoint_register_tty();
  #else
   return 0;
 diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
 index eb5e1b4..27113ca 100644
 --- a/include/linux/checkpoint_hdr.h
 +++ b/include/linux/checkpoint_hdr.h
 @@ -271,6 +271,8 @@ enum obj_type {
  #define CKPT_OBJ_NET_NS CKPT_OBJ_NET_NS
   CKPT_OBJ_NETDEV,
  #define CKPT_OBJ_NETDEV CKPT_OBJ_NETDEV
 + CKPT_OBJ_PTS_NS,
 +#define CKPT_OBJ_PTS_NS CKPT_OBJ_PTS_NS
   CKPT_OBJ_MAX
  #define CKPT_OBJ_MAX CKPT_OBJ_MAX
  };
 @@ -1142,5 +1144,21 @@ struct ckpt_hdr_ldisc_n_tty {
  #define CKPT_TST_OVERFLOW_64(a, b) \
   ((sizeof(a)  sizeof(b))  ((a)  LONG_MAX))
  
 +#if defined(CONFIG_UNIX98_PTYS) || defined(CONFIG_UNIX98_PTYS_MODULE)
 +/*
 + * We don't yet support multiple pts mounts in a container, so
 + * all we're doing here is detecting pty's in 1 ptsns.  We'll
 + * actually stick the sb on the objhash because that's all there
 + * is.  Note that means that when we start checkpointing mounts
 + * and filesystems we'll need to change this.
 + *
 + * We don't need to pin these bc they are pinned by the pty,
 + * which we do have 

[Devel] Re: [PATCH][cr][nit]: eclone() - Remove extra __user and add a comment

2010-06-14 Thread Oren Laadan

Applied in ckpt-v22-dev (will be pushed soon)

Thanks,

Oren.

On 05/03/2010 10:04 PM, Sukadev Bhattiprolu wrote:
 
 Sorry, I thought I had sent this minor patch out earlier, but I just
 realized I had not.
 ---
 
 From: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 Date: Mon, 3 May 2010 11:38:16 -0700
 Subject: [PATCH 1/1] eclone(): Remove extra __user and add a comment
 
 As pointed out by Russel King, '__user' does not make sense on an integer
 type. Remove it. Also add a comment on why we want 'stack_size' to be 0
 on x86 and s390.
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 ---
  arch/s390/kernel/process.c |5 -
  arch/x86/kernel/process.c  |5 -
  2 files changed, 8 insertions(+), 2 deletions(-)
 
 diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
 index efc7e8a..b00ffad 100644
 --- a/arch/s390/kernel/process.c
 +++ b/arch/s390/kernel/process.c
 @@ -276,7 +276,7 @@ SYSCALL_DEFINE4(eclone, unsigned int, flags_low, struct 
 clone_args __user *,
   int __user *parent_tid_ptr;
   int __user *child_tid_ptr;
   unsigned long flags;
 - unsigned long __user child_stack;
 + unsigned long child_stack;
   unsigned long stack_size;
  
   rc = fetch_clone_args_from_user(uca, args_size, kca);
 @@ -288,6 +288,9 @@ SYSCALL_DEFINE4(eclone, unsigned int, flags_low, struct 
 clone_args __user *,
   child_tid_ptr =  (int __user *) kca.child_tid_ptr;
  
   stack_size = (unsigned long) kca.child_stack_size;
 + /*
 +  * s390 does not need/use the stack_size. Ensure it is unused.
 +  */
   if (stack_size)
   return -EINVAL;
  
 diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
 index 5abad20..8b1699a 100644
 --- a/arch/x86/kernel/process.c
 +++ b/arch/x86/kernel/process.c
 @@ -268,7 +268,7 @@ sys_eclone(unsigned flags_low, struct clone_args __user 
 *uca,
   unsigned long flags;
   int __user *parent_tidp;
   int __user *child_tidp;
 - unsigned long __user stack;
 + unsigned long stack;
   unsigned long stack_size;
  
   rc = fetch_clone_args_from_user(uca, args_size, kca);
 @@ -286,6 +286,9 @@ sys_eclone(unsigned flags_low, struct clone_args __user 
 *uca,
   parent_tidp = (int *)(unsigned long)kca.parent_tid_ptr;
   child_tidp = (int *)(unsigned long)kca.child_tid_ptr;
  
 + /*
 +  * x86 does not need/use the stack_size. Ensure it is unused.
 +  */
   stack_size = (unsigned long)kca.child_stack_size;
   if (stack_size)
   return -EINVAL;
 
 
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH RFC] nsproxy: strengthen the comment above copy_namespaces

2010-06-14 Thread Oren Laadan
Serge -

Is this intended for linux-cr ?
Sounds like it should go (or already?) upstream.

Oren.

On 04/30/2010 02:36 PM, Serge E. Hallyn wrote:
 copy_namespaces() cannot be used on a live task.  There are two
 reasons - it can be used to assign another task's nsproxy, which
 is not legal (for a live task).  And, it does not use
 rcu_assign_pointer, so even when we're being extra-clever it is
 unsafe.
 
 Signed-off-by: Serge Hallyn se...@us.ibm.com
 ---
  kernel/nsproxy.c |3 +++
  1 files changed, 3 insertions(+), 0 deletions(-)
 
 diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
 index f74e6c0..67d8f2b 100644
 --- a/kernel/nsproxy.c
 +++ b/kernel/nsproxy.c
 @@ -115,6 +115,9 @@ out_ns:
  /*
   * called from clone.  This now handles copy for nsproxy and all
   * namespaces therein.
 + * Note this should only be used in fork.  Specifically, tsk should
 + * not yet be a live task.  See the the unshare code for how to handle
 + * that.
   */
  int copy_namespaces(unsigned long flags, struct task_struct *tsk)
  {
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/1] Fix typos and minor errors in eclone documentation

2010-06-14 Thread Oren Laadan

Pulled for ckpt-v22-dev (will push soon).

Oren.

On 05/05/2010 11:22 PM, Sukadev Bhattiprolu wrote:
 The typos/errors were identified by Randy Dunlap. Also add a pointer to the
 user-cr git tree and a note about building eclone() library interface for 
 other
 architectures. This pointer can probably be removed when eclone() makes it
 into libc.
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com
 ---
  Documentation/eclone |   37 ++---
  1 files changed, 22 insertions(+), 15 deletions(-)
 
 diff --git a/Documentation/eclone b/Documentation/eclone
 index c2f1b4b..c1032d5 100644
 --- a/Documentation/eclone
 +++ b/Documentation/eclone
 @@ -13,7 +13,7 @@ struct clone_args {
  sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
   pid_t * __user pids)
  
 - In addition to doing everything that clone() system call does, the
 + In addition to doing everything that the clone() system call does, the
   eclone() system call:
  
   - allows additional clone flags (31 of 32 bits in the flags
 @@ -30,7 +30,7 @@ sys_eclone(u32 flags_low, struct clone_args * __user cargs, 
 int cargs_size,
   pids to specify during restart.
  
   The @flags_low parameter is identical to the 'clone_flags' parameter
 - in existing clone() system call.
 + in the existing clone() system call.
  
   The fields in 'struct clone_args' are meant to be used as follows:
  
 @@ -47,35 +47,34 @@ sys_eclone(u32 flags_low, struct clone_args * __user 
 cargs, int cargs_size,
   clone() and clone2() (on IA64) system calls. The usage of
   these two fields depends on the processor architecture.
  
 - Most architectures use -child_stack to pass-in a stack-pointer
 + Most architectures use -child_stack to pass in a stack-pointer
   itself and don't need the -child_stack_size field. On these
   architectures the -child_stack_size field must be 0.
  
 - Some architectures, eg IA64, use -child_stack to pass-in the
 + Some architectures, e.g. IA64, use -child_stack to pass in the
   base of the region allocated for stack. These architectures
 - must pass in the size of the stack-region in -child_stack_size.
 + must pass in the size of the stack region in -child_stack_size.
  
   u64 parent_tid_ptr;
   u64 child_tid_ptr;
  
   These two fields correspond to the 'parent_tid_ptr' and
 - 'child_tid_ptr' fields in the clone() system call
 + 'child_tid_ptr' fields in the clone() system call.
  
   u32 nr_pids;
  
   nr_pids specifies the number of pids in the @pids array
   parameter to eclone() (see below). nr_pids should not exceed
 - the current nesting level of the calling process (i.e if the
 + the current nesting level of the calling process (i.e. if the
   process is in init_pid_ns, nr_pids must be 1, if process is
   in a pid namespace that is a child of init-pid-ns, nr_pids
   cannot exceed 2, and so on).
  
   u32 reserved0;
 - u64 reserved1;
  
 - These fields are intended to extend the functionality of the
 + This field is intended to extend the functionality of the
   eclone() in the future, while preserving backward compatibility.
 - They must be set to 0 for now.
 + It must be set to 0 for now.
  
   The @cargs_size parameter specifes the sizeof(struct clone_args) and
   is intended to enable extending this structure in the future, while
 @@ -90,7 +89,7 @@ sys_eclone(u32 flags_low, struct clone_args * __user cargs, 
 int cargs_size,
   namespace in which case the process is a container-init (and must have
   the pid 1 in that namespace).
  
 - See CLONE_NEWPID section of clone(2) man page for details about pid
 + See CLONE_NEWPID section of the clone(2) man page for details about pid
   namespaces.
  
   If a pid in the @pids list is 0, the kernel will assign the next
 @@ -103,8 +102,8 @@ sys_eclone(u32 flags_low, struct clone_args * __user 
 cargs, int cargs_size,
   The order of pids in @pids is oldest in pids[0] to youngest pid
   namespace in pids[nr_pids-1]. If the number of pids specified in the
   @pids list is fewer than the nesting level of the process, the pids
 - are applied from youngest namespace. i.e if the process is nested in
 - a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
 + are applied from the youngest namespace. I.e. if the process is nested
 + in a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
   are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
   have a pid of '0' (the kernel will assign a pid in those namespaces).
  
 @@ -119,17 +118,25 @@ 

[Devel] Re: [PATCH 6/9][cr][v2]: Checkpoint file-locks

2010-06-14 Thread Oren Laadan


On 05/18/2010 11:07 PM, Sukadev Bhattiprolu wrote:
 While checkpointing each file-descriptor, find all the locks on the
 file and save information about the lock in the checkpoint-image.
 A follow-on patch will use this informaiton to restore the file-locks.
 
 Changelog[v2]:
   [Matt Helsley]: Use fixed sizes (__s64) instead of 'loff_t' in
   'struct ckpt_hdr_file_lock'.
   [Matt Helsley, Serge Hallyn]: Highlight new use of BKL (using
   lock_flocks() macros as suggested by Serge).
   [Matt Helsley]: Reorg code a bit to simplify error handling.
   [Matt Helsley]: Reorg code to initialize marker-lock (Pass a
   NULL lock to checkpoint_one_lock() to indicate marker).
 
 Signed-off-by: Sukadev Bhattiprolu suka...@linux.vnet.ibm.com

[...]

  
 +static int checkpoint_one_file_lock(struct ckpt_ctx *ctx, struct file *file,
 + struct file_lock *lock)
 +{
 + int rc;
 + struct ckpt_hdr_file_lock *h;
 +
 + h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_LOCK);
 + if (!h)
 + return -ENOMEM;
 +
 + if (lock) {
 + h-fl_start = lock-fl_start;
 + h-fl_end = lock-fl_end;
 + h-fl_type = lock-fl_type;
 + h-fl_flags = lock-fl_flags;
 + } else {
 + /* Checkpoint a dummy lock as a marker */
 + h-fl_start = -1;

Maybe designate some constant for this ?  e.g. CKPT_FLOCK_NONE ?
In any case, you need a (loff_t) -1  (like in the restore code).

[...]

Oren.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 9/9][cr][v2]: Restore file-locks

2010-06-14 Thread Oren Laadan


On 05/26/2010 07:57 PM, Sukadev Bhattiprolu wrote:
 st...@chygwyn.com [st...@chygwyn.com] wrote:
 | Hi,
 | 
 | On Tue, May 18, 2010 at 08:07:32PM -0700, Sukadev Bhattiprolu wrote:
 |  Restore POSIX file-locks of an application from its checkpoint image.
 |  
 |  Read the saved file-locks from the checkpoint image and for each POSIX
 |  lock, call flock_set() to set the lock on the file.
 |  
 |  As pointed out by Matt Helsley, no special handling is necessary for a
 |  process P2 in the checkpointed container that is blocked on a lock, L1
 |  held by another process P1.  Since processes in the restarted container
 |  begin execution only after all processes have restored. If the blocked
 |  process P2 is restored first, first, it will prepare to return an
 |  -ERESTARTSYS from the fcntl() system call, but wait for P1 to be
 |  restored. When P1 is restored, it will re-acquire the lock L1 before P1
 |  and P2 begin actual execution. This ensures that even if P2 is scheduled
 |  to run before P1, P2 will go back to waiting for the lock L1.
 | 
 | Does that imply certain conditions wrt checkpointed processes and
 | NFS exports? I'm not sure I exactly undertstand the use case which
 | this is intended to address.
 
 Well, yes this assumes some pre-requisites are met.
 
 First lets look at a single system.  We expect that the application
 process tree is run inside a container. This means that the file
 system(s) (and other resources like pipes, IPC) that the application
 is working with are not modified by a process outside the container.

To be precise, we require that (a) resources won't change during
the checkpoint, and (b) the filesystem view at restart would be
the same as at checkpoint.

Running applications inside an isolated container is one way to
achieve that (and more so, to provide guarantees on that). Doing
that provides certain assurance on the resulting checkpoint image.

However, the requirements may be satisfied even outside a container
by, for example, a well behaved applications; except that then we
can't say it's safe -  it depends on the application.

 We also require that the application process tree be frozen before
 checkpointing the application. So even if the checkpoint process takes
 a few minutes, the state of the resources (files, pipes, signals etc)
 does not change since a) application is containerized b) container is
 frozen.
 
 We already have the ability to run applications inside containers, using
 the clone() system call (see lxc.sf.net for example) and the ability to
 freeze the application using the freezer cgroup in the linux kenrnel.
 
 | 
 | I was hoping to figure out whether it would also still be safe on
 | a cluster filesystem as well,
 
 For clusters and NFS, an external protocol has to be established so that
 the distrubuted application can be started/frozen/checkpointed/restarated
 in a coordinated way.
 
 I think that is something that would have to be built on top of the
 checkpoint/restart functionality that we are working on. Or maybe there
 are existing implementations that we would need to plug into.
 
 Hope that helps, but its possible I missed your question :-). If so
 please let me know.

What you refer to is checkpoint in a cluster, or distributed checkpoint
of multiple cooperating processes/applications that run on multiple
hosts. Indeed, one simple way to do it is coordinated distributed
checkpoint and restart.

However, I think the question was about a single application (or
container) that is accessing remote and clustered (and possibly
distributed) *file systems*.

Ideally, we would like to have a method to snapshot a filesystem
at checkpoint to guarantee that at restart we have a consistent
view of that file system. This is regardless of whether the file
system is local or remote.  In the absence of such a mechanism,
we will have to rely on the file system not being changed (at least
those parts that the checkpoint application expects to remain as
they had been) until the restart.

Oren.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases

2010-06-14 Thread Oren Laadan


On 05/25/2010 09:07 PM, Sukadev Bhattiprolu wrote:
 Build upon the C/R of file-locks to C/R file-leases. C/R of a lease that is
 not being broken is almost identical to C/R of file-locks.  i.e save the type
 of lease for the file in the checkpoint image and when restarting, restore
 the lease by calling do_setlease().
 
 C/R of file-lease gets complicated (I think), if a process is checkpointed
 when its lease was being revoked. i.e if P1 has a F_WRLCK lease on file F1
 and P2 opens F1 for write, P2's open is blocked for lease_break_time (45 
 secs).
 P1's lease is revoked (i.e set to F_UNLCK) and P1 is notified via a SIGIO to
 flush any dirty data.
 
 This brings up two issues:
 
 First, if P1 is checkpointed during this lease_break_time, we need to remember
 to that P1 originally had a F_WRLCK lease which is now revoked to F_UNLCK.
 Checkpointing the current lease type would wrongly save the lease-type as
 F_UNLCK.
 
 Secondly, if P1 was checkpointed 40 seconds into the lease_break_time,(i.e.
 it had 5 seconds remaining in the lease), we want to ensure that after 
 restart,
 P1 gets at least 5 more seconds in the lease (no ?). (i.e P1 could be in the
 its SIGIO handler when it was checkpointed and may be about to start a new
 write(). If P1 does not gets its 5 seconds and P2's open and a read()
 completes, we would have a data corruption).
 
 This patch addresses the first issue above by adding file_lock-fl_type_prev
 field. When a lease is downgraded/revoked, the original lease type is saved
 in -fl_type_prev and is also checkpointed. When the process P1 is restarted,
 the kernel temporarily restores the original (F_WRLCK) lease.  When process
 P2 is restarted, the open() would fail with -ERESTARTSYS and the open() would
 be repeated. This open() would initiate the lease-break protocol again on P1.
 
 To address the second issue above, this patch saves the remaining-lease in
 the checkpoint image, but does not (yet) use this value. The plan is to use
 this remaining-lease period when P1/P2 are restarted so that P2 is blocked
 only for the remaining-lease rather than entire lease_break_time. I want to
 check if there are better ways to address this.

[...]

 @@ -280,8 +280,14 @@ static int checkpoint_one_file_lock(struct ckpt_ctx 
 *ctx, struct file *file,
   if (lock) {
   h-fl_start = lock-fl_start;
   h-fl_end = lock-fl_end;
 + /* checkpoint F_INPROGRESS if set for now */

Did I miss anything -- what is F_INPROGRESS ?

   h-fl_type = lock-fl_type;
 + h-fl_type_prev = lock-fl_type_prev;
   h-fl_flags = lock-fl_flags;
 + if (h-fl_type  F_INPROGRESS  
 + (lock-fl_break_time  jiffies))
 + h-fl_rem_lease = (lock-fl_break_time - jiffies) / HZ;

Hmmm -- we have a ctx-ktime_begin marking the start of the checkpoint.
It is used for relative-time calculations, for example, the expiry of
restart-blocks and timeouts.  I suggest that we use it here too to be
consistent.

 +
   } else {
   /* Checkpoint a dummy lock as a marker */
   h-fl_start = -1;
 @@ -315,7 +321,7 @@ checkpoint_file_locks(struct ckpt_ctx *ctx, struct 
 files_struct *files,
   continue;
  
   rc = -EBADF;
 - if (IS_POSIX(lockp))
 + if (IS_POSIX(lockp) || IS_LEASE(lockp))
   rc = checkpoint_one_file_lock(ctx, file, lockp);
  
   if (rc  0) {
 @@ -1055,8 +1061,10 @@ static int restore_file_locks(struct ckpt_ctx *ctx, 
 struct file *file, int fd)
   if (IS_ERR(h))
   return PTR_ERR(h);
  
 - ckpt_debug(Lock [%lld, %lld, %d, 0x%x]\n, h-fl_start,
 - h-fl_end, (int)h-fl_type, h-fl_flags);
 + ckpt_debug(Lock [%lld, %lld, %d, 0x%x], rem-lease %lus, 
 + fl-type-prev %d\n, h-fl_start, h-fl_end,
 + (int)h-fl_type, h-fl_flags, h-fl_rem_lease,
 + h-fl_type_prev);
  
   /*
* If it is a dummy-lock, we are done with this fd.
 @@ -1070,6 +1078,17 @@ static int restore_file_locks(struct ckpt_ctx *ctx, 
 struct file *file, int fd)
   if (h-fl_flags  FL_POSIX)
   ret = restore_one_file_lock(ctx, file, fd, h); 
  
 + else if (h-fl_flags  FL_LEASE) {
 + int type;
 +
 + type = h-fl_type;
 + if (h-fl_type  F_INPROGRESS)
 + type = h-fl_type_prev;
 + ret = do_setlease(fd, file, type, h-fl_rem_lease);

Do we need to sanitize the type argument ?
or the h-fl_rem_lease ?
or h-fl_type_prev ?  (which is taken blindly here)

 + if (ret)
 + ckpt_err(ctx, ret, do_setlease(): %d\n, type);
 + }
 +
   if (ret  0)

[Devel] Re: [RFC][PATCH 3/3][cr][v2]: fileleases: C/R of an in-progress lease.

2010-06-14 Thread Oren Laadan


On 05/25/2010 09:07 PM, Sukadev Bhattiprolu wrote:
 If process P1 has a F_WRLCK lease on file F1 and process P2 opens the
 file, P2's open() blocks for lease_break_time (45 seconds) and P1 gets
 a SIGIO to cleanup it lease in preparation for P2's open.  If the two
 processes are checkpointed/restarted in this window, we should address
 following two issues:
 
   - P1 should get a SIGIO only once for the lease (i.e if P1 got the
 SIGIO before checkpoint, it should not get the SIGIO after restart).
 
   - If R seconds remain in the lease, P2's open should be blocked for
 at least the R seconds, so P1 has the time to clean up its lease.
 The previous patch gives P1 the entire lease_break_time but that
 can leave P2 stalled for 2*lease_break_time.
 
 To address first, we add a field -fl_break_notified to remember if we
 notified the lease-holder already. We save this field in the checkpoint
 image and when restarting, we notify the lease-holder only if this field
 is not set.
 
 To address the second issue, we also checkpoint the -fl_break_time for
 an in-progress lease. When restarting the process, we ensure that the
 lease-holder sleeps only for the remaining-lease rather than the entire
 lease.
 
 These two fixes sound like an approximation (see comments in do_setlease()
 and __break_lease() below) and are also a bit kludgy (hence a separate patch
 for now).
 
 Appreciate comments on how we can do this better. Specifically:
 
   - do we even need to try and address the second issue above or
 just let P1 have the entire lease_break_time again ?
 
   - theoretically, the R seconds should start counting after *all*
 processes in the application-process tree have been restarted,
 since P1 waits inside the kernel for a portion of the remaining
 lease - should we then add a delta to R ?

[...]

 @@ -1084,7 +1088,8 @@ static int restore_file_locks(struct ckpt_ctx *ctx, 
 struct file *file, int fd)
   type = h-fl_type;
   if (h-fl_type  F_INPROGRESS)
   type = h-fl_type_prev;
 - ret = do_setlease(fd, file, type, h-fl_rem_lease);
 + ret = do_setlease(fd, file, type, h-fl_rem_lease,
 + h-fl_break_notified);

Is h-fl_break_notified sanitized ?

Oren.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel