Louis Rilling wrote:
> On Wed, Jul 30, 2008 at 02:27:52PM -0400, Oren Laadan wrote:
>>
>> Louis Rilling wrote:
>>>> +/**
>>>> + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
>>>> + * @ctx - checkpoint context
>>>> + * @pgarr - page-array to fill
>>>> + * @vma - vma to scan
>>>> + * @start - start address (updated)
>>>> + */
>>>> +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
>>>> +                       struct vm_area_struct *vma, unsigned long *start)
>>>> +{
>>>> +  unsigned long end = vma->vm_end;
>>>> +  unsigned long addr = *start;
>>>> +  struct page **pagep;
>>>> +  unsigned long *addrp;
>>>> +  int cow, nr, ret = 0;
>>>> +
>>>> +  nr = pgarr->nleft;
>>>> +  pagep = &pgarr->pages[pgarr->nused];
>>>> +  addrp = &pgarr->addrs[pgarr->nused];
>>>> +  cow = !!vma->vm_file;
>>>> +
>>>> +  while (addr < end) {
>>>> +          struct page *page;
>>>> +
>>>> +          /* simplified version of get_user_pages(): already have vma,
>>>> +          * only need FOLL_TOUCH, and (for now) ignore fault stats */
>>>> +
>>>> +          cond_resched();
>>>> +          while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
>>>> +                  ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
>>>> +                  if (ret & VM_FAULT_ERROR) {
>>>> +                          if (ret & VM_FAULT_OOM)
>>>> +                                  ret = -ENOMEM;
>>>> +                          else if (ret & VM_FAULT_SIGBUS)
>>>> +                                  ret = -EFAULT;
>>>> +                          else
>>>> +                                  BUG();
>>>> +                          break;
>>>> +                  }
>>>> +                  cond_resched();
>>>> +          }
>>> I guess that 'ret' should be checked somewhere after this loop.
>> yes; this is where a "break(2)" construct in C would come handy :)
> 
> Alternatively, putting the inner loop in a separate function often helps to
> handle errors in a cleaner way.

Also true. I opted to keep it that way to keep the code as similar as
possible to get_user_pages().

Note that the logic can be optimized by, instead of traversing the page
table once for each page, we could aggregate a few pages in each round.
I wanted to keep the code simple.

> 
>>>> +
>>>> +          if (IS_ERR(page)) {
>>>> +                  ret = PTR_ERR(page);
>>>> +                  break;
>>>> +          }
>>>> +
>>>> +          if (page == ZERO_PAGE(0))
>>>> +                  page = NULL;    /* zero page: ignore */
>>>> +          else if (cow && page_mapping(page) != NULL)
>>>> +                  page = NULL;    /* clean cow: ignore */
>>>> +          else {
>>>> +                  get_page(page);
>>>> +                  *(addrp++) = addr;
>>>> +                  *(pagep++) = page;
>>>> +                  if (--nr == 0) {
>>>> +                          addr += PAGE_SIZE;
>>>> +                          break;
>>>> +                  }
>>>> +          }
>>>> +
>>>> +          addr += PAGE_SIZE;
>>>> +  }
>>>> +
>>>> +  if (unlikely(ret < 0)) {
>>>> +          nr = pgarr->nleft - nr;
>>>> +          while (nr--)
>>>> +                  page_cache_release(*(--pagep));
>>>> +          return ret;
>>>> +  }
>>>> +
>>>> +  *start = addr;
>>>> +  return (pgarr->nleft - nr);
>>>> +}
>>>> +
> 
> 
>>>> +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
>>>> +{
>>>> +  struct cr_hdr h;
>>>> +  struct cr_hdr_mm *hh = ctx->tbuf;
>>>> +  struct mm_struct *mm;
>>>> +  struct vm_area_struct *vma;
>>>> +  int ret;
>>>> +
>>>> +  h.type = CR_HDR_MM;
>>>> +  h.len = sizeof(*hh);
>>>> +  h.id = ctx->pid;
>>>> +
>>>> +  mm = get_task_mm(t);
>>>> +
>>>> +  hh->tag = 1;    /* non-zero will mean first time encounter */
>>>> +
>>>> +  hh->start_code = mm->start_code;
>>>> +  hh->end_code = mm->end_code;
>>>> +  hh->start_data = mm->start_data;
>>>> +  hh->end_data = mm->end_data;
>>>> +  hh->start_brk = mm->start_brk;
>>>> +  hh->brk = mm->brk;
>>>> +  hh->start_stack = mm->start_stack;
>>>> +  hh->arg_start = mm->arg_start;
>>>> +  hh->arg_end = mm->arg_end;
>>>> +  hh->env_start = mm->env_start;
>>>> +  hh->env_end = mm->env_end;
>>>> +
>>>> +  hh->map_count = mm->map_count;
>>> Some fields above should also be protected with mmap_sem, like ->brk,
>>> ->map_count, and possibly others (I'm not a memory expert though).
>> true; keep in mind, though, that the container will be frozen during
>> this time, so nothing should change at all. The only exception would
>> be if, for instance, someone is killing the container while we save
>> its state.
> 
> Sure. So you think that taking mm->mmap_sem below is useless? I tend to 
> believe
> so, since no other task should share this mm_struct at this time, and we could
> state that ptrace should not interfere during restart. However, I'm never
> confident when ptrace considerations come in...

Not quite.

Probing the value of mm->brk is always safe, although it may turn out to
yield incorrect value. Traversing the vma's isn't safe, because - if for
instance the target task dies in the middle, it may alter the vma list.
So the mmap_sem protects against the latter.

Anyway, it won't hurt to be extra safe and take the semaphore earlier.

Ptrace, btw, cannot come in because the container is (supposedly) frozen.

Oren.

>>>> +
>>>> +  /* FIX: need also mm->flags */
>>>> +
>>>> +  ret = cr_write_obj(ctx, &h, hh);
>>>> +  if (ret < 0)
>>>> +          goto out;
>>>> +
>>>> +  /* write the vma's */
>>>> +  down_read(&mm->mmap_sem);
>>>> +  for (vma = mm->mmap; vma; vma = vma->vm_next) {
>>>> +          if ((ret = cr_write_vma(ctx, vma)) < 0)
>>>> +                  break;
>>>> +  }
>>>> +  up_read(&mm->mmap_sem);
>>>> +
>>>> +  if (ret < 0)
>>>> +          goto out;
>>>> +
>>>> +  ret = cr_write_mm_context(ctx, mm);
>>>> +
>>>> + out:
>>>> +  mmput(mm);
>>>> +  return ret;
>>>> +}
> 
> Thanks,
> 
> Louis
> 
_______________________________________________
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

_______________________________________________
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

Reply via email to