Re: UVM coredump API change

2017-02-15 Thread Philip Guenther
On Wed, 15 Feb 2017, Philip Guenther wrote:
...
> I have a followup diff that then alters uvm_unix.c to treat unshared amaps 
> (anonymous maps) as multiple ranges such that pages that were never 
> faulted don't get written to the coredump: it's useless to write them out 
> (they're just zeros) and it'll fail if faulting in those pages would hit 
> the process's memory limit.  That diff takes advantage of the API change 
> in the present diff to hold extra references on some amaps between the two 
> walks of the VA.

Here's that followup diff.  If you examine a core file generated by a 
kernel with both these diffs, you'll find segments with filesz < memsz in 
the output of "readelf -lW whatever.core".  Those are ranges in amaps 
where this diff suppressed faulting in of never-touched pages.


Philip Guenther

--- uvm/uvm_unix.c  Wed Feb 15 22:31:50 2017
+++ uvm/uvm_unix.c  Wed Feb 15 23:37:26 2017
@@ -134,7 +134,79 @@
 
 #ifndef SMALL_KERNEL
 
+#define WALK_CHUNK 32
 /*
+ * Not all the pages in an amap may be present.  When dumping core,
+ * we don't want to force all the pages to be present: it's a waste
+ * of time and memory when we already know what they contain (zeros)
+ * and the ELF format at least can adequately represent them as a
+ * segment with memory size larger than its file size.
+ *
+ * So, we walk the amap with calls to amap_lookups() and scan the
+ * resulting pointers to find ranges of zero or more present pages
+ * followed by at least one absent page or the end of the amap.
+ * When then pass that range to the walk callback with 'start'
+ * pointing to the start of the present range, 'realend' pointing
+ * to the first absent page (or the end of the entry), and 'end'
+ * pointing to the page page the last absent page (or the end of
+ * the entry).
+ *
+ * Note that if the first page of the amap is empty then the callback
+ * must be invoked with 'start' == 'realend' so it can present that
+ * first range of absent pages.
+ */
+int
+uvm_coredump_walk_amap(struct vm_map_entry *entry, int *nsegmentp,
+uvm_coredump_walk_cb *walk, void *cookie)
+{
+   struct vm_anon *anons[WALK_CHUNK];
+   vaddr_t pos, start, realend, end, entry_end;
+   vm_prot_t prot;
+   int nsegment, absent, npages, i, error;
+
+   prot = entry->protection;
+   nsegment = *nsegmentp;
+   start = entry->start;
+   entry_end = MIN(entry->end, VM_MAXUSER_ADDRESS);
+
+   absent = 0;
+   for (pos = start; pos < entry_end; pos += npages << PAGE_SHIFT) {
+   npages = (entry_end - pos) >> PAGE_SHIFT;
+   if (npages > WALK_CHUNK)
+   npages = WALK_CHUNK;
+   amap_lookups(&entry->aref, pos - entry->start, anons, npages);
+   for (i = 0; i < npages; i++) {
+   if ((anons[i] == NULL) == absent)
+   continue;
+   if (!absent) {
+   /* going from present to absent: set realend */
+   realend = pos + (i << PAGE_SHIFT);
+   absent = 1;
+   continue;
+   }
+
+   /* going from absent to present: invoke callback */
+   end = pos + (i << PAGE_SHIFT);
+   if (start != end) {
+   error = (*walk)(start, realend, end, prot,
+   nsegment, cookie);
+   if (error)
+   return error;
+   nsegment++;
+   }
+   start = realend = end;
+   absent = 0;
+   }
+   }
+
+   if (!absent)
+   realend = entry_end;
+   error = (*walk)(start, realend, entry_end, prot, nsegment, cookie);
+   *nsegmentp = nsegment + 1;
+   return error;
+}
+
+/*
  * Common logic for whether a map entry should be included in a coredump
  */
 static inline int
@@ -166,6 +238,15 @@
return 1;
 }
 
+
+/* do nothing callback for uvm_coredump_walk_amap() */
+static int
+noop(vaddr_t start, vaddr_t realend, vaddr_t end, vm_prot_t prot,
+int nsegment, void *cookie)
+{
+   return 0;
+}
+
 /*
  * Walk the VA space for a process to identify what to write to
  * a coredump.  First the number of contiguous ranges is counted,
@@ -183,10 +264,16 @@
struct vm_map *map = &vm->vm_map;
struct vm_map_entry *entry;
vaddr_t end;
+   int refed_amaps = 0;
int nsegment, error;
 
/*
-* Walk the map once to count the segments.
+* Walk the map once to count the segments.  If an amap is
+* referenced more than once than take *another* reference
+* and treat the amap as exactly one segment instead of
+* checking page presence inside it.  On the second pass
+* we'll recognize 

UVM coredump API change

2017-02-15 Thread Philip Guenther
uvm_coredump_walkmap() is the API used by the ELF code to get information 
about the process's memory image so that it can write it out to the core 
file.  Currently, uvm_coredump_walkmap() just does a pass across the 
process's VA, invoking a callback on each range.  The ELF code executes it 
twice: once to get a count of the ranges, then memory is allocated to hold 
the segment information, and then it's called again to fill in that array.

The catch is that the two calls must return the same number of ranges so 
that a consistent ELF header can be generated.  If that fails, a kernel 
assert will trigger.  That's easy right now where every vm_map_entry in 
the process's VA results in a single range, but it would be nice if the 
coredump could be more precise about what is included and, for example, 
not try to write out pages which were never faulted into existence.  In 
that case it may be necessary to hold locks or references between the 
first and second walk of the VA, which means the logic of multiple walks 
should be in the UVM code and not in the ELF code.

So, the diff below changes uvm_coredump_walkmap() to instead take two 
callbacks.  To quote a comment in the diff:
+/*
+ * Walk the VA space for a process to identify what to write to
+ * a coredump.  First the number of contiguous ranges is counted,
+ * then the 'setup' callback is invoked to prepare for actually
+ * recording the ranges, then the VA is walked again, invoking
+ * the 'walk' callback for each range.  The number of ranges walked
+ * is guaranteed to match the count seen by the 'setup' callback.
+ */

The ELF code of course changes to match that, such that what was 
previously done between the two calls is now done in the 
coredump_setup_elf() callback.


I have a followup diff that then alters uvm_unix.c to treat unshared amaps 
(anonymous maps) as multiple ranges such that pages that were never 
faulted don't get written to the coredump: it's useless to write them out 
(they're just zeros) and it'll fail if faulting in those pages would hit 
the process's memory limit.  That diff takes advantage of the API change 
in the present diff to hold extra references on some amaps between the two 
walks of the VA.

(This all originally arose from a report to bugs@ from semarie@ about 
kernel log messages generated when firefox coredumps)

Comments, concerns, oks?


Philip Guenther

Index: uvm/uvm_extern.h
===
RCS file: /data/src/openbsd/src/sys/uvm/uvm_extern.h,v
retrieving revision 1.140
diff -u -p -r1.140 uvm_extern.h
--- uvm/uvm_extern.h12 Feb 2017 04:55:08 -  1.140
+++ uvm/uvm_extern.h16 Feb 2017 04:40:44 -
@@ -240,20 +240,6 @@ extern struct uvm_constraint_range *uvm_
 extern struct pool *uvm_aiobuf_pool;
 
 /*
- * used to keep state while iterating over the map for a core dump.
- */
-struct uvm_coredump_state {
-   void *cookie;   /* opaque for the caller */
-   vaddr_t start;  /* start of region */
-   vaddr_t realend;/* real end of region */
-   vaddr_t end;/* virtual end of region */
-   vm_prot_t prot; /* protection of region */
-   int flags;  /* flags; see below */
-};
-
-#defineUVM_COREDUMP_STACK  0x01/* region is user stack */
-
-/*
  * the various kernel maps, owned by MD code
  */
 extern struct vm_map *exec_map;
@@ -456,9 +442,13 @@ intuvm_pglistalloc(psize_t, 
paddr_t, 
 void   uvm_pglistfree(struct pglist *);
 void   uvm_pmr_use_inc(paddr_t, paddr_t);
 void   uvm_swap_init(void);
-intuvm_coredump_walkmap(struct proc *,
-   void *, int (*)(struct proc *, void *,
-   struct uvm_coredump_state *), void *);
+typedef intuvm_coredump_setup_cb(int _nsegment, void *_cookie);
+typedef intuvm_coredump_walk_cb(vaddr_t _start, vaddr_t _realend,
+   vaddr_t _end, vm_prot_t _prot, int _nsegment,
+   void *_cookie);
+intuvm_coredump_walkmap(struct proc *_p,
+   uvm_coredump_setup_cb *_setup,
+   uvm_coredump_walk_cb *_walk, void *_cookie);
 void   uvm_grow(struct proc *, vaddr_t);
 void   uvm_deallocate(vm_map_t, vaddr_t, vsize_t);
 struct uvm_object  *uvn_attach(struct vnode *, vm_prot_t);
Index: uvm/uvm_unix.c
===
RCS file: /data/src/openbsd/src/sys/uvm/uvm_unix.c,v
retrieving revision 1.61
diff -u -p -r1.61 uvm_unix.c
--- uvm/uvm_unix.c  2 Feb 2017 06:23:58 -   1.61
+++ uvm/uvm_unix.c  16 Feb 2017 06:31:50 -
@@ -135,90 +135,101 @@ uvm_grow(struct proc *p, vaddr_t sp)
 #ifndef SMALL_KERNEL
 
 /*
- * Walk the VA space for a process, invoking 'func' on each present