Sun, 29 Jan 2017, Philip Guenther wrote:
...
> Indeed, the exec_elf.c side of that is already there, though currently 
> it's only in play for the stack on hppa: check out the 
> MACHINE_STACK_GROWS_UP section of uvm_coredump_walkmap() in 
> uvm/uvm_unix.c.  <pause>  Actually, that code looks broken, as it's moving 
> down state.end to a value less than state.realend, but exec_elf.c expects 
> state.realend to always be >= state.end!  Dang it, now I have to fire up 
> the hppa again...

[[TYPO: exec_elf.c expects state.realend to always be *less than* or equal 
to state.end, as realend is the source for p_filesiz while end is the 
source for p_memsz, and ELF requires that p_filesz <= p_memsz for PT_LOAD 
segments.]]

Yes, this code currently is broken on hppa: if the stack rlimit is beyond 
that last page actually used by the program at the time of the coredump, 
then an invalid ELF core file is generated that has a segment with 
p_filesz > p_memsz:

: jackal; uname -a
OpenBSD jackal.my.domain 6.0 LOCAL#0 hppa
: jackal; ulimit -s 32768
: jackal; vis
^\Quit (core dumped)

: jackal; readelf -l vis.core  | awk '!/LOAD/ || $5 > $6'

Elf file type is CORE (Core file)
Entry point 0x0
There are 42 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x047840 0x7801b000 0x00000000 0x05000 0x03000 RW  0x1000
  NOTE           0x000574 0x00000000 0x00000000 0x002cc 0x00000 R   0x4
: jackal;


The stack handling code in uvm_coredump_walkmap() is, AFAICT, just trying 
to skip pages that aren't actually backed.  If we do that all on amaps 
then the jockeying (and #ifdef) there can go away.  The diff below appear 
to work for me, tested on macppc so far.

It delegates the dumping of uvm_map_entries for amaps to a new function 
uvm_coredump_walk_amap() that uses amap_lookups() calls to see which pages 
are actually backed and generates calls to the callback for ranges of N 
present pages followed by M absent pages, where N + M > 0 but either N or 
M may be zero.

To confirm the results, I started a perl process and twice had it fork 
with the child killing itself with SIGQUIT to generate a core, with this 
new functionality enabled for one child+core but not the other.  Diffing 
the output of "readelf -l" on the two core files (with the 
offset-in-the-core-file column removed) shows these differences:


< There are 820 program headers, starting at offset 52
---
> There are 826 program headers, starting at offset 52
37c37
<   LOAD           0xbb7fd000 0x00000000 0x04000 0x04000 RW  0x1000
---
>   LOAD           0xbb7fd000 0x00000000 0x01000 0x04000 RW  0x1000
182c182
<   LOAD           0xbebb6000 0x00000000 0x04000 0x04000 RW  0x1000
---
>   LOAD           0xbebb6000 0x00000000 0x01000 0x04000 RW  0x1000
184c184,185
<   LOAD           0xbebbb000 0x00000000 0x09000 0x09000 RW  0x1000
---
>   LOAD           0xbebbb000 0x00000000 0x01000 0x05000 RW  0x1000
>   LOAD           0xbebc0000 0x00000000 0x02000 0x04000 RW  0x1000
334c335,336
<   LOAD           0xc1c86000 0x00000000 0x04000 0x04000 RW  0x1000
---
>   LOAD           0xc1c86000 0x00000000 0x01000 0x03000 RW  0x1000
>   LOAD           0xc1c89000 0x00000000 0x01000 0x01000 RW  0x1000
344c346,347
<   LOAD           0xc1e26000 0x00000000 0x04000 0x04000 RW  0x1000
---
>   LOAD           0xc1e26000 0x00000000 0x01000 0x03000 RW  0x1000
>   LOAD           0xc1e29000 0x00000000 0x01000 0x01000 RW  0x1000
397c400,401
<   LOAD           0xc2f39000 0x00000000 0x03000 0x03000 RW  0x1000
---
>   LOAD           0xc2f39000 0x00000000 0x01000 0x02000 RW  0x1000
>   LOAD           0xc2f3b000 0x00000000 0x01000 0x01000 RW  0x1000
571c575,576
<   LOAD           0xc61b2000 0x00000000 0x03000 0x03000 RW  0x1000
---
>   LOAD           0xc61b2000 0x00000000 0x00000 0x01000 RW  0x1000
>   LOAD           0xc61b3000 0x00000000 0x02000 0x02000 RW  0x1000
826c831,832
<   LOAD           0xfffe0000 0x00000000 0x0f000 0x0f000 RW  0x1000
---
>   LOAD           0xfffe0000 0x00000000 0x00000 0x09000 RW  0x1000
>   LOAD           0xfffe9000 0x00000000 0x06000 0x06000 RW  0x1000

That last chunk, for example, shows that where it previously would have 
dumped 15 pages [0xfffe0000, 0xfffef000), it instead shows 9 unbacked 
pages [0xfffe0000, 0xfffe9000) and then actually dumps only 6 pages 
[0xfffe9000, 0xfffef000).  This means the kernel no longer had to fault in 
and create empty backing for those 9 pages, just to write them to the core 
file.

I see one thing that need to be checked and fixed in this and one thing 
that might need pondering.  The important question is if I create a 
*shared* anonymous mapping, can I have process A force another page to be 
faulted in and backed while process B is trying to dump the mapping?  If 
so, I can trigger an kernel assert by requiring a range to be split, which 
will increase the number of segments needed to dump that range in the 
core.  Lock all the map entries in the uvmspace across the two walks?  
Yuck...

The minor thought is about how the kernel's coredump code allocates memory 
to hold all the segment headers with mallocarray().  An amd64 process can 
have MAXDSIZ / PAGE_SIZE == 8 million pages.  If you allocate every other 
page, it would require 4 million segments in the core file, which would 
take 224MB.  Is that malloc w/M_WAITOK possibly excessive?  Will the 
increase in segment count from this diff make whatever limit there is for 
this easier to hit?


semarie@, you can consistently trigger that kernel log message, perhaps 
you can try reproducing it with this patch and see if it goes away?


Thoughts/comments on the diff?

Philip


Index: uvm/uvm_extern.h
===================================================================
RCS file: /data/src/openbsd/src/sys/uvm/uvm_extern.h,v
retrieving revision 1.139
diff -u -p -r1.139 uvm_extern.h
--- uvm/uvm_extern.h    5 Jun 2016 08:35:57 -0000       1.139
+++ uvm/uvm_extern.h    31 Jan 2017 07:26:48 -0000
@@ -245,14 +245,11 @@ extern struct pool *uvm_aiobuf_pool;
 struct uvm_coredump_state {
        void *cookie;           /* opaque for the caller */
        vaddr_t start;          /* start of region */
-       vaddr_t realend;        /* real end of region */
+       vaddr_t realend;        /* real end of region (<= end) */
        vaddr_t end;            /* virtual end of region */
        vm_prot_t prot;         /* protection of region */
-       int flags;              /* flags; see below */
 };
 
-#define        UVM_COREDUMP_STACK      0x01    /* region is user stack */
-
 /*
  * the various kernel maps, owned by MD code
  */
@@ -460,7 +457,7 @@ void                        uvm_pmr_use_inc(paddr_t, 
paddr_t)
 void                   uvm_swap_init(void);
 int                    uvm_coredump_walkmap(struct proc *,
                            void *, int (*)(struct proc *, void *,
-                           struct uvm_coredump_state *), void *);
+                           const struct uvm_coredump_state *), void *);
 void                   uvm_grow(struct proc *, vaddr_t);
 void                   uvm_deallocate(vm_map_t, vaddr_t, vsize_t);
 struct uvm_object      *uvn_attach(struct vnode *, vm_prot_t);
Index: uvm/uvm_unix.c
===================================================================
RCS file: /data/src/openbsd/src/sys/uvm/uvm_unix.c,v
retrieving revision 1.60
diff -u -p -r1.60 uvm_unix.c
--- uvm/uvm_unix.c      16 Sep 2016 01:09:53 -0000      1.60
+++ uvm/uvm_unix.c      31 Jan 2017 07:27:22 -0000
@@ -134,26 +134,77 @@ uvm_grow(struct proc *p, vaddr_t sp)
 
 #ifndef SMALL_KERNEL
 
+#define WALK_CHUNK     32
+int
+uvm_coredump_walk_amap(struct proc *p, struct vm_map_entry *entry,
+    void *iocookie,
+    int (*func)(struct proc *, void *, const struct uvm_coredump_state *),
+    struct uvm_coredump_state *state)
+{
+       struct vm_anon *anons[WALK_CHUNK];
+       vaddr_t start, end;
+       int absent = 0;
+       int npages, i, error;
+
+       state->start = start = entry->start;
+       end = MIN(entry->end, VM_MAXUSER_ADDRESS);
+
+       for (; start < end; start += npages << PAGE_SHIFT) {
+               npages = (end - start) >> PAGE_SHIFT;
+               if (npages > WALK_CHUNK)
+                       npages = WALK_CHUNK;
+               amap_lookups(&entry->aref, start - entry->start, anons, npages);
+               for (i = 0; i < npages; i++) {
+                       if ((anons[i] == NULL) == absent)
+                               continue;
+                       if (!absent) {
+                               /* going from present -> absent: set realend */
+                               state->realend = start + (i << PAGE_SHIFT);
+                               absent = 1;
+                               continue;
+                       }
+
+                       /* going from absent to present: invoke callback */
+                       state->end = start + (i << PAGE_SHIFT);
+                       if (state->start != state->end) {
+                               error = (*func)(p, iocookie, state);
+                               if (error)
+                                       return error;
+                       }
+                       state->start = state->realend = state->end;
+                       absent = 0;
+               }
+       }
+
+       if (!absent)
+               state->realend = end;
+       state->end = end;
+       return (*func)(p, iocookie, state);
+}
+
+
 /*
  * Walk the VA space for a process, invoking 'func' on each present range
  * that should be included in a coredump.
  */
+
+int sparse = 1;
+
 int
 uvm_coredump_walkmap(struct proc *p, void *iocookie,
-    int (*func)(struct proc *, void *, struct uvm_coredump_state *),
+    int (*func)(struct proc *, void *, const struct uvm_coredump_state *),
     void *cookie)
 {
        struct uvm_coredump_state state;
        struct vmspace *vm = p->p_vmspace;
        struct vm_map *map = &vm->vm_map;
        struct vm_map_entry *entry;
-       vaddr_t top;
        int error;
 
+       state.cookie = cookie;
+
        RBT_FOREACH(entry, uvm_map_addr, &map->addr) {
-               state.cookie = cookie;
                state.prot = entry->protection;
-               state.flags = 0;
 
                /* should never happen for a user process */
                if (UVM_ET_ISSUBMAP(entry)) {
@@ -170,40 +221,23 @@ uvm_coredump_walkmap(struct proc *p, voi
                    UVM_OBJ_IS_DEVICE(entry->object.uvm_obj))
                        continue;
 
-               state.start = entry->start;
-               state.realend = entry->end;
-               state.end = entry->end;
-
-               if (state.start >= VM_MAXUSER_ADDRESS)
+               if (entry->start >= VM_MAXUSER_ADDRESS)
                        continue;
 
-               if (state.end > VM_MAXUSER_ADDRESS)
-                       state.end = VM_MAXUSER_ADDRESS;
-
-#ifdef MACHINE_STACK_GROWS_UP
-               if ((vaddr_t)vm->vm_maxsaddr <= state.start &&
-                   state.start < ((vaddr_t)vm->vm_maxsaddr + MAXSSIZ)) {
-                       top = round_page((vaddr_t)vm->vm_maxsaddr +
-                           ptoa(vm->vm_ssize));
-                       if (state.end > top)
-                               state.end = top;
+               if (sparse && entry->aref.ar_amap != NULL) {
+                       error = uvm_coredump_walk_amap(p, entry, iocookie,
+                           func, &state);
+               } else {
+                       state.start = entry->start;
+                       state.realend = entry->end;
+                       state.end = entry->end;
 
-                       if (state.start >= state.end)
-                               continue;
-#else
-               if (state.start >= (vaddr_t)vm->vm_maxsaddr) {
-                       top = trunc_page((vaddr_t)vm->vm_minsaddr -
-                           ptoa(vm->vm_ssize));
-                       if (state.start < top)
-                               state.start = top;
+                       if (state.end > VM_MAXUSER_ADDRESS)
+                               state.end = VM_MAXUSER_ADDRESS;
 
-                       if (state.start >= state.end)
-                               continue;
-#endif
-                       state.flags |= UVM_COREDUMP_STACK;
+                       error = (*func)(p, iocookie, &state);
                }
 
-               error = (*func)(p, iocookie, &state);
                if (error)
                        return (error);
        }
Index: kern/exec_elf.c
===================================================================
RCS file: /data/src/openbsd/src/sys/kern/exec_elf.c,v
retrieving revision 1.130
diff -u -p -r1.130 exec_elf.c
--- kern/exec_elf.c     21 Jan 2017 05:42:03 -0000      1.130
+++ kern/exec_elf.c     31 Jan 2017 04:08:17 -0000
@@ -955,7 +955,7 @@ struct countsegs_state {
 };
 
 int    ELFNAMEEND(coredump_countsegs)(struct proc *, void *,
-           struct uvm_coredump_state *);
+           const struct uvm_coredump_state *);
 
 struct writesegs_state {
        Elf_Phdr *psections;
@@ -963,7 +963,7 @@ struct writesegs_state {
 };
 
 int    ELFNAMEEND(coredump_writeseghdrs)(struct proc *, void *,
-           struct uvm_coredump_state *);
+           const struct uvm_coredump_state *);
 
 int    ELFNAMEEND(coredump_notes)(struct proc *, void *, size_t *);
 int    ELFNAMEEND(coredump_note)(struct proc *, void *, size_t *);
@@ -1124,7 +1124,7 @@ out:
 
 int
 ELFNAMEEND(coredump_countsegs)(struct proc *p, void *iocookie,
-    struct uvm_coredump_state *us)
+    const struct uvm_coredump_state *us)
 {
 #ifndef SMALL_KERNEL
        struct countsegs_state *cs = us->cookie;
@@ -1136,7 +1136,7 @@ ELFNAMEEND(coredump_countsegs)(struct pr
 
 int
 ELFNAMEEND(coredump_writeseghdrs)(struct proc *p, void *iocookie,
-    struct uvm_coredump_state *us)
+    const struct uvm_coredump_state *us)
 {
 #ifndef SMALL_KERNEL
        struct writesegs_state *ws = us->cookie;

Reply via email to