Re: UVM coredump API change
On Wed, 15 Feb 2017, Philip Guenther wrote: ... > I have a followup diff that then alters uvm_unix.c to treat unshared amaps > (anonymous maps) as multiple ranges such that pages that were never > faulted don't get written to the coredump: it's useless to write them out > (they're just zeros) and it'll fail if faulting in those pages would hit > the process's memory limit. That diff takes advantage of the API change > in the present diff to hold extra references on some amaps between the two > walks of the VA. Here's that followup diff. If you examine a core file generated by a kernel with both these diffs, you'll find segments with filesz < memsz in the output of "readelf -lW whatever.core". Those are ranges in amaps where this diff suppressed faulting in of never-touched pages. Philip Guenther --- uvm/uvm_unix.c Wed Feb 15 22:31:50 2017 +++ uvm/uvm_unix.c Wed Feb 15 23:37:26 2017 @@ -134,7 +134,79 @@ #ifndef SMALL_KERNEL +#define WALK_CHUNK 32 /* + * Not all the pages in an amap may be present. When dumping core, + * we don't want to force all the pages to be present: it's a waste + * of time and memory when we already know what they contain (zeros) + * and the ELF format at least can adequately represent them as a + * segment with memory size larger than its file size. + * + * So, we walk the amap with calls to amap_lookups() and scan the + * resulting pointers to find ranges of zero or more present pages + * followed by at least one absent page or the end of the amap. + * When then pass that range to the walk callback with 'start' + * pointing to the start of the present range, 'realend' pointing + * to the first absent page (or the end of the entry), and 'end' + * pointing to the page page the last absent page (or the end of + * the entry). + * + * Note that if the first page of the amap is empty then the callback + * must be invoked with 'start' == 'realend' so it can present that + * first range of absent pages. + */ +int +uvm_coredump_walk_amap(struct vm_map_entry *entry, int *nsegmentp, +uvm_coredump_walk_cb *walk, void *cookie) +{ + struct vm_anon *anons[WALK_CHUNK]; + vaddr_t pos, start, realend, end, entry_end; + vm_prot_t prot; + int nsegment, absent, npages, i, error; + + prot = entry->protection; + nsegment = *nsegmentp; + start = entry->start; + entry_end = MIN(entry->end, VM_MAXUSER_ADDRESS); + + absent = 0; + for (pos = start; pos < entry_end; pos += npages << PAGE_SHIFT) { + npages = (entry_end - pos) >> PAGE_SHIFT; + if (npages > WALK_CHUNK) + npages = WALK_CHUNK; + amap_lookups(&entry->aref, pos - entry->start, anons, npages); + for (i = 0; i < npages; i++) { + if ((anons[i] == NULL) == absent) + continue; + if (!absent) { + /* going from present to absent: set realend */ + realend = pos + (i << PAGE_SHIFT); + absent = 1; + continue; + } + + /* going from absent to present: invoke callback */ + end = pos + (i << PAGE_SHIFT); + if (start != end) { + error = (*walk)(start, realend, end, prot, + nsegment, cookie); + if (error) + return error; + nsegment++; + } + start = realend = end; + absent = 0; + } + } + + if (!absent) + realend = entry_end; + error = (*walk)(start, realend, entry_end, prot, nsegment, cookie); + *nsegmentp = nsegment + 1; + return error; +} + +/* * Common logic for whether a map entry should be included in a coredump */ static inline int @@ -166,6 +238,15 @@ return 1; } + +/* do nothing callback for uvm_coredump_walk_amap() */ +static int +noop(vaddr_t start, vaddr_t realend, vaddr_t end, vm_prot_t prot, +int nsegment, void *cookie) +{ + return 0; +} + /* * Walk the VA space for a process to identify what to write to * a coredump. First the number of contiguous ranges is counted, @@ -183,10 +264,16 @@ struct vm_map *map = &vm->vm_map; struct vm_map_entry *entry; vaddr_t end; + int refed_amaps = 0; int nsegment, error; /* -* Walk the map once to count the segments. +* Walk the map once to count the segments. If an amap is +* referenced more than once than take *another* reference +* and treat the amap as exactly one segment instead of +* checking page presence inside it. On the second pass +* we'll recognize
UVM coredump API change
uvm_coredump_walkmap() is the API used by the ELF code to get information about the process's memory image so that it can write it out to the core file. Currently, uvm_coredump_walkmap() just does a pass across the process's VA, invoking a callback on each range. The ELF code executes it twice: once to get a count of the ranges, then memory is allocated to hold the segment information, and then it's called again to fill in that array. The catch is that the two calls must return the same number of ranges so that a consistent ELF header can be generated. If that fails, a kernel assert will trigger. That's easy right now where every vm_map_entry in the process's VA results in a single range, but it would be nice if the coredump could be more precise about what is included and, for example, not try to write out pages which were never faulted into existence. In that case it may be necessary to hold locks or references between the first and second walk of the VA, which means the logic of multiple walks should be in the UVM code and not in the ELF code. So, the diff below changes uvm_coredump_walkmap() to instead take two callbacks. To quote a comment in the diff: +/* + * Walk the VA space for a process to identify what to write to + * a coredump. First the number of contiguous ranges is counted, + * then the 'setup' callback is invoked to prepare for actually + * recording the ranges, then the VA is walked again, invoking + * the 'walk' callback for each range. The number of ranges walked + * is guaranteed to match the count seen by the 'setup' callback. + */ The ELF code of course changes to match that, such that what was previously done between the two calls is now done in the coredump_setup_elf() callback. I have a followup diff that then alters uvm_unix.c to treat unshared amaps (anonymous maps) as multiple ranges such that pages that were never faulted don't get written to the coredump: it's useless to write them out (they're just zeros) and it'll fail if faulting in those pages would hit the process's memory limit. That diff takes advantage of the API change in the present diff to hold extra references on some amaps between the two walks of the VA. (This all originally arose from a report to bugs@ from semarie@ about kernel log messages generated when firefox coredumps) Comments, concerns, oks? Philip Guenther Index: uvm/uvm_extern.h === RCS file: /data/src/openbsd/src/sys/uvm/uvm_extern.h,v retrieving revision 1.140 diff -u -p -r1.140 uvm_extern.h --- uvm/uvm_extern.h12 Feb 2017 04:55:08 - 1.140 +++ uvm/uvm_extern.h16 Feb 2017 04:40:44 - @@ -240,20 +240,6 @@ extern struct uvm_constraint_range *uvm_ extern struct pool *uvm_aiobuf_pool; /* - * used to keep state while iterating over the map for a core dump. - */ -struct uvm_coredump_state { - void *cookie; /* opaque for the caller */ - vaddr_t start; /* start of region */ - vaddr_t realend;/* real end of region */ - vaddr_t end;/* virtual end of region */ - vm_prot_t prot; /* protection of region */ - int flags; /* flags; see below */ -}; - -#defineUVM_COREDUMP_STACK 0x01/* region is user stack */ - -/* * the various kernel maps, owned by MD code */ extern struct vm_map *exec_map; @@ -456,9 +442,13 @@ intuvm_pglistalloc(psize_t, paddr_t, void uvm_pglistfree(struct pglist *); void uvm_pmr_use_inc(paddr_t, paddr_t); void uvm_swap_init(void); -intuvm_coredump_walkmap(struct proc *, - void *, int (*)(struct proc *, void *, - struct uvm_coredump_state *), void *); +typedef intuvm_coredump_setup_cb(int _nsegment, void *_cookie); +typedef intuvm_coredump_walk_cb(vaddr_t _start, vaddr_t _realend, + vaddr_t _end, vm_prot_t _prot, int _nsegment, + void *_cookie); +intuvm_coredump_walkmap(struct proc *_p, + uvm_coredump_setup_cb *_setup, + uvm_coredump_walk_cb *_walk, void *_cookie); void uvm_grow(struct proc *, vaddr_t); void uvm_deallocate(vm_map_t, vaddr_t, vsize_t); struct uvm_object *uvn_attach(struct vnode *, vm_prot_t); Index: uvm/uvm_unix.c === RCS file: /data/src/openbsd/src/sys/uvm/uvm_unix.c,v retrieving revision 1.61 diff -u -p -r1.61 uvm_unix.c --- uvm/uvm_unix.c 2 Feb 2017 06:23:58 - 1.61 +++ uvm/uvm_unix.c 16 Feb 2017 06:31:50 - @@ -135,90 +135,101 @@ uvm_grow(struct proc *p, vaddr_t sp) #ifndef SMALL_KERNEL /* - * Walk the VA space for a process, invoking 'func' on each present
Update to www/faq/faq14.html for FDE on UEFI
Hi, I have added the fdisk(8) flags required for full disk encryption setups on UEFI hosts to www/faq/faq14.html Thanks! Index: faq14.html === RCS file: /cvs/www/faq/faq14.html,v retrieving revision 1.340 diff -u -p -r1.340 faq14.html --- faq14.html 9 Feb 2017 17:22:19 - 1.340 +++ faq14.html 15 Feb 2017 23:42:53 - @@ -817,6 +817,9 @@ adversary to deduce how much space is ac Next, we'll initialize the disk with http://man.openbsd.org/fdisk";>fdisk(8) and create the softraid partition with http://man.openbsd.org/disklabel";>disklabel(8). +If you use GPT for UEFI booting, the http://man.openbsd.org/fdisk ">fdisk(8) +flags below must be changed to -b 960 -giy (this will also change the first slice +offset to 1024 in http://man.openbsd.org/disklabel ">disklabel(8)). # fdisk -iy sd0
Re: Is it worth considering changing the installer default offset from 64 to a higher value?
Thanks Stuart, you went further than I Did :) On Wed, Feb 15, 2017 at 11:39 PM, Stuart Henderson wrote: > On 2017/02/15 23:22, Tom Smyth wrote: >> Hello, >> >> I have been installing OpenBSD quite a bit in a virtualization >> environment and the >> underlying storage has been formatted in 1MB blocks, and most other >> operating >> systems have recommended for best performance to align the partition on a 1MB >> Offset rather than a smaller value. >> >> the reasoning behind this is that if you have files that are around >> than 1MB in size >> and they are stored in the file system it is likely that they will >> cross a block boundary and while OpenBSD only wants to read the file, >> the underlying storage system >> will have to retrieve 2 blocks and this is wasteful on resources. >> >> If the offset was set to 2048 sectors (at 1MB in my case) the disk >> performance of OpenBSD on my underlying system would be improved. >> I have been setting the value of the initial offset to 2048 and I have >> used exact sizes in-terms of sector count to keep my subsequent >> partitions aligned with the 1M offset, >> I have not noticed any ill affect on my installs (for the past 2 >> years) and I think it >> would be worth considering this as a suggested default when installing >> OpenBSD >> >> Any comments suggestions insights welcome, as my experience on >> virtualization >> has been Vmware Based. But I would imagine most modern enterprise >> storage systems are using larger block sizes... and by setting it to a >> 1MB or perhaps a larger value (as long as the value ends on a 1MB >> Boundary eg 2MB or 4MB) depending on other users experience >> > > I've been doing similar for installs on SSDs with their large erase blocks. > And this is common on other OS these days. > > As you say, it's not just the starting point of the OpenBSD part of the disk, > care must be taken with the disklabel partitions too. > > I have an old tree that has this diff in, I don't remember if it was quite > enough for disklabel as-is, but it might be enough of a starting point to > save time tracking things down. > > Index: fdisk/mbr.c > === > RCS file: /cvs/src/sbin/fdisk/mbr.c,v > retrieving revision 1.67 > diff -u -p -r1.67 mbr.c > --- fdisk/mbr.c 1 Sep 2016 16:17:46 - 1.67 > +++ fdisk/mbr.c 15 Feb 2017 23:36:29 - > @@ -144,8 +144,11 @@ MBR_init(struct mbr *mbr) > } > #endif > > - /* Start OpenBSD MBR partition on a power of 2 block number. */ > - i = 1; > + /* > +* Start OpenBSD MBR partition on a power of 2 block number > +* with a minimum 2048-block alignment. > +*/ > + i = 2048; > while (i < DL_SECTOBLK(&dl, mbr->part[3].bs)) > i *= 2; > adj = DL_BLKTOSEC(&dl, i) - mbr->part[3].bs; > Index: disklabel/editor.c > === > RCS file: /cvs/src/sbin/disklabel/editor.c,v > retrieving revision 1.304 > diff -u -p -r1.304 editor.c > --- disklabel/editor.c 6 Oct 2016 13:02:31 - 1.304 > +++ disklabel/editor.c 15 Feb 2017 23:36:29 - > @@ -2092,8 +2092,7 @@ align: > orig_size = DL_GETPSIZE(pp); > orig_offset = DL_GETPOFFSET(pp); > > - bsize = (DISKLABELV1_FFS_FRAG(pp->p_fragblock) * > - DISKLABELV1_FFS_FSIZE(pp->p_fragblock)) / lp->d_secsize; > + bsize = 2048; > if (DL_GETPOFFSET(pp) != starting_sector) { > /* Can't change offset of first partition. */ > adj = bsize - (DL_GETPOFFSET(pp) % bsize); >
Re: Is it worth considering changing the installer default offset from 64 to a higher value?
On 2017/02/15 23:22, Tom Smyth wrote: > Hello, > > I have been installing OpenBSD quite a bit in a virtualization > environment and the > underlying storage has been formatted in 1MB blocks, and most other operating > systems have recommended for best performance to align the partition on a 1MB > Offset rather than a smaller value. > > the reasoning behind this is that if you have files that are around > than 1MB in size > and they are stored in the file system it is likely that they will > cross a block boundary and while OpenBSD only wants to read the file, > the underlying storage system > will have to retrieve 2 blocks and this is wasteful on resources. > > If the offset was set to 2048 sectors (at 1MB in my case) the disk > performance of OpenBSD on my underlying system would be improved. > I have been setting the value of the initial offset to 2048 and I have > used exact sizes in-terms of sector count to keep my subsequent > partitions aligned with the 1M offset, > I have not noticed any ill affect on my installs (for the past 2 > years) and I think it > would be worth considering this as a suggested default when installing OpenBSD > > Any comments suggestions insights welcome, as my experience on virtualization > has been Vmware Based. But I would imagine most modern enterprise > storage systems are using larger block sizes... and by setting it to a > 1MB or perhaps a larger value (as long as the value ends on a 1MB > Boundary eg 2MB or 4MB) depending on other users experience > I've been doing similar for installs on SSDs with their large erase blocks. And this is common on other OS these days. As you say, it's not just the starting point of the OpenBSD part of the disk, care must be taken with the disklabel partitions too. I have an old tree that has this diff in, I don't remember if it was quite enough for disklabel as-is, but it might be enough of a starting point to save time tracking things down. Index: fdisk/mbr.c === RCS file: /cvs/src/sbin/fdisk/mbr.c,v retrieving revision 1.67 diff -u -p -r1.67 mbr.c --- fdisk/mbr.c 1 Sep 2016 16:17:46 - 1.67 +++ fdisk/mbr.c 15 Feb 2017 23:36:29 - @@ -144,8 +144,11 @@ MBR_init(struct mbr *mbr) } #endif - /* Start OpenBSD MBR partition on a power of 2 block number. */ - i = 1; + /* +* Start OpenBSD MBR partition on a power of 2 block number +* with a minimum 2048-block alignment. +*/ + i = 2048; while (i < DL_SECTOBLK(&dl, mbr->part[3].bs)) i *= 2; adj = DL_BLKTOSEC(&dl, i) - mbr->part[3].bs; Index: disklabel/editor.c === RCS file: /cvs/src/sbin/disklabel/editor.c,v retrieving revision 1.304 diff -u -p -r1.304 editor.c --- disklabel/editor.c 6 Oct 2016 13:02:31 - 1.304 +++ disklabel/editor.c 15 Feb 2017 23:36:29 - @@ -2092,8 +2092,7 @@ align: orig_size = DL_GETPSIZE(pp); orig_offset = DL_GETPOFFSET(pp); - bsize = (DISKLABELV1_FFS_FRAG(pp->p_fragblock) * - DISKLABELV1_FFS_FSIZE(pp->p_fragblock)) / lp->d_secsize; + bsize = 2048; if (DL_GETPOFFSET(pp) != starting_sector) { /* Can't change offset of first partition. */ adj = bsize - (DL_GETPOFFSET(pp) % bsize);
Is it worth considering changing the installer default offset from 64 to a higher value?
Hello, I have been installing OpenBSD quite a bit in a virtualization environment and the underlying storage has been formatted in 1MB blocks, and most other operating systems have recommended for best performance to align the partition on a 1MB Offset rather than a smaller value. the reasoning behind this is that if you have files that are around than 1MB in size and they are stored in the file system it is likely that they will cross a block boundary and while OpenBSD only wants to read the file, the underlying storage system will have to retrieve 2 blocks and this is wasteful on resources. If the offset was set to 2048 sectors (at 1MB in my case) the disk performance of OpenBSD on my underlying system would be improved. I have been setting the value of the initial offset to 2048 and I have used exact sizes in-terms of sector count to keep my subsequent partitions aligned with the 1M offset, I have not noticed any ill affect on my installs (for the past 2 years) and I think it would be worth considering this as a suggested default when installing OpenBSD Any comments suggestions insights welcome, as my experience on virtualization has been Vmware Based. But I would imagine most modern enterprise storage systems are using larger block sizes... and by setting it to a 1MB or perhaps a larger value (as long as the value ends on a 1MB Boundary eg 2MB or 4MB) depending on other users experience Thanks for your Time, Tom Smyth
Re: route foreach
On 15 February 2017 at 19:08, Alexander Bluhm wrote: > Hi, > > Replace some manual loops with FOREACH macro. > > ok? > > bluhm > Looks good to me, OK mikeb
route foreach
Hi, Replace some manual loops with FOREACH macro. ok? bluhm Index: net/route.c === RCS file: /data/mirror/openbsd/cvs/src/sys/net/route.c,v retrieving revision 1.350 diff -u -p -r1.350 route.c --- net/route.c 5 Feb 2017 16:23:38 - 1.350 +++ net/route.c 15 Feb 2017 18:00:05 - @@ -1556,8 +1556,7 @@ rt_timer_add(struct rtentry *rt, void (* * If there's already a timer with this action, destroy it before * we add a new one. */ - for (r = LIST_FIRST(&rt->rt_timer); r != NULL; -r = LIST_NEXT(r, rtt_link)) { + LIST_FOREACH(r, &rt->rt_timer, rtt_link) { if (r->rtt_func == func) { LIST_REMOVE(r, rtt_link); TAILQ_REMOVE(&r->rtt_queue->rtq_head, r, rtt_next); @@ -1598,8 +1597,7 @@ rt_timer_timer(void *arg) current_time = time_uptime; NET_LOCK(s); - for (rtq = LIST_FIRST(&rttimer_queue_head); rtq != NULL; -rtq = LIST_NEXT(rtq, rtq_link)) { + LIST_FOREACH(rtq, &rttimer_queue_head, rtq_link) { while ((r = TAILQ_FIRST(&rtq->rtq_head)) != NULL && (r->rtt_time + rtq->rtq_timeout) < current_time) { LIST_REMOVE(r, rtt_link); @@ -1620,7 +1618,7 @@ rt_timer_timer(void *arg) u_int16_t rtlabel_name2id(char *name) { - struct rt_label *label, *p = NULL; + struct rt_label *label, *p; u_int16_tnew_id = 1; if (!name[0]) @@ -1637,12 +1635,11 @@ rtlabel_name2id(char *name) * and take the first free slot we find. if there is none or the list * is empty, append a new entry at the end. */ - - if (!TAILQ_EMPTY(&rt_labels)) - for (p = TAILQ_FIRST(&rt_labels); p != NULL && - p->rtl_id == new_id; p = TAILQ_NEXT(p, rtl_entry)) - new_id = p->rtl_id + 1; - + TAILQ_FOREACH(p, &rt_labels, rtl_entry) { + if (p->rtl_id != new_id) + break; + new_id = p->rtl_id + 1; + } if (new_id > LABELID_MAX) return (0); @@ -1697,8 +1694,7 @@ rtlabel_unref(u_int16_t id) if (id == 0) return; - for (p = TAILQ_FIRST(&rt_labels); p != NULL; p = next) { - next = TAILQ_NEXT(p, rtl_entry); + TAILQ_FOREACH_SAFE(p, &rt_labels, rtl_entry, next) { if (id == p->rtl_id) { if (--p->rtl_ref == 0) { TAILQ_REMOVE(&rt_labels, p, rtl_entry);
Re: rcctl(8) does not set flags
> > Having it documented or not, the diff below removes an unneeded step, > > since "-tun 4" is already the default for nfsd. > > I have no opinion on that one. Sure it's not necessary but maybe it makes it > more obvious as a documentation. Dunno... We can't just delete the line since it serves as context for the following paragraph. I guess we could modify all occurrences of "4" and "four" to something different, but I don't think there's anything wrong with the way it is right now. > > Index: faq6.html > > === > > RCS file: /cvs/www/faq/faq6.html,v > > retrieving revision 1.427 > > diff -u -p -r1.427 faq6.html > > --- faq6.html 9 Feb 2017 17:22:19 - 1.427 > > +++ faq6.html 15 Feb 2017 16:05:30 - > > @@ -607,7 +607,6 @@ services must be enabled on the server: > > > > > > # rcctl enable portmap mountd nfsd > > -# rcctl set nfsd flags -tun 4 > > > > > > The -t and -u flags for nfsd(8) enable TCP and UDP, > > > > -- > > db > > > > -- > Antoine >
Re: rcctl(8) does not set flags
On Wed, Feb 15, 2017 at 02:22:31PM -0200, Daniel Bolgheroni wrote: > Hi tech@, > > Setting, for example > > # rcctl enable nfsd > # rcctl set nfsd flags -tun 4 > > has no effect on /etc/rc.conf.local. This is also true for other cases > where the default flags for the daemon are equal to the flags you're > trying to set. That is correct (and expected). > It seemed a problem at first, since there is no reference to this > behaviour on the man page. But looking at the source code > (usr.sbin/rcctl/rcctl.sh, r1.105, line 452), this is actually expected. > > # unset flags if they match the default enabled ones > [ "${_args}" = "$(svc_getdef ${_svc} ${_var})" ] && \ > unset _args > > Should this behaviour be on the man page? I don't think it's too obvious > to assume the flags wasn't set on /etc/rc.conf.local because it is > already the default for the daemon. Well, if you enable sshd it won't add 'sshd=' to rc.conf.local either (since it's the default). It's kind of expected I think. That said, I don't mind to add it in the man page if people think it's worth it. But if you use rcctl to *set* flags, then use it to *get* them as well and you will see it's what you expect :-) > Having it documented or not, the diff below removes an unneeded step, > since "-tun 4" is already the default for nfsd. I have no opinion on that one. Sure it's not necessary but maybe it makes it more obvious as a documentation. Dunno... > > Index: faq6.html > === > RCS file: /cvs/www/faq/faq6.html,v > retrieving revision 1.427 > diff -u -p -r1.427 faq6.html > --- faq6.html 9 Feb 2017 17:22:19 - 1.427 > +++ faq6.html 15 Feb 2017 16:05:30 - > @@ -607,7 +607,6 @@ services must be enabled on the server: > > > # rcctl enable portmap mountd nfsd > -# rcctl set nfsd flags -tun 4 > > > The -t and -u flags for nfsd(8) enable TCP and UDP, > > -- > db > -- Antoine
rcctl(8) does not set flags
Hi tech@, Setting, for example # rcctl enable nfsd # rcctl set nfsd flags -tun 4 has no effect on /etc/rc.conf.local. This is also true for other cases where the default flags for the daemon are equal to the flags you're trying to set. It seemed a problem at first, since there is no reference to this behaviour on the man page. But looking at the source code (usr.sbin/rcctl/rcctl.sh, r1.105, line 452), this is actually expected. # unset flags if they match the default enabled ones [ "${_args}" = "$(svc_getdef ${_svc} ${_var})" ] && \ unset _args Should this behaviour be on the man page? I don't think it's too obvious to assume the flags wasn't set on /etc/rc.conf.local because it is already the default for the daemon. Having it documented or not, the diff below removes an unneeded step, since "-tun 4" is already the default for nfsd. Index: faq6.html === RCS file: /cvs/www/faq/faq6.html,v retrieving revision 1.427 diff -u -p -r1.427 faq6.html --- faq6.html 9 Feb 2017 17:22:19 - 1.427 +++ faq6.html 15 Feb 2017 16:05:30 - @@ -607,7 +607,6 @@ services must be enabled on the server: # rcctl enable portmap mountd nfsd -# rcctl set nfsd flags -tun 4 The -t and -u flags for nfsd(8) enable TCP and UDP, -- db