Re: dup3 syscall - atomic set O_CLOEXEC with dup2
On Fri, Jan 20, 2012 at 07:09:43PM -0500, Eitan Adler wrote: > 2012/1/20 Kostik Belousov : > > On Fri, Jan 20, 2012 at 06:25:42PM -0500, Eitan Adler wrote: > >> I figure this isn't wanted? > > You silently ignored part of the notes that were provided, > > I fixed the style violations you pointed out and removed > _SYS_SYSPROTO_H_ and friends. Which else was there? If I missed > something I'm sorry. Manual definition of the struct dup3_args is not needed at all. I very much doubt that patch can compile since there is now duplicate definition of the structure. There is still stray return (0);. Another reason for the patch to not compile is compat32 syscalls.master changes. The suggestion of 'it is better to test the presence of O_CLOEXEC with & instead of comparing with ==, to allow for ease introduction of future dup3 flags, if any.' is silently ignored. Probably new: style requires putting opening '{' for the function body at new line. There shall be no space between '*' and arg name. New: symbols in the version map shall be ordered alphabetically. > > > and keep > > complete silence on the primary question about non-standard > > I thought I answered that already but I'll try again: I believe the > functionality is useful to developers even if it is non-standard. In > addition if it is standardized at some point in the future it is > unlikely to have different semantics than the ones implemented. It very well may have different details that would force us to draw two versions forever. > > > and fractional nature of the patch. > > How so? I am not including the generated parts in the patch. Should I? The implementation is partial because dup3() is only part of the Linux and glibc efforts to allow app authors to handle the race between obtaining the file descriptor and setting cloexec flag. I already pointed you to SOCK_CLOEXEC for example. This is ignored as well. dup3() alone is almost useless. > > > I see no reason to retype my previous response. > > -- > Eitan Adler pgpL6B1rXwfLx.pgp Description: PGP signature
Re: dup3 syscall - atomic set O_CLOEXEC with dup2
On Fri, Jan 20, 2012 at 06:25:42PM -0500, Eitan Adler wrote: > I figure this isn't wanted? You silently ignored part of the notes that were provided, and keep complete silence on the primary question about non-standard and fractional nature of the patch. I see no reason to retype my previous response. > > On Thu, Jan 12, 2012 at 10:07 PM, Eitan Adler wrote: > > Okay - here is version 2 (compile and run tested) > > > > Index: sys/kern/kern_descrip.c > > === > > --- sys/kern/kern_descrip.c (revision 229830) > > +++ sys/kern/kern_descrip.c (working copy) > > @@ -110,6 +110,7 @@ > > /* Flags for do_dup() */ > > #define DUP_FIXED 0x1 /* Force fixed allocation */ > > #define DUP_FCNTL 0x2 /* fcntl()-style errors */ > > +#define DUP_CLOEXEC 0x4 /* Enable O_CLOEXEC on the new fd */ > > > > static int do_dup(struct thread *td, int flags, int old, int new, > > register_t *retval); > > @@ -307,7 +308,36 @@ > > return (0); > > } > > > > +struct dup3_args { > > + u_int from; > > + u_int to; > > + int flags; > > +}; > > + > > /* > > + * Duplicate a file descriptor and allow for O_CLOEXEC > > + */ > > + > > +int > > +sys_dup3(struct thread * td, struct dup3_args * uap) { > > + int dupflags; > > + > > + if (uap->from == uap->to) > > + return (EINVAL); > > + > > + if (uap->flags & ~O_CLOEXEC) > > + return (EINVAL); > > + > > + dupflags = DUP_FIXED; > > + if (uap->flags & O_CLOEXEC) > > + dupflags |= DUP_CLOEXEC; > > + > > + return (do_dup(td, dupflags, (int)uap->from, (int)uap->to, > > + td->td_retval)); > > + return (0); > > +} > > + > > +/* > > * Duplicate a file descriptor to a particular value. > > * > > * Note: keep in mind that a potential race condition exists when closing > > @@ -912,6 +942,9 @@ > > fdp->fd_lastfile = new; > > *retval = new; > > > > + if (flags & DUP_CLOEXEC) > > + fdp->fd_ofileflags[new] |= UF_EXCLOSE; > > + > > /* > > * If we dup'd over a valid file, we now own the reference to it > > * and must dispose of it using closef() semantics (as if a > > Index: sys/kern/syscalls.master > > === > > --- sys/kern/syscalls.master (revision 229830) > > +++ sys/kern/syscalls.master (working copy) > > @@ -951,5 +951,6 @@ > > off_t offset, off_t len); } > > 531 AUE_NULL STD { int posix_fadvise(int fd, off_t offset, \ > > off_t len, int advice); } > > +532 AUE_NULL STD { int dup3(u_int from, u_int to, int > > flags); } > > ; Please copy any additions and changes to the following compatability > > tables: > > ; sys/compat/freebsd32/syscalls.master > > Index: sys/compat/freebsd32/syscalls.master > > === > > --- sys/compat/freebsd32/syscalls.master (revision 229830) > > +++ sys/compat/freebsd32/syscalls.master (working copy) > > @@ -997,3 +997,4 @@ > > uint32_t offset1, uint32_t offset2,\ > > uint32_t len1, uint32_t len2, \ > > int advice); } > > +532 AUE_NULL STD { int dup3(u_int from, u_int to, int > > flags); } > > > > Index: lib/libc/sys/Symbol.map > > === > > --- lib/libc/sys/Symbol.map (revision 229830) > > +++ lib/libc/sys/Symbol.map (working copy) > > @@ -383,6 +383,7 @@ > > > > FBSD_1.3 { > > posix_fadvise; > > + dup3; > > }; > > > > FBSDprivate_1.0 { > > > > > > -- > > Eitan Adler > > > > -- > Eitan Adler pgpIGBa27Avyh.pgp Description: PGP signature
Re: Processes' FIBs
On Thu, Jan 12, 2012 at 10:44:51PM -0800, Julian Elischer wrote: > On 1/12/12 6:04 AM, Oliver Fromme wrote: > >Bjoern A. Zeeb wrote: > > > On 11. Jan 2012, at 15:06 , Oliver Fromme wrote: > > > > I'm currently looking at the source code of ps, but adding > > > > a field for the FIB isn't as trivial as I thought because > > > > ps only sees struct kinfo_proc (via sysctl kern.proc.*) > > > > which doesn't contain the FIB. procstat does the same. > > > > > > > > I'm currently trying to write a patch that copies p_fibnum > > > > from struct proc to struct kinfo_proc (just like p_nice, > > > > for example). Does that make sense? If so, does the patch > > > > below look reasonable? (I've made it on a stable/8 system, > > > > but it should apply to 9 and 10, too.) > > > > > > I am not sure it makes too much sense in ps. It might make sense in > > > sockstat maybe? > > > >Well, every process has a default FIB number (p_fibnum in > >struct proc). It is a property of the process, just like > >the nice value for example. So I think it makes sense for > >ps to be able to display it if the user asks for it. This > >is the piece of information that I need. > > > >On the other hand, sockstat displays open sockets only. > >Of course, an internet socket has a FIB number associated > >with it, too, so sockstat could display it. But that > >would be a completely different piece of information, > >and it wouldn't solve the actual problem that I'm currently > >facing. > > > > I hadn't considered that a process would want to see the fib of another. > but of course it makes sense. > I see no problem the the attached patch except that it doesn't fix the > man page for ps and setfib(2) or setfib(8) (or is it 1?) > > etc. > if there are no objections, I can add this when it has been polished.. The patch misses compat32 bits and breaks compat32 ps/top. > > >Best regards > > Oliver > > > >--- ./sys/sys/user.h.orig2011-07-12 14:23:54.0 +0200 > >+++ ./sys/sys/user.h 2012-01-11 15:35:50.0 +0100 > >@@ -83,7 +83,7 @@ > > * it in two places: function fill_kinfo_proc in sys/kern/kern_proc.c and > > * function kvm_proclist in lib/libkvm/kvm_proc.c . > > */ > >-#define KI_NSPARE_INT 9 > >+#define KI_NSPARE_INT 8 > > #defineKI_NSPARE_LONG 12 > > #defineKI_NSPARE_PTR 6 > > > >@@ -177,6 +177,7 @@ > > */ > > charki_sparestrings[68];/* spare string space */ > > int ki_spareints[KI_NSPARE_INT];/* spare room for growth */ > >+int ki_fibnum; /* Default FIB number */ > > u_int ki_cr_flags;/* Credential flags */ > > int ki_jid; /* Process jail ID */ > > int ki_numthreads; /* XXXKSE number of threads in total > > */ > >--- ./sys/kern/kern_proc.c.orig 2011-07-12 14:19:26.0 +0200 > >+++ ./sys/kern/kern_proc.c 2012-01-11 15:36:22.0 +0100 > >@@ -775,6 +775,7 @@ > > kp->ki_swtime = (ticks - p->p_swtick) / hz; > > kp->ki_pid = p->p_pid; > > kp->ki_nice = p->p_nice; > >+kp->ki_fibnum = p->p_fibnum; > > PROC_SLOCK(p); > > rufetch(p,&kp->ki_rusage); > > kp->ki_runtime = cputick2usec(p->p_rux.rux_runtime); > >--- ./bin/ps/keyword.c.orig 2011-07-12 13:42:48.0 +0200 > >+++ ./bin/ps/keyword.c 2012-01-11 15:44:27.0 +0100 > >@@ -90,6 +90,7 @@ > > NULL, 0}, > > {"etime", "ELAPSED", NULL, USER, elapsed, NULL, 12, 0, CHAR, NULL, > > 0}, > > {"f", "F", NULL, 0, kvar, NULL, 8, KOFF(ki_flag), INT, "x", 0}, > >+{"fib", "FIB", NULL, 0, kvar, NULL, 2, KOFF(ki_fibnum), INT, "d", 0}, > > {"flags", "", "f", 0, NULL, NULL, 0, 0, CHAR, NULL, 0}, > > {"ignored", "", "sigignore", 0, NULL, NULL, 0, 0, CHAR, NULL, 0}, > > {"inblk", "INBLK", NULL, USER, rvar, NULL, 4, ROFF(ru_inblock), LONG, > >___ > >freebsd-hackers@freebsd.org mailing list > >http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" > > > > ___ > freebsd-...@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" pgpx0c7BUa047.pgp Description: PGP signature
Re: dup3 syscall - atomic set O_CLOEXEC with dup2
On Thu, Jan 12, 2012 at 12:08:40PM +0200, Kostik Belousov wrote: > On Thu, Jan 12, 2012 at 12:01:29AM -0500, Eitan Adler wrote: > > This is an implementation of dup3 for FreeBSD: > > man page here (with a FreeBSD patch coming soon): > > https://www.kernel.org/doc/man-pages/online/pages/man2/dup.2.html > > > > Is this implementation correct? If so any objection to adding this as > > a supported syscall? > > > > > > Index: sys/kern/kern_descrip.c > > === > > --- sys/kern/kern_descrip.c (revision 229830) > > +++ sys/kern/kern_descrip.c (working copy) > > @@ -110,6 +110,7 @@ > > /* Flags for do_dup() */ > > #define DUP_FIXED 0x1 /* Force fixed allocation */ > > #define DUP_FCNTL 0x2 /* fcntl()-style errors */ > > +#define DUP_CLOEXEC0x4 /* Enable O_CLOEXEC on the new fs */ > > > > static int do_dup(struct thread *td, int flags, int old, int new, > > register_t *retval); > > @@ -307,7 +308,39 @@ > > return (0); > > } > > > > +#ifndef _SYS_SYSPROTO_H_ > > +struct dup3_args { > > + u_int from; > > + u_int to; > > + int flags; > > +}; > > +#endif > > + > The _SYS_SYSPROTO_H_ should have been obsoleted and removed long time ago. > I see no reasons to continue its agony for new syscalls. > > > /* > > + * Duplicate a file descriptor and allow for O_CLOEXEC > > + */ > > + > > +/* ARGSUSED */ > ARGUSED is from the same category as SYS_SYSPROTO_H. > > > +int > > +sys_dup3(struct thread * const td, struct dup3_args * const uap) { > The td and uap are not constant in the prototypes generated by > makesyscalls.sh. This makes me seriosly wonder was the patch compiled at > all. > > > + > > + KASSERT(td != NULL, ("%s: td == NULL", __func__)); > > + KASSERT(uap != NULL, ("%s: uap == NULL", __func__)); > The asserts are useless and must be removed. > > > + > > + if (uap->from == uap->to) > > + return EINVAL; > Style(9) recommends to brace the values of return operator. > > > + > > + if (uap->flags & ~O_CLOEXEC) > > + return EINVAL; > > + > > + const int dupflags = (uap->flags == O_CLOEXEC) ? > > DUP_FIXED|DUP_CLOEXEC : DUP_FIXED; > There are too many style violations to enumerate them all, please do not > consider the list below exhaustive: > - the variable is declared in the executive part of the function body; > - line too long; > - no spaces around binary '|'. > > Not stule comments: there is no use in declaring dupflags const; > it is better to test the presence of O_CLOEXEC with & instead of comparing > with ==, to allow for ease introduction of future dup3 flags, if any. > > > + > > + return (do_dup(td, dupflags, (int)uap->from, (int)uap->to, > > + td->td_retval)); > Why casting arguments to int ? > > + return (0); > Isn't this line never executed ? > > +} > > + > > +/* > > * Duplicate a file descriptor to a particular value. > > * > > * Note: keep in mind that a potential race condition exists when closing > > @@ -912,6 +945,9 @@ > > fdp->fd_lastfile = new; > > *retval = new; > > > > + if (flags & DUP_CLOEXEC) > > + fdp->fd_ofileflags[new] |= UF_EXCLOSE; > > + > It is better to handle UF_EXCLOSE at the time of assignment to > fdp->fd_ofileflags[new] just above, instead of clearing UF_EXCLOSE > and then setting it. > > > /* > > * If we dup'd over a valid file, we now own the reference to it > > * and must dispose of it using closef() semantics (as if a > > Index: sys/kern/syscalls.master > > === > > --- sys/kern/syscalls.master(revision 229830) > > +++ sys/kern/syscalls.master(working copy) > > @@ -951,5 +951,6 @@ > > off_t offset, off_t len); } > > 531AUE_NULLSTD { int posix_fadvise(int fd, off_t > > offset, \ > > off_t len, int advice); } > > +532AUE_NULLSTD { int dup3(u_int from, u_int to, int > > flags); } > > ; Please copy any additions and changes to the following compatability > > tables: > > ; sys/compat/freebsd32/syscalls.master > > Index: sys/compat/freebsd32/syscalls.master > > ==
Re: dup3 syscall - atomic set O_CLOEXEC with dup2
On Thu, Jan 12, 2012 at 12:01:29AM -0500, Eitan Adler wrote: > This is an implementation of dup3 for FreeBSD: > man page here (with a FreeBSD patch coming soon): > https://www.kernel.org/doc/man-pages/online/pages/man2/dup.2.html > > Is this implementation correct? If so any objection to adding this as > a supported syscall? > > > Index: sys/kern/kern_descrip.c > === > --- sys/kern/kern_descrip.c (revision 229830) > +++ sys/kern/kern_descrip.c (working copy) > @@ -110,6 +110,7 @@ > /* Flags for do_dup() */ > #define DUP_FIXED0x1 /* Force fixed allocation */ > #define DUP_FCNTL0x2 /* fcntl()-style errors */ > +#define DUP_CLOEXEC 0x4 /* Enable O_CLOEXEC on the new fs */ > > static int do_dup(struct thread *td, int flags, int old, int new, > register_t *retval); > @@ -307,7 +308,39 @@ > return (0); > } > > +#ifndef _SYS_SYSPROTO_H_ > +struct dup3_args { > + u_int from; > + u_int to; > + int flags; > +}; > +#endif > + The _SYS_SYSPROTO_H_ should have been obsoleted and removed long time ago. I see no reasons to continue its agony for new syscalls. > /* > + * Duplicate a file descriptor and allow for O_CLOEXEC > + */ > + > +/* ARGSUSED */ ARGUSED is from the same category as SYS_SYSPROTO_H. > +int > +sys_dup3(struct thread * const td, struct dup3_args * const uap) { The td and uap are not constant in the prototypes generated by makesyscalls.sh. This makes me seriosly wonder was the patch compiled at all. > + > + KASSERT(td != NULL, ("%s: td == NULL", __func__)); > + KASSERT(uap != NULL, ("%s: uap == NULL", __func__)); The asserts are useless and must be removed. > + > + if (uap->from == uap->to) > + return EINVAL; Style(9) recommends to brace the values of return operator. > + > + if (uap->flags & ~O_CLOEXEC) > + return EINVAL; > + > + const int dupflags = (uap->flags == O_CLOEXEC) ? > DUP_FIXED|DUP_CLOEXEC : DUP_FIXED; There are too many style violations to enumerate them all, please do not consider the list below exhaustive: - the variable is declared in the executive part of the function body; - line too long; - no spaces around binary '|'. Not stule comments: there is no use in declaring dupflags const; it is better to test the presence of O_CLOEXEC with & instead of comparing with ==, to allow for ease introduction of future dup3 flags, if any. > + > + return (do_dup(td, dupflags, (int)uap->from, (int)uap->to, > + td->td_retval)); Why casting arguments to int ? > + return (0); Isn't this line never executed ? > +} > + > +/* > * Duplicate a file descriptor to a particular value. > * > * Note: keep in mind that a potential race condition exists when closing > @@ -912,6 +945,9 @@ > fdp->fd_lastfile = new; > *retval = new; > > + if (flags & DUP_CLOEXEC) > + fdp->fd_ofileflags[new] |= UF_EXCLOSE; > + It is better to handle UF_EXCLOSE at the time of assignment to fdp->fd_ofileflags[new] just above, instead of clearing UF_EXCLOSE and then setting it. > /* >* If we dup'd over a valid file, we now own the reference to it >* and must dispose of it using closef() semantics (as if a > Index: sys/kern/syscalls.master > === > --- sys/kern/syscalls.master (revision 229830) > +++ sys/kern/syscalls.master (working copy) > @@ -951,5 +951,6 @@ > off_t offset, off_t len); } > 531 AUE_NULLSTD { int posix_fadvise(int fd, off_t offset, \ > off_t len, int advice); } > +532 AUE_NULLSTD { int dup3(u_int from, u_int to, int flags); } > ; Please copy any additions and changes to the following compatability > tables: > ; sys/compat/freebsd32/syscalls.master > Index: sys/compat/freebsd32/syscalls.master > === > --- sys/compat/freebsd32/syscalls.master (revision 229830) > +++ sys/compat/freebsd32/syscalls.master (working copy) > @@ -997,3 +997,4 @@ > uint32_t offset1, uint32_t offset2,\ > uint32_t len1, uint32_t len2, \ > int advice); } > +532 AUE_NULLSTD { int dup3(u_int from, u_int to, int flags); } And again, this part of patch makes me think that you never compiled it. That said, I am not sure that we shall copy the O_CLOEXEC crusade from Linux until it is standartized. And, if copying, I think we shall copy all bits of the new API, like SOCK_CLOEXEC etc, and not just dup3. pgpnIpOkAjXZm.pgp Description: PGP signature
Re: Checking for other kernel modules on load
On Thu, Dec 29, 2011 at 12:53:19PM +, Chris Rees wrote: > 2011/12/29 Kostik Belousov : > > On Thu, Dec 29, 2011 at 11:46:57AM +, Chris Rees wrote: > >> 2011/12/28 Kostik Belousov : > >> > On Wed, Dec 28, 2011 at 02:53:42PM +, Chris Rees wrote: > >> >> 2011/12/28 Kostik Belousov : > >> >> > On Wed, Dec 28, 2011 at 12:23:58PM +, Chris Rees wrote: > >> >> >> On 28 December 2011 12:21, Daniel O'Connor > >> >> >> wrote: > >> >> >> > > >> >> >> > On 28/12/2011, at 22:07, Chris Rees wrote: > >> >> >> >> Is there a simple way to check for existence of a driver? I could > >> >> >> >> even check for /dev/sndstat, though that doesn't seem elegant to > >> >> >> >> me... > >> >> >> > > >> >> >> > kldstat -v, but really /dev/sndstat seems simpler and just as > >> >> >> > effective. > >> >> >> > > >> >> >> > >> >> >> Cheers-- I was thinking of a kernel-level function though. > >> >> >> > >> >> >> cognet@ suggested using modfind("sound"), I'll go with that. > >> >> > Obvious question is what the panic is. Checking for modules loaded is > >> >> > papering over some issue. > >> >> > >> >> True, although I figured that it was a simple conflict, possibly to do > >> >> with sndstat. > >> >> > >> >> Also, I'm getting panics with the following patch, whether sound is > >> >> loaded or not :) > >> >> > >> >> + if (modfind("sound") >= 0) > >> >> + { > >> >> + cmn_err (CE_WARN, "A conflicting sound driver is already > >> >> loaded"); > >> >> + return EBUSY; > >> >> + } > >> >> + > >> >> > >> >> Is there a better way to handle such conflicts? > >> > > >> > You have missed the point. There is some bug in oss driver that causing > >> > the panic. Presumed 'conflict' cannot cause the harm itself, besides not > >> > allowing second driver to attach to the same device, and should not > >> > result > >> > in panic. Trying to implement a half-measure that only covers the problem > >> > you do a mis-service. > >> > > >> > And you still did not provided the panic message. > >> > >> I'm sorry, you're right. However, your guess was in fact correct; > >> make_dev was being called, which returned a null pointer because it > >> failed. > >> > >> The patch at [1] stops the panic, however I was hoping that returning > >> EBUSY would abort loading the module... At the moment it loads the > >> module, and doesn't create the sndstat dev, which causes weird errors > >> with the oss binary commands. > >> > >> Since this solves the panic and anyone should be able to work out from > >> the warning message what the problem is, AND this is a port that > >> apparently no-one else uses, should this be sufficient? > >> > >> BTW, it only affects FreeBSD 9+, couldn't reproduce on my 8.2 dev > >> machine, but could once I upgraded it. > > On 8.2, there is no check in the devfs for duplicated cdev names, AFAIR. > > So you get absolutely undeterministic behaviour which driver is referenced > > by devfs node. > > > >> Chris > >> > >> [1] > >> http://www.bayofrum.net/~crees/patches/oss-patch-kernel-OS-FreeBSD-os_freebsd.c > > > > I highly recommend to return error in case of any make_dev_p(9) failure, and > > not only EEXIST. > > That'd be great-- but I can't work out how to do it :( > > Do I need to return a different value? error = make_dev_p(); if (error != 0) { printf("Error creating device node /dev/%s: %d\n", name, error); return (error); } Or I do not understand your question. pgpv6Kv0D7MoL.pgp Description: PGP signature
Re: Checking for other kernel modules on load
On Thu, Dec 29, 2011 at 11:46:57AM +, Chris Rees wrote: > 2011/12/28 Kostik Belousov : > > On Wed, Dec 28, 2011 at 02:53:42PM +, Chris Rees wrote: > >> 2011/12/28 Kostik Belousov : > >> > On Wed, Dec 28, 2011 at 12:23:58PM +, Chris Rees wrote: > >> >> On 28 December 2011 12:21, Daniel O'Connor > >> >> wrote: > >> >> > > >> >> > On 28/12/2011, at 22:07, Chris Rees wrote: > >> >> >> Is there a simple way to check for existence of a driver? I could > >> >> >> even check for /dev/sndstat, though that doesn't seem elegant to > >> >> >> me... > >> >> > > >> >> > kldstat -v, but really /dev/sndstat seems simpler and just as > >> >> > effective. > >> >> > > >> >> > >> >> Cheers-- I was thinking of a kernel-level function though. > >> >> > >> >> cognet@ suggested using modfind("sound"), I'll go with that. > >> > Obvious question is what the panic is. Checking for modules loaded is > >> > papering over some issue. > >> > >> True, although I figured that it was a simple conflict, possibly to do > >> with sndstat. > >> > >> Also, I'm getting panics with the following patch, whether sound is > >> loaded or not :) > >> > >> + if (modfind("sound") >= 0) > >> + { > >> + cmn_err (CE_WARN, "A conflicting sound driver is already loaded"); > >> + return EBUSY; > >> + } > >> + > >> > >> Is there a better way to handle such conflicts? > > > > You have missed the point. There is some bug in oss driver that causing > > the panic. Presumed 'conflict' cannot cause the harm itself, besides not > > allowing second driver to attach to the same device, and should not result > > in panic. Trying to implement a half-measure that only covers the problem > > you do a mis-service. > > > > And you still did not provided the panic message. > > I'm sorry, you're right. However, your guess was in fact correct; > make_dev was being called, which returned a null pointer because it > failed. > > The patch at [1] stops the panic, however I was hoping that returning > EBUSY would abort loading the module... At the moment it loads the > module, and doesn't create the sndstat dev, which causes weird errors > with the oss binary commands. > > Since this solves the panic and anyone should be able to work out from > the warning message what the problem is, AND this is a port that > apparently no-one else uses, should this be sufficient? > > BTW, it only affects FreeBSD 9+, couldn't reproduce on my 8.2 dev > machine, but could once I upgraded it. On 8.2, there is no check in the devfs for duplicated cdev names, AFAIR. So you get absolutely undeterministic behaviour which driver is referenced by devfs node. > Chris > > [1] > http://www.bayofrum.net/~crees/patches/oss-patch-kernel-OS-FreeBSD-os_freebsd.c I highly recommend to return error in case of any make_dev_p(9) failure, and not only EEXIST. pgpwW0Pqu6gLw.pgp Description: PGP signature
Re: Checking for other kernel modules on load
On Wed, Dec 28, 2011 at 02:53:42PM +, Chris Rees wrote: > 2011/12/28 Kostik Belousov : > > On Wed, Dec 28, 2011 at 12:23:58PM +, Chris Rees wrote: > >> On 28 December 2011 12:21, Daniel O'Connor wrote: > >> > > >> > On 28/12/2011, at 22:07, Chris Rees wrote: > >> >> Is there a simple way to check for existence of a driver? I could > >> >> even check for /dev/sndstat, though that doesn't seem elegant to me... > >> > > >> > kldstat -v, but really /dev/sndstat seems simpler and just as effective. > >> > > >> > >> Cheers-- I was thinking of a kernel-level function though. > >> > >> cognet@ suggested using modfind("sound"), I'll go with that. > > Obvious question is what the panic is. Checking for modules loaded is > > papering over some issue. > > True, although I figured that it was a simple conflict, possibly to do > with sndstat. > > Also, I'm getting panics with the following patch, whether sound is > loaded or not :) > > + if (modfind("sound") >= 0) > +{ > + cmn_err (CE_WARN, "A conflicting sound driver is already loaded"); > + return EBUSY; > +} > + > > Is there a better way to handle such conflicts? You have missed the point. There is some bug in oss driver that causing the panic. Presumed 'conflict' cannot cause the harm itself, besides not allowing second driver to attach to the same device, and should not result in panic. Trying to implement a half-measure that only covers the problem you do a mis-service. And you still did not provided the panic message. pgpUT1vbYGa31.pgp Description: PGP signature
Re: Checking for other kernel modules on load
On Wed, Dec 28, 2011 at 12:23:58PM +, Chris Rees wrote: > On 28 December 2011 12:21, Daniel O'Connor wrote: > > > > On 28/12/2011, at 22:07, Chris Rees wrote: > >> Is there a simple way to check for existence of a driver? I could > >> even check for /dev/sndstat, though that doesn't seem elegant to me... > > > > kldstat -v, but really /dev/sndstat seems simpler and just as effective. > > > > Cheers-- I was thinking of a kernel-level function though. > > cognet@ suggested using modfind("sound"), I'll go with that. Obvious question is what the panic is. Checking for modules loaded is papering over some issue. pgpnYaFqKXxNx.pgp Description: PGP signature
Re: Per-mount syncer threads and fanout for pagedaemon cleaning
On Tue, Dec 27, 2011 at 05:05:04PM +0100, Attilio Rao wrote: > 2011/12/27 Giovanni Trematerra : > > On Mon, Dec 26, 2011 at 9:24 PM, Venkatesh Srinivas > > wrote: > >> Hi! > >> > >> I've been playing with two things in DragonFly that might be of interest > >> here. > >> > >> Thing #1 := > >> > >> First, per-mountpoint syncer threads. Currently there is a single thread, > >> 'syncer', which periodically calls fsync() on dirty vnodes from every > >> mount, > >> along with calling vfs_sync() on each filesystem itself (via syncer > >> vnodes). > >> > >> My patch modifies this to create syncer threads for mounts that request it. > >> For these mounts, vnodes are synced from their mount-specific thread rather > >> than the global syncer. > >> > >> The idea is that periodic fsync/sync operations from one filesystem should > >> not > >> stall or delay synchronization for other ones. > >> The patch was fairly simple: > >> http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/50e4012a4b55e1efc595db0db397b4365f08b640 > >> > > > > There's something WIP by attilio@ on that area. > > you might want to take a look at > > http://people.freebsd.org/~attilio/syncer_alpha_15.diff > > > > I don't know what hammerfs needs but UFS/FFS and buffer cache make a good > > job performance-wise and so the authors are skeptical about the boost that > > such > > a change can give. We believe that brain cycles need to be spent on > > other pieces of the system such as ARC and ZFS. > > More specifically, it is likely that focusing on UFS and buffer cache > for performance is not really useful, we should drive our efforts over > ARC and ZFS. > Also, the real bottlenecks in our I/O paths are in GEOM > single-threaded design, lack of unmapped I/O functionality, possibly > lack of proritized I/O, etc. Even if not useful for performance (this is possible), the change itself is useful to provide better system behaviour in the case of failure. E.g., slowly-responding or wedged NFS server, dying disk etc would more limited impact with the patch then without it. It will not completely solve the issue, since e.g. dirty buffers amount is not limited per-mount point, only globally. But at least it covers significant part of the problem. Also, it should help with interactivity and load pikes at 30sec interval. I remember that I had no major objections when I read the patch. I personally would prefer to have it committed. pgpecs4kYoJpu.pgp Description: PGP signature
Re: Getting swapped-out memory per process
On Mon, Dec 26, 2011 at 02:02:09PM -0500, Boris Kochergin wrote: > Hi. > > Is there a way, from userspace, to get the amount of memory a given > process currently has swapped out? The VM does not track memory 'per process'. Simplifying to the point where the statement becomes false, it assigns the memory pages to the vm objects, and allows to map objects into process address space. Pageout works on the page by page basis, regardless of the page ownership (for the normal pages, using some definition of normal). Since one page can and often is mapped into several address spaces, accounting on the per-process is meaningless or causes the stress of the imagination of somebody who defines the accounting policy. pgpLN6Ks3Pr7N.pgp Description: PGP signature
Re: rtld and noexec
On Sun, Dec 04, 2011 at 02:17:43PM +0100, joris dedieu wrote: > 2011/12/2 Alexander Kabaev : > > On Fri, 2 Dec 2011 18:22:57 +0100 > > joris dedieu wrote: > > > >> Hi, > >> > >> Here is a patch I use to prevent loading a shared object from a noexec > >> mountpoint. It's an easy way, I found, after the last root exploit > >> ((http://seclists.org/fulldisclosure/2011/Nov/452), to enhance the > >> security of my web servers (with /home, /tmp and /var/tmp mounted with > >> noexec). > >> > >> - the last ftpd/porftpd (libc ?) exploit does not work (indirect use > >> of rtld via nsswitch) > >> - the previous rtld security issue should have been more difficult to > >> use in a noexec context. > >> - It may help to prevent some miscellaneous usage of common softwares > >> using dlopen like apache or php. > >> > >> I think it also makes sens because loading a shared object sounds like > >> a kind of "execution". > >> > >> What do you think about this patch and the opportunity to open a PR on > >> this subject? > >> > >> Cheers > >> Joris > >> > >> > >> --- libexec/rtld-elf/rtld.c.orig 2011-12-02 12:09:40.0 > >> +0100 +++ libexec/rtld-elf/rtld.c 2011-12-02 13:45:18.0 > >> +0100 @@ -1123,32 +1123,50 @@ > >> { > >> char *pathname; > >> char *name; > >> + struct statfs mnt; > >> > >> if (strchr(xname, '/') != NULL) { /* Hard coded pathname */ > >> + name = NULL; > >> if (xname[0] != '/' && !trust) { > >> _rtld_error("Absolute pathname required for shared object > >> \"%s\"", xname); > >> return NULL; > >> } > >> if (refobj != NULL && refobj->z_origin) > >> - return origin_subst(xname, refobj->origin_path); > >> + pathname = origin_subst(xname, refobj->origin_path); > >> else > >> - return xstrdup(xname); > >> + pathname = xstrdup(xname); > >> + } > >> + else { /* xname is not a path */ > >> + if (libmap_disable || (refobj == NULL) || > >> + (name = lm_find(refobj->path, xname)) == NULL) > >> + name = (char *)xname; > >> + > >> + dbg(" Searching for \"%s\"", name); > >> + > >> + pathname = search_library_path(name, ld_library_path); > >> + if (pathname == NULL && refobj != NULL) > >> + pathname = search_library_path(name, refobj->rpath); > >> + if (pathname == NULL) > >> + pathname = search_library_path(name, gethints()); > >> + if (pathname == NULL) > >> + pathname = search_library_path(name, > >> STANDARD_LIBRARY_PATH); > >> + } > >> + > >> + if (pathname != NULL) { /* noexec mountpoint in pathname */ > >> + if (statfs(pathname, &mnt) != 0) > >> + free(pathname); > >> + else { > >> + if (mnt.f_flags & MNT_NOEXEC) { > >> + _rtld_error("noexec violation for shared object > >> \"%s\"", pathname); > >> + free(pathname); > >> + return NULL; > >> + } > >> + else > >> + return pathname; > >> + } > >> } > >> > >> - if (libmap_disable || (refobj == NULL) || > >> - (name = lm_find(refobj->path, xname)) == NULL) > >> - name = (char *)xname; > >> - > >> - dbg(" Searching for \"%s\"", name); > >> - > >> - if ((pathname = search_library_path(name, ld_library_path)) != > >> NULL || > >> - (refobj != NULL && > >> - (pathname = search_library_path(name, refobj->rpath)) != NULL) > >> || > >> - (pathname = search_library_path(name, gethints())) != NULL || > >> - (pathname = search_library_path(name, > >> STANDARD_LIBRARY_PATH)) != NULL) > >> - return pathname; > >> - > >> if(refobj != NULL && refobj->path != NULL) { > >> _rtld_error("Shared object \"%s\" not found, required by > >> \"%s\"", name, basename(refobj->path)); > >> ___ > > > > > > 1. There is a race using statfs and then loading the file. > I will look at this point. Maybe statfs on the dirname ? > > > 2. We already have the check in do_load_object > It doesn't work with dlopen. > > mount |grep tank/t > tank/t on /tank/t (zfs, local, noexec, nfsv4acls) > > so /tank/t is noexec > > Here the powerful libmoo source code : > > void say_moo() { >printf("mo\n"); > } > > it's in /tank/t so noexec > > ls -l /tank/t/ > total 6 > -rwxr-xr-x 1 joris joris 4632 Dec 4 13:52 libmoo.so > > 1) First test with : > > main() { >say_moo(); > } > > LD_LIBRARY_PATH=/tank/t ./test_moo > /libexec/ld-elf.so.1: Cannot execute objects on /tank/t > > Ok cool work has expected. > > Second test with : > > main() { >void * handle = dlopen("/tank/t/libmoo.so", RTLD_LAZY); >if (! handle) { >fprintf(stderr, "%s\n", dlerror()); >exit(1); >} >void (* moo) (void) = dlsym (handle, "say_moo"); >(* moo)(); >dlclose (han
Re: "ps -e" without procfs(5)
On Wed, Nov 09, 2011 at 02:44:55PM +0200, Kostik Belousov wrote: > On Tue, Nov 08, 2011 at 11:47:54PM +0200, Mikolaj Golub wrote: > > > > On Sun, 6 Nov 2011 20:10:41 +0200 Kostik Belousov wrote: > > > > KB> On Sat, Nov 05, 2011 at 10:37:46PM +0200, Mikolaj Golub wrote: > > >> > > >> http://people.freebsd.org/~trociny/env.sys.3.patch > > > > KB> Oops, I missed this in the previous review. You cannot use fubyte in > > KB> proc_read_mem(). fubyte reads a byte from the address space of the > > current > > KB> process. The fix is easy, use proc_rwmem for 1 byte. > > > > KB> I do not think that fall back to single byte read is warranted for > > KB> proc_read_mem calls e.g. for ps_strings. Add a flag to indicate whether > > KB> the proc_read_mem should fall back to byte read ? > > > > KB> I would prefer using sizeof(uint64_t) and sizeof(uint32_t) instead of 8 > > KB> and 4 constants in the align checks. > > > > KB> Might be, add PROC_ASSERT_HELD() to get_ps_string() ? > > > > KB> procfs patch looks good. > > > > Thanks. The updated version: > > > > http://people.freebsd.org/~trociny/env.sys.4.patch > > > > Investigating cases when EFAULT was returned and if the fallback was > > successful I noticed that most of the cases were when p->p_comm changed > > during > > the read, so the process was in exec in that time. In order to avoid this > > error I added a check for P_INEXEC flag. > And now you return success and nothing gets copied out for the process > in P_INEXEC state. Either you should return an error like EAGAIN, or > consider the P_INEXEC state as transitional and wait till process > leaves it. Or, ignore the state as it was before, and return whatever > error proc_rwmem generated (my preference). Forgot to say that the check does not change much because you drop process lock immediately after the check, so the process may enter the INEXEC state right after the check. I believe you already tried to do this with P_WEXIT. > > > > > After this I observed EFAULT (very rarely) only when reading arg or env > > strings and fallback was successful for those cases. So I modified the patch > > to do fallback only when reading strings (as it was in one of my earlier > > versions but with wrong fubyte), and returned your comment which explains > > why > > it may happen :-) > > > > Also in the procfs patch I have added the check for process state. > > > > The userland part has not been changed since my first report: > > > > http://people.freebsd.org/~trociny/env.user.patch > > > > -- > > Mikolaj Golub pgpArht5hTAqy.pgp Description: PGP signature
Re: "ps -e" without procfs(5)
On Tue, Nov 08, 2011 at 11:47:54PM +0200, Mikolaj Golub wrote: > > On Sun, 6 Nov 2011 20:10:41 +0200 Kostik Belousov wrote: > > KB> On Sat, Nov 05, 2011 at 10:37:46PM +0200, Mikolaj Golub wrote: > >> > >> http://people.freebsd.org/~trociny/env.sys.3.patch > > KB> Oops, I missed this in the previous review. You cannot use fubyte in > KB> proc_read_mem(). fubyte reads a byte from the address space of the > current > KB> process. The fix is easy, use proc_rwmem for 1 byte. > > KB> I do not think that fall back to single byte read is warranted for > KB> proc_read_mem calls e.g. for ps_strings. Add a flag to indicate whether > KB> the proc_read_mem should fall back to byte read ? > > KB> I would prefer using sizeof(uint64_t) and sizeof(uint32_t) instead of 8 > KB> and 4 constants in the align checks. > > KB> Might be, add PROC_ASSERT_HELD() to get_ps_string() ? > > KB> procfs patch looks good. > > Thanks. The updated version: > > http://people.freebsd.org/~trociny/env.sys.4.patch > > Investigating cases when EFAULT was returned and if the fallback was > successful I noticed that most of the cases were when p->p_comm changed during > the read, so the process was in exec in that time. In order to avoid this > error I added a check for P_INEXEC flag. And now you return success and nothing gets copied out for the process in P_INEXEC state. Either you should return an error like EAGAIN, or consider the P_INEXEC state as transitional and wait till process leaves it. Or, ignore the state as it was before, and return whatever error proc_rwmem generated (my preference). > > After this I observed EFAULT (very rarely) only when reading arg or env > strings and fallback was successful for those cases. So I modified the patch > to do fallback only when reading strings (as it was in one of my earlier > versions but with wrong fubyte), and returned your comment which explains why > it may happen :-) > > Also in the procfs patch I have added the check for process state. > > The userland part has not been changed since my first report: > > http://people.freebsd.org/~trociny/env.user.patch > > -- > Mikolaj Golub pgpLo6P15pt0f.pgp Description: PGP signature
Re: "ps -e" without procfs(5)
On Sat, Nov 05, 2011 at 10:37:46PM +0200, Mikolaj Golub wrote: > > On Sat, 5 Nov 2011 21:45:53 +0200 Kostik Belousov wrote: > > KB> On Sat, Nov 05, 2011 at 08:59:21PM +0200, Mikolaj Golub wrote: > >> > >> On Sat, 5 Nov 2011 17:44:43 +0200 Kostik Belousov wrote: > >> > >> >> KB> I think that the aux vector must be naturally aligned. You can > return > >> >> KB> ENOEXEC early if vptr is not aligned. > >> >> > >> >> Not sure I see what you mean. vptr for auxv is calculated just couple > lines > >> >> above, and I check the result here, in the part common for all vector > types. > >> KB> You do not check for the alignment. Am I wrong ? > >> > >> I see now. If natural alignment means "addr % sizeof(aux) == 0" then the > aux > >> vectors are not naturally aligned. After adding this check: > >> > >> if (vptr % sizeof(aux) != 0) > >> return (ENOEXEC); > KB> No, the natural alignment of the structure is the alignment of the most > KB> demanding member. So it is 4 bytes on 32bit, and 8 bytes on 64. > > >> > >> I started to observe many ENOEXEC errors. Adding printf showed that the > >> vectors are half size aligned. > >> > >> On i386: > >> > >> get_proc_vector(pid = getty[3442], type = 2): vptr (2143284876) % > sizeof(aux) (8) = 4) > >> > >> On amd64: > >> > >> get_proc_vector(pid = getty[2425], type = 2): vptr (140737488346568) % > sizeof(aux) (16) = 8) > >> > >> Looking at exec_copyout_strings() from kern_exec.c, how destp is > calculated, I > >> think they are sizeof(char *) aligned. > >> > >> Do you think it is worth adding the check for sizeof(char *) alignment? > >> > >> if (vptr % (sizeof(char *) != 0) > >> return (ENOEXEC); > KB> I suggest to use #if __ELF_WORD_SIZE == 32 or 64. > > Thanks. The updated patch: > > http://people.freebsd.org/~trociny/env.sys.3.patch Oops, I missed this in the previous review. You cannot use fubyte in proc_read_mem(). fubyte reads a byte from the address space of the current process. The fix is easy, use proc_rwmem for 1 byte. I do not think that fall back to single byte read is warranted for proc_read_mem calls e.g. for ps_strings. Add a flag to indicate whether the proc_read_mem should fall back to byte read ? I would prefer using sizeof(uint64_t) and sizeof(uint32_t) instead of 8 and 4 constants in the align checks. Might be, add PROC_ASSERT_HELD() to get_ps_string() ? procfs patch looks good. pgp9ZRbUSX5B9.pgp Description: PGP signature
Re: "ps -e" without procfs(5)
On Sat, Nov 05, 2011 at 08:59:21PM +0200, Mikolaj Golub wrote: > > On Sat, 5 Nov 2011 17:44:43 +0200 Kostik Belousov wrote: > > >> KB> I think that the aux vector must be naturally aligned. You can return > >> KB> ENOEXEC early if vptr is not aligned. > >> > >> Not sure I see what you mean. vptr for auxv is calculated just couple > lines > >> above, and I check the result here, in the part common for all vector > types. > KB> You do not check for the alignment. Am I wrong ? > > I see now. If natural alignment means "addr % sizeof(aux) == 0" then the aux > vectors are not naturally aligned. After adding this check: > > if (vptr % sizeof(aux) != 0) > return (ENOEXEC); No, the natural alignment of the structure is the alignment of the most demanding member. So it is 4 bytes on 32bit, and 8 bytes on 64. > > I started to observe many ENOEXEC errors. Adding printf showed that the > vectors are half size aligned. > > On i386: > > get_proc_vector(pid = getty[3442], type = 2): vptr (2143284876) % sizeof(aux) > (8) = 4) > > On amd64: > > get_proc_vector(pid = getty[2425], type = 2): vptr (140737488346568) % > sizeof(aux) (16) = 8) > > Looking at exec_copyout_strings() from kern_exec.c, how destp is calculated, I > think they are sizeof(char *) aligned. > > Do you think it is worth adding the check for sizeof(char *) alignment? > > if (vptr % (sizeof(char *) != 0) > return (ENOEXEC); I suggest to use #if __ELF_WORD_SIZE == 32 or 64. pgp0LGCE3rUZi.pgp Description: PGP signature
Re: "ps -e" without procfs(5)
On Sat, Nov 05, 2011 at 05:40:22PM +0200, Mikolaj Golub wrote: > > On Sat, 5 Nov 2011 15:58:01 +0200 Kostik Belousov wrote: > > KB> +if (error == EFAULT) { > KB> +for (i = 0; i < len; i++) { > KB> +c = fubyte(sptr + i); > KB> +if (c < 0) > KB> As a purely stylistical issue, compare with -1. > > KB> +return (EFAULT); > KB> +buf[i] = (char)c; > KB> +if (c == '\0') > KB> +break; > KB> +} > KB> +error = 0; > KB> +} > KB> +return error; > KB> Put () around error. > > Thanks. > > KB> +/* > KB> + * Check that that the address is in user space. > KB> + */ > KB> +if (vptr + 1 < VM_MIN_ADDRESS + 1 || vptr >= VM_MAXUSER_ADDRESS) > KB> +return (ENOEXEC); > KB> Why is this needed ? > > I saw this check in libkvm for ps_argvstr and ps_envstr and decided to add it > too. Just some additional check that ps_string fields, which can be > overwritten by the user, look reasonable. If you think this is not very useful > I remove it. The proc_rwmem() must handle access outside the user VA (and it does). > > KB> I think that the aux vector must be naturally aligned. You can return > KB> ENOEXEC early if vptr is not aligned. > > Not sure I see what you mean. vptr for auxv is calculated just couple lines > above, and I check the result here, in the part common for all vector types. You do not check for the alignment. Am I wrong ? > > BTW, investigating the cases when I got > > procstat: sysctl: kern.proc.args: 58002: 8: Exec format error > > they were because the PROC_VECTOR_MAX limit (512 entries, as it is in > linprocfs and libkvm) is small for real world cases: > > get_proc_vector(pid = rm[47883], type = 0): vsize (3009) > PROC_VECTOR_MAX > (512)) > get_proc_vector(pid = rm[47883], type = 0): vsize (3009) > PROC_VECTOR_MAX > (512)) > get_proc_vector(pid = rm[47890], type = 0): vsize (3008) > PROC_VECTOR_MAX > (512)) > get_proc_vector(pid = rm[47890], type = 0): vsize (3008) > PROC_VECTOR_MAX > (512)) > get_proc_vector(pid = rm[47897], type = 0): vsize (4511) > PROC_VECTOR_MAX > (512)) > get_proc_vector(pid = rm[47897], type = 0): vsize (4511) > PROC_VECTOR_MAX > (512)) > get_proc_vector(pid = rm[47897], type = 0): vsize (4511) > PROC_VECTOR_MAX > (512)) > get_proc_vector(pid = rm[48044], type = 0): vsize (611) > PROC_VECTOR_MAX > (512)) > get_proc_vector(pid = rm[52189], type = 0): vsize (772) > PROC_VECTOR_MAX > (512)) > get_proc_vector(pid = rm[52192], type = 0): vsize (1157) > PROC_VECTOR_MAX > (512)) > get_proc_vector(pid = rm[55685], type = 0): vsize (1041) > PROC_VECTOR_MAX > (512)) > get_proc_vector(pid = rm[55687], type = 0): vsize (1040) > PROC_VECTOR_MAX > (512)) > get_proc_vector(pid = rm[55690], type = 0): vsize (1559) > PROC_VECTOR_MAX > (512)) > > So I am going to change it to ARG_MAX and use independent limit (256 entries) > for auxv. Ok. > > KB> Why the blank after the loop statement in get_ps_strings() ? > > Sorry, what blank you mean? I don't see it in get_ps_strings(). May be you > mean the blank line in get_proc_vector() before return? + for (sptr = proc_vector[i]; ; sptr += GET_PS_STRINGS_CHUNK_SZ) { + + error = proc_read_mem(td, p, (vm_offset_t)sptr, The line between for() and error = . > > KB> There shall be blank lines after the '{' in proc_getargv() and > proc_getenvv(). > > Ah, sure. > > KB> Note that using cached pargs is somewhat inconsistent with the digging > KB> into ps_strings. > > KB> procfs_doproccmdline() can benefit from your work. > > Thanks, I will look at it :-). > > -- > Mikolaj Golub pgpD3ZL2Z2fRc.pgp Description: PGP signature
Re: "ps -e" without procfs(5)
On Wed, Nov 02, 2011 at 11:27:37PM +0200, Mikolaj Golub wrote: > > On Mon, 31 Oct 2011 11:49:48 +0200 Kostik Belousov wrote: > > KB> I think it is better to use sys/elf.h over the machine/elf.h. > > KB> Please change the comment for PROC_AUXV_MAX to "Safety limit on > KB> auxv size". Also, it worth adding a comment saying that we are > KB> reading aux vectors twice, first to get a size, second time to > KB> fetch a content, for simplicity. > > KB> When reading aux vector, if the PROC_AUXV_MAX entries are > KB> iterated over, and we still not reached AT_NULL, the return error > KB> is 0. Was it intended ? > > KB> For PROC_ARG and PROC_ENV, you blindly trust the read values of > KB> the arg and env vector sizes. This can easily cause kernel panics > KB> due to unability to malloc the requested memory. I recommend to > KB> put some clump, and twice of (PATH_MAX + ARG_MAX) is probably > KB> enough (see kern_exec.c, in particular, exec_alloc_args). Also, > KB> you might use the swappable memory for the strings as well, in > KB> the style of exec_alloc_args(). > > KB> I suspect this is my bug: Reading the GET_PS_STRINGS_CHUNK_SZ may > KB> validly return EFAULT if the string is shorter than the chunk > KB> and aligned at the end of the page, assuming the next page is > KB> not mapped. There should be a fallback to fubyte() read loop. I > KB> remember that copyinstr() was unsuitable. > > KB> The checks for P_WEXIT in the linprocfs routines look strange. > KB> Since you are unlocking the process right after the check, it > KB> does not make sense. In fact, the checks are not needed, I > KB> believe, since pseudofs already did the hold (see e.g. pfs_read > KB> and pfs_visible). > > Here is an updated version of the patch. Also available at > > http://people.freebsd.org/~trociny/env.sys.1.patch > > I decided to use the same constant (PROC_VECTOR_MAX) for limiting both > the number of arg or env strings and the numbex of aux vectors. > > Also I decided not to play with exec_alloc_args :-). + if (error == EFAULT) { + for (i = 0; i < len; i++) { + c = fubyte(sptr + i); + if (c < 0) As a purely stylistical issue, compare with -1. + return (EFAULT); + buf[i] = (char)c; + if (c == '\0') + break; + } + error = 0; + } + return error; Put () around error. + /* +* Check that that the address is in user space. +*/ + if (vptr + 1 < VM_MIN_ADDRESS + 1 || vptr >= VM_MAXUSER_ADDRESS) + return (ENOEXEC); Why is this needed ? I think that the aux vector must be naturally aligned. You can return ENOEXEC early if vptr is not aligned. Why the blank after the loop statement in get_ps_strings() ? There shall be blank lines after the '{' in proc_getargv() and proc_getenvv(). Note that using cached pargs is somewhat inconsistent with the digging into ps_strings. procfs_doproccmdline() can benefit from your work. pgpBc5VnV0yxI.pgp Description: PGP signature
Re: "ps -e" without procfs(5)
On Tue, Nov 01, 2011 at 09:07:11AM +0200, Mikolaj Golub wrote: > > On Mon, 31 Oct 2011 11:49:48 +0200 Kostik Belousov wrote: > > KB> For PROC_ARG and PROC_ENV, you blindly trust the read values of the arg > and > KB> env vector sizes. This can easily cause kernel panics due to unability to > KB> malloc the requested memory. I recommend to put some clump, and twice > KB> of (PATH_MAX + ARG_MAX) is probably enough (see kern_exec.c, in > particular, > KB> exec_alloc_args). Also, you might use the swappable memory for the > strings > KB> as well, in the style of exec_alloc_args(). > > After looking at it more closely, I am not sure if I need to use > exec_alloc_args. I malloc explicitly only for array vector (proc_vector). And > actually it should be much smaller than 2 * (PATH_MAX + ARG_MAX). Currently in > linprocfs the limit is 512 entries: > > #define MAX_ARGV_STR512 /* Max number of argv-like strings */ > > The same limit is in libkvm: > > /* > * Check that there aren't an unreasonable number of arguments, > * and that the address is in user space. Special test for > * VM_MIN_ADDRESS as it evaluates to zero, but is not a simple zero > * constant for some archs. We cannot use the pre-processor here and > * for some archs the compiler would trigger a signedness warning. > */ > if (narg > 512 || addr + 1 < VM_MIN_ADDRESS + 1 || addr >= > VM_MAXUSER_ADDRESS) > return (0); > > (BTW, may be the VM_MIN_ADDRESS - VM_MAXUSER_ADDRESS is worth adding in my > code too?) > > So it looks like I should use the same limit (512 * sizeof(char *)) for the > allocated array. I could use exec_alloc_args() for the allocation but it would > reqire some changes: I would have to free using kmem_free_wakeup(), which > requires size of the region, while I return the number of entries. So I'd > rather not use exec_alloc_args() for vector allocation because the benefit is > not significant here. > > For strings I use sbuf and set it up using sbuf_new_for_sysctl. I could set it > up manually as SBUF_FIXEDLEN allocating buf (up to 2 * (PATH_MAX + ARG_MAX)) > with exec_alloc_args() but this would complicate things a little. Do you think > it is worth doing? I mean using the swappable memory for strings, i.e. for the data you are currently store in sbuf. It indeed may be tricky, it was only an idea. pgp6emnWdCNQf.pgp Description: PGP signature
Re: "ps -e" without procfs(5)
On Mon, Oct 31, 2011 at 12:54:42PM +0200, Mikolaj Golub wrote: > > On Mon, 31 Oct 2011 11:49:48 +0200 Kostik Belousov wrote: > > KB> On Sat, Oct 29, 2011 at 01:32:39PM +0300, Mikolaj Golub wrote: > >> > >> What do you think about the attached patch? This is a kernel > >> part. COMPAT_FREEBSD32 has not been tested after the last update (just > checked > >> that it compiles): it looks I will not have access to amd64 box for > testing > >> during the weekend. I will test it after the weekend. > >> > >> Both kernel and userland parts are available here: > >> > >> http://people.freebsd.org/~trociny/env.sys.patch > >> http://people.freebsd.org/~trociny/env.user.patch > >> > >> Currently there is an issue with procstat -x: if one tried to run it on > 64 bit > >> for a 32 bit process it would not detect this so would output a garbage. > Could > >> somebody recommend a way how to get this info about a process from > userlend? > > KB> I think it is better to use sys/elf.h over the machine/elf.h. > > KB> Please change the comment for PROC_AUXV_MAX to "Safety limit on auxv > size". > KB> Also, it worth adding a comment saying that we are reading aux vectors > twice, > KB> first to get a size, second time to fetch a content, for simplicity. > > KB> When reading aux vector, if the PROC_AUXV_MAX entries are iterated over, > KB> and we still not reached AT_NULL, the return error is 0. Was it intended > ? > > According to kern_exec.c it is possible that a process doesn't have auxv at > all. I don't know a way how to detect this. So because PROC_AUXV_MAX is much > larger than expected amount of aux entries and we have not reached AT_NULL it > is most likely the process doesn't have auxv and 0 length array (without > error) is returned. > > If you think I should return a error in this situation, I can add this. Please > tell me the error code I should return :-). > > Also, may be there is a sane way to check on auxv existence? Hm. Even if the process ABI mandates the existence of auxv (i.e. the executing binary is ELF), the process could still cleared or mangled the corresponding stack page. I propose to designate some specific error code for the case of unparseable auxv region, ENOEXEC is not worse then any other. Also, please add the explanation you provided as a comment. pgpkaWejd97dy.pgp Description: PGP signature
Re: "ps -e" without procfs(5)
On Sat, Oct 29, 2011 at 01:32:39PM +0300, Mikolaj Golub wrote: > > On Tue, 25 Oct 2011 11:24:51 +0300 Kostik Belousov wrote: > > KB> On Tue, Oct 25, 2011 at 12:13:10AM +0300, Mikolaj Golub wrote: > >> > >> On Sun, 16 Oct 2011 20:10:05 +0300 Kostik Belousov wrote: > >> > >> KB> In my opinion, the way to implement the feature is to (re)use > >> KB> linprocfs_doargv() and provide another kern.proc sysctl to retrieve > the > >> KB> argv and env vectors. Then, ps(1) and procstat(1) can use it, as > well as > >> KB> procfs and linprocfs inside the kernel. > >> > >> Thanks! I am testing a patch (without auxv vector so far) and have some > >> questions. > >> > >> Original ps -e returns environment only for user owned processes (the > access is > >> restricted by the permissions of /proc/pid/mem file). My kern.proc.env > sysctl > >> does not have such a restriction. I suppose I should add it? What > function I > >> could use for this? > >> > >> BTW, linprocfs allows to read other user's environment. > KB> linprocfs uses p_cansee() to check the permissions. There are sysctls > KB> security.bsd.see_other_{ug}ids that control the behaviour. > > KB> I believe that the new sysctl shall use the same check. > > >> > >> KB> While you are at the code, it would be useful to also export the > auxv vector, > >> KB> which is immediately before env. > >> > >> It looks I can find the location of auxv but what about the size? Or do > you > >> propose to extend struct ps_strings to store location and size of auxv? I > >> could do this way... > > KB> No, extending ps_strings is not needed and it is too radical change. > KB> The auxv vector must end by the AT_NULL aux entry. You can also > artificially > KB> limit the amount of read aux vectors to, say, 256, which is much more > then > KB> it is currently defined. > > What do you think about the attached patch? This is a kernel > part. COMPAT_FREEBSD32 has not been tested after the last update (just checked > that it compiles): it looks I will not have access to amd64 box for testing > during the weekend. I will test it after the weekend. > > Both kernel and userland parts are available here: > > http://people.freebsd.org/~trociny/env.sys.patch > http://people.freebsd.org/~trociny/env.user.patch > > Currently there is an issue with procstat -x: if one tried to run it on 64 bit > for a 32 bit process it would not detect this so would output a garbage. Could > somebody recommend a way how to get this info about a process from userlend? I think it is better to use sys/elf.h over the machine/elf.h. Please change the comment for PROC_AUXV_MAX to "Safety limit on auxv size". Also, it worth adding a comment saying that we are reading aux vectors twice, first to get a size, second time to fetch a content, for simplicity. When reading aux vector, if the PROC_AUXV_MAX entries are iterated over, and we still not reached AT_NULL, the return error is 0. Was it intended ? For PROC_ARG and PROC_ENV, you blindly trust the read values of the arg and env vector sizes. This can easily cause kernel panics due to unability to malloc the requested memory. I recommend to put some clump, and twice of (PATH_MAX + ARG_MAX) is probably enough (see kern_exec.c, in particular, exec_alloc_args). Also, you might use the swappable memory for the strings as well, in the style of exec_alloc_args(). I suspect this is my bug: Reading the GET_PS_STRINGS_CHUNK_SZ may validly return EFAULT if the string is shorter than the chunk and aligned at the end of the page, assuming the next page is not mapped. There should be a fallback to fubyte() read loop. I remember that copyinstr() was unsuitable. The checks for P_WEXIT in the linprocfs routines look strange. Since you are unlocking the process right after the check, it does not make sense. In fact, the checks are not needed, I believe, since pseudofs already did the hold (see e.g. pfs_read and pfs_visible). pgpyO8CZBR4it.pgp Description: PGP signature
Re: In-kernel API for tasks, which could wait?
On Sun, Oct 30, 2011 at 06:54:51PM +0400, Lev Serebryakov wrote: > So, I have question: what should I do if I need to perofrm ONE > action, which could block for some time (for example, open file or > create ALQ)? > > I could create thread for this. But it looks strange and too heavy: create > thread > to call one function once. > > Maybe, kernel has some API to postpone such task to one, > always-running idle thread? See taskqueue(9). On the other hand, waiting for the enqueued task to finish is itself the sleepable action. pgpyJsLtkHgNN.pgp Description: PGP signature
Re: Kernel Space Memory Allocation
On Thu, Oct 27, 2011 at 10:31:43AM +0200, Daniel Grech wrote: > Hi, > > I am allocating memory from a device driver in the kernel and passing it on > to another driver. In the other driver it is neccessary for me to determine > whether the address passed is from user space or kernel space as this driver > can also be called from user space. I am currently using the following check > to determine if the address is in kernel space or userspace : > > if ( ( (char *) VM_MIN_KERNEL_ADDRESS <= address) && (address <= (char *) > VM_MAX_KERNEL_ADDRESS) ) > { > KERNEL SPACE > } > else > { >USER SPACE > } > > However, this does not always work. Sometimes although I allocate memory in > the kernel using the malloc macro, this returns an address which is not in > the above range. Is there any workaround for this problem? > > Thanks in advance for your help. Generally, you cannot determine the ownership of the address by its value. There are architectures that use separate kernel/user address spaces, e.g. Even for i386, there were a patch that switched page tables on the kernel entry, allowing the kernel to use whole 4GB of the address space. The proper way to do what you want is to supply explicit indicator of the address space referenced by the address. Read uio(9), in particular, look at the uio_seg. pgpjGGbiVIyGu.pgp Description: PGP signature
Re: sigprocmask and fork
On Wed, Oct 26, 2011 at 05:02:00PM +0400, Alexandr Matveev wrote: > Hi, > > We are using FreeBSD 8.2 on our servers for high load projects. > When I was preparing system for production I saw strange (as I think) > behavior, > that leads to increased load on servers. > > If I made truss on httpd (apache22) process, I saw too much sigprocmask > syscalls: > > 24822: sigprocmask(SIG_BLOCK,0x0,0x0)= 0 (0x0) > 24822: sigprocmask(SIG_BLOCK,0x0,0x0)= 0 (0x0) > 24822: sigprocmask(SIG_BLOCK,0x0,0x0)= 0 (0x0) > 24822: sigprocmask(SIG_BLOCK,0x0,0x0)= 0 (0x0) > 24822: sigprocmask(SIG_BLOCK,0x0,0x0)= 0 (0x0) > ... too many lines ... > > and > > 24822: > sigprocmask(SIG_BLOCK,SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|S > > > IGWINCH|SIGINFO|SIGUSR1|SIGUSR2,0x0) = 0 (0x0) > 24822: sigprocmask(SIG_SETMASK,0x0,0x0) = 0 (0x0) > 24822: > sigprocmask(SIG_BLOCK,SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|S > > > IGWINCH|SIGINFO|SIGUSR1|SIGUSR2,0x0) = 0 (0x0) > 24822: sigprocmask(SIG_SETMASK,0x0,0x0) = 0 (0x0) > 24822: > sigprocmask(SIG_BLOCK,SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|S > > > IGWINCH|SIGINFO|SIGUSR1|SIGUSR2,0x0) = 0 (0x0) > 24822: sigprocmask(SIG_SETMASK,0x0,0x0) = 0 (0x0) > ... too many lines ... > > > but apache, and modules loaded from it do not call this directly. > I was trying to use DTRACE for getting information about syscalls, and I > got same result. > > I wrote a tiny sample > > code: > $ cat sigproc_test.c > #include > > main() > { > fork(); > } > I ran it on FreeBSD with different compilers: > $ cc sigproc_test.c -o sigproc_test > $ truss ./sigproc_test 2>&1 | grep sigprocmask | wc -l >8 > $ g++ sigproc_test.c -o sigproc_test > $ truss ./sigproc_test 2>&1 | grep sigprocmask | wc -l > 20 > > Is it normal to make so many sigprocmask syscalls for such simple program? > For example, there is no sigprocmask syscalls when I run it on Debian > Linux. Yes, it is normal. I do not quite understand the relation between "simple program" and "so many sigprocmask syscalls" claim, but alas. The calls to sigprocmask originate from the rtld, which needs to guarantee async-signal safety of the lazy binding process. The rtld locks, besides other things, block signals. Also, due to an additional requirement that rtld is functional after the fork, it has to acquire the internal locks around fork in multithreaded programs. All lock acquisions in the program which does only one fork(2) call in main(), as well as in the program that does nothing in main at all, come from the rtld initialization before main, and shared library finalizations after. For empty main(), there are 4 locks acquisions after return from main, so you get total 12 for forked version. In essence, the trivial program makes the same amount of locking in the startup/exit path, as non-trivial (the cost is proportional to the number of shared libraries loaded). The typical cause for the rtld locks acquisitions during the program execution, besides the dlopen(3) activity, is the lazy resolution of the PLT entries. LD_BIND_NOW=1 moves the load to the program startup. > > Another sample. Here we have sigprocmask syscalls on Linux too, but > FreeBSD makes this syscall significantly more often: > $ cat test.c > #include > > main() > { > int i; > sleep(2); > for (i = 0; i<3; i++) { > int pid = fork(); > if (!pid) { > sleep(0.5); > return 0; > } > sleep(2); > } > } > > FreeBSD 8.2: > # truss -f ./test > ... SKIPPED ... > 48666: > sigprocmask(SIG_BLOCK,SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2,0x0) > > = 0 (0x0) > 48666: sigprocmask(SIG_SETMASK,0x0,0x0) = 0 (0x0) > 48666: __sysctl(0xbfbfe6a4,0x2,0x28192700,0xbfbfe6ac,0x0,0x0) = 0 (0x0) > 48666: > sigprocmask(SIG_BLOCK,SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2,0x0) > > = 0 (0x0) > 48666: sigprocmask(SIG_SETMASK,0x0,0x0) = 0 (0x0) > 48666: nanosleep({2.0 }) = 0 (0x0) > 48666: fork() = 48667 (0xbe1b) > 48667: sigprocmask(SIG_SETMASK,0x0,0x0) = 0 (0x0) > 48667: > sigprocmask(SIG_BLOCK,SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2,0x0) > > = 0 (0x0) > 48667: sigprocmask(SIG_
Re: "ps -e" without procfs(5)
On Tue, Oct 25, 2011 at 12:13:10AM +0300, Mikolaj Golub wrote: > > On Sun, 16 Oct 2011 20:10:05 +0300 Kostik Belousov wrote: > > KB> In my opinion, the way to implement the feature is to (re)use > KB> linprocfs_doargv() and provide another kern.proc sysctl to retrieve the > KB> argv and env vectors. Then, ps(1) and procstat(1) can use it, as well as > KB> procfs and linprocfs inside the kernel. > > Thanks! I am testing a patch (without auxv vector so far) and have some > questions. > > Original ps -e returns environment only for user owned processes (the access > is > restricted by the permissions of /proc/pid/mem file). My kern.proc.env sysctl > does not have such a restriction. I suppose I should add it? What function I > could use for this? > > BTW, linprocfs allows to read other user's environment. linprocfs uses p_cansee() to check the permissions. There are sysctls security.bsd.see_other_{ug}ids that control the behaviour. I believe that the new sysctl shall use the same check. > > KB> While you are at the code, it would be useful to also export the auxv > vector, > KB> which is immediately before env. > > It looks I can find the location of auxv but what about the size? Or do you > propose to extend struct ps_strings to store location and size of auxv? I > could do this way... No, extending ps_strings is not needed and it is too radical change. The auxv vector must end by the AT_NULL aux entry. You can also artificially limit the amount of read aux vectors to, say, 256, which is much more then it is currently defined. pgpHLn9etVmXO.pgp Description: PGP signature
Re: "ps -e" without procfs(5)
On Sun, Oct 16, 2011 at 07:57:57PM +0300, Mikolaj Golub wrote: > Hi, > > I have a patch that makes kvm_uread() read from user space using ptrace(2). > > http://people.freebsd.org/~trociny/kvm_uread.ptrace.patch > > With this change 'ps -e' does not requires procfs(5). > > Do you like it or there might be some reasons why it is a bad idea? > > Grepping sources it looks like currently only ps uses kvm_getenvv(3) (and thus > kvm_uread()). > > Note, when reading from its own user space it just does bcopy(3), so if a > wrong address range is passed to kvm_uread() the program will segfault. Do we > need some protection here and what? Masking SIGSEGV? Ptracing a random process may have a disastrous consequences for the traced process, caller or system. For the process, the ptrace(2) can cause spurious signal delivery or EINTR returns. If the process you attached to is critical for the system operation, you can get hung or failed system. In my opinion, the way to implement the feature is to (re)use linprocfs_doargv() and provide another kern.proc sysctl to retrieve the argv and env vectors. Then, ps(1) and procstat(1) can use it, as well as procfs and linprocfs inside the kernel. While you are at the code, it would be useful to also export the auxv vector, which is immediately before env. pgpn6KWHsiqIL.pgp Description: PGP signature
Re: Power usage on FreeBSD
On Sat, Oct 08, 2011 at 01:44:37AM +0200, Harald Servat wrote: > Dear hackers, > > does anyone know if I can measure power consumption of the processor via > software by using FreeBSD on a Intel Core i5? Any pointer on this, or other > architectures, will be welcome. > > Thank you very much in advance! Measuring the power consumption by CPU would probably require you to get some NDA documentation from Intel. On the Core-class CPUs there are definitely the facilities related to the Turbo Mode, that show inner working of the power management microcontroller. See the Linux IPS driver, which controls both CPU and Northbridge/GPU. I was unable to find a public documentation about it. Some information is present in the CPU/PCH datasheets. On the other hand, if you are only concerned about whole system power consumption, use a laptop or a managed UPS. pgpPjcGY66oPy.pgp Description: PGP signature
Re: mmap performance and memory use
On Thu, Oct 06, 2011 at 04:41:45PM +0200, Wojciech Puchar wrote: > i have few questions. > > 1) suppose i map 1TB of address space as anonymous and touch just one > page. how much memory is used to manage this? I am not sure how deep the enumeration you want to know, but the first approximation will be: one struct vm_map_entry one struct vm_object one pv_entry Page table structures need four pages for directories and page table proper. > > 2) suppose we have 1TB file on disk without holes and 10 processes > mmaps this file to it's address space. are just pages shared or can > pagetables be shared too? how much memory is used to manage such > situation? Only pages are shared. Pagetables are not. For one thing, this indeed causes more memory use for the OS. This is somewhat mitigated by automatic use of superpages. Superpage promotion still keeps the 4KB page table around, so most savings from the superpages are due to more efficient use of TLB. On the other hand, having non-shared page tables allows for much more accurate tracking of the accesses and writes, which can result in better pageout performance. For the situation 1TB/10 processes, you will probably need to tune the amount of pv entries, see sysctl vm.pmap.pv*. > > yes this is a real question - assume most of these processes are mostly > sleeping but every now and then do something and work of some set of > pages from this file and there is enough memory in computer to keep this > working set, but only if managing it by OS will not overuse memory. > ___ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" pgpokrQUSBSl6.pgp Description: PGP signature
Re: Memory allocation in kernel -- what to use in which situation? What is the best for page-sized allocations?
On Sun, Oct 02, 2011 at 04:21:09PM +0400, Lev Serebryakov wrote: > Hello, Freebsd-hackers. > > Here are several memory-allocation mechanisms in the kernel. The two > I'm aware of is MALLOC_DEFINE()/malloc()/free() and uma_* (zone(9)). > > As far as I understand, malloc() is general-purpose, but it has > fixed "transaction cost" (in term of memory consumption) for each > block allocated, and is not very suitable for allocation of many small > blocks, as lots of memory will be wasted for bookkeeping. > > zone(9) allocator, on other hand, have very low cost of each > allocated block, but could allocate only pre-configured fixed-size > blocks, and ideal for allocation tons of small objects (and provide > API for reusing them, too!). > > Am I right? > >But what if I need to allocate a lot (say, 16K-32K) of page-sized > blocks? Not in one chunk, for sure, but in lifetime of my kernel > module. Which allocator should I use? It seems, the best one will be > very low-level only-page-sized allocator. Is here any in kernel? Very short answer is that you could use kmem_alloc() or kmem_malloc(). kmem_alloc() sleeps, while kmem_malloc() may be directed to not sleep. pgpVLS4HXqGcq.pgp Description: PGP signature
Re: Installation of kernel symbols file in a separate directory [Was: Re: Experiences with FreeBSD 9.0-BETA2]
On Wed, Sep 28, 2011 at 10:30:06AM -0400, Ryan Stone wrote: > You might be interested in this PR: > > http://www.freebsd.org/cgi/query-pr.cgi?pr=153157&cat= > > This does the same thing for userland .symbols files, and teaches gdb > how to find them. The patch is not committable as is. In particular, the patch hardcodes the /usr/src as the source location. I think the right thing to do is to install the stripbin.sh into usr/bin, might be under the name like freebsd_stripbin. Also, I think the script should use the full pathes to the utilities, but I am not sure about this part, esp. in regard the cross-build. pgpMs4Yvgh09R.pgp Description: PGP signature
Re: Thread-local storage issue
On Thu, Sep 15, 2011 at 10:50:47AM +0800, Thinker K.F. Li wrote: > Hi Guys, > > I was in trouble for an issue of TLS implementation of FreeBSD. It is > an issue of ld-elf.so actually. If I have a thread-local variable in > program, the value of the variable is not consistent after an > dlopen(). For example, > > __thread int var = 50; > > void modify() { > var = 100; > } > > void show() { > printf("%d\n", var); > } > > int main(int argc, char * const *argv) { > dlopen(...); > modify(); > show(); > } > > If it is compiled with -fpic, it would print "50" while "100" is > expected. (-fpic is required for shared objects) > > I have send-pr a patch as > > http://www.freebsd.org/cgi/query-pr.cgi?pr=160721 > > I need someone to review it. I already sent the request to re@ to commit the change. Thanks for the submission. pgpfPBkkLAdt0.pgp Description: PGP signature
Re: Re: Kernel timers infrastructure
On Mon, Sep 12, 2011 at 11:48:42AM +0200, "Marc L?rner" wrote: > Hello, > what about changing order of callout_reset and uprintf? > And your timeout isn't 1minute, it's one second! > > Regards, > Marc > > >I already did that to ensure timer_event_handler would be called correctly. > > > >The result follows: > > > >freebsd# kldload ./timer.ko > >timer_event_handler() with MOD_LOAD > > > >freebsd# kldunload ./timer.ko > >timer_event_handler() with MOD_UNLOAD > > > >and I maintained the module load for about 1 minute so the timer printing > >>"Hello, World!" should have been run. > > > >Filippo > > > >On 12/set/2011, at 11:24, Adrian Chadd wrote: > > > >> How about adding some printfs() to the functions to ensure they're being > >> called? > >> The callouts are executed in the context that does not have the controlling terminal. uprintf(9) tries to use the ctty for output. Use printf(9) to get something on console. pgpMi02zI4Bux.pgp Description: PGP signature
Re: Where to ask about a 7.2 bug, and debugging sys/queue.h errors
On Thu, Aug 25, 2011 at 05:12:09PM -0500, Brandon Gooch wrote: > On Thu, Aug 25, 2011 at 4:53 PM, Kostik Belousov wrote: > > On Thu, Aug 25, 2011 at 03:16:09PM -0600, Charlie Martin wrote: > >> We're having a crash in some internal code running on FreeBSD 7.2 > >> (specifically 7.2-PRERELEASE FreeBSD 7.2-PRERELEASE and yeah, I know > >> it's quite a bit behind) in which after 18-30 hours of running load > >> tests, the code panics with: > >> > >> panic: Bad link elm 0xff0044c09600 next->prev != elm > >> cpuid = 0 > >> KDB: stack backtrace: > >> db_trace_self_wrapper() at 0x8019119a = db_trace_self_wrapper+0x2a > >> panic() at 0x80307c72 = panic+0x182 > >> devfs_populate_loop() at 0x802a43a8 = devfs_populate_loop+0x548 > >> > >> > >> First question: where's the most appropriate place to ask about this > >> kind of bug on a back version. > > It is fine to ask there. > > > >> > >> Second: does this remind anyone of any bugs? Googling came up with a > >> few somewhat similar things but hasn't provided much insight so far. > > In 99% of the cases, it means that you forgot to dev_ref() some cdev. > > So dev_ref increments the reference count for a cdev. Even though the > work "loop" seems to indicate that we will iterate over a list of > objects (one of which we may be missing a reference to via a missing > dev_ref()), I'm not seeing how this can cause a panic from inside > devfs_populate_loop(). > > Can you help me understand this? > Missing dev_ref() means that the memory for the cdev (and cdev_priv) is freed prematurely. If this happens before destroy_dev() is called, then the list which is iterated over by populate_loop(), is corrupted. See e.g. MAKEDEV_REF flag for make_dev(9) and its use in the (old) clone handlers. pgpWJlf9huRNl.pgp Description: PGP signature
Re: Where to ask about a 7.2 bug, and debugging sys/queue.h errors
On Thu, Aug 25, 2011 at 03:16:09PM -0600, Charlie Martin wrote: > We're having a crash in some internal code running on FreeBSD 7.2 > (specifically 7.2-PRERELEASE FreeBSD 7.2-PRERELEASE and yeah, I know > it's quite a bit behind) in which after 18-30 hours of running load > tests, the code panics with: > > panic: Bad link elm 0xff0044c09600 next->prev != elm > cpuid = 0 > KDB: stack backtrace: > db_trace_self_wrapper() at 0x8019119a = db_trace_self_wrapper+0x2a > panic() at 0x80307c72 = panic+0x182 > devfs_populate_loop() at 0x802a43a8 = devfs_populate_loop+0x548 > > > First question: where's the most appropriate place to ask about this > kind of bug on a back version. It is fine to ask there. > > Second: does this remind anyone of any bugs? Googling came up with a > few somewhat similar things but hasn't provided much insight so far. In 99% of the cases, it means that you forgot to dev_ref() some cdev. pgpG6DLjTvGZh.pgp Description: PGP signature
Re: debugging frequent kernel panics on 8.2-RELEASE
On Wed, Aug 17, 2011 at 11:21:42PM +0300, Andriy Gapon wrote: [skip] > But I also would like to use this opportunity to discuss how we can > make it easier to debug such issue as this. I think that this problem > demonstrates that when we treat certain junk in kernel address value > as a userland address value, we throw additional heaps of irrelevant > stuff on top of an actual problem. One solution could be to use a > special flag that would mark all actual attempts to access userland > address (e.g. setting the flag on entrance to copyin and clearing it > upon return), so that in the page fault handler we could distinguish > actual faults on userland addresses from faults on garbage kernel > addresses. I am sure that there could be other clever techniques to > catch such garbage addresses early. We already have such mechanism, the kernel code aware of the usermode page access sets pcb_onfault. See the end of trap_pfault() handler. In fact, we can catch it earlier, before even calling vm_fault(). BTW, I think this is esp. useful in the combination with the support for the SMEP in recent Intel CPUs. commit 2e1b36fa93f9499e37acf04a66ff0646d4f13536 Author: Konstantin Belousov Date: Thu Aug 18 00:08:50 2011 +0300 Assert that the exiting process does not return to usermode. On x86, do not call vm_fault() when the kernel is not prepared to handle unsuccessful page fault. diff --git a/sys/amd64/amd64/trap.c b/sys/amd64/amd64/trap.c index 4e5f8b8..55e1e5a 100644 --- a/sys/amd64/amd64/trap.c +++ b/sys/amd64/amd64/trap.c @@ -674,6 +674,19 @@ trap_pfault(frame, usermode) goto nogo; map = &vm->vm_map; + + /* +* When accessing a usermode address, kernel must be +* ready to accept the page fault, and provide a +* handling routine. Since accessing the address +* without the handler is a bug, do not try to handle +* it normally, and panic immediately. +*/ + if (!usermode && (td->td_intr_nesting_level != 0 || + PCPU_GET(curpcb)->pcb_onfault == NULL)) { + trap_fatal(frame, eva); + return (-1); + } } /* diff --git a/sys/i386/i386/trap.c b/sys/i386/i386/trap.c index 5a8016c..e6d2b5a 100644 --- a/sys/i386/i386/trap.c +++ b/sys/i386/i386/trap.c @@ -831,6 +831,11 @@ trap_pfault(frame, usermode, eva) goto nogo; map = &vm->vm_map; + if (!usermode && (td->td_intr_nesting_level != 0 || + PCPU_GET(curpcb)->pcb_onfault == NULL)) { + trap_fatal(frame, eva); + return (-1); + } } /* diff --git a/sys/kern/subr_trap.c b/sys/kern/subr_trap.c index 3527ed1..a69b7b8 100644 --- a/sys/kern/subr_trap.c +++ b/sys/kern/subr_trap.c @@ -99,6 +99,8 @@ userret(struct thread *td, struct trapframe *frame) CTR3(KTR_SYSC, "userret: thread %p (pid %d, %s)", td, p->p_pid, td->td_name); + KASSERT((p->p_flag & P_WEXIT) == 0, + ("Exiting process returns to usermode")); #if 0 #ifdef DIAGNOSTIC /* Check that we called signotify() enough. */ pgp0chWxIkaWy.pgp Description: PGP signature
Re: cam / ata timeout limited to 2147 due to overflow bug?
On Fri, Aug 05, 2011 at 12:02:19AM +0100, Steven Hartland wrote: > I'm working on adding security methods to camcontrol and have > come up against a strange issue. It seems that the timeout > value for cam, at least on ata (ahci), is limited to less than > 2148 seconds. > > This can be seen by running:- > camcontrol identify ada0 -t 2148 -v > (pass0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 > 00 > (pass0:ahcich0:0:0:0): CAM status: Command timeout > > Also seen in /var/log/messages at this time is:- > Aug 4 23:29:51 cfdev kernel: ahcich0: Timeout on slot 24 > Aug 4 23:29:51 cfdev kernel: ahcich0: is cs 0100 ss > rs 0100 tfd d0 serr > > Dropping the timeout down to 2147 and the command runs fine. > > I've done some digging and it seems like this is implemented via:- > sys/dev/ahci/ahci.c > ahci_execute_transaction(struct ahci_slot *slot) > { > ... >/* Start command execution timeout */ >callout_reset(&slot->timeout, (int)ccb->ccb_h.timeout * hz / 2000, >(timeout_t*)ahci_timeout, slot); > > Now its documented that:- > "Non-positive values of ticks are silently converted to the value 1" > > So I suspect that this is what's happening resulting in an extremely > small timeout instead of a large one. Now I know that passed in value > to the timeout is seconds * 1000 so we should be seeing 2148000 > for ccb->ccb_h.timeout now multiply that by 1000 (hz) and your over > the int wrap point 2147483647. > > So instead of the wrap point being 2147483 seconds (24 days), I suspect > because of the way this is structured its actually 2147 seconds (26mins). > > If this is the case the fix is likely to be something like:- > callout_reset(&slot->timeout, (int)(ccb->ccb_h.timeout * (hz / 2000)), For hz == 1000, hz / 2000 == 0 according to the C rules, so the result will be 0 always. > > Does this sound reasonable? What I don't understand is why the /2000? > > For reference the reason for wanting a large timeout is that a > secure erase of large media could take many hours so I'm using > the erase time reported by the drive for this, in my case here is > 400 minutes. > > Currently this instantly fails with a Command timeout which is > clearly not right. > >Regards >Steve > > > This e.mail is private and confidential between Multiplay (UK) Ltd. and the > person or entity to whom it is addressed. In the event of misdirection, the > recipient is prohibited from using, copying, printing or otherwise > disseminating it or any information contained in it. > In the event of misdirection, illegible or incomplete transmission please > telephone +44 845 868 1337 > or return the E.mail to postmas...@multiplay.co.uk. > > ___ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" pgpMO4v8aH0Re.pgp Description: PGP signature
Re: kern/159281: [PATCH] Linux-like /proc/swaps for linprocfs
On Sun, Jul 31, 2011 at 06:56:51PM +0200, Robert Millan wrote: > 2011/7/30 Kostik Belousov : > > Second, my proposal contains a flaw. Namely, if some swap device was removed > > between calls to swap_info and swap_devname calls, we get mangled list. > > Ok, I see that you fixed this by unifying those functions. Rather, by gathering both pieces of information without dropping the protection of sw_dev_mtx. > > > The third problem, which is not fixed, and which I do not want to handle, > > is that the swap devices list can be changed between calls to > > swap_devname(), > > changing device indexes and thus making the output mangled. > > You say you don't want to handle this, but AFAICT from the patch you > already did. Or did I miss something? Since sw_dev_mtx is dropped and reacquired between each iteration of the loop, some swap device may be removed meantime. At least, the output will show the same device twice. Hm, the swapoff/swapon routines are half-protected by the Giant, taking the Giant around the loop prevents the swap device list changes. > > > Should the swap device name be separated from 'unknown' word by > > space or by tab ? > > Linux' /proc/swaps prints spaces after the first column and tabs for > the rest. I think it's better to do the same just in case, but I > don't think it's relevant either way. Right, I added a comment to not have this broken by somebody in future. > > > I updated your patch, hopefully fixing the issues. Do you have comments > > or objections ? > > No. Thanks for fixing those problems. Below is the hopefully final patch after Bruce Evans' comments incorporated. If nobody speaks, I will send this to re tomorrow. diff --git a/sys/compat/linprocfs/linprocfs.c b/sys/compat/linprocfs/linprocfs.c index 692c5a3..8832d3d 100644 --- a/sys/compat/linprocfs/linprocfs.c +++ b/sys/compat/linprocfs/linprocfs.c @@ -502,6 +502,33 @@ linprocfs_dostat(PFS_FILL_ARGS) return (0); } +static int +linprocfs_doswaps(PFS_FILL_ARGS) +{ + struct xswdev xsw; + uintmax_t total, used; + int n; + char devname[SPECNAMELEN + 1]; + + sbuf_printf(sb, "Filename\t\t\t\tType\t\tSize\tUsed\tPriority\n"); + mtx_lock(&Giant); + for (n = 0; ; n++) { + if (swap_dev_info(n, &xsw, devname, sizeof(devname)) != 0) + break; + total = (uintmax_t)xsw.xsw_nblks * PAGE_SIZE / 1024; + used = (uintmax_t)xsw.xsw_used * PAGE_SIZE / 1024; + + /* +* The space and not tab after the device name is on +* purpose. Linux does so. +*/ + sbuf_printf(sb, "/dev/%-34s unknown\t\t%jd\t%jd\t-1\n", + devname, total, used); + } + mtx_unlock(&Giant); + return (0); +} + /* * Filler function for proc/uptime */ @@ -1490,6 +1517,8 @@ linprocfs_init(PFS_INIT_ARGS) NULL, NULL, NULL, 0); pfs_create_file(root, "stat", &linprocfs_dostat, NULL, NULL, NULL, PFS_RD); + pfs_create_file(root, "swaps", &linprocfs_doswaps, + NULL, NULL, NULL, PFS_RD); pfs_create_file(root, "uptime", &linprocfs_douptime, NULL, NULL, NULL, PFS_RD); pfs_create_file(root, "version", &linprocfs_doversion, diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c index f421e4f..9a818e7 100644 --- a/sys/vm/swap_pager.c +++ b/sys/vm/swap_pager.c @@ -2365,35 +2365,53 @@ swap_pager_status(int *total, int *used) mtx_unlock(&sw_dev_mtx); } -static int -sysctl_vm_swap_info(SYSCTL_HANDLER_ARGS) +int +swap_dev_info(int name, struct xswdev *xs, char *devname, size_t len) { - int *name = (int *)arg1; - int error, n; - struct xswdev xs; struct swdevt *sp; - - if (arg2 != 1) /* name length */ - return (EINVAL); + char *tmp_devname; + int error, n; n = 0; + error = ENOENT; mtx_lock(&sw_dev_mtx); TAILQ_FOREACH(sp, &swtailq, sw_list) { - if (n == *name) { - mtx_unlock(&sw_dev_mtx); - xs.xsw_version = XSWDEV_VERSION; - xs.xsw_dev = sp->sw_dev; - xs.xsw_flags = sp->sw_flags; - xs.xsw_nblks = sp->sw_nblks; - xs.xsw_used = sp->sw_used; - - error = SYSCTL_OUT(req, &xs, sizeof(xs)); - return (error); + if (n != name) { + n++; + continue; + } + xs->xsw_version = XSWDEV_VERSION; + xs->xsw_dev = sp->sw_dev; +
Re: kern/159281: [PATCH] Linux-like /proc/swaps for linprocfs
On Sat, Jul 30, 2011 at 01:06:48AM +0200, Robert Millan wrote: > Hi Kostik, > > 2011/7/29 Kostik Belousov : > > The patch is too hackish, IMHO. > > I would prefer to have an exported kernel function that fills xswdev > > by index, used both by vm_swap_info and linprocfs. > > > > For the device name, you would use sw_vp->v_rdev->si_name, see, for > > instance, the following fragment in the swapoff_all(): > > if (vn_isdisk(sp->sw_vp, NULL)) > > devname = sp->sw_vp->v_rdev->si_name; > > else > > devname = "[file]"; > > This could be another function that returns swap information by index. > > Here's a patch with the changes you requested. > There are several issues in the patch that I see, not counting minor stylistic bugs. First, the sw_dev_mtx should be kept until all the data is read from the xs. The locking there is somewhat vague, the original code was probably correct because sysctl handler is not marked mpsafe, and all functions changing the swtailq hold Giant. Second, my proposal contains a flaw. Namely, if some swap device was removed between calls to swap_info and swap_devname calls, we get mangled list. The third problem, which is not fixed, and which I do not want to handle, is that the swap devices list can be changed between calls to swap_devname(), changing device indexes and thus making the output mangled. Should the swap device name be separated from 'unknown' word by space or by tab ? I updated your patch, hopefully fixing the issues. Do you have comments or objections ? diff --git a/sys/compat/linprocfs/linprocfs.c b/sys/compat/linprocfs/linprocfs.c index 692c5a3..0733913 100644 --- a/sys/compat/linprocfs/linprocfs.c +++ b/sys/compat/linprocfs/linprocfs.c @@ -502,6 +502,30 @@ linprocfs_dostat(PFS_FILL_ARGS) return (0); } +static int +linprocfs_doswaps(PFS_FILL_ARGS) +{ + struct xswdev xsw; + int n; + long long total, used; + char devname[SPECNAMELEN + 1]; + + sbuf_printf(sb, "Filename\t\t\t\tType\t\tSize\tUsed\tPriority\n"); + + for (n = 0; ; n++) { + if (swap_dev_info(n, &xsw, devname, sizeof(devname)) != 0) + break; + + total = (long long)xsw.xsw_nblks * PAGE_SIZE / 1024; + used = (long long)xsw.xsw_used * PAGE_SIZE / 1024; + + sbuf_printf(sb, "/dev/%-34s unknown\t\t%lld\t%lld\t-1\n", + devname, total, used); + } + + return (0); +} + /* * Filler function for proc/uptime */ @@ -1490,6 +1514,8 @@ linprocfs_init(PFS_INIT_ARGS) NULL, NULL, NULL, 0); pfs_create_file(root, "stat", &linprocfs_dostat, NULL, NULL, NULL, PFS_RD); + pfs_create_file(root, "swaps", &linprocfs_doswaps, + NULL, NULL, NULL, PFS_RD); pfs_create_file(root, "uptime", &linprocfs_douptime, NULL, NULL, NULL, PFS_RD); pfs_create_file(root, "version", &linprocfs_doversion, diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c index f421e4f..5929871 100644 --- a/sys/vm/swap_pager.c +++ b/sys/vm/swap_pager.c @@ -2365,35 +2365,58 @@ swap_pager_status(int *total, int *used) mtx_unlock(&sw_dev_mtx); } -static int -sysctl_vm_swap_info(SYSCTL_HANDLER_ARGS) +int +swap_dev_info(int name, struct xswdev *xs, char *devname, size_t len) { - int *name = (int *)arg1; - int error, n; - struct xswdev xs; struct swdevt *sp; - - if (arg2 != 1) /* name length */ - return (EINVAL); + char *tmp_devname; + int error, n; n = 0; + error = ENOENT; mtx_lock(&sw_dev_mtx); TAILQ_FOREACH(sp, &swtailq, sw_list) { - if (n == *name) { - mtx_unlock(&sw_dev_mtx); - xs.xsw_version = XSWDEV_VERSION; - xs.xsw_dev = sp->sw_dev; - xs.xsw_flags = sp->sw_flags; - xs.xsw_nblks = sp->sw_nblks; - xs.xsw_used = sp->sw_used; - - error = SYSCTL_OUT(req, &xs, sizeof(xs)); - return (error); + if (n != name) { + n++; + continue; } - n++; + + xs->xsw_version = XSWDEV_VERSION; + xs->xsw_dev = sp->sw_dev; + xs->xsw_flags = sp->sw_flags; + xs->xsw_nblks = sp->sw_nblks; + xs->xsw_used = sp->sw_used; + if (devname != NULL) { + if (vn_isdisk(sp->sw_vp, NULL)) { + tmp_devname = +
Re: [PATCH] Linux-like /proc/swaps for linprocfs
On Fri, Jul 29, 2011 at 01:15:59AM +0200, Robert Millan wrote: > Please consider this patch, it implements Linux-like /proc/swaps for > linprocfs. > > E.g. > > $ cat /proc/swaps > FilenameTypeSizeUsed > Priority > /dev/zvol/dimoni/swap unknown 2097152 0 -1 > > -- > Robert Millan The patch is too hackish, IMHO. I would prefer to have an exported kernel function that fills xswdev by index, used both by vm_swap_info and linprocfs. For the device name, you would use sw_vp->v_rdev->si_name, see, for instance, the following fragment in the swapoff_all(): if (vn_isdisk(sp->sw_vp, NULL)) devname = sp->sw_vp->v_rdev->si_name; else devname = "[file]"; This could be another function that returns swap information by index. pgpWfXDOOYNOR.pgp Description: PGP signature
Re: wdrain vs. fuse/ggate (was: Re: mount vdi)
On Tue, Jul 26, 2011 at 11:35:44PM +0200, Juergen Lock wrote: > On Wed, Jul 27, 2011 at 12:22:34AM +0300, Kostik Belousov wrote: > > On Tue, Jul 26, 2011 at 08:03:35PM +0200, Juergen Lock wrote: > > > On Mon, Jul 25, 2011 at 06:31:36PM -0500, Brandon Gooch wrote: > > > > 2011/7/25 Andrey V. Elsukov : > > > > > On 25.07.2011 10:18, Joe Sciulli wrote: > > > > >> Is it possible to mount virtualbox vdi file on the FreeBSD host? > > > > >> This appears to be doable on > > > > >> windows and linux hosts, which basically is done in two steps: 1. > > > > >> find offset in the image. 2. > > > > >> mount the image with that offset. > > > > >> > > > > >> I'm trying to do the same thing on FreeBSD, and found the > > > > >> undocumented and deprecated command > > > > >> still works: > > > > >> > > > > >> VBoxManage internalcommands dumphdinfo freebsd_home.vdi > > > > >> > > > > >> I got the following for the virtual disk image holding the /home (no > > > > >> root hence no MBR) disk for > > > > >> a FreeBSD guest: > > > > >> > > > > >> Header: offBlocks=4096 offData=28672 > > > > >> > > > > >> Then attempt to mount it: > > > > >> > > > > >> mdconfig -a -t vnode -f /tmp/freebsd_home_56.vdi -u 0 mount /dev/md0 > > > > >> /tmp/aaa/ mount -t cd9660 > > > > >> /dev/md0 /tmp/aaa/ > > > > >> > > > > >> unfortunately both the above two mount commands failed with "Invalid > > > > >> argument". I tried > > > > >> skip=28672 to no avail as well. Anything did I do wrong? > > > > > > > > > > I have not any Vbox images with fixed size, but i tried this: > > > > > # mdconfig -f 10G_GPT_UFS.vdi > > > > > # gnop create -v -o 41472 /dev/md0 > > > > > > > > > > where 41472 is offData value. After that md0.nop was tasted and > > > > > reports about invalid GPT. > > > > > So, i think if your image is fixed size disk yout can try this method > > > > > and mount UFS (not cd9660). > > > > > > > > > > -- > > > > > WBR, Andrey V. Elsukov > > > > > > > > There was a CFT sent out a while back about a fuse module for mounting > > > > vdi images: > > > > > > > > http://lists.freebsd.org/pipermail/freebsd-emulation/2010-September/007964.html > > > > > > > > Not sure about the state of this now though... > > > > > > Yeah sorry I never got back to this after getting reports that > > > writes are stuck in wdrain... [1] I guess what happens is the > > > vdfuse process wants to write many(?) more blocks than the number > > > that got queued by the original fuse request and/or too many writes > > > get queued up at once before the vdfuse process gets to handle them > > > (trying to increase runningbufspace even more), and thus there is > > > deadlock. :( So my question for -hackers is, does anyone have a > > > clever idea how to fix this? Should the fuse requests for this be > > > exempted from the wdrain bookkeeping so that only the actual writes > > > by the vdfuse process get counted, and if yes, how would I best > > > accomplish this? > > > > > > Thanx! :) > > > Juergen > > > > Excluding any buffers from runningbufspace is absolutely wrong. > > You will get another deadlock due to excessive pressure on the buffer > > space. > > Alright, so how to fix the deadlocks then? :) In general, you cannot, without rearchitecturing the buffer deadlock avoidance code. md(4) is in similar situation, but there the i/o does not leave kernel. You can look for VV_MD and similar hacks spreaded in sys/. pgpL3ruSXle9e.pgp Description: PGP signature
Re: wdrain vs. fuse/ggate (was: Re: mount vdi)
On Tue, Jul 26, 2011 at 08:03:35PM +0200, Juergen Lock wrote: > On Mon, Jul 25, 2011 at 06:31:36PM -0500, Brandon Gooch wrote: > > 2011/7/25 Andrey V. Elsukov : > > > On 25.07.2011 10:18, Joe Sciulli wrote: > > >> Is it possible to mount virtualbox vdi file on the FreeBSD host? This > > >> appears to be doable on > > >> windows and linux hosts, which basically is done in two steps: 1. find > > >> offset in the image. 2. > > >> mount the image with that offset. > > >> > > >> I'm trying to do the same thing on FreeBSD, and found the undocumented > > >> and deprecated command > > >> still works: > > >> > > >> VBoxManage internalcommands dumphdinfo freebsd_home.vdi > > >> > > >> I got the following for the virtual disk image holding the /home (no > > >> root hence no MBR) disk for > > >> a FreeBSD guest: > > >> > > >> Header: offBlocks=4096 offData=28672 > > >> > > >> Then attempt to mount it: > > >> > > >> mdconfig -a -t vnode -f /tmp/freebsd_home_56.vdi -u 0 mount /dev/md0 > > >> /tmp/aaa/ mount -t cd9660 > > >> /dev/md0 /tmp/aaa/ > > >> > > >> unfortunately both the above two mount commands failed with "Invalid > > >> argument". I tried > > >> skip=28672 to no avail as well. Anything did I do wrong? > > > > > > I have not any Vbox images with fixed size, but i tried this: > > > # mdconfig -f 10G_GPT_UFS.vdi > > > # gnop create -v -o 41472 /dev/md0 > > > > > > where 41472 is offData value. After that md0.nop was tasted and reports > > > about invalid GPT. > > > So, i think if your image is fixed size disk yout can try this method and > > > mount UFS (not cd9660). > > > > > > -- > > > WBR, Andrey V. Elsukov > > > > There was a CFT sent out a while back about a fuse module for mounting > > vdi images: > > > > http://lists.freebsd.org/pipermail/freebsd-emulation/2010-September/007964.html > > > > Not sure about the state of this now though... > > Yeah sorry I never got back to this after getting reports that > writes are stuck in wdrain... [1] I guess what happens is the > vdfuse process wants to write many(?) more blocks than the number > that got queued by the original fuse request and/or too many writes > get queued up at once before the vdfuse process gets to handle them > (trying to increase runningbufspace even more), and thus there is > deadlock. :( So my question for -hackers is, does anyone have a > clever idea how to fix this? Should the fuse requests for this be > exempted from the wdrain bookkeeping so that only the actual writes > by the vdfuse process get counted, and if yes, how would I best > accomplish this? > > Thanx! :) > Juergen Excluding any buffers from runningbufspace is absolutely wrong. You will get another deadlock due to excessive pressure on the buffer space. pgp5nwm9HUjrN.pgp Description: PGP signature
Re: [PATCH] Improve LinuxThreads compatibility in rfork()
On Tue, Jul 12, 2011 at 11:53:05PM +0200, Petr Salinger wrote: > >>I will bump revision for stable/8 when merging, but I do not see much > >>reason to bump on HEAD right now. > > Many thanks. > > >Uhm I think we can survive without a bump in HEAD. For now we will > >need to keep our backward-compatibility patch anyway, and when the > >bump happens on stable/8 our sysctl comparison will identify 900039 as > >new ABI. > > We cannot survive without a bump in HEAD before 9.0 release, > but we do not need bump immediately. Right, the point is that you can just say 'do not use 9.0-current earlier then r223966', and check only for >= 8025XX, where XX is the bump at MFC. > > So far the condition for new API will be > > if ((osreldate >= 900040) > || ((osreldate >= 803000) && (osreldate < 90))) > > Of course, when MFC will bump revision, the 803000 constant > will change (probably to 802509). > > Petr pgpRQ01iW3WLL.pgp Description: PGP signature
Re: [PATCH] Improve LinuxThreads compatibility in rfork()
On Tue, Jul 12, 2011 at 04:36:24PM +0200, Petr Salinger wrote: > >Below is the patch I intend to commit after you retest it. > > I applied it against our 8.2 based package, > altered our "clone" to use this new > interface and run eglibc testsuite. No regression. > > Our runtime detection of this new interface will be based > on "sysctl kern.osreldate". I will bump revision for stable/8 when merging, but I do not see much reason to bump on HEAD right now. > > > >some rewording to the manual page. > > You might want to reword also RFLINUXTHPN part > "kernel will return" -> "kernel will deliver" > > Can we expect MFC before 8.3 release ? I just committed this to HEAD. Will MFC in a week, unless some problems arise. pgpSnAdVuVEYb.pgp Description: PGP signature
Re: [PATCH] Improve LinuxThreads compatibility in rfork()
On Tue, Jul 12, 2011 at 11:16:28AM +0200, Petr Salinger wrote: > >>Seems this interface be acceptable ? > > > >Looks good to me. > > The proposed code changes are in the attached patch. > > Proposed wording of addition into RFORK(2): Below is the patch I intend to commit after you retest it. I added the checks for validity of the flags, and some rewording to the manual page. diff --git a/lib/libc/sys/rfork.2 b/lib/libc/sys/rfork.2 index f1ae14b..993fd1b 100644 --- a/lib/libc/sys/rfork.2 +++ b/lib/libc/sys/rfork.2 @@ -5,7 +5,7 @@ .\" .\" $FreeBSD$ .\" -.Dd March 15, 2011 +.Dd July 12, 2011 .Dt RFORK 2 .Os .Sh NAME @@ -84,6 +84,16 @@ Note that a lot of code will not run correctly in such an environment. .It Dv RFSIGSHARE If set, the kernel will force sharing the sigacts structure between the child and the parent. +.It Dv RFTSIGZMB +If set, the kernel will deliver a specified signal to the parent +upon the child exit, instead of default SIGCHILD. +The signal number +.Dv signum +is specified by oring the +.Dv RFTSIGFLAGS(signum) +expression into +.Fa flags . +Specifying signal number 0 disables signal delivery upon the child exit. .It Dv RFLINUXTHPN If set, the kernel will return SIGUSR1 instead of SIGCHILD upon thread exit for the child. @@ -164,6 +174,8 @@ would be exceeded (see Both the RFFDG and the RFCFDG flags were specified. .It Bq Er EINVAL Any flags not listed above were specified. +.It Bq Er EINVAL +An invalid signal number was specified. .It Bq Er ENOMEM There is insufficient swap space for the new process. .El diff --git a/sys/kern/kern_fork.c b/sys/kern/kern_fork.c index a8abd8e..9d3e22d 100644 --- a/sys/kern/kern_fork.c +++ b/sys/kern/kern_fork.c @@ -476,7 +476,10 @@ do_fork(struct thread *td, int flags, struct proc *p2, struct thread *td2, sigacts_copy(newsigacts, p1->p_sigacts); p2->p_sigacts = newsigacts; } - if (flags & RFLINUXTHPN) + + if (flags & RFTSIGZMB) + p2->p_sigparent = RFTSIGNUM(flags); + else if (flags & RFLINUXTHPN) p2->p_sigparent = SIGUSR1; else p2->p_sigparent = SIGCHLD; @@ -719,10 +722,22 @@ fork1(struct thread *td, int flags, int pages, struct proc **procp) static int curfail; static struct timeval lastfail; + /* Check for the undefined or unimplemented flags. */ + if ((flags & ~(RFFLAGS | RFTSIGFLAGS(RFTSIGMASK))) != 0) + return (EINVAL); + + /* Signal value requires RFTSIGZMB. */ + if ((flags & RFTSIGFLAGS(RFTSIGMASK)) != 0 && (flags & RFTSIGZMB) == 0) + return (EINVAL); + /* Can't copy and clear. */ if ((flags & (RFFDG|RFCFDG)) == (RFFDG|RFCFDG)) return (EINVAL); + /* Check the validity of the signal number. */ + if ((flags & RFTSIGZMB) != 0 && (u_int)RFTSIGNUM(flags) > _SIG_MAXSIG) + return (EINVAL); + p1 = td->td_proc; /* diff --git a/sys/sys/unistd.h b/sys/sys/unistd.h index 378308d..9d56a3a 100644 --- a/sys/sys/unistd.h +++ b/sys/sys/unistd.h @@ -180,8 +180,16 @@ #defineRFLINUXTHPN (1<<16) /* do linux clone exit parent notification */ #defineRFSTOPPED (1<<17) /* leave child in a stopped state */ #defineRFHIGHPID (1<<18) /* use a pid higher than 10 (idleproc) */ +#defineRFTSIGZMB (1<<19) /* select signal for exit parent notification */ +#defineRFTSIGSHIFT 20 /* selected signal number is in bits 20-27 */ +#defineRFTSIGMASK 0xFF +#defineRFTSIGNUM(flags)(((flags) >> RFTSIGSHIFT) & RFTSIGMASK) +#defineRFTSIGFLAGS(signum) ((signum) << RFTSIGSHIFT) #defineRFPPWAIT(1<<31) /* parent sleeps until child exits (vfork) */ #defineRFKERNELONLY(RFSTOPPED | RFHIGHPID | RFPPWAIT) +#defineRFFLAGS (RFFDG | RFPROC | RFMEM | RFNOWAIT | RFCFDG | \ +RFTHREAD | RFSIGSHARE | RFLINUXTHPN | RFSTOPPED | RFHIGHPID | RFTSIGZMB | \ +RFPPWAIT) #endif /* __BSD_VISIBLE */ pgpOAMxbI5Frs.pgp Description: PGP signature
Re: [PATCH] Improve LinuxThreads compatibility in rfork()
On Mon, Jul 11, 2011 at 08:05:56PM +0200, Petr Salinger wrote: > >>Should the bit slice be 7 or 8 bits ? > > >I propose to go 8 bits, and add the check to be future-proof. > > >It seems that we already parse GNU/kFreeBSD brandnote. I think this > >could be used to distinguish between old behaviour, that is currently > >used by your libc, and proposed new interface, if __FreeBSD_version > >is bumped and honored by glibc. You might need to store the brandinfo > >somewhere in struct proc or use the separate struct sysentvec. > > No, the version in brandnote is compile-time minimal supported version, > we will detect at runtime (by "sysctl kern.osreldate") which interface we > should use. > > > So far defined rfork() options: > > /* > * XXX currently, some operations without RFPROC set are not supported. > */ > > #define RFNAMEG (1<<0) /* UNIMPL new plan9 `name space' */ > #define RFENVG (1<<1) /* UNIMPL copy plan9 `env space' */ > #define RFFDG (1<<2) /* copy fd table */ > #define RFNOTEG (1<<3) /* UNIMPL create new plan9 `note > group' */ > #define RFPROC (1<<4) /* change child (else changes > curproc) */ > #define RFMEM (1<<5) /* share `address space' */ > #define RFNOWAIT(1<<6) /* give child to init */ > #define RFCNAMEG(1<<10) /* UNIMPL zero plan9 `name space' */ > #define RFCENVG (1<<11) /* UNIMPL zero plan9 `env space' */ > #define RFCFDG (1<<12) /* close all fds, zero fd table */ > #define RFTHREAD(1<<13) /* enable kernel thread support */ > #define RFSIGSHARE (1<<14) /* share signal handlers */ > #define RFLINUXTHPN (1<<16) /* do linux clone exit parent > notification */ > #define RFSTOPPED (1<<17) /* leave child in a stopped state */ > #define RFHIGHPID (1<<18) /* use a pid higher than 10 > (idleproc) */ > #define RFPPWAIT(1<<31) /* parent sleeps until child exits > (vfork) */ > #define RFKERNELONLY(RFSTOPPED | RFHIGHPID | RFPPWAIT) > > > The new interface will add: > > #define RFTSIGZMB (1<<19) > #define RFTSIGSHIFT 20/* reserve bits 20-27 */ > #define RFTSIGMASK 0xFF > #define RFTSIGNUM(flags) (((flags) >> RFTSIGSHIFT) & RFTSIGMASK) > #define RFTSIGFLAGS(signum) ((signum) << RFTSIGSHIFT) > > Seems this interface be acceptable ? Looks good to me. pgpyjHNfXlour.pgp Description: PGP signature
Re: [PATCH] Improve LinuxThreads compatibility in rfork()
On Mon, Jul 11, 2011 at 06:12:15PM +0200, Petr Salinger wrote: > >I would instead use a new flag to specify a signal sent on the child > >death. Like RFTSIGZMB. If flag is not set, SIGCHLD is used. If it is > >set, the bit slice is used as signal number, 0 means do not send any > >signal. > > > >Please note that the signal should be checked for validity, it must be > ><= _SIG_MAXSIG). > > We used this: > > #define RFTHPNSHIFT24 /* reserve bits 24-30 */ > #define RFTHPNMASK 0x7F/* for compatibility with > linuxthreads/clone() */ >/* allow to specify "clone exit parent >notification" signal */ > #define RFTHPNSIGNUM(flags)(((flags) >> RFTHPNSHIFT) & RFTHPNMASK) > > Therefore signal #128 (_SIG_MAXSIG) cannot be selected. > > Should the bit slice be 7 or 8 bits ? I propose to go 8 bits, and add the check to be future-proof. It seems that we already parse GNU/kFreeBSD brandnote. I think this could be used to distinguish between old behaviour, that is currently used by your libc, and proposed new interface, if __FreeBSD_version is bumped and honored by glibc. You might need to store the brandinfo somewhere in struct proc or use the separate struct sysentvec. pgpd5GgsEGZCV.pgp Description: PGP signature
Re: [PATCH] Improve LinuxThreads compatibility in rfork()
On Mon, Jul 11, 2011 at 05:43:23PM +0200, Petr Salinger wrote: > >>The 1st patch satisfies this. I agree that SIGCHLD part > >>is not easily readable. > >The SIGCHLD part is ugly. This is why I am asking about possible ways > >to overcome this. > > We need a way to specify "no signal". > It can be "new flag" or "ugly SIGCHLD". > > new flag: > pros: cleaner design > cons: one bit of flags eaten > cons: GNU/kFreeBSD have to detect at runtime which "no signal" have to use > cons: GNU/kFreeBSD have to add "ugly SIGCHLD" for some time > (up-to and including next Debian release) anyway > > ugly SIGCHLD: > pros: immediate GNU/kFreeBSD compatibility > cons: ugly design > > But definitely, it would be much, much better to have "new flag" compared > to diverge indefinitely ;-) > > What should be name of the "new flag" ? > > #define RFTHPNONE (1<<19) /* do not send exit notification signal to the > parent */ > I would instead use a new flag to specify a signal sent on the child death. Like RFTSIGZMB. If flag is not set, SIGCHLD is used. If it is set, the bit slice is used as signal number, 0 means do not send any signal. Please note that the signal should be checked for validity, it must be <= _SIG_MAXSIG). pgpc5NSisjy2Z.pgp Description: PGP signature
Re: [PATCH] Improve LinuxThreads compatibility in rfork()
On Mon, Jul 11, 2011 at 04:50:44PM +0200, Petr Salinger wrote: > >RFLINUXPTH was used by the linuxthreads port, that was popular in the > >time of FreeBSD 4.x and may be 5.x to run mysql. I will object against > >this breakage. > > Do I understand correctly that API/ABI backward compatibility with > previous FreeBSD releases w.r.t RFLINUXPTH is needed ? Yes, I think so. > > The 1st patch satisfies this. I agree that SIGCHLD part > is not easily readable. The SIGCHLD part is ugly. This is why I am asking about possible ways to overcome this. pgpdjqZaWkWLM.pgp Description: PGP signature
Re: [PATCH] Improve LinuxThreads compatibility in rfork()
On Mon, Jul 11, 2011 at 04:23:36PM +0200, Petr Salinger wrote: > >>>Can you, please, describe the reasoning behind the > + if (sig == SIGCHLD) sig = 0; > >>>line ? > >> > >>The main reason is backward compatibility. > >>The original FreeBSD code allows only to select between > >>SIGUSR1 or SIGCHLD signals. > >> > >>The our extension changes meaning of RFLINUXTHPN to select signal based on > >>bits 24-30 of passed flags instead of using SIGUSR1 every time. > >> > >>When the passed "signal" number is zero, it should behave identically > >>on plain FreeBSD and in our environment, therefore SIGUSR1 is selected. > >>The assumption is (have been) that (yet) undefined bits are zero. > >>That way we are backward compatible with original FreeBSD. > >> > >>We still need an alternative way to select "none signal is sent" > >>after child exit (under linux #0 is used). > >> > >>The SIGCHLD can be "selected" (also on original FreeBSD) by not specifying > >>RFLINUXTHPN, therefore combination of RFLINUXTHPN and passed "signal" > >>number SIGCHLD is (have been) used for "none signal". > >> > >>BTW, the opposite side is in > >> > >>http://anonscm.debian.org/viewvc/glibc-bsd/trunk/glibc-ports/kfreebsd/clone.c?view=markup > > > >I shall state that the sig == SIGCHLD case is ugly. Having the separate > >flag "do not send signal to the parent" would be much less clumsy. > >What are the requirements for the ABI stability for Debian/kFreeBSD ? > >Can this be fixed now, or is it too late ? > > It should be backward compatible with one previous version. > > What about in long term this: > > RFLINUXTHPN bit will be renamed and will have meaning > "select signal based on bits 24-30 of passed flags" > > - zero would mean "no signal" > - SIGCHLD would mean undefined > - SIGUSR1 would mean SIGUSR1 > > It is ABI/API breakage under original FreeBSD. > The question is how frequently RFLINUXTHPN is used under native FreeBSD > and its port collection. RFLINUXPTH was used by the linuxthreads port, that was popular in the time of FreeBSD 4.x and may be 5.x to run mysql. I will object against this breakage. > > And under "Debian GNU/kFreeBSD COMPAT" or 8-COMPAT > - SIGCHLD would mean "no signal" We never user COMPAT to _change_ the meaning of something, only to exclude some functionality, like syscall. > > We do not use SIGUSR1 currently, the eglibc side can detect whether > it runs under new-enough kernel and decide whether use 0 or SIGCHLD > for "no signal". > > The kernel side would be something like: > > if (flags & RFLINUXTHPN) > { > p2->p_sigparent = RFTHPNSIGNUM(flags); > #if COMPAT8 > if (p2->p_sigparent == SIGCHLD) >p2->p_sigparent = 0; > #endif > } No, this is even uglier, and see the note about compat. BTW, it seems that our rfork(2) ignores unknown flags. This should be fixed, might be in the same patch. pgprLlzHnAqWr.pgp Description: PGP signature
Re: [PATCH] Improve LinuxThreads compatibility in rfork()
On Mon, Jul 11, 2011 at 03:27:56PM +0200, Petr Salinger wrote: > >>This patch made by Petr Salinger improves compatibility with > >>LinuxThreads in rfork() syscall. The Linux clone() implementation > >>allows specifying the signal sent to parent when child terminates > >>(instead of SIGCHLD). > >> > >>As the threading implementation in Debian GNU/kFreeBSD is > >>LinuxThreads-based, we had to diverge from upstream kFreeBSD ABI and > >>implement this extension. > >> > >>I hope it is acceptable for you to use the same encoding, this way we > >>would archieve ABI compatibility to run Debian GNU/kFreeBSD inside a > >>chroot/jail on top of a FreeBSD system. > >> > >>Thanks for considering > >> > >>-- > >>Robert Millan > > > >>--- a/sys/kern/kern_fork.c > >>+++ b/sys/kern/kern_fork.c > >>@@ -477,7 +477,13 @@ > >>p2->p_sigacts = newsigacts; > >>} > >>if (flags & RFLINUXTHPN) > >>- p2->p_sigparent = SIGUSR1; > >>+ { > >>+ int sig; > >>+ sig = RFTHPNSIGNUM(flags); > >>+ if (sig == 0) sig = SIGUSR1; > >>+ if (sig == SIGCHLD) sig = 0; > >>+ p2->p_sigparent = sig; > >>+ } > >>else > >>p2->p_sigparent = SIGCHLD; > >> > >>--- a/sys/sys/unistd.h > >>+++ b/sys/sys/unistd.h > >>@@ -182,6 +182,10 @@ > >> #defineRFHIGHPID (1<<18) /* use a pid higher than 10 > >> (idleproc) */ > >> #defineRFPPWAIT(1<<31) /* parent sleeps until child exits > >> (vfork) */ > >> #defineRFKERNELONLY(RFSTOPPED | RFHIGHPID | RFPPWAIT) > >>+#define RFTHPNSHIFT24 /* reserve bits 24-30 */ > >>+#define RFTHPNMASK 0x7F/* for compatibility with > >>linuxthreads/clone() */ > >>+ /* allow to specify "clone exit parent > >>notification" signal */ > >>+#define RFTHPNSIGNUM(flags)(((flags) >> RFTHPNSHIFT) & RFTHPNMASK) > >> > >> #endif /* __BSD_VISIBLE */ > >> > >I looked at this patch some time ago already. > > > >Can you, please, describe the reasoning behind the > >>+ if (sig == SIGCHLD) sig = 0; > >line ? > > The main reason is backward compatibility. > The original FreeBSD code allows only to select between > SIGUSR1 or SIGCHLD signals. > > The our extension changes meaning of RFLINUXTHPN to select signal based on > bits 24-30 of passed flags instead of using SIGUSR1 every time. > > When the passed "signal" number is zero, it should behave identically > on plain FreeBSD and in our environment, therefore SIGUSR1 is selected. > The assumption is (have been) that (yet) undefined bits are zero. > That way we are backward compatible with original FreeBSD. > > We still need an alternative way to select "none signal is sent" > after child exit (under linux #0 is used). > > The SIGCHLD can be "selected" (also on original FreeBSD) by not specifying > RFLINUXTHPN, therefore combination of RFLINUXTHPN and passed "signal" > number SIGCHLD is (have been) used for "none signal". > > BTW, the opposite side is in > > http://anonscm.debian.org/viewvc/glibc-bsd/trunk/glibc-ports/kfreebsd/clone.c?view=markup I shall state that the sig == SIGCHLD case is ugly. Having the separate flag "do not send signal to the parent" would be much less clumsy. What are the requirements for the ABI stability for Debian/kFreeBSD ? Can this be fixed now, or is it too late ? Would you care to update the rfork(2) man page ? Also, it would be ideal to reformat the kern_fork.c change according to our style(9). pgp61aiQrmkp5.pgp Description: PGP signature
Re: [PATCH] Improve LinuxThreads compatibility in rfork()
On Mon, Jul 11, 2011 at 12:54:06PM +0200, Robert Millan wrote: > This patch made by Petr Salinger improves compatibility with > LinuxThreads in rfork() syscall. The Linux clone() implementation > allows specifying the signal sent to parent when child terminates > (instead of SIGCHLD). > > As the threading implementation in Debian GNU/kFreeBSD is > LinuxThreads-based, we had to diverge from upstream kFreeBSD ABI and > implement this extension. > > I hope it is acceptable for you to use the same encoding, this way we > would archieve ABI compatibility to run Debian GNU/kFreeBSD inside a > chroot/jail on top of a FreeBSD system. > > Thanks for considering > > -- > Robert Millan > --- a/sys/kern/kern_fork.c > +++ b/sys/kern/kern_fork.c > @@ -477,7 +477,13 @@ > p2->p_sigacts = newsigacts; > } > if (flags & RFLINUXTHPN) > - p2->p_sigparent = SIGUSR1; > + { > + int sig; > + sig = RFTHPNSIGNUM(flags); > + if (sig == 0) sig = SIGUSR1; > + if (sig == SIGCHLD) sig = 0; > + p2->p_sigparent = sig; > + } > else > p2->p_sigparent = SIGCHLD; > > --- a/sys/sys/unistd.h > +++ b/sys/sys/unistd.h > @@ -182,6 +182,10 @@ > #define RFHIGHPID (1<<18) /* use a pid higher than 10 (idleproc) > */ > #define RFPPWAIT(1<<31) /* parent sleeps until child exits > (vfork) */ > #define RFKERNELONLY(RFSTOPPED | RFHIGHPID | RFPPWAIT) > +#define RFTHPNSHIFT 24 /* reserve bits 24-30 */ > +#define RFTHPNMASK 0x7F/* for compatibility with linuxthreads/clone() > */ > + /* allow to specify "clone exit parent > notification" signal */ > +#define RFTHPNSIGNUM(flags) (((flags) >> RFTHPNSHIFT) & RFTHPNMASK) > > #endif /* __BSD_VISIBLE */ > I looked at this patch some time ago already. Can you, please, describe the reasoning behind the > + if (sig == SIGCHLD) sig = 0; line ? pgp6Jx6IZj6tc.pgp Description: PGP signature
Re: [PATCH] PAGE_SIZE in libsbuf
On Sun, Jul 03, 2011 at 02:07:59PM +0200, Robert Millan wrote: > 2011/7/3 Kostik Belousov : > > I think the different workaround is already included in the latest > > sbuf source. Please see > > http://svnweb.freebsd.org/base/head/sys/kern/subr_sbuf.c?revision=222015&view=markup > > How does libsbuf react if the guess fails? Also note this codepath > isn't tested in FreeBSD (except for pc98). > > I'm afraid this situation could lead to runtime errors somehow. I'd > still propose replacing this guess with a sysconf() check. Which runtime errors ? The constant is used to size the malloc allocations. Use of the PAGE_SIZE is an attempt to do some optimization. I think it would work if you use e.g. 42 instead of PAGE_SIZE. pgpaKJ1P9cEbq.pgp Description: PGP signature
Re: [PATCH] PAGE_SIZE in libsbuf
On Sun, Jul 03, 2011 at 01:47:38PM +0200, Robert Millan wrote: > On arm, ia64, powerpc and sparc, Linux doesn't define a static > PAGE_SIZE. It can only be obtained via sysconf(). In addition, > GNU/Hurd doesn't define PAGE_SIZE at all. > > This patch improves portability of libsbuf to be built on those platforms. > > In case you'd like to know, Debian is using libsbuf not just for its > GNU/kFreeBSD port, but also provides it to developers in Debian > GNU/Linux: > > http://packages.debian.org/search?keywords=libsbuf-dev > https://buildd.debian.org/status/package.php?p=freebsd-libs&suite=sid I think the different workaround is already included in the latest sbuf source. Please see http://svnweb.freebsd.org/base/head/sys/kern/subr_sbuf.c?revision=222015&view=markup pgpPQfiPS3PGr.pgp Description: PGP signature
Re: FreeBSD I/OAT (QuickData now?) driver
On Mon, Jun 06, 2011 at 12:47:38PM -0700, Artem Belevich wrote: > Jack, > > Quite a while back you've posted I/OAT driver. Unfortunately the link > you posted does not seem to have the drivers any more. Do you still > have them available somewhere? > > Do you know if Intel has technical info on the DMA engine available? I > believe long time back when it was still I/OAT it was part of chipset > spec. Now X58 Express chipset datasheet refers to a broken link > http://www.intel.com/cs/network/connectivity/emea/eng/226276.htm Referencing DMA engine, do you mean Vanderpool ? It is http://download.intel.com/technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf According to Intel marketing-talk, I/OAT is a couple of things, including MSI-X, VT-d (i.e. IOMMU/Vanderpool), something called Intel(R) QuickData Technology (might be, you mean this one ?) and DCA. pgpQXgdkVSWfx.pgp Description: PGP signature
Re: compiling ports with SSP (was: [PATCH] Add -lssp_nonshared to GCC's LIB_SPEC unconditionally)=
On Wed, Jun 01, 2011 at 11:47:23PM +0200, Jeremie Le Hen wrote: > Hi Kostik, > > Thanks to b.f., I've been reminded that this patch has yet to be > committed :). > > As a reminder, here are the archive pointers to the discussion: > http://lists.freebsd.org/pipermail/freebsd-hackers/2010-August/032549.html > continued... > http://lists.freebsd.org/pipermail/freebsd-hackers/2010-September/033028.html > continued... > http://lists.freebsd.org/pipermail/freebsd-hackers/2010-November/033478.html > > On Sat, Nov 06, 2010 at 09:47:02PM +0200, Kostik Belousov wrote: > > On Fri, Nov 05, 2010 at 07:00:23PM -0400, Alexander Kabaev wrote: > > > On Fri, 5 Nov 2010 22:39:06 +0100 > > > Jeremie Le Hen wrote: > > > > > > > Hi Kib, > > > > > > > > On Tue, Oct 05, 2010 at 08:18:04PM +0200, Jeremie Le Hen wrote: > > > > > > > > > > On Mon, Sep 27, 2010 at 06:44:57PM +0300, Kostik Belousov wrote: > > > > > > Hardcoding /usr/lib as the path to the library in the script looks > > > > > > problematic. For the buidlworld, you are linking resulting > > > > > > binaries with the host library, instead of the > > > > > > buildworld-produced one. For lib32, it makes non-working > > > > > > combination of 32/64 bit. > > > > > > > > > > Sorry for the late reply, but I had to collect various evidences > > > > > for my sayings and my development machine is reaaally slow. > > > > > > > > > > In fact it seems the toolchain built for buildworld contains a ld(1) > > > > > binary which invariably bases lookups for libraries in ${WORLDTMP}, > > > > > even in case of an absolute path. I have two evidences of this: > > > > > - Putting /usr/obj/usr/src/tmp/usr/lib/libssp_nonshared.a in > > > > > /usr/obj/usr/src/tmp/usr/lib/libc.ld leads toolchain's ld(1) to > > > > > use > > > > > /usr/obj/usr/src/tmp/usr/obj/usr/src/tmp/usr/lib/libssp_nonshared.a; > > > > > - I also verified this with a hand-wrought opensnoop-like DTrace > > > > > script. > > > > > > > > I dare to remind you about my patch. Do you have any other concerns? > > > > > > > > Thanks. > > > > Regards, > > > > -- > > > > Jeremie Le Hen > > > > > > > > Humans are born free and equal. But some are more equal than others. > > > > Coluche > > > > > > Hmm, I thought I did approve this patch already a long time agi, but > > > since you asked: > > > > > > +.if defined(SHLIB_LDSCRIPT) && exists(${.CURDIR}/${SHLIB_LDSCRIPT}) > > > > > > this should be: > > > > > > +.if defined(SHLIB_LDSCRIPT) > > > > > > ditto for all other similar places. Otherwise I do not think we should > > > hold the patch in queue ans should unleash it on unsuspecting public. > > > > Also, I think the "DEBUG" lines should be removed. > > Sure, I'll do it in my next update. > > > You install the libxxx.ld and then symlink libxxx.so to libxxx.ld. > > Why ? Would it be enough to install just the libxxx.so ? > > I just thought it would be less puzzling for users than noticing that > libc.so is only a few hundred of ascii. I don't have a strong opinion > about this though. I prefer to not have a symlink. > > > Otherwise, I think you need the similar > > .if ${SHLIBDIR} == ${LIBDIR} > > magic, that is better to be avoided. > > Can you explain a little bit more about this one please? I'm willing to > post an updated patch for further review. I think this comment was somehow related to the fact that make install does not work from the buildenv, or something similar. It was too long time ago for me to remember details. Also, I remember there was a concern about linker script not being installed in the cross-build environment during buildworld, or something close to what I stated. Was it resolved ? [I pinged you some time ago, you did not responded]. pgpx1ZSWxBnkz.pgp Description: PGP signature
Re: NFS mount inside jail fails
On Wed, May 18, 2011 at 04:03:26PM +0200, Pawel Jakub Dawidek wrote: > On Tue, May 17, 2011 at 10:17:12PM +0200, Alexander Leidinger wrote: > > On Tue, 17 May 2011 12:56:40 -0700 Sean Bruno > > wrote: > > > > > Silly thing I ran into today. User wanted to NFS mount a dir inside a > > > jail. After I groaned about the security implication of this, I noted > > > that there is a sysctl that looks like it should allow this. Namely, > > > security.jail.mount_allowed. I noted that setting this follows a path > > > that *should* have allowed this silly thing to happen, except that the > > > credentials in the nfsclient were not setup correctly. > > > > As you noticed, this is supposed to allow to mount inside a jail, IF > > the FS you want to mount is marked as secure/safe to do so. Nearly no > > FS is marked as such, as nobody wants to guarantee that it is safe > > (root in a jail should not be able to panic a system by trying to > > mount a corrupt/malicious FS-image) and secure (not possible to get > > elevated access/privileges). > > > > For NFS there is theoretically the problem that the outgoing address on > > requests could be the one of the physical host instead of the IP of the > > jail. If this is true in practice, I do not know. This could be > > the reason why NFS is not marked with VFCF_JAIL. > > It is not marked with VFCF_JAIL, because I just had no time to audit > that it is safe. It might be safe in theory. > > There are some file systems types that can't be securely mounted within > a jail no matter what, like UFS, MSDOFS, EXTFS, XFS, REISERFS, NTFS, > etc. because the user mounting it has access to raw storage and can > corrupt it in a way that it will panic entire system. > > There are other file systems that don't require access to raw storage > for the user doing the mount and chances are they are safe to mount from > within a jail, like ZFS (user can have access to ZFS datasets, but don't > need access to ZFS pool), NFS, SMBFS, NULLFS, UNIONFS, PROCFS, FDESCFS, > etc. I added VFCF_JAIL flag, so there is general mechanism to mark file > systems as jail-friendly, but back then I only needed it for ZFS. If user does nfs or smbfs mount to the local (or any other controllable) server, then he definitely can plant the DoS or resource starvation attacks. pgpZlPZ4SSgcE.pgp Description: PGP signature
Re: Fwd: [PATCH v2 3/4] x86, head_32/64.S: Enable SMEP
On Wed, May 18, 2011 at 10:50:30AM -0400, John Baldwin wrote: > On Wednesday, May 18, 2011 8:31:15 am Oliver Pinter wrote: > > On 5/18/11, Kostik Belousov wrote: > > > On Wed, May 18, 2011 at 02:03:07AM +0200, Oliver Pinter wrote: > > >> -- Forwarded message -- > > >> From: Fenghua Yu > > >> Date: Mon, 16 May 2011 14:34:44 -0700 > > >> Subject: [PATCH v2 3/4] x86, head_32/64.S: Enable SMEP > > >> To: Ingo Molnar , Thomas Gleixner , > > >> H Peter Anvin , Asit K Mallick > > >> , Linus Torvalds > > >> , Avi Kivity , Arjan > > >> van de Ven , Andrew Morton > > >> , Andi Kleen > > >> Cc: linux-kernel , Fenghua Yu > > >> > > >> > > >> From: Fenghua Yu > > >> > > >> Enable newly documented SMEP (Supervisor Mode Execution Protection) CPU > > >> feature in kernel. > > >> > > >> SMEP prevents the CPU in kernel-mode to jump to an executable page that > > >> does > > >> not have the kernel/system flag set in the pte. This prevents the kernel > > >> from executing user-space code accidentally or maliciously, so it for > > >> example > > >> prevents kernel exploits from jumping to specially prepared user-mode > > >> shell > > >> code. The violation will cause page fault #PF and will have error code > > >> identical to XD violation. > > >> > > >> CR4.SMEP (bit 20) is 0 at power-on. If the feature is supported by CPU > > >> (X86_FEATURE_SMEP), enable SMEP by setting CR4.SMEP. New kernel > > >> option nosmep disables the feature even if the feature is supported by > > >> CPU. > > >> > > >> Signed-off-by: Fenghua Yu > > > > > > So, where is the mentioned documentation for SMEP ? Rev. 38 of the > > > Intel(R) 64 and IA-32 Architectures Software Developer's Manual does > > > not contain the description, at least at the places where I looked and > > > expected to find it. > > > > http://www.intel.com/Assets/PDF/manual/325384.pdf > > > > Intel? 64 and IA-32 Architectures Software Developer?s Manual > >Volume 3 (3A & 3B): > > System Programming Guide > > Which revision? It is not documented in revision 38 from April 2011. > > I just downloaded that link, and it is still revision 38 and has no mention > 'SMEP'. Also, bit 20 of CR4 is still marked as Reserved in that manual > (section 2.5). This is exactly what I said about rev. 38 in my original reply. pgpVkiYgn2TpL.pgp Description: PGP signature
Re: Fwd: [PATCH v2 3/4] x86, head_32/64.S: Enable SMEP
On Wed, May 18, 2011 at 02:03:07AM +0200, Oliver Pinter wrote: > -- Forwarded message -- > From: Fenghua Yu > Date: Mon, 16 May 2011 14:34:44 -0700 > Subject: [PATCH v2 3/4] x86, head_32/64.S: Enable SMEP > To: Ingo Molnar , Thomas Gleixner , > H Peter Anvin , Asit K Mallick > , Linus Torvalds > , Avi Kivity , Arjan > van de Ven , Andrew Morton > , Andi Kleen > Cc: linux-kernel , Fenghua Yu > > > From: Fenghua Yu > > Enable newly documented SMEP (Supervisor Mode Execution Protection) CPU > feature in kernel. > > SMEP prevents the CPU in kernel-mode to jump to an executable page that does > not have the kernel/system flag set in the pte. This prevents the kernel > from executing user-space code accidentally or maliciously, so it for example > prevents kernel exploits from jumping to specially prepared user-mode shell > code. The violation will cause page fault #PF and will have error code > identical to XD violation. > > CR4.SMEP (bit 20) is 0 at power-on. If the feature is supported by CPU > (X86_FEATURE_SMEP), enable SMEP by setting CR4.SMEP. New kernel > option nosmep disables the feature even if the feature is supported by CPU. > > Signed-off-by: Fenghua Yu So, where is the mentioned documentation for SMEP ? Rev. 38 of the Intel(R) 64 and IA-32 Architectures Software Developer's Manual does not contain the description, at least at the places where I looked and expected to find it. Looking forward to hear from you. pgpYcKGUj9rIq.pgp Description: PGP signature
Re: Does FreeBSD have replacement for posix_fadvice() or fcntl(F_RDADVISE)?
On Thu, May 12, 2011 at 11:38:12AM +0400, Lev Serebryakov wrote: > Hello, Freebsd-hackers. > >Does FreeBSD have some custom call, which can be used where Linux > programs uses posix_fadvice() and DARWIN ones fcntl(F_RDADVISE)? > >It is like madvise(2) but for file descriptors. No, it does not (and I think the function is spelled posix_fadvise()). mdf reserved the syscall slot for posix_fadvise in his recent work on posix_fallocate(). Might be, he could comment more. pgpeURQwJNk0Q.pgp Description: PGP signature
Re: Runtime check for PAE option on BSD 6+ i386
On Tue, May 03, 2011 at 11:44:32AM -0400, John Baldwin wrote: > On Tuesday, May 03, 2011 9:10:26 am Philip Soeberg wrote: > > Hi fellow FreeBSD hackers, > > > > I've been using the following poor-man's approach in my driver init for > > ages in an attempt at detecting PAE option on BSD 6 (or greater) i386 > > kernels, as I depend on dmabus(9) but provide a loadable kernel module only. > > > > >>> > >if (sizeof(void*) == 4) { > > if (((uint64_t)(cnt.v_page_count * cnt.v_page_size) / 1073741824) > > >= 4) { > >printf("FreeBSD i386 detected with PAE option enabled. FreeBSD > > PAE type\n"); > >printf("kernels does not support loadable modules which use DMA. > > Please\n"); > >printf("reconfigure your kernel for non-PAE or switch to amd64 > > kernel.\n"); > >return EFAULT; > > } > >} > > <<< > > Hmmm, even this isn't really accurate as some folks may choose to enable PAE > even with < 4GB to get PG_NX functionality. > > > afaik there's a sysctl method of checking this per BSD7 (or is it 8?), > > but what about BSD6? Any hints on how I can runtime detect the above? > > Definitely a kern.features.pae sysctl in 7. I don't see anything similar in > 6. Read %cr4 and test the bit there. pgpnUcshIb6U6.pgp Description: PGP signature
Re: Is there some implicit locking of device methods?
On Thu, Apr 28, 2011 at 12:14:49AM +0200, Bartosz Fabianowski wrote: > Indeed, I may have mixed up terminology. Sorry about that. What I am > doing (or trying to do) is very simple: > > There is a single physical USB device. I have a single device node > representing it. This device can be opened for reading, concurrently, > any number of times. Everyone who open()s the device can read() it at > their own pace. I implemented this by maintaining an individual queue of > incoming data for each open() call. This queue resides in cdevpriv. > > So open() instantiates a queue and adds it to the driver's global list > of queues. Whenever a packet arrives from the device, it is placed in > all the queues (I have a linked list of all queues for that purpose). > When the open() is eventually followed by a close(), the cdevpriv > destructor removes the queue from the global list and frees its memory. > > In addition to this, I need to start the USB transfer when the first > open() occurs and stop it again when the last close() occurs. I am doing > this by checking the length of the global list. When the list is > zero-length on open(), I start the transfer. When the list i zero-length > in the cdevpriv destructor, I stop the transfer. > > I cannot see how else to achieve this behavior (other than device > cloning which I was using before but which is more complicated and > probably more error-prone). If I am doing something wrong and there is a > more correct way to do it, I would love to hear about it. This is a strange architecture, esp. amusing is the kernel-mode traffic multiplier. Without knowing the details, and really not wanting to know it, I could make two suggestions out of thin air: - use usermode daemon that multiplies traffic for all connected clients; - or, implement a ring buffer that cyclically stores the received data, and keep only the current read pointer in the cdevpriv. You need to handle the overflow case (eq. to the stuck reader) somehow in the current scheme anyway. Reader may now read from its current read position in the buffer up to the fill point. If the buffer wrapped for the reader, it should get some error. pgpsmwKlIWhWE.pgp Description: PGP signature
Re: Is there some implicit locking of device methods?
On Wed, Apr 27, 2011 at 11:14:36PM +0200, Bartosz Fabianowski wrote: > >If you have some sort of state that needs to get created on first > >open and then removed on last close [...] I would still depend on the > >cdevpriv destructor and use a reference count between open() and the > >destructor to know when to cleanup shared state. > > Yes, this is what I am doing. I am maintaining a list of all file > descriptors open on the device. Once the length of that list reaches > zero, I do global clean-up in the cdevpriv destructor. You are mixing things, and do repeat the work (probably buggy) that is already done by devfs. You should understand the relationship between basic concepts first. File descriptor != opened file != cdev node. Driver indeed may get notifications on all open(2) syscalls performed on the node, but is has absolutely no way to enumerate filedescriptors referencing that files. cdevpriv is per file, not per node. cdevpriv is cleaned automatically when last filedescriptor referencing the file is gone, not when the last file referencing the node is closed. The later is approximated by d_close(). pgpNUYm0Ai9Bh.pgp Description: PGP signature
Re: Is there some implicit locking of device methods?
On Wed, Apr 27, 2011 at 03:22:43PM +0200, Bartosz Fabianowski wrote: > >Err, if you use cdevpriv you shouldn't even have a d_close method. All > >your > >d_close logic should be in the cdevpriv destructor > > I see. There is no documentation for any of this, so I just implemented > it in the way I *thought* it should work: > > .d_close = drv_close, > > int drv_close(...) { > devfs_clear_cdevpriv(); > } > > static void cdevpriv_dtr(void *data) { > free(data, M_USBDEV); > } > > If I understand you correctly, I can leave out the drv_close() method. > When close() is called, devfs_clear_cdevpriv() will be executed > implcitly for me and my dstructor will run - right? You are mixing the global 'last close', that is performed when last file opened over the device node is closed, and the last filedescriptor close which causes the file to be decomissioned. The global kind of last close is communicated to cdev by calling cdevsw close method. It is known to be not quite reliable, and esp. hard in relation to the forced unmounts of devfs mount points. The close of file (when no other file descriptors referencing the file are left) ends in cdevpriv destructor call. pgp5X6a1bnieI.pgp Description: PGP signature
Re: Is there some implicit locking of device methods?
On Tue, Apr 26, 2011 at 02:38:30PM +0200, Bartosz Fabianowski wrote: > >You need to handle all cases in your driver. Fortunately there exists a > >solution for this already, called USB cdev. See > > I went through all the USB drivers with a fine comb (the driver I am > porting was based on the old USB stack and so I needed to adjust it for > the new stack). Drivers like ulpt seem to be based around usb_fifo_* > structures. If I understand usb_fifo_* right, it gives you a single > device with FIFO semantics. This is not sufficient in my case. My device > is opened for reading by several processes in parallel and needs to keep > a separate FIFO per process. I implemented this via device cloning - and > I could not see how to integrate that with usb_fifo_*. Thus, I based my > driver on the raw cdev framework. Am I missing something obvious and > making my life unnecessarily hard? If you needs per-file private data for cdev, you would be better served by cdevpriv(9) KPI. Cloning is too hard to use correctly for such task. pgpmd1BnzRcQA.pgp Description: PGP signature
Re: SMP question w.r.t. reading kernel variables
On Sun, Apr 17, 2011 at 03:49:48PM -0400, Rick Macklem wrote: > Hi, > > I should know the answer to this, but... When reading a global kernel > variable, where its modifications are protected by a mutex, is it > necessary to get the mutex lock to just read its value? > > For example: > Aif ((mp->mnt_kern_flag & MNTK_UNMOUNTF) != 0) > return (EPERM); > versus > BMNT_ILOCK(mp); > if ((mp->mnt_kern_flag & MNTK_UNMOUNTF) != 0) { > MNT_IUNLOCK(mp); > return (EPERM); > } > MNT_IUNLOCK(mp); > > My hunch is that B is necessary if you need an up-to-date value > for the variable (mp->mnt_kern_flag in this case). > > Is that correct? mnt_kern_flag read is atomic on all architectures. If, as I suspect, the fragment is for the VFS_UNMOUNT() fs method, then VFS guarantees the stability of mnt_kern_flag, by blocking other attempts to unmount until current one is finished. If not, then either you do not need the lock, or provided snipped which takes a lock is unsufficient, since you are dropping the lock but continue the action that depends on the flag not being set. pgp3IYsjmCOsB.pgp Description: PGP signature
Re: State of FreeBSD/xbox
On Sun, Apr 03, 2011 at 07:40:14PM +0100, Chris Rees wrote: > Hi all, > > I've got an xbox running at my parents' house as a backup MX, among > other things. > > Ages ago I updated it from 7.2 -> 8.1, and I ended up with no end of > trouble, and finished by restoring a backup. I emailed rink@, although > he replied it appears a case of ENOTIME. > > Am I the last person on the planet running FreeBSD on an Xbox? I'm > planning on hacking around a bit to see if I can fix it next week, but > I was wondering if anyone else had similar ideas or a solution! > > In short-- FreeBSD/xbox is bitrotted (or so it appears!) I wanted to remove XBOX for long time. With the introduction of Xen PV, i386/machdep.c became too convoluted. On the other hand, if you can bring it back to life, I definitely would not have an argument for removal. pgp8ug0Jq7lS9.pgp Description: PGP signature
Re: Prebind from OpenBSD
On Sun, Mar 27, 2011 at 08:54:18PM +0100, Robert Watson wrote: > On Sat, 26 Mar 2011, Jesse Smith wrote: > > >I'm interested in working on the "Port prebind from OpenBSD" project > >mentioned on the FreeBSD Ideas page. ( > >http://wiki.freebsd.org/IdeasPage#head-d28cdd95ca1755d5afe63d653cb4926d4bdc99de > > > >) > > > >There isn't much to go on from the project description and I'm curious > >what FreeBSD devs are looking for specifically. For example, should the > >entire ldconfig program be ported from OpenBSD (it looks like it's close > >enough to FreeBSD's to make that suitable), or should just the prebind > >code be merged into FreeBSD's ldcnfig? > > > >Once the project is complete, who should the work be submitted to? Has > >anyone else worked on this and made any progress? > > Hi Jesse: > > I think the intent of the ideas list entry is more a research project than > a direct-to-commit project: the question is whether prebinding of some sort > would observably help performance for important FreeBSD applications or, > for example, the boot process. If so, then certainly the OpenBSD > prebinding code is a possible model -- Mac OS X also has prebinding, of > course, and it's done quite differently (and probably less reusably from > our perspective as they use Mach-O rather than ELF); however, there might > be interesting ideas as well. > > I think therefore I'd structure a project along the following lines: first, > you want to establish to what extent synchronous waiting on linkage at > run-time is a significant problem. It could be that some combination of > hwpmc and DTrace would be the right tools for this. I'd especially pay > attention to boot time, since we know that quite a lot of executing takes > place then as part of rc.d. I'd also investigate large applications like > Firefox, Chrome, KDE, Gnome, etc. KDE already integrates prebinding tricks > in its design, but I don't think the others do. > > Next, I'd dig a bit more into the areas where it's hurting performance -- > can you add up all the time spent waiting and cut 10 seconds from boot, or > 5 seconds from Firefox startup? Or is the best win going to be .2 seconds > in Firefox? Does the OpenBSD optimisation actually address the problem > we're experiencing? Perhaps perform some experiments with prebinding-like > behaviour, working up to an implementation. > > It's worth remembering that prebinding comes with some baggage as well, of > course. Perhaps less relevant in the world of 64-bit address spaces, but > there are some design trade-offs in this department... The most serious issue with prebind is a consistency. It is very easy to get prebind data out of date, and this is esp. easy in the FreeBSD where buildworld and source port upgrades are everyday activity. Before this goes much further, yes, we need a benchmarks that demonstrate that system spends significant time in the symbol resolution for often started images [*], and second, we need to have a workable model for source upgrades. * - so that Firefox, OpenOffice etc do not qualify, IMHO. pgpamzbTnhOCD.pgp Description: PGP signature
Re: [GSoc] Timeconter Performance Improvements
On Sat, Mar 26, 2011 at 10:12:32AM -0400, John Baldwin wrote: > On Saturday, March 26, 2011 08:16:46 am Peter Jeremy wrote: > > On 2011-Mar-25 08:18:38 -0400, John Baldwin wrote: > > >For modern Intel CPUs you can just assume that the TSCs are in sync across > > >packages. They also have invariant TSC's meaning that the frequency > > >doesn't change. > > > > Synchronised P-state invariant TSCs vastly simplify the problem but > > not everyone has them. Should the fallback be more complexity to > > support per-CPU TSC counts and varying frequencies or a fallback to > > reading the time via a syscall? > > I think we should just fallback to a syscall in that case. We will also need > to do that if the TSC is not used as the timecounter (or always duplicate the > ntp_adjtime() work we do for the current timecounter for the TSC timecounter). > > Doing this easy case may give us the most bang for the buck, and it is also a > good first milestone. Once that is in place we can decide what the value is > in extending it to support harder variations. > > One thing we do need to think about is if the shared page should just export a > fixed set of global data, or if it should export routines. The latter > approach is more complex, but it makes the ABI boundary between userland and > the kernel more friendly to future changes. I believe Linux does the latter > approach? Linux uses a so-called vdso, which is linked into the process. I think that the efforts to implement a vdso approximately equal to the efforts required to implement timecounters in the user mode. On the other hand, with vdso we could properly annotate signal trampolines with the unwind info, that is also a big win. pgpbOEkvvqnQ4.pgp Description: PGP signature
Re: [GSoc] Timeconter Performance Improvements
On Sat, Mar 26, 2011 at 11:16:46PM +1100, Peter Jeremy wrote: > On 2011-Mar-25 08:18:38 -0400, John Baldwin wrote: > >For modern Intel CPUs you can just assume that the TSCs are in sync across > >packages. They also have invariant TSC's meaning that the frequency doesn't > >change. > > Synchronised P-state invariant TSCs vastly simplify the problem but > not everyone has them. Should the fallback be more complexity to > support per-CPU TSC counts and varying frequencies or a fallback to > reading the time via a syscall? > > >I believe we already have a shared page (it holds the signal trampoline now) > >for at least the x86 platform (probably some others as well). > > r217151 for amd64 and r217400 for ppc. It doesn't appear to be > supported on other platforms. My reading of the code is that there is > a single shared page used by all processes/CPUs. In order to support > non-synchronised TSCs, this would need to be changed to per-CPU. Not neccessary. If you have a reliable way to access proper private per-CPU page from the array, then you could use the same method to access the array in the single page. IMO, per-cpu page in process address space at the same address for all pages is too costly. I think we can target a modern hardware for user-mode tsc, this is the kind of machines that are used for benchmarks anyway. pgpnxlUPO1v61.pgp Description: PGP signature
Re: DMA controller on Northbridge?
On Tue, Mar 22, 2011 at 11:55:59AM -0700, Matthew Fleming wrote: > On Tue, Mar 22, 2011 at 11:12 AM, Kostik Belousov wrote: > > On Tue, Mar 22, 2011 at 10:11:04AM -0700, Matthew Fleming wrote: > >> How can I tell if the Northbridge on a machine has a built-in DMA > >> controller? And if it does, what device would I use to control it? > >> > >> I ask because I'm working with a PCI card that has a 36-bit physical > >> address limit, and that means bounce buffers when using more than 64GB > >> of memory. I'd prefer not to use bounce buffers, and since the card's > >> memory that I'm using is mapped into the physical space of the FreeBSD > >> host, the entire address space of the card that I care about is > >> available to FreeBSD. So while pio to the card's memory is too slow > >> to be useful, if there was a way to use a DMA controller on the > >> motherboard to get data into and out of the card, that may be > >> preferable to using the card's DMAC with the limited address space. > >> > >> But all that's just theory -- I have no idea how to tell whether the > >> mobo has a DMAC, and if it does, how to control it. Help? :-) > >> > >> Attached is the boot dmesg; I can also run pciconf commands, etc., to > >> help out with figuring out what I have. > > > > I believe what are you looking for is > > ftp://download.intel.com/technology/.../Intel(r)_VT_for_Direct_IO.pdf Oops. ftp://download.intel.com/technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf > > This link doesn't work for me. > > > On my X58 machine it is shown like this: > > none6@pci0:0:22:2: class=0x088000 card=0xf38015d9 chip=0x34328086 > > rev=0x12 hdr=0x00 > > vendor = 'Intel Corporation' > > device = 'DMA Engine' > > class = base peripheral > > cap 11[80] = MSI-X supports 1 message in map 0x10 > > cap 10[90] = PCI-Express 2 root endpoint max data 128(128) link x0(x0) > > cap 01[e0] = powerspec 3 supports D0 D3 current D0 > > I do seem to have several DMA Engine entries in pciconf on this > hardware. Hopefully the above doc will explain what to do. :-) > > Thanks, > matthew pgpkII57XvuLo.pgp Description: PGP signature
Re: GSoC'11: DWARF2 call frame information
On Tue, Mar 22, 2011 at 11:39:58PM +0800, Xingxing Pan wrote: > 2011/3/22 Kostik Belousov : > > On Mon, Mar 21, 2011 at 08:32:04PM +0300, Chagin Dmitry wrote: > >> On Mon, Mar 21, 2011 at 05:36:13PM +0800, Xingxing Pan wrote: > >> > 2011/3/21 Chagin Dmitry : > >> > >> powerfull script. > >> > >> > >> > >> Xingxing Pan > >> > > > >> > > hmm, which script? I think enough amd64, i386 and amd64/ia32. > >> > > > >> > > I suggest to write a example before continuing the conversation > >> > > about the GSoC. For example (bcopy || bzero) && cpu_switch. > >> > > Is it ok for you? > >> > > > >> > > -- > >> > > Have fun! > >> > > chd > >> > > > >> > > >> > Hi Chargin, > >> > > >> > Thank you for your reply. > >> > The followings shows how I try to add DWARF for bcopy. > >> > > >> > --- ../8.2.0/sys/i386/include/asm.h 2011-03-21 14:35:56.111973722 > >> > +0800 > >> > +++ asm.h 2011-03-21 15:25:31.564636162 +0800 > >> > @@ -71,7 +71,7 @@ > >> > > >> > #define _ENTRY(x) _START_ENTRY; \ > >> > .globl CNAME(x); .type CNAME(x),@function; > >> > CNAME(x): > >> > -#define END(x) .size x, . - x > >> > +#define END(x) .cfi_endproc; .size x, . - x > >> > > >> > #ifdef PROF > >> > #define ALTENTRY(x) _ENTRY(x); \ > >> > @@ -80,9 +80,10 @@ > >> > popl %ebp; \ > >> > jmp 9f > >> > #define ENTRY(x) _ENTRY(x); \ > >> > - pushl %ebp; movl %esp,%ebp; \ > >> > + .cfi_startproc; \ > >> > + pushl %ebp; .cfi_adjust_cfa_offset 4; movl > >> > %esp,%ebp; .cfi_def_cfa_register %ebp; \ > >> > call PIC_PLT(HIDENAME(mcount)); \ > >> > - popl %ebp; \ > >> > + popl %ebp; .cfi_def_cfa %esp, 4; \ > >> > > >> > --- bcopy.S 2011-03-21 15:51:26.804203809 +0800 > >> > +++ ../8.2.0/lib/libc/i386/string/bcopy.S 2011-03-21 > >> > 14:28:15.023069890 +0800 > >> > @@ -51,9 +51,7 @@ ENTRY(bcopy) > >> > #endif > >> > #endif > >> > pushl %esi > >> > - .cfi_adjust_cfa_offset 4; > >> > pushl %edi > >> > - .cfi_adjust_cfa_offset 4; > >> > #if defined(MEMCOPY) || defined(MEMMOVE) > >> > movl 12(%esp),%edi > >> > movl 16(%esp),%esi > >> > @@ -77,9 +75,7 @@ ENTRY(bcopy) > >> > rep > >> > movsb > >> > popl %edi > >> > - .cfi_adjust_cfa_offset -4; > >> > popl %esi > >> > - .cfi_adjust_cfa_offset -4; > >> > ret > >> > 1: > >> > addl %ecx,%edi /* copy backwards. */ > >> > @@ -98,9 +94,7 @@ ENTRY(bcopy) > >> > rep > >> > movsl > >> > popl %edi > >> > - .cfi_adjust_cfa_offset -4; > >> > popl %esi > >> > - .cfi_adjust_cfa_offset -4; > >> > cld > >> > ret > >> > #ifdef MEMCOPY > >> > > >> > But I don't know how to add DWARF for cpu_switch, because I have no > >> > idea about the circumstance when we need to backtrace through this > >> > function. Suppose there's a cpu switch like this, > >> > threadA->kernel->threadB. Then should the expected backtrace has the > >> > following result? > >> > > >> > threadB's stack > >> > kernel's stack > >> > threadA's stack > >> > >> > >> hmm, ok. good, avoid cpu_switch. > >> First of all, please, read style(9) man page. > >> In the second, evaluate the proposed plan (discussed with kib@): > >> > >> 1) Annotate libc, msun, rtld, libthr (you) > > 1a) Develop and implement a testing plan to verify the implementation. > > 1b) consider doing full register tracking for assembler code. > > > >> 2) vdso (I'm) > >> 3) Annotate signal trampolines (you, after vdso) > >> > >> And i'm going to understand what I need to do to start GSoC for you. > >> Thanks! > >> > >> > >> -- > >> Have fun! > >> chd > > > > > > > > Hi Kostik, > > I think the basic testing method can be using GDB to set breakpoint in > functions and observing the backtrace result. GDB uses Expect. I can > learn something from GDB's testsuite. Sounds good. > > AFAIK, CFA and return address are enough for unwinding. Dose full > register tracking > means to emit DWARF for all the registers's saving and restoring in > the life time of the function? Not only save and restore, but also for move around. I am mostly about the syscall entry sequence on amd64, see the description of the `syscall' instruction and handling of %rcx in libc sources. Rarely used routines could be left aside. pgpyDdqJ4T4Oz.pgp Description: PGP signature
Re: DMA controller on Northbridge?
On Tue, Mar 22, 2011 at 10:11:04AM -0700, Matthew Fleming wrote: > How can I tell if the Northbridge on a machine has a built-in DMA > controller? And if it does, what device would I use to control it? > > I ask because I'm working with a PCI card that has a 36-bit physical > address limit, and that means bounce buffers when using more than 64GB > of memory. I'd prefer not to use bounce buffers, and since the card's > memory that I'm using is mapped into the physical space of the FreeBSD > host, the entire address space of the card that I care about is > available to FreeBSD. So while pio to the card's memory is too slow > to be useful, if there was a way to use a DMA controller on the > motherboard to get data into and out of the card, that may be > preferable to using the card's DMAC with the limited address space. > > But all that's just theory -- I have no idea how to tell whether the > mobo has a DMAC, and if it does, how to control it. Help? :-) > > Attached is the boot dmesg; I can also run pciconf commands, etc., to > help out with figuring out what I have. I believe what are you looking for is ftp://download.intel.com/technology/.../Intel(r)_VT_for_Direct_IO.pdf On my X58 machine it is shown like this: none6@pci0:0:22:2: class=0x088000 card=0xf38015d9 chip=0x34328086 rev=0x12 hdr=0x00 vendor = 'Intel Corporation' device = 'DMA Engine' class = base peripheral cap 11[80] = MSI-X supports 1 message in map 0x10 cap 10[90] = PCI-Express 2 root endpoint max data 128(128) link x0(x0) cap 01[e0] = powerspec 3 supports D0 D3 current D0 pgpN2wNvBDDeJ.pgp Description: PGP signature
Re: GSoC'11: DWARF2 call frame information
On Mon, Mar 21, 2011 at 08:32:04PM +0300, Chagin Dmitry wrote: > On Mon, Mar 21, 2011 at 05:36:13PM +0800, Xingxing Pan wrote: > > 2011/3/21 Chagin Dmitry : > > >> powerfull script. > > >> > > >> Xingxing Pan > > > > > > hmm, which script? I think enough amd64, i386 and amd64/ia32. > > > > > > I suggest to write a example before continuing the conversation > > > about the GSoC. For example (bcopy || bzero) && cpu_switch. > > > Is it ok for you? > > > > > > -- > > > Have fun! > > > chd > > > > > > > Hi Chargin, > > > > Thank you for your reply. > > The followings shows how I try to add DWARF for bcopy. > > > > --- ../8.2.0/sys/i386/include/asm.h 2011-03-21 14:35:56.111973722 +0800 > > +++ asm.h 2011-03-21 15:25:31.564636162 +0800 > > @@ -71,7 +71,7 @@ > > > > #define _ENTRY(x) _START_ENTRY; \ > > .globl CNAME(x); .type CNAME(x),@function; CNAME(x): > > -#defineEND(x) .size x, . - x > > +#defineEND(x) .cfi_endproc; .size x, . - x > > > > #ifdef PROF > > #defineALTENTRY(x) _ENTRY(x); \ > > @@ -80,9 +80,10 @@ > > popl %ebp; \ > > jmp 9f > > #defineENTRY(x)_ENTRY(x); \ > > - pushl %ebp; movl %esp,%ebp; \ > > + .cfi_startproc; \ > > + pushl %ebp; .cfi_adjust_cfa_offset 4; movl > > %esp,%ebp; .cfi_def_cfa_register %ebp; \ > > call PIC_PLT(HIDENAME(mcount)); \ > > - popl %ebp; \ > > + popl %ebp; .cfi_def_cfa %esp, 4; \ > > > > --- bcopy.S 2011-03-21 15:51:26.804203809 +0800 > > +++ ../8.2.0/lib/libc/i386/string/bcopy.S 2011-03-21 > > 14:28:15.023069890 +0800 > > @@ -51,9 +51,7 @@ ENTRY(bcopy) > > #endif > > #endif > > pushl %esi > > - .cfi_adjust_cfa_offset 4; > > pushl %edi > > - .cfi_adjust_cfa_offset 4; > > #if defined(MEMCOPY) || defined(MEMMOVE) > > movl12(%esp),%edi > > movl16(%esp),%esi > > @@ -77,9 +75,7 @@ ENTRY(bcopy) > > rep > > movsb > > popl%edi > > - .cfi_adjust_cfa_offset -4; > > popl%esi > > - .cfi_adjust_cfa_offset -4; > > ret > > 1: > > addl%ecx,%edi /* copy backwards. */ > > @@ -98,9 +94,7 @@ ENTRY(bcopy) > > rep > > movsl > > popl%edi > > - .cfi_adjust_cfa_offset -4; > > popl%esi > > - .cfi_adjust_cfa_offset -4; > > cld > > ret > > #ifdef MEMCOPY > > > > But I don't know how to add DWARF for cpu_switch, because I have no > > idea about the circumstance when we need to backtrace through this > > function. Suppose there's a cpu switch like this, > > threadA->kernel->threadB. Then should the expected backtrace has the > > following result? > > > > threadB's stack > > kernel's stack > > threadA's stack > > > hmm, ok. good, avoid cpu_switch. > First of all, please, read style(9) man page. > In the second, evaluate the proposed plan (discussed with kib@): > > 1) Annotate libc, msun, rtld, libthr (you) 1a) Develop and implement a testing plan to verify the implementation. 1b) consider doing full register tracking for assembler code. > 2) vdso (I'm) > 3) Annotate signal trampolines (you, after vdso) > > And i'm going to understand what I need to do to start GSoC for you. > Thanks! > > > -- > Have fun! > chd pgpJOxpYhyLs1.pgp Description: PGP signature
Re: get_cyclecount(9) deprecation
On Fri, Mar 18, 2011 at 02:09:58PM -0400, Jung-uk Kim wrote: > On Friday 18 March 2011 01:05 pm, Bruce Evans wrote: > > On Sat, 19 Mar 2011, Bruce Evans wrote: > > > On Fri, 18 Mar 2011, Kostik Belousov wrote: > > >> We definitely do not support configurations with different > > >> models of CPUs in SMP, this is what Simmetric is about. > > >> Different as in frequency or stepping. > > > > > > ... > > > Now there is even more asymmetry > > > in core frequencies, with the hardware transiently slowing down > > > or stopping cores independently, at least for cores in different > > > packages. > > > > Also, with virtualization, the virtualizer cannot reasonably even > > provide an invariant TSC that runs at the same rate on all cores. > > It should provide an invariant TSC that claims to run at the same > > rate on all cores, but then the cores cannot run at the same rate > > except on average, since some of the cores will have to run the > > virtualizer some of the time, and it is unreasonable to distribute > > the overhead for this evenly except on average. > > Ah, virtualization... It is really painful to make it reasonably > correct. For example, VMware claims that all timers (including TSC > and excluding RTC) runs at "apparent time", which is concept of > constant ticks in virtualized guest. It also means TSCs are always > "virtually" constant and synchronized across all virtual cores in > guest environment. If it loses periodic timer ticks, lost ticks are > "compensated" later. This also means timecounter does not exactly > scale well based on realtime and its frequency fluctuates so much, if > my understanding is correct: > > http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf > > I am not sure how others handle this but VirtualBox gives me really > wacky TSC frequency sometimes. When it happens, it becomes unusably > slow. So, I know something is really wrong there, too. However, Xen > seems to do much smarter than that because it has its own concept of > virtual TSC, thanks to its para-virtualization architecture. Most likely, all 'full-hardware' hypervisors have to disable rdtsc for guests, intercepting the instruction by trap and emulating it. This means that most basic assumptions about rdtsc are not held in virtualized environment, like relative cost or accuracy. Anyway, I was unable to make any reasonable use of virtualization except kernel debugging, which is more then satisfactory performed by QEMU. Ah, QEMU is not hypervisor. pgpmgvUjdGJOL.pgp Description: PGP signature
Re: get_cyclecount(9) deprecation
On Fri, Mar 18, 2011 at 05:51:27PM +0200, Andriy Gapon wrote: > on 18/03/2011 15:56 Kostik Belousov said the following: > > On Fri, Mar 18, 2011 at 05:26:53PM +1100, Bruce Evans wrote: > > ... > >> - set cputicker() has some design bugs. It assumes that the tick frequency > >> is the same across all CPUs, but the TSC is per-CPU. I have an old SMP > >> system with CPUs of different frequency that can demonstrate bugs from > >> this. > > We definitely do not support configurations with different models of > > CPUs in SMP, this is what Simmetric is about. Different as in frequency > > or stepping. > > Are there any fundamental reasons for us to not support that configuration in > situations where hardware and BIOS (in x86 case) happen to support it? > > I am personally more interested in non-uniform topologies like one package > having > two cores and another having four. We do not handle CPU errata/quirks individually per-core. I think that we assume that all cores have the same stepping and thus require the same workarounds, if any, as BSP. Also, I think tsc calibration is done only on BSP, but I may be wrong there. pgpB2UVUIyuIg.pgp Description: PGP signature
Re: get_cyclecount(9) deprecation
On Fri, Mar 18, 2011 at 05:26:53PM +1100, Bruce Evans wrote: ... > - set cputicker() has some design bugs. It assumes that the tick frequency > is the same across all CPUs, but the TSC is per-CPU. I have an old SMP > system with CPUs of different frequency that can demonstrate bugs from > this. We definitely do not support configurations with different models of CPUs in SMP, this is what Simmetric is about. Different as in frequency or stepping. pgp01PNFgiwx3.pgp Description: PGP signature
Re: usertime and systime
On Wed, Mar 16, 2011 at 12:56:14PM -0500, Dan Nelson wrote: > In the last episode (Mar 16), Thiago Damas said: > > Hi, > > without procfs, there is a way to get usertime and systime from a > > running process? > > Try applying the attached patch to ps. I've had it for a while but never > submitted a PR. > > Heh. I've had it for a very long time. > http://lists.freebsd.org/pipermail/freebsd-hackers/2009-March/027918.html Yes, apparently, this is often requested feature. I dislike the copying of the existing code, sincere up to the comment that is not quite relevant (about interrupts in systime). I slightly reorganized the patch to reduce the copy/paste part of it. Do you have comments ? diff --git a/bin/ps/extern.h b/bin/ps/extern.h index faeeb19..eb1ede6 100644 --- a/bin/ps/extern.h +++ b/bin/ps/extern.h @@ -81,12 +81,14 @@ int s_uname(KINFO *); voidshowkey(void); voidstarted(KINFO *, VARENT *); voidstate(KINFO *, VARENT *); +voidsystime(KINFO *, VARENT *); voidtdev(KINFO *, VARENT *); voidtdnam(KINFO *, VARENT *); voidtname(KINFO *, VARENT *); voiducomm(KINFO *, VARENT *); voiduname(KINFO *, VARENT *); voidupr(KINFO *, VARENT *); +voidusertime(KINFO *, VARENT *); voidvsize(KINFO *, VARENT *); voidwchan(KINFO *, VARENT *); __END_DECLS diff --git a/bin/ps/keyword.c b/bin/ps/keyword.c index 3bcc23b..09eb756 100644 --- a/bin/ps/keyword.c +++ b/bin/ps/keyword.c @@ -189,6 +189,7 @@ static VAR var[] = { UINT, UIDFMT, 0}, {"svuid", "SVUID", NULL, 0, kvar, NULL, UIDLEN, KOFF(ki_svuid), UINT, UIDFMT, 0}, + {"systime", "SYSTIME", NULL, USER, systime, NULL, 9, 0, CHAR, NULL, 0}, {"tdaddr", "TDADDR", NULL, 0, kvar, NULL, sizeof(void *) * 2, KOFF(ki_tdaddr), KPTR, "lx", 0}, {"tdev", "TDEV", NULL, 0, tdev, NULL, 5, 0, CHAR, NULL, 0}, @@ -210,6 +211,8 @@ static VAR var[] = { KOFF(ki_paddr), KPTR, "lx", 0}, {"user", "USER", NULL, LJUST|DSIZ, uname, s_uname, USERLEN, 0, CHAR, NULL, 0}, + {"usertime", "USERTIME", NULL, USER, usertime, NULL, 9, 0, CHAR, NULL, + 0}, {"usrpri", "", "upr", 0, NULL, NULL, 0, 0, CHAR, NULL, 0}, {"vsize", "", "vsz", 0, NULL, NULL, 0, 0, CHAR, NULL, 0}, {"vsz", "VSZ", NULL, 0, vsize, NULL, 6, 0, CHAR, NULL, 0}, diff --git a/bin/ps/print.c b/bin/ps/print.c index 46b979b..253793a 100644 --- a/bin/ps/print.c +++ b/bin/ps/print.c @@ -550,12 +550,11 @@ vsize(KINFO *k, VARENT *ve) (void)printf("%*lu", v->width, (u_long)(k->ki_p->ki_size / 1024)); } -void -cputime(KINFO *k, VARENT *ve) +static void +printtime(KINFO *k, VARENT *ve, long secs, long psecs) +/* psecs is "parts" of a second. first micro, then centi */ { VAR *v; - long secs; - long psecs; /* "parts" of a second. first micro, then centi */ char obuff[128]; static char decimal_point; @@ -566,20 +565,7 @@ cputime(KINFO *k, VARENT *ve) secs = 0; psecs = 0; } else { - /* -* This counts time spent handling interrupts. We could -* fix this, but it is not 100% trivial (and interrupt -* time fractions only work on the sparc anyway). XXX -*/ - secs = k->ki_p->ki_runtime / 100; - psecs = k->ki_p->ki_runtime % 100; - if (sumrusage) { - secs += k->ki_p->ki_childtime.tv_sec; - psecs += k->ki_p->ki_childtime.tv_usec; - } - /* -* round and scale to 100's -*/ + /* round and scale to 100's */ psecs = (psecs + 5000) / 1; secs += psecs / 100; psecs = psecs % 100; @@ -590,6 +576,53 @@ cputime(KINFO *k, VARENT *ve) } void +cputime(KINFO *k, VARENT *ve) +{ + long secs, psecs; + + /* +* This counts time spent handling interrupts. We could +* fix this, but it is not 100% trivial (and interrupt +* time fractions only work on the sparc anyway). XXX +*/ + secs = k->ki_p->ki_runtime / 100; + psecs = k->ki_p->ki_runtime % 100; + if (sumrusage) { + secs += k->ki_p->ki_childtime.tv_sec; + psecs += k->ki_p->ki_childtime.tv_usec; + } + printtime(k, ve, secs, psecs); +} + +void +systime(KINFO *k, VARENT *ve) +{ + long secs, psecs; + + secs = k->ki_p->ki_rusage.ru_stime.tv_sec; + psecs = k->ki_p->ki_rusage.ru_stime.tv_usec; + if (sumrusage) { + secs += k->ki_p->ki_childstime.tv_sec; + psecs += k->ki_p->ki_childstime.tv_usec; + } + printtime(k, ve, secs, psecs); +} + +void +usertime(KINFO *k, VARENT *ve) +{ + long secs, psecs; + + secs = k->ki_p->ki_rusage.ru_uti
Re: SIGSTOP and SIGKILL
On Sun, Mar 13, 2011 at 11:30:00PM -0700, Ravi Murty wrote: > I haven't, are there specific improvements in this area of the kernel? First, the 8.2, compared to 8.0, changed the mechanism of delivering the process-global signal to a thread. Now, the thread to deliver is selected at the moment of delivery, instead of the moment of post. This was done to close a race where either thread changed signal mask after post, or exited after post, causing signal be undefinitely delayed or ignored, respectively. Second, 8.2 less often stops the threads in the sleep state, more often trying to move the thread to safer user<->kernel boundary, to take a stop. Sleeping thread might indicate the unsafe sleep (from the POV of stopping) by using PBDRY flag to msleep. Both the changes are not directly address the behaviour you described, but they are explicitely in the area of your interest. BTW, it is not clear to me, does the process in the state as you described, able to unwind from the stop state using SIGCONT ? I.e., is it 'weird but recoverable' state, or is it 'weird and unrecoverable' state ? > > On Mar 13, 2011, at 10:30 PM, Erich Dollansky > wrote: > > > Hi, > > > > On Monday 14 March 2011 11:41:21 Ravi Murty wrote: > > > >> > >> Any good solutions to this problem? > > > > did you try your program on 8.2? > > > > Erich > ___ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" pgpQJkg72lkQG.pgp Description: PGP signature
Re: libdispatch don't build on 8.2-RELEASE amd64
On Wed, Mar 02, 2011 at 12:19:01PM -0300, Danilo Egea wrote: > The problem is the binutils version, the port try to use the > binutils-2.21. With the binutils-2.15 (native of the system) works fine. Rather, it is binutils 2.15 silently creating broken library. 2.21 refuses to do it. Some object used to create the final dso was not built with -fPIC. > > On 2/27/11 2:37 PM, Danilo Egea wrote: > >Worked with GCC, but support blocks is disabled... :( > > > >On 2/27/11 1:42 AM, Danilo Egea wrote: > >>Hi guys, > >> > >>Anyone know what's going on? > >> > >>PS: with clang-devel > >> > >>libtool: compile: clang -DHAVE_CONFIG_H -I. -I../config -I.. -I.. > >>-fPIC -MT time.lo -MD -MP -MF .deps/time.Tpo -c shims/time.c -fPIC > >>-DPIC -o .libs/time.o > >>libtool: compile: clang -DHAVE_CONFIG_H -I. -I../config -I.. -I.. > >>-fPIC -MT time.lo -MD -MP -MF .deps/time.Tpo -c shims/time.c -o > >>time.o >/dev/null 2>&1 > >>mv -f .deps/time.Tpo .deps/time.Plo > >>/bin/sh ../libtool --tag=CC--mode=link clang -fPIC-o > >>libshims.la mach.lo time.lo -lpthread -L/usr/local/lib > >>-lBlocksRuntime > >>libtool: link: ar cru .libs/libshims.a .libs/mach.o .libs/time.o > >>libtool: link: ranlib .libs/libshims.a > >>libtool: link: ( cd ".libs" && rm -f "libshims.la" && ln -s > >>"../libshims.la" "libshims.la" ) > >>/bin/sh ../libtool --tag=CC--mode=link clang -Wall -fblocks > >>-fPIC -o libdispatch.la -rpath /usr/local/lib > >>libdispatch_la-apply.lo libdispatch_la-benchmark.lo > >>libdispatch_la-object.lo libdispatch_la-once.lo > >>libdispatch_la-queue.lo libdispatch_la-queue_kevent.lo > >>libdispatch_la-semaphore.lo libdispatch_la-source.lo > >>libdispatch_la-source_kevent.lo libdispatch_la-time.lo > >>libshims.la -lpthread -L/usr/local/lib -lBlocksRuntime > >>libtool: link: clang -shared .libs/libdispatch_la-apply.o > >>.libs/libdispatch_la-benchmark.o .libs/libdispatch_la-object.o > >>.libs/libdispatch_la-once.o .libs/libdispatch_la-queue.o > >>.libs/libdispatch_la-queue_kevent.o .libs/libdispatch_la-semaphore.o > >>.libs/libdispatch_la-source.o .libs/libdispatch_la-source_kevent.o > >>.libs/libdispatch_la-time.o -Wl,--whole-archive ./.libs/libshims.a > >>-Wl,--no-whole-archive -L/usr/local/lib -lpthread -lBlocksRuntime > >>-Wl,-soname -Wl,libdispatch.so.0 -o .libs/libdispatch.so.0 > >>/usr/local/bin/ld: .libs/libdispatch_la-apply.o: relocation > >>R_X86_64_PC32 against symbol `_dispatch_hw_config' can not be used > >>when making a shared object; recompile with -fPIC > >>/usr/local/bin/ld: final link failed: Bad value > >>clang: error: linker command failed with exit code 1 (use -v to see > >>invocation) > >>*** Error code 1 > >> > >>Stop in /usr/ports/devel/libdispatch/work/libdispatch-r174/src. > >>*** Error code 1 > >> > >>Stop in /usr/ports/devel/libdispatch/work/libdispatch-r174/src. > >>*** Error code 1 > >> > >>Stop in /usr/ports/devel/libdispatch/work/libdispatch-r174. > >>*** Error code 1 > >> > >>Stop in /usr/ports/devel/libdispatch. > >>*** Error code 1 > >> > >>Stop in /usr/ports/devel/libdispatch. > >> > > > > > > > -- > Danilo Eg?a Gondolfo > http://daniloegea.wordpress.com > > __ > Fale com seus amigos de gra?a com o novo Yahoo! Messenger > http://br.messenger.yahoo.com/ > ___ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" pgpLtps2kgy7Z.pgp Description: PGP signature
Re: buildkernel error
On Wed, Feb 23, 2011 at 02:24:05PM +0900, mmats...@cybernet.co.jp wrote: > From: Garrett Cooper > Date: Tue, 22 Feb 2011 21:04:24 -0800 > ::On Tue, Feb 22, 2011 at 7:54 PM, wrote: > ::> From: Dimitry Andric > ::> Date: Tue, 22 Feb 2011 13:39:31 +0100 > ::> ::On 2011-02-22 08:30, gnehzuil wrote: > ::> ::> I updated my kernel source code and try to make a new kernel using > make > ::> ::> buildkernel command. But I got an error as follow: > ::> ::... > ::> ::> ld:/usr/src/sys/conf/ldscript.i386:66: syntax error > ::> :: > ::> ::Your /usr/bin/ld is still at version 2.15, which is too old to parse the > ::> ::kernel linker script. In this case, first run "make buildworld", or at > ::> ::least "make kernel-toolchain" before you attempt to build any kernels. > ::> > ::> Hello, > ::> > ::> Does it mean we have no-way of source upgrading from 8.X? > ::> We need newest world to build 9.x kernel, and need 9.x kernel to run > ::> newest world, and... > :: > ::No; you have to do something like the following: > :: > ::[kernel-]toolchain buildworld buildkernel. > :: > ::A plus side of doing this is that I do kernel-toolchain at -j12 > ::and then do buildworld buildkernel at -j12 as well. It's much quicker > ::than buildworld buildkernel at -j1, and less error prone than doing it > ::at any other -j value in parallel. > ::The handbook just says buildworld buildkernel ( > ::http://www.freebsd.org/doc/handbook/makeworld.html ), and UPDATING > ::just says kernel-toolchain buildkernel installkernel (look for "To > ::build a kernel"), but there aren't any official directions in the > ::quick to find spots that mention those above steps. > > Ahh, thanks for pointers. > I was just reading the "To upgrade in-place from 8.x-stable to current" > from UPDATING, which does not say anything about [kernel-]toolchain. > May by that part, and some others, also needs updating. You do not need to do anything with kernel-toolchain. THe proper procedure of buildworld/buildkernel just works. pgpnzJHiiKAHt.pgp Description: PGP signature
Re: svn commit: r218953 - stable/8/usr.sbin/sysinstall
On Tue, Feb 22, 2011 at 01:06:38PM -0500, Ben Kaduk wrote: > [replying to the MFC that triggered the connection] > > On Tue, Feb 22, 2011 at 12:38 PM, Bruce Cran wrote: > > Author: brucec > > Date: Tue Feb 22 17:38:43 2011 > > New Revision: 218953 > > URL: http://svn.freebsd.org/changeset/base/218953 > > > > Log: > > MFC r218840: > > > > Remove the quotas option from the Startup Services menu. > > GENERIC has no support for quotas so this option has no effect. > > Do you know why GENERIC does not have quota support enabled? I note > that the Debian/kFreeBSD folk have enabled quota support for the > kernel they ship: > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=608995 > > In that report, emaste@ is quoted as saying: > > If you mean "kFreeBSD should ship with a quota-enabled kernel" then I'd > > say go for it. The reason we don't have it on in upstream FreeBSD is > > largely historical; enabling quotas used to require additional locking > > that caused performance and other issues. The additional locking is now > > not required, and it's just that nobody has stepped in to turn them on. > > If you believe everything you read on the internet, "In order to > achieve a modern operating system, the quota support should be active > by default, not achieved only after compilation of a custom kernel." > > Is there more to "stepping in and turning them on" than just the > one-line change? I promise to enable UFS quotas in GENERIC in one week unless anybody objects now. pgpq299AMLKXG.pgp Description: PGP signature
Re: FreeBSD ABI?
On Mon, Feb 21, 2011 at 03:58:32PM -0800, Yuri wrote: > On 02/21/2011 15:38, Joerg Sonnenberger wrote: > >That's a major difference. The Linux people decided a while ago that > >stack alignment should be 16 Byte. GCC effectively forces that down > >everyone's throat because until at least GCC 4.2 or 4.3, it can't > >correctly realign the stack and just fails miserable. I would be > >surprised if it was a conscious decision for the Solaris either. > > > > I filed gcc PR asking gcc to revert their behavior back to prescribed by > documentation: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47842 The i386 ABI was de-facto changed to require 16-byte stack alignment. The change is reasonable, both due to modern CPU cache behaviour, and to not put a block on the usage of the new CPU features (SSE instructions sets). There was a combination of bugs in gcc/glibc/bsd libc etc that made this not a reliable feature. I think there is no point of moving back to 4-byte alignment. pgptQV11S5x0S.pgp Description: PGP signature
Re: Can vm_mmap()/vm_map_remove() be called with giant held? (linuxolator dvb patches)
On Fri, Feb 18, 2011 at 09:55:42PM +0100, Juergen Lock wrote: > I have finally got back to this and did the style and vm_map_remove() > return value handling fixes, updated the patches in-place: > > http://people.freebsd.org/~nox/dvb/linux-dvb.patch > > (for head) > > http://people.freebsd.org/~nox/dvb/linux-dvb-8.patch > > (for 8.) > > On Sun, Jan 30, 2011 at 12:54:48AM +0100, Juergen Lock wrote: > > On Sat, Jan 29, 2011 at 10:51:05PM +0200, Kostik Belousov wrote: > > > On Sat, Jan 29, 2011 at 09:10:00PM +0100, Juergen Lock wrote: > > > > Hi! > > > > > > > > I was kinda hoping to be able to post a correct patch in public but > > > > getting an answer to ${Subject} seems to be more difficult than I > > > > thought... :) So, does anyone here know? copyout_map() and > > > You do not need Giant locked for vm_map* functions. > > > > > The question was more do I need to drop it first before calling them... > > > > > > copyout_unmap() are copied from ksyms_map() from sys/dev/ksyms/ksyms.c > > > > - should there maybe be global versions instead of two static copies > > > > each, and what would be good names? And giant is taken by linux_ioctl() > > > Would you make a patch for this ? > > > > > Heh if you want me to... Where should they go and are my name choices ok? > > > I haven't done this yet so people can keep patching linux.ko in-place > without having to build a new kernel too... Separate build of linux.ko is not quite supported action. I would greatly prefer to have the move of these two functions before the rest of the patch comes in. Together with conversion of other users. I propose to put it into vm/vm_glue.c. > > > > > in the same source file before calling the parts I added. So here > > > > comes the patch, it is to add support for dvb ioctls to the linuxolator > > > > as discussed on -emulation earlier in this thread: > > > > > > > > > > > > http://lists.freebsd.org/pipermail/freebsd-multimedia/2011-January/011575.html > > > > > > > > (patch also at: > > > > > > > > http://people.freebsd.org/~nox/dvb/linux-dvb.patch > > > > > > > > and a version for 8, which is what I tested with w_scan on dvb-s2 > > > > and dvb-t, and Andrew Gallatin also tested it with SageTV: > > > > > > > > http://people.freebsd.org/~nox/dvb/linux-dvb-8.patch > > > > > > > > ) > > > > > > > > > + /* > > > > +* Map somewhere after heap in process memory. > > > > +*/ > > > > + PROC_LOCK(td->td_proc); > > > > + *addr = round_page((vm_offset_t)vms->vm_daddr + > > > > + lim_max(td->td_proc, RLIMIT_DATA)); > > > > + PROC_UNLOCK(td->td_proc); > > > Are you sure that this is needed ? Why not leave the address selection > > > to the VM ? > > > > > I don't know, maybe sys/dev/ksyms/ksyms.c has a reason? > > How would I leave the address selection to the VM? Just trying > to initialize *addr to (vm_offset_t)NULL there caused the patch to > stop working. I believe you should do *addr = 0; vm_mmap(map, addr); pgp2xDmVJigQs.pgp Description: PGP signature
Re: spontaneous reboot - ptrace
On Fri, Feb 18, 2011 at 03:58:05AM -0800, Dr. Baud wrote: > > > > First, do you have a console output during the run ? Is it possible > > that machine paniced ? If not, were there any kernel messages before> > > reboot ? > > > > Second, what is the process you are dumping ? Can you show at least > > the procstat -v output for the process ? > > Sorry for this late response but my earlier response appears to have been > consumed by the email reflector. > > I'm getting an NMI. And the trouble appears to be when trying to > read via ptrace memory segments of type KVME_TYPE_DEVICE. The app in > quesion allocates a large number of large memory buffers and maps them > into virtual address space. Not all segments cause the NMI however. > Small sampling of the virtual memory map. You did not provided exact console output on the panic, despite requested. What is the device that was mmaped ? Obviously, this is your problem. Can you gather all required information ? > > PID STARTEND PRT RES PRES REF SHD FL TP PATH > 931 0x40 0x1773000 r-x 4239 5053 2 1 CN vn > /mnt/dia > g/problemchild > 931 0x1872000 0x2ef2000 rw- 4160 1 0 C- vn > /mnt/dia > g/problemchld > 931 0x2ef2000 0x670 rw- 117650 1 0 C- df > 9310x8018720000x8018a2000 r-x 480 57 28 CN vn > /libexec > /ld-elf.so.1 > 9310x8018a20000x8018b3000 rw- 150 1 0 C- df > 9310x8018b30000x801924000 rw- 1130 1698 0 -- dv > 9310x8019240000x801925000 rw-10 1698 0 -- dv > 9310x8019250000x801926000 rw-10 1698 0 -- dv > 9310x8019260000x801927000 rw-10 1698 0 -- dv > 9310x8019270000x801928000 rw-10 1698 0 -- dv > 9310x8019280000x80192a000 rw-20 1698 0 -- dv > 9310x80192a0000x80192b000 rw-10 1698 0 -- dv > 9310x80192b0000x80192c000 rw-10 1698 0 -- dv > 9310x80192c0000x80192d000 rw-10 1698 0 -- dv > 9310x80192d0000x80192e000 rw-10 1698 0 -- dv > 9310x80192e0000x80193 rw-20 1698 0 -- dv > 9310x801930x801932000 rw-20 1698 0 -- dv > 9310x8019320000x801993000 rw- 970 1698 0 -- dv > 9310x8019930000x8019a rw- 130 1698 0 -- dv > 9310x8019a0x8019a1000 rw-10 1698 0 -- dv > > > > 9310x802ab70000x802bb7000 ---00 1 0 CN df > 9310x0x802bb700x0x802bd6000 rw- 170 1 0 C- vn > /lib/lib > c.so.7 > 9310x802bd60000x802bf1000 rw- 180 1 0 C- df > 9310x802bf10000x802bfd000 r-x9 14 2 1 CN vn > /lib/lib > gcc_s.so.1 > 9310x802bfd0000x802cfc000 ---00 1 0 CN df > 9310x802cfc0000x802cfe000 rw-20 1 0 CN vn > /lib/lib > gcc_s.so.1 > 9310x802cfe0000x802cff000 rw-10 1698 0 -- dv > 9310x802cff0000x802d0 rw-10 1698 0 -- dv > 9310x802d00x802e0 rw- 1320 1 0 C- df > 9310x802e00x802f0 rw- 1090 1 0 C- df > 9310x802f00x802f91000 rw- 1450 1698 0 -- dv > 9310x802f910000x802fb2000 rw- 330 1698 0 -- dv > 9310x802fb20000x802fd3000 rw- 330 1698 0 -- dv > 9310x802fd30000x802ff5000 rw- 340 1698 0 -- dv > 9310x802ff50000x802ff6000 rw-10 1698 0 -- dv > > > > ___ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" pgp5NxCNGfhpX.pgp Description: PGP signature
Re: sched_setscheduler() behaviour changed??
On Thu, Feb 17, 2011 at 10:50:06AM +0100, Mats Lindberg wrote: > All, > I have been using a small program /rt) that utilize the sched_setscheduler() > syscall to set the scheduling policy of a process to SCHED_RR. Been running > it FBSD 5.x and 6.x. Now when migrating to FBSD 8.1 I get EPERM back at me. > used to be able to run it like e.g. > > ./rt -sr -p2 -- prog > > which started in SCHED_RR policy with priority 2. > > now in FBSD 8.1 I get EPERM > > But If I do > > rtprio 10 ./rt -sr -p2 -- prog > > it I dont get EPERM. > > I'm always root when doing this. > > My problem is that I have customers that need to run their old 5.x 6.x > applications 'as is' in 8.1 whithout changing anything. > > Does anyone know if there is a workaround? > sysctl? kernel hint? kernel config? reverting to 4BSD scheduler? If you want help from the list, you should provide some data to diagnose the issue. Obviously, we cannot read the sources of your rt utility. At least, you can provide ktrace/kdump output for start. pgppBPgaCVFOS.pgp Description: PGP signature
Re: spontaneous reboot - ptrace
On Tue, Feb 15, 2011 at 09:46:05AM -0800, Dr. Baud wrote: > I've an interesting anomaly I'd appreciate some help with. > > As a result of looking at a threading problem I modified gcore > to dump all memory segments associated with a process; > basically comment out the "Ignore" conditional in readmap(). > The result is that the box spontaneously reboots sometime > Between 40 and 50 seconds into the dump; this is an application > that is indeed a memory hog. I commented out the actual > write to disk and the result is the same. I???ve even disabled the > watchdog to no avail. > > This is a vanilla 8.1 install: > > FreeBSD pollyanna 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49 UTC > 2010 > r...@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64 > > running on fairly vanilla hardware: > > CPU: Intel(R) Xeon(TM) CPU 3.20GHz (3212.93-MHz K8-class CPU) > Origin = "GenuineIntel" Id = 0xf49 Family = f Model = 4 Stepping = 9 > > Features=0xbfebfbff MOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> > Features2=0x641d > AMD Features=0x2800 > AMD Features2=0x1 > TSC: P-state invariant > > > Any thoughts on how to proceed? First, do you have a console output during the run ? Is it possible that machine paniced ? If not, were there any kernel messages before reboot ? Second, what is the process you are dumping ? Can you show at least the procstat -v output for the process ? pgpQv92Xs1EIb.pgp Description: PGP signature
Re: ptrace weirdness with 9.0-CURRENT
On Wed, Feb 09, 2011 at 01:49:52AM +0200, Kostik Belousov wrote: > On Wed, Feb 09, 2011 at 12:42:15AM +0200, Ali Polatel wrote: > > Hello everyone, > > > > I'm the developer of pinktrace - http://dev.exherbo.org/~alip/pinktrace/ > > - a simple ptrace() wrapper library for FreeBSD and Linux. I have set up > > a FreeBSD-9.0-CURRENT VM today to test various new features recently > > added to ptrace(). This is about a behaviour difference between > > 8.1-RELEASE and 9.0-CURRENT which I've noticed through a unit test of > > pinktrace. I don't want to bother you with the internals of this library > > so I'll briefly explain the problem. > > > > I've inserted the testcase I've used below. The aim is to trace a > > open(NULL, 0) call which should fail with EFAULT. Running this on two > > different VMs I get: > > > > % uname -a > > FreeBSD 9.0-CURRENT FreeBSD 9.0-CURRENT #0: Wed Feb 9 05:02:31 EET 2011 > > root@:/usr/obj/usr/src/sys/GENERIC amd64 > > % sudo cat /root/world.txt > > -- > > >>> World build completed on Wed Feb 9 00:23:30 EET 2011 > > -- > > % gcc -Wall ptrace-amd64-fbsd-return.c > > % ./a.out > > retval:0 error:0 > > > > $ uname -a > > FreeBSD 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49 UTC 2010 > > r...@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64 > > $ gcc -Wall ptrace-amd64-fbsd-return.c > > $ ./a.out > > retval:14 error:1 > > $ > > > > Important note: I couldn't notice a problem with truss tracing a > > open(NULL, 0) call so I think this is a problem with my testcase. > > I'll be happy if you can shed some light on what I'm doing wrong here: > There is no issue with ptrace(2). Your test fails because, apparently, > rtld in HEAD calls setjmp(3) when resolving symbols, and setjmp(3) > calls sigprocmask(2). The end result is that you get SCX event for > sigprocmask, and not for your open(2). > > The issue with sigprocmask call from setjmp shall be fixed, but this > is not an issue with ptrace(2). The following should fix the problem. diff --git a/libexec/rtld-elf/rtld.c b/libexec/rtld-elf/rtld.c index 50ab393..948cf49 100644 --- a/libexec/rtld-elf/rtld.c +++ b/libexec/rtld-elf/rtld.c @@ -560,7 +560,7 @@ _rtld_bind(Obj_Entry *obj, Elf_Size reloff) RtldLockState lockstate; rlock_acquire(rtld_bind_lock, &lockstate); -if (setjmp(lockstate.env) != 0) +if (sigsetjmp(lockstate.env, 0) != 0) lock_upgrade(rtld_bind_lock, &lockstate); if (obj->pltrel) rel = (const Elf_Rel *) ((caddr_t) obj->pltrel + reloff); @@ -2142,7 +2142,7 @@ dlopen(const char *name, int mode) ld_tracing = (mode & RTLD_TRACE) == 0 ? NULL : "1"; if (ld_tracing != NULL) { rlock_acquire(rtld_bind_lock, &lockstate); - if (setjmp(lockstate.env) != 0) + if (sigsetjmp(lockstate.env, 0) != 0) lock_upgrade(rtld_bind_lock, &lockstate); environ = (char **)*get_program_var_addr("environ", &lockstate); lock_release(rtld_bind_lock, &lockstate); @@ -2264,7 +2264,7 @@ do_dlsym(void *handle, const char *name, void *retaddr, const Ver_Entry *ve, req.lockstate = &lockstate; rlock_acquire(rtld_bind_lock, &lockstate); -if (setjmp(lockstate.env) != 0) +if (sigsetjmp(lockstate.env, 0) != 0) lock_upgrade(rtld_bind_lock, &lockstate); if (handle == NULL || handle == RTLD_NEXT || handle == RTLD_DEFAULT || handle == RTLD_SELF) { diff --git a/libexec/rtld-elf/rtld.h b/libexec/rtld-elf/rtld.h index 8941d29..bb365a7 100644 --- a/libexec/rtld-elf/rtld.h +++ b/libexec/rtld-elf/rtld.h @@ -276,7 +276,7 @@ typedef struct Struct_DoneList { struct Struct_RtldLockState { int lockstate; - jmp_buf env; + sigjmp_buf env; }; /* diff --git a/libexec/rtld-elf/rtld_lock.c b/libexec/rtld-elf/rtld_lock.c index e76a4da..024e1e2 100644 --- a/libexec/rtld-elf/rtld_lock.c +++ b/libexec/rtld-elf/rtld_lock.c @@ -259,7 +259,7 @@ lock_restart_for_upgrade(RtldLockState *lockstate) case RTLD_LOCK_WLOCKED: break; case RTLD_LOCK_RLOCKED: - longjmp(lockstate->env, 1); + siglongjmp(lockstate->env, 1); break; default: assert(0); pgpr3uownAfno.pgp Description: PGP signature
Re: ptrace weirdness with 9.0-CURRENT
On Wed, Feb 09, 2011 at 12:42:15AM +0200, Ali Polatel wrote: > Hello everyone, > > I'm the developer of pinktrace - http://dev.exherbo.org/~alip/pinktrace/ > - a simple ptrace() wrapper library for FreeBSD and Linux. I have set up > a FreeBSD-9.0-CURRENT VM today to test various new features recently > added to ptrace(). This is about a behaviour difference between > 8.1-RELEASE and 9.0-CURRENT which I've noticed through a unit test of > pinktrace. I don't want to bother you with the internals of this library > so I'll briefly explain the problem. > > I've inserted the testcase I've used below. The aim is to trace a > open(NULL, 0) call which should fail with EFAULT. Running this on two > different VMs I get: > > % uname -a > FreeBSD 9.0-CURRENT FreeBSD 9.0-CURRENT #0: Wed Feb 9 05:02:31 EET 2011 > root@:/usr/obj/usr/src/sys/GENERIC amd64 > % sudo cat /root/world.txt > -- > >>> World build completed on Wed Feb 9 00:23:30 EET 2011 > -- > % gcc -Wall ptrace-amd64-fbsd-return.c > % ./a.out > retval:0 error:0 > > $ uname -a > FreeBSD 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49 UTC 2010 > r...@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64 > $ gcc -Wall ptrace-amd64-fbsd-return.c > $ ./a.out > retval:14 error:1 > $ > > Important note: I couldn't notice a problem with truss tracing a > open(NULL, 0) call so I think this is a problem with my testcase. > I'll be happy if you can shed some light on what I'm doing wrong here: There is no issue with ptrace(2). Your test fails because, apparently, rtld in HEAD calls setjmp(3) when resolving symbols, and setjmp(3) calls sigprocmask(2). The end result is that you get SCX event for sigprocmask, and not for your open(2). The issue with sigprocmask call from setjmp shall be fixed, but this is not an issue with ptrace(2). > > #include > #include > #include > > #include > #include > > #include > #include > #include > #include > #include > #include > #include > > #undef NDEBUG > #include > > int > main(void) > { > int status; > pid_t pid; > > if ((pid = fork()) < 0) { > perror("fork"); > abort(); > } > else if (!pid) { /* child */ > assert(!(ptrace(PT_TRACE_ME, 0, NULL, 0) < 0)); > kill(getpid(), SIGSTOP); > open(NULL, 0); > fprintf(stderr, "open: (errno:%d %s)\n", errno, > strerror(errno)); > _exit(0); > } > else { > assert(!(waitpid(pid, &status, 0) < 0)); > assert(WIFSTOPPED(status)); > assert(WSTOPSIG(status) == SIGSTOP); > > assert(!(ptrace(PT_TO_SCX, pid, (caddr_t)1, 0) < 0)); > assert(!(waitpid(pid, &status, 0) < 0)); > assert(WIFSTOPPED(status)); > assert(WSTOPSIG(status) == SIGTRAP); > > #if defined(PT_LWPINFO) && defined(PL_FLAG_SCX) > struct ptrace_lwpinfo info; > assert(!(ptrace(PT_LWPINFO, pid, (caddr_t)&info, sizeof(struct > ptrace_lwpinfo)) < 0)); > assert(info.pl_flags & PL_FLAG_SCX); > #endif > > struct reg r; > assert(!(ptrace(PT_GETREGS, pid, (caddr_t)&r, 0) < 0)); > > printf("retval:%ld error:%d\n", r.r_rax, !!(r.r_rflags & > PSL_C)); > > ptrace(PT_CONTINUE, pid, (caddr_t)1, 0); > waitpid(pid, &status, 0); > > return 0; > } > } > > -- > Regards, > Ali Polatel pgpF6KEnGyLRP.pgp Description: PGP signature
Re: Can vm_mmap()/vm_map_remove() be called with giant held? (linuxolator dvb patches)
On Sun, Jan 30, 2011 at 12:54:48AM +0100, Juergen Lock wrote: > On Sat, Jan 29, 2011 at 10:51:05PM +0200, Kostik Belousov wrote: > > On Sat, Jan 29, 2011 at 09:10:00PM +0100, Juergen Lock wrote: > > > Hi! > > > > > > I was kinda hoping to be able to post a correct patch in public but > > > getting an answer to ${Subject} seems to be more difficult than I > > > thought... :) So, does anyone here know? copyout_map() and > > You do not need Giant locked for vm_map* functions. > > > The question was more do I need to drop it first before calling them... No, you do not need to drop Giant. [snip] > > > + error = ENOMEM; > > > + goto out2; > > > + } > > > + error = copyin((void *) vps.props, l_vp, l_propsiz); > > > + if (error) > > > + goto out2; > > > + for (i = vps.num, l_p = l_vp, p = vp; i--; ++l_p, ++p) > > > + linux_to_bsd_dtv_property(l_p, p); > > > + > > > + error = copyout_map(td, &uvp, propsiz); > > > + if (error) > > > + goto out2; > > > + copyout(vp, (void *) uvp, propsiz); > > > + > > > + if ((error = fget(td, args->fd, &fp)) != 0) { > > > + (void) copyout_unmap(td, uvp, propsiz); > > > + goto out2; > > > + } > > > + vps.props = (void *) uvp; > > > + if ((args->cmd & 0x) == LINUX_FE_SET_PROPERTY) > > > + error = fo_ioctl(fp, FE_SET_PROPERTY, &vps, > > > td->td_ucred, td); > > > + else > > > + error = fo_ioctl(fp, FE_GET_PROPERTY, &vps, > > > td->td_ucred, td); > > > + if (error) { > > > + (void) copyout_unmap(td, uvp, propsiz); > > > + goto out; > > > + } > > > + error = copyin((void *) uvp, vp, propsiz); > > > + (void) copyout_unmap(td, uvp, propsiz); > > No need in space between cast and expression. Bigger issue is that I > > do not understand this fragment. You do copyout_map(), and then > > immediately call copyout_unmap() for the address range returned > > by copyout_map(), or am I missing something ? > > > The vm allocated by copyout_map() is only needed for the fo_ioctl() > call because the struct passed to FE_[GS]ET_PROPERTY describes an > array that the drivers then copyin() and (possibly) copyout() > themselves. So that array needs to be translated from/to the 32bit > Linux version to (possibly) 64bit on the host (linux_to_bsd_dtv_property), > and the 64bit version is larger so it doesn't fit in the original > place in the userland vm. And am I right that the drivers can only take this array from the usermode ? How is the compatibility for 32/64 bit mode is handled by native FE_SET_PROPERTY handlers ? I could only say that the hack is atrocious. Might be, you indeed have no choice there. > > (One optimization here would be to only do this when the sizes > differ i.e. in the 32bit-on-64 case...) > > Thanx! > Juergen pgpMEErYWv5o5.pgp Description: PGP signature
Re: Can vm_mmap()/vm_map_remove() be called with giant held? (linuxolator dvb patches)
On Sat, Jan 29, 2011 at 09:10:00PM +0100, Juergen Lock wrote: > Hi! > > I was kinda hoping to be able to post a correct patch in public but > getting an answer to ${Subject} seems to be more difficult than I > thought... :) So, does anyone here know? copyout_map() and You do not need Giant locked for vm_map* functions. > copyout_unmap() are copied from ksyms_map() from sys/dev/ksyms/ksyms.c > - should there maybe be global versions instead of two static copies > each, and what would be good names? And giant is taken by linux_ioctl() Would you make a patch for this ? > in the same source file before calling the parts I added. So here > comes the patch, it is to add support for dvb ioctls to the linuxolator > as discussed on -emulation earlier in this thread: > > > http://lists.freebsd.org/pipermail/freebsd-multimedia/2011-January/011575.html > > (patch also at: > > http://people.freebsd.org/~nox/dvb/linux-dvb.patch > > and a version for 8, which is what I tested with w_scan on dvb-s2 > and dvb-t, and Andrew Gallatin also tested it with SageTV: > > http://people.freebsd.org/~nox/dvb/linux-dvb-8.patch > > ) > > Thanx! > Juergen > > Index: src/sys/compat/linux/linux_ioctl.c > === > RCS file: /home/scvs/src/sys/compat/linux/linux_ioctl.c,v > retrieving revision 1.167 > diff -u -p -r1.167 linux_ioctl.c > --- src/sys/compat/linux/linux_ioctl.c30 Dec 2010 02:18:04 - > 1.167 > +++ src/sys/compat/linux/linux_ioctl.c18 Jan 2011 17:10:21 - > @@ -59,6 +59,14 @@ __FBSDID("$FreeBSD: src/sys/compat/linux > #include > #include > #include > +#include > +#include > +#include > + > +#include > +#include > +#include > +#include > > #include > #include > @@ -83,6 +91,9 @@ __FBSDID("$FreeBSD: src/sys/compat/linux > #include > #include > > +#include > +#include > + > CTASSERT(LINUX_IFNAMSIZ == IFNAMSIZ); > > static linux_ioctl_function_t linux_ioctl_cdrom; > @@ -97,6 +108,7 @@ static linux_ioctl_function_t linux_ioct > static linux_ioctl_function_t linux_ioctl_drm; > static linux_ioctl_function_t linux_ioctl_sg; > static linux_ioctl_function_t linux_ioctl_v4l; > +static linux_ioctl_function_t linux_ioctl_dvb; > static linux_ioctl_function_t linux_ioctl_special; > static linux_ioctl_function_t linux_ioctl_fbsd_usb; > > @@ -124,6 +136,8 @@ static struct linux_ioctl_handler sg_han > { linux_ioctl_sg, LINUX_IOCTL_SG_MIN, LINUX_IOCTL_SG_MAX }; > static struct linux_ioctl_handler video_handler = > { linux_ioctl_v4l, LINUX_IOCTL_VIDEO_MIN, LINUX_IOCTL_VIDEO_MAX }; > +static struct linux_ioctl_handler dvb_handler = > +{ linux_ioctl_dvb, LINUX_IOCTL_DVB_MIN, LINUX_IOCTL_DVB_MAX }; > static struct linux_ioctl_handler fbsd_usb = > { linux_ioctl_fbsd_usb, FBSD_LUSB_MIN, FBSD_LUSB_MAX }; > > @@ -139,6 +153,7 @@ DATA_SET(linux_ioctl_handler_set, privat > DATA_SET(linux_ioctl_handler_set, drm_handler); > DATA_SET(linux_ioctl_handler_set, sg_handler); > DATA_SET(linux_ioctl_handler_set, video_handler); > +DATA_SET(linux_ioctl_handler_set, dvb_handler); > DATA_SET(linux_ioctl_handler_set, fbsd_usb); > > struct handler_element > @@ -2989,6 +3004,251 @@ linux_ioctl_special(struct thread *td, s > } > > /* > + * Map some anonymous memory in user space of size sz, rounded up to the page > + * boundary. > + */ > +static int > +copyout_map(struct thread *td, vm_offset_t *addr, size_t sz) > +{ > + struct vmspace *vms = td->td_proc->p_vmspace; Style recommends not to initialize in the declaration section. [I do not repeat systematic style violations notes below]. > + int error; > + vm_size_t size; > + > + One empty line is enough. > + /* > + * Map somewhere after heap in process memory. > + */ > + PROC_LOCK(td->td_proc); > + *addr = round_page((vm_offset_t)vms->vm_daddr + > + lim_max(td->td_proc, RLIMIT_DATA)); > + PROC_UNLOCK(td->td_proc); Are you sure that this is needed ? Why not leave the address selection to the VM ? > + > + /* round size up to page boundry */ > + size = (vm_size_t) round_page(sz); > + > + error = vm_mmap(&vms->vm_map, addr, size, PROT_READ | PROT_WRITE, > + VM_PROT_ALL, MAP_PRIVATE | MAP_ANON, OBJT_DEFAULT, NULL, 0); > + > + return (error); > +} > + > +/* > + * Unmap memory in user space. > + */ > +static int > +copyout_unmap(struct thread *td, vm_offset_t addr, size_t sz) > +{ > + vm_map_t map; > + vm_size_t size; > + > + map = &td->td_proc->p_vmspace->vm_map; > + size = (vm_size_t) round_page(sz); > + > + if (!vm_map_remove(map, addr, addr + size)) Better do if (vm_map_remove(...) != KERN_SUCCESS) > + return (EINVAL); > + > + return (0); > +} > + > +static int > +linux_to_bsd_dtv_properties(struct l_dtv_properties *lvps, struct > dtv_properties *vps) > +{ Empty line between (empty) declaration section and executable statements. > +
Re: rtld optimizations
On Thu, Jan 27, 2011 at 09:12:34PM +, Devin Teske wrote: > On Thu, 2011-01-27 at 22:59 +0200, Kostik Belousov wrote: > > > On Thu, Jan 27, 2011 at 08:50:48PM +, Devin Teske wrote: > > > Probably did something like this: > > > > > > time sh -c '( firefox & ); sleep 1000' > > > > > > and then pressed Ctrl-C when he felt that firefox was finished loading. > > > The moment Ctrl-C is pressed, time(1) shows how long it ran up until you > > > pressed Ctrl-C. > > > NOTE: Pressing Ctrl-C will not terminate the firefox instance. > > > > You cannot have 1/100 of seconds precision with this method. > > This is why I am asking, seeing < 0.1 seconds difference. > > Not to mention some methodical questions, like whether the caches were > > warmed before the measurement by several runs before the actual > > test. > > > Really? > > $ time sh -c '( firefox & ); sleep 1000' > ^C > > real0m5.270s > user0m0.000s > sys 0m0.005s > > > I'd call that 1/100th of a second precision, wouldn't you? > > HINT: Try using bash instead of csh. (I supposed that) obvious point of my mail is that you cannot reliably measure 1/100 second intervals when human interaction is involved. To make it completely obvious: human has to press CTRL-C, I did not mean reading the numbers from display. pgp3mBNWKIVxX.pgp Description: PGP signature
Re: rtld optimizations
On Thu, Jan 27, 2011 at 08:50:48PM +, Devin Teske wrote: > On Thu, 2011-01-27 at 22:31 +0200, Kostik Belousov wrote: > > > On Thu, Jan 27, 2011 at 12:37:54PM -0500, Mark Saad wrote: > > > On Thu, Jan 27, 2011 at 6:05 AM, David Naylor > > > wrote: > > > > On Wednesday 26 January 2011 06:49:11 Alexander Kabaev wrote: > > > >> On Tue, 25 Jan 2011 21:40:42 -0500 > > > >> > > > >> Mark Saad wrote: > > > >> > Hello Hackers > > > >> > > > > >> > The NetBSD folks have a nice improvement with the rtld-elf subsystem, > > > >> > known as "Negative Symbol Cache" . > > > >> > > > > >> > http://blog.netbsd.org/tnf/entry/netbsd_runtime_linker_gains_negative > > > >> > > > > >> > Roy Marples roy@ has a simple write up of the change. > > > >> > > > > >> > I took the basic idea from FreeBSD, but improved the performance > > > >> > drastically. Basically, the huge win is by caching both breadth and > > > >> > depth of the needed/weak symbol lookup. > > > >> > Easiest to think of a,b,c,d as a matrix and FreeBSD just cache a row > > > >> > where we cache both rows and columns. > > > >> > > > > >> > Has anyone looked into porting the changes back to FreeBSD ? The > > > >> > improvement on load time for things like firefox, openoffice, and > > > >> > java > > > >> > is huge on NetBSD. It looks like this change could improve load times > > > >> > on FreeBSD in the same ways. > > > >> > > > >> This is a second time someone posts this to public mailing list and > > > >> curiously enough is a second time it suggested that someone else is to > > > >> do the investigation. From the quick look, the commit in question is > > > >> more or less a direct rip-off of Donelists we had for ages and as > > > >> such is completely over-hyped. The only extra quirk that said commit > > > >> does is an optimization of a dlsym() call, which is hardly ever in > > > >> critical performance path. Said optimization is trivial and easy to > > > >> try. Here you have it: > > > >> http://people.freebsd.org/~kan/rtld-symlook-depth.diff > > > >> > > > >> Since it only applies to dlsym, it only affects programs that are heavy > > > >> plugin users, which I suppose is the category OpenOffice and firefox > > > >> both fall into. Care to do some benchmarks with and without the > > > >> patch and report the results? I frankly doubt that you'll see any > > > >> noticeable difference compared to our stock rtld's performance. > > > > > > > > I benchmarked the impact said patch has on the boot-time of my system. > > > > I > > > > timed the boot-time to when KDE launches autostart programs and once all > > > > programs have loaded (I run a few extra programs, such as amarok). The > > > > latter > > > > measure requires human action thus it has extra, human, variance in its > > > > measure. > > > > > > > > I tried an older version of rtld (about 2 months old), current version > > > > of rtld > > > > and the new (patched) rtld. I ran each test three times. There was > > > > little > > > > variance in the tests and I am confident that there is no difference > > > > between > > > > the different rtld versions and my boot-time. > > > > > > > > Here is a summary of my boot times (in seconds). First measure is when > > > > KDE > > > > autostarts programs, the latter is when I determined when all programs > > > > had > > > > launched. > > > > rtld-old: 69 96 > > > > rtld: 69 94 > > > > rtld-new: 69 94 > > > > > > > > Please note that kernel boot time is approximately 10 seconds and kdm is > > > > delayed by about 10 seconds thus 20 seconds can be removed from above > > > > numbers > > > > to determine non-kernel boot wall-time. > > > > > > > > I would like to add that the blog entry claims a substantial > > > > improvement for > > > > some use cases. Is it not worth to optimism these fringe cases as one > > > > mans > > > >
Re: rtld optimizations
On Thu, Jan 27, 2011 at 12:37:54PM -0500, Mark Saad wrote: > On Thu, Jan 27, 2011 at 6:05 AM, David Naylor > wrote: > > On Wednesday 26 January 2011 06:49:11 Alexander Kabaev wrote: > >> On Tue, 25 Jan 2011 21:40:42 -0500 > >> > >> Mark Saad wrote: > >> > Hello Hackers > >> > > >> > The NetBSD folks have a nice improvement with the rtld-elf subsystem, > >> > known as "Negative Symbol Cache" . > >> > > >> > http://blog.netbsd.org/tnf/entry/netbsd_runtime_linker_gains_negative > >> > > >> > Roy Marples roy@ has a simple write up of the change. > >> > > >> > I took the basic idea from FreeBSD, but improved the performance > >> > drastically. Basically, the huge win is by caching both breadth and > >> > depth of the needed/weak symbol lookup. > >> > Easiest to think of a,b,c,d as a matrix and FreeBSD just cache a row > >> > where we cache both rows and columns. > >> > > >> > Has anyone looked into porting the changes back to FreeBSD ? The > >> > improvement on load time for things like firefox, openoffice, and java > >> > is huge on NetBSD. It looks like this change could improve load times > >> > on FreeBSD in the same ways. > >> > >> This is a second time someone posts this to public mailing list and > >> curiously enough is a second time it suggested that someone else is to > >> do the investigation. From the quick look, the commit in question is > >> more or less a direct rip-off of Donelists we had for ages and as > >> such is completely over-hyped. The only extra quirk that said commit > >> does is an optimization of a dlsym() call, which is hardly ever in > >> critical performance path. Said optimization is trivial and easy to > >> try. Here you have it: > >> http://people.freebsd.org/~kan/rtld-symlook-depth.diff > >> > >> Since it only applies to dlsym, it only affects programs that are heavy > >> plugin users, which I suppose is the category OpenOffice and firefox > >> both fall into. Care to do some benchmarks with and without the > >> patch and report the results? I frankly doubt that you'll see any > >> noticeable difference compared to our stock rtld's performance. > > > > I benchmarked the impact said patch has on the boot-time of my system. I > > timed the boot-time to when KDE launches autostart programs and once all > > programs have loaded (I run a few extra programs, such as amarok). The > > latter > > measure requires human action thus it has extra, human, variance in its > > measure. > > > > I tried an older version of rtld (about 2 months old), current version of > > rtld > > and the new (patched) rtld. I ran each test three times. There was little > > variance in the tests and I am confident that there is no difference between > > the different rtld versions and my boot-time. > > > > Here is a summary of my boot times (in seconds). First measure is when KDE > > autostarts programs, the latter is when I determined when all programs had > > launched. > > rtld-old: 69 96 > > rtld: 69 94 > > rtld-new: 69 94 > > > > Please note that kernel boot time is approximately 10 seconds and kdm is > > delayed by about 10 seconds thus 20 seconds can be removed from above > > numbers > > to determine non-kernel boot wall-time. > > > > I would like to add that the blog entry claims a substantial improvement for > > some use cases. Is it not worth to optimism these fringe cases as one mans > > fringe case is another mans normal case (or woman as one prefers)? > > > > > So I figured out how to properly fit my foot in my mouth and set out > to retesting this on netbsd. > Turns out that in most cases the speed up is not as dramatic. > > Firefox 3.6.16 on amd64 > > old ld.elf_so: 4.07 seconds > new ld.elf_so: 3.89 seconds > > Openoffice 3.1 on amd64 > > old ld.elf_so: 2.67 seconds > new ld.elf_so: 2.60 seconds > > I am slightly perturbed that I can start openoffice faster then I can > start firefox, oh well. Can you, please, satisfy my curiousity ? How did you fixated the moment of finishing the startup of interactive applications like ff or oo ? pgpncuGe8N4rT.pgp Description: PGP signature
Re: rc.d/jail issues
On Thu, Jan 27, 2011 at 10:37:22AM -0700, Jamie Gritton wrote: > It's in the the subversion tree, under projects/jailconf. > > I've got the dependency stuff there (actually turns out to be a large > chunk of the code). I've got it doing almost everything that rc.d/jail > does now, though it doesn't yet handle errors like it should. After I > get that fixed up, I plan on putting it in HEAD. After that, I still > have a todo list mostly of suggestions from others. > > Feel free to give me any "todo" suggestions, or any other feedback :-). Are per-jail init and console in the list ? pgpoGvYicvJAB.pgp Description: PGP signature
Re: [rfc] allow to boot with >= 256GB physmem
On Fri, Jan 21, 2011 at 12:44:13PM -0500, John Baldwin wrote: > On Friday, January 21, 2011 11:09:10 am Sergey Kandaurov wrote: > > Hello. > > > > Some time ago I faced with a problem booting with 400GB physmem. > > The problem is that vm.max_proc_mmap type overflows with > > such high value, and that results in a broken mmap() syscall. > > The max_proc_mmap value is a signed int and roughly calculated > > at vmmapentry_rsrc_init() as u_long vm_kmem_size quotient: > > vm_kmem_size / sizeof(struct vm_map_entry) / 100. > > > > Although at the time it was introduced at svn r57263 the value > > was quite low (f.e. the related commit log stands: > > "The value defaults to around 9000 for a 128MB machine."), > > the problem is observed on amd64 where KVA space after > > r212784 is factually bound to the only physical memory size. > > > > With INT_MAX here is 0x7fff, and sizeof(struct vm_map_entry) > > is 120, it's enough to have sligthly less than 256GB to be able > > to reproduce the problem. > > > > I rewrote vmmapentry_rsrc_init() to set large enough limit for > > max_proc_mmap just to protect from integer type overflow. > > As it's also possible to live tune this value, I also added a > > simple anti-shoot constraint to its sysctl handler. > > I'm not sure though if it's worth to commit the second part. > > > > As this patch may cause some bikeshedding, > > I'd like to hear your comments before I will commit it. > > > > http://plukky.net/~pluknet/patches/max_proc_mmap.diff > > Is there any reason we can't just make this variable and sysctl a long? I do not think we ever need 2G vm map entries in the single address space. pgpyotL9bkpe2.pgp Description: PGP signature
Re: Android development (was Re: best way to run -RELEASE and -CURRENT on the same machine)
On Sun, Jan 16, 2011 at 09:19:22AM -0500, Aryeh Friedman wrote: > Are you talking about BSDoid or FreeDroid? I got SDK from http://bsdroid.org. No idea what FreeDroid is. > > On Sun, Jan 16, 2011 at 8:17 AM, Kostik Belousov wrote: > > On Sun, Jan 16, 2011 at 01:58:26PM +0100, Hans Petter Selasky wrote: > >> On Sunday 16 January 2011 13:30:33 Aryeh Friedman wrote: > >> > On Sun, Jan 16, 2011 at 7:27 AM, Hans Petter Selasky > >> wrote: > >> > > On Sunday 16 January 2011 13:20:39 Aryeh Friedman wrote: > >> > >> On Sun, Jan 16, 2011 at 7:07 AM, Hans Petter Selasky > >> > >> > >> > > > >> > > wrote: > >> > >> > On Sunday 16 January 2011 12:59:17 Aryeh Friedman wrote: > >> > >> >> On Sun, Jan 16, 2011 at 6:48 AM, Hans Petter Selasky > >> > >> >> > >> > >> > > >> > >> > wrote: > >> > >> >> > On Sunday 16 January 2011 11:49:28 Hans Petter Selasky wrote: > >> > >> >> >> if_cdce kernel, > >> > >> >> > > >> > >> >> > if_cdce kernel module > >> > >> >> > > >> > >> >> > --HPS > >> > >> >> > >> > >> >> flosoft-stable# kldload if_cdce > >> > >> >> kldload: can't load if_cdce: File exists > >> > >> > > >> > >> > Any ueX network interfaces? > >> > >> > >> > >> None > >> > >> > >> > >> > Also: > >> > > And what about: > >> > > > >> > > usbconfig -d X.Y dump_curr_config_desc > >> > > >> > flosoft-stable# usbconfig -d 5.2 dump_curr_config_desc > >> > ugen5.2: at usbus5, cfg=0 md=HOST spd=HIGH (480Mbps) > >> > pwr=ON > >> > > >> > > >> > Configuration index 0 > >> > > >> > bLength = 0x0009 > >> > bDescriptorType = 0x0002 > >> > wTotalLength = 0x0037 > >> > bNumInterfaces = 0x0002 > >> > bConfigurationValue = 0x0001 > >> > iConfiguration = 0x > >> > bmAttributes = 0x0080 > >> > bMaxPower = 0x0080 > >> > > >> > Interface 0 > >> > bLength = 0x0009 > >> > bDescriptorType = 0x0004 > >> > bInterfaceNumber = 0x > >> > bAlternateSetting = 0x > >> > bNumEndpoints = 0x0002 > >> > bInterfaceClass = 0x0008 > >> > bInterfaceSubClass = 0x0006 > >> > bInterfaceProtocol = 0x0050 > >> > iInterface = 0x > >> > > >> > Endpoint 0 > >> > bLength = 0x0007 > >> > bDescriptorType = 0x0005 > >> > bEndpointAddress = 0x0001 > >> > bmAttributes = 0x0002 > >> > wMaxPacketSize = 0x0200 > >> > bInterval = 0x > >> > bRefresh = 0x > >> > bSynchAddress = 0x > >> > > >> > Endpoint 1 > >> > bLength = 0x0007 > >> > bDescriptorType = 0x0005 > >> > bEndpointAddress = 0x0081 > >> > bmAttributes = 0x0002 > >> > wMaxPacketSize = 0x0200 > >> > bInterval = 0x > >> > bRefresh = 0x > >> > bSynchAddress = 0x > >> > > >> > > >> > Interface 1 > >> > bLength = 0x0009 > >> > bDescriptorType = 0x0004 > >> > bInterfaceNumber = 0x0001 > >> > bAlternateSetting = 0x > >> > bNumEndpoints = 0x0002 > >> > bInterfaceClass = 0x00ff > >> > bInterfaceSubClass = 0x0042 > >> > bInterfaceProtocol = 0x0001 > >> > iInterface = 0x > >> > > >> > Endpoint 0 > >> > bLength = 0x0007 > >> > bDescriptorType = 0x0005 > >> > bEndpointAddress = 0x0002 > >> > bmAttributes = 0x0002 > >> > wMaxPacketSize = 0x0200 > >> > bInterval = 0x > >> > bRefresh = 0x > >> > bSynchAddress = 0x > >> > > >> > Endpoint 1 > >> > bLength = 0x0007 > >> > bDescriptorType = 0x0005 > >> > bEndpointAddress = 0x0082 > >> > bmAttributes = 0x0002 > >> > wMaxPacketSize = 0x0200 > >> > bInterval = 0x > >> > bRefresh = 0x > >> > bSynchAddress = 0x > >> > >> Looks like interface 1 is the ADB one (vendor specific). If the ADB client > >> is > >> not using LibUSB then you might get it working by adding the vendor ID and > >> product ID to sys/dev/usb/serial/u3g.c as listed by dump_device_desc . A > >> /dev/cuaU0 will then be created which you can use to transfer data. > > > > FWIW, I successfully used adb from freebsd port of android SDK to > > connect to Samsung Galaxy S. For sure, it only worked on i386, but > > this is a known issue with libusb. > > pgpnXDfWsAlxt.pgp Description: PGP signature
Re: Android development (was Re: best way to run -RELEASE and -CURRENT on the same machine)
On Sun, Jan 16, 2011 at 01:58:26PM +0100, Hans Petter Selasky wrote: > On Sunday 16 January 2011 13:30:33 Aryeh Friedman wrote: > > On Sun, Jan 16, 2011 at 7:27 AM, Hans Petter Selasky > wrote: > > > On Sunday 16 January 2011 13:20:39 Aryeh Friedman wrote: > > >> On Sun, Jan 16, 2011 at 7:07 AM, Hans Petter Selasky > > > > > > wrote: > > >> > On Sunday 16 January 2011 12:59:17 Aryeh Friedman wrote: > > >> >> On Sun, Jan 16, 2011 at 6:48 AM, Hans Petter Selasky > > >> >> > > >> > > > >> > wrote: > > >> >> > On Sunday 16 January 2011 11:49:28 Hans Petter Selasky wrote: > > >> >> >> if_cdce kernel, > > >> >> > > > >> >> > if_cdce kernel module > > >> >> > > > >> >> > --HPS > > >> >> > > >> >> flosoft-stable# kldload if_cdce > > >> >> kldload: can't load if_cdce: File exists > > >> > > > >> > Any ueX network interfaces? > > >> > > >> None > > >> > > >> > Also: > > > And what about: > > > > > > usbconfig -d X.Y dump_curr_config_desc > > > > flosoft-stable# usbconfig -d 5.2 dump_curr_config_desc > > ugen5.2: at usbus5, cfg=0 md=HOST spd=HIGH (480Mbps) > > pwr=ON > > > > > > Configuration index 0 > > > > bLength = 0x0009 > > bDescriptorType = 0x0002 > > wTotalLength = 0x0037 > > bNumInterfaces = 0x0002 > > bConfigurationValue = 0x0001 > > iConfiguration = 0x > > bmAttributes = 0x0080 > > bMaxPower = 0x0080 > > > > Interface 0 > > bLength = 0x0009 > > bDescriptorType = 0x0004 > > bInterfaceNumber = 0x > > bAlternateSetting = 0x > > bNumEndpoints = 0x0002 > > bInterfaceClass = 0x0008 > > bInterfaceSubClass = 0x0006 > > bInterfaceProtocol = 0x0050 > > iInterface = 0x > > > > Endpoint 0 > > bLength = 0x0007 > > bDescriptorType = 0x0005 > > bEndpointAddress = 0x0001 > > bmAttributes = 0x0002 > > wMaxPacketSize = 0x0200 > > bInterval = 0x > > bRefresh = 0x > > bSynchAddress = 0x > > > > Endpoint 1 > > bLength = 0x0007 > > bDescriptorType = 0x0005 > > bEndpointAddress = 0x0081 > > bmAttributes = 0x0002 > > wMaxPacketSize = 0x0200 > > bInterval = 0x > > bRefresh = 0x > > bSynchAddress = 0x > > > > > > Interface 1 > > bLength = 0x0009 > > bDescriptorType = 0x0004 > > bInterfaceNumber = 0x0001 > > bAlternateSetting = 0x > > bNumEndpoints = 0x0002 > > bInterfaceClass = 0x00ff > > bInterfaceSubClass = 0x0042 > > bInterfaceProtocol = 0x0001 > > iInterface = 0x > > > > Endpoint 0 > > bLength = 0x0007 > > bDescriptorType = 0x0005 > > bEndpointAddress = 0x0002 > > bmAttributes = 0x0002 > > wMaxPacketSize = 0x0200 > > bInterval = 0x > > bRefresh = 0x > > bSynchAddress = 0x > > > > Endpoint 1 > > bLength = 0x0007 > > bDescriptorType = 0x0005 > > bEndpointAddress = 0x0082 > > bmAttributes = 0x0002 > > wMaxPacketSize = 0x0200 > > bInterval = 0x > > bRefresh = 0x > > bSynchAddress = 0x > > Looks like interface 1 is the ADB one (vendor specific). If the ADB client is > not using LibUSB then you might get it working by adding the vendor ID and > product ID to sys/dev/usb/serial/u3g.c as listed by dump_device_desc . A > /dev/cuaU0 will then be created which you can use to transfer data. FWIW, I successfully used adb from freebsd port of android SDK to connect to Samsung Galaxy S. For sure, it only worked on i386, but this is a known issue with libusb. pgpDtRWY19XcM.pgp Description: PGP signature
Re: What does the FreeBSD/i386 ABI say about stack alignment?
On Thu, Jan 13, 2011 at 04:34:22PM -0700, Warner Losh wrote: > On 01/13/2011 13:28, Kostik Belousov wrote: > >On Thu, Jan 13, 2011 at 12:19:00PM -0500, Ryan Stone wrote: > >>I've been trying to get an application compiled with gcc 4.5.1 running > >>on FreeBSD 8.1, but it's been crashing during startup with a SIGBUS. > >>It turns out that the problem is that gcc is issuing SSE > >>instructions(in my case, a movdqa) that assume that the stack will be > >>aligned to a 16-byte boundary. It seems that Linux/i386 guarantees > >>this, and I worry that gcc has extended this assumption to all i386 > >>architectures. I'm assuming that FreeBSD doesn't make any such > >>promises based on the fact that I'm getting crashes. > >> > >>There does seem to be a flag (-mstackrealign) that you can set to > >>force gcc to align the stack to what it wants, but that pessimizes the > >>generated code a bit. Some googling would seem to indicate that > >>-mpreferred-stack-boundary won't always handle this problem correctly. > >> > >>Any ideas? My inclination, at least for our local source tree here at > >>$WORK, would be to accommodate gcc and guarantee the stack alignment > >>that it wants rather than pessimize our application. It seems we have > >>an old local patch/hack in our FreeBSD 6.1 tree(apparently based on > >>this: > >>http://www.freebsd.org/cgi/getmsg.cgi?fetch=438552+0+/usr/local/www/db/text/2000/freebsd-current/2507.freebsd-current). > >> I believe that this patch is the reason why we haven't seen the > >>problem when running on 6.1, but the patch doesn't seem to work > >>anymore on 8.1. > >Look at lib/csu/i386-elf/crt1_s.S, we align stack on startup. > >My understanding is that the requirement is (%esp& 0xf) == 0 just before > >the call to the function. And we are off by 4 (this is my fault). > > > >Please give this a try. > > > >diff --git a/lib/csu/i386-elf/crt1_s.S b/lib/csu/i386-elf/crt1_s.S > >index d7ed0a2..17ac0e3 100644 > >--- a/lib/csu/i386-elf/crt1_s.S > >+++ b/lib/csu/i386-elf/crt1_s.S > >@@ -42,6 +42,7 @@ _start: > > .cfi_def_cfa_register %ebp > > andl$0xfff0,%esp # align stack > > leal8(%ebp),%eax > >+subl$4,%esp > > pushl %eax# argv > > pushl 4(%ebp) # argc > > pushl %edx# rtld cleanup > > I'm seeing weird core dumps for ssh and friends on i386 on stable/8 from > a few days ago. Could that be related? Few days ago ? It was in the tree for probably one year. I very much doubt it, but cannot say anything until you show the backtrace. Our in-tree gcc masks this by typically doing stack realignment on the entry into the main(). pgpqMBi6aEVzy.pgp Description: PGP signature
Re: What does the FreeBSD/i386 ABI say about stack alignment?
On Thu, Jan 13, 2011 at 10:57:15PM +0100, Joerg Sonnenberger wrote: > On Thu, Jan 13, 2011 at 12:19:00PM -0500, Ryan Stone wrote: > > I've been trying to get an application compiled with gcc 4.5.1 running > > on FreeBSD 8.1, but it's been crashing during startup with a SIGBUS. > > It turns out that the problem is that gcc is issuing SSE > > instructions(in my case, a movdqa) that assume that the stack will be > > aligned to a 16-byte boundary. It seems that Linux/i386 guarantees > > this, and I worry that gcc has extended this assumption to all i386 > > architectures. I'm assuming that FreeBSD doesn't make any such > > promises based on the fact that I'm getting crashes. > > FreeBSD follows the original SYSV ABI. Linux at some point silently > decided to redefine the ABI to fit their mindset. I think you want to > use a combination of -mpreferred-stack-boundary=4 and > -mincoming-stack-boundary=2. I think gcc [*] requires 16-byte alignment. Also, it follows the policy of not changing the stack alignment through the calls. What you see is a plain bug in FreeBSD, when a developer (me) tried to adopt to newer ABI but failed. * Might be not gcc in our tree, but definitely newer gcc releases. pgpFmrQkAdFNW.pgp Description: PGP signature
Re: What does the FreeBSD/i386 ABI say about stack alignment?
On Thu, Jan 13, 2011 at 12:19:00PM -0500, Ryan Stone wrote: > I've been trying to get an application compiled with gcc 4.5.1 running > on FreeBSD 8.1, but it's been crashing during startup with a SIGBUS. > It turns out that the problem is that gcc is issuing SSE > instructions(in my case, a movdqa) that assume that the stack will be > aligned to a 16-byte boundary. It seems that Linux/i386 guarantees > this, and I worry that gcc has extended this assumption to all i386 > architectures. I'm assuming that FreeBSD doesn't make any such > promises based on the fact that I'm getting crashes. > > There does seem to be a flag (-mstackrealign) that you can set to > force gcc to align the stack to what it wants, but that pessimizes the > generated code a bit. Some googling would seem to indicate that > -mpreferred-stack-boundary won't always handle this problem correctly. > > Any ideas? My inclination, at least for our local source tree here at > $WORK, would be to accommodate gcc and guarantee the stack alignment > that it wants rather than pessimize our application. It seems we have > an old local patch/hack in our FreeBSD 6.1 tree(apparently based on > this: > http://www.freebsd.org/cgi/getmsg.cgi?fetch=438552+0+/usr/local/www/db/text/2000/freebsd-current/2507.freebsd-current). > I believe that this patch is the reason why we haven't seen the > problem when running on 6.1, but the patch doesn't seem to work > anymore on 8.1. Look at lib/csu/i386-elf/crt1_s.S, we align stack on startup. My understanding is that the requirement is (%esp & 0xf) == 0 just before the call to the function. And we are off by 4 (this is my fault). Please give this a try. diff --git a/lib/csu/i386-elf/crt1_s.S b/lib/csu/i386-elf/crt1_s.S index d7ed0a2..17ac0e3 100644 --- a/lib/csu/i386-elf/crt1_s.S +++ b/lib/csu/i386-elf/crt1_s.S @@ -42,6 +42,7 @@ _start: .cfi_def_cfa_register %ebp andl$0xfff0,%esp # align stack leal8(%ebp),%eax + subl$4,%esp pushl %eax# argv pushl 4(%ebp) # argc pushl %edx# rtld cleanup pgp7aDUftFr6a.pgp Description: PGP signature