Re: another question
On Mon, Jul 1, 2013 at 5:42 PM, David Sanford david.lee...@programmer.netwrote: Hi, Thanks for your responses to my first question. They were very helpful. In looking at the code, I ran across the functions setprogname and getprogname. According to the man page: In FreeBSD, the name of the program is set by the start-up code that is run before *main*(); thus, running *setprogname*() is not necessary. I'm confused by how this is done. Where is this start-up code defined? Is this included in all executables compiled on FreeBSD? Even the programs released under the GNU GPL? I believe the code that does this is in lib/csu/common/ignore_init.c; see handle_argv() and the use of __progname[]. This will run for anything that links against csu, which is anything compiled on FreeBSD. The same csu library sets the ABI note.tag, which tells the kernel which syscall table to use when the binary is executed. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: b_freelist TAILQ/SLIST
On Fri, Jun 28, 2013 at 8:56 AM, Adrian Chadd adr...@freebsd.org wrote: On 28 June 2013 08:37, Alexander Motin m...@freebsd.org wrote: Otherwise it may just creep up again after someone does another change in an unrelated part of the kernel. Big win or small, TAILQ is still heavier then STAILQ, while it is not needed there at all. You can't make that assumption. I bet that if both pointers are in the _same_ cache line, the overhead of maintaining a double linked list is trivial. No, it's not. A singly-linked SLIST only needs to modify the head of the list and the current element. A doubly-linked LIST needs to modify both the head as well as the old first element, which may not be in cache (and may not be in the same TLB, either). I don't recall exactly what [S]TAILQ touches, but the doubly-linked version still has to modify more entries that are not contiguous. Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Fix MNAMELEN or reimplement struct statfs
On Sat, Jun 8, 2013 at 3:52 PM, Dirk Engling erdge...@erdgeist.org wrote: The arbitrary value #define MNAMELEN88 /* size of on/from name bufs */ struct statfs { [...] charf_mntfromname[MNAMELEN];/* mounted filesystem */ charf_mntonname[MNAMELEN]; /* directory on which mounted */ }; currently bites us when trying to use poudriere with errors like 'mount: tmpfs: File name too long' /poudriere/data/build/91_RELEASE_amd64-REALLY-REALLY-LONG-JAILNAME/ref/wrkdirs The topic has been discussed several times since 2004 and has been postponed each time, the last time when it hit zfs users: http://lists.freebsd.org/pipermail/freebsd-fs/2010-March/007974.html So I'd like to point to the calendar, it's 2013 already and there's still a static arbitrary (and way too low) limit in one of the core areas of the vfs code. So I'd like to bump the issue and propose either making f_mntfromname a dynamic allocation or just increase MNAMELEN, using 10.0 as water shed. Gleb Kurtsou did this along with the ino64 GSoC project. Unfortunately, both he and I hit ENOTIME due to the job that pays the bills and it's never made it back to the main repository. IIRC, though, the only reason for doing it with 64-bit ino_t is that he'd already finished changing the stat/dirent ABI so what was one more. I think he went with 1024 bytes, which also necessitated not allocating statfs on the stack for the kernel. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: extattr_set_* return type
On Mon, Apr 1, 2013 at 11:24 AM, John Baldwin j...@freebsd.org wrote: On Saturday, March 30, 2013 5:30:21 pm m...@freebsd.org wrote: Despite the man page correctly describing the return value for extattr_set_*, I thought recently that they returned 0/-1 for success/failure, not the number of bytes written, like write(2). This is because extattr_set_* is declared as returning an int, not an ssize_t. Both extattr_get and extattr_list return ssize_t, so this is inconsistent. The patch at http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix- rmextattr-8-.patchfixes this. It compiles but it's untested. I don't think any compat shims are needed, since an old application will still sign extend and this will work (it's very unlikely anyone does extattr_set for 2GB or more). If anyone actually uses extattr on 64-bit, please test a new kernel but old userspace to be sure nothing is broken. I plan to commit this next week if I don't hear otherwise. Hmm, the patch URL doesn't work, but please fix this. There is an old thread we are both on from Dec 2011 where I ran into the same thing. I also think we don't need compat shims. The version in my outbox looked right; I don't know how it got mangled. http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix- http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-rmextattr-8-.patchfixes rmextattr-8-.patchhttp://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-rmextattr-8-.patchfixes Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: extattr_set_* return type
On Mon, Apr 1, 2013 at 12:51 PM, m...@freebsd.org wrote: On Mon, Apr 1, 2013 at 11:24 AM, John Baldwin j...@freebsd.org wrote: On Saturday, March 30, 2013 5:30:21 pm m...@freebsd.org wrote: Despite the man page correctly describing the return value for extattr_set_*, I thought recently that they returned 0/-1 for success/failure, not the number of bytes written, like write(2). This is because extattr_set_* is declared as returning an int, not an ssize_t. Both extattr_get and extattr_list return ssize_t, so this is inconsistent. The patch at http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix- rmextattr-8-.patchfixeshttp://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-rmextattr-8-.patchfixes this. It compiles but it's untested. I don't think any compat shims are needed, since an old application will still sign extend and this will work (it's very unlikely anyone does extattr_set for 2GB or more). If anyone actually uses extattr on 64-bit, please test a new kernel but old userspace to be sure nothing is broken. I plan to commit this next week if I don't hear otherwise. Hmm, the patch URL doesn't work, but please fix this. There is an old thread we are both on from Dec 2011 where I ran into the same thing. I also think we don't need compat shims. The version in my outbox looked right; I don't know how it got mangled. http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix- http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-rmextattr-8-.patchfixes rmextattr-8-.patchhttp://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-rmextattr-8-.patchfixes And I found the old thread: http://lists.freebsd.org/pipermail/freebsd-current/2011-December/030567.html I guess my memory really does just suck. I now remember what I posted, but only after reading it again. :-) Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
extattr_set_* return type
Despite the man page correctly describing the return value for extattr_set_*, I thought recently that they returned 0/-1 for success/failure, not the number of bytes written, like write(2). This is because extattr_set_* is declared as returning an int, not an ssize_t. Both extattr_get and extattr_list return ssize_t, so this is inconsistent. The patch at http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-rmextattr-8-.patchfixes this. It compiles but it's untested. I don't think any compat shims are needed, since an old application will still sign extend and this will work (it's very unlikely anyone does extattr_set for 2GB or more). If anyone actually uses extattr on 64-bit, please test a new kernel but old userspace to be sure nothing is broken. I plan to commit this next week if I don't hear otherwise. Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Stupid question about integer sizes
On Tue, Feb 19, 2013 at 5:11 AM, Borja Marcos bor...@sarenet.es wrote: Hello, I'm really sorry if this is a stupid question, but as far as I know, u_int64_t defined in /usr/include/sys/types.h should *always* be a 64 bit unsigned integer, right? Seems there's a bug (or I need more and stronger coffee). Compiling a program on a 64 bit system with -m32 gets the 64 bit integer types wrong. % cat tachin.c #include sys/types.h #include stdio.h main() { printf(sizeof uint64_t = %d\n, sizeof(uint64_t)); printf(sizeof u_int64_t = %d\n, sizeof(u_int64_t)); } uname -a FreeBSD splunk 9.1-RELEASE FreeBSD 9.1-RELEASE #14: Wed Jan 23 17:24:05 CET 2013 root@splunk:/usr/obj/usr/src/sys/SPLUNK amd64 % gcc -o tachin tachin.c % ./tachin sizeof uint64_t = 8 sizeof u_int64_t = 8 % ./tachin sizeof uint64_t = 4 === WRONG!! sizeof u_int64_t = 4=== WRONG!! The same happens with clang. % clang -m32 -o tachin tachin.c tachin.c:5:1: warning: type specifier missing, defaults to 'int' [-Wimplicit-int] main() ^~~~ 1 warning generated. % ./tachin sizeof uint64_t = 4 === WRONG!! sizeof u_int64_t = 4=== WRONG!! if I do the same on a i386 system (8.2-RELEASE, but it should be irrelevant) the u_int64 types have the correct size. %gcc -o tachin tachin.c %./tachin sizeof uint64_t = 8 sizeof u_int64_t = 8 Am I missing anything? Seems like a too stupid problem to be a real bug. Sorry if I am wrong. Last I knew -m32 still wasn't quite supported on 9.1. This is fixed in CURRENT. The problem in in the machine-dependent headers. sys/types.h includes sys/_types.h which includes machine/_types.h. But the machine alias can only point to one place; it's the amd64 headers since you're running the amd64 kernel. sys/amd64/include/_types.h defines __uint64_t as unsigned long, which is correct in 64-bit mode but wrong in 32-bit mode. On CURRENT there is a merge x86/_types.h which uses #ifdef guards to define __uint64_t as unsigned long or unsigned long long, depending on __LP64__, so that the size is correct on a 32-bit compiler invocation. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: dynamically calculating NKPT [was: Re: huge ktr buffer]
On Tue, Feb 5, 2013 at 7:14 AM, Konstantin Belousov kostik...@gmail.com wrote: On Mon, Feb 04, 2013 at 03:05:15PM -0800, Neel Natu wrote: Hi, I have a patch to dynamically calculate NKPT for amd64 kernels. This should fix the various issues that people pointed out in the email thread. Please review and let me know if there are any objections to committing this. Also, thanks to Alan (alc@) for reviewing and providing feedback on the initial version of the patch. Patch (also available at http://people.freebsd.org/~neel/patches/nkpt_diff.txt): Index: sys/amd64/include/pmap.h === --- sys/amd64/include/pmap.h (revision 246277) +++ sys/amd64/include/pmap.h (working copy) @@ -113,13 +113,7 @@ ((unsigned long)(l2) PDRSHIFT) | \ ((unsigned long)(l1) PAGE_SHIFT)) -/* Initial number of kernel page tables. */ -#ifndef NKPT -#define NKPT32 -#endif - #define NKPML4E 1 /* number of kernel PML4 slots */ -#define NKPDPE howmany(NKPT, NPDEPG)/* number of kernel PDP slots */ #define NUPML4E (NPML4EPG/2)/* number of userland PML4 pages */ #define NUPDPE (NUPML4E*NPDPEPG)/* number of userland PDP pages */ @@ -181,6 +175,7 @@ #define PML4map ((pd_entry_t *)(addr_PML4map)) #define PML4pml4e ((pd_entry_t *)(addr_PML4pml4e)) +extern int nkpt; /* Initial number of kernel page tables */ extern u_int64_t KPDPphys; /* physical address of kernel level 3 */ extern u_int64_t KPML4phys; /* physical address of kernel level 4 */ Index: sys/amd64/amd64/minidump_machdep.c === --- sys/amd64/amd64/minidump_machdep.c(revision 246277) +++ sys/amd64/amd64/minidump_machdep.c(working copy) @@ -232,7 +232,7 @@ /* Walk page table pages, set bits in vm_page_dump */ pmapsize = 0; pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys); - for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + NKPT * NBPDR, + for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + nkpt * NBPDR, kernel_vm_end); ) { /* * We always write a page, even if it is zero. Each @@ -364,7 +364,7 @@ /* Dump kernel page directory pages */ bzero(fakepd, sizeof(fakepd)); pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys); - for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + NKPT * NBPDR, + for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + nkpt * NBPDR, kernel_vm_end); va += NBPDP) { i = (va PDPSHIFT) ((1ul NPDPEPGSHIFT) - 1); Index: sys/amd64/amd64/pmap.c === --- sys/amd64/amd64/pmap.c(revision 246277) +++ sys/amd64/amd64/pmap.c(working copy) @@ -202,6 +202,10 @@ vm_offset_t virtual_avail; /* VA of first avail page (after kernel bss) */ vm_offset_t virtual_end; /* VA of last avail page (end of kernel AS) */ +int nkpt; +SYSCTL_INT(_machdep, OID_AUTO, nkpt, CTLFLAG_RD, nkpt, 0, +Number of kernel page table pages allocated on bootup); + static int ndmpdp; static vm_paddr_t dmaplimit; vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS; @@ -495,17 +499,42 @@ CTASSERT(powerof2(NDMPML4E)); +/* number of kernel PDP slots */ +#define NKPDPE(ptpgs) howmany((ptpgs), NPDEPG) + static void +nkpt_init(vm_paddr_t addr) +{ + int pt_pages; + +#ifdef NKPT + pt_pages = NKPT; +#else + pt_pages = howmany(addr, 1 PDRSHIFT); + pt_pages += NKPDPE(pt_pages); + + /* + * Add some slop beyond the bare minimum required for bootstrapping + * the kernel. + * + * This is quite important when allocating KVA for kernel modules. + * The modules are required to be linked in the negative 2GB of + * the address space. If we run out of KVA in this region then + * pmap_growkernel() will need to allocate page table pages to map + * the entire 512GB of KVA space which is an unnecessary tax on + * physical memory. + */ + pt_pages += 4; /* 8MB additional slop for kernel modules */ 8MB might be to low. I just checked one of my machines with fully modularized kernel, it takes slightly more than 6 MB to load 50 modules. I think that 16MB would be safer, but it probably needs to be scaled down based on the available phys memory. amd64 kernel could be booted on 128MB machine still. Is there no way to not map the entire 512GB? Otherwise this patch could really hose some vendors. E.g. the kernel module for the OneFS file system is around 8MB all by itself. I found when we moved from FreeBSD 6 to 7 that the NKPT of 32 was insufficient for our system to even boot so I put it back to 240 (I didn't want to spend a lot of time playing). At that time our
Re: How to validate the variable size memory block in ioctl handler?
On Sun, Jan 20, 2013 at 3:01 PM, Yuri y...@rawbw.com wrote: I am implementing an ioctl that reads/writes variable size structure. Allocated size is supplied by the caller in the structure itself. struct my_struct { int len; // allocated size other_struct s[1]; }; ioctl request id is defined as _IOWR('X', number, my_struct) How to validate from the ioctl function handler (for some device) that the whole (variable size) block of bytes is RW accessible in the process memory space? Should I call copyout/copyin for this, or there is some shorter way? EFAULT should be returned in case of validation failure. As I understand, macros like _IOR, _IOWR do validation based on the size of structure supplied to them. So that the handler procedures don't have to do that. I was expecting to find among them some macro that would work for such variable size structure, but it isn't there. (Not sure if this is possible language-wise). You'll need to pass in more than the above, probably, as the kernel's ioctl() function has copied in the specified number of bytes already. I.e. the value passed to your ioctl handler is already in the kernel space, and unless it's 4 bytes, was malloc(9)'d and copyin'd (if it's an IN parameter). The size used is the size passed to the _IOC() macro. To do what you want it sounds like you want your handler to take something like: struct var_ioctl { int len; void *data; }; Then then handler itself would have to use copyin/copyout to access the data. There's no simpler way. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: How to validate the variable size memory block in ioctl handler?
On Sun, Jan 20, 2013 at 6:50 PM, Yuri y...@rawbw.com wrote: On 01/20/2013 16:59, m...@freebsd.org wrote: To do what you want it sounds like you want your handler to take something like: struct var_ioctl { int len; void *data; }; Then then handler itself would have to use copyin/copyout to access the data. There's no simpler way. I think I found the simpler way, see the draft patch below. Generic macro _IOWRE will handle the case when the first integer in ioctl parameter holds the actual size of the structure. This way of passing the variable array sizes is quite common in various APIs. Other potential uses would also benefit from this. Yuri Index: sys/kern/sys_generic.c === --- sys/kern/sys_generic.c(revision 245654) +++ sys/kern/sys_generic.c(working copy) @@ -640,6 +640,7 @@ int arg, error; u_int size; caddr_t data; + int vsize; if (uap-com 0x) { printf( @@ -654,6 +655,14 @@ * copied to/from the user's address space. */ size = IOCPARM_LEN(com); + if (size == IOC_VARIABLE) { + /* first integer has the length of the memory */ + error = copyin(uap-data, (caddr_t)vsize, sizeof(vsize)); + if (error) + return (error); + size = (u_int)vsize; + } if ((size IOCPARM_MAX) || ((com (IOC_VOID | IOC_IN | IOC_OUT)) == 0) || #if defined(COMPAT_FREEBSD5) || defined(COMPAT_FREEBSD4) || defined(COMPAT_43) Index: sys/sys/ioccom.h === --- sys/sys/ioccom.h (revision 245654) +++ sys/sys/ioccom.h (working copy) @@ -50,6 +50,7 @@ #define IOC_IN 0x8000 /* copy in parameters */ #define IOC_INOUT (IOC_IN|IOC_OUT) #define IOC_DIRMASK (IOC_VOID|IOC_OUT|IOC_IN) +#define IOC_VARIABLEIOCPARM_MASK/* parameters size in parameters */ #define _IOC(inout,group,num,len) ((unsigned long) \ ((inout) | (((len) IOCPARM_MASK) 16) | ((group) 8) | (num))) @@ -59,6 +60,9 @@ #define _IOW(g,n,t) _IOC(IOC_IN,(g), (n), sizeof(t)) /* this should be _IORW, but stdio got there first */ #define _IOWR(g,n,t)_IOC(IOC_INOUT, (g), (n), sizeof(t)) +#define _IORE(g,n) _IOC(IOC_OUT, (g), (n), IOC_VARIABLE) +#define _IOWE(g,n) _IOC(IOC_IN,(g), (n), IOC_VARIABLE) +#define _IOWRE(g,n) _IOC(IOC_INOUT, (g), (n), IOC_VARIABLE) #ifdef _KERNEL This would be fine for a local patch but it breaks existing (valid) uses that have exactly 8191 bytes of data, so it wouldn't be suitable for the main FreeBSD repository. Also, in general one wants to have limits on syscalls that can force a kernel malloc of any size, as it leads to denial of service attacks or crashes by requesting the kernel over-allocate memory. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: kmem_map auto-sizing and size dependencies
On Fri, Jan 18, 2013 at 7:29 AM, Andre Oppermann an...@freebsd.org wrote: The (inital?) size of the kmem_map is determined by some voodoo magic, a sprinkle of nmbclusters * PAGE_SIZE incrementor and lots of tunables. However it seems to work out to an effective kmem_map_size of about 58MB on my 16GB AMD64 dev machine: vm.kvm_size: 549755809792 vm.kvm_free: 530233421824 vm.kmem_size: 16,594,300,928 vm.kmem_size_min: 0 vm.kmem_size_max: 329,853,485,875 vm.kmem_size_scale: 1 vm.kmem_map_size: 59,518,976 vm.kmem_map_free: 16,534,777,856 The kmem_map serves kernel malloc (via UMA), contigmalloc and everthing else that uses UMA for memory allocation. Mbuf memory too is managed by UMA which obtains the backing kernel memory from the kmem_map. The limits of the various mbuf memory types have been considerably raised recently and may make use of 50-75% of all physically present memory, or available KVM space, whichever is smaller. Now my questions/comments are: Does the kmem_map automatically extend itself if more memory is requested? Not that I recall. Should it be set to a larger initial value based on min(physical,KVM) space available? It needs to be smaller than the physical space, because the only limit on the kernel's use of (pinned) memory is the size of the map. So if it is too large there is nothing to stop the kernel from consuming all available memory. The lowmem handler is called when running out of virtual space only (i.e. a failure to allocate a range in the map). The naming and output of the various vm.kmem_* and vm.kvm_* sysctls is confusing and not easy to reconcile. Either we need some more detailing more aspects or less. Plus perhaps sysctl subtrees to better describe the hierarchy of the maps. Why are separate kmem submaps being used? Is it to limit memory usage of certain subsystems? Are those limits actually enforced? I mostly know about memguard, since I added memguard_fudge(). IIRC some of the submaps are used. The memguard_map specifically is used to know whether an allocation is guarded or not, so at free(9) it can be handled as normal malloc() or as memguard. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: _mtx_lock_spin: obsolete historic handling of kdb_active and panicstr?
On Wed, Oct 17, 2012 at 4:20 AM, Andriy Gapon a...@freebsd.org wrote: _mtx_lock_spin has the following check in its retry loop: if (i 6000 || kdb_active || panicstr != NULL) DELAY(1); else _mtx_lock_spin_failed(m); [snip analysis] So I'd like to propose to remove those checks altogether. Or perhaps to reverse them and immediately induce a (possibly secondary) panic if we ever get to that wait loop and kdb_active || panicstr != NULL. The panicstr can clearly be removed. I think there can be race conditions with entering kdb and taking a spinlock, because the spinlock acquire will block interrupts. I don't remember if we always NMI for kdb enter or if that's configurable. The old code was clearer (or maybe I'm just remembering an Isilon hack); looking at stop_cpus_hard() I don't see that it uses an NMI. So a CPU can block interrupts, then if it sees kdb_active it will spin until we leave kdb, rather than panic. Of course this would only be relevant if the CPU it's trying to acquire is already held; otherwise it should find the lock unowned and this isn't relevant. And if the lock is owned by the thread entering kdb, that would be a real panic, not a recoverable kdb entry. So I think maybe the kdb_active check is also not helpful after all. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Change vfork() to posix_spawn()?
On Fri, Sep 14, 2012 at 4:45 AM, Erik Cederstrand e...@cederstrand.dk wrote: Den 14/09/2012 kl. 13.03 skrev Ivan Voras ivo...@freebsd.org: On 14/09/2012 09:49, Erik Cederstrand wrote: Hello hackers, I'm looking through the Clang Analyzer scans on http://scan.freebsd.your.org/freebsd-head looking for false positives to report back to LLVM. There are quite a list of reports suggesting to change vfork() calls to posix_spawn(). Example from /bin/rpc: http://scan.freebsd.your.org/freebsd-head/bin.rcp/2012-09-12-amd64/report-nsOV80.html#EndPath I know nothing about this but I can see fork and posix_spawn have been discussed on this list previously. Is this a legitimate warning (in this case and in general in FreeBSD base)? Currently (on 9-stable at least), posix_spawn() is implemented as a wrapper around vfork(), so I doubt replacing one with the other would do much. The analyzer added this warning in January. The release notes link to this explanation: https://www.securecoding.cert.org/confluence/display/seccode/POS33-C.+Do+not+use+vfork() I guess this is the important part: Because of the implementation of the vfork() function, the parent process is suspended while the child process executes. If a user sends a signal to the child process, delaying its execution, the parent process (which is privileged) is also blocked. This means that an unprivileged process can cause a privileged process to halt, which is a privilege inversion resulting in a denial of service. Isn't the important part the previous paragraph, which said that some older versions of Linux had the problem? The entire thing reads that the issue comes from an idiom of vfork(), setuid(), then exec, which is both undefined and would be specific to only some *uses* of vfork(), not the implementation. The whole thing isn't worded terribly usefully; e.g. it doesn't explain if it was only Linux that had an issue, which version of Linux is fixed, whether normal code that went straight from vfork() to exec() was fine, etc. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: freebsd.org git repositories and svn ids
On Mon, Sep 3, 2012 at 7:29 AM, Eitan Adler li...@eitanadler.com wrote: On 3 September 2012 10:19, Ryan Stone ryst...@gmail.com wrote: On Sun, Sep 2, 2012 at 5:45 PM, Eitan Adler li...@eitanadler.com wrote: Why isn't git.freebsd.org a straight git svn clone ? AFAIK that isn't broken. Well, let's put it this way: I started a git svn clone of the src repository on Aug 28. It's still going. And I'm only creating a very small subset of the branches in the repo. I've done the same and it only took me a few hours although the ports repo took a few days. Did you try to preserve the full history in the git version? I'm not familiar with git clone, but when I've made a git copy of an svn repo for uses at work (both of freebsd.org any $WORK's repo) it took two to three days just to fetch the portion of the revisions that were for head/current. A few hours sounds too short. The three-four days since Aug 28 sounds normal to me. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: How to know __FreeBSD_version for a symbol
On Mon, Aug 13, 2012 at 1:24 AM, Hooman Fazaeli hoomanfaza...@gmail.com wrote: Hi hackers In the process of back porting drivers to older freebsd versions, We sometimes need to add suitable '#if __FreeBSD_version = x ... else ... ' directives to the source to use an alternate function or exclude certain statements, defines, etc. What is the best (quick/reliable) way to know in which __FreeBSD_version a symbol (function, struct member, macro, ...) has been first introduced? As far as I know it's a slightly painful look over SVN logs. First, find the SVN revision that introduced or changed the relevant symbol. Then, look at the SVN history of sys/sys/param.h for changes to __FreeBSD_version symbol. The one that's from the same or later SVN revision as the symbol change is the value you'll need. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: sysctl filesystem ?
On Tue, Jun 26, 2012 at 1:59 AM, Robert Watson rwat...@freebsd.org wrote: On Tue, 26 Jun 2012, Chris Rees wrote: as well as we don't depend of /proc for normal operation we shouldn't for say /proc/sysctl improvements are welcome, better documentation is welcome, changes to what is OK - isn't. /proc/sysctl might be useful. Just because Linux uses it doesn't make it a bad idea. One of the problems we've encounted with synthetic file systems is that off-the-shelf file system tools (e.g., cp, dd, cat) make simplistic (but not unreasonable) assumptions about the statistic content of files. This comes up frequently with procfs-like systems where the size of, say, memory map data can be considerably larger than the perhaps 128-byte, 256-byte, or even 8k buffers that might exist in a stock file access tool. Unless we change all of those tools to use buffers much bigger than they currently do, which even suggets changing the C library buffer to defaults for FILE *, that places an onus on the file system to provide persisting snapshots of data until it's sure that a user process is done -- e.g., over many system calls. sysctl is not immune to the requirement of atomicity, but it has explicit control over it: sysctl is a single system call, rather than an unbounded open-read-seek-repeat-etc cycle, and has been carefully crafted to provide this and other MIB-like properties, such as a basic data type model so that command line tools know how to render content rather than having to guess and/or get it wrong. sysctl has some file-system like properties, but on the whole, it's not a file system -- it's much more like an SNMP MIB. While you can map anything into anything (including Turing machines), I think the sysctl command line tool and API, despite its limitations, is a better match for accessing this sort of monitoring and control data than the POSIX file API, and would recommend against trying to move to a sysctl file system. While I understand the problems you allude to, the sysctl(8) binary can protect itself from them. IMO the biggest problem with sysctls not being files is that it makes no sense from the core UNIX philosophy that everything is a file. Sockets and pipes and character devices and even unseekable things like stdout are files; why aren't these other objects that allow read, write, and have their own namespace? Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: OS support for fault tolerance
On Tue, Feb 14, 2012 at 8:57 AM, Julian Elischer jul...@freebsd.org wrote: On 2/14/12 6:23 AM, Maninya M wrote: For multicore desktop computers, suppose one of the cores fails, the FreeBSD OS crashes. My question is about how I can make the OS tolerate this hardware fault. The strategy is to checkpoint the state of each core at specific intervals of time in main memory. Once a core fails, its previous state is retrieved from the main memory, and the processes that were running on it are rescheduled on the remaining cores. I read that the OS tolerates faults in large servers. I need to make it do this for a Desktop OS. I assume I would have to change the scheduler program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine. How do I go about doing this? What exactly do I need to save for the state of the core? What else do I need to know? I have absolutely no experience with kernel programming or with FreeBSD. Any pointers to good sources about modifying the source-code of FreeBSD would be greatly appreciated. This question has always intrigued me, because I'm always amazed that people actually try. From my viewpoint, There's really not much you can do if the core that is currently holding the scheduler lock fails. We did this at IBM after we'd done the dynamic logical partitioning. Basically, there was a way to probe the CPU for the number of correctable errors it was encountering. At too high a threshhold, it was considered faulty and we offlined the CPU before it encountered an uncorrectable error. We did the same thing for memory, too (that one I was directly involved in). The basic trouble, though, is that at least for memory, there didn't seem to be a correlation between the rate of correctable ECC and an uncorrectable error occurring. And what do you mean by 'fails? do you run constant diagnostics? how do you tell when it is failed? It'd be hard to detect that 'multiply' has suddenly started giving bad results now and then. I'd assume this is predicated by the ability of the hardware to have some redundancy and some way to query the error rate. I've done a little work with memory ECC on the device driver end, and at least there hardware definitely reports correctable and uncorrectable ECC via some registers. But I don't know if there's any way to query this for a CPU (and of course each CPU would be different). However, all that said, it's a moderately large project to get an OS ready to handle things like holes appearing in its logical CPU ID space (how do you serialize this when you want the common case to not take a lock?), and to do all the wizardry of unscheduling (what do you do with a bound thread?) and then actually shutting the CPU down via firmware so it doesn't continue running. I started working on this for Linux when I worked at IBM, somewhere around 2004, and then IBM got sued by SCO so they pulled me off the project. It was finished up by a colleague and friend. You can probably come to a first approximation by forcing e.g. the idle thread to not get switched out, when the CPU appears unstable. Then at least it's running fewer instructions, and less likely to generate a machine check. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Moving on ... (was: Re: FreeBSD has serious problems with...)
On Wed, Feb 8, 2012 at 10:33 PM, John Kozubik j...@kozubik.com wrote: If it were up to me, I think I would stake out a very loud and clear mission statement for FreeBSD that explicitly sacrifices new features for longer lifecycles and deeper investment in particular releases. I've thought seriously about the points that were made in this thread supporting faster releases, and I do appreciate what we would be giving up, but I continue to advocate for a new major release every 5 years. To summarize 7 years of working on AIX, where we upgrade the major version number every 5 years or so, in the interim we do deliver hundreds of other features. However, if we made a poor choice for interface, we're stuck with it. This leads to two things: (1) flexible interfaces, like leaving an unused flags field in a syscall so it can become variadic later, and (2) flexible structures, like leaving a few bytes in a struct that's part of the KBI so they can be filled in, and when that space is nearly out, malloc(9) more storage to hang off the struct. Sometimes this means a lot of extra testing effort to ensure the legacy interface works as well as any extensions that arise later. It's not ideal, it's more work when designing and testing, and when a major release does come around where we could break binary compatibility, we've often forgotten about a lot of these little bits and just leave things the way they are. With a more frequent release cycle FreeBSD can be more free about breaking KBI and ABI, and even the [AK]PI, which keeps the current code clean but makes upgrading more difficult. It's a hassle. It's doable, but it adds work, and that's a tough call to make for a volunteer effort. Increasing the barrier to entry, or reducing the fun part of working on FreeBSD doesn't serve the project either. Just my 2 cents, a month late. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Per-mount syncer threads and fanout for pagedaemon cleaning
On Tue, Dec 27, 2011 at 8:05 AM, Attilio Rao atti...@freebsd.org wrote: 2011/12/27 Giovanni Trematerra giovanni.tremate...@gmail.com: On Mon, Dec 26, 2011 at 9:24 PM, Venkatesh Srinivas vsrini...@dragonflybsd.org wrote: Hi! I've been playing with two things in DragonFly that might be of interest here. Thing #1 := First, per-mountpoint syncer threads. Currently there is a single thread, 'syncer', which periodically calls fsync() on dirty vnodes from every mount, along with calling vfs_sync() on each filesystem itself (via syncer vnodes). My patch modifies this to create syncer threads for mounts that request it. For these mounts, vnodes are synced from their mount-specific thread rather than the global syncer. The idea is that periodic fsync/sync operations from one filesystem should not stall or delay synchronization for other ones. The patch was fairly simple: http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/50e4012a4b55e1efc595db0db397b4365f08b640 There's something WIP by attilio@ on that area. you might want to take a look at http://people.freebsd.org/~attilio/syncer_alpha_15.diff I don't know what hammerfs needs but UFS/FFS and buffer cache make a good job performance-wise and so the authors are skeptical about the boost that such a change can give. We believe that brain cycles need to be spent on other pieces of the system such as ARC and ZFS. More specifically, it is likely that focusing on UFS and buffer cache for performance is not really useful, we should drive our efforts over ARC and ZFS. Also, the real bottlenecks in our I/O paths are in GEOM single-threaded design, lack of unmapped I/O functionality, possibly lack of proritized I/O, etc. Indeed, Isilon (and probably other vendors as well) entirely skip VFS_SYNC when the WAIT argument is MNT_LAZY. Since we're a distributed journalled filesystem, syncing via a system thread is not a relevant operation; i.e. all writes that have exited a VOP_WRITE or similar operation are already in reasonably stable storage in a journal on the relevant nodes. However, we do then have our own threads running on each node to flush the journal regularly (in addition to when it fills up), and I don't know enough about this to know if it could be fit into the syncer thread idea or if it's too tied in somehow to our architecture. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Per-mount syncer threads and fanout for pagedaemon cleaning
On Tue, Dec 27, 2011 at 9:05 AM, Attilio Rao atti...@freebsd.org wrote: 2011/12/27 m...@freebsd.org: On Tue, Dec 27, 2011 at 8:05 AM, Attilio Rao atti...@freebsd.org wrote: 2011/12/27 Giovanni Trematerra giovanni.tremate...@gmail.com: On Mon, Dec 26, 2011 at 9:24 PM, Venkatesh Srinivas vsrini...@dragonflybsd.org wrote: Hi! I've been playing with two things in DragonFly that might be of interest here. Thing #1 := First, per-mountpoint syncer threads. Currently there is a single thread, 'syncer', which periodically calls fsync() on dirty vnodes from every mount, along with calling vfs_sync() on each filesystem itself (via syncer vnodes). My patch modifies this to create syncer threads for mounts that request it. For these mounts, vnodes are synced from their mount-specific thread rather than the global syncer. The idea is that periodic fsync/sync operations from one filesystem should not stall or delay synchronization for other ones. The patch was fairly simple: http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/50e4012a4b55e1efc595db0db397b4365f08b640 There's something WIP by attilio@ on that area. you might want to take a look at http://people.freebsd.org/~attilio/syncer_alpha_15.diff I don't know what hammerfs needs but UFS/FFS and buffer cache make a good job performance-wise and so the authors are skeptical about the boost that such a change can give. We believe that brain cycles need to be spent on other pieces of the system such as ARC and ZFS. More specifically, it is likely that focusing on UFS and buffer cache for performance is not really useful, we should drive our efforts over ARC and ZFS. Also, the real bottlenecks in our I/O paths are in GEOM single-threaded design, lack of unmapped I/O functionality, possibly lack of proritized I/O, etc. Indeed, Isilon (and probably other vendors as well) entirely skip VFS_SYNC when the WAIT argument is MNT_LAZY. Since we're a distributed journalled filesystem, syncing via a system thread is not a relevant operation; i.e. all writes that have exited a VOP_WRITE or similar operation are already in reasonably stable storage in a journal on the relevant nodes. However, we do then have our own threads running on each node to flush the journal regularly (in addition to when it fills up), and I don't know enough about this to know if it could be fit into the syncer thread idea or if it's too tied in somehow to our architecture. I'm not really sure how does journaling is implemented on OneFS, but when I made this patch SU+J wasn't yet there. Also, this patch just adds the infrastructure for a multithreaded and configurable syncer, which means it still requires the UFS bits for skipping the double-syncing (alias the MNT_LAZY skippage you mentioned). Right, I don't object to any changes relating to multiple sync threads, etc., just trying to offer a vendor viewpoint. Though having one thread per mount would allow for a different sync interval for each filesystem which can be of advantage. Right after I did Isilon's last FreeBSD merge (it seems like a long time ago now), I wanted to look into what it would take to eliminate our specialed journal flush thread (i.e. tie it into VFS_SYNC), but one objection was that then the flush interval would not be configurable separately from the one for our UFS partition. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: sysctl variable question
2011/12/18 Fernando Apesteguía fernando.apesteg...@gmail.com: El 18/12/2011 22:12, Julian Elischer jul...@freebsd.org escribió: On 12/18/11 12:18 PM, Fernando Apesteguía wrote: Hi all, I'm writing a small module just for fun. I would like to have two variables: - pid of type unsigned int and RW so the user can set a pid - process_name as a string RD that will display the process name associated to that pid (or a message if the pid doesn't exist anymore) this is dangerous as there are some places in the kernel where processes are referenced by pid, so changing it may break kernel assumptions. Sorry, i think i didn't explain it clearly. The pid variable is static in my module and it is used just to tell the module which information it should show in the other variable. Many sysctl handlers look for a req-newptr and leave early if it's NULL. If it's non-NULL then you can use SYSCTL_IN to fetch the data. Note that a caller of sysctl(3) API can specify both old pointer and new pointer, which is why most handlers always do a SYSCTL_OUT on the current value. Cheers, matthew My problem is with the handler functions. For process_name, as it is read only, I wrote a simple handler that works fine. However, I want to write another one for pid so I can sanitize the input (avoiding pids 0 and so). As I understand, the handler I specify with SYSCTL_OID will be called for both reads and writes. But, how can I tell what kind of operation is it, so I know if I have to use SYSCTL_OUT or SYSCTL_IN? I tried to have a look at sysctl_handle_int but I don't fully understand what is going on with the arg1 parameter. What is it for? Thanks in advance. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: kern_yield vs ukbd_yield
On Mon, Dec 12, 2011 at 12:03 AM, Andriy Gapon a...@freebsd.org wrote: on 11/12/2011 23:48 m...@freebsd.org said the following: On Sun, Dec 11, 2011 at 1:12 PM, Andriy Gapon a...@freebsd.org wrote: Does the following change do what I think that it does? Thank you! Author: Andriy Gapon a...@icyb.net.ua Date: Thu Sep 1 16:50:13 2011 +0300 ukbd: drop local duplicate of kern_yield and use that instead diff --git a/sys/dev/usb/input/ukbd.c b/sys/dev/usb/input/ukbd.c index 086c178..8078cbb 100644 --- a/sys/dev/usb/input/ukbd.c +++ b/sys/dev/usb/input/ukbd.c @@ -399,33 +399,6 @@ ukbd_put_key(struct ukbd_softc *sc, uint32_t key) } static void -ukbd_yield(void) -{ - struct thread *td = curthread; - uint32_t old_prio; - - DROP_GIANT(); - - thread_lock(td); - - /* get current priority */ - old_prio = td-td_base_pri; - - /* set new priority */ - sched_prio(td, td-td_user_pri); - - /* cause a task switch */ - mi_switch(SW_INVOL | SWT_RELINQUISH, NULL); - - /* restore priority */ - sched_prio(td, old_prio); - - thread_unlock(td); - - PICKUP_GIANT(); -} - -static void ukbd_do_poll(struct ukbd_softc *sc, uint8_t wait) { @@ -439,7 +412,7 @@ ukbd_do_poll(struct ukbd_softc *sc, uint8_t wait) while (sc-sc_inputs == 0) { /* give USB threads a chance to run */ - ukbd_yield(); + kern_yield(-1); Not quite. 1) -1 should be spelled PRI_UNCHANGED, except ukbd_yield() uses td_user_pri, but then puts it back again, so I think UNCHANGED is what is meant. 2) kern_yield() calls it a SW_VOL rather than SW_INVOL, which seems the desired behaviour here anyways, since this is an explicit (i.e. voluntary) yield. Thank you for the explanation. So would you say that the patch is OK? As far as I know, yes. There may be a difference in behaviour, though, while yielding, if the priority of the thread remains high (as this change would make it) -- I'm not completely sure how the scheduler chooses threads, because I'm pretty sure I've seen it take threads with lower (higher numbered) priorities even when there's runnable threads with a higher (lower numbered) priority available. It has always seemed weird to me that the priorities in the kernel are strictly higher than user-space -- but only after a prio change like that done implicitly by many of the calls to sleep(9). So it may be that the better patch is to use PRI_USER, not PRI_UNCHANGED, which would revert any potentially changed thread prio (e.g. due to a sleep(9)) back to its user-space level, so that it contended as expected with other threads. hselasky@ or someone else familiar with the various usb threads would have to answer that. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: kern_yield vs ukbd_yield
On Mon, Dec 12, 2011 at 11:05 AM, John Baldwin j...@freebsd.org wrote: On Monday, December 12, 2011 10:58:22 am Hans Petter Selasky wrote: On Monday 12 December 2011 16:55:38 m...@freebsd.org wrote: On Mon, Dec 12, 2011 at 12:03 AM, Andriy Gapon a...@freebsd.org wrote: on 11/12/2011 23:48 m...@freebsd.org said the following: On Sun, Dec 11, 2011 at 1:12 PM, Andriy Gapon a...@freebsd.org wrote: Does the following change do what I think that it does? Thank you! Author: Andriy Gapon a...@icyb.net.ua Date: Thu Sep 1 16:50:13 2011 +0300 ukbd: drop local duplicate of kern_yield and use that instead diff --git a/sys/dev/usb/input/ukbd.c b/sys/dev/usb/input/ukbd.c index 086c178..8078cbb 100644 --- a/sys/dev/usb/input/ukbd.c +++ b/sys/dev/usb/input/ukbd.c @@ -399,33 +399,6 @@ ukbd_put_key(struct ukbd_softc *sc, uint32_t key) } static void -ukbd_yield(void) -{ - struct thread *td = curthread; - uint32_t old_prio; - - DROP_GIANT(); - - thread_lock(td); - - /* get current priority */ - old_prio = td-td_base_pri; - - /* set new priority */ - sched_prio(td, td-td_user_pri); - - /* cause a task switch */ - mi_switch(SW_INVOL | SWT_RELINQUISH, NULL); - - /* restore priority */ - sched_prio(td, old_prio); - - thread_unlock(td); - - PICKUP_GIANT(); -} - -static void ukbd_do_poll(struct ukbd_softc *sc, uint8_t wait) { @@ -439,7 +412,7 @@ ukbd_do_poll(struct ukbd_softc *sc, uint8_t wait) while (sc-sc_inputs == 0) { /* give USB threads a chance to run */ - ukbd_yield(); + kern_yield(-1); Not quite. 1) -1 should be spelled PRI_UNCHANGED, except ukbd_yield() uses td_user_pri, but then puts it back again, so I think UNCHANGED is what is meant. 2) kern_yield() calls it a SW_VOL rather than SW_INVOL, which seems the desired behaviour here anyways, since this is an explicit (i.e. voluntary) yield. Thank you for the explanation. So would you say that the patch is OK? As far as I know, yes. There may be a difference in behaviour, though, while yielding, if the priority of the thread remains high (as this change would make it) -- I'm not completely sure how the scheduler chooses threads, because I'm pretty sure I've seen it take threads with lower (higher numbered) priorities even when there's runnable threads with a higher (lower numbered) priority available. The scheduler generally should not do this with the following exceptions: 1) for timesharing threads, priorities are in bands, so effectively the pri / 4 is what is used for comparison and if two threads have the same pri / 4 the scheduler will round-robin among htem. 2) ULE might be a bit different because of the way it assigns threads to CPUs, so if a CPU has two high priority threads and another CPU only has a low priority thread, the second CPU will not run the second high priority thread. 4BSD handles this case more correctly. It has always seemed weird to me that the priorities in the kernel are strictly higher than user-space -- but only after a prio change like that done implicitly by many of the calls to sleep(9). So it may be that the better patch is to use PRI_USER, not PRI_UNCHANGED, which would revert any potentially changed thread prio (e.g. due to a sleep(9)) back to its user-space level, so that it contended as expected with other threads. Realtime priorities (for rtprio user threads) are higher than the kernel sleep priorities. Also, keep in mind that a thread does not get an automatic priority boost when it enters the kernel. It only gets a boost either temporarily from priority propagation, or a slightly longer (but still temporary) boost from a sleep(9) call. I still don't understand why threads are boosting their priority while sleeping; if it's so they are woken preferentially, that makes some sense, but then they shouldn't be keeping the boosted priority after the thread is woken. If a thread really needs higher priority to do its work, that should be an explicit call to sched_prio(9), rather than implicitly only when/if it first sleeps. Thanks, matthew hselasky@ or someone else familiar with the various usb threads would have to answer that. The problem is only during init() where the init thread has highest priority and that doesn't allow other threads to run even if the scheduler is running! Hmm, that should be fixed by lowering the relevant thread's priority. Do you mean thread0 (the one doing all the SYSINIT's or thread we create for init (pid 1) before it executes init? -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list
Re: kern_yield vs ukbd_yield
On Sun, Dec 11, 2011 at 1:12 PM, Andriy Gapon a...@freebsd.org wrote: Does the following change do what I think that it does? Thank you! Author: Andriy Gapon a...@icyb.net.ua Date: Thu Sep 1 16:50:13 2011 +0300 ukbd: drop local duplicate of kern_yield and use that instead diff --git a/sys/dev/usb/input/ukbd.c b/sys/dev/usb/input/ukbd.c index 086c178..8078cbb 100644 --- a/sys/dev/usb/input/ukbd.c +++ b/sys/dev/usb/input/ukbd.c @@ -399,33 +399,6 @@ ukbd_put_key(struct ukbd_softc *sc, uint32_t key) } static void -ukbd_yield(void) -{ - struct thread *td = curthread; - uint32_t old_prio; - - DROP_GIANT(); - - thread_lock(td); - - /* get current priority */ - old_prio = td-td_base_pri; - - /* set new priority */ - sched_prio(td, td-td_user_pri); - - /* cause a task switch */ - mi_switch(SW_INVOL | SWT_RELINQUISH, NULL); - - /* restore priority */ - sched_prio(td, old_prio); - - thread_unlock(td); - - PICKUP_GIANT(); -} - -static void ukbd_do_poll(struct ukbd_softc *sc, uint8_t wait) { @@ -439,7 +412,7 @@ ukbd_do_poll(struct ukbd_softc *sc, uint8_t wait) while (sc-sc_inputs == 0) { /* give USB threads a chance to run */ - ukbd_yield(); + kern_yield(-1); Not quite. 1) -1 should be spelled PRI_UNCHANGED, except ukbd_yield() uses td_user_pri, but then puts it back again, so I think UNCHANGED is what is meant. 2) kern_yield() calls it a SW_VOL rather than SW_INVOL, which seems the desired behaviour here anyways, since this is an explicit (i.e. voluntary) yield. Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: gcc 4.2 miscompilation with -O2 -fno-omit-frame-pointer on amd64
On Sun, Nov 20, 2011 at 10:24 AM, Gleb Kurtsou gleb.kurt...@gmail.com wrote: On (19/11/2011 09:11), m...@freebsd.org wrote: On Sat, Nov 19, 2011 at 8:19 AM, Gleb Kurtsou gleb.kurt...@gmail.com wrote: On (19/11/2011 07:26), m...@freebsd.org wrote: On Sat, Nov 19, 2011 at 2:01 AM, Gleb Kurtsou gleb.kurt...@gmail.com wrote: Hi, I was lucky to write a bit of code which gcc 4.2 fails to compile correctly with -O2. Too keep long story short the code fails for gcc from base system and last gcc 4.2 snapshot from ports. It works with gcc 4.3, gcc 4.4 on FreeBSD and Linux. Clang from base is also good. -O and -Os optimization levels are fine (I've tried with all -f* flags mentioned in documentation) -O2 -fno-omit-frame-pointer combination is troublesome on amd64. I presume i386 should be fine. These options are also used for compilation of kernel (with debugging enabled) and modules. I'm not able to share the code, but have a test case reproducing the bug. I've encountered the issue over a week ago and tried narrowing it down to a simple test I could share but without much success. The code itself is very common: initialize two structs on stack, call a function with pointers to those stucts as arguments. A number of inlined assertion functions. gcc fails to correctly optimize struct assignments with -fno-omit-frame-pointer, I have a number of small structs assigned, gcc decides not to use data coping but to assign fields directly. I've tried disabling sra, tweaking sra parameters -- no luck in forcing it to copy data. Replacing one particular assignment with memcpy produces correct code, but that's not a solution. How small are the structs? gcc has an optimization for structs that are no larger than a register, but it's buggy in 4.2 and we disabled it at $WORK. I can dig up the patch if this is the problem. struct sockaddr_in in this particular test. 16 bytes. Register size structs are rather common, e.g. struct in_addr. I could test the patch. Adding -finline-functions seems to fix the issue for me. I can't find the thing I'm thinking of. The only potentially relevant patch I see in our gcc sources is this: It could be related but doesn't fix bug I observe. I've installed fresh 9.0-RC2 virtual machine and reran tests in clean environment. Do you plan committing it? We probably should, but I haven't heard that anyone else has had a problem with this. I'm sure when I do the next FreeBSD merge at $WORK I'll re-remember it and probably commit it then. Thanks, matthew Index: opts.c === --- opts.c (.../vendor.branches/freebsd/stable/7/src/contrib/gcc/opts.c) (revision 211574) +++ opts.c (.../head/src/contrib/gcc/opts.c) (revision 211574) @@ -457,11 +457,7 @@ flag_tree_dse = 1; flag_tree_ter = 1; flag_tree_live_range_split = 1; + /** + * 7dot1MERGE: tree-sra in gcc 4.2.x is buggy and + * breaks bitfield structs. + */ + flag_tree_sra = 0; - flag_tree_sra = 1; flag_tree_copyrename = 1; flag_tree_fre = 1; flag_tree_copy_prop = 1; Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: gcc 4.2 miscompilation with -O2 -fno-omit-frame-pointer on amd64
On Sat, Nov 19, 2011 at 2:01 AM, Gleb Kurtsou gleb.kurt...@gmail.com wrote: Hi, I was lucky to write a bit of code which gcc 4.2 fails to compile correctly with -O2. Too keep long story short the code fails for gcc from base system and last gcc 4.2 snapshot from ports. It works with gcc 4.3, gcc 4.4 on FreeBSD and Linux. Clang from base is also good. -O and -Os optimization levels are fine (I've tried with all -f* flags mentioned in documentation) -O2 -fno-omit-frame-pointer combination is troublesome on amd64. I presume i386 should be fine. These options are also used for compilation of kernel (with debugging enabled) and modules. I'm not able to share the code, but have a test case reproducing the bug. I've encountered the issue over a week ago and tried narrowing it down to a simple test I could share but without much success. The code itself is very common: initialize two structs on stack, call a function with pointers to those stucts as arguments. A number of inlined assertion functions. gcc fails to correctly optimize struct assignments with -fno-omit-frame-pointer, I have a number of small structs assigned, gcc decides not to use data coping but to assign fields directly. I've tried disabling sra, tweaking sra parameters -- no luck in forcing it to copy data. Replacing one particular assignment with memcpy produces correct code, but that's not a solution. How small are the structs? gcc has an optimization for structs that are no larger than a register, but it's buggy in 4.2 and we disabled it at $WORK. I can dig up the patch if this is the problem. Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: gcc 4.2 miscompilation with -O2 -fno-omit-frame-pointer on amd64
On Sat, Nov 19, 2011 at 8:19 AM, Gleb Kurtsou gleb.kurt...@gmail.com wrote: On (19/11/2011 07:26), m...@freebsd.org wrote: On Sat, Nov 19, 2011 at 2:01 AM, Gleb Kurtsou gleb.kurt...@gmail.com wrote: Hi, I was lucky to write a bit of code which gcc 4.2 fails to compile correctly with -O2. Too keep long story short the code fails for gcc from base system and last gcc 4.2 snapshot from ports. It works with gcc 4.3, gcc 4.4 on FreeBSD and Linux. Clang from base is also good. -O and -Os optimization levels are fine (I've tried with all -f* flags mentioned in documentation) -O2 -fno-omit-frame-pointer combination is troublesome on amd64. I presume i386 should be fine. These options are also used for compilation of kernel (with debugging enabled) and modules. I'm not able to share the code, but have a test case reproducing the bug. I've encountered the issue over a week ago and tried narrowing it down to a simple test I could share but without much success. The code itself is very common: initialize two structs on stack, call a function with pointers to those stucts as arguments. A number of inlined assertion functions. gcc fails to correctly optimize struct assignments with -fno-omit-frame-pointer, I have a number of small structs assigned, gcc decides not to use data coping but to assign fields directly. I've tried disabling sra, tweaking sra parameters -- no luck in forcing it to copy data. Replacing one particular assignment with memcpy produces correct code, but that's not a solution. How small are the structs? gcc has an optimization for structs that are no larger than a register, but it's buggy in 4.2 and we disabled it at $WORK. I can dig up the patch if this is the problem. struct sockaddr_in in this particular test. 16 bytes. Register size structs are rather common, e.g. struct in_addr. I could test the patch. Adding -finline-functions seems to fix the issue for me. I can't find the thing I'm thinking of. The only potentially relevant patch I see in our gcc sources is this: Index: opts.c === --- opts.c (.../vendor.branches/freebsd/stable/7/src/contrib/gcc/opts.c) (revision 211574) +++ opts.c (.../head/src/contrib/gcc/opts.c) (revision 211574) @@ -457,11 +457,7 @@ flag_tree_dse = 1; flag_tree_ter = 1; flag_tree_live_range_split = 1; + /** + * 7dot1MERGE: tree-sra in gcc 4.2.x is buggy and + * breaks bitfield structs. + */ + flag_tree_sra = 0; - flag_tree_sra = 1; flag_tree_copyrename = 1; flag_tree_fre = 1; flag_tree_copy_prop = 1; Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Communication between kernel and userspace via local socket
On Tue, Nov 15, 2011 at 12:18 PM, Maxim Ignatenko gelraen...@gmail.com wrote: frHi, I'm currently inventing the wheel^W^W^Wwriting a firewall from scratch and looking for most convenient way to establish communication between userspace processes and kernel part. Communication pattern best fits to listening PF_LOCAL socket opened from kernel and userspace processes connecting to it. Clients should be able to send requests and receive responses from kernel (to retrieve list of loaded modules, active ruleset, add or remove rules, ...) and vice versa: kernel should be able to send request to userspace process and receive response (I'm planning to add interactive features like in most firewalls for windows(r)). First part can be implemented via ioctl, but it should be called not only by processes with euid == 0, so supplied pointer to receive buffer cannot be trusted (is there any mechanism to check memory allocation?) and any unprivileged user can instruct kernel to write some trash at arbitrary address (for example, VM just rebooted ungracefully when I supplied (void*)123 as pointer to destination buffer). Were you using copyout(9)? I think FreeBSD's memory isolation between processes is pretty decent. I would be very surprised if copyout to an invalid address did something other than return EFAULT. At least the amd64 implementation of copyout(9) will also explicitly check that the address is a user address, so that you can't corrupt kernel memory with a rogue pointer from user-space. Thanks, matthew So, requirements is: 1) message exchange can initiated from userspace and from kernel 2) safe to communicate with unprivileged processes (not like in above case with ioctl) 3) kernel part should be able to determine process uid 4) messages size can be large (from 1KB to 10KB and more) Now I'm thinking about few variants: 1) emulation of local socket via character device. This way requires to manually handle per-process IO buffers, which almost certainly will have many bugs 2) opening local socket from kernel. This, as I think, require to spawn new process in kernel (but I don't know how to do this) to listen for incoming connections and messages 3) userspace mux/demux daemon (like devd): one and only one process opens character device and uses local socket to communicate with other processes. This requires to design 2 ABIs - kernel-daemon and daemon-client. 2nd variant looks most appropriate but know I don't know how to implement it. Can anyone point me to some documentation about spawning processes in kernel an working with sockets from kernelspace, or suggest better way of communication between processes and kernel? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: question regarding style(9) and field initialisers in structs
On Wed, Nov 2, 2011 at 4:56 PM, Alexander Best arun...@freebsd.org wrote: i sent the following message to freebsd-quaestions@ and got no answer. mybe it is better suited for freebsd-hackers@. hi there, i found hundreds of the following cases in the FreeBSD src: [...] struct periph_driver { periph_init_func_t init; char *driver_name; TAILQ_HEAD(,cam_periph) units; u_int generation; u_int flags; #define CAM_PERIPH_DRV_EARLY 0x01 }; [...] static struct periph_driver dadriver = { dainit, da, TAILQ_HEAD_INITIALIZER(dadriver.units), /* generation */ 0 }; ...is it proper programming practice to forget about the last field, if it would have been initialised to 0? It's more likely that, in this instance, the field was added after the initializer. Without named initializers, all fields until the last non-zero one need to be explicitly present. style(9) doesn't address it, having been written too long ago, but IMO initializations in a C translation unit [1] should use C99's named initializer feature, to protect against future member re-ordering. And it would be style(9) compliant to leave out any field whose initial value is zero (though sometimes it's good to list it anyways to make it clear that 0 was the desired value). [1] Note that C++ doesn't support C99's named initializer syntax (even in C++0x, which is a bit silly), so any struct initializers in a header file like sys/malloc.h must use old-style, since FreeBSD should continue to support writing kernel modules in C++. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: How to express inter-directory dependencies in bsd.*.mk infrastructure?
2011/10/30 Lev Serebryakov l...@freebsd.org: Hello, Hackers. (SORRY FOR SENDING INCOMPLETE MESSAGE) How to express inter-directory dependencies in bsd.*.mk infrastructure? I have project, which has two subdirectories: lib and bin. Top-level Makefile is simple one, looks like this: === SUBDIR= lib \ bin .include bsd.subdir.mk === lib subdirectory has Makefile with bsd.lib.mk included, and bin -- with bsd.prog.mk included. But how could I express, that PROG in bin depends on LIB from lib, to cause rebuilding of PROG when LIB is changed (when I call make on top level)? Normally I'd expect the dependency to be handled by make depend; that is, the source for the binary should be #including header files from the library. If the interfaces change, make depend will handle this. I guess the problem comes if you are using static linking and just implementation details internal to the library change. There is a DPADD makefile variable that seems to be used; I'm not sure though if this handles the type of dependency you're wanting. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: How to remopve volatile qualifier from pointer without warning from gcc?
2011/10/30 Lev Serebryakov l...@freebsd.org: Hello, Hackers. I need to pass volatile void * to API function, which takes void*. gcc (on FreeBSD 8.2) emits warning, and as in FreeBSD-styed code warnings are treated as errors, program could not be built. Manual casting gives warning, too... See the __DEVOLATILE() hack that will do sufficient casting to silence the compiler. As with any compiler warning, silencing it incorrectly comes with a risk... Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Using Valgrind on FreeBSD
2011/10/7 Mustafa Reşit Şahin resitsa...@gmail.com: I am trying to run Valgrind on FreeBSD. I am getting the error about ksem_open which i stated below. I have searched for a solution to be able to solve this problem and found the calgrind patch. I could not found the instructions to apply this patch. How can i use valgrind on FreeBSD with ksem_open enabled? It looks like that syscall is only available when the P1003_1B_SEMAPHORES option is in your kernel config. You will need to edit a kernel configuration file and rebuild / reinstall the kernel. Thanks, matthew I use valgrind version: valgrind-3.6.0 FreeBSD Version :FreeBSD 8.1-RELEASE-p2 FreeBSD 8.1-RELEASE-p2 #6: amd64 The error i get : (The valgrind output) --84521-- WARNING: unhandled syscall: 404 ==84521== at 0x144E2BC: __sys_ksem_init (in /lib/libc.so.7) ==84521== by 0x14422FE: sem_init (in /lib/libc.so.7) ==84521== by 0x1639BBB: snf__sem_ring_open (in /usr/local/opt/snf/lib/libsnf.so.0.3) ==84521== by 0x163A1AC: snf__open_endpoint_ring (in /usr/local/opt/snf/lib/libsnf.so.0.3) ==84521== by 0x163A5A3: snf__board_open (in /usr/local/opt/snf/lib/libsnf.so.0.3) ==84521== by 0x1637A7A: snf_open (in /usr/local/opt/snf/lib/libsnf.so.0.3) ==84521== by 0xF04DBA: snf_activate (in /usr/local/opt/snf/lib/libpcap.so.1.1.1) ==84521== by 0xF05C26: pcap_activate (in /usr/local/opt/snf/lib/libpcap.so.1.1.1) ==84521== by 0xF0624A: pcap_open_live (in /usr/local/opt/snf/lib/libpcap.so.1.1.1) ==84521== by 0x452B4D: ??? (in /usr/sbin/tcpdump) ==84521== by 0x402E4D: ??? (in /usr/sbin/tcpdump) ==84521== by 0x41FFF: ??? --84521-- You may be able to write your own handler. --84521-- Read the file README_MISSING_SYSCALL_OR_IOCTL. --84521-- Nevertheless we consider this a bug. Please report --84521-- it at http://valgrind.org/support/bug_reports.html. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Atomic increment and get?
2011/10/3 Lev Serebryakov l...@freebsd.org: Is here atomic increment and get (or add and get) operation in kernel? I cannot find one. Here is atomic_add_32(), but it doesn't return result. And here is no atomic_add_64() on 32 bit system :( See atomic_fetchadd_int. Not all hardware has 64-bit atomic instructions available in 32-bit mode so it's not machine-independent. Generally, for this case (and potentially for sub-32-bit atomic operations [1]) a lock is used. [1] with masking and shifting 32-bit atomics can be used to set data in an 8 or 16 bit variable; see a recent commit to sys/vm. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NUMA Support is there in FreeBSD.
On Mon, Oct 3, 2011 at 7:55 AM, satish kondapalli nitw.sat...@gmail.com wrote: I am new to FreeBSD, I just want know whether FreeBSD supports NUMA. If FreeBSD supports NUMA what are the kernel API to allocate memory? is there any example driver or any driver which is using the NUMA API? please provide some inputs... The kernel is NUMA-aware (at least for x86), and memory is allocated round-robin amongst the memory domains. There are not yet any KPIs for allocating memory in a specific NUMA domain, nor for binding specific threads / processes to get their memory local to a bound cpu instead of round robin. There have been several discussions but no one has taken the lead and proposed some KPIs yet. At $WORK the round-robin is sufficient to get consistent performance numbers and we have not yet started any experimentation with binding specific threads to either CPU or memory. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NUMA Support is there in FreeBSD.
On Mon, Oct 3, 2011 at 10:24 AM, Arnaud Lacombe lacom...@gmail.com wrote: Hi, On Mon, Oct 3, 2011 at 12:31 PM, m...@freebsd.org wrote: On Mon, Oct 3, 2011 at 7:55 AM, satish kondapalli nitw.sat...@gmail.com wrote: I am new to FreeBSD, I just want know whether FreeBSD supports NUMA. If FreeBSD supports NUMA what are the kernel API to allocate memory? is there any example driver or any driver which is using the NUMA API? please provide some inputs... The kernel is NUMA-aware (at least for x86), What x86 ? i386 ? amd64 ? both ? Both; see sys/x86/acpica/srat.c which parses the SRAT table. and memory is allocated round-robin amongst the memory domains. There are not yet any KPIs for allocating memory in a specific NUMA domain, nor for binding specific threads / processes to get their memory local to a bound cpu instead of round robin. I'm not sure to follow you. Say you have 2 memory domain attached to 2 different CPU package, each providing a memory domain, 4 physical core and eventually 8 virtual. Say you have a network adapter supporting 8 RX/TX queue, dispatching RX packet to 8 netisr. Ideally, you'd want those 8 queue/netisr to each have an affinity for a given CPU/memory domain, have the network adapter route flow evenly on those those 8 CPU. Now, if you allocated an mbuf from memory domain 1, and end up being processed by a CPU in domain 0, that likely to introduce performance penalty. Your statement isn't incorrect. What I'm saying is that there's no KPI for requesting bound memory because, while the netstat example is a fine one for where local memory is desired, the majority [1] of processing is not bound to a CPU and so round-robin allocations will produce uniform performance results -- that is, not the best possible, but not wildly fluctuating as scheduling decisions over different runs give different remote memory penalties. [1] for some definition of 'majority'. Now, what about userland ? This is certainly an horribly big picture :/ Yes, and it's why I said just that there's no KPI. One reason there is no KPI is that there's a lot of fiddly bits to take into account. My experience at IBM on AIX was that NUMA is very easy to get wrong; specifically what one usually wants is for the OS to get the answer right (especially for userspace) without a lot of manual tuning; except for some specific applications like netstat queues or a machine doing HPC or mostly running e.g. an Oracle db server, there's too much happening for any one program to configure itself right for all the uses of that code. I remember a lot of customer reports of problems from overly aggressive local memory use. Most of the time no one complained when things had consistent performance, even if that wasn't quite as fast as possible. In fact, I may be wrong about the round-robin; I sent jhb@ a patch and I have no recollection anymore whether it's actually in CURRENT. It's been over a year since I thought about this much (BSDCan 2010 was the last time I remember). Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory allocation in kernel -- what to use in which situation? What is the best for page-sized allocations?
2011/10/2 Lev Serebryakov l...@freebsd.org: Hello, Freebsd-hackers. Here are several memory-allocation mechanisms in the kernel. The two I'm aware of is MALLOC_DEFINE()/malloc()/free() and uma_* (zone(9)). As far as I understand, malloc() is general-purpose, but it has fixed transaction cost (in term of memory consumption) for each block allocated, and is not very suitable for allocation of many small blocks, as lots of memory will be wasted for bookkeeping. zone(9) allocator, on other hand, have very low cost of each allocated block, but could allocate only pre-configured fixed-size blocks, and ideal for allocation tons of small objects (and provide API for reusing them, too!). Am I right? No one has quite answered this question, IMO, so here's my 2 cents. malloc(9) on smaller sizes (= PAGE_SIZE) uses uma(9) under the covers. There are a set of uma zones for 16, 32, 64, 128, ... PAGE_SIZE bytes and malloc(9) looks up the malloc size in a small array to determine which uma zone to allocate from. So malloc(9) on small sizes doesn't have overhead of bookkeeping, but it does have overhead of rounding to the next highest malloc uma bucket. At $WORK we found, for example, that 48 bytes and 96 bytes were very common sizes and so I added uma zones there (and few other odd sies determined by using the malloc statistics option). But what if I need to allocate a lot (say, 16K-32K) of page-sized blocks? Not in one chunk, for sure, but in lifetime of my kernel module. Which allocator should I use? It seems, the best one will be very low-level only-page-sized allocator. Is here any in kernel? 4k allocations, as has been pointed out, get a single kernel page in both the virtual space and physical space. They (like all the large allocations) use a field in the vm_page for the physical page backing the virtual address to record info about the allocation. Any allocation PAGE_SIZE and larger will round up to the next multiple of pages and allocate whole pages. IMO the problems here are (1) as was pointed out, TLB shootdown on free(9), and (2) the current algorithm for finding space in a kmem_map is a linear search and doesn't track where there are fragmented chunks, so it's not terribly efficient when finding larger sies, and the PAGE_SIZE allocations will not fill in fragmented areas. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Way to get current tick number in kernel?
2011/10/2 Lev Serebryakov l...@freebsd.org: Hello, Freebsd-hackers. What should I use to measure short intervals of time between events in kernel? I don't need any time in means of, for example, time(3) API, but some monotonically and uniformly increasing counter with known frequency. As cheap as possible, without complex calculations :) There are several global variables that may suffice. 'ticks' is the current tick value. 'time_second' is the current time in seconds, but this is adjusted when the system time is changed. 'time_uptime' is the current uptime in seconds and is the base upon which time_second is computed, depending on the system clock. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Understanding panic and exit in the kernel
On Wed, Sep 7, 2011 at 11:08 AM, Charlie Martin crmar...@sgi.com wrote: I'm still pursuing a FreeBSD bug in 7.2-PRERELEASE FreeBSD -- and yes, we know this is wildly out of date, but it's not feasible to upgrade right now -- and while trying to backport a fix suggested here http://permalink.gmane.org/gmane.os.freebsd.current/134266 I got a situation where the panic calls in one of these two macros from sys/queue.h #define QMD_LIST_CHECK_NEXT(elm, field) do { \ if (LIST_NEXT((elm), field) != NULL \ LIST_NEXT((elm), field)-field.le_prev != \ ((elm)-field.le_next)) \ panic(Bad link elm %p next-prev != elm, (elm)); \ } while (0) #define QMD_LIST_CHECK_PREV(elm, field) do { \ if (*(elm)-field.le_prev != (elm)) \ panic(Bad link elm %p prev-next != elm, (elm)); \ } while (0) print the message, but *don't* then proceed to drop to the debugger -- instead the system hangs, with the CPU running but I had no luck getting its attention to force it to the debugger. I'm not clear just what could be causing the hangup. For my immediate purposes, I'd be happy with any way in which I could brutally kill the kernel and force it to the debugger, say by replacing the panic call with a printf followed by 1/0;. But I'm a little confused by the panic.c code -- it prints the arguments using a var_args, and then calls exit(1);' What file are you looking in? The kernel panic() is in sys/kern/kern_shutdown.c, not sys/boot/common/panic.c. It will optionally call kdb_enter_why() and then boot(). Do you have the debug.debugger_on_panic sysctl set to 1? Thanks, matthew So my questions: (1) will my brutal method actually force what I want or am I misunderstanding something (2) *which* of the several implementations of int exit(int) or similar is the one called in the FreeBSD kernel? (3) and how then does exit work? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Understanding panic and exit in the kernel
On Wed, Sep 7, 2011 at 12:24 PM, Charlie Martin crmar...@sgi.com wrote: On 2011-09-07 12:53, m...@freebsd.org wrote: For my immediate purposes, I'd be happy with any way in which I could brutally kill the kernel and force it to the debugger, say by replacing the panic call with a printf followed by 1/0;. But I'm a little confused by the panic.c code -- it prints the arguments using a var_args, and then calls exit(1);' What file are you looking in? The kernel panic() is in sys/kern/kern_shutdown.c, not sys/boot/common/panic.c. It will optionally call kdb_enter_why() and then boot(). Bingo, that's got to help. This makes a lot more sense. Do you have the debug.debugger_on_panic sysctl set to 1? Yes -- and panic does so *except* in the version with those changes to queue.h. If all you need to get started is a backtrace then debug.trace_on_panic=1 may be sufficient. I'm not sure of the timeline when 7.2-prerelease was issued, but there have been some reliability improvements to panic handling including marking that a CPU is in panic, etc., that may have come after 7.2-prerelease. You may want to look at the svn history for kern_shutdown.c and locally apply those changes to see if it changes your result. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Large machine test ideas
On Mon, Aug 29, 2011 at 7:46 AM, Ivan Voras ivo...@freebsd.org wrote: On 26/08/2011 19:44, Garrett Cooper wrote: On Fri, Aug 26, 2011 at 10:36 AM, Ivan Voras ivo...@freebsd.org wrote: ... I think that I'll need a 9-CURRENT snapshot on it to run all 128 CPUs, right? A 9.0-BETA1 snapshot, yes. Well, I'll leave it another half an hour but the 9.9-beta1 shapshot froze on boot after showing a SRAT: No CPU found for memory domain 4. This message implies the memory affinity information coming from ACPI is either non-sensical, or you have an unexpected physical setup where there really are CPUs with no memory in the local sockets. You should be able to boot with something like hint.srat.0=disabled at the boot loader prompt. Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: [PATCH] sysctl cleanup - unifying sysctls: vm.kvm_size and vm.kvm_free
On Mon, Jul 18, 2011 at 5:32 AM, Uffe Jakobsen u...@uffe.org wrote: Please consider this patch - it unifies sysctls: vm.kvm_size and vm.kvm_free. Currently these sysctls are only found under i386 and amd64: sys/i386/i386/pmap.c sys/i386/xen/pmap.c sys/amd64/amd64/pmap.c It seems logical (to me) to move them into a generic location suce as sys/vm/vm_kern.c Patch against HEAD (revision 224180) attached. I don't believe that VM_MAX_KERNEL_ADDRESS is relevant to all architectures, which is why these sysctls are in i386/amd64 specific files. Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: [PATCH] sysctl cleanup - unifying sysctls: vm.kvm_size and vm.kvm_free
On Mon, Jul 18, 2011 at 7:47 AM, Uffe Jakobsen u...@uffe.org wrote: On 2011-07-18 15:54, m...@freebsd.org wrote: On Mon, Jul 18, 2011 at 5:32 AM, Uffe Jakobsenu...@uffe.org wrote: Please consider this patch - it unifies sysctls: vm.kvm_size and vm.kvm_free. Currently these sysctls are only found under i386 and amd64: sys/i386/i386/pmap.c sys/i386/xen/pmap.c sys/amd64/amd64/pmap.c It seems logical (to me) to move them into a generic location suce as sys/vm/vm_kern.c Patch against HEAD (revision 224180) attached. I don't believe that VM_MAX_KERNEL_ADDRESS is relevant to all architectures, which is why these sysctls are in i386/amd64 specific files. Well I'm not an expert - but I did some checking before submitting this patch. As far as I can see VM_MAX_KERNEL_ADDRESS is addressed in several generic kernel files: sys/kern/subr_param.c sys/vm/vm_map.c sys/vm/vm_object.c and as far as I can see these locations are not ifdef'ed with any platform specific condition. Further more the following platform-specific files defines VM_MAX_KERNEL_ADDRESS: sys/amd64/include/vmparam.h sys/arm/include/vmparam.h sys/i386/include/vmparam.h sys/ia64/include/vmparam.h sys/mips/include/vmparam.h sys/powerpc/include/vmparam.h sys/sparc64/include/vmparam.h That leaves us only with pc98 and x86 platform-specifics subdirs that does not directly define VM_MAX_KERNEL_ADDRESS sys/pc98/include/vmparam.h includes i386/vmparam.h which defines VM_MAX_KERNEL_ADDRESS While x86 subdir is a little more questionable - though it has several includes for machine/vmparam.h which defines VM_MAX_KERNEL_ADDRESS Could soneone please give me an indication of what would be the correct way for me to verify this in a more scientific way. Alan Cox is usually a definitive answer for FreeBSD vm questions. My recollection was in the context of a bug with memguard(9) on sparc64, however now that I refresh my memory the issue there was VM_KMEM_SIZE_MAX, not VM_MAX_KERNEL_ADDRESS. So, my apologies as my memory has led me astray. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: [PATCH] __FreeBSD_cc_version in sys/cdefs.h
On Tue, Jul 5, 2011 at 6:36 AM, Robert Millan r...@debian.org wrote: 2011/7/5 Dimitry Andric d...@freebsd.org: As far as I can see, this code only gives warnings when compiled with gcc 4.5 or higher, and when using the -Wundef flag. Isn't it easier to just remove the -Wundef flag here? Here's a patch to remove -Wundef. I think it's a bad idea however, IMHO it's better to fix the cause of the warning instead. The problem is that, IIRC, C guarantees that an undefined symbol when checked in a #if context will evaluate to 0. So -Wundef checks for warnings on compliant code. Personally I have no objection to the original patch. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Unit Tests for FreeBSD Kernel APIs?
On Fri, Jun 24, 2011 at 6:14 AM, John Baldwin j...@freebsd.org wrote: On Friday, June 24, 2011 3:23:11 am Sebastian Huber wrote: Hello, exists there some unit tests for FreeBSD kernel APIs, e.g. mutex(9), condvar(9), etc.? Have a nice day! Hmm, I have a kernel module that does some tests, but it is not in the tree. One of the issues is that many of the tests you want to do for some of these APIs involve timing. For rwlocks, for example, I used KTR traces and used a kernel module that forked 4 threads to all compete over a single lock. I then verified via KTR traces that every branch was taken (and made liberal use of KASSERT()s which caught a few edge cases I had missed initially). At $WORK we have a generic TEST_THREAD interface which allows for calling essentially random kernel code (whatever a dev codes up) with arguments. At the shell one types test_thread foobar(12345, \string\) or whatever args. The test_thread syscall looks in the set of registered functions for foobar, and passes off a string with (12345, \string\) to the function. Each function is responsible for parsing its own arguments. The args for the test_thread function pointer are the $WORK equivalent of and input sbuf and and output sbuf, which is copyout'd to the calling binary. If there's interest I can clean up the patch a little to show proof-of-concept that builds without all the $WORK gorp. We use this for a few things: unit-testing kernel code, hand-editing of filesystem data to recover customer sites from kernel bugs (sometimes this is possible), etc. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Check for 0 ino_t in readdir(3)
There is a check in the function implementing readdir(3) for a zero inode number: struct dirent * _readdir_unlocked(dirp, skip) DIR *dirp; int skip; { /* ... */ if (dp-d_ino == 0 skip) continue; /* ... */ } skip is 1 except for when coming from _seekdir(3). I don't recall any requirement that a filesystem not use an inode numbered 0, though for obvious reasons it's a poor choice for a file's inode. So... is this code in libc incorrect? Or is there documentation that 0 cannot be a valid inode number for a filesystem? If 0 cannot be a valid inode number, my GSoC mentee Gleb suggested d_fileno == 0 be skipped in dirent_exists() as used by vop_stdvptocnp(9), and also skipped by unionfs. Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: int64_t and printf
On Sun, Jun 5, 2011 at 11:13 AM, Ben Laurie b...@links.org wrote: So, for example int64_t has no printf modifier I am aware of. Likewise its many friends. The worse option is to use the C99 defines, like PRI64 and PRI64U. The better option is to use %jd / %ju and cast the value to [u]intmax_t. Cheers, matthew In the past I've handled this by having a define somewhere along the lines of... #if something # define INT_64_T_FMT %ld #else # define INT_64_T_FMT %lld #endif but I have no idea where to put such a thing in FreeBSD. Opinions? Also, I guess I'd really need to do a modifier rather than a format, for full generality. -- http://www.apache-ssl.org/ben.html http://www.links.org/ There is no limit to what a man can do or how far he can go if he doesn't mind who gets the credit. - Robert Woodruff ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
sizeof(function pointer)
I am looking into potentially MFC'ing r212367 and related, that adds drains to sbufs. The reason for MFC is that several pieces of new code in CURRENT are using the drain functionality and it would make MFCing those changes much easier. The problem is that r212367 added a pointer to a drain function in the sbuf (it replaced a pointer to void). The C standard doesn't guarantee that a void * and a function pointer have the same size, though its true on amd64, i386 and I believe PPC. What I'm wondering is, though not guaranteed by the standard, is it *practically* true that sizeof(void *) == sizeof(int(*)(void)), such that an MFC won't break binary compatibility for any supported architecture? (The standard does guarantee, though not in words, that all function pointers have the same size, since it guarantees that pointers to functions can be cast to other pointers to functions and back without changing the value). Another possibility is to malloc a blob that is sizeof(int(*)(void)) and store that in a renamed s_unused; this is a bit messier but guaranteed to work. I'd just rather the code be an MCF instead of a partial re-write. Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Does FreeBSD have replacement for posix_fadvice() or fcntl(F_RDADVISE)?
On Thu, May 12, 2011 at 4:17 AM, Kostik Belousov kostik...@gmail.com wrote: On Thu, May 12, 2011 at 11:38:12AM +0400, Lev Serebryakov wrote: Hello, Freebsd-hackers. Does FreeBSD have some custom call, which can be used where Linux programs uses posix_fadvice() and DARWIN ones fcntl(F_RDADVISE)? It is like madvise(2) but for file descriptors. No, it does not (and I think the function is spelled posix_fadvise()). mdf reserved the syscall slot for posix_fadvise in his recent work on posix_fallocate(). Might be, he could comment more. Whoops, I replied bot forgot to reply-all. Adding a stub for posix_fadvise() is on my list of things to do, since we use it at $WORK in our custom filesystem, but the list is long and spare time is short. Nominally a technically correct implementation can do error checking and then nothing. Even after a stub is added to the OS, each filesystem must implement a VOP to do the work. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Does FreeBSD have replacement for posix_fadvice() or fcntl(F_RDADVISE)?
2011/5/12 Lev Serebryakov l...@serebryakov.spb.ru: Hello, Mdf. You wrote 12 мая 2011 г., 15:40:25: Does FreeBSD have some custom call, which can be used where Linux programs uses posix_fadvice() and DARWIN ones fcntl(F_RDADVISE)? It is like madvise(2) but for file descriptors. No, it does not (and I think the function is spelled posix_fadvise()). mdf reserved the syscall slot for posix_fadvise in his recent work on posix_fallocate(). Might be, he could comment more. Whoops, I replied bot forgot to reply-all. Adding a stub for posix_fadvise() is on my list of things to do, since we use it at $WORK in our custom filesystem, but the list is long and spare time is short. Nominally a technically correct implementation can do error checking and then nothing. Even after a stub is added to the OS, each filesystem must implement a VOP to do the work. transmission Torrents client rely on this syscall. It could be beuilt without it, of course (it is in our ports!), but it seems, that it produce much more sane disk load on Linux, because uses pre-fetch via posix_fadvise(). My profiling shows that pread() could take about 40% of it CPU time (8 seconds out of 20 with turned on ktrace, for example), even if throughput is very modest (700KiB/s, for example). Authors say, that they didn't observe such behavior on Linux (which has posix_fadvise()) or MacOS X (with fcntl(F_RDADVISE))... May be, we should go to fs@ with this thread now? Probably; the interface and a few unit tests are a start but the common filesystems (presumably ufs and zfs and perhaps others) will need support and I do not have any knowledge of the internals of those filesystems. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Include file search path
I thought I knew something about how the compiler looks for include files, but now I think maybe I don't know much. :-) So here's what I'm pondering. When I build a library, like e.g. libc, where do the include files get pulled from? They can't (shouldn't) be the ones in /usr/include, but I don't see a -nostdinc like for the kernel. There are -I directives in the Makefile for -I${.CURDIR}/include -I${.CURDIR}/../../include, etc., but that won't remove /usr/include from the search path. I see in the gcc documentation that -I paths are searched before the standards paths. But isn't the lack of -nostdinc a bug (not just for libc, but for any library in /usr/src/lib)? It somewhat feels to me that all of the libraries and binaries in the source distribution should use -nostdinc and include only from the source distribution itself. This isn't always an issue, but for source upgrades it seems crucial, and for a hacker it saves difficulties with having to install headers before re-building. Is that the intent, and it's not fully implemented? How badly would things break if -nostdinc was included in e.g. bsd.lib.mk? (This would break non-base libraries, yes? But as a thought experiment for the base, how far off are we?) Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
coretemp(4)/amdtemp(4) and sysctl nodes
There are very few uses in FreeBSD mainline code of sysctl_remove_oid(), and I was looking at potentially removing them. However, the use in coretemp/amdtemp has me slightly stumped. Each device provides a device_get_sysctl_ctx sysctl_ctx that is automatically cleaned up when the device goes away. Yet the sysctl nodes for both amdtemp and coretemp use the context of other devices, rather than their own. I can't quite figure out why, though the two are slightly different enough that they may have different reasons. For coretmp(4) I don't see how the parent device can be removed first, since we are a child device. So from my understanding it makes no sense to have an explicit sysctl_remove_oid() and attach in the parent's sysctl_ctx. For amdtemp, the temperature sysctl is added to the nexus.acpi.cpu device, which I suppose it's possible could be removed while the amdtemp sysctl was still present, causing errors. This case could be dealt with by orphaning sysctl nodes at sysctl_ctx_free if there were any not removed, presuming that whoever created them will clean them up (or leak the memory). Thoughts and explanations of the use of the parent's sysctl context? Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
FreeBSD development on Mac OS
(Taking to -hackers by request) My old (Ubuntu) laptop died and I have replaced it with a Macbook Air. I know from visual inspection at BSDCan that 50% of the attending developers own a Mac laptop of some kind. However, I've been a bit stumped on how to share the source code from my Mac with the FreeBSD virtual machine. When I was using Linux it was relatively easy; I exported a directory and mounted it with NFS. Using my Mac as a NFS server hasn't yet worked. The google is mostly coming up empty for me in terms of help. My /etc/exports file on the Mac looks like: /data -maproot=mdf:admin -network 10.211.55.0-mask 255.255.255.0 as that's the IP range that parallels is using for its virtual machines. But when I try to mount from FreeBSD (as root) I get this error: RPCPROG_MNT: RPC: Authentication error; why = Client credential too weak Again, the google hits for this error message haven't given me anything helpful. I tried taking out the maproot option but I get the same error. Immediately after a nfsd restart I get a RPCPROG_NFS: RPC: Program not registered error, but that clears up and then I'm back to Client credential too weak. I don't know if this is a client or server issue, but I don't recall that I'm doing anything differently on the FreeBSD VM side except that I'm trying it through Parallels instead of VMware or VirtualBox at the moment. So... any thoughts as to what else I need to do on the server side? Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: FreeBSD development on Mac OS
On Sun, Nov 21, 2010 at 8:02 PM, Garrett Cooper gcoo...@freebsd.org wrote: On Sun, Nov 21, 2010 at 7:20 PM, Daniel O'Connor docon...@gsoft.com.au wrote: On 22/11/2010, at 13:32, m...@freebsd.org wrote: My /etc/exports file on the Mac looks like: /data -maproot=mdf:admin -network 10.211.55.0 -mask 255.255.255.0 as that's the IP range that parallels is using for its virtual machines. But when I try to mount from FreeBSD (as root) I get this error: RPCPROG_MNT: RPC: Authentication error; why = Client credential too weak I just tried this on my MBP with parallels and it worked fine - I had maproot=0:0 though. Also, I mounted /Users - don't know if it makes a difference or if there is some other thing that needs tweaking first. In short, look at /var/log/messages on the Macbook to see what RPC isn't happy with [1]. HTH, -Garrett 1. http://www.freebsddiary.org/nfs.php maproot=0:0 gives me the same error. There is no /var/log/messages on the mac, or at least not on mine. A grep for things like credential shows no hits anywhere in /var/log; a grep for mountd in /var/log/*.log only has hits in launchd-shutdown.log. A grep for nfsd in /var/log/*.log doesn't *seem* to have anything useful, just a few in appfirewall.log about nfsd listening on various ports. Even with the firewall off I get the same Client credential too weak error. I get the same error when I change /etc/exports on the Mac and /etc/fstab on the FreeBSD VM to /Users/mdf. So... I'm pretty stumped. Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: FreeBSD development on Mac OS
Solved! The default setup for Parallels was a Shared network. I didn't see any way to change this in the Parallels machine configuration, but there was another menu I had missed where the network was configurable. Changing the network to bridged, ifconfig em0 down; dhclient em0 and now mounting works! Thanks, matthew On Sun, Nov 21, 2010 at 8:44 PM, m...@freebsd.org wrote: On Sun, Nov 21, 2010 at 8:02 PM, Garrett Cooper gcoo...@freebsd.org wrote: On Sun, Nov 21, 2010 at 7:20 PM, Daniel O'Connor docon...@gsoft.com.au wrote: On 22/11/2010, at 13:32, m...@freebsd.org wrote: My /etc/exports file on the Mac looks like: /data -maproot=mdf:admin -network 10.211.55.0 -mask 255.255.255.0 as that's the IP range that parallels is using for its virtual machines. But when I try to mount from FreeBSD (as root) I get this error: RPCPROG_MNT: RPC: Authentication error; why = Client credential too weak I just tried this on my MBP with parallels and it worked fine - I had maproot=0:0 though. Also, I mounted /Users - don't know if it makes a difference or if there is some other thing that needs tweaking first. In short, look at /var/log/messages on the Macbook to see what RPC isn't happy with [1]. HTH, -Garrett 1. http://www.freebsddiary.org/nfs.php maproot=0:0 gives me the same error. There is no /var/log/messages on the mac, or at least not on mine. A grep for things like credential shows no hits anywhere in /var/log; a grep for mountd in /var/log/*.log only has hits in launchd-shutdown.log. A grep for nfsd in /var/log/*.log doesn't *seem* to have anything useful, just a few in appfirewall.log about nfsd listening on various ports. Even with the firewall off I get the same Client credential too weak error. I get the same error when I change /etc/exports on the Mac and /etc/fstab on the FreeBSD VM to /Users/mdf. So... I'm pretty stumped. Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Phantom sysctl
On Mon, Nov 15, 2010 at 9:53 AM, Garrett Cooper yaneg...@gmail.com wrote: According to SYSCTL_INT(9): The SYSCTL kernel interfaces allow code to statically declare sysctl(8) MIB entries, which will be initialized when the kernel module containing the declaration is initialized. When the module is unloaded, the sysctl will be automatically destroyed. The sysctl should be reaped when the module is unloaded. My dumb test kernel module [1] doesn't seem to do that though (please note that the OID test_int_sysctl is created, and not reaped... FWIW it's kind of bizarre that test_int_sysctl is created in the first place, given what I've seen when SYSCTL_* gets executed): toaster# kldload ./test_int_sysctl.ko toaster# sysctl -a | grep test test_int_sysctl: 0 vfs.nfs_common.realign_test: 0 debug.test_int_sysctl: 0 toaster# sysctl test_int_sysctl sysctl: unknown oid 'test_int_sysctl' toaster# kldunload ./test_int_sysctl.ko toaster# sysctl -a | grep test test_int_sysctl: 0 vfs.nfs_common.realign_test: 0 debug.test_int_sysctl did disappear, and that was the sysctl in the module: SYSCTL_INT(_debug, OID_AUTO, test_int_sysctl, CTLFLAG_RW, _sysctl, 0, Test sysctl OID); I am not sure where the other test_int_sysctl appeared from, but the results of sysctl test_int_sysctl didn't change from before kldunload to after. Thanks, matthew toaster# sysctl test_int_sysctl sysctl: unknown oid 'test_int_sysctl' I've seen this behavior on 8.1-RELEASE (custom kernel, vanilla sources), and CURRENT r215254. I'm compiling the kernel with SYSCTL_DEBUG (and added some missing error checking in the kern_sysctl.c code) to see if I can track down the resource issue, but in the meantime if someone more knowledgeable has some suggestions for what to do / where I should look, I'm all ears. Thanks! -Garrett 1. http://pastebin.com/n7d9bH8U ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: UMA allocations from a specific physical range
On Mon, Sep 6, 2010 at 1:38 AM, Nathan Whitehorn nwhiteh...@freebsd.org wrote: PowerPC hypervisors typically provided a restricted range on memory when the MMU is disabled, as it is when initially handling exceptions. In order to restore virtual memory, the powerpc64 code needs to read a data structure called the SLB cache, which is currently allocated out of a UMA zone, and must be mapped into wired memory, ideally 1:1 physical-virtual address. Since this must be accessible in real mode, it must have a physical address in a certain range. I am trying to figure out the best way to do this. My first run at this code uses a custom UMA allocator that calls vm_phys_alloc_contig() to get a memory page. The trouble I have run into is that I cannot figure out a way to free the page. Marking the zone NOFREE is a bad solution, vm_page_free() panics the kernel due to inconsistent tracking of page wiring, and vm_phys_free_pages() causes panics in vm_page_alloc() later on (page is not free). What is the correct way to deallocate these pages? Or is there a different approach I should adopt? I assume this is for the SLB flih? What AIX did was to have a 1-1 simple esid to vsid translation for kernel addresses, reserve the first 16 SLB entries for various uses, including one for the current process's process private segment, and if the slb miss was on a process address we'd turn on translation and look up the answer, the tables holding the answer being in the process private segment effective address space so we wouldn't take another slb miss. This required one level deep recursion in the slb slih, in case there was a miss on kernel data with xlate on in the SLB slih. For historical reasons due to the per-process segment table for POWER3, we also had a one-page hashed lookup table per process that we stored the real address of in the process private segment, so the assembly code in the flih looked here before turning on MSR_DR IIRC. I was trying to find ways to kill this code when I left IBM, since we'd ended support for POWER3 a few years earlier. I haven't had the time to look at FreeBSD ppc64 sources; how large are the uma-allocated slb entries and what is stored in them? The struct and filename is sufficient, though I don't have convenient access to sources until Tuesday. V=R space is rather limited (well, depending on a lot of factors; for AIX on Power5 and later the hypervisor only gave us 128M, though for ppc64 on a Mac G4 I assume all of memory can be mapped V=R if desired) so it was best to find a non V=R solution if possible. Turning on translation in the flih after some setup and recursion stopping is one of the easier ways, and also has the advantage of not needing to either have separate code or macro access to data structures used in both V and R modes. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: UMA allocations from a specific physical range
On Mon, Sep 6, 2010 at 4:28 AM, Nathan Whitehorn nwhiteh...@freebsd.org wrote: On 09/05/10 22:51, m...@freebsd.org wrote: On Mon, Sep 6, 2010 at 1:38 AM, Nathan Whitehorn nwhiteh...@freebsd.org wrote: PowerPC hypervisors typically provided a restricted range on memory when the MMU is disabled, as it is when initially handling exceptions. In order to restore virtual memory, the powerpc64 code needs to read a data structure called the SLB cache, which is currently allocated out of a UMA zone, and must be mapped into wired memory, ideally 1:1 physical-virtual address. Since this must be accessible in real mode, it must have a physical address in a certain range. I am trying to figure out the best way to do this. My first run at this code uses a custom UMA allocator that calls vm_phys_alloc_contig() to get a memory page. The trouble I have run into is that I cannot figure out a way to free the page. Marking the zone NOFREE is a bad solution, vm_page_free() panics the kernel due to inconsistent tracking of page wiring, and vm_phys_free_pages() causes panics in vm_page_alloc() later on (page is not free). What is the correct way to deallocate these pages? Or is there a different approach I should adopt? I assume this is for the SLB flih? What AIX did was to have a 1-1 simple esid to vsid translation for kernel addresses, reserve the first 16 SLB entries for various uses, including one for the current process's process private segment, and if the slb miss was on a process address we'd turn on translation and look up the answer, the tables holding the answer being in the process private segment effective address space so we wouldn't take another slb miss. This required one level deep recursion in the slb slih, in case there was a miss on kernel data with xlate on in the SLB slih. Yes, that's correct. FreeBSD has the same 1-to-1 translation for the kernel, but the entire address space is switched out for user processes (no part of the kernel is mapped into user processes), so the code to load the user SLB entries has to be able to execute with the MMU off, lest it disappear underneath itself. Okay. For AIX the kernel text/data in esid 0 was always in slb entry 0 (so it wasn't affected by slbia) and also was mapped into the process address space. So we had to be careful with KsKp bits to prevent access to anything the user couldn't see. The code for memcpy and friends was at fixed addresses in the kernel segment so the compiler knew to jump there, and there was also a user-readable _system_configuration struct. Even with no address sharing, the SLB flih could load entries for the kernel and turn on translation, but it would be trickier. For historical reasons due to the per-process segment table for POWER3, we also had a one-page hashed lookup table per process that we stored the real address of in the process private segment, so the assembly code in the flih looked here before turning on MSR_DR IIRC. I was trying to find ways to kill this code when I left IBM, since we'd ended support for POWER3 a few years earlier. I haven't had the time to look at FreeBSD ppc64 sources; how large are the uma-allocated slb entries and what is stored in them? The struct and filename is sufficient, though I don't have convenient access to sources until Tuesday. The entries are each 1 KB, and there is one for each pmap. Each consists of 64 16-byte SLBE/SLBV pairs. These buffers are just a carbon copy of what should be in the SLB after a context switch to that map. But if this is for the flih, the esid that was faulted on won't be in that struct, right? Aren't you trying to look up in some table to load an slb entry? V=R space is rather limited (well, depending on a lot of factors; for AIX on Power5 and later the hypervisor only gave us 128M, though for ppc64 on a Mac G4 I assume all of memory can be mapped V=R if desired) so it was best to find a non V=R solution if possible. Turning on translation in the flih after some setup and recursion stopping is one of the easier ways, and also has the advantage of not needing to either have separate code or macro access to data structures used in both V and R modes. On the PS3 (the target in this case), the hypervisor also limits us to 128 MB. The one and only kernel data structure that needs to be used in this mode is this SLB cache object, so I was hoping for a simple solution to just put them all in the real-mode accessible region. Well, I assume if you're willing to use 4k then it should't be hard to allocate a whole page in the V=R region. Perhaps other useful data for the process could be added to this page? Admittedly, this is a bit of a digression. The internals of UMA always leave me confused, so I try to avoid thinking about it. :-) IIRC the memory from vm_phys_alloc_contig() can be released like any other page; the interface should just be fetching a specific page. How far off is the page wire
Re: sched_pin() versus PCPU_GET
On Sun, Aug 8, 2010 at 2:43 PM, Attilio Rao atti...@freebsd.org wrote: 2010/8/4 m...@freebsd.org: On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote: On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: We've seen a few instances at work where witness_warn() in ast() indicates the sched lock is still held, but the place it claims it was held by is in fact sometimes not possible to keep the lock, like: thread_lock(td); td-td_flags = ~TDF_SELECT; thread_unlock(td); What I was wondering is, even though the assembly I see in objdump -S for witness_warn has the increment of td_pinned before the PCPU_GET: 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx 802db217: 00 00 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax 802db226: 00 00 if (lock_list != NULL lock_list-ll_count != 0) { 802db228: 48 85 c0 test %rax,%rax * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) if (lock_list != NULL lock_list-ll_count != 0) { 802db239: 0f 84 ff 00 00 00 je 802db33e witness_warn+0x30e 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d is it possible for the hardware to do any re-ordering here? The reason I'm suspicious is not just that the code doesn't have a lock leak at the indicated point, but in one instance I can see in the dump that the lock_list local from witness_warn is from the pcpu structure for CPU 0 (and I was warned about sched lock 0), but the thread id in panic_cpu is 2. So clearly the thread was being migrated right around panic time. This is the amd64 kernel on stable/7. I'm not sure exactly what kind of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. So... do we need some kind of barrier in the code for sched_pin() for it to really do what it claims? Could the hardware have re-ordered the mov %gs:0x48,%rax PCPU_GET to before the sched_pin() increment? Hmmm, I think it might be able to because they refer to different locations. Note this rule in section 8.2.2 of Volume 3A: • Reads may be reordered with older writes to different locations but not with older writes to the same location. It is certainly true that sparc64 could reorder with RMO. I believe ia64 could reorder as well. Since sched_pin/unpin are frequently used to provide this sort of synchronization, we could use memory barriers in pin/unpin like so: sched_pin() { td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1; } sched_unpin() { atomic_store_rel_int(td-td_pinned, td-td_pinned - 1); } We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but they are slightly more heavyweight, though it would be more clear what is happening I think. However, to actually get a race you'd have to have an interrupt fire and migrate you so that the speculative read was from the other CPU. However, I don't think the speculative read would be preserved in that case. The CPU has to return to a specific PC when it returns from the interrupt and it has no way of storing the state for what speculative reordering it might be doing, so presumably it is thrown away? I suppose it is possible that it actually retires both instructions (but reordered) and then returns to the PC value after the read of listlocks after the interrupt. However, in that case the scheduler would not migrate as it would see td_pinned != 0. To get the race you have to have the interrupt take effect prior to modifying td_pinned, so I think the processor would have to discard the reordered read of listlocks so it could safely resume execution at the 'incl' instruction. The other nit there on x86 at least is that the incl instruction is doing both a read and a write and another rule in the section 8.2.2 is this: • Reads are not reordered with other reads. That would seem to prevent the read of listlocks from passing the read of td_pinned in the incl instruction on x86. I wonder how that's interpreted in the microcode, though? I.e. if the incr instruction decodes to load, add,
Re: sched_pin() versus PCPU_GET
On Wed, Aug 4, 2010 at 11:55 AM, John Baldwin j...@freebsd.org wrote: On Wednesday, August 04, 2010 12:20:31 pm m...@freebsd.org wrote: On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote: On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote: On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote: On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: We've seen a few instances at work where witness_warn() in ast() indicates the sched lock is still held, but the place it claims it was held by is in fact sometimes not possible to keep the lock, like: thread_lock(td); td-td_flags = ~TDF_SELECT; thread_unlock(td); What I was wondering is, even though the assembly I see in objdump -S for witness_warn has the increment of td_pinned before the PCPU_GET: 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx 802db217: 00 00 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax 802db226: 00 00 if (lock_list != NULL lock_list-ll_count != 0) { 802db228: 48 85 c0 test %rax,%rax * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) if (lock_list != NULL lock_list-ll_count != 0) { 802db239: 0f 84 ff 00 00 00 je 802db33e witness_warn+0x30e 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d is it possible for the hardware to do any re-ordering here? The reason I'm suspicious is not just that the code doesn't have a lock leak at the indicated point, but in one instance I can see in the dump that the lock_list local from witness_warn is from the pcpu structure for CPU 0 (and I was warned about sched lock 0), but the thread id in panic_cpu is 2. So clearly the thread was being migrated right around panic time. This is the amd64 kernel on stable/7. I'm not sure exactly what kind of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. So... do we need some kind of barrier in the code for sched_pin() for it to really do what it claims? Could the hardware have re-ordered the mov %gs:0x48,%rax PCPU_GET to before the sched_pin() increment? Hmmm, I think it might be able to because they refer to different locations. Note this rule in section 8.2.2 of Volume 3A: • Reads may be reordered with older writes to different locations but not with older writes to the same location. It is certainly true that sparc64 could reorder with RMO. I believe ia64 could reorder as well. Since sched_pin/unpin are frequently used to provide this sort of synchronization, we could use memory barriers in pin/unpin like so: sched_pin() { td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1; } sched_unpin() { atomic_store_rel_int(td-td_pinned, td-td_pinned - 1); } We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but they are slightly more heavyweight, though it would be more clear what is happening I think. However, to actually get a race you'd have to have an interrupt fire and migrate you so that the speculative read was from the other CPU. However, I don't think the speculative read would be preserved in that case. The CPU has to return to a specific PC when it returns from the interrupt and it has no way of storing the state for what speculative reordering it might be doing, so presumably it is thrown away? I suppose it is possible that it actually retires both instructions (but reordered) and then returns to the PC value after the read of listlocks after the interrupt. However, in that case the scheduler would not migrate as it would see td_pinned != 0. To get the race you have to have the interrupt take effect prior to modifying td_pinned, so I think the processor would have to discard the reordered read of listlocks so it could safely resume execution at the 'incl'
Re: sched_pin() versus PCPU_GET
On Wed, Aug 4, 2010 at 9:20 AM, m...@freebsd.org wrote: On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote: Actually, I would beg to differ in that case. If PCPU_GET(spinlocks) returns non-NULL, then it means that you hold a spin lock, ll_count is 0 for the correct pc_spinlocks and non-zero for the wrong one, though. So I think it can be non-NULL but the current thread/CPU doesn't hold a spinlock. I don't believe we have any code in the NMI handler. I'm on vacation today so I'll check tomorrow. I checked and ipi_nmi_handler() doesn't appear to have any local changes. I assume that's where I should look? Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: sched_pin() versus PCPU_GET
(gdb) p panic_cpu $9 = 2 (gdb) p dumptid $12 = 100751 (gdb) p cpuhead.slh_first-pc_allcpu.sle_next-pc_curthread-td_tid $14 = 100751 (gdb) p *cpuhead.slh_first-pc_allcpu.sle_next $6 = { pc_curthread = 0xff00716d6960, pc_cpuid = 2, pc_spinlocks = 0x80803198, (gdb) p lock_list $2 = (struct lock_list_entry *) 0x80803fb0 (gdb) p *cpuhead.slh_first-pc_allcpu.sle_next-pc_allcpu.sle_next- pc_allcpu.sle_next $8 = { pc_curthread = 0xff0005479960, pc_cpuid = 0, pc_spinlocks = 0x80803fb0, I.e. we're dumping on CPU 2, but the lock_list pointer that was saved in the dump matches that of CPU 0. Can you print out the tid's for the two curthreads? It's not impossible that the thread migrated after calling panic. In fact we force threads to CPU 0 during shutdown. dumptid matches the pc_curthread for CPU 2 and is printed above. The lock_list local variable matches the PCPU for CPU 0, which has tid: (gdb) p cpuhead.slh_first-pc_allcpu.sle_next-pc_allcpu.sle_next-pc_allcpu.sle_next-pc_curthread-td_tid $2 = 15 (gdb) p cpuhead.slh_first-pc_allcpu.sle_next-pc_allcpu.sle_next-pc_allcpu.sle_next-pc_curthread-td_proc-p_comm $3 = idle: cpu0\000\000\000\000\000\000\000\000\000 Note that lock_list-ll_count is now 0, but of course wasn't when we panic'd. Also, the panic message showed exclusive spin mutex sched lock 0 (sched lock) r = 0 (0x806cf640) locked @ /build/mnt/src/sys/kern/sys_generic.c:826; i.e. the lock was for CPU 0 as well. If we truly were returning to user space with that lock held it would still be held and we'd still be on CPU 0. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: sched_pin() versus PCPU_GET
On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote: On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote: On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote: On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: We've seen a few instances at work where witness_warn() in ast() indicates the sched lock is still held, but the place it claims it was held by is in fact sometimes not possible to keep the lock, like: thread_lock(td); td-td_flags = ~TDF_SELECT; thread_unlock(td); What I was wondering is, even though the assembly I see in objdump -S for witness_warn has the increment of td_pinned before the PCPU_GET: 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx 802db217: 00 00 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax 802db226: 00 00 if (lock_list != NULL lock_list-ll_count != 0) { 802db228: 48 85 c0 test %rax,%rax * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) if (lock_list != NULL lock_list-ll_count != 0) { 802db239: 0f 84 ff 00 00 00 je 802db33e witness_warn+0x30e 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d is it possible for the hardware to do any re-ordering here? The reason I'm suspicious is not just that the code doesn't have a lock leak at the indicated point, but in one instance I can see in the dump that the lock_list local from witness_warn is from the pcpu structure for CPU 0 (and I was warned about sched lock 0), but the thread id in panic_cpu is 2. So clearly the thread was being migrated right around panic time. This is the amd64 kernel on stable/7. I'm not sure exactly what kind of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. So... do we need some kind of barrier in the code for sched_pin() for it to really do what it claims? Could the hardware have re-ordered the mov %gs:0x48,%rax PCPU_GET to before the sched_pin() increment? Hmmm, I think it might be able to because they refer to different locations. Note this rule in section 8.2.2 of Volume 3A: • Reads may be reordered with older writes to different locations but not with older writes to the same location. It is certainly true that sparc64 could reorder with RMO. I believe ia64 could reorder as well. Since sched_pin/unpin are frequently used to provide this sort of synchronization, we could use memory barriers in pin/unpin like so: sched_pin() { td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1; } sched_unpin() { atomic_store_rel_int(td-td_pinned, td-td_pinned - 1); } We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but they are slightly more heavyweight, though it would be more clear what is happening I think. However, to actually get a race you'd have to have an interrupt fire and migrate you so that the speculative read was from the other CPU. However, I don't think the speculative read would be preserved in that case. The CPU has to return to a specific PC when it returns from the interrupt and it has no way of storing the state for what speculative reordering it might be doing, so presumably it is thrown away? I suppose it is possible that it actually retires both instructions (but reordered) and then returns to the PC value after the read of listlocks after the interrupt. However, in that case the scheduler would not migrate as it would see td_pinned != 0. To get the race you have to have the interrupt take effect prior to modifying td_pinned, so I think the processor would have to discard the reordered read of listlocks so it could safely resume execution at the 'incl' instruction. The other nit there on x86 at least is that the incl instruction is doing both a read and a write and another rule in the section 8.2.2 is this: • Reads are not reordered with other reads. That would seem to prevent the read of listlocks from passing the read of
Re: sched_pin() versus PCPU_GET
On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote: On Friday, July 30, 2010 10:08:22 am John Baldwin wrote: On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote: We've seen a few instances at work where witness_warn() in ast() indicates the sched lock is still held, but the place it claims it was held by is in fact sometimes not possible to keep the lock, like: thread_lock(td); td-td_flags = ~TDF_SELECT; thread_unlock(td); What I was wondering is, even though the assembly I see in objdump -S for witness_warn has the increment of td_pinned before the PCPU_GET: 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx 802db217: 00 00 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax 802db226: 00 00 if (lock_list != NULL lock_list-ll_count != 0) { 802db228: 48 85 c0 test %rax,%rax * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) if (lock_list != NULL lock_list-ll_count != 0) { 802db239: 0f 84 ff 00 00 00 je 802db33e witness_warn+0x30e 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d is it possible for the hardware to do any re-ordering here? The reason I'm suspicious is not just that the code doesn't have a lock leak at the indicated point, but in one instance I can see in the dump that the lock_list local from witness_warn is from the pcpu structure for CPU 0 (and I was warned about sched lock 0), but the thread id in panic_cpu is 2. So clearly the thread was being migrated right around panic time. This is the amd64 kernel on stable/7. I'm not sure exactly what kind of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. So... do we need some kind of barrier in the code for sched_pin() for it to really do what it claims? Could the hardware have re-ordered the mov %gs:0x48,%rax PCPU_GET to before the sched_pin() increment? Hmmm, I think it might be able to because they refer to different locations. Note this rule in section 8.2.2 of Volume 3A: • Reads may be reordered with older writes to different locations but not with older writes to the same location. It is certainly true that sparc64 could reorder with RMO. I believe ia64 could reorder as well. Since sched_pin/unpin are frequently used to provide this sort of synchronization, we could use memory barriers in pin/unpin like so: sched_pin() { td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1; } sched_unpin() { atomic_store_rel_int(td-td_pinned, td-td_pinned - 1); } We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but they are slightly more heavyweight, though it would be more clear what is happening I think. However, to actually get a race you'd have to have an interrupt fire and migrate you so that the speculative read was from the other CPU. However, I don't think the speculative read would be preserved in that case. The CPU has to return to a specific PC when it returns from the interrupt and it has no way of storing the state for what speculative reordering it might be doing, so presumably it is thrown away? I suppose it is possible that it actually retires both instructions (but reordered) and then returns to the PC value after the read of listlocks after the interrupt. However, in that case the scheduler would not migrate as it would see td_pinned != 0. To get the race you have to have the interrupt take effect prior to modifying td_pinned, so I think the processor would have to discard the reordered read of listlocks so it could safely resume execution at the 'incl' instruction. The other nit there on x86 at least is that the incl instruction is doing both a read and a write and another rule in the section 8.2.2 is this: • Reads are not reordered with other reads. That would seem to prevent the read of listlocks from passing the read of td_pinned in the incl instruction on x86. I wonder how that's interpreted in the microcode, though? I.e. if the incr instruction decodes to load, add, store, does the h/w allow the later reads to pass the final store? I added the following: sched_pin();
Re: sched_pin() versus PCPU_GET
2010/7/30 Kostik Belousov kostik...@gmail.com: On Thu, Jul 29, 2010 at 04:57:25PM -0700, m...@freebsd.org wrote: On Thu, Jul 29, 2010 at 4:39 PM, m...@freebsd.org wrote: We've seen a few instances at work where witness_warn() in ast() indicates the sched lock is still held, but the place it claims it was held by is in fact sometimes not possible to keep the lock, like: thread_lock(td); td-td_flags = ~TDF_SELECT; thread_unlock(td); What I was wondering is, even though the assembly I see in objdump -S for witness_warn has the increment of td_pinned before the PCPU_GET: 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx 802db217: 00 00 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax 802db226: 00 00 if (lock_list != NULL lock_list-ll_count != 0) { 802db228: 48 85 c0 test %rax,%rax * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) if (lock_list != NULL lock_list-ll_count != 0) { 802db239: 0f 84 ff 00 00 00 je 802db33e witness_warn+0x30e 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d is it possible for the hardware to do any re-ordering here? The reason I'm suspicious is not just that the code doesn't have a lock leak at the indicated point, but in one instance I can see in the dump that the lock_list local from witness_warn is from the pcpu structure for CPU 0 (and I was warned about sched lock 0), but the thread id in panic_cpu is 2. So clearly the thread was being migrated right around panic time. This is the amd64 kernel on stable/7. I'm not sure exactly what kind of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. So... do we need some kind of barrier in the code for sched_pin() for it to really do what it claims? Could the hardware have re-ordered the mov %gs:0x48,%rax PCPU_GET to before the sched_pin() increment? So after some research, the answer I'm getting is maybe. What I'm concerned about is whether the h/w reordered the read of PCPU_GET in front of the previous store to increment td_pinned. While not an ultimate authority, http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems implies that stores can be reordered after loads for both Intel and amd64 chips, which would I believe account for the behavior seen here. Am I right that you suggest that in the sequence mov %gs:0x0,%rbx [1] incl 0x104(%rbx) [2] mov %gs:0x48,%rax [3] interrupt and preemption happen between points [2] and [3] ? And the %rax value after the thread was put back onto the (different) new CPU and executed [3] was still from the old cpu' pcpu area ? Right, but I'm also asking if it's possible the hardware executed the instructions as: mov %gs:0x0,%rbx [1] mov %gs:0x48,%rax [3] incl 0x104(%rbx) [2] On PowerPC this is definitely possible and I'd use an isync to prevent the re-ordering. I haven't been able to confirm that Intel/AMD present such a strict ordering that no barrier is needed. It's admittedly a very tight window, and we've only seen it twice, but I have no other way to explain the symptom. Unfortunately in the dump gdb shows both %rax and %gs as 0, so I can't confirm that they had a value I'd expect from another CPU. The only thing I do have is panic_cpu being different than the CPU at the time of PCPU_GET(spinlock), but of course there's definitely a window there. I do not believe this is possible. CPU is always self-consistent. Context switch from the thread can only occur on the return from interrupt handler, in critical_exit() or such. This code is executing on the same processor, and thus should already see the effect of [2], that would prevent context switch. Right, but if the hardware allowed reads to pass writes, then %rax would have an incorrect value which would be saved at interrupt time, and restored on another processor. I can add a few sanity asserts to try to prove this one way or another and hope they don't mess with the timing;
sched_pin() versus PCPU_GET
We've seen a few instances at work where witness_warn() in ast() indicates the sched lock is still held, but the place it claims it was held by is in fact sometimes not possible to keep the lock, like: thread_lock(td); td-td_flags = ~TDF_SELECT; thread_unlock(td); What I was wondering is, even though the assembly I see in objdump -S for witness_warn has the increment of td_pinned before the PCPU_GET: 802db210: 65 48 8b 1c 25 00 00mov%gs:0x0,%rbx 802db217: 00 00 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db21f: 65 48 8b 04 25 48 00mov%gs:0x48,%rax 802db226: 00 00 if (lock_list != NULL lock_list-ll_count != 0) { 802db228: 48 85 c0test %rax,%rax * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db22b: 48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp) 802db232: 48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp) if (lock_list != NULL lock_list-ll_count != 0) { 802db239: 0f 84 ff 00 00 00 je 802db33e witness_warn+0x30e 802db23f: 44 8b 60 50 mov0x50(%rax),%r12d is it possible for the hardware to do any re-ordering here? The reason I'm suspicious is not just that the code doesn't have a lock leak at the indicated point, but in one instance I can see in the dump that the lock_list local from witness_warn is from the pcpu structure for CPU 0 (and I was warned about sched lock 0), but the thread id in panic_cpu is 2. So clearly the thread was being migrated right around panic time. This is the amd64 kernel on stable/7. I'm not sure exactly what kind of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. So... do we need some kind of barrier in the code for sched_pin() for it to really do what it claims? Could the hardware have re-ordered the mov%gs:0x48,%rax PCPU_GET to before the sched_pin() increment? Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: sched_pin() versus PCPU_GET
On Thu, Jul 29, 2010 at 4:39 PM, m...@freebsd.org wrote: We've seen a few instances at work where witness_warn() in ast() indicates the sched lock is still held, but the place it claims it was held by is in fact sometimes not possible to keep the lock, like: thread_lock(td); td-td_flags = ~TDF_SELECT; thread_unlock(td); What I was wondering is, even though the assembly I see in objdump -S for witness_warn has the increment of td_pinned before the PCPU_GET: 802db210: 65 48 8b 1c 25 00 00 mov %gs:0x0,%rbx 802db217: 00 00 802db219: ff 83 04 01 00 00 incl 0x104(%rbx) * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db21f: 65 48 8b 04 25 48 00 mov %gs:0x48,%rax 802db226: 00 00 if (lock_list != NULL lock_list-ll_count != 0) { 802db228: 48 85 c0 test %rax,%rax * Pin the thread in order to avoid problems with thread migration. * Once that all verifies are passed about spinlocks ownership, * the thread is in a safe path and it can be unpinned. */ sched_pin(); lock_list = PCPU_GET(spinlocks); 802db22b: 48 89 85 f0 fe ff ff mov %rax,-0x110(%rbp) 802db232: 48 89 85 f8 fe ff ff mov %rax,-0x108(%rbp) if (lock_list != NULL lock_list-ll_count != 0) { 802db239: 0f 84 ff 00 00 00 je 802db33e witness_warn+0x30e 802db23f: 44 8b 60 50 mov 0x50(%rax),%r12d is it possible for the hardware to do any re-ordering here? The reason I'm suspicious is not just that the code doesn't have a lock leak at the indicated point, but in one instance I can see in the dump that the lock_list local from witness_warn is from the pcpu structure for CPU 0 (and I was warned about sched lock 0), but the thread id in panic_cpu is 2. So clearly the thread was being migrated right around panic time. This is the amd64 kernel on stable/7. I'm not sure exactly what kind of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC. So... do we need some kind of barrier in the code for sched_pin() for it to really do what it claims? Could the hardware have re-ordered the mov %gs:0x48,%rax PCPU_GET to before the sched_pin() increment? So after some research, the answer I'm getting is maybe. What I'm concerned about is whether the h/w reordered the read of PCPU_GET in front of the previous store to increment td_pinned. While not an ultimate authority, http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems implies that stores can be reordered after loads for both Intel and amd64 chips, which would I believe account for the behavior seen here. Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: strange problem with int64_t variables
On Sun, Jul 11, 2010 at 7:58 AM, Gabor Kovesdan ga...@freebsd.org wrote: Em 2010.07.11. 16:54, Dimitry Andric escreveu: On 2010-07-11 16:46, Gabor Kovesdan wrote: I have two int64_t variables in kernel code, first is stored internally and the second one is passed from a syscall argument. When I print them with printf %lld modifier, the internal one behaves correctly but the other one I pass from a syscall has a corrupted value. If I pass 1, it prints out 3735348794091372545. I'm not doing anything special with it just reading it out from the struct that was generated with make sysent. Since 3735348794091372545 is 0x33d69ff1, it looks like the upper word got corrupted somehow. Maybe some part of it got non-atomically assigned? Maybe the wrong word was read? It is hard to tell without code... :) Userland syscall calling: killjob(getjid(), SIGINT); //getjid() returns 1 this case, whose type is jid_t Kernel code: int killjob(struct thread *td, struct killjob_args *uap) { struct jobentry *jp, *jtmp; struct procentry *pp, *ptmp; JOBLIST_bWLOCK; LIST_FOREACH_SAFE(jp,irix_joblist, entries, jtmp) { if (jp-jid == uap-jid) { /* never reached code, comparison always fail because of corrupted value */ } } JOBLIST_WUNLOCK; /* not such job */ td-td_retval[0] = -1; return (ENOJOB); } What does struct killjob_args look like? Is this syscall defined in a module, or an addition to the kernel's syscalls.master? Is the user-space 32-bit or 64-bit? What about the kernel? Thanks, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: loader prompt: list / on other device
On Wed, Jun 23, 2010 at 9:03 AM, rank1see...@gmail.com wrote: I've escaped to loader prompt: Current device is disk0s3a, from which this loader is running. My USB stick is device1 and device1s2a is UFS /, on which I would like to reach some file or simply list directory. IIRC, there is no way to do this (but my memory is based on stable/7). I extended some Isilon stuff to add a 'bootdev' command to the loader that would allow for selecting alternate boot disks. I can see about dusting that off and making a patch against CURRENT, if there is no current way to do what you ask. Cheers, matthew ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org