Re: another question

2013-07-01 Thread mdf
On Mon, Jul 1, 2013 at 5:42 PM, David Sanford
david.lee...@programmer.netwrote:

 Hi,

 Thanks for your responses to my first question. They were very helpful.

 In looking at the code, I ran across the functions setprogname and
 getprogname. According to the man page:
 In FreeBSD, the name of the program is set by the start-up code that is
 run before  *main*(); thus, running  *setprogname*() is not necessary.
 I'm confused by how this is done. Where is this start-up code defined?
 Is this included in all executables compiled on FreeBSD? Even the programs
 released under the GNU GPL?


I believe the code that does this is in lib/csu/common/ignore_init.c; see
handle_argv() and the use of __progname[].

This will run for anything that links against csu, which is anything
compiled on FreeBSD.  The same csu library sets the ABI note.tag, which
tells the kernel which syscall table to use when the binary is executed.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: b_freelist TAILQ/SLIST

2013-06-28 Thread mdf
On Fri, Jun 28, 2013 at 8:56 AM, Adrian Chadd adr...@freebsd.org wrote:

 On 28 June 2013 08:37, Alexander Motin m...@freebsd.org wrote:
  Otherwise it may just creep up again after someone does another change
  in an unrelated part of the kernel.
 
  Big win or small, TAILQ is still heavier then STAILQ, while it is not
 needed
  there at all.

 You can't make that assumption. I bet that if both pointers are in the
 _same_ cache line, the overhead of maintaining a double linked list is
 trivial.


No, it's not.  A singly-linked SLIST only needs to modify the head of the
list and the current element.  A doubly-linked LIST needs to modify both
the head as well as the old first element, which may not be in cache (and
may not be in the same TLB, either).  I don't recall exactly what [S]TAILQ
touches, but the doubly-linked version still has to modify more entries
that are not contiguous.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Fix MNAMELEN or reimplement struct statfs

2013-06-08 Thread mdf
On Sat, Jun 8, 2013 at 3:52 PM, Dirk Engling erdge...@erdgeist.org wrote:

 The arbitrary value

 #define MNAMELEN88  /* size of on/from name bufs */

 struct statfs {
 [...]
 charf_mntfromname[MNAMELEN];/* mounted filesystem */
 charf_mntonname[MNAMELEN];  /* directory on which mounted */
 };

 currently bites us when trying to use poudriere with errors like

 'mount: tmpfs: File name too long'


 /poudriere/data/build/91_RELEASE_amd64-REALLY-REALLY-LONG-JAILNAME/ref/wrkdirs

 The topic has been discussed several times since 2004 and has been
 postponed each time, the last time when it hit zfs users:

 http://lists.freebsd.org/pipermail/freebsd-fs/2010-March/007974.html

 So I'd like to point to the calendar, it's 2013 already and there's
 still a static arbitrary (and way too low) limit in one of the core
 areas of the vfs code.

 So I'd like to bump the issue and propose either making f_mntfromname a
 dynamic allocation or just increase MNAMELEN, using 10.0 as water shed.


Gleb Kurtsou did this along with the ino64 GSoC project.  Unfortunately,
 both he and I hit ENOTIME due to the job that pays the bills and it's
never made it back to the main repository.

IIRC, though, the only reason for doing it with 64-bit ino_t is that he'd
already finished changing the stat/dirent ABI so what was one more.  I
think he went with 1024 bytes, which also necessitated not allocating
statfs on the stack for the kernel.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: extattr_set_* return type

2013-04-01 Thread mdf
On Mon, Apr 1, 2013 at 11:24 AM, John Baldwin j...@freebsd.org wrote:

 On Saturday, March 30, 2013 5:30:21 pm m...@freebsd.org wrote:
  Despite the man page correctly describing the return value for
  extattr_set_*, I thought recently that they returned 0/-1 for
  success/failure, not the number of bytes written, like write(2).  This is
  because extattr_set_* is declared as returning an int, not an ssize_t.
   Both extattr_get and extattr_list return ssize_t, so this is
 inconsistent.
 
  The patch at
 
 http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-
 rmextattr-8-.patchfixes
  this.  It compiles but it's untested.
 
  I don't think any compat shims are needed, since an old application will
  still sign extend and this will work (it's very unlikely anyone does
  extattr_set for 2GB or more).
 
  If anyone actually uses extattr on 64-bit, please test a new kernel but
 old
  userspace to be sure nothing is broken.  I plan to commit this next week
 if
  I don't hear otherwise.

 Hmm, the patch URL doesn't work, but please fix this.  There is an old
 thread
 we are both on from Dec 2011 where I ran into the same thing.  I also
 think we
 don't need compat shims.


The version in my outbox looked right; I don't know how it got mangled.

http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-
http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-rmextattr-8-.patchfixes
rmextattr-8-.patchhttp://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-rmextattr-8-.patchfixes

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: extattr_set_* return type

2013-04-01 Thread mdf
On Mon, Apr 1, 2013 at 12:51 PM, m...@freebsd.org wrote:

 On Mon, Apr 1, 2013 at 11:24 AM, John Baldwin j...@freebsd.org wrote:

 On Saturday, March 30, 2013 5:30:21 pm m...@freebsd.org wrote:
  Despite the man page correctly describing the return value for
  extattr_set_*, I thought recently that they returned 0/-1 for
  success/failure, not the number of bytes written, like write(2).  This
 is
  because extattr_set_* is declared as returning an int, not an ssize_t.
   Both extattr_get and extattr_list return ssize_t, so this is
 inconsistent.
 
  The patch at
 
 http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-
 rmextattr-8-.patchfixeshttp://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-rmextattr-8-.patchfixes
  this.  It compiles but it's untested.
 
  I don't think any compat shims are needed, since an old application will
  still sign extend and this will work (it's very unlikely anyone does
  extattr_set for 2GB or more).
 
  If anyone actually uses extattr on 64-bit, please test a new kernel but
 old
  userspace to be sure nothing is broken.  I plan to commit this next
 week if
  I don't hear otherwise.

 Hmm, the patch URL doesn't work, but please fix this.  There is an old
 thread
 we are both on from Dec 2011 where I ran into the same thing.  I also
 think we
 don't need compat shims.


 The version in my outbox looked right; I don't know how it got mangled.


 http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-
 http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-rmextattr-8-.patchfixes
 rmextattr-8-.patchhttp://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-rmextattr-8-.patchfixes


And I found the old thread:

http://lists.freebsd.org/pipermail/freebsd-current/2011-December/030567.html

I guess my memory really does just suck.   I now remember what I posted,
but only after reading it again. :-)

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


extattr_set_* return type

2013-03-30 Thread mdf
Despite the man page correctly describing the return value for
extattr_set_*, I thought recently that they returned 0/-1 for
success/failure, not the number of bytes written, like write(2).  This is
because extattr_set_* is declared as returning an int, not an ssize_t.
 Both extattr_get and extattr_list return ssize_t, so this is inconsistent.

The patch at
http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-rmextattr-8-.patchfixes
this.  It compiles but it's untested.

I don't think any compat shims are needed, since an old application will
still sign extend and this will work (it's very unlikely anyone does
extattr_set for 2GB or more).

If anyone actually uses extattr on 64-bit, please test a new kernel but old
userspace to be sure nothing is broken.  I plan to commit this next week if
I don't hear otherwise.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Stupid question about integer sizes

2013-02-19 Thread mdf
On Tue, Feb 19, 2013 at 5:11 AM, Borja Marcos bor...@sarenet.es wrote:

 Hello,

 I'm really sorry if this is a stupid question, but as far as I know, 
 u_int64_t defined in /usr/include/sys/types.h should *always* be
 a 64 bit unsigned integer, right?

 Seems there's a bug (or I need more and stronger coffee). Compiling a program 
 on a 64 bit system with -m32 gets the 64 bit integer types wrong.

 % cat tachin.c
 #include sys/types.h
 #include stdio.h


 main()
 {
 printf(sizeof uint64_t = %d\n, sizeof(uint64_t));
 printf(sizeof u_int64_t = %d\n, sizeof(u_int64_t));
 }



 uname -a
 FreeBSD splunk 9.1-RELEASE FreeBSD 9.1-RELEASE #14: Wed Jan 23 17:24:05 CET 
 2013 root@splunk:/usr/obj/usr/src/sys/SPLUNK  amd64

 % gcc -o tachin tachin.c
 % ./tachin
 sizeof uint64_t = 8
 sizeof u_int64_t = 8

 % ./tachin
 sizeof uint64_t = 4   === WRONG!!
 sizeof u_int64_t = 4=== WRONG!!

 The same happens with clang.

 % clang -m32 -o tachin tachin.c
 tachin.c:5:1: warning: type specifier missing, defaults to 'int' 
 [-Wimplicit-int]
 main()
 ^~~~
 1 warning generated.
 % ./tachin
 sizeof uint64_t = 4 === WRONG!!
 sizeof u_int64_t = 4=== WRONG!!


 if I do the same on a i386 system (8.2-RELEASE, but it should be irrelevant) 
 the u_int64 types have the correct size.

 %gcc -o tachin tachin.c
 %./tachin
 sizeof uint64_t = 8
 sizeof u_int64_t = 8





 Am I missing anything? Seems like a too stupid problem to be a real bug. 
 Sorry if I am wrong.


Last I knew -m32 still wasn't quite supported on 9.1.  This is fixed
in CURRENT.  The problem in in the machine-dependent headers.
sys/types.h includes sys/_types.h which includes machine/_types.h.
But the machine alias can only point to one place; it's the amd64
headers since you're running the amd64 kernel.
sys/amd64/include/_types.h defines __uint64_t as unsigned long, which
is correct in 64-bit mode but wrong in 32-bit mode.

On CURRENT there is a merge x86/_types.h which uses #ifdef guards to
define __uint64_t as unsigned long or unsigned long long, depending on
__LP64__, so that the size is correct on a 32-bit compiler invocation.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: dynamically calculating NKPT [was: Re: huge ktr buffer]

2013-02-05 Thread mdf
On Tue, Feb 5, 2013 at 7:14 AM, Konstantin Belousov kostik...@gmail.com wrote:
 On Mon, Feb 04, 2013 at 03:05:15PM -0800, Neel Natu wrote:
 Hi,

 I have a patch to dynamically calculate NKPT for amd64 kernels. This
 should fix the various issues that people pointed out in the email
 thread.

 Please review and let me know if there are any objections to committing this.

 Also, thanks to Alan (alc@) for reviewing and providing feedback on
 the initial version of the patch.

 Patch (also available at 
 http://people.freebsd.org/~neel/patches/nkpt_diff.txt):

 Index: sys/amd64/include/pmap.h
 ===
 --- sys/amd64/include/pmap.h  (revision 246277)
 +++ sys/amd64/include/pmap.h  (working copy)
 @@ -113,13 +113,7 @@
   ((unsigned long)(l2)  PDRSHIFT) | \
   ((unsigned long)(l1)  PAGE_SHIFT))

 -/* Initial number of kernel page tables. */
 -#ifndef NKPT
 -#define  NKPT32
 -#endif
 -
  #define NKPML4E  1   /* number of kernel PML4 slots 
 */
 -#define NKPDPE   howmany(NKPT, NPDEPG)/* number of kernel PDP 
 slots */

  #define  NUPML4E (NPML4EPG/2)/* number of userland PML4 
 pages */
  #define  NUPDPE  (NUPML4E*NPDPEPG)/* number of userland PDP 
 pages */
 @@ -181,6 +175,7 @@
  #define  PML4map ((pd_entry_t *)(addr_PML4map))
  #define  PML4pml4e   ((pd_entry_t *)(addr_PML4pml4e))

 +extern int nkpt; /* Initial number of kernel page tables */
  extern u_int64_t KPDPphys;   /* physical address of kernel level 3 */
  extern u_int64_t KPML4phys;  /* physical address of kernel level 4 */

 Index: sys/amd64/amd64/minidump_machdep.c
 ===
 --- sys/amd64/amd64/minidump_machdep.c(revision 246277)
 +++ sys/amd64/amd64/minidump_machdep.c(working copy)
 @@ -232,7 +232,7 @@
   /* Walk page table pages, set bits in vm_page_dump */
   pmapsize = 0;
   pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys);
 - for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + NKPT * NBPDR,
 + for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + nkpt * NBPDR,
   kernel_vm_end); ) {
   /*
* We always write a page, even if it is zero. Each
 @@ -364,7 +364,7 @@
   /* Dump kernel page directory pages */
   bzero(fakepd, sizeof(fakepd));
   pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys);
 - for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + NKPT * NBPDR,
 + for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + nkpt * NBPDR,
   kernel_vm_end); va += NBPDP) {
   i = (va  PDPSHIFT)  ((1ul  NPDPEPGSHIFT) - 1);

 Index: sys/amd64/amd64/pmap.c
 ===
 --- sys/amd64/amd64/pmap.c(revision 246277)
 +++ sys/amd64/amd64/pmap.c(working copy)
 @@ -202,6 +202,10 @@
  vm_offset_t virtual_avail;   /* VA of first avail page (after kernel bss) */
  vm_offset_t virtual_end; /* VA of last avail page (end of kernel AS) */

 +int nkpt;
 +SYSCTL_INT(_machdep, OID_AUTO, nkpt, CTLFLAG_RD, nkpt, 0,
 +Number of kernel page table pages allocated on bootup);
 +
  static int ndmpdp;
  static vm_paddr_t dmaplimit;
  vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS;
 @@ -495,17 +499,42 @@

  CTASSERT(powerof2(NDMPML4E));

 +/* number of kernel PDP slots */
 +#define  NKPDPE(ptpgs)   howmany((ptpgs), NPDEPG)
 +
  static void
 +nkpt_init(vm_paddr_t addr)
 +{
 + int pt_pages;
 +
 +#ifdef NKPT
 + pt_pages = NKPT;
 +#else
 + pt_pages = howmany(addr, 1  PDRSHIFT);
 + pt_pages += NKPDPE(pt_pages);
 +
 + /*
 +  * Add some slop beyond the bare minimum required for bootstrapping
 +  * the kernel.
 +  *
 +  * This is quite important when allocating KVA for kernel modules.
 +  * The modules are required to be linked in the negative 2GB of
 +  * the address space.  If we run out of KVA in this region then
 +  * pmap_growkernel() will need to allocate page table pages to map
 +  * the entire 512GB of KVA space which is an unnecessary tax on
 +  * physical memory.
 +  */
 + pt_pages += 4;  /* 8MB additional slop for kernel modules */
 8MB might be to low. I just checked one of my machines with fully
 modularized kernel, it takes slightly more than 6 MB to load 50 modules.
 I think that 16MB would be safer, but it probably needs to be scaled
 down based on the available phys memory. amd64 kernel could be booted
 on 128MB machine still.

Is there no way to not map the entire 512GB?  Otherwise this patch
could really hose some vendors.  E.g. the kernel module for the OneFS
file system is around 8MB all by itself.

I found when we moved from FreeBSD 6 to 7 that the NKPT of 32 was
insufficient for our system to even boot so I put it back to 240 (I
didn't want to spend a lot of time playing).  At that time our 

Re: How to validate the variable size memory block in ioctl handler?

2013-01-20 Thread mdf
On Sun, Jan 20, 2013 at 3:01 PM, Yuri y...@rawbw.com wrote:
 I am implementing an ioctl that reads/writes variable size structure.
 Allocated size is supplied by the caller in the structure itself.
 struct my_struct {
   int len; // allocated size
   other_struct s[1];
 };
 ioctl request id is defined as _IOWR('X', number, my_struct)

 How to validate from the ioctl function handler (for some device) that the
 whole (variable size) block of bytes is RW accessible in the process memory
 space?
 Should I call copyout/copyin for this, or there is some shorter way?
 EFAULT should be returned in case of validation failure.

 As I understand, macros like _IOR, _IOWR do validation based on the size of
 structure supplied to them. So that the handler procedures don't have to do
 that.
 I was expecting to find among them some macro that would work for such
 variable size structure, but it isn't there. (Not sure if this is possible
 language-wise).

You'll need to pass in more than the above, probably, as the kernel's
ioctl() function has copied in the specified number of bytes already.
I.e. the value passed to your ioctl handler is already in the kernel
space, and unless it's 4 bytes, was malloc(9)'d and copyin'd (if it's
an IN parameter).  The size used is the size passed to the _IOC()
macro.

To do what you want it sounds like you want your handler to take something like:

struct var_ioctl {
int len;
void *data;
};

Then then handler itself would have to use copyin/copyout to access
the data.  There's no simpler way.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: How to validate the variable size memory block in ioctl handler?

2013-01-20 Thread mdf
On Sun, Jan 20, 2013 at 6:50 PM, Yuri y...@rawbw.com wrote:
 On 01/20/2013 16:59, m...@freebsd.org wrote:

 To do what you want it sounds like you want your handler to take something
 like:

 struct var_ioctl {
 int len;
 void *data;
 };

 Then then handler itself would have to use copyin/copyout to access
 the data.  There's no simpler way.


 I think I found the simpler way, see the draft patch below.
 Generic macro _IOWRE will handle the case when the first integer in ioctl
 parameter holds the actual size of the structure.
 This way of passing the variable array sizes is quite common in various
 APIs.
 Other potential uses would also benefit from this.

 Yuri


 Index: sys/kern/sys_generic.c
 ===
 --- sys/kern/sys_generic.c(revision 245654)
 +++ sys/kern/sys_generic.c(working copy)
 @@ -640,6 +640,7 @@
   int arg, error;
   u_int size;
   caddr_t data;
 + int vsize;

   if (uap-com  0x) {
   printf(
 @@ -654,6 +655,14 @@
* copied to/from the user's address space.
*/
   size = IOCPARM_LEN(com);
 + if (size == IOC_VARIABLE) {
 + /* first integer has the length of the memory */
 + error = copyin(uap-data, (caddr_t)vsize, sizeof(vsize));
 + if (error)
 + return (error);
 + size = (u_int)vsize;
 + }
   if ((size  IOCPARM_MAX) ||
   ((com  (IOC_VOID  | IOC_IN | IOC_OUT)) == 0) ||
  #if defined(COMPAT_FREEBSD5) || defined(COMPAT_FREEBSD4) ||
 defined(COMPAT_43)
 Index: sys/sys/ioccom.h
 ===
 --- sys/sys/ioccom.h  (revision 245654)
 +++ sys/sys/ioccom.h  (working copy)
 @@ -50,6 +50,7 @@
  #define  IOC_IN  0x8000  /* copy in parameters */
  #define  IOC_INOUT   (IOC_IN|IOC_OUT)
  #define  IOC_DIRMASK (IOC_VOID|IOC_OUT|IOC_IN)
 +#define  IOC_VARIABLEIOCPARM_MASK/* parameters size in 
 parameters */

  #define  _IOC(inout,group,num,len)   ((unsigned long) \
   ((inout) | (((len)  IOCPARM_MASK)  16) | ((group)  8) | (num)))
 @@ -59,6 +60,9 @@
  #define  _IOW(g,n,t) _IOC(IOC_IN,(g), (n), sizeof(t))
  /* this should be _IORW, but stdio got there first */
  #define  _IOWR(g,n,t)_IOC(IOC_INOUT, (g), (n), sizeof(t))
 +#define  _IORE(g,n)  _IOC(IOC_OUT,   (g), (n), IOC_VARIABLE)
 +#define  _IOWE(g,n)  _IOC(IOC_IN,(g), (n), IOC_VARIABLE)
 +#define  _IOWRE(g,n) _IOC(IOC_INOUT, (g), (n), IOC_VARIABLE)

  #ifdef _KERNEL


This would be fine for a local patch but it breaks existing (valid)
uses that have exactly 8191 bytes of data, so it wouldn't be suitable
for the main FreeBSD repository.  Also, in general one wants to have
limits on syscalls that can force a kernel malloc of any size, as it
leads to denial of service attacks or crashes by requesting the kernel
over-allocate memory.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: kmem_map auto-sizing and size dependencies

2013-01-18 Thread mdf
On Fri, Jan 18, 2013 at 7:29 AM, Andre Oppermann an...@freebsd.org wrote:
 The (inital?) size of the kmem_map is determined by some voodoo magic,
 a sprinkle of nmbclusters * PAGE_SIZE incrementor and lots of tunables.
 However it seems to work out to an effective kmem_map_size of about 58MB
 on my 16GB AMD64 dev machine:

 vm.kvm_size: 549755809792
 vm.kvm_free: 530233421824
 vm.kmem_size: 16,594,300,928
 vm.kmem_size_min: 0
 vm.kmem_size_max: 329,853,485,875
 vm.kmem_size_scale: 1
 vm.kmem_map_size: 59,518,976
 vm.kmem_map_free: 16,534,777,856

 The kmem_map serves kernel malloc (via UMA), contigmalloc and everthing
 else that uses UMA for memory allocation.

 Mbuf memory too is managed by UMA which obtains the backing kernel memory
 from the kmem_map.  The limits of the various mbuf memory types have
 been considerably raised recently and may make use of 50-75% of all
 physically
 present memory, or available KVM space, whichever is smaller.

 Now my questions/comments are:

  Does the kmem_map automatically extend itself if more memory is requested?

Not that I recall.

  Should it be set to a larger initial value based on min(physical,KVM) space
  available?

It needs to be smaller than the physical space, because the only limit
on the kernel's use of (pinned) memory is the size of the map.  So if
it is too large there is nothing to stop the kernel from consuming all
available memory.  The lowmem handler is called when running out of
virtual space only (i.e. a failure to allocate a range in the map).

  The naming and output of the various vm.kmem_* and vm.kvm_* sysctls is
  confusing and not easy to reconcile.  Either we need some more detailing
  more aspects or less.  Plus perhaps sysctl subtrees to better describe the
  hierarchy of the maps.

  Why are separate kmem submaps being used?  Is it to limit memory usage of
  certain subsystems?  Are those limits actually enforced?

I mostly know about memguard, since I added memguard_fudge().  IIRC
some of the submaps are used.  The memguard_map specifically is used
to know whether an allocation is guarded or not, so at free(9) it can
be handled as normal malloc() or as memguard.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: _mtx_lock_spin: obsolete historic handling of kdb_active and panicstr?

2012-10-17 Thread mdf
On Wed, Oct 17, 2012 at 4:20 AM, Andriy Gapon a...@freebsd.org wrote:

 _mtx_lock_spin has the following check in its retry loop:
 if (i  6000 || kdb_active || panicstr != NULL)
 DELAY(1);
 else
 _mtx_lock_spin_failed(m);

[snip analysis]

 So I'd like to propose to remove those checks altogether.  Or perhaps to
 reverse them and immediately induce a (possibly secondary) panic if we ever
 get to that wait loop and kdb_active || panicstr != NULL.

The panicstr can clearly be removed.  I think there can be race
conditions with entering kdb and taking a spinlock, because the
spinlock acquire will block interrupts.  I don't remember if we always
NMI for kdb enter or if that's configurable.  The old code was clearer
(or maybe I'm just remembering an Isilon hack); looking at
stop_cpus_hard() I don't see that it uses an NMI.  So a CPU can block
interrupts, then if it sees kdb_active it will spin until we leave
kdb, rather than panic.  Of course this would only be relevant if the
CPU it's trying to acquire is already held; otherwise it should find
the lock unowned and this isn't relevant.  And if the lock is owned by
the thread entering kdb, that would be a real panic, not a recoverable
kdb entry.

So I think maybe the kdb_active check is also not helpful after all.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Change vfork() to posix_spawn()?

2012-09-14 Thread mdf
On Fri, Sep 14, 2012 at 4:45 AM, Erik Cederstrand e...@cederstrand.dk wrote:
 Den 14/09/2012 kl. 13.03 skrev Ivan Voras ivo...@freebsd.org:

 On 14/09/2012 09:49, Erik Cederstrand wrote:
 Hello hackers,

 I'm looking through the Clang Analyzer scans on 
 http://scan.freebsd.your.org/freebsd-head looking for false positives to 
 report back to LLVM. There are quite a list of reports suggesting to change 
 vfork() calls to posix_spawn(). Example from /bin/rpc: 
 http://scan.freebsd.your.org/freebsd-head/bin.rcp/2012-09-12-amd64/report-nsOV80.html#EndPath

 I know nothing about this but I can see fork and posix_spawn have been 
 discussed on this list previously. Is this a legitimate warning (in this 
 case and in general in FreeBSD base)?

 Currently (on 9-stable at least), posix_spawn() is implemented as a
 wrapper around vfork(), so I doubt replacing one with the other would do
 much.

 The analyzer added this warning in January. The release notes link to this 
 explanation: 
 https://www.securecoding.cert.org/confluence/display/seccode/POS33-C.+Do+not+use+vfork()

 I guess this is the important part:

 Because of the implementation of the vfork() function, the parent process is 
 suspended while the child process executes. If a user sends a signal to the 
 child process, delaying its execution, the parent process (which is 
 privileged) is also blocked. This means that an unprivileged process can 
 cause a privileged process to halt, which is a privilege inversion resulting 
 in a denial of service.


Isn't the important part the previous paragraph, which said that some
older versions of Linux had the problem?  The entire thing reads that
the issue comes from an idiom of vfork(), setuid(), then exec, which
is both undefined and would be specific to only some *uses* of
vfork(), not the implementation.

The whole thing isn't worded terribly usefully; e.g. it doesn't
explain if it was only Linux that had an issue, which version of Linux
is fixed, whether normal code that went straight from vfork() to
exec() was fine, etc.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: freebsd.org git repositories and svn ids

2012-09-03 Thread mdf
On Mon, Sep 3, 2012 at 7:29 AM, Eitan Adler li...@eitanadler.com wrote:
 On 3 September 2012 10:19, Ryan Stone ryst...@gmail.com wrote:
 On Sun, Sep 2, 2012 at 5:45 PM, Eitan Adler li...@eitanadler.com wrote:
 Why isn't git.freebsd.org a straight git svn clone ? AFAIK that isn't 
 broken.

 Well, let's put it this way: I started a git svn clone of the src
 repository on Aug 28.  It's still going.  And I'm only creating a very
 small subset of the branches in the repo.

 I've done the same and it only took me a few hours although the ports
 repo took a few days.

Did you try to preserve the full history in the git version?  I'm not
familiar with git clone, but when I've made a git copy of an svn repo
for uses at work (both of freebsd.org any $WORK's repo) it took two to
three days just to fetch the portion of the revisions that were for
head/current.

A few hours sounds too short.  The three-four days since Aug 28 sounds
normal to me.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: How to know __FreeBSD_version for a symbol

2012-08-13 Thread mdf
On Mon, Aug 13, 2012 at 1:24 AM, Hooman Fazaeli hoomanfaza...@gmail.com wrote:
 Hi hackers

 In the process of back porting drivers to older freebsd versions,
 We sometimes need to add suitable '#if __FreeBSD_version = x ... else
 ... '
 directives to the source to use an alternate function or exclude certain
 statements, defines, etc.

 What is the best (quick/reliable) way to know in which __FreeBSD_version
 a symbol (function, struct member, macro, ...) has been first introduced?

As far as I know it's a slightly painful look over SVN logs.  First,
find the SVN revision that introduced or changed the relevant symbol.
Then, look at the SVN history of sys/sys/param.h for changes to
__FreeBSD_version symbol.  The one that's from the same or later SVN
revision as the symbol change is the value you'll need.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sysctl filesystem ?

2012-06-26 Thread mdf
On Tue, Jun 26, 2012 at 1:59 AM, Robert Watson rwat...@freebsd.org wrote:
 On Tue, 26 Jun 2012, Chris Rees wrote:

 as well as we don't depend of /proc for normal operation we shouldn't for

 say /proc/sysctl


 improvements are welcome, better documentation is welcome, changes to

 what is OK - isn't.

 /proc/sysctl might be useful.  Just because Linux uses it doesn't make it
 a bad idea.


 One of the problems we've encounted with synthetic file systems is that
 off-the-shelf file system tools (e.g., cp, dd, cat) make simplistic (but not
 unreasonable) assumptions about the statistic content of files.  This comes
 up frequently with procfs-like systems where the size of, say, memory map
 data can be considerably larger than the perhaps 128-byte, 256-byte, or even
 8k buffers that might exist in a stock file access tool.  Unless we change
 all of those tools to use buffers much bigger than they currently do, which
 even suggets changing the C library buffer to defaults for FILE *, that
 places an onus on the file system to provide persisting snapshots of data
 until it's sure that a user process is done -- e.g., over many system calls.

 sysctl is not immune to the requirement of atomicity, but it has explicit
 control over it: sysctl is a single system call, rather than an unbounded
 open-read-seek-repeat-etc cycle, and has been carefully crafted to provide
 this and other MIB-like properties, such as a basic data type model so that
 command line tools know how to render content rather than having to guess
 and/or get it wrong.  sysctl has some file-system like properties, but on
 the whole, it's not a file system -- it's much more like an SNMP MIB.

 While you can map anything into anything (including Turing machines), I
 think the sysctl command line tool and API, despite its limitations, is a
 better match for accessing this sort of monitoring and control data than the
 POSIX file API, and would recommend against trying to move to a sysctl file
 system.

While I understand the problems you allude to, the sysctl(8) binary
can protect itself from them.  IMO the biggest problem with sysctls
not being files is that it makes no sense from the core UNIX
philosophy that everything is a file.  Sockets and pipes and character
devices and even unseekable things like stdout are files; why aren't
these other objects that allow read, write, and have their own
namespace?

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: OS support for fault tolerance

2012-02-14 Thread mdf
On Tue, Feb 14, 2012 at 8:57 AM, Julian Elischer jul...@freebsd.org wrote:
 On 2/14/12 6:23 AM, Maninya M wrote:

 For multicore desktop computers, suppose one of the cores fails, the
 FreeBSD OS crashes. My question is about how I can make the OS tolerate
 this hardware fault.
 The strategy is to checkpoint the state of each core at specific intervals
 of time in main memory. Once a core fails, its previous state is retrieved
 from the main memory, and the processes that were running on it are
 rescheduled on the remaining cores.

 I read that the OS tolerates faults in large servers. I need to make it do
 this for a Desktop OS. I assume I would have to change the scheduler
 program. I am using FreeBSD 9.0 on an Intel core i5 quad core machine.
 How do I go about doing this? What exactly do I need to save for the
 state of the core? What else do I need to know?
 I have absolutely no experience with kernel programming or with FreeBSD.
 Any pointers to good sources about modifying the source-code of FreeBSD
 would be greatly appreciated.

 This question has always intrigued me, because I'm always amazed
 that people actually try.
 From my viewpoint, There's really not much you can do if the core
 that is currently holding the scheduler lock fails.

We did this at IBM after we'd done the dynamic logical partitioning.
Basically, there was a way to probe the CPU for the number of
correctable errors it was encountering.  At too high a threshhold, it
was considered faulty and we offlined the CPU before it encountered
an uncorrectable error.

We did the same thing for memory, too (that one I was directly involved in).

The basic trouble, though, is that at least for memory, there didn't
seem to be a correlation between the rate of correctable ECC and an
uncorrectable error occurring.

 And what do you mean by 'fails?  do you run constant diagnostics?
 how do you tell when it is failed? It'd be hard to detect that 'multiply'
 has suddenly started giving bad results now and then.

I'd assume this is predicated by the ability of the hardware to have
some redundancy and some way to query the error rate.  I've done a
little work with memory ECC on the device driver end, and at least
there hardware definitely reports correctable and uncorrectable ECC
via some registers.  But I don't know if there's any way to query this
for a CPU (and of course each CPU would be different).

However, all that said, it's a moderately large project to get an OS
ready to handle things like holes appearing in its logical CPU ID
space (how do you serialize this when you want the common case to not
take a lock?), and to do all the wizardry of unscheduling (what do you
do with a bound thread?) and then actually shutting the CPU down via
firmware so it doesn't continue running.  I started working on this
for Linux when I worked at IBM, somewhere around 2004, and then IBM
got sued by SCO so they pulled me off the project.  It was finished up
by a colleague and friend.

You can probably come to a first approximation by forcing e.g. the
idle thread to not get switched out, when the CPU appears unstable.
Then at least it's running fewer instructions, and less likely to
generate a machine check.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Moving on ... (was: Re: FreeBSD has serious problems with...)

2012-02-08 Thread mdf
On Wed, Feb 8, 2012 at 10:33 PM, John Kozubik j...@kozubik.com wrote:
 If it were up to me, I think I would stake out a very loud and clear mission
 statement for FreeBSD that explicitly sacrifices new features for longer
 lifecycles and deeper investment in particular releases.  I've thought
 seriously about the points that were made in this thread supporting faster
 releases, and I do appreciate what we would be giving up, but I continue to
 advocate for a new major release every 5 years.

To summarize 7 years of working on AIX, where we upgrade the major
version number every 5 years or so, in the interim we do deliver
hundreds of other features.  However, if we made a poor choice for
interface, we're stuck with it.  This leads to two things: (1)
flexible interfaces, like leaving an unused flags field in a syscall
so it can become variadic later, and (2) flexible structures, like
leaving a few bytes in a struct that's part of the KBI so they can be
filled in, and when that space is nearly out, malloc(9) more storage
to hang off the struct.  Sometimes this means a lot of extra testing
effort to ensure the legacy interface works as well as any extensions
that arise later.

It's not ideal, it's more work when designing and testing, and when a
major release does come around where we could break binary
compatibility, we've often forgotten about a lot of these little bits
and just leave things the way they  are.  With a more frequent release
cycle FreeBSD can be more free about breaking KBI and ABI, and even
the [AK]PI, which keeps the current code clean but makes upgrading
more difficult.

It's a hassle.  It's doable, but it adds work, and that's a tough call
to make for a volunteer effort.  Increasing the barrier to entry, or
reducing the fun part of working on FreeBSD doesn't serve the
project either.

Just my 2 cents, a month late.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Per-mount syncer threads and fanout for pagedaemon cleaning

2011-12-27 Thread mdf
On Tue, Dec 27, 2011 at 8:05 AM, Attilio Rao atti...@freebsd.org wrote:
 2011/12/27 Giovanni Trematerra giovanni.tremate...@gmail.com:
 On Mon, Dec 26, 2011 at 9:24 PM, Venkatesh Srinivas
 vsrini...@dragonflybsd.org wrote:
 Hi!

 I've been playing with two things in DragonFly that might be of interest
 here.

 Thing #1 :=

 First, per-mountpoint syncer threads. Currently there is a single thread,
 'syncer', which periodically calls fsync() on dirty vnodes from every mount,
 along with calling vfs_sync() on each filesystem itself (via syncer vnodes).

 My patch modifies this to create syncer threads for mounts that request it.
 For these mounts, vnodes are synced from their mount-specific thread rather
 than the global syncer.

 The idea is that periodic fsync/sync operations from one filesystem should
 not
 stall or delay synchronization for other ones.
 The patch was fairly simple:
 http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/50e4012a4b55e1efc595db0db397b4365f08b640


 There's something WIP by attilio@ on that area.
 you might want to take a look at
 http://people.freebsd.org/~attilio/syncer_alpha_15.diff

 I don't know what hammerfs needs but UFS/FFS and buffer cache make a good
 job performance-wise and so the authors are skeptical about the boost that 
 such
 a change can give. We believe that brain cycles need to be spent on
 other pieces of the system such as ARC and ZFS.

 More specifically, it is likely that focusing on UFS and buffer cache
 for performance is not really useful, we should drive our efforts over
 ARC and ZFS.
 Also, the real bottlenecks in our I/O paths are in GEOM
 single-threaded design, lack of unmapped I/O functionality, possibly
 lack of proritized I/O, etc.

Indeed, Isilon (and probably other vendors as well) entirely skip
VFS_SYNC when the WAIT argument is MNT_LAZY.  Since we're a
distributed journalled filesystem, syncing via a system thread is not
a relevant operation; i.e. all writes that have exited a VOP_WRITE or
similar operation are already in reasonably stable storage in a
journal on the relevant nodes.

However, we do then have our own threads running on each node to flush
the journal regularly (in addition to when it fills up), and I don't
know enough about this to know if it could be fit into the syncer
thread idea or if it's too tied in somehow to our architecture.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Per-mount syncer threads and fanout for pagedaemon cleaning

2011-12-27 Thread mdf
On Tue, Dec 27, 2011 at 9:05 AM, Attilio Rao atti...@freebsd.org wrote:
 2011/12/27  m...@freebsd.org:
 On Tue, Dec 27, 2011 at 8:05 AM, Attilio Rao atti...@freebsd.org wrote:
 2011/12/27 Giovanni Trematerra giovanni.tremate...@gmail.com:
 On Mon, Dec 26, 2011 at 9:24 PM, Venkatesh Srinivas
 vsrini...@dragonflybsd.org wrote:
 Hi!

 I've been playing with two things in DragonFly that might be of interest
 here.

 Thing #1 :=

 First, per-mountpoint syncer threads. Currently there is a single thread,
 'syncer', which periodically calls fsync() on dirty vnodes from every 
 mount,
 along with calling vfs_sync() on each filesystem itself (via syncer 
 vnodes).

 My patch modifies this to create syncer threads for mounts that request 
 it.
 For these mounts, vnodes are synced from their mount-specific thread 
 rather
 than the global syncer.

 The idea is that periodic fsync/sync operations from one filesystem should
 not
 stall or delay synchronization for other ones.
 The patch was fairly simple:
 http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/50e4012a4b55e1efc595db0db397b4365f08b640


 There's something WIP by attilio@ on that area.
 you might want to take a look at
 http://people.freebsd.org/~attilio/syncer_alpha_15.diff

 I don't know what hammerfs needs but UFS/FFS and buffer cache make a good
 job performance-wise and so the authors are skeptical about the boost that 
 such
 a change can give. We believe that brain cycles need to be spent on
 other pieces of the system such as ARC and ZFS.

 More specifically, it is likely that focusing on UFS and buffer cache
 for performance is not really useful, we should drive our efforts over
 ARC and ZFS.
 Also, the real bottlenecks in our I/O paths are in GEOM
 single-threaded design, lack of unmapped I/O functionality, possibly
 lack of proritized I/O, etc.

 Indeed, Isilon (and probably other vendors as well) entirely skip
 VFS_SYNC when the WAIT argument is MNT_LAZY.  Since we're a
 distributed journalled filesystem, syncing via a system thread is not
 a relevant operation; i.e. all writes that have exited a VOP_WRITE or
 similar operation are already in reasonably stable storage in a
 journal on the relevant nodes.

 However, we do then have our own threads running on each node to flush
 the journal regularly (in addition to when it fills up), and I don't
 know enough about this to know if it could be fit into the syncer
 thread idea or if it's too tied in somehow to our architecture.

 I'm not really sure how does journaling is implemented on OneFS, but
 when I made this patch SU+J wasn't yet there.
 Also, this patch just adds the infrastructure for a multithreaded and
 configurable syncer, which means it still requires the UFS bits for
 skipping the double-syncing (alias the MNT_LAZY skippage you
 mentioned).

Right, I don't object to any changes relating to multiple sync
threads, etc., just trying to offer a vendor viewpoint.  Though having
one thread per mount would allow for a different sync interval for
each filesystem which can be of advantage.

Right after I did Isilon's last FreeBSD merge (it seems like a long
time ago now), I wanted to look into what it would take to eliminate
our specialed journal flush thread (i.e. tie it into VFS_SYNC), but
one objection was that then the flush interval would not be
configurable separately from the one for our UFS partition.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sysctl variable question

2011-12-18 Thread mdf
2011/12/18 Fernando Apesteguía fernando.apesteg...@gmail.com:
 El 18/12/2011 22:12, Julian Elischer jul...@freebsd.org escribió:

 On 12/18/11 12:18 PM, Fernando Apesteguía wrote:

 Hi all,

 I'm writing a small module just for fun. I would like to have two
 variables:

 - pid of type unsigned int and RW so the user can set a pid
 - process_name as a string RD that will display the process name
 associated to that pid (or a message if the pid doesn't exist anymore)


 this is dangerous as there are some places in the kernel where processes
 are referenced by pid, so changing it may break kernel assumptions.

 Sorry, i think i didn't explain it clearly. The pid variable is static in
 my module and it is used just to tell the module which information it
 should show in the other variable.

Many sysctl handlers look for a req-newptr and leave early if it's
NULL.  If it's non-NULL then you can use SYSCTL_IN to fetch the data.
Note that a caller of sysctl(3) API can specify both old pointer and
new pointer, which is why most handlers always do a SYSCTL_OUT on the
current value.

Cheers,
matthew

 My problem is with the handler functions. For process_name, as it is
 read only, I wrote a simple handler that works fine. However, I want
 to write another one for pid so I can sanitize the input (avoiding
 pids  0 and so). As I understand, the handler I specify with
 SYSCTL_OID will be called for both reads and writes. But, how can I
 tell what kind of operation is it, so I know if I have to use
 SYSCTL_OUT or SYSCTL_IN? I tried to have a look at sysctl_handle_int
 but I don't fully understand what is going on with the arg1 parameter.
 What is it for?

 Thanks in advance.
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
 


 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: kern_yield vs ukbd_yield

2011-12-12 Thread mdf
On Mon, Dec 12, 2011 at 12:03 AM, Andriy Gapon a...@freebsd.org wrote:
 on 11/12/2011 23:48 m...@freebsd.org said the following:
 On Sun, Dec 11, 2011 at 1:12 PM, Andriy Gapon a...@freebsd.org wrote:

 Does the following change do what I think that it does?
 Thank you!

 Author: Andriy Gapon a...@icyb.net.ua
 Date:   Thu Sep 1 16:50:13 2011 +0300

    ukbd: drop local duplicate of kern_yield and use that instead

 diff --git a/sys/dev/usb/input/ukbd.c b/sys/dev/usb/input/ukbd.c
 index 086c178..8078cbb 100644
 --- a/sys/dev/usb/input/ukbd.c
 +++ b/sys/dev/usb/input/ukbd.c
 @@ -399,33 +399,6 @@ ukbd_put_key(struct ukbd_softc *sc, uint32_t key)
  }

  static void
 -ukbd_yield(void)
 -{
 -       struct thread *td = curthread;
 -       uint32_t old_prio;
 -
 -       DROP_GIANT();
 -
 -       thread_lock(td);
 -
 -       /* get current priority */
 -       old_prio = td-td_base_pri;
 -
 -       /* set new priority */
 -       sched_prio(td, td-td_user_pri);
 -
 -       /* cause a task switch */
 -       mi_switch(SW_INVOL | SWT_RELINQUISH, NULL);
 -
 -       /* restore priority */
 -       sched_prio(td, old_prio);
 -
 -       thread_unlock(td);
 -
 -       PICKUP_GIANT();
 -}
 -
 -static void
  ukbd_do_poll(struct ukbd_softc *sc, uint8_t wait)
  {

 @@ -439,7 +412,7 @@ ukbd_do_poll(struct ukbd_softc *sc, uint8_t wait)
                while (sc-sc_inputs == 0) {

                        /* give USB threads a chance to run */
 -                       ukbd_yield();
 +                       kern_yield(-1);

 Not quite.

 1) -1 should be spelled PRI_UNCHANGED, except ukbd_yield() uses
 td_user_pri, but then puts it back again, so I think UNCHANGED is what
 is meant.
 2) kern_yield() calls it a SW_VOL rather than SW_INVOL, which seems
 the desired behaviour here anyways, since this is an explicit (i.e.
 voluntary) yield.

 Thank you for the explanation.  So would you say that the patch is OK?

As far as I know, yes.  There may be a difference in behaviour,
though, while yielding, if the priority of the thread remains high (as
this change would make it) -- I'm not completely sure how the
scheduler chooses threads, because I'm pretty sure I've seen it take
threads with lower (higher numbered) priorities even when there's
runnable threads with a higher (lower numbered) priority available.

It has always seemed weird to me that the priorities in the kernel are
strictly higher than user-space -- but only after a prio change like
that done implicitly by many of the calls to sleep(9).  So it may be
that the better patch is to use PRI_USER, not PRI_UNCHANGED, which
would revert any potentially changed thread prio (e.g. due to a
sleep(9)) back to its user-space level, so that it contended as
expected with other threads.

hselasky@ or someone else familiar with the various usb threads would
have to answer that.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: kern_yield vs ukbd_yield

2011-12-12 Thread mdf
On Mon, Dec 12, 2011 at 11:05 AM, John Baldwin j...@freebsd.org wrote:
 On Monday, December 12, 2011 10:58:22 am Hans Petter Selasky wrote:
 On Monday 12 December 2011 16:55:38 m...@freebsd.org wrote:
  On Mon, Dec 12, 2011 at 12:03 AM, Andriy Gapon a...@freebsd.org wrote:
   on 11/12/2011 23:48 m...@freebsd.org said the following:
   On Sun, Dec 11, 2011 at 1:12 PM, Andriy Gapon a...@freebsd.org wrote:
   Does the following change do what I think that it does?
   Thank you!
  
   Author: Andriy Gapon a...@icyb.net.ua
   Date:   Thu Sep 1 16:50:13 2011 +0300
  
      ukbd: drop local duplicate of kern_yield and use that instead
  
   diff --git a/sys/dev/usb/input/ukbd.c b/sys/dev/usb/input/ukbd.c
   index 086c178..8078cbb 100644
   --- a/sys/dev/usb/input/ukbd.c
   +++ b/sys/dev/usb/input/ukbd.c
   @@ -399,33 +399,6 @@ ukbd_put_key(struct ukbd_softc *sc, uint32_t key)
    }
  
    static void
   -ukbd_yield(void)
   -{
   -       struct thread *td = curthread;
   -       uint32_t old_prio;
   -
   -       DROP_GIANT();
   -
   -       thread_lock(td);
   -
   -       /* get current priority */
   -       old_prio = td-td_base_pri;
   -
   -       /* set new priority */
   -       sched_prio(td, td-td_user_pri);
   -
   -       /* cause a task switch */
   -       mi_switch(SW_INVOL | SWT_RELINQUISH, NULL);
   -
   -       /* restore priority */
   -       sched_prio(td, old_prio);
   -
   -       thread_unlock(td);
   -
   -       PICKUP_GIANT();
   -}
   -
   -static void
    ukbd_do_poll(struct ukbd_softc *sc, uint8_t wait)
    {
  
   @@ -439,7 +412,7 @@ ukbd_do_poll(struct ukbd_softc *sc, uint8_t wait)
                  while (sc-sc_inputs == 0) {
  
                          /* give USB threads a chance to run */
   -                       ukbd_yield();
   +                       kern_yield(-1);
  
   Not quite.
  
   1) -1 should be spelled PRI_UNCHANGED, except ukbd_yield() uses
   td_user_pri, but then puts it back again, so I think UNCHANGED is what
   is meant.
   2) kern_yield() calls it a SW_VOL rather than SW_INVOL, which seems
   the desired behaviour here anyways, since this is an explicit (i.e.
   voluntary) yield.
  
   Thank you for the explanation.  So would you say that the patch is OK?
 
  As far as I know, yes.  There may be a difference in behaviour,
  though, while yielding, if the priority of the thread remains high (as
  this change would make it) -- I'm not completely sure how the
  scheduler chooses threads, because I'm pretty sure I've seen it take
  threads with lower (higher numbered) priorities even when there's
  runnable threads with a higher (lower numbered) priority available.

 The scheduler generally should not do this with the following exceptions:

  1) for timesharing threads, priorities are in bands, so effectively the
    pri / 4 is what is used for comparison and if two threads have the same
    pri / 4 the scheduler will round-robin among htem.

  2) ULE might be a bit different because of the way it assigns threads to
    CPUs, so if a CPU has two high priority threads and another CPU only
    has a low priority thread, the second CPU will not run the second high
    priority thread.  4BSD handles this case more correctly.

  It has always seemed weird to me that the priorities in the kernel are
  strictly higher than user-space -- but only after a prio change like
  that done implicitly by many of the calls to sleep(9).  So it may be
  that the better patch is to use PRI_USER, not PRI_UNCHANGED, which
  would revert any potentially changed thread prio (e.g. due to a
  sleep(9)) back to its user-space level, so that it contended as
  expected with other threads.

 Realtime priorities (for rtprio user threads) are higher than the kernel
 sleep priorities.  Also, keep in mind that a thread does not get an
 automatic priority boost when it enters the kernel.  It only gets a boost
 either temporarily from priority propagation, or a slightly longer (but
 still temporary) boost from a sleep(9) call.

I still don't understand why threads are boosting their priority while
sleeping; if it's so they are woken preferentially, that makes some
sense, but then they shouldn't be keeping the boosted priority after
the thread is woken.  If a thread really needs higher priority to do
its work, that should be an explicit call to sched_prio(9), rather
than implicitly only when/if it first sleeps.

Thanks,
matthew


  hselasky@ or someone else familiar with the various usb threads would
  have to answer that.

 The problem is only during init() where the init thread has highest priority
 and that doesn't allow other threads to run even if the scheduler is
 running!

 Hmm, that should be fixed by lowering the relevant thread's priority.
 Do you mean thread0 (the one doing all the SYSINIT's or thread we create for
 init (pid 1) before it executes init?

 --
 John Baldwin
___
freebsd-hackers@freebsd.org mailing list

Re: kern_yield vs ukbd_yield

2011-12-11 Thread mdf
On Sun, Dec 11, 2011 at 1:12 PM, Andriy Gapon a...@freebsd.org wrote:

 Does the following change do what I think that it does?
 Thank you!

 Author: Andriy Gapon a...@icyb.net.ua
 Date:   Thu Sep 1 16:50:13 2011 +0300

    ukbd: drop local duplicate of kern_yield and use that instead

 diff --git a/sys/dev/usb/input/ukbd.c b/sys/dev/usb/input/ukbd.c
 index 086c178..8078cbb 100644
 --- a/sys/dev/usb/input/ukbd.c
 +++ b/sys/dev/usb/input/ukbd.c
 @@ -399,33 +399,6 @@ ukbd_put_key(struct ukbd_softc *sc, uint32_t key)
  }

  static void
 -ukbd_yield(void)
 -{
 -       struct thread *td = curthread;
 -       uint32_t old_prio;
 -
 -       DROP_GIANT();
 -
 -       thread_lock(td);
 -
 -       /* get current priority */
 -       old_prio = td-td_base_pri;
 -
 -       /* set new priority */
 -       sched_prio(td, td-td_user_pri);
 -
 -       /* cause a task switch */
 -       mi_switch(SW_INVOL | SWT_RELINQUISH, NULL);
 -
 -       /* restore priority */
 -       sched_prio(td, old_prio);
 -
 -       thread_unlock(td);
 -
 -       PICKUP_GIANT();
 -}
 -
 -static void
  ukbd_do_poll(struct ukbd_softc *sc, uint8_t wait)
  {

 @@ -439,7 +412,7 @@ ukbd_do_poll(struct ukbd_softc *sc, uint8_t wait)
                while (sc-sc_inputs == 0) {

                        /* give USB threads a chance to run */
 -                       ukbd_yield();
 +                       kern_yield(-1);

Not quite.

1) -1 should be spelled PRI_UNCHANGED, except ukbd_yield() uses
td_user_pri, but then puts it back again, so I think UNCHANGED is what
is meant.
2) kern_yield() calls it a SW_VOL rather than SW_INVOL, which seems
the desired behaviour here anyways, since this is an explicit (i.e.
voluntary) yield.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: gcc 4.2 miscompilation with -O2 -fno-omit-frame-pointer on amd64

2011-11-20 Thread mdf
On Sun, Nov 20, 2011 at 10:24 AM, Gleb Kurtsou gleb.kurt...@gmail.com wrote:
 On (19/11/2011 09:11), m...@freebsd.org wrote:
 On Sat, Nov 19, 2011 at 8:19 AM, Gleb Kurtsou gleb.kurt...@gmail.com wrote:
  On (19/11/2011 07:26), m...@freebsd.org wrote:
  On Sat, Nov 19, 2011 at 2:01 AM, Gleb Kurtsou gleb.kurt...@gmail.com 
  wrote:
   Hi,
  
   I was lucky to write a bit of code which gcc 4.2 fails to compile
   correctly with -O2. Too keep long story short the code fails for gcc
   from base system and last gcc 4.2 snapshot from ports. It works with gcc
   4.3, gcc 4.4 on FreeBSD and Linux. Clang from base is also good. -O and
   -Os optimization levels are fine (I've tried with all -f* flags
   mentioned in documentation)
  
   -O2 -fno-omit-frame-pointer combination is troublesome on amd64. I
   presume i386 should be fine. These options are also used for
   compilation of kernel (with debugging enabled) and modules.
  
   I'm not able to share the code, but have a test case reproducing the
   bug. I've encountered the issue over a week ago and tried narrowing it 
   down
   to a simple test I could share but without much success.
  
   The code itself is very common: initialize two structs on stack, call a
   function with pointers to those stucts as arguments. A number of inlined
   assertion functions. gcc fails to correctly optimize struct assignments
   with -fno-omit-frame-pointer, I have a number of small structs assigned,
   gcc decides not to use data coping but to assign fields directly. I've
   tried disabling sra, tweaking sra parameters -- no luck in forcing it
   to copy data. Replacing one particular assignment with memcpy produces
   correct code, but that's not a solution.
 
  How small are the structs?  gcc has an optimization for structs that
  are no larger than a register, but it's buggy in 4.2 and we disabled
  it at $WORK.  I can dig up the patch if this is the problem.
  struct sockaddr_in in this particular test. 16 bytes.
 
  Register size structs are rather common, e.g. struct in_addr.
 
  I could test the patch. Adding -finline-functions seems to fix the issue
  for me.

 I can't find the thing I'm thinking of.  The only potentially relevant
 patch I see in our gcc sources is this:

 It could be related but doesn't fix bug I observe. I've installed fresh
 9.0-RC2 virtual machine and reran tests in clean environment.

 Do you plan committing it?

We probably should, but I haven't heard that anyone else has had a
problem with this.  I'm sure when I do the next FreeBSD merge at $WORK
I'll re-remember it and probably commit it then.

Thanks,
matthew

 Index: opts.c
 ===
 --- opts.c    (.../vendor.branches/freebsd/stable/7/src/contrib/gcc/opts.c)  
  (revision
 211574)
 +++ opts.c    (.../head/src/contrib/gcc/opts.c)       (revision 211574)
 @@ -457,11 +457,7 @@
        flag_tree_dse = 1;
        flag_tree_ter = 1;
        flag_tree_live_range_split = 1;
 +      /**
 +       * 7dot1MERGE: tree-sra in gcc 4.2.x is buggy and
 +       * breaks bitfield structs.
 +       */
 +      flag_tree_sra = 0;
 -      flag_tree_sra = 1;
        flag_tree_copyrename = 1;
        flag_tree_fre = 1;
        flag_tree_copy_prop = 1;

 Thanks,
 matthew

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: gcc 4.2 miscompilation with -O2 -fno-omit-frame-pointer on amd64

2011-11-19 Thread mdf
On Sat, Nov 19, 2011 at 2:01 AM, Gleb Kurtsou gleb.kurt...@gmail.com wrote:
 Hi,

 I was lucky to write a bit of code which gcc 4.2 fails to compile
 correctly with -O2. Too keep long story short the code fails for gcc
 from base system and last gcc 4.2 snapshot from ports. It works with gcc
 4.3, gcc 4.4 on FreeBSD and Linux. Clang from base is also good. -O and
 -Os optimization levels are fine (I've tried with all -f* flags
 mentioned in documentation)

 -O2 -fno-omit-frame-pointer combination is troublesome on amd64. I
 presume i386 should be fine. These options are also used for
 compilation of kernel (with debugging enabled) and modules.

 I'm not able to share the code, but have a test case reproducing the
 bug. I've encountered the issue over a week ago and tried narrowing it down
 to a simple test I could share but without much success.

 The code itself is very common: initialize two structs on stack, call a
 function with pointers to those stucts as arguments. A number of inlined
 assertion functions. gcc fails to correctly optimize struct assignments
 with -fno-omit-frame-pointer, I have a number of small structs assigned,
 gcc decides not to use data coping but to assign fields directly. I've
 tried disabling sra, tweaking sra parameters -- no luck in forcing it
 to copy data. Replacing one particular assignment with memcpy produces
 correct code, but that's not a solution.

How small are the structs?  gcc has an optimization for structs that
are no larger than a register, but it's buggy in 4.2 and we disabled
it at $WORK.  I can dig up the patch if this is the problem.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: gcc 4.2 miscompilation with -O2 -fno-omit-frame-pointer on amd64

2011-11-19 Thread mdf
On Sat, Nov 19, 2011 at 8:19 AM, Gleb Kurtsou gleb.kurt...@gmail.com wrote:
 On (19/11/2011 07:26), m...@freebsd.org wrote:
 On Sat, Nov 19, 2011 at 2:01 AM, Gleb Kurtsou gleb.kurt...@gmail.com wrote:
  Hi,
 
  I was lucky to write a bit of code which gcc 4.2 fails to compile
  correctly with -O2. Too keep long story short the code fails for gcc
  from base system and last gcc 4.2 snapshot from ports. It works with gcc
  4.3, gcc 4.4 on FreeBSD and Linux. Clang from base is also good. -O and
  -Os optimization levels are fine (I've tried with all -f* flags
  mentioned in documentation)
 
  -O2 -fno-omit-frame-pointer combination is troublesome on amd64. I
  presume i386 should be fine. These options are also used for
  compilation of kernel (with debugging enabled) and modules.
 
  I'm not able to share the code, but have a test case reproducing the
  bug. I've encountered the issue over a week ago and tried narrowing it down
  to a simple test I could share but without much success.
 
  The code itself is very common: initialize two structs on stack, call a
  function with pointers to those stucts as arguments. A number of inlined
  assertion functions. gcc fails to correctly optimize struct assignments
  with -fno-omit-frame-pointer, I have a number of small structs assigned,
  gcc decides not to use data coping but to assign fields directly. I've
  tried disabling sra, tweaking sra parameters -- no luck in forcing it
  to copy data. Replacing one particular assignment with memcpy produces
  correct code, but that's not a solution.

 How small are the structs?  gcc has an optimization for structs that
 are no larger than a register, but it's buggy in 4.2 and we disabled
 it at $WORK.  I can dig up the patch if this is the problem.
 struct sockaddr_in in this particular test. 16 bytes.

 Register size structs are rather common, e.g. struct in_addr.

 I could test the patch. Adding -finline-functions seems to fix the issue
 for me.

I can't find the thing I'm thinking of.  The only potentially relevant
patch I see in our gcc sources is this:


Index: opts.c
===
--- opts.c  (.../vendor.branches/freebsd/stable/7/src/contrib/gcc/opts.c)   
(revision
211574)
+++ opts.c  (.../head/src/contrib/gcc/opts.c)   (revision 211574)
@@ -457,11 +457,7 @@
   flag_tree_dse = 1;
   flag_tree_ter = 1;
   flag_tree_live_range_split = 1;
+  /**
+   * 7dot1MERGE: tree-sra in gcc 4.2.x is buggy and
+   * breaks bitfield structs.
+   */
+  flag_tree_sra = 0;
-  flag_tree_sra = 1;
   flag_tree_copyrename = 1;
   flag_tree_fre = 1;
   flag_tree_copy_prop = 1;

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Communication between kernel and userspace via local socket

2011-11-15 Thread mdf
On Tue, Nov 15, 2011 at 12:18 PM, Maxim Ignatenko gelraen...@gmail.com wrote:
 frHi,

 I'm currently inventing the wheel^W^W^Wwriting a firewall from scratch and
 looking for most convenient way to establish communication between userspace
 processes and kernel part. Communication pattern best fits to listening
 PF_LOCAL socket opened from kernel and userspace processes connecting to it.
 Clients should be able to send requests and receive responses from kernel (to
 retrieve list of loaded modules, active ruleset, add or remove rules, ...) and
 vice versa: kernel should be able to send request to userspace process and
 receive response (I'm planning to add interactive features like in most
 firewalls for windows(r)).

 First part can be implemented via ioctl, but it should be called not only by
 processes with euid == 0, so supplied pointer to receive buffer cannot be
 trusted (is there any mechanism to check memory allocation?) and any
 unprivileged user can instruct kernel to write some trash at arbitrary address
 (for example, VM just rebooted ungracefully when I supplied (void*)123 as
 pointer to destination buffer).

Were you using copyout(9)?  I think FreeBSD's memory isolation between
processes is pretty decent. I would be very surprised if copyout to an
invalid address did something other than return EFAULT.  At least the
amd64 implementation of copyout(9) will also explicitly check that the
address is a user address, so that you can't corrupt kernel memory
with a rogue pointer from user-space.

Thanks,
matthew

 So, requirements is:
 1) message exchange can initiated from userspace and from kernel
 2) safe to communicate with unprivileged processes (not like in above case
 with ioctl)
 3) kernel part should be able to determine process uid
 4) messages size can be large (from 1KB to 10KB and more)

 Now I'm thinking about few variants:
 1) emulation of local socket via character device. This way requires to
 manually handle per-process IO buffers, which almost certainly will have many
 bugs
 2) opening local socket from kernel. This, as I think, require to spawn new
 process in kernel (but I don't know how to do this) to listen for incoming
 connections and messages
 3) userspace mux/demux daemon (like devd): one and only one process opens
 character device and uses local socket to communicate with other processes.
 This requires to design 2 ABIs - kernel-daemon and daemon-client.

 2nd variant looks most appropriate but know I don't know how to implement it.
 Can anyone point me to some documentation about spawning processes in kernel
 an working with sockets from kernelspace, or suggest better way of
 communication between processes and kernel?
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: question regarding style(9) and field initialisers in structs

2011-11-02 Thread mdf
On Wed, Nov 2, 2011 at 4:56 PM, Alexander Best arun...@freebsd.org wrote:
 i sent the following message to freebsd-quaestions@ and got no answer. mybe it
 is better suited for freebsd-hackers@.

 hi there,

 i found hundreds of the following cases in the FreeBSD src:

 [...]
 struct periph_driver {
        periph_init_func_t      init;
        char                    *driver_name;
        TAILQ_HEAD(,cam_periph) units;
        u_int                   generation;
        u_int                   flags;
        #define CAM_PERIPH_DRV_EARLY    0x01
 };
 [...]
 static struct periph_driver dadriver =
 {
        dainit, da,
        TAILQ_HEAD_INITIALIZER(dadriver.units), /* generation */ 0
 };

 ...is it proper programming practice to forget about the last field, if it
 would have been initialised to 0?

It's more likely that, in this instance, the field was added after the
initializer.  Without named initializers, all fields until the last
non-zero one need to be explicitly present.

style(9) doesn't address it, having been written too long ago, but IMO
initializations in a C translation unit [1] should use C99's named
initializer feature, to protect against future member re-ordering.
And it would be style(9) compliant to leave out any field whose
initial value is zero (though sometimes it's good to list it anyways
to make it clear that 0 was the desired value).

[1] Note that C++ doesn't support C99's named initializer syntax (even
in C++0x, which is a bit silly), so any struct initializers in a
header file like sys/malloc.h must use old-style, since FreeBSD
should continue to support writing kernel modules in C++.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: How to express inter-directory dependencies in bsd.*.mk infrastructure?

2011-10-30 Thread mdf
2011/10/30 Lev Serebryakov l...@freebsd.org:
 Hello, Hackers.

  (SORRY FOR SENDING INCOMPLETE MESSAGE)

  How to express inter-directory dependencies in bsd.*.mk infrastructure?

  I have project, which has two subdirectories: lib and bin.
  Top-level Makefile is simple one, looks like this:

 ===
 SUBDIR= lib \
        bin

 .include bsd.subdir.mk
 ===

  lib subdirectory has Makefile with bsd.lib.mk included, and
  bin -- with bsd.prog.mk included.

  But how could I express, that PROG in bin depends on LIB from lib, to
 cause rebuilding of PROG when LIB is changed (when I call make on
 top level)?

Normally I'd expect the dependency to be handled by make depend; that
is, the source for the binary should be #including header files from
the library.  If the interfaces change, make depend will handle this.

I guess the problem comes if you are using static linking and just
implementation details internal to the library change.

There is a DPADD makefile variable that seems to be used; I'm not sure
though if this handles the type of dependency you're wanting.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: How to remopve volatile qualifier from pointer without warning from gcc?

2011-10-30 Thread mdf
2011/10/30 Lev Serebryakov l...@freebsd.org:
 Hello, Hackers.

  I need to pass volatile void * to API function, which takes
 void*. gcc (on FreeBSD 8.2) emits warning, and as in FreeBSD-styed
 code warnings are treated as errors, program could not be built.

  Manual casting gives warning, too...

See the __DEVOLATILE() hack that will do sufficient casting to silence
the compiler.  As with any compiler warning, silencing it incorrectly
comes with a risk...

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Using Valgrind on FreeBSD

2011-10-07 Thread mdf
2011/10/7 Mustafa Reşit Şahin resitsa...@gmail.com:
 I am trying to run Valgrind on FreeBSD. I am getting the error about
 ksem_open which i stated below. I have searched for  a solution to be able
 to solve this problem and found the calgrind patch. I could not found the
 instructions to apply this patch. How can i use valgrind on FreeBSD with
 ksem_open enabled?

It looks like that syscall is only available when the
P1003_1B_SEMAPHORES option is in your kernel config.  You will need to
edit a kernel configuration file and rebuild / reinstall the kernel.

Thanks,
matthew


 I use valgrind version:  valgrind-3.6.0
 FreeBSD Version :FreeBSD 8.1-RELEASE-p2 FreeBSD 8.1-RELEASE-p2 #6:  amd64

 The error i get : (The valgrind output)


 --84521-- WARNING: unhandled syscall: 404
 ==84521==    at 0x144E2BC: __sys_ksem_init (in /lib/libc.so.7)
 ==84521==    by 0x14422FE: sem_init (in /lib/libc.so.7)
 ==84521==    by 0x1639BBB: snf__sem_ring_open (in
 /usr/local/opt/snf/lib/libsnf.so.0.3)
 ==84521==    by 0x163A1AC: snf__open_endpoint_ring (in
 /usr/local/opt/snf/lib/libsnf.so.0.3)
 ==84521==    by 0x163A5A3: snf__board_open (in
 /usr/local/opt/snf/lib/libsnf.so.0.3)
 ==84521==    by 0x1637A7A: snf_open (in
 /usr/local/opt/snf/lib/libsnf.so.0.3)
 ==84521==    by 0xF04DBA: snf_activate (in
 /usr/local/opt/snf/lib/libpcap.so.1.1.1)
 ==84521==    by 0xF05C26: pcap_activate (in
 /usr/local/opt/snf/lib/libpcap.so.1.1.1)
 ==84521==    by 0xF0624A: pcap_open_live (in
 /usr/local/opt/snf/lib/libpcap.so.1.1.1)
 ==84521==    by 0x452B4D: ??? (in /usr/sbin/tcpdump)
 ==84521==    by 0x402E4D: ??? (in /usr/sbin/tcpdump)
 ==84521==    by 0x41FFF: ???
 --84521-- You may be able to write your own handler.
 --84521-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
 --84521-- Nevertheless we consider this a bug.  Please report
 --84521-- it at http://valgrind.org/support/bug_reports.html.
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Atomic increment and get?

2011-10-03 Thread mdf
2011/10/3 Lev Serebryakov l...@freebsd.org:
  Is here atomic increment and get (or add and get) operation in
  kernel? I cannot find one. Here is atomic_add_32(), but it doesn't
  return result. And here is no atomic_add_64() on 32 bit system :(

See atomic_fetchadd_int.

Not all hardware has 64-bit atomic instructions available in 32-bit
mode so it's not machine-independent.  Generally, for this case (and
potentially for sub-32-bit atomic operations [1]) a lock is used.

[1] with masking and shifting 32-bit atomics can be used to set data
in an 8 or 16 bit variable; see a recent commit to sys/vm.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: NUMA Support is there in FreeBSD.

2011-10-03 Thread mdf
On Mon, Oct 3, 2011 at 7:55 AM, satish kondapalli nitw.sat...@gmail.com wrote:
 I am new to FreeBSD, I just want know whether FreeBSD supports NUMA.
 If FreeBSD supports NUMA what are the kernel API to allocate memory?
 is there any example driver or any driver which is using the NUMA API?

 please provide some inputs...

The kernel is NUMA-aware (at least for x86), and memory is allocated
round-robin amongst the memory domains.  There are not yet any KPIs
for allocating memory in a specific NUMA domain, nor for binding
specific threads / processes to get their memory local to a bound cpu
instead of round robin.

There have been several discussions but no one has taken the lead and
proposed some KPIs yet.  At $WORK the round-robin is sufficient to get
consistent performance numbers and we have not yet started any
experimentation with binding specific threads to either CPU or memory.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: NUMA Support is there in FreeBSD.

2011-10-03 Thread mdf
On Mon, Oct 3, 2011 at 10:24 AM, Arnaud Lacombe lacom...@gmail.com wrote:
 Hi,

 On Mon, Oct 3, 2011 at 12:31 PM,  m...@freebsd.org wrote:
 On Mon, Oct 3, 2011 at 7:55 AM, satish kondapalli nitw.sat...@gmail.com 
 wrote:
 I am new to FreeBSD, I just want know whether FreeBSD supports NUMA.
 If FreeBSD supports NUMA what are the kernel API to allocate memory?
 is there any example driver or any driver which is using the NUMA API?

 please provide some inputs...

 The kernel is NUMA-aware (at least for x86),

 What x86 ? i386 ? amd64 ? both ?

Both; see sys/x86/acpica/srat.c which parses the SRAT table.

 and memory is allocated
 round-robin amongst the memory domains.  There are not yet any KPIs
 for allocating memory in a specific NUMA domain, nor for binding
 specific threads / processes to get their memory local to a bound cpu
 instead of round robin.

 I'm not sure to follow you. Say you have 2 memory domain attached to 2
 different CPU package, each providing a memory domain, 4 physical core
 and eventually 8 virtual. Say you have a network adapter supporting 8
 RX/TX queue, dispatching RX packet to 8 netisr. Ideally, you'd want
 those 8 queue/netisr to each have an affinity for a given CPU/memory
 domain, have the network adapter route flow evenly on those those 8
 CPU. Now, if you allocated an mbuf from memory domain 1, and end up
 being processed by a CPU in domain 0, that likely to introduce
 performance penalty.

Your statement isn't incorrect.  What I'm saying is that there's no
KPI for requesting bound memory because, while the netstat example is
a fine one for where local memory is desired, the majority [1] of
processing is not bound to a CPU and so round-robin allocations will
produce uniform performance results -- that is, not the best possible,
but not wildly fluctuating as scheduling decisions over different runs
give different remote memory penalties.

[1] for some definition of 'majority'.

 Now, what about userland ?

 This is certainly an horribly big picture :/

Yes, and it's why I said just that there's no KPI.  One reason there
is no KPI is that there's a lot of fiddly bits to take into account.

My experience at IBM on AIX was that NUMA is very easy to get wrong;
specifically what one usually wants is for the OS to get the answer
right (especially for userspace) without a lot of manual tuning;
except for some specific applications like netstat queues or a machine
doing HPC or mostly running e.g. an Oracle db server, there's too much
happening for any one program to configure itself right for all the
uses of that code.  I remember a lot of customer reports of problems
from overly aggressive local memory use.  Most of the time no one
complained when things had consistent performance, even if that wasn't
quite as fast as possible.

In fact, I may be wrong about the round-robin; I sent jhb@ a patch and
I have no recollection anymore whether it's actually in CURRENT.  It's
been over a year since I thought about this much (BSDCan 2010 was the
last time I remember).

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Memory allocation in kernel -- what to use in which situation? What is the best for page-sized allocations?

2011-10-02 Thread mdf
2011/10/2 Lev Serebryakov l...@freebsd.org:
 Hello, Freebsd-hackers.

  Here are several memory-allocation mechanisms in the kernel. The two
 I'm aware of is MALLOC_DEFINE()/malloc()/free() and uma_* (zone(9)).

  As far as I understand, malloc() is general-purpose, but it has
 fixed transaction cost (in term of memory consumption) for each
 block allocated, and is not very suitable for allocation of many small
 blocks, as lots of memory will be wasted for bookkeeping.

  zone(9) allocator, on other hand, have very low cost of each
 allocated block, but could allocate only pre-configured fixed-size
 blocks, and ideal for allocation tons of small objects (and provide
 API for reusing them, too!).

  Am I right?

No one has quite answered this question, IMO, so here's my 2 cents.

malloc(9) on smaller sizes (= PAGE_SIZE) uses uma(9) under the
covers.  There are a set of uma zones for 16, 32, 64, 128, ...
PAGE_SIZE bytes and malloc(9) looks up the malloc size in a small
array to determine which uma zone to allocate from.

So malloc(9) on small sizes doesn't have overhead of bookkeeping, but
it does have overhead of rounding to the next highest malloc uma
bucket.  At $WORK we found, for example, that 48 bytes and 96 bytes
were very common sizes and so I added uma zones there (and few other
odd sies determined by using the malloc statistics option).

   But what if I need to allocate a lot (say, 16K-32K) of page-sized
 blocks? Not in one chunk, for sure, but in lifetime of my kernel
 module. Which allocator should I use? It seems, the best one will be
 very low-level only-page-sized allocator. Is here any in kernel?

4k allocations, as has been pointed out, get a single kernel page in
both the virtual space and physical space.  They (like all the large
allocations) use a field in the vm_page for the physical page backing
the virtual address to record info about the allocation.

Any allocation PAGE_SIZE and larger will round up to the next multiple
of pages and allocate whole pages.  IMO the problems here are (1) as
was pointed out, TLB shootdown on free(9), and (2) the current
algorithm for finding space in a kmem_map is a linear search and
doesn't track where there are fragmented chunks, so it's not terribly
efficient when finding larger sies, and the PAGE_SIZE allocations will
not fill in fragmented areas.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Way to get current tick number in kernel?

2011-10-02 Thread mdf
2011/10/2 Lev Serebryakov l...@freebsd.org:
 Hello, Freebsd-hackers.

  What should I use to measure short intervals of time between events
 in kernel? I don't need any time in means of, for example, time(3)
 API, but some monotonically and uniformly increasing counter with
 known frequency. As cheap as possible, without complex calculations :)

There are several global variables that may suffice.

'ticks' is the current tick value.
'time_second' is the current time in seconds, but this is adjusted
when the system time is changed.
'time_uptime' is the current uptime in seconds and is the base upon
which time_second is computed, depending on the system clock.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Understanding panic and exit in the kernel

2011-09-07 Thread mdf
On Wed, Sep 7, 2011 at 11:08 AM, Charlie Martin crmar...@sgi.com wrote:
 I'm still pursuing a FreeBSD bug in 7.2-PRERELEASE FreeBSD -- and yes, we
 know this is wildly out of date, but it's not feasible to upgrade right now
 -- and while trying to backport a fix suggested here
 http://permalink.gmane.org/gmane.os.freebsd.current/134266 I got a situation
 where the panic calls in one of  these two macros from sys/queue.h

 #define    QMD_LIST_CHECK_NEXT(elm, field) do {                \
    if (LIST_NEXT((elm), field) != NULL                 \
        LIST_NEXT((elm), field)-field.le_prev !=            \
 ((elm)-field.le_next))                    \
             panic(Bad link elm %p next-prev != elm, (elm));    \
 } while (0)

 #define    QMD_LIST_CHECK_PREV(elm, field) do {                \
    if (*(elm)-field.le_prev != (elm))                \
        panic(Bad link elm %p prev-next != elm, (elm));    \
 } while (0)

 print the message, but *don't* then proceed to drop to the debugger --
 instead the system hangs, with the CPU running but I had no luck getting its
 attention to force it to the debugger.

 I'm not clear just what could be causing the hangup.

 For my immediate purposes, I'd be happy with any way in which I could
 brutally kill the kernel and force it to the debugger, say by replacing the
 panic call with a printf followed by 1/0;.  But I'm a little confused by
 the panic.c code -- it prints the arguments using a var_args, and then calls
 exit(1);'

What file are you looking in?  The kernel panic() is in
sys/kern/kern_shutdown.c, not sys/boot/common/panic.c.  It will
optionally call kdb_enter_why() and then boot().

Do you have the debug.debugger_on_panic sysctl set to 1?

Thanks,
matthew

 So my questions:

 (1) will my brutal method actually force what I want or am I
 misunderstanding something
 (2) *which* of the several implementations of int exit(int) or similar is
 the one called in the FreeBSD kernel?
 (3) and how then does exit work?
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Understanding panic and exit in the kernel

2011-09-07 Thread mdf
On Wed, Sep 7, 2011 at 12:24 PM, Charlie Martin crmar...@sgi.com wrote:


 On 2011-09-07 12:53, m...@freebsd.org wrote:

 For my immediate purposes, I'd be happy with any way in which I could
   brutally kill the kernel and force it to the debugger, say by
  replacing the
   panic call with a printf followed by 1/0;.  But I'm a little
  confused by
   the panic.c code -- it prints the arguments using a var_args, and then
  calls
   exit(1);'

 What file are you looking in?  The kernel panic() is in
 sys/kern/kern_shutdown.c, not sys/boot/common/panic.c.  It will
 optionally call kdb_enter_why() and then boot().

 Bingo, that's got to help.  This makes a lot more sense.

 Do you have the debug.debugger_on_panic sysctl set to 1?

 Yes -- and panic does so *except* in the version with those changes to
 queue.h.

If all you need to get started is a backtrace then
debug.trace_on_panic=1 may be sufficient.

I'm not sure of the timeline when 7.2-prerelease was issued, but there
have been some reliability improvements to panic handling including
marking that a CPU is in panic, etc., that may have come after
7.2-prerelease.  You may want to look at the svn history for
kern_shutdown.c and locally apply those changes to see if it changes
your result.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Large machine test ideas

2011-08-29 Thread mdf
On Mon, Aug 29, 2011 at 7:46 AM, Ivan Voras ivo...@freebsd.org wrote:
 On 26/08/2011 19:44, Garrett Cooper wrote:
 On Fri, Aug 26, 2011 at 10:36 AM, Ivan Voras ivo...@freebsd.org wrote:

 ...

 I think that I'll need a 9-CURRENT snapshot on it to run all 128 CPUs,
 right?

 A 9.0-BETA1 snapshot, yes.

 Well, I'll leave it another half an hour but the 9.9-beta1 shapshot
 froze on boot after showing a SRAT: No CPU found for memory domain 4.

This message implies the memory affinity information coming from ACPI
is either non-sensical, or you have an unexpected physical setup where
there really are CPUs with no memory in the local sockets.

You should be able to boot with something like hint.srat.0=disabled
at the boot loader prompt.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: [PATCH] sysctl cleanup - unifying sysctls: vm.kvm_size and vm.kvm_free

2011-07-18 Thread mdf
On Mon, Jul 18, 2011 at 5:32 AM, Uffe Jakobsen u...@uffe.org wrote:
 Please consider this patch - it unifies sysctls: vm.kvm_size and
 vm.kvm_free.

 Currently these sysctls are only found under i386 and amd64:

 sys/i386/i386/pmap.c
 sys/i386/xen/pmap.c
 sys/amd64/amd64/pmap.c

 It seems logical (to me) to move them into a generic location suce as
 sys/vm/vm_kern.c

 Patch against HEAD (revision 224180) attached.

I don't believe that VM_MAX_KERNEL_ADDRESS is relevant to all
architectures, which is why these sysctls are in i386/amd64 specific
files.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: [PATCH] sysctl cleanup - unifying sysctls: vm.kvm_size and vm.kvm_free

2011-07-18 Thread mdf
On Mon, Jul 18, 2011 at 7:47 AM, Uffe Jakobsen u...@uffe.org wrote:
 On 2011-07-18 15:54, m...@freebsd.org wrote:
 On Mon, Jul 18, 2011 at 5:32 AM, Uffe Jakobsenu...@uffe.org  wrote:

 Please consider this patch - it unifies sysctls: vm.kvm_size and
 vm.kvm_free.

 Currently these sysctls are only found under i386 and amd64:

 sys/i386/i386/pmap.c
 sys/i386/xen/pmap.c
 sys/amd64/amd64/pmap.c

 It seems logical (to me) to move them into a generic location suce as
 sys/vm/vm_kern.c

 Patch against HEAD (revision 224180) attached.

 I don't believe that VM_MAX_KERNEL_ADDRESS is relevant to all
 architectures, which is why these sysctls are in i386/amd64 specific
 files.


 Well I'm not an expert - but I did some checking before submitting this
 patch.

 As far as I can see VM_MAX_KERNEL_ADDRESS is addressed in several generic
 kernel files:

 sys/kern/subr_param.c
 sys/vm/vm_map.c
 sys/vm/vm_object.c

 and as far as I can see these locations are not ifdef'ed with any platform
 specific condition.



 Further more the following platform-specific files defines
 VM_MAX_KERNEL_ADDRESS:

 sys/amd64/include/vmparam.h
 sys/arm/include/vmparam.h
 sys/i386/include/vmparam.h
 sys/ia64/include/vmparam.h
 sys/mips/include/vmparam.h
 sys/powerpc/include/vmparam.h
 sys/sparc64/include/vmparam.h

 That leaves us only with pc98 and x86 platform-specifics subdirs that
 does not directly define VM_MAX_KERNEL_ADDRESS

 sys/pc98/include/vmparam.h includes i386/vmparam.h which defines
 VM_MAX_KERNEL_ADDRESS

 While x86 subdir is a little more questionable - though it has several
 includes for machine/vmparam.h which defines VM_MAX_KERNEL_ADDRESS



 Could soneone please give me an indication of what would be the correct way
 for me to verify this in a more scientific way.

Alan Cox is usually a definitive answer for FreeBSD vm questions.

My recollection was in the context of a bug with memguard(9) on
sparc64, however now that I refresh my memory the issue there was
VM_KMEM_SIZE_MAX, not VM_MAX_KERNEL_ADDRESS.

So, my apologies as my memory has led me astray.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: [PATCH] __FreeBSD_cc_version in sys/cdefs.h

2011-07-05 Thread mdf
On Tue, Jul 5, 2011 at 6:36 AM, Robert Millan r...@debian.org wrote:
 2011/7/5 Dimitry Andric d...@freebsd.org:
 As far as I can see, this code only gives warnings when compiled with
 gcc 4.5 or higher, and when using the -Wundef flag.  Isn't it easier to
 just remove the -Wundef flag here?

 Here's a patch to remove -Wundef.  I think it's a bad idea however,
 IMHO it's better to fix the cause of the warning instead.

The problem is that, IIRC, C guarantees that an undefined symbol when
checked in a #if context will evaluate to 0.  So -Wundef checks for
warnings on compliant code.

Personally I have no objection to the original patch.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Unit Tests for FreeBSD Kernel APIs?

2011-06-24 Thread mdf
On Fri, Jun 24, 2011 at 6:14 AM, John Baldwin j...@freebsd.org wrote:
 On Friday, June 24, 2011 3:23:11 am Sebastian Huber wrote:
 Hello,

 exists there some unit tests for FreeBSD kernel APIs, e.g. mutex(9),
 condvar(9), etc.?

 Have a nice day!

 Hmm, I have a kernel module that does some tests, but it is not in the tree.
 One of the issues is that many of the tests you want to do for some of these
 APIs involve timing.  For rwlocks, for example, I used KTR traces and used
 a kernel module that forked 4 threads to all compete over a single lock.  I
 then verified via KTR traces that every branch was taken (and made liberal
 use of KASSERT()s which caught a few edge cases I had missed initially).

At $WORK we have a generic TEST_THREAD interface which allows for
calling essentially random kernel code (whatever a dev codes up) with
arguments.  At the shell one types test_thread foobar(12345,
\string\) or whatever args.  The test_thread syscall looks in the
set of registered functions for foobar, and passes off a string with
(12345, \string\) to the function.  Each function is responsible for
parsing its own arguments.

The args for the test_thread function pointer are the $WORK equivalent
of and input sbuf and and output sbuf, which is copyout'd to the
calling binary.

If there's interest I can clean up the patch a little to show
proof-of-concept that builds without all the $WORK gorp.

We use this for a few things: unit-testing kernel code, hand-editing
of filesystem data to recover customer sites from kernel bugs
(sometimes this is possible), etc.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Check for 0 ino_t in readdir(3)

2011-06-07 Thread mdf
There is a check in the function implementing readdir(3) for a zero
inode number:

struct dirent *
_readdir_unlocked(dirp, skip)
DIR *dirp;
int skip;
{
/* ... */
if (dp-d_ino == 0  skip)
continue;
/* ... */
}

skip is 1 except for when coming from _seekdir(3).

I don't recall any requirement that a filesystem not use an inode
numbered 0, though for obvious reasons it's a poor choice for a file's
inode.  So... is this code in libc incorrect?  Or is there
documentation that 0 cannot be a valid inode number for a filesystem?

If 0 cannot be a valid inode number, my GSoC mentee Gleb suggested
d_fileno == 0 be skipped in dirent_exists() as used by
vop_stdvptocnp(9), and also skipped by unionfs.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: int64_t and printf

2011-06-05 Thread mdf
On Sun, Jun 5, 2011 at 11:13 AM, Ben Laurie b...@links.org wrote:
 So, for example int64_t has no printf modifier I am aware of. Likewise
 its many friends.

The worse option is to use the C99 defines, like PRI64 and PRI64U.
The better option is to use %jd / %ju and cast the value to
[u]intmax_t.

Cheers,
matthew

 In the past I've handled this by having a define somewhere along the
 lines of...

 #if something
 # define INT_64_T_FMT %ld
 #else
 # define INT_64_T_FMT %lld
 #endif

 but I have no idea where to put such a thing in FreeBSD. Opinions? Also,
 I guess I'd really need to do a modifier rather than a format, for full
 generality.

 --
 http://www.apache-ssl.org/ben.html           http://www.links.org/

 There is no limit to what a man can do or how far he can go if he
 doesn't mind who gets the credit. - Robert Woodruff
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


sizeof(function pointer)

2011-05-31 Thread mdf
I am looking into potentially MFC'ing r212367 and related, that adds
drains to sbufs.  The reason for MFC is that several pieces of new
code in CURRENT are using the drain functionality and it would make
MFCing those changes much easier.

The problem is that r212367 added a pointer to a drain function in the
sbuf (it replaced a pointer to void).  The C standard doesn't
guarantee that a void * and a function pointer have the same size,
though its true on amd64, i386 and I believe PPC.  What I'm wondering
is, though not guaranteed by the standard, is it *practically* true
that sizeof(void *) == sizeof(int(*)(void)), such that an MFC won't
break binary compatibility for any supported architecture?  (The
standard does guarantee, though not in words, that all function
pointers have the same size, since it guarantees that pointers to
functions can be cast to other pointers to functions and back without
changing the value).

Another possibility is to malloc a blob that is sizeof(int(*)(void))
and store that in a renamed s_unused; this is a bit messier but
guaranteed to work.  I'd just rather the code be an MCF instead of a
partial re-write.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Does FreeBSD have replacement for posix_fadvice() or fcntl(F_RDADVISE)?

2011-05-12 Thread mdf
On Thu, May 12, 2011 at 4:17 AM, Kostik Belousov kostik...@gmail.com wrote:
 On Thu, May 12, 2011 at 11:38:12AM +0400, Lev Serebryakov wrote:
 Hello, Freebsd-hackers.

    Does FreeBSD have some custom call, which can be used where Linux
  programs uses posix_fadvice() and DARWIN ones fcntl(F_RDADVISE)?

    It is like madvise(2) but for file descriptors.
 No, it does not (and I think the function is spelled posix_fadvise()).

 mdf reserved the syscall slot for posix_fadvise in his recent work
 on posix_fallocate(). Might be, he could comment more.

Whoops, I replied bot forgot to reply-all.

Adding a stub for posix_fadvise() is on my list of things to do, since
we use it at $WORK in our custom filesystem, but the list is long and
spare time is short.

Nominally a technically correct implementation can do error checking
and then nothing.  Even after a stub is added to the OS, each
filesystem must implement a VOP to do the work.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Does FreeBSD have replacement for posix_fadvice() or fcntl(F_RDADVISE)?

2011-05-12 Thread mdf
2011/5/12 Lev Serebryakov l...@serebryakov.spb.ru:
 Hello, Mdf.
 You wrote 12 мая 2011 г., 15:40:25:

    Does FreeBSD have some custom call, which can be used where Linux
  programs uses posix_fadvice() and DARWIN ones fcntl(F_RDADVISE)?

    It is like madvise(2) but for file descriptors.
 No, it does not (and I think the function is spelled posix_fadvise()).

 mdf reserved the syscall slot for posix_fadvise in his recent work
 on posix_fallocate(). Might be, he could comment more.
 Whoops, I replied bot forgot to reply-all.

 Adding a stub for posix_fadvise() is on my list of things to do, since
 we use it at $WORK in our custom filesystem, but the list is long and
 spare time is short.

 Nominally a technically correct implementation can do error checking
 and then nothing.  Even after a stub is added to the OS, each
 filesystem must implement a VOP to do the work.
  transmission Torrents client rely on this syscall. It could be beuilt
 without it, of course (it is in our ports!), but it seems, that it
 produce much more sane disk load on Linux, because uses pre-fetch via
 posix_fadvise(). My profiling shows that pread() could take about 40%
 of it CPU time (8 seconds out of 20 with turned on ktrace, for
 example), even if throughput is very modest (700KiB/s, for example).
 Authors say, that they didn't observe such behavior on Linux (which
 has posix_fadvise()) or MacOS X (with fcntl(F_RDADVISE))...

  May be, we should go to fs@ with this thread now?

Probably; the interface and a few unit tests are a start but the
common filesystems (presumably ufs and zfs and perhaps others) will
need support and I do not have any knowledge of the internals of those
filesystems.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Include file search path

2011-03-29 Thread mdf
I thought I knew something about how the compiler looks for include
files, but now I think maybe I don't know much. :-)

So here's what I'm pondering.  When I build a library, like e.g. libc,
where do the include files get pulled from?  They can't (shouldn't) be
the ones in /usr/include, but I don't see a -nostdinc like for the
kernel.  There are -I directives in the Makefile for
-I${.CURDIR}/include -I${.CURDIR}/../../include, etc., but that won't
remove /usr/include from the search path.

I see in the gcc documentation that -I paths are searched before the
standards paths.  But isn't the lack of -nostdinc a bug (not just for
libc, but for any library in /usr/src/lib)?  It somewhat feels to me
that all of the libraries and binaries in the source distribution
should use -nostdinc and include only from the source distribution
itself.  This isn't always an issue, but for source upgrades it seems
crucial, and for a hacker it saves difficulties with having to install
headers before re-building.

Is that the intent, and it's not fully implemented?  How badly would
things break if -nostdinc was included in e.g. bsd.lib.mk? (This would
break non-base libraries, yes?  But as a thought experiment for the
base, how far off are we?)

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


coretemp(4)/amdtemp(4) and sysctl nodes

2010-12-03 Thread mdf
There are very few uses in FreeBSD mainline code of
sysctl_remove_oid(), and I was looking at potentially removing them.
However, the use in coretemp/amdtemp has me slightly stumped.

Each device provides a device_get_sysctl_ctx sysctl_ctx that is
automatically cleaned up when the device goes away.  Yet the sysctl
nodes for both amdtemp and coretemp use the context of other devices,
rather than their own.  I can't quite figure out why, though the two
are slightly different enough that they may have different reasons.

For coretmp(4) I don't see how the parent device can be removed first,
since we are a child device.  So from my understanding it makes no
sense to have an explicit sysctl_remove_oid() and attach in the
parent's sysctl_ctx.

For amdtemp, the temperature sysctl is added to the nexus.acpi.cpu
device, which I suppose it's possible could be removed while the
amdtemp sysctl was still present, causing errors.  This case could be
dealt with by orphaning sysctl nodes at sysctl_ctx_free if there were
any not removed, presuming that whoever created them will clean them
up (or leak the memory).

Thoughts and explanations of the use of the parent's sysctl context?

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


FreeBSD development on Mac OS

2010-11-21 Thread mdf
(Taking to -hackers by request)

My old (Ubuntu) laptop died and I have replaced it with a Macbook Air.
 I know from visual inspection at BSDCan that 50% of the attending
developers own a Mac laptop of some kind.

However, I've been a bit stumped on how to share the source code from
my Mac with the FreeBSD virtual machine.  When I was using Linux it
was relatively easy; I exported a directory and mounted it with NFS.
Using my Mac as a NFS server hasn't yet worked.  The google is mostly
coming up empty for me in terms of help.

My /etc/exports file on the Mac looks like:

/data   -maproot=mdf:admin  -network 10.211.55.0-mask 255.255.255.0

as that's the IP range that parallels is using for its virtual machines.

But when I try to mount from FreeBSD (as root) I get this error:

RPCPROG_MNT: RPC: Authentication error; why = Client credential too weak

Again, the google hits for this error message haven't given me
anything helpful.  I tried taking out the maproot option but I get the
same error.  Immediately after a nfsd restart I get a RPCPROG_NFS:
RPC: Program not registered error, but that clears up and then I'm
back to Client credential too weak.

I don't know if this is a client or server issue, but I don't recall
that I'm doing anything differently on the FreeBSD VM side except that
I'm trying it through Parallels instead of VMware or VirtualBox at the
moment.  So... any thoughts as to what else I need to do on the server
side?

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: FreeBSD development on Mac OS

2010-11-21 Thread mdf
On Sun, Nov 21, 2010 at 8:02 PM, Garrett Cooper gcoo...@freebsd.org wrote:
 On Sun, Nov 21, 2010 at 7:20 PM, Daniel O'Connor docon...@gsoft.com.au 
 wrote:

 On 22/11/2010, at 13:32, m...@freebsd.org wrote:
 My /etc/exports file on the Mac looks like:

 /data   -maproot=mdf:admin      -network 10.211.55.0    -mask 255.255.255.0

 as that's the IP range that parallels is using for its virtual machines.

 But when I try to mount from FreeBSD (as root) I get this error:

 RPCPROG_MNT: RPC: Authentication error; why = Client credential too weak

 I just tried this on my MBP with parallels and it worked fine - I had 
 maproot=0:0 though.

 Also, I mounted /Users - don't know if it makes a difference or if there is 
 some other thing that needs tweaking first.

    In short, look at /var/log/messages on the Macbook to see what RPC
 isn't happy with [1].
 HTH,
 -Garrett

 1. http://www.freebsddiary.org/nfs.php

maproot=0:0 gives me the same error.

There is no /var/log/messages on the mac, or at least not on mine.  A
grep for things like credential shows no hits anywhere in /var/log;
a grep for mountd in /var/log/*.log only has hits in
launchd-shutdown.log.  A grep for nfsd in /var/log/*.log doesn't
*seem* to have anything useful, just a few in appfirewall.log about
nfsd listening on various ports.

Even with the firewall off I get the same Client credential too weak
error.  I get the same error when I change /etc/exports on the Mac and
/etc/fstab on the FreeBSD VM to /Users/mdf.

So... I'm pretty stumped.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: FreeBSD development on Mac OS

2010-11-21 Thread mdf
Solved!

The default setup for Parallels was a Shared network.  I didn't see
any way to change this in the Parallels machine configuration, but
there was another menu I had missed where the network was
configurable.  Changing the network to bridged, ifconfig em0 down;
dhclient em0 and now mounting works!

Thanks,
matthew

On Sun, Nov 21, 2010 at 8:44 PM,  m...@freebsd.org wrote:
 On Sun, Nov 21, 2010 at 8:02 PM, Garrett Cooper gcoo...@freebsd.org wrote:
 On Sun, Nov 21, 2010 at 7:20 PM, Daniel O'Connor docon...@gsoft.com.au 
 wrote:

 On 22/11/2010, at 13:32, m...@freebsd.org wrote:
 My /etc/exports file on the Mac looks like:

 /data   -maproot=mdf:admin      -network 10.211.55.0    -mask 255.255.255.0

 as that's the IP range that parallels is using for its virtual machines.

 But when I try to mount from FreeBSD (as root) I get this error:

 RPCPROG_MNT: RPC: Authentication error; why = Client credential too weak

 I just tried this on my MBP with parallels and it worked fine - I had 
 maproot=0:0 though.

 Also, I mounted /Users - don't know if it makes a difference or if there is 
 some other thing that needs tweaking first.

    In short, look at /var/log/messages on the Macbook to see what RPC
 isn't happy with [1].
 HTH,
 -Garrett

 1. http://www.freebsddiary.org/nfs.php

 maproot=0:0 gives me the same error.

 There is no /var/log/messages on the mac, or at least not on mine.  A
 grep for things like credential shows no hits anywhere in /var/log;
 a grep for mountd in /var/log/*.log only has hits in
 launchd-shutdown.log.  A grep for nfsd in /var/log/*.log doesn't
 *seem* to have anything useful, just a few in appfirewall.log about
 nfsd listening on various ports.

 Even with the firewall off I get the same Client credential too weak
 error.  I get the same error when I change /etc/exports on the Mac and
 /etc/fstab on the FreeBSD VM to /Users/mdf.

 So... I'm pretty stumped.

 Thanks,
 matthew

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Phantom sysctl

2010-11-15 Thread mdf
On Mon, Nov 15, 2010 at 9:53 AM, Garrett Cooper yaneg...@gmail.com wrote:
    According to SYSCTL_INT(9):

     The SYSCTL kernel interfaces allow code to statically declare sysctl(8)
     MIB entries, which will be initialized when the kernel module containing
     the declaration is initialized.  When the module is unloaded, the sysctl
     will be automatically destroyed.

    The sysctl should be reaped when the module is unloaded. My dumb
 test kernel module [1] doesn't seem to do that though (please note
 that the OID test_int_sysctl is created, and not reaped... FWIW it's
 kind of bizarre that test_int_sysctl is created in the first place,
 given what I've seen when SYSCTL_* gets executed):

 toaster# kldload ./test_int_sysctl.ko
 toaster# sysctl -a | grep test
 test_int_sysctl: 0
 vfs.nfs_common.realign_test: 0
 debug.test_int_sysctl: 0
 toaster# sysctl test_int_sysctl
 sysctl: unknown oid 'test_int_sysctl'
 toaster# kldunload ./test_int_sysctl.ko
 toaster# sysctl -a | grep test
 test_int_sysctl: 0
 vfs.nfs_common.realign_test: 0

debug.test_int_sysctl did disappear, and that was the sysctl in the module:

SYSCTL_INT(_debug, OID_AUTO, test_int_sysctl, CTLFLAG_RW, _sysctl, 0,
Test sysctl OID);

I am not sure where the other test_int_sysctl appeared from, but the
results of sysctl test_int_sysctl didn't change from before
kldunload to after.

Thanks,
matthew

 toaster# sysctl test_int_sysctl
 sysctl: unknown oid 'test_int_sysctl'

    I've seen this behavior on 8.1-RELEASE (custom kernel, vanilla
 sources), and CURRENT r215254.
    I'm compiling the kernel with SYSCTL_DEBUG (and added some missing
 error checking in the kern_sysctl.c code) to see if I can track down
 the resource issue, but in the meantime if someone more knowledgeable
 has some suggestions for what to do / where I should look, I'm all
 ears.
 Thanks!
 -Garrett

 1. http://pastebin.com/n7d9bH8U
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: UMA allocations from a specific physical range

2010-09-05 Thread mdf
On Mon, Sep 6, 2010 at 1:38 AM, Nathan Whitehorn nwhiteh...@freebsd.org wrote:
 PowerPC hypervisors typically provided a restricted range on memory when
 the MMU is disabled, as it is when initially handling exceptions. In
 order to restore virtual memory, the powerpc64 code needs to read a data
 structure called the SLB cache, which is currently allocated out of a
 UMA zone, and must be mapped into wired memory, ideally 1:1
 physical-virtual address. Since this must be accessible in real mode,
 it must have a physical address in a certain range. I am trying to
 figure out the best way to do this.

 My first run at this code uses a custom UMA allocator that calls
 vm_phys_alloc_contig() to get a memory page. The trouble I have run into
 is that I cannot figure out a way to free the page. Marking the zone
 NOFREE is a bad solution, vm_page_free() panics the kernel due to
 inconsistent tracking of page wiring, and vm_phys_free_pages() causes
 panics in vm_page_alloc() later on (page is not free). What is the
 correct way to deallocate these pages? Or is there a different approach
 I should adopt?

I assume this is for the SLB flih?

What AIX did was to have a 1-1 simple esid to vsid translation for
kernel addresses, reserve the first 16 SLB entries for various uses,
including one for the current process's process private segment, and
if the slb miss was on a process address we'd turn on translation and
look up the answer, the tables holding the answer being in the process
private segment effective address space so we wouldn't take another
slb miss.  This required one level deep recursion in the slb slih, in
case there was a miss on kernel data with xlate on in the SLB slih.

For historical reasons due to the per-process segment table for
POWER3, we also had a one-page hashed lookup table per process that we
stored the real address of in the process private segment, so the
assembly code in the flih looked here before turning on MSR_DR IIRC.
I was trying to find ways to kill this code when I left IBM, since
we'd ended support for POWER3 a few years earlier.

I haven't had the time to look at FreeBSD ppc64 sources; how large are
the uma-allocated slb entries and what is stored in them?  The struct
and filename is sufficient, though I don't have convenient access to
sources until Tuesday.

V=R space is rather limited (well, depending on a lot of factors; for
AIX on Power5 and later the hypervisor only gave us 128M, though for
ppc64 on a Mac G4 I assume all of memory can be mapped V=R if desired)
so it was best to find a non V=R solution if possible.  Turning on
translation in the flih after some setup and recursion stopping is one
of the easier ways, and also has the advantage of not needing to
either have separate code or macro access to data structures used in
both V and R modes.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: UMA allocations from a specific physical range

2010-09-05 Thread mdf
On Mon, Sep 6, 2010 at 4:28 AM, Nathan Whitehorn nwhiteh...@freebsd.org wrote:
 On 09/05/10 22:51, m...@freebsd.org wrote:
 On Mon, Sep 6, 2010 at 1:38 AM, Nathan Whitehorn nwhiteh...@freebsd.org 
 wrote:

 PowerPC hypervisors typically provided a restricted range on memory when
 the MMU is disabled, as it is when initially handling exceptions. In
 order to restore virtual memory, the powerpc64 code needs to read a data
 structure called the SLB cache, which is currently allocated out of a
 UMA zone, and must be mapped into wired memory, ideally 1:1
 physical-virtual address. Since this must be accessible in real mode,
 it must have a physical address in a certain range. I am trying to
 figure out the best way to do this.

 My first run at this code uses a custom UMA allocator that calls
 vm_phys_alloc_contig() to get a memory page. The trouble I have run into
 is that I cannot figure out a way to free the page. Marking the zone
 NOFREE is a bad solution, vm_page_free() panics the kernel due to
 inconsistent tracking of page wiring, and vm_phys_free_pages() causes
 panics in vm_page_alloc() later on (page is not free). What is the
 correct way to deallocate these pages? Or is there a different approach
 I should adopt?

 I assume this is for the SLB flih?

 What AIX did was to have a 1-1 simple esid to vsid translation for
 kernel addresses, reserve the first 16 SLB entries for various uses,
 including one for the current process's process private segment, and
 if the slb miss was on a process address we'd turn on translation and
 look up the answer, the tables holding the answer being in the process
 private segment effective address space so we wouldn't take another
 slb miss.  This required one level deep recursion in the slb slih, in
 case there was a miss on kernel data with xlate on in the SLB slih.

 Yes, that's correct. FreeBSD has the same 1-to-1 translation for the
 kernel, but the entire address space is switched out for user processes
 (no part of the kernel is mapped into user processes), so the code to
 load the user SLB entries has to be able to execute with the MMU off,
 lest it disappear underneath itself.

Okay.  For AIX the kernel text/data in esid 0 was always in slb entry
0 (so it wasn't affected by slbia) and also was mapped into the
process address space.  So we had to be careful with KsKp bits to
prevent access to anything the user couldn't see.  The code for memcpy
and friends was at fixed addresses in the kernel segment so the
compiler knew to jump there, and there was also a user-readable
_system_configuration struct.

Even with no address sharing, the SLB flih could load entries for the
kernel and turn on translation, but it would be trickier.

 For historical reasons due to the per-process segment table for
 POWER3, we also had a one-page hashed lookup table per process that we
 stored the real address of in the process private segment, so the
 assembly code in the flih looked here before turning on MSR_DR IIRC.
 I was trying to find ways to kill this code when I left IBM, since
 we'd ended support for POWER3 a few years earlier.

 I haven't had the time to look at FreeBSD ppc64 sources; how large are
 the uma-allocated slb entries and what is stored in them?  The struct
 and filename is sufficient, though I don't have convenient access to
 sources until Tuesday.

 The entries are each 1 KB, and there is one for each pmap. Each consists
 of 64 16-byte SLBE/SLBV pairs. These buffers are just a carbon copy of
 what should be in the SLB after a context switch to that map.

But if this is for the flih, the esid that was faulted on won't be in
that struct, right?  Aren't you trying to look up in some table to
load an slb entry?

 V=R space is rather limited (well, depending on a lot of factors; for
 AIX on Power5 and later the hypervisor only gave us 128M, though for
 ppc64 on a Mac G4 I assume all of memory can be mapped V=R if desired)
 so it was best to find a non V=R solution if possible.  Turning on
 translation in the flih after some setup and recursion stopping is one
 of the easier ways, and also has the advantage of not needing to
 either have separate code or macro access to data structures used in
 both V and R modes.

 On the PS3 (the target in this case), the hypervisor also limits us to
 128 MB. The one and only kernel data structure that needs to be used in
 this mode is this SLB cache object, so I was hoping for a simple
 solution to just put them all in the real-mode accessible region.

Well, I assume if you're willing to use 4k then it should't be hard to
allocate a whole page in the V=R region.  Perhaps other useful data
for the process could be added to this page?

Admittedly, this is a bit of a digression.  The internals of UMA
always leave me confused, so I try to avoid thinking about it. :-)

IIRC the memory from vm_phys_alloc_contig() can be released like any
other page; the interface should just be fetching a specific page.
How far off is the page wire 

Re: sched_pin() versus PCPU_GET

2010-08-08 Thread mdf
On Sun, Aug 8, 2010 at 2:43 PM, Attilio Rao atti...@freebsd.org wrote:
 2010/8/4  m...@freebsd.org:
 On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
 On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
 On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
  We've seen a few instances at work where witness_warn() in ast()
  indicates the sched lock is still held, but the place it claims it was
  held by is in fact sometimes not possible to keep the lock, like:
 
      thread_lock(td);
      td-td_flags = ~TDF_SELECT;
      thread_unlock(td);
 
  What I was wondering is, even though the assembly I see in objdump -S
  for witness_warn has the increment of td_pinned before the PCPU_GET:
 
  802db210:   65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
  802db217:   00 00
  802db219:   ff 83 04 01 00 00       incl   0x104(%rbx)
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db21f:   65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
  802db226:   00 00
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db228:   48 85 c0                test   %rax,%rax
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db22b:   48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
  802db232:   48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db239:   0f 84 ff 00 00 00       je     802db33e
  witness_warn+0x30e
  802db23f:   44 8b 60 50             mov    0x50(%rax),%r12d
 
  is it possible for the hardware to do any re-ordering here?
 
  The reason I'm suspicious is not just that the code doesn't have a
  lock leak at the indicated point, but in one instance I can see in the
  dump that the lock_list local from witness_warn is from the pcpu
  structure for CPU 0 (and I was warned about sched lock 0), but the
  thread id in panic_cpu is 2.  So clearly the thread was being migrated
  right around panic time.
 
  This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
  of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
 
  So... do we need some kind of barrier in the code for sched_pin() for
  it to really do what it claims?  Could the hardware have re-ordered
  the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
  increment?

 Hmmm, I think it might be able to because they refer to different 
 locations.

 Note this rule in section 8.2.2 of Volume 3A:

   • Reads may be reordered with older writes to different locations but not
     with older writes to the same location.

 It is certainly true that sparc64 could reorder with RMO.  I believe ia64
 could reorder as well.  Since sched_pin/unpin are frequently used to 
 provide
 this sort of synchronization, we could use memory barriers in pin/unpin
 like so:

 sched_pin()
 {
       td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
 }

 sched_unpin()
 {
       atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
 }

 We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but 
 they
 are slightly more heavyweight, though it would be more clear what is 
 happening
 I think.

 However, to actually get a race you'd have to have an interrupt fire and
 migrate you so that the speculative read was from the other CPU.  However, I
 don't think the speculative read would be preserved in that case.  The CPU
 has to return to a specific PC when it returns from the interrupt and it has
 no way of storing the state for what speculative reordering it might be
 doing, so presumably it is thrown away?  I suppose it is possible that it
 actually retires both instructions (but reordered) and then returns to the 
 PC
 value after the read of listlocks after the interrupt.  However, in that 
 case
 the scheduler would not migrate as it would see td_pinned != 0.  To get the
 race you have to have the interrupt take effect prior to modifying 
 td_pinned,
 so I think the processor would have to discard the reordered read of
 listlocks so it could safely resume execution at the 'incl' instruction.

 The other nit there on x86 at least is that the incl instruction is doing
 both a read and a write and another rule in the section 8.2.2 is this:

  • Reads are not reordered with other reads.

 That would seem to prevent the read of listlocks from passing the read of
 td_pinned in the incl instruction on x86.

 I wonder how that's interpreted in the microcode, though?  I.e. if the
 incr instruction decodes to load, add, 

Re: sched_pin() versus PCPU_GET

2010-08-05 Thread mdf
On Wed, Aug 4, 2010 at 11:55 AM, John Baldwin j...@freebsd.org wrote:
 On Wednesday, August 04, 2010 12:20:31 pm m...@freebsd.org wrote:
 On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote:
  On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote:
  On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
   On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
   On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
We've seen a few instances at work where witness_warn() in ast()
indicates the sched lock is still held, but the place it claims it 
was
held by is in fact sometimes not possible to keep the lock, like:
   
    thread_lock(td);
    td-td_flags = ~TDF_SELECT;
    thread_unlock(td);
   
What I was wondering is, even though the assembly I see in objdump -S
for witness_warn has the increment of td_pinned before the PCPU_GET:
   
802db210:   65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
802db217:   00 00
802db219:   ff 83 04 01 00 00       incl   0x104(%rbx)
     * Pin the thread in order to avoid problems with thread 
migration.
     * Once that all verifies are passed about spinlocks ownership,
     * the thread is in a safe path and it can be unpinned.
     */
    sched_pin();
    lock_list = PCPU_GET(spinlocks);
802db21f:   65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
802db226:   00 00
    if (lock_list != NULL  lock_list-ll_count != 0) {
802db228:   48 85 c0                test   %rax,%rax
     * Pin the thread in order to avoid problems with thread 
migration.
     * Once that all verifies are passed about spinlocks ownership,
     * the thread is in a safe path and it can be unpinned.
     */
    sched_pin();
    lock_list = PCPU_GET(spinlocks);
802db22b:   48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
802db232:   48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
    if (lock_list != NULL  lock_list-ll_count != 0) {
802db239:   0f 84 ff 00 00 00       je     802db33e
witness_warn+0x30e
802db23f:   44 8b 60 50             mov    0x50(%rax),%r12d
   
is it possible for the hardware to do any re-ordering here?
   
The reason I'm suspicious is not just that the code doesn't have a
lock leak at the indicated point, but in one instance I can see in 
the
dump that the lock_list local from witness_warn is from the pcpu
structure for CPU 0 (and I was warned about sched lock 0), but the
thread id in panic_cpu is 2.  So clearly the thread was being 
migrated
right around panic time.
   
This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
of hardware; it's a 4-way Intel chip from about 3 or 4 years ago 
IIRC.
   
So... do we need some kind of barrier in the code for sched_pin() for
it to really do what it claims?  Could the hardware have re-ordered
the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
increment?
  
   Hmmm, I think it might be able to because they refer to different 
   locations.
  
   Note this rule in section 8.2.2 of Volume 3A:
  
     • Reads may be reordered with older writes to different locations 
   but not
       with older writes to the same location.
  
   It is certainly true that sparc64 could reorder with RMO.  I believe 
   ia64
   could reorder as well.  Since sched_pin/unpin are frequently used to 
   provide
   this sort of synchronization, we could use memory barriers in pin/unpin
   like so:
  
   sched_pin()
   {
         td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
   }
  
   sched_unpin()
   {
         atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
   }
  
   We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), 
   but they
   are slightly more heavyweight, though it would be more clear what is 
   happening
   I think.
  
   However, to actually get a race you'd have to have an interrupt fire and
   migrate you so that the speculative read was from the other CPU.  
   However, I
   don't think the speculative read would be preserved in that case.  The 
   CPU
   has to return to a specific PC when it returns from the interrupt and 
   it has
   no way of storing the state for what speculative reordering it might be
   doing, so presumably it is thrown away?  I suppose it is possible that 
   it
   actually retires both instructions (but reordered) and then returns to 
   the PC
   value after the read of listlocks after the interrupt.  However, in 
   that case
   the scheduler would not migrate as it would see td_pinned != 0.  To get 
   the
   race you have to have the interrupt take effect prior to modifying 
   td_pinned,
   so I think the processor would have to discard the reordered read of
   listlocks so it could safely resume execution at the 'incl' 

Re: sched_pin() versus PCPU_GET

2010-08-05 Thread mdf
On Wed, Aug 4, 2010 at 9:20 AM,  m...@freebsd.org wrote:
 On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote:
 Actually, I would beg to differ in that case.  If PCPU_GET(spinlocks)
 returns non-NULL, then it means that you hold a spin lock,

 ll_count is 0 for the correct pc_spinlocks and non-zero for the
 wrong one, though.  So I think it can be non-NULL but the current
 thread/CPU doesn't hold a spinlock.

 I don't believe we have any code in the NMI handler.  I'm on vacation
 today so I'll check tomorrow.

I checked and ipi_nmi_handler() doesn't appear to have any local
changes.  I assume that's where I should look?

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sched_pin() versus PCPU_GET

2010-08-05 Thread mdf
 (gdb) p panic_cpu
 $9 = 2
 (gdb) p dumptid
 $12 = 100751
 (gdb) p cpuhead.slh_first-pc_allcpu.sle_next-pc_curthread-td_tid
 $14 = 100751

 (gdb) p *cpuhead.slh_first-pc_allcpu.sle_next
 $6 = {
   pc_curthread = 0xff00716d6960,
   pc_cpuid = 2,
   pc_spinlocks = 0x80803198,

 (gdb) p lock_list
 $2 = (struct lock_list_entry *) 0x80803fb0

 (gdb) p *cpuhead.slh_first-pc_allcpu.sle_next-pc_allcpu.sle_next-
pc_allcpu.sle_next
 $8 = {
   pc_curthread = 0xff0005479960,
   pc_cpuid = 0,
   pc_spinlocks = 0x80803fb0,

 I.e. we're dumping on CPU 2, but the lock_list pointer that was saved
 in the dump matches that of CPU 0.

 Can you print out the tid's for the two curthreads?  It's not impossible that
 the thread migrated after calling panic.  In fact we force threads to CPU 0
 during shutdown.

dumptid matches the pc_curthread for CPU 2 and is printed above.

The lock_list local variable matches the PCPU for CPU 0, which has tid:

(gdb) p 
cpuhead.slh_first-pc_allcpu.sle_next-pc_allcpu.sle_next-pc_allcpu.sle_next-pc_curthread-td_tid
$2 = 15

(gdb) p 
cpuhead.slh_first-pc_allcpu.sle_next-pc_allcpu.sle_next-pc_allcpu.sle_next-pc_curthread-td_proc-p_comm
$3 = idle: cpu0\000\000\000\000\000\000\000\000\000

Note that lock_list-ll_count is now 0, but of course wasn't when we
panic'd.  Also, the panic message showed exclusive spin mutex sched
lock 0 (sched lock) r = 0 (0x806cf640) locked @
/build/mnt/src/sys/kern/sys_generic.c:826; i.e. the lock was for CPU
0 as well.  If we truly were returning to user space with that lock
held it would still be held and we'd still be on CPU 0.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sched_pin() versus PCPU_GET

2010-08-04 Thread mdf
On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin j...@freebsd.org wrote:
 On Tuesday, August 03, 2010 9:46:16 pm m...@freebsd.org wrote:
 On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
  On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
  On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
   We've seen a few instances at work where witness_warn() in ast()
   indicates the sched lock is still held, but the place it claims it was
   held by is in fact sometimes not possible to keep the lock, like:
  
       thread_lock(td);
       td-td_flags = ~TDF_SELECT;
       thread_unlock(td);
  
   What I was wondering is, even though the assembly I see in objdump -S
   for witness_warn has the increment of td_pinned before the PCPU_GET:
  
   802db210:   65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
   802db217:   00 00
   802db219:   ff 83 04 01 00 00       incl   0x104(%rbx)
        * Pin the thread in order to avoid problems with thread migration.
        * Once that all verifies are passed about spinlocks ownership,
        * the thread is in a safe path and it can be unpinned.
        */
       sched_pin();
       lock_list = PCPU_GET(spinlocks);
   802db21f:   65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
   802db226:   00 00
       if (lock_list != NULL  lock_list-ll_count != 0) {
   802db228:   48 85 c0                test   %rax,%rax
        * Pin the thread in order to avoid problems with thread migration.
        * Once that all verifies are passed about spinlocks ownership,
        * the thread is in a safe path and it can be unpinned.
        */
       sched_pin();
       lock_list = PCPU_GET(spinlocks);
   802db22b:   48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
   802db232:   48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
       if (lock_list != NULL  lock_list-ll_count != 0) {
   802db239:   0f 84 ff 00 00 00       je     802db33e
   witness_warn+0x30e
   802db23f:   44 8b 60 50             mov    0x50(%rax),%r12d
  
   is it possible for the hardware to do any re-ordering here?
  
   The reason I'm suspicious is not just that the code doesn't have a
   lock leak at the indicated point, but in one instance I can see in the
   dump that the lock_list local from witness_warn is from the pcpu
   structure for CPU 0 (and I was warned about sched lock 0), but the
   thread id in panic_cpu is 2.  So clearly the thread was being migrated
   right around panic time.
  
   This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
   of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
  
   So... do we need some kind of barrier in the code for sched_pin() for
   it to really do what it claims?  Could the hardware have re-ordered
   the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
   increment?
 
  Hmmm, I think it might be able to because they refer to different 
  locations.
 
  Note this rule in section 8.2.2 of Volume 3A:
 
    • Reads may be reordered with older writes to different locations but 
  not
      with older writes to the same location.
 
  It is certainly true that sparc64 could reorder with RMO.  I believe ia64
  could reorder as well.  Since sched_pin/unpin are frequently used to 
  provide
  this sort of synchronization, we could use memory barriers in pin/unpin
  like so:
 
  sched_pin()
  {
        td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
  }
 
  sched_unpin()
  {
        atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
  }
 
  We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but 
  they
  are slightly more heavyweight, though it would be more clear what is 
  happening
  I think.
 
  However, to actually get a race you'd have to have an interrupt fire and
  migrate you so that the speculative read was from the other CPU.  However, 
  I
  don't think the speculative read would be preserved in that case.  The CPU
  has to return to a specific PC when it returns from the interrupt and it 
  has
  no way of storing the state for what speculative reordering it might be
  doing, so presumably it is thrown away?  I suppose it is possible that it
  actually retires both instructions (but reordered) and then returns to the 
  PC
  value after the read of listlocks after the interrupt.  However, in that 
  case
  the scheduler would not migrate as it would see td_pinned != 0.  To get the
  race you have to have the interrupt take effect prior to modifying 
  td_pinned,
  so I think the processor would have to discard the reordered read of
  listlocks so it could safely resume execution at the 'incl' instruction.
 
  The other nit there on x86 at least is that the incl instruction is doing
  both a read and a write and another rule in the section 8.2.2 is this:
 
   • Reads are not reordered with other reads.
 
  That would seem to prevent the read of listlocks from passing the read of
  

Re: sched_pin() versus PCPU_GET

2010-08-03 Thread mdf
On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin j...@freebsd.org wrote:
 On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
 On Thursday, July 29, 2010 7:39:02 pm m...@freebsd.org wrote:
  We've seen a few instances at work where witness_warn() in ast()
  indicates the sched lock is still held, but the place it claims it was
  held by is in fact sometimes not possible to keep the lock, like:
 
      thread_lock(td);
      td-td_flags = ~TDF_SELECT;
      thread_unlock(td);
 
  What I was wondering is, even though the assembly I see in objdump -S
  for witness_warn has the increment of td_pinned before the PCPU_GET:
 
  802db210:   65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
  802db217:   00 00
  802db219:   ff 83 04 01 00 00       incl   0x104(%rbx)
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db21f:   65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
  802db226:   00 00
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db228:   48 85 c0                test   %rax,%rax
       * Pin the thread in order to avoid problems with thread migration.
       * Once that all verifies are passed about spinlocks ownership,
       * the thread is in a safe path and it can be unpinned.
       */
      sched_pin();
      lock_list = PCPU_GET(spinlocks);
  802db22b:   48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
  802db232:   48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
      if (lock_list != NULL  lock_list-ll_count != 0) {
  802db239:   0f 84 ff 00 00 00       je     802db33e
  witness_warn+0x30e
  802db23f:   44 8b 60 50             mov    0x50(%rax),%r12d
 
  is it possible for the hardware to do any re-ordering here?
 
  The reason I'm suspicious is not just that the code doesn't have a
  lock leak at the indicated point, but in one instance I can see in the
  dump that the lock_list local from witness_warn is from the pcpu
  structure for CPU 0 (and I was warned about sched lock 0), but the
  thread id in panic_cpu is 2.  So clearly the thread was being migrated
  right around panic time.
 
  This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
  of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
 
  So... do we need some kind of barrier in the code for sched_pin() for
  it to really do what it claims?  Could the hardware have re-ordered
  the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
  increment?

 Hmmm, I think it might be able to because they refer to different locations.

 Note this rule in section 8.2.2 of Volume 3A:

   • Reads may be reordered with older writes to different locations but not
     with older writes to the same location.

 It is certainly true that sparc64 could reorder with RMO.  I believe ia64
 could reorder as well.  Since sched_pin/unpin are frequently used to provide
 this sort of synchronization, we could use memory barriers in pin/unpin
 like so:

 sched_pin()
 {
       td-td_pinned = atomic_load_acq_int(td-td_pinned) + 1;
 }

 sched_unpin()
 {
       atomic_store_rel_int(td-td_pinned, td-td_pinned - 1);
 }

 We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but 
 they
 are slightly more heavyweight, though it would be more clear what is 
 happening
 I think.

 However, to actually get a race you'd have to have an interrupt fire and
 migrate you so that the speculative read was from the other CPU.  However, I
 don't think the speculative read would be preserved in that case.  The CPU
 has to return to a specific PC when it returns from the interrupt and it has
 no way of storing the state for what speculative reordering it might be
 doing, so presumably it is thrown away?  I suppose it is possible that it
 actually retires both instructions (but reordered) and then returns to the PC
 value after the read of listlocks after the interrupt.  However, in that case
 the scheduler would not migrate as it would see td_pinned != 0.  To get the
 race you have to have the interrupt take effect prior to modifying td_pinned,
 so I think the processor would have to discard the reordered read of
 listlocks so it could safely resume execution at the 'incl' instruction.

 The other nit there on x86 at least is that the incl instruction is doing
 both a read and a write and another rule in the section 8.2.2 is this:

  • Reads are not reordered with other reads.

 That would seem to prevent the read of listlocks from passing the read of
 td_pinned in the incl instruction on x86.

I wonder how that's interpreted in the microcode, though?  I.e. if the
incr instruction decodes to load, add, store, does the h/w allow the
later reads to pass the final store?

I added the following:

sched_pin();
   

Re: sched_pin() versus PCPU_GET

2010-07-30 Thread mdf
2010/7/30 Kostik Belousov kostik...@gmail.com:
 On Thu, Jul 29, 2010 at 04:57:25PM -0700, m...@freebsd.org wrote:
 On Thu, Jul 29, 2010 at 4:39 PM,  m...@freebsd.org wrote:
  We've seen a few instances at work where witness_warn() in ast()
  indicates the sched lock is still held, but the place it claims it was
  held by is in fact sometimes not possible to keep the lock, like:
 
         thread_lock(td);
         td-td_flags = ~TDF_SELECT;
         thread_unlock(td);
 
  What I was wondering is, even though the assembly I see in objdump -S
  for witness_warn has the increment of td_pinned before the PCPU_GET:
 
  802db210:       65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
  802db217:       00 00
  802db219:       ff 83 04 01 00 00       incl   0x104(%rbx)
          * Pin the thread in order to avoid problems with thread migration.
          * Once that all verifies are passed about spinlocks ownership,
          * the thread is in a safe path and it can be unpinned.
          */
         sched_pin();
         lock_list = PCPU_GET(spinlocks);
  802db21f:       65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
  802db226:       00 00
         if (lock_list != NULL  lock_list-ll_count != 0) {
  802db228:       48 85 c0                test   %rax,%rax
          * Pin the thread in order to avoid problems with thread migration.
          * Once that all verifies are passed about spinlocks ownership,
          * the thread is in a safe path and it can be unpinned.
          */
         sched_pin();
         lock_list = PCPU_GET(spinlocks);
  802db22b:       48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
  802db232:       48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
         if (lock_list != NULL  lock_list-ll_count != 0) {
  802db239:       0f 84 ff 00 00 00       je     802db33e
  witness_warn+0x30e
  802db23f:       44 8b 60 50             mov    0x50(%rax),%r12d
 
  is it possible for the hardware to do any re-ordering here?
 
  The reason I'm suspicious is not just that the code doesn't have a
  lock leak at the indicated point, but in one instance I can see in the
  dump that the lock_list local from witness_warn is from the pcpu
  structure for CPU 0 (and I was warned about sched lock 0), but the
  thread id in panic_cpu is 2.  So clearly the thread was being migrated
  right around panic time.
 
  This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
  of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
 
  So... do we need some kind of barrier in the code for sched_pin() for
  it to really do what it claims?  Could the hardware have re-ordered
  the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
  increment?

 So after some research, the answer I'm getting is maybe.  What I'm
 concerned about is whether the h/w reordered the read of PCPU_GET in
 front of the previous store to increment td_pinned.  While not an
 ultimate authority,
 http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems
 implies that stores can be reordered after loads for both Intel and
 amd64 chips, which would I believe account for the behavior seen here.

 Am I right that you suggest that in the sequence
        mov     %gs:0x0,%rbx      [1]
        incl    0x104(%rbx)       [2]
        mov     %gs:0x48,%rax     [3]
 interrupt and preemption happen between points [2] and [3] ?
 And the %rax value after the thread was put back onto the (different) new
 CPU and executed [3] was still from the old cpu' pcpu area ?

Right, but I'm also asking if it's possible the hardware executed the
instructions as:

        mov     %gs:0x0,%rbx      [1]
        mov     %gs:0x48,%rax     [3]
        incl    0x104(%rbx)       [2]

On PowerPC this is definitely possible and I'd use an isync to prevent
the re-ordering.  I haven't been able to confirm that Intel/AMD
present such a strict ordering that no barrier is needed.

It's admittedly a very tight window, and we've only seen it twice, but
I have no other way to explain the symptom.  Unfortunately in the dump
gdb shows both %rax and %gs as 0, so I can't confirm that they had a
value I'd expect from another CPU.  The only thing I do have is
panic_cpu being different than the CPU at the time of
PCPU_GET(spinlock), but of course there's definitely a window there.

 I do not believe this is possible. CPU is always self-consistent. Context
 switch from the thread can only occur on the return from interrupt
 handler, in critical_exit() or such. This code is executing on the
 same processor, and thus should already see the effect of [2], that
 would prevent context switch.

Right, but if the hardware allowed reads to pass writes, then %rax
would have an incorrect value which would be saved at interrupt time,
and restored on another processor.

I can add a few sanity asserts to try to prove this one way or another
and hope they don't mess with the timing; 

sched_pin() versus PCPU_GET

2010-07-29 Thread mdf
We've seen a few instances at work where witness_warn() in ast()
indicates the sched lock is still held, but the place it claims it was
held by is in fact sometimes not possible to keep the lock, like:

thread_lock(td);
td-td_flags = ~TDF_SELECT;
thread_unlock(td);

What I was wondering is, even though the assembly I see in objdump -S
for witness_warn has the increment of td_pinned before the PCPU_GET:

802db210:   65 48 8b 1c 25 00 00mov%gs:0x0,%rbx
802db217:   00 00
802db219:   ff 83 04 01 00 00   incl   0x104(%rbx)
 * Pin the thread in order to avoid problems with thread migration.
 * Once that all verifies are passed about spinlocks ownership,
 * the thread is in a safe path and it can be unpinned.
 */
sched_pin();
lock_list = PCPU_GET(spinlocks);
802db21f:   65 48 8b 04 25 48 00mov%gs:0x48,%rax
802db226:   00 00
if (lock_list != NULL  lock_list-ll_count != 0) {
802db228:   48 85 c0test   %rax,%rax
 * Pin the thread in order to avoid problems with thread migration.
 * Once that all verifies are passed about spinlocks ownership,
 * the thread is in a safe path and it can be unpinned.
 */
sched_pin();
lock_list = PCPU_GET(spinlocks);
802db22b:   48 89 85 f0 fe ff ffmov%rax,-0x110(%rbp)
802db232:   48 89 85 f8 fe ff ffmov%rax,-0x108(%rbp)
if (lock_list != NULL  lock_list-ll_count != 0) {
802db239:   0f 84 ff 00 00 00   je 802db33e
witness_warn+0x30e
802db23f:   44 8b 60 50 mov0x50(%rax),%r12d

is it possible for the hardware to do any re-ordering here?

The reason I'm suspicious is not just that the code doesn't have a
lock leak at the indicated point, but in one instance I can see in the
dump that the lock_list local from witness_warn is from the pcpu
structure for CPU 0 (and I was warned about sched lock 0), but the
thread id in panic_cpu is 2.  So clearly the thread was being migrated
right around panic time.

This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.

So... do we need some kind of barrier in the code for sched_pin() for
it to really do what it claims?  Could the hardware have re-ordered
the mov%gs:0x48,%rax PCPU_GET to before the sched_pin()
increment?

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sched_pin() versus PCPU_GET

2010-07-29 Thread mdf
On Thu, Jul 29, 2010 at 4:39 PM,  m...@freebsd.org wrote:
 We've seen a few instances at work where witness_warn() in ast()
 indicates the sched lock is still held, but the place it claims it was
 held by is in fact sometimes not possible to keep the lock, like:

        thread_lock(td);
        td-td_flags = ~TDF_SELECT;
        thread_unlock(td);

 What I was wondering is, even though the assembly I see in objdump -S
 for witness_warn has the increment of td_pinned before the PCPU_GET:

 802db210:       65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
 802db217:       00 00
 802db219:       ff 83 04 01 00 00       incl   0x104(%rbx)
         * Pin the thread in order to avoid problems with thread migration.
         * Once that all verifies are passed about spinlocks ownership,
         * the thread is in a safe path and it can be unpinned.
         */
        sched_pin();
        lock_list = PCPU_GET(spinlocks);
 802db21f:       65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
 802db226:       00 00
        if (lock_list != NULL  lock_list-ll_count != 0) {
 802db228:       48 85 c0                test   %rax,%rax
         * Pin the thread in order to avoid problems with thread migration.
         * Once that all verifies are passed about spinlocks ownership,
         * the thread is in a safe path and it can be unpinned.
         */
        sched_pin();
        lock_list = PCPU_GET(spinlocks);
 802db22b:       48 89 85 f0 fe ff ff    mov    %rax,-0x110(%rbp)
 802db232:       48 89 85 f8 fe ff ff    mov    %rax,-0x108(%rbp)
        if (lock_list != NULL  lock_list-ll_count != 0) {
 802db239:       0f 84 ff 00 00 00       je     802db33e
 witness_warn+0x30e
 802db23f:       44 8b 60 50             mov    0x50(%rax),%r12d

 is it possible for the hardware to do any re-ordering here?

 The reason I'm suspicious is not just that the code doesn't have a
 lock leak at the indicated point, but in one instance I can see in the
 dump that the lock_list local from witness_warn is from the pcpu
 structure for CPU 0 (and I was warned about sched lock 0), but the
 thread id in panic_cpu is 2.  So clearly the thread was being migrated
 right around panic time.

 This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
 of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.

 So... do we need some kind of barrier in the code for sched_pin() for
 it to really do what it claims?  Could the hardware have re-ordered
 the mov    %gs:0x48,%rax PCPU_GET to before the sched_pin()
 increment?

So after some research, the answer I'm getting is maybe.  What I'm
concerned about is whether the h/w reordered the read of PCPU_GET in
front of the previous store to increment td_pinned.  While not an
ultimate authority,
http://en.wikipedia.org/wiki/Memory_ordering#In_SMP_microprocessor_systems
implies that stores can be reordered after loads for both Intel and
amd64 chips, which would I believe account for the behavior seen here.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: strange problem with int64_t variables

2010-07-11 Thread mdf
On Sun, Jul 11, 2010 at 7:58 AM, Gabor Kovesdan ga...@freebsd.org wrote:
 Em 2010.07.11. 16:54, Dimitry Andric escreveu:

 On 2010-07-11 16:46, Gabor Kovesdan wrote:


 I have two int64_t variables in kernel code, first is stored internally
 and the second one is passed from a syscall argument. When I print them
 with printf %lld modifier, the internal one behaves correctly but the
 other one I pass from a syscall has a corrupted value. If I pass 1, it
 prints out 3735348794091372545. I'm not doing anything special with it
 just reading it out from the struct that was generated with make sysent.


 Since 3735348794091372545 is 0x33d69ff1, it looks like the upper
 word got corrupted somehow.  Maybe some part of it got non-atomically
 assigned?  Maybe the wrong word was read?  It is hard to tell without
 code...  :)


 Userland syscall calling:

 killjob(getjid(), SIGINT);  //getjid() returns 1 this case, whose type is
 jid_t

 Kernel code:

 int
 killjob(struct thread *td, struct killjob_args *uap)
 {
        struct jobentry *jp, *jtmp;
        struct procentry *pp, *ptmp;

        JOBLIST_bWLOCK;
        LIST_FOREACH_SAFE(jp,irix_joblist, entries, jtmp) {
                if (jp-jid == uap-jid) {
                        /* never reached code, comparison always fail because
 of corrupted value */
                }
        }
        JOBLIST_WUNLOCK;

        /* not such job */
        td-td_retval[0] = -1;
        return (ENOJOB);
 }

What does struct killjob_args look like?

Is this syscall defined in a module, or an addition to the kernel's
syscalls.master?

Is the user-space 32-bit or 64-bit?  What about the kernel?

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: loader prompt: list / on other device

2010-06-23 Thread mdf
On Wed, Jun 23, 2010 at 9:03 AM,  rank1see...@gmail.com wrote:
 I've escaped to loader prompt:
 Current device is disk0s3a, from which this loader is running.

 My USB stick is device1 and device1s2a is UFS /, on which I would like to
 reach some file or simply list directory.

IIRC, there is no way to do this (but my memory is based on stable/7).
 I extended some Isilon stuff to add a 'bootdev' command to the loader
that would allow for selecting alternate boot disks.  I can see about
dusting that off and making a patch against CURRENT, if there is no
current way to do what you ask.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org