from:"John"

Re: Call fo comments - raising vfs.ufs.dirhash_reclaimage?

2013-10-10 Thread John Baldwin

On Monday, October 07, 2013 1:34:24 pm Davide Italiano wrote:
> > What would perhaps be better than a hardcoded reclaim age would be to use
> > an LRU-type approach and perhaps set a target percent to reclaim.  That 
is,
> > suppose you were to reclaim the oldest 10% of hashes on each lowmem call
> > (and make the '10%' the tunable value).  Then you will always make some 
amount
> > of progress in a low memory situation (and if the situation remains dire 
you
> > will eventually empty the entire cache), but the effective maximum age 
will
> > be more dynamic.  Right now if you haven't touched UFS in 5 seconds it
> > throws the entire thing out on the first lowmem event.  The LRU-approach 
would
> > only throw the oldest 10% out on the first call, but eventually throw it 
all out
> > if the situation remains dire.
> >
> > --
> > John Baldwin
> > ___
> > freebsd...@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to "freebsd-fs-unsubscr...@freebsd.org"
> 
> I liked your idea more than what's available in HEAD right now and I
> implemented it.
> http://people.freebsd.org/~davide/review/ufs_direclaimage.diff
> I was unsure what kind of heuristic I should choose to select which
> (10% of) entries should be evicted so I just removed the first 10%
> ones from the head of the ufs_dirhash list (which should be the
> oldest).
> The code keeps rescanning the cache until 10% (or, the percentage set
> via SYSCTL) of the entry are freed, but probably we can discuss if
> this limit could be relaxed and just do a single scan over the list.
> Unfortunately I haven't a testcase to prove the effectiveness (or
> non-effectiveness) of the approach but I think either Ivan or Peter
> could be able to give it a spin, maybe.

I think this looks good.  One cosmetic nit is that I think this:

if (!try_lock())
continue;
else
memfreed += ufsdirhash_destroy();

Looks a bit odd.  I would either drop the else (which the old code did in its 
failsafe case) or just do this:

if (try_lock())
memfreed += ufsdirhash_destroy();

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: What's the state of AF-4Kn support?

2013-10-10 Thread John Baldwin

On Monday, September 23, 2013 10:58:19 am Ravi Pokala wrote:
> -Original Message-
> From: Jia-Shiun Li 
> Date: Sunday, September 22, 2013 11:22 PM
> To: Ravi Pokala 
> Cc: "freebsd-hardw...@freebsd.org" ,
> 
> Subject: Re: What's the state of AF-4Kn support?
> 
> >On Wed, Sep 18, 2013 at 10:49 PM, Ravi Pokala  wrote:
> >>
> >>...
> >
> >CC -hackers.
> >
> >Thanks for the clarification. Is there any 4Kn HDDs shopping now? I am
> >not aware of any.
> 
> Good question. I had the impression that some currently shipping drives
> were AF-4Kn, but spot-checking some of the drives listed in
> 
> src/cam/ata/ata_da.c::ada_quirk_table[]
> 
> against their datasheets, suggests that they're AF-512e. So, their being
> flagged w/ ADA_Q_4K is "just" a performance optimization.
> 
> >BTW I believe UFS and ZFS have proper design for 4K-sectors, but FreeBSD
> >needs some ecosystem connections to get samples early to test,
> >incorporate supports and validate for it. Or we will need to wait until
> >it appears on market and someone got caught into some kind of bugs.
> 
> Yeah, based on my reading of the code, it looks like the ATACAM layer and
> higher (GEOM, filesystems) take the physical block size into account. That
> just leaves the bootstrap code. Now that I've taken a second look, it
> seems as though at least 'pmbr' only works in terms of 512 bytes. :-(

Yes, the BIOS calls have always only used 512 byte sectors.  There would
have to be an updated spec for those, and it would be a bit of a PITA to
use.  I suspect the "right" answer for this on x86 is UEFI.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: [RFC][CFT] GEOM direct dispatch and fine-grained CAM locking

2013-10-07 Thread John Baldwin

On Sunday, October 06, 2013 3:30:42 am Alexander Motin wrote:
> On 02.10.2013 20:30, John Baldwin wrote:
> > On Saturday, September 07, 2013 2:32:45 am Alexander Motin wrote:
> >> On 07.09.2013 02:02, Jeremie Le Hen wrote:
> >>> On Fri, Sep 06, 2013 at 11:29:11AM +0300, Alexander Motin wrote:
> >>>> On 06.09.2013 11:06, Jeremie Le Hen wrote:
> >>>>> On Fri, Sep 06, 2013 at 12:46:27AM +0200, Olivier Cochard-Labbé wrote:
> >>>>>> On Thu, Sep 5, 2013 at 11:38 PM, Alexander Motin 
> > wrote:
> >>>>>>> I've found and fixed possible double request completion, that could
> > cause
> >>>>>>> such symptoms if happened. Updated patch located as usual:
> >>>>>>> 
http://people.freebsd.org/~mav/camlock_patches/camlock_20130905.patch
> >>>>>>>
> >>>>> With this new one I cannot boot any more (I also updated the source
> >>>>> tree).  This is a hand transcripted version:
> >>>>>
> >>>>> Trying to mount root from zfs:zroot/root []...
> >>>>> panic: Batch flag already set
> >>>>> cpuid = 1
> >>>>> KDB: stack backtrace:
> >>>>> db_trace_self_wrapper()
> >>>>> kdb_backtrace()
> >>>>> vpanic()
> >>>>> kassert_panic()
> >>>>> xpt_batch_start()
> >>>>> ata_interrupt()
> >>>>> softclock_call_cc()
> >>>>> softclock()
> >>>>> ithread_loop()
> >>>>> fork_exit()
> >>>>> fork_trampoline()
> >>>>
> >>>> Thank you for the report. I see my fault. It is probably specific to
> >>>> ata(4) driver only. I've workarounded that in new patch version, but
> >>>> probably that area needs some rethinking.
> >>>>
> >>>> http://people.freebsd.org/~mav/camlock_patches/camlock_20130906.patch
> >>>
> >>> I'm not sure you needed a confirmation, but it boots.  Thanks :).
> >>>
> >>> I didn't quite understand the thread; is direct dispatch enabled for
> >>> amd64?  ISTR you said only i386 but someone else posted the macro for
> >>> amd64.
> >>
> >> Yes, it is enabled for amd64. I've said x86, meaning both i386 and amd64.
> >
> > FYI, I tested mfi with this patch set and mfid worked fine for handling 
g_up
> > directly:
> >
> > Index: dev/mfi/mfi_disk.c
> > ===
> > --- dev/mfi/mfi_disk.c  (revision 257407)
> > +++ dev/mfi/mfi_disk.c  (working copy)
> > @@ -162,6 +162,7 @@
> >  sc->ld_disk->d_unit = sc->ld_unit;
> >  sc->ld_disk->d_sectorsize = secsize;
> >  sc->ld_disk->d_mediasize = sectors * secsize;
> > +   sc->ld_disk->d_flags = DISKFLAG_DIRECT_COMPLETION;
> >  if (sc->ld_disk->d_mediasize >= (1 * 1024 * 1024)) {
> >  sc->ld_disk->d_fwheads = 255;
> >  sc->ld_disk->d_fwsectors = 63;
> >
> 
> Thank you for the feedback. But looking on mfi driver sources I would 
> say that it calls biodone() from mfi_disk_complete() from cm_complete() 
> method, which is called while holding mfi_io_lock mutex. I guess that if 
> on top of mfi device would be some GEOM class, supporting direct 
> dispatch and sending new requests down on previous request completion 
> (or retrying requests), that could cause recursive mfi_io_lock 
> acquisition. That is exactly the cause why I've added this flag. May be 
> it is a bit paranoid, but it is better to be safe then sorry.
> 
> Another good reason to drop the lock before calling biodone() would be 
> reducing the lock hold time. Otherwise it may just increase lock 
> congestion there and destroy all benefits of the direct dispatch.

Ah, interesting.  What is your policy for such drivers?  Should they be
left using g_up, should they drop the lock around biodone when completeing
multiple requests in an interrupt?  Should they try to batch them by
waiting and doing biodone at the end after dropping the lock?

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: UFS related panic (daily <-> find)

2013-10-07 Thread John Baldwin

On Wednesday, October 02, 2013 5:40:02 pm rank1see...@gmail.com wrote:
> > > > > Ok, here is another one, same case, just this time under 
> 9.1-RELEASE-p7
> > > > > 
> > > > > ==
> > > > > Fatal trap 12: page fault while in kernel mode
> > > > > fault virtual address   = 0x25
> > > > > fault code  = supervisor read, page not present
> > > > > instruction pointer = 0x20:0xc082c552
> > > > > stack pointer   = 0x28:0xe7eed7a8
> > > > > frame pointer   = 0x28:0xe7eed7ac
> > > > > code segment= base 0x0, limit 0xf, type 0x1b
> > > > > = DPL 0, pres 1, def32 1, gran 1
> > > > > processor eflags= interrupt enabled, resume, IOPL = 0
> > > > > current process = 63645 (find)
> > > > > trap number = 12
> > > > > panic: page fault
> > > > > Uptime: 11h16m47s
> > > > > Physical memory: 1014 MB
> > > > > Dumping 143 MB: 128 112 96 80 64 48 32 16
> > > > > 
> > > > > #6  0xc0898d4c in calltrap () at 
> /usr/src/sys/i386/i386/exception.s:169
> > > > > #7  0xc082c552 in inodedep_find (inodedephd=Variable "inodedephd" 
> is 
> > > not 
> > > > > available.
> > > > > )
> > > > > at /usr/src/sys/ufs/ffs/ffs_softdep.c:2073
> > > > 
> > > > Please go to frame 7 and do 'x/i $rip'.
> > > > 
> > > 
> > > (kgdb) up 7
> > > #7  0xc082c552 in inodedep_find (inodedephd=Variable "inodedephd" is 
> not 
> > > available.
> > > ) at /usr/src/sys/ufs/ffs/ffs_softdep.c:2073
> > > 2073/usr/src/sys/ufs/ffs/ffs_softdep.c: No such file or directory.
> > > in /usr/src/sys/ufs/ffs/ffs_softdep.c
> > > (kgdb) x/i $rip
> > > Value can't be converted to integer.
> > 
> > Oh, this is i386, use "$eip" instead of "$rip", so 'x/i $eip' at frame 7.
> 
> 
> (kgdb) x/i $eip
> 0xc082c552 :  cmp%ecx,0x24(%eax)

Ok, so %eax must be 1.  I think you probably have failing RAM with a stuck bit 
or some such.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: UFS related panic (daily <-> find)

2013-10-02 Thread John Baldwin

On Wednesday, October 02, 2013 4:32:56 pm rank1see...@gmail.com wrote:
> > > Ok, here is another one, same case, just this time under 9.1-RELEASE-p7
> > > 
> > > ==
> > > Fatal trap 12: page fault while in kernel mode
> > > fault virtual address   = 0x25
> > > fault code  = supervisor read, page not present
> > > instruction pointer = 0x20:0xc082c552
> > > stack pointer   = 0x28:0xe7eed7a8
> > > frame pointer   = 0x28:0xe7eed7ac
> > > code segment= base 0x0, limit 0xf, type 0x1b
> > > = DPL 0, pres 1, def32 1, gran 1
> > > processor eflags= interrupt enabled, resume, IOPL = 0
> > > current process = 63645 (find)
> > > trap number = 12
> > > panic: page fault
> > > Uptime: 11h16m47s
> > > Physical memory: 1014 MB
> > > Dumping 143 MB: 128 112 96 80 64 48 32 16
> > > 
> > > #6  0xc0898d4c in calltrap () at /usr/src/sys/i386/i386/exception.s:169
> > > #7  0xc082c552 in inodedep_find (inodedephd=Variable "inodedephd" is 
> not 
> > > available.
> > > )
> > > at /usr/src/sys/ufs/ffs/ffs_softdep.c:2073
> > 
> > Please go to frame 7 and do 'x/i $rip'.
> > 
> 
> (kgdb) up 7
> #7  0xc082c552 in inodedep_find (inodedephd=Variable "inodedephd" is not 
> available.
> ) at /usr/src/sys/ufs/ffs/ffs_softdep.c:2073
> 2073/usr/src/sys/ufs/ffs/ffs_softdep.c: No such file or directory.
> in /usr/src/sys/ufs/ffs/ffs_softdep.c
> (kgdb) x/i $rip
> Value can't be converted to integer.

Oh, this is i386, use "$eip" instead of "$rip", so 'x/i $eip' at frame 7.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: [RFC][CFT] GEOM direct dispatch and fine-grained CAM locking

2013-10-02 Thread John Baldwin

On Saturday, September 07, 2013 2:32:45 am Alexander Motin wrote:
> On 07.09.2013 02:02, Jeremie Le Hen wrote:
> > On Fri, Sep 06, 2013 at 11:29:11AM +0300, Alexander Motin wrote:
> >> On 06.09.2013 11:06, Jeremie Le Hen wrote:
> >>> On Fri, Sep 06, 2013 at 12:46:27AM +0200, Olivier Cochard-Labbé wrote:
> >>>> On Thu, Sep 5, 2013 at 11:38 PM, Alexander Motin  
wrote:
> >>>>> I've found and fixed possible double request completion, that could 
cause
> >>>>> such symptoms if happened. Updated patch located as usual:
> >>>>> http://people.freebsd.org/~mav/camlock_patches/camlock_20130905.patch
> >>>>>
> >>> With this new one I cannot boot any more (I also updated the source
> >>> tree).  This is a hand transcripted version:
> >>>
> >>> Trying to mount root from zfs:zroot/root []...
> >>> panic: Batch flag already set
> >>> cpuid = 1
> >>> KDB: stack backtrace:
> >>> db_trace_self_wrapper()
> >>> kdb_backtrace()
> >>> vpanic()
> >>> kassert_panic()
> >>> xpt_batch_start()
> >>> ata_interrupt()
> >>> softclock_call_cc()
> >>> softclock()
> >>> ithread_loop()
> >>> fork_exit()
> >>> fork_trampoline()
> >>
> >> Thank you for the report. I see my fault. It is probably specific to
> >> ata(4) driver only. I've workarounded that in new patch version, but
> >> probably that area needs some rethinking.
> >>
> >> http://people.freebsd.org/~mav/camlock_patches/camlock_20130906.patch
> >
> > I'm not sure you needed a confirmation, but it boots.  Thanks :).
> >
> > I didn't quite understand the thread; is direct dispatch enabled for
> > amd64?  ISTR you said only i386 but someone else posted the macro for
> > amd64.
> 
> Yes, it is enabled for amd64. I've said x86, meaning both i386 and amd64.

FYI, I tested mfi with this patch set and mfid worked fine for handling g_up
directly:

Index: dev/mfi/mfi_disk.c
===
--- dev/mfi/mfi_disk.c  (revision 257407)
+++ dev/mfi/mfi_disk.c  (working copy)
@@ -162,6 +162,7 @@
sc->ld_disk->d_unit = sc->ld_unit;
sc->ld_disk->d_sectorsize = secsize;
sc->ld_disk->d_mediasize = sectors * secsize;
+   sc->ld_disk->d_flags = DISKFLAG_DIRECT_COMPLETION;
if (sc->ld_disk->d_mediasize >= (1 * 1024 * 1024)) {
sc->ld_disk->d_fwheads = 255;
sc->ld_disk->d_fwsectors = 63;


-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: UFS related panic (daily <-> find)

2013-10-02 Thread John Baldwin

On Tuesday, October 01, 2013 9:15:33 pm rank1see...@gmail.com wrote:
> Ok, here is another one, same case, just this time under 9.1-RELEASE-p7
> 
> ==
> Fatal trap 12: page fault while in kernel mode
> fault virtual address   = 0x25
> fault code  = supervisor read, page not present
> instruction pointer = 0x20:0xc082c552
> stack pointer   = 0x28:0xe7eed7a8
> frame pointer   = 0x28:0xe7eed7ac
> code segment= base 0x0, limit 0xf, type 0x1b
> = DPL 0, pres 1, def32 1, gran 1
> processor eflags= interrupt enabled, resume, IOPL = 0
> current process = 63645 (find)
> trap number = 12
> panic: page fault
> Uptime: 11h16m47s
> Physical memory: 1014 MB
> Dumping 143 MB: 128 112 96 80 64 48 32 16
> 
> #6  0xc0898d4c in calltrap () at /usr/src/sys/i386/i386/exception.s:169
> #7  0xc082c552 in inodedep_find (inodedephd=Variable "inodedephd" is not 
> available.
> )
> at /usr/src/sys/ufs/ffs/ffs_softdep.c:2073

Please go to frame 7 and do 'x/i $rip'.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: [RFC][CFT] GEOM direct dispatch and fine-grained CAM locking

2013-09-04 Thread John Baldwin

On Wednesday, September 04, 2013 10:11:28 am Nathan Whitehorn wrote:
> On 09/04/13 08:20, Ryan Stone wrote:
> > On Wed, Sep 4, 2013 at 8:45 AM, Nathan Whitehorn  
wrote:
> >> Could you describe what this macro is supposed to do so that we can do 
the
> >> porting work?
> >> -Nathan
> > #define GET_STACK_USAGE(total, used)
> >
> > GET_STACK_USAGE sets the variable passed in total to the total amount
> > of stack space available to the current thread.  used is set to the
> > amount of stack space currently used (this does not have to have
> > byte-precision).  Netgraph uses this to decide when to stop recursing
> > and instead defer to a work queue (to prevent stack overflow).  I
> > presume that Alexander is using it in a similar way.  It looks like
> > the amd64 version could be ported to other architectures quite easily
> > if you were to account for stacks that grow up and stacks that grow
> > down:
> >
> > 
http://svnweb.freebsd.org/base/head/sys/amd64/include/proc.h?revision=233291&view=markup
> >
> > /* Get the current kernel thread stack usage. */
> > #define GET_STACK_USAGE(total, used) do {\
> >  struct thread*td = curthread;\
> >  (total) = td->td_kstack_pages * PAGE_SIZE;\
> >  (used) = (char *)td->td_kstack +\
> >  td->td_kstack_pages * PAGE_SIZE -\
> >  (char *)&td;\
> > } while (0)
> > ___
> > freebsd-hackers@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> > To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
> 
> I think that should be MI for us anyway. I'm not aware of any 
> architectures FreeBSD supports with stacks that grow up. I'll give it a 
> test on PPC.

ia64 has the double stack thingie where the register stack spills into a stack
that grows up rather than down.  Not sure how sparc64 window spills are 
handled either.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: About CPU cores numbering an processor affinity

2013-09-03 Thread John Baldwin

On Friday, August 23, 2013 9:23:51 am Dmitry Sivachenko wrote:
> Hello!
> 
> I am using FreeBSD-9-STABLE on the following hardware:
> 
> FreeBSD/SMP: Multiprocessor System Detected: 24 CPUs
> FreeBSD/SMP: 2 package(s) x 6 core(s) x 2 SMT threads
> 
> So I have 2 physical CPUs with 6 core each.
> 
> # cpuset -g
> pid -1 mask: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 
18, 19, 20, 21, 22, 23
> 
> 
> So each of 24 cores are numbered 0..23.
> 
> 1) In what particular order are these cores numbered?  Can I assume that 
0..11 correspond to 1st physical CPU and 12..23 to second?  How SMT threads 
are numbered within each core?

Yes, the numbering is "grouped" so that you have each package as a contiguous
block.  Each core is a contiguous block as well, so SMT threads are adjacent
to each other.

> Should I use "-x" option of cpuset for that purpose (to bind irq 260 and 261 
in my example)?

Yes, cpuset -x.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Call fo comments - raising vfs.ufs.dirhash_reclaimage?

2013-09-03 Thread John Baldwin

On Wednesday, August 28, 2013 12:40:15 pm Ivan Voras wrote:
> On 28 August 2013 18:12, Gary Jennejohn  wrote:
> 
> > So, if I understand this correctly, a normal desktop user won't
> > notice any real change, except that buildworld might get faster,
> > and big servers will benefit?
> 
> Basically, yes, but read on...
> 
> > But could this negatively impact small, embedded systems, which
> > usually have only small memory footprints?  Although I suppose
> > one could argue that they usually don't have large numbers of
> > files cached in memory at any given time.
> 
> Unless I'm wrong, the only pathological case coming from this change
> would be the following sequence of events:
> 
> 1) Memory is scarce [*]
> 2) There's a sudden surge of requests for a huge number of different 
> directories
> 3) There's an urgent lowmem event which is observed by dirhash, which
> attempts to free memory but is prevented in doing so for the next 60
> seconds because all entries are young (the idea behind dirhash being
> that if a directory is accessed, it will probably soon be accessed
> again - think "ls" then "fopen", so we won't evict him until
> reclaimage seconds)
> 4) the kernel runs out of memory, game over.

Just to play devil's advocate, the only way your change can benefit is
if:

1) Memory is scarce thus triggering a lowmem event
2) There are requests for a huge number of directories that haven't been
   accessed in over 5 seconds.

That is to say, what your change does is increase the relative importance
of dirhash memory relative to other memory in the machine when the machine
is under memory pressure.  If the machine is not under memory pressure then
the lowmem handler will not be triggered and your change will never matter.

Keep in mind that if pagedaemon is able to keep up, the lowmem event handler
will not be called.  This handler only triggers when you are really low on
memory and trying to allocate it faster than pagedaemon can reclaim free
pages.  In that sort of environment you generally want caches to return
pages sooner rather than later.

What would perhaps be better than a hardcoded reclaim age would be to use
an LRU-type approach and perhaps set a target percent to reclaim.  That is,
suppose you were to reclaim the oldest 10% of hashes on each lowmem call
(and make the '10%' the tunable value).  Then you will always make some amount
of progress in a low memory situation (and if the situation remains dire you
will eventually empty the entire cache), but the effective maximum age will
be more dynamic.  Right now if you haven't touched UFS in 5 seconds it
throws the entire thing out on the first lowmem event.  The LRU-approach would
only throw the oldest 10% out on the first call, but eventually throw it all out
if the situation remains dire.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: UFS related panic (daily <-> find)

2013-07-26 Thread John Baldwin

On Friday, July 26, 2013 3:00:33 pm rank1see...@gmail.com wrote:
> > > > > I had 2 panics: (Both occured at 3 AM, so had to be daily task)
> > > > > 
> > > > > First (Jul  2 03:06:50 2013):
> > > > > --
> > > > > Fatal trap 12: page fault while in kernel mode
> > > > > fault virtual address   = 0x19
> > > > > fault code  = supervisor read, page not present
> > > > > instruction pointer = 0x20:0xc06caf34
> > > > > stack pointer   = 0x28:0xe76248fc
> > > > > frame pointer   = 0x28:0xe7624930
> > > > > code segment= base 0x0, limit 0xf, type 0x1b
> > > > > = DPL 0, pres 1, def32 1, gran 1
> > > > > processor eflags= interrupt enabled, resume, IOPL = 0
> > > > > current process = 76562 (find)
> > > > > trap number = 12
> > > > > panic: page fault
> > > > > Uptime: 23h0m41s
> > > > > Physical memory: 1014 MB
> > > > > Dumping 186 MB: 171 155 139 123 107 91 75 59 43 27 11
> > > > > 
> > > > > #7  0xc06caf34 in cache_lookup_times (dvp=0xc784a990, 
> vpp=0xe7624ae8,
> > > > > cnp=0xe7624afc, tsp=0x0, ticksp=0x0) at 
> > > > /usr/src/sys/kern/vfs_cache.c:547
> > > > 
> > > > Can you go up to this frame and do 'l'?
> > > > 
> > > > -- 
> > > > John Baldwin
> > > 
> > > 
> > > Sure,
> > > 
> > > -
> > > (kgdb) up 7
> > > #7  0xc06caf34 in cache_lookup_times (dvp=0xc784a990, vpp=0xe7624ae8, 
> cnp=0xe7624afc, tsp=0x0, ticksp=0x0) at /usr/src/sys/kern/vfs_cache.c:547
> > > 547 numchecks++;
> > > -
> > > (kgdb) l
> > > 542 }
> > > 543
> > > 544 hash = fnv_32_buf(cnp->cn_nameptr, cnp->cn_namelen, 
> FNV1_32_INIT);
> > > 545 hash = fnv_32_buf(&dvp, sizeof(dvp), hash);
> > > 546 LIST_FOREACH(ncp, (NCHHASH(hash)), nc_hash) {
> > > 547 numchecks++;
> > > 548 if (ncp->nc_dvp == dvp && ncp->nc_nlen == 
> cnp->cn_namelen &&
> > > 549 !bcmp(nc_get_name(ncp), cnp->cn_nameptr, 
> ncp->nc_nlen))
> > > 550 break;
> > > 551 }
> > > -
> > 
> > Hmm, 'p ncp' and 'p *ncp' at that frame perhaps?
> > 
> 
> (kgdb) p ncp
> $1 = (struct namecache *) 0x1
> (kgdb) p *ncp
> Cannot access memory at address 0x1

Interesting.  Maybe look at NCHHASH(hash) (you'll have to expand the macro 
manually)
and see if the head node is corrupted or walk the list to find the corrupted 
node.
Given that it is a single bit error, there is a chance this is a RAM problem.  
If it
is in the hash table head entry then that would always be at the same physical 
address
for the same kernel I think.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: UFS related panic (daily <-> find)

2013-07-26 Thread John Baldwin

On Friday, July 26, 2013 9:04:42 am rank1see...@gmail.com wrote:
> > > I had 2 panics: (Both occured at 3 AM, so had to be daily task)
> > > 
> > > First (Jul  2 03:06:50 2013):
> > > --
> > > Fatal trap 12: page fault while in kernel mode
> > > fault virtual address   = 0x19
> > > fault code  = supervisor read, page not present
> > > instruction pointer = 0x20:0xc06caf34
> > > stack pointer   = 0x28:0xe76248fc
> > > frame pointer   = 0x28:0xe7624930
> > > code segment= base 0x0, limit 0xf, type 0x1b
> > > = DPL 0, pres 1, def32 1, gran 1
> > > processor eflags= interrupt enabled, resume, IOPL = 0
> > > current process = 76562 (find)
> > > trap number = 12
> > > panic: page fault
> > > Uptime: 23h0m41s
> > > Physical memory: 1014 MB
> > > Dumping 186 MB: 171 155 139 123 107 91 75 59 43 27 11
> > > 
> > > #7  0xc06caf34 in cache_lookup_times (dvp=0xc784a990, vpp=0xe7624ae8,
> > > cnp=0xe7624afc, tsp=0x0, ticksp=0x0) at 
> > /usr/src/sys/kern/vfs_cache.c:547
> > 
> > Can you go up to this frame and do 'l'?
> > 
> > -- 
> > John Baldwin
> 
> 
> Sure,
> 
> -
> (kgdb) up 7
> #7  0xc06caf34 in cache_lookup_times (dvp=0xc784a990, vpp=0xe7624ae8, 
> cnp=0xe7624afc, tsp=0x0, ticksp=0x0) at /usr/src/sys/kern/vfs_cache.c:547
> 547 numchecks++;
> -
> (kgdb) l
> 542 }
> 543
> 544 hash = fnv_32_buf(cnp->cn_nameptr, cnp->cn_namelen, 
> FNV1_32_INIT);
> 545 hash = fnv_32_buf(&dvp, sizeof(dvp), hash);
> 546 LIST_FOREACH(ncp, (NCHHASH(hash)), nc_hash) {
> 547 numchecks++;
> 548 if (ncp->nc_dvp == dvp && ncp->nc_nlen == 
> cnp->cn_namelen &&
> 549 !bcmp(nc_get_name(ncp), cnp->cn_nameptr, 
> ncp->nc_nlen))
> 550 break;
> 551 }
> -

Hmm, 'p ncp' and 'p *ncp' at that frame perhaps?

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: UFS related panic (daily <-> find)

2013-07-22 Thread John Baldwin

On Friday, July 19, 2013 1:45:11 pm rank1see...@gmail.com wrote:
> I had 2 panics: (Both occured at 3 AM, so had to be daily task)
> 
> First (Jul  2 03:06:50 2013):
> --
> Fatal trap 12: page fault while in kernel mode
> fault virtual address   = 0x19
> fault code  = supervisor read, page not present
> instruction pointer = 0x20:0xc06caf34
> stack pointer   = 0x28:0xe76248fc
> frame pointer   = 0x28:0xe7624930
> code segment= base 0x0, limit 0xf, type 0x1b
> = DPL 0, pres 1, def32 1, gran 1
> processor eflags= interrupt enabled, resume, IOPL = 0
> current process = 76562 (find)
> trap number = 12
> panic: page fault
> Uptime: 23h0m41s
> Physical memory: 1014 MB
> Dumping 186 MB: 171 155 139 123 107 91 75 59 43 27 11
> 
> #7  0xc06caf34 in cache_lookup_times (dvp=0xc784a990, vpp=0xe7624ae8,
> cnp=0xe7624afc, tsp=0x0, ticksp=0x0) at 
/usr/src/sys/kern/vfs_cache.c:547

Can you go up to this frame and do 'l'?

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Kernel crashes after sleep: how to debug?

2013-07-22 Thread John Baldwin

On Friday, July 19, 2013 10:16:15 pm Yuri wrote:
> On 07/19/2013 14:04, John Baldwin wrote:
> > Hmm, that definitely looks like garbage.  How are you with gdb scripting?
> > You could write a script that walks the PQ_ACTIVE queue and see if this
> > pointers ends up in there.  It would then be interesting to see if the
> > previous page's next pointer is corrupted, or if the pageq.tqe_prev 
> > references
> > that page then it could be that this vm_page structure has been stomped on
> > instead.
> 
> As you suggested, I printed the list of pages. Actually, iteration in 
> frame 8 goes through PQ_INACTIVE pages. So I printed those.
> <...skipped...>
> ### page#2245 ###
> $4492 = (struct vm_page *) 0xfe00b5a27658
> $4493 = {pageq = {tqe_next = 0xfe00b5a124d8, tqe_prev = 
> 0xfe00b5b79038}, listq = {tqe_next = 0x0, tqe_prev = 
> 0xfe00b5a276e0},
>left = 0x0, right = 0x0, object = 0xfe005e3f7658, pindex = 5, 
> phys_addr = 1884901376, md = {pv_list = {tqh_first = 0xfe005e439ce8,
>tqh_last = 0xfe00795eacc0}, pat_mode = 6}, queue = 0 '\0', 
> segind = 2 '\002', hold_count = 0, order = 13 '\r', pool = 0 '\0',
>cow = 0, wire_count = 0, aflags = 1 '\001', flags = 64 '@', oflags = 
> 0, act_count = 9 '\t', busy = 0 '\0', valid = 255 '�', dirty = 255 '�'}
> ### page#2246 ###
> $4494 = (struct vm_page *) 0xfe00b5a124d8
> $4495 = {pageq = {tqe_next = 0xfe00b460abf8, tqe_prev = 
> 0xfe00b5a27658}, listq = {tqe_next = 0x0, tqe_prev = 
> 0xfe005e3f7cf8},
>left = 0x0, right = 0x0, object = 0xfe005e3f7cb0, pindex = 1, 
> phys_addr = 1881952256, md = {pv_list = {tqh_first = 0xfe005e42dd48,
>tqh_last = 0xfe007adb03a8}, pat_mode = 6}, queue = 0 '\0', 
> segind = 2 '\002', hold_count = 0, order = 13 '\r', pool = 0 '\0',
>cow = 0, wire_count = 0, aflags = 1 '\001', flags = 64 '@', oflags = 
> 0, act_count = 9 '\t', busy = 0 '\0', valid = 255 '�', dirty = 255 '�'}
> ### page#2247 ###
> $4496 = (struct vm_page *) 0xfe00b460abf8
> $4497 = {pageq = {tqe_next = 0xfe26, tqe_prev = 0xfe00b5a124d8}, 
> listq = {tqe_next = 0xfe0081ad8f70, tqe_prev = 0xfe0081ad8f78},
>left = 0x6, right = 0xd0201, object = 0x1, pindex = 
> 4294901765, phys_addr = 18446741877712530608, md = {pv_list = {
>tqh_first = 0xfe00b460abc0, tqh_last = 0xfe00b5579020}, 
> pat_mode = -1268733096}, queue = 72 'H', segind = -85 '�',
>hold_count = -19360, order = 0 '\0', pool = 254 '�', cow = 65535, 
> wire_count = 0, aflags = 0 '\0', flags = 0 '\0', oflags = 0,
>act_count = 0 '\0', busy = 176 '�', valid = 208 '�', dirty = 126 '~'}
> ### page#2248 ###
> $4498 = (struct vm_page *) 0xfe26
> 
> The page #2247 is the same that caused the problem in frame 8. tqe_next 
> is apparently invalid, so iteration stopped here.
> It appears that this structure has been stomped on. This page is 
> probably supposed to be a valid inactive page.

Yes, it's phys_addr is also way off. I think you might even be able to
figure out which phys_addr it is supposed to have based on the virtual
address (see PHYS_TO_VM_PAGE() in vm/vm_page.c) by using the vm_page
address and phys_addr of the prior entries to establish the relative
offset.  It is certainly a page "earlier" in the array.

> > Ultimately I think you will need to look at any malloc/VM/page operations
> > done in the suspend and resume paths to see where this happens.  It might
> > be slightly easier if the same page gets trashed every time as you could
> > print out the relevant field periodically during suspend and resume to
> > narrow down where the breakage occurs.
> 
> I am thinking to put code walking through all page queues and verifying 
> that they are not damaged in this way into the code when each device is 
> waking up from sleep.
> dev/acpica/acpi.c has acpi_EnterSleepState, which, as I understand, 
> contains top-level code for S3 sleep. Before sleep it invokes event 
> 'power_suspend' on all devices, and after sleep it calls 'power_resume' 
> on devices. So maybe I will call the page check procedure after 
> 'power_suspend' and 'power_resume'.
> 
> But it is possible that memory gets damaged somewhere else after 
> power_resume happens.
> Do you have any thought/suggestions?

Well, I think you should try what you've suggeseted above first.  If that
doesn't narrow it down then we can brainstorm some other places to inspect.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Kernel crashes after sleep: how to debug?

2013-07-19 Thread John Baldwin

On Friday, July 19, 2013 3:32:43 pm Yuri wrote:
> On 07/19/2013 08:00, John Baldwin wrote:
> > Well, you can probably find the value of 'm' in a register if you look at 
the
> > dissassembly around the fault.  You can then cast that pointer to the 
right
> > type and print its contents.
> 
> Here is the value of *m in frame 8:
> (kgdb) p *(struct vm_page*)0xfe00b460abf8
> $3 = {pageq = {tqe_next = 0xfe26, tqe_prev = 0xfe00b5a124d8}, listq 
> = {tqe_next = 0xfe0081ad8f70, tqe_prev = 0xfe0081ad8f78},
> left = 0x6, right = 0xd0201, object = 0x1, pindex = 
> 4294901765, phys_addr = 18446741877712530608, md = {pv_list = {
> tqh_first = 0xfe00b460abc0, tqh_last = 0xfe00b5579020}, pat_mode 
> = -1268733096}, queue = 72 'H', segind = -85 '�',
> hold_count = -19360, order = 0 '\0', pool = 254 '�', cow = 65535, 
> wire_count = 0, aflags = 0 '\0', flags = 0 '\0', oflags = 0,
> act_count = 0 '\0', busy = 176 '�', valid = 208 '�', dirty = 126 '~'}

Hmm, that definitely looks like garbage.  How are you with gdb scripting?
You could write a script that walks the PQ_ACTIVE queue and see if this
pointers ends up in there.  It would then be interesting to see if the
previous page's next pointer is corrupted, or if the pageq.tqe_prev references 
that page then it could be that this vm_page structure has been stomped on 
instead.

Ultimately I think you will need to look at any malloc/VM/page operations
done in the suspend and resume paths to see where this happens.  It might
be slightly easier if the same page gets trashed every time as you could
print out the relevant field periodically during suspend and resume to
narrow down where the breakage occurs.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Kernel crashes after sleep: how to debug?

2013-07-19 Thread John Baldwin

On Thursday, July 18, 2013 8:56:58 pm Yuri wrote:
> On 07/18/2013 13:52, John Baldwin wrote:
> > Are you in frame 8?
> 
> For some reason the debug info is missing in frame 8, but is present in 
> surrounding frames 7 and 9.
> The might be a bug in makefiles that debug flag isn't passed into 
> sys/vm/ directory.

Well, you can probably find the value of 'm' in a register if you look at the 
dissassembly around the fault.  You can then cast that pointer to the right
type and print its contents.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Kernel crashes after sleep: how to debug?

2013-07-18 Thread John Baldwin

On Thursday, July 18, 2013 3:56:46 pm Yuri wrote:
> On 07/18/2013 11:42, John Baldwin wrote:
> > Hmm, so this seems to indicate you have a page on the active queue that
> > doesn't have an associated VM object.  Can you maybe 'p *m'?  Maybe some
> > temporary page is allocated during suspend but isn't freed appropriately?
> 
> Unfortunately, I get this:
> (kgdb) p *m
> No symbol "m" in current context.
> 
> even though kernel was built with "makeoptions DEBUG=-g", same for 
> other symbols there.
> 
> Is there a way to identify when and by whom the page has been allocated?

Are you in frame 8?

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Kernel crashes after sleep: how to debug?

2013-07-18 Thread John Baldwin

On Thursday, July 18, 2013 2:15:48 pm Yuri wrote:
> On 07/16/2013 08:07, John Baldwin wrote:
> > Can you go to frame 8 and do 'l' in kgdb?
> 
> (kgdb) up 8
> #8  0x80baea78 in vm_pageout () at /usr/src/sys/vm/vm_pageout.c:829
> 829 if (!VM_OBJECT_TRYLOCK(object) &&
> (kgdb) l
> 824 if (!vm_pageout_page_lock(m, &next)) {
> 825 vm_page_unlock(m);
> 826 continue;
> 827 }
> 828 object = m->object;
> 829 if (!VM_OBJECT_TRYLOCK(object) &&
> 830 !vm_pageout_fallback_object_lock(m, &next)) {
> 831 vm_page_unlock(m);
> 832 VM_OBJECT_UNLOCK(object);
> 833 continue;

Hmm, so this seems to indicate you have a page on the active queue that 
doesn't have an associated VM object.  Can you maybe 'p *m'?  Maybe some
temporary page is allocated during suspend but isn't freed appropriately?

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Kernel crashes after sleep: how to debug?

2013-07-16 Thread John Baldwin

On Monday, July 15, 2013 3:22:28 am Yuri wrote:
> 
> After sleep/wakeup cycle my 9.1-STABLE r253105 amd64 system has a 
> tendency to sometimes randomly crash after a while. It doesn't happen 
> every time.
> See kgdb log below. I am not sure there is enough information to lead to 
> the cause of the issue.
> 
> It looks like it crashes near the line:
> #7  0x8091a181 in _mtx_trylock (m=0x1, opts=0, 
> file=, line=0) at /usr/src/sys/kern/kern_mutex.c:295
> 295 if (SCHEDULER_STOPPED())
> Current language:  auto; currently c
> (kgdb) l
> 290 uint64_t waittime = 0;
> 291 int contested = 0;
> 292 #endif
> 293 int rval;
> 294
> 295 if (SCHEDULER_STOPPED())
> 296 return (1);
> 297
> 298 KASSERT(m->mtx_lock != MTX_DESTROYED,
> 299 ("mtx_trylock() of destroyed mutex @ %s:%d", file, 
> line));
> 
> Current thread was:
> * 67 Thread 100064 (PID=5: pagedaemon)  doadump (textdump= optimized out>) at pcpu.h:234
> 
> How to find the cause of the crash?
> 
> Yuri
> 
> 
> --- kgdb log ---
> # kgdb /boot/kernel/kernel vmcore.0
> GNU gdb 6.1.1 [FreeBSD]
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain 
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "amd64-marcel-freebsd"...
> 
> Unread portion of the kernel message buffer:
> 
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0x10018
> fault code  = supervisor read data, page not present
> instruction pointer = 0x20:0x8091a181
> stack pointer   = 0x28:0xff80d51c6ab0
> frame pointer   = 0x28:0xff80d51c6ad0
> code segment= base 0x0, limit 0xf, type 0x1b
>  = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags= interrupt enabled, resume, IOPL = 0
> current process = 5 (pagedaemon)
> trap number = 12
> panic: page fault
> cpuid = 0
> KDB: stack backtrace:
> #0 0x80968416 at kdb_backtrace+0x66
> #1 0x8092e43e at panic+0x1ce
> #2 0x80d12940 at trap_fatal+0x290
> #3 0x80d12ca1 at trap_pfault+0x211
> #4 0x80d13254 at trap+0x344
> #5 0x80cfc583 at calltrap+0x8
> #6 0x80baea78 at vm_pageout+0x998
> #7 0x808fc10f at fork_exit+0x11f
> #8 0x80cfcaae at fork_trampoline+0xe
> Uptime: 2h21m27s
> Dumping 407 out of 2919 MB:..4%..12%..24%..32%..44%..52%..63%..71%..83%..91%
> 
> Reading symbols from /boot/modules/cuse4bsd.ko...done.
> Loaded symbols for /boot/modules/cuse4bsd.ko
> Reading symbols from /boot/kernel/linux.ko...Reading symbols from 
> /boot/kernel/linux.ko.symbols...done.
> done.
> Loaded symbols for /boot/kernel/linux.ko
> Reading symbols from /usr/local/libexec/linux_adobe/linux_adobe.ko...done.
> Loaded symbols for /usr/local/libexec/linux_adobe/linux_adobe.ko
> Reading symbols from /boot/kernel/radeon.ko...Reading symbols from 
> /boot/kernel/radeon.ko.symbols...done.
> done.
> Loaded symbols for /boot/kernel/radeon.ko
> Reading symbols from /boot/kernel/drm.ko...Reading symbols from 
> /boot/kernel/drm.ko.symbols...done.
> done.
> Loaded symbols for /boot/kernel/drm.ko
> #0  doadump (textdump=) at pcpu.h:234
> 234 pcpu.h: No such file or directory.
>  in pcpu.h
> (kgdb) bt
> #0  doadump (textdump=) at pcpu.h:234
> #1  0x8092df16 in kern_reboot (howto=260) at 
> /usr/src/sys/kern/kern_shutdown.c:449
> #2  0x8092e417 in panic (fmt=0x1 ) at 
> /usr/src/sys/kern/kern_shutdown.c:637
> #3  0x80d12940 in trap_fatal (frame=0xc, eva= out>) at /usr/src/sys/amd64/amd64/trap.c:879
> #4  0x80d12ca1 in trap_pfault (frame=0xff80d51c6a00, 
> usermode=0) at /usr/src/sys/amd64/amd64/trap.c:795
> #5  0x80d13254 in trap (frame=0xff80d51c6a00) at 
> /usr/src/sys/amd64/amd64/trap.c:463
> #6  0x80cfc583 in calltrap () at 
> /usr/src/sys/amd64/amd64/exception.S:232
> #7  0x8091a181 in _mtx_trylock (m=0x1, opts=0, 
> file=, line=0) at /usr/src/sys/kern/kern_mutex.c:295
> #8  0x80baea78 in vm_pageout () at /usr/src/sys/vm/vm_pageout.c:829
> #9  0x808fc10f in fork_exit (callout=0x80bae0e0 
> , arg=0x0, frame=0xff80d51c6c40)
>  at /usr/src/sys/kern/kern_fork.c:988
>

Re: Kernel dumps [was Re: possible changes from Panzura]

2013-07-11 Thread John Baldwin

> Speaking of Apple solutions, I've recently used Apple's kgdb with the
> kernel debug kit & kdp remote debugging, to debug a panic'd OS X host.
>  It's really quite nice, because the debug kit comes with a ton of
> macros, similar to kdb, and you also get the benefit of source
> debugging.  I think FreeBSD would benefit massively from finding some
> way to share macros between kdb and kgdb, in addition to having an
> "emergency network stack" like you suggest.

I have a set of macros I maintain that implement many ddb commands in
kgdb including 'sleepchain' and 'lockchain'.

http://www.freebsd.org/~jhb/gdb/

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: memmap in FreeBSD

2013-07-11 Thread John Baldwin

On Sunday, July 07, 2013 7:41:43 am mangesh chitnis wrote:
> Hi,
> 
> What is the memmap equivalent of Linux in FreeBSD?
> 
> In Linux memmap is used to reserve a portion of physical memory. This is 
used as a kernel boot argument. E.g.: memmap=2G$1G will reserve 1GB memory 
above 2GB,  incase I have 3GB RAM. This 1GB reserved memory is not visible 
to the OS, however this 1GB can be used using ioremap. 
> How can I reserve memory in FreeBSD and later use 
it i.e memmap and ioremap equivalent?
> 
> I have tried using hw.physmem loader parameter.
> I have 3 GB system memory and I have set hw.physmem=2G. 
> 
> 
> sysctl -a shows:
> hw.physmem: 2.12G

Note that 'hw.physmem=2G' is using power of 2 units (so 2 * 2^30),
not power of 10.

> hw.usermem: 1.9G
> hw.realmem: 2.15G
> 
> devinfo -rv shows:
> ram0: 
> 
> 0x00-0x9f3ff 
> 0x1000-0xbfed 
> 0xbff0-0xbfff
> 
> Here, looks like it is showing the full 3 GB mapping.

ram0 is reserving address space, so it always claims all of the memory 
installed.

> Now, how do I know which is that 1 GB available memory (In Linux, this 
memory is shown as reserved in /proc/iomem under System RAM) ? Also, which 
function(similar to ioremap) should I call to map the physical address to 
virtual address?

There is currently no way to see the memory above the cap you set.  In the 
kernel you could perhaps fetch the SMAP metadata and walk the list to see if
there is memory above Maxmem (and if so it is presumably available for use).

However, to map it you would need to use pmap_*() routines directly.

Alternatively, you could abuse OBJT_SG by creating an sglist that describes
the unused memory range and then creating an OBJT_SG VM object backed by
that sglist.  You could then insert that VM object into the kernel's address
space to map it into the kernel, or even make it available to userland via
d_mmap_single(), or direct manipulation of a process' address space via an
ioctl, etc.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Intel D2500CC serial ports

2013-07-11 Thread John Baldwin

On Sunday, June 30, 2013 1:24:27 pm Robert Ames wrote:
> I just picked up an Intel D2500CCE motherboard and was disappointed
> to find the serial ports didn't work.  There has been discussion
> about this problem here:
> 
> http://lists.freebsd.org/pipermail/freebsd-current/2013-April/040897.html
> http://lists.freebsd.org/pipermail/freebsd-current/2013-May/042088.html
> 
> As seen in the second link, Juergen Weiss was able to work around
> the problem.  This patch (for 8.4-RELEASE amd64) makes all 4 serial
> ports functional.
> 
> --- /usr/src/sys/amd64/amd64/io_apic.c.orig 2013-06-02 13:23:05.0 
> -0500
> +++ /usr/src/sys/amd64/amd64/io_apic.c  2013-06-28 18:52:03.0 
> -0500
> @@ -452,6 +452,10 @@
> KASSERT(!(trig == INTR_TRIGGER_CONFORM || pol == 
> INTR_POLARITY_CONFORM),
> ("%s: Conforming trigger or polarity\n", __func__));
>  
> +   if (trig == INTR_TRIGGER_EDGE && pol == INTR_POLARITY_LOW) {
> +   pol = INTR_POLARITY_HIGH;
> +   }
> +

Hmm, so this is your BIOS doing the wrong thing in its ASL.

Maybe try this:

--- //depot/user/jhb/acpipci/dev/acpica/acpi_resource.c 2011-07-22 
17:59:31.0 
+++ /home/jhb/work/p4/acpipci/dev/acpica/acpi_resource.c2011-07-22 
17:59:31.0 
@@ -141,6 +141,10 @@
 default:
panic("%s: bad resource type %u", __func__, res->Type);
 }
+#if defined(__amd64__) || defined(__i386__)
+if (irq < 16 && trig == ACPI_EDGE_SENSITIVE && pol == ACPI_ACTIVE_LOW)
+   pol = ACPI_ACTIVE_HIGH;
+#endif
 BUS_CONFIG_INTR(dev, irq, (trig == ACPI_EDGE_SENSITIVE) ?
    INTR_TRIGGER_EDGE : INTR_TRIGGER_LEVEL, (pol == ACPI_ACTIVE_HIGH) ?
INTR_POLARITY_HIGH : INTR_POLARITY_LOW);

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: BUS_PROBE_NOWILDCARD behaviour doesn't seem to match DEVICE_PROBE(9)

2013-06-27 Thread John Baldwin

On Thursday, June 20, 2013 10:54:39 am Ryan Stone wrote:
> 
http://www.freebsd.org/cgi/man.cgi?query=DEVICE_PROBE&apropos=0&sektion=0&manpath=FreeBSD%208.2-
RELEASE&format=html
> 
> DEVICE_PROBE(9) has this to say about BUS_PROBE_NOWILDCARD:
> 
> The driver expects its parent to tell it which children to manage and no
> probing is really done. The device only matches if its parent bus
> specifically said to use this driver.

Perhaps run this by Warner to make sure?  (There is also a new-bus@ list FYI).

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Exposing driver's GPIOs through gpiobus

2013-06-27 Thread John Baldwin

On Wednesday, June 12, 2013 11:36:04 pm Ryan Stone wrote:
> At $WORK we have some custom boards with multi-port uarts managed by puc.
> The uart devices happen to provide some GPIOs, and our hardware designers
> have appropriated those GPIOs for various purposes entirely unrelated to
> the uart.
> 
> I'm looking for a clean way to provide access to the GPIOs.  It occurred to
> me that this was a problem that should be solved through newbus, and lo and
> behold I have discovered that FreeBSD provides a gpiobus driver that seems
> suitable.  I've been playing around with this for a couple of days and I
> have a solutions that is working, but there are aspects that I am unhappy
> with.  I also quite unfamiliar with newbus, so there easily could be better
> ways to approach the problem that I haven't thought of.
> 
> What I ended up doing was making gpiobus and gpioc children of the puc
> bus.  In puc_bfe_attach() I create two new child devices of the puc device
> with device_add_child(), one with the gpioc devclass and one with the
> gpiobus devclass.  I then attach both children with
> device_probe_and_attach().  I make the puc_pci driver itself provide
> implementations of the various gpio methods (like gpio_pin_get) so they can
> be inherited by the child devices.
> 
> Things start to get somewhat messy in the gpio client code.  I have the
> same image running on many different hardware types, so I can't use device
> hints to create a child device of the gpiobus.  Instead my kernel module
> tracks down the device_t for the puc, finds the gpiobus child, and uses
> BUS_ADD_CHILD to create a child of the gpiobus.  I had to add a new gpiobus
> method to allocate GPIO pins to my driver instance.  Once that's done, I
> can toggle GPIOs and whatnot using methods on my driver instance.
> 
> The things that I'm most unhappy with (newbus-wise, anyway) are:
> 
> 1) By default the gpioc and gpiobus drivers were claiming the uart children
> of the puc.  I had to decrease their priority in bus_probe to
> BUS_PROBE_LOW_PRIORITY to work around the problem.  I really don't think
> that was the right solution.  I guess I could introduce a new device that
> is a child of the puc, make sure that it will not claim the uarts, and then
> make the gpioc and gpiobus children of this device.
> 
> 2) I'm not sure how to clean up my child device when my module is
> unloaded.  Currently I am checking if it already exists on load and reusing
> it if so.  I may be missing something obvious here.

Just leave the device around and reuse it.  In your identify routine do
something like this:

static void
agp_i810_identify(driver_t *driver, device_t parent)
{

if (device_find_child(parent, "agp", -1) == NULL &&
agp_i810_match(parent))
device_add_child(parent, "agp", -1);
}

> 
> 3) I really don't like the way that I am adding my child to gpiobus.  Upon
> writing this it occurs to me that device_identify would be the canonical
> way to do this.  Previously it wasn't clear to me how to fit
> device_identify into the current architecture of the gpio client but I see
> how it can be done now.

Yes, device_identify is what you want.  I think it will also solve problem 1
for you as well.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Fix MNAMELEN or reimplement struct statfs

2013-06-10 Thread John Baldwin

On Saturday, June 08, 2013 9:36:27 pm m...@freebsd.org wrote:
> On Sat, Jun 8, 2013 at 3:52 PM, Dirk Engling  wrote:
> 
> > The arbitrary value
> >
> > #define MNAMELEN88  /* size of on/from name bufs */
> >
> > struct statfs {
> > [...]
> > charf_mntfromname[MNAMELEN];/* mounted filesystem */
> > charf_mntonname[MNAMELEN];  /* directory on which mounted */
> > };
> >
> > currently bites us when trying to use poudriere with errors like
> >
> > 'mount: tmpfs: File name too long'
> >
> >
> > /poudriere/data/build/91_RELEASE_amd64-REALLY-REALLY-LONG-
JAILNAME/ref/wrkdirs
> >
> > The topic has been discussed several times since 2004 and has been
> > postponed each time, the last time when it hit zfs users:
> >
> > http://lists.freebsd.org/pipermail/freebsd-fs/2010-March/007974.html
> >
> > So I'd like to point to the calendar, it's 2013 already and there's
> > still a static arbitrary (and way too low) limit in one of the core
> > areas of the vfs code.
> >
> > So I'd like to bump the issue and propose either making f_mntfromname a
> > dynamic allocation or just increase MNAMELEN, using 10.0 as water shed.
> >
> 
> Gleb Kurtsou did this along with the ino64 GSoC project.  Unfortunately,
>  both he and I hit ENOTIME due to the job that pays the bills and it's
> never made it back to the main repository.
> 
> IIRC, though, the only reason for doing it with 64-bit ino_t is that he'd
> already finished changing the stat/dirent ABI so what was one more.  I
> think he went with 1024 bytes, which also necessitated not allocating
> statfs on the stack for the kernel.

He also fixed a few other things since changing this ABI is so invasive
IIRC.  This really is the right fix for this.  Is it in an svn branch 
that can be updated and a new patch generated?

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: find -delete broken, or just used improperly?

2013-05-24 Thread John Baldwin

On Friday, May 24, 2013 6:24:11 am Jilles Tjoelker wrote:
> On Tue, May 21, 2013 at 11:06:39AM -0400, John Baldwin wrote:
> > On Monday, May 20, 2013 5:47:31 pm Jilles Tjoelker wrote:
> > > The below patch allows deleting the pathname given to find itself:
> 
> > > Index: usr.bin/find/function.c
> > > ===
> > > --- usr.bin/find/function.c   (revision 250661)
> > > +++ usr.bin/find/function.c   (working copy)
> > > @@ -442,7 +442,8 @@
> > >   errx(1, "-delete: forbidden when symlinks are followed");
> > >  
> > >   /* Potentially unsafe - do not accept relative paths whatsoever */
> > > - if (strchr(entry->fts_accpath, '/') != NULL)
> > > + if (entry->fts_level > FTS_ROOTLEVEL &&
> > > + strchr(entry->fts_accpath, '/') != NULL)
> > >   errx(1, "-delete: %s: relative path potentially not safe",
> > >   entry->fts_accpath);
> 
> > I'm curious, how would you instruct a patched find to avoid deleteing
> > the /tmp/foo directory (e.g. if you wanted this to be a job that
> > pruned empty dirs from /tmp/foo but never pruned the directory
> > itself).  Would -mindepth 1 do it?  (Just asking.  I have also found
> > this message annoying but most of the jobs I have seen it on probably
> > don't want to delete the root path, just descendants.)
> 
> -mindepth 1 works, as does cd /tmp/foo && find . -... (-delete silently
> ignores . and ..).

Right, my only concern is that this fix will introduce a change in behavior
that I think might be significant, so we should make sure to advertise it
well in UPDATING, etc.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: zfsloader triggering reset when interacting with v86int()

2013-05-22 Thread John Baldwin

On Tuesday, May 21, 2013 10:49:58 pm Lawrence Stewart wrote:
> Hi all,
> 
> I have an ageing Toshiba Portege R600 laptop (Intel ULV SU9400 1.4GHz
> Core2 CPU, 3GB RAM) running the latest v3.2 BIOS. I recently stuck a new
> Samsung 840 Pro 256GB SSD into it and decided to do my usual trick of
> dual booting Windows 7 and FreeBSD-on-ZFS-root. Windows 7 won't boot
> from a GPT scheme without a (U)EFI BIOS, so I had to use MBR + BSD
> labels for FreeBSD which I've never done before with ZFS. I followed the
> guide at [1] using the FreeBSD head r250260 snapshot from [2].
> 
> Rebooting into my newly configured FreeBSD slice from the boot0 F3
> option, I'd see zfsloader start running and then the machine would
> reset. The last line flashed up on screen before the reset was something
> like "BIOS drive C:".
> 
> With Andriy's (avg@) help, I went digging and traced the problem to
> zfsloader attempting to probe "disk0:" (floppy drive, though the machine
> doesn't have one physically). The code call path looks roughly like this
> with some functions omitted in the middle related to bios disk strategy.
> 
> zfs_probe_dev -> open -> disk_open -> ptable_open -> ptblread ->
>  -> bd_read -> bd_io -> bd_chs_io -> v86int() -> BOOM
> 
> Turns out putting a simple printf above the call to v86int() resolved
> the problem and the system would progress to a successful boot. On a
> whim, I also tested adding an mfence above the call which also allowed
> the boot to complete successfully.
> 
> Andriy also suggested I test his patch at [3] (applied to
> libi386/biosmem.c) which I did, but it did not fix the reset.
> 
> After discussing further on IRC with Andriy (avg@) and Dimitry (dim@),
> it looked more like a case of BIOS-behaving-badly, so I went hunting and
> discovered that if I disabled "USB Floppy Emulation" (the only BIOS
> option related to floppies; defaults to enabled), the standard
> unmolested zfsloader would work fine. Re-enabling "USB Floppy Emulation"
> reliably triggers the zfsloader reset unless running my tweaked code.
> 
> Two things fall out from my trip down this rabbit hole:
> 
> 1. My BIOS is doing something nasty and/or stupid and I know how to work
> around it. Yay for me (although I should note that I was successfully
> running the regular non-ZFS loader previously without encountering this
> problem).
> 
> 2. It would probably be sensible to avoid probing drives which have no
> possibility of being a ZFS candidate to minimise the chances of tickling
> BIOS bugs like the one I found. Andriy suggested only probing drives of
> a minimum size (128MB is mentioned at [4]). The problem is that to know
> the size of a disk you have to open it, and the open code path is what
> triggers the reset.
> 
> 
> I'm in well over my head here so I'm interested to hear thoughts on
> point 2 and/or any other theories about possible FreeBSD causes of the
> reset I'm seeing on the off chance my BIOS isn't actually at fault.
> Willing to test ideas/patches.

Can you try an older loader such as from 8.3 release before all the recent 
changes to rototill the disk partition code?

Also, the BIOS "knows" which devices are floppies (0x00 - 0x7f) vs hard drives 
(0x80 - 0xff) and we should probably just not prove for ZFS on floppies.

Of course, I think USB floppies might present themselves as hard drives.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: find -delete broken, or just used improperly?

2013-05-21 Thread John Baldwin

On Monday, May 20, 2013 5:47:31 pm Jilles Tjoelker wrote:
> On Mon, May 20, 2013 at 03:23:16PM -0400, Kurt Lidl wrote:
> > OK, maybe I'm missing something obvious, but...
> 
> > find(1) says:
> 
> >  -delete
> >  Delete found files and/or directories.  Always returns true.
> >  This executes from the current working directory as find 
> > recurses
> >  down the tree.  It will not attempt to delete a filename with a
> >  ``/'' character in its pathname relative to ``.'' for security
> >  reasons.  Depth-first traversal processing is implied by this
> >  option.  Following symlinks is incompatible with this option.
> 
> > However, it fails even when the path is absolute:
> 
> > bhyve9# mkdir /tmp/foo
> > bhyve9# find /tmp/foo -empty -delete
> > find: -delete: /tmp/foo: relative path potentially not safe
> 
> > Shouldn't this work?
> 
> The "relative path" refers to a pathname that contains a slash other
> than at the beginning or end and may therefore refer to somewhere else
> if a directory is concurrently replaced by a symlink.
> 
> When -L is not specified and "." can be opened, the fts(3) code
> underlying find(1) is careful to avoid following symlinks or being
> dropped in different locations by moving the directory fts is currently
> traversing. If a problematic concurrent modification is detected, fts
> will not enter the directory or abort. Files found in the search are
> returned via the current working directory and a pathname not containing
> a slash.
> 
> For paranoia, find(1) verifies this when -delete is used. However, it is
> too paranoid about the root of the traversal. It is already assumed that
> the initial pathname does not refer to directories or symlinks that
> might be replaced by untrusted users; otherwise, the whole traversal
> would be unsafe. Therefore, it is not necessary to do the check for
> fts_level == FTS_ROOTLEVEL.
> 
> The below patch allows deleting the pathname given to find itself:
> 
> Index: usr.bin/find/function.c
> ===
> --- usr.bin/find/function.c   (revision 250661)
> +++ usr.bin/find/function.c   (working copy)
> @@ -442,7 +442,8 @@
>   errx(1, "-delete: forbidden when symlinks are followed");
>  
>   /* Potentially unsafe - do not accept relative paths whatsoever */
> - if (strchr(entry->fts_accpath, '/') != NULL)
> + if (entry->fts_level > FTS_ROOTLEVEL &&
> + strchr(entry->fts_accpath, '/') != NULL)
>   errx(1, "-delete: %s: relative path potentially not safe",
>   entry->fts_accpath);

I'm curious, how would you instruct a patched find to avoid deleteing the
/tmp/foo directory (e.g. if you wanted this to be a job that pruned empty
dirs from /tmp/foo but never pruned the directory itself).  Would -mindepth 1
do it?  (Just asking.  I have also found this message annoying but most of
the jobs I have seen it on probably don't want to delete the root path,
just descendants.)

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Adding a FOREACH_CONTINUE() variant to queue(3)

2013-05-20 Thread John Baldwin

On Wednesday, May 01, 2013 3:41:52 am Lawrence Stewart wrote:
> On 05/01/13 15:59, Lawrence Stewart wrote:
> > On 05/01/13 15:29, Poul-Henning Kamp wrote:
> >> In message <518092bf.9070...@freebsd.org>, Lawrence Stewart writes:
> >>> [reposting from freebsd-arch@ - was probably the wrong list]
> >>
> >>> #define TAILQ_FOREACH_CONTINUE(var, head, field)  \
> >>
> >> Obligatory bikeshedding:
> >>
> >> I find the suffix "_CONTINUE" non-obvious, as there may not have
> >> been any previos FOREACH involved.
> >>
> >> TAILQ_FOREACH_FROM(...) ?
> > 
> > Agreed. Thanks for the input.
> 
> Here's an untested patch for consideration:
> 
> 
http://people.freebsd.org/~lstewart/patches/misc/queue_foreach_from_10.x.r250136.patch
> 
> I didn't do _SAFE variants as I don't have an immediate use for them.

Looks ok to me.  I agree with phk@ and prefer the _FROM name.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Rebooting from loader causes a "fault" in VMware Workstation

2013-04-23 Thread John Baldwin

On Tuesday, April 23, 2013 1:32:34 pm Andriy Gapon wrote:
> on 23/04/2013 19:09 Andriy Gapon said the following:
> > 
> > IN:
> > 0x90d2:  cli
> > 0x90d3:  mov$0x1800,%esp
> > 0x90d8:  mov%cr0,%eax
> > 0x90db:  and$0x7fff,%eax
> > 0x90e0:  mov%eax,%cr0
> > 
> > 
> > IN:
> > 0x90e3:  xor%ecx,%ecx
> > 0x90e5:  mov%ecx,%cr3
> > 
> > 
> > IN:
> > 0x90e8:  lgdtl  0x95d0
> > 0x90ef:  ljmpw  $0x18,$0x90f5
> 
> Perhaps the problem is that lgdt is called after disabling paging?

That should be fine.  Generally speaking paging shouldn't be enabled
anyway (it only is if the i386 kernel panics before it has setup its
own IDT).  With paging disabled that should load the gdt from that
physical address which looks correct (the GDT descriptor is stored
just after the static gdt in btx.S itself).

> > Triple fault
> > CPU Reset (CPU 0)
> > ESI=0004503c EDI=3fe50968 EBP=00094a80 ESP=1800
> > EIP=90ef EFL=0046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> > ES =0033 a000  00cff300 DPL=3 DS   [-WA]
> > CS =0008   00cf9a00 DPL=0 CS32 [-R-]
> > SS =0010   00cf9300 DPL=0 DS   [-WA]
> > DS =0033 a000  00cff300 DPL=3 DS   [-WA]
> > FS =0033 a000  00cff300 DPL=3 DS   [-WA]
> > GS =0033 a000  00cff300 DPL=3 DS   [-WA]
> > LDT=   8200 DPL=0 LDT
> > TR =0038 5f98 2067 8900 DPL=0 TSS32-avl
> > GDT= ff85c789 
> > IDT= 5e00 0197
> > CR0=0011 CR2= CR3= CR4=
> > DR0= DR1=0000 DR2= 
DR3=
> > DR6=0ff0 DR7=0400
> > CCS=0001 CCD= CCO=LOGICL
> > EFER=
> > 
> 
> 
> -- 
> Andriy Gapon
> 

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Rebooting from loader causes a "fault" in VMware Workstation

2013-04-23 Thread John Baldwin

On Tuesday, April 23, 2013 12:09:28 pm Andriy Gapon wrote:
> on 23/04/2013 17:36 Dimitry Andric said the following:
> > I have tried to ascertain it actually arrives at this code when
> > rebooting from the loader, but it does not seem to ever make it there,
> > at least not to the jump to f000:fff0.  Maybe VMware intercepts the
> > switching back to real mode in the previous part, and dies on that, I am
> > not sure.  It is of course rather tricky to print off any debug messages
> > at that point. :-)
> 
> For the inquisitive minds here how last instructions (and CPU state) look
> according to qemu log:
> 
> IN:
> 0xa030:  xor%eax,%eax
> 0xa032:  int$0x30
> 
> 
> IN:
> 0x93e0:  cmp$0x1,%eax
> 0x93e3:  jne0x93ff
> 
> 
> IN:
> 0x93ff:  orb$0x1,%ss:0x9007
> 0x9407:  jmp0x90d2
> 
> 
> IN:
> 0x90d2:  cli
> 0x90d3:  mov$0x1800,%esp
> 0x90d8:  mov%cr0,%eax
> 0x90db:  and$0x7fff,%eax
> 0x90e0:  mov%eax,%cr0
> 
> 
> IN:
> 0x90e3:  xor%ecx,%ecx
> 0x90e5:  mov%ecx,%cr3
> 
> 
> IN:
> 0x90e8:  lgdtl  0x95d0
> 0x90ef:  ljmpw  $0x18,$0x90f5
> 
> Triple fault
> CPU Reset (CPU 0)
> ESI=0004503c EDI=3fe50968 EBP=00094a80 ESP=1800
> EIP=90ef EFL=0046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0033 a000  00cff300 DPL=3 DS   [-WA]
> CS =0008   00cf9a00 DPL=0 CS32 [-R-]
> SS =0010   00cf9300 DPL=0 DS   [-WA]
> DS =0033 a000  00cff300 DPL=3 DS   [-WA]
> FS =0033 a000  00cff300 DPL=3 DS   [-WA]
> GS =0033 a000  00cff300 DPL=3 DS   [-WA]
> LDT=   8200 DPL=0 LDT
> TR =0038 5f98 2067 8900 DPL=0 TSS32-avl
> GDT= ff85c789 0000

This seems wrong (address is way too high).  I wonder if the gdtdesc was 
trashed by something?  Can you dump memory before the lgdtl instruction at the 
0x95d0 address?

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: NUMA, cpuset and malloc

2013-04-22 Thread John Baldwin

On Monday, April 22, 2013 12:29:29 pm Freddie Cash wrote:
> On Mon, Apr 22, 2013 at 8:32 AM, John Baldwin  wrote:
> 
> > On Saturday, April 20, 2013 6:43:26 pm Robert Waksmundzki wrote:
> > > On NUMA systems allocated memory is striped across local and non-local
> > banks
> > in order to have consistent performance in case the task is rescheduled to
> > a
> > different CPU socket.
> > > When a process is pinned to a single CPU socket with cpuset having the
> > memory allocator prefer local banks would probably improve performance.
> > Default system behavior would stay the same and the optimization would
> > only be
> > triggered on big multi socket systems when administrator used cpuset
> > (command
> > mostly used for performance optimization anyway).
> > >
> > > Is this something currently implemented in FreeBSD? Is this even a good
> > idea?
> >
> > You can get something sort of like this by enabling NUMA in your kernel
> > (9.0
> > and later) and always pinning your processes with cpuset.  (The simple NUMA
> > bits always allocate memory in the memory domain the current thread is
> > running in at the time of the fault.)
> >
> 
> How does one enable NUMA?
> 
> A "grep -i numa /usr/src/sys/conf/NOTES /usr/src/sys/amd64/conf/NOTES"
> turns up 0 hits for both 9-STABLE r248547 and 10-CURRENT (April 11, used
> svnup so no way to get the exact revision number, that I know of).
> 
> Or, is it enabled automatically?

You have to chagne the VM_NDOMAIN setting.  In recent HEAD and 9-stable
you can do it in the kernel config (options VM_NDOMAIN=4 for example).
In older HEAD and 9 you have to edit sys/amd64/include/vmparam.h or
sys/i386/include/vmparam.h and change VM_NDOMAIN before building your
kernel.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Rebooting from loader causes a "fault" in VMware Workstation

2013-04-22 Thread John Baldwin

On Saturday, April 20, 2013 7:51:06 am Joshua Isom wrote:
> On 4/19/2013 8:48 PM, Jeremy Chadwick wrote:
> > I'm happy to open up a ticket with VMware about the issue as I'm a
> > customer, but I find it a little odd that other operating systems do not
> > exhibit this problem, including another BSD.  Ones which reboot just
> > fine from their bootloaders:
> >
> > - Linux -- so many that I don't know where to begin: ArchLinux
> >2012.10.06, CentOS 6.3, Debian 6.0.7, Finnix 1.0.5, Knoppix 7.0.4,
> >Slackware 14.0, and Ubuntu 11.10
> > - OpenBSD 5.2
> > - OpenIndiana -- build 151a7 (server version)
> >
> > So when you say "Blame VMware", I'd be happy to, except there must be
> > something FreeBSD's bootstraps are doing differently than everyone else
> > that causes this oddity.  Would you not agree?
> 
> A triple fault is standard practice as a fail safe guarantee of reboot. 
>   It's used either as a reboot, switch to real mode(IBM OS/2), or 
> catastrophic unrecoverable failure.
> 
> By the looks of grub(Linux and Solaris), it either jumps to it's own 
> instruction, hoping the bios catches it("tell the BIOS a boot failure, 
> which may result in no effect"), or jumps to a location that I can't yet 
> determine what code exists there.  I can't seem to find OpenBSD's reboot 
> method from OpenBSD's cvsweb, only an exit but not where that exit leads 
> to.  The native operating system is irrelevant, only the boot loader so 
> all the Linux distributions and Solaris forks all count as "grub."  Many 
> other bootloaders don't even have the reboot option, just "fail." 
> Here's barebox, a Das U-Boot fork:
> 
>   /** How to reset the machine? */
>   while(1)
> 
> In any case, it's a bag of tricks, finding something that works and is 
> "nice."  We're talking 30 years of legacy.  A triple fault, assuming the 
> mbr and loader ignores or zeroes previous memory, is guaranteed and 
> doesn't hang.

Actually, the traditional reboot method in real-mode (e.g. in DOS) is
to jump to 0x:0.  The BIOS is supposed to have a restart routine
at that location.  I've also seen jumps to 0xf000:fff0.

For example, BTX (the mini-kernel that "hosts" the loader and boot2)
uses the latter:

/*
 * Reboot or await reset.
 */
sti # Enable interrupts
testb $0x1,btx_hdr+0x7  # Reboot?
exit.3: jz exit.3   # No
movw $0x1234, BDA_BOOT  # Do a warm boot
ljmp $0xf000,$0xfff0# reboot the machine

And in fact, when the loader calls __exit() that is precisely where it
ends up.  The int 0x30 ends up here in btx.S:

/*
 * System calls.
 */
.set SYS_EXIT,0x0   # Exit
.set SYS_EXEC,0x1   # Exec
...
/*
 * System Call.
 */
intx30: cmpl $SYS_EXEC,%eax # Exec system call?
jne intx30.1# No
...
intx30.1:   orb $0x1,%ss:btx_hdr+0x7# Flag reboot
jmp exit# Exit

And the 'exit' label eventually ends up at the 'exit.3' code I quoted
above.  If the BIOS VMWare exports a reboot routine VMWare doesn't like
then VMWare needs to fix its BIOS. :)

The operations we try on x86 to shutdown from protected mode is quite a bit 
longer (not including the ACPI bits).  You can look at cpu_reset_real() in 
sys/i386/i386/vm_machdep.c.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: NUMA, cpuset and malloc

2013-04-22 Thread John Baldwin

On Saturday, April 20, 2013 6:43:26 pm Robert Waksmundzki wrote:
> On NUMA systems allocated memory is striped across local and non-local banks 
in order to have consistent performance in case the task is rescheduled to a 
different CPU socket.
> When a process is pinned to a single CPU socket with cpuset having the 
memory allocator prefer local banks would probably improve performance. 
Default system behavior would stay the same and the optimization would only be 
triggered on big multi socket systems when administrator used cpuset (command 
mostly used for performance optimization anyway).
> 
> Is this something currently implemented in FreeBSD? Is this even a good 
idea?

You can get something sort of like this by enabling NUMA in your kernel (9.0 
and later) and always pinning your processes with cpuset.  (The simple NUMA
bits always allocate memory in the memory domain the current thread is
running in at the time of the fault.)

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Multiple page size support on FreeBSD?

2013-04-10 Thread John Baldwin

On Monday, April 08, 2013 6:39:31 am Wojciech Puchar wrote:
> > Superpage promotion happens automatically when consecutive data are 
> > accessed 
> > according to the proper heuristic.
> 
> and in practice - unless there are only few processes, never really works.
> 
> this is a result of my own tests.

How do your tests work?  Do you examine PTEs directly to check for superpages
or are you relying on the vm.pmap.pde sysctls?

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: close(2) while accept(2) is blocked

2013-04-06 Thread John-Mark Gurney

Bakul Shah wrote this message on Sat, Mar 30, 2013 at 13:22 -0700:
> On Sat, 30 Mar 2013 09:14:34 PDT John-Mark Gurney  wrote:
> > 
> > As someone else pointed out in this thread, if a userland program
> > depends upon this behavior, it has a race condition in it...
> > 
> > Thread 1Thread 2Thread 3
> > enters routine to read
> > enters routine to close
> > calls close(3)
> > open() returns 3
> > does read(3) for orignal fd
> > 
> > How can the original threaded program ensure that thread 2 doesn't
> > create a new fd in between?  So even if you use a lock, this won't
> > help, because as far as I know, there is no enter read and unlock
> > mutex call yet...
> 
> It is worse. Consider:
> 
>   fd = open(file,...);
>   read(fd, ...);
> 
> No guarantee read() gets data from the same opened file!
> Another thread could've come along, closed fd and pointed it
> to another file. So nothing is safe. Might as well stop using
> threads, right?!

Nope, you need your threads to cooperate w/ either locks or some
ownership mechanism like token passing...  As long as only one
thread own the fd, you won't have troubles...

> We are talking about cooperative threads where you don't have
> to assume the worst case.  Here not being notified on a close

Multiple people have said when threads cooperate without locks, but
no one has given a good example where they do...  The cases you list
below are IMO special (and extreme) cases...

We were talking about "cooperating" threads in a general purpose
langage like Java where you can not depend upon your other threads
doing limited amount of work...

> event can complicate things. As an example, I have done
> something like this in the past: A frontend process validating
> TCP connections and then passing on valid TCP connections to
> another process for actual service (via sendmsg() over a unix
> domain). All the worker threads in service process can do a
> recvmsg() on the same fd. They process whatever tcp connection
> they get. Now what happens when the frontend process is
> restarted for some reason?  All the worker threads need to
> eventually reconnect to a new unix domain posted by the new
> frontend process. You can handle this multiple ways but
> terminating all the blocking syscalls on the now invalid fd is
> the simplest solution from a user perspective.

This'd make an interesting race... Not sure if it could happen, but
close(), closes the receiving fd, but another thread is in the process
of receiving an fd and puts the new fd in the close of the now closed
unix domain socket...  I can draw a more detailed diagram if you want..

> > I decided long ago that this is only solvable by proper use of locking
> > and ensuring that if you call close (the syscall), that you do not have
> > any other thread that may use the fd.  It's the close routine's (not
> > syscall) function to make sure it locks out other threads and all other
> > are out of the code path that will use the fd before it calls close..
> 
> If you lock before close(), you have to lock before every
> other syscall on that fd. That complicates userland coding and
> slows down things when this can be handled more simply in the
> kernel.

There is "ownership" that can be passed, such as via kqueue _ONESHOT
w/ multiple threads which allows you to avoid locking and only one
thread ever owns the fd...

> Another usecase is where N worker threads all accept() on the
> same fd. Single threading using a lock defeats any performance
> gain.

In this case you still need to do something special since what happens
if one of the worker threads opens a file and/or listen/accept socket
as part of it's work?

Thread 1Thread 2Thread 3
about to call accept but
  after flag check
sets flag that close
  is going to be called
accept connect
process connection
close listen'd socket
open socket as part of
  processing gets same fd
calls listen on socket
calls accept on socket

And there we have it, a race condition...  You can't always guarantee
what your worker threads do, if you can, it's good, but we need to make
sure that beginner programs don't get into traps like these...  They
can easily make the mistake of saying, well, since close kicks all
my threads out, I'll just do that instead of making sure that they
don't have a race

Re: extattr_set_* return type

2013-04-01 Thread John Baldwin

On Monday, April 01, 2013 3:56:46 pm m...@freebsd.org wrote:
> On Mon, Apr 1, 2013 at 12:51 PM,  wrote:
> 
> > On Mon, Apr 1, 2013 at 11:24 AM, John Baldwin  wrote:
> >
> >> On Saturday, March 30, 2013 5:30:21 pm m...@freebsd.org wrote:
> >> > Despite the man page correctly describing the return value for
> >> > extattr_set_*, I thought recently that they returned 0/-1 for
> >> > success/failure, not the number of bytes written, like write(2).  This
> >> is
> >> > because extattr_set_* is declared as returning an int, not an ssize_t.
> >> >  Both extattr_get and extattr_list return ssize_t, so this is
> >> inconsistent.
> >> >
> >> > The patch at
> >> >
> >> http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-
fix-
> >> rmextattr-8-.patchfixes<http://people.freebsd.org/~mdf/0001-Fix-return-
type-of-extattr_set_-and-fix-rmextattr-8-.patchfixes>
> >> > this.  It compiles but it's untested.
> >> >
> >> > I don't think any compat shims are needed, since an old application 
will
> >> > still sign extend and this will work (it's very unlikely anyone does
> >> > extattr_set for 2GB or more).
> >> >
> >> > If anyone actually uses extattr on 64-bit, please test a new kernel but
> >> old
> >> > userspace to be sure nothing is broken.  I plan to commit this next
> >> week if
> >> > I don't hear otherwise.
> >>
> >> Hmm, the patch URL doesn't work, but please fix this.  There is an old
> >> thread
> >> we are both on from Dec 2011 where I ran into the same thing.  I also
> >> think we
> >> don't need compat shims.
> >>
> >
> > The version in my outbox looked right; I don't know how it got mangled.
> >
> >
> > http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-
fix-
> > <http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-
fix-rmextattr-8-.patchfixes>
> > rmextattr-8-.patch<http://people.freebsd.org/~mdf/0001-Fix-return-type-of-
extattr_set_-and-fix-rmextattr-8-.patchfixes>
> >

Somehow the space between 'patch' at the end of the URL and 'fixes' kept 
getting eaten.  I worked it out though and your patch looks fine to me.  I 
think we can just punt on doing symver and commit it as is.  Thanks for fixing 
this!

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: preemptive kernel

2013-04-01 Thread John Baldwin

On Friday, March 22, 2013 4:10:16 pm vasanth rao naik sabavat wrote:
> Hi Adrian,
> 
> Just to clarify, is the kernel pre-emption involuntary?
> 
> Let say I have a kernel thread processing a huge list of entries, would
> this thread get involuntarily context switched out because of kernel
> preemption?
> 
> What is the time slice after which a kernel thread can involuntarily
> context switched out?
> 
> Could you please point to the file in the source code which handles the
> kernel pre-emption.

In-kernel preemption is driven by interrupts, not time slices.  If an 
interrupt arrives that awakens a higher priority thread (e.g. an interrupt 
thread), or if your thread awakens a thread that has higher priority (e.g. due 
to wakeup() or cv_signal()), then your thread will be preempted.

In general time-based preemptions are only done for user threads.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: helgrind (valgrind plugin) errors coming from nsdispatch(3)

2013-04-01 Thread John Baldwin

On Monday, March 25, 2013 12:13:39 am Yuri wrote:
> While running helgrind on my program, I observed several errors that 
> stem from the nsdispatch calls, see helgrind log below.
> "lock order" error in helgrind is generated when some data is protected 
> by two mutexes, and they were locked in different order on different 
> occasions.
> I think, mutexes in question are nss_lock and conf_lock in 
> lib/libc/net/nsdispatch.c
> 
> It seems like authors of helgrind took an approach that such situation 
> is error prone in general, thus they point it out.
> 
> So what would be the prevalent judgement here, is this something worth 
> fixing in libc, or such errors should be ignored?

Hmm, try locks don't block, so if the only use of the mutex is try locks
and the caller unwinds and releases the rwlock if the mutex try lock fails, no 
deadlock is possible.  The WITNESS checker in the kernel ignores try locks 
when checking for lock order reversals for this reason.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: extattr_set_* return type

2013-04-01 Thread John Baldwin

On Saturday, March 30, 2013 5:30:21 pm m...@freebsd.org wrote:
> Despite the man page correctly describing the return value for
> extattr_set_*, I thought recently that they returned 0/-1 for
> success/failure, not the number of bytes written, like write(2).  This is
> because extattr_set_* is declared as returning an int, not an ssize_t.
>  Both extattr_get and extattr_list return ssize_t, so this is inconsistent.
> 
> The patch at
> http://people.freebsd.org/~mdf/0001-Fix-return-type-of-extattr_set_-and-fix-
rmextattr-8-.patchfixes
> this.  It compiles but it's untested.
> 
> I don't think any compat shims are needed, since an old application will
> still sign extend and this will work (it's very unlikely anyone does
> extattr_set for 2GB or more).
> 
> If anyone actually uses extattr on 64-bit, please test a new kernel but old
> userspace to be sure nothing is broken.  I plan to commit this next week if
> I don't hear otherwise.

Hmm, the patch URL doesn't work, but please fix this.  There is an old thread 
we are both on from Dec 2011 where I ran into the same thing.  I also think we 
don't need compat shims.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: close(2) while accept(2) is blocked

2013-04-01 Thread John Baldwin

On Thursday, March 28, 2013 12:54:31 pm Andriy Gapon wrote:
> 
> So, this started as a simple question, but the answer was quite unexpected to 
> me.
> 
> Let's say we have an opened and listen-ed socket and let's assume that we know
> that one thread is blocked in accept(2) and another thread is calling 
> close(2).
> What is going to happen?
> 
> Turns out that practically nothing.  For kernel the close call would be 
> almost a nop.
> My understanding is this:
> - when socket is created, its reference count is 1
> - when accept(2) is called, fget in kernel increments the reference count 
> (kept in
> an associated struct file)
> - when close(2) is called, the reference count is decremented
> 
> The reference count is still greater than zero, so fdrop does not call 
> fo_close.
> That means that in the case of a socket soclose is not called.
> 
> I am sure that the reference counting in this case is absolutely correct with
> respect to managing kernel side structures.  But I am not that it is correct 
> with
> respect to hiding the explicit close(2) call from other threads that may be
> waiting on the socket.
> In other words, I am not sure if fo_close is supposed to signify that there 
> are no
> uses of a file, or that userland close-d the file.  Or perhaps these should 
> be two
> different methods.
> 
> Additional note is that shutdown(2) doesn't wake up the thread in accept(2)
> either.  At least that's true for unix domain sockets.
> Not sure if this is a bug.
> 
> But the summary seems to be is that currently it is not possible to break a 
> thread
> out of accept(2) (at least without resorting to signals).

I think you need to split the 'struct file' reference count into two different
counts similar to the how we have vref/vrele vs vhold/vdrop for vnodes.  The
fget for accept and probably most other system calls should probably be 
equivalent
to vhold, whereas things like open/dup (and storing an fd in a cmsg) should be
more like vref.  close() should then be a vrele().

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: call suspend_cpus() under smp_ipi_mtx

2013-04-01 Thread John Baldwin

On Saturday, March 23, 2013 5:48:50 am Andriy Gapon wrote:
> 
> Looks like this issue needs more thinking and discussing.
> 
> The basic idea is that suspend_cpus() must be called with smp_ipi_mtx held (on
> SMP systems).
> This is for exactly the same reasons as to why we first take smp_ipi_mtx 
> before
> calling stop_cpus() in the shutdown path.  Essentially one CPU could be 
> holding
> smp_ipi_mtx (and thus with interrupts disabled[*]) and waiting for an
> acknowledgement from other CPUs (e.g. in smp_rendezvous or in a TLB 
> shootdown),
> while another CPU could be with interrupts disabled (explicitly - like in the
> shutdown or ACPI suspend paths) and trying to deliver an IPI to other CPUs.
> 
> In my opinion, we must consistently use the same lock, smp_ipi_mtx, for all
> regular (non-NMI) synchronous IPI-based communication between CPUs.  
> Otherwise a
> deadlock is quite possible.
> 
> Some obstacles for just going ahead and making the suggested change:
> 
> - acpi_sleep_machdep() calls intr_suspend() with interrupts disabled; 
> currently
> witness(9) is not aware of that, but if smp_ipi_mtx spin-lock is used, then we
> would have to make intr_table_lock and msi_lock the spin-locks as well;
> - AcpiLeaveSleepStatePrep() (from ACPICA) is called with interrupts disabled 
> and
> currently it performs an action that requires memory allocation; again, with
> interrupts disabled via intr_disable() this fact is not visible to witness, 
> etc,
> but with smp_ipi_mtx it needs to be somehow handled.
> 
> I talked to ACPICA guys about the last issue and they told me that what is
> currently done in AcpiLeaveSleepStatePrep does not need to be with interrupts
> disabled and can be moved to AcpiLeaveSleepState.  This is after the _BFS and
> _GTS support was removed.
> 
> What do you think?
> Thank you.

Hmm, I think intr_table_lock used to be a spin lock at some point.  I don't 
remember
why we changed it to a regular mutex.  It may be that there was a lock order 
reason
for that. :(

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: close(2) while accept(2) is blocked

2013-03-30 Thread John-Mark Gurney

Bakul Shah wrote this message on Fri, Mar 29, 2013 at 16:54 -0700:
> On Fri, 29 Mar 2013 14:30:59 PDT Carl Shapiro  wrote:
> > 
> > In other operating systems, such as Solaris and MacOS X, closing the
> > descriptor causes blocked system calls to return with an error.
> 
> What happens if you select() on a socket and another thread
> closes this socket?  Ideally select() should return (with
> EINTR?) so that the blocking thread can some cleanup action.
> And if you do that, the blocking accept() case is not really
> different.
> 
> There is no point in *not* telling blocking threads that the
> descriptor they're waiting on is one EBADF and nothing is
> going to happen.
> 
> > It is not obvious whether there is any benefit to having the current
> > blocking behaviour. 
> 
> This may need some new kernel code but IMHO this is worth fixing.

As someone else pointed out in this thread, if a userland program
depends upon this behavior, it has a race condition in it...

Thread 1Thread 2Thread 3
enters routine to read
enters routine to close
calls close(3)
open() returns 3
does read(3) for orignal fd

How can the original threaded program ensure that thread 2 doesn't
create a new fd in between?  So even if you use a lock, this won't
help, because as far as I know, there is no enter read and unlock
mutex call yet...

I decided long ago that this is only solvable by proper use of locking
and ensuring that if you call close (the syscall), that you do not have
any other thread that may use the fd.  It's the close routine's (not
syscall) function to make sure it locks out other threads and all other
are out of the code path that will use the fd before it calls close..

If someone could describe how this new eject a person from read could
be done in a race safe way, then I'd say go ahead w/ it...  Otherwise
we're just moving the race around, and letting people think that they
have solved the problem when they haven't...

I think I remeber another thread about this from a year or two ago,
but I couldn't find it...  If someone finds it, posting a link would
be nice..

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 "All that I will do, has been done, All that I have, has not."
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: mountroot event

2013-03-20 Thread John Baldwin

On Wednesday, March 20, 2013 4:30:27 am Andriy Gapon wrote:
> 
> I would like to propose the following change.
> 
> My understanding is that it was never a true intention to post 'mountroot' 
> event
> on every set_rootvnode() call, but rather an accident in the original commit.
> But I could be wrong here.  In either case I think that it is more appropriate
> to post the event only once.  I do not expect that there could be any 
> consumers
> interested in all the details of root fs manipulations.

The firmware code only needs the final call for the "real" root.  I think your
change is fine.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: rtprio_thread trouble

2013-03-06 Thread John Baldwin

On Thursday, February 28, 2013 2:59:16 pm Ian Lepore wrote:
> On Tue, 2013-02-26 at 15:29 -0500, John Baldwin wrote:
> > On Friday, February 22, 2013 2:06:00 pm Ian Lepore wrote:
> > > I ran into some trouble with rtprio_thread() today.  I have a worker
> > > thread that I want to run at idle priority most of the time, but if it
> > > falls too far behind I'd like to bump it back up to regular timeshare
> > > priority until it catches up.  In a worst case, the system is
> > > continuously busy and something scheduled at idle priority is never
> > > going to run (for some definition of 'never').  
> > > 
> > > What I found is that in this worst case, even after my main thread has
> > > used rtprio_thread() to change the worker thread back to
> > > RTP_PRIO_NORMAL, the worker thread never gets scheduled.  This is with
> > > the 4BSD scheduler but it appears that the same would be the case with
> > > ULE, based on code inspection.  I find that this fixes it for 4BSD, and
> > > I think the same would be true for ULE...
> > > 
> > > --- a/sys/kern/sched_4bsd.c   Wed Feb 13 12:54:36 2013 -0700
> > > +++ b/sys/kern/sched_4bsd.c   Fri Feb 22 11:55:35 2013 -0700
> > > @@ -881,6 +881,9 @@ sched_user_prio(struct thread *td, u_cha
> > >   return;
> > >   oldprio = td->td_user_pri;
> > >   td->td_user_pri = prio;
> > > + if (td->td_flags & TDF_BORROWING && td->td_priority <= prio)
> > > + return;
> > > + sched_priority(td, prio);
> > >  }
> > >  
> > >  void
> > > 
> > > But I'm not sure if this would have any negative side effects,
> > > especially since in the ULE case there's a comment on this function that
> > > specifically notes that it changes the the user priority without
> > > changing the current priority (but it doesn't say why that matters).
> > > 
> > > Is this a reasonable way to fix this problem, or is there a better way?
> > 
> > This will lose the "priority boost" afforded to interactive threads when 
> > they 
> > sleep in the kernel in the 4BSD scheduler.  You aren't supposed to drop the
> > user priority to loose this boost until userret().  You could perhaps try
> > only altering the priority if the new user pri is lower than your current
> > priority (and then you don't have to check TDF_BORROWING I believe):
> > 
> > if (prio < td->td_priority)
> > sched_priority(td, prio);
> > 
> 
> That's just the sort of insight I was looking for, thanks.  That made me
> look at the code more and think harder about the problem I'm trying to
> solve, and I concluded that doing it within the scheduler is all wrong.
> 
> That led me to look elsewhere, and I discovered the change you made in
> r228207, which does almost what I want, but your change does it only for
> realtime priorities, and I need a similar effect for idle priorities.
> What I came up with is a bit different than yours (attached below) and
> I'd like your thoughts on it.
> 
> I start with the same test as yours: if sched_user_prio() didn't
> actually change the user priority (due to borrowing), do nothing.  Then
> mine differs:  call sched_prio() to effect the change only if either the
> old or the new priority class is not timeshare.
> 
> My reasoning for the second half of the test is that if it's a change in
> timeshare priority then the scheduler is going to adjust that priority
> in a way that completely wipes out the requested change anyway, so
> what's the point?  (If that's not true, then allowing a thread to change
> its own timeshare priority would subvert the scheduler's adjustments and
> let a cpu-bound thread monopolize the cpu; if allowed at all, that
> should require priveleges.)
> 
> On the other hand, if either the old or new priority class is not
> timeshare, then the scheduler doesn't make automatic adjustments, so we
> should honor the request and make the priority change right away.  The
> reason the old class gets caught up in this is the very reason I'm
> wanting to make a change:  when thread A changes the priority of its
> child thread B from idle back to timeshare, thread B never actually gets
> moved to a timeshare-range run queue unless there are some idle cycles
> available to allow it to first get scheduled again as an idle thread.
> 
> Finally, my change doesn't consider the td == curthread situation at
> all, because I don't see how that's germane.  This is the thing I'm
> least sure of -- I don't at all understand why the old code (even before
> your changes) had that test.  The old code had that flagged as "XXX
> dubious" (a comment a bit too cryptic to be useful).

I think your change is correct.  One style nit: please sort the order of
variables (oldclass comes before oldpri).

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Need advice on sys5 shm and zero copy sockets

2013-03-05 Thread John Baldwin

On Monday, March 04, 2013 1:24:08 pm gary mazzaferro wrote:
> Hi,
> 
> Thanks for all the help..  Looks like I'll move forward with
> recommending linux as a base for our new cloud execution containers.
> 
> Personally, I thought freebsd would be a technically superior and
> longer term solution for scientific grid and cloud, but if I can't get
> support on best architectural practices. I'll need to move to
> something I that will be supported during eval, design and prototyping
> processes.

There is not anything in stock FreeBSD that currently does zero-copy sockets
for TCP.  You could add such a thing, but you would have to implement your
own. :(  Some of the building blocks are in place.  For example, you can
create POSIX shared memory objects via shm_open() (and FreeBSD has an extension
where a path of SHM_ANON creates anonymous, unnamed objects) and pass that fd
into the kernel where an ioctl handler can map it into KVA using shm_map()
and shm_unmap().  You'd have to change TCP to do something useful with this
buffer however.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: looking for someone to fix humanize_number (test cases included)

2013-03-01 Thread John-Mark Gurney

Clifton Royston wrote this message on Tue, Dec 25, 2012 at 09:46 -1000:
> On Tue, Dec 25, 2012 at 08:23:55AM -1000, Clifton Royston wrote:
> > On Tue, Dec 25, 2012 at 07:20:37AM -1000, Clifton Royston wrote:
> > > On Mon, Dec 24, 2012 at 12:00:01PM +, 
> > > freebsd-hackers-requ...@freebsd.org wrote:
> > > > From: John-Mark Gurney 
> > > > To: hack...@freebsd.org
> > > > Subject: looking for someone to fix humanize_number (test cases
> > > > included)
> > > > 
> > > > I'm looking for a person who is interested in fixing up humanize_number.
> > ...
> > > > So I decided to write a test program to test the output, and now I'm 
> > > > even
> > > > more surprised by the output...  Neither 7.2-R nor 10-current give what
> > > > I expect are the correct results...
> > ...
> > 
> >   I am bemused.
> 
>   I correct myself: the function works fine, and there are no bugs I
> could find, though it's clear the man page could emphasize the correct
> usage a bit more.
> 
>   I had to read the source several times and start on debugging it
> before I understood the correct usage of the flag values with the scale
> and flags parameters, despite the man page stating:
> 
>  The following flags may be passed in scale:
> 
>HN_AUTOSCALE Format the buffer using the lowest multiplier pos-
> sible.
>HN_GETSCALE  Return the prefix index number (the number of
> times number must be divided to fit) instead of
> formatting it to the buffer.
> 
>  The following flags may be passed in flags:
> 
>HN_DECIMAL   If the final result is less than 10, display it
> using one digit.
> ...
>HN_DIVISOR_1000  Divide number with 1000 instead of 1024.
> 
>   That is, certain flags must be passed in flags and others must only
> be passed in scale - a bit counter-intuitive.  Also, scale == 0 is
> clearly not interpreted as AUTOSCALE, but I am not yet clear how it is
> being handled - it seems somewhat like AUTOSCALE but not identical.
> 
>   When the test program constant table is updated to pass the scale
> flags as specified, as well as fixing the bugs mentioned in the
> previous emails, it all passes except for the one (intentional?)
> inconsistency that "k" is used in place of "K" if HN_DECIMAL is
> enabled.
> 
>   The bug in the transfer speed results which prompted this inquiry
> suggests that perhaps some clients of humanize_number in the codebase
> are also passing the scale parameters incorrectly.  I would propose
> accepting HN_AUTOSCALE and HN_GETSCALE in the flags field (they don't
> overlap with other values) while continuing to accept them in the scale
> field for backwards compatibility.  Trivial diff below.

Sorry I didn't get back to this, but now I have a few minutes...

> + getscale  = (flags | scale) & HN_GETSCALE;

This isn't good:
#define HN_IEC_PREFIXES 0x10
#define HN_GETSCALE 0x10

If someone sets HN_IEC_PREFIXES, they'll acidentally enable _GETSCALE..

We could do something anoying by changing the value of _GETSCALE, and
then leaving some legacy code to accept the old _GETSCALE on the scale
input...  This would let new code work, but would break new code on
old libraries...  So, I don't see an easy way to fix this...

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 "All that I will do, has been done, All that I have, has not."
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: why no per-thread scheduling niceness?

2013-02-26 Thread John Baldwin

On Friday, February 22, 2013 2:12:54 pm Ian Lepore wrote:
> I'm curious why the concept of scheduling niceness applies only to an
> entire process, and it's not possible to have nice threads within a
> process.  Is there any fundamental reason why it couldn't be supported
> with some extra bookkeeping to track niceness per thread?

Only that the existing 'nice' command only works on processes and nice is 
traditionally a process concept.  Also see things like renice.  Individual 
threads can already alter their priority somewhat (e.g. to set an individual 
thread to an idle or real-time priority).

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: rtprio_thread trouble

2013-02-26 Thread John Baldwin

On Friday, February 22, 2013 2:06:00 pm Ian Lepore wrote:
> I ran into some trouble with rtprio_thread() today.  I have a worker
> thread that I want to run at idle priority most of the time, but if it
> falls too far behind I'd like to bump it back up to regular timeshare
> priority until it catches up.  In a worst case, the system is
> continuously busy and something scheduled at idle priority is never
> going to run (for some definition of 'never').  
> 
> What I found is that in this worst case, even after my main thread has
> used rtprio_thread() to change the worker thread back to
> RTP_PRIO_NORMAL, the worker thread never gets scheduled.  This is with
> the 4BSD scheduler but it appears that the same would be the case with
> ULE, based on code inspection.  I find that this fixes it for 4BSD, and
> I think the same would be true for ULE...
> 
> --- a/sys/kern/sched_4bsd.c   Wed Feb 13 12:54:36 2013 -0700
> +++ b/sys/kern/sched_4bsd.c   Fri Feb 22 11:55:35 2013 -0700
> @@ -881,6 +881,9 @@ sched_user_prio(struct thread *td, u_cha
>   return;
>   oldprio = td->td_user_pri;
>   td->td_user_pri = prio;
> + if (td->td_flags & TDF_BORROWING && td->td_priority <= prio)
> + return;
> + sched_priority(td, prio);
>  }
>  
>  void
> 
> But I'm not sure if this would have any negative side effects,
> especially since in the ULE case there's a comment on this function that
> specifically notes that it changes the the user priority without
> changing the current priority (but it doesn't say why that matters).
> 
> Is this a reasonable way to fix this problem, or is there a better way?

This will lose the "priority boost" afforded to interactive threads when they 
sleep in the kernel in the 4BSD scheduler.  You aren't supposed to drop the
user priority to loose this boost until userret().  You could perhaps try
only altering the priority if the new user pri is lower than your current
priority (and then you don't have to check TDF_BORROWING I believe):

if (prio < td->td_priority)
sched_priority(td, prio);

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: intr_event_handle never sends EOI if we fail to schedule ithread

2013-02-26 Thread John Baldwin

On Thursday, February 21, 2013 5:10:44 pm Ryan Stone wrote:
> I recently saw an issue where all interrupts from a particular interrupt
> vector were never raised.  After investigating it appears that I ran into
> the bug fixed in r239095[1], where an interrupt handler that uses an
> ithread was added to the list of interrupt handlers for a particular event
> before the ithread was allocated.  If an interrupt for this event
> (presumably for a different device sharing the same interrupt line) comes
> in after the handler is added to the list but before the ithread has been
> allocated we hit the following code:
> 
> if (thread) {
> if (ie->ie_pre_ithread != NULL)
> ie->ie_pre_ithread(ie->ie_source);
> } else {
> if (ie->ie_post_filter != NULL)
> ie->ie_post_filter(ie->ie_source);
> }
> 
> /* Schedule the ithread if needed. */
> if (thread) {
> error = intr_event_schedule_thread(ie);
> #ifndef XEN
> KASSERT(error == 0, ("bad stray interrupt"));
> #else
> if (error != 0)
> log(LOG_WARNING, "bad stray interrupt");
> #endif
> }
> 
> thread is true, so we will not run ie_post_filter (which would send the
> EOI).  However because the ithread has not been allocated
> intr_event_schedule_thread will return an error.  If INVARIANTS is not
> defined we skip the KASSERT and return.
> 
> Now, r239095 fixes this scenario, but I think that we should call
> ie_post_filter whenever intr_event_schedule_thread fails to ensure that we
> don't block an interrupt vector indefinitely.  Any comments?

Actually, I think you want to call post_ithread as you've already called
pre_ithread?  Also, pre_ithread should already EOI the interrupt, the problem
is that it leaves it masked, and you need to invoke post_ithread to unmask it.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: libprocstat(3): retrieve process command line args and environment

2013-02-20 Thread John Baldwin

On Tuesday, February 12, 2013 4:50:54 pm Mikolaj Golub wrote:
> On Fri, Jan 25, 2013 at 03:31:43PM -0500, John Baldwin wrote:
> 
> > BTW, one off-ball thought I have is that I would like to have a mode where 
> > libprocstat operates on a core file (of a process, not a kernel crash 
> > dump), 
> > so it could list the threads from a core dump, and possibly file descriptor 
> > info (if PR kern/173723 is implemented).
> > 
> > We certainly could have a 'raw' mode where it spat out name: value or XML
> > of the entire kinfo_proc perhaps.
> > 
> 
> It looks very interesting! Do you mean something like this?

Yes, exactly like this!  Very nice indeed.

> root@lisa:/root # sh -c 'kill -5 $$'
> Trace/BPT trap (core dumped)
> root@lisa:/root # ls -l sh.core 
> -rw---  1 root  wheel  8790016 Feb 12 21:17 sh.core
> root@lisa:/root # procstat sh.core 
>   PID  PPID  PGID   SID  TSID THR LOGINWCHAN EMUL  COMM   
>  
>   674   657   674   657   657   1 root - FreeBSD ELF32 sh 
>  
> root@lisa:/root # procstat -f sh.core 
>   PID COMM   FD T V FLAGS REF  OFFSET PRO NAME
>   674 sh   text v r r   -   - -   /bin/sh   
>   674 sh   ctty v c rw---   -   - -   /dev/pts/0
>   674 shcwd v d r   -   - -   /root 
>   674 sh   root v d r   -   - -   / 
>   674 sh  0 v c rw---   82537 -   /dev/pts/0
>   674 sh  1 v c rw---   82537 -   /dev/pts/0
>   674 sh  2 v c rw---   82537 -   /dev/pts/0
> root@lisa:/root # procstat -v sh.core
>   PID  STARTEND PRT  RES PRES REF SHD   FL TP PATH
>   674  0x8048000  0x8064000 r-x   280   1   0 CN-- vn /bin/sh
>   674  0x8064000  0x8066000 rw-20   1   0  df 
>   674 0x28064000 0x2807a000 r-x   220  17   0 CN-- vn /libexec/ld-elf.so.1
>   674 0x2807a000 0x28083000 rw-90   1   0  df 
>   674 0x28084000 0x280a3000 r-x   31   32   2   1 CN-- vn /lib/libedit.so.7
>   674 0x280a3000 0x280a5000 rw-20   1   0 C--- vn /lib/libedit.so.7
>   674 0x280a5000 0x280a7000 rw-00   0   0  -- 
>   674 0x280a7000 0x280e r-x   570   4   2 CN-- vn /lib/libncurses.so.8
>   674 0x280e 0x280e3000 rw-30   1   0 C--- vn /lib/libncurses.so.8
>   674 0x280e3000 0x28213000 r-x  3040  34  17 CN-- vn /lib/libc.so.7
>   674 0x28213000 0x2821a000 rw-70   1   0 C--- vn /lib/libc.so.7
>   674 0x2821a000 0x28243000 rw-   160   2   0  df 
>   674 0x2840 0x28c0 rw-   240   2   0  df 
>   674 0xbfbdf000 0xbfbff000 rwx30   1   0 ---D df 
>   674 0xbfbff000 0xbfc0 r-x00  20   0 CN-- ph 
> 
> Here is my attempt to implement it:
> 
> http://people.freebsd.org/~trociny/procstat_core.1.patch
> 
> The patch needs much work yet, especially the userland part, but looks
> like it is good enough for demonstration purposes to discuss how this
> should be done properly.
> 
> So, procstat data is stored in a core as note sections with name
> "FreeBSD" and types NT_PROCSTAT_PROC, NT_PROCSTAT_FILES, ...
> 
> The current format of notes is a header of sizeof(int) and data in the
> format as it is returned by a related sysctl call. I think the
> header should provide some versioning and for the cases I implemented
> (kinfo_proc, kinfo_file, kinfo_vmentry) it contains a value of the
> corresponding kinfo struct size (e.g. KINFO_VMENTRY_SIZE). It might be
> not the best solution and I would be glad for suggestions.

I think including the size is good and is probably sufficient for
versioning.

> (BTW, why don't we have constants like KINFO_VMENTRY_SIZE defined for
> all archs?)

I cannot speak to that.

> To avoid code duplication I changed the code of kinfo sysctl handlers
> to output kinfo data to sbuf instead of calling SYSCTL_OUT directly so
> these functions could be used by both sysctl handlers and the coredump
> routine.

Ok.
 
> Another thing I am not sure about is writing procstat data in the
> coredump routine. The coredump routine on the first pass calculates
> core sizes and on the second pass does actual writing. I added
> procstat in that way that procstat data is collected on the first run
> to internal buffers and on the second pass is copied from the buffers
> to the core. I could do this another way, e.g running kern_proc_*out()
> twice, on the fist pass with tiny buffer and a drain routine that
> would calculate data length, but this looks less efficient,
> complicates things and currently I

Re: Looking for reviewers for patch that adds foreign disk support mfiutil

2013-02-20 Thread John Baldwin

On Tuesday, February 19, 2013 6:49:52 pm Steven Hartland wrote:
> - Original Message - 
> From: "John Baldwin"
> 
> Thanks for the feedback John appreciated, a couple of questions inline
> below if you would be so kind.

Certainly.

> > - Is dump_config() really the right choice for 'foreign config'?  It doesn't
> >  attempt to output things very pretty, and I think mfiutil's non-debug
> >  commands should aim to be human readable.
> 
> Will check this, just didn't want to re-invent the wheel ;-)

Heh, can you reuse the show_config code instead perhaps?  It might be useful if
you could provide an example of the current 'foreign config' output?

> > - This (human readable) is also why it doesn't include the opcode in the 
> > error
> >  message by default.  Sysadmins don't really care which opcode fails.  Maybe
> >  put that under '#ifdef DEBUG'?
> 
> Previously there was no information about what command failed, which made
> the failure message kinda useless, so while debugging I added the opcode
> to help me trace things.

In general my goal had been to make the caller provide that level of detail if
it is useful.  While developing a command it can indeed be useful which is why
I suggested moving it under #ifdef DEBUG.  This provides the extra detail while
working on a command while keeping the UI for users clean.  If it is under DEBUG
you can just print the raw opcode in hex as you are doing now.

> > - mfireg.h should be kept in sync with the driver's version of that header, 
> > so
> >  don't reorder the enum's unless you are changing it to match what is in the
> >  device driver's mfireg.h.  In fact, mfiutil should probably be using the
> >  mfireg.h from sys/dev/mfi directly now that it is in the tree.  (mfiutil
> >  was originally developed outside of the tree as a standalone app)
> 
> There is only one mfireg.h and that is already in sys/dev/mfi :)

Oh, bah.  I misread the diff.  Reordering the commands looks good to me in that
case.

> > - Please don't do assignments in declarations and leave a blank line between
> >  declarations and the bode of code.  Thus:
> > 
> > mfi_op_desc(...)
> > {
> > int i, num_ops;
> > 
> > num_ops = nitems(mfi_op_codes);
> > ...
> > 
> >  (nitems() is nice to use when it is available as well)
> 
> Changed, this the case for constant initialisers too? e.g. is the
> following incorrect or acceptable?
> myfunc(...)
> {
> int i = 0, j = 1;
> ...

style(9) forbids those as well (and I generally avoid them myself), but you
will find code in the tree that does use initializers for simple expressions.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Looking for reviewers for patch that adds foreign disk support mfiutil

2013-02-19 Thread John Baldwin

On Sunday, February 17, 2013 1:06:40 pm Steven Hartland wrote:
> Hi all I'm looking for someone to review the attached patch
> to mfiutil which adds foreign disk support to mfiutil as
> per:
> http://www.freebsd.org/cgi/query-pr.cgi?pr=172091
> 
> Any and all feedback welcome :)

Some suggestions:

- Please stick with FreeBSD style, e.g. please use:

if (foo == NULL)

  rather than:

if (NULL == foo)

  I understand the reasons for the latter style (turn accidental assignments
  into compile errors) but I don't buy them because 1) modern compilers can
  already catch such things, but most importantly 2) it doesn't read
  correctly.  Above all else code should be readable, and one doesn't say
  "if NULL the pointer is" (unless one is Yoda), but "if the pointer is NULL".
- Don't make dump_config() use a default prefix, just fix the existing call
  to dump_config() to pass in a prefix.
- Is dump_config() really the right choice for 'foreign config'?  It doesn't
  attempt to output things very pretty, and I think mfiutil's non-debug
  commands should aim to be human readable.
- This (human readable) is also why it doesn't include the opcode in the error
  message by default.  Sysadmins don't really care which opcode fails.  Maybe
  put that under '#ifdef DEBUG'?
- mfireg.h should be kept in sync with the driver's version of that header, so
  don't reorder the enum's unless you are changing it to match what is in the
  device driver's mfireg.h.  In fact, mfiutil should probably be using the
  mfireg.h from sys/dev/mfi directly now that it is in the tree.  (mfiutil
  was originally developed outside of the tree as a standalone app)
- Leaving out the 'MFI_DCMD_' prefix from the opcode description was
  intentional.  If you are ever fortunate enough to examine the manuals from
  LSI, they refer to the firmware commands as 'LD_CONFIG', etc.  (Maybe it's
  'MR_LD_CONFIG'?)  The MFI_DCMD_ prefix is specific to the FreeBSD driver.
- Please don't do assignments in declarations and leave a blank line between
  declarations and the bode of code.  Thus:

 mfi_op_desc(...)
 {
 int i, num_ops;

 num_ops = nitems(mfi_op_codes);
 ...

  (nitems() is nice to use when it is available as well)
- Reindent the call to mfi_ldprobe() if CFG_ADD or CFG_FOREIGN_IMPORT
  succeeds.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Mbuf memory handling

2013-02-06 Thread John Baldwin

On Wednesday, February 06, 2013 2:41:32 pm Lino Sanfilippo wrote:
> John, Jacques,
> 
> thank you very much for your help. An mbuf cluster seems to be the right 
direction.
> So I would have to do something like
> 
> mbuf = m_getjcl(how, MT_DATA, M_PKTHDR, MJUMPAGESIZE);
> left_for_next_rcv = m_split(mbuf, chunksize);
> if_input(ifp, mbuf);
> 
> right?
> 
> >I agree that read-only buffers may be ok in this case but would like to 
point out that the M_WRITABLE() macro will evaluate to 0 if the refcount on 
the cluster is >1
> 
> The fact that the resulting mbufs returned by m_split() are not writeable 
any more is indeed a problem:
> What I would like to do is keep the 'left_for_next_rcv' mbuf until the next 
packet arrives and
> then fill it with the next packets data only up to 'chunksize', split it 
again to pass the new mbuf to
> the protocol stack and so on until 'left_for_next_rcv' becomes too small to 
be splitted further.
> Only then I would want to allocate a new "fresh" jumbo sized mbuf. Is it 
possible to
> realize this with cluster mbufs?

They are only read-only in the sense that you can't call routines like
m_pullup() or m_prepend(), etc.  Your device should still be able to DMA
into the buffer, but once the buffer is passed up to the stack the stack
can't mess with it.  This is probably what you want anyway as you wouldn't
want the stack appending to a buffer and spilling over into the cluster
where your device is going to DMA the next packet.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Mbuf memory handling

2013-02-06 Thread John Baldwin

On Wednesday, February 06, 2013 10:20:50 am Jacques Fourie wrote:
> On Wed, Feb 6, 2013 at 3:36 PM, John Baldwin  wrote:
> 
> > On Wednesday, February 06, 2013 4:50:39 am Lino Sanfilippo wrote:
> > >
> > > Hi all,
> > >
> > > I want to implement a device driver for a NIC which stores received data
> > into chunks within
> > > a page (>=4k) in host memory. One page shall be used for multiple
> > packets and freed
> > > after all mbufs linked to that page have been processed. So I would like
> > to know what is the recommended way
> > > to handle this in FreeBSD? Any hints are very appreciated.
> >
> > I think you can get what you want by allocating M_JUMBOP mbuf clusters for
> > your receive buffers.  When you want to split out a packet, allocate a new
> > packet header mbuf and use m_split() to let it take over the rest of the 4k
> > buffer and pass the original mbuf up to if_input() as the new packet.  The
> > new mbufs you attach to the cluster via m_split() will all hold a reference
> > on the backing cluster and it won't be freed until all the mbufs are freed.
> >
> > The resulting mbufs will not be writeable (M_WRITABLE() will evaluate to
> 0), right? I don't know if this will be an issue in this particular
> application.

No, they only propagate an existing M_RDONLY flag:

n->m_flags |= m->m_flags & M_RDONLY;

If the first mbuf is writable the splits remain writable from my reading
of the code.  OTOH, I think in this case read-only buffers passed up to
the stack are probably fine since they are already contiguous so any
pullup should be a NOP, etc.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Mbuf memory handling

2013-02-06 Thread John Baldwin

On Wednesday, February 06, 2013 4:50:39 am Lino Sanfilippo wrote:
> 
> Hi all,
> 
> I want to implement a device driver for a NIC which stores received data into 
> chunks within
> a page (>=4k) in host memory. One page shall be used for multiple packets and 
> freed
> after all mbufs linked to that page have been processed. So I would like to 
> know what is the recommended way
> to handle this in FreeBSD? Any hints are very appreciated.

I think you can get what you want by allocating M_JUMBOP mbuf clusters for
your receive buffers.  When you want to split out a packet, allocate a new
packet header mbuf and use m_split() to let it take over the rest of the 4k
buffer and pass the original mbuf up to if_input() as the new packet.  The
new mbufs you attach to the cluster via m_split() will all hold a reference
on the backing cluster and it won't be freed until all the mbufs are freed.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: kgdb modules

2013-02-04 Thread John Baldwin

On Monday, February 04, 2013 6:43:43 am Andriy Gapon wrote:
> on 04/02/2013 12:36 Matt Burke said the following:
> > How do I get kgdb to load kernel modules from somewhere other than
> > /boot/kernel?
> > 
> > Googling tells me I need to use asf to create a file, but I haven't managed
> > to figure out how to get kgdb use the output.
> 
> Research in the direction of set sysroot, solib-absolute-prefix,
> solib-search-path.  I would not be surprised if the ancient gdb version on 
> which
> kgdb is based does not support some of these settings.

It supports at least some of those.  You can also load modules manually by
using the add-kld command (give it a full path to an individual module).  You
may need to use 'nosharedlibrary' to unload symbols from the "wrong" module
before add-kld will be useful however.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Testing SIOCADDMULTI?

2013-02-01 Thread John Baldwin

On Friday, February 01, 2013 1:23:26 am Tim Kientzle wrote:
> >> Would still appreciate any suggestions for how to test these.
> > 
> > You can write a simple app to listen for UDP packets and have it join a 
> > multicast group and have another machine on the same network write a packet 
> > to 
> > the multicast group.
> 
> I tried this first, but  the test program worked fine even
> without ADDMULTI/DELMULTI support.   Watching
> tcpdump -e, it appears that IP4 multicast UDP uses
> broadcast at the Ethernet layer.

Were you running tcpdump?  You have to use tcpdump -p to avoid putting
the chip into promiscuous mode if so (promiscious causes the NIC to
receive all multicast regardless of the filters assuming that your
driver supports it correctly).

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: removing plip from GENERIC

2013-01-31 Thread John Baldwin

On Wednesday, January 30, 2013 5:09:56 pm Alfred Perlstein wrote:
> Does plip no longer work?

I had one user ("bf") who verified that plip(4) still worked in March of 2009
after my locking changes to ppc/ppbus.  OTOH, I doubt it is very widely used
at all.  I would probably only leave it in GENERIC for i386 and pc98 if 
anywhere.  Of course, there are many far, far older drivers (like just about
everything that is ISA but doesn't include built-in LPC stuff like psm, 
atkbdc, uart, and ppc) that should be removed before GENERIC before plip.  
There are machines that may have a ppc port that do not have an ISA slots.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Testing SIOCADDMULTI?

2013-01-28 Thread John Baldwin

On Sunday, January 27, 2013 1:51:12 am Tim Kientzle wrote:
> 
> On Jan 26, 2013, at 3:56 PM, Tim Kientzle wrote:
> 
> > My next TODO items for this network driver is to implement
> > the SIOCADDMULTI and SIOCDELMULTI ioctls.
> 
> Looking through other drivers (and net/if.c), I've
> managed to implement ADDMULTI by adding
> the multicast ethernet address to the list maintained
> by the controller.
> 
> DELMULTI seems trickier.   Since if.c does not pass
> the specific address being removed down to the
> driver, it looks like I have no choice but to remove
> every multicast address from the controller and then
> re-insert all of the ones that are still valid.
> 
> (This controller doesn't use a hash filter; it uses
> a list of valid multicast addresses.)
> 
> Is there a better approach?

You should always reprogram the full table while holding if_maddr_rlock().
All the ioctl's tell you is that an entry was added or removed from that
list.  There is currently no race-free way for the stack to tell you which
specific address to add or remove.

> > I'm not quite sure what they do, though, and have
> > no idea how to test them to see if they are working
> > correctly.
> 
> Would still appreciate any suggestions for how to test these.

You can write a simple app to listen for UDP packets and have it join a 
multicast group and have another machine on the same network write a packet to 
the multicast group.

However, a simpler test is to toggle the sysctl to enable multicast ping 
replies and to ping a multicast address from another machine after joining it 
on the test machine using mtest.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: some questions on kern_linker and pre-loaded modules

2013-01-28 Thread John Baldwin

On Saturday, January 26, 2013 6:52:01 am Andriy Gapon wrote:
> 
> I.
> It seems that linker_preload checks a  module for being a duplicate module 
> only
> if the module has MDT_VERSION.
> 
> This is probably designed to allow different version of the same module to
> co-exist (for some definition of co-exist)?

Yes, that is likely true, but it is a bit dubious.

> But, OTOH, this doesn't work well if the module is version-less (no
> MODULE_VERSION in the code) and is pre-loaded twice (e.g. once in kernel and
> once in a preloaded file).

Yes.

> At present a good example of this is zfsctrl module, which could be present 
> both
> in kernel (options ZFS) and in zfs.ko.
> 
> I haven't thought about any linker-level resolution for this issue.
> I've just tried a plug the ZFS hole for now.

I think we should require all modules declared by DECLARE_MODULE() to have a
version.  You might be able to enforce that by failing to register a linker
file if it contains any modules that do not include at least one version
metadata note in the same linker file.  You could check this before running
the MOD_LOAD handlers (though after running SYSINITs).  Truly fixing this would
mean making module data true metadata that is parsed by the linker rather than
having it all provided to the kernel via SYSINITs so that you could evaluate
this before running SYSINITs.  That is a larger project however.  I think your
fix for zfsctrl is correct.

> II.
> It seems that linker_file_register_modules() for the kernel is called after
> linker_file_register_modules() is called for all the pre-loaded files.
> linker_file_register_modules() for the kernel is called from
> linker_init_kernel_modules via SYSINIT(SI_SUB_KLD, SI_ORDER_ANY) and that
> happens after linker_preload() which is executed via SYSINIT(SI_SUB_KLD,
> SI_ORDER_MIDDLE).
> 
> Perhaps this is designed to allow modules in the preloaded files to override
> modules compiled into the kernel?

Yes, likely so.

> But this doesn't seem to work well.
> Because modules from the kernel are not registered yet,
> linker_file_register_modules() would be successful for the duplicate modules 
> in
> a preloaded file and thus any sysinits present in the file will also be 
> registered.
> So, if the module is present both in the kernel and in the preloaded file and
> the module has a module event handler (modeventhand_t), then the handler will
> registered and called twice.

Yes, I think it is too hard at present to safely allow a linker file to
override the same module in a kernel, so the duplicate linker file should
just be rejected entirely.  I'm not sure if your change is completely
correct for that.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: libprocstat(3): retrieve process command line args and environment

2013-01-25 Thread John Baldwin

On Friday, January 25, 2013 1:37:45 pm Robert N. M. Watson wrote:
> 
> On 24 Jan 2013, at 16:20, John Baldwin wrote:
> 
> >>> Hmm, are you going to rewrite ps(1) to use libprocstat?  Or rather, is 
that a
> >>> goal someday?  That is one current consumer of kvm_getargv/envv.  That 
might
> >>> be fine if we want to make more tools use libprocstat instead of using 
libkvm
> >>> directly.
> >> 
> >> I didn't have any plans for ps(1) :-) That is why I wrote about "new
> >> code". But if you think it is good to do I might look at it one day...
> > 
> > I'm mostly hoping Robert chimes in to see if that was his intention for
> > libprocstat. :)  If we can ultimately replace all uses of kvm_get*v() with
> > calls to procstat_get*v*() then I'm fine with some code duplication in the
> > interim.
> 
> 
> Originally there was just proctstat(1), but it made sense to begin re-
encapsulating it in a libprocstat(3) because the code there is potentially 
extremely reusable. This conflicts a bit with libkvm(3), which mysteriously 
knows about sysctlbyname(3) despite a name suggesting otherwise. You can 
imagine various approaches to fixing this, but indeed, making libprocstat(3) 
the first-class citizen and preferring it for both kvm and sysctl methods 
sounds like the way to go. I actually want to make libprocstat also support 
snmp, but I've never actually found the time to investigate doing that. One of 
my main unmet goals for procstat(1) was to introduce an extremely machine-
readable output format for it -- e.g., something XML-based or similar. I'd 
still love to see that happen.

BTW, one off-ball thought I have is that I would like to have a mode where 
libprocstat operates on a core file (of a process, not a kernel crash dump), 
so it could list the threads from a core dump, and possibly file descriptor 
info (if PR kern/173723 is implemented).

We certainly could have a 'raw' mode where it spat out name: value or XML
of the entire kinfo_proc perhaps.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: libprocstat(3): retrieve process command line args and environment

2013-01-24 Thread John Baldwin

On Wednesday, January 23, 2013 4:49:50 pm Mikolaj Golub wrote:
> On Wed, Jan 23, 2013 at 11:31:43AM -0500, John Baldwin wrote:
> > On Wednesday, January 23, 2013 2:25:00 am Mikolaj Golub wrote:
> > > IMHO, after adding procstat_getargv and procstat_getargv, the usage of
> > > kvm_getargv() and kvm_getenvv() (at least in the new code) may be
> > > deprecated. As this is stated in the man page, BUGS section, "these
> > > routines do not belong in the kvm interface". I suppose they are part
> > > of libkvm because there was no a better place for them. procstat(1)
> > > prefers direct sysctl to them (so, again, code duplication, which I am
> > > going to remove adding procstat_getargv/envv).
> > 
> > Hmm, are you going to rewrite ps(1) to use libprocstat?  Or rather, is that 
> > a
> > goal someday?  That is one current consumer of kvm_getargv/envv.  That might
> > be fine if we want to make more tools use libprocstat instead of using 
> > libkvm
> > directly.
> 
> I didn't have any plans for ps(1) :-) That is why I wrote about "new
> code". But if you think it is good to do I might look at it one day...

I'm mostly hoping Robert chimes in to see if that was his intention for
libprocstat. :)  If we can ultimately replace all uses of kvm_get*v() with
calls to procstat_get*v*() then I'm fine with some code duplication in the
interim.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: NMI watchdog functionality on Freebsd

2013-01-24 Thread John Baldwin

On Wednesday, January 23, 2013 11:57:33 am Ian Lepore wrote:
> On Wed, 2013-01-23 at 08:47 -0800, Matthew Jacob wrote:
> > On 1/23/2013 7:25 AM, John Baldwin wrote:
> > > On Tuesday, January 22, 2013 5:40:55 pm Sushanth Rai wrote:
> > >> Hi,
> > >>
> > >> Does freebsd have some functionality similar to  Linux's NMI watchdog ? 
I'm
> > > aware of ichwd driver, but that depends to WDT to be available in the
> > > hardware. Even when it is available, BIOS needs to support a mechanism 
to
> > > trigger a OS level recovery to get any useful information when system is
> > > really wedged (with interrupt disabled)
> > The principle purpose of a watchdog is to keep the system from hanging. 
> > Information is secondary. The ichwd driver can use the LPC part of ICH 
> > hardware that's been there since ICH version 4. I implemented this more 
> > fully at Panasas. The first importance is to keep the system from being 
> > hung. The next piece of information is to detect, on reboot, that a 
> > watchdog event occurred. Finally, trying to isolate why is good.
> > 
> > This is equivalent to the tco_WDT stuff on Linux. It's not interrupt 
> > driven (it drives the reset line on the processor).
> > 
> 
> I think there's value in the NMI watchdog idea, but unless you back it
> up with a real hardware watchdog you don't really have full watchdog
> functionality.  If the NMI can get the OS to produce some extra info,
> that's great, and using an NMI gives you a good chance of doing that
> even if it is normal interrupt processing that has wedged the machine.
> But calling panic() invokes plenty of processing that can get wedged in
> other ways, so even an NMI-based watchdog isn't g'teed to get the
> machine running again.
> 
> But adding a real hardware watchdog that fires on a slightly longer
> timeout than the NMI watchdog gives you the best of everything: you get
> information if it's possible to produce it, and you get a real hardware
> reset shortly thereafter if producing the info fails.

The IPMI watchdog facility has support for a pre-interrupt that fires before 
the real watchdog.  I have coded up support for it in a branch but haven't 
found any hardware that supports it that I could use to test them.  However, 
you could use an NMI pre-timer via the local APIC timer as a generic pre-timer 
for other hardware watchdogs.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: libprocstat(3): retrieve process command line args and environment

2013-01-23 Thread John Baldwin

On Wednesday, January 23, 2013 2:25:00 am Mikolaj Golub wrote:
> On Tue, Jan 22, 2013 at 02:17:39PM -0800, Stanislav Sedov wrote:
> > 
> > On Jan 22, 2013, at 1:48 PM, John Baldwin  wrote:
> > > 
> > > Well, you could make procstat open a kvm handle in both cases (open a 
> > > "live" 
> > > handle in the procstat_open_sysctl() case).  It just seems rather silly 
> > > to be 
> > > duplicating code in the two interfaces.
> 
> In this particular case I prefer code duplication to opening a kvm
> handle in procstat_open_sysctl(), as it looks a bit confusing. But I
> can do this way if the agreement is reached.
> 
> > > More a question for Robert: does 
> > > libprocstat intentionally duplicate the code in libkvm for other things 
> > > as 
> > > well in the live case?  (Like fetching the list of processes?)
> > > 
> > It does not actually has a duplicate code, the code for fetching the list of
> > processes via sysctl is different from the KVM case.  The open file 
> > descriptors
> > processing is different as well.  Because libprocstat implements almost the
> > same functionality both for sysctl and mvm backends, it can be used to 
> > analyze
> > both the live system and the kernel crash dumps.  The code Mikolaj proposed
> > only implements the sysctl backend currently, so it does not seem to have
> > any relation to KVM, so it will be a bit weird to make it open a KVM handle
> > though it does not use it.
> 
> IMHO, after adding procstat_getargv and procstat_getargv, the usage of
> kvm_getargv() and kvm_getenvv() (at least in the new code) may be
> deprecated. As this is stated in the man page, BUGS section, "these
> routines do not belong in the kvm interface". I suppose they are part
> of libkvm because there was no a better place for them. procstat(1)
> prefers direct sysctl to them (so, again, code duplication, which I am
> going to remove adding procstat_getargv/envv).

Hmm, are you going to rewrite ps(1) to use libprocstat?  Or rather, is that a
goal someday?  That is one current consumer of kvm_getargv/envv.  That might
be fine if we want to make more tools use libprocstat instead of using libkvm
directly.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: NMI watchdog functionality on Freebsd

2013-01-23 Thread John Baldwin

On Tuesday, January 22, 2013 5:40:55 pm Sushanth Rai wrote:
> Hi,
> 
> Does freebsd have some functionality similar to  Linux's NMI watchdog ? I'm 
aware of ichwd driver, but that depends to WDT to be available in the 
hardware. Even when it is available, BIOS needs to support a mechanism to 
trigger a OS level recovery to get any useful information when system is 
really wedged (with interrupt disabled). 
> 
> With Linux's NMI, APIC is programmed to periodically generate NMI and the OS 
NMI handler can check for some counters and invoke panic if the counters are 
not updated for a while. 

We currently use the local APIC timer as a timer with a normal interrupt.  
There's no reason you couldn't add a mode to make the local APIC timer operate 
in this fashion however.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: libprocstat(3): retrieve process command line args and environment

2013-01-22 Thread John Baldwin

On Tuesday, January 22, 2013 4:17:43 pm Mikolaj Golub wrote:
> On Tue, Jan 22, 2013 at 12:01:06PM -0500, John Baldwin wrote:
> 
> > How is this different from kvm_getargv()?  It seems to be a direct copy.
> 
> libprocstat(3) is a frontend for sysctl(3) and kvm(3) interfaces, so
> it is good to extend it to cover "getarg/env" functionality.
> 
> Yes the functions look similar to kvm_getargv() but I couldn't
> implement them just as wrappers around kvm_getargv(): I would like to
> have libprocstat functions thread safe, while kvm_getargv() uses
> static variables for its internal buffers.
> 
> It looks like I could fix kvm_getargv() to use fields of kvm structure
> instead of static variables to store pointers to the buffers, and then
> use it in libprocstat(3). Do you think it is worth doing?
> 
> BTW, struct __kvm already contains some pointers, which looks like are
> unused currently:
> 
>   char**argv; /* (dynamic) storage for argv pointers */
>   int argc;   /* length of above (not actual # present) */
>   char*argbuf;/* (dynamic) temporary storage */
> 
> But if I even had kvm_getargv() to behave as I wanted, there is still
> an issue with using it in libprocstat(): to get kvm structure you need
> to initialize procstat using procstat_open_kvm(). It is supposed to
> call procstat_open_kvm() when you want to read from kernel memory,
> while kvm_getargv() uses sysctl. So from a user point of you it would
> be a litle confusing if she had to call procstat_open_kvm() to get
> runtime args and env. If she wanted e.g. to get both runtime args and
> file info (via sysctl) she would have to do procstat_open_kvm() for
> args and procstat_open_sysctl() for files.

Well, you could make procstat open a kvm handle in both cases (open a "live" 
handle in the procstat_open_sysctl() case).  It just seems rather silly to be 
duplicating code in the two interfaces.  More a question for Robert: does 
libprocstat intentionally duplicate the code in libkvm for other things as 
well in the live case?  (Like fetching the list of processes?)

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: solved: pmbr: Boot loader too large

2013-01-22 Thread John Baldwin

On Tuesday, January 22, 2013 6:42:22 am Daniel Braniss wrote:
> > hi,
> > this is the output from gpart show:
> > =>   34  976773101  ada0  GPT  (465G)
> >  34   2048 1  freebsd-boot  (1.0M)
> >20824194304 2  freebsd-ufs  [bootme]  (2.0G)
> > 4196386   12582912 3  freebsd-swap  (6.0G)
> >16779298  959993837 4  freebsd-zfs  (457G)
> > 
> > =>   34  976773101  ada1  GPT  (465G)
> >  34   2048 1  freebsd-boot  (1.0M)
> >20824194304 2  freebsd-ufs  (2.0G)
> > 4196386   12582912 3  freebsd-swap  (6.0G)
> >16779298  959993837 4  freebsd-zfs  (457G)
> > 
> > I also did:
> > gpart bootcode -b /boot/pmbr ada0
> > 
> > I'm trying to boot and get
> >   Boot loader too large
> > 
> > not matter if I boot from disk or pxe.
> > The pmbr is 512 bytes, so what causes it to overshoot? 
> > I don't know x86 assembler (nor want to :-), but the comment says: 
> > 545k should be enough
> > so what's going on?
> 
> never underestimate the human stupidity (mine in this case) nor of the boot.
> pmbr will load the whole partition, which was 1M, instead of the size of
> gptboot :-(
> 
> reducing the size of the slice/partition fixed the issue.

pmbr doesn't have room to be but so smart.  It can't parse a filesystem, so it 
just loads a raw partition assuming that the partition is the boot loader.  
The 545k bit has to do with where it is loaded.  The boot loader has to live 
in the lower 640k, but it starts at 0x7c00 (the address that the BIOS always 
loads boot loaders).  The 545k limit comes from 640k - 0x7c00.  This is a 
fundamental limit of the x86 BIOS architecture.  Compared to the 15.5k that 
UFS leaves for boot2 it is worlds of space.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: libprocstat(3): retrieve process command line args and environment

2013-01-22 Thread John Baldwin

On Saturday, January 19, 2013 10:12:54 am Mikolaj Golub wrote:
> Hi,
> 
> Some time ago Stanislav Sedov suggested to me extending libprocstat(3)
> with functions to retrieve process command line arguments and
> environment variables.
> 
> In the first approach I tried, the newly added functions
> procstat_getargv/getenvv allocated a buffer of necessary size, stored
> the values and returned to the caller:
> 
> http://people.freebsd.org/~trociny/libprocstat.1.patch
> 
> The problem with this approach was that when I updated procstat(1) to
> use this interface, I observed noticeable performance degradation
> (about 30% on systems with MALLOC_PRODUCTION off), due to memory
> allocation overhead: the original procstat(1) reuses the buffer for
> all its retrievals.
> 
> So my second approach was to add internal buffers to struct procstat,
> which are used by procstat_getargv/getenvv to store values and reused
> on the subsequent call:
> 
> http://people.freebsd.org/~trociny/libprocstat.2.patch
> 
> The drawback of this approach is that a user has to take care and
> remember that a subsequent call rewrites argument vector obtained from
> the previous call. On the other hand this is ok for typical use cases
> while does not add allocation overhead, so I like this approach more.
> 
> I would like to commit this second patch, if there are no objections
> or suggestions how to improve the things.

How is this different from kvm_getargv()?  It seems to be a direct copy.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Fixing grep -D skip

2013-01-18 Thread John Baldwin

On Thursday, January 17, 2013 9:33:53 pm David Xu wrote:
> I am trying to fix a bug in GNU grep, the bug is if you
> want to skip FIFO file, it will not work, for example:
> 
> grep -D skip aaa .
> 
> it will be stucked on a FIFO file.
> 
> Here is the patch:
> http://people.freebsd.org/~davidxu/patch/grep.c.diff2
> 
> Is it fine to be committed ?

I think the first part definitely looks fine.  My guess is the non-blocking 
change is als probably fine, but that should be run by the bsdgrep person at 
least.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Failsafe on kernel panic

2013-01-17 Thread John Baldwin

On Wednesday, January 16, 2013 4:27:53 pm Sami Halabi wrote:
> Thank you for your response, very helpful.
> one question - how do i configure auto-reboot once kernel panic occurs?

Unless you've added DDB and KDB to your kernel it will reboot by default
on a panic.  Stable kernel configs also include the unattended option so
that even with the debugger present they reboot by default on a panic.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: disadvantages of running 8.3 kernel on freebsd 8.2 system

2013-01-17 Thread John Baldwin

On Thursday, January 17, 2013 10:57:01 am Devin Teske wrote:
> On Jan 17, 2013, at 4:06 AM, Ali Okan YÜKSEL wrote:
> 
> > I know for UPDATING, it s not correct way, but i tried and 8.2 system 
works
> > with 8.3 kernel (copied 8.3 /boot/kernel directory to freebsd 8.2
> > /boot/kernel) and it s not good solution but i want to know;
> > 
> > 
> >   - What are specific disadvantages that i can see clearly, running 8.3
> >   kernel on freebsd 8.2?
> 
> A couple user land tools might barf on you (listed below).
> 
> Other than that, it's generally considered very safe.
> 
> The quintessential test-case is running an 8.2 jail under an 8.3 host.
> 
> We do this all the time with various releases (again, most-problematic 
utilities listed below).
> 
> 
> 
> >   - What are user land tools those not match with 8.3 kernel on freebsd
> >   8.2 system…?
> > 
> 
> top and ps might complain about procsize mismatch.
> 
> netstat has been known to have problems if the gap is too wide.

These generally do not have problems in recent release branches.  top and ps 
haven't complained about procsize since the 4.x days as 5.0 introduced a new 
kinfo_proc structure that the kernel exports and it hasn't changed in size 
since 5.0.

The mfiutil issue dhw@ mentioned is real and is due to an mfi(4) driver 
change.  I merged a fix for the panics to 8-stable, but it just makes
old mfiutil binaries not work at all.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Failsafe on kernel panic

2013-01-16 Thread John Baldwin

On Wednesday, January 16, 2013 2:25:33 pm Sami Halabi wrote:
> Hi everyone,
> I have a production box, in which I want to install new kernel without any
> remotd kvn.
> my problem is its 2 hours away, and if a kernel panic occurs I got a
> problem.
> I woner if I can seg failsafe script to load the old kernel in case of
> psnic.

man nextboot (if you are using UFS)

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: [RFC] support -b when starting gdb

2013-01-16 Thread John Baldwin

On Wednesday, January 16, 2013 1:30:37 am Adrian Chadd wrote:
> Also, I found 'set remotebaud' and 'set debug remote 1' to do this.
> 
> I'd like to add the code just to support the same -b flag as gdb (so
> -r can also be used with a non-standard serial port.)

I think adding -b is fine.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: kldstat / kernel linker deadlock

2013-01-12 Thread John Baldwin

On Thursday, November 22, 2012 08:26:17 PM Bryan Drewery wrote:
> On 8.3-RELEASE I've hit a deadlock with kldstat.
> 
> I can't provide much information as procstat(1) locks up and I have
> already rebooted the servers due to it breaking quite a bit in my setup.
> 
> > # kldstat
> > Id Refs AddressSize Name
> > load: 0.91  cmd: kldstat 9936 [kernel linker] 51.21r 0.00u 0.00s 0% 768k
> > ^C
> > load: 0.72  cmd: kldstat 9936 [kernel linker] 225.23r 0.00u 0.00s 0% 704k
> > load: 0.72  cmd: kldstat 9936 [kernel linker] 225.39r 0.00u 0.00s 0% 704k
> > load: 0.42  cmd: kldstat 9936 [kernel linker] 1837.24r 0.00u 0.00s 0%
> > 692k
> 
> Short list of affected processes (74 in all):
> > root3685  0.0  0.0  3264   700  ??  D 7:27PM   0:00.00
> > kldstat root   67061  0.0  0.0  3380   892  ??  D 7:27PM  
> > 0:00.00 /usr/bin/netstat -nrf inet root5579  0.0  0.0  3380  
> > 892  ??  D 7:37PM   0:00.00 /usr/bin/netstat -nrf inet root   
> > 6393  0.0  0.0  3264   704  ??  D 7:32PM   0:00.00 /sbin/kldstat -v
> > root   99635  0.0  0.1  3324  1244  13  D+7:52PM   0:00.01
> > procstat -ka
> 
> [... 69 more removed ...]
> 
> I had 2 minutely cron entries that were running kldstat(1)/netstat(1).
> 
> Guessing the kldstat(1) and netstat(1) deadlocked initially.

Next time get a dump if at all possible.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

looking for someone to fix humanize_number (test cases included)

2012-12-23 Thread John-Mark Gurney

I'm looking for a person who is interested in fixing up humanize_number.

The other day I copied some data from a 7.2-R box to a 9.1-stable box
and did a du -shc and a du-skc to check the results...  I noticed the -h
run dropped from 11M to 10M, which I thought was weird...  Then I looked
at the results from the -k run, but the new machine had a larger result
(I copied from UFS to ZFS)...  It turns out that humanize_number was
broken when doing rounding...  No longer does humanize_number round up
at .5 or more of the prefix..

So I decided to write a test program to test the output, and now I'm even
more surprised by the output...  Neither 7.2-R nor 10-current give what
I expect are the correct results...

Feel free to take a look at the test program posted to:
http://people.freebsd.org/~jmg/humanize_numbers/

The .c contains what I think the output should be.

So far the bugs I know of:
1) rounding is incorrect (started this whole search)
2) buffer calculation is incorrect in some cases, index 11 should fit
   but doesn't
3) some cases zero is returned though it isn't zero, more like 0T for 512 G
   (indexes 16, 17, 22, 23)
4) man page is missing required sys/types.h include

I'll work to get the code into the tree once we get it in a good state.

Please cc me as I'm not subscribed to -hackers.

Thanks.

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 "All that I will do, has been done, All that I have, has not."
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: kernel module parallel build?

2012-12-06 Thread John Baldwin

On Wednesday, December 05, 2012 6:51:17 pm Damien Fleuriot wrote:
> 
> On 5 Dec 2012, at 18:39, Warner Losh  wrote:
> 
> > 
> > On Dec 5, 2012, at 9:42 AM, John Baldwin wrote:
> > 
> >> On Tuesday, December 04, 2012 2:41:32 pm Ryan Stone wrote:
> >>> On Tue, Dec 4, 2012 at 10:52 AM, John Baldwin  wrote:
> >>> 
> >>>> Hmm, I certainly see the module directories being built in parallel.  
> >>>> Some
> >>>> of
> >>>> the make jobs may not be as obvious since links are silent (no output
> >>>> unless
> >>>> there is an error).
> >>>> 
> >>>> 
> >>> This is definitely not the behaviour that I see trying to build any 
> >>> version
> >>> of FreeBSD.  I see the same behaviour as Andre: the depend and all targets
> >>> both iterate through the module directories sequentially.  It never builds
> >>> two module subdirectories concurrently.
> >> 
> >> Hmm, I think I was confused by seeing kernel builds intermingle with the 
> >> associated modules.  sys/modules/Makefile uses bsd.subdir.mk.  I think I 
> >> see 
> >> similar things in world builds where I will see parallel builds of bin vs 
> >> sbin 
> >> vs usr.bin vs usr.sbin, but within each of those directories the builds go 
> >> sequentially.  I think you would need to change bsd.subdir.mk if you want 
> >> to 
> >> fix this.
> > 
> > The builds are in parallel, just that the parallelism is low because it is 
> > only parallel within the module being built. Would love to see a fix.
> > 
> > Warner
> > 
> 
> All trolling aside, I believe an awesome fix to be setting module override in 
> /etc/make.conf to only build the 4-5 specific modules one needs.
> 
> To be honest I think this configuration tweak should be advertised a bit more 
> as it definitely speeds up kernel builds.
> 
> I would be happy to check if this is advertised in the handbook in the 
> "rebuilding kernel" section and enhance its visibility if required.
> 
> I can provide en_US and fr_FR.

Better than doing it in /etc/make.conf (or /etc/src.conf) is doing it direclty
in the kernel config file itself via

makeoptions MODULES_OVERRIDE="foo"

You can use multiple of these (with +=) in a config file as well.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: kernel module parallel build?

2012-12-05 Thread John Baldwin

On Tuesday, December 04, 2012 2:41:32 pm Ryan Stone wrote:
> On Tue, Dec 4, 2012 at 10:52 AM, John Baldwin  wrote:
> 
> > Hmm, I certainly see the module directories being built in parallel.  Some
> > of
> > the make jobs may not be as obvious since links are silent (no output
> > unless
> > there is an error).
> >
> >
> This is definitely not the behaviour that I see trying to build any version
> of FreeBSD.  I see the same behaviour as Andre: the depend and all targets
> both iterate through the module directories sequentially.  It never builds
> two module subdirectories concurrently.

Hmm, I think I was confused by seeing kernel builds intermingle with the 
associated modules.  sys/modules/Makefile uses bsd.subdir.mk.  I think I see 
similar things in world builds where I will see parallel builds of bin vs sbin 
vs usr.bin vs usr.sbin, but within each of those directories the builds go 
sequentially.  I think you would need to change bsd.subdir.mk if you want to 
fix this.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: kernel module parallel build?

2012-12-04 Thread John Baldwin

On Sunday, November 04, 2012 2:53:02 pm Andre Oppermann wrote:
> On 22.10.2012 15:28, John Baldwin wrote:
> > On Sunday, October 21, 2012 7:11:10 am Andre Oppermann wrote:
> >> What's keeping kernel modules from building in parallel with
> >> "make -j8"?
> >
> > They don't for you?  They do for me either via 'make buildkernel'
> > or the old method.
> 
> They do, but only partially.  Within a module the files are built
> in parallel.  However the module directories seem to be serialized.
> I'm a Makefile noob though.

Hmm, I certainly see the module directories being built in parallel.  Some of 
the make jobs may not be as obvious since links are silent (no output unless 
there is an error).

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Loader-kernel interaction

2012-10-22 Thread John Baldwin

On Friday, October 19, 2012 7:05:05 pm Richard Yao wrote:
> Dear Everyone,
> 
> I know that the kernel is a BTX client, but I do not understand the
> protocol used by loader to pass sysctl settings and loadable modules to
> the kernel. Is there documentation on this?

The loader passes it's variables as a set of environment variables.
They are stored in a contiguous block of memory after the last kernel
module.  Look at sys/boot/i386/libi386/bootinfo{32,64}.c.  Specifically
look at the bi_load*() routines.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: kernel module parallel build?

2012-10-22 Thread John Baldwin

On Sunday, October 21, 2012 7:11:10 am Andre Oppermann wrote:
> What's keeping kernel modules from building in parallel with
> "make -j8"?

They don't for you?  They do for me either via 'make buildkernel'
or the old method.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: syncing large mmaped files

2012-10-18 Thread John Baldwin

On Thursday, October 18, 2012 12:42:18 pm Konstantin Belousov wrote:
> On Thu, Oct 18, 2012 at 09:39:34AM -0400, John Baldwin wrote:
> > On Thursday, October 18, 2012 4:35:37 am Konstantin Belousov wrote:
> > > On Thu, Oct 18, 2012 at 10:08:22AM +1000, Tristan Verniquet wrote:
> > > > 
> > > > I want to work with large (1-10G) files in memory but eventually sync
> > > > them back out to disk. The problem is that the sync process appears to
> > > > lock the file in kernel for the duration of the sync, which can run
> > > > into minutes. This prevents other processes from reading from the file
> > > > (unless they already have it mapped) for this whole time. Is there
> > > > any way to prevent this? I think I read in a post somewhere about
> > > > openbsd implementing partial-writes when it hits a file with lots of
> > > > dirty pages in order to prevent this. Is there anything available for
> > > > FreeBSD or is there another way around it?
> > > >
> > > No, currently the vnode lock is held exclusive for the whole duration
> > > of the msync(2) syscall or its analog from the syncer.
> > > 
> > > Making a change to periodically drop the vnode lock in
> > > vm_object_page_clean() might be possible, but requires the benchmarking
> > > to make sure that we do not pessimize the common case. Also, this opens
> > > a possibility for the vnode reclamation meantime.
> > 
> > You can simulate this in userland by breaking up your msync() into multiple
> > msync() calls where each call just syncs a portion of the file.
> Be aware that this is much-much slower than msyncing the whole file, even
> if file is very large. The reason is that pager initiates asynchronous
> _immediate_ clustered write for such situations. Async writes (AKA
> bdwrite()) are only specified for full range msyncing.

Ugh.  It would seem to me that msync(MS_ASYNC) should be doing delayed
writes.

> > > Anyway, note that you cannot 'work with large files in memory', even if
> > > you have enough RAM and no pressure to hold all the file pages resident.
> > > The syncer will do a writeback periodically regardless of the application
> > > calling msync(2) or not, with the interval of approximately 30 seconds.
> > 
> > You can mmap with MAP_NOSYNC to prevent the syncer from writing the file out
> > every 30 seconds.
> 
> This also prevents msync(2) from syncing the region. The flag is fine
> for throw-away data, but not for the scenario that was described, I
> think.

Oof.  I could see that in certain situations you might want to control this
behavior from an application (similar to how I now make use of fadvise() at
work).  Having a way to disable syncer but having msync(MS_ASYNC) do
something useful would be good.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: syncing large mmaped files

2012-10-18 Thread John Baldwin

On Thursday, October 18, 2012 4:35:37 am Konstantin Belousov wrote:
> On Thu, Oct 18, 2012 at 10:08:22AM +1000, Tristan Verniquet wrote:
> > 
> > I want to work with large (1-10G) files in memory but eventually sync
> > them back out to disk. The problem is that the sync process appears to
> > lock the file in kernel for the duration of the sync, which can run
> > into minutes. This prevents other processes from reading from the file
> > (unless they already have it mapped) for this whole time. Is there
> > any way to prevent this? I think I read in a post somewhere about
> > openbsd implementing partial-writes when it hits a file with lots of
> > dirty pages in order to prevent this. Is there anything available for
> > FreeBSD or is there another way around it?
> >
> No, currently the vnode lock is held exclusive for the whole duration
> of the msync(2) syscall or its analog from the syncer.
> 
> Making a change to periodically drop the vnode lock in
> vm_object_page_clean() might be possible, but requires the benchmarking
> to make sure that we do not pessimize the common case. Also, this opens
> a possibility for the vnode reclamation meantime.

You can simulate this in userland by breaking up your msync() into multiple
msync() calls where each call just syncs a portion of the file.

> Anyway, note that you cannot 'work with large files in memory', even if
> you have enough RAM and no pressure to hold all the file pages resident.
> The syncer will do a writeback periodically regardless of the application
> calling msync(2) or not, with the interval of approximately 30 seconds.

You can mmap with MAP_NOSYNC to prevent the syncer from writing the file out
every 30 seconds.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: _mtx_lock_spin: obsolete historic handling of kdb_active and panicstr?

2012-10-17 Thread John Baldwin

On Wednesday, October 17, 2012 7:20:56 am Andriy Gapon wrote:
> 
> _mtx_lock_spin has the following check in its retry loop:
> if (i < 6000 || kdb_active || panicstr != NULL)
> DELAY(1);
> else
> _mtx_lock_spin_failed(m);
> 
> Which means that in the (!kdb_active && panicstr == NULL) case we will make at
> most 6000 iterations and then call _mtx_lock_spin_failed (which proceeds 
> to
> panic).  When either kdb_active or panicstr is set, then we are going to loop
> forever.
> 
> I've done some digging through the lengthy history and many evolutions of the
> code (unfortunately I haven't kept records during the research), and my
> conclusion is that the kdb_active and panicstr checks were added at the quite
> early era of FreeBSD SMP, where we didn't have a mechanism to stop/block other
> CPUs when kdb or panic were entered.  We didn't even prevent parallel 
> execution
> of panic.
> So the above code was a sort of defense where we hoped that "other" CPUs would
> eventually stumble upon some held spinlock and would be captured there.  Maybe
> there was a specific set of spinlocks, which were supposed to help.

It wasn't so much as a way of hoping CPUs would stop so much as a way to prevent
other CPUs from panic'ing while another CPU had already panic'd or was already
in DDB making debugging harder.

> Nowadays, we do try to stop other CPUs during panic and kdb activation and 
> there
> are good chances that they are indeed stopped.  In this circumstances, should
> the main CPU be so unlucky as to run into the held spinlock, the above check
> would do more harm than good - the main CPU would just spin there forever,
> because a lock owner is also spinning in the stop loop and so can't release 
> the
> lock.
> Actually, this is only true for the kdb case.  In the panic case we make a 
> check
> earlier and simply ignore/skip/bust all the locks.  That makes the panicstr
> check in the code in question both harmless and useless.
> 
> So I'd like to propose to remove those checks altogether.  Or perhaps to
> "reverse" them and immediately induce a (possibly secondary) panic if we ever
> get to that wait loop and kdb_active || panicstr != NULL.
> 
> What do you think?

I think this sounds fine.  I do think though that there are two behaviors.  If
for some reason you are not able to stop the other CPUs, you would rather them
spin than trigger another panic while you are in DDB or writing out a crashdump.
However, the CPU that is currently in the debugger or writing out a crashdump
should probably bust all locks (code executed in debugger backends should
generally avoid all locking at all, and depend on things like try locks where it
gracefully fails if it must use locking.  That would make the kdb_active case
here irrelevant, and the panic case is already handled as you noted.)

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: NFS server bottlenecks

2012-10-15 Thread John Baldwin

On Saturday, October 13, 2012 9:03:22 am Rick Macklem wrote:
> rick
> ps: I hope John doesn't mind being added to the cc list yet again. It's
> just that I suspect he knows a fair bit about mutex implementation
> and possible hardware cache line effects.

Currently mtx_pool just uses a simple array (I have patches to force the
array members to be cache-aligned, but they haven't been shown to help in
any benchmarks to date).  I do think though that I would prefer embedding
the mutexes in the hash table entries directly.  This is what we do for the
turnstile and sleep queue hash tables.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: No bus_space_read_8 on x86 ?

2012-10-12 Thread John Baldwin

On Wednesday, October 10, 2012 5:44:09 pm Carl Delsey wrote:
> Sorry for the slow response. I was dealing with a bit of a family 
> emergency. Responses inline below.
> 
> On 10/09/12 08:54, John Baldwin wrote:
> > On Monday, October 08, 2012 4:59:24 pm Warner Losh wrote:
> >> On Oct 5, 2012, at 10:08 AM, John Baldwin wrote:
> 
> >>> I think cxgb* already have an implementation.  For amd64 we should 
> >>> certainly
> >>> have bus_space_*_8(), at least for SYS_RES_MEMORY.  I think they should 
> >>> fail
> >>> for SYS_RES_IOPORT.  I don't think we can force a compile-time error 
> >>> though,
> >>> would just have to return -1 on reads or some such?
> 
> Yes. Exactly what I was thinking.
> 
> >> I believe it was because bus reads weren't guaranteed to be atomic on i386.
> >> don't know if that's still the case or a concern, but it was an 
> >> intentional omission.
> > True.  If you are on a 32-bit system you can read the two 4 byte values and
> > then build a 64-bit value.  For 64-bit platforms we should offer 
> > bus_read_8()
> > however.
> 
> I believe there is still no way to perform a 64-bit read on a i386 (or 
> at least without messing with SSE instructions), but if you have to read 
> a 64-bit register, you are stuck with doing two 32-bit reads and 
> concatenating them. I figure we may as well provide an implementation 
> for those who have to do that as well as the implementation for 64-bit.

I think the problem though is that the way you should glue those two 32-bit
reads together is device dependent.  I don't think you can provide a completely
device-neutral bus_read_8() on i386.  We should certainly have it on 64-bit
platforms, but I think drivers that want to work on 32-bit platforms need to
explicitly merge the two words themselves.
 
-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: No bus_space_read_8 on x86 ?

2012-10-09 Thread John Baldwin

On Monday, October 08, 2012 4:59:24 pm Warner Losh wrote:
> 
> On Oct 5, 2012, at 10:08 AM, John Baldwin wrote:
> 
> > On Thursday, October 04, 2012 1:20:52 pm Carl Delsey wrote:
> >> I noticed that the bus_space_*_8 functions are unimplemented for x86. 
> >> Looking at the code, it seems this is intentional.
> >> 
> >> Is this done because on 32-bit systems we don't know, in the general 
> >> case, whether to read the upper or lower 32-bits first?
> >> 
> >> If that's the reason, I was thinking we could provide two 
> >> implementations for i386: bus_space_read_8_upper_first and 
> >> bus_space_read_8_lower_first. For amd64 we would just have bus_space_read_8
> >> 
> >> Anybody who wants to use bus_space_read_8 in their file would do 
> >> something like:
> >> #define BUS_SPACE_8_BYTES LOWER_FIRST
> >> or
> >> #define BUS_SPACE_8_BYTES UPPER_FIRST
> >> whichever is appropriate for their hardware.
> >> 
> >> This would go in their source file before including bus.h and we would 
> >> take care of mapping to the correct implementation.
> >> 
> >> With the prevalence of 64-bit registers these days, if we don't provide 
> >> an implementation, I expect many drivers will end up rolling their own.
> >> 
> >> If this seems like a good idea, I'll happily whip up a patch and submit it.
> > 
> > I think cxgb* already have an implementation.  For amd64 we should 
> > certainly 
> > have bus_space_*_8(), at least for SYS_RES_MEMORY.  I think they should 
> > fail 
> > for SYS_RES_IOPORT.  I don't think we can force a compile-time error 
> > though, 
> > would just have to return -1 on reads or some such?
> 
> I believe it was because bus reads weren't guaranteed to be atomic on i386.
> don't know if that's still the case or a concern, but it was an intentional 
> omission.

True.  If you are on a 32-bit system you can read the two 4 byte values and
then build a 64-bit value.  For 64-bit platforms we should offer bus_read_8()
however.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: No bus_space_read_8 on x86 ?

2012-10-05 Thread John Baldwin

On Thursday, October 04, 2012 1:20:52 pm Carl Delsey wrote:
> I noticed that the bus_space_*_8 functions are unimplemented for x86. 
> Looking at the code, it seems this is intentional.
> 
> Is this done because on 32-bit systems we don't know, in the general 
> case, whether to read the upper or lower 32-bits first?
> 
> If that's the reason, I was thinking we could provide two 
> implementations for i386: bus_space_read_8_upper_first and 
> bus_space_read_8_lower_first. For amd64 we would just have bus_space_read_8
> 
> Anybody who wants to use bus_space_read_8 in their file would do 
> something like:
> #define BUS_SPACE_8_BYTES LOWER_FIRST
> or
> #define BUS_SPACE_8_BYTES UPPER_FIRST
> whichever is appropriate for their hardware.
> 
> This would go in their source file before including bus.h and we would 
> take care of mapping to the correct implementation.
> 
> With the prevalence of 64-bit registers these days, if we don't provide 
> an implementation, I expect many drivers will end up rolling their own.
> 
> If this seems like a good idea, I'll happily whip up a patch and submit it.

I think cxgb* already have an implementation.  For amd64 we should certainly 
have bus_space_*_8(), at least for SYS_RES_MEMORY.  I think they should fail 
for SYS_RES_IOPORT.  I don't think we can force a compile-time error though, 
would just have to return -1 on reads or some such?

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: vendor import questions

2012-10-05 Thread John Baldwin

On Thursday, October 04, 2012 8:31:36 pm Brooks Davis wrote:
> On Tue, Sep 25, 2012 at 08:41:34AM -0400, John Baldwin wrote:
> > On Monday, September 24, 2012 5:31:37 pm Brooks Davis wrote:
> > > As part of switching to NetBSD's mtree I plan to import their versions
> > > of a few files that are part of libc (for example all the bits of
> > > vis/unvis).  I would like to do that via a vendor import, but I'm unsure
> > > where to put the files and how to tag them.  For mtree itself the right
> > > place is clearly base/vendor/NetBSD/mtree/dist, but we don't seem to
> > > have a good example for libc bits.
> > > 
> > > There is currently a base/vendor/NetBSD/dist directory containing a
> > > (very) partial source tree, but it seems to be unused in recent times.
> > > If I did import into that tree, the next question would be how to tag
> > > the import.  The base/vendor/NetBSD/fparseln_19990920/ directory shows
> > > one seemingly sensible example, but I don't like the resulting explosion
> > > of top level directories.  I also worry that having mixed versions in the
> > > libc directory would make any attempt at sensible merging difficult
> > > since we'd have to put mergeinfo on files.
> > > 
> > > An additional issue is where to put the files in the source tree.
> > > Precedent seems to favor direct copies to src/lib/libc/gen etc.  In some
> > > ways I think the optimal solution would be to put the bits in contrib
> > > in feature specific directories like contrib/libc/vis, but that might
> > > be annoying for some consumers.  That being said, the existence if
> > > src/include means you can't simply check out libc so it's probably ok to
> > > add more locations in the source tree for a good cause.
> > > 
> > > What's the right way to go here?
> > 
> > libc already has contrib bits (contrib/gdtoa).  I think something like
> > contrib/NetBSD/libc/ might be fine.  The problem I have with just
> > 'contrib/libc' is that it is ambiguous.  OTOH, the contrib/NetBSD/libc
> > path isn't too pretty either.  One option would be to merge directly from
> > the vendor area into src/lib/libc.  One other option might be to just
> > do src/contrib/vis if it is only for 'vis' files.
> 
> I'm leaning towards src/contrib/libc-vis.  That would also work well in
> vendor/NetBSD since I could do vendor/NetBSD/libc-vis/dist.

I think that is fine.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: kvm_proclist: gnore processes in PRS_NEW

2012-10-03 Thread John Baldwin

On Wednesday, October 03, 2012 12:37:07 pm Andriy Gapon wrote:
> 
> I believe that the following patch does the right thing that is repeated in a 
> few
> other places.
> I would like to ask for a review just in case.
> 
> commit cf0f573a1dcbc09cb8fce612530afeeb7f1b1c62
> Author: Andriy Gapon 
> Date:   Sun Sep 23 22:49:26 2012 +0300
> 
> kvm_proc: ignore processes in larvae state

I think this is fine.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: SMP Version of tar

2012-10-03 Thread John Nielsen

On Oct 2, 2012, at 12:36 AM, Yamagi Burmeister  wrote:

> On Mon, 1 Oct 2012 22:16:53 -0700
> Tim Kientzle  wrote:
> 
>> There are a few different parallel command-line compressors and 
>> decompressors in ports; experiment a lot (with large files being read from 
>> and/or written to disk) and see what the real effect is.  In particular, 
>> some decompression algorithms are actually faster than memcpy() when run on 
>> a single processor.  Parallelizing such algorithms is not likely to help 
>> much in the real world.
>> 
>> The two popular algorithms I would expect to benefit most are bzip2 
>> compression and lzma compression (targeting xz or lzip format).  For 
>> decompression, bzip2 is block-oriented so fits SMP pretty naturally.  Other 
>> popular algorithms are stream-oriented and less amenable to parallelization.
>> 
>> Take a careful look at pbzip2, which is a parallelized bzip2/bunzip2 
>> implementation that's already under a BSD license.  You should be able to 
>> get a lot of ideas about how to implement a parallel compression algorithm.  
>> Better yet, you might be able to reuse a lot of the existing pbzip2 code.
>> 
>> Mark Adler's pigz is also worth studying.  It's also license-friendly, and 
>> is built on top of regular zlib, which is a nice technique when it's 
>> feasible.
> 
> Just a small note: There's a parallel implementation of xz called
> "pixz". It's build atop of liblzma and libarchiv and stands under a 
> BSD style license. See: https://github.com/vasi/pixz Maybe it's
> possible to reuse most of the code.


See also below, which has some bugfixes/improvements that AFAIK were never 
committed in the original project (though they were submitted).
https://github.com/jlrobins/pixz

JN

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Fwd: [CFT/RFC]: refactor bsd.prog.mk to understand multiple programs instead of a singular program

2012-10-02 Thread John Baldwin

On Tuesday, October 02, 2012 10:29:49 am Garrett Cooper wrote:
> On Tue, Oct 2, 2012 at 4:50 AM, John Baldwin  wrote:
> 
> ...
> 
> > This sounds like a superior approach.  It doesn't break any current use
> > cases while giving the ability to build multiple programs in the few
> > places that need it.  It sounds like there are a few places under gnu/
> > from Garrett's reply that might be able to make use of this as well.
> 
> For the record, gnu/cc/cc_tools/Makefile is where I first spotted a
> potential "bsd.progs.mk" candidate. Most of the other code doesn't
> care given how things are organized in our source tree.
> 
> > BTW, one general comment.  There seem to be two completely independent
> > groups of folks working on ATF (e.g. there have been two different
> > imports of ATF into the tree in two different locations IIRC, and now
> > we have two different sets of patches to our system makefiles).
> >
> > Are these two groups talking to each other at all?  I know in May that
> > many folks (certainly multiple vendors) are interested in ATF, and it
> > seems that both Juniper and Isilon have ported ATF internally.  It seems
> > that it might be good for the two groups to work together to avoid
> > stomping on each other's toes.  It seems there are some differences in
> > the two approaches that merit working out to avoid a lot of wasted
> > effort on both sides.
> 
> Both parties (Isilon/Juniper) are converging on the ATF porting work
> that Giorgos/myself have done after talking at the FreeBSD Foundation
> meet-n-greet. I have contributed all of the patches that I have other
> to marcel for feedback.

This is very non-obvious to the public at large (e.g. there was no public 
response to one group's inquiry about the second ATF import for example).  
Also, given that you had no idea that sgf@ and obrien@ were working on 
importing NetBSD's bmake as a prerequisite for ATF, it seems that whatever 
discussions were held were not very detailed at best.  I think it would be 
good to have the various folks working on ATF to at least summarize the 
current state of things and sketch out some sort of plan or roadmap for future 
work in a public forum (such as atf@, though a summary mail would be quite 
appropriate for arch@).

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Fwd: [CFT/RFC]: refactor bsd.prog.mk to understand multiple programs instead of a singular program

2012-10-02 Thread John Baldwin

On Monday, October 01, 2012 6:31:00 pm Simon J. Gerraty wrote:
> Hi Garrett,
> 
> >> From: Garrett Cooper 
> >> Subject: [CFT/RFC]: refactor bsd.prog.mk to understand multiple =
> >programs instead of a singular program
> >> Date: September 2, 2012 11:01:09 PM PDT
> >> To: freebsd-hackers@freebsd.org
> >> Cc: "freebsd-a...@freebsd.org Arch" 
> >>=20
> >> Hello,
> >>I've been a bit busy working on porting over ATF from NetBSD, and
> 
> Thanks, we've been using ATF in Junos for a while and glad to see it 
> being imported to FreeBSD.
> 
> >> one of the pieces that's currently not available in FreeBSD that's
> >> available in NetBSD is the ability to understand and compile multiple
> >> programs. In order to do this I had to refactor bsd.prog.mk (a lot).
> 
> A change like this to bsd.prog.mk can have considerable fallout.
> Eg. any makefile that tweaks OBJS is suddenly out of luck.
> 
> Not to mention the fact that bsd.prog.mk goes from being relatively
> simple, to unspeakably hard to read, and all for rather limited return. 
> 
> Apart from ATF, is there any huge demand for building multiple progs in
> the same directory?
> 
> FWIW we use progs.mk (as bsd.progs.mk) from
> ftp://ftp.netbsd.org/pub/NetBSD/misc/sjg/mk-*.tar.gz 
> It isn't ideal, but it certainly avoids a lot of churn and complexity
> for what is essentially a corner case.

This sounds like a superior approach.  It doesn't break any current use
cases while giving the ability to build multiple programs in the few
places that need it.  It sounds like there are a few places under gnu/
from Garrett's reply that might be able to make use of this as well.

BTW, one general comment.  There seem to be two completely independent
groups of folks working on ATF (e.g. there have been two different
imports of ATF into the tree in two different locations IIRC, and now
we have two different sets of patches to our system makefiles).

Are these two groups talking to each other at all?  I know in May that
many folks (certainly multiple vendors) are interested in ATF, and it
seems that both Juniper and Isilon have ported ATF internally.  It seems
that it might be good for the two groups to work together to avoid
stomping on each other's toes.  It seems there are some differences in
the two approaches that merit working out to avoid a lot of wasted
effort on both sides.

Do we already have a freebsd-atf@ mailing list?  If not, perhaps we
should create one and start these discussions there?

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: vendor import questions

2012-09-25 Thread John Baldwin

On Monday, September 24, 2012 5:31:37 pm Brooks Davis wrote:
> As part of switching to NetBSD's mtree I plan to import their versions
> of a few files that are part of libc (for example all the bits of
> vis/unvis).  I would like to do that via a vendor import, but I'm unsure
> where to put the files and how to tag them.  For mtree itself the right
> place is clearly base/vendor/NetBSD/mtree/dist, but we don't seem to
> have a good example for libc bits.
> 
> There is currently a base/vendor/NetBSD/dist directory containing a
> (very) partial source tree, but it seems to be unused in recent times.
> If I did import into that tree, the next question would be how to tag
> the import.  The base/vendor/NetBSD/fparseln_19990920/ directory shows
> one seemingly sensible example, but I don't like the resulting explosion
> of top level directories.  I also worry that having mixed versions in the
> libc directory would make any attempt at sensible merging difficult
> since we'd have to put mergeinfo on files.
> 
> An additional issue is where to put the files in the source tree.
> Precedent seems to favor direct copies to src/lib/libc/gen etc.  In some
> ways I think the optimal solution would be to put the bits in contrib
> in feature specific directories like contrib/libc/vis, but that might
> be annoying for some consumers.  That being said, the existence if
> src/include means you can't simply check out libc so it's probably ok to
> add more locations in the source tree for a good cause.
> 
> What's the right way to go here?

libc already has contrib bits (contrib/gdtoa).  I think something like
contrib/NetBSD/libc/ might be fine.  The problem I have with just
'contrib/libc' is that it is ambiguous.  OTOH, the contrib/NetBSD/libc
path isn't too pretty either.  One option would be to merge directly from
the vendor area into src/lib/libc.  One other option might be to just
do src/contrib/vis if it is only for 'vis' files.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: serial console "detection" during boot

2012-09-25 Thread John Baldwin

On Saturday, September 22, 2012 11:35:21 am Andriy Gapon wrote:
> 
> Here is an update on the changes:
> http://people.freebsd.org/~avg/boot-comconsole.diff
> 
> Please note that the file is actually a patchset that consists of three
> individual changes:
> - a generic change in common boot code
> - libi386 comconsole change
> - BTX and boot2-ish comconsole change
> 
> All the code is lightly tested.
> As I am not an expert in the assembly code and also because boot2 is quite
> size-sensitive I would like to ask for a special attention to the last change.

This looks ok to me.  Did it pass a 'make universe'?

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

2012-09-19 Thread John Baldwin

On Thursday, September 13, 2012 12:14:49 pm Mark Felder wrote:
> On Thu, 13 Sep 2012 10:11:28 -0500, Andriy Gapon  wrote:
> 
> > Just curious - does VMWare provide a remote debugger support (gdb stub)?
> 
> I'm not aware of one. What I have been able to successfully do is break  
> into the debugger during the hang but the info I've posted so far has not  
> been relevant to anyone. I'm hoping someone on the core team will  
> eventually be able to follow my guide and figure out what went wrong.

So the last e-mail I sent before this week I asked if you could get a 
crashdump from DDB?  The "flswai" case you pointed to was a lock deadlock, and 
having a crashdump would be really helpful for figuring out which threads were 
deadlocked.

Barring a crashdump, capturing 'ps' output from DDB would be a good first step 
(you can sanitize process names if you need to).  However, a crashdump that 
you can use kgdb on will make debugging this significantly easier.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: vm info from a hung system

2012-09-17 Thread John Baldwin

On Friday, September 14, 2012 1:32:43 am Vijay Singh wrote:
> Need some expert help. I have a system that is hung hard, and I was
> able to get it into gdb. From show_vmstat I see:
> 
> (kgdb-amd64-7.4-95) show_vmstat
> SYSTEM MEMORY INFORMATION:
> mem_wire: 285970432 (272MB) Wired: disabled for paging out
> mem_active:  +400105472 (381MB) Active: recently referenced
> mem_inactive:+ 56840192 ( 54MB) Inactive: recently not referenced
> mem_cache:   +0 (  0MB) Cached: almost avail. for allocation
> mem_free:+0 (  0MB) Free: fully available for allocation
> mem_gap_vm:  +   753664 (  0MB) Memory gap: vm
> --  --- --
> mem_all: =743669760 (709MB) Total real memory managed
> mem_gap_sys: + 22765568 ( 21MB) Memory gap: system
> --  ---
> mem_phys:=766435328 (730MB) Total phys memory
> --  ---
> 
> SYSTEM MEMORY SUMMARY:
> mem_used: 709595136 (676MB)  Used memory
> mem_avail:   + 56840192 ( 54MB) Available memory
> --  --- --
> mem_total:   =766435328 (730MB) Total memory
> 
> What is this telling me?

Oof.  I think we generally don't cope with not having any
free memory at all (mem_cache + mem_free).  That is, I imagine
the system was unable to make forward progress, possibly it
had to malloc() something (GEOM is terrible for doing this)
while trying to page out something to free up space.  I would
look at the state of the pagedaemon kthread to see why it isn't
able to run.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Order in which a driver attaches to devices

2012-09-13 Thread John Baldwin

On Friday, September 07, 2012 10:48:39 am John Baldwin wrote:
> On Thursday, September 06, 2012 5:08:27 pm Navdeep Parhar wrote:
> > I have a system with multiple cards supported by cxgbe(4).  When I build 
> > a kernel with the driver compiled in, it attaches to the cards in a 
> > different order from when it's loaded as a module.  Why?  The network 
> > interfaces get re-ordered and this is quite annoying.
> 
> H.  The boot time probe does a depth first walk of the PCI bus.  This is 
> what is suggested by PCI-SIG for enumerating PCI buses (and is normally how 
> BIOSs walk the bus assigning bus numbers).  The walk that is done at kldload 
> time walks the 'pciX' bus devices in numerical order (rather than walking the 
> tree).  I suspect your BIOS is doing something weird and assigning bus 
> numbers 
> in a non-depth first ordering so that the two orderings are not consisent as 
> they are on other machines.

BTW, another fix is to stop trying to force unit numbers to patch PCI bus
numbers (e.g. change pcib_attach() in pci_pci.c to use -1 instead of
sc->secbus).  A few other places would need to be changed as well:
acpi_pcib_attach(), legacy_pcib_attach(), qpi_pcib_attach(),
mptable_hostb_attach().  If we went this route we should probably do it on
other platforms as well.  (Some, such as sparc64 already do this.)

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

2012-09-12 Thread John Baldwin

On Wednesday, June 06, 2012 9:34:02 pm Mark Felder wrote:
> Hi guys I'm excitedly posting this from my phone. Good news for you guys, 
bad news for us -- we were building HA storage on vmware for a client and can 
now replicate the crash on demand. I'll be posting details when I get home to 
my PC tonight, but this hopefully is enough to replicate the crash for any 
curious followers:
> 
> ESXi 5
> 9 or 9-STABLE
> HAST 
> 1 cpu is fine
> 1GB of ram
> UFS SUJ on HAST device
> No special loader.conf, sysctl, etc
> No need for VMWare tools
> Run Bonnie++ on the HAST device
> 
> We can get the crash to happen on the first run of bonnie++ right now. I'll 
post the exact specs and precise command run in the PR. We found an old post 
from 2004 when we looked up the process state obtained from CTRL+T -- flswai 
-- which describes the symptoms nearly perfectly.
> 
>  http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2004-02/0250.html 
> 
> Hopefully this gets us closer to a fix...

Sorry, I just now saw this.  :(

Are you still seeing this, and if so can you get a crashdump?  Also, I'm 
curious if you only see this with SUJ or if plain UFS+SU works fine?

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: What happened to my /proc/curproc/file?

2012-09-07 Thread John Baldwin

On Friday, September 07, 2012 12:39:50 pm Konstantin Belousov wrote:
> On Fri, Sep 07, 2012 at 12:23:54PM -0400, John Baldwin wrote:
> > On Friday, September 07, 2012 11:59:36 am Konstantin Belousov wrote:
> > > On Fri, Sep 07, 2012 at 10:33:52AM -0400, John Baldwin wrote:
> > > > On Tuesday, September 04, 2012 7:46:23 pm Sam Varshavchik wrote:
> > > > > Is the dev+ino of what was exec()ed known, for another process? I 
> > > > > might be  
> > > > > able to get the client voluntarily submit its argv[0], then 
> > > > > independently  
> > > > > have the server validate it by stat()ing that, and comparing the 
> > > > > result  
> > > > > against what the kernel says the process's inode is.
> > > > 
> > > > It's known in the kernel certainly.  I don't think we currently have 
> > > > any way
> > > > of exporting that info to userland however.
> > > 
> > > It is, as  KF_FD_TYPE_TEXT by sysctl kern.proc.filedesc.
> > 
> > That doesn't include stat info though IIRC.  You can get a pathname that is
> > the same you would get from /proc/curproc/file (so it may fail and be 
> > empty),
> > but you don't get st_dev or st_ino.
> > 
> > I have thought that it might be useful for kinfo_file to include a full
> > 'struct stat' and use the fo_stat() method of each file to fill it in, but
> > that is not present currently.
> 
> ino is in kf_file_fileid, and rdev in kf_file_rdev. Also there is
> fsid in kf_file_fsid.

Oh, foo.  I was looking at the 'o' variants.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2588 matches

Mail list logo