Re: PCI: disable I/O or mem before probing BAR size

2020-05-04 Thread David Young
On Mon, May 04, 2020 at 05:56:13PM -0400, Mouse wrote:
> On the other, if anything else could possibly be poking at the device
> while you're probing its mapping register to see how big it is, you've
> got much worse problems already.

If CPU 2 tries to read/write registers on device B while CPU 1 probes
device A's BAR for type/size, device A is enabled, and device A's base
address is momentarily undefined, then I can see device A and B both
responding to the same transaction and causing a fault to occur.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: PCI: disable I/O or mem before probing BAR size

2020-05-04 Thread David Young
On Mon, May 04, 2020 at 09:17:28PM +0200, Manuel Bouyer wrote:
> Does anyone see a problem with this ?

On the contrary, sounds like something we should have always done!

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Symbol debugging support for kernel modules in crash dumps

2020-05-01 Thread David Young
Fantastic! Thanks.

Dave

Spilling kerrectud by iPhone

> On May 1, 2020, at 6:34 PM, Christos Zoulas  wrote:
> 
> 
> Hi,
> 
> I just added symbol debugging support for modules in kernel dumps.
> Things are not perfect because of what I call "current thread
> confusion" in the kvm target, but as you see in the following
> session it works just fine if you follow the right steps. First of
> all you need a build from HEAD that has the capability to build
> .debug files for kernel modules.  Once that's done, you are all
> set; see how it works (comments prefixed by )
> 
> Enjoy,
> 
> christos
> 
> $ gdb /usr/src/sys/arch/amd64/compile/QUASAR/netbsd.gdb
> GNU gdb (GDB) 8.3
> Copyright (C) 2019 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64--netbsd".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
>.
> 
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /usr/src/sys/arch/amd64/compile/QUASAR/netbsd.gdb...
> (gdb) target kvm netbsd.22.core
> 0x80224375 in cpu_reboot (howto=howto@entry=260, 
>bootstr=bootstr@entry=0x0) at ../../../../arch/amd64/amd64/machdep.c:718
> warning: Source file is more recent than executable.
> 718 if (s != IPL_NONE)
> 
>  Ok we got a stacktrace here, but we don't have a current thread...
>  So we set it...
> 
> (gdb) info thread
>  Id   Target Id Frame 
> * 2.1   0x80224375 in cpu_reboot (
>howto=howto@entry=260, bootstr=bootstr@entry=0x0)
>at ../../../../arch/amd64/amd64/machdep.c:718
> 
> No selected thread.  See `help thread'.
> (gdb) thread 2.1
> 
> [Switching to thread 2.1 ()]
> #0  0x80224375 in ?? ()
> 
>  Note that here we lost all symbol table access when we switched threads
>  let's load it again..
> 
> (gdb) add-symbol-file /usr/src/sys/arch/amd64/compile/QUASAR/netbsd.gdb
> add symbol table from file "/usr/src/sys/arch/amd64/compile/QUASAR/netbsd.gdb"
> (y or n) y
> Reading symbols from /usr/src/sys/arch/amd64/compile/QUASAR/netbsd.gdb...
> 
>  OK, lets load our modules
> 
> (gdb) source /usr/src/sys/gdbscripts/modload 
> (gdb) modload
> add symbol table from file "/stand/amd64/9.99.59/modules/ping/ping.kmod" at
>.text_addr = 0x8266e000
>.data_addr = 0x8266b000
>.rodata_addr = 0x8266c000
> add symbol table from file 
> "/stand/amd64/9.99.59/modules/nfsserver/nfsserver.kmod" at
>.text_addr = 0x82a64000
>.data_addr = 0x82669000
>.rodata_addr = 0x8298e000
> add symbol table from file 
> "/stand/amd64/9.99.59/modules/npf_ext_log/npf_ext_log.kmod" at
>.text_addr = 0x82668000
>.data_addr = 0x82667000
>.rodata_addr = 0x82969000
> add symbol table from file 
> "/stand/amd64/9.99.59/modules/npf_alg_icmp/npf_alg_icmp.kmod" at
>.text_addr = 0x82666000
>.data_addr = 0x82665000
>.rodata_addr = 0x82952000
> add symbol table from file "/stand/amd64/9.99.59/modules/bpfjit/bpfjit.kmod" 
> at
>.text_addr = 0x82661000
>.data_addr = 0x0
>.rodata_addr = 0x828dd000
> add symbol table from file "/stand/amd64/9.99.59/modules/sljit/sljit.kmod" at
>.text_addr = 0x82945000
>.data_addr = 0x82664000
>.rodata_addr = 0x828f9000
> add symbol table from file 
> "/stand/amd64/9.99.59/modules/if_npflog/if_npflog.kmod" at
>.text_addr = 0x8266
>.data_addr = 0x8265f000
>.rodata_addr = 0x828ca000
> add symbol table from file "/stand/amd64/9.99.59/modules/npf/npf.kmod" at
>.text_addr = 0x82648000
>.data_addr = 0x82647000
>.rodata_addr = 0x826d6000
> add symbol table from file "/stand/amd64/9.99.59/modules/bpf/bpf.kmod" at
>.text_addr = 0x82622000
>.data_addr = 0x82621000
>.rodata_addr = 0x826a3000
> add symbol table from file 
> "/stand/amd64/9.99.59/modules/bpf_filter/bpf_filter.kmod" at
>.text_addr = 0x8263c000
>.data_addr = 0x0
>.rodata_addr = 0x82627000
> add symbol table from file 
> "/stand/amd64/9.99.59/modules/scsiverbose/scsiverbose.kmod" at
>.text_addr = 0x826a2000
>.data_addr = 0x82686000
>.rodata_addr = 0x82687000
> add symbol table from file 
> "/st

Re: racy acccess in kern_runq.c

2019-12-06 Thread David Young
On Fri, Dec 06, 2019 at 06:33:32PM +0100, Maxime Villard wrote:
> Le 06/12/2019 à 17:53, Andrew Doran a écrit :
> >Why atomic_swap_ptr() not atomic_store_relaxed()?  I don't see any bug that
> >it fixes.  Other than that it look OK to me.
> 
> Because I suggested it; my concern was that if not explicitly atomic, the
> cpu could make two writes internally (even though the compiler produces
> only one instruction), and in that case a page fault would have been possible
> because of garbage dereference.

atomic_store_relaxed() is not explicitly atomic?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: __{read,write}_once

2019-11-21 Thread David Young
On Thu, Nov 21, 2019 at 07:19:51PM +0100, Maxime Villard wrote:
> Le 18/11/2019 à 19:49, David Holland a écrit :
> >On Sun, Nov 17, 2019 at 02:35:43PM +, Mindaugas Rasiukevicius wrote:
> >  > David Holland  wrote:
> >  > >  > I see the potential source of confusion, but just think about: what
> >  > >  > could "atomic" possibly mean for loads or stores?  A load is just a
> >  > >  > load, there is no RMW cycle, conceptually; inter-locking the 
> > operation
> >  > >  > does not make sense.  For anybody who attempts to reason about this,
> >  > >  > it should not be too confusing, plus there are man pages.
> >  > >
> >  > > ...that it's not torn.
> >  > >
> >  > > As far as names... there are increasingly many slightly different
> >  > > types of atomic and semiatomic operations.
> >  > >
> >  > > I think it would be helpful if someone came up with a comprehensive
> >  > > naming scheme for all of them (including ones we don't currently have
> >  > > that we're moderately likely to end up with later...)
> >  >
> >  > Yes, the meaning of "atomic" has different flavours and describes 
> > different
> >  > set of properties in different fields (operating systems vs databases vs
> >  > distributed systems vs ...) and, as we can see, even within the fields.
> >  >
> >  > Perhaps not ideal, but "atomic loads/stores" and "relaxed" are already 
> > the
> >  > dominant terms.
> >
> >Yes, but "relaxed" means something else... let me be clearer since I
> >wasn't before: I would expect e.g. atomic_inc_32_relaxed() to be
> >distinguished from atomic_inc_32() or maybe atomic_inc_32_ordered() by
> >whether or not multiple instances of it are globally ordered, not by
> >whether or not it's actually atomic relative to other cpus.
> >
> >Checking the prior messages indicates we aren't currently talking
> >about atomic_inc_32_relaxed() but only about atomic_load_32_relaxed()
> >and atomic_store_32_relaxed(), which would be used together to
> >generate a local counter. This is less misleading, but I'm still not
> >convinced it's a good choice of names given that we might reasonably
> >later on want to have atomic_inc_32_relaxed() and
> >atomic_inc_32_ordered() that differ as above.
> >
> >  > I think it is pointless to attempt to reinvent the wheel
> >  > here.  It is terminology used by C11 (and C++11) and accepted by various
> >  > technical literature and, at this point, by academia (if you look at the
> >  > recent papers on memory models -- it's pretty much settled).  These terms
> >  > are not too bad; it could be worse; and this bit is certainly not the 
> > worst
> >  > part of C11.  So, I would just move on.
> >
> >Is it settled? My experience with the academic literature has been
> >that everyone uses their own terminology and the same words are
> >routinely found to have subtly different meanings from one paper to
> >the next and occasionally even within the same paper. :-/  But I
> >haven't been paying much attention lately due to being preoccupied
> >with other things.
> 
> So in the end which name do we use? Are people really unhappy with _racy()?
> At least it has a general meaning, and does not imply atomicity or ordering.

_racy() doesn't really get at the intended meaning, and everything in
C11 is racy unless you arrange otherwise by using mutexes, atomics, etc.
The suffix has very little content.

Names such as load_/store_fully() or load_/store_entire() or
load_/store_completely() get to the actual semantics: at the program
step implied by the load_/store_entire() expression, the memory
constituting the named variable is loaded/stored in its entirety.  In
other words, the load cannot be drawn out over more than one local
program step.  Whether or not the load/store is performed in one step
with respect to interrupts or other threads is not defined.

I feel like load_entire() and store_entire() get to the heart of the
matter while being easy to speak, but _fully() or _completely() seem
fine.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: __{read,write}_once

2019-11-06 Thread David Young
On Wed, Nov 06, 2019 at 06:57:07AM -0800, Jason Thorpe wrote:
> 
> 
> > On Nov 6, 2019, at 5:41 AM, Kamil Rytarowski  wrote:
> > 
> > On 06.11.2019 14:37, Jason Thorpe wrote:
> >> 
> >> 
> >>> On Nov 6, 2019, at 4:45 AM, Kamil Rytarowski  wrote:
> >>> 
> >>> I propose __write_relaxed() / __read_relaxed().
> >> 
> >> ...except that seems to imply the opposite of what these do.
> >> 
> >> -- thorpej
> >> 
> > 
> > Rationale?
> > 
> > This matches atomic_load_relaxed() / atomic_write_relaxed(), but we do
> > not deal with atomics here.
> 
> Fair enough.  To me, the names suggest "compiler is allowed to apply relaxed 
> constraints and tear the access if it wants" But apparently the common 
> meaning is "relax, bro, I know what I'm doing".  If that's the case, I can 
> roll with it.

After reading this conversation, I'm not sure of the semantics.

I *think* the intention is for __read_once()/__write_once() to
load/store the entire variable from/to memory precisely once.  They
provide no guarantees about atomicity of the load/store.  Should
something be said about ordering and visibility of stores?

If x is initialized to 0xf00dd00f, two threads start, and thread
1 performs __read_once(x) concurrently with thread 2 performing
__write_once(x, 0xfeedbeef), then what values can thread 1 read?

Do __read_once()/__write_once() have any semantics with respect to
interrupts?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Portability fix in kern_ntptime.c

2019-06-05 Thread David Young
On Thu, Jun 06, 2019 at 04:42:50AM +0700, Robert Elz wrote:
> Further, I'd never do it without a thorough review of the code,
> if you looked, you'd also see
> 
> freq = (ntv->freq * 1000LL) >> 16;
> 
> and
> 
> ntv->ppsfreq = L_GINT((pps_freq / 1000LL) << 16);
> 
> (and perhaps more) - if one of those is a shift of a negative number,
> the others potentially are as well (not that shifts of negative numbers
> bother me at all if the code author understood what is happening, which
> I suspect that they did here.)

Unfortunately, if "undefined behavior" (UB) is invoked, you simply
cannot claim to understand what is happening, because C11-compliant
compilers have a lot of leeway in what code they generate when behavior
is undefined.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Strange PCI bug

2018-07-31 Thread David Young
On Tue, Jul 31, 2018 at 06:11:32PM +0200, Maxime Villard wrote:
> It is really strange. The fact that we receive exception 0x10 seems to
> indicate that there is a problem somewhere related to the pin/irq
> initialization, but I don't really see how the delay could fix that. Or
> maybe the printfs do something more than just adding delay, I don't
> really know.
> 
> Does that ring a bell to someone? Running out of ideas... Phew, I liked
> these machines...

Supposing it is actually a PCI exception:

Maybe the delay gives the target of the `outl` enough time to finish
some initialization.  Maybe the target doesn't respond to PCI
I/O-space writes until the initialization has finished, and the write
is retried until some retry limit is exceeded.

It's also possible that the PCI I/O-space access generated by the `outl`
is flushing a posted PCI memory-space access, and it's the memory-space
access that causes a PCI exception.

Can you look at the PCI error state on the bridges and devices in the
system?  I don't remember if `ddb` will show that, but it may not be
hard to add.  The error state may lead you quickly to the precise device
that's generating the exception.
 
Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: urtwn driver is spammy

2018-06-28 Thread David Young
On Thu, Jun 28, 2018 at 11:48:56AM -0500, David Young wrote:
> On Thu, Jun 28, 2018 at 12:47:06PM +0200, Radoslaw Kujawa wrote:
> > 
> > 
> > > On 28 Jun 2018, at 12:35, Benny Siegert  wrote:
> > > 
> > > urtwn0: could not send firmware command 5
> > 
> > This means it’s unable set radio signal strength.
> 
> It's not clear *how* the firmware uses the RSSI for its rate adaptation,
> so I cannot say for sure, but it looks to me like the RSSI may be
> averaged over many stations when there is a base station.

Sorry, meant to say "too many stations."

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: urtwn driver is spammy

2018-06-28 Thread David Young
On Thu, Jun 28, 2018 at 12:47:06PM +0200, Radoslaw Kujawa wrote:
> 
> 
> > On 28 Jun 2018, at 12:35, Benny Siegert  wrote:
> > 
> > urtwn0: could not send firmware command 5
> 
> This means it’s unable set radio signal strength.

It's not clear *how* the firmware uses the RSSI for its rate adaptation,
so I cannot say for sure, but it looks to me like the RSSI may be
averaged over many stations when there is a base station.

For the firmware to use the average RSSI, as computed, for its rate
adaptation is probably unhelpful/counterproductive in an adhoc network
(IBSS mode).

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: CVS commit: src/sys/dev/pci

2018-05-17 Thread David Young
On Thu, May 17, 2018 at 12:01:30PM -0500, Jonathan A. Kollasch wrote:
> On Wed, May 16, 2018 at 01:45:36PM -0700, Jason Thorpe wrote:
> > 
> > 
> > > On May 16, 2018, at 1:07 PM, Jonathan A. Kollasch  
> > > wrote:
> > > 
> > > I'm a bit uneasy about it myself for that same reason.  However, we
> > > do not to my knowledge have the infrastructure available to do a
> > > complete validation of the resource assignment.  If we did, we'd be
> > > able to do hot attach of PCIe ExpressCard with just a little more work.
> > 
> > 
> > We used to have something like this to support CardBus way back in the day, 
> > but I will admit I wasn’t entirely happy with it at the time.
> 
> rbus?  That's still around, and it's still ugly and doesn't always work.

I have some patches where I had begun to unify CardBus and PCI I/O- &
memory-space management using vmem(9).  The patches have probably rotted
since I stopped working on them 5+ years ago, but it might be easier to
fix them than to start from scratch.  If you want to take this on, I
will try to scare up the patches.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: compat code function pointers

2018-03-18 Thread David Young
On Sun, Mar 18, 2018 at 09:00:06PM -0400, Christos Zoulas wrote:
> Paul suggested:
> 
>   src/sys/kern/kern_junk.c
>   src/sys/sys/kern_junk.h
> 
> I suggested:
> 
>   src/sys/kern/kern_compat.c
>   src/sys/sys/compat.h

I think I have used src/sys/kern/kern_stub.c for a similar purpose in
the past.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: setting DDB_COMMANDONENTER="bt" by default

2018-02-15 Thread David Young
On Thu, Feb 15, 2018 at 06:27:25PM +, David Brownlee wrote:
> Is there some useful variant where the panic message is shown again at the
> end of the stack trace, or the stack trace defaults to a very small number
> of entries by default?

I figure it is not a simple matter to program, but you could probably
print the stack trace "upside down" followed by the panic message?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Restricting rdtsc [was: kernel aslr]

2017-03-28 Thread David Young
On Tue, Mar 28, 2017 at 06:47:11PM +0200, Manuel Bouyer wrote:
> On Tue, Mar 28, 2017 at 11:30:52AM -0500, David Young wrote:
> > [...]
> > What do you mean by "legitimately" use rdtsc?  It seems to me that it
> > is legitimate for a user to use a high-resolution timer to profile some
> > code that's under development.  They may want to avoid running that code
> > with root privileges under most circumstances.
> > 
> 
> Sure.
> At the very last a sysctl to remove the restriction is needed.

Just to expand on that, an interface to set the restriction on a
per-process (per-thread?) level would be handy.

Capabilities beckon! :-)

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Restricting rdtsc [was: kernel aslr]

2017-03-28 Thread David Young
On Tue, Mar 28, 2017 at 04:58:58PM +0200, Maxime Villard wrote:
> Having read several papers on the exploitation of cache latency to defeat
> aslr (kernel or not), it appears that disabling the rdtsc instruction is a
> good mitigation on x86. However, some applications can legitimately use it,
> so I would rather suggest restricting it to root instead.

I may not understand some of your premises.

Why do you single out the rdtsc instruction instead of other time
sources?

What do you mean by "legitimately" use rdtsc?  It seems to me that it
is legitimate for a user to use a high-resolution timer to profile some
code that's under development.  They may want to avoid running that code
with root privileges under most circumstances.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: USB device [was "USB slave"]

2017-03-09 Thread David Young
On Thu, Mar 09, 2017 at 03:14:14PM -0500, Terry Moore wrote:
> However, please, please, please: let us follow the USB standard
> nomenclature.

I guess this means that my proposal to call the "device" role the
"index," "cache," or "staging area," according to our current mood, is
dead on arrival? :-)

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Spinning down sata disk before poweroff

2016-06-18 Thread David Young
On Sat, Jun 18, 2016 at 10:51:25PM +0200, Manuel Bouyer wrote:
> On Sat, Jun 18, 2016 at 03:31:20PM -0500, David Young wrote:
> > BTW, what should we do during "manual" action, such as 'drvctl -d wd0'?
> > Seems like we should power off the drive then, too?  Otherwise, it is
> > set up for an abrupt power-loss, later.
> 
> But if you do this in order to do a rescan, you will power down/up
> the drive, which is not so good for enterprise-class drives (for drives
> designed for 24x7 SMART counts the number of stop/start cycles)

I don't understand.  Why detach in order to do a rescan?

BTW, I think that when I wrote "power off the drive," I should have
written "put it into standby."  I'm not sure the formal meaning of
standby, but I reckon it is up to the drive (controller?) whether or not
to stop the spindle as well as parking the heads.

Seen in a certain perspective, it's asymmetric that frequently the
BIOS powers a drive *up* before the bootloader runs, but the OS is
responsible to power a drive *down* during power off, or else the drive
may abruptly lose power.  Is there some way to hand responsibility for
the drive's power state back to the BIOS?  I guess that on x86, that
would be an ACPI BIOS method.  Don't we use an ACPI method to power down
the machine, after all?  ISTR we put the machine into "sleep state 5".

Dave

-- 
David Young //\ Trestle Technology Consulting
(217) 721-9981  Urbana, IL   http://trestle.tech/


Re: Spinning down sata disk before poweroff

2016-06-18 Thread David Young
On Fri, Jun 17, 2016 at 09:58:07PM +0200, Manuel Bouyer wrote:
> On Fri, Jun 17, 2016 at 11:59:09AM -0500, David Young wrote:
> > A less intrusive change that's likely to work pretty well, I think, is
> > to introduce a new flag, DETACH_REBOOT or DETACH_STAY_POWERED, that's
> > passed to config_detach_all() by cpu_reboot() when the RB_* flags
> > indicate a reboot is happening.  Then, in the wd(4) detach routine, put
> > the device into standby mode if the flag is not set.
> 
> I'd prefer to have it the other way round then: a DETACH_POWEROFF
> which is set only for halt -p.

Ok.

BTW, what should we do during "manual" action, such as 'drvctl -d wd0'?
Seems like we should power off the drive then, too?  Otherwise, it is
set up for an abrupt power-loss, later.

I may have something to say about the unusual asymmetry of BIOS and OS
responsibilities here, later

Dave

-- 
David Young //\ Trestle Technology Consulting
(217) 721-9981  Urbana, IL   http://trestle.tech/


Re: Spinning down sata disk before poweroff

2016-06-17 Thread David Young
On Fri, Jun 17, 2016 at 01:23:36PM +0200, Manuel Bouyer wrote:
> On Fri, Jun 17, 2016 at 01:49:43AM +, Anindya Mukherjee wrote:
> > Hi,
> > 
> > I'm running NetBSD 7.0.1_PATCH (GENERIC) amd64 on a Dell laptop. Almost 
> > everything is working perfectly, except the fact that every time I shutdown 
> > using the -p switch, the hard drive makes a loud click sound as the system 
> > powers off. I checked the SMART status (atactl and smartctl) and after 
> > every poweroff the Power_Off-Retract-Count parameter increases by 1.
> > 
> > I did some searching on the web and came across PR #21531 where this issue 
> > was discussed from 2003-2008 and finally a patch was committed which 
> > resolved the issue by sending the STANDBY_IMMEDIATE command to the disk 
> > before powering off. Since then the code has been refactored, but it is 
> > present in src/sys/dev/ata/wd.c line 1970 (wd_shutdown) which calls line 
> > 1848 (wd_standby). This seemed strange since the disk was definitely not 
> > being spun down.
> > 
> > I attached a remote gdb instance and stepped through the code during 
> > shutdown, breaking on wd_flushcache() which is always called. The code path 
> > taken is wdclose()->wdlastclose() (lines 1029, 1014). I can see that the 
> > cache is flushed but then the device is deleted in line 1023. Subsequently, 
> > power is cut off during shutdown, causing an emergency retract. So, it 
> > seems at least for newer sata disks the spindown code is not being called. 
> > I'm fairly new to NetBSD code so there is a chance I read this wrong, so 
> > feel free to correct me.
> > 
> > Ideally I'd like the disk to spin down during poweroff (-p) and halt (-h), 
> > perhaps settable using a sysctl, but not during a reboot (-r). I am 
> > planning to patch wdlastclose() as an experiment to run the spindown code 
> > to see if it stops the click. Is this a known issue, worthy of a PR? I can 
> > file one. I can also volunteer a patch once I have tested it on my laptop. 
> > Comments welcome!
> 
> 
> So the disk is not powered off because it's detached before the pmf framework
> has a chance to power it off (see amd64/amd64/machdep.c:cpu_reboot()).
> that's bad.
> Doing the poweroff in wdlastclose() is bad because then you'll have a
> poweroff/powerup cycle for a reboot, or even on unmount/mount events if this
> is not your root device. This can be harmfull for some disks (this has already
> been discussed).
> 
> The (untested) attached patch should fix this by calling pmf before detach;
> can you give it a try ?

Careful!  The alternation of detaching devices and unmounting
filesystems is purposeful.  You can have devices such as vnd(4) backed
by filesystems backed by further devices.

It's possible that unmounting a filesystem will counteract the PMF
shutdown.

A less intrusive change that's likely to work pretty well, I think, is
to introduce a new flag, DETACH_REBOOT or DETACH_STAY_POWERED, that's
passed to config_detach_all() by cpu_reboot() when the RB_* flags
indicate a reboot is happening.  Then, in the wd(4) detach routine, put
the device into standby mode if the flag is not set.

Dave

-- 
David Young //\ Trestle Technology Consulting
(217) 721-9981  Urbana, IL   http://trestle.tech/


Re: Introduce curlwp_bind and curlwp_unbind for psref(9)

2016-06-14 Thread David Young
On Wed, Jun 15, 2016 at 10:56:57AM +0900, Ryota Ozaki wrote:
> On Tue, Jun 14, 2016 at 7:58 PM, Joerg Sonnenberger  wrote:
> > On Tue, Jun 14, 2016 at 09:53:33AM +0900, Ryota Ozaki wrote:
> >> - curlwp_bind and curlwp_unbind
> >> - curlwp_bound_set and curlwp_bound_restore
> >> - curlwp_bound and curlwp_boundx
> >>
> >> Any other ideas? :)
> >
> > curlwp_bind_push / curlwp_bind_pop
> 
> Hmm, I think the naming fits in if Linux, but not in NetBSD.
> And
>   bound = curlwp_bind_push();
>   ...
>   curlwp_bind_pop(bound);
> looks odd to me.

bound = curlwp_bind_get(); curlwp_bind_put(bound)?

Dave

-- 
David Young //\ Trestle Technology Consulting
(217) 721-9981  Urbana, ILhttp://trestle.tech


Re: CPU-local reference counts

2016-06-13 Thread David Young
On Mon, Jun 13, 2016 at 04:54:32PM +, Taylor R Campbell wrote:
> Prompted by the discussion last week about making sure autoconf
> devices and bdevsw/cdevsw instances don't go away at inopportune
> moments during module unload, I threw together a generic sketch of the
> CPU-local reference counts I noted, which I'm calling localcount(9).

Hi Taylor,

Here are a couple of ideas that I probably picked up from other
NetBSDers and from contributors to an old thread,
<https://mail-index.netbsd.org/tech-kern/2013/01/17/msg014872.html>.

One way both to save some memory and to reduce the cache footprint of a
reference-counting scheme like this is to use "narrow" per-CPU counters
(uint16_t, say) and a wide, shared counter that's increased, using an
interlocked atomic operation, whenever a per-CPU counter rolls over.

To save some more memory, you can make struct localcount,

struct localcount {
int64_t *lc_totalp;
struct percpu   *lc_percpu; /* int64_t */
};

into a handle,

struct localcount {
unsigned intlc_slot;
};

where the lc_slot indicates an index into a few arrays: a per-CPU array
of local counters, a global array of shared counters, and an array of
whichever flags ("draining") that you may require.

Depending how many localcount you expect to be extant in the system, you
can make lc_slot a uint16_t or a uint32_t.

Dave

-- 
David Young //\ Trestle Technology Consulting
(217) 721-9981  Urbana, ILhttp://trestle.tech


Re: Locking strategy for device deletion (also see PR kern/48536)

2016-06-07 Thread David Young
On Tue, Jun 07, 2016 at 06:28:11PM +0800, Paul Goyette wrote:
> Can anyone suggest a reliable way to ensure that a device-driver
> module can be _really_ safely detached?
> 
> The module could theoretically maintain an open/ref counter, but
> making this MP-safe is "difficult"!  Even if the module were to
> provide a mutex to control increment/decrement of it's counter,
> there's still a problem:
> 
> Thread 1 initiates a module-unload, which takes the mutex
> 
> Thread 2 attempts to open the device (or one of its units), attempts to
> grab the mutex, and waits
> 
> Back in thread 1, the driver's module unload code determines that it
> is safe to unload (no current activites queued, no current opens),
> so it
> goes forward and unmaps the module - including the mutex!

I think that what's missing is a flag on the module that says it is
unloading, and module entrance/exit counters.  I think it could work
sort of like this---the devil is in the details:

Thread 1 initiates a module unload:
1) Acquires mutex
2) Sets the module's unloading flag
3) Unlinks module entry points---that is, they're still mapped,
   but there are no more globally-visible pointers to them
4) While module entrances > exits, sleeps on module condition
   variable C, thus temporarily releasing mutex
5) Releases mutex
6) Unmaps module

Thread 2 attempts to open the device
1) Increases module-entrance count
2) Acquires mutex
3) Examines unloading flag
a) Finding it set, signals condition variable C,
b) OR, finding it NOT set, performs open
4) increases module-exit count
5) releases mutex

The module entrance/exit counts can be per-CPU variables that you
increment using non-interlocked atomic instructions, which are not very
expensive.

Now, I am trying to remember if/why counting entrances and exits
separately is necessary.  ISTM that to avoid races, you want to add up
exits across all CPUs, first, then add up entrances, and compare.

This is not necessarily the best or only way to handle this, and I feel
sure that I've overlooked a fatal flaw in this first draft.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: nexthop cache separation

2016-04-04 Thread David Young
On Fri, Mar 25, 2016 at 07:41:18PM +0900, Ryota Ozaki wrote:
> [after]
> Routing tables
> 
> Internet:
> DestinationGatewayFlagsRefs  Use
> Mtu Interface
> default10.0.1.1   UGS --
>  -  shmif0
> 10.0.1/24  link#2 UC  --
>  -  shmif0
> 10.0.1.2   link#2 UHl --  -  lo0
> 127.0.0.1  lo0UHl --  33648  lo0

Previous to the change you've proposed, 'route show' provided a more
comprehensive view of the routing state.  Now, it is missing several
useful items.  Where has the 10.0.1.1 route gone?  Where are the
MAC addresses?  Previously, you could issue one command and within
an eyespan have all of the information that you needed to diagnose
connectivity problems on routers.  Now, to first appearances, every
routing state looks suspect, and it's necessary to dig in with arp/ndp
to see if things are ok.

Please, if your changes materially change the user interface, provide
mockups.  Mockups are powerful communication tools that help to build
consensus and provide strong implementation guidance.  Design oversights
that are obvious in mockups may be invisible in patches.  It's easy
to mockup command-line displays like route(8)'s using $EDITOR.  I
cannot recommend strongly enough that developers add mockups to their
engineering-communications repertoire.

Dave

[*] It was bad enough that networking in NetBSD contains many potential
switchbacks and blackholes once a firewall is active.  I don't think
we're worse than any other system in that regard, but ISTM we should
strive to be *better*.

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: nexthop cache separation

2016-03-22 Thread David Young
On Tue, Mar 22, 2016 at 01:14:39PM +0900, Ryota Ozaki wrote:
> Hi,
> 
> Here are new patches:
>   http://www.netbsd.org/~ozaki-r/separate-nexthop-caches-v2.diff
>   http://www.netbsd.org/~ozaki-r/separate-nexthop-caches-v2-diff.diff
> 
> Changes since v1:
> - Comment out osbolete RTF_* and RTM_* definitions
>   - Tweak some userland codes for the change
> - Restore checks of connected (cloning) routes in nd6_is_addr_neighbor
> - Restore the original behavior on removing ARP/NDP entries for
>   IP addresses of interface itself
> - Remove remaining use of RTF_LLINFO in the kernel
>   - I think we can remove it safely

It sounds as if these changes could affect the appearance of netstat -r,
route show, route get, arp -a, arp .  Can you provide some
before/after examples?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: softint-based if_input

2016-01-25 Thread David Young
ing thread:
loop forever:
enable interrupts
wait for wakeup
for each Rx packet on ring:
process packet

That stopped the user-tickle watchdog from firing.  It was handy having
a full-fledged thread context to process packets in.  But there were
trade-offs.  As Matt Thomas pointed out to me, if it takes longer for
the NIC to read the next packet off of the network than it takes your
thread to process the current packet, then your Rx thread is going to go
back to sleep again after every single packet.  So there's potentially
a lot of context-switch overhead and latency when you're receiving
back-to-back large packets.

ISTR Matt had some ideas how context switches could be made faster, or
h/w interrupt handlers could have an "ordinary" thread context, or the
scheduler could control the rate of softints, or all of the above.  I
don't know if there's been any progress along those lines in the mean
time.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: In-kernel process exit hooks?

2016-01-08 Thread David Young
On Fri, Jan 08, 2016 at 10:47:08AM -0600, David Young wrote:
> On Fri, Jan 08, 2016 at 12:52:16PM +0700, Robert Elz wrote:
> > Date:Fri, 8 Jan 2016 11:22:28 +0800 (PHT)
> > From:Paul Goyette 
> > Message-ID:  
> > 
> >   | Is there a "supported" interface for detaching the file (or descriptor) 
> >   | from the process without closing it?
> > 
> > Actually, thinking through this more, why not just "fix" filemon to make
> > a proper reference to the file, instead of the half baked thing it is
> > currently doing.
> 
> Yes, please! :-)
> 
> Furthermore, stick the file into LWP 0's descriptor table so that you

Oops, meant PID 0's.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: In-kernel process exit hooks?

2016-01-08 Thread David Young
On Fri, Jan 08, 2016 at 12:52:16PM +0700, Robert Elz wrote:
> Date:Fri, 8 Jan 2016 11:22:28 +0800 (PHT)
> From:Paul Goyette 
> Message-ID:  
> 
>   | Is there a "supported" interface for detaching the file (or descriptor) 
>   | from the process without closing it?
> 
> Actually, thinking through this more, why not just "fix" filemon to make
> a proper reference to the file, instead of the half baked thing it is
> currently doing.

Yes, please! :-)

Furthermore, stick the file into LWP 0's descriptor table so that you
can see it with fstat.  It's a little more code to write---I wrote it
for gre(4)---but it's well worth the visibility.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: In-kernel process exit hooks?

2016-01-08 Thread David Young
On Fri, Jan 08, 2016 at 12:26:14PM +0700, Robert Elz wrote:
> Date:Fri, 8 Jan 2016 11:22:28 +0800 (PHT)
> From:Paul Goyette 
> Message-ID:  
> 
>   | Is there a "supported" interface for detaching the file (or descriptor) 
>   | from the process without closing it?
> 
> Inside the kernel you want to follow the exact same procedure as would
> be done by
> 
>   newfd = dup(oldfd);
>   close(oldfd);
> 
> except instead of dup (and assigning to a newfd in the process) we
> take the file reference and stick it in filemon.   There's nothing
> magic about this step.  What magic there is (though barely worthy of
> the title) would be in ensuring that filemon properly releases the file
> when it is closing.

Years ago I added to gre(4) an ioctl that a user thread can use to
delegate to the kernel a UDP socket that carries tunnel traffic.  I
think that that code should cover at least the dup(2) part of Robert's
suggestion.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Lightweight support for instruction RNGs

2015-12-21 Thread David Young
On Mon, Dec 21, 2015 at 07:38:57PM -0500, Thor Lancelot Simon wrote:
> On Mon, Dec 21, 2015 at 09:28:40AM -0800, Alistair Crooks wrote:
> > I think there's some disconnect here, since we're obviously talking
> > past each other.
> > 
> > My concern is the output from the random devices into userland. I
> 
> Yes, then we're clearly talking past each other.  The "output from the
> random devices into userland" is generated using the NIST SP800-90
> CTR_DRBG.  You could key it with all-zeroes and the statistical properties
> of the output would differ in no detectable way* from what you got if
> you keyed it with pure quantum noise.
> 
> If you want to run statistical tests that mean anything, you need to
> feed them input from somewhere else.  Feeding them the output of the
> CTR_DRBG can be nothing but -- at best -- security theater.
> 
>  [*maybe some day we will have a cryptanalysis of AES that allows us to
>detect such a difference, but we sure don't now]

Thor,

I think Alistair is concerned that the implementation of "NIST SP800-90
CTR_DRBG" could be incorrect, or else that it could be embedded in
a system in which the correct behavior is not, for whatever reason,
manifest in the userland output.  Thus the statistical properties of the
output could be different from specifications.  Maybe one of the problem
systems will be, for unforeseen reasons, one in which there is an RNG
instruction.  Stranger things have happened.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: I'm interested in implementing lockless fifo/lifo in the kernel

2015-12-02 Thread David Young
On Tue, Nov 24, 2015 at 12:04:09AM -0800, Randy White wrote:
> I love NetBSD, and I would like to contribute. I see the open job for 
> lockless queues, and stacks. I want to learn, and I want to help. I have 
> literature on UNIX kernel development. I have many systems and I think I 
> could fund myself for the most part. I am familiar with lockless programming. 
> 
> I am looking forward to working on netbsd and help maintaining its 
> awesomeness. 

Randy,

That's great.  Let me know how I can help you to get started.  It sounds
like you're already familiar with NetBSD, how and where we communicate,
etc.

BTW, we have a lockless queue in NetBSD called pcq(9).  People will
disagree whether it is the best/only lockless queue for the purpose of,
say, SMP networking. pcq(9) uses a fixed-size ring buffer, which may be
advantageous in some scenarios and a liability in others.

We are notably lacking a fast *linked* FIFO queue---i.e., one that can
take the place of struct ifqueue/IF_ENQUEUE()/IF_DEQUEUE() for mbuf
queues.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: spiflash.c process_write()

2015-09-08 Thread David Young
On Tue, Sep 08, 2015 at 06:12:11AM +, David Holland wrote:
> As noted in passing elsewhere, it seems that process_write() in
> spiflash.c allocates a scratch buffer on every call... and leaks it on
> every call too. This clearly isn't a good thing.

My recollection is fuzzy, now, but I think that that was written to
support the Meraki Mini.  ISTR NOR-flash writing was never tested on
the Mini (maybe not anywhere?).  A misplaced/corrupted write to the
NOR Flash could have bricked the Mini, and there was a limited supply
available for testing!

I think I have a Mini or two around here, somewhere.  I can make one
available with serial console and outlet control, if somebody has an
urge to test it out.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Restructuring ARP cache

2015-08-25 Thread David Young
On Mon, Aug 17, 2015 at 06:23:14PM +0900, Ryota Ozaki wrote:
> Hi,
> 
> Here is a new patch that restructures ARP caches, which
> aims for MP-safe network stack:
> http://www.netbsd.org/~ozaki-r/lltable-arpcache.diff
> (https://github.com/ozaki-r/netbsd-src/tree/lltable-arpcache)

I don't think that through this piecemeal approach, NetBSD can achieve a
maintainable, MP-safe network stack that is competitive in performance
with the other stacks out there.

To put this into the old formula, you may pick three: piecemeal,
maintainable, MP-safe, performance.

I think it's important to take several steps toward simplicity before
anything else.  Neither simplicity nor MP-safety are compatible with a
network stack shot through with caches like NetBSD's stack is, now.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: argument of pci_msi[x]_count()

2015-08-12 Thread David Young
On Thu, Aug 13, 2015 at 06:56:34AM +1000, matthew green wrote:
> > I don't have a problem with it, I was just questioning the rationale
> > about passing pci_attach_args to functions...
> 
> the original pci(9) interfaces didn't do this, but a 3rd member of
> pci_attach_args{} was needed for a new change, so someone (i forget
> now, but CVS will tell you) changed it to pass the structure itself,
> since this was only called during autoconfig when this structure was
> actually available.
> 
> doing it outside of autoconfig is not a good idea, though, so any
> function that is usefully callable outside of attach probably should
> take specific arguments instead of pci_attach_args{}.

ISTR a hairy wi(4) bug came about because *_attach_args was passed
outside attach!  [I may have introduced that bug, too. :-)]

It sounds to me like the emerging consensus is that it's best to pass
only the chipset+memory tag, if that's all you need, to each MSI/MSI-X
function.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Introducing CloudABI: a pure capability-based runtime for NetBSD (and other systems)

2015-07-23 Thread David Young
On Thu, Jun 25, 2015 at 03:11:51PM +0200, Ed Schouten wrote:
> Hello NetBSD hackers,
> 
> Two weeks ago I gave a talk at BSDCan about something I've been
> working on for the last half a year called CloudABI[1]. In short,
> CloudABI is an alternative UNIX-like runtime environment that purely
> uses capability-based security, strongly influenced by Capsicum[2].

Ed,

It has always seemed to me that it will be easier for a user to form and
to operate a mental model for a capability system, especially if the
system makes the capabilities visible, than to model any rules-based
system.  So capabilities have always looked like a good foundation for
building *usable* security.

Initially, I was very excited about Capsicum, "practical capabilities
for UNIX".  But it seems like Capsicum isn't for users, it is for
developers: in the examples I have read, you have to modify a program's
source to make good use of Capsicum.  That seems like an unnecessarily
high barrier to use.

That brings me to my question about CloudABI.  It sounds like CloudABI
is aimed at developers, who would adapt programs to work with the new
run-time?  Or is there an upside to CloudABI for users, too?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Improving use of rt_refcnt

2015-07-06 Thread David Young
On Sun, Jul 05, 2015 at 11:50:12AM +0200, Joerg Sonnenberger wrote:
> I think the main point that David wanted to raise is that the normal
> path for packets should *not* do any ref count changes at all.

I wasn't trying to make a point.  I wanted to make sure that I properly
understood Ryota's plans.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Improving use of rt_refcnt

2015-07-04 Thread David Young
On Sat, Jul 04, 2015 at 09:52:56PM +0900, Ryota Ozaki wrote:
> I'm trying to improve use of rt_refcnt: reducing
> abuse of it, e.g., rt_refcnt++/rt_refcnt-- outside
> route.c and extending it to treat referencing
> during packet processing (IOW, references from
> local variables) -- currently it handles only
> references between routes. The latter is needed for
> MP-safe networking.

Do you propose to increase/decrease rt_refcnt in the packet processing
path, using atomic instructions?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Interrupt flow in the NetBSD kernel

2015-06-22 Thread David Young
On Sun, Jun 21, 2015 at 08:01:47AM -0700, Matt Thomas wrote:
> 
> > On Jun 21, 2015, at 7:30 AM, Kamil Rytarowski  wrote:
> > 
> > I have got few questions regarding the interrupt flow in the kernel.
> > Please tell whether my understanding is correct.
> 
> You are confusing interrupts with exceptions.  Interrupts are 
> asynchronous events.  Exceptions are (usually) synchronous and
> are the result of an instruction.

I took Kamil's question to be, "When interrupts at the highest priority
level are blocked, can control flow still be interrupted?  How?"  The
answer to the question is yes.  Both synchronous events (exceptions,
such as "data abort" on ARM) and asynchronous events (non-maskable
interrupts, such as NMI on x86) can interrupt control flow.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: MSI/MSI-X implementation

2014-11-13 Thread David Young
On Thu, Nov 13, 2014 at 01:59:09PM -0600, David Young wrote:
> On Thu, Nov 13, 2014 at 12:41:38PM +0900, Kengo NAKAHARA wrote:
> > (2014/11/13 11:54), David Young wrote:
> > >On Fri, Nov 07, 2014 at 04:41:55PM +0900, Kengo NAKAHARA wrote:
> > >>Could you comment the specification and implementation?
> > >
> > >The user should not be on the hook to set processor affinity for the
> > >interrupts.  That is more properly the responsibility of the designer
> > >and OS.
> > 
> > I wrote unclear explanation..., so please let me redescribe.
> > 
> > This MSI/MSI-X API *design* is independent from processor affinity.
> > The device dirvers can use MSI/MSI-X and processor affinity
> > independently of each other. In other words, legacy interrupts and
> > INTx interrupts can use processor affinity still. Furthermore,
> > MSI/MSI-X may or may not use processor affinity.
> 
> MSI/MSI-X is not half as useful as it ought to be if a driver's author
> cannot spread interrupt workload across the available CPUs.  If you
> don't mind, please share your processor affinity proposal and show how
> it works with interrupts.

Here are some cases that interest me:

1) What interrupts does a driver establish if the NIC has separate
   MSI/MSI-X interrupts for each of 4 Tx DMA rings and each of 4 Rx DMA
   rings, and there are 2 logical CPUs?  Can/does the driver provide
   any hints about the processor that is the target of each interrupt?
   What CPUs receive the interrupts?

2) Same as above, but what if there are 4 logical CPUs?

3) Same as previous, but what if there are 16 logical CPUs?

There's more than one way to crack this nut, I'm just wondering how you
propose to crack it. :-)

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: struct ifnet and ifaddr handling [was: Re: Making global variables of if.c MPSAFE]

2014-11-13 Thread David Young
On Thu, Nov 13, 2014 at 04:26:52AM +, Taylor R Campbell wrote:
>Date: Thu, 13 Nov 2014 12:43:26 +0900
>From: Ryota Ozaki 
> 
>Here is a new patch: http://www.netbsd.org/~ozaki-r/psz-ifnet.diff
> 
>I think the patch reflects rmind's suggestions:
>- Use pserialize for IFNET_FOREACH
>  - but use a lock for blockable/sleepable critical sections
>- cpu_intr_p workaround for HW interrupt
> 
>Any comments?
> 
> Hmm...some quick notes from a non-expert in sys/net:
> 
> - You call malloc(M_WAITOK) while the ifnet lock is held, in
>   if_alloc_sadl_locked, which is not allowed.
> 
> - You call copyout in a pserialize read section, in ifconf, which is
>   not allowed because copyout may block.
> 
> - I don't know what cpu_intr_p is working around but it's probably not
>   a good idea!
> 
> Generally, all that you are allowed to do in a pserialize read section
> is read a small piece of information, or grab a reference to a data
> structure which you are then going to use outside the read section.
> 
> I don't think it's going to be easy to scalably parallelize this code
> without restructuring it, unless as a stop-gap you use a heaver-weight
> reader-writer lock like the prwlock at
> <https://www.NetBSD.org/~riastradh/tmp/20140517/rwlock/prwlock.c>.
> (No idea how much overhead this might add.)

Parallelizing the network code without restructuring it?  That
sounds like something I have tried before.  Avoid it, if you can!

When I was confronted at a previous job with the problem of rapidly
MP-ifying the network stack, I introduced a lightweight reader/writer
lock for the network configuration.  I called that lock the "corral."
A thread had to enter the corral before it read or modified a route,
ifnet, ifaddr, etc.  There could be multiple readers in the corral,
or one writer in the corral, never a reader and writer at once---i.e.,
usual reader/writer semantics.  The corral was designed so that
readers could enter and exit very quickly without using any locked
instructions or modifying any shared cachelines in the common case.
It was fairly expensive for a writer to enter the corral, or for
a reader to wait for a writer to exit before it entered.

The corral was introduced gradually, the kernel_lock and softnet_lock
gradually phased out.  In retrospect, I probably introduced the
corral_enter() calls in a different order with KERNEL_LOCK() and
mutex_enter(softnet_lock) calls than I should have, and that caused
me a lot of grief.  It did not help that the kernel did not make
KERNEL_LOCK() and mutex_enter(softnet_lock) calls in a consistent
order.

IIRC, some of the waits in the corral entailed cv_wait() and other
blocking calls that could not be made from an interrupt context.
Anyway, I ended up in the end running virtually all packet processing
in a LWP context so that I could use whichever synchronization
objects I liked.  NIC interrupts were just responsible for waking
per-CPU threads that processed the Rx rings.  This approach can
work ok if your starting point is the inefficient legacy NetBSD
packet processing and there are a preponderance of small packets.
Matt Thomas once explained to me some serious overheads that crop
up when the inter-packet interval is long and your packet processing
is tuned up, and I might make different trade-offs between
interrupt/LWP processing with 20/20 hindsight.

Anyway, while I worked on the rapid MP-ification project, I had on
my desk an eminent engineer's thoroughgoing and ambitious plan for
MP-ifying a *BSD network stack: a rational MP-ification project.
I don't think that my team could have implemented that plan on a
timescale that met the business need, but if you take a global
perspective---how many times will persons and projects around the
world will be hobbled by NetBSD's legacy network stack before it
is finally restructured?---then it's pretty clear that now is the
time to begin a rational project.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: MSI/MSI-X implementation

2014-11-13 Thread David Young
On Thu, Nov 13, 2014 at 12:41:38PM +0900, Kengo NAKAHARA wrote:
> (2014/11/13 11:54), David Young wrote:
> >On Fri, Nov 07, 2014 at 04:41:55PM +0900, Kengo NAKAHARA wrote:
> >>Could you comment the specification and implementation?
> >
> >The user should not be on the hook to set processor affinity for the
> >interrupts.  That is more properly the responsibility of the designer
> >and OS.
> 
> I wrote unclear explanation..., so please let me redescribe.
> 
> This MSI/MSI-X API *design* is independent from processor affinity.
> The device dirvers can use MSI/MSI-X and processor affinity
> independently of each other. In other words, legacy interrupts and
> INTx interrupts can use processor affinity still. Furthermore,
> MSI/MSI-X may or may not use processor affinity.

MSI/MSI-X is not half as useful as it ought to be if a driver's author
cannot spread interrupt workload across the available CPUs.  If you
don't mind, please share your processor affinity proposal and show how
it works with interrupts.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: MSI/MSI-X implementation

2014-11-12 Thread David Young
On Fri, Nov 07, 2014 at 04:41:55PM +0900, Kengo NAKAHARA wrote:
> Could you comment the specification and implementation?

The user should not be on the hook to set processor affinity for the
interrupts.  That is more properly the responsibility of the designer
and OS.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Brainy: Set of 33 potential bugs

2014-09-20 Thread David Young
On Sat, Sep 20, 2014 at 08:48:08PM +0200, Maxime Villard wrote:
> Hi,
> here is another set of 33 potential bugs found by my code scanner.
> 
>   http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html#Report-3
> 
> Not all bugs are listed here; I've put only those which looked like proper
> bugs. I guess they will all need to be fixed in NetBSD-7.

Is the source for this scanner available somewhere?  Does it build on
some existing project, such as LLVM?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: add MSI/MSI-X support to NetBSD

2014-08-29 Thread David Young
On Sat, Jun 07, 2014 at 08:36:47AM +1000, matthew green wrote:
> 
> let's not forget my favourite mis-feature of MSI/MSI-X:
> 
> if you misconfigure the address, interrupts might cause main memory to
> be corrupted.  i've seen this happen, and it was rather difficult to
> diagnose the real culprit..

Picking up this discussion again, rather late.

If there is an IOMMU available, shouldn't it be used to protect against
this kind of memory corruption?  Even some x86 machines have IOMMUs
these days.

> i'm a little confused about bus_msi(9) -- pci_intr(9) is already an MD
> interface, so if it was extended or if we copied the pci_intr_map_msi()
> functions from elsewhere, it's still MD code we have to write.
> what does bus_msi(9) add?  who would use it?

bus_msi(9) gives MI code access to doorbells: MI code uses it to
establish a doorbell -> interrupt handler mapping and find out the
doorbell's physical address.

All the code to map the doorbell's physaddr into a PCI busaddr, to
program the IOMMU if there is one, to establish the MSI address/data in
the PCI device, and to enable MSI is MI code using bus_dma(9), pci(9),
and bus_space(9).  Even if it's 100 lines or fewer, why duplicate it
across platforms?

Also, doorbells look to me like a potentially useful facility to make
generally available, even apart from their use with PCI MSI.  Anyway,
I'm curious what uses people would come up with.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: ixg(4) performances

2014-08-26 Thread David Young
On Tue, Aug 26, 2014 at 10:25:52AM -0400, Christos Zoulas wrote:
> On Aug 26,  2:23pm, m...@netbsd.org (Emmanuel Dreyfus) wrote:
> -- Subject: Re: ixg(4) performances
> 
> | On Tue, Aug 26, 2014 at 12:57:37PM +, Christos Zoulas wrote:
> | > 
> ftp://ftp.supermicro.com/CDR-C2_1.20_for_Intel_C2_platform/Intel/LAN/v15.5/PROXGB/DOCS/SERVER/prform10.htm#Setting_MMRBC
> | 
> | Right, but NetBSD has no tool like Linux's setpci to tweak MMRBC, and if
> | the BIOS has no setting for it, NetBSD is screwed.
> | 
> | I see   has a PCI_IOC_CFGREAD / PCI_IOC_CFGWRITE ioctl,
> | does that means Linux's setpci can be easily reproduced?
> 
> I would probably extend pcictl with cfgread and cfgwrite commands.

Emmanuel,

Most (all?) configuration registers are read/write.  Have you read the
MMRBC and found that it's improperly configured?

Are you sure that you don't have to program the MMRBC at every bus
bridge between the NIC and RAM?  I'm not too familiar with PCI Express,
so I really don't know.

Have you verified the information at
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe with the 82599
manual?  I have tried to corroborate the information both with my PCI
Express book and with the 82599 manual, but I cannot make a match.
PCI-X != PCI Express; maybe ixgb != ixgbe?  (It sure looks like they're
writing about an 82599, but maybe they don't know what they're writing
about!)


Finally, adding cfgread/cfgwrite commands to pcictl seems like a step in
the wrong direction.  I know that this is UNIX and we're duty-bound to
give everyone enough rope, but may we reconsider our assisted-suicide
policy just this one time? :-)

How well has blindly poking configuration registers worked for us in
the past?  I can think of a couple of instances where an knowledgeable
developer thought that they were writing a helpful value to a useful
register and getting a desirable result, but in the end it turned out to
be a no-op.  In one case, it was an Atheros WLAN adapter where somebody
added to Linux some code that wrote to a mysterious PCI configuration
register, and then some of the *BSDs copied it.  In the other case, I
think that somebody used pci_conf_write() to write a magic value to a
USB host controller register that wasn't on a 32-bit boundary.  ISTR
that some incorrect value was written, instead.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Dead code: double return

2014-08-18 Thread David Young
On Mon, Aug 18, 2014 at 11:28:13AM +0200, Maxime Villard wrote:
> Hi,
> my code scanner reports in several places lines like these:
> 
>   return ERROR_CODE/func(XXX);
>   return VALUE;

In some of your examples, it looks like code may have been copied and
pasted.  Is some refactoring of the code called for?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: IRQ affinity (aka interrupt routing)

2014-07-25 Thread David Young
On Fri, Jul 25, 2014 at 07:15:02PM +0900, Kengo NAKAHARA wrote:
> Hi martin,
> 
> Thank you very much for your comment.
> 
> (2014/07/25 18:15), Martin Husemann wrote:
> >A few general comments:
> >
> >  - best UI is no UI - the kernel should distribute interrupts automatically
> >(if possible) as fair as possible over time doiing some statistics
> 
> I agree the computer should distribute interrupts automatically, but
> I think balancing interrupts is too complex for the kernel. So I think
> the balancing should be done by the userland daemon which use the UI.
> Implementing and tuning the userland daemon are future works.

What's the goal of balancing interrupts?  Controlling latency?  That's
important, but it seems like other considerations might apply.  For
example, funneling all interrupts to one core might allow the other
cores to idle in a power-saving state.  Also, it might help to avoid
cacheline motion for two interrupts involved in the same network flow to
fire on the same CPU.

Can you explain why you think that balancing interrupts is too complex
for the kernel?  I would not necessarily disagree, but it seems like
the kernel has the most immediate access to the relevant interrupt
statistics *and* the responsibility for CPU scheduling, so it's in a
pretty good position to react to imbalances.

Dave

> >  - a UI to wire some device interrupts to a special CPU would be ok,
> >I prefer a new intrctrl for that
> >
> >  - vmstat -i could gain an additional column with the current target cpu
> >of the interrupt
> 
> I am afraid of breaking backward compatibility, so I avoid to change
> existing commands.
> 
> >  - the device name is nice, but what about shared interrupts?
> 
> I forgot shared interrupts... In the current implement, the device name
> is overwritten by the device established later. Of course this is ugly,
> so I must fix it.
> 
> 
> Thanks,
> 
> -- 
> //
> Internet Initiative Japan Inc.
> 
> Device Engineering Section,
> Core Product Development Department,
> Product Division,
> Technology Unit
> 
> Kengo NAKAHARA 

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Obtaining list of created sockets

2014-07-01 Thread David Young
On Mon, Jun 30, 2014 at 09:39:37AM -0700, Will Dignazio wrote:
> That would be an excellent start; I had considered it before, however I
> thought that netstat only listed listening connections. With lsof, you
> would only get sockets created with fsocket (those having file descriptors).
> 
> I suppose combining the way the two get their information would yield a
> majority of the sockets created, however I would like to get all internal
> sockets that may not be listening yet, or never get a file descriptor.

Some years back, I modified gre(4) to use an actual socket instead of
rolling its own GRE or UDP packets.  IIRC, I made gre(4) *always* create
a file descriptor so that fstat(1) provided a comprehensive view of the
sockets in the system.

Having sockets in the system that appear neither in fstat(1) nor nor
netstat(8) output seems like an unnecessary mystery/complication to me.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: add MSI/MSI-X support to NetBSD

2014-06-06 Thread David Young
On Fri, Jun 06, 2014 at 07:06:00PM +, Taylor R Campbell wrote:
>Date: Fri, 6 Jun 2014 12:56:53 -0500
>From: David Young 
> 
>Here is the proposal that I came up with many months (a few years?) ago
>with input from Matt Thomas.  I have tried to account for Matt's
>requirements, but I'm not sure that I have done so.
> 
> For those ignoramuses among us who remain perplexed by the apparent
> difficulty of using a new interrupt delivery mechanism, could you add
> some notes to your proposal about what driver authors would need to
> know about it and when & how one would use it in a driver?

Driver authors do not need to know anything about bus_msi(9) unless
they're doing something fancy.  bus_msi(9) will be invisible to the
author of a PCI driver because pci_intr(9) will establish the mailbox
and all of that.

bus_msi(9) hides differences between hardware platforms.

> Would all architectures with PCI support bus_msi(9), or would PCI
> device drivers need to conditionally use it?  Why isn't it just a
> matter of modifying pci_intr_map, or calling pci_intr_map_msi like in
> OpenBSD?

For a PCI driver, it *is* a matter of calling pci_intr_map(9), or
whatever NetBSD comes up with for MSI/MSI-X.

> Would there be other non-PCI buses with message-signalled
> interrupts too?

It's conceivable that there are existing non-PCI buses that use
message-signalled interrupts.  Any future bus probably will.

> (Still not having done my homework to study what this MSI business is
> all about, I'll note parenthetically that it seems FreeBSD and OpenBSD
> have supported MSI for a while, and I understand neither why it was so
> easy for them nor what advantage they lack by not having bus_msi(9).)

NetBSD has supported MSI (but not MSI-X) for a while, too, at least on
i386 or x86.

Here are the nice things about MSI/MSI-X in a nutshell: you can have
many interrupt sources per device (IIRC, there are just 4 interrupt
lines on a legacy PCI bus), each interrupt can signal a different
condition (so your interrupt service routine doesn't have to read
an interrupt condition register), you can route each condition on a
device to a different CPU, and the interrupt is a bus-master write
that flushes all of the previous bus-master writes by the same device
(according to the PCI ordering rules), so a driver doesn't have to poll
a device register to "land" buffered bus-master writes before examining
descriptor rings and other DMA-able regions.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: RFC: add MSI/MSI-X support to NetBSD

2014-06-06 Thread David Young
On Fri, Jun 06, 2014 at 12:40:54PM -0500, David Young wrote:
> On Fri, May 30, 2014 at 05:55:25PM +0900, Kengo NAKAHARA wrote:
> > Hello,
> > 
> > I'm going to add MSI/MSI-X support to NetBSD. I list tasks about this.
> > Would you comment following task list?
> 
> I think that MSI/MSI-X logically separates into a few pieces, what do
> you think about these pieces?
> 
> 1 An MI API for establishing "mailboxes" (or "doorbells" or whatever
>   we may call them).  A mailbox is a special physical address (PA) or
>   PA/data-pair in correspondence with a callback (function, argument).
> 
>   An MI API for mapping the mailbox into various address spaces,
>   but especially the message-signalling devices.  In this way, the
>   mailbox API is a use or an extension of bus_dma(9).
> 
>   Somewhere I have a draft proposal for this MI API, I will try to
>   dig it up.

Here is the proposal that I came up with many months (a few years?) ago
with input from Matt Thomas.  I have tried to account for Matt's
requirements, but I'm not sure that I have done so.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981
BUS_MSI(9) NetBSD Kernel Developer's Manual BUS_MSI(9)

bus_msi(9) is a machine-independent interface for establishing in the
machine's physical address space a "doorbell" that when written with
a particular word, sends an interrupt vector to a set of CPUs.  Using
bus_msi(9), the interrupt vector can be tied to interrupt handlers.

bus_msi(9) is the basis for a machine-independent implementation
of PCI Message-Signaled Interrupts (MSI) and MSI-X, however, the
bus_msi(9) implementation itself is highly machine-dependent.  Any
NetBSD architecture that wants to support PCI MSI should provide a
bus_msi(9) implementation.

bus_msi(9) uses facilities provided by bus_dma(9).

typedef struct _bus_msi_t {
bus_addr_t  mi_addr;
uint32_tmi_data;
uint32_tmi_count;
};

int
bus_msi_alloc(bus_dma_tag_t tag, bus_msi_reservation_t *msirp, size_t n,
uint32_t data_min, uint32_t data_max,
uint32_t data_alignment, uint32_t data_boundary, int flags);

Reserve `number' interrupt vectors on up to `ncpumax' CPUs
in the set `cpusetin' and reserve corresponding message
address/message data pairs.  Record the message address/data-pair
reservations in up to `nintervals' consecutive bus_msi_interval_ts
beginning with `interval[0]'; overwrite `rintervals' with
the number of intervals used.  Overwrite `cpusetout' with
the set of CPUs where interrupt vectors were established.

Each bus_msg_interval_t tells a message address, mi_addr,
and the mi_count different 32-bit message data words,
[mi_data,�mi_data�+�mi_count�-�1], to write to trigger
mi_count different interrupt vectors.

Each message data interval, [mi_data, mi_data + mi_count�-�1]
will satisfy the constraints passed to bus_msg_alloc():
[data_min,�data_max] must enclose each interval, each
interval must start at a multiple of data_alignment, and
no interval may cross a data_boundary boundary.  A legal
value of data_alignment (or data_boundary) is either zero
or a power of 2.  When zero, data_alignment (or data_boundary)
has no effect.

`tag' is the bus_dma_tag_t passed by the parent driver via
the bus _attach_args.

`flags' may be one of BUS_DMA_WAITOK or BUS_DMA_NOWAIT.

bus_msi_handle_t
bus_msi_establish(bus_dma_tag_t tag, bus_msi_reservation_t msir, int idx,
const kcpuset_t *cpusetin, int ncpumax, kcpuset_t *cpusetout,
int ipl, int (*func)(void *), void *arg);

Establish a callback (func, arg) to run at interrupt priority
level `ipl' whenever the `idx'th message in `intervals' is
delivered.  Return an opaque handle for use with
bus_msi_disestablish().

You can establish more than one handler at each `idx'.

The correspondence between `idx's and message-address/data
pairs is like this:

idx 0 -> (intervals[0].mi_addr, intervals[0].mi_data)
idx 1 -> (intervals[0].mi_addr, intervals[0].mi_data + 1)
. . .
idx N - 1 -> (intervals[0].mi_addr, intervals[0].mi_data +
intervals[0].mi_count - 1)
idx N -> (intervals[1].mi_addr, intervals[1].mi_data)
idx N + 1 -> (intervals[1].mi_addr, intervals[1].mi_data + 1)
. . .
idx N + K - 1 -> (intervals[1].mi_addr, intervals[1].mi_data +
intervals[1].mi_count - 1)

void
bus_msi_disestablish(bus_dma_tag_t tag, bus_msi_handle_t);

   

Re: RFC: add MSI/MSI-X support to NetBSD

2014-06-06 Thread David Young
On Fri, May 30, 2014 at 05:55:25PM +0900, Kengo NAKAHARA wrote:
> Hello,
> 
> I'm going to add MSI/MSI-X support to NetBSD. I list tasks about this.
> Would you comment following task list?

I think that MSI/MSI-X logically separates into a few pieces, what do
you think about these pieces?

1 An MI API for establishing "mailboxes" (or "doorbells" or whatever
  we may call them).  A mailbox is a special physical address (PA) or
  PA/data-pair in correspondence with a callback (function, argument).

  An MI API for mapping the mailbox into various address spaces,
  but especially the message-signalling devices.  In this way, the
  mailbox API is a use or an extension of bus_dma(9).

  Somewhere I have a draft proposal for this MI API, I will try to
  dig it up.

2 For each platform, an MD implementation of the MI mailbox API.

3 Extensions to pci(9) for establishing message-signalled interrupts
  using either a (function, argument) pair, a PA, or a (PA, data) pair.
  I am pretty sure that the implementation of these extensions can be
  MI.

> + [amd64 MD]  refactor INTRSTUB
>   - currently, it walks the interrupt handler list in assembly code
> - I want to use NetBSD's list library, so I want to convert this assembly
>   code to C code.

I support converting much of the interrupt dispatch code to C from
assembly.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Lockless IP input queue, the pktqueue interface

2014-05-29 Thread David Young
re - it is a good indicator that you
> >might potentially be doing something wrong; even more so if you fiddle with
> >a lockless data structure as in pktqueue's case.  "It might be useful some
> >day" is a very poor and dangerous reasoning in this case.
> >
> 
> All of your arguments boil down to "can't trust someone else."
> 
> Why do you need to be so insulting of other developers in your arguments?
> 
> Do you think you're the only person capable of making good design decisions?
> 
> Sorry, but I won't be a party to that kind of attitude and want nothing more
> to do with this.

I think that Mindaugas is being pragmatic here.  Developers are not
equally brilliant[*], observant of the rules, or perceptive of the
patterns, layers, or abstractions in the code.  He is writing the code
in a way that discourages us from casually misusing or breaking it by
getting under the abstraction.  I don't find that offensive.

Dave

[*] However, we are all above average at NetBSD.

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: packet timestamps (was Re: Changes to make /dev/*random better sooner)

2014-04-15 Thread David Young
On Wed, Apr 09, 2014 at 04:36:26PM -0700, Dennis Ferguson wrote:
> What I would like to do is to make per-packet timestamps (which you
> are doing already) more widely useful by moving where the timestamp is
> taken closer to the input interrupt which signals a packet delivery
> and carrying it in each packet's mbuf metadata.  This has a nice effect
> on the rare occasion that the packet is delivered to a socket with a
> consumer with cares about packet arrival times (like an NTP or IEEE 1588
> implementation), but it can also provide a new measure of the performance
> of the network code when making changes to it (how long did it take packets
> to get from the input interface to the socket, or from the input interface
> to the output interface?) which doesn't exist now.  In fact it would be
> nice to have it right now so the effect of the changes you are making
> could be measured instead of just speculating.  I was also imagining that
> the random number machinery would harvest timestamps from the mbufs,
> but maybe only after it is determined if the timestamp is one a user
> is interested in so it didn't use those.

FWIW, based on a suggestion by Dennis, in August I added a timestamp
capability to MBUFTRACE for my use at $DAYJOB.  MBUFTRACE thus enhanced
shows me how long it takes (cycles maximum, cycles average) for packets
to make ownership transitions: e.g., from IP queue to wm0 transmission
queue.  This has been useful for finding out where to concentrate my
optimization efforts, and it has helped to rule-in or -out hypotheses
about networking bugs.  Thanks, Dennis.

Here and there I have also fixed an MBUFTRACE bug, and I have made some
changes designed to reduce counters' cache footprint.  I call my variant
of MBUFTRACE, MBUFTRACE3.  I hope to feed MBUFTRACE3 back one of these
days.

Here is a sample of 'netstat -ssm' output when MBUFTRACE3 is operating
on a box with "background" levels of traffic---there are two tables,
the first that you will recognize, and the second which is new:

 smallextcluster
unix   inuse 3  1  1
 arp hold  inuse 8  0  0
 wm8 rx ring   inuse  1024   1024   1024
 wm7 rx ring   inuse  1024   1024   1024
 wm6 rx ring   inuse  1024   1024   1024
 wm5 rx ring   inuse  1024   1024   1024
 wm4 rx ring   inuse  1024   1024   1024
 wm3 rx ring   inuse  1024   1024   1024
 wm2 rx ring   inuse  1024   1024   1024
 wm1 rx ring   inuse  1024   1024   1024
 wm0 rx ring   inuse  1024   1024   1024
 unknown data  inuse 34802  34802  0
 revoked   inuse   1273302  63033  22922
 microsecs/tr   max # transitions   previous owner -> new owner
617 2,068 wm8 cpu5 rxq -> wm8 rx  
827   199   udp rx -> revoked 
954,71910,772 defer arp_deferral -> revoked 
   132682   route  -> revoked 
   70 1,627,735   835,683unix  -> revoked 
  241   487,685 3,977  ixg1 rx -> arp 
  260   260 1 arp hold -> ixg0 tx 
1,410   456,389   772  ixg0 rx -> arp 
   22,296 6,712,491 2,082  ixg1 tx -> revoked 
  315,846 6,293,761   136  ixg0 tx -> revoked 
  516,585   709,19313  wm4 tx ring -> revoked 

There are microseconds in that table, but netstat reads the CPU
frequency, average and maximum cycles/transition from the kernel and
does the arithmetic.  I'm using the CPU cycle counter to make all of the
timestamps.  I'm not compensating for clock drift or anything.

A more suitable display for this information than a table is *probably*
a directed graph with a vertex corresponding to each mbuf owner, and
an edge corresponding to each owner->owner transition.  Set the area
of each vertex in proportion to the mbufs in-use by the corresponding
owner, and set the width of each edge in proportion to the rate of
transitions.  Label each vertex with an mbuf-owner name.  Graphs for
normal/high-performance and abnormal/low-performance machines will have
distinct patterns, and the graph will help to illuminate bottlenecks.
If anyone is interested in programming th

rethinking resource limits (was Re: Shared resource limit implementation question)

2014-02-20 Thread David Young
On Wed, Feb 19, 2014 at 09:08:59PM +0700, Robert Elz wrote:
> The kernel code for handling resource limits attempts to share the
> limits structure between processes (wholly reasonable, as limits are
> inherited, and rarely changed).  A shared limits struct (which is all
> of them when a new one is created) is marked as !pl_writeable.
> (Then when a process modifies one of its limits, it is given a copy
> of the limits struct, marked pl_writeable that it can modify as needed).

I do not see anything wrong with your analysis, but I only skimmed it.

I skimmed your email expecting for you to mention a problem with process
resource limits that came up several years ago: after a process fork()s,
the child's resource use does not count against the parent's limits, but
it counts against the child's own copy of the parent's resource limits.

Also, we may configure a system-wide limit on the number of processes,
and we may individually limit the number of processes simultaneous
belonging to each user, but there is not a limit to the number of
processes created by a process and its descendants.

All of this means that a user has very little protection against a
program that constantly forks and allocates memory: where N is the
user's process limit, and M the bytes memory limit, the program and its
descendants can use N * M bytes of memory and all N of the processes
available to the user.  In this way a "fork bomb" can run away with all
of the user's resources, and it might cripple the system, too.

It seems to me that the whole area of resource limits is ripe for
reconsideration, if somebody had the time and level of interest.  These
days it makes more sense to arbitrate access to system resources using
power budgets, noise budgets/limits, and latency goals, than to enforce
some of the traditional limits.  Limits should be enforceable by users
on the processes that run on their behalf.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: ptrdiff_t in the kernel

2013-12-04 Thread David Young
On Wed, Dec 04, 2013 at 12:07:42PM -0200, Lourival Vieira Neto wrote:
> >>> No,  is not allowed in the kernel.  Symbols from it are
> >>> provided via other means.
> >>
> >> I know. In fact, I'm asking if it would be alright to allow that.
> >> AFAIK, it would be inoffensive if available in the kernel.
> >
> > Actually, it would be offensive.
> 
> Why?

I would also like to know why that would be offensive!

I'm always disappointed when I have to write something like this in order
to share code between the userland and kernel,

#if defined(_KERNEL) || defined(_STANDALONE)
#include   /* for bool, size_t---XXX not right? */
#else
#include /* for bool */
#include   /* for size_t */
#endif

Apparently,  is the correct header for size_t, so that is more
properly written like this,

#if defined(_KERNEL) || defined(_STANDALONE)
#include   /* for bool, size_t---XXX not right? */
#else
#include /* for bool */
#include   /* for size_t */
#endif

I would prefer for this to suffice both for the kernel and userland:

#include /* for bool */
#include   /* for size_t */

ISTM that the reasons things are not that simple are merely historical
reasons, but I am open to other explanations.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


kernel profiling artifact: Xspllower the leaf function

2013-07-11 Thread David Young
I am profiling a 6-ish kernel with kgmon/gprof.  I find that Xspllower
is quite prominent in my profiles *and* that it looks like a
"leaf"---that is, like it doesn't call anything.

I suspect that the reason for Xspllower's prominence and "leafiness"
is that it enters handlers for deferred interrupts in some "funny"
way that does not create the type of stack frame that the profiler
recognizes.  So gprof ascribes to Xspllower time that is in fact spent
running handlers for the deferred interrupts.  Could that be?  More to
the point, is there an easy way to produce a more reliable profile?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: proposal: some pointer and arithmetic utilities

2013-03-21 Thread David Young
On Thu, Mar 21, 2013 at 08:29:46PM +, Taylor R Campbell wrote:
>Date: Thu, 21 Mar 2013 21:11:56 +0100
>From: Rhialto 
> 
>I would argue that all such mumble() should really be changed to
> 
>int
>mumble(struct foo *foo)
>{
>struct bar *bar = &foo->f_bar;
>...
>}
> 
>since they are making that assumption already anyway.
> 
> This transformation doesn't make sense when the caller of mumble works
> with bars, not foos, and it's the caller of the caller of mumble that
> needs to pass in a foo to mumble by way of passing a bar to the caller
> of mumble.
> 
> For example, consider workqueues and struct work.  If you want to put
> an object on a workqueue (i.e., pass an argument in a request for
> deferred work), you embed a struct work in the object and then the
> workqueue's worker takes the struct work and finds the containing
> object from of it.
> 
> kern_physio.c relies on the fact that the struct work is at the
> beginning of struct buf when it does
> 
> static void
> physio_done(struct work *wk, void *dummy)
> {
>   struct buf *bp = (void *)wk;
>   ...
> }
> 
> to recover bp after workqueue_enqueue(physio_workqueue, &bp->b_work,
> NULL).  But instead it could use
> 
>   struct buf *bp = container_of(wk, struct work, b_work);

I think you meant container_of(wk, struct buf, b_work) ?

Sometimes I wish that the C language had a notation for that
container_of() expression looking like this,

&bp->b_work = wk;

In the LHS expression, `bp' is the only free variable, so it gets set in
order for `wk' to equal `&bp->b_work'.

An analogous expression for an array is more ambiguous than the previous
example,

/*
 * Return the index of the `array' element pointed to by `elt'.
 */
int
indexof(const int *array, const int *elt)
{
int i;

&array[i] = elt;

return i;
}

because there are two free variables, `array' and `i'.  The compiler
should flag that as an error.  I guess we can eliminate the ambiguity in
a couple of ways.  One way is to const-ify array,

int
indexof(const int * const array, const int *elt)
{
int i;

&array[i] = elt;

return i;
}

Another way is to cast to const,

    int
indexof(const int *array, const int *elt)
{
int i;

&((const int * const)array)[i] = elt;

return i;
}

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: netbsd-6: pagedaemon freeze when low on memory

2013-03-06 Thread David Young
On Wed, Mar 06, 2013 at 06:54:58PM -0500, Richard Hansen wrote:
> On 2013-03-06 00:43, David Young wrote:
> > On Mon, Mar 04, 2013 at 02:43:47PM -0500, Richard Hansen wrote:
> >> With the patch applied, the pagedaemon no longer freezes.  However, LWPs
> >> start piling up in vmem_alloc() waiting for memory to become available.
> >> So it seems like this change is necessary but not sufficient.
> > 
> > vmem_alloc(..., VM_SLEEP) will sleep forever while resources are not
> > available to fulfill the request, and I think that's good.
> 
> OK.
> 
> Here's another thought:  What about changing some of the VM_SLEEP calls
> to VM_NOSLEEP, at least for the userspace-initiated syscalls?  The
> syscalls would then fail, moving the responsibility of dealing with low
> memory onto the userspace apps (they may be unhappy, but at least the
> kernel will stay functional).  This change would be in addition to the
> vmem_xalloc() wake changes you proposed (because those wake-ups may
> never come if the system is truly running on fumes).

You could do that.  You just have to take care to handle the errors
properly.  There are more error paths to test.  Not trying to discourage
you, just point out the trade-offs. :-)

> > It looks to me
> > like vmem_alloc() needs to wake from its sleep and retry in either of
> > two conditions.
> > 
> > Condition 1: the backend has new memory available for vmem to add to
> > the arena
> > 
> > I don't think vmem_alloc() will ever wake in this condition,
> > because I don't see any way for a backend to signal to the vmem
> > that memory is available.
> > 
> > To fix this problem, add a new vmem(9) method,
> > vmem_backend_ready(vmem_t *, ...) and add a couple of vm_flag_t's,
> > VM_SLEEPING and VM_WAKING.  Before vmem_alloc() starts to wait
> > on its condition variable, let it call the backend's import
> > callback with the VM_SLEEPING flag.  Now the backend knows the
> > arena is sleeping and the parameters of the region it waits
> > for.  If a region that may satisfy the VM_SLEEPING call is
> > freed, let it call vmem_backend_ready() on the arena.  Let
> > vmem_alloc() call the import function again with VM_WAKING when
> > it has satisfied the request that it was VM_SLEEPING for.
> 
> I'm not familiar enough with vmem(9) (or really the kernel in general)
> to fully understand your proposal, but would this scheme work if there
> are multiple vmem_xalloc()s waiting for memory?

It would.

> Wouldn't it be better to wait on a backend condvar and let the backend
> broadcast when anything becomes available?  Or would that incur too much
> locking overhead?

That won't work because you need to wait on two things at once:
availability in the backend *or* availability in the arena (vmem_free()
/ vmem_xfree()).

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


vmem(9) (was Re: netbsd-6: pagedaemon freeze when low on memory)

2013-03-06 Thread David Young
On Wed, Mar 06, 2013 at 08:05:35AM +, David Laight wrote:
> On Tue, Mar 05, 2013 at 11:43:35PM -0600, David Young wrote:
> > Maybe we can avoid unnecessary locking or redundancy using a
> > generation number?  Add a generation number to the vmem_t,
> > 
> > volatile uint64_t vm_gen;
> > 
> > Increase a vmem_t's generation number every
> > time that vmem_free(), vmem_xfree(), or vmem_backend_ready() is
> > called:
> 
> Won't that generate a very hot cache line on a large smp system?
> Maybe the associated structures are actually worse here!
> But per-cpu virtual address free lists might make sense.

I think you mean that the line containing the generation number will
bounce rapidly between caches, which isn't efficient.  I agree with
that.  Perhaps we can reduce the cacheline bounciness induced by
vmem_{x,}free() if those routines acquire the vmem_t lock and increase
vm_gen only if they're freeing to an empty arena.  Or whatever it takes
to make the code correct as well as fast.

General comment: ISTM that vmem(9) is, and was always intended to be,
a general-purpose allocator of number intervals that may or may not
correspond to memory addresses.  I have actually used it as such.  It is
actually badly named: extent(9) is a better name, but it's taken.  Let
us keep that in mind.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: netbsd-6: pagedaemon freeze when low on memory

2013-03-05 Thread David Young
On Mon, Mar 04, 2013 at 02:43:47PM -0500, Richard Hansen wrote:
> With the patch applied, the pagedaemon no longer freezes.  However, LWPs
> start piling up in vmem_alloc() waiting for memory to become available.
>  So it seems like this change is necessary but not sufficient.

vmem_alloc(..., VM_SLEEP) will sleep forever while resources are not
available to fulfill the request, and I think that's good.  It looks to me
like vmem_alloc() needs to wake from its sleep and retry in either of
two conditions.

Condition 1: the backend has new memory available for vmem to add to
the arena

I don't think vmem_alloc() will ever wake in this condition,
because I don't see any way for a backend to signal to the vmem
that memory is available.

To fix this problem, add a new vmem(9) method,
vmem_backend_ready(vmem_t *, ...) and add a couple of vm_flag_t's,
VM_SLEEPING and VM_WAKING.  Before vmem_alloc() starts to wait
on its condition variable, let it call the backend's import
callback with the VM_SLEEPING flag.  Now the backend knows the
arena is sleeping and the parameters of the region it waits
for.  If a region that may satisfy the VM_SLEEPING call is
freed, let it call vmem_backend_ready() on the arena.  Let
vmem_alloc() call the import function again with VM_WAKING when
it has satisfied the request that it was VM_SLEEPING for.

Condition 2: the vmem arena was replenished with vmem_free() or vmem_xfree()

vmem_alloc() sometimes will wake in this condition, however,
I'm not sure that it will wake reliably, because it does not
continuously hold a lock both while it checks the out-of-memory
condition and while it waits on a condition variable for the
out-of-memory condition to change.  There's a race condition:
it's possible for a thread to alleviate the out-of-memory
condition after vmem_alloc() tests for it but before vmem_alloc()
does VMEM_LOCK(vm); VMEM_CONDVAR_WAIT(vm).  If that happens,
then vmem_alloc() could wait forever for the out-of-memory
condition to end.

It's undesirable for vmem_alloc() to VMEM_LOCK(vm) unless it's
strictly necessary, and testing the out-of-memory condition
a second time after VMEM_LOCK() but before VMEM_CONDVAR_WAIT()
seems redundant.

Maybe we can avoid unnecessary locking or redundancy using a
generation number?  Add a generation number to the vmem_t,

volatile uint64_t vm_gen;

Increase a vmem_t's generation number every
time that vmem_free(), vmem_xfree(), or vmem_backend_ready() is
called:

VMEM_LOCK(vm);  /* have to hold lock to modify vm_gen */
vm->vm_gen++;
VMEM_CONDVAR_BROADCAST(vm);
VMEM_UNLOCK(vm);

Before testing the out-of-memory condition in vmem_alloc(),
read the generation number:

again:
gen = vm->vm_gen;
membar_consumer();

... memory available? if so, return.  otherwise ...

VMEM_LOCK(vm);
while (gen == vm->vm_gen)
VMEM_CONDVAR_WAIT(vm);
VMEM_UNLOCK(vm);
goto again;

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: drivers customizing mmap(2)

2013-03-05 Thread David Young
On Wed, Mar 06, 2013 at 12:12:10PM +0900, Masao Uebayashi wrote:
> vm_object containing ... what?  DMA memory?  Device's memory?

Containing whatever the driver requires.  DMA-able memory, device
registers, wired kernel memory.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: drivers customizing mmap(2)

2013-03-04 Thread David Young
On Tue, Mar 05, 2013 at 12:31:37AM +, Taylor R Campbell wrote:
> I'd like to add a member fo_mmap to struct fileops so that cloning
> devices can support mmap and use other kinds of uvm objects than just
> uvm_deviceops and uvm_vnodeops.
> 
> Currently the only ways to customize mmap(2) are
> 
> . with a device node, by setting d_mmap in the struct cdevsw, but the
> customization is very limited -- you have to use uvm_deviceops, and
> you can't have per-open state for the device (e.g., with a cloning
> device -- file descriptors for a cloning device have type DTYPE_MISC
> which mmap(2) rejects); or
> 
> . with a file system, by providing VOP_MMAP, but you're limited to
> uvm_vnodeops, and a file system is a complex beast.
> 
> Our drm code currently uses a non-cloning device for the drm(4) device
> with some horrible bookkeeping kludges to associate per-open (really,
> per-pid) state with the device.  I'd like to get rid of these horrible
> bookkeeping kludges by using a cloning device, and the newer
> incarnation of drm looks like it will require more of mmap(2) than
> uvm_deviceops and uvm_vnodeops can straightforwardly provide,
> including custom fault handlers.
> 
> I haven't fleshed out a detailed proposal or a patch to implement this
> yet, but it should be fairly straightforward, and may help to clean up
> some of the complex nests of conditionals in uvm_mmap.c.  Thoughts?

At $DAYJOB we wanted an mmap-able character device both to manage the
VA->PA mapping protection (read-only, read-write), to handle faults,
and to remove mappings under certain conditions.  It looked to me like
the uvm_pagerops structure encapsulated all of the functions that our
device needed to supply, but I could find no way for a character device
to supply a custom uvm_pagerops for an mmap'd region.  I thought that
perhaps the device's cdevsw could provide an alternate to d_mmap,
int (*d_get_pagerops)(dev_t, struct uvm_pagerops **) or something,
for uvm_mmap to call.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: What's an "MPSAFE" driver need to do?

2013-02-28 Thread David Young
On Thu, Feb 28, 2013 at 09:43:49AM +0100, Manuel Bouyer wrote:
> On Thu, Feb 28, 2013 at 02:29:11AM -0500, Mouse wrote:
> > [...]
> > Well, assuming rwlock(9) is considered a subset of mutex(9) for the
> > purposes of that sentence, I then have to ask, what else is there?
> > spl(9), the traditional way, specifically calls out that those routines
> > work on only the CPU they're executed on (which is what I'd expect,
> > given what they have traditionally done - but, I gather from the
> > manpage, no longer do).
> > 
> > This then leads me to wonder how a driver can _not_ be MPSAFE, since
> > the old way doesn't/can't actually work and the new way is MPSAFE.
> 
> A driver not marked MPSAFE will be entered (especially
> its interrupt routine, but also from upper layers) with
> kernel_lock held. This is what makes spl(9) still work.
> In order to convert a driver using spl(9)-style calls, you have to replace
> spl(9) calls with a mutex of the equivalent IPL level (a rwlock won't work
> for this as it can't be used in interrupt routines, only thread context).

I want to complicate this idea of spl->mutex conversion a bit.  I used
to think that replacing spl calls by mutex calls would block the same
interrupts that traditionally spl blocks.  Then I realized that I'd
been misled both by the "lore" surrounding spl->mutex conversion, and
by reading (and re-reading) the manual: a mutex initialized with level
`ipl' does NOT necessarily block interrupts.  It will block them if
it is a spin mutex (initialized with one of the hardware interrupt
levels: IPL_VM, IPL_SCHED, IPL_HIGH), but it will not if it is an
adaptive mutex (initialized with one of the software interrupt levels,
IPL_SOFT*).  So things are not 100% symmetrical in mutex land.

Generally you're safe if both your interrupt handlers and your code
running in the a "normal" thread context acquire & release the same
mutex in critical sections.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


low-priority xcall(9)s and preemption

2013-02-08 Thread David Young
The xcall(9) manual page says,

 xcall provides a mechanism for making ``low priority'' cross calls.  The
 function to be executed runs on the remote CPU within a thread context,
 and not from a software interrupt, so it can ensure that it is not inter-
 rupting other code running on the CPU, and so has exclusive access to the
 CPU.  Keep in mind that unless disabled, it may cause a kernel preemp-
 tion.

I take that last sentence to mean that a low-priority cross call *may*
preempt a thread on the remote CPU.  Is that correct?

In other words, can we rephrase that, "A low-priority cross call may
preempt a thread running on the remote CPU unless preemption is disabled
on that CPU."

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: kcpuset(9) questions

2013-02-04 Thread David Young
On Mon, Feb 04, 2013 at 11:45:33PM +, Mindaugas Rasiukevicius wrote:
> Matt Thomas  wrote:
> > 
> > On Feb 3, 2013, at 3:33 PM, Mindaugas Rasiukevicius wrote:
> > 
> > > Any reason why do you need bitfield based iteration, as opposed to list
> > > or array based?
> > 
> > Be nice to have a MI method instead a hodgepodge of MD methods.
> > 
> > The CPU_FOREACH method is ugly.
> 
> I totally agree.  That is why couple years ago I wanted to add and convert
> everything to MI replacement of struct cpu_info.  Since this work requires
> intervention to all ports, it did not materialise since..  After cleaning
> up the dust from ancient patches, I could put MI interface into a branch.

I want to manipulate sets of CPUs in the kernel, and a set of CPUs is
what I understand a kcpuset_t to be.

Sometimes I want to iterate over the members of a set.

> However, I do not think that adding ad-hoc bitfield based interface in
> addition to the "ugly" one is an improvement.  Quite the opposite as then
> we would need to deal with two "not great" ones.

I don't care whether the implementation of CPU sets is based on
bitfields or lists or arrays.  And it's fine with me if kcpuset
iteration is not in addition to CPU_INFO_FOREACH, but instead of it.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: kcpuset(9) questions

2013-02-03 Thread David Young
On Sun, Feb 03, 2013 at 11:33:10PM +, Mindaugas Rasiukevicius wrote:
> David Young  wrote:
> > > There are kcpuset_attached and kcpuset_running, which are MI.  All ports
> > > ought to switch to them replacing MD cpu_attached/cpu_running.  They can
> > > be wrapped into a routine, but globals seem harmless in this case too.
> > 
> > It seems that if they are not wrapped in routines, they should be
> > declared differently, e.g.,
> > 
> > extern const kcpuset_t * const kcpuset_attached;
> 
> Although we are far from this, but in the long term we would like to
> support run time attaching/detaching of CPUs, so it would not be const.

It would be nice to have the compiler's help to avoid adding/deleting
CPUs to/from kcpuset_attached or kcpuset_running by accident.  Only the
kcpuset_{attached,running} implementation code should be writing those
sets.  Users of kcpuset_{attached,running} should only be reading them.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: kcpuset(9) questions

2013-02-03 Thread David Young
On Sun, Feb 03, 2013 at 04:22:37PM -0800, Matt Thomas wrote:
> 
> On Feb 3, 2013, at 3:33 PM, Mindaugas Rasiukevicius wrote:
> 
> > Any reason why do you need bitfield based iteration, as opposed to list
> > or array based?
> 
> Be nice to have a MI method instead a hodgepodge of MD methods.
> 
> The CPU_FOREACH method is ugly.

What Matt said. :-)

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: kcpuset(9) questions

2013-02-01 Thread David Young
On Mon, Jan 28, 2013 at 02:34:32AM +, Mindaugas Rasiukevicius wrote:
> David Young  wrote:
> > I was using kcpuset(9) a little bit today and I was surprised that
> > there was not a routine or a variable representing all of the attached
> > CPUs.  I see that there is such a MI variable declared in ,
> > kcpuset_attached.  Should it be part of the API?
> 
> There are kcpuset_attached and kcpuset_running, which are MI.  All ports
> ought to switch to them replacing MD cpu_attached/cpu_running.  They can
> be wrapped into a routine, but globals seem harmless in this case too.

It seems that if they are not wrapped in routines, they should be
declared differently, e.g.,

extern const kcpuset_t * const kcpuset_attached;

> > 
> > Also, kcpuset iterator would have been useful.  Perhaps there should be
> > one?
> > 
> 
> There was no use case, when I added it.  Can you describe your use case?
> Usually we iterate all CPUs with CPU_INFO_FOREACH() anyway (which should
> also be replaced with a MI interface, but that requires non-trivial
> invasion into all ports).

Well, iterating all CPUs would be one use case.  Another case would be
to, say, iterate the CPUs where a message-signalled interrupt (MSI)
handler should be established.

I was trying to decide the other night whether iterating a kcpuset_t w/
a for-loop was unwieldy under my _first/_next proposal:

bool more;
cpuid_t cpuid;

for (more = kcpuset_first(kcpu, &cpuid);
 more;
 more = kcpuset_next(kcpu, &cpuid)) {
// do sumthin
}

Not great, but ok?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


kcpuset(9) questions

2013-01-26 Thread David Young
I was using kcpuset(9) a little bit today and I was surprised that
there was not a routine or a variable representing all of the attached
CPUs.  I see that there is such a MI variable declared in ,
kcpuset_attached.  Should it be part of the API?

Also, kcpuset iterator would have been useful.  Perhaps there should be
one?

/* If `kcp' is not empty, return the first CPU in `kcp' in `*cpuid' and
 * return true.  If `kcp' is empty, do not modify `*cpuid`, and return false.
 */
bool kcpuset_first(const kcpuset_t *kcp, cpuid_t *cpuid);

/* Return the next CPU ID in `kcp' after `*cpuid' in `*cpuid' and
 * return true or, if there is no such ID, do not modify `*cpuid`, and
 * return false.
 */
bool kcpuset_next(const kcpuset_t *kcp, cpuid_t *cpuid);

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: event counting vs. the cache

2013-01-17 Thread David Young
On Thu, Jan 17, 2013 at 11:10:24PM +, David Laight wrote:
> On Thu, Jan 17, 2013 at 03:43:13PM -0600, David Young wrote:
> > 
> > 2) Split all counters into two parts: high-order 32 bits, low-order 32
> >bits.  It's only necessary to touch the high-order part when the
> >low-order part rolls over, so in effect you split the counters into
> >write-often (hot) and write-rarely (cold) parts.  Cram together the
> >cold parts in cachelines.  Cram together the hot parts in cachelines.
> >Only the hot parts change that often, so the ordinary footprint of
> >counters in the cache is cut almost in half.
> 
> That means have to have special code to read them in order to avoid
> having 'silly' values.

We can end up with silly values with the status quo, too, can't we?  On
32-bit architectures like i386, x++ for uint64_t x compiles to

addl $0x1, x
adcl $0x0, x

If the addl carries, then reading x between the addl and adcl will show
a silly value.

I think that you can avoid the silly values.  Say you're using per-CPU
counters.  If counter x belongs to CPU p, then avoid silly values by
reading x in a low-priority thread, t, that's bound to p and reads hi(x)
then lo(x) then hi(x) again.  If hi(x) changed, then t was preempted by
a thread or an interrupt handler that wrapped lo(x), so t has to restart
the sequence.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


event counting vs. the cache

2013-01-17 Thread David Young
It's customary to use a 64-bit integer to count events in NetBSD because
we don't expect for the count to roll over in the lifetime of a box
running NetBSD.

I've been thinking about what these wide integers do to the cache
footprint of a system and wondering if we shouldn't make a couple of
changes:

1) Cram just as many counters into each cacheline as possible.
   Extend/replace evcnt(9) to allow the caller to provide the storage
   for the integer.

   On a multiprocessor box, you don't want CPUs sharing counter
   cachelines if you can help it, but do cram together each individual
   CPU's counters.

2) Split all counters into two parts: high-order 32 bits, low-order 32
   bits.  It's only necessary to touch the high-order part when the
   low-order part rolls over, so in effect you split the counters into
   write-often (hot) and write-rarely (cold) parts.  Cram together the
   cold parts in cachelines.  Cram together the hot parts in cachelines.
   Only the hot parts change that often, so the ordinary footprint of
   counters in the cache is cut almost in half.

I suppose you could split counters into four or more parts of 16 or
fewer bits each, and in that shrink the footprint even further, but it
seems that you would reach a point of diminishing returns very quickly.

Perhaps this has been tried before and found to (not) work reasonably
well?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


kernel linker wish

2013-01-02 Thread David Young
weak_alias(bus_space_read_4, xh_bus_space_read_4)

and likewise reserves a private symbol for calling the implementation it
overrode, also called super_bus_space_read_4.

If we load the modules bus_space_debug.kmod and bus_space_xh.kmod in
that order, then a call to bus_space_read_4 gets the xh_bus_space_read_4
implementation, which does its work and calls (through its symbol
super_bus_space_read_4) debug_bus_space_read_4, which does its work and
calls (through its super_bus_space_read_4) the default implementation,
_bus_space_read_4.

I think that to implement loading/unloading modules that refine each
other in this way, you could also use the aliases[symbol] stacks, but
they would grow taller than 0 or 1 items.

It is strange to use a weak alias to override a weak alias (why should a
loadable module's weak alias override the kernel's weak alias?); it may
be necessary to have a new kind of alias or else some meta-information
about each alias so that there is no ambiguity about what the kernel
linker should do.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: re(4) MAC address

2012-12-28 Thread David Young
On Fri, Dec 28, 2012 at 04:45:14PM +0100, Frank Wille wrote:
> On Fri, 28 Dec 2012 23:33:01 +0900
> Izumi Tsutsui  wrote:
> 
> > The attached patch make re(4) always use IDR register values
> > for its MAC address.
> > 
> > We no longer have to link rtl81x9.c for eeprom read functions
> > and I'm not sure if we should make the old behavoir optional
> > or remove completely.
> 
> I cannot imagine any case where it is needed. When an EEPROM is present,
> the IDR registers should be initialized with its MAC.
> 
> Maybe somebody who owns an re(4) NIC with an EEPROM should confirm that.
> 
> 
> > But for now I think it's almost harmless so please commit
> > if it works on re(4) on your NAS boxes.
> 
> Unfortunately, there is still a dependency with rtl81x9.c:
> 
> rtl8169.o: In function `re_ioctl':
> rtl8169.c:(.text+0x680): undefined reference to `rtk_setmulti'
> rtl8169.o: In function `re_init':
> rtl8169.c:(.text+0x1bc4): undefined reference to `rtk_setmulti'
> 
> As this is the only function needed from rtl81x9.c it probably makes
> sense to add rtk_setmulti() and the rtk_calchash macro to rtl8169.c.

Please, don't copy them.  Put them into a module the drivers can share.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: lua(4), non-invasive and invasive parts

2012-12-27 Thread David Young
On Mon, Dec 24, 2012 at 10:43:03AM +0100, Marc Balmer wrote:
> For such more invasive changes, I foresee to use a kernel option, 'options 
> LUA' which will compile such code only when the option is enabled.  It will 
> be commented out by default, besides maybe the ALL kernels.

Why not use a kernel module?

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: KNF and the C preprocessor

2012-12-10 Thread David Young
On Mon, Dec 10, 2012 at 03:50:00PM -0500, Thor Lancelot Simon wrote:
> On Mon, Dec 10, 2012 at 02:28:28PM -0600, David Young wrote:
> > On Mon, Dec 10, 2012 at 07:37:14PM +, David Laight wrote:
> > 
> > > a) #define macros tend to get optimised better.
> > 
> > Better even than an __attribute__((always_inline)) function?
> 
> I'd like to submit that neither are a good thing, because human
> beings are demonstrably quite bad at deciding when things should
> be inlined, particularly in terms of the cache effects of excessive
> inline use.

I agree with that.  However, occasionally I have found when I'm
optimizing the code based on actual evidence rather than hunches, and
the compiler is letting me down, always_inline was necessary.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: KNF and the C preprocessor

2012-12-10 Thread David Young
On Mon, Dec 10, 2012 at 10:27:39PM +0200, Alan Barrett wrote:
> On Mon, 10 Dec 2012, David Young wrote:
> >What do people think about setting stricter guidelines for using
> >the C preprocessor than the guidelines from the past?
> 
> Maybe.
> 
> >The C preprocessor MUST NOT be used for
> >
> >1 In-line code: 'static inline' subroutines are virtually always better
> > than macros.
> 
> I disagree with this one.  If you tone it down to "SHOULD NOT" or

Sure, let's make it SHOULD NOT.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: KNF and the C preprocessor

2012-12-10 Thread David Young
On Mon, Dec 10, 2012 at 07:37:14PM +, David Laight wrote:
> On Mon, Dec 10, 2012 at 09:36:35AM -0600, David Young wrote:
> > What do people think about setting stricter guidelines for using the
> > C preprocessor than the guidelines from the past?  Example guidelines:
> ...
> > 4 Computed constants.  The result of a function call may not be used
> >   in a case-statement, even if the function evaluates to a constant at
> >   compile time.  You have to use a macro, instead.
> 
> The alternative to constants would be C enums.
> However C enums are such 2nd class citizens that they have problems
> of their own.

I'm not sure you mean quite the same thing.  An example of what I mean
by "computed constant" would be something like f(Y) where Y is some
other constant and f(X) can always be evaluated to a constant at compile
time: f() may not be a function, not even a static/inline function, if
f(Y) appears in a case statement.

(Actually, if f() is an inline function and the compiler optimization
level is turned up, GCC will let you put f(Y) in a case statement.  Turn
the optimization level down, though, and you get a compile error.)

> > The C preprocessor MUST NOT be used for
> > 
> > 1 In-line code: 'static inline' subroutines are virtually always better
> >   than macros.
> 
> That rather depends on your definition of better.

It comes down to the ease of reading/understanding/writing a macro like

#define M(x, y) \
do {\
... \
... \
... \
} while (0)

when something like

static inline void
M(int x, int y)
{
...
...
...
}

will do.  The guideline can be re-phrased, "reach for a function
before a hairy macro; use a hairy macro only when nothing else will
do."  When I say "hairy macro" I mean one like WM_INIT_RXDESC() in
sys/dev/pci/if_wm.c: the extra underscores, parens, and backslashes
badly clutter the code.  Was the same code written as a static or static
inline function, first, found wanting, and converted to a macro?  Or was
the author in the habit of using a macro, first?  I'm pretty sure that
the code is a macro for the latter reason.

> a) #define macros tend to get optimised better.

Better even than an __attribute__((always_inline)) function?

> b) __LINE__ (etc) have the value of the use, not the definition.

I certainly don't want to rule out the careful use of __LINE__ or
__func__.

> > 2 Configuration management: use the compiler & linker to a greater
> >   extent than the C preprocessor to configure your program for your
> >   execution environment, your chosen compilation options, et cetera.
> 
> Avoiding #ifdef inside code tends to be benefitial.
> But, IMHO, there isn't much wrong with using #defines in header files
> to remove function calls.

Example?

> Using the compiler gets to be a PITA because of the warning/errors
> about unreachable code.

I wrote the guidelines in 2010 and they sat in a draft form ever since.
I no longer remember what I had in mind when I wrote "compiler" above.

> > 3 Virtually anything else. :-)
> 
> There are some very useful techniques that allow a single piece of
> source to be expanded in multiple ways.

I don't disagree.  I don't want to discourage the use of the C
preprocessor altogether, just to make sure its use is measured against
the potential headaches.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


KNF and the C preprocessor

2012-12-10 Thread David Young
What do people think about setting stricter guidelines for using the
C preprocessor than the guidelines from the past?  Example guidelines:

The C preprocessor MAY be used for

1 Lazy evaluation: unlike a function call's arguments, a macro's
  arguments are not evaluated before the macro is "called."  E.g., in
  the code below, the second and following arguments of M(a, b, c, d, e)
  are not evaluated unless p(a) is true:

#define M(__x, ...) \
do {\
if (p(__x)) \
f(__x, __VA_ARGS__);\
} while (false)

M(a, b, c, d, e);

2 Lexical manipulation: e.g., concatenating symbols, converting symbols
  to strings:

#define PAIR(__x)   { #__x, __x }

struct {
const char *name;
unsigned int value;
} errno_name_to_value[] = {PAIR(EINVAL), PAIR(EEXIST), PAIR(ENOENT)};

3 Generic programming:


#define __arraycount(__x)   (sizeof(__x) / sizeof(__x[0]))

4 Computed constants.  The result of a function call may not be used
  in a case-statement, even if the function evaluates to a constant at
  compile time.  You have to use a macro, instead.

The C preprocessor MUST NOT be used for

1 In-line code: 'static inline' subroutines are virtually always better
  than macros.

2 Configuration management: use the compiler & linker to a greater
  extent than the C preprocessor to configure your program for your
  execution environment, your chosen compilation options, et cetera.

3 Virtually anything else. :-)

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Broadcast traffic on vlans leaks into the parent interface on NetBSD-5.1

2012-11-28 Thread David Young
On Wed, Nov 28, 2012 at 07:27:56PM -0500, Greg Troxel wrote:
> 
> dhcpd, last I checked, used bpf and not sockets.
> 
> If dhcpd is bpf, I would suggest reading the bpf_tap calls in the
> driver.  It could be that if_wm.c has a spurious on.
> 
> If it's not, I don't know what's going on.

I'll bet this has something to do with the hardware VLAN tagging.  I
don't think BPF groks the VLAN mbuf tags.

FWIW, I think that hardware VLAN tagging is a lot of pain for no gain
the way that NetBSD is doing it.

Dave 

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Making forced unmounts work

2012-11-27 Thread David Young
On Mon, Nov 26, 2012 at 03:06:34PM +0100, J. Hannken-Illjes wrote:
> Comments or objections?

I'm wondering if this will fix the bugs in 'mount -u -r /xyz' where a
FFS is mounted read-write at /xyz?  Sorry, I don't remember any longer
what the bugs were.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: fexecve, round 3

2012-11-26 Thread David Young
On Mon, Nov 26, 2012 at 10:18:42AM +0100, Martin Husemann wrote:
> Does anyone know of a setup that uses a process outside of a chroot doing
> descriptor passing to a chrooted process?

Yes.  I can point to the same example as Thor has described, but I think
that it is easy to cook up numerous useful examples.

> I wonder if we should disallow that completely (i.e. fail the anxiliary
> data send if sender and recipient have different p_cwdi->cwdi_rdir)?

This idea of failing the ancillary data transmission seems unnecessarily
inflexible to me.  I think that if process A has a "send descriptors"
privilege, and process B has a "receive descriptors" privilege, and
there is some communications channel from A to B, then A should be
able to send a descriptor to B regardless of the origin or properties
of that descriptor.  B's privileges may not be sufficient to use
certain "methods" of the descriptor---for example, to fexecve() the
descriptor---but I think that is ok, because B's entire purpose may be
to send the descriptor to a third process that can use the descriptor.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: [PATCH] fexecve

2012-11-17 Thread David Young
On Sat, Nov 17, 2012 at 12:16:49AM +0100, Rhialto wrote:
> On Thu 15 Nov 2012 at 20:18:56 -0600, David Young wrote:
> > Also, enforcing access along "effective roots" lines may be inflexible
> > or unwieldy, maybe a more abstract notion of "process coalition" is
> > better.  Let each new root have a corresponding new coalition, but
> > perhaps we should be able to create a new coalition without changing
> > root, and change root without changing coalition.
> 
> That would make yet another process grouping, confusingly (dis)similar
> to process groups, controlling-terminal groups, sessions, (and am I
> forgetting more perhaps?)

Process groups, controlling-terminal groups, and sessions are not
already confusingly dissimilar from each other?  Perhaps coalitions
could subsume them all: process group, controlling-terminal groups, and
sessions could become coalitions of different privileges & properties.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Questions about pci configuration and the mpt(4) driver

2012-11-17 Thread David Young
On Sat, Nov 17, 2012 at 02:18:18AM -0800, Brian Buhrow wrote:
>   Hello.  I've been working on an issue with the mpt(4) driver, the
> driver for the LSI Fusion SCSI controller cards and raid cards.  In the
> process of working through the issue, I've discovered that the mpt(4)
> driver is very fragile if the need to reset the hardware arises.  In
> particular, if a hardware reset is done, all of the pci configuration
> registers get zorched, causing interrupt handling to fail and requests to
> get stuck in the driver and hardware's queue.

When does the need for a reset arise?  Is the cause a driver bug or a
hardware/firmware bug?

It sounds to me like you should detach the device (perhaps resetting
in the final stages of the detachment---i.e., before unmapping the
registers) and re-attach it.

Since detachment ordinarily loses all of the software state, you may
need to stash the outstanding requests somewhere that the re-attached
device can find them.

Dave

> I've been looking for
> examples of how to reset the PCI registers after such a reset, but neither
> the OpenBSD or FreeBSD drivers offer a clue.  All BSD drivers I've looked
> at lament the problem, but none provide a solution.  I've considered
> extracting the PCI initialization process from the mpt_pci_attach() routine
> into a separate function that can be called at any time while things are
> running, but there must be a reason this hasn't been done already and why I
> don't see any examples that look obvious to me of any drivers that do this.
> Is it safe to call pci_intr_disestablish() and pci_intr_establish() during
> the course of normal multi-user operation for a particular driver as a
> means of re-attaching interrupts to a device that's forgotten how to
> generate them?  Are there any examples of drivers that do a complete reset
> of the hardware, including pci and pci interrupt settings while continuing
> to operate in multi-user mode?
> -thanks
> -Brian
> 

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: [PATCH] fexecve

2012-11-15 Thread David Young
On Thu, Nov 15, 2012 at 04:57:24PM -0500, Thor Lancelot Simon wrote:
> On Thu, Nov 15, 2012 at 09:46:13PM +, David Holland wrote:
> > On Thu, Nov 15, 2012 at 11:03:15AM -0500, Thor Lancelot Simon wrote:
> >  > > Here is a patch that implements fexecve(2) for review:
> >  > > http://ftp.espci.fr/shadow/manu/fexecve.patch
> >  > 
> >  > This strikes me as profoundly dangerous.  Among other things, it
> >  > means you can't allow any program running in a chroot to receive
> >  > unix-domain messages any more since they might get passed a file
> >  > descriptor to code they should not be able to execute.
> > 
> > I have two immediate reactions to this: (1) being able to pass
> > executables to something untrusted in a controlled manner sounds
> > useful, not dangerous
> 
> Sorry to cherry-pick one more point for the moment:  Considered in a vacuum,
> I agree with your reaction #1 above.  The problem is that there is a great
> deal of existing code in the world which receives file descriptors and which
> is not designed with the possibility that they might then be used to exec.
> 
> With that history, I don't see a clear way to make this safe (for example
> by restricting which descriptors can be passed to chrooted processes) 
> without breaking code that assumes it can pass file descriptors without such
> restrictions.

Why restrict what descriptors can be passed?  It seems that you could
restrict what you can with the descriptors after they are passed.

It seems like something like the following can be made to work:

Label a file descriptor with the root that was in effect when it was created
by, say, open(2).  The effective root will never change over the
lifetime of that descriptor.

Call the root of a descriptor z, root(z).

Let fexecve(zfd, ...) compare the root of the kernel file descriptor
corresponding to zfd with the effective root and return EPERM if they're
unequal.

Say that process 1 with effective root pqr opens an executable,
fd = open("./setuidprog", ...). Call fd's corresponding kernel
descriptor z.  Now process 1 passes z to process 2, whose effective
root is stu != pqr.  Process 2 tries to fexecve() the descriptor, but
root(z) != stu so fexecve() returns EPERM.

Maybe we can weaken fexecve()'s requirement on the effective root of z
to "root(z) must be reachable from the effective root," but I think that
that might be much more complicated.

fexecve() isn't the only call on which you may want to enforce a
"root(descriptor) == effective root" restriction.  You may want to
enforce it on read(2) and write(2), too.

Also, enforcing access along "effective roots" lines may be inflexible
or unwieldy, maybe a more abstract notion of "process coalition" is
better.  Let each new root have a corresponding new coalition, but
perhaps we should be able to create a new coalition without changing
root, and change root without changing coalition.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: ETHERCAP_* & ioctl()

2012-11-01 Thread David Young
On Wed, Oct 31, 2012 at 07:24:51PM +0900, Masanobu SAITOH wrote:
>  Hi, all.
> 
>  I sent the followin mail more than two years ago.
> 
> > http://mail-index.netbsd.org/tech-kern/2010/07/28/msg008613.html
> 
>  As the starting point to solve this problem, I committed the change to
> add SIOCGETHERCAP stuff.
> 
>  Example:
> > msk0: flags=8802 mtu 1500
> > ec_capabilities=5
> > ec_enabled=0
> > address: 00:50:43:00:4b:c5
> > media: Ethernet autoselect
> > status: no carrier
> > wm0: flags=8843 mtu 1500
> > 
> > capabilities=7ff80
> > 
> > enabled=7ff80
> > ec_capabilities=7
> > ec_enabled=0
> > address: 00:1b:21:58:68:34
> > media: Ethernet autoselect (1000baseT 
> > full-duplex,flowcontrol,rxpause,txpause)
> > status: active
> > inet 192.168.1.5 netmask 0xff00 broadcast 192.168.1.255
> > inet6 fe80::21b:21ff:fe58:6834%wm0 prefixlen 64 scopeid 0x2
> > inet6 2001:240:694:1:21b:21ff:fe58:6834 prefixlen 64
> 
> 
>  What do you think about this output?

I think that these flags belong within a "service hatch" rather than "on
the dashboard."  That is, shown via sysctl or ifconfig -v instead of in
the normal output of ifconfig.

What are the use-cases for reading/changing these flags?  I don't see
what an operator is supposed to do with this new information and with
these new controls.

I am curious whether these flags good for anything except diagnosing and
working around driver bugs?  I ask because I don't think the operator
can ordinarily make a better selection of hardware-capability flags than
the OS can, except insofar as the OS has bugs and forces the user to
work around them.  BTW, I think that it is the same for the checksum
offload / TSO flags as for the ethernet capability flags, but I guess
that we're kind of stuck with the checksum/TSO flags by now.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: [patch] MI root filesystem detection

2012-10-04 Thread David Young
On Thu, Oct 04, 2012 at 06:50:18PM +, Michael van Elst wrote:
> dyo...@pobox.com (David Young) writes:
> 
> >At least on i386, the bootloader passes both BTINFO_BOOTWEDGE-
> >and BTINFO_BOOTDISK-type information to the kernel, but the
> >_BOOTWEDGE information supercedes the _BOOTDISK information: thus the
> >booted_partition is left at 0, and if the kernel fails to match the
> >_BOOTWEDGE info to a wedge it blithely selects the 0th partition on the
> >booted_device for its root filesystem.
> 
> The MD code should pass the proper information, there is no need to fix up the
> error in the MI part.

You have a patch for that, right?  Will you commit it?

Dave
 
-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


[patch] MI root filesystem detection

2012-10-04 Thread David Young
This change to sys/kern/init_main.c:rootconf_handle_wedges()
lets the kernel select a partition in a BSD disklabel using the
BTINFO_BOOTWEDGE-type information from the bootloader if the _BOOTWEDGE
information matches no dk(4) instance.

At least on i386, the bootloader passes both BTINFO_BOOTWEDGE-
and BTINFO_BOOTDISK-type information to the kernel, but the
_BOOTWEDGE information supercedes the _BOOTDISK information: thus the
booted_partition is left at 0, and if the kernel fails to match the
_BOOTWEDGE info to a wedge it blithely selects the 0th partition on the
booted_device for its root filesystem.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981
--- sys/kern/init_main.c2012/09/05 20:48:18
+++ sys/kern/init_main.c2012/10/02 19:59:05
@@ -835,7 +853,7 @@
daddr_t startblk;
uint64_t nblks;
device_t dev; 
-   int error;
+   int error, partition;
 
if (booted_nblks) {
/*
@@ -882,6 +900,36 @@
if (dev != NULL) {
booted_device = dev;
booted_partition = 0;
+   return;
+   }
+   if (booted_nblks == 0)
+   return;
+
+   /*
+* Use the geometry to try to locate a partition.
+*/
+   vp = opendisk(booted_device);
+
+   if (vp == NULL)
+   return;
+
+   error = VOP_IOCTL(vp, DIOCGPART, &dpart, FREAD, NOCRED);
+   VOP_CLOSE(vp, FREAD, NOCRED);
+   vput(vp);
+   if (error)
+   return;
+
+   for (partition = 0; partition < MAXPARTITIONS; partition++) {
+
+   p = &dpart.disklab->d_partitions[partition];
+
+   startblk = p->p_offset;
+   nblks= p->p_size;
+
+   if (startblk == booted_startblk && nblks == booted_nblks) {
+   booted_partition = partition;
+   break;
+   }
}
 }
 


Re: WAPBL/cache flush and mfi(4)

2012-08-24 Thread David Young
On Fri, Aug 24, 2012 at 10:38:50PM +0200, Manuel Bouyer wrote:
> On Fri, Aug 24, 2012 at 04:26:07PM -0400, Thor Lancelot Simon wrote:
> > > I think in this case you have to flush both: if you flush only the
> > > disks, the data you want to be on stable storage may still be in the
> > > controller's cache.
> > 
> > That doesn't make sense to me.  If you consider the controller cache
> > to be stable storage, then you clearly need to flush only the disks'
> > caches for all the data expected to be in stable storage to actually
> > be in stable storage.
> 
> Immagine the following scenario:
> - wapbl writes to its journal.
> - mfi(4) sends the write to controller, which keeps it in its
>   (battery-backed) cache and return completion of the command
> - wapbl requests a cache flush
> - mfi(4) translate this to a disk cache flush (but not controller cache
>   flush).
> - the controller sends a cache flush to disk. at this time, the data wapbl
>   cares about is still in the controller's cache
> - some time later, the controller flushes its data to disks. Now the
>   data from wapbl is in the unsafes disks caches, and not in the controller
>   cache any more.
> 
> So you still need to flush the controller's cache before disks caches,
> otherwise data can migrate from safe storage to unsafe one.

Will a controller really empty its cache into the attached disks'
caches, or will it issue the disk writes, wait for the disks to
acknowledge that the data is on the platter, and then empty the cache?

I have the following vague idea in mind for how an operating system
should treat disk writes: it seems to me that our disks subsystem(s)
should treat streams of disk writes kind of like TCP sessions in
that the "receiver", which is either an instance of some disk driver
(e.g., sd(4)) or a non-volatile cache, tells the "sender" (some user
process that write(2)s, a filesystem, or the pager) that it is open to
receive up to X megabytes.  The sender sends the receiver X-megabytes'
worth of bufs, but holds onto a copy of the bufs itself until each is
acknowledged.  Ordinarily an acknowledgement will come back saying "you
may go ahead and send me Y more kilobytes, sender".  A sender may also
get a NACK ("sorry, the backup disk was unplugged before it acknowledged
that buffers P, Q, and R hit the media"); then it has to indicate the
exception or else retransmit the buffers.

Here and there in the system you will have software (a filesystem) or
hardware (a battery-backed cache) that "proxies" disk-write streams.  A
filesystem will "proxy" because it's probably going to either serialize
writes (say to write them to a journal) or to augment them (say to
update corresponding metadata).  Typically a filesystem will proxy, too,
because we don't expect for a user process to block in write(2) until
all the bytes written have landed on the platter.  A battery-backed
cache will proxy because it's going to guarantee disk-write completion
to the sender.

I have the following doubt about a battery-backed cache: what if I
yank the disk?  I have never met a controller with battery-backed
cache where I could not pull some of the disks right out of the front
of the chassis.  I guess that usually those disks were redundant,
too.  So, what if I yank two disks? :-) It seems like receivers and
proxy receivers ought to advertise the guarantees that they do and do
not make (e.g., "I guarantee that barring disk-yankage, I will put
your bytes on the platter" OR "barring power failure or disk-yankage
and non-replacement, I will put your bytes on the platter"), and
senders requirements ought to be matched to receivers guarantees when a
disk-write session is established.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


PCI MSI musings

2012-07-15 Thread David Young
I'm writing to share some thoughts I've had about PCI Message-Signaled
Interrupts (MSI), MSI-X, and their application:

Establishment of an MSI/MSI-X handler routine customarily happens
in stages like this:

a) Query MSI/MSI-X device capabilities: find out any
   limitations to MSI/MSI-X message data & message address.
   E.g., 16-bit message width with 0..n^2-1 lower bits
   reserved (MSI), either 32-bit or 64-bit address width.

b) Establish a mapping,
   (message address, message data) -> (CPU(s),
   handler routine), in the interrupt controller (e.g.,
   IOAPIC) and in the CPU (interrupt vector table).

c) Program the MSI/MSI-X registers with the message data &
   address.

d) Enable MSI.

MSI/MSI-X are really useful when we use them in a customary mode, but
I think that there are useful ways that we can modify stage (b), above: 

1) Device chaining: one PCI bus-master processes a memory buffer
   and, when it has finished processing, triggers processing by a
   second device.  For example, a cryptographic coprocessor and a
   network interface (NIC) share a network buffer.  The cryptographic
   coprocessor encrypts the buffer and signals completion by sending
   a message.  The target of the message is a memory location
   corresponding to the NIC register that either triggers DMA
   descriptor-ring polling or advances the descriptor-ring tail pointer.

2) Device polling 1: low-cost polling for coprocessor completions:
   say that you have a userland driver for a PCI 3D graphics
   coprocessor whose pattern of operation is to write a list of
   triangles to render into memory that is shared with the device,
   to issue a Draw Polygons command, to do other work until the
   command completes, and to repeat.  The driver tests for completion
   of commands by polling a memory-mapped device register.  Usually
   polling a register is a costly operation at best.  At worst,
   polling may introduce variable latency: the host CPU may have
   to retry its transaction once or more while PCI a bus bridge
   forwards pending PCI transactions upstream.

   In a much more efficient arrangement, the userland driver polls
   a memory word that is the target for the coprocessor's
   message-signaled completion interrupts.  At least on DMA-coherent
   systems like x86, the memory word can be cached, so polling it
   is quite cheap.

3) Device polling 2: like above, but let us say that you have drivers
   polling a bunch of NICs.  Instead of polling with register reads, let
   them check a shared word for changes.

4) Timer invalidation: sometimes reading hardware time sources involves
   register reads that are costly.  If I have an application that uses
   the current time often but that doesn't need the time with equal
   accuracy as the time source provides, then the app may spend an
   inordinate amount of time reading and re-reading the registers of the
   time source.

   If the time source can be programmed to interrupt at intervals
   corresponding to the accuracy of time that your application
   wants, and if the source supports MSI, then we can direct its
   interrupt messages to a memory word that the app can treat as
   a "cache invalidated" flag:  when the app needs the current
   time, it refers to the flag.  If the flag is 0, then it reads
   the current time from the time-source registers and caches it.
   If the flag is 1, then it reads the current time from its cache.
   Let the interrupt's message data be 0, so that signalling the
   interrupt invalidates the app's cache.

5) I have been turning over and over in my head the idea that if there
   are no processes eligible to run on a CPU except for a userland
   device driver, if we want that device driver to wake and process an
   interrupt with very low latency, if we are allergic for some reason
   to spinning while waiting for the interrupt, and if MSI is available,
   then maybe on x86 we can MONITOR/MWAIT the cacheline containing an
   MSI target in the last few instructions of a return to userland.  The
   CPU will just hang there until either there is some other interrupt
   (the hardclock ticks, say) or the message signalling the interrupt
   lands.

   Granted, I may have described such a rare alignment of conditions
   that this is never worth it.  The latency of "waking" a CPU from
   its MWAIT may be very long, too: I think that typically MWAIT
   is used to put the CPU into a power-saving state.  I think that
   the amount of power-saving is adjustable, though.

   I think on most x86 CPUs, MONITOR/MWAIT are only available in
   the privileged context, so another problem is that you may have
   to MWAIT right on the brink of a kernel->user return.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: avoiding bus_dmamap_sync() costs

2012-07-13 Thread David Young
On Fri, Jul 13, 2012 at 07:55:11AM +0100, David Laight wrote:
> On Thu, Jul 12, 2012 at 09:05:10PM -0500, David Young wrote:
> > 
> > In general, it is necessary on x86 to flush the store buffer on a
> > _PREREAD operation so that if we write a word to a DMA-able address and
> > subsequently read the same address again, the CPU will not satisfy the
> > read with store-buffer content (i.e., the word that we just wrote), but
> > with the last word written at that address by any agent.
> 
> I thought that was only true for cached addresses.
> On x86 uncached accesses bypass the store buffer (don't they also
> flush it?)
> (Ignoring the obscure non-temporay instructions etc.)

Right, uncached accesses bypass the store buffer.

> When do you care whether the read is serviced from the store buffer
> or from the cache line?

One cares when the store buffer or the cache line is stale.  I.e.,
there's newer information either in the cache or in RAM.

> You can only be interested in the value after the dma entity has read
> the value being written, and updated it with a new value.
> Reads serviced from the store buffer are only problems for device
> registers - and no one uses cached accesses for those.

Ok, I see what you're saying.  I will have to think about that.

This all begs the question whether the bus_dmamap_sync() operations are
generally the right ones.

> Writing to un-snooped cached addresses when dma might write to the
> same cache line before the write completes will be impossible to get
> right on any architecture.

It will be impossible, and that reminds me that I was going to ask in my
previous email in this thread: is there any possibility whatsoever that
wm(4) can work on those architectures that lack DMA coherency, even if
all of the right bus_dmamap_sync()s are in place?  ISTR that when you
and I discussed this in another venue, we agreed that the answer was
"no" if the DMA descriptors were cached.

It's always bothered me that the bus_dmamem_map() flag BUS_DMA_COHERENT,
is only *advice* to the backend.

> > if ((status & WTX_ST_DD) == 0) {
> > WM_CDTXSYNC(sc, txs->txs_lastdesc, 1,
> > BUS_DMASYNC_PREREAD | BUS_DMASYNC_CLEAN);
> > break;
> > }
> > 
> > And the x86 implementation of bus_dmamap_sync() would just skip the
> > locked instruction if BUS_DMASYNC_CLEAN was in the flags.
> 
> My worry is that using flags to a function tends to lead to run-time
> checks - whereas the requirement for these sysc/barrier is mostly
> compile time. All the conditionals might take longer than the lock.
> (which is what happened with the old LOCKMGR code.)

On this point, somebody's advice to me (may have been yours!) was to
introduce, on those archs that will benefit, an always-inlined shim that
makes some of the _sync()s collapse at compile-time to nothing:

static __attribute__((__always_inline)) void
bus_dmamap_sync(..., int ops)
{
/* XXX not sure if the constant-ness propagates through the
 * always-inline call.  That is, may need to use a macro, *sigh*.
 */
if (!__builtin_constant_p(ops)) {
_bus_dmamap_sync(..., ops);
return;
}
switch (ops) {
case BUS_DMASYNC_PREREAD|BUS_DMASYNC_CLEAN:
case BUS_DMASYNC_PREWRITE|BUS_DMASYNC_CLEAN:
case BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE|BUS_DMASYNC_CLEAN:
return;
default:
_bus_dmamap_sync(..., ops);
return;
}
}

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: avoiding bus_dmamap_sync() costs

2012-07-12 Thread David Young
On Thu, Jul 12, 2012 at 09:05:10PM -0500, David Young wrote:
> At $DAYJOB I am working on a team that is optimizing wm(4).
> 
> In an initial pass over the driver, we found that on x86,
> bus_dmamap_sync(9) calls issued some unnecessary LOCK-prefix
> instructions, and those instructions were expensive.  Some of the
> locked instructions were redundant---that is, there were effectively
> two in a row---and others were just unnecessary.  What we found by
> reading the AMD & Intel processor manuals is that bus_dmamap_sync() can
> be a no-op unless you're doing a _PREREAD or _PREWRITE operation[1].
> _PREREAD and _PREWRITE operations need to flush the store buffer[2].
> The cache-coherency mechanism will take care of the rest.  We will
> feed back a patch with these changes and others just as soon as local
> NetBSD-current tree is compilable[3].
> 
> In a second pass over the driver, a teammember noted that even with
> the bus_dmamap_sync(9) optimizations already in place, some of the
> LOCK-prefix instructions were still unnecessary.  Just for example, take
> this sequence in wm_txintr():
> 
> status =
> sc->sc_txdescs[txs->txs_lastdesc].wtx_fields.wtxu_status;
> if ((status & WTX_ST_DD) == 0) {
> WM_CDTXSYNC(sc, txs->txs_lastdesc, 1,
> BUS_DMASYNC_PREREAD);
> break;
> }
> 
> Here we are examining the status field of a Tx descriptor and, if we
> find that the descriptor still belongs to the NIC, we synchronize
> the descriptor.  It's correct and persuasive code, however, the x86
> implementation will issue a locked instruction that is unnecessary under
> these particular circumstances.
> 
> In general, it is necessary on x86 to flush the store buffer on a
> _PREREAD operation so that if we write a word to a DMA-able address and
> subsequently read the same address again, the CPU will not satisfy the
> read with store-buffer content (i.e., the word that we just wrote), but
> with the last word written at that address by any agent.
> 
> In these particular circumstances, however, we do not modify the
> DMA-able region, so flushing the store buffer is not necessary.
> 
> Let us consider another processor architecture.  On some ARM variants,
> the _PREREAD operation is necessary to invalidate the cacheline
> containing the descriptor whose status we just read so that if we come
> back and read it again after a DMA updates the descriptor, content from
> a stale cacheline does not satisfy our read, but actual descriptor
> content does.
> 
> One idea that I have for avoiding the unnecessary instruction on x86
> is to add a MI hint to the bus_dmamap_sync(9) API, BUS_DMASYNC_CLEAN.
> The hint tells the MD bus_dma(9) implementation that it may treat the
> DMA region like it has not been written (dirtied) by the CPU.  The code
> above would change to this code:
> 
> status =
> sc->sc_txdescs[txs->txs_lastdesc].wtx_fields.wtxu_status;
> if ((status & WTX_ST_DD) == 0) {
>     WM_CDTXSYNC(sc, txs->txs_lastdesc, 1,
> BUS_DMASYNC_PREREAD);

Oops, line should be:

> BUS_DMASYNC_PREREAD|BUS_DMASYNC_CLEAN);

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


avoiding bus_dmamap_sync() costs

2012-07-12 Thread David Young
is any valid combination of bus_dmamap_sync() operations.

If the register access indicated by the `ops' argument won't
perform the DMA synchronization (`dmaops') as a side effect,
then bus_dmamap_barrier() just has to do the equivalent of a
bus_dmamap_sync(..., dmaops).

So that's my current thinking about bus_dma(9).  Please let me know
your thoughts.

Dave

[1] Or bounce buffers are involved.

[2] Store buffer is Intel terminology.  Write buffer is AMD terminology
for the same thing.

[3] RUMP is unpopular at $DAYJOB for various reasons.  One reason
is that there is not a MKRUMP option for disabling it, so it is
necessary to wait for it to build and install even if it isn't
wanted.  Another reason is that sometimes changes made to the kernel
have to be replicated in RUMP, and having to double any effort
is both expensive and demoralizing.  Please don't read this as a
    criticism of RUMP overall, just a wish for some improvements in
modularity and code sharing.

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: software interrupts & scheduling oddities

2012-07-05 Thread David Young
On Thu, Jul 05, 2012 at 07:40:11PM -0400, Mouse wrote:
> > Maybe this has happened to you: you tune your NetBSD router for
> > fastest packet-forwarding speed.  Presented with a peak packet load,
> > [...] the user interface doesn't get any CPU cycles.  [...]  [I]f
> > there is any software interrupt pending, then it will run before any
> > user process gets a timeslice.  So if the softint rate is really
> > high, then userland will scarcely ever run.  Or that is my current
> > understanding.  Is it incorrect?
> 
> "No", I think.  At least, that's how I'd expect it to work, and I've
> occasionally seen behaviour close enough to that to make me think it's
> reasonably accurate.
> 
> I find your discovery about changing a user process's priority making a
> difference surprising.

Me, too.  Before I made that discovery, I had intended to defer packet
processing to a kernel thread that ran at a middling user priority.  The
kernel would shift packet processing from softints to the kernel thread
if it became apparent that the system wasn't switching to userland.
User programs that needed to stay interactive would run at a higher
priority than packet processing; programs that could afford to be
delayed (cron, syslogd, ...) would run at a lower priority than packet
processing.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


software interrupts & scheduling oddities

2012-07-05 Thread David Young
Maybe this has happened to you: you tune your NetBSD router for fastest
packet-forwarding speed.  Presented with a peak packet load, your
router does really well for 30 seconds.  Then it reboots because the
user-tickle watchdog timer expires.  OR, your router doesn't reboot but
you cannot change any parameters because the user interface doesn't get
any CPU cycles.  This is the problem that I am dealing with today: while
the system is doing a lot of network processing, the userland doesn't
make any progress at all.  Userland seems to be starved of CPU cycles
because there is a non-stop software-interrupt load caused by the high
level of network traffic.  At least on i386, if there is any software
interrupt pending, then it will run before any user process gets a
timeslice.  So if the softint rate is really high, then userland will
scarcely ever run.  Or that is my current understanding.  Is it incorrect?

Ordinarily, under high packet load, processes that I need to stay
interactive, such as my shell, freeze up after the network load reaches
a certain level.  If I change the scheduling (class, priority) for my
shell to (SCHED_RR, 31), then the shell stays responsive, even though it
still runs at a lower priority than softints.  Ok, so maybe that makes
sense: of all the userland processes, my shell is the only one running
with a real-time priority, so if there are any cycles leftover after the
softints run, my shell is likely to get them.

I thought that maybe, if I run every process at (SCHED_RR, 31) so that
my shell has to share the leftover cycles with every other user process,
then my shell will freeze up again under high packet load.  I set
every user process to (SCHED_RR, 31), though, and the shell remained
responsive.

I'm using the SCHED_M2 scheduler, btw, on a uniprocessor.  SCHED_M2 is
kind of an arbitrary choice.  I haven't tried SCHED_4BSD, yet, but I
will.

I don't really expect for changing any process class/priority to
SCHED_RR/31 to make any difference in the situation I describe, so there
must be something that I am missing about the workings of the scheduler.

One more thing: userland processes get a priority bump when they enter
the kernel.  No problem.  But it seems like a bug that the kernel will
also raise the priority of a low-priority *kernel* thread if it, say,
waits on a condition variable.  I think that happens because cv_wait()
calls cv_enter(..., l) that sets l->l_kpriority, which is only reset by
mi_userret().  Kernel threads never go through mi_userret() so at some
point the kernel will call lwp_eprio() to compute an effective priority:

static inline pri_t
lwp_eprio(lwp_t *l)
{
pri_t pri;

pri = l->l_priority;
if (l->l_kpriority && pri < PRI_KERNEL)
pri = (pri >> 1) + l->l_kpribase;
return MAX(l->l_inheritedprio, pri);
}

Since my low-priority kernel thread has lower priority than PRI_KERNEL,
and l_kpriority is set, it gets bumped up.  Perhaps lwp_eprio() should
test for kernel threads (LW_SYSTEM) before elevating the priority?

static inline pri_t
lwp_eprio(lwp_t *l)
{
pri_t pri;

pri = l->l_priority;
if (l->l_kpriority && pri < PRI_KERNEL && (l->l_flag & LW_SYSTEM) == 0)
    pri = (pri >> 1) + l->l_kpribase;
return MAX(l->l_inheritedprio, pri);
}

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: {send,recv}mmsg patch.

2012-06-08 Thread David Young
On Thu, Jun 07, 2012 at 10:34:52PM -0400, Christos Zoulas wrote:
> 
> Hi,
> 
> Linux has grown those two, and claim 20% performance improvement on some
> workloads. Some programs already use them, so we are going to need them
> for emulation anyway...
> 
> http://www.netbsd.org/~christos/mmsg.diff

Can you provide some documentation for these calls?

ISTM that {send,recv}mmsg(), {read,write}v(), and AIO could be subsumed
by a general-purpose scatter-gather system call, and that might be a
good direction to go.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Fixing pool_cache_invalidate(9), final step

2012-06-04 Thread David Young
On Mon, Jun 04, 2012 at 10:34:06AM +0100, Jean-Yves Migeon wrote:
> [General]
> - Dumped pool_cache_invalidate_cpu() in favor of pool_cache_xcall()
> which transfers CPU-bound objects back to the global pool.
> - Adapt comments accordingly.

pool_cache_xcall() seems to describe how the function works rather than
what it does.  Is it a new part of the pool_cache(9) API?  If so, I
think that the name should say what the function does.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: lwp resource limit

2012-06-03 Thread David Young
On Wed, May 23, 2012 at 07:37:19PM -0400, Christos Zoulas wrote:
> Hello,
> 
> This is a new resource limit to prevent users from exhausting kernel
> resources that lwps use.
> 
> - The limit is per uid
> - The default is 1024 per user unless the architecture overrides it
> - The kernel is never prohibited from creating threads
> - Exceeding the thread limit does not prevent process creation, but
>   it will prevent processes from creating additional threads. So the
>   effective thread limit is nlwp + nproc
> - The name NTHR was chosen to follow prior art
> - There could be atomicity issues for setuid and lwp exits
> - This diff also adds a sysctl kern.uidinfo.* to show the user the uid
>   limits
> 
> comments?
> 
> christos
> Index: kern/init_main.c
> ===
> RCS file: /cvsroot/src/sys/kern/init_main.c,v
> retrieving revision 1.442
> diff -u -p -u -r1.442 init_main.c
> --- kern/init_main.c  19 Feb 2012 21:06:47 -  1.442
> +++ kern/init_main.c  23 May 2012 23:19:31 -
> @@ -256,6 +256,7 @@ int   cold = 1;   /* still 
> working on star
>  struct timespec boottime;/* time at system startup - will only 
> follow settime deltas */
>  
>  int  start_init_exec;/* semaphore for start_init() */
> +int  maxlwp;
>  
>  cprng_strong_t   *kern_cprng;
>  
> @@ -291,6 +292,12 @@ main(void)
>  #endif
>   l->l_pflag |= LP_RUNNING;
>  
> +#ifdef __HAVE_CPU_MAXLWP
> + maxlwp = cpu_maxlwp();
> +#else
> + maxlwp = 1024;
> +#endif
> +

Configuring the kernel with the preprocessor is just so ... wordy. :-)
Maybe use the linker instead?  E.g., provide a weak alias to a default
implementation of cpu_maxlwp() and a strong alias to the MD override on
architectures that have one.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: Module name - recommendations/opinions sought

2012-04-25 Thread David Young
On Wed, Apr 25, 2012 at 05:52:31PM -0700, Paul Goyette wrote:
> I'm in the process of modularizing the ieee80211 (Wireless LAN)
> code, and would like some feedback on what the module's name should
> be.  I can think of at least three or four likely candidates:
> 
>   net80211
>   ieee80211

I'd vote for one of these, myself, since there's a correspondence with a
directory name in the first case and with a prefix on a lot of the API
names in the second case.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


loadable verbose message modules (was Re: Kernel panic codes)

2012-04-16 Thread David Young
On Sun, Apr 15, 2012 at 10:57:54PM +1000, Nat Sloss wrote:
> Hi.
> 
> I have been working on a program that uses bluetooth sco sockets and I am 
> having frequent kernel panics relating to usb.
> 
> I am receiving trap type 6 code 0 and trap type 6 code 2 errors.

I've been thinking that it would be nice if there were more kernel
modules that replaced or supplemented anonymous numbers with their name
or description.  Thus

trap type 6 code 0

and
trap type 6 code 2

would become something like

trap type 6(T_PAGEFLT) code 0

and

trap type 6(T_PAGEFLT) code 2

if and only if the module was loaded.  The existing printf() in
trap_print()

printf("trap type %d code %x ...\n", type, frame->tf_err, ...);

would change (just for example) to

printf("trap type %d%s code %x%s ...\n", type, trap_type_string(type),
frame->tf_err, trap_code_string(type, frame->tf_err), ...);

By default, the number -> string conversion functions,

const char *trap_type_string(int type);
const char *trap_code_string(int type, int code);

would be weak aliases for a single function that returns the empty
string.  The kernel module would override the defaults by providing
strong aliases to actual implementations.

For that weak/strong alias thing to work on a loadable module, I
think that Someone(TM) will need to make the kernel linker DTRT when
a modules with overriding strong aliases is added.  If the module is
not unloadable, Someone(TM)'s work is through.  There are some gotchas
making the kernel *un*loadable.  BTW, I also desire this function in the
kernel linker for Other Reasons.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: introduce device_is_attached()

2012-04-16 Thread David Young
On Mon, Apr 16, 2012 at 06:52:28PM +0200, Christoph Egger wrote:
> 
> Hi,
> 
> I want to introduce a new function to sys/devices.h:
> 
> bool device_is_attached(device_t parent, cfdata_t cf);
> 
> The purpose is for bus drivers who wants to attach children
> and ensure that only one instance of it will attach.
> 
> 'parent' is the bus driver and 'cf' is the child device
> as passed to the submatch callback via config_search_loc().
> 
> The return value is true if the child is already attached.
> 
> I implemented a reference usage of it in amdnb_misc.c to ensure
> that amdtemp only attaches once on rescan.

Don't add that function.  Just use a small amdnb_misc softc to track
whether or not the amdtemp is attached or not:

struct amdnb_misc_softc {
device_t sc_amdtemp;
};

Don't pass a pci_attach_args to amdtemp.  Just pass it the chipset tag
and PCI tag if that is all that it needs.

I'm not sure I fully understand the purpose of amdnb_miscbus.
Are all of the functions that do/will attach at amdnb_miscbus
configuration-space only functions, or are they something else?  Please
explain what amdnb_miscbus is for.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


Re: rbus support one-cardbus only?

2012-04-09 Thread David Young
On Tue, Apr 10, 2012 at 12:01:20AM +0900, KIYOHARA Takashi wrote:
> Hi! all,
> 
> 
> I have a question.
> 
> OPENBLOCKS266 supports two cardbus slots optionaly.
> But I think, MI-rbus not supports multiple cardbus slots.
> [Y/N]?

It supports >1 slot on i386.

I'm trying to get rid of rbus, btw.  I'm already able to run i386
without it.  If you make bus_space_tag_create(9), bus_space_reserve(9),
bus_space_release(9), bus_space_reservation_map(9), and
bus_space_reservation_unmap(9) work on OPENBLOCKS266, then I think that
you can use my patches to avoid rbus.

Dave

-- 
David Young
dyo...@pobox.comUrbana, IL(217) 721-9981


  1   2   3   >