Re: netbsd-5.1_RC3 crash at Dell M710

2010-08-15 Thread 6bone

On Sat, 14 Aug 2010, Jean-Yves Migeon wrote:


Date: Sat, 14 Aug 2010 00:29:09 +0200
From: Jean-Yves Migeon jeanyves.mig...@free.fr
To: 6b...@6bone.informatik.uni-leipzig.de
Cc: tech-kern@netbsd.org
Subject: Re: netbsd-5.1_RC3 crash at Dell M710

On 14.08.2010 00:05, 6b...@6bone.informatik.uni-leipzig.de wrote:

On Fri, 13 Aug 2010, Jean-Yves Migeon wrote:


Date: Fri, 13 Aug 2010 17:03:14 +0200
From: Jean-Yves Migeon jeanyves.mig...@free.fr
To: 6b...@6bone.informatik.uni-leipzig.de
Cc: tech-kern@netbsd.org
Subject: Re: netbsd-5.1_RC3 crash at Dell M710

On 13.08.2010 08:52, 6b...@6bone.informatik.uni-leipzig.de wrote:

hello,

netbsd crashs at Dell M710. You can have a look at the screeshot at
http://6bone.informatik.uni-leipzig.de/Dell-M710.bmp

Any Ideas what could be the problem?


Most probably, an attempt to read a MSR, which is not allowed/present
for that CPU.

At ddb prompt, type bt and show reg, so we can see where and how it
happens.



http://6bone.informatik.uni-leipzig.de/Dell-M710-bt.bmp
http://6bone.informatik.uni-leipzig.de/Dell-M710-show-reg-1.bmp
http://6bone.informatik.uni-leipzig.de/Dell-M710-show-reg-2.bmp


MSR 0xcd, which is MSR_FSB_FREQ.

One hacky fix is needed. You seem to be in the same situation as mine
there, quick glance at your CPU ID makes me think it reports model 0xc too:

http://cvsweb.netbsd.org/cgi-bin/cvsweb.cgi/src/sys/arch/x86/x86/intel_busclock.c?rev=1.11content-type=text/x-cvsweb-markup

I asked for a pull-up about a week ago, so should come in eventually.
Try patching around as I did.



the patch solves the problem. unfortunately the kernel now has a problem 
with the broadcom nic.


http://6bone.informatik.uni-leipzig.de/Dell-M710-bnx-no-PHY-found.bmp

does there also a patch exist for this problem?


Thank you for your efforts


Regards
Uwe


Re: Module and device configuration locking [was Re: Modules loading modules?]

2010-08-15 Thread Paul Goyette

Updating the status on these changes:

One comment questioned whether or not a version bump was required, and 
I've more-or-less convinced myself it is at least desired.  While 
properly-working modules from the pre-update will continue to work on a 
post-update kernel, the reverse is not necessarily true.  A module 
written for a post-update kernel and which takes advantage of the 
changed locking protocol will fail on a pre-update kernel.


Another comment suggested that the name of newly-created file 
kern/kern_cfg.c should be changed to more closely match its contents, 
while an earlier comment had suggested generalizing the filename!  I've 
taken the more-specific path and the file is now called kern_cfglock.c
If other stuff gets added to this file later, we can come up with a more 
generic name at that time.


Some additional changes have been included in this latest set of diffs. 
Mostly, there were some KASSERT()s sprinkled throughout that tried to 
ensure that the code had not been called recursively.  Since recursion 
is now explicitly supported, all of the module_active stuff has been 
reworked.


Additionally, there was a statically-allocated list of pending modules 
that needed to have their initialization completed.  This has been 
changed to keep multiple stack-allocated lists, one for each depth.



I'm planning to commit these changes next weekend, unless there are any 
significant objections.  If anyone else needs to coordinate changes in 
order to ride-the-version-bump, let me know.




-
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net |
| Kernel Developer |  | pgoyette at netbsd.org  |
-

ML.diff.tgz
Description: Binary data


deprecating #define'd sysctl OIDs

2010-08-15 Thread Alan Barrett
On Sun, 15 Aug 2010, Jean-Yves Migeon wrote:
  It might make sense to add comments near all existing lists of
  hard-wired sysctl OID values asking people not to add more of them.
 
 Shall it be added for all other archs then? I assume that they can all
 benefit from the dynamic sysctl(9) interface?

If we do this at all, then we should do it for all lists of sysctl
OID values.  Several of them are in sys/sysctl.h, and I am sure
there are more scattered around.  I don't see the point of doing
it only for CPU_* definitions.

All three of the sysctl(3), sysctl(7), and sysctl(9) man pages
could also be improved, to make it more clear that new code can
(should?) use dynamic allocation instead of #define'd OID values.

--apb (Alan Barrett)


Re: Adding 'i386_use_pae' variable, and expose it through sysctl

2010-08-15 Thread Thor Lancelot Simon
On Sun, Aug 15, 2010 at 02:30:46PM +0200, Jean-Yves Migeon wrote:
 
 Shall it be added for all other archs then? I assume that they can all
 benefit from the dynamic sysctl(9) interface?

You can't do it for existing OIDs, that breaks binary compatibility.

Thor


Re: Adding 'i386_use_pae' variable, and expose it through sysctl

2010-08-15 Thread Jean-Yves Migeon
On 15.08.2010 17:23, Thor Lancelot Simon wrote:
 On Sun, Aug 15, 2010 at 02:30:46PM +0200, Jean-Yves Migeon wrote:

 Shall it be added for all other archs then? I assume that they can all
 benefit from the dynamic sysctl(9) interface?
 
 You can't do it for existing OIDs, that breaks binary compatibility.

Sure; I was just talking about the comment part, where we add a note
that new sysctl(7) shall get added using the dynamic sysctl interface.

-- 
Jean-Yves Migeon
jeanyves.mig...@free.fr


Re: Adding 'i386_use_pae' variable, and expose it through sysctl

2010-08-15 Thread Alan Barrett
On Sun, 15 Aug 2010, Thor Lancelot Simon wrote:
 You can't do it for existing OIDs, that breaks binary compatibility.

Yes, obviously.  My suggestion was about adding comments and
documentation to discourage new OIDS from being added in the old way.

--apb (Alan Barrett)


kicking everybody out of the softc

2010-08-15 Thread David Young
Currently, device detachment is racy: a kernel thread races to
read/write a softc, after looking it up, before a second thread detaches
the corresponding device_t and reclaims the softc's storage.  I've been
working in spare moments on lockless code to prevent storage for a softc
from going away while a driver uses it.  I submit the idea and the code
here for review.

To stop the races against detachment is tricky because many drivers
are in and out of their softc over and over and over again, today,
without any synchronization with detachment whatsoever.  To use
atomic operations or locking to synchronize use of the softc with its
reclamation can add costly locked memory transactions to many fast paths
where there were no such transactions before, for a performance loss.

I took an approach that avoids locked memory transactions in the
common case.  I keep count of threads entering each softc on each
CPU using per-CPU counters in the corresponding device_t: an LWP
calls device_acquire(dev) as it enters a softc.  I also keep count
of threads leaving each softc using per-CPU counters: an LWP calls
device_release(dev) as it leaves a softc.  I call the counters
turnstiles.  Turnstile-counts only increase.  Because turnstiles
are per-CPU, incrementing one only has to be atomic with respect to
other threads on that same CPU, so no locked memory transactions are
necessary.

After a LWP enters a softc through its turnstile, it passes through a
gate implemented by a pointer from the device_t to an object of type
device_gate_t that contains a kernel mutex, among other things.  This
happens in device_acquire().  Normally, the gate is open: the device_t
points to the default device_gate_t, gate_open.  It is not necessary for
device_acquire() to acquire gate_open's mutex, but device_acquire() must
acquire every other gate's mutex.

In the rare event that our thread wants to reclaim the softc (say that
it is completing config_detach(9)), first it closes the gate on the
softc's corresponding device_t.  To close the gate, our thread creates
a closed device_gate_t, acquires its mutex, and points the device_t at
it.  With the gate closed, our thread unlinks the device_t and softc.
Finally, it seals the gate by pointing the device_t at a special
device_gate_t, gate_sealed, and it wakes all of the threads that wait
to acquire the closed gate's mutex.  Since the device_t and softc are
unreachable, no new thread can enter them.  All of the threads that wake
holding the closed gate see that the device_t now points to gate_sealed
and leave the softc through a turnstile---device_acquire() returns in
those threads with ENXIO.  Our thread safely reclaims the softc when the
number of threads who have entered through its turnstiles balances the
number who have left.

Anyway, that's the gist of the idea.  I've attached the untested (and
uncompiled) code for the details.  Comments?

Dave

[1] I'm using the term thread loosely to mean any thread of execution,
be it an LWP, software or hardware interrupt.

-- 
David Young OJC Technologies
dyo...@ojctech.com  Urbana, IL * (217) 278-3933
/* I. DEVICE ATTACHMENT
 *
 * Steps for config_attach*() to follow:
 *
 * Call device_attach_gate(dv) and check for error before making dv
 * visible by linking it to alldevs or cfdriver_t.
 *
 * II. DEVICE DETACHMENT
 *
 * Steps for config_detach(9) to detach a device_t, dv:
 *
 * 0) Call the driver detach routine, dv-dv_cfattach.ca_detach.  It is
 * important for the detach routine to disestablish interrupt handlers
 * and to make sure interrupt handling on all CPUs is complete (by
 * making a low-priority cross-call to all CPUs, for example).
 *
 * 1) Install gate with device_gate_close(dv, config_detach, dgp).
 *
 * 2) Unlink the device_t from alldevs and from cfdriver_t.
 *
 * 3) Call device_gate_seal(dv, dgp).
 *
 * III. REFERENCING/READING/WRITING THE SOFTWARE STATE (device_t  softc)
 *
 * Steps for kernel threads, hard- and soft-interrupt handlers to
 * protect a device_t, dv, and its softc against reclamation:
 *
 * In sleepable LWP contexts, rc = device_acquire(dv, interruptible),
 * where interruptible is true or false, and check rc for errors.
 * Every successful device_acquire() call must be matched with a
 * device_release().  Between device_acquire() and device_release(),
 * neither dv nor its softc can be reclaimed.
 *
 * In hardware/software interrupts, device_enter(dv).  Every device_enter(dv)
 * must be matched with a device_exit(dv).  Between device_enter(dv) and
 * device_exit(dv), neither dv nor its softc can be reclaimed.
 *
 * LOOKING UP THE SOFTWARE STATE (device_t  softc)
 *
 * Routines that look up a device_t or softc by unit name or number,
 * such as device_lookup() and device_lookup_private(), should return
 * a device_t or softc that is protected against reclamation.  Callers
 * of lookup routines should match each successful call with a
 * device_release() or device_exit(), according to guidance given in

Re: kicking everybody out of the softc

2010-08-15 Thread David Young
On Sun, Aug 15, 2010 at 02:19:56PM -0500, David Young wrote:
 Anyway, that's the gist of the idea.

BTW, many thanks to ad@ and rmind@ and others for valuable discussions
that helped me get this far.  Of course, any broken ideas or coding
mistakes are mine.

Dave

-- 
David Young OJC Technologies
dyo...@ojctech.com  Urbana, IL * (217) 278-3933


re: kicking everybody out of the softc

2010-08-15 Thread matthew green


thanks for looking into this problem.  we need a solution.

would device_lookup() and device_lookup_private() take a reference
on this count automatically?  or maybe some new API that does it,
to avoid the need to audit every driver at once.


Re: kicking everybody out of the softc

2010-08-15 Thread David Young
On Mon, Aug 16, 2010 at 06:43:20AM +1000, matthew green wrote:
 
 
 thanks for looking into this problem.  we need a solution.

No problem.

 would device_lookup() and device_lookup_private() take a reference
 on this count automatically?  or maybe some new API that does it,
 to avoid the need to audit every driver at once.

Thanks for asking the question.  Device_lookup() needs to
device_acquire(dv) before dropping its lock on the cfdriver_t array[1]
and returning dv.

I had planned to audit all of the drivers, but that could get out of
hand.  I like your idea of using a new API.  I'll give it some thought.

Dave

[1] Actually, it takes the alldevs lock, today.  Finer locking
granularity, such as a cfdriver_t lock, is desirable!

-- 
David Young OJC Technologies
dyo...@ojctech.com  Urbana, IL * (217) 278-3933


re: kicking everybody out of the softc

2010-08-15 Thread John Nemeth
On Jan 6,  1:19am, matthew green wrote:
} 
} would device_lookup() and device_lookup_private() take a reference
} on this count automatically?  or maybe some new API that does it,
} to avoid the need to audit every driver at once.

 What would release the reference in that case?  Or, would the
count just keep incrementing thus preventing the driver from detaching
until it is audited?

}-- End of excerpt from matthew green


re: kicking everybody out of the softc

2010-08-15 Thread matthew green

 On Jan 6,  1:19am, matthew green wrote:
 } 
 } would device_lookup() and device_lookup_private() take a reference
 } on this count automatically?  or maybe some new API that does it,
 } to avoid the need to audit every driver at once.
 
  What would release the reference in that case?  Or, would the
 count just keep incrementing thus preventing the driver from detaching
 until it is audited?

right.  that's why the new API to switch to on a per-device basis would
help avoid an all-drivers-at-once audit.


.mrg.