Re: netbsd-5.1_RC3 crash at Dell M710
On Sat, 14 Aug 2010, Jean-Yves Migeon wrote: Date: Sat, 14 Aug 2010 00:29:09 +0200 From: Jean-Yves Migeon jeanyves.mig...@free.fr To: 6b...@6bone.informatik.uni-leipzig.de Cc: tech-kern@netbsd.org Subject: Re: netbsd-5.1_RC3 crash at Dell M710 On 14.08.2010 00:05, 6b...@6bone.informatik.uni-leipzig.de wrote: On Fri, 13 Aug 2010, Jean-Yves Migeon wrote: Date: Fri, 13 Aug 2010 17:03:14 +0200 From: Jean-Yves Migeon jeanyves.mig...@free.fr To: 6b...@6bone.informatik.uni-leipzig.de Cc: tech-kern@netbsd.org Subject: Re: netbsd-5.1_RC3 crash at Dell M710 On 13.08.2010 08:52, 6b...@6bone.informatik.uni-leipzig.de wrote: hello, netbsd crashs at Dell M710. You can have a look at the screeshot at http://6bone.informatik.uni-leipzig.de/Dell-M710.bmp Any Ideas what could be the problem? Most probably, an attempt to read a MSR, which is not allowed/present for that CPU. At ddb prompt, type bt and show reg, so we can see where and how it happens. http://6bone.informatik.uni-leipzig.de/Dell-M710-bt.bmp http://6bone.informatik.uni-leipzig.de/Dell-M710-show-reg-1.bmp http://6bone.informatik.uni-leipzig.de/Dell-M710-show-reg-2.bmp MSR 0xcd, which is MSR_FSB_FREQ. One hacky fix is needed. You seem to be in the same situation as mine there, quick glance at your CPU ID makes me think it reports model 0xc too: http://cvsweb.netbsd.org/cgi-bin/cvsweb.cgi/src/sys/arch/x86/x86/intel_busclock.c?rev=1.11content-type=text/x-cvsweb-markup I asked for a pull-up about a week ago, so should come in eventually. Try patching around as I did. the patch solves the problem. unfortunately the kernel now has a problem with the broadcom nic. http://6bone.informatik.uni-leipzig.de/Dell-M710-bnx-no-PHY-found.bmp does there also a patch exist for this problem? Thank you for your efforts Regards Uwe
Re: Module and device configuration locking [was Re: Modules loading modules?]
Updating the status on these changes: One comment questioned whether or not a version bump was required, and I've more-or-less convinced myself it is at least desired. While properly-working modules from the pre-update will continue to work on a post-update kernel, the reverse is not necessarily true. A module written for a post-update kernel and which takes advantage of the changed locking protocol will fail on a pre-update kernel. Another comment suggested that the name of newly-created file kern/kern_cfg.c should be changed to more closely match its contents, while an earlier comment had suggested generalizing the filename! I've taken the more-specific path and the file is now called kern_cfglock.c If other stuff gets added to this file later, we can come up with a more generic name at that time. Some additional changes have been included in this latest set of diffs. Mostly, there were some KASSERT()s sprinkled throughout that tried to ensure that the code had not been called recursively. Since recursion is now explicitly supported, all of the module_active stuff has been reworked. Additionally, there was a statically-allocated list of pending modules that needed to have their initialization completed. This has been changed to keep multiple stack-allocated lists, one for each depth. I'm planning to commit these changes next weekend, unless there are any significant objections. If anyone else needs to coordinate changes in order to ride-the-version-bump, let me know. - | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| | Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net | | Kernel Developer | | pgoyette at netbsd.org | - ML.diff.tgz Description: Binary data
deprecating #define'd sysctl OIDs
On Sun, 15 Aug 2010, Jean-Yves Migeon wrote: It might make sense to add comments near all existing lists of hard-wired sysctl OID values asking people not to add more of them. Shall it be added for all other archs then? I assume that they can all benefit from the dynamic sysctl(9) interface? If we do this at all, then we should do it for all lists of sysctl OID values. Several of them are in sys/sysctl.h, and I am sure there are more scattered around. I don't see the point of doing it only for CPU_* definitions. All three of the sysctl(3), sysctl(7), and sysctl(9) man pages could also be improved, to make it more clear that new code can (should?) use dynamic allocation instead of #define'd OID values. --apb (Alan Barrett)
Re: Adding 'i386_use_pae' variable, and expose it through sysctl
On Sun, Aug 15, 2010 at 02:30:46PM +0200, Jean-Yves Migeon wrote: Shall it be added for all other archs then? I assume that they can all benefit from the dynamic sysctl(9) interface? You can't do it for existing OIDs, that breaks binary compatibility. Thor
Re: Adding 'i386_use_pae' variable, and expose it through sysctl
On 15.08.2010 17:23, Thor Lancelot Simon wrote: On Sun, Aug 15, 2010 at 02:30:46PM +0200, Jean-Yves Migeon wrote: Shall it be added for all other archs then? I assume that they can all benefit from the dynamic sysctl(9) interface? You can't do it for existing OIDs, that breaks binary compatibility. Sure; I was just talking about the comment part, where we add a note that new sysctl(7) shall get added using the dynamic sysctl interface. -- Jean-Yves Migeon jeanyves.mig...@free.fr
Re: Adding 'i386_use_pae' variable, and expose it through sysctl
On Sun, 15 Aug 2010, Thor Lancelot Simon wrote: You can't do it for existing OIDs, that breaks binary compatibility. Yes, obviously. My suggestion was about adding comments and documentation to discourage new OIDS from being added in the old way. --apb (Alan Barrett)
kicking everybody out of the softc
Currently, device detachment is racy: a kernel thread races to read/write a softc, after looking it up, before a second thread detaches the corresponding device_t and reclaims the softc's storage. I've been working in spare moments on lockless code to prevent storage for a softc from going away while a driver uses it. I submit the idea and the code here for review. To stop the races against detachment is tricky because many drivers are in and out of their softc over and over and over again, today, without any synchronization with detachment whatsoever. To use atomic operations or locking to synchronize use of the softc with its reclamation can add costly locked memory transactions to many fast paths where there were no such transactions before, for a performance loss. I took an approach that avoids locked memory transactions in the common case. I keep count of threads entering each softc on each CPU using per-CPU counters in the corresponding device_t: an LWP calls device_acquire(dev) as it enters a softc. I also keep count of threads leaving each softc using per-CPU counters: an LWP calls device_release(dev) as it leaves a softc. I call the counters turnstiles. Turnstile-counts only increase. Because turnstiles are per-CPU, incrementing one only has to be atomic with respect to other threads on that same CPU, so no locked memory transactions are necessary. After a LWP enters a softc through its turnstile, it passes through a gate implemented by a pointer from the device_t to an object of type device_gate_t that contains a kernel mutex, among other things. This happens in device_acquire(). Normally, the gate is open: the device_t points to the default device_gate_t, gate_open. It is not necessary for device_acquire() to acquire gate_open's mutex, but device_acquire() must acquire every other gate's mutex. In the rare event that our thread wants to reclaim the softc (say that it is completing config_detach(9)), first it closes the gate on the softc's corresponding device_t. To close the gate, our thread creates a closed device_gate_t, acquires its mutex, and points the device_t at it. With the gate closed, our thread unlinks the device_t and softc. Finally, it seals the gate by pointing the device_t at a special device_gate_t, gate_sealed, and it wakes all of the threads that wait to acquire the closed gate's mutex. Since the device_t and softc are unreachable, no new thread can enter them. All of the threads that wake holding the closed gate see that the device_t now points to gate_sealed and leave the softc through a turnstile---device_acquire() returns in those threads with ENXIO. Our thread safely reclaims the softc when the number of threads who have entered through its turnstiles balances the number who have left. Anyway, that's the gist of the idea. I've attached the untested (and uncompiled) code for the details. Comments? Dave [1] I'm using the term thread loosely to mean any thread of execution, be it an LWP, software or hardware interrupt. -- David Young OJC Technologies dyo...@ojctech.com Urbana, IL * (217) 278-3933 /* I. DEVICE ATTACHMENT * * Steps for config_attach*() to follow: * * Call device_attach_gate(dv) and check for error before making dv * visible by linking it to alldevs or cfdriver_t. * * II. DEVICE DETACHMENT * * Steps for config_detach(9) to detach a device_t, dv: * * 0) Call the driver detach routine, dv-dv_cfattach.ca_detach. It is * important for the detach routine to disestablish interrupt handlers * and to make sure interrupt handling on all CPUs is complete (by * making a low-priority cross-call to all CPUs, for example). * * 1) Install gate with device_gate_close(dv, config_detach, dgp). * * 2) Unlink the device_t from alldevs and from cfdriver_t. * * 3) Call device_gate_seal(dv, dgp). * * III. REFERENCING/READING/WRITING THE SOFTWARE STATE (device_t softc) * * Steps for kernel threads, hard- and soft-interrupt handlers to * protect a device_t, dv, and its softc against reclamation: * * In sleepable LWP contexts, rc = device_acquire(dv, interruptible), * where interruptible is true or false, and check rc for errors. * Every successful device_acquire() call must be matched with a * device_release(). Between device_acquire() and device_release(), * neither dv nor its softc can be reclaimed. * * In hardware/software interrupts, device_enter(dv). Every device_enter(dv) * must be matched with a device_exit(dv). Between device_enter(dv) and * device_exit(dv), neither dv nor its softc can be reclaimed. * * LOOKING UP THE SOFTWARE STATE (device_t softc) * * Routines that look up a device_t or softc by unit name or number, * such as device_lookup() and device_lookup_private(), should return * a device_t or softc that is protected against reclamation. Callers * of lookup routines should match each successful call with a * device_release() or device_exit(), according to guidance given in
Re: kicking everybody out of the softc
On Sun, Aug 15, 2010 at 02:19:56PM -0500, David Young wrote: Anyway, that's the gist of the idea. BTW, many thanks to ad@ and rmind@ and others for valuable discussions that helped me get this far. Of course, any broken ideas or coding mistakes are mine. Dave -- David Young OJC Technologies dyo...@ojctech.com Urbana, IL * (217) 278-3933
re: kicking everybody out of the softc
thanks for looking into this problem. we need a solution. would device_lookup() and device_lookup_private() take a reference on this count automatically? or maybe some new API that does it, to avoid the need to audit every driver at once.
Re: kicking everybody out of the softc
On Mon, Aug 16, 2010 at 06:43:20AM +1000, matthew green wrote: thanks for looking into this problem. we need a solution. No problem. would device_lookup() and device_lookup_private() take a reference on this count automatically? or maybe some new API that does it, to avoid the need to audit every driver at once. Thanks for asking the question. Device_lookup() needs to device_acquire(dv) before dropping its lock on the cfdriver_t array[1] and returning dv. I had planned to audit all of the drivers, but that could get out of hand. I like your idea of using a new API. I'll give it some thought. Dave [1] Actually, it takes the alldevs lock, today. Finer locking granularity, such as a cfdriver_t lock, is desirable! -- David Young OJC Technologies dyo...@ojctech.com Urbana, IL * (217) 278-3933
re: kicking everybody out of the softc
On Jan 6, 1:19am, matthew green wrote: } } would device_lookup() and device_lookup_private() take a reference } on this count automatically? or maybe some new API that does it, } to avoid the need to audit every driver at once. What would release the reference in that case? Or, would the count just keep incrementing thus preventing the driver from detaching until it is audited? }-- End of excerpt from matthew green
re: kicking everybody out of the softc
On Jan 6, 1:19am, matthew green wrote: } } would device_lookup() and device_lookup_private() take a reference } on this count automatically? or maybe some new API that does it, } to avoid the need to audit every driver at once. What would release the reference in that case? Or, would the count just keep incrementing thus preventing the driver from detaching until it is audited? right. that's why the new API to switch to on a per-device basis would help avoid an all-drivers-at-once audit. .mrg.