drop volatile from __cpu_simplelock_t typedef
__cpu_simplelock_t was born 15+ years ago with the following commit message: === snip === Let each platform typedef the new __cpu_simple_lock_t, which should be the most efficient type used for the atomic operations in the simplelock structure, and should also be __volatile. === snip === So, thinking about fixing lib/49989, I started wondering why volatile is necessary in the simplelock typedefs. should also be doesn't explain much, and may just be there because that's what the pre-simplelock_t definitions used. Shouldn't simplelocks always be operated on with atomic instructions and instruction barriers or some non-SMP equivalent thereof? Assuming so, volatile in the typedef doesn't do anything except probably throw compilers off and therefore we should drop volatile from the typedefs. RAS might need volatile (not sure yet), but that can probably be pushed inside the RAS sequence instead of exposing it everywhere. Thoughts? Seems like the right thing to do irrespective of lib/49989.
Re: drop volatile from __cpu_simplelock_t typedef
On 26/06/15 14:51, Matt Thomas wrote: __cpu_simpe_lock_unlock concerns me without volatile. Why? Something to do with barriers? Also, many have loops that count on the variable changing. Without volatile those will become infinite loops. Such as? I can only think of the C debugging version of simple_lock ;) Can't those be fixed by making them call __SIMPLELOCK_LOCKED_P()? They arguably should have been doing that in the first place anyway. Or are you worried that we won't be able to catch all of them? For RISC-V, I used the builtin C11-ish gcc atomics to implement the __cpu_simple_lock_t operations. I just moved it to sys/common_lock.h so other ports could use it. More MI code is always nice, but what's the relevance to this discussion? sys/common_lock.h should work the same with or without volatile because the volatileness comes from atomic_store/exchange, no? Anyway, I interpreted your reply as sounds like a good idea, but there may be some problems. If that's completely wrong, please be more explicit.
Re: drop volatile from __cpu_simplelock_t typedef
On 26/06/15 15:20, Matt Thomas wrote: On Jun 26, 2015, at 8:17 AM, Antti Kantee po...@iki.fi wrote: Such as? I can only think of the C debugging version of simple_lock ;) Can't those be fixed by making them call __SIMPLELOCK_LOCKED_P()? They arguably should have been doing that in the first place anyway. Or are you worried that we won't be able to catch all of them? Atomic instruction typically have a lot of overhead so you loop until the variable changes and then retry the atomic instruction. __SIMPLELOCK_LOCKED_P() doesn't use an atomic instruction, and I can't see how it even could in any way that would make sense. We'd just add a volatile cast into that method.
Re: drop volatile from __cpu_simplelock_t typedef
On 26/06/15 13:55, Antti Kantee wrote: __cpu_simplelock_t was born 15+ years ago with the following commit message: === snip === Let each platform typedef the new __cpu_simple_lock_t, which should be the most efficient type used for the atomic operations in the simplelock structure, and should also be __volatile. === snip === should also be doesn't explain much, and may just be there because that's what the pre-simplelock_t definitions used. I asked thorpej about wouldn't the special interfaces used to manipulate simplelock_t negate the need for volatile, and he said: Originally, not everything was special interfaces, and there were C versions for debugging. So, it seems like the original reason to use volatile no longer exists.
Re: 82801G_HDA in hdaudiodevs
On 14/06/15 12:07, Robert Millan wrote: Hi! Am I missing something, or is my device (82801G_HDA) missing from hdaudiodevs? I don't use hdaudiodevs directly, only parse it to generate a list that is later fed into Linux UIO to make it available to Rump, so I'm not completely sure if all HDA devices are supposed to be listed there or not. Does someone know? I don't, but NetBSD tech-kern might (cc'd). While on the subject, I've also noticed that the hdaudio driver doesn't work under qemu or virtualbox, not sure if it's as simple as adding something to hdaudiodevs or if something more profound required. (It's not really a problem with emulators, though, since I just use the ac97 device). Index: rumpkernel-0~20150607/buildrump.sh/src/sys/dev/hdaudio/hdaudiodevs === --- rumpkernel-0~20150607.orig/buildrump.sh/src/sys/dev/hdaudio/hdaudiodevs 2015-06-07 17:04:54.0 +0200 +++ rumpkernel-0~20150607/buildrump.sh/src/sys/dev/hdaudio/hdaudiodevs 2015-06-14 13:53:20.666957221 +0200 @@ -170,6 +170,7 @@ productINTEL G45_HDMI_3 0x2803 G45 HDMI/3 productINTEL G45_HDMI_4 0x2804 G45 HDMI/4 productINTEL G45_HDMI_FB 0x29fb G45 HDMI/FB +productINTEL 82801G_HDA 0x27d8 82801GB/GR /* Sigmatel */ productSIGMATELSTAC9230X 0x7612 STAC9230X
Re: 82801G_HDA in hdaudiodevs
On 14/06/15 13:48, Robert Millan wrote: El 14/06/15 a les 15:10, Antti Kantee ha escrit: 2) hdaudio doesn't work on a regular NetBSD installation under qemu (and probably virtualbox too, though I didn't test with virtualbox) Just FTR, I had trouble with auich(8) + Rump + Linux-UIO not propagating interrupts on Virtualbox due to shared IRQs. I suspect it might be a Virtualbox bug, as I couldn't reproduce the problem anywhere other than Vbox, and I didn't investigate further as I found a simple workaround (attached, maybe someone will find it useful). uio_pci_generic+interrupts is not a happy place. For example, the maintainers have refused patches to enable MSI because it would make in-kernel uio driver useful. well, maybe the rejection wasn't quite phrased like that, but that was the essence. Ok, interrupts aren't a happy place, but now we're getting sidetracked ;)
Re: Removing ARCNET stuffs
On 31/05/15 06:05, matthew green wrote: hi Andrew! :) Who is appalled to discover that pc532 support has been removed! In addition to toolchain support, the hardware was near-extinct at the time of removal. Now, the hardware is no longer near-extinct: http://cpu-ns32k.net/ I used the FPGA pc532 running NetBSD 1.5.x(?) a few weeks back. Unbelievable experience, especially since I spent quite some time and effort trying to get a pc532 I had on loan 10+ years ago to function. get your GCC and binutils and GDB pals to put the support back in the toolchain and we'll have something to talk about :-) Didn't know that things to *talk* about were short in supply...
Re: Inter-driver #if dependencies
On 18/05/15 02:33, Paul Goyette wrote: If you want to solve the problem just for one driver cluster, that's more than fine. In other words, if you don't want to spend effort on a general solution, implement what you need privately in pcppi land. Everyone still wins and will be thankful your efforts, unlike in the event of a haphazardly researched autoconf interface. There's no need to start making a list of things that you're not willing to do. No-one benefits if I implement what [I] need privately... Read the entire sentence/paragraph instead of stopping at a halfway point where it most suits your apparent agenda of having an excuse to lash out. Please consider this exchange (between you and me) to be finished. Gladly.
Re: Inter-driver #if dependencies
On 17/05/15 22:40, Paul Goyette wrote: My crusade for modularity has arrived at the pcppi(4) driver, and I've discovered that there are a number of places in the code where a #if is used to determine whether or not some _other_ driver is available to provide certain routines. For pcppi(4), these dependencies are for the attimer(4) and pckbd(4) drivers. (While I haven't yet gone searching, I'd be willing to wager that there are other similar examples in other drivers.) As you say, you're proposing a solution based on looking at one example and a wager that you'll find more use cases. Furthermore, your message is unclear on if you've implemented your proposal to test that it works even for your single case. You won the wager in the sense that the problem exists. However, I'm not at all convinced that an abstraction hell via autoconf is the best possible solution (not that I'm convinced that it isn't, either). I suggest analyzing and fixing at at least half a dozen cases in the tree before proposing a general solution. For example scsi/ata/wd/sd/usb is a good place to look at. If you want to solve the problem just for one driver cluster, that's more than fine, but you don't need a halfway general solution for that.
Re: Inter-driver #if dependencies
On 18/05/15 01:02, Paul Goyette wrote: On Mon, 18 May 2015, Antti Kantee wrote: On 17/05/15 22:40, Paul Goyette wrote: My crusade for modularity has arrived at the pcppi(4) driver, and I've discovered that there are a number of places in the code where a #if is used to determine whether or not some _other_ driver is available to provide certain routines. For pcppi(4), these dependencies are for the attimer(4) and pckbd(4) drivers. (While I haven't yet gone searching, I'd be willing to wager that there are other similar examples in other drivers.) As you say, you're proposing a solution based on looking at one example and a wager that you'll find more use cases. Furthermore, your message is unclear on if you've implemented your proposal to test that it works even for your single case. You won the wager in the sense that the problem exists. However, I'm not at all convinced that an abstraction hell via autoconf is the best possible solution (not that I'm convinced that it isn't, either). I suggest analyzing and fixing at at least half a dozen cases in the tree before proposing a general solution. For example scsi/ata/wd/sd/usb is a good place to look at. If you want to solve the problem just for one driver cluster, that's more than fine, but you don't need a halfway general solution for that. I'm certainly willing to implement the mechanism, as a proof-of-concept, for the pcppi/attimer/pckbd cluster, if there's a reasonable chance of the effort being useful. I'm certainly not willing to spend the next 6 months (or more) of my life analyzing and fixing at least half a dozen cases without some encouragement that the effort won't be wasted. If you've actually got some constructive feedback on my suggestion, please provide it. Is giving pointers to related use cases that you're in your own words not aware of not constructive? If you want to solve the problem just for one driver cluster, that's more than fine. In other words, if you don't want to spend effort on a general solution, implement what you need privately in pcppi land. Everyone still wins and will be thankful your efforts, unlike in the event of a haphazardly researched autoconf interface. There's no need to start making a list of things that you're not willing to do.
Re: Missing rump_kthread_destroy() ?
On 19/04/15 07:40, Paul Goyette wrote: In my on-going efforts to further modularize the NetBSD kernel, I'm currently prying apart the pieces of sysmon... One of those pieces would be sysmon_taskq, which provides a lwp environment to execute callouts. In the sysmon_taskq_init() routine there is a call to kthread_create(), so it would seem reasonable that sysmon_taskq_fini() would call kthread_destroy(). Unfortunately, when building rump_allserver I discover that there is no rump emulation for kthread_destroy(). There is _create(), _join(), _exit(), and _init(), but no _destroy(). Is there a reason for not providing kthread_desetroy()? How difficult would it be to add it? The right thing to use is kthread_exit(). I don't see why kthread_destroy() needs to be in the public API at all. I'd just remove it.
Re: kernel constructor
There are two separate issues here: 1: link sets vs. ctors They are exactly the same thing in slightly different clothing. Mental exercise: define link_set_ctor and run those in kernel bootstrap when you'd run __attribute__((constructor)). As David cautions, I don't think ctors should do anything apart from note that X is present in the image so that initializing X can be done later. With link sets you don't need the extra step of noting since you can just iterate when you want to. 2: init_main ordering I think that code reading is an absolute requirement there, i.e. we should be able to know offline what will happen at runtime. Maybe that problem is better addressed with an offline preprocessor which figures out the correct order?
Re: [PATCH] PUFFS backend allocation (round 3)
On 29/10/14 00:11, Emmanuel Dreyfus wrote: On Tue, Oct 28, 2014 at 06:07:29PM +0100, J. Hannken-Illjes wrote: Confused. If write and/or fsync are synchronous (VOP_PUTPAGES with flag PGO_SYNCIO) no write error will be forgotten. puffs_vnop_stratgy() contains this: /* * : wrong, but kernel can't survive strategy * failure currently. Here, have one more X: X. */ if (error != ENOMEM) error = 0; This is where we want to store the error so that it can be recovered by upper layer. That comment is close to 10 years old. If you haven't, start by checking that it still applies and perhaps do a quick check to see what goes wrong (I don't remember exactly, some sort of kernel panic I think) and if it can be fixed. And I still think that the best approach is to make the cache write-through, at least when a write causes a page fault, and then just deal with whatever distributed systemness happens behind the kernel driver's back.
Re: [PATCH] PUFFS backend allocation (round 3)
On 29/10/14 23:33, Emmanuel Dreyfus wrote: Antti Kantee po...@iki.fi wrote: That comment is close to 10 years old. If you haven't, start by checking that it still applies and perhaps do a quick check to see what goes wrong (I don't remember exactly, some sort of kernel panic I think) and if it can be fixed. I just tried removing this if (error != ENOMEM) error = 0 and it seems work fine on netbsd-7. The error is reported to the calling layers without a hitch. Are there some corner cases where it could cause problem? Don't recall it being a corner case. And why does NF has to save the error in np-n_error to recover it in upper layer? Obsolete code that was never touched? Not sure, but per a quick examination it looks like nfs wants to save the error for the next caller. As long as puffs is synchronous, it won't be an issue. Notably, though, a puffs file server might like to be asynchronous in handling a write and report an error later, but that's getting complicated. Optimization is not a substitute for correctness ...
Re: [PATCH] GOP_ALLOC and fallocate for PUFFS
On 30/09/14 13:44, Emmanuel Dreyfus wrote: Hello When a PUFFS filesystem uses the page cache, data enters the cache with no guarantee it will be flushed. If it cannot be flushed (bcause PUFFS write requests get EDQUOT or ENOSPC), then the kernel will loop forever trying to flush data from the cache, and the filesystem cannot be unmounted without -f (and data loss). In the attached patch, I add in PUFFS: - support for the fallocate operation - a puffs_gop_alloe() function that use fallocate - when writing through the page cache we call first GOP_ALLOC to make sure backend storage is allocated for the data we cache. debug printf show a sane behavior, GOP_ALLOC calling puffs_gop_alloc only when required. If the filesystem does not implement fallocate, we keep the current behavior of filling the page cache with data we are not sure we can flush. Perhaps we can improve further: missing fallocate can be emulated by writing zeroed chuncks. I have implemented that in libperfuse, but we may want to have this in libpuffs, enabled by a mount option. Input welcome. Is it really better to sync fallocate, put stuff in the page cache and flush the page cache some day instead of just having a write-through (or write-first) page cache on the write() path? You also get rid of the fallocate-not-implemented problem that way. That still leaves the mmap path ... but mmap always causes annoying problems and should just die ;) Writing zeroes might be a bad emulation for distributed file systems, though I guess you're the expert in that field and can evaluate the risks better than me.
Re: How PUFFS should deal with EDQUOT?
On 22/09/14 04:28, Emmanuel Dreyfus wrote: When a PUFFS filesystem enforces quota, a process doing a write over quota will end frozen in DE+ state. The problem is that we have written data in the page cache that is supposed to go to disk. The code path is a bit complicated, but basically we go in genfs VOP_PUTPAGE, which leads to genfs_do_io() where we have a VOP_STRATEGY, which cause PUFFS write. The PUFFS write will get EDQUOT, but genfs_do_io() ignores VOP_STRATEGY's return value and retries forever. In other words, when flushing the cache, the kernel ignores errors from the filesystem and runs an endless loop attempting to flush data, during which the process that did the over quota write is not allowed to complete exit(). What is the proper way to deal with that? Is it reasonable to wipe the page cache using puffs_inval_pagecache_node() when write gets a failure? Any failure? Or just EDQUOT and ENOSPC? Should that happen in libpuffs or in the filesystem (libperfuse here)? I'd guess the key to success would be to support genfs_ops in puffs so that the file server is consulted about block allocations. See also tests/vfs/t_full.c
Re: virtualized nfsd (Re: virtual kernels, syscall routing, etc.)
I almost forgot my annual contribution to this thread (actually missed it last year, sorry 'bout that). On Fri Oct 16 2009 at 05:36:40 +0300, Antti Kantee wrote: On Thu Nov 27 2008 at 20:32:15 +0200, Antti Kantee wrote: Good news everyone! I've made the kernel nfs service (nfsd) run in userspace. Ok, I've worked on this a little more. Now it's possible to run a fully selfcontained nfsd with a virtualized TCP/IP stack and hence a dedicated IP address (the previous solution used host IP and rpcbind in a very unholy cocktail). [...] The bad news is that this currently requires a hacked version of the libc rpc client. Without syscall routing mentioned in my first email on the subject, we cannot route the syscalls libc makes to the right kernel. The good news is that the modifications are selfcontained and I've put up a tarball. So now I worked on it even more, and it's possible to use the stock kernel nfs server code and stock userland binaries to run the kernel nfs server in userspace (and, as usual, the stock kernel module binaries on x86). The instructions are part of the tutorial I published last week: http://www.netbsd.org/docs/rump/sptut.html#masterclass If you want to use kernel module bins instead of rump libs, just remove -lrumpfs_nfsserver and even -lrumpfs_nfs and -lrumpfs_ffs from the rump_server command line. The functionality will be autoloaded from /stand/i386/5.99.48/modules on the host. On amd64 it'll require some tinkering, though, since with standard kernel module binaries you need to load all rump kernel code into the bottom 2GB due to -mcmodel=kernel and that tinkering is left as an exercise for the reader (you either need static linking or to teach ld.so to do this). Still, even without tinkering the rump libs work just fine on every arch. Since I can't figure out how to develop things any further than running unmodified source and binary of every relevant component, I guess here endeth this thread. -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
inheriting the lwp private area
Hi, On Julio's request I was looking at the now-failing tests and resulting hanging processes. Basically at least on i386 the failures are a result of fork() + setcontext() (plus some voodoo) after which calling pthread_mutex_lock() with signals masked causes a busyloop due to pthread__self() causing a segv but the signal never getting delivered (that in itself seems like stinky business, but not the motivating factor of this mail). Per 4am intuition it seems pretty obvious that a child should inherit the forking lwp's private data. Does that make sense to everyone else? At least patching fork1() to do so fixes the hanging processes and failing tests and a quick roll around qemu hasn't caused any problems. If it doesn't make sense, I'll disable the pthread bits (per commit guideline clause 5) until support is fully fixed so that others don't have to suffer from untested half-baked commits causing juju hangs and crashes. -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
rump is complete
Hi, I have accomplished everything I want to with rump and plan to declare it stable in NetBSD 6. This implies adding new interfaces will slow down, and changing old old ones will require backward compat. If you are interested in the unique possibilities offered by rump, now is a good time to check your use cases work as expected. - antti For what rump is, does, and how to use it, see the usual place: http://www.NetBSD.org/docs/rump/
Re: the bouyer-quota2 branch
On Thu Mar 10 2011 at 09:36:20 +0100, Manuel Bouyer wrote: On Thu, Mar 10, 2011 at 11:42:08AM +1100, matthew green wrote: On Sat Feb 19 2011 at 23:21:35 +0100, Manuel Bouyer wrote: This branch is for the developement of a modernized disk quota system. The 2 main changes are: a new quotactl(2) interface and a new on-disk format, compatible with journaled ffs. Hmm, I'm wondering if the new quotactl syscall should have a new name instead of keeping the old one. It doesn't make much sense to play __RENAME() games with it since any old code will not compile against the new quotactl signature. that seems reasonable to me. What do you propose then ? quotactl is the best name I can find for this syscall ... quotactl2? quotapctl? quota_pctl? quotactl_the_next_generation? ... quota_king? Considering that quotactl is not used by programmers (unless they're hacking on the quota utils ;) I don't think we need to spend a lot of energy on picking the name. If we want to follow a common naming scheme for all syscalls which will take a plist (such as future mount?), we might want to spend a few minutes on it, though. (Just to explain the rationale for this nomenclatural crisis, yesterday I discovered that the changed signature broke some assumptions about syscall compat I'd made in makesyscalls.sh, and that caused the script to fail in a very-scratchingly way. I could just change makesyscalls.sh, but since I'd had made that assumption, it's possible others have too) -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: the bouyer-quota2 branch
On Thu Mar 10 2011 at 20:29:58 +1100, matthew green wrote: (Just to explain the rationale for this nomenclatural crisis, yesterday I discovered that the changed signature broke some assumptions about syscall compat I'd made in makesyscalls.sh, and that caused the script to fail in a very-scratchingly way. I could just change makesyscalls.sh, but since I'd had made that assumption, it's possible others have too) BTW, when i changed reboot(2) i added a char * to the signature. (this was in 1996?) how does this affect your compat assumptions? It doesn't affect them because i'm not interested in compiling new code against oreboot. So theoretically yes, in reality no. I care about the latter ;) -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: the bouyer-quota2 branch
On Thu Mar 10 2011 at 11:28:14 +0100, Manuel Bouyer wrote: Considering that quotactl is not used by programmers (unless they're hacking on the quota utils ;) I don't think we need to spend a lot someone who looks at quotactl(8) will also look at quotactl(2) ... You can still MLINKS the quotactl.2 name or add a note. of energy on picking the name. If we want to follow a common naming Agreed. So let's keep quotactl(2) ... it's fine and is working. I don't agree about fine, but I won't push the issue any further. -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: bouyer-quota2: fsck_ffs crash
On Thu Mar 10 2011 at 19:45:20 +0100, Manuel Bouyer wrote: On Thu, Mar 10, 2011 at 06:59:41PM +0100, Ignatios Souvatzis wrote: Hi, % unmount /export/home/1 % tunefs -q user /export/home/1 % fsck -fy /export/home/1 ... USER QUOTA MISMATCH FOR ID 0: 0/0 SHOULD BE 1791988/1794 ALLOC? yes USER QUOTA MISMATCH FOR ID 0: 0/0 SHOULD BE 0/0 ALLOC? yes fsck: /dev/home/rtheory1: Segmentation fault This is on Sparc64. I'll provide more data tomorrow, assuming I'll find time to point gdb at a -g binary and the core dump. it should not try to allocate/fix entries for the same uid twice. Also, 0/0 SHOULD BE 0/0 looks wrong. It would be interesting to see if an entry got really added for id 0 twice, of if the second id is the result of some corruption. Can you see if tests/sbin/fsck_ffs completes fine on sparc64 (atf-run|atf-report in this directory) ? They complete fine on a sparc64 (but of course that doesn't guarantee they complete fine on Ignatios's sparc64, so he should run the tests). http://www.netbsd.org/~martin/sparc64-atf/22_atf.html -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: the bouyer-quota2 branch
On Sat Feb 19 2011 at 23:21:35 +0100, Manuel Bouyer wrote: This branch is for the developement of a modernized disk quota system. The 2 main changes are: a new quotactl(2) interface and a new on-disk format, compatible with journaled ffs. Hmm, I'm wondering if the new quotactl syscall should have a new name instead of keeping the old one. It doesn't make much sense to play __RENAME() games with it since any old code will not compile against the new quotactl signature. -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: Fwd: Status and future of 3rd party ABI compatibility layer
On Tue Mar 01 2011 at 09:55:38 +, Andrew Doran wrote: On Mon, Feb 28, 2011 at 11:25:07AM -0500, Thor Lancelot Simon wrote: On Mon, Feb 28, 2011 at 11:13:36AM +0200, haad wrote: With solaris.kmod we are compatible with solaris kernel, (we should be able to load solaris kernel modules). Have you actually tried this? I am pretty sure it would not work. It appears to me that solaris.kmod includes shims that provide some Solaris kernel interfaces at the *source* level in NetBSD, which certainly makes it easier to port kernel code from Solaris but does not (as far as I can tell) give us binary compatibility. Adam may have meant source level compat, it definitely does provide some level of that. Of course no binary compat as you say. If Solaris has a module-compatible kernel ABI it's most likely possible to be binary compatible considering we're source-compatible already (cf. rump ABI compatibility with the kernel). Of course it doesn't happen accidentally and there's some amount of work involved. But if someone finds a use case for it, why not? -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: modules and weak aliases
On Tue Feb 22 2011 at 13:24:38 -0600, David Young wrote: If there are weak aliases in my kernel and strong aliases in my kernel module, will the in-kernel linker override the weak aliases when I load my module, and put back the weak alias when it unloads my module? Supposing that the answer to my first question is yes, can I make the modules subsystem pause, before releasing the module's memory, while all threads vacate the module's functions? From what I recall from having some things accidentally as __weak_alias in rump, this happens: case STB_WEAK: kobj_error(weak symbols not supported\n); return 0; -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: next vforkup chapter: lwpctl
On Tue Feb 15 2011 at 13:05:11 +, Alexander Nasonov wrote: Antti Kantee wrote: This is not about rumphijack. Look at e.g. sh and make. Even if you do fix them, it's not just limited to malloc either. Anything that uses LWPCTL will be screwed up after vfork. Hi Antti, Sorry if suggest something stupid but would it be possible to make librumphijack pthread-neutral? E.g. use atomic_ops and/or rumpfd as synchronization primitives? In that case you'd have to implement poll/select (and kevent) with the help of fork(). It would be a much more heavyweight operation, especially since it causes another rump kernel handshake to happen. Furthermore, you cannot cache the workers. Well, maybe with __clone(CLONE_FILES), but ... So, yes, it would be possible, but not a good move since it doesn't solve any problems (apart from working around this kernel bug) and causes extra penalties. -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
next vforkup chapter: lwpctl
Hi, Alexander pointed me at a problem where under suitable conditions a process with rump syscall hijacking would crash when after vfork() and sent the attached test program. Under further examination, it turned out that the problem is due to libpthread and lwpctl. Having pthread linked causes malloc to use pthread routines instead of the libc stubs. Now, the vfork() child will use a pointer to the parent's lwpctl area and thinks it is running on LWPCTL_CPU_NONE (-1). When malloc uses this to index the arena map, it unsurprisingly gets total garbage back. The following patch makes a vfork child update the parent's lwpctl area while the child is running. Comments or better ideas? -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa #include stddef.h #include stdlib.h #include unistd.h int main() { malloc(1); switch(vfork()) { case -1: return EXIT_FAILURE; case 0: malloc(1); _exit(EXIT_FAILURE); default: ; } return EXIT_SUCCESS; } Index: kern/kern_exec.c === RCS file: /cvsroot/src/sys/kern/kern_exec.c,v retrieving revision 1.305 diff -p -u -r1.305 kern_exec.c --- kern/kern_exec.c18 Jan 2011 08:21:03 - 1.305 +++ kern/kern_exec.c14 Feb 2011 12:25:28 - @@ -979,6 +979,7 @@ execve1(struct lwp *l, const char *path, mutex_enter(proc_lock); p-p_lflag = ~PL_PPWAIT; cv_broadcast(p-p_pptr-p_waitcv); + l-l_lwpctl = NULL; /* borrowed from parent */ mutex_exit(proc_lock); } Index: kern/kern_exit.c === RCS file: /cvsroot/src/sys/kern/kern_exit.c,v retrieving revision 1.231 diff -p -u -r1.231 kern_exit.c --- kern/kern_exit.c18 Dec 2010 01:36:19 - 1.231 +++ kern/kern_exit.c14 Feb 2011 12:25:28 - @@ -343,6 +343,7 @@ exit1(struct lwp *l, int rv) if (p-p_lflag PL_PPWAIT) { p-p_lflag = ~PL_PPWAIT; cv_broadcast(p-p_pptr-p_waitcv); + l-l_lwpctl = NULL; /* borrowed from parent */ } if (SESS_LEADER(p)) { Index: kern/kern_lwp.c === RCS file: /cvsroot/src/sys/kern/kern_lwp.c,v retrieving revision 1.154 diff -p -u -r1.154 kern_lwp.c --- kern/kern_lwp.c 17 Jan 2011 08:26:58 - 1.154 +++ kern/kern_lwp.c 14 Feb 2011 12:25:29 - @@ -696,6 +696,12 @@ lwp_create(lwp_t *l1, proc_t *p2, vaddr_ l2-l_pflag = LP_MPSAFE; TAILQ_INIT(l2-l_ld_locks); + /* For vfork, borrow parent's lwpctl context */ + if (flags LWP_VFORK l1-l_lwpctl) { + l2-l_lwpctl = l1-l_lwpctl; + l2-l_flag |= LW_LWPCTL; + } + /* * If not the first LWP in the process, grab a reference to the * descriptor table. @@ -1376,6 +1382,17 @@ lwp_userret(struct lwp *l) KASSERT(0); /* NOTREACHED */ } + + /* update lwpctl process (for vfork child_return) */ + if (l-l_flag LW_LWPCTL) { + lwp_lock(l); + l-l_flag = ~LW_LWPCTL; + lwp_unlock(l); + KPREEMPT_DISABLE(l); + l-l_lwpctl-lc_curcpu = (int)cpu_index(l-l_cpu); + l-l_lwpctl-lc_pctr++; + KPREEMPT_ENABLE(l); + } } #ifdef KERN_SA @@ -1529,6 +1546,10 @@ lwp_ctl_alloc(vaddr_t *uaddr) l = curlwp; p = l-l_proc; + /* don't allow a vforked process to create lwp ctls */ + if (p-p_lflag PL_PPWAIT) + return EBUSY; + if (l-l_lcpage != NULL) { lcp = l-l_lcpage; *uaddr = lcp-lcp_uaddr + (vaddr_t)l-l_lwpctl - lcp-lcp_kaddr; @@ -1653,11 +1674,16 @@ lwp_ctl_alloc(vaddr_t *uaddr) void lwp_ctl_free(lwp_t *l) { + struct proc *p = l-l_proc; lcproc_t *lp; lcpage_t *lcp; u_int map, offset; - lp = l-l_proc-p_lwpctl; + /* don't free a lwp context we borrowed for vfork */ + if (p-p_lflag PL_PPWAIT) + return; + + lp = p-p_lwpctl; KASSERT(lp != NULL); lcp = l-l_lcpage; Index: sys/lwp.h === RCS file: /cvsroot/src/sys/sys/lwp.h,v retrieving revision 1.142 diff -p -u -r1.142 lwp.h --- sys/lwp.h 28 Jan 2011 16:58:27 - 1.142 +++ sys/lwp.h 14 Feb 2011 12:25:29 - @@ -214,6 +214,7 @@ extern lwp_tlwp0; /* LWP for proc0. * /* These flags are kept in l_flag. */ #defineLW_IDLE 0x0001 /* Idle lwp. */ +#defineLW_LWPCTL 0x0002 /* Adjust lwpctl in userret */ #defineLW_SINTR0x0080 /* Sleep is
Re: remove sparse check in vnd
On Sun Feb 06 2011 at 00:08:33 +0900, Izumi Tsutsui wrote: yamt@ wrote: i'd like to remove the sparseness check in vnd because there's no problem to use a sparse files on nfs. We really want vnd on sparse files for emulator images... I have this in my /etc/fstab: /home/pooka/temp/anita/wd0.img%DISKLABEL:a% /anita ffs rw,noauto,rump It works perfectly for editing the image. fsck is a slightly gray area, but with wapbl it's not really a concern. I use the following to mount the image so that I don't need to unnecessarily sudo all the access: alias anitamnt env P2K_WIZARDUID=0 mount -o log /anita But on the original subject, maybe we can use either gop_alloc or vop_bmap to decide if the underlying file systems supports vnd on sparse files. -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: kernel memory allocators
On Sat Jan 22 2011 at 17:55:11 +0100, Lars Heidieker wrote: yes, that makes sense for trying my changes with different pool_page sized pool_allocators... I think the initialization order has to be tested on bare metal or? Yes, anything which depends on a real uvm/pmap layer being present needs to be tested, if not on bare metal, at least in an environment where the whole OS stack is present. If we had better platform support for anita (*), things like this would be a lot easier to test ... I guess in theory it would be possible to use static analysis to check initialization order, but that might be more of a hobby ;) *) http://www.netbsd.org/developers/features/ On Fri, Jan 21, 2011 at 12:29 PM, Antti Kantee po...@cs.hut.fi wrote: btw, just in case you're interested, you can easily use rump for userspace development/testing of the kmem/vmem/pool layers. src/tests/rump/rumpkern has examples on how to call kernelspace routines directly from user namespace. On Fri Jan 21 2011 at 11:51:08 +0100, Lars Heidieker wrote: Do you have your changes available for review? The kmem patch it includes: - enhanced vmk caching in the uvm_km module not only for page sized allocation but low integer multiplies. (changed for rump as well) - a changed kmem(9) implementation (using these new caches) (it's not using vmem see note below) - removed the malloc(9) bucket system and made malloc(9) a thin wrapper around kmem, just like in the yamt-kmem branch. (changed vmstat to deal with non more existing symbol for the malloc buckets) - pool_subsystem_init is split into pool_subsystem_bootstrap and pool_subsystem_init, after bootstrap static allocated pools can be initialized and after init allocation is allowed. the only instances (as fas as I found them) that do static pool initialization earlier are some pmaps those are changed accordingly. (Tested i386 and amd64 so far) vmem: Status quo: The kmem(9) implementation used vmem for its backing, with an pool_allocator for each size this is unusual for caches. The vmem(9) backing kmem(9) uses a quatum size of the machine alignment so 4 or 8 bytes, therefore the quantum caches of the vmem are very small and kmem extends these to larger ones. The import functions for vmem do this on a page sized basis and the uvm_map subsystem is in charge of controlling the virtual address layout and vmem is just an extra layer. Questions: Shouldn't vmem provide the pool caches with pages for import into the pools and the quantum caches of vmem should provide these pages for the low integer multiplied sizes? That's the way I understand the idea of vmem and it's implementation in solaris. But this makes only sense if vmem(9) is in charge of controlling lets say the kmem map and not the uvm_map system, slices of this submap would be described by vmem entries and not by map entries. With the extended vmk caching for the kernel_map and kmem_map I implemented the quatum caching idea. Results on an amd64 four-core 8gb machine: sizes after: building a kernel with make -j200, du /, ./build.sh -j8 distribution current changed kmem pool size: 915mb / 950mb 942mb/956mb pmap -R0 | wc 2700 1915 sizes after pushing the memory system with several instances of the Sieve of Eratosthenes each one consuming about 540mb to shrink the pools. current changed kmem pool size: 657mb / 760mb 620mb/740mb pmap -R0 | wc 4280 3327 those numbers are not precise (especially the later ones) at all but they do hint in an direction. Keep in mind that allocations that go to malloc in the current implementation go to the pool in the changed one. Runtime of the build process was the same within a few seconds difference. kind rgards, Lars -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa -- Mystische Erklärungen: Die mystischen Erklärungen gelten für tief; die Wahrheit ist, dass sie noch nicht einmal oberflächlich sind. -- Friedrich Nietzsche [ Die Fröhliche Wissenschaft Buch 3, 126 ] -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: kernel memory allocators
On Fri Jan 21 2011 at 11:51:08 +0100, Lars Heidieker wrote: Do you have your changes available for review? The kmem patch it includes: - enhanced vmk caching in the uvm_km module not only for page sized allocation but low integer multiplies. (changed for rump as well) - a changed kmem(9) implementation (using these new caches) (it's not using vmem see note below) - removed the malloc(9) bucket system and made malloc(9) a thin wrapper around kmem, just like in the yamt-kmem branch. (changed vmstat to deal with non more existing symbol for the malloc buckets) With your changes you can probably also include kern_malloc.c in librump instead of the host-relegated allocator in memalloc.c. There were two reasons why it wasn't done before: 1) i didn't want to guess an arbitrary size for kmem_map 2) too many subsystems relied on link sets for malloc types and i didn't want to add special handling for that At least per cursory examination your patch seems to take care of both issues. -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: kernel messages and rump
On Wed Jan 12 2011 at 19:23:44 +0100, Manuel Bouyer wrote: I can live with it for now; having the uprintf output somewhere could help for atf tests though. I have filled kern/44378 about this. Thanks, i'll look at it some day hopefully soon. Curiously enough, during all the time i've been working with rump (3.5 years now) I've never missed uprints ;) -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: kernel messages and rump
On Wed Jan 12 2011 at 15:36:02 +0100, Manuel Bouyer wrote: Hello, I'm playing with rump, and more specifically rump_ffs. The mount is rejected (as expected) because the fs image has a feature which is not yet in the kernel. It's rejected by this code: if (fs-fs_flags ~(FS_KNOWN_FLAGS | FS_INTERNAL)) { uprintf(%s: unknown ufs flags: 0x%08PRIx32%s\n, mp-mnt_stat.f_mntonname, fs-fs_flags, (mp-mnt_flag MNT_FORCE) ? : , not mounting); if ((mp-mnt_flag MNT_FORCE) == 0) { mutex_exit(ump-um_lock); return (EINVAL); } } but even with RUMP_VERBOSE I never see the uprintf(). Where does it do, and is there a way to make rump print it (I guess it should just go to stderr) ? It goes to the same place as for any process without a tty: the bitbucket. To properly support uprintf, there are at least two things to consider: 1) is the calling process local or remote 2) does the kernel include rumpkern_tty support If you want a quick solution, file a PR and add ifdefs to the uprint routines in subr_prf.c to make them behave like kprintf(TOCONS). -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: WAPBL kernel assertion
On Sat Jan 08 2011 at 14:32:23 +0100, Manuel Bouyer wrote: Hello, on a NetBSD 5.1 Xen domU, I got: panic: kernel diagnostic assertion wl-wl_dealloccnt wl-wl_dealloclim failed: file /home/builds/ab/netbsd-5/src/sys/kern/vfs_wapbl.c, line 1673 file system is clean (I forced a fsck). For now I'm running without wapbl. Does this ring a bell to someone ? Try including revs 1.27 and 1.28 of vfs_wapbl.c. I can't recall the details anymore, but I remember the overflow was quite easy to trigger on a rump kernel. Maybe the problem triggers easier in a virtual environment? -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: bad merge in uvm_fault_lower?
On Mon Dec 13 2010 at 00:24:49 +, Alexander Nasonov wrote: Hi, In sys/uvm/uvm_fault.c I see three KASSERT's twice: Removed one set. Thanks. -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: Sane support for SMP kernel profiling
On Fri Dec 10 2010 at 23:13:54 -0500, Thor Simon wrote: We've fixed SMP kernel profiling, which worked poorly at best (particularly on systems with high HZ) since a lock was taken and released around every single entry to mcount. Thanks to Andy for the suggestion as to how. Nice. Since you're on a roll, do you have plans to investigate userland multithreaded profiling? The only way I've gotten it to work reliably is to artificially leave libpthread out of the mix, and it's not multithreaded after that ... -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: what is this KASSERT() testing?
On Mon Dec 06 2010 at 11:55:05 +1100, matthew green wrote: hi. my devbox just crashed with this: panic: kernel diagnostic assertion pg == NULL || pg == PGO_DONTCARE failed: file /usr/src/sys/miscfs/genfs/genfs_io.c, line 243 but i don't understand the KASSERT(). it seems that this sequence of events will always trigger: nfound = uvn_findpages(uobj, origoffset, npages, ap-a_m, UFP_NOWAIT|UFP_NOALLOC|(memwrite ? UFP_NORDONLY : 0)); ... if (!genfs_node_rdtrylock(vp)) { ... for (i = 0; i npages; i++) { pg = ap-a_m[i]; if (pg != NULL pg != PGO_DONTCARE) { ap-a_m[i] = NULL; } KASSERT(pg == NULL || pg == PGO_DONTCARE); won't all pages filled in by the uvn_findpages() be not NULL, so if the uvn_findpages() succeeds but the genfs_node_rdtrylock() fails, we will trigger this assert always. i think it should just be removed. I guess it wants to test ap-a_m[i], cf. the change to the assignment clause in the same revision. -- älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
Re: mutexes, locks and so on...
Thanks, I'll use your list as a starting point. One question though: On Wed Nov 24 2010 at 00:16:37 +, Andrew Doran wrote: - build.sh on a static, unchanging source tree. From the SSP discussion I have a recollection that build.sh can be very jittery, up to the order of 1% per build. I've never confirmed it myself, though. Did you notice anything like that? (I guess the tools would have to be static too, so that they are not affected by the host compiler)
Re: misuse of pathnames in rump (and portalfs?)
Hi, On Tue Nov 23 2010 at 23:13:02 +, David Holland wrote: Furthermore, it is just plain gross for the behavior of VOP_LOOKUP in some directory to depend on how one got to that directory. As a matter of design, the working path should not be available to VOP_LOOKUP and VOP_LOOKUP should not attempt to make use of it. When I asked pooka for clarification, I got back an assertion that portalfs depends on this behavior so I should rethink the namei design to support it. However, as far as I can tell, this is not true: there is only one unexpected/problematic use of the pathname buffer in question anywhere in the system, in rumpfs.c. Furthermore, even if it were true, I think it would be highly undesirable. You wrote that the whole path will no longer be available. As you say yourself, it doesn't make sense for a file system to care about the previous components, so don't be shocked that I took this to mean the whole remaining path. If the whole remaining path is available, portalfs should be fine. As for etfs, as you might be able to see from the code, it's only used for root vnode lookups. I cannot think of a reason why we cannot define the key to start with exactly one leading '/'. Some in-tree users may not follow that rule now, but they should be quite trivial to locate with grep. That should make it work properly with your finally-nonbroken namei and also take care of all symlink/.. concerns you might have. thanks, antti
Re: mutexes, locks and so on...
On Wed Nov 24 2010 at 12:42:44 -0500, Thor Lancelot Simon wrote: On Wed, Nov 24, 2010 at 04:52:38PM +0200, Antti Kantee wrote: Thanks, I'll use your list as a starting point. One question though: On Wed Nov 24 2010 at 00:16:37 +, Andrew Doran wrote: - build.sh on a static, unchanging source tree. From the SSP discussion I have a recollection that build.sh can be very jittery, up to the order of 1% per build. I've never confirmed it myself, though. Did you notice anything like that? There are other issues associated with build.sh as a benchmark. * What are you trying to test? If you're trying to test the efficiency of cache algorithms or the I/O subsystem (including disk sort), for example, you need to test pairs of runs with a cold boot of *ALL INVOLVED HARDWARE* (this includes disk arrays etc) between each. * If SSDs, hybrid disks, or other potentially self-reorganizing media are involved, forget it, you just basically lose. * If you're trying to test everything *but* the cache and I/O subsystem, then you need to use a warm up procedure you can have reasonable confidence works, for example always measuring the Nth of N consecutive builds. Indeed. Let's start with the low-hanging fruit first -- having some figures which at least make some sense (e.g. measure second of two builds in a row) is better than no figures. * It can be hard to construct a system configuration where NetBSD kernel performance is actually the bottleneck and some other hardware limitation is not. Or where there's only a single bottleneck. Dunno about NetBSD specifically, but this suggests great differences: http://www.netbsd.org/~ad/50/img15.html At least I doubt we got dramatically better drivers between 4 and 5. No idea about other OS performance there.
Re: misuse of pathnames in rump (and portalfs?)
On Wed Nov 24 2010 at 18:12:00 +, David Holland wrote: As for etfs, as you might be able to see from the code, it's only used for root vnode lookups. I cannot think of a reason why we cannot define the key to start with exactly one leading '/'. Some in-tree users may not follow that rule now, but they should be quite trivial to locate with grep. That should make it work properly with your finally-nonbroken namei and also take care of all symlink/.. concerns you might have. I think it makes more sense for doregister to check for at least one leading '/' and remove the leading slashes before storing the key. Then the key will match the name passed by lookup; otherwise the leading slash won't be there and it won't match. (What I suggested last night is broken because it doesn't do this.) Ah, yea, the leading slashes will be stripped for lookup, so we can't get an exact match for those anyway. So, let's define it as string beginning with /, leading /'s collapsed to 1. All users I can find pass an absolute path. ok, good
Re: mutexes, locks and so on...
On Fri Nov 19 2010 at 00:11:12 +, Andrew Doran wrote: You can release it with either call, mutex_spin_ is just a way to avoid additional atomic operations. The ususal case is adaptive mutex, but stuff like the dispatcher/scheduler makes use of spin mutexes exclusively and the fast path versions were invented for that. (Because you can measure the effect with benchmarks :-). Speaking of which, something I (and a few others) have been thinking about is to have constantly running benchmarks (akin to constantly running tests). That way we can have a rough idea which way performance and resource consumption is going and if there are any sharp regressions. Are your old benchmarking programs still available somewhere?
Re: mutexes, locks and so on...
On Fri Nov 12 2010 at 14:30:58 +0100, Johnny Billquist wrote: By reasoning that we should design for tomorrows hardware, we might as well design explicitly for x86_64, and let all other emulate that. But in the past, I believe NetBSD have tried to raise above such simple and naïve implementation designs and actually try to grab the meaning of the operation instead of an explicit implementation. That have belonged more in the field of Linux. I hope we don't go down that path... Freeway design is not driven by the requirements of the horse. If a horse occasionally wants to gallop down a freeway, we're happy to let it as long as it doesn't cause any impediment to the actual users of the freeway. Over 15 years ago NetBSD had a possibility to take everyone into account since everyone was more or less on the same line. This is no longer true. If old architectures can continue to be supported, awesome, but they may in no way dictate MI design decisions which hold back the capabilities of modern day architectures.
Re: mutexes, locks and so on...
On Fri Nov 12 2010 at 15:25:04 +0100, Johnny Billquist wrote: Freeway design is not driven by the requirements of the horse. If a horse occasionally wants to gallop down a freeway, we're happy to let it as long as it doesn't cause any impediment to the actual users of the freeway. Over 15 years ago NetBSD had a possibility to take everyone into account since everyone was more or less on the same line. This is no longer true. If old architectures can continue to be supported, awesome, but they may in no way dictate MI design decisions which hold back the capabilities of modern day architectures. So what you are arguing is that MI needn't be so much MI anymore, and that supporting anything more than mainstream today is more to be considered a lucky accident than a desired goal? You can try to twist my words in any way that pleases you. However, the fact is that people who put forward a heroic effort in modernizing NetBSD will not be held accountable for making sure prehistoric architectures keep up (*). Some of our older ports have active supporters who keep the port up to speed with MI changes, set up emulator support, publish test run results etc. These ports will continue to be supported by NetBSD indefinitely. *) just to be explicit: prehistoric != non-x86
Re: mutexes, locks and so on...
On Fri Nov 12 2010 at 16:58:18 +, Mindaugas Rasiukevicius wrote: What Johnny apparently suggests is to revisit mutex(9) interface, which is known to work very well, and optimise it for VAX. Well, I hope we do not design MI code to be focused on VAX. If we do, then perhaps I picked the wrong project to join.. :) He is suggesting to revisit the implementation. It doesn't take much thinking to figure out you don't have to use kern_rwlock.c on vax, just provide the interface. It's really really unlikely the *interface* will change, so it's not much code updating to worry about either. (incidentally, rump kernels have take this approach for, what, 3 years now because the kernel implementation of mutex/rwlock uses primitives which are not in line with the goals of rump, namely to virtualize without stacking multiple unnecessary implementations of the same abstraction)
Re: XIP (Rev. 2)
A big problem with the XIP thread is that it is simply not palatable. It takes a lot of commitment just to read the thread, not to mention putting out a sensible review comments like e.g. Chuq and Matt have done. The issue is complex and the code involved is even more so. However, that is no excuse for a confusing presentation. It seems like hardly anyone can follow what is going on, and usually that signals that the audience is not the root of the problem. A while back chuq promised to send a mail classifying his points into clear showstopers and issues which can be handled post-merge. Let's start with that list (hopefully we'll get it soon) and see what exactly are the relevant issues remaining and solve *only* those issues. What needs to stop is threading to other areas because $subsystem is broken beyond repair. We know, but let's just handle the problems relevant to XIP for now.
Re: XIP (Rev. 2)
On Tue Nov 09 2010 at 12:47:11 -0600, David Young wrote: On Tue, Nov 09, 2010 at 04:31:22PM +0200, Antti Kantee wrote: A big problem with the XIP thread is that it is simply not palatable. It takes a lot of commitment just to read the thread, not to mention putting out a sensible review comments like e.g. Chuq and Matt have done. The issue is complex and the code involved is even more so. However, that is no excuse for a confusing presentation. It seems like hardly anyone can follow what is going on, and usually that signals that the audience is not the root of the problem. If the conversation's leading participants adopt the rule that they may not introduce a new term (pager ops) or symbol (pgo_fault) to the discussion until a manual page describes it, then we will gain some useful kernel-internals documentation, and the conversation will be more accessible. :-) Those concepts are carefully documented, if nowhere else, at least in the uvm dissertation. Basically a pager is involved in moving things between memory and whatever the va is backed with (swap, a file system, ubc, ...). There's pgo_get which pages data from the backing storage to memory (*) and pgo_put which does the opposite. Additionally there's pgo_fault which is like pgo_get except the interface allows the method a little more freedom in how it handles the operation. ... but i don't know if that's a helpful explanation unless you are familiar with pagers, which is why it is very difficult to produce succint documentation on the subject -- everyone learns to understand it a little differently. *) obviously in the case of XIP to is a matter of mapping instead of transferring But, the problem was not so much the use of terminology as it was the lack of any clear focus on the direction. I can't form a clear mental image of the project, although admittedly I didn't even finish reading the earlier thread yet. Like gimpy said, the diff is a big piece to swallow since it's so full of unrelated parts: 1) man pages 2) new drivers 3) vm 4) vnode pager 5) MD collateral Then again, it's missing pieces (what's pmap_common.c? and isn't that a slight oxymoron ?) The diff would be much more browsable if it was separated into pieces and the man pages attached as rendered versions. Although reading the diff is quicker than reading the previous thread ;) A radically different implementation at this stage seems feasible only if there is strong reason for that based on another actually existing implementation (in another OS, of course). Beauty issues aside, can we have a summary of the current implementation of XIP from a functional perspective, i.e. what works and what doesn't. That's what users care about ...
Re: Capsicum: practical capabilities for UNIX
On Tue Oct 26 2010 at 13:04:30 +0200, Jean-Yves Migeon wrote: On Mon, 25 Oct 2010 20:13:16 -0500, David Young dyo...@pobox.com wrote: I've been wondering if the dynamic linker could simulate access to the global namespace by supplying alternate system-call stubs. Say rtld-elf-cap supplies its own open(2) stub, for example, that searches Capsicum's fdlist for a suitable file descriptor on which to call openat(2): int open(const char *path, int flags, mode_t mode) { const char *name; int fd; for (name, fd in fdlist) { if (path is-under-directory name) return openat(fd, path, flags, mode); } errno = ENOENT; return -1; } That would only work with dynamic executables. Sandboxing static executables that way will not work. Less obviously and more dangerously it will not work for syscalls done from libc (cf. rpc code in rump nfsd). Maybe it's possible to link libc.so so that the linker doesn't resolve unresolved symbols at that stage, but I haven't investigated that path. [i didn't read this thread, at least not yet, so apologies if that was mentioned earlier]
Re: [ANN] Lunatik -- NetBSD kernel scripting with Lua (GSoC project results)
On Tue Oct 05 2010 at 18:24:48 -0300, Lourival Vieira Neto wrote: Hi folks, I'm glad to announce the results of my GSoC project this year [1]. We've created the support for scripting the NetBSD kernel with Lua, which we called Lunatik and it is composed by a port of the Lua interpreter to the kernel, a kernel programming interface for extending subsystems and a user-space interface for loading user scripts into the kernel. You can see more details on [2]. I am currently working on the improvement of its implementation, on the documentation and on the integration between Lunatik and other subsystems, such as npf(9), to provide a real usage scenario. Cool. I'm looking forward to seeing your evaluation of real usage scenarios. If you can find some existing policy code written in C and convert it to lua, it would make a strong case. The main metric I'm interested in is convenience, and performance to some degree depending on what kind of places your plan to put lua scripts in. At least in the packet filter use case the performance is quite critical. I don't know how well the fibonacci example performs (and the performance is not very critical there), but I'm sure you'll agree that from the convenience pov it is a very strong case _against_ lua ;) (yes, I realize it's not provided for demonstrating convenience) - antti
Re: [ANN] Lunatik -- NetBSD kernel scripting with Lua (GSoC project
On Tue Oct 12 2010 at 02:17:35 -0300, Lourival Vieira Neto wrote: On Tue, Oct 12, 2010 at 1:50 AM, David Holland dholland-t...@netbsd.org wrote: On Tue, Oct 12, 2010 at 12:53:10AM -0300, Lourival Vieira Neto wrote: A signature only tells you whose neck to wring when the script misbehaves. :-) Since a Lua script running in the kernel won't be able to forge a pointer (right?), or conjure references to methods or data that weren't in its environment at the outset, you can run it in a highly restricted environment so that many kinds of misbehavior are difficult or impossible. ?Or I would *think* you can restrict the environment in that way; I wonder what Lourival thinks about that. I wouldn't say better =). That's exactly how I'm thinking about address this issue: restricting access to each Lua environment. For example, a script running in packet filtering should have access to a different set of kernel functions than a script running in process scheduling. ...so what do you do if the script calls a bunch of kernel functions and then crashes? if a script crashes, it raises an exception that can be caught by the kernel (as an error code).. Right... so how do you restore the kernel to a valid state? Why wouldn't it be a valid state after a script crash? I didn't get that. Can you exemplify it? I *guess* what David means is that to perform decisions you need a certain level of atomicity. For example, just drawing something out of a hat, if you want to decide which thread to schedule next, you need to make sure the selected thread object exists over fetching the candidate list and the actual scheduling. For this you use a lock or a reference counter or whatever. So if your lua script crashes between fetching the candidates and doing the actual scheduling, you need some way of releasing the lock or decrementing the refcounter. While you can of course push an error branch stack into lua or write the interfaces to follow a strict model where you commit state changes only at the last possible moment, it is additional work and probably quite error-prone. Although, on the non-academic side of things, if your thread scheduler crashes, you're kinda screwed anyway.
Re: something really screwed up with mmap+ffs on 5.0_STABLE
On Wed Sep 01 2010 at 15:23:42 +0200, Thomas Klausner wrote: On Tue, Aug 17, 2010 at 11:52:31PM +0300, Antti Kantee wrote: It would be great if someone could confirm or debunk this on -current and for archs beyond i386. Just get the latest sources, go to sys/rump/net/lib/libshmif, comment out line 61 (the one with PREFAULT_RW) from if_shmif.c, make make install, and run tests/net/icmp/t_ping floodping in a loop. You should see a coredump within a few thousand iteratios (few minutes) if the problem is there. I think you mean if_shmem.c. Something like that with a fuzzy match. I just tested this on 5.99.39/amd64. # ./t_ping floodping got 0/1 passed # while true; do ./t_ping floodping; done panic: kernel diagnostic assertion busmem-shm_magic == SHMIF_MAGIC failed: file if_shmem.c, line 287 panic: kernel diagnostic assertion busmem-shm_magic == SHMIF_MAGIC failed: file if_shmem.c, line 287 Thanks. This has been analyzed and fixed by chuq already. He said he just needs more time to verify the fix is correct. It's essentially kern/40389, but it turns out the hack I wrote to get 5.0 out wasn't quite complete. What goes around comes around ...
Re: [RFC] perfuse permission checks
[resending with tech-kern included] On Sat Aug 28 2010 at 05:52:54 +0200, Emmanuel Dreyfus wrote: Hello I just commited code in libperfuse to check permissions on various file operations. On each operation comes weird questions such as do I need r-x on the parent directory, or just --x?. While I attempted to experiment around it, I am pretty sure there are bugs left behind. Anyone can review the code? It is in src/lib/libperfuse/ops.c (search for calls to no_access) Usually it's enough and easier to perform access checks from lookup (plus setattr). I don't know what gluesterfs does, though, and what kind of races are present due to the distributed nature if you just check access in lookup.
Re: 16 year old bug
On Mon Aug 23 2010 at 13:53:40 +0200, Christoph Egger wrote: ... has been found by OpenBSD: Their commit message: Fix a 16 year old bug in the sorting routine for non-contiguous netmasks. For masks of identical length rn_lexobetter() did not stop on the first non-equal byte. This leads rn_addroute() to not detecting duplicate entries and thus we might create a very long list of masks to check for each node. This can have a huge impact on IPsec performance, where non-contiguous masks are used for the flow lookup. In a setup with 1300 flows we saw 400 duplicate masks and only a third of the expected throughput. The patch is attached. Any comments? The test for this is missing.
Re: something really screwed up with mmap+ffs on 5.0_STABLE
[whoops, resending with tech-kern cc'd] On Thu Aug 19 2010 at 11:17:55 +0100, Patrick Welche wrote: On Tue, Aug 17, 2010 at 11:52:31PM +0300, Antti Kantee wrote: On Tue Aug 17 2010 at 19:06:38 +0300, Antti Kantee wrote: It would be great if someone could confirm or debunk this on -current and for archs beyond i386. Just get the latest sources, go to sys/rump/net/lib/libshmif, comment out line 61 (the one with PREFAULT_RW) from if_shmif.c, make make install, and run tests/net/icmp/t_ping floodping in a loop. You should see a coredump within a few thousand iteratios (few minutes) if the problem is there. Sure enough: Cool, thanks for confirming. 826:arp info overwritten for 1.1.1.10 by b2:a0:61:b4:bc:6f I forgot to mention to remove the busfile in between runs. Otherwise tests will pick up on traffic from an old test run. But this doesn't affect the result we're after, just creates noise. panic: kernel diagnostic assertion busmem-shm_magic == SHMIF_MAGIC failed: fi le /sys/rump/net/lib/libshmif/if_shmem.c, line 285 Abort trap (core dumped) while ( 1 ) 827:panic: kernel diagnostic assertion sp.sp_len BUSMEM_DATASIZE failed: fil e /usr/src/sys/rump/net/lib/libshmif/shmif_busops.c, line 135 Abort trap (core dumped) while ( 1 ) 828:panic: kernel diagnostic assertion busmem-shm_magic == SHMIF_MAGIC failed : file /sys/rump/net/lib/libshmif/if_shmem.c, line 388 though again on i386. 5.99.latest?
something really screwed up with mmap+ffs on 5.0_STABLE
I've been looking at some quite weird behaviour with mmapped files on ffs. I want to concentrate on something else for a while, so here's a brain dump of what I've been struggling with recently, in case it rings a bell for someone or they even know the solution. Background: The shmif rump driver provides a networkin backend using the old mmap-a-file-to-get-a-handle trick. Observations: Most of the time the problem is that the first 16k of the bus file gets corrupted. The underlying fs blocksize is 32k. I have verified that: a) it does not get written to by the involved processes per ktrace -i b) processes do not overwrite random memory by having a PROT_NONE red zone in front This problem does not happen on tmpfs. I don't believe there is a timing issue because I've run the test tens of thousands of times with varying background load. Zero-filling the bus file with write() instead of creating a sparse with truncate doesn't make much of a difference either. I was almost sure it was a problem with the genfs sawhole code, but nope. Usually after the bus has seen one generation (i.e. the pages have been faulted in to all processes) there are no further problems. However, causing (read) faults from a 3rd party process not involved with the test may trigger the problem. The really spooky stuff: Seems like it's possible to get two views into the same file depending on read/write or mmap access (whatever happened to mr. ubc???). Can someone explain this: ./dumpbus-mmio -h thank-you-driver-for-getting-me-here bus version 2, lock: 0, generation: 431, firstoff: 0x5a95a, lastoff: 0x5a8ea ./dumpbus-read -h thank-you-driver-for-getting-me-here dumpbus-read: thank-you-driver-for-getting-me-here not a shmif bus i.e. same file, but magic number doesn't match when not using mmap. hexdump uses read() (per ktrace), so I get the garbage version of the file with it and can confirm it indeed has gargabe in it. The only difference between the two programs is this: #if 1 read(fd, buf, BUFSIZE); bmem = (void *)buf; #else busmem = mmap(NULL, sb.st_size, PROT_READ, MAP_FILE|MAP_SHARED, fd, 0); if (busmem == MAP_FAILED) err(1, mmap); bmem = busmem; #endif However, I can restore the old version using cp (since it uses mmio): ./dumpbus-read -h thank-you-driver-for-getting-me-here dumpbus-read: thank-you-driver-for-getting-me-here not a shmif bus cp thank-you-driver-for-getting-me-here backup ./dumpbus-read -h backup bus version 2, lock: 0, generation: 431, firstoff: 0x5a95a, lastoff: 0x5a8ea How-to-repeat: Get tests/net/icmp from -current and run ./t_ping floodping in a loop from ffs. You should see the problem within a few thousand iterations. Most likely the shmif code will encounter an invariant failure, such as: panic: kernel diagnostic assertion busmem-shm_magic == SHMIF_MAGIC failed: file if_shmem.c, line 391 I plan to update to latest -STABLE soon and see if the problem is still present there. Guess I'll reboot now...
Re: Using coccinelle for (quick?) syntax fixing
On Mon Aug 09 2010 at 11:20:29 +0200, Jean-Yves Migeon wrote: It is 'error-prone, in the sense that it can raise false positives. But when you get more familiar with it, you can either fix the cocci patch (easy for __arraycount, I missed one of the cases... less obvious for aprint stuff), and proof read the generated patch. I really dislike untested wide-angle churn, especially if there is 0 measurable gain. Converting code to __arraycount is a prime example. The only benefit of __arraycount is avoiding typing and therefore typos. Neither of those apply when doing a churn. (there are subjective beauty values, but every C programmer knows the sizeof/sizeof idiom, which is more than what can be said about __arraycount) Examples of measurable benefit are good. Encouraging churn is less good, even if spatch-churn is a million times better than sed-churn. I used these examples to get familiar with it; it starts getting useful when you try to find out buggy code, like double free() in the same function, mutex_exit() missing in a branch before returning, etc. Static analysis is good. However, it might take quite a bit of effort to get the rules general enough so that they trigger in more than one file and specific enough so that you don't get too many false positives. Just to give an example, the ffs allocator routines don't release the lock in an error branch. I remember coccinelle had problems with cpp. Any code which uses macros to skip C syntax will fail silently. procfs comes to mind here. Also, I remember it using so much memory when given our kernel source that I could not finish a rototill and had to use it in combo with find/grep. That said, if $someone can produce a set of rules which showably find bugs in NetBSD code and do not produce a lot of false positives, I'm very interested in seeing nightly runs. ... especially if there are no TAILQ false positives ;)
Re: fd code multithreaded race?
On Wed Aug 04 2010 at 13:21:07 +, Andrew Doran wrote: On Sat, Jul 31, 2010 at 08:31:19PM +0300, Antti Kantee wrote: Hi, I'm looking at a KASSERT which is triggering quite rarely for me (in terms of iterations): panic: kernel diagnostic assertion dt-dt_ff[i]-ff_refcnt == 0 failed: file /usr/allsrc/src/sys/rump/librump/rumpkern/../../../kern/kern_descrip.c, line 856 Upon closer examination, it seems that this can trigger while another thread is in fd_getfile() between upping the refcount, testing for ff_file, and fd_putfile(). Removing the KASSERT seems to restore correct You're right there, the KASSERT() is wrong, it should be removed. Thanks, I'll do that. operation, but I didn't read the code far enough to see where the race is actually handled and what stops the code from using the wrong file. FYI the fdfile_t (per-descriptor records) are stable for the lifetime of the process, what each record descibes can and does of course change, and how those records are pointed to does change (fdtab_t). There isn't really a concept of wrong file, as in, the app gets what it asked for. It is free to ask for the wrong thing, and it's free to ask for the right thing at the wrong time, etc - that's its problem. Unless you're alluding to another bug? Not really. I just started thinking about how applications can make sure they use the right file descriptor. It seems using close() to notify other threads of a file descriptor being closed is racy. So something naiive like this: t1: lock t1: get fd1 t1: unlock /* t1 wants to do a syscall with fd1 but is preempted */ t2: lock t2: close fd1 t2: unlock t3: lock t3: open, result fd1 t3: unlock t1: syscall fd1 ... will give you the wrong result. Essentially there is no interlock from the application lookup to the kernel backing object lookup. So I guess if you want things to work correctly, instead of close() you need to dup2() to a zombie/deadfs fd and wait for all threads to check in before you can close it. (i assume dup2 is atomic) Never realized file descriptors and threads were so tricky ;)
Re: Modules loading modules?
On Mon Aug 02 2010 at 16:30:03 +1000, matthew green wrote: this is an incomplete reading of the manual page, and you can not use mutex_owned() the way you are trying to (regardless of what pooka posted.) you can't even using it in various forms of assertions safely. from the man page: It should not be used to make locking decisions at run time, or to verify that a lock is not held. That's the mantra, yes. ie, you can not even KASSERT(!mutex_owned()). Strictly speaking you can in a case where you have two locks always are taken as l1,l2 and released as l2,l1 provided you're not dealing with a spin mutex. Does it make any sense? no (l2 is useless). Is it possible? yes. Now, on to sensible stuff. I'm quite certain that warning was written to make people avoid writing bad code without understanding locking -- if you need to used mutex_owned() to decide if you should lock a normal mutex, your code is broken. The other less likely possibility is that someone plans to change mutex_owned in the future. Further data point: the same warning is in rw_held, yet it was used to implement recursive locking in vlockmgr until its very recent demise. Ignoring man page mantras and focusing on how the code works, I do not see anything wrong with Paul's use of mutex_owned(). but i'm still not sure why we're going to such lengths to hold a lock across such a heavy weight operation like loading a module. that may involve disk speeds, and so you're looking at waiting millions of cycles for the lock. aka, what eeh posted. It's held for millions of cycles already now and nobody has pointed out measurable problems. But if it is deemed necessary, you can certainly hide a cv underneath. The efficiency requirements for the module lock are probably anyway in the who cares wins ... not spectrum. At least I'm not aware of any fastpath wanting to use it. Anyway, no real opinion there. A cv most likely is the safe, no-brainer choice.
Re: fd code multithreaded race?
On Sat Jul 31 2010 at 20:31:19 +0300, Antti Kantee wrote: Hi, I'm looking at a KASSERT which is triggering quite rarely for me (in terms of iterations): panic: kernel diagnostic assertion dt-dt_ff[i]-ff_refcnt == 0 failed: file /usr/allsrc/src/sys/rump/librump/rumpkern/../../../kern/kern_descrip.c, line 856 Upon closer examination, it seems that this can trigger while another thread is in fd_getfile() between upping the refcount, testing for ff_file, and fd_putfile(). Removing the KASSERT seems to restore correct operation, but I didn't read the code far enough to see where the race is actually handled and what stops the code from using the wrong file. How-to-repeat: Run tests/fs/puffs/t_fuzz mountfuzz7 in a loop. A multiprocessor kernel might produce a more reliable result, so set RUMP_NCPU unless you have a multiprocessor host. Depending on timings and how the get/put thread runs, you might even see the refcount as 0 in the core. Does anyone see something wrong with the analysis? If not, I'll create a dedidated test and file a PR. kern/43694, tests/kernel/t_filedesc
Re: Modules loading modules?
On Tue Aug 03 2010 at 02:17:43 +1000, matthew green wrote: Now, on to sensible stuff. I'm quite certain that warning was written to make people avoid writing bad code without understanding locking -- if you need to used mutex_owned() to decide if you should lock a normal mutex, your code is broken. The other less likely possibility is that someone plans to change mutex_owned in the future. Further data point: the same warning is in rw_held, yet it was used to implement recursive locking in vlockmgr until its very recent demise. Ignoring man page mantras and focusing on how the code works, I do not see anything wrong with Paul's use of mutex_owned(). this just does not match my actual experience in the kernel. i had weird pmap-style problems and asserts firing wrongly while i did not obey the rules in the manual directly. Not knowing more details it's difficult to comment. But since you are talking about the pmap, maybe your experiences are with spin mutexes instead of adaptive ones?
Re: Modules loading modules?
On Sat Jul 31 2010 at 15:48:26 -0700, Paul Goyette wrote: If modload-from-modcmd is found necessary, sounds more like a case for the infamous recursive lock. Recursive lock is the way to go. I think the same lock should also cover all device configuration activites (i.e. autoconf) and any other heavy lifting where we have chunks of the system coming and going. Well, folks, here is a first pass recursive locks! The attached diffs are against -current as of a few minutes ago. Oh, heh, I thought we have recursive lock support. But with that gone from the vfs locks, I guess not (apart from the kernel lock ;). I'm not sure if it's a good idea to change the size of kmutex_t. I guess plenty of data structures have carefully been adjusted by hand to its size and I don't know of any automatic way to recalculate that stuff. Even if not, since this is the only user and we probably won't have that many of them even in the future, why not just define a new type ``rmutex'' which contains a kmutex, an owner and the counter? It feels wrong to punish all the normal kmutex users for just one use. It'll also make the implementation a lot simpler to test, since it's purely MI. separate normal case and worst case
Re: Modules loading modules?
On Sun Aug 01 2010 at 06:10:07 -0700, Paul Goyette wrote: One question: Since an adaptive kmutex_t already includes an owner field, would we really need to have another copy of it in the rmutex_t structure? Good point. I think it's ok to do: if (mutex_owned(mtx)) cnt++ else lock
fd code multithreaded race?
Hi, I'm looking at a KASSERT which is triggering quite rarely for me (in terms of iterations): panic: kernel diagnostic assertion dt-dt_ff[i]-ff_refcnt == 0 failed: file /usr/allsrc/src/sys/rump/librump/rumpkern/../../../kern/kern_descrip.c, line 856 Upon closer examination, it seems that this can trigger while another thread is in fd_getfile() between upping the refcount, testing for ff_file, and fd_putfile(). Removing the KASSERT seems to restore correct operation, but I didn't read the code far enough to see where the race is actually handled and what stops the code from using the wrong file. How-to-repeat: Run tests/fs/puffs/t_fuzz mountfuzz7 in a loop. A multiprocessor kernel might produce a more reliable result, so set RUMP_NCPU unless you have a multiprocessor host. Depending on timings and how the get/put thread runs, you might even see the refcount as 0 in the core. Does anyone see something wrong with the analysis? If not, I'll create a dedidated test and file a PR.
Re: Modules loading modules?
On Sun Jul 25 2010 at 15:17:29 -0700, Paul Goyette wrote: On Mon, 26 Jul 2010, matthew green wrote: it seems to me the root problem is that module_mutex is held while calling into the module startup routines. thus, the right solution is to remove this requirement. Yes, that's what is needed. I'm far from convinced that's a good idea. First, it will probably make the module code a nightmare -- what happens when you have multiple interleaved loads, some of which fail at some point in their dependency stack, and let's just throw in a manual modunload to mix up things further. Second, and pretty much related to number one, it goes against one of the most fundamental principles of robust code: atomic actions. If modload-from-modcmd is found necessary, sounds more like a case for the infamous recursive lock. (no comment on the actual problem)
Re: power management and pseudo-devices
On Mon Jul 19 2010 at 01:28:42 +, Quentin Garnier wrote: On Sun, Jul 18, 2010 at 05:31:42PM -0700, Paul Goyette wrote: On Mon, 19 Jul 2010, Quentin Garnier wrote: Include ioconf.h I think. Tried that. It works for compiling the kernel. Unfortunately, swwdog is included in rump, and there doesn't seem to be an ioconf.h available for the librump build. Well, whatever. I don't think I want to look at that. :) Actually, I still think this is wrong. It might make librump compile and even link, that doesn't mean it will be usable if nothing ever creates that extern symbol. You'll have to check with pooka or explore the code, but there have to be some components of rump that have partial configuration files. After all, he created the ioconf directive for that purpose. First of all, let's take a step up from the trenches and try to understand the problem we're dealing with instead of trying to arbitrarily guess how to fix the build. We have two ways of building kernel code: monolithic (./build.sh kernel=CONF) and modular (kernel modules, rump components). The latter ones currently always do config stuff on-demand, so changes cause breakage. Should a swwdog kernel module exist (why isn't there one?), it would run into the same problem. Now, the ioconf keyword for config(1) is meant to help build modular kernel code by allowing to specify a partial config file. Currently, as the name implies, it takes care of creating ioconf.[hc], namely in this case struct cfdriver swwdog_cd. Adding a SYSMON.ioconf will solve the problem: === snip === ioconf sysmon include conf/files pseudo-device swwdog === snip === The good news is that if some day a sysmon kernel module is added, the exact same ioconf can be used without having to once again run into trouble. Then let's view the broader scale. I think acpibat is currently the only kernel module using ioconf, and I haven't bothered converting others since I have realized that the scope of ioconf was a little too narrow. I'm planning to change the keyword to module and add support for source files. This way the default build for every modular kernel component goes through config, and we can avoid issues due to config changes. IIRC I have the config(1) part for this done already, but being able to use an autogenerated SRCS list requires some bsd.subdir.mk style advanced make hackery which I haven't been in the mood for. Plus, there's of course the mess with file some.c foo | bar (baz ^ xyzzy) !!!frobnitz. Finally, on the eternal someone should astral plane, someone should fix the kernel build to consist of building a set of modules and linking them together, so we don't have more than one way to skin a kernel.
Re: CVS commit: src/tests/net/icmp
On Mon Jul 12 2010 at 01:51:54 +0200, Jean-Yves Migeon wrote: Anyway, the solution as usual is to work the problem from both ends (improve the server methods and the kernel drivers) and perform a meet-in-the-middle attack at the sweet spot where nothing is lost and everything is gained. The cool thing about working on NetBSD is that we can actually do these things properly instead of bolting some hacks on top of a black-magic-box we're not allowed to touch. Although I'm not familiar with the Xen hypercall interface, I assume it to be infinitely more well-defined than unix process-kernel interaction with no funny bits like fiddling about here and there just because the kernel can get away with it. Yes; however, note that the Xen hypercalls are not expected to be as feature-rich as a POSIX process kernel interface. It is vastly simpler, but it is also poorer (the complexity is left as an exercise to the tasks above it). Anyway, you will face the exact same issue as yours with puffs and pud. The Xen hypercalls are close to x86 semantics; at this layer, you have lost most of the higher level semantic. It's not the same thing. Correct me if I misunderstood, but I thought you want to port/adjust/whatever rumpuser to the xen hypercall interace. Since the xen hypervisor interface, as opposed to a posix process environment, is designed for hosting an OS, you can do lowlevel ops as you'd expect to do them instead of having to think about high-level semantic meaning. ... well, at least in theory, since we can't measure it yet. Plus I'm not ultrafamiliar with Xen (read: not familiar at all), so there might be issues I don't foresee. And there always are. But business as usual: only one way to find out ;)
Re: CVS commit: src/tests/net/icmp
On Sat Jul 10 2010 at 12:30:07 +0200, Adam Hamsik wrote: 8) Is it possible to run rump_exec in rump ? e.g. to boot rump kernel and start init by it ? What are you trying to accomplish? Generally no, in a special case yes. I've been doing a little work in that area and I have a syscall server which can support a process's basic syscall requests in a rump kernel. But that's a very boring approach, since it requires host kernel support. I think the process should know where it wants its requests to be serviced. I thought about something like userspace virtualization(zones, jails) based on top of rump. Given that the idea of jails/zones is to limit a userspace process, doing this in a userspace process is not the obvious route. It probably could be done with a software-isolated process, but we are desperately not there with our toolchain. Another choice would be to port rumpuser on top of the Xen hypervisor interface, like jym recently envisioned. Even so, rump is about virtualizing the kernel, not the user interface layer. Given that jails/zones is a well-understood technology with at least some sort of NetBSD implementation already done, why not go the obvious route and finish that off?
Re: CVS commit: src/tests/net/icmp
On Sun Jul 11 2010 at 16:49:59 +0200, Jean-Yves Migeon wrote: On 11.07.2010 15:00, Antti Kantee wrote: On Sat Jul 10 2010 at 12:30:07 +0200, Adam Hamsik wrote: Given that the idea of jails/zones is to limit a userspace process, doing this in a userspace process is not the obvious route. It probably could be done with a software-isolated process, but we are desperately not there with our toolchain. Another choice would be to port rumpuser on top of the Xen hypervisor interface, like jym recently envisioned. Let me get a bit more precise here :) the purpose is not to offer container-like virtualization, but rather to have a finer grained approach, close to microkernels, with small processes/tasks that perform a specific functionality. What I would like to do is to get rid of the big dom0 uber-privileged domain that you encounter in hypervisor-based virtualization, by having smaller, isolated domains that perform specific tasks (one for block device access, another for network, device driver, so on). Without requiring to integrate yet_another_monolithic_yet_modular_linux_kernel in. I didn't mean to say you suggested to offer virtualization containers. Sorry. I merely intended to say you had the desire of using the Xen hypercall interface. Although I must say now I understand more clearly why you wanted to do that. Frankly, I have no idea how this would perform; basically, dom0 can be considered as one big uber-privileged domain, which is as critical as the hypervisor itself; if it crashes, or gets compromised, the system is entirely crippled. Purpose is to avoid a contamination of the whole dom0 context if only one of its part is buggy, and one requirement is to get it as small as possible. perform? Are you using that term for execution speed, or was it accidentally bundled with the rest of the paragraph? Even so, rump is about virtualizing the kernel, not the user interface layer. Given that jails/zones is a well-understood technology with at least some sort of NetBSD implementation already done, why not go the obvious route and finish that off? I think he was referring to using a rump kernel as a syscall proxy server rather than having in-kernel virtualization like jails/zones. That would make sense, you already have proxy-like feature with rump. I'm not so sure. That would require a lot of kernel help to make everything work correctly. The first example is mmap: you run into it pretty fast when you start work on a syscall server ;) That's not to say there is not synergy. For example, a jail networking stack virtualized this way would avoid having to go over all the code, and reboot would be as simple as kill $serverpid. Plus, more obviously, it would not require every jail to share the same code, i.e. you can have text optimized in various ways for various applications.
Re: CVS commit: src/tests/net/icmp
On Fri Jul 09 2010 at 18:00:05 +0200, Adam Hamsik wrote: Let me add some of my questions about rump :) 6) How are device nodes managed inside rump when e.g. /dev/mapper/control created by libdevmaper rump lib. just as expected ... (?) Can you elaborate the question? 7) Does RUMP support multiprocessor setup ? e.g. Can I boot rump kernel in SMP mode and do I need SMP machine for that ? Yes, on i386 and amd64. Others would be ~trivially possible too (even ones where the host does not support SMP), but I haven't bothered to go into battle with some arch-specific headers and macros. Probably would be a few hours of tweaking to get all archs working. By default the number of virtual CPUs configured into a rump kernel is the same as the number of CPUs present on the host. However, you are free to pick anything from 1 to MAXCPUS. As I've noted before, unicpu on an SMP host is cool because you can optimize bus locking away from kernel work which can be isolated. This can provide a performance boost of tens of percent. The other way (i.e. SMP rump kernel on a unicpu host) is used by e.g. tests/fs/tmpfs/t_renamerace:renamerace2. The default qemu setup used by anita is unicpu, and the race it is trying to trigger did not happen with only one virtual CPU, so upping the rump configuration to have more CPUs was the ticket. Yes, you can specify an arbitrary number of CPUs to qemu, but that tends to slow down execution quite dramatically (as in several times slower). In contrast, with rump there is no slowdown (apart from all virtual CPUs having to take clock interrupts, which is negligible unless you run at an insane HZ). 8) Is it possible to run rump_exec in rump ? e.g. to boot rump kernel and start init by it ? What are you trying to accomplish? Generally no, in a special case yes. I've been doing a little work in that area and I have a syscall server which can support a process's basic syscall requests in a rump kernel. But that's a very boring approach, since it requires host kernel support. I think the process should know where it wants its requests to be serviced.
Re: CVS commit: src/tests/net/icmp
On Thu Jul 08 2010 at 23:22:44 +0200, Thomas Klausner wrote: [redirected from source-changes-d to a hopefully more suitable mailing list] On Mon, Jul 05, 2010 at 12:26:17AM +0300, Antti Kantee wrote: I'm happy to give a more detailed explanation on how it works, but I need one or two questions to determine the place where I should start from. I'm planning a short article on the unique advantages of rump in kernel testing (four advantages by my counts so far), and some questions now might even help me write that one about what people want to read instead of what I guess they'd want to read. I looked at the tests some more (tmpfs race, and the interface one from above). I think I can read them, but am unclear on some of the basic properties of a rump kernel. Hi, good questions. For example: 1. Where is '/'? Does it have any relation to the host systems '/'? Is it completely virtual in the memory of the rump kernel? From a practical perspective, it's in the same place as '/' on e.g. a qemu instance or xen domu: somewhere. By default it's in memory, but you can mount any file system as '/' over rumpfs (default rootfs). Of course this is partially a trick question, since a rump kernel does not necessarily have a '/' at all. Running a configuration without file systems at all can save quite a bit of memory, and can be the difference between 50k and 100k nodes in a virtual netowrk (I've only tested up to a few hundred nodes on my scrawny laptop, but I've done calculations ... I'm sure you can appreciate calculations ;). In that case any rump system calls attempting to use VFS will fail with ENOSYS. 2. Do I understand correctly that for e.g. copying a file from the host file system into a rump kernel file system, I would use read and rump_sys_write? Well, yes and no. It depends on which namespace you are making the calls from. If you are in the host namespace (i.e. not inside the rump kernel), you can do that. The paths given to rump_sys_open() are ones relative to the rump kernel '/' (or whatever you've chrooted to inside the rump kernel), and then you just use the file descriptor as usual. If you are inside the rump kernel, you can access the host file system namespace with etfs, extra terrestrial file system, with which you can establish mappings from the rump kernel namespace to the host namespace. For example, the rump_foofs utils use this to configure a virtual block device pointing to the host, so when I type rump_ffs /home/pooka/ffs.img /mount even though VFS_MOUNT() operates inside the rump kernel, the device file for mount is still use from the host (and, etfs can also just report it as a block device, so you don't need any of the vnconfig nonsense). 3. Similarly for network interfaces -- open a socket with socket(2) or rump_socket(or so) and copy bytes with read/rump_sys_write? I'm not quite sure what you want to copy from and where. If you connect() to a network service inside the rump kernel, you access it from the host with read/write (or send/recv) just like any other peer. If you rump_sys_connect(), you use rump_sys_read/rump_sys_write(). I probably should point out that rump has two different networking configurations: a full networking stack and what I call sockin. The prior is exactly what you'd expect: interface, tcp/ip, sockets and a unique IP (or other) address. This can be a hassle sometimes when you want to use networking from the rump kernel and do not have a separate IP address or simply just don't have root privileges to configure a tap interface. sockin registers at the protocol layer in the kernel and pretends to be an inet domain. What it does is just maps requests to the host sockets. So e.g. PRU_CONNECT does connect() _on the host_. This is helpful for cases where you need networking (e.g. rump_nfs and rump_smbfs), but do not want to hassle and administrative boundary of configuring a separate address. 4. Could you NFS export the rump kernel file system to the host? (Probably better to a second rump kernel...) Yes. When I make changes which affect nfs, I test them by running one rump kernel with the nfs server and one instance of rump_nfs (the latter using sockin, i.e. effectively the rump kernel nfs exports to the host). This way I get a two-machine illusion -- naturally, since the nfs client is quite finicky, I don't want to use mount_nfs for testing on my desktop. nfs itself presents one of the unsolved issues with rump: the division between rump kernel and host kernel is done at the syscall level: foo or rump_sys_foo(). However, for libraries foo is already hardcoded. This is especially problematic for libc, since even LD_PRELOAD will not help. There's a few different things I've been playing around with this, but will try to detour into verbose explanations of them in this email. The whole issue is explained here (and generally in the thread): http://mail-index.netbsd.org/tech-kern/2009/10/16/msg006276.html
Re: Enabling built-in modules earlier in init
On Wed Jun 16 2010 at 15:36:30 -0700, Paul Goyette wrote: The attached diffs add one more routine, module_init3() which gets called from init_main() right after module_class_init(MODULE_CLASS_ANY). module_init3() walks the list of builtin modules that have not already been init'd and marks them disabled. Tested briefly on my home systems and appears to work. Any objections to committing this? I'd still hook it to the end of module_class_init(MODULE_CLASS_ANY) instead of adding more randomly numbered module_initn() calls. The other benefit from doing so is that you get it done atomically, which is always worthwhile, and doubly so when it's a low hanging fruit like here. @@ -416,6 +434,7 @@ module_init_class(modclass_t class) * init. */ if (module_do_builtin(mi-mi_name, NULL) != 0) { + mod-mod_disabled = true; TAILQ_REMOVE(module_builtins, mod, mod_chain); TAILQ_INSERT_TAIL(bi_fail, mod, mod_chain); } Why do you mark it as disabled? Doesn't this conflict with the it might succeed in a later module_init_class() idea you presented earlier? module_disabled = true/false in multiple places looks a little error-prone. Now that struct module is growing more and more members, maybe we can just have an object allocator which initializes the value and afterwards the only acceptable mutation for module_disabled is setting it to true (might make sense to rename the variable to something like module_virgin and flip the polarity, though).
Re: Enabling built-in modules earlier in init
On Wed Jun 16 2010 at 04:13:54 -0700, Paul Goyette wrote: With the current ways of secmodel register, I'd be damn careful to not push it around. The effect is that if it's called 0 times, you have a system which allows everything. So if your suggestion is implemented and you're testing a new secmodel which buggily omits register alongside another correctly registering secmodel, things will appear to work fine, But if in some scenario the buggy one is loaded alone, well ... welcome to the wishing well. I had some concern about this as well, wondering if I would be able to be sure I'd found all the secmodel modules that might exist. Especially ones which aren't in src! Perhaps it would be best to retain MODULE_CLASS_SECMODEL and also add the suggested MODULE_CLASS_EARLY? That would be my vote. But, early is a little vague. What if in the future we want modules which are initialized even earlier. Will those be called MODULE_CLASS_EARLIER_THAN_EARLY? If the class means intialized before autoconf, why not use that in the name? Also, the modclass id is exported to userland and used as an index to a table in modstat. I think I filed a PR about this being suboptimal. Yeah, I was planning to update modstat(8) as well. The better choice is to update modctl(2) to pass down the information as a proplist. That way even module classes are pluggable and other information is easy to add if necessary. I'm secretly hoping someone will do this before 6.0 ... ;)
Re: Enabling built-in modules earlier in init
On Tue Jun 15 2010 at 17:10:55 -0700, Paul Goyette wrote: Currently, built-in kernel modules are not enabled until very late in the system initialization process, right after we create process #1 for init(8). (As an exception to this, secmodel modules are enabled much earlier.) Unfortunately, this means that built-in modules are not available for use during much of the initialization process, and in particular they are not available during auto-configuration. This means that my recent changes to convert PCIVERBOSE, etc. into kernel modules does not work when the modules are built-in to the kernel! I would like to enable the built-in modules much earlier, at least early enough to have them available during auto-configuration. The attached patch accomplishes this. I have briefly tested the patch, and it seems not to have any unwanted side-effects, but I would appreciate feedback from others who may be more familiar with the init sequence. An alternative, but less desirable approach, would be to create a new class of modules for PCIVERBOSE and friends, and call module_class_int() early on to enable only these few modules. Actually reading the first email in the thread also ... I have to admit I haven't been following your work too closely, but builtin modules are initialized either when all of them are initialized per class or when their initialization is explicitly requested. So if whatever uses PCIVERBOSE requests the load of the PCIVERBOSE module, it should be initialized and you should be fine (see module_do_load()). The only but is that explicit loads must be accompanied by MODCTL_LOAD_FORCE. I wrote it that way because of the security use case: if you disable a builtin module due to a security hole, you don't want it to get autoloaded later. For file system modules you can always use rm, but for builtins you don't have that luxury. So if that is actually what you're chocking on, I suggest adding some flag to determine if the module has ever been loaded and ignore the need for -F if it hasn't.
Re: Enabling built-in modules earlier in init
On Wed Jun 16 2010 at 06:31:59 -0700, Paul Goyette wrote: The attached diffs add a new mod_disabled member to the module_t structure, and set the value to false in each place that a new entry is created. (Since all of the allocations of module_t structures are done with kmem_zalloc() I could probably avoid the explicit setting of the value to false.) The value is set to true whenever a module is removed from active duty and returned to the module_builtin list. (I specifically did NOT mark a module disabled if its modcmd(INIT) failed, under the assumption that it might succeed in a later retry.) Keeping the same security use case in mind, it would be better that after full module bootstrap (i.e. MODULE_CLASS_ANY) all builtin modules would be either initialized or disabled. Otherwise, if we assume that init may later succeed for whatever reason, an operator that checks a module with a security problem is not activated may be surprised to later find out that the same module has now been autoenabled.
uvm percpu
While reading the uvm page allocator code, I noticed it tries to allocate from percpu storage before falling back to global storage. However, even if allocation from local storage was possible, a global stats counter is incremented (e.g. uvmexp.cpuhit++). In my measurements I've observed this type of cheap statcounting has a huge impact on percpu algorithms, as you still need to loadstore a globally contended memory address. Furthermore, uvmexp cache lines are probably more contended than the page queue, so theoretically you get less than half of the possible benefit. I don't expect anyone to remember what the benchmark used to justify the original percpu commit was, but if someone is going to work on it further, I'm curious as to how much gain the percpu allocator produced and how much more it would squeeze out if the global counter was left out. The above example of course applies more generally. When you're going all out with the bag of tricks, i++ can be very expensive ...
Re: Red-black tree optimisation
Hi, On Tue May 04 2010 at 18:20:30 +0200, Adam Ciarci?ski wrote: Hello, Because at one point I studied red-black trees (not as in dendrology, but as data structures), I looked into the implementation that is being used in NetBSD. I have made some drastic optimisations on sys/ sys/tree.h and would like to have the changes imported into NetBSD repository. I would like someone to take a look at the patch, which is attached to this message, and verify the code. I have also attached a short PDF document, in which I comment on changes made to the implementation of the red-black tree algorithm. If it's okay, I can commit the changes myself. I think we all will benefit from faster and smaller code. :) Can you present numbers to support your claims of drastic optimizations? I've used tree.h in out-of-NetBSD projects and don't mind this being committed. However, I did not review your changes, so I hope you have made 100% sure there are no regressions. Remember that usually the only way to win is not to play at all ;)
Re: Lightweight virtualization - the rump approach
On Tue May 18 2010 at 14:00:59 +0200, Jean-Yves Migeon wrote: Many thanks for answering my questions, Antti. Now that I have sane (and safe) pointers, I have some readings to do. Oh if you just wanted reading, you could have started with the publications linked from http://www.NetBSD.org/docs/rump/. Can't comment on the safety or sanity, though ;) (the web page itself is not quite up-to-date anymore and updating it is, proverbially, on David Holland's todo list) Apparently this year's AsiaBSDCon papers aren't online on 2010.asiabsdcon.org yet. I've put my paper *temporarily* in ftp://ftp.NetBSD.org/pub/NetBSD/misc/pooka/tmp/rumpdev.pdf (anyone reading this from the archives, if that link is dead, check http://2010.asiabsdcon.org/)
Re: Lightweight virtualization - the rump approach
On Thu May 13 2010 at 18:51:16 +0200, Jean-Yves Migeon wrote: I am not posting this to reinstate the decades old monolithic vs microkernel troll, so please avoid that field; thanks. I hope people on these lists are adult enough to realize that argument is pointless. The correct answer to which is better is of course both (or neither, as code I hope get committed later today or during the weekend will quite measurably demonstrate). Lights on the work of Antti, with rump. Most systems I have seen lately provide the characteristics enumerated above by pulling in a general purpose OS, like Linux, with its environment, just to get a specific need, like an up-to-date network stack (strong push for IPv6, anyone?), drivers (filesystems, usb, pci stacks, devices), etc. There is no real componentization in mind. Everything being virtual these days (see cloud computing buzzwords, or hardware systems delivered with some kind of hypervisor inside - PS3, or sun4v, for example -), I see Antti's work (well, all TNF work too ;) ) as being a real asset to make the NetBSD's code base more widely known, appreciated and used. I have yet to see a solution where you could port then debug kernel code directly to userland (at least, for a general purpose OS), or offer an alternative when you need to port specific components, like network functionality or filesystem code. My guess of what is going to happen in the future is that the historic kernel/user boundary and even the OS will mostly go away and you'll just be left with semi-independent virtualized application stacks running on minimal hosts, perhaps on ASICs. The OS is pure unnecessary overhead. To take a food analogy: flavour rules, not if the food was baked, broiled, sauteed or cooked sous vide (sous vided? ;). For this reason, I have a few questions for the ones familiar with rump technology here, especially: - the basic architecture; how did/do you achieve such a functionality? Adding an extra layer of indirection within the kernel? There's no artificial extra layer of indirection in the code (and, compared to other virtualization technologies, there's one less at runtime). It's mostly about plugging into key places and understanding what you can use directly from the host (e.g. locking, which is of course very intertwined with scheduling) and what you really don't want to be emulating in userspace at all (e.g. virtual memory). Due to some magical mystery reason, even code which was written 20 years ago tends to allow for easy separation. The other part is more or less completing the work on kernel module support in NetBSD, mostly minor issues with config left (devsw, SRCS) and I've got those somewhat done in a corner of my source tree. Rump components more or less follow the same functional units as kernel modules apart from the rump{dev,net,vfs} factions, which cannot be dynamically loaded and without which a regular kernel would not function. Yes, I decided to use the word faction to describe the three midlayers between rumpkern and the drivers. The only real problem is the loosy-goosy use of inlines/macros in unnecessary places. But luckily for me, a lot of the work to clean that up was done by Andy when he made the x86 ports modular. Suppose we have improvements in one part of it, like TCP, IP stacks, could it directly benefit the rumpnet component, or any service sitting above it? Could you elaborate this question? - What kind of effort would it require to port it to other OS architectures, especially when the API they offer could be a subset of POSIX, or specific low level API (like Xen's hypercalls)? (closely related to the work of Arnaud Ysmall in misc/rump [4]) As you probably know, rump uses the rumpuser hypercall interface to access the hypervisor (which is currently just a fancy name for userland namespace). It pretty much evolved with the oh I need this? ok, I'll add it technique. But, if someone wants to experiment with minimal hosts, I think we can work on rumpuser a bit and see what qualities the hosts have in common and what's different. I don't expect supporting rump to be any more difficult than for example Wombat/Iguana L4+UML. In fact, it's probably simpler since with rump there is no notion of things like address space -- that comes entirely from the host. Running directly on top of Xen is an interesting idea, but the first question I have with that is why?, i.e. what is the gain as opposed running directly in a process on dom0? The only reason I can think of is that you don't trust your dom0 OS enough, but then again you can't really trust Xen guests either? Besides, rump kernels do not execute any privileged instructions, so Xen doesn't sound like the right hammer. - If rump could be used both for lightweight virtualization (like rump fs servers), or more heavyweight one (netbsd-usermode...)? Usermode = rump+more, although paradoxically rump = usermode+more also
Re: bin/30756: gdb not usable for live debugging of threaded programs
On Thu Apr 22 2010 at 11:18:14 -0400, Paul Koning wrote: Antti pointed out a problem in the patch I originally submitted which causes gdb to go into a loop if the child process exits. The attached updated patch corrects that problem. Yup, your new patch seems to fix that. Thanks again. Just one cosmetic issue now. After finishing, gdb always says: Couldn't get registers: Operation not permitted.
Re: rump and usb, only one ugen getting attached?
On Thu Apr 08 2010 at 02:35:28 +0100, Jasper Wallace wrote: Hi, I'm trying to debug a problem with netbsd and a usb cdc acm device using rump and in the process I can only get rump to attach to ugen0. I can work around this by nailing down ugen0 to a particular usb port in my kernel config, but does rump/ugenhc always only attach to ugen0? UGENHC.ioconf has 4 ugenhc entries so i assume not. Hmm. Make sure the other /dev/ugen device nodes exist. Based on the timestamps I have on my /dev and from reading /dev/MAKEDEV, only ugen0 nodes are created by default. good luck ;)
Re: config(5) break down
On Fri Mar 26 2010 at 13:25:43 +0900, Masao Uebayashi wrote: syntax. I spent a whole weekend to read sys/conf/files, ioconf.c, and module stubs in sys/dev/usb/uaudio.c. I wasted a whole weekend. I've This patch should work and make it easier. No, it doesn't solve dependencies, but gets developers at least halfway there without having to waste weekends (with code). Unfortunately I can't test, since I forgot to buy a usb audio device from Akihabara ;) Index: dev/usb/uaudio.c === RCS file: /cvsroot/src/sys/dev/usb/uaudio.c,v retrieving revision 1.117 diff -p -u -r1.117 uaudio.c --- dev/usb/uaudio.c12 Nov 2009 19:50:01 - 1.117 +++ dev/usb/uaudio.c26 Mar 2010 06:11:39 - @@ -3065,67 +3065,21 @@ uaudio_set_speed(struct uaudio_softc *sc MODULE(MODULE_CLASS_DRIVER, uaudio, NULL); -static const struct cfiattrdata audiobuscf_iattrdata = { - audiobus, 0, { { NULL, NULL, 0 }, } -}; -static const struct cfiattrdata * const uaudio_attrs[] = { - audiobuscf_iattrdata, NULL -}; -CFDRIVER_DECL(uaudio, DV_DULL, uaudio_attrs); -extern struct cfattach uaudio_ca; -static int uaudioloc[6/*USBIFIFCF_NLOCS*/] = { - -1/*USBIFIFCF_PORT_DEFAULT*/, - -1/*USBIFIFCF_CONFIGURATION_DEFAULT*/, - -1/*USBIFIFCF_INTERFACE_DEFAULT*/, - -1/*USBIFIFCF_VENDOR_DEFAULT*/, - -1/*USBIFIFCF_PRODUCT_DEFAULT*/, - -1/*USBIFIFCF_RELEASE_DEFAULT*/}; -static struct cfparent uhubparent = { - usbifif, NULL, DVUNIT_ANY -}; -static struct cfdata uaudio_cfdata[] = { - { - .cf_name = uaudio, - .cf_atname = uaudio, - .cf_unit = 0, - .cf_fstate = FSTATE_STAR, - .cf_loc = uaudioloc, - .cf_flags = 0, - .cf_pspec = uhubparent, - }, - { NULL } -}; +#include ioconf.c static int uaudio_modcmd(modcmd_t cmd, void *arg) { - int err; switch (cmd) { case MODULE_CMD_INIT: - err = config_cfdriver_attach(uaudio_cd); - if (err) { - return err; - } - err = config_cfattach_attach(uaudio, uaudio_ca); - if (err) { - config_cfdriver_detach(uaudio_cd); - return err; - } - err = config_cfdata_attach(uaudio_cfdata, 1); - if (err) { - config_cfattach_detach(uaudio, uaudio_ca); - config_cfdriver_detach(uaudio_cd); - return err; - } - return 0; + return config_init_component(cfdriver_comp_uaudio, + cfattach_comp_uaudio, cfdata_uaudio); + case MODULE_CMD_FINI: - err = config_cfdata_detach(uaudio_cfdata); - if (err) - return err; - config_cfattach_detach(uaudio, uaudio_ca); - config_cfdriver_detach(uaudio_cd); - return 0; + return config_fini_component(cfdriver_comp_uaudio, + cfattach_comp_uaudio, cfdata_uaudio); + default: return ENOTTY; } Index: modules/uaudio/Makefile === RCS file: /cvsroot/src/sys/modules/uaudio/Makefile,v retrieving revision 1.1 diff -p -u -r1.1 Makefile --- modules/uaudio/Makefile 28 Jun 2008 09:14:56 - 1.1 +++ modules/uaudio/Makefile 26 Mar 2010 06:11:39 - @@ -5,6 +5,7 @@ .PATH: ${S}/dev/usb KMOD= uaudio +IOCONF=UAUDIO.ioconf SRCS= uaudio.c .include bsd.kmodule.mk Index: modules/uaudio/UAUDIO.ioconf === RCS file: modules/uaudio/UAUDIO.ioconf diff -N modules/uaudio/UAUDIO.ioconf --- /dev/null 1 Jan 1970 00:00:00 - +++ modules/uaudio/UAUDIO.ioconf26 Mar 2010 06:11:39 - @@ -0,0 +1,12 @@ +# $NetBSD$ +# + +ioconf uaudio + +include conf/files +include dev/usb/files.usb + +pseudo-root uhub* + +# USB audio +uaudio* at uhub? port ? configuration ?
Re: test wanted: module plists
On Mon Mar 08 2010 at 02:37:05 +, David Holland wrote: The code for loading a module plist from a file system is messed up in that it calls namei() and then it calls vn_open() on the same nameidata without reinitializing it or cleaning up the previous results. I'm surprised this didn't result in fireworks, but apparently it didn't. The following patch fixes that, and compiles, but I'm not set up to be able to test this -- is there anyone who can do so easily/quickly? When I was playing with that code, I used atf on tests/modules. I can't remember if it tests loading from .prop, but a .prop file isn't exactly hard to create. Dunno what the canonical in-tree feature using this is, though. Index: kern_module_vfs.c === RCS file: /cvsroot/src/sys/kern/kern_module_vfs.c,v retrieving revision 1.3 diff -u -p -r1.3 kern_module_vfs.c --- kern_module_vfs.c 16 Feb 2010 05:47:52 - 1.3 +++ kern_module_vfs.c 8 Mar 2010 02:33:36 - @@ -147,23 +147,18 @@ module_load_plist_vfs(const char *modpat NDINIT(nd, LOOKUP, FOLLOW | (nochroot ? NOCHROOT : 0), UIO_SYSSPACE, proppath); - error = namei(nd); - if (error != 0) { - goto out1; + error = vn_open(nd, FREAD, 0); + if (error != 0) { + goto out1; } error = vn_stat(nd.ni_vp, sb); if (error != 0) { - goto out1; + goto out; } if (sb.st_size = (plistsize - 1)) {/* leave space for term \0 */ error = EFBIG; - goto out1; - } - - error = vn_open(nd, FREAD, 0); - if (error != 0) { - goto out1; + goto out; } base = kmem_alloc(plistsize, KM_SLEEP); -- David A. Holland dholl...@netbsd.org
Re: config(5) break down
On Mon Mar 08 2010 at 07:09:07 +, David Holland wrote: Meanwhile, I think trying to wipe out all the boolean dependency logic in favor of a big graph of modules and submodules is also likely to make a mess. What happens to e.g. fileufs/ffs/ffs_bswap.c (ffs | mfs) ffs_ei especially given that the ffs code is littered with FFS_EI conditional compilation? You can make ffs_bswap its own module, but that doesn't really serve any purpose. You could try making an FFS_EI module that works by patching the ffs module on the fly or something, and then include ffs_bswap.o into that, but that would be both very difficult and highly gross. You could compile two copies each of ffs and mfs, with and without FFS_EI support, but that wastes space. Or you could make FFS_EI no longer optional, which would be a regression. (FFS_EI isn't the only such option either, it's just one I happen to have already banged heads with.) This one is easy, no need to make it difficult. The NetBSD-supplied module is always compiled with FFS_EI (if you don't like it, you can always compile your own just like you can compile your own kernel now). We don't care about mfs here, since it's not reasonable to want to mount a memory file system in the opposite byte order (technically I guess you could mmap an image instead of malloc+newfs and then mount(MOUNT_MFS), but you might just as well use ffs). Things like wapbl are currently an actual problem, since it is multiply owned (conf/files *and* ufs/files.ufs). The easy solution (and my vote) would be to make vfs_wapbl.c always included in the base kernel. If someone feels it's worth their salt to make it into two modules with all the dependency hum-haa, that would be a good place to start practicing instead of ffs_ei.
Re: Zero page
On Wed Feb 03 2010 at 03:06:00 +0900, Masao Uebayashi wrote: I need to add zero-page to support XIP. Unallocated blocks are redirected to this. Basically it's a static simgle page filled with zero. void *pmap_zeropage; paddr_t pmap_zeropage_phys_addr; and initialized by pmap.c like: pmap_zeropage = (void *)uvm_pageboot_alloc(PAGE_SIZE); pmap_zeropage_phys_addr = MIPS_KSEG0_TO_PHYS(pmap_zeropage); Because it's used publically (from the coming custome genfs_getpages()), it's defined somewhere like uvm_page.h. Why does it need to be in pmap?
Re: Zero page
On Wed Feb 03 2010 at 03:26:33 +0900, Masao Uebayashi wrote: Why does it need to be in pmap? Actually it doesn't. Probably uvm_page.c is better? Maybe. And it'll be #ifdef XIP'ed. Can't the first XIP device to attach simply allocate it?
Re: Zero page
On Wed Feb 03 2010 at 03:55:29 +0900, Masao Uebayashi wrote: Can't the first XIP device to attach simply allocate it? It's getpages()'s iteration loop which redirects unallocated pages to zero-pages. If we allocate zero-page in device drivers, we have to have an interface which can be retrieved from vnode or mount. Having a well-known global name is simple, but I'm fine with both. I assumed the reason you mentioned #ifdef XIP was because you didn't want to waste a whole page of memory on systems which don't use XIP. So in my suggestion you'd have a global: struct uvm_page *page_of_blues; and then in xip_attach: RUN_ONCE(zeroes, allocate_nothingness); Or something like that (you can even refcount it if you want to be extra-fancy). Then you can just always use the global zeropage in xip getpages() and don't need to recompile your kernel (and reboot!) to support XIP device modules.
Re: blocksizes
On Sun Jan 31 2010 at 22:21:52 +0900, Izumi Tsutsui wrote: Can you please test with your 2K MO? It's not easy to test it without working newfs(8) command. (if you need hardware I can send the drive and media..) N.B. newfs doesn't yet know how to deduce sector sizes, you need to use the -S option. newfs(8) doesn't work even with -S 2048 option. (probably it tries to write data at offset not sectorsize aligned) Apparently makefs -S 2048 works, and the resulting image also works only when accessed with 2048 byte simulated sector size (fs-utils with ffs from rump): pain-rustique:29:~ env RUMP_BLKSECTSHIFT=11 fsu_du -o ro testi2.ffs -sck rumpblk: using 11 for sector shift (size 2048) 120830 . 120830 total pain-rustique:30:~ env RUMP_BLKSECTSHIFT=10 fsu_du -o ro testi2.ffs -sck rumpblk: using 10 for sector shift (size 1024) fsu_du: Not a directory But of course makefs uses a file backend, which doesn't care that much about unaligned writes, so the problem you mention still might exist.
Re: uvm_object::vmobjlock
On Fri Jan 29 2010 at 02:03:23 +, Mindaugas Rasiukevicius wrote: If you are talking about memory not within the object, well, then all bets are off applies. I might argue equally handwavily that you'll cause false sharing with other locks from the mutex obj pool, and even for many many more locks, since you don't even get the protection of the data after the lock being safe. ... Heh? The mutex object pool has a necessary alignment and padding, which guarantees that the lock has its own cache line. That was one of the reasons, besides reference counting, why lock object pool was invented. Ooops. I meant to handwave about how you're now wasting multiple cache lines where previously only one pretty much always uncontended line was required. I'm not convinced at all this is improving performance. Anyway, you get the point.