Re: Interrupt handlers and mutex
On Thu, Dec 31, 2009 at 01:52:48PM -0600, Frank Zerangue wrote: Help request -- Mutex(9) indicates that mutex replaces the spl(9) system. Here are some general (non-NetBSD-specific) answers based on underlying principles that will hopefully explain the situation better. (1) When writing an interrupt handler, should the handler acquire a spin mutex before modifying some IO that may be accessed also by a LWP? Yes. In general, in a multiprocessor kernel, the interrupt handler will not necessarily run on the same CPU that has been posting requests to the device... and in general more than one CPU may be doing that. It is prohibitively expensive (as well as generally undesirable) to disable interrupts for all CPUs at once. Therefore, disabling interrupts only affects the current CPU, so in order to keep everything from being tied in knots, you need one or more locks. And because you can't sleep in an interrupt handler, these must be spinlocks. Spinlocks that are used from interrupt handlers must themselves also disable interrupts. Otherwise, if a thread holding the spinlock is interrupted by an interrupt handler that tries to acquire the same spinlock, you get a deadlock. For this reason, in essentially all multiprocessor systems, the spinlocks that you use for mutual exclusion in interrupt code disable and re-enable interrupts for you. In NetBSD each mutex can have an interrupt level associated with it; if that interrupt level is not IPL_NONE, acquiring the mutex raises the current interrupt level and releasing the mutex lowers the current interrupt level. And in turn, in such systems one generally never sees or uses the splfoo() functions except in code that hasn't yet been multiprocessorized. Currently in NetBSD the interrupt level is only lowered to zero and only when all spinlocks have been released, instead of any time the necessary interrupt level drops. This is to avoid complexity when spinlocks are not released in order, and it mostly makes no practical difference. (It isn't a good idea to embed e.g. a small block using an IPL_HIGH mutex inside a large block using a lower-interrupt-level mutex. But then again, it also isn't a good idea to have a large block disabling interrupts or using a spinlock anyway.) NetBSD also has soft interrupts that have more process context than ordinary interrupt handlers; instead of borrowing the context of whatever's running when the interrupt arrives, they run on dedicated kernel threads; this means they can sleep to acquire mutexes. AIUI, the intended design is that most hard interrupts will do as little work as possible and trigger a soft interrupt to do the rest; this reduces the number of mutexes used from real interrupt handlers and reduces the overall amount of spinning. I'm not entirely up on the exact details at the moment and hopefully someone else will clarify if there are questions. (2) What happens when the interrupt handler cannot acquire the mutex? Will the LWP that holds it ever be able to run again? Define cannot acquire. If the mutex is held by some thread on another processor, the interrupt handler will spin until it's released. If it's never released, the interrupt handler will spin forever, but that's not supposed to happen. If the mutex is held by some thread on the same processor, then you're deadlocked. This is why you have to disable interrupts before attempting to get a spinlock that's used from an interrupt handler. (Note: the exact details depend somewhat on the implementation. But you don't want to be writing code that pushes the boundaries.) (3) Will a LWP that holds a spin mutex be pre-empted by the scheduler? In general, that depends on the interrupt level associated with the spin mutex. -- David A. Holland dholl...@netbsd.org
Re: The imperfect beauty of NetBSD
On Wed, Jan 06, 2010 at 10:51:58PM -0600, Peter Seebach wrote: You might like to know about apropos(1): I am told that officially apropos(1) is deprecated, and substituted with man -k. Which does, in fact, say that it's the same thing. I think deprecated may mean little more than some standard somewhere didn't include it. If that. Is there any suitably licensed text search engine that's worth importing to get a real man page search capability? (Er. Off-topic, followups to tech-userlevel I guess...) -- David A. Holland dholl...@netbsd.org
Re: The imperfect beauty of NetBSD
On Thu, Jan 07, 2010 at 12:44:07AM -0500, Alex Goncharov wrote: But: man -S N STRING to work, and man -S N -k STRING not?... I think you're looking for man -s, which works fine. I didn't even know -S existed. It seems that the problem is that -S is defined somewhat differently for man and apropos; I have no idea which one of them (if either) is wrong, but it certainly violates the principle of least surprise the way it is. Please file a PR. -- David A. Holland dholl...@netbsd.org
Re: blocksizes
On Thu, Jan 21, 2010 at 10:30:20PM +, Michael van Elst wrote: IMHO there need to be three different ways to specify block offsets and block counts: 1. in units of blocks of the physical device 2. in units of blocks of DEV_BSIZE bytes 3. in bytes Don't forget: 4. in units of the filesystem block size... and we need to establish what units are used where. IM (fairly strong) O everything should be kept in byte counts, and never block counts because if you have more than one unit in use it is far too easy to accidentally mix them or provide the wrong one, and because they're all the same language-level type there's little hope of detecting such problems automatically. Furthermore, Murphy's Law dictates that in any particular place the count you are given is frequently not in the units you need to give something else, and then you end up converting back and forth all over everywhere. This serves no purpose and tends to obfuscate the code base. Since in practice nothing can be larger than the maximum value of off_t anyway, and all counts should be getting carried around as 64-bit values, using byte counts instead of block counts does not change the maximum addressable size of anything and therefore has no particular downside. Things should only be converted to block counts right when talking to hardware, in which case the correct size to use is immediately available, or right when reporting to userlevel using interfaces standardized that way, in which case ditto. The necessary changes are rather small. In particular, dkwedge_info needs to be extended to keep track of the physical sector size so that the dk driver can do the transformations. The physical sector size should be available to callers (just not part of the API/ABI) so this ought to be done regardless. -- David A. Holland dholl...@netbsd.org
Re: blocksizes
On Fri, Jan 22, 2010 at 07:38:14AM +, Michael van Elst wrote: Like most things, there is no universal correct answer here, simply deciding always use bytes because it seems simpler is unlikely to be the overall best answer. I think the suggestion is to use block numbers (or some other form of addressing larger units) only internally to some subsystem where these have a meaning and to use byte counts between those subsystems. Yes, that. Quotas should use the units the filesystem uses. For FFS that's probably fragments. The quota structures that have been brought up are on-disk format that's not subject to change. Any FS-independent quota code should internally should use byte counts, but currently quotas are such a mess that we don't have any such thing anyway. :-/ -- David A. Holland dholl...@netbsd.org
Re: blocksizes
On Fri, Jan 22, 2010 at 08:07:03AM +0100, Michael van Elst wrote: On Fri, Jan 22, 2010 at 05:46:31AM +, David Holland wrote: On Thu, Jan 21, 2010 at 10:30:20PM +, Michael van Elst wrote: IMHO there need to be three different ways to specify block offsets and block counts: 1. in units of blocks of the physical device 2. in units of blocks of DEV_BSIZE bytes 3. in bytes Don't forget: 4. in units of the filesystem block size... I ommitted this from the list because only the filesystem itself has the notion of 'filesystem block size', but when talking to the device it goes back to use DEV_BSIZE. It becomes clear that 'filesystem block size' is a very private measure of a filesystem when you think about FFS fragments where the filesystem already uses a second size and about aggregated IO where multiple blocks are accessed as one unit. Indeed. But it's still floating around in the system and still a possible complication. It's not *quite* invisible outside of each filesystem; e.g. it affects caching. and we need to establish what units are used where. IM (fairly strong) O everything should be kept in byte counts, and never block counts because if you have more than one unit in use it is far too easy to accidentally mix them or provide the wrong one, and because they're all the same language-level type there's little hope of detecting such problems automatically. I would like a system where all I/O is measured in bytes, but this requires a complete redesign for all disk devices and all filesystems. Right, but I think we should make this the end goal. Nobody says we need to expect to get there promptly. :-/ And you won't get rid of the physical blocks, at some point you have to translate. Only when interfacing, as previously noted. (And, as noted elsewhere, the places that this is required also includes on-disk formats.) Furthermore, Murphy's Law dictates that in any particular place the count you are given is frequently not in the units you need to give something else, and then you end up converting back and forth all over everywhere. This serves no purpose and tends to obfuscate the code base. This is how it works now. We do translate blocks back and forth all over the place, except that there a lot of assumptions that physical block size is the same as DEV_BSIZE. Right. Wading through such logic is one of the things that convinced me (a long time ago) that it shouldn't exist. Implementing such stuff in research kernels was the other driving factor - it is too easy to get wrong and you can't afford to spend time dealing with it. Also, filesystems organize data in larger chunks. There is always some translation going on between block or extent numbers and now DEV_BSIZE offsets or byte offset in your ideal system. On the filesystem side it won't get simpler. It will, some. % grep fsbtodb sys/ufs/ffs/*.[ch] | wc -l 57 That's quite a few more than ought to be there, IMO. Meanwhile, other things will get quite a bit simpler. The physical sector size should be available to callers (just not part of the API/ABI) so this ought to be done regardless. I haven't thought about compatibility issues yet, where is dkwedge_info exposed to userland? I dunno, I'm not all that up on wedges. -- David A. Holland dholl...@netbsd.org
Re: quota housekeeping unit
On Sun, Jan 24, 2010 at 09:59:09PM +0100, Wolfgang Solfrank wrote: As an extreme example [on ISO 9660], you could have a file with 3 bytes, where every byte is in a separate block. Raising the question of which kind of resource limitation exactly you want to impose on the user. Wouldn't it make sense to count that kind of usage as three blocks? After all, it's disc blocks you run out of, not bytes of user data. I'm not sure why you bring up quota in this discussion. Someone else mentioned the on-disk FFS quota format. However, this is not relevant to cd9660. The problem I tried to describe is that the current buffer cache with its DEV_BSIZE centric implementation more or less forbids any implementation of this part of the 9660 specification (albeit I have to admit that last I looked into this was before UBC integration.) A buffer cache that works with byte offsets and byte sizes of buffers would simplify this tremendously It's both not quite that bad and worse. Buffer cache buffers can be whatever size; they are normally the FS block size and they certainly aren't limited to DEV_BSIZE. However, the API is such that if any one file system's buffers aren't all the same size it's treading on very thin ice. This should be rectified sometime. The address to read from is also a block number, but the buffer code does not (as far as I know) itself interpret this number but just passes it along. -- David A. Holland dholl...@netbsd.org
Re: blocksizes
On Sun, Jan 24, 2010 at 08:48:32PM +, David Laight wrote: The btodb/dbtob macros will need another argument to indicate where the block size is obtained. That will just cause massive errors... For disks I would go for transfer requests (eg from fs) that are either in fixed units (BDEVSIZE, 512) or in bytes. Much like the current block device transfers - where transfers must be a multiple of 512 - transfers you need to be aligned to the physical sector size. bytes! :-) -- David A. Holland dholl...@netbsd.org
Re: blocksizes
On Sun, Jan 24, 2010 at 11:21:52PM +0100, Michael van Elst wrote: Not using DEV_BSIZE requires to change how things work now. He is right in the long run, though. You may think that the way NetBSD works is a hack as Izumi Tsutsui put it. But the argument that keeping things they way they are suddenly makes them too simple is just nonsense. Enh. Design decisions should be made while looking ahead, not with the nose to the grindstone. What's the right way to do all this? That has to be established first, then we can debate the merits of how to get there or of compromises that need to be made with the way things currently work. Choosing code architectures because they're easy or apparently simple (== require less immediate work) is a good way to get into a hole. I had a coworker once who did a lot of this. His project eventually needed the equivalent of a federal bank bailout. :-) In this case, the problem with the way things are is that the way things are does not work. I would suggest, based on prior experience with large rototills, that the symbol DEV_BSIZE be removed entirely and all uses examined one by one and changed to something else based on the usage. If we want a symbol that fills the role DEV_BSIZE is maybe supposed to (an arbitrarily-sized block whose size happens to be convenient for various legacy reasons) we should call it something else, and make sure it's only deployed in places where it's correct. grep shows 718 references in the kernel, so this can't be done all in one pass, but it's a much smaller problem than e.g. the device/softc thing. -- David A. Holland dholl...@netbsd.org
Re: Proposal for adding fsx(8) to base system
On Mon, Jan 25, 2010 at 12:19:30AM +0100, Hubert Feyrer wrote: On Sun, 24 Jan 2010, o...@linbsd.org wrote: Fsx is a filesystem exerciser that is used to stress filesystem code. I would like to propose importing fsx into the base systems, or perhaps pkgsrc. The intent is to import ftp://ftp.netbsd.org/pub/NetBSD/misc/ober/fsx/ to src/usr.sbin. Sounds like a case for pkgsrc/benchmark for me. AIUI the reason to have it in base is so the regression tests can use it. However, if so it should really go in src/external. also I notice the man page has no license. -- David A. Holland dholl...@netbsd.org
Re: FS corruption because of bufio_cache pool depletion?
On Tue, Jan 26, 2010 at 07:39:00PM +0100, Manuel Bouyer wrote: I have a netbsd-3/Xen 2 based server that runs on the same hardware and we have seen FS corruption in a particular domU on that system taqt seems to be related to the file system running out of space. That's what the co-admin running that domU tells me anyway. But I haven't seen the damage or the error messages in the domU personally. I've seen this too: kern/41834. There is also kern/35704 on 3.0 about this topic. And there's kern/27802, which I've also seen at times. Conversely, the experience that led to filing misc/33753 didn't hit any of these problems... for whatever any of this is worth. -- David A. Holland dholl...@netbsd.org
Re: blocksizes
On Mon, Jan 25, 2010 at 08:06:11AM +, Michael van Elst wrote: C hoosing code architectures 'Redesigning' things to fix bugs seems to be common sense nowadays, as if everything existing is always too bad to be used. Of course the same is valid for the redesigned code base in the future. Yes, and after a dozen or so such ground-up rebuilds (provided each one is informed by the lessons learned in the previous ones) things start to reach a decent state. One interesting point is that dropping DEV_BSIZE doesn't really mean something new but a jump backwards. That's where we came from, that's what was 'redesigned' then. But as you may have noticed, I'm not advocating going back to using random mixtures of block sizes. In this case, the problem with the way things are is that the way things are does not work. It works fine for 16 years now. The problems only come from legacy code and code from other sources that wasn't adapted to the then valid design, mainly because the problems didn't show up immediately due to lack of hardware. ...that is, it works except when it doesn't. :-P The software that needs to be fixed is pretty obvious, it's not a large rototill but if you are into 'redesigning' you may see a lot of places (unrelated to DEV_BSIZE) that could be structured better and cleaned up, e.g. partition handling. And all this should be done, wether you intend to drop or keep DEV_BSIZE. Of course... -- David A. Holland dholl...@netbsd.org
Re: mutexes, IPL, tty locking
On Tue, Jan 26, 2010 at 10:39:23AM +, Andrew Doran wrote: I'm not sure it's as rare as all that; it just mostly doesn't overtly fail. Instead you end up silently running at a higher IPL than necessary, and that buys you longer interrupt latencies and more dropped packets and all that. I have done extensive testing to on SPL behaviour and can confidently say that with our current setup it simply does not matter unless you have a very poorly written bit of code - in which case that's your problem, not the interrupt masking system. Sure, but the real question is how many such poorly written bits of code currently exist and how hard they are to find and fix. If what you mean to say is that you've specifically gone and looked for such cases and not found any, then great. But so far nobody's been willing to stick their neck out to make this assertion -- only the weaker one if such code exists, it's wrong. -- David A. Holland dholl...@netbsd.org
Re: ddb write and io memory
On Sat, Jan 30, 2010 at 08:45:48PM +0100, Frank Wille wrote: Therefore I would like to change ddb/db_write_cmd.c as in the following patch: [...] Any objections? Do we absolutely need to print the old value here? I think it's somewhat desirable to. Wouldn't it be better anyway to create a second command (iowrite or regwrite or something) that both does this and also does anything else that might be necessary to write to I/O registers safely, like flushing caches? -- David A. Holland dholl...@netbsd.org
Re: buffer cache can return buffer of another inode ?
On Sun, Jan 31, 2010 at 12:35:16AM +0100, Manuel Bouyer wrote: Hi, while investigating directory corruption on my NFS server I found a possible issue with the buffer cache. [...] I think vclean() should also take care of removing the vnode from the buffer cache's hash. Comments ? Yes. Except, what if someone else is holding the buffer? Invalidating buffers is delicate (both in general and in the code we currently have) and doing this could easily just move the problem around. I'm not familiar enough with the guts to suggest a probably-safe way of doing this in the short term; in the long term I think the buffer cache code needs a big rototill and design review. It is next on my list when/if I ever finish with namei... -- David A. Holland dholl...@netbsd.org
Re: uvm_object::vmobjlock
On Thu, Jan 28, 2010 at 09:55:53PM +, Mindaugas Rasiukevicius wrote: Unless anyone objects, I would like to change struct uvm_object::vmobjlock to be dynamically allocated with mutex_obj_alloc(). It allows us to: 1) share the lock among objects by holding a reference 2) avoid false-sharing on locks. Note that struct vnode::v_interlock becomes a pointer, which means a chunk of mechanical changes. You could in theory do -#define v_interlock v_uobj.vmobjlock +#define v_interlock (*v_uobj.vmobjlock) but it is probably not a good idea :-) Anyhow, if you do this, can we please come up with a better name for v_interlock? Calling a lock interlock is about as descriptive as writing int i or bool flag, i.e., fine sometimes when the scope is limited, but generally not so great for public data structures. If it were me I'd probably call it v_memlock, but it looks as if it ought to be something beginning with 'i' to avoid renaming a pile of other stuff. (Renaming it will also make sure that all code that needs to be visited and adjusted actually does get visited and adjusted, which is a good thing.) -- David A. Holland dholl...@netbsd.org
unhooking lfs from ufs
On several occasions it's been suggested that lfs should be unhooked from ufs, on the grounds that sharing ufs between both ffs and lfs has made all three entities (but particularly lfs) gross. ffs and lfs are not similar enough structurally for this sharing to really be a good design. Nobody I've discussed this with (on the lists or in chat) has been opposed, and I think there's a general consensus that this is the right direction. Getting there, however, is going to perhaps be a bit more controversial. Since ufs does provide a lot of functionality to lfs, it seems to me that the only practical way to do this is to cut and paste a copy of ufs into lfs under a different name, hack it up so it works again, and then begin consolidating. Anything else would involve either cutting off far too much work at once or leaving lfs entirely inoperable (as opposed to merely unstable) for a lengthy period; both of these propositions seem like a worse idea than attending to and merging changes into a chunk of copied code. (In fact, I've been maintaining and syncing the copy since July, and it's not been a big deal so far.) So I think this is the best approach. The copy involves 18 files from sys/ufs/ufs (out of 21; the ones excluded are quota.h and unsurprisingly ufs_wapbl.[ch]) which contain 9067 lines of code. That gives the following statistics: 14988 size of lfs currently + 9067 size of copypasted ufs 24055 size of resulting uncompilable lfs - 401 result of making it compilable 23654 size of new lfs This is the size of the code in sys/ufs/lfs; the userlevel tools need patching but don't change size significantly. My guess/estimate is that after several rounds of consolidation the total size will drop to around 18000-19000 lines. Maybe less, even, but I wouldn't count on that. I'll be keeping an eye on the total size going forward. Anyway, I have done this much and it's ready to go. I will be committing it tonight, I think, unless there are sudden howls of protest. The diff (from HEAD of a couple hours ago to the new compilable lfs) is posted here: http://www.eecs.harvard.edu/~dholland/netbsd/lfs-ufs-20100207.diff I will probably commit the pasted-only uncompilable form first, and maybe some of the intermediate steps as well, for the historical record and to make future merges easier. This may make the tree temporarily unbuildable, but hopefully not for very long. -- David A. Holland dholl...@netbsd.org
Re: unhooking lfs from ufs
On Sun, Feb 07, 2010 at 10:10:31AM +, Mindaugas Rasiukevicius wrote: The copy involves 18 files from sys/ufs/ufs (out of 21; the ones excluded are quota.h and unsurprisingly ufs_wapbl.[ch]) which contain 9067 lines of code. That gives the following statistics: 14988 size of lfs currently + 9067 size of copypasted ufs 24055 size of resulting uncompilable lfs - 401 result of making it compilable 23654 size of new lfs How would this affect UFS side? For example, any potential code reduction and/or simplification? Yes. ufs_readwrite.c will become much less gross, for example. There used to be assorted LFS-only code in the ufs sources; ad@ removed the ifdefs some time ago but they could be resurrected and then used to purge the relevant code. I don't know how much code that is. As for deeper simplifications, I don't know without digging around a lot more than I have (particularly in the ext2fs code), but there should be some. Anyway, I have done this much and it's ready to go. I will be committing it tonight, I think, unless there are sudden howls of protest. This involves significant changes, therefore enough time should be left for mailing list readers (~1 week at least, before committing anything). It was discussed months ago. This is a reminder/heads-up. -- David A. Holland dholl...@netbsd.org
Re: unhooking lfs from ufs
On Sun, Feb 07, 2010 at 11:07:55AM +, Mindaugas Rasiukevicius wrote: It was discussed months ago. This is a reminder/heads-up. Where? This mailing list is a right place where such discussions (and decisions) should happen. Right here... -- David A. Holland dholl...@netbsd.org
Re: solved ? [Re: need help with kern/35704 (UBC-related)]
On Tue, Feb 02, 2010 at 10:53:58PM +0100, Manuel Bouyer wrote: I found the cause of for this one: [...] To fix this I propose to have ffs_trucate() (and derivatives) always set v_writesize, even if the real size of the inode didn't change. The attached patch completely fixes the test case from kern/35704 for me. I suspect it could also fix other related file system full related PRs. Anyone has commnts about this ? I'd still like to hear some from UVM/UBC experts ... Should this be pulled up to -4? It applies cleanly and I can probably test it (some...) -- David A. Holland dholl...@netbsd.org
Re: solved ? [Re: need help with kern/35704 (UBC-related)]
On Wed, Mar 03, 2010 at 09:27:43PM +0100, Manuel Bouyer wrote: Anyone has commnts about this ? I'd still like to hear some from UVM/UBC experts ... Should this be pulled up to -4? It applies cleanly and I can probably test it (some...) Yes, it's also needed for -4 (AFAIK it's older than -3). But I've not had a chance to test it yet ... Ok, well, a -4 kernel with it boots normally at least. I'll try filling /tmp in the morning. -- David A. Holland dholl...@netbsd.org
Re: config(5) break down
On Wed, Mar 03, 2010 at 03:26:20AM +0900, Masao Uebayashi wrote: I want to slowly start breaking down config(5) files (sys/conf/files, sys/**/files.*) into small pieces. The goal is to clarify ownership of files; lines like file aaa.c bbb | ccc are to be changed into file aaa.c (ownership) and ccc: bbb (dependency). I'm not entirely convinced that makes sense in all cases, but I suppose it probably does in most. Because in the modular world one file belongs to one module. Perhaps a first step would be using config(1) and files.* to generate the module makefiles instead of maintaining them by hand... Broken config(5) files will be named like module.conf, because files.* namespace is insufficient. For example pci.kmod can't use files.pci. Huh? I don't understand. -- David A. Holland dholl...@netbsd.org
Re: msync(2)
On Mon, Feb 22, 2010 at 04:40:58PM -0500, Matthew Mondor wrote: After reading the manual page of msync(2), I have the impression that if invoked with the MS_SYNC flag, it should be safe enough not to need a further fdatasync(2)/fsync_range(2) call afterwards? That is the theory. And how about the metadata? Would sync(2) be the only true way to ensure it's synchronized (considering fsync(2) seems fd-specific)? What metadata? You can't get to things like the time stamps via mmap. Granted, in FFS-land someone might have thought it made sense to write out all the data blocks and not the FS-level metadata that describes them on disk... but since doing this does not guarantee that the data can be read back again later, it is not a correct implementation of msync(2). (Or fdatasync(2) either.) Also, I am auditing an application which seems to modify mmaped files but which does not use msync(2) at all (and I can see that an older fsync(2) call was used, but is now commented out). Should this be considered a bug? Why would it be? -- David A. Holland dholl...@netbsd.org
Re: config(5) break down
On Fri, Mar 05, 2010 at 01:14:50AM +0900, Masao Uebayashi wrote: Perhaps a first step would be using config(1) and files.* to generate the module makefiles instead of maintaining them by hand... cube@ said he did this part long time ago. The thing is that only fixing these tools don't solve all problems magically. We have to fix wrong instances around the tree. Maybe it should be merged then? Broken config(5) files will be named like module.conf, because files.* namespace is insufficient. For example pci.kmod can't use files.pci. Huh? I don't understand. Let's see the real examples. sys/conf/files has this: filenet/zlib.c (ppp ppp_deflate) | ipsec | opencrypto | vnd_compression This means [...] We should normalize this as [...] Now we define a module ppp_deflate which depends on ppp and zlib. To make dependency really work, the depended modules must be already defined. To make sure, we have to split files into pieces and include dependencies. net/zlib.conf See, this is the part that I don't understand. You're talking about normalizing logic, which is fine, and making shared files first-class entities, which is fine too though could get messy. But then suddenly you jump into splitting up files.* into lots of little tiny files and I don't see why or how that's connected to what you're trying to do. -- David A. Holland dholl...@netbsd.org
Re: config(5) break down
On Mon, Mar 08, 2010 at 10:53:16AM +0200, Antti Kantee wrote: (FFS_EI isn't the only such option either, it's just one I happen to have already banged heads with.) This one is easy, no need to make it difficult. Sure, but as I said it was just an example; what about the next one? Things like wapbl are currently an actual problem, since it is multiply owned (conf/files *and* ufs/files.ufs). I don't see why this is a problem either. The way things are right now, vfs_wapbl.c is conditional on wapbl the same as the rest; enforcing a hierarchical decomposition by source directory would break this but that's part of why such hierarchical decompositions are a bad idea. (I've also never fully understood why wapbl has to have so many tentacles hanging out of ffs, either.) -- David A. Holland dholl...@netbsd.org
Re: (Semi-random) thoughts on device tree structure and devfs
On Wed, Mar 10, 2010 at 11:47:49AM +0900, Masao Uebayashi wrote: I wonder what is the best design / implementation of devfs. none When you go and do it right it turns into some automount logic and a tmpfs. -- David A. Holland dholl...@netbsd.org
Re: (Semi-random) thoughts on device tree structure and devfs
On Thu, Mar 11, 2010 at 01:36:41AM +0900, Masao Uebayashi wrote: Well, yes. ?But research efforts are like that. ?Robustness is pretty much necessary for production use but not for the stage this appears to be at. I'm not a researcher. I'm an engineer. I like steady move feasible project. I am a researcher, and my core area of interest is exactly this kind of problem. If you are looking for a feasible project that can be relied on to move forward, my honest best recommendation is to pick something else. :-| -- David A. Holland dholl...@netbsd.org
Re: (Semi-random) thoughts on device tree structure and devfs
On Sun, Mar 14, 2010 at 03:33:19PM +0900, Masao Uebayashi wrote: I did; bus attachments. If you pay a little more respect to engineers, you'll find this is almost same as Iain's saying and what I wrote in the first mail. huh? he asked me what I meant, I said what I meant... -- David A. Holland dholl...@netbsd.org
Re: (Semi-random) thoughts on device tree structure and devfs
On Sat, Mar 13, 2010 at 08:02:51AM -0500, der Mouse wrote: [st_dev] does not have to correspond, though, to anything else in the system. Not really, no, but it may as well be the same as what's in st_rdev. If there still is an st_rdev. I see no particular reason that needs to be preserved. No, except that it is somewhat useful to be able to identify a device node (or at least distinguish it from others) and plenty of existing code expects the st_rdev field to exist. Patching all that is only worthwhile if it accomplishes some purpose, which it wouldn't really. The files in procfs and kernfs are for the most part semantically equivalent to real files even when they're virtual or dynamically generated. Devices frequently have other properties. Disagree. Writing to real files does not, for example, change the system hostname or alter a process's registers. In fact, that sounds a lot like the kind of dangers that inhere in writing to devices indiscriminately, doesn't it? Yes... and no. There's another sense in which /kern/hostname is the same as /etc/passwd: both are text files that affect the system configuration. Changes to both also have immediate operational effects on the running system. The fact that one is not preserved across reboots is a negligible difference from the perspective of some program that might randomly open either. Unexpectedly opening a tty without being prepared to hang indefinitely waiting for carrier-detect is a different class of problem. Many devices also are not like regular files in that you cannot read back what you write to them; /kern/hostname is again a regular file by that standard. I'm not saying that it might not be useful to tag /kern/hostname somehow (and /etc/passwd too) so that certain classes of programs, like say mail delivery tools, can categorically refuse to write to them. But that's kind of a different issue from marking devices... [...] devfs might even involve creating [...] (S_IFDEV, say) I don't see any point at all in renaming S_IFBLK/S_IFCHR. In terms of the end state achieved, neither do I. But there can be value in that programs that haven't been ported are more likely to misbehave if they see a name (by which I mean S_IFCHR and S_IFBLK) they think they know the semantics of but with different semantics than if they encounter something they don't recognize. True, but the semantics that can be expected in practice of S_IFCHR and S_IFBLK are very limited - most (but not all) S_IFCHR objects won't seek, for example, and S_IFBLK objects generally require aligned I/O and have a fixed size, but there are few other expectations. Which is, after all, why we mark devices as devices; they don't necessarily behave as regular files can reasonably be expected to. [...], and any new device type would have pretty much the same semantics anyway. In some respects. But lurking under all this has been doing away with st_rdev, which for some programs is a radical enoguh departure that a new name is deserved. (Others won't care, but I suspect most of them don't go looking at st_mode.) Well, no, we're doing away with a specific interpretation of the contents of st_rdev. Getting rid of st_rdev itself doesn't serve much further purpose. One can identify (most) programs that are going to try to interpret st_rdev the old way by getting rid of the major() and minor() macros. (IMO, any program that slices up dev_t on its own without using those deserves the consequences.) (1) Attaching a device into devfs and attaching a fs into the fs namespace are fundamentally the same operation. Only at a very general level, [...] at that level open(,O_CREAT,) also qualifies. So [does mknod()] Those are different in a fairly basic way: they create an object within an existing filesystem namespace, as opposed to binding a foreign object into the namespace. I'm not sure I'd call a filesystem a foreign object. If that's fair, then the filesystem namespace is _all_ foreign objects, and the foreign adjective no longer really means anything there. They are foreign to the filesystem they're being attached into? Maybe the choice of words isn't so great, but there's a real difference involved. A traditional device node is also a binding of a foreign object, but it does it by creating a proxy object in an existing filesystem. I'm not sure how fair it is to call it a proxy object, any more than an S_IFREG inode is a proxy for the big array of bytes (stored elsewhere on the disk) that make up the file's contents. But that big array is part of the conceptual entity that the inode represents. The driver pointed to by a device special file is not part of anything in the filesystem. Devfs schemes that don't abolish the proxy tend to get in trouble because it's too many layers of indirection. (This is not the only problem, but it's *a* problem.) Devfs schemes that
Re: config(5) break down
On Tue, Mar 16, 2010 at 06:50:29PM +0100, Zafer Aydo?an wrote: I'm wholeheartedly behind Julio's statement. Users should never have to rebuild anything. Er, why? Users should never have to perform complex unautomated procedures, because such procedures can easily be screwed up and recovery becomes difficult or impossible. But recompiling things isn't a complex unautomated procedure, it's a complex automated procedure, and not really that much different from other complex automated procedures like binary updates. Nor is it necessarily slow; building a kernel doesn't take any longer than booting Vista... -- David A. Holland dholl...@netbsd.org
Re: build time (was: config(5) break down)
On Wed, Mar 17, 2010 at 07:48:32PM +0100, Edgar Fu? wrote: DH Nor is it necessarily slow; building a kernel doesn't take any longer DH than booting Vista... EH Maybe on your machine. On mine it's still quite a bit slower than just EH editing a config file. Which gives you a totally new boot from source option. That's not new, it's called /vmunix.el and was invented a long time ago, just never perfected :-) -- David A. Holland dholl...@netbsd.org
Re: config(5) break down
On Wed, Mar 17, 2010 at 11:10:59AM -0500, Eric Haszlakiewicz wrote: On Tue, Mar 16, 2010 at 08:01:31PM +, David Holland wrote: But recompiling things isn't a complex unautomated procedure, it's a complex automated procedure, and not really that much different from other complex automated procedures like binary updates. The difference here is that a binary update is changing one particular machine and updating some other machine obviously won't have the intended effect, but recompiling things does the exact same thing regardless of where you do it, so having multiple people do it seems like a waste of time. That's a red herring; applying a binary patch does the same thing everywhere, and recompiling updates one particular machine in exactly the same sense too. The difference is in what material is distributed and how and where it's processed. This is not something end users are going to care about much - at most they'll care about how long it takes. Which is a valid concern, of course, especially in extreme examples like building firefox on a sun3, but it's *not* a usability issue in the same sense that e.g. incomprehensible error messages are. Admittedly, neither CVS nor our build system is quite robust enough to make this really work, but in practice tools like apt-get and yum aren't quite, either. Anyhow, it seems to me that a blanket statement nobody should ever have to recompile anything requires some justification; however, people have been taking it as an axiom lately and that concerns me a bit. Nor is it necessarily slow; building a kernel doesn't take any longer than booting Vista... Maybe on your machine. On mine it's still quite a bit slower than just editing a config file. Well sure, but that just means we're way ahead of the competition, since in Windows editing that config file generally requires a reboot. :-) -- David A. Holland dholl...@netbsd.org
Re: [gsoc] syscall/libc fuzzer proposal
On Sat, Mar 20, 2010 at 01:54:49PM -0400, Elad Efrat wrote: Thor Lancelot Simon wrote: If not, I don't think this adds any benefit to your proposal and is likely to simply be a distraction; I'd urge you in that case to drop it. Strongly seconded. There are so many great ways to improve NetBSD and wasting time and money on fuzzing is about as suboptimal as it gets. Um. First of all, that's not what Thor said; second of all, you really should not be telling potential gsoc students that their project ideas are flatly worthless, even if your judgment were correct; and third, I'm rather surprised that anyone who claims to work on security would call testing and analysis tools worthless. Let's try not to scare everyone off, ok? -- David A. Holland dholl...@netbsd.org
Re: [gsoc] syscall/libc fuzzer proposal
On Sat, Mar 20, 2010 at 12:40:12PM -0400, Thor Lancelot Simon wrote: As a part of my work I would like to write a translator for C language and a small library. Their goal would be to detect integer overflows, stack overflows, problems with static array indexing, etc (when such occur during the program execution). It will enable me to uncover more bugs in the software. What is the benefit of this when compared to existing static-analysis tools such as Coverity Scan, splint, or the Clang static analyzer? Will this cover any cases they don't? If so, which ones? AIUI from chat, the idea is to increase the probability that if the testing causes something bogus to happen, the bogus behavior will result in an easily identifiable abort. This seems like a valid line of reasoning; all the same, implementing such a tool is a fairly big (and annoying) pile of grunt work. Plus various variations on it have been done before. (Some of which might be worth looking into, actually.) -- David A. Holland dholl...@netbsd.org
Re: panic: ffs_valloc: dup alloc
On Sat, Mar 20, 2010 at 10:29:44PM +1030, Brett Lymn wrote: I have given up on suspending because my filesystems would be corrupted with monotonous regularity. The chances of a corruption seems to increase with the amount of disk activity happening on suspend. It seems like something is not being flushed (or not being marked as flushed) when the suspend happens. We don't support suspend-to-disk, right? So the contents of kernel memory are supposed to be preserved in this suspend? Because if so, unflushed buffers shouldn't matter. One would think. That suggests that something is flushing buffers to a device that's suspended and it's throwing them away instead of rejecting them or panicing. Does stuffing a couple sync calls somewhere before it starts suspending devices (wherever that is, I don't know) make the problems go away? -- David A. Holland dholl...@netbsd.org
Re: [gsoc] syscall/libc fuzzer proposal
On Sat, Mar 20, 2010 at 03:40:33PM -0400, Elad Efrat wrote: If not, I don't think this adds any benefit to your proposal and is likely to simply be a distraction; I'd urge you in that case to drop it. Strongly seconded. There are so many great ways to improve NetBSD and wasting time and money on fuzzing is about as suboptimal as it gets. Um. First of all, that's not what Thor said; Sorry? Are you saying that me agreeing with Thor that unless this proposal shows some clear advantage over what we already have -- specifically Coverity Scan -- it should probably be dropped is not what Thor said? He was talking about the bounds-checking translation tool part. You were attacking the entire thing. second of all, you really should not be telling potential gsoc students that their project ideas are flatly worthless, even if your judgment were correct; I said exactly what I think Which was tactless and rude. If someone comes along with an idea that's basically a waste of time, they should be gently steered towards something else. Students don't always have good ideas; that's why they need mentoring and advising, but you don't mentor and advise very effectively by being hostile and dismissive. Also, outside of the specific gsoc context, we have a long-standing custom in this project to not tell other people what to spend their time on or what is and isn't valuable. and third, I'm rather surprised that anyone who claims to work on security would call testing and analysis tools worthless. I don't *claim* anything, David; I *work*, at least as opposed to, say, assigning bugs to me, claiming for years I'll do something about them (together with many other grand ideas) and instead fix, I dunno, whitespace and grammar issues. Take your preaching elsewhere; I couldn't care less. Is that what you think I do? (And if so, do you really want to get into ad hominems? You're on fairly shaky ground.) As for the issue at hand, well, I suggest you look at what the proposal is, what we already have for years, and draw your own conclusions. Yes, I have; it needs to be fleshed out into a real project proposal (as is to be expected at this stage, after all) but I don't see anything inherently wrong with it so far. -- David A. Holland dholl...@netbsd.org
Re: panic: ffs_valloc: dup alloc
On Sat, Mar 20, 2010 at 04:06:32PM -0400, Steven Bellovin wrote: That suggests that something is flushing buffers to a device that's suspended and it's throwing them away instead of rejecting them or panicing. Possibly Although it doesn't quite make sense, because in most cases this could only corrupt the fs if the same block was left untouched afterwards for long enough for the (allegedly) clean buffer to be discarded, and that shouldn't cause a panic right after resume. Unless the fs was already broken from a previous suspend, I guess. Maybe there's suspend code somewhere that writes out and also discards buffers in the hopes of cleaning up for some future suspend-to-disk work? Could be, I guess, but I'd tend to think not. I ought to go look at the code but I don't think I have time for that this weekend. :-| Does stuffing a couple sync calls somewhere before it starts suspending devices (wherever that is, I don't know) make the problems go away? No -- I've had a sync call in my suspend script for years. More precisely, at the moment it's sync; sleep 1 to let things flush. No joy. That might not be late enough; I was thinking of inside the kernel. Of course, rejecting them wouldn't seem to do any good; what's needed, I suspect, is for the device to queue them (as usual) but not fire up the disk when in suspending mode. Or for the writes to not be issued at all until after resume. ISTM it must be either the syncer firing at the wrong time or something's gotten out of order in the suspend sequencing. -- David A. Holland dholl...@netbsd.org
Re: panic: ffs_valloc: dup alloc
On Sat, Mar 20, 2010 at 05:03:16PM -0400, Steven Bellovin wrote: Let me see if I can find my first note on the subject -- it might give a clue about the date of any changes. Turns out that I sendpr-ed it in September: kern/42104. I even responded to the PR, not that I had any useful ideas at the time. That sounds like maybe the problem is not on the suspend side but on the resume side, that is, that stuff is being written out before (some layer of) the disk subsystem is ready to go again. With vanilla FFS such writes should be synchronous so it should be (relatively) easy to figure out what's going on. Do you feel like trying out dtrace? :-) On the other hand, if fsck thinks the inode for a named pipe is unallocated (or particularly, has duplicate blocks, since pipes shouldn't have blocks at all)... that means that whatever went wrong went wrong when the pipe was created, not when something exited and removed it. And with vanilla ffs, those are synchronous writes and they should happen in quick succession; if the inode didn't get written but the directory did, something's more badly wrong than just the disk not being ready yet. And I strongly suspect that the pipe creation isn't tied to suspending, that is, the pipe should have been created long before you suspended and should not in general be removed and recreated by suspending. And that means either something is severely wrong in general and you're only seeing it after crashing due to suspend (which is possible, but seems not too likely) or the suspend cycle is actively writing garbage and corrupting the fs. Meanwhile, getting traps while dumping is Very Strange (TM). Do we have any kind of debug code that can checksum memory before and after the suspend? I wonder if something ACPI-related is garbaging memory. -- David A. Holland dholl...@netbsd.org
Re: config(5) break down
On Thu, Mar 25, 2010 at 01:14:51AM +0900, Masao Uebayashi wrote: ? (Besides, it's not necessarily as flat as all that, either.) ? ? It's necessary to be flat to be modular. Mm... not strictly. That's only true when there are diamonds in the dependency graph; otherwise, declaring B inside A just indicates that B depends on A. Consider the following hackup of files.ufs: There're diamonds (for example, ppp-deflate depends on ppp and zlib). Sure. But mostly there aren't. [...] module UFS [...] module FFS [...] module MFS [...] module EXT2FS [...] module LFS [...] In this plan, what *.kmod will be generated? The ones declared? Or one big one, or one per source file, or whatever the blazes you want, actually... I'm perfectly happy to rework the parser to support syntax like the above if we can all agree on what it should be. So you're proposing a syntax change without understanding the existing syntax? (You don't know what braces are for, you didn't know define behavior, ...) I have to say that your proposal is not convincing to me... Um. I know perfectly well that config currently uses braces for something else. That's irrelevant. There's no need to use braces for grouping; it just happens to be readily comprehensible to passersby. There's an infinite number of possible other grouping symbols that can be used, ranging from to (! !) or even things like *( )*. Furthermore, the existing use of braces can just as easily be changed to something else if that seems desirable. There's a reason I said syntax like the above and if we can all agree on what it should be. That wasn't a concrete proposal, it wasn't meant to be a concrete proposal, no concrete proposal is complete without an analysis of whether the grammar remains unambiguous, and nitpicking it on those grounds is futile. You seem to be completely missing the point. -- David A. Holland dholl...@netbsd.org
Re: config(5) break down
On Fri, Mar 19, 2010 at 02:49:37PM +, Andrew Doran wrote: I *do* think it's a useful datapoint to note that sun2, pmax, algor, etc. are never, ever downloaded any more. Right, and these dead ports must be euthanized. The mountain of unused device drivers and core kernel code is a signficant hinderance to people working in the kernel. Speaking from the point of view of repeatedly touching every namei call anywhere in the kernel... I'd have to disagree. Sure, it'd go faster if there weren't a pile of legacy binary compat implementations or if we removed all the mostly-unused fses, but if I wanted a toy kernel I already have a pile of those in the office. Most of the issues that the dead ports or fses trigger are real design or structural problems that would be only masked, not resolved, by removing that code. Supporting all the random bells and whistles that e.g. compat_svr4 wants from namei is part of doing it correctly, and having the correct infrastructure in place that can support these things is important because the need/desire/demand will come along again; it always does. For example, the $ORIGIN thing in ld.elf_so is actually the same as one of the annoying cases in (IIRC) compat_svr4... I know we don't exactly see eye to eye on these issues but perhaps we can reach some kind of middle ground? -- David A. Holland dholl...@netbsd.org
Re: config(5) break down
On Thu, Mar 25, 2010 at 06:22:17PM +0900, Masao Uebayashi wrote: % grep ':.*,' sys/conf/files | wc -l 86 And? I don't understand your point. There are a lot more than 86 entities in sys/conf/files. There are many instances where modules have multiple dependencies. And? I still don't understand your point. -- David A. Holland dholl...@netbsd.org
Re: config(5) break down
On Fri, Mar 26, 2010 at 10:24:02AM +0900, Masao Uebayashi wrote: (Honestly, I see benefit to not convincing you; objection only from dholland@ sounds more convincing to me than no objections.) Um. I'm sorry you think that. I guess there is no point continuing this discussion, then. Or much of any other. -- David A. Holland dholl...@netbsd.org
Re: config(5) break down
On Fri, Mar 26, 2010 at 01:45:51PM +, Andrew Doran wrote: I'm speaking of low level kernel code and driver drivers, areas that to date you have had relatively little involvement in. That's not entirely true, but fair enough. I will however consider discussing the points you raise if/when I launch a jihad against emulations and file system code. I thought you already had! :-p :-) -- David A. Holland dholl...@netbsd.org
$ORIGIN (was: Re: make: ensure ${.MAKE} works)
On Thu, Apr 15, 2010 at 08:40:19AM +, David Holland wrote: Wish we had working $ORIGIN... We will fairly soon, I think... :-) To wit: as far as I can tell, having been wading around in that code recently, the only problem with what we have is that if the path sent back by namei isn't absolute it needs a getcwd() stuck on the front of it. Is it reasonable to just do that? I don't think calling getcwd() from exec is going to cause locking problems, but it might be more overhead than we want to swallow. -- David A. Holland dholl...@netbsd.org
Re: $ORIGIN (was: Re: make: ensure ${.MAKE} works)
On Wed, Apr 21, 2010 at 08:58:31AM -0400, Christos Zoulas wrote: | Is it reasonable to just do that? I don't think calling getcwd() from | exec is going to cause locking problems, but it might be more overhead | than we want to swallow. The code that we have there works fine now, yamt objected about it because strictly speaking the path could have been evicted from the namei() cache, but this never happens since the call is immediately after. Calling getcwd will add overhead and it is not really necessary because we already did resolve the path. If you exec ../bin/foo, that's all namei will resolve or touch, and that's the string that'll come back from namei. If we want an absolute path out, it needs getcwd, either in exec or in namei... and in exec is probably preferable. If we really want to support the feature we need to either buy into that overhead, or inspect the binary in some fashion to only do it in cases where it's going to be used. AFAICT getcwd should be no more expensive than vnode_to_path if the parentage of the current directory is in the name cache, which should be the common case. -- David A. Holland dholl...@netbsd.org
Re: $ORIGIN (was: Re: make: ensure ${.MAKE} works)
On Wed, Apr 21, 2010 at 01:22:12PM -0400, Christos Zoulas wrote: | If you exec ../bin/foo, that's all namei will resolve or touch, and | that's the string that'll come back from namei. If we want an absolute | path out, it needs getcwd, either in exec or in namei... and in exec | is probably preferable. That's right, and it affects in my opinion 5% of the invocations, since the majority of the execs are done via the shell and the shells pass absolute paths to exec for commands that don't contain '/'. Right. | If we really want to support the feature we need to either buy into | that overhead, or inspect the binary in some fashion to only do it in | cases where it's going to be used. | | AFAICT getcwd should be no more expensive than vnode_to_path if the | parentage of the current directory is in the name cache, which should | be the common case. That's what vnode_to_path() does (it calls getcwd), so the cost is the same. I had convinced myself it was supposed to fail if it had to look outside the cache, but that's only true for the first step. I think what you propose is to call something like a kernel realpath(path) and use this to set $ORIGIN which is fine with me. I did not do it because I did not want to deal with path canonicalization (eliminating ../.././// from the path, but I guess that getcwd() does this for you if you call it with the full path?). namei can already do enough of this to get by on (see for example svr4_sys_resolvepath() in sys/compat/svr4/svr4_misc.c) and exec is already using this (mis?)feature. For the time being what we can do is take the path sent back from namei, and if it's not absolute call getcwd and graft that onto the front. This will in general yield a partially realpath'd path but I don't think anyone will care. In the long run I think a fully realpath'd path can be arranged, either by calling getcwd first and handing the results to namei to grind on, or by explicitly compacting any ..'s that appear in the front of the namei result. I sort of favor the first because it makes it possible to handle the emulation root properly, I think, but this can be discussed later on. -- David A. Holland dholl...@netbsd.org
Re: sysctl node names (Re: CVS commit: src/sys/uvm)
On Fri, Feb 19, 2010 at 06:42:03AM -0500, der Mouse wrote: I'd say it's a question of whether you think of them as input to the kernel, commands (enable this), or as output from the kernel, reporting state (this is enabled). Of course, in most cases, they're actually both, so that doesn't help much. But, as a native anglophone, it's what the difference between enable and enabled here feels like to me. Well, the sysctl tree is a presentation of the state. Changing it is a command. So I think it should be enabled. We could also avoid this problem by using on instead, but that's probably not a great idea. -- David A. Holland dholl...@netbsd.org
Re: $ORIGIN (was: Re: make: ensure ${.MAKE} works)
On Wed, Apr 28, 2010 at 02:57:47PM -0400, der Mouse wrote: To wit: as far as I can tell, having been wading around in that code recently, the only problem with what we have is that if the path sent back by namei isn't absolute it needs a getcwd() stuck on the front of it. Is it reasonable to just do that? I don't think so. It would be a regression in that it would break things in no-path-up-to-/ situations; it also would either fail or expose paths that shouldn't be accessible in path-to-/-isn't-readable situations. Does anyone know how other implementations of $ORIGIN deal with these cases? For the time being, even if we just provide a relative path when getcwd fails it'd still be more functional than the current situation. Why not get the kernel to keep a reference to the vnode of the directory that contained the process image? Then use some flag to open() (or similar) to open a file relative to that vnode? Someone already invented $ORIGIN with absolute paths and we ought to support that, especially since we already have a half-baked implementation. I've often thought something like this (keeping the directory) would be a decent approach but it's not clear how to set it up, or e.g. what ought to be done if the executable is moved or other such cases. Hacking on namei() just to get an absolute path in $(.MAKE)? It's not just make; see for example PR 42420. -- David A. Holland dholl...@netbsd.org
Re: WAPBL and IDE mac68k
On Tue, Jun 01, 2010 at 08:31:56AM -0400, der Mouse wrote: It happens even when I try to boot to single user mode because I see the message saying /: replaying log to memory right before it panics. Not sure why the journaling stuff happens when booting in single user mode without mounting any filesystems, but that's what it is. You can't boot without mounting any filesystems at all; / must be mounted, at a minimum. You can boot as far as panic: cannot mount root, though, by e.g. disabling wd. Anyway, what's the panic message? -- David A. Holland dholl...@netbsd.org
Re: Layered fs, vnode locking and v_vnlock removal
On Wed, Jun 02, 2010 at 05:58:40PM +0100, David Laight wrote: In the long term VOP_xxxLOCK() should become part of the file systems. AFAIK there is a consensus between yamt@, ad@ and thorpej@ that locking should be moved down to the filesystems. There was some discussion about it here some time before. Yes, this keeps coming up and I keep trying to explain why it's misguided. There is a lurking problem making read/write atomically update the file offset. I suspect that is currently covered by the vnode lock. Might only affect O_APPEND - but I've seen systems get that wrong! Not to mention the problem of correctly setting the file position when read/write fault on a userspace address part way through a transfer. Other important cases include atomicity of O_CREAT and permission checks done in VFS-level code. These cases can all be handled by cutting and pasting the code into every file system, but we really don't want to do that. -- David A. Holland dholl...@netbsd.org
Re: Layered fs, vnode locking and v_vnlock removal
On Tue, Jun 01, 2010 at 11:44:03AM +0200, Juergen Hannken-Illjes wrote: It's not immediately clear how either of these ought to work, so I'm concerned that making the infrastructure less general will lead to problems. 1) One upper to many lower vnodes This is a file system like unionfs. It has to lock either one or many lower vnodes and does/will not earn anything of shared locks. 2) Many upper to one lower vnode Such a layered file system could use a lock shared between ALL upper and the lower vnode. Always taking the lower vnode's lock will do the same. I see no need for shared locks here. That seems plausible. In the long run I intend to make all the vnode ops symmetric with respect to locking, which should make a lot of this less toxic, but at the rate I've been able to work on this stuff we won't be there anytime soon. The asymmetry comes from functions like null_mount() where a vnode gets locked by the lower layer and unlocked by the upper layer. A lower layer expecting its VOP_LOCK() to be matched by a VOP_UNLOCK() will fail badly. ...that is just broken, yes. If you can beat sanity into that, please do. -- David A. Holland dholl...@netbsd.org
Re: wedges on vnd(4) patch
On Mon, Jun 21, 2010 at 06:23:02PM -0400, Christos Zoulas wrote: Well, I find the different indentation styles typically use for the braces clumsy and not following the standard. Or even when they do, they cause the code to move too much to the right: FWIW, I prefer this: switch (c) { case 'a': { decl; stmt; } break; } because I've never liked not indenting the cases and half an indent seems a good way to do that. But that is also at variance with current rules. Whether variables should be at the top of the function or not depends heavily on how big and complicated the blocks are. (And when they are big and complicated, splitting each case into its own function is also not always desirable.) -- David A. Holland dholl...@netbsd.org
Re: Move the vnode lock into file systems
On Sat, Jun 26, 2010 at 10:39:27AM +0200, Juergen Hannken-Illjes wrote: The vnode lock operations currently work on a rw lock located inside the vnode. I propose to move this lock into the file system node. This place is more logical as we lock a file system node and not a vnode. This becomes clear if we think of a file system where one file system node is attached to more than one vnode. Ptyfs allowing multiple mounts is such a candidate. I'm not convinced that sharing locks at the VFS level is a good idea for such cases. While ptyfs specifically is pretty limited and unlikely to get into trouble, something more complex easily could. I don't think we ought to encourage this until we have a clear plan for rebind mounts and the resulting namespace-handling issues. (Since in a sense that's the general case of multiply-mounted ptyfs.) Since I'm pretty sure that a reasonable architecture for that will lead to sharing vnodes between the multiple mountpoints, and manifesting some kind of virtual name objects akin to Linux's dentries to keep track of which mountpoint is which, I don't see that it's necessary or desirable to specialize the locking... Do you have another use case in mind? (In the absence of some clear benefits I don't think it's a particularly good idea to paste a dozen or two copies of genfs_lock everywhere. But folding vcrackmgr() into genfs_lock and genfs_unlock seems like a fine idea.) -- David A. Holland dholl...@netbsd.org
Re: Move the vnode lock into file systems
On Sun, Jun 27, 2010 at 06:18:19PM +0200, Juergen Hannken-Illjes wrote: (In the absence of some clear benefits I don't think it's a particularly good idea to paste a dozen or two copies of genfs_lock everywhere. But folding vcrackmgr() into genfs_lock and genfs_unlock seems like a fine idea.) Primary goal is to abstract vnode locking into the vnode operations only and therefore completely removing vlockmgr(). That I can agree with :-) For now I can live with genfs_lock()/v_lock becoming the generic locking interface where v_lock becomes genfs_lock()-private. and that I have no objection to. Vaguely related to which, does anyone object to wrapping VOP_UNLOCK in a vn_unlock() function (doing nothing extra), so as to be symmetric with vn_lock()? I think I've mentioned this before, but I'm not sure, and if so it was a while back... -- David A. Holland dholl...@netbsd.org
Re: Preserving early console output (pre-Copyright stuff)
On Thu, Jul 01, 2010 at 05:18:36AM -0700, Paul Goyette wrote: b) a way to pause long enough to manually transcribe the output? (A simple timed delay would work, although a Press any key to continue would be easier!) It may work to do printf(Press a key...\n); cnpollc(1); (void)cngetc(); cnpollc(0); ... it used to, but that was ~15 years ago. -- David A. Holland dholl...@netbsd.org
Re: Using coccinelle for (quick?) syntax fixing
On Sat, Aug 14, 2010 at 01:36:15PM +0200, Jean-Yves Migeon wrote: I would say don't do __func__ for messages like this; it doesn't really serve much purpose vs. typing in a name, causes the observable behavior to change silently if the code gets reorganized, and makes it much harder to grep the source for an error message you're seeing. Understood not a big deal though. Changing it to use aprint_whatnot_foo(), though, might be worthwhile... Hmm, I thought aprint_* functions were essentially for autoconf messages? Erm... um, yeah, never mind, I was thinking of the way they automatically rope in the device name and unit number. -- David A. Holland dholl...@netbsd.org
Re: [ANN] Lunatik -- NetBSD kernel scripting with Lua (GSoC project
On Tue, Oct 12, 2010 at 12:53:10AM -0300, Lourival Vieira Neto wrote: A signature only tells you whose neck to wring when the script misbehaves. :-) Since a Lua script running in the kernel won't be able to forge a pointer (right?), or conjure references to methods or data that weren't in its environment at the outset, you can run it in a highly restricted environment so that many kinds of misbehavior are difficult or impossible. ?Or I would *think* you can restrict the environment in that way; I wonder what Lourival thinks about that. I wouldn't say better =). That's exactly how I'm thinking about address this issue: restricting access to each Lua environment. For example, a script running in packet filtering should have access to a different set of kernel functions than a script running in process scheduling. ...so what do you do if the script calls a bunch of kernel functions and then crashes? if a script crashes, it raises an exception that can be caught by the kernel (as an error code).. Right... so how do you restore the kernel to a valid state? -- David A. Holland dholl...@netbsd.org
Re: kernel module loading vs securelevel
On Sat, Oct 16, 2010 at 11:23:29AM +0900, Izumi Tsutsui wrote: It would seem to be intentional. After all, kernel modules can do all sorts of nasty things if they want to. In that case, module autoload/autounload is not functional at all and we have to specify all possible necessary modules explicitly during boot time?? Yes. Otherwise it's quite easy to defeat securelevel by causing the loading of a module that resets it to -1. -- David A. Holland dholl...@netbsd.org
Re: kernel module loading vs securelevel
On Sun, Oct 17, 2010 at 03:38:42AM +0900, Izumi Tsutsui wrote: Heh, then why have we had it on i386 for years? Because of the X server. You are just saying: We introduced a significant security regression just for our own convenience. Perhaps... I see no proper reason to avoid INSECURE for MODULAR if it's okay for X. ...and I'm not convinced of this, primarily because (from a practical point of view) X is unavoidable and unfixable, whereas modules are neither. This gets back to the underlying question of what purpose modules are supposed to serve, and as I know everyone knows what I think and is sick and tired of hearing about it, I'll pipe down. -- David A. Holland dholl...@netbsd.org
Re: kernel module loading vs securelevel
On Sat, Oct 16, 2010 at 12:03:52PM -0700, Paul Goyette wrote: autoload/autounload does NOT perform any authorization checks - please look at the code! No checking of securelevel occurs, as far as I can see. For autoload, the module name must not contain a '/', so if the module is being loaded from the file system it must be loaded from the blessed /stand/${ARCH}/${VERSION}/modules directory. Including the INSECURE option will have no effect on autoloading of modules. If this is true it makes securelevel useless; all you need to do is put a hostile module in the right place and cause it to be autoloaded. (Remember the point of securelevel is that even root can't lower it.) John Nemeth has already pointed out that my reading of the code was flawed. Module autoloading _does_ call kauth for authorization. The kauth listener provided by the module subsystem returns ALLOW for all autoload calls, but this gets overridden by another kauth listener, so autoload still gets denied. Good that it's not true then :-) It should be sufficient, I think, to check at boot time that any module that can be autoloaded is marked immutable. And also make the blessed directory itself immutable? :) As I recall the semantics of immutable are such that this isn't necessary to protect modules that are present at boot time (that is, they can't be unlinked/renamed/etc.), and if there are autoloadable modules whose names aren't present at boot time, they'll fail the check. -- David A. Holland dholl...@netbsd.org
Re: kernel module loading vs securelevel
On Sun, Oct 17, 2010 at 06:13:11AM +1100, matthew green wrote: ...and I'm not convinced of this, primarily because (from a practical point of view) X is unavoidable and unfixable, whereas modules are neither. actually, with DRM (and KMS) i believe we will be able to run the X server as non-root. Yes, after what, some fifteen years? :-/ -- David A. Holland dholl...@netbsd.org
Re: CVS commit: src/bin/cp
On Mon, Oct 25, 2010 at 05:49:11PM +0100, David Laight wrote: No, since in general the file is also being extended (certainly in this case it is) it also has to lock the file size, and that's going to deny stat() until it's done. A stat request during a write can safely return the old size. Yes it can, if it has it. Hence multiversion... -- David A. Holland dholl...@netbsd.org
Re: RFC: ppath(3): property list paths library
On Mon, Nov 01, 2010 at 08:00:09PM -0500, David Young wrote: I'm working on a library called ppath(3) for making property lists more convenient to use in the kernel. With ppath(3), you refer to a property to read/write/delete in a property list by the path from the list's outermost container. Comments welcome. Speaking from the POV of someone who's been working on querying semistructured data for several years now... I have a pile of high-level questions: (1) can you articulate the expressive power you intend for your path expressions, and why that's a logical stopping point vs. more expressive things; (2) what if any facilities do you envision for checking paths against proplist schemas when/if we ever manage to sort out a system for that; (3) what model do you have for dealing with cases when the values found at the paths provided are not what the user is expecting; and (4) what model do you have for dealing with cases where the path does not name a single unique value or position, if that's possible? (I'm not trying to give you a hard time, I've just spent a long time dealing with these problems and I don't want to see familiar mistakes reinvented.) -- David A. Holland dholl...@netbsd.org
Re: RFC: ppath(3): property list paths library
On Wed, Nov 03, 2010 at 09:28:11AM +0100, Martin Husemann wrote: This is one of the ocassions where I would love to use C++ and templates in the kernel ;-} I think what you mean is that you'd like to have a language that has some kind of sane parameterized types... :-/ -- David A. Holland dholl...@netbsd.org
Re: mutexes, locks and so on...
On Fri, Nov 12, 2010 at 02:21:34PM +0100, Johnny Billquist wrote: then I realized that this solution would break if people actually wrote code like lock(a) lock(b) release(a) release(b) ...which is very common. It is? I would have thought (and hoped) that people normally did: lock(a) lock(b) unlock(b) unlock(a) Nope. You might get away with this if we always did strict two-phase locking in the kernel, but we don't (no kernel does) to avoid excessive contention on e.g. the vnode for / and other such locks. Meanwhile, lock coupling tends to appear anytime one is transitioning through a data structure and wants to maintain consistency. Thus the typical usage is something like lock(a) b = a-b lock(b) unlock(a) c = b-c lock(c) unlock(b) do_work(c) unlock(c) It can be shown that this preserves conflict serializability as long as nothing ever follows the structure in the opposite order (c - b - a). The traditional place you find code like this is in pathname translation, but in a MP kernel it pops up in lots of other places too. I agree that it's not wrong, but untidy. Keeping track of ipl levels could have been kept within the mutex instead, thus simplifying both the lock and unlock code, at the expense that people actually had to unlock mutexes in the reverse order they acquired them. Just as with the splraise/splx before. That however isn't workable. (And it still wouldn't be workable even in a kernel that had separate spinlocks and sleep-locks.) -- David A. Holland dholl...@netbsd.org
Re: Please do not yell at people for trying to help you.
On Fri, Nov 12, 2010 at 08:31:39PM +, Eduardo Horvath wrote: No it doesen't because all those macros assume the value is being transferred from one register to another rather than regiser to memory. The assignment: foo.size = htole64(size); Cannot be replaced with: __inline __asm(stxa %1, [%0] ASI_LITLE : foo.size : size); The right way to fix this is in the compiler; teach the compiler about opposite-endian variables and let it pick the right instructions for accessing them, and the problem goes away. -- David A. Holland dholl...@netbsd.org
Re: mutexes, locks and so on...
On Sat, Nov 13, 2010 at 01:45:40AM +0900, Izumi Tsutsui wrote: Wow. I guess you can add me to the list of people leaving. There is no perfect world and we don't have enough resources. Any help to keep support for ancient machines are appreciate, but complaints like we should support it which prevents improvements of mainstream will just make NetBSD rotten. What prevents improvements of mainstream are we talking about here? We have someone who wants to provide tuned vax-specific locking primitives. The absolute worst possible cost to the mainstream that this incurs is a bit of extra cpp and config hackery. (Can we all please get a grip?) -- David A. Holland dholl...@netbsd.org
Re: CVS commit: src/sys/arch/powerpc/oea
(moving this to tech-kern because it's the right place and per request) On Mon, Nov 15, 2010 at 11:24:21AM +0900, Masao Uebayashi wrote: Every header file should include the things it requires to compile. Therefore, there should in principle be no cases where a header file (or source file) needs to #include something it doesn't itself use. This clarifies my long-unanswered question, thanks! *bow* I've (re)built about 300 kernels in the last days. I've found: - sys/sysctl.h's struct kinfo_proc should be moved into sys/proc.h (I've done this locally). Otherwise all sysctl node providers includes sys/proc.h and uvm/uvm_extern.h. (This is where I started...) I'm not sure this is a good plan in the long run. Shouldn't it at some point be unhooked fully from the real proc structure? - sys/syscallargs.h should be split into pieces, otherwise all its users have to know unrelated types (sys/mount.h, sys/cpu.h). Since system calls don't in general pass structures by value, it shouldn't need most of those header files, just forward declarations of the structs. (this is, btw, one of the reasons to avoid silly typedefs) - sys/proc.h's tsleep(), wakeup(), and friends should be moved into some common header, because it's widely used API. sys/proc.h will be used only for struct proc related things. Given that this is a deprecated API in the long term I'm not sure it's worthwhile. -- David A. Holland dholl...@netbsd.org
Re: CVS commit: src/sys/arch/powerpc/oea
On Mon, Nov 15, 2010 at 10:41:55PM +, David Laight wrote: Indeed. Properly speaking though, headers that are exported to userland should define only the precise symbols that userland needs; kernel-only material should be kept elsewhere. One start would be to add a sys/proc_internal.h so that sys/proc can be reduced to only stuff that userspace and some kernel parts are really expected to use. The right way (TM) is to create src/sys/include and put kernel-only headers in there, to be included as e.g. proc.h. In the long term the user-visible parts would go in src/sys/include/kern/proc.h, which would be included as kern/proc.h. (It has to be kern/ and not sys/ because a couple decades of standards creep and poor API maintenance has led to half of sys/*.h properly belonging to libc. In order to avoid repeating this problem in the future, all APIs should be defined without direct reference to any kern/*.h files; those should only be included from other libc or kernel headers. So libc would grow its own sys/proc.h because that's part of the libkvm API.) When done completely the entire kern/ subtree is the same for both userland and the kernel, including MD headers, no other random kernel headers need to be installed, and there's no longer any need for #ifdef _KERNEL. As much as this probably sounds obvious, the first couple of times I set out to do it myself I got it wrong. (And it's wrong in Linux too.) -- David A. Holland dholl...@netbsd.org
Re: CVS commit: src/sys/arch/powerpc/oea
On Mon, Nov 15, 2010 at 03:47:32PM -0500, der Mouse wrote: [...] just forward declarations of the structs. (this is, btw, one of the reasons to avoid silly typedefs) I'm not sure what typedefs have to do with it. typedeffing a name to an incomplete (forward) struct type works just fine: struct foo; typedef struct foo FOO; (You can't do anything with a FOO without completing the struct type, but you can work with pointers to them) But now there's no protection against divergence; that is, if I have typedef struct foo FOO; in one header and a typo'd typedef struct tfoo FOO; in another, assuming suitable ifdef guards as already mentioned, now FOO can be two different things, and the inconsistency in the cut-and-pasted-material might not be detected for some time. However, if I just have struct foo; in multiple headers there aren't very many ways this can be wrong that will compile at all. The only common way for this to go bad is if you've removed struct foo from your program completely; then you have to hunt down all the forward declarations by hand and kill them off. But that's more or less unavoidable. The difference between these two cases is inherent in the fact that the typedef form is declaring two things and the plain struct declaration is declaring only one... there's no particular reason C couldn't provide a way to create a forward declaration (without definition) of a typedef name, but it doesn't. -- David A. Holland dholl...@netbsd.org
Re: module.prop rename
On Sat, Nov 20, 2010 at 07:50:03PM -0800, John Nemeth wrote: } embed the property info in the module file itself? That may or may not make more sense, but it would require a lot more work (i.e. inventing a tool to extract them, edit them, and insert them; and modifying the module loading code to extract them). I have very little interest in doing that work at this time. Fair enough. -- David A. Holland dholl...@netbsd.org
Re: misuse of pathnames in rump (and portalfs?)
On Tue, Nov 23, 2010 at 11:13:02PM +, David Holland wrote: However, I discovered today that rumpfs's VOP_LOOKUP implementation relies on being able to access not just the name to be looked up, but also the rest of the pathname namei is working on, specifically including the parts that have already been translated. Ok, on further inspection it appears that this is overly pessimistic. It looks, rather, as if rumpfs (specifically the etfs logic) is using the full namei work buffer and hoping that no such parts actually appear in it, because if they do it'll fail. So I think the following change will resolve the problem; can someone who knows how this is supposed to work check it? (If it's ok, there's no need to tamper with VOP_LOOKUP.) Index: rumpfs.c === RCS file: /cvsroot/src/sys/rump/librump/rumpvfs/rumpfs.c,v retrieving revision 1.74 diff -u -p -r1.74 rumpfs.c --- rumpfs.c22 Nov 2010 15:15:35 - 1.74 +++ rumpfs.c24 Nov 2010 04:31:07 - @@ -291,10 +291,9 @@ hft_to_vtype(int hft) } static bool -etfs_find(const char *key, struct etfs **etp, bool forceprefix) +etfs_find(const char *key, size_t keylen, struct etfs **etp, bool forceprefix) { struct etfs *et; - size_t keylen = strlen(key); KASSERT(mutex_owned(etfs_lock)); @@ -381,7 +380,7 @@ doregister(const char *key, const char * rn-rn_flags |= RUMPNODE_DIR_ETSUBS; mutex_enter(etfs_lock); - if (etfs_find(key, NULL, REGDIR(ftype))) { + if (etfs_find(key, strlen(key), NULL, REGDIR(ftype))) { mutex_exit(etfs_lock); if (et-et_blkmin != -1) rumpblk_deregister(hostpath); @@ -641,13 +640,15 @@ rump_vop_lookup(void *v) if (dvp == rootvnode cnp-cn_nameiop == LOOKUP) { bool found; mutex_enter(etfs_lock); - found = etfs_find(cnp-cn_pnbuf, et, false); + found = etfs_find(cnp-cn_nameptr, cnp-cn_namelen, et, false); mutex_exit(etfs_lock); if (found) { - char *offset; + const char *offset; - offset = strstr(cnp-cn_pnbuf, et-et_key); + /* pointless as et_key is always the whole string */ + /*offset = strstr(cnp-cn_nameptr, et-et_key);*/ + offset = cnp-cn_nameptr; KASSERT(offset); rn = et-et_rn; -- David A. Holland dholl...@netbsd.org
Re: misuse of pathnames in rump (and portalfs?)
On Wed, Nov 24, 2010 at 01:26:04PM -0500, der Mouse wrote: Right. But if you want a guaranteed absolute path you should be able to do it by calling getcwd first. Only if you accept breakage if the current directory no longer has any name. Well, if you can't call getcwd, then it won't work... (I was not suggesting that the getcwd call be wedged inside namei) Of course, if you consider that acceptable, then fine. I don't, not for something as central as namei (though this looks as though you may be talking about only certain filesystems, in which case it may be acceptable). We'll probably end up doing it on every exec, since there's currently no way to tell from the ELF headers whether $ORIGIN needs to be set (this should be considered a bug) but failure is ok for that. In compat_svr4 I doubt anyone cares much if it fails in corner cases. -- David A. Holland dholl...@netbsd.org
Re: misuse of pathnames in rump (and portalfs?)
On Wed, Nov 24, 2010 at 08:30:18PM +0200, Antti Kantee wrote: I think it makes more sense for doregister to check for at least one leading '/' and remove the leading slashes before storing the key. Then the key will match the name passed by lookup; otherwise the leading slash won't be there and it won't match. (What I suggested last night is broken because it doesn't do this.) Ah, yea, the leading slashes will be stripped for lookup, so we can't get an exact match for those anyway. So, let's define it as string beginning with /, leading /'s collapsed to 1. Ok. See below (replaces the patch upthread): All users I can find pass an absolute path. ok, good diff -r 66985053a079 sys/rump/librump/rumpvfs/rumpfs.c --- a/sys/rump/librump/rumpvfs/rumpfs.c Wed Nov 24 01:34:10 2010 -0500 +++ b/sys/rump/librump/rumpvfs/rumpfs.c Wed Nov 24 15:41:26 2010 -0500 @@ -324,6 +324,13 @@ doregister(const char *key, const char * devminor_t dmin = -1; int hft, error; + if (key[0] != '/') { + return EINVAL; + } + while (key[0] == '/') { + key++; + } + if (rumpuser_getfileinfo(hostpath, fsize, hft, error)) return error; @@ -396,7 +403,7 @@ doregister(const char *key, const char * if (ftype == RUMP_ETFS_BLK) { format_bytes(buf, sizeof(buf), size); - aprint_verbose(%s: hostpath %s (%s)\n, key, hostpath, buf); + aprint_verbose(/%s: hostpath %s (%s)\n, key, hostpath, buf); } return 0; @@ -641,13 +648,15 @@ rump_vop_lookup(void *v) if (dvp == rootvnode cnp-cn_nameiop == LOOKUP) { bool found; mutex_enter(etfs_lock); - found = etfs_find(cnp-cn_pnbuf, et, false); + found = etfs_find(cnp-cn_nameptr, et, false); mutex_exit(etfs_lock); if (found) { - char *offset; + const char *offset; - offset = strstr(cnp-cn_pnbuf, et-et_key); + /* pointless as et_key is always the whole string */ + /*offset = strstr(cnp-cn_nameptr, et-et_key);*/ + offset = cnp-cn_nameptr; KASSERT(offset); rn = et-et_rn; -- David A. Holland dholl...@netbsd.org
Re: radix tree implementation for quota ?
On Sun, Nov 28, 2010 at 09:47:02PM +, David Holland wrote: (Also, why a radix tree? Radix trees are generally not very efficient. If you're going to, though, you might want to reuse the direct, indirect, double indirect, etc. method FFS uses for block mapping.) ...and the easiest way to do this is to put the quota in a sparse file... -- David A. Holland dholl...@netbsd.org
Re: radix tree implementation for quota ?
On Sun, Nov 28, 2010 at 11:43:48PM +0100, Joerg Sonnenberger wrote: A radix tree is kind of a bad choice for this purpose. The easiest approach is most likely to have [a btree] I would go with an expanding hash table of some kind, e.g. size is 2^n pages, hash (2^n - 1) tells you the page to look at, when you fill up double the size and take an extra bit from the hash value. If you use a good hash function you should get decent occupancy rates; expanding requires rewriting every page, but for systems where the number of uids is more or less constant over time (which is most systems) this won't happen very much... and remember that for 30,000 users the total size of quota data is only about 1M anyhow so chugging it around once in a while isn't a big deal. However, I'm still not convinced the sparse file really is a serious problem in practice. -- David A. Holland dholl...@netbsd.org
Re: radix tree implementation for quota ?
On Mon, Nov 29, 2010 at 11:12:21AM -0500, der Mouse wrote: Without any real data on what UID distribution looks like in practice, we're all speculating in a vacuum here. Just for shits and giggles I ran this on a real password file with about 350 users that's had lots of churn since it was first established. Using Joerg's hash with a table of size 512 gave 90 collisions; using x % 509 (the nearest available prime) gave 91. For comparison I tried some of the (more expensive) hashes from db4 and got 111, 94, and 84. None of the hash functions generated significant hotspots. It seems like a nonissue. (Of course, hashing 350 things isn't exactly a big challenge...) -- David A. Holland dholl...@netbsd.org
Re: Heads up: moving some uvmexp stat to being per-cpu
On Tue, Dec 14, 2010 at 08:49:14PM -0800, Matt Thomas wrote: I have a fairly large but mostly simple patch which changes the stats collected in uvmexp for faults, intrs, softs, syscalls, and traps from 32 bit to 64 bits and puts them in cpu_data (in cpu_info). This makes more accurate and a little cheaper to update on 64bit systems. Would this be a good opportunity to retire the sysctl that returns the non-ABI-stable uvmexp, or perhaps change its name to indicate that it's not stable/supported? -- David A. Holland dholl...@netbsd.org
parsepath op
Because we have at least one FS that may not want paths being looked up to be split on '/', namely rump etfs, and arguably the most important simplification to VOP_LOOKUP is to make it handle one path component at a time, we need a way for a FS to decide how much of a path it wants to digest at once. After thinking about this for a while I think the best approach is to add a parsepath op, which given a pathname returns the length of the string to consume. There are two major questions about how this should work: one is whether it should be a vnode or fs operation, and the other is how onionfs should handle it. I'm currently leaning towards a vnode op and applying the restriction that it must select either the first component (up to the first slash) or the whole remaining path string. This makes it reasonably possible for onionfs to deal with cases where its layers don't agree on the length to consume. (Although I think for the time being I'll just let such cases fail.) Making it a fs op instead would require that onionfs always call the operation again on both layers inside lookup; this would add a certain amount of overhead. Unfortunately, because etfs requires that the choice depend on the particular pathname given, I don't think this can be done just by setting flags somewhere. However, I can't think of a credible use case that requires more flexibility than selecting either one component or the whole path. (This includes a moderately crazy research project I proposed years ago.) So I don't think a more general operation is needed. I don't have a candidate patch yet (or even a draft patch); adding new vops requires touching an unreasonably large number of places. But this seems like the kind of thing where posting early is a good idea... -- David A. Holland dholl...@netbsd.org
Re: semantics of TRYEMULROOT
On Sun, Jan 02, 2011 at 09:19:30AM -0500, matthew sporleder wrote: [TRYEMULROOT] Since it's on http://www.netbsd.org/~dholland/buglists/file.html , I'm sure you're aware of it, but would 41678 be solved? http://gnats.NetBSD.org/cgi-bin/query-pr-single.pl?number=41678 Doubtful, as that behavior's inside onionfs. I don't think applying the same simplification to onionfs would result in the behavior you want there, either. (Anyhow, like I wrote in that PR, the only real way forward for onionfs is to try to figure out a self-consistent model for the semantics. If as I suspect that turns out to be impossible, the right way forward is to kill onionfs and replace it with several similar fses each with a clear set of semantics specialized for a particular set of purposes.) -- David A. Holland dholl...@netbsd.org
Re: semantics of TRYEMULROOT
On Sun, Jan 02, 2011 at 06:14:57PM +, David Laight wrote: On Sun, Jan 02, 2011 at 09:52:31AM +, David Holland wrote: Has anyone ever sat down and clearly worked out the desired semantics for TRYEMULROOT? I've noted inconsistencies in the past, and because in a number of ways it's a special case of onionfs I'm somewhat concerned that there may be cases where the proper or desired behavior is unclear or ambiguous. When I added TRYEMULROOT I did so in order to maintain the same actions as the old code - which did all sorts of horrid checks before copying a changed path out into the stackgap. At that time I didn't want to be worried about which code paths should, or should not, look in the emulated root - since many of the emulations were probably broken. That's more or less what I thought :-) You are probably the only one who has actually looked into this deeply! It does seem likely that only ENOENT (and similar) should cause the main fs to be checked. Yeah, but which are similar? Clearly e.g. ENOTDIR, maybe ELOOP; one could make a case either way for EROFS and EACCES though. And some things that might turn up, like ESTALE or ECONNTIMEDOUT, aren't clear at all... The question of where to commit to an object to work with is more important than the precise list of errors to retry on, because the former is structural and the latter is (relatively) cosmetic. Currently operations like mkdir do (roughly) NDINIT(nd, CREATE, LOCKPARENT, path); namei(nd); if (nd.ni_vp != NULL) { return EEXIST; } VOP_MKDIR(nd.ni_dvp, nd.ni_cnd); vput(nd.ni_dvp); but the plan is to change this to (roughly) char last[NAME_MAX]; namei_parent(path, dvp, last, sizeof(last)); vn_lock(dvp); VOP_MKDIR(dvp, last); vn_unlock(dvp); vrele(dvp); which is simpler and a lot tidier, both on the surface and inside the fs; however, if TRYEMULROOT is wanted it's obviously not desirable to need a retry loop here (and in each of the several other similar functions) -- therefore I'd like namei_parent to be able to commit to a (parent) directory to operate in, handling TRYEMULROOT inside itself only. In addition to fitting my design better I think this provides a more consistent semantics; I just want to make sure we think it'll work before committing to it. -- David A. Holland dholl...@netbsd.org
Re: parsepath op
On Sun, Jan 02, 2011 at 10:48:03AM +, David Laight wrote: On Sun, Jan 02, 2011 at 09:17:11AM +, David Holland wrote: Because we have at least one FS that may not want paths being looked up to be split on '/', namely rump etfs, and arguably the most important simplification to VOP_LOOKUP is to make it handle one path component at a time, we need a way for a FS to decide how much of a path it wants to digest at once. Slightly related is something I did for an embedded os, where I appended /param1/param2 to a device path. [...] Right, I think the original art for this came from AmigaDOS. It's always seemed like a useful idiom to me (much better than creating a dozen variant device nodes for every physical device) and I don't intend to do anything that will rule it out. Basically to make it work we'd need to patch namei to allow calling VOP_LOOKUP on device nodes instead of giving ENOTDIR, and add a spec_lookup implementation that does some kind of string-to-ioctl mapping to issue state changes on the device vnode. Currently writing a VOP_LOOKUP implementation is a black art, but that's supposed to change, and I don't think this would require any other special hacks. -- David A. Holland dholl...@netbsd.org
Re: semantics of TRYEMULROOT
On Sun, Jan 02, 2011 at 09:19:51PM +, Eduardo Horvath wrote: TRYEMULROOT should only open existing objects on the emul path, it should never create anything new, so you would never want to use it for mkdir. I don't know if that means you need to pass an extra flag to namei_parent() or what. That's what I thought at one point, but it's set on almost everything, including mount, open with or without O_CREAT, mknod, mkfifo, link, symlink, mkdir, and rename, and also unlink and rmdir. If that's not how it should be, things should be tidied up. (And, if neither mkdir-type nor rmdir-type operations should have TRYEMULROOT, my original question becomes largely or entirely moot.) -- David A. Holland dholl...@netbsd.org
Re: prop_*_internalize and copyin/out for syscall ?
On Mon, Jan 17, 2011 at 04:33:25PM +0100, Manuel Bouyer wrote: so I'm evaluating how to use proplib for the new quotactl(2) I'm working on. er, why? When I was looking at quota stuff in the context of lfs and other fs types, the existing quotactl interface seemed fine -- it just needs to have a clear separation between the syscall-level structures and the ffs-specific ones. At worst one might want to split struct dqblk in half, so the block and inode limits are addressed separately, something like this: struct quotaentry { uint64_t qe_hardlimit; uint64_t qe_softlimit; uint64_t qe_current; int32_t qe_time; int32_t __qe_spare; }; with additional suitable constants for addressing block vs. inode limits. I really don't see where proplib figures into this. -- David A. Holland dholl...@netbsd.org
Re: Dates in boot loaders on !x86
On Tue, Jan 18, 2011 at 04:24:37PM +0100, Joerg Sonnenberger wrote: Well, we derive the version to include from the version file. This is controlled by a central script. What about adding support to expand $DATE$ or some other magic version string, if it is the last in the version file? If you are actively developing, you can add that to the version file and hopefully remember to replace it with a proper version entry before commit. That's unnecessarily complicated. There's prior art for this: NetBSD tanaqui 5.99.41 NetBSD 5.99.41 (TANAQUI) #32: Wed Dec 1 01:20:02 EST 2010 dholland@tanaqui:/usr/src/sys/arch/i386/compile/TANAQUI i386 ^^^ Wouldn't be very hard to do the same for bootloaders. -- David A. Holland dholl...@netbsd.org
Re: Dates in boot loaders on !x86
On Tue, Jan 18, 2011 at 09:39:58PM +0100, Joerg Sonnenberger wrote: That's unnecessarily complicated. There's prior art for this: [...] Please look at the mail that started this threat. newvers provides multiple independent variable, so conditionally providing one of them needs both an option and output mangling in the users. It doesn't need an option, because on a clean build it would always be 0 (or 1) -- if you start hacking, then it would increment itself. Assuming you don't cleandir. (And I didn't say to reuse the kernel's newvers script itself. All this needs is about five lines of sh...) The consensus seems to be that during normal usage, the build date is irrelevant and doesn't provide any value. Based on Martin's suggestion, I will add a MKVERBOSEBOOT variable or so (haven't made my mind up about the name). If it is set, bootprog_kernrev will include the build date as well as user and host name (like the current bootprog_maker). The current bootprog_maker and bootprog_kernrev go away. But anyway, that seems fine. Is it going to be extended to the x86 bootloader? -- David A. Holland dholl...@netbsd.org
Re: turning off COMPAT_386BSD_MBRPART in disklabel
On Mon, Jan 31, 2011 at 05:40:20PM +0100, Matthias Drochner wrote: PR 44496 notes that COMPAT_386BSD_MBRPART is still enabled in disklabel(8), even though it was turned off by default in the kernel early in 4.99.x. The PR also notes that it's not harmless to leave it on. The PR rather leads to the conclusion that the support for old Partition IDs in disklabel(8) is suboptimal. Originally, the code did only consider a partition with the old ID if no new one was found. This apparently got broken when extended partition support was added years later. Yeah, that's a valid point. I guess the question then is whether fixing that will prevent any problematic cases from arising... and whether at this point it's worth worrying about. I suspect very few commodity drives old enough to have been fdisk'd with the old partition ID are still operable, and I suspect that anyone who's got one that hasn't been updated already is qualified to run fdisk... and there are very few cases where anyone would need to run disklabel but not be able to run fdisk first. So I'd really be inclined at this point to just disable the feature. ...also, it's not entirely clear to me what the code is supposed to be doing if there are multiple NetBSD partitions; it looks as if what it *will* do is use the label from the one it sees last and write the same label to all of them. blah, using both fdisk partitions and traditional labels on the same disk has always been a pile of fail. -- David A. Holland dholl...@netbsd.org
Re: remove sparse check in vnd
On Sat, Feb 05, 2011 at 10:07:13PM -0500, der Mouse wrote: Of course, still better would be to fix vnd, though I'm not sure what the right fix would be. What's the problem? My vague understanding was that you could get into deadlocks allocating blocks, but maybe I'm confusing it with something else. -- David A. Holland dholl...@netbsd.org
Re: turning off COMPAT_386BSD_MBRPART in disklabel
On Thu, Feb 03, 2011 at 08:04:26AM +, David Laight wrote: The PR rather leads to the conclusion that the support for old Partition IDs in disklabel(8) is suboptimal. Originally, the code did only consider a partition with the old ID if no new one was found. This apparently got broken when extended partition support was added years later. Yeah, that's a valid point. I guess the question then is whether fixing that will prevent any problematic cases from arising... and whether at this point it's worth worrying about. Possibly the code should be willing to locate and process such a label. Possibly even write it back. But it probably shouldn't 'corrupt' it - ie leave it as a valid label (doesn't it contain sector number relative to the ptn iteself? so can't describe any other parts of the disk?) Are *our* ancient disklabels partition-relative? It's so long ago that I'm not sure... but the code in currently in disklabel(8) doesn't appear to know anything at all about partition-relative labels. Given the rest of the discussion here, the fact that fixing disklabel(8) properly isn't completely trivial, and tls's recent experience, I think the feature should just be turned off in disklabel... but, just in case, not removed entirely until we branch netbsd-6. Does anyone object to this course of action? -- David A. Holland dholl...@netbsd.org
Re: turning off COMPAT_386BSD_MBRPART in disklabel
On Mon, Feb 07, 2011 at 01:48:57AM -0500, Thor Lancelot Simon wrote: For the record, I am pretty sure it was sysinst, not disklabel, which hosed my disk. Sysinst compiles equivalent code in directly, no? There are only two uses of MBR_PTYPE_386BSD in src/distrib. One is a perfectly innocuous list of partition type IDs. The other is in src/distrib/utils/sysinst/arch/i386/md.c, which changes the partition ID of a MBR_PTYPE_386BSD partition to MBR_PTYPE_NETBSD if no MBR_PTYPE_NETBSD partitions are seen. This is, however, only reached if someone's explicitly attempting to upgrade an existing installation, so it's probably harmless -- I think you got hosed by disklabel. This code should probably be removed from sysinst too, but maybe after -6 is branched. -- David A. Holland dholl...@netbsd.org
Re: turning off COMPAT_386BSD_MBRPART in disklabel
On Sun, Feb 13, 2011 at 01:06:36PM -0500, Thor Lancelot Simon wrote: Not in the failure case I observed (I can now reproduce this, but since it looks like the code in disklabel is going to Go, It has Gone :-) (The remaining question is whether to request pullup to -5; I think I will unless someone is strongly opposed.) If the kernel write-out-label code can do something similar, that ought to get the axe, too. It was disabled by default four years ago. -- David A. Holland dholl...@netbsd.org
Re: Fwd: Status and future of 3rd party ABI compatibility layer
On Wed, Mar 02, 2011 at 12:40:44AM +, Andrew Doran wrote: With modules now basically working we should either retire or move some of these items to pkgsrc so that the interested parties maintain them. An awful lot of the compat stuff is now very compartmentalised, with not much more work to do. There's at least one thing on the long-term wishlist that ideally should be done first: migrating to code generation for the syscall copyin/copyout logic. Given such infrastructure, much of the compat_* code can be replaced with code generator rules. Also, we really need a better story for compiling modules outside the source tree. Installing every random kernel header in /usr/include isn't the right way to go, but we don't currently have enough internal API organization to do much better. (Alternatively, we could come up with a better story for providing a system source tree in pkgsrc, but that also has issues.) Darwin (no GUI, doesn't to have been updated in the last 5 years) IRIX These two are strange and very broken, i.e. internally they are in very bad shape. I vote to delete. The version control history will still be there. Can't see strong use cases for either. I don't know that much about compat_darwin, but compat_irix is a pile of ooze. Someone please delete it :-) -- David A. Holland dholl...@netbsd.org
Re: the bouyer-quota2 branch
On Sat, Feb 19, 2011 at 11:21:35PM +0100, Manuel Bouyer wrote: I think the code in the bouyer-quota2 branch is stable now, and ready to be merged to HEAD. Unless objections, I'll merge it in about 2 weeks. [...] So, I thought one of the points of this was to make the quota interface fs-independent, but as it seems to have come out all the pieces and definitions are still in sys/ufs/ufs, and so far at least I really do not see where to slice to have quota support in a non-ufs filesystem. Can you explain how this is supposed to be done? And can we move the fs-independent and vfs-level declarations to sys/quota.h and add kern/vfs_quota.c for the fs-independent code? -- David A. Holland dholl...@netbsd.org
Re: the bouyer-quota2 branch
On Wed, Mar 09, 2011 at 08:20:00PM +0100, Manuel Bouyer wrote: On Wed, Mar 09, 2011 at 06:28:11PM +, David Holland wrote: struct quota2_entry (and so struct quota2_val) is used for both on-disk storage, and in-memory representation in tools and kernel. I agree this should be split; with an extra level of conversion (between in-memory and on-disk representation). The issue is struct quota2_val: I don't see any reason to have a different structure for on-disk and in-memory represenation at this level. Well... for one thing N_QL isn't necessarily 2; The tools rely on N_QL being defined and constant for all FS types at this time. The string for each QL is also defined here That should be fixed in favor of something more flexible before the API gets cast in stone; as I was saying there's at least one piece of prior art out there with three types of quotas. It isn't clear to me that we'll ever care, but on the other hand the cost of not compiling in the quota types isn't very high. Especially since if the interface is really going to be proplib-based they can just be arbitrary names. also the in-memory representation shouldn't have disk addresses in it (e.g. q2e_next), and in the syscall interface the types should be logical types, like uid_t, not sized types. Sure (although in the syscall interface, all of this are strings now :) that doesn't exactly make any difference... struct quota2_entry isn't the real problem (it's not used much outside of filesystem), struct quota2_val is. Sure. Also the structures in quota2.h are too hierarchical; a single entry should be an id type (user or group, maybe others), an id number, a quota type (block, inode, or other things; SGI xfs has three types), the hard and soft limits and configured grace period, and the current usage and current expiration time. The hierarchy in quota2.h reflects the proplib structure. A proplib quota entry has a type, and an associated array of entries. Each entry has an id (uid or gid depending on the type above) and array of values for this id. Each value has a type, current usage, limits and grace times. Well, yes, one of the problems with proplib is that it encourages hierarchical structuring of things that should be relations. ISTM that a bundle of quota records should be a single array of tuples of the form I described above... at least in the canonical format for communicating among system components. (Maybe the configured limits and current state should be separate structures though I'm not sure. I tend to think so because it allows separating the policy (which might be independent of a given ID) from the operational statistics (which can't be). And maybe the id information should be structured to allow ranges.) I'm not sure we should allow too much at this level. The tools can allow range if we want, but they should convert it to a list of discrete entries. don't do too much in the kernel. If I have 80,000 accounts of which 40,000 are undergrads with the same quota policy, it certainly makes sense to pass this to the kernel (and, where possible, store it) as one record rather than replicating it 40,000 times. I agree we shouldn't go off half-cocked and if we're going to set up a policy language for quotas it should be designed with some care. However, with the default stuff we're already moving in that direction so it seems like a logical step. But at a minimum it's like struct dirent; it's not particularly different from the FFS on-disk structure, but it needs to be its own thing because it plays a different role. I agree. What I don't get is how to split it to avoid too much extra code which would just be a non-optimised memcpy. We're talking about like a dozen lines of code, far less than the proplist decoding that has to be replicated in every FS. Currently it looks like any fs that wants to implement quotas has to cut and paste quota2_prop.c. Surely the proplib gunk can be decoded fs-independently? Parts of quota2_prop.c can, I guess. For the part in ufs_quota.c, I'm not sure. If quota2_val and friends can really be an fs-independent interface, then I don't see that there's any value to passing the proplib bundle to the FS. If they can't... then the data structures should be strengthened. Since we are not going to replicate the userland quota tools for every different FS type that has quotas, the interface *they* talk has to be FS-independent. Even if there's some reason it needs to be proplib-based it's still a proplib encoding of some physical data structure, and (especially in the absence of any kind of proplib schemas) it would be helpful to have that structure clearly defined somewhere. -- David A. Holland dholl...@netbsd.org
Re: libquota proposal
On Mon, Mar 21, 2011 at 02:21:26PM +0100, Manuel Bouyer wrote: (also, edquota and repquota seem fs-independent to me...) no, they're not: they can directly the quota1 file specified in the fstab if quotactl fails or the filesystem is not mounted. That's a bug, or more accurately legacy behavior that doesn't need to be supported. Once upon a time (IIRC) df used to fall back to opening the block device and examining ffs structures directly; that was removed because it violated desirable abstractions. -- David A. Holland dholl...@netbsd.org
Re: libquota proposal
(more context restored) On Wed, Mar 23, 2011 at 09:51:48AM +0100, Manuel Bouyer wrote: (also, edquota and repquota seem fs-independent to me...) no, they're not: they can directly the quota1 file specified in the fstab if quotactl fails or the filesystem is not mounted. That's a bug, or more accurately legacy behavior that doesn't need to be supported. of course it's not nice. But we're talking about existing code calling the legacy quotactl. If we're going to change it to not check the fstab options any more, we may as well change it to use libquota. I don't understand - surely edquota and repquota go through your proplib interface now? We were talking about code like netatalk, which is why I propose a public library for this. Uh, now I really don't understand. -- David A. Holland dholl...@netbsd.org
Re: libquota proposal
On Wed, Mar 23, 2011 at 09:50:16AM +0100, Manuel Bouyer wrote: On Wed, Mar 23, 2011 at 03:44:53AM +, David Holland wrote: On Tue, Mar 22, 2011 at 05:41:52PM +0100, Manuel Bouyer wrote: | (also, edquota and repquota seem fs-independent to me...) | | no, they're not: they can directly the quota1 file specified in the | fstab if quotactl fails or the filesystem is not mounted. | | That's a bug, or more accurately legacy behavior that doesn't need to | be supported. Once upon a time (IIRC) df used to fall back to opening | the block device and examining ffs structures directly; that was | removed because it violated desirable abstractions. Totally agree, please remove this complex and hard to maintain stuff. Once again: this needs to be supported for transition, up to 6.0 (inclusive). No, it doesn't. Even before you touched anything, they were only scribbling directly as a fallback if the kernel operations failed. The kernel operations should not fail in any case where scribbling directly makes sense; furthermore there's no need at all to deal with the case where the fs isn't mounted. repquota at last needs them: it doesn't have any way to get a list of quotas otherwise That sounds like a bug. (and it's also part of the migration to quota2, with repquota -x). ...wait, we're exposing the plists directly to the user? Shouldn't the migration be a single transparent tunefs operation? In the new world order all userland quota operations go through the kernel interface so they can interact successfully with filesystems using either the old or new quota layouts, or with new filesystems that may have their own different quota layouts, like zfs or whatever else. Right? right. Exept that the getall command is not supported for quota1, repquota does the job itself. uh, why not? that *is* a bug. -- David A. Holland dholl...@netbsd.org
Re: Decomposing vfs_subr.c
On Wed, Mar 23, 2011 at 02:18:55PM +, Mindaugas Rasiukevicius wrote: I would like to split-off parts of vfs_subr.c into vfs_node.c * and vfs_mount.c modules. Decomposing should hopefully bring some better abstraction, as well as make it easier to work with VFS subsystem. Any objections? Sounds good to me. Some comments: - I think it should be vfs_vnode.c? OK, unless somebody will come up with a better name. Since AIUI from chat this is going to contain the vnode lifecycle and code and not e.g. stuff like vn_lock, I think I'd prefer vfs_vncache.c. But, vfs_vnode.c is definitely better than vfs_node.c. - Random thought: some day it would be nice to dump all the syscall code into its own directory. Speaking of structural clean ups - I am thinking about moving vfs_*.c into a separate src/sys/vfs directory. Given that clean code history of vfs_subr.c is already damaged (*cough*pooka*cough*) and decomposing will do more - it might be worth going all the way. Well, forcibly moving vfs_lookup.c right now (or anytime in the near future) would be a bad idea, so let's not. After that stuff stabilizes, perhaps we can. Though I'd kind of prefer having real rename support before launching on major reorgs. -- David A. Holland dholl...@netbsd.org
Re: reading non-standard floppy formats
On Thu, Apr 28, 2011 at 12:02:51PM +0200, Edgar Fu? wrote: Is there a saner way of reading non-standard (e.g., 10 sectors per track) floppies than either a) building a custom kernel with modified fd_types in sys/dev/isa/fd.c b) writing a user-space program that sets the appropriate parameters with FDIOCSETFORMAT and then, holding the device open, writes the raw floppy data to a file? No-one? Does this mean there is no saner way or am I missing something so obvious that no-one wants to answer? More likely than either: nobody knows. -- David A. Holland dholl...@netbsd.org
Re: NFS server problems (lockup) on netbsd-5
On Mon, May 02, 2011 at 03:23:48PM +0200, Manuel Bouyer wrote: unfortunably I don't have a core dump (I couldn't get one). And unfortunably it's not reproductible with a simple testbed (I've been trying for 3 days). I wonder if it could be related to the INRENAME change that has been pulled up ... Unlikely - that exists only to help work around a protocol bug in puffs. -- David A. Holland dholl...@netbsd.org