RFC: add MSI/MSI-X support to NetBSD
Hello, I'm going to add MSI/MSI-X support to NetBSD. I list tasks about this. Would you comment following task list? TODO basic idea: keep current implementation as we can. + [x86 common MD] separate status of interrupt handlers to registered and assigned - currently, registering interrupt handler and assigning CPU are done at a time in pci_intr_establish(). - If there are more than one CPUs and they handle multiple interrupts include MSI/MSI-X, it is difficult to manage them. + [x86 common MD] increase the max number of registered interrupt handlers - currently, the number of supported interrupts is (32 (MAX_INTR_SOURCES - 7 (for softint)) * number of CPUs - increase this number, for example, up to 768 (like FreeBSD) - include 512 for MSI/MSI-X. + [x86 common MD] implement intr_establish() for MSI - enhance intr_establish() fpr MSI or add a new function. - register interrupt handlers. - separate it from assigning CPUs. + [MI] implement MI API for MSI - not yet determined - Where should MSI's address and data be set? - in struct device? (like FreeBSD) - One of the way is to use API like FreeBSD to make device driver's portability well. + [x86 common MD] implement MD intr_establish() for MSI-X - enhance intr_establish() for MSI or add a new function. + [MI] implement MI API for MSI-X - different API from MSI - because device drivers treat MSI and MSI-X in a different way - generally MSI uses only one interrupt vector. - MSI-X uses multi interrupt vecotrs, futhermore multi interrupt handlers. FUTURE WORK + [amd64 MD] modify callers of INTRSTUB - unifying ioapic_edge and ioapic_level should be done with spl rearrangement - unifying legacy may be needed too? + [amd64 MD] refactor INTRSTUB - currently, it walks the interrupt handler list in assembly code - I want to use NetBSD's list library, so I want to convert this assembly code to C code. + [amd64 MD] increase the max number of assigned interrupt handlers - currently, the number of supported interrupt is (32 (MAX_INTR_SOURCES - 7 (for softint)) - increase this number to 64 (max bit length which amd64 can manipulate atomically) like OpenBSD. - hopefully, increase this number to 0xef (IDT_INTR_HIGH) - 0x20 (IDT entries for exception) + number of softints - this modification needs expanding struct cpu_info.ipending to a length more than 64bit (which cannot manipulate atomically), so it's required to protect ipending with mutex(9). but muntex_spin_exit() uses ipending... - furthermore, increase related arrays and flags. + [x86 common MD] separate Interrupt Descriptor Table (IDT) for each CPU - this is needed to manipulate interrupts over the number of one CPU's IDT entries. - DragonflyBSD maybe implements this. + [x86 common MD] implement a new struct to manipulate local APIC - this struct may be needed when one interrupt is assigned to multi CPUs aka logical destination mode. -- // Internet Initiative Japan Inc. Device Engineering Section, Core Product Development Department, Product Division Kengo NAKAHARA k-nakah...@iij.ad.jp
Re: Lockless IP input queue, the pktqueue interface
On 30/05/2014 6:30 AM, David Holland wrote: On Fri, May 30, 2014 at 12:01:23AM +1000, Darren Reed wrote: [code cleanup] All of your arguments boil down to can't trust someone else. Why do you need to be so insulting of other developers in your arguments? Do you think you're the only person capable of making good design decisions? Sorry, but I won't be a party to that kind of attitude and want nothing more to do with this. I am surprised... no, more like shocked really... that someone as experienced as you are could think this way. Yes, experienced. That means I've seen all manner of code written. And I've never before seen anyone justify a code design decision based on who they want or don't want to change code. Especially when it is open source code. Similarly I've never seen putting structs in .c files lead anywhere useful in the long term. I think that Mindaugas is being pragmatic here. Developers are not equally brilliant[*], observant of the rules, or perceptive of the patterns, layers, or abstractions in the code. He is writing the code in a way that discourages us from casually misusing or breaking it by getting under the abstraction. I don't find that offensive. There are other ways to achieve that - as with all programming, there is always more than one solution to a problem. I wouldn't care if he justified it as his preference or how he wants to present the interfaces or he was doing it to reduce bugs (i.e. mrg) or whatever, but justifying it as a means of trying to control who changes code is wrong. Open source code should be written such that it supports others making changes to it and doing so safely and properly. Not to prevent or discourage people. Kind Regards, Darren
Re: Lockless IP input queue, the pktqueue interface
On Thu, May 29, 2014 at 09:46:22AM -0500, David Young wrote: I think that Mindaugas is being pragmatic here. Developers are not equally brilliant[*], observant of the rules, or perceptive of the patterns, layers, or abstractions in the code. He is writing the code in a way that discourages us from casually misusing or breaking it by getting under the abstraction. I don't find that offensive. Indeed, I note that over in tech-kern there is a long running thread in which a user, trying to debug a problem with NetBSD, complains that internals of the cd9660 implementation are *not* properly hidden and are inappropriately exposed in public header files. He seems shocked that we do not go (did not go) much farther to hide implementation details than we do. I tend to agree. We ought to make deliberate decisions about what interfaces we will commit to publishing and expose only those, to the greatest degree possible. Exposing guts datastructures to ease grovelling with kmem seems to me to be exactly the _wrong_ direction to go -- I believe that kmem grovelling is a pernicious practice and should die (from my point of view it represents one of the longest-standing major security issues with NetBSD). Note that implementation hiding does *not* impede source-level debugging as if you've got the source, you've got the guts datastructures. Inappropriately exposing guts also leads to compatibility nightmares and impedes both necessary repair of poorly designed subsystems, and innovation in new ones. If you expose the guts, something _will_ use them, and then you will be forced to either replace (at the least, recompile) a ton of random code when you upgrade the kernel, or maintain complex, nasty, error-prone backwards compatibility code forever. I think we should design the system so we do _not_ have to replace a dozen things in /sbin when we replace the kernel. I certainly do not think we should go in the other direction. Thor
Making tmpfs reserved memory configurable
I have been on a quest to make the stock vax install CD (-image) usable on VAX machines with 8 MB recently. (8 MB is the lowest I could persuade simh to emulate, for 4 MB we will need a custom kernel anyway and for smaller even a custom /boot - I will cover installing on those machines in an upcoming improvement of the install docs). After a few easy steps, I hit a hard wall: using MFS filesystem on the install CD does not work well (we take too long to configure swap, the userland part of MFS gets often killed way before). However, tmpfs does not work at all under such low memory situations. See mount_tmpfs(8), in the paragraph about the -s option: Note that four megabytes are always reserved for the system and cannot be assigned to the file system. Now, with a 3.2 MB text GENERIC kernel and 8 MB RAM, we certainly don't have 4 MB available at all - so tmpfs is not usable. But, on the other hand, the CD is not usable without tmpfs - so we are in a deadlock. Simple way out: since we know the tmpfs is needed more than any RAM, we could tell tmpfs to lower the amount of ram reserved for the system via sysctl(8). This is what the first patch attached does. The second patch updates the man page and adds a kauth decision to allow this changes only for root. Finally, the trivial third patch uses the new feature and makes VAX install CDs work on an emulated VAX 11/780 with 8 MB total ram. Any objections to the general aproach? Reviews, especially of the kauth and sysctl code welcome (everything else is mostly mechanical). One open issue is the name - I somehow thought reservee would be a proper noune for reserved memory, but apparently this does not work out well for native speakers, so it probably should be named something else. What about min_ram_free? Other suggestions? Martin Index: sys/fs/tmpfs/tmpfs.h === RCS file: /cvsroot/src/sys/fs/tmpfs/tmpfs.h,v retrieving revision 1.49 diff -u -p -r1.49 tmpfs.h --- sys/fs/tmpfs/tmpfs.h30 Apr 2014 01:33:51 - 1.49 +++ sys/fs/tmpfs/tmpfs.h30 May 2014 13:42:10 - @@ -311,7 +311,7 @@ booltmpfs_strname_neqlen(struct compon */ /* Amount of memory pages to reserve for the system. */ -#defineTMPFS_PAGES_RESERVED(4 * 1024 * 1024 / PAGE_SIZE) +extern size_t tmpfs_pages_reserved; /* * Routines to convert VFS structures to tmpfs internal ones. Index: sys/fs/tmpfs/tmpfs_mem.c === RCS file: /cvsroot/src/sys/fs/tmpfs/tmpfs_mem.c,v retrieving revision 1.5 diff -u -p -r1.5 tmpfs_mem.c --- sys/fs/tmpfs/tmpfs_mem.c30 Apr 2014 01:33:51 - 1.5 +++ sys/fs/tmpfs/tmpfs_mem.c30 May 2014 13:42:10 - @@ -48,6 +48,8 @@ __KERNEL_RCSID(0, $NetBSD: tmpfs_mem.c, extern struct pool tmpfs_dirent_pool; extern struct pool tmpfs_node_pool; +size_t tmpfs_pages_reserved = 4 * 1024 * 1024 / PAGE_SIZE; + void tmpfs_mntmem_init(struct tmpfs_mount *mp, uint64_t memlimit) { @@ -89,7 +91,7 @@ tmpfs_mntmem_set(struct tmpfs_mount *mp, * = If 'total' is true, then return _total_ amount of pages. * = If false, then return the amount of _free_ memory pages. * - * Remember to remove TMPFS_PAGES_RESERVED from the returned value to avoid + * Remember to remove tmpfs_pages_reserved from the returned value to avoid * excessive memory usage. */ size_t @@ -118,10 +120,10 @@ tmpfs_bytes_max(struct tmpfs_mount *mp) size_t freepages = tmpfs_mem_info(false); uint64_t avail_mem; - if (freepages TMPFS_PAGES_RESERVED) { + if (freepages tmpfs_pages_reserved) { freepages = 0; } else { - freepages -= TMPFS_PAGES_RESERVED; + freepages -= tmpfs_pages_reserved; } avail_mem = round_page(mp-tm_bytes_used) + (freepages PAGE_SHIFT); return MIN(mp-tm_mem_limit, avail_mem); Index: sys/fs/tmpfs/tmpfs_vfsops.c === RCS file: /cvsroot/src/sys/fs/tmpfs/tmpfs_vfsops.c,v retrieving revision 1.61 diff -u -p -r1.61 tmpfs_vfsops.c --- sys/fs/tmpfs/tmpfs_vfsops.c 30 Apr 2014 01:59:30 - 1.61 +++ sys/fs/tmpfs/tmpfs_vfsops.c 30 May 2014 13:42:10 - @@ -49,9 +49,11 @@ __KERNEL_RCSID(0, $NetBSD: tmpfs_vfsops #include sys/kmem.h #include sys/mount.h #include sys/stat.h +#include sys/sysctl.h #include sys/systm.h #include sys/vnode.h #include sys/module.h +#include sys/kauth.h #include miscfs/genfs/genfs.h #include fs/tmpfs/tmpfs.h @@ -59,6 +61,9 @@ __KERNEL_RCSID(0, $NetBSD: tmpfs_vfsops MODULE(MODULE_CLASS_VFS, tmpfs, NULL); +static struct sysctllog *tmpfs_sysctl_log; +static int tmpfs_pages_reservee(SYSCTLFN_PROTO); + struct pooltmpfs_dirent_pool; struct pooltmpfs_node_pool; @@ -132,7 +137,7 @@ tmpfs_mount(struct mount *mp, const char /* Prohibit mounts if there is not enough memory. */ -
Re: Lockless IP input queue, the pktqueue interface
Hi, Thor Lancelot Simon: Indeed, I note that over in tech-kern there is a long running thread in which a user, trying to debug a problem with NetBSD, complains that internals of the cd9660 implementation are *not* properly hidden Urm, i did not complain but asked about the API/ABI rank of those .h files in /usr/include. And i am not really a user but rather an upstream programmer of an ISO 9660 production program, who wants a decent mount environment for the results of that program. Results so far: 1 user aquired (actually for Xfburn). 3 own packages promoted from wip to pkgsrc. 2 kernel kernel bug fixes and 2 userland bug fixes are committed. 1 bug pending: kern/48815, 1 enhancement pending: kern/48808. 1 obtrusive change proposal to come for supporting large data files. (It is being tested meanwhile.) He seems shocked that we do not go (did not go) much farther to hide implementation details Well, i was surprised to see obvious kernel entrails published. So i asked for instructions about how much care to take with changes. Meanwhile i learned that /usr/src/usr.sbin/makefs is compiling /usr/src/sys/fs/udf/udf_osta.c and /usr/src/usr.sbin/mtree/spec.c So strict encapsulation seems to have never been enforced. On the other hand, as system-neutral userland programmer, i would not dig in the isofs/cd9660/*.h files in order to find some kernel interface on which i could daringly rely. I rather have to take care to keep any system dependency in special adapter modules, so that my code stays portable. Still not decided is whether i shall keep .../cd9660_node.h API-compatible in my change to come. It will cost sizeof(void *) in each ISO 9660 vnode, but on the other hand it saves the work to find any includer or copier of the files which i propose to change. There are such friends of cd9660 and i can hardly propose to break them without providing fixes. But i already see problems with getting reviews for 48808 and for my change-to-come. And most of the known affected programs cannot easily be linked here because of obvious libc incompatibilities between CVS and NetBSD-6.1.3. So i could test only somewhat hacked versions. Note well: I don't complain. It's a do-ocracy. Have a nice day :) Thomas
Re: Lockless IP input queue, the pktqueue interface
I would like to point out that exposing the guts of structures has bitten us many times in the past (FILE, etc.). Once you expose a struct, you are making the size of it known; even if your API does not need it, people might use that fact to keep local copies or declare objects of that type. Maintaining binary compatibility becomes really difficult. On the other hand, hiding a structure can lead to less efficient code, because you need to have getters/setters, allocate memory instead of using the callers stack etc. All these are well-known trade-offs, and given our experience with the structure in discussion in the networking stack (the pollution in device drivers by direct access for example), the path chosen seems the correct one for me. christos
Re: Making tmpfs reserved memory configurable
On Fri 30 May 2014 at 16:56:01 +0200, Martin Husemann wrote: One open issue is the name - I somehow thought reservee would be a proper noune for reserved memory, but apparently this does not work out well for native speakers, so it probably should be named something else. What about min_ram_free? Other suggestions? I'm not a native speaker either, but it seems reserve (one fewer e) would be good. --- sbin/mount_tmpfs/mount_tmpfs.84 Dec 2013 18:05:21 - 1.17 +++ sbin/mount_tmpfs/mount_tmpfs.830 May 2014 13:42:46 - @@ -83,8 +83,12 @@ Specifies the total file system size in bytes. If zero is given (the default), the available amount of memory (including main memory and swap space) will be used. -Note that four megabytes are always reserved for the system and cannot +Note that some memory (by default: +four megabytes) are always reserved for the system and cannot be assigned to the file system. some memory (...) are - some memory is. I'd split the lines as | Note that some memory | (by default: four megabytes) | is always reserved for the system and cannot -Olaf. -- ___ Olaf 'Rhialto' Seibert -- The Doctor: No, 'eureka' is Greek for \X/ rhialto/at/xs4all.nl-- 'this bath is too hot.' pgprxAEQkY7q7.pgp Description: PGP signature
Re: Making tmpfs reserved memory configurable
On Fri, 30 May 2014, Martin Husemann wrote: See mount_tmpfs(8), in the paragraph about the -s option: Note that four megabytes are always reserved for the system and cannot be assigned to the file system. Now, with a 3.2 MB text GENERIC kernel and 8 MB RAM, we certainly don't have 4 MB available at all - so tmpfs is not usable. This just doesn't sound right. Why is tmpfs reserving a fixed amount of RAM? Shouldn't it be using uvmexp.freemin? That's basically what we're reserving for emergencies. RAM scaling is always a pain in the posterior. The choices made for a system with 16MB RAM don't make sense for a system with 16GB RAM, and visa versa. Eduardo
Re: Making tmpfs reserved memory configurable
Martin Husemann mar...@duskware.de wrote: ... See mount_tmpfs(8), in the paragraph about the -s option: Note that four megabytes are always reserved for the system and cannot be assigned to the file system. This wording may be confusing. tmpfs does not reserve the memory in a sense of allocation. It imposes a hard limit and, in addition, it checks that there is at least 4MB left for the system (if not, it considers that as hitting the limit and fails to allocate more memory). Note that the current mechanism of using the floating number of free memory is wrong. It is not reliable and should not really be done that way. The UVM itself has some reserved memory so that it could function if the memory is exhausted (see reserve_pagedaemon and reserve_kernel in uvmexp). I think that the assumptions about minimum memory requirements should be made by UVM and if we want to keep a helper mechanism in tmpfs, then I agree with Eduardo that it should rather be using uvmexp.freemin target. -- Mindaugas