RFC: add MSI/MSI-X support to NetBSD

2014-05-30 Thread Kengo NAKAHARA
Hello,

I'm going to add MSI/MSI-X support to NetBSD. I list tasks about this.
Would you comment following task list?


TODO
basic idea: keep current implementation as we can.

+ [x86 common MD] separate status of interrupt handlers to registered and
  assigned
  - currently, registering interrupt handler and assigning CPU are done at
a time in pci_intr_establish().
  - If there are more than one CPUs and they handle multiple interrupts include
MSI/MSI-X, it is difficult to manage them.
 
+ [x86 common MD] increase the max number of registered interrupt handlers
  - currently, the number of supported interrupts is
(32 (MAX_INTR_SOURCES - 7 (for softint)) * number of CPUs
  - increase this number, for example, up to 768 (like FreeBSD)
- include 512 for MSI/MSI-X.
 
+ [x86 common MD] implement intr_establish() for MSI
  - enhance intr_establish() fpr MSI or add a new function.
  - register interrupt handlers.
  - separate it from assigning CPUs.
 
+ [MI] implement MI API for MSI
  - not yet determined
- Where should MSI's address and data be set?
  - in struct device? (like FreeBSD)
  - One of the way is to use API like FreeBSD to make device driver's
portability well.
 
+ [x86 common MD] implement MD intr_establish() for MSI-X
  - enhance intr_establish() for MSI or add a new function.
 
+ [MI] implement MI API for MSI-X
  - different API from MSI
- because device drivers treat MSI and MSI-X in a different way
  - generally MSI uses only one interrupt vector.
  - MSI-X uses multi interrupt vecotrs, futhermore multi interrupt handlers.


FUTURE WORK

+ [amd64 MD] modify callers of INTRSTUB
  - unifying ioapic_edge and ioapic_level should be done with spl rearrangement
- unifying legacy may be needed too?
 
+ [amd64 MD]  refactor INTRSTUB
  - currently, it walks the interrupt handler list in assembly code
- I want to use NetBSD's list library, so I want to convert this assembly
  code to C code.
 
+ [amd64 MD] increase the max number of assigned interrupt handlers
  - currently, the number of supported interrupt is
(32 (MAX_INTR_SOURCES - 7 (for softint))
  - increase this number to 64 (max bit length which amd64 can manipulate
atomically) like OpenBSD.
- hopefully, increase this number to
  0xef (IDT_INTR_HIGH) - 0x20 (IDT entries for exception) + number of 
softints
  - this modification needs expanding struct cpu_info.ipending to
a length more than 64bit (which cannot manipulate atomically),
so it's required to protect ipending with mutex(9).
but muntex_spin_exit() uses ipending...
  - furthermore, increase related arrays and flags.

+ [x86 common MD] separate Interrupt Descriptor Table (IDT) for each CPU
  - this is needed to manipulate interrupts over the number of one CPU's
IDT entries.
  - DragonflyBSD maybe implements this.
 
+ [x86 common MD] implement a new struct to manipulate local APIC
  - this struct may be needed when one interrupt is assigned to multi CPUs
aka logical destination mode.

-- 
//
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division

Kengo NAKAHARA k-nakah...@iij.ad.jp


Re: Lockless IP input queue, the pktqueue interface

2014-05-30 Thread Darren Reed

On 30/05/2014 6:30 AM, David Holland wrote:

On Fri, May 30, 2014 at 12:01:23AM +1000, Darren Reed wrote:
[code cleanup]
  
   All of your arguments boil down to can't trust someone else.
  
   Why do you need to be so insulting of other developers in your arguments?
  
   Do you think you're the only person capable of making good design decisions?
  
   Sorry, but I won't be a party to that kind of attitude and want nothing more
   to do with this.

I am surprised... no, more like shocked really... that someone as
experienced as you are could think this way.


Yes, experienced. That means I've seen all manner of code written.

And I've never before seen anyone justify a code design decision
based on who they want or don't want to change code. Especially
when it is open source code.

Similarly I've never seen putting structs in .c files lead anywhere
useful in the long term.


I think that Mindaugas is being pragmatic here.  Developers are not
equally brilliant[*], observant of the rules, or perceptive of the
patterns, layers, or abstractions in the code.  He is writing the code
in a way that discourages us from casually misusing or breaking it by
getting under the abstraction.  I don't find that offensive.



There are other ways to achieve that - as with all programming,
there is always more than one solution to a problem.

I wouldn't care if he justified it as his preference or how he
wants to present the interfaces or he was doing it to reduce bugs
(i.e. mrg) or whatever, but justifying it as a means of trying to
control who changes code is wrong.

Open source code should be written such that it supports others
making changes to it and doing so safely and properly. Not to
prevent or discourage people.

Kind Regards,
Darren



Re: Lockless IP input queue, the pktqueue interface

2014-05-30 Thread Thor Lancelot Simon
On Thu, May 29, 2014 at 09:46:22AM -0500, David Young wrote:
 
 I think that Mindaugas is being pragmatic here.  Developers are not
 equally brilliant[*], observant of the rules, or perceptive of the
 patterns, layers, or abstractions in the code.  He is writing the code
 in a way that discourages us from casually misusing or breaking it by
 getting under the abstraction.  I don't find that offensive.

Indeed, I note that over in tech-kern there is a long running thread
in which a user, trying to debug a problem with NetBSD, complains that
internals of the cd9660 implementation are *not* properly hidden and
are inappropriately exposed in public header files.  He seems shocked
that we do not go (did not go) much farther to hide implementation details
than we do.  I tend to agree.

We ought to make deliberate decisions about what interfaces we will
commit to publishing and expose only those, to the greatest degree
possible.  Exposing guts datastructures to ease grovelling with kmem
seems to me to be exactly the _wrong_ direction to go -- I believe
that kmem grovelling is a pernicious practice and should die (from my
point of view it represents one of the longest-standing major security
issues with NetBSD).

Note that implementation hiding does *not* impede source-level debugging
as if you've got the source, you've got the guts datastructures.

Inappropriately exposing guts also leads to compatibility nightmares and
impedes both necessary repair of poorly designed subsystems, and innovation
in new ones.  If you expose the guts, something _will_ use them, and then
you will be forced to either replace (at the least, recompile) a ton of
random code when you upgrade the kernel, or maintain complex, nasty,
error-prone backwards compatibility code forever.

I think we should design the system so we do _not_ have to replace a dozen
things in /sbin when we replace the kernel.  I certainly do not think we
should go in the other direction.

Thor


Making tmpfs reserved memory configurable

2014-05-30 Thread Martin Husemann
I have been on a quest to make the stock vax install CD (-image) usable on
VAX machines with 8 MB recently. (8 MB is the lowest I could persuade simh
to emulate, for 4 MB we will need a custom kernel anyway and for smaller
even a custom /boot - I will cover installing on those machines in an
upcoming improvement of the install docs).

After a few easy steps, I hit a hard wall: using MFS filesystem on the install
CD does not work well (we take too long to configure swap, the userland part
of MFS gets often killed way before). However, tmpfs does not work at all
under such low memory situations.

See mount_tmpfs(8), in the paragraph about the -s option:

   Note that four megabytes are always reserved for the system and cannot
   be assigned to the file system.

Now, with a 3.2 MB text GENERIC kernel and 8 MB RAM, we certainly don't have
4 MB available at all - so tmpfs is not usable.

But, on the other hand, the CD is not usable without tmpfs - so we are in a
deadlock. Simple way out: since we know the tmpfs is needed more than any
RAM, we could tell tmpfs to lower the amount of ram reserved for the system
via sysctl(8). This is what the first patch attached does.

The second patch updates the man page and adds a kauth decision to allow
this changes only for root.

Finally, the trivial third patch uses the new feature and makes VAX install
CDs work on an emulated VAX 11/780 with 8 MB total ram.

Any objections to the general aproach?
Reviews, especially of the kauth and sysctl code welcome (everything else
is mostly mechanical).

One open issue is the name - I somehow thought reservee would be a proper
noune for reserved memory, but apparently this does not work out well
for native speakers, so it probably should be named something else.
What about min_ram_free? Other suggestions?

Martin
Index: sys/fs/tmpfs/tmpfs.h
===
RCS file: /cvsroot/src/sys/fs/tmpfs/tmpfs.h,v
retrieving revision 1.49
diff -u -p -r1.49 tmpfs.h
--- sys/fs/tmpfs/tmpfs.h30 Apr 2014 01:33:51 -  1.49
+++ sys/fs/tmpfs/tmpfs.h30 May 2014 13:42:10 -
@@ -311,7 +311,7 @@ booltmpfs_strname_neqlen(struct compon
  */
 
 /* Amount of memory pages to reserve for the system. */
-#defineTMPFS_PAGES_RESERVED(4 * 1024 * 1024 / PAGE_SIZE)
+extern size_t tmpfs_pages_reserved;
 
 /*
  * Routines to convert VFS structures to tmpfs internal ones.
Index: sys/fs/tmpfs/tmpfs_mem.c
===
RCS file: /cvsroot/src/sys/fs/tmpfs/tmpfs_mem.c,v
retrieving revision 1.5
diff -u -p -r1.5 tmpfs_mem.c
--- sys/fs/tmpfs/tmpfs_mem.c30 Apr 2014 01:33:51 -  1.5
+++ sys/fs/tmpfs/tmpfs_mem.c30 May 2014 13:42:10 -
@@ -48,6 +48,8 @@ __KERNEL_RCSID(0, $NetBSD: tmpfs_mem.c,
 extern struct pool tmpfs_dirent_pool;
 extern struct pool tmpfs_node_pool;
 
+size_t tmpfs_pages_reserved = 4 * 1024 * 1024 / PAGE_SIZE;
+
 void
 tmpfs_mntmem_init(struct tmpfs_mount *mp, uint64_t memlimit)
 {
@@ -89,7 +91,7 @@ tmpfs_mntmem_set(struct tmpfs_mount *mp,
  * = If 'total' is true, then return _total_ amount of pages.
  * = If false, then return the amount of _free_ memory pages.
  *
- * Remember to remove TMPFS_PAGES_RESERVED from the returned value to avoid
+ * Remember to remove tmpfs_pages_reserved from the returned value to avoid
  * excessive memory usage.
  */
 size_t
@@ -118,10 +120,10 @@ tmpfs_bytes_max(struct tmpfs_mount *mp)
size_t freepages = tmpfs_mem_info(false);
uint64_t avail_mem;
 
-   if (freepages  TMPFS_PAGES_RESERVED) {
+   if (freepages  tmpfs_pages_reserved) {
freepages = 0;
} else {
-   freepages -= TMPFS_PAGES_RESERVED;
+   freepages -= tmpfs_pages_reserved;
}
avail_mem = round_page(mp-tm_bytes_used) + (freepages  PAGE_SHIFT);
return MIN(mp-tm_mem_limit, avail_mem);
Index: sys/fs/tmpfs/tmpfs_vfsops.c
===
RCS file: /cvsroot/src/sys/fs/tmpfs/tmpfs_vfsops.c,v
retrieving revision 1.61
diff -u -p -r1.61 tmpfs_vfsops.c
--- sys/fs/tmpfs/tmpfs_vfsops.c 30 Apr 2014 01:59:30 -  1.61
+++ sys/fs/tmpfs/tmpfs_vfsops.c 30 May 2014 13:42:10 -
@@ -49,9 +49,11 @@ __KERNEL_RCSID(0, $NetBSD: tmpfs_vfsops
 #include sys/kmem.h
 #include sys/mount.h
 #include sys/stat.h
+#include sys/sysctl.h
 #include sys/systm.h
 #include sys/vnode.h
 #include sys/module.h
+#include sys/kauth.h
 
 #include miscfs/genfs/genfs.h
 #include fs/tmpfs/tmpfs.h
@@ -59,6 +61,9 @@ __KERNEL_RCSID(0, $NetBSD: tmpfs_vfsops
 
 MODULE(MODULE_CLASS_VFS, tmpfs, NULL);
 
+static struct sysctllog *tmpfs_sysctl_log;
+static int tmpfs_pages_reservee(SYSCTLFN_PROTO);
+
 struct pooltmpfs_dirent_pool;
 struct pooltmpfs_node_pool;
 
@@ -132,7 +137,7 @@ tmpfs_mount(struct mount *mp, const char
 
 
/* Prohibit mounts if there is not enough memory. */
- 

Re: Lockless IP input queue, the pktqueue interface

2014-05-30 Thread Thomas Schmitt
Hi,

Thor Lancelot Simon:
 Indeed, I note that over in tech-kern there is a long running thread
 in which a user, trying to debug a problem with NetBSD, complains that
 internals of the cd9660 implementation are *not* properly hidden

Urm, i did not complain but asked about the API/ABI rank of
those .h files in /usr/include.

And i am not really a user but rather an upstream programmer of an
ISO 9660 production program, who wants a decent mount environment
for the results of that program. Results so far:
1 user aquired (actually for Xfburn).
3 own packages promoted from wip to pkgsrc.
2 kernel kernel bug fixes and 2 userland bug fixes are committed.
1 bug pending: kern/48815, 1 enhancement pending: kern/48808.
1 obtrusive change proposal to come for supporting large data files.
  (It is being tested meanwhile.)


 He seems shocked
 that we do not go (did not go) much farther to hide implementation
 details

Well, i was surprised to see obvious kernel entrails published.
So i asked for instructions about how much care to take with
changes.

Meanwhile i learned that
  /usr/src/usr.sbin/makefs
is compiling
  /usr/src/sys/fs/udf/udf_osta.c
and
  /usr/src/usr.sbin/mtree/spec.c

So strict encapsulation seems to have never been enforced.

On the other hand, as system-neutral userland programmer, i would
not dig in the isofs/cd9660/*.h files in order to find some
kernel interface on which i could daringly rely.
I rather have to take care to keep any system dependency in
special adapter modules, so that my code stays portable.

Still not decided is whether i shall keep .../cd9660_node.h
API-compatible in my change to come.
It will cost sizeof(void *) in each ISO 9660 vnode, but on the
other hand it saves the work to find any includer or copier of 
the files which i propose to change.

There are such friends of cd9660 and i can hardly propose
to break them without providing fixes.
But i already see problems with getting reviews for 48808 and
for my change-to-come. And most of the known affected programs
cannot easily be linked here because of obvious libc
incompatibilities between CVS and NetBSD-6.1.3.
So i could test only somewhat hacked versions.


Note well: I don't complain. It's a do-ocracy.


Have a nice day :)

Thomas



Re: Lockless IP input queue, the pktqueue interface

2014-05-30 Thread Christos Zoulas
I would like to point out that exposing the guts of structures has
bitten us many times in the past (FILE, etc.). Once you expose a
struct, you are making the size of it known; even if your API does
not need it, people might use that fact to keep local copies or
declare objects of that type. Maintaining binary compatibility
becomes really difficult.

On the other hand, hiding a structure can lead to less efficient
code, because you need to have getters/setters, allocate memory
instead of using the callers stack etc.

All these are well-known trade-offs, and given our experience with
the structure in discussion in the networking stack (the pollution
in device drivers by direct access for example), the path chosen
seems the correct one for me.

christos



Re: Making tmpfs reserved memory configurable

2014-05-30 Thread Rhialto
On Fri 30 May 2014 at 16:56:01 +0200, Martin Husemann wrote:
 One open issue is the name - I somehow thought reservee would be a proper
 noune for reserved memory, but apparently this does not work out well
 for native speakers, so it probably should be named something else.
 What about min_ram_free? Other suggestions?

I'm not a native speaker either, but it seems reserve (one fewer e)
would be good.

 --- sbin/mount_tmpfs/mount_tmpfs.84 Dec 2013 18:05:21 -   1.17
 +++ sbin/mount_tmpfs/mount_tmpfs.830 May 2014 13:42:46 -
 @@ -83,8 +83,12 @@
  Specifies the total file system size in bytes.
  If zero is given (the default), the available amount of memory (including
  main memory and swap space) will be used.
 -Note that four megabytes are always reserved for the system and cannot
 +Note that some memory (by default:
 +four megabytes) are always reserved for the system and cannot
  be assigned to the file system.

some memory (...) are - some memory is.

I'd split the lines as

| Note that some memory
| (by default: four megabytes)
| is always reserved for the system and cannot

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


pgprxAEQkY7q7.pgp
Description: PGP signature


Re: Making tmpfs reserved memory configurable

2014-05-30 Thread Eduardo Horvath
On Fri, 30 May 2014, Martin Husemann wrote:

 See mount_tmpfs(8), in the paragraph about the -s option:
 
Note that four megabytes are always reserved for the system and cannot
be assigned to the file system.
 
 Now, with a 3.2 MB text GENERIC kernel and 8 MB RAM, we certainly don't have
 4 MB available at all - so tmpfs is not usable.

This just doesn't sound right.  Why is tmpfs reserving a fixed amount of 
RAM?  Shouldn't it be using uvmexp.freemin?  That's basically what we're 
reserving for emergencies.

RAM scaling is always a pain in the posterior.  The choices made for a 
system with 16MB RAM don't make sense for a system with 16GB RAM, and visa 
versa.

Eduardo


Re: Making tmpfs reserved memory configurable

2014-05-30 Thread Mindaugas Rasiukevicius
Martin Husemann mar...@duskware.de wrote:
 ...
 
 See mount_tmpfs(8), in the paragraph about the -s option:
 
Note that four megabytes are always reserved for the system and cannot
be assigned to the file system.
 

This wording may be confusing.  tmpfs does not reserve the memory in a
sense of allocation.  It imposes a hard limit and, in addition, it checks
that there is at least 4MB left for the system (if not, it considers that
as hitting the limit and fails to allocate more memory).  Note that the
current mechanism of using the floating number of free memory is wrong.
It is not reliable and should not really be done that way.

The UVM itself has some reserved memory so that it could function if the
memory is exhausted (see reserve_pagedaemon and reserve_kernel in uvmexp).
I think that the assumptions about minimum memory requirements should be
made by UVM and if we want to keep a helper mechanism in tmpfs, then I
agree with Eduardo that it should rather be using uvmexp.freemin target.

-- 
Mindaugas