from:"Antti Kantee"

drop volatile from __cpu_simplelock_t typedef

2015-06-26 Thread Antti Kantee


__cpu_simplelock_t was born 15+ years ago with the following commit message:

=== snip ===
Let each platform typedef the new __cpu_simple_lock_t, which should
be the most efficient type used for the atomic operations in the
simplelock structure, and should also be __volatile.
=== snip ===

So, thinking about fixing lib/49989, I started wondering why volatile 
is necessary in the simplelock typedefs.  should also be doesn't 
explain much, and may just be there because that's what the 
pre-simplelock_t definitions used.  Shouldn't simplelocks always be 
operated on with atomic instructions and instruction barriers or some 
non-SMP equivalent thereof?  Assuming so, volatile in the typedef 
doesn't do anything except probably throw compilers off and therefore we 
should drop volatile from the typedefs.


RAS might need volatile (not sure yet), but that can probably be pushed 
inside the RAS sequence instead of exposing it everywhere.


Thoughts?  Seems like the right thing to do irrespective of lib/49989.

Re: drop volatile from __cpu_simplelock_t typedef

2015-06-26 Thread Antti Kantee


On 26/06/15 14:51, Matt Thomas wrote:

__cpu_simpe_lock_unlock concerns me without volatile.


Why?  Something to do with barriers?


Also, many have loops that count on the variable changing.
Without volatile those will become infinite loops.


Such as?  I can only think of the C debugging version of simple_lock ;)

Can't those be fixed by making them call __SIMPLELOCK_LOCKED_P()?  They 
arguably should have been doing that in the first place anyway.  Or are 
you worried that we won't be able to catch all of them?



For RISC-V, I used the builtin C11-ish gcc atomics to implement
the __cpu_simple_lock_t operations.  I just moved it to
sys/common_lock.h so other ports could use it.


More MI code is always nice, but what's the relevance to this 
discussion?  sys/common_lock.h should work the same with or without 
volatile because the volatileness comes from atomic_store/exchange, no?



Anyway, I interpreted your reply as sounds like a good idea, but there 
may be some problems.  If that's completely wrong, please be more explicit.

Re: drop volatile from __cpu_simplelock_t typedef

2015-06-26 Thread Antti Kantee


On 26/06/15 15:20, Matt Thomas wrote:



On Jun 26, 2015, at 8:17 AM, Antti Kantee po...@iki.fi wrote:

Such as?  I can only think of the C debugging version of simple_lock ;)

Can't those be fixed by making them call __SIMPLELOCK_LOCKED_P()?  They 
arguably should have been doing that in the first place anyway.  Or are you 
worried that we won't be able to catch all of them?


Atomic instruction typically have a lot of overhead so you loop until the 
variable changes and then retry the atomic instruction.


__SIMPLELOCK_LOCKED_P() doesn't use an atomic instruction, and I can't 
see how it even could in any way that would make sense.  We'd just add a 
volatile cast into that method.

Re: drop volatile from __cpu_simplelock_t typedef

2015-06-26 Thread Antti Kantee


On 26/06/15 13:55, Antti Kantee wrote:

__cpu_simplelock_t was born 15+ years ago with the following commit
message:

=== snip ===
Let each platform typedef the new __cpu_simple_lock_t, which should
be the most efficient type used for the atomic operations in the
simplelock structure, and should also be __volatile.
=== snip ===

should also be doesn't
explain much, and may just be there because that's what the
pre-simplelock_t definitions used.


I asked thorpej about wouldn't the special interfaces used to manipulate 
simplelock_t negate the need for volatile, and he said:


Originally, not everything was special interfaces, and there
were C versions for debugging.

So, it seems like the original reason to use volatile no longer exists.

Re: 82801G_HDA in hdaudiodevs

2015-06-14 Thread Antti Kantee


On 14/06/15 12:07, Robert Millan wrote:


Hi!

Am I missing something, or is my device (82801G_HDA) missing from
hdaudiodevs?

I don't use hdaudiodevs directly, only parse it to generate a list that is
later fed into Linux UIO to make it available to Rump, so I'm not
completely
sure if all HDA devices are supposed to be listed there or not.

Does someone know?


I don't, but NetBSD tech-kern might (cc'd).

While on the subject, I've also noticed that the hdaudio driver doesn't 
work under qemu or virtualbox, not sure if it's as simple as adding 
something to hdaudiodevs or if something more profound required.  (It's 
not really a problem with emulators, though, since I just use the ac97 
device).



Index: rumpkernel-0~20150607/buildrump.sh/src/sys/dev/hdaudio/hdaudiodevs
===
--- rumpkernel-0~20150607.orig/buildrump.sh/src/sys/dev/hdaudio/hdaudiodevs 
2015-06-07 17:04:54.0 +0200
+++ rumpkernel-0~20150607/buildrump.sh/src/sys/dev/hdaudio/hdaudiodevs  
2015-06-14 13:53:20.666957221 +0200
@@ -170,6 +170,7 @@
 productINTEL   G45_HDMI_3  0x2803  G45 HDMI/3
 productINTEL   G45_HDMI_4  0x2804  G45 HDMI/4
 productINTEL   G45_HDMI_FB 0x29fb  G45 HDMI/FB
+productINTEL   82801G_HDA  0x27d8  82801GB/GR

 /* Sigmatel */
 productSIGMATELSTAC9230X   0x7612  STAC9230X

Re: 82801G_HDA in hdaudiodevs

2015-06-14 Thread Antti Kantee


On 14/06/15 13:48, Robert Millan wrote:

El 14/06/15 a les 15:10, Antti Kantee ha escrit:

2) hdaudio doesn't work on a regular NetBSD installation under qemu
(and probably virtualbox too, though I didn't test with virtualbox)


Just FTR, I had trouble with auich(8) + Rump + Linux-UIO not propagating
interrupts on Virtualbox due to shared IRQs.

I suspect it might be a Virtualbox bug, as I couldn't reproduce the
problem anywhere other than Vbox, and I didn't
investigate further as I found a simple workaround (attached, maybe
someone will find it useful).


uio_pci_generic+interrupts is not a happy place.  For example, the 
maintainers have refused patches to enable MSI because it would make 
in-kernel uio driver useful.  well, maybe the rejection wasn't quite 
phrased like that, but that was the essence.


Ok, interrupts aren't a happy place, but now we're getting sidetracked ;)

Re: Removing ARCNET stuffs

2015-05-31 Thread Antti Kantee


On 31/05/15 06:05, matthew green wrote:

hi Andrew! :)


Who is appalled to discover that pc532 support has been removed!


In addition to toolchain support, the hardware was near-extinct at the 
time of removal.


Now, the hardware is no longer near-extinct:
http://cpu-ns32k.net/

I used the FPGA pc532 running NetBSD 1.5.x(?) a few weeks back. 
Unbelievable experience, especially since I spent quite some time and 
effort trying to get a pc532 I had on loan 10+ years ago to function.



get your GCC and binutils and GDB pals to put the support back
in the toolchain and we'll have something to talk about :-)


Didn't know that things to *talk* about were short in supply...

Re: Inter-driver #if dependencies

2015-05-18 Thread Antti Kantee


On 18/05/15 02:33, Paul Goyette wrote:

If you want to solve the problem just for one driver cluster, that's
more than fine.  In other words, if you don't want to spend effort on
a general solution, implement what you need privately in pcppi land.
Everyone still wins and will be thankful your efforts, unlike in the
event of a haphazardly researched autoconf interface.  There's no need
to start making a list of things that you're not willing to do.


No-one benefits if I implement what [I] need privately...


Read the entire sentence/paragraph instead of stopping at a halfway 
point where it most suits your apparent agenda of having an excuse to 
lash out.



Please consider this exchange (between you and me) to be finished.


Gladly.

Re: Inter-driver #if dependencies

2015-05-17 Thread Antti Kantee


On 17/05/15 22:40, Paul Goyette wrote:

My crusade for modularity has arrived at the pcppi(4) driver, and I've
discovered that there are a number of places in the code where a #if is
used to determine whether or not some _other_ driver is available to
provide certain routines.  For pcppi(4), these dependencies are for the
attimer(4) and pckbd(4) drivers.  (While I haven't yet gone searching,
I'd be willing to wager that there are other similar examples in other
drivers.)


As you say, you're proposing a solution based on looking at one example 
and a wager that you'll find more use cases.  Furthermore, your message 
is unclear on if you've implemented your proposal to test that it works 
even for your single case.


You won the wager in the sense that the problem exists.  However, I'm 
not at all convinced that an abstraction hell via autoconf is the best 
possible solution (not that I'm convinced that it isn't, either).  I 
suggest analyzing and fixing at at least half a dozen cases in the tree 
before proposing a general solution.  For example scsi/ata/wd/sd/usb is 
a good place to look at.


If you want to solve the problem just for one driver cluster, that's 
more than fine, but you don't need a halfway general solution for that.

Re: Inter-driver #if dependencies

2015-05-17 Thread Antti Kantee


On 18/05/15 01:02, Paul Goyette wrote:

On Mon, 18 May 2015, Antti Kantee wrote:


On 17/05/15 22:40, Paul Goyette wrote:

My crusade for modularity has arrived at the pcppi(4) driver, and I've
discovered that there are a number of places in the code where a #if is
used to determine whether or not some _other_ driver is available to
provide certain routines.  For pcppi(4), these dependencies are for the
attimer(4) and pckbd(4) drivers.  (While I haven't yet gone searching,
I'd be willing to wager that there are other similar examples in other
drivers.)


As you say, you're proposing a solution based on looking at one
example and a wager that you'll find more use cases.  Furthermore,
your message is unclear on if you've implemented your proposal to test
that it works even for your single case.

You won the wager in the sense that the problem exists.  However, I'm
not at all convinced that an abstraction hell via autoconf is the best
possible solution (not that I'm convinced that it isn't, either).  I
suggest analyzing and fixing at at least half a dozen cases in the
tree before proposing a general solution.  For example
scsi/ata/wd/sd/usb is a good place to look at.

If you want to solve the problem just for one driver cluster, that's
more than fine, but you don't need a halfway general solution for that.


I'm certainly willing to implement the mechanism, as a proof-of-concept,
for the pcppi/attimer/pckbd cluster, if there's a reasonable chance of
the effort being useful.

I'm certainly not willing to spend the next 6 months (or more) of my
life analyzing and fixing at least half a dozen cases without some
encouragement that the effort won't be wasted.

If you've actually got some constructive feedback on my suggestion,
please provide it.


Is giving pointers to related use cases that you're in your own words 
not aware of not constructive?


If you want to solve the problem just for one driver cluster, that's 
more than fine.  In other words, if you don't want to spend effort on a 
general solution, implement what you need privately in pcppi land. 
Everyone still wins and will be thankful your efforts, unlike in the 
event of a haphazardly researched autoconf interface.  There's no need 
to start making a list of things that you're not willing to do.

Re: Missing rump_kthread_destroy() ?

2015-04-19 Thread Antti Kantee


On 19/04/15 07:40, Paul Goyette wrote:

In my on-going efforts to further modularize the NetBSD kernel, I'm
currently prying apart the pieces of sysmon...

One of those pieces would be sysmon_taskq, which provides a lwp
environment to execute callouts.  In the sysmon_taskq_init() routine
there is a call to kthread_create(), so it would seem reasonable that
sysmon_taskq_fini() would call kthread_destroy().

Unfortunately, when building rump_allserver I discover that there is no
rump emulation for kthread_destroy().  There is _create(), _join(),
_exit(), and _init(), but no _destroy().

Is there a reason for not providing kthread_desetroy()?  How difficult
would it be to add it?


The right thing to use is kthread_exit().

I don't see why kthread_destroy() needs to be in the public API at all. 
 I'd just remove it.

Re: kernel constructor

2014-11-11 Thread Antti Kantee


There are two separate issues here:

1: link sets vs. ctors

They are exactly the same thing in slightly different clothing.  Mental 
exercise: define link_set_ctor and run those in kernel bootstrap when 
you'd run __attribute__((constructor)).  As David cautions, I don't 
think ctors should do anything apart from note that X is present in the 
image so that initializing X can be done later.  With link sets you 
don't need the extra step of noting since you can just iterate when you 
want to.



2: init_main ordering

I think that code reading is an absolute requirement there, i.e. we 
should be able to know offline what will happen at runtime.  Maybe that 
problem is better addressed with an offline preprocessor which figures 
out the correct order?

Re: [PATCH] PUFFS backend allocation (round 3)

2014-10-29 Thread Antti Kantee


On 29/10/14 00:11, Emmanuel Dreyfus wrote:

On Tue, Oct 28, 2014 at 06:07:29PM +0100, J. Hannken-Illjes wrote:

Confused.  If write and/or fsync are synchronous (VOP_PUTPAGES with
flag PGO_SYNCIO) no write error will be forgotten.


puffs_vnop_stratgy() contains this:
 /*
  * : wrong, but kernel can't survive strategy
  * failure currently.  Here, have one more X: X.
  */
 if (error != ENOMEM)
 error = 0;

This is where we want to store the error so that it can be
recovered by upper layer.


That comment is close to 10 years old.  If you haven't, start by 
checking that it still applies and perhaps do a quick check to see what 
goes wrong (I don't remember exactly, some sort of kernel panic I think) 
and if it can be fixed.


And I still think that the best approach is to make the cache 
write-through, at least when a write causes a page fault, and then just 
deal with whatever distributed systemness happens behind the kernel 
driver's back.

Re: [PATCH] PUFFS backend allocation (round 3)

2014-10-29 Thread Antti Kantee


On 29/10/14 23:33, Emmanuel Dreyfus wrote:

Antti Kantee po...@iki.fi wrote:


That comment is close to 10 years old.  If you haven't, start by
checking that it still applies and perhaps do a quick check to see what
goes wrong (I don't remember exactly, some sort of kernel panic I think)
and if it can be fixed.


I just tried removing this if (error != ENOMEM) error = 0 and it seems
work  fine on netbsd-7. The error is reported to the calling layers
without a hitch.

Are there some corner cases where it could cause problem?


Don't recall it being a corner case.


And why does NF has to save the error in np-n_error to recover it in
upper layer? Obsolete code that was never touched?


Not sure, but per a quick examination it looks like nfs wants to save 
the error for the next caller.  As long as puffs is synchronous, it 
won't be an issue.  Notably, though, a puffs file server might like to 
be asynchronous in handling a write and report an error later, but 
that's getting complicated.  Optimization is not a substitute for 
correctness ...

Re: [PATCH] GOP_ALLOC and fallocate for PUFFS

2014-09-30 Thread Antti Kantee


On 30/09/14 13:44, Emmanuel Dreyfus wrote:

Hello

When a PUFFS filesystem uses the page cache, data enters the
cache with no guarantee it will be flushed. If it cannot be flushed
(bcause PUFFS write requests get EDQUOT or ENOSPC), then the
kernel will loop forever trying to flush data from the cache,
and the filesystem cannot be unmounted without -f (and data loss).

In the attached patch, I add in PUFFS:
- support for the fallocate operation
- a puffs_gop_alloe() function that use fallocate
- when writing through the page cache we call first GOP_ALLOC to make
   sure backend storage is allocated for the data we cache. debug printf
   show a sane behavior, GOP_ALLOC calling puffs_gop_alloc only when required.

If the filesystem does not implement fallocate, we keep the current
behavior of filling the page cache with data we are not sure we can flush.
Perhaps we can improve further: missing fallocate can be emulated by
writing zeroed chuncks. I have implemented that in libperfuse, but
we may want to have this in libpuffs, enabled by a mount option. Input
welcome.


Is it really better to sync fallocate, put stuff in the page cache and 
flush the page cache some day instead of just having a write-through (or 
write-first) page cache on the write() path?  You also get rid of the 
fallocate-not-implemented problem that way.


That still leaves the mmap path ... but mmap always causes annoying 
problems and should just die ;)


Writing zeroes might be a bad emulation for distributed file systems, 
though I guess you're the expert in that field and can evaluate the 
risks better than me.

Re: How PUFFS should deal with EDQUOT?

2014-09-22 Thread Antti Kantee


On 22/09/14 04:28, Emmanuel Dreyfus wrote:

When a PUFFS filesystem enforces quota, a process doing a write over
quota will end frozen in DE+ state.

The problem is that  we have written data in the page cache that is
supposed to go to disk. The code path is a bit complicated, but
basically we go in genfs VOP_PUTPAGE, which leads to genfs_do_io() where
we have a VOP_STRATEGY, which cause PUFFS write. The PUFFS write will
get EDQUOT, but genfs_do_io()  ignores VOP_STRATEGY's return value and
retries forever.

In other words, when flushing the cache, the kernel ignores errors from
the filesystem and runs an endless loop attempting to flush data, during
which the process that did the over quota write is not allowed to
complete exit().

What is the proper way to deal with that? Is it reasonable to wipe the
page cache using puffs_inval_pagecache_node() when write gets a failure?
Any failure? Or just EDQUOT and ENOSPC? Should that happen in libpuffs
or in the filesystem (libperfuse here)?


I'd guess the key to success would be to support genfs_ops in puffs so 
that the file server is consulted about block allocations.


See also tests/vfs/t_full.c

Re: virtualized nfsd (Re: virtual kernels, syscall routing, etc.)

2011-03-22 Thread Antti Kantee

I almost forgot my annual contribution to this thread (actually missed
it last year, sorry 'bout that).

On Fri Oct 16 2009 at 05:36:40 +0300, Antti Kantee wrote:
 On Thu Nov 27 2008 at 20:32:15 +0200, Antti Kantee wrote:
  Good news everyone!
  
  I've made the kernel nfs service (nfsd) run in userspace.
 
 Ok, I've worked on this a little more.  Now it's possible to run a fully
 selfcontained nfsd with a virtualized TCP/IP stack and hence a dedicated
 IP address (the previous solution used host IP and rpcbind in a very
 unholy cocktail).
 
 [...]

 The bad news is that this currently requires a hacked version of the libc
 rpc client.  Without syscall routing mentioned in my first email on the
 subject, we cannot route the syscalls libc makes to the right kernel.
 The good news is that the modifications are selfcontained and I've put
 up a tarball.

So now I worked on it even more, and it's possible to use the stock kernel
nfs server code and stock userland binaries to run the kernel nfs server
in userspace (and, as usual, the stock kernel module binaries on x86).
The instructions are part of the tutorial I published last week:

http://www.netbsd.org/docs/rump/sptut.html#masterclass

If you want to use kernel module bins instead of rump libs, just remove
-lrumpfs_nfsserver and even -lrumpfs_nfs and -lrumpfs_ffs from the
rump_server command line.  The functionality will be autoloaded from
/stand/i386/5.99.48/modules on the host.  On amd64 it'll require some
tinkering, though, since with standard kernel module binaries you need
to load all rump kernel code into the bottom 2GB due to -mcmodel=kernel
and that tinkering is left as an exercise for the reader (you either
need static linking or to teach ld.so to do this).  Still, even without
tinkering the rump libs work just fine on every arch.

Since I can't figure out how to develop things any further than running
unmodified source and binary of every relevant component, I guess here
endeth this thread.

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

inheriting the lwp private area

2011-03-22 Thread Antti Kantee

Hi,

On Julio's request I was looking at the now-failing tests and resulting
hanging processes.  Basically at least on i386 the failures are a
result of fork() + setcontext() (plus some voodoo) after which calling
pthread_mutex_lock() with signals masked causes a busyloop due to
pthread__self() causing a segv but the signal never getting delivered
(that in itself seems like stinky business, but not the motivating factor
of this mail).

Per 4am intuition it seems pretty obvious that a child should inherit
the forking lwp's private data.  Does that make sense to everyone else?
At least patching fork1() to do so fixes the hanging processes and
failing tests and a quick roll around qemu hasn't caused any problems.

If it doesn't make sense, I'll disable the pthread bits (per commit
guideline clause 5) until support is fully fixed so that others don't have
to suffer from untested half-baked commits causing juju hangs and crashes.

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

rump is complete

2011-03-21 Thread Antti Kantee

Hi,

I have accomplished everything I want to with rump and plan to declare it
stable in NetBSD 6.  This implies adding new interfaces will slow down,
and changing old old ones will require backward compat.

If you are interested in the unique possibilities offered by rump,
now is a good time to check your use cases work as expected.

  - antti

For what rump is, does, and how to use it, see the usual place:
http://www.NetBSD.org/docs/rump/

Re: the bouyer-quota2 branch

2011-03-10 Thread Antti Kantee

On Thu Mar 10 2011 at 09:36:20 +0100, Manuel Bouyer wrote:
 On Thu, Mar 10, 2011 at 11:42:08AM +1100, matthew green wrote:
  
   On Sat Feb 19 2011 at 23:21:35 +0100, Manuel Bouyer wrote:
This branch is for the developement of a modernized disk quota system.
The 2 main changes are: a new quotactl(2) interface and a new on-disk
format, compatible with journaled ffs.
   
   Hmm, I'm wondering if the new quotactl syscall should have a new name
   instead of keeping the old one.
   
   It doesn't make much sense to play __RENAME() games with it since
   any old code will not compile against the new quotactl signature.
  
  that seems reasonable to me.
 
 What do you propose then ? quotactl is the best name I can find for this
 syscall ...

quotactl2?  quotapctl?  quota_pctl?  quotactl_the_next_generation?
... quota_king?

Considering that quotactl is not used by programmers (unless they're
hacking on the quota utils ;) I don't think we need to spend a lot
of energy on picking the name.  If we want to follow a common naming
scheme for all syscalls which will take a plist (such as future mount?),
we might want to spend a few minutes on it, though.


(Just to explain the rationale for this nomenclatural crisis, yesterday
I discovered that the changed signature broke some assumptions about
syscall compat I'd made in makesyscalls.sh, and that caused the script
to fail in a very-scratchingly way.  I could just change makesyscalls.sh,
but since I'd had made that assumption, it's possible others have too)

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: the bouyer-quota2 branch

2011-03-10 Thread Antti Kantee

On Thu Mar 10 2011 at 20:29:58 +1100, matthew green wrote:
  (Just to explain the rationale for this nomenclatural crisis, yesterday
  I discovered that the changed signature broke some assumptions about
  syscall compat I'd made in makesyscalls.sh, and that caused the script
  to fail in a very-scratchingly way.  I could just change makesyscalls.sh,
  but since I'd had made that assumption, it's possible others have too)
 
 BTW, when i changed reboot(2) i added a char * to the signature.
 (this was in 1996?) how does this affect your compat assumptions?

It doesn't affect them because i'm not interested in compiling new code
against oreboot.  So theoretically yes, in reality no.  I care about
the latter ;)

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: the bouyer-quota2 branch

2011-03-10 Thread Antti Kantee

On Thu Mar 10 2011 at 11:28:14 +0100, Manuel Bouyer wrote:
  Considering that quotactl is not used by programmers (unless they're
  hacking on the quota utils ;) I don't think we need to spend a lot
 
 someone who looks at quotactl(8) will also look at quotactl(2) ...

You can still MLINKS the quotactl.2 name or add a note.

  of energy on picking the name.  If we want to follow a common naming
 
 Agreed. So let's keep quotactl(2) ... it's fine and is working.

I don't agree about fine, but I won't push the issue any further.

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: bouyer-quota2: fsck_ffs crash

2011-03-10 Thread Antti Kantee

On Thu Mar 10 2011 at 19:45:20 +0100, Manuel Bouyer wrote:
 On Thu, Mar 10, 2011 at 06:59:41PM +0100, Ignatios Souvatzis wrote:
  Hi,
  
  % unmount /export/home/1
  % tunefs -q user /export/home/1
  % fsck -fy /export/home/1
  ...
   USER QUOTA MISMATCH FOR ID 0: 0/0 SHOULD BE 1791988/1794
   ALLOC? yes
   USER QUOTA MISMATCH FOR ID 0: 0/0 SHOULD BE 0/0
   ALLOC? yes
   fsck: /dev/home/rtheory1: Segmentation fault
  
  This is on Sparc64. I'll provide more data tomorrow, assuming
  I'll find time to point gdb at a -g binary and the core dump.
 
 it should not try to allocate/fix entries for the same uid twice.
 Also, 0/0 SHOULD BE 0/0 looks wrong. It would be interesting
 to see if an entry got really added for id 0 twice, of if the second
 id is the result of some corruption.
 
 Can you see if tests/sbin/fsck_ffs completes fine on sparc64
 (atf-run|atf-report in this directory) ?

They complete fine on a sparc64 (but of course that doesn't guarantee
they complete fine on Ignatios's sparc64, so he should run the tests).

http://www.netbsd.org/~martin/sparc64-atf/22_atf.html

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: the bouyer-quota2 branch

2011-03-09 Thread Antti Kantee

On Sat Feb 19 2011 at 23:21:35 +0100, Manuel Bouyer wrote:
 This branch is for the developement of a modernized disk quota system.
 The 2 main changes are: a new quotactl(2) interface and a new on-disk
 format, compatible with journaled ffs.

Hmm, I'm wondering if the new quotactl syscall should have a new name
instead of keeping the old one.

It doesn't make much sense to play __RENAME() games with it since
any old code will not compile against the new quotactl signature.

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: Fwd: Status and future of 3rd party ABI compatibility layer

2011-03-01 Thread Antti Kantee

On Tue Mar 01 2011 at 09:55:38 +, Andrew Doran wrote:
 On Mon, Feb 28, 2011 at 11:25:07AM -0500, Thor Lancelot Simon wrote:
 
  On Mon, Feb 28, 2011 at 11:13:36AM +0200, haad wrote:
   
   With solaris.kmod we are compatible with solaris kernel, (we should
   be able to load solaris kernel modules).
  
  Have you actually tried this?  I am pretty sure it would not work.
  
  It appears to me that solaris.kmod includes shims that provide some
  Solaris kernel interfaces at the *source* level in NetBSD, which
  certainly makes it easier to port kernel code from Solaris but does
  not (as far as I can tell) give us binary compatibility.
 
 Adam may have meant source level compat, it definitely does provide some
 level of that. Of course no binary compat as you say.

If Solaris has a module-compatible kernel ABI it's most likely possible
to be binary compatible considering we're source-compatible already
(cf. rump ABI compatibility with the kernel).  Of course it doesn't
happen accidentally and there's some amount of work involved.  But if
someone finds a use case for it, why not?

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: modules and weak aliases

2011-02-22 Thread Antti Kantee

On Tue Feb 22 2011 at 13:24:38 -0600, David Young wrote:
 If there are weak aliases in my kernel and strong aliases in my kernel
 module, will the in-kernel linker override the weak aliases when I load
 my module, and put back the weak alias when it unloads my module?
 
 Supposing that the answer to my first question is yes, can I make the
 modules subsystem pause, before releasing the module's memory, while all
 threads vacate the module's functions?

From what I recall from having some things accidentally as __weak_alias
in rump, this happens:

case STB_WEAK:
kobj_error(weak symbols not supported\n);
return 0;

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: next vforkup chapter: lwpctl

2011-02-15 Thread Antti Kantee

On Tue Feb 15 2011 at 13:05:11 +, Alexander Nasonov wrote:
 Antti Kantee wrote:
  This is not about rumphijack.  Look at e.g. sh and make.
  
  Even if you do fix them, it's not just limited to malloc either.
  Anything that uses LWPCTL will be screwed up after vfork.
 
 Hi Antti,
 Sorry if suggest something stupid but would it be possible to make
 librumphijack pthread-neutral? E.g. use atomic_ops and/or rumpfd as
 synchronization primitives?

In that case you'd have to implement poll/select (and kevent) with the
help of fork().  It would be a much more heavyweight operation, especially
since it causes another rump kernel handshake to happen.  Furthermore, you
cannot cache the workers.  Well, maybe with __clone(CLONE_FILES), but ...

So, yes, it would be possible, but not a good move since it doesn't
solve any problems (apart from working around this kernel bug) and causes
extra penalties.

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

next vforkup chapter: lwpctl

2011-02-14 Thread Antti Kantee

Hi,

Alexander pointed me at a problem where under suitable conditions a
process with rump syscall hijacking would crash when after vfork() and
sent the attached test program.  Under further examination, it turned
out that the problem is due to libpthread and lwpctl.

Having pthread linked causes malloc to use pthread routines instead
of the libc stubs.  Now, the vfork() child will use a pointer to the
parent's lwpctl area and thinks it is running on LWPCTL_CPU_NONE (-1).
When malloc uses this to index the arena map, it unsurprisingly gets
total garbage back.

The following patch makes a vfork child update the parent's lwpctl area
while the child is running.  Comments or better ideas?

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
#include stddef.h
#include stdlib.h
#include unistd.h

int main()
{
malloc(1);
switch(vfork())
{
case -1:
return EXIT_FAILURE;
case 0:
malloc(1);
_exit(EXIT_FAILURE);
default:
;
}

return EXIT_SUCCESS;
}
Index: kern/kern_exec.c
===
RCS file: /cvsroot/src/sys/kern/kern_exec.c,v
retrieving revision 1.305
diff -p -u -r1.305 kern_exec.c
--- kern/kern_exec.c18 Jan 2011 08:21:03 -  1.305
+++ kern/kern_exec.c14 Feb 2011 12:25:28 -
@@ -979,6 +979,7 @@ execve1(struct lwp *l, const char *path,
mutex_enter(proc_lock);
p-p_lflag = ~PL_PPWAIT;
cv_broadcast(p-p_pptr-p_waitcv);
+   l-l_lwpctl = NULL; /* borrowed from parent */
mutex_exit(proc_lock);
}
 
Index: kern/kern_exit.c
===
RCS file: /cvsroot/src/sys/kern/kern_exit.c,v
retrieving revision 1.231
diff -p -u -r1.231 kern_exit.c
--- kern/kern_exit.c18 Dec 2010 01:36:19 -  1.231
+++ kern/kern_exit.c14 Feb 2011 12:25:28 -
@@ -343,6 +343,7 @@ exit1(struct lwp *l, int rv)
if (p-p_lflag  PL_PPWAIT) {
p-p_lflag = ~PL_PPWAIT;
cv_broadcast(p-p_pptr-p_waitcv);
+   l-l_lwpctl = NULL; /* borrowed from parent */
}
 
if (SESS_LEADER(p)) {
Index: kern/kern_lwp.c
===
RCS file: /cvsroot/src/sys/kern/kern_lwp.c,v
retrieving revision 1.154
diff -p -u -r1.154 kern_lwp.c
--- kern/kern_lwp.c 17 Jan 2011 08:26:58 -  1.154
+++ kern/kern_lwp.c 14 Feb 2011 12:25:29 -
@@ -696,6 +696,12 @@ lwp_create(lwp_t *l1, proc_t *p2, vaddr_
l2-l_pflag = LP_MPSAFE;
TAILQ_INIT(l2-l_ld_locks);
 
+   /* For vfork, borrow parent's lwpctl context */
+   if (flags  LWP_VFORK  l1-l_lwpctl) {
+   l2-l_lwpctl = l1-l_lwpctl;
+   l2-l_flag |= LW_LWPCTL;
+   }
+
/*
 * If not the first LWP in the process, grab a reference to the
 * descriptor table.
@@ -1376,6 +1382,17 @@ lwp_userret(struct lwp *l)
KASSERT(0);
/* NOTREACHED */
}
+
+   /* update lwpctl process (for vfork child_return) */
+   if (l-l_flag  LW_LWPCTL) {
+   lwp_lock(l);
+   l-l_flag = ~LW_LWPCTL;
+   lwp_unlock(l);
+   KPREEMPT_DISABLE(l);
+   l-l_lwpctl-lc_curcpu = (int)cpu_index(l-l_cpu);
+   l-l_lwpctl-lc_pctr++;
+   KPREEMPT_ENABLE(l);
+   }
}
 
 #ifdef KERN_SA
@@ -1529,6 +1546,10 @@ lwp_ctl_alloc(vaddr_t *uaddr)
l = curlwp;
p = l-l_proc;
 
+   /* don't allow a vforked process to create lwp ctls */
+   if (p-p_lflag  PL_PPWAIT)
+   return EBUSY;
+
if (l-l_lcpage != NULL) {
lcp = l-l_lcpage;
*uaddr = lcp-lcp_uaddr + (vaddr_t)l-l_lwpctl - lcp-lcp_kaddr;
@@ -1653,11 +1674,16 @@ lwp_ctl_alloc(vaddr_t *uaddr)
 void
 lwp_ctl_free(lwp_t *l)
 {
+   struct proc *p = l-l_proc;
lcproc_t *lp;
lcpage_t *lcp;
u_int map, offset;
 
-   lp = l-l_proc-p_lwpctl;
+   /* don't free a lwp context we borrowed for vfork */
+   if (p-p_lflag  PL_PPWAIT)
+   return;
+
+   lp = p-p_lwpctl;
KASSERT(lp != NULL);
 
lcp = l-l_lcpage;
Index: sys/lwp.h
===
RCS file: /cvsroot/src/sys/sys/lwp.h,v
retrieving revision 1.142
diff -p -u -r1.142 lwp.h
--- sys/lwp.h   28 Jan 2011 16:58:27 -  1.142
+++ sys/lwp.h   14 Feb 2011 12:25:29 -
@@ -214,6 +214,7 @@ extern lwp_tlwp0;   /* LWP for 
proc0. *
 
 /* These flags are kept in l_flag. */
 #defineLW_IDLE 0x0001 /* Idle lwp. */
+#defineLW_LWPCTL   0x0002 /* Adjust lwpctl in userret */
 #defineLW_SINTR0x0080 /* Sleep is

Re: remove sparse check in vnd

2011-02-05 Thread Antti Kantee

On Sun Feb 06 2011 at 00:08:33 +0900, Izumi Tsutsui wrote:
 yamt@ wrote:
 
  i'd like to remove the sparseness check in vnd because there's
  no problem to use a sparse files on nfs.
 
 We really want vnd on sparse files for emulator images...

I have this in my /etc/fstab:
/home/pooka/temp/anita/wd0.img%DISKLABEL:a% /anita ffs rw,noauto,rump

It works perfectly for editing the image.  fsck is a slightly gray area,
but with wapbl it's not really a concern.  I use the following to mount
the image so that I don't need to unnecessarily sudo all the access:

 alias anitamnt
env P2K_WIZARDUID=0 mount -o log /anita


But on the original subject, maybe we can use either gop_alloc or vop_bmap
to decide if the underlying file systems supports vnd on sparse files.

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: kernel memory allocators

2011-01-22 Thread Antti Kantee

On Sat Jan 22 2011 at 17:55:11 +0100, Lars Heidieker wrote:
 yes, that makes sense for trying my changes with different pool_page
 sized pool_allocators...
 I think the initialization order has to be tested on bare metal or?

Yes, anything which depends on a real uvm/pmap layer being present needs
to be tested, if not on bare metal, at least in an environment where
the whole OS stack is present.  If we had better platform support for
anita (*), things like this would be a lot easier to test ...

I guess in theory it would be possible to use static analysis to check
initialization order, but that might be more of a hobby ;)

*) http://www.netbsd.org/developers/features/

 On Fri, Jan 21, 2011 at 12:29 PM, Antti Kantee po...@cs.hut.fi wrote:
  btw, just in case you're interested, you can easily use rump for userspace
  development/testing of the kmem/vmem/pool layers.  src/tests/rump/rumpkern
  has examples on how to call kernelspace routines directly from user
  namespace.
 
  On Fri Jan 21 2011 at 11:51:08 +0100, Lars Heidieker wrote:
   Do you have your changes available for review?
 
  The kmem patch it includes:
 
  - enhanced vmk caching in the uvm_km module not only for page sized
  allocation but low integer multiplies.
    (changed for rump as well)
  - a changed kmem(9) implementation (using these new caches) (it's not
  using vmem see note below)
  - removed the malloc(9) bucket system and made malloc(9) a thin
  wrapper around kmem, just like in the yamt-kmem branch.
    (changed vmstat to deal with non more existing symbol for the malloc 
  buckets)
 
  - pool_subsystem_init is split into pool_subsystem_bootstrap and
  pool_subsystem_init,
  after bootstrap static allocated pools can be initialized and after
  init allocation is allowed.
  the only instances (as fas as I found them) that do static pool
  initialization earlier are some pmaps those are changed accordingly.
  (Tested i386 and amd64 so far)
 
  vmem:
  Status quo:
  The kmem(9) implementation used vmem for its backing, with an
  pool_allocator for each size this is unusual for caches.
  The vmem(9) backing kmem(9) uses a quatum size of the machine
  alignment so 4 or 8 bytes, therefore the quantum caches of the vmem
  are very small and kmem extends these to larger ones.
  The import functions for vmem do this on a page sized basis and the
  uvm_map subsystem is in charge of controlling the virtual address
  layout and vmem is just an extra layer.
 
  Questions:
  Shouldn't vmem provide the pool caches with pages for import into the
  pools and the quantum caches of vmem should provide these pages for
  the low integer multiplied sizes? That's the way I understand the idea
  of vmem and it's implementation in solaris.
  But this makes only sense if vmem(9) is in charge of controlling lets
  say the kmem map and not the uvm_map system, slices of this submap
  would be described by vmem entries and not by map entries.
 
  With the extended vmk caching for the kernel_map and kmem_map I
  implemented the quatum caching idea.
 
  Results on an amd64 four-core 8gb machine:
 
  sizes after: building a kernel with make -j200, du /, ./build.sh -j8
  distribution
                                    current
  changed kmem
  pool size:                 915mb / 950mb                  942mb/956mb
  pmap -R0 | wc          2700                                  1915
 
  sizes after pushing the memory system with several instances of the
  Sieve of Eratosthenes each one consuming about 540mb to shrink the
  pools.
                                    current
  changed kmem
  pool size:                 657mb / 760mb                  620mb/740mb
  pmap -R0 | wc          4280                                  3327
 
 
  those numbers are not precise (especially the later ones) at all but
  they do hint in an direction.
  Keep in mind that allocations that go to malloc in the current
  implementation go to the pool in the changed one.
  Runtime of the build process was the same within a few seconds difference.
 
  kind rgards,
  Lars
 
 
 
  --
  älä karot toivorikkauttas, kyl rätei ja lumpui piisaa
 
 
 
 
 -- 
 Mystische Erklärungen:
 Die mystischen Erklärungen gelten für tief;
 die Wahrheit ist, dass sie noch nicht einmal oberflächlich sind.
    -- Friedrich Nietzsche
    [ Die Fröhliche Wissenschaft Buch 3, 126 ]

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: kernel memory allocators

2011-01-21 Thread Antti Kantee

On Fri Jan 21 2011 at 11:51:08 +0100, Lars Heidieker wrote:
  Do you have your changes available for review?
 
 The kmem patch it includes:
 
 - enhanced vmk caching in the uvm_km module not only for page sized
 allocation but low integer multiplies.
   (changed for rump as well)
 - a changed kmem(9) implementation (using these new caches) (it's not
 using vmem see note below)
 - removed the malloc(9) bucket system and made malloc(9) a thin
 wrapper around kmem, just like in the yamt-kmem branch.
   (changed vmstat to deal with non more existing symbol for the malloc 
 buckets)

With your changes you can probably also include kern_malloc.c in librump
instead of the host-relegated allocator in memalloc.c.  There were two
reasons why it wasn't done before:

1) i didn't want to guess an arbitrary size for kmem_map
2) too many subsystems relied on link sets for malloc types and i
   didn't want to add special handling for that

At least per cursory examination your patch seems to take care of
both issues.

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: kernel messages and rump

2011-01-13 Thread Antti Kantee

On Wed Jan 12 2011 at 19:23:44 +0100, Manuel Bouyer wrote:
 I can live with it for now; having the uprintf output somewhere could help
 for atf tests though. I have filled kern/44378 about this.

Thanks, i'll look at it some day hopefully soon.

Curiously enough, during all the time i've been working with rump
(3.5 years now) I've never missed uprints ;)

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: kernel messages and rump

2011-01-12 Thread Antti Kantee

On Wed Jan 12 2011 at 15:36:02 +0100, Manuel Bouyer wrote:
 Hello,
 I'm playing with rump, and more specifically rump_ffs.
 The mount is rejected (as expected) because the fs image has a feature
 which is not yet in the kernel. It's rejected by this code:
 if (fs-fs_flags  ~(FS_KNOWN_FLAGS | FS_INTERNAL)) {
 uprintf(%s: unknown ufs flags: 0x%08PRIx32%s\n,
 mp-mnt_stat.f_mntonname, fs-fs_flags,
 (mp-mnt_flag  MNT_FORCE) ?  : , not mounting);
 if ((mp-mnt_flag  MNT_FORCE) == 0) {
 mutex_exit(ump-um_lock);
 return (EINVAL);
 }
 }
 but even with RUMP_VERBOSE I never see the uprintf(). Where does it do, and
 is there a way to make rump print it (I guess it should just go to
 stderr) ?

It goes to the same place as for any process without a tty: the bitbucket.

To properly support uprintf, there are at least two things to consider:

  1) is the calling process local or remote
  2) does the kernel include rumpkern_tty support

If you want a quick solution, file a PR and add ifdefs to the uprint
routines in subr_prf.c to make them behave like kprintf(TOCONS).

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: WAPBL kernel assertion

2011-01-08 Thread Antti Kantee

On Sat Jan 08 2011 at 14:32:23 +0100, Manuel Bouyer wrote:
 Hello,
 on a NetBSD 5.1 Xen domU, I got:
 panic: kernel diagnostic assertion wl-wl_dealloccnt  wl-wl_dealloclim 
 failed: file /home/builds/ab/netbsd-5/src/sys/kern/vfs_wapbl.c, line 1673
 
 file system is clean (I forced a fsck). For now I'm running without wapbl.
 Does this ring a bell to someone ?

Try including revs 1.27 and 1.28 of vfs_wapbl.c.  I can't recall the
details anymore, but I remember the overflow was quite easy to trigger
on a rump kernel.  Maybe the problem triggers easier in a virtual
environment?

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: bad merge in uvm_fault_lower?

2010-12-15 Thread Antti Kantee

On Mon Dec 13 2010 at 00:24:49 +, Alexander Nasonov wrote:
 Hi,
 In sys/uvm/uvm_fault.c I see three KASSERT's twice: 

Removed one set.  Thanks.

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: Sane support for SMP kernel profiling

2010-12-11 Thread Antti Kantee

On Fri Dec 10 2010 at 23:13:54 -0500, Thor Simon wrote:
 We've fixed SMP kernel profiling, which worked poorly at best (particularly
 on systems with high HZ) since a lock was taken and released around every
 single entry to mcount.  Thanks to Andy for the suggestion as to how.

Nice.  Since you're on a roll, do you have plans to investigate userland
multithreaded profiling?  The only way I've gotten it to work reliably
is to artificially leave libpthread out of the mix, and it's not
multithreaded after that ...

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: what is this KASSERT() testing?

2010-12-05 Thread Antti Kantee

On Mon Dec 06 2010 at 11:55:05 +1100, matthew green wrote:
 
 hi.
 
 
 my devbox just crashed with this:
 
 panic: kernel diagnostic assertion pg == NULL || pg == PGO_DONTCARE failed: 
 file /usr/src/sys/miscfs/genfs/genfs_io.c, line 243
 
 but i don't understand the KASSERT().  it seems that this sequence
 of events will always trigger:
 
 nfound = uvn_findpages(uobj, origoffset, npages,
 ap-a_m, UFP_NOWAIT|UFP_NOALLOC|(memwrite ? UFP_NORDONLY 
 : 0));
 ...
 if (!genfs_node_rdtrylock(vp)) {
 ...
 for (i = 0; i  npages; i++) {
 pg = ap-a_m[i];
 if (pg != NULL  pg != PGO_DONTCARE) {
 ap-a_m[i] = NULL;
 }
 
 KASSERT(pg == NULL || pg == PGO_DONTCARE);
 
 won't all pages filled in by the uvn_findpages() be not NULL, so
 if the uvn_findpages() succeeds but the genfs_node_rdtrylock() 
 fails, we will trigger this assert always.
 
 
 i think it should just be removed.

I guess it wants to test ap-a_m[i], cf. the change to the assignment
clause in the same revision.

-- 
älä karot toivorikkauttas, kyl rätei ja lumpui piisaa

Re: mutexes, locks and so on...

2010-11-24 Thread Antti Kantee

Thanks, I'll use your list as a starting point.  One question though:

On Wed Nov 24 2010 at 00:16:37 +, Andrew Doran wrote:
 - build.sh on a static, unchanging source tree.

From the SSP discussion I have a recollection that build.sh can be
very jittery, up to the order of 1% per build.  I've never confirmed it
myself, though.  Did you notice anything like that?

(I guess the tools would have to be static too, so that they are not
affected by the host compiler)

Re: misuse of pathnames in rump (and portalfs?)

2010-11-24 Thread Antti Kantee

Hi,

On Tue Nov 23 2010 at 23:13:02 +, David Holland wrote:
 Furthermore, it is just plain gross for the behavior of VOP_LOOKUP in
 some directory to depend on how one got to that directory. As a matter
 of design, the working path should not be available to VOP_LOOKUP and
 VOP_LOOKUP should not attempt to make use of it.
 
 When I asked pooka for clarification, I got back an assertion that
 portalfs depends on this behavior so I should rethink the namei design
 to support it. However, as far as I can tell, this is not true: there
 is only one unexpected/problematic use of the pathname buffer in
 question anywhere in the system, in rumpfs.c. Furthermore, even if it
 were true, I think it would be highly undesirable.

You wrote that the whole path will no longer be available.  As you
say yourself, it doesn't make sense for a file system to care about
the previous components, so don't be shocked that I took this to mean
the whole remaining path.  If the whole remaining path is available,
portalfs should be fine.

As for etfs, as you might be able to see from the code, it's only used
for root vnode lookups.  I cannot think of a reason why we cannot define
the key to start with exactly one leading '/'.  Some in-tree users may
not follow that rule now, but they should be quite trivial to locate
with grep.  That should make it work properly with your finally-nonbroken
namei and also take care of all symlink/.. concerns you might have.

thanks,
  antti

Re: mutexes, locks and so on...

2010-11-24 Thread Antti Kantee

On Wed Nov 24 2010 at 12:42:44 -0500, Thor Lancelot Simon wrote:
 On Wed, Nov 24, 2010 at 04:52:38PM +0200, Antti Kantee wrote:
  Thanks, I'll use your list as a starting point.  One question though:
  
  On Wed Nov 24 2010 at 00:16:37 +, Andrew Doran wrote:
   - build.sh on a static, unchanging source tree.
  
  From the SSP discussion I have a recollection that build.sh can be
  very jittery, up to the order of 1% per build.  I've never confirmed it
  myself, though.  Did you notice anything like that?
 
 There are other issues associated with build.sh as a benchmark.

   * What are you trying to test?  If you're trying to test the
 efficiency of cache algorithms or the I/O subsystem (including
 disk sort), for example, you need to test pairs of runs with
 a cold boot of *ALL INVOLVED HARDWARE* (this includes disk
 arrays etc) between each.
 
 * If SSDs, hybrid disks, or other potentially self-reorganizing
   media are involved, forget it, you just basically lose.
 
 * If you're trying to test everything *but* the cache and I/O
   subsystem, then you need to use a warm up procedure you can
   have reasonable confidence works, for example always measuring
   the Nth of N consecutive builds.

Indeed.  Let's start with the low-hanging fruit first -- having some
figures which at least make some sense (e.g. measure second of two builds
in a row) is better than no figures.

 * It can be hard to construct a system configuration where NetBSD
   kernel performance is actually the bottleneck and some other
   hardware limitation is not.  Or where there's only a single
   bottleneck.

Dunno about NetBSD specifically, but this suggests great differences:
http://www.netbsd.org/~ad/50/img15.html

At least I doubt we got dramatically better drivers between 4 and 5.
No idea about other OS performance there.

Re: misuse of pathnames in rump (and portalfs?)

2010-11-24 Thread Antti Kantee

On Wed Nov 24 2010 at 18:12:00 +, David Holland wrote:
   As for etfs, as you might be able to see from the code, it's only used
   for root vnode lookups.  I cannot think of a reason why we cannot define
   the key to start with exactly one leading '/'.  Some in-tree users may
   not follow that rule now, but they should be quite trivial to locate
   with grep.  That should make it work properly with your finally-nonbroken
   namei and also take care of all symlink/.. concerns you might have.
 
 I think it makes more sense for doregister to check for at least one
 leading '/' and remove the leading slashes before storing the key.
 Then the key will match the name passed by lookup; otherwise the
 leading slash won't be there and it won't match. (What I suggested
 last night is broken because it doesn't do this.)

Ah, yea, the leading slashes will be stripped for lookup, so we can't
get an exact match for those anyway.

So, let's define it as string beginning with /, leading /'s collapsed
to 1.

 All users I can find pass an absolute path.

ok, good

Re: mutexes, locks and so on...

2010-11-23 Thread Antti Kantee

On Fri Nov 19 2010 at 00:11:12 +, Andrew Doran wrote:
 You can release it with either call, mutex_spin_ is just a way to avoid
 additional atomic operations.  The ususal case is adaptive mutex, but
 stuff like the dispatcher/scheduler makes use of spin mutexes exclusively
 and the fast path versions were invented for that. (Because you can measure
 the effect with benchmarks :-).

Speaking of which, something I (and a few others) have been thinking
about is to have constantly running benchmarks (akin to constantly
running tests).  That way we can have a rough idea which way performance
and resource consumption is going and if there are any sharp regressions.
Are your old benchmarking programs still available somewhere?

Re: mutexes, locks and so on...

2010-11-12 Thread Antti Kantee

On Fri Nov 12 2010 at 14:30:58 +0100, Johnny Billquist wrote:
 By reasoning that we should design for tomorrows hardware, we might as 
 well design explicitly for x86_64, and let all other emulate that. But 
 in the past, I believe NetBSD have tried to raise above such simple and 
 naïve implementation designs and actually try to grab the meaning of the 
 operation instead of an explicit implementation. That have belonged more 
 in the field of Linux. I hope we don't go down that path...

Freeway design is not driven by the requirements of the horse.  If a horse
occasionally wants to gallop down a freeway, we're happy to let it as long
as it doesn't cause any impediment to the actual users of the freeway.

Over 15 years ago NetBSD had a possibility to take everyone into account
since everyone was more or less on the same line.  This is no longer true.
If old architectures can continue to be supported, awesome, but they may
in no way dictate MI design decisions which hold back the capabilities
of modern day architectures.

Re: mutexes, locks and so on...

2010-11-12 Thread Antti Kantee

On Fri Nov 12 2010 at 15:25:04 +0100, Johnny Billquist wrote:
 Freeway design is not driven by the requirements of the horse.  If a horse
 occasionally wants to gallop down a freeway, we're happy to let it as long
 as it doesn't cause any impediment to the actual users of the freeway.
 
 Over 15 years ago NetBSD had a possibility to take everyone into account
 since everyone was more or less on the same line.  This is no longer true.
 If old architectures can continue to be supported, awesome, but they may
 in no way dictate MI design decisions which hold back the capabilities
 of modern day architectures.
 
 So what you are arguing is that MI needn't be so much MI anymore, and 
 that supporting anything more than mainstream today is more to be 
 considered a lucky accident than a desired goal?

You can try to twist my words in any way that pleases you.  However, the
fact is that people who put forward a heroic effort in modernizing NetBSD
will not be held accountable for making sure prehistoric architectures
keep up (*).  Some of our older ports have active supporters who keep
the port up to speed with MI changes, set up emulator support, publish
test run results etc.  These ports will continue to be supported by
NetBSD indefinitely.

*) just to be explicit: prehistoric != non-x86

Re: mutexes, locks and so on...

2010-11-12 Thread Antti Kantee

On Fri Nov 12 2010 at 16:58:18 +, Mindaugas Rasiukevicius wrote:
 What Johnny apparently suggests is to revisit mutex(9) interface, which
 is known to work very well, and optimise it for VAX.  Well, I hope we
 do not design MI code to be focused on VAX.  If we do, then perhaps I
 picked the wrong project to join.. :)

He is suggesting to revisit the implementation.  It doesn't take much
thinking to figure out you don't have to use kern_rwlock.c on vax, just
provide the interface.  It's really really unlikely the *interface*
will change, so it's not much code updating to worry about either.

(incidentally, rump kernels have take this approach for, what, 3 years
now because the kernel implementation of mutex/rwlock uses primitives
which are not in line with the goals of rump, namely to virtualize without
stacking multiple unnecessary implementations of the same abstraction)

Re: XIP (Rev. 2)

2010-11-09 Thread Antti Kantee

A big problem with the XIP thread is that it is simply not palatable.
It takes a lot of commitment just to read the thread, not to mention
putting out a sensible review comments like e.g. Chuq and Matt have done.
The issue is complex and the code involved is even more so.  However,
that is no excuse for a confusing presentation.  It seems like hardly
anyone can follow what is going on, and usually that signals that the
audience is not the root of the problem.

A while back chuq promised to send a mail classifying his points
into clear showstopers and issues which can be handled post-merge.
Let's start with that list (hopefully we'll get it soon) and see what
exactly are the relevant issues remaining and solve *only* those issues.

What needs to stop is threading to other areas because $subsystem is
broken beyond repair.  We know, but let's just handle the problems
relevant to XIP for now.

Re: XIP (Rev. 2)

2010-11-09 Thread Antti Kantee

On Tue Nov 09 2010 at 12:47:11 -0600, David Young wrote:
 On Tue, Nov 09, 2010 at 04:31:22PM +0200, Antti Kantee wrote:
  A big problem with the XIP thread is that it is simply not palatable.
  It takes a lot of commitment just to read the thread, not to mention
  putting out a sensible review comments like e.g. Chuq and Matt have done.
  The issue is complex and the code involved is even more so.  However,
  that is no excuse for a confusing presentation.  It seems like hardly
  anyone can follow what is going on, and usually that signals that the
  audience is not the root of the problem.
 
 If the conversation's leading participants adopt the rule that they may
 not introduce a new term (pager ops) or symbol (pgo_fault) to the
 discussion until a manual page describes it, then we will gain some
 useful kernel-internals documentation, and the conversation will be more
 accessible. :-)

Those concepts are carefully documented, if nowhere else, at least in
the uvm dissertation.  Basically a pager is involved in moving things
between memory and whatever the va is backed with (swap, a file system,
ubc, ...).  There's pgo_get which pages data from the backing storage
to memory (*) and pgo_put which does the opposite.  Additionally there's
pgo_fault which is like pgo_get except the interface allows the method
a little more freedom in how it handles the operation.  ... but i don't
know if that's a helpful explanation unless you are familiar with pagers,
which is why it is very difficult to produce succint documentation on
the subject -- everyone learns to understand it a little differently.

*) obviously in the case of XIP to is a matter of mapping instead
of transferring

But, the problem was not so much the use of terminology as it was the
lack of any clear focus on the direction.  I can't form a clear mental
image of the project, although admittedly I didn't even finish reading
the earlier thread yet.

Like gimpy said, the diff is a big piece to swallow since it's so full
of unrelated parts:

1) man pages
2) new drivers
3) vm
4) vnode pager
5) MD collateral

Then again, it's missing pieces (what's pmap_common.c?  and isn't that
a slight oxymoron ?)

The diff would be much more browsable if it was separated into pieces
and the man pages attached as rendered versions.  Although reading the
diff is quicker than reading the previous thread ;)

A radically different implementation at this stage seems feasible only
if there is strong reason for that based on another actually existing
implementation (in another OS, of course).

Beauty issues aside, can we have a summary of the current implementation
of XIP from a functional perspective, i.e. what works and what doesn't.
That's what users care about ...

Re: Capsicum: practical capabilities for UNIX

2010-10-26 Thread Antti Kantee

On Tue Oct 26 2010 at 13:04:30 +0200, Jean-Yves Migeon wrote:
 
 On Mon, 25 Oct 2010 20:13:16 -0500, David Young dyo...@pobox.com wrote:
  I've been wondering if the dynamic linker could simulate access to
  the global namespace by supplying alternate system-call stubs.  Say
  rtld-elf-cap supplies its own open(2) stub, for example, that searches
  Capsicum's fdlist for a suitable file descriptor on which to call
  openat(2):
  
  int
  open(const char *path, int flags, mode_t mode)
  {
  const char *name;
  int fd;
  
  for (name, fd in fdlist) {
  if (path is-under-directory name)
  return openat(fd, path, flags, mode);
  }
  errno = ENOENT;
  return -1;
  }
 
 That would only work with dynamic executables. Sandboxing static
 executables that way will not work.

Less obviously and more dangerously it will not work for syscalls done
from libc (cf. rpc code in rump nfsd).  Maybe it's possible to link
libc.so so that the linker doesn't resolve unresolved symbols at that
stage, but I haven't investigated that path.

[i didn't read this thread, at least not yet, so apologies if that was
mentioned earlier]

Re: [ANN] Lunatik -- NetBSD kernel scripting with Lua (GSoC project results)

2010-10-19 Thread Antti Kantee

On Tue Oct 05 2010 at 18:24:48 -0300, Lourival Vieira Neto wrote:
 Hi folks,
 
 I'm glad to announce the results of my GSoC project this year [1].
 We've created the support for scripting the NetBSD kernel with Lua,
 which we called Lunatik and it is composed by a port of the Lua
 interpreter to the kernel, a kernel programming interface for
 extending subsystems and a user-space interface for loading user
 scripts into the kernel. You can see more details on [2]. I am
 currently working on the improvement of its implementation, on the
 documentation and on the integration between Lunatik and other
 subsystems, such as npf(9), to provide a real usage scenario.

Cool.

I'm looking forward to seeing your evaluation of real usage scenarios.
If you can find some existing policy code written in C and convert it to
lua, it would make a strong case.  The main metric I'm interested in is
convenience, and performance to some degree depending on what kind of
places your plan to put lua scripts in.  At least in the packet filter
use case the performance is quite critical.

I don't know how well the fibonacci example performs (and the performance
is not very critical there), but I'm sure you'll agree that from the
convenience pov it is a very strong case _against_ lua ;)
(yes, I realize it's not provided for demonstrating convenience)

  - antti

Re: [ANN] Lunatik -- NetBSD kernel scripting with Lua (GSoC project

2010-10-19 Thread Antti Kantee

On Tue Oct 12 2010 at 02:17:35 -0300, Lourival Vieira Neto wrote:
 On Tue, Oct 12, 2010 at 1:50 AM, David Holland dholland-t...@netbsd.org 
 wrote:
  On Tue, Oct 12, 2010 at 12:53:10AM -0300, Lourival Vieira Neto wrote:
   A signature only tells you whose neck to wring when the script
   misbehaves. :-) Since a Lua script running in the kernel won't be
   able to forge a pointer (right?), or conjure references to methods 
  or
   data that weren't in its environment at the outset, you can run it
   in a highly restricted environment so that many kinds of 
  misbehavior
   are difficult or impossible. ?Or I would *think* you can restrict 
  the
   environment in that way; I wonder what Lourival thinks about that.
     
      I wouldn't say better =). That's exactly how I'm thinking about
      address this issue: restricting access to each Lua environment. For
      example, a script running in packet filtering should have access to a
      different set of kernel functions than a script running in process
      scheduling.
    
     ...so what do you do if the script calls a bunch of kernel functions
     and then crashes?
   
    if a script crashes, it raises an exception that can be caught by the
    kernel (as an error code)..
 
  Right... so how do you restore the kernel to a valid state?
 
 Why wouldn't it be a valid state after a script crash? I didn't get
 that. Can you exemplify it?

I *guess* what David means is that to perform decisions you need a
certain level of atomicity.  For example, just drawing something out of
a hat, if you want to decide which thread to schedule next, you need to
make sure the selected thread object exists over fetching the candidate
list and the actual scheduling.  For this you use a lock or a reference
counter or whatever.  So if your lua script crashes between fetching the
candidates and doing the actual scheduling, you need some way of releasing
the lock or decrementing the refcounter.  While you can of course push an
error branch stack into lua or write the interfaces to follow a strict
model where you commit state changes only at the last possible moment,
it is additional work and probably quite error-prone.

Although, on the non-academic side of things, if your thread scheduler
crashes, you're kinda screwed anyway.

Re: something really screwed up with mmap+ffs on 5.0_STABLE

2010-09-01 Thread Antti Kantee

On Wed Sep 01 2010 at 15:23:42 +0200, Thomas Klausner wrote:
 On Tue, Aug 17, 2010 at 11:52:31PM +0300, Antti Kantee wrote:
  It would be great if someone could confirm or debunk this on -current
  and for archs beyond i386.  Just get the latest sources, go to
  sys/rump/net/lib/libshmif, comment out line 61 (the one with PREFAULT_RW)
  from if_shmif.c, make  make install, and run tests/net/icmp/t_ping
  floodping in a loop.  You should see a coredump within a few thousand
  iteratios (few minutes) if the problem is there.
 
 I think you mean if_shmem.c.

Something like that with a fuzzy match.

 I just tested this on 5.99.39/amd64.
 # ./t_ping floodping
 got 0/1
 passed
 # while true; do ./t_ping floodping; done
 panic: kernel diagnostic assertion busmem-shm_magic == SHMIF_MAGIC failed: 
 file if_shmem.c, line 287
 panic: kernel diagnostic assertion busmem-shm_magic == SHMIF_MAGIC failed: 
 file if_shmem.c, line 287

Thanks.  This has been analyzed and fixed by chuq already.  He said he
just needs more time to verify the fix is correct.  It's essentially
kern/40389, but it turns out the hack I wrote to get 5.0 out wasn't
quite complete.  What goes around comes around ...

Re: [RFC] perfuse permission checks

2010-08-28 Thread Antti Kantee

[resending with tech-kern included]

On Sat Aug 28 2010 at 05:52:54 +0200, Emmanuel Dreyfus wrote:
 Hello
 
 I just commited code in libperfuse to check permissions on various file
 operations. On each operation comes weird questions such as do I need
 r-x on the parent directory, or just --x?. While I attempted to
 experiment around it, I am pretty sure there are bugs left behind. 
 
 Anyone can review the code? It is in src/lib/libperfuse/ops.c
 (search for calls to no_access)

Usually it's enough and easier to perform access checks from lookup
(plus setattr).

I don't know what gluesterfs does, though, and what kind of races are
present due to the distributed nature if you just check access in lookup.

Re: 16 year old bug

2010-08-23 Thread Antti Kantee

On Mon Aug 23 2010 at 13:53:40 +0200, Christoph Egger wrote:
 
 ... has been found by OpenBSD:
 
 Their commit message:
 
 Fix a 16 year old bug in the sorting routine for non-contiguous netmasks.
 For masks of identical length rn_lexobetter() did not stop on the
 first non-equal byte. This leads rn_addroute() to not detecting
 duplicate entries and thus we might create a very long list of masks
 to check for each node.
 This can have a huge impact on IPsec performance, where non-contiguous
 masks are used for the flow lookup.  In a setup with 1300 flows we
 saw 400 duplicate masks and only a third of the expected throughput.
 
 
 The patch is attached. Any comments?

The test for this is missing.

Re: something really screwed up with mmap+ffs on 5.0_STABLE

2010-08-19 Thread Antti Kantee

[whoops, resending with tech-kern cc'd]

On Thu Aug 19 2010 at 11:17:55 +0100, Patrick Welche wrote:
 On Tue, Aug 17, 2010 at 11:52:31PM +0300, Antti Kantee wrote:
  On Tue Aug 17 2010 at 19:06:38 +0300, Antti Kantee wrote:
  It would be great if someone could confirm or debunk this on -current
  and for archs beyond i386.  Just get the latest sources, go to
  sys/rump/net/lib/libshmif, comment out line 61 (the one with PREFAULT_RW)
  from if_shmif.c, make  make install, and run tests/net/icmp/t_ping
  floodping in a loop.  You should see a coredump within a few thousand
  iteratios (few minutes) if the problem is there.
 
 Sure enough:

Cool, thanks for confirming.

 826:arp info overwritten for 1.1.1.10 by b2:a0:61:b4:bc:6f

I forgot to mention to remove the busfile in between runs.  Otherwise
tests will pick up on traffic from an old test run.  But this doesn't
affect the result we're after, just creates noise.

 panic: kernel diagnostic assertion busmem-shm_magic == SHMIF_MAGIC failed: 
 fi
 le /sys/rump/net/lib/libshmif/if_shmem.c, line 285
 Abort trap (core dumped)
 while ( 1 )
 827:panic: kernel diagnostic assertion sp.sp_len  BUSMEM_DATASIZE failed: 
 fil
 e /usr/src/sys/rump/net/lib/libshmif/shmif_busops.c, line 135
 Abort trap (core dumped)
 while ( 1 )
 828:panic: kernel diagnostic assertion busmem-shm_magic == SHMIF_MAGIC 
 failed
 : file /sys/rump/net/lib/libshmif/if_shmem.c, line 388
 
 though again on i386.

5.99.latest?

something really screwed up with mmap+ffs on 5.0_STABLE

2010-08-17 Thread Antti Kantee

I've been looking at some quite weird behaviour with mmapped files on ffs.
I want to concentrate on something else for a while, so here's a brain
dump of what I've been struggling with recently, in case it rings a bell
for someone or they even know the solution.


Background:

The shmif rump driver provides a networkin backend using the old
mmap-a-file-to-get-a-handle trick.


Observations:

Most of the time the problem is that the first 16k of the bus file gets
corrupted.  The underlying fs blocksize is 32k.  I have verified that:

a) it does not get written to by the involved processes per ktrace -i
b) processes do not overwrite random memory by having a
   PROT_NONE red zone in front

This problem does not happen on tmpfs.  I don't believe there is a timing
issue because I've run the test tens of thousands of times with varying
background load.

Zero-filling the bus file with write() instead of creating a sparse with
truncate doesn't make much of a difference either.  I was almost sure
it was a problem with the genfs sawhole code, but nope.

Usually after the bus has seen one generation (i.e. the pages have been
faulted in to all processes) there are no further problems.  However,
causing (read) faults from a 3rd party process not involved with the
test may trigger the problem.


The really spooky stuff:

Seems like it's possible to get two views into the same file depending
on read/write or mmap access (whatever happened to mr. ubc???).
Can someone explain this:

 ./dumpbus-mmio -h thank-you-driver-for-getting-me-here
bus version 2, lock: 0, generation: 431, firstoff: 0x5a95a, lastoff: 0x5a8ea
 ./dumpbus-read -h thank-you-driver-for-getting-me-here
dumpbus-read: thank-you-driver-for-getting-me-here not a shmif bus

i.e. same file, but magic number doesn't match when not using mmap.
hexdump uses read() (per ktrace), so I get the garbage version of the
file with it and can confirm it indeed has gargabe in it.

The only difference between the two programs is this:
#if 1
read(fd, buf, BUFSIZE);
bmem = (void *)buf;
#else
busmem = mmap(NULL, sb.st_size, PROT_READ, MAP_FILE|MAP_SHARED, fd, 0);
if (busmem == MAP_FAILED)
err(1, mmap);
bmem = busmem;
#endif

However, I can restore the old version using cp (since it uses mmio):

 ./dumpbus-read -h thank-you-driver-for-getting-me-here 
dumpbus-read: thank-you-driver-for-getting-me-here not a shmif bus
 cp thank-you-driver-for-getting-me-here backup
 ./dumpbus-read -h backup
bus version 2, lock: 0, generation: 431, firstoff: 0x5a95a, lastoff: 0x5a8ea


How-to-repeat:

Get tests/net/icmp from -current and run ./t_ping floodping in a loop
from ffs.  You should see the problem within a few thousand iterations.
Most likely the shmif code will encounter an invariant failure, such as:
panic: kernel diagnostic assertion busmem-shm_magic == SHMIF_MAGIC failed: 
file if_shmem.c, line 391


I plan to update to latest -STABLE soon and see if the problem is still
present there.  Guess I'll reboot now...

Re: Using coccinelle for (quick?) syntax fixing

2010-08-09 Thread Antti Kantee

On Mon Aug 09 2010 at 11:20:29 +0200, Jean-Yves Migeon wrote:
 It is 'error-prone, in the sense that it can raise false positives. But
 when you get more familiar with it, you can either fix the cocci patch
 (easy for __arraycount, I missed one of the cases... less obvious for
 aprint stuff), and proof read the generated patch.

I really dislike untested wide-angle churn, especially if there is 0
measurable gain.  Converting code to __arraycount is a prime example.
The only benefit of __arraycount is avoiding typing and therefore typos.
Neither of those apply when doing a churn.
(there are subjective beauty values, but every C programmer knows
the sizeof/sizeof idiom, which is more than what can be said about
__arraycount)

Examples of measurable benefit are good.  Encouraging churn is less good,
even if spatch-churn is a million times better than sed-churn.

 I used these examples to get familiar with it; it starts getting useful
 when you try to find out buggy code, like double free() in the same
 function, mutex_exit() missing in a branch before returning, etc.

Static analysis is good.  However, it might take quite a bit of effort
to get the rules general enough so that they trigger in more than one
file and specific enough so that you don't get too many false positives.
Just to give an example, the ffs allocator routines don't release the
lock in an error branch.

I remember coccinelle had problems with cpp.  Any code which uses macros
to skip C syntax will fail silently.  procfs comes to mind here.  Also,
I remember it using so much memory when given our kernel source that I
could not finish a rototill and had to use it in combo with find/grep.

That said, if $someone can produce a set of rules which showably find
bugs in NetBSD code and do not produce a lot of false positives, I'm
very interested in seeing nightly runs.

... especially if there are no TAILQ false positives ;)

Re: fd code multithreaded race?

2010-08-04 Thread Antti Kantee

On Wed Aug 04 2010 at 13:21:07 +, Andrew Doran wrote:
 On Sat, Jul 31, 2010 at 08:31:19PM +0300, Antti Kantee wrote:
  Hi,
  
  I'm looking at a KASSERT which is triggering quite rarely for me (in
  terms of iterations):
  
  panic: kernel diagnostic assertion dt-dt_ff[i]-ff_refcnt == 0 failed: 
  file 
  /usr/allsrc/src/sys/rump/librump/rumpkern/../../../kern/kern_descrip.c, 
  line 856
  
  Upon closer examination, it seems that this can trigger while another
  thread is in fd_getfile() between upping the refcount, testing for
  ff_file, and fd_putfile().  Removing the KASSERT seems to restore correct
 
 You're right there, the KASSERT() is wrong, it should be removed.

Thanks, I'll do that.

  operation, but I didn't read the code far enough to see where the race
  is actually handled and what stops the code from using the wrong file.
 
 FYI the fdfile_t (per-descriptor records) are stable for the lifetime of the
 process, what each record descibes can and does of course change, and how
 those records are pointed to does change (fdtab_t).
  
 There isn't really a concept of wrong file, as in, the app gets
 what it asked for.  It is free to ask for the wrong thing, and it's free
 to ask for the right thing at the wrong time, etc - that's its problem.
 
 Unless you're alluding to another bug?

Not really.  I just started thinking about how applications can make
sure they use the right file descriptor.  It seems using close() to
notify other threads of a file descriptor being closed is racy.

So something naiive like this:

t1: lock
t1: get fd1
t1: unlock
/* t1 wants to do a syscall with fd1 but is preempted */
t2: lock
t2: close fd1
t2: unlock
t3: lock
t3: open, result fd1
t3: unlock
t1: syscall fd1 ...

will give you the wrong result.  Essentially there is no interlock from
the application lookup to the kernel backing object lookup.

So I guess if you want things to work correctly, instead of close()
you need to dup2() to a zombie/deadfs fd and wait for all threads to
check in before you can close it.  (i assume dup2 is atomic)

Never realized file descriptors and threads were so tricky ;)

Re: Modules loading modules?

2010-08-02 Thread Antti Kantee

On Mon Aug 02 2010 at 16:30:03 +1000, matthew green wrote:
 this is an incomplete reading of the manual page, and you can not
 use mutex_owned() the way you are trying to (regardless of what
 pooka posted.) you can't even using it in various forms of assertions
 safely.  from the man page:
 
   It should not be used to make locking decisions at run time, or to
   verify that a lock is not held.

That's the mantra, yes.

 ie, you can not even KASSERT(!mutex_owned()).

Strictly speaking you can in a case where you have two locks always
are taken as l1,l2 and released as l2,l1 provided you're not dealing
with a spin mutex.  Does it make any sense?  no (l2 is useless).
Is it possible?  yes.

Now, on to sensible stuff.  I'm quite certain that warning was written
to make people avoid writing bad code without understanding locking --
if you need to used mutex_owned() to decide if you should lock a normal
mutex, your code is broken.  The other less likely possibility is that
someone plans to change mutex_owned in the future.

Further data point: the same warning is in rw_held, yet it was used to
implement recursive locking in vlockmgr until its very recent demise.

Ignoring man page mantras and focusing on how the code works, I do not
see anything wrong with Paul's use of mutex_owned().

 but i'm still not sure why we're going to such lengths to hold a
 lock across such a heavy weight operation like loading a module.
 that may involve disk speeds, and so you're looking at waiting
 millions of cycles for the lock.  aka, what eeh posted.

It's held for millions of cycles already now and nobody has pointed out
measurable problems.  But if it is deemed necessary, you can certainly
hide a cv underneath.  The efficiency requirements for the module lock
are probably anyway in the who cares wins ... not spectrum.  At least
I'm not aware of any fastpath wanting to use it.

Anyway, no real opinion there.  A cv most likely is the safe, no-brainer
choice.

Re: fd code multithreaded race?

2010-08-02 Thread Antti Kantee

On Sat Jul 31 2010 at 20:31:19 +0300, Antti Kantee wrote:
 Hi,
 
 I'm looking at a KASSERT which is triggering quite rarely for me (in
 terms of iterations):
 
 panic: kernel diagnostic assertion dt-dt_ff[i]-ff_refcnt == 0 failed: 
 file 
 /usr/allsrc/src/sys/rump/librump/rumpkern/../../../kern/kern_descrip.c, 
 line 856
 
 Upon closer examination, it seems that this can trigger while another
 thread is in fd_getfile() between upping the refcount, testing for
 ff_file, and fd_putfile().  Removing the KASSERT seems to restore correct
 operation, but I didn't read the code far enough to see where the race
 is actually handled and what stops the code from using the wrong file.
 
 How-to-repeat:
 Run tests/fs/puffs/t_fuzz mountfuzz7 in a loop.  A multiprocessor kernel
 might produce a more reliable result, so set RUMP_NCPU unless you have
 a multiprocessor host.  Depending on timings and how the get/put thread
 runs, you might even see the refcount as 0 in the core.
 
 Does anyone see something wrong with the analysis?  If not, I'll create
 a dedidated test and file a PR.

kern/43694, tests/kernel/t_filedesc

Re: Modules loading modules?

2010-08-02 Thread Antti Kantee

On Tue Aug 03 2010 at 02:17:43 +1000, matthew green wrote:
  Now, on to sensible stuff.  I'm quite certain that warning was written
  to make people avoid writing bad code without understanding locking --
  if you need to used mutex_owned() to decide if you should lock a normal
  mutex, your code is broken.  The other less likely possibility is that
  someone plans to change mutex_owned in the future.
  
  Further data point: the same warning is in rw_held, yet it was used to
  implement recursive locking in vlockmgr until its very recent demise.
  
  Ignoring man page mantras and focusing on how the code works, I do not
  see anything wrong with Paul's use of mutex_owned().
 
 this just does not match my actual experience in the kernel.  i had
 weird pmap-style problems and asserts firing wrongly while i did not
 obey the rules in the manual directly.

Not knowing more details it's difficult to comment.  But since you are
talking about the pmap, maybe your experiences are with spin mutexes
instead of adaptive ones?

Re: Modules loading modules?

2010-08-01 Thread Antti Kantee

On Sat Jul 31 2010 at 15:48:26 -0700, Paul Goyette wrote:
 If modload-from-modcmd is found necessary, sounds more like a case for
 the infamous recursive lock.
 
 Recursive lock is the way to go.  I think the same lock should also cover
 all device configuration activites (i.e. autoconf) and any other
 heavy lifting where we have chunks of the system coming and going.
 
 Well, folks, here is a first pass recursive locks!  The attached diffs 
 are against -current as of a few minutes ago.

Oh, heh, I thought we have recursive lock support.  But with that gone
from the vfs locks, I guess not (apart from the kernel lock ;).

I'm not sure if it's a good idea to change the size of kmutex_t.  I guess
plenty of data structures have carefully been adjusted by hand to its
size and I don't know of any automatic way to recalculate that stuff.

Even if not, since this is the only user and we probably won't have
that many of them even in the future, why not just define a new type
``rmutex'' which contains a kmutex, an owner and the counter?  It feels
wrong to punish all the normal kmutex users for just one use.  It'll also
make the implementation a lot simpler to test, since it's purely MI.

separate normal case and worst case

Re: Modules loading modules?

2010-08-01 Thread Antti Kantee

On Sun Aug 01 2010 at 06:10:07 -0700, Paul Goyette wrote:
 One question:  Since an adaptive kmutex_t already includes an owner 
 field, would we really need to have another copy of it in the rmutex_t 
 structure?

Good point.  I think it's ok to do:

  if (mutex_owned(mtx))
cnt++
  else
lock

fd code multithreaded race?

2010-07-31 Thread Antti Kantee

Hi,

I'm looking at a KASSERT which is triggering quite rarely for me (in
terms of iterations):

panic: kernel diagnostic assertion dt-dt_ff[i]-ff_refcnt == 0 failed: file 
/usr/allsrc/src/sys/rump/librump/rumpkern/../../../kern/kern_descrip.c, line 
856

Upon closer examination, it seems that this can trigger while another
thread is in fd_getfile() between upping the refcount, testing for
ff_file, and fd_putfile().  Removing the KASSERT seems to restore correct
operation, but I didn't read the code far enough to see where the race
is actually handled and what stops the code from using the wrong file.

How-to-repeat:
Run tests/fs/puffs/t_fuzz mountfuzz7 in a loop.  A multiprocessor kernel
might produce a more reliable result, so set RUMP_NCPU unless you have
a multiprocessor host.  Depending on timings and how the get/put thread
runs, you might even see the refcount as 0 in the core.

Does anyone see something wrong with the analysis?  If not, I'll create
a dedidated test and file a PR.

Re: Modules loading modules?

2010-07-26 Thread Antti Kantee

On Sun Jul 25 2010 at 15:17:29 -0700, Paul Goyette wrote:
 On Mon, 26 Jul 2010, matthew green wrote:
 
 
 it seems to me the root problem is that module_mutex is held while
 calling into the module startup routines.
 
 thus, the right solution is to remove this requirement.
 
 Yes, that's what is needed.

I'm far from convinced that's a good idea.  First, it will probably
make the module code a nightmare -- what happens when you have multiple
interleaved loads, some of which fail at some point in their dependency
stack, and let's just throw in a manual modunload to mix up things
further.  Second, and pretty much related to number one, it goes against
one of the most fundamental principles of robust code: atomic actions.

If modload-from-modcmd is found necessary, sounds more like a case for
the infamous recursive lock.

(no comment on the actual problem)

Re: power management and pseudo-devices

2010-07-19 Thread Antti Kantee

On Mon Jul 19 2010 at 01:28:42 +, Quentin Garnier wrote:
 On Sun, Jul 18, 2010 at 05:31:42PM -0700, Paul Goyette wrote:
  On Mon, 19 Jul 2010, Quentin Garnier wrote:
  
  Include ioconf.h I think.
  
  Tried that.  It works for compiling the kernel.  Unfortunately,
  swwdog is included in rump, and there doesn't seem to be an ioconf.h
  available for the librump build.
  
  Well, whatever.  I don't think I want to look at that.
  
  :)
 
 Actually, I still think this is wrong.  It might make librump compile
 and even link, that doesn't mean it will be usable if nothing ever
 creates that extern symbol.  You'll have to check with pooka or explore
 the code, but there have to be some components of rump that have partial
 configuration files.  After all, he created the ioconf directive for
 that purpose.

First of all, let's take a step up from the trenches and try to understand
the problem we're dealing with instead of trying to arbitrarily guess how
to fix the build.  We have two ways of building kernel code: monolithic
(./build.sh kernel=CONF) and modular (kernel modules, rump components).
The latter ones currently always do config stuff on-demand, so changes
cause breakage.  Should a swwdog kernel module exist (why isn't there
one?), it would run into the same problem.

Now, the ioconf keyword for config(1) is meant to help build modular
kernel code by allowing to specify a partial config file.  Currently,
as the name implies, it takes care of creating ioconf.[hc], namely in
this case struct cfdriver swwdog_cd.

Adding a SYSMON.ioconf will solve the problem:
=== snip ===
ioconf sysmon

include conf/files

pseudo-device swwdog
=== snip ===

The good news is that if some day a sysmon kernel module is added,
the exact same ioconf can be used without having to once again run
into trouble.

Then let's view the broader scale.  I think acpibat is currently the
only kernel module using ioconf, and I haven't bothered converting others
since I have realized that the scope of ioconf was a little too narrow.
I'm planning to change the keyword to module and add support for source
files.  This way the default build for every modular kernel component
goes through config, and we can avoid issues due to config changes.
IIRC I have the config(1) part for this done already, but being able to
use an autogenerated SRCS list requires some bsd.subdir.mk style advanced
make hackery which I haven't been in the mood for.  Plus, there's of
course the mess with file some.c foo | bar  (baz ^ xyzzy)  !!!frobnitz.

Finally, on the eternal someone should astral plane, someone should
fix the kernel build to consist of building a set of modules and linking
them together, so we don't have more than one way to skin a kernel.

Re: CVS commit: src/tests/net/icmp

2010-07-12 Thread Antti Kantee

On Mon Jul 12 2010 at 01:51:54 +0200, Jean-Yves Migeon wrote:
  Anyway, the solution as usual is to work the problem from both ends
  (improve the server methods and the kernel drivers) and perform a
  meet-in-the-middle attack at the sweet spot where nothing is lost and
  everything is gained.  The cool thing about working on NetBSD is that
  we can actually do these things properly instead of bolting some hacks
  on top of a black-magic-box we're not allowed to touch.
 
  Although I'm not familiar with the Xen hypercall interface, I assume it
  to be infinitely more well-defined than unix process-kernel interaction
  with no funny bits like fiddling about here and there just because the
  kernel can get away with it.
 
 Yes; however, note that the Xen hypercalls are not expected to be as
 feature-rich as a POSIX process  kernel interface. It is vastly
 simpler, but it is also poorer (the complexity is left as an exercise to
 the tasks above it).
 
 Anyway, you will face the exact same issue as yours with puffs and pud.
 The Xen hypercalls are close to x86 semantics; at this layer, you have
 lost most of the higher level semantic.

It's not the same thing.  Correct me if I misunderstood, but I thought
you want to port/adjust/whatever rumpuser to the xen hypercall interace.
Since the xen hypervisor interface, as opposed to a posix process
environment, is designed for hosting an OS, you can do lowlevel ops
as you'd expect to do them instead of having to think about high-level
semantic meaning.

... well, at least in theory, since we can't measure it yet.  Plus I'm
not ultrafamiliar with Xen (read: not familiar at all), so there might
be issues I don't foresee.  And there always are.  But business as usual:
only one way to find out ;)

Re: CVS commit: src/tests/net/icmp

2010-07-11 Thread Antti Kantee

On Sat Jul 10 2010 at 12:30:07 +0200, Adam Hamsik wrote:
 
  
  
  8) Is it possible to run rump_exec in rump ? e.g. to boot rump kernel and 
  start init by it ?
  
  What are you trying to accomplish?  Generally no, in a special case yes.
  
  I've been doing a little work in that area and I have a syscall server
  which can support a process's basic syscall requests in a rump kernel.
  But that's a very boring approach, since it requires host kernel support.
  I think the process should know where it wants its requests to be
  serviced.
 
 I thought about something like userspace virtualization(zones, jails) based 
 on top of rump.

Given that the idea of jails/zones is to limit a userspace process,
doing this in a userspace process is not the obvious route.  It probably
could be done with a software-isolated process, but we are desperately
not there with our toolchain.  Another choice would be to port rumpuser
on top of the Xen hypervisor interface, like jym recently envisioned.

Even so, rump is about virtualizing the kernel, not the user interface
layer.  Given that jails/zones is a well-understood technology with at
least some sort of NetBSD implementation already done, why not go the
obvious route and finish that off?

Re: CVS commit: src/tests/net/icmp

2010-07-11 Thread Antti Kantee

On Sun Jul 11 2010 at 16:49:59 +0200, Jean-Yves Migeon wrote:
 On 11.07.2010 15:00, Antti Kantee wrote:
  On Sat Jul 10 2010 at 12:30:07 +0200, Adam Hamsik wrote:
  Given that the idea of jails/zones is to limit a userspace process,
  doing this in a userspace process is not the obvious route.  It probably
  could be done with a software-isolated process, but we are desperately
  not there with our toolchain.  Another choice would be to port rumpuser
  on top of the Xen hypervisor interface, like jym recently envisioned.
 
 Let me get a bit more precise here :) the purpose is not to offer
 container-like virtualization, but rather to have a finer grained
 approach, close to microkernels, with small processes/tasks that perform
 a specific functionality. What I would like to do is to get rid of the
 big dom0 uber-privileged domain that you encounter in hypervisor-based
 virtualization, by having smaller, isolated domains that perform
 specific tasks (one for block device access, another for network, device
 driver, so on). Without requiring to integrate
 yet_another_monolithic_yet_modular_linux_kernel in.

I didn't mean to say you suggested to offer virtualization containers.
Sorry.  I merely intended to say you had the desire of using the Xen
hypercall interface.  Although I must say now I understand more clearly
why you wanted to do that.

 Frankly, I have no idea how this would perform; basically, dom0 can be
 considered as one big uber-privileged domain, which is as critical as
 the hypervisor itself; if it crashes, or gets compromised, the system is
 entirely crippled. Purpose is to avoid a contamination of the whole dom0
 context if only one of its part is buggy, and one requirement is to get
 it as small as possible.

perform?  Are you using that term for execution speed, or was it
accidentally bundled with the rest of the paragraph?

  Even so, rump is about virtualizing the kernel, not the user interface
  layer.  Given that jails/zones is a well-understood technology with at
  least some sort of NetBSD implementation already done, why not go the
  obvious route and finish that off?
 
 I think he was referring to using a rump kernel as a syscall proxy
 server rather than having in-kernel virtualization like jails/zones.
 
 That would make sense, you already have proxy-like feature with rump.

I'm not so sure.  That would require a lot of kernel help to make
everything work correctly.  The first example is mmap: you run into it
pretty fast when you start work on a syscall server ;)

That's not to say there is not synergy.  For example, a jail networking
stack virtualized this way would avoid having to go over all the code, and
reboot would be as simple as kill $serverpid.  Plus, more obviously,
it would not require every jail to share the same code, i.e. you can
have text optimized in various ways for various applications.

Re: CVS commit: src/tests/net/icmp

2010-07-09 Thread Antti Kantee

On Fri Jul 09 2010 at 18:00:05 +0200, Adam Hamsik wrote:
 Let me add some of my questions about rump :)
 
 6) How are device nodes managed inside rump when e.g. /dev/mapper/control 
 created by libdevmaper rump lib.

just as expected ... (?)

Can you elaborate the question?

 7) Does RUMP support multiprocessor setup ? e.g. Can I boot rump kernel in 
 SMP mode and do I need SMP machine for that ?

Yes, on i386 and amd64.  Others would be ~trivially possible too (even
ones where the host does not support SMP), but I haven't bothered to go
into battle with some arch-specific headers and macros.  Probably would
be a few hours of tweaking to get all archs working.

By default the number of virtual CPUs configured into a rump kernel is
the same as the number of CPUs present on the host.  However, you are
free to pick anything from 1 to MAXCPUS.  As I've noted before, unicpu
on an SMP host is cool because you can optimize bus locking away from
kernel work which can be isolated.  This can provide a performance boost
of tens of percent.

The other way (i.e. SMP rump kernel on a unicpu host) is used by e.g.
tests/fs/tmpfs/t_renamerace:renamerace2.  The default qemu setup used
by anita is unicpu, and the race it is trying to trigger did not happen
with only one virtual CPU, so upping the rump configuration to have more
CPUs was the ticket.  Yes, you can specify an arbitrary number of CPUs
to qemu, but that tends to slow down execution quite dramatically (as
in several times slower).  In contrast, with rump there is no slowdown
(apart from all virtual CPUs having to take clock interrupts, which is
negligible unless you run at an insane HZ).

 8) Is it possible to run rump_exec in rump ? e.g. to boot rump kernel and 
 start init by it ?

What are you trying to accomplish?  Generally no, in a special case yes.

I've been doing a little work in that area and I have a syscall server
which can support a process's basic syscall requests in a rump kernel.
But that's a very boring approach, since it requires host kernel support.
I think the process should know where it wants its requests to be
serviced.

Re: CVS commit: src/tests/net/icmp

2010-07-08 Thread Antti Kantee

On Thu Jul 08 2010 at 23:22:44 +0200, Thomas Klausner wrote:
 [redirected from source-changes-d to a hopefully more suitable mailing
 list]
 
 On Mon, Jul 05, 2010 at 12:26:17AM +0300, Antti Kantee wrote:
  I'm happy to give a more detailed explanation on how it works, but I need
  one or two questions to determine the place where I should start from.
  I'm planning a short article on the unique advantages of rump in kernel
  testing (four advantages by my counts so far), and some questions now
  might even help me write that one about what people want to read instead
  of what I guess they'd want to read.
 
 I looked at the tests some more (tmpfs race, and the interface one
 from above). I think I can read them, but am unclear on some of the
 basic properties of a rump kernel.

Hi, good questions.

 For example:
 1. Where is '/'? Does it have any relation to the host systems '/'? Is
 it completely virtual in the memory of the rump kernel?

From a practical perspective, it's in the same place as '/' on e.g.
a qemu instance or xen domu: somewhere.  By default it's in memory,
but you can mount any file system as '/' over rumpfs (default rootfs).

Of course this is partially a trick question, since a rump kernel does
not necessarily have a '/' at all.  Running a configuration without file
systems at all can save quite a bit of memory, and can be the difference
between 50k and 100k nodes in a virtual netowrk (I've only tested up
to a few hundred nodes on my scrawny laptop, but I've done calculations
... I'm sure you can appreciate calculations ;).  In that case any rump
system calls attempting to use VFS will fail with ENOSYS.

 2. Do I understand correctly that for e.g. copying a file from the
 host file system into a rump kernel file system, I would use read and
 rump_sys_write?

Well, yes and no.  It depends on which namespace you are making the
calls from.  If you are in the host namespace (i.e. not inside the rump
kernel), you can do that.  The paths given to rump_sys_open() are ones
relative to the rump kernel '/' (or whatever you've chrooted to inside
the rump kernel), and then you just use the file descriptor as usual.

If you are inside the rump kernel, you can access the host file system
namespace with etfs, extra terrestrial file system, with which you can
establish mappings from the rump kernel namespace to the host namespace.
For example, the rump_foofs utils use this to configure a virtual block
device pointing to the host, so when I type

rump_ffs /home/pooka/ffs.img /mount

even though VFS_MOUNT() operates inside the rump kernel, the device file
for mount is still use from the host (and, etfs can also just report it
as a block device, so you don't need any of the vnconfig nonsense).

 3. Similarly for network interfaces -- open a socket with socket(2) or
 rump_socket(or so) and copy bytes with read/rump_sys_write?

I'm not quite sure what you want to copy from and where.  If you
connect() to a network service inside the rump kernel, you access it
from the host with read/write (or send/recv) just like any other peer.
If you rump_sys_connect(), you use rump_sys_read/rump_sys_write().

I probably should point out that rump has two different networking
configurations: a full networking stack and what I call sockin.
The prior is exactly what you'd expect: interface, tcp/ip, sockets and a
unique IP (or other) address.  This can be a hassle sometimes when you
want to use networking from the rump kernel and do not have a separate
IP address or simply just don't have root privileges to configure a
tap interface.  sockin registers at the protocol layer in the kernel
and pretends to be an inet domain.  What it does is just maps requests
to the host sockets.  So e.g. PRU_CONNECT does connect() _on the host_.
This is helpful for cases where you need networking (e.g. rump_nfs and
rump_smbfs), but do not want to hassle and administrative boundary of
configuring a separate address.

 4. Could you NFS export the rump kernel file system to the host?
 (Probably better to a second rump kernel...)

Yes.  When I make changes which affect nfs, I test them by running one
rump kernel with the nfs server and one instance of rump_nfs (the latter
using sockin, i.e. effectively the rump kernel nfs exports to the host).
This way I get a two-machine illusion -- naturally, since the nfs client
is quite finicky, I don't want to use mount_nfs for testing on my desktop.

nfs itself presents one of the unsolved issues with rump: the division
between rump kernel and host kernel is done at the syscall level: foo
or rump_sys_foo().  However, for libraries foo is already hardcoded.
This is especially problematic for libc, since even LD_PRELOAD will not
help.  There's a few different things I've been playing around with this,
but will try to detour into verbose explanations of them in this email.
The whole issue is explained here (and generally in the thread):
http://mail-index.netbsd.org/tech-kern/2009/10/16/msg006276.html

Re: Enabling built-in modules earlier in init

2010-06-17 Thread Antti Kantee

On Wed Jun 16 2010 at 15:36:30 -0700, Paul Goyette wrote:
 The attached diffs add one more routine, module_init3() which gets 
 called from init_main() right after module_class_init(MODULE_CLASS_ANY). 
 module_init3() walks the list of builtin modules that have not already 
 been init'd and marks them disabled.
 
 Tested briefly on my home systems and appears to work.
 
 Any objections to committing this?

I'd still hook it to the end of module_class_init(MODULE_CLASS_ANY)
instead of adding more randomly numbered module_initn() calls.
The other benefit from doing so is that you get it done atomically,
which is always worthwhile, and doubly so when it's a low hanging fruit
like here.

 @@ -416,6 +434,7 @@ module_init_class(modclass_t class)
* init.
*/
   if (module_do_builtin(mi-mi_name, NULL) != 0) {
 + mod-mod_disabled = true;
   TAILQ_REMOVE(module_builtins, mod, mod_chain);
   TAILQ_INSERT_TAIL(bi_fail, mod, mod_chain);
   }

Why do you mark it as disabled?  Doesn't this conflict with the it
might succeed in a later module_init_class() idea you presented earlier?

module_disabled = true/false in multiple places looks a little
error-prone.  Now that struct module is growing more and more members,
maybe we can just have an object allocator which initializes the value and
afterwards the only acceptable mutation for module_disabled is setting
it to true (might make sense to rename the variable to something like
module_virgin and flip the polarity, though).

Re: Enabling built-in modules earlier in init

2010-06-16 Thread Antti Kantee

On Wed Jun 16 2010 at 04:13:54 -0700, Paul Goyette wrote:
 With the current ways of secmodel register, I'd be damn careful to not
 push it around.  The effect is that if it's called 0 times, you have a
 system which allows everything.  So if your suggestion is implemented
 and you're testing a new secmodel which buggily omits register alongside
 another correctly registering secmodel, things will appear to work fine,
 But if in some scenario the buggy one is loaded alone, well ... welcome
 to the wishing well.
 
 I had some concern about this as well, wondering if I would be able to 
 be sure I'd found all the secmodel modules that might exist.

Especially ones which aren't in src!

 Perhaps it would be best to retain MODULE_CLASS_SECMODEL and also add 
 the suggested MODULE_CLASS_EARLY?

That would be my vote.

But, early is a little vague.  What if in the future we want
modules which are initialized even earlier.  Will those be called
MODULE_CLASS_EARLIER_THAN_EARLY?  If the class means intialized before
autoconf, why not use that in the name?

 Also, the modclass id is exported to userland and used as an index to
 a table in modstat.  I think I filed a PR about this being suboptimal.
 
 Yeah, I was planning to update modstat(8) as well.

The better choice is to update modctl(2) to pass down the information
as a proplist.  That way even module classes are pluggable and other
information is easy to add if necessary.  I'm secretly hoping someone
will do this before 6.0 ... ;)

Re: Enabling built-in modules earlier in init

2010-06-16 Thread Antti Kantee

On Tue Jun 15 2010 at 17:10:55 -0700, Paul Goyette wrote:
 Currently, built-in kernel modules are not enabled until very late in 
 the system initialization process, right after we create process #1 for 
 init(8).  (As an exception to this, secmodel modules are enabled much 
 earlier.)
 
 Unfortunately, this means that built-in modules are not available for 
 use during much of the initialization process, and in particular they 
 are not available during auto-configuration.  This means that my recent 
 changes to convert PCIVERBOSE, etc. into kernel modules does not work 
 when the modules are built-in to the kernel!
 
 I would like to enable the built-in modules much earlier, at least early 
 enough to have them available during auto-configuration.  The attached 
 patch accomplishes this.  I have briefly tested the patch, and it seems 
 not to have any unwanted side-effects, but I would appreciate feedback 
 from others who may be more familiar with the init sequence.
 
 An alternative, but less desirable approach, would be to create a new 
 class of modules for PCIVERBOSE and friends, and call module_class_int() 
 early on to enable only these few modules.

Actually reading the first email in the thread also ...

I have to admit I haven't been following your work too closely, but
builtin modules are initialized either when all of them are initialized
per class or when their initialization is explicitly requested.  So if
whatever uses PCIVERBOSE requests the load of the PCIVERBOSE module,
it should be initialized and you should be fine (see module_do_load()).

The only but is that explicit loads must be accompanied by
MODCTL_LOAD_FORCE.  I wrote it that way because of the security use case:
if you disable a builtin module due to a security hole, you don't want
it to get autoloaded later.  For file system modules you can always use
rm, but for builtins you don't have that luxury.  So if that is actually
what you're chocking on, I suggest adding some flag to determine if the
module has ever been loaded and ignore the need for -F if it hasn't.

Re: Enabling built-in modules earlier in init

2010-06-16 Thread Antti Kantee

On Wed Jun 16 2010 at 06:31:59 -0700, Paul Goyette wrote:
 The attached diffs add a new mod_disabled member to the module_t 
 structure, and set the value to false in each place that a new entry is 
 created.  (Since all of the allocations of module_t structures are done 
 with kmem_zalloc() I could probably avoid the explicit setting of the 
 value to false.)
 
 The value is set to true whenever a module is removed from active duty 
 and returned to the module_builtin list.  (I specifically did NOT mark a 
 module disabled if its modcmd(INIT) failed, under the assumption that it 
 might succeed in a later retry.)

Keeping the same security use case in mind, it would be better that after
full module bootstrap (i.e. MODULE_CLASS_ANY) all builtin modules would
be either initialized or disabled.  Otherwise, if we assume that init
may later succeed for whatever reason, an operator that checks a module
with a security problem is not activated may be surprised to later find
out that the same module has now been autoenabled.

uvm percpu

2010-06-01 Thread Antti Kantee

While reading the uvm page allocator code, I noticed it tries to allocate
from percpu storage before falling back to global storage.  However, even
if allocation from local storage was possible, a global stats counter is
incremented (e.g. uvmexp.cpuhit++).  In my measurements I've observed
this type of cheap statcounting has a huge impact on percpu algorithms,
as you still need to loadstore a globally contended memory address.
Furthermore, uvmexp cache lines are probably more contended than the page
queue, so theoretically you get less than half of the possible benefit.

I don't expect anyone to remember what the benchmark used to justify
the original percpu commit was, but if someone is going to work on it
further, I'm curious as to how much gain the percpu allocator produced
and how much more it would squeeze out if the global counter was left out.

The above example of course applies more generally.  When you're going
all out with the bag of tricks, i++ can be very expensive ...

Re: Red-black tree optimisation

2010-05-27 Thread Antti Kantee

Hi,

On Tue May 04 2010 at 18:20:30 +0200, Adam Ciarci?ski wrote:
 Hello,
 
 Because at one point I studied red-black trees (not as in dendrology,  
 but as data structures), I looked into the implementation that is  
 being used in NetBSD. I have made some drastic optimisations on sys/ 
 sys/tree.h and would like to have the changes imported into NetBSD  
 repository.
 
 I would like someone to take a look at the patch, which is attached to  
 this message, and verify the code. I have also attached a short PDF  
 document, in which I comment on changes made to the implementation of  
 the red-black tree algorithm.
 
 If it's okay, I can commit the changes myself.
 
 I think we all will benefit from faster and smaller code. :)

Can you present numbers to support your claims of drastic optimizations?

I've used tree.h in out-of-NetBSD projects and don't mind this being
committed.  However, I did not review your changes, so I hope you have
made 100% sure there are no regressions.  Remember that usually the only
way to win is not to play at all ;)

Re: Lightweight virtualization - the rump approach

2010-05-18 Thread Antti Kantee

On Tue May 18 2010 at 14:00:59 +0200, Jean-Yves Migeon wrote:
 Many thanks for answering my questions, Antti. Now that I have sane (and
 safe) pointers, I have some readings to do.

Oh if you just wanted reading, you could have started with the
publications linked from http://www.NetBSD.org/docs/rump/.
Can't comment on the safety or sanity, though ;)
(the web page itself is not quite up-to-date anymore and updating it is,
proverbially, on David Holland's todo list)

Apparently this year's AsiaBSDCon papers aren't online on
2010.asiabsdcon.org yet.  I've put my paper *temporarily* in
ftp://ftp.NetBSD.org/pub/NetBSD/misc/pooka/tmp/rumpdev.pdf
(anyone reading this from the archives, if that link is dead, check
http://2010.asiabsdcon.org/)

Re: Lightweight virtualization - the rump approach

2010-05-14 Thread Antti Kantee

On Thu May 13 2010 at 18:51:16 +0200, Jean-Yves Migeon wrote:
 I am not posting this to reinstate the decades old monolithic vs 
 microkernel troll, so please avoid that field; thanks.

I hope people on these lists are adult enough to realize that argument
is pointless.  The correct answer to which is better is of course both
(or neither, as code I hope get committed later today or during the
weekend will quite measurably demonstrate).

 Lights on the work of Antti, with rump. Most systems I have seen lately 
 provide the characteristics enumerated above by pulling in a general 
 purpose OS, like Linux, with its environment, just to get a specific 
 need, like an up-to-date network stack (strong push for IPv6, anyone?), 
 drivers (filesystems, usb, pci stacks, devices), etc. There is no real 
 componentization in mind.
 
 Everything being virtual these days (see cloud computing buzzwords, or 
 hardware systems delivered with some kind of hypervisor inside - PS3, or 
 sun4v, for example -), I see Antti's work (well, all TNF work too ;) ) 
 as being a real asset to make the NetBSD's code base more widely known, 
 appreciated and used. I have yet to see a solution where you could port 
 then debug kernel code directly to userland (at least, for a general 
 purpose OS), or offer an alternative when you need to port specific 
 components, like network functionality or filesystem code.

My guess of what is going to happen in the future is that the historic
kernel/user boundary and even the OS will mostly go away and you'll just
be left with semi-independent virtualized application stacks running on
minimal hosts, perhaps on ASICs.  The OS is pure unnecessary overhead.
To take a food analogy: flavour rules, not if the food was baked,
broiled, sauteed or cooked sous vide (sous vided? ;).

 For this reason, I have a few questions for the ones familiar with rump 
 technology here, especially:
 
 - the basic architecture; how did/do you achieve such a functionality? 
 Adding an extra layer of indirection within the kernel?

There's no artificial extra layer of indirection in the code (and,
compared to other virtualization technologies, there's one less at
runtime).  It's mostly about plugging into key places and understanding
what you can use directly from the host (e.g. locking, which is of
course very intertwined with scheduling) and what you really don't want
to be emulating in userspace at all (e.g. virtual memory).  Due to some
magical mystery reason, even code which was written 20 years ago tends
to allow for easy separation.

The other part is more or less completing the work on kernel module
support in NetBSD, mostly minor issues with config left (devsw,
SRCS) and I've got those somewhat done in a corner of my source tree.
Rump components more or less follow the same functional units as kernel
modules apart from the rump{dev,net,vfs} factions, which cannot be
dynamically loaded and without which a regular kernel would not function.
Yes, I decided to use the word faction to describe the three midlayers
between rumpkern and the drivers.

The only real problem is the loosy-goosy use of inlines/macros in
unnecessary places.  But luckily for me, a lot of the work to clean that
up was done by Andy when he made the x86 ports modular.

 Suppose we have 
 improvements in one part of it, like TCP, IP stacks, could it directly 
 benefit the rumpnet component, or any service sitting above it?

Could you elaborate this question?

 - What kind of effort would it require to port it to other OS 
 architectures, especially when the API they offer could be a subset of 
 POSIX, or specific low level API (like Xen's hypercalls)? (closely 
 related to the work of Arnaud Ysmall in misc/rump [4])

As you probably know, rump uses the rumpuser hypercall interface to
access the hypervisor (which is currently just a fancy name for userland
namespace).  It pretty much evolved with the oh I need this?  ok, I'll
add it technique.  But, if someone wants to experiment with minimal
hosts, I think we can work on rumpuser a bit and see what qualities the
hosts have in common and what's different.  I don't expect supporting
rump to be any more difficult than for example Wombat/Iguana L4+UML.
In fact, it's probably simpler since with rump there is no notion of
things like address space -- that comes entirely from the host.

Running directly on top of Xen is an interesting idea, but the first
question I have with that is why?, i.e. what is the gain as opposed
running directly in a process on dom0?  The only reason I can think of
is that you don't trust your dom0 OS enough, but then again you can't
really trust Xen guests either?  Besides, rump kernels do not execute
any privileged instructions, so Xen doesn't sound like the right hammer.

 - If rump could be used both for lightweight virtualization (like rump 
 fs servers), or more heavyweight one (netbsd-usermode...)?

Usermode = rump+more, although paradoxically rump = usermode+more also

Re: bin/30756: gdb not usable for live debugging of threaded programs

2010-04-23 Thread Antti Kantee

On Thu Apr 22 2010 at 11:18:14 -0400, Paul Koning wrote:
 Antti pointed out a problem in the patch I originally submitted which
 causes gdb to go into a loop if the child process exits.  The attached
 updated patch corrects that problem.

Yup, your new patch seems to fix that.  Thanks again.

Just one cosmetic issue now.  After finishing, gdb always says:
Couldn't get registers: Operation not permitted.

Re: rump and usb, only one ugen getting attached?

2010-04-08 Thread Antti Kantee

On Thu Apr 08 2010 at 02:35:28 +0100, Jasper Wallace wrote:
 
 Hi,
 
 I'm trying to debug a problem with netbsd and a usb cdc acm device using 
 rump and in the process I can only get rump to attach to ugen0. I can work 
 around this by nailing down ugen0 to a particular usb port in my kernel 
 config, but does rump/ugenhc always only attach to ugen0? UGENHC.ioconf 
 has 4 ugenhc entries so i assume not.

Hmm.  Make sure the other /dev/ugen device nodes exist.  Based on the
timestamps I have on my /dev and from reading /dev/MAKEDEV, only ugen0
nodes are created by default.

good luck ;)

Re: config(5) break down

2010-03-26 Thread Antti Kantee

On Fri Mar 26 2010 at 13:25:43 +0900, Masao Uebayashi wrote:
 syntax.  I spent a whole weekend to read sys/conf/files, ioconf.c, and
 module stubs in sys/dev/usb/uaudio.c.  I wasted a whole weekend.  I've

This patch should work and make it easier.  No, it doesn't solve
dependencies, but gets developers at least halfway there without having
to waste weekends (with code).  Unfortunately I can't test, since I
forgot to buy a usb audio device from Akihabara ;)

Index: dev/usb/uaudio.c
===
RCS file: /cvsroot/src/sys/dev/usb/uaudio.c,v
retrieving revision 1.117
diff -p -u -r1.117 uaudio.c
--- dev/usb/uaudio.c12 Nov 2009 19:50:01 -  1.117
+++ dev/usb/uaudio.c26 Mar 2010 06:11:39 -
@@ -3065,67 +3065,21 @@ uaudio_set_speed(struct uaudio_softc *sc
 
 MODULE(MODULE_CLASS_DRIVER, uaudio, NULL);
 
-static const struct cfiattrdata audiobuscf_iattrdata = {
-   audiobus, 0, { { NULL, NULL, 0 }, }
-};
-static const struct cfiattrdata * const uaudio_attrs[] = {
-   audiobuscf_iattrdata, NULL
-};
-CFDRIVER_DECL(uaudio, DV_DULL, uaudio_attrs);
-extern struct cfattach uaudio_ca;
-static int uaudioloc[6/*USBIFIFCF_NLOCS*/] = {
-   -1/*USBIFIFCF_PORT_DEFAULT*/,
-   -1/*USBIFIFCF_CONFIGURATION_DEFAULT*/,
-   -1/*USBIFIFCF_INTERFACE_DEFAULT*/,
-   -1/*USBIFIFCF_VENDOR_DEFAULT*/,
-   -1/*USBIFIFCF_PRODUCT_DEFAULT*/,
-   -1/*USBIFIFCF_RELEASE_DEFAULT*/};
-static struct cfparent uhubparent = {
-   usbifif, NULL, DVUNIT_ANY
-};
-static struct cfdata uaudio_cfdata[] = {
-   {
-   .cf_name = uaudio,
-   .cf_atname = uaudio,
-   .cf_unit = 0,
-   .cf_fstate = FSTATE_STAR,
-   .cf_loc = uaudioloc,
-   .cf_flags = 0,
-   .cf_pspec = uhubparent,
-   },
-   { NULL }
-};
+#include ioconf.c
 
 static int
 uaudio_modcmd(modcmd_t cmd, void *arg)
 {
-   int err;
 
switch (cmd) {
case MODULE_CMD_INIT:
-   err = config_cfdriver_attach(uaudio_cd);
-   if (err) {
-   return err;
-   }
-   err = config_cfattach_attach(uaudio, uaudio_ca);
-   if (err) {
-   config_cfdriver_detach(uaudio_cd);
-   return err;
-   }
-   err = config_cfdata_attach(uaudio_cfdata, 1);
-   if (err) {
-   config_cfattach_detach(uaudio, uaudio_ca);
-   config_cfdriver_detach(uaudio_cd);
-   return err;
-   }
-   return 0;
+   return config_init_component(cfdriver_comp_uaudio,
+   cfattach_comp_uaudio, cfdata_uaudio);
+
case MODULE_CMD_FINI:
-   err = config_cfdata_detach(uaudio_cfdata);
-   if (err)
-   return err;
-   config_cfattach_detach(uaudio, uaudio_ca);
-   config_cfdriver_detach(uaudio_cd);
-   return 0;
+   return config_fini_component(cfdriver_comp_uaudio,
+   cfattach_comp_uaudio, cfdata_uaudio);
+
default:
return ENOTTY;
}
Index: modules/uaudio/Makefile
===
RCS file: /cvsroot/src/sys/modules/uaudio/Makefile,v
retrieving revision 1.1
diff -p -u -r1.1 Makefile
--- modules/uaudio/Makefile 28 Jun 2008 09:14:56 -  1.1
+++ modules/uaudio/Makefile 26 Mar 2010 06:11:39 -
@@ -5,6 +5,7 @@
 .PATH: ${S}/dev/usb
 
 KMOD=   uaudio
+IOCONF=UAUDIO.ioconf
 SRCS=  uaudio.c
 
 .include bsd.kmodule.mk
Index: modules/uaudio/UAUDIO.ioconf
===
RCS file: modules/uaudio/UAUDIO.ioconf
diff -N modules/uaudio/UAUDIO.ioconf
--- /dev/null   1 Jan 1970 00:00:00 -
+++ modules/uaudio/UAUDIO.ioconf26 Mar 2010 06:11:39 -
@@ -0,0 +1,12 @@
+#  $NetBSD$
+#
+
+ioconf uaudio
+
+include conf/files
+include dev/usb/files.usb
+
+pseudo-root uhub*
+
+# USB audio
+uaudio* at uhub? port ? configuration ?

Re: test wanted: module plists

2010-03-08 Thread Antti Kantee

On Mon Mar 08 2010 at 02:37:05 +, David Holland wrote:
 The code for loading a module plist from a file system is messed up in
 that it calls namei() and then it calls vn_open() on the same
 nameidata without reinitializing it or cleaning up the previous
 results. I'm surprised this didn't result in fireworks, but apparently
 it didn't.
 
 The following patch fixes that, and compiles, but I'm not set up to be
 able to test this -- is there anyone who can do so easily/quickly?

When I was playing with that code, I used atf on tests/modules.  I can't
remember if it tests loading from .prop, but a .prop file isn't exactly
hard to create.

Dunno what the canonical in-tree feature using this is, though.

 Index: kern_module_vfs.c
 ===
 RCS file: /cvsroot/src/sys/kern/kern_module_vfs.c,v
 retrieving revision 1.3
 diff -u -p -r1.3 kern_module_vfs.c
 --- kern_module_vfs.c 16 Feb 2010 05:47:52 -  1.3
 +++ kern_module_vfs.c 8 Mar 2010 02:33:36 -
 @@ -147,23 +147,18 @@ module_load_plist_vfs(const char *modpat
   NDINIT(nd, LOOKUP, FOLLOW | (nochroot ? NOCHROOT : 0),
   UIO_SYSSPACE, proppath);
  
 - error = namei(nd);
 - if (error != 0) {
 - goto out1;
 + error = vn_open(nd, FREAD, 0);
 + if (error != 0) {
 + goto out1;
   }
  
   error = vn_stat(nd.ni_vp, sb);
   if (error != 0) {
 - goto out1;
 + goto out;
   }
   if (sb.st_size = (plistsize - 1)) {/* leave space for term \0 */
   error = EFBIG;
 - goto out1;
 - }
 -
 - error = vn_open(nd, FREAD, 0);
 - if (error != 0) {
 - goto out1;
 + goto out;
   }
  
   base = kmem_alloc(plistsize, KM_SLEEP);
 
 
 
 -- 
 David A. Holland
 dholl...@netbsd.org

Re: config(5) break down

2010-03-08 Thread Antti Kantee

On Mon Mar 08 2010 at 07:09:07 +, David Holland wrote:
 Meanwhile, I think trying to wipe out all the boolean dependency logic
 in favor of a big graph of modules and submodules is also likely to
 make a mess. What happens to e.g.
 
 fileufs/ffs/ffs_bswap.c (ffs | mfs)  ffs_ei
 
 especially given that the ffs code is littered with FFS_EI conditional
 compilation? You can make ffs_bswap its own module, but that doesn't
 really serve any purpose. You could try making an FFS_EI module that
 works by patching the ffs module on the fly or something, and then
 include ffs_bswap.o into that, but that would be both very difficult
 and highly gross. You could compile two copies each of ffs and mfs,
 with and without FFS_EI support, but that wastes space. Or you could
 make FFS_EI no longer optional, which would be a regression.
 
 (FFS_EI isn't the only such option either, it's just one I happen to
 have already banged heads with.)

This one is easy, no need to make it difficult.  The NetBSD-supplied
module is always compiled with FFS_EI (if you don't like it, you can
always compile your own just like you can compile your own kernel now).
We don't care about mfs here, since it's not reasonable to want to mount
a memory file system in the opposite byte order (technically I guess you
could mmap an image instead of malloc+newfs and then mount(MOUNT_MFS),
but you might just as well use ffs).

Things like wapbl are currently an actual problem, since it is multiply
owned (conf/files *and* ufs/files.ufs).  The easy solution (and my
vote) would be to make vfs_wapbl.c always included in the base kernel.
If someone feels it's worth their salt to make it into two modules with
all the dependency hum-haa, that would be a good place to start practicing
instead of ffs_ei.

Re: Zero page

2010-02-02 Thread Antti Kantee

On Wed Feb 03 2010 at 03:06:00 +0900, Masao Uebayashi wrote:
 I need to add zero-page to support XIP.  Unallocated blocks are redirected
 to this.  Basically it's a static simgle page filled with zero.
 
   void *pmap_zeropage;
   paddr_t pmap_zeropage_phys_addr;
 
 and initialized by pmap.c like:
 
   pmap_zeropage = (void *)uvm_pageboot_alloc(PAGE_SIZE);
   pmap_zeropage_phys_addr = MIPS_KSEG0_TO_PHYS(pmap_zeropage);
 
 Because it's used publically (from the coming custome genfs_getpages()), it's
 defined somewhere like uvm_page.h.

Why does it need to be in pmap?

Re: Zero page

2010-02-02 Thread Antti Kantee

On Wed Feb 03 2010 at 03:26:33 +0900, Masao Uebayashi wrote:
  Why does it need to be in pmap?
 
 Actually it doesn't.  Probably uvm_page.c is better?

Maybe.

 And it'll be #ifdef XIP'ed.

Can't the first XIP device to attach simply allocate it?

Re: Zero page

2010-02-02 Thread Antti Kantee

On Wed Feb 03 2010 at 03:55:29 +0900, Masao Uebayashi wrote:
  Can't the first XIP device to attach simply allocate it?
 
 It's getpages()'s iteration loop which redirects unallocated pages to
 zero-pages.  If we allocate zero-page in device drivers, we have to
 have an interface which can be retrieved from vnode or mount.
 Having a well-known global name is simple, but I'm fine with both.

I assumed the reason you mentioned #ifdef XIP was because you didn't
want to waste a whole page of memory on systems which don't use XIP.

So in my suggestion you'd have a global:

struct uvm_page *page_of_blues;

and then in xip_attach:

RUN_ONCE(zeroes, allocate_nothingness);

Or something like that (you can even refcount it if you want to be
extra-fancy).  Then you can just always use the global zeropage in xip
getpages() and don't need to recompile your kernel (and reboot!) to
support XIP device modules.

Re: blocksizes

2010-01-31 Thread Antti Kantee

On Sun Jan 31 2010 at 22:21:52 +0900, Izumi Tsutsui wrote:
  Can you please test with your 2K MO?
 
 It's not easy to test it without working newfs(8) command.
 (if you need hardware I can send the drive and media..)
 
  N.B. newfs doesn't yet know how to deduce sector sizes, you need
  to use the -S option.
 
 newfs(8) doesn't work even with -S 2048 option.
 (probably it tries to write data at offset not sectorsize aligned)

Apparently makefs -S 2048 works, and the resulting image also works only
when accessed with 2048 byte simulated sector size (fs-utils with ffs
from rump):

pain-rustique:29:~ env RUMP_BLKSECTSHIFT=11 fsu_du -o ro testi2.ffs -sck
rumpblk: using 11 for sector shift (size 2048)
120830  .
120830  total
pain-rustique:30:~ env RUMP_BLKSECTSHIFT=10 fsu_du -o ro testi2.ffs -sck
rumpblk: using 10 for sector shift (size 1024)
fsu_du: Not a directory

But of course makefs uses a file backend, which doesn't care that much
about unaligned writes, so the problem you mention still might exist.

Re: uvm_object::vmobjlock

2010-01-28 Thread Antti Kantee

On Fri Jan 29 2010 at 02:03:23 +, Mindaugas Rasiukevicius wrote:
  If you are talking about memory not within the object, well, then all
  bets are off applies.  I might argue equally handwavily that you'll
  cause false sharing with other locks from the mutex obj pool, and even
  for many many more locks, since you don't even get the protection of
  the data after the lock being safe. ...
 
 Heh?  The mutex object pool has a necessary alignment and padding, which
 guarantees that the lock has its own cache line.  That was one of the
 reasons, besides reference counting, why lock object pool was invented.

Ooops.  I meant to handwave about how you're now wasting multiple cache
lines where previously only one pretty much always uncontended line
was required.  I'm not convinced at all this is improving performance.
Anyway, you get the point.

88 matches

Mail list logo