from:"David Holland"

Re: Interrupt handlers and mutex

2009-12-31 Thread David Holland

On Thu, Dec 31, 2009 at 01:52:48PM -0600, Frank Zerangue wrote:
  Help request -- Mutex(9) indicates that mutex replaces the spl(9) system.

Here are some general (non-NetBSD-specific) answers based on
underlying principles that will hopefully explain the situation better.

  (1) When writing an interrupt handler, should the handler acquire a
  spin mutex before modifying some IO that may be accessed also by a
  LWP?

Yes. In general, in a multiprocessor kernel, the interrupt handler
will not necessarily run on the same CPU that has been posting
requests to the device... and in general more than one CPU may be
doing that. It is prohibitively expensive (as well as generally
undesirable) to disable interrupts for all CPUs at once. Therefore,
disabling interrupts only affects the current CPU, so in order to keep
everything from being tied in knots, you need one or more locks. And
because you can't sleep in an interrupt handler, these must be
spinlocks.

Spinlocks that are used from interrupt handlers must themselves also
disable interrupts. Otherwise, if a thread holding the spinlock is
interrupted by an interrupt handler that tries to acquire the same
spinlock, you get a deadlock.

For this reason, in essentially all multiprocessor systems, the
spinlocks that you use for mutual exclusion in interrupt code disable
and re-enable interrupts for you. In NetBSD each mutex can have an
interrupt level associated with it; if that interrupt level is not
IPL_NONE, acquiring the mutex raises the current interrupt level and
releasing the mutex lowers the current interrupt level.

And in turn, in such systems one generally never sees or uses the
splfoo() functions except in code that hasn't yet been
multiprocessorized.

Currently in NetBSD the interrupt level is only lowered to zero and
only when all spinlocks have been released, instead of any time the
necessary interrupt level drops. This is to avoid complexity when
spinlocks are not released in order, and it mostly makes no practical
difference. (It isn't a good idea to embed e.g. a small block using an
IPL_HIGH mutex inside a large block using a lower-interrupt-level
mutex. But then again, it also isn't a good idea to have a large block
disabling interrupts or using a spinlock anyway.)

NetBSD also has soft interrupts that have more process context than
ordinary interrupt handlers; instead of borrowing the context of
whatever's running when the interrupt arrives, they run on dedicated
kernel threads; this means they can sleep to acquire mutexes. AIUI,
the intended design is that most hard interrupts will do as little
work as possible and trigger a soft interrupt to do the rest; this
reduces the number of mutexes used from real interrupt handlers and
reduces the overall amount of spinning. I'm not entirely up on the
exact details at the moment and hopefully someone else will clarify if
there are questions.

  (2) What happens when the interrupt handler cannot acquire the
  mutex? Will the LWP that holds it ever be able to run again?

Define cannot acquire. If the mutex is held by some thread on
another processor, the interrupt handler will spin until it's
released. If it's never released, the interrupt handler will spin
forever, but that's not supposed to happen.

If the mutex is held by some thread on the same processor, then you're
deadlocked. This is why you have to disable interrupts before
attempting to get a spinlock that's used from an interrupt handler.
(Note: the exact details depend somewhat on the implementation. But
you don't want to be writing code that pushes the boundaries.)

  (3) Will a LWP that holds a spin mutex be pre-empted by the scheduler?

In general, that depends on the interrupt level associated with the
spin mutex. 

-- 
David A. Holland
dholl...@netbsd.org

Re: The imperfect beauty of NetBSD

2010-01-07 Thread David Holland

On Wed, Jan 06, 2010 at 10:51:58PM -0600, Peter Seebach wrote:
  You might like to know about apropos(1):
  
  I am told that officially apropos(1) is deprecated, and substituted with
  man -k.  Which does, in fact, say that it's the same thing.
  
  I think deprecated may mean little more than some standard somewhere
  didn't include it.

If that.

Is there any suitably licensed text search engine that's worth
importing to get a real man page search capability?

(Er. Off-topic, followups to tech-userlevel I guess...)

-- 
David A. Holland
dholl...@netbsd.org

Re: The imperfect beauty of NetBSD

2010-01-07 Thread David Holland

On Thu, Jan 07, 2010 at 12:44:07AM -0500, Alex Goncharov wrote:
  But:
  
 man -S N STRING
 
  to work, and 
  
 man -S N -k STRING
  
  not?...

I think you're looking for man -s, which works fine. I didn't even
know -S existed.

It seems that the problem is that -S is defined somewhat differently
for man and apropos; I have no idea which one of them (if either) is
wrong, but it certainly violates the principle of least surprise the
way it is. Please file a PR.

-- 
David A. Holland
dholl...@netbsd.org

Re: blocksizes

2010-01-21 Thread David Holland

On Thu, Jan 21, 2010 at 10:30:20PM +, Michael van Elst wrote:
  IMHO there need to be three different ways to specify block
  offsets and block counts:
  
  1. in units of blocks of the physical device
  2. in units of blocks of DEV_BSIZE bytes
  3. in bytes

Don't forget: 4. in units of the filesystem block size...

  and we need to establish what units are used where.

IM (fairly strong) O everything should be kept in byte counts, and
never block counts because if you have more than one unit in use it is
far too easy to accidentally mix them or provide the wrong one, and
because they're all the same language-level type there's little hope
of detecting such problems automatically.

Furthermore, Murphy's Law dictates that in any particular place the
count you are given is frequently not in the units you need to give
something else, and then you end up converting back and forth all over
everywhere. This serves no purpose and tends to obfuscate the code
base.

Since in practice nothing can be larger than the maximum value of
off_t anyway, and all counts should be getting carried around as
64-bit values, using byte counts instead of block counts does not
change the maximum addressable size of anything and therefore has no
particular downside.

Things should only be converted to block counts right when talking to
hardware, in which case the correct size to use is immediately
available, or right when reporting to userlevel using interfaces
standardized that way, in which case ditto.

  The necessary changes are rather small. In particular, dkwedge_info needs
  to be extended to keep track of the physical sector size so that the dk
  driver can do the transformations.

The physical sector size should be available to callers (just not part
of the API/ABI) so this ought to be done regardless.

-- 
David A. Holland
dholl...@netbsd.org

Re: blocksizes

2010-01-22 Thread David Holland

On Fri, Jan 22, 2010 at 07:38:14AM +, Michael van Elst wrote:
  Like most things, there is no universal correct answer here, simply
  deciding always use bytes because it seems simpler is unlikely to be
  the overall best answer.
  
  I think the suggestion is to use block numbers (or some other form
  of addressing larger units) only internally to some subsystem where
  these have a meaning and to use byte counts between those subsystems.

Yes, that.

  Quotas should use the units the filesystem uses. For FFS that's
  probably fragments.

The quota structures that have been brought up are on-disk format
that's not subject to change. Any FS-independent quota code should
internally should use byte counts, but currently quotas are such a
mess that we don't have any such thing anyway. :-/

-- 
David A. Holland
dholl...@netbsd.org

Re: blocksizes

2010-01-22 Thread David Holland

On Fri, Jan 22, 2010 at 08:07:03AM +0100, Michael van Elst wrote:
  On Fri, Jan 22, 2010 at 05:46:31AM +, David Holland wrote:
   On Thu, Jan 21, 2010 at 10:30:20PM +, Michael van Elst wrote:
IMHO there need to be three different ways to specify block
offsets and block counts:

1. in units of blocks of the physical device
2. in units of blocks of DEV_BSIZE bytes
3. in bytes
   
   Don't forget: 4. in units of the filesystem block size...
  
  I ommitted this from the list because only the filesystem
  itself has the notion of 'filesystem block size', but when
  talking to the device it goes back to use DEV_BSIZE. It
  becomes clear that 'filesystem block size' is a very private
  measure of a filesystem when you think about FFS fragments
  where the filesystem already uses a second size and about
  aggregated IO where multiple blocks are accessed as one
  unit.

Indeed. But it's still floating around in the system and still a
possible complication. It's not *quite* invisible outside of each
filesystem; e.g. it affects caching.

and we need to establish what units are used where.
   
   IM (fairly strong) O everything should be kept in byte counts, and
   never block counts because if you have more than one unit in use it is
   far too easy to accidentally mix them or provide the wrong one, and
   because they're all the same language-level type there's little hope
   of detecting such problems automatically.
  
  I would like a system where all I/O is measured in bytes, but this
  requires a complete redesign for all disk devices and all filesystems.

Right, but I think we should make this the end goal. Nobody says we
need to expect to get there promptly. :-/

  And you won't get rid of the physical blocks, at some point you
  have to translate.

Only when interfacing, as previously noted. (And, as noted elsewhere,
the places that this is required also includes on-disk formats.)

   Furthermore, Murphy's Law dictates that in any particular place the
   count you are given is frequently not in the units you need to give
   something else, and then you end up converting back and forth all over
   everywhere. This serves no purpose and tends to obfuscate the code
   base.
  
  This is how it works now. We do translate blocks back and forth
  all over the place, except that there a lot of assumptions that
  physical block size is the same as DEV_BSIZE.

Right. Wading through such logic is one of the things that convinced
me (a long time ago) that it shouldn't exist. Implementing such stuff
in research kernels was the other driving factor - it is too easy to
get wrong and you can't afford to spend time dealing with it.

  Also, filesystems organize data in larger chunks. There is always
  some translation going on between block or extent numbers and
  now DEV_BSIZE offsets or byte offset in your ideal system.
  
  On the filesystem side it won't get simpler.

It will, some. 

% grep fsbtodb sys/ufs/ffs/*.[ch] | wc -l
  57

That's quite a few more than ought to be there, IMO.

Meanwhile, other things will get quite a bit simpler.

   The physical sector size should be available to callers (just not part
   of the API/ABI) so this ought to be done regardless.
  
  I haven't thought about compatibility issues yet, where is dkwedge_info
  exposed to userland?

I dunno, I'm not all that up on wedges.

-- 
David A. Holland
dholl...@netbsd.org

Re: quota housekeeping unit

2010-01-24 Thread David Holland

On Sun, Jan 24, 2010 at 09:59:09PM +0100, Wolfgang Solfrank wrote:
 As an extreme example [on ISO 9660], you could have a file with 3 bytes,
 where every byte is in a separate block.
 Raising the question of which kind of resource limitation exactly you want
 to impose on the user. Wouldn't it make sense to count that kind of usage
 as three blocks? After all, it's disc blocks you run out of, not bytes of 
 user data.

 I'm not sure why you bring up quota in this discussion.

Someone else mentioned the on-disk FFS quota format. However, this is
not relevant to cd9660.

 The problem I tried to describe is that the current buffer cache
 with its DEV_BSIZE centric implementation more or less forbids any
 implementation of this part of the 9660 specification (albeit I have
 to admit that last I looked into this was before UBC integration.)
 A buffer cache that works with byte offsets and byte sizes of
 buffers would simplify this tremendously

It's both not quite that bad and worse. Buffer cache buffers can be
whatever size; they are normally the FS block size and they certainly
aren't limited to DEV_BSIZE.

However, the API is such that if any one file system's buffers aren't
all the same size it's treading on very thin ice. This should be
rectified sometime.

The address to read from is also a block number, but the buffer code
does not (as far as I know) itself interpret this number but just
passes it along.

-- 
David A. Holland
dholl...@netbsd.org

Re: blocksizes

2010-01-24 Thread David Holland

On Sun, Jan 24, 2010 at 08:48:32PM +, David Laight wrote:
   The btodb/dbtob macros will need another argument to indicate
   where the block size is obtained.
  
  That will just cause massive errors...
  For disks I would go for transfer requests (eg from fs) that are either in
  fixed units (BDEVSIZE, 512) or in bytes.
  
  Much like the current block device transfers - where transfers must be a
  multiple of 512 - transfers you need to be aligned to the physical
  sector size.

bytes! :-)

-- 
David A. Holland
dholl...@netbsd.org

Re: blocksizes

2010-01-24 Thread David Holland

On Sun, Jan 24, 2010 at 11:21:52PM +0100, Michael van Elst wrote:
Not using DEV_BSIZE requires to change how things work now.
   
   He is right in the long run, though.
  
  You may think that the way NetBSD works is a hack as Izumi Tsutsui
  put it. But the argument that keeping things they way they are suddenly
  makes them too simple is just nonsense.

Enh. Design decisions should be made while looking ahead, not with the
nose to the grindstone. What's the right way to do all this? That has
to be established first, then we can debate the merits of how to get
there or of compromises that need to be made with the way things
currently work.

Choosing code architectures because they're easy or apparently simple
(== require less immediate work) is a good way to get into a hole. I
had a coworker once who did a lot of this. His project eventually
needed the equivalent of a federal bank bailout. :-)

In this case, the problem with the way things are is that the way
things are does not work.

I would suggest, based on prior experience with large rototills, that
the symbol DEV_BSIZE be removed entirely and all uses examined one by
one and changed to something else based on the usage. If we want a
symbol that fills the role DEV_BSIZE is maybe supposed to (an
arbitrarily-sized block whose size happens to be convenient for
various legacy reasons) we should call it something else, and make
sure it's only deployed in places where it's correct.

grep shows 718 references in the kernel, so this can't be done all in
one pass, but it's a much smaller problem than e.g. the device/softc
thing.

-- 
David A. Holland
dholl...@netbsd.org

Re: Proposal for adding fsx(8) to base system

2010-01-24 Thread David Holland

On Mon, Jan 25, 2010 at 12:19:30AM +0100, Hubert Feyrer wrote:
 On Sun, 24 Jan 2010, o...@linbsd.org wrote:
 Fsx is a filesystem exerciser that is used to stress filesystem code.
 I would like to propose importing fsx into the base systems, or perhaps 
 pkgsrc.
 The intent is to import ftp://ftp.netbsd.org/pub/NetBSD/misc/ober/fsx/ to 
 src/usr.sbin.

 Sounds like a case for pkgsrc/benchmark for me.

AIUI the reason to have it in base is so the regression tests can use
it. However, if so it should really go in src/external.

also I notice the man page has no license.

-- 
David A. Holland
dholl...@netbsd.org

Re: FS corruption because of bufio_cache pool depletion?

2010-01-27 Thread David Holland

On Tue, Jan 26, 2010 at 07:39:00PM +0100, Manuel Bouyer wrote:
   I have a netbsd-3/Xen 2 based server that runs on the same
   hardware and we have seen FS corruption in a particular domU on
   that system taqt seems to be related to the file system running
   out of space.  That's what the co-admin running that domU tells
   me anyway.  But I haven't seen the damage or the error messages
   in the domU personally.
  
  I've seen this too: kern/41834. There is also kern/35704 on 3.0
  about this topic.

And there's kern/27802, which I've also seen at times.

Conversely, the experience that led to filing misc/33753 didn't hit
any of these problems... for whatever any of this is worth.

-- 
David A. Holland
dholl...@netbsd.org

Re: blocksizes

2010-01-27 Thread David Holland

On Mon, Jan 25, 2010 at 08:06:11AM +, Michael van Elst wrote:
  C hoosing code architectures
  
  'Redesigning' things to fix bugs seems to be common sense
  nowadays, as if everything existing is always too bad to be
  used.
  
  Of course the same is valid for the redesigned code base in
  the future.

Yes, and after a dozen or so such ground-up rebuilds (provided each
one is informed by the lessons learned in the previous ones) things
start to reach a decent state.

  One interesting point is that dropping DEV_BSIZE doesn't
  really mean something new but a jump backwards. That's where
  we came from, that's what was 'redesigned' then.

But as you may have noticed, I'm not advocating going back to using
random mixtures of block sizes.

  In this case, the problem with the way things are is that the way
  things are does not work.
  
  It works fine for 16 years now. The problems only come from legacy
  code and code from other sources that wasn't adapted to the then
  valid design, mainly because the problems didn't show up immediately
  due to lack of hardware.

...that is, it works except when it doesn't. :-P

  The software that needs to be fixed is pretty obvious, it's
  not a large rototill but if you are into 'redesigning' you
  may see a lot of places (unrelated to DEV_BSIZE) that could
  be structured better and cleaned up, e.g. partition handling.
  
  And all this should be done, wether you intend to drop or
  keep DEV_BSIZE.

Of course...

-- 
David A. Holland
dholl...@netbsd.org

Re: mutexes, IPL, tty locking

2010-01-27 Thread David Holland

On Tue, Jan 26, 2010 at 10:39:23AM +, Andrew Doran wrote:
   I'm not sure it's as rare as all that; it just mostly doesn't overtly
   fail. Instead you end up silently running at a higher IPL than
   necessary, and that buys you longer interrupt latencies and more
   dropped packets and all that.
  
  I have done extensive testing to on SPL behaviour and can confidently say
  that with our current setup it simply does not matter unless you have a
  very poorly written bit of code - in which case that's your problem, not
  the interrupt masking system.

Sure, but the real question is how many such poorly written bits of
code currently exist and how hard they are to find and fix.

If what you mean to say is that you've specifically gone and looked
for such cases and not found any, then great. But so far nobody's been
willing to stick their neck out to make this assertion -- only the
weaker one if such code exists, it's wrong.

-- 
David A. Holland
dholl...@netbsd.org

Re: ddb write and io memory

2010-01-30 Thread David Holland

On Sat, Jan 30, 2010 at 08:45:48PM +0100, Frank Wille wrote:
  Therefore I would like to change ddb/db_write_cmd.c as in the following
  patch:
  [...]
  
  Any objections? Do we absolutely need to print the old value here?

I think it's somewhat desirable to. Wouldn't it be better anyway to
create a second command (iowrite or regwrite or something) that
both does this and also does anything else that might be necessary to
write to I/O registers safely, like flushing caches?

-- 
David A. Holland
dholl...@netbsd.org

Re: buffer cache can return buffer of another inode ?

2010-01-30 Thread David Holland

On Sun, Jan 31, 2010 at 12:35:16AM +0100, Manuel Bouyer wrote:
  Hi,
  while investigating directory corruption on my NFS server I found
  a possible issue with the buffer cache.
  [...]
  I think vclean() should also take care of removing the vnode from
  the buffer cache's hash. Comments ?

Yes. Except, what if someone else is holding the buffer? Invalidating
buffers is delicate (both in general and in the code we currently
have) and doing this could easily just move the problem around.

I'm not familiar enough with the guts to suggest a probably-safe way
of doing this in the short term; in the long term I think the buffer
cache code needs a big rototill and design review. It is next on my
list when/if I ever finish with namei...

-- 
David A. Holland
dholl...@netbsd.org

Re: uvm_object::vmobjlock

2010-01-30 Thread David Holland

On Thu, Jan 28, 2010 at 09:55:53PM +, Mindaugas Rasiukevicius wrote:
  Unless anyone objects, I would like to change struct
  uvm_object::vmobjlock to be dynamically allocated with
  mutex_obj_alloc().  It allows us to: 1) share the lock among
  objects by holding a reference 2) avoid false-sharing on locks.
  Note that struct vnode::v_interlock becomes a pointer, which means
  a chunk of mechanical changes.

You could in theory do

   -#define v_interlock v_uobj.vmobjlock
   +#define v_interlock (*v_uobj.vmobjlock)

but it is probably not a good idea :-)

Anyhow, if you do this, can we please come up with a better name for
v_interlock? Calling a lock interlock is about as descriptive as
writing int i or bool flag, i.e., fine sometimes when the scope is
limited, but generally not so great for public data structures.

If it were me I'd probably call it v_memlock, but it looks as if it
ought to be something beginning with 'i' to avoid renaming a pile of
other stuff.

(Renaming it will also make sure that all code that needs to be
visited and adjusted actually does get visited and adjusted, which is
a good thing.)

-- 
David A. Holland
dholl...@netbsd.org

unhooking lfs from ufs

2010-02-07 Thread David Holland

On several occasions it's been suggested that lfs should be unhooked
from ufs, on the grounds that sharing ufs between both ffs and lfs has
made all three entities (but particularly lfs) gross. ffs and lfs are
not similar enough structurally for this sharing to really be a good
design. Nobody I've discussed this with (on the lists or in chat) has
been opposed, and I think there's a general consensus that this is the
right direction.

Getting there, however, is going to perhaps be a bit more
controversial. Since ufs does provide a lot of functionality to lfs,
it seems to me that the only practical way to do this is to cut and
paste a copy of ufs into lfs under a different name, hack it up so it
works again, and then begin consolidating. Anything else would involve
either cutting off far too much work at once or leaving lfs entirely
inoperable (as opposed to merely unstable) for a lengthy period; both
of these propositions seem like a worse idea than attending to and
merging changes into a chunk of copied code. (In fact, I've been
maintaining and syncing the copy since July, and it's not been a big
deal so far.) So I think this is the best approach.

The copy involves 18 files from sys/ufs/ufs (out of 21; the ones
excluded are quota.h and unsurprisingly ufs_wapbl.[ch]) which contain
9067 lines of code. That gives the following statistics:

 14988  size of lfs currently
+ 9067  size of copypasted ufs
 24055  size of resulting uncompilable lfs
-  401  result of making it compilable
 23654  size of new lfs

This is the size of the code in sys/ufs/lfs; the userlevel tools need
patching but don't change size significantly.

My guess/estimate is that after several rounds of consolidation the
total size will drop to around 18000-19000 lines. Maybe less, even,
but I wouldn't count on that. I'll be keeping an eye on the total size
going forward.

Anyway, I have done this much and it's ready to go. I will be
committing it tonight, I think, unless there are sudden howls of
protest.

The diff (from HEAD of a couple hours ago to the new compilable lfs)
is posted here:

   http://www.eecs.harvard.edu/~dholland/netbsd/lfs-ufs-20100207.diff

I will probably commit the pasted-only uncompilable form first, and
maybe some of the intermediate steps as well, for the historical
record and to make future merges easier. This may make the tree
temporarily unbuildable, but hopefully not for very long. 

-- 
David A. Holland
dholl...@netbsd.org

Re: unhooking lfs from ufs

2010-02-07 Thread David Holland

On Sun, Feb 07, 2010 at 10:10:31AM +, Mindaugas Rasiukevicius wrote:
   The copy involves 18 files from sys/ufs/ufs (out of 21; the ones
   excluded are quota.h and unsurprisingly ufs_wapbl.[ch]) which contain
   9067 lines of code. That gives the following statistics:
   
14988 size of lfs currently
   + 9067 size of copypasted ufs
24055 size of resulting uncompilable lfs
   -  401 result of making it compilable
23654 size of new lfs
  
  How would this affect UFS side?  For example, any potential code reduction
  and/or simplification?

Yes. ufs_readwrite.c will become much less gross, for example. There
used to be assorted LFS-only code in the ufs sources; ad@ removed the
ifdefs some time ago but they could be resurrected and then used to
purge the relevant code. I don't know how much code that is.

As for deeper simplifications, I don't know without digging around a
lot more than I have (particularly in the ext2fs code), but there
should be some.

   Anyway, I have done this much and it's ready to go. I will be
   committing it tonight, I think, unless there are sudden howls of
   protest.
  
  This involves significant changes, therefore enough time should be left
  for mailing list readers (~1 week at least, before committing anything).

It was discussed months ago. This is a reminder/heads-up.

-- 
David A. Holland
dholl...@netbsd.org

Re: unhooking lfs from ufs

2010-02-07 Thread David Holland

On Sun, Feb 07, 2010 at 11:07:55AM +, Mindaugas Rasiukevicius wrote:
   It was discussed months ago. This is a reminder/heads-up.
  
  Where?  This mailing list is a right place where such discussions (and
  decisions) should happen.

Right here...

-- 
David A. Holland
dholl...@netbsd.org

Re: solved ? [Re: need help with kern/35704 (UBC-related)]

2010-02-28 Thread David Holland

On Tue, Feb 02, 2010 at 10:53:58PM +0100, Manuel Bouyer wrote:
  I found the cause of for this one:
  [...]
  
  To fix this I propose to have ffs_trucate() (and derivatives) always set 
  v_writesize, even if the real size of the inode didn't change.
  The attached patch completely fixes the test case from kern/35704
  for me. I suspect it could also fix other related file system full
  related PRs.
  
  Anyone has commnts about this ? I'd still like to hear some from
  UVM/UBC experts ...

Should this be pulled up to -4? It applies cleanly and I can probably
test it (some...)

-- 
David A. Holland
dholl...@netbsd.org

Re: solved ? [Re: need help with kern/35704 (UBC-related)]

2010-03-03 Thread David Holland

On Wed, Mar 03, 2010 at 09:27:43PM +0100, Manuel Bouyer wrote:
Anyone has commnts about this ? I'd still like to hear some from
UVM/UBC experts ...
   
   Should this be pulled up to -4? It applies cleanly and I can probably
   test it (some...)
  
  Yes, it's also needed for -4 (AFAIK it's older than -3). But I've not
  had a chance to test it yet ...

Ok, well, a -4 kernel with it boots normally at least. I'll try
filling /tmp in the morning.

-- 
David A. Holland
dholl...@netbsd.org

Re: config(5) break down

2010-03-03 Thread David Holland

On Wed, Mar 03, 2010 at 03:26:20AM +0900, Masao Uebayashi wrote:
  I want to slowly start breaking down config(5) files
  (sys/conf/files, sys/**/files.*) into small pieces.  The goal is to
  clarify ownership of files; lines like file aaa.c bbb | ccc are
  to be changed into file aaa.c (ownership) and ccc: bbb
  (dependency).

I'm not entirely convinced that makes sense in all cases, but I
suppose it probably does in most.

  Because in the modular world one file belongs to one module.

Perhaps a first step would be using config(1) and files.* to generate
the module makefiles instead of maintaining them by hand...

  Broken config(5) files will be named like module.conf, because files.*
  namespace is insufficient.  For example pci.kmod can't use files.pci.

Huh? I don't understand.

-- 
David A. Holland
dholl...@netbsd.org

Re: msync(2)

2010-03-04 Thread David Holland

On Mon, Feb 22, 2010 at 04:40:58PM -0500, Matthew Mondor wrote:
  After reading the manual page of msync(2), I have the impression that
  if invoked with the MS_SYNC flag, it should be safe enough not to need
  a further fdatasync(2)/fsync_range(2) call afterwards?

That is the theory.

  And how about the metadata?  Would sync(2) be the only true way to
  ensure it's synchronized (considering fsync(2) seems fd-specific)?

What metadata? You can't get to things like the time stamps via mmap.

Granted, in FFS-land someone might have thought it made sense to write
out all the data blocks and not the FS-level metadata that describes
them on disk... but since doing this does not guarantee that the data
can be read back again later, it is not a correct implementation of
msync(2). (Or fdatasync(2) either.)

  Also, I am auditing an application which seems to modify mmaped files
  but which does not use msync(2) at all (and I can see that an older
  fsync(2) call was used, but is now commented out).  Should this be
  considered a bug?

Why would it be?

-- 
David A. Holland
dholl...@netbsd.org

Re: config(5) break down

2010-03-04 Thread David Holland

On Fri, Mar 05, 2010 at 01:14:50AM +0900, Masao Uebayashi wrote:
   Perhaps a first step would be using config(1) and files.* to generate
   the module makefiles instead of maintaining them by hand...
  
  cube@ said he did this part long time ago.  The thing is that only
  fixing these tools don't solve all problems magically.  We have to
  fix wrong instances around the tree.

Maybe it should be merged then?

  Broken config(5) files will be named like module.conf, because files.*
  namespace is insufficient.  For example pci.kmod can't use files.pci.
  
  Huh? I don't understand.
  
  Let's see the real examples.  sys/conf/files has this:
  
   filenet/zlib.c  (ppp  ppp_deflate) | ipsec | opencrypto | 
  vnd_compression
  
  This means [...]
  We should normalize this as [...]
 
  Now we define a module ppp_deflate which depends on ppp and zlib.  To
  make dependency really work, the depended modules must be already defined.
  To make sure, we have to split files into pieces and include dependencies.
  
  net/zlib.conf

See, this is the part that I don't understand. You're talking about
normalizing logic, which is fine, and making shared files first-class
entities, which is fine too though could get messy.

But then suddenly you jump into splitting up files.* into lots of
little tiny files and I don't see why or how that's connected to what
you're trying to do.

-- 
David A. Holland
dholl...@netbsd.org

Re: config(5) break down

2010-03-08 Thread David Holland

On Mon, Mar 08, 2010 at 10:53:16AM +0200, Antti Kantee wrote:
   (FFS_EI isn't the only such option either, it's just one I happen to
   have already banged heads with.)
  
  This one is easy, no need to make it difficult.

Sure, but as I said it was just an example; what about the next one?

  Things like wapbl are currently an actual problem, since it is multiply
  owned (conf/files *and* ufs/files.ufs).

I don't see why this is a problem either. The way things are right
now, vfs_wapbl.c is conditional on wapbl the same as the rest;
enforcing a hierarchical decomposition by source directory would break
this but that's part of why such hierarchical decompositions are a bad
idea.

(I've also never fully understood why wapbl has to have so many
tentacles hanging out of ffs, either.)

-- 
David A. Holland
dholl...@netbsd.org

Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-11 Thread David Holland

On Wed, Mar 10, 2010 at 11:47:49AM +0900, Masao Uebayashi wrote:
  I wonder what is the best design / implementation of devfs.

none

When you go and do it right it turns into some automount logic and a
tmpfs.

-- 
David A. Holland
dholl...@netbsd.org

Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-11 Thread David Holland

On Thu, Mar 11, 2010 at 01:36:41AM +0900, Masao Uebayashi wrote:
   Well, yes. ?But research efforts are like that. ?Robustness is pretty
   much necessary for production use but not for the stage this appears to
   be at.
  
  I'm not a researcher.  I'm an engineer.  I like steady move 
  feasible project.

I am a researcher, and my core area of interest is exactly this kind
of problem. If you are looking for a feasible project that can be
relied on to move forward, my honest best recommendation is to pick
something else. :-|

-- 
David A. Holland
dholl...@netbsd.org

Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-14 Thread David Holland

On Sun, Mar 14, 2010 at 03:33:19PM +0900, Masao Uebayashi wrote:
   I did; bus attachments.
  
  If you pay a little more respect to engineers, you'll find this is
  almost same as Iain's saying and what I wrote in the first mail.

huh? he asked me what I meant, I said what I meant...

-- 
David A. Holland
dholl...@netbsd.org

Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-14 Thread David Holland

On Sat, Mar 13, 2010 at 08:02:51AM -0500, der Mouse wrote:
  [st_dev] does not have to correspond, though, to anything else in
  the system.
  Not really, no, but it may as well be the same as what's in st_rdev.
  
  If there still is an st_rdev.  I see no particular reason that needs to
  be preserved.

No, except that it is somewhat useful to be able to identify a device
node (or at least distinguish it from others) and plenty of existing
code expects the st_rdev field to exist. Patching all that is only
worthwhile if it accomplishes some purpose, which it wouldn't really.

   The files in procfs and kernfs are for the most part semantically
   equivalent to real files even when they're virtual or dynamically
   generated.  Devices frequently have other properties.
  
  Disagree.  Writing to real files does not, for example, change the
  system hostname or alter a process's registers.
  
  In fact, that sounds a lot like the kind of dangers that inhere in
  writing to devices indiscriminately, doesn't it?

Yes... and no. There's another sense in which /kern/hostname is the
same as /etc/passwd: both are text files that affect the system
configuration. Changes to both also have immediate operational effects
on the running system. The fact that one is not preserved across
reboots is a negligible difference from the perspective of some
program that might randomly open either.

Unexpectedly opening a tty without being prepared to hang indefinitely
waiting for carrier-detect is a different class of problem. Many
devices also are not like regular files in that you cannot read back
what you write to them; /kern/hostname is again a regular file by that
standard.

I'm not saying that it might not be useful to tag /kern/hostname
somehow (and /etc/passwd too) so that certain classes of programs,
like say mail delivery tools, can categorically refuse to write to
them. But that's kind of a different issue from marking devices...

   [...] devfs might even involve creating [...] (S_IFDEV, say)
   I don't see any point at all in renaming S_IFBLK/S_IFCHR.
  
  In terms of the end state achieved, neither do I.  But there can be
  value in that programs that haven't been ported are more likely to
  misbehave if they see a name (by which I mean S_IFCHR and S_IFBLK)
  they think they know the semantics of but with different semantics than
  if they encounter something they don't recognize.

True, but the semantics that can be expected in practice of S_IFCHR
and S_IFBLK are very limited - most (but not all) S_IFCHR objects
won't seek, for example, and S_IFBLK objects generally require aligned
I/O and have a fixed size, but there are few other expectations. Which
is, after all, why we mark devices as devices; they don't necessarily
behave as regular files can reasonably be expected to.

   [...], and any new device type would have pretty much the same
   semantics anyway.
  
  In some respects.  But lurking under all this has been doing away with
  st_rdev, which for some programs is a radical enoguh departure that a
  new name is deserved.  (Others won't care, but I suspect most of them
  don't go looking at st_mode.)

Well, no, we're doing away with a specific interpretation of the
contents of st_rdev. Getting rid of st_rdev itself doesn't serve much
further purpose.

One can identify (most) programs that are going to try to interpret
st_rdev the old way by getting rid of the major() and minor() macros.
(IMO, any program that slices up dev_t on its own without using those
deserves the consequences.)

   (1) Attaching a device into devfs and attaching a fs into the fs
   namespace are fundamentally the same operation.
   Only at a very general level, [...] at that level open(,O_CREAT,)
   also qualifies.  So [does mknod()]
   Those are different in a fairly basic way: they create an object
   within an existing filesystem namespace, as opposed to binding a
   foreign object into the namespace.
  
  I'm not sure I'd call a filesystem a foreign object.  If that's fair,
  then the filesystem namespace is _all_ foreign objects, and the
  foreign adjective no longer really means anything there.

They are foreign to the filesystem they're being attached into? Maybe
the choice of words isn't so great, but there's a real difference
involved.

   A traditional device node is also a binding of a foreign object, but
   it does it by creating a proxy object in an existing filesystem.
  
  I'm not sure how fair it is to call it a proxy object, any more than
  an S_IFREG inode is a proxy for the big array of bytes (stored
  elsewhere on the disk) that make up the file's contents.

But that big array is part of the conceptual entity that the inode
represents. The driver pointed to by a device special file is not part
of anything in the filesystem.

   Devfs schemes that don't abolish the proxy tend to get in trouble
   because it's too many layers of indirection.  (This is not the only
   problem, but it's *a* problem.)  Devfs schemes that

Re: config(5) break down

2010-03-16 Thread David Holland

On Tue, Mar 16, 2010 at 06:50:29PM +0100, Zafer Aydo?an wrote:
  I'm wholeheartedly behind Julio's statement.
  Users should never have to rebuild anything.

Er, why?

Users should never have to perform complex unautomated procedures,
because such procedures can easily be screwed up and recovery becomes
difficult or impossible.

But recompiling things isn't a complex unautomated procedure, it's a
complex automated procedure, and not really that much different from
other complex automated procedures like binary updates.

Nor is it necessarily slow; building a kernel doesn't take any longer
than booting Vista...

-- 
David A. Holland
dholl...@netbsd.org

Re: build time (was: config(5) break down)

2010-03-17 Thread David Holland

On Wed, Mar 17, 2010 at 07:48:32PM +0100, Edgar Fu? wrote:
  DH Nor is it necessarily slow; building a kernel doesn't take any longer
  DH than booting Vista...
  EH Maybe on your machine.  On mine it's still quite a bit slower than just
  EH editing a config file.
  Which gives you a totally new boot from source option.

That's not new, it's called /vmunix.el and was invented a long time
ago, just never perfected :-)

-- 
David A. Holland
dholl...@netbsd.org

Re: config(5) break down

2010-03-17 Thread David Holland

On Wed, Mar 17, 2010 at 11:10:59AM -0500, Eric Haszlakiewicz wrote:
  On Tue, Mar 16, 2010 at 08:01:31PM +, David Holland wrote:
   But recompiling things isn't a complex unautomated procedure, it's a
   complex automated procedure, and not really that much different from
   other complex automated procedures like binary updates.
  
   The difference here is that a binary update is changing one particular
  machine and updating some other machine obviously won't have the intended
  effect, but recompiling things does the exact same thing regardless of
  where you do it, so having multiple people do it seems like a waste of
  time.

That's a red herring; applying a binary patch does the same thing
everywhere, and recompiling updates one particular machine in exactly
the same sense too. The difference is in what material is distributed
and how and where it's processed. This is not something end users are
going to care about much - at most they'll care about how long it
takes.

Which is a valid concern, of course, especially in extreme examples
like building firefox on a sun3, but it's *not* a usability issue in
the same sense that e.g. incomprehensible error messages are.

Admittedly, neither CVS nor our build system is quite robust enough to
make this really work, but in practice tools like apt-get and yum
aren't quite, either.

Anyhow, it seems to me that a blanket statement nobody should ever
have to recompile anything requires some justification; however,
people have been taking it as an axiom lately and that concerns me a
bit.

   Nor is it necessarily slow; building a kernel doesn't take any longer
   than booting Vista...
  
  Maybe on your machine.  On mine it's still quite a bit slower than just
  editing a config file.

Well sure, but that just means we're way ahead of the competition,
since in Windows editing that config file generally requires a
reboot. :-)

-- 
David A. Holland
dholl...@netbsd.org

Re: [gsoc] syscall/libc fuzzer proposal

2010-03-20 Thread David Holland

On Sat, Mar 20, 2010 at 01:54:49PM -0400, Elad Efrat wrote:
 Thor Lancelot Simon wrote:
 If not, I don't think this adds any benefit to your proposal and is likely
 to simply be a distraction; I'd urge you in that case to drop it.

 Strongly seconded. There are so many great ways to improve NetBSD and
 wasting time and money on fuzzing is about as suboptimal as it gets.

Um.

First of all, that's not what Thor said; second of all, you really
should not be telling potential gsoc students that their project ideas
are flatly worthless, even if your judgment were correct; and third,
I'm rather surprised that anyone who claims to work on security would
call testing and analysis tools worthless.

Let's try not to scare everyone off, ok?

-- 
David A. Holland
dholl...@netbsd.org

Re: [gsoc] syscall/libc fuzzer proposal

2010-03-20 Thread David Holland

On Sat, Mar 20, 2010 at 12:40:12PM -0400, Thor Lancelot Simon wrote:
  As a part of my work I would like to write a translator for C
  language and a small library. Their goal would be to detect
  integer overflows, stack overflows, problems with static array
  indexing, etc (when such occur during the program execution). It
  will enable me to uncover more bugs in the software.
  
  What is the benefit of this when compared to existing static-analysis
  tools such as Coverity Scan, splint, or the Clang static analyzer?  Will
  this cover any cases they don't?  If so, which ones?

AIUI from chat, the idea is to increase the probability that if the
testing causes something bogus to happen, the bogus behavior will
result in an easily identifiable abort.

This seems like a valid line of reasoning; all the same, implementing
such a tool is a fairly big (and annoying) pile of grunt work. Plus
various variations on it have been done before. (Some of which might
be worth looking into, actually.)

-- 
David A. Holland
dholl...@netbsd.org

Re: panic: ffs_valloc: dup alloc

2010-03-20 Thread David Holland

On Sat, Mar 20, 2010 at 10:29:44PM +1030, Brett Lymn wrote:
  I have given up on suspending because my filesystems would be
  corrupted with monotonous regularity.  The chances of a corruption
  seems to increase with the amount of disk activity happening on
  suspend.   It seems like something is not being flushed (or not being
  marked as flushed) when the suspend happens.

We don't support suspend-to-disk, right? So the contents of kernel
memory are supposed to be preserved in this suspend? Because if so,
unflushed buffers shouldn't matter. One would think.

That suggests that something is flushing buffers to a device that's
suspended and it's throwing them away instead of rejecting them or
panicing.

Does stuffing a couple sync calls somewhere before it starts
suspending devices (wherever that is, I don't know) make the problems
go away?

-- 
David A. Holland
dholl...@netbsd.org

Re: [gsoc] syscall/libc fuzzer proposal

2010-03-20 Thread David Holland

On Sat, Mar 20, 2010 at 03:40:33PM -0400, Elad Efrat wrote:
  If not, I don't think this adds any benefit to your proposal and
  is likely to simply be a distraction; I'd urge you in that case
  to drop it.
 
  Strongly seconded. There are so many great ways to improve NetBSD and
  wasting time and money on fuzzing is about as suboptimal as it gets.
 
  Um.
 
  First of all, that's not what Thor said;
  
  Sorry? Are you saying that me agreeing with Thor that unless this
  proposal shows some clear advantage over what we already have --
  specifically Coverity Scan -- it should probably be dropped is not
  what Thor said?

He was talking about the bounds-checking translation tool part. You
were attacking the entire thing.

   second of all, you really
   should not be telling potential gsoc students that their project ideas
   are flatly worthless, even if your judgment were correct;
  
  I said exactly what I think

Which was tactless and rude. If someone comes along with an idea
that's basically a waste of time, they should be gently steered
towards something else. Students don't always have good ideas; that's
why they need mentoring and advising, but you don't mentor and advise
very effectively by being hostile and dismissive.

Also, outside of the specific gsoc context, we have a long-standing
custom in this project to not tell other people what to spend their
time on or what is and isn't valuable.

   and third,
   I'm rather surprised that anyone who claims to work on security would
   call testing and analysis tools worthless.
  
  I don't *claim* anything, David; I *work*, at least as opposed to,
  say, assigning bugs to me, claiming for years I'll do something about
  them (together with many other grand ideas) and instead fix, I dunno,
  whitespace and grammar issues. Take your preaching elsewhere; I
  couldn't care less.

Is that what you think I do? (And if so, do you really want to get
into ad hominems? You're on fairly shaky ground.)

  As for the issue at hand, well, I suggest you look at what the
  proposal is, what we already have for years, and draw your own
  conclusions.

Yes, I have; it needs to be fleshed out into a real project proposal
(as is to be expected at this stage, after all) but I don't see
anything inherently wrong with it so far.

-- 
David A. Holland
dholl...@netbsd.org

Re: panic: ffs_valloc: dup alloc

2010-03-20 Thread David Holland

On Sat, Mar 20, 2010 at 04:06:32PM -0400, Steven Bellovin wrote:
   That suggests that something is flushing buffers to a device that's
   suspended and it's throwing them away instead of rejecting them or
   panicing.
  
  Possibly

Although it doesn't quite make sense, because in most cases this could
only corrupt the fs if the same block was left untouched afterwards
for long enough for the (allegedly) clean buffer to be discarded, and
that shouldn't cause a panic right after resume. Unless the fs was
already broken from a previous suspend, I guess.

Maybe there's suspend code somewhere that writes out and also discards
buffers in the hopes of cleaning up for some future suspend-to-disk
work? Could be, I guess, but I'd tend to think not. I ought to go look
at the code but I don't think I have time for that this weekend. :-|

   Does stuffing a couple sync calls somewhere before it starts
   suspending devices (wherever that is, I don't know) make the problems
   go away?
 
  No -- I've had a sync call in my suspend script for years.  More
  precisely, at the moment it's
  
   sync; sleep 1
  
  to let things flush.  No joy.

That might not be late enough; I was thinking of inside the kernel.

  Of course, rejecting them wouldn't seem to do any good; what's
  needed, I suspect, is for the device to queue them (as usual) but
  not fire up the disk when in suspending mode.

Or for the writes to not be issued at all until after resume.

ISTM it must be either the syncer firing at the wrong time or
something's gotten out of order in the suspend sequencing.

-- 
David A. Holland
dholl...@netbsd.org

Re: panic: ffs_valloc: dup alloc

2010-03-20 Thread David Holland

On Sat, Mar 20, 2010 at 05:03:16PM -0400, Steven Bellovin wrote:
   Let me see if I can find my first note on the subject -- it might
   give a clue about the date of any changes.
  
  Turns out that I sendpr-ed it in September: kern/42104.

I even responded to the PR, not that I had any useful ideas at the
time.

That sounds like maybe the problem is not on the suspend side but on
the resume side, that is, that stuff is being written out before (some
layer of) the disk subsystem is ready to go again. With vanilla FFS
such writes should be synchronous so it should be (relatively) easy to
figure out what's going on. Do you feel like trying out dtrace? :-)

On the other hand, if fsck thinks the inode for a named pipe is
unallocated (or particularly, has duplicate blocks, since pipes
shouldn't have blocks at all)... that means that whatever went wrong
went wrong when the pipe was created, not when something exited and
removed it. And with vanilla ffs, those are synchronous writes and
they should happen in quick succession; if the inode didn't get
written but the directory did, something's more badly wrong than just
the disk not being ready yet. And I strongly suspect that the pipe
creation isn't tied to suspending, that is, the pipe should have been
created long before you suspended and should not in general be removed
and recreated by suspending. And that means either something is
severely wrong in general and you're only seeing it after crashing due
to suspend (which is possible, but seems not too likely) or the
suspend cycle is actively writing garbage and corrupting the fs.

Meanwhile, getting traps while dumping is Very Strange (TM). Do we
have any kind of debug code that can checksum memory before and after
the suspend? I wonder if something ACPI-related is garbaging memory.

-- 
David A. Holland
dholl...@netbsd.org

Re: config(5) break down

2010-03-24 Thread David Holland

On Thu, Mar 25, 2010 at 01:14:51AM +0900, Masao Uebayashi wrote:
   ?  (Besides, it's not necessarily as flat as all that, either.)
   ?
   ? It's necessary to be flat to be modular.
  
   Mm... not strictly. That's only true when there are diamonds in the
   dependency graph; otherwise, declaring B inside A just indicates that
   B depends on A. Consider the following hackup of files.ufs:
  
  There're diamonds (for example, ppp-deflate depends on ppp and zlib).

Sure. But mostly there aren't.

   [...]
   module UFS [...]
   module FFS [...]
   module MFS [...]
   module EXT2FS [...]
   module LFS [...]
  
  In this plan, what *.kmod will be generated?

The ones declared? Or one big one, or one per source file, or whatever
the blazes you want, actually...

   I'm perfectly happy to rework the parser to support syntax like the
   above if we can all agree on what it should be.
  
  So you're proposing a syntax change without understanding the
  existing syntax? (You don't know what braces are for, you didn't
  know define behavior, ...)  I have to say that your proposal is
  not convincing to me...

Um. I know perfectly well that config currently uses braces for
something else. That's irrelevant. There's no need to use braces for
grouping; it just happens to be readily comprehensible to passersby.
There's an infinite number of possible other grouping symbols that can
be used, ranging from   to (! !) or even things like *( )*.
Furthermore, the existing use of braces can just as easily be changed
to something else if that seems desirable.

There's a reason I said syntax like the above and if we can all
agree on what it should be. That wasn't a concrete proposal, it
wasn't meant to be a concrete proposal, no concrete proposal is
complete without an analysis of whether the grammar remains
unambiguous, and nitpicking it on those grounds is futile.

You seem to be completely missing the point.

-- 
David A. Holland
dholl...@netbsd.org

Re: config(5) break down

2010-03-24 Thread David Holland

On Fri, Mar 19, 2010 at 02:49:37PM +, Andrew Doran wrote:
   I *do* think it's a useful datapoint to note that sun2, pmax, algor, etc.
   are never, ever downloaded any more.
  
  Right, and these dead ports must be euthanized.  The mountain of
  unused device drivers and core kernel code is a signficant hinderance to
  people working in the kernel. 

Speaking from the point of view of repeatedly touching every namei
call anywhere in the kernel... I'd have to disagree. Sure, it'd go
faster if there weren't a pile of legacy binary compat implementations
or if we removed all the mostly-unused fses, but if I wanted a toy
kernel I already have a pile of those in the office. Most of the
issues that the dead ports or fses trigger are real design or
structural problems that would be only masked, not resolved, by
removing that code. Supporting all the random bells and whistles that
e.g. compat_svr4 wants from namei is part of doing it correctly, and
having the correct infrastructure in place that can support these
things is important because the need/desire/demand will come along
again; it always does. For example, the $ORIGIN thing in ld.elf_so is
actually the same as one of the annoying cases in (IIRC) compat_svr4...

I know we don't exactly see eye to eye on these issues but perhaps we
can reach some kind of middle ground?

-- 
David A. Holland
dholl...@netbsd.org

Re: config(5) break down

2010-03-25 Thread David Holland

On Thu, Mar 25, 2010 at 06:22:17PM +0900, Masao Uebayashi wrote:
% grep ':.*,' sys/conf/files | wc -l
86
   
   And? I don't understand your point. There are a lot more than 86
   entities in sys/conf/files.
  
  There are many instances where modules have multiple dependencies.

And? I still don't understand your point.

-- 
David A. Holland
dholl...@netbsd.org

Re: config(5) break down

2010-03-28 Thread David Holland

On Fri, Mar 26, 2010 at 10:24:02AM +0900, Masao Uebayashi wrote:
  (Honestly, I see benefit to not convincing you; objection only from
  dholland@ sounds more convincing to me than no objections.)

Um. I'm sorry you think that. I guess there is no point continuing
this discussion, then. Or much of any other.

-- 
David A. Holland
dholl...@netbsd.org

Re: config(5) break down

2010-03-28 Thread David Holland

On Fri, Mar 26, 2010 at 01:45:51PM +, Andrew Doran wrote:
  I'm speaking of low level kernel code and driver drivers, areas that to 
  date you have had relatively little involvement in.

That's not entirely true, but fair enough. 

  I will however consider discussing the points you raise if/when I
  launch a jihad against emulations and file system code.

I thought you already had! :-p :-)

-- 
David A. Holland
dholl...@netbsd.org

$ORIGIN (was: Re: make: ensure ${.MAKE} works)

2010-04-20 Thread David Holland

On Thu, Apr 15, 2010 at 08:40:19AM +, David Holland wrote:
  Wish we had working $ORIGIN...
  
  We will fairly soon, I think... :-)

To wit: as far as I can tell, having been wading around in that code
recently, the only problem with what we have is that if the path sent
back by namei isn't absolute it needs a getcwd() stuck on the front of
it.

Is it reasonable to just do that? I don't think calling getcwd() from
exec is going to cause locking problems, but it might be more overhead
than we want to swallow.

-- 
David A. Holland
dholl...@netbsd.org

Re: $ORIGIN (was: Re: make: ensure ${.MAKE} works)

2010-04-21 Thread David Holland

On Wed, Apr 21, 2010 at 08:58:31AM -0400, Christos Zoulas wrote:
  | Is it reasonable to just do that? I don't think calling getcwd() from
  | exec is going to cause locking problems, but it might be more overhead
  | than we want to swallow.
  
  The code that we have there works fine now, yamt objected about it because
  strictly speaking the path could have been evicted from the namei() cache,
  but this never happens since the call is immediately after. Calling getcwd
  will add overhead and it is not really necessary because we already did
  resolve the path.

If you exec ../bin/foo, that's all namei will resolve or touch, and
that's the string that'll come back from namei. If we want an absolute
path out, it needs getcwd, either in exec or in namei... and in exec
is probably preferable.

If we really want to support the feature we need to either buy into
that overhead, or inspect the binary in some fashion to only do it in
cases where it's going to be used.

AFAICT getcwd should be no more expensive than vnode_to_path if the
parentage of the current directory is in the name cache, which should
be the common case.

-- 
David A. Holland
dholl...@netbsd.org

Re: $ORIGIN (was: Re: make: ensure ${.MAKE} works)

2010-04-21 Thread David Holland

On Wed, Apr 21, 2010 at 01:22:12PM -0400, Christos Zoulas wrote:
  | If you exec ../bin/foo, that's all namei will resolve or touch, and
  | that's the string that'll come back from namei. If we want an absolute
  | path out, it needs getcwd, either in exec or in namei... and in exec
  | is probably preferable.
  
  That's right, and it affects in my opinion  5% of the invocations,
  since the majority of the execs are done via the shell and the shells
  pass absolute paths to exec for commands that don't contain '/'.

Right.

  | If we really want to support the feature we need to either buy into
  | that overhead, or inspect the binary in some fashion to only do it in
  | cases where it's going to be used.
  | 
  | AFAICT getcwd should be no more expensive than vnode_to_path if the
  | parentage of the current directory is in the name cache, which should
  | be the common case.
  
  That's what vnode_to_path() does (it calls getcwd), so the cost is
  the same.

I had convinced myself it was supposed to fail if it had to look
outside the cache, but that's only true for the first step.

  I think what you propose is to call something like a
  kernel realpath(path) and use this to set $ORIGIN which is fine
  with me. I did not do it because I did not want to deal with path
  canonicalization (eliminating ../.././// from the path, but I guess
  that getcwd() does this for you if you call it with the full path?).

namei can already do enough of this to get by on (see for example
svr4_sys_resolvepath() in sys/compat/svr4/svr4_misc.c) and exec is
already using this (mis?)feature.

For the time being what we can do is take the path sent back from
namei, and if it's not absolute call getcwd and graft that onto the
front. This will in general yield a partially realpath'd path but I
don't think anyone will care.

In the long run I think a fully realpath'd path can be arranged,
either by calling getcwd first and handing the results to namei to
grind on, or by explicitly compacting any ..'s that appear in the
front of the namei result. I sort of favor the first because it makes
it possible to handle the emulation root properly, I think, but this
can be discussed later on.

-- 
David A. Holland
dholl...@netbsd.org

Re: sysctl node names (Re: CVS commit: src/sys/uvm)

2010-04-27 Thread David Holland

On Fri, Feb 19, 2010 at 06:42:03AM -0500, der Mouse wrote:
  I'd say it's a question of whether you think of them as input to the
  kernel, commands (enable this), or as output from the kernel,
  reporting state (this is enabled).
  
  Of course, in most cases, they're actually both, so that doesn't help
  much.  But, as a native anglophone, it's what the difference between
  enable and enabled here feels like to me.

Well, the sysctl tree is a presentation of the state. Changing it is a
command. So I think it should be enabled.

We could also avoid this problem by using on instead, but that's
probably not a great idea.

-- 
David A. Holland
dholl...@netbsd.org

Re: $ORIGIN (was: Re: make: ensure ${.MAKE} works)

2010-05-02 Thread David Holland

On Wed, Apr 28, 2010 at 02:57:47PM -0400, der Mouse wrote:
  To wit: as far as I can tell, having been wading around in that code
  recently, the only problem with what we have is that if the path
  sent back by namei isn't absolute it needs a getcwd() stuck on the
  front of it.
  
  Is it reasonable to just do that?
  
  I don't think so.  It would be a regression in that it would break
  things in no-path-up-to-/ situations; it also would either fail or
  expose paths that shouldn't be accessible in path-to-/-isn't-readable
  situations.

Does anyone know how other implementations of $ORIGIN deal with these
cases?

For the time being, even if we just provide a relative path when
getcwd fails it'd still be more functional than the current situation.

   Why not get the kernel to keep a reference to the vnode of the
   directory that contained the process image?
  
   Then use some flag to open() (or similar) to open a file relative to
   that vnode?

Someone already invented $ORIGIN with absolute paths and we ought to
support that, especially since we already have a half-baked
implementation.

I've often thought something like this (keeping the directory) would
be a decent approach but it's not clear how to set it up, or e.g. what
ought to be done if the executable is moved or other such cases.

  Hacking on namei() just to get an absolute path in $(.MAKE)? 

It's not just make; see for example PR 42420.

-- 
David A. Holland
dholl...@netbsd.org

Re: WAPBL and IDE mac68k

2010-06-03 Thread David Holland

On Tue, Jun 01, 2010 at 08:31:56AM -0400, der Mouse wrote:
   It happens even when I try to boot to single user mode because I see
   the message saying /: replaying log to memory right before it
   panics.  Not sure why the journaling stuff happens when booting in
   single user mode without mounting any filesystems, but that's what it
   is.
  
  You can't boot without mounting any filesystems at all; / must be
  mounted, at a minimum.

You can boot as far as panic: cannot mount root, though, by e.g.
disabling wd.

Anyway, what's the panic message?

-- 
David A. Holland
dholl...@netbsd.org

Re: Layered fs, vnode locking and v_vnlock removal

2010-06-03 Thread David Holland

On Wed, Jun 02, 2010 at 05:58:40PM +0100, David Laight wrote:
In the long term VOP_xxxLOCK() should become part of the file systems.
   
   AFAIK there is a consensus between yamt@, ad@ and thorpej@ that
   locking should be moved down to the filesystems.
   There was some discussion about it here some time before.

Yes, this keeps coming up and I keep trying to explain why it's
misguided.

  There is a lurking problem making read/write atomically update the file
  offset. I suspect that is currently covered by the vnode lock.
  Might only affect O_APPEND - but I've seen systems get that wrong!
  
  Not to mention the problem of correctly setting the file position
  when read/write fault on a userspace address part way through a
  transfer.

Other important cases include atomicity of O_CREAT and permission
checks done in VFS-level code.

These cases can all be handled by cutting and pasting the code into
every file system, but we really don't want to do that.

-- 
David A. Holland
dholl...@netbsd.org

Re: Layered fs, vnode locking and v_vnlock removal

2010-06-03 Thread David Holland

On Tue, Jun 01, 2010 at 11:44:03AM +0200, Juergen Hannken-Illjes wrote:
   It's not immediately clear how either of these ought to work, so
   I'm concerned that making the infrastructure less general will
   lead to problems.
  
  1) One upper to many lower vnodes
 This is a file system like unionfs.  It has to lock either one or
 many lower vnodes and does/will not earn anything of shared locks.
  
  2) Many upper to one lower vnode
 Such a layered file system could use a lock shared between ALL
 upper and the lower vnode.  Always taking the lower vnode's lock
 will do the same.  I see no need for shared locks here.

That seems plausible.

   In the long run I intend to make all the vnode ops symmetric with
   respect to locking, which should make a lot of this less toxic, but at
   the rate I've been able to work on this stuff we won't be there
   anytime soon.
  
  The asymmetry comes from functions like null_mount() where a vnode gets
  locked by the lower layer and unlocked by the upper layer.  A lower
  layer expecting its VOP_LOCK() to be matched by a VOP_UNLOCK() will
  fail badly.

...that is just broken, yes. If you can beat sanity into that, please do.

-- 
David A. Holland
dholl...@netbsd.org

Re: wedges on vnd(4) patch

2010-06-22 Thread David Holland

On Mon, Jun 21, 2010 at 06:23:02PM -0400, Christos Zoulas wrote:
  Well, I find the different indentation styles typically use for the braces
  clumsy and not following the standard. Or even when they do, they cause the
  code to move too much to the right:

FWIW, I prefer this:

switch (c) {
case 'a':
{
decl;

stmt;
}
break;
}

because I've never liked not indenting the cases and half an indent
seems a good way to do that. But that is also at variance with current
rules.

Whether variables should be at the top of the function or not depends
heavily on how big and complicated the blocks are. (And when they are
big and complicated, splitting each case into its own function is also
not always desirable.)

-- 
David A. Holland
dholl...@netbsd.org

Re: Move the vnode lock into file systems

2010-06-26 Thread David Holland

On Sat, Jun 26, 2010 at 10:39:27AM +0200, Juergen Hannken-Illjes wrote:
  The vnode lock operations currently work on a rw lock located inside the
  vnode.  I propose to move this lock into the file system node.
  
  This place is more logical as we lock a file system node and not a vnode.
  This becomes clear if we think of a file system where one file system node
  is attached to more than one vnode.  Ptyfs allowing multiple mounts is such
  a candidate.

I'm not convinced that sharing locks at the VFS level is a good idea
for such cases. While ptyfs specifically is pretty limited and
unlikely to get into trouble, something more complex easily could. I
don't think we ought to encourage this until we have a clear plan for
rebind mounts and the resulting namespace-handling issues. (Since
in a sense that's the general case of multiply-mounted ptyfs.)

Since I'm pretty sure that a reasonable architecture for that will
lead to sharing vnodes between the multiple mountpoints, and
manifesting some kind of virtual name objects akin to Linux's dentries
to keep track of which mountpoint is which, I don't see that it's
necessary or desirable to specialize the locking...

Do you have another use case in mind?

(In the absence of some clear benefits I don't think it's a
particularly good idea to paste a dozen or two copies of genfs_lock
everywhere. But folding vcrackmgr() into genfs_lock and genfs_unlock
seems like a fine idea.)

-- 
David A. Holland
dholl...@netbsd.org

Re: Move the vnode lock into file systems

2010-06-27 Thread David Holland

On Sun, Jun 27, 2010 at 06:18:19PM +0200, Juergen Hannken-Illjes wrote:
   (In the absence of some clear benefits I don't think it's a
   particularly good idea to paste a dozen or two copies of genfs_lock
   everywhere. But folding vcrackmgr() into genfs_lock and genfs_unlock
   seems like a fine idea.)
  
  Primary goal is to abstract vnode locking into the vnode operations
  only and therefore completely removing vlockmgr().

That I can agree with :-)

  For now I can live with genfs_lock()/v_lock becoming the generic
  locking interface where v_lock becomes genfs_lock()-private.

and that I have no objection to.

Vaguely related to which, does anyone object to wrapping VOP_UNLOCK in
a vn_unlock() function (doing nothing extra), so as to be symmetric
with vn_lock()? I think I've mentioned this before, but I'm not sure,
and if so it was a while back...

-- 
David A. Holland
dholl...@netbsd.org

Re: Preserving early console output (pre-Copyright stuff)

2010-07-05 Thread David Holland

On Thu, Jul 01, 2010 at 05:18:36AM -0700, Paul Goyette wrote:
 b) a way to pause long enough to manually transcribe the output?  (A  
 simple timed delay would work, although a Press any key to continue  
 would be easier!)

It may work to do

   printf(Press a key...\n);
   cnpollc(1);
   (void)cngetc();
   cnpollc(0);

... it used to, but that was ~15 years ago.

-- 
David A. Holland
dholl...@netbsd.org

Re: Using coccinelle for (quick?) syntax fixing

2010-08-14 Thread David Holland

On Sat, Aug 14, 2010 at 01:36:15PM +0200, Jean-Yves Migeon wrote:
   I would say don't do __func__ for messages like this; it doesn't
   really serve much purpose vs. typing in a name, causes the observable
   behavior to change silently if the code gets reorganized, and makes it
   much harder to grep the source for an error message you're seeing.
  
  Understood

not a big deal though.

   Changing it to use aprint_whatnot_foo(), though, might be worthwhile...
  
  Hmm, I thought aprint_* functions were essentially for autoconf messages?

Erm... um, yeah, never mind, I was thinking of the way they
automatically rope in the device name and unit number.

-- 
David A. Holland
dholl...@netbsd.org

Re: [ANN] Lunatik -- NetBSD kernel scripting with Lua (GSoC project

2010-10-11 Thread David Holland

On Tue, Oct 12, 2010 at 12:53:10AM -0300, Lourival Vieira Neto wrote:
 A signature only tells you whose neck to wring when the script
 misbehaves. :-) Since a Lua script running in the kernel won't be
 able to forge a pointer (right?), or conjure references to methods or
 data that weren't in its environment at the outset, you can run it
 in a highly restricted environment so that many kinds of misbehavior
 are difficult or impossible. ?Or I would *think* you can restrict the
 environment in that way; I wonder what Lourival thinks about that.
   
I wouldn't say better =). That's exactly how I'm thinking about
address this issue: restricting access to each Lua environment. For
example, a script running in packet filtering should have access to a
different set of kernel functions than a script running in process
scheduling.
  
   ...so what do you do if the script calls a bunch of kernel functions
   and then crashes?
  
  if a script crashes, it raises an exception that can be caught by the
  kernel (as an error code)..

Right... so how do you restore the kernel to a valid state?

-- 
David A. Holland
dholl...@netbsd.org

Re: kernel module loading vs securelevel

2010-10-15 Thread David Holland

On Sat, Oct 16, 2010 at 11:23:29AM +0900, Izumi Tsutsui wrote:
   It would seem to be intentional.  After all, kernel modules can
   do all sorts of nasty things if they want to.
  
  In that case, module autoload/autounload is not functional at all and
  we have to specify all possible necessary modules explicitly
  during boot time??

Yes. Otherwise it's quite easy to defeat securelevel by causing the
loading of a module that resets it to -1.

-- 
David A. Holland
dholl...@netbsd.org

Re: kernel module loading vs securelevel

2010-10-16 Thread David Holland

On Sun, Oct 17, 2010 at 03:38:42AM +0900, Izumi Tsutsui wrote:
 Heh, then why have we had it on i386 for years?
   
   Because of the X server.
  
  You are just saying:
  We introduced a significant security regression just for our own
  convenience.

Perhaps...

  I see no proper reason to avoid INSECURE for MODULAR if it's okay for X.

...and I'm not convinced of this, primarily because (from a practical
point of view) X is unavoidable and unfixable, whereas modules are
neither.

This gets back to the underlying question of what purpose modules are
supposed to serve, and as I know everyone knows what I think and is
sick and tired of hearing about it, I'll pipe down.

-- 
David A. Holland
dholl...@netbsd.org

Re: kernel module loading vs securelevel

2010-10-16 Thread David Holland

On Sat, Oct 16, 2010 at 12:03:52PM -0700, Paul Goyette wrote:
   autoload/autounload does NOT perform any authorization checks -
   please look at the code!  No checking of securelevel occurs, as far
   as I can see.  For autoload, the module name must not contain a
   '/', so if the module is being loaded from the file system it must
   be loaded from the blessed /stand/${ARCH}/${VERSION}/modules
   directory.  Including the INSECURE option will have no effect on
   autoloading of modules.
  
  If this is true it makes securelevel useless; all you need to do is
  put a hostile module in the right place and cause it to be autoloaded.
  (Remember the point of securelevel is that even root can't lower it.)
  
  John Nemeth has already pointed out that my reading of the code was
  flawed.  Module autoloading _does_ call kauth for authorization.
  The kauth listener provided by the module subsystem returns ALLOW
  for all autoload calls, but this gets overridden by another kauth
  listener, so autoload still gets denied.

Good that it's not true then :-)

  It should be sufficient, I think, to check at boot time that any
  module that can be autoloaded is marked immutable.
  
  And also make the blessed directory itself immutable?  :)

As I recall the semantics of immutable are such that this isn't
necessary to protect modules that are present at boot time (that is,
they can't be unlinked/renamed/etc.), and if there are autoloadable
modules whose names aren't present at boot time, they'll fail the
check.

-- 
David A. Holland
dholl...@netbsd.org

Re: kernel module loading vs securelevel

2010-10-16 Thread David Holland

On Sun, Oct 17, 2010 at 06:13:11AM +1100, matthew green wrote:
   ...and I'm not convinced of this, primarily because (from a practical
   point of view) X is unavoidable and unfixable, whereas modules are
   neither.
  
  actually, with DRM (and KMS) i believe we will be able to run the
  X server as non-root.

Yes, after what, some fifteen years? :-/

-- 
David A. Holland
dholl...@netbsd.org

Re: CVS commit: src/bin/cp

2010-10-26 Thread David Holland

On Mon, Oct 25, 2010 at 05:49:11PM +0100, David Laight wrote:
   No, since in general the file is also being extended (certainly in
   this case it is) it also has to lock the file size, and that's going
   to deny stat() until it's done.
  
  A stat request during a write can safely return the old size.

Yes it can, if it has it. Hence multiversion...

-- 
David A. Holland
dholl...@netbsd.org

Re: RFC: ppath(3): property list paths library

2010-11-02 Thread David Holland

On Mon, Nov 01, 2010 at 08:00:09PM -0500, David Young wrote:
  I'm working on a library called ppath(3) for making property lists more
  convenient to use in the kernel.  With ppath(3), you refer to a property
  to read/write/delete in a property list by the path from the list's
  outermost container.  Comments welcome.

Speaking from the POV of someone who's been working on querying
semistructured data for several years now... I have a pile of
high-level questions: (1) can you articulate the expressive power you
intend for your path expressions, and why that's a logical stopping
point vs. more expressive things; (2) what if any facilities do you
envision for checking paths against proplist schemas when/if we ever
manage to sort out a system for that; (3) what model do you have for
dealing with cases when the values found at the paths provided are not
what the user is expecting; and (4) what model do you have for dealing
with cases where the path does not name a single unique value or
position, if that's possible?

(I'm not trying to give you a hard time, I've just spent a long time
dealing with these problems and I don't want to see familiar mistakes
reinvented.)

-- 
David A. Holland
dholl...@netbsd.org

Re: RFC: ppath(3): property list paths library

2010-11-03 Thread David Holland

On Wed, Nov 03, 2010 at 09:28:11AM +0100, Martin Husemann wrote:
  This is one of the ocassions where I would love to use C++ and templates
  in the kernel ;-}

I think what you mean is that you'd like to have a language that has
some kind of sane parameterized types... :-/

-- 
David A. Holland
dholl...@netbsd.org

Re: mutexes, locks and so on...

2010-11-14 Thread David Holland

On Fri, Nov 12, 2010 at 02:21:34PM +0100, Johnny Billquist wrote:
  then I realized that this solution would break if people actually
  wrote code like
  lock(a)
  lock(b)
  release(a)
  release(b)
  
  ...which is very common.
  
  It is? I would have thought (and hoped) that people normally did:
  lock(a)
  lock(b)
  unlock(b)
  unlock(a)

Nope. You might get away with this if we always did strict two-phase
locking in the kernel, but we don't (no kernel does) to avoid
excessive contention on e.g. the vnode for / and other such locks.
Meanwhile, lock coupling tends to appear anytime one is transitioning
through a data structure and wants to maintain consistency.

Thus the typical usage is something like

   lock(a)
   b = a-b
   lock(b)
   unlock(a)
   c = b-c
   lock(c)
   unlock(b)
   do_work(c)
   unlock(c)

It can be shown that this preserves conflict serializability as long
as nothing ever follows the structure in the opposite order (c - b -
a).

The traditional place you find code like this is in pathname
translation, but in a MP kernel it pops up in lots of other places
too.

  I agree that it's not wrong, but untidy. Keeping track of ipl
  levels could have been kept within the mutex instead, thus
  simplifying both the lock and unlock code, at the expense that
  people actually had to unlock mutexes in the reverse order they
  acquired them. Just as with the splraise/splx before.

That however isn't workable.

(And it still wouldn't be workable even in a kernel that had separate
spinlocks and sleep-locks.)

-- 
David A. Holland
dholl...@netbsd.org

Re: Please do not yell at people for trying to help you.

2010-11-14 Thread David Holland

On Fri, Nov 12, 2010 at 08:31:39PM +, Eduardo Horvath wrote:
  No it doesen't because all those macros assume the value is being 
  transferred from one register to another rather than regiser to memory.  
  The assignment:
  
  foo.size = htole64(size);
  
  Cannot be replaced with:
  
  __inline __asm(stxa %1, [%0] ASI_LITLE : foo.size : size);

The right way to fix this is in the compiler; teach the compiler about
opposite-endian variables and let it pick the right instructions for
accessing them, and the problem goes away.

-- 
David A. Holland
dholl...@netbsd.org

Re: mutexes, locks and so on...

2010-11-14 Thread David Holland

On Sat, Nov 13, 2010 at 01:45:40AM +0900, Izumi Tsutsui wrote:
   Wow. I guess you can add me to the list of people leaving.
  
  There is no perfect world and we don't have enough resources.
  
  Any help to keep support for ancient machines are appreciate, but
  complaints like we should support it which prevents improvements
  of mainstream will just make NetBSD rotten.

What prevents improvements of mainstream are we talking about here?
We have someone who wants to provide tuned vax-specific locking
primitives. The absolute worst possible cost to the mainstream that
this incurs is a bit of extra cpp and config hackery.

(Can we all please get a grip?)

-- 
David A. Holland
dholl...@netbsd.org

Re: CVS commit: src/sys/arch/powerpc/oea

2010-11-15 Thread David Holland

(moving this to tech-kern because it's the right place and per request)

On Mon, Nov 15, 2010 at 11:24:21AM +0900, Masao Uebayashi wrote:
   Every header file should include the things it requires to compile.
   Therefore, there should in principle be no cases where a header file
   (or source file) needs to #include something it doesn't itself use.
  
  This clarifies my long-unanswered question, thanks!

*bow*

  I've (re)built about 300 kernels in the last days.  I've found:
  
  - sys/sysctl.h's struct kinfo_proc should be moved into sys/proc.h
(I've done this locally).  Otherwise all sysctl node providers
includes sys/proc.h and uvm/uvm_extern.h.
(This is where I started...)

I'm not sure this is a good plan in the long run. Shouldn't it at some
point be unhooked fully from the real proc structure?

  - sys/syscallargs.h should be split into pieces, otherwise all its
users have to know unrelated types (sys/mount.h, sys/cpu.h).

Since system calls don't in general pass structures by value, it
shouldn't need most of those header files, just forward declarations
of the structs.

(this is, btw, one of the reasons to avoid silly typedefs)

  - sys/proc.h's tsleep(), wakeup(), and friends should be moved into
some common header, because it's widely used API.  sys/proc.h will
be used only for struct proc related things.

Given that this is a deprecated API in the long term I'm not sure it's
worthwhile.

-- 
David A. Holland
dholl...@netbsd.org

Re: CVS commit: src/sys/arch/powerpc/oea

2010-11-15 Thread David Holland

On Mon, Nov 15, 2010 at 10:41:55PM +, David Laight wrote:
   Indeed. Properly speaking though, headers that are exported to
   userland should define only the precise symbols that userland needs;
   kernel-only material should be kept elsewhere.
  
  One start would be to add a sys/proc_internal.h so that sys/proc
  can be reduced to only stuff that userspace and some kernel parts
  are really expected to use.

The right way (TM) is to create src/sys/include and put kernel-only
headers in there, to be included as e.g. proc.h.

In the long term the user-visible parts would go in
src/sys/include/kern/proc.h, which would be included as kern/proc.h.

(It has to be kern/ and not sys/ because a couple decades of standards
creep and poor API maintenance has led to half of sys/*.h properly
belonging to libc. In order to avoid repeating this problem in the
future, all APIs should be defined without direct reference to any
kern/*.h files; those should only be included from other libc or
kernel headers. So libc would grow its own sys/proc.h because that's
part of the libkvm API.)

When done completely the entire kern/ subtree is the same for both
userland and the kernel, including MD headers, no other random kernel
headers need to be installed, and there's no longer any need for
#ifdef _KERNEL.

As much as this probably sounds obvious, the first couple of times I
set out to do it myself I got it wrong. (And it's wrong in Linux too.)

-- 
David A. Holland
dholl...@netbsd.org

Re: CVS commit: src/sys/arch/powerpc/oea

2010-11-15 Thread David Holland

On Mon, Nov 15, 2010 at 03:47:32PM -0500, der Mouse wrote:
   [...] just forward declarations of the structs.
  
   (this is, btw, one of the reasons to avoid silly typedefs)
  
  I'm not sure what typedefs have to do with it.  typedeffing a name to
  an incomplete (forward) struct type works just fine:
  
  struct foo;
  typedef struct foo FOO;
  
  (You can't do anything with a FOO without completing the struct type,
  but you can work with pointers to them)

But now there's no protection against divergence; that is, if I have

   typedef struct foo FOO;

in one header and a typo'd

   typedef struct tfoo FOO;

in another, assuming suitable ifdef guards as already mentioned, now
FOO can be two different things, and the inconsistency in the
cut-and-pasted-material might not be detected for some time.

However, if I just have

   struct foo;

in multiple headers there aren't very many ways this can be wrong that
will compile at all. The only common way for this to go bad is if
you've removed struct foo from your program completely; then you have
to hunt down all the forward declarations by hand and kill them off.
But that's more or less unavoidable.

The difference between these two cases is inherent in the fact that
the typedef form is declaring two things and the plain struct
declaration is declaring only one... there's no particular reason C
couldn't provide a way to create a forward declaration (without
definition) of a typedef name, but it doesn't.

-- 
David A. Holland
dholl...@netbsd.org

Re: module.prop rename

2010-11-20 Thread David Holland

On Sat, Nov 20, 2010 at 07:50:03PM -0800, John Nemeth wrote:
  } embed the property info in the module file itself?
  
   That may or may not make more sense, but it would require a lot
  more work (i.e. inventing a tool to extract them, edit them, and insert
  them; and modifying the module loading code to extract them).  I have
  very little interest in doing that work at this time.

Fair enough.

-- 
David A. Holland
dholl...@netbsd.org

Re: misuse of pathnames in rump (and portalfs?)

2010-11-23 Thread David Holland

On Tue, Nov 23, 2010 at 11:13:02PM +, David Holland wrote:
  However, I discovered today that rumpfs's VOP_LOOKUP implementation
  relies on being able to access not just the name to be looked up, but
  also the rest of the pathname namei is working on, specifically
  including the parts that have already been translated.

Ok, on further inspection it appears that this is overly pessimistic.
It looks, rather, as if rumpfs (specifically the etfs logic) is using
the full namei work buffer and hoping that no such parts actually
appear in it, because if they do it'll fail.

So I think the following change will resolve the problem; can someone
who knows how this is supposed to work check it? (If it's ok, there's
no need to tamper with VOP_LOOKUP.)

Index: rumpfs.c
===
RCS file: /cvsroot/src/sys/rump/librump/rumpvfs/rumpfs.c,v
retrieving revision 1.74
diff -u -p -r1.74 rumpfs.c
--- rumpfs.c22 Nov 2010 15:15:35 -  1.74
+++ rumpfs.c24 Nov 2010 04:31:07 -
@@ -291,10 +291,9 @@ hft_to_vtype(int hft)
 }
 
 static bool
-etfs_find(const char *key, struct etfs **etp, bool forceprefix)
+etfs_find(const char *key, size_t keylen, struct etfs **etp, bool forceprefix)
 {
struct etfs *et;
-   size_t keylen = strlen(key);
 
KASSERT(mutex_owned(etfs_lock));
 
@@ -381,7 +380,7 @@ doregister(const char *key, const char *
rn-rn_flags |= RUMPNODE_DIR_ETSUBS;
 
mutex_enter(etfs_lock);
-   if (etfs_find(key, NULL, REGDIR(ftype))) {
+   if (etfs_find(key, strlen(key), NULL, REGDIR(ftype))) {
mutex_exit(etfs_lock);
if (et-et_blkmin != -1)
rumpblk_deregister(hostpath);
@@ -641,13 +640,15 @@ rump_vop_lookup(void *v)
if (dvp == rootvnode  cnp-cn_nameiop == LOOKUP) {
bool found;
mutex_enter(etfs_lock);
-   found = etfs_find(cnp-cn_pnbuf, et, false);
+   found = etfs_find(cnp-cn_nameptr, cnp-cn_namelen, et, false);
mutex_exit(etfs_lock);
 
if (found) {
-   char *offset;
+   const char *offset;
 
-   offset = strstr(cnp-cn_pnbuf, et-et_key);
+   /* pointless as et_key is always the whole string */
+   /*offset = strstr(cnp-cn_nameptr, et-et_key);*/
+   offset = cnp-cn_nameptr;
KASSERT(offset);
 
rn = et-et_rn;


-- 
David A. Holland
dholl...@netbsd.org

Re: misuse of pathnames in rump (and portalfs?)

2010-11-24 Thread David Holland

On Wed, Nov 24, 2010 at 01:26:04PM -0500, der Mouse wrote:
   Right. But if you want a guaranteed absolute path you should be able
   to do it by calling getcwd first.
  
  Only if you accept breakage if the current directory no longer has any
  name.

Well, if you can't call getcwd, then it won't work... (I was not
suggesting that the getcwd call be wedged inside namei)

  Of course, if you consider that acceptable, then fine.  I don't, not
  for something as central as namei (though this looks as though you may
  be talking about only certain filesystems, in which case it may be
  acceptable).

We'll probably end up doing it on every exec, since there's currently
no way to tell from the ELF headers whether $ORIGIN needs to be set
(this should be considered a bug) but failure is ok for that. In
compat_svr4 I doubt anyone cares much if it fails in corner cases.

-- 
David A. Holland
dholl...@netbsd.org

Re: misuse of pathnames in rump (and portalfs?)

2010-11-24 Thread David Holland

On Wed, Nov 24, 2010 at 08:30:18PM +0200, Antti Kantee wrote:
   I think it makes more sense for doregister to check for at least one
   leading '/' and remove the leading slashes before storing the key.
   Then the key will match the name passed by lookup; otherwise the
   leading slash won't be there and it won't match. (What I suggested
   last night is broken because it doesn't do this.)
  
  Ah, yea, the leading slashes will be stripped for lookup, so we can't
  get an exact match for those anyway.
  
  So, let's define it as string beginning with /, leading /'s collapsed
  to 1.

Ok. See below (replaces the patch upthread):

   All users I can find pass an absolute path.
  
  ok, good

diff -r 66985053a079 sys/rump/librump/rumpvfs/rumpfs.c
--- a/sys/rump/librump/rumpvfs/rumpfs.c Wed Nov 24 01:34:10 2010 -0500
+++ b/sys/rump/librump/rumpvfs/rumpfs.c Wed Nov 24 15:41:26 2010 -0500
@@ -324,6 +324,13 @@ doregister(const char *key, const char *
devminor_t dmin = -1;
int hft, error;
 
+   if (key[0] != '/') {
+   return EINVAL;
+   }
+   while (key[0] == '/') {
+   key++;
+   }
+
if (rumpuser_getfileinfo(hostpath, fsize, hft, error))
return error;
 
@@ -396,7 +403,7 @@ doregister(const char *key, const char *
 
if (ftype == RUMP_ETFS_BLK) {
format_bytes(buf, sizeof(buf), size);
-   aprint_verbose(%s: hostpath %s (%s)\n, key, hostpath, buf);
+   aprint_verbose(/%s: hostpath %s (%s)\n, key, hostpath, buf);
}
 
return 0;
@@ -641,13 +648,15 @@ rump_vop_lookup(void *v)
if (dvp == rootvnode  cnp-cn_nameiop == LOOKUP) {
bool found;
mutex_enter(etfs_lock);
-   found = etfs_find(cnp-cn_pnbuf, et, false);
+   found = etfs_find(cnp-cn_nameptr, et, false);
mutex_exit(etfs_lock);
 
if (found) {
-   char *offset;
+   const char *offset;
 
-   offset = strstr(cnp-cn_pnbuf, et-et_key);
+   /* pointless as et_key is always the whole string */
+   /*offset = strstr(cnp-cn_nameptr, et-et_key);*/
+   offset = cnp-cn_nameptr;
KASSERT(offset);
 
rn = et-et_rn;


-- 
David A. Holland
dholl...@netbsd.org

Re: radix tree implementation for quota ?

2010-11-28 Thread David Holland

On Sun, Nov 28, 2010 at 09:47:02PM +, David Holland wrote:
  (Also, why a radix tree? Radix trees are generally not very efficient.
  If you're going to, though, you might want to reuse the direct,
  indirect, double indirect, etc. method FFS uses for block mapping.)

...and the easiest way to do this is to put the quota in a sparse file...

-- 
David A. Holland
dholl...@netbsd.org

Re: radix tree implementation for quota ?

2010-11-28 Thread David Holland

On Sun, Nov 28, 2010 at 11:43:48PM +0100, Joerg Sonnenberger wrote:
  A radix tree is kind of a bad choice for this purpose. The easiest
  approach is most likely to have [a btree]

I would go with an expanding hash table of some kind, e.g. size is 2^n
pages, hash  (2^n - 1) tells you the page to look at, when you fill
up double the size and take an extra bit from the hash value.  If you
use a good hash function you should get decent occupancy rates;
expanding requires rewriting every page, but for systems where the
number of uids is more or less constant over time (which is most
systems) this won't happen very much... and remember that for 30,000
users the total size of quota data is only about 1M anyhow so chugging
it around once in a while isn't a big deal.

However, I'm still not convinced the sparse file really is a serious
problem in practice.

-- 
David A. Holland
dholl...@netbsd.org

Re: radix tree implementation for quota ?

2010-11-30 Thread David Holland

On Mon, Nov 29, 2010 at 11:12:21AM -0500, der Mouse wrote:
  Without any real data on what UID distribution looks like in practice,
  we're all speculating in a vacuum here.

Just for shits and giggles I ran this on a real password file with
about 350 users that's had lots of churn since it was first
established. Using Joerg's hash with a table of size 512 gave 90
collisions; using x % 509 (the nearest available prime) gave 91.

For comparison I tried some of the (more expensive) hashes from db4
and got 111, 94, and 84.

None of the hash functions generated significant hotspots. It seems
like a nonissue.

(Of course, hashing 350 things isn't exactly a big challenge...)

-- 
David A. Holland
dholl...@netbsd.org

Re: Heads up: moving some uvmexp stat to being per-cpu

2010-12-15 Thread David Holland

On Tue, Dec 14, 2010 at 08:49:14PM -0800, Matt Thomas wrote:
  I have a fairly large but mostly simple patch which changes the
  stats collected in uvmexp for faults, intrs, softs, syscalls, and
  traps from 32 bit to 64 bits and puts them in cpu_data (in
  cpu_info).  This makes more accurate and a little cheaper to update
  on 64bit systems.

Would this be a good opportunity to retire the sysctl that returns the
non-ABI-stable uvmexp, or perhaps change its name to indicate that
it's not stable/supported?

-- 
David A. Holland
dholl...@netbsd.org

parsepath op

2011-01-02 Thread David Holland

Because we have at least one FS that may not want paths being looked
up to be split on '/', namely rump etfs, and arguably the most
important simplification to VOP_LOOKUP is to make it handle one path
component at a time, we need a way for a FS to decide how much of a
path it wants to digest at once.

After thinking about this for a while I think the best approach is to
add a parsepath op, which given a pathname returns the length of the
string to consume.

There are two major questions about how this should work: one is
whether it should be a vnode or fs operation, and the other is how
onionfs should handle it.

I'm currently leaning towards a vnode op and applying the restriction
that it must select either the first component (up to the first slash)
or the whole remaining path string. This makes it reasonably possible
for onionfs to deal with cases where its layers don't agree on the
length to consume. (Although I think for the time being I'll just let
such cases fail.) Making it a fs op instead would require that onionfs
always call the operation again on both layers inside lookup; this
would add a certain amount of overhead.

Unfortunately, because etfs requires that the choice depend on the
particular pathname given, I don't think this can be done just by
setting flags somewhere. However, I can't think of a credible use case
that requires more flexibility than selecting either one component or
the whole path. (This includes a moderately crazy research project I
proposed years ago.) So I don't think a more general operation is
needed.

I don't have a candidate patch yet (or even a draft patch); adding new
vops requires touching an unreasonably large number of places. But
this seems like the kind of thing where posting early is a good idea...

-- 
David A. Holland
dholl...@netbsd.org

Re: semantics of TRYEMULROOT

2011-01-02 Thread David Holland

On Sun, Jan 02, 2011 at 09:19:30AM -0500, matthew sporleder wrote:
   [TRYEMULROOT]
  
  Since it's on http://www.netbsd.org/~dholland/buglists/file.html , I'm
  sure you're aware of it, but would 41678 be solved?
  
  http://gnats.NetBSD.org/cgi-bin/query-pr-single.pl?number=41678

Doubtful, as that behavior's inside onionfs. I don't think applying
the same simplification to onionfs would result in the behavior you
want there, either.

(Anyhow, like I wrote in that PR, the only real way forward for
onionfs is to try to figure out a self-consistent model for the
semantics. If as I suspect that turns out to be impossible, the right
way forward is to kill onionfs and replace it with several similar
fses each with a clear set of semantics specialized for a particular
set of purposes.)

-- 
David A. Holland
dholl...@netbsd.org

Re: semantics of TRYEMULROOT

2011-01-02 Thread David Holland

On Sun, Jan 02, 2011 at 06:14:57PM +, David Laight wrote:
  On Sun, Jan 02, 2011 at 09:52:31AM +, David Holland wrote:
   Has anyone ever sat down and clearly worked out the desired semantics
   for TRYEMULROOT? I've noted inconsistencies in the past, and because
   in a number of ways it's a special case of onionfs I'm somewhat
   concerned that there may be cases where the proper or desired behavior
   is unclear or ambiguous.
  
  When I added TRYEMULROOT I did so in order to maintain the same actions
  as the old code - which did all sorts of horrid checks before copying
  a changed path out into the stackgap.
  At that time I didn't want to be worried about which code paths should,
  or should not, look in the emulated root - since many of the emulations
  were probably broken.

That's more or less what I thought :-)

  You are probably the only one who has actually looked into this deeply!
  It does seem likely that only ENOENT (and similar) should cause the
  main fs to be checked.

Yeah, but which are similar? Clearly e.g. ENOTDIR, maybe ELOOP; one
could make a case either way for EROFS and EACCES though. And some
things that might turn up, like ESTALE or ECONNTIMEDOUT, aren't clear
at all...

The question of where to commit to an object to work with is more
important than the precise list of errors to retry on, because the
former is structural and the latter is (relatively) cosmetic.

Currently operations like mkdir do (roughly)

   NDINIT(nd, CREATE, LOCKPARENT, path);
   namei(nd);
   if (nd.ni_vp != NULL) {
return EEXIST;
   }
   VOP_MKDIR(nd.ni_dvp, nd.ni_cnd);
   vput(nd.ni_dvp);

but the plan is to change this to (roughly)

   char last[NAME_MAX];
   namei_parent(path, dvp, last, sizeof(last));
   vn_lock(dvp);
   VOP_MKDIR(dvp, last);
   vn_unlock(dvp);
   vrele(dvp);

which is simpler and a lot tidier, both on the surface and inside the
fs; however, if TRYEMULROOT is wanted it's obviously not desirable to
need a retry loop here (and in each of the several other similar
functions) -- therefore I'd like namei_parent to be able to commit to
a (parent) directory to operate in, handling TRYEMULROOT inside itself
only. In addition to fitting my design better I think this provides a
more consistent semantics; I just want to make sure we think it'll
work before committing to it.

-- 
David A. Holland
dholl...@netbsd.org

Re: parsepath op

2011-01-02 Thread David Holland

On Sun, Jan 02, 2011 at 10:48:03AM +, David Laight wrote:
  On Sun, Jan 02, 2011 at 09:17:11AM +, David Holland wrote:
   Because we have at least one FS that may not want paths being looked
   up to be split on '/', namely rump etfs, and arguably the most
   important simplification to VOP_LOOKUP is to make it handle one path
   component at a time, we need a way for a FS to decide how much of a
   path it wants to digest at once.
  
  Slightly related is something I did for an embedded os, where
  I appended /param1/param2 to a device path.
  [...]

Right, I think the original art for this came from AmigaDOS. It's
always seemed like a useful idiom to me (much better than creating a
dozen variant device nodes for every physical device) and I don't
intend to do anything that will rule it out.

Basically to make it work we'd need to patch namei to allow calling
VOP_LOOKUP on device nodes instead of giving ENOTDIR, and add a
spec_lookup implementation that does some kind of string-to-ioctl
mapping to issue state changes on the device vnode. Currently writing
a VOP_LOOKUP implementation is a black art, but that's supposed to
change, and I don't think this would require any other special hacks.

-- 
David A. Holland
dholl...@netbsd.org

Re: semantics of TRYEMULROOT

2011-01-02 Thread David Holland

On Sun, Jan 02, 2011 at 09:19:51PM +, Eduardo Horvath wrote:
  TRYEMULROOT should only open existing objects on the emul path, it should 
  never create anything new, so you would never want to use it for mkdir.  
  I don't know if that means you need to pass an extra flag to 
  namei_parent() or what.

That's what I thought at one point, but it's set on almost everything,
including mount, open with or without O_CREAT, mknod, mkfifo, link,
symlink, mkdir, and rename, and also unlink and rmdir.

If that's not how it should be, things should be tidied up. (And, if
neither mkdir-type nor rmdir-type operations should have TRYEMULROOT,
my original question becomes largely or entirely moot.)

-- 
David A. Holland
dholl...@netbsd.org

Re: prop_*_internalize and copyin/out for syscall ?

2011-01-18 Thread David Holland

On Mon, Jan 17, 2011 at 04:33:25PM +0100, Manuel Bouyer wrote:
  so I'm evaluating how to use proplib for the new quotactl(2) I'm working on.

er, why? When I was looking at quota stuff in the context of lfs and
other fs types, the existing quotactl interface seemed fine -- it just
needs to have a clear separation between the syscall-level structures
and the ffs-specific ones.

At worst one might want to split struct dqblk in half, so the block
and inode limits are addressed separately, something like this:

   struct quotaentry {
uint64_t qe_hardlimit;
uint64_t qe_softlimit;
uint64_t qe_current;
int32_t qe_time;
int32_t __qe_spare;
   };

with additional suitable constants for addressing block vs. inode
limits.

I really don't see where proplib figures into this.

-- 
David A. Holland
dholl...@netbsd.org

Re: Dates in boot loaders on !x86

2011-01-18 Thread David Holland

On Tue, Jan 18, 2011 at 04:24:37PM +0100, Joerg Sonnenberger wrote:
  Well, we derive the version to include from the version file. This is
  controlled by a central script. What about adding support to expand
  $DATE$ or some other magic version string, if it is the last in the
  version file? If you are actively developing, you can add that to the
  version file and hopefully remember to replace it with a proper version
  entry before commit.

That's unnecessarily complicated. There's prior art for this:

NetBSD tanaqui 5.99.41 NetBSD 5.99.41 (TANAQUI) #32: Wed Dec  1 01:20:02 EST 
2010 dholland@tanaqui:/usr/src/sys/arch/i386/compile/TANAQUI i386
^^^

Wouldn't be very hard to do the same for bootloaders.

-- 
David A. Holland
dholl...@netbsd.org

Re: Dates in boot loaders on !x86

2011-01-18 Thread David Holland

On Tue, Jan 18, 2011 at 09:39:58PM +0100, Joerg Sonnenberger wrote:
   That's unnecessarily complicated. There's prior art for this:
   [...]
  
  Please look at the mail that started this threat.  newvers provides
  multiple independent variable, so conditionally providing one of them
  needs both an option and output mangling in the users.

It doesn't need an option, because on a clean build it would always be
0 (or 1) -- if you start hacking, then it would increment
itself. Assuming you don't cleandir.

(And I didn't say to reuse the kernel's newvers script itself. All this
needs is about five lines of sh...)

  The consensus seems to be that during normal usage, the build date is
  irrelevant and doesn't provide any value. Based on Martin's suggestion,
  I will add a MKVERBOSEBOOT variable or so (haven't made my mind up about
  the name). If it is set, bootprog_kernrev will include the build date as
  well as user and host name (like the current bootprog_maker). The
  current bootprog_maker and bootprog_kernrev go away.

But anyway, that seems fine. Is it going to be extended to the x86
bootloader?

-- 
David A. Holland
dholl...@netbsd.org

Re: turning off COMPAT_386BSD_MBRPART in disklabel

2011-02-02 Thread David Holland

On Mon, Jan 31, 2011 at 05:40:20PM +0100, Matthias Drochner wrote:
   PR 44496 notes that COMPAT_386BSD_MBRPART is still enabled in
   disklabel(8), even though it was turned off by default in the kernel
   early in 4.99.x. The PR also notes that it's not harmless to leave it
   on.
  
  The PR rather leads to the conclusion that the support for
  old Partition IDs in disklabel(8) is suboptimal.
  Originally, the code did only consider a partition with the
  old ID if no new one was found. This apparently got broken
  when extended partition support was added years later.

Yeah, that's a valid point. I guess the question then is whether
fixing that will prevent any problematic cases from arising...  and
whether at this point it's worth worrying about.

I suspect very few commodity drives old enough to have been fdisk'd
with the old partition ID are still operable, and I suspect that
anyone who's got one that hasn't been updated already is qualified to
run fdisk... and there are very few cases where anyone would need to
run disklabel but not be able to run fdisk first. So I'd really be
inclined at this point to just disable the feature.

...also, it's not entirely clear to me what the code is supposed to be
doing if there are multiple NetBSD partitions; it looks as if what it
*will* do is use the label from the one it sees last and write the
same label to all of them.

blah, using both fdisk partitions and traditional labels on the same
disk has always been a pile of fail.

-- 
David A. Holland
dholl...@netbsd.org

Re: remove sparse check in vnd

2011-02-06 Thread David Holland

On Sat, Feb 05, 2011 at 10:07:13PM -0500, der Mouse wrote:
  Of course, still better would be to fix vnd, though I'm not sure what
  the right fix would be. 

What's the problem? My vague understanding was that you could get into
deadlocks allocating blocks, but maybe I'm confusing it with something
else.

-- 
David A. Holland
dholl...@netbsd.org

Re: turning off COMPAT_386BSD_MBRPART in disklabel

2011-02-06 Thread David Holland

On Thu, Feb 03, 2011 at 08:04:26AM +, David Laight wrote:
The PR rather leads to the conclusion that the support for
old Partition IDs in disklabel(8) is suboptimal.
Originally, the code did only consider a partition with the
old ID if no new one was found. This apparently got broken
when extended partition support was added years later.
   
   Yeah, that's a valid point. I guess the question then is whether
   fixing that will prevent any problematic cases from arising...  and
   whether at this point it's worth worrying about.
  
  Possibly the code should be willing to locate and process such a label.
  Possibly even write it back.
  But it probably shouldn't 'corrupt' it - ie leave it as a valid label
  (doesn't it contain sector number relative to the ptn iteself?
  so can't describe any other parts of the disk?)

Are *our* ancient disklabels partition-relative? It's so long ago that
I'm not sure... but the code in currently in disklabel(8) doesn't appear
to know anything at all about partition-relative labels.

Given the rest of the discussion here, the fact that fixing
disklabel(8) properly isn't completely trivial, and tls's recent
experience, I think the feature should just be turned off in
disklabel... but, just in case, not removed entirely until we branch
netbsd-6.

Does anyone object to this course of action?

-- 
David A. Holland
dholl...@netbsd.org

Re: turning off COMPAT_386BSD_MBRPART in disklabel

2011-02-12 Thread David Holland

On Mon, Feb 07, 2011 at 01:48:57AM -0500, Thor Lancelot Simon wrote:
  For the record, I am pretty sure it was sysinst, not disklabel, which
  hosed my disk.  Sysinst compiles equivalent code in directly, no?

There are only two uses of MBR_PTYPE_386BSD in src/distrib. One is a
perfectly innocuous list of partition type IDs. The other is in
src/distrib/utils/sysinst/arch/i386/md.c, which changes the partition
ID of a MBR_PTYPE_386BSD partition to MBR_PTYPE_NETBSD if no
MBR_PTYPE_NETBSD partitions are seen.

This is, however, only reached if someone's explicitly attempting to
upgrade an existing installation, so it's probably harmless -- I think
you got hosed by disklabel.

This code should probably be removed from sysinst too, but maybe after
-6 is branched.

-- 
David A. Holland
dholl...@netbsd.org

Re: turning off COMPAT_386BSD_MBRPART in disklabel

2011-02-13 Thread David Holland

On Sun, Feb 13, 2011 at 01:06:36PM -0500, Thor Lancelot Simon wrote:
  Not in the failure case I observed (I can now reproduce this, but since
  it looks like the code in disklabel is going to Go, 

It has Gone :-)

(The remaining question is whether to request pullup to -5; I think I
will unless someone is strongly opposed.)

  If the kernel write-out-label code can do something similar, that ought
  to get the axe, too.

It was disabled by default four years ago.

-- 
David A. Holland
dholl...@netbsd.org

Re: Fwd: Status and future of 3rd party ABI compatibility layer

2011-03-03 Thread David Holland

On Wed, Mar 02, 2011 at 12:40:44AM +, Andrew Doran wrote:
  With modules now basically working we should either retire or move
  some of these items to pkgsrc so that the interested parties maintain them.
  An awful lot of the compat stuff is now very compartmentalised, with not
  much more work to do.

There's at least one thing on the long-term wishlist that ideally
should be done first: migrating to code generation for the syscall
copyin/copyout logic. Given such infrastructure, much of the compat_*
code can be replaced with code generator rules.

Also, we really need a better story for compiling modules outside the
source tree. Installing every random kernel header in /usr/include
isn't the right way to go, but we don't currently have enough internal
API organization to do much better. (Alternatively, we could come up
with a better story for providing a system source tree in pkgsrc, but
that also has issues.)

Darwin (no GUI, doesn't to have been updated in the last 5 years)
IRIX
  
  These two are strange and very broken, i.e. internally they are in
  very bad shape.  I vote to delete.  The version control history will still
  be there.  Can't see strong use cases for either.

I don't know that much about compat_darwin, but compat_irix is a pile
of ooze. Someone please delete it :-)

-- 
David A. Holland
dholl...@netbsd.org

Re: the bouyer-quota2 branch

2011-03-07 Thread David Holland

On Sat, Feb 19, 2011 at 11:21:35PM +0100, Manuel Bouyer wrote:
  I think the code in the bouyer-quota2 branch is stable now, and
  ready to be merged to HEAD. Unless objections, I'll merge it in
  about 2 weeks.
  [...]

So, I thought one of the points of this was to make the quota
interface fs-independent, but as it seems to have come out all the
pieces and definitions are still in sys/ufs/ufs, and so far at least I
really do not see where to slice to have quota support in a non-ufs
filesystem.

Can you explain how this is supposed to be done? And can we move the
fs-independent and vfs-level declarations to sys/quota.h and add
kern/vfs_quota.c for the fs-independent code?

-- 
David A. Holland
dholl...@netbsd.org

Re: the bouyer-quota2 branch

2011-03-10 Thread David Holland

On Wed, Mar 09, 2011 at 08:20:00PM +0100, Manuel Bouyer wrote:
  On Wed, Mar 09, 2011 at 06:28:11PM +, David Holland wrote:
 struct quota2_entry (and so struct quota2_val) is used for both
 on-disk storage, and in-memory representation in tools and kernel.
 I agree this should be split; with an extra level of conversion
 (between in-memory and on-disk representation). The issue is struct
 quota2_val: I don't see any reason to have a different structure
 for on-disk and in-memory represenation at this level.
   
   Well... for one thing N_QL isn't necessarily 2;
  
  The tools rely on N_QL being defined and constant for all FS types at
  this time. The string for each QL is also defined here

That should be fixed in favor of something more flexible before the
API gets cast in stone; as I was saying there's at least one piece of
prior art out there with three types of quotas. It isn't clear to me
that we'll ever care, but on the other hand the cost of not compiling
in the quota types isn't very high.

Especially since if the interface is really going to be proplib-based
they can just be arbitrary names.

   also the in-memory
   representation shouldn't have disk addresses in it (e.g. q2e_next),
   and in the syscall interface the types should be logical types, like
   uid_t, not sized types.
  
  Sure (although in the syscall interface, all of this are strings now :)

that doesn't exactly make any difference...

  struct quota2_entry isn't the real problem (it's not used much outside of
  filesystem), struct quota2_val is.

Sure.

   Also the structures in quota2.h are too
   hierarchical; a single entry should be an id type (user or group,
   maybe others), an id number, a quota type (block, inode, or other
   things; SGI xfs has three types), the hard and soft limits and
   configured grace period, and the current usage and current expiration
   time.
  
  The hierarchy in quota2.h reflects the proplib structure. A proplib
  quota entry has a type, and an associated array of entries. Each
  entry has an id (uid or gid depending on the type above) and array
  of values for this id.  Each value has a type, current usage,
  limits and grace times.

Well, yes, one of the problems with proplib is that it encourages
hierarchical structuring of things that should be relations.

ISTM that a bundle of quota records should be a single array of
tuples of the form I described above... at least in the canonical
format for communicating among system components.

   (Maybe the configured limits and current state should be
   separate structures though
  
  I'm not sure.

I tend to think so because it allows separating the policy (which
might be independent of a given ID) from the operational statistics
(which can't be).

   And maybe the id information should be
   structured to allow ranges.)
  
  I'm not sure we should allow too much at this level. The tools can
  allow range if we want, but they should convert it to a list of discrete
  entries. don't do too much in the kernel.

If I have 80,000 accounts of which 40,000 are undergrads with the same
quota policy, it certainly makes sense to pass this to the kernel
(and, where possible, store it) as one record rather than replicating
it 40,000 times.

I agree we shouldn't go off half-cocked and if we're going to set up a
policy language for quotas it should be designed with some care.
However, with the default stuff we're already moving in that direction
so it seems like a logical step.

   But at a minimum it's like struct dirent; it's not particularly
   different from the FFS on-disk structure, but it needs to be its own
   thing because it plays a different role.
  
  I agree. What I don't get is how to split it to avoid too much extra
  code which would just be a non-optimised memcpy.

We're talking about like a dozen lines of code, far less than the
proplist decoding that has to be replicated in every FS.

   Currently it looks like any fs that wants to implement quotas has to
   cut and paste quota2_prop.c. Surely the proplib gunk can be decoded
   fs-independently?
  
  Parts of quota2_prop.c can, I guess. For the part in ufs_quota.c,
  I'm not sure.

If quota2_val and friends can really be an fs-independent interface,
then I don't see that there's any value to passing the proplib bundle
to the FS. If they can't... then the data structures should be
strengthened.

Since we are not going to replicate the userland quota tools for every
different FS type that has quotas, the interface *they* talk has to be
FS-independent.

Even if there's some reason it needs to be proplib-based it's still a
proplib encoding of some physical data structure, and (especially in
the absence of any kind of proplib schemas) it would be helpful to
have that structure clearly defined somewhere.

-- 
David A. Holland
dholl...@netbsd.org

Re: libquota proposal

2011-03-22 Thread David Holland

On Mon, Mar 21, 2011 at 02:21:26PM +0100, Manuel Bouyer wrote:
   (also, edquota and repquota seem fs-independent to me...)
  
  no, they're not: they can directly the quota1 file specified in the
  fstab if quotactl fails or the filesystem is not mounted.

That's a bug, or more accurately legacy behavior that doesn't need to
be supported. Once upon a time (IIRC) df used to fall back to opening
the block device and examining ffs structures directly; that was
removed because it violated desirable abstractions.

-- 
David A. Holland
dholl...@netbsd.org

Re: libquota proposal

2011-03-23 Thread David Holland

(more context restored)
On Wed, Mar 23, 2011 at 09:51:48AM +0100, Manuel Bouyer wrote:
  (also, edquota and repquota seem fs-independent to me...)
  
  no, they're not: they can directly the quota1 file specified in the
  fstab if quotactl fails or the filesystem is not mounted.
  
  That's a bug, or more accurately legacy behavior that doesn't need to
  be supported. 
  
  of course it's not nice. But we're talking about existing code calling the
  legacy quotactl. If we're going to change it to not check the fstab
  options any more, we may as well change it to use libquota.
  
  I don't understand - surely edquota and repquota go through your
  proplib interface now?
  
  We were talking about code like netatalk, which is why I propose
  a public library for this.

Uh, now I really don't understand.

-- 
David A. Holland
dholl...@netbsd.org

Re: libquota proposal

2011-03-23 Thread David Holland

On Wed, Mar 23, 2011 at 09:50:16AM +0100, Manuel Bouyer wrote:
  On Wed, Mar 23, 2011 at 03:44:53AM +, David Holland wrote:
   On Tue, Mar 22, 2011 at 05:41:52PM +0100, Manuel Bouyer wrote:
  |   (also, edquota and repquota seem fs-independent to me...)
  |  
  |  no, they're not: they can directly the quota1 file specified in 
   the
  |  fstab if quotactl fails or the filesystem is not mounted.
  | 
  | That's a bug, or more accurately legacy behavior that doesn't need 
   to
  | be supported. Once upon a time (IIRC) df used to fall back to 
   opening
  | the block device and examining ffs structures directly; that was
  | removed because it violated desirable abstractions.
  
  Totally agree, please remove this complex and hard to maintain stuff.
 
 Once again: this needs to be supported for transition, up to 6.0
 (inclusive).
   
   No, it doesn't. Even before you touched anything, they were only
   scribbling directly as a fallback if the kernel operations failed.
   The kernel operations should not fail in any case where scribbling
   directly makes sense; furthermore there's no need at all to deal with
   the case where the fs isn't mounted.
  
  repquota at last needs them: it doesn't have any way to get a list
  of quotas otherwise

That sounds like a bug.

  (and it's also part of the migration to quota2,
  with repquota -x).

...wait, we're exposing the plists directly to the user?

Shouldn't the migration be a single transparent tunefs operation?

   In the new world order all userland quota operations go through the
   kernel interface so they can interact successfully with filesystems
   using either the old or new quota layouts, or with new filesystems
   that may have their own different quota layouts, like zfs or whatever
   else. Right?
  
  right. Exept that the getall command is not supported for quota1,
  repquota does the job itself.

uh, why not? that *is* a bug.

-- 
David A. Holland
dholl...@netbsd.org

Re: Decomposing vfs_subr.c

2011-03-23 Thread David Holland

On Wed, Mar 23, 2011 at 02:18:55PM +, Mindaugas Rasiukevicius wrote:
I would like to split-off parts of vfs_subr.c into vfs_node.c * and
vfs_mount.c modules.  Decomposing should hopefully bring some better
abstraction, as well as make it easier to work with VFS subsystem.

Any objections?
   
   Sounds good to me.  Some comments:
   
   - I think it should be vfs_vnode.c?
  
  OK, unless somebody will come up with a better name.

Since AIUI from chat this is going to contain the vnode lifecycle and
code and not e.g. stuff like vn_lock, I think I'd prefer vfs_vncache.c.
But, vfs_vnode.c is definitely better than vfs_node.c.

   - Random thought: some day it would be nice to dump all the syscall code
 into its own directory.
  
  Speaking of structural clean ups - I am thinking about moving vfs_*.c
  into a separate src/sys/vfs directory.  Given that clean code history
  of vfs_subr.c is already damaged (*cough*pooka*cough*) and decomposing
  will do more - it might be worth going all the way.

Well, forcibly moving vfs_lookup.c right now (or anytime in the near
future) would be a bad idea, so let's not. After that stuff
stabilizes, perhaps we can. Though I'd kind of prefer having real
rename support before launching on major reorgs.

-- 
David A. Holland
dholl...@netbsd.org

Re: reading non-standard floppy formats

2011-04-29 Thread David Holland

On Thu, Apr 28, 2011 at 12:02:51PM +0200, Edgar Fu? wrote:
   Is there a saner way of reading non-standard (e.g., 10 sectors per track)
   floppies than either
   a) building a custom kernel with modified fd_types in sys/dev/isa/fd.c
   b) writing a user-space program that sets the appropriate parameters with
   FDIOCSETFORMAT and then, holding the device open, writes the raw floppy
   data to a file?
  No-one?
  
  Does this mean there is no saner way or am I missing something so obvious
  that no-one wants to answer?

More likely than either: nobody knows.

-- 
David A. Holland
dholl...@netbsd.org

Re: NFS server problems (lockup) on netbsd-5

2011-05-02 Thread David Holland

On Mon, May 02, 2011 at 03:23:48PM +0200, Manuel Bouyer wrote:
   unfortunably I don't have a core dump (I couldn't get one).
  
  And unfortunably it's not reproductible with a simple testbed (I've been
  trying for 3 days).
  I wonder if it could be related to the INRENAME change that has
  been pulled up ...

Unlikely - that exists only to help work around a protocol bug in puffs.

-- 
David A. Holland
dholl...@netbsd.org

1 2 3 4 5 6 7 8 >

1 - 100 of 795 matches

Mail list logo