from:"Greg Troxel"

zfs not freeing znodes

2024-09-26 Thread Greg Troxel

I have been having problems with my netbsd-10 systems locking up.  This
has happened on at last 3 physical computers and 3 physical disks so I'm
as sure as I can be that it's not hardware.

Systems are all

  - netbsd-10 amd64
  - / and /usr on ffs, bulk data on zfs

and have had 4, 8, 24, or 32 G of RAM, which from the zfs viewpoint ranges
from marginal to considered enough.

The symptoms are that everything is fine for a time, and then it ends up
in lockup, no keyboard effective to switch out of X, ddb, anything.
Sometimes I catch it as it is deteriorating and have been able to get
into ddb and there are a bunch of processes in tstile, with underlying
locks including flt_noram5 (from fallible memory).

My guess is that I run low on memory and there is a locking bug (failure
to release) in a rarely-taken path, perhaps trying to delete files in
zfs when the system is out of RAM, or that sort of thing.

Things that tend to lead to higher odds of lockup are:

  - daily cronjob running (8 GB machine w/o X)
  - leaving firefox open, especially with piggy js tabs
  - running pkgsrc builds
  - anything that deals with very large numbres of files.

I have adjusted zfs's target allocations to use less RAM, basing it on
total.  In theory these would be sysctls anyway.

One thing I figured out before is that zfs's approach to respecting RAM
limits is to go ahead and allocate when requested and to have bg thread
free things.  This can result in going way over, and I think it makes
something like "untar this huge bunch of files into zfs" put memory
pressure on the rest of the system.

On a 32 GB machine, the lockups got more frequent (I can't rule out a
graphics card failure unrelated to the memory problem), so I started
looking harder. I ran vmstat -m before and after doing cvs updates in
NetBSD checkouts (I have them for 9, 10 and current), and pkgsrc.   I
noticed that "dnode_t" showed a large number of requests and pages and
*no releases*.

An example is
 - 1847937 requests
 - 307990 pages

That's 1203 MB in dnodes (which are 632 bytes each).  But the concerning
thing is that everytime I did an update of a tree (different ones) the
dnode allocation rose and I saw no frees.

I then remembered that I had bumped up kern.maxvnodes long ago, before I
was even using zfs, because a netbsd or pkgsrc tree was not fitting in
the cache.

maxvnode was at about 1.6M.  This seemed big, and I set it to 500K.
Then additional "call stat on this huge bunch of files" did not result
in new allocations, but I didn't see any frees.  This was all yesterday
or Tuesday.

This morning there are 1592742 releases, but no pages have been
released.

I went to see what I was setting maxvnodes to, and it seems I removed
that setting long ago, probably when I upgraded my main machine from an
8G box to 24G box, or earlier.


This all leaves a lot of questions:

  Obviously I need to read the zfs code to see what dnode is being used
  for (am guessing it's "disk node", the on-disk info backing a vnode) and
  how the number of allocations is controlled and they are freed.

  1.6M vnodes on a system with 32G of RAM seems like it should be ok.  I
  plan to set it to 500K on boot and see if that avoids lockups.

  zfs's background free, don't sleep processes that are over strategy
  seems kind of risky.  I can see a "if mildy over, let bg free deal",
  but if a process is just allocating aas fast as it can, it seems to
  run the system into no ram.

  There remains the question of what happens when there isn't RAM left
  to allocate from pools.  I think it's highly likely there is a bug.


Things on my todo list to debug:

  Set up a VM that has ZFS and see if I can make a repro recipe.

  In the VM, also try DIAGNOSTIC/DEBUG/LOCKDEBUG.

  Write code to capture vmstat output periodically and save it.  I
  expect graphing this to be useful in understanding.
   

What I don't understand is why others aren't seeing this.  I do have
settings to avoid having the file cache page out all my processes:

  # \todo Reconsider and document
  vm.filemin=5
  vm.filemax=10
  vm.anonmin=5
  vm.anonmax=80
  vm.execmin=5
  vm.execmax=50
  vm.bufcache=5

But this is I believe pretty normal among netbsd users.

However, on the other logical machine, which is a Xen dom0, I have a a
stock sysctl.conf, and it would reliably crash on the daily cron with 4
GB of RAM, but stay up when running GENERIC with the full 8GB.

Re: config: conditional at clause

2024-08-13 Thread Greg Troxel

Valery Ushakov  writes:

> I'm still not entirely sure, why XEN kernels include GENERIC.local in
> the first place though.  If one needs a fragile maze of includes just
> to avoid a few lines being copy-pasted, that doesn't feel like a win.

Having FOO include FOO.local is fine, but we still need a local include
file that is included by substantially every kernel.

The point is that if I decide that on my systems, I want some device
that isn't on by default, or to tweak something, and I want it on
everything, it should be easy.

In particular, I think it's normal to flip back and forth between
GENERIC and XEN3_DOM0 on the same machine, and those feel like they are
doing the same thing, modulo expecting a hypervisor.

Arguably, XEN3_DOM0 should just include GENERIC and then have any dom0
stuff added and some no stuff.

This feels like "we shouldn't have XEN kernels include what they have
included for years, because we found a latent config bug".

Re: vio9p vs. GENERIC.local vs. XEN3_DOM[0U]

2024-08-12 Thread Greg Troxel

Mouse  writes:

> My answer is, error-checking.  If I, say, typo "pci" as "cpi" in
>
>   mydev* at cpi?
>
> I'd want an error rather than having the line silently ignored.  (That
> particular typo is not all that plausible.  It's just an example.)
>
> Now, if virtio were specifically declared as "this name is valid but
> may or may not be present"?  I'm on the fence.
>
> If virtio were declared normally in the kernels that provide it and
> declared as valid but specifically absent in XEN3_DOM* kernels?  Then I
> think that's what I'd want (to my limited understanding, this is close
> to what "no virtio" does at present).

A fair point, but are you suggesting that every bus that could ever
exist be declared and all other kernels have "no", as a general
approach?

Re: vio9p vs. GENERIC.local vs. XEN3_DOM[0U]

2024-08-11 Thread Greg Troxel

Christoph Badura  writes:

> Currently the vio9p driver is commented out in {i386,amd64}/conf/GENERIC:
> #vio9p*   at virtio?  # Virtio 9P device
>
> The obvious way to enable that is by adding a line to GENERIC.local:
> vio9p*at virtio?  
>
>
> But doing so breaks the builds of the XEN3_DOM? kernels like so
> sys/arch/amd64/conf/GENERIC.local:1: `vio9p* at virtio?' is orphaned 
> (nothing matching `virtio?' found)
> because
> $ grep cinclude {i386,amd64}/conf/*XEN3*
> i386/conf/XEN3PAE_DOM0:cinclude "arch/i386/conf/GENERIC.local"
> i386/conf/XEN3PAE_DOM0:cinclude "arch/i386/conf/XEN3_DOM0.local"
> i386/conf/XEN3PAE_DOMU:cinclude "arch/i386/conf/GENERIC.local"
> i386/conf/XEN3PAE_DOMU:cinclude "arch/i386/conf/XEN3_DOMU.local"
> amd64/conf/XEN3_DOM0:cinclude   "arch/amd64/conf/GENERIC.local"
> amd64/conf/XEN3_DOMU:cinclude   "arch/amd64/conf/GENERIC.local"
> amd64/conf/XEN3_DOMU:cinclude   "arch/amd64/conf/XEN3_DOMU.local"
>
> This is extremely annoying, as that breaks "build.sh release" because that
> builds the XEN3 kernels.  And it prevents us from enabling vio9p on x86
> kernels by default.
>
> The obvious and simplest fix is to make the XEN3 kernels stop including
> GENERIC.local.  (And make amd64 XEN3_DOM0 cinclude XEN3_DOM0.local as on
> i386.)

I don't think that's reasonable at all.  GENERIC.local is for things you
want in your kernels, and if it's something else, then it certainly
belongs in XEN kernels.

You have merely found something is problematic because it assumes there
is virtio.

> The less trivial fix is to conditionally attach vio9p in GENERIC.local.
> config(8) has "ifdef"/"ifndef" directives for that.  But they key on
> config attributes and I couldn't find an attribute that is only present in
> XEN kernels.
>
> Now, Someone(TM) could probably go into config(8) and add a way to
> conditionalize on flags.  But that is way more work and, IMHO, tackling
> the problem at the wrong level of abstraction.

The right level of abstraction is to do something that says

  if there is a virtio bus, add viop* at virtio*

and this is true of pretty much anything that attaches to a bus that may
or may not present.

I wonder if there are good reasons to avoid "just skip lines that refer
to a bus that don't exist".

> It seems to me that the best way to remedy the situation is to make the
> XEN3 kernels not include GENERIC.local.

That will break a lot of other things, and many will find it very
surprising.

> If people really want to include GENERIC.local they can do so in their
> XEN3_DOM?.local files or create a XEN3.local (or XEN3.common.local or
> whatever) that is included from them.

If you really want this, you can just add it to GENERIC.  That seems
better than asking the rest of the world to change.  But seriously, I
don't see why making XEN not include GENERIC.local is better than
people that want bus-specific things putting it someplace else.

You could also add

GENERIC.local.virtio

and include it from all kernels that have virtio.

Re: sysmon(4) messages

2024-08-07 Thread Greg Troxel

Valery Ushakov  writes:

I agree this is confusing.

> Would it make more sense to change this to either
>
>   $dev: $sensor changed to $state
>
> and hope for the best (at least now there are no extra nouns that come
> from the message template), or make things verbose and very explicit
> with something like
>
>   $dev: sensor '$sensor' changed state to '$state'

That seems like a big improvement.  I think I prefer

  $dev: sensor '$sensor' changed to '$state'

as a sensor's value is implicitly a state and fewer words are better.

But I do like the leading "sensor " as likely to reduce confusion.

Re: strange zfs size allocation data

2024-07-08 Thread Greg Troxel

Martin Husemann  writes:

> On Sun, Jul 07, 2024 at 11:32:54PM -0400, Mouse wrote:
>> Is bup zfs-specific?  Because, if you're not doing something
>> filesystem-specific, I actually think you will have trouble even
>> _defining_ what "100% right" is for this test, since everything about
>> sparseness, right down to whether it's even a thing, is
>> filesystem-dependent.
>
> Indeed. Try running the test on a tmpfs or msdofs for example and you
> should see the test reliably fail.

It does fail on tmpfs.  I see this as a bug; it means that files written
sparsely might not fit, when I'd expect them to.  I can certainly
understand that the person who wrote tmpfs didn't get to this, but given
the long history of sparse support in the standard filesystem, I sort it
weakly into bug vs feature-I'd-expect-but-is-missing.

As for msdosfs, I am not surprised; that's a foreign fs with its own
format and semantics -- and one that is viewed as old and primitive.

Re: strange zfs size allocation data

2024-07-08 Thread Greg Troxel

Mouse  writes:

>> This is a test case, to see if backing up and restoring a sparse file
>> results in a sparse file.  I realize that this probably requires a
>> logging fuse driver and a lot of complexity to do 100% right.
>
> Is bup zfs-specific?  Because, if you're not doing something

No, it is a general backup program.  I just happen to have sources for
it on zfs.  Which people tell me is a great filesystem and is now not
odd on NetBSD

> filesystem-specific, I actually think you will have trouble even
> _defining_ what "100% right" is for this test, since everything about
> sparseness, right down to whether it's even a thing, is
> filesystem-dependent.

True.  the point is to try to verify that the backup program, when
restoring a sparse file, writes it in such a way that the normal
implementation of sparse files works, meaning results in a file without
blocks storing all the zeros.

What you are missing, and everybody else too, is that the fact that this
is theoretically impossible is irrelevant to it being useful in the real
world, to detect regressions, even if it also occasionally detects
bizarre behavior.

A better test would be 'fuse-sparsetest' that makes metadata available
for inspection later about the writes it sees.  But that's hard to
write.

Re: strange zfs size allocation data

2024-07-07 Thread Greg Troxel

Taylor R Campbell  writes:

>> Date: Sun, 07 Jul 2024 14:07:40 -0400
>> From: Greg Troxel 
>> 
>> I ran into a test failure with bup, where it was restoring a sparse file
>> and trying to validate the resulting disk usage.  It turns out that on
>> zfs (NetBSD 10), when you write a file, it shows as using 1 block and
>> then some seconds later shows as using the right amount.
>
> zfs's struct stat::st_blocks (i.e., struct vattr::va_bytes/va_blksize,
> roughly) gives the number of blocks actually allocated on disk for the
> file, plus 1 for some metadata.

Ah, this is what I sort of suspected.

> Before a newly written file is synced to disk, when it still exists
> only in memory, it doesn't have any space allocated on disk for it
> (though I expect if you hit the logical reservation limit, you'll see
> a write error earlier).

This feels like a zfs bug to me.  Yes, the blocks are not allocated on
disk, but they are essentially reserved and its logically like they are
used up.  It's a mere artifact of cache coherency that they are not
actually allocated on disk.

> Every 5sec, the system syncs the file system (I forget where this
> comes from, whether it's a zfs thing or a NetBSD syncer thing), which
> explains why within a couple of seconds you see du(1) output change.

that's kind of fast vs the old 30s but it explains it.

> I bet if you fsync just the file you created, or use dd oflag=sync or
> oflag=dsync, you will stop seeing the delay.

I see.  I wonder if that is wise, or if the test should just wait 10s.
The question is making the test fast vs keeping impact low.

Given your answer, I'd expect this on any zfs filesystem; this seems not
to be a NetBSD thing.

I just found this, but it doesn't get into "blocks not allocated yet".

https://github.com/openzfs/zfs/discussions/11533

Re: strange zfs size allocation data

2024-07-07 Thread Greg Troxel

Thor Lancelot Simon  writes:

> On Sun, Jul 07, 2024 at 02:07:40PM -0400, Greg Troxel wrote:
>> I ran into a test failure with bup, where it was restoring a sparse file
>> and trying to validate the resulting disk usage.  It turns out that on
>> zfs (NetBSD 10), when you write a file, it shows as using 1 block and
>> then some seconds later shows as using the right amount.
>
> When you say "validate the resulting disk usage" and "using 1 block" what
> do you mean, exactly?  If the file is sparse, I can't see how there's any
> bug unless the wrong st_size is returned by stat() or the wrong length
> returned by lseek().

First, in actual operation, bup just does fs ops and there is no issue.

This is a test case, to see if backing up and restoring a sparse file
results in a sparse file.  I realize that this probably requires a
logging fuse driver and a lot of complexity to do 100% right.

What I have been doing as a proxy is the script below, which skips 10 MB
and writes 1 MB.   Since the file is sparse, one would expect about 1 MB
of usage, not 11MB, and not 1 block.  Yes, this is not 100% reliable,
but the point is to catch regressions where it results in 11 MB, or the
test is broken.

So there is a

  is the amount of space at least the data we actually wrote?

  is it well less than the sparse file's nominal length

and this does not seem unreasonable, even if it is not 100% sound.

> du counts allocated blocks as reported by stat().  A sparse file might
> legitimately report 0, 1, or any other value, even values that exceed
> (st_size / st_blksize).  And the number of allocated blocks can absolutely

Yes, but a sparse file with 10 MB of seeked-over and 1 MB of legit
urandom more or less has to take up more than 1 block.

> change even while st_size stays the same - consider a filesystem with
> background deduplication or compression, both of which some variants of
> ZFS have, but ZFS is not the only filesystem with these features.

Sure, I get it that zfs that makes this hard.

> If bup is relying on some particular block allocation behavior, that seems
> like a bug.

It is only tests, trying to catch problems.

The actual operation tries to rely only on POSIX.

strange zfs size allocation data

2024-07-07 Thread Greg Troxel

I ran into a test failure with bup, where it was restoring a sparse file
and trying to validate the resulting disk usage.  It turns out that on
zfs (NetBSD 10), when you write a file, it shows as using 1 block and
then some seconds later shows as using the right amount.

So:

  - why is it happening?
  - is this a bug?
  - if we think it's a bug, is it feasible to fix?

A simple program to create files, n empty megabytes followed by 1 real
megabyte, for n in 0..9.  And then to 'du' the file, every 1s for 30s,
not worrying about precise timing.

I have a big ssd which is mostly a zfs type partition, and a pool with
just that.  Nothing fancy.


#!/bin/sh

for i in $(seq 0 9); do
OUT=seek$i
rm -rf ${OUT} ${OUT}.size
dd if=/dev/urandom seek=$i bs=1m count=1 of=${OUT} 2> /dev/null

for s in $(seq 0 30); do
(echo -n "$s:   "; du ${OUT}) >> $OUT.size
sleep 1
done

done


leads to ('head -6' shown, since that's sufficient to understand):

==> seek0.size <==
0:  1   seek0
1:  1   seek0
2:  1   seek0
3:  1027seek0
4:  1027seek0
5:  1027seek0

==> seek1.size <==
0:  1   seek1
1:  1   seek1
2:  1027seek1
3:  1027seek1
4:  1027seek1
5:  1027seek1

==> seek2.size <==
0:  1   seek2
1:  1027seek2
2:  1027seek2
3:  1027seek2
4:  1027seek2
5:  1027seek2

==> seek3.size <==
0:  1   seek3
1:  1   seek3
2:  1   seek3
3:  1   seek3
4:  1027seek3
5:  1027seek3

==> seek4.size <==
0:  1   seek4
1:  1   seek4
2:  1   seek4
3:  1027seek4
4:  1027seek4
5:  1027seek4

==> seek5.size <==
0:  1   seek5
1:  1   seek5
2:  1027seek5
3:  1027seek5
4:  1027seek5
5:  1027seek5

==> seek6.size <==
0:  1   seek6
1:  1   seek6
2:  1   seek6
3:  1   seek6
4:  1   seek6
5:  1027seek6

==> seek7.size <==
0:  1   seek7
1:  1   seek7
2:  1   seek7
3:  1   seek7
4:  1027seek7
5:  1027seek7

==> seek8.size <==
0:  1   seek8
1:  1   seek8
2:  1   seek8
3:  1027seek8
4:  1027seek8
5:  1027seek8

==> seek9.size <==
0:  1   seek9
1:  1027seek9
2:  1027seek9
3:  1027seek9
4:  1027seek9
5:  1027seek9

Re: hang in vcache_vget()

2024-06-06 Thread Greg Troxel

Emmanuel Dreyfus  writes:

> Hello
>
> I experienced a system freeze on NetBSD-10.0/i386. Many processes 
> waiting on tstile, and one waiting on vnode, with this backtrace:
> sleepq_block
> cv_wait
> vcache_vget
> vcache_get
> ufs_lookup
> VOP_LOOKUP
> lookup_once
> namei_tryemulroot.constprop.0
> namei
> vn_open
> do_open
> do_sys_openat
>
> I regret I did not took the time to show vnode. 
>
> Is it worth a PR? I have no clue if it can be reproduced.

I would say yes it's worth it.

I have had hangs on 10/amd64, on a system with 32G of ram.  I have been
blaming zfs, but my "never hangs" experience has been on 9/ufs.  But,
others say zfs is fine.

I just came across the "threads leak memory" problem pointed
out by Brian Marcotte, and found a 17G gpg-agent.  I now wonder if
whatever is hanging is being provoked by running out of memory.  Still a
bug, but I no longer feel my "new problem" can be pointed at zfs.

Do you think your system had high memory pressure at the time of your
crash?

(Sort of off topic, you should know that because 32-bit computers are no
longer manufactured, the rust project thinks you shouldn't be using
them.  Take it to the ewaste center right away!)

Re: poll(): IN/OUT vs {RD,WR}NORM

2024-05-27 Thread Greg Troxel

Johnny Billquist  writes:

>  POLLPRIHigh priority data may be read without blocking.
>
>  POLLRDBAND Priority data may be read without blocking.
>
>  POLLRDNORM Normal data may be read without blocking.

Is this related to the "oob data" scheme in TCP (which is a hack that
doesn't work)?   Where do we attach 3 priority levels to data?

Re: Forcing a USB device to "ugen"

2024-03-25 Thread Greg Troxel

Jason Thorpe  writes:

> I should be able to do this with OpenOCD (pkgsrc/devel/openocd), but
> libfdti1 fails to find the device because libusb1 only deals in
> "ugen".

Is that fundamental, in that ugen has ioctls that are ugen-ish that
uftdi does not?   I am guessing you thought about fixing libusb1.

> The desire to use "ugen" on "interface 1" is not a property of
> 0x0403,0x6010, it's really a property of
> "SecuringHardware.com","Tigard V1.1".  Unfortunately, there's isn't a
> way to express that in the kernel config syntax.
>
> I think my only short-term option here is to, in uftdi_match(), specifically 
> reject based on this criteria:
>
>   - VID == 0x0403
>   - PID == 0x6010
>   - interface number == 1
>   - vendor string == "SecuringHardware.com"
>   - product string == "Tigard V1.1"
>
> (It's never useful, on this particular board, to use the second port as a 
> UART.)

That seems reasonable to me.  It seems pretty unlikely to break other
things.

Re: Polymorphic devices

2024-01-06 Thread Greg Troxel

Brad Spencer  writes:

> I don't know just yet, but there might be unwanted device reset the "use
> the one you open" technique.  That is, you might have to reset the chip
> to change mode and if you support say, I2C and GPIO at the same time
> (which is possible), but then change to just GPIO the chip has to be
> reset and that will disrupt any setting you might have set (I think, I
> am am still working out what needs to happen with the mode switches).
> This may not matter in the bigger picture and it wouldn't matter as much
> if the mode switch was a sysctl, which one can say will reset the chip
> anyway.

Interesting complexity, but I'd say state the user has asked for should
live in the driver and if it has to write that again on mode switch so
be it.  Generally if you open a device and close it you don't have much
grounds to expect things you did to persist to the next session, but
devices have device-specific semantics anyway.

Re: Polymorphic devices

2024-01-05 Thread Greg Troxel

Brad Spencer  writes:

> The first is enhancements to ufdti to support the MPSSE engine that some
> of the FTDI chip variants have.  This engine allows the chip to be a I2C
> bus, SPI bus and provides some simple GPIO, and a bunch of other stuff,
> as well as the normal USB UART.  It is not possible to use all of the
> modes at the same time.  That is, these are not separate devices, but
> modes within one device.  Or another way, depending on the mode of the
> chip you get different child devices attached to it.  I am curious on
> what the thoughts are on how this might be modeled.

My reaction without much thought is to attach them all and to have the
non-selected one return ENXIO or similar.  And to have another device on
which you call the ioctl to choose which device to enable.

Or perhaps, to let you open any of them, flipping the mode, and to fail
the 2nd simultaenous open.

Re: Maxphys on -current?

2023-08-03 Thread Greg Troxel

Brian Buhrow  writes:

>   hello.  I know that this has ben a very long term project, but I'm 
> wondering about the
> status of this effort?  I note that FreeBSD-13 has a Maxphys value of 1048576 
> bytes.
> Have we found other ways to get more throughput from ATA disks that obviate 
> the need for this
> setting which I'm not aware of?
> If not, is anyone working on this project?  The wiki page says the project is 
> stalled.

I haven't heard that anyone is.

When you run dd with bs=64k and then bs=1m, how different are the
results?  (I believe raw requests happen accordingly, vs MAXPHYS for fs
etc. access.)

Re: RFC: Native epoll syscalls

2023-06-22 Thread Greg Troxel

Mouse  writes:

>> It is definitely a real problem that people write linuxy code that
>> seems unaware of POSIX and portability.
>
> While I feel a bit uncomfortable appearing to defend the practice (and,
> to be sure, it definitely can be a problem) - but, it's also one of the
> ways advancements happen: add an extension, use it, it turns out to be
> useful, it gets popular
>
> I've done it myself (well, except for the "gets popular" part, which no
> one person can do alone): labeled control structure, AF_TIMER sockets,
> pidconn, validusershell, the list goes on.

Sure, but this is "there are several extensions, and write code that
only uses the local one, even though it could have been written to use
any".  And perhaps "there are mechanisms which could have been adopted,
but instead make up a third".

And I really meant "seems unaware", not "made a deliberate decision,
evidenced by written design" :-)

Re: RFC: Native epoll syscalls

2023-06-22 Thread Greg Troxel

Martin Husemann  writes:

> On Wed, Jun 21, 2023 at 01:50:47PM -0400, Theodore Preduta wrote:
>> There are two main benefits to adding native epoll syscalls:
>> 
>> 1. They can be used to help port Linux software to NetBSD.
>
> Well, syscall numbers are cheap and plenty...
>
> The real question is: is it a usefull and consistent API to have?
> At first sight it looks like a mix of kqueue and poll, and seems to be
> quite complex.

It is definitely a real problem that people write linuxy code that seems
unaware of POSIX and portability.  If we had native epoll, then that
code could be built and used.  That of course doesn't fix the
portability issues, but it avoids them.

It seems to me that if we have epoll emulation, it should not be that
hard to also have it native, and I think the benefit in being able to
run (natively) programs written unportably is significant.

Re: malloc(9) vs kmem(9) interfaces

2023-06-01 Thread Greg Troxel

Taylor R Campbell  writes:

> Right, so the question is -- can we get the attribution _without_
> that?  Surely attribution itself is just a matter of some per-CPU
> counters.

Reading along, it strikes me there is a huge point implicit in your
last sentence.

I first thought of attribution as being able to tell what a particular
allocated object is being used for.  That requires state per object.

However, you are talking about maintaining a count of objects by user.
That is vastly cheaper, and likely 90%+ as useful.

SO there is "object attribution" and "total usage attribution".

Re: LINEAR24 userland format in audio(4) - do we really want it?

2023-05-08 Thread Greg Troxel

nia  writes:

> Unfortunately file formats are standardized but the
> the way the audio APIs are implemented varies. :/
>
>> It's now no longer broken to handle 24bit WAV files.
>
> This is true, but audioplay is hardly the only
> consumer of the API and could easily be made to communicate
> with the kernel using 32-bit samples.
>
> What is the behaviour of everything in pkgsrc when thrown
> 24bit WAV files?

I'm not following.  Are you saying

  we should remove suppport from the kernel API for 24-bit linear?

  lots of stuff in pkgsrc should be fixed so it works better?

Re: USB-related panic in 8.2_STABLE

2023-04-27 Thread Greg Troxel

Timo Buhrmester  writes:

> Apparently out of nothing, one of our servers paniced.
>
>
> uname -a gives:
>
> | NetBSD trave.math.uni-bonn.de 8.2_STABLE NetBSD 8.2_STABLE
> | (MI-Server) #17: Fri Jul 16 14:01:03 CEST 2021
> | supp...@trave.math.uni-bonn.de:/var/work/obj-8/sys/arch/amd64/compile/miserv
> | amd64

My impression is that there have been a lot of USB fixes since 8.

> I've transcribed the panic message and backtrace:
>
> | ohci0: 1 scheduling overruns
> | ugen0: detached
> | ugen0: at uhub4 port 2 (addr 2) disconnected
> | ugen0 at uhub4 port 2
> | ugen0: Phoenixtec Power (0x6da) USB Cable (V2.00) (0x02), rev 1.00/0.06, 
> addr 2
> | uvm_fault(0xfe82574c2458, 0x0, 1) -> e
> | fatal page fault in supervisor mode
> | trap type 6 code 0 rip 0x802f627e cs 0x8 rflags 0x10246 cr2 0x2 
> ilevel 6 (NB: could be ilevel 0 as well) rsp 0x80013f482c10
> | curlwp 0xfe83002b2000 pid 8393.1 lowest kstack 0x80013f4802c0
> | kernel: page fault trap, code=0
> | Stopped in pid 8393.1 (nutdrv_qx_usb) at   netbsd:ugen_get_cdesc+0xb1:
> | movzwl 2(%rax),%edx
> | db{2}> bt
> | ugen_get_cdesc() at netbsd:ugen_get_cdesc+0xb1
> | ugenioctl() at netbsd:ugenioctl+0x9a4
> | cdev_ioctl() at netbsd:cdev_ioctl+0xb4
> | VOP_IOCTL() at netbsd:VOP_IOCTL+0x54
> | vn_ioctl() at netbsd:vn_ioctl+0xa6
> | sys_ioctl() at netbsd:sys_ioctl+0x11a
> | syscall() at netbsd:syscall+0x1ec
> | --- syscall (number 54) ---
> | 7a73c9eff13a:
> | db{2}>
>
> Any idea what's going on?

It can always be hardware.  (Even if one can argue bad hardware should never
lead to panic.)  I'm not saying it is, or is likely, but keep that in
mind.

You didn't give timing.  If this immediately followed the disconnnect,
it's perhaps a bug in ugen to do something after the device is gone.  It
may be that this bug has always been there and that normally the UPS
doesn't disconnect, or you hit a bad race.

Try updating to 9 or 10 :-)

Re: crash in timerfd building pandoc / ghc94 related

2023-02-06 Thread Greg Troxel

PHO  writes:

> On 2/6/23 5:27 PM, Nikita Ronja Gillmann wrote:
>> I encountered this on some version of 10.99.2 and last night again on
>> 10.99.2 from Friday morning.
>> This is an obvious blocker for me for making 9.4.4 the default.
>> I propose to either revert to the last version or make the default GHC
>> version setable.
>
> I wish I could do the latter, but unfortunately not all Haskell
> packages are buildable with 2 major versions of GHC at the same time
> (most are, but there are a few exceptions).
>
> Alternatively, I think I can patch GHC 9.4 so that it won't use
> timerfd. It appears to be an optional feature after all; if its
> ./configure doesn't find timerfd it won't use it. Let me try that.

If it's possible to only do this on NetBSD 10.99, that would be good.
It seems so far, from not really paying attention, that there is nothing
wrong with ghc but that there is a bug in the kernel.   It would also
be good to get a reproduction recipe without haskell.

Re: Enable to send packets on if_loop via bpf

2022-11-21 Thread Greg Troxel

Ryota Ozaki  writes:

> In the specification DLT_NULL assumes a protocol family in the host
> byte order followed by a payload.  Interfaces of DLT_NULL uses
> bpf_mtap_af to pass a mbuf prepending a protocol family.  All interfaces
> follow the spec and work well.
>
> OTOH, bpf_write to interfaces of DLT_NULL is a bit of a sad situation.
> A writing data to an interface of DLT_NULL is treated as a raw data
> (I don't know why); the data is passed to the interface's output routine
> as is with dst (sa_family=AF_UNSPEC).  tun seems to be able
> to handle such raw data but the others can't handle the data (probably
> the data will be dropped like if_loop).

Summarizing and commenting to make sure I'm not confused

  on receive/read, DLT_NULL  prepends AF in host byte order
  on transmit/write, it just sends with AF_UNSPCE

  This seems broken as it is asymmetric, and is bad because it throws
  away information that is hard to reliably recreate.  On the other hand
  this is for link-layer formats, and it seems that some interfaces have
  an AF that is not really part of what is transmitted, even though
  really it is.  For example tun is using an IP proto byte to specify AF
  and really this is part of the link protocol.  Except we pretend it
  isn't.

> Correcting bpf_write to assume a prepending protocol family will
> save some interfaces like gif and gre but won't save others like stf
> and wg.  Even worse, the change may break existing users of tun
> that want to treat data as is (though I don't know if users exist).
>
> BTW, prepending a protocol family on tun is a different protocol from
> DLT_NULL of bpf.  tun has three protocol modes and doesn't always prepend
> a protocol family.  (And also the network byte order is used on tun
> as gert says while DLT_NULL assumes the host byte order.)

wow.

> So my fix will:
> - keep DLT_NULL of if_loop to not break bpf_mtap_af, and
> - unchange DLT_NULL handling in bpf_write except for if_loop to bother
> existing users.
> The patch looks like this:
>
> @@ -447,6 +448,14 @@ bpf_movein(struct uio *uio, int linktype,
> uint64_t mtu, struct mbuf **mp,
> m0->m_len -= hlen;
> }
>
> +   if (linktype == DLT_NULL && ifp->if_type == IFT_LOOP) {
> +   uint32_t af;
> +   memcpy(&af, mtod(m0, void *), sizeof(af));
> +   sockp->sa_family = af;
> +   m0->m_data += sizeof(af);
> +   m0->m_len -= sizeof(af);
> +   }
> +
> *mp = m0;
> return (0);

That seems ok to me.

I think the long-term right fix is to define DLT_AF which has an AF word
in host order on receive and transmit always, and to modify interfaces
to use it whenever they are AF aware at all.   In this case tun would
fill in the AF word from the IP proto field, and you'd get a
transformed/regularized AF word when really the "link layer packet" had
the IP proto field.  But that's ok as it's just cleanup and reversible.

signature.asc
Description: PGP signature

Re: Enable to send packets on if_loop via bpf

2022-11-09 Thread Greg Troxel

Ryota Ozaki  writes:

> NetBSD can't do this because a loopback interface
> registers itself to bpf as DLT_NULL and bpf treats
> packets being sent over the interface as AF_UNSPEC.
> Packets of AF_UNSPEC are just dropped by loopback
> interfaces.
>
> FreeBSD and OpenBSD enable to do that by letting users
> prepend a protocol family to a sending data.  bpf (or if_loop)
> extracts it and handles the packet as an extracted protocol
> family.  The following patch follows them (the implementation
> is inspired by OpenBSD).
>
> http://www.netbsd.org/~ozaki-r/loop-bpf.patch
>
> The patch changes if_loop to register itself to bpf
> as DLT_LOOP and bpf to handle a prepending protocol
> family on bpf_write if a sender interface is DLT_LOOP.

I am surprised that there is not already a DLT_foo that already has this
concept, an AF word followed by data.  But I guess every interface
already has a more-specific format.

Looking at if_tun.c, I see DLT_NULL.  This should have the same ability
to write.   I have forgotten the details of how tun encodes AF when
transmitting, but I know you can have v4 or v6 inside, and tcpdump works
now. so obviously I must be missing something.

My suggestion is to look at the rest of the drivers that register
DLT_NULL and see if they are amenable to the same fix, and choose a new
DLT_SOMETHING that accomodates the broader situation.

I am not demanding that you add features to the rest of the drivers.  I
am only asking that you think about the architectural issue of how the
rest of them would be updated, so we don't end up with DLT_LOOP,
DLT_TUN, and so on, where they all do almost the same thing, when they
could be the same.

I don't really have an opinion on host vs network for AF, but I think
your choice of aligning with FreeBSD is reasonable.

signature.asc
Description: PGP signature

Re: #pragma once

2022-10-16 Thread Greg Troxel


My quick reaction is that we should stick to actual standards, absent a
really compelling case.  This isn't compelling to me, and the point that
linting for wrong usage isn't hard is a good one.

I happen to be in the middle of a paper (from the guix crowd) about
de-boostrapping ocaml.  It's about getting rid of binary bootstraps.
That's a problem we also have in pkgsrc, but we haven't issued a
manifesto.  While it might seem tangential, the de-bootstrapping world
often wants to compile older code with older tools to construct a build
graph that starts from as little binary as possible.

Thus, "newer compilers all do this" is a bit scary, as while that's what
people usually use, it's more comfortable to say "we need C99 plus X"
for as little X as possible.


signature.asc
Description: PGP signature

Re: Can version bump up to 9.99.100?

2022-09-17 Thread Greg Troxel

David Holland  writes:

> On Fri, Sep 16, 2022 at 07:00:23PM +0700, Robert Elz wrote:
>  > That is, except for in pkgsrc, which is why I still
>  > have a (very mild) concern about that one - it actually compares the
>  > version numbers using its (until it gets changed) "Dewey" comparison
>  > routines, and for those, 9.99.100 is uncharted territory.
>
> No, it's not, pkgsrc-Dewey is well defined on arbitrarily large
> numbers. In fact, that's in some sense the whole point of it relative
> to using fixed-width fields.

And, surely we had 9.99.9 and 9.99.10.  The third digit is no more
special than the second.  It's just that it happens less often so the
problem of arguably incorrectly written two-digit patterns is more
likely than for that to happen with one.

It's not reasonable to constrain a normal process because other bugs
might exist.

signature.asc
Description: PGP signature

Re: Module autounload proposal: opt-in, not opt-out

2022-08-08 Thread Greg Troxel

Paul Goyette  writes:

> (personal note)
> It really seems to me that the current module sub-systems is at
> best a second-class capability.  I often get the feeling that
> others don't really care about modules, until it's the only way
> to provide something else (dtrace).  This proposal feels like
> another nail in the modular coffin.  Rather than disabling (part
> of) the module feature, we should find ways to improve testing
> the feature.

I'd just like to say that while I haven't gone down the "modules first"
path, I have been watching your commits and cheering you on.

I do use a few modules, and this is making me think I should try to run
MODULAR, especially on machines with less memory.
I'm a little scared of not even having UFS, but I can try it as the
low-memory machine is not important.

signature.asc
Description: PGP signature

Re: Module autounload proposal: opt-in, not opt-out

2022-08-08 Thread Greg Troxel


Martin Husemann  writes:

> I think that all modules that we deliver and declare safe for *autoload*
> should require to be actually tested, and a basic test (for me) includes
> testing auto-unload. That does not cover races that slip through "casual"
> testing, but should have caught the worst bugs.

That's a reasonable position for adding modules, but

> So the error in the cases you stumbled in is the autoload and keeping the
> badly tested module autoloadable but forbid its unloading sounds a bit
> strange to me.

Given where we are, do you really mean

  we should withdraw every module from autoload that does not have a
  documented test result, right now?

It seems far better to have them stay loaded than be unavailable.


signature.asc
Description: PGP signature

Re: New iwn firmware & upgrade procedure

2022-06-19 Thread Greg Troxel


Havard Eidnes  writes:

>> A quick skim of /libdata/firmware makes me think it is mostly not
>> versioned.
>
> Really?  I suspect all the if_iwn files are versioned; if it
> follows the pattern for iwlwifi-6000g2a-5, the number behind the
> last hyphen is the version number.

Look at all the devices, not just if_iwn.


signature.asc
Description: PGP signature

Re: New iwn firmware & upgrade procedure

2022-06-19 Thread Greg Troxel

Havard Eidnes  writes:

> 1) Could the if_iwn driver fall back to using the 6000g2a-5 microcode
>without any code changes?  (My gut feeling says "yes", but I have
>no existence proof of that.)

Unless it's really necessary (ABI change in accessing device with new
firmware), it seems that the firmware should just be named for the
device and not have the firmware version.  Thus you'd get the version
you have in tree, and that might be a little old.  Alternatively there
could be a symlink.  But I don't understand why it is versioned.

> Should the wireless firmware go into a different set which we also
> learn the habit of extracting before reboot of the kernel?

If the versioning is really intractable and frequent, perhaps, but I
think this can be 99% solved by not putting firmware versions in
filenames.

A quick skim of /libdata/firmware makes me think it is mostly not
versioned.

signature.asc
Description: PGP signature

Re: Slightly off topic, question about git

2022-06-06 Thread Greg Troxel

David Brownlee  writes:

> I suspect most of this also works with s/git/hg/ assuming NetBSD
> switches to a mercurial repo

Indeed, all of this is not really about git.  Systems in the class of
"distributed VCS" have two important properties:

  commits are atomic across the repo, not per file

  anyone can prepare commits, whether or not they are authorized to
  apply them to the repo.  an authorized person can apply someone else's
  commit.

These more or less lead to "local copy of the repo".  And there are web
tools for people who just want to look at something occasionally.But
I find that it's not that big, that right now I have 3 copies (8, 9,
current), and that it's nice to be able to do things offline (browse,
diff, commit).

CVS is really just RCS with
  organization into groups of files
  ability to operate over ssh (rsh originally :-)
That was really great in 1994; I remember what a big advance it was
(seriously).

signature.asc
Description: PGP signature

Re: Memory corruption after fork, only on AMD CPUs

2021-12-01 Thread Greg Troxel

co...@sdf.org writes:

> There appears to be a memory corruption bug that only happens on AMD
> CPUs running NetBSD (or OpenBSD). The same code doesn't fail on Intel.
> This affects Go and they've made some bug reports investigating it[1][2].
>
> People have narrowed it down to this simple Go reproducer
> (install lang/go117 to run it).

This is probably unrelated, but I've been running sncthingy for a long
time with no issues.   syncthing built with go117 crashes at startup on an
early 2011 Macbook Pro, and the same syncthing built with go116 works
fine.  However, this system has an Intel CPU:
  Processor Name:   Intel Core i7
  Processor Speed:  2.2 GHz
  Number of Processors: 1
  Total Number of Cores:4

signature.asc
Description: PGP signature

Re: wsvt25 backspace key should match terminfo definition

2021-11-24 Thread Greg Troxel

RVP  writes:

> On Tue, 23 Nov 2021, Michael van Elst wrote:
>
>> If you restrict yourself to PC hardware (i386/amd64 arch) then
>> you probably have either
>>
>> a PS/2 keyboard -> the backspace key generates a DEL.
>> a USB keyboard  -> the backspace key generates a BS.
>>
>> That's something you cannot shoehorn into a single terminfo
>> attribute and that's why many programs simply ignore terminfo
>> here, in particular when you think about remote access.
>
> So, if I had a USB keyboard (don't have one to check right now), the
> terminfo entry would be correct? How do we make this consistent then?
> Have 2 terminfo entries: wsvt25-ps2 and wsvt25-usb (and fix-up getty
> to set the correct one)?

wscons is supposed to abstract all this, so making wsvt25-foo for
different keyboard classes seems like the wrong approach.

wskbd(4) says:

 •   Mapping from keycodes (defined by the specific keyboard driver) to
 keysyms (hardware independent, defined in
 /usr/include/dev/wscons/wsksymdef.h).

As uwe@ points out, the terms we use and the actual key labels are
confusing.  When I've talked about the DEL key, I've meant the key that
the user types to delete backwards, almost always upper right and easily
reachable when touch typing, and that in DEC tradition sent the DEL 0x1f
character.  It was pointed out that newer terminals have a backarrow
logo, and I see that an IBM USB keyboard has that too.

Then there's the BS key, which older (almost all actual?) terminals had,
but my IBM USB keyboard doesn't have one, and my mac doesn't either.

Looking in wsksymdef.h (netbsd-9, which is handy), we see "keysyms"
which is what keycodes are supposed to map into, and it talks about them
being aligned with ASCII.   Relevant to this discussion there is

#define KS_BackSpace0x08
#define KS_Delete   0x7f
#define KS_KP_Delete0xf29f

So that's for BS, DEL (to use ASCII) and the extended keypad "delete
right" introduced with I think the VT220.

On my USB keyboard, in NetBSD 9 wscons without trying to mess with
mappings, I get

  backarrow (key where DEL should) ==> BS (^H)
  keypad Delete key (next to insert/home/end/pageup/pagedown) ==> DEL (^?)

and I see that stty has erase to to ^H.

The underlying issue is that the norms of some systems are to map that
"user wants to delete left easily reachable key" to BS and some want to
map it to DEL.  I see these as the PC tradition and the UNIX tradition.

So I think NetBSD should decide that we're following the UNIX tradition
that this key is DEL, have wskbd map it that way for all keyboard types,
and have stty erase start out DEL.  (Plus of course carrying this across
ssh so cross-deletionism works, which I think is already the case.)

A quick glance at wskbd and ukbd did not enlighten me.

xev shows similar wrong x keysyms, BS and DEL for "backarrow" and
"keypad delete".

signature.asc
Description: PGP signature

Re: wsvt25 backspace key should match terminfo definition

2021-11-23 Thread Greg Troxel


Valery Ushakov  writes:

> vt52 is different.  I never used a real vt52 or a clone, but the
> manual at vt100.net gives the following picture:
>
>   https://vt100.net/docs/vt52-mm/figure3-1.html
>
> and the description
>
>   https://vt100.net/docs/vt52-mm/chapter3.html#S3.1.2.3
>
>   Key CodeAction Taken if Codes Are Echoed
>   BACK SPACE  010 Backspace (Cursor Left) function
>   DELETE  177 Nothing

That is explaining what the terminal does when those codes are sent by
the computer.  That is a different thing from how the computer
interprets user input.

When using a VT52 on Seventh Edition, for example one pushed DEL to
remove the previous character, and the computer woudl send
"" to make it disappear and leave the cursor left.  One
basically never pushed BS.

> vt100 had similar keyboard (again, never used a real one personally)
>
>   https://vt100.net/docs/vt100-ug/chapter3.html#F3-2
>
>   BACKSPACE   010 Backspace function
>   DELETE  177 Ignored by the VT100

same as vt52, I think.


> But vt200 and later use a different keyboard, lk201 (and i did use a
> real vt220 a lot)
>
>   https://vt100.net/docs/vt220-rm/figure3-1.html
>
> that picture is not very good, the one from the vt320 manual is better
>
>   https://vt100.net/docs/vt320-uu/chapter3.html
>
> vt220 does NOT have a configuration option that selects the code that
> the  But somehow the official terminfo database has kbs=^H for vt220!
>
> Later it became configurable:
>
>   https://vt100.net/docs/vt320-uu/chapter4.html#S4.13
>
> For vt320 (where it *is* configurable) terminfo has
>
>   $ infocmp -1 vt320 | grep kbs
>   kbs=^?,

Very interesting!

>
>> I think the first thing to answer is "what is kbs in terminfo supposed
>> to mean".
>
> X/Open Curses, Issue 7 doesn't explain, other than saying "backspace"
> key, which is an unfortunate name, as it's loaded.  But it's
> sufficiently clear from the context that it's the key that deletes
> backwards, i.e.  deletes under.

So it's the codes generated by the DEL key (as opposed to the Delete
key).

>> My other question is how kbs is used from terminfo.  Is it about
>> generating output sequences to move the active cursor one left?  If so,
>> it's right.  Is it about "what should the user type to delete left",
>> then for a vt52/vt220, that's wrong.  If it is supposed to be both,
>> that's an architectural bug as those aren't the same thing.
>
> No, k* capabilities are sequences generated by the terminal when some
> key is pressed.  The capability for the sequence sent to the the
> terminal to move the cursor left one position is cub1
>
>   $ infocmp -1 vt220 | grep cub1
>   cub1=^H,
>   kcub1=\E[D,
>
> (kcub1 is the sequence generated by the left arrow _k_ey).

Then I'm convinced that kbs should be \? for these terminals.



signature.asc
Description: PGP signature

Re: wsvt25 backspace key should match terminfo definition

2021-11-23 Thread Greg Troxel


Johnny Billquist  writes:

>> For vt320 (where it *is* configurable) terminfo has
>>
>>$ infocmp -1 vt320 | grep kbs
>>kbs=^?,
>
> Which I think it should be.


But what does kbs mean?

 - the ASCII character sent by the computer to move the cursor left?
 - the ASCII character sent by the BS key?
 - the ASCII character sent by the DEL key that the uses uss to delete left?


signature.asc
Description: PGP signature

Re: wsvt25 backspace key should match terminfo definition

2021-11-23 Thread Greg Troxel

Valery Ushakov  writes:

> On Tue, Nov 23, 2021 at 00:01:40 +, RVP wrote:
>
>> On Tue, 23 Nov 2021, Johnny Billquist wrote:
>> 
>> > If something pretends to be a VT220, then the key that deletes
>> > characters to the left should send DEL, not BS...
>> > Just saying...
>> 
>> That's fine with me too. As long as things are consistent. I suggested the
>> kernel change because both terminfo definitions (and the FreeBSD console)
>> go for ^H.
>
> Note that the pckbd_keydesc_us keymap maps the scancode of the <- key to
>
> KC(14),  KS_Cmd_ResetEmul, KS_Delete,
>
> i.e. 0x7f (^?).
>
> terminfo is obviously incorrect here.  Amazingly, the bug is actually
> in vt220 description!  wsvt25 just inherits from it:
>
> $ infocmp -1 vt220 | grep kbs
> kbs=^H,
>
> I checkeed termcap.src from netbsd-4 and it's wrong there too.  I have
> no idea htf that could have happened.

I think (memory is getting fuzzy) the problem is that the old terminals
had a delete key, in the upper right, that users use to remove the
previous character, and a BS key, upper left, that was actually a
carriage control character.

The basic problem is that in the PC world, the idea is that key where
DEL should be has a backarrow the the PC world thinks it is backspace.
That's the DEC-centric viewpoint of course :-)

I think any change needs a careful proposal and review, becuase there
are lots of opinions here and a change is likely to mess up a bunch of
people's configs, even if they have worked around something broken.  I
don't mean "no changes", just that if you don't think this is a really
hard problem you probably shouldn't change it (globally).

Also /usr/include/sys/ttydefaults.h is about all of NetBSD on all sorts
of hardware, not just PCs and there are lots of keyboards as well as
actual terminals.   Ever since we moved beyond ASR33, CERASE has been
0177 (my Unix use more or less began with a VT52 and a Beehive CRT).

xterm has a config to say "make the key where DEL ought to be generate
the key that the tty has configured as ERASE".  I suspect that the right
approach is

  1) choose what wscons generates for the "key where DEL belongs"

  2) have the tty set so that the choice in (1) is 'stty erase'.

I see the same kbs=^H on vt52.

I think the first thing to answer is "what is kbs in terminfo supposed
to mean".

My other question is how kbs is used from terminfo.  Is it about
generating output sequences to move the active cursor one left?  If so,
it's right.  Is it about "what should the user type to delete left",
then for a vt52/vt220, that's wrong.  If it is supposed to be both,
that's an architectural bug as those aren't the same thing.

signature.asc
Description: PGP signature

Re: timecounters

2021-11-13 Thread Greg Troxel


I think it makes sense to document them, and arguably each counter
should have a man page, except for things that are somehow in
timecounter(9) instead (if they don't have a device name?).



signature.asc
Description: PGP signature

Re: Representing a rotary encoder input device

2021-09-22 Thread Greg Troxel


What do other systems do?  It strikes me what wsmouse feels like it is
for things connected with the kbd/mouse/display world.  To be
cantankerous, using it seems a little bit like representing a GPIO input
as a 1-button mouse that doesn't move.

I would imagine that a rotary encoder is more likely to be a volume or
level control, but perhaps not for the machine, perhaps just reported
over MQTT so Home Assistant on some other machine can deal.

If you are really talking about encoders hooked to gpio, then perhaps
gpio should grow a facility to take N pins and say they are some kind of
encoder and then have a gpio encoder abstraction.

But maybe you are trying to use an encoder to add scroll to a 3-button
mouse?


signature.asc
Description: PGP signature

Re: SCSI scanners

2021-06-29 Thread Greg Troxel

Julian Coleman  writes:

> Can we get rid of the SCSI scanner support as well?  It only supports old
> HP and Mustek scanners, and its functionality is superseded by SANE (which
> sends the relevant SCSI commands from userland).

If it's really the case that SANE works with these, then that seems ok.
(I actually have a UMAX scsi scsanner but haven't powered it on in
years.)

I wonder though if this is causing the kind of trouble that uscanner
caused.

signature.asc
Description: PGP signature

Re: protect pmf from network drivers that don't provide if_stop

2021-06-29 Thread Greg Troxel

Martin Husemann  writes:

> On Tue, Jun 29, 2021 at 03:46:20PM +0930, Brett Lymn wrote:
>> I turned up a fix I had put into my source tree a while back, I think at
>> the time the wireless driver (urtwn IIRC) did not set an entry for
>> if_stop.
>
> This is a driver bug, we should not work around it but catch it early
> and fix it.

So maybe KASSERT that stop exists, and then call it if non-NULL, so
regular users don't crash, and DIAGNOSTIC does what DIAGNOSTIC is
supposed to do?

signature.asc
Description: PGP signature

Re: regarding the changes to kernel entropy gathering

2021-04-06 Thread Greg Troxel


Thanks - that is useful information.

I think the big point is that the new seed file is generated from
urandom, not from the internal state, so the new seed doesn't leaak
internal state.  The "save entropy" language didn't allow me to conclude
that.

Also, your explanation is about updating, but it doesn't address
generation of a file for the first time.  Presumably that just takes
urandom without the old seed that isn't there and doesn't overwrite the
old seed that isnt' there.

Interestingly, I have a machine running current, running as a dom0
sometimes, and haven't had problems.  I now realize that's only because
the machine had a seed file created under either 7 or 9 (installed 7,
updated to 9, updated to current).  So it has trusted, untrustworthy
entropy (even though surely after all this time some of it must have
been unobserved).


signature.asc
Description: PGP signature

Re: regarding the changes to kernel entropy gathering

2021-04-06 Thread Greg Troxel

Thor Lancelot Simon  writes:

> shuts down, again all entropy samples that have been added (which, again,
> are accumulating in the per-cpu pools) are propagated to the global pool;
> all the stream RNGs rekey themselves again; then the seed is extracted.

It seems obvious to me that "extracting" the seed should be done in such
a way that the state of the internal rng is still unpredictable from the
saved seed, even if the state of the newly-booted rng will be
predictable.  Perhaps by pulling 256 bytes from urandom, perhaps by
something more direct and then some sort of hash/rekey to get back
traffic protection.

Probably this is already done in a way much better thought out than my
30s reaction, the man page doesn't really say this, at least that I
could follow; rndctl -S says "save entropy pool".

signature.asc
Description: PGP signature

Re: ZFS: time to drop Big Scary Warning

2021-03-25 Thread Greg Troxel

chris...@astron.com (Christos Zoulas) writes:

> That's a good test, but how does zfs compare in for the same test with lets
> say ffs or ext2fs (filesystems that offer persistence)?

With the same system, booted in the  same way, but with 3 different
filesystems mounted on /tmp, I get similar numbers of failures:

tmpfs   12
ffs213
zfs 18

So tmpfs/ffs2 are ~equal and zfs has a few more failures (but it all
looks a bit random and non-repeatable).So it's hard to sort out "zfs
is buggy" vs "some tests fail in timing-related hard-to-understand ways
and that seems provoked slightly more with /tmp on zfs".

Did you mean something else?

signature.asc
Description: PGP signature

Re: ZFS: time to drop Big Scary Warning

2021-03-23 Thread Greg Troxel


I got a suggestion to run atf with a ZFS tmp.  This is all with current
from around March 1, and is straight current, no Xen.

Creating tank0/tmp and having it be mounted on /tmp failed the mount
(but created the volume) with some sort of "busy" error.  I already had
a tmpfs mounted.  Rebooting, zfs got mounted and then tmpfs and i
unmounted tmpfs and then I have a zfs tmp.  So not sure what's up but
feels like a tmpfs issue more than a zfs issue, and not a big deal.  Or
maybe it's a feature that you can't mount over tmpfs.


With /tmp being tmpfs, my results are similar to the releng runs.  I've
indented things that don't match two spaces.

Failed test cases:
  lib/libc/sys/t_futex_ops:futex_wait_timeout_deadline
lib/libc/sys/t_ptrace_waitid:syscall_signal_on_sce
lib/libc/sys/t_truncate:truncate_err
  lib/librumpclient/t_exec:threxec
net/if_wg/t_misc:wg_rekey
  usr.bin/cc/t_tsan_data_race:data_race
usr.bin/make/t_make:archive
usr.bin/c++/t_tsan_data_race:data_race
usr.sbin/cpuctl/t_cpuctl:nointr
usr.sbin/cpuctl/t_cpuctl:offline
fs/ffs/t_quotalimit:slimit_le_1_user
modules/t_x86_pte:rwx

Summary for 903 test programs:
9570 passed test cases.
12 failed test cases.
73 expected failed test cases.
530 skipped test cases.

With /tmp being zfs:tank0/tmp, I get

Failed test cases:
  ./bin/cp/t_cp:file_to_file
  ./lib/libarchive/t_libarchive:libarchive
  ./lib/libc/stdlib/t_mktemp:mktemp_large_template
./lib/libc/sys/t_ptrace_waitid:syscall_signal_on_sce
  ./lib/libc/sys/t_stat:stat_chflags
./lib/libc/sys/t_truncate:truncate_err
./net/if_wg/t_misc:wg_rekey
  ./usr.bin/cc/t_tsan_data_race:data_race_pie
./usr.bin/make/t_make:archive
  ./usr.bin/ztest/t_ztest:assert
./usr.bin/c++/t_tsan_data_race:data_race
  ./usr.bin/c++/t_tsan_data_race:data_race_pie
./usr.sbin/cpuctl/t_cpuctl:nointr
./usr.sbin/cpuctl/t_cpuctl:offline
  ./fs/nfs/t_rquotad:get_nfs_be_1_group
./modules/t_x86_pte:rwx
  ./modules/t_x86_pte:svs_g_bit_set

Summary for 903 test programs:
9567 passed test cases.
17 failed test cases.
72 expected failed test cases.
529 skipped test cases.

which is also similar, but slightly different.

So overal I conclude that there's nothing terrible going on, and that
these results are in the same class of mostly passing but somewhat
irregular as the base case.  So work to do, but it doesn't support "ZFS
is scary".

(Of course, the system stayed up through the tests and has no apparent
trouble, or I would have said.)

As an aside, it would be nice if atf-test used TMPDIR or had an argument
to say what place to do tests.


signature.asc
Description: PGP signature

Re: ZFS: time to drop Big Scary Warning

2021-03-20 Thread Greg Troxel

"J. Hannken-Illjes"  writes:

>> On 19. Mar 2021, at 21:18, Michael  wrote:
>> 
>> On Fri, 19 Mar 2021 15:57:18 -0400
>> Greg Troxel  wrote:
>> 
>>> Even in current, zfs has a Big Scary Warning.  Lots of people are using
>>> it and it seems quite solid, especially by -current standards.  So it
>>> feels times to drop the warning.
>>> 
>>> I am not proposing dropping the warning in 9.
>>> 
>>> Objections/comments?
>> 
>> I've been using it on sparc64 without issues for a while now.
>> Does nfs sharing work these days? I dimly remember problems there.
>
> If you mean misc/55042: Panic when creating a directory on a NFS served ZFS
> it should be fixed in -current.

I have a box running current/amd64 from about March 4, with a zpool on a
disklabel partition, and a filesystem from that exported, mounted on a
9/amd64 box, and did the mkdir test and it was totally fine.   I was
able to have the maproot segfault happen, before the fix.  So yes, this
is fixed.

So summarizing:

  nobody has said there is any remaining serious issue

  many remember issues about NFS (true) but they all seem ok now

and I just looked over the open PRs and w.r.t. current don't see
anything serious.

signature.asc
Description: PGP signature

ZFS: time to drop Big Scary Warning

2021-03-19 Thread Greg Troxel


Even in current, zfs has a Big Scary Warning.  Lots of people are using
it and it seems quite solid, especially by -current standards.  So it
feels times to drop the warning.

I am not proposing dropping the warning in 9.

Objections/comments?


signature.asc
Description: PGP signature

Re: kmem pool for half pagesize is very wasteful

2021-02-23 Thread Greg Troxel

Chuck Silvers  writes:

> in the longer term, I think it would be good to use even larger pool pages
> for large pool objects on systems that have relatively large amount of memory.
> even with your patch, a 1k pool object on a system with a 4k VM page size
> still has 33% overhead for the redzone, which is a lot for something that
> is enabled by DIAGNOSTIC and is thus supposed to be "inexpensive".

So maybe the real bug is that this check should not be part of
DIAGNOSTIC.  I remember from 2.8BSD that DIAGNOSTIC was basically just
supposed to add cheap asserts and panic earlier but not really be slower
in any way anybody would care about.

It seems easy enough to make this separate and not get turned on for
DIAGNOSTIC, but some other define.   It might even be that for current
the checked-in GENERIC enables this.  But someone turning on DIAGNOSTIC
on 9 shouldn't get things that hurt memory usage really at all, or more
than say a 2% degradation in speed.

> there's a tradeoff here in that using a pool page size that matches the
> VM page size allows us to use the direct map, whereas with a larger
> pool page size we can't use the direct map (at least effectively can't today),
> but for pools that already use a pool page size that is larger than
> the VM page size (eg. buf16k using a 64k pool page size) we already
> aren't using the direct map, so there's no real penalty for increasing
> the pool page size even further, as long as the larger pool page size
> is still a tiny percentage of RAM and KVA.  we can choose the pool page size
> such that the overhead of the redzone is bounded by whatever percentage
> we would like.  this way we can use a redzone for most pools while
> still keeping the overhead down to a reasonable level.

That sounds like great progress and I don't mean to say anything
negative about that.

signature.asc
Description: PGP signature

Re: fsync error reporting

2021-02-19 Thread Greg Troxel

Greg Troxel  writes:

>   1) operating system has a succcessful return from a write transaction to
>   a disk controller (perhaps via a controller that has a write-back
>   cache)
>
>   2) operating system has been told by the controller that the write has
>   actually completed to stable storage (guaranteed even if OS crashes or
>   power fails, so actually written or perhaps in battery-backed cache)

I see our man page addresses this with FDISKSYNC.   It sounds like you
aren't proposing to change this (makes sense), but there's the pesky
issue of errors within the disk when writing from cache to media.
Perhaps those are unreportable.

signature.asc
Description: PGP signature

Re: fsync error reporting

2021-02-19 Thread Greg Troxel

David Holland  writes:

>  > > everything that process wrote is on disk,
>  > 
>  > That is probably unattainable, since I've seen it plausibly asserted
>  > that some disks lie, reporting that writes are on the media when this
>  > is not actually true.
>
> Indeed. What I meant to say is that everything has been sent to disk,
> as opposed to being accidentally skipped in the cache because the
> buffer was busy, which will currently happen on some of the fsync
> paths.
>
> That's why flushing the disk-level caches was a separate point.

(ignoring errors as I have no objection to what you proposed and
clarified with mouse@)

Maybe I'm way off in space, but I'd like to see us be careful about

  1) operating system has a succcessful return from a write transaction to
  a disk controller (perhaps via a controller that has a write-back
  cache)

  2) operating system has been told by the controller that the write has
  actually completed to stable storage (guaranteed even if OS crashes or
  power fails, so actually written or perhaps in battery-backed cache)

  A) for stacked filesystems like raid, cgd, and for things like NFS,
  there's basically and e2e ack of the above condition.

POSIX is of course weasely about this.  But it seems obvious that if you
call fsync, you want the property that if there is a crash or power
failure (but not a disk media failure :-) that your bits are there,
which is case 2.  Case 1 is only useful in that files could remain in OS
cache for a long time, and there is a pretty good but not guaranteed
notion that once in device writeback cache they will get to the actual
media in not that long.  The old "sync;sync;sync;sleep 10" thing from
before there was shutdown(8)...

I thought NCQ was supposed to give acks for actual writing, but allow
them to be perhaps ordered and multiple in flight, so that one could use
that instead of the big-hammer inscrutable writeback cache.

If the controller doesn't support NCQ, then it seems one has to issue a
cache flush, which presmably is defined to get all data in cache as of
the flush onto disk before reprorting that its done.

Is that what you're thinking, or do you think this is all about case 1?

signature.asc
Description: PGP signature

Re: fsync_range and O_RDONLY

2021-02-18 Thread Greg Troxel

David Holland  writes:

> Well, if you have it open for write and I have it open for read, and I
> fsync it, it'll sync your changes.

I guess maybe POSIX is wrong then :-)

But as a random user I can type sync to the shell.

> And report any errors to me, so if you're a database and I'm feeling
> nasty I can maybe mess with you that way. So I'm not sure it's a great
> idea.
>
> Right now fsync error reporting is a trainwreck though.

I think that's the real problem; if I open for write and fsync, then I
should get status back that lets me know about my writes, regardless of
who else asked for sync.   Once that's fixed, then the 'others asking
for sync' is much less of a big deal.
I know, ENOPATCH.

signature.asc
Description: PGP signature

Re: fsync_range and O_RDONLY

2021-02-17 Thread Greg Troxel

David Holland  writes:

> Last year, fdatasync() was changed to allow syncing files opened
> read-only, because that ceased to be prohibited by POSIX and something
> apparently depended on it.

I have a dim memory of this and mongodb.

> However, fsync_range() was not also so changed. Should it have been?
> It's now inconsistent with fsync and fdatasync and it seems like it's
> meant to be a strict superset of them.

It seems like it might as well be.  I would expect this to only really
sync the file's metadata, same as the others, but I do not feel like I
really understand this.

signature.asc
Description: PGP signature

Re: partial failures in write(2) (and read(2))

2021-02-05 Thread Greg Troxel

David Holland  writes:

> Basically, it is not feasible to check for and report all possible
> errors ahead of time, nor in general is it possible or even desirable
> to unwind portions of a write that have already been completed, which
> means that if a failure occurs partway through a write there are two
> reasonable choices for proceeding:
>(a) return success with a short count reporting how much data has
>already been written;
>(b) return failure.
>
> In case (a) the error gets lost unless additional steps are taken
> (which as far as I know we currently have no support for); in case (b)
> the fact that some data was written gets lost, potentially leading to
> corrupted output. Neither of these outcomes is optimal, but optimal
> (detecting all errors beforehand, or rolling back the data already
> written) isn't on the table.
>
> It seems to me that for most errors (a) is preferable, since correctly
> written user software will detect the short count, retry with the rest
> of the data, and hit the error case directly, but it seems not
> everyone agrees with me.

It seems to me that (a) is obviously the correct approach.

An obvious question is what POSIX requires, pause for `kill -HUP kred` :)

I am only a junior POSIX lawyer, not a senior one, but as I read

https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html#tag_16_685

I think your case (a) is the only conforming behavior and obviously what
the spec says must happen.   I do not even see a glimmer of support for (b).

There is the issue of PIPE_BUF, and  requests <= PIPE_BUF being atomic,
but I don't think you are talking about that.

Note that write is obligated to return partial completion if interrupted
by a signal.

I think your notion that it's ok to not return the reason the full
amount wasn't written is enirely valid.

I am surprised this is contentious (really; not trying to be difficult).

signature.asc
Description: PGP signature

Re: Temporary memory allocation from interrupt context

2020-11-11 Thread Greg Troxel

Martin Husemann  writes:

> On Wed, Nov 11, 2020 at 08:26:45AM -0500, Greg Troxel wrote:
>> >LOCK(st);
>> >size_t n, max_n = st->num_items;
>> >some_state_item **tmp_list =
>> >kmem_intr_alloc(max_n * sizeof(*tmp_list));
>> 
>> kmem_intr_alloc takes a flag, and it seems that you need to pass
>> KM_NOSLEEP, as blocking for memory in softint context is highly unlikely
>> to be the right thing.
>
> Yes, and of course the real code has that (and works). It's just that 
>  - memoryallocators(9) does not cover this case
>  - kmem_intr_alloc(9) is kinda deprecated - quoting the man page:
>
>   These routines are for the special cases.  Normally,
>   pool_cache(9) should be used for memory allocation from interrupt
>   context.
>
>but how would I use pool_cache(9) here?

Not deprecated, but for "special cases".  I think needing a possibly-big
variable-size chunk of memory at interrupt time is special.

You would use pool_cache by being able to use a fixed-sized object.
But it seems that's not how the situation is.

I think memoryallocators(9) could use some spiffing up; it (on 9) says
kmem(9) cannot be used from interrupt context.

The central hard problem is orthogonal, though: if you don't
pre-allocate, you have to choose between waiting and copying with
failure.

signature.asc
Description: PGP signature

Re: Temporary memory allocation from interrupt context

2020-11-11 Thread Greg Troxel

Martin Husemann  writes:

> Consider the following pseudo-code running in softint context:
>
> void
> softint_func(some_state *st, )
> {
>   LOCK(st);
>   size_t n, max_n = st->num_items;
>   some_state_item **tmp_list =
>   kmem_intr_alloc(max_n * sizeof(*tmp_list));

kmem_intr_alloc takes a flag, and it seems that you need to pass
KM_NOSLEEP, as blocking for memory in softint context is highly unlikely
to be the right thing.

The an page is silent on whether lack of both flags is an error, and if
not what the semantics are.  (It seems to me it should be an error.)

With KM_NOSLEEP, it is possible that the allocation will fail.  Thus
there needs to be a strategy to deal with that.

>   n = 0;
>   for (i : st->items) {
>   if (!(i matches some predicate))
>   continue;
>   i->retain();
>   tmp_list[n++] = i;
>   }
>   UNLOCK(st);
>   /* do something with all elements in tmp_list */
>   kmem_intr_free(tmp_list, max_n * sizeof(*tmp_list));
> }
>
> I don't want to alloca here (the list could be quite huge) and max_n could
> vary a lot, so having a "manual" pool of a few common (preallocated)
> list sizes hanging off the state does not go well either.

I think that you need to pick one of

  pre-allocate the largest size and use it temporarily

  be able to deal with not having memory.  This leads to hard-to-debug
  situations if that code is wrong, becuase usually malloc will succeed.

  figure out that this softint can block indefinitely, only harming
  later calls of the same family, and not leading to kernel
  deadlock/etc.  This leads to hard-to-debug situations if lack of
  memory does lead to hangs, because usually malloc will succeed.

> In a perfect world we would avoid the interrupt allocation all together, but
> I have not found a way to rearrange things here to make this feasible.
>
> Is kmem_intr_alloc(9) the best way forward?

With all that said, note that I'm not the allocation export.

signature.asc
Description: PGP signature

Re: New header for GPIO-like definitions

2020-10-27 Thread Greg Troxel

Julian Coleman  writes:

>   name="LED activity"
>   name="LED disk5_fault"
>
>   name="INDICATOR key_diag"
>   name="INDICATOR disk5_present"
>
> and similar, then parse that in MI code.

Another approach would be to extend the fdt schema in the way they would
if solving this problem and use that.  In other words: if you were in
charge of fdt and were going to add this feature, what would you do?

But, your name overloading proposal seems ok.

signature.asc
Description: PGP signature

Re: New header for GPIO-like definitions

2020-10-26 Thread Greg Troxel


Julian Coleman  writes:

>> >   #define GPIO_PIN_LED0x01
>> >   #define GPIO_PIN_SENSOR 0x02
>> >
>> > Does this seem reasonable, or is there a better way to do this?
>
>> I don't really understand how this is different from in/out.
>> Presumably this is coming from some request from userspace originally,
>> where someone, perhaps in a config file, has told the system how a pin
>> is hooked up.
>
> The definitions of pins are coming from hardware-specific properties.

That's what I missed.  On a device you are dealing with, pin N is
*always* wired to an LED because that's how it comes from the factory.
My head was in maker-land where there is an LED because someone wired
one up.

> In the driver, I'd like to be able to handle requests based on what is
> connected to the pin.  For example, for LED's, attach them to the LED
> framework using led_attach()

That makes sense, then.  But how do you denote that logical high turns
on he light, vs logical low?

>> LED seems overly specific.  Presumably you care that the output does
>> something like "makes a light".  But I don't understand why your API
>> cares about light vs noise.  And I don't see an active high/low in your
>> proposal.   So I don't understand how this is different from just
>> "controllable binary output"
>
> As above, I want to be able to route the pin to the correct internal
> subsystem in the GPIO driver.

I just remember lights before LED, and the fact that they are LED vs
incandescent is not important to how they are used.  I don't know what's
next.

But given there is an led system, there is no incremental harm and it
seems ok.

>> I am also not following SENSOR.DO you just mean "reads if the logic
>> level at the pin is high or low".
>> 
>> I don't think you mean using i2c bitbang for a temp sensor.
>
> Yes, just reading the logic level to display whether the "thing" connected
> is on or off.  A better name would be appreciated.  Maybe "INDICATOR", which
> would match the envsys name "ENVSYS_INDICATOR"?

Or even "GPIO_ENVSYS_INDICATOR" because there might be some binary
inputs later that get hooked up to some other kind of framework.

> Hopefully, the above is enough, but maybe a code snippet would help (this
> snippet is only for LED's, but similar applies for other types).  In the
> hardware-specific driver, I add the pins to proplib:
>
>   add_gpio_pin(pins, "disk_fault", GPIO_PIN_LED,
>   0, GPIO_PIN_ACT_LOW, -1);
>   ...

So I see the ACT_LOW.

GPIO_PIN_LED is an output, but presumably this means that one can no
longer use it with GPIO and only via led_.  Which seems fine. Is that
what you mean?

> Then, in the MD driver I have:
>
>   pin = prop_array_get(pins, i);
>   prop_dictionary_get_uint32(pin, "type", &type);
>   switch (type) {
>   case GPIO_PIN_LED:
>   ...
>   led_attach(n, l, pcf8574_get, pcf8574_set);

Do you mean MD, or MI?

> and because of the way that this chip works, I also need to know in advance
> which pins are input and which are output, to avoid inadvertently changing
> the input pins to output when writing to the chip.  For that, generic
> GPIO_PIN_IS_INPUT and GPIO_PIN_IS_OUTPUT definitions might be useful too.

I 95% follow, but I am convinced that what you are doing is ok, so to be
clear I have no objections.



signature.asc
Description: PGP signature

Re: New header for GPIO-like definitions

2020-10-26 Thread Greg Troxel

Julian Coleman  writes:

> I'm adding a driver and hardware-specific properties for GPIO's (which pins
> control LED's, which control sensors, etc).  I need to be able to pass the
> pin information from the arch-specific configuration to the MI driver.  I'd
> like to add a new dev/gpio/gpiotypes.h, so that I can share the defintions
> between the MI and MD code, e.g.:
>
>   #define GPIO_PIN_LED0x01
>   #define GPIO_PIN_SENSOR 0x02
>
> Does this seem reasonable, or is there a better way to do this?

I don't really understand how this is different from in/out.
Presumably this is coming from some request from userspace originally,
where someone, perhaps in a config file, has told the system how a pin
is hooked up.

LED seems overly specific.  Presumably you care that the output does
something like "makes a light".  But I don't understand why your API
cares about light vs noise.  And I don't see an active high/low in your
proposal.   So I don't understand how this is different from just
"controllable binary output"

I am also not following SENSOR.DO you just mean "reads if the logic
level at the pin is high or low".

I don't think you mean using i2c bitbang for a temp sensor.

Perhaps you could step back and explain  the bigger picture and what's
awkward currently.  I don't doubt you that more is needed, but I am not
able to understand enough to discuss.

signature.asc
Description: PGP signature

Re: make COMPAT_LINUX match SYSV binaries

2020-10-21 Thread Greg Troxel


co...@sdf.org writes:

> I feel compelled to explain further:
> any OS that doesn't rely on this tag is prone to spitting out binaries
> with the wrong tag. For example, Go spits out Solaris binaries with SYSV
> as well.
>
> Our current solution to it is the kernel reading through the binary,
> checking if it contains certain known symbols that are common on Linux.
>
> We support the following forms of compat:
>
> ultrixnot ELF
> sunos not ELF (we support only oold stuff)
> freebsd   always correctly tagged, because the native OS
>   checks this, like we do.
> linux ELF, not always correctly tagged
>
>
> So, currently, we only support one OS that has this problem, which is
> linux. I am proposing we take advantage of it.
>
> In the event someone adds support for another OS with this problem (say,
> modern Solaris), I don't expect this compat to be enabled by default,
> for security reasons. So the problem will only occur if a user enables
> both forms of compat at the same time.
>
> Users already have to opt in to have Linux compat support. I think it is
> a lot to ask to have them tag every binary.

Thanks for the explanation.  I'm still not thrilled, but I withdraw my
objection.


signature.asc
Description: PGP signature

Re: make COMPAT_LINUX match SYSV binaries

2020-10-20 Thread Greg Troxel

co...@sdf.org writes:

> As a background, some Linux binaries don't claim to be targeting the
> Linux OS, but instead are "SYSV".
>
> We have used some heuristics to still identify those binaries as being
> Linux binaries, like looking into the symbols defined by the binary.
>
> it looks like we no longer have other forms of compat expected to use
> SYSV ELF binaries. Perhaps we should drop this elaborate detection logic
> in favour of detecting SYSV == Linux?

In general adapting to every confused practice out there leads us to a
bad place.  This just feels like a step along that path.

I could see having a sysctl/etc. to enable this behavior, but it seems
really irregular.   Is there a way to have a tool to retag binaries
that are tagged incorrectly?   It seems SYSV emulation should not allow
non-SYSV system calls.

signature.asc
Description: PGP signature

Re: autoloading compat43 on tty ioctls

2020-10-10 Thread Greg Troxel

chris...@astron.com (Christos Zoulas) writes:

> Aside for the TIOCGSID bug which I am about to fix (it is in tty_43.c
> and is used in libc tcgetsid(), all the compat tty ioctls are defined
> in /usr/src/sys/sys/ioctl_compat.h... We can empty that file and try
> to build the tree :-), but I am guessing things will break. Also a lot
> of pkgsrc will break too. It is not 4.3 applications that break it is
> applications that still use the 4.3 terminal api's.

If the API is still present in our source tree, then the implementation
probably does not belong under COMPAT_43.  As I see it COMPAT_43 is to
match an old ABI that one can no longer (on modern NetBSD) compile to.
What you are describing sounds like "we have an API still, and we've had
it since 4.3", which is not in my view COMPAT.

signature.asc
Description: PGP signature

Re: Sample boot.cfg for upgraded systems (rndseed & friends)

2020-09-22 Thread Greg Troxel

David Brownlee  writes:

> What would people think of installing an original copy of the etc set
> in /usr/share/examples/etc or similar - its 4.9M extracted and ~500K
> compressed and the ability to compare what is on the system to what it
> was shipped with would have saved me so much effort over the years :)

I personally unpack etc and xetc to /usr/netbsd-etc via the
INSTALL-NetBSD update script in etcmanage.  I would not be super keen on
adding a full etc by default, especially because then there's the issue
of managing it for upgrades.   But if it is unpacked someplace, and
updated on updates, and old files removed on updates via postinstall
fix, maybe.

signature.asc
Description: PGP signature

Re: Logging a kernel message when blocking on entropy

2020-09-22 Thread Greg Troxel

Andreas Gustafsson  writes:

> The following patch will cause a kernel message to be logged when a
> process blocks on /dev/random or some other randomness API.  It may
> help some users befuddled by pkgsrc builds blocking on /dev/random,
> and I'm finding it useful when testing changes aimed at fixing PR
> 55659.

I'm in favor.

I have not dug in to the brave new entropy world.  I'm sure it's better
in many ways, but it also seems like people/systems that used to not end
up blocked before now do, apparently because some sources that used to
be considered ok (timing of events) no longer are.   So I think people
should be given clues - things appear a bit too difficult now.

signature.asc
Description: PGP signature

Re: Proposal to enable WAPBL by default for 10.0

2020-07-23 Thread Greg Troxel

Taylor R Campbell  writes:

[lots of good points, no disagreement]

If /etc/master.passwd is ending up with junk, that's a clue that code
that updates it isn't doing the write secondary file, fysnc it, rename,
approach.  As I understand it with POSIX filesystems you have to do that
because there is no guarantee on open/write/close that you'll have one
or the other.  Even with zfs, you could have done write on the first
half and not the second, so I think you still need this.q

> work...which is why I used to use ffs+sync on my laptop, and these
> days I avoid ffs altogether in favour of zfs and lfs, except on
> install images written to USB media.)

Do you find that lfs is 100% solid now (in 9-stable, or current)?  I
have seen fixes and never really been sure.

Re: AES leaks, cgd ciphers, and vector units in the kernel

2020-06-23 Thread Greg Troxel

a data point on a machine from 2014:

$ ./aestest -l
BearSSL aes_ct
Intel SSE2 bitsliced

$ progress -f /dev/zero sh -c 'exec ./aestest -e -b 256 -c aes-xts -i "Intel 
SSE2 bitsliced" > /dev/null'
   399 MiB   56.98 MiB/s ^C
$ progress -f /dev/zero sh -c 'exec ./aestest -e -b 256 -c aes-xts -i "BearSSL 
aes_ct" > /dev/null'
   211 MiB   26.38 MiB/s ^C
$ progress -f /dev/zero sh -c 'exec ./bad -e -b 256 -c aes-xts > /dev/null'
   869 MiB   86.85 MiB/s ^C

So the sse2 is slower, but not enough to get upset about.



cpu0: "Intel(R) Core(TM) i7 CPU 930  @ 2.80GHz"
cpu0: Intel Core i7, Xeon 34xx, 35xx and 55xx (Nehalem) (686-class), 2800.09 MHz
cpu0: family 0x6 model 0x1a stepping 0x5 (id 0x106a5)
cpu0: features 0xbfebfbff
cpu0: features 0xbfebfbff
cpu0: features 0xbfebfbff
cpu0: features1 0x98e3bd
cpu0: features1 0x98e3bd
cpu0: features2 0x28100800
cpu0: features3 0x1
cpu0: features7 0x9c00

Re: AES leaks, cgd ciphers, and vector units in the kernel

2020-06-23 Thread Greg Troxel

Taylor R Campbell  writes:

>> What I meant is: consider an external USB disk of say 4T, which has a
>> cgd partition within which is ffs.
>> 
>> Someone attaches it to several systems in turn, doing cgd_attach, mount,
>> and then runs bup with /mnt/bup as the target, getting deduplication
>> across systems.
>
> (Side note: as a matter of architecture I would recommend
> incorporating the cryptography into the application, like borgbackup,
> restic, or Tarsnap do -- anything at a higher level than disks (even
> at the level of the file system, like zfs encryption) has much more
> flexibility and can also provide authentication.  Generally the main
> use case for disk encryption is to enable recycling disks without
> worrying about information disclosure; the threat model and security
> of disk encryption systems are both qualitatively very weak.)

Sure, but this is about doing something that is really reliable about
getting data back for disaster recovery, simplicity, only using tools
that have existed for a long time.  (You can't run zfs on old systems,
and borgbackup has had enough stability issues that I wouldn't trust
it.)

>> So, using the new faster cipher won't work, because it's not supported
>> by the older systems.
>> 
>> Hoewver, if the -current system does AES slowly because it has the new
>> constant-time implementation, and the older ones do it like they used
>> to, I don't see a real problem.
>
> OK.  If you encounter a scenario where this is likely to be a real
> problem, let me know.

>From my viewpoint, a 3x slowdown, but with 100% reliablity is not a big
deal.

> I drafted an SSE2 implementation which considerably improves on the
> BearSSL aes_ct implementation on a number of amd64 CPUs I tested from
> around a decade ago.  It is still slower than before -- and AES-CBC
> encryption hurts by far the most, because it is necessarily
> sequential, whereas AES-CBC decryption and AES-XTS in both directions
> can be vectorized -- but it does mitigate the problem somewhat.  This
> covers all amd64 CPUs and probably most `i386' CPUs of the last 15-20
> years.
>
> There is some more room for improvement -- SSSE3 provides PSHUFB which
> can sequentially speed up parts of AES, and is supported by a good
> number of amd64 CPUs starting around 14 years ago that lack AES-NI --
> but there are diminishing returns for increasing implementation and
> maintenance effort, so I'd like to focus on making an impact on
> systems that matter.  (That includes non-x86 CPUs -- e.g., we could
> probably easily adapt the Intel SSE2 logic to ARM NEON -- but I would
> like to focus on systems where there is demand.)

That sounds good.

> I drafted a couple programs to approximately measure performance from
> userland.  They are very naive and do nothing to measure overhead from
> cgd(4) or disk i/o itself.
>
> https://www.NetBSD.org/~riastradh/tmp/20200621/aestest.tgz
> https://www.NetBSD.org/~riastradh/tmp/20200622/adiantum.tgz

Thanks - will try them.

>> So it remains to make userland AES use also constant time, as a separate
>> step?
>
> Correct.

ok - and helpful details from nia@ noted.

Re: AES leaks, cgd ciphers, and vector units in the kernel

2020-06-18 Thread Greg Troxel

Taylor R Campbell  writes:

>> I don't really see the new cipher as a reasonable option for removable
>> disks that need to be accessed by older systems.  I can see it for
>> encrypted local disk.  But given AES hardware speedup, I suspect most
>> people can just stay with AES.
>
> Can you be more specific about the systems you're concerned about?
> What are the characteristics and performance requirements of the
> different systems that need to share disks?  Do you have a reason to
> need to share a backup drive that you use on an up-to-date NetBSD on
> older hardware where it has to be fast, with a much older version of
> NetBSD?
>
> (I am sure there are use cases I haven't thought of; I just want to
> make sure I understand the use cases before I try to address them.)

What I meant is: consider an external USB disk of say 4T, which has a
cgd partition within which is ffs.

Someone attaches it to several systems in turn, doing cgd_attach, mount,
and then runs bup with /mnt/bup as the target, getting deduplication
across systems.  Of these systems, some are older NetBSD and some are
newer.  Posit one each netbsd 5, 7, 8, 9, current in the mix, as a blend
of strawman and not-so-crazy example.  After this, the disk is taken to
an undisclosed location where it is unlikely to be destroyed (or at
least, unlikely to be destroyed correlated with the main systems'
disks), but at which it does not have reliable physical protection
against snoooping.  I submit that this is not an odd model for cgd
usage.

(I don't actually do this; I mount disks on one system and do
over-the-network backups from the older systems, and my mix of system
versions is different.)

So, using the new faster cipher won't work, because it's not supported
by the older systems.

Hoewver, if the -current system does AES slowly because it has the new
constant-time implementation, and the older ones do it like they used
to, I don't see a real problem.

>> Is there an easy way to publish code that does hardware AES, to allow
>> people to measure on their hardware?  If a call for that on -users turns
>> up essentially zero actual people that would be bothered, I think that
>> would be interesting.
>
> I am not quite sure what you're asking.  Correct me if I have
> misunderstood, but I suspect what you're getting at is:
>
>How can someone on netbsd<=9 test empirically whether this patch
>will have a substantial negative performance impact or not?
>
> On basically all amd64 systems of the past decade, and on most if not
> all aarch64 systems, there is essentially guaranteed to be a net
> performance improvement.  What about other systems?
>
> The best way to test this is to just boot a new kernel and try a
> workload.  But I assume you are looking for a userland program that
> one can compile and run to test it without booting a new kernel.

Yes, that's what I meant.  Kind of like "openssl speed".

> I could in a couple hours make a program that checks cpuid to detect
> hardware support and does some measurements in isolation -- to
> estimate an _upper bound_ on the system performance impact.
>
> The upper bound is likely to be extremely conservative unless your
> workload is actually reading and writing zeros to cgd on a RAM-backed
> file in tmpfs; for a realistic impact on cgd or ipsec you would have
> to take into account the disk or network throughput -- the fraction of
> it that is spent in the crypto is what the 1/3-2/3 figure applies to.

I did sort of mean "how many MB/s would the old impl do, and how many
MB/s would the new one do", realizing that actually reading/writing from
disk might overwhelem that.

I'm not sure my request is reasonable; it might help up the comfort
level for people.

> (Note that there is no impact on userland crypto, which means no
> impact on TLS or OpenVPN or anything like that, unless for some
> bizarre reason you've turned on kern.cryptodevallowsoft and the
> userland crypto uses /dev/crypto, the solution to which is to stop
> using /dev/crypto and/or turn off kern.cryptodevallowsoft for anything
> other than testing because it's terrible (and also the apparently
> boolean nature of kern.cryptodevallowsoft is a lie).)

So it remains to make userland AES use also constant time, as a separate
step?

>> I'm unclear on openssl and hardware support; "openssl speed" might be a
>> good home for this, and I don't know if openssl needs the same treatment
>> as cgd.  (Fine to keep separable things separate; not a complaint!)
>
> OpenSSL is a mixed bag.  It has a lot more MD implementations of
> various cryptographic primitives.  But many of them are still leaky.
> So it's probably not a very good proxy for what the performance impact
> of this patch set will be.

I sort of meant putting the new code in there so it can be measured, but
I realize that's messy.

Please don't take my "is there a way" question as a demand.

Re: AES leaks, cgd ciphers, and vector units in the kernel

2020-06-18 Thread Greg Troxel

Taylor R Campbell  writes:

>> Date: Thu, 18 Jun 2020 07:19:43 +0200
>> From: Martin Husemann 
>> 
>> One minor nit: with the performance impact that high, and there being
>> setups where runtime side channel attacks are totally not an issue,
>> maybe we should leave the old code in place for now, #ifdef'd as
>> default-off options to provide time for a full reconstruction (or untill
>> the machine gets update to "the last decade" cpu)?
>
> Having leaky AES code around is asking for trouble -- and would
> require additional complexity to implement and maintain (e.g., is it
> always unhooked from the build, or do we hook it in just enough to run
> tests?), which would add further burden on an audit to verify that
> it's _not_ being used in a real application.
>
> The goals here are to make that burden completely go away by making
> the answer unconditionally no, there's essentially no danger that AES
> in the kernel is leaky; and to provide alternatives with performance
> ranging from `not worse' to `much better' to avoid the conflict that
> AES invites between performance and security.
>
> If you have a specific system where there's a real negative
> performance impact that matters to you, I would be happy to talk over
> the details and see how we can address it better.

I see your point, and I think this is probably ok, but I share Martin's
concern.

For me, the main use of cgd is to encrypt backup drives.  I am therfore
not really concerned about side channel attacks when they are attached
and keyed on the system being backed up.  (I realize other people use
cgd for other reasons.)

I don't really see the new cipher as a reasonable option for removable
disks that need to be accessed by older systems.  I can see it for
encrypted local disk.  But given AES hardware speedup, I suspect most
people can just stay with AES.

Is there an easy way to publish code that does hardware AES, to allow
people to measure on their hardware?  If a call for that on -users turns
up essentially zero actual people that would be bothered, I think that
would be interesting.

I'm unclear on openssl and hardware support; "openssl speed" might be a
good home for this, and I don't know if openssl needs the same treatment
as cgd.  (Fine to keep separable things separate; not a complaint!)

Re: makesyscalls (moving forward)

2020-06-15 Thread Greg Troxel

David Holland  writes:

> Meanwhile it doesn't belong in sbin because it doesn't require root,
> nor does doing something useful with it require root, and it doesn't
> need to be on /, so... usr.bin. Unless we think libexec is reasonable,
> but if 3rd-party code is going to be running it we really want it on
> the $PATH, so...

I agree with that logic, that makesyscalls is kind of like config, and
that /usr/bin makes sense.  There's nothing admin-ish about it, as
building an operating system is not about configuring the host.

We could have a directory for tools used only for building NetBSD that
are not otherwise useful, and put config and makesyscalls there, but
given that we aren't overwhelming bin in a way that causes trouble, that
doesn't seem like a good idea.

Re: KAUTH_SYSTEM_UNENCRYPTED_SWAP

2020-05-17 Thread Greg Troxel

Alexander Nasonov  writes:

> Greg Troxel wrote:
>> Kamil Rytarowski  writes:
>> 
>> > Is it possible to avoid negation in the name?
>> >
>> > KAUTH_SYSTEM_ENABLE_SWAP_ENCRYPTION
>> 
>> I think the point is to have one permission to enable it, which is
>> perhaps just regular root, and another to disable it if securelevel is
>> elevated.
>> 
>> So perhaps there should be two names, one to enable, one to disable.
>
> Kauth is about security rather than speed or convenience. Disabling
> encryption may improve speed but it definitely degrades your security
> level. So, you can enable vm.swap_encrypt at any level but you can't
> disable it if you care about security.

I understand that.  But there's still a question of "should there be a
KAUTH name for enabling as well as disabling", separate from "what
should the rules be".

I think everybody believes that regardless of securelevel, root should
be able to enable encrypted swap.  But probably almost everyone thinks
regular users should not be allowed to enable it.

I realize we have a lot of "root can", and that extending kauth to make
everything separate is almost certainly too much.  But when disabling is
a big deal, I think it makes sense to add names for both enabling and
disabling, to make that intent clearer in the sources.

But, I don't think this is that important, and a comment would do.

Re: KAUTH_SYSTEM_UNENCRYPTED_SWAP

2020-05-16 Thread Greg Troxel

Kamil Rytarowski  writes:

> Is it possible to avoid negation in the name?
>
> KAUTH_SYSTEM_ENABLE_SWAP_ENCRYPTION

I think the point is to have one permission to enable it, which is
perhaps just regular root, and another to disable it if securelevel is
elevated.

So perhaps there should be two names, one to enable, one to disable.

Re: Rump makes the kernel problematically brittle

2020-04-02 Thread Greg Troxel

Thor Lancelot Simon  writes:

> I'd love to see a GSoC project to actually make rump build like the
> kernel...but it may be too much work.

Good points, and improvement would be great.

Re: Rump makes the kernel problematically brittle

2020-04-02 Thread Greg Troxel

The other side of the coin to "rump is fragile" is "an operating system
without rump-style tests that can be run automatically is suscpetible to
hard-to-detect failures from changes, and is therefore fragile".  There
have been many instances (usually on current-users, I think) of reports
of newly-failing tests cases, leading to rapid removal of
newly-introduced defects.

Re: Rump dependencies (5.2)?

2020-01-13 Thread Greg Troxel

Mouse  writes:

>> The rump build is done with separate reachover makefiles.  [...]
>
> Hm.  Then I think possibly the right answer for the moment is for me to
> excise rump from my tree entirely.  I can't recall ever wanting its
> functionality, and trying to figure out what the dependency graph is
> when it exists only implicitly in Makefiles scattered all over the tree
> sounds like a recipe for serious headaches.
>
> If and when it looks worth the effort, I can always back out the
> removal commit and clean up the result.  But SCM_MEMORY looks like the
> more valuable thing for my use cases for the moment.

Your tree, your call.  But it seems really obvious that you should fix
the rump build and write some atf test cases for your SCM_MEMORY stuff,
and then you will be able to test it automatically.

Re: Proposal, again: Disable autoload of compat_xyz modules

2019-09-26 Thread Greg Troxel

chris...@astron.com (Christos Zoulas) writes:

> I propose something very slightly different that can preserve the current
> functionality with user action:
>
> 1. Remove them from standard kernels in architectures where modules are
>supported. Users can add them back or just use modules.
> 2. Disable autoloading, but provide a sysctl to enable autoloading
>(1 global sysctl for all compat modules). Users can change the default
>in /etc/sysctl.conf (adds sysctl to the proposal)

I am assuming that we are talking about disabling autoloading of a
number of compat modules that are some combination of believed likely to
have security bugs and not used extensively, and this includes compat
for foreign OS, but does not, at least for now, include compat for older
NetBSD.

This situation is basically a balancing act of the needs/benefits
somehow aggregated (I will avoid "averaged") over all users.   It seems
pretty unclear how to evaluate that in total.  But, it does seem like
your single-sysctl proposal means:

  people who like compat being autoloaded can add one line in
  sysctl.conf and be back where they were

  people who want specific modules can load them and not enable the
  general sysctl

  people who don't know about any of this who try to run Linux binaries
  will lose, and presumably there'd be a line in dmesg that says which
  module failed to autoload, like
policy blocked autoloading compat_linux module; see compat_linux(8)
  which would then explain.

I'm also assuming this is being talked about for HEAD and hence 10, and
not 9.

Overall, this seems like a reasonable compromise among conflicting
goals.

If older NetBSD compat were included, I'd want to see a separate sysctl,
default-on for now.  (My guess is that wanting to disable that is a
fairly extreme position, at least these days.)

Re: build.sh sets with xz (was Re: vfs cache timing changes)

2019-09-13 Thread Greg Troxel

Martin Husemann  writes:

> On Fri, Sep 13, 2019 at 06:59:42AM -0400, Greg Troxel wrote:
>> I'd like us to keep somewhat separate the notions of:
>> 
>>   someone is doing build.sh release
>> 
>>   someone wants min-size sets at the expense of a lot of cpu time
>> 
>> 
>> I regularly do build.sh release, and rsync the releasedir bits to other
>> machines, and use them to install.  Now perhaps I should be doing
>> "distribution", but sometimes I want the ISOs.
>
> The default is MKDEBUG=no so you probably will not notice the compression
> difference that much.

I don't follow what DEBUG has to do with this, but that's not important.

> If you set MKDEBUG=yes you can just as easily set USE_XZ_SETS=no
> (or USE_PIGZGZIP=yes if you have pigz installed).

Sure, I realize I could do this.  The question is about defaults.

> The other side of the coin is that we have reproducable builds, and we
> should not make it harder than needed to reproduce our official builds.

It should not difficult or hard to understand, which is perhaps
different than defaults.

> But ... it already needs some settings (which we still need to document
> on a wiki page properly), so we could also default to something else
> and force maximal compressions via the build.sh command line on the
> build cluster.

I could see

MKREPRODUCILE=yes

causing defaults of various things to be a particular way, and perhaps
letting XZ default to no otherwise.  I would hope that what
MKREPRODUCILE=yes has to set is not very many things, but I haven't kept
up.

Re: build.sh sets with xz (was Re: vfs cache timing changes)

2019-09-13 Thread Greg Troxel

"Tom Spindler (moof)"  writes:

>> PS: The xz compression for the debug set takes 36 minutes on my machine.
>> We shoudl do something about it. Matt to use -T for more parallelism?
>
> On older machines, xz's default settings are pretty much unusable,
> and USE_XZ_SETS=no (or USE_PIGZGZIP=yes) is almost a requirement.
> On my not-exactly-slow i7 6700K, build.sh -j4 parallel is just fine
> until it hits the xz stage; gzip is many orders of magnitude faster.
> Maybe if xz were cranked down to -2 or -3 it'd be better at not
> that much of a compression loss, or it defaulted to the higher
> compression level only when doing a `build.sh release`.

(I have not really been building current so am unclear on the xz
details.)

I'd like us to keep somewhat separate the notions of:

  someone is doing build.sh release

  someone wants min-size sets at the expense of a lot of cpu time


I regularly do build.sh release, and rsync the releasedir bits to other
machines, and use them to install.  Now perhaps I should be doing
"distribution", but sometimes I want the ISOs.

Sometimes I do builds just to see if they work, e.g. if being diligent
about testing changes.

(Overall the notion of staying with gzip in most cases, with a tunable
for extreme savins sounds sensible but I am too unclear to really weigh
in on it.)

Re: NFS lockup after UDP fragments getting lost

2019-07-31 Thread Greg Troxel

Edgar Fuß  writes:

> Thanks to riastradh@, this tuned out to be caused by an (UDP, hard)
> HFS mount combined with a mis-configured IPFilter that blocked all but
> the first fragment of a fragmented NFS reply (e.g., readdir) combined
> with a NetBSD design error (or so Taylor says) that a vnode lock may
> be held accross I/O, in this case, network I/O.

Holding a vnode lock across IO seems like a bug to me too.  Marking the
vnode as having an in-process operation so others can
lock/read/report-that-status/unlock seems ok.  But I'm sure you already
know that vnode locking is hard.

> It looks like the operation to which the reply was lost sometimes
> doesn't get retried. Do we have some weird bug where the first
> fragment arriving stops the timeout but the blocking of the remaining
> fragments cause it to wedge?

Probably not.  fragments sit until there's a packet and then the packet
is sent to the stack.  So the NFS code is almost certainly totally
unaware of the arrival of the first fragment.

Re: /dev/random is hot garbage

2019-07-22 Thread Greg Troxel

Taylor R Campbell  writes:

>> It would also be reasonable to have a sysctl to allow /dev/random to
>> return bytes anyway, like urandom would, and to turn this on for our xen
>> builders, as a different workaround.  That's easy, and it doesn't break
>> the way things are supposed to be for people that don't ask for it.
>
> What's the advantage of this over using replacing /dev/random by a
> symlink to /dev/urandom in the build system?
>
> A symlink can be restricted to a chroot, while a sysctl knob would
> affect the host outside the chroot.  The two would presumably require
> essentially the same privileges to enact.

None, now that I think of it.

So let's change that on the xen build host.

And, the other issue is that systems need randomness, and we need a way
to inject some into xen guests.  Enabling some with rndctl works, or at
least used to, even if it is theoretically dangerous.  But we aren't
trying to defend against the dom0.

Re: /dev/random is hot garbage

2019-07-21 Thread Greg Troxel

I don't think we should change /dev/random.   For a very long time, the
notion is that the bits from /dev/random really are ok for keys, and
there has been a notion that such bits are precious and you should be
prepared to wait.  If you aren't generating a key, you shouldn't read
from /dev/random.

So I think rust is wrong and should be fixed.

I can see the reason for frustration, but I believe that we should not
break things that are sensible because they are abused and cause
problems in some environments.

It would also be reasonable to have a sysctl to allow /dev/random to
return bytes anyway, like urandom would, and to turn this on for our xen
builders, as a different workaround.  That's easy, and it doesn't break
the way things are supposed to be for people that don't ask for it.

Also, on the xen build hosts, it would perhaps be good to turn on
entropy collection from network and disk.

Another approach, harder, is to create a xenrnd(4) pseudodevice and
hypervisor call that gets bits from the host's /dev/random and injects
them as if from a hardware rng.

Re: mknod(2) and POSIX

2019-06-18 Thread Greg Troxel

David Holland  writes:

> However, I notice that mknod(2) does not describe how to set the
> object type with the type bits of the mode argument, or document which
> object types are allowed, and mkfifo(2) also does not say whether
> S_IFIFO should be set in the mode argument or not.

This is documented quite well in the opengroup.org standards pages (or
in S_IFIFO, and just don't set any special bits, respectively).  Agreed
that fixing the man pages would be good.

> (Though mkfifo(2) hints not by not documenting EINVAL for "The
> supplied mode is invalid", this sort of inference is annoying even in
> standards and really not ok for docs...)

https://pubs.opengroup.org/onlinepubs/9699919799/functions/mknod.html
https://pubs.opengroup.org/onlinepubs/9699919799/functions/mkfifo.html#tag_16_327

Those seem clear to me.

Re: mknod(2) and POSIX

2019-06-18 Thread Greg Troxel

Agreed with uwe@ about not mixing unrelated changes. Pretend we are
using git :-)


The patch looks fine.

Agreed that making fifos with mknod is an odd thing to do, but if it's
in posix, then we should do it unless there's something really bad about
supporting the posix usage.  In this case, it just seems silly to have a
second way to make fifos, not harmful.

mknod(2) and POSIX

2019-06-18 Thread Greg Troxel

I recently noticed that pkgsrc/sysutils/bup failed when restoring a fifo
under NetBSD because it calls mknod (in python) which calls mknod(3) and
hence mknod(2).

Our mknod(2) man page does not mention creating FIFOS, and claims

 The mknod() function conforms to IEEE Std 1003.1-1990 (“POSIX.1”).
 mknodat() conforms to IEEE Std 1003.1-2008 (“POSIX.1”).

I can't find 1990 online, but 2004 and 2008 require fifo support in
mknod:

  https://pubs.opengroup.org/onlinepubs/009695399/functions/mknod.html
  
https://pubs.opengroup.org/onlinepubs/9699919799.2008edition/functions/mknod.html

However, at least in netbsd-8, our kernel (sys/vfs_syscalls.c:do_mknod_at):

  requires KAUTH_SYSTEM_MKNOD for all callers, and hence EPERM for non-root

  has a switch on allowable types, and S_IFIFO is not one of them, and
  hence EINVAL

I realize mkfifo is preferred in our world, and POSIX says it is
preferred.  But I believe we have a failure to follow POSIX.

Other opinions?

Re: pool: removing ioff?

2019-03-16 Thread Greg Troxel

>> But, I wonder if this comes from the Solaris allocation design, and that
>> the ioff notion is not about alignment for 4/8 objects to fit the way
>> the CPU wants, but for say 128 byte objects to be lined up on various
>> different offsets in different pages to make caching work better.  But
>> perhaps that doesn't exist in NetBSD, or is done differently, or my
>> memory of the paper is off.
>
> Indeed, Sun's SLAB had cache coloring.
>
> We do too in our pool subsystem, that's not related to ioff, we don't
> lose it as a result of removing ioff.

Great to hear on both counts, thanks.

Re: pool: removing ioff?

2019-03-16 Thread Greg Troxel

Maxime Villard  writes:

> I would like to remove the 'ioff' argument from pool_init() and friends,
> documented as 'align_offset' in the man page. This parameter allows the
> caller to specify that the alignment given in 'align' is to be applied at
> the offset 'ioff' within the buffer.
>
> I think we're better-off with hard-aligned structures, ie with __aligned(32)
> in the case of XSCALE. Then we just pass align=32 in the pool, and that's it.
>
> I would prefer to avoid any confusion in the pool initializers and drop ioff,
> rather than having this kind of marginal and not-well-defined features that
> add complexity with no real good reason.
>
> Note also that, as far as I can tell, our policy in the kernel has always
> been to hard-align the structures, and then pass the same alignment in the
> allocators.

I am not objecting as I can't make a performance/complexity argument.

But, I wonder if this comes from the Solaris allocation design, and that
the ioff notion is not about alignment for 4/8 objects to fit the way
the CPU wants, but for say 128 byte objects to be lined up on various
different offsets in different pages to make caching work better.  But
perhaps that doesn't exist in NetBSD, or is done differently, or my
memory of the paper is off.

Re: patch: debug instrumentation for dev/raidframe/rf_netbsdkintf.c

2019-01-21 Thread Greg Troxel

Christoph Badura  writes:

> On Mon, Jan 21, 2019 at 04:24:49PM -0500, Greg Troxel wrote:
>> Separetaly from debug code being careful, if it's a rule that bdv can't
>> be NULL, it's just as well to put in a KASSERT.  Then we'll find out
>> where that isn't true and can fix it.
>
> I must not be getting something.  If rf_containsboot() is passed a NULL
> pointer, it will trap with a page fault and we can get a stacktrace from
> ddb.  If we add a KASSERT it will panic and we can get a stacktrace from
> ddb.  I don't see where the benefit in that is.

The benefit is that the panic from the KASSERT is cleaner, and it
documents for readers of the function that the author believes it is a
rule.   And it will definitely fault even on machines will can
dereference NULL - that is technically if not practically architecture 
dependent.

> Do you think we should add a KASSERT to document that rf_containsboot()
> does expect a valid pointer? I'd see value in that and would go ahead with
> it.

Yes.  Basically, in any kernel function, if there is a requirement that
a pointer be non-NULL, then there should be a KASSERT and the code
should then feel free to assume it is valid.

When a KASSERT is hit, the user gets a message with the KASSERT
expression and the source file/line, instead of a page fault traceback.
It's very easy and quick to go from that printout to the KASSERT that
failed.

Plus, adding the KASSERT, or talking about adding it, is a good way to
check if there is consensus among the other developers that this really
is a rule.  In NetBSD, people are really good at telling you you're
wrong!

Re: patch: debug instrumentation for dev/raidframe/rf_netbsdkintf.c

2019-01-21 Thread Greg Troxel

Christoph Badura  writes:

>> > +  if (bdv == NULL)
>> > +  return 0;
>> > +
>> 
>> This looked suspicious, even before I read the code.
>> The question is if it is ever legitimate for bdv to be NULL.
>
> That is an excellent point.  The short answer is, no it isn't.  And it
> never was NULL in the code that used it.  I got a trap into ddb because of
> a null pointer deref in the DPRINTF that I changed (in the 4th hunk of my
> patch).
>
>> I am a fan of having comments before every function declaring their
>> preconditions and what they guarantee on exit.  Then all uses can be
>> audited if they guarantee the the preconditions are true.  This approach
>> is really hard-core in eiffel, known as design by contract.
>
> Yes, I totally agree.  Also to the rest of your message that I didn't quote.
>
> When I prepared the patch yesterday I was about to delete the above change
> because at first I couldn't remember why I added it ~3 weeks ago.  That
> should have raised a big fat warning sign.
>
> I thought about adding a comment after I read your private mail
> earlier today.  In the end I decided it is better to not change
> rf_containsboot() and instead introduce a wrapper for the benefit of the
> DPRINTF.

Separetaly from debug code being careful, if it's a rule that bdv can't
be NULL, it's just as well to put in a KASSERT.  Then we'll find out
where that isn't true and can fix it.

Re: patch: debug instrumentation for dev/raidframe/rf_netbsdkintf.c

2019-01-21 Thread Greg Troxel

Christoph Badura  writes:

> Here is some instrumentation I found useful during my recent debugging.
> If there are no objections, I'd like to commit soon.
>
> The change to rf_containsroot() simplifies the second DPRINTF that I added.
>
> Index: rf_netbsdkintf.c
> ===
> RCS file: /cvsroot/src/sys/dev/raidframe/rf_netbsdkintf.c,v
> retrieving revision 1.356
> diff -u -r1.356 rf_netbsdkintf.c
> --- rf_netbsdkintf.c  23 Jan 2018 22:42:29 -  1.356
> +++ rf_netbsdkintf.c  20 Jan 2019 22:32:14 -
> @@ -472,6 +472,9 @@
>   const char *bootname = device_xname(bdv);
>   size_t len = strlen(bootname);
>  
> + if (bdv == NULL)
> + return 0;
> +
>   for (int col = 0; col < r->numCol; col++) {
>   const char *devname = r->Disks[col].devname;
>   devname += sizeof("/dev/") - 1;

This looked suspicious, even before I read the code.

The question is if it is ever legitimate for bdv to be NULL.

I am a fan of having comments before every function declaring their
preconditions and what they guarantee on exit.  Then all uses can be
audited if they guarantee the the preconditions are true.  This approach
is really hard-core in eiffel, known as design by contract.

In NetBSD, many functions have KASSERT at the beginning.  This checks them
(under DIAGNOSTIC) but it also is a way of documenting the rules.

>From a quick glance at the code it seems obvious that it's not ok to
call these functions with a NULL bdv.

So if bdv is an argument and not allowed to be NULL, then early on in
that function, where you check/return, there should be

  KASSERT(bdv != NULL)

Not really on point, but as a caution there should be no behavior change
in any function under DIAGNOSTIC, if the code is bug free and
preconditions are met.  So "if something we can rely on isn't true,
panic" is fine, but many other things rae not.

Re: Importing libraries for the kernel

2018-12-12 Thread Greg Troxel

m...@netbsd.org writes:

> I don't expect there to be any problems with the ISC license. It's the
> preferred license for OpenBSD and we use a lot of their code (it's
> everywhere in sys/dev/)

Agreed that the license is ok.

> external, as I understand it, means "please upstream your changes, and
> avoid unnecessary local changes".

Agreed.

And also that we have a plan/expectation of tracking changes and
improvements that upstream makes.  Code that is not in external more or
less implies that we are the maintainer.  For these libraries, my
expectation is that they are being actively maintained and that we will
want to update to newer upstream versions from time to time.

Re: noatime mounts inhibiting atime updates via utime()

2018-12-05 Thread Greg Troxel

Edgar Fuß  writes:

>> Honestly, I think atime is one of the dumbest thing ever.
> We occasionally use them to find out (or have a first guess at):
> -- has anyone used libfoobar last year?
> -- who uses kbaz, i.e. has /home/xyz/.config/kbaz.conf been accessed?
>
> We use snapshots to run backups, so atimes are not touched by them.

I fairly often look at atimes to find out if old libraries have been
used, and various other things.

I have also had a test that tried to use utime fail on a machine that
was noatime.

So the notion that noatime should mean what it does now, but allow
explicit writes sounds good.

I don't see any value in changing the naming of the flags.  Having a fs
write atime updates unless mounted noatime seems fine, and if people
want noatime that's easy.  I would be opposed to e.g. dropping the
noatime option, making noatime default, and adding an atime option.
That's just churn violating historical norms for no good reason.

There's a question of what the default for installs should be, and I
don't have a real opinion about that.

It would be good to have stats about writes, separately including atime
updates.  Right now we know it causes writes but I haven't seen data.

Re: fixing coda in -current

2018-11-26 Thread Greg Troxel

m...@netbsd.org writes:

> On Sun, Nov 25, 2018 at 08:05:21PM -0500, Greg Troxel wrote:
>> However, I am pleased to report that the coda people have said that they
>> are working on a fuse interface, although it's expected to be slower.
>> We'll see, both if it materializes and how fast it is.
>
> That'd be neat.
> ... can we get general consensus about removing kernel coda if that
> happens, and the FUSE implementation works for netbsd too?
> dholland speaks poorly of it, we don't have a volunteer to write out
> tests, and it has a history of brea

Getting consensus is hard enough that I would prefer to defer that
until we see where we are.

The breakage history from NetBSD VFS changes isn't really that bad -- a
few times in 20 years, and it has caused very little trouble for others.

Re: fixing coda in -current

2018-11-26 Thread Greg Troxel

m...@netbsd.org (Emmanuel Dreyfus) writes:

> Greg Troxel  wrote:
>
>> However, I am pleased to report that the coda people have said that they
>> are working on a fuse interface, although it's expected to be slower.
>
> FUSE vs kernel does not really matter when we deal with network
> filesystem performance. The latency of requesting a network operation is
> orders of magnitude higher than issuing a few system calls.

That's true when the file has to be fetched.

Coda, like AFS, caches files in normal operation, and there are read
lock callbacks.  So the first fetch is over the network and slow, and
subsequent reads are at nearly the speed of the underlying filesystem.
It is this speed that people are talking about.

Re: fixing coda in -current

2018-11-25 Thread Greg Troxel

David Holland  writes:

> So I have no immediate comment on the patch but I'd like to understand
> better what it's doing -- the last time I crawled around in it
> (probably 7-8 years ago) it appeared to among other things have an
> incestuous relationship with ufs_readdir such that if you tried to use
> anything under than ffs as the local container it would detonate
> violently. But I never did figure out exactly what the deal was other
> than it was confusing and seemed to violate a lot of abstractions.
>
> Can you clarify? It would be nice to have it working properly and
> stuff like the above is only going to continue to fail in the future...

I didn't read this patch carefully, and I'm not Brett.  But the basic
scheme is that a container file representing a directory is in a
particular format.  This has been a source of issues when there was an
alignment change in directory reading.  My impression is that the way it
should be is that a container file that's a directory should be read in
ufs format, regardless of the container filesystem type.  I am not sure
that's the way the code is.

However, I am pleased to report that the coda people have said that they
are working on a fuse interface, although it's expected to be slower.
We'll see, both if it materializes and how fast it is.

Re: fixing coda in -current

2018-11-21 Thread Greg Troxel

bch  writes:

> On Tue, Nov 20, 2018 at 2:38 PM Greg Troxel  wrote:
>
>> I volunteer to bug Satya about using FUSE instead of a homegrown
>> (pre-FUSE) kernel interface.
>
>
> Which Satya is this?

The Coda one :-)   Here's a proper citation.

http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.448

Re: fixing coda in -current

2018-11-20 Thread Greg Troxel

I volunteer to bug Satya about using FUSE instead of a homegrown
(pre-FUSE) kernel interface.

I am unaware of anytning else that allows writes while disconnected and
reintegrates them.  I have actually done that, both on purpose and for
several days while my IPsec connection was messed up, and it really
worked.

Re: fixing coda in -current

2018-11-20 Thread Greg Troxel

I used to use it, and may again.  So I'd like to see it stay, partly
because I think it's good to keep NetBS relevant in the fileystem
research world.  I am expecting to see new upstream activity.

But, I think it makes sense to remove it from GENERIC, and perhaps have
whatever don't-autoload treatment, so that people only get it if they
explicitly ask for it.  That way it should not bother others.

Re: Things not referenced in kernel configs, but mentioned in files.*

2018-11-12 Thread Greg Troxel

co...@sdf.org writes:

> So, I am excluding things that appear in ALL, and I am not checking if

But ALL is an x86 thing, currently.

> they appear as modules.

Interesting, but I suppose they then belong in ALL also.

> So far I had complaints about the appearance of 'lm' which cannot be
> safely included in a default kernel, for example.

Sure, lots of things are not ok in GENERIC, but do those concerns apply
to it being in ALL?

Re: Things not referenced in kernel configs, but mentioned in files.*

2018-11-12 Thread Greg Troxel

co...@sdf.org writes:

> This is an automatically generated list with some hand touchups, feel
> free to do whatever with it. I only generated the output.
>
> ac100ic
> acemidi
> acpipmtr
> [snip]

I wonder if these are candidates to add to an ALL kernel, and if it will
turn out that they are mostly not x86 things.

I see we only have ALL for i386/amd64.  I wonder if it makes sense to
have one in evbarm.

Re: panic: ffs_snapshot_mount: already on list

2018-09-11 Thread Greg Troxel

m...@netbsd.org (Emmanuel Dreyfus) writes:

> Beside the problem that  FFS_NO_SNAPSHOT does not really disable
> snapshot code, I think we should have a nosnapshot or nofss mount option
> to handle such a scenario. Anyone has opinion on this?

That seems sensible; filesystems are complicated enough that being able
to really ignore complicated features seems good.  But we are trying to
maintaint invariants (which weren't) and it would be good if mounting
without a feature doesn't make things work.   So maybe this is a RO
mount option?

But also, it seems that there was something wrong with this filesystem,
but fsck didn't fix it.  That seems like the most important thing to
fix.

signature.asc
Description: PGP signature

Re: Time to merge the pgoyette-compat branch (take two)

2018-09-07 Thread Greg Troxel

I am just barely paying attention, but I think modules working well is
important, and also having minimal code for what's needed.  So if mrg's
main concerns have been addressed (aliases), I'm in favor (in a somewhat
weak, not really clued in sort of way) of this.

Re: NetBSD-8 kernel too big?

2018-08-30 Thread Greg Troxel


Two thoughts:

  When trimming, ls -lSr in the kernel build directory will identify
  large objects.

  We have had kernel modules for a while, but I'm not entirely clear on
  where we are.  I would think that moving to a mode of aggressively not
  including things that can be modules and loading them from the fs as
  needed would help, particularly if the issue is the bootloader, vs
  memory used up when running.   This is not build as part of an -8
  release build, but there is MODULAR in the conf directory.
  



signature.asc
Description: PGP signature

1 2 3 >

1 - 100 of 245 matches

Mail list logo