zfs not freeing znodes
I have been having problems with my netbsd-10 systems locking up. This has happened on at last 3 physical computers and 3 physical disks so I'm as sure as I can be that it's not hardware. Systems are all - netbsd-10 amd64 - / and /usr on ffs, bulk data on zfs and have had 4, 8, 24, or 32 G of RAM, which from the zfs viewpoint ranges from marginal to considered enough. The symptoms are that everything is fine for a time, and then it ends up in lockup, no keyboard effective to switch out of X, ddb, anything. Sometimes I catch it as it is deteriorating and have been able to get into ddb and there are a bunch of processes in tstile, with underlying locks including flt_noram5 (from fallible memory). My guess is that I run low on memory and there is a locking bug (failure to release) in a rarely-taken path, perhaps trying to delete files in zfs when the system is out of RAM, or that sort of thing. Things that tend to lead to higher odds of lockup are: - daily cronjob running (8 GB machine w/o X) - leaving firefox open, especially with piggy js tabs - running pkgsrc builds - anything that deals with very large numbres of files. I have adjusted zfs's target allocations to use less RAM, basing it on total. In theory these would be sysctls anyway. One thing I figured out before is that zfs's approach to respecting RAM limits is to go ahead and allocate when requested and to have bg thread free things. This can result in going way over, and I think it makes something like "untar this huge bunch of files into zfs" put memory pressure on the rest of the system. On a 32 GB machine, the lockups got more frequent (I can't rule out a graphics card failure unrelated to the memory problem), so I started looking harder. I ran vmstat -m before and after doing cvs updates in NetBSD checkouts (I have them for 9, 10 and current), and pkgsrc. I noticed that "dnode_t" showed a large number of requests and pages and *no releases*. An example is - 1847937 requests - 307990 pages That's 1203 MB in dnodes (which are 632 bytes each). But the concerning thing is that everytime I did an update of a tree (different ones) the dnode allocation rose and I saw no frees. I then remembered that I had bumped up kern.maxvnodes long ago, before I was even using zfs, because a netbsd or pkgsrc tree was not fitting in the cache. maxvnode was at about 1.6M. This seemed big, and I set it to 500K. Then additional "call stat on this huge bunch of files" did not result in new allocations, but I didn't see any frees. This was all yesterday or Tuesday. This morning there are 1592742 releases, but no pages have been released. I went to see what I was setting maxvnodes to, and it seems I removed that setting long ago, probably when I upgraded my main machine from an 8G box to 24G box, or earlier. This all leaves a lot of questions: Obviously I need to read the zfs code to see what dnode is being used for (am guessing it's "disk node", the on-disk info backing a vnode) and how the number of allocations is controlled and they are freed. 1.6M vnodes on a system with 32G of RAM seems like it should be ok. I plan to set it to 500K on boot and see if that avoids lockups. zfs's background free, don't sleep processes that are over strategy seems kind of risky. I can see a "if mildy over, let bg free deal", but if a process is just allocating aas fast as it can, it seems to run the system into no ram. There remains the question of what happens when there isn't RAM left to allocate from pools. I think it's highly likely there is a bug. Things on my todo list to debug: Set up a VM that has ZFS and see if I can make a repro recipe. In the VM, also try DIAGNOSTIC/DEBUG/LOCKDEBUG. Write code to capture vmstat output periodically and save it. I expect graphing this to be useful in understanding. What I don't understand is why others aren't seeing this. I do have settings to avoid having the file cache page out all my processes: # \todo Reconsider and document vm.filemin=5 vm.filemax=10 vm.anonmin=5 vm.anonmax=80 vm.execmin=5 vm.execmax=50 vm.bufcache=5 But this is I believe pretty normal among netbsd users. However, on the other logical machine, which is a Xen dom0, I have a a stock sysctl.conf, and it would reliably crash on the daily cron with 4 GB of RAM, but stay up when running GENERIC with the full 8GB.
Re: config: conditional at clause
Valery Ushakov writes: > I'm still not entirely sure, why XEN kernels include GENERIC.local in > the first place though. If one needs a fragile maze of includes just > to avoid a few lines being copy-pasted, that doesn't feel like a win. Having FOO include FOO.local is fine, but we still need a local include file that is included by substantially every kernel. The point is that if I decide that on my systems, I want some device that isn't on by default, or to tweak something, and I want it on everything, it should be easy. In particular, I think it's normal to flip back and forth between GENERIC and XEN3_DOM0 on the same machine, and those feel like they are doing the same thing, modulo expecting a hypervisor. Arguably, XEN3_DOM0 should just include GENERIC and then have any dom0 stuff added and some no stuff. This feels like "we shouldn't have XEN kernels include what they have included for years, because we found a latent config bug".
Re: vio9p vs. GENERIC.local vs. XEN3_DOM[0U]
Mouse writes: > My answer is, error-checking. If I, say, typo "pci" as "cpi" in > > mydev* at cpi? > > I'd want an error rather than having the line silently ignored. (That > particular typo is not all that plausible. It's just an example.) > > Now, if virtio were specifically declared as "this name is valid but > may or may not be present"? I'm on the fence. > > If virtio were declared normally in the kernels that provide it and > declared as valid but specifically absent in XEN3_DOM* kernels? Then I > think that's what I'd want (to my limited understanding, this is close > to what "no virtio" does at present). A fair point, but are you suggesting that every bus that could ever exist be declared and all other kernels have "no", as a general approach?
Re: vio9p vs. GENERIC.local vs. XEN3_DOM[0U]
Christoph Badura writes: > Currently the vio9p driver is commented out in {i386,amd64}/conf/GENERIC: > #vio9p* at virtio? # Virtio 9P device > > The obvious way to enable that is by adding a line to GENERIC.local: > vio9p*at virtio? > > > But doing so breaks the builds of the XEN3_DOM? kernels like so > sys/arch/amd64/conf/GENERIC.local:1: `vio9p* at virtio?' is orphaned > (nothing matching `virtio?' found) > because > $ grep cinclude {i386,amd64}/conf/*XEN3* > i386/conf/XEN3PAE_DOM0:cinclude "arch/i386/conf/GENERIC.local" > i386/conf/XEN3PAE_DOM0:cinclude "arch/i386/conf/XEN3_DOM0.local" > i386/conf/XEN3PAE_DOMU:cinclude "arch/i386/conf/GENERIC.local" > i386/conf/XEN3PAE_DOMU:cinclude "arch/i386/conf/XEN3_DOMU.local" > amd64/conf/XEN3_DOM0:cinclude "arch/amd64/conf/GENERIC.local" > amd64/conf/XEN3_DOMU:cinclude "arch/amd64/conf/GENERIC.local" > amd64/conf/XEN3_DOMU:cinclude "arch/amd64/conf/XEN3_DOMU.local" > > This is extremely annoying, as that breaks "build.sh release" because that > builds the XEN3 kernels. And it prevents us from enabling vio9p on x86 > kernels by default. > > The obvious and simplest fix is to make the XEN3 kernels stop including > GENERIC.local. (And make amd64 XEN3_DOM0 cinclude XEN3_DOM0.local as on > i386.) I don't think that's reasonable at all. GENERIC.local is for things you want in your kernels, and if it's something else, then it certainly belongs in XEN kernels. You have merely found something is problematic because it assumes there is virtio. > The less trivial fix is to conditionally attach vio9p in GENERIC.local. > config(8) has "ifdef"/"ifndef" directives for that. But they key on > config attributes and I couldn't find an attribute that is only present in > XEN kernels. > > Now, Someone(TM) could probably go into config(8) and add a way to > conditionalize on flags. But that is way more work and, IMHO, tackling > the problem at the wrong level of abstraction. The right level of abstraction is to do something that says if there is a virtio bus, add viop* at virtio* and this is true of pretty much anything that attaches to a bus that may or may not present. I wonder if there are good reasons to avoid "just skip lines that refer to a bus that don't exist". > It seems to me that the best way to remedy the situation is to make the > XEN3 kernels not include GENERIC.local. That will break a lot of other things, and many will find it very surprising. > If people really want to include GENERIC.local they can do so in their > XEN3_DOM?.local files or create a XEN3.local (or XEN3.common.local or > whatever) that is included from them. If you really want this, you can just add it to GENERIC. That seems better than asking the rest of the world to change. But seriously, I don't see why making XEN not include GENERIC.local is better than people that want bus-specific things putting it someplace else. You could also add GENERIC.local.virtio and include it from all kernels that have virtio.
Re: sysmon(4) messages
Valery Ushakov writes: I agree this is confusing. > Would it make more sense to change this to either > > $dev: $sensor changed to $state > > and hope for the best (at least now there are no extra nouns that come > from the message template), or make things verbose and very explicit > with something like > > $dev: sensor '$sensor' changed state to '$state' That seems like a big improvement. I think I prefer $dev: sensor '$sensor' changed to '$state' as a sensor's value is implicitly a state and fewer words are better. But I do like the leading "sensor " as likely to reduce confusion.
Re: strange zfs size allocation data
Martin Husemann writes: > On Sun, Jul 07, 2024 at 11:32:54PM -0400, Mouse wrote: >> Is bup zfs-specific? Because, if you're not doing something >> filesystem-specific, I actually think you will have trouble even >> _defining_ what "100% right" is for this test, since everything about >> sparseness, right down to whether it's even a thing, is >> filesystem-dependent. > > Indeed. Try running the test on a tmpfs or msdofs for example and you > should see the test reliably fail. It does fail on tmpfs. I see this as a bug; it means that files written sparsely might not fit, when I'd expect them to. I can certainly understand that the person who wrote tmpfs didn't get to this, but given the long history of sparse support in the standard filesystem, I sort it weakly into bug vs feature-I'd-expect-but-is-missing. As for msdosfs, I am not surprised; that's a foreign fs with its own format and semantics -- and one that is viewed as old and primitive.
Re: strange zfs size allocation data
Mouse writes: >> This is a test case, to see if backing up and restoring a sparse file >> results in a sparse file. I realize that this probably requires a >> logging fuse driver and a lot of complexity to do 100% right. > > Is bup zfs-specific? Because, if you're not doing something No, it is a general backup program. I just happen to have sources for it on zfs. Which people tell me is a great filesystem and is now not odd on NetBSD > filesystem-specific, I actually think you will have trouble even > _defining_ what "100% right" is for this test, since everything about > sparseness, right down to whether it's even a thing, is > filesystem-dependent. True. the point is to try to verify that the backup program, when restoring a sparse file, writes it in such a way that the normal implementation of sparse files works, meaning results in a file without blocks storing all the zeros. What you are missing, and everybody else too, is that the fact that this is theoretically impossible is irrelevant to it being useful in the real world, to detect regressions, even if it also occasionally detects bizarre behavior. A better test would be 'fuse-sparsetest' that makes metadata available for inspection later about the writes it sees. But that's hard to write.
Re: strange zfs size allocation data
Taylor R Campbell writes: >> Date: Sun, 07 Jul 2024 14:07:40 -0400 >> From: Greg Troxel >> >> I ran into a test failure with bup, where it was restoring a sparse file >> and trying to validate the resulting disk usage. It turns out that on >> zfs (NetBSD 10), when you write a file, it shows as using 1 block and >> then some seconds later shows as using the right amount. > > zfs's struct stat::st_blocks (i.e., struct vattr::va_bytes/va_blksize, > roughly) gives the number of blocks actually allocated on disk for the > file, plus 1 for some metadata. Ah, this is what I sort of suspected. > Before a newly written file is synced to disk, when it still exists > only in memory, it doesn't have any space allocated on disk for it > (though I expect if you hit the logical reservation limit, you'll see > a write error earlier). This feels like a zfs bug to me. Yes, the blocks are not allocated on disk, but they are essentially reserved and its logically like they are used up. It's a mere artifact of cache coherency that they are not actually allocated on disk. > Every 5sec, the system syncs the file system (I forget where this > comes from, whether it's a zfs thing or a NetBSD syncer thing), which > explains why within a couple of seconds you see du(1) output change. that's kind of fast vs the old 30s but it explains it. > I bet if you fsync just the file you created, or use dd oflag=sync or > oflag=dsync, you will stop seeing the delay. I see. I wonder if that is wise, or if the test should just wait 10s. The question is making the test fast vs keeping impact low. Given your answer, I'd expect this on any zfs filesystem; this seems not to be a NetBSD thing. I just found this, but it doesn't get into "blocks not allocated yet". https://github.com/openzfs/zfs/discussions/11533
Re: strange zfs size allocation data
Thor Lancelot Simon writes: > On Sun, Jul 07, 2024 at 02:07:40PM -0400, Greg Troxel wrote: >> I ran into a test failure with bup, where it was restoring a sparse file >> and trying to validate the resulting disk usage. It turns out that on >> zfs (NetBSD 10), when you write a file, it shows as using 1 block and >> then some seconds later shows as using the right amount. > > When you say "validate the resulting disk usage" and "using 1 block" what > do you mean, exactly? If the file is sparse, I can't see how there's any > bug unless the wrong st_size is returned by stat() or the wrong length > returned by lseek(). First, in actual operation, bup just does fs ops and there is no issue. This is a test case, to see if backing up and restoring a sparse file results in a sparse file. I realize that this probably requires a logging fuse driver and a lot of complexity to do 100% right. What I have been doing as a proxy is the script below, which skips 10 MB and writes 1 MB. Since the file is sparse, one would expect about 1 MB of usage, not 11MB, and not 1 block. Yes, this is not 100% reliable, but the point is to catch regressions where it results in 11 MB, or the test is broken. So there is a is the amount of space at least the data we actually wrote? is it well less than the sparse file's nominal length and this does not seem unreasonable, even if it is not 100% sound. > du counts allocated blocks as reported by stat(). A sparse file might > legitimately report 0, 1, or any other value, even values that exceed > (st_size / st_blksize). And the number of allocated blocks can absolutely Yes, but a sparse file with 10 MB of seeked-over and 1 MB of legit urandom more or less has to take up more than 1 block. > change even while st_size stays the same - consider a filesystem with > background deduplication or compression, both of which some variants of > ZFS have, but ZFS is not the only filesystem with these features. Sure, I get it that zfs that makes this hard. > If bup is relying on some particular block allocation behavior, that seems > like a bug. It is only tests, trying to catch problems. The actual operation tries to rely only on POSIX.
strange zfs size allocation data
I ran into a test failure with bup, where it was restoring a sparse file and trying to validate the resulting disk usage. It turns out that on zfs (NetBSD 10), when you write a file, it shows as using 1 block and then some seconds later shows as using the right amount. So: - why is it happening? - is this a bug? - if we think it's a bug, is it feasible to fix? A simple program to create files, n empty megabytes followed by 1 real megabyte, for n in 0..9. And then to 'du' the file, every 1s for 30s, not worrying about precise timing. I have a big ssd which is mostly a zfs type partition, and a pool with just that. Nothing fancy. #!/bin/sh for i in $(seq 0 9); do OUT=seek$i rm -rf ${OUT} ${OUT}.size dd if=/dev/urandom seek=$i bs=1m count=1 of=${OUT} 2> /dev/null for s in $(seq 0 30); do (echo -n "$s: "; du ${OUT}) >> $OUT.size sleep 1 done done leads to ('head -6' shown, since that's sufficient to understand): ==> seek0.size <== 0: 1 seek0 1: 1 seek0 2: 1 seek0 3: 1027seek0 4: 1027seek0 5: 1027seek0 ==> seek1.size <== 0: 1 seek1 1: 1 seek1 2: 1027seek1 3: 1027seek1 4: 1027seek1 5: 1027seek1 ==> seek2.size <== 0: 1 seek2 1: 1027seek2 2: 1027seek2 3: 1027seek2 4: 1027seek2 5: 1027seek2 ==> seek3.size <== 0: 1 seek3 1: 1 seek3 2: 1 seek3 3: 1 seek3 4: 1027seek3 5: 1027seek3 ==> seek4.size <== 0: 1 seek4 1: 1 seek4 2: 1 seek4 3: 1027seek4 4: 1027seek4 5: 1027seek4 ==> seek5.size <== 0: 1 seek5 1: 1 seek5 2: 1027seek5 3: 1027seek5 4: 1027seek5 5: 1027seek5 ==> seek6.size <== 0: 1 seek6 1: 1 seek6 2: 1 seek6 3: 1 seek6 4: 1 seek6 5: 1027seek6 ==> seek7.size <== 0: 1 seek7 1: 1 seek7 2: 1 seek7 3: 1 seek7 4: 1027seek7 5: 1027seek7 ==> seek8.size <== 0: 1 seek8 1: 1 seek8 2: 1 seek8 3: 1027seek8 4: 1027seek8 5: 1027seek8 ==> seek9.size <== 0: 1 seek9 1: 1027seek9 2: 1027seek9 3: 1027seek9 4: 1027seek9 5: 1027seek9
Re: hang in vcache_vget()
Emmanuel Dreyfus writes: > Hello > > I experienced a system freeze on NetBSD-10.0/i386. Many processes > waiting on tstile, and one waiting on vnode, with this backtrace: > sleepq_block > cv_wait > vcache_vget > vcache_get > ufs_lookup > VOP_LOOKUP > lookup_once > namei_tryemulroot.constprop.0 > namei > vn_open > do_open > do_sys_openat > > I regret I did not took the time to show vnode. > > Is it worth a PR? I have no clue if it can be reproduced. I would say yes it's worth it. I have had hangs on 10/amd64, on a system with 32G of ram. I have been blaming zfs, but my "never hangs" experience has been on 9/ufs. But, others say zfs is fine. I just came across the "threads leak memory" problem pointed out by Brian Marcotte, and found a 17G gpg-agent. I now wonder if whatever is hanging is being provoked by running out of memory. Still a bug, but I no longer feel my "new problem" can be pointed at zfs. Do you think your system had high memory pressure at the time of your crash? (Sort of off topic, you should know that because 32-bit computers are no longer manufactured, the rust project thinks you shouldn't be using them. Take it to the ewaste center right away!)
Re: poll(): IN/OUT vs {RD,WR}NORM
Johnny Billquist writes: > POLLPRIHigh priority data may be read without blocking. > > POLLRDBAND Priority data may be read without blocking. > > POLLRDNORM Normal data may be read without blocking. Is this related to the "oob data" scheme in TCP (which is a hack that doesn't work)? Where do we attach 3 priority levels to data?
Re: Forcing a USB device to "ugen"
Jason Thorpe writes: > I should be able to do this with OpenOCD (pkgsrc/devel/openocd), but > libfdti1 fails to find the device because libusb1 only deals in > "ugen". Is that fundamental, in that ugen has ioctls that are ugen-ish that uftdi does not? I am guessing you thought about fixing libusb1. > The desire to use "ugen" on "interface 1" is not a property of > 0x0403,0x6010, it's really a property of > "SecuringHardware.com","Tigard V1.1". Unfortunately, there's isn't a > way to express that in the kernel config syntax. > > I think my only short-term option here is to, in uftdi_match(), specifically > reject based on this criteria: > > - VID == 0x0403 > - PID == 0x6010 > - interface number == 1 > - vendor string == "SecuringHardware.com" > - product string == "Tigard V1.1" > > (It's never useful, on this particular board, to use the second port as a > UART.) That seems reasonable to me. It seems pretty unlikely to break other things.
Re: Polymorphic devices
Brad Spencer writes: > I don't know just yet, but there might be unwanted device reset the "use > the one you open" technique. That is, you might have to reset the chip > to change mode and if you support say, I2C and GPIO at the same time > (which is possible), but then change to just GPIO the chip has to be > reset and that will disrupt any setting you might have set (I think, I > am am still working out what needs to happen with the mode switches). > This may not matter in the bigger picture and it wouldn't matter as much > if the mode switch was a sysctl, which one can say will reset the chip > anyway. Interesting complexity, but I'd say state the user has asked for should live in the driver and if it has to write that again on mode switch so be it. Generally if you open a device and close it you don't have much grounds to expect things you did to persist to the next session, but devices have device-specific semantics anyway.
Re: Polymorphic devices
Brad Spencer writes: > The first is enhancements to ufdti to support the MPSSE engine that some > of the FTDI chip variants have. This engine allows the chip to be a I2C > bus, SPI bus and provides some simple GPIO, and a bunch of other stuff, > as well as the normal USB UART. It is not possible to use all of the > modes at the same time. That is, these are not separate devices, but > modes within one device. Or another way, depending on the mode of the > chip you get different child devices attached to it. I am curious on > what the thoughts are on how this might be modeled. My reaction without much thought is to attach them all and to have the non-selected one return ENXIO or similar. And to have another device on which you call the ioctl to choose which device to enable. Or perhaps, to let you open any of them, flipping the mode, and to fail the 2nd simultaenous open.
Re: Maxphys on -current?
Brian Buhrow writes: > hello. I know that this has ben a very long term project, but I'm > wondering about the > status of this effort? I note that FreeBSD-13 has a Maxphys value of 1048576 > bytes. > Have we found other ways to get more throughput from ATA disks that obviate > the need for this > setting which I'm not aware of? > If not, is anyone working on this project? The wiki page says the project is > stalled. I haven't heard that anyone is. When you run dd with bs=64k and then bs=1m, how different are the results? (I believe raw requests happen accordingly, vs MAXPHYS for fs etc. access.)
Re: RFC: Native epoll syscalls
Mouse writes: >> It is definitely a real problem that people write linuxy code that >> seems unaware of POSIX and portability. > > While I feel a bit uncomfortable appearing to defend the practice (and, > to be sure, it definitely can be a problem) - but, it's also one of the > ways advancements happen: add an extension, use it, it turns out to be > useful, it gets popular > > I've done it myself (well, except for the "gets popular" part, which no > one person can do alone): labeled control structure, AF_TIMER sockets, > pidconn, validusershell, the list goes on. Sure, but this is "there are several extensions, and write code that only uses the local one, even though it could have been written to use any". And perhaps "there are mechanisms which could have been adopted, but instead make up a third". And I really meant "seems unaware", not "made a deliberate decision, evidenced by written design" :-)
Re: RFC: Native epoll syscalls
Martin Husemann writes: > On Wed, Jun 21, 2023 at 01:50:47PM -0400, Theodore Preduta wrote: >> There are two main benefits to adding native epoll syscalls: >> >> 1. They can be used to help port Linux software to NetBSD. > > Well, syscall numbers are cheap and plenty... > > The real question is: is it a usefull and consistent API to have? > At first sight it looks like a mix of kqueue and poll, and seems to be > quite complex. It is definitely a real problem that people write linuxy code that seems unaware of POSIX and portability. If we had native epoll, then that code could be built and used. That of course doesn't fix the portability issues, but it avoids them. It seems to me that if we have epoll emulation, it should not be that hard to also have it native, and I think the benefit in being able to run (natively) programs written unportably is significant.
Re: malloc(9) vs kmem(9) interfaces
Taylor R Campbell writes: > Right, so the question is -- can we get the attribution _without_ > that? Surely attribution itself is just a matter of some per-CPU > counters. Reading along, it strikes me there is a huge point implicit in your last sentence. I first thought of attribution as being able to tell what a particular allocated object is being used for. That requires state per object. However, you are talking about maintaining a count of objects by user. That is vastly cheaper, and likely 90%+ as useful. SO there is "object attribution" and "total usage attribution".
Re: LINEAR24 userland format in audio(4) - do we really want it?
nia writes: > Unfortunately file formats are standardized but the > the way the audio APIs are implemented varies. :/ > >> It's now no longer broken to handle 24bit WAV files. > > This is true, but audioplay is hardly the only > consumer of the API and could easily be made to communicate > with the kernel using 32-bit samples. > > What is the behaviour of everything in pkgsrc when thrown > 24bit WAV files? I'm not following. Are you saying we should remove suppport from the kernel API for 24-bit linear? lots of stuff in pkgsrc should be fixed so it works better?
Re: USB-related panic in 8.2_STABLE
Timo Buhrmester writes: > Apparently out of nothing, one of our servers paniced. > > > uname -a gives: > > | NetBSD trave.math.uni-bonn.de 8.2_STABLE NetBSD 8.2_STABLE > | (MI-Server) #17: Fri Jul 16 14:01:03 CEST 2021 > | supp...@trave.math.uni-bonn.de:/var/work/obj-8/sys/arch/amd64/compile/miserv > | amd64 My impression is that there have been a lot of USB fixes since 8. > I've transcribed the panic message and backtrace: > > | ohci0: 1 scheduling overruns > | ugen0: detached > | ugen0: at uhub4 port 2 (addr 2) disconnected > | ugen0 at uhub4 port 2 > | ugen0: Phoenixtec Power (0x6da) USB Cable (V2.00) (0x02), rev 1.00/0.06, > addr 2 > | uvm_fault(0xfe82574c2458, 0x0, 1) -> e > | fatal page fault in supervisor mode > | trap type 6 code 0 rip 0x802f627e cs 0x8 rflags 0x10246 cr2 0x2 > ilevel 6 (NB: could be ilevel 0 as well) rsp 0x80013f482c10 > | curlwp 0xfe83002b2000 pid 8393.1 lowest kstack 0x80013f4802c0 > | kernel: page fault trap, code=0 > | Stopped in pid 8393.1 (nutdrv_qx_usb) at netbsd:ugen_get_cdesc+0xb1: > | movzwl 2(%rax),%edx > | db{2}> bt > | ugen_get_cdesc() at netbsd:ugen_get_cdesc+0xb1 > | ugenioctl() at netbsd:ugenioctl+0x9a4 > | cdev_ioctl() at netbsd:cdev_ioctl+0xb4 > | VOP_IOCTL() at netbsd:VOP_IOCTL+0x54 > | vn_ioctl() at netbsd:vn_ioctl+0xa6 > | sys_ioctl() at netbsd:sys_ioctl+0x11a > | syscall() at netbsd:syscall+0x1ec > | --- syscall (number 54) --- > | 7a73c9eff13a: > | db{2}> > > Any idea what's going on? It can always be hardware. (Even if one can argue bad hardware should never lead to panic.) I'm not saying it is, or is likely, but keep that in mind. You didn't give timing. If this immediately followed the disconnnect, it's perhaps a bug in ugen to do something after the device is gone. It may be that this bug has always been there and that normally the UPS doesn't disconnect, or you hit a bad race. Try updating to 9 or 10 :-)
Re: crash in timerfd building pandoc / ghc94 related
PHO writes: > On 2/6/23 5:27 PM, Nikita Ronja Gillmann wrote: >> I encountered this on some version of 10.99.2 and last night again on >> 10.99.2 from Friday morning. >> This is an obvious blocker for me for making 9.4.4 the default. >> I propose to either revert to the last version or make the default GHC >> version setable. > > I wish I could do the latter, but unfortunately not all Haskell > packages are buildable with 2 major versions of GHC at the same time > (most are, but there are a few exceptions). > > Alternatively, I think I can patch GHC 9.4 so that it won't use > timerfd. It appears to be an optional feature after all; if its > ./configure doesn't find timerfd it won't use it. Let me try that. If it's possible to only do this on NetBSD 10.99, that would be good. It seems so far, from not really paying attention, that there is nothing wrong with ghc but that there is a bug in the kernel. It would also be good to get a reproduction recipe without haskell.
Re: Enable to send packets on if_loop via bpf
Ryota Ozaki writes: > In the specification DLT_NULL assumes a protocol family in the host > byte order followed by a payload. Interfaces of DLT_NULL uses > bpf_mtap_af to pass a mbuf prepending a protocol family. All interfaces > follow the spec and work well. > > OTOH, bpf_write to interfaces of DLT_NULL is a bit of a sad situation. > A writing data to an interface of DLT_NULL is treated as a raw data > (I don't know why); the data is passed to the interface's output routine > as is with dst (sa_family=AF_UNSPEC). tun seems to be able > to handle such raw data but the others can't handle the data (probably > the data will be dropped like if_loop). Summarizing and commenting to make sure I'm not confused on receive/read, DLT_NULL prepends AF in host byte order on transmit/write, it just sends with AF_UNSPCE This seems broken as it is asymmetric, and is bad because it throws away information that is hard to reliably recreate. On the other hand this is for link-layer formats, and it seems that some interfaces have an AF that is not really part of what is transmitted, even though really it is. For example tun is using an IP proto byte to specify AF and really this is part of the link protocol. Except we pretend it isn't. > Correcting bpf_write to assume a prepending protocol family will > save some interfaces like gif and gre but won't save others like stf > and wg. Even worse, the change may break existing users of tun > that want to treat data as is (though I don't know if users exist). > > BTW, prepending a protocol family on tun is a different protocol from > DLT_NULL of bpf. tun has three protocol modes and doesn't always prepend > a protocol family. (And also the network byte order is used on tun > as gert says while DLT_NULL assumes the host byte order.) wow. > So my fix will: > - keep DLT_NULL of if_loop to not break bpf_mtap_af, and > - unchange DLT_NULL handling in bpf_write except for if_loop to bother > existing users. > The patch looks like this: > > @@ -447,6 +448,14 @@ bpf_movein(struct uio *uio, int linktype, > uint64_t mtu, struct mbuf **mp, > m0->m_len -= hlen; > } > > + if (linktype == DLT_NULL && ifp->if_type == IFT_LOOP) { > + uint32_t af; > + memcpy(&af, mtod(m0, void *), sizeof(af)); > + sockp->sa_family = af; > + m0->m_data += sizeof(af); > + m0->m_len -= sizeof(af); > + } > + > *mp = m0; > return (0); That seems ok to me. I think the long-term right fix is to define DLT_AF which has an AF word in host order on receive and transmit always, and to modify interfaces to use it whenever they are AF aware at all. In this case tun would fill in the AF word from the IP proto field, and you'd get a transformed/regularized AF word when really the "link layer packet" had the IP proto field. But that's ok as it's just cleanup and reversible. signature.asc Description: PGP signature
Re: Enable to send packets on if_loop via bpf
Ryota Ozaki writes: > NetBSD can't do this because a loopback interface > registers itself to bpf as DLT_NULL and bpf treats > packets being sent over the interface as AF_UNSPEC. > Packets of AF_UNSPEC are just dropped by loopback > interfaces. > > FreeBSD and OpenBSD enable to do that by letting users > prepend a protocol family to a sending data. bpf (or if_loop) > extracts it and handles the packet as an extracted protocol > family. The following patch follows them (the implementation > is inspired by OpenBSD). > > http://www.netbsd.org/~ozaki-r/loop-bpf.patch > > The patch changes if_loop to register itself to bpf > as DLT_LOOP and bpf to handle a prepending protocol > family on bpf_write if a sender interface is DLT_LOOP. I am surprised that there is not already a DLT_foo that already has this concept, an AF word followed by data. But I guess every interface already has a more-specific format. Looking at if_tun.c, I see DLT_NULL. This should have the same ability to write. I have forgotten the details of how tun encodes AF when transmitting, but I know you can have v4 or v6 inside, and tcpdump works now. so obviously I must be missing something. My suggestion is to look at the rest of the drivers that register DLT_NULL and see if they are amenable to the same fix, and choose a new DLT_SOMETHING that accomodates the broader situation. I am not demanding that you add features to the rest of the drivers. I am only asking that you think about the architectural issue of how the rest of them would be updated, so we don't end up with DLT_LOOP, DLT_TUN, and so on, where they all do almost the same thing, when they could be the same. I don't really have an opinion on host vs network for AF, but I think your choice of aligning with FreeBSD is reasonable. signature.asc Description: PGP signature
Re: #pragma once
My quick reaction is that we should stick to actual standards, absent a really compelling case. This isn't compelling to me, and the point that linting for wrong usage isn't hard is a good one. I happen to be in the middle of a paper (from the guix crowd) about de-boostrapping ocaml. It's about getting rid of binary bootstraps. That's a problem we also have in pkgsrc, but we haven't issued a manifesto. While it might seem tangential, the de-bootstrapping world often wants to compile older code with older tools to construct a build graph that starts from as little binary as possible. Thus, "newer compilers all do this" is a bit scary, as while that's what people usually use, it's more comfortable to say "we need C99 plus X" for as little X as possible. signature.asc Description: PGP signature
Re: Can version bump up to 9.99.100?
David Holland writes: > On Fri, Sep 16, 2022 at 07:00:23PM +0700, Robert Elz wrote: > > That is, except for in pkgsrc, which is why I still > > have a (very mild) concern about that one - it actually compares the > > version numbers using its (until it gets changed) "Dewey" comparison > > routines, and for those, 9.99.100 is uncharted territory. > > No, it's not, pkgsrc-Dewey is well defined on arbitrarily large > numbers. In fact, that's in some sense the whole point of it relative > to using fixed-width fields. And, surely we had 9.99.9 and 9.99.10. The third digit is no more special than the second. It's just that it happens less often so the problem of arguably incorrectly written two-digit patterns is more likely than for that to happen with one. It's not reasonable to constrain a normal process because other bugs might exist. signature.asc Description: PGP signature
Re: Module autounload proposal: opt-in, not opt-out
Paul Goyette writes: > (personal note) > It really seems to me that the current module sub-systems is at > best a second-class capability. I often get the feeling that > others don't really care about modules, until it's the only way > to provide something else (dtrace). This proposal feels like > another nail in the modular coffin. Rather than disabling (part > of) the module feature, we should find ways to improve testing > the feature. I'd just like to say that while I haven't gone down the "modules first" path, I have been watching your commits and cheering you on. I do use a few modules, and this is making me think I should try to run MODULAR, especially on machines with less memory. I'm a little scared of not even having UFS, but I can try it as the low-memory machine is not important. signature.asc Description: PGP signature
Re: Module autounload proposal: opt-in, not opt-out
Martin Husemann writes: > I think that all modules that we deliver and declare safe for *autoload* > should require to be actually tested, and a basic test (for me) includes > testing auto-unload. That does not cover races that slip through "casual" > testing, but should have caught the worst bugs. That's a reasonable position for adding modules, but > So the error in the cases you stumbled in is the autoload and keeping the > badly tested module autoloadable but forbid its unloading sounds a bit > strange to me. Given where we are, do you really mean we should withdraw every module from autoload that does not have a documented test result, right now? It seems far better to have them stay loaded than be unavailable. signature.asc Description: PGP signature
Re: New iwn firmware & upgrade procedure
Havard Eidnes writes: >> A quick skim of /libdata/firmware makes me think it is mostly not >> versioned. > > Really? I suspect all the if_iwn files are versioned; if it > follows the pattern for iwlwifi-6000g2a-5, the number behind the > last hyphen is the version number. Look at all the devices, not just if_iwn. signature.asc Description: PGP signature
Re: New iwn firmware & upgrade procedure
Havard Eidnes writes: > 1) Could the if_iwn driver fall back to using the 6000g2a-5 microcode >without any code changes? (My gut feeling says "yes", but I have >no existence proof of that.) Unless it's really necessary (ABI change in accessing device with new firmware), it seems that the firmware should just be named for the device and not have the firmware version. Thus you'd get the version you have in tree, and that might be a little old. Alternatively there could be a symlink. But I don't understand why it is versioned. > Should the wireless firmware go into a different set which we also > learn the habit of extracting before reboot of the kernel? If the versioning is really intractable and frequent, perhaps, but I think this can be 99% solved by not putting firmware versions in filenames. A quick skim of /libdata/firmware makes me think it is mostly not versioned. signature.asc Description: PGP signature
Re: Slightly off topic, question about git
David Brownlee writes: > I suspect most of this also works with s/git/hg/ assuming NetBSD > switches to a mercurial repo Indeed, all of this is not really about git. Systems in the class of "distributed VCS" have two important properties: commits are atomic across the repo, not per file anyone can prepare commits, whether or not they are authorized to apply them to the repo. an authorized person can apply someone else's commit. These more or less lead to "local copy of the repo". And there are web tools for people who just want to look at something occasionally.But I find that it's not that big, that right now I have 3 copies (8, 9, current), and that it's nice to be able to do things offline (browse, diff, commit). CVS is really just RCS with organization into groups of files ability to operate over ssh (rsh originally :-) That was really great in 1994; I remember what a big advance it was (seriously). signature.asc Description: PGP signature
Re: Memory corruption after fork, only on AMD CPUs
co...@sdf.org writes: > There appears to be a memory corruption bug that only happens on AMD > CPUs running NetBSD (or OpenBSD). The same code doesn't fail on Intel. > This affects Go and they've made some bug reports investigating it[1][2]. > > People have narrowed it down to this simple Go reproducer > (install lang/go117 to run it). This is probably unrelated, but I've been running sncthingy for a long time with no issues. syncthing built with go117 crashes at startup on an early 2011 Macbook Pro, and the same syncthing built with go116 works fine. However, this system has an Intel CPU: Processor Name: Intel Core i7 Processor Speed: 2.2 GHz Number of Processors: 1 Total Number of Cores:4 signature.asc Description: PGP signature
Re: wsvt25 backspace key should match terminfo definition
RVP writes: > On Tue, 23 Nov 2021, Michael van Elst wrote: > >> If you restrict yourself to PC hardware (i386/amd64 arch) then >> you probably have either >> >> a PS/2 keyboard -> the backspace key generates a DEL. >> a USB keyboard -> the backspace key generates a BS. >> >> That's something you cannot shoehorn into a single terminfo >> attribute and that's why many programs simply ignore terminfo >> here, in particular when you think about remote access. > > So, if I had a USB keyboard (don't have one to check right now), the > terminfo entry would be correct? How do we make this consistent then? > Have 2 terminfo entries: wsvt25-ps2 and wsvt25-usb (and fix-up getty > to set the correct one)? wscons is supposed to abstract all this, so making wsvt25-foo for different keyboard classes seems like the wrong approach. wskbd(4) says: • Mapping from keycodes (defined by the specific keyboard driver) to keysyms (hardware independent, defined in /usr/include/dev/wscons/wsksymdef.h). As uwe@ points out, the terms we use and the actual key labels are confusing. When I've talked about the DEL key, I've meant the key that the user types to delete backwards, almost always upper right and easily reachable when touch typing, and that in DEC tradition sent the DEL 0x1f character. It was pointed out that newer terminals have a backarrow logo, and I see that an IBM USB keyboard has that too. Then there's the BS key, which older (almost all actual?) terminals had, but my IBM USB keyboard doesn't have one, and my mac doesn't either. Looking in wsksymdef.h (netbsd-9, which is handy), we see "keysyms" which is what keycodes are supposed to map into, and it talks about them being aligned with ASCII. Relevant to this discussion there is #define KS_BackSpace0x08 #define KS_Delete 0x7f #define KS_KP_Delete0xf29f So that's for BS, DEL (to use ASCII) and the extended keypad "delete right" introduced with I think the VT220. On my USB keyboard, in NetBSD 9 wscons without trying to mess with mappings, I get backarrow (key where DEL should) ==> BS (^H) keypad Delete key (next to insert/home/end/pageup/pagedown) ==> DEL (^?) and I see that stty has erase to to ^H. The underlying issue is that the norms of some systems are to map that "user wants to delete left easily reachable key" to BS and some want to map it to DEL. I see these as the PC tradition and the UNIX tradition. So I think NetBSD should decide that we're following the UNIX tradition that this key is DEL, have wskbd map it that way for all keyboard types, and have stty erase start out DEL. (Plus of course carrying this across ssh so cross-deletionism works, which I think is already the case.) A quick glance at wskbd and ukbd did not enlighten me. xev shows similar wrong x keysyms, BS and DEL for "backarrow" and "keypad delete". signature.asc Description: PGP signature
Re: wsvt25 backspace key should match terminfo definition
Valery Ushakov writes: > vt52 is different. I never used a real vt52 or a clone, but the > manual at vt100.net gives the following picture: > > https://vt100.net/docs/vt52-mm/figure3-1.html > > and the description > > https://vt100.net/docs/vt52-mm/chapter3.html#S3.1.2.3 > > Key CodeAction Taken if Codes Are Echoed > BACK SPACE 010 Backspace (Cursor Left) function > DELETE 177 Nothing That is explaining what the terminal does when those codes are sent by the computer. That is a different thing from how the computer interprets user input. When using a VT52 on Seventh Edition, for example one pushed DEL to remove the previous character, and the computer woudl send "" to make it disappear and leave the cursor left. One basically never pushed BS. > vt100 had similar keyboard (again, never used a real one personally) > > https://vt100.net/docs/vt100-ug/chapter3.html#F3-2 > > BACKSPACE 010 Backspace function > DELETE 177 Ignored by the VT100 same as vt52, I think. > But vt200 and later use a different keyboard, lk201 (and i did use a > real vt220 a lot) > > https://vt100.net/docs/vt220-rm/figure3-1.html > > that picture is not very good, the one from the vt320 manual is better > > https://vt100.net/docs/vt320-uu/chapter3.html > > vt220 does NOT have a configuration option that selects the code that > the But somehow the official terminfo database has kbs=^H for vt220! > > Later it became configurable: > > https://vt100.net/docs/vt320-uu/chapter4.html#S4.13 > > For vt320 (where it *is* configurable) terminfo has > > $ infocmp -1 vt320 | grep kbs > kbs=^?, Very interesting! > >> I think the first thing to answer is "what is kbs in terminfo supposed >> to mean". > > X/Open Curses, Issue 7 doesn't explain, other than saying "backspace" > key, which is an unfortunate name, as it's loaded. But it's > sufficiently clear from the context that it's the key that deletes > backwards, i.e. deletes under. So it's the codes generated by the DEL key (as opposed to the Delete key). >> My other question is how kbs is used from terminfo. Is it about >> generating output sequences to move the active cursor one left? If so, >> it's right. Is it about "what should the user type to delete left", >> then for a vt52/vt220, that's wrong. If it is supposed to be both, >> that's an architectural bug as those aren't the same thing. > > No, k* capabilities are sequences generated by the terminal when some > key is pressed. The capability for the sequence sent to the the > terminal to move the cursor left one position is cub1 > > $ infocmp -1 vt220 | grep cub1 > cub1=^H, > kcub1=\E[D, > > (kcub1 is the sequence generated by the left arrow _k_ey). Then I'm convinced that kbs should be \? for these terminals. signature.asc Description: PGP signature
Re: wsvt25 backspace key should match terminfo definition
Johnny Billquist writes: >> For vt320 (where it *is* configurable) terminfo has >> >>$ infocmp -1 vt320 | grep kbs >>kbs=^?, > > Which I think it should be. But what does kbs mean? - the ASCII character sent by the computer to move the cursor left? - the ASCII character sent by the BS key? - the ASCII character sent by the DEL key that the uses uss to delete left? signature.asc Description: PGP signature
Re: wsvt25 backspace key should match terminfo definition
Valery Ushakov writes: > On Tue, Nov 23, 2021 at 00:01:40 +, RVP wrote: > >> On Tue, 23 Nov 2021, Johnny Billquist wrote: >> >> > If something pretends to be a VT220, then the key that deletes >> > characters to the left should send DEL, not BS... >> > Just saying... >> >> That's fine with me too. As long as things are consistent. I suggested the >> kernel change because both terminfo definitions (and the FreeBSD console) >> go for ^H. > > Note that the pckbd_keydesc_us keymap maps the scancode of the <- key to > > KC(14), KS_Cmd_ResetEmul, KS_Delete, > > i.e. 0x7f (^?). > > terminfo is obviously incorrect here. Amazingly, the bug is actually > in vt220 description! wsvt25 just inherits from it: > > $ infocmp -1 vt220 | grep kbs > kbs=^H, > > I checkeed termcap.src from netbsd-4 and it's wrong there too. I have > no idea htf that could have happened. I think (memory is getting fuzzy) the problem is that the old terminals had a delete key, in the upper right, that users use to remove the previous character, and a BS key, upper left, that was actually a carriage control character. The basic problem is that in the PC world, the idea is that key where DEL should be has a backarrow the the PC world thinks it is backspace. That's the DEC-centric viewpoint of course :-) I think any change needs a careful proposal and review, becuase there are lots of opinions here and a change is likely to mess up a bunch of people's configs, even if they have worked around something broken. I don't mean "no changes", just that if you don't think this is a really hard problem you probably shouldn't change it (globally). Also /usr/include/sys/ttydefaults.h is about all of NetBSD on all sorts of hardware, not just PCs and there are lots of keyboards as well as actual terminals. Ever since we moved beyond ASR33, CERASE has been 0177 (my Unix use more or less began with a VT52 and a Beehive CRT). xterm has a config to say "make the key where DEL ought to be generate the key that the tty has configured as ERASE". I suspect that the right approach is 1) choose what wscons generates for the "key where DEL belongs" 2) have the tty set so that the choice in (1) is 'stty erase'. I see the same kbs=^H on vt52. I think the first thing to answer is "what is kbs in terminfo supposed to mean". My other question is how kbs is used from terminfo. Is it about generating output sequences to move the active cursor one left? If so, it's right. Is it about "what should the user type to delete left", then for a vt52/vt220, that's wrong. If it is supposed to be both, that's an architectural bug as those aren't the same thing. signature.asc Description: PGP signature
Re: timecounters
I think it makes sense to document them, and arguably each counter should have a man page, except for things that are somehow in timecounter(9) instead (if they don't have a device name?). signature.asc Description: PGP signature
Re: Representing a rotary encoder input device
What do other systems do? It strikes me what wsmouse feels like it is for things connected with the kbd/mouse/display world. To be cantankerous, using it seems a little bit like representing a GPIO input as a 1-button mouse that doesn't move. I would imagine that a rotary encoder is more likely to be a volume or level control, but perhaps not for the machine, perhaps just reported over MQTT so Home Assistant on some other machine can deal. If you are really talking about encoders hooked to gpio, then perhaps gpio should grow a facility to take N pins and say they are some kind of encoder and then have a gpio encoder abstraction. But maybe you are trying to use an encoder to add scroll to a 3-button mouse? signature.asc Description: PGP signature
Re: SCSI scanners
Julian Coleman writes: > Can we get rid of the SCSI scanner support as well? It only supports old > HP and Mustek scanners, and its functionality is superseded by SANE (which > sends the relevant SCSI commands from userland). If it's really the case that SANE works with these, then that seems ok. (I actually have a UMAX scsi scsanner but haven't powered it on in years.) I wonder though if this is causing the kind of trouble that uscanner caused. signature.asc Description: PGP signature
Re: protect pmf from network drivers that don't provide if_stop
Martin Husemann writes: > On Tue, Jun 29, 2021 at 03:46:20PM +0930, Brett Lymn wrote: >> I turned up a fix I had put into my source tree a while back, I think at >> the time the wireless driver (urtwn IIRC) did not set an entry for >> if_stop. > > This is a driver bug, we should not work around it but catch it early > and fix it. So maybe KASSERT that stop exists, and then call it if non-NULL, so regular users don't crash, and DIAGNOSTIC does what DIAGNOSTIC is supposed to do? signature.asc Description: PGP signature
Re: regarding the changes to kernel entropy gathering
Thanks - that is useful information. I think the big point is that the new seed file is generated from urandom, not from the internal state, so the new seed doesn't leaak internal state. The "save entropy" language didn't allow me to conclude that. Also, your explanation is about updating, but it doesn't address generation of a file for the first time. Presumably that just takes urandom without the old seed that isn't there and doesn't overwrite the old seed that isnt' there. Interestingly, I have a machine running current, running as a dom0 sometimes, and haven't had problems. I now realize that's only because the machine had a seed file created under either 7 or 9 (installed 7, updated to 9, updated to current). So it has trusted, untrustworthy entropy (even though surely after all this time some of it must have been unobserved). signature.asc Description: PGP signature
Re: regarding the changes to kernel entropy gathering
Thor Lancelot Simon writes: > shuts down, again all entropy samples that have been added (which, again, > are accumulating in the per-cpu pools) are propagated to the global pool; > all the stream RNGs rekey themselves again; then the seed is extracted. It seems obvious to me that "extracting" the seed should be done in such a way that the state of the internal rng is still unpredictable from the saved seed, even if the state of the newly-booted rng will be predictable. Perhaps by pulling 256 bytes from urandom, perhaps by something more direct and then some sort of hash/rekey to get back traffic protection. Probably this is already done in a way much better thought out than my 30s reaction, the man page doesn't really say this, at least that I could follow; rndctl -S says "save entropy pool". signature.asc Description: PGP signature
Re: ZFS: time to drop Big Scary Warning
chris...@astron.com (Christos Zoulas) writes: > That's a good test, but how does zfs compare in for the same test with lets > say ffs or ext2fs (filesystems that offer persistence)? With the same system, booted in the same way, but with 3 different filesystems mounted on /tmp, I get similar numbers of failures: tmpfs 12 ffs213 zfs 18 So tmpfs/ffs2 are ~equal and zfs has a few more failures (but it all looks a bit random and non-repeatable).So it's hard to sort out "zfs is buggy" vs "some tests fail in timing-related hard-to-understand ways and that seems provoked slightly more with /tmp on zfs". Did you mean something else? signature.asc Description: PGP signature
Re: ZFS: time to drop Big Scary Warning
I got a suggestion to run atf with a ZFS tmp. This is all with current from around March 1, and is straight current, no Xen. Creating tank0/tmp and having it be mounted on /tmp failed the mount (but created the volume) with some sort of "busy" error. I already had a tmpfs mounted. Rebooting, zfs got mounted and then tmpfs and i unmounted tmpfs and then I have a zfs tmp. So not sure what's up but feels like a tmpfs issue more than a zfs issue, and not a big deal. Or maybe it's a feature that you can't mount over tmpfs. With /tmp being tmpfs, my results are similar to the releng runs. I've indented things that don't match two spaces. Failed test cases: lib/libc/sys/t_futex_ops:futex_wait_timeout_deadline lib/libc/sys/t_ptrace_waitid:syscall_signal_on_sce lib/libc/sys/t_truncate:truncate_err lib/librumpclient/t_exec:threxec net/if_wg/t_misc:wg_rekey usr.bin/cc/t_tsan_data_race:data_race usr.bin/make/t_make:archive usr.bin/c++/t_tsan_data_race:data_race usr.sbin/cpuctl/t_cpuctl:nointr usr.sbin/cpuctl/t_cpuctl:offline fs/ffs/t_quotalimit:slimit_le_1_user modules/t_x86_pte:rwx Summary for 903 test programs: 9570 passed test cases. 12 failed test cases. 73 expected failed test cases. 530 skipped test cases. With /tmp being zfs:tank0/tmp, I get Failed test cases: ./bin/cp/t_cp:file_to_file ./lib/libarchive/t_libarchive:libarchive ./lib/libc/stdlib/t_mktemp:mktemp_large_template ./lib/libc/sys/t_ptrace_waitid:syscall_signal_on_sce ./lib/libc/sys/t_stat:stat_chflags ./lib/libc/sys/t_truncate:truncate_err ./net/if_wg/t_misc:wg_rekey ./usr.bin/cc/t_tsan_data_race:data_race_pie ./usr.bin/make/t_make:archive ./usr.bin/ztest/t_ztest:assert ./usr.bin/c++/t_tsan_data_race:data_race ./usr.bin/c++/t_tsan_data_race:data_race_pie ./usr.sbin/cpuctl/t_cpuctl:nointr ./usr.sbin/cpuctl/t_cpuctl:offline ./fs/nfs/t_rquotad:get_nfs_be_1_group ./modules/t_x86_pte:rwx ./modules/t_x86_pte:svs_g_bit_set Summary for 903 test programs: 9567 passed test cases. 17 failed test cases. 72 expected failed test cases. 529 skipped test cases. which is also similar, but slightly different. So overal I conclude that there's nothing terrible going on, and that these results are in the same class of mostly passing but somewhat irregular as the base case. So work to do, but it doesn't support "ZFS is scary". (Of course, the system stayed up through the tests and has no apparent trouble, or I would have said.) As an aside, it would be nice if atf-test used TMPDIR or had an argument to say what place to do tests. signature.asc Description: PGP signature
Re: ZFS: time to drop Big Scary Warning
"J. Hannken-Illjes" writes: >> On 19. Mar 2021, at 21:18, Michael wrote: >> >> On Fri, 19 Mar 2021 15:57:18 -0400 >> Greg Troxel wrote: >> >>> Even in current, zfs has a Big Scary Warning. Lots of people are using >>> it and it seems quite solid, especially by -current standards. So it >>> feels times to drop the warning. >>> >>> I am not proposing dropping the warning in 9. >>> >>> Objections/comments? >> >> I've been using it on sparc64 without issues for a while now. >> Does nfs sharing work these days? I dimly remember problems there. > > If you mean misc/55042: Panic when creating a directory on a NFS served ZFS > it should be fixed in -current. I have a box running current/amd64 from about March 4, with a zpool on a disklabel partition, and a filesystem from that exported, mounted on a 9/amd64 box, and did the mkdir test and it was totally fine. I was able to have the maproot segfault happen, before the fix. So yes, this is fixed. So summarizing: nobody has said there is any remaining serious issue many remember issues about NFS (true) but they all seem ok now and I just looked over the open PRs and w.r.t. current don't see anything serious. signature.asc Description: PGP signature
ZFS: time to drop Big Scary Warning
Even in current, zfs has a Big Scary Warning. Lots of people are using it and it seems quite solid, especially by -current standards. So it feels times to drop the warning. I am not proposing dropping the warning in 9. Objections/comments? signature.asc Description: PGP signature
Re: kmem pool for half pagesize is very wasteful
Chuck Silvers writes: > in the longer term, I think it would be good to use even larger pool pages > for large pool objects on systems that have relatively large amount of memory. > even with your patch, a 1k pool object on a system with a 4k VM page size > still has 33% overhead for the redzone, which is a lot for something that > is enabled by DIAGNOSTIC and is thus supposed to be "inexpensive". So maybe the real bug is that this check should not be part of DIAGNOSTIC. I remember from 2.8BSD that DIAGNOSTIC was basically just supposed to add cheap asserts and panic earlier but not really be slower in any way anybody would care about. It seems easy enough to make this separate and not get turned on for DIAGNOSTIC, but some other define. It might even be that for current the checked-in GENERIC enables this. But someone turning on DIAGNOSTIC on 9 shouldn't get things that hurt memory usage really at all, or more than say a 2% degradation in speed. > there's a tradeoff here in that using a pool page size that matches the > VM page size allows us to use the direct map, whereas with a larger > pool page size we can't use the direct map (at least effectively can't today), > but for pools that already use a pool page size that is larger than > the VM page size (eg. buf16k using a 64k pool page size) we already > aren't using the direct map, so there's no real penalty for increasing > the pool page size even further, as long as the larger pool page size > is still a tiny percentage of RAM and KVA. we can choose the pool page size > such that the overhead of the redzone is bounded by whatever percentage > we would like. this way we can use a redzone for most pools while > still keeping the overhead down to a reasonable level. That sounds like great progress and I don't mean to say anything negative about that. signature.asc Description: PGP signature
Re: fsync error reporting
Greg Troxel writes: > 1) operating system has a succcessful return from a write transaction to > a disk controller (perhaps via a controller that has a write-back > cache) > > 2) operating system has been told by the controller that the write has > actually completed to stable storage (guaranteed even if OS crashes or > power fails, so actually written or perhaps in battery-backed cache) I see our man page addresses this with FDISKSYNC. It sounds like you aren't proposing to change this (makes sense), but there's the pesky issue of errors within the disk when writing from cache to media. Perhaps those are unreportable. signature.asc Description: PGP signature
Re: fsync error reporting
David Holland writes: > > > everything that process wrote is on disk, > > > > That is probably unattainable, since I've seen it plausibly asserted > > that some disks lie, reporting that writes are on the media when this > > is not actually true. > > Indeed. What I meant to say is that everything has been sent to disk, > as opposed to being accidentally skipped in the cache because the > buffer was busy, which will currently happen on some of the fsync > paths. > > That's why flushing the disk-level caches was a separate point. (ignoring errors as I have no objection to what you proposed and clarified with mouse@) Maybe I'm way off in space, but I'd like to see us be careful about 1) operating system has a succcessful return from a write transaction to a disk controller (perhaps via a controller that has a write-back cache) 2) operating system has been told by the controller that the write has actually completed to stable storage (guaranteed even if OS crashes or power fails, so actually written or perhaps in battery-backed cache) A) for stacked filesystems like raid, cgd, and for things like NFS, there's basically and e2e ack of the above condition. POSIX is of course weasely about this. But it seems obvious that if you call fsync, you want the property that if there is a crash or power failure (but not a disk media failure :-) that your bits are there, which is case 2. Case 1 is only useful in that files could remain in OS cache for a long time, and there is a pretty good but not guaranteed notion that once in device writeback cache they will get to the actual media in not that long. The old "sync;sync;sync;sleep 10" thing from before there was shutdown(8)... I thought NCQ was supposed to give acks for actual writing, but allow them to be perhaps ordered and multiple in flight, so that one could use that instead of the big-hammer inscrutable writeback cache. If the controller doesn't support NCQ, then it seems one has to issue a cache flush, which presmably is defined to get all data in cache as of the flush onto disk before reprorting that its done. Is that what you're thinking, or do you think this is all about case 1? signature.asc Description: PGP signature
Re: fsync_range and O_RDONLY
David Holland writes: > Well, if you have it open for write and I have it open for read, and I > fsync it, it'll sync your changes. I guess maybe POSIX is wrong then :-) But as a random user I can type sync to the shell. > And report any errors to me, so if you're a database and I'm feeling > nasty I can maybe mess with you that way. So I'm not sure it's a great > idea. > > Right now fsync error reporting is a trainwreck though. I think that's the real problem; if I open for write and fsync, then I should get status back that lets me know about my writes, regardless of who else asked for sync. Once that's fixed, then the 'others asking for sync' is much less of a big deal. I know, ENOPATCH. signature.asc Description: PGP signature
Re: fsync_range and O_RDONLY
David Holland writes: > Last year, fdatasync() was changed to allow syncing files opened > read-only, because that ceased to be prohibited by POSIX and something > apparently depended on it. I have a dim memory of this and mongodb. > However, fsync_range() was not also so changed. Should it have been? > It's now inconsistent with fsync and fdatasync and it seems like it's > meant to be a strict superset of them. It seems like it might as well be. I would expect this to only really sync the file's metadata, same as the others, but I do not feel like I really understand this. signature.asc Description: PGP signature
Re: partial failures in write(2) (and read(2))
David Holland writes: > Basically, it is not feasible to check for and report all possible > errors ahead of time, nor in general is it possible or even desirable > to unwind portions of a write that have already been completed, which > means that if a failure occurs partway through a write there are two > reasonable choices for proceeding: >(a) return success with a short count reporting how much data has >already been written; >(b) return failure. > > In case (a) the error gets lost unless additional steps are taken > (which as far as I know we currently have no support for); in case (b) > the fact that some data was written gets lost, potentially leading to > corrupted output. Neither of these outcomes is optimal, but optimal > (detecting all errors beforehand, or rolling back the data already > written) isn't on the table. > > It seems to me that for most errors (a) is preferable, since correctly > written user software will detect the short count, retry with the rest > of the data, and hit the error case directly, but it seems not > everyone agrees with me. It seems to me that (a) is obviously the correct approach. An obvious question is what POSIX requires, pause for `kill -HUP kred` :) I am only a junior POSIX lawyer, not a senior one, but as I read https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html#tag_16_685 I think your case (a) is the only conforming behavior and obviously what the spec says must happen. I do not even see a glimmer of support for (b). There is the issue of PIPE_BUF, and requests <= PIPE_BUF being atomic, but I don't think you are talking about that. Note that write is obligated to return partial completion if interrupted by a signal. I think your notion that it's ok to not return the reason the full amount wasn't written is enirely valid. I am surprised this is contentious (really; not trying to be difficult). signature.asc Description: PGP signature
Re: Temporary memory allocation from interrupt context
Martin Husemann writes: > On Wed, Nov 11, 2020 at 08:26:45AM -0500, Greg Troxel wrote: >> >LOCK(st); >> >size_t n, max_n = st->num_items; >> >some_state_item **tmp_list = >> >kmem_intr_alloc(max_n * sizeof(*tmp_list)); >> >> kmem_intr_alloc takes a flag, and it seems that you need to pass >> KM_NOSLEEP, as blocking for memory in softint context is highly unlikely >> to be the right thing. > > Yes, and of course the real code has that (and works). It's just that > - memoryallocators(9) does not cover this case > - kmem_intr_alloc(9) is kinda deprecated - quoting the man page: > > These routines are for the special cases. Normally, > pool_cache(9) should be used for memory allocation from interrupt > context. > >but how would I use pool_cache(9) here? Not deprecated, but for "special cases". I think needing a possibly-big variable-size chunk of memory at interrupt time is special. You would use pool_cache by being able to use a fixed-sized object. But it seems that's not how the situation is. I think memoryallocators(9) could use some spiffing up; it (on 9) says kmem(9) cannot be used from interrupt context. The central hard problem is orthogonal, though: if you don't pre-allocate, you have to choose between waiting and copying with failure. signature.asc Description: PGP signature
Re: Temporary memory allocation from interrupt context
Martin Husemann writes: > Consider the following pseudo-code running in softint context: > > void > softint_func(some_state *st, ) > { > LOCK(st); > size_t n, max_n = st->num_items; > some_state_item **tmp_list = > kmem_intr_alloc(max_n * sizeof(*tmp_list)); kmem_intr_alloc takes a flag, and it seems that you need to pass KM_NOSLEEP, as blocking for memory in softint context is highly unlikely to be the right thing. The an page is silent on whether lack of both flags is an error, and if not what the semantics are. (It seems to me it should be an error.) With KM_NOSLEEP, it is possible that the allocation will fail. Thus there needs to be a strategy to deal with that. > n = 0; > for (i : st->items) { > if (!(i matches some predicate)) > continue; > i->retain(); > tmp_list[n++] = i; > } > UNLOCK(st); > /* do something with all elements in tmp_list */ > kmem_intr_free(tmp_list, max_n * sizeof(*tmp_list)); > } > > I don't want to alloca here (the list could be quite huge) and max_n could > vary a lot, so having a "manual" pool of a few common (preallocated) > list sizes hanging off the state does not go well either. I think that you need to pick one of pre-allocate the largest size and use it temporarily be able to deal with not having memory. This leads to hard-to-debug situations if that code is wrong, becuase usually malloc will succeed. figure out that this softint can block indefinitely, only harming later calls of the same family, and not leading to kernel deadlock/etc. This leads to hard-to-debug situations if lack of memory does lead to hangs, because usually malloc will succeed. > In a perfect world we would avoid the interrupt allocation all together, but > I have not found a way to rearrange things here to make this feasible. > > Is kmem_intr_alloc(9) the best way forward? With all that said, note that I'm not the allocation export. signature.asc Description: PGP signature
Re: New header for GPIO-like definitions
Julian Coleman writes: > name="LED activity" > name="LED disk5_fault" > > name="INDICATOR key_diag" > name="INDICATOR disk5_present" > > and similar, then parse that in MI code. Another approach would be to extend the fdt schema in the way they would if solving this problem and use that. In other words: if you were in charge of fdt and were going to add this feature, what would you do? But, your name overloading proposal seems ok. signature.asc Description: PGP signature
Re: New header for GPIO-like definitions
Julian Coleman writes: >> > #define GPIO_PIN_LED0x01 >> > #define GPIO_PIN_SENSOR 0x02 >> > >> > Does this seem reasonable, or is there a better way to do this? > >> I don't really understand how this is different from in/out. >> Presumably this is coming from some request from userspace originally, >> where someone, perhaps in a config file, has told the system how a pin >> is hooked up. > > The definitions of pins are coming from hardware-specific properties. That's what I missed. On a device you are dealing with, pin N is *always* wired to an LED because that's how it comes from the factory. My head was in maker-land where there is an LED because someone wired one up. > In the driver, I'd like to be able to handle requests based on what is > connected to the pin. For example, for LED's, attach them to the LED > framework using led_attach() That makes sense, then. But how do you denote that logical high turns on he light, vs logical low? >> LED seems overly specific. Presumably you care that the output does >> something like "makes a light". But I don't understand why your API >> cares about light vs noise. And I don't see an active high/low in your >> proposal. So I don't understand how this is different from just >> "controllable binary output" > > As above, I want to be able to route the pin to the correct internal > subsystem in the GPIO driver. I just remember lights before LED, and the fact that they are LED vs incandescent is not important to how they are used. I don't know what's next. But given there is an led system, there is no incremental harm and it seems ok. >> I am also not following SENSOR.DO you just mean "reads if the logic >> level at the pin is high or low". >> >> I don't think you mean using i2c bitbang for a temp sensor. > > Yes, just reading the logic level to display whether the "thing" connected > is on or off. A better name would be appreciated. Maybe "INDICATOR", which > would match the envsys name "ENVSYS_INDICATOR"? Or even "GPIO_ENVSYS_INDICATOR" because there might be some binary inputs later that get hooked up to some other kind of framework. > Hopefully, the above is enough, but maybe a code snippet would help (this > snippet is only for LED's, but similar applies for other types). In the > hardware-specific driver, I add the pins to proplib: > > add_gpio_pin(pins, "disk_fault", GPIO_PIN_LED, > 0, GPIO_PIN_ACT_LOW, -1); > ... So I see the ACT_LOW. GPIO_PIN_LED is an output, but presumably this means that one can no longer use it with GPIO and only via led_. Which seems fine. Is that what you mean? > Then, in the MD driver I have: > > pin = prop_array_get(pins, i); > prop_dictionary_get_uint32(pin, "type", &type); > switch (type) { > case GPIO_PIN_LED: > ... > led_attach(n, l, pcf8574_get, pcf8574_set); Do you mean MD, or MI? > and because of the way that this chip works, I also need to know in advance > which pins are input and which are output, to avoid inadvertently changing > the input pins to output when writing to the chip. For that, generic > GPIO_PIN_IS_INPUT and GPIO_PIN_IS_OUTPUT definitions might be useful too. I 95% follow, but I am convinced that what you are doing is ok, so to be clear I have no objections. signature.asc Description: PGP signature
Re: New header for GPIO-like definitions
Julian Coleman writes: > I'm adding a driver and hardware-specific properties for GPIO's (which pins > control LED's, which control sensors, etc). I need to be able to pass the > pin information from the arch-specific configuration to the MI driver. I'd > like to add a new dev/gpio/gpiotypes.h, so that I can share the defintions > between the MI and MD code, e.g.: > > #define GPIO_PIN_LED0x01 > #define GPIO_PIN_SENSOR 0x02 > > Does this seem reasonable, or is there a better way to do this? I don't really understand how this is different from in/out. Presumably this is coming from some request from userspace originally, where someone, perhaps in a config file, has told the system how a pin is hooked up. LED seems overly specific. Presumably you care that the output does something like "makes a light". But I don't understand why your API cares about light vs noise. And I don't see an active high/low in your proposal. So I don't understand how this is different from just "controllable binary output" I am also not following SENSOR.DO you just mean "reads if the logic level at the pin is high or low". I don't think you mean using i2c bitbang for a temp sensor. Perhaps you could step back and explain the bigger picture and what's awkward currently. I don't doubt you that more is needed, but I am not able to understand enough to discuss. signature.asc Description: PGP signature
Re: make COMPAT_LINUX match SYSV binaries
co...@sdf.org writes: > I feel compelled to explain further: > any OS that doesn't rely on this tag is prone to spitting out binaries > with the wrong tag. For example, Go spits out Solaris binaries with SYSV > as well. > > Our current solution to it is the kernel reading through the binary, > checking if it contains certain known symbols that are common on Linux. > > We support the following forms of compat: > > ultrixnot ELF > sunos not ELF (we support only oold stuff) > freebsd always correctly tagged, because the native OS > checks this, like we do. > linux ELF, not always correctly tagged > > > So, currently, we only support one OS that has this problem, which is > linux. I am proposing we take advantage of it. > > In the event someone adds support for another OS with this problem (say, > modern Solaris), I don't expect this compat to be enabled by default, > for security reasons. So the problem will only occur if a user enables > both forms of compat at the same time. > > Users already have to opt in to have Linux compat support. I think it is > a lot to ask to have them tag every binary. Thanks for the explanation. I'm still not thrilled, but I withdraw my objection. signature.asc Description: PGP signature
Re: make COMPAT_LINUX match SYSV binaries
co...@sdf.org writes: > As a background, some Linux binaries don't claim to be targeting the > Linux OS, but instead are "SYSV". > > We have used some heuristics to still identify those binaries as being > Linux binaries, like looking into the symbols defined by the binary. > > it looks like we no longer have other forms of compat expected to use > SYSV ELF binaries. Perhaps we should drop this elaborate detection logic > in favour of detecting SYSV == Linux? In general adapting to every confused practice out there leads us to a bad place. This just feels like a step along that path. I could see having a sysctl/etc. to enable this behavior, but it seems really irregular. Is there a way to have a tool to retag binaries that are tagged incorrectly? It seems SYSV emulation should not allow non-SYSV system calls. signature.asc Description: PGP signature
Re: autoloading compat43 on tty ioctls
chris...@astron.com (Christos Zoulas) writes: > Aside for the TIOCGSID bug which I am about to fix (it is in tty_43.c > and is used in libc tcgetsid(), all the compat tty ioctls are defined > in /usr/src/sys/sys/ioctl_compat.h... We can empty that file and try > to build the tree :-), but I am guessing things will break. Also a lot > of pkgsrc will break too. It is not 4.3 applications that break it is > applications that still use the 4.3 terminal api's. If the API is still present in our source tree, then the implementation probably does not belong under COMPAT_43. As I see it COMPAT_43 is to match an old ABI that one can no longer (on modern NetBSD) compile to. What you are describing sounds like "we have an API still, and we've had it since 4.3", which is not in my view COMPAT. signature.asc Description: PGP signature
Re: Sample boot.cfg for upgraded systems (rndseed & friends)
David Brownlee writes: > What would people think of installing an original copy of the etc set > in /usr/share/examples/etc or similar - its 4.9M extracted and ~500K > compressed and the ability to compare what is on the system to what it > was shipped with would have saved me so much effort over the years :) I personally unpack etc and xetc to /usr/netbsd-etc via the INSTALL-NetBSD update script in etcmanage. I would not be super keen on adding a full etc by default, especially because then there's the issue of managing it for upgrades. But if it is unpacked someplace, and updated on updates, and old files removed on updates via postinstall fix, maybe. signature.asc Description: PGP signature
Re: Logging a kernel message when blocking on entropy
Andreas Gustafsson writes: > The following patch will cause a kernel message to be logged when a > process blocks on /dev/random or some other randomness API. It may > help some users befuddled by pkgsrc builds blocking on /dev/random, > and I'm finding it useful when testing changes aimed at fixing PR > 55659. I'm in favor. I have not dug in to the brave new entropy world. I'm sure it's better in many ways, but it also seems like people/systems that used to not end up blocked before now do, apparently because some sources that used to be considered ok (timing of events) no longer are. So I think people should be given clues - things appear a bit too difficult now. signature.asc Description: PGP signature
Re: Proposal to enable WAPBL by default for 10.0
Taylor R Campbell writes: [lots of good points, no disagreement] If /etc/master.passwd is ending up with junk, that's a clue that code that updates it isn't doing the write secondary file, fysnc it, rename, approach. As I understand it with POSIX filesystems you have to do that because there is no guarantee on open/write/close that you'll have one or the other. Even with zfs, you could have done write on the first half and not the second, so I think you still need this.q > work...which is why I used to use ffs+sync on my laptop, and these > days I avoid ffs altogether in favour of zfs and lfs, except on > install images written to USB media.) Do you find that lfs is 100% solid now (in 9-stable, or current)? I have seen fixes and never really been sure.
Re: AES leaks, cgd ciphers, and vector units in the kernel
a data point on a machine from 2014: $ ./aestest -l BearSSL aes_ct Intel SSE2 bitsliced $ progress -f /dev/zero sh -c 'exec ./aestest -e -b 256 -c aes-xts -i "Intel SSE2 bitsliced" > /dev/null' 399 MiB 56.98 MiB/s ^C $ progress -f /dev/zero sh -c 'exec ./aestest -e -b 256 -c aes-xts -i "BearSSL aes_ct" > /dev/null' 211 MiB 26.38 MiB/s ^C $ progress -f /dev/zero sh -c 'exec ./bad -e -b 256 -c aes-xts > /dev/null' 869 MiB 86.85 MiB/s ^C So the sse2 is slower, but not enough to get upset about. cpu0: "Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz" cpu0: Intel Core i7, Xeon 34xx, 35xx and 55xx (Nehalem) (686-class), 2800.09 MHz cpu0: family 0x6 model 0x1a stepping 0x5 (id 0x106a5) cpu0: features 0xbfebfbff cpu0: features 0xbfebfbff cpu0: features 0xbfebfbff cpu0: features1 0x98e3bd cpu0: features1 0x98e3bd cpu0: features2 0x28100800 cpu0: features3 0x1 cpu0: features7 0x9c00
Re: AES leaks, cgd ciphers, and vector units in the kernel
Taylor R Campbell writes: >> What I meant is: consider an external USB disk of say 4T, which has a >> cgd partition within which is ffs. >> >> Someone attaches it to several systems in turn, doing cgd_attach, mount, >> and then runs bup with /mnt/bup as the target, getting deduplication >> across systems. > > (Side note: as a matter of architecture I would recommend > incorporating the cryptography into the application, like borgbackup, > restic, or Tarsnap do -- anything at a higher level than disks (even > at the level of the file system, like zfs encryption) has much more > flexibility and can also provide authentication. Generally the main > use case for disk encryption is to enable recycling disks without > worrying about information disclosure; the threat model and security > of disk encryption systems are both qualitatively very weak.) Sure, but this is about doing something that is really reliable about getting data back for disaster recovery, simplicity, only using tools that have existed for a long time. (You can't run zfs on old systems, and borgbackup has had enough stability issues that I wouldn't trust it.) >> So, using the new faster cipher won't work, because it's not supported >> by the older systems. >> >> Hoewver, if the -current system does AES slowly because it has the new >> constant-time implementation, and the older ones do it like they used >> to, I don't see a real problem. > > OK. If you encounter a scenario where this is likely to be a real > problem, let me know. >From my viewpoint, a 3x slowdown, but with 100% reliablity is not a big deal. > I drafted an SSE2 implementation which considerably improves on the > BearSSL aes_ct implementation on a number of amd64 CPUs I tested from > around a decade ago. It is still slower than before -- and AES-CBC > encryption hurts by far the most, because it is necessarily > sequential, whereas AES-CBC decryption and AES-XTS in both directions > can be vectorized -- but it does mitigate the problem somewhat. This > covers all amd64 CPUs and probably most `i386' CPUs of the last 15-20 > years. > > There is some more room for improvement -- SSSE3 provides PSHUFB which > can sequentially speed up parts of AES, and is supported by a good > number of amd64 CPUs starting around 14 years ago that lack AES-NI -- > but there are diminishing returns for increasing implementation and > maintenance effort, so I'd like to focus on making an impact on > systems that matter. (That includes non-x86 CPUs -- e.g., we could > probably easily adapt the Intel SSE2 logic to ARM NEON -- but I would > like to focus on systems where there is demand.) That sounds good. > I drafted a couple programs to approximately measure performance from > userland. They are very naive and do nothing to measure overhead from > cgd(4) or disk i/o itself. > > https://www.NetBSD.org/~riastradh/tmp/20200621/aestest.tgz > https://www.NetBSD.org/~riastradh/tmp/20200622/adiantum.tgz Thanks - will try them. >> So it remains to make userland AES use also constant time, as a separate >> step? > > Correct. ok - and helpful details from nia@ noted.
Re: AES leaks, cgd ciphers, and vector units in the kernel
Taylor R Campbell writes: >> I don't really see the new cipher as a reasonable option for removable >> disks that need to be accessed by older systems. I can see it for >> encrypted local disk. But given AES hardware speedup, I suspect most >> people can just stay with AES. > > Can you be more specific about the systems you're concerned about? > What are the characteristics and performance requirements of the > different systems that need to share disks? Do you have a reason to > need to share a backup drive that you use on an up-to-date NetBSD on > older hardware where it has to be fast, with a much older version of > NetBSD? > > (I am sure there are use cases I haven't thought of; I just want to > make sure I understand the use cases before I try to address them.) What I meant is: consider an external USB disk of say 4T, which has a cgd partition within which is ffs. Someone attaches it to several systems in turn, doing cgd_attach, mount, and then runs bup with /mnt/bup as the target, getting deduplication across systems. Of these systems, some are older NetBSD and some are newer. Posit one each netbsd 5, 7, 8, 9, current in the mix, as a blend of strawman and not-so-crazy example. After this, the disk is taken to an undisclosed location where it is unlikely to be destroyed (or at least, unlikely to be destroyed correlated with the main systems' disks), but at which it does not have reliable physical protection against snoooping. I submit that this is not an odd model for cgd usage. (I don't actually do this; I mount disks on one system and do over-the-network backups from the older systems, and my mix of system versions is different.) So, using the new faster cipher won't work, because it's not supported by the older systems. Hoewver, if the -current system does AES slowly because it has the new constant-time implementation, and the older ones do it like they used to, I don't see a real problem. >> Is there an easy way to publish code that does hardware AES, to allow >> people to measure on their hardware? If a call for that on -users turns >> up essentially zero actual people that would be bothered, I think that >> would be interesting. > > I am not quite sure what you're asking. Correct me if I have > misunderstood, but I suspect what you're getting at is: > >How can someone on netbsd<=9 test empirically whether this patch >will have a substantial negative performance impact or not? > > On basically all amd64 systems of the past decade, and on most if not > all aarch64 systems, there is essentially guaranteed to be a net > performance improvement. What about other systems? > > The best way to test this is to just boot a new kernel and try a > workload. But I assume you are looking for a userland program that > one can compile and run to test it without booting a new kernel. Yes, that's what I meant. Kind of like "openssl speed". > I could in a couple hours make a program that checks cpuid to detect > hardware support and does some measurements in isolation -- to > estimate an _upper bound_ on the system performance impact. > > The upper bound is likely to be extremely conservative unless your > workload is actually reading and writing zeros to cgd on a RAM-backed > file in tmpfs; for a realistic impact on cgd or ipsec you would have > to take into account the disk or network throughput -- the fraction of > it that is spent in the crypto is what the 1/3-2/3 figure applies to. I did sort of mean "how many MB/s would the old impl do, and how many MB/s would the new one do", realizing that actually reading/writing from disk might overwhelem that. I'm not sure my request is reasonable; it might help up the comfort level for people. > (Note that there is no impact on userland crypto, which means no > impact on TLS or OpenVPN or anything like that, unless for some > bizarre reason you've turned on kern.cryptodevallowsoft and the > userland crypto uses /dev/crypto, the solution to which is to stop > using /dev/crypto and/or turn off kern.cryptodevallowsoft for anything > other than testing because it's terrible (and also the apparently > boolean nature of kern.cryptodevallowsoft is a lie).) So it remains to make userland AES use also constant time, as a separate step? >> I'm unclear on openssl and hardware support; "openssl speed" might be a >> good home for this, and I don't know if openssl needs the same treatment >> as cgd. (Fine to keep separable things separate; not a complaint!) > > OpenSSL is a mixed bag. It has a lot more MD implementations of > various cryptographic primitives. But many of them are still leaky. > So it's probably not a very good proxy for what the performance impact > of this patch set will be. I sort of meant putting the new code in there so it can be measured, but I realize that's messy. Please don't take my "is there a way" question as a demand.
Re: AES leaks, cgd ciphers, and vector units in the kernel
Taylor R Campbell writes: >> Date: Thu, 18 Jun 2020 07:19:43 +0200 >> From: Martin Husemann >> >> One minor nit: with the performance impact that high, and there being >> setups where runtime side channel attacks are totally not an issue, >> maybe we should leave the old code in place for now, #ifdef'd as >> default-off options to provide time for a full reconstruction (or untill >> the machine gets update to "the last decade" cpu)? > > Having leaky AES code around is asking for trouble -- and would > require additional complexity to implement and maintain (e.g., is it > always unhooked from the build, or do we hook it in just enough to run > tests?), which would add further burden on an audit to verify that > it's _not_ being used in a real application. > > The goals here are to make that burden completely go away by making > the answer unconditionally no, there's essentially no danger that AES > in the kernel is leaky; and to provide alternatives with performance > ranging from `not worse' to `much better' to avoid the conflict that > AES invites between performance and security. > > If you have a specific system where there's a real negative > performance impact that matters to you, I would be happy to talk over > the details and see how we can address it better. I see your point, and I think this is probably ok, but I share Martin's concern. For me, the main use of cgd is to encrypt backup drives. I am therfore not really concerned about side channel attacks when they are attached and keyed on the system being backed up. (I realize other people use cgd for other reasons.) I don't really see the new cipher as a reasonable option for removable disks that need to be accessed by older systems. I can see it for encrypted local disk. But given AES hardware speedup, I suspect most people can just stay with AES. Is there an easy way to publish code that does hardware AES, to allow people to measure on their hardware? If a call for that on -users turns up essentially zero actual people that would be bothered, I think that would be interesting. I'm unclear on openssl and hardware support; "openssl speed" might be a good home for this, and I don't know if openssl needs the same treatment as cgd. (Fine to keep separable things separate; not a complaint!)
Re: makesyscalls (moving forward)
David Holland writes: > Meanwhile it doesn't belong in sbin because it doesn't require root, > nor does doing something useful with it require root, and it doesn't > need to be on /, so... usr.bin. Unless we think libexec is reasonable, > but if 3rd-party code is going to be running it we really want it on > the $PATH, so... I agree with that logic, that makesyscalls is kind of like config, and that /usr/bin makes sense. There's nothing admin-ish about it, as building an operating system is not about configuring the host. We could have a directory for tools used only for building NetBSD that are not otherwise useful, and put config and makesyscalls there, but given that we aren't overwhelming bin in a way that causes trouble, that doesn't seem like a good idea.
Re: KAUTH_SYSTEM_UNENCRYPTED_SWAP
Alexander Nasonov writes: > Greg Troxel wrote: >> Kamil Rytarowski writes: >> >> > Is it possible to avoid negation in the name? >> > >> > KAUTH_SYSTEM_ENABLE_SWAP_ENCRYPTION >> >> I think the point is to have one permission to enable it, which is >> perhaps just regular root, and another to disable it if securelevel is >> elevated. >> >> So perhaps there should be two names, one to enable, one to disable. > > Kauth is about security rather than speed or convenience. Disabling > encryption may improve speed but it definitely degrades your security > level. So, you can enable vm.swap_encrypt at any level but you can't > disable it if you care about security. I understand that. But there's still a question of "should there be a KAUTH name for enabling as well as disabling", separate from "what should the rules be". I think everybody believes that regardless of securelevel, root should be able to enable encrypted swap. But probably almost everyone thinks regular users should not be allowed to enable it. I realize we have a lot of "root can", and that extending kauth to make everything separate is almost certainly too much. But when disabling is a big deal, I think it makes sense to add names for both enabling and disabling, to make that intent clearer in the sources. But, I don't think this is that important, and a comment would do.
Re: KAUTH_SYSTEM_UNENCRYPTED_SWAP
Kamil Rytarowski writes: > Is it possible to avoid negation in the name? > > KAUTH_SYSTEM_ENABLE_SWAP_ENCRYPTION I think the point is to have one permission to enable it, which is perhaps just regular root, and another to disable it if securelevel is elevated. So perhaps there should be two names, one to enable, one to disable.
Re: Rump makes the kernel problematically brittle
Thor Lancelot Simon writes: > I'd love to see a GSoC project to actually make rump build like the > kernel...but it may be too much work. Good points, and improvement would be great.
Re: Rump makes the kernel problematically brittle
The other side of the coin to "rump is fragile" is "an operating system without rump-style tests that can be run automatically is suscpetible to hard-to-detect failures from changes, and is therefore fragile". There have been many instances (usually on current-users, I think) of reports of newly-failing tests cases, leading to rapid removal of newly-introduced defects.
Re: Rump dependencies (5.2)?
Mouse writes: >> The rump build is done with separate reachover makefiles. [...] > > Hm. Then I think possibly the right answer for the moment is for me to > excise rump from my tree entirely. I can't recall ever wanting its > functionality, and trying to figure out what the dependency graph is > when it exists only implicitly in Makefiles scattered all over the tree > sounds like a recipe for serious headaches. > > If and when it looks worth the effort, I can always back out the > removal commit and clean up the result. But SCM_MEMORY looks like the > more valuable thing for my use cases for the moment. Your tree, your call. But it seems really obvious that you should fix the rump build and write some atf test cases for your SCM_MEMORY stuff, and then you will be able to test it automatically.
Re: Proposal, again: Disable autoload of compat_xyz modules
chris...@astron.com (Christos Zoulas) writes: > I propose something very slightly different that can preserve the current > functionality with user action: > > 1. Remove them from standard kernels in architectures where modules are >supported. Users can add them back or just use modules. > 2. Disable autoloading, but provide a sysctl to enable autoloading >(1 global sysctl for all compat modules). Users can change the default >in /etc/sysctl.conf (adds sysctl to the proposal) I am assuming that we are talking about disabling autoloading of a number of compat modules that are some combination of believed likely to have security bugs and not used extensively, and this includes compat for foreign OS, but does not, at least for now, include compat for older NetBSD. This situation is basically a balancing act of the needs/benefits somehow aggregated (I will avoid "averaged") over all users. It seems pretty unclear how to evaluate that in total. But, it does seem like your single-sysctl proposal means: people who like compat being autoloaded can add one line in sysctl.conf and be back where they were people who want specific modules can load them and not enable the general sysctl people who don't know about any of this who try to run Linux binaries will lose, and presumably there'd be a line in dmesg that says which module failed to autoload, like policy blocked autoloading compat_linux module; see compat_linux(8) which would then explain. I'm also assuming this is being talked about for HEAD and hence 10, and not 9. Overall, this seems like a reasonable compromise among conflicting goals. If older NetBSD compat were included, I'd want to see a separate sysctl, default-on for now. (My guess is that wanting to disable that is a fairly extreme position, at least these days.)
Re: build.sh sets with xz (was Re: vfs cache timing changes)
Martin Husemann writes: > On Fri, Sep 13, 2019 at 06:59:42AM -0400, Greg Troxel wrote: >> I'd like us to keep somewhat separate the notions of: >> >> someone is doing build.sh release >> >> someone wants min-size sets at the expense of a lot of cpu time >> >> >> I regularly do build.sh release, and rsync the releasedir bits to other >> machines, and use them to install. Now perhaps I should be doing >> "distribution", but sometimes I want the ISOs. > > The default is MKDEBUG=no so you probably will not notice the compression > difference that much. I don't follow what DEBUG has to do with this, but that's not important. > If you set MKDEBUG=yes you can just as easily set USE_XZ_SETS=no > (or USE_PIGZGZIP=yes if you have pigz installed). Sure, I realize I could do this. The question is about defaults. > The other side of the coin is that we have reproducable builds, and we > should not make it harder than needed to reproduce our official builds. It should not difficult or hard to understand, which is perhaps different than defaults. > But ... it already needs some settings (which we still need to document > on a wiki page properly), so we could also default to something else > and force maximal compressions via the build.sh command line on the > build cluster. I could see MKREPRODUCILE=yes causing defaults of various things to be a particular way, and perhaps letting XZ default to no otherwise. I would hope that what MKREPRODUCILE=yes has to set is not very many things, but I haven't kept up.
Re: build.sh sets with xz (was Re: vfs cache timing changes)
"Tom Spindler (moof)" writes: >> PS: The xz compression for the debug set takes 36 minutes on my machine. >> We shoudl do something about it. Matt to use -T for more parallelism? > > On older machines, xz's default settings are pretty much unusable, > and USE_XZ_SETS=no (or USE_PIGZGZIP=yes) is almost a requirement. > On my not-exactly-slow i7 6700K, build.sh -j4 parallel is just fine > until it hits the xz stage; gzip is many orders of magnitude faster. > Maybe if xz were cranked down to -2 or -3 it'd be better at not > that much of a compression loss, or it defaulted to the higher > compression level only when doing a `build.sh release`. (I have not really been building current so am unclear on the xz details.) I'd like us to keep somewhat separate the notions of: someone is doing build.sh release someone wants min-size sets at the expense of a lot of cpu time I regularly do build.sh release, and rsync the releasedir bits to other machines, and use them to install. Now perhaps I should be doing "distribution", but sometimes I want the ISOs. Sometimes I do builds just to see if they work, e.g. if being diligent about testing changes. (Overall the notion of staying with gzip in most cases, with a tunable for extreme savins sounds sensible but I am too unclear to really weigh in on it.)
Re: NFS lockup after UDP fragments getting lost
Edgar Fuß writes: > Thanks to riastradh@, this tuned out to be caused by an (UDP, hard) > HFS mount combined with a mis-configured IPFilter that blocked all but > the first fragment of a fragmented NFS reply (e.g., readdir) combined > with a NetBSD design error (or so Taylor says) that a vnode lock may > be held accross I/O, in this case, network I/O. Holding a vnode lock across IO seems like a bug to me too. Marking the vnode as having an in-process operation so others can lock/read/report-that-status/unlock seems ok. But I'm sure you already know that vnode locking is hard. > It looks like the operation to which the reply was lost sometimes > doesn't get retried. Do we have some weird bug where the first > fragment arriving stops the timeout but the blocking of the remaining > fragments cause it to wedge? Probably not. fragments sit until there's a packet and then the packet is sent to the stack. So the NFS code is almost certainly totally unaware of the arrival of the first fragment.
Re: /dev/random is hot garbage
Taylor R Campbell writes: >> It would also be reasonable to have a sysctl to allow /dev/random to >> return bytes anyway, like urandom would, and to turn this on for our xen >> builders, as a different workaround. That's easy, and it doesn't break >> the way things are supposed to be for people that don't ask for it. > > What's the advantage of this over using replacing /dev/random by a > symlink to /dev/urandom in the build system? > > A symlink can be restricted to a chroot, while a sysctl knob would > affect the host outside the chroot. The two would presumably require > essentially the same privileges to enact. None, now that I think of it. So let's change that on the xen build host. And, the other issue is that systems need randomness, and we need a way to inject some into xen guests. Enabling some with rndctl works, or at least used to, even if it is theoretically dangerous. But we aren't trying to defend against the dom0.
Re: /dev/random is hot garbage
I don't think we should change /dev/random. For a very long time, the notion is that the bits from /dev/random really are ok for keys, and there has been a notion that such bits are precious and you should be prepared to wait. If you aren't generating a key, you shouldn't read from /dev/random. So I think rust is wrong and should be fixed. I can see the reason for frustration, but I believe that we should not break things that are sensible because they are abused and cause problems in some environments. It would also be reasonable to have a sysctl to allow /dev/random to return bytes anyway, like urandom would, and to turn this on for our xen builders, as a different workaround. That's easy, and it doesn't break the way things are supposed to be for people that don't ask for it. Also, on the xen build hosts, it would perhaps be good to turn on entropy collection from network and disk. Another approach, harder, is to create a xenrnd(4) pseudodevice and hypervisor call that gets bits from the host's /dev/random and injects them as if from a hardware rng.
Re: mknod(2) and POSIX
David Holland writes: > However, I notice that mknod(2) does not describe how to set the > object type with the type bits of the mode argument, or document which > object types are allowed, and mkfifo(2) also does not say whether > S_IFIFO should be set in the mode argument or not. This is documented quite well in the opengroup.org standards pages (or in S_IFIFO, and just don't set any special bits, respectively). Agreed that fixing the man pages would be good. > (Though mkfifo(2) hints not by not documenting EINVAL for "The > supplied mode is invalid", this sort of inference is annoying even in > standards and really not ok for docs...) https://pubs.opengroup.org/onlinepubs/9699919799/functions/mknod.html https://pubs.opengroup.org/onlinepubs/9699919799/functions/mkfifo.html#tag_16_327 Those seem clear to me.
Re: mknod(2) and POSIX
Agreed with uwe@ about not mixing unrelated changes. Pretend we are using git :-) The patch looks fine. Agreed that making fifos with mknod is an odd thing to do, but if it's in posix, then we should do it unless there's something really bad about supporting the posix usage. In this case, it just seems silly to have a second way to make fifos, not harmful.
mknod(2) and POSIX
I recently noticed that pkgsrc/sysutils/bup failed when restoring a fifo under NetBSD because it calls mknod (in python) which calls mknod(3) and hence mknod(2). Our mknod(2) man page does not mention creating FIFOS, and claims The mknod() function conforms to IEEE Std 1003.1-1990 (“POSIX.1”). mknodat() conforms to IEEE Std 1003.1-2008 (“POSIX.1”). I can't find 1990 online, but 2004 and 2008 require fifo support in mknod: https://pubs.opengroup.org/onlinepubs/009695399/functions/mknod.html https://pubs.opengroup.org/onlinepubs/9699919799.2008edition/functions/mknod.html However, at least in netbsd-8, our kernel (sys/vfs_syscalls.c:do_mknod_at): requires KAUTH_SYSTEM_MKNOD for all callers, and hence EPERM for non-root has a switch on allowable types, and S_IFIFO is not one of them, and hence EINVAL I realize mkfifo is preferred in our world, and POSIX says it is preferred. But I believe we have a failure to follow POSIX. Other opinions?
Re: pool: removing ioff?
>> But, I wonder if this comes from the Solaris allocation design, and that >> the ioff notion is not about alignment for 4/8 objects to fit the way >> the CPU wants, but for say 128 byte objects to be lined up on various >> different offsets in different pages to make caching work better. But >> perhaps that doesn't exist in NetBSD, or is done differently, or my >> memory of the paper is off. > > Indeed, Sun's SLAB had cache coloring. > > We do too in our pool subsystem, that's not related to ioff, we don't > lose it as a result of removing ioff. Great to hear on both counts, thanks.
Re: pool: removing ioff?
Maxime Villard writes: > I would like to remove the 'ioff' argument from pool_init() and friends, > documented as 'align_offset' in the man page. This parameter allows the > caller to specify that the alignment given in 'align' is to be applied at > the offset 'ioff' within the buffer. > > I think we're better-off with hard-aligned structures, ie with __aligned(32) > in the case of XSCALE. Then we just pass align=32 in the pool, and that's it. > > I would prefer to avoid any confusion in the pool initializers and drop ioff, > rather than having this kind of marginal and not-well-defined features that > add complexity with no real good reason. > > Note also that, as far as I can tell, our policy in the kernel has always > been to hard-align the structures, and then pass the same alignment in the > allocators. I am not objecting as I can't make a performance/complexity argument. But, I wonder if this comes from the Solaris allocation design, and that the ioff notion is not about alignment for 4/8 objects to fit the way the CPU wants, but for say 128 byte objects to be lined up on various different offsets in different pages to make caching work better. But perhaps that doesn't exist in NetBSD, or is done differently, or my memory of the paper is off.
Re: patch: debug instrumentation for dev/raidframe/rf_netbsdkintf.c
Christoph Badura writes: > On Mon, Jan 21, 2019 at 04:24:49PM -0500, Greg Troxel wrote: >> Separetaly from debug code being careful, if it's a rule that bdv can't >> be NULL, it's just as well to put in a KASSERT. Then we'll find out >> where that isn't true and can fix it. > > I must not be getting something. If rf_containsboot() is passed a NULL > pointer, it will trap with a page fault and we can get a stacktrace from > ddb. If we add a KASSERT it will panic and we can get a stacktrace from > ddb. I don't see where the benefit in that is. The benefit is that the panic from the KASSERT is cleaner, and it documents for readers of the function that the author believes it is a rule. And it will definitely fault even on machines will can dereference NULL - that is technically if not practically architecture dependent. > Do you think we should add a KASSERT to document that rf_containsboot() > does expect a valid pointer? I'd see value in that and would go ahead with > it. Yes. Basically, in any kernel function, if there is a requirement that a pointer be non-NULL, then there should be a KASSERT and the code should then feel free to assume it is valid. When a KASSERT is hit, the user gets a message with the KASSERT expression and the source file/line, instead of a page fault traceback. It's very easy and quick to go from that printout to the KASSERT that failed. Plus, adding the KASSERT, or talking about adding it, is a good way to check if there is consensus among the other developers that this really is a rule. In NetBSD, people are really good at telling you you're wrong!
Re: patch: debug instrumentation for dev/raidframe/rf_netbsdkintf.c
Christoph Badura writes: >> > + if (bdv == NULL) >> > + return 0; >> > + >> >> This looked suspicious, even before I read the code. >> The question is if it is ever legitimate for bdv to be NULL. > > That is an excellent point. The short answer is, no it isn't. And it > never was NULL in the code that used it. I got a trap into ddb because of > a null pointer deref in the DPRINTF that I changed (in the 4th hunk of my > patch). > >> I am a fan of having comments before every function declaring their >> preconditions and what they guarantee on exit. Then all uses can be >> audited if they guarantee the the preconditions are true. This approach >> is really hard-core in eiffel, known as design by contract. > > Yes, I totally agree. Also to the rest of your message that I didn't quote. > > When I prepared the patch yesterday I was about to delete the above change > because at first I couldn't remember why I added it ~3 weeks ago. That > should have raised a big fat warning sign. > > I thought about adding a comment after I read your private mail > earlier today. In the end I decided it is better to not change > rf_containsboot() and instead introduce a wrapper for the benefit of the > DPRINTF. Separetaly from debug code being careful, if it's a rule that bdv can't be NULL, it's just as well to put in a KASSERT. Then we'll find out where that isn't true and can fix it.
Re: patch: debug instrumentation for dev/raidframe/rf_netbsdkintf.c
Christoph Badura writes: > Here is some instrumentation I found useful during my recent debugging. > If there are no objections, I'd like to commit soon. > > The change to rf_containsroot() simplifies the second DPRINTF that I added. > > Index: rf_netbsdkintf.c > === > RCS file: /cvsroot/src/sys/dev/raidframe/rf_netbsdkintf.c,v > retrieving revision 1.356 > diff -u -r1.356 rf_netbsdkintf.c > --- rf_netbsdkintf.c 23 Jan 2018 22:42:29 - 1.356 > +++ rf_netbsdkintf.c 20 Jan 2019 22:32:14 - > @@ -472,6 +472,9 @@ > const char *bootname = device_xname(bdv); > size_t len = strlen(bootname); > > + if (bdv == NULL) > + return 0; > + > for (int col = 0; col < r->numCol; col++) { > const char *devname = r->Disks[col].devname; > devname += sizeof("/dev/") - 1; This looked suspicious, even before I read the code. The question is if it is ever legitimate for bdv to be NULL. I am a fan of having comments before every function declaring their preconditions and what they guarantee on exit. Then all uses can be audited if they guarantee the the preconditions are true. This approach is really hard-core in eiffel, known as design by contract. In NetBSD, many functions have KASSERT at the beginning. This checks them (under DIAGNOSTIC) but it also is a way of documenting the rules. >From a quick glance at the code it seems obvious that it's not ok to call these functions with a NULL bdv. So if bdv is an argument and not allowed to be NULL, then early on in that function, where you check/return, there should be KASSERT(bdv != NULL) Not really on point, but as a caution there should be no behavior change in any function under DIAGNOSTIC, if the code is bug free and preconditions are met. So "if something we can rely on isn't true, panic" is fine, but many other things rae not.
Re: Importing libraries for the kernel
m...@netbsd.org writes: > I don't expect there to be any problems with the ISC license. It's the > preferred license for OpenBSD and we use a lot of their code (it's > everywhere in sys/dev/) Agreed that the license is ok. > external, as I understand it, means "please upstream your changes, and > avoid unnecessary local changes". Agreed. And also that we have a plan/expectation of tracking changes and improvements that upstream makes. Code that is not in external more or less implies that we are the maintainer. For these libraries, my expectation is that they are being actively maintained and that we will want to update to newer upstream versions from time to time.
Re: noatime mounts inhibiting atime updates via utime()
Edgar Fuß writes: >> Honestly, I think atime is one of the dumbest thing ever. > We occasionally use them to find out (or have a first guess at): > -- has anyone used libfoobar last year? > -- who uses kbaz, i.e. has /home/xyz/.config/kbaz.conf been accessed? > > We use snapshots to run backups, so atimes are not touched by them. I fairly often look at atimes to find out if old libraries have been used, and various other things. I have also had a test that tried to use utime fail on a machine that was noatime. So the notion that noatime should mean what it does now, but allow explicit writes sounds good. I don't see any value in changing the naming of the flags. Having a fs write atime updates unless mounted noatime seems fine, and if people want noatime that's easy. I would be opposed to e.g. dropping the noatime option, making noatime default, and adding an atime option. That's just churn violating historical norms for no good reason. There's a question of what the default for installs should be, and I don't have a real opinion about that. It would be good to have stats about writes, separately including atime updates. Right now we know it causes writes but I haven't seen data.
Re: fixing coda in -current
m...@netbsd.org writes: > On Sun, Nov 25, 2018 at 08:05:21PM -0500, Greg Troxel wrote: >> However, I am pleased to report that the coda people have said that they >> are working on a fuse interface, although it's expected to be slower. >> We'll see, both if it materializes and how fast it is. > > That'd be neat. > ... can we get general consensus about removing kernel coda if that > happens, and the FUSE implementation works for netbsd too? > dholland speaks poorly of it, we don't have a volunteer to write out > tests, and it has a history of brea Getting consensus is hard enough that I would prefer to defer that until we see where we are. The breakage history from NetBSD VFS changes isn't really that bad -- a few times in 20 years, and it has caused very little trouble for others.
Re: fixing coda in -current
m...@netbsd.org (Emmanuel Dreyfus) writes: > Greg Troxel wrote: > >> However, I am pleased to report that the coda people have said that they >> are working on a fuse interface, although it's expected to be slower. > > FUSE vs kernel does not really matter when we deal with network > filesystem performance. The latency of requesting a network operation is > orders of magnitude higher than issuing a few system calls. That's true when the file has to be fetched. Coda, like AFS, caches files in normal operation, and there are read lock callbacks. So the first fetch is over the network and slow, and subsequent reads are at nearly the speed of the underlying filesystem. It is this speed that people are talking about.
Re: fixing coda in -current
David Holland writes: > So I have no immediate comment on the patch but I'd like to understand > better what it's doing -- the last time I crawled around in it > (probably 7-8 years ago) it appeared to among other things have an > incestuous relationship with ufs_readdir such that if you tried to use > anything under than ffs as the local container it would detonate > violently. But I never did figure out exactly what the deal was other > than it was confusing and seemed to violate a lot of abstractions. > > Can you clarify? It would be nice to have it working properly and > stuff like the above is only going to continue to fail in the future... I didn't read this patch carefully, and I'm not Brett. But the basic scheme is that a container file representing a directory is in a particular format. This has been a source of issues when there was an alignment change in directory reading. My impression is that the way it should be is that a container file that's a directory should be read in ufs format, regardless of the container filesystem type. I am not sure that's the way the code is. However, I am pleased to report that the coda people have said that they are working on a fuse interface, although it's expected to be slower. We'll see, both if it materializes and how fast it is.
Re: fixing coda in -current
bch writes: > On Tue, Nov 20, 2018 at 2:38 PM Greg Troxel wrote: > >> I volunteer to bug Satya about using FUSE instead of a homegrown >> (pre-FUSE) kernel interface. > > > Which Satya is this? The Coda one :-) Here's a proper citation. http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.448
Re: fixing coda in -current
I volunteer to bug Satya about using FUSE instead of a homegrown (pre-FUSE) kernel interface. I am unaware of anytning else that allows writes while disconnected and reintegrates them. I have actually done that, both on purpose and for several days while my IPsec connection was messed up, and it really worked.
Re: fixing coda in -current
I used to use it, and may again. So I'd like to see it stay, partly because I think it's good to keep NetBS relevant in the fileystem research world. I am expecting to see new upstream activity. But, I think it makes sense to remove it from GENERIC, and perhaps have whatever don't-autoload treatment, so that people only get it if they explicitly ask for it. That way it should not bother others.
Re: Things not referenced in kernel configs, but mentioned in files.*
co...@sdf.org writes: > So, I am excluding things that appear in ALL, and I am not checking if But ALL is an x86 thing, currently. > they appear as modules. Interesting, but I suppose they then belong in ALL also. > So far I had complaints about the appearance of 'lm' which cannot be > safely included in a default kernel, for example. Sure, lots of things are not ok in GENERIC, but do those concerns apply to it being in ALL?
Re: Things not referenced in kernel configs, but mentioned in files.*
co...@sdf.org writes: > This is an automatically generated list with some hand touchups, feel > free to do whatever with it. I only generated the output. > > ac100ic > acemidi > acpipmtr > [snip] I wonder if these are candidates to add to an ALL kernel, and if it will turn out that they are mostly not x86 things. I see we only have ALL for i386/amd64. I wonder if it makes sense to have one in evbarm.
Re: panic: ffs_snapshot_mount: already on list
m...@netbsd.org (Emmanuel Dreyfus) writes: > Beside the problem that FFS_NO_SNAPSHOT does not really disable > snapshot code, I think we should have a nosnapshot or nofss mount option > to handle such a scenario. Anyone has opinion on this? That seems sensible; filesystems are complicated enough that being able to really ignore complicated features seems good. But we are trying to maintaint invariants (which weren't) and it would be good if mounting without a feature doesn't make things work. So maybe this is a RO mount option? But also, it seems that there was something wrong with this filesystem, but fsck didn't fix it. That seems like the most important thing to fix. signature.asc Description: PGP signature
Re: Time to merge the pgoyette-compat branch (take two)
I am just barely paying attention, but I think modules working well is important, and also having minimal code for what's needed. So if mrg's main concerns have been addressed (aliases), I'm in favor (in a somewhat weak, not really clued in sort of way) of this.
Re: NetBSD-8 kernel too big?
Two thoughts: When trimming, ls -lSr in the kernel build directory will identify large objects. We have had kernel modules for a while, but I'm not entirely clear on where we are. I would think that moving to a mode of aggressively not including things that can be modules and loading them from the fs as needed would help, particularly if the issue is the bootloader, vs memory used up when running. This is not build as part of an -8 release build, but there is MODULAR in the conf directory. signature.asc Description: PGP signature